{"title": "Submodular Function Minimization with Noisy Evaluation Oracle", "book": "Advances in Neural Information Processing Systems", "page_first": 12103, "page_last": 12113, "abstract": "This paper considers submodular function minimization with \\textit{noisy evaluation oracles} that return the function value of a submodular objective with zero-mean additive noise. For this problem, we provide an algorithm that returns an $O(n^{3/2}/\\sqrt{T})$-additive approximate solution in expectation, where $n$ and $T$ stand for the size of the problem and the number of oracle calls, respectively. There is no room for reducing this error bound by a factor smaller than $O(1/\\sqrt{n})$. Indeed, we show that any algorithm will suffer additive errors of $\\Omega(n/\\sqrt{T})$ in the worst case. Further, we consider an extended problem setting with \\textit{multiple-point feedback} in which we can get the feedback of $k$ function values with each oracle call. Under the additional assumption that each noisy oracle is submodular and that $2 \\leq k = O(1)$, we provide an algorithm with an $O(n/\\sqrt{T})$-additive error bound as well as a worst-case analysis including a lower bound of $\\Omega(n/\\sqrt{T})$, which together imply that the algorithm achieves an optimal error bound up to a constant.", "full_text": "Submodular Function Minimization\n\nwith Noisy Evaluation Oracle\n\nShinji Ito\u2217\n\nNEC Corporation, The University of Tokyo\n\ni-shinji@nec.com\n\nAbstract\n\nThis paper considers submodular function minimization with noisy evaluation ora-\n\u221a\ncles that return the function value of a submodular objective with zero-mean addi-\ntive noise. For this problem, we provide an algorithm that returns an O(n3/2/\nT )-\nadditive approximate solution in expectation, where n and T stand for the size of\n\u221a\nthe problem and the number of oracle calls, respectively. There is no room for\nreducing this error bound by a factor smaller than O(1/\nn). Indeed, we show that\nany algorithm will suffer additive errors of \u2126(n/\nT ) in the worst case. Further,\nwe consider an extended problem setting with multiple-point feedback in which\nwe can get the feedback of k function values with each oracle call. Under the\nadditional assumption that each noisy oracle is submodular and that 2 \u2264 k = O(1),\nT )-additive error bound as well as a\nwe provide an algorithm with an O(n/\nworst-case analysis including a lower bound of \u2126(n/\nT ), which together imply\nthat the algorithm achieves an optimal error bound up to a constant.\n\n\u221a\n\n\u221a\n\n\u221a\n\n1\n\nIntroduction\n\nSubmodular function minimization (SFM) is an important problem that appears in a wide range of\nresearch areas, including image segmentation [31; 33], learning with structured regularization [6],\nand pricing optimization [26]. The goal in this problem is to \ufb01nd a minimizer of a submodular\nfunction, a function f : 2[n] \u2192 R de\ufb01ned on the subsets of a given \ufb01nite set [n] := {1, 2, . . . , n} and\nsatisfying the following inequality:\n\nf (X) + f (Y ) \u2265 f (X \u2229 Y ) + f (X \u222a Y ).\n\n(1)\n\nThis condition is equivalent to the diminishing marginal returns property (see, e.g., [17]): for every\nX \u2286 Y \u2286 [n] and i \u2208 [n] \\ Y , f (X \u222a {i}) \u2212 f (X) \u2265 f (Y \u222a {i}) \u2212 f (Y ).\nExisting studies on SFM assume access to an evaluation oracle for f that returns the value f (X)\nfor any X in the feasible region. Under this assumption, a number of ef\ufb01cient algorithms have been\ndiscovered, in which the number of oracle calls as well as other computational time is bounded by a\npolynomial in n. The \ufb01rst polynomial-time algorithm was given by Gr\u00f6tschel, Lov\u00e1sz, and Schri-\njver [19] and was based on the ellipsoid method. Combinatorial strongly polynomial-time algorithms\nhave been independently proposed by Iwata, Fleischer, and Fujishige [28] and by Schrijver [38]. The\ncurrent best computational time is of O(n3 log2 n \u00b7 EO + n4 logO(1) n) by Lee et al. [34], where\nEO denotes the time taken by the evaluation oracle to answer a single query. For approximate\noptimization, Chakrabarty et al. [11] have proposed an algorithm that \ufb01nds an \u03b5-additive approximate\nsolution in \u02dcO(n5/3 \u00b7 EO/\u03b52) time. The time complexity has been improved to \u02dcO(n \u00b7 EO/\u03b52) by\nAxelrod et al. [4].\n\n\u2217This work was supported by JST, ACT-I, Grant Number JPMJPR18U5, Japan.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTable 1: Additive error bounds for submodular minimization with noisy evaluation oracle.\n\nDo not assume \u02c6ft: submodular\n\n(\u2126(cid:48)(\u00b7) := \u2126(min{1,\u00b7})\n\nsingle-point feedback\n\nk-point feedback\n(2 \u2264 k \u2264 n)\n(n + 1)-point feedback\n\nT 1/3\n\n(cid:17)\n\nAssume \u02c6ft: submodular\n\n[23]1: O(cid:0) n\n(cid:1) (if T = \u2126(n3))\n(cid:16) n3/2\u221a\nand \u2126(cid:48)(cid:16) n\u221a\nand \u2126(cid:48)(cid:16) n\u221a\n(cid:17)\n(cid:16) n\u221a\n(cid:17)\n(cid:16) \u221a\n(cid:17)\nand \u2126(cid:48)(cid:16) \u221a\n(cid:17)\n\n[This paper]: O\n[This paper]:\nO\n[23]:\nO\n\n\u221a\nn\u221a\nT\n\n2kT\n\n+\n\nkT\n\nT\n\nn\u221a\nT\n\nn\u221a\nT\n\n(cid:17)\nand \u2126(cid:48)(cid:16) n\u221a\n(cid:17)\n(cid:16) n3/2\u221a\nand \u2126(cid:48)(cid:16) \u221a\n(cid:17)\n(cid:16) n\u221a\n(cid:17)\n\nT\n[This paper]:\nO\n[This paper]:\nO\n\nkT\n\nkT\n\nn\u221a\nT\n\nT\n\n(cid:17)\n\nIn some applications, however, evaluation oracles are not always available, and only noisy function\nvalues are observable. For example, in the pricing optimization problem, let us consider selling n\ntypes of products, where the value of the objective function f (X) corresponds to the expected gross\npro\ufb01t, and the variable X \u2286 [n] corresponds to the set of discounted products. In this scenario,\nIto and Fujimaki [26] have shown that \u2212f (X) is a submodular function under certain assumptions,\nwhich means that the problem of maximizing the gross pro\ufb01t f (X) is an example of SFM. In a\nrealistic situation, however, we are not given an explicit form of f, and the only thing we can do is to\nobserve the sales of products while changing prices. The observed gross pro\ufb01t does not coincide with\nits expectation f (X), but changes randomly due to the inherent randomness of purchasing behavior\nor some temporary events. This means that exact values of f (X) are unavailable, and, consequently,\nexisting works do not directly apply to this situation.\nTo deal with such problems, we introduce SFM with noisy evaluation oracles that return a random\nvalue with expectation f (X). In other words, the noisy evaluation oracle \u02c6f returns \u02c6f (X) = f (X)+\u03be,\nwhere \u03be is a zero-mean noise that may or may not depend on X. We assume access to T independent\nnoisy evaluation oracles \u02c6f1, \u02c6f2, . . . , \u02c6fT with bounded ranges. We start with the single-point feedback\nsetting and then study the more general multiple-point feedback (or k-point feedback) setting: In the\nformer setting, for each t \u2208 [T ], we choose one query Xt to feed \u02c6ft, and get feedback of \u02c6ft(Xt). In\nthe latter setting, we are given a positive integer k, and for each t, choose k queries to feed \u02c6ft and\nobserve k real values of feedback. Such a situation with multiple-point feedback can be assumed\nin some applications. For example, in the case of pricing optimization for E-commerce, we can get\nmultiple-point feedback by employing the A/B-testing framework, i.e., by showing different prices to\nrandomly divided groups of customers. Note that each \u02c6ft is not necessarily submodular even if its\nexpectation is submodular.\n\u221a\nOur contribution is two-fold, positive results (algorithms, Theorem 1) and negative results (worst-case\n\u221a\nanalyses, Theorem 2): We propose algorithms that return O(1/\nT )-additive approximate solutions,\nand we show that arbitrary algorithms suffer additive errors of \u2126(1/\nT ) in the worst case. The\nresults are summarized in Table 1 with positive results in O(\u00b7) notation and negative ones in \u2126(cid:48)(\u00b7)\nnotation.\n\u221a\nAs shown in Table 1, for the single-point feedback setting, we propose an algorithm that \ufb01nds an\nT )-additive approximate solution. Moreover, there is no room for reducing this additive\nO(n3/2/\nerror bound by a smaller factor than O(1/\nIndeed, our Theorem 2 implies that arbitrary\nalgorithms, including those requiring exponential time and space, suffer at least \u2126(n/\nT ) additive\nerrors. For the k-point feedback setting, both the lower and the upper bounds are decreased by\nk factors, without additional assumptions. Under the assumption that each \u02c6ft is submodular\n1/\n\u221a\n(Assumption 1), however, the situation changes: Our proposed algorithm achieves O(n/\nkT )-\n\u221a\nn)-times smaller than without Assumption 1. We also show the lower\nadditive error, which is O(1/\nbound of \u2126(n/\nkT ), which implies that, if k = O(1) or k = \u2126(n), then our algorithm\nn/\nis optimal up to constant factors, i.e., no algorithms achieve additive errors of a smaller order.\n\n2kT +\n\nn).\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n1 This work applies to more general problem settings than ours, bandit submodular minimization and online\n\nsubmodular minimization. See Section 2 for details.\n\n2\n\n\fTo construct the algorithms, we combine a convex relaxation technique based on the Lov\u00e1sz extension\nand stochastic gradient descent (SGD) method. The Lov\u00e1sz extension for a submodular function is a\nconvex function of which minimizers lead to solutions for SFM. Thanks to this, we can reduce SFM to\na convex optimization problem. In this study, we seek a minimizer of the Lov\u00e0sz extension by means\nof SGD, in which we need to construct unbiased estimators of subgradients. The performance of the\nSGD depends strongly on the variance of subgradient estimators. We present ways for constructing\nsubgradient estimators, and it turns out that Assumption 1 enables us to obtain estimators with smaller\nvariances. The combination of Lov\u00e1sz extension and SGD has been already introduced in the work\non bandit submodular minimization by Hazan and Kale [23]. Our work, however, considers different\nproblem settings, including multiple-point feedback, and presents tighter and more detailed analyses.\nDetails in the difference are given in Section 2.\nA key technique for our lower bounds comes from the proof of regret lower bounds for bandit\nproblems by Auer et al. [3]. Their proof consists of two steps: they \ufb01rst construct a probabilistic\ndistribution of inputs for which it is hard to detect a good arm offering a large reward, and then\nshow that any algorithm actually chooses the good arm only infrequently. We follow a line similar\nto these two steps to prove Theorem 2, in which a number of technical issues arise. In the case of\nmultiple-point feedback, in particular, we need to assess the KL divergence carefully for the observed\nsignals from evaluation oracles.\n\n2 Related Work\n\n(cid:80)T\n\n\u221a\n\nT ) regret bounds.\n\nt=1 ft(Xt) \u2212 minX\u2286[n]\n\nby the regret de\ufb01ned as RegretT :=(cid:80)T\n\nBandit submodular minimization (BSM) by Hazan and Kale [23] is strongly related to our model.\nBSM is described as follows: in each iteration t \u2208 [T ], a decision maker chooses Xt \u2286 [n] and\nobserve ft(Xt), where each ft : 2[n] \u2192 [\u22121, 1] is a submodular function. In contrast to our model,\nno stochastic models for ft are assumed, and the performance of the decision maker is measured\nt=1 ft(X). This BSM problem\ncan be regarded as a generalization of our problem with single-point feedback under Assumption 1.\nIndeed, given a BSM algorithm achieving RegretT \u2264 b(n, T ) for some function b, one can construct\nan SFM algorithm that returns b(n, T )/T -additive approximate solutions (see, e.g., [25]). Since a\nBSM algorithm with an O(nT 2/3) regret bound has been proposed in [23], an O(n/T 1/3)-additive\napproximate algorithm immediately follows, as in Table 1. In BSM, however, it has been left as an\nopen problem whether or not one can achieve O(nO(1)\nWith respect to SFM with an exact evaluation oracle, there is a large body of literature [6; 10; 27; 37;\n42; 13], in addition to the works mentioned in Section 1. The Fujishige-Wolfe algorithm [17], based\non Wolfe\u2019s minimum norm point algorithm [42] and the connection between minimum norm points\nand the SFM shown in [16], is known to have the best empirical performance in many cases [5; 18].\nChakrabarty et al. [10] have shown that the Fujishige-Wolfe algorithm \ufb01nds an \u03b5-additive approximate\nsolution with a running time of O(n2(EO + n)/\u03b52). The same runtime bound can be achieved by a\ngradient descent approach presented by Bach [6].\nFor submodular function maximization with noisy evaluation oracles, there have been many studies.\nHassani et al. [21] provided a nearly 1/2-approximate algorithm for monotone submodular maximiza-\ntion. Singla et al. [41] considered a similar problem with applications to crowdsourcing. Karimi et al.\n[32] considered maximizing weighted coverage functions, a special case of submodular functions,\nunder matroid constraints, and presented an ef\ufb01cient nearly (1 \u2212 1/e)-approximate algorithm. Has-\nsidim and Singer [22] provided a nearly (1 \u2212 1/e)-approximate algorithm for monotone submodular\nmaximization with cardinality constraints. Mokhtari et al. [36] showed that a stochastic continuous\ngreedy method works well for monotone submodular function maximization subject to a convex\nbody constraint. For minimization problems with similar assumptions, in contrast to maximization\nproblems, only a little literature can be found. Blais et al. [8] considered approximate submodular\nminimization with an approximate oracle model, and presented a polynomial-time algorithm with a\nhigh-probability error bound. While their model is more general than ours, their algorithm requires\nmore the computational cost and oracle calls than ours, to achieve a similar error bound. Halabi\nand Jegelka [20] dealt with minimization of weakly DR-submodular functions, which is a class of\napproximately submodular functions, and provided algorithms with reasonable approximation ratios.\nZero-order or derivative-free convex optimization [2; 29; 39], optimization problems with evaluation\noracle for convex objectives without access to gradients, is also related to our model because Lov\u00e1sz\n\n3\n\n\f\u221a\n\nextensions are convex. For general convex objectives, Agarwal et al. [2], Belloni et al. [7] and Bubeck\net al. [9] have proposed algorithms that return \u02dcO(1/\nT )-additive approximate solutions, ignoring\nfactors of polynomials in log T and nO(1), where n stands for the dimension of the feasible region.\nThough the error bounds in these results include factors larger than O(n3), it has been reported [5; 24]\nthat dependence w.r.t. n can be improved under such additional assumptions as the smoothness and\nthe strong convexity of the objectives. These improved results, however, do not apply to our problems\nbecause Lov\u00e1sz extensions are neither smooth nor strongly convex. Multiple-point feedback has been\nconsidered in zero-order convex optimization, and some algorithms have been reported to achieve\noptimal performance in such problem settings [1; 15; 40]. In terms of the lower bound on the additive\nerror, Jamieson et al. [29] and Shamir [39] have shown lower bounds of \u2126(1/\nT ) or \u2126(1/T ) for\nvarious classes of convex objectives, which, however, do not directly apply to our model.\n\n\u221a\n\n3 Problem Setting\nLet n be a positive integer, and let [n] = {1, 2, . . . , n} stand for the \ufb01nite set consisting of positive\nintegers at most n. Let L \u2286 2[n] be a distributive lattice, i.e., we assume that X, Y \u2208 L implies\nX \u2229 Y, X \u222a Y \u2208 L. Let f : L \u2192 [\u22121, 1] be a submodular function that we aim to minimize. In\nour problem setting, we are not given access to exact values of f, but given noisy evaluation oracles\n{ \u02c6ft}T\nt=1 of f, where \u02c6ft are random functions from L to [\u22121, 1] that satisfy E[ \u02c6ft(X)] = f (X) for all\nt = 1, 2, . . . , T and X \u2208 L. We also assume that \u02c6f1, \u02c6f2, . . . , \u02c6fT are independent.\nOur goal is to construct algorithms for solving the following problem: First, the algorithm is given\nthe decision set L and the number T of available oracle calls. For t = 1, 2, . . . , T , the algorithm\nchooses Xt \u2208 L and observes \u02c6ft(Xt). The chosen query Xt can depend on previous observations\n{ \u02c6fj(Xj)}t\u22121\nj=1. After T rounds of observation, the algorithm outputs \u02c6X \u2208 L. We call this problem a\nsingle-point feedback setting. In an alternative problem setting, a multi-point or k-point feedback\nsetting, we are given a parameter k \u2265 2 in addition to T and L. In the k-point feedback setting,\n\u2208 L, and, after that, it observes the\nthe algorithm can choose k queries X (1)\n) from the evaluation oracle in each round t \u2208 T . In both\nvalues \u02c6ft(X (1)\nsettings, the performance of the algorithm is evaluated in terms of the additive error ET de\ufb01ned as\nET = f ( \u02c6X) \u2212 minX\u2208L f (X).\nA part of our results relies on the following assumption. Note that the following is assumed only\nwhen it is explicitly mentioned.\nAssumption 1. Assume that each \u02c6ft : L \u2192 [\u22121, 1] is submodular and that k \u2265 2.\n\n), . . . , \u02c6ft(X (k)\n\n), \u02c6ft(X (2)\n\n, . . . , X (k)\n\nt\n\nt\n\nt\n\n, X (2)\n\nt\n\nt\n\nt\n\n4 Our Contribution\n\nOur contribution is two-fold: positive results (Theorem 1) and negative results (Theorem 2).\nTheorem 1. Suppose 1 \u2264 k \u2264 n + 1. For the problem with k-point feedback, there is an algorithm\nthat returns \u02c6X such that\n\n(2)\n\n(3)\n\n\u221a\nE[ET ] = E[f ( \u02c6X)] \u2212 minX\u2208L f (X) = O(n3/2/\nIf Assumption 1 holds, there is an algorithm that returns \u02c6X such that\nE[ET ] = E[f ( \u02c6X)] \u2212 minX\u2208L f (X) = O(n/\n\n\u221a\n\nkT ).\n\nkT ).\n\nThe expectation is taken w.r.t. the randomness of oracles \u02c6ft and the algorithm\u2019s internal randomness.\nIn both algorithms, the running time is bounded by O((kEO + n log n)T ) if L = 2[n], where EO\nstands for the time taken by an evaluation oracle to answer a single query.\n\nIf we can choose the number T of oracle calls arbitrarily, we are then able to compute \u03b5-additive\napproximate solution (in expectation) for arbitrary \u03b5 > 0, by means of the algorithm with the error\nbound (2). The computational time for it is of O( n3\nk log n)). Indeed, to \ufb01nd an \u03b5-additive\n\u03b52 (EO + n\napproximate solution, it suf\ufb01ces to set T so that \u03b5 = \u0398( n3/2\u221a\n), which is equivalent to T = \u0398( n3\nk\u03b52 ).\n\nkT\n\n4\n\n\f\u03b52 (EO+ n\n\n\u03b52 (EO + n\n\nk log n)) time.\n\nThe computational time is then bounded as O((kEO+n log n)T ) = O( n3\nk log n)). Similarly,\nif Assumption 1 holds and the algorithm achieving (3) is used, an \u03b5-additive approximate solution\ncan be found in O( n2\nThe following theorem gives an insight regarding how tight the above error bounds in Theorem 1 are.\nTheorem 2. There is a probability distribution of instances for which any algorithm suffers errors of\n(4)\nwhere we denote \u2126(cid:48)(\u00b7) := \u2126(min{1,\u00b7}). In addition, there is a probability distribution of instances\nsatisfying Assumption 1 for which any algorithm suffers errors of\n\u221a\nE[ET ] = E[f ( \u02c6X) \u2212 minX\u2208L f (X)] = \u2126(cid:48)(n/\n\n(5)\nThe expectation is taken w.r.t. the randomness of the instance f and oracles \u02c6ft, and the algorithm\u2019s\ninternal randomness.\n\nE[ET ] = E[f ( \u02c6X) \u2212 minX\u2208L f (X)] = \u2126(cid:48)(n/\n\n2kT +(cid:112)n/T ).\n\n\u221a\n\nkT ),\n\nFrom (4) in this theorem, we can see that at least \u2126( n2\n\u03b52 EO) computational time is required to \ufb01nd an\n\u03b5-additive approximate solution. This can be shown by an argument similar to that after Theorem 1.\nFor the problem with exact evaluation oracles, on the other hand, Chakrabarty et al. [11] have\nproposed an algorithm running in \u02dcO( n5/3\n\u03b52 EO)-time. By comparing these two results, we can see that\nSFM with noisy oracle is essentially harder than SFM with exact oracle.\n\n5 Algorithm\n\n\u02dcf (x) =(cid:82) 1\n\n5.1 Preliminary\nLov\u00e1sz extension of submodular function For a [0, 1]-valued vector x = (x1, . . . , xn)(cid:62) \u2208 [0, 1]d\nand a real value u \u2208 [0, 1], de\ufb01ne Hx(u) \u2286 [n] to be the set of indices i for which xi \u2265 u, i.e.,\nHx(u) = {i \u2208 [n] | xi \u2265 u}. For a distributive lattice L, de\ufb01ne a convex hull \u02dcL \u2286 [0, 1]n of L as\nfollows: \u02dcL = {x \u2286 [0, 1]n | Hx(u) \u2208 L for all u \u2208 [0, 1]}. Given a function f : L \u2192 R, we de\ufb01ne\nthe Lov\u00e1sz extension \u02dcf : \u02dcL \u2192 R of f as\n\n0 f (Hx(u))du.\n\n(6)\nFrom the de\ufb01nition, we have \u02dcf (\u03c7X ) = f (X) for all X \u2208 L, i.e., \u02dcf is an extension of f.2 The\nfollowing theorem provides a connection between submodular functions and convex functions:\nTheorem 3 ([35]). A function f : L \u2192 R is submodular if and only if \u02dcf is convex. For a submodular\nfunction f : L \u2192 R, we have minX\u2208L f (X) = minx\u2208 \u02dcL\nFor a proof of this theorem, see, e.g., [17; 35].\nFor x \u2208 [0, 1]n, let \u03c3 : [n] \u2192 [n] be a permutation over [n] such that x\u03c3(1) \u2265 x\u03c3(2) \u2265 \u00b7\u00b7\u00b7 \u2265 x\u03c3(n).\nFor any permutation \u03c3 over [n], de\ufb01ne S\u03c3(i) = {\u03c3(j) | j \u2264 i}. The Lov\u00e1sz extension de\ufb01ned by (6)\ncan then be rewritten as\n\n\u02dcf (x)\n\nn(cid:88)\n\n\u02dcf (x) = f ([0]) +\n\n(f (S\u03c3(i)) \u2212 f (S\u03c3(i \u2212 1)))x\u03c3(i)\n\ni=1\n\n= f ([0])(1 \u2212 x\u03c3(1)) +\n\nn\u22121(cid:88)\n\nf (S\u03c3(i))(x\u03c3(i) \u2212 x\u03c3(i+1)) + f ([n])x\u03c3(n).\n\n(7)\n\n(8)\n\nSimilar expression can be found, e.g., Lemma 6.19 in the book [17].\n\ni=1\n\nSubgradient of Lov\u00e1sz extension From the above two expressions (7) and (8) of the Lov\u00e1sz\nextension, we obtain two alternative ways to express its subgradient. For a permutation \u03c3 over [n]\nand i \u2208 {0, 1, . . . , n}, de\ufb01ne \u03c8\u03c3(i) \u2208 {\u22121, 0, 1}n as\n\n\u03c8\u03c3(0) = \u2212\u03c7\u03c3(1), \u03c8\u03c3(n) = \u03c7\u03c3(n), \u03c8\u03c3(i) = \u03c7\u03c3(i) \u2212 \u03c7\u03c3(i+1)\n(9)\n2 \u03c7X \u2208 {0, 1}n denotes the indicator vector of X, i.e., (\u03c7X )i = 1 for i \u2208 X and (\u03c7X )i = 0 for i \u2208 [n]\\X.\n\n(i = 1, 2, . . . , n \u2212 1).\n\n5\n\n\fn(cid:88)\n\nA subgradient of \u02dcf at x can then be expressed by g(\u03c3x) de\ufb01ned as\n\ng(\u03c3) :=\n\n(f (S\u03c3(i)) \u2212 f (S\u03c3(i \u2212 1)))\u03c7\u03c3(i)\n\ni=1\n\n= \u2212f ([0])\u03c7\u03c3(1) +\n\nn\u22121(cid:88)\n\nf (S\u03c3(i))(\u03c7\u03c3(i) \u2212 \u03c7\u03c3(i+1)) + f ([n])\u03c7\u03c3(n) =\n\nwhere (10) and (11) come from (7) and (8), respectively.\n\ni=1\n\n(10)\n\nf (S\u03c3(i))\u03c8\u03c3(i),\n\n(11)\n\nn(cid:88)\n\ni=0\n\nxt+1 = P \u02dcL(xt \u2212 \u03b7\u02c6gt),\n\n5.2 Stochastic Gradient Descent Method\nOur algorithm is based on the stochastic gradient descent method for \u02dcf : \u02dcL \u2192 [0, 1]. To start with,\n2 \u00b7 1 \u2208 \u02dcL. For t = 1, 2, . . . , T , we update xt by iteratively calling the oracle \u02c6ft to\nwe initialize x1 = 1\nobtain xt+1. In each update, we construct an unbiased estimator \u02c6gt of a subgradient of \u02dcf at xt (a\nmore concrete construction will be given later), and xt+1 is given by\n\n(12)\n(cid:107)y \u2212 x(cid:107)2,\n\nwhere P \u02dcL : Rn \u2192 \u02dcL stands for a Euclidean projection to \u02dcL, i.e., P \u02dcL(x) \u2208 arg min\ny\u2208 \u02dcL\nand \u03b7 > 0 is a parameter that we can change arbitrarily. We then compute \u00afx = 1\nt=1 xt\nT\nand draw u from a uniform distribution over [0, 1], and output \u02c6X = H\u00afx(u). From (6), we have\nE[f ( \u02c6X)] = E[ \u02dcf (\u00afx)]. To analyze the performance of our algorithm, we use the following theorem:\nTheorem 4. Let D \u2208 Rn be a compact convex set containing 0. For a convex function \u02dcf : D \u2192 R,\nlet x1, . . . , xT be de\ufb01ned by x1 = 0 and xt+1 = PD(xt \u2212 \u03b7\u02c6gt), where E[\u02c6gt|xt] is a subgradient of \u02dcf\nat xt for each t. Then, \u00afx := 1\nT\nE[ \u02dcf (\u00afx)] \u2212 min\nx\u2217\u2208D\n\nt=1 xt satis\ufb01es\n\u02dcf (x\u2217) \u2264 1\nT\n\nmaxx\u2208D (cid:107)x(cid:107)2\n\nE[(cid:107)\u02c6gt(cid:107)2\n2]\n\n(cid:80)T\n\nT(cid:88)\n\n(cid:80)T\n\n(cid:32)\n\n(cid:33)\n\n(13)\n\n\u03b7\n2\n\n2\u03b7\n\n+\n\n2\n\n.\n\nt=1\n\nn\n\nFor completeness, we give a proof of this theorem in Appendix A. A similar analysis can be\nfound in, e.g., Lemma 11 of [23]. When setting D = \u02dcL \u2212 1\n2 \u2264\n2 \u00b7 1, we have maxx\u2208D (cid:107)x(cid:107)2\n4 . From this, Theorems 3 and 4, if \u02c6gt is bounded as E[(cid:107)\u02c6gt(cid:107)2\n2] \u2264 G2 for all t, we then have\nE[f ( \u02c6X)] \u2212 minX\u2217\u2208L f (X\u2217) \u2264 1\n. The performance of the algorithm here depends\non G, an upper bound on the expected norm of unbiased estimator \u02c6gt. We evaluate the magnitude of\nG for speci\ufb01c examples of \u02c6gt, in the following subsection.\n\n(cid:16) n\n\n8\u03b7 + \u03b7\n\n2 G2T\n\n(cid:17)\n\nT\n\n5.3 Unbiased Estimators of Subgradients\n\nIn this subsection, we present two different ways to construct unbiased estimators for a subgradient\nof \u02dcf that are based on (11) and (10), respectively. The latter is available for the case of multiple-point\nfeedback, i.e., k \u2265 2, and produces a smaller error bound under Assumption 1. Without such an\nassumption, the former gives a better error bound. Given xt = (xt1, xt2, . . . , xtn)(cid:62) \u2208 [0, 1]n, let\n\u03c3 : [n] \u2192 [n] be a permutation over [n] for which xt\u03c3(1) \u2265 xt\u03c3(2) \u2265 \u00b7\u00b7\u00b7 \u2265 xt\u03c3(n).\nAn estimator based on the expression (11) Suppose k \u2208 [n + 1]. Consider choosing queries\nt }k\nj=1 \u2286 {0, 1, . . . , n} of size k, uniformly\n{X (j)\nat random, i.e., It follows a uniform distribution over the subset family {I \u2286 {0, 1, . . . , n} | |I| = k}.\n(cid:80)\nThen let X (j)\n\nj=1 randomly as follows: Choose a subset It = {i(j)\n\n(14)\nwhere \u03c8\u03c3(i) is de\ufb01ned in (9). Note that \u02c6gt relies on xt since \u03c3 depends on xt. Then, \u02c6gt is an unbiased\nestimator of a subgradient and satis\ufb01es E[(cid:107)\u02c6gt(cid:107)2\nLemma 1. Suppose that \u02c6gt is given by (14). We then have\n\nt } and observe \u02c6ft(X (j)\nt\n)\u03c8\u03c3(i(j)\ni\u2208It\n\n) for j \u2208 [k]. De\ufb01ne \u02c6gt as\n\u02c6ft(S\u03c3(i))\u03c8\u03c3(i),\n\nt ) = {\u03c3(j) | j \u2264 i(j)\n\n2] = O(n2/k):\n\nt = S\u03c3(i(j)\n\nt ) = n+1\nk\n\n(cid:80)k\n\n\u02c6gt = n+1\nk\n\n\u02c6ft(X (j)\n\nt }k\n\nj=1\n\nt\n\nE[\u02c6gt|xt] \u2208 \u2202 \u02dcf (xt), E[(cid:107)\u02c6gt(cid:107)2\n\n2] \u2264 2(n + 1)(n + k)/k.\n\n(15)\n\nProofs of all lemmas in this paper are given in Appendix B.\n\n6\n\n\fAlgorithm 1 An algorithm for submoudular function minimization with noisy evaluation oracle\nRequire: The size n \u2265 1 of the problem, the number T \u2265 1 of oracle calls, the number k \u2208 [n + 1]\n\n2 \u00b7 1.\n\nof feedback values per oracle call, and the learning rate \u03b7 > 0.\n\nelse\n\nLet \u03c3 : [n] \u2192 [n] be a permutation corresponding to xt, i.e., xt\u03c3(1) \u2265 \u00b7\u00b7\u00b7 \u2265 xt\u03c3(n).\nif Assumption 1 holds then\n\nChoose a subset Jt \u2286 [n] of size l = (cid:98)k/2(cid:99), uniformly at random.\nCall the evaluation oracle \u02c6ft to observe \u02c6ft(S\u03c3(i)) and \u02c6ft(S\u03c3(i \u2212 1)) for i \u2208 Jt.\nConstruct an unbiased estimator \u02c6gt of a subgradient of \u02dcf at xt, as (16).\nChoose a subset It \u2286 {0, 1, . . . , n} of size k, uniformly at random.\nCall the evaluation oracle \u02c6ft to observe \u02c6ft(S\u03c3(i)) for i \u2208 It.\nConstruct an unbiased estimator \u02c6gt of a subgradient of \u02dcf at xt, as (14).\n\n1: Set x1 = 1\n2: for t = 1, 2, . . . , T do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14: end for\n15: Set \u00afx = 1\nT\n16: Draw u from a uniform distribution over [0, 1], and output \u02c6X = H\u00afx(u) = {i \u2208 [n] | \u00afxi \u2265 u}.\n\nend if\nCompute xt+1 from xt and \u02c6gt on the basis of (12).\n\n(cid:80)T\n\nt=1 xt.\n\nAn estimator based on the expression (10) Suppose 2 \u2264 k \u2264 n + 1 holds, and let l denote\nj=1 randomly as follows: Choose a subset Jt \u2286\nl = (cid:98)k/2(cid:99) \u2265 1. Consider choosing queries {X (j)\n{1, . . . , n} of size l, uniformly at random. Then, set queries {X (j)\n{S\u03c3(i), S\u03c3(i \u2212\n1)} \u2286 {X (j)\n\ni\u2208Jt\nj=1, and observe \u02c6ft(S\u03c3(i)) and \u02c6ft(S\u03c3(i \u2212 1)) for i \u2208 Jt. De\ufb01ne \u02c6gt as\n\nj=1 so that(cid:83)\n\nt }k\n\nt }k\n\nt }k\n\n(cid:80)\n\n\u02c6gt = n\nl\n\n( \u02c6ft(S\u03c3(i)) \u2212 \u02c6ft(S\u03c3(i \u2212 1))\u03c7\u03c3(i).\nThen, \u02c6gt is an unbiased estimator of a subgradient and satis\ufb01es E[(cid:107)\u02c6gt(cid:107)2\nsubmodular function, then E[(cid:107)\u02c6gt(cid:107)2\nLemma 2. Suppose that \u02c6gt is given by (16). We then have\n\n2] = O(n/k) holds.\n\ni\u2208Jt\n\nE[\u02c6gt|xt] \u2208 \u2202 \u02dcf (xt), E[(cid:107)\u02c6gt(cid:107)2\nIn addition, if \u02c6ft is a submodular function, we then have\n\n2] \u2264 4n2/l \u2264 12n2/k.\n\n2] = O(n2/k), and if \u02c6ft is a\n\n(16)\n\n(17)\n\nE[(cid:107)\u02c6gt(cid:107)2\n\n2] \u2264 16n/l \u2264 48n/k.\n\n(18)\nA key factor in the advantage of the estimator de\ufb01ned by (16) is that the vector (ft(S\u03c3(i))\u2212ft(S\u03c3(i\u2212\ni=0 \u2208 Rn+1, which is implied by Lemma 8 in\n1)))n\n[23] or Lemma 1 in [30].\n\ni=1 \u2208 Rn has a smaller norm than (ft(S\u03c3(i)))n\n\n5.4 Proof of Theorem 1\n\n\u02dcf (x\u2217) = E[f ( \u02c6X)] \u2212 minX\u2217\u2208L f (X\u2217) from (6) and Theorem 3.\n\nBy combining SGD described in Section 5.2 and unbiased estimators de\ufb01ned by (14) or (16), we\nobtain Algorithm 1. Let us evaluate the additive errors for this algorithm. Note that we have\nE[ \u02dcf (\u00afx)] \u2212 minx\u2217\u2208 \u02dcL\nSuppose \u02c6X is produced by Algorithm 1 in which Steps 9\u201311 are chosen. From Theorem 4 and\nLemma 1, we have E[f ( \u02c6X)] \u2212 minX\u2217\u2208L f (X\u2217) \u2264 1\n. The right-hand side is\n8T (n+1)(n+k). We then have E[f ( \u02c6X)] \u2212 minX\u2217\u2208L f (X\u2217) \u2264\nminimized when \u03b7 is chosen as \u03b7 =\n(cid:17)\n\nSuppose that Assumption 1 holds and that \u02c6X is produced by Algorithm 1, where Steps 5\u20137 are\nchosen. From Theorem 4 and Lemma 1, we have E[f ( \u02c6X)] \u2212 minX\u2217\u2208L f (X\u2217) \u2264 1\n.\n\n(cid:113) n(n+1)(n+k)\n\nkT ), which proves (2).\n\n8\u03b7 + \u03b7T (n+1)(n+k)\n\n(cid:16) n\n\n(cid:16) n\n\n= O( n3/2\n\n(cid:113)\n\n(cid:17)\n\n2kT\n\nkn\n\nT\n\nk\n\n8\u03b7 + 24\u03b7T n\n\nk\n\nT\n\n7\n\n\f(cid:113) k\n\n12n\u221a\nkT\n\n192T . We then have E[f ( \u02c6X)] \u2212\n\n= O( n\u221a\nkT\n\n), which proves (3).\n\nThe right-hand side is minimized when \u03b7 is chosen as \u03b7 =\nminX\u2217\u2208L f (X\u2217) \u2264 \u221a\nThe computational time of Algorithm 1 can be evaluated as follows: Step 3 can be conducted\nby a sorting algorithm, which takes O(n log n) time. Step 5 can be done by generating uniform\nrandom numbers over [m] for m = n, n \u2212 1, . . . , n \u2212 k + 1, which takes O(k log n) times. Step\n6 requires O(kEO) time computation. Step 7 can be computed with O(n) arithmetic operations.\nSteps 9\u201311 are similar to Steps 5\u20137. Step 13 takes O(n) time since xt \u2212 \u03b7\u02c6gt can be computed\nwith O(n) arithmetic operations and since P \u02dcL(x) has an explicit form. Hence, Steps 2\u201314 require\nO((n log n + k log n + kEO + n + n) \u00b7 T ) = O((kEO + n log n)T ) time. The other steps do not\nrequire time greater than this. Therefore, the overall time complexity is of O((kEO + n log n)T ).\n\n6 Lower Bound\n\n(cid:26) \u22121\n\n1\n\n(i \u2208 X)\n(i /\u2208 X)\n\n6.1 Construction of Hard Instance\n. for i \u2208 [n]. Fix a subset S\u2217 \u2286 [n] and\nDe\ufb01ne hi : 2[n] \u2192 {\u22121, 1} as hi(X) :=\na positive real value \u03b5 \u2208 [0, 1]. Consider the following procedure that produces a function \u02c6f : 2[n] \u2192\n{\u22121, 1}: (1) Choose i \u2208 [n] uniformly at random, and set s = 1 with probability 1\u2212\u03b5\n2 , s = \u22121 with\n2 . (2) De\ufb01ne \u02c6f : 2[n] \u2192 {\u22121, 1} by \u02c6f (X) = s\u00b7 hi(S\u2217(cid:52)X) = s\u00b7 hi(S\u2217)hi(X), where\nprobability 1+\u03b5\nS\u2217(cid:52)X stands for the symmetric difference between S\u2217 and X, i.e., S\u2217(cid:52)X = (S\u2217 \\ X) \u222a (X \\ S\u2217).\nLet F (S\u2217, \u03b5) denote the distribution of functions generated by the above procedure. A similar\nconstruction can be found in [14], which is for a lower bound of bandit linear optimization.\nIn addition, de\ufb01ne F (cid:48)(S\u2217, \u03b5) similarly, so that all function values of f \u223c F (cid:48)(S\u2217, \u03b5) are stochastically\nindependent: Choose iX \u2208 [n] and sX with the probability de\ufb01ned as the above, independently for all\nX \u2286 [n], and de\ufb01ne \u02c6f (X) = sX \u00b7 hiX (S\u2217)hi(X). Let F (cid:48)(S\u2217, \u03b5) denote the distribution of functions\ngenerated by this procedure. Note that each \u02c6f generated from F (S\u2217, \u03b5) is a modular function and that\nthis does not always hold for F (cid:48)(S\u2217, \u03b5). If DS\u2217 = F (S\u2217, \u03b5) or if DS\u2217 = F (cid:48)(S\u2217, \u03b5), the expectation\nof \u02c6f \u223c DS\u2217 is a submodular function expressed as\n\n(cid:80)n\ni=1 hi(S\u2217)hi(X) = \u03b5\n\nn (2|S\u2217(cid:52)X| \u2212 n),\n\n(19)\n\nfS\u2217,\u03b5(X) := E\n\n\u02c6f\u223cDS\u2217\n\n[ \u02c6f (X)] = \u2212 \u03b5\n\nn\n\nwhere the second equality comes from E[s] = E[sX ] = 1\u2212\u03b5\n\n2 \u2212 1+\u03b5\n\n2 = \u2212\u03b5.\n\n6.2 Proof of Theorem 2\n\nTo prove Theorem 2, we start with bounding the additive error from below by means of KL\ndivergences. Fix X (1), X (2), . . . , X (k) \u2286 [n] arbitrarily. For a class {DS\u2217\n| S\u2217 \u2286 [n]}\nof distributions over { \u02c6f : 2[n] \u2192 {\u22121, 1}},\nlet PS\u2217 denote the distribution of y( \u02c6f ) =\n( \u02c6f (X (1)), \u02c6f (X (2)) . . . , \u02c6f (X (k)))(cid:62) \u2208 Rk for \u02c6f \u223c DS\u2217. We then have the following:\nLemma 3. Suppose that a class of distributions {DS\u2217 | S\u2217 \u2286 [d]} satis\ufb01es (19) for all S\u2217 \u2286 [d]. In\naddition, suppose that the following holds for arbitrary S\u2217, X (1), X (2), . . . , X (k) \u2286 [n]:\n\n(cid:80)n\ni=1 DKL(PS\u2217||PS\u2217(cid:52){i}) \u2264 n\n2T .\n\n(cid:104)\n\n(20)\nIf S\u2217 is chosen uniformly at random from 2[n], and \u02c6ft follows DS\u2217 i.i.d. for t = 1, 2, . . . , T , then any\nalgorithm suffers an additive error of E[ET ] = E\n2 , where the\nexpectation is taken w.r.t. S\u2217, \u02c6ft, and the internal randomness of algorithms.\nIntuitively, the condition (20) means that the distribution of the observed values y does not change\nmuch even if the optimal solution S\u2217 is perturbed. Consequently, under the condition (20), it is hard\nfor any algorithm to detect S\u2217. Suf\ufb01cient conditions for (20) are given in the following two lemmas:\nLemma 4. Suppose that {PS\u2217} is de\ufb01ned by DS\u2217 = F (cid:48)(S\u2217, \u03b5) for 0 \u2264 \u03b5 \u2264 min{ 1\n}. Then\n6 ,\n(20) holds for arbitrary S\u2217, X (1), . . . , X (k) \u2286 [n].\n\nfS\u2217,\u03b5( \u02c6X) \u2212 minS\u22082[n] fS\u2217,\u03b5(S)\n\n(cid:105) \u2265 \u03b5\n\nn\u221a\n\n8kT\n\n8\n\n\f(cid:113)\n\n6 , n\n\n5\n\n24T min{2k,2n}}. Then (20) holds for arbitrary S\u2217, X (1), . . . , X (k) \u2286 [n].\n\nLemma 5. Suppose that {PS\u2217} is de\ufb01ned by DS\u2217 = F (S\u2217, \u03b5) for 0 \u2264 \u03b5 \u2264\nmin{ 1\nTheorem 2 can be proven by combining Lemmas 3, 4, and 5. From Lemmas 3 and 4, if S\u2217 is chosen\nuniformly at random from 2[n] and if \u02c6ft follows F (cid:48)(S\u2217, \u03b5) with \u03b5 = min{ 1\n}, i.i.d. for t \u2208 [T ],\n6 ,\n), which proves (4). If k \u2265 2 and if S\u2217 is\nwe then have E[ET ] \u2265 \u03b5\nchosen uniformly at random from 2[n], and \u02c6ft follows F (S\u2217, \u03b5) with \u03b5 = min{ 1\n24T min{2k,2n}},\ni.i.d. for t \u2208 [T ], then Assumption 1 is satis\ufb01ed since \u02c6ft \u223c F (S\u2217, \u03b5) is a submodular. Further, from\nT }) =\nLemmas 3 and 5, we have E[ET ] \u2265 \u03b5\n\u2126(cid:48)( n\u221a\nT ), which proves (5).\n\nT 2k ,(cid:112) n\n\n96T min{2k,2n}} = \u2126(cid:48)(max{ n\u221a\n\n5\n\nT 2k +(cid:112) n\n\n2 = min{ 1\n\n12 , n\n\n} = \u2126(cid:48)( n\u221a\n\n2 = min{ 1\n12 ,\n\nn\u221a\n\n32kT\n\n(cid:113)\n\nn\u221a\n\n8kT\n\n6 , n\n\n(cid:113)\n\nkT\n\n5\n\n7 Conclusion and Open Questions\n\nWe have introduced submodular function minimization with noisy evaluation oracle, and have\nprovided algorithms and lower bounds, which together implies that the proposed algorithms achieve\nnearly optimal additive errors, modulo O(\nn) factors. For the special cases of k-point feedback\nsettings, in which 2 \u2264 k = O(1) and each noisy evaluation oracle itself is a submodular function, we\nhave provided a tight error bound. For the other cases, we leave it as an open question to \ufb01nd tight\nbounds.\n\n\u221a\n\nAcknowledgment\n\nThe author thanks Satoru Iwata for valuable discussions and for pointing out important literature. The\nauthor also thanks reviewers for many helpful comments and suggestions.\n\nReferences\n[1] A. Agarwal, O. Dekel, and L. Xiao. Optimal algorithms for online convex optimization with\n\nmulti-point bandit feedback. In Conference on Learning Theory, pages 28\u201340, 2010.\n\n[2] A. Agarwal, D. P. Foster, D. J. Hsu, S. M. Kakade, and A. Rakhlin. Stochastic convex\noptimization with bandit feedback. In Advances in Neural Information Processing Systems,\npages 1035\u20131043, 2011.\n\n[3] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit\n\nproblem. SIAM Journal on Computing, 32(1):48\u201377, 2002.\n\n[4] B. Axelrod, Y. P. Liu, and A. Sidford. Near-optimal approximate discrete and continuous\nIn Proceedings of the Fourteenth Annual ACM-SIAM\n\nsubmodular function minimization.\nSymposium on Discrete Algorithms, pages 837\u2013853. SIAM, 2020.\n\n[5] F. Bach. Convex analysis and optimization with submodular functions: a tutorial. arXiv preprint\n\narXiv:1010.4207, 2010.\n\n[6] F. Bach. Learning with submodular functions: A convex optimization perspective. Foundations\n\nand Trends R(cid:13) in Machine Learning, 6(2-3):145\u2013373, 2013.\n\n[7] A. Belloni, T. Liang, H. Narayanan, and A. Rakhlin. Escaping the local minima via simulated\nannealing: Optimization of approximately convex functions. In Conference on Learning Theory,\npages 240\u2013265, 2015.\n\n[8] E. Blais, C. L. Canonne, T. Eden, A. Levi, and D. Ron. Tolerant junta testing and the connection\nto submodular optimization and function isomorphism. In Proceedings of the Twenty-Ninth\nAnnual ACM-SIAM Symposium on Discrete Algorithms, pages 2113\u20132132. SIAM, 2018.\n\n[9] S. Bubeck, Y. T. Lee, and R. Eldan. Kernel-based methods for bandit convex optimization.\nIn Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages\n72\u201385. ACM, 2017.\n\n9\n\n\f[10] D. Chakrabarty, P. Jain, and P. Kothari. Provable submodular minimization using wolfe\u2019s\n\nalgorithm. In Advances in Neural Information Processing Systems, pages 802\u2013809, 2014.\n\n[11] D. Chakrabarty, Y. T. Lee, A. Sidford, and S. C.-w. Wong. Subquadratic submodular function\nminimization. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of\nComputing, pages 1220\u20131231. ACM, 2017.\n\n[12] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, 2012.\n\n[13] D. Dadush, L. A. V\u00e9gh, and G. Zambelli. Geometric rescaling algorithms for submodular\nfunction minimization. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on\nDiscrete Algorithms, pages 832\u2013848. SIAM, 2018.\n\n[14] V. Dani, S. M. Kakade, and T. P. Hayes. The price of bandit information for online optimization.\n\nIn Advances in Neural Information Processing Systems, pages 345\u2013352, 2008.\n\n[15] J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono. Optimal rates for zero-order\nconvex optimization: The power of two function evaluations. IEEE Transactions on Information\nTheory, 61(5):2788\u20132806, 2015.\n\n[16] S. Fujishige. Lexicographically optimal base of a polymatroid with respect to a weight vector.\n\nMathematics of Operations Research, 5(2):186\u2013196, 1980.\n\n[17] S. Fujishige. Submodular Functions and Optimization, volume 58. Elsevier, 2005.\n\n[18] S. Fujishige and S. Isotani. A submodular function minimization algorithm based on the\n\nminimum-norm base. Paci\ufb01c Journal of Optimization, 7(1):3\u201317, 2011.\n\n[19] M. Gr\u00f6tschel, L. Lov\u00e1sz, and A. Schrijver. The ellipsoid method and its consequences in\n\ncombinatorial optimization. Combinatorica, 1(2):169\u2013197, 1981.\n\n[20] M. E. Halabi and S. Jegelka. Minimizing approximately submodular functions. arXiv preprint\n\narXiv:1905.12145, 2019.\n\n[21] H. Hassani, M. Soltanolkotabi, and A. Karbasi. Gradient methods for submodular maximization.\n\nIn Advances in Neural Information Processing Systems, pages 5841\u20135851, 2017.\n\n[22] A. Hassidim and Y. Singer. Submodular optimization under noise. In Conference on Learning\n\nTheory, pages 1069\u20131122, 2017.\n\n[23] E. Hazan and S. Kale. Online submodular minimization. Journal of Machine Learning Research,\n\n13(Oct):2903\u20132922, 2012.\n\n[24] E. Hazan and K. Levy. Bandit convex optimization: Towards tight bounds. In Advances in\n\nNeural Information Processing Systems, pages 784\u2013792, 2014.\n\n[25] E. Hazan et al. Introduction to online convex optimization. Foundations and Trends R(cid:13) in\n\nOptimization, 2(3-4):157\u2013325, 2016.\n\n[26] S. Ito and R. Fujimaki. Large-scale price optimization via network \ufb02ow. In Advances in Neural\n\nInformation Processing Systems, pages 3855\u20133863, 2016.\n\n[27] S. Iwata. A faster scaling algorithm for minimizing submodular functions. SIAM Journal on\n\nComputing, 32(4):833\u2013840, 2003.\n\n[28] S. Iwata, L. Fleischer, and S. Fujishige. A combinatorial strongly polynomial algorithm for\n\nminimizing submodular functions. Journal of the ACM (JACM), 48(4):761\u2013777, 2001.\n\n[29] K. G. Jamieson, R. Nowak, and B. Recht. Query complexity of derivative-free optimization. In\n\nAdvances in Neural Information Processing Systems, pages 2672\u20132680, 2012.\n\n[30] S. Jegelka and J. Bilmes. Online submodular minimization for combinatorial structures. In\n\nInternational Conference on Machine Learning, pages 345\u2013352, 2011.\n\n10\n\n\f[31] S. Jegelka and J. Bilmes. Submodularity beyond submodular energies: Coupling edges in graph\ncuts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 1897\u20131904, 2011.\n\n[32] M. Karimi, M. Lucic, H. Hassani, and A. Krause. Stochastic submodular maximization: The\ncase of coverage functions. In Advances in Neural Information Processing Systems, pages\n6853\u20136863, 2017.\n\n[33] P. Kohli and P. H. Torr. Dynamic graph cuts and their applications in computer vision. In\n\nComputer Vision, pages 51\u2013108. Springer, 2010.\n\n[34] Y. T. Lee, A. Sidford, and S. C.-w. Wong. A faster cutting plane method and its implications for\ncombinatorial and convex optimization. In 2015 IEEE 56th Annual Symposium on Foundations\nof Computer Science, pages 1049\u20131065. IEEE, 2015.\n\n[35] L. Lov\u00e1sz. Submodular functions and convexity. In Mathematical Programming The State of\n\nthe Art, pages 235\u2013257. Springer, 1983.\n\n[36] A. Mokhtari, H. Hassani, and A. Karbasi. Conditional gradient method for stochastic submodular\nmaximization: Closing the gap. In International Conference on Arti\ufb01cial Intelligence and\nStatistics, pages 1886\u20131895, 2018.\n\n[37] J. B. Orlin. A faster strongly polynomial time algorithm for submodular function minimization.\n\nMathematical Programming, 118(2):237\u2013251, 2009.\n\n[38] A. Schrijver. A combinatorial algorithm minimizing submodular functions in strongly polyno-\n\nmial time. Journal of Combinatorial Theory, Series B, 80(2):346\u2013355, 2000.\n\n[39] O. Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In\n\nConference on Learning Theory, pages 3\u201324, 2013.\n\n[40] O. Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point\n\nfeedback. Journal of Machine Learning Research, 18(52):1\u201311, 2017.\n\n[41] A. Singla, S. Tschiatschek, and A. Krause. Noisy submodular maximization via adaptive\nsampling with applications to crowdsourced image collection summarization. In Thirtieth AAAI\nConference on Arti\ufb01cial Intelligence, 2016.\n\n[42] P. Wolfe. Finding the nearest point in a polytope. Mathematical Programming, 11(1):128\u2013149,\n\n1976.\n\n11\n\n\f", "award": [], "sourceid": 6508, "authors": [{"given_name": "Shinji", "family_name": "Ito", "institution": "NEC Corporation, University of Tokyo"}]}