{"title": "Information-theoretic lower bounds for convex optimization with erroneous oracles", "book": "Advances in Neural Information Processing Systems", "page_first": 3204, "page_last": 3212, "abstract": "We consider the problem of optimizing convex and concave functions with access to an erroneous zeroth-order oracle. In particular, for a given function $x \\to f(x)$ we consider optimization when one is given access to absolute error oracles that return values in [f(x) - \\epsilon,f(x)+\\epsilon] or relative error oracles that return value in [(1+\\epsilon)f(x), (1 +\\epsilon)f (x)], for some \\epsilon larger than 0. We show stark information theoretic impossibility results for minimizing convex functions and maximizing concave functions over polytopes in this model.", "full_text": "Information-theoretic lower bounds for convex\n\noptimization with erroneous oracles\n\nYaron Singer\n\nHarvard University\n\nCambridge, MA 02138\n\nyaron@seas.harvard.edu\n\nJan Vondr\u00b4ak\n\nIBM Almaden Research Center\n\nSan Jose, CA 95120\n\njvondrak@us.ibm.com\n\nAbstract\n\nWe consider the problem of optimizing convex and concave functions with access\nto an erroneous zeroth-order oracle. In particular, for a given function x \u2192 f (x)\nwe consider optimization when one is given access to absolute error oracles that\nreturn values in [f (x) \u2212 \u0001, f (x) + \u0001] or relative error oracles that return value in\n[(1 \u2212 \u0001)f (x), (1 + \u0001)f (x)], for some \u0001 > 0. We show stark information theoretic\nimpossibility results for minimizing convex functions and maximizing concave\nfunctions over polytopes in this model.\n\n1\n\nIntroduction\n\nConsider the problem of minimizing a convex function over some convex domain. It is well known\nthat this problem is solvable in the sense that there are algorithms which make polynomially-many\ncalls to an oracle that evaluates the function at every given point, and return a point which is ar-\nbitrarily close to the true minimum of the function. But suppose that instead of the true value of\nthe function, the oracle has some small error. Would it still be possible to optimize the function\nef\ufb01ciently? To formalize the notion of error, we can consider two types of erroneous oracles:\n\n\u2022 For a given function f : [0, 1]n \u2192 [0, 1] we say that (cid:101)f : [0, 1]n \u2192 [0, 1] is an absolute\n\u0001-erroneous oracle if \u2200x \u2208 [0, 1]n we have that: (cid:101)f (x) = f (x) + \u03bex where \u03bex \u2208 [\u2212\u0001, \u0001].\n\u2022 For a given function f : [0, 1]n \u2192 R we say that (cid:101)f : [0, 1]n \u2192 R is a relative \u0001-erroneous\noracle if \u2200x \u2208 [0, 1]n we have that: (cid:101)f (x) = \u03bexf (x) where \u03bex \u2208 [1 \u2212 \u0001, 1 + \u0001].\n\nNote that we intentionally do not make distributional assumptions about the errors. This is in con-\ntrast to noise, where the errors are assumed to be random and independently generated from some\ndistribution. In such cases, under reasonable conditions on the distribution, one can obtain arbitrar-\nily good approximations of the true function value by averaging polynomially many points in some\n\u0001-ball around the point of interest. Stated in terms of noise, in this paper we consider oracles that\nhave some small adversarial noise, and wish to understand whether desirable optimization guaran-\ntees are obtainable. To avoid ambiguity, we refrain from using the term noise altogether, and refer\nto such as inaccuracies in evaluation as error.\nWhile distributional i.i.d. assumptions are often reasonable models, evaluating our dependency on\nthese assumptions seems necessary. From a practical perspective, there are cases in which noise\ncan be correlated, or where the data we use to estimate the function is corrupted in some arbitrary\nway. Furthermore, since we often optimize over functions that we learn from data, the process of\n\ufb01tting a model to a function may also introduce some bias that does not necessarily vanish, or may\nvanish. But more generally, it seems like we should morally know the consequences that modest\ninaccuracies may have.\n\n1\n\n\fFigure 1: An illustration of an erroneous oracle to a convex function that fools a gradient descent algorithm.\n\nIn the special case of a linear function f (x) = c\n\n(cid:124)\n\n(cid:124)\n\nx, for some c \u2208 Rn, a rela-\nBenign cases.\ntive \u0001-error has little effect on the optimization. By querying f (ei), for every i \u2208 [n] we can extract\nx. This results in a (1\u00b1\u0001)-multiplicative\n\n(cid:101)ci \u2208 [(1\u2212\u0001)ci, (1+\u0001)ci] and then optimize over f(cid:48)(x) =(cid:101)c\napproximation. Alternatively, if the erroneous oracle (cid:101)f happens to be a convex function, optimiz-\n\ning \u02dcf (x) directly retains desirable optimization guarantees, up to either additive and multiplicative\nerrors. We are therefore interested in scenarios where the error does not necessarily have nice prop-\nerties.\n\nGradient descent fails with error. For a simple example, consider the function illustrated in\nFigure 1. The \ufb01gure illustrates a convex function (depicted in blue) and an erroneous version of\nit (dotted red), s.t. on every point, the oracle is at most some additive \u0001 > 0 away from the true\nfunction value (the \u0001 margins of the function are depicted in grey). If we assume that a gradient\ndescent algorithm is given access to the erroneous version (dotted red) instead of the true function\n(blue), the algorithm will be trapped in a local minimum that can be arbitrarily far from the true\nminimum. But the fact that a naive gradient descent algorithm fails does not necessarily mean that\nthere isn\u2019t an algorithm that can overcome small errors. This narrates the main question in this paper.\n\nIs convex optimization robust to error?\n\nMain Results. Our results are largely spoilers. We present stark information-theoretic lower\nbounds for both relative and absolute \u0001-erroneous oracles, for any constant and even sub-constant\n\u0001 > 0. In particular, we show that:\n\n\u2022 For minimizing a convex function (or maximizing a concave function) f : [0, 1]n \u2192 [0, 1]\nover [0, 1]n: we show that for any \ufb01xed \u03b4 > 0, no algorithm can achieve an additive\napproximation within 1/2 \u2212 \u03b4 of the optimum, using a subexponential number of calls to\nan absolute n\u22121/2+\u03b4-erroneous oracle.\n\n\u2022 For minimizing a convex function f : [0, 1]n \u2192 [0, 1] over a polytope P \u2282 [0, 1]n: for any\n\ufb01xed \u0001 > 0, no algorithm can achieve a \ufb01nite multiplicative factor using a subexponential\nnumber of calls to a relative \u0001-erroneous oracle.\n\u2022 For maximizing a concave function f : [0, 1]n \u2192 [0, 1] over a polytope P \u2282 [0, 1]n: for\nany \ufb01xed \u0001 > 0, no algorithm can achieve a multiplicative factor better than \u0398(n\u22121/2+\u0001)\nusing a subexponential number of calls to a relative \u0001-erroneous oracle;\n\u2022 For maximizing a concave function f : [0, 1]n \u2192 [0, 1] over [0, 1]n: for any \ufb01xed \u0001 > 0,\nno algorithm can obtain a multiplicative factor better than 1/2 + \u0001 using a subexponential\nnumber of calls to a relative \u0001-erroneous oracle. (And there is a trivial 1/2-approximation\nwithout asking any queries.)\n\nSomewhat surprisingly, many of the impossibility results listed above are shown for a class of ex-\ntremely simple convex and concave functions, namely, af\ufb01ne functions: f (x) = c\nx + b. This is\n\n(cid:124)\n\n2\n\nxf(x)\fin sharp contrast to the case of linear functions (without the constant term b) with relative erroneous\noracles as discussed above. In addition, we note that our results extend to strongly convex functions.\n\n1.1 Related work\n\nThe oracle models we study here fall in the category of zeroth-order or derivative free. Derivative-\nfree methods have a rich history in convex optimization and were among the earliest to numerically\nsolve unconstrained optimization problems. Recently these approaches have enjoyed increasing\ninterest, as they are useful in scenarios where black-box access is given to the function or cases in\nwhich gradient information is dif\ufb01cult to compute or does not exist [9, 8, 11, 15, 14, 6]\nThere has been a rich line of work for noisy oracles, where the oracles return some erroneous version\nof the function value which is random.\nIn a stochastic framework, these settings correspond to\nrepeatedly choosing points in some convex domain, obtaining noisy realizations of some underlying\nconvex function\u2019s value. Most frequently, the assumption is that one is given a \ufb01rst-order noisy\noracle with some assumptions about the distribution that generates the error [13, 12]. In the learning\ntheory community, optimization with stochastic noisy oracles is often motivated by multi-armed\nbandits settings [4, 1], and regret minimization with zeroth-order feedback [2]. All these models\nconsider the case in which the error is drawn from a distribution.\nThe model of adversarial noise in zeroth order oracles has been mentioned in [10] which considers\na related model of erroneous oracles and informally argues that exponentially many queries are\nrequired to approximately minimize a convex function in this model (under an (cid:96)2-ball constraint).\nIn recent work, Belloni et al. [3] study convex optimization with erroneous oracles. Interestingly,\nBelloni et al. show positive results. In their work they develop a novel algorithm that is based on\nsampling from an approximately log-concave distribution using the Hit-and-Run method and show\nthat their method has polynomial query complexity. In contrast to the negative results we show in\nthis work, the work of Belloni et al. assumes the (absolute) erroneous oracle returns f (x) + \u03bex\nwith \u03bex \u2208 [\u2212 \u0001\nn ]. That is, the error is not a constant term, but rather is inversely proportional\nto the dimension. Our lower bounds for additive approximation hold when the oracle error is not\nnecessarily a constant but \u03bex \u2208 [\n\nn1/2\u2212\u03b4 ] for a constant 0 < \u03b4 < 1/2.\n\nn , \u0001\n\n1\n\nn1/2\u2212\u03b4 ,\n\n1\n\n2 Preliminaries\n\nOptimization and convexity. For a minimization problem, given a nonnegative objective function\nf and a polytope P we will say that an algorithm provides a (multiplicative) \u03b1-approximation (\u03b1 >\n1) if it \ufb01nds a point \u00afx \u2208 P s.t. f (\u00afx) \u2264 \u03b1 minx\u2208P f (x). For a maximization problem, an algorithm\nprovides an \u03b1-approximation (\u03b1 < 1) if it \ufb01nds a point \u00afx s.t. f (\u00afx) \u2265 \u03b1 maxx\u2208P f (x).\nFor absolute erroneous oracles, given an objective function f and a polytope P we will aim to \ufb01nd\na point \u00afx \u2208 P which is within an additive error of \u03b4 from the optimum, with \u03b4 as small as possible.\nThat is, for a \u03b4 > 0 we aim to \ufb01nd a point \u00afx s.t. |f (\u00afx)\u2212minx f (x)| < \u03b4 in the case of minimization.\nA function f : P \u2192 R is convex on P if f (tx + (1 \u2212 t)y) \u2264 tf (x) + (1 \u2212 t)f (y) (or concave if\nf (tx + (1 \u2212 t)y) \u2265 tf (x) + (1 \u2212 t)f (y)) for every x, y \u2208 P and t \u2208 [0, 1].\n\nChernoff bounds. Throughout the paper we appeal to the Chernoff bounds. We note that while\ntypically stated for independent random variables X1, . . . , Xm, Chernoff bounds also hold for neg-\natively associated random variables.\nDe\ufb01nition 2.1 ([5], De\ufb01nition 1). Random variables X1, . . . , Xn are negatively associated, if for\nevery I \u2286 [n] and every non-decreasing f : RI \u2192 R, g : R \u00afI \u2192 R,\n\nE[f (Xi, i \u2208 I)g(Xj, j \u2208 \u00afI)] \u2264 E[f (Xi, i \u2208 I)]E[g(Xj, j \u2208 \u00afI)].\n\nvalues in [0, 1] and \u00b5 = E[(cid:80)n\n\nClaim 2.2 ([5], Theorem 14). Let X1, . . . , Xn be negatively associated random variables that take\n\ni=1 Xi]. Then, for any \u03b4 \u2208 [0, 1] we have that:\n\nPr[\n\nXi > (1 + \u03b4)\u00b5] \u2264 e\u2212\u03b42\u00b5/3,\n\nn(cid:88)\n\ni=1\n\n3\n\n\fn(cid:88)\n\nPr[\n\ni=1\n\nXi < (1 \u2212 \u03b4)\u00b5] \u2264 e\u2212\u03b42\u00b5/2.\n\nWe apply this to random variables that are formed by selecting a random subset of a \ufb01xed size. In\nparticular, we use the following.\nClaim 2.3. Let x1, . . . , xn \u2265 0 be \ufb01xed. For 1 \u2264 k \u2264 n, let R be a uniformly random subset of k\nelements out of [n]. Let Xi = xi if i \u2208 R and Xi = 0 otherwise. Then X1, . . . , Xn are negatively\nassociated.\n\nProof. For x1 = x2 = . . . = xn = 1, the statement holds by Corollary 11 of [5] (which refers\nto this distribution as the Fermi-Dirac model). The generalization to arbitrary xi \u2265 0 follows from\nProposition 4 of [5] with Ij = {j} and hj(t) = xjt.\n\n3 Optimization over the unit cube\n\nWe start with optimization over [0, 1]n, arguably the simplest possible polytope. We show that\nalready in this setting, the presence of adversarial noise prevents us from achieving much more than\ntrivial results.\n\n3.1 Convex minimization\n\nFirst let us consider convex minimization over [0, 1]n. In this setting, we show that errors as small\nas n\u2212(1\u2212\u03b4)/2 prevent us from optimizing within a constant additive error.\nTheorem 3.1. Let \u03b4 > 0 be a constant. There are instances of a convex function f : [0, 1]n \u2192\n[0, 1] accessible through an absolute n\u2212(1\u2212\u03b4)/2-erroneous oracle, such that a (possibly randomized)\nalgorithm that makes eO(n\u03b4) queries cannot \ufb01nd a solution of value better than within additive\n1/2 \u2212 o(1) of the optimum with probability more than e\u2212\u2126(n\u03b4).\n\nWe remark that the proof of this theorem is inspired by the proof of hardness of ( 1\n2 + \u0001)-\napproximation for unconstrained submodular maximization [7]; in particular it can be viewed as\na simple application of the \u201csymmetry gap\u201d argument (see [16] for a more general exposition).\n\nProof. Let \u0001 = n\u2212(1\u2212\u03b4)/2; we can assume that \u0001 < 1\n2, otherwise n is constant and the statement\nis trivial. We will construct an \u0001-erroneous oracle (both in the relative and absolute sense) for a\nconvex function f : [0, 1]n \u2192 [0, 1]. Consider a partition of [n] into two subsets A, B of size\n|A| = |B| = n/2 (which will be eventually chosen randomly). We de\ufb01ne the following function:\n\nThis is a convex (in fact linear) function. Next, we de\ufb01ne the following modi\ufb01cation of f, which\ncould be the function returned by an \u0001-erroneous oracle.\n\n\u2022 f (x) = 1\n\nj\u2208B xj).\n\n2 + 1\n\nn ((cid:80)\ni\u2208A xi \u2212(cid:80)\ni\u2208A xi \u2212(cid:80)\ni\u2208A xi \u2212(cid:80)\n\nj\u2208B xj| > 1\nj\u2208B xj| \u2264 1\n\n\u2022 If |(cid:80)\n\u2022 If |(cid:80)\n\n2 \u0001n, then \u02dcf (x) = f (x) = 1\n2 \u0001n, then \u02dcf (x) = 1\n2.\n\n2 + 1\n\nn ((cid:80)\ni\u2208A xi \u2212(cid:80)\ni\u2208A xi\u2212(cid:80)\n\nj\u2208B xj).\n\nNote that f (x) and \u02dcf (x) differ only in the region where |(cid:80)\nhigh probability, a \ufb01xed query x issued by the algorithm will have the property that |(cid:80)\n(cid:80)\nj\u2208B xj| \u2264 1\n\nthe value of f (x) in this region is within [ 1\u2212\u0001\nf (x) (both in the relative and absolute sense) could very well return \u02dcf (x) instead.\nNow assume that (A, B) is a random partition, unknown to the algorithm. We argue that with\ni\u2208A xi \u2212\n2 \u0001n. More precisely, since (A, B) is chosen at random subject to |A| = |B| = n/2,\n\nj\u2208B xj| \u2264 1\n2 \u0001n. In particular,\n2, so an \u0001-erroneous oracle for\n\n2 ], while \u02dcf (x) = 1\n\n2 , 1+\u0001\n\n4\n\n\fi\u2208A xi is a sum of negatively associated random variables in [0, 1] (by Claim 2.3).\n\n(cid:80)n\ni=1 xi \u2264 1\n\n2 n. By Claim 2.2, we have\n\n\u0001)\u00b5] < e\u2212(n\u0001/(4\u00b5))2\u00b5/3 \u2264 e\u2212\u00012n/24.\n\nwe have that(cid:80)\nThe expectation of this quantity is \u00b5 = E[(cid:80)\n(cid:88)\n(cid:88)\n(cid:80)n\n\nxi > \u00b5 +\n\n\u0001n] = Pr[\n\n(cid:80)\n\ni\u2208A\ni\u2208A xi + 1\n2\n\nSince 1\n2\n\ni\u2208A\n\nPr[\n\n1\n4\n\ni\u2208B xi = 1\n2\n\n(cid:80)\n(cid:88)\nxi \u2212(cid:88)\nPr[|(cid:88)\n\ni\u2208B\n\ni\u2208A\n\ni\u2208B\n\nPr[\n\nxi >\n\nxi \u2212(cid:88)\n\nj\u2208A\n\nBy symmetry,\n\ni\u2208A xi] = 1\n2\nn\n4\u00b5\n\nxi > (1 +\n\n(cid:88)\n\ni=1 xi = \u00b5, we get\n1\n2\n\n\u0001n] = Pr[\n\ni\u2208A\n\nxi \u2212 \u00b5 >\n\n\u0001n] < e\u2212\u00012n/24.\n\n1\n4\n\nxj| >\n\n1\n2\n\n\u0001n] < 2e\u2212\u00012n/24.\n\nWe emphasize that this holds for a \ufb01xed query x.\nRecall that we assumed the algorithm to be deterministic. Hence, as long as its queries satisfy the\nproperty above, the answers will be \u02dcf (x) = 1/2, and the algorithm will follow the same path of\ncomputation, no matter what the choice of (A, B) is. (Effectively we will not learn anything about\nA and B.) Considering the sequence of queries on this computation path, if the number of queries is\nt then with probability at least 1\u22122te\u2212\u00012n/24 the queries will indeed fall in the region where \u02dcf (x) =\n1/2 and the algorithm will follow this path. If t \u2264 e\u00012n/48, this happens with probability at least\n1 \u2212 2e\u2212\u00012n/48. In this case, all the points queried by the algorithm as well as the returned solution\nxout (by the same argument) satis\ufb01es \u02dcf (xout) = 1/2, and hence f (xout) \u2265 1\u2212\u0001\n2 . In contrast, the\nactual optimum is f (1B) = 0. Recall that \u0001 = n\u2212(1\u2212\u03b4)/2; hence, f (xout) \u2265 1\n2 (1 \u2212 n\u2212(1\u2212\u03b4)/2)\nand the bounds on the number of queries and probability of success are as in the statement of the\ntheorem.\nFinally, consider a randomized algorithm. Denote by (R1, R2, . . . , ...) the random variables used\nby the algorithm in its decisions. We can condition on a \ufb01xed choice of (R1 = r1, R2 = r2, . . .)\nwhich makes the algorithm deterministic. By our proof, the algorithm conditioned on this choice\ncannot succeed with probability more than e\u2212\u2126(n\u03b4). Since this is true for each particular choice of\n(r1, r2, . . .), by averaging it is also true for a random choice of (R1, R2, . . .). Hence, we obtain the\nsame result for randomized algorithms as well.\n\n3.2 Concave maximization\nHere we consider the problem of maximizing a concave function f : [0, 1]n \u2192 [0, 1]. One can\nobtain a result for concave maximization analogous to Theorem 3.1, which we do not state; in\nterms of additive errors, there is really no difference between convex minimization and concave\nmaximization. However, in the case of concave maximization we can also formulate the following\nhardness result for multiplicative approximation.\nTheorem 3.2. If a concave function f : [0, 1]n \u2192 [0, 1] is accessible through a relative-\u03b4-erroneous\noracle, then for any \u0001 \u2208 [0, \u03b4], an algorithm that makes less than e\u00012n/48 queries cannot \ufb01nd a\nsolution of value greater than 1+\u0001\n\n2 OP T with probability more than 2e\u2212\u00012n/48.\n\nProof. This result follows from the same construction as Theorem 3.1. Recall that f (x) is a linear\nfunction, hence also concave. As we mentioned in the proof of Theorem 3.1, \u02dcf (x) could be the\nvalues returned by a relative \u0001-erroneous oracle. Now we consider an arbitrary \u0001 > 0; note that for\n\u03b4 \u2265 \u0001 it still holds that \u02dcf (x) is a relative \u03b4-erroneous oracle.\nBy the same proof, an algorithm querying less than e\u0001nn/48 points cannot \ufb01nd a solution of value\n2 with probability more than 2e\u2212\u00012n/48. In contrast, the optimum of the maximization\nbetter than 1+\u0001\nproblem is supx\u2208[0,1]n f (x) = 1. Therefore, the algorithm cannot achieve multiplicative approxi-\nmation better than 1+\u0001\n2 .\n\nWe note that this hardness result is optimal due to the following easy observation.\n\n5\n\n\fTheorem 3.3. For any concave function f : [0, 1]n \u2192 R+, let OP T = supx\u2208[0,1]n f (x). Then\n\n(cid:18) 1\n\n(cid:19)\n\nf\n\n,\n\n1\n2\n\n, . . . ,\n\n1\n2\n\n2\n\n\u2265 1\n2\n\nOP T.\n\nProof. By compactness, the optimum is attained at a point: let OP T = f (x\u2217). Let also x(cid:48) =\n(1, 1, . . . , 1) \u2212 x\u2217. We have x(cid:48) \u2208 [0, 1]n and hence f (x(cid:48)) \u2265 0. By concavity, we obtain\n\n(cid:18) 1\n\n(cid:19)\n\n(cid:18) x\u2217 + x(cid:48)\n\n(cid:19)\n\n2\n\nf\n\n,\n\n1\n2\n\n, . . . ,\n\n1\n2\n\n2\n\n= f\n\n\u2265 f (x\u2217) + f (x(cid:48))\n\n2\n\nf (x\u2217) =\n\n\u2265 1\n2\n\n1\n2\n\nOP T.\n\nIn other words, a multiplicative 1\nasking any queries about f. We just return the point ( 1\nconcave maximization, a relative \u0001-erroneous oracle is not useful at all.\n\n2-approximation for this problem is trivial to obtain \u2014 even without\n2 ). Thus we can conclude that for\n\n2 , . . . , 1\n\n2 , 1\n\n4 Optimization over polytopes\n\nIn this section we consider optimization of convex and concave functions over a polytope\nP = {x \u2265 0 : Ax = b}. We will show inappoximability results for the relative error model. Note\nthat for the absolute error case, the lower bound on convex minimization from the previous section\nholds, and can be applied to show a lower bound for concave maximization with absolute errors.\nTheorem 4.1. Let \u0001, \u03b4 \u2208 (0, 1/2) be some constants. There are convex functions for which no\nalgorithm can obtain a \ufb01nite approximation ratio to minx\u2208P f (x) using \u2126(en\u03b4\n) queries to a relative\n\u0001-erroneous oracle of the function.\n\nProof. We will prove our theorem for the case in which P = {x \u2265 0 : (cid:80)\n\ni xi \u2264 n1/2+\u03b4}. Let\nH be a subset of indices chosen uniformly at random from all subsets of size exactly n1/2+\u03b4. We\nconstruct two functions:\n\nf (x) = n1+\u03b4 \u2212 n1/2(cid:88)\ng(x) = n1+\u03b4 \u2212 n\u03b4(cid:88)\n\ni\u2208H\n\nxi\n\nxi\n\n(cid:26)g(x),\n\n(cid:101)f (x) =\n\nof f is x\u2217 = 1H and f (x\u2217) = 0, while the minimizer of g is any vector x(cid:48) : (cid:80)\n\nObserve that both these functions are convex and non-negative. Also, observe that the minimizer\ni xi = n1/2+\u03b4\nand g(x(cid:48)) = n1+\u03b4 \u2212 n1/2+2\u03b4 = \u0398(n1+\u03b4). Therefore, the ratio between these two functions is\nunbounded. We will now construct the erroneous oracle in the following manner:\n\ni\n\nif (1 \u2212 \u0001)f (x) \u2264 g(x) \u2264 (1 + \u0001)f (x)\notherwise\n\nf (x)\n\nBy de\ufb01nition, (cid:101)f is an \u0001-erroneous oracle to f. The claim will follow from the fact that given access\nto (cid:101)f one cannot distinguish between f and g using a subexponential number of queries. This implies\n\nthe inapproximability result since an approximation algorithm which guarantees a \ufb01nite approxima-\ntion ratio using a subexponential number of queries could be used to distinguish between the two\nfunctions: if the algorithm returns an answer strictly greater than 0 then we know the underlying\nfunction is g and otherwise it is f.\nGiven a query x \u2208 [0, 1]n to the oracle, we will consider two cases.\n\n\u2022 In case the query x is such that(cid:80)\n\ni xi \u2264 n1/2 then we have that:\nn1+\u03b4 \u2212 n \u2264 f (x) \u2264 n1+\u03b4\nn1+\u03b4 \u2212 n\u03b4+1/2 \u2264 g(x) \u2264 n1+\u03b4\n\n6\n\n\fSince for any \u0001, \u03b4 > 0 there is a large enough n s.t. n\u03b4 > (1 + \u0001)/\u0001, this implies that for\ni xi \u2264 n1/2 then we have that g(x) \u2208 [(1 \u2212 \u0001)f (x), (1 + \u0001)f (x)]\n\ni xi > n1/2 then we can interpret the value of\ni\u2208H xi which determines value of f as a sum of negatively associated random variables\nX1, . . . , Xn where Xi realizes with probability n\u22121/2+\u03b4 and takes value xi if realized\n(see Claim 2.3). We can then apply the Chernoff bound (Claim 2.2), using the fact that\ni xi, and get that for any constant 0 < \u03b2 < 1 we have that with\n\nand thus the oracle returns g(x).\n\nany query for which(cid:80)\n\u2022 In case the query is such that (cid:80)\n(cid:80)\nE[f (x)] = n1/2\u2212\u03b4(cid:80)\nprobability 1 \u2212 e\u2212\u2126(n\u03b4):(cid:16)\n\n(cid:17)(cid:80)\n\n1 \u2212 \u03b2\n\ni xi\nn1/2\u2212\u03b4\n\nxi \u2264(cid:16)\n\n\u2264(cid:88)\n\ni\u2208H\n\n1 + \u03b2\n\n(cid:17)(cid:80)\n\ni xi\nn1/2\u2212\u03b4\n\nBy using \u03b2 \u2264 \u0001/(1 + \u0001), this implies that with probability at least 1 \u2212 e\u2212\u2126(n\u03b4) we get that:\n\n(1 \u2212 \u0001)f (x) \u2264 g(x) \u2264 (1 + \u0001)f (x)\n\nSince the likelihood of distinguishing between f and g on a single query is exponentially\nsmall in n\u03b4, the same arguments used throughout the paper imply that it takes an exponen-\ntial number of queries to distinguish between f and g.\n\ni xi + n1/2+\u03b4\n\n\u0001\n\nTo conclude, for any query x \u2208 [0, 1]n it takes \u2126(en\u03b4\n) queries to distinguish between f and g.\nAs discussed above, due to the fact that the ratio between the optima of these two functions is\nunbounded, this concludes the proof.\nTheorem 4.2. \u2200 constants \u0001, \u03b4 \u2208 (0, 1/2) there is a concave function f : [0, 1]n \u2192 R+ for which\nno algorithm can obtain an approximation strictly better than O(n\u22121/2+\u03b4) to maxx\u2208P f (x) using\n\u2126(en\u03b4\n\n) queries to a relative \u0001-erroneous oracle of the function.\n\nProof. We follow a similar methodology as in the proof of Theorem 4.1. We again we select a\nand\n\nset H of size n1/2+\u03b4 u.a.r. and construct two functions: f (x) = n1/2(cid:80)\ng(x) = n\u03b4(cid:80)\n. As in the proof of Theorem 4.1 the noisy oracle (cid:101)f (x) = g(x) when\n(1\u2212 \u0001)f (x) \u2264 g(x) \u2264 (1 + \u0001)f (x) and otherwise (cid:101)f (x) = f (x). Note that both functions are concave\nx s.t. (cid:80)\nwhich(cid:80)\nIn case the query is a point x s.t.(cid:80)\n\nand non-negative, and by its de\ufb01nition the oracle is \u0001-erroneous for the function f. For b = n1/2+\u03b4\nit is easy to see that the optimal value when the objective is f is n1+\u03b4 while the optimal value is\nO(n1/2+\u03b4) when the objective is g, which implies that one cannot obtain an approximation better\nthan \u2126(n\u22121/2+\u03b4) with a subexponential number of queries. In case the query to the oracle is a point\ni xi \u2264 n1/2, then by Chernoff bound arguments, similar to the ones we used above, with\nprobability at least 1 \u2212 e\u2212\u2126(n\u03b4) we get (1 \u2212 \u0001)f (x) \u2264 g(x) \u2264 (1 + \u0001)f (x). Thus, for any query in\ni xi \u2264 n1/2, the likelihood of the oracle returning f is exponentially small in n\u03b4.\n\ni xi > n1/2 standard concentration bound arguments as before,\nimply that with probability at least 1 \u2212 e\u2212\u2126(n\u03b4) we get (1 \u2212 \u0001)f (x) \u2264 g(x) \u2264 (1 + \u0001)f (x). Since\nthe likelihood of distinguishing between f and g on a single query is exponentially small in n\u03b4, we\ncan conclude that it takes an exponential number of queries to distinguish between f and g.\n\ni\u2208H xi + n1/2+\u03b4\n\n\u0001\n\n5 Optimization over assignments\n\nIn this section, we consider the concave maximization problem over a more speci\ufb01c polytope,\n\n\uf8f1\uf8f2\uf8f3x \u2208 Rn\u00d7k\n\n+\n\nk(cid:88)\n\nj=1\n\n:\n\n\uf8fc\uf8fd\uf8fe .\n\nPn,k =\n\nxij = 1 \u2200i \u2208 [n]\n\nThis can be viewed as the matroid polytope for a partition matroid on n blocks of k elements, or\nalternatively the convex hull of assignments of n items to k agents. In this case, there is a trivial\nk -approximation, similar to the 1\n1\n\n2-approximation in the case of a unit cube.\n\n7\n\n\fTheorem 5.1. For any k \u2265 2 and a concave function f : Pn,k \u2192 R+, let OP T = supx\u2208Pn,k\nThen\n\nf (x).\n\n(cid:18) 1\n\n(cid:19)\n\nf\n\n, . . . ,\n\n1\nk\n\n,\n\nk\n\n1\nk\n\n\u2265 1\nk\n\nOP T.\n\nProof. By compactness, the optimum is attained at a point:\nij =\nx\u2217\ni,(j+(cid:96) mod k) ; i.e., x((cid:96)) is a cyclic shift of the coordinates of x\u2217 by (cid:96) in each block. We have\nx((cid:96)) \u2208 Pn,k and 1\nk . By concavity and nonnegativity of f, we obtain\n\n(cid:96)=0 x((cid:96))\n\nij = 1\nk\n\nij = 1\n\nlet OP T = f (x\u2217). Let x((cid:96))\n\nk\n\n(cid:80)k\u22121\n(cid:18) 1\n\nf\n\n(cid:80)k\n(cid:32)\n(cid:19)\nj=1 x\u2217\n\n= f\n\nk\u22121(cid:88)\n\n(cid:96)=0\n\n1\nk\n\n(cid:33)\n\n1\nk\n\n, . . . ,\n\n1\nk\n\n,\n\nk\n\nx((cid:96))\n\n\u2265 1\nk\n\nf (x(0)) =\n\n1\nk\n\nOP T.\n\nk OP T with probability more than 2e\u2212\u00012n/6k.\n\nWe show that this approximation is best possible, if we have access only to a \u03b4-erroneous oracle.\nTheorem 5.2. If k \u2265 2 and a concave function f : Pn,k \u2192 [0, 1] is accessible through a relative-\u03b4-\nerroneous oracle, then for any \u0001 \u2208 [0, \u03b4], an algorithm that makes less than e\u00012n/6k queries cannot\n\ufb01nd a solution of value greater than 1+\u0001\nNote that this result is nontrivial only for n (cid:29) k. In other words, the hardness factor of k is never\nworse than a square root of the dimension of the problem. Therefore, this result can be viewed as\n2 -approximation over the unit cube (Theorem 3.2), and the\ninterpolating between the hardness of 1+\u0001\nhardness of n\u03b4\u22121/2-approximation over a general polytope (Theorem 4.2).\nProof. Given \u03c0 : [n] \u2192 [k], we construct a function f \u03c0 : Pn,k \u2192 [0, 1] (where \u03c0 describes the\nintended optimal solution):\n\n(cid:80)n\n\n\u2022 f \u03c0(x) = 1\n\nn\n\ni=1 xi,\u03c0(i).\n\nNext we de\ufb01ne a modi\ufb01ed function \u02dcf \u03c0 as follows:\n\n\u2022 If |f \u03c0(x) \u2212 1\n\u2022 If |f \u03c0(x) \u2212 1\n\nk| > \u0001\nk| \u2264 \u0001\n\nk then \u02dcf \u03c0(x) = f \u03c0(x)\nk then \u02dcf \u03c0(x) = 1\nk .\n\n\ufb01xed xij such that(cid:80)k\n\nk . Therefore, \u02dcf \u03c0(x) is a valid relative \u0001-erroneous oracle for f \u03c0.\n\nBy de\ufb01nition, f \u03c0(x) and \u02dcf \u03c0(x) differ only if |f \u03c0(x) \u2212 1\n\u02dcf \u03c0(x) = 1\nSimilarly to the proofs above, we argue that if \u03c0 is chosen uniformly at random, then with high\nk for any \ufb01xed query x \u2208 Pn,k. This holds again by a Chernoff bound: For a\nprobability \u02dcf \u03c0(x) = 1\nj=1 xij = 1, we have that f \u03c0(x) = 1\nn Z where Z is a sum of\ni=1 xi,\u03c0(i) = 1\nk . The random variables attain values\n\n(cid:80)\nindependent random variables with expectation 1\n(cid:21)\nk\nk| > \u0001 n\nin [0, 1]. By the Chernoff bound, Pr[|Z \u2212 n\nk ] < 2e\u2212\u00012n/3k. This gives\n| >\n\u0001\nk\n\nk| \u2264 \u0001\nk , and then f \u03c0(x) \u2208 [ 1\u2212\u0001\n(cid:80)n\n\n|f \u03c0(x) \u2212 1\nk\n\n< 2e\u2212\u00012n/3k.\n\ni,j xij = n\n\n(cid:20)\n\nk , 1+\u0001\n\nk ] while\n\nPr\n\nn\n\nBy the same arguments as before, if the algorithm asks less than e\u00012n/6k queries, then it will not\ndetect a point such that |f \u03c0(x) \u2212 1\nk with probability more than 2e\u2212\u00012n/6k. Then the query an-\nswers will all be \u02dcf \u03c0(x) = 1\nk . Meanwhile,\nthe optimum solution is x\u2217\n\nk and the value of the returned solution will be at most 1+\u0001\ni,\u03c0(i) = 1 for all i, which gives f (x\u2217) = 1.\n\nk| > \u0001\n\nAcknowledgements. YS was supported by NSF grant CCF-1301976, CAREER CCF-1452961 and a Google\nFaculty Research Award.\n\n8\n\n\fReferences\n[1] Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with\nmulti-point bandit feedback. In COLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel,\nJune 27-29, 2010, pages 28\u201340, 2010.\n\n[2] Alekh Agarwal, Dean P. Foster, Daniel Hsu, Sham M. Kakade, and Alexander Rakhlin. Stochastic convex\n\noptimization with bandit feedback. SIAM Journal on Optimization, 23(1):213\u2013240, 2013.\n\n[3] Alexandre Belloni, Tengyuan Liang, Hariharan Narayanan, and Alexander Rakhlin. Escaping the local\n\nminima via simulated annealing: Optimization of approximately convex functions. COLT 2015.\n\n[4] S\u00b4ebastien Bubeck and Nicol`o Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends in Machine Learning, 5(1):1\u2013122, 2012.\n\n[5] Devdatt Dubhashi, Volker Priebe, and Desh Ranjan. Negative dependence through the FKG inequality.\n\nIn Research report MPI-I-96-1-020, Max-Planck Institut f\u00a8ur Informatik, Saarbr\u00a8ucken, 1996.\n\n[6] John C. Duchi, Michael I. Jordan, Martin J. Wainwright, and Andre Wibisono. Optimal rates for zero-\norder convex optimization: The power of two function evaluations. IEEE Transactions on Information\nTheory, 61(5):2788\u20132806, 2015.\n\n[7] Uriel Feige, Vahab S. Mirrokni, and Jan Vondr\u00b4ak. Maximizing non-monotone submodular functions.\n\nSIAM J. Comput., 40(4):1133\u20131153, 2011.\n\n[8] Abraham Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex optimization in the\nbandit setting: gradient descent without a gradient. In Proceedings of the Sixteenth Annual ACM-SIAM\nSymposium on Discrete Algorithms, SODA 2005, Vancouver, British Columbia, Canada, January 23-25,\n2005, pages 385\u2013394, 2005.\n\n[9] Kevin G. Jamieson, Robert D. Nowak, and Benjamin Recht. Query complexity of derivative-free opti-\nmization. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural\nInformation Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe,\nNevada, United States., pages 2681\u20132689, 2012.\n\n[10] A.S. Nemirovsky and D.B. Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization. J. Wiley\n\n& Sons, New York, 1983.\n\n[11] Yurii Nesterov. Random gradient-free minimization of convex functions. CORE Discussion Papers\n2011001, Universit\u00b4e catholique de Louvain, Center for Operations Research and Econometrics (CORE),\n2011.\n\n[12] Aaditya Ramdas, Barnab\u00b4as P\u00b4oczos, Aarti Singh, and Larry A. Wasserman. An analysis of active learning\nIn Proceedings of the Seventeenth International Conference on Arti\ufb01cial\nwith uniform feature noise.\nIntelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April 22-25, 2014, pages 805\u2013813, 2014.\n[13] Aaditya Ramdas and Aarti Singh. Optimal rates for stochastic convex optimization under tsybakov noise\ncondition. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, At-\nlanta, GA, USA, 16-21 June 2013, pages 365\u2013373, 2013.\n\n[14] Ohad Shamir. On the complexity of bandit and derivative-free stochastic convex optimization. In COLT\n2013 - The 26th Annual Conference on Learning Theory, June 12-14, 2013, Princeton University, NJ,\nUSA, pages 3\u201324, 2013.\n\n[15] Sebastian U. Stich, Christian L. M\u00a8uller, and Bernd G\u00a8artner. Optimization of convex functions with random\n\npursuit. CoRR, abs/1111.0194, 2011.\n\n[16] Jan Vondr\u00b4ak. Symmetry and approximability of submodular maximization problems. SIAM J. Comput.,\n\n42(1):265\u2013304, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1785, "authors": [{"given_name": "Yaron", "family_name": "Singer", "institution": "Harvard University"}, {"given_name": "Jan", "family_name": "Vondrak", "institution": "IBM Research"}]}