{"title": "Information-theoretic lower bounds on the oracle complexity of convex optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1, "page_last": 9, "abstract": "Despite the large amount of literature on upper bounds on complexity of convex analysis, surprisingly little is known about the fundamental hardness of these problems. The extensive use of convex optimization in machine learning and statistics makes such an understanding critical to understand fundamental computational limits of learning and estimation. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for some function classes. We also discuss implications of these results to the understanding the inherent complexity of large-scale learning and estimation problems.", "full_text": "Information-theoretic lower bounds on the oracle\n\ncomplexity of convex optimization\n\nAlekh Agarwal\n\nComputer Science Division\n\nUC Berkeley\n\nalekh@cs.berkeley.edu\n\nPradeep Ravikumar\n\nDepartment of Computer Sciences\n\nUT Austin\n\npradeepr@cs.utexas.edu\n\nPeter Bartlett\n\nComputer Science Division\n\nDepartment of Statistics\n\nUC Berkeley\n\nbartlett@cs.berkeley.edu\n\nMartin J. Wainwright\nDepartment of EECS, and\nDepartment of Statistics\n\nUC Berkeley\n\nwainwrig@eecs.berkeley.edu\n\nAbstract\n\nDespite a large literature on upper bounds on complexity of convex optimization,\nrelatively less attention has been paid to the fundamental hardness of these prob-\nlems. Given the extensive use of convex optimization in machine learning and\nstatistics, gaining a understanding of these complexity-theoretic issues is impor-\ntant.\nIn this paper, we study the complexity of stochastic convex optimization\nin an oracle model of computation. We improve upon known results and obtain\ntight minimax complexity estimates for various function classes. We also dis-\ncuss implications of these results for the understanding the inherent complexity of\nlarge-scale learning and estimation problems.\n\n1\n\nIntroduction\n\nConvex optimization forms the backbone of many algorithms for statistical learning and estimation.\nIn large-scale learning problems, in which the problem dimension and/or data are large, it is es-\nsential to exploit bounded computational resources in a (near)-optimal manner. For such problems,\nunderstanding the computational complexity of convex optimization is a key issue.\nA large body of literature is devoted to obtaining rates of convergence of speci\ufb01c procedures for\nvarious classes of convex optimization problems. A typical outcome of such analysis is an upper\nbound on the error\u2014for instance, gap to the optimal cost\u2014 as a function of the number of iterations.\nSuch analyses have been performed for many standard optimization alogrithms, among them gradi-\nent descent, mirror descent, interior point programming, and stochastic gradient descent, to name a\nfew. We refer the reader to standard texts on optimization (e.g., [4, 1, 10]) for further details on such\nresults.\nOn the other hand, there has been relatively little study of the inherent complexity of convex opti-\nmization problems. To the best of our knowledge, the \ufb01rst formal study in this area was undertaken\nin the seminal work of Nemirovski and Yudin [8] (hereafter referred to as NY). One obstacle to\na classical complexity-theoretic analysis, as the authors observed, was that of casting convex opti-\nmization problems in a Turing Machine model. They avoided this problem by instead considering a\nnatural oracle model of complexity in which at every round, the optimization procedure queries an\noracle for certain information on the function being optimized. Working within this framework, the\nauthors obtained a series of lower bounds on the computational complexity of convex optimization\n\n1\n\n\fproblems. In addition to the original text NY [8], we refer the reader to Nesterov [10] or the lecture\nnotes by Nemirovski [7].\nIn this paper, we consider the computational complexity of stochastic convex optimization in the or-\nacle model. Our results lead to a characterization of the inherent dif\ufb01culty of learning and estimation\nproblems when computational resources are constrained. In particular, we improve upon the work of\nNY [8] in two ways. First, our lower bounds have an improved dependence on the dimension of the\nspace. In the context of statistical estimation, these bounds show how the dif\ufb01culty of the estimation\nproblem increases with the number of parameters. Second, our techniques naturally extend to give\nsharper results for optimization over simpler function classes. For instance, they show that the op-\ntimal oracle complexity of statistical estimation with quadratic loss is signi\ufb01cantly smaller than the\ncorresponding complexity with absolute loss. Our proofs exploit a new notion of the discrepancy\nbetween two functions that appears to be natural for optimization problems. They are based on a\nreduction from a statistical parameter estimation problem to the stochastic optimization problem,\nand an application of information-theoretic lower bounds for the estimation problem.\n\n2 Background and problem formulation\n\nIn this section, we introduce background on the oracle model of complexity for convex optimization,\nand then de\ufb01ne the oracles considered in this paper.\n\n2.1 Convex optimization in the oracle model\nConvex optimization is the task of minimizing a convex function f over a convex set S \u2286 Rd.\nAssuming that the minimum is achieved, it corresponds to computing an element x\u2217\nf that achieves\nf \u2208 arg minx\u2208S f(x). An optimization method is any procedure that solves\nthe minimum\u2014that is, x\u2217\nthis task, typically by repeatedly selecting values from S. Our primary focus in this paper is the\nfollowing question: given any class of convex functions F, what is the minimum computational\nlabor any such optimization method would expend for any function in F?\nIn order to address this question, we follow the approach of Nemirovski and Yudin [8], based on the\noracle model of optimization. More precisely, an oracle is a (possibly random) function \u03c6 : S (cid:55)\u2192 I\nthat answers any query x \u2208 S by returning an element \u03c6(x) in an information set I. The information\nset varies depending on the oracle; for instance, for an exact oracle of kth order, the answer to a query\nxt consists of f(xt) and the \ufb01rst k derivatives of f at xt. For the case of stochastic oracles studied\nin this paper, these values are corrupted with zero-mean noise with bounded variance.\nGiven some number of rounds T , an optimization method M designed to approximately minimize\nthe convex function f over the convex set S proceeds as follows: at any given round t = 1, T , the\nmethod M queries at xt \u2208 S, and the oracle reveals the information \u03c6(xt, f). The method then uses\nthis information to decide at which point xt+1 the next query should be made. For a given oracle\nfunction \u03c6, let MT denote the class of all optimization methods M that make T queries according\nto the procedure outlined above. For any method M \u2208 MT , we de\ufb01ne its error on function f after\nT steps as\n\n\u0001(M, f, S, \u03c6) := f(xT ) \u2212 inf\nx\u2208S\n\nf(x) = f(xT ) \u2212 f(x\u2217\nf ),\n\nwhere xT is the method\u2019s query at time T . Note that by de\ufb01nition of x\u2217\nquantity.\n\n(1)\nf , this error is a non-negative\n\n2.2 Minimax error\n\nWhen the oracle is stochastic, the method\u2019s query xT at time T is itself random, since it depends on\nthe random answers provided by the oracle. In this case, the optimization error \u0001(M, f, S, \u03c6) is also\na random variable. Accordingly, for the case of stochastic oracles, we measure the accuracy in terms\nof the expected value E\u03c6[\u0001(M, f, S, \u03c6)], where the expectation is taken over the oracle randomness.\nGiven a class of functions F, and the class MT of optimization methods making T oracle queries,\nwe can de\ufb01ne the minimax error\n\n\u0001\u2217(F, S, \u03c6) := infMT \u2208MT\n\nE\u03c6[\u0001(MT , f, S, \u03c6)].\n\n(2)\n\nsup\nf\u2208F\n\n2\n\n\fNote that this de\ufb01nition depends on the optimization set S. In order to obtain uniform bounds, we\nde\ufb01ne S := {S \u2286 Rd : S convex,(cid:107)x\u2212 y(cid:107)\u221e \u2264 1 for x, y \u2208 S}, and consider the worst-case average\nerror over all S \u2208 S , given by\n\n\u0001\u2217(F, \u03c6) := sup\nS\u2208S\n\n\u0001\u2217(F, S, \u03c6).\n\n(3)\n\nIn the sequel, we provide results for particular classes of oracles. So as to ease the notation, when\nthe function \u03c6 is clear from the context, we simply write \u0001\u2217(F).\nIt is worth noting that oracle complexity measures only the number of queries to the oracle\u2014for\ninstance, the number of (approximate) function or gradient evaluations. However, it does not track\ncomputational cost within each component of the oracle query (e.g., the actual \ufb02op count associated\nwith evaluating the gradient).\n\n2.3 Types of Oracle\n\nIn this paper we study the class of stochastic \ufb01rst order oracles, which we will denote simply by\nO. For this class of oracles, the information set I consists of pairs of noisy function and gradient\nevaluations; consequently, any oracle \u03c6 in this class can be written as\n\nwhere (cid:98)f(x) and(cid:98)g(x) are random variables that are unbiased as estimators of the function and gradi-\nent values respectively (i.e., E(cid:98)f(x) = f(x) and E(cid:98)g(x) = \u2207f(x)). Moreover, we assume that both\n(cid:98)f(x) and(cid:98)g(x) have variances bounded by one. When the gradient is not de\ufb01ned at x, the notation\n\n\u2207f(x) should be understood to mean any arbitrary subgradient at x. Recall that a subgradient of a\nconvex function f is any vector v \u2208 Rd such that\n\n(4)\n\n\u03c6(x, f) = ((cid:98)f(x),(cid:98)g(x)),\n\nf(y) \u2265 f(x) + v(cid:62)(y \u2212 x).\n\nStochastic gradient methods are popular examples of algorithms for such oracles.\nNotation: For the convenience of the reader, we collect here some notation used throughout the\n1 to refer to the sequence (x1, . . . , xt). We refer to the i-th coordinate of any vector\npaper. We use xt\nx \u2208 Rd as x(i). For a convex set S, the radius of the largest inscribed (cid:96)\u221e ball is denoted as r\u221e.\nFor a convex function f, its minimizer over a set S will be denoted as x\u2217\nf when S is obvious from\ncontext. We will often use the notation x\u2217\n\u03b1 to denote the minimizer of f\u03b1 if \u03b1 is an index variable\nover a class. For two distributions p and q, KL(p||q) refers to the Kullback Leibler divergence\nbetween the distributions. The notation I(A) is the 0-1 valued indicator random variable of the\nset (equivalently event) A. For two vectors \u03b1, \u03b2 \u2208 {\u22121, +1}d, we de\ufb01ne the Hamming distance\n\n\u2206H(\u03b1, \u03b2) :=(cid:80)d\n\nI[\u03b1i (cid:54)= \u03b2i].\n\ni=1\n\n3 Main results and their consequences\n\nWith the setup of stochastic convex optimization in place, we are now in a position to state the\nmain results of this paper. In particular, we provide some tight lower bounds on the complexity of\nstochastic oracle optimization. We begin by analyzing the minimax oracle complexity of optimiza-\ntion for the class of convex Lipschitz functions. Recall that a function f : Rd \u2192 R is convex if for\nall x, y \u2208 Rd and \u03bb \u2208 (0, 1), we have the inequality f(\u03bbx + (1\u2212 \u03bb)y) \u2264 \u03bbf(x) + (1\u2212 \u03bb)f(y). For\nsome constant L > 0, we say that the function f is L-Lipschitz on S if |f(x)\u2212 f(y)| \u2264 L(cid:107)x\u2212 y(cid:107)\u221e\nfor all x, y \u2208 S.\nBefore stating the results, we note that scaling the Lipschitz constant scales minimax optimization\nerror linearly. Hence, to keep our results scale-free, we consider 1-Lipschitz functions only. As the\ndiameter of S is also bounded by 1, this automatically enforces that |f(x)| \u2264 1, \u2200x \u2208 S.\nTheorem 1. Let F C be the class of all bounded convex 1-Lipschitz functions on Rd. Then there is\na constant c (independent of d) such that\n\n\u0001\u2217(F C, \u03c6) \u2265 c\n\nsup\n\u03c6\u2208O\n\nd\nT\n\n.\n\n(5)\n\n(cid:114)\n\n3\n\n\fRemarks: This lower bound is tight in the minimax sense, since the method of stochastic gradient\ndescent attains a matching upper bound for all stochastic \ufb01rst order oracles for any convex set S\n(see Chapter 5 of NY [8]). Also, even though this lower bound requires the oracle to have only\nbounded variance, we will use an oracle based on Bernoulli random variables, which has all mo-\nments bounded. As a result there is no hope to get faster rates in a simple way by assuming bounds\non higher moments for the oracle. This is in interesting contrast to the case of having less than 2\nbounded moments where we get slower rates (again, see Chapter 5 of NY [8]).\nThe above lower bound is obtained by considering the worst case over all convex sets. However,\nwe expect optimization over a smaller convex set to be easier than over a large set. Indeed, we can\neasily obtain a corollary of Theorem 1 that quanti\ufb01es this intuition.\nCorollary 1. Let F C be the class of all bounded convex 1-Lipschitz functions on Rd. Let S be a\nconvex set such that it contains an (cid:96)\u221e ball of radius r\u221e and is contained in an (cid:96)\u221e ball of radius\nR\u221e. Then there is a universal constant c such that,\n\n(cid:114)\n\nsup\n\u03c6\u2208O\n\n\u0001\u2217(F C, S, \u03c6) \u2265 c\n\nr\u221e\nR\u221e\n\nd\nT\n\n.\n\n(6)\n\nRemark: The ratio r\u221e\nR\u221e is also common in results of [8], and is called the asphericity of S. As\na particular application of above corollary, consider S to be the unit (cid:96)2 ball. Then r\u221e = 1\u221a\n, and\nR\u221e = 1. which gives a dimension independent lower bound. This lower bound for the case of the\n(cid:96)2 ball is indeed tight, and is recovered by the stochastic gradient descent algorithm [8].\nJust as optimization over simpler sets gets easier, optimization over simple function classes should\nbe easier too. A natural function class that has been studied extensively in the context of better upper\nbounds is that of strongly convex functions. For any given norm (cid:107) \u00b7 (cid:107) on S, a function f is strongly\nconvex with coef\ufb01cient \u03ba means that f(x) \u2265 f(y) + \u2207f(y)T (x \u2212 y) + \u03ba\n2(cid:107)x \u2212 y(cid:107)2 for all x, y \u2208 S.\nFor this class of functions, we obtain a smaller lower bound on the minimax oracle complexity of\noptimization.\nTheorem 2. Let FS be the class of all bounded strongly convex and 1-Lipschitz functions on Rd.\nThen there is a universal constant c such that,\n\nd\n\n\u0001\u2217(FS , \u03c6) \u2265 c\n\nd\nT\n\n.\n\nsup\n\u03c6\u2208O\n\n(7)\n\n(cid:1)2 d\n\nT .\n\nR\u221e\n\nOnce again there is a matching upper bound using stochastic gradient descent for example, when\nthe strong convexity is with respect to the (cid:96)2 norm. The corollary depending on the geometry of S\nfollows again.\nCorollary 2. Let FS be the class of all bounded convex 1-Lipschitz functions on Rd. Let S be a\nconvex set such that it contains an (cid:96)\u221e ball of radius r\u221e. Then there is a universal constant c such\n\nthat sup\u03c6\u2208O \u0001\u2217(FS , S, \u03c6) \u2265 c(cid:0) r\u221e\nIn comparison, Nemirovski and Yudin [8] obtained a lower bound scaling as \u2126(cid:0) 1\u221a\n\n(cid:1) for the class\n\nF C. Their bound applies only to the class F C, and does not provide any dimension dependence,\nas opposed to the bounds provided here. Obtaining the correct dependence yields tight minimax\nresults, and allows us to highlight the dependence of bounds on the geometry of the set S. Our\nproofs are information-theoretic in nature. We characterize the hardness of optimization in terms of\na relatively easy to compute complexity measure. As a result, our technique provides tight lower\nbounds for smaller function classes like strongly convex functions rather easily. Indeed, we will also\nstate a result for general function classes.\n\nT\n\n3.1 An application to statistical estimation\n\nWe now describe a simple application of the results developed above to obtain results on the oracle\ncomplexity of statistical estimation, where the typical setup is the following: given a convex loss\nfunction (cid:96), a class of functions F indexed by a d-dimensional parameter \u03b8 so that F = {f\u03b8 : \u03b8 \u2208\n\n4\n\n\fRd}, \ufb01nd a function f \u2208 F such that E(cid:96)(f) \u2212 inf f\u2208F E(cid:96)(f) \u2264 \u0001. If the distribution were known,\nthis is exactly the problem of computing the \u0001-accurate optimizer of a convex function, assuming\nthe function class F is convex. Even though we do not have the distribution in practice, we typically\nare provided with i.i.d. samples from it, which can be used to obtain unbiased estimates of the\nvalue and gradients of the risk functional E(cid:96)(f) for any given f. If indeed the computational model\nof the estimator were restricted to querying these values and gradients, then the lower bounds in\nthe previous sections would apply. Our bounds, then allow us to deduce the oracle complexity of\nstatistical estimation problems in this realistic model. In particular, a case of interest is when we\n\ufb01x a convex loss function (cid:96) and consider the worst oracle complexity over all possible distributions\nunder which expectation is taken. From our bounds, it is straightforward to deduce:\n\u2022 For the absolute loss (cid:96)(f(x), y) = |f(x) \u2212 y|, the oracle complexity of \u0001-accurate estimation\n\nover all possible distributions is \u2126(cid:0)d/\u00012(cid:1).\n\n\u2022 For the quadratic loss (cid:96)(f(x), y) = (f(x) \u2212 y)2, the oracle complexity of \u0001-accurate estimation\n\nover all possible distributions is \u2126 (d/\u0001).\n\nWe can use such an analysis to determine the limits of statistical estimation under computational\nconstraints. Several authors have recently considered this problem [3, 9], and provided upper bounds\nfor particular algorithms. In contrast, our results provide algorithm-independent lower bounds on\nthe complexity of statistical estimation within the oracle model. An interesting direction for future\nwork is to broader the oracle model so as to more accurately re\ufb02ect the computational trade-offs in\nlearning and estimation problems, for instance by allowing a method to pay a higher price to query\nan oracle with lower variance.\n\n4 Proofs of results\n\nWe now turn to the proofs of our main results, beginning with a high-level outline of the main ideas\ncommon to our proofs.\n\n4.1 High-level outline\n\nOur main idea is to embed the problem of estimating the parameter of a Bernoulli vector (alter-\nnatively, the biases of d coins) into a convex optimization problem. We start with an appropriately\nchosen subset of the vertices of a d-dimensional hypercube each of which corresponds to some value\nof the Bernoulli vector. For any given function class, we then construct a \u201cdif\ufb01cult\u201d subclass of\nfunctions parameterized by these hypercube vertices. We then show that being able to optimize any\nfunction in this subclass requires estimating its hypercube vertex, that is, the corresponding biases\nof the d coins. But the only information for this estimation would be from the coin toss outcomes\nrevealed by the oracle in T queries. With this set-up, we are able to apply the Fano lower bound for\nstatistical estimation, as has been done in past work on nonparametric estimation (e.g., [5, 2, 11]).\nIn more detail, the proofs of Theorems 1 and 2 are both based on a common set of steps, which we\ndescribe here.\n\nStep I: Constructing a dif\ufb01cult subclass of functions. Our \ufb01rst step is to construct a subclass\nof functions G \u2286 F that we use to derive lower bounds. Any such subclass is parameterized by\na subset V \u2286 {\u22121, +1}d of the hypercube, chosen as follows. Recalling that \u2206H denotes the\nHamming metric on the space {\u22121, +1}d, we choose V to be a d/4-packing of this hypercube.\nThat is, V is a subset of the hypercube such that for all \u03b1, \u03b2 \u2208 V, the Hamming distance satis\ufb01es\n\u2206H(\u03b1, \u03b2) \u2265 d/4. By standard arguments [6], we can construct such a packing set V with cardinality\n\u221a\n|V| \u2265 (2/\ni , i = 1, . . . , d} denote some base set of 2d functions (to be chosen\nWe then let Gbase = {f +\ndepending on the problem at hand). Given the packing set V and some parameter \u03b4 \u2208 [0, 1/4], we\nde\ufb01ne a larger class (with a total of |V| functions) via G(\u03b4) := {g\u03b1, \u03b1 \u2208 V}, where each function\ng\u03b1 \u2208 G(\u03b4) has the form\nd(cid:88)\n\ni , f\u2212\n\n(cid:8)(1/2 + \u03b1i\u03b4)f +\n\ni (x)(cid:9).\n\ne)d/2.\n\ng\u03b1(x) =\n\n1\nd\n\ni (x) + (1/2 \u2212 \u03b1i\u03b4) f\u2212\n\n(8)\n\ni=1\n\n5\n\n\fIn our proofs, the subclasses Gbase and G(\u03b4) are chosen such that G(\u03b4) \u2286 F, the functions f +\ni , f\u2212\ni\nare bounded over the convex set S with a Lipschitz constant independent of dimension d, and the\nminimizers x\u03b2 of g\u03b2 over Rd are contained in S for all \u03b2 \u2208 V. We demonstrate speci\ufb01c choices in\nthe proofs of Theorems 1 and 2.\n\nStep II: Optimizing well is equivalent to function identi\ufb01cation.\nIn this step, we show that if\na method can optimize over the subclass G(\u03b4) up to a certain tolerance \u03c8(G(\u03b4)), then it must be\ncapable of identifying which function g\u03b1 \u2208 G(\u03b4) was chosen. We \ufb01rst require a measure for the\ncloseness of functions in terms of their behavior near each others\u2019 minima. Recall that we use\nf \u2208 Rd to denote a minimizing point of the function f. Given a convex set S \u2286 Rd and two\nx\u2217\nfunctions f, g, we de\ufb01ne\n\n(cid:2)f(x) + g(x) \u2212 f(x\u2217\n\ng)(cid:3) .\n\nf ) \u2212 g(x\u2217\n\n\u03c1(f, g) = inf\nx\u2208S\n\n(9)\n\n\u03c8(G(\u03b4)) = min\n\nf = x\u2217\n\nThe discrepancy measure is non-negative, symmetric in its arguments,1 and satis\ufb01es \u03c1(f, g) = 0 if\nand only if x\u2217\nGiven the subclass G(\u03b4), we quantify how densely it is packed with respect to the semimetric \u03c1 using\nthe quantity\n\ng, so that we may refer to it as a semimetric.\n\n\u03b1) \u2264 \u03c8(\u03b4)\n3 .\n\n\u03b1(cid:54)=\u03b2\u2208V \u03c1(g\u03b1, g\u03b2),\n\n\u03b1 denotes a minimizing argument of the function g\u03b1.\n\n(10)\nwhich we also denote by \u03c8(\u03b4) when the class G is clear from the context. We now state a simple\nresult that demonstrates the utility of maintaining a separation under \u03c1 among functions in G(\u03b4).\nNote that x\u2217\nthere can be at most one function g\u03b1 \u2208 G(\u03b4) for which\n\nLemma 1. For any (cid:101)x \u2208 S,\ng\u03b1((cid:101)x) \u2212 g\u03b1(x\u2217\nThus, if we have an element(cid:101)x that approximately minimizes (meaning up to tolerance \u03c8(\u03b4)) one\nProof. For a given(cid:101)x \u2208 S, suppose that there exists an \u03b1 \u2208 V such that g\u03b1((cid:101)x) \u2212 g\u03b1(x\u2217\n\u03b1) \u2264 \u03c8(\u03b4)\n3 .\n\u03c8(\u03b4) \u2264 g\u03b1((cid:101)x) \u2212 g\u03b1(x\u2217\n\u03b2) \u2264 \u03c8(\u03b4)/3 + g\u03b2((cid:101)x) \u2212 g\u03b2(x\u2217\nwhich implies that g\u03b2((cid:101)x) \u2212 g\u03b2(x\u2217\n\n\u03b1) + g\u03b2((cid:101)x) \u2212 g\u03b2(x\u2217\n\u03b2) \u2265 2\u03c8(\u03b4)/3, from which the claim follows.\n\nfunction in the set G(\u03b4), then it cannot approximately minimize any other function in the set.\n\nFrom the de\ufb01nition of \u03c8(\u03b4) in (10), for any \u03b2 \u2208 V, \u03b2 (cid:54)= \u03b1, we have\n\n\u03b2),\n\nSuppose that we choose some function g\u03b1\u2217 \u2208 G(\u03b4), and some method MT is allowed to make T\nqueries to an oracle with information function \u03c6(\u00b7, g\u03b1\u2217). Our next lemma shows that in this set-up,\nif the method MT can optimize well over the class G(\u03b4), then it must be capable of determining the\ntrue function g\u03b1\u2217. Recall the de\ufb01nition (2) of the minimax error in optimization:\nLemma 2. Suppose that some method MT has minimax optimization error upper bounded as\n\nE(cid:2)\u0001\u2217(MT ,G(\u03b4), S, \u03c6)(cid:3) \u2264 \u03c8(\u03b4)\n\n(11)\n\n9 .\n\nP\u03c6[(cid:98)\u03b1(MT ) (cid:54)= \u03b1\u2217] \u2264 1\nThen the method MT can construct an estimator(cid:98)\u03b1(MT ) such that max\nProof. Given a method MT that satis\ufb01es the bound (11), we construct an estimator(cid:98)\u03b1(MT ) of the\nset(cid:98)\u03b1(MT ) equal to \u03b1. If no such \u03b1 exists, then we choose(cid:98)\u03b1(MT ) uniformly at random from V.\n(cid:2)\u0001\u2217(MT , g\u03b1\u2217 , S, \u03c6) \u2265 \u03c8(\u03b4)/3(cid:3) \u2264 1\nusing Markov\u2019s inequality, we have P\u03c6[(cid:98)\u03b1(MT ) (cid:54)= \u03b1\u2217] \u2264 P\u03c6\n\ntrue vertex \u03b1\u2217 as follows. If there exists some \u03b1 \u2208 V such that g\u03b1(xT ) \u2212 g\u03b1(x\u03b1) \u2264 \u03c8(\u03b4)\nthen we\nFrom Lemma 1, there can exist only one such \u03b1 \u2208 V that satis\ufb01es this inequality. Consequently,\n3.\nMaximizing over \u03b1\u2217 completes the proof.\nWe have thus shown that having a low minimax optimization error over G(\u03b4) implies that the vertex\n\u03b1 \u2208 V can be identi\ufb01ed.\n\n\u03b1\u2217\u2208V\n\n3 .\n\n3\n\n1However, it fails to satisfy the triangle inequality and so is not a metric.\n\n6\n\n\fStep III: Oracle answers and coin tosses. We now demonstrate a stochastic \ufb01rst order oracle \u03c6\nfor which the samples {\u03c6(x1, g\u03b1), . . . , \u03c6(xT , g\u03b1)} can be related to coin tosses. In particular, we\nassociate a coin with each dimension i \u2208 {1, 2, . . . , d}, and consider the set of coin bias vectors\nlying in the set\n\n\u0398(\u03b4) =(cid:8)(1/2 + \u03b11\u03b4, . . . , 1/2 + \u03b1d\u03b4) | \u03b1 \u2208 V(cid:9),\n\n(12)\nGiven a particular function g\u03b1 \u2208 G(\u03b4) (or equivalently, vertex \u03b1 \u2208 V), we consider the oracle \u03c6 that\npresents noisy value and gradient samples from g\u03b1 according to the following prescription:\n\u2022 Pick an index it \u2208 {1, . . . , d} uniformly at random.\n\u2022 Draw bit \u2208 {0, 1} according to a Bernoulli distribution with parameter 1/2 + \u03b1it\u03b4.\n\u2022 Return the value and sub-gradient of the function\n\n(cid:98)g\u03b1(x) = bitf +\n\nit\n\n(x) + (1 \u2212 bit)f\u2212\n\nit\n\n(x).\n\nBy construction, the function value and gradient samples are unbiased estimates of those of g\u03b1;\nmoreover, the variance of the effective \u201cnoise\u201d is bounded independently of d as long as the Lipschitz\nconstant is independent of d since the function values and gradients are bounded on S.\n\nStep IV: Lower bounds on coin-tossing Finally, we use information-theoretic methods to lower\nbound the probability of correctly estimating the true vertex \u03b1\u2217 \u2208 V in our model.\nLemma 3. Given an arbitrary vertex \u03b1\u2217 \u2208 V, suppose that we toss a set of d coins with bias\n\u03b8\u2217 = ( 1\n2\u03b4) a total of T times, but that the outcome of only one coin chosen\n\nuniformly at random is revealed at every round. Then for all \u03b4 \u2264 1/4, any estimator(cid:98)\u03b1 satis\ufb01es\n\n2 + \u03b1\u2217\n\n2 + \u03b1\u2217\n\n1\u03b4, . . . , 1\n\n(cid:26)\nP[(cid:98)\u03b1 (cid:54)= \u03b1\u2217] \u2265\n\ninf(cid:98)\u03b1\n\nmax\n\u03b1\u2217\u2208V\n\n1 \u2212 16T \u03b42 + log 2\ne)\n\n2 log(2/\n\n\u221a\n\nd\n\n(cid:27)\n\n.\n\n\u03b8 . Note that P\u03b8(i, b) = 1\n\nProof. Denote the Bernoulli distribution for the i-th coin by P\u03b8i. Let Yt \u2208 {1, . . . , d} be the variable\nindicating the coin revealed at time T , and let Xt \u2208 {0, 1} denote its outcome. With some abuse of\nnotation, we also denote the distribution of (Xt, Yt) by P\u03b8, and that of the entire data {(Xt, Yt)}T\nd P\u03b8i(b). We now apply a version of Fano\u2019s lemma [11] to the set of\nby P T\ndistributions P T\n\n(cid:19)\n\u03b8 for \u03b8 \u2208 \u0398(\u03b4). In particular, using the proof of Lemma 3 in [11] we get:\n1 \u2212 b + log 2\n\u03b8 ||P T\nlog |\u0398|\n\n\u03b8(cid:48) ) \u2264 b, \u2200\u03b8, \u03b8(cid:48) \u2208 \u0398(\u03b4) \u21d2 inf(cid:98)\u03b8\n\nmax\n\u03b8\u2208\u0398(\u03b4)\n\n.\n\n(13)\n\nKL(P T\n\n(cid:18)\n\nt=1\n\n\u03b8(cid:48) ) =\n\n\u03b8 ||P T\n\nb = KL(P T\n\nIn our case, we upper bound b as follows:\n\nKL(P\u03b8(Xt, Yt)||P\u03b8(cid:48)(Xt, Yt)) =\n\nT(cid:88)\nEach term KL(P\u03b8i(Xt)||P\u03b8\nwith parameters 1/2 + \u03b4 and 1/2 \u2212 \u03b4. A little calculation shows that\n\u2264 8\u03b42\n1 \u2212 2\u03b4\n\ng(\u03b4) = 2\u03b4 log\n\n4\u03b4\n1 \u2212 2\u03b4\n\n(cid:19)\n\n(cid:18)\n\n1 +\n\n1\nd\n\nt=1\n\n(cid:48)\ni\n\n,\n\nKL(P\u03b8i(Xt)||P\u03b8\n\n(Xt)).\n\n(cid:48)\ni\n\n(Xt)) is at most the KL divergence g(\u03b4) between Bernoulli variates\n\nP\u03b8[(cid:98)\u03b8 (cid:54)= \u03b8] \u2265\nT(cid:88)\nd(cid:88)\n\nt=1\n\ni=1\n\nwhich is less than 16\u03b42 as long as \u03b4 \u2264 1/4. Consequently, we conclude that b \u2264 16T \u03b42. Also, we\n\nnote that P[(cid:98)\u03b1 (cid:54)= \u03b1\u2217] = P\u03b8[(cid:98)\u03b8 (cid:54)= \u03b8\u2217]. Substituting these values and the size of V into (13) yields the\n\nclaim.\n\n4.2 Proofs of main results\n\nWe are now in a position to prove our main theorems.\n\n7\n\n\fProof of Theorem 1: By the construction of our oracle, it is clear that, at each round, only one\n\ncoin is revealed to the method MT . Thus Lemma 3 applies to the estimator(cid:98)\u03b1(MT ):\nIn order to obtain an upper bound on P[(cid:98)\u03b1(MT ) (cid:54)= \u03b1] using Lemma 2, we need to identify the\n\n16T \u03b42 + log 2\n\u221a\nd log(2/\ne)\n\n1 \u2212 2\n\nsubclass Gbase of F C. For i = 1, . . . , d, de\ufb01ne:\n\nP[(cid:98)\u03b1(MT ) (cid:54)= \u03b1] \u2265\ni (x) :=(cid:12)(cid:12)x(i) + 1/2(cid:12)(cid:12),\n\ni (x) :=(cid:12)(cid:12)x(i) \u2212 1/2(cid:12)(cid:12).\n\n(cid:19)\n\n(cid:18)\n\nf\u2212\n\n(14)\n\nand\n\nf +\n\n.\n\ni , f\u2212\n\nWe take S to be the (cid:96)\u221e ball of radius 1/2. It is clear then that the minimizers of g\u03b1 are contained in\ni are bounded in [0, 1] and 1-Lipschitz in the \u221e-norm, giving the same\nS. Also, the functions f +\nd \u2206H(\u03b1, \u03b2) \u2265 \u03b4\n2 for \u03b1 (cid:54)= \u03b2 \u2208 V.\nproperties for each function g\u03b1. Finally, we note that \u03c1(g\u03b1, g\u03b2) = 2\u03b4\nSetting \u0001 = \u03b4/18 < 1/2, we obtain \u0001\u2217(F C, \u03c6) \u2264 \u0001 = \u03b4\n18 = \u03c8(\u03b4)\n9 . Then by Lemma 2, we have\n\n3 \u2265(cid:0)1 \u2212 2 16T \u03b42+log 2\n(cid:1).\n(cid:1) for all d \u2265 11. Combining this with Theorem 5.3.1 of\n\nP\u03c6[(cid:98)\u03b1(MT ) (cid:54)= \u03b1] \u2264 1\nSubstituting \u03b4 = 18\u0001 yields T = \u2126(cid:0) d\n(cid:1) for all d.\nNY [8] gives T = \u2126(cid:0) d\n\n3 which, when combined with equation (14), yields 1\n\nd log(2/\n\n\u221a\n\n\u00012\n\ne)\n\nTo prove Corollary 1, we note that the proof of Theorem 1 required r\u221e \u2265 1\n2. If not, it is easy to see\nthat the computation of \u03c1 on G(\u03b4) scales by r\u221e. Further, if the set is contained in a ball of radius\nR\u221e, then we need to scale the function with 1\nR\u221e to keep the function values bounded. Taking both\nthese dependences into account gives the desired result.\n\n\u00012\n\nProof of Theorem 2:\n\nIn this case, we de\ufb01ne the base class\n\ni (x) =(cid:0)x(i) + 1/2(cid:1)2\n\nf +\n\n,\n\ni (x) =(cid:0)x(i) \u2212 1/2(cid:1)2\n\nand f\u2212\n\n,\n\nfor i = 1, . . . , d.\n\nThen the functions g\u03b1 are strongly convex w.r.t.\nSome calculation shows that \u03c1(g\u03b1, g\u03b2) = 2\u03b42\nis identical to Theorem 1.\nThe reader might suspect that the dimension dependence in our lower bound for strongly convex\nfunctions is not tight, due to the dependence of \u03ba on the dimension d. However, this is the largest\npossible value of \u03ba under the assumptions of the theorem.\n\nthe Euclidean norm with coef\ufb01cient \u03ba = 1/d.\nd \u2206H(\u03b1, \u03b2) for all \u03b1 (cid:54)= \u03b2. The remainder of the proof\n\n4.3 A general result\n\nArmed with the greater understanding from these proofs, we can now state a general result for any\nfunction class F. The proof is similar to that of earlier results.\nTheorem 3. For any function class F \u2286 F C, suppose a given base set of functions Gbase yields the\nmeasure \u03c8 as de\ufb01ned in (10). Then there exists a universal constant c such that sup\u03c6\u2208O \u0001\u2217(FS , \u03c6) \u2265\n\nc \u03c8(cid:0)(cid:113) d\n\nT\n\n(cid:1).\n\nAcknowledgements We gratefully acknowledge the support of the NSF under award DMS-0830410\nand of DARPA under award HR0011-08-2-0002. Alekh is supported in part by MSR PhD Fellow-\nship.\n\nReferences\n[1] D.P. Bertsekas. Nonlinear programming. Athena Scienti\ufb01c, Belmont, MA, 1995.\n[2] L. Birg\u00b4e. Approximation dans les espaces metriques et theorie de l\u2019estimation. Z. Wahrsch.\n\nverw. Gebiete, 65:181\u2013327, 1983.\n\n[3] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In NIPS. 2008.\n[4] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, Cambridge,\n\nUK, 2004.\n\n8\n\n\f[5] R. Z. Has\u2019minskii. A lower bound on the risks of nonparametric estimates of densities in the\n\nuniform metric. Theory Prob. Appl., 23:794\u2013798, 1978.\n\n[6] J. Matousek. Lectures on discrete geometry. Springer-Verlag, New York, 2002.\n[7] A. S. Nemirovski. Ef\ufb01cient methods in convex programming. Lecture notes.\n[8] A. S. Nemirovski and D. B. Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimiza-\n\ntion. John Wiley UK/USA, 1983.\n\n[9] S. Shalev-Shwartz and N. Srebro. SVM optimization: inverse dependence on training set size.\n\nIn ICML, 2008.\n\n[10] Nesterov Y. Introductory lectures on convex optimization: Basic course. Kluwer Academic\n\nPublishers, 2004.\n\n[11] B. Yu. Assouad, Fano and Le Cam. In Festschrift in Honor of L. Le Cam on his 70th Birthday.\n\nSpringer-Verlag, 1993.\n\n9\n\n\f", "award": [], "sourceid": 1005, "authors": [{"given_name": "Alekh", "family_name": "Agarwal", "institution": null}, {"given_name": "Martin", "family_name": "Wainwright", "institution": null}, {"given_name": "Peter", "family_name": "Bartlett", "institution": null}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": null}]}