{"title": "Algorithms and matching lower bounds for approximately-convex optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 4745, "page_last": 4753, "abstract": "In recent years, a rapidly increasing number of applications in practice requires solving non-convex objectives, like training neural networks, learning graphical models, maximum likelihood estimation etc. Though simple heuristics such as gradient descent with very few modifications tend to work well, theoretical understanding is very weak. We consider possibly the most natural class of non-convex functions where one could hope to obtain provable guarantees: functions that are ``approximately convex'', i.e. functions $\\tf: \\Real^d \\to \\Real$ for which there exists a \\emph{convex function} $f$ such that for all $x$, $|\\tf(x) - f(x)| \\le \\errnoise$ for a fixed value $\\errnoise$. We then want to minimize $\\tf$, i.e. output a point $\\tx$ such that $\\tf(\\tx) \\le \\min_{x} \\tf(x) + \\err$. It is quite natural to conjecture that for fixed $\\err$, the problem gets harder for larger $\\errnoise$, however, the exact dependency of $\\err$ and $\\errnoise$ is not known. In this paper, we strengthen the known \\emph{information theoretic} lower bounds on the trade-off between $\\err$ and $\\errnoise$ substantially, and exhibit an algorithm that matches these lower bounds for a large class of convex bodies.", "full_text": "Algorithms and matching lower bounds for\n\napproximately-convex optimization\n\nYuanzhi Li\n\nPrinceton University\nPrinceton, NJ, 08450\n\nAndrej Risteski\n\nPrinceton University\nPrinceton, NJ, 08450\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nyuanzhil@cs.princeton.edu\n\nristeski@cs.princeton.edu\n\nAbstract\n\nIn recent years, a rapidly increasing number of applications in practice requires\noptimizing non-convex objectives, like training neural networks, learning graphical\nmodels, maximum likelihood estimation. Though simple heuristics such as gradient\ndescent with very few modi\ufb01cations tend to work well, theoretical understanding\nis very weak.\nWe consider possibly the most natural class of non-convex functions where one\ncould hope to obtain provable guarantees: functions that are \u201capproximately con-\nvex\u201d, i.e. functions \u02dcf : Rd \u2192 R for which there exists a convex function f such\nthat for all x, | \u02dcf (x) \u2212 f (x)| \u2264 \u2206 for a \ufb01xed value \u2206. We then want to minimize\n\u02dcf, i.e. output a point \u02dcx such that \u02dcf (\u02dcx) \u2264 minx\nIt is quite natural to conjecture that for \ufb01xed \u0001, the problem gets harder for larger\n\u2206, however, the exact dependency of \u0001 and \u2206 is not known. In this paper, we\nsigni\ufb01cantly improve the known lower bound on \u2206 as a function of \u0001 and an\nalgorithm matching this lower bound for a natural class of convex bodies. More\nprecisely, we identify a function T : R+ \u2192 R+ such that when \u2206 = O(T (\u0001)),\nwe can give an algorithm that outputs a point \u02dcx such that \u02dcf (\u02dcx) \u2264 minx\n\u02dcf (x) + \u0001\n\n(cid:1). On the other hand, when \u2206 = \u2126(T (\u0001)), we also prove an\n\n\u02dcf (x) + \u0001.\n\nwithin time poly(cid:0)d, 1\n\n\u0001\n\ninformation theoretic lower bound that any algorithm that outputs such a \u02dcx must\nuse super polynomial number of evaluations of \u02dcf.\n\n1\n\nIntroduction\n\nOptimization of convex functions over a convex domain is a well studied problem in machine\nlearning, where a variety of algorithms exist to solve the problem ef\ufb01ciently. However, in recent years,\npractitioners face ever more often non-convex objectives \u2013 e.g. training neural networks, learning\ngraphical models, clustering data, maximum likelihood estimation etc. Albeit simple heuristics such\nas gradient descent with few modi\ufb01cations usually work very well, theoretical understanding in these\nsettings are still largely open.\nThe most natural class of non-convex functions where one could hope to obtain provable guarantees\nis functions that are \u201capproximately convex\u201d: functions \u02dcf : Rd \u2192 R for which there exists a convex\nfunction f such that for all x, | \u02dcf (x) \u2212 f (x)| \u2264 \u2206 for a \ufb01xed value \u2206. In this paper, we focus on zero\norder optimization of \u02dcf: an algorithm that outputs a point \u02dcx such that \u02dcf (\u02dcx) \u2264 minx\n\u02dcf (x) + \u0001, where\nthe algorithm in the course of its execution is allowed to pick points x \u2208 Rd and query the value of\n\u02dcf (x).\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fTrivially, one can solve the problem by constructing a \u0001-net and search through all the net points.\n\nHowever, such an algorithm requires \u2126(cid:0) 1\npoly(cid:0)d, 1\n\n(cid:1) (in particular, this implies the algorithm makes poly(cid:0)d, 1\n\n(cid:1)d evaluations of \u02dcf, which is highly inef\ufb01cient in high\n\ndimension. In this paper, we are interested in ef\ufb01cient algorithms: algorithms that run in time\n\n(cid:1) evaluations of \u02dcf).\n\n\u0001\n\n\u0001\n\n\u0001\n\nOne extreme case of the problem is \u2206 = 0, which is just standard convex optimization, where\nalgorithms exist to solve it in polynomial time for every \u0001 > 0. However, even when \u2206 is any\nquantity > 0, none of these algorithms extend without modi\ufb01cation. (Indeed, we are not imposing\nany structure on \u02dcf \u2212 f like stochasticity.) Of course, when \u2206 = +\u221e, the problem includes any\nnon-convex optimization, where we cannot hope for an ef\ufb01cient solution for any \ufb01nite \u0001. Therefore,\nthe crucial quantity to study is the optimal tradeoff of \u0001 and \u2206: For which \u0001, \u2206 the problem can be\nsolved in polynomial time, and for which it can not.\nIn this paper, we study the rate of \u2206 as a function of \u0001: We identify a function T : R+ \u2192 R+ such that\nwhen \u2206 = O(T (\u0001)), we can give an algorithm that outputs a point \u02dcx such that \u02dcf (\u02dcx) \u2264 minx\n\u02dcf (x) + \u0001\n\n(cid:1) over a natural class of well-conditioned convex bodies. On the other hand,\n\nwithin time poly(cid:0)d, 1\n\nwhen \u2206 = \u02dc\u2126(T (\u0001))1, we also prove an information theoretic lower bound that any algorithm outputs\nsuch \u02dcx must use super polynomial number of evaluations of \u02dcf. Our result can be summarized as the\nfollowing two theorems:\nTheorem (Algorithmic upper bound, informal). There exists an algorithm A that for any function \u02dcf\nover a well-conditioned convex set in Rd of diameter 1 which is \u2206 close to an 1-Lipschitz convex\nfunction 2 f, and\n\n\u0001\n\n\u2206 = O\n\n(cid:1)\n\n,\n\nd\n\n\u0001\nd\n\nmax\n\n(cid:27)(cid:19)(cid:19)\n\n(cid:18)(cid:26) \u00012\u221a\n\n(cid:18)\n\u02dcf (x) + \u0001 within time poly(cid:0)d, 1\n(cid:18)\n\n(cid:27)(cid:19)\n\n(cid:26) \u00012\u221a\n\n\u0001\nd\n\n,\n\nd\n\n\u2206 = \u02dc\u2126\n\nmax\n\nA \ufb01nds a point \u02dcx such that \u02dcf (\u02dcx) \u2264 minx\nThe notion of well-conditioning will formally be de\ufb01ned in section 3, but intuitively captures the\nnotion that the convex body \u201ccurves\u201d in all directions to a good extent.\nTheorem (Information theoretic lower bound, informal). For every algorithm A, every d, \u2206, \u0001 with\n\n\u0001\n\nthere exists a function \u02dcf on a convex set in Rd of diameter 1, and \u02dcf is \u2206 close to an 1-Lipschitz\nconvex function f, such that A can not \ufb01nd a point \u02dcx with \u02dcf (\u02dcx) \u2264 minx\nevaluations of \u02dcf.\n\n\u0001\n\n\u02dcf (x) + \u0001 in poly(cid:0)d, 1\n\n(cid:1)\n\n2 Prior work\n\nTo the best of our knowledge, there are three works on the problem of approximately convex\noptimization, which we summarize brie\ufb02y below.\nOn the algorithmic side, the classical paper by [DKS14] considered optimizing smooth convex\nfunctions over convex bodies with smooth boundaries. More precisely, they assume a bound on both\nthe gradient and the Hessian of F . Furthermore, they assume that for every small ball centered at a\npoint in the body, a large proportion of the volume of the ball lies in the body. Their algorithm is local\nsearch: they show that for a suf\ufb01ciently small r, in a ball of radius r there is with high probability a\npoint which has a smaller value than the current one, as long as the current value is suf\ufb01ciently larger\nthan the optimum. For constant-smooth functions only, their algorithm applies when \u2206 = O( \u0001\u221a\n).\nAlso on the algorithmic side, the work by [BLNR15] considers 1-Lipschitz functions, but their\nalgorithm only applies to the case where \u2206 = O( \u0001\n)). Their\nmethods rely on sampling log-concave distribution via hit and run walks. The crucial idea is to show\nthat for approximately convex functions, one needs to sample from \u201capproximately log-concave\u201d\n\nd ) (so not optimal unless \u0001 = O( 1\u221a\n\nd\n\nd\n\n1The \u02dc\u2126 notation hides polylog(d/\u0001) factors.\n2The assumptions on the diameter of K and the Lipschitz condition are for convenience of stating the results.\n\n(See Section ?? to extend to arbitrary diameter and Lipschitz constant)\n\n2\n\n\fdistributions, which they show can be done by a form of rejection sampling together with classical\nmethods for sampling log-concave distributions.\nFinally, [SV15] consider information theoretic lower bounds. They show that when \u2206 = 1/d1/2\u2212\u03b4\n2 \u2212 \u03b4, when optimizing a convex function\nno algorithm can, in polynomial time, achieve achieve \u0001 = 1\nover the hypercube. This translates to a super polynomial information theoretic lower bound when\n\u2206 = \u2126( \u0001\u221a\n). They additionally give lower bounds when the approximately convex function is\nd\nmultiplicatively, rather than additively, close to a convex function. 3\nWe also note a related problem is zero-order optimization, where the goal is to minimize a function\nwe only have value oracle access to. The algorithmic motivations here come from various applications\nwhere we only have black-box access to the function we are optimizing, and there is a classical line of\nwork on characterizing the oracle complexity of convex optimization.[NY83, NS, DJWW15]. In all\nof these settings however, the oracles are either noiseless, or the noise is stochastic, usually because\nthe target application is in bandit optimization. [AD10, AFH+11, Sha12]\n\n3 Overview of results\n\nFormally, we will consider the following scenario.\nDe\ufb01nition 3.1. A function \u02dcf : K \u2192 Rd will be called \u2206-approximately convex if there exists a\n1-Lipschitz convex function f : K \u2192 Rd, s.t. \u2200x \u2208 K,| \u02dcf (x) \u2212 f (x)| \u2264 \u2206.\nFor ease of exposition, we also assume that K has diameter 14. We consider the problem of optimizing\n\u02dcf, more precisely, we are interesting in \ufb01nding a point \u02dcx \u2208 K, such that\n\n\u02dcf (\u02dcx) \u2264 min\nx\u2208K\n\n\u02dcf (x) + \u0001\n\nWe give the following results:\nTheorem 3.1 (Information theoretic lower bound). For very constant c \u2265 1, there exists a constant\ndc such that for every algorithm A, every d \u2265 dc, there exists a convex set K \u2286 Rd with diameter 1,\nan \u2206-approximate convex function \u02dcf : K \u2192 R and \u0001 \u2208 [0, 1/64) 5 such that\n\n(cid:26) \u00012\u221a\n\n,\n\nd\n\n(cid:27)\n\n\u00d7\n\n\u0001\nd\n\n(cid:18)\n\n13c log\n\n(cid:19)2\n\nd\n\u0001\n\n\u2206 \u2265 max\n\nSuch that A fails to output, with probability \u2265 1/2, a point \u02dcx \u2208 K with \u02dcf (\u02dcx) \u2264 minx\u2208K{ \u02dcf (x)} + \u0001\nin o(( d\n\n\u0001 )c) time.\n\n\u2264 \u00b5.\n\nIn order to state the upper bounds, we will need the de\ufb01nition of a well-conditioned body:\nDe\ufb01nition 3.2 (\u00b5-well-conditioned). A convex body K is said to be \u00b5-well-conditioned for \u00b5 \u2265 1,\nif there exists a function F : Rd \u2192 R such that K = {x|F (x) \u2264 0} and for every x \u2208 \u2202K:\n(cid:107)\u22072F (x)(cid:107)2\n(cid:107)\u2207F (x)(cid:107)2\nThis notion of well-conditioning of a convex body to the best of our knowledge has not been de\ufb01ned\nbefore, but it intuitively captures the notion that the convex body should \u201ccurve\u201d in all directions to a\ncertain extent. In particular, the unit ball has \u00b5 = 1.\nTheorem 3.2 (Algorithmic upper bound). Let d be a positive integer, \u03b4 > 0 be a positive real number,\n\u0001, \u2206 be two positive real number such that\n\u2206 \u2264 max\n\n(cid:26) \u00012\n\n\u00d7 1\n\n(cid:27)\n\n\u221a\n\u00b5\n\nd\n\n,\n\n\u0001\nd\n\n16348\n\nThen there exists an algorithm A such that on given any \u2206-approximate convex function \u02dcf over a\n\u00b5-rounded convex set K \u2286 Rd of diameter 1, A returns a point \u02dcx \u2208 K with probability 1 \u2212 \u03b4 in time\n\npoly(cid:0)d, 1\n\n\u0001 , log 1\n\n\u03b4\n\n(cid:1) such that\n\n\u02dcf (\u02dcx) \u2264 min\nx\u2208K\n\n\u02dcf (x) + \u0001\n\nbounded by 1.\n\n3Though these are not too dif\ufb01cult to derive from the additive ones, considering the convex body has diameter\n\n4Generalizing to arbitrary Lipschitz constants and diameters is discussed in Section 6.\n5Since we normalize f to be 1-Lipschitz and K to have diameter 1, the problem is only interesting for \u0001 \u2264 1\n\n3\n\n\fFor the reader wishing to digest a condition-free version of the above result, the following weaker\nresult is also true (and much easier to prove):\nTheorem 3.3 (Algorithmic upper bound (condition-free)). Let d be a positive integer, \u03b4 > 0 be a\npositive real number, \u0001, \u2206 be two positive real number such that\n\u00d7 1\n\n(cid:26) \u00012\u221a\n\n\u2206 \u2264 max\n\n(cid:27)\n\n\u0001\nd\n\n,\n\nd\n\n16348\n\nThen there exists an algorithm A such that on given any \u2206-approximate convex function \u02dcf over a\n\u00b5-rounded convex set K \u2286 Rd of diameter 1, A returns a point \u02dcx \u2208 K with probability 1 \u2212 \u03b4 in time\n\npoly(cid:0)d, 1\n\n\u0001 , log 1\n\n\u03b4\n\n(cid:1) such that\n\n\u02dcf (\u02dcx) \u2264 min\n\nx\u2208S(K,\u2212\u0001)\n\n\u02dcf (x) + \u0001\n\nWhere S(K,\u2212\u0001) = {x \u2208 K|B\u0001(x) \u2286 K}\nThe result merely states that we can output a value that competes with points \u201cwell-inside\u201d the convex\nbody \u2013 around which a ball of radius of \u0001 still lies inside the body.\nThe assumptions on the diameter of K and the Lipschitz condition are for convenience of stating\nthe results. It\u2019s quite easy to extend both the lower and upper bounds to an arbitrary diameter and\nLipschitz constant, as we discuss in Section 6.\n\n3.1 Proof techniques\n\nWe brie\ufb02y outline the proof techniques we use. We proceed with the information theoretic lower\nbound \ufb01rst. The idea behind the proof is the following. We will construct a function G(x) and a family\nof convex functions {fw(x)} depending on a direction w \u2208 S d (S d is the unit sphere in Rd). On one\nhand, the minimal value of G and fw are quite different: minx G(x) \u2265 0, and minx fw(x) \u2264 \u22122\u0001.\nOn the other hand, the approximately convex function \u02dcfw(x) for fw(x) we consider will be such\nthat \u02dcfw(x) = G(x) except in a very small cone around w. Picking w at random, no algorithm with\nsmall number of queries will, with high probability, every query a point in this cone. Therefore, the\nalgorithm will proceed as if the function is G(x) and fail to optimize \u02dcfw.\nProceeding to the algorithmic result, since [BLNR15] already shows the existence of an ef\ufb01cient\nd ), we only need to give an algorithm that solves the problem when\nalgorithm when \u2206 = O( \u0001\nd ) and \u2206 = O( \u00012\u221a\n) (i.e. when \u0001, \u2206 are large). There are two main ideas for the algorithm.\n\u2206 = \u2126( \u0001\nFirst, we show that the gradient of a smoothed version of \u02dcfw (in the spirit of [FKM05]) at any point\nx will be correlated with x\u2217 \u2212 x, where x\u2217 = argminx\u2208K \u02dcfw(x). The above strategy will however\nrequire averaging the value of \u02dcfw along a ball of radius \u0001, which in many cases will not be contained\nin K (especially when \u0001 is large). Therefore, we come up with a way to extend \u02dcfw outside of K in a\nmanner that maintains the correlation with x\u2217 \u2212 x.\n\nd\n\n4\n\nInformation-theoretic lower bound\n\nIn this section, we present the proof of Theorem 3.1.\nThe idea is to construct a function G(x), a family of convex functions {fw(x)} depending on a\ndirection w \u2208 S d, such that minx G(x) \u2265 0, minx fw(x) \u2264 \u22122\u0001, and an approximately convex\n\u02dcfw(x) for fw(x) such that \u02dcfw(x) = G(x) except in a very small \u201ccritical\u201d region depending on w.\nPicking w at random, we want to argue that the algorithm will with high probability not query the\ncritical region. The convex body K used in the lower bound will be arguably the simplest convex\nbody imaginable: the unit ball B1(0).\nWe might hope to prove a lower bound for even a linear function fw for a start, similarly as in [SV15].\nA reasonable candidate construction is the following: we set fw(x) = \u2212\u0001(cid:104)w, x(cid:105) for some random\nchosen unit vector w and de\ufb01ne \u02dcf (x) = 0 when |(cid:104)x, w(cid:105)| \u2264 log d\n(cid:107)x(cid:107)2 and \u02dcf (x) = fw(x) otherwise.6\n\u0001\u221a\nd\n6For the proof sketch only, to maintain ease of reading all of the inequalities we state will be only correct up\n\nto constants. In the actual proofs we will be completely formal.\n\n4\n\n\fd\n\nd\n\nObserve, this translates to \u2206 = log d\n\u0001\u221a\n\u0001. It\u2019s a standard concentration of measure fact that for \u201cmost\u201d\nd\nof the points x in the unit ball, |(cid:104)x, w(cid:105)| \u2264 log d\n(cid:107)x(cid:107)2. This implies that any algorithm that makes a\n\u0001\u221a\nd\npolynomial number of queries to \u02dcf will with high probability see 0 in all of the queries, but clearly\nmin \u02dcf (x) = \u2212\u0001. However, this idea fails to generalize to optimal range as \u2206 = 1\u221a\n\u0001 is tight for\nlinear, even smooth functions.7\nIn order to obtain the optimal bound, we need to modify the construction to a non-linear, non-smooth\nfunction. We will, in a certain sense, \u201chide\u201d a random linear function inside a non-linear function.\nFor a random unit vector w, we consider two regions inside the unit ball: a core C = Br(0) for\nr = max{\u0001, 1\u221a\n(cid:107)x(cid:107)2}. The convex function f\nfor some \u03b1 > 0 outside C \u222a A and \u2212\u0001(cid:104)w, x(cid:105) for x \u2208 C \u222a A. We construct\nwill look like (cid:107)x(cid:107)1+\u03b1\n\u02dcf as \u02dcf = f when f (x) is suf\ufb01ciently large (e.g. |f (x)| > \u2206\n2 otherwise. Clearly, such \u02dcf\nobtain its minimal at point w, with \u02dcf (w) = \u2212\u0001. However, since \u02dcf = (cid:107)x(cid:107)1+\u03b1\noutside C or A, the\nalgorithm needs either query A or query C \u2229 Ac to detect w. The former happens with exponentially\nsmall probability in high dimensions, and for any x \u2208 C \u2229 Ac,|f (x)| = \u0001|(cid:104)w, x(cid:105)| \u2264 \u0001 log d\n(cid:107)x(cid:107)2 \u2264\n\u0001\u221a\nd\n\u0001 log d\n\u0001\u221a\n2 . Therefore, the algorithm will\nd\n\n}, and a \u201ccritical angle\u201d A = {x | |(cid:104)x, w(cid:105)| \u2265 log d\n\u0001\u221a\nd\n\n2 , which implies that \u02dcf (x) = \u2206\n\nd} \u00d7 log d\n, \u0001\n\n2 in Rd centered at 0. 8\n\nfail with high probability.\nNow, we move on to the detailed of the constructions. We will consider K = B 1\nradius 1\n4.1 The family {fw(x)}\nBefore delving into the construction we need the following de\ufb01nition:\nDe\ufb01nition 4.1 (Lower Convex Envelope (LCE)). Given a set S \u2286 Rd, a function F : S \u2192 R,\nde\ufb01ne the lower convex envelope FLCE = LCE(F ) as a function FLCE : Rd \u2192 R such that for every\nx \u2208 Rd,\n\n(0): the ball of\n\nr \u2264 max{ \u00012\u221a\n\n2 ) and \u2206\n\n\u0001 \u2264 \u2206\n\n2\n\nd\n\n2\n\n2\n\nFLCE(x) = max\n\ny\u2208S {(cid:104)x \u2212 y,\u2207F (y)(cid:105) + F (y)}\n\nProposition 4.1. LCE(F ) is convex.\nProof. LCE(F) is the pointwise maximum of linear functions, so the claim follows.\nRemark : The LCE of a function F is a function de\ufb01ned over the entire Rd, while the input function\nF is only de\ufb01ned in a set S (not necessarily convex set). When the input function F is convex,\nLCE(F ) can be considered as an extension of F to the entire Rd.\nTo de\ufb01ne the family fw(x), we will need four parameters: a power factor \u03b1 > 0, a shrinking factor \u03b2,\nand a radius factor \u03b3 > 0, and a vector w \u2208 Rd such that (cid:107)w(cid:107)2 = 1\n2, which we specify in a short bit.\nConstruction 4.1. Given w, \u03b1, \u03b2, \u03b3, de\ufb01ne the core C = B\u03b3(0), the critical angle A = {x |\n|(cid:104)x, w(cid:105)| \u2265 \u03b2(cid:107)x(cid:107)2} and let H = K \u2229 C \u2229 A. Let \u02dch : H \u2192 R be de\ufb01ned as\n\n(cid:107)x(cid:107)1+\u03b1\nand de\ufb01ne lw(x) = \u22128\u0001(cid:104)x, w(cid:105). Finally let fw : K \u2192 Rd as\n\n\u02dch(x) =\n\n1\n2\n\n2\n\n(cid:110)\u02dchLCE(x), lw(x)\n(cid:111)\n\nfw(x) = max\n\nWhere \u02dchLCE = LCE(\u02dch) as in De\ufb01nition 4.1.\nWe then construct the \u201chard\u201d function \u02dcfw as the following:\nConstruction 4.2. Consider the function \u02dcfw : K \u2192 R:\n\n(cid:26) fw(x)\n\n\u02dcfw(x) =\n\nmax{fw(x), 1\n\n2 \u2206}\n\nif x \u2208 K \u2229(cid:0)C \u222a A(cid:1) ;\n\notherwise.\n\n7This follows from the results in [DKS14]\n8We pick B 1\n\n(0) instead of the unit ball in order to ensure the diameter is 1.\n\n2\n\n5\n\n\fConsider the following settings of the parameters \u03b2, \u03b3, \u03b1 (depending on the magnitude of \u0001):\n\n\u221a\n\n\u2264 \u0001 \u2264 1\n\u221a\n\n(log d)2 : \u03b2 =\n\nd\n\n\u2022 Case 1, 1\u221a\n\u2022 Case 2, \u0001 \u2264 1\u221a\n\u2022 Case 3, 1\n\nd\n\n: \u03b2 =\n64 \u2265 \u0001 \u2265 1\n\n(log d)2 : \u03b2 =\n\n\u221a\nc log d/\u0001\n\nd\n\nc log d\n\u0001\u221a\n\nd\n\n, \u03b3 = 10c\u0001(log d\n\n\u0001 )1.5, \u03b1 =\n\n1\n\nlog(1/\u03b3).\n\n(log d/\u0001)3/2, \u03b1 =\n\n1\n\nlog(1/\u03b3).\n\n, \u03b3 = 10c\u221a\n\u221a\n\nc log d\u221a\n\nd\n, \u03b3 = 1\n\nd\n\n2, \u03b1 = 1.\n\nThen, the we formalize the proof intuition from the previous section with the following claims.\nFollowing the the proof outline, we \ufb01rst show the minimum of fw is small, in particular we will show\nfw(w) \u2264 \u22122\u0001.\nLemma 4.1. fw(w) = \u22122\u0001\nFinally, we show that \u02dcfw is indeed a \u2206-approximately convex, by showing \u2200x \u2208 K,|fw \u2212 \u02dcfw| \u2264 \u2206\nand fw is 1-Lipschitz and convex.\nProposition 4.2. \u02dcfw is a \u2206-approximately convex.\n\nNext, we construct G(x), which does not depend on w, we want to show that for an algorithm with\nsmall number of queries of \u02dcfw, it can not distinguish fw from this function.\nConstruction 4.3. Let G : K \u2192 R be de\ufb01ned as:\n\n(cid:26) max(cid:8) 1+\u03b1\n\n1\n\n2(cid:107)x(cid:107)1+\u03b1\n\n2\n\nG(x) =\n\n4 (cid:107)x(cid:107)2 \u2212 \u03b1\n\n4 \u03b3, 1\n\n2 \u2206(cid:9) if x \u2208 K \u2229 C ;\n\notherwise.\n\nThe following is true:\nLemma 4.2. G(x) \u2265 0 and {x \u2208 K | G(x) (cid:54)= \u02dcfw(x)} \u2286 A\nWe show how Theorem 3.1 is implied given these statements:\n\nProof of Theorem 3.1. With everything prior to this set up, the \ufb01nal claim is somewhat standard.\nWe want to show that no algorithm can, with probability \u2265 1\n2, output a point x, s.t. \u02dcfw(x) \u2264\n\u02dcfw(x) + \u0001. Since we know that \u02dcfw(x) agrees with G(x) everywhere except in K \u2229 A, and\nminx\nG(x) satis\ufb01es minx G(x) \u2265 minx\n\u02dcfw(x) + \u0001, we only need to show that with high probability, any\npolynomial time algorithm will not query any point in K \u2229 A.\nConsider a (potentially) randomized algorithm A, making random choices R1, R2, . . . , Rm. Condi-\ntioned on a particular choice of randomness r1, r2, . . . , rm, for a random choice of w, each ri lies in\nA with probability at most exp(\u2212c log(d/\u0001)), by a standard Gaussian tail bound. Union bounding,\nsince m = o(( d\n\u0001 )c), the probability that at least of the\nqueries of A lies in A is at most 1\n2.\nBut the claim is true for any choice r1, r2, . . . , rm of the randomness, by averaging, the claim holds\nfor r1, r2, . . . , rm being sampled according to the randomness of the algorithm.\n\n\u0001 )c) for an algorithm that runs in time o(( d\n\nThe proofs of all of the lemmas above have been ommited due to space constraints, and are included\nin the appendix in full.\n\n5 Algorithmic upper bound\n\nd ), so we only\nAs mentioned before, the algorithm in [BLNR15] covers the case when \u2206 = O( \u0001\nneed to give an algorithm when \u2206 = \u2126( \u0001\nd ). Our approach will not be making use\nof simulated annealing, but a more robust version of gradient descent. The intuition comes from\n[FKM05] who use estimates of the gradient of a convex function derived from Stokes\u2019 formula:\n\nd ) and \u2206 = O( \u00012\n\n(cid:20) d\n\nr\n\nEw\u223cS d\n\n(cid:21)\n\n(cid:90)\n\nB\n\n\u2207f (x)dx\n\nf (x + rw)w\n\n=\n\n6\n\n\fwhere w \u223c S d denotes w being a uniform sample from the sphere S d. Our observation is the gradient\nestimation is robust to noise if we instead use \u02dcf in the left hand side. Crucially, robust is not in the\nsense that it approximates the gradient of f, but it preserves the crucial property of the gradient of\nf we need: (cid:104)\u2212\u2207f (x), x\u2217 \u2212 x(cid:105) \u2265 f (x) \u2212 f (x\u2217). In words, this means if we move x at direction\n\u2212\u2207f (x) for a small step, then x will be closer to x\u2217, and we will show the property is preserved by\n\u02dcf when \u2206 \u2264 \u00012\u221a\n\nd\n\nr\n\n(cid:21)\n\n(cid:28)\n\n, x\u2217 \u2212 x\n\n\u02dcf (x + rw)w\n\n(cid:20) d\n(cid:20)(cid:28) d\n(cid:20)(cid:28) d\n\n. Indeed, we have that:\n\u2212Ew\u223cS d\n\u2265 \u2212Ew\u223cS d\n\n(cid:29)\n(cid:29)(cid:21)\n(cid:29)(cid:21)\nf (x + rw)w, x\u2217 \u2212 x\nr \u2206Ew\u223cU (Sd) [|(cid:104)w, x\u2217 \u2212 x(cid:105)|] is bounded by O( \u2206\n\nf (x + rw)w, x\u2217 \u2212 x\n\nr\n\nd\n\nr\n\nThe usual [FKM05] calculation shows that\n\nr\n\n\u221a\n\n\u221a\n\nEw\u223cS d)\n\n= \u2126 (f (x) \u2212 f (x\u2217) \u2212 2r)\n), since Ew\u223cU (Sd) [|(cid:104)w, x\u2217 \u2212 x(cid:105)|] = O( 1\u221a\n).\nwhenever f (x) \u2212 f (x\u2217) \u2265 \u0001. Choosing the optimal\n\n4 and \u2206 \u2264 \u00012\u221a\n\nand d\nTherefore, we want f (x) \u2212 f (x\u2217) \u2212 2r \u2265 \u2206\nparameter leads to r = \u0001\nThis intuitive calculation basically proves the simple upper bound guarantee (Theorem 3.3). On\nthe other hand, the argument requires sampling from a ball of radius \u2126(\u0001) around point x. This is\nproblematic when \u0001 > 1\u221a\n: many convex bodies (e.g. the simplex, L1 ball after rescaling to diameter\none) will not contain a ball of radius even 1\u221a\n. The idea is then to make the sampling possible by\n\u201cextending\u201d \u02dcf outside of K. Namely, we de\ufb01ne a new function g : Rd \u2192 R such that (\u03a0K(x) is the\nprojection of x to K)\n\nd\n\nr\n\n.\n\nd\n\nd\n\nd\n\nd\n\ng(x) will not be in general convex, but we instead directly bound (cid:104)Ew\u223c(cid:2) 1\n\ng(x) = \u02dcf (\u03a0K(x)) + d(x,K)\n\nfor x \u2208 K and show that it behaves like (cid:104)\u2212\u2207f (x), x\u2217 \u2212 x(cid:105) \u2265 f (x) \u2212 f (x\u2217).\n\nr g(x + rw)w(cid:3) , x \u2212 x\u2217(cid:105)\n\nAlgorithm 1 Noisy Convex Optimization\n1: Input: A convex set K \u2282 Rd with diam(K) = 1 and 0 \u2208 K. A \u2206-approximate convex function\n2: De\ufb01ne: g : R \u2192 R as:\n\n\u02dcf\n\nwhere \u03a0K is the projection to K and d(x,K) is the Euclidean distance from x to K.\n\n\u02dcg(x) = \u02dcf (\u03a0K(x)) + d(x,K)\n\n\u2212 d\u2206\nr\n\nEw\u223cS d [|(cid:104)w, x\u2217 \u2212 x(cid:105)|]\n\n128\u00b5 , \u03b7 =\n\n\u00013\n\n4194304d2 , T = 8388608d2\n\n\u00014\n\n.\n\n3: Initial: x1 = 0, r = \u0001\n4: for t = 1, 2, ...., T do\n5:\n6:\n\nLet vt = \u02dcf (xt).\nEstimate up to accuracy\n\n\u0001\n\n4194304 in l2 norm (by uniformly randomly sample w):\n\n(cid:20) d\n\nr\n\n(cid:21)\n\ngt = Ew\u223cS d\n\n\u02dcg(xt + rw)w\n\nwhere w \u223c S d means w is uniform sample from the unit sphere.\nUpdate xt+1 = \u03a0K(xt \u2212 \u03b7gt)\n\n7:\n8: end for\n9: Output mint\u2208[T ]{vt}\n\nThe rest of this section will be dedicated to showing the following main lemma for Algorithm 1.\nLemma 5.1 (Main, algorithm). Suppose \u2206 <\nx\u2217 \u2208 K such that \u02dcf (x\u2217) < \u02dcf (xt) \u2212 2\u0001, then\n\n, we have: For every t \u2208 [T ], if there exists\n\n\u00012\n16348\n\n\u221a\n\nd\n\n(cid:104)\u2212gt, x\u2217 \u2212 xt(cid:105) \u2265 \u0001\n64\n\n7\n\n\fAssuming this Lemma, we can prove Theorem 3.2.\n\nProof of Theorem 3.2. We \ufb01rst focus on the number of iterations:\nFor every t \u2265 1, suppose \u02dcf (x\u2217) < \u02dcf (xt) \u2212 2\u0001, then we have: (since (cid:107)gt(cid:107) \u2264 2d/r \u2264 256d\n\n)\n\n\u0001\n\n(cid:107)x\u2217 \u2212 xt+1(cid:107)2\n\n2 \u2264 (cid:107)x\u2217 \u2212 (xt \u2212 \u03b7gt)(cid:107)2\n\n2\n\n2\n\n= (cid:107)x\u2217 \u2212 xt(cid:107)2\n\u2264 (cid:107)x\u2217 \u2212 xt(cid:107)2\n\u2264 (cid:107)x\u2217 \u2212 xt(cid:107)2\n= (cid:107)x\u2217 \u2212 xt(cid:107)2\n\n+ \u03b72 65536d2\n\u00014\n\n2 \u2212 2\u03b7(cid:104)x\u2217 \u2212 xt, gt(cid:105) + \u03b72(cid:107)gt(cid:107)2\n2 \u2212 \u03b7\u0001\n64\n2 \u2212\n2 \u2212\n\n8388608d2 +\n\n4194304d2\n\n\u00014\n\n\u00014\n\n\u00012\n\n8388608d2\nSince originally (cid:107)x\u2217 \u2212 x1(cid:107) \u2264 1, the algorithm ends in poly(d, 1\nNow we consider the sample complexity. Since we know that\n\u2264 64d\n\u0001\n\n\u02dcg(xt + rw)w\n\n(cid:13)(cid:13)(cid:13)(cid:13) d\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\nr\n\n\u0001 ) iterations.\n\nBy standard concentration bound we know that we need poly(d, 1\ntion up to error\n\n2097152 per iteration.\n\n\u0001\n\n\u0001 ) samples to estimate the expecta-\n\nDue to space constraints, we forward the proof of Lemma 5.1 to the appendix.\n\n6 Discussion and open problems\n\n6.1 Arbitrary Lipschitz constants and diameter\nWe assumed throughout the paper that the convex function f is 1-Lipschitz and the convex set K\nhas diameter 1. Our results can be easily extended to arbitrary functions and convex sets through a\nsimple linear transformation. For f with Lipschitz constant (cid:107)f(cid:107)Lip and K with diameter D, and the\n\u02dcf (rx). (Where K\ncorresponding approximately convex \u02dcf, de\ufb01ne \u02dcg : K\nD is the\nrescaling of K by a factor of 1\n. But g(x) = f (Rx)\nis\nR(cid:107)f(cid:107)Lip\n1-Lipschitz over a set K\nR of diameter 1. Therefore, for general functions over a general convex sets,\nour result trivially implies the rate for being able to optimize approximately-convex functions is\n\nD \u2192 R as \u02dcg(x) = 1\nD(cid:107)f(cid:107)Lip\nD .) This translates to (cid:107)\u02dcg(x) \u2212 g(x)(cid:107)2 \u2264 \u2206\nR(cid:107)f(cid:107)Lip\n(cid:41)\n\n(cid:40)\n\n(cid:19)2\n\n(cid:18)\n\n\u2206\n\nR(cid:107)f(cid:107)Lip\n\n(cid:110)\n\n= max\n\n\u00012\u221a\n\ndR(cid:107)f(cid:107)Lip\n\n, \u0001\nd\n\n(cid:111)\n\n1\u221a\nd\n.\n\n\u0001\n\nR(cid:107)f(cid:107)Lip\n\n,\n\n1\nd\n\n\u0001\n\nR(cid:107)f(cid:107)Lip\n\nwhich simpli\ufb01es to \u2206 = max\n\n6.2 Body speci\ufb01c bounds\n\nOur algorithmic result matches the lower bound on well-conditioned bodies. The natural open\nproblem is to resolve the problem for arbitrary bodies. 9\nAlso note the lower bound can not hold for any convex body K in Rd: for example, if K is just a one\ndimensional line in Rd, then the threshold should not depend on d at all. But even when the \u201cinherent\ndimension\u201d of K is d, the result is still body speci\ufb01c: one can show that for \u02dcf over the simplex in Rd,\nwhen \u0001 \u2265 1\u221a\nFinally, while our algorithm made use of the well-conditioning \u2013 what is the correct prop-\nerty/parameter of the convex body that governs the rate of T (\u0001) is a tantalizing question to explore in\nfuture work.\n\n, it is possible to optimize \u02dcf in polynomial time even when \u2206 is as large as \u0001. 10\n\nd\n\n9We do not show it here, but one can prove the upp/lower bound still holds over the hypercube and when one\n\ncan \ufb01nd a ball of radius \u0001 that has most of the mass in the convex body K.\n\n10Again, we do not show that here, but essentially one can search through the d + 1 lines from the center to\n\nthe d + 1 corners.\n\n8\n\n\fReferences\n\n[AD10] Alekh Agarwal and Ofer Dekel. Optimal algorithms for online convex optimization\n\nwith multi-point bandit feedback. In COLT, pages 28\u201340. Citeseer, 2010.\n\n[AFH+11] Alekh Agarwal, Dean P Foster, Daniel J Hsu, Sham M Kakade, and Alexander Rakhlin.\nStochastic convex optimization with bandit feedback. In Advances in Neural Information\nProcessing Systems, pages 1035\u20131043, 2011.\n\n[BLNR15] Alexandre Belloni, Tengyuan Liang, Hariharan Narayanan, and Alexander Rakhlin.\nEscaping the local minima via simulated annealing: Optimization of approximately\nconvex functions. In Proceedings of The 28th Conference on Learning Theory, pages\n240\u2013265, 2015.\n\n[DJWW15] John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono. Opti-\nmal rates for zero-order convex optimization: The power of two function evaluations.\nInformation Theory, IEEE Transactions on, 61(5):2788\u20132806, 2015.\n\n[DKS14] Martin Dyer, Ravi Kannan, and Leen Stougie. A simple randomised algorithm for\n\nconvex optimisation. Mathematical Programming, 147(1-2):207\u2013229, 2014.\n\n[FKM05] Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex\noptimization in the bandit setting: gradient descent without a gradient. In Proceedings\nof the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 385\u2013394.\nSociety for Industrial and Applied Mathematics, 2005.\n\n[NS] Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex\n\nfunctions. Foundations of Computational Mathematics, pages 1\u201340.\n\n[NY83] Arkadii Nemirovskii and David Borisovich Yudin. Problem complexity and method\nef\ufb01ciency in optimization. Wiley-Interscience series in discrete mathematics. Wiley,\nChichester, New York, 1983. A Wiley-Interscience publication.\n\n[Sha12] Ohad Shamir. On the complexity of bandit and derivative-free stochastic convex opti-\n\nmization. arXiv preprint arXiv:1209.2388, 2012.\n\n[SV15] Yaron Singer and Jan Vondr\u00e1k. Information-theoretic lower bounds for convex optimiza-\ntion with erroneous oracles. In Advances in Neural Information Processing Systems,\npages 3186\u20133194, 2015.\n\n9\n\n\f", "award": [], "sourceid": 2412, "authors": [{"given_name": "Andrej", "family_name": "Risteski", "institution": "Princeton University"}, {"given_name": "Yuanzhi", "family_name": "Li", "institution": "Princeton University"}]}