{"title": "Complexity of Highly Parallel Non-Smooth Convex Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 13900, "page_last": 13909, "abstract": "A landmark result of non-smooth convex optimization is that gradient descent is an optimal algorithm whenever the number of computed gradients is smaller than the dimension $d$. In this paper we study the extension of this result to the parallel optimization setting. Namely we consider optimization algorithms interacting with a highly parallel gradient oracle, that is one that can answer $\\mathrm{poly}(d)$ gradient queries in parallel. We show that in this case gradient descent is optimal only up to $\\tilde{O}(\\sqrt{d})$ rounds of interactions with the oracle. The lower bound improves upon a decades old construction by Nemirovski which proves optimality only up to $d^{1/3}$ rounds (as recently observed by Balkanski and Singer), and the suboptimality of gradient descent after $\\sqrt{d}$ rounds was already observed by Duchi, Bartlett and Wainwright. In the latter regime we propose a new method with improved complexity, which we conjecture to be optimal. The analysis of this new method is based upon a generalized version of the recent results on optimal acceleration for highly smooth convex optimization.", "full_text": "Complexity of Highly Parallel\n\nNon-Smooth Convex Optimization\n\nS\u00e9bastien Bubeck\nMicrosoft Research\n\nsebubeck@microsoft.com\n\nQijia Jiang\n\nStanford University\n\nqjiang2@stanford.edu\n\nYin Tat Lee\n\nUniversity of Washington\n\n& Microsoft Research\n\nyintat@uw.edu\n\nYuanzhi Li\n\nStanford University\n\nyuanzhil@stanford.edu\n\nAaron Sidford\n\nStanford University\n\nsidford@stanford.edu\n\nAbstract\n\nA landmark result of non-smooth convex optimization is that gradient descent is\nan optimal algorithm whenever the number of computed gradients is smaller than\nthe dimension d. In this paper we study the extension of this result to the parallel\noptimization setting. Namely we consider optimization algorithms interacting\nwith a highly parallel gradient oracle, that is one that can answer poly(d) gradient\nqueries in parallel. We show that in this case gradient descent is optimal only\n\nup to eO(pd) rounds of interactions with the oracle. The lower bound improves\n\nupon a decades old construction by Nemirovski which proves optimality only up to\nd1/3 rounds (as recently observed by Balkanski and Singer), and the suboptimality\nof gradient descent after pd rounds was already observed by Duchi, Bartlett\nand Wainwright. In the latter regime we propose a new method with improved\ncomplexity, which we conjecture to be optimal. The analysis of this new method is\nbased upon a generalized version of the recent results on optimal acceleration for\nhighly smooth convex optimization.\n\n1\n\nIntroduction\n\nMuch of the research in convex optimization has focused on the oracle model, where an algorithm\noptimizing some objective function f : Rd ! R does so by sequential interaction with, e.g., a\ngradient oracle (given a query x 2 Rd, the oracle returns rf (x)), [Nemirovski and Yudin, 1983,\nNesterov, 2004, Bubeck, 2015].1 In the early 1990s, Arkadi Nemirovski introduced the parallel\nversion of this problem [Nemirovski, 1994]: instead of submitting queries one by one sequentially,\nthe algorithm can submit in parallel up to Q 1 queries. We refer to the depth of such a parallel\nalgorithm as the number of rounds of interaction with the oracle, and the work as the total number of\nqueries (in particular work \uf8ff Q\u21e5 depth). In this paper we study the optimal depth achievable for\nhighly parallel algorithms, namely we consider the regime Q = poly(d). We focus on non-smooth\nconvex optimization, that is we want to optimize a Lipschitz, convex function f on the unit Euclidean\nball.\n\n1Throughout we assume that f is differentiable, though our results carry over to the case where f is non-\ndifferentiable and given by a sub-gradient oracle. This generalization is immediate as our analysis and algorithms\nare stable under \ufb01nite-precision arithmetic and convex functions are almost everywhere differentiable.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOur key result is a new form a quadratic acceleration: while for purely sequential methods the critical\n\n1.1 Classical optimality results\n\nClassically, when Q = 1, it is known that gradient descent\u2019s query complexity is order optimal for\n\ndepth at which one can improve upon local search is eO(d), we show that in the highly parallel regime\nthe critical depth is eO(pd).\nany target accuracy \" in the range\u21e5d1/2, 1\u21e4. More precisely, it is known that the query complexity\nof gradient descent is O(1/\"2) and that for any \" in the range\u21e5d1/2, 1\u21e4, and for any algorithm,\n\nthere exists a Lipschitz and convex function f on which the number of oracle queries the algorithm\nmakes to achieve additive \" accuracy is \u2326(1/\"2). Furthermore, whenever \" is smaller than d1/2\nthere exists a better algorithm (i.e., with smaller depth), namely the center of gravity whose depth is\nO(d log(1/\")). Consequently, an alternative formulation of these results is that, for Q = 1, gradient\n\nfor the exact statements.)\n\ndescent is order optimal if and only if the depth is smaller than eO(d). (See previously cited references\n\n1.2 Optimality for highly parallel algorithms\n\nThe main result of this paper is to show that in the highly parallel regime (Q = poly(d)), gradient\n\ndescent is order optimal if and only if the depth is smaller than eO(pd). Thus one has a \u201cquadratic\"\n\nimprovement over the purely sequential setting in terms of the critical depth at which naive local\nsearch becomes suboptimal.\nThe only if part of the above statement follows from Duchi et al. [2012], where randomized smoothing\nwith accelerated gradient descent was proposed (henceforth referred to as distributed randomized\nsmoothing [Scaman et al., 2018]), and shown to achieve depth d1/4/\", which is order better than\n1/\"2 exactly when the latter is equal to pd. A \ufb01rst key contribution of our work is a matching lower\n\ndescent is possible, i.e. Q = 1 and Q = poly(d) have essentially the same power. Importantly we\nnote that our lower bound applies to randomized algorithms. The previous state of the art lower\n\nbound showing that, when the depth is smaller than eO(pd), no signi\ufb01cant improvement over gradient\nbound was that gradient descent is optimal up to depth eO(d1/3) [Balkanski and Singer, 2018]. In\n\nfact the construction in the latter paper is exactly the same as the original construction of Nemirovski\nin [Nemirovski, 1994] (however the \ufb01nal statements are different, as Nemirovski was concerned\nwith an `1 setting instead of `2, see also Diakonikolas and Guzm\u00e1n [2018] for more results about\nnon-Euclidean setups).\nA second key contribution of this work is to improve the state of the art complexity of parallel\nalgorithms with depth between pd and d. Improving the depth d1/4/\" of Duchi et al. [2012] was\nexplicitly mentioned as an open problem by Scaman et al. [2018]. Leveraging the recent higher order\nacceleration schemes of Gasnikov et al. [2018], Jiang et al. [2018], Bubeck et al. [2018], we propose\na new method with depth d1/3/\"2/3. This means that for any value of \" in the range\u21e5d1, d1/4\u21e4\n\nthere is an algorithm that is order better than both gradient descent and center of gravity. Moreover\nwe conjecture that the depth d1/3/\"2/3 is in fact optimal for any \" in this range (as the arguments\nof Section 2.3 would imply this if a similar argument could be made for = \", i.e. a smaller\nwalling radius). We leave this question, the optimality of the center of gravity method for small\n\"< 1/poly(d) , and the optimal work among optimal depth algorithms, for future works.\n\n1.3 Related works\n\nThough Nemirovski\u2019s prescient work stood alone for decades, more recently the sub\ufb01eld of paral-\nlel/distributed optimization is booming, propelled by problems in machine learning, see e.g., [Boyd\net al., 2011]. Chief among those problems is how to leverage mini-batches in stochastic gradient\ndescent as ef\ufb01ciently as possible [Dekel et al., 2012]. The literature on this topic is sprawling, see for\nexample [Duchi et al., 2018] which studies the total work achievable in parallel stochastic convex\noptimization, or [Zhang and Xiao, 2018] where the stochastic assumptions are leveraged to take\nadvantage of second order information. More directly related to our work is [Nemirovski, 1994,\nDiakonikolas and Guzm\u00e1n, 2018, Balkanski and Singer, 2018] from the lower bound side (we directly\n\n2\n\n\fimprove upon the result in the latter paper), and [Duchi et al., 2012, Scaman et al., 2018] from the\nupper bound side (we directly improve upon the depth provided by the algorithms in those works).\n\n2 Lower bound\n\nFix \"> 0 such that 1/\"2 = eO(pd). In this section we construct a random function f such that,\n\nfor any deterministic algorithm with depth O(1/\"2) and total work poly(d), the output point x is\nsuch that E[f (x) f\u21e4] >\" , where the expectation is with respect to the random function f, and\nf\u21e4 denotes the minimum value of f on the unit centered Euclidean ball. Note that by the minimax\ntheorem, this implies that for any randomized algorithm there exists a deterministic function such\nthat the same conclusion applies. Formally, we prove the following:\nTheorem 1 (Lower Bound) Let \u21e2 2 (0, 1) and C = 12 + 4 logd(Q/\u21e2). Further, assume that it\n4 (i.e., N .qd/ log3(d)). Fix a randomized algorithm that\nholds that log(N )NpC log(d)/d \uf8ff 1\nqueries at most Q points per iteration (both function value and gradient), and that runs for at most\nN iterations. Then, with probability at least 1 \u21e2, when run on the shielded Nemirovski function f\n(see Section 2.3 and Section 2.4)) one has for any queried point: f (x) f\u21e4 1\n4pN\nThe details of the proof of this theorem are deferred to Appendix A. In the remainder of this section\nwe instead provide a sketch of its proof. We \ufb01rst recall in Section 2.1 why, for purely sequential\nalgorithms, the above statement holds true, and in fact one can even replace pd by d in this case (this\nconstruction goes back to [Yudin and Nemirovski, 1976], see also [Nemirovski and Yudin, 1983]).\nNext, in Section 2.2 we explain Nemirovski [1994]\u2019s construction, which yields a weaker version of\nthe above statement, with pd replaced by d1/3 (as rediscovered by [Balkanski and Singer, 2018]).\nWe then explain in Section 2.3 our key new construction, a type of shielding operation. Finally, we\nconclude the proof sketch in Section 2.4.\nFor the rest of the section we let v1, . . . , vN denote N random orthonormal vectors in Rd (in particular\n\n.\n\ni=1 vi. We de\ufb01ne the Nemirovski function with parameter 0 by:\n\nN \uf8ff d), and x\u21e4 = 1pNPN\n\nNote that\n\nN (x) = max\n\ni2[N ]vi \u00b7 x i ,\n1\npN\n\n.\n\nN \u21e4 \uf8ffN (x\u21e4) \uf8ff \n\n(1)\n\n2.1 The classical argument\nWe consider here the Nemirovski function with parameter = 0. Each gradient query reveals a single\nvector in the collection of the vi, so after N/2 iterations one might know say v1, . . . , vN/2, but the rest\nremain unknown (or in other words they remain random orthonormal vectors in span(v1, . . . , vN/2)?).\nThus for any output x that depends on only N/2 queries, one has E[N (x)] 0 (formally this\ninequality follows from Jensen\u2019s inequality and the tower rule). Thus, together with (1), it follows\nthat E[N (x) N \u21e4] 1/pN. In other words the best rate of convergence of sequential methods is\n1/pN, provided that N \uf8ff d.\n2.2 The basic parallel argument\n\nWe consider here the Nemirovski function with parameter = Cplog(d)/d for some large enough\n\nconstant C (more precisely that constant C depends on the exponent in the poly(d) number of\nallowed queries per round). The key observation is as follows: Imagine that the algorithm has already\ndiscovered v1, . . . , vi1. Then for any set of poly(d) queries, with high probability with respect to\nthe random draw of vi, . . . , vN, one has that the inner product of any of those vectors with any of the\nqueried points is in [/2,/ 2] (using both basic concentration of measure on the sphere, and a union\nbound). Thus the maximum in the de\ufb01nition of N is attained at some index \uf8ff i. This means that this\nset of poly(d) queries can only reveal vi, and not any of the vj, j > i. Thus after N 1 rounds we\nknow that with high probability any output x satis\ufb01es N (x) vN \u00b7 x N (N + 1) (since\n\n3\n\n\fvN is a random direction orthogonal to span(v1, . . . , vN1) and x only depends on v1, . . . , vN1).\nThus we obtain that the suboptimality gap is\n\n1pN (N + 1). Let us assume that\n\nN 3/2 \uf8ff\n\n1\n2\n\n,\n\n(2)\n\ni.e., N = eO(d1/3) (since = Cplog(d)/d). Then one has that the best rate of convergence with a\n\nhighly parallel algorithm is \u2326(1/pN ) (i.e., the same as with purely sequential methods).\n\n2.3 The wall function\n\nOur new idea to improve upon Nemirovski\u2019s construction is to introduce a new random wall function\nW (with parameter > 0), where the randomness come from v1, . . . , vN. Our new random hard\nfunction, which we term shielded-Nemirovski function, is then de\ufb01ned by:\n\nf (x) = max{N (x),W(x)} .\n\nWe construct the convex function W so that one can essentially repeat the argument of Section\n2.2 with a smaller value of (the parameter in the Nemirovski function), so that the condition (2)\n\nbecomes less restrictive and allows to take N as large as eO(pd).\n\nRoughly speaking the wall function will satisfy the following properties:\n\n1. The value of W at x\u21e4 is small, namely W(x\u21e4) \uf8ff 1pN .\n2. The value of W at \u201cmost\" vectors x with kxk is large, namely W(x) N (x), and\nmoreover it is does not depend on the collection vi (in fact at most points we will have the\nsimple formula W(x) = 2kxk1+\u21b5, for some small \u21b5 that depends on , to be de\ufb01ned later).\nThe key argument is that, by property 2, one can expect (roughly) that information on the random\ncollection of v0is can only be obtained by querying points of norm smaller than . This means that\n\ntension between property 1 and 2, so that one cannot take too small. We will show below that it\n\none can repeat the argument of Section 2.2 with a smaller value of , namely = \u00b7 Cplog(d)/d.\nIn turn the condition (2) now becomes N = eOd1/3/2/3. Due to convexity of W, there is a\nis possible to take =pN/d. In turn this means that the argument proves that 1/pN is the best\npossible rate, up to N = eO(pd).\n\nThe above argument is imprecise because the meaning of \u201cmost\" in property 2 is unclear. A more\nprecise formulation of the required property is as follows:\n\n20 Let x = w + z with w 2 Vi and z 2 V ?i where Vi = span(v1, . . . , vi). Assume\nthat kzk , then the total variation distance between the conditional distribution of\nvi+1, . . . , vN given rW(x) (and W(x)) and the unconditional distribution is polynomially\nsmall in d with high probability (here high probability is with respect to the realization of\nrW(x) and W(x), see below for an additional comment about such conditional reasoning).\nMoreover if the argmax in the de\ufb01nition of N (x) is attained at some index > i, then\nW(x) N (x).\n\nGiven both property 1 and 20 it is actually easy to formalize the whole argument. We do so by\nconsdering a game between Alice, who is choosing the query points, and Bob who is choosing the\nrandom vectors v1, . . . , vN. Moreover, to clarify the reasoning about conditional distributions, Bob\nwill resample the vectors vi, . . . , vN at the beginning of the ith round of interaction, so that one\nexplicitly does not have any information about those vectors given the \ufb01rst i 1 rounds of interaction.\nThen we argue that with high probability all the oracle answers remain consistent throughout this\nresampling process. See Appendix A for the details. Next we explain how to build W so as to satisfy\nproperty 1 and 2\u2019.\n\n4\n\n\f2.4 Building the wall\nLet h(x) = 2kxk1+\u21b5 be the basic building block of the wall. Consider the correlation cones:\n\nCi =(x 2 Rd :vi \u00b7\n\nx\n\nd ) .\nkxk Cr log(d)\n\nNote that for any \ufb01xed query x, the probability (with respect to the random draw of vi) that x is in Ci\nis polynomially small in d. We now de\ufb01ne the wall W as follows: it is equal to the function h outside\nof the correlation cones and the ball of radius , and it is extended by convexity to the rest of the unit\nball. In other words, let \u2326= {x 2 Rd : kxk 2 [, 1] and x 62 Ci for all i 2 [N ]}, and\n\nW(x) = max\n\ny2\u2326 {h(y) + rh(y) \u00b7 (x y)} .\n\n+ 1pN\n\n. Then W(x\u21e4) \uf8ff 1pN\n\n.\n\ny \u00b7 x\nkyk1\u21b5 .\n\n(3)\n\nLet us \ufb01rst prove property 1:\n\n1\n\nLemma 2 Let \u21b5 =\nProof One has rh(y) = 2(1 + \u21b5)\n\nlog2(1/) \uf8ff 1, and\n\n\n\nlog2(1/) = 4Cq N log(d)\nkyk1\u21b5 and thus\n\nd\n\ny\n\nMoreover for any y 2 \u2326 one has:\n|y \u00b7 x\u21e4|\uf8ff\n\nh(y) + rh(y) \u00b7 (x y) = 2\u21b5kyk1+\u21b5 + 2(1 + \u21b5)\n|y \u00b7 vi|\uf8ff Cr N log(d)\n\n1\npN\n\nd\n\n\u00b7k yk .\n\nNXi=1\n\nThus for any y 2 \u2326 we have:\n\nh(y) + rh(y) \u00b7 (x\u21e4 y) \uf8ff 2\u21b51+\u21b5 + 2(1 + \u21b5)Cr N log(d)\n\nd\n\n.\n\nThe proof is straightforwardly concluded by using the values of \u21b5 and .\n\nmax\n\na,b2R+:a2+b22[2,1](2\u21b5(a2 + b2)\n\nNext we prove a simple formula for W(x) in the context of property 20. More precisely we assume\nthat x = w + z with w 2 Vi and z 2 V ?i with z 62 Cj for any j > i. Note that for any \ufb01xed z, the\nlatter condition happens with high probability with respect to the random draw of vi+1, . . . , vN.\nLemma 3 Let x = w + z with w 2 Vi and z 2 V ?i with z 62 Cj for any j > i. Then one has:\nW(x) =\n\ny \u00b7 w + bkzk!) ,\nwheree\u2326= {x 2 Vi : x 62 Cj for all j 2 [i]} and we use the convention that the maximum of an\nempty set is 1.\nProof Recall (3), and let us optimize over y 2 \u2326 subject to kPViyk = a and kPV ?i\nyk = b for some\na, b such that a2 + b2 2 [2, 1]. Note that in fact there is an upper bound constraint on a for such a y\nto exists (for if the projection of y onto Vi is large, then necessarily y must be in one of the correlation\ncones), which we can ignore thanks to the convention choice for the maximum of an empty set. Thus\nthe only calculation we have to do is to verify that:\n\n2 max\ny2e\u2326,kyk=a\n\n(a2 + b2) 1\u21b5\n\n2 + 2\n\n1 + \u21b5\n\n1+\u21b5\n\nmax\n\ny2\u2326:kPVi yk=a and kPV ?i\n\nyk=b\n\ny \u00b7 x = max\ny2e\u2326,kyk=a\n\ny \u00b7 w + bkzk .\n\nNote that y \u00b7 x = PViy \u00b7 w + PV ?i\ny \u00b7 z. Thus the right hand side is clearly an upper bound on the\nleft hand side (note that PViy 2e\u2326). To see that it is also a lower bound take y = y0 + b z\nfor some\narbitrary y0 2e\u2326 with ky0k = a, and note that y 2 \u2326 (in particular using the assumption on z) with\n\nkzk\n\n5\n\n\fyk = b.\n\nkPViyk = a and kPV ?i\nThe key point of the formula given in Lemma 3 is that it does not depend on vi+1, . . . , vN. Thus\nwhen the algorithm queries the point x and obtains the above value for W(x) (and the corresponding\ngradient), the only information that it obtains is that z 62 Cj for any j > i. Since the latter condition\nholds with high probability, the algorithm essentially learns nothing (more precisely the conditional\ndistribution of vi+1, . . . , vN only changes by 1/poly(d) compared to the unconditional distribution).\nThus to complete the proof of property 20 it only remains to show that if kzk and the argmax in\nN (x) is attained at an index > i, then the formula in Lemma 3 is larger than N (x). By taking a = 0\nand b = one obtains that this formula is equal to (using also the values assigned to \u21b5 in Lemma 2):\n\n2\u21b51+\u21b5 + 2\n\n1 + \u21b5\n1\u21b5 kzk = \u21b5 + (1 + \u21b5)kzk kzk .\n\nOn the other hand one has (by assumption that the arg max index is > i)\n\nN (x) = max\n\nj>i {vj \u00b7 x j}\uf8ff k zk .\n\nThis concludes the proof of property 20, and in turn concludes the proof sketch of our lower bound.\n\n3 Upper bound\n\nHere we present our highly parallel optimization procedure. Throughout this section we let f :\nRd ! R denote a differentiable L-Lipschitz function that obtains its minimum value at x\u21e4 2 Rd\nwith kx\u21e4k2 \uf8ff R. The main result of this section is the following theorem, which provides an\neO(d1/3/\"2/3)-depth highly-parallel algorithm that computes an \"-optimal point with high probability.\nTheorem 4 (Highly Parallel Function Minimization) There is a randomized highly-parallel algo-\nrithm which given any differentiable L-Lipschitz f : Rd ! R minimized at x\u21e4 with kx\u21e4k \uf8ff R\ncomputes with probability 1 \u232b a point x 2 Rd with f (x) f (x\u21e4) \uf8ff \" in depth eO(d1/3(LR/\")2/3)\nand work eO(d4/3(LR/\")8/3) where eO(\u00b7) hides factors polylogarithmic in d,\",L,R, and \u232b1.\n\nOur starting point for obtaining this result are the O(d1/4/\") depth highly parallel algorithms of\n[Duchi et al., 2012]. This paper considers the convolution of f with simple functions, e.g. Gaussians\nand uniform distributions, and shows this preserves the convexity and continuity of f while improving\nthe smoothness and thereby enables methods like accelerated gradient descent (AGD) to run ef\ufb01ciently.\nSince the convolved function can be accessed ef\ufb01ciently in parallel by random sampling, working\nwith the convolved function is comparable to working with the original function in terms of query\ndepth (up to the sampling error). Consequently, the paper achieves its depth bound by trading off the\nerror induced by convolution with the depth improvements gained from stochastic variants of AGD.\nTo improve upon this bound, we apply a similar approach of working with the convolution of f\nwith a Gaussian. However, instead of applying standard stochastic AGD we consider accelerated\nmethods which build a more sophisticated model of the convolved function in parallel. Instead of\nusing random sampling to approximate only the gradient of the convolved function, we obtain our\nimprovements by using random sampling to glean more local information with each highly-parallel\nquery and then use this to minimize the convolved function at an accelerated rate.\nTo enable the use of these more sophisticated models we develop a general acceleration framework\nthat allows us to leverage any subroutine for approximate minimization a local model/approximate\ngradient computations into an accelerated minimization scheme. We believe this framework is of\nindependent interest, as we show that we can analyze the performance of this method just in terms\nof simple quantities regarding the local model. This framework is discussed in Section 3.1 and in\nAppendix C where we show how it generalizes multiple previous results on near-optimal acceleration.\nUsing this framework, proving Theorem 4 reduces to showing that we can minimize high quality\nlocal models of the convolved function. Interestingly, it is possible to nearly obtain this result by\nsimply random sampling to estimate all derivatives up to some order k and then use this to minimize\na regularized k-th order Taylor approximation to the function. Near-optimal convergence for such\nmethods under Lipschitz bounds on the k-th derivatives were recently given by [Gasnikov et al., 2018,\nJiang et al., 2018, Bubeck et al., 2018] (and follow from our framework). This approach can be shown\n\n6\n\n\fto give a highly-parallel algorithm of depth eO(d1/3+c/\"2/3) for any c > 0 (with an appropriately\n\nlarge k). Unfortunately, the work of these methods is O(dpoly(1/c)) and expensive for small c.\nTo overcome this limitation, we leverage the full power of our acceleration framework and instead\nshow that we can randomly sample to build a model of the convolved function accurate within a ball\nof suf\ufb01ciently large radius. In Section 3.2 we bound this quality of approximation and show that this\nlocal model can be be optimized to suf\ufb01cient accuracy ef\ufb01ciently. By combining this result with our\nframework we prove Theorem 4. We believe this demonstrates the utility of our general acceleration\nscheme and we plan to further explore its implications in future work.\n\n3.1 Acceleration framework\n\nHere we provide a general framework for accelerated convex function minimization. Throughout\nthis section we assume that there is a twice-differentiable convex function g : Rd ! R given by an\napproximate proximal step oracle and an approximate gradient oracle de\ufb01ned as follows.\n\nDe\ufb01nition 5 (Approximate Proximal Step Oracle) Let ! : R+ ! R+ be a non-decreasing func-\ntion, 0, and \u21b5 2 [0, 1). We call Tprox an (\u21b5, )-approximate !-proximal step oracle for\ng : Rd ! R if, for all x 2 Rd, when queried at x 2 Rd the oracle returns y = Tprox(x) such that\n(4)\n\nkrg(y) + !(ky xk)(y x)k \uf8ff \u21b5 \u00b7 !(ky xk)ky xk + .\n\nDe\ufb01nition 6 (Approximate Gradient Oracle) We call Tgrad an -approximate gradient oracle for\ng : Rd ! R if when queried at x 2 Rd the oracle returns v = Tgrad(x) such that kv rg(x)k \uf8ff .\nWe show that there is an ef\ufb01cient accelerated optimization algorithm for minimizing g using only\nthese oracles. Its performance is encapsulated by the following theorem.\n\nTheorem 7 (Acceleration Framework) Let g : Rd ! R be a convex twice-differentiable function\nminimized at x\u21e4 with kx\u21e4k \uf8ff R, \"> 0, \u21b5 2 [0, 1), and 1 such that 128\u21b52 \uf8ff 1. Further, let\n! : R+ ! R+ be a monotonically increasing continuously differentiable function with 0 0. There is an algorithm which for all k computes a point yk with\n\ng(yk) g\u21e4 \uf8ff max(\",\n\n32 \u00b7 !\u21e3 40kx\u21e4k\n\nk3/2 \u2318kx\u21e4k2\n\n)\n\nk2\n\n2 and Tprox(x) \u21e1 arg miny g(y) + L\n\nusing k(6 + log2[10206R2 \u00b7 !(1052R) \u00b7 \"1])2 queries to a (\u21b5, )-approximate !-proximal step\noracle for g and a -approximate gradient oracle for g provided that both \uf8ff \"/[10202R] and\n\" \uf8ff 10204R3!(80R).\nThis theorem generalizes multiple accelerated methods (up to polylogarithmic factors) and sheds light\non the rates of these methods (See Appendix C for applications). For example, choosing !(x) def= L\n2\nand Tprox(x) = x 1\nLrf (x) recovers standard accelerated minimization of L-smooth functions,\n2 ky xk2 recovers a variant of approximate\nchoosing !(x) def= L\nproximal point [Frostig et al.] and Catalyst [Lin et al., 2015], and choosing !(x) def= Lp\u00b7(p+1)\nxp1\nand Tprox(x) = arg miny gp(y; x) + Lp\np! ky xkp+1 where gp(y; x) is the value of the p\u2019th order\nTaylor approximation of g about x evaluated at y recovers highly smooth function minimization\n[Monteiro and Svaiter, 2013, Gasnikov et al., 2018, Jiang et al., 2018, Bubeck et al., 2018].\nWe prove Theorem 7 by generalizing an acceleration framework due to [Monteiro and Svaiter, 2013].\nThis framework was recently used by several results to obtain near-optimal query complexities\nfor minimizing highly smooth convex functions [Gasnikov et al., 2018, Jiang et al., 2018, Bubeck\net al., 2018]. In Section B.1 we provide a variant of this general framework that is amenable to the\nnoise induced by our oracles. In Section B.2 we show how to instantiate our framework using the\noracles assuming a particular type of line search can be performed. In Section B.3 we then prove the\nTheorem 7. The algorithm for and analysis of line search is deferred to Appendix E.\n\np!\n\n7\n\n\f3.2 Highly parallel optimization\n\nWith Theorem 7 in hand, to obtain our result we need to provide, for an appropriate function !, a\nhighly parallel implementation of an approximate proximal step oracle and an approximate gradient\noracle for a function that is an O(\") additive approximation f. As with previous work [Duchi et al.,\n2012, Scaman et al., 2018] we consider the convolution of f with a Gaussian of covariance r2 \u00b7 Id for\nr > 0 we will tune later. Formally, we de\ufb01ne g : Rd ! R for all x 2 Rd as\n\ng(x) def=ZRd\n\nr(y)f (x y)dy where r(x) def=\n\n1\n\n(p2\u21e1r)d\n\nexp\u2713kxk2\n2r2 \u25c6\n\nIt is straightforward to prove (See Section D.1) the following standard facts regarding g.\n\nLemma 8 The function g is convex, L-Lipschitz, and satis\ufb01es both |g(y) f (y)|\uf8ff pd \u00b7 Lr and\nr2g(y) (L/r) \u00b7 Id for all y 2 Rd.\nConsequently, to minimize f up to \" error, it suf\ufb01ces to minimize g to O(\") error with r = O( \"pdL\n).\nIn the remainder of this section we simply show how to provide highly parallel implementations of\nthe requisite oracles to achieve this by Theorem 7.\nNow, as we have discussed, one way we could achieve this goal would be to use random sampling\nto approximate (in parallel) the k-th order Taylor approximation to g and minimize a regularization\nof this function to implement the approximate proximal step oracle. While this procedure is depth-\nef\ufb01cient, its work is quite large. Instead, we provide a more work ef\ufb01cient application of our\nacceleration framework. To implement a query to the oracles at some point c 2 Rd we instead simply\ntake multiple samples from r(x c), i.e., the normal distribution with covariance r2Id and mean c,\nand use these samples to build an approximation to the gradient \ufb01eld of g. The algorithm for this\nprocedure is given by Algorithm 1 and carefully combines the gradients of the sampled points to\nbuild a model with small bias and variance. By concentration bound and \"-net argument, we can\nshow that Algorithm 1 outputs a vector \ufb01eld v : Rd ! Rd that is an uniform approximation of rg\nwithin a small ball (See Section D.2 for the proof.)\nAlgorithm 1: Compute vector \ufb01eld approximating rg\n1 Input: Number of samples N, radius r > 0, error parameter \u2318 2 (0, 1), center c 2 Rd.\n2 Sample x1, x2,\u00b7\u00b7\u00b7 , xN independently from r(x c).\n3 return v : Rd ! Rd de\ufb01ned for all y 2 Rd by\n\nv(y) =\n\n1\nN\n\nNXi=1\n\nr(y xi)\nr(c xi) \u00b7r f (xi) \u00b7 ((xi c)>(y c)) \u00b7 1\n\nkxick\uf8ff(pd+ 1\n\u2318 )r\n\nwhere (t) def= 0, if |t| r2, (t) def= 1 if |t|\uf8ff r2\nLemma 9 (Uniform Approximation) Algorithm 1 outputs vector \ufb01eld v : Rd ! R such that for\nany 2 (0, 1\n\n2 and (t) def= 2 2|t|r2 otherwise.\n\n2 ) with probability at least 1 the following holds\n2\u23182\u25c6 +\n8L\n\n4 r kv(y) rg(y)k \uf8ff 5L \u00b7 exp\u2713\nConsequently, for any \" 2 [0, 1], N = O([d log d log( 1\n\n\" ) + log( 1\n\ny:kyck\uf8ff \u2318\n\nmax\n\n1\n\npNs d\n\n\u23182 log(9d) + log\n\n1\n\n\n.\n\nyields that maxy:kyck\uf8ffer kv(y) rg(y)k \uf8ff L \u00b7 \" whereer =\n\nThis lemma immediately yields that we can use Algorithm 1 to implement a highly-parallel approx-\nimate gradient oracle for g. Interestingly, it can also be leveraged to implement a highly-parallel\napproximate proximal step oracle. Formally, we show how to use it to \ufb01nd y such that\n\n )]\"2), and \u2318 =\n8qlog( 10\n\n\" )\n\n.\n\nr\n\nthis\n\n1\n\n2qlog( 10\n\n\" )\n\nrg(y) + !(ky xk) \u00b7 (y x) \u21e1 0 where !(s) def=\n\n8\n\n4Lsp\n\nerp+1\n\n(5)\n\n\ffor some p to be determined later. Ignoring logarithmic factors and supposing for simplicity that\ntimes we could\n\nachieve function error on the order\n\nL, R \uf8ff 1, Theorem 7 shows that by invoking this procedure eO(k) \u21e1 d\n\n3p+4 \" 2p2\n\n3p+4\n\np+1\n\n2 \"(p+1)k 3p+4\n\n2 \u21e1 \"\n\nand therefore achieve the desired result by setting p to be polylogarithmic in the problem parameters.\nConsequently, we simply need to \ufb01nd y satisfying (5). The algorithm that achieves this is Algorithm 2\nwhich essentially performs gradient descent on\n\np+1\n\n2 \u21e1 d\n\n!(1/k3/2)/k2 \u21e1er(p+1)k 3p+4\ng(y) +( ky ck) where (s) =Z s\n\n0\n\n!(t) \u00b7 t dt .\n\n(6)\n\nThe performance of this algorithm is given by Theorem 24 in Section D.3. Combined with all of the\nabove it proves Theorem 4, see Section D.4 for the details.\nAlgorithm 2: Approximate minimization of g(y) +( ky ck)\n\nr\n\n1 Input: center c 2 Rd, accuracy \", inner radiuser =\n48ppdL.\ner\n6.\n2 Use Algorithm 1 to \ufb01nd a vector \ufb01eld v such that maxy:kyck\uf8ffer kv(y) rg(y)k \uf8ff L \u00b7 \"\n3 y c.\n4 for i = 1, 2,\u00b7\u00b7\u00b71 do\ny = v(y) + !(ky ck) \u00b7 (y c) where ! is de\ufb01ned by (5) with p 1.\nif kyk \uf8ff L \u00b7 5\"\n\n6 then return y else y = y h \u00b7 y;\n\n, and step size h =\n\n8qlog( 60\n\n\" )\n\n5\n6\n7 end\n\nAcknowledgments\nThe authors thank the anonymous reviewers for their helpful feedback in preparing this \ufb01nal version.\nFurther, the authors are grateful for multiple funding sources which supported this work in part,\nincluding NSF Awards CCF-1740551, CCF-1749609, and DMS-1839116 and NSF CAREER Award\nCCF-1844855.\n\nReferences\nE. Balkanski and Y. Singer. Parallelization does not accelerate convex optimization: Adaptivity lower\n\nbounds for non-smooth convex minimization. Arxiv preprint arXiv:1808.03880, 2018.\n\nK. Ball. An elementary introduction to modern convex geometry. In S. Levy, editor, Flavors of\n\nGeometry, pages 1\u201358. Cambridge University Press, 1997.\n\nS. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical\nlearning via the alternating direction method of multipliers. Foundations and Trends R in Machine\nLearning, 3(1):1\u2013122, 2011.\n\nS. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine\n\nLearning, 8(3-4):231\u2013357, 2015.\n\nS. Bubeck, Q. Jiang, Y.T. Lee, Y. Li, and A. Sidford. Near-optimal method for highly smooth convex\n\noptimization. Arxiv preprint arXiv:1812.08026, 2018.\n\nO. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using\n\nmini-batches. Journal of Machine Learning Research, 13:165\u2013202, 2012.\n\nJ. Diakonikolas and C. Guzm\u00e1n. Lower bounds for parallel and randomized convex optimization.\n\nArxiv preprint arXiv:1811.01903, 2018.\n\nJ. Duchi, P. Bartlett, and M. Wainwright. Randomized smoothing for stochastic optimization. SIAM\n\nJournal on Optimization, 22(2):674\u2013701, 2012.\n\n9\n\n\fJ. Duchi, F. Ruan, and C. Yun. Minimax bounds on stochastic batched convex optimization. In\nProceedings of the 31st Conference On Learning Theory (COLT), volume 75 of Proceedings of\nMachine Learning Research, pages 3065\u20133162, 2018.\n\nRoy Frostig, Rong Ge, Sham Kakade, and Aaron Sidford. Un-regularizing: approximate proximal\npoint and faster stochastic algorithms for empirical risk minimization. In Proceedings of the 32nd\nInternational Conference on Machine Learning (ICML).\n\nA. Gasnikov, E. Gorbunov, D. Kovalev, A. Mohhamed, and E. Chernousova. The global rate of conver-\ngence for optimal tensor methods in smooth convex optimization. Arxiv preprint arXiv:1809.00382,\n2018.\n\nB. Jiang, H. Wang, and S. Zhang. An optimal high-order tensor method for convex optimization.\n\nArxiv preprint arXiv:1812.06557, 2018.\n\nB. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. Ann.\nStatist., 28(5):1302\u20131338, 10 2000. doi: 10.1214/aos/1015957395. URL https://doi.org/10.\n1214/aos/1015957395.\n\nHongzhou Lin, Julien Mairal, and Za\u00efd Harchaoui. A universal catalyst for \ufb01rst-order optimization. In\nAdvances in Neural Information Processing Systems 28: Annual Conference on Neural Information\nProcessing Systems (NIPS) 2015, pages 3384\u20133392, 2015.\n\nR. D. C. Monteiro and B. F. Svaiter. An accelerated hybrid proximal extragradient method for convex\noptimization and its implications to second-order methods. SIAM Journal on Optimization, 23(2):\n1092\u20131125, 2013.\n\nA. Nemirovski. On parallel complexity of nonsmooth convex optimization. Journal of Complexity,\n\n10(4):451 \u2013 463, 1994.\n\nA. Nemirovski and D. Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization. Wiley\n\nInterscience, 1983.\n\nY. Nesterov. Introductory lectures on convex optimization: A basic course. Kluwer Academic\n\nPublishers, 2004.\n\nIosif Pinelis. Optimum bounds for the distributions of martingales in banach spaces. Ann. Probab.,\n22(4):1679\u20131706, 10 1994. doi: 10.1214/aop/1176988477. URL https://doi.org/10.1214/\naop/1176988477.\n\nK. Scaman, F. Bach, S. Bubeck, L. Massouli\u00e9, and Y.T. Lee. Optimal algorithms for non-smooth\ndistributed optimization in networks. In Advances in Neural Information Processing Systems 31,\npages 2740\u20132749. 2018.\n\nD. Yudin and A. Nemirovski. Estimating the computational complexity of mathematica-programming\n\nproblems. Ekonomika i matem metody, 12(1):128\u2013142, 1976. In Russian.\n\nY. Zhang and L. Xiao. Communication-ef\ufb01cient distributed optimization of self-concordant empirical\n\nloss. In Large-Scale and Distributed Optimization, pages 289\u2013341. Springer, 2018.\n\n10\n\n\f", "award": [], "sourceid": 7769, "authors": [{"given_name": "Sebastien", "family_name": "Bubeck", "institution": "Microsoft Research"}, {"given_name": "Qijia", "family_name": "Jiang", "institution": "Stanford University"}, {"given_name": "Yin-Tat", "family_name": "Lee", "institution": null}, {"given_name": "Yuanzhi", "family_name": "Li", "institution": "Princeton"}, {"given_name": "Aaron", "family_name": "Sidford", "institution": "Stanford"}]}