{"title": "Graph Oracle Models, Lower Bounds, and Gaps for Parallel Stochastic Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 8496, "page_last": 8506, "abstract": "We suggest a general oracle-based framework that captures parallel\n stochastic optimization in different parallelization settings\n described by a dependency graph, and derive generic lower bounds \n in terms of this graph. We then use the framework and derive lower\n bounds to study several specific parallel optimization settings,\n including delayed updates and parallel processing with intermittent\n communication. We highlight gaps between lower and upper bounds on\n the oracle complexity, and cases where the ``natural'' algorithms\n are not known to be optimal.", "full_text": "Graph Oracle Models, Lower Bounds, and Gaps for\n\nParallel Stochastic Optimization\n\nToyota Technological Institute at Chicago\n\nBlake Woodworth\n\nblake@ttic.edu\n\nJialei Wang\n\nTwo Sigma Investments\n\njialei.wang@twosigma.com\n\nAdam Smith\n\nBoston University\nads22@bu.edu\n\nBrendan McMahan\n\nGoogle\n\nmcmahan@google.com\n\nToyota Technological Institute at Chicago\n\nNathan Srebro\u21e4\n\nnati@ttic.edu\n\nAbstract\n\nWe suggest a general oracle-based framework that captures different parallel\nstochastic optimization settings described by a dependency graph, and derive\ngeneric lower bounds in terms of this graph. We then use the framework and derive\nlower bounds for several speci\ufb01c parallel optimization settings, including delayed\nupdates and parallel processing with intermittent communication. We highlight\ngaps between lower and upper bounds on the oracle complexity, and cases where\nthe \u201cnatural\u201d algorithms are not known to be optimal.\n\n1\n\nIntroduction\n\nRecently, there has been great interest in stochastic optimization and learning algorithms that leverage\nparallelism, including e.g. delayed updates arising from pipelining and asynchronous concurrent\nprocessing, synchronous single-instruction-multiple-data parallelism, and parallelism across distant\ndevices. With the abundance of parallelization settings and associated algorithms, it is important to\nprecisely formulate the problem, which allows us to ask questions such as \u201cis there a better method\nfor this problem than what we have?\u201d and \u201cwhat is the best we could possibly expect?\u201d\nOracle models have long been a useful framework for formalizing stochastic optimization and\nlearning problems. In an oracle model, we place limits on the algorithm\u2019s access to the optimization\nobjective, but not what it may do with the information it receives. This allows us to obtain sharp\nlower bounds, which can be used to argue that an algorithm is optimal and to identify gaps between\ncurrent algorithms and what might be possible. Finding such gaps can be very useful\u2014for example,\nthe gap between the \ufb01rst order optimization lower bound of Nemirovski et al. [21] and the best known\nalgorithms at the time inspired Nesterov\u2019s accelerated gradient descent algorithm [22].\nWe propose an oracle framework for formalizing different parallel optimization problems. We specify\nthe structure of parallel computation using an \u201coracle graph\u201d which indicates how an algorithm\naccesses the oracle. Each node in the graph corresponds to a single stochastic oracle query, and that\nquery (e.g. the point at which a gradient is calculated) must be computed using only oracle accesses\nin ancestors of the node. We generally think of each stochastic oracle access as being based on a\nsingle data sample, thus involving one or maybe a small number of vector operations.\nIn Section 3 we devise generic lower bounds for parallel optimization problems in terms of simple\nproperties of the associated oracle graph, namely the length of the longest dependency chain and\nthe total number of nodes. In Section 4 we study speci\ufb01c parallel optimization settings in which\n\n\u21e4Part of this work was done while visiting Google.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fmany algorithms have been proposed, formulate them as graph-based oracle parallel optimization\nproblems, instantiate our lower bounds, and compare them with the performance guarantees of\nspeci\ufb01c algorithms. We highlight gaps between the lower bound and the best known upper bound and\nalso situations where we can devise an optimal algorithm that matches the lower bound, but where\nthis is not the \u201cnatural\u201d and typical algorithm used in this settings. The latter indicates either a gap in\nour understanding of the \u201cnatural\u201d algorithm or a need to depart from it.\n\nPreviously suggested models Previous work studied communication lower bounds for parallel\nconvex optimization where there are M machines each containing a local function (e.g. a collection\nof samples from a distribution). Each machine can perform computation on its own function, and\nthen periodically every machine is allowed to transmit information to the others. In order to prove\nmeaningful lower bounds based on the number of rounds of communication, it is necessary to prevent\nthe machines from simply transmitting their local function to a central machine, or else any objective\ncould be optimized in one round. There are two established ways of doing this. First, one can allow\narbitrary computation on the local machines, but restrict the number of bits that can be transmitted\nin each round. There is work focusing on speci\ufb01c statistical estimation problems that establishes\ncommunication lower bounds via information-theoretic arguments [7, 12, 29]. Alternatively, one can\nallow the machines to communicate real-valued vectors, but restrict the types of computation they\nare allowed to perform. For instance, Arjevani and Shamir [3] present communication complexity\nlower bounds for algorithms which can only compute vectors that lie in a certain subspace, which\nincludes e.g. linear combinations of gradients of their local function. Lee et al. [16] assume a similar\nrestriction, but allow the data de\ufb01ning the local functions to be allocated to the different machines in\na strategic manner. Our framework applies to general stochastic optimization problems and does not\nimpose any restrictions on what computation the algorithm may perform, and is thus a more direct\ngeneralization of the oracle model of optimization.\nRecently, Duchi et al. [10] considered \ufb01rst-order optimization in a special case of our proposed model\n(the \u201csimple parallelism\u201d graph of Section 4.2), but their bounds apply in a more limited parameter\nregime, see Section 3 for discussion.\n2 The graph-based oracle model\nWe consider the following stochastic optimization problem\n\nmin\n\nx2Rm:kxk\uf8ffB\n\nF (x) := Ez\u21e0P [f (x; z)]\n\n(1)\n\nThe problem (1) captures many important tasks, such as supervised learning, in which case f (x; z) is\nthe loss of a model parametrized by x on data instance z and the goal is to minimize the population\nrisk E [f (x; z)]. We assume that f (\u00b7; z) is convex, L-Lipschitz, and H-smooth for all z. We also\nallow f to be non-smooth, which corresponds to H = 1. A function g is L-Lipschitz when\nkg(x) g(y)k \uf8ff Lkx yk for all x, y, and it is H-smooth when it is differentiable and its gradient\nis H-Lipschitz. We consider optimization algorithms that use either a stochastic gradient or stochastic\nprox oracle (Ograd and Oprox respectively):\n\nOgrad(x, z) = (f (x; z), rf (x; z))\n\nOprox(x, , z) =\u21e3f (x; z), rf (x; z), proxf (\u00b7;z)(x, )\u2318\nproxf (\u00b7;z)(x, ) = arg min\n\n\n2 ky xk2\n\nf (y; z) +\n\n(2)\n(3)\n\n(4)\n\nwhere\n\ny\n\nThe prox oracle is quite powerful and provides global rather than local information about f. In\nparticular, querying the prox oracle with = 0 fully optimizes f (\u00b7; z).\nAs stated, z is an argument to the oracle, however there are two distinct cases. In the \u201cfully stochastic\u201d\noracle setting, the algorithm receives an oracle answer corresponding to a random z \u21e0P . We also\nconsider a setting in which the algorithm is allowed to \u201cactively query\u201d the oracle. In this case, the\nalgorithm may either sample z \u21e0P or choose a desired z and receive an oracle answer for that z.\nOur lower bounds hold for either type of oracle. Most optimization algorithms only use the fully\nstochastic oracle, but some require more powerful active queries.\nWe capture the structure of a parallel optimization algorithm with a directed, acyclic oracle graph G.\nIts depth, D, is the length of the longest directed path, and the size, N, is the number of nodes. Each\n\n2\n\n\fnode in the graph represents a single stochastic oracle access, and the edges in the graph indicate\nwhere the results of that oracle access may be used: only the oracle accesses from ancestors of\neach node are available when issuing a new query. These limitations might arise e.g. due to parallel\ncomputation delays or the expense of communicating between disparate machines.\nLet Q be the set of possible oracle queries, with the exact form of queries (e.g., q = x vs. q =\n(x, , z)) depending on the context. Formally, a randomized optimization algorithm that accesses the\nstochastic oracle O as prescribed by the graph G is speci\ufb01ed by associating with each node vt a query\nrule Rt : (Q,O(Q))\u21e4 \u21e5 \u2305 !Q , plus a single output rule \u02c6X : (Q,O(Q))\u21e4 \u21e5 \u2305 !X . We grant all\nof the nodes access to a source of shared randomness \u21e0 2 \u2305 (e.g. an in\ufb01nite stream of random bits).\nThe mapping Rt selects a query qt to make at node vt using the set of queries and oracle responses in\nancestors of vt, namely\n\nqt = Rt (qi,O(qi) : i 2 Ancestors(vt)) ,\u21e0\n\n(5)\n\nSimilarly, the output rule \u02c6X maps from all of the queries and oracle responses to the algorithm\u2019s\noutput as \u02c6x = \u02c6X ((qi,O(qi) : i 2 [N ]),\u21e0 ). The essential question is: for a class of optimization\nproblems (G,O,F) speci\ufb01ed by a dependency graph G, a stochastic oracle O, and a function class\nF, what is the best possible guarantee on the expected suboptimality of an algorithm\u2019s output, i.e.\n(6)\n\ninf\n\nE\u02c6x,z [f (\u02c6x; z)] min\n\nx Ez [f (x; z)]\n\n(R1,...,RN , \u02c6X)\n\nsup\nf2F\n\nIn this paper, we consider optimization problems (G,O,FL,H,B) where FL,H,B is the class of convex,\nL-Lipschitz, and H-smooth functions on the domain {x 2 Rm : kxk \uf8ff B} and parametrized by z,\nand O is either a stochastic gradient oracle Ograd (2) or a stochastic prox oracle Oprox (3). We consider\nthis function class to contain Lipschitz but non-smooth functions too, which corresponds to H = 1.\nOur function class does not bound the dimension m of the problem, as we seek to understand the\nbest possible guarantees in terms of Lipschitz and smoothness constants that hold in any dimension.\nIndeed, there are (typically impractical) algorithms such as center-of-mass methods, which might\nuse the dimension in order to signi\ufb01cantly reduce the oracle complexity, but at a potentially huge\ncomputational cost. Nemirovski [20] studied non-smooth optimization in the case that the dimension\nis bounded, proving lower bounds in this setting that scale with the 1/3-power of the dimension but\nhave only logarithmic dependence on the suboptimality. We do not analyze strongly convex functions,\nbut the situation is similar and lower bounds can be established via reduction [28].\n3 Lower bounds\nWe now provide lower bounds for optimization problems (G,Ograd,FL,H,B) and (G,Oprox,FL,H,B)\nin terms of L, H, B, and the depth and size of G.\nTheorem 1. Let L, B 2 (0,1), H 2 [0,1], N D 1, let G be any oracle graph of depth D and\nsize N and consider the optimization problem (G,Ograd,FL,H,B). For any randomized algorithm\nA = (R1, . . . , RN , \u02c6X), there exists a distribution P and a convex, L-Lipschitz, and H-smooth\nfunction f on a B-bounded domain in Rm for m = OmaxN 2, D3N log (DN ) such that\n\nHB2\n\nE z\u21e0P\u02c6X\u21e0Ahf ( \u02c6X; z)i min\n\nx Ez\u21e0P [f (x; z)] \u2326\u2713min\u21e2 LB\npD\n\n,\n\nD2 +\n\nLB\n\npN\u25c6\n\nTheorem 2. Let L, B 2 (0,1), H 2 [0,1], N D 1, let G be any oracle graph of depth D and\nsize N and consider the optimization problem (G,Oprox,FL,H,B). For any randomized algorithm\nA = (R1, . . . , RN , \u02c6X), there exists a distribution P and a convex, L-Lipschitz, and H-smooth\nfunction f on a B-bounded domain in Rm for m = OmaxN 2, D3N log (DN ) such that\n\nHB2\n\nE z\u21e0P\u02c6X\u21e0Ahf ( \u02c6X; z)i min\n\nx Ez\u21e0P [f (x; z)] \u2326\u2713min\u21e2 LB\n\nD\n\n,\n\nD2 +\n\nLB\n\npN\u25c6\n\nThese are the tightest possible lower bounds in terms of just the depth and size of G in the sense that\nfor all D, N there are graphs G and associated algorithms which match the lower bound. Of course,\nfor speci\ufb01c, mostly degenerate graphs they might not be tight. For instance, our lower bound for the\ngraph consisting of a short sequential chain plus a very large number of disconnected nodes might\n\n3\n\n\fbe quite loose due to the arti\ufb01cial in\ufb02ation of N. Nevertheless, for many interesting graphs they are\ntight, as we shall see in Section 4.\nEach lower bound has two components: an \u201coptimization\u201d term and a \u201cstatistical\u201d term. The statistical\nterm \u2326(LB/pN ) is well known, although we include a brief proof of this portion of the bound\nin Appendix D for completeness. The optimization term depends on the depth D, and indicates,\nintuitively, the best suboptimality guarantee that can be achieved by an algorithm using unlimited\nparallelism but only D rounds of communication. Arjevani and Shamir [3] also obtain lower bounds\nin terms of rounds of communication, which are similar to how our lower bounds depend on depth.\nHowever they restricted the type of computations that are allowed to the algorithm to a speci\ufb01c class\nof operations, while we only limit the number of oracle queries and the dependency structure between\nthem, but allow forming the queries in any arbitrary way.\nSimilar to Arjevani and Shamir [3], to establish the optimization term in the lower bounds, we\nconstruct functions that require multiple rounds of sequential oracle accesses to optimize. In the\ngradient oracle case, we use a single, deterministic function which resembles a standard construction\nfor \ufb01rst order optimization lower bounds. For the prox case, we construct two functions inspired by\nprevious lower bounds for round-based and \ufb01nite sum optimization [3, 28]. In order to account for\nrandomized algorithms that might leave the span of gradients or proxs returned by the oracle, we use\na technique that was proposed by Woodworth and Srebro [27, 28] and re\ufb01ned by Carmon et al. [8].\nFor our speci\ufb01c setting, we must slightly modify existing analysis, which is detailed in Appendix A.\nA useful feature of our lower bounds is that they apply when both the Lipschitz constant and\nsmoothness are bounded concurrently. Consequently, \u201cnon-smooth\u201d in the subsequent discussion can\nbe read as simply identifying the case where the L term achieves the minimum as opposed to the H\nterm (even if H < 1). This is particularly important when studying stochastic parallel optimization,\nsince obtaining non-trivial guarantees in a purely stochastic setting requires some sort of control on\nthe magnitude of the gradients (smoothness by itself is not suf\ufb01cient), while obtaining parallelization\nspeedups often requires smoothness, and so we would like to ask what is the best that can be done\nwhen both Lipschitz and smoothness are controlled. Interestingly, the dependence on both L and H\nin our bounds is tight, even when the other is constrained, which shows that the optimization term\ncannot be substantially reduced by using both conditions together.\nIn the case of the gradient oracle, we \u201csmooth out\u201d a standard non-smooth lower bound construction\n[21, 27]; previous work has used a similar approach in slightly different settings [2, 13]. For ` \uf8ff L\nand \u2318 \uf8ff H, and orthonormal v1, . . . , vD+1 drawn uniformly at random, we de\ufb01ne the `-Lipschitz\nbut non-smooth function \u02dcf, and its `-Lipschitz, \u2318-smooth \u201c\u2318-Moreau envelope\u201d [5]:\n\u2318\n2 ky xk2\n\n\u02dcf (x) = max\n\nf (x) = min\n\n\u02dcf (y) +\n\n(7)\n\ny\n\n`\u2713v>r x \n\nr 1\n\n2(D + 1)1.5\u25c6\n\n1\uf8ffr\uf8ffD+1\n\nThis de\ufb01nes a distribution over f\u2019s based on the randomness in the draw of v1, . . . , vD+1, and we\napply Yao\u2019s minimax principle. In Appendix B, we prove Theorem 1 using this construction.\nIn the case of the prox oracle, we \u201cstraighten out\u201d the smooth construction of Woodworth and Srebro\n[28]. For \ufb01xed constants c, , we de\ufb01ne the following Lipschitz and smooth scalar function c:\n\nFor P = Uniform{1, 2} and orthonormal v1, . . . , v2D drawn uniformly at random, we de\ufb01ne\n\n0\n2(|z| c)2\nz2 2c2\n2 |z| 2 2c2\n\n|z|\uf8ff c\nc < |z|\uf8ff 2c\n2c < |z|\uf8ff \n|z| >\n\ncv>r1x v>r x!\n\n(8)\n\n(9)\n\n(10)\n\nc(z) =8>><>>:\n8 2av>1 x + cv>2Dx +\n2D1Xr=3,5,7,...\ncv>r1x v>r x!\n8 2DXr=2,4,6,...\n\n\u2318\n\n\u2318\n\nf (x; 1) =\n\nf (x; 2) =\n\n4\n\nAgain, this de\ufb01nes a distribution over f\u2019s based on the randomness in the draw of v1, . . . , v2D and\nwe apply Yao\u2019s minimax principle. In Appendix C, we prove Theorem 2 using this construction.\n\n\fGraph example\n\npath(T )\n\n(Section 4.1)\nlayer(T, M)\n(Section 4.2)\ndelay(T,\u2327 )\n(Section 4.3)\n\nintermittent(T, K, M)\n\n(Section 4.4)\n\nWith gradient oracle\n\nWith gradient and prox oracle\n\nLpT\n\nT 2\u2318+ LpM T\n\u21e3 LpT ^ H\nT 2 \u25c6+ LpT\n\u2713 LpT /\u2327 ^ H\u2327 2\nK2T 2\u2318+ LpM KT\n\u21e3 LpKT ^ H\nT 2 + LpM KT\u2318\nLpKT ^\u21e3 H\nT K + LpM KT\u2318 log M KT\n^\u21e3 H\nL \n\n L\nT 2+ LpM T\nT ^ H\nT 2 \u2318+ LpT\n\u21e3 L\u2327\nT ^ H\u2327 2\n L\nK2T 2+ LpM KT\nKT ^ H\nT 2+ LpM KT\u2318\nLpKT ^\u21e3 L\nT ^ H\nT K + LpM KT\u2318 log M KT\n^\u21e3 H\nL \n\nTable 1: Summary of upper and lower bounds for stochastic convex optimization of L-Lipschitz and H-smooth\nfunctions with T iterations, M machines, and K sequential steps per machine. Green indicates lower bounds\nmatched only by \"unnatural\" methods, red and blue indicates a gap between the lower and upper bounds.\n\nRelation to previous bounds As mentioned above, Duchi et al. [10] recently showed a lower\nbound for \ufb01rst- and zero-order stochastic optimization in the \u201csimple parallelism\u201d graph consisting\nof D layers, each with M nodes. Their bound [10, Thm 2] applies only when the dimension m is\nconstant, and D = O(m log log M ). Our lower bound requires non-constant dimension, but applies\nin any range of M. Furthermore, their proof techniques do not obviously extend to prox oracles.\n\n4 Speci\ufb01c dependency graphs\n\nWe now use our framework to study four speci\ufb01c parallelization structures. The main results (tight\ncomplexities and gaps between lower and upper bounds) are summarized in Table 1. For simplicity\nand without loss of generality, we set B = 1, i.e. we normalize the optimization domain to be\n{x 2 Rm : kxk \uf8ff 1}. All stated upper and lower bounds are for the expected suboptimality\nE[F (\u02c6x)] F (x\u21e4) of the algorithm\u2019s output.\n4.1 Sequential computation: the path graph\nWe begin with the simplest model, that of sequential computation captured by the path graph of\nlength T depicted above. The ancestors of each vertex vi, i = 1 . . . T are all the preceding vertices\n(v1, . . . , vi1). The sequential model is of course well studied and understood. To see how it \ufb01ts into\nour framework: A path graph of length T has a depth of D = T and size of N = T , thus with either\ngradient or prox oracles, the statistical term is dominant in Theorems 1 and 2. These lower bounds\nare matched by sequential stochastic gradient descent, yielding a tight complexity of \u21e5(L/pT ) and\nthe familiar conclusion that SGD is (worst case) optimal in this setting.\n\n4.2 Simple parallelism: the layer graph\n\nWe now turn to a model in which M oracle queries can be made in parallel, and the results are\nbroadcast for use in making the next batch of M queries. This corresponds to synchronized parallelism\nand fast communication between processors. The model is captured by a layer graph of width M,\ndepicted above for M = 3. The graph consists of T layers i = 1, . . . , T each with M nodes\nvt,1, . . . , vt,m whose ancestors include vt0,i for all t0 < t and i 2 [M ]. The graph has a depth of\nD = T and size of N = M T . With a stochastic gradient oracle, Theorem 1 yields a lower bound of:\n\nwhich is matched by accelerated mini-batch SGD (A-MB-SGD) [9, 15], establishing the optimality\nof A-MB-SGD in this setting. For suf\ufb01ciently smooth objectives, the same algorithm is also optimal\neven if prox access is allowed, since Theorem 2 implies a lower bound of:\n\n\u2326\u2713min\u21e2 L\npT\n\n,\n\nH\n\nT 2 +\n\nL\n\npM T\u25c6\n\n\u2326\u2713min\u21e2 L\n\nT\n\n,\n\nH\n\nT 2 +\n\nL\n\npM T\u25c6 .\n\n5\n\n(11)\n\n(12)\n\n\fThat is, for smooth objectives, having access to a prox oracle does not improve the optimal complexity\nover just using gradient access. However, for non-smooth or insuf\ufb01ciently smooth objectives, there\nis a gap between (11) and (12). An optimal algorithm, smoothed A-MB-SGD, uses the prox oracle\nin order to calculate gradients of the Moreau envelope of f (x; z) (cf. Proposition 12.29 of [5]), and\nthen performs A-MB-SGD on the smoothed objectives. This yields a suboptimality guarantee that\nprecisely matches (12), establishing that the lower bound from Theorem 2 is tight for the layer graph,\nand that smoothed A-MB-SGD is optimal. An analysis of the smoothed A-MB-SGD algorithm is\nprovided in Appendix E.1.\n\n4.3 Delayed updates\n\nWe now turn to a delayed computation model that is typical in many asynchronous parallelization\nand pipelined computation settings, e.g. when multiple processors or machines are working asyn-\nchronously, reading iterates, taking some time to perform the oracle accesses and computation, then\ncommunicating the results back (or updating the iterate accordingly) [1, 6, 17, 19, 25]. This is\ncaptured by a \u201cdelay graph\u201d with T nodes v1, . . . , vT and delays \u2327t for the response to the oracle\nquery performed at vt to become available. Hence, Ancestors(vt) = {vs | s + \u2327s \uf8ff t}. Analysis is\ntypically based on the delays being bounded, i.e. \u2327t \uf8ff \u2327 for all t. The depiction above corresponds to\n\u2327t = 2; the case \u2327t = 1 corresponds to the path graph. With constant delays \u2327t = \u2327, the delay graph\nhas depth D \uf8ff T /\u2327 and size N = T , so Theorem 1 gives the following lower bound when using a\ngradient oracle:\n\nH\n\n(T /\u2327 )2) +\n\nL\n\npT! .\n\n,\n\n\u2326 min( L\npT /\u2327\nO\u2713 H\n\nT /\u2327 2 +\n\nL\n\npT\u25c6 .\n\nDelayed SGD, with updates xt xt1 \u2318trf (xt\u2327t; z), is a natural algorithm in this setting.\nUnder the bounded delay assumption the best guarantee we are aware of for delayed update SGD is\n(see [11] improving over [1])\n\nThis result is signi\ufb01cantly worse than the lower bound (13) and quite disappointing. It does not\nprovide for a 1/T 2 accelerated optimization rate, but even worse, compared to non-accelerated SGD\nit suffers a slowdown quadratic in the delay, compared to the linear slowdown we would expect. In\nparticular, the guarantee (14) only allows maximum delay of \u2327 = O(T 1/4) in order to attain the\noptimal statistical rate \u21e5(L/pT ), whereas the lower bound allows a delay up to \u2327 = O(T 3/4).\nThis raises the question of whether a different algorithm can match the lower bound (13). The answer\nis af\ufb01rmative, but it requires using an \u201cunnatural\u201d algorithm, which simulates a mini-batch approach\nin what seems an unnecessarily wasteful way. We refer to this as a \u201cwait-and-collect\u201d approach: it\nworks in T /(2\u2327 ) stages, each stage consisting of 2\u2327 iterations (i.e. nodes or oracle accesses). In stage\ni, \u2327 iterations are used to obtain \u2327 stochastic gradient estimates rf (xi; z2\u2327i +j), j = 1, . . . ,\u2327 at the\nsame point xi. For the remaining \u2327 iterations, we wait for all the preceding oracle computations to\nbecome available and do not even use our allowed oracle access. We can then \ufb01nally update the xi+1\nusing the minibatch of \u2327 gradient estimates. This approach is also speci\ufb01ed formally as Algorithm 2\nin Appendix E.2. Using this approach, we can perform T /(2\u2327 ) A-MB-SGD updates with a minibatch\nsize of \u2327, yielding a suboptimality guarantee that precisely matches the lower bound (13).\nThus (13) indeed represents the tight complexity of the delay graph with a stochastic gradient oracle,\nand the wait-and-collect approach is optimal. However, this answer is somewhat disappointing and\nleaves an intriguing open question: can a more natural, and seemingly more ef\ufb01cient (no wasted\noracle accesses) delayed update SGD algorithm also match the lower bound? An answer to this\nquestion has two parts: \ufb01rst, does the delayed update SGD truly suffer from a \u2327 2 slowdown as\nindicated by (14), or does it achieve linear degradation and a speculative guarantee of\n\n(13)\n\n(14)\n\n(15)\n\nSecond, can delayed update SGD be accelerated to achieve the optimal rate (13). We note that\nconcurrent with our work there has been progress toward closing this gap: Arjevani et al. [4] showed\nan improved bound matching the non-accelerated (15) for delayed updates (with a \ufb01xed delay) on\n\nO\u2713 H\n\nT /\u2327\n\n+\n\nL\n\npT\u25c6 .\n\n6\n\n\fquadratic objectives. It still remains to generalize the result to smooth non-quadratic objectives,\nhandle non-constant bounded delays, and accelerate the procedure so as to improve the rate to (\u2327 /T )2.\n\n4.4\n\nIntermittent communication\n\nWe now turn to a parallel computation model which is relevant especially when parallelizing across\ndisparate machines: in each of T iterations, there are M machines that, instead of just a single oracle\naccess, perform K sequential oracle accesses before broadcasting to all other machines synchronously.\nThis communication pattern is relevant in the realistic scenario where local computation is plentiful\nrelative to communication costs (i.e. K is large). This may be the case with fast processors distributed\nacross different machines, or in the setting of federated learning, where mobile devices collaborate to\ntrain a shared model while keeping their respective training datasets local [18].\nThis is captured by a graph consisting of M parallel chains of length T K, with cross connections\nbetween the chains every K nodes. Indexing the nodes as vt,m,k, the nodes vt,m,1 !\u00b7\u00b7\u00b7! vt,m,K\nform a chain, and vt,m,K is connected to vt+1,m0,1 for all m0 = 1..M. This graph generalizes the\nlayer graph by allowing K sequential oracle queries between each complete synchronization; K = 1\nrecovers the layer graph, and the depiction above corresponds to K = M = 3. We refer to the\ncomputation between each synchronization step as a (communication) round.\nThe depth of this graph is D = T K and the size is N = T KM. Focusing on the stochastic gradient\noracle (the situation is similar for the prox oracle, except with the potential of smoothing a non-smooth\nobjective, as discussed in Section 4.2), Theorem 1 yields the lower bound:\n\n\u2326\u2713min\u21e2 L\npT K\n\n,\n\nH\n\nT 2K2 +\n\nL\n\npT KM\u25c6 .\n\n(16)\n\nA natural algorithm for this graph is parallel SGD, where we run an SGD chain on each machine and\naverage iterates during communication rounds, e.g. [18]. The updates are then given by:\n\nxt,m,0 =\n\nxt,m0,K\n\n(17)\n\n1\n\nM Xm0\n\nxt,m,k = xt,m,k1 \u2318trf (xt,m,k1; zt,m,k), k = 1, . . . , K\n\n(note that xt,m,0 does not correspond to any node in the graph, and is included for convenience of\npresentation). Unfortunately, we are not aware of any satisfying analysis of such a parallel SGD\napproach. Instead, we consider two other algorithms in an attempt to match the lower bound (16).\nFirst, we can combine all KM oracle accesses between communication rounds in order to form a\nsingle mini-batch, giving up on the possibility of sequential computation along the \u201clocal\u201d K node\nsub-paths. Using all KM nodes to obtain stochastic gradient estimates at the same point, we can\nperform T iterations of A-MB-SGD with a mini-batch size of KM, yielding an upper bound of\n\nO\u2713 H\n\nT 2 +\n\nL\n\npT KM\u25c6 .\n\n(18)\n\nThis is a reasonable and common approach, and it is optimal (up to constant factors) when KM =\nO( L2\nH 2 T 3) so that the statistical term is limiting. However, comparing (18) to the lower bound (16)\nwe see a gap by a factor of K2 in the optimization term, indicating the possibility for signi\ufb01cant\ngains when K is large (i.e. when we can process a large number of examples on each machine at\neach round). Improving the optimization term by this K2 factor would allow statistical optimality as\nlong as M = O(T 3K3)\u2014-this is a very signi\ufb01cant difference. In many scenarios we would expect a\nmodest number of machines, but the amount of data on each machine could easily be much more than\nthe number of communication rounds, especially if communication is across a wide area network.\nIn fact, when K is large, a different approach is preferable: we can ignore all but a single chain and\nsimply execute KT iterations of sequential SGD, offering an upper bound of\n\nO\u2713 L\npT K\u25c6 .\n\n7\n\n(19)\n\n\fAlthough this approach seems extremely wasteful, it actually yields a better guarantee than (18) when\nK \u2326(T 3L2/H). This is a realistic regime, e.g. in federated learning when computation is dis-\ntributed across devices, communication is limited and sporadic and so only a relatively small number\nof rounds T are possible, but each device already possesses a large amount of data. Furthermore, for\nnon-smooth functions, (19) matches the lower bound (16).\nOur upper bound on the complexity is therefore obtained by selecting either A-MB-SGD or single-\nmachine sequential SGD, yielding a combined upper bound of\nL\n\nO\u2713min\u21e2 L\npT K\n\n,\n\nH\n\nT 2 +\n\npT KM\n\n.\u25c6\n\nFor smooth functions, there is still a signi\ufb01cant gap between this upper bound and the lower bound\n(16). Furthermore, this upper bound is not achieved by a single algorithm, but rather a combination\nof two separate algorithms, covering two different regimes. This raises the question of whether there\nis a single, natural algorithm, perhaps an accelerated variant of the parallel SGD updates (17), that at\nthe very least matches (20), and preferably also improves over them in the intermediate regime or\neven matches the lower bound (16).\nActive querying and SVRG All methods discussed so far used fully stochastic oracles, requesting\na gradient (or prox computation) with respect to an independently and randomly drawn z \u21e0P . We\nnow turn to methods that also make active queries, i.e. draw samples from P and then repeatedly\nquery the oracle, at different points x, but on the same samples z. Recall that all of our lower bounds\nare valid also in this setting.\nWith an active query gradient oracle, we can implement SVRG [14, 16] on an intermittent com-\nmunication graph. More speci\ufb01cally, for an appropriate choice of n and , we apply SVRG to the\nregularized empirical objective \u02c6F(x) = 1\n\ni=1 f (x; zi) + \n\n2 kxk2\n\n(20)\n\nAlgorithm 1 SVRG\n\nParameters: n, S, I,\n\nnPn\nSample z1, . . . , zn \u21e0P ,\nKM\u2325 +\u2303 I\nK\u2325\u21e7 do\nfor s = 1, 2, . . . , S =\u2305T /\u2303 n\nx0\n\u02dcx = xs1,\ns = \u02dcx\nnPn\n\u02dcg = r \u02c6F(\u02dcx) = 1\nfor i = 1, 2, . . . , I = H\nSample j \u21e0 Uniform{1, . . . , n}\ns = xi1\nxi\n\ni=1 rf (\u02dcx; zi) + \u02dcx\n do\n\n; zj) + xi1\n\ns \u2318rf (xi1\ns for i \u21e0 Uniform{1, . . . , I}\n\nend for\nxs = xi\n\ns\n\ns\n\nend for\nReturn xS\n\nInitialize x0 = 0\n\n (rf (\u02dcx; zj) + \u02dcx) + \u02dcg\n\n(\u21e4)\n\n(\u21e4\u21e4)\n\n(21)\n\nTo do so, we \ufb01rst pick a sample {z1, . . . zn} (without actually querying the oracle). As indicated\nby Algorithm 1, we then alternate between computing full gradients on {z1, . . . zn} in parallel (\u21e4),\nand sequential variance-reduced stochastic gradient updates in between (\u21e4\u21e4). The full gradient \u02dcg is\ncomputed using n active queries to the gradient oracle. Since all of these oracle accesses are made\nat the same point \u02dcx, this can be fully parallelized across the M parallel chains of length K thus\nrequiring n/KM rounds. The sequential variance-reduced stochastic gradient updates cannot be\nparallelized in this way, and must be performed using queries to the gradient oracle in just one of\nthe M available parallel chains, requiring I/K rounds of synchronization. Consequently, each outer\n\nK\u2325 rounds. We analyze this method using =\u21e5 \u21e3 Lpn\u2318,\niteration of SVRG requires\u2303 n\nlog(M KT /L)\u2318o. Using the\nI =\u21e5 H\nanalysis of Johnson and Zhang [14], SVRG guarantees that, with an appropriate stepsize, we have\n\u02c6F(xS) minx \u02c6F(x) \uf8ff 2S; the value of xS on the empirical objective also generalizes to the\npopulation, so E [f (xS; z)] minx E [f (x; z)] \uf8ff 2S + O\u21e3 Lpn\u2318 (see [23]). With our choice of\nparameters, this implies upper bound (see Appendix E.3)\n\nH 2 log2(M KT /L)\u2318 , \u21e5\u21e3 M KT\n\nL \u2318, and n = minn\u21e5\u21e3\n\n =\u21e5 \u21e3 Hpn\n\nKM\u2325 +\u2303 I\n\nK2T 2L2\n\nO\u2713\u2713 H\n\nT K\n\n+\n\nL\n\npT KM\u25c6 log\u2713 T KM\n\nL \u25c6\u25c6 .\n\n8\n\n\fThese guarantees improve over sequential SGD (17) as soon as M > log2(T KM/L) and K >\nH 2/L2, i.e. L/pT K < L2/H. This is a very wide regime: we require only a moderate number\nof machines, and the second condition will typically hold for a smooth loss. Intuitively, SVRG\ndoes roughly the same number (up to a factor of two) of sequential updates as in the sequential\nSGD approach but it uses better, variance reduced updates. The price we pay is in the smaller total\nsample size since we keep calling the oracle on the same samples. Nevertheless, since SVRG only\nneeds to calculate the \u201cbatch\u201d gradient a logarithmic number of times, this incurs only an additional\nlogarithmic factor.\nComparing (18) and (21), we see that SVRG also improves over A-MB-SGD as soon as K >\nT log(T KM/L), that is if the number of points we are processing on each machine each round is\nslightly more then the total number of rounds, which is also a realistic scenario.\nTo summarize, the best known upper bound for optimizing with intermittent communication using a\npure stochastic oracle is (20), which combines two different algorithms. However, with active oracle\naccesses, SVRG is also possible and the upper bound becomes:\n\nO\u2713min\u21e2 L\npT K\n\n, \u2713 H\n\nT K\n\n+\n\nL\n\npT KM\u25c6 log\u2713 T KM\nL \u25c6 ,\n\nH\nT 2 +\n\nL\n\npT KM\u25c6\n\n(22)\n\n5 Summary\n\nOur main contributions in this paper are: (1) presenting a precise formal oracle framework for\nstudying parallel stochastic optimization; (2) establishing tight oracle lower bounds in this framework\nthat can then be easily applied to particular instances of parallel optimization; and (3) using the\nframework to study speci\ufb01c settings, obtaining optimality guarantees, understanding where additional\nassumptions would be needed to break barriers, and, perhaps most importantly, identifying gaps in\nour understanding that highlight possibilities for algorithmic improvement. Speci\ufb01cally,\n\n\u2022 For non-smooth objectives and a stochastic prox oracle, smoothing and acceleration can\nimprove performance in the layer graph setting. It is not clear if there is a more direct\nalgorithm with the same optimal performance, e.g. averaging the answers from the prox\noracle.\n\n\u2022 In the delay graph setting, delayed update SGD\u2019s guarantee is not optimal. We suggest an\nalternative optimal algorithm, but it would be interesting and bene\ufb01cial to understand the\ntrue behavior of delayed update SGD and to improve it as necessary to attain optimality.\n\u2022 With intermittent communication, we show how different methods are better in different\nregimes, but even combining these methods does not match our lower bound. This raises\nthe question of whether our lower bound is achievable. Are current methods optimal? Is the\ntrue optimal complexity somewhere in between? Even \ufb01nding a single method that matches\nthe current best performance in all regimes would be a signi\ufb01cant advance here.\n\n\u2022 With intermittent communication, active queries allow us to obtain better performance in a\ncertain regime. Can we match this performance using pure stochastic queries or is there a\nreal gap between active and pure stochastic queries?\n\nThe investigation into optimizing over FL,H,B in our framework indicates that there is no advantage\nto the prox oracle for optimizing (suf\ufb01ciently) smooth functions. This raises the question of what\nadditional assumptions might allow us to leverage the prox oracle, which is intuitively much stronger\nas it allows global access to f (\u00b7; z). One option is to assume a bound on the variance of the stochastic\noracle i.e. Ez[krf (x; z) rF (x)k2] \uf8ff 2 which captures the notion that the functions f (\u00b7; z) are\nsomehow related and not arbitrarily different. In particular, if each stochastic oracle access, in each\nnode, is based on a sample of b data points (thus, a prox operation optimizes a sub-problem of size b),\nwe have that 2 \uf8ff L2/b. Initial investigation into the complexity of optimizing over the restricted\nclass FL,H,B, (where we also require the above variance bound), reveals a signi\ufb01cant theoretical\nadvantage for the prox oracle over the gradient oracle, even for smooth functions. This is an example\nof how formalizing the optimization problem gives insight into additional assumptions, in this case\nlow variance, that are necessary for realizing the bene\ufb01ts of a stronger oracle.\n\n9\n\n\fAcknowledgements\n\nWe would like to thank Ohad Shamir for helpful discussions. This work was partially funded by\nNSF-BSF award 1718970 (\u201cConvex and Non-Convex Distributed Learning\u201d) and a Google Research\nAward. BW is supported by the NSF Graduate Research Fellowship under award 1754881. AS was\nsupported by NSF awards IIS-1447700 and AF-1763786, as well as a Sloan Foundation research\naward.\n\nReferences\n[1] Alekh Agarwal and John C Duchi. Distributed delayed stochastic optimization. In Advances in\n\nNeural Information Processing Systems, pages 873\u2013881, 2011.\n\n[2] Naman Agarwal and Elad Hazan. Lower bounds for higher-order convex optimization. arXiv\n\npreprint arXiv:1710.10329, 2017.\n\n[3] Yossi Arjevani and Ohad Shamir. Communication complexity of distributed convex learning\nand optimization. In Advances in Neural Information Processing Systems, pages 1756\u20131764,\n2015.\n\n[4] Yossi Arjevani, Ohad Shamir, and Nathan Srebro. A tight convergence analysis for stochastic\n\ngradient descent with delayed updates. 2018.\n\n[5] Heinz H Bauschke, Patrick L Combettes, et al. Convex analysis and monotone operator theory\n\nin Hilbert spaces, volume 2011. Springer, 2017.\n\n[6] Dimitri P Bertsekas. Parallel and distributed computation: numerical methods, volume 23.\n\nPrentice hall Englewood Cliffs, NJ, 1989.\n\n[7] Mark Braverman, Ankit Garg, Tengyu Ma, Huy L Nguyen, and David P Woodruff. Communica-\ntion lower bounds for statistical estimation problems via a distributed data processing inequality.\nIn Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages\n1011\u20131020. ACM, 2016.\n\n[8] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for \ufb01nding\n\nstationary points i. arXiv preprint arXiv:1710.11606, 2017.\n\n[9] Andrew Cotter, Ohad Shamir, Nati Srebro, and Karthik Sridharan. Better mini-batch algorithms\nvia accelerated gradient methods. In Advances in neural information processing systems, pages\n1647\u20131655, 2011.\n\n[10] John Duchi, Feng Ruan, and Chulhee Yun. Minimax bounds on stochastic batched convex\noptimization. In Proceedings of the 31st Conference On Learning Theory, pages 3065\u20133162,\n2018.\n\n[11] Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson. An asynchronous mini-\nIEEE Transactions on Automatic\n\nbatch algorithm for regularized stochastic optimization.\nControl, 61(12):3740\u20133754, 2016.\n\n[12] Ankit Garg, Tengyu Ma, and Huy Nguyen. On communication cost of distributed statistical\nestimation and dimensionality. In Advances in Neural Information Processing Systems, pages\n2726\u20132734, 2014.\n\n[13] Crist\u00f3bal Guzm\u00e1n and Arkadi Nemirovski. On lower complexity bounds for large-scale smooth\n\nconvex optimization. Journal of Complexity, 31(1):1\u201314, 2015.\n\n[14] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in Neural Information Processing Systems, 2013.\n\n[15] Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical\n\nProgramming, 133(1-2):365\u2013397, 2012.\n\n10\n\n\f[16] Jason D Lee, Qihang Lin, Tengyu Ma, and Tianbao Yang. Distributed stochastic variance\nreduced gradient methods by sampling extra data with replacement. The Journal of Machine\nLearning Research, 18(1):4404\u20134446, 2017.\n\n[17] Brendan McMahan and Matthew Streeter. Delay-tolerant algorithms for asynchronous dis-\ntributed online learning. In Advances in Neural Information Processing Systems, pages 2915\u2013\n2923, 2014.\n\n[18] H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera\nIn\n\ny Arcas. Communication-ef\ufb01cient learning of deep networks from decentralized data.\nArti\ufb01cial Intelligence and Statistics, 2017.\n\n[19] A Nedi\u00b4c, Dimitri P Bertsekas, and Vivek S Borkar. Distributed asynchronous incremental\n\nsubgradient methods. Studies in Computational Mathematics, 8(C):381\u2013407, 2001.\n\n[20] Arkadi Nemirovski. On parallel complexity of nonsmooth convex optimization. Journal of\n\nComplexity, 10(4):451\u2013463, 1994.\n\n[21] Arkadii Nemirovski, David Borisovich Yudin, and Edgar Ronald Dawson. Problem complexity\n\nand method ef\ufb01ciency in optimization. 1983.\n\n[22] Yurii Nesterov. A method of solving a convex programming problem with convergence rate o\n\n(1/k2). 1983.\n\n[23] Shai Shalev-Shwartz and Nathan Srebro. Svm optimization: inverse dependence on training set\n\nsize. In International Conference on Machine Learning, pages 928\u2013935, 2008.\n\n[24] Eric V Slud et al. Distribution inequalities for the binomial law. The Annals of Probability, 5\n\n(3):404\u2013412, 1977.\n\n[25] Suvrit Sra, Adams Wei Yu, Mu Li, and Alex Smola. Adadelay: Delay adaptive distributed\n\nstochastic optimization. In Arti\ufb01cial Intelligence and Statistics, pages 957\u2013965, 2016.\n\n[26] Jialei Wang, Weiran Wang, and Nathan Srebro. Memory and communication ef\ufb01cient distributed\n\nstochastic optimization with minibatch prox. In Conference on Learning Theory, 2017.\n\n[27] Blake Woodworth and Nathan Srebro. Lower bound for randomized \ufb01rst order convex opti-\n\nmization. arXiv preprint arXiv:1709.03594, 2017.\n\n[28] Blake Woodworth and Nati Srebro. Tight complexity bounds for optimizing composite objec-\n\ntives. In Advances in Neural Information Processing Systems, pages 3639\u20133647, 2016.\n\n[29] Yuchen Zhang, John Duchi, Michael I Jordan, and Martin J Wainwright. Information-theoretic\nlower bounds for distributed statistical estimation with communication constraints. In Advances\nin Neural Information Processing Systems, pages 2328\u20132336, 2013.\n\n11\n\n\f", "award": [], "sourceid": 5135, "authors": [{"given_name": "Blake", "family_name": "Woodworth", "institution": "TTI-Chicago"}, {"given_name": "Jialei", "family_name": "Wang", "institution": "Two Sigma Investments, University of Chicago"}, {"given_name": "Adam", "family_name": "Smith", "institution": "Boston University"}, {"given_name": "Brendan", "family_name": "McMahan", "institution": "Google"}, {"given_name": "Nati", "family_name": "Srebro", "institution": "TTI-Chicago"}]}