{"title": "Think out of the \"Box\": Generically-Constrained Asynchronous Composite Optimization and Hedging", "book": "Advances in Neural Information Processing Systems", "page_first": 12246, "page_last": 12256, "abstract": "We present two new algorithms, ASYNCADA and HEDGEHOG, for asynchronous sparse online and stochastic optimization. ASYNCADA is, to our knowledge, the first asynchronous stochastic optimization algorithm with finite-time data-dependent convergence guarantees for generic convex constraints. In addition, ASYNCADA: (a) allows for proximal (i.e., composite-objective) updates and adaptive step-sizes; (b) enjoys any-time convergence guarantees without requiring an exact global clock; and (c) when the data is sufficiently sparse, its convergence rate for (non-)smooth, (non-)strongly-convex, and even a limited class of non-convex objectives matches the corresponding serial rate, implying a theoretical \u201clinear speed-up\u201d. The second algorithm, HEDGEHOG, is an asynchronous parallel version of the Exponentiated Gradient (EG) algorithm for optimization over the probability simplex (a.k.a. Hedge in online learning), and, to our knowledge, the first asynchronous algorithm enjoying linear speed-ups under sparsity with non-SGD-style updates. Unlike previous work, ASYNCADA and HEDGEHOG and their convergence and speed-up analyses are not limited to individual coordinate-wise (i.e., \u201cbox-shaped\u201d) constraints or smooth and strongly-convex objectives. Underlying both results is a generic analysis framework that is of independent\ninterest, and further applicable to distributed and delayed feedback optimization", "full_text": "Think out of the \u201cBox\u201d: Generically-Constrained\n\nAsynchronous Composite Optimization and Hedging\n\nPooria Joulani\u21e4\nDeepMind, UK\n\npjoulani@google.com\n\nAndr\u00e1s Gy\u00f6rgy\nDeepMind, UK\n\nagyorgy@google.com\n\nCsaba Szepesv\u00e1ri\nDeepMind, UK\n\nszepi@google.com\n\nAbstract\n\nWe present two new algorithms, ASYNCADA and HEDGEHOG, for asynchronous\nsparse online and stochastic optimization. ASYNCADA is, to our knowledge,\nthe \ufb01rst asynchronous stochastic optimization algorithm with \ufb01nite-time data-\ndependent convergence guarantees for generic convex constraints. In addition,\nASYNCADA: (a) allows for proximal (i.e., composite-objective) updates and\nadaptive step-sizes; (b) enjoys any-time convergence guarantees without requiring\nan exact global clock; and (c) when the data is suf\ufb01ciently sparse, its convergence\nrate for (non-)smooth, (non-)strongly-convex, and even a limited class of non-\nconvex objectives matches the corresponding serial rate, implying a theoretical\n\u201clinear speed-up\u201d. The second algorithm, HEDGEHOG, is an asynchronous parallel\nversion of the Exponentiated Gradient (EG) algorithm for optimization over the\nprobability simplex (a.k.a. Hedge in online learning), and, to our knowledge, the\n\ufb01rst asynchronous algorithm enjoying linear speed-ups under sparsity with non-\nSGD-style updates. Unlike previous work, ASYNCADA and HEDGEHOG and\ntheir convergence and speed-up analyses are not limited to individual coordinate-\nwise (i.e., \u201cbox-shaped\u201d) constraints or smooth and strongly-convex objectives.\nUnderlying both results is a generic analysis framework that is of independent\ninterest, and further applicable to distributed and delayed feedback optimization.\n\n1\n\nIntroduction\n\nMany modern machine learning methods are based on iteratively optimizing a regularized objective.\nGiven a convex, non-empty set of feasible model parameters X\u21e2 Rd, a differentiable loss function\nf : Rd ! R, and a convex (possibly non-differentiable) regularizer function : Rd ! R, these\nmethods seek the parameter vector x\u21e4 2X that minimizes f + (assuming a minimizer exists):\n(1)\n\nx\u21e4 = arg min\n\nf (x) + (x) .\n\nx2X\n\nmPm\n\nIn particular, empirical risk minimization (ERM) methods such as (regularized) least-squares, logistic\nregression, LASSO, and support vector machines solve optimization problems of the form (1). In\nthese cases, f (x) = 1\ni=1 F (x, \u21e0i) is the average of the loss F (x, \u21e0i) of the model parameter x\non the given training data \u21e01,\u21e0 2, . . . ,\u21e0 m and (x) is a norm (or a combination of norms) on Rd (e.g.,\nF (x, \u21e0) = log(1 + exp(x>\u21e0)) and (x) = 1\nTo bring the power of modern parallel computing architectures to such optimization problems, several\npapers in the past decade have studied parallel variants of the stochastic optimization algorithms\napplied to these problems. Here one of the main questions is to quantify the cost of parallelization,\nthat is, how much extra work is needed by a parallel algorithm to achieve the same accuracy as its\nserial variant. Ideally, a parallel algorithm is required to do no more work than the serial version, but\n\n2 in linear logistic regression [13]).\n\n2kxk2\n\n\u21e4Work partially done when the author was at the University of Alberta, Edmonton, AB, Canada.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthis is very hard to achieve in our case. Instead, a somewhat weaker goal is to ensure that the price\nof parallelism is at most a constant factor: that is, the parallel variant needs at most constant-times\nmore updates (or work). In other words, using \u2327 parallel process requires a wall-clock running time\nthat is only O(1/\u2327)-times that of the serial variant. In this case we say that the parallel algorithm\nachieves a linear speed-up. Of particular interest are asynchronous lock-free algorithms, where\nRecht et al. [30] demostrated \ufb01rst that linear speed-ups are possible: They showed that if \u2327 processes\nrun stochastic gradient descent (SGD) and apply their updates to the same shared iterate without\nlocking, then the overall algorithm (called Hogwild!) converges after the same amount of work as\nserial SGD, up to a multiplicative factor that increases with the number of concurrent processes and\ndecreases with the sparsity of the problem. Thus, if the problem is sparse enough, this penalty can\nbe considered a constant, and the algorithm achieves linear speed-up. Several follow-up work (see\ne.g., [20, 18, 17, 27, 24, 10, 29, 7, 11, 4, 2, 3, 19, 31, 33, 32, 35, 36, 12, 6, 28] and the references\ntherein) have demonstrated linear speed-ups for methods based on (block-)coordinate descent (BCD),\nas well as other variants of SGD such as SVRG [15], SAGA [8], ADAGRAD [22, 9], and SGD with a\ntime-decaying step-size. Despite the great advances, however, several problems remain open.2\nFirst, the existing convergence guarantees concern SGD when the constraint set X is box-shaped,\nthat is, a Cartesian product of (block-)coordinatewise constraints X = \u21e5d\ni=1Xi. This leaves it unclear\nwhether existing techniques apply to stochastic optimization algorithms that operate on non-box-\nshaped constraints (e.g., on the `2 ball), or algorithms that use a non-Euclidean regularizer, such as\nthe exponentiated gradient (EG) algorithm used on the probability simplex (see, e.g., [34, 14]).\nSecond, with the exception of the works of Duchi et al. [10] and Pan et al. [26] (which still require\nbox-shaped constraints), and De Sa et al. [7] (which only bounds the probability of \u201cfailure\u201d, i.e., of\nproducing no iterates in the \u270f-ball around x\u21e4), the existing analyses demonstrating linear speed-ups\nare limited to strongly-convex (or Polyak-\u0141ojasiewicz) objectives. Thus, so far it has remained\nunclear whether a similar speed-up analysis is possible if the objective is simply convex or smooth\n[20], or if we are in the closely-related online-learning setting with the objective changing over time.\nThird, with the exception of the work of Pedregosa et al. [27] (which still requires box-shaped\nconstraints, block-separable and strongly-convex f), the existing analyses do not take advantage of\nthe structure of problem (1). In particular, when is \u201csimple to optimize\u201d over X (formally de\ufb01ned\nas having access to a proximal operator oracle, as we make precise in what follows), serial algorithms\nsuch as Proximal-SGD take advantage of this property to achieve considerably faster convergence\nrates. Asynchronous variants of the Proximal-SGD algorithm with such faster rates have so far been\nunavailable for non-strongly-convex objectives and non-box constraints.\n\n1.1 Contributions\nIn this paper we address the aforementioned problems and present algorithms that are applicable\nto general convex constraint sets, not just box-shaped X , but still achieve linear speed-ups (under\nsparsity) for non-smooth and non-strongly-convex (as well as smooth or strongly convex) objectives,\nand even for a speci\ufb01c class of non-convex problems. This is achieved through our new asynchronous\noptimization algorithm, ASYNCADA, which generalizes the ASYNC-ADAGRAD (and ASYNC-DA)\nalgorithm of Duchi et al. [10] to proximal updates and its data-dependent bound to arbitrary constraint\nsets. Instantiations of ASYNCADA under different settings are given in Table 1. Indeed, the results\nare obtained by a more general analysis framework, built on the work of Duchi et al. [10], that yields\ndata-dependent convergence guarantees for a generic class of adaptive, composite-objective online\noptimization algorithms undergoing perturbations to their \u201cstate\u201d. We further use this framework to\nderive the \ufb01rst asynchronous online and stochastic optimization algorithm with non-box constraints\nthat uses non-Euclidean regularizers. In particular, we present and analyze HEDGEHOG, the parallel\nasynchronous variant of the EG algorithm, also known as Hedge in online linear optimization [34, 14],\n2 In this paper, we do not further consider BCD-based methods, for two main reasons: a) in general, a\nBCD update may unnecessarily slow down the convergence of the algorithm by focusing only on a single\ncoordinate of the gradient information, especially in the sparse-data problems we consider in this paper (see,\ne.g., Pedregosa et al. [27, Appendix F]); and b) BCD algorithms typically apply only to box-shaped constraints,\nwhich is what our algorithms are designed to be able to avoid. We would like to note, however, that our\nstochastic gradient oracle set-up (Section 2) does allow for building an unbiased gradient estimate using only\none randomly-selected (block-)coordinate, as done in BCD methods. Nevertheless, the literature on parallel\nasynchronous BCD algorithms is vast, including especially algorithms for proximal, non-strongly-convex, and\nnon-convex optimization; see, e.g., [29, 11, 4, 2, 3, 19, 31, 33, 32, 35, 36, 12, 6, 28] and the references therein.\n\n2\n\n\f[10, 26] X\n\n[26] X\n\nStrongly-convex\n\nSmooth f + Strongly-convex\n\n[26] X\n[26]\nX\n\n[26] X\n\nNonsmooth\n[10, 26] X\n[10, 26]\n\nSmooth f\n[26] X\n[26]\nX\n\n[30, 7, 20, 17, 24, 26] X\n[30, 7, 20, 17, 24, 26]\n\nAlgorithm\nX\nSGD (DA) Rd\nSGD (MD) \u21e4\nDA\n\n\u21e4\nAG / DA\nAG / DA\n\n\u21e4\nProx-MD\nProx-DA\n\nProx-AG\n\nHedge/EG 4\nTable 1: (Star-)convex optimization settings under which suf\ufb01cient sparsity results in linear speed-up.\nPrevious work are cited under the settings they address. A X indicates a setting covered by the results\nin this paper. The symbols \u21e4, 4, and indicate, respectively, the case when the constraint set is\nbox-shaped, the probability simplex, or any convex constraint set with a projection oracle. AG, DA,\nand MD stand, respectively, for ADAGRAD, Dual-Averaging, and Mirror Descent, while Prox-AG,\nProx-DA, and Prox-MD denote their proximal variants (using the proximal operator of ).\n\nX\n\nX\n-\nX\nX\nX\n\nX\n\n[26] X\n\nX\n[27]\nX\nX\nX\n\nX\n-\nX\nX\nX\n\nX\n-\nX\nX\nX\n\nand show that it enjoys similar parallel speed-up regimes as ASYNCADA. The results are derived for\nthe more general setting of noisy online optimization, and the generic framework is of independent\ninterest, in particular in the related settings of distributed and delayed-feedback learning.\nThe rest of the paper is organized as follows: The optimization problem and its solution with serial\nalgorithms are described in Section 2 and Section 3, respectively. The generic perturbed-iterate\nframework is given in Section 4. Our main algorithms, ASYNCADA and HEDGEHOG are presented\nand analyzed in Section 5 and Section 6, respectively. Conclusions are drawn and some open problems\nare discussed in Section 7, while omitted technical details are given in the appendices.\n\n\u21b5 = 1\n\nj=1 \u21b5(j)x(j)2, and k\u00b7k \u21b5,\u21e4 its dual. We use (at)j\n2Pd\n\n1.2 Notation and de\ufb01nitions\nWe use [n] to denote the set {1, 2, . . . , n}, I{E} for the indicator of an event E, and (H) to denote\nthe sigma-\ufb01eld generated by a set H of random variables. The j-th coordinate of a vector a 2 Rd\nis denoted a(j). For \u21b5 2 Rd with positive entries, k\u00b7k \u21b5 denotes the \u21b5-weighted Euclidean norm,\ngiven by kxk2\nt=i to denote a sequence\nai, ai+1, . . . , aj and de\ufb01ne ai:j :=Pj\nt=i at, with ai:j := 0 if i > j. Given a differentiable function\nh : Rd ! R, the Bregman divergence of y 2 Rd from x 2 Rd with respect to (w.r.t.) h is given by\nBh(y, x) := h(y) h(x) hrh(x), y xi. It can be shown that a differentiable function is convex\nif and only if Bh(x, y) 0 for all x, y 2 Rd. The function h : Rd ! R is \u00b5-strongly convex w.r.t. a\nnorm k\u00b7k on Rd if and only if for all x, y 2 Rd Bh(x, y) \u00b5\n2kx yk2, and smooth w.r.t. a norm k\u00b7k\nif and only if for all x, y 2 Rd, |Bh(x, y)|\uf8ff 1\n2kx yk2. A differentiable function f is star-convex if\nand only if there exists a global minimizer x\u21e4 of f such that for all x 2 Rd, Bf (x\u21e4, x) 0.\n2 Problem setting: noisy online optimization\n\nWe consider a generic iterative optimization setting that enables us to study both online learning\nand stochastic composite optimization. The problem is de\ufb01ned by a (known) constraint set X\nand a (known) convex (possibly non-differentiable) function , as well as differentiable functions\nf1, f2, . . . about which an algorithm learns iteratively. At each iteration t = 1, 2, . . . , the algorithm\npicks an iterate xt 2X , and observes an unbiased estimate gt 2 Rd of the gradient rft(xt) , that\nis, E{gt|xt} = rft(xt). The goal is to minimize the composite-objective online regret after T\niterations, given by\n\nR(f +)\n\nT\n\n=\n\nTXt=1\n\n(ft(xt) + (xt) ft(x\u21e4T ) (x\u21e4T )) ,\n\n3\n\n\ft=1(ft(x) + (x))o.\n\nwhere x\u21e4T = arg minx2XnPT\n\nIn the absence of noise (i.e., when gt =\nrft(xt)), this reduces to the (composite-objective) online (convex) optimization setting [34, 14].\nStochastic optimization, online regret, and iterate averaging.\nIf ft = f for all t = 1, 2, . . . , we\nrecover the stochastic optimization setting, with the algorithm aiming to minimize the composite\nobjective f + over X while receiving noisy estimates of rf at points (xt)T\nt=1. The algorithm\u2019s\nonline regret can then be used to control the optimization risk: Since ft \u2318 f, we have x\u21e4T = x\u21e4 =\narg minx2X {f (x) + (x)}, and by Jensen\u2019s inequality, if f is convex and \u00afxT = 1\nT x1:T is the\naverage iterate,\n\nf (\u00afxT ) + (\u00afxT ) f (x\u21e4) (x\u21e4) \uf8ff\n\n1\nT\n\nR(f +)\n\nT\n\n.\n\nIn addition, if f is non-convex but \u00afxT is selected uniformly at random from x1, . . . , xT , then the\nabove bound holds in expectation. As such, in the rest of the paper we study the optimization risk\nthrough the lens of online regret.\n\nStochastic \ufb01rst-order oracle. Throughout the paper, we assume that at time t, the noisy gradient\nestimate gt is given by a randomized \ufb01rst-order oracle3 gt : Rd \u21e5 \u2305 ! Rd, where \u2305 is some space\nof random variables, and there exists a sequence (\u21e0t)T\nt=1 of independent elements from \u2305, with\ndistribution P\u2305, such thatR\u2305 gt(x, \u21e0)dP\u2305(\u21e0) = rft(x) for all x 2X .\nFor example, in the \ufb01nite-sum stochastic optimization case when f = PN\ni fi, selecting one fi\nuniformly at random to estimate the gradient corresponds to P\u2305 being the uniform distribution on \u2305=\n{1, 2, . . . , N} and gt(x, \u21e0t) = rf\u21e0t(x), whereas selecting a mini-batch of fi\u2019s corresponds to \u2305 being\n|\u21e0t|Pi2\u21e0t rfi(x).\nthe set of subsets (of a \ufb01xed or varying size) of {1, 2, . . . , N} and gt(x, \u21e0t) = 1\nThis also covers variance-reduced gradient estimates as formed, e.g., by SAGA and SVRG, in which\ncase gt is built using information from the previous rounds.4\n\n3 Preliminaries: analysis in the serial setting\n\nFirst, we recall the analysis of a generic serial dual-averaging algorithm, known as Adaptive Follow-\nthe-Regularized-Leader (ADA-FTRL) [21, 25, 16], that generalizes regularized dual-averaging [37]\nand captures the dual-averaging variants of SGD, Ada-Grad, Proximal-SGD and EG as special case.\n\nSerial ADA-FTRL. The serial ADA-FTRL algorithm uses a sequence of regularizer functions\nr0, r1, r2, . . . . At time t = 1, 2, . . . , given the previous feedback gs 2 Rd, s 2 [t 1], ADA-FTRL\nselects the next point xt such that\n(2)\n\nxt 2 arg min\n\nx2X hzt1, xi + t(x) + r0:t1(x) ,\n\nwhere zt1 = g1:t1 is the sum of the past feedback. We refer to (zt, t, r0:t) as the state of the\nalgorithm at time t, noting that apart from tie-breaking in (2), this state determines xt.\nIt is straightforward to verify that with = 0,X = Rd, and r0:t1 = \u2318\n2k\u00b7k 2 for some \u2318> 0, we get\nthe SGD update xt = 1\n, i 2 [d] are positive\nstep-sizes (possibly adaptively tuned [22, 9]), ADA-FTRL reduces to xt = prox(t,zt1,\u2318 t),\nwhere prox is the generalized proximal operator oracle5 over X that, given a function and vectors\nz and \u2318, returns6\n\n\u2318 g1:t1. In addition, using r0:t1 = 1\n\n\u2318t where \u2318(i)\n\n2k\u00b7k 2\n\nt\n\n(3)\n\nprox( , z, \u2318 ) := arg min\n\nx2X\n\n (x) +\n\n1\n\n2x \u23181 z2\n\n\u2318 .\n\n3With a slight abuse of notation, gt(x, \u21e0) (with arguments x, \u21e0) is from now on used to denote the oracle at\n\ntime t evaluated at x, \u21e0, where as gt (without arguments) denotes the observed noisy gradient gt(xt,\u21e0 t).\n4Note that in this case \u21e0t remains an independent sequence, even though gt changes with the history.\n5 Serial proximal DA [37] and ADA-FTRL call prox with t, whereas the conventional Proximal-SGD\nalgorithm (based on Mirror-Descent) invokes the proximal operator with irrespective of the iteration; see\nthe paper of Xiao [37, Sections 5 and 6] for a detailed discussion of this phenomenon.\n\n6Here \u23181 denotes the elementwise inverse of \u2318 and denotes elementwise multiplication.\n\n4\n\n\fWhen \u2318 is the same for all coordinates (in which case we simply treat it as a scalar), this reduces\nto prox( , z, \u2318 ) = arg minx2X (x) + \u2318\n2kx z/\u2318k2, which is the standard proximal operator;\nthe generalized version (3) makes it possible to use coordinatewise step-sizes as in ADAGRAD\n[22, 9]. Finally, when = 0 and X is the probability simplex, ADA-FTRL with the negen-\ntropy regularizer r0:t1(x) = r0(x) = \u2318Pd\ni=1 xi log(xi) for some \u2318> 0, recovers the update\nx(i)\nt = Ct exp(z(i)\nt1/\u2318) is the constant\nnormalizing xt to lie in X . Other choices of rt recover algorithms such as the p-norm update; we refer\nto Shalev-Shwartz [34], Hazan [14], McMahan [21], and Orabona et al. [25] for further examples.\n\nt1/\u2318) of the EG algorithm, where Ct = 1/Pj=1 exp(z(j)\n\nAnalysis of ADA-FTRL ADA-FTRL and its special cases have been extensively studied in the\nliterature [5, 34, 14, 21, 25, 16]. In particular, it has been shown that under speci\ufb01c conditions on rt\nand , which we discuss in detail in Appendix F, ADA-FTRL enjoys the following bound on the\nlinearized regret [25, 16]:\nTheorem 1 (Regret of ADA-FTRL). For any x\u21e4 2X and any sequence of vectors (gt)T\nt=1 in Rd,\nusing any sequence of regularizers r0, r1, . . . , rT that are admissible w.r.t. a sequence of norms\nk\u00b7k (t) (see De\ufb01nition 2 in Appendix F), the iterates (xt)T\n(hgt, xt x\u21e4i + (xt) (x\u21e4)) \uf8ff r0:T (x\u21e4) \n\nt=1 generated by ADA-FTRL satisfy\n\n1\n2kgtk2\n\nrt(xt+1) +\n\n(t,\u21e4) .\n\n(4)\n\nTXt=0\n\nTXt=1\n\nTXt=1\n\nImportantly, this bound holds for any feedback sequence gt irrespective of the way it is generated,\nand serves as a solid basis to derive bounds under different assumptions on f, , and rt [25, 16].\n\n4 Relaxing the serial analysis: algorithms with perturbed state\nIn this section, we show that Theorem 1 can be used to analyze ADA-FTRL when its state undergoes\nspeci\ufb01c perturbations. This relaxation of the generic serial analysis framework underlies our analysis\nof parallel asynchronous algorithms, since parallel algorithms like ASYNCADA and HEDGEHOG\ncan be viewed as serial ADA-FTRL algorithms with perturbed states, as we show in Sections 5 and 6.\nPerturbed ADA-FTRL. Next, we show that Theorem 1 also provides the basis to analyze ADA-\nFTRL with perturbed states. Speci\ufb01cally, suppose that instead of (2), the iterate xt is given by\n\nxt 2 arg min\n\nx2X h\u02c6zt1, xi + \u02c6tt(x) + \u02c6r0:t1(x),\n\nt = 1, 2, . . . ,\n\n(5)\n\nwhere \u02c6zt1 denotes a perturbed version of the dual vector zt1, \u02c6tt denotes a perturbed version of\nADA-FTRL\u2019s iteration counter t, and \u02c6r0:t1 denotes a perturbed version of the regularizer r0:t1.\nThen, we can analyze the regret of the Perturbed-ADA-FTRL update (5) by comparing xt to the\n\u201cideal\u201d iterate \u02dcxt, given by\n\n\u02dcxt := arg min\n\nx2X hzt1, xi + t(x) + r0:t1(x),\n\nt = 1, 2, . . . .\n\n(6)\n\nSince (\u02dcxt)T\nt=1 is given by a non-perturbed ADA-FTRL update, it enjoys the bound of Theorem 1. The\ncrucial observation of Duchi et al. [10] (who studied the special case of (5) with = 0, box-shaped\nX , and \u02c6rt = rt) was that the regret of Perturbed-ADA-FTRL is related to the linearized regret of \u02dcxt.\nWhen may be non-zero, we capture this relation by the next lemma, proved in Appendix A:\nt=1 and (\u02dcxt)T\nLemma 1 (Perturbation penalty of ADA-FTRL). Consider any sequences (xt)T\nt=1 in\nX , and any sequence (gt)T\nof the sequence (xt)T\nt=1 satis\ufb01es\nTXt=1\n\n(hgt, \u02dcxt x\u21e4i + (\u02dcxt) (x\u21e4)) + \u02dc\u270f1:T + 1:T B1:T ,\n\nt=1 in Rd. Then, the regret R(f +)\n\nwhere \u02dc\u270ft = hgt, xt \u02dcxti + (xt) (\u02dcxt), t = hrft(xt) gt, xt x\u21e4i and Bt = Bft(x\u21e4, xt).\nSince gt is an unbiased estimate of rft(xt) (conditionally given xt), 1:T is zero in expectation\n0, and for \u02dcxt given by (6), the \ufb01rst summation is bounded by Theorem 1. Also note that when\nthe ft are (star-)convex, B1:T \uf8ff 0. Thus, to bound the regret of Perturbed-ADA-FTRL, it only\n\nR(f +)\n\n(7)\n\n=\n\nT\n\nT\n\n5\n\n\fremains to control the \u201cperturbation penalty\u201d terms \u02dc\u270ft capturing the difference in the composite linear\nloss hgt,\u00b7i + between xt and \u02dcxt. In Appendix A, we use the stability of ADA-FTRL algorithms\n(Lemma 3) to control \u02dc\u270f1:T , under a speci\ufb01c perturbation structure (coming from delayed updates to \u02c6zt)\nthat captures the evolution of the state of asynchronous dual-averaging algorithms like ASYNCADA\nand HEDGEHOG. Unlike Duchi et al. [10], our derivation applies to any convex constraint set X and,\ncrucially, to ADA-FTRL updates incorporating non-zero and a perturbed counter \u02c6tt. The following\n(informal) theorem, whose formal version is given in Appendix A, captures the result.\nTheorem 4 (informal). Under appropriate independence, regularity, and structural assumptions on\nthe regularizers and the perturbations, the Perturbed-ADA-FTRL update (5) satis\ufb01es\n\no \uf8ff E(r0:T (x\u21e4) +\n\nTXt=1 1 + p\u21e4\u232bt +Ps:t2Os\n\n\u232bt! B1:T) ,\nEnR(f +)\nwhere p\u21e4,\u232b t,\u2327 t and t measure, respectively, the sparsity of the gradient estimates gt, the difference\n\u02c6ttt, and the amount of perturbations in \u02c6zt1, and \u02c6r0:t1, while Os is the set of time steps whose\nattributed perturbations affect iteration s (i.e., their updates are delayed beyond s).\nAs we show next, we can control the effect of p\u21e4,\u2327t and t in the bound by appropriately tuning \u02c6tt,\nresulting in linear speed-ups for ASYNCADA and HEDGEHOG.\n\nkgtk2\n\n(t,\u21e4) +\n\nt\n\nT\n\n\u2327s\n\u232bs\n\n2\n\n5 ASYNCADA: Asynchronous Composite Adaptive Dual Averaging\nIn this section, we introduce and analyze ASYNCADA for asynchronous noisy online optimization.\nASYNCADA consists of \u2327 processes running in parallel (e.g., threads on the same physical machine\nor computing nodes distributed over a network accessing a shared data store). The processes can\naccess a shared memory, consisting of a dual vector z 2 Rd to store the sum of observed gradient\nestimates gt, a step-size vector \u2318 2 Rd, and an integer t, referred to as the clock, to track the number\nof iterations completed at each point in time. The processes run copies of Algorithm 1 concurrently.\n\nAlgorithm 1: ASYNCADA: Asynchronous Composite Adaptive Dual Averaging\n\n1 repeat\n2\n3\n4\n5\n6\n7\n8\n\n\u02c6\u2318 a full (lock-free) read of the shared step-sizes \u2318\n\u02c6z a full (lock-free) read of the shared dual vector z\nt t + 1\n\u02c6t t + \nReceive \u21e0t\nCompute the next iterate: xt prox(\u02c6tt,\u02c6zt1, \u02c6\u2318t)\nObtain the noisy gradient estimate: gt gt(xt,\u21e0 t)\nfor j such that g(j)\nUpdate the shared step-size vector \u2318\n\n6= 0 do z(j) z(j) + g(j)\n\nt\n\nt\n\n9\n10\n11 until terminated\n\n// atomic read-increment\n// denote \u02c6zt1 = \u02c6z, \u02c6\u2318t = \u02c6\u2318, \u02c6tt = \u02c6t\n\n// prox defined in (3)\n\n// atomic update\n\nInconsistent reads. The processes access the shared memory without necessarily acquiring a lock:\nas in previous Hogwild!-style algorithms [30, 20, 18, 17, 27], we only assume that operations on\nsingle coordinates of z and \u2318, as well as on t0, are atomic. This in particular means that the values of\n\u02c6z or \u02c6\u2318 read by a process may not correspond to an actual state of z or \u2318 at any given point in time, as\ndifferent processes can modify the coordinates in parallel while the read is taking place. A process \u21e1\nis in write-con\ufb02ict with another process \u21e10 (equivalently, \u21e10 is in read-con\ufb02ict with \u21e1) if \u21e10 reads parts\nof the memory which should have been updated by \u21e1 before. To limit the effects of asynchrony, we\nassume that a process can be in write- and read con\ufb02icts with at most \u2327c1 processes, respectively.\nThe role of . ASYNCADA uses an over-estimate \u02c6tt of the current global clock t by an additional .\nThis over-estimation enables us to better handle the effect of asynchrony when composite objectives\nare involved, in particular ensuring the appropriate tuning of \u232bt in Theorem 4; see Appendix C.\nASYNCADA can nevertheless be run without (i.e., with = 0).7\n\n7 In Theorems 2, 5 and 6, we set based on \u2327\u21e4 := max{\u2327c, \u2327}. The analysis is still possible, and\nstraightforward, with = 0, but results in a worst constant factor in the rate, as well as an extra additive term of\norder O(\u2327 2\n\u21e4 ) where = sup x,y2X {(x) (y)} is the diameter of X w.r.t. . This term does not diminish\nwith p\u21e4 and may be unnecessarily large, affecting convergence in early stages of the optimization process.\n\n6\n\n\fExact vs estimated clock. ASYNCADA as given in Algorithm 1 maintains the exact global clock t.\nHowever, this option may not be desirable (or available) in certain asynchronous computing scenarios.\nFor example, if the processes are distributed over a network, then maintaining an exact global clock\namounts to changing the pattern of asynchrony and delaying the computations by repeated calls over\na network. To mitigate this requirement, in Appendix B we provide ASYNCADA(\u21e2), a version of\nASYNCADA in which the processes update the global clock only every \u21e2 iterations. ASYNCADA as\npresented in Algorithm 1 is equivalent to ASYNCADA(\u21e2) with \u21e2 = 1, and both algorithms enjoy the\nsame rate of convergence and linear speed-up. Obviously, when \u2318 0 and t is not used for setting\nthe step-sizes \u2318 either, there is no need to maintain t physically, and Line 4 can be omitted.\nUpdating the step-sizes \u2318:\nIn Line 10 of Algorithm 1, the step-size \u2318 has to be updated based on\nthe information received. The exact way this is done depends on the speci\ufb01c step-size sched-\nule.\nIn particular, we consider two situations: First, when the step-size is either constant or\na simple function of t (or \u02c6tt in case of ASYNCADA(\u21e2)), and second, when diagonal ADA-\nGRAD step-sizes are used. In the \ufb01rst case, the vector \u2318 need not be kept in the shared mem-\nory explicitly, and Lines 2 and 10 can be omitted.\nIn the second case, following [10], we\nstore the sum of squared gradients in the shared \u2318, i.e., Line 10 is implemented as follows:\n\n10* for j such that g(j)\n\nt\n\n6= 0 do \u2318(j)2\n\n \u2318(j)2\n\n+ \u21b52\u21e3g(j)\nt \u23182\n\n// atomic update\n\nfor a \ufb01xed hyper-parameter \u21b5> 0. In this case, we are storing the square of \u2318 in the shared memory,\nso a square root operation needs to be applied after reading the shared memory in Line 2 to retrieve \u2318.\nForming the output \u00afxT for stochastic optimization: For stochastic optimization, the algorithm\nneeds to output the average (or randomized) iterate \u00afxT at the end. However, this needs no further\ncoordination between the processes. To form the average iterate, it suf\ufb01ces for each process to keep a\nlocal running sum of the iterates it produces and the number of updates it makes. At the end, \u00afxT is\nbuilt from these sums and the total number of updates. Alternatively, we can return a random iterate\nas \u00afxT by terminating the algorithm, with probability 1/T , after calculating x in Line 7.\n\nt\n\n5.1 Analysis of ASYNCADA\nThe analysis of ASYNCADA is based on treating it as a special case of Perturbed-ADA-FTRL. In\norder to be able to use Theorem 4, we start with the following independence assumption on \u21e0t:\nAssumption 1 (Independence of \u21e0t). For all t = 1, 2, . . . , T , the t-th sample \u21e0t is independent of the\n\nhistory \u02c6Ht := (\u21e0s, \u02c6zs, \u02c6\u2318s+1)t1\ns=1 .\nThis, in turn, implies that \u21e0t is independent of xt as well as xs and \u21e0s for all s < t.\nFor general (non-box-shaped) X , Assumption 1 is plausible, as ASYNCADA needs to read z (and \u2318)\ncompletely and independently of \u21e0t. If X is box-shaped and is coordinate-separable, however, the\nvalues of x(j)\nfor different coordinates j can be calculated independently. In this, case, the algorithm\nmay \ufb01rst sample \u21e0t, and then only read the relevant coordinates j from z (and \u2318) for which gt may\nbe non-zero, as calculating other values of x(j)\nis unnecessary for calculating gt. As mentioned by\nMania et al. [20], this violates Assumption 1. This is because multiple other processes are updating z\nand \u2318, and the updates that are included the value read for \u02c6zt1 (and \u02c6\u2318t) would then depend on \u21e0t.\nPrevious papers either assume that this independence holds in their analysis, e.g., by enforcing a full\nread of z and \u2318, [20, 18, 17, 27], or rely on the smoothness of the objective to bound the effect of\nthe possible change in the read values [20, Appendix A]. It seems possible to adapt the argument of\nMania et al. [20, Appendix A] to ASYNCADA for box-shaped X , by comparing xt to the iterate\nthat would have been created based on the content of the shared memory right before the start of the\nexecution of the t-th iteration. This makes the analysis more complicated, and is not necessary when\nX is not box-shaped; hence, we do not further pursue this construction in this paper.\nSparsity of the gradient estimates. For t 2 [T ] and j 2 [d], let pt,j to denote the probability that\nthe j-th coordinate of gt is non-zero given the history \u02c6Ht, that is, pt,j = Pg(j)\ndenote an upper-bound on maxt2[T ],j2[d] pt,j. We use p\u21e4 as a measure of the sparsity of the problem.8\n8 In stochastic optimization with a \ufb01nite-sum objective f =Pm\ni=1 fi, where gt = rf\u21e0t (xt) and \u21e0t 2 [m]\nis an index at time t sampled uniformly at random and independently of the history, one could measure the\n\n6= 0 \u02c6Ht . Let p\u21e4\n\nt\n\nt\n\n7\n\n\f1\n\nT\n\n1\n\n2 +\n\n\u23180\n\nG2\n\n\u21e4\u25c6 .\n\n2(1 + p\u21e4\u2327 2\n\u21e4 )\n\n2 \uf8ff G2\n\n\u21e4\n\nNon-adaptive and time-decaying step-sizes. We \ufb01rst study the case when \u2318t is either a constant,\nor varies only as a function of the estimated iteration count \u02c6tt. Recall that each concurrent iteration of\nthe algorithms can be in read- and write-con\ufb02ict with at most \u2327c 1 other iterations, respectively,\nand that the algorithm uses \u2327 parallel processes. De\ufb01ne \u2327\u21e4 = max{\u2327c, \u2327}. The next theorem gives\nbounds on the regret of ASYNCADA under various scenarios. It is proved in Appendix C, where a\nsimilar result is also given for ASYNCADA(\u21e2) (Theorem 5).\nTheorem 2. Suppose that either all ft, t 2 [T ] are convex, or \u2318 0 and ft \u2318 f for some star-convex\nfunction f. Consider ASYNCADA running under Assumption 1 for T >\u2327 2\n.\nupdates, using = 2\u2327 2\n\u21e4\n\u21e4\nLet \u23180 > 0. Then:\nfor all t 2 [T ], then using a \ufb01xed \u2318t = \u23180pT or a time-varying \u2318t = \u23180p\u02c6tt,\n(i) If Ekgtk2\npT \u2713\u23180kx\u21e4k2\nT EnR(f +)\no \uf8ff\n2 , and for all \u21e0 2 \u2305, F (\u00b7,\u21e0 ) is convex\n\u21e4 := EkrF (x\u21e4,\u00b7)k2\n(ii) If ft = f = E\u21e0\u21e0P\u2305 {F (x, \u21e0)}, 2\nand 1-smooth w.r.t. the norm k\u00b7k l for some l 2 Rd with positive entries, then given a constant\n\u21e4 ) and using a \ufb01xed \u2318t,i = c0li + \u23180pT or a time-varying \u2318t,i = c0li + \u23180p\u02c6tt,\nc0 > 8(1 + p\u21e4\u2327 2\npT \u2713\u23180kx\u21e4k2\no \uf8ff\nT EnR(f +)\nc0kx\u21e4k2\n1\n(iii) If is \u00b5-strongly-convex and Ekgtk2\n2 \uf8ff G2\nfor all t 2 [T ], then using \u2318t \u2318 0 or, equivalently,\n\u02c6tt(x) + hz, xi = r\u21e4(z/\u02c6tt),\nprox(\u02c6tt,z, 0) := arg minx2X\no \uf8ff\nT EnR(f +)\n\nRemark 1. If c = p\u21e4\u2327 2\nis constant, the bounds match the corresponding serial bounds [16] up to\n\u21e4\nconstant factors, implying a linear speed-up. This also extends the analysis of ASYNC-DA [10] to\nnon-box-shaped X , non-zero , time-varying step sizes, and smooth and strongly-convex objectives.9\nRemark 2. Note that (10) holds for all time steps, and converges to zero as T grows, without\nthe knowledge of T or epoch-based updates. In case of ASYNCADA(\u21e2), the algorithm does not\nmaintain an exact clock either. To our knowledge, this makes ASYNCADA(\u21e2) the \ufb01rst Hogwild!-style\nalgorithm with an any-time guarantee without maintaining a global clock.\nRemark 3. Since strongly convex functions have unbounded gradients on unbounded domains, it\nis not possible to impose a uniform bound on the gradient of f + in part (iii) for unconstrained\noptimization (i.e., when X = Rd). However, we only require the gradients of f, the non-strongly-\nconvex part of the objective, to be bounded, which is a feasible assumption. Similarly, Nguyen\net al. [24] analyzed strongly-convex optimization with unconstrained Hogwild! while avoiding the\naforementioned uniform boundedness assumption,s using a global clock. ASYNCADA(\u21e2) achieves\nthe same result, but applies to arbitrary convex X and , without requiring a global clock.\nAdaptive step-sizes. Due to space constraints, we relegate the analysis of ASYNCADA(\u21e2) with\nAdaGrad step-sizes given by Line 10* to Appendix D.\n\n\u21e4 )G2\n\u21e4(1 + log(T ))\n\u00b5T\n\n4(1 + p\u21e4\u2327 2\n\u21e4 )\n\n2\n\n\u21e4\u25c6 .\n\n2\n\nl\n\n+\n\nT\n\n(1 + p\u21e4\u2327 2\n\n1\n\nT\n\n2 +\n\n\u23180\n\n(8)\n\n(9)\n\n.\n\n(10)\n\nT\n\n\u21e4\n\n6 HEDGEHOG: Hogwild-Style Hedge\n\nNext, we present HEDGEHOG, which is, to our knowledge the \ufb01rst asynchronous version of the EG\nalgorithm. The parallelization scheme is very similar to ASYNCADA, the difference being that EG\nuses multiplicative updates rather than additive SGD-style updates. We focus only on the case of\n \u2318 0. Each processe runs Lines 3\u201310 of Algorithm 2 concurrently with the other processes, sharing\nthe dual vector z.\nsparsity of the problem through a \u201ccon\ufb02ict graph\u201d [30, 20, 17, 27], which is a bi-partite graph with fi, i 2 [m]\non the left and coordinates j 2 [d] on the right, and an edge between fi and coordinate j if rfi(x)(j) can be\nnon-zero for some x 2X . In this graph, let j denote the degree of the node corresponding to coordinate j and\nr be the largest j, j 2 [d]. Then, it is straightforward to see that pt,j \uf8ff j/m. Thus, p\u21e4 = r/m is a valid\nupper-bound, and gives the sparsity measure used, e.g., by Leblond et al. [17] and Pedregosa et al. [27].\n9Note that under the conditions considered in [10], which include that X is box-shaped and = 0, ASYNC-\n\nDA requires a less restrictive sparsity regime of p\u21e4\u2327\u21e4 \uf8ff c for linear speed-up.\n\n8\n\n\fAlgorithm 2: HEDGEHOG!: Asynchronous Stochastic Exponentiated Gradient.\nInput: Step size \u2318\n\n1 Initialization\n2\n3 repeat in parallel by each process\n4\n5\n\nLet z 0 be the shared sum of observed gradient estimates\n\u02c6z a full lock-free read of the shared dual vector z\nReceive \u21e0t\nt1/\u2318\u2318 ,\nCompute the next iterate: w(i)\nNormalize: xt wt/kwtk1\nObtain the noisy gradient estimate: gt gt(xt,\u21e0 t)\nfor j such that g(j)\n\nt exp\u21e3\u02c6z(i)\n6= 0 do z(j) z(j) + g(j)\n\n6\n\n7\n8\n\nt\n\nt\n\n9\n10 until terminated\n\n// t t + 1, denote \u02c6zt1 = \u02c6z\n\ni = 1, 2, . . . , d\n\n// atomic update\n\nAs in ASYNCADA(\u21e2), we index the iterations by the time they \ufb01nish the reading of z in Line 4\n\ns=1o to denote the\nof HEDGEHOG (\u201cafter-read\u201d labeling [18]). Similarly, we use \u02c6Ht =n(\u21e0s, \u02c6zs)t1\nhistory of HEDGEHOG at time t, and use \u02c6Ht to de\ufb01ne the sparsity measure p\u21e4 as in Section 5.1. Then,\nwe have the following regret bound for HEDGEHOG.\nTheorem 3. Let X be the probability simplex X = {x|x(j) > 0,kxk1 = 1}, and suppose that either\nft are all convex, or ft \u2318 f for a star-convex f. Assume that for all t 2 [T ], the sampling of \u21e0t in\nLine 5 of HEDGEHOG is independent of the history \u02c6Ht. Then, after T updates, HEDGEHOG satis\ufb01es\n\nEnR(f )\n\nT o \uf8ff \u2318 log(d) +\n\nTXt=1\n\nE\u21e2 1 + pp\u21e4\u2327\u21e4\n\n2\u2318\n\n1 .\nkgtk2\n\nRemark 4. As in the case of ASYNCADA, as long as pp\u21e4\u2327\u21e4 is a constant, the rate above matches\nthe worst-case rate of serial EG up to constant factors, implying a linear speed-up. In particular,\ngiven an upper-bound G\u21e4 on E{kgtk1} and setting \u2318 = G\u21e4/pT log(d), we recover the well-known\nO(G\u21e4pT log(d)) rate for EG [14], but in the paralell asynchronous setting.\n\n7 Conclusion, limitations, and future work\n\nWe presented and analyzed ASYNCADA, a parallel asynchronous online optimization algorithm\nwith composite, adaptive updates, and global convergence rates under generic convex constraints and\nconvex composite objectives which can be smooth, non-smooth, or non-strongly-convex. We also\nshowed a similar global convergence for the so-called \u201cstar-convex\u201d class of non-convex functions.\nUnder all of the aforementioned settings, we showed that ASYNCADA enjoys linear speed-ups when\nthe data is sparse. We also derived and analyzed HEDGEHOG, to our knowledge the \ufb01rst Hogwild-\nstyle asynchronous variant of the Exponentiated Gradient algorithm working on the probability\nsimplex, and showed that HEDGEHOG enjoyed similar linear speed-ups.\nTo derive and analyze ASYNCADA and HEDGEHOG, we showed that the idea of perturbed iterates,\nused previously in the analysis of asynchronous SGD algorithms, naturally extends to generic dual-\naveraging algorithms, in the form of a perturbation in the \u201cstate\u201d of the algorithm. Then, building on\nthe work of Duchi et al. [10], we studied a uni\ufb01ed framework for analyzing generic adaptive dual-\naveraging algorithms for composite-objective noisy online optimization (including ASYNCADA\nand HEDGEHOG as special cases). Possible directions for future research include applying the\nanalysis to other problem settings, such as multi-armed bandits. In addition, it remains an open\nproblem whether such an analysis is obtainable for constrained adaptive Mirror Descent without\nfurther restrictions on the regularizers (e.g., smoothness of the regularizer seems to help). Finally, the\nderivation of such data-dependent bounds for the \ufb01nal (rather than the average) iterate in stochastic\noptimization, without the usual strong-convexity and smoothness assumptions, remains an interesting\nopen problem.\n\n9\n\n\fReferences\n\n[1] Heinz H Bauschke and Patrick L Combettes. Convex analysis and monotone operator theory\n\nin Hilbert spaces. Springer Science & Business Media, 2011.\n\n[2] Loris Cannelli et al. \u201cAsynchronous Parallel Algorithms for Nonconvex Big-Data Optimization.\n\nPart I: Model and Convergence\u201d. In: arXiv preprint arXiv:1607.04818 (2017).\n\n[3] Loris Cannelli et al. \u201cAsynchronous Parallel Algorithms for Nonconvex Big-Data Optimization.\n\nPart II: Complexity and Numerical Results\u201d. In: arXiv preprint arXiv:1701.04900 (2017).\n\n[4] Loris Cannelli et al. \u201cAsynchronous parallel algorithms for nonconvex optimization\u201d. In: arXiv\n\npreprint arXiv:1607.04818 (2016).\n\n[5] Nicol\u00f2 Cesa-Bianchi and G\u00e1bor Lugosi. Prediction, Learning, and Games. New York, NY,\n\nUSA: Cambridge University Press, 2006.\n\n[6] Damek Davis, Brent Edmunds, and Madeleine Udell. \u201cThe sound of apalm clapping: Faster\nnonsmooth nonconvex optimization with stochastic asynchronous palm\u201d. In: Advances in\nNeural Information Processing Systems. 2016, pp. 226\u2013234.\n\n[7] Christopher De Sa et al. \u201cTaming the Wild: A Uni\ufb01ed Analysis of Hogwild!-Style Algorithms\u201d.\n\nIn: arXiv preprint arXiv:1506.06438 (2015).\n\n[9]\n\n[8] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. \u201cSAGA: A fast incremental gradient\nmethod with support for non-strongly convex composite objectives\u201d. In: Advances in neural\ninformation processing systems. 2014, pp. 1646\u20131654.\nJohn Duchi, Elad Hazan, and Yoram Singer. \u201cAdaptive Subgradient Methods for Online\nLearning and Stochastic Optimization\u201d. In: Journal of Machine Learning Research 12 (July\n2011), pp. 2121\u20132159.\nJohn Duchi, Michael I Jordan, and Brendan McMahan. \u201cEstimation, optimization, and paral-\nlelism when data is sparse\u201d. In: Advances in Neural Information Processing Systems. 2013,\npp. 2832\u20132840.\n\n[10]\n\n[11] Francisco Facchinei, Gesualdo Scutari, and Simone Sagratella. \u201cParallel selective algorithms\nfor nonconvex big data optimization\u201d. In: IEEE Transactions on Signal Processing 63.7 (2015),\npp. 1874\u20131889.\n\n[12] Olivier Fercoq and Peter Richt\u00e1rik. \u201cOptimization in high dimensions via accelerated, parallel,\n\nand proximal coordinate descent\u201d. In: SIAM Review 58.4 (2016), pp. 739\u2013771.\nJerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning.\nVol. 1. Springer series in statistics Springer, Berlin, 2001.\n\n[13]\n\n[14] Elad Hazan. \u201cIntroduction to online convex optimization\u201d. In: Foundations and Trends in\n\nOptimization 2.3-4 (2016), pp. 157\u2013325.\n\n[15] Rie Johnson and Tong Zhang. \u201cAccelerating stochastic gradient descent using predictive\nvariance reduction\u201d. In: Advances in Neural Information Processing Systems. 2013, pp. 315\u2013\n323.\n\n[16] Pooria Joulani, Andr\u00e1s Gy\u00f6rgy, and Csaba Szepesv\u00e1ri. \u201cA Modular Analysis of Adaptive\n(Non-) Convex Optimization: Optimism, Composite Objectives, and Variational Bounds\u201d.\nIn: Proceedings of Machine Learning Research (Algorithmic Learning Theory 2017). 2017,\npp. 681\u2013720.\n\n[17] R\u00e9mi Leblond, Fabian Pederegosa, and Simon Lacoste-Julien. \u201cImproved asynchronous\nparallel optimization analysis for stochastic incremental methods\u201d. In: arXiv preprint\narXiv:1801.03749 (2018).\n\n[18] R\u00e9mi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien. \u201cASAGA: asynchronous parallel\n\nSAGA\u201d. In: arXiv preprint arXiv:1606.04809 (2016).\nJi Liu et al. \u201cAn asynchronous parallel stochastic coordinate descent algorithm\u201d. In: arXiv\npreprint arXiv:1311.1873 (2013).\n\n[19]\n\n[20] H. Mania et al. \u201cPerturbed Iterate Analysis for Asynchronous Stochastic Optimization\u201d. In:\n\nArXiv e-prints (July 2015). arXiv: 1507.06970 [stat.ML].\n\n[21] H. Brendan McMahan. \u201cA survey of Algorithms and Analysis for Adaptive Online Learning\u201d.\n\nIn: Journal of Machine Learning Research 18.90 (2017), pp. 1\u201350.\n\n[22] H. Brendan McMahan and Matthew Streeter. \u201cAdaptive bound optimization for online convex\n\noptimization\u201d. In: Proceedings of the 23rd Conference on Learning Theory. 2010.\n\n10\n\n\f[23] Yurii Nesterov. Introductory lectures on convex optimization: A basic course. Vol. 87. Springer\n\nScience & Business Media, 2013.\n\n[24] Lam M Nguyen et al. \u201cSGD and Hogwild! convergence without the bounded gradients\n\nassumption\u201d. In: arXiv preprint arXiv:1802.03801 (2018).\n\n[25] Francesco Orabona, Koby Crammer, and Nicol\u00f2 Cesa-Bianchi. \u201cA generalized online mirror\ndescent with applications to classi\ufb01cation and regression\u201d. English. In: Machine Learning 99.3\n(2015), pp. 411\u2013435.\n\n[26] Xinghao Pan et al. \u201cCyclades: Con\ufb02ict-free asynchronous machine learning\u201d. In: Advances in\n\nNeural Information Processing Systems. 2016, pp. 2568\u20132576.\n\n[27] Fabian Pedregosa, R\u00e9mi Leblond, and Simon Lacoste-Julien. \u201cBreaking the Nonsmooth\nBarrier: A Scalable Parallel Method for Composite Optimization\u201d. In: Advances in Neural\nInformation Processing Systems. 2017, pp. 55\u201364.\n\n[28] Zhimin Peng et al. \u201cArock: an algorithmic framework for asynchronous parallel coordinate\n\nupdates\u201d. In: SIAM Journal on Scienti\ufb01c Computing 38.5 (2016), A2851\u2013A2879.\n\n[29] Meisam Razaviyayn et al. \u201cParallel successive convex approximation for nonsmooth nonconvex\noptimization\u201d. In: Advances in Neural Information Processing Systems. 2014, pp. 1440\u20131448.\n[30] Benjamin Recht et al. \u201cHogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient\nDescent\u201d. In: Advances in Neural Information Processing Systems 24. Ed. by J. Shawe-Taylor\net al. Curran Associates, Inc., 2011, pp. 693\u2013701.\n\n[31] Gesualdo Scutari, Francisco Facchinei, and Lorenzo Lampariello. \u201cParallel and distributed\nmethods for constrained nonconvex optimization\u2014Part I: Theory\u201d. In: IEEE Transactions on\nSignal Processing 65.8 (2016), pp. 1929\u20131944.\n\n[32] Gesualdo Scutari and Ying Sun. \u201cParallel and distributed successive convex approximation\nmethods for big-data optimization\u201d. In: Multi-agent Optimization. Springer, 2018, pp. 141\u2013\n308.\n\n[33] Gesualdo Scutari et al. \u201cParallel and distributed methods for constrained nonconvex\noptimization-part ii: Applications in communications and machine learning\u201d. In: IEEE Trans-\nactions on Signal Processing 65.8 (2016), pp. 1945\u20131960.\n\n[34] Shai Shalev-Shwartz. \u201cOnline learning and online convex optimization\u201d. In: Foundations and\n\nTrends in Machine Learning 4.2 (2011), pp. 107\u2013194.\n\n[35] Tao Sun, Robert Hannah, and Wotao Yin. \u201cAsynchronous coordinate descent under more real-\nistic assumptions\u201d. In: Advances in Neural Information Processing Systems. 2017, pp. 6182\u2013\n6190.\n\n[36] Yu-Xiang Wang et al. \u201cParallel and distributed block-coordinate Frank-Wolfe algorithms\u201d. In:\n\nInternational Conference on Machine Learning. 2016, pp. 1548\u20131557.\n\n[37] Lin Xiao. \u201cDual averaging method for regularized stochastic learning and online optimization\u201d.\nIn: Advances in Neural Information Processing Systems. 2009, pp. 2116\u20132124. (Visited on\n02/05/2015).\n\n11\n\n\f", "award": [], "sourceid": 6627, "authors": [{"given_name": "Pooria", "family_name": "Joulani", "institution": "DeepMind"}, {"given_name": "Andr\u00e1s", "family_name": "Gy\u00f6rgy", "institution": "DeepMind"}, {"given_name": "Csaba", "family_name": "Szepesvari", "institution": "DeepMind / University of Alberta"}]}