{"title": "A Generic Acceleration Framework for Stochastic Composite Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 12556, "page_last": 12567, "abstract": "In this paper, we introduce various mechanisms to obtain accelerated first-order stochastic optimization algorithms when the objective function is convex or strongly convex. Specifically, we extend the Catalyst approach originally designed for deterministic objectives to the stochastic setting. Given an optimization method with mild convergence guarantees for strongly convex problems, the challenge is to accelerate convergence to a noise-dominated region, and then achieve convergence with an optimal worst-case complexity depending on the noise variance of the gradients. A side contribution of our work is also a generic analysis that can handle inexact proximal operators, providing new insights about the robustness of stochastic algorithms when the proximal operator cannot be exactly computed.", "full_text": "A Generic Acceleration Framework\n\nfor Stochastic Composite Optimization\n\nAndrei Kulunchakov and Julien Mairal\n\nUniv. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France\n\n\u275b\u2665\u275er\u2761\u2710\u2733\u2766\u2709\u2767\u2709\u2665\u275d\u2764\u275b\u2766\u2666\u2708\u2745\u2710\u2665r\u2710\u275b\u2733\u2762r \u275b\u2665\u275e \u2765\u2709\u2767\u2710\u2761\u2665\u2733\u2660\u275b\u2710r\u275b\u2767\u2745\u2710\u2665r\u2710\u275b\u2733\u2762r\n\nAbstract\n\nIn this paper, we introduce various mechanisms to obtain accelerated \ufb01rst-order\nstochastic optimization algorithms when the objective function is convex or strongly\nconvex. Speci\ufb01cally, we extend the Catalyst approach originally designed for\ndeterministic objectives to the stochastic setting. Given an optimization method\nwith mild convergence guarantees for strongly convex problems, the challenge is to\naccelerate convergence to a noise-dominated region, and then achieve convergence\nwith an optimal worst-case complexity depending on the noise variance of the\ngradients. A side contribution of our work is also a generic analysis that can\nhandle inexact proximal operators, providing new insights about the robustness of\nstochastic algorithms when the proximal operator cannot be exactly computed.\n\n1\n\nIntroduction\n\nIn this paper, we consider stochastic composite optimization problems of the form\n\nmin\n\nx\u2208Rp {F (x) := f (x) + \u03c8(x)} with\n\nf (x) = E\u03be[ \u02dcf (x, \u03be)],\n\n(1)\n\nwhere the function f is convex, or \u00b5-strongly convex, and L-smooth (meaning differentiable with\nL-Lipschitz continuous gradient), and \u03c8 is a possibly non-smooth convex lower-semicontinuous\nfunction. For instance, \u03c8 may be the \u21131-norm, which is known to induce sparsity, or an indicator\nfunction of a convex set [21]. The random variable \u03be corresponds to data samples. When the amount\nof training data is \ufb01nite, the expectation E\u03be[ \u02dcf (x, \u03be)] can be replaced by a \ufb01nite sum, a setting that\nhas attracted a lot of attention in machine learning recently, see, e.g., [13, 14, 19, 25, 35, 42, 53] for\nincremental algorithms and [1, 26, 30, 33, 47, 55, 56] for accelerated variants.\n\nYet, as noted in [8], one is typically not interested in the minimization of the empirical risk\u2014that is,\na \ufb01nite sum of functions\u2014with high precision, but instead, one should focus on the expected risk\ninvolving the true (unknown) data distribution. When one can draw an in\ufb01nite number of samples\nfrom this distribution, the true risk (1) may be minimized by using appropriate stochastic optimization\ntechniques. Unfortunately, fast methods designed for deterministic objectives would not apply to\nthis setting; methods based on stochastic approximations admit indeed optimal \u201cslow\u201d rates that are\n\ntypically O(1/\u221ak) for convex functions and O(1/k) for strongly convex ones, depending on the\nexact assumptions made on the problem, where k is the number of noisy gradient evaluations [38].\n\nBetter understanding the gap between deterministic and stochastic optimization is one goal of this\npaper. Speci\ufb01cally, we are interested in Nesterov\u2019s acceleration of gradient-based approaches [39, 40].\nIn a nutshell, gradient descent or its proximal variant applied to a \u00b5-strongly convex L-smooth\nfunction achieves an exponential convergence rate O((1 \u2212 \u00b5/L)k) in the worst case in function\nvalues, and a sublinear rate O(L/k) if the function is simply convex (\u00b5 = 0). By interleaving the\nalgorithm with clever extrapolation steps, Nesterov showed that faster convergence could be achieved,\n\nand the previous convergence rates become O((1 \u2212p\u00b5/L)k) and O(L/k2), respectively. Whereas\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fno clear geometrical intuition seems to appear in the literature to explain why acceleration occurs,\nproof techniques to show accelerated convergence [5, 40, 50] and extensions to a large class of other\ngradient-based algorithms are now well established [1, 10, 33, 41, 47].\n\nYet, the effect of Nesterov\u2019s acceleration to stochastic objectives remains poorly understood since\nexisting unaccelerated algorithms such as stochastic mirror descent [38] and their variants already\nachieve the optimal asymptotic rate. Besides, negative results also exist, showing that Nesterov\u2019s\nmethod may be unstable when the gradients are computed approximately [12, 16]. Nevertheless,\nseveral approaches such as [4, 11, 15, 17, 18, 23, 28, 29, 52] have managed to show that acceleration\nmay be useful to forget faster the algorithm\u2019s initialization and reach a region dominated by the\nnoise of stochastic gradients; then, \u201cgood\u201d methods are expected to asymptotically converge with\na rate exhibiting an optimal dependency in the noise variance [38], but with no dependency on the\ninitialization. A major challenge is then to achieve the optimal rate for these two regimes.\nIn this paper, we consider an optimization method M with the following property: given an auxiliary\nstrongly convex objective function h, we assume that M is able to produce iterates (zt)t\u22650 with\nexpected linear convergence to a noise-dominated region\u2014that is, such that\n\nE[h(zt) \u2212 h\u22c6] \u2264 C(1 \u2212 \u03c4 )t(h(z0) \u2212 h\u22c6) + B\u03c32,\n\n(2)\n\nwhere C, \u03c4, B > 0, h\u22c6 is the minimum function value, and \u03c32 is an upper bound on the variance of\nstochastic gradients accessed by M, which we assume to be uniformly bounded. Whereas such an\nassumption has limitations, it remains the most standard one for stochastic optimization (see [9, 43]\nfor more realistic settings in the smooth case). The class of methods satisfying (2) is relatively large.\nFor instance, when h is L-smooth, the stochastic gradient descent method (SGD) with constant step\nsize 1/L and iterate averaging satis\ufb01es (2) with \u03c4 = \u00b5/L, B = 1/L, and C = 1, see [28].\n\nMain contribution.\n\nIn this paper, we extend the Catalyst approach [33] to general stochastic\nproblems.1 Under mild conditions, our approach is able to turn M into a converging algorithm with a\nworst-case expected complexity that decomposes into two parts: the \ufb01rst one exhibits an accelerated\nconvergence rate in the sense of Nesterov and shows how fast one forgets the initial point; the second\none corresponds to the stochastic regime and typically depends (optimally in many cases) on \u03c32.\nNote that even though we only make assumptions about the behavior of M on strongly convex\nsub-problems (2), we also treat the case where the objective (1) is convex, but not strongly convex.\n\nTo illustrate the versatility of our approach, we consider the stochastic \ufb01nite-sum problem [7, 22, 31,\n54], where the objective (1) decomposes into n components \u02dcf (x, \u03be) = 1\n\u02dcfi(x, \u03be) and \u03be is a\nstochastic perturbation, coming, e.g., from data augmentation or noise injected during training to\nimprove generalization or privacy (see [28, 35]). The underlying \ufb01nite-sum structure may also result\nfrom clustering assumptions on the data [22], or from distributed computing [31], a setting beyond\nthe scope of our paper. Whereas it was shown in [28] that classical variance-reduced stochastic\noptimization methods such as SVRG [53], SDCA [47], SAGA [13], or MISO [35], can be made robust\nto noise, the analysis of [28] is only able to accelerate the SVRG approach. With our acceleration\ntechnique, all of the aforementioned methods can be modi\ufb01ed such that they \ufb01nd a point \u02c6x satisfying\nE[F (\u02c6x) \u2212 F \u22c6] \u2264 \u03b5 with global iteration complexity, for the \u00b5-strongly convex case,\n\nnPn\n\ni=1\n\n\u02dcO n +sn\n\nL\n\n\u00b5! log(cid:18) F (x0) \u2212 F \u22c6\n\n\u03b5\n\n(cid:19) +\n\n\u03c32\n\n\u00b5\u03b5! .\n\n(3)\n\nThe term on the left is the optimal complexity for \ufb01nite-sum optimization [1, 2], up to logarithmic\nterms in L, \u00b5 hidden in the \u02dcO(.) notation, and the term on the right is the optimal complexity for\n\u00b5-strongly convex stochastic objectives [17] where \u03c32 is due to the perturbations \u03be. As Catalyst [33],\nthe price to pay compared to non-generic direct acceleration techniques [1, 28] is a logarithmic factor.\n\nOther contributions.\nIn this paper, we generalize the analysis of Catalyst [33, 44] to handle\nvarious new cases. Beyond the ability to deal with stochastic optimization problems, our approach (i)\nimproves Catalyst by allowing sub-problems of the form (2) to be solved approximately in expectation,\nwhich is more realistic than the deterministic requirement made in [33] and which is also critical\n\n1All objectives addressed by the original Catalyst approach are deterministic, even though they may be large\n\n\ufb01nite sums. Here, we consider general expectations as de\ufb01ned in (1).\n\n2\n\n\ffor stochastic optimization, (ii) leads to a new accelerated stochastic gradient descent algorithms\nfor composite optimization with similar guarantees as [17, 18, 28], (iii) handles the analysis of\naccelerated proximal gradient descent methods with inexact computation of proximal operators,\nimproving the results of [45] while also treating the stochastic setting.\n\nFinally, we note that the extension of Catalyst we propose is easy to implement. The original Catalyst\nmethod introduced in [32] indeed required solving a sequence of sub-problems while controlling\ncarefully the convergence, e.g., with duality gaps. For this reason, Catalyst has sometimes been seen as\ntheoretically appealing but not practical enough [46]. Here, we focus on a simpler and more practical\nvariant presented later in [33], which consists of solving sub-problems with a \ufb01xed computational\nbudget, thus removing the need to de\ufb01ne stopping criterions for sub-problems. The code used for our\nexperiments is available here: \u2764tt\u2663\u273f\u2734\u2734\u2763\u2710t\u2764\u2709\u275c\u2733\u275d\u2666\u2660\u2734\u2751\u2709\u2767\u2709\u2746\u2665\u275er\u2761\u2765\u2734\u25c6\u25a0P\u2759\u2732\u2737\u2735\u2736\u273e\u2732\u275d\u2666\u275e\u2761.\n\n2 Related Work on Inexact and Stochastic Proximal Point Methods.\n\nCatalyst is based on the inexact accelerated proximal point algorithm [20], which consists in solving\napproximately a sequence of sub-problems and updating two sequences (xk)k\u22650 and (yk)k\u22650 by\n\nxk \u2248 argmin\n\nx\u2208Rp nhk(x) := F (x) +\n\n\u03ba\n\n2kx \u2212 yk\u20131k2o and\n\nyk = xk + \u03b2k(xk \u2212 xk\u20131),\n\n(4)\n\nwhere \u03b2k in (0, 1) is obtained from Nesterov\u2019s acceleration principles [40], \u03ba is a well chosen\nregularization parameter, and k \u00b7 k2 is the Euclidean norm. The method M is used to obtain an\napproximate minimizer of hk; when M converges linearly, it may be shown that the resulting\nalgorithm (4) enjoys a better worst-case complexity than if M was used directly on f , see [33].\nSince asymptotic linear convergence is out of reach when f is a stochastic objective, a classical\nstrategy consists in replacing F (x) in (4) by a \ufb01nite-sum approximation obtained by random sampling,\nleading to deterministic sub-problems. Typically without Nesterov\u2019s acceleration (with yk = xk),\nthis strategy is often called the stochastic proximal point method [3, 6, 27, 48, 49]. The point of view\nwe adopt in this paper is different and is based on the minimization of surrogate functions hk related\nto (4), but which are more general and may take other forms than F (x) + \u03ba\n\n2kx \u2212 yk\u20131k2.\n\n3 Preliminaries: Basic Multi-Stage Schemes\n\nIn this section, we present two simple multi-stage mechanisms to improve the worst-case complexities\nof stochastic optimization methods, before introducing acceleration principles.\n\nBasic restart with mini-batching or decaying step sizes. Consider an optimization method M\nwith convergence rate (2) and assume that there exists a hyper-parameter to control a trade-off\nbetween the bias B\u03c32 and the computational complexity. Speci\ufb01cally, we assume that the bias can be\nreduced by an arbitrary factor \u03b7 < 1, while paying a factor 1/\u03b7 in terms of complexity per iteration\n(or \u03c4 may be reduced by a factor \u03b7, thus slowing down convergence). This may occur in two cases:\n\n\u2022 by using a mini-batch of size 1/\u03b7 to sample gradients, which replaces \u03c32 by \u03b7\u03c32;\n\u2022 or the method uses a step size proportional to \u03b7 that can be chosen arbitrarily small.\n\nFor instance, stochastic gradient descent with constant step size and iterate averaging is compatible\nwith both scenarios [28]. Then, consider a target accuracy \u03b5 and de\ufb01ne the sequences \u03b7k = 1/2k\nand \u03b5k = 2B\u03c32\u03b7k for k \u2265 0. We may now solve successively the problem up to accuracy \u03b5k\u2014e.g.,\nwith a constant number O(1/\u03c4 ) steps of M when using mini-batches of size 1/\u03b7k = 2k to reduce\nthe bias\u2014and by using the solution of iteration k\u20131 as a warm restart. As shown in Appendix B, the\nscheme converges and the worst-case complexity to achieve the accuracy \u03b5 in expectation is\n\nO(cid:18) 1\n\n\u03c4\n\nlog(cid:18) C(F (x0) \u2212 F \u22c6)\n\n\u03b5\n\n(cid:19) +\n\nB\u03c32 log(2C)\n\n\u03c4 \u03b5\n\n(cid:19) .\n\n(5)\n\nFor instance, one may run SGD with constant step size \u03b7k/L at stage k with iterate averaging as\nin [28], which yields B = 1/L, C = 1, and \u03c4 = \u00b5/L. Then, the left term is the classical complexity\nO((L/\u00b5) log(1/\u03b5)) of the (unaccelerated) gradient descent algorithm for deterministic objectives,\nwhereas the right term is the optimal complexity for stochastic optimization in O(\u03c32/\u00b5\u03b5). Similar\nrestart principles appear for instance in [4] in the design of a multistage accelerated SGD algorithm.\n\n3\n\n\fRestart: from sub-linear to linear rate with strong convexity. A natural question is whether\nasking for a linear rate in (2) for strongly convex problems is a strong requirement. Here, we show\nthat a sublinear rate is in fact suf\ufb01cient for our needs by generalizing a restart technique introduced\nin [18] for stochastic optimization, which was previously used for deterministic objectives in [24].\nSpeci\ufb01cally, consider an optimization method M such that the convergence rate (2) is replaced by\n(6)\n\nDkz0 \u2212 z\u22c6k2\n\nB\u03c32\n\n+\n\n,\n\nE[h(zt) \u2212 h\u22c6] \u2264\n\n2td\n\n2\n\nwhere D, d > 0 and z\u22c6 is a minimizer of h. Assume now that h is \u00b5-strongly convex with D \u2265 \u00b5\nand consider restarting s times the method M, each time running M for constant t\u2032 = \u2308(2D/\u00b5)1/d\u2309\niterations. Then, it may be shown (see Appendix B) that the relation (2) holds with constant t = st\u2032,\n\u03c4 = 1\n2t\u2032 , and C = 1. If a mini-batch or step size mechanism is available, we may then proceed\nas before and obtain a converging scheme with complexity (5), e.g., by using mini-batches of\nexponentially increasing sizes once the method reaches a noise-dominated region, and by using a\nrestart frequency of order O(1/\u03c4 ).\n\n4 Generic Multi-Stage Approaches with Acceleration\n\nWe are now in shape to introduce a generic acceleration framework that generalizes (4). Speci\ufb01cally,\ngiven some point yk\u20131 at iteration k, we consider a surrogate function hk related to a parameter \u03ba > 0,\nan approximation error \u03b4k \u2265 0, and an optimization method M that satisfy the following properties:\n\n(H1) hk is (\u03ba + \u00b5)-strongly convex, where \u00b5 is the strong convexity parameter of f ;\n(H2) E[hk(x)|Fk\u20131] \u2264 F (x) + \u03ba\n(H3) M can provide the exact minimizer x\u22c6\n\n2kx \u2212 yk\u20131k2 for x = \u03b1k\u20131x\u22c6 + (1 \u2212 \u03b1k\u20131)xk\u20131, which is\ndeteministic given the past information Fk\u20131 up to iteration k\u20131 and \u03b1k\u20131 is given in Alg. 1;\nk) such that\nE[F (xk)] \u2264 E[h\u22c6\n\nk of hk and a point xk (possibly equal to x\u22c6\n\nk] + \u03b4k where h\u22c6\n\nk = minx hk(x).\n\nThe generic acceleration framework is presented in Algorithm 1. Note that the conditions on hk\nbear similarities with estimate sequences introduced by Nesterov [40]; indeed, (H3) is a direct\ngeneralization of (2.2.2) from [40] and (H2) resembles (2.2.1). However, the choices of hk and the\nproof technique are signi\ufb01cantly different, as we will see with various examples below. We also\nassume at the moment that the exact minimizer x\u22c6\nk of hk is available, which differs from the Catalyst\nframework [33]; the case with approximate minimization will be presented in Section 4.1.\n\neter for hk); K (number of iterations); (\u03b4k)k\u22650 (approximation errors);\n\u00b5+\u03ba ; \u03b10 = 1 if \u00b5 = 0 or \u03b10 = \u221aq if \u00b5 6= 0;\n\nAlgorithm 1 Generic Acceleration Framework with Exact Minimization of hk\n1: Input: x0 (initial estimate); M (optimization method); \u00b5 (strong convexity constant); \u03ba (param-\n2: Initialization: y0 = x0; q = \u00b5\n3: for k = 1, . . . , K do\n4:\n5:\n6:\n\nConsider a surrogate hk satisfying (H1), (H2) and obtain xk, x\u22c6\nCompute \u03b1k in (0, 1) by solving the equation \u03b12\nk = (1 \u2212 \u03b1k)\u03b12\nUpdate the extrapolated sequence\n\nk using M satisfying (H3);\nk\u20131 + q\u03b1k.\n\nyk = x\u22c6\n\nk + \u03b2k(x\u22c6\n\nk \u2212 xk\u20131) +\n\n(\u03ba + \u00b5)(1 \u2212 \u03b1k)\n\n\u03ba\n\n(xk \u2212 x\u22c6\n\nk) with \u03b2k =\n\n\u03b1k\u20131(1 \u2212 \u03b1k\u20131)\n\n\u03b12\n\nk\u20131 + \u03b1k\n\n. (7)\n\n7: end for\n8: Output: xk (\ufb01nal estimate).\n\nProposition 1 (Convergence analysis for Algorithm 1). Consider Algorithm 1. Then,\n\nE[F (xk) \u2212 F \u22c6] \u2264\uf8f1\uf8f2\n\uf8f3\n\n(1 \u2212 \u221aq)k(cid:16)2(F (x0) \u2212 F \u22c6) +Pk\n(k+1)2 (cid:16)\u03bakx0 \u2212 x\u22c6k2 +Pk\n\nj=1 \u03b4j(j + 1)2(cid:17)\n\n2\n\nj=1(1 \u2212 \u221aq)\u2212j\u03b4j(cid:17) if \u00b5 6= 0\n\notherwise\n\n.\n\n(8)\n\nThe proof of the proposition is given in Appendix C and is based on an extension of the analysis of\nCatalyst [33]. Next, we present various application cases leading to algorithms with acceleration.\n\n4\n\n\fAccelerated proximal gradient method. When f is deterministic and the proximal operator of \u03c8\n(see Appendix A for the de\ufb01nition) can be computed in closed form, choose \u03ba = L \u2212 \u00b5 and de\ufb01ne\n(9)\n\nhk(x) := f (yk\u20131) + \u2207f (yk\u20131)\u22a4(x \u2212 yk\u20131) +\nConsider M that minimizes hk in closed form: xk = x\u22c6\nL\u2207f (yk\u20131)(cid:3). Then,\n(H1) is obvious; (H2) holds from the convexity of f , and (H3) with \u03b4k = 0 follows from classical\ninequalities for L-smooth functions [40]. Finally, we recover accelerated convergence rates [5, 40].\n\nk = Prox\u03c8/L(cid:2)yk\u20131 \u2212 1\n\nL\n2 kx \u2212 yk\u20131k2 + \u03c8(x).\n\nAccelerated proximal point algorithm. We consider hk given in (4) with exact minimization (thus\nan unrealistic setting, but conceptually interesting) with \u03ba = L \u2212 \u00b5. Then, the assumptions (H1),\n(H2), and (H3) are satis\ufb01ed with \u03b4k = 0 and we recover the accelerated rates of [20].\nAccelerated stochastic gradient descent with prox. A more interesting choice of surrogate is\n\nhk(x) := f (yk\u20131) + g\u22a4k (x \u2212 yk\u20131) +\n\n\u03ba + \u00b5\n\n2\n\nkx \u2212 yk\u20131k2 + \u03c8(x),\n\n(10)\n\nwhere \u03ba \u2265 L \u2212 \u00b5 and gk is an unbiased estimate of \u2207f (yk\u20131)\u2014that is, E[gk|Fk\u20131] = \u2207f (yk\u20131)\u2014\nwith variance bounded by \u03c32, following classical assumptions from the stochastic optimization\nliterature [17, 18, 23]. Then, (H1) and (H2) are satis\ufb01ed given that f is convex. To characterize (H3),\nconsider M that minimizes hk in closed form: xk = x\u22c6\n\u03ba+\u00b5 gk], and de\ufb01ne\nuk\u20131 := Prox\u03c8/(\u03ba+\u00b5)[yk\u20131 \u2212 1\n\u03ba+\u00b5\u2207f (yk\u20131)], which is deterministic given Fk\u20131. Then, from (10),\n\nk = Prox\u03c8/(\u03ba+\u00b5)[yk\u20131 \u2212 1\n\nF (xk) \u2264 hk(xk) + (\u2207f (yk\u20131) \u2212 gk)\u22a4(xk \u2212 yk\u20131)\n\n(from L-smoothness of f )\n\n= h\u22c6\n\nk + (\u2207f (yk\u20131) \u2212 gk)\u22a4(xk \u2212 uk\u20131) + (\u2207f (yk\u20131) \u2212 gk)\u22a4(uk\u20131 \u2212 yk\u20131).\n\nWhen taking expectations, the last term on the right disappears since E[gk|Fk\u20131] = \u2207f (yk\u20131):\n\nE[F (xk)] \u2264 E[h\u22c6\n\u2264 E[h\u22c6\n\n1\n\nk] + E[kgk \u2212 \u2207f (yk\u20131)kkxk \u2212 uk\u20131k]\nk] +\n\nE(cid:2)kgk \u2212 \u2207f (yk\u20131)k2(cid:3) \u2264 E[h\u22c6\n\n\u03ba + \u00b5\n\nk] +\n\n\u03c32\n\n\u03ba + \u00b5\n\n,\n\n(11)\n\nk\n\nwhere we used the non-expansiveness of the proximal operator [37]. Therefore, (H3) holds with\n\u03b4k = \u03c32/(\u03ba + \u00b5). The resulting algorithm is similar to [28] and offers the same guarantees. The\nnovelty of our approach is then a uni\ufb01ed convergence proof for the deterministic and stochastic cases.\nCorollary 2 (Complexity of proximal stochastic gradient algorithm, \u00b5 > 0). Consider Algorithm 1\nwith hk de\ufb01ned in (10). When f is \u00b5-strongly convex, choose \u03ba = L \u2212 \u00b5. Then,\n\u03c32\n\u221a\u00b5L\n,\n\nE[F (xk) \u2212 F \u22c6] \u2264(cid:18)1 \u2212r \u00b5\nL(cid:19)\n\n(F (x0) \u2212 F \u22c6) +\n\nwhich is of the form (2) with \u03c4 =p\u00b5/L and B = \u03c32/(\u221a\u00b5L). Interestingly, the optimal complex-\nity O(cid:16)pL/\u00b5 log((F (x0) \u2212 F \u22c6)/\u03b5) + \u03c32/\u00b5\u03b5(cid:17) can be obtained by using the \ufb01rst restart strategy\n\npresented in Section 3, see Eq. (5), either by using increasing mini-batches or decreasing step sizes.\nWhen the objective is convex, but not strongly convex, Proposition 1 gives a bias term O(\u03c32k/\u03ba)\nthat increases linearly with k. Yet, the following corollary exhibits an optimal rate with \ufb01nite horizon,\nwhen both \u03c32 and an upper-bound on kx0 \u2212 x\u22c6k2 are available. Even though non-practical, the result\nshows that our analysis recovers the optimal dependency in the noise level, as [18, 28] and others.\nCorollary 3 (Complexity of proximal stochastic gradient algorithm, \u00b5 = 0). Consider a \ufb01xed budget\nK of iterations of Algorithm 1 with hk de\ufb01ned in (10). When \u03ba = max(L, \u03c3(K + 1)3/2/kx0 \u2212 x\u22c6k),\n\nE[F (xK) \u2212 F \u22c6] \u2264\n\n2Lkx0 \u2212 x\u22c6k2\n\n(K + 1)2 +\n\n3\u03c3kx0 \u2212 x\u22c6k\n\n\u221aK + 1\n\n.\n\nWhile all the previous examples use the choice xk = x\u22c6\nmay choose xk 6= x\u22c6\n\nk. Before that, we introduce a variant when x\u22c6\n\nk is not available.\n\nk, we will see in Section 4.2 cases where we\n\nIn principle, it is possible to design other surrogates, which would lead to new algorithms coming\nwith convergence guarantees given by Propositions 1 and 4, but the given examples (4), (10), and\n(10) already cover all important cases considered in the paper for functions of the form (1).\n\n5\n\n\f4.1 Variant with Inexact Minimization\n\nIn this variant, presented in Algorithm 2, x\u22c6\n\nk is not available and we assume that M also satis\ufb01es:\n\n(H4) given \u03b5k \u2265 0, M can provide a point xk such that E[hk(xk) \u2212 h\u22c6\n\nk] \u2264 \u03b5k.\n\nAlgorithm 2 Generic Acceleration Framework with Inexact Minimization of hk\n1: Input: same as Algorithm 2;\n2: Initialization: y0 = x0; q = \u00b5\n3: for k = 1, . . . , K do\n4:\n5:\n6:\n7: end for\n8: Output: xk (\ufb01nal estimate).\n\nConsider a surrogate hk satisfying (H1), (H2) and obtain xk satisfying (H4);\nCompute \u03b1k in (0, 1) by solving the equation \u03b12\nUpdate the extrapolated sequence yk = xk + \u03b2k(xk \u2212 xk\u20131) with \u03b2k de\ufb01ned in (7);\n\n\u00b5+\u03ba ; \u03b10 = 1 if \u00b5 = 0 or \u03b10 = \u221aq if \u00b5 6= 0;\n\nk = (1 \u2212 \u03b1k)\u03b12\n\nk\u20131 + q\u03b1k.\n\nThe next proposition, proven in Appendix C, gives us some insight on how to achieve acceleration.\nProposition 4 (Convergence analysis for Algorithm 2). Consider Alg. 2. Then, for any \u03b3 \u2208 (0, 1],\n2 (cid:17)\u2212j(cid:16)\u03b4j + \u03b5j\u221aq(cid:17)(cid:19) if \u00b5 6= 0\n\n2 (cid:17)k(cid:18)2(F (x0) \u2212 F \u22c6) + 4Pk\n\nj=1(cid:16)1 \u2212\n\nj=1(j + 1)2\u03b4j + (j+1)3+\u03b3 \u03b5j\n\nif \u00b5 = 0.\n\n\u221aq\n\n\u221aq\n\n\u03b3\n\n2e1+\u03b3\n\n(cid:16)1 \u2212\n(k+1)2 (cid:16)\u03bakx0 \u2212 x\u22c6k2 +Pk\n\nE[F (xk)\u2212F \u22c6] \u2264\uf8f1\uf8f4\uf8f2\n\uf8f4\uf8f3\n\nTo maintain the accelerated rate, the sequence (\u03b4k)k\u22650 needs to converge at a similar speed as in\nProposition 1, but the dependency in \u03b5k is slightly worse. Speci\ufb01cally, when \u00b5 is positive, we may\nhave both (\u03b5k)k\u22650 and (\u03b4k)k\u22650 decreasing at a rate O((1 \u2212 \u03c1)k) with \u03c1 < \u221aq/2, but we pay a\nfactor (1/\u221aq) compared to (8). When \u00b5 = 0, the accelerated O(1/k2) rate is preserved whenever\n\u03b5k = O(1/k4+2\u03b3) and \u03b4k = O(1/k3+\u03b3), but we pay a factor O(1/\u03b3) compared to (8).\n\n(cid:17)\n\nCatalyst [33]. When using hk de\ufb01ned in (4), we recover the convergence rates of [33]. In such\na case \u03b4k = \u03b5k since E[F (xk)] \u2264 E[hk(xk)] \u2264 E[h\u22c6\nk] + \u03b4k. In order to analyze the complexity of\nminimizing each hk with M and derive the global complexity of the multi-stage algorithm, the next\nproposition, proven in Appendix C, characterizes the quality of the initialization xk\u20131.\nProposition 5 (Warm restart for Catalyst). Consider Alg. 2 with hk de\ufb01ned in (4). Then, for k \u2265 2,\n(12)\n\n3\u03b5k\u20131\n\n+ 54\u03ba max(cid:0)kxk\u20131 \u2212 x\u22c6k2,kxk\u20132 \u2212 x\u22c6k2,kxk\u20133 \u2212 x\u22c6k2(cid:1) ,\n\nE[hk(xk\u20131) \u2212 h\u22c6\n\nk] \u2264\n\n2\n\n\u00b5\n\nk] = O( \u03ba\n\nwhere x\u20131 = x0. Following [33], we may now analyze the global complexity. For instance, when f\n\nis \u00b5-strongly convex, we may choose \u03b5k = O((1 \u2212 \u03c1)k(F (x0) \u2212 F \u22c6)) with \u03c1 = \u221aq/3. Then, it\nis possible to show that Proposition (4) yields E[F (xk) \u2212 F \u22c6] = O(\u03b5k/q) and from the inequality\n2kxk \u2212 x\u22c6k2 \u2264 F (xk) \u2212 F \u22c6 and (12), we have E[hk(xk\u20131) \u2212 h\u22c6\n\u00b5q \u03b5k\u20131) = O(\u03b5k\u20131/q2).\nConsider now a method M that behaves as (2). When \u03c3 = 0, xk can be obtained in O(log(1/q)/\u03c4 ) =\n\u02dcO(1/\u03c4 ) iterations of M after initializing with xk\u20131. This allows us to obtain the global complexity\n\u02dcO((1/\u03c4\u221aq) log(1/\u03b5)). For example, when M is the proximal gradient descent method, \u03ba = L and\n\u03c4 = (\u00b5 + \u03ba)/(L + \u03ba) yield the global complexity \u02dcO(pL/\u00b5 log(1/\u03b5)) of an accelerated method.\nOur results improve upon Catalyst [33] in two aspects that are crucial for stochastic optimization:\n(i) we allow the sub-problems to be solved in expectation, whereas Catalyst requires the stronger\ncondition hk(xk) \u2212 h\u22c6\nk \u2264 \u03b5k; (ii) Proposition 5 removes the requirement of [33] to perform a full\ngradient step for initializing the method M in the composite case (see Prop. 12 in [33]).\nProximal gradient descent with inexact prox [45]. The surrogate (10) with inexact minimization\ncan be treated in the same way as Catalyst, which provides a uni\ufb01ed proof for both problems. Then,\nwe recover the results of [45], while allowing inexact minimization to be performed in expectation.\n\nStochastic Catalyst. With Proposition 5, we are in shape to consider stochastic problems when\n\nusing a method M that converges linearly as (2) with \u03c32 6= 0 for minimizing hk. As in Section 3,\n\n6\n\n\fwe also assume that there exists a mini-batch/step-size parameter \u03b7 that can reduce the bias by a\nfactor \u03b7 < 1 while paying a factor 1/\u03b7 in terms of inner-loop complexity. As above, we discuss the\nstrongly-convex case and choose the same sequence (\u03b5k)k\u22650. In order to minimize hk up to accuracy\n\u03b5k, we set \u03b7k = min(1, \u03b5k/(2B\u03c32)) such that \u03b7kB\u03c32 \u2264 \u03b5k/2. Then, the complexity to minimize hk\nwith M when using the initialization xk\u20131 becomes \u02dcO(1/\u03c4 \u03b7k), leading to the global complexity\n\n\u02dcO(cid:18) 1\n\u03c4\u221aq\n\nlog(cid:18) F (x0) \u2212 F \u22c6\n\n\u03b5\n\n(cid:19) +\n\nB\u03c32\n\nq3/2\u03c4 \u03b5(cid:19) .\n\n(13)\n\nDetails about the derivation are given in Appendix B. The left term corresponds to the Catalyst\naccelerated rate, but it may be shown that the term on the right is sub-optimal. Indeed, consider M to\nbe ISTA with \u03ba = L\u2212\u00b5. Then, B = 1/L, \u03c4 = O(1), and the right term becomes \u02dcO((pL/\u00b5)\u03c32/\u00b5\u03b5),\nwhich is sub-optimal by a factorpL/\u00b5. Whereas this result is a negative one, suggesting that Catalyst\n\nis not robust to noise, we show in Section 4.2 how to circumvent this for a large class of algorithms.\n\nAccelerated stochastic proximal gradient descent with inexact prox. Finally, consider hk de-\n\ufb01ned in (10) but the proximal operator is computed approximately, which, to our knowledge, has never\nbeen analyzed in the stochastic context. Then, it may be shown (see Appendix B for details) that, even\nthough x\u22c6\nk is not available, Proposition 4 holds nonetheless with \u03b4k = 2\u03b5k+3\u03c32/(2(\u03ba + \u00b5)). Then, an\ninteresting question is how small should \u03b5k be to guarantee the optimal dependency with respect to \u03c32\nas in Corollary 2. In the strongly-convex case, Proposition 4 simply gives \u03b5k = O(\u221aq\u03c32/(\u03ba + \u00b5))\nsuch that \u03b4k \u2248 \u03b5k/\u221aq.\n4.2 Exploiting methods M providing strongly convex surrogates\nAmong various application cases, we have seen an extension of Catalyst to stochastic problems. To\nachieve convergence, the strategy requires a mechanism to reduce the bias B\u03c32 in (2), e.g., by using\nmini-batches or decreasing step sizes. Yet, the approach suffers from two issues: (i) some of the\nparameters are based on unknown quantities such as \u03c32; (ii) the worst-case complexity exhibits a\nsub-optimal dependency in \u03c32, typically of order 1/\u221aq when \u00b5 > 0. Whereas practical workarounds\nfor the \ufb01rst point are discussed in Section 5, we now show how to solve the second one in some\ncases, by using Algorithm 1 with an optimization method M, which is able not only to minimize an\nauxiliary objective Hk, but also at the same time is able to provide a model hk, typically a quadratic\nfunction, which is easy to minimize. Consider then a method M satisfying (2) and which produces,\nafter T steps, a point xk and a surrogate hk such that\n\nk +\u03bek\u20131)+B\u03c32 with Hk(x) = F (x)+\n\nk] \u2264 C(1\u2212\u03c4 )T (Hk(xk\u20131)\u2212H \u22c6\n\n\u03ba\n2kx\u2212yk\u20131k2,\nE[Hk(xk)\u2212h\u22c6\n(14)\nwhere Hk is approximately minimized by M, hk is a model of Hk that satis\ufb01es (H1), (H2) and that\ncan be minimized in closed form, and \u03bek\u20131 = O(E[F (xk\u20131) \u2212 F \u22c6]); it is easy to show that (H3) is\nalso satis\ufb01ed with the choice \u03b4k = C(1 \u2212 \u03c4 )T (Hk(xk\u20131) \u2212 H \u22c6\nk + \u03bek\u20131) + B\u03c32 since E[F (xk)] \u2264\nE[Hk(xk)] \u2264 E[h\u22c6\nk] + \u03b4k. In other words, M is used to perform approximate minimization of Hk,\nbut we consider cases where M also provides another surrogate hk with closed-form minimizer that\nsatis\ufb01es the conditions required to use Algorithm 1, which has better convergence guarantees than\nAlgorithm 2 (same convergence rate up to a better factor).\n\nAs shown in Appendix D, even though (14) looks technical, a large class of optimization techniques\nare able to provide the condition (14), including many variants of proximal stochastic gradient descent\nmethods with variance reduction such as SAGA [13], MISO [35], SDCA [47], or SVRG [53].\n\nWhereas (14) seems to be a minor modi\ufb01cation of (2), an important consequence is that it will allow us\nto gain a factor 1/\u221aq in complexity when \u00b5 > 0, corresponding precisely to the sub-optimality factor.\nTherefore, even though the surrogate Hk needs only be minimized approximately, the condition (14)\nallows us to use Algorithm 1 instead of Algorithm 2. The dependency with respect to \u03b4k being better\nthan \u03b5k (by 1/\u221aq), we have then the following result:\nProposition 6 (Stochastic Catalyst with Optimality Gaps, \u00b5 > 0). Consider Algorithm 1 with a\nmethod M and surrogate hk satisfying (14) when M is used to minimize Hk by using xk\u20131 as a\nwarm restart. Assume that f is \u00b5-strongly convex and that there exists a parameter \u03b7 that can reduce\nthe bias B\u03c32 by a factor \u03b7 < 1 while paying a factor 1/\u03b7 in terms of inner-loop complexity.\n\n7\n\n\fChoose \u03b4k = O((1\u2212\u221aq/2)k(F (x0)\u2212F \u22c6)) and \u03b7k = min(1, \u03b4k/(2B\u03c32)). Then, the complexity to\nsolve (14) and compute xk is \u02dcO(1/\u03c4 \u03b7k), and the global complexity to obtain E[F (xk) \u2212 F \u22c6] \u2264 \u03b5 is\n\n\u02dcO(cid:18) 1\n\u03c4\u221aq\n\nlog(cid:18) F (x0) \u2212 F \u22c6\n\n\u03b5\n\n(cid:19) +\n\nB\u03c32\n\nq\u03c4 \u03b5 (cid:19) .\n\nThe term on the left is the accelerated rate of Catalyst for deterministic problems, whereas the\nterm on the right is potentially optimal for strongly convex problems, as illustrated in the next\ntable. We provide indeed practical choices for the parameters \u03ba, leading to various values of B, \u03c4, q,\nfor the proximal stochastic gradient descent method with iterate averaging as well as variants of\nSAGA,MISO,SVRG that can cope with stochastic perturbations, which are discussed in Appendix D.\nAll the values below are given up to universal constants to simplify the presentation.\n\nMethod M\nprox-SGD\n\n\u03ba\n\nhk\n(10) L \u2212 \u00b5\nn \u2212 \u00b5\n\nL\n\n\u03c4 B\n1\n1\nL\n2\n\nq\n\u00b5\nL\n\n\u00b5n\nL\n\nComplexity after Catalyst\n\n1\nn\n\nn \u2265 \u00b5 (14)\n\nSAGA/MISO/SVRG with L\n\n\u00b5\u03b5(cid:17)\n\u03b5 (cid:1) + \u03c32\n\u00b5\u03b5(cid:17)\n\u03b5 (cid:1) + \u03c32\nIn this table, F0 := F (x0)\u2212F \u22c6 and the methods SAGA/MISO/SVRG are applied to the stochastic\n\ufb01nite-sum problem discussed in Section 1 with n L-smooth functions. As in the deterministic case,\nwe note that when L/n \u2264 \u00b5, there is no acceleration for SAGA/MISO/SVRG since the complexity\nof the unaccelerated method M is \u02dcO(cid:0)n log (F0/\u03b5) + \u03c32/\u00b5\u03b5(cid:1), which is independent of the condition\nnumber and already optimal [28]. In comparison, the logarithmic terms in L, \u00b5 that are hidden in the\nnotation \u02dcO do not appear for a variant of the SVRG method with direct acceleration introduced in [28].\nHere, our approach is more generic. Note also that \u03c32 for prox-SGD and SAGA/MISO/SVRG cannot\nbe compared to each other since the source of randomness is larger for prox-SGD, see [7, 28].\n\n\u00b5 log(cid:0) F0\n\u00b5 log(cid:0) F0\n\n\u02dcO(cid:16)q L\n\u02dcO(cid:16)qn L\n\n1\nL\n\n5 Experiments\n\nIn this section, we perform numerical evaluations by following [28], which was notably able to make\nSVRG and SAGA robust to stochastic noise, and accelerate SVRG. Code to reproduce the experiments\nis provided with the submission and more details and experiments are given in Appendix E.\n\nFormulations. Given training data (ai, bi)i=1,...,n, with ai in Rp and bi in {\u22121, +1}, we consider\n\nthe optimization problem\n\n1\nn\n\nmin\nx\u2208Rp\n\nn\n\nXi=1\n\n\u03c6(bia\u22a4i x) +\n\n\u00b5\n2 kxk2,\n\nwhere \u03c6 is either the logistic loss \u03c6(u) = log(1+e\u2212u), or the squared hinge loss \u03c6(u) = 1\n2 max(0, 1\u2212\nu)2, which are both L-smooth, with L = 0.25 for logistic and L = 1 for the squared hinge loss.\nStudying the squared hinge loss is interesting since its gradients are unbounded on the optimization\ndomain, which may break the bounded noise assumption. The regularization parameter \u00b5 acts as the\nstrong convexity constant for the problem and is chosen among the smallest values one would try\nwhen performing parameter search, e.g., by cross validation. Speci\ufb01cally, we consider \u00b5 = 1/10n\nand \u00b5 = 1/100n, where n is the number of training points; we also try \u00b5 = 1/1000n to evaluate the\nnumerical stability of methods in very ill-conditioned problems. Following [7, 28, 54], we consider\nDropOut perturbations [51]\u2014that is, setting each component (\u2207f (x))i to 0 with a probability \u03b4 and\nto (\u2207f (x))i/(1 \u2212 \u03b4) otherwise. This procedure is motivated by the need of a simple optimization\nbenchmark illustrating stochastic \ufb01nite-sum problems, where the amount of perturbation is easy to\ncontrol. The settings used in our experiments are \u03b4 = 0 (no noise) and \u03b4 \u2208 {0.01, 0.1}.\nDatasets. We consider three datasets with various number of points n and dimension p. All the\ndata points are normalized to have unit \u21132-norm. The description comes from [28]:\n\n\u2022 alpha is from the Pascal Large Scale Learning Challenge website2 and contains n = 250 000 points\n\nin dimension p = 500.\n\n2\u2764tt\u2663\u273f\u2734\u2734\u2767\u275br\u2763\u2761s\u275d\u275b\u2767\u2761\u2733\u2660\u2767\u2733t\u2709\u2732\u275c\u2761r\u2767\u2710\u2665\u2733\u275e\u2761\u2734\n\n8\n\n\fbreast cancer. This is a small dataset with n = 295 and p = 8 141.\n\n\u2022 gene consists of gene expression data and the binary labels bi characterize two different types of\n\u2022 ckn-cifar is an image classi\ufb01cation task where each image from the CIFAR-10 dataset3 is repre-\nsented by using a two-layer unsupervised convolutional neural network [36]. We consider here\nthe binary classi\ufb01cation task consisting of predicting the class 1 vs. other classes, and use our\nalgorithms for the classi\ufb01cation layer of the network, which is convex. The dataset contains\nn = 50 000 images and the dimension of the representation is p = 9 216.\n\nMethods. We consider the variants of SVRG and SAGA of [28], which use decreasing step sizes\nwhen \u03b4 > 0 (otherwise, they do not converge). We use the suf\ufb01x \u201c-d\u201d each time decreasing step sizes\nare used. We also consider Katyuasha [1] when \u03b4 = 0, and the accelerated SVRG method of [28],\ndenoted by acc-SVRG. Then, SVRG-d, SAGA-d, acc-SVRG-d are used with the step size strategies\ndescribed in [28], by using the code provided to us by the authors.\n\nIn all setups, we choose the parameter \u03ba according to\nPractical questions and implementation.\ntheory, which are described in the previous section, following Catalyst [33]. For composite problems,\nProposition 5 suggests to use xk\u20131 as a warm start for inner-loop problems. For smooth ones, [33]\nshows that in fact, other choices such as yk\u20131 are appropriate and lead to similar complexity results.\nIn our experiments with smooth losses, we use yk\u20131, which has shown to perform consistently better.\n\nThe strategy for \u03b7k discussed in Proposition 6 suggests to use constant step-sizes for a while in the\ninner-loop, typically of order 1/(\u03ba + L) for the methods we consider, before using an exponentially\n\ndecreasing schedule. Unfortunately, even though theory suggests a rate of decay in (1 \u2212 \u221aq/2)k,\n\nit does not provide useful insight on when decaying should start since the theoretical time requires\nknowing \u03c32. A similar issue arise in stochastic optimization techniques involving iterate averaging\n[9]. We adopt a similar heuristic as in this literature and start decaying after k0 epochs, with k0 = 30.\nFinally, we discuss the number of iterations of M to perform in the inner-loop. When \u03b7k = 1, the\ntheoretical value is of order \u02dcO(1/\u03c4 ) = \u02dcO(n), and we choose exactly n iterations (one epoch), as in\nCatalyst [33]. After starting decaying the step-sizes (\u03b7k < 1), we use \u2308n/\u03b7k\u2309, according to theory.\nExperiments and conclusions. We run each experiment \ufb01ve time with a different random seed and\naverage the results. All curves also display one standard deviation. Appendix E contains numerous\nexperiments, where we vary the amount of noise, the type of approach (SVRG vs. SAGA), the amount\nof regularization \u00b5, and choice of loss function. In Figure 1, we show a subset of these curves. Most\nof them show that acceleration may be useful even in the stochastic optimization regime, consistently\nwith [28]. At the same time, all acceleration methods may not perform well for very ill-conditioned\nproblems with \u00b5 = 1/1000n, where the sublinear convergence rates for convex optimization (\u00b5 = 0)\nare typically better than the linear rates for strongly convex optimization (\u00b5 > 0). However, these\nill-conditioned cases are often unrealistic in the context of empirical risk minimization.\n\nckn-cifar\n\ngene\n\nalpha\n\n10\u22121\n\n10\u22122\n\n10\u22123\n\n10\u22121\n\n10\u22122\n\n10\u22123\n\ncat-svrg-d\nsvrg-d\nacc-svrg-d\n\ncat-svrg-d\nsvrg-d\ncat-saga-d\nsaga-d\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\n140\n\n160\n\nFigure 1: Accelerating SVRG-like (top) and SAGA (bottom) methods for \u21132-logistic regression with\n\u00b5 = 1/(100n) (bottom) for \u03b4 = 0.1. All plots are on a logarithmic scale for the objective function\nvalue, and the x-axis denotes the number of epochs. The colored tubes around each curve denote a\nstandard deviations across 5 runs. They do not look symmetric because of the logarithmic scale.\n\n3\u2764tt\u2663s\u273f\u2734\u2734\u2707\u2707\u2707\u2733\u275ds\u2733t\u2666r\u2666\u2665t\u2666\u2733\u2761\u275e\u2709\u2734\u2466\u2766r\u2710\u2462\u2734\u275d\u2710\u2762\u275br\u2733\u2764t\u2660\u2767\n\n9\n\n\fAcknowledgments\n\nThis work was supported by the ERC grant SOLARIS (number 714381) and ANR 3IA\nMIAI@Grenoble Alpes. The authors would like to thank Anatoli Juditsky for numerous interesting\ndiscussions that greatly improved the quality of this manuscript.\n\nReferences\n\n[1] Z. Allen-Zhu. Katyusha: The \ufb01rst direct acceleration of stochastic gradient methods. In Proceedings of\n\nSymposium on Theory of Computing (STOC), 2017.\n\n[2] Y. Arjevani and O. Shamir. Dimension-free iteration complexity of \ufb01nite sum optimization problems. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2016.\n\n[3] H. Asi and J. C. Duchi. Stochastic (approximate) proximal point methods: Convergence, optimality, and\n\nadaptivity. SIAM Journal on Optimization, 29(3):2257\u20132290, 2019.\n\n[4] N. S. Aybat, A. Fallah, M. Gurbuzbalaban, and A. Ozdaglar. A universally optimal multistage accelerated\n\nstochastic gradient method. preprint arXiv:1901.08022, 2019.\n\n[5] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[6] D. P. Bertsekas.\n\nIncremental proximal methods for large scale convex optimization. Mathematical\n\nProgramming, 129(2):163, 2011.\n\n[7] A. Bietti and J. Mairal. Stochastic optimization with variance reduction for in\ufb01nite datasets with \ufb01nite-sum\n\nstructure. In Advances in Neural Information Processing Systems (NIPS), 2017.\n\n[8] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information\n\nProcessing Systems (NIPS), 2008.\n\n[9] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM\n\nReview, 60(2):223\u2013311, 2018.\n\n[10] A. Chambolle and T. Pock. A remark on accelerated block coordinate descent for computing the proximity\n\noperators of a sum of convex functions. SMAI Journal of Computational Mathematics, 1:29\u201354, 2015.\n\n[11] M. B. Cohen, J. Diakonikolas, and L. Orecchia. On acceleration with noise-corrupted gradients. In\n\nProceedings of the International Conferences on Machine Learning (ICML), 2018.\n\n[12] A. d\u2019Aspremont. Smooth optimization with approximate gradient. SIAM Journal on Optimization,\n\n19(3):1171\u20131183, 2008.\n\n[13] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for\nnon-strongly convex composite objectives. In Advances in Neural Information Processing Systems (NIPS),\n2014.\n\n[14] A. Defazio, T. Caetano, and J. Domke. Finito: A faster, permutable incremental gradient method for big\n\ndata problems. In Proceedings of the International Conferences on Machine Learning (ICML), 2014.\n\n[15] O. Devolder. Stochastic \ufb01rst order methods in smooth convex optimization. Technical report, Universit\u00e9\n\ncatholique de Louvain, 2011.\n\n[16] O. Devolder, F. Glineur, and Y. Nesterov. First-order methods of smooth convex optimization with inexact\n\noracle. Mathematical Programming, 146(1-2):37\u201375, 2014.\n\n[17] S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic\ncomposite optimization I: A generic algorithmic framework. SIAM Journal on Optimization, 22(4):1469\u2013\n1492, 2012.\n\n[18] S. Ghadimi and G. Lan. Optimal stochastic approximation algorithms for strongly convex stochastic\ncomposite optimization II: Shrinking procedures and optimal algorithms. SIAM Journal on Optimization,\n23(4):2061\u20132089, 2013.\n\n[19] R. M. Gower, P. Richt\u00e1rik, and F. Bach. Stochastic quasi-gradient methods: Variance reduction via Jacobian\n\nsketching. preprint arXiv:1805.02632, 2018.\n\n10\n\n\f[20] O. G\u00fcler. New proximal point algorithms for convex minimization. SIAM Journal on Optimization,\n\n2(4):649\u2013664, 1992.\n\n[21] J.-B. Hiriart-Urruty and C. Lemar\u00e9chal. Convex analysis and minimization algorithms. II. Springer, 1996.\n\n[22] T. Hofmann, A. Lucchi, S. Lacoste-Julien, and B. McWilliams. Variance reduced stochastic gradient\n\ndescent with neighbors. In Advances in Neural Information Processing Systems (NIPS), 2015.\n\n[23] C. Hu, W. Pan, and J. T. Kwok. Accelerated gradient methods for stochastic optimization and online\n\nlearning. In Advances in Neural Information Processing Systems (NIPS). 2009.\n\n[24] A. Iouditski and Y. Nesterov. Primal-dual subgradient methods for minimizing uniformly convex functions.\n\npreprint arXiv:1401.1792, 2014.\n\n[25] J. Kone\u02c7cn`y and P. Richt\u00e1rik. Semi-stochastic gradient descent methods. Frontiers in Applied Mathematics\n\nand Statistics, 3:9, 2017.\n\n[26] D. Kovalev, S. Horvath, and P. Richtarik. Don\u2019t jump through hoops and remove those loops: SVRG and\n\nKatyusha are better without the outer loop. preprint arXiv:1901.08689, 2019.\n\n[27] B. Kulis and P. L. Bartlett. Implicit online learning. In Proceedings of the International Conferences on\n\nMachine Learning (ICML), 2010.\n\n[28] A. Kulunchakov and J. Mairal. Estimate sequences for stochastic composite optimization: Variance\n\nreduction, acceleration, and robustness to noise. preprint arXiv:1901.08788, 2019.\n\n[29] G. Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1):365\u2013\n\n397, 2012.\n\n[30] G. Lan and Y. Zhou. An optimal randomized incremental gradient method. Mathematical Programming,\n\n171(1\u20132):167\u2013215, 2018.\n\n[31] G. Lan and Y. Zhou. Random gradient extrapolation for distributed and stochastic optimization. SIAM\n\nJournal on Optimization, 28(4):2753\u20132782, 2018.\n\n[32] H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for \ufb01rst-order optimization. In Advances in\n\nNeural Information Processing Systems (NIPS), 2015.\n\n[33] H. Lin, J. Mairal, and Z. Harchaoui. Catalyst acceleration for \ufb01rst-order convex optimization: from theory\n\nto practice. Journal of Machine Learning Research (JMLR), 18(212):1\u201354, 2018.\n\n[34] H. Lin, J. Mairal, and Z. Harchaoui. An inexact variable metric proximal point algorithm for generic\n\nquasi-Newton acceleration. SIAM Journal on Optimization, 29(2):1408\u20131443, 2019.\n\n[35] J. Mairal. Incremental majorization-minimization optimization with application to large-scale machine\n\nlearning. SIAM Journal on Optimization, 25(2):829\u2013855, 2015.\n\n[36] J. Mairal. End-to-end kernel learning with supervised convolutional kernel networks. In Advances in\n\nNeural Information Processing Systems (NIPS), 2016.\n\n[37] J.-J. Moreau. Proximit\u00e9 et dualit\u00e9 dans un espace hilbertien. Bulletins de la Socit\u00e9t\u00e9 Math\u00e9matique de\n\nFrance, 93(2):273\u2013299, 1965.\n\n[38] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to\n\nstochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[39] Y. Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2). Soviet\n\nMathematics Doklady, 27(2):372\u2013376, 1983.\n\n[40] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer, 2004.\n\n[41] Y. Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems. SIAM Journal\n\non Optimization, 22(2):341\u2013362, 2012.\n\n[42] L. M. Nguyen, J. Liu, K. Scheinberg, and M. Tak\u00e1\u02c7c. Sarah: A novel method for machine learning problems\nusing stochastic recursive gradient. In Proceedings of the International Conferences on Machine Learning\n(ICML), 2017.\n\n11\n\n\f[43] L. M. Nguyen, P. H. Nguyen, M. van Dijk, P. Richt\u00e1rik, K. Scheinberg, and M. Tak\u00e1\u02c7c. SGD and Hogwild!\nconvergence without the bounded gradients assumption. In Proceedings of the International Conferences\non Machine Learning (ICML), 2018.\n\n[44] C. Paquette, H. Lin, D. Drusvyatskiy, J. Mairal, and Z. Harchaoui. Catalyst acceleration for gradient-based\n\nnon-convex optimization. preprint arXiv:1703.10993, 2018.\n\n[45] M. Schmidt, N. Le Roux, and F. Bach. Convergence rates of inexact proximal-gradient methods for convex\n\noptimization. In Advances in Neural Information Processing Systems (NIPS), 2011.\n\n[46] D. Scieur, F. Bach, and A. d\u2019Aspremont. Nonlinear acceleration of stochastic algorithms. In Adv. in Neural\n\nInformation Processing Systems (NIPS), 2017.\n\n[47] S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized\n\nloss minimization. Mathematical Programming, 155(1):105\u2013145, 2016.\n\n[48] P. Toulis, T. Horel, and E. M. Airoldi. Stable Robbins-Monro approximations through stochastic proximal\n\nupdates. preprint arXiv:1510.00967, 2018.\n\n[49] P. Toulis, D. Tran, and E. Airoldi. Towards stability and optimality in stochastic gradient descent. In\n\nProceedings of the International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2016.\n\n[50] P. Tseng. On accelerated proximal gradient methods for convex-concave optimization. 2008. unpublished.\n\n[51] S. Wager, W. Fithian, S. Wang, and P. S. Liang. Altitude training: Strong bounds for single-layer dropout.\n\nIn Advances in Neural Information Processing Systems (NIPS), 2014.\n\n[52] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of\n\nMachine Learning Research (JMLR), 11(Oct):2543\u20132596, 2010.\n\n[53] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM\n\nJournal on Optimization, 24(4):2057\u20132075, 2014.\n\n[54] S. Zheng and J. T. Kwok. Lightweight stochastic optimization for minimizing \ufb01nite sums with in\ufb01nite\n\ndata. In Proceedings of the International Conferences on Machine Learning (ICML), 2018.\n\n[55] K. Zhou. Direct acceleration of SAGA using sampled negative momentum.\n\nIn Proceedings of the\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2019.\n\n[56] K. Zhou, F. Shang, and J. Cheng. A simple stochastic variance reduced algorithm with fast convergence\n\nrates. In Proceedings of the International Conferences on Machine Learning (ICML), 2018.\n\n12\n\n\f", "award": [], "sourceid": 6827, "authors": [{"given_name": "Andrei", "family_name": "Kulunchakov", "institution": "Inria"}, {"given_name": "Julien", "family_name": "Mairal", "institution": "Inria"}]}