{"title": "On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants", "book": "Advances in Neural Information Processing Systems", "page_first": 2647, "page_last": 2655, "abstract": "We study optimization algorithms based on variance reduction for stochastic gradientdescent (SGD). Remarkable recent progress has been made in this directionthrough development of algorithms like SAG, SVRG, SAGA. These algorithmshave been shown to outperform SGD, both theoretically and empirically. However,asynchronous versions of these algorithms\u2014a crucial requirement for modernlarge-scale applications\u2014have not been studied. We bridge this gap by presentinga unifying framework that captures many variance reduction techniques.Subsequently, we propose an asynchronous algorithm grounded in our framework,with fast convergence rates. An important consequence of our general approachis that it yields asynchronous versions of variance reduction algorithms such asSVRG, SAGA as a byproduct. Our method achieves near linear speedup in sparsesettings common to machine learning. We demonstrate the empirical performanceof our method through a concrete realization of asynchronous SVRG.", "full_text": "On Variance Reduction in Stochastic Gradient\n\nDescent and its Asynchronous Variants\n\nSashank J. Reddi\n\nCarnegie Mellon University\nsjakkamr@cs.cmu.edu\n\nAhmed Hefny\n\nCarnegie Mellon University\nahefny@cs.cmu.edu\n\nBarnab\u00b4as P\u00b4oczos\n\nCarnegie Mellon University\nbapoczos@cs.cmu.edu\n\nSuvrit Sra\n\nMassachusetts Institute of Technology\n\nsuvrit@mit.edu\n\nAlex Smola\n\nCarnegie Mellon University\n\nalex@smola.org\n\nAbstract\n\nWe study optimization algorithms based on variance reduction for stochastic gra-\ndient descent (SGD). Remarkable recent progress has been made in this direc-\ntion through development of algorithms like SAG, SVRG, SAGA. These algo-\nrithms have been shown to outperform SGD, both theoretically and empirically.\nHowever, asynchronous versions of these algorithms\u2014a crucial requirement for\nmodern large-scale applications\u2014have not been studied. We bridge this gap by\npresenting a unifying framework for many variance reduction techniques. Subse-\nquently, we propose an asynchronous algorithm grounded in our framework, and\nprove its fast convergence. An important consequence of our general approach\nis that it yields asynchronous versions of variance reduction algorithms such as\nSVRG and SAGA as a byproduct. Our method achieves near linear speedup in\nsparse settings common to machine learning. We demonstrate the empirical per-\nformance of our method through a concrete realization of asynchronous SVRG.\n\n1\n\nIntroduction\n\nThere has been a steep rise in recent work [6, 7, 9\u201312, 25, 27, 29] on \u201cvariance reduced\u201d stochastic\ngradient algorithms for convex problems of the \ufb01nite-sum form:\n\nf (x) := 1\n\nfi(x).\n\n(1.1)\n\nmin\nx2Rd\n\nnXn\n\ni=1\n\nUnder strong convexity assumptions, such variance reduced (VR) stochastic algorithms attain better\nconvergence rates (in expectation) than stochastic gradient descent (SGD) [18, 24], both in theory\nand practice.1 The key property of these VR algorithms is that by exploiting problem structure\nand by making suitable space-time tradeoffs, they reduce the variance incurred due to stochastic\ngradients. This variance reduction has powerful consequences:\nit helps VR stochastic methods\nattain linear convergence rates, and thereby circumvents slowdowns that usually hit SGD.\n\n1Though we should note that SGD also applies to the harder stochastic optimization problem min F (x) =\n\nE[f (x; \u21e0)], which need not be a \ufb01nite-sum.\n\n1\n\n\fAlthough these advances have great value in general, for large-scale problems we still require par-\nallel or distributed processing. And in this setting, asynchronous variants of SGD remain indispens-\nable [2, 8, 13, 21, 28, 30]. Therefore, a key question is how to extend the synchronous \ufb01nite-sum\nVR algorithms to asynchronous parallel and distributed settings.\nWe answer one part of this question by developing new asynchronous parallel stochastic gradient\nmethods that provably converge at a linear rate for smooth strongly convex \ufb01nite-sum problems.\nOur methods are inspired by the in\ufb02uential SVRG [10], S2GD [12], SAG [25] and SAGA [6] family\nof algorithms. We list our contributions more precisely below.\nContributions. Our paper makes two core contributions: (i) a formal general framework for vari-\nance reduced stochastic methods based on discussions in [6]; and (ii) asynchronous parallel VR al-\ngorithms within this framework. Our general framework presents a formal unifying view of several\nVR methods (e.g., it includes SAGA and SVRG as special cases) while expressing key algorithmic\nand practical tradeoffs concisely. Thus, it yields a broader understanding of VR methods, which\nhelps us obtain asynchronous parallel variants of VR methods. Under sparse-data settings common\nto machine learning problems, our parallel algorithms attain speedups that scale near linearly with\nthe number of processors.\nAs a concrete illustration, we present a specialization to an asynchronous SVRG-like method. We\ncompare this specialization with non-variance reduced asynchronous SGD methods, and observe\nstrong empirical speedups that agree with the theory.\nRelated work. As already mentioned, our work is closest to (and generalizes) SAG [25], SAGA [6],\nSVRG [10] and S2GD [12], which are primal methods. Also closely related are dual methods such\nas SDCA [27] and Finito [7], and in its convex incarnation MISO [16]; a more precise relation\nbetween these dual methods and VR stochastic methods is described in Defazio\u2019s thesis [5]. By\ntheir algorithmic structure, these VR methods trace back to classical non-stochastic incremental\ngradient algorithms [4], but by now it is well-recognized that randomization helps obtain much\nsharper convergence results (in expectation). Proximal [29] and accelerated VR methods have also\nbeen proposed [20, 26]; we leave a study of such variants of our framework as future work. Finally,\nthere is recent work on lower-bounds for \ufb01nite-sum problems [1].\nWithin asynchronous SGD algorithms, both parallel [21] and distributed [2, 17] variants are known.\nIn this paper, we focus our attention on the parallel setting. A different line of methods is that\nof (primal) coordinate descent methods, and their parallel and distributed variants [14, 15, 19, 22,\n23]. Our asynchronous methods share some structural assumptions with these methods. Finally,\nthe recent work [11] generalizes S2GD to the mini-batch setting, thereby also permitting parallel\nprocessing, albeit with more synchronization and allowing only small mini-batches.\n\nf (x) f (y) + hrf (y), x yi + \n\n2kx yk2.\n\n2 A General Framework for VR Stochastic Methods\nWe focus on instances of (1.1) where the cost function f (x) has an L-Lipschitz gradient, so that\nkrf (x) rf (y)k \uf8ff Lkx yk, and it is -strongly convex, i.e., for all x, y 2 Rd,\n\n(2.1)\nWhile our analysis focuses on strongly convex functions, we can extend it to just smooth convex\nfunctions along the lines of [6, 29].\nInspired by the discussion on a general view of variance reduced techniques in [6], we now describe\na formal general framework for variance reduction in stochastic gradient descent. We denote the\ncollection {fi}n\ni=1 of functions that make up f in (1.1) by F. For our algorithm, we maintain an\ni=1. The general iterative\nadditional parameter \u21b5t\ni}n\nframework for updating the parameters is presented as Algorithm 1. Observe that the algorithm is\nstill abstract, since it does not specify the subroutine SCHEDULEUPDATE. This subroutine deter-\nmines the crucial update mechanism of {\u21b5t\ni} (and thereby of At). As we will see different schedules\ngive rise to different fast \ufb01rst-order methods proposed in the literature. The part of the update based\non At is the key for these approaches and is responsible for variance reduction.\nNext, we provide different instantiations of the framework and construct a new algorithm derived\nfrom it. In particular, we consider incremental methods SAG [25], SVRG [10] and SAGA [6], and\nclassic gradient descent GRADIENTDESCENT for demonstrating our framework.\n\ni 2 Rd for each fi 2F . We use At to denote {\u21b5t\n\n2\n\n\fi = x0 8i 2 [n] , {1, . . . , n}, step size \u2318> 0\n\nALGORITHM 1: GENERIC STOCHASTIC VARIANCE REDUCTION ALGORITHM\nData: x0 2 Rd,\u21b5 0\nRandomly pick a IT = {i0, . . . , iT} where it 2{ 1, . . . , n}8 t 2{ 0, . . . , T} ;\nfor t = 0 to T do\ni) ;\nUpdate iterate as xt+1 xt \u2318rfit(xt) rfit(\u21b5t\nnPi rfi(\u21b5t\nAt+1 = SCHEDULEUPDATE({xi}t+1\n\ni=0, At, t, IT ) ;\n\nit) + 1\n\nend\nreturn xT\n\nthe aforementioned algorithms.\n\nFigure 1 shows the schedules for\nIn case of SVRG,\nSCHEDULEUPDATE is triggered every m iterations (here m denotes precisely the number of in-\ni are updated\nner iterations used in [10]); so At remains unchanged for the m iterations and all \u21b5t\nto the current iterate at the mth iteration. For SAGA, unlike SVRG, At changes at the tth iteration\nfor all t 2 [T ]. This change is only to a single element of At, and is determined by the index it\n(the function chosen at iteration t). The update of SAG is similar to SAGA insofar that only one of\nthe \u21b5i is updated at each iteration. However, the update for At+1 is based on it+1 rather than it.\nThis results in a biased estimate of the gradient, unlike SVRG and SAGA. Finally, the schedule for\ngradient descent is similar to SAG, except that all the \u21b5i\u2019s are updated at each iteration. Due to the\nfull update we end up with the exact gradient at each iteration. This discussion highlights how the\nscheduler determines the resulting gradient method.\nTo motivate the design of another schedule, let us consider the computational and storage costs of\neach of these algorithms. For SVRG, since we update At after every m iterations, it is enough to\nstore a full gradient, and hence, the storage cost is O(d). However, the running time is O(d) at\neach iteration and O(nd) at the end of each epoch (for calculating the full gradient at the end of\neach epoch). In contrast, both SAG and SAGA have high storage costs of O(nd) and running time\nof O(d) per iteration. Finally, GRADIENTDESCENT has low storage cost since it needs to store the\ngradient at O(d) cost, but very high computational costs of O(nd) at each iteration.\nSVRG has an additional computation overhead at the end of each epoch due to calculation of the\nwhole gradient. This is avoided in SAG and SAGA at the cost of additional storage. When m\nis very large, the additional computational overhead of SVRG amortized over all the iterations is\nsmall. However, as we will later see, this comes at the expense of slower convergence to the optimal\nsolution. The tradeoffs between the epoch size m, additional storage, frequency of updates, and the\nconvergence to the optimal solution are still not completely resolved.\n\ni=0, At, t, IT )\n\nSVRG:SCHEDULEUPDATE({xi}t+1\nfor i = 1 to n do\n\u21b5t+1\ni ;\ni = (m | t)xt + (m6| t)\u21b5t\n\nend\nreturn At+1\nSAG:SCHEDULEUPDATE({xi}t+1\nfor i = 1 to n do\n\ni=0, At, t, IT )\n\u21b5t+1\ni ;\ni = (it+1 = i)xt+1 + (it+1 6= i)\u21b5t\n\nend\nreturn At+1\n\n\u21b5t+1\ni = xt+1 ;\n\nend\nreturn At+1\n\ni=0, At, t, IT )\n\nSAGA:SCHEDULEUPDATE({xi}t+1\nfor i = 1 to n do\n\u21b5t+1\ni ;\ni = (it = i)xt + (it 6= i)\u21b5t\n\nend\nreturn At+1\nGD:SCHEDULEUPDATE({xi}t+1\nfor i = 1 to n do\n\ni=0, At, t, IT )\n\nFigure 1: SCHEDULEUPDATE function for SVRG (top left), SAGA (top right), SAG (bottom left)\nand GRADIENTDESCENT (bottom right). While SVRG is epoch-based, rest of algorithms perform\nupdates at each iteration. Here a|b denotes that a divides b.\nA straightforward approach to design a new scheduler is to combine the schedules of the above al-\ngorithms. This allows us to tradeoff between the various aforementioned parameters of our interest.\nWe call this schedule hybrid stochastic average gradient (HSAG). Here, we use the schedules of\nSVRG and SAGA to develop HSAG. However, in general, schedules of any of these algorithms can\n\n3\n\n\fHSAG:SCHEDULEUPDATE(xt, At, t, IT )\nfor i = 1 to n do\n\ni =\u21e2 (it = i)xt + (it 6= i)\u21b5t\n\n(si | t)xt + (si6| t)\u21b5t\n\n\u21b5t+1\n\ni\n\ni\n\nend\nreturn At+1\n\nif i 2 S\nif i /2 S\n\nFigure 2: SCHEDULEUPDATE for HSAG. This algorithm assumes access to some index set S and\nthe schedule frequency vector s. Recall that a|b denotes a divides b\n\nbe combined to obtain a hybrid algorithm. Consider some S \u2713 [n], the indices that follow SAGA\nschedule. We assume that the rest of the indices follow an SVRG-like schedule with schedule fre-\nquency si for all i 2 S , [n] \\ S. Figure 2 shows the corresponding update schedule of HSAG. If\nS = [n] then HSAG is equivalent to SAGA, while at the other extreme, for S = ; and si = m for all\ni 2 [n], it corresponds to SVRG. HSAG exhibits interesting storage, computational and convergence\ntrade-offs that depend on S. In general, while large cardinality of S likely incurs high storage costs,\nthe computational cost per iteration is relatively low. On the other hand, when cardinality of S is\nsmall and si\u2019s are large, storage costs are low but the convergence typically slows down.\nBefore concluding our discussion on the general framework, we would like to draw the reader\u2019s\nattention to the advantages of studying Algorithm 1. First, note that Algorithm 1 provides a unifying\nframework for many incremental/stochastic gradient methods proposed in the literature. Second, and\nmore importantly, it provides a generic platform for analyzing this class of algorithms. As we will\nsee in Section 3, this helps us develop and analyze asynchronous versions for different \ufb01nite-sum\nalgorithms under a common umbrella. Finally, it provides a mechanism to derive new algorithms by\ndesigning more sophisticated schedules; as noted above, one such construction gives rise to HSAG.\n\n2.1 Convergence Analysis\n\nIn this section, we provide convergence analysis for Algorithm 1 with HSAG schedules. As observed\nearlier, SVRG and SAGA are special cases of this setup. Our analysis assumes unbiasedness of the\ngradient estimates at each iteration, so it does not encompass SAG. For ease of exposition, we\nassume that all si = m for all i 2 [n]. Since HSAG is epoch-based, our analysis focuses on the\niterates obtained after each epoch. Similar to [10] (see Option II of SVRG in [10]), our analysis will\nbe for the case where the iterate at the end of (k + 1)st epoch, xkm+m, is replaced with an element\nchosen randomly from {xkm, . . . , xkm+m1} with probability {p1,\u00b7\u00b7\u00b7 , pm}. For brevity, we use\n\u02dcxk to denote the iterate chosen at the kth epoch. We also need the following quantity for our analysis:\n\n\u02dcGk , 1\n\nnXi2Sfi(\u21b5km\n\ni\n\n) fi(x\u21e4) hrfi(x\u21e4),\u21b5 km\n\ni x\u21e4i .\n\nTheorem 1. For any positive parameters c, , \uf8ff > 1, step size \u2318 and epoch size m, we de\ufb01ne the\nfollowing quantities:\n\n1\n\n = \uf8ff\uf8ff1 \u27131 \n\u2713 = max\u21e2\uf8ff 2c\n\u27131 \n\n\uf8ff\u25c6m\u27132c\u2318(1 L\u2318(1 + )) \n\uf8ff\u25c6m\nSuppose the probabilities pi / (1 1\nsuch that the following conditions are satis\ufb01ed:\n\n \u27131 +\n\n2Lc\u23182\n\n1\nn \n\n\uf8ff\u25c6\n\u25c6 \uf8ff\uf8ff1 \u27131 \n\n2c\n\n+\n\n1\n\n1\n\n1\n\n\uf8ff\u25c6m ,\u27131 \n\n1\n\n\uf8ff\u25c6m .\n\n\uf8ff )mi, and that c, , \uf8ff, step size \u2318 and epoch size m are chosen\n\nThen, for iterates of Algorithm 1 under the HSAG schedule, we have\n\n1\n\uf8ff\n\n+ 2Lc\u23182\u27131 +\n\n1\n\n1\nn\n\n,> 0,\u2713< 1.\n\n\u25c6 \uf8ff\n\u02dcGk+1i \uf8ff \u2713 Ehf (\u02dcxk) f (x\u21e4) +\n\n1\n\n\n\u02dcGki.\n\nEhf (\u02dcxk+1) f (x\u21e4) +\n\n1\n\n\n4\n\n\fAs a corollary, we immediately obtain an expected linear rate of convergence for HSAG.\nCorollary 1. Note that \u02dcGk 0 and therefore, under the conditions speci\ufb01ed in Theorem 1 and with\n\u00af\u2713 = \u2713 (1 + 1/) < 1 we have\n\nE\u21e5f (\u02dcxk) f (x\u21e4)\u21e4 \uf8ff \u00af\u2713k \u21e5f (x0) f (x\u21e4)\u21e4 .\n\nWe emphasize that there exist values of the parameters for which the conditions in Theorem 1\nand Corollary 1 are easily satis\ufb01ed. For instance, setting \u2318 = 1/16(n + L), \uf8ff = 4/\u2318,\n = (2n + L)/L and c = 2/\u2318n, the conditions in Theorem 1 are satis\ufb01ed for suf\ufb01ciently large\nm. Additionally, in the high condition number regime of L/ = n, we can obtain constant \u2713< 1\n(say 0.5) with m = O(n) epoch size (similar to [6, 10]). This leads to a computational com-\nplexity of O(n log(1/\u270f)) for HSAG to achieve \u270f accuracy in the objective function as opposed to\nO(n2 log(1/\u270f)) for batch gradient descent method. Please refer to the appendix for more details on\nthe parameters in Theorem 1.\n\n3 Asynchronous Stochastic Variance Reduction\nWe are now ready to present asynchronous versions of the algorithms captured by our general frame-\nwork. We \ufb01rst describe our setup before delving into the details of these algorithms. Our model of\ncomputation is similar to the ones used in Hogwild! [21] and AsySCD [14]. We assume a multicore\narchitecture where each core makes stochastic gradient updates to a centrally stored vector x in an\nasynchronous manner. There are four key components in our asynchronous algorithm; these are\nbrie\ufb02y described below.\n\n1. Read: Read the iterate x and compute the gradient rfit(x) for a randomly chosen it.\n2. Read schedule iterate: Read the schedule iterate A and compute the gradients required\n\nfor update in Algorithm 1.\n\n3. Update: Update the iterate x with the computed incremental update in Algorithm 1.\n4. Schedule Update: Run a scheduler update for updating A.\n\nEach processor repeatedly runs these procedures concurrently, without any synchronization. Hence,\nx may change in between Step 1 and Step 3. Similarly, A may change in between Steps 2 and 4. In\nfact, the states of iterates x and A can correspond to different time-stamps. We maintain a global\ncounter t to track the number of updates successfully executed. We use D(t) 2 [t] and D0(t) 2 [t]\nto denote the particular x-iterate and A-iterate used for evaluating the update at the tth iteration. We\nassume that the delay in between the time of evaluation and updating is bounded by a non-negative\ninteger \u2327, i.e., t D(t) \uf8ff \u2327 and t D0(t) \uf8ff \u2327. The bound on the staleness captures the degree of\nparallelism in the method: such parameters are typical in asynchronous systems (see e.g., [3, 14]).\nFurthermore, we also assume that the system is synchronized after every epoch i.e., D(t) km for\nt km. We would like to emphasize that the assumption is not strong since such a synchronization\nneeds to be done only once per epoch.\nFor the purpose of our analysis, we assume a consistent read model.\nIn particular, our analysis\nassumes that the vector x used for evaluation of gradients is a valid iterate that existed at some point\nin time. Such an assumption typically amounts to using locks in practice. This problem can be\navoided by using random coordinate updates as in [21] (see Section 4 of [21]) but such a procedure\nis computationally wasteful in practice. We leave the analysis of inconsistent read model as future\nwork. Nonetheless, we report results for both locked and lock-free implementations (see Section 4).\n\n3.1 Convergence Analysis\nThe key ingredients to the success of asynchronous algorithms for multicore stochastic gradient\ndescent are sparsity and \u201cdisjointness\u201d of the data matrix [21]. More formally, suppose fi only\ndepends on xei where ei \u2713 [d] i.e., fi acts only on the components of x indexed by the set ei. Let\ni denotePj2ei kxjk2; then, the convergence depends on , the smallest constant such that\nkxk2\ni ] \uf8ff kxk2. Intuitively, denotes the average frequency with which a feature appears in\nEi[kxk2\nthe data matrix. We are interested in situations where \u2327 1. As a warm up, let us \ufb01rst discuss\nconvergence analysis for asynchronous SVRG. The general case is similar, but much more involved.\nHence, it is instructive to \ufb01rst go through the analysis of asynchronous SVRG.\n\n5\n\n\fTheorem 2. Suppose step size \u2318, epoch size m are chosen such that the following condition holds:\n\n0 <\u2713 s := \u21e3 1\n\n\u2318m + 4L\u21e3 \u2318+L\u2327 2\u23182\n12L2\u23182\u2327 2\u2318\u2318\n12L2\u23182\u2327 2\u2318\u2318 < 1.\n\u21e31 4L\u21e3 \u2318+L\u2327 2\u23182\n\nThen, for the iterates of an asynchronous variant of Algorithm 1 with SVRG schedule and probabil-\nities pi = 1/m for all i 2 [m], we have\n\nE[f (\u02dcxk+1) f (x\u21e4)] \uf8ff \u2713s E[f (\u02dcxk) f (x\u21e4)].\n\nThe bound obtained in Theorem 2 is useful when is small. To see this, as earlier, consider the\nindicative case where L/ = n. The synchronous version of SVRG obtains a convergence rate of\n\u2713 = 0.5 for step size \u2318 = 0.1/L and epoch size m = O(n). For the asynchronous variant of SVRG,\nby setting \u2318 = 0.1/2(max{1, 1/2\u2327}L), we obtain a similar rate with m = O(n + 1/2\u2327n ).\nTo obtain this, set \u2318 = \u21e2/L where \u21e2 = 0.1/2(max{1, 1/2\u2327}) and \u2713s = 0.5. Then, a simple\ncalculation gives the following:\n\nm\nn\n\n=\n\n2\n\n\u21e2\u2713\n\n1 2\u2327 2\u21e22\n\n1 12\u21e2 14\u2327 2\u21e22\u25c6 \uf8ff c0 max{1, 1/2\u2327},\n\nwhere c0 is some constant. This follows from the fact that \u21e2 = 0.1/2(max{1, 1/2\u2327}). Suppose\n\u2327< 1/1/2. Then we can achieve nearly the same guarantees as the synchronous version, but \u2327\ntimes faster since we are running the algorithm asynchronously. For example, consider the sparse\nsetting where = o(1/n); then it is possible to get near linear speedup when \u2327 = o(n1/2). On the\nother hand, when 1/2\u2327> 1, we can obtain a theoretical speedup of 1/1/2.\nWe \ufb01nally provide the convergence result for the asynchronous algorithm in the general case. The\nproof is complicated by the fact that set A, unlike in SVRG, changes during the epoch. The key idea\nis that only a single element of A changes at each iteration. Furthermore, it can only change to one\nof the iterates in the epoch. This control provides a handle on the error obtained due to the staleness.\nDue to space constraints, the proof is relegated to the appendix.\nTheorem 3. For any positive parameters c, , \uf8ff > 1, step size \u2318 and epoch size m, we de\ufb01ne the\nfollowing quantities:\n\n1\n\n1\n\n1\n\ncL\u2327 2\u23183! ,\n\n\u21e3 = c\u23182 +\u27131 \n\uf8ff\u25c6\u2327\n\uf8ff\u25c6m\"2c\u2318 8\u21e3L(1 + ) \na = \uf8ff\uf8ff1 \u27131 \n8\u21e3L\u21e31 + 1\n\u2318\n\u2713a = max8<:\n24 2c\n\uf8ff\u25c6m\na\u27131 \nSuppose probabilities pi / (1 1\nsuch that the following conditions are satis\ufb01ed:\n\u25c6 +\n+ 8\u21e3L\u27131 +\nn \u27131 \n\uf8ff\u25c6\u2327\nE\uf8fff (\u02dcxk+1) f (x\u21e4) +\n\n96\u21e3L\u2327\n\n1\na\n\n\uf8ff\n\na\n\n+\n\n\uf8ff )mi, parameters , \uf8ff, step-size \u2318, and epoch size m are chosen\n\n1\n\n1\n\uf8ff\nThen, for the iterates of asynchronous variant of Algorithm 1 with HSAG schedule we have\n\n12L2\u2327 2 , a > 0,\u2713 a < 1.\n\n1\nn\n\n1\n\n1\n\n1\n\n,\u2318 2 \uf8ff\u27131 \n\n\u02dcGk+1 \uf8ff \u2713aE\uf8fff (\u02dcxk) f (x\u21e4) +\n\n1\na\n\n\u02dcGk .\n\n2c\n\uf8ff \n\n\uf8ff\uf8ff1 \u27131 \n\n\n\n1\n\nn# ,\n\uf8ff\u25c6m9=;\n\n1\n\n.\n\n1\n\n1\n\n96\u21e3L\u2327\n\nn \u27131 \n\uf8ff\u25c6\u2327\n\uf8ff\u25c6m35 ,\u27131 \n\uf8ff\u25c6m1\n\nCorollary 2. Note that \u02dcGk 0 and therefore, under the conditions speci\ufb01ed in Theorem 3 and with\n\u00af\u2713a = \u2713a (1 + 1/a) < 1, we have\n\nE\u21e5f (\u02dcxk) f (x\u21e4)\u21e4 \uf8ff \u00af\u2713k\n\na \u21e5f (x0) f (x\u21e4)\u21e4 .\n\n6\n\n\fp\nu\nd\ne\ne\np\nS\n\n3\n\n2\n\n1\n\n0\n\nLock-Free SVRG\nLocked SVRG\n\nLock-Free SVRG\nLocked SVRG\n\n5\n4\n3\n2\n1\n\np\nu\nd\ne\ne\np\nS\n\nLock-Free SVRG\nLocked SVRG\n\n5\n4\n3\n2\n1\n\np\nu\nd\ne\ne\np\nS\n\nLock-Free SVRG\nLocked SVRG\n\n5\n4\n3\n2\n1\n\np\nu\nd\ne\ne\np\nS\n\n5\n\nThreads\n\n10\n\n0\n\n5\n\nThreads\n\n10\n\n0\n\n5\n\nThreads\n\n10\n\n0\n\n5\n\nThreads\n\n10\n\nFigure 3: l2-regularized logistic regression. Speedup curves for Lock-Free SVRG and Locked SVRG\non rcv1 (left), real-sim (left center), news20 (right center) and url (right) datasets. We report the\nspeedup achieved by increasing the number of threads.\n\nBy using step size normalized by 1/2\u2327 (similar to Theorem 2) and parameters similar to the ones\nspeci\ufb01ed after Theorem 1 we can show speedups similar to the ones obtained in Theorem 2. Please\nrefer to the appendix for more details on the parameters in Theorem 3.\nBefore ending our discussion on the theoretical analysis, we would like to highlight an important\npoint. Our emphasis throughout the paper was on generality. While the results are presented here in\nfull generality, one can obtain stronger results in speci\ufb01c cases. For example, in the case of SAGA,\none can obtain per iteration convergence guarantees (see [6]) rather than those corresponding to per\nepoch presented in the paper. Also, SAGA can be analyzed without any additional synchronization\nper epoch. However, there is no qualitative difference in these guarantees accumulated over the\nepoch. Furthermore, in this case, our analysis for both synchronous and asynchronous cases can be\neasily modi\ufb01ed to obtain convergence properties similar to those in [6].\n\n4 Experiments\nWe present our empirical results in this section. For our experiments, we study the problem of\nbinary classi\ufb01cation via l2-regularized logistic regression. More formally, we are interested in the\nfollowing optimization problem:\n\nmin\n\nx\n\n1\nn\n\nnXi=1log(1 + exp(yiz>i x)) + kxk2 ,\n\n(4.1)\n\nwhere zi 2 Rd and yi is the corresponding label for each i 2 [n]. In all our experiments, we set\n = 1/n. Note that such a choice leads to high condition number.\nA careful implementation of SVRG is required for sparse gradients since the implementation as\nstated in Algorithm 1 will lead to dense updates at each iteration. For an ef\ufb01cient implementation, a\nscheme like the \u2018just-in-time\u2019 update scheme, as suggested in [25], is required. Due to lack of space,\nwe provide the implementation details in the appendix.\nWe evaluate the following algorithms for our experiments:\n\n\u2022 Lock-Free SVRG: This is the lock-free asynchronous variant of Algorithm 1 using SVRG sched-\nule; all threads can read and update the parameters with any synchronization. Parameter updates\nare performed through atomic compare-and-swap instruction [21]. A constant step size that\ngives the best convergence is chosen for the dataset.\n\n\u2022 Locked SVRG: This is the locked version of the asynchronous variant of Algorithm 1 using\nSVRG schedule. In particular, we use a concurrent read exclusive write locking model, where all\nthreads can read the parameters but only one threads can update the parameters at a given time.\nThe step size is chosen similar to Lock-Free SVRG.\n\n\u2022 Lock-Free SGD: This is the lock-free asynchronous variant of the SGD algorithm (see [21]).\nWe compare two different versions of this algorithm: (i) SGD with constant step size (referred\nto as CSGD). (ii) SGD with decaying step size \u23180p0/(t + 0) (referred to as DSGD), where\n\nconstants \u23180 and 0 specify the scale and speed of decay. For each of these versions, step size is\ntuned for each dataset to give the best convergence progress.\n\n7\n\n\fl\n\na\nm\n\ni\nt\np\nO\n\nl\n\n \n-\n \ne\nu\na\nV\n \ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n10 -5\n\n10 -10\n\n0\n\nLock-Free SVRG\nDSGD\nCSGD\n0.5\nTime(seconds)\n\n1\n\n1.5\n\nl\n\na\nm\n\ni\nt\np\nO\n\nl\n\n \n-\n \ne\nu\na\nV\n \ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n10 0\n\n10 -5\n\n10 -10\n\n0\n\nLock-Free SVRG\nDSGD\nCSGD\n4\n\n2\n6\nTime(seconds)\n\n8\n\n10 0\n\n10 -5\n\n10 -10\n\n0\n\nl\n\na\nm\n\ni\nt\n\nl\n\np\nO\n-\ne\nu\na\nV\ne\nv\ni\nt\nc\ne\nb\nO\n\n \n\nj\n\nl\n\na\nm\n\ni\nt\n\np\nO\n\n \n-\n \n\nl\n\n \n\ne\nu\na\nV\ne\nv\ni\nt\nc\ne\nb\nO\n\nj\n\n10 -5\n\n10 -10\n\n0\n\nLock-Free SVRG\nDSGD\nCSGD\n5\n\n10\n\nTime(seconds)\n\nLock-Free SVRG\nDSGD\nCSGD\n\n50\n\n100\n\nTime(seconds)\n\nFigure 4: l2-regularized logistic regression. Training loss residual f (x) f (x\u21e4) versus time plot of\nLock-Free SVRG, DSGD and CSGD on rcv1 (left), real-sim (left center), news20 (right center) and\nurl (right) datasets. The experiments are parallelized over 10 cores.\n\nAll the algorithms were implemented in C++ 2. We run our experiments on datasets from LIBSVM\nwebsite3. Similar to [29], we normalize each example in the dataset so that kzik2 = 1 for all\ni 2 [n]. Such a normalization leads to an upper bound of 0.25 on the Lipschitz constant of the\ngradient of fi. The epoch size m is chosen as 2n (as recommended in [10]) in all our experiments.\nIn the \ufb01rst experiment, we compare the speedup achieved by our asynchronous algorithm. To this\nend, for each dataset we \ufb01rst measure the time required for the algorithm to each an accuracy of\n1010 (i.e., f (x) f (x\u21e4) < 1010). The speedup with P threads is de\ufb01ned as the ratio of the\nruntime with a single thread to the runtime with P threads. Results in Figure 3 show the speedup\non various datasets. As seen in the \ufb01gure, we achieve signi\ufb01cant speedups for all the datasets.\nNot surprisingly, the speedup achieved by Lock-free SVRG is much higher than ones obtained by\nlocking. Furthermore, the lowest speedup is achieved for rcv1 dataset. Similar speedup behavior\nwas reported for this dataset in [21]. It should be noted that this dataset is not sparse and hence, is a\nbad case for the algorithm (similar to [21]).\nFor the second set of experiments we compare the performance of Lock-Free SVRG with stochastic\ngradient descent. In particular, we compare with the variants of stochastic gradient descent, DSGD\nand CSGD, described earlier in this section. It is well established that the performance of variance\nreduced stochastic methods is better than that of SGD. We would like to empirically verify that such\nbene\ufb01ts carry over to the asynchronous variants of these algorithms. Figure 4 shows the performance\nof Lock-Free SVRG, DSGD and CSGD. Since the computation complexity of each epoch of these\nalgorithms is different, we directly plot the objective value versus the runtime for each of these\nalgorithms. We use 10 cores for comparing the algorithms in this experiment. As seen in the \ufb01gure,\nLock-Free SVRG outperforms both DSGD and CSGD. The performance gains are qualitatively\nsimilar to those reported in [10] for the synchronous versions of these algorithms. It can also be\nseen that the DSGD, not surprisingly, outperforms CSGD in all the cases. In our experiments, we\nobserved that Lock-Free SVRG, in comparison to SGD, is relatively much less sensitive to the step\nsize and more robust to increasing threads.\n\n5 Discussion & Future Work\nIn this paper, we presented a unifying framework based on [6], that captures many popular variance\nreduction techniques for stochastic gradient descent. We use this framework to develop a simple\nhybrid variance reduction method. The primary purpose of the framework, however, was to provide\na common platform to analyze various variance reduction techniques. To this end, we provided\nconvergence analysis for the framework under certain conditions. More importantly, we propose an\nasynchronous algorithm for the framework with provable convergence guarantees. The key conse-\nquence of our approach is that we obtain asynchronous variants of several algorithms like SVRG,\nSAGA and S2GD. Our asynchronous algorithms exploits sparsity in the data to obtain near linear\nspeedup in settings that are typically encountered in machine learning.\nFor future work, it would be interesting to perform an empirical comparison of various schedules.\nIn particular, it would be worth exploring the space-time-accuracy tradeoffs of these schedules. We\nwould also like to analyze the effect of these tradeoffs on the asynchronous variants.\nAcknowledgments. SS was partially supported by NSF IIS-1409802.\n\n2All experiments were conducted on a Google Compute Engine n1-highcpu-32 machine with 32 processors\n\nand 28.8 GB RAM.\n\n3http://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/datasets/binary.html\n\n8\n\n\fBibliography\n[1] A. Agarwal and L. Bottou. A lower bound for the optimization of \ufb01nite sums. arXiv:1410.0723, 2014.\n[2] A. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. In Advances in Neural Infor-\n\nmation Processing Systems, pages 873\u2013881, 2011.\n\n[3] D. Bertsekas and J. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice-Hall,\n\n[4] D. P. Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: A\n\nsurvey. Optimization for Machine Learning, 2010:1\u201338, 2011.\n\n[5] A. Defazio. New Optimization Methods for Machine Learning. PhD thesis, Australian National Univer-\n\n1989.\n\nsity, 2014.\n\n[6] A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for\n\nnon-strongly convex composite objectives. In NIPS 27, pages 1646\u20131654. 2014.\n\n[7] A. J. Defazio, T. S. Caetano, and J. Domke. Finito: A faster, permutable incremental gradient method for\n\nbig data problems. arXiv:1407.2710, 2014.\n\n[8] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-\n\nbatches. The Journal of Machine Learning Research, 13(1):165\u2013202, 2012.\n\n[9] M. G\u00a8urb\u00a8uzbalaban, A. Ozdaglar, and P. Parrilo. A globally convergent incremental Newton method.\n\nMathematical Programming, 151(1):283\u2013313, 2015.\n\n[10] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction.\n\n[11] J. Kone\u02c7cn\u00b4y, J. Liu, P. Richt\u00b4arik, and M. Tak\u00b4a\u02c7c. Mini-Batch Semi-Stochastic Gradient Descent in the\n\nIn NIPS 26, pages 315\u2013323. 2013.\n\nProximal Setting. arXiv:1504.04407, 2015.\n\n[12] J. Kone\u02c7cn\u00b4y and P. Richt\u00b4arik. Semi-Stochastic Gradient Descent Methods. arXiv:1312.1666, 2013.\n[13] M. Li, D. G. Andersen, A. J. Smola, and K. Yu. Communication Ef\ufb01cient Distributed Machine Learning\n\nwith the Parameter Server. In NIPS 27, pages 19\u201327, 2014.\n\n[14] J. Liu, S. Wright, C. R\u00b4e, V. Bittorf, and S. Sridhar. An asynchronous parallel stochastic coordinate descent\n\nalgorithm. In ICML 2014, pages 469\u2013477, 2014.\n\n[15] J. Liu and S. J. Wright. Asynchronous stochastic coordinate descent: Parallelism and convergence prop-\n\nerties. SIAM Journal on Optimization, 25(1):351\u2013376, 2015.\n\n[16] J. Mairal. Optimization with \ufb01rst-order surrogate functions. arXiv:1305.3120, 2013.\n[17] A. Nedi\u00b4c, D. P. Bertsekas, and V. S. Borkar. Distributed asynchronous incremental subgradient methods.\n\nStudies in Computational Mathematics, 8:381\u2013407, 2001.\n\n[18] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to\n\nstochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[19] Y. Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems. SIAM\n\nJournal on Optimization, 22(2):341\u2013362, 2012.\n\n[20] A. Nitanda. Stochastic Proximal Gradient Descent with Acceleration Techniques.\n\nIn NIPS 27, pages\n\n1574\u20131582, 2014.\n\n[21] B. Recht, C. Re, S. Wright, and F. Niu. Hogwild!: A Lock-Free Approach to Parallelizing Stochastic\n\nGradient Descent. In NIPS 24, pages 693\u2013701, 2011.\n\n[22] S. Reddi, A. Hefny, C. Downey, A. Dubey, and S. Sra. Large-scale randomized-coordinate descent meth-\n\nods with non-separable linear constraints. In UAI 31, 2015.\n\n[23] P. Richt\u00b4arik and M. Tak\u00b4a\u02c7c.\n\nIteration complexity of randomized block-coordinate descent methods for\n\nminimizing a composite function. Mathematical Programming, 144(1-2):1\u201338, 2014.\n\n[24] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics,\n\n[25] M. W. Schmidt, N. L. Roux, and F. R. Bach. Minimizing Finite Sums with the Stochastic Average\n\n[26] S. Shalev-Shwartz and T. Zhang. Accelerated mini-batch stochastic dual coordinate ascent. In NIPS 26,\n\n[27] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss. The\n\nJournal of Machine Learning Research, 14(1):567\u2013599, 2013.\n\n[28] O. Shamir and N. Srebro. On distributed stochastic optimization and learning. In Proceedings of the 52nd\n\nAnnual Allerton Conference on Communication, Control, and Computing, 2014.\n\n[29] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM\n\nJournal on Optimization, 24(4):2057\u20132075, 2014.\n\n[30] M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In NIPS,\n\npages 2595\u20132603, 2010.\n\n22:400\u2013407, 1951.\n\nGradient. arXiv:1309.2388, 2013.\n\npages 378\u2013385, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1542, "authors": [{"given_name": "Sashank", "family_name": "J. Reddi", "institution": "Carnegie Mellon University"}, {"given_name": "Ahmed", "family_name": "Hefny", "institution": "Carnegie Mellon University"}, {"given_name": "Suvrit", "family_name": "Sra", "institution": "MIT"}, {"given_name": "Barnabas", "family_name": "Poczos", "institution": "Carnegie Mellon University"}, {"given_name": "Alexander", "family_name": "Smola", "institution": "Carnegie Mellon University"}]}