{"title": "Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 2737, "page_last": 2745, "abstract": "The asynchronous parallel implementations of stochastic gradient (SG) have been broadly used in solving deep neural network and received many successes in practice recently. However, existing theories cannot explain their convergence and speedup properties, mainly due to the nonconvexity of most deep learning formulations and the asynchronous parallel mechanism. To fill the gaps in theory and provide theoretical supports, this paper studies two asynchronous parallel implementations of SG: one is on the computer network and the other is on the shared memory system. We establish an ergodic convergence rate $O(1/\\sqrt{K})$ for both algorithms and prove that the linear speedup is achievable if the number of workers is bounded by $\\sqrt{K}$ ($K$ is the total number of iterations). Our results generalize and improve existing analysis for convex minimization.", "full_text": "Asynchronous Parallel Stochastic Gradient for\n\nNonconvex Optimization\n\nXiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu\nDepartment of Computer Science, University of Rochester\n\n{lianxiangru,huangyj0,raingomm,ji.liu.uwisc}@gmail.com\n\nAbstract\n\nAsynchronous parallel implementations of stochastic gradient (SG) have been\nbroadly used in solving deep neural network and received many successes in prac-\ntice recently. However, existing theories cannot explain their convergence and\nspeedup properties, mainly due to the nonconvexity of most deep learning formu-\nlations and the asynchronous parallel mechanism. To \ufb01ll the gaps in theory and\nprovide theoretical supports, this paper studies two asynchronous parallel imple-\n\u221a\nmentations of SG: one is over a computer network and the other is on a shared\nmemory system. We establish an ergodic convergence rate O(1/\nK) for both al-\ngorithms and prove that the linear speedup is achievable if the number of workers\nis bounded by\nK (K is the total number of iterations). Our results generalize\nand improve existing analysis for convex minimization.\n\n\u221a\n\nIntroduction\n\n1\nThe asynchronous parallel optimization recently received many successes and broad attention in\nmachine learning and optimization [Niu et al., 2011, Li et al., 2013, 2014b, Yun et al., 2013, Fercoq\nand Richt\u00b4arik, 2013, Zhang and Kwok, 2014, Marecek et al., 2014, Tappenden et al., 2015, Hong,\n2014]. It is mainly due to that the asynchronous parallelism largely reduces the system overhead\ncomparing to the synchronous parallelism. The key idea of the asynchronous parallelism is to allow\nall workers work independently and have no need of synchronization or coordination. The asyn-\nchronous parallelism has been successfully applied to speedup many state-of-the-art optimization\nalgorithms including stochastic gradient [Niu et al., 2011, Agarwal and Duchi, 2011, Zhang et al.,\n2014, Feyzmahdavian et al., 2015, Paine et al., 2013, Mania et al., 2015], stochastic coordinate de-\nscent [Avron et al., 2014, Liu et al., 2014a, Sridhar et al., 2013], dual stochastic coordinate ascent\n[Tran et al., 2015], and randomized Kaczmarz algorithm [Liu et al., 2014b].\nIn this paper, we are particularly interested in the asynchronous parallel stochastic gradient algo-\nrithm (ASYSG) for nonconvex optimization mainly due to its recent successes and popularity in\ndeep neural network [Bengio et al., 2003, Dean et al., 2012, Paine et al., 2013, Zhang et al., 2014,\nLi et al., 2014a] and matrix completion [Niu et al., 2011, Petroni and Querzoni, 2014, Yun et al.,\n2013]. While some research efforts have been made to study the convergence and speedup properties\nof ASYSG for convex optimization, people still know very little about its properties in nonconvex\noptimization. Existing theories cannot explain its convergence and excellent speedup property in\npractice, mainly due to the nonconvexity of most deep learning formulations and the asynchronous\nparallel mechanism. People even have no idea if its convergence is certi\ufb01ed for nonconvex optimiza-\ntion, although it has been used widely in solving deep neural network and implemented on different\nplatforms such as computer network and shared memory (for example, multicore and multiGPU)\nsystem.\nTo \ufb01ll these gaps in theory, this paper tries to make the \ufb01rst attempt to study ASYSG for the following\nnonconvex optimization problem\n\n(1)\n\nminx\u2208Rn\n\nf (x) := E\u03be[F (x; \u03be)]\n\n1\n\n\fwhere \u03be \u2208 \u039e is a random variable and f (x) is a smooth (but not necessarily convex) function. The\nmost common speci\ufb01cation is that \u039e is an index set of all training samples \u039e = {1, 2,\u00b7\u00b7\u00b7 , N} and\nF (x; \u03be) is the loss function with respect to the training sample indexed by \u03be.\nWe consider two popular asynchronous parallel implementations of SG: one is for the computer\nnetwork originally proposed in [Agarwal and Duchi, 2011] and the other one is for the shared mem-\nory (including multicore/multiGPU) system originally proposed in [Niu et al., 2011]. Note that due\nto the architecture diversity, it leads to two different algorithms. The key difference lies on that\nthe computer network can naturally (also ef\ufb01ciently) ensure the atomicity of reading and writing\nthe whole vector of x, while the shared memory system is unable to do that ef\ufb01ciently and usually\nonly ensures ef\ufb01ciency for atomic reading and writing on a single coordinate of parameter x. The\nimplementation on computer cluster is described by the \u201cconsistent asynchronous parallel SG\u201d al-\ngorithm (ASYSG-CON), because the value of parameter x used for stochastic gradient evaluation is\nconsistent \u2013 an existing value of parameter x at some time point. Contrarily, we use the \u201cinconsis-\ntent asynchronous parallel SG\u201d algorithm (ASYSG-INCON) to describe the implementation on the\nshared memory platform, because the value of parameter x used is inconconsistent, that is, it might\nnot be the real state of x at any time point.\n\u221a\nThis paper studies the theoretical convergence and speedup properties for both algorithms. We estab-\nlish an asymptotic convergence rate of O(1/\nKM ) for ASYSG-CON where K is the total iteration\n\u221a\nnumber and M is the size of minibatch. The linear speedup1 is proved to be achievable while the\nnumber of workers is bounded by O(\nK). For ASYSG-INCON, we establish an asymptotic con-\nvergence and speedup properties similar to ASYSG-CON. The intuition of the linear speedup of\nasynchronous parallelism for SG can be explained in the following: Recall that the serial SG es-\nsentially uses the \u201cstochastic\u201d gradient to surrogate the accurate gradient. ASYSG brings additional\ndeviation from the accurate gradient due to using \u201cstale\u201d (or delayed) information. If the additional\ndeviation is relatively minor to the deviation caused by the \u201cstochastic\u201d in SG, the total iteration\ncomplexity (or convergence rate) of ASYSG would be comparable to the serial SG, which implies a\nnearly linear speedup. This is the key reason why ASYSG works.\nThe main contributions of this paper are highlighted as follows:\n\u2022 Our result for ASYSG-CON generalizes and improves earlier analysis of ASYSG-CON for convex\noptimization in [Agarwal and Duchi, 2011]. Particularly, we improve the upper bound of the max-\nimal number of workers to ensure the linear speedup from O(K 1/4M\u22123/4) to O(K 1/2M\u22121/2)\nby a factor K 1/4M 1/4;\n\u2022 The proposed ASYSG-INCON algorithm provides a more accurate description than HOGWILD!\n[Niu et al., 2011] for the lock-free implementation of ASYSG on the shared memory system.\nAlthough our result does not strictly dominate the result for HOGWILD! due to different problem\nsettings, our result can be applied to more scenarios (e.g., nonconvex optimization);\n\u2022 Our analysis provides theoretical (convergence and speedup) guarantees for many recent suc-\ncesses of ASYSG in deep learning. To the best of our knowledge, this is the \ufb01rst work that offers\nsuch theoretical support.\n\nNotation x\u2217 denotes the global optimal solution to (1). (cid:107)x(cid:107)0 denotes the (cid:96)0 norm of vector x, that\nis, the number of nonzeros in x; ei \u2208 Rn denotes the ith natural unit basis vector. We use E\u03bek,\u2217 (\u00b7)\nto denote the expectation with respect to a set of variables {\u03bek,1,\u00b7\u00b7\u00b7 , \u03bek,M}. E(\u00b7) means taking the\nexpectation in terms of all random variables. G(x; \u03be) is used to denote \u2207F (x; \u03be) for short. We use\n\u2207if (x) and (G(x; \u03be))i to denote the ith element of \u2207f (x) and G(x; \u03be) respectively.\nAssumption Throughout this paper, we make the following assumption for the objective function.\nAll of them are quite common in the analysis of stochastic gradient algorithms.\nAssumption 1. We assume that the following holds:\n\u2022 (Unbiased Gradient): The stochastic gradient G(x; \u03be) is unbiased, that is to say,\n\n\u2207f (x) = E\u03be[G(x; \u03be)]\n\n(2)\n\n1The speedup for T workers is de\ufb01ned as the ratio between the total work load using one worker and the\naverage work load using T workers to obtain a solution at the same precision. \u201cThe linear speedup is achieved\u201d\nmeans that the speedup with T workers greater than cT for any values of T (c \u2208 (0, 1] is a constant independent\nto T ).\n\n2\n\n\f\u2022 (Bounded Variance): The variance of stochastic gradient is bounded:\n\nE\u03be((cid:107)G(x; \u03be) \u2212 \u2207f (x)(cid:107)2) \u2264 \u03c32,\n\n\u2200x.\n\n\u2022 (Lipschitzian Gradient): The gradient function \u2207f (\u00b7) is Lipschitzian, that is to say,\n\n(4)\nUnder the Lipschitzian gradient assumption, we can de\ufb01ne two more constants Ls and Lmax. Let\ns be any positive integer. De\ufb01ne Ls to be the minimal constant satisfying the following inequality:\n\n(cid:107)\u2207f (x) \u2212 \u2207f (y)(cid:107)\u2264 L(cid:107)x \u2212 y(cid:107) \u2200x,\u2200y.\n\n(3)\n\n(5)\n\n(6)\n\n(cid:13)(cid:13)\u2207f (x) \u2212 \u2207f(cid:0)x +(cid:80)\n\n(cid:1)(cid:13)(cid:13) \u2264 Ls\n\n(cid:13)(cid:13)(cid:80)\n\ni\u2208S \u03b1iei\n\ni\u2208S \u03b1iei\n\n(cid:13)(cid:13) , \u2200S \u2282 {1, 2, ..., n} and |S|\u2264 s\n\nDe\ufb01ne Lmax as the minimum constant that satis\ufb01es:\n\n|\u2207if (x) \u2212 \u2207if (x + \u03b1ei)|\u2264 Lmax|\u03b1|, \u2200i \u2208 {1, 2, ..., n}.\n\nIt can be seen that Lmax \u2264 Ls \u2264 L.\n\n2 Related Work\n\nThis section mainly reviews asynchronous parallel gradient algorithms, and asynchronous parallel\nstochastic gradient algorithms and refer readers to the long version of this paper2 for review of\nstochastic gradient algorithms and synchronous parallel stochastic gradient algorithms.\nThe asynchronous parallel algorithms received broad attention in optimization recently, although\npioneer studies started from 1980s [Bertsekas and Tsitsiklis, 1989]. Due to the rapid development\nof hardware resources, the asynchronous parallelism recently received many successes when ap-\nplied to parallel stochastic gradient [Niu et al., 2011, Agarwal and Duchi, 2011, Zhang et al., 2014,\nFeyzmahdavian et al., 2015, Paine et al., 2013], stochastic coordinate descent [Avron et al., 2014, Liu\net al., 2014a], dual stochastic coordinate ascent [Tran et al., 2015], randomized Kaczmarz algorithm\n[Liu et al., 2014b], and ADMM [Zhang and Kwok, 2014]. Liu et al. [2014a] and Liu and Wright\n[2014] studied the asynchronous parallel stochastic coordinate descent algorithm with consistent\nread and inconsistent read respectively and prove the linear speedup is achievable if T \u2264 O(n1/2)\nfor smooth convex functions and T \u2264 O(n1/4) for functions with \u201csmooth convex loss + nonsmooth\nconvex separable regularization\u201d. Avron et al. [2014] studied this asynchronous parallel stochastic\ncoordinate descent algorithm in solving Ax = b where A is a symmetric positive de\ufb01nite matrix,\nand showed that the linear speedup is achievable if T \u2264 O(n) for consistent read and T \u2264 O(n1/2)\nfor inconsistent read. Tran et al. [2015] studied a semi-asynchronous parallel version of Stochas-\ntic Dual Coordinate Ascent algorithm which periodically enforces primal-dual synchronization in a\nseparate thread.\nWe review the asynchronous parallel stochastic gradient algorithms in the last. Agarwal and Duchi\n\u221a\n[2011] analyzed the ASYSG-CON algorithm (on computer cluster) for convex smooth optimization\nM K + M T 2/K) which implies that linear speedup is\nand proved a convergence rate of O(1/\nachieved when T is bounded by O(K 1/4/M 3/4). In comparison, our analysis for the more general\nnonconvex smooth optimization improves the upper bound by a factor K 1/4M 1/4. A very recent\nwork [Feyzmahdavian et al., 2015] extended the analysis in Agarwal and Duchi [2011] to mini-\nmize functions in the form \u201csmooth convex loss + nonsmooth convex regularization\u201d and obtained\nsimilar results. Niu et al. [2011] proposed a lock free asynchronous parallel implementation of SG\non the shared memory system and described this implementation as HOGWILD! algorithm. They\nproved a sublinear convergence rate O(1/K) for strongly convex smooth objectives. Another re-\ncent work Mania et al. [2015] analyzed asynchronous stochastic optimization algorithms for convex\nfunctions by viewing it as a serial algorithm with the input perturbed by bounded noise and proved\nthe convergences rates no worse than using traditional point of view for several algorithms.\n\n3 Asynchronous parallel stochastic gradient for computer network\n\nThis section considers the asynchronous parallel implementation of SG on computer network pro-\nposed by Agarwal and Duchi [2011].\nIt has been successfully applied to the distributed neural\nnetwork [Dean et al., 2012] and the parameter server [Li et al., 2014a] to solve deep neural network.\n\n2http://arxiv.org/abs/1506.08272\n\n3\n\n\f3.1 Algorithm Description: ASYSG-CON\n\nRandomly select M training samples in-\ndexed by \u03bek,1, \u03bek,2, ...\u03bek,M ;\nxk+1 = xk\u2212 \u03b3k\n\nm=1 G(xk\u2212\u03c4k,m , \u03bek,m);\n\n(cid:80)M\n\nAlgorithm 1 ASYSG-CON\nRequire: x0, K, {\u03b3k}k=0,\u00b7\u00b7\u00b7,K\u22121\nEnsure: xK\n1: for k = 0,\u00b7\u00b7\u00b7 , K \u2212 1 do\n2:\n\n\u03be\u2208S G(x; \u03be);\n\n\u2022 (Compute): compute the stochastic gradient g \u2190(cid:80)\n\nThe \u201cstar\u201d in the star-shaped network is a mas-\nter machine3 which maintains the parameter x.\nOther machines in the computer network serve\nas workers which only communicate with the\nmaster. All workers exchange information with\n3:\nthe master independently and simultaneously,\n4: end for\nbasically repeating the following steps:\n\u2022 (Select): randomly select a subset of training samples S \u2208 \u039e;\n\u2022 (Pull): pull parameter x from the master;\n\u2022 (Push): push g to the master.\nThe master basically repeats the following steps:\n\u2022 (Aggregate): aggregate a certain amount of stochastic gradients \u201cg\u201d from workers;\n\u2022 (Sum): summarize all \u201cg\u201ds into a vector \u2206;\n\u2022 (Update): update parameter x by x \u2190 x \u2212 \u03b3\u2206.\nWhile the master is aggregating stochastic gradients from workers, it does not care about the sources\nof the collected stochastic gradients. As long as the total amount achieves the prede\ufb01ned quantity,\nthe master will compute \u2206 and perform the update on x. The \u201cupdate\u201d step is performed as an atomic\noperation \u2013 workers cannot read the value of x during this step, which can be ef\ufb01ciently implemented\nin the network (especially in the parameter server [Li et al., 2014a]). The key difference between this\nasynchronous parallel implementation of SG and the serial (or synchronous parallel) SG algorithm\nlies on that in the \u201cupdate\u201d step, some stochastic gradients \u201cg\u201d in \u201c\u2206\u201d might be computed from\nsome early value of x instead of the current one, while in the serial SG, all g\u2019s are guaranteed to use\nthe current value of x.\nThe asynchronous parallel implementation substantially reduces the system overhead and overcomes\nthe possible large network delay, but the cost is to use the old value of \u201cx\u201d in the stochastic gradient\nevaluation. We will show in Section 3.2 that the negative affect of this cost will vanish asymptoti-\ncally.\nTo mathematically characterize this asynchronous parallel implementation, we monitor parameter x\nin the master. We use the subscript k to indicate the kth iteration on the master. For example, xk\ndenotes the value of parameter x after k updates, so on and so forth. We introduce a variable \u03c4k,m\nto denote how many delays for x used in evaluating the mth stochastic gradient at the kth iteration.\nThis asynchronous parallel implementation of SG on the \u201cstar-shaped\u201d network is summarized by\nthe ASYSG-CON algorithm, see Algorithm 1. The suf\ufb01x \u201cCON\u201d is short for \u201cconsistent read\u201d.\n\u201cConsistent read\u201d means that the value of x used to compute the stochastic gradient is a real state\nof x no matter at which time point. \u201cConsistent read\u201d is ensured by the atomicity of the \u201cupdate\u201d\nstep. When the atomicity fails, it leads to \u201cinconsistent read\u201d which will be discussed in Section 4.\nIt is worth noting that on some \u201cnon-star\u201d structures the asynchronous implementation can also\nbe described as ASYSG-CON in Algorithm 1, for example, the cyclic delayed architecture and the\nlocally averaged delayed architecture [Agarwal and Duchi, 2011, Figure 2] .\n3.2 Analysis for ASYSG-CON\n\nTo analyze Algorithm 1, besides Assumption 1 we make the following additional assumptions.\nAssumption 2. We assume that the following holds:\n\u2022 (Independence): All random variables in {\u03bek,m}k=0,1,\u00b7\u00b7\u00b7,K;m=1,\u00b7\u00b7\u00b7,M in Algorithm 1 are inde-\n\u2022 (Bounded Age): All delay variables \u03c4k,m\u2019s are bounded: maxk,m \u03c4k,m \u2264 T .\nThe independence assumption strictly holds if all workers select samples with replacement. Al-\nthough it might not be satis\ufb01ed strictly in practice, it is a common assumption made for the analysis\n\npendent to each other;\n\n3There could be more than one machines in some networks, but all of them serves the same purpose and\n\ncan be treated as a single machine.\n\n4\n\n\f1(cid:80)K\n\nLM \u03b3k + 2L2M 2T \u03b3k\n\npurpose. The bounded delay assumption is much more important. As pointed out before, the asyn-\nchronous implementation may use some old value of parameter x to evaluate the stochastic gradient.\nIntuitively, the age (or \u201coldness\u201d) should not be too large to ensure the convergence. Therefore, it\nis a natural and reasonable idea to assume an upper bound for ages. This assumption is commonly\nused in the analysis for asynchronous algorithms, for example, [Niu et al., 2011, Avron et al., 2014,\nLiu and Wright, 2014, Liu et al., 2014a, Feyzmahdavian et al., 2015, Liu et al., 2014b]. It is worth\nnoting that the upper bound T is roughly proportional to the number of workers.\nUnder Assumptions 1 and 2, we have the following convergence rate for nonconvex optimization.\nTheorem 1. Assume that Assumptions 1 and 2 hold and the steplength sequence {\u03b3k}k=1,\u00b7\u00b7\u00b7,K in\nAlgorithm 1 satis\ufb01es\n\n(cid:80)T\n\u03ba=1 \u03b3k+\u03ba \u2264 1\n(cid:80)K\nk=1 \u03b3kE((cid:107)\u2207f (xk)(cid:107)2) \u2264 2(f (x1)\u2212f (x\u2217))+(cid:80)K\n\nfor all k = 1, 2, ....\nWe have the following ergodic convergence rate for the iteration of Algorithm 1\n\n(cid:80)k\u22121\nj=k\u2212T \u03b32\nwhere E(\u00b7) denotes taking expectation in terms of all random variables in Algorithm 1.\nTo evaluate the convergence rate, the commonly used metrics in convex optimization are not eligi-\nble, for example, f (xk) \u2212 f\u2217 and (cid:107)xk \u2212 x\u2217(cid:107)2. For nonsmooth optimization, we use the ergodic\nconvergence as the metric, that is, the weighted average of the (cid:96)2 norm of all gradients (cid:107)\u2207f (xk)(cid:107)2,\nwhich is used in the analysis for nonconvex optimization [Ghadimi and Lan, 2013]. Although the\nmetric used in nonconvex optimization is not exactly comparable to f (xk)\u2212 f\u2217 or (cid:107)xk \u2212 x\u2217(cid:107)2 used\nin the analysis for convex optimization, it is not totally unreasonable to think that they are roughly\nin the same order. The ergodic convergence directly indicates the following convergence: If ran-\nk=1 \u03b3k}, then E((cid:107)\u2207f (x \u02dcK)(cid:107)2)\n\ndomly select an index \u02dcK from {1, 2,\u00b7\u00b7\u00b7 , K} with probability {\u03b3k/(cid:80)K\n\nis bounded by the right hand side of (8) and all bounds we will show in the following.\nTaking a close look at Theorem 1, we can properly choose the steplength \u03b3k as a constant value and\nobtain the following convergence rate:\nCorollary 2. Assume that Assumptions 1 and 2 hold. Set the steplength \u03b3k to be a constant \u03b3\n\nM(cid:80)K\n\nk=1(\u03b32\n\nkM L+2L2M 2\u03b3k\n\nk=1 \u03b3k\n\nj )\u03c32\n\n.\n\n(8)\n\nk=1 \u03b3k\n\n(7)\n\n\u03b3 :=(cid:112)f (x1) \u2212 f (x\u2217)/(M LK\u03c32).\n\nIf the delay parameter T is bounded by\n\nK \u2265 4M L(f (x1) \u2212 f (x\u2217))(T + 1)2/\u03c32,\n\nthen the output of Algorithm 1 satis\ufb01es the following ergodic convergence rate\n\n(cid:80)K\n\nE(cid:107)\u2207f (xk)(cid:107)2\u2264 4(cid:112)(f (x1) \u2212 f (x\u2217))L/(M K)\u03c3.\n\nK\n\nk=1\n\nmink\u2208{1,\u00b7\u00b7\u00b7,K} E(cid:107)\u2207f (xk)(cid:107)2\u2264 1\n\u221a\nThis corollary basically claims that when the total iteration number K is greater than O(T 2), the\nconvergence rate achieves O(1/\nM K). Since this rate does not depend on the delay parameter\nT after suf\ufb01cient number of iterations, the negative effect of using old values of x for stochastic\ngradient evaluation vanishes asymptoticly. In other words, if the total number of workers is bounded\n\nby O((cid:112)K/M ), the linear speedup is achieved.\n\n(11)\n\n(9)\n\n(10)\n\n\u221a\n\nNote that our convergence rate O(1/\nM K) is consistent with the serial SG (with M = 1) for\nconvex optimization [Nemirovski et al., 2009], the synchronous parallel (or mini-batch) SG for\nconvex optimization [Dekel et al., 2012], and nonconvex smooth optimization [Ghadimi and Lan,\n2013]. Therefore, an important observation is that as long as the number of workers (which is\n\nproportional to T ) is bounded by O((cid:112)K/M ), the iteration complexity to achieve the same accuracy\nO((cid:112)K/M ). Since our convergence rate meets several special cases, it is tight.\n\nlevel will be roughly the same. In other words, the average work load for each worker is reduced\nby the factor T comparing to the serial SG. Therefore, the linear speedup is achievable if T \u2264\n\n\u221a\nNext we compare with the analysis of ASYSG-CON for convex smooth optimization in Agarwal\nM K), which\nand Duchi [2011, Corollary 2]. They proved an asymptotic convergence rate O(1/\nis consistent with ours. But their results require T \u2264 O(K 1/4M\u22123/4) to guarantee linear speedup.\nOur result improves it by a factor O(K 1/4M 1/4).\n\n5\n\n\f4 Asynchronous parallel stochastic gradient for shared memory architecture\nThis section considers a widely used lock-free asynchronous implementation of SG on the shared\nmemory system proposed in Niu et al. [2011]. Its advantages have been witnessed in solving SVM,\ngraph cuts [Niu et al., 2011], linear equations [Liu et al., 2014b], and matrix completion [Petroni\nand Querzoni, 2014]. While the computer network always involves multiple machines, the shared\nmemory platform usually only includes a single machine with multiple cores / GPUs sharing the\nsame memory.\n4.1 Algorithm Description: ASYSG-INCON\n\n3:\n\n(xk+1)ik = (xk)ik \u2212 \u03b3(cid:80)M\n\n4:\n5: end for\n\nm=1(G(\u02c6xk,m; \u03bek,m))ik;\n\n(we use \u02c6x to denote its value);\n\nRandomly select M training samples indexed\nby \u03bek,1, \u03bek,2, ...\u03bek,M ;\nRandomly select ik \u2208 {1, 2, ..., n} with uni-\nform distribution;\n\nFor the shared memory platform, one can ex-\nactly follow ASYSG-CON on the computer\nnetwork using software locks, which is ex-\npensive4. Therefore, in practice the lock free\nasynchronous parallel implementation of SG\nis preferred. This section considers the same\nimplementation as Niu et al. [2011], but pro-\nvides a more precise algorithm description\nASYSG-INCON than HOGWILD! proposed in Niu et al. [2011].\nIn this lock free implementation, the shared memory stores the parameter \u201cx\u201d and allows all workers\nreading and modifying parameter x simultaneously without using locks. All workers repeat the\nfollowing steps independently, concurrently, and simultaneously:\n\u2022 (Read): read the parameter from the shared memory to the local memory without software locks\n\u2022 (Compute): sample a training data \u03be and use \u02c6x to compute the stochastic gradient G(\u02c6x; \u03be) locally;\n\u2022 (Update): update parameter x in the shared memory without software locks x \u2190 x \u2212 \u03b3G(\u02c6x; \u03be).\nSince we do not use locks in both \u201cread\u201d and \u201cupdate\u201d steps, it means that multiple workers may\nmanipulate the shared memory simultaneously. It causes the \u201cinconsistent read\u201d at the \u201cread\u201d step,\nthat is, the value of \u02c6x read from the shared memory might not be any state of x in the shared\nmemory at any time point. For example, at time 0, the original value of x in the shared memory is a\ntwo dimensional vector [a, b]; at time 1, worker W is running the \u201cread\u201d step and \ufb01rst reads a from\nthe shared memory; at time 2, worker W (cid:48) updates the \ufb01rst component of x in the shared memory\nfrom a to a(cid:48); at time 2, worker W (cid:48) updates the second component of x in the shared memory from\nb to b(cid:48); at time 3, worker W reads the value of the second component of x in the shared memory as\nb(cid:48). In this case, worker W eventually obtains the value of \u02c6x as [a, b(cid:48)], which is not a real state of x\nin the shared memory at any time point. Recall that in ASYSG-CON the parameter value obtained\nby any worker is guaranteed to be some real value of parameter x at some time point.\nTo precisely characterize this implementation and especially represent \u02c6x, we monitor the value of\nparameter x in the shared memory. We de\ufb01ne one iteration as a modi\ufb01cation on any single com-\nponent of x in the shared memory since the update on a single component can be considered to be\natomic on GPUs and DSPs [Niu et al., 2011]. We use xk to denote the value of parameter x in the\nshared memory after k iterations and \u02c6xk to denote the value read from the shared memory and used\nfor computing stochastic gradient at the kth iteration. \u02c6xk can be represented by xk with a few earlier\nupdates missing\n\n\u02c6xk = xk \u2212(cid:80)\n\nAlgorithm 2 ASYSG-INCON\nRequire: x0, K, \u03b3\nEnsure: xK\n1: for k = 0,\u00b7\u00b7\u00b7 , K \u2212 1 do\n2:\n\n(12)\nwhere J(k) \u2282 {k \u2212 1, k,\u00b7\u00b7\u00b7 , 0} is a subset of index numbers of previous iterations. This way is\nalso used in analyzing asynchronous parallel coordinate descent algorithms in [Avron et al., 2014,\nLiu and Wright, 2014]. The kth update happened in the shared memory can be described as\n\nj\u2208J(k)(xj+1 \u2212 xj)\n\n(xk+1)ik = (xk)ik \u2212 \u03b3(G(\u02c6xk; \u03bek))ik\n\nwhere \u03bek denotes the index of the selected data and ik denotes the index of the component being\nupdated at kth iteration. In the original analysis for the HOGWILD! implementation [Niu et al.,\n2011], \u02c6xk is assumed to be some earlier state of x in the shared memory (that is, the consistent read)\nfor simpler analysis, although it is not true in practice.\n\n4The time consumed by locks is roughly equal to the time of 104 \ufb02oating-point computation. The additional\n\ncost for using locks is the waiting time during which multiple worker access the same memory address.\n\n6\n\n\fOne more complication is to apply the mini-batch strategy like before. Since the \u201cupdate\u201d step\nneeds physical modi\ufb01cation in the shared memory, it is usually much more time consuming than\nboth \u201cread\u201d and \u201ccompute\u201d steps are. If many workers run the \u201cupdate\u201d step simultaneously, the\nmemory contention will seriously harm the performance. To reduce the risk of memory contention,\na common trick is to ask each worker to gather multiple (say M) stochastic gradients and write the\nshared memory only once. That is, in each cycle, run both \u201cupdate\u201d and \u201ccompute\u201d steps for M\ntimes before you run the \u201cupdate\u201d step. Thus, the mini-batch updates happen in the shared memory\ncan be written as\n\n(13)\nwhere ik denotes the coordinate index updated at the kth iteration, and G(\u02c6xk,m; \u03bek,m) is the mth\nstochastic gradient computed from the data sample indexed by \u03bek,m and the parameter value denoted\nby \u02c6xk,m at the kth iteration. \u02c6xk,m can be expressed by:\n\nm=1(G(\u02c6xk,m; \u03bek,m))ik\n\n(xk+1)ik = (xk)ik \u2212 \u03b3(cid:80)M\n\n(14)\nwhere J(k, m) \u2282 {k \u2212 1, k,\u00b7\u00b7\u00b7 , 0} is a subset of index numbers of previous iterations. The algo-\nrithm is summarized in Algorithm 2 from the view of the shared memory.\n\nj\u2208J(k,m)(xj+1 \u2212 xj)\n\n\u02c6xk,m = xk \u2212(cid:80)\n\n4.2 Analysis for ASYSG-INCON\n\nTo analyze the ASYSG-INCON, we need to make a few assumptions similar to Niu et al. [2011], Liu\net al. [2014b], Avron et al. [2014], Liu and Wright [2014].\nAssumption 3. We assume that the following holds for Algorithm 2:\n\u2022 (Independence): All groups of variables {ik,{\u03bek,m}M\n\u2022 (Bounded Age): Let T be the global bound for delay: J(k, m) \u2282 {k \u2212 1, ...k \u2212 T},\n\nm=1} at different iterations from k = 1 to\n\u2200k,\u2200m,\n\nK are independent to each other.\nso |J(k, m)|\u2264 T .\n\nThe independence assumption might not be true in practice, but it is probably the best assumption\none can make in order to analyze the asynchronous parallel SG algorithm. This assumption was also\nused in the analysis for HOGWILD! [Niu et al., 2011] and asynchronous randomized Kaczmarz al-\ngorithm [Liu et al., 2014b]. The bounded delay assumption basically restricts the age of all missing\ncomponents in \u02c6xk,m (\u2200m, \u2200k). The upper bound \u201cT \u201d here serves a similar purpose as in Assump-\ntion 2. Thus we abuse this notation in this section. The value of T is proportional to the number of\nworkers and does not depend on the size of mini-batch M. The bounded age assumption is used in\nthe analysis for asynchronous stochastic coordinate descent with \u201cinconsistent read\u201d [Avron et al.,\n2014, Liu and Wright, 2014]. Under Assumptions 1 and 3, we have the following results:\nTheorem 3. Assume that Assumptions 1 and 3 hold and the constant steplength \u03b3 satis\ufb01es\n\n\u221a\n\n2M 2T L2\n\nT (\n\nn + T \u2212 1)\u03b32/n3/2 + 2M Lmax\u03b3 \u2264 1.\n\nWe have the following ergodic convergence rate for Algorithm 2\n\n(cid:80)K\n\nt=1\n\nE(cid:0)(cid:107)\u2207f (xt)(cid:107)2(cid:1) \u2264 2n\n\n1\nK\n\nKM \u03b3 (f (x1) \u2212 f (x\u2217)) + L2\n\nT T M \u03b32\n\n2n\n\n\u03c32 + Lmax\u03b3\u03c32.\n\n(16)\n\nTaking a close look at Theorem 3, we can choose the steplength \u03b3 properly and obtain the following\nerror bound:\nCorollary 4. Assume that Assumptions 1 and 3 hold. Set the steplength to be a constant \u03b3\n\n(15)\n\n(17)\n\n(18)\n\n(19)\n\nIf the total iterations K is greater than\n\n(cid:112)\n\u03b3 :=(cid:112)2(f (x1) \u2212 f (x\u2217))n/(\n(cid:16)\nn3/2 + 4T 2(cid:17)\n\nK \u2265 16(f (x1) \u2212 f (x\u2217))LT M\n\nKLT M \u03c3).\n\n\u221a\n\n/(\n\nn\u03c32),\n\nthen the output of Algorithm 2 satis\ufb01es the following ergodic convergence rate\n\n(cid:80)K\n\n1\nK\n\nk=1\n\nE((cid:107)\u2207f (xk)(cid:107)2) \u2264(cid:112)72 (f (x1) \u2212 f (x\u2217)) LT n/(KM )\u03c3.\n\n7\n\n\f\u221a\n\n\u221a\nThis corollary indicates the asymptotic convergence rate achieves O(1/\nM K) when the total iter-\nation number K exceeds a threshold in the order of O(T 2) (if n is considered as a constant). We\ncan see that this rate and the threshold are consistent with the result in Corollary 2 for ASYSG-CON.\nOne may argue that why there is an additional factor\nn in the numerator of (19). That is due to the\nway we count iterations \u2013 one iteration is de\ufb01ned as updating a single component of x. If we take\ninto account this factor in the comparison to ASYSG-CON, the convergence rates for ASYSG-CON\nand ASYSG-INCON are essentially consistent. This comparison implies that the \u201cinconsistent read\u201d\nwould not make a big difference from the \u201cconsistent read\u201d.\nNext we compare our result with the analysis of HOGWILD! by [Niu et al., 2011]. In principle,\nour analysis and their analysis consider the same implementation of asynchronous parallel SG, but\ndiffer in the following aspects: 1) our analysis considers the smooth nonconvex optimization which\nincludes the smooth strongly convex optimization considered in their analysis; 2) our analysis con-\nsiders the \u201cinconsistent read\u201d model which meets the practice while their analysis assumes the im-\npractical \u201cconsistent read\u201d model. Although the two results are not absolutely comparable, it is still\ninteresting to see the difference. Niu et al. [2011] proved that the linear speedup is achievable if the\nmaximal number of nonzeros in stochastic gradients is bounded by O(1) and the number of work-\n\u221a\ners is bounded by O(n1/4). Our analysis does not need this prerequisite and guarantees the linear\nspeedup as long as the number of workers is bounded by O(\nK). Although it is hard to say that our\nresult strictly dominates HOGWILD! in Niu et al. [2011], our asymptotic result is eligible for more\nscenarios.\n\n5 Experiments\n\nThe successes of ASYSG-CON and ASYSG-INCON and their advantages over synchronous parallel\nalgorithms have been widely witnessed in many applications such as deep neural network [Dean\net al., 2012, Paine et al., 2013, Zhang et al., 2014, Li et al., 2014a], matrix completion [Niu et al.,\n2011, Petroni and Querzoni, 2014, Yun et al., 2013], SVM [Niu et al., 2011], and linear equations\n[Liu et al., 2014b]. We refer readers to these literatures for more comphrehensive comparison and\nempirical studies. This section mainly provides the empirical study to validate the speedup proper-\nties for completeness. Due to the space limit, please \ufb01nd it in Supplemental Materials.\n\n6 Conclusion\n\nThis paper studied two popular asynchronous parallel implementations for SG on computer cluster\nand shared memory system respectively. Two algorithms (ASYSG-CON and ASYSG-INCON) are\nused to describe two implementations. An asymptotic sublinear convergence rate is proven for\nboth algorithms on nonconvex smooth optimization. This rate is consistent with the result of SG\nfor convex optimization. The linear speedup is proven to achievable when the number of workers\nK, which improves the earlier analysis of ASYSG-CON for convex optimization\nis bounded by\nin [Agarwal and Duchi, 2011]. The proposed ASYSG-INCON algorithm provides a more precise\ndescription for lock free implementation on shared memory system than HOGWILD! [Niu et al.,\n2011]. Our result for ASYSG-INCON can be applied to more scenarios.\n\n\u221a\n\nAcknowledgements\n\nThis project is supported by the NSF grant CNS-1548078, the NEC fellowship, and the startup fund-\ning at University of Rochester. We thank Professor Daniel Gildea and Professor Sandhya Dwarkadas\nat University of Rochester, Professor Stephen J. Wright at University of Wisconsin-Madison, and\nanonymous (meta-)reviewers for their constructive comments and helpful advices.\n\nReferences\nA. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. NIPS, 2011.\nH. Avron, A. Druinsky, and A. Gupta. Revisiting asynchronous linear solvers: Provable convergence rate\n\nthrough randomization. IPDPS, 2014.\n\nY. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. The Journal of\n\nMachine Learning Research, 3:1137\u20131155, 2003.\n\n8\n\n\fD. P. Bertsekas and J. N. Tsitsiklis. Parallel and distributed computation: numerical methods, volume 23.\n\nPrentice hall Englewood Cliffs, NJ, 1989.\n\nJ. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al.\n\nLarge scale distributed deep networks. NIPS, 2012.\n\nO. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using mini-batches.\n\nJournal of Machine Learning Research, 13(1):165\u2013202, 2012.\n\nO. Fercoq and P. Richt\u00b4arik. Accelerated, parallel and proximal coordinate descent.\n\narXiv:1312.5799, 2013.\n\narXiv preprint\n\nH. R. Feyzmahdavian, A. Aytekin, and M. Johansson. An asynchronous mini-batch algorithm for regularized\n\nstochastic optimization. ArXiv e-prints, May 18 2015.\n\nS. Ghadimi and G. Lan. Stochastic \ufb01rst-and zeroth-order methods for nonconvex stochastic programming.\n\nSIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\nM. Hong. A distributed, asynchronous and incremental algorithm for nonconvex optimization: An ADMM\n\nbased approach. arXiv preprint arXiv:1412.6058, 2014.\n\nY. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:\n\nConvolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.\n\nA. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Computer Science\n\nDepartment, University of Toronto, Tech. Rep, 1(4):7, 2009.\n\nA. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural networks.\n\nNIPS, pages 1097\u20131105, 2012.\n\nM. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D. G. Andersen, and A. Smola. Parameter server for distributed machine\n\nlearning. Big Learning NIPS Workshop, 2013.\n\nM. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su.\n\nScaling distributed machine learning with the parameter server. OSDI, 2014a.\n\nM. Li, D. G. Andersen, A. J. Smola, and K. Yu. Communication ef\ufb01cient distributed machine learning with the\n\nparameter server. NIPS, 2014b.\n\nJ. Liu and S. J. Wright. Asynchronous stochastic coordinate descent: Parallelism and convergence properties.\n\narXiv preprint arXiv:1403.3862, 2014.\n\nJ. Liu, S. J. Wright, C. R\u00b4e, V. Bittorf, and S. Sridhar. An asynchronous parallel stochastic coordinate descent\n\nalgorithm. ICML, 2014a.\n\nJ. Liu, S. J. Wright, and S. Sridhar. An asynchronous parallel randomized kaczmarz algorithm. arXiv preprint\n\narXiv:1401.4780, 2014b.\n\nH. Mania, X. Pan, D. Papailiopoulos, B. Recht, K. Ramchandran, and M. I. Jordan. Perturbed iterate analysis\n\nfor asynchronous stochastic optimization. arXiv preprint arXiv:1507.06970, 2015.\n\nJ. Marecek, P. Richt\u00b4arik, and M. Tak\u00b4ac. Distributed block coordinate descent for minimizing partially separable\n\nfunctions. arXiv preprint arXiv:1406.0238, 2014.\n\nA. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic\n\nprogramming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\nF. Niu, B. Recht, C. Re, and S. Wright. Hogwild: A lock-free approach to parallelizing stochastic gradient\n\ndescent. NIPS, 2011.\n\nT. Paine, H. Jin, J. Yang, Z. Lin, and T. Huang. Gpu asynchronous stochastic gradient descent to speed up\n\nneural network training. NIPS, 2013.\n\nF. Petroni and L. Querzoni. Gasgd: stochastic gradient descent for distributed asynchronous matrix completion\n\nvia graph partitioning. ACM Conference on Recommender systems, 2014.\n\nS. Sridhar, S. Wright, C. Re, J. Liu, V. Bittorf, and C. Zhang. An approximate, ef\ufb01cient LP solver for lp\n\nrounding. NIPS, 2013.\n\nR. Tappenden, M. Tak\u00b4a\u02c7c, and P. Richt\u00b4arik. On the complexity of parallel coordinate descent. arXiv preprint\n\narXiv:1503.03033, 2015.\n\nK. Tran, S. Hosseini, L. Xiao, T. Finley, and M. Bilenko. Scaling up stochastic dual coordinate ascent. ICML,\n\n2015.\n\nH. Yun, H.-F. Yu, C.-J. Hsieh, S. Vishwanathan, and I. Dhillon. Nomad: Non-locking, stochastic multi-machine\n\nalgorithm for asynchronous and decentralized matrix completion. arXiv preprint arXiv:1312.0193, 2013.\n\nR. Zhang and J. Kwok. Asynchronous distributed ADMM for consensus optimization. ICML, 2014.\nS. Zhang, A. Choromanska, and Y. LeCun. Deep learning with elastic averaging SGD. CoRR, abs/1412.6651,\n\n2014.\n\n9\n\n\f", "award": [], "sourceid": 1572, "authors": [{"given_name": "Xiangru", "family_name": "Lian", "institution": "University of Rochester"}, {"given_name": "Yijun", "family_name": "Huang", "institution": "University of Rochester"}, {"given_name": "Yuncheng", "family_name": "Li", "institution": "University of Rochester"}, {"given_name": "Ji", "family_name": "Liu", "institution": "University of Rochester"}]}