{"title": "Collaborative Deep Learning in Fixed Topology Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 5904, "page_last": 5914, "abstract": "There is significant recent interest to parallelize deep learning algorithms in order to handle the enormous growth in data and model sizes. While most advances focus on model parallelization and engaging multiple computing agents via using a central parameter server, aspect of data parallelization along with decentralized computation has not been explored sufficiently. In this context, this paper presents a new consensus-based distributed SGD (CDSGD) (and its momentum variant, CDMSGD) algorithm for collaborative deep learning over fixed topology networks that enables data parallelization as well as decentralized computation. Such a framework can be extremely useful for learning agents with access to only local/private data in a communication constrained environment. We analyze the convergence properties of the proposed algorithm with strongly convex and nonconvex objective functions with fixed and diminishing step sizes using concepts of Lyapunov function construction. We demonstrate the efficacy of our algorithms in comparison with the baseline centralized SGD and the recently proposed federated averaging algorithm (that also enables data parallelism) based on benchmark datasets such as MNIST, CIFAR-10 and CIFAR-100.", "full_text": "Collaborative Deep Learning in\n\nFixed Topology Networks\n\nZhanhong Jiang1, Aditya Balu1, Chinmay Hegde2, and Soumik Sarkar1\n\n1Department of Mechanical Engineering, Iowa State University,\n\nzhjiang, baditya, soumiks@iastate.edu\n\n2Department of Electrical and Computer Engineering , Iowa State University, chinmay@iastate.edu\n\nAbstract\n\nThere is signi\ufb01cant recent interest to parallelize deep learning algorithms in order\nto handle the enormous growth in data and model sizes. While most advances\nfocus on model parallelization and engaging multiple computing agents via using\na central parameter server, aspect of data parallelization along with decentralized\ncomputation has not been explored suf\ufb01ciently. In this context, this paper presents\na new consensus-based distributed SGD (CDSGD) (and its momentum variant,\nCDMSGD) algorithm for collaborative deep learning over \ufb01xed topology networks\nthat enables data parallelization as well as decentralized computation. Such a frame-\nwork can be extremely useful for learning agents with access to only local/private\ndata in a communication constrained environment. We analyze the convergence\nproperties of the proposed algorithm with strongly convex and nonconvex objective\nfunctions with \ufb01xed and diminishing step sizes using concepts of Lyapunov func-\ntion construction. We demonstrate the ef\ufb01cacy of our algorithms in comparison\nwith the baseline centralized SGD and the recently proposed federated averaging\nalgorithm (that also enables data parallelism) based on benchmark datasets such as\nMNIST, CIFAR-10 and CIFAR-100.\n\n1\n\nIntroduction\n\nIn this paper, we address the scalability of optimization algorithms for deep learning in a distributed\nsetting. Scaling up deep learning [1] is becoming increasingly crucial for large-scale applications\nwhere the sizes of both the available data as well as the models are massive [2]. Among various\nalgorithmic advances, many recent attempts have been made to parallelize stochastic gradient descent\n(SGD) based learning schemes across multiple computing agents. An early approach called Downpour\nSGD [3], developed within Google\u2019s disbelief software framework, primarily focuses on model\nparallelization (i.e., splitting the model across the agents). A different approach known as elastic\naveraging SGD (EASGD) [4] attempts to improve perform multiple SGDs in parallel; this method\nuses a central parameter server that helps in assimilating parameter updates from the computing\nagents. However, none of the above approaches concretely address the issue of data parallelization,\nwhich is an important issue for several learning scenarios: for example, data parallelization enables\nprivacy-preserving learning in scenarios such as distributed learning with a network of mobile and\nInternet-of-Things (IoT) devices. A recent scheme called Federated Averaging SGD [5] attempts\nsuch a data parallelization in the context of deep learning with signi\ufb01cant success; however, they still\nuse a central parameter server.\nIn contrast, deep learning with decentralized computation can be achieved via gossip SGD algo-\nrithms [6, 7], where agents communicate probabilistically without the aid of a parameter server.\nHowever, decentralized computation in the sense of gossip SGD is not feasible in many real life\napplications. For instance, consider a large (wide-area) sensor network [8, 9] or multi-agent robotic\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fMethod\nSGD\n\nf\n\nStr-con\n\nDownpour SGD [3] Nonconvex\n\nEASGD [4]\n\nGossip SGD [7]\n\nFedAvg [5]\n\nCDSGD [This paper]\n\nStr-con\nStr-con Lip.&Bou.\nStr-con Lip.&Bou.\n\nNonconvex\n\nLip.\n\nStr-con Lip.&Bou.\nStr-con Lip.&Bou.\nNonconvex Lip.&Bou.\nNonconvex Lip.&Bou.\n\nTable 1: Comparisons between different optimization approaches\n\n\u2207f\nLip.\nLip.\nLip.\n\nCon.\n\nCon.&Ada. N/A\n\nStep Size Con.Rate D.P. D.C. C.C.T.\nO(\u03b3k) No No No\nYes No No\nO(\u03b3k) No No No\nO(\u03b3k) No Yes No\nO( 1\nk )\nYes No No\nN/A\nO(\u03b3k)\nO( 1\nk\u0001 )\nN/A\nN/A\n\nCon.\nCon.\nDim.\nCon.\nCon.\nDim.\nCon.\nDim.\n\nYes Yes Yes\n\nCon.Rate: convergence rate, Str-con: strongly convex. Lip.&Bou.: Lipschitz continuous and\nbounded. Con.: constant and Con.&Ada.: constant&adagrad. Dim.: diminishing. \u03b3 \u2208 (0, 1) is a\npositive constant. \u0001 \u2208 (0.5, 1] is a positive constant. D.P.: data parallelism. D.C.: decentralized\ncomputation. C.C.T.: constrained communication topology.\n\nnetwork that aims to learn a model of the environment in a collaborative manner [10, 11]. For such\ncases, it may be infeasible for arbitrary pairs of agents to communicate on-demand; typically, agents\nare only able to communicate with their respective neighbors in a communication network in a \ufb01xed\n(or evolving) topology.\nContribution: This paper introduces a new class of approaches for deep learning that enables both\ndata parallelization and decentralized computation. Speci\ufb01cally, we propose consensus-based dis-\ntributed SGD (CDSGD) and consensus-based distributed momentum SGD (CDMSGD) algorithms for\ncollaborative deep learning that, for the \ufb01rst time, satis\ufb01es all three requirements: data parallelization,\ndecentralized computation, and constrained communication over \ufb01xed topology networks. Moreover,\nwhile most existing studies solely rely on empirical evidence from simulations, we present rigorous\nconvergence analysis for both (strongly) convex and non-convex objective functions, with both \ufb01xed\nand diminishing step sizes using a Lyapunov function construction approach. Our analysis reveals\nseveral advantages of our method: we match the best existing rates of convergence in the centralized\nsetting, while simultaneously supporting data parallelism as well as constrained communication\ntopologies; to our knowledge, this is the \ufb01rst approach that achieves all three desirable properties; see\nTable 1 for a detailed comparison.\nFinally, we validate our algorithms\u2019 performance on benchmark datasets, such as MNIST, CIFAR-10,\nand CIFAR-100. Apart from centralized SGD as a baseline, we also compare performance with\nthat of Federated Averaging SGD as it also enables data parallelization. Empirical evidence (for a\ngiven number of agents and other hyperparametric conditions) suggests that while our method is\nslightly slower, we can achieve higher accuracy compared to the best available algorithm (Federated\nAveraging (FedAvg)). Empirically, the proposed framework in this work is suitable for situations\nwithout central parameter servers, but also robust to a central parameter server failture situation.\nRelated work: Apart from the algorithms mentioned above, a few other related works exist, including\na distributed system called Adam for large deep neural network (DNN) models [12] and a distributed\nmethodology by Strom [13] for DNN training by controlling the rate of weight-update to reduce the\namount of communication. Natural Gradient Stochastic Gradient Descent (NG-SGD) based on model\naveraging [14] and staleness-aware async-SGD [15] have also been developed for distributed deep\nlearning. A method called CentralVR [16] was proposed for reducing the variance and conducting\nparallel execution with linear convergence rate. Moreover, a decentralized algorithm based on\ngossip protocol called the multi-step dual accelerated (MSDA) [17] was developed for solving\ndeterministically smooth and strongly convex distributed optimization problems in networks with a\nprovable optimal linear convergence rate. A new class of decentralized primal-dual methods [18]\nwas also proposed recently in order to improve inter-node communication ef\ufb01ciency for distributed\nconvex optimization problems. To minimize a \ufb01nite sum of nonconvex functions over a network, the\nauthors in [19] proposed a zeroth-order distributed algorithm (ZENITH) that was globally convergent\nwith a sublinear rate. From the perspective of distributed optimization, the proposed algorithms\nhave similarities with the approaches of [20, 21]. However, we distinguish our work due to the\ncollaborative learning aspect with data parallelization and extension to the stochastic setting and\nnonconvex objective functions. In [20] the authors only considered convex objective functions in a\n\n2\n\n\fdeterministic setting, while the authors in [21] presented results for non-convex optimization problems\nin a deterministic setting. Our proof techniques are different from those in [20, 21] with the choice\nof Lyapunov function, as well as the notion of stochastic Lyapunov gradient. More importantly, we\nprovide an extensive and thorough suite of numerical comparisons with both centralized methods and\ndistributed methods on benchmark datasets.\nThe rest of the paper is organized as follows. While section 2 formulates the distributed, unconstrained\nstochastic optimization problem, section 3 presents the CDSGD algorithm and the Lyapunov stochas-\ntic gradient required for analysis presented in section 4. Validation experiments and performance\ncomparison results are described in section 5. The paper is summarized, concluded in section 6\nalong with future research directions. Detailed proofs of analytical results, extensions (e.g., effect of\ndiminishing step size) and additional experiments are included in the supplementary section 7.\n2 Formulation\n\nWe consider the standard (unconstrained) empirical risk minimization problem typically used in\nmachine learning problems (such as deep learning):\n\nn(cid:88)\n\ni=1\n\nmin 1\nn\n\nf i(x),\n\n(1)\n\nwhere x \u2208 Rd denotes the parameter of interest and f : Rd \u2192 R is a given loss function, and f i\nis the function value corresponding to a data point i. In this paper, we are interested in learning\nproblems where the computational agents exhibit data parallelism, i.e., they only have access to\ntheir own respective training datasets. However, we assume that the agents can communicate over\na static undirected graph G = (V,E), where V is a vertex set (with nodes corresponding to agents)\nand E is an edge set. With N agents, we have V = {1, 2, ..., N} and E \u2286 V \u00d7 V. If (j, l) \u2208 E,\nthen Agent j can communicate with Agent l. The neighborhood of agent j \u2208 V is de\ufb01ned as:\nN b(j) (cid:44) {l \u2208 V : (j, l) \u2208 E or j = l}. Throughout this paper we assume that the graph G is\nconnected. Let Dj, j = 1, . . . , n denote the subset of the training data (comprising nj samples)\nj=1 nj = n. With this setup, we have the following\n\ncorresponding to the jth agents such that(cid:80)N\n(cid:88)\n\nsimpli\ufb01cation of Eq. 1:\n\nN(cid:88)\n\nN(cid:88)\n\n(cid:88)\n\nf i\nj (x),\n\n(2)\n\nmin 1\nn\n\nf i(x) =\n\nN\nn\n\nj=1\n\ni\u2208Dj\n\nj=1\n\ni\u2208Dj\n\nwhere, fj(x) = 1\n\nstate the optimization problem in a distributed manner, where f (x) =(cid:80)N\n\nN f (x) is the objective function speci\ufb01c to Agent j. This formulation enables us to\nj=1 fj(x). 1 Furthermore,\n\nthe problem (1) can be reformulated as\n\nwhere x := (x1, x2, . . . , xN )T \u2208 RN\u00d7d and F(x) can be written as\n\nmin N\nn\n\n1T F(x) :=\n\nN\nn\n\ns.t. xj = xl \u2200(j, l) \u2208 E,\n\n(cid:20)(cid:88)\n\ni\u2208D1\n\n(cid:88)\n\ni\u2208D2\n\nF(x) =\n\nf i\n1(x1),\n\nf i\n2(x2), . . . ,\n\nN(cid:88)\n\n(cid:88)\n\nj=1\n\ni\u2208Dj\n\nf i\nj (xj)\n\n(cid:21)T\n\nf i\nN (xN )\n\n(cid:88)\n\ni\u2208DN\n\n(3a)\n\n(3b)\n\n(4)\n\nNote that with d > 1, the parameter set x as well as the gradient \u2207F(x) correspond to matrix\nvariables. However, for simplicity in presenting our analysis, we set d = 1 in this paper, which\ncorresponds to the case where x and \u2207F(x) are vectors.\n\n1Note that in our formulation, we are assuming that every agent has the same local objective function while\n\nin general distributed optimization problems they can be different.\n\n3\n\n\fWe now introduce several key de\ufb01nitions and assumptions that characterize the objective functions\nand the agent interaction matrix.\nDe\ufb01nition 1. A function f : Rd \u2192 R is H-strongly convex, if for all x, y \u2208 Rd, we have f (y) \u2265\nf (x) + \u2207f (x)T (y \u2212 x) + H\nDe\ufb01nition 2. A function f : Rd \u2192 R is \u03b3-smooth if for all x, y \u2208 Rd, we have f (y) \u2264 f (x) +\n\u2207f (x)T (y \u2212 x) + \u03b3\n\n2 (cid:107)y \u2212 x(cid:107)2.\n\n2(cid:107)y \u2212 x(cid:107)2.\n\nAs a consequence of De\ufb01nition 2, we can conclude that \u2207f is Lipschitz continuous, i.e., (cid:107)\u2207f (y) \u2212\n\u2207f (x)(cid:107) \u2264 \u03b3(cid:107)y \u2212 x(cid:107) [22].\nDe\ufb01nition 3. A function c is said to be coercive if it satis\ufb01es: c(x) \u2192 \u221e when(cid:107)x(cid:107) \u2192 \u221e.\nAssumption 1. The objective functions fj : Rd \u2192 R are assumed to satisfy the following conditions:\na) Each fj is \u03b3j-smooth; b) each fj is proper (not everywhere in\ufb01nite) and coercive; and c) each fj\nis Lj-Lipschitz continuous, i.e., |fj(y) \u2212 fj(x)| < Lj(cid:107)y \u2212 x(cid:107) \u2200x, y \u2208 Rd.\n\nAs a consequence of Assumption 1, we can conclude that(cid:80)N\n(cid:80)N\n\nj=1 fj(xj) is strongly convex with Hm = minjHj.\n\nj=1 fj(xj) possesses Lipschitz continu-\nous gradient with parameter \u03b3m := maxj\u03b3j. Similarly, each fj is strongly convex with Hj such that\n\nRegarding the communication network, we use \u03a0 to denote the agent interaction matrix, where the\nelement \u03c0jl signi\ufb01es the link weight between agents j and l.\nAssumption 2. a) If (j, l) /\u2208 E, then \u03c0jl = 0; b) \u03a0T = \u03a0; c) null{I \u2212 \u03a0} = span{1}; and d)\nI (cid:23) \u03a0 (cid:31) \u2212I.\n\nThe main outcome of Assumption 2 is that the probability transition matrix is doubly stochastic\nand that we have \u03bb1(\u03a0) = 1 > \u03bb2(\u03a0) \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbN (\u03a0) \u2265 0, where \u03bbz(\u03a0) denotes the z-th largest\neigenvalue of \u03a0.\n\n3 Proposed Algorithm\n\n3.1 Consensus Distributed SGD\n\nFor solving stochastic optimization problems, SGD and its variants have been commonly used to\ncentralized and distributed problem formulations. Therefore, the following algorithm is proposed\nbased on SGD and the concept of consensus to solve the problem laid out in Eq. 2,\n\n(cid:88)\n\nl\u2208N b(j)\n\nxj\nk+1 =\n\n\u03c0jlxl\n\nk \u2212 \u03b1gj(xj\nk)\n\n(5)\n\nk) = 1\n\nb(cid:48)(cid:80)\n\nwhere N b(j) indicates the neighborhood of agent j, \u03b1 is the step size, gj(xj\nk) is stochastic gradient\nof fj at xj\nk, which corresponds to a minibatch of sampled data points at the kth epoch. More\nk), where b(cid:48) is the size of the minibatch D(cid:48) randomly selected\nformally, gj(xj\nfrom the data subset Dj. While the pseudo-code of CDSGD is shown below in Algorithm 1,\nmomentum versions of CDSGD based on Polyak momentum [23] and Nesterov momentum [24] are\nalso presented in the supplementary section 7. In experiments, Nesterov momentum is used as it has\nbeen shown in the traditional SGD implementations that the Nesterov variant outperforms the Polyak\nmomentum. Note, that mini-batch implementations of these algorithms are straightforward, hence,\n\nq(cid:48)\u2208D(cid:48) \u2207f q(cid:48)\n\nj (xj\n\n4\n\n\fare not discussed here in detail, and that the convergence analysis of momentum variants is out of\nscope in this paper and will be presented in our future work.\n\n: m, \u03b1, N\n\nAlgorithm 1: CDSGD\nInput\nInitialize: xj\nDistribute the training dataset to N agents.\nfor each agent do\n\n0, (j = 1, 2, . . . , N)\n\nfor k = 0 : m do\n\nk+1 =(cid:80)\n\nRandomly shuf\ufb02e the corresponding data subset Dj (without replacement)\nwj\nxj\nk+1 = wj\n\nl\u2208N b(j) \u03c0jlxl\nk\nk+1 \u2212 \u03b1gj(xj\nk)\n\nend\n\nend\n\n3.2 Tools for convergence analysis\nWe now analyze the convergence properties of the iterates {xj\nk} generated by Algorithm 1. The\nfollowing section summarizes some key intermediate concepts required to establish our main results.\nFirst, we construct an appropriate Lyapunov function that will enable us to establish convergence.\nObserve that the update law in Alg. 1 can be expressed as:\n\nxk+1 = \u03a0xk \u2212 \u03b1g(xk),\n\n(6)\n\nwhere\n\ng(xk) = [g1(x1\n\nk)g2(x2\n\nk)...gN (xN\n\nk )]T\n\nDenoting wk = \u03a0xk, the update law can be re-written as xk+1 = wk \u2212 \u03b1g(xk). Moreover,\nxk+1 = xk \u2212 xk + wk \u2212 \u03b1g(xk). Rearranging the last equality yields the following relation:\n\nxk+1 = xk \u2212 \u03b1(g(xk) + \u03b1\u22121(xk \u2212 wk)) = xk \u2212 \u03b1(g(xk) + \u03b1\u22121(I \u2212 \u03a0)xk)\n\n(7)\nwhere the last term in Eq. 7 is the Stochastic Lyapunov Gradient. From Eq. 7, we observe that\nthe \u201ceffective\" gradient step is given by g(xk) + \u03b1\u22121(I \u2212 \u03a0)xk. Rewriting \u2207J i(xk) = g(xk) +\n\u03b1\u22121(I \u2212 \u03a0)xk, the updates of CDSGD can be expressed as:\n\nxk+1 = xk \u2212 \u03b1\u2207J i(xk).\n\nThe above expression naturally motivates the following Lyapunov function candidate:\n\n(cid:107)x(cid:107)2\n\n1\n2\u03b1\n\nN\nn\n\nI\u2212\u03a0\n\nV (x, \u03b1) :=\n\n1T F(x) +\n\nj=1 fj(xj) has a\n\u03b3m-Lipschitz continuous gradient, \u2207V (x) also is a Lipschitz continuous gradient with parameter:\n\nwhere (cid:107) \u00b7 (cid:107)I\u2212\u03a0 denotes the norm with respect to the PSD matrix I \u2212 \u03a0. Since(cid:80)N\nSimilarly, as(cid:80)N\n\n\u02c6\u03b3 := \u03b3m + \u03b1\u22121\u03bbmax(I \u2212 \u03a0) = \u03b3m + \u03b1\u22121(1 \u2212 \u03bbN (\u03a0)).\n\nj=1 fj(xj) is Hm-strongly convex, then V (x) is strongly convex with parameter:\n\u02c6H := Hm + (2\u03b1)\u22121\u03bbmin(I \u2212 \u03a0) = Hm + (2\u03b1)\u22121(1 \u2212 \u03bb2(\u03a0)).\n\nBased on De\ufb01nition 1, V has a unique minimizer, denoted by x\u2217 with V \u2217 = V (x\u2217). Correspondingly,\nusing strong convexity of V , we can obtain the relation:\n\n2 \u02c6H(V (x) \u2212 V \u2217) \u2264 (cid:107)\u2207V (x)(cid:107)2 for all x \u2208 RN .\n\n(10)\nFrom strong convexity and the Lipschitz continuous property of \u2207fj, the constants Hm and \u03b3m\nfurther satisfy Hm \u2264 \u03b3m and hence, \u02c6H \u2264 \u02c6\u03b3.\nNext, we introduce two key lemmas that will help establish our main theoretical guarantees. Due to\nspace limitations, all proofs are deferred to the supplementary material in Section 7.\n\n5\n\n(8)\n\n(9)\n\n\fLemma 1. Under Assumptions 1 and 2, the iterates of CDSGD satisfy \u2200k \u2208 N:\n\nE[V (xk+1)] \u2212 V (xk) \u2264 \u2212\u03b1\u2207V (xk)T E[\u2207J i(xk)] +\n\n\u03b12E[(cid:107)\u2207J i(xk)(cid:107)2]\n\n\u02c6\u03b3\n2\n\n(11)\n\nAt a high level, since E[\u2207J i(xk)] is the unbiased estimate of \u2207V (xk), using the updates \u2207J i(xk)\nwill lead to suf\ufb01cient decrease in the Lyapunov function. However, unbiasedness is not enough, and\nwe also need to control higher order moments of \u2207J i(xk) to ensure convergence. Speci\ufb01cally, we\nconsider the variance of \u2207J i(xk):\n\nV ar[\u2207J i(xk)] := E[(cid:107)\u2207J i(xk)(cid:107)2] \u2212 (cid:107)E[\u2207J i(xk)](cid:107)2\n\n(12)\nTo bound the variance of \u2207J i(xk), we use a standard assumption presented in [25] in the context of\n(centralized) deep learning. Such an assumption aims at providing an upper bound for the \u201cgradient\nnoise\" caused by the randomness in the minibatch selection at each iteration.\nAssumption 3. a) There exist scalars \u03b62 \u2265 \u03b61 > 0 such that \u2207V (xk)T E[\u2207J i(xk)] \u2265\n\u03b61(cid:107)\u2207V (xk)(cid:107)2 and (cid:107)E[\u2207J i(xk)](cid:107) \u2264 \u03b62(cid:107)\u2207V (xk)(cid:107) for all k \u2208 N; b) There exist scalars Q \u2265 0 and\nQV \u2265 0 such that V ar[\u2207J i(xk)] \u2264 Q + QV (cid:107)\u2207V (xk)(cid:107)2 for all k \u2208 N.\nRemark 1. While Assumption 3(a) guarantees the suf\ufb01cient descent of V in the direction of\n\u2212\u2207J i(xk), Assumption 3(b) states that the variance of \u2207J i(xk) is bounded above by the sec-\nond moment of \u2207V (xk). The constant Q can be considered to represent the second moment of the\n\u201cgradient noise\" in \u2207J i(xk). Therefore, the second moment of \u2207J i(xk) can be bounded above as\nE[(cid:107)\u2207J i(xk)(cid:107)2] \u2264 Q + Qm(cid:107)\u2207V (xk)(cid:107)2, where Qm := QV + \u03b6 2\nLemma 2. Under Assumptions 1, 2, and 3, the iterates of CDSGD satisfy \u2200k \u2208 N:\n\n2 \u2265 \u03b6 2\n\n1 > 0.\n\nE[V (xk+1)] \u2212 V (xk) \u2264 \u2212(\u03b61 \u2212 \u02c6\u03b3\n2\n\n\u03b1Qm)\u03b1(cid:107)\u2207V (xk)(cid:107)2 +\n\n\u02c6\u03b3\n2\n\n\u03b12Q .\n\n(13)\n\nIn Lemma 2, the \ufb01rst term is strictly negative if the step size satis\ufb01es the following necessary\ncondition:\n\nHowever, in latter analysis, when such a condition is substituted into the convergence analysis, it may\nproduce a larger upper bound. For obtaining a tight upper bound, we impose a suf\ufb01cient condition\nfor the rest of analysis as follows:\n\nAs \u02c6\u03b3 is a function of \u03b1, the above inequality can be rewritten as 0 < \u03b1 \u2264 \u03b61\u2212(1\u2212\u03bbN (\u03a0))Qm\n4 Main Results\nWe now present our main theoretical results establishing the convergence of CDSGD. First, we show\nthat for most generic loss functions (whether convex or not), CDSGD achieves consensus across\ndifferent agents in the graph, provided the step size (which is \ufb01xed across iterations) does not exceed\na natural upper bound.\nProposition 1. (Consensus with \ufb01xed step size) Under Assumptions 1 and 2, the iterates of CDSGD\n(Algorithm 1) satisfy \u2200k \u2208 N:\n\n\u03b3mQm\n\n.\n\nE[(cid:107)xj\n\nk \u2212 sk(cid:107)] \u2264\n\n\u03b1L\n\nwhere \u03b1 satis\ufb01es 0 < \u03b1 \u2264 \u03b61\u2212(1\u2212\u03bbN (\u03a0))Qm\n(de\ufb01ned properly and discussed in Lemma 4 in the supplementary section 7) and sk = 1\nN\nrepresents the average parameter estimate.\n\n(16)\nand L is an upper bound of E[(cid:107)g(xk)(cid:107)],\u2200k \u2208 N\nj=1 xj\n\n1 \u2212 \u03bb2(\u03a0)\n\n(cid:80)N\n\n\u03b3mQm\n\nk\n\nThe proof of this proposition can be adapted from [26, Lemma 1].\nNext, we show that for strongly convex loss functions, CDSGD converges linearly to a neighborhood\nof the global optimum.\n\n6\n\n0 < \u03b1 \u2264 2\u03b61\n\u02c6\u03b3Qm\n\n0 < \u03b1 \u2264 \u03b61\n\u02c6\u03b3Qm\n\n(14)\n\n(15)\n\n\fTheorem 1. (Convergence of CDSGD with \ufb01xed step size, strongly convex case) Under Assump-\ntions 1, 2 and 3, the iterates of CDSGD satisfy the following inequality \u2200k \u2208 N:\n\nE[V (xk) \u2212 V \u2217] \u2264 (1 \u2212 \u03b1 \u02c6H\u03b61)k\u22121(V (x1) \u2212 V \u2217) +\n\n\u03b12\u02c6\u03b3Q\n\n2\n\n(1 \u2212 \u03b1 \u02c6H\u03b61)l\n\nk\u22121(cid:88)\n\nl=0\n\n= (1 \u2212 (\u03b1Hm + 1 \u2212 \u03bb2(\u03a0))\u03b61)k\u22121(V (x1) \u2212 V \u2217)\n\n(17)\n\n(\u03b12\u03b3m + \u03b1(1 \u2212 \u03bbN (\u03a0)))Q\n\n+\n\n2\n\n(1 \u2212 (\u03b1Hm + 1 \u2212 \u03bb2(\u03a0))\u03b61)l\n\nk\u22121(cid:88)\n\nl=0\n\n.\n\nwhen the step size satis\ufb01es 0 < \u03b1 \u2264 \u03b61\u2212(1\u2212\u03bbN (\u03a0))Qm\n\n\u03b3mQm\n\n= (\u03b1\u03b3m+1\u2212\u03bbN (\u03a0))Q\n2(Hm+\u03b1\u22121(1\u2212\u03bb2(\u03a0))\u03b61\n\nA detailed proof is presented in the supplementary section 7. We observe from Theorem 1 that\nthe sequence of Lyapunov function values {V (xk)} converges linearly to a neighborhood of the\noptimal value, i.e., limk\u2192\u221e E[V (xk) \u2212 V \u2217] \u2264 \u03b1\u02c6\u03b3Q\n. We also observe that\n2 \u02c6H\u03b61\nthe term on the right hand side decreases with the spectral gap of the agent interaction matrix \u03a0,\ni.e., 1 \u2212 \u03bb2(\u03a0), which suggests an interesting relation between convergence and topology of the\ngraph. Moreover, we observe that the upper bound is proportional to the step size parameter \u03b1, and\nsmaller step sizes lead to smaller radii of convergence. (However, choosing a very small step-size\nmay negatively affect the convergence rate of the algorithm). Finally, if the gradient in this context is\nnot stochastic (i.e., the parameter Q = 0), then linear convergence to the optimal value is achieved,\nwhich matches known rates of convergence with (centralized) gradient descent under strong convexity\nand smoothness assumptions.\nn 1T F(x\u2217) = V \u2217, the sequence of objective\nRemark 2. Since E[ N\nn 1T F(x\u2217)] \u2264 E[V (xk)\u2212\nfunction values are themselves upper bounded as follows: E[ N\nV \u2217]. Therefore, using Theorem 1 we can establish analogous convergence rates in terms of the true\nobjective function values { N\nThe above convergence result for CDSGD is limited to the case when the objective functions are\nstrongly convex. However, most practical deep learning systems (such as convolutional neural\nnetwork learning) involve optimizing over highly non-convex objective functions, which are much\nharder to analyze. Nevertheless, we show that even under such situations, CDSGD exhibits a (weaker)\nnotion of convergence.\nTheorem 2. (Convergence of CDSGD with \ufb01xed step size, nonconvex case) Under Assumptions 1, 2,\nand 3, the iterates of CDSGD satisfy \u2200m \u2208 N:\n(cid:107)\u2207V (xk)(cid:107)2] \u2264 \u02c6\u03b3m\u03b1Q\n\nn 1T F(xk)] \u2264 E[V (xk)] and N\n\nn 1T F(xk)} as well.\n\nn 1T F(xk)\u2212 N\n\n2(V (x1) \u2212 Vinf)\n\nm(cid:88)\n\nE[\n\n+\n\nk=1\n\n\u03b61\n\n\u03b61\u03b1\n\n(\u03b3m\u03b1 + 1 \u2212 \u03bbN (\u03a0))mQ\n\n2(V (x1) \u2212 Vinf)\n\n+\n\n.\n\n=\n\n(18)\n\n.\n\nm\n\n\u03b61\u03b1\n\n\u03b3mQm\n\n\u03b61\nwhen the step size satis\ufb01es 0 < \u03b1 \u2264 \u03b61\u2212(1\u2212\u03bbN (\u03a0))Qm\nthe quantity E[(cid:80)m\nRemark 3. Theorem 2 states that when in the absence of \u201cgradient noise\" (i.e., when Q = 0),\nk=1 (cid:107)\u2207V (xk)(cid:107)2] remains \ufb01nite. Therefore, necessarily {(cid:107)\u2207V (xk)(cid:107)} \u2192 0 and\n(cid:80)m\nthe estimates approach a stationary point. On the other hand, if the gradient calculations are\nstochastic, then a similar claim cannot be made. However, for this case we have the upper bound\nk=1 (cid:107)\u2207V (xk)(cid:107)2] \u2264 (\u03b3m\u03b1+1\u2212\u03bbN (\u03a0))Q\nlimm\u2192\u221e E[ 1\n. This tells us that while we cannot guarantee\nconvergence in terms of sequence of objective function values, we can still assert that the average\nof the second moment of gradients is strictly bounded from above even for the case of nonconvex\nobjective functions.\nMoreover, the upper bound cannot be solely controlled via the step-size parameter \u03b1 (which is\ndifferent from what is implied in the strongly convex case by Theorem 1). In general, the upper bound\nbecomes tighter as \u03bbN (\u03a0) increases; however, an increase in \u03bbN (\u03a0) may result in a commensurate\nincrease in \u03bb2(\u03a0), leading to worse connectivity in the graph and adversely affecting consensus\namong agents. Again, our upper bounds are re\ufb02ective of interesting tradeoffs between consensus and\nconvergence in the gradients, and their dependence on graph topology.\n\n\u03b61\n\n7\n\n\f(a)\n\n(b)\n\nFigure 1: Average training (solid lines) and validation (dash lines) accuracy for (a) comparison of\nCDSGD with centralized SGD and (b) CDMSGD with Federated average method\n\nThe above results are for \ufb01xed step size \u03b1, and we can prove complementary results for CDSGD even\nfor the (more prevalent) case of diminishing step size \u03b1k. These are presented in the supplementary\nmaterial due to space constraints.\n5 Experimental Results\nThis section presents the experimental results using the benchmark image recognition dataset, CIFAR-\n10. We use a deep convolutional nerual network (CNN) model (with 2 convolutional layers with 32\n\ufb01lters each followed by a max pooling layer, then 2 more convolutional layers with 64 \ufb01lters each\nfollowed by another max pooling layer and a dense layer with 512 units, ReLU activation is used in\nconvolutional layers) to validate the proposed algorithm. We use a fully connected topology with 5\nagents and uniform agent interaction matrix except mentioned otherwise. A mini-batch size of 128\nand a \ufb01xed step size of 0.01 are used in these experiments. The experiments are performed using Keras\nand TensorFlow [27, 28] and the codes will be made publicly available soon. While we included the\ntraining and validation accuracy plots for the different case studies here, the corresponding training\nloss plots, results with other becnmark datasets such as MNIST and CIFAR-100 and decaying as well\nas different \ufb01xed step sizes are presented in the supplementary section 7.\n5.1 Performance comparison with benchmark methods\nWe begin with comparing the accuracy of CDSGD with that of the centralized SGD algorithm\nas shown in Fig. 1(a). While the CDSGD convergence rate is signi\ufb01cantly slower compared to\nSGD as expected, it is observed that CDSGD can eventually achieve high accuracy, comparable\nwith centralized SGD. However, another interesting observation is that the generalization gap (the\ndifference between training and validation accuracy as de\ufb01ned in [29]) for the proposed CDSGD\nalgorithm is signi\ufb01cantly smaller than that of SGD which is an useful property. We also compare both\nCDSGD and CDMSGD with the Federated averaging SGD (FedAvg) algorithm which also performs\ndata parallelization (see Fig. 1(b)). For the sake of comparison, we use same number of agents and\nchoose E = 1 and C = 1 as the hyperparameters in the FedAvg algorithm as it is close to a fully\nconnected topology scenario as considered in the CDSGD and CDMSGD experiments. As CDSGD\nis signi\ufb01cantly slow, we mainly compare the CDMSGD with FedAvg which have similar convergence\nrates (CDMSGD being slightly slower). The main observation is that CDMSGD performs better\nthan FedAvg at the steady state and can achieve centralized SGD level performance. It is important\nto note that FedAvg does not perform decentralized computation. Essentially it runs a brute force\nparameter averaging on a central parameter server at every epoch (i.e., consensus at every epoch)\nand then broadcasts the updated parameters to the agents. Hence, it tends to be slightly faster than\nCDMSGD which uses a truly decentralized computation over a network.\n5.2 Effect of network size and topology\nIn this section, we investigate the effects of network size and topology on the performance of the\nproposed algorithms. Figure 2(a) shows the change in training performance as the number of agents\ngrow from 2 to 8 and to 16. Although with increase in number of agents, the convergence rate slows\ndown, all networks are able to achieve similar accuracy levels. Finally, we investigate the impact of\nnetwork sparsity (as quanti\ufb01ed by the second largest eigenvalue) on the learning performance. The\nprimary observation is convergence of average accuracy value happens faster for sparser networks\n\n8\n\n010002000300040005000Number of epochs0.00.20.40.60.81.0accscifar10 experimentSGDCDSGD02004006008001000Number of epochs0.00.20.40.60.81.0accscifar10 experimentSGDCDSGDCDMSGDFederated Averaging\f(a)\n\n(b)\n\nFigure 2: Average training (solid lines) and validation (dash lines) accuracy along with accuracy\nvariance over agents for CDMSGD algorithm with (a) varying network size and (b) varying network\ntopology\n\n(higher second largest eigenvalue). This is similar to the trend observed for FedAvg algorithm\nwhile reducing the Client fraction (C) which makes the (stochastic) agent interaction matrix sparser.\nHowever, from the plot of the variance of accuracy values over agents (a smooth version using moving\naverage \ufb01lter), it can be observed that the level of consensus is more stable for denser networks\ncompared to that for sparser networks. This is also expected as discussed in Proposition 1. Note,\nwith the availability of a central parameter server (as in federated averaging), sparser topology may\nbe useful for a faster convergence, however, consensus (hence, topology density) is critical for a\ncollaborative learning paradigm with decentralized computation.\n\n6 Conclusion and Future Work\n\nThis paper addresses the collaborative deep learning (and many other machine learning) problem\nin a completely distributed manner (i.e., with data parallelism and decentralized computation) over\nnetworks with \ufb01xed topology. We establish a consensus based distributed SGD framework and\nproposed associated learning algorithms that can prove to be extremely useful in practice. Using a\nLyapunov function construction approach, we show that the proposed CDSGD algorithm can achieve\nlinear convergence rate with suf\ufb01ciently small \ufb01xed step size and sublinear convergence rate with\ndiminishing step size (see supplementary section 7 for details) for strongly convex and Lipschitz\ndifferentiable objective functions. Moreover, decaying gradients can be observed for the nonconvex\nobjective functions using CDSGD. Relevant experimental results using benchmark datasets show that\nCDSGD can achieve centralized SGD level accuracy with suf\ufb01cient training epochs while maintaining\na signi\ufb01cantly low generalization error. The momentum variant of the proposed algorithm, CDMSGD\ncan outperform recently proposed FedAvg algorithm which also uses data parallelism but does not\nperform a decentralized computation, i.e., uses a central parameter server. The effects of network\nsize and topology are also explored experimentally which conforms to the analytical understandings.\nWhile current and future research is focusing on extensive testing and validation of the proposed\nframework especially for large networks, a few technical research directions include: (i) collaborative\nlearning with extreme non-IID data; (ii) collaborative learning over directed time-varying graphs; and\n(iii) understanding the dependencies between learning rate and consensus.\n\nAcknowledgments\n\nThis paper is based upon research partially supported by the USDA-NIFA under Award no. 2017-\n67021-25965, the National Science Foundation under Grant No. CNS-1464279 and No. CCF-\n1566281. Any opinions, \ufb01ndings, and conclusions or recommendations expressed in this material are\nthose of the authors and do not necessarily re\ufb02ect the views of the funding agencies.\n\n9\n\n02004006008001000Number of epochs0.00.20.40.60.81.0accscifar10 experiment2 Agents8 Agents16 Agents02004006008001000Number of e ochs0.00.20.40.60.81.0accs0.000.010.020.030.04Variance among agents for accscifar10 ex erimentFully Connected with \u03bb2(()=0S arse To ology with \u03bb2(()=0.54S arse To ology with \u03bb2(()=0.86\fReferences\n[1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436\u2013444,\n\n2015.\n\n[2] Suyog Gupta, Wei Zhang, and Josh Milthorpe. Model accuracy and runtime tradeoff in\n\ndistributed deep learning. arXiv preprint arXiv:1509.04210, 2015.\n\n[3] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew\nSenior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks. In\nAdvances in neural information processing systems, pages 1223\u20131231, 2012.\n\n[4] Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with elastic averaging\n\nsgd. In Advances in Neural Information Processing Systems, pages 685\u2013693, 2015.\n\n[5] H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. Communication-\nef\ufb01cient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629,\n2016.\n\n[6] Michael Blot, David Picard, Matthieu Cord, and Nicolas Thome. Gossip training for deep\n\nlearning. arXiv preprint arXiv:1611.09726, 2016.\n\n[7] Peter H Jin, Qiaochu Yuan, Forrest Iandola, and Kurt Keutzer. How to scale distributed deep\n\nlearning? arXiv preprint arXiv:1611.04581, 2016.\n\n[8] Kushal Mukherjee, Asok Ray, Thomas Wettergren, Shalabh Gupta, and Shashi Phoha. Real-time\nadaptation of decision thresholds in sensor networks for detection of moving targets. Automatica,\n47(1):185 \u2013 191, 2011.\n\n[9] Chao Liu, Yongqiang Gong, Simon La\ufb02amme, Brent Phares, and Soumik Sarkar. Bridge damage\ndetection using spatiotemporal patterns extracted from dense sensor network. Measurement\nScience and Technology, 28(1):014011, 2017.\n\n[10] H.-L. Choi and J. P. How. Continuous trajectory planning of mobile sensors for informative\n\nforecasting. Automatica, 46(8):1266\u20131275, 2010.\n\n[11] D. K. Jha, P. Chattopadhyay, S. Sarkar, and A. Ray. Path planning in gps-denied environments\nwith collective intelligence of distributed sensor networks. International Journal of Control, 89,\n2016.\n\n[12] Trishul M Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. Project\nadam: Building an ef\ufb01cient and scalable deep learning training system. In OSDI, volume 14,\npages 571\u2013582, 2014.\n\n[13] Nikko Strom. Scalable distributed dnn training using commodity gpu cloud computing. In\n\nINTERSPEECH, volume 7, page 10, 2015.\n\n[14] Hang Su and Haoyu Chen. Experiments on parallel training of deep neural network using model\n\naveraging. arXiv preprint arXiv:1507.01239, 2015.\n\n[15] Wei Zhang, Suyog Gupta, Xiangru Lian, and Ji Liu. Staleness-aware async-sgd for distributed\n\ndeep learning. arXiv preprint arXiv:1511.05950, 2015.\n\n[16] Soham De and Tom Goldstein. Ef\ufb01cient distributed sgd with variance reduction. In Data Mining\n\n(ICDM), 2016 IEEE 16th International Conference on, pages 111\u2013120. IEEE, 2016.\n\n[17] Kevin Scaman, Francis Bach, S\u00e9bastien Bubeck, Yin Tat Lee, and Laurent Massouli\u00e9. Optimal\nalgorithms for smooth and strongly convex distributed optimization in networks. arXiv preprint\narXiv:1702.08704, 2017.\n\n[18] Guanghui Lan, Soomin Lee, and Yi Zhou. Communication-ef\ufb01cient algorithms for decentralized\n\nand stochastic optimization. arXiv preprint arXiv:1701.03961, 2017.\n\n[19] Davood Hajinezhad, Mingyi Hong, and Alfredo Garcia. Zenith: A zeroth-order distributed\n\nalgorithm for multi-agent nonconvex optimization.\n\n10\n\n\f[20] Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent opti-\n\nmization. IEEE Transactions on Automatic Control, 54(1):48\u201361, 2009.\n\n[21] Jinshan Zeng and Wotao Yin. On nonconvex decentralized gradient descent. arXiv preprint\n\narXiv:1608.05766, 2016.\n\n[22] Angelia Nedi\u00b4c and Alex Olshevsky. Stochastic gradient-push for strongly convex functions on\ntime-varying directed graphs. IEEE Transactions on Automatic Control 61.12, pages 3936\u20133947,\n2016.\n\n[23] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR\n\nComputational Mathematics and Mathematical Physics, 4(5):1\u201317, 1964.\n\n[24] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87.\n\nSpringer Science & Business Media, 2013.\n\n[25] L\u00e9on Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine\n\nlearning. arXiv preprint arXiv:1606.04838, 2016.\n\n[26] Kun Yuan, Qing Ling, and Wotao Yin. On the convergence of decentralized gradient descent.\n\narXiv preprint arXiv:1310.7063, 2013.\n\n[27] Fran\u00e7ois Chollet. Keras. https://github.com/fchollet/keras, 2015.\n\n[28] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor\ufb02ow: Large-scale\nmachine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467,\n2016.\n\n[29] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\n\ndeep learning requires rethinking generalization. CoRR, abs/1611.03530, 2016.\n\n[30] Angelia Nedi\u00b4c and Alex Olshevsky. Distributed optimization over time-varying directed graphs.\n\nIEEE Transactions on Automatic Control, 60(3):601\u2013615, 2015.\n\n[31] S. Ram, A. Nedic, and V. Veeravalli. A new class of distributed optimization algorithms:\napplication to regression of distributed data. Optimization Methods and Software, 27(1):71\u2013 88,\n2012.\n\n11\n\n\f", "award": [], "sourceid": 3019, "authors": [{"given_name": "Zhanhong", "family_name": "Jiang", "institution": "Iowa State University"}, {"given_name": "Aditya", "family_name": "Balu", "institution": "Iowa State University"}, {"given_name": "Chinmay", "family_name": "Hegde", "institution": "Iowa State University"}, {"given_name": "Soumik", "family_name": "Sarkar", "institution": "Iowa State University"}]}