{"title": "Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent", "book": "Advances in Neural Information Processing Systems", "page_first": 5330, "page_last": 5340, "abstract": "Most distributed machine learning systems nowadays, including TensorFlow and CNTK, are built in a centralized fashion. One bottleneck of centralized algorithms lies on high communication cost on the central node. Motivated by this, we ask, can decentralized algorithms be faster than its centralized counterpart? Although decentralized PSGD (D-PSGD) algorithms have been studied by the control community, existing analysis and theory do not show any advantage over centralized PSGD (C-PSGD) algorithms, simply assuming the application scenario where only the decentralized network is available. In this paper, we study a D-PSGD algorithm and provide the first theoretical analysis that indicates a regime in which decentralized algorithms might outperform centralized algorithms for distributed stochastic gradient descent. This is because D-PSGD has comparable total computational complexities to C-PSGD but requires much less communication cost on the busiest node. We further conduct an empirical study to validate our theoretical analysis across multiple frameworks (CNTK and Torch), different network configurations, and computation platforms up to 112 GPUs. On network configurations with low bandwidth or high latency, D-PSGD can be up to one order of magnitude faster than its well-optimized centralized counterparts.", "full_text": "Can Decentralized Algorithms Outperform\nCentralized Algorithms? A Case Study for\n\nDecentralized Parallel Stochastic Gradient Descent\n\nXiangru Lian\u2020, Ce Zhang\u2217, Huan Zhang+, Cho-Jui Hsieh+, Wei Zhang#, and Ji Liu\u2020(cid:92)\n\n\u2020University of Rochester, \u2217ETH Zurich\n\n+University of California, Davis, #IBM T. J. Watson Research Center, (cid:92)Tencent AI lab\n\nxiangru@yandex.com, ce.zhang@inf.ethz.ch, victzhang@gmail.com,\nchohsieh@ucdavis.edu, weiz@us.ibm.com, ji.liu.uwisc@gmail.com\n\nAbstract\n\nMost distributed machine learning systems nowadays, including TensorFlow and\nCNTK, are built in a centralized fashion. One bottleneck of centralized algorithms\nlies on high communication cost on the central node. Motivated by this, we ask,\ncan decentralized algorithms be faster than its centralized counterpart?\nAlthough decentralized PSGD (D-PSGD) algorithms have been studied by the\ncontrol community, existing analysis and theory do not show any advantage over\ncentralized PSGD (C-PSGD) algorithms, simply assuming the application scenario\nwhere only the decentralized network is available. In this paper, we study a D-\nPSGD algorithm and provide the \ufb01rst theoretical analysis that indicates a regime\nin which decentralized algorithms might outperform centralized algorithms for\ndistributed stochastic gradient descent. This is because D-PSGD has comparable\ntotal computational complexities to C-PSGD but requires much less communication\ncost on the busiest node. We further conduct an empirical study to validate our\ntheoretical analysis across multiple frameworks (CNTK and Torch), different\nnetwork con\ufb01gurations, and computation platforms up to 112 GPUs. On network\ncon\ufb01gurations with low bandwidth or high latency, D-PSGD can be up to one order\nof magnitude faster than its well-optimized centralized counterparts.\n\nIntroduction\n\n1\nIn the context of distributed machine learning, decentralized algorithms have long been treated as a\ncompromise \u2014 when the underlying network topology does not allow centralized communication,\none has to resort to decentralized communication, while, understandably, paying for the \u201ccost of being\ndecentralized\u201d. In fact, most distributed machine learning systems nowadays, including TensorFlow\nand CNTK, are built in a centralized fashion. But can decentralized algorithms be faster than their\ncentralized counterparts? In this paper, we provide the \ufb01rst theoretical analysis, veri\ufb01ed by empirical\nexperiments, for a positive answer to this question.\nWe consider solving the following stochastic optimization problem\n\nf (x) := E\u03be\u223cDF (x; \u03be),\n\nmin\nx\u2208RN\n\n(1)\nwhere D is a prede\ufb01ned distribution and \u03be is a random variable usually referring to a data sample in\nmachine learning. This formulation summarizes many popular machine learning models including\ndeep learning [LeCun et al., 2015], linear regression, and logistic regression.\nParallel stochastic gradient descent (PSGD) methods are leading algorithms in solving large-scale\nmachine learning problems such as deep learning [Dean et al., 2012, Li et al., 2014], matrix completion\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: An illustration of different network topologies.\n\nC-PSGD (mini-batch SGD)\n\nAlgorithm\n\nD-PSGD\n\ncommunication complexity\non the busiest node\n\nO(n)\n\nO (Deg(network))\n\ncomputational complexity\n\nO(cid:0) n\nO(cid:0) n\n\n\u0001 + 1\n\u00012\n\u0001 + 1\n\u00012\n\n(cid:1)\n(cid:1)\n\nTable 1: Comparison of C-PSGD and D-PSGD. The unit of the communication cost is the number\nof stochastic gradients or optimization variables. n is the number of nodes. The computational\ncomplexity is the number of stochastic gradient evaluations we need to get a \u0001-approximation solution,\nwhich is de\ufb01ned in (3).\n\n[Recht et al., 2011, Zhuang et al., 2013] and SVM. Existing PSGD algorithms are mostly designed for\ncentralized network topology, for example, the parameter server topology [Li et al., 2014], where there\nis a central node connected with multiple nodes as shown in Figure 1(a). The central node aggregates\nthe stochastic gradients computed from all other nodes and updates the model parameter, for example,\nthe weights of a neural network. The potential bottleneck of the centralized network topology lies on\nthe communication traf\ufb01c jam on the central node, because all nodes need to communicate with it\nconcurrently iteratively. The performance will be signi\ufb01cantly degraded when the network bandwidth\nis low.1 These motivate us to study algorithms for decentralized topologies, where all nodes can only\ncommunicate with its neighbors and there is no such a central node, shown in Figure 1(b).\nAlthough decentralized algorithms have been studied as consensus optimization in the control commu-\nnity and used for preserving data privacy [Ram et al., 2009a, Yan et al., 2013, Yuan et al., 2016], for\nthe application scenario where only the decentralized network is available, it is still an open question\nif decentralized methods could have advantages over centralized algorithms in some scenarios\nin case both types of communication patterns are feasible \u2014 for example, on a supercomputer with\nthousands of nodes, should we use decentralized or centralized communication? Existing theory and\nanalysis either do not make such comparison [Bianchi et al., 2013, Ram et al., 2009a, Srivastava and\nNedic, 2011, Sundhar Ram et al., 2010] or implicitly indicate that decentralized algorithms were much\nworse than centralized algorithms in terms of computational complexity and total communication\ncomplexity [Aybat et al., 2015, Lan et al., 2017, Ram et al., 2010, Zhang and Kwok, 2014]. This paper\ngives a positive result for decentralized algorithms by studying a decentralized PSGD (D-PSGD)\nalgorithm on the connected decentralized network. Our theory indicates that D-PSGD admits similar\ntotal computational complexity but requires much less communication for the busiest node. Table 1\nshows a quick comparison between C-PSGD and D-PSGD with respect to the computation and\ncommunication complexity. Our contributions are:\n\u2022 We theoretically justify the potential advantage of decentralizedalgorithms over centralized\nalgorithms. Instead of treating decentralized algorithms as a compromise one has to make, we are the\n\ufb01rst to conduct a theoretical analysis that identi\ufb01es cases in which decentralized algorithms can be\nfaster than its centralized counterpart.\n\u2022 We theoretically analyze the scalability behavior of decentralized SGD when more nodes are\nused. Surprisingly, we show that, when more nodes are available, decentralized algorithms can bring\nspeedup, asymptotically linearly, with respect to computational complexity. To our best knowledge,\nthis is the \ufb01rst speedup result related to decentralized algorithms.\n\u2022 We conduct extensive empirical study to validate our theoretical analysis of D-PSGD and different\nC-PSGD variants (e.g., plain SGD, EASGD [Zhang et al., 2015]). We observe similar computational\n\n1There has been research in how to accommodate this problem by having multiple parameter servers\ncommunicating with ef\ufb01cient MPI ALLREDUCE primitives. As we will see in the experiments, these methods,\non the other hand, might suffer when the network latency is high.\n\n2\n\n(a) Centralized Topology(b) Decentralized TopologyParameterServer\f\u221a\n\nK) is proved in Ghadimi and Lan [2013].\n\ncomplexity as our theory indicates; on networks with low bandwidth or high latency, D-PSGD\ncan be up to 10\u00d7 faster than C-PSGD. Our result holds across multiple frameworks (CNTK and\nTorch), different network con\ufb01gurations, and computation platforms up to 112 GPUs. This indicates\npromising future direction in pushing the research horizon of machine learning systems from pure\ncentralized topology to a more decentralized fashion.\nDe\ufb01nitions and notations Throughout this paper, we use following notation and de\ufb01nitions:\n\u2022 (cid:107) \u00b7 (cid:107) denotes the vector (cid:96)2 norm or the matrix spectral norm depending on the argument.\n\u2022 (cid:107) \u00b7 (cid:107)F denotes the matrix Frobenius norm.\n\u2022 \u2207f (\u00b7) denotes the gradient of a function f.\n\u2022 1n denotes the column vector in Rn with 1 for all elements.\n\u2022 f\u2217 denotes the optimal solution of (1).\n\u2022 \u03bbi(\u00b7) denotes the i-th largest eigenvalue of a matrix.\n2 Related work\nIn the following, we use K and n to refer to the number of iterations and the number of nodes.\n\u221a\nStochastic Gradient Descent (SGD) SGD is a powerful approach for solving large scale machine\nlearning. The well known convergence rate of stochastic gradient is O(1/\nK) for convex problems\nand O(1/K) for strongly convex problems [Moulines and Bach, 2011, Nemirovski et al., 2009]. SGD\nis closely related to online learning algorithms, for example, Crammer et al. [2006], Shalev-Shwartz\n\u221a\n[2011], Yang et al. [2014]. For SGD on nonconvex optimization, an ergodic convergence rate of\nO(1/\nCentralized parallel SGD For CENTRALIZED PARALLEL SGD (C-PSGD) algorithms, the most\npopular implementation is based on the parameter server, which is essentially the mini-batch SGD\nadmitting a convergence rate of O(1/\nKn) [Agarwal and Duchi, 2011, Dekel et al., 2012, Lian et al.,\n2015], where in each iteration n stochastic gradients are evaluated. In this implementation there is a\nparameter server communicating with all nodes. The linear speedup is implied by the convergence\nrate automatically. More implementation details for C-PSGD can be found in Chen et al. [2016],\nDean et al. [2012], Li et al. [2014], Zinkevich et al. [2010]. The asynchronous version of centralized\nparallel SGD is proved to guarantee the linear speedup on all kinds of objectives (including convex,\nstrongly convex, and nonconvex objectives) if the staleness of the stochastic gradient is bounded\n[Agarwal and Duchi, 2011, Feyzmahdavian et al., 2015, Lian et al., 2015, 2016, Recht et al., 2011,\nZhang et al., 2016b,c].\nDecentralized parallel stochastic algorithms Decentralized algorithms do not specify any central\nnode unlike centralized algorithms, and each node maintains its own local model but can only\ncommunicate with with its neighbors. Decentralized algorithms can usually be applied to any\nconnected computational network. Lan et al. [2017] proposed a decentralized stochastic algorithm\nwith computational complexities O(n/\u00012) for general convex objectives and O(n/\u0001) for strongly\nconvex objectives. Sirb and Ye [2016] proposed an asynchronous decentralized stochastic algorithm\nensuring complexity O(n/\u00012) for convex objectives. A similar algorithm to our D-PSGD in both\nsynchronous and asynchronous fashion was studied in Ram et al. [2009a, 2010], Srivastava and\nNedic [2011], Sundhar Ram et al. [2010]. The difference is that in their algorithm all node can only\nperform either communication or computation but not simultaneously. Sundhar Ram et al. [2010]\nproposed a stochastic decentralized optimization algorithm for constrained convex optimization\nand the algorithm can be used for non-differentiable objectives by using subgradients. Please also\nrefer to Srivastava and Nedic [2011] for the subgradient variant. The analysis in Ram et al. [2009a,\n2010], Srivastava and Nedic [2011], Sundhar Ram et al. [2010] requires the gradients of each term\nof the objective to be bounded by a constant. Bianchi et al. [2013] proposed a similar decentralized\nstochastic algorithm and provided a convergence rate for the consensus of the local models when\nthe local models are bounded. The convergence to a solution was also provided by using central\nlimit theorem, but the rate is unclear. HogWild++ [Zhang et al., 2016a] uses decentralized model\nparameters for parallel asynchronous SGD on multi-socket systems and shows that this algorithm\nempirically outperforms some centralized algorithms. Yet the convergence or the convergence rate is\nunclear. The common issue for these work above lies on that the speedup is unclear, that is, we do\nnot know if decentralized algorithms (involving multiple nodes) can improve the ef\ufb01ciency of only\nusing a single node.\n\n3\n\n\fOther decentralized algorithms In other areas including control, privacy and wireless sensing\nnetwork, decentralized algorithms are usually studied for solving the consensus problem [Aysal et al.,\n2009, Boyd et al., 2005, Carli et al., 2010, Fagnani and Zampieri, 2008, Olfati-Saber et al., 2007,\nSchenato and Gamba, 2007]. Lu et al. [2010] proves a gossip algorithm to converge to the optimal\nsolution for convex optimization. Mokhtari and Ribeiro [2016] analyzed decentralized SAG and\nSAGA algorithms for minimizing \ufb01nite sum strongly convex objectives, but they are not shown\nto admit any speedup. The decentralized gradient descent method for convex and strongly convex\nproblems was analyzed in Yuan et al. [2016]. Nedic and Ozdaglar [2009], Ram et al. [2009b] studied\nits subgradient variants. However, this type of algorithms can only converge to a ball of the optimal\nsolution, whose diameter depends on the steplength. This issue was \ufb01xed by Shi et al. [2015] using a\nmodi\ufb01ed algorithm, namely EXTRA, that can guarantee to converge to the optimal solution. Wu et al.\n[2016] analyzed an asynchronous version of decentralized gradient descent with some modi\ufb01cation\nlike in Shi et al. [2015] and showed that the algorithm converges to a solution when K \u2192 \u221e. Aybat\net al. [2015], Shi et al., Zhang and Kwok [2014] analyzed decentralized ADMM algorithms and they\nare not shown to have speedup. From all of these reviewed papers, it is still unclear if decentralized\nalgorithms can have any advantage over their centralized counterparts.\n3 Decentralized parallel stochastic gradient descent (D-PSGD)\n\nAlgorithm 1 Decentralized Parallel Stochastic Gradient Descent (D-PSGD) on the ith node\nRequire: initial point x0,i = x0, step length \u03b3, weight matrix W , and number of iterations K\n1: for k = 0, 1, 2, . . . , K \u2212 1 do\n2:\n3:\n4:\n\nRandomly sample \u03bek,i from local data of the i-th node\nCompute the local stochastic gradient \u2207Fi(xk,i; \u03bek,i) \u2200i on all nodes a\nCompute the neighborhood weighted average by fetching optimization variables from neighbors:\nUpdate the local optimization variable xk+1,i \u2190 xk+ 1\n\n2 ,i \u2212 \u03b3\u2207Fi(xk,i; \u03bek,i)c\n\nj=1 Wijxk,j\n\nxk+ 1\n\nb\n\n2 ,i =(cid:80)n\n(cid:80)n\n\n5:\n6: end for\n7: Output: 1\nn\n\ni=1 xK,i\n\naNote that the stochastic gradient computed in can be replaced with a mini-batch of stochastic gradients,\n\nwhich will not hurt our theoretical results.\n\nbNote that the Line 3 and Line 4 can be run in parallel.\ncNote that the Line 4 and step Line 5 can be exchanged. That is, we \ufb01rst update the local stochastic gradient\ninto the local optimization variable, and then average the local optimization variable with neighbors. This does\nnot hurt our theoretical analysis. When Line 4 is logically before Line 5, then Line 3 and Line 4 can be run in\nparallel. That is to say, if the communication time used by Line 4 is smaller than the computation time used by\nLine 3, the communication time can be completely hidden (it is overlapped by the computation time).\n\nThis section introduces the D-PSGD algorithm. We represent the decentralized communication\ntopology with an undirected graph with weights: (V, W ). V denotes the set of n computational\nnodes: V := {1, 2,\u00b7\u00b7\u00b7 , n}. W \u2208 Rn\u00d7n is a symmetric doubly stochastic matrix, which means (i)\nj Wij = 1 for all i. We use Wij to encode\n\nWij \u2208 [0, 1],\u2200i, j, (ii) Wij = Wji for all i, j, and (ii)(cid:80)\nn(cid:88)\n\nhow much node j can affect node i, while Wij = 0 means node i and j are disconnected.\nTo design distributed algorithms on a decentralized network, we \ufb01rst distribute the data onto all nodes\nsuch that the original objective de\ufb01ned in (1) can be rewritten into\nE\u03be\u223cDiFi(x; \u03be)\n\nf (x) =\n\n(2)\n\n.\n\nmin\nx\u2208RN\n\n(cid:124)\n\n1\nn\n\ni=1\n\n(cid:123)(cid:122)\n\n=:fi(x)\n\n(cid:125)\n\nThere are two simple ways to achieve (2), both of which can be captured by our theoretical analysis\nand they both imply Fi(\u00b7;\u00b7) = F (\u00b7;\u00b7),\u2200i.\nStrategy-1 All distributions Di\u2019s are the same as D, that is, all nodes can access a shared database;\nStrategy-2 n nodes partition all data in the database and appropriately de\ufb01ne a distribution for\nsampling local data, for example, if D is the uniform distribution over all data, Di can be de\ufb01ned\nto be the uniform distribution over local data.\n\nThe D-PSGD algorithm is a synchronous parallel algorithm. All nodes are usually synchronized by a\nclock. Each node maintains its own local variable and runs the protocol in Algorithm 1 concurrently,\nwhich includes three key steps at iterate k:\n\n4\n\n\f\u2022 Each node computes the stochastic gradient \u2207Fi(xk,i; \u03bek,i)2 using the current local variable xk,i,\nwhere k is the iterate number and i is the node index;\n\u2022 When the synchronization barrier is met, each node exchanges local variables with its neighbors\nand average the local variables it receives with its own local variable;\n\u2022 Each node update its local variable using the average and the local stochastic gradient.\nTo view the D-PSGD algorithm from a global view, at iterate k, we de\ufb01ne the concatenation of all\nlocal variables, random samples, stochastic gradients by matrix Xk \u2208 RN\u00d7n, vector \u03bek \u2208 Rn, and\n\u2202F (Xk, \u03bek), respectively:\n\nXk := [ xk,1\n\n\u00b7\u00b7\u00b7\n\nxk,n ] \u2208 RN\u00d7n,\n\n\u03bek := [ \u03bek,1\n\n\u00b7\u00b7\u00b7\n\n(cid:62) \u2208 Rn,\n\n\u03bek,n ]\n\n\u2202F (Xk, \u03bek) := [ \u2207F1(xk,1; \u03bek,1) \u2207F2(xk,2; \u03bek,2)\n\n\u00b7\u00b7\u00b7 \u2207Fn(xk,n; \u03bek,n) ] \u2208 RN\u00d7n.\n\nThen the k-th iterate of Algorithm 1 can be viewed as the following update\n\nXk+1 \u2190 XkW \u2212 \u03b3\u2202F (Xk; \u03bek).\n\nWe say the algorithm gives an \u0001-approximation solution if\n\nK\u22121(cid:16)(cid:80)K\u22121\n\nk=0\n\nE(cid:13)(cid:13)\u2207f(cid:0) Xk1n\n\nn\n\n(cid:1)(cid:13)(cid:13)2(cid:17) (cid:54) \u0001.\n\n(3)\n\n4 Convergence rate analysis\nThis section provides the analysis for the convergence rate of the D-PSGD algorithm. Our analysis\nwill show that the convergence rate of D-PSGD w.r.t. iterations is similar to the C-PSGD (or mini-\nbatch SGD) [Agarwal and Duchi, 2011, Dekel et al., 2012, Lian et al., 2015], but D-PSGD avoids the\ncommunication traf\ufb01c jam on the parameter server.\nTo show the convergence results, we \ufb01rst de\ufb01ne\n\n\u2202f (Xk) := [ \u2207f1(xk,1) \u2207f2(xk,2)\n\n\u00b7\u00b7\u00b7 \u2207fn(xk,n) ] \u2208 RN\u00d7n,\n\nwhere functions fi(\u00b7)\u2019s are de\ufb01ned in (2).\nAssumption 1. Throughout this paper, we make the following commonly used assumptions:\n1. Lipschitzian gradient: All function fi(\u00b7)\u2019s are with L-Lipschitzian gradients.\n2. Spectral gap: Given the symmetric doubly stochastic matrix W , we de\ufb01ne \u03c1\n:=\n(max{|\u03bb2(W )|,|\u03bbn(W )|})2. We assume \u03c1 < 1.\n3. Bounded variance: Assume the variance of stochastic gradient Ei\u223cU ([n])E\u03be\u223cDi(cid:107)\u2207Fi(x; \u03be) \u2212\n\u2207f (x)(cid:107)2 is bounded for any x with i uniformly sampled from {1, . . . , n} and \u03be from the distribution\nDi. This implies there exist constants \u03c3, \u03c2 such that\n\nE\u03be\u223cDi(cid:107)\u2207Fi(x; \u03be) \u2212 \u2207fi(x)(cid:107)2 (cid:54)\u03c32,\u2200i,\u2200x, Ei\u223cU ([n])(cid:107)\u2207fi(x) \u2212 \u2207f (x)(cid:107)2 (cid:54) \u03c2 2,\u2200x.\n\nNote that if all nodes can access the shared database, then \u03c2 = 0.\n4. Start from 0: We assume X0 = 0. This assumption simpli\ufb01es the proof w.l.o.g.\n\n(cid:18) 1\n\n(cid:19)\n\n(cid:18)\n\n(cid:19)\n\nLet\n\n\u2212 9\u03b32L2n\n(1 \u2212 \u221a\n\nD1 :=\n\n\u03c1)2 nL2\nUnder Assumption 1, we have the following convergence result for Algorithm 1.\nTheorem 1 (Convergence of Algorithm 1). Under Assumption 1, we have the following convergence\nrate for Algorithm 1:\n\n, D2 :=\n\n\u03c1)2D2\n\n2\n\n.\n\n1 \u2212 18\u03b32\n(1 \u2212 \u221a\n\nK\u22121(cid:88)\n\nE\n\n(cid:13)(cid:13)(cid:13)(cid:13) \u2202f (Xk)1n\n\nn\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n+ D1\n\n(cid:13)(cid:13)(cid:13)(cid:13)\u2207f\n\n(cid:18) Xk1n\n\nn\n\n(cid:19)(cid:13)(cid:13)(cid:13)(cid:13)2(cid:33)\n\nK\u22121(cid:88)\n\nk=0\n\nE\n\n(cid:32)\n\n1 \u2212 \u03b3L\n\n2\n\n1\nK\n\n(cid:54) f (0) \u2212 f\u2217\n\n\u03b3K\n\nk=0\n\u03b3L\n2n\n\n+\n\n\u03c32 +\n\n\u03b32L2n\u03c32\n(1 \u2212 \u03c1)D2\n\n+\n\n9\u03b32L2n\u03c2 2\n\n(1 \u2212 \u221a\n\n\u03c1)2D2\n\n.\n\n2It can be easily extended to mini-batch stochastic gradient descent.\n\n5\n\n\f(cid:80)n\n\nn\n\nn = 1\n\nNoting that Xk1n\ni=1 xk,i, this theorem characterizes the convergence of the average of all\nlocal optimization variables xk,i. To take a closer look at this result, we appropriately choose the step\nlength in Theorem 1 to obtain the following result:\nCorollary 2. Under the same assumptions as in Theorem 1, if we set \u03b3 =\nAlgorithm 1 we have the following convergence rate:\n\n3, for\n\n\u221a\n\n2L+\u03c3\n\nK/n\n\n1\n\n(cid:80)K\u22121\n\nk=0\n\nE(cid:13)(cid:13)\u2207f(cid:0) Xk1n\n\nn\n\n(cid:1)(cid:13)(cid:13)2\n\nK\n\n(cid:54) 8(f (0) \u2212 f\u2217)L\n\n+\n\n(8f (0) \u2212 8f\u2217 + 4L)\u03c3\n\nif the total number of iterate K is suf\ufb01ciently large, in particular,\n(1 \u2212 \u221a\n\n\u03c36(f (0) \u2212 f\u2217 + L)2\n\n1 \u2212 \u03c1\n\nK (cid:62)\n\n4L4n5\n\n9\u03c2 2\n\n+\n\n\u03c1)2\n\nK\n\n(cid:18) \u03c32\n\nK (cid:62) 72L2n2\n\n\u03c32(cid:0)1 \u2212 \u221a\n\n\u03c1(cid:1)2 .\n\nKn\n\n\u221a\n\n(cid:19)2\n\n, and\n\n(cid:16) 1\n\n.\n\n(4)\n\n(5)\n\n(6)\n\n, if K is large\n\n(cid:17)\n\nK + 1\u221a\n\nThis result basically suggests that the convergence rate for D-PSGD is O\nenough. We highlight two key observations from this result:\nLinear speedup When K is large enough, the 1\nleads to a\n\nterm which\nconvergence rate. It indicates that the total computational complexity4 to achieve\n\nan \u0001-approximation solution (3) is bounded by O(cid:0) 1\naffect the total complexity, a single node only shares a computational complexity of O(cid:0) 1\n\n(cid:1). Since the total number of nodes does not\n(cid:1). Thus\n\nK term will be dominated by the\n\n1\u221a\n\n1\u221a\n\nKn\n\nnK\n\nnK\n\n\u00012\n\nlinear speedup can be achieved by D-PSGD asymptotically w.r.t. computational complexity.\nD-PSGD can be better than C-PSGD Note that this rate is the same as C-PSGD (or mini-batch\nSGD with mini-batch size n) [Agarwal and Duchi, 2011, Dekel et al., 2012, Lian et al., 2015].\nThe advantage of D-PSGD over C-PSGD is to avoid the communication traf\ufb01c jam. At each\niteration, the maximal communication cost for every single node is O(the degree of the network)\nfor D-PSGD, in contrast with O(n) for C-PSGD. The degree of the network could be much smaller\nthan O(n), e.g., it could be O(1) in the special case of a ring.\n\nn\u00012\n\nThe key difference from most existing analysis for decentralized algorithms lies on that we do not\nuse the boundedness assumption for domain or gradient or stochastic gradient. Those boundedness\nassumptions can signi\ufb01cantly simplify the proof but lose some subtle structures in the problem.\nThe linear speedup indicated by Corollary 2 requires the total number of iteration K is suf\ufb01ciently\nlarge. The following special example gives a concrete bound of K for the ring network topology.\nTheorem 3. (Ring network) Choose the steplength \u03b3 in the same as Corollary 2 and consider the\nring network topology with corresponding W in the form of\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\nW =\n\n1/3\n1/3\n\n1/3\n1/3\n\n1/3\n\n1/3\n\n1/3\n\n1/3\n\n. . .\n\n. . .\n. . .\n\n1/3\n\n1/3\n\n1/3\n1/3\n1/3\n\n1/3\n1/3\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8\n\n\u2208 Rn\u00d7n.\n\nUnder Assumption 1, Algorithm 1 achieves the same convergence rate in (4), which indicates a linear\nspeedup can be achieved, if the number of involved nodes is bounded by\n\u2022 n = O(K 1/9), if apply strategy-1 distributing data (\u03c2 = 0);\n\u2022 n = O(K 1/13), if apply strategy-2 distributing data (\u03c2 > 0),\n3 In Theorem 1 and Corollary 2, we choose the constant steplength for simplicity. Using the diminishing\n\nsteplength O((cid:112)n/k) can achieve a similar convergence rate by following the proof procedure in this paper. For\n\nconvex objectives, D-PSGD could be proven to admit the convergence rate O(1/\nnK) which is consistent with\nthe non-convex case. For strongly convex objectives, the convergence rate for D-PSGD could be improved to\nO(1/nK) which is consistent with the rate for C-PSGD.\n\n\u221a\n\n4The complexity to compute a single stochastic gradient counts 1.\n\n6\n\n\fFigure 2: Comparison between D-PSGD and two centralized implementations (7 and 10 GPUs).\n\nFigure 3: (a) Convergence Rate; (b) D-PSGD Speedup; (c) D-PSGD Communication Patterns.\nwhere the capital \u201cO\u201d swallows \u03c3, \u03c2, L, and f (0) \u2212 f\u2217.\n\n(cid:80)n\ni(cid:48)=1 xk,i(cid:48) \u2212 xk,i\n\n(cid:13)(cid:13)2 converges to\n\nn\n\nthe consensus quickly, in particular, the running average of E(cid:13)(cid:13) 1\n\nThis result considers a special decentralized network topology: ring network, where each node can\nonly exchange information with its two neighbors. The linear speedup can be achieved up to K 1/9\nand K 1/13 for different scenarios. These two upper bound can be improved potentially. This is the\n\ufb01rst work to show the speedup for decentralized algorithms, to the best of our knowledge.\nIn this section, we mainly investigate the convergence rate for the average of all local variables\n{xk,i}n\ni=1. Actually one can also obtain a similar rate for each individual xk,i, since all nodes achieve\n0 with a O(1/K) rate, where the \u201cO\u201d swallows n, \u03c1, \u03c3, \u03c2, L and f (0) \u2212 f\u2217. See Theorem 6 for more\ndetails in Supplemental Material.\n5 Experiments\nWe validate our theory with experiments that compare D-PSGD with other centralized implementa-\ntions. We run experiments on clusters up to 112 GPUs and show that, on some network con\ufb01gurations,\nD-PSGD can outperform well-optimized centralized implementations by an order of magnitude.\n5.1 Experiment setting\nDatasets and models We evaluate D-PSGD on two machine learning tasks, namely (1) image clas-\nsi\ufb01cation, and (2) Natural Language Processing (NLP). For image classi\ufb01cation we train ResNet [He\net al., 2015] with different number of layers on CIFAR-10 [Krizhevsky, 2009]; for natural language\nprocessing, we train both proprietary and public dataset on a proprietary CNN model that we get\nfrom our industry partner [Feng et al., 2016, Lin et al., 2017, Zhang et al., 2017].\nImplementations and setups We implement D-PSGD on two different frameworks, namely Mi-\ncrosoft CNTK and Torch. We evaluate four SGD implementations:\n1. CNTK. We compare with the standard CNTK implementation of synchronous SGD. The imple-\nmentation is based on MPI\u2019s AllReduce primitive.\n2. Centralized. We implemented the standard parameter server-based synchronous SGD using MPI.\nOne node will serve as the parameter server in our implementation.\n3. Decentralized. We implemented our D-PSGD algorithm using MPI within CNTK.\n4. EASGD. We compare with the standard EASGD implementation of Torch.\nAll three implementations are compiled with gcc 7.1, cuDNN 5.0, OpenMPI 2.1.1. We fork from\nCNTK after commit 57d7b9d and enable distributed minibatch reading for all of our experiments.\nDuring training, we keep the local batch size of each node the same as the reference con\ufb01gurations\nprovided by CNTK. We tune learning rate for each SGD variant and report the best con\ufb01guration.\n\n7\n\n05010015020025030000.51Seconds/Epoch1/Bandwidth (1 / 1Mbps)0204060801001201400510Seconds/EpochNetwork Latency (ms)(c) Impact of Network Bandwidth(d) Impact of Network LatencyDecentralizedCNTKCNTKDecentralizedSlower NetworkSlower NetworkCentralizedCentralized00.511.522.505001000Training LossTime (Seconds)DecentralizedCentralizedCNTK00.511.522.505001000Training LossTime (Seconds)DecentralizedCNTKCentralized(a) ResNet-20, 7GPU, 10Mbps(b) ResNet-20, 7GPU, 5ms00.511.522.530200400600Training LossEpochsDecentralized(a) ResNet20, 112GPUsCentralized0246802468Speedup# Workers(c) ResNet20, 7GPUs(d) DPSGD Comm. Pattern00.511.522.53050100150Training LossEpochsDecentralizedCentralized(b) ResNet-56, 7GPU\fMachines/Clusters We conduct experiments on three different machines/clusters:\n1. 7GPUs. A single local machine with 8 GPUs, each of which is a Nvidia TITAN Xp.\n2. 10GPUs. 10 p2.xlarge EC2 instances, each of which has one Nvidia K80 GPU.\n3. 16GPUs. 16 local machines, each of which has two Xeon E5-2680 8-core processors and a\nNVIDIA K20 GPU. Machines are connected by Gigabit Ethernet in this case.\n4. 112GPUs. 4 p2.16xlarge and 6 p2.8xlarge EC2 instances. Each p2.16xlarge (resp.\np2.8xlarge) instance has 16 (resp. 8) Nvidia K80 GPUs.\nIn all of our experiments, we use each GPU as a node.\n5.2 Results on CNTK\nEnd-to-end performance We \ufb01rst validate that, under certain network con\ufb01gurations, D-PSGD\nconverges faster, in wall-clock time, to a solution that has the same quality of centralized SGD.\nFigure 2(a, b) and Figure 3(a) shows the result of training ResNet20 on 7GPUs. We see that D-\nPSGD converges faster than both centralized SGD competitors. This is because when the network is\nslow, both centralized SGD competitors take more time per epoch due to communication overheads.\nFigure 3(a, b) illustrates the convergence with respect to the number of epochs, and D-PSGD shows\nsimilar convergence rate as centralized SGD even with 112 nodes.\nSpeedup The end-to-end speedup of D-PSGD over centralized SGD highly depends on the under-\nlying network. We use the tc command to manually vary the network bandwidth and latency and\ncompare the wall-clock time that all three SGD implementations need to \ufb01nish one epoch.\nFigure 2(c, d) shows the result. We see that, when the network has high bandwidth and low latency,\nnot surprisingly, all three SGD implementations have similar speed. This is because in this case,\nthe communication is never the system bottleneck. However, when the bandwidth becomes smaller\n(Figure 2(c)) or the latency becomes higher (Figure 2(d)), both centralized SGD implementations\nslow down signi\ufb01cantly. In some cases, D-PSGD can be even one order of magnitude faster than\nits centralized competitors. Compared with Centralized (implemented with a parameter server), D-\nPSGD has more balanced communication patterns between nodes and thus outperforms Centralized\nin low-bandwidth networks; compared with CNTK (implemented with AllReduce), D-PSGD needs\nfewer number of communications between nodes and thus outperforms CNTK in high-latency\nnetworks. Figure 3(c) illustrates the communication between nodes for one run of D-PSGD.\nWe also vary the number of GPUs that D-PSGD uses and report the speed up over a single GPU to\nreach the same loss. Figure 3(b) shows the result on a machine with 7GPUs. We see that, up to 4\nGPUs, D-PSGD shows near linear speed up. When all seven GPUs are used, D-PSGD achieves up\nto 5\u00d7 speed up. This subliner speed up for 7 GPUs is due to the synchronization cost but also that\nour machine only has 4 PCIe channels and thus more than two GPUs will share PCIe bandwidths.\n5.3 Results on Torch\nDue to the space limitation, the results on Torch can be found in Supplement Material.\n6 Conclusion\nThis paper studies the D-PSGD algorithm on the decentralized computational network. We prove\nthat D-PSGD achieves the same convergence rate (or equivalently computational complexity) as the\nC-PSGD algorithm, but outperforms C-PSGD by avoiding the communication traf\ufb01c jam. To the\nbest of our knowledge, this is the \ufb01rst work to show that decentralized algorithms admit the linear\nspeedup and can outperform centralized algorithms.\nLimitation and Future Work The potential limitation of D-PSGD lies on the cost of synchronization.\nBreaking the synchronization barrier could make the decentralize algorithms even more ef\ufb01cient, but\nrequires more complicated analysis. We will leave this direction for the future work.\nOn the system side, one future direction is to deploy D-PSGD to larger clusters beyond 112 GPUs\nand one such environment is state-of-the-art supercomputers. In such environment, we envision\nD-PSGD to be one necessary building blocks for multiple \u201ccentralized groups\u201d to communicate. It is\nalso interesting to deploy D-PSGD to mobile environments.\nAcknowledgements Xiangru Lian and Ji Liu are supported in part by NSF CCF1718513. Ce Zhang gratefully\nacknowledge the support from the Swiss National Science Foundation NRP 75 407540_167266, IBM Zurich,\nMercedes-Benz Research & Development North America, Oracle Labs, Swisscom, Chinese Scholarship Council,\n\n8\n\n\fthe Department of Computer Science at ETH Zurich, the GPU donation from NVIDIA Corporation, and the\ncloud computation resources from Microsoft Azure for Research award program. Huan Zhang and Cho-Jui\nHsieh acknowledge the support of NSF IIS-1719097 and the TACC computation resources.\nReferences\nA. Agarwal and J. C. Duchi. Distributed delayed stochastic optimization. NIPS, 2011.\n\nN. S. Aybat, Z. Wang, T. Lin, and S. Ma. Distributed linearized alternating direction method of\nmultipliers for composite convex consensus optimization. arXiv preprint arXiv:1512.08122, 2015.\n\nT. C. Aysal, M. E. Yildiz, A. D. Sarwate, and A. Scaglione. Broadcast gossip algorithms for consensus.\n\nIEEE Transactions on Signal processing, 57(7):2748\u20132761, 2009.\n\nP. Bianchi, G. Fort, and W. Hachem. Performance of a distributed stochastic approximation algorithm.\n\nIEEE Transactions on Information Theory, 59(11):7405\u20137418, 2013.\n\nS. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Gossip algorithms: Design, analysis and applications.\nIn INFOCOM 2005. 24th Annual Joint Conference of the IEEE Computer and Communications\nSocieties. Proceedings IEEE, volume 3, pages 1653\u20131664. IEEE, 2005.\n\nR. Carli, F. Fagnani, P. Frasca, and S. Zampieri. Gossip consensus algorithms via quantized commu-\n\nnication. Automatica, 46(1):70\u201380, 2010.\n\nJ. Chen, R. Monga, S. Bengio, and R. Jozefowicz. Revisiting distributed synchronous sgd. arXiv\n\npreprint arXiv:1604.00981, 2016.\n\nK. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive\n\nalgorithms. Journal of Machine Learning Research, 7:551\u2013585, 2006.\n\nJ. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V.\nLe, et al. Large scale distributed deep networks. In Advances in neural information processing\nsystems, pages 1223\u20131231, 2012.\n\nO. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction using\n\nmini-batches. Journal of Machine Learning Research, 13(Jan):165\u2013202, 2012.\n\nF. Fagnani and S. Zampieri. Randomized consensus algorithms over large scale networks. IEEE\n\nJournal on Selected Areas in Communications, 26(4), 2008.\n\nM. Feng, B. Xiang, and B. Zhou. Distributed deep learning for question answering. In Proceedings\nof the 25th ACM International on Conference on Information and Knowledge Management, pages\n2413\u20132416. ACM, 2016.\n\nH. R. Feyzmahdavian, A. Aytekin, and M. Johansson. An asynchronous mini-batch algorithm for\n\nregularized stochastic optimization. arXiv, 2015.\n\nS. Ghadimi and G. Lan. Stochastic \ufb01rst-and zeroth-order methods for nonconvex stochastic program-\n\nming. SIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\nK. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. ArXiv e-prints,\n\nDec. 2015.\n\nK. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings\n\nof the IEEE Conference on Computer Vision and Pattern Recognition, pages 770\u2013778, 2016.\n\nA. Krizhevsky. Learning multiple layers of features from tiny images. In Technical Report, 2009.\n\nG. Lan, S. Lee, and Y. Zhou. Communication-ef\ufb01cient algorithms for decentralized and stochastic\n\noptimization. arXiv preprint arXiv:1701.03961, 2017.\n\nY. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436\u2013444, 2015.\n\nM. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and\nB.-Y. Su. Scaling distributed machine learning with the parameter server. In OSDI, volume 14,\npages 583\u2013598, 2014.\n\n9\n\n\fX. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex\n\noptimization. In Advances in Neural Information Processing Systems, pages 2737\u20132745, 2015.\n\nX. Lian, H. Zhang, C.-J. Hsieh, Y. Huang, and J. Liu. A comprehensive linear speedup analysis for\nasynchronous stochastic parallel optimization from zeroth-order to \ufb01rst-order. In Advances in\nNeural Information Processing Systems, pages 3054\u20133062, 2016.\n\nZ. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio. A structured self-attentive\n\nsentence embedding. 5th International Conference on Learning Representations, 2017.\n\nJ. Lu, C. Y. Tang, P. R. Regier, and T. D. Bow. A gossip algorithm for convex consensus optimization\n\nover networks. In American Control Conference (ACC), 2010, pages 301\u2013308. IEEE, 2010.\n\nA. Mokhtari and A. Ribeiro. Dsa: decentralized double stochastic averaging gradient algorithm.\n\nJournal of Machine Learning Research, 17(61):1\u201335, 2016.\n\nE. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation algorithms for\n\nmachine learning. NIPS, 2011.\n\nA. Nedic and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE\n\nTransactions on Automatic Control, 54(1):48\u201361, 2009.\n\nA. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to\n\nstochastic programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\nNvidia.\n\nOptimized\nhttps://github.com/NVIDIA/nccl.\n\nNccl:\n\nprimitives\n\nfor\n\ncollective multi-gpu\n\ncommunication.\n\nR. Olfati-Saber, J. A. Fax, and R. M. Murray. Consensus and cooperation in networked multi-agent\n\nsystems. Proceedings of the IEEE, 95(1):215\u2013233, 2007.\n\nS. S. Ram, A. Nedi\u00b4c, and V. V. Veeravalli. Asynchronous gossip algorithms for stochastic optimization.\nIn Decision and Control, 2009 held jointly with the 2009 28th Chinese Control Conference.\nCDC/CCC 2009. Proceedings of the 48th IEEE Conference on, pages 3581\u20133586. IEEE, 2009a.\n\nS. S. Ram, A. Nedic, and V. V. Veeravalli. Distributed subgradient projection algorithm for convex\noptimization. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International\nConference on, pages 3653\u20133656. IEEE, 2009b.\n\nS. S. Ram, A. Nedi\u00b4c, and V. V. Veeravalli. Asynchronous gossip algorithm for stochastic optimization:\nConstant stepsize analysis. In Recent Advances in Optimization and its Applications in Engineering,\npages 51\u201360. Springer, 2010.\n\nB. Recht, C. Re, S. Wright, and F. Niu. Hogwild: A lock-free approach to parallelizing stochastic\ngradient descent. In Advances in Neural Information Processing Systems, pages 693\u2013701, 2011.\n\nL. Schenato and G. Gamba. A distributed consensus protocol for clock synchronization in wireless\nsensor network. In Decision and Control, 2007 46th IEEE Conference on, pages 2289\u20132294. IEEE,\n2007.\n\nS. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in\n\nMachine Learning, 4(2):107\u2013194, 2011.\n\nW. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin. On the linear convergence of the admm in decentralized\n\nconsensus optimization.\n\nW. Shi, Q. Ling, G. Wu, and W. Yin. Extra: An exact \ufb01rst-order algorithm for decentralized consensus\n\noptimization. SIAM Journal on Optimization, 25(2):944\u2013966, 2015.\n\nB. Sirb and X. Ye. Consensus optimization with delayed and stochastic gradients on decentralized\nnetworks. In Big Data (Big Data), 2016 IEEE International Conference on, pages 76\u201385. IEEE,\n2016.\n\nK. Srivastava and A. Nedic. Distributed asynchronous constrained stochastic optimization. IEEE\n\nJournal of Selected Topics in Signal Processing, 5(4):772\u2013790, 2011.\n\n10\n\n\fS. Sundhar Ram, A. Nedi\u00b4c, and V. Veeravalli. Distributed stochastic subgradient projection algorithms\nfor convex optimization. Journal of optimization theory and applications, 147(3):516\u2013545, 2010.\n\nT. Wu, K. Yuan, Q. Ling, W. Yin, and A. H. Sayed. Decentralized consensus optimization with\n\nasynchrony and delays. arXiv preprint arXiv:1612.00150, 2016.\n\nF. Yan, S. Sundaram, S. Vishwanathan, and Y. Qi. Distributed autonomous online learning: Re-\ngrets and intrinsic privacy-preserving properties. IEEE Transactions on Knowledge and Data\nEngineering, 25(11):2483\u20132493, 2013.\n\nT. Yang, M. Mahdavi, R. Jin, and S. Zhu. Regret bounded by gradual variation for online convex\n\noptimization. Machine learning, 95(2):183\u2013223, 2014.\n\nK. Yuan, Q. Ling, and W. Yin. On the convergence of decentralized gradient descent. SIAM Journal\n\non Optimization, 26(3):1835\u20131854, 2016.\n\nH. Zhang, C.-J. Hsieh, and V. Akella. Hogwild++: A new mechanism for decentralized asynchronous\n\nstochastic gradient descent. ICDM, 2016a.\n\nR. Zhang and J. Kwok. Asynchronous distributed admm for consensus optimization. In International\n\nConference on Machine Learning, pages 1701\u20131709, 2014.\n\nS. Zhang, A. E. Choromanska, and Y. LeCun. Deep learning with elastic averaging sgd. In Advances\n\nin Neural Information Processing Systems, pages 685\u2013693, 2015.\n\nW. Zhang, S. Gupta, X. Lian, and J. Liu. Staleness-aware async-sgd for distributed deep learning. In\nProceedings of the Twenty-Fifth International Joint Conference on Arti\ufb01cial Intelligence, IJCAI\n2016, New York, NY, USA, 9-15 July 2016, 2016b.\n\nW. Zhang, S. Gupta, and F. Wang. Model accuracy and runtime tradeoff in distributed deep learning:\n\nA systematic study. In IEEE International Conference on Data Mining, 2016c.\n\nW. Zhang, M. Feng, Y. Zheng, Y. Ren, Y. Wang, J. Liu, P. Liu, B. Xiang, L. Zhang, B. Zhou, and\nF. Wang. Gadei: On scale-up training as a service for deep learning. In Proceedings of the\n25th ACM International on Conference on Information and Knowledge Management. The IEEE\nInternational Conference on Data Mining series(ICDM\u20192017), 2017.\n\nY. Zhuang, W.-S. Chin, Y.-C. Juan, and C.-J. Lin. A fast parallel sgd for matrix factorization in shared\nmemory systems. In Proceedings of the 7th ACM conference on Recommender systems, pages\n249\u2013256. ACM, 2013.\n\nM. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochastic gradient descent. In\n\nAdvances in neural information processing systems, pages 2595\u20132603, 2010.\n\n11\n\n\f", "award": [], "sourceid": 2767, "authors": [{"given_name": "Xiangru", "family_name": "Lian", "institution": "University of Rochester"}, {"given_name": "Ce", "family_name": "Zhang", "institution": "ETH Zurich"}, {"given_name": "Huan", "family_name": "Zhang", "institution": null}, {"given_name": "Cho-Jui", "family_name": "Hsieh", "institution": "UC Davis"}, {"given_name": "Wei", "family_name": "Zhang", "institution": "IBM T.J.Watson Research Center"}, {"given_name": "Ji", "family_name": "Liu", "institution": "University of Rochester"}]}