{"title": "Training Neural Networks Using Features Replay", "book": "Advances in Neural Information Processing Systems", "page_first": 6659, "page_last": 6668, "abstract": "Training a neural network using backpropagation algorithm requires passing error gradients sequentially through the network.\nThe backward locking prevents us from updating network layers in parallel and fully leveraging the computing resources. Recently, there are several works trying to decouple and parallelize the backpropagation algorithm. However, all of them suffer from severe accuracy loss or memory explosion when the neural network is deep. To address these challenging issues, we propose a novel parallel-objective formulation for the objective function of the neural network. After that, we introduce features replay algorithm and prove that it is guaranteed to converge to critical points for the non-convex problem under certain conditions. Finally, we apply our method to training deep convolutional neural networks, and the experimental results show that the proposed method achieves {faster} convergence, {lower} memory consumption, and {better} generalization error than compared methods.", "full_text": "Training Neural Networks Using Features Replay\n\nZhouyuan Huo1,2, Bin Gu2, Heng Huang1,2\u2217\n\n1Electrical and Computer Engineering, University of Pittsburgh, 2 JDDGlobal.com\n\nzhouyuan.huo@pitt.edu, jsgubin@gmail.com\n\nheng.huang@pitt.edu\n\nAbstract\n\nTraining a neural network using backpropagation algorithm requires passing error\ngradients sequentially through the network. The backward locking prevents us from\nupdating network layers in parallel and fully leveraging the computing resources.\nRecently, there are several works trying to decouple and parallelize the backpropa-\ngation algorithm. However, all of them suffer from severe accuracy loss or memory\nexplosion when the neural network is deep. To address these challenging issues,\nwe propose a novel parallel-objective formulation for the objective function of the\nneural network. After that, we introduce features replay algorithm and prove that\nit is guaranteed to converge to critical points for the non-convex problem under\ncertain conditions. Finally, we apply our method to training deep convolutional\nneural networks, and the experimental results show that the proposed method\nachieves faster convergence, lower memory consumption, and better generalization\nerror than compared methods.\n\n1\n\nIntroduction\n\nIn recent years, the deep convolutional neural networks have made great breakthroughs in computer\nvision [8, 10, 19, 20, 32, 33], natural language processing [15, 16, 31, 36], and reinforcement learning\n[21, 23, 24, 25]. The growth of the depths of the neural networks is one of the most critical factors\ncontributing to the success of deep learning, which has been veri\ufb01ed both in practice [8, 10] and\nin theory [2, 7, 35]. Gradient-based methods are the major methods to train deep neural networks,\nsuch as stochastic gradient descent (SGD) [29], ADAGRAD [6], RMSPROP [9] and ADAM [17].\nAs long as the loss functions are differentiable, we can compute the gradients of the networks using\nbackpropagation algorithm [30]. The backpropagation algorithm requires two passes of the neural\nnetwork, the forward pass to compute activations and the backward pass to compute gradients. As\nshown in Figure 1 (BP), error gradients are repeatedly propagated from the top (output layer) all\nthe way back to the bottom (input layer) in the backward pass. The sequential propagation of the\nerror gradients is called backward locking because all layers of the network are locked until their\ndependencies have executed. According to the benchmark report in [14], the computational time of\nthe backward pass is about twice of the computational time of the forward pass. When networks are\nquite deep, backward locking becomes the bottleneck of making good use of computing resources,\npreventing us from updating layers in parallel.\nThere are several works trying to break the backward locking in the backpropagation algorithm.\n[4] and [34] avoid the backward locking by removing the backpropagation algorithm completely.\nIn [4], the authors proposed the method of auxiliary coordinates (MAC) and simpli\ufb01ed the nested\nfunctions by imposing quadratic penalties. Similarly, [34] used Lagrange multipliers to enforce\nequality constraints between auxiliary variables and activations. Both of the reformulated problems\ndo not require backpropagation algorithm at all and are easy to be parallelized. However, neither of\n\n\u2217Corresponding Author.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Illustrations of the backward pass of the backpropagation algorithm (BP) [30], decoupled\nneural interface (DNI) [13] and decoupled parallel backpropagation (DDG) [11]. DNI breaks the\nbackward locking by synthesizing error gradients. DDG breaks the backward locking by storing stale\ngradients.\n\nthem have been applied to training convolutional neural networks yet. There are also several works\nbreaking the dependencies between groups of layers or modules in the backpropagation algorithm.\nIn [13], the authors proposed to remove the backward locking by employing the decoupled neural\ninterface to approximate error gradients (Figure 1 DNI). [1, 27] broke the local dependencies between\nsuccessive layers and made all hidden layers receive error information from the output layer directly.\nIn the backward pass, we can use the synthetic gradients or the direct feedbacks to update the weights\nof all modules without incurring any delay. However, these methods work poorly when the neural\nnetworks use very deep architecture. In [11], the authors proposed decoupled parallel backpropagation\nby using stale gradients, where modules are updated with the gradients from different timestamps\n(Figure 1 DDG). However, it requires large amounts of memory to store the stale gradients and suffers\nfrom the loss of accuracy.\nIn this paper, we propose feature replay algorithm which is free of the above three issues: backward\nlocking, memory explosion and accuracy loss. The main contributions of our work are summarized\nas follows:\n\u2022 Firstly, we propose a novel parallel-objective formulation for the objective function of the neural\nnetworks in Section 3. Using this new formulation, we break the backward locking by introducing\nfeatures replay algorithm, which is easy to be parallelized.\n\u2022 Secondly, we provide the theoretical analysis in Section 4 and prove that the proposed method is\nguaranteed to converge to critical points for the non-convex problem under certain conditions.\n\u2022 Finally, we validate our method with experiments on training deep convolutional neural net-\nworks in Section 5. Experimental results demonstrate that the proposed method achieves faster\nconvergence, lower memory consumption, and better generalization error than compared methods.\n\n2 Background\nWe assume there is a feedforward neural network with L layers, where w = [w1, w2, ..., wL] \u2208 Rd\ndenotes the weights of all layers. The computation in each layer can be represented as taking an input\nhl\u22121 and producing an activation hl = Fl(hl\u22121; wl) using weight wl. Given a loss function f and\ntarget y, we can formulate the objective function of the neural network f (w) as follows:\n\nf (hL, y)\n\nmin\n\nw\ns.t.\n\n(1)\nwhere h0 denotes the input data x. By using stochastic gradient descent, the weights of the network\nare updated in the direction of their negative gradients of the loss function following:\n\nhl = Fl(hl\u22121; wl)\n\nfor all\n\nl \u2208 {1, 2, ..., L}\n\nwt+1\n\nl\n\n= wt\n\nl \u2212 \u03b3t \u00b7 gt\n\nl\n\nl \u2208 {1, 2, ..., L}\n\n(2)\n\nfor all\n\n2\n\nlayer 1layer 2layer 3h\u03b4tActivationError gradientht0layer 4loss\u00a0layer 1layer 2layer 3layer 4loss\u00a0layer sht1ht2ht3ht4ht0ht1ht2ht3ht4\u03b4t\u22121layer 1layer 2layer 3layer 4loss\u00a0ht\u221210ht\u221211ht\u221212ht3ht4ht0ht1ht2Forward passBackward pass\u03b4layer\u00a0Network layerMethodBPDNIDDGBackward\u00a0 lockingYesNoNo LS\u02c6\u03b4tLS=||\u03b4t\u2212\u02c6\u03b4t||22\fFigure 2: Backward pass of Features Replay Algorithm. We divide a 12-layer neural network into\nfour modules, where each module stores its input history and a stale error gradient from the upper\nmodule. At each iteration, all modules compute the activations by inputting features from the history\nand compute the gradients by applying the chain rule. After that, they receive the error gradients\nfrom the upper modules for the next iteration.\n\n:= \u2202fxt (wt)\n\ndenotes the gradient of the loss function (1)\nwhere \u03b3t denotes the stepsize and gt\nl\nregarding wt\nl with input samples xt. The backpropagation algorithm [30] is utilized to compute the\ngradients for the neural networks. At iteration t, it requires two passes over the network: in the\nforward pass, the activations of all layers are computed from the bottom layer l = 1 to the top layer\nl = L following: ht\nl ); in the backward pass, it applies the chain rule and propagates\nerror gradients through the network from the top layer l = L to the bottom layer l = 1 following:\n\nl = Fl(ht\n\nl\u22121; wt\n\n\u2202wt\nl\n\n\u2202fxt(wt)\n\n\u2202wt\nl\n\n=\n\n\u2202ht\nl\n\u2202wt\nl\n\n\u00d7 \u2202fxt(wt)\n\n\u2202ht\nl\n\nand\n\n\u2202fxt (wt)\n\n\u2202ht\n\nl\u22121\n\n=\n\n\u2202ht\nl\n\u2202ht\nl\u22121\n\n\u00d7 \u2202fxt(wt)\n\n\u2202ht\nl\n\n.\n\n(3)\n\nAccording to (3), computing gradients for the weights wl of the layer l is dependent on the error\ngradient \u2202fxt (wt)\nfrom the layer l + 1, which is known as backward locking. Therefore, the backward\nlocking prevents all layers from updating before receiving error gradients from dependent layers.\nWhen the networks are deep, the backward locking becomes the bottleneck in the training process.\n\n\u2202ht\nl\n\n3 Features Replay\n\nIn this section, we propose a novel parallel-objective formulation for the objective function of the\nneural networks. Using our new formulation, we break the backward locking in the backpropagation\nalgorithm by using features replay algorithm.\n\n3.1 Problem Reformulation\n\nAs shown in Figure 2, we assume to divide an L-layer feedforward neural network into K modules\nwhere K (cid:28) L, such that w = [wG(1), wG(2), ..., wG(K)] \u2208 Rd and G(k) denotes the layers in the\nmodule k. Let Lk represent the last layer of the module k, the output of this module can be written\nas hLk. The error gradient variable is denoted as \u03b4t\nk , which is used for the gradient computation of\nthe module k. We can split the problem (1) into K subproblems. The task of the module k (except\n(wt)\n\nk = K) is minimizing the least square error between the error gradient variable \u03b4t\ninto the module k + 1, and\nwhich is the gradient of the loss function regarding ht\nthe task of the module K is minimizing the loss between the prediction ht\nand the real label yt.\nFrom this point of view, we propose a novel parallel-objective loss function at iteration t as follows:\n\nwith input ht\n\nk and\n\nLk\n\u2202ht\n\n\u2202fht\n\nLK\n\nLk\n\nLk\n\nLk\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u03b4t\n\nK\u22121(cid:88)\n\nk=1\nht\nLk\n\nmin\nw,\u03b4\n\ns.t.\n\n(wt)\n\nk \u2212 \u2202fht\nLk\n\u2202ht\n\nLk\n\n= FG(k)(ht\n\nLk\u22121\n\n; wtG(k))\n\n3\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\n+ f(cid:0)ht\n\nLK\n\n, yt(cid:1)\n\nfor all k \u2208 {1, ..., K},\n\n(4)\n\nlayer 1layer 2layer 3\u03b4t1Module 1Forward passBackward passhActivationError gradientht\u221230ht\u221220ht\u221210layer 4layer 5layer 6\u03b4t2Module 2ht\u221223ht\u221213ht3\u02dcht1\u02dcht2\u02dcht3ht0\u02dcht4\u02dcht6\u02dcht5layer 7layer 8layer 9\u03b4t3Module 3ht\u221216ht6\u02dcht7\u02dcht9\u02dcht8layer 10layer 11layer 12Module 4ht9ht10ht12ht11loss\u00a0\u03b4\fAlgorithm 1 Features Replay Algorithm\n1: Initialize: weights w0 = [w0G(1), ..., w0G(K)] \u2208 Rd and stepsize sequence {\u03b3t};\n2: for t = 0, 1, 2, . . . , T \u2212 1 do\n3:\n4:\n5:\n\nSample mini-batch (xt, yt) from the dataset and let ht\nL0\nfor k = 1, . . . , K do\n\n= xt;\n\n(cid:16)\n\n(cid:17)\n\nLk\u22121\n\n; wtG(k)\n\n; \u2190 Play\n\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n\n14:\n\nStore ht\nLk\u22121\nCompute ht\nSend ht\n\nin the memory;\nfollowing ht\n\nht\nLk\nto the module k + 1 if k < K;\n\n= FG(k)\n\nLk\n\nLk\n\nCompute loss f (wt) = f(cid:0)ht\n\nend for\n\nLK\nfor k = 1, . . . , K in parallel do\n\n, yt(cid:1);\n\n= FG(k)(ht+k\u2212K\n\nfollowing \u02dcht\nLk\n\nCompute \u02dcht\nLk\nCompute gradient gtG(k) following (7);\nUpdate weights: wt+1G(k) = wtG(k) \u2212 \u03b3t \u00b7 gtG(k);\nSend\nend for\n\nt+k\u2212K\nh\nLk\u22121\n\u2202ht+k\u2212K\nLk\u22121\n\nto the module k \u2212 1 if k > 1;\n\nLk\u22121\n\n(wt)\n\n\u2202f\n\n15:\n16: end for\n\n; wtG(k)); \u2190 Replay\n\npass\n\n\uf8fc\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8fd\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8fe Forward\n\uf8fc\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8fd\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8fe Backward\n\npass\n\n\u2202fht\n\nwhere ht\nL0\n\ndenotes the input data xt. It is obvious that the optimal solution for the left term of the\n, for all k \u2208 {1, ..., K \u2212 1}. In other words, the optimal solution of\nproblem (4) is \u03b4t\nthe module k is dependent on the output of the upper modules. Therefore, minimizing the problem\n(1) with the backpropagation algorithm is equivalent to minimizing the problem (4) with the \ufb01rst\nK \u2212 1 subproblems obtaining optimal solutions.\n\nk =\n\nLk\n\u2202ht\n\n(wt)\n\nLk\n\n3.2 Breaking Dependencies by Replaying Features\n\nFeatures replay algorithm is introduced in Algorithm 1. In the forward pass, immediate features are\ngenerated and passed through the network, and the module k keeps a history of its input with size\nK \u2212 k + 1. To break the dependencies between modules in the backward pass, we propose to compute\nthe gradients of the modules using immediate features from different timestamps. Features replay\ndenotes that immediate feature ht+k\u2212K\nis input into the module k for the \ufb01rst time in the forward\nLk\u22121\npass at iteration t + k \u2212 K, and it is input into the module k for the second time in the backward pass\nat iteration t. If t + k \u2212 K < 0, we set ht+k\u2212K\n= 0 . Therefore, the new problem can be written as:\n(wt)\n\nLk\u22121\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u03b4t\n\nK\u22121(cid:88)\n\nk=1\n\u02dcht\nLk\n\nk \u2212 \u2202f\u02dcht\nLk\n\u2202\u02dcht\nLk\n= FG(k)(ht+k\u2212K\n\nLk\u22121\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\nmin\nw,\u03b4\n\ns.t.\n\n\u2202f\u02dcht\n\n(wt)\n\n+ f (\u02dcht\n\nLK\n\n, yt)\n\n; wtG(k))\n\nfor all k \u2208 {1, ..., K}.\n\n(5)\n\nLk\n\nLk\n\u2202\u02dcht\n\ndenotes the gradient of the loss f (wt) regarding \u02dcht\nLk\n\nwhere\ninto the\nmodule k + 1. It is important to note that it is not necessary to get the optimal solutions for the\n\ufb01rst K \u2212 1 subproblems while we do not compute the optimal solution for the last subproblem. To\navoid the tedious computation, we make a trade-off between the error of the left term in (5) and the\ncomputational time by making:\n\nwith input \u02dcht\nLk\n\n\u03b4t\nk =\n\n\u2202fht+k\u2212K\n\n(wt\u22121)\n\nLk\n\n\u2202ht+k\u2212K\n\nLk\n\nfor all k \u2208 {1, ..., K \u2212 1},\n\n(wt\u22121)\n\n\u2202f\n\nt+k\u2212K\nh\nLk\n\u2202ht+k\u2212K\n\nwhere\nput ht+k\u2212K\n\nLk\n\ndenotes the gradient of the loss f (wt\u22121) regarding ht+k\u2212K\n\nLk\n\nLk\ninto the module k + 1 at the previous iteration. Assuming the algorithm has\n\n(6)\n\nwith in-\n\n4\n\n\f(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) \u2202f\n\n(wt)\n\n(wt\u22121)\n\n\u2212 \u2202f\u02dcht\nLk\n\u2202\u02dcht\n\nt+k\u2212K\nh\nLk\n\u2202ht+k\u2212K\n\nconverged as t \u2192 \u221e, we have wt \u2248 wt\u22121 \u2248 wt+k\u2212K such that \u02dcht\n\nand\n\u2248 0 for all k \u2208 {1, ..., K \u2212 1}. Therefore, (6) is a reasonable\napproximation of the optimal solutions to the \ufb01rst K \u2212 1 subproblems in (5). In this way, we break\nk can\nthe backward locking in the backpropagation algorithm because the error gradient variable \u03b4t\nbe determined at the previous iteration t \u2212 1 such that all modules are independent of each other at\niteration t. Additionally, we compute the gradients inside each module following:\n\n\u2248 ht+k\u2212K\n\nLk\n\nLk\n\nLk\n\nLk\n\n(wt)\n\n\u2202fht+k\u2212K\nLk\u22121\n\u2202wt\nl\n\n=\n\n\u2202\u02dcht\nLk\n\u2202wt\nl\n\n\u00d7 \u03b4t\n\nk\n\nand\n\n(wt)\n\n\u2202fht+k\u2212K\nLk\u22121\n\u2202\u02dcht\nl\n\n=\n\n\u2202\u02dcht\nLk\n\u2202\u02dcht\nl\n\n\u00d7 \u03b4t\nk,\n\n(7)\n\nwhere l \u2208 G(k). At the end of each iteration, the module k sends\nthe computation of the next iteration.\n\n4 Convergence Analysis\n\n(wt)\n\n\u2202f\n\nt+k\u2212K\nh\nLk\u22121\n\u2202ht+k\u2212K\nLk\u22121\n\nto module k \u2212 1 for\n\nIn this section, we provide theoretical analysis for Algorithm 1. Analyzing the convergence of the\nproblem (5) directly is dif\ufb01cult, as it involves the variables of different timestamps. Instead, we solve\nthis problem by building a connection between the gradients of Algorithm 1 and stochastic gradient\ndescent in Assumption 1, and prove that the proposed method is guaranteed to converge to critical\npoints for the non-convex problem (1).\n\nAssumption 1 (Suf\ufb01cient direction) We assume that the expectation of the descent direction\nE\nin Algorithm 1 is a suf\ufb01cient descent direction of the loss f (wt) regarding wt. Let\n\u2207f (wt) denote the full gradient of the loss, there exists a constant \u03c3 > 0 such that,\n\ngtG(k)\n\nk=1\n\n(cid:20) K(cid:80)\n\n(cid:21)\n\n(cid:42)\n\n(cid:34) K(cid:88)\n\n(cid:35)(cid:43)\n\n\u2207f (wt), E\n\ngtG(k)\n\n\u2265 \u03c3(cid:107)\u2207f (wt)(cid:107)2\n2.\n\n(8)\n\nk=1\n\nSuf\ufb01cient direction assumption guarantees that the model is moving towards the descending direction\nof the loss function.\n\nAssumption 2 Throughout this paper, we make two assumptions following [3]:\n\u2022 (Lipschitz-continuous gradient) The gradient of f is Lipschitz continuous with a constant L > 0,\nsuch that for any w1, w2 \u2208 Rd, it is satis\ufb01ed that (cid:107)\u2207f (w1) \u2212 \u2207f (w2)(cid:107)2 \u2264 L(cid:107)w1 \u2212 w2(cid:107)2.\n\u2022 (Bounded variance) We assume that the second moment of the descent direction in Algorithm 1 is\nupper bounded. There exists a constant M \u2265 0 such that E\n\n\u2264 M.\n\ngtG(k)\n\n(cid:13)(cid:13)(cid:13)(cid:13) K(cid:80)\n\nk=1\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\n(cid:20) K(cid:80)\n\n(cid:21)(cid:13)(cid:13)(cid:13)(cid:13)2\n\nk=1\n\ngtG(k)\n\ngtG(k) \u2212 E\n\n2 = E(cid:107)\u03be(cid:107)2\n\n2\u2212(cid:107)E [\u03be](cid:107)2\n\n2 , the variance of the\nis guaranteed to be less than M. According to the\n\nAccording to the equation regarding variance E(cid:107)\u03be \u2212 E [\u03be](cid:107)2\ndescent direction E\nabove assumptions, we prove the convergence rate for the proposed method under two circumstances\nof \u03b3t. Firstly, we analyze the convergence for Algorithm 1 when \u03b3t is \ufb01xed and prove that the\nlearned model will converge sub-linearly to the neighborhood of the critical points for the non-convex\nproblem.\nTheorem 1 Assume that Assumptions 1 and 2 hold, and the \ufb01xed stepsize sequence {\u03b3t} satis\ufb01es\n\u03b3t = \u03b3 for all t \u2208 {0, 1, ..., T \u2212 1}. In addition, we assume w\u2217 to be the optimal solution to f (w).\nThen, the output of Algorithm 1 satis\ufb01es that:\n\nk=1\n\n2\n\n(cid:13)(cid:13)(cid:13)(cid:13) K(cid:80)\n\nT\u22121(cid:88)\n\nt=0\n\n1\nT\n\nE(cid:13)(cid:13)\u2207f (wt)(cid:13)(cid:13)2\n\n2\n\n\u2264 f (w0) \u2212 f (w\u2217)\n\n\u03c3\u03b3T\n\n+\n\n\u03b3LM\n\n2\u03c3\n\n.\n\n(9)\n\n5\n\n\fFigure 3: Suf\ufb01cient direction constant \u03c3 for ResNet164 and ResNet101 on CIFAR-10.\n\nTherefore, the best solution we can obtain is controlled by \u03b3LM\n2\u03c3 . We also prove that Algorithm 1 can\nguarantee the convergence to critical points for the non-convex problem, as long as the diminishing\nstepsizes satisfy the requirements in [29] such that:\n\nlim\nT\u2192\u221e\n\n\u03b3t = \u221e and\n\nlim\nT\u2192\u221e\n\nt < \u221e.\n\u03b32\n\n(10)\n\nTheorem 2 Assume that Assumptions 1 and 2 hold and the diminishing stepsize sequence {\u03b3t}\nsatis\ufb01es (10). In addition, we assume w\u2217 to be the optimal solution to f (w). Setting \u0393T =\n\u03b3t,\nthen the output of Algorithm 1 satis\ufb01es that:\n\nt=0\n\nT\u22121(cid:80)\n\nT\u22121(cid:88)\n\nt=0\n\nT\u22121(cid:88)\n\nt=0\n\nT\u22121(cid:88)\n\nt=0\n\n1\n\u0393T\n\n\u03b3tE(cid:13)(cid:13)\u2207f (wt)(cid:13)(cid:13)2\n\n2 \u2264 f (w0) \u2212 f (w\u2217)\n\n\u03c3\u0393T\n\n+\n\nLM\n2\u03c3\n\n.\n\n(11)\n\nT\u22121(cid:80)\n\nt=0\n\u0393T\n\n\u03b32\nt\n\nRemark 1 Suppose ws is chosen randomly from {wt}T\u22121\n{\u03b3t}T\u22121\ncritical points for the non-convex problem:\n\nt=0 with probabilities proportional to\nt=0 . According to Theorem 2, we can prove that Algorithm 1 guarantees convergence to\n\nE(cid:107)\u2207f (ws)(cid:107)2\n\n2 = 0 .\n\n(12)\n\nlim\ns\u2192\u221e\n\n5 Experiments\n\nIn this section, we validate our method with experiments training deep convolutional neural networks.\nExperimental results show that the proposed method achieves faster convergence, lower memory\nconsumption and better generalization error than compared methods.\n\n5.1 Experimental Setting\n\nImplementations: We implement our method in PyTorch [28], and evaluate it with ResNet models\n[8] on two image classi\ufb01cation benchmark datasets: CIFAR-10 and CIFAR-100 [18]. We adopt\nthe standard data augmentation techniques in [8, 10, 22] for training these two datasets: random\ncropping, random horizontal \ufb02ipping and normalizing. We use SGD with the momentum of 0.9, and\nthe stepsize is initialized to 0.01. Each model is trained using batch size 128 for 300 epochs and\nthe stepsize is divided by a factor of 10 at 150 and 225 epochs. The weight decay constant is set to\n5 \u00d7 10\u22124. In the experiment, a neural network with K modules is sequentially distributed across K\nGPUs. All experiments are performed on a server with four Titan X GPUs.\nCompared Methods: We compare the performance of four methods in the experiments, including:\n\u2022 BP: we use the backpropagation algorithm [30] in PyTorch Library.\n\u2022 DNI: we implement the decoupled neural interface in [13]. Following [13], the synthetic network\n\n6\n\nSufficient Direction Constant Module 1Module 2Module 3Module 4ResNet1640.85 0.90.95 1102030405060708090100110120130140150160170180190200210220230240250260270280290EpochModule 1Module 2Module 3Module 4ResNet1010.85 0.90.95 1\fFigure 4: Training and testing curves for ResNet-164, ResNet101 and ResNet152 on CIFAR-10. Row\n1 and row 2 present the convergence of the loss function regrading epochs and computational time\nrespectively. Because DNI diverges for all models, we only plot the result of DNI for ResNet164.\n\nhas two hidden convolutional layers with 5 \u00d7 5 \ufb01lters, padding of size 2, batch-normalization [12]\nand ReLU [26]. The output layer is a convolutional layer with 5 \u00d7 5 \ufb01lters and padding size of 2.\n\u2022 DDG: we implement the decoupled parallel backpropagation in [11].\n\u2022 FR: features replay algorithm in Algorithm 1.\n\n5.2 Suf\ufb01cient Direction\n\nWe demonstrate that the proposed method satis\ufb01es Assumption 1 empirically. In the experiment,\nwe divide ResNet164 and ResNet 101 into 4 modules and visualize the variations of the suf\ufb01cient\ndirection constant \u03c3 during the training period in Figure 3. Firstly, it is obvious that the values of\n\u03c3 of these modules are larger than 0 all the time. Therefore, Assumption 1 is satis\ufb01ed such that\nAlgorithm 1 is guaranteed to converge to the critical points for the non-convex problem. Secondly,\nwe can observe that the values of \u03c3 of the lower modules are relatively small at the \ufb01rst half epochs,\nand become close to 1 afterwards. The variation of \u03c3 indicates the difference between the descent\ndirection of FR and the steepest descent direction. Small \u03c3 at early epochs can help the method\nescape from saddle points and \ufb01nd better local minimum; large \u03c3 at the \ufb01nal epochs can prevent the\nmethod from diverging. In the following context, we will show that our method has better generation\nerror than compared methods.\n\n5.3\n\nPerformance Comparisons\n\nTo evaluate the performance of the compared methods, we utilize three criterion in the experiment\nincluding convergence speed, memory consumption, and generalization error.\nFaster Convergence: In the experiments, we evaluate the compared methods with three ResNet\nmodels: ResNet164 with the basic building block, ResNet101 and ResNet152 with the bottleneck\nbuilding block [8]. The performances of the compared methods on CIFAR-10 are shown in Figure\n4. There are several nontrivial observations as follows: Firstly, DNI cannot converge for all models.\nThe synthesizer network in [13] is so small that it cannot learn an accurate approximation of the\nerror gradient when the network is deep. Secondly, DDG cannot converge for the model ResNet152\nwhen we set K = 4. The stale gradients can impose noise in the optimization and lead to divergence.\nThirdly, our method converges much faster than BP when we increase the number of modules. In the\nexperiment, the proposed method FR can achieve a speedup of up to 2 times compared to BP. We do\nnot consider data parallelism for BP in this section. In the supplementary material, we show that our\nmethod also converges faster than BP with data parallelism.\n\n7\n\n050100150200250300Epoch 10-310-210-1100101LossCIFAR-10 (ResNet164)BP TrainDDG Train K=4FR Train K=2FR Train K=3FR Train K=4DNI Train K=4BP TestDDG Test K=4FR Test K=2FR Test K=3FR Test K=4DNI Test K=4050100150200250300Epoch 10-310-210-1100101LossCIFAR-10 (ResNet101)BP TrainDDG Train K=4FR Train K=2FR Train K=3FR Train K=4BP TestDDG Test K=4FR Test K=2FR Test K=3FR Test K=4050100150200250300Epoch 10-310-210-1100101LossCIFAR-10 (ResNet152)BP TrainDDG Train K=4FR Train K=2FR Train K=3FR Train K=4BP TestDDG Test K=4FR Test K=2FR Test K=3FR Test K=4\fAlgorithm Backward\nLocking\nBP [30]\nDNI [13]\nDDG [11]\n\nyes\nno\nno\nno\n\nFR\n\nMemory\nO(L)\n\n(Activations)\nO(L + KLs)\nO(LK + K 2)\nO(L + K 2)\n\nFigure 5:\nMemory consumption for\nResNet164, ResNet101 and ResNet152.\nWe do not report the memory consumption of\nDNI because it does not converge. DDG also\ndiverges when K = 3, 4 for ResNet152.\n\nTable 1: Comparisons of memory consumption\nof the neural network with L layers, which is\ndivided into K modules and L (cid:29) K. We use\nO(L) to represent the memory consumption\nof the activations. For DNI, each gradient syn-\nthesizer has Ls layers. From the experiments,\nit is reasonable to assume that Ls (cid:29) K to\nmake the algorithm converge. The memory\nconsumed by the weights is negligible com-\npared to the activations.\n\nLower Memory Consumption: In Figure 5, we present the memory consumption of the compared\nmethods for three models when we vary the number of modules K. We do not consider DNI because\nit does not converge for all models. It is evident that the memory consumptions of FR and BP are\nvery close. On the contrary, when K = 4, the memory consumption of DDG is more than two times\nof the memory consumption of BP. The observations in the experiment are also consistent with the\nanalysis in Table 1. For DNI, since a three-layer synthesizer network cannot converge, it is reasonable\nto assume that Ls should be large if the network is very deep. We do not explore it because it is out\nof the scope of this paper. We always set K very small such that K (cid:28) L and K (cid:28) Ls. FR can still\nobtain a good speedup when K is very small according to the second row in Figure 4.\nBetter Generalization Error:\nTable 2 shows the best testing er-\nror rates for the compared meth-\nods. We do not report the result\nof DNI because it does not con-\nverge. We can observe that FR\nalways obtains better testing error\nrates than other two methods BP\nand DDG by a large margin. We\nthink it is related to the variation\nof the suf\ufb01cient descent constant\n\u03c3. Small \u03c3 at the early epochs\nhelp FR escape saddle points and\n\ufb01nd better local minimum, large\n\u03c3 at the \ufb01nal epochs prevent FR from diverging. DDG usually performs worse than BP because\nthe stale gradients impose noise in the optimization, which is commonly observed in asynchronous\nalgorithms with stale gradients [5].\n\nTable 2: Best testing error rates (%) of the compared methods on\nCIFAR-10 and CIFAR-100 datasets. For DDG and FR, we set\nK = 2 in the experiment.\n\nFR\n6.03\n27.34\n4.97\n23.10\n4.91\n23.61\n\nCIFAR [18] BP [30]\n6.40\n28.53\n5.25\n23.48\n5.26\n25.20\n\nC-10\nC-100\nC-10\nC-100\nC-10\nC-100\n\n6.45\n28.51\n5.35\n24.25\n5.72\n26.39\n\nResNet101\n\nResNet152\n\nModel\n\nResNet164\n\nDDG [11]\n\n6 Conclusion\n\nIn this paper, we proposed a novel parallel-objective formulation for the objective function of the\nneural network and broke the backward locking using a new features replay algorithm. Besides\nthe new algorithms, our theoretical contributions include analyzing the convergence property of\nthe proposed method and proving that our new algorithm is guaranteed to converge to critical\npoints for the non-convex problem under certain conditions. We conducted experiments with deep\nconvolutional neural networks on two image classi\ufb01cation datasets, and all experimental results verify\nthat the proposed method can achieve faster convergence, lower memory consumption, and better\ngeneralization error than compared methods.\n\n8\n\nK=2K=3K=4K=2K=3K=4K=2K=3K=400.511.52Memory (MiB)104Memory ConsumptionBPDDGFRResNet164ResNet101ResNet152\fReferences\n[1] David Balduzzi, Hastagiri Vanchinathan, and Joachim M Buhmann. Kickback cuts backprop\u2019s\nred-tape: Biologically plausible credit assignment in neural networks. In AAAI, pages 485\u2013491,\n2015.\n\n[2] Yoshua Bengio et al. Learning deep architectures for ai. Foundations and trends R(cid:13) in Machine\n\nLearning, 2(1):1\u2013127, 2009.\n\n[3] L\u00e9on Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine\n\nlearning. arXiv preprint arXiv:1606.04838, 2016.\n\n[4] Miguel Carreira-Perpinan and Weiran Wang. Distributed optimization of deeply nested systems.\n\nIn Arti\ufb01cial Intelligence and Statistics, pages 10\u201319, 2014.\n\n[5] Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. Revisiting\n\ndistributed synchronous sgd. arXiv preprint arXiv:1604.00981, 2016.\n\n[6] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning\nand stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\n[7] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In\n\nConference on Learning Theory, pages 907\u2013940, 2016.\n\n[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[9] Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Lecture 6a overview of mini\u2013\nbatch gradient descent. Coursera Lecture slides https://class. coursera. org/neuralnets-2012-\n001/lecture,[Online, 2012.\n\n[10] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected\n\nconvolutional networks. arXiv preprint arXiv:1608.06993, 2016.\n\n[11] Zhouyuan Huo, Bin Gu, Qian Yang, and Heng Huang. Decoupled parallel backpropagation\n\nwith convergence guarantee. arXiv preprint arXiv:1804.10574, 2018.\n\n[12] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\nby reducing internal covariate shift. In International Conference on Machine Learning, pages\n448\u2013456, 2015.\n\n[13] Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, and\nKoray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients. arXiv preprint\narXiv:1608.05343, 2016.\n\n[14] Justin Johnson.\n\nBenchmarks for popular cnn models.\n\njcjohnson/cnn-benchmarks, 2017.\n\nhttps://github.com/\n\n[15] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network for\n\nmodelling sentences. arXiv preprint arXiv:1404.2188, 2014.\n\n[16] Yoon Kim. Convolutional neural networks for sentence classi\ufb01cation.\n\narXiv:1408.5882, 2014.\n\narXiv preprint\n\n[17] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[18] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\n2009.\n\n[19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n9\n\n\f[20] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436\u2013444,\n\n2015.\n\n[21] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,\nDavid Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv\npreprint arXiv:1509.02971, 2015.\n\n[22] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400,\n\n2013.\n\n[23] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap,\nTim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep rein-\nforcement learning. In International Conference on Machine Learning, pages 1928\u20131937,\n2016.\n\n[24] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan\nWierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint\narXiv:1312.5602, 2013.\n\n[25] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.\n\n[26] Vinod Nair and Geoffrey E Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines.\nIn Proceedings of the 27th international conference on machine learning (ICML-10), pages\n807\u2013814, 2010.\n\n[27] Arild N\u00f8kland. Direct feedback alignment provides learning in deep neural networks. In\n\nAdvances in Neural Information Processing Systems, pages 1037\u20131045, 2016.\n\n[28] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. In NIPS-W, 2017.\n\n[29] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of\n\nmathematical statistics, pages 400\u2013407, 1951.\n\n[30] David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. Learning representations by\n\nback-propagating errors. Cognitive modeling, 5(3):1, 1988.\n\n[31] Cicero D Santos and Bianca Zadrozny. Learning character-level representations for part-of-\nspeech tagging. In Proceedings of the 31st International Conference on Machine Learning\n(ICML-14), pages 1818\u20131826, 2014.\n\n[32] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[33] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,\nDumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions.\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1\u20139,\n2015.\n\n[34] Gavin Taylor, Ryan Burmeister, Zheng Xu, Bharat Singh, Ankit Patel, and Tom Goldstein.\nTraining neural networks without gradients: A scalable admm approach. In International\nConference on Machine Learning, pages 2722\u20132731, 2016.\n\n[35] Matus Telgarsky. Bene\ufb01ts of depth in neural networks. arXiv preprint arXiv:1602.04485, 2016.\n\n[36] Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text\n\nclassi\ufb01cation. In Advances in neural information processing systems, pages 649\u2013657, 2015.\n\n10\n\n\f", "award": [], "sourceid": 3356, "authors": [{"given_name": "Zhouyuan", "family_name": "Huo", "institution": "University of Pittsburgh"}, {"given_name": "Bin", "family_name": "Gu", "institution": "Pittsburgh University"}, {"given_name": "Heng", "family_name": "Huang", "institution": "University of Pittsburgh"}]}