{"title": "The Reversible Residual Network: Backpropagation Without Storing Activations", "book": "Advances in Neural Information Processing Systems", "page_first": 2214, "page_last": 2224, "abstract": "Residual Networks (ResNets) have demonstrated significant improvement over traditional Convolutional Neural Networks (CNNs) on image classification, increasing in performance as networks grow both deeper and wider. However, memory consumption becomes a bottleneck as one needs to store all the intermediate activations for calculating gradients using backpropagation. In this work, we present the Reversible Residual Network (RevNet), a variant of ResNets where each layer's activations can be reconstructed exactly from the next layer's. Therefore, the activations for most layers need not be stored in memory during backprop. We demonstrate the effectiveness of RevNets on CIFAR and ImageNet, establishing nearly identical performance to equally-sized ResNets, with activation storage requirements independent of depth.", "full_text": "The Reversible Residual Network:\n\nBackpropagation Without Storing Activations\n\nAidan N. Gomez\u2217 1, Mengye Ren\u2217 1,2,3, Raquel Urtasun1,2,3, Roger B. Grosse1,2\n\nUniversity of Toronto1\n\nVector Institute for Arti\ufb01cial Intelligence2\n\nUber Advanced Technologies Group3\n\n{aidan, mren, urtasun, rgrosse}@cs.toronto.edu\n\nAbstract\n\nDeep residual networks (ResNets) have signi\ufb01cantly pushed forward the state-of-\nthe-art on image classi\ufb01cation, increasing in performance as networks grow both\ndeeper and wider. However, memory consumption becomes a bottleneck, as one\nneeds to store the activations in order to calculate gradients using backpropaga-\ntion. We present the Reversible Residual Network (RevNet), a variant of ResNets\nwhere each layer\u2019s activations can be reconstructed exactly from the next layer\u2019s.\nTherefore, the activations for most layers need not be stored in memory during\nbackpropagation. We demonstrate the effectiveness of RevNets on CIFAR-10,\nCIFAR-100, and ImageNet, establishing nearly identical classi\ufb01cation accuracy\nto equally-sized ResNets, even though the activation storage requirements are\nindependent of depth.\n\n1\n\nIntroduction\n\nOver the last \ufb01ve years, deep convolutional neural networks have enabled rapid performance im-\nprovements across a wide range of visual processing tasks [19, 26, 20]. For the most part, the\nstate-of-the-art networks have been growing deeper. For instance, deep residual networks (ResNets)\n[13] are the state-of-the-art architecture across multiple computer vision tasks [19, 26, 20]. The\nkey architectural innovation behind ResNets was the residual block, which allows information to be\npassed directly through, making the backpropagated error signals less prone to exploding or vanishing.\nThis made it possible to train networks with hundreds of layers, and this vastly increased depth led to\nsigni\ufb01cant performance gains.\nNearly all modern neural networks are trained using backpropagation. Since backpropagation\nrequires storing the network\u2019s activations in memory, the memory cost is proportional to the number\nof units in the network. Unfortunately, this means that as networks grow wider and deeper, storing\nthe activations imposes an increasing memory burden, which has become a bottleneck for many\napplications [34, 37]. Graphics processing units (GPUs) have limited memory capacity, leading to\nconstraints often exceeded by state-of-the-art architectures, some of which reach over one thousand\nlayers [13]. Training large networks may require parallelization across multiple GPUs [7, 28], which\nis both expensive and complicated to implement. Due to memory constraints, modern architectures\nare often trained with a mini-batch size of 1 (e.g. [34, 37]), which is inef\ufb01cient for stochastic gradient\nmethods [11]. Reducing the memory cost of storing activations would signi\ufb01cantly improve our\nability to ef\ufb01ciently train wider and deeper networks.\n\n\u2217These authors contributed equally.\nCode available at https://github.com/renmengye/revnet-public\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: (left) A traditional residual block as in Equation 2. (right-top) A basic residual function.\n(right-bottom) A bottleneck residual function.\n\nWe present Reversible Residual Networks (RevNets), a variant of ResNets which is reversible in the\nsense that each layer\u2019s activations can be computed from the subsequent reversible layer\u2019s activations.\nThis enables us to perform backpropagation without storing the activations in memory, with the\nexception of a handful of non-reversible layers. The result is a network architecture whose activation\nstorage requirements are independent of depth, and typically at least an order of magnitude smaller\ncompared with equally sized ResNets. Surprisingly, constraining the architecture to be reversible\nincurs no noticeable loss in performance: in our experiments, RevNets achieved nearly identical\nclassi\ufb01cation accuracy to standard ResNets on CIFAR-10, CIFAR-100, and ImageNet, with only a\nmodest increase in the training time.\n\n2 Background\n\n2.1 Backpropagation\n\nBackpropagation [25] is a classic algorithm for computing the gradient of a cost function with respect\nto the parameters of a neural network. It is used in nearly all neural network algorithms, and is now\ntaken for granted in light of neural network frameworks which implement automatic differentiation\n[1, 2]. Because achieving the memory savings of our method requires manual implementation of part\nof the backprop computations, we brie\ufb02y review the algorithm.\nWe treat backprop as an instance of reverse mode automatic differentiation [24]. Let v1, . . . , vK\ndenote a topological ordering of the nodes in the network\u2019s computation graph G, where vK denotes\nthe cost function C. Each node is de\ufb01ned as a function fi of its parents in G. Backprop computes\nthe total derivative dC/dvi for each node in the computation graph. This total derivative de\ufb01nes the\nthe effect on C of an in\ufb01nitesimal change to vi, taking into account the indirect effects through the\ndescendants of vk in the computation graph. Note that the total derivative is distinct from the partial\nderivative \u2202f /\u2202xi of a function f with respect to one of its arguments xi, which does not take into\naccount the effect of changes to xi on the other arguments. To avoid using a small typographical\ndifference to represent a signi\ufb01cant conceptual difference, we will denote total derivatives using\nvi = dC/dvi.\nBackprop iterates over the nodes in the computation graph in reverse topological order. For each\nnode vi, it computes the total derivative vi using the following rule:\n\n(cid:88)\n\n(cid:18) \u2202fj\n\n(cid:19)(cid:62)\n\nvi =\n\nj\u2208Child(i)\n\n\u2202vi\n\nvj,\n\n(1)\n\nwhere Child(i) denotes the children of node vi in G and \u2202fj/\u2202vi denotes the Jacobian matrix.\n\n2.2 Deep Residual Networks\n\nOne of the main dif\ufb01culties in training very deep networks is the problem of exploding and vanishing\ngradients, \ufb01rst observed in the context of recurrent neural networks [3]. In particular, because a deep\nnetwork is a composition of many nonlinear functions, the dependencies across distant layers can be\nhighly complex, making the gradient computations unstable. Highway networks [29] circumvented\nthis problem by introducing skip connections. Similarly, deep residual networks (ResNets) [13] use\n\n2\n\n\fa functional form which allows information to pass directly through the network, thereby keeping\nthe computations stable. ResNets currently represent the state-of-the-art in object recognition [13],\nsemantic segmentation [35] and image generation [32]. Outside of vision, residuals have displayed\nimpressive performance in audio generation [31] and neural machine translation [16],\nResNets are built out of modules called residual blocks, which have the following form:\n\ny = x + F(x),\n\n(2)\nwhere F, a function called the residual function, is typically a shallow neural net. ResNets are robust\nto exploding and vanishing gradients because each residual block is able to pass signals directly\nthrough, allowing the signals to be propagated faithfully across many layers. As displayed in Figure\n1, residual functions for image recognition generally consist of stacked batch normalization (\"BN\")\n[14], recti\ufb01ed linear activation (\"ReLU\") [23] and convolution layers (with \ufb01lters of shape three \"C3\"\nand one \"C1\").\nAs in He et al. [13], we use two residual block architectures: the basic residual function (Figure\n1 right-top) and the bottleneck residual function (Figure 1 right-bottom). The bottleneck residual\nconsists of three convolutions, the \ufb01rst is a point-wise convolution which reduces the dimensionality of\nthe feature dimension, the second is a standard convolution with \ufb01lter size 3, and the \ufb01nal point-wise\nconvolution projects into the desired output feature depth.\n\na(x) = ReLU(BN(x)))\nck(x) = Convk\u00d7k(a(x))\n\nBasic(x) = c3(c3(x))\n\nBottleneck(x) = c1(c3(c1(x)))\n\n(3)\n\n2.3 Reversible Architectures\n\nVarious reversible neural net architectures have been proposed, though for motivations distinct from\nour own. Deco and Brauer [8] develop a similar reversible architecture to ensure the preservation\nof information in unsupervised learning contexts. The proposed architecture is indeed residual and\nconstructed to produce a lower triangular Jacobian matrix with ones along the diagonal. In Deco and\nBrauer [8], the residual connections are composed of all \u2018prior\u2019 neurons in the layer, while NICE\nand our own architecture segments a layer into pairs of neurons and additively connect one with a\nresidual function of the other. Maclaurin et al. [21] made use of the reversible nature of stochastic\ngradient descent to tune hyperparameters via gradient descent. Our proposed method is inspired by\nnonlinear independent components estimation (NICE) [9, 10], an approach to unsupervised generative\nmodeling. NICE is based on learning a non-linear bijective transformation between the data space\nand a latent space. The architecture is composed of a series of blocks de\ufb01ned as follows, where x1\nand x2 are a partition of the units in each layer:\n\ny1 = x1\ny2 = x2 + F(x1)\n\n(4)\n\nBecause the model is invertible and its Jacobian has unit determinant, the log-likelihood and its\ngradients can be tractably computed. This architecture imposes some constraints on the functions the\nnetwork can represent; for instance, it can only represent volume-preserving mappings. Follow-up\nwork by Dinh et al. [10] addressed this limitation by introducing a new reversible transformation:\n\ny1 = x1\ny2 = x2 (cid:12) exp(F(x1)) + G(x1).\n\n(5)\n\nHere, (cid:12) represents the Hadamard or element-wise product. This transformation has a non-unit\nJacobian determinant due to multiplication by exp (F(x1)).\n\n3\n\n\f(a)\n\n(b)\n\nFigure 2: (a) the forward, and (b) the reverse computations of a residual block, as in Equation 8.\n\n3 Methods\n\nWe now introduce Reversible Residual Networks (RevNets), a variant of Residual Networks which is\nreversible in the sense that each layer\u2019s activations can be computed from the next layer\u2019s activations.\nWe discuss how to reconstruct the activations online during backprop, eliminating the need to store\nthe activations in memory.\n\n3.1 Reversible Residual Networks\n\nRevNets are composed of a series of reversible blocks, which we now de\ufb01ne. We must partition the\nunits in each layer into two groups, denoted x1 and x2; for the remainder of the paper, we assume\nthis is done by partitioning the channels, since we found this to work the best in our experiments.2\nEach reversible block takes inputs (x1, x2) and produces outputs (y1, y2) according to the following\nadditive coupling rules \u2013 inspired by NICE\u2019s [9] transformation in Equation 4 \u2013 and residual functions\nF and G analogous to those in standard ResNets:\n\ny1 = x1 + F(x2)\ny2 = x2 + G(y1)\n\nEach layer\u2019s activations can be reconstructed from the next layer\u2019s activations as follows:\n\nx2 = y2 \u2212 G(y1)\nx1 = y1 \u2212 F(x2)\n\n(6)\n\n(7)\n\nNote that unlike residual blocks, reversible blocks must have a stride of 1 because otherwise the layer\ndiscards information, and therefore cannot be reversible. Standard ResNet architectures typically\nhave a handful of layers with a larger stride. If we de\ufb01ne a RevNet architecture analogously, the\nactivations must be stored explicitly for all non-reversible layers.\n\n3.2 Backpropagation Without Storing Activations\n\nTo derive the backprop procedure, it is helpful to rewrite the forward (left) and reverse (right)\ncomputations in the following way:\n\nz1 = x1 + F(x2)\ny2 = x2 + G(z1)\ny1 = z1\n\nz1 = y1\nx2 = y2 \u2212 G(z1)\nx1 = z1 \u2212 F(x2)\n\n(8)\n\nEven though z1 = y1, the two variables represent distinct nodes of the computation graph, so the\ntotal derivatives z1 and y1 are different. In particular, z1 includes the indirect effect through y2, while\ny1 does not. This splitting lets us implement the forward and backward passes for reversible blocks\nin a modular fashion. In the backwards pass, we are given the activations (y1, y2) and their total\nderivatives (y1, y2) and wish to compute the inputs (x1, x2), their total derivatives (x1, x2), and the\ntotal derivatives for any parameters associated with F and G. (See Section 2.1 for our backprop\n\n2The possibilities we explored included columns, checkerboard, rows and channels, as done by [10]. We\nfound that performance was consistently superior using the channel-wise partitioning scheme and comparable\nacross the remaining options. We note that channel-wise partitioning has also been explored in the context of\nmulti-GPU training via \u2019grouped\u2019 convolutions [18], and more recently, convolutional neural networks have\nseen signi\ufb01cant success by way of \u2019separable\u2019 convolutions [27, 6].\n\n4\n\n\fnotation.) We do this by combining the reconstruction formulas (Eqn. 8) with the backprop rule\n(Eqn. 1). The resulting algorithm is given as Algorithm 1.3\nBy applying Algorithm 1 repeatedly, one can perform backprop on a sequence of reversible blocks if\none is given simply the activations and their derivatives for the top layer in the sequence. In general,\na practical architecture would likely also include non-reversible layers, such as subsampling layers;\nthe inputs to these layers would need to be stored explicitly during backprop. However, a typical\nResNet architecture involves long sequences of residual blocks and only a handful of subsampling\nlayers; if we mirror the architecture of a ResNet, there would be only a handful of non-reversible\nlayers, and the number would not grow with the depth of the network. In this case, the storage cost of\nthe activations would be small, and independent of the depth of the network.\nComputational overhead. In general, for a network with N connections, the forward and backward\npasses of backprop require approximately N and 2N add-multiply operations, respectively. For a\nRevNet, the residual functions each must be recomputed during the backward pass. Therefore, the\nnumber of operations required for reversible backprop is approximately 4N, or roughly 33% more\nthan ordinary backprop. (This is the same as the overhead introduced by checkpointing [22].) In\npractice, we have found the forward and backward passes to be about equally expensive on GPU\narchitectures; if this is the case, then the computational overhead of RevNets is closer to 50%.\n\nAlgorithm 1 Reversible Residual Block Backprop\n1: function BLOCKREVERSE((y1, y2), (y1, y2))\n2:\n3:\n4:\n\n(cid:17)(cid:62)\n(cid:17)(cid:62)\n\n\u2202z1\n\nz1 \u2190 y1\nx2 \u2190 y2 \u2212 G(z1)\nx1 \u2190 z1 \u2212 F(x2)\nz1 \u2190 y1 +\nx2 \u2190 y2 +\nx1 \u2190 z1\n\n(cid:16) \u2202G\n(cid:16) \u2202F\n(cid:17)(cid:62)\nwF \u2190(cid:16) \u2202F\nwG \u2190(cid:16) \u2202G\n(cid:17)(cid:62)\n\n\u2202wF\n\n\u2202x2\n\n5:\n\n6:\n7:\n\n8:\n\ny2\n\nz1\n\n(cid:46) ordinary backprop\n\n(cid:46) ordinary backprop\n\n(cid:46) ordinary backprop\n\nz1\n\ny2\n\n\u2202wG\n\n(cid:46) ordinary backprop\n\nreturn (x1, x2) and (x1, x2) and (wF , wG)\n\n9:\n10:\n11: end function\nModularity. Note that Algorithm 1 is agnostic to the form of the residual functions F and G. The\nsteps which use the Jacobians of these functions are implemented in terms of ordinary backprop, which\ncan be achieved by calling automatic differentiation routines (e.g. tf.gradients or Theano.grad).\nTherefore, even though implementing our algorithm requires some amount of manual implementation\nof backprop, one does not need to modify the implementation in order to change the residual functions.\nNumerical error. While Eqn. 8 reconstructs the activations exactly when done in exact arithmetic,\npractical float32 implementations may accumulate numerical error during backprop. We study the\neffect of numerical error in Section 5.2; while the error is noticeable in our experiments, it does not\nsigni\ufb01cantly affect \ufb01nal performance. We note that if numerical error becomes a signi\ufb01cant issue,\none could use \ufb01xed-point arithmetic on the x\u2019s and y\u2019s (but ordinary \ufb02oating point to compute F and\nG), analogously to [21]. In principle, this would enable exact reconstruction while introducing little\noverhead, since the computation of the residual functions and their derivatives (which dominate the\ncomputational cost) would be unchanged.\n\n4 Related Work\n\nA number of steps have been taken towards reducing the storage requirements of extremely deep\nneural networks. Much of this work has focused on the modi\ufb01cation of memory allocation within the\ntraining algorithms themselves [1, 2]. Checkpointing [22, 5, 12] is one well-known technique which\n\n3We assume for notational clarity that the residual functions do not share parameters, but Algorithm 1 can be\n\ntrivially extended to a network with weight sharing, such as a recurrent neural net.\n\n5\n\n\fTable 1: Computational and spatial complexity comparisons. L denotes the number of layers.\n\nTechnique\n\nSpatial Complexity Computational\n(Activations)\nO(L)\n\u221a\nNaive\nO(\nCheckpointing [22]\nL)\nRecursive Checkpointing [5] O(log L)\nReversible Networks (Ours) O(1)\n\nComplexity\nO(L)\nO(L)\nO(L log L)\nO(L)\n\nlength T using backpropagation through time [33], storing every (cid:100)\u221a\n\ntrades off spatial and temporal complexity; during backprop, one stores a subset of the activations\n(called checkpoints) and recomputes the remaining activations as required. Martens and Sutskever\n[22] adopted this technique in the context of training recurrent neural networks on a sequence of\nT(cid:101) layers and recomputing the\nintermediate activations between each during the backward pass. Chen et al. [5] later proposed to\nrecursively apply this strategy on the sub-graph between checkpoints. Gruslys et al. [12] extended\nthis approach by applying dynamic programming to determine a storage strategy which minimizes\nthe computational cost for a given memory budget.\nTo analyze the computational and memory complexity of these alternatives, assume for simplicity a\nfeed-forward network consisting of L identical layers. Again, for simplicity, assume the units are\nchosen such that the cost of forward propagation or backpropagation through a single layer is 1, and\nthe memory cost of storing a single layer\u2019s activations is 1. In this case, ordinary backpropagation has\ncomputational cost 2L and storage cost L for the activations. The method of Martens and Sutskever\n[22] requres 2\nL storage, and it demands an additional forward computation for each layer, leading\nto a total computational cost of 3L. The recursive algorithm of Chen et al. [5] reduces the required\nmemory to O(log L), while increasing the computational cost to O(L log L). In comparison to these,\nour method incurs O(1) storage cost \u2014 as only a single block must be stored \u2014 and computational\ncost of 3L. The time and space complexities of these methods are summarized in Table 1.\nAnother approach to saving memory is to replace backprop itself. The decoupled neural interface [15]\nupdates each weight matrix using a gradient approximation, termed the synthetic gradient, computed\nbased on only the node\u2019s activations instead of the global network error. This removes any long-range\ngradient computation dependencies in the computation graph, leading to O(1) activation storage\nrequirements. However, these savings are achieved only after the synthetic gradient estimators have\nbeen trained; that training requires all the activations to be stored.\n\n\u221a\n\n5 Experiments\n\nWe experimented with RevNets on three standard image classi\ufb01cation benchmarks: CIFAR-10,\nCIFAR-100, [17] and ImageNet [26]. In order to make our results directly comparable with standard\nResNets, we tried to match both the computational depth and the number of parameters as closely as\npossible. We observed that each reversible block has a computation depth of two original residual\nblocks. Therefore, we reduced the total number of residual blocks by approximately half, while\napproximately doubling the number of channels per block, since they are partitioned into two. Table 2\nshows the details of the RevNets and their corresponding traditional ResNet. In all of our experiments,\nwe were interested in whether our RevNet architectures (which are far more memory ef\ufb01cient) were\nable to match the classi\ufb01cation accuracy of ResNets of the same size.\n\n5.1\n\nImplementation\n\nWe implemented the RevNets using the TensorFlow library [1]. We manually make calls to Ten-\nsorFlow\u2019s automatic differentiation method (i.e. tf.gradients) to construct the backward-pass\ncomputation graph without referencing activations computed in the forward pass. While building the\nbackward graph, we reconstruct the input activations (\u02c6x1, \u02c6x2) for each block (Equation 8); Second, we\napply tf.stop_gradient on the reconstructed inputs to prevent auto-diff from traversing into the re-\nconstructions\u2019 computation graph, then call the forward functions again to compute (\u02c6y1, \u02c6y2) (Equation\n8). Lastly, we use auto-diff to traverse from (\u02c6y1, \u02c6y2) to (\u02c6x1, \u02c6x2) and the parameters (wF , wG). This\n\n6\n\n\fTable 2: Architectural details. \u2019Bottleneck\u2019 indicates whether the residual unit type used was the\nBottleneck or Basic variant (see Equation 3). \u2019Units\u2019 indicates the number of residual units in each\ngroup. \u2019Channels\u2019 indicates the number of \ufb01lters used in each unit in each group. \u2019Params\u2019 indicates\nthe number of parameters, in millions, each network uses.\n\nDataset\n\nVersion\n\nBottleneck\n\nResNet-32\nCIFAR-10 (100)\nCIFAR-10 (100)\nRevNet-38\nCIFAR-10 (100) ResNet-110\nCIFAR-10 (100) RevNet-110\nCIFAR-10 (100) ResNet-164\nCIFAR-10 (100) RevNet-164\nResNet-101\nRevNet-104\n\nImageNet\nImageNet\n\nNo\nNo\nNo\nNo\nYes\nYes\nYes\nYes\n\nUnits\n5-5-5\n3-3-3\n\n18-18-18\n\n9-9-9\n\n18-18-18\n\n9-9-9\n\n3-4-23-3\n2-2-11-2\n\nChannels\n\n16-16-32-64\n32-32-64-112\n16-16-32-64\n32-32-64-128\n16-16-32-64\n32-32-64-128\n64-128-256-512\n128-256-512-832\n\nParams (M)\n0.46 (0.47)\n0.46 (0.48)\n1.73 (1.73)\n1.73 (1.74)\n1.70 (1.73)\n1.75 (1.79)\n\n44.5\n45.2\n\nTable 3: Classi\ufb01cation error on CIFAR\n\nArchitecture\n\n32 (38)\n\n110\n164\n\nCIFAR-10 [17]\nResNet RevNet\n7.14% 7.24%\n5.74% 5.76%\n5.24% 5.17%\n\nCIFAR-100 [17]\nResNet\nRevNet\n29.95% 28.96%\n26.44% 25.40%\n23.37% 23.69%\n\nimplementation leverages the convenience of the auto-diff functionality to avoid manually deriving\ngradients; however the computational cost becomes 5N, compared with 4N for Algorithm 1, and\n3N for ordinary backpropagation (see Section 3.2). The full theoretical ef\ufb01ciency can be realized by\nreusing the F and G graphs\u2019 activations that were computed in the reconstruction steps (lines 3 and 4\nof Algorithm 1).\n\nTable 4: Top-1 classi\ufb01cation error on ImageNet (single crop)\n\nResNet-101 RevNet-104\n\n23.01%\n\n23.10%\n\n5.2 RevNet performance\n\nOur ResNet implementation roughly matches the previously reported classi\ufb01cation error rates [13].\nAs shown in Table 3, our RevNets roughly matched the error rates of traditional ResNets (of roughly\nequal computational depth and number of parameters) on CIFAR-10 & 100 as well as ImageNet\n(Table 4). In no condition did the RevNet underperform the ResNet by more than 0.5%, and in some\ncases, RevNets achieved slightly better performance. Furthermore, Figure 3 compares ImageNet\ntraining curves of the ResNet and RevNet architectures; reversibility did not lead to any noticeable\nper-iteration slowdown in training. (As discussed above, each RevNet update is about 1.5-2\u00d7 more\nexpensive, depending on the implementation.) We found it surprising that the performance matched\nso closely, because reversibility would appear to be a signi\ufb01cant constraint on the architecture, and\none might expect large memory savings to come at the expense of classi\ufb01cation error.\nImpact of numerical error. As described in Section 3.2, reconstructing the activations over many\nlayers causes numerical errors to accumulate. In order to measure the magnitude of this effect, we\ncomputed the angle between the gradients computed using stored and reconstructed activations over\nthe course of training. Figure 4 shows how this angle evolved over the course of training for a\nCIFAR-10 RevNet; while the angle increased during training, it remained small in magnitude.\n\n7\n\n\fTable 5: Comparison of parameter and activation storage costs for ResNet and RevNet.\n\nTask\n\nResNet-101\nRevNet-104\n\nParameter Cost Activation Cost\n\u223c 5250MB\n\u223c 1440MB\n\n\u223c 178MB\n\u223c 180MB\n\nFigure 3: Training curves for ResNet-101 vs. RevNet-104 on ImageNet, with both networks having\napproximately the same depth and number of free parameters. Left: training cross entropy; Right:\nclassi\ufb01cation error, where dotted lines indicate training, and solid lines validation.\n\nFigure 4: Left: angle (degrees) between the gradient computed using stored and reconstructed\nactivations throughout training. While the angle grows during training, it remains small in magnitude.\nWe measured 4 more epochs after regular training length and did not observe any instability. Middle:\ntraining cross entropy; Right: classi\ufb01cation error, where dotted lines indicate training, and solid\nlines validation; No meaningful difference in training ef\ufb01ciency or \ufb01nal performance was observed\nbetween stored and reconstructed activations.\n\nFigure 4 also shows training curves for CIFAR-10 networks trained using both methods of computing\ngradients. Despite the numerical error from reconstructing activations, both methods performed\nalmost indistinguishably in terms of the training ef\ufb01ciency and the \ufb01nal performance.\n\n6 Conclusion and Future Work\nWe introduced RevNets, a neural network architecture where the activations for most layers need not\nbe stored in memory. We found that RevNets provide considerable reduction in the memory footprint\nat little or no cost to performance. As future work, we are currently working on applying RevNets to\nthe task of semantic segmentation, the performance of which is limited by a critical memory bottleneck\n\u2014 the input image patch needs to be large enough to process high resolution images; meanwhile, the\nbatch size also needs to be large enough to perform effective batch normalization (e.g. [36]). We also\nintend to develop reversible recurrent neural net architectures; this is a particularly interesting use\ncase, because weight sharing implies that most of the memory cost is due to storing the activations\n(rather than parameters). Another interesting direction is predicting the activations of previous layers\u2019\nactivation, similar to synthetic gradients. We envision our reversible block as a module which will\nsoon enable training larger and more powerful networks with limited computational resources.\n\n8\n\n020406080100120No. epochs100Train LossImageNet Train LossOriginal ResNet-101RevNet-104020406080100120No. epochs0.00%10.00%20.00%30.00%40.00%50.00%Classification errorImageNet Top-1 Error (Single Crop)Original ResNet-101RevNet-10405101520No. epochs0246810Angle (degrees)RevNet-164 CIFAR-10 Gradient Error0246810121416No. epochs10-410-310-210-1100101Train LossRevNet-164 CIFAR-10 Train LossStored ActivationsReconstructed Activations0246810121416No. epochs0.00%10.00%20.00%30.00%40.00%50.00%Classification errorRevNet-164 CIFAR-10 Top-1 ErrorStored ActivationsReconstructed Activations\fReferences\n[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis,\nJ. Dean, M. Devin, et al. TensorFlow: Large-scale machine learning on heterogeneous distributed\nsystems. arXiv preprint arXiv:1603.04467, 2016.\n\n[2] R. Al-Rfou, G. Alain, A. Almahairi, C. Angermueller, D. Bahdanau, N. Ballas, F. Bastien,\nJ. Bayer, A. Belikov, A. Belopolsky, et al. Theano: A Python framework for fast computation\nof mathematical expressions. arXiv preprint arXiv:1605.02688, 2016.\n\n[3] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent\n\nis dif\ufb01cult. IEEE transactions on neural networks, 5(2):157\u2013166, 1994.\n\n[4] J. Chen, R. Monga, S. Bengio, and R. Jozefowicz. Revisiting distributed synchronous sgd.\n\narXiv preprint arXiv:1604.00981, 2016.\n\n[5] T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training deep nets with sublinear memory cost.\n\narXiv preprint arXiv:1604.06174, 2016.\n\n[6] F. Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint\n\narXiv:1610.02357, 2016.\n\n[7] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang,\n\nQ. V. Le, et al. Large scale distributed deep networks. In NIPS, 2012.\n\n[8] G. Deco and W. Brauer. Higher order statistical decorrelation without information loss. In\nG. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing\nSystems 7, pages 247\u2013254. MIT Press, 1995. URL http://papers.nips.cc/paper/\n901-higher-order-statistical-decorrelation-without-information-loss.\npdf.\n\n[9] L. Dinh, D. Krueger, and Y. Bengio. NICE: Non-linear independent components estimation.\n\n2015.\n\n[10] L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real NVP. In ICLR, 2017.\n\n[11] P. Goyal, P. Doll\u00e1r, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia,\nand K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint\narXiv:1706.02677, 2017.\n\n[12] A. Gruslys, R. Munos, I. Danihelka, M. Lanctot, and A. Graves. Memory-ef\ufb01cient backpropa-\n\ngation through time. In NIPS, 2016.\n\n[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.\n\nIn\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n770\u2013778, 2016.\n\n[14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. In ICML, 2015.\n\n[15] M. Jaderberg, W. M. Czarnecki, S. Osindero, O. Vinyals, A. Graves, and K. Kavukcuoglu.\nDecoupled neural interfaces using synthetic gradients. arXiv preprint arXiv:1608.05343, 2016.\n\n[16] N. Kalchbrenner, L. Espeholt, K. Simonyan, A. v. d. Oord, A. Graves, and K. Kavukcuoglu.\n\nNeural machine translation in linear time. arXiv preprint arXiv:1610.10099, 2016.\n\n[17] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical\n\nreport, University of Toronto, Department of Computer Science, 2009.\n\n[18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, 2012.\n\n[19] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D.\nJackel. Handwritten digit recognition with a back-propagation network. In Advances in neural\ninformation processing systems, pages 396\u2013404, 1990.\n\n9\n\n\f[20] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C. L. Zitnick.\n\nMicrosoft COCO: Common objects in context. In ECCV, 2014.\n\n[21] D. Maclaurin, D. K. Duvenaud, and R. P. Adams. Gradient-based hyperparameter optimization\n\nthrough reversible learning. In ICML, 2015.\n\n[22] J. Martens and I. Sutskever. Training deep and recurrent networks with Hessian-free optimiza-\n\ntion. In Neural networks: Tricks of the trade, pages 479\u2013535. Springer, 2012.\n\n[23] V. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted Boltzmann machines. In\n\nICML, 2010.\n\n[24] L. B. Rall. Automatic differentiation: Techniques and applications. 1981.\n\n[25] D. Rumelhart, G. Hinton, and R. Williams. Learning representations by back-propagating errors.\n\nLett. Nat., 323:533\u2013536, 1986.\n\n[26] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,\nA. Khosla, M. Bernstein, et al. ImageNet large scale visual recognition challenge. IJCV, 115\n(3):211\u2013252, 2015.\n\n[27] L. Sifre. Rigid-motion scattering for image classi\ufb01cation. PhD thesis, Ph. D. thesis, 2014.\n\n[28] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[29] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks.\n\narXiv:1505.00387, 2015.\n\narXiv preprint\n\n[30] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and\n\nA. Rabinovich. Going deeper with convolutions. In CVPR, 2015.\n\n[31] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbren-\nner, A. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. CoRR\nabs/1609.03499, 2016.\n\n[32] A. van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. Conditional image\n\ngeneration with pixelCNN decoders. In NIPS, 2016.\n\n[33] R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural\n\nnetworks. Neural computation, 1(2):270\u2013280, 1989.\n\n[34] Z. Wu, C. Shen, and A. v. d. Hengel. High-performance semantic segmentation using very deep\n\nfully convolutional networks. arXiv preprint arXiv:1604.04339, 2016.\n\n[35] Z. Wu, C. Shen, and A. v. d. Hengel. Wider or deeper: Revisiting the ResNet model for visual\n\nrecognition. arXiv preprint arXiv:1611.10080, 2016.\n\n[36] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, 2017.\n\n[37] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using\n\ncycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.\n\n10\n\n\f7 Appendix\n\n7.1 Experiment details\n\nFor our CIFAR-10/100 experiments, we \ufb01xed the mini-batch size to be 100. The learning rate was\ninitialized to 0.1 and decayed by a factor of 10 at 40K and 60K training steps, training for a total of\n80K steps. The weight decay constant was set to 2 \u00d7 10\u22124 and the momentum was set to 0.9. We\nsubtracted the mean image, and augmented the dataset with random cropping and random horizontal\n\ufb02ipping.\nFor our ImageNet experiments, we \ufb01xed the mini-batch size to be 256, split across 4 Titan X GPUs\nwith data parallelism [28]. We employed synchronous SGD [4] with momentum of 0.9. The model\nwas trained for 600K steps, with factor-of-10 learning rate decays scheduled at 160K, 320K, and\n480K steps. Weight decay was set to 1 \u00d7 10\u22124. We applied standard input preprocessing and\ndata augmentation used in training Inception networks [30]: pixel intensity rescaled to within [0,\n1], random cropping of size 224 \u00d7 224 around object bounding boxes, random scaling, random\nhorizontal \ufb02ipping, and color distortion, all of which are available in TensorFlow. For the original\nResNet-101, We were unable to \ufb01t a mini-batch size of 256 on 4 GPUs, so we instead averaged the\ngradients from two serial runs with mini-batch size 128 (32 per GPU). For the RevNet, we were able\nto \ufb01t a mini-batch size of 256 on 4 GPUs (i.e. 64 per GPU).\n\n7.2 Memory savings\n\nFully realizing the theoretical gains of RevNets can be a non-trivial task and require precise low-level\nGPU memory management. We experimented with two different implementations within TensorFlow:\nWith the \ufb01rst, we were able to reach reasonable spatial gains using \u201cTensor Handles\u201d provided by\nTensorFlow, which preserve the activations of graph nodes between calls to session.run. Multiple\nsession.run calls ensures that TensorFlow frees up activations that will not be referenced later. We\nsegment our computation graph into separate sections and save the bordering activations and gradients\ninto the persistent Tensor Handles. During the forward pass of the backpropagation algorithm, each\nsection of the graph is executed sequentially with the input tensors being reloaded from the previous\nsection and the output tensors being saved for use in the subsequent section. We empirically veri\ufb01ed\nthe memory gain by \ufb01tting at least twice the number of examples while training ImageNet. Each\nGPU can now \ufb01t a mini-batch size of 128 images, compared the original ResNet, which can only \ufb01t a\nmini-batch size of 32. The graph splitting trick brings only a small computational overhead (around\n10%).\nThe second and most signi\ufb01cant spatial gains were made by implementing each residual stack as a\ntf.while_loop with the back_prop parameter set to False. This setting ensures that activations\nof each layer in the residual stack (aside from the last) are discarded from memory immediately\nafter their utility expires. We use the tf.while_loops for both the forward and backward passes of\nthe layers, ensuring both ef\ufb01ciently discard activations. Using this implementation we were able to\ntrain a 600-layer RevNet on the ImageNet image classi\ufb01cation challenge on a single GPU; despite\nbeing prohibitively slow to train this demonstrates the potential for massive savings in spatial costs of\ntraining extremely deep networks.\n\n11\n\n\f", "award": [], "sourceid": 1321, "authors": [{"given_name": "Aidan", "family_name": "Gomez", "institution": "University of Toronto"}, {"given_name": "Mengye", "family_name": "Ren", "institution": "University of Toronto"}, {"given_name": "Raquel", "family_name": "Urtasun", "institution": "University of Toronto"}, {"given_name": "Roger", "family_name": "Grosse", "institution": "University of Toronto"}]}