{"title": "Discovering Neural Wirings", "book": "Advances in Neural Information Processing Systems", "page_first": 2684, "page_last": 2694, "abstract": "The success of neural networks has driven a shift in focus from feature engineering to architecture engineering. However, successful networks today are constructed using a small and manually defined set of building blocks. Even in methods of neural architecture search (NAS) the network connectivity patterns are largely constrained. In this work we propose a method for discovering neural wirings. We relax the typical notion of layers and instead enable channels to form connections independent of each other. This allows for a much larger space of possible networks. The wiring of our network is not fixed during training -- as we learn the network parameters we also learn the structure itself. Our experiments demonstrate that our learned connectivity outperforms hand engineered and randomly wired networks. By learning the connectivity of MobileNetV1we boost the ImageNet accuracy by 10% at ~41M FLOPs. Moreover, we show that our method generalizes to recurrent and continuous time networks.\nOur work may also be regarded as unifying core aspects of the neural architecture search problem with sparse neural network learning. As NAS becomes more fine grained, finding a good architecture is akin to finding a sparse subnetwork of the complete graph. Accordingly, DNW provides an effective mechanism for discovering sparse subnetworks of predefined architectures in a single training run. Though we only ever use a small percentage of the weights during the forward pass, we still play the so-called initialization lottery with a combinatorial number of subnetworks. Code and pretrained models are available at https://github.com/allenai/dnw while additional visualizations may be found at https://mitchellnw.github.io/blog/2019/dnw/.", "full_text": "Discovering Neural Wirings\n\nMitchell Wortsman1,2, Ali Farhadi1,2,3, Mohammad Rastegari1,3\n\n1PRIOR @ Allen Institute for AI, 2University of Washington, 3XNOR.AI\n\nmitchnw@cs.washington.edu, {ali, mohammad}@xnor.ai\n\nAbstract\n\nThe success of neural networks has driven a shift in focus from feature engineering\nto architecture engineering. However, successful networks today are constructed\nusing a small and manually de\ufb01ned set of building blocks. Even in methods of\nneural architecture search (NAS) the network connectivity patterns are largely\nconstrained. In this work we propose a method for discovering neural wirings. We\nrelax the typical notion of layers and instead enable channels to form connections\nindependent of each other. This allows for a much larger space of possible networks.\nThe wiring of our network is not \ufb01xed during training \u2013 as we learn the network\nparameters we also learn the structure itself. Our experiments demonstrate that our\nlearned connectivity outperforms hand engineered and randomly wired networks.\nBy learning the connectivity of MobileNetV1 [12] we boost the ImageNet accuracy\nby 10% at \u21e0 41M FLOPs. Moreover, we show that our method generalizes to\nrecurrent and continuous time networks. Our work may also be regarded as unifying\ncore aspects of the neural architecture search problem with sparse neural network\nlearning. As NAS becomes more \ufb01ne grained, \ufb01nding a good architecture is akin to\n\ufb01nding a sparse subnetwork of the complete graph. Accordingly, DNW provides an\neffective mechanism for discovering sparse subnetworks of prede\ufb01ned architectures\nin a single training run. Though we only ever use a small percentage of the weights\nduring the forward pass, we still play the so-called initialization lottery [8] with a\ncombinatorial number of subnetworks. Code and pretrained models are available\nat https://github.com/allenai/dnw while additional visualizations may be\nfound at https://mitchellnw.github.io/blog/2019/dnw/.\n\n1\n\nIntroduction\n\nDeep neural networks have shifted the prevailing paradigm from feature engineering to feature\nlearning. The architecture of deep neural networks, however, must still be hand designed in a\nprocess known as architecture engineering. A myriad of recent efforts attempt to automate the\nprocess of the architecture design by searching among a set of smaller well-known building blocks\n[30, 34, 37, 19, 2, 20]. While methods of search range from reinforcement learning to gradient based\napproaches [34, 20], the space of possible connectivity patterns is still largely constrained. NAS\nmethods explore wirings between prede\ufb01ned blocks, and [28] learns the recurrent structure of CNNs.\nWe believe that more ef\ufb01cient solutions may arrive from searching the space of wirings at a more \ufb01ne\ngrained level, i.e. single channels.\nIn this work, we consider an unconstrained set of possible wirings by allowing channels to form\nconnections independent of each other. This enables us to discover a wide variety of operations (e.g.\ndepthwise separable convs [12], channel shuf\ufb02e and split [36], and more). Formally, we treat the\nnetwork as a large neural graph where each each node processes a single channel.\nOne key challenge lies in searching the space of all possible wirings \u2013 the number of possible\nsub-graphs is combinatorial in nature. When considering thousands of nodes, traditional search\nmethods are either prohibitive or offer approximate solutions. In this paper we introduce a simple\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f0\n\n0\n\n3\n\n3\n\n6\n\n6\n\n1\n\n1\n\n4\n\n4\n\n7\n\n7\n\n2\n\n2\n\n5\n\n5\n\n8\n\n8\n\n0\n\n0\n\n1\n\n1\n\n8\n\n8\n\n6\n\n6\n\n5\n\n5\n\n4\n\n4\n\n7\n\n7\n\n3\n\n3\n\n2\n\n2\n\nFigure 1: Dynamic Neural Graph: A 3-layer perceptron (left) can be expressed by a dynamic neural\ngraph with 3 time steps (right).\n\nand ef\ufb01cient algorithm for discovering neural wirings (DNW). Our method searches the space of all\npossible wirings with a simple modi\ufb01cation of the backwards pass.\nRecent work in randomly wired neural networks [35] aims to explore the space of novel neural\nnetwork wirings. Intriguingly, they show that constructing neural networks with random graph\nalgorithms often outperforms a manually engineered architecture. However, these wirings are \ufb01xed at\ntraining.\nOur method for discovering neural wirings is as follows: First, we consider the sole constraint that\nthat the total number of edges in the neural graph is \ufb01xed to be k. Initially we randomly assign a\nweight to each edge. We then choose the weighted edges with the highest magnitude and refer to\nthe remaining edges as hallucinated. As we train, we modify the weights of all edges according to a\nspeci\ufb01ed update rule. Accordingly, a hallucinated edge may strengthen to a point it replaces a real\nedge. We tailor the update rule so that when swapping does occur, it is bene\ufb01cial.\nWe consider the application of DNW for static and dynamic neural graphs. In the static regime each\nnode has a single output and the graphical structure is acyclic. In the case of a dymanic neural graph\nwe allow the state of a node to vary with time. Dymanic neural graphs may contain cycles and\nexpress popular sequential models such as LSTMs [11]. As dymanic neural graphs are strictly more\nexpressive than static neural graphs, they can also express feed-forward networks (as in Figure 1).\nOur work may also be regarded as a uni\ufb01cation between the problem of neural architecture search\nand sparse neural network learning. As NAS becomes less restrictive and more \ufb01ne grained, \ufb01nding a\ngood architecture is akin to \ufb01nding a sparse sub-network of the complete graph. Accordingly, DNW\nprovides an effective mechanism for discovering sparse networks in a single training run.\nThe Lottery Ticket Hypothesis [8, 9] demonstrates that dense feed-forward neural networks contain\nso-called winning-tickets. These winning-tickets are sparse subnetworks which, when reset to their\ninitialization and trained in isolation, reach an accuracy comparable to their dense counterparts. This\nhypothesis articulate an advantage of overparameterization during training \u2013 having more parameters\nincreases the chance of winning the initialization lottery. We leverage this idea to train a sparse neural\nnetwork without retraining or \ufb01ne-tuning. Though we only ever use a small percentage of the weights\nduring the forward pass, we still play the lottery with a combinatorial number of sub-networks.\nWe demonstrate the ef\ufb01cacy of DNW on small and large scale data-sets, and for feed-forward,\nrecurrent, continuous, and sparse networks. Notably, we augment MobileNetV1 [12] with DNW to\nachieve a 10% improvement on ImageNet [5] from the hand engineered MobileNetV1 at \u21e0 41M\nFLOPs1.\n\n2 Discovering Neural Wirings\n\nIn this section we describe our method for jointly discovering the structure and learning the parameters\nof a neural network. We \ufb01rst consider the algorithm in a familiar setting, a feed-forward neural\nnetwork, which we abstract as a static neural graph. We then present a more expressive dynamic\nneural graph which extends to discrete and continuous time and generalizes feed-forward, recurrent,\nand continuous time neural networks.\n\n1We follow [36, 22] and de\ufb01ne FLOPS as the number of Multiply Adds.\n\n2\n\n\f!\"#$%\n\n'(')*'\n\n[3x3-conv2D, stride=1]\n\n[3x3-conv2D, stride=2]\nOutput zero-padded\n\n,(\"+-*'\n\n&$%#$%\n\n!\"#$%\n\n+'(')*'\n\n.%+%*'\n\n&$%#$%\n\nFigure 2: An example of a dynamic (left) and static (right) neural graph. Details in Section 2.3.\n\n2.1 Static Neural Graph\nA static neural graph is a directed acyclic graph G = (V,E) consisting of nodes V and edges\n. The state of a node v 2V is given by the random variable Zv. At each node v we apply\nE\u2713V\u21e5V\na function f\u2713v and with each edge (u, v) we associate a weight wuv. In the case of a multi-layer\nperceptron, f is simply a parameter-free non-linear activation like ReLU [17].\nFor any set A\u2713V we let ZA denote (Zv)v2A\nV contains a subset of input nodes V0 with no parents and output nodes VE with no children. The\ninput data X\u21e0 px \ufb02ows into the network through V0 as ZV0 = g(X ) for a function g which may\nhave parameters . Similarly, the output of the network \u02c6Y is given by h (ZVE ).\n\nand so ZV is the state of all nodes in the network.\n\nZv =(f\u2713v\u21e3P(u,v)2E wuvZu\u2318 v 2V \\ V 0\n\ng(v)\n (X )\n\nv 2V 0.\n\n(1)\n\n(2)\n\nFor brevity, we let Iv denote the \u201cinput\" to node v, where Iv may be expressed\n\nIv = X(u,v)2E\n\nwuvZu.\n\nIn this work we consider the case where the input and output of each node is a two-dimensional\nmatrix, commonly referred to as a channel. Each node performs a non-linear activation followed by\nnormalization and convolution (which may be strided to reduce the spatial resolution). As in [35], we\nno longer conform to the traditional notion of \u201clayers\" in a deep network.\nThe combination of a separate 3\u21e53 convolution for each channel (depthwise convolution) followed by\na 1 \u21e5 1 convolution (pointwise convolution) is often referred to as a depthwise seperable convolution,\nand is essential in ef\ufb01cient network design [12, 22]. With a static neural graph this process may\nbe interpreted equivalently as a 3 \u21e5 3 convolution at each node followed by information \ufb02ow on a\ncomplete bipartite graph.\n\n2.2 Discovering a k-Edge neural graph\nWe now outline our method for discovering the edges of a static neural graph subject to the constraint\nthat the total number of edges must not exceed k.\nWe consider a set of real edges E and a set of hallucinated edges Ehal = V\u21e5V \\ E . The real edge set\nis comprised of the k-edges which have the largest magnitude weight. As we allow the magnitude of\nthe weights in both sets to change throughout training the edges in Ehal may replace those in E.\nConsider a hallucinated edge (u, v) 62 E. If the gradient is pushing Iv in a direction which aligns\nwith Zu, then our update rule strengthens the magnitude of the weight wuv. If this alignment happens\nconsistently then wuv will be eventually be strong enough to enter the real edge set E. As the total\n\n3\n\n\fInitialize wuv by independently sampling from a uniform distribution.\n\nAlgorithm 1 DNW-Train(V,V0,VE, g, h ,{f\u2713v}v2V , pxy, k,L)\n1: for each pair of nodes (u, v) such that u < v do\n2:\n3: for each training iteration do\n4:\n5:\n\nSample mini batch of data and labels (X ,Y) = {(Xi,Yi)} using pxy\nE { (u, v) : |wuv| \u2327} where \u2327 is chosen so that |E| = k\n\n6:\n\n7:\n8:\n9:\n10:\n\n$:<\u2190$:<+ 9:,\u2212N=\u2112=B<\n\n. Initialize\n\n. Sample data\n. Choose edges\n\n. Forward pass\n\n. Compute output\n\n. Update edge weights\n\nBackward\n\nForward\n\nv 2V 0\n\ng(v)\n (X )\n\nfor each pair of nodes (u, v) such that u < v do\n\nZv (f\u2713v\u21e3P(u,v)2E wuvZu\u2318 v 2V \\ V 0\n\u02c6Y = h ({Zv}v2VE )\nUpdate ,{\u2713v}v2V , via SGD & Backprop [26] using loss L\u21e3 \u02c6Y,Y\u2318\nwuv wuv +DZu,\u21b5 @L@IvE\n$:<\n$:;\n9:\n$C:\n$D:\n=\u2112=9:= ?:,<\u2208\u2130=\u2112=B<$:<\n\n. Recall Iv =P(u,v)2E wuvZu\n$:;\n$:<\n9:\n\u2130= $:<\u2236 $:< \u2265J\n$D:\n$C:\nGF\n9:=1EF ?(C,:)\u2208\u2130$C:9C\n\nFigure 3: Gradient \ufb02ow: On the forward pass we use only on the real edges. On the backwards pass\nwe allow the gradient to \ufb02ow to but not through the hallucinated edges (as in Algorithm 1).\n\nnumber of edges is conserved, when (u, v) enters the edge set E another edge is removed and placed\nin Ehal. This procedure is detailed by Algorithm 1, where V is the node set, V0,VE are the input\nand output node sets, g, h and {f\u2713v}v2V are the input, output, and node functions, pxy is the data\ndistribution, k is the number of edges in the graph and L is the loss.\nIn practice we may also include a momentum and weight decay2 term in the weight update rule\n(line 10 in Algorithm 1). In fact, the weight update rule looks nearly identical to that in traditional\nSGD & Backprop but for one key difference: we allow the gradient to \ufb02ow to edges which did not\nexist during the forward pass. Importantly, we do not allow the gradient to \ufb02ow through these edges\nand so the rest of the parameters update as in traditional SGD & Backprop. This gradient \ufb02ow is\nillustrated in Figure 3.\nUnder certain conditions we formally show that swapping an edge from Ehal to E decreases the loss L.\nWe \ufb01rst consider the simple case where the hallucinated edge (i, k) replaces (j, k) 2E . In Section C\nwe discuss the proof to a more general case.\nWe let \u02dcw to denote the weight w after the weight update rule \u02dcwuv = wuv +DZu,\u21b5 @L@IvE. We\nassume that \u21b5 is small enough so that sign( \u02dcw) = sign(w).\nClaim: Assume L is Lipschitz continuous. There exists a learning rate \u21b5\u21e4 > 0 such that for\n\u21b5 2 (0,\u21b5 \u21e4) the process of swapping (i, k) for (j, k) will decrease the loss on the mini-batch when\nthe state of the nodes are \ufb01xed and |wik| < |wjk| but | \u02dcwik| > | \u02dcwjk|.\n2Weight decay [18] may in fact be very helpful for eliminating dead ends.\n\n4\n\n\fProof. Let A be value of Ik after the update rule if (j, k) is replaced with (i, k). Let B be the state of\nIk after the update rule if we do not allow for swapping. A and B are then given by\nB = \u02dcwjkZj + X(u,k)2E, u6=i,j\n(3)\n\nA = \u02dcwikZi + X(u,k)2E, u6=i,j\n\nAdditionally, let g = \u21b5 @L@Ik\nbe the direction in which the loss most steeply descends with respect to\nIk. By Lemma 1 (Section D of the Appendix) it suf\ufb01ces to show that moving Ik towards A is more\naligned with g then moving Ik towards B. Formally we wish to show that\n(4)\n\n\u02dcwukZu.\n\n\u02dcwukZu,\n\nhA  Ik, gi  hBI k, gi\n\nwhich simpli\ufb01es to\n\n\u02dcwik hZi, gi  \u02dcwjk hZj, gi\n\n() \u02dcwik( \u02dcwik  wik)  \u02dcwjk( \u02dcwjk  wjk).\n\n(5)\n(6)\nIn the case where \u02dcwik and ( \u02dcwikwik) have the same sign but \u02dcwjk and ( \u02dcwjkwjk) have different signs\nthe inequality immediately holds. This corresponds to the case where wik increases in magnitude but\nwjk decreases in magnitude. The opposite scenario (wik decreases in magnitude but wjk increases)\nis impossible since |wik| < |wjk| but | \u02dcwik| > | \u02dcwjk|.\nWe now consider the scenario where both sides of the inequality (equation 6) are positive. Simplifying\nfurther we obtain\n\n( \u02dcwjkwjk  \u02dcwikwik)  \u02dcw2\n\n(7)\nand are now able to identify a range for \u21b5 such that the inequality above is satis\ufb01ed. By assumption\nthe right hand side is less than 0 and sign( \u02dcw) = sign(w) so \u02dcww = | \u02dcw||w|. Accordingly, it suf\ufb01ces to\nshow that\n(8)\nIf we let \u270f = |wjk|| wik| and \u21b5\u21e4 = sup{\u21b5 : | \u02dcwik|\uf8ff| \u02dcwjk| + \u270f| \u02dcwjk||wik|1}, then for \u21b5 2 (0,\u21b5 \u21e4)\n(9)\n\n| \u02dcwjk||wjk|| \u02dcwik||wik| 0.\n\nik\njk  \u02dcw2\n\n| \u02dcwjk||wjk|| \u02dcwik||wik|| \u02dcwjk|0@|wjk|| wik|\n{z\n}\n\n|\n\n=\u270f\n\n\u270f1A = 0\n\nthe inequality (equation 7) is satis\ufb01ed. Here we are implicitly using our assumption that the gradient\nis bounded and we may \u201ctune\u201d \u21b5 to control the magnitude | \u02dcwik|| \u02dcwjk|. In the case where \u21b5 =\ninf{\u21b5 : | \u02dcwik| > | \u02dcwjk|} the right hand side of equation 7 becomes 0 while the left hand side is \u270f> 0.\nIn Section E of the appendix we discuss the effect of \u2713v on wuv. In Section F of the Appendix, we\nshow that the update rule is equivalently a straight-through estimator [1].\n\n2.3 Dynamic Neural Graph\nWe now consider a more general setting where the state of each node Zv(t) may vary through time.\nWe refer to this model as a dynamic neural graph.\nThe initial conditions of a dynamic neural graph are given by\nv 2V 0\nv 2V \\ V 0\n\nZv(0) =(g(v)\n\n (X )\n0\n\n(10)\n\nwhere V0 is a designated set of input nodes, which may now have parents.\n\nDiscrete Time Dynamics: For a discrete time neural graph we consider times ` 2{ 0, 1, ..., L}.\nThe dynamics are then given by\n\nZv(` + 1) = f\u2713v0@ X(u,v)2E\n\nwuvZu(`),` 1A\n\n(11)\n\nand the network output is \u02c6Y = h (ZVE (L)). We may express equation 11 more succinctly as\n(12)\n\nZV (` + 1) = f\u2713 (AGZV (`),` )\n\n5\n\n\f, f\u2713(z,` ) = (f\u2713v (zv,` ))v2V, and AG is the weighted adjacency\nwhere ZV (`) = (Zv(`))v2V\nmatrix for graph G. Equation 12 suggests the following interpretation: At each time step we\nsend information through the edges using AG then apply a function at each node.\nContinuous Time Dynamics: As in [3], we consider the case where t may take on a continuous\nrange of values. We then arrive at dynamics given by\n\nInterestingly, if V0 is a strict subset of V we uncover an Augmented Neural ODE [7].\n\nr ZV (t) = f\u2713 (AGZV (t), t) .\n\n(13)\n\nThe discrete time case is unifying in the sense that it may also express any static neural graph.\nIn Figure 1 we illustrate than an MLP may also be expressed by a discrete time neural graph.\nAdditionally, the discrete time dynamics are able to capture sequential models such as LSTMs [11],\nas long as we allow input to \ufb02ow into V0 at any time.\nIn continuous time it is not immediately obvious how to incorporate strided convolutions. One\napproach is to keep the same spatial resolution throughout and pad with zeros after applying strided\nconvolutions. This design is illustrated by Figure 2.\nWe may also apply Algorithm 1 to learn the structure of dynamic neural graphs. One may use\nbackpropogation through time [33] and the adjoint-sensitivity method [3] for optimization in the\ndiscrete and continuous time settings respectively. In Section 3.1, we demonstrate empirically that\nour method performs better than a random graph, though we do not formally justify the application\nof our algorithm in this setting.\n\n2.4\n\nImplementation details for Large Scale Experiments\n\nFor large scale experiments we do not consider the dynamic case as optimization is too expensive.\nAccordingly, we now present our method for constructing a large and ef\ufb01cient static neural graph.\nWith this model we may jointly learn the structure of the graph along with the parameters on ImageNet\n[5]. As illustrated by Table 5 our model closely follows the structure of MobileNetV1 [12], and\nso we refer to it as MobileNetV1-DNW. We consider a separate neural graph for each spatial\nresolution \u2013 the output of graph Gi is the input of graph Gi+1. For width multiplier [12] d and spatial\nresolution s \u21e5 s we constrain MobileNetV1-DNW to have the same number of edges for resolution\ns \u21e5 s as the corresponding MobileNetV1 \u21e5d. We use a slightly smaller width multiplier to obtain a\nmodel with similar FLOPs as we do not explicitly reduce the number of depthwise convolutions in\nMobileNetV1-DNW. However, we do \ufb01nd that neurons often die (have no output) and we may then\nskip the depthwise convolution during inference. Note that if we interpret a pointwise convolution\nwith c1 input channels and c2 output channels as a complete bipartite graph then the number of edges\nis simply c1 \u21e4 c2.\nWe also constrain the longest path in graph G to be equivalent to the number of layers of the\ncorresponding MobileNetV1. We do so by partitioning the nodes V into blocks B = {B0, ...,BL1}\nwhere B0 is the input nodes V0, BL1 is output nodes VE, and we only allow edges between nodes\nin Bi and Bj if i < j. The longest path in a graph with L blocks is then L  1. Splitting the graph\ninto blocks also improves ef\ufb01ciency as we may operate on one block at a time. The structure of\nMobileNetV1 may be recovered by considering a complete bipartite graph between adjacent blocks.\nThe operation f\u2713v at each non-output node is a batch-norm [14] (2 parameters), ReLU [17], 3 \u21e5 3\nconvolution (9 parameters) triplet. There are no operations at the output nodes. When the spatial\nresolution decreases in MobileNetV1 we change the convolutional stride of the input nodes to 2.\nIn models denoted MobileNetV1-DNW-Small (\u21e5d) we also limit the last fully connected (FC) layer\nto have the same number of edges as the FC layer in MobileNetV1 (\u21e5d). In the normal setting of\nMobileNetV1-DNW we do not modify the last FC layer.\n\n3 Experiments\n\nIn this section we demonstrate the effectiveness of DNW for image classi\ufb01cation in small and large\nscale settings. We begin by comparing our method with a random wiring on a small scale dataset\n\n6\n\n\fTable 1: Testing a tiny (41k parameters) clas-\nsi\ufb01er on CIFAR-10 [16] in static and dynamic\nsettings shown as mean and standard devia-\ntion (std) over 5 runs.\n\nModel\nStatic (RG)\nStatic (DNW)\nDiscrete Time (RG)\nDiscrete Time (DNW)\nContinuous (RG)\nContinuous (DNW)\n\nAccuracy\n76.1 \u00b1 0.5%\n80.9 \u00b1 0.6%\n77.3 \u00b1 0.7%\n82.3 \u00b1 0.6%\n78.5 \u00b1 1.2%\n83.1 \u00b1 0.3%\n\nTable 2: Other methods for discovering\nwirings (using the architecture described in\nTable 5) tested on CIFAR-10 shown as mean\nand std over 5 runs. Models with \u2020 \ufb01rst re-\nquire the complete graph to be trained.\nModel\nMobileNetV1 (\u21e50.25)\nMobileNetV1-RG(\u21e50.225)\nNo Update Rule\nL1 + Anneal\nTD \u21e2 = 0.95\nLottery Ticket (one-shot)\u2020\nFine Tune \u21b5 = 0.1\u2020\nFine Tune \u21b5 = 0.01\u2020\nFine Tune \u21b5 = 0.001\u2020\nMobileNetV1-DNW(\u21e50.225)\n\nAccuracy\n86.3 \u00b1 0.2%\n87.2 \u00b1 0.1%\n86.7 \u00b1 0.5%\n84.3 \u00b1 0.6%\n89.2 \u00b1 0.4%\n87.9 \u00b1 0.3%\n89.4 \u00b1 0.2%\n89.7 \u00b1 0.1%\n88.7 \u00b1 0.2%\n89.7 \u00b1 0.2%\n\nand model. This allows us to experiment in static, discrete time, and continuous settings. Next we\nexplore the use of DNW at scale with experiments on ImageNet [5] and compare DNW with other\nmethods of discovering network structures. Finally we use our algorithm to effectively train sparse\nneural networks without retraining or \ufb01ne-tuning.\nThroughout this section we let RG denote our primary baseline \u2013 a randomly wired graph. To\nconstruct a randomly wired graph with k-edges we assign a uniform random weight to each edge\nthen pick the k edges with the largest magnitude weights. As shown in [35], random graphs often\noutperform manually designed networks.\n\n3.1 Small Scale Experiments For Static and Dynamic Neural Graphs\n\nWe begin by training tiny classi\ufb01ers for the CIFAR-10 dataset [16]. Our initial aim is not to achieve\nstate of the art performance but instead to explore DNW in the static, discrete, and continuous time\nsettings. As illustrated by Table 1, our method outperforms a random graph by a large margin.\nThe image is \ufb01rst downsampled3 then each channel is given as input to a node in a neural graph. The\nstatic graph uses 5 blocks and the discrete time graph uses 5 time steps. For the continuous case we\nbackprop through the operation of an adaptive ODE solver4. The models have 41k parameters. At\neach node we perform Instance Normalization [32], ReLU, and a 3 \u21e5 3 single channel convolution.\n\n3.2\n\nImageNet Classi\ufb01cation\n\nFor large scale experiments on ImageNet [5] we are limited to exploring DNW in the static case\n(recurrent and continuous time networks are more expensive to optimize due to lack of parallelization).\nAlthough our network follows the simple structure of MobileNetV1 [12] we are able to achieve higher\naccuracy than modern networks which are more advanced and optimized. Notably, MobileNetV2\n[27] extends MobileNetV1 by adding residual connections and linear bottlenecks and Shuf\ufb02eNet\n[36, 22] introduces channel splits and channel shuf\ufb02es. The results of the large scale experiments\nmay be found in Table 3.\nAs standard, we have divided the results of Table 3 to consider models which have similar FLOPs.\nIn the more sparse case (\u21e0 41M FLOPs) we are able to use DNW to boost the performance of\nMobileNetV1 by 10%. Though random graphs perform extremely well we still observe a 7% boost in\nperformance. In each experiment we train for 250 epochs using Cosine Annealing as the learning rate\nscheduler with initial learning rate 0.1, as in [35]. Models using random graphs have considerably\nmore FLOPs as nearly all depthwise convolutions must be performed. DNW allows neurons to die\nand we may therefore skip many operations.\n\n3We use two 3 \u21e5 3 strided convolutions. The \ufb01rst is standard while the second is depthwise-separable.\n4We use a 5th order Runge-Kutta method [29] as implemented by [3] (from t = 0 to 1 with tolerance 0.001).\n\n7\n\n\fTable 3: ImageNet Experiments (see Section 2.4 for more details). Models with \u21e4 use the implemen-\ntations of [22]. Models with multiples asterisks use different image resolutions so that the FLOPs is\ncomparable (see Table 8 in [22] for more details).\n\nModel\nMobileNetV1 (\u21e50.25) [12]\nX-4 MobileNetV1 [25]\nMobileNetV2 (\u21e50.15)\u21e4 [27]\nMobileNetV2 (\u21e50.4)\u21e4\u21e4\nDenseNet (\u21e50.5)\u21e4 [13]\nXception (\u21e50.5)\u21e4 [4]\nShuf\ufb02eNetV1 (\u21e50.5, g = 3) [36]\nShuf\ufb02eNetV2 (\u21e50.5) [22]\nMobileNetV1-RG(\u21e50.225)\nMobileNetV1-DNW-Small (\u21e50.15)\nMobileNetV1-DNW-Small (\u21e50.225)\nMobileNetV1-DNW(\u21e50.225)\nMnasNet-search1 [30]\nMobileNetV1-DNW(\u21e50.3)\nMobileNetV1 (\u21e50.5)\nMobileNetV2 (\u21e50.6)\u21e4\nMobileNetV2 (\u21e50.75)\u21e4\u21e4\u21e4\nDenseNet (\u21e51)\u21e4\nXception (\u21e51)\u21e4\nShuf\ufb02eNetV1 (\u21e51, g = 3)\nShuf\ufb02eNetV2 (\u21e51)\nMobileNetV1-RG(\u21e50.49)\nMobileNetV1-DNW(\u21e50.49)\n\nFLOPs Accuracy\nParams\n41M\n0.5M\n50.6%\n> 50M 54.0%\n\u2014\n39M\n\u2014\n44.9%\n43M\n\u2014\n56.6%\n42M\n\u2014\n41.1%\n40M\n\u2014\n55.1%\n38M\n\u2014\n56.8%\n41M\n1.4M\n60.3%\n1.2M\n55.7M 53.3%\n0.24M 22.1M 50.3%\n0.4M\n41.2M 59.9%\n42.1M 60.9%\n1.1M\n65M\n1.9M\n64.9%\n66.7M 65.0%\n1.3M\n1.3M\n149M\n63.7%\n141M\n\u2014\n66.6%\n145M\n\u2014\n67.9%\n142M\n\u2014\n54.8%\n145M\n\u2014\n65.9%\n\u2014\n140M\n67.4%\n146M\n2.3M\n69.4%\n170M\n1.8M\n64.1%\n1.8M\n154M\n70.4%\n\n3.3 Related Methods\n\nWe compare DNW with various methods for discovering neural wirings. In Table 2 we use the struc-\nture of MobileNetV1-DNW but try other methods which \ufb01nd k-edge sub-networks. The experiments\nin Table 2 are conducted using CIFAR-10 [16]. We train for 160 epochs using Cosine Annealing as\nthe learning rate scheduler with initial learning rate \u21b5 = 0.1 unless otherwise noted.\nThe Lottery Ticket Hypothesis: The authors of [8, 9] offer an intriguing hypothesis: sparse sub-\nnetworks may be trained in isolation when reset to their initialization. However, their method for\n\ufb01nding so-called winning tickets is quite expensive as it requires training the full graph from scratch.\nWe compare with one-shot pruning from [9]. One-shot pruning is more comparable in training\nFLOPS than iterative pruning [8], though both methods are more expensive in training FLOPS than\nDNW. After training the full network Gfull (i.e. no edges pruned) the optimal sub-network Gk with\nk-edges is chosen by taking the weights with the highest magnitude. In the row denoted Lottery\nTicket we retrain Gk using the initialization of Gfull. We also initialize Gk with the weights of Gfull\nafter training \u2013 denoted by FT for \ufb01ne-tune (we try different initial learning rates \u21b5). Though these\nexperiments perform comparably with DNW, their training is more expensive as the full graph must\ninitially be trained.\nExploring Randomly Wired Networks for Image Recognition: The authors of [35] explore \u201ca\nmore diverse set of connectivity patterns through the lens of randomly wired neural networks.\"\nThey achieve impressive performance on ImageNet [5] using random graph algorithms to generate\nthe structure of a neural network. Their network connectivity, however, is \ufb01xed during training.\nThroughout this section we have a random graph (denoted RG) as our primary baseline \u2013 as in [35]\nwe have seen that random graphs outperform hand-designed networks.\nNo Update Rule: In this ablation on DNW we do not apply the update rule to the hallucinated edges.\nAn edge may only leave the hallucinated edge set if the magnitude of a real edge is suf\ufb01ciently\ndecreased. This experiment demonstrates the importance of the update rule.\nL1 + Anneal: We experiment with a simple pruning technique \u2013 start with a fully connected graph\nand remove edges by magnitude throughout training until there are only k remaining. We found that\naccuracy was much better if we added an L1 regularization term.\n\n8\n\n\fTable 4: Training a tuned version of ResNet50 on ImageNet with modern optimization techniques, as\nin Appendix C of [6]. For All Layers Sparse, every layer has a \ufb01xed sparsity. In contrast, we leave\nthe very \ufb01rst convolution dense for First Layer Dense. The parameters in the \ufb01rst layer constitute\nonly 0.04% of the total network.\n\nMethod\n\nWeights (%)\n\nTop-1 Accuracy\n\nTop-5 Accuracy\n\nSparse Networks from Scratch [6]\n\nOurs - All Layers Sparse\nOurs - First Layer Dense\n\nSparse Networks from Scratch [6]\n\nOurs - All Layers Sparse\nOurs - First Layer Dense\n\nSparse Networks from Scratch [6]\n\nOurs - All Layers Sparse\nOurs - First Layer Dense\n\nSparse Networks from Scratch [6]\n\nOurs - Dense Baseline\n\n10%\n10%\n10%\n20%\n20%\n20%\n30%\n30%\n30%\n100%\n100%\n\n72.9%\n74.0%\n75.0%\n74.9%\n76.2%\n76.6%\n75.9%\n76.9%\n77.1%\n77.0%\n77.5%\n\n91.5%\n92.0%\n92.5%\n92.5%\n93.0%\n93.4%\n92.9%\n93.4%\n93.5%\n93.5%\n93.7%\n\nTargeted Dropout: The authors of [10] present a simple and effective method for training a network\nwhich is robust to subsequent pruning. Their method outperforms variational dropout [23] and\nL0 pruning [21]. We compare with Weight Dropout/Pruning from [10], which we denote as TD.\nSection B of the Appendix contains more information, experimental details, and hyperparameter\ntrials for the Targeted Dropout experiments, though we provide the best result in Table 2.\nNeural Architecture Search: As illustrated by Table 3, our network (with a very simple Mo-\nbileNetV1 like structure) is able to achieve comparable accuracy to an expensive method which\nperforms neural architecture search using reinforcement learning [30].\n\n3.4 Training Sparse Neural Networks\n\nWe may apply our algorithm for Discovering Neural Wirings to the task of training sparse neural\nnetworks. Importantly, our method requires no \ufb01ne-tuning or retraining to discover a sparse sub-\nnetworks \u2013 the sparsity is maintained throughout training. This perspective was guided by the the\nwork of Dettmers and Zettelmoyer in [6], though we would like to highlight some differences. Their\nwork enables faster training, though our backwards pass is still dense. Moreover, their work allows\nfor a redistribution of parameters across layers whereas we consider a \ufb01xed sparsity per layer.\nOur algorithm for training a sparse neural network is similar to Algorithm 1, though we implicitly\ntreat each convolution as a separate graph where each parameter is an edge. For each convolutional\nlayer on the forwards pass, we use the top k% of the parameters chosen by magnitude. On the\nbackwards pass we allow the gradient to \ufb02ow to, but not through, all weights that were zeroed out on\nthe forwards pass. All weights receive gradients as if they existed on the forwards pass, regardless of\nif they were zeroed out.\nAs in [6] we leave the biases and batchnorm dense. We compare with the result in Appendix C of [6],\nas we also use a tuned version of a ResNet50 that uses modern optimization techniques such as cosine\nlearning rate scheduling and warmup5. We train for 100 epochs and showcase our results in Table 4.\n\n4 Conclusion\n\nWe present a novel method for discovering neural wirings. With a simple algorithm we demonstrate a\nsigni\ufb01cant boost in accuracy over randomly wired networks. We bene\ufb01t from overparameterization\nduring training even when the resulting model is sparse. Just as in [35], our networks are free from\nthe typical constraints of NAS. This work suggests exciting directions for more complex and ef\ufb01cient\nmethods of discovering neural wirings.\n\n5We adapt the code from https://github.com/NVIDIA/DeepLearningExamples/tree/master/\n\nPyTorch/Classification/RN50v1.5, using the exact same hyperparameters but training for 100 epochs.\n\n9\n\n\fAcknowledgments\nWe thank Sarah Pratt, Mark Yatskar and the Beaker team. We also thank Tim Dettmers for his\nassistance and guidance in the experiments regarding sparse networks. This work is in part supported\nby DARPA N66001-19-2-4031, NSF IIS-165205, NSF IIS-1637479, NSF IIS-1703166, Sloan\nFellowship, NVIDIA Arti\ufb01cial Intelligence Lab, the Allen Institute for Arti\ufb01cial Intelligence, and the\nAI2 fellowship for AI. Computations on beaker.org were supported in part by credits from Google\nCloud.\n\nReferences\n[1] Yoshua Bengio, Nicholas L\u00e9onard, and Aaron C. Courville. Estimating or propagating gradients through\n\nstochastic neurons for conditional computation. ArXiv, abs/1308.3432, 2013.\n\n[2] Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and\n\nhardware. In ICLR, 2019.\n\n[3] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K. Duvenaud. Neural ordinary differential\n\nequations. In NeurIPS, 2018.\n\n[4] Fran\u00e7ois Chollet. Xception: Deep learning with depthwise separable convolutions. 2017 IEEE Conference\n\non Computer Vision and Pattern Recognition (CVPR), pages 1800\u20131807, 2017.\n\n[5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical\n\nimage database. In CVPR 2009, 2009.\n\n[6] Tim Dettmers and Luke S. Zettlemoyer. Sparse networks from scratch: Faster training without losing\n\nperformance. ArXiv, abs/1907.04840, 2019.\n\n[7] Emilien Dupont, Arnaud Doucet, and Yee Whye Teh. Augmented neural odes. CoRR, abs/1904.01681,\n\n2019.\n\n[8] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural\n\nnetworks. In ICLR 2019, 2019.\n\n[9] Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. The lottery ticket\n\nhypothesis at scale. CoRR, abs/1903.01611, 2019.\n\n[10] Aidan N. Gomez, Ivan Zhang, Kevin Swersky, Yarin Gal, and Geoffrey E. Hinton. Learning sparse\n\nnetworks using targeted dropout, 2019.\n\n[11] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural Computation, 9:1735\u20131780,\n\n1997.\n\n[12] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand,\nMarco Andreetto, and Hartwig Adam. Mobilenets: Ef\ufb01cient convolutional neural networks for mobile\nvision applications. CoRR, abs/1704.04861, 2017.\n\n[13] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. 2017\n\nIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261\u20132269, 2017.\n\n[14] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. In ICML, 2015.\n\n[15] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. ArXiv,\n\nabs/1611.01144, 2016.\n\n[16] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of\n\nToronto, 2009.\n\n[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. Commun. ACM, 60:84\u201390, 2012.\n\n[18] Anders Krogh and John A. Hertz. A simple weight decay can improve generalization. In NIPS, 1991.\n[19] Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille,\nIn Proceedings of the\n\nJonathan Huang, and Kevin Murphy. Progressive neural architecture search.\nEuropean Conference on Computer Vision (ECCV), pages 19\u201334, 2018.\n\n[20] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. CoRR,\n\nabs/1806.09055, 2019.\n\n[21] Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through l0\n\nregularization. CoRR, abs/1712.01312, 2018.\n\n10\n\n\f[22] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shuf\ufb02enet v2: Practical guidelines for\n\nef\ufb01cient cnn architecture design. In ECCV, 2018.\n\n[23] Dmitry Molchanov, Arsenii Ashukha, and Dmitry P. Vetrov. Variational dropout sparsi\ufb01es deep neural\n\nnetworks. In ICML, 2017.\n\n[24] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming\nLin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS\nAutodiff Workshop, 2017.\n\n[25] Ameya Prabhu, Girish Varma, and Anoop M. Namboodiri. Deep expander networks: Ef\ufb01cient deep\n\nnetworks from graph theory. In ECCV, 2017.\n\n[26] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-\n\npropagating errors. Nature, 323:533\u2013536, 1986.\n\n[27] Mark B. Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.\nMobilenetv2: Inverted residuals and linear bottlenecks. 2018 IEEE/CVF Conference on Computer Vision\nand Pattern Recognition, pages 4510\u20134520, 2018.\n\n[28] Pedro H. P. Savarese and Michael Maire. Learning implicitly recurrent cnns through parameter sharing.\n\nArXiv, abs/1902.09701, 2019.\n\n[29] F Shampine, Lawrence. Some practical runge-kutta formulas. Math. Comput., 46(173):135\u2013150, January\n\n1986.\n\n[30] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V. Le. Mnasnet: Platform-aware\n\nneural architecture search for mobile. CoRR, abs/1807.11626, 2018.\n\n[31] Yuandong Tian, Tina Jiang, Qucheng Gong, and Ari S. Morcos. Luck matters: Understanding training\n\ndynamics of deep relu networks. ArXiv, abs/1905.13405, 2019.\n\n[32] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missing ingredient\n\nfor fast stylization. CoRR, abs/1607.08022, 2016.\n\n[33] P. J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE,\n\n78(10):1550\u20131560, Oct 1990.\n\n[34] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang, Fei Sun, Yiming Wu, Yuandong Tian, Peter\nVajda, Yangqing Jia, and Kurt Keutzer. Fbnet: Hardware-aware ef\ufb01cient convnet design via differentiable\nneural architecture search. arXiv preprint arXiv:1812.03443, 2018.\n\n[35] Saining Xie, Alexander Kirillov, Ross B. Girshick, and Kaiming He. Exploring randomly wired neural\n\nnetworks for image recognition. CoRR, abs/1904.01569, 2019.\n\n[36] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shuf\ufb02enet: An extremely ef\ufb01cient convolu-\ntional neural network for mobile devices. 2018 IEEE/CVF Conference on Computer Vision and Pattern\nRecognition, pages 6848\u20136856, 2018.\n\n[37] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint\n\narXiv:1611.01578, 2016.\n\n11\n\n\f", "award": [], "sourceid": 1542, "authors": [{"given_name": "Mitchell", "family_name": "Wortsman", "institution": "University of Washington, Allen Institute for Artificial Intelligence"}, {"given_name": "Ali", "family_name": "Farhadi", "institution": "University of Washington, Allen Institute for Artificial Intelligence"}, {"given_name": "Mohammad", "family_name": "Rastegari", "institution": "XNOR.AI- AI2"}]}