{"title": "Deep ReLU Networks Have Surprisingly Few Activation Patterns", "book": "Advances in Neural Information Processing Systems", "page_first": 361, "page_last": 370, "abstract": "The success of deep networks has been attributed in part to their expressivity: per parameter, deep networks can approximate a richer class of functions than shallow networks. In ReLU networks, the number of activation patterns is one measure of expressivity; and the maximum number of patterns grows exponentially with the depth. However, recent work has showed that the practical expressivity of deep networks - the functions they can learn rather than express - is often far from the theoretical maximum. In this paper, we show that the average number of activation patterns for ReLU networks at initialization is bounded by the total number of neurons raised to the input dimension. We show empirically that this bound, which is independent of the depth, is tight both at initialization and during training, even on memorization tasks that should maximize the number of activation patterns. Our work suggests that realizing the full expressivity of deep networks may not be possible in practice, at least with current methods.", "full_text": "Deep ReLU Networks Have\n\nSurprisingly Few Activation Patterns\n\nBoris Hanin\n\nFacebook AI Research\nTexas A&M University\nbhanin@math.tamu.edu\n\nDavid Rolnick\n\ndrolnick@seas.upenn.edu\n\nUniversity of Pennsylvania\n\nPhiladelphia, PA USA\n\nAbstract\n\nThe success of deep networks has been attributed in part to their expressivity: per\nparameter, deep networks can approximate a richer class of functions than shallow\nnetworks. In ReLU networks, the number of activation patterns is one measure of\nexpressivity; and the maximum number of patterns grows exponentially with the\ndepth. However, recent work has showed that the practical expressivity of deep\nnetworks \u2013 the functions they can learn rather than express \u2013 is often far from the\ntheoretical maximum. In this paper, we show that the average number of activation\npatterns for ReLU networks at initialization is bounded by the total number of\nneurons raised to the input dimension. We show empirically that this bound, which\nis independent of the depth, is tight both at initialization and during training, even\non memorization tasks that should maximize the number of activation patterns.\nOur work suggests that realizing the full expressivity of deep networks may not be\npossible in practice, at least with current methods.\n\nIntroduction\n\n1\nA fundamental question in the theory of deep learning is why deeper networks often work better in\npractice than shallow ones. One proposed explanation is that, while even shallow neural networks\nare universal approximators [3, 7, 9, 16, 21], there are functions for which increased depth allows\nexponentially more ef\ufb01cient representations. This phenomenon has been quanti\ufb01ed for various\ncomplexity measures [4, 5, 6, 10, 18, 19, 22, 23, 24, 25]. However, authors such as Ba and Caruana\nhave called into question this point of view [2], observing that shallow networks can often be trained\nto imitate deep networks and thus that functions learned in practice by deep networks may not achieve\nthe full expressive power of depth.\nIn this article, we attempt to capture the difference between the maximum complexity of deep\nnetworks and the complexity of functions that are actually learned (see Figure 1). We provide\ntheoretical and empirical analyses of the typical complexity of the function computed by a ReLU\nnetwork N . Given a vector \u2713 of its trainable parameters, N computes a continuous and piecewise\nlinear function x 7! N (x; \u2713). Each \u2713 thus is associated with a partition of input space Rnin into\nactivation regions, polytopes on which N (x; \u2713) computes a single linear function corresponding to a\n\ufb01xed activation pattern in the neurons of N .\nWe aim to count the number of such activation regions. This number has been the subject of previous\nwork (see \u00a71.1), with the majority concerning large lower bounds on the maximum over all \u2713 of\nthe number of regions for a given network architecture. In contrast, we are interested in the typical\nbehavior of ReLU nets as they are used in practice. We therefore focus on small upper bounds for the\naverage number of activation regions present for a typical value of \u2713. Our main contributions are:\n\u2022 We give precise de\ufb01nitions and prove several fundamental properties of both linear and\n\u2022 We prove in Theorem 5 an upper bound for the expected number of activation regions in a\nReLU net N . Roughly, we show that if nin is the input dimension and C is a cube in input\n\nactivation regions, two concepts that are often con\ufb02ated in the literature (see \u00a72).\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Schematic illustration of the space of functions f : Rnin ! Rnout. For a given neural\nnetwork architecture, there is a set Fexpress of functions expressible by that architecture. Within\nthis set, the functions corresponding to networks at initialization are concentrated within a set Finit.\nIntermediate between Finit and Fexpress is a set Flearn containing the functions which the network has\nnon-vanishing probability of learning using gradient descent. (None of these is of course a formal\nde\ufb01nition.) This paper seeks to demonstrate the gap between Fexpress and Flearn and that, at least for\ncertain measures of complexity, there is a surprisingly small gap between Finit and Flearn.\nspace Rnin, then, under reasonable assumptions on network gradients and biases,\n\n#activation regions of N that intersect C\n\n(T #neurons)nin\n\nnin!\n\n\uf8ff\n\n, T > 0.\n\n(1)\n\nvol(C)\n\n\u2022 This bound holds in particular for deep ReLU nets at initialization, and is in sharp contrast to\nthe maximum possible number of activation patterns, which is exponential in depth [23, 28].\n\u2022 Theorem 5 also strongly suggests that the bounds on number of activation regions continue\nto hold approximately throughout training. We empirically verify that this behavior holds,\neven for networks trained on memorization-based tasks (see \u00a74 and Figures 3-6).\n\nFigure 2: Function de\ufb01ned by a ReLU network of depth 5 and width 8 at initialization. Left: Partition\nof the input space into regions, on each of which the activation pattern of neurons is constant. Right:\nthe function computed by the network, which is linear on each activation region.\nIt may seem counterintuitive that the number of activation patterns in a ReLU net is effectively\ncapped far below its theoretical maximum during training, even for tasks where a higher number of\nregions would be advantageous (see \u00a74). We provide in \u00a73.2-3.3 two intuitive explanations for this\nphenomenon. The essence of both is that many activation patterns can be created only when a typical\nneuron z in N turns on/off repeatedly, forcing the value of its pre-activation z(x) to cross the level of\nits bias bz many times. This requires (i) signi\ufb01cant overlap between the range of z(x) on the different\nactivation regions of x 7! z(x) and (ii) the bias bz to be picked within this overlap. Intuitively, (i)\nand (ii) require either large or highly coordinated gradients. In the former case, z(x) oscillates over a\nlarge range of outputs and bz can be random, while in the latter z(x) may oscillate only over a small\nrange of outputs and bz is carefully chosen. Neither is likely to happen with a proper initialization.\nMoreover, both appear to be dif\ufb01cult to learn with gradient-based optimization.\nThe rest of this article is structured as follows. Section 2 gives formal de\ufb01nitions and some important\nproperties of both activation regions and the closely related notion of linear regions (see De\ufb01nitions\n1 and 2).Section 3 contains our main technical result, Theorem 5, stated in \u00a73.1. Sections 3.2 and\n3.3 provide heuristics for understanding Theorem 5 and its implications. Finally, \u00a74 is devoted to\nexperiments that push the limits of how many activation regions a ReLU network can learn in practice.\n1.1 Relation to Prior Work\nWe consider the typical number of activation regions in ReLU nets. Interesting bounds on the\nmaximum number of regions are given in [4, 6, 19, 22, 23, 25, 26, 28]. Our main theoretical result,\n\n2\n\n\fTheorem 5, is related to [14], which conjectured that our Theorem 5 should hold and proved bounds\nfor other notions of average complexity of activation regions. Theorem 5 is also related in spirit to\n[8], which uses a mean \ufb01eld analysis of wide ReLU nets to show that they are biased towards simple\nfunctions. Our empirical work (e.g. \u00a74) is related both to the experiments of [20] and to those of\n[1, 29]. The last two observe that neural networks are capable of \ufb01tting noisy or completely random\ndata. Theorem 5 and experiments in \u00a74 give a counterpoint, suggesting limitations on the complexity\nof random functions that ReLU nets can \ufb01t in practice (see Figures 4-6).\n\nFigure 3: The average number of activation regions in a 2D cross-section of input space, for fully\nconnected networks of various architectures training on MNIST. Left: a closeup of 0.5 epochs of\ntraining. Right: 20 epochs of training. The notation [20, 20, 20] indicates a network with three layers,\neach of width 20. The number of activation regions starts at approximately (#neurons)2/2, as\npredicted by Theorem 5 (see Remark 1). This value changes little during training, \ufb01rst decreasing\nslightly and then rebounding, but never increasing exponentially. Each curve is averaged over 10\nindependent training runs, and for each run the number of regions is averaged over 5 different 2D\ncross-sections, where for each cross-section we count the number of regions in the (in\ufb01nite) plane\npassing through the origin and two random training examples. Standard deviations between different\nruns are shown for each curve. See Appendix A for more details.\n2 How to Think about Activation Regions\nBefore stating our main results on counting activation regions in \u00a73, we provide a formal de\ufb01nition\nand contrast them with linear regions in \u00a72.1. We also note in \u00a72.1 some simple properties of\nactivation regions that are useful both for understanding how they are built up layer by layer in a deep\nReLU net and for visualizing them. Then, in \u00a72.2, we explain the relationship between activation\nregions and arrangements of bent hyperplanes (see Lemma 4).\n2.1 Activation Regions vs. Linear Regions\nOur main objects of study in this article are activation regions, which we now de\ufb01ne.\nDe\ufb01nition 1 (Activation Patterns/Regions). Let N be a ReLU net with input dimension nin. An\nactivation pattern for N is an assignment to each neuron of a sign:\n\nA := {az, z a neuron in N} 2 {1, 1}#neurons.\n\nFix \u2713, a vector of trainable parameters in N , and an activation pattern A. The activation region\ncorresponding to A, \u2713 is\n\nR(A; \u2713) := {x 2 Rnin\n\nz a neuron in N},\n\n|\n\n(1)az (z(x; \u2713) bz) > 0,\n\nwhere neuron z has pre-activation z(x; \u2713), bias bz, and post-activation max{0, z(x; \u2713) bz}. We\nsay the activation regions of N at \u2713 are the non-empty activation regions R(A, \u2713).\nPerhaps the most fundamental property of activation regions is their convexity.\nLemma 1 (Convexity of Activation Regions). Let N be a ReLU net. Then for every activation\npattern A and any vector \u2713 of trainable parameters for N each activation region R(A; \u2713) is convex.\nWe note that Lemma 1 has been observed before (e.g. Theorem 2 in [23]), but in much of the\nliterature the difference between linear regions (de\ufb01ned below), which are not necessarily convex,\nand activation regions, which are, is ignored. It turns out that Lemma 1 holds for any piecewise\nlinear activation, such as leaky ReLU and hard hyperbolic tangent/sigmoid. This fact seems to be less\n\n3\n\n\fwell-known (see Appendix B.1 for a proof). To provide a useful alternative description of activation\nregions for a ReLU net N , a \ufb01xed vector \u2713 of trainable parameters and neuron z of N , de\ufb01ne\n\nHz(\u2713) := {x 2 Rnin | z(x; \u2713) = bz}.\n\n(2)\nThe sets Hz(\u2713) can be thought of as \u201cbent hyperplanes\u201d (see Lemma 4). The non-empty activation\nregions of N at \u2713 are the connected components of Rnin with all the bent hyperplanes Hz(\u2713) removed:\nLemma 2 (Activation Regions as Connected Components). For any ReLU net N and any vector \u2713\nof trainable parameters\n\nactivation regions (N , \u2713) = connected componentsRnin\u270f [neurons z\n\nHz(\u2713).\n\nWe prove Lemma 2 in Appendix B.2. We may compare activation regions with linear regions, which\nare the regions of input space on which the network de\ufb01nes different linear functions.\nDe\ufb01nition 2 (Linear Regions). Let N be a ReLU net with input dimension nin, and \ufb01x \u2713, a vector of\ntrainable parameters for N . De\ufb01ne\n\n(3)\n\nBN (\u2713) := {x 2 Rnin | rN (\u00b7 ; \u2713) is discontinuous at x}.\n\nThe linear regions of N at \u2713 are the connected components of input space with BN removed:\n\nlinear regions (N , \u2713) = connected components (Rnin\\BN (\u2713)) .\n\nLinear regions have often been con\ufb02ated with activation regions, but in some cases they are different.\nThis can, for example, happen when an entire layer of the network is zeroed out by ReLUs, leading\nmany distinct activation regions to coalesce into a single linear region. However, the number of\nactivation regions is always at least as large as the number of linear regions.\nLemma 3 (More Activation Regions than Linear Regions). Let N be a ReLU net. For any parameter\nvector \u2713 for N , the number of linear regions in N at \u2713 is always bounded above by the number of\nactivation regions in N at \u2713. In fact, the closure of every linear region is the closure of the union of\nsome number of activation regions.\nLemma 3 is proved in Appendix B.3. We prove moreover in Appendix B.4 that generically, the\ngradient of rN is different in the interior of most activation regions and hence that most activation\nregions lie in different linear regions. In particular, this means that the number of linear regions is\ngenerically very similar to the number of activation regions.\n2.2 Activation Regions and Hyperplane Arrangements\nActivation regions in depth 1 ReLU nets are given by hyperplane arrangements in Rnin (see [27]).\nIndeed, if N is a ReLU net with one hidden layer, then the sets Hz(\u2713) from (2) are simply hyper-\nplanes, giving the well-known observation that the activation regions in a depth 1 ReLU net are the\nconnected components of Rnin with the hyperplanes Hz(\u2713) removed. The study of regions induced\nby hyperplane arrangements in Rn is a classical subject in combinatorics [27]. A basic result is\nthat for hyperplanes in general position (e.g. chosen at random), the total number of connected\ncomponents coming from an arrangement of m hyperplanes in Rn is constant:\n\n#connected components =\n\nnXi=0\u2713m\n\ni\u25c6 ' \u21e2 mn\n\nn! ,\n2m,\n\nm n\nm \uf8ff n\n\n.\n\n(4)\n\nHence, for random wj, bj drawn from any reasonable distributions the number of activation regions\nin a ReLU net with input dimension nin and one hidden layer of size m is given by (4). The situation\nis more subtle for deeper networks. By Lemma 2, activation regions are connected components for an\narrangement of \u201cbent\u201d hyperplanes Hz(\u2713) from (2), which are only locally described by hyperplanes.\nTo understand their structure more carefully, \ufb01x a ReLU net N with d hidden layers and a vector \u2713 of\ntrainable parameters for N . Write Nj for the network obtained by keeping only the \ufb01rst j layers of N\nand \u2713j for the corresponding parameter vector. The following lemma makes precise the observation\nthat the hyperplane Hz(\u2713) can only bend only when it meets a bent hyperplane Hbz(\u2713) corresponding\nto some neuronbz in an earlier layer.\nLemma 4 (Hz(\u2713) as Bent Hyperplanes). Except on a set of \u2713 2 R#params of measure 0 with respect\nto Lebesgue measure, the sets Hz(\u27131) corresponding to neurons from the \ufb01rst hidden layer are\nhyperplanes in Rnin. Moreover, \ufb01x 2 \uf8ff j \uf8ff d. Then, for each neuron z in layer j, the set Hz(\u2713j)\ncoincides with a single hyperplane in the interior of each activation region of Nj1.\n\n4\n\n\fLemma 4, which follows immediately from the proof of Lemma 7 in Appendix B.1, ensures that\n\nin a small ball near any point that does not belong toSz Hz(\u2713), the collection of bent hyperplanes\n\nHz(\u2713) look like an ordinary hyperplane arrangement. Globally, however, Hz(\u2713) can de\ufb01ne many\nmore regions than ordinary hyperplane arrangements. This re\ufb02ects the fact that deep ReLU nets may\nhave many more activation regions than shallow networks with the same number of neurons.\nDespite their different extremal behaviors, we show in Theorem 5 that the average number of\nactivation regions in a random ReLU net enjoys depth-independent upper bounds at initialization. We\nshow experimentally that this holds throughout training as well (see \u00a74). On the other hand, although\nwe do not prove this here, we believe that the effect of depth can be seen through the \ufb02uctuations\n(e.g. the variance), rather than the mean, of the number of activation regions. For instance, for depth\n1 ReLU nets, the variance is 0 since for a generic con\ufb01guration of weights/biases, the number of\nactivation regions is constant (see (4)). The variance is strictly positive, however, for deeper networks.\n3 Main Result\n3.1 Formal Statement\nTheorem 5 gives upper bounds on the average number of activation regions per unit volume of input\nspace for a feed-forward ReLU net with random weights/biases. Note that it applies even to highly\ncorrelated weight/bias distributions and hence holds throughout training. Also note that although we\nrequire no tied weights, there are no further constraints on the connectivity between adjacent layers.\nTheorem 5 (Counting Activation Regions). Let N be a feed-forward ReLU network with no tied\nweights, input dimension nin, output dimension 1, and random weights/biases satisfying:\n\n1. The distribution of all weights has a density with respect to Lebesgue measure on R#weights.\n2. Every collection of biases has a density with respect to Lebesgue measure conditional on\n\nthe values of all weights and other biases (for identically zero biases, see Appendix D).\n\n3. There exists Cgrad > 0 so that for every neuron z and each m 1, we have\n\nsup\nx2Rnin\n\nE [krz(x)km] \uf8ff Cm\n\ngrad.\n\n4. There exists Cbias > 0 so that for any neurons z1, . . . , zk, the conditional distribution of the\n\nbiases \u21e2bz1 ,...,bzk\n\nof these neurons given all the other weights and biases in N satis\ufb01es\n\nsup\n\nb1,...,bk2R\n\n\u21e2bz1 ,...,bzk\n\n(b1, . . . , bk) \uf8ff Ck\n\nbias.\n\n2#neurons\n\nvol(C)\n\n.\n#neurons \uf8ff nin\n(5)\n\nThen, there exists 0, T > 0 depending on Cgrad, Cbias with the following property. Suppose that\n > 0. Then, for all cubes C with side length , we have\nE [#non-empty activation regions of N in C]\n\n\uf8ff \u21e2(T #neurons)nin /nin! #neurons nin\nHere, the average is with respect to the distribution of weights and biases in N .\nRemark 1. The heuristic of \u00a73.3 suggests the average number of activation patterns in N over all of\nRnin is at most (#neurons)nin /nin!, its value for depth 1 networks (see (4)). This is con\ufb01rmed in\nour experiments (see Figures 3-6).\nWe state and prove a generalization of Theorem 5 in Appendix C. Note that by Theorem 1 (and\nProposition 2) in [12], Condition 3 is automatically satis\ufb01ed by a fully connected depth d ReLU\nnet N with independent weights and biases whose marginals are symmetric around 0 and satisfy\nVar[weights] = 2/fan-in with the constant Cgrad in 3 depending only on an upper bound for the\nsumPd\nj=1 1/nj of the reciprocals of the hidden layer widths of N . For example, if the layers of N\nhave constant width n, then Cgrad depends on the depth and width only via the aspect ratio d/n of\nN , which is small for wide networks. Also, at initialization when all biases are independent, the\nconstant Cbias can be taken simply to be the maximum of the density of the bias distribution.\nBelow are two heuristics for the second (5). First, in \u00a73.2 we derive the upper bound (5) via an\nintuitive geometric argument. Then in \u00a73.3, we explain why, at initialization, we expect the upper\nbounds (5) to have matching, depth-independent, lower bounds (to leading order in the number of\nneurons). This suggests that the average total number of activation regions at initialization should be\nthe same for any two ReLU nets with the same number of neurons (see (4) and Figure 3).\n\n5\n\n\f3.2 Geometric Intuition\nWe give an intuitive explanation for the upper bounds in Theorem 5, beginning with the simplest case\nof a ReLU net N with nin = 1. Activation regions for N are intervals, and at an endpoint x of such\nan interval the pre-activation of some neuron z in N equals its bias: i.e. z(x) = bz. Thus,\n#{x 2 [a, b] | z(x) = bz}.\n\n#activation regions of N in [a, b] \uf8ff 1 + Xneurons z\n\nGeometrically, the number of solutions to z(x) = bz for inputs x 2 I is the number of times the\nhorizontal line y = bz intersects the graph y = z(x) over x 2 I. A large number of intersections at a\ngiven bias bz may only occur if the graph of z(x) has many oscillations around that level. Hence,\nsince bz is random, the graph of z(x) must oscillate many times over a large range on the y axis. This\ncan happen only if the total variationRx2I |z0(x)| of z(x) over I is large. Thus, if |z0(x)| is typically\n\nof moderate size, we expect only O(1) solutions to z(x) = bz per unit input length, suggesting\n\nE [#activation regions of N in [a, b]] = O ((b a) \u00b7 #neurons) ,\n\nIs Theorem 5 Sharp?\n\nin accordance with Theorem 5 (cf. Theorems 1,3 in [14]). When nin > 1, the preceding argument,\nshows that density of 1-dimensional regions per unit length along any 1-dimensional line segment in\ninput space is bounded above by the number of neurons in N . A unit-counting argument therefore\nsuggests that the density of nin-dimensional regions per unit nin-dimensional volume is bounded\nabove by #neurons raised to the input dimension, which is precisely the upper bound in Theorem 5\nin the non-trivial regime where #neurons nin.\n3.3\nTheorem 5 shows that, on average, depth does not increase the local density of activation regions.\nWe give here an intuitive explanation of why this should be the case in wide networks on any \ufb01xed\nsubset of input space Rnin. Consider a ReLU net N with random weights/biases, and \ufb01x a layer\nindex ` 1. Note that the map x 7! x(`1) from inputs x to the post-activations of layer ` 1 is\nitself a ReLU net. Note also that in wide networks, the gradients rz(x) for different neurons in\nthe same layer are only weakly correlated (cf. e.g. [17]). Hence, for the purpose of this heuristic,\nwe will assume that the bent hyperplanes Hz(\u2713) for neurons z in layer ` are independent. Consider\nan activation region R for x(`1)(x). By de\ufb01nition, in the interior of R, the gradient rz(x) for\nneurons z in layer ` are constant and hence the corresponding bent hyperplane from (2) inside R\nis the hyperplane {x 2 R | hrz, xi = bz}. This in keeping with Lemma 4. The 2/fan-in weight\nnormalization ensures that for each x\n\nE\u21e5@xi@xj z(x)\u21e4 = 2 \u00b7 i,j ) Cov[rz(x)] = 2 Id.\n\nSee, for example, equation (17) in [11]. Thus, the covariance matrix of the normal vectors rz of the\nhyperplanes Hz(\u2713) \\ R for neurons in layer ` are independent of `! This suggests that, per neuron,\nthe average contribution to the number of activation regions is the same in every layer. In particular,\ndeep and shallow ReLU nets with the same number of neurons should have the same average number\nof activation regions (see (4), Remark 1, and Figures 3-6).\n4 Maximizing the Number of Activation Regions\nWhile we have seen in Figure 3 that the number of regions does not strongly increase during training\non a simple task, such experiments leave open the possibility that the number of regions would go\nup markedly if the task were more complicated. Will the number of regions grow to achieve the\ntheoretical upper bound (exponential in the depth) if the task is designed so that having more regions\nis advantageous? We now investigate this possibility. See Appendix A for experimental details.\n4.1 Memorization\nMemorization tasks on large datasets require learning highly oscillatory functions with large numbers\nof activation regions. Inspired by the work of Arpit et. al. in [1], we train on several tasks interpolating\nbetween memorization and generalization (see Figure 4) in a certain fraction of MNIST labels have\nbeen randomized. We \ufb01nd that the maximum number of activation regions learned does increase with\nthe amount of noise to be memorized, but only slightly. In no case does the number of activation\nregions change by more than a small constant factor from its initial value. Next, we train a network to\nmemorize binary labels for random 2D points (see Figure 5). Again, the number of activation regions\nafter training increases slightly with increasing memorization, until the task becomes too hard for the\nnetwork and training fails altogether. Varying the learning rate yields similar results (see Figure 6(a)),\nsuggesting the small increase in activation regions is probably not a result of hyperparameter choice.\n\n6\n\n\fFigure 4: Depth 3, width 32 network trained on MNIST with varying levels of label corruption.\nActivation regions are counted along lines through input space (lines are selected to pass through\nboth the origin and randomly selected MNIST examples), with counts averaged across 100 such lines.\nTheorem 5 and [14] predict the expected number of regions should be approximately the number\nof neurons (in this case, 96). Left: average number of regions plotted against epoch. Curves are\naveraged over 40 independent training runs, with standard deviations shown. Right: average number\nof regions plotted against average training accuracy. Throughout training the number of regions is\nwell-predicted by our result. There are slightly, but not exponentially, more regions when memorizing\nmore datapoints. See Appendix A for more details.\n\nFigure 5: Depth 3, width 32 fully connected ReLU net trained for 2000 epochs to memorize random\n2D points with binary labels. The number of regions predicted by Theorem 5 for such a network\nis 962/2! = 4608. Left: number of regions plotted against epoch. Curves are averaged over 40\nindependent training runs, with standard deviations shown. Right: #regions plotted against training\naccuracy. The number of regions increased during training, and increased more for greater amounts\nof memorization. The exception was for the maximum amount of memorization, where the network\nessentially failed to learn, perhaps because of insuf\ufb01cient capacity. See Appendix A for more details.\n\n4.2 The Effect of Initialization\nWe explore here whether varying the scale of biases and weights at initialization affects the number\nof activation regions in a ReLU net. Note that scaling the biases changes the maximum density of the\nbias, and thus affects the upper bound on the density of activation regions given in Theorem 5 by\nincreasing Tbias. Larger, more diffuse biases reduce the upper bound, while smaller, more tightly\nconcentrated biases increase it. However, Theorem 5 counts only the local rather than global number\nof regions. The latter are independent of scaling the biases:\nLemma 6. Let N be a deep ReLU network, and for c > 0 let N bias\nmultiplying all biases in N by c. Then, N (x) = N bias\nconstant therefore does not change the total number of activation regions.\nIn the extreme case of biases initialized to zero, Theorem 5 does not apply. However, as we explain\nin Appendix D, zero biases only create fewer activation regions (see Figure 7). We now consider\nchanging the scale of weights at initialization. In [23], it was suggested that initializing the weights\nof a network with greater variance should increase the number of activation regions. Likewise, the\nupper bound in Theorem 5 on the density of activation regions increases as gradient norms increase,\nand it has been shown that increased weight variance increases gradient norms [12]. However, this is\nagain a property of the local, rather than global, number of regions.\nIndeed, for a network N of depth d, write N weight\nall its weights by c, and let N bias\n1/c\u21e4\n\nfor the network obtained from N by multiplying\nbe obtained from N by dividing the biases in the kth layer by ck.\n\nc\n\nbe the network obtained by\n(cx)/c. Rescaling all biases by the same\n\nc\n\nc\n\n7\n\n\fFigure 6: Depth 3, width 32 network trained to memorize 5000 random 2D points with independent\nbinary labels, for various learning rates and weight scales at initialization. All networks start with\n\u21e1 4608 regions, as predicted by Theorem 5. Left: None of the learning rates gives a number of\nregions larger than a small constant times the initial value. Learning rate 103, which gives the\nmaximum number of regions, is the learning rate in all other experiments, while 102 is too large\nand causes learning to fail. Center: Different weight scales at initialization do not strongly affect\nthe number of regions. All weight scales are given relative to variance 2/fan-in. Right: For a given\naccuracy, the number of regions learned grows with the weight scale at initialization. However, poor\ninitialization impedes high accuracy. See Appendix A for details.\n\nFigure 7: Activation regions within input space, for a network of depth 3 and width 64 training on\nMNIST. (a) Cross-section through the origin, shown at initialization, after one epoch, and after twenty\nepochs. The plane is chosen to pass through two sample points from MNIST, shown as black dots.\n(b) Cross-section not through the origin, shown at initialization. The plane is chosen to pass through\nthree sample points from MNIST. For discussion of activation regions at zero bias, see Appendix D.\n\nc\n\nand N bias\n1/c\u21e4\n\nc\n\n(x) = cdN bias\n1/c\u21e4\n\n(x). We therefore conclude that the activation\nA scaling argument shows that N weight\nare the same. Thus, scaling the weights uniformly is equivalent to\nregions of N weight\nscaling the biases differently for every layer. We have seen from Lemma 6 that scaling the biases\nuniformly by any amount does not affect the global number of activation regions. Therefore, it makes\nsense (though we do not prove it) that scaling the weights uniformly should approximately preserve\nthe global number of activation regions. We test this intuition empirically by attempting to memorize\npoints randomly drawn from a 2D input space with arbitrary binary labels for various initializations\n(see Figure 6). We \ufb01nd that neither at initialization nor during training is the number of activation\nregions strongly dependent on the weight scaling used for initialization.\n\n5 Conclusion\nWe have presented theoretical and empirical evidence that the number of activation regions learned in\npractice by a ReLU network is far from the maximum possible and depends mainly on the number\nof neurons in the network, rather than its depth. This surprising result implies that, at least when\nnetwork gradients and biases are well-behaved (see conditions 3,4 in the statement of Theorem 5), the\npartition of input space learned by a deep ReLU network is not signi\ufb01cantly more complex than that\nof a shallow network with the same number of neurons. We found that this is true even after training\non memorization-based tasks, in which we expect a large number of regions to be advantageous for\n\ufb01tting many randomly labeled inputs. Our results are stated for ReLU nets with no tied weights and\nbiases (and arbitrary connectivity). We believe that analogous results and proofs hold for residual and\nconvolutional networks but have not veri\ufb01ed the technical details.\n\n8\n\n\fReferences\n[1] Devansh Arpit, Stanislaw Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio,\nMaxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A\ncloser look at memorization in deep networks. In ICML, 2017.\n\n[2] Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In NeurIPS, 2014.\n\n[3] Andrew R Barron. Approximation and estimation bounds for arti\ufb01cial neural networks. Machine\n\nlearning, 14(1):115\u2013133, 1994.\n\n[4] Monica Bianchini and Franco Scarselli. On the complexity of neural network classi\ufb01ers: A\ncomparison between shallow and deep architectures. IEEE Transactions on Neural Networks\nand Learning Systems, 25(8):1553\u20131565, 2014.\n\n[5] Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A\n\ntensor analysis. In COLT, pages 698\u2013728, 2016.\n\n[6] Francesco Croce, Maksym Andriushchenko, and Matthias Hein. Provable robustness of ReLU\n\nnetworks via maximization of linear regions. In AISTATS, 2018.\n\n[7] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of\n\ncontrol, signals and systems, 2(4):303\u2013314, 1989.\n\n[8] Giacomo De Palma, Bobak Toussi Kiani, and Seth Lloyd. Deep neural networks are biased\n\ntowards simple functions. Preprint arXiv:1812.10156, 2018.\n\n[9] Ken-Ichi Funahashi. On the approximate realization of continuous mappings by neural networks.\n\nNeural networks, 2(3):183\u2013192, 1989.\n\n[10] Boris Hanin. Universal function approximation by deep neural nets with bounded width and\n\nReLU activations. Preprint arXiv:1708.02691, 2017.\n\n[11] Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? In\n\nNeurIPS, 2018.\n\n[12] Boris Hanin and Mihai Nica. Products of many large random matrices and gradients in deep\n\nneural networks. Preprint arXiv:1812.05994, 2018.\n\n[13] Boris Hanin and David Rolnick. How to start training: The effect of initialization and architec-\n\nture. In NeurIPS, 2018.\n\n[14] Boris Hanin and David Rolnick. Complexity of linear regions in deep networks. In ICML, 2019.\n\n[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers:\n\nSurpassing human-level performance on ImageNet classi\ufb01cation. In ICCV, 2015.\n\n[16] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are\n\nuniversal approximators. Neural networks, 2(5):359\u2013366, 1989.\n\n[17] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and\n\nJascha Sohl-Dickstein. Deep neural networks as Gaussian processes. In ICLR, 2018.\n\n[18] Henry W Lin, Max Tegmark, and David Rolnick. Why does deep and cheap learning work so\n\nwell? Journal of Statistical Physics, 168(6):1223\u20131247, 2017.\n\n[19] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of\n\nlinear regions of deep neural networks. In NeurIPS, 2014.\n\n[20] Roman Novak, Yasaman Bahri, Daniel A Abola\ufb01a, Jeffrey Pennington, and Jascha Sohl-\nDickstein. Sensitivity and generalization in neural networks: an empirical study. In ICLR,\n2018.\n\n[21] Allan Pinkus. Approximation theory of the MLP model in neural networks. Acta numerica,\n\n8:143\u2013195, 1999.\n\n9\n\n\f[22] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli.\nExponential expressivity in deep neural networks through transient chaos. In NeurIPS, 2016.\n[23] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the\n\nexpressive power of deep neural networks. In ICML, pages 2847\u20132854, 2017.\n\n[24] David Rolnick and Max Tegmark. The power of deeper networks for expressing natural\n\nfunctions. In ICLR, 2018.\n\n[25] Thiago Serra and Srikumar Ramalingam. Empirical bounds on linear regions of deep recti\ufb01er\n\nnetworks. Preprint arXiv:1810.03370, 2018.\n\n[26] Thiago Serra, Christian Tjandraatmadja, and Srikumar Ramalingam. Bounding and counting\n\nlinear regions of deep neural networks. In ICML, 2018.\n\n[27] Richard P Stanley et al. An introduction to hyperplane arrangements. Geometric combinatorics,\n\n13:389\u2013496, 2004.\n[28] Matus Telgarsky.\n\narXiv:1509.08101, 2015.\n\nRepresentation bene\ufb01ts of deep feedforward networks.\n\nPreprint\n\n[29] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\n\ndeep learning requires rethinking generalization. In ICLR, 2017.\n\n10\n\n\f", "award": [], "sourceid": 165, "authors": [{"given_name": "Boris", "family_name": "Hanin", "institution": "Texas A&M"}, {"given_name": "David", "family_name": "Rolnick", "institution": "UPenn"}]}