{"title": "On the Number of Linear Regions of Deep Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2924, "page_last": 2932, "abstract": "We study the complexity of functions computable by deep feedforward neural networks with piecewise linear activations in terms of the symmetries and the number of linear regions that they have. Deep networks are able to sequentially map portions of each layer's input-space to the same output. In this way, deep models compute functions that react equally to complicated patterns of different inputs. The compositional structure of these functions enables them to re-use pieces of computation exponentially often in terms of the network's depth. This paper investigates the complexity of such compositional maps and contributes new theoretical results regarding the advantage of depth for neural networks with piecewise linear activation functions. In particular, our analysis is not specific to a single family of models, and as an example, we employ it for rectifier and maxout networks. We improve complexity bounds from pre-existing work and investigate the behavior of units in higher layers.", "full_text": "On the Number of Linear Regions of\n\nDeep Neural Networks\n\nGuido Mont\u00b4ufar\n\nMax Planck Institute for Mathematics in the Sciences\n\nmontufar@mis.mpg.de\n\nRazvan Pascanu\n\nUniversit\u00b4e de Montr\u00b4eal\n\npascanur@iro.umontreal.ca\n\nKyunghyun Cho\n\nUniversit\u00b4e de Montr\u00b4eal\n\nkyunghyun.cho@umontreal.ca\n\nYoshua Bengio\n\nUniversit\u00b4e de Montr\u00b4eal, CIFAR Fellow\nyoshua.bengio@umontreal.ca\n\nAbstract\n\nWe study the complexity of functions computable by deep feedforward neural networks\nwith piecewise linear activations in terms of the symmetries and the number of linear\nregions that they have. Deep networks are able to sequentially map portions of each\nlayer\u2019s input-space to the same output. In this way, deep models compute functions\nthat react equally to complicated patterns of different inputs. The compositional\nstructure of these functions enables them to re-use pieces of computation exponentially\noften in terms of the network\u2019s depth. This paper investigates the complexity of such\ncompositional maps and contributes new theoretical results regarding the advantage\nof depth for neural networks with piecewise linear activation functions. In particular,\nour analysis is not specific to a single family of models, and as an example, we employ\nit for rectifier and maxout networks. We improve complexity bounds from pre-existing\nwork and investigate the behavior of units in higher layers.\nKeywords: Deep learning, neural network, input space partition, rectifier, maxout\n\n1 Introduction\n\nArtificial neural networks with several hidden layers, called deep neural networks, have become popular\ndue to their unprecedented success in a variety of machine learning tasks (see, e.g., Krizhevsky et al.\n2012, Ciresan et al. 2012, Goodfellow et al. 2013, Hinton et al. 2012). In view of this empirical evidence,\ndeep neural networks are becoming increasingly favored over shallow networks (i.e., with a single layer\nof hidden units), and are often implemented with more than five layers. At the time being, however, the\ntheory of deep networks still poses many questions. Recently, Delalleau and Bengio (2011) showed that\na shallow network requires exponentially many more sum-product hidden units1 than a deep sum-product\nnetwork in order to compute certain families of polynomials. We are interested in extending this kind\nof analysis to more popular neural networks, such as those with maxout and rectifier units.\nThere is a wealth of literature discussing approximation, estimation, and complexity of artificial neural\nnetworks (see, e.g., Anthony and Bartlett 1999). A well-known result states that a feedforward neural\nnetwork with a single, huge, hidden layer is a universal approximator of Borel measurable functions (see\nHornik et al. 1989, Cybenko 1989). Other works have investigated universal approximation of probability\ndistributions by deep belief networks (Le Roux and Bengio 2010, Mont\u00b4ufar and Ay 2011), as well as\ntheir approximation properties (Mont\u00b4ufar 2014, Krause et al. 2013).\nThese previous theoretical results, however, do not trivially apply to the types of deep neural networks\nthat have seen success in recent years. Conventional neural networks often employ either hidden units\n\n1A single sum-product hidden layer summarizes a layer of product units followed by a layer of sum units.\n\n1\n\n\fFigure 1: Binary classification using a shallow model with 20 hidden units (solid line) and a deep model\nwith two layers of 10 units each (dashed line). The right panel shows a close-up of the left panel. Filled\nmarkers indicate errors made by the shallow model.\n\nwith a bounded smooth activation function, or Boolean hidden units. On the other hand, recently it has\nbecome more common to use piecewise linear functions, such as the rectifier activation g(a) = max{0, a}\n(Glorot et al. 2011, Nair and Hinton 2010) or the maxout activation g(a1, . . . , ak) = max{a1, . . . , ak}\n(Goodfellow et al. 2013). The practical success of deep neural networks with piecewise linear units calls\nfor the theoretical analysis specific for this type of neural networks.\nIn this respect, Pascanu et al. (2013) reported a theoretical result on the complexity of functions computable\nby deep feedforward networks with rectifier units. They showed that, in the asymptotic limit of many\nhidden layers, deep networks are able to separate their input space into exponentially more linear response\nregions than their shallow counterparts, despite using the same number of computational units.\nBuilding on the ideas from Pascanu et al. (2013), we develop a general framework for analyzing deep\nmodels with piecewise linear activations. We describe how the intermediary layers of these models\nare able to map several pieces of their inputs into the same output. The layer-wise composition of the\nfunctions computed in this way re-uses low-level computations exponentially often as the number of\nlayers increases. This key property enables deep networks to compute highly complex and structured\nfunctions. We underpin this idea by estimating the number of linear regions of functions computable by\ntwo important types of piecewise linear networks: with rectifier units and with maxout units. Our results\nfor the complexity of deep rectifier networks yield a significant improvement over the previous results\non rectifier networks mentioned above, showing a favorable behavior of deep over shallow networks even\nwith a moderate number of hidden layers. Furthermore, our analysis of deep rectifier and maxout networks\nprovides a platform to study a broad variety of related networks, such as convolutional networks.\nThe number of linear regions of the functions that can be computed by a given model is a measure of the\nmodel\u2019s flexibility. An example of this is given in Fig. 1, which compares the learned decision boundary of a\nsingle-layer and a two-layer model with the same number of hidden units (see details in the Supplementary\nMaterial). This illustrates the advantage of depth; the deep model captures the desired boundary more\naccurately, approximating it with a larger number of linear pieces. As noted earlier, deep networks are able\nto identify an exponential number of input neighborhoods by mapping them to a common output of some\nintermediary hidden layer. The computations carried out on the activations of this intermediary layer are\nreplicated many times, once in each of the identified neighborhoods. This allows the networks to compute\nvery complex looking functions even when they are defined with relatively few parameters. The number\nof parameters is an upper bound for the dimension of the set of functions computable by a network, and\na small number of parameters means that the class of computable functions has a low dimension. The\nset of functions computable by a deep feedforward piecewise linear network, although low dimensional,\nachieves exponential complexity by re-using and composing features from layer to layer.\n\n2 Feedforward Neural Networks and their Compositional Properties\n\nIn this section we discuss the ability of deep feedforward networks to re-map their input-space to create\ncomplex symmetries by using only relatively few computational units. The key observation of our analysis\nis that each layer of a deep model is able to map different regions of its input to a common output. This\nleads to a compositional structure, where computations on higher layers are effectively replicated in all\ninput regions that produced the same output at a given layer. The capacity to replicate computations over\nthe input-space grows exponentially with the number of network layers. Before expanding these ideas, we\nintroduce basic definitions needed in the rest of the paper. At the end of this section, we give an intuitive\nperspective for reasoning about the replicative capacity of deep models.\n\n2\n\n\f2.1 Definitions\n\nA feedforward neural network is a composition of layers of computational units which defines a function\nF : Rn0 \u2192 Rout of the form\n\nF (x; \u03b8) = fout \u25e6 gL \u25e6 fL \u25e6 \u00b7\u00b7\u00b7 \u25e6 g1 \u25e6 f1(x),\n\n(1)\nwhere fl is a linear preactivation function and gl is a nonlinear activation function. The parameter \u03b8 is\ncomposed of input weight matrices Wl \u2208 Rk\u00b7nl\u00d7nl\u22121 and bias vectors bl \u2208 Rk\u00b7nl for each layer l \u2208 [L].\n(cid:62) of activations xl,i of the units i \u2208 [nl] in\nThe output of the l-th layer is a vector xl = [xl,1, . . . , xl,nl]\nthat layer. This is computed from the activations of the preceding layer by xl = gl(fl(xl\u22121)). Given the\nactivations xl\u22121 of the units in the (l \u2212 1)-th layer, the preactivation of layer l is given by\n\nfl(xl\u22121) = Wlxl\u22121 + bl,\n\nwhere fl = [fl,1, . . . , fl,nl]\nof the i-th unit in the l-th layer is given by\n\n(cid:62) is an array composed of nl preactivation vectors fl,i \u2208 Rk, and the activation\n\nxl,i = gl,i(fl,i(xl\u22121)).\n\nWe will abbreviate gl \u25e6 fl by hl. When the layer index l is clear, we will drop the corresponding subscript.\nWe are interested in piecewise linear activations, and will consider the following two important types.\n\n\u2022 Rectifier unit:\n\u2022 Rank-k maxout unit: gi(fi) = max{fi,1, . . . , fi,k}, where fi = [fi,1, . . . , fi,k] \u2208 Rk.\n\ngi(fi) = max{0, fi}, where fi \u2208 R and k = 1.\n\nThe structure of the network refers to the way its units are arranged. It is specified by the number n0 of\ninput dimensions, the number of layers L, and the number of units or width nl of each layer.\nWe will classify the functions computed by different network structures, for different choices of parameters,\nin terms of their number of linear regions. A linear region of a piecewise linear function F : Rn0 \u2192 Rm\nis a maximal connected subset of the input-space Rn0, on which F is linear. For the functions that we\nconsider, each linear region has full dimension, n0.\n\n2.2 Shallow Neural Networks\n\nRectifier units have two types of behavior; they can be either constant 0 or linear, depending on their\ninputs. The boundary between these two behaviors is given by a hyperplane, and the collection of all\nthe hyperplanes coming from all units in a rectifier layer forms a hyperplane arrangement. In general,\nif the activation function g : R \u2192 R has a distinguished (i.e., irregular) behavior at zero (e.g., an inflection\npoint or non-linearity), then the function Rn0 \u2192 Rn1; x (cid:55)\u2192 g(Wx + b) has a distinguished behavior at\nall inputs from any of the hyperplanes Hi := {x \u2208 Rn0 : Wi,:x + bi = 0} for i \u2208 [n1]. The hyperplanes\ncapturing this distinguished behavior also form a hyperplane arrangement (see, e.g., Pascanu et al. 2013).\nThe hyperplanes in the arrangement split the input-space into several regions. Formally, a region of a\nhyperplane arrangement {H1, . . . , Hn1} is a connected component of the complement Rn0 \\ (\u222aiHi),\n(cid:0)n1\n(cid:1)\nwell-known result by Zaslavsky (1975). An arrangement of n1 hyperplanes in Rn0 has at most(cid:80)n0\ni.e., a set of points delimited by these hyperplanes (possibly open towards infinity). The number of regions\nof an arrangement can be given in terms of a characteristic function of the arrangement, as shown in a\n(cid:1) (see Pascanu et al. 2013; Proposition 5).\nrectifier network with n0 inputs and n1 hidden units is(cid:80)n0\n\nregions. Furthermore, this number of regions is attained if and only if the hyperplanes are in general\nposition. This implies that the maximal number of linear regions of functions computed by a shallow\n\n(cid:0)n1\n\nj=0\n\nj\n\nj=0\n\nj\n\n2.3 Deep Neural Networks\n\nWe start by defining the identification of input neighborhoods mentioned in the introduction more formally:\nDefinition 1. A map F identifies two neighborhoods S and T of its input domain if it maps them to a com-\nmon subset F (S) = F (T ) of its output domain. In this case we also say that S and T are identified by F.\nExample 2. The four quadrants of 2-D Euclidean space are regions that are identified by the absolute\n(cid:62).\nvalue function g : R2 \u2192 R2; (x1, x2) (cid:55)\u2192 [|x1|,|x2|]\n\n3\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: (a) Space folding of 2-D Euclidean space along the two coordinate axes. (b) An illustration of\nhow the top-level partitioning (on the right) is replicated to the original input space (left). (c) Identification\nof regions across the layers of a deep model.\n\nThe computation carried out by the l-th layer of a feedforward network on a set of activations from the\n(l \u2212 1)-th layer is effectively carried out for all regions of the input space that lead to the same activations\nof the (l \u2212 1)-th layer. One can choose the input weights and biases of a given layer in such a way that\nthe computed function behaves most interestingly on those activation values of the preceding layer which\nhave the largest number of preimages in the input space, thus replicating the interesting computation many\ntimes in the input space and generating an overall complicated-looking function.\nFor any given choice of the network parameters, each hidden layer l computes a function hl = gl \u25e6 fl on\nthe output activations of the preceding layer. We consider the function Fl : Rn0 \u2192 Rnl; Fl := hl \u25e6\u00b7\u00b7\u00b7\u25e6 h1\nthat computes the activations of the l-th hidden layer. We denote the image of Fl by Sl \u2286 Rnl, i.e., the\nset of (vector valued) activations reachable by the l-th layer for all possible inputs. Given a subset R \u2286 Sl,\nR the set of subsets \u00afR1, . . . , \u00afRk \u2286 Sl\u22121 that are mapped by hl onto R; that is, subsets\nwe denote by P l\nthat satisfy hl( \u00afR1) = \u00b7\u00b7\u00b7 = hl( \u00afRk) = R. See Fig. 2 for an illustration.\nThe number of separate input-space neighborhoods that are mapped to a common neighborhood\nR \u2286 Sl \u2286 Rnl can be given recursively as\nN l\u22121\n\nR = 1, for each region R \u2286 Rn0.\nN 0\n\n(cid:88)\n\nN l\nR =\n\n(2)\n\nR(cid:48)\n\n,\n\nR(cid:48)\u2208P l\n\nR\n\nR is the set of all disjoint input-space neighborhoods whose image by the function\n\nwith piecewise linear activations is at least N =(cid:80)\n\nFor example, P 1\ncomputed by the first layer, h1 : x (cid:55)\u2192 g(Wx + b), equals R \u2286 S1 \u2286 Rn1.\nThe recursive formula (2) counts the number of identified sets by moving along the branches of a tree\nrooted at the set R of the j-th layer\u2019s output-space (see Fig. 2 (c)). Based on these observations, we can\nestimate the maximal number of linear regions as follows.\nLemma 3. The maximal number of linear regions of the functions computed by an L-layer neural network\nis defined by Eq. (2), and\nP L is a set of neighborhoods in distinct linear regions of the function computed by the last hidden layer.\nHere, the idea to construct a function with many linear regions is to use the first L \u2212 1 hidden layers to\nidentify many input-space neighborhoods, mapping all of them to the activation neighborhoods P L of\nthe (L \u2212 1)-th hidden layer, each of which belongs to a distinct linear region of the last hidden layer. We\nwill follow this strategy in Secs. 3 and 4, where we analyze rectifier and maxout networks in detail.\n\nR , where N L\u22121\n\nR\u2208P L N L\u22121\n\nR\n\n2.4 Identification of Inputs as Space Foldings\n\nIn this section, we discuss an intuition behind Lemma 3 in terms of space folding. A map F that identifies\n(cid:48) can be considered as an operator that folds its domain in such a way that the two\ntwo subsets S and S\n\n4\n\n1.Foldalongthe2.Foldalongthehorizontalaxisverticalaxis3.S1S2S3S4S04S01S01S01S01S04S04S04S02S02S02S02S03S03S03S03S01S04S02S03InputSpaceFirstLayerSpaceSecondLayerSpace\fFigure 3: Space folding of 2-D space in a non-trivial way. Note how the folding can potentially identify\nsymmetries in the boundary that it needs to learn.\n\n(cid:48) coincide and are mapped to the same output. For instance, the absolute value function\nsubsets S and S\ng : R2 \u2192 R2 from Example 2 folds its domain twice (once along each coordinate axis), as illustrated\nin Fig. 2 (a). This folding identifies the four quadrants of 2-D Euclidean space. By composing such\noperations, the same kind of map can be applied again to the output, in order to re-fold the first folding.\nEach hidden layer of a deep neural network can be associated with a folding operator. Each hidden layer\nfolds the space of activations of the previous layer. In turn, a deep neural network effectively folds its\ninput-space recursively, starting with the first layer. The consequence of this recursive folding is that\nany function computed on the final folded space will apply to all the collapsed subsets identified by the\nmap corresponding to the succession of foldings. This means that in a deep model any partitioning of\nthe last layer\u2019s image-space is replicated in all input-space regions which are identified by the succession\nof foldings. Fig. 2 (b) offers an illustration of this replication property.\nSpace foldings are not restricted to foldings along coordinate axes and they do not have to preserve lengths.\nInstead, the space is folded depending on the orientations and shifts encoded in the input weights W and\nbiases b and on the nonlinear activation function used at each hidden layer. In particular, this means that the\nsizes and orientations of identified input-space regions may differ from each other. See Fig. 3. In the case\nof activation functions which are not piece-wise linear, the folding operations may be even more complex.\n\n2.5 Stability to Perturbation\n\nOur bounds on the complexity attainable by deep models (Secs. 3 and 4) are based on suitable choices\nof the network weights. However, this does not mean that the indicated complexity is only attainable\nin singular cases. The parametrization of the functions computed by a neural network is continuous.\nMore precisely, the map \u03c8 : RN \u2192 C(Rn0; RnL); \u03b8 (cid:55)\u2192 F\u03b8, which maps input weights and biases\ni=1 to the continuous functions F\u03b8 : Rn0 \u2192 RnL computed by the network, is continuous.\n\u03b8 = {Wi, bi}L\nOur analysis considers the number of linear regions of the functions F\u03b8. By definition, each linear region\ncontains an open neighborhood of the input-space Rn0. Given any function F\u03b8 with a finite number\nof linear regions, there is an \u0001 > 0 such that for each \u0001-perturbation of the parameter \u03b8, the resulting\nfunction F\u03b8+\u0001 has at least as many linear regions as F\u03b8. The linear regions of F\u03b8 are preserved under\nsmall perturbations of the parameters, because they have a finite volume.\nIf we define a probability density on the space of parameters, what is the probability of the event that\nthe function represented by the network has a given number of linear regions? By the above discussion,\nthe probability of getting a number of regions at least as large as the number resulting from any particular\nchoice of parameters (for a uniform measure within a bounded domain) is nonzero, even though it may be\nvery small. This is because there exists an epsilon-ball of non-zero volume around that particular choice of\nparameters, for which at least the same number of linear regions is attained. For example, shallow rectifier\nnetworks generically attain the maximal number of regions, even if in close vicinity of any parameter\nchoice there may be parameters corresponding to functions with very few regions.\nFor future work it would be interesting to study the partitions of parameter space RN into pieces where\nthe resulting functions partition their input-spaces into isomorphic linear regions, and to investigate how\nmany of these pieces of parameter space correspond to functions with a given number of linear regions.\n\n2.6 Empirical Evaluation of Folding in Rectifier MLPs\n\nWe empirically examined the behavior of a trained MLP to see if it folds the input-space in the way described\nabove. First, we note that tracing the activation of each hidden unit in this model gives a piecewise linear\nmap Rn0 \u2192 R (from inputs to activation values of that unit). Hence, we can analyze the behavior of each\n\n5\n\n\fFigure 4: Folding of the real line into equal-length segments by a sum of rectifiers.\n\nunit by visualizing the different weight matrices corresponding to the different linear pieces of this map. The\nweight matrix of one piece of this map can be found by tracking the linear piece used in each intermediary\nlayer, starting from an input example. This visualization technique, a byproduct of our theoretical analysis,\nis similar to the one proposed by Zeiler and Fergus (2013), but is motivated by a different perspective.\nAfter computing the activations of an intermediary hidden unit for each training example, we can, for\ninstance, inspect two examples that result in similar levels of activation for a hidden unit. With the linear\nmaps of the hidden unit corresponding to the two examples we perturb one of the examples until it results\nin exactly the same activation. These two inputs then can be safely considered as points in two regions\nidentified by the hidden unit. In the Supplementary Material we provide details and examples of this\nvisualization technique. We also show inputs identified by a deep MLP.\n\n3 Deep Rectifier Networks\n\nIn this section we analyze deep neural networks with rectifier units, based on the general observations\nfrom Sec. 2. We improve upon the results by Pascanu et al. (2013), with a tighter lower-bound on the\nmaximal number of linear regions of functions computable by deep rectifier networks. First, let us note the\nfollowing upper-bound, which follows directly from the fact that each linear region of a rectifier network\ncorresponds to a pattern of hidden units being active:\nProposition 4. The maximal number of linear regions of the functions computed by any rectifier network\nwith a total of N hidden units is bounded from above by 2N.\n\n3.1 Illustration of the Construction\nConsider a layer of n rectifiers with n0 input variables, where n \u2265 n0. We partition the set of rectifier\nunits into n0 (non-overlapping) subsets of cardinality p = (cid:98) n/n0(cid:99) and ignore the remainder units. Consider\nthe units in the j-th subset. We can choose their input weights and biases such that\n\nh1(x) = max{0, wx} ,\nh2(x) = max{0, 2wx \u2212 1} ,\nh3(x) = max{0, 2wx \u2212 2} ,\n\n...\n\nhp(x) = max{0, 2wx \u2212 (p \u2212 1)} ,\n\nwhere w is a row vector with j-th entry equal to 1 and all other entries set to 0. The product wx selects\nthe j-th coordinate of x. Adding these rectifiers with alternating signs, we obtain following scalar function:\n\n\u02dchj(x) =(cid:2)1,\u22121, 1, . . . , (\u22121)p\u22121(cid:3) [h1(x), h2(x), h3(x), . . . , hp(x)]\n\nSince \u02dchj acts only on the j-th input coordinate, we may redefine it to take a scalar input, namely the\nj-th coordinate of x. This function has p linear regions given by the intervals (\u2212\u221e, 0], [0, 1], [1, 2],\n. . . , [p \u2212 1,\u221e). Each of these intervals has a subset that is mapped by \u02dchj onto the interval (0, 1), as\nillustrated in Fig. 4. The function \u02dchj identifies the input-space strips with j-th coordinate xj restricted to\nthe intervals (0, 1), (1, 2), . . . , (p\u2212 1, p). Consider now all the n0 subsets of rectifiers and the function \u02dch =\n. This function is locally symmetric about each hyperplane with a fixed j-th coordinate\n\n(cid:2)\u02dch1, \u02dch2, . . . , \u02dchp\n\n(cid:3)(cid:62)\n\n6\n\n(cid:62)\n\n.\n\n(3)\n\n012123h1h2h3h1\u2212h2h1\u2212h2+h3x\u02dch(x)\fequal to xj = 1, . . . , xj = p \u2212 1 (vertical lines in Fig. 4), for all j = 1, . . . , n0. Note the periodic pattern\nthat emerges. In fact, the function \u02dch identifies a total of pn0 hypercubes delimited by these hyperplanes.\nNow, note that \u02dch arises from h by composition with a linear function (alternating sums). This linear\nfunction can be effectively absorbed in the preactivation function of the next layer. Hence we can treat \u02dch as\nbeing the function computed by the current layer. Computations by deeper layers, as functions of the unit\nhypercube output of this rectifier layer, are replicated on each of the pn0 identified input-space hypercubes.\n\n3.2 Formal Result\n\nWe can generalize the construction described above to the case of a deep rectifier network with n0 inputs\nand L hidden layers of widths ni \u2265 n0 for all i \u2208 [L]. We obtain the following lower bound for the\nmaximal number of linear regions of deep rectifier networks:\nTheorem 5. The maximal number of linear regions of the functions computed by a neural network with\nn0 input units and L hidden layers, with ni \u2265 n0 rectifiers at the i-th layer, is lower bounded by\n\n(cid:32)L\u22121(cid:89)\n\n(cid:22) ni\n\n(cid:23)n0(cid:33) n0(cid:88)\n\n(cid:18)nL\n\n(cid:19)\n\nn0\n\ni=1\n\nj\n\nj=0\n\n.\n\nThe next corollary gives an expression for the asymptotic behavior of these bounds. Assuming that\nn0 = O(1) and ni = n for all i \u2265 1, the number of regions of a single layer model with Ln hidden units\nbehaves as O(Ln0nn0) (see Pascanu et al. 2013; Proposition 10). For a deep model, Theorem 5 implies:\nCorollary 6. A rectifier neural network with n0 input units and L hidden layers of width n \u2265 n0 can\ncompute functions that have \u2126\n\nlinear regions.\n\n(cid:17)\n\nThus we see that the number of linear regions of deep models grows exponentially in L and polynomially\nin n, which is much faster than that of shallow models with nL hidden units. Our result is a significant\nimprovement over the bound \u2126\nobtained by Pascanu et al. (2013). In particular, our\nresult demonstrates that even for small values of L and n, deep rectifier models are able to produce\nsubstantially more linear regions than shallow rectifier models. Additionally, using the same strategy\nas Pascanu et al. (2013), our result can be reformulated in terms of the number of linear regions per\nparameter. This results in a similar behavior, with deep models being exponentially more efficient than\nshallow models (see the Supplementary Material).\n\n(cid:16)\n( n/n0)(L\u22121)n0 nn0\n(cid:16)\n(cid:17)\n( n/n0)L\u22121 nn0\n\n4 Deep Maxout Networks\n\nA maxout network is a feedforward network with layers defined as follows:\nDefinition 7. A rank-k maxout layer with n input and m output units is defined by a preactivation function\nof the form f : Rn \u2192 Rm\u00b7k; f(x) = Wx+b, with input and bias weights W \u2208 Rm\u00b7k\u00d7n, b \u2208 Rm\u00b7k, and\nactivations of the form gj(z) = max{z(j\u22121)k+1, . . . , zjk} for all j \u2208 [m]. The layer computes a function\n\n\uf8eb\uf8ec\uf8ed\n\ng \u25e6 f : Rn \u2192 Rm; x (cid:55)\u2192\n\nmax{f1(x), . . . , fk(x)}\n\n...\n\nmax{f(m\u22121)k+1(x), . . . , fmk(x)}\n\n\uf8f6\uf8f7\uf8f8 .\n\n(4)\n\nSince the maximum of two convex functions is convex, maxout units and maxout layers compute convex\nfunctions. The maximum of a collection of functions is called their upper envelope. We can view the graph\nof each linear function fi : Rn \u2192 R as a supporting hyperplane of a convex set in (n + 1)-dimensional\nspace. In particular, if each fi, i \u2208 [k] is the unique maximizer fi = max{f(cid:48)\n\u2208 [k]} at some input\nneighborhood, then the number of linear regions of the upper envelope g1 \u25e6 f = max{fi : i \u2208 [k]} is\nexactly k. This shows that the maximal number of linear regions of a maxout unit is equal to its rank.\nThe linear regions of the maxout layer are the intersections of the linear regions of the individual maxout\nunits. In order to obtain the number of linear regions for the layer, we need to describe the structure of\nthe linear regions of each maxout unit, and study their possible intersections. Voronoi diagrams can be\n\ni : i(cid:48)\n\n7\n\n\f(cid:0)k2m\n(cid:1), km}.\n\nlifted to upper envelopes of linear functions, and hence they describe input-space partitions generated\nby maxout units. Now, how many regions do we obtain by intersecting the regions of m Voronoi diagrams\nwith k regions each? Computing the intersections of Voronoi diagrams is not easy, in general. A trivial\nupper bound for the number of linear regions is km, which corresponds to the case where all intersections\nof regions of different units are different from each other. We will give a better bound in Proposition 8.\nNow, for the purpose of computing lower bounds, here it will be sufficient to consider certain well-behaved\nspecial cases. One simple example is the division of input-space by k\u22121 parallel hyperplanes. If m \u2264 n, we\ncan consider the arrangement of hyperplanes Hi = {x \u2208 Rn : xj = i} for i = 1, . . . , k \u2212 1, for each max-\n(cid:80)n\nout unit j \u2208 [m]. In this case, the number of regions is km. If m > n, the same arguments yield kn regions.\nProposition 8. The maximal number of regions of a single layer maxout network with n inputs and m\noutputs of rank k is lower bounded by kmin{n,m} and upper bounded by min{\nNow we take a look at the deep maxout model. Note that a rank-2 maxout layer can be simulated by a\nrectifier layer with twice as many units. Then, by the results from the last section, a rank-2 maxout network\nwith L \u2212 1 hidden layers of width n = n0 can identify 2n0(L\u22121) input-space regions, and, in turn, it can\ncompute functions with 2n0(L\u22121)2n0 = 2n0L linear regions. For the rank-k case, we note that a rank-k\nmaxout unit can identify k cones from its input-domain, whereby each cone is a neighborhood of the\npositive half-ray {rWi \u2208 Rn : r \u2208 R+} corresponding to the gradient Wi of the linear function fi for\nall i \u2208 [k]. Elaborating this observation, we obtain:\nTheorem 9. A maxout network with L layers of width n0 and rank k can compute functions with at least\nkL\u22121kn0 linear regions.\nTheorem 9 and Proposition 8 show that deep maxout networks can compute functions with a number of\nlinear regions that grows exponentially with the number of layers, and exponentially faster than the maximal\nnumber of regions of shallow models with the same number of units. Similarly to the rectifier model, this\nexponential behavior can also be established with respect to the number of network parameters. We note\nthat although certain functions that can be computed by maxout layers can also be computed by rectifier\nlayers, the rectifier construction from last section leads to functions that are not computable by maxout\nnetworks (except in the rank-2 case). The proof of Theorem 9 is based on the same general arguments\nfrom Sec. 2, but uses a different construction than Theorem 5 (details in the Supplementary Material).\n\nj=0\n\nj\n\n5 Conclusions and Outlook\n\nWe studied the complexity of functions computable by deep feedforward neural networks in terms of their\nnumber of linear regions. We specifically focused on deep neural networks having piecewise linear hidden\nunits which have been found to provide superior performance in many machine learning applications\nrecently. We discussed the idea that each layer of a deep model is able to identify pieces of its input in\nsuch a way that the composition of layers identifies an exponential number of input regions. This results\nin exponentially replicating the complexity of the functions computed in the higher layers. The functions\ncomputed in this way by deep models are complicated, but still they have an intrinsic rigidity caused by\nthe replications, which may help deep models generalize to unseen samples better than shallow models.\nThis framework is applicable to any neural network that has a piecewise linear activation function. For\nexample, if we consider a convolutional network with rectifier units, as the one used in (Krizhevsky et al.\n2012), we can see that the convolution followed by max pooling at each layer identifies all patches of the\ninput within a pooling region. This will let such a deep convolutional neural network recursively identify\npatches of the images of lower layers, resulting in exponentially many linear regions of the input space.\nThe structure of the linear regions depends on the type of units, e.g., hyperplane arrangements for shallow\nrectifier vs. Voronoi diagrams for shallow maxout networks. The pros and cons of each type of constraint\nwill likely depend on the task and are not easily quantifiable at this point. As for the number of regions,\nin both maxout and rectifier networks we obtain an exponential increase with depth. However, our bounds\nare not conclusive about which model is more powerful in this respect. This is an interesting question\nthat would be worth investigating in more detail.\nThe parameter space of a given network is partitioned into the regions where the resulting functions have\ncorresponding linear regions. The combinatorics of such structures is in general hard to compute, even for\nsimple hyperplane arrangements. One interesting question for future analysis is whether many regions of the\nparameter space of a given network correspond to functions which have a given number of linear regions.\n\n8\n\n\fReferences\nM. Anthony and P. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University\n\nPress, 1999.\n\nD. Ciresan, U. Meier, J. Masci, and J. Schmidhuber. Multi column deep neural network for traffic sign\n\nclassification. Neural Networks, 32:333\u2013338, 2012.\n\nG. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals\n\nand Systems, 2(4):303\u2013314, 1989.\n\nO. Delalleau and Y. Bengio. Shallow vs. deep sum-product networks. In NIPS, 2011.\nX. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In AISTATS, 2011.\nI. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In Proc. 30th\n\nInternational Conference on Machine Learning, pages 1319\u20131327, 2013.\n\nG. Hinton, L. Deng, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath,\nand B. Kingsbury. Deep neural networks for acoustic modeling in speech recognition. IEEE Signal\nProcessing Magazine, 29(6):82\u201397, Nov. 2012.\n\nK. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators.\n\nNeural Networks, 2:359\u2013366, 1989.\n\nO. Krause, A. Fischer, T. Glasmachers, and C. Igel. Approximation properties of DBNs with binary hidden\nunits and real-valued visible units. In Proc. 30th International Conference on Machine Learning, pages\n419\u2013426, 2013.\n\nA. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural\n\nnetworks. In NIPS, 2012.\n\nN. Le Roux and Y. Bengio. Deep belief networks are compact universal approximators. Neural\n\nComputation, 22(8):2192\u20132207, Aug. 2010.\n\nG. Mont\u00b4ufar. Universal approximation depth and errors of narrow belief networks with discrete units.\n\nNeural Computation, 26, July 2014.\n\nG. Mont\u00b4ufar and N. Ay. Refinements of universal approximation results for deep belief networks and\n\nrestricted Boltzmann machines. Neural Computation, 23(5):1306\u20131319, May 2011.\n\nV. Nair and G. E. Hinton. Rectified linear units improve restricted Boltzmann machines. In Proc. 27th\n\nInternational Conference on Machine Learning, pages 807\u2013814, 2010.\n\nR. Pascanu and Y. Bengio. Revisiting natural gradient for deep networks. In International Conference\n\non Learning Representations, 2014.\n\nR. Pascanu, G. Mont\u00b4ufar, and Y. Bengio. On the number of response regions of deep feed forward\n\nnetworks with piece-wise linear activations. arXiv:1312.6098, Dec. 2013.\n\nR. Stanley. An introduction to hyperplane arrangements. In Lect. notes, IAS/Park City Math. Inst., 2004.\nJ. Susskind, A. Anderson, and G. E. Hinton. The Toronto face dataset. Technical Report UTML TR\n\n2010-001, U. Toronto, 2010.\n\nT. Zaslavsky. Facing Up to Arrangements: Face-Count Formulas for Partitions of Space by Hyperplanes.\nNumber 154 in Memoirs of the American Mathematical Society. American Mathematical Society,\nProvidence, RI, 1975.\n\nM. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. arXiv:1311.2901,\n\n2013.\n\n9\n\n\f", "award": [], "sourceid": 1528, "authors": [{"given_name": "Guido", "family_name": "Montufar", "institution": "Max Planck Institute for Mathematics in the Sciences"}, {"given_name": "Razvan", "family_name": "Pascanu", "institution": "Universit\u00e9 de Montr\u00e9al"}, {"given_name": "Kyunghyun", "family_name": "Cho", "institution": "Universit\u00e9 de Montr\u00e9al"}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": "University of Montreal"}]}