{"title": "The Expressive Power of Neural Networks: A View from the Width", "book": "Advances in Neural Information Processing Systems", "page_first": 6231, "page_last": 6239, "abstract": "The expressive power of neural networks is important for understanding deep learning. Most existing works consider this problem from the view of the depth of a network. In this paper, we study how width affects the expressiveness of neural networks. Classical results state that depth-bounded (e.g. depth-2) networks with suitable activation functions are universal approximators. We show a universal approximation theorem for width-bounded ReLU networks: width-(n + 4) ReLU networks, where n is the input dimension, are universal approximators. Moreover, except for a measure zero set, all functions cannot be approximated by width-n ReLU networks, which exhibits a phase transition. Several recent works demonstrate the benefits of depth by proving the depth-efficiency of neural networks. That is, there are classes of deep networks which cannot be realized by any shallow network whose size is no more than an exponential bound. Here we pose the dual question on the width-efficiency of ReLU networks: Are there wide networks that cannot be realized by narrow networks whose size is not substantially larger? We show that there exist classes of wide networks which cannot be realized by any narrow network whose depth is no more than a polynomial bound. On the other hand, we demonstrate by extensive experiments that narrow networks whose size exceed the polynomial bound by a constant factor can approximate wide and shallow network with high accuracy. Our results provide more comprehensive evidence that depth may be more effective than width for the expressiveness of ReLU networks.", "full_text": "The Expressive Power of Neural Networks: A View\n\nfrom the Width\n\nZhou Lu1,3\n\nHongming Pu1\n\nFeicheng Wang1,3\n\n1400010739@pku.edu.cn\n\n1400010621@pku.edu.cn\n\n1400010604@pku.edu.cn\n\nZhiqiang Hu2\n\nhuzq@pku.edu.cn\n\nLiwei Wang2,3\n\nwanglw@cis.pku.edu.cn\n\n1, Department of Mathematics, Peking University\n\n2, Key Laboratory of Machine Perception, MOE, School of EECS, Peking University\n3, Center for Data Science, Peking University, Beijing Institute of Big Data Research\n\nAbstract\n\nThe expressive power of neural networks is important for understanding deep\nlearning. Most existing works consider this problem from the view of the depth of\na network. In this paper, we study how width affects the expressiveness of neural\nnetworks. Classical results state that depth-bounded (e.g. depth-2) networks with\nsuitable activation functions are universal approximators. We show a universal\napproximation theorem for width-bounded ReLU networks: width-(n + 4) ReLU\nnetworks, where n is the input dimension, are universal approximators. Moreover,\nexcept for a measure zero set, all functions cannot be approximated by width-n\nReLU networks, which exhibits a phase transition. Several recent works demon-\nstrate the bene\ufb01ts of depth by proving the depth-ef\ufb01ciency of neural networks. That\nis, there are classes of deep networks which cannot be realized by any shallow\nnetwork whose size is no more than an exponential bound. Here we pose the dual\nquestion on the width-ef\ufb01ciency of ReLU networks: Are there wide networks\nthat cannot be realized by narrow networks whose size is not substantially larger?\nWe show that there exist classes of wide networks which cannot be realized by\nany narrow network whose depth is no more than a polynomial bound. On the\nother hand, we demonstrate by extensive experiments that narrow networks whose\nsize exceed the polynomial bound by a constant factor can approximate wide and\nshallow network with high accuracy. Our results provide more comprehensive\nevidence that depth may be more effective than width for the expressiveness of\nReLU networks.\n\n1\n\nIntroduction\n\nDeep neural networks have achieved state-of-the-art performance in a wide range of tasks such\nas speech recognition, computer vision, natural language processing, and so on. Despite their\npromising results in applications, our theoretical understanding of neural networks remains limited.\nThe expressive power of neural networks, being one of the vital properties, is crucial on the way\ntowards a more thorough comprehension.\nThe expressive power describes neural networks\u2019 ability to approximate functions. This line of\nresearch dates back at least to 1980\u2019s. The celebrated universal approximation theorem states that\ndepth-2 networks with suitable activation function can approximate any continuous function on a\ncompact domain to any desired accuracy [3] [1] [9] [6]. However, the size of such a neural network\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fcan be exponential in the input dimension, which means that the depth-2 network has a very large\nwidth.\nFrom a learning perspective, having universal approximation is just the \ufb01rst step. One must also\nconsider the ef\ufb01ciency, i.e., the size of the neural network to achieve approximation. Having a small\nsize requires an understanding of the roles of depth and width for the expressive power. Recently,\nthere are a series of works trying to characterize how depth affects the expressiveness of a neural\nnetwork . [5] showed the existence of a 3-layer network, which cannot be realized by any 2-layer to\nmore than a constant accuracy if the size is subexponential in the dimension. [2] proved the existence\nof classes of deep convolutional ReLU networks that cannot be realized by shallow ones if its size\nis no more than an exponential bound. For any integer k, [15] explicitly constructed networks with\nO(k3) layers and constant width which cannot be realized by any network with O(k) layers whose\nsize is smaller than 2k. This type of results are referred to as depth ef\ufb01ciency of neural networks\non the expressive power: a reduction in depth results in exponential sacri\ufb01ce in width. However, it\nis worth noting that these are existence results. In fact, as pointed out in [2], proving existence is\ninevitable; There is always a positive measure of network parameters such that deep nets can\u2019t be\nrealized by shallow ones without substantially larger size. Thus we should explore more in addition\nto proving existence.\nDifferent to most of the previous works which investigate the expressive power in terms of the depth\nof neural networks, in this paper we study the problem from the view of width. We argue that\nan integration of both views will provide a better understanding of the expressive power of neural\nnetworks.\nFirstly, we prove a universal approximation theorem for width-bounded ReLU networks. Let n\ndenotes the input dimension, we show that width-(n + 4) ReLU networks can approximate any\nLebesgue integrable function on n-dimensional space with respect to L1 distance. On the other hand,\nexcept for a zero measure set, all Lebesgue integrable functions cannot be approximated by width-n\nReLU networks, which demonstrate a phase transition. Our result is a dual version of the classical\nuniversal approximation theorem for depth-bounded networks.\nNext, we explore quantitatively the role of width for the expressive power of neural networks. Similar\nto the depth ef\ufb01ciency, we raise the following question on the width ef\ufb01ciency:\nAre there wide ReLU networks that cannot be realized by any narrow network whose size is not\nsubstantially increased?\nWe argue that investigation of the above question is important for an understanding of the roles of\ndepth and width for the expressive power of neural networks. Indeed, if the answer to this question is\nyes, and the size of the narrow networks must be exponentially larger, then it is appropriate to say\nthat width has an equal importance as depth for neural networks.\nIn this paper, we prove that there exists a family of ReLU networks that cannot be approximated by\nnarrower networks whose depth increase is no more than polynomial. This polynomial lower bound\nfor width is signi\ufb01cantly smaller than the exponential lower bound for depth. However, it does not\nrule out the possibility of the existence of an exponential lower bound for width ef\ufb01ciency. On the\nother hand, insights from the previous analysis suggest us to study if there is a polynomial upper\nbound, i.e., a polynomial increase in depth and size suf\ufb01ces for narrow networks to approximate wide\nand shallow networks. Theoretically proving a polynomial upper bound seems very dif\ufb01cult, and we\nformally pose it as an open problem. Nevertheless, we conduct extensive experiments and the results\ndemonstrate that when the depth of the narrow network exceeds the polynomial lower bound by just a\nconstant factor, it can approximate wide shallow networks to a high accuracy. Together, these results\nprovide more comprehensive evidence that depth is more effective for the expressive power of ReLU\nnetworks.\nOur contributions are summarized as follows:\n\n\u2022 We prove a Universal Approximation Theorem for Width-Bounded ReLU Networks. We\nshow that any Lebesgue-integrable function f from Rn to R can be approximated by a\nfully-connected width-(n + 4) ReLU network to arbitrary accuracy with respect to L1\ndistance. In addition, except for a negligible set, all functions f from Rn to R cannot be\napproximated by any ReLU network whose width is no more than n.\n\n2\n\n\f\u2022 We show a width ef\ufb01ciency polynomial lower bound. For integer k, there exist a class of\nwidth-O(k2) and depth-2 ReLU networks that cannot be approximated by any width-O(k1.5)\nand depth-k networks. On the other hand, experimental results demonstrate that networks\nwith size slightly larger than the lower bound achieves high approximation accuracy.\n\n1.1 Related Work\n\nResearch analyzing the expressive power of neural networks date back to decades ago. As one of the\nmost classic work, Cybenko[3] proved that a fully-connected sigmoid neural network with one single\nhidden layer can universally approximate any continuous univariate function on a bounded domain\nwith arbitrarily small error. Barron[1], Hornik et al.[9] ,Funahashi[6] achieved similar results. They\nalso generalize the sigmoid function to a large class of activation functions, showing that universal\napproximation is essentially implied by the network structure. Delalleau et al.[4] showed that there\nexists a family of functions which can be represented much more ef\ufb01ciently with deep networks than\nwith shallow ones as well.\nDue to the development and success of deep neural networks recently, there have been much more\nworks discussing the expressive power of neural networks theoretically. Depth ef\ufb01ciency is among\nthe most typical results. Eldan et.al [5] showed the existence of a 3-layer network, which cannot be\nrealized by any 2-layer to more than a constant accuracy if the size is subexponential in the dimension.\nCohen et.al [2] proved the existence of classes of deep convolutional ReLU networks that cannot be\nrealized by shallow ones if its size is no more than an exponential bound. For any integer k, Telgarsky\n[15] explicitly constructed networks with O(k3) layers and constant width which cannot be realized\nby any network with O(k) layers whose size is smaller than 2k.\nOther works turn to show deep networks\u2019 ability to approximate a wide range of functions. For\nexample, Liang et al.[12] showed that in order to approximate a function which is \u0398(log 1\n\u0001 )-order\nderivable with \u0001 error universally, a deep network with O(log 1\n\u0001 ) weights\ncan do but \u2126(poly 1\n\u0001 ) layers. Yarotsky [16] showed\nthat C n-functions on Rd with a bounded domain can be approximated with \u0001 error universally by a\nReLU network with O(log 1\n\u0001 ) weights. In addition, for results based on\nclassic theories, Harvey et al.[7] provided a nearly-tight bound for VC-dimension of neural networks,\nthat the VC-dimension for a network with W weights and L layers will have a O(W L log W )\nbut \u2126(W L log W\nL ) VC-dimension. Also, there are several works arguing for width\u2019s importance\nfrom other aspects, for example, Nguyen et al.[11] shows if a deep architecture is at the same time\nsuf\ufb01ciently wide at one hidden layer then it has a well-behaved loss surface in the sense that almost\nevery critical point with full rank weight matrices is a global minimum from the view of optimization.\nThe remainder of the paper is organized as follows. In section 2 we introduce some background\nknowledge needed in this article. In section 3 we present our main result \u2013 the Width-Bounded\nUniversal Approximation Theorem; besides, we show two comparing results related to the theorem.\nThen in section 4 we turn to explore quantitatively the role of width for the expressive power of neural\nnetworks. Finally, section 5 concludes. All proofs can be found in the Appendix and we give proof\nsketch in main text as well.\n\n\u0001 ) weights will be required if there is only o(log 1\n\n\u0001 ) layers and O(poly log 1\n\n\u0001 ) layers and O(( 1\n\n\u0001 ) d\n\nn log 1\n\n2 Preliminaries\n\nWe begin by presenting basic de\ufb01nitions that will be used throughout the paper. A neural network\nis a directed computation graph, where the nodes are computation units and the edges describe the\nconnection pattern among the nodes. Each node receives as input a weighted sum of activations\n\ufb02owed through the edges, applies some kind of activation function, and releases the output via the\nedges to other nodes. Neural networks are often organized in layers, so that nodes only receive signals\nfrom the previous layer and only release signals to the next layer. A fully-connected neural network is\na layered neural network where there exists a connection between every two nodes in adjacent layers.\nIn this paper, we will study the fully-connected ReLU network, which is a fully-connected neural\n\n3\n\n\fnetwork with Recti\ufb01er Linear Unit (ReLU) activation functions. The ReLU function ReLU : R \u2192 R\ncan be formally de\ufb01ned as\n\nReLU(x) = max{x, 0}\n\n(1)\n\nThe architecture of neural networks often speci\ufb01ed by the width and the depth of the networks. The\ndepth h of a network is de\ufb01ned as its number of layers (including output layer but excluding input\nlayer); while the width dm of a network is de\ufb01ned to be the maximal number of nodes in a layer. The\nnumber of input nodes, i.e. the input dimension, is denoted as n.\nIn this paper we study the expressive power of neural networks. The expressive power describes\nneural networks\u2019 ability to approximate functions. We focus on Lebesgue-integrable functions. A\nLebesgue-integrable function f : Rn \u2192 R is a Lebesgue-measurable function satisfying\n\n|f (x)|dx < \u221e\n\n(2)\n\n(cid:90)\n\nRn\n\n(cid:90)\n\nRn\n\nwhich contains continuous functions, including functions such as the sgn function. Because we\ndeal with Lebesgue-integrable functions, we adopt L1 distance as a measure of approximation error,\ndifferent from L\u221e distance used by some previous works which consider continuous functions.\n\n3 Width-bounded ReLU Networks as Universal Approximator\n\nIn this section we consider universal approximation with width-bounded ReLU networks. The\nfollowing theorem is the main result of this section.\nTheorem 1 (Universal Approximation Theorem for Width-Bounded ReLU Networks). For any\nLebesgue-integrable function f : Rn \u2192 R and any \u0001 > 0, there exists a fully-connected ReLU\nnetwork A with width dm \u2264 n + 4, such that the function FA represented by this network satis\ufb01es\n\n|f (x) \u2212 FA (x)|dx < \u0001.\n\n(3)\n\nThe proof of this theorem is lengthy and is deferred to the supplementary material. Here we provide\nan informal description of the high level idea.\nFor any Lebesgue integrable function and any prede\ufb01ned approximation accuracy, we explicitly\nconstruct a width-(n + 4) ReLU network so that it can approximate the function to the given accuracy.\nThe network is a concatenation of a series of blocks. Each block satis\ufb01es the following properties:\n1) It is a depth-(4n + 1) width-(n + 4) ReLU network.\n2) It can approximate any Lebesgue integrable function which is uniformly zero outside a cube with\nlength \u03b4 to a high accuracy;\n3) It can store the output of the previous block, i.e., the approximation of other Lebesgue integrable\nfunctions on different cubes;\n4) It can sum up its current approximation and the memory of the previous approximations.\nIt is not dif\ufb01cult to see that the construction of the whole network is completed once we build the\nblocks. We illustrate such a block in Figure 1 . In this block, each layer has n + 4 neurons. Each\nrectangle in Figure 1 represents a neuron, and the symbols in the rectangle describes the output of\nthat neuron as a function of the block. Among the n + 4 neurons, n neurons simply transfer the input\ncoordinates. For the other 4 neurons, 2 neurons store the approximation ful\ufb01lled by previous blocks.\nThe other 2 neurons help to do the approximation on the current cube. The topology of the block is\nrather simple. It is very sparse, each neuron connects to at most 2 neurons in the next layer.\nThe proof is just to verify the construction illustrated in Figure 1 is correct. Because of the space\nlimit, we defer all the details to the supplementary materials.\nTheorem 1 can be regarded as a dual version of the classical universal approximation theorem, which\nproves that depth-bounded networks are universal approximator. If we ignore the size of the network,\n\n4\n\n\fFigure 1: One block to simulate the indicator function on [a1, b1] \u00d7 [a2, b2] \u00d7 \u00b7\u00b7\u00b7 \u00d7 [an, bn]. For k\nfrom 1 to n, we \"chop\" two sides in the kth dimension, and for every k the \"chopping\" process is\ncompleted within a 4-layer sub-network as we show in Figure 1. It is stored in the (n+3)th node as\nLnin the last layer of A . We then use a single layer to record it in the (n+1)th or the (n+2)th node,\nand reset the last two nodes to zero. Now the network is ready to simulate another (n+1)-dimensional\ncube.\n\nboth depth and width themselves are ef\ufb01cient for universal approximation. At the technical level\nhowever, there are a few differences between the two universal approximation theorems. The classical\ndepth-bounded theorem considers continuous function on a compact domain and use L\u221e distance;\nOur width-bounded theorem instead deals with Lebesgue-integrable functions on the whole Euclidean\nspace and therefore use L1 distance.\nTheorem 1 implies that there is a phase transition for the expressive power of ReLU networks as the\nwidth of the network varies across n, the input dimension. It is not dif\ufb01cult to see that if the width is\nmuch smaller than n, then the expressive power of the network must be very weak. Formally, we\nhave the following two results.\nTheorem 2. For any Lebesgue-integrable function f : Rn \u2192 R satisfying that {x : f (x) (cid:54)= 0} is a\npositive measure set in Lebesgue measure, and any function FA represented by a fully-connected\nReLU network A with width dm \u2264 n, the following equation holds:\n\n|f (x) \u2212 FA (x)|dx = +\u221e or\n\n|f (x)|dx.\n\n(4)\n\n(cid:90)\n\nRn\n\n(cid:90)\n\nRn\n\nTheorem 2 says that even the width equals n, the approximation ability of the ReLU network is still\nweak, at least on the Euclidean space Rn. If we restrict the function on a bounded set, we can still\nprove the following theorem.\n\n5\n\n\fTheorem 3. For any continuous function f : [\u22121, 1]n \u2192 R which is not constant along any direction,\nthere exists a universal \u0001\u2217 > 0 such that for any function FA represented by a fully-connected ReLU\nnetwork with width dm \u2264 n \u2212 1, the L1 distance between f and FA is at least \u0001\u2217:\n\n|f (x) \u2212 FA(x)|dx \u2265 \u0001\u2217.\n\n(5)\n\n(cid:90)\n\n[\u22121,1]n\n\nThen Theorem 3 is a direct comparison with Theorem 1 since in Theorem 1 the L1 distance can be\narbitrarily small.\nThe main idea of the two theorems is grabbing the disadvantage brought by the insuf\ufb01ciency of\ndimension. If the corresponding \ufb01rst layer values of two different input points are the same, the\noutput will be the same as well. When the ReLU network\u2019s width is not larger than the input layer\u2019s\nwidth, we can \ufb01nd a ray for \"most\" points such that the ray passes the point and the corresponding\n\ufb01rst layer values on the ray are the same. It is like a dimension reduction caused by insuf\ufb01ciency of\nwidth. Utilizing this weakness of thin network, we can \ufb01nally prove the two theorems.\n\n4 Width Ef\ufb01ciency vs. Depth Ef\ufb01ciency\n\nGoing deeper and deeper has been a trend in recent years, starting from the 8-layer AlexNet[10],\nthe 19-layer VGG[13], the 22-layer GoogLeNet[14], and \ufb01nally to the 152-layer and 1001-layer\nResNets[8]. The superiority of a larger depth has been extensively shown in the applications of many\nareas. For example, ResNet has largely advanced the state-of-the-art performance in computer vision\nrelated \ufb01elds, which is claimed solely due to the extremely deep representations. Despite of the great\npractical success, theories of the role of depth are still limited.\nTheoretical understanding of the strength of depth starts from analyzing the depth ef\ufb01ciency, by\nproving the existence of deep neural networks that cannot be realized by any shallow network whose\nsize is exponentially larger. However, we argue that even for a comprehensive understanding of the\ndepth itself, one needs to study the dual problem of width ef\ufb01ciency: Because, if we switch the role\nof depth and width in the depth ef\ufb01ciency theorems and the resulting statements remain true, then\nwidth would have the same power as depth for the expressiveness, at least in theory. It is worth noting\nthat a priori, depth ef\ufb01ciency theorems do not imply anything about the validity of width ef\ufb01ciency.\nIn this section, we study the width ef\ufb01ciency of ReLU networks quantitatively.\nTheorem 4. Let n be the input dimension. For any integer k \u2265 n + 4, there exists FA : Rn \u2192 R\nrepresented by a ReLU neural network A with width dm = 2k2 and depth h = 3, such that for any\nconstant b > 0, there exists \u0001 > 0 and for any function FB : Rn \u2192 R represented by ReLU neural\nnetwork B whose parameters are bounded in [\u2212b, b] with width dm \u2264 k3/2 and depth h \u2264 k + 2,\nthe following inequality holds:\n\n|FA \u2212 FB|dx \u2265 \u0001.\n\n(6)\n\n(cid:90)\n\nRn\n\nTheorem 4 states that there are networks such that reducing width requires increasing in the size to\ncompensate, which is similar to that of depth qualitatively. However, at the quantitative level, this\ntheorem is very different to the depth ef\ufb01ciency theorems in [15] [5][2]. Depth ef\ufb01ciency enjoys\nexponential lower bound, while for width Theorem 4 is a polynomial lower bound. Of course if a\ncorresponding polynomial upper bound can be proven, we can say depth plays a more important role\nin ef\ufb01ciency, but such a polynomial lower bound still means that depth is not strictly stronger than\nwidth in ef\ufb01ciency, sometimes it costs depth super-linear more nodes than width.\nThis raises a natural question: Can we improve the polynomial lower bound? There are at least two\npossibilities.\n1) Width ef\ufb01ciency has exponential lower bound. To be concrete, there are wide networks that cannot\nbe approximated by any narrow networks whose size is no more than an exponential bound.\n2) Width ef\ufb01ciency has polynomial upper bound. Every wide network can be approximated by a\nnarrow network whose size increase is no more than a polynomial.\nExponential lower bound and polynomial upper bound have completely different implications. If\nexponential lower bound is true, then width and depth have the same strength for the expressiveness,\n\n6\n\n\fat least in theory. If the polynomial upper bound is true, then depth plays a signi\ufb01cantly stronger role\nfor the expressive power of ReLU networks.\nCurrently, neither the exponential lower bound nor the polynomial upper bound seems within the\nreach. We pose it as a formal open problem.\n\n4.1 Experiments\n\nWe further conduct extensive experiments to provide some insights about the upper bound of such\nan approximation. To this end, we study a series of network architectures with varied width. For\neach network architecture, we randomly sample the parameters, which, together with the architecture,\nrepresent the function that we would like narrower networks to approximate. The approximation error\nis empirically calculated as the mean square error between the target function and the approximator\nfunction evaluated on a series of uniformly placed inputs. For simplicity and clearity, we refer to\nthe network architectures that will represent the target functions when assigned parameters as target\nnetworks, and the corresponding network architectures for approximator functions as approximator\nnetworks.\nTo be detailed, the target networks are fully-connected ReLU networks of input dimension n, output\ndimension 1, width 2k2 and depth 3, for n = 1, 2 and k = 3, 4, 5. For each of these networks, we\nsample weight parameters according to standard normal distribution, and bias parameters according\nto uniform distribution over [\u22121, 1). The network and the sampled parameters will collectively\nrepresent a target function that we use a narrow approximator network of width 3k3/2 and depth k + 2\nto approximate, with a corresponding k. The architectures are designed in accordance to Theorem 4 \u2013\nwe aim to investigate whether such a lower bound is actually an upper bound. In order to empirically\ncalculate the approximation error, 20000 uniformly placed inputs from [\u22121, 1)n for n = 1 and 40000\nsuch inputs for n = 2 are evaluated by the target function and the approximator function respectively,\nand the mean square error is reported. For each target network, we repeat the parameter-sampling\nprocess 50 times and report the mean square error in the worst and average case.\nWe adopt the standard supervised learning approach to search in the parameter space of the approxi-\nmator network to \ufb01nd the best approximator function. Speci\ufb01cally, half of all the test inputs from\n[\u22121, 1)n and the corresponding values evaluated by target function constitute the training set. The\ntraining set is used to train approximator network with a mini-batch AdaDelta optimizer and learning\nrate 1.0. The parameters of approximator network are randomly initialized according to [8]. The\ntraining process proceeds 100 epoches for n = 1 and 200 epoches for n = 2; the best approximator\nfunction is recorded.\nTable 1 lists the results. Figure 2 illustrates the comparison of an example target function and the\ncorresponding approximator function for n = 1 and k = 5. Note that the target function values\nvary with a scale \u223c 10 in the given domain, so the (absolute) mean square error is indeed a rational\nmeasure of the approximation error. It is shown that the approximation error is indeed very small,\nfor the target networks and approximator networks we study. From Figure 2 we can see that the\napproximation function is so close to the target function that we have to enlarge a local region to\nbetter display the difference. Since the architectures of both the target networks and approximator\nnetworks are determined according to Theorem 4, where the depth of approximator networks are in a\npolynomial scale with respect to that of target networks, the empirical results show an indication that\na polynomial larger depth may be suf\ufb01cient for a narrow network to approximate a wide network.\n\n5 Conclusion\n\nIn this paper, we analyze the expressive power of neural networks with a view from the width,\ndistinguished from many previous works which focus on the view from the depth. We establish\nthe Universal Approximation Theorem for Width-Bounded ReLU Networks, in contrast with the\nwell-known Universal Approximation Theorem, which studies depth-bounded networks. Our result\ndemonstrate a phase transition with respect to expressive power when the width of a ReLU network\nof given input dimension varies.\nWe also explore the role of width for the expressive power of neural networks: we prove that a wide\nnetwork cannot be approximated by a narrow network unless with polynomial more nodes, which\ngives a lower bound of the number of nodes for approximation. We pose open problems on whether\n\n7\n\n\fTable 1: Empirical study results. n denotes the input dimension, k is de\ufb01ned in Theorem 4; the\nwidth/depth for both target network and approximator network are determined in accordance to\nTheorem 4. We report mean square error in the worst and average case over 50 runs of randomly\nsampled parameters for target network.\n\nn k\n\n1\n1\n1\n2\n2\n2\n\n3\n4\n5\n3\n4\n5\n\ntarget network\nwidth\n\ndepth width\n\ndepth\n\napproximator network worst case error\n\naverage case error\n\n18\n36\n50\n18\n36\n50\n\n3\n3\n3\n3\n3\n3\n\n16\n24\n34\n16\n24\n34\n\n5\n6\n7\n5\n6\n7\n\n0.002248\n0.003263\n0.005643\n0.008729\n0.018852\n0.030114\n\n0.000345\n0.000892\n0.001296\n0.001990\n0.006251\n0.007984\n\nFigure 2: Comparison of an example target function and the corresponding approximator function for\nn = 1 and k = 5. A local region is enlarged to better display the difference.\n\nexponential lower bound or polynomial upper bound hold for the width ef\ufb01ciency, which we think\nis crucial on the way to a more thorough understanding of expressive power of neural networks.\nExperimental results support the polynomial upper bound and agree with our intuition and insights\nfrom the analysis.\nThe width and the depth are two key components in the design of a neural network architecture.\nWidth and depth are both important and should be carefully tuned together for the best performance\nof neural networks, since the depth may determine the abstraction level but the width may in\ufb02uence\nthe loss of information in the forwarding pass. A comprehensive understanding of the expressive\npower of neural networks requires looking from both views.\n\nAcknowledgments\n\nThis work was partially supported by National Basic Research Program of China (973 Program)\n(grant no. 2015CB352502), NSFC (61573026) and Center for Data Science, Beijing Institute of\nBig Data Research in Peking University. We would like to thank the anonymous reviewers for their\nvaluable comments on our paper.\n\n8\n\n\fReferences\n[1] Andrew R Barron. Approximation and estimation bounds for arti\ufb01cial neural networks. Machine\n\nLearning, 14(1):115\u2013133, 1994.\n\n[2] Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A\n\ntensor analysis. In Conference on Learning Theory, pages 698\u2013728, 2016.\n\n[3] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of\n\nControl, Signals, and Systems (MCSS), 2(4):303\u2013314, 1989.\n\n[4] Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sum-product networks. In Advances in\n\nNeural Information Processing Systems, pages 666\u2013674, 2011.\n\n[5] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In\n\nConference on Learning Theory, pages 907\u2013940, 2016.\n\n[6] Ken-Ichi Funahashi. On the approximate realization of continuous mappings by neural networks.\n\nNeural networks, 2(3):183\u2013192, 1989.\n\n[7] Nick Harvey, Chris Liaw, and Abbas Mehrabian. Nearly-tight vc-dimension bounds for piece-\n\nwise linear neural networks. COLT 2017, 2017.\n\n[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im-\nage recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 770\u2013778, 2016.\n\n[9] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are\n\nuniversal approximators. Neural networks, 2(5):359\u2013366, 1989.\n\n[10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[11] Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. In\nDoina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on\nMachine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2603\u20132612,\nInternational Convention Centre, Sydney, Australia, 06\u201311 Aug 2017. PMLR.\n\n[12] R. Srikant Shiyu Liang. Why deep neural networks for funtion approximation? ICLR 2017,\n\n2017.\n\n[13] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. CoRR, abs/1409.1556, 2014.\n\n[14] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov,\nDumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions.\nCoRR, abs/1409.4842, 2014.\n\n[15] Matus Telgarsky. Bene\ufb01ts of depth in neural networks. COLT 2016: 1517-1539, 2016.\n\n[16] Dmitry Yarotsky. Error bounds for approximations with deep relu networks. arXiv preprint\n\narXiv:1610.01145, 2016.\n\n9\n\n\f", "award": [], "sourceid": 3152, "authors": [{"given_name": "Zhou", "family_name": "Lu", "institution": "Peking University"}, {"given_name": "Hongming", "family_name": "Pu", "institution": "Peking university"}, {"given_name": "Feicheng", "family_name": "Wang", "institution": "Peking University"}, {"given_name": "Zhiqiang", "family_name": "Hu", "institution": "Peking University"}, {"given_name": "Liwei", "family_name": "Wang", "institution": "Peking University"}]}