{"title": "ResNet with one-neuron hidden layers is a Universal Approximator", "book": "Advances in Neural Information Processing Systems", "page_first": 6169, "page_last": 6178, "abstract": "We demonstrate that a very deep ResNet with stacked modules that have one neuron per hidden layer and ReLU activation functions can uniformly approximate any Lebesgue integrable function in d dimensions, i.e. \\ell_1(R^d). Due to the identity mapping inherent to ResNets, our network has alternating layers of dimension one and d. This stands in sharp contrast to fully connected networks, which are not universal approximators if their width is the input dimension d [21,11]. Hence, our result implies an increase in representational power for narrow deep networks by the ResNet architecture.", "full_text": "ResNet with one-neuron hidden layers is a Universal\n\nApproximator\n\nHongzhou Lin\n\nMIT\n\nCambridge, MA 02139\nhongzhou@mit.edu\n\nStefanie Jegelka\n\nMIT\n\nCambridge, MA 02139\n\nstefje@mit.edu\n\nAbstract\n\nWe demonstrate that a very deep ResNet with stacked modules that have one\nneuron per hidden layer and ReLU activation functions can uniformly approximate\nany Lebesgue integrable function in d dimensions, i.e. \u21131(Rd). Due to the identity\nmapping inherent to ResNets, our network has alternating layers of dimension one\nand d. This stands in sharp contrast to fully connected networks, which are not\nuniversal approximators if their width is the input dimension d [21, 11]. Hence,\nour result implies an increase in representational power for narrow deep networks\nby the ResNet architecture.\n\n1\n\nIntroduction\n\nDeep neural networks are central to many recent successes of machine learning, including applications\nsuch as computer vision, natural language processing, or reinforcement learning. A common trend in\ndeep learning has been to construct larger and deeper networks, starting from the pioneer convolutional\nnetwork LeNet [19], to networks with tens of layers such as AlexNet [17] or VGG-Net [28], or recent\narchitectures like GoogLeNet/Inception [30] or ResNet [13, 14], which may contain hundreds or\nthousands of layers. A typical observation is that deeper networks offer better performance. This\nphenomenon, at least on the training set, supports the intuition that a deeper network should have\nmore capacity to approximate the target function, and leads to a question that has received increasing\ninterest in the theory of deep learning: can all functions that we may care about be approximated well\nby a suf\ufb01ciently large and deep network? In this work, we address this important question for the\npopular ResNet architecture.\n\nThe question of representational power of neural networks has been answered in different forms.\nResults in the late eighties showed that a network with a single hidden layer can approximate any\ncontinuous function with compact support to arbitrary accuracy, when the width goes to in\ufb01nity [7,\n15, 10, 18]. This result is referred to as the universal approximation theorem. Analogous to the\nclassical Stone-Weierstrass theorem on polynomials or the convergence theorem on Fourier series,\nthis theorem implies that the family of neural networks are universal approximators: we can apply\nneural networks to approximate any continuous function and the accuracy improves as we add more\nneurons in the width. More importantly, the coef\ufb01cients in the network can be ef\ufb01ciently learned via\nback-propagation, providing an explicit representation of the approximation.\n\nThis classical universal approximation theorem completely relies on the power of the width increasing\nto in\ufb01nity, i.e., \u201cfat\u201d networks. Current \u201ctall\u201d deep learning models, however, are not captured by this\nsetting. Consequently, theoretically analyzing the bene\ufb01t of depth has gained much attention in the\nrecent literature [31, 6, 9, 32, 23, 20, 25]. The main focus of these papers is to provide examples of\nfunctions that can be ef\ufb01ciently represented by a deep network but are hard to represent by shallow\nnetworks. These examples require exponentially many neurons in a shallow network to achieve\nthe same approximation accuracy as a deep network with only a polynomial or linear number of\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fneurons. Yet, these speci\ufb01c examples do not imply that all shallow networks can be represented by\ndeep networks, leading to an important question:\n\nIf the number of neurons in each layer is bounded, does universal approximation hold when the depth\ngoes to in\ufb01nity?\n\nThis question has recently been studied by [21, 11] for fully connected networks with ReLU activation\nfunctions: if each hidden layer has at least d + 1 neurons, where d is the dimension of the input\nspace, the universal approximation theorem holds as the depth goes to in\ufb01nity. If, however, at most\nd neurons can be used in each hidden layer, then universal approximation is impossible even with\nin\ufb01nite depth.\n\nIn practice, other architectures have been developed to improve empirical results. A popular example\nis ResNet [13, 14], which includes an identity mapping in addition to each layer. A \ufb01rst step towards\na better theoretical understanding of those empirically successful models is to ask how the above\nquestion extends to them. Do the architecture variations make a difference theoretically? Due to the\nidentity mapping, for ResNet, the width of the network remains the same as the input dimension. For\na formal analysis, we stack modules of the form shown in Figure 1, and analyze how small the hidden\ngreen layers can be. The resulting width of d (blue) or even less (green) stands in sharp contrast\nwith the negative result for width d for fully connected networks in [21, 11]. Indeed, our empirical\nillustrations in Section 2 demonstrate that, empirically, signi\ufb01cant differences in the representational\npower of narrow ResNets versus narrow fully connected networks can be observed. Our theoretical\nresults con\ufb01rm those observations.\n\nHardt and Ma [12] show that ResNet enjoys universal \ufb01nite-sample expressive power, i.e., ResNet\ncan represent any classi\ufb01er on any \ufb01nite sample perfectly. This positive result in the discrete setting\nmotivates our work. Their proof, however, relies on the fact that samples are \u201cfar\u201d from each other\nand hence cannot be used in the setting of full functions in continuous space.\n\nContributions. The main contribution of this paper is to show that ResNet with one single neuron\nper hidden layer is enough to provide universal approximation as the depth goes to in\ufb01nity. More\nprecisely, we show that for any Lebesgue-integrable1 function f : Rd \u2192 R, for any \u01eb > 0, there\nexists a ResNet R with ReLU activation and one neuron per hidden layer such that\n\nZRd\n\n|f (x) \u2212 R(x)|dx \u2264 \u01eb.\n\nThis result implies that, compared to fully connected networks, the identity mapping of ResNet\nindeed adds representational power for tall networks.\n\n\u00b7 \u00b7 \u00b7\n\nReLU\n\n\u00b7 \u00b7 \u00b7\n\n+Id\n\nFigure 1: The basic residual block with one neuron per hidden layer.\n\nThe ResNet in our construction is built by stacking residual blocks of the form illustrated in Figure 1,\nwith one neuron in the hidden layer. A basic residual block consists of two linear mappings and a\nsingle ReLU activation [12, 13]. More formally, it is a function TU,V,u from Rd to Rd de\ufb01ned by\n\nTU,V,u(x) = V ReLU(U x + u),\n\nwhere U \u2208 R1\u00d7d, V \u2208 Rd\u00d71, u \u2208 R and the ReLU activation function is de\ufb01ned by\n\nAfter performing the nonlinear transformation, we add the identity to form the input of the next layer.\nThe resulting ResNet is a combination of several basic residual blocks and a \ufb01nal linear output layer:\n\nReLU(x) = max(x, 0) = [x]+.\n\n(1)\n\nR(x) = L \u25e6 (Id + TN ) \u25e6 (Id + TN \u22121) \u25e6 \u00b7 \u00b7 \u00b7 \u25e6 (Id + T0)(x),\n\n1A function f is Lebesgue-integrable if RRd |f (x)|dx < \u221e.\n\n2\n\n\fwhere L : Rd \u2192 R is a linear operator and Ti are basic one-neuron residual blocks.\n\nUnlike the original architecture [13], we do not include any convolutional layers, max pooling or batch\nnormalization; the above simpli\ufb01ed architecture turns out to be suf\ufb01cient for universal approximation.\n\n2 A motivating example\n\nWe begin by empirically exploring the difference between narrow fully connected networks, with d\nneurons per hidden layer, and ResNet via a simple example: classifying the unit ball in the plane.\nThe training set consists of randomly generated samples (zi, yi)i=1\u00b7\u00b7\u00b7n \u2208 R2 \u00d7 {\u22121, 1} with\n\nyi = (cid:26)1\n\n\u22121\n\nif kzik2 \u2264 1;\nif 2 \u2264 kzik2 \u2264 3.\n\nWe arti\ufb01cially create a margin between positive and negative samples to make the classi\ufb01cation task\neasier. As loss, we use the logistic loss 1\nthe network on the i-th sample. After training, we illustrate the learned decision boundaries of the\nnetworks for various depths. Ideally, we would expect the decision boundaries of our models to be\nclose to the true distribution, i.e., the unit ball.\n\nn P log(1 + e\u2212yi \u02c6yi ), where \u02c6yi = fN (zi) is the output of\n\nTraining data\n\n1 Hidden Layer\n\n2 Hidden Layers\n\n3 Hidden Layers\n\n5 Hidden Layers\n\nFigure 2: Decision boundaries obtained by training fully connected networks with width d = 2 per\nhidden layer (top row) and ResNet (bottom row) with one neuron in the hidden layers on the unit\nball classi\ufb01cation problem. The fully connected networks fail to capture the true function, in line\nwith the theory stating that width d is too narrow for universal approximation. ResNet in contrast\napproximates the function well, empirically supporting our theoretical results.\n\nFigure 2 shows the results. For the fully connected networks (top row), the learned decision boundaries\nhave roughly the same shape for different depths: the approximation quality seems to not improve\nwith increasing depth. While one may be inclined to argue that this is due to local optimality, our\nobservation agrees with the results in [21]:\nProposition 2.1. Let fN : Rd \u2192 R be the function de\ufb01ned by a fully connected network N with\n\nReLU activation. Denote by P = (cid:8)x \u2208 Rd | fN (x) > 0(cid:9) the positive level set of fN . If each hidden\n\nlayer of N has at most d neurons, then\n\n\u03bb(P ) = 0\n\nor\n\n\u03bb(P ) = +\u221e, where \u03bb denotes the Lebesgue measure.\n\nIn other words, the non-trivial level set of a narrow fully connected network is always unbounded.\n\nThe proof is a direct application of Theorem 2 of [21], see Appendix E. Thus, even when the depth\ngoes to in\ufb01nity, a narrow fully connected network can never approximate a bounded region. Here we\nonly show the case d = 2 because we can easily visualize the data; the same observation will still\nhold in higher dimensions. An even stronger, recent result states that any connected component of\nthe decision boundaries obtained by a narrow fully connected network is unbounded [3].\n\nThe decision boundaries for ResNet appear strikingly different: despite the even narrower width\nof one, from 2 hidden layers onwards, the ResNet represents the indicator of a bounded region.\n\n3\n\n\fWith increasing depth, the decision boundary seems to converge to the unit ball, implying that\nProposition 2.1 cannot hold for ResNet. These observations motivate the universal approximation\ntheorem that we will show in the next section.\n\n3 Universal approximation theorem\n\nIn this section, we present the universal approximation theorem for ResNet with one-neuron hidden\nlayers. We sketch the proof in the one-dimensional case; the induction for higher dimensions relies\non similar ideas and may be found in the appendix.\nTheorem 3.1 (Universal Approximation of ResNet). For any d \u2208 N, the family of ResNet with\none-neuron hidden layers and ReLU activation function can universally approximate any f \u2208 \u21131(Rd).\nIn other words, for any \u01eb > 0, there is a ResNet R with \ufb01nitely many layers such that\n\nZRd\n\n|f (x) \u2212 R(x)|dx \u2264 \u01eb.\n\nOutline of the proof. The proof starts with a well-known fact: the class of piecewise constant\nfunctions with compact support and \ufb01nitely many discontinuities is dense in \u21131(Rd). Thus it suf\ufb01ces\nto approximate any piecewise constant function. Given a piecewise constant function, we \ufb01rst\nconstruct a grid \u201cindicator\u201d function on its support, as shown in Figure 4. This function is similar to\nan indicator function in the sense that it vanishes outside the support, but, instead of being constantly\nequal to one, a grid indicator function takes different constant values on different grid cells, see\nDe\ufb01nition B.3 for a formal de\ufb01nition. The property of having different function values creates\na\u201c\ufb01ngerprint\u201d on each grid cell, which will help to distinguish them. Then, we divide the space into\ndifferent level sets, such that one level set contains exactly one grid cell. Finally, we \ufb01t the function\nvalue on each grid cell, cell by cell.\n\nSketch of the proof when d = 1. We start with the one-dimensional case, which is central to our\nconstruction. As mentioned above, it is suf\ufb01cient to approximate piecewise constant functions. Given\na piecewise constant function h, there is a subdivision \u2212\u221e < a0 < a1 < \u00b7 \u00b7 \u00b7 < aM < +\u221e such that\n\nh(x) =\n\nM\n\nXk=1\n\nhk 1\n\nx\u2208[ak\u22121,ak),\n\nwhere hk is the constant value on the k-th subdivision Ik = [ak\u22121, ak). We will approximate h via\ntrapezoid functions of the following form, shown in Figure 3.\n\n\u03b4\n\u2194\nak\u22121\n\n\u03b4\n\u2194\n\nak\n\nI \u03b4\nk\n\nx\n\nFigure 3: A trapezoid function, which is a continuous approximation of the indicator function. The\nparameter \u03b4 measures the quality of the approximation.\n\nA trapezoid function is a simple continuous approximation of the indicator function. It is constant on\nthe segment I \u03b4\nk . As \u03b4 goes to zero, the\ntrapezoid function tends point-wise to the indicator function.\n\nk = [ak\u22121 + \u03b4, ak \u2212 \u03b4] and linear in the \u03b4-tolerant region Ik\\I \u03b4\n\nA natural idea to approximate h is to construct a trapezoid function on each subdivision Ik and to then\nsum them up. This is the main strategy used in [21, 11] to show a universal approximation theorem\nfor fully connected networks with width at least d + 1. However, this strategy is not applicable for the\nResNet structure because the summation requires memory of past components, and hence requires\nadditional units in every layer. The width constraint of ResNet makes the difference here.\n\nIn contrast, we construct our approximation in a sequential way: we build the components of the\ntrapezoid function one after another. With this sequential construction, we can only build increasing\n\n4\n\n\fincreasing\n\n\u00b7 \u00b7 \u00b7\n\na0\n\na1\n\na2\n\na3\n\naM \u22121\n\naM\n\nx\n\nFigure 4: An increasing trapezoid function, which is a special case of grid indicator function when\nd = 1, is trapezoidal on each subdivision with increasing constant value from left to right.\n\ntrapezoid functions as shown in Figure 4. Such functions are trapezoidal on each subdivision Ik and\nthe constant value on I \u03b4\nk increases when k grows. The construction relies on the following basic\noperations:\n\nProposition 3.2 (Basic operations). The following operations are realizable by a single basic\nresidual block of ResNet with one neuron:\n\n(a) Shifting by a constant: R+ = R + c for any c \u2208 R;\n\n(b) Min or Max with a constant: R+ = min{R, c} or R+ = max{R, c} for any c \u2208 R;\n\n(c) Min or Max with a linear transformation: R+ = min{R, \u03b1R + \u03b2} (or max) for any \u03b1, \u03b2 \u2208 R;\n\nwhere R represents the input layer in the basic residual block and R+ the output layer.\n\nGeometrically, operation (a) allows us to shift the function by a constant; operation (b) allows us to\nremove the level set {R \u2265 c} or {R \u2264 c} and operation (c) can be used to adjust the slope. With\nthese basic operations at hand, we construct the increasing trapezoid function by induction on the\nsubdivisions. For any m \u2208 [0, M ], we construct a function Rm satisfying\n\nC1. Rm = 0 on (\u2212\u221e, a0];\n\nC2. Rm is a trapezoid function on each Ik, for any k = 1, \u00b7 \u00b7 \u00b7 , m;\n\nC3. Rm = (k + 1)khk\u221e on I \u03b4\n\nk = [ak\u22121 + \u03b4, ak \u2212 \u03b4] for any k = 1, \u00b7 \u00b7 \u00b7 , m;\n\nC4. Rm is bounded on (\u2212\u221e, am] by 0 \u2264 Rm \u2264 (m + 1)khk\u221e;\n\nC5. Rm(x) = \u2212 (m+1)khk\u221e\n\n\u03b4\n\n(x \u2212 am) if x \u2208 [am, +\u221e);\n\nwhere khk\u221e = max\n|hk| is the in\ufb01nity norm and \u03b4 > 0 measures the quality of the approximation.\nk=1\u00b7\u00b7\u00b7M\nAn illustration of Rm is shown in Figure 5. On the \ufb01rst m subdivisions, Rm is the restriction of the\ndesired increasing trapezoid function. On [am, +\u221e), the function Rm is a very steep linear function\nwith negative slope that enables the construction of next subdivision.\n\nGiven Rm, we sequentially stack three residual blocks to build Rm+1:\n\n\u2022 R+\n\nm = maxnRm, \u2212(cid:16)1 + 1\nm = minnR+\n\u2022 Rm+1 = min{R++\n\nm, \u2212R+\n\n\u2022 R++\n\nm+1(cid:17) Rmo;\nm + (m+2)khk\u221e\n\n\u03b4\n\n(am+1 \u2212 am)o;\n\nm , (m + 2)khk\u221e}.\n\nFigure 5 illustrates the effect of these blocks: the \ufb01rst operation \ufb02ips the linear part on [am, +\u221e) by\nadjusting the slope, the second operation folds the linear function in the middle of [am, am+1], and\n\ufb01nally we cut off the peak at the appropriate level (m + 2)khk\u221e.\n\n5\n\n\fRm\n\na0\n\nam\n\n1\n\nx\n\nR+\nm\n\na0\n\nam\n\n2\n\nx\n\nR++\nm\n\n3\n\nRm+1\n\n4\n\nx\nam am+1\n\nx\nam am+1\n\nFigure 5: The construction of Rm+1 based on Rm. We build the next trapezoid function (red) and\nkeep the previous ones (blue) unchanged.\n\nAn important consideration is that we need to keep the function on previous subdivisions unchanged\nwhile building the next trapezoid function. We achieve this by increasing the function values. The\ndifferent values will be the basis for adjusting the function value in each subdivision to the \ufb01nal value\nof the target function we want to approximate. Before proceeding with the adjustment, we remark\nthat RM goes to \u2212\u221e as x \u2192 \u221e. This negative \u201ctail\u201d is easily removed by performing a cut-off\noperation via the max operator. This gives us the desired increasing trapezoid function R\u2217\n\nTo adjust the function values on the intervals I \u03b4\nworks because, by construction, R\u2217\nsets Lk = {kkhk\u221e < R\u2217\nhighest to lowest value: for any k = M, \u00b7 \u00b7 \u00b7 , 1, we sequentially build\n\nM . This\nk . More precisely, we de\ufb01ne the level\nM \u2264 (k + 1)khk\u221e} (for k = 0, \u00b7 \u00b7 \u00b7 , M ) and adjust them one by one from\n\nM = (k + 1)khk\u221e on I \u03b4\n\nk , we identify the I \u03b4\n\nM .\nk via the level sets of R\u2217\n\nR\u2217\n\nk\u22121 = R\u2217\n\nk +\n\nhk \u2212 (k + 1)khk\u221e\n\nkhk\u221e\n\n[R\u2217\n\nk \u2212 kkhk\u221e]+.\n\n(2)\n\nAn illustration of the R\u2217\n\nk is shown in Figure 6.\n\nM |h|\u221e\n\n(M \u2212 1)|h|\u221e\n\n\u00b7 \u00b7 \u00b7\n\na0\n\na1\n\na2\n\na3\n\naM \u22121\n\naM\n\n\u00b7 \u00b7 \u00b7\n\na0\n\na1\n\na2\n\naM \u22122\n\naM \u22121\n\naM\n\nx\n\nx\n\nFigure 6: An illustration of the function adjustment procedure applied to the top level sets. At each\nstep, we adjust one I \u03b4\n\nk to the desired function value hk.\n\nIn particular, the \ufb01rst step from R\u2217\nM \u22121 only scales the top level set because the ReLU activation\n[RM \u2212 M |h|\u221e]+ is active if and only if x \u2208 LM . The coef\ufb01cients are appropriately selected such\nthat after the scaling, the constant in I \u03b4\n\nM matches hM . Hence, we have\n\nM to R\u2217\n\nR\u2217\n\nM \u22121 = (cid:26)hM if x \u2208 I \u03b4\n\nM if x /\u2208 LM .\n\nR\u2217\n\nM \u2282 LM ;\n\nNext, we set the second largest level set to hM \u22121, and so on. As a result, the function R\u2217\n0, obtained\nafter rescaling all the level sets is the desired approximation of the piecewise constant function h.\nConcretely, we show that R\u2217\n\n0 satis\ufb01es\n\n\u2022 R\u2217\n\u2022 R\u2217\n\n0 = 0 on (\u2212\u221e, a0] and [aM , +\u221e);\n0 = hk on I \u03b4\n\nk = [ak\u22121 + \u03b4, ak \u2212 \u03b4] for any k = 1, \u00b7 \u00b7 \u00b7 , M ;\n\n6\n\n\f\u2022 R\u2217\n\n0 is bounded with \u2212khk\u221e \u2264 R\u2217\n\n0 \u2264 khk\u221e.\n\nThe detailed proof is deferred to Appendix B. Importantly, our construction is valid for any small\n{ak \u2212 ak\u22121}. Hence, the approximation error, which is\nenough \u03b4 satisfying 0 < 2\u03b4 < min\n\nk=1,\u00b7\u00b7\u00b7 ,M\n\nbounded by\n\ncan be made arbitrarily small by taking \u03b4 to 0. This completes the proof.\n\nZR\n\n|R\u2217\n\n0(x) \u2212 h(x)|dx \u2264 4M \u03b4khk\u221e,\n\nExtension to higher dimensions. The last step of the one-dimensional construction is performed\nby sliding through all the grid cells and adjusting the function value sequentially. This procedure can\nbe done regardless of the dimension. Therefore, it suf\ufb01ces to build a d-dimensional grid indicator\nfunction, which generalizes the notion of increasing trapezoid function into high dimensional space\n(De\ufb01nition B.3 in the appendix).\n\nWe perform an induction over dimensions. The main idea is to sum up an appropriate one-dimensional\ngrid indicator function and an appropriate d \u2212 1 dimensional grid indicator function, as illustrated in\nFigure 7.\n\n+\n\n=\n\nFigure 7: One dimensional grid indicator functions on the \ufb01rst (left) and second (middle) coordinate.\nBoth functions can be constructed independently by our one hidden unit ResNet.\n\nThe summation gives the desired shape inside each grid cell. However, some regions outside the grid\ncells are also raised, but were supposed to be zero. We address this issue via another, separate level\nset property: there is a threshold T such that a) the function value inside each I \u03b4\nk is larger than T ; b)\nthe function values outside the grid cells are smaller than T . Therefore, the desired grid indicator\nfunction can be obtained by performing a max operator with the threshold T , i.e., cutting off the\nsmaller values and setting them to zero (see Appendix C).\n\nNumber of neurons/layers. A straightforward consequence of our construction is that we\ncan approximate any piecewise constant function to arbitrary accuracy with a ResNet of\nO(number of grid cells) hidden units/layers. The most space-consuming procedure is the func-\ntion adjustment that requires going through each of the grid cells one by one. This procedure however\ncan be parallelized if we allow more hidden units per layer.\n\nDeriving an exact relationship between the original target function f and the required number of grid\ncells is nontrivial and highly dependent on characteristics of f . In particular, when the function f is\ncontinuous, this number is related to the modulus of continuity of f de\ufb01ned by\n\n\u03c9K(r) =\n\nmax\n\nx,y\u2208K,kx\u2212yk\u2264r\n\n|f (x) \u2212 f (y)|,\n\nwhere K is any compact set and r represents the radius of the discretization. Given a desired\napproximation accuracy \u01eb, we need to\n\n\u2022 \ufb01rst, determine a compact set K such that RRd\\K |f | \u2264 \u01eb and restrict f to K;\n\n\u2022 second, determine r such that \u03c9K(r) \u2264 \u01eb/Vol(K).\n\nThen, the number of grid cells is O(1/rd). This dependence is suboptimal in the exponent, and it\nmay be possible to improve it using a similar strategy as [34]. Also, by imposing stronger smoothness\nassumptions, this number may be reducible dramatically [2, 22, 33]. These improvements are not the\nmain focus of this paper, and we leave them for future work.\n\n7\n\n\f4 Discussion and concluding remarks\n\nIn this paper, we have shown a universal approximation theorem for the ResNet structure with one\nunit per hidden layer. This result stands in contrast to recent results on fully connected networks, for\nwhich universal approximation fails with width d or less. To conclude, we add some \ufb01nal remarks\nand implications.\n\nResNet vs Fully connected networks. While we achieve universal approximation with only one\nhidden neuron in each basic residual block, one may argue that the structure of ResNet still passes\nthe identity to the next layer. This identity map could be counted as d hidden units, resulting in a\ntotal of d + 1 hidden unites per residual block, and could be viewed as making the network a width\n(d + 1) fully connected network. But, even from this angle, ResNet corresponds to a compressed or\nsparse version of a fully connected network. In particular, a width (d + 1) fully connected network\nhas O(d2) connections per layer, whereas only O(d) connections are present in ResNet thanks to the\nidentity map. This \u201coverparametrization\u201d of fully connected networks may be a partial explanation\nwhy dropout [29] has been observed to be bene\ufb01cial for such networks. By the same argument, our\nresult implies that width (d + 1) fully connected networks are universal approximators, which is the\nminimum width needed [11]. A detailed construction may be found in Appendix F.\n\nWhy does universal approximation matter? As shown in Section 2, a width d fully connected\nnetwork can never approximate a compact decision boundary even if we allow in\ufb01nite depth. However,\nin high dimensional space, it is very hard to visualize and check the obtained decision boundary. The\nuniversal approximation theorem then provides a sanity check, and ensures that, in principle, we are\nable to capture any desired decision boundary.\n\nTraining ef\ufb01ciency. The universal approximation theorem only guarantees the possibility of approx-\nimating any desired function, but it does not guarantee that we will actually \ufb01nd it in practice by run-\nning SGD or any other optimization algorithm. Understanding the ef\ufb01ciency of training may require a\nbetter understanding of the optimization landscape, a topic of recent attention [5, 16, 24, 26, 8, 35, 27].\n\nHere, we try to provide a slightly different angle. By our theory, ResNet with one-neuron hidden\nlayers is already a universal approximator. In other words, a ResNet with multiple units per layer is\nin some sense an over-parametrization of the model, and over-parametrization has been observed to\nbene\ufb01t optimization [36, 4, 1]. This might be one reason why training a very deep ResNet is \u201ceasier\u201d\nthan training a fully connected network. A more rigorous analysis is an interesting direction for\nfuture work.\n\nGeneralization. Since a universal approximator is able to \ufb01t any function, one might expect it to\nover\ufb01t very easily. Yet, it is commonly observed that deep networks generalize surprisingly well on\nthe test set. The explanation of this phenomenon is orthogonal to our paper, however, knowing the\nuniversal approximation capability is an important building block of such a theory. Moreover, the\nabove-mentioned \u201cover-parametrization\u201d implied by our results may play a role too.\n\nTo conclude, we have shown a universal approximation theorem for ResNet with one-neuron hidden\nlayers. This theoretically distinguishes them from fully connected networks. To some extent, our\nconstruction also theoretically motivates the current practice of going deeper and deeper in the ResNet\narchitecture.\n\nAcknowledgements\n\nWe would like to thank Jeffery Z. HaoChen for useful feedback and suggestions for this paper. This\nresearch was supported by The Defense Advanced Research Projects Agency (grant number YFA17\nN66001-17-1-4039). The views, opinions, and/or \ufb01ndings contained in this article are those of the\nauthor and should not be interpreted as representing the of\ufb01cial views or policies, either expressed or\nimplied, of the Defense Advanced Research Projects Agency or the Department of Defense.\n\nReferences\n\n[1] S. Arora, N. Cohen, and E. Hazan. On the optimization of deep networks: Implicit acceleration\n\nby overparameterization. arXiv:1802.06509, 2018.\n\n8\n\n\f[2] A. R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE\n\nTransactions on Information theory, 39(3):930\u2013945, 1993.\n\n[3] H. Beise, S. D. Da Cruz, and U. Schroder. On decision regions of narrow deep neural networks.\n\narXiv:1807.01194, 2018.\n\n[4] A. Brutzkus, A. Globerson, E. Malach, and S. Shalev-Shwartz. Sgd learns over-parameterized\nnetworks that provably generalize on linearly separable data. In The International Conference\non Learning Representations (ICLR), 2018.\n\n[5] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of\nmultilayer networks. In The International Conference on Arti\ufb01cial Intelligence and Statistics\n(AISTATS), 2015.\n\n[6] N. Cohen, O. Sharir, and A. Shashua. On the expressive power of deep learning: A tensor\n\nanalysis. In Conference on Learning Theory (COLT), 2016.\n\n[7] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control,\n\nsignals and systems, 2(4):303\u2013314, 1989.\n\n[8] S. S. Du, J.D. Lee, Y. Tian, B. Poczos, and A. Singh. Gradient descent learns one-hidden-layer\n\ncnn: Don\u2019t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779, 2017.\n\n[9] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In Conference\n\non Learning Theory (COLT), 2016.\n\n[10] K. Funahashi. On the approximate realization of continuous mappings by neural networks.\n\nNeural networks, 2(3):183\u2013192, 1989.\n\n[11] B. Hanin and M. Sellke. Approximating continuous functions by relu nets of minimal width.\n\narXiv:1710.11278, 2017.\n\n[12] M. Hardt and T. Ma. Identity matters in deep learning. In The International Conference on\n\nLearning Representations (ICLR), 2017.\n\n[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE\n\nconference on computer vision and pattern recognition (CVPR), 2016.\n\n[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual\n\nnetworks. In European Conference on Computer Vision (ECCV), 2016.\n\n[15] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal\n\napproximators. Neural networks, 2(5):359\u2013366, 1989.\n\n[16] K. Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information\n\nProcessing Systems (NIPS), 2016.\n\n[17] A. Krizhevsky, I. Sutskever, and G.E. Hinton. Imagenet classi\ufb01cation with deep convolutional\nneural networks. In Advances in Neural Information Processing Systems (NIPS), pages 1097\u2013\n1105, 2012.\n\n[18] V. Kurkov\u00e1. Kolmogorov\u2019s theorem and multilayer neural networks. Neural networks, 5(3):\n\n501\u2013506, 1992.\n\n[19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[20] S. Liang and R. Srikant. Why deep neural networks for function approximation? In The\n\nInternational Conference on Learning Representations (ICLR), 2017.\n\n[21] Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang. The expressive power of neural networks: A view\n\nfrom the width. In Advances in Neural Information Processing Systems (NIPS), 2017.\n\n[22] H. N. Mhaskar. Neural networks for optimal approximation of smooth and analytic functions.\n\nNeural computation, 8(1):164\u2013177, 1996.\n\n9\n\n\f[23] H. N. Mhaskar and T. Poggio. Deep vs. shallow networks: An approximation theory perspective.\n\nAnalysis and Applications, 14(06):829\u2013848, 2016.\n\n[24] Q. Nguyen and M. Hein. The loss surface of deep and wide neural networks. In Proceedings of\n\nthe International Conferences on Machine Learning (ICML), 2017.\n\n[25] D. Rolnick and M. Tegmark. The power of deeper networks for expressing natural functions. In\n\nThe International Conference on Learning Representations (ICLR), 2018.\n\n[26] S. Shalev-Shwartz, O. Shamir, and S. Shammah. Weight sharing is crucial to succesful opti-\n\nmization. arXiv:1706.00687, 2017.\n\n[27] O. Shamir. Are resnets provably better than linear predictors? arXiv:1804.06739, 2018.\n\n[28] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. In The International Conference on Learning Representations (ICLR), 2015.\n\n[29] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple\nway to prevent neural networks from over\ufb01tting. The Journal of Machine Learning Research\n(JMLR), 15(1):1929\u20131958, 2014.\n\n[30] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,\nA. Rabinovich, J. Rick Chang, et al. Going deeper with convolutions. In IEEE conference on\ncomputer vision and pattern recognition (CVPR), 2015.\n\n[31] L. Szymanski and B. McCane. Deep networks are effective encoders of periodicity. IEEE\n\ntransactions on neural networks and learning systems, 25(10):1816\u20131827, 2014.\n\n[32] M. Telgarsky. Bene\ufb01ts of depth in neural networks. In Conference on Learning Theory (COLT),\n\n2016.\n\n[33] D. Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks, 94:\n\n103\u2013114, 2017.\n\n[34] D. Yarotsky. Optimal approximation of continuous functions by very deep relu networks. arXiv\n\npreprint arXiv:1802.03620, 2018.\n\n[35] C. Yun, S. Sra, and A. Jadbabaie. Global optimality conditions for deep neural networks. In\n\nThe International Conference on Learning Representations (ICLR), 2018.\n\n[36] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires\nrethinking generalization. In The International Conference on Learning Representations (ICLR),\n2016.\n\n10\n\n\f", "award": [], "sourceid": 3035, "authors": [{"given_name": "Hongzhou", "family_name": "Lin", "institution": "MIT"}, {"given_name": "Stefanie", "family_name": "Jegelka", "institution": "MIT"}]}