{"title": "Lower Bounds on the Complexity of Approximating Continuous Functions by Sigmoidal Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 328, "page_last": 334, "abstract": null, "full_text": "Lower Bounds on the Complexity of \n\nApproximating Continuous Functions by \n\nSigmoidal Neural Networks \n\nMichael Schmitt \n\nLehrstuhl Mathematik und Informatik \n\nFakuWit ftir Mathematik \nRuhr-Universitat Bochum \nD-44780 Bochum, Germany \n\nmschmitt@lmi.ruhr-uni-bochum.de \n\nAbstract \n\nWe calculate lower bounds on the size of sigmoidal neural networks \nthat approximate continuous functions. \nIn particular, we show \nthat for the approximation of polynomials the network size has \nto grow as O((logk)1/4) where k is the degree of the polynomials. \nThis bound is valid for any input dimension, i.e. independently of \nthe number of variables. The result is obtained by introducing a \nnew method employing upper bounds on the Vapnik-Chervonenkis \ndimension for proving lower bounds on the size of networks that \napproximate continuous functions. \n\n1 \n\nIntroduction \n\nSigmoidal neural networks are known to be universal approximators. This is one of \nthe theoretical results most frequently cited to justify the use of sigmoidal neural \nnetworks in applications. By this statement one refers to the fact that sigmoidal \nneural networks have been shown to be able to approximate any continuous function \narbitrarily well. Numerous results in the literature have established variants of \nthis universal approximation property by considering distinct function classes to be \napproximated by network architectures using different types of neural activation \nfunctions with respect to various approximation criteria, see for instance [1, 2, 3, 5, \n6, 11, 12, 14, 15]. (See in particular Scarselli and Tsoi [15] for a recent survey and \nfurther references.) \n\nAll these results and many others not referenced here, some of them being construc(cid:173)\ntive, some being merely existence proofs, provide upper bounds for the network size \nasserting that good approximation is possible if there are sufficiently many net(cid:173)\nwork nodes available. This, however, is only a partial answer to the question that \nmainly arises in practical applications: \"Given some function, how many network \nnodes are needed to approximate it?\" Not much attention has been focused on \nestablishing lower bounds on the network size and, in particular, for the approx(cid:173)\nimation of functions over the reals. As far as the computation of binary-valued \n\n\fComplexity of Approximating Continuous Functions by Neural Networks \n\n329 \n\nfunctions by sigmoidal networks is concerned (where the output value of a network \nis thresholded to yield 0 or 1) there are a few results in this direction. For a spe(cid:173)\ncific Boolean function Koiran [9] showed that networks using the standard sigmoid \nu(y) = 1/(1 + e- Y ) as activation function must have size O(nl/4) where n is the \nnumber of inputs. (When measuring network size we do not count the input nodes \nhere and in what follows.) Maass [13] established a larger lower bound by construct(cid:173)\ning a binary-valued function over IRn and showing that standard sigmoidal networks \nrequire O(n) many network nodes for computing this function. The first work on \nthe complexity of sigmoidal networks for approximating continuous functions is due \nto DasGupta and Schnitger [4]. They showed that the standard sigmoid in network \nnodes can be replaced by other types of activation functions without increasing the \nsize of the network by more than a polynomial. This yields indirect lower bounds \nfor the size of sigmoidal networks in terms of other network types. DasGupta and \nSchnitger [4] also claimed the size bound AO(I/d) for sigmoidal networks with d \nlayers approximating the function sin(Ax). \nIn this paper we consider the problem of using the standard sigmoid u(y) = \n1/(1 + e- Y) in neural networks for the approximation of polynomials. We show \nthat at least O\u00ablogk)1/4) network nodes are required to approximate polynomials \nof degree k with small error in the loo norm. This bound is valid for arbitrary input \ndimension, i.e., it does not depend on the number of variables. (Lower bounds can \nalso be obtained from the results on binary-valued functions mentioned above by \ninterpolating the corresponding functions by polynomials. This, however, requires \ngrowing input dimension and does not yield a lower bound in terms of the degree.) \nFurther, the bound established here holds for networks of any number of layers. As \nfar as we know this is the first lower bound result for the approximation of polyno(cid:173)\nmials. From the computational point of view this is a very simple class of functions; \nthey can be computed using the basic operations addition and multiplication only. \nPolynomials also play an important role in approximation theory since they are \ndense in the class of continuous functions and some approximation results for neu(cid:173)\nral networks rely on the approximability of polynomials by sigmoidal networks (see, \ne.g., [2, 15]). \nWe obtain the result by introducing a new method that employs upper bounds on \nthe Vapnik-Chervonenkis dimension of neural networks to establish lower bounds \non the network size. The first use of the Vapnik-Chervonenkis dimension to obtain \na lower bound is due to Koiran [9] who calculated the above-mentioned bound \non the size of sigmoidal networks for a Boolean function. Koiran's method was \nfurther developed and extended by Maass [13] using a similar argument but another \ncombinatorial dimension. Both papers derived lower bounds for the computation \nof binary-valued functions (Koiran [9] for inputs from {O, 1 }n, Maass [13] for inputs \nfrom IRn). Here, we present a new technique to show that and how lower bounds can \nbe obtained for networks that approximate continuous functions. It rests on two \nfundamental results about the Vapnik-Chervonenkis dimension of neural networks. \nOn the one hand, we use constructions provided by Koiran and Sontag [10] to build \nnetworks that have large Vapnik-Chervonenkis dimension and consist of gates that \ncompute certain arithmetic functions. On the other hand, we follow the lines of \nreasoning of Karpinski and Macintyre [7] to derive an upper bound for the Vapnik(cid:173)\nChervonenkis dimension of these networks from the estimates of Khovanskil [8] and \na result due to Warren [16]. \nIn the following section we give the definitions of sigmoidal networks and the Vapnik(cid:173)\nChervonenkis dimension. Then we present the lower bound result for function \napproximation. Finally, we conclude with some discussion and open questions. \n\n\f330 \n\nM Schmitt \n\n2 Sigmoidal Neural Networks and VC Dimension \n\nWe briefly recall the definitions of a sigmoidal neural network and the Vapnik(cid:173)\nChervonenkis dimension (see, e.g., [7, 10]). We consider /eed/orward neural networks \nwhich have a certain number of input nodes and one output node. The nodes \nwhich are not input nodes are called computation nodes and associated with each \nof them is a real number t, the threshold. Further, each edge is labelled with a \nreal number W called weight. Computation in the network takes place as follows: \nThe input values are assigned to the input nodes. Each computation node applies \nthe standard sigmoid u(y) = 1/(1 + e- V ) to the sum W1Xl + ... + WrXr -\nt where \nXl, .\u2022. ,Xr are the values computed by the node's predecessors, WI, \u2022\u2022\u2022 ,Wr are the \nweights of the corresponding edges, and t is the threshold. The output value of the \nnetwork is defined to be the value computed by the output node. As it is common \nfor approximation results by means of neural networks, we assume that the output \nnode is a linear gate, i.e., it just outputs the sum WIXI + ... + WrXr -\nt. (Clearly, \nfor computing functions on finite sets with output range [0, 1] the output node \nmay apply the standard sigmoid as well.) Since u is the only sigmoidal function \nthat we consider here we will refer to such networks as sigmoidal neural networks. \n(Sigmoidal functions in general need to satisfy much weaker assumptions than u \ndoes.) The definition naturally generalizes to networks employing other types of \ngates that we will make use of (e.g. linear, multiplication, and division gates). \nThe Vapnik-Chervonenkis dimension is a combinatorial dimension of a function class \nand is defined as follows: A dichotomy of a set S ~ IRn is a partition of S into two \ndisjoint subsets (So, Sl) such that So U SI = S. Given a set F offunctions mapping \nIRn to {O, I} and a dichotomy (So, Sd of S, we say that F induces the dichotomy \n(So, Sd on S if there is some f E F such that /(So) ~ {O} and f(Sd ~ {I}. \nWe say further that F shatters S if F induces all dichotomies on S. The Vapnik(cid:173)\nChervonenkis (VC) dimension of F, denoted VCdim(F), is defined as the largest \nnumber m such that there is a set of m elements that is shattered by F. We refer \nto the VC dimension of a neural network, which is given in terms of a \"feedforward \narchitecture\", i.e. a directed acyclic graph, as the VC dimension of the class of \nfunctions obtained by assigning real numbers to all its programmable parameters, \nwhich are in general the weights and thresholds of the network or a subset thereof. \nFurther, we assume that the output value of the network is thresholded at 1/2 to \nobtain binary values. \n\n3 Lower Bounds on Network Size \n\nBefore we present the lower bound on the size of sigmoidal networks required for \nthe approximation of polynomials we first give a brief outline of the proof idea. \nWe will define a sequence of univariate polynomials (Pn)n>l by means of which \nwe show how to construct neural architectures Nn consistmg of various types of \ngates such as linear, multiplication, and division gates, and, in particular, gates \nthat compute some of the polynomials. Further, this architecture has a single \nweight as programmable parameter (all other weights and thresholds are fixed). \nWe then demonstrate that, assuming the gates computing the polynomials can be \napproximated by sigmoidal neural networks sufficiently well, the architecture Nn \ncan shatter a certain set by assigning suitable values to its programmable weight. \nThe final step is to reason along the lines of Karpinski and Macintyre [7] to obtain \nvia Khovanskil's estimates [8] and Warren's result [16] an upper bound on the VC \ndimension of Nn in terms of the number of its computation nodes. (Note that we \ncannot directly apply Theorem 7 of [7] since it does not deal with division gates.) \nComparing this bound with the cardinality of the shattered set we will then be able \n\n\fComplexity of Approximating Continuous Functions by Neural Networks \n\n331 \n\nW \n\n1 \n\nP3 \n\n(3) \n\nW 1 \n\n(3) \n\nWn \n\n(2) \n\nW 1 \n\n(1) \n\nW 1 \n\nn Wi \n(3) \n\nP 2 \n\nn Wj \n\n(2) \n\nP1 \n\nn \n\n(1) \n\nW k \n\n(2) \n\nWn \n\n(1) \n\nWn \n\nj --------------------------------~ \n\nk--------------------------------------------------~ \n\nFigure 1: The network Nn with values k, j, i, 1 assigned to the input nodes \nXl, X2, X3, X4 respectively. The weight W is the only programmable parameter of \nthe network. \n\nto conclude with a lower bound on the number of computation nodes in Nn and \nthus in the networks that approximate the polynomials. \n\nLet the sequence (Pn)n2: l of polynomials over IR be inductively defined by \n\nPn(X) = \n\n{ 4x(1 - x) \n\nn = 1 , \nP(Pn-dx)) n 2:: 2 . \n\nClearly, this uniquely defines Pn for every n 2:: 1 and it can readily be seen that \nPn has degree 2n. The main lower bound result is made precise in the following \nstatement. \n\nTheorem 1 Sigmoidal neural networks that approximate the polynomials (Pn)n >l \non the interval [0,1] with error at most O(2- n ) in the 100 norm must have at least \nn(nl/4) computation nodes. \n\nProof. For each n a neural architecture Nn can be constructed as follows: The \nnetwork has four input nodes Xl, X2, X3, X4. Figure 1 shows the network with input \nvalues assigned to the input nodes in the order X4 = 1, X3 = i, X2 = j, Xl = k. \nThere is one weight which we consider as the (only) programmable parameter of \nNn . It is associated with the edge outgoing from input node X4 and is denoted \nby w. The computation nodes are partitioned into six levels as indicated by the \nboxes in Figure 1. Each level is itself a network. Let us first assume, for the sake of \nsimplicity, that all computations over real numbers are exact. There are three levels \nlabeled with II, having n + 1 input nodes and one output node each, that compute \nso-called projections 7r : IRnH -+ IR where 7r(YI,\"\" Yn, a) = Ya for a E {I, ... , n}. \nThe levels labeled P3 , P2 , PI have one input node and n output nodes each. Level \nP3 receives the constant 1 as input and thus the value W which is the parameter of \nthe network. We define the output values of level PA for>. = 3,2, 1 by \n\n(A) \n\nwb = Pbon\"'-l v) , b= 1, ... ,n \n\n( \n\nwhere v denotes the input value to level PA. This value is equal to w for>. = 3 and \nrom \n7r WI \n\n,XA+l ot erWlse. vve observe t at wb+l can e calcu ate \n\n(A) b id f \n\n( (A+l) \n\n, .\u2022. , Wn \n\nh \n\n()..+l) \n\n) h \n\n. \n\nOUT \n\n\f332 \n\nM Schmitt \n\nw~A) as Pn>'_l(W~A\u00bb). Therefore, the computations of level PA can be implemented \nusing n gates each of them computing the function Pn>.-l. \nWe show now that Nn can shatter a set of cardinality n 3 \u2022 Let S = {I, ... ,n p. It \nhas been shown in Lemma 2 of [10] that for each (/31 , ... , /3r) E {O, 1 Y there exists \nsome W E [0,1] such that for q = 1, ... ,T \n\npq(w) E [0,1/2) \n\nif /3q = 0, and pq(w) E (1/2,1] \n\nif /3q = 1. \n\nThis implies that, for each dichotomy (So, Sd of S there is some W E [0,1] such \nthat for every (i, j, k) E S \n\nPk (pj.n (Pi.n2 (w))) < 1/2 \nPk(Pj.n(Pi.n2 (w))) > 1/2 \n\nif \nif \n\n( i, j, k) E So , \n(i,j,k)ES1' \n\nNote that Pk(Pj.n(Pi.n2 (w))) is the value computed by Nn given input values k, j, i, 1. \nTherefore, choosing a suitable value for w, which is the parameter of Nn , the network \ncan induce any dichotomy on S. In other words, S is shattered by Nn . \nIt has been shown in Lemma 1 of [10] that there is an architecture An such that \n\nfor each E > \u00b0 weights can be chosen for An such that the function in,\u20ac computed \n\nby this network satisfies lim\u20ac~o in,\u20ac(Yl, ... ,Yn, a) = Ya. Moreover, this architecture \nconsists of O(n) computation nodes, which are linear, multiplication, and division \ngates. (Note that the size of An does not depend on E.) Therefore, choosing E \nsufficiently small, we can implement the projections 1r in Nn by networks of O(n) \ncomputation nodes such that the resulting network N~ still shatters S. Now in N~ \nwe have O(n) computation nodes for implementing the three levels labeled II and \nwe have in each level P A a number of O(n) computation nodes for computing Pn>.-l, \nrespectively. Assume now that the computation nodes for Pn>.-l can be replaced \nby sigmoidal networks such that on inputs from S and with the parameter values \ndefined above the resulting network N:: computes the same functions as N~. (Note \nthat the computation nodes for Pn>.-l have no programmable parameters.) \nWe estimate the size of N::. According to Theorem 7 of Karpinski and Macintyre \n[7] a sigmoidal neural network with I programmable parameters and m computation \nnodes has VC dimension O((ml)2). We have to generalize this result slightly before \nbeing able to apply it. It can readily be seen from the proof of Theorem 7 in [7] that \nthe result also holds if the network additionally contains linear and multiplication \ngates. For division gates we can derive the same bound taking into account that for \na gate computing division, say x/y, we can introduce a defining equality x = z . Y \nwhere z is a new variable. (See [7] for how to proceed.) Thus, we have that a \nnetwork with I programmable parameters and m computation nodes, which are \nlinear, multiplication, division, and sigmoidal gates, has VC dimension O((ml)2). \nIn particular, if m is the number of computation nodes of N::, the VC dimension \nis O(m 2 ). On the other hand, as we have shown above, N:: can shatter a set \nof cardinality n3 \u2022 Since there are O(n) sigmoidal networks in N:: computing the \nfunctions Pn>.-l, and since the number of linear, multiplication, and division gates \nis bounded by O(n), for some value of A a single network computing Pn>.-l must \nhave size at least O(fo). This yields a lower bound of O(nl/4) for the size of a \nsigmoidal network computing Pn. \nThus far, we have assumed that the polynomials Pn are computed exactly. Since \npolynomials are continuous functions and since we require them to be calculated \nonly on a finite set of input values (those resulting from S and from the parameter \nvalues chosen for w to shatter S) an approximation of these polynomials is sufficient. \nA straightforward analysis, based on the fact that the output value of the network \nhas a \"tolerance\" close to 1/2, shows that if Pn is approximated with error O(2- n ) \n\n\fComplexity of Approximating Continuous Functions by Neural Networks \n\n333 \n\nin the loo norm, the resulting network still shatters the set S. This completes the \nproof of the theorem. \nD \n\nThe statement of the previous theorem is restricted to the approximation of poly(cid:173)\nnomials on the input domain [0,1]. However, the result immediately generalizes to \nany arbitrary interval in llt Moreover, it remains valid for multivariate polynomials \nof arbitrary input dimension. \n\nCorollary 2 The approximation of polynomials of degree k by sigmoidal neural \nnetworks with approximation error O(ljk) in the 100 norm requires networks of size \nO((log k)1/4). This holds for polynomials over any number of variables. \n\n4 Conclusions and Open Questions \n\nWe have established lower bounds on the size of sigmoidal networks for the approx(cid:173)\nimation of continuous functions. In particular, for a concrete class of polynomials \nwe have calculated a lower bound in terms of the degree of the polynomials. The \nmain result already holds for the approximation of univariate polynomials. Intu(cid:173)\nitively, approximation of multivariate polynomials seems to become harder when \nthe dimension increases. Therefore, it would be interesting to have lower bounds \nboth in terms of the degree and the input dimension. \n\nFurther, in our result the approximation error and the degree are coupled. Naturally, \none would expect that the number of nodes has to grow for each fixed function when \nthe error decreases. At present we do not know of any such lower bound. \n\nWe have not aimed at calculating the constants in the bounds. For practical appli(cid:173)\ncations such values are indispensable. Refining our method and using tighter results \nit should be straightforward to obtain such numbers. Further, we expect that better \nlower bounds can be obtained by considering networks of restricted depth. \n\nTo establish the result we have introduced a new method for deriving lower bounds \non network sizes. One of the main arguments is to use the functions to be approxi(cid:173)\nmated to construct networks with large VC dimension. The method seems suitable \nto obtain bounds also for the approximation of other types of functions as long as \nthey are computationally powerful enough. \n\nMoreover, the method could be adapted to obtain lower bounds also for networks \nusing other activation functions (e.g. more general sigmoidal functions, ridge func(cid:173)\ntions, radial basis functions). This may lead to new separation results for the \napproximation capabilities of different types of neural networks. In order for this \nto be accomplished, however, an essential requirement is that small upper bounds \ncan be calculated for the VC dimension of such networks. \n\nAcknowledgments \n\nI thank Hans U. Simon for helpful discussions. This work was supported in part \nby the ESPRIT Working Group in Neural and Computational Learning II, Neuro(cid:173)\nCOLT2, No. 27150. \n\nReferences \n\n[1] A. Barron. Universal approximation bounds for superposition of a sigmoidal \n\nfunction. IEEE Transactions on Information Theory, 39:930--945, 1993. \n\n\f334 \n\nM Schmitt \n\n[2J C. K. Chui and X. Li. Approximation by ridge functions and neural networks \n\nwith one hidden layer. Journal of Approximation Theory, 70:131-141,1992. \n[3J G. Cybenko. Approximation by superpositions of a sigmoidal function. Math(cid:173)\n\nematics of Control, Signals, and Systems, 2:303-314, 1989. \n\n[4J B. DasGupta and G. Schnitger. The power of approximating: A comparison \nof activation functions. In C. L. Giles, S. J. Hanson, and J. D. Cowan, editors, \nAdvances in Neural Information Processing Systems 5, pages 615-622, Morgan \nKaufmann, San Mateo, CA, 1993. \n\n[5] K. Hornik. Approximation capabilities of multilayer feedforward networks. \n\nNeural Networks, 4:251-257, 1991. \n\n[6] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks \n\nare universal approximators. Neural Networks, 2:359-366, 1989. \n\n[7] M. Karpinski and A. Macintyre. Polynomial bounds for VC dimension of \nsigmoidal and general Pfaffian neural networks. Journal of Computer and \nSystem Sciences, 54:169-176, 1997. \n\n[8] A. G. Khovanskil. Fewnomials, volume 88 of Translations of Mathematical \n\nMonographs. American Mathematical Society, Providence, RI, 1991. \n\n[9] P. Koiran. VC dimension in circuit complexity. In Proceedings of the 11th \nAnnual IEEE Conference on Computational Complexity CCC'96, pages 81-85, \nIEEE Computer Society Press, Los Alamitos, CA, 1996. \n\n[10] P. Koiran and E. D. Sontag. Neural networks with quadratic VC dimension. \n\nJournal of Computer and System Sciences, 54:190-198, 1997. \n\n[11] V. Y. Kreinovich. Arbitrary nonlinearity is sufficient to represent all functions \n\nby neural networks: A theorem. Neural Networks, 4:381-383, 1991. \n\n[12] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken. Multilayer feedforward net(cid:173)\nworks with a nonpolynomial activation function can approximate any function. \nNeural Networks, 6:861-867, 1993. \n\n[13] W. Maass. Noisy spiking neurons with temporal coding have more compu(cid:173)\n\ntational power than sigmoidal neurons. \nIn M. Mozer, M. 1. Jordan, and \nT. Petsche, editors, Advances in Neural Information Processing Systems 9, \npages 211-217. MIT Press, Cambridge, MA, 1997. \n\n[14] H. Mhaskar. Neural networks for optimal approximation of smooth and analytic \n\nfunctions. Neural Computation, 8:164-177, 1996. \n\n[15J F. Scarselli and A. C. Tsoi. Universal approximation using feedforward neural \nnetworks: A survey of some existing methods and some new results. Neural \nNetworks, 11:15-37, 1998. \n\n[16] H. E. Warren. Lower bounds for approximation by nonlinear manifolds. Trans(cid:173)\n\nactions of the American Mathematical Society, 133:167-178, 1968. \n\n\f", "award": [], "sourceid": 1692, "authors": [{"given_name": "Michael", "family_name": "Schmitt", "institution": null}]}