{"title": "How to Choose an Activation Function", "book": "Advances in Neural Information Processing Systems", "page_first": 319, "page_last": 326, "abstract": null, "full_text": "How to Choose an Activation Function \n\nH. N. Mhaskar \n\nDepartment of Mathematics \nCalifornia State University \n\nLos Angeles, CA 90032 \nhmhaska@calstatela.edu \n\nc. A. Micchelli \n\nIBM Watson Research Center \n\nP. O. Box 218 \n\nYorktown Heights, NY 10598 \n\ncam@watson.ibm.com \n\nAbstract \n\nWe study the complexity problem in artificial feedforward neural networks \ndesigned to approximate real valued functions of several real variables; i.e., \nwe estimate the number of neurons in a network required to ensure a given \ndegree of approximation to every function in a given function class. We \nindicate how to construct networks with the indicated number of neurons \nevaluating standard activation functions. Our general theorem shows that \nthe smoother the activation function, the better the rate of approximation. \n\n1 \n\nINTRODUCTION \n\nThe approximation capabilities of feedforward neural networks with a single hidden \nlayer has been studied by many authors, e.g., [1, 2, 5]. In [10], we have shown that \nsuch a network using practically any nonlinear activation function can approximate \nany continuous function of any number of real variables on any compact set to any \ndesired degree of accuracy. \nA central question in this theory is the following. If one needs to approximate \na function from a known class of functions to a prescribed accuracy, how many \nneurons will be necessary to accomplish this approximation for all functions in the \nclass? For example, Barron shows in [1] that it is possible to approximate any \nfunction satisfying certain conditions on its Fourier transform within an L2 error \nof O(1/n) using a feedforward neural network with one hidden layer comprising of \nn2 neurons, each with a sigmoidal activation function. On the contrary, if one is \ninterested in a class of functions of s variables with a bounded gradient on [-1, I]S , \n\n319 \n\n\f320 \n\nMhaskar and Micchelli \n\nthen in order to accomplish this order of approximation, it is necessary to use at \nleast 0(11$) number of neurons, regardless of the activation function (cf. [3]). \n\nIn this paper, our main interest is to consider the problem of approximating a \nfunction which is known only to have a certain number of smooth derivatives. We \ninvestigate the question of deciding which activation function will require how many \nneurons to achieve a given order of approximation for all such functions. We will \ndescribe a very general theorem and explain how to construct networks with various \nactivation functions, such as the Gaussian and other radial basis functions advocated \nby Girosi and Poggio [13] as well as the classical squashing function and other \nsigmoidal functions. \n\nIn the next section, we develop some notation and briefly review some known facts \nabout approximation order with a sigmoidal type activation function. In Section \n3, we discuss our general theorem. This theorem is applied in Section 4 to yield \nthe approximation bounds for various special functions which are commonly in use. \nIn Section 5, we briefly describe certain dimension independent bounds similar to \nthose due to Barron [1], but applicable with a general activation function. Section \n6 summarizes our results. \n\n2 SIGMOIDAL-TYPE ACTIVATION FUNCTIONS \n\nIn this section, we develop some notation and review certain known facts. For the \nsake of concreteness, we consider only uniform approximation, but our results are \nvalid also for other LP -norms with minor modifications, if any. Let s 2: 1 be the \nnumber of input variables. The class of all continuous functions on [-1, IP will be \ndenoted by C$. The class of all 27r- periodic continuous functions will be denoted \nby C$*. The uniform norm in either case will be denoted by II . II. Let IIn,I,$,u \ndenote the set of all possible outputs of feedforward neural networks consisting of \nn neurons arranged in I hidden layers and each neuron evaluating an activation \nfunction (j where the inputs to the network are from R$. It is customary to assume \nmore a priori knowledge about the target function than the fact that it belongs \nto C$ or cn. For example, one may assume that it has continuous derivatives of \norder r 2: 1 and the sum of the norms of all the partial derivatives up to (and \nincluding) order r is bounded. Since we are interested mainly in the relative error \nin approximation, we may assume that the target function is normalized so that this \nsum of the norms is bounded above by 1. The class of all the functions satisfying \nthis condition will be denoted by W: (or W:'\" if the functions are periodic). In this \npaper, we are interested in the universal approximation of the classes W: (and their \nperiodic versions). Specifically, we are interested in estimating the quantity \n\n(2.1) \n\nwhere \n\n(2.2) \n\nsup En,l,$,u(f) \nJEW: \n\nEn,l,$,u(f) := p Anf \n\nE n,l,s,1T \n\nIII - PII\u00b7 \n\nThe quantity En,l,s ,u(l) measures the theoretically possible best order of approxi(cid:173)\nmation of an individual function I by networks with 11 neurons. We are interested \n\n\fHow to Choose an Activation Function \n\n321 \n\nin determining the order that such a network can possibly achieve for all functions \nin the given class. An equivalent dual formulation is to estimate \n\n(2.3) \n\nEn,l,s,O'(W:) := min{m E Z : sup Em,l,s,O'(f) ~ lin}. \n\nfEW: \n\nThis quantity measures the minimum number of neurons required to obtain accuracy \nof lin for all functions in the class W:. An analogous definition is assumed for W:* \nin place of W: . \nLet IH~ denote the class of all s-variable trigonometric polynomials of order at most \nn and for a continuous function f, 27r-periodic in each of its s variables, \n\n(2.4) \n\nE~(f):= min Ilf - PII\u00b7 \n\nPEIH~ \n\nWe observe that IH~ can be thought of as a subclass of all outputs of networks with \na single hidden layer comprising of at most (2n + 1)\" neurons, each evaluating the \nactivation function sin X. It is then well known that \n\n(2.5) \n\nHere and in the sequel, c, Cl, ... will denote positive constants independent of the \nfunctions and the number of neurons involved, but generally dependent on the other \nparameters of the problem such as r, sand (j. Moreover, several constructions for \nthe approximating trigonometric polynomials involved in (2.5) are also well known. \nIn the dual formulation, (2.5) states that if (j(x) := sinx then \n\n(2.6) \n\nR s \n\nIt can be proved [3] that any \"reasonable\" approximation process that aims to ap(cid:173)\nproximate all functions in W:'\" up to an order of accuracy lin must necessarily \ndepend upon at least O(ns/r) parameters. Thus, the activation function sin x pro(cid:173)\nvides optimal convergence rates for the class W:*. \nThe problem of approximating an r times continuously differentiable function \nf \n--+ R on [-1, I]S can be reduced to that of approximating another \nfunction from the corresponding periodic class as follows. We take an infinitely \nmany times differentiable function 1f; which is equal to 1 on [-2,2]S and 0 outside \nof [-7r, 7rp. The function f1f; can then be extended as a 27r-periodic function. This \nfunction is r times continuously differentiable and its derivatives can be bounded \nby the derivatives of f using the Leibnitz formula. A function that approximates \nthis 27r-periodic function also approximates f on [-I,I]S with the same order of \napproximation. In contrast, it is not customary to choose the activation function \nto be periodic. \n\nIn [10] we introduced the notion of a higher order sigmoidal function as follows. Let \nk > O. We say that a function (j : R --+ R is sigmoidal of order k if \n\n(2.7) \n\nand \n\n(2.8) \n\nlim (j( x) - 1 \nx-+oo xk - , x-+-oo xk \n\n(j(x) - 0 \n, \n\nlim \n\n-\n\nxE R. \n\n\f322 \n\nMhaskar and Micchelli \n\nA sigmoidal function of order 0 is thus the customary bounded sigmoidal function. \n\nWe proved in [10] that for any integer r ~ 1 and a sigmoidal function (j of order \nr - 1, we have \n\n(2.9) \n\nif s = 1, \nif s > 2. \n\nSubsequently, Mhaskar showed in [6] that if (j is a sigmoidal function of order k > 2 \nand r ~ 1 then, with I = O(log r/ log k)), \n\n(2.10) \n\nThus, an optimal network can be constructed using a sigmoidal function of higher \norder. During the course of the proofs in [10] and [6], we actually constructed the \nnetworks explicitly. The various features of these constructions from the connec(cid:173)\ntionist point of view are discussed in [7, 8, 9]. \n\nIn this paper, we take a different viewpoint. We wish to determine which acti(cid:173)\nvation function leads to what approximation order. As remarked above, for the \napproximation of periodic functions, the periodic activation function sin x provides \nan optimal network. Therefore, we will investigate the degree of approximation by \nneural net.works first in terms of a general periodic activation function and then \napply these results to the case when the activation function is not periodic. \n\n3 A GENERAL THEOREM \n\nIn this section, we discuss the degree of approximation of periodic functions using \nperiodic activation functions. It is our objective to include the case of radial basis \nfunctions as well as the usual \"first. order\" neural networks in our discussion. To \nencompass both of these cases, we discuss the following general formuation. Let \ns ~ d 2: 1 be integers and \u00a2J E Cd \u2022. We will consider the approximation of functions \nin ca. by linear combinat.ions of quantities of the form \u00a2J(Ax + t) where A is a d x s \nmatrix and t E Rd. (In general, both A and t are parameters ofthe network.) When \nd = s, A is the identity matrix and \u00a2J is a radial function, then a linear combination \nof n such quantities represents the output of a radial basis function network with n \nneurons. When d = 1 then we have the usual neural network with one hidden layer \nand periodic activation function \u00a2J. \n\nWe define the Fourier coefficients of \u00a2J by the formula \n\n, \n\u00a2J(m) := (2 )d \n\n. t \n\n\u00a2J(t)e- zm . dt, \n\n1 1 \n\n7r \n\n[-lI',lI']d \n\n(3.1) \n\nLet \n\n(3.2) \n\nand assume that there is a set J co Itaining d x s matrices with integer entries such \nthat \n(3.3) \n\n\fHow to Choose an Activation Function \n\n323 \n\nwhere AT denotes the transpose of A. If d = 1 and \u00a2(l) #- 0 (the neural network \ncase) then we may choose S4> = {I} and J to be Z8 (considered as row vectors). \nIf d = sand \u00a2J is a function with none of its Fourier coefficients equal to zero (the \nradial basis case) then we may choose S4> = zs and J = {Is x s}. For m E Z8, we \nlet k m be the multi-integer with minimum magnitude such that m = ATkm for \nsome A = Am E J. Our estimates will need the quantities \n\n(3.4) \n\nand \n\n(3.5) \n\nmn := min{I\u00a2(km)1 : -2n::; m::; 2n} \n\nN n := max{lkml : -2n::; m < 2n} \n\nwhere Ikml is the maximum absolute value of the components of km. In the neural \nnetwork case, we have mn = 1\u00a2(1)1 and N n = 1. In the radial basis case, N n = 2n. \nOur main theorem can be formulated as follows. \n\nTHEOREM 3.1. Let s ~ d ~ 1, n ~ 1 and N ~ Nn be integers, f E C n , \u00a2J E C d*. \nIt is possible to construct a network \n\n(3.6) \n\nsuch that \n\n(3.7) \n\nIn (3.6), the sum contains at most O( n S Nd) terms, Aj E J, tj E R d, and dj are \nlinear functionals of f, depending upon n, N, . The formulas in \n[11] show that the network can be trained in a very simple manner, given the Fourier \ncoefficients of the target function. The weights and thresholds (or the centers in the \ncase of the radial basis networks) are determined universally for all functions being \napproximated . Only the coefficients at the output layer depend upon the function . \nEven these are given explicitly as linear combinations of the Fourier coefficients of \nthe target function. The explicit formulas in [11] show that in the radial basis case, \nthe operator Gn ,N,4> actually contains only O( n + N)S summands. \n\n4 APPLICATIONS \n\nIn Section 3, we had assumed that the activation function \u00a2J is periodic. If the \nactivation function (J is not periodic, but satisfies certain decay conditions near \n\n\f324 \n\nMhaskar and Micchelli \n\n00, it is still possible to construct a periodic function for which Theorem 3.1 can \nbe applied. Suppose that there exists a function 1j; in the linear span of Au,J := \n{(T( Ax + t) A E J, t E R d}, which is integrable on R d and satisfies the condition \nthat \n(4.1) \nUnder this assumption, the function \n\nfor some T> d. \n\n( 4.2) \n\n1j;0 (x):= L 1j;(x - 27rk) \n\nkEZ d \n\nis a 27r-periodic function integrable on [-7r, 7r]s. We can then apply Theorem 3.1 \nwith 1j;0 instead of \u00a2. In Gn,N,tjJo, we next replace 1j;0 by a function obtained by \njudiciously truncating the infinite sum in (4.2). The error made in this replacement \ncan be estimated using (4.1). Knowing the number of evaluations of (T in the \nexpression for '1/) as a finite linear combination of elements of Au,J, we then have an \nestimate on the degree of approximation of I in terms of the number of evaluations of \n(T. This process was applied on a number of functions (T. The results are summarized \nin Table 1. \n\n5 DIMENSION INDEPENDENT BOUNDS \n\nIn this section, we describe certain estimates on the L2 degree of approximation that \nare independent of the dimension of the input space. In this section, II\u00b7 II denotes \nth(' L2 norm on [-1, I]S (respectively [-7r, 7r]S) and we approximate functions in the \nclass S Fs defined by \n\n(5.1 ) \n\nSFtI := {I E C H \n\n: II/l1sF,s:= L li(m)l::; I}. \n\nmEZ' \n\nAnalogous to the degree of approximation from IH~, we define the n-th degree of \napproximation of a function I E CS* by the formula \n\n(5.2) \n\nEn s(f) := \n\n, \n\ninf \n\nIII - L i(m)eimOxlI \n\nACZ' ,IAI~n \n\nmEA \n\nwhere we require the norm involved to be the L2 norm. In (5.2), there is no need \nto assume that n is an integer. \n\nLet \u00a2 be a square integrable 27r-periodic function of one variable. We define the L2 \ndegree of approximation by networks with a single hidden layer by the formula \n\n(.5.3) \n\nE~~~)f) := PEj~~l\"'~ III - PII \n\nwhere m is the largest integer not exceeding n. Our main theorem in this connection \nis the following \n\nTHEOREM 5.1. Let s 2: 1 be an integer, IE SFs , \u00a2 E Li and J(1) f:. O. Then, for \nintegers n, N 2: 1, \n\n(5.4) \n\n\fHow to Choose an Activation Function \n\n325 \n\nTable 1: Order of magnitude of En,l,s,o-(W:) for different O\"S \n\nFunction 0' \n\nSigmoidal, order r - 1 \n\n---\n\nEn Iso-\n\nn 1/ r \n\nRemarks \n\ns=d=I,/=1 \n\nSigmoidal, order r - 1 \n\nn lJ / r+(s+2r)/r 2 \n\ns ~ 2, d = 1, I = 1 \n\nxk, if x ~ 0, 0, if x < O. \n\nn IJ / r+ (2r+s )/2r k \n\nk ~ 2, s ~ 2, d = 1, I = 1 \n\n(1 + e-x)-l \n\nnlJ/r(log n)2 \n\ns~2,d=I,/=1 \n\nSigmoidal, order k \n\nexp( -lxl 2 /2) \n\nn lJ / r \n\nn2s /r \n\nk ~ 2, s ~ 1, d = 1, \nI = o (log r/ log k)) \n\ns=d>2/=1 \n\n, \n\n-\n\nIxlk(log Ixl)6 \n\nn( IJ /r)(2+(3s+2r)/ k) \n\nS = d > 2, k > 0, k + seven, \n6 = 0 if s odd, 1 if s even, I = 1 \n\nwhere {6n} is a sequence of positive numbers, 0 ::; 6n ::; 2, depending upon f such \nthat 6n --- 0 as n --- 00. Moreover, the coefficients in the network that yields (5.1,) \nare bounded, independent of nand N. \n\nWe may apply Theorem 5.1 in the same way as Theorem 3.1. For the squashing \nactivation fUllction, this gives an order of approximation O(n-l/2) with a network \nconsisting of n(lo~ n)2 neurons arranged in one hidden layer. With the truncated \npower function x + (cf. Table 1, entry 3) as the activation function, the same \norder of approximation is obtained with a network with a single hidden layer and \nO(n1+1/(2k\u00bb) neurons. \n\n6 CONCLUSIONS. \n\nWe have obtained estimates on the number of neurons necessary for a network with \na single hidden layer to provide a gi ven accuracy of all functions under the only a \npriori assumption that the derivatives of the function up to a certain order should \nexist. We have proved a general theorem which enables us to estimate this number \n\n\f326 \n\nMhaskar and Micchelli \n\nin terms of the growth and smoothness of the activation function. We have explicitly \nconstructed networks which provide the desired accuracy with the indicated number \nof neurons. \n\nAcknowledgements \n\nThe research of H. N. Mhaskar was supported in part by AFOSR grant 2-26 113. \n\nReferences \n\n1. BARRON, A. R., Universal approximation bounds for superposition of a \n\nsigmoidal function, IEEE Trans. on Information Theory, 39. \n\n2. CYBENKO, G., Approximation by superposition of sigmoidal functions, \n\nMathematics of Control, Signals and Systems, 2, # 4 (1989), 303-314. \n\n3. DEVORE, R., HOWARD , R. AND MICCHELLI, C.A., Optimal nonlinear \n\napproximation, Manuscripta Mathematica, 63 (1989), 469-478. \n\n4. HECHT-NILESEN, R., Thoery of the backpropogation neural network, IEEE \n\nInternational Conference on Neural Networks, 1 (1988), 593-605. \n\n5. HORNIK, K., STINCHCOMBE, M. AND WHITE, H ., Multilayer feedforward \nnetworks are universal approximators, Neural Networks, 2 (1989),359-366. \n6. MHASKAR, H. N., Approximation properties of a multilayered feedfor(cid:173)\n\nward artificial neural network, Advances in Computational Mathematics \n1 (1993), 61-80. \n\n7. MHASKAR, H. N., Neural networks for localized approximation of real \nfunctions, in \"Neural Networks for Signal Processing, III\", (Kamm, Huhn, \nYoon, Chellappa and Kung Eds.), IEEE New York, 1993, pp. 190-196. \n\n8. MHASKAR, H. N., Approximation of real functions using neural networks, \nin Proc. of Int. Conf. on Advances in Comput. Math., New Delhi, India, \n1993, World Sci. Publ., H. P. Dikshit, C. A. Micchelli eds., 1994. \n\n9. MHASKAR, H. N., Noniterative training algorithms for neural networks, \n\nManuscript, 1993. \n\n10. MHASKAR, H. N. AND MICCHELLI, C. A. , Approximation by superposi(cid:173)\ntion of a sigmoidal function and radial basis functions, Advances in Ap(cid:173)\nplied Mathematics, 13 (1992),350-373. \n\n11. MHASKAR, H. N. AND MICCHELLI, C. A., Degree of approximation by \n\nsuperpositions of a fixed function, in preparation. \n\n12. MHASKAR, H. N. AND MICCHELLI, C. A., Dimension independent bounds \n\non the degree of approximation by neural networks, Manuscript, 1993. \n\n13. POGGIO, T. AND GIROSI, F., Regularization algorithms for learning that \n\nare equivalent to multilayer networks, Science, 247 (1990), 978-982. \n\n\f", "award": [], "sourceid": 874, "authors": [{"given_name": "H. N.", "family_name": "Mhaskar", "institution": null}, {"given_name": "C. A..", "family_name": "Micchelli", "institution": null}]}