{"title": "The Power of Approximating: a Comparison of Activation Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 615, "page_last": 622, "abstract": "", "full_text": "The Power of Approximating: a \n\nComparison of Activation Functions \n\nBhaskar DasGupta \n\nDepartment of Computer Science \n\nUniversity of Minnesota \n\nMinneapolis, MN 55455-0159 \n\nemail: dasgupta~cs.umn.edu \n\nGeorg Schnitger \n\nDepartment of Computer Science \nThe Pennsylvania State University \n\nUniversity Park, PA 16802 \nemail: georg~cs.psu.edu \n\nAbstract \n\nWe compare activation functions in terms of the approximation \npower of their feedforward nets. We consider the case of analog as \nwell as boolean input. \n\n1 \n\nIntroduction \n\nWe consider efficient approximationsofa given multivariate function I: [-1, l]m-+ \nn by feedforward neural networks. We first introduce the notion of a feedforward \nnet. \n\nLet r be a class of real-valued functions, where each function is defined on some \nsubset of n. A r-net C is an unbounded fan-in circuit whose edges and vertices \nare labeled by real numbers. The real number assigned to an edge (resp. vertex) \nis called its weight (resp. its threshold). Moreover, to each vertex v an activation \nfunction IV E r is assigned. Finally, we assume that C has a single sink w. \nThe net C computes a function Ie : [-1,11 m --+ n as follows. The components \nof the input vector x = (Xl, . .. , x m ) E [-1, 11 m are assigned to the sources of C. \nLet Vl, \u2022\u2022\u2022 , Vn be the immediate predecessors of a vertex v. The input for v is then \nsv(x) = E~=l WiYi -tv, where Wi is the weight of the edge (Vi, V), tv is the threshold \nof v and Yi is the value assigned to Vi. If V is not the sink, then we assign the value \nIv (sv (x)) to v. Otherwise we assign Sv (x) to v. \nThen Ie = Sw is the function computed by C where W is the unique sink of C. \n\n615 \n\n\f616 \n\nDasGupta and Schnitger \n\nA great deal of work has been done showing that nets of two layers can approximate \n(in various norms) large function classes (including continuous functions) arbitrar(cid:173)\nily well (Arai, 1989; Carrol and Dickinson, 1989; Cybenko, 1989; Funahashi, 1989; \nGallant and White, 1988; Hornik et al. 1989; Irie and Miyake,1988; Lapades and \nFarber, 1987; Nielson, 1989; Poggio and Girosi, 1989; Wei et al., 1991). Various \nactivation functions have been used, among others, the cosine squasher, the stan(cid:173)\ndard sigmoid, radial basis functions, generalized radial basis functions, polynomials, \ntrigonometric polynomials and binary thresholds. Still, as we will see, these func(cid:173)\ntions differ greatly in terms of their approximation power when we only consider \nefficient nets; i.e. nets with few layers and few vertices. \n\nOur goal is to compare activation functions in terms of efficiency and quality of \napproximation. We measure efficiency by the size of the net (i.e. the number of \nvertices, not counting input units) and by its number of layers. Another resource \nof interest is the Lipschitz-bound of the net, which is a measure of the numerical \nstability of the net. We say that net C has Lipschitz-bound L if all weights and \nthresholds of C are bounded in absolute value by L and for each vertex v of C and \nfor all inputs x, y E [-1, l]m, \n\nItv(sv(x)) -tv(sv(y\u00bb1 :::; L \u00b7Isv(x) - sv(y)l\u00b7 \n\n(Thus we do not demand that activation function Iv has Lipschitz-bound L, but \nonly that Iv has Lipschitz-bound L for the inputs it receives.) We measure the \nquality of an approximation of function I by function Ie by the Chebychev norm; \ni.e. by the maximum distance between I and Ie over the input domain [-1, l]m. \nLet r be a class of activation functions. We are particularly interested in the \nfollowing two questions . \n\u2022 Given a function I : [-1, l]m -+ n, how well can we approximate I by a f-net \nwith d layers, size s, and Lipschitz-bound L? Thus, we are particularly interested \nin the behavior of the approximation error e(s, d) as a function of size and number \nof layers. This set-up allows us to investigate how much the approximation error \ndecreases with increased size and/or number of layers . \n\n\u2022 Given two classes of activation functions fl and f2, when do f 1-nets and f 2-\nnets have essentially the same \"approximation power\" with respect to some error \nfunction e(s, d)? \n\nWe first formalize the notion of \"essentially the same approximation power\" . \nDefinition 1.1 Let e : N2 -+ n+ be a function. fl and f2 are classes of activation \nfunctions. \n(a). We say that fl simulates r 2 with respect to e if and only if there is a constant \nk such that for all functions I : [-1, l]m -+ n with Lipschitz-bound l/e(s, d), \n\nif f can be approximated by a r 2-net with d layers, size s, Lipschitz(cid:173)\nbound 2~ and approximation error e(s, d), then I can also be ap(cid:173)\nproximated with error e( s, d) by a r 1 -net with k( d + 1) layers, size \n(s + l)k and Lipschitz-bound 28 \" \u2022 \n\n(b). We say that f1 and r2 are equivalent with respect to e if and only if f2 \nsimulates f 1 with respect to e and f 1 simulates f2 with respect to e. \n\n\fThe Power of Approximating: a Comparison of Activation Functions \n\n617 \n\nIn other words, when comparing the approximation power of activation functions, \nwe allow size to increase polynomially and the number of layers to increase by a \nconstant factor, but we insist on at least the same approximation error. Observe \nthat we have linked the approximation error e( s, d) and the Lipschitz-bound of the \nfunction to be approximated. The reason is that approximations of functions with \nhigh Lipschitz-bound \"tend\" to have an inversely proportional approximation error. \nMoreover observe that the Lipschitz-bounds of the involved nets are allowed to be \nexponential in the size of the net. We will see in section 3, that for some activation \nfunctions far smaller Lipschitz-bounds suffice. \nBelow we discuss our results. In section 2 we consider the case of tight approx(cid:173)\nimations, i.e. e(s, d) = 2-'. Then in section 3 the more relaxed error model \ne(s, d) = s-d is discussed . In section 4 we consider the computation of boolean \nfunctions and show that sigmoidal nets can be far more efficient than threshold(cid:173)\nnets. \n\n2 Equivalence of Activation Functions for Error e( s, d) = 2- 8 \n\nWe obtain the following result. \n\nTheorem 2.1 The following activation functions are equivalent with respect to er(cid:173)\nror e(s, d) = 2- 3 \u2022 \n\u2022 the standard sigmoid O'(x) = l+ex~(-r)' \n\n\u2022 any rational function which is not a polynomial, \n\n\u2022 any root x a , provided Q \n\u2022 the logarithm (for any base b > 1), \n\nis not a natural number, \n\n\u2022 the gaussian e- x2 , \n\u2022 the radial basis functions (1 + x2)a, Q < 1, Q # 0 \n\nNotable exceptions from the list of functions equivalent to the standard sigmoid are \npolynomials, trigonometric polynomials and splines. We do obtain an equivalence \nto the standard sigmoid by allowing splines of degree s as activation functions for \nnets of size s. (We will always assume that splines are continuous with a single knot \nonly.) \n\nTheorem 2.2 Assume that e(s, d) = 2-'. Then splines (of degree s for nets of size \ns) and the standard sigmoid are equivalent with respect to e(s, d). \n\nRemark 2.1 \n\n(a) Of course, the equivalence of spline-nets and {O' }-nets also holds for binary \ninput. Since threshold-nets can add and multiply m m-bit numbers with constantly \nmany layers and size polynomial in m (Rei/, 1987), threshold-nets can efficiently \napproximate polynomials and splines. \n\n\f618 \n\nDasGupta and Schnitger \n\nThus, we obtain that {u }-nets with d layers, size s and Lipschitz-bound L can \nbe simulated by nets of binary thresholds. The number of layers of the simulat(cid:173)\ning threshold-net will increase by a constant factor and its size will increase by a \npolynomial in (s + n) log(L), where n is the number of input bits. (The inclusion \nof n accounts for the additional increase in size when approximately computing a \nweighted sum by a threshold-net.) \n(b) If we allow size to increase by a polynomial in s + n, then threshold-nets and \n{u }-nets are actually equivalent with respect to error bound 2-\". This follows, since \na threshold function can easily be implemented by a sigmoidal gate (Maass et al., \n1991). \nThus, if we allow size to increase polynomially (in s + n) and the number of layers \nto increase by a constant factor, then {u }-nets with weights that are at most expo(cid:173)\nnential (in s + n) can be simulated by {u} -nets with weights of size polynomial in \ns. \n\n{u }-nets and threshold-nets (respectively nets of linear thresholds) are not equiva(cid:173)\nlent for analog input. The same applies to polynomials, even if we allow polynomials \nof degree s as activation function for nets of size s: \n\nTheorem 2.3 \n(a) Let sq(x) = x 2 \u2022 If a net of linear splines (with d layers and size s) approximates \nsq( x) over the interval [-1, 1], then its approximation error will be at least s-o( d) . \n(b) Let abs(x) =1 x I. If a polynomial net with d layers and size s approximates \nabs(x) over the interval [-1,1]' then the approximation error will be at least s-O(d). \n\nWe will see in Theorem 2.5 that the standard sigmoid (and hence any activation \nfunction listed in Theorem 2.1) is capable of approximating sq(x) and abs(x) with \nerror at most 2-\" by constant-layer nets of size polynomial in s. Hence the standard \nsigmoid is properly stronger than linear splines and polynomials. Finally, we show \nthat sine and the standard sigmoid are inequivalent with respect to error 2-'. \n\nTheorem 2.4 The function sine(Ax) can be approximated by a {u}-net CA with d \nlayers, size s = AO(l/d) and error at most sO( -d). On the other hand, every {u }-net \nwith d layers which approximates sine(Ax) with error at most ~, has to have size \nat least AO(l/d). \n\nBelow we sketch the proof of Theorem 2.1. The proof itself will actually be more \ninstructive than the statement of Theorem 2.1. \nIn particular, we will obtain a \ngeneral criterion that allows us to decide whether a given activation function (or \nclass of activation functions) has at least the approximation power of splines. \n\n2.1 Activation F\\lnctions with the Approximation Power of Splines \n\nObviously, any activation function which can efficiently approximate polynomials \nand the binary threshold will be able to efficiently approximate splines. This follows \nsince a spline can be approximated by the sum p + t . q with polynomials p and q \n\n\fThe Power of Approximating: a Comparison of Activation Functions \n\n619 \n\nand a binary threshold t. (Observe that we can approximate a product once we can \napproximately square: (x + y)2/2 - x 2/2 - y2/2 = x . y.) \nFirstly, we will see that any sufficiently smooth activation function is capable of \napproximating polynomials. \nDefinition 2.1 Ld -y : n ---+ n be a function. We call -y suitable if and only if \nthere exists real numbers a, f3 (a > 0) and an integer k such that \nf3)i for all x E [-a, a]. \n(a) -y can be represented by the power series 2:~o ai(x -\nThe coefficients are rationals of the form ai = ~ with IPi I, IQil ~ 2ki (for i > 1). \n(b) For each i > 2 there exists j with i ~ j ~ i k and aj \"# O. \n\nProposition 2.1 Assume that -y is suitable with parameter k. \nThen, over the domain [-D, D], any degree n polynomial p can be approximated \nwith errore by a {-y}-net Cpo Cp has 2 layers and size 0(n2k); its weights are \nrational numbers whose numerator and denominator are bounded in absolute value \nby \n\nPmax(2 + D)PO,y(n)lh(N+l)II[_a,a1;' \n\nHere we have assumed that the coefficients of p are rational numbers with numerator \nand denominator bounded in absolute value by Pmax. \n\nThus, in order to have at least the approximation power of splines, a suitable activa(cid:173)\ntion function has to be able to approximate the binary threshold. This is achieved \nby the following function class, \nDefinition 2.2 Let r be a class of activation functions and let 9 : [1,00] ---+ n be a \nfunction. \n(a). We say that 9 is fast converging if and only if \n\nI g(x) - g(x + e) 1= 0(e/X 2 ) for x ~ 1, e ~ 0, \n\no < roo g( u 2 )du < 00 and I roo g( u2 )du 1= 0(1/ N) for all N ~ 1. \n\nJ1 \n\nJ2 N \n\n(b). We say that r is powerful if and only if at least one function in r is suitable \nand there is a fast converging function g which can be approximated for all s > 1 \n(over the domain [-2\",2\"]) with error 2-\" by a {r}-net with a constant number of \nlayers, size polynomial in s and Lipschitz-bound 2\". \n\nFast convergence can be checked easily for differentiable functions by applying the \nmean value theorem. Examples are x-a for a ~ 1, exp( -x) and 0'( -x). Moreover, \nit is not difficult to show that each function mentioned in Theorem 2.1 is powerful. \nHence Theorem 2.1 is a corollary of \nTheorem 2.5 Assume that r is powerful. \n(a) r simulates splines with respect to error e(s, d) = 2- 3 \u2022 \n\n\f620 \n\nDasGupta and Schnitger \n\n(b) Assume that each activation function in r can be approximated (over the \ndomain [-2',2']) with error 2-' by a spline-net N, of size s and with constantly \nmany layers. Then r is equivalent to splines. \n\nRemark 2.2 Obviously, 1/x is po'wer/ul. Therefore Theorem 2.5 implies that \nconstant-layer {l/x}-nets of size s approximate abs(x) = Ixl with error 2-'. The \ndegree of the resulting rational function will be polynomial in s. Thus Theorem \n2.5 generalizes .N ewman's approximation of the absolute value by rational functions. \n(Newman, 1964) \n\n3 \n\nEquivalence of Activation Functions for Error s-d \n\nThe lower bounds in the previous section suggest that the relaxed error bound \ne(s, d) = s-d is of importance. Indeed, it will turn out that many non-trivial smooth \nactivation functions lead to nets that simulate {(T }-nets, provided the number of \ninput units is counted when determining the size of the net. (We will see in section \n4, that linear splines and the standard sigmoid are not equivalent if the number of \ninputs is not counted). The concept of threshold-property will be crucial for us. \nDefinition 3.1 Let r be a collection of activation functions. We say that r has \nthe threshold-property if there is a constant c such that the following two properties \nare satisfied for all m > 1. \n(a) For each 'Y E r there is a threshold-net T-y,m with c layers and size (s + m)C \nwhich computes the binary representation of'Y'(x) where h(x)-t'(x)1 ~ 2-m . \nThe input x of T-y ,m is given in binary and consists of 2m + 1 bits; m bits describe \nthe integral part of x, m bits describe its fractional part and one bit indicates the \nsign. s + m specifies the required number of output bits, i.e. s = rlog2(sup{'Y(x) : \n_2m+l < x < 2m+l})1. \n(b) There is a r -net with c layers, size m C and Lipschitz bound 2mc which approx(cid:173)\nimates the binary threshold over D = [-1,1] - [-11m, 11m] with error 11m. \nWe can now state the main result of this section. \n\nTheorem 3.1 Assume that e(s, d) = s-d. \n(a) Let r be a class of activation functions and assume that r has the threshold \nproperty. Then, (T and r are equivalent with respect to e . Moreover, {(T} -nets only \nrequire weights and thresholds of absolute value at most s. (Observe that r -nets are \nallowed to have weights as large as 2' I) \n(b) If rand (T are equivalent with respect to error 2-', then rand (T are equivalent \nwith respect to error s-d. \n(c) Additionally, the following classes are equivalent to {(T }-nets with respect to e. \n(We assume throughout that all coefficients, weights and thresholds are bounded by \n2 3 for nets of size s) . \n\n\u2022 polynomial nets (i. e. polynomials of degree s appear as activation function for \nnets of size s), \n\n\fThe Power of Approximating: a Comparison of Activation Functions \n\n621 \n\n\u2022 {\"y }-nets, where ~/ is a suitable function and \"y satisfies part (a) of Definition 3.1. \n(This includes the sine-function.) \n\n\u2022 nets of linear splines \n\nThe equivalence proof involves a first phase of extracting O(dlogs) bits from the \nanalog input. In a second phase, a binary computation is mimicked. The extraction \nprocess can be carried out with error s-1 (over the domain [-1,1] - [-l/s, l/s]) \nonce the binary threshold is approximated. \n\n4 Computing boolean functions \n\nAs we have seen in Remark 2.1, the binary threshold (respectively linear splines) \ngains considerable power when computing boolean functions as compared to approx(cid:173)\nimating analog functions. But sigmoidal nets will be far more powerful when only \nthe number of neurons is counted and the number of input units is disregarded. For \ninstance, sigmoidal nets are far more efficient for \"squaring\", i.e when computing: \n\n(where [z] = L Zi). \n\nMn = {(x, y): x E {O, l}n, y E {O, l}n:l and [xJ2;;::: [y]} \n\ni \n\nTheorem 4.1 A threshold-net computing Mn must have size at least n(logn). But \nMn can be computed by a (1'-net with constantly many gates. \n\nThe previously best known separation of threshold-nets and sigmoidal-nets is due \nto Maass, Schnitger and Sontag (Maass et al., 1991). But their result only applies \nto threshold-nets with at most two layers; our result holds without any restriction \non the number oflayers. Theorem 4.1 can be generalized to separate threshold-nets \nand 3-times differentiable activation functions, but this smoothness requirement is \nmore severe than the one assumed in (Maass et al., 1991). \n\n5 Conclusions \n\nOur results show that good approximation performance (for error 2-\") hinges on \ntwo properties, namely efficient approximation of polynomials and efficient approx(cid:173)\nimation of the binary threshold. These two properties are shared by a quite large \nclass of activation functions; i.e. powerful functions. Since (non-polynomial) ratio(cid:173)\nnal functions are powerful, we were able to generalize Newman's approximation of \nI x I by rational functions . \nOn the other hand, for a good approximation performance relative to the relaxed \nerror bound s-d it is already sufficient to efficiently approximate the binary thresh(cid:173)\nold. Consequently, the class of equivalent activation functions grows considerably \n(but only if the number of input units is counted). The standard sigmoid is dis(cid:173)\ntinguished in that its approximation performance scales with the error bound: if \nlarger error is allowed, then smaller weights suffice. \n\nMoreover, the standard sigmoid is actually more powerful than the binary threshold \neven when computing boolean functions. In particular, the standard sigmoid is able \nto take advantage of its (non-trivial) smoothness to allow for more efficient nets. \n\n\f622 \n\nDasGupta and Schnitger \n\nAcknowledgements. We wish to thank R. Paturi, K. Y. Siu and V. P. Roy(cid:173)\nchowdhury for helpful discussions. Special thanks go to W. Maass for suggesting \nthis research, to E. Sontag for continued encouragement and very valuable advice \nand to J. Lambert for his never-ending patience. \n\nThe second author gratefully acknowledges partial support by NSF-CCR-9114545. \n\nReferences \n\nArai, W. (1989), Mapping abilities of three-layer networks, in \"Proc. of the Inter(cid:173)\nnational Joint Conference on Neural Networks\", pp. 419-423. \n\nCarrol, S. M., and Dickinson, B. W. (1989), Construction of neural nets using \nthe Radon Transform,in \"Proc. of the International Joint Conference on Neural \nNetworks\", pp. 607-611. \n\nCybenko, G. (1989), Approximation by superposition of a sigmoidal function, Math(cid:173)\nematics of Control, Signals, and System, 2, pp. 303-314. \n\nFunahashi, K. (1989), On the approximate realization of continuous mappings by \nneural networks, Neural Networks, 2, pp. 183-192. \n\nGallant, A. R., and White, H. (1988), There exists a neural network that does not \nmake avoidable mistakes, in \"Proc. of the International Joint Conference on Neural \nNetworks\" , pp. 657-664. \n\nHornik, K., Stinchcombe, M., and White, H. (1989), Multilayer Feedforward Net(cid:173)\nworks are Universal Approximators, Neural Networks, 2, pp. 359-366. \n\nIrie, B., and Miyake, S. (1988), Capabilities of the three-layered perceptrons, in \n\"Proc. of the International Joint Conference on Neural Networks\", pp. 641-648. \n\nLapades, A., and Farbar, R. (1987), How neural nets work, in \"Advances in Neural \nInformation Processing Systems\" , pp. 442-456. \n\nMaass, W., Schnitger, G., and Sontag, E. (1991), On the computational power of \nsigmoid versus boolean threshold circuits, in \"Proc. of the 32nd Annual Symp. on \nFoundations of Computer Science\" , pp. 767-776. \nNewman, D. J. (1964), Rational approximation to I x I , Michigan Math. Journal, \n11, pp. 11-14. \n\nHecht-Nielson, R. (1989), Theory of backpropagation neural networks, in \"Proc. of \nthe International Joint Conference on Neural Networks\", pp. 593-611. \n\nPoggio, T., and Girosi, F. (1989), A theory of networks for Approximation and \nlearning, Artificial Intelligence Memorandum, no 1140. \n\nReif, J. H. (1987), On threshold circuits and polynomial computation, in \"Proceed(cid:173)\nings of the 2nd Annual Structure in Complexity theory\", pp. 118-123. \n\nWei, Z., Yinglin, Y., and Qing, J. (1991), Approximation property of multi-layer \nneural networks ( MLNN ) and its application in nonlinear simulation, in \"Proc. of \nthe International Joint Conference on Neural Networks\", pp. 171-176. \n\n\f", "award": [], "sourceid": 692, "authors": [{"given_name": "Bhaskar", "family_name": "DasGupta", "institution": null}, {"given_name": "Georg", "family_name": "Schnitger", "institution": null}]}