{"title": "Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 503, "page_last": 510, "abstract": null, "full_text": "Computing Time Lower Bounds for \n\nRecurrent Sigmoidal Neural Networks \n\nMichael Schmitt \n\nLehrstuhl Mathematik und Informatik, Fakultat fUr Mathematik \n\nRuhr-Universitat Bochum, D- 44780 Bochum, Germany \n\nmschmitt@lmi.ruhr-uni-bochum.de \n\nAbstract \n\nRecurrent neural networks of analog units are computers for real(cid:173)\nvalued functions. We study the time complexity of real computa(cid:173)\ntion in general recurrent neural networks. These have sigmoidal, \nlinear, and product units of unlimited order as nodes and no re(cid:173)\nstrictions on the weights. For networks operating in discrete time, \nwe exhibit a family of functions with arbitrarily high complexity, \nand we derive almost tight bounds on the time required to compute \nthese functions. Thus, evidence is given of the computational lim(cid:173)\nitations that time-bounded analog recurrent neural networks are \nsubject to. \n\n1 \n\nIntroduction \n\nAnalog recurrent neural networks are known to have computational capabilities that \nexceed those of classical Turing machines (see, e.g., Siegelmann and Sontag, 1995; \nKilian and Siegelmann, 1996; Siegelmann, 1999). Very little, however, is known \nabout their limitations. Among the rare results in this direction, for instance, \nis the one of Sima and Orponen (2001) showing that continuous-time Hopfield \nnetworks may require exponential time before converging to a stable state. This \nbound, however, is expressed in terms of the size of the network and, hence, does \nnot apply to fixed-size networks with a given number of nodes. Other bounds \non the computational power of analog recurrent networks have been established by \nMaass and Orponen (1998) and Maass and Sontag (1999). They show that discrete(cid:173)\ntime recurrent neural networks recognize only a subset of the regular languages in \nthe presence of noise. This model of computation in recurrent networks, however, \nreceives its inputs as sequences. Therefore, computing time is not an issue since \nthe network halts when the input sequence terminates. Analog recurrent neural \nnetworks, however, can also be run as \"real\" computers that get as input a vector \nof real numbers and, after computing for a while, yield a real output value. No \nresults are available thus far regarding the time complexity of analog recurrent \nneural networks with given size. \n\nWe investigate here the time complexity of discrete-time recurrent neural networks \nthat compute functions over the reals. As network nodes we allow sigmoidal units, \nlinear units, and product units-\nthat is, monomials where the exponents are ad-\n\n\fjustable weights (Durbin and Rumelhart, 1989) . We study the complexity of real \ncomputation in the sense of Blum et aI. (1998). That means, we consider real num(cid:173)\nbers as entities that are represented exactly and processed without restricting their \nprecision. Moreover, we do not assume that the information content of the network \nweights is bounded (as done, e.g., in the works of Balcazar et aI. , 1997; Gavalda and \nSiegelmann, 1999). With such a general type of network, the question arises which \nfunctions can be computed with a given number of nodes and a limited amount of \ntime. In the following, we exhibit a family of real-valued functions ft, l 2: 1, in one \nvariable that is computed by some fixed size network in time O(l). Our main result \nis, then, showing that every recurrent neural network computing the functions ft \nrequires at least time nW /4). Thus, we obtain almost tight time bounds for real \ncomputation in recurrent neural networks. \n\n2 Analog Computation in Recurrent Neural Networks \n\nWe study a very comprehensive type of discrete-time recurrent neural network that \nwe call general recurrent neural network (see Figure 1). For every k, n E N there is \na recurrent neural architecture consisting of k computation nodes YI , . . . , Yk and n \ninput nodes Xl , ... , x n . The size of a network is defined to be the number ofits com(cid:173)\nputation nodes. The computation nodes form a fully connected recurrent network. \nEvery computation node also receives connections from every input node. The input \nnodes play the role of the input variables of the system. All connections are param(cid:173)\neterized by real-valued adjustable weights. There are three types of computation \nnodes: product units, sigmoidal units, and linear units. Assume that computation \nnode i has connections from computation nodes weighted by Wil, ... ,Wi k and from \ninput nodes weighted by ViI, .. . ,Vi n. Let YI (t) , . . . ,Yk (t) and Xl (t), ... ,Xn (t) be the \nvalues of the computation nodes and input nodes at time t, respectively. If node i \nis a product unit, it computes at time t + 1 the value \n\n(1) \n\nthat is, after weighting them exponentially, the incoming values are multiplied. \nSigmoidal and linear units have an additional parameter associated with them, the \nthreshold or bias ()i . A sigmoidal unit computes the value \n\nwhere (J is the standard sigmoid (J( z ) = 1/ (1 + e- Z ). If node i is a linear unit, it \nsimply outputs the weighted sum \n\nWe allow the networks to be heterogeneous, that is, they may contain all three types \nof computation nodes simultaneously. Thus, this model encompasses a wide class of \nnetwork types considered in research and applications. For instance, architectures \nhave been proposed that include a second layer of linear computation nodes which \nhave no recurrent connections to computation nodes but serve as output nodes (see, \ne.g. , Koiran and Sontag, 1998; Haykin, 1999; Siegelmann, 1999). It is clear that in \nthe definition given here, the linear units can function as these output nodes if the \nweights of the outgoing connections are set to O. Also very common is the use \nof sigmoidal units with higher-order as computation nodes in recurrent networks \n(see, e.g., Omlin and Giles, 1996; Gavalda and Siegelmann, 1999; Carrasco et aI., \n2000). Obviously, the model here includes these higher-order networks as a special \ncase since the computation of a higher-order sigmoidal unit can be simulated by \nfirst computing the higher-order terms using product units and then passing their \n\n\fcomputation \n\nnodes \n\nI \nsigmoidal, product, and linear units \n\nI \n\n. \n\nYl \n\n. \n\nYk \n\nt \n\ninput nodes \n\nXl \n\nXn \n\nI \n\nFigure 1: A general recurrent neural network of size k. Any computation node may \nserve as output node. \n\noutputs to a sigmoidal unit. Product units, however, are even more powerful than \nhigher-order terms since they allow to perform division operations using negative \nweights. Moreover, if a negative input value is weighted by a non-integer weight, \nthe output of a product unit may be a complex number. We shall ensure here that \nall computations are real-valued. Since we are mainly interested in lower bounds, \nhowever, these bounds obviously remain valid if the computations of the networks \nare extended to the complex domain. \nWe now define what it means that a recurrent neural network N computes a function \nf : ~n --+ llt Assume that N has n input nodes and let x E ~n. Given tE N, \nwe say that N computes f(x) in t steps if after initializing at time 0 the input \nnodes with x and the computation nodes with some fixed values, and performing t \ncomputation steps as defined in Equations (1) , (2) , and (3) , one of the computation \nnodes yields the value f(x). We assume that the input nodes remain unchanged \nduring the computation. We further say that N computes f in time t if for every \nx E ~n , network N computes f in at most t steps. Note that t may depend \non f but must be independent of the input vector. We emphasize that this is \na very general definition of analog computation in recurrent neural networks. In \nparticular, we do not specify any definite output node but allow the output to occur \nat any node. Moreover, it is not even required that the network reaches a stable \nstate, as with attractor or Hopfield networks. It is sufficient that the output value \nappears at some point of the trajectory the network performs. A similar view of \ncomputation in recurrent networks is captured in a model proposed by Maass et al. \n(2001). Clearly, the lower bounds remain valid for more restrictive definitions of \nanalog computation that require output nodes or stable states. Moreover, they \nhold for architectures that have no input nodes but receive their inputs as initial \nvalues of the computation nodes. Thus, the bounds serve as lower bounds also for \nthe transition times between real-valued states of discrete-time dynamical systems \ncomprising the networks considered here. \n\nOur main tool of investigation is the Vapnik-Chervonenkis dimension of neural \nnetworks. It is defined as follows (see also Anthony and Bartlett, 1999): A dichotomy \nof a set S ~ ~n is a partition of S into two disjoint subsets (So , Sd satisfying \nSo U S1 = S. A class :F of functions mapping ~n to {O, I} is said to shatter S if \nfor every dichotomy (So , Sd of S there is some f E :F that satisfies f(So) ~ {O} \nand f(S1) ~ {I}. The Vapnik-Chervonenkis (VC) dimension of :F is defined as \n\n\f4\"'+4\",IL \n\n' I -1---Y-2----Y-5~1 \n\nS~ \n\nY5 \n\noutput \n\nY4 \n\nFigure 2: A recurrent neural network computing the functions fl in time 2l + 1. \n\nthe largest number m such that there is a set of m elements shattered by F. A \nneural network given in terms of an architecture represents a class of functions \nobtained by assigning real numbers to all its adjustable parameters, that is, weights \nand thresholds or a subset thereof. The output of the network is assumed to be \nthresholded at some fixed constant so that the output values are binary. The VC \ndimension of a neural network is then defined as the VC dimension of the class of \nfunctions computed by this network. \nIn deriving lower bounds in the next section, we make use of the following result \non networks with product and sigmoidal units that has been previously established \n(Schmitt, 2002). We emphasize that the only constraint on the parameters of the \nproduct units is that they yield real-valued, that is, not complex-valued, functions. \nThis means further that the statement holds for networks of arbitrary order, that is, \nit does not impose any restrictions on the magnitude of the weights of the product \nunits. \nProposition 1. (Schmitt, 2002, Theorem 2) Suppose N is a feedforward neural \nnetwork consisting of sigmoidal, product, and linear units. Let k be its size and W \nthe number of adjustable weights. The VC dimension of N restricted to real-valued \nfunctions is at most 4(Wk)2 + 20Wk log(36Wk). \n\n3 Bounds on Computing Time \n\nWe establish bounds on the time required by recurrent neural networks for comput(cid:173)\ning a family of functions fl : JR -+ JR, l 2:: 1, where l can be considered as a measure \nof the complexity of fl. Specifically, fl is defined in terms of a dynamical system as \nthe lth iterate of the logistic map \u00a2>(x) = 4x(1 - x), that is, \n\nfl(X) \n\n{ \n\n\u00a2>(x) \n\u00a2>(fl- l (x)) \n\nl = 1, \nl > 2. \n\nWe observe that there is a single recurrent network capable of computing every fl \nin time O(l). \nLemma 2. There is a general recurrent neural network that computes fl in time \n2l + 1 for every l. \n\nProof. The network is shown in Figure 2. It consists of linear and second-order \nunits. All computation nodes are initialized with 0, except Yl, which starts with 1 \nand outputs 0 during all following steps. The purpose of Yl is to let the input x \n\n\foutput \n\nFigure 3: Network Nt. \n\nenter node Y2 at time 1 and keep it away at later times. Clearly, the value fl (x) \nresults at node Y5 after 2l + 1 steps. \nD \nThe network used for computing fl requires only linear and second-order units. The \nfollowing result shows that the established upper bound is asymptotically almost \ntight, with a gap only of order four . Moreover, the lower bound holds for networks \nof unrestricted order and with sigmoidal units. \nTheorem 3. Every general recurrent neural network of size k requires at least time \ncl l / 4 j k to compute function fl' where c> 0 is some constant. \nProof. The idea is to construct higher-order networks Nt of small size that have \ncomparatively large VC dimension. Such a network will consist of linear and product \nunits and hypothetical units that compute functions fJ for certain values of j. We \nshall derive a lower bound on the VC dimension of these networks. Assuming that \nthe hypothetical units can be replaced by time-bounded general recurrent networks, \nwe determine an upper bound on the VC dimension of the resulting networks in \nterms of size and computing time using an idea from Koiran and Sontag (1998) and \nProposition 1. The comparison of the lower and upper VC dimension bounds will \ngive an estimate of the time required for computing k \nNetwork Nt, shown in Figure 3, is a feedforward network composed of three networks \n(/1) \n\u2022 r(1) \nJVI \nand 2l + 2 computation nodes yb/1), ... , Y~r~l (see Figure 4). There is only one \nadjustable parameter in Nt, denoted w, all other weights are fixed. The computation \nnodes are defined as follows (omitting time parameter t): \n\nlnput no es Xl' .. . , x I \n\n1 2 3 h \n, , , as \n\n.r(3) E h \n\nac networ \n\nk \u2022 r(/1) \n\nJVI \n\n,J.L = \n\n\u2022 r(2) \n\n, JVI \n\n, JVI \n\nl\u00b7 \n\n. \n\nd \n\n(/1) \n\nfor J.L = 3, \nfor J.L = 1,2, \n\nfll'--1 (Y~~)l) for i = 1, ... ,l and J.L = 1,2,3, \n\ny~/1) . x~/1), for i = 1, .. . ,l and J.L = 1,2,3, \n\n(/1) \nYIH + ... + Y21 \n\n(/1) c \n\nlor J.L -\n\n- 1 2 3 \n, \u2022 \n\n, \n\ny~/1) \n\ny}~{ \n\n(/1) \n\nY21+l \n\nThe nodes Yb/1) can be considered as additional input nodes for N//1), where N;(3) \ngets this input from w, and N;(/1) from N;(/1+l) for J.L = 1,2. Node Y~r~l is the \noutput node of N;(/1), and node Y~~~l is also the output node of Nt. Thus, the entire \nnetwork has 3l + 6 nodes that are linear or product units and 3l nodes that compute \nfunctions h, fl' or f12. \n\n\foutput \n\n8 \n\nr - - - - - - - - - - - - ' ..... L - - - - - - - - - - - , \n\nI \n\nB \nt \nI x~p)1 \n\nI \n\nB \nt \n~ \n\n-----\n\nt \n\ninput: w or \noutput of N;(P+1) \n\nFigure 4: Network N;(p). \n\nWe show that Ni shatters some set of cardinality [3, in particular, the set S = ({ ei : \n\ni = 1, . .. , [})3, where ei E {O, 1}1 is the unit vector with a 1 in position i and \u00b0 \n\nelsewhere. Every dichotomy of S can be programmed into the network parameter \nw using the following fact about the logistic function \u00a2 (see Koiran and Sontag, \n1998, Lemma 2): For every binary vector b E {O, l}m, b = b1 .\u2022. bm , there is some \nreal number w E [0,1] such that for i = 1, ... , m \n\nE \n\n{ \n\n[0,1 /2) \n\n(1/2,1] \n\nif bi = 0, \nif bi = 1. \n\nHence, for every dichotomy (So, Sd of S the parameter w can be chosen such that \nevery (ei1' ei2 , ei3) E S satisfies \n\n1/2 \n\n1/2 \n\nif (eillei2,eis) E So, \nif (eillei2,eiJ E S1. \n\nSince h +i2 H i 3 .12 (w) = \u00a2i1 (\u00a2i2'1 (\u00a2i3 .12 (w))), this is the value computed by Ni on \ninput (eill ei2' ei3), where ei\" is the input given to network N;(p). (Input ei\" selects \nthe function li\"'I,,-1 in N;(p).) Hence, S is shattered by Ni, implying that Ni has \nVC dimension at least [3. \n\n\fAssume now that Ii can be computed by a general recurrent neural network of size \nat most kj in time tj. Using an idea of Koiran and Sontag (1998), we unfold the \nnetwork to obtain a feedforward network of size at most kjtj computing fj. Thus we \ncan replace the nodes computing ft, ft, fl2 in Nz by networks of size k1t1, kltl, k12t12, \nrespectively, such that we have a feedforward network '!J consisting of sigmoidal, \nproduct, and linear units. Since there are 3l units in Nl computing ft, ft, or fl2 \nand at most 3l + 6 product and linear units, the size of Nt is at most c1lkl2tl2 \nfor some constant C1 > O. Using that Nt has one adjustable weight, we get from \nProposition 1 that its VC dimension is at most c2l2kr2tr2 for some constant C2 > o. \nOn the other hand, since Nz and Nt both shatter S, the VC dimension of Nt is at \nleast l3. Hence, l3 ~ C2l2 kr2 tr2 holds, which implies that tl2 2: cl 1/ 2 / kl2 for some \nc > 0, and hence tl 2: cl1/4 / kl. \nD \nLemma 2 shows that a single recurrent network is capable of computing every \nfunction fl in time O(l). The following consequence of Theorem 3 establishes that \nthis bound cannot be much improved. \n\nCorollary 4. Every general recurrent neural network requires at least time 0(ll /4 ) \nto compute the functions fl. \n\n4 Conclusions and Perspectives \n\nWe have established bounds on the computing time of analog recurrent neural \nnetworks. The result shows that for every network of given size there are functions \nof arbitrarily high time complexity. This fact does not rely on a bound on the \nmagnitude of weights. We have derived upper and lower bounds that are rather \ntight- with a polynomial gap of order four- and hold for the computation of a \nspecific family of real-valued functions in one variable. Interestingly, the upper \nbound is shown using second-order networks without sigmoidal units, whereas the \nlower bound is valid even for networks with sigmoidal units and arbitrary product \nunits. This indicates that adding these units might decrease the computing time \nonly marginally. The derivation made use of an upper bound on the VC dimension \nof higher-order sigmoidal networks. This bound is not known to be optimal. Any \nfuture improvement will therefore lead to a better lower bound on the computing \ntime. \n\nWe have focussed on product and sigmoidal units as nonlinear computing elements. \nHowever, the construction presented here is generic. Thus, it is possible to derive \nsimilar results for radial basis function units, models of spiking neurons, and other \nunit types that are known to yield networks with bounded VC dimension. The \nquestions whether such results can be obtained for continuous-time networks and for \nnetworks operating in the domain of complex numbers, are challenging. A further \nassumption made here is that the networks compute the functions exactly. By a \nmore detailed analysis and using the fact that the shattering of sets requires the \noutputs only to lie below or above some threshold, similar results can be obtained \nfor networks that approximate the functions more or less closely and for networks \nthat are subject to noise. \n\nAcknowledgment \n\nThe author gratefully acknowledges funding from the Deutsche Forschungsgemein(cid:173)\nschaft (DFG). This work was also supported in part by the ESPRIT Working Group \nin Neural and Computational Learning II, NeuroCOLT2, No. 27150. \n\n\fReferences \nAnthony, M. and Bartlett, P. L. (1999). Neural Network Learning: Theoretical \n\nFoundations. Cambridge University Press, Cambridge. \n\nBalcazar, J., Gavalda, R., and Siegelmann, H. T. (1997). Computational power of \nneural networks: A characterization in terms of Kolmogorov complexity. IEEE \nTranscations on Information Theory, 43: 1175- 1183. \n\nBlum, L., Cucker, F. , Shub, M. , and Smale, S. (1998) . Complexity and Real Com(cid:173)\n\nputation. Springer-Verlag, New York. \n\nCarrasco, R. C., Forcada, M. L., Valdes-Munoz, M. A., and Neco, R. P. (2000). \nStable encoding of finite state machines in discrete-time recurrent neural nets \nwith sigmoid units. Neural Computation, 12:2129- 2174. \n\nDurbin, R. and Rumelhart, D. (1989). Product units: A computationally pow(cid:173)\n\nerful and biologically plausible extension to backpropagation networks. Neural \nComputation, 1:133- 142. \n\nGavalda, R. and Siegelmann, H. T . (1999) . Discontinuities in recurrent neural \n\nnetworks. Neural Computation, 11:715- 745. \n\nHaykin, S. (1999). Neural Networks: A Comprehensive Foundation. Prentice Hall, \n\nUpper Saddle River, NJ, second edition. \n\nKilian, J. and Siegelmann, H. T. (1996). The dynamic universality of sigmoidal \n\nneural networks. Information and Computation, 128:48- 56. \n\nKoiran, P. and Sontag, E. D. (1998). Vapnik-Chervonenkis dimension of recurrent \n\nneural networks. Discrete Applied Mathematics, 86:63- 79. \n\nMaass, W., NatschUiger, T., and Markram, H. (2001). Real-time computing without \nstable states: A new framework for neural computation based on perturbations. \nPreprint. \n\nMaass, W. and Orponen, P. (1998). On the effect of analog noise in discrete-time \n\nanalog computations. Neural Computation, 10:1071- 1095. \n\nMaass, W. and Sontag, E. D. (1999). Analog neural nets with Gaussian or other \ncommon noise distributions cannot recognize arbitrary regular languages. Neural \nComputation, 11:771- 782. \n\namlin, C. W. and Giles, C. L. (1996). Constructing deterministic finite-state au(cid:173)\n\ntomata in recurrent neural networks. Journal of the Association for Computing \nMachinery, 43:937- 972. \n\nSchmitt, M. (2002). On the complexity of computing and learning with multiplica(cid:173)\n\ntive neural networks. Neural Computation, 14. In press. \n\nSiegelmann, H. T . (1999). Neural Networks and Analog Computation: Beyond the \n\nTuring Limit. Progress in Theoretical Computer Science. Birkhiiuser, Boston. \n\nSiegelmann, H. T. and Sontag, E. D. (1995). On the computational power of neural \n\nnets. Journal of Computer and System Sciences, 50:132- 150. \n\nSima, J. and Orponen, P. (2001). Exponential transients in continuous-time \n\nsymmetric Hopfield nets. \ntors, Proceedings of the International Conference on Artificial Neural Networks \nICANN 2001, volume 2130 of Lecture Notes in Computer Science, pages 806- 813, \nSpringer, Berlin. \n\nIn Dorffner, G., Bischof, H. , and Hornik, K. , edi(cid:173)\n\n\f", "award": [], "sourceid": 2111, "authors": [{"given_name": "M.", "family_name": "Schmitt", "institution": null}]}