{"title": "Bounds on the complexity of recurrent neural network implementations of finite state machines", "book": "Advances in Neural Information Processing Systems", "page_first": 359, "page_last": 366, "abstract": null, "full_text": "Bounds on the complexity of recurrent \nneural network implementations of finite \n\nstate machines \n\nBill G. Horne \n\nNEC Research Institute \n\n4 Independence Way \nPrinceton, NJ 08540 \n\nDon R. Hush \n\nEECE Department \n\nUniversity of New Mexico \nAlbuquerque, NM 87131 \n\nAbstract \n\nIn this paper the efficiency of recurrent neural network implementa(cid:173)\ntions of m-state finite state machines will be explored. Specifically, \nit will be shown that the node complexity for the unrestricted case \ncan be bounded above by 0 ( fo) . It will also be shown that the \nnode complexity is 0 (y'm log m) when the weights and thresholds \nare restricted to the set {-I, I}, and 0 (m) when the fan-in is re(cid:173)\nstricted to two. Matching lower bounds will be provided for each \nof these upper bounds assuming that the state of the FSM can be \nencoded in a subset of the nodes of size rlog m 1. \n\n1 \n\nIntroduction \n\nThe topic of this paper is understanding how efficiently neural networks scale to \nlarge problems. Although there are many ways to measure efficiency, we shall be \nconcerned with node complexity, which as its name implies, is a calculation of the \nrequired number of nodes. Node complexity is a useful measure of efficiency since \nthe amount of resources required to implement or even simulate a recurrent neural \nnetwork is typically related to the number of nodes. Node complexity can also \nbe related to the efficiency of learning algorithms for these networks and perhaps \nto their generalization ability as well. We shall focus on the node complexity of \nrecurrent neural network implementations of finite state machines (FSMs) when \nthe nodes of the network are restricted to threshold logic units. \n\n359 \n\n\f360 \n\nHome and Hush \n\nIn the 1960s it was shown that recurrent neural networks are capable of imple(cid:173)\nmenting arbitrary FSMs. The first result in this area was due to Minsky [7], who \nshowed that m-state FSMs can be implemented in a fully connected recurrent neu(cid:173)\nral network. Although circuit complexity was not the focus of his investigation it \nturns out that his construction, yields 0 (m) nodes. This construction was also \nguaranteed to use weight values limited to the set {I, 2}. Since a recurrent neural \nnetwork with k hard-limiting nodes is capable of representing as many as 2k states, \none might wonder if an m-state FSM could be implemented by a network with \nlog m nodes. However, it was shown in [1] that the node complexity for a standard \nfully connected network is n ((m log m)1/3). They were also able to improve upon \nMinsky's result by providing a construction which is guaranteed to yield no more \nthan 0 (m3/ 4 ) nodes. In the same paper lower bounds on node complexity were in(cid:173)\nvestigated as the network was subject to restrictions on the possible range of weight \nvalues and the fan-in and fan-out of the nodes in the network. Their investigation \nwas limited to fully connected recurrent neural networks and they discovered that \nthe node complexity for the case where the weights are restricted to a finite size set \nis n (y'm log m) . Alternatively, if the nodes in the network were restricted to have a \nconstant fan-in then the node complexity becomes n (m) . However, they left open \nthe question of how tight these bounds are and if they apply to variations on the \nbasic architecture. Other recent work includes investigation of the node complexity \nfor networks with continuous valued nonlinearities [14]. However, it can also be \nshown that when continuous nonlinearities are used, recurrent neural networks are \nfar more powerful than FSMs; in fact, they are Turing equivalent [13]. \n\nIn this paper we improve the upper bound on the node complexity for the un(cid:173)\nrestricted case to 0 (yIm). We also provide upper bounds that match the lower \nbounds above for various restrictions. Specifically, we show that a node complexity \nof 0 ( y'm log m) can be achieved if the weights are restricted to the set {-I, I} , and \nthat the node complexity is 0 (m) for the case when the fan-in of each node in the \nnetwork is restricted to two. Finally, we explore the possibility that implementing \nfinite state machines in more complex models might yield a lower node complexity. \nSpecifically, we explore the node complexity of a general recurrent neural network \ntopology, that is capable of simulating a variety of popular recurrent neural net(cid:173)\nwork architectures. Except for the unrestricted case, we will show that the node \ncomplexity is no different for this architecture than for the fully connected case if \nthe number of feedback variables is limited to rlog m 1, i.e. if the state of the FSM \nis encoded optimally in a subset of the nodes. We leave it as an open question if a \nsparser encoding can lead to a more efficient implementation. \n\n2 Background \n\n2.1 Finite State Machines \n\nFSMs may be defined in several ways. In this paper we shall be concerned with \nMealy machines, although our approach can easily be extended to other formulations \nto yield equivalent results. \n\n\fBounds on the Complexity of Recurrent Neural Network Implementations \n\n361 \n\nDefinition 1 A Mealy machine is a quintuple M = (Q, qo, E, d, <1\u00bb, where Q is a \nfinite set of states; qo is the initial state; E is the input alphabet; d is the output \nalphabet; and * : Q x E - Q x d is the combined transition and output function. \no \nThroughout this paper both the input and output alphabets will be binary (i.e. E = \nd = {a, I}) . In general, the number of states, m = IQI, may be arbitrary. Since any \nelement of Q can be encoded as a binary vector whose minimum length is pog m 1 , \nthe function ** can be implemented as a boolean logic function of the form \n\n** : {a, l} pogm1+l _ \n\n{a, l} pogm1+l . \n\n(1) \n\nThe number, N M , of different minimal FSMs with m states will be used to determine \nlower bounds on the number of gates required to implement an arbitrary FSM in a \nrecurrent neural network. It can easily be shown that (2m)m :S NM [5]. However, \nit will be convenient to reexpress N M in terms of n = flog m 1 + 1 as follows \n\n(2) \n\n2.2 Recurrent Neural Networks \n\nThe fundamental processing unit in the models we wish to consider is the perceptron, \nwhich is a biased, linearly weighted sum of its inputs followed by a hard-limiting \nnonlinearity whose output is zero if its input is negative and one otherwise. The \nfan-in of the perceptron is defined to be the number of non-zero weights. When the \nvalues of Xi are binary (as they are in this paper) , the perceptron is often referred \nto as a threshold logic unit (TL U). \n\nA count of the number of different partially specified threshold logic functions, which \nare threshold logic functions whose values are only defined over v vertices of the \nunit hypercube, will be needed to develop lower bounds on the node complexity \nrequired to implement an arbitrary logic function . It has been shown that this \nnumber, denoted L~, is [15] \n\nL~:S -,-. \n\n2v n \nn. \n\n(3) \n\nAs pointed out in [10], many of the most popular discrete-time recurrent neural \nnetwork models can be implemented as a feedforward network whose outputs are \nfed back recurrently through a set of unit time delays. In the most generic version of \nthis architecture, the feed forward section is lower triangular, meaning the [th node is \nthe only node in layer I and receives input from all nodes in previous layers (including \nthe input layer). A lower triangular network of k threshold logic elements is the \nmost general topology possible for a feedforward network since all other feedforward \nnetworks can be viewed as a special case of this network with the appropriate weights \nset equal to zero. The most direct implementation of this model is the architecture \nproposed in [11] . However, many recurrent neural network architectures can be cast \ninto this framework. For example, fully connected networks [3] fit this model when \nthe the feedforward network is simply a single layer of nodes. Even models which \nappear very different [2, 9] can be cast into this framework. \n\n\f362 \n\nHome and Hush \n\n3 The unrestricted case \n\nThe unrestricted case is the most general, and thus explores the inherent power of \nrecurrent neural networks. The unrestricted case is also important because it serves \nas a baseline from which one can evaluate the effect of various restrictions on the \nnode complexity. \nIn order to derive an upper bound on the node complexity of recurrent neural \nnetwork implementations of FSMs we shall utilize the following lemma, due to \nLupanov [6]. The proof of this lemma involves a construction that is extremely \ncomplex and beyond the scope of this paper. \n\nLemma 1 (Lupanov, 1973) Arbitrary boolean logic functions with x inputs and \ny outputs can be implemented in a network of perceptrons with a node complexity \nof \n\no ( J x ~~:g y) . \n\no \n\nTheorem 1 Multilayer recurrent neural networks can implement FSMs having m \nstates with a node complexity of 0 (.Jffi) . \n0 \n\nProof: Since an m-state FSM can be implemented in a recurrent neural network \nin which the multilayer network performs a mapping of the form in equation (1), \nthen using n = m = flog m 1 + 1, and applying Lemma 1 gives an upper bound of \nO(.Jffi). \nQ.E.D. \n\nTheorem 2 Multilayer recurrent neural networks can implement FSMs having m \nstates with a node complexity of n (fo) if the number of unit time delays is flog m 1. \no \n\nProof: In order to prove the theorem we derive an expression for the maximum \nnumber of functions that a k-node recurrent neural network can compute and com(cid:173)\npare that against the minimum number of finite state machines. Then we solve for \nk in terms of the number of states of the FSM. \n\nSpecifically, we wish to manipulate the inequality \n\n2(n-l)2 n - 2 < n! ( k - 1 ) krr-l 2n(n+i~+1 \n(n + z)! \n\nn - 1 \n\n-\n\n. \n,=0 \n\n(a) \n\n(b) \n\nwhere the left hand side is given in equation (2), (a) represents the total number of \nways to choose the outputs and feedback variables of the network, and (b) repre(cid:173)\nsents the total number of logic functions computable by the feed forward section of \nthe network, which is lower triangular. Part (a) is found by simple combinatorial \narguments and noting that the last node in the network must be used as either \nan output or feedback node. Part (b) is obtained by the following argument: If \nthe state is optimally encoded in flog m 1 nodes, then only flog m 1 variables need \n\n\fBounds on the Complexity of Recurrent Neural Network Implementations \n\n363 \n\nto be fed back. Together with the external input this gives n = rlog m 1 + 1 local \ninputs to the feedforward network. Repeated application of (3) with v = 2n yields \nexpression (b). \n\nFollowing a series of algebraic manipulations it can easily be shown that there exists \na constant c such that \n\nn2n < ck 2n. \n\nSince n = flog ml + 1 it follows that k = f2 (fo). \n\nQ.E.D. \n\n4 Restriction on weights and thresholds \n\nAll threshold logic functions can be implemented with perceptrons whose weight \nand threshold values are integers. It is well known that there are threshold logic \nfunctions of n variables that require a perceptron with weights whose maximum \nmagnitude is f2(2n) and O( nn/2) [8]. This implies that if a perceptron is to be \nimplemented digitally, the number of bits required to represent each weight and \nthreshold in the worst case will be a super linear function of the fan-in. This is \ngenerally undesirable ; it would be far better to require only a logarithmic number \nof bits per weight, or even better, a constant number of bits per weight. We will be \nprimarily be interested in the most extreme case where the weights are limited to \nvalues from the set {-I , I}. \n\nIn order to derive the node complexity for networks with weight restrictions, we \nshall utilize the following lemma, proved in [4]. \n\nLemma 2 Arbitrary boolean logic functions with x inputs and y outputs can be \nimplemented in a network ofperceptrons whose weights and thresholds are restricted \nto the set {-I, I} with a node complexity of e (Jy2 x ) . \n0 \n\nThis lemma is not difficult to prove, however it is beyond the scope of this paper. \nThe basic idea involves using a decomposition of logic functions proposed in [12]. \nSpecifically, a boolean function f may always be decomposed into a disjunction of \n2r terms of the form XIX2. ' . Xr fi(X r +1 , .. . , x n ) , one for each conjunction of the \nfirst r variables, where Xj represents either a complemented or uncomplemented \nversion of the input variable Xj and each Ii is a logic function of the last n -\nr \nvariables. This expression can be implemented directly in a neural network. With \nnegligible number of additional nodes, the construction can be implemented in such \na way that all weights are either -lor 1. Finally, the variable r is optimized to \nyield the minimum number of nodes in the network. \n\nTheorem 3 Multilayer recurrent neural networks that have nodes whose weights \nand thresholds are restricted to the set {-I , I} can implement FSMs having m \nstates with a node complexity of 0 (Jm log m) . \n0 \n\nProof: Since an m-state FSM can be implemented in a recurrent neural network \nin which the multilayer network performs a mapping of the form in equation (1), \nthen using n = m = flog m 1 + 1, and applying Lemma 2 gives an upper bound of \no (Jmlogm) . \nQ.E.D. \n\n\f364 \n\nHome and Hush \n\nTheorem 4 Multilayer recurrent neural networks that have nodes whose weights \nand thresholds are restricted to a set of size IWI can implement FSMs having m \nstates with a node complexity of n ( \nif the number of unit time delays is \n0 \nflogml. \n\nProof: The proof is similar to the proof of Theorem 2 which gave a lower bound \nfor the node complexity required in an arbitrary network of threshold logic units. \nHere, the inequality we wish to manipulate is given by \n\n) \n\nk - 1 \nn-I \n\nk-l \n\nII IWln+i+ 1 . \n\ni=O \n\n(a) \n\n(b) \n\nwhere the left hand side and (a) are computed as before and (b) represents the \nmaximum number of ways to configure the nodes in the network when there are \nonly IWI choices for each weight and threshold. Following a series of algebraic \nmanipulations it can be shown that there exists a constant c such that \n\nn2n ::; ck 2 log IWI. \n\nSince n = pog m 1 + 1 it follows that k = n ( mlogm) \nQ.E.D. \nClearly, for W = {-I, I} this lower bound matches the upper bound in Theorem 3. \n\nloglWI \n\n. \n\n5 Restriction on fan-in \n\nA limit on the fan-in of a perceptron is another important practical restriction. \nIn the networks discussed so far each node has an unlimited fan-in. In fact, in \nthe constructions described above, many nodes receive inputs from a polynomial \nnumber of nodes (in terms of m) in a previous layer. In practice it is not possible \nto build devices that have such a large connectivity. Restricting the fan-in to 2, is \nthe most severe restriction, and will be of primary interest in this paper. \n\nOnce again, in order to derive the node complexity for restricted fan-in, we shall \nutilize the following lemma, proved in [4]. \n\nLemma 3 Arbitrary boolean logic functions with x inputs and y outputs can be \nimplemented in a network of perceptrons restricted to fan-in 2 with a node com(cid:173)\nplexityof \n\ny2X \n\n) \ne x + logy \n\n( \n\n. \n\no \n\nThis proof of this lemma is very similar to the proof of Lemma 2. Here Shannon's \ndecomposition is used with r = 2 to recursively decompose the logic function into \na set of trees, until each tree has depth d. Then, all possible functions of the last \nn - d variables are implemented in an inverted tree-like structure, which feeds into \nthe bottom of the trees. Finally, d is optimized to yield the minimum number of \nnodes. \n\n\fBounds on the Complexity of Recurrent Neural Network Implementations \n\n365 \n\nTheorem 5 Multilayer recurrent neural networks that have nodes whose fan-in is \nrestricted to two can implement FSMs having m states with a node complexity of \nOem) \n0 \n\nProof: Since an m-state FSM can be implemented in a recurrent neural network \nin which the multilayer network performs a mapping of the form in equation (1), \nthen using n = m = rlog m 1 + 1, and applying Lemma 3 gives an upper bound of \no (m). \nQ.E.D. \n\nTheorem 6 Multilayer recurrent neural networks that have nodes whose fan-in is \nrestricted to two can implement FSMs having m states with a node complexity of \nn (m) if the number of unit time delays is rlog m 1. \n0 \n\nProof: Once again the proof is similar to Theorem 2, which gave a lower bound \nfor the node complexity required in an arbitrary network of threshold logic units. \nHere, the inequality we need to solve for is given by \n\n2(n-1)2'-' :s n! ( ~:= ~ ) D. 14 ( n t i ) \n\n,----_V~----A~----_V~----~ \n\n(a) \n\n(b) \n\nwhere the left hand side and (a) are computed as before and (b) represents the max(cid:173)\n\nimum number of ways to configure the nodes in the network. The term ( n t i ) \nis used since a node in the ith layer has n + i possible inputs from which two are \nchosen. The constant 14 represents the fourteen possible threshold logic functions \nof two variables. Following a series of algebraic manipulations it can be shown that \nthere exists a constant c such that \n\nn2n ~ ck logk \nSince n = rlog m 1 + 1 it follows that k = n (m) . \n\nQ.E.D. \n\n6 Summary \n\nIn summary, we provide new bounds on the node complexity of implementing FSMs \nwith recurrent neural networks. These upper bounds match lower bounds devel(cid:173)\noped in [1] for fully connected recurrent networks when the size of the weight set \nor the fan-in of each node is finite. Although one might speculate that more com(cid:173)\nplex networks might yield more efficient constructions, we showed that these lower \nbounds do not change for restrictions on weights or fan-in, at least when the state \nof the FSM is encoded optimally in a subset of flog m 1 nodes. When the network \nis unrestricted, this lower bound matches our upper bound. We leave it as an open \nquestion if a sparser encoding of the state variables can lead to a more efficient \nimplementation. \n\nOne interesting aspect of this study is that there is really not much difference \nin efficiency when the network is totally unrestricted and when there are severe \nrestrictions placed on the weights. Assuming that our bounds are tight, then there \n\n\f366 \n\nHome and Hush \n\nis only a y'log m penalty for restricting the weights to either -1 or 1. To get some \nidea for how marginal this difference is consider that for a finite state machine with \nm = 18 x 10 18 states, y'log m is only eight! \nA more detailed version of this paper can be found in [5]. \n\nReferences \n\n[1] N. Alon, A.K. Dewdney, and T.J. Ott . Efficient simulation of finite automata \n\nby neural nets. JACM, 38(2):495-514, 1991. \n\n[2] A.D. Back and A.C. Tsoi. FIR and I1R synapses, a new neural network archi(cid:173)\n\ntecture for time series modeling. Neural Computation, 3(3):375-385, 1991. \n\n[3] J.J. Hopfield. Neural networks and physical systems with emergent collective \n\ncomputational abilities. Proc. Nat. Acad. Sci., 79:2554-2558, 1982. \n\n[4] B.G. Horne and D.R. Hush. On the node complexity of neural networks. Tech(cid:173)\n\nnical Report EECE 93-003, Dept. EECE, U. New Mexico, 1993. \n\n[5] B.G. Horne and D.R. Hush. Bounds on the complexity of recurrent neural \nnetwork implementations of finite state machines. Technical Report EECE \n94-001, Dept. EECE, U. New Mexico, 1994. \n\n[6] O.B. Lupanov. The synthesis of circuits from threshold elements. Problemy \n\nKibernetiki, 26:109-140, 1973. \n\n[7] M. Minsky. Computation: Finite and infinite machines. Prentice-Hall, 1967. \n[8] S. Muroga. Threshold Logic and Its Applications. Wiley, 1971. \n[9] K.S. Narendra and K. Parthasarathy. Identification and control of dynamical \nsystems using neural networks. IEEE Trans. on Neural Networks, 1:4-27, 1990. \n[10] O. Nerrand et al. Neural networks and nonlinear adaptive filtering: Unifying \n\nconcepts and new algorithms. Neural Computation, 5(2):165-199, 1993. \n\n[11] A.J. Robinson and F. Fallside. Static and dynamic error propagation networks \n\nwith application to speech coding. In D.Z. Anderson, editor, Neural Informa(cid:173)\ntion Processing Systems, pages 632-641, 1988. \n\n[12] C. Shannon. The synthesis of two-terminal switching circuits. Bell Sys. Tech. \n\n1., 28:59-98, 1949. \n\n[13] H. Siegelmann and E.D. Sontag. Neural networks are universal computing \ndevices. Technical Report SYCON-91-08, Rutgers Ctr. for Sys. and Cont., \n1991. \n\n[14] H.T. Siegelmann, E.D. Sontag, and C.L. Giles. The complexity of language \nrecognition by neural networks. In Proc. IFIP 12th World Compo Cong., pages \n329-335, 1992. \n\n[15] R.O. Winder. Bounds on threshold gate realizability. IEEE Trans. on Elect. \n\nComp., EC-12:561-564, 1963. \n\n\f", "award": [], "sourceid": 822, "authors": [{"given_name": "Bill", "family_name": "Horne", "institution": null}, {"given_name": "Don", "family_name": "Hush", "institution": null}]}*