{"title": "RCC Cannot Compute Certain FSA, Even with Arbitrary Transfer Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 619, "page_last": 625, "abstract": "", "full_text": "RCC Cannot Compute Certain FSA, \n\nEven with Arbitrary Transfer Functions \n\nRWCP Theoretical Foundation GMD Laboratory \n\nGMD - German National Research Center for Information Technology \n\nMark Ring \n\nSchloss Birlinghoven \n\nD-53 754 Sankt Augustin, Germany \n\nemail: Mark .Ring@GMD.de \n\nAbstract \n\nExisting proofs demonstrating the computational limitations of Re(cid:173)\ncurrent Cascade Correlation and similar networks (Fahlman, 1991; \nBachrach, 1988; Mozer, 1988) explicitly limit their results to units \nhaving sigmoidal or hard-threshold transfer functions (Giles et aI., \n1995; and Kremer, 1996). The proof given here shows that for \nany finite, discrete transfer function used by the units of an RCC \nnetwork, there are finite-state automata (FSA) that the network \ncannot model, no matter how many units are used. The proof also \napplies to continuous transfer functions with a finite number of \nfixed-points, such as sigmoid and radial-basis functions. \n\n1 \n\nIntroduction \n\nThe Recurrent Cascade Correlation (RCC) network was proposed by Fahlman \n(1991) to offer a fast and efficient alternative to fully connected recurrent networks. \nThe network is arranged such that each unit has only a single recurrent connection: \nthe connection that goes from itself to itself. Networks with the same structure \nhave been proposed by Mozer (Mozer, 1988) and Bachrach (Bachrach, 1988). This \nstructure is intended to allow simplified training of recurrent networks in the hopes \nof making them computationally feasible. However, this increase in efficiency comes \nat the cost of computational power: the networks' computational capabilities are \nlimited regardless of the power of their activation functions. The remaining input to \neach unit consists of the input to the network as a whole together with the outputs \nfrom all units lower in the RCC network. Since it is the structure of the network \nand not the learning algorithm that is of interest here, only the structure will be \ndescribed in detail. \n\n\f620 \n\nM. Ring \n\nFigure 1: This finite-state automaton was shown by Giles et al. (1995) to be un(cid:173)\nrepresentable by an Ree network whose units have hard-threshold or sigmoidal \ntransfer functions. The arcs are labeled with transition labels of the FSA which \nare given as input to the Ree network. The nodes are labeled with the output \nvalues that the network is required to generate. The node with an inner circle is an \naccepting or halting state . \n\nFigure 2: This finite-state automaton is one of those shown by Kremer (1996) not to \nbe representable by an Ree network whose units have a hard-threshold or sigmoidal \ntransfer function . This FSA computes the parity of the inputs seen so far . \n\nThe functionality of a network of N Ree units, Uo, .. . , UN-l can be described in \nthe following way: \n\n/o([(t), Vo(t - 1\u00bb \n/x(i(t), Vx(t - 1), Vx-1(t), Vx-2(t), .. . , Vo(t\u00bb, \n\n(1) \n(2) \n\nwhere Vx(t) is the output value of Ux at time step t, and l(t) is the input to the \nnetwork at time step t. The value of each unit is determined from: (1) the network \ninput at the current time step, (2) its own value at the previous time step, and (3) \nthe output values of the units lower in the network at the current time step . Since \nlearning is not being considered here, the weights are assumed to be constant. \n\n2 Existing Proofs \nThe proof of Giles, et al (1995) showed that an Ree network whose units had a \nhard-threshold or sigmoidal transfer function cannot produce outputs that oscillate \nwith a period greater than two when the network input is constant. (An oscillation \nhas a period of x if it repeats itself every x steps.) Thus, the FSA shown in Figure 1 \ncannot be modeled by such an Ree network, since its output (shown as node labels) \noscillates at a period greater than two given constant input. Kremer (1996) refined \nthe class of FSA representable by an Ree network showing that, if the input to \nthe net oscillates with period p, then the output can only oscillate with a period of \nw, where w is one of p's factors (or of 2p's factors if p is odd). An unrepresentable \nexample, therefore, is the parity FSA shown in Figure 2, whose output has a period \nof four given the following input (of period two): 0,1,0,1, .... \n\nBoth proofs, that by Giles et al. and that by Kremer, are explicitly designed with \n\n\fRCC Cannot Compute Certain FSA, Even with Arbitrary Transfer Functions \n\n621 \n\n*0,1 \n\nFigure 3: This finite-state automaton cannot be modeled with any RCC network \nwhose units are capable of representing only k discrete outputs. The values within \nthe circles are the state names and the output expected from the network. The arcs \ndescribe transitions from state to state, and their values represent the input given \nto the network when the transition is made. The dashed lines indicate an arbitrary \nnumber of further states between state 3 and state k which are connected in the \nsame manner as states 1,2, and 3. (All states are halting states.) \n\nhard-threshold and sigmoidal transfer functions in mind, and can say nothing about \nother transfer functions. In other words, these proofs do not demonstrate the lim(cid:173)\nitations of the RCC-type network structure, but about the use of threshold units \nwithin this structure. The following proof is the first that actually demonstrates \nthe limitations of the single-recurrent-link network structure. \n\n3 Details of the Proof \n\nThis section proves that RCC networks are incapable even in principle of modeling \ncertain kinds of FSA, regardless of the sophistication of each unit's transfer function, \nprovided only that the transfer function be discrete and finite, meaning only that \nthe units of the RCC network are capable of generating a fixed number, k, of distinct \noutput values. (Since all functions implemented on a discrete computer fall into this \ncategory, this assumption is minor. Furthermore, as will be discussed in Section 4, \nthe outputs of most interesting continuous transfer functions reduce to only a small \nnumber of distinct values.) This generalized RCC network is proven here to be \nincapable of modeling the finite-state automaton shown in Figure 3. \n\n\f622 \n\nMRing \n\nFor ease of exposition, let us call any FSA of the form shown in Figure 3 an RFk+l \nfor Ring FSA with k + 1 states. I Further, call a unit whose output can be any of \nk distinct values and whose input includes its own previous output, a DRUk for \nDiscrete Recurrent Unit. These units are a generalization ofthe units used by RCC \nnetworks in that the specific transfer function is left unspecified. By proving the \nnetwork is limited when its units are DRUbs proves the limitations of the network's \nstructure regardless of the transfer function used. \n\nClearly, a DRUk+1 with a sufficiently sophisticated transfer function could by itself \nmodel an RFk+1 by simply allocating one of its k + 1 output values for each of \nthe k + 1 states. At each step it would receive as input the last state of the FSA \nand the next transition and could therefore compute the next state. By restricting \nthe units in the least conceivable manner, i.e., by reducing the number of distinct \noutput values to k, the RCC network becomes incapable of modeling any RFk+1 \nregardless of how many DRUk's the network contains. This will now be proven. \n\nThe proof is inductive and begins with the first unit in the network, which, after \nbeing given certain sequences of inputs, becomes incapable of distinguishing among \nany states of the FSA. The second step, the inductive step, proves that no finite \nnumber of such units can 'assist a unit hi~her in the ReC network in making a \ndistinction between any states of the RFk+ . \n\nLemma 1 No DR Uk whose input is the current transition of an RFk+1 can reliably \ndistinguish among any states of the RP+I. More specifically, at least one of the \nDR Uk,s k output values can be generated in all of the RP+I 's k + 1 states. \n\nProof: Let us name the DRUbs k distinct output values VO, VI, ... , Vk-I. The \nmapping function implemented by the DRU k can be expressed as follows: \n\n( V X , i) =} VY, \n\nwhich indicates that when the unit's last output was V X and its current input is i, \nthen its next output is VY. \n\nSince an RFk is cyclical, the arithmetic in the following will also be cyclical (i.e., \nmodular): \n\nxtfJy = { x+y \nif x + y < k \nx+y-k if x + y ~ k \n{ x-y \n\nif x 2: y \nx+k-y if x < y \n\nx8y \n\n-\n\nwhere 0 ~ x < k and 0 ~ y < k. \nSince it is impossible for the DRUk to represent each of the RFk+I,s k + 1 states with \na distinct output value, at least two of these states must be represented ambiguously \nby the same value. That is, there are two RFk+l states a and b and one DRU k \nvalue V a/ b such that V a/ b can be generated by the unit both when the FSA is in \nstate a and when it is in state b. Furthermore, this value will be generated by the \nunit given an appropriate sequence of inputs. (Otherwise the value is unreachable, \nserves no purpose, and can be discarded, reducing the unit to a DRUk- I.) \nOnce the DRUk has generated V a/ b , it cannot in the next step distinguish whether \nthe FSA's current state is a or b. Since the FSA could be in either state a or b, the \nnext state after a b transition could be either a or b tfJ 1. That is: \n\n(va/b, b) =} Va/bEl'll, \n\n(3) \n\nIThanks to Mike Mozer for suggesting this catchy name. \n\n\fRCC Cannot Compute Certain FSA, Even with Arbitrary Transfer Functions \n\n623 \n\nwhere a e b ~ be a and k > 1. This new output value Va/b$l can therefore be \ngenerated when the FSA is in either state a or state b EB 1. By repeatedly replacing \nb with b EB 1 in Equation 3, all states from b to a e 1 can be shown to share output \nvalues with state a, i.e., V a/ b , Va/b$l, V a/ b$2, ... , va/ae2, v a / ae1 all exist. \n\nRepeatedly substituting a eland a for a and b respectively in the last paragraph \nproduces values vx/y Vx, YEO, 1, ... , k + 1. There is, therefore, at least one value \nthat can be generated by the unit in both states of every possible pair of states. \nSince there are (k! 1) distinct pairs but only k distinct output values, and since \n\nwhen k > 1, then not all of these pairs can be represented by unique V values. At \nleast two of these pairs must share the same output value, and this implies that \nsome v a / b/ e exists that can be output by the unit in any of the three FSA states \na, b, and c. \nStarting with \n\n(V a/ b/ e , c) ::::} va/b/e$l, \n\nand following the same argument given above for V a/ b , there must be a vx/y/z for \nall triples of states x, Y, and z. Since there are (k ~ 1) distinct triples but only k \n\ndistinct output values, and since fi+ll > 1, \nrer)l >1, \n\nwhere k > 3, some v a/ b/ e/ d must also exist. \nThis argument can be followed repeatedly since: \n\nfor all m < k + 1, including when m = k. Therefore, there is at least one \nVO/l/2f..fk/k+l that can be output by the unit in all k + 1 states of the RFk+l. \nCall this value and any other that can be generated in all FSA states ~,k. All \nVk>s are reachable (else they could be discarded and the above proof applied for \nDRU I , / < k). When a Vk is output by a DRU k , it does not distinguish any states \nof the RFH 1 . \n\nLemma 2 Once a DRUk outputs a V k , all future outputs will a/so be Vk's. \n\nProof: The proof is simply by inspection, and is shown in the following table: \n\nActual State Transition Next State \n\nxEB1 \n\nx \n\nxEB2 \nxEB3 \n\nx \n\nxEB1 \nxEB2 \nxEB3 \n\nx \nx \nx \nx \n\nx82 \nx81 \n\nx \nx \n\nx82 \nx81 \n\n\f624 \n\nM. Ring \n\nIf the unit's last output value was a Vk, then the FSA might be in any of its k + 1 \npossible states. As can be seen, if at this point any of the possible transitions is \ngiven as input, the next state can also be any of the k + 1 possible states. Therefore, \nno future inp'ut can ever serve to lessen the unit's ambiguity. \n\nTheorem 1 An RGG network composed of any finite number of DR Uk 's cannot \nmodel an Rpk+l. \n\nProof: Let us describe the transitions of an RCC network of N units by using the \nfollowing notation: \n\n((VN-I , VN-2, ... , VI, Va), i) ~ (VN- I , VN-2, ... , V{, V~), \n\nwhere Vrn is the output value of the m'th unit (i.e., Urn) before the given input, \ni, is seen by the network, and V~ is Urn's value after i has been processed by the \nnetwork. The first unit, Uo, receives only i and Va as input. Every other unit Ux \nreceives as input i and Vx as well as v~, y < x. \nLemma 1 shows that the first unit, Uo, will eventually generate a value vl, which \ncan be generated in any of the RFk+1 states. From Lemma 2, the unit will continue \nto produce vl values after this point. \nGiven any finite number N of DRUk,s, Urn-I, ... , Uo that are producing their Vk \nvalues, V~ -1' .. . , Vt, the next higher unit , UN, will be incapable of disambiguating \nall states by itself, i.e., at least two FSA states, a and b, will have overlapping \noutput values, V;,p. Since none of the units UN-I, ... , Uo can distinguish between \nany states (including a and b), \n\n( ( a / b \n\nk \n\nVN ,VN-I,\u00b7\u00b7 \u00b7, VI'VO ' )~ VN \n\nk \n\nk) b \n\n(a / b (JJ 1 \n\nk \n\nk ) \n'VN- I '\u00b7\u00b7 \u00b7, I'VO ' \n\nVk \n\nassuming that be a ~ ae b and k > 1. The remainder of the prooffollows identically \nalong the lines developed for Lemmas 1 and 2. The result of this development is \nthat UN also has a set of reachable output values V~ that can be produced in any \nstate of the FSA. Once one such value is produced, no less-ambiguous value is ever \ngenerated. Since no RCC network containing any number of DRU k 's can over time \ndistinguish among any states of an RFHI, no such RCC network can model such \nan FSA. \n\n4 Continuous Transfer Functions \n\nSigmoid functions can generate a theoretically infinite number of output values; \nif represented with 32 bits, they can generate 232 outputs. This hardly means, \nhowever, that all such values are of use. In fact, as was shown by Giles et al. (1995), \nif the input remains constant for a long enough period of time (as it can in all \nRFHI'S) , the output of sigmoid units will converge to a constant value (a fixed \npoint) or oscillate between two values. This means that a unit with a sigmoid \ntransfer function is in principle a DRU 2 . Most useful continuous transfer functions \n(radial-basis functions, for example), exhibit the same property, reducing to only a \nsmall number of distinct output values when given the same input repeatedly. The \nresults shown here are therefore not merely theoretical, but are of real practical \nsignificance and apply to any network whose recurrent links are restricted to self \nconnections. \n\n5 Concl usion \n\nNo RCC network can model any FSA containing an RFk+1 (such as that shown \nin Figure 3), given units limited to generating k possible output values, regardless \n\n\fRCC Cannot Compute Certain FSA, Even with Arbitrary Transfer Functions \n\n625 \n\nof the sophistication of the transfer function that generates these values. This \nplaces an upper bound on the computational capabilities of an RCC network. Less \nsophisticated transfer functions, such as the sigmoid units investigated by Giles et \nal. and Kremer may have even greater limitations. Figure 2, for example, could \nbe modeled by a single sufficiently sophisticated DRU 2 , but cannot be modeled \nby an RCe network composed of hard-threshold or sigmoidal units (Giles et al., \n1995; Kremer, 1996) because these units cannot exploit all mappings from inputs \nto outputs. By not assuming arbitrary transfer functions, previous proofs could not \nisolate the network's structure as the source of RCC's limitations. \n\nReferences \n\nBachrach, J. R. (1988). Learning to represent state. Master's thesis, Department \nof Computer and Information Sciences, University of Massachusetts, Amherst, \nMA 01003. \n\nFahlman, S. E. (1991) . The recurrent cascade-correlation architecture. In Lipp(cid:173)\n\nmann, R. P., Moody, J. E., and Touretzky, D. S., editors, Advances in Neural \nInformation Processing Systems 3, pages 190-196, San Mateo, California. Mor(cid:173)\ngan Kaufmann Publishers. \n\nGiles, C., Chen, D., Sun, G., Chen, H., Lee, Y., and Goudreau, M. (1995). Con(cid:173)\nstructive learning of recurrent neural networks: Problems with recurrent cas(cid:173)\ncade correlation and a simple solution. IEEE Transactions on Neural Networks, \n6(4):829. \n\nKremer, S. C. (1996). Finite state automata that recurrent cascade-correlation \ncannot represent. In Touretzky, D. S., Mozer, M. C., and Hasselno, M. E., \neditors, Advances in Neural Information Processing Systems 8, pages 679-686. \nMIT Press. In Press. \n\nMozer, M. C. (1988). A focused back-propagation algorithm for temporal pattern \n\nrecognition. Technical Report CRG-TR-88-3, Department of Psychology, Uni(cid:173)\nversity of Toronto. \n\n\f", "award": [], "sourceid": 1426, "authors": [{"given_name": "Mark", "family_name": "Ring", "institution": null}]}