{"title": "RCC Cannot Compute Certain FSA, Even with Arbitrary Transfer Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 619, "page_last": 625, "abstract": "", "full_text": "RCC  Cannot  Compute  Certain FSA, \n\nEven with  Arbitrary Transfer  Functions \n\nRWCP Theoretical  Foundation GMD  Laboratory \n\nGMD - German National Research  Center for  Information Technology \n\nMark  Ring \n\nSchloss  Birlinghoven \n\nD-53  754  Sankt Augustin,  Germany \n\nemail:  Mark .Ring@GMD.de \n\nAbstract \n\nExisting proofs demonstrating the computational limitations of Re(cid:173)\ncurrent  Cascade Correlation and similar networks  (Fahlman, 1991; \nBachrach,  1988;  Mozer,  1988)  explicitly limit their  results  to  units \nhaving sigmoidal or hard-threshold  transfer functions  (Giles et  aI., \n1995;  and  Kremer,  1996).  The  proof given  here  shows  that  for \nany  finite,  discrete  transfer  function  used  by  the  units of an  RCC \nnetwork,  there  are  finite-state  automata  (FSA)  that  the  network \ncannot model, no matter how  many units are  used.  The proof also \napplies  to  continuous  transfer  functions  with  a  finite  number  of \nfixed-points,  such  as  sigmoid and  radial-basis functions. \n\n1 \n\nIntroduction \n\nThe  Recurrent  Cascade  Correlation  (RCC)  network  was  proposed  by  Fahlman \n(1991)  to offer  a fast  and efficient  alternative to fully  connected recurrent  networks. \nThe network is  arranged such that each unit has only a single recurrent  connection: \nthe  connection  that  goes  from  itself to  itself.  Networks  with  the  same  structure \nhave  been  proposed  by  Mozer  (Mozer,  1988)  and  Bachrach  (Bachrach,  1988).  This \nstructure  is  intended  to allow simplified training of recurrent  networks  in  the hopes \nof making them computationally feasible.  However,  this increase in efficiency comes \nat  the  cost  of  computational power:  the  networks'  computational capabilities  are \nlimited regardless of the power of their activation functions.  The remaining input to \neach  unit  consists  of the input to the network as a  whole  together  with  the outputs \nfrom  all  units  lower  in  the  RCC  network.  Since  it  is  the  structure  of the  network \nand  not  the  learning  algorithm  that  is  of interest  here,  only  the  structure  will  be \ndescribed  in  detail. \n\n\f620 \n\nM.  Ring \n\nFigure  1:  This finite-state  automaton was  shown  by  Giles  et  al.  (1995)  to  be  un(cid:173)\nrepresentable  by  an  Ree  network  whose  units  have  hard-threshold  or  sigmoidal \ntransfer  functions.  The  arcs  are  labeled  with  transition  labels  of the  FSA  which \nare  given  as  input  to  the  Ree  network.  The  nodes  are  labeled  with  the  output \nvalues that the network is  required  to generate.  The node with an inner  circle is  an \naccepting or  halting  state . \n\nFigure 2:  This finite-state automaton is one of those shown by Kremer (1996) not to \nbe representable by an Ree network whose units have a hard-threshold or sigmoidal \ntransfer  function .  This FSA  computes the  parity of the  inputs seen  so  far . \n\nThe  functionality  of a  network  of  N  Ree  units,  Uo,  .. . , UN-l  can  be  described  in \nthe following  way: \n\n/o([(t), Vo(t  - 1\u00bb \n/x(i(t), Vx(t  - 1), Vx-1(t), Vx-2(t), .. . , Vo(t\u00bb, \n\n(1) \n(2) \n\nwhere  Vx(t)  is  the  output  value  of Ux  at  time  step  t,  and  l(t)  is  the  input  to  the \nnetwork  at time step t.  The value of each  unit is determined from:  (1)  the network \ninput at the  current  time step,  (2)  its own  value at the  previous  time step,  and  (3) \nthe output  values of the  units lower  in  the network  at  the  current  time step .  Since \nlearning is  not  being  considered  here,  the weights  are  assumed  to be  constant. \n\n2  Existing Proofs \nThe  proof of Giles,  et  al  (1995)  showed  that  an Ree  network  whose  units  had  a \nhard-threshold or sigmoidal transfer function  cannot produce outputs that oscillate \nwith a  period greater than two when  the network input is  constant.  (An oscillation \nhas a period of x  if it repeats itself every  x  steps.)  Thus, the FSA shown in Figure 1 \ncannot be modeled by such  an Ree network, since its output (shown as node labels) \noscillates at a period greater than two  given  constant  input.  Kremer  (1996)  refined \nthe  class  of  FSA  representable  by  an  Ree  network  showing  that,  if the  input  to \nthe net oscillates with  period p,  then the output can  only oscillate  with a  period  of \nw,  where  w  is  one  of p's factors  (or  of 2p's factors  if p  is  odd).  An  unrepresentable \nexample, therefore,  is  the parity FSA shown in  Figure 2,  whose output has a period \nof four  given  the following input  (of period two):  0,1,0,1, .... \n\nBoth  proofs,  that by  Giles  et  al.  and  that by  Kremer,  are explicitly  designed  with \n\n\fRCC Cannot Compute Certain FSA,  Even with Arbitrary Transfer Functions \n\n621 \n\n*0,1 \n\nFigure  3:  This finite-state  automaton  cannot  be  modeled  with  any  RCC  network \nwhose  units  are  capable of representing  only  k  discrete  outputs.  The values within \nthe circles are the state names and the output expected  from the network.  The arcs \ndescribe  transitions from  state to state,  and  their  values  represent  the  input given \nto the network  when  the transition is  made.  The dashed  lines indicate an arbitrary \nnumber  of further  states  between  state  3  and  state  k  which  are  connected  in  the \nsame manner as  states  1,2, and  3.  (All  states  are  halting states.) \n\nhard-threshold and sigmoidal transfer functions in mind, and can say nothing about \nother  transfer functions.  In other  words,  these  proofs  do  not  demonstrate  the  lim(cid:173)\nitations of the  RCC-type  network  structure,  but  about  the  use  of threshold  units \nwithin  this  structure.  The  following  proof is  the  first  that  actually  demonstrates \nthe limitations of the single-recurrent-link  network  structure. \n\n3  Details  of the  Proof \n\nThis section  proves  that  RCC networks  are  incapable even in principle of modeling \ncertain kinds of FSA, regardless of the sophistication of each unit's transfer function, \nprovided  only  that  the  transfer  function  be  discrete  and finite,  meaning only  that \nthe units of the RCC network are capable of generating a fixed number, k, of distinct \noutput values.  (Since all functions implemented on a discrete computer fall into this \ncategory,  this  assumption is  minor.  Furthermore, as  will  be  discussed  in  Section 4, \nthe outputs of most interesting continuous transfer functions reduce  to only a small \nnumber  of distinct  values.)  This  generalized  RCC  network  is  proven  here  to  be \nincapable of modeling the finite-state  automaton shown  in  Figure  3. \n\n\f622 \n\nMRing \n\nFor ease of exposition, let  us  call any FSA  of the form shown  in  Figure 3 an  RFk+l \nfor  Ring  FSA  with  k + 1  states. I Further,  call  a  unit  whose  output  can  be  any  of \nk  distinct  values  and  whose  input  includes  its  own  previous  output,  a  DRUk  for \nDiscrete  Recurrent  Unit.  These units are  a generalization ofthe units used  by  RCC \nnetworks  in  that  the  specific  transfer  function  is  left  unspecified.  By  proving  the \nnetwork is limited when  its units are DRUbs proves the limitations of the network's \nstructure  regardless  of the  transfer  function  used. \n\nClearly,  a  DRUk+1  with a sufficiently sophisticated transfer function  could  by itself \nmodel  an  RFk+1  by  simply  allocating  one  of  its  k + 1  output  values  for  each  of \nthe  k + 1 states.  At  each  step  it  would  receive  as  input  the  last  state  of the  FSA \nand the next  transition and  could  therefore  compute the next  state.  By  restricting \nthe  units  in  the  least  conceivable  manner,  i.e.,  by  reducing  the  number of distinct \noutput  values  to  k,  the  RCC  network  becomes  incapable of modeling  any  RFk+1 \nregardless  of how  many DRUk's the  network  contains.  This will  now  be  proven. \n\nThe  proof is  inductive  and  begins  with  the  first  unit  in  the  network,  which,  after \nbeing given certain sequences  of inputs, becomes  incapable of distinguishing among \nany  states  of the  FSA.  The  second  step,  the  inductive  step,  proves  that  no  finite \nnumber  of such  units  can 'assist  a  unit  hi~her in  the  ReC  network  in  making  a \ndistinction  between  any states of the  RFk+  . \n\nLemma 1  No  DR Uk  whose  input  is  the  current transition of an  RFk+1  can  reliably \ndistinguish  among  any  states  of the  RP+I.  More  specifically,  at  least  one  of the \nDR Uk,s  k  output  values  can  be  generated  in  all  of the  RP+I 's  k + 1 states. \n\nProof:  Let  us  name  the  DRUbs  k  distinct  output  values  VO,  VI, ... , Vk-I.  The \nmapping function  implemented by  the  DRU k can  be expressed  as  follows: \n\n( V X  , i)  =}  VY, \n\nwhich  indicates that when  the unit's last output was  V X  and  its current  input is  i, \nthen  its next  output is  VY. \n\nSince  an  RFk  is  cyclical,  the  arithmetic  in  the  following  will  also  be  cyclical  (i.e., \nmodular): \n\nxtfJy  =  { x+y \nif x + y < k \nx+y-k  if x + y ~ k \n{ x-y \n\nif x 2:  y \nx+k-y  if x < y \n\nx8y \n\n-\n\nwhere  0 ~ x < k  and  0 ~ y < k. \nSince it is impossible for  the DRUk to represent each of the RFk+I,s k + 1 states with \na distinct output value, at least two of these states must be represented  ambiguously \nby  the  same  value.  That  is,  there  are  two  RFk+l  states  a  and  b and  one  DRU k \nvalue  V a/ b  such  that  V a/ b  can  be  generated  by  the  unit  both  when  the  FSA  is  in \nstate a  and  when  it  is  in  state b.  Furthermore,  this  value  will  be  generated  by  the \nunit  given  an  appropriate sequence  of inputs.  (Otherwise  the  value is  unreachable, \nserves  no  purpose,  and can  be  discarded,  reducing  the unit to  a  DRUk- I.) \nOnce  the  DRUk has generated  V a/ b ,  it cannot in the next step  distinguish  whether \nthe FSA's current state is  a  or b.  Since  the FSA  could  be in either state a  or  b,  the \nnext state  after  a  b transition  could  be either  a  or  b tfJ  1.  That is: \n\n(va/b, b)  =}  Va/bEl'll, \n\n(3) \n\nIThanks  to  Mike  Mozer  for  suggesting  this  catchy  name. \n\n\fRCC Cannot Compute Certain FSA,  Even with Arbitrary Transfer Functions \n\n623 \n\nwhere  a e  b ~ be a  and  k  >  1.  This  new  output  value  Va/b$l  can  therefore  be \ngenerated when  the  FSA  is  in either state a or state b EB  1.  By  repeatedly  replacing \nb with  b EB  1 in  Equation 3, all states from b to a e  1 can  be shown  to share  output \nvalues  with state  a,  i.e.,  V a/ b , Va/b$l, V a/ b$2, ... , va/ae2, v a / ae1  all exist. \n\nRepeatedly substituting a eland a for  a  and  b respectively  in  the  last  paragraph \nproduces  values  vx/y  Vx, YEO, 1, ... , k + 1.  There  is,  therefore,  at  least  one  value \nthat  can  be  generated  by  the  unit  in both states  of every  possible  pair  of states. \nSince  there  are  (k! 1)  distinct  pairs but only k distinct output values,  and since \n\nwhen  k  > 1,  then  not  all  of these  pairs can  be  represented  by  unique  V  values.  At \nleast  two  of these  pairs  must  share  the  same  output  value,  and  this  implies  that \nsome  v a / b/ e  exists  that  can  be  output  by  the  unit  in  any  of the  three  FSA  states \na, b,  and c. \nStarting with \n\n(V a/ b/ e , c)  ::::}  va/b/e$l, \n\nand following the  same argument given  above for  V a/ b ,  there  must  be  a  vx/y/z  for \nall  triples  of states  x, Y,  and  z.  Since there  are  (k ~ 1)  distinct  triples  but only k \n\ndistinct  output values,  and since fi+ll > 1, \nrer)l >1, \n\nwhere  k > 3,  some v a/ b/ e/ d  must also exist. \nThis argument can  be followed  repeatedly since: \n\nfor  all  m  <  k  + 1,  including  when  m  =  k.  Therefore,  there  is  at  least  one \nVO/l/2f..fk/k+l  that  can  be  output  by  the  unit  in  all  k + 1  states  of the  RFk+l. \nCall  this  value  and  any  other  that  can  be  generated  in  all  FSA  states  ~,k.  All \nVk>s  are  reachable  (else  they  could  be  discarded  and  the  above  proof applied  for \nDRU I , /  < k).  When  a  Vk  is  output by  a  DRU k ,  it does  not  distinguish  any  states \nof the  RFH 1 . \n\nLemma 2  Once  a DRUk  outputs  a V k ,  all future  outputs  will  a/so  be  Vk's. \n\nProof:  The  proof  is  simply  by  inspection,  and  is  shown  in  the  following  table: \n\nActual  State  Transition  Next  State \n\nxEB1 \n\nx \n\nxEB2 \nxEB3 \n\nx \n\nxEB1 \nxEB2 \nxEB3 \n\nx \nx \nx \nx \n\nx82 \nx81 \n\nx \nx \n\nx82 \nx81 \n\n\f624 \n\nM.  Ring \n\nIf the  unit's last output value  was  a  Vk,  then  the  FSA might be in  any of its  k + 1 \npossible  states.  As  can  be  seen,  if at  this  point  any  of the  possible  transitions  is \ngiven as input,  the next state can also be any of the k + 1 possible states.  Therefore, \nno future  inp'ut  can  ever  serve  to lessen  the unit's ambiguity. \n\nTheorem  1  An  RGG  network  composed  of any  finite  number  of DR Uk  's  cannot \nmodel  an  Rpk+l. \n\nProof:  Let  us  describe  the  transitions of an  RCC  network  of N  units  by  using the \nfollowing notation: \n\n((VN-I , VN-2, ... , VI, Va), i)  ~ (VN- I , VN-2, ... , V{,  V~), \n\nwhere  Vrn  is  the  output  value  of the  m'th  unit  (i.e.,  Urn)  before  the  given  input, \ni,  is  seen  by  the  network,  and  V~ is  Urn's  value  after  i  has  been  processed  by  the \nnetwork.  The  first  unit,  Uo, receives  only  i  and  Va  as  input.  Every  other  unit  Ux \nreceives  as  input i  and  Vx  as well  as  v~, y < x. \nLemma 1 shows  that  the  first  unit,  Uo,  will  eventually generate  a  value vl, which \ncan be generated in any of the RFk+1  states.  From Lemma 2,  the unit will continue \nto  produce  vl  values after  this point. \nGiven  any  finite  number  N  of DRUk,s,  Urn-I, ... , Uo that  are  producing  their  Vk \nvalues,  V~ -1' .. . , Vt,  the  next  higher  unit , UN,  will  be incapable of disambiguating \nall  states  by  itself,  i.e.,  at  least  two  FSA  states,  a  and  b,  will  have  overlapping \noutput  values,  V;,p.  Since  none  of the  units UN-I, ... , Uo can  distinguish  between \nany states  (including a and b), \n\n( ( a / b \n\nk \n\nVN  ,VN-I,\u00b7\u00b7 \u00b7, VI'VO ' )~ VN \n\nk \n\nk)  b \n\n(a / b (JJ 1 \n\nk \n\nk ) \n'VN- I '\u00b7\u00b7 \u00b7,  I'VO ' \n\nVk \n\nassuming that be a ~ ae b and k > 1.  The remainder of the prooffollows identically \nalong  the  lines  developed  for  Lemmas  1  and  2.  The  result  of this  development  is \nthat  UN  also  has  a set  of reachable  output values  V~ that  can  be produced  in  any \nstate of the  FSA.  Once one such value is  produced,  no less-ambiguous value is ever \ngenerated.  Since no  RCC network containing any number of DRU k 's  can  over  time \ndistinguish  among any states  of an  RFHI,  no  such  RCC  network  can  model such \nan  FSA. \n\n4  Continuous  Transfer  Functions \n\nSigmoid  functions  can  generate  a  theoretically  infinite  number  of  output  values; \nif represented  with  32  bits,  they  can  generate  232  outputs.  This  hardly  means, \nhowever,  that all such values are of use.  In fact,  as was shown by  Giles et al.  (1995), \nif  the  input  remains  constant  for  a  long  enough  period  of time  (as  it  can  in  all \nRFHI'S) ,  the  output  of sigmoid  units  will  converge  to  a  constant  value  (a  fixed \npoint)  or  oscillate  between  two  values.  This  means  that  a  unit  with  a  sigmoid \ntransfer function  is  in  principle a  DRU 2 .  Most  useful  continuous  transfer  functions \n(radial-basis functions,  for  example), exhibit the same property,  reducing  to  only a \nsmall number of distinct  output values when  given  the same input repeatedly.  The \nresults  shown  here  are  therefore  not  merely  theoretical,  but  are  of real  practical \nsignificance  and  apply  to  any  network  whose  recurrent  links  are  restricted  to  self \nconnections. \n\n5  Concl usion \n\nNo  RCC  network  can  model  any  FSA  containing  an  RFk+1  (such  as  that  shown \nin  Figure  3),  given  units  limited to  generating  k  possible  output  values,  regardless \n\n\fRCC Cannot Compute Certain FSA,  Even with Arbitrary Transfer Functions \n\n625 \n\nof  the  sophistication  of  the  transfer  function  that  generates  these  values.  This \nplaces  an  upper  bound on  the computational capabilities of an RCC  network.  Less \nsophisticated  transfer  functions,  such  as  the  sigmoid units investigated  by  Giles  et \nal.  and  Kremer  may  have  even  greater  limitations.  Figure  2,  for  example,  could \nbe  modeled  by  a  single  sufficiently  sophisticated  DRU 2 ,  but  cannot  be  modeled \nby  an  RCe  network  composed  of hard-threshold  or  sigmoidal  units  (Giles  et  al., \n1995;  Kremer,  1996)  because  these  units  cannot  exploit  all  mappings from  inputs \nto outputs.  By not assuming arbitrary transfer functions,  previous proofs could  not \nisolate the  network's structure  as  the source  of RCC's limitations. \n\nReferences \n\nBachrach,  J.  R.  (1988).  Learning  to  represent  state.  Master's  thesis,  Department \nof Computer and Information Sciences,  University  of Massachusetts,  Amherst, \nMA  01003. \n\nFahlman,  S.  E.  (1991) .  The  recurrent  cascade-correlation  architecture.  In  Lipp(cid:173)\n\nmann,  R.  P.,  Moody,  J.  E., and  Touretzky,  D.  S.,  editors,  Advances  in  Neural \nInformation  Processing Systems 3,  pages  190-196, San Mateo,  California. Mor(cid:173)\ngan  Kaufmann Publishers. \n\nGiles,  C.,  Chen,  D.,  Sun,  G.,  Chen,  H.,  Lee,  Y.,  and  Goudreau,  M.  (1995).  Con(cid:173)\nstructive  learning  of recurrent  neural  networks:  Problems  with  recurrent  cas(cid:173)\ncade correlation and a simple solution.  IEEE Transactions on  Neural Networks, \n6(4):829. \n\nKremer,  S.  C.  (1996).  Finite  state  automata  that  recurrent  cascade-correlation \ncannot  represent.  In  Touretzky,  D.  S.,  Mozer,  M.  C.,  and  Hasselno,  M.  E., \neditors,  Advances  in  Neural Information  Processing  Systems  8,  pages  679-686. \nMIT  Press.  In  Press. \n\nMozer,  M.  C.  (1988).  A focused  back-propagation  algorithm for  temporal  pattern \n\nrecognition. Technical Report CRG-TR-88-3, Department of Psychology, Uni(cid:173)\nversity  of Toronto. \n\n\f", "award": [], "sourceid": 1426, "authors": [{"given_name": "Mark", "family_name": "Ring", "institution": null}]}