{"title": "Adaptive Nonlinear System Identification with Echo State Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 609, "page_last": 616, "abstract": "", "full_text": "Adaptive Nonlinear  System Identification \n\nwith Echo  State Networks \n\nHerbert  Jaeger \n\nInternational University Bremen \n\nD-28759  Bremen,  Germany \n\nh.jaeger@iu-bremen. de \n\nAbstract \n\nEcho state networks  (ESN)  are a  novel approach to recurrent neu(cid:173)\nral  network training.  An  ESN  consists  of a  large,  fixed,  recurrent \n\"reservoir\"  network, from  which the desired output is  obtained by \ntraining suitable output connection weights.  Determination of op(cid:173)\ntimal  output  weights  becomes  a  linear,  uniquely  solvable  task  of \nMSE  minimization.  This  article  reviews  the  basic  ideas  and  de(cid:173)\nscribes  an  online  adaptation scheme  based  on the  RLS  algorithm \nknown  from  adaptive  linear  systems.  As  an  example,  a  10-th or(cid:173)\nder  NARMA  system  is  adaptively identified.  The known  benefits \nof the RLS  algorithms carryover from  linear systems to nonlinear \nones;  specifically,  the  convergence rate and  misadjustment  can be \ndetermined at design  time. \n\n1 \n\nIntroduction \n\nIt is  fair  to say that difficulties  with  existing  algorithms  have so far  precluded  su(cid:173)\npervised training techniques for recurrent neural networks (RNNs)  from widespread \nuse.  Echo  state  networks  (ESNs)  provide  a  novel  and  easier  to  manage  approach \nto supervised training of RNNs.  A  large  (order  of 100s of units)  RNN is  used  as a \n\"reservoir\"  of dynamics  which  can  be  excited  by  suitably  presented  input  and/or \nfed-back output.  The connection weights of this reservoir network are not changed \nby  training.  In  order to  compute  a  desired  output  dynamics,  only  the  weights  of \nconnections from  the reservoir to the output units are calculated.  This boils  down \nto  a  linear  regression.  The theory of ESNs,  references  and many examples  can be \nfound  in  [5]  [6].  A  tutorial  is  [7].  A  similar idea has  recently  been  independently \ninvestigated in a  more biologically oriented setting under the name of \"liquid state \nnetworks\"  [8]  [9]. \n\nIn  this  article  I  describe  how  ESNs  can  be  conjoined  with  the  \"recursive  least \nsquares\"  (RLS)  algorithm, a  method for  fast  online  adaptation known  from  linear \nsystems.  The  resulting  RLS-ESN  is  capable  of tracking  a  10-th  order  nonlinear \nsystem  with  high  quality  in  convergence  speed  and  residual  error.  Furthermore, \nthe approach yields  apriori estimates of tracking performance parameters and thus \nallows one to design nonlinear trackers according to specificationsl . \n\n1 All \n\nalgorithms \n\nand \n\ncalculations \n\ndescribed  m \n\nthis \n\narticle \n\nare \n\ncon-\n\n\fArticle  organization.  Section  2 recalls  the basic ideas  and  definitions  of ESNs  and \nintroduces  an  augmentation  of the  basic  technique.  Section  3  demonstrates  ESN \nomine learning on the 10th order system identification task.  Section 4 describes the \nprinciples of using the RLS algorithm with ESN networks and presents a simulation \nstudy.  Section 5 wraps up. \n\n2  Basic ideas  of echo  state networks \n\nFor the sake of a  simple notation, in this  article I  address only single-input, single(cid:173)\noutput systems  (general treatment in  [5]).  We  consider  a  discrete-time  \"reservoir\" \nRNN  with  N  internal network  units,  a  single  extra input  unit,  and  a  single  extra \noutput unit.  The input at time n  2  1 is u(n), activations of internal units are x(n) = \n(xl(n), ... ,xN(n)), and  activation of the  output  unit  is  y(n).  Internal  connection \nweights are collected in an N  x  N  matrix W  =  (Wij),  weights of connections going \nfrom the input unit into the network in an N-element (column) weight vector win  = \n(w~n), and the N + 1 (input-and-network)-to-output connection weights in aN + 1-\nelement  (row)  vector wout  =  (w?ut).  The output weights  wout  will  be learned,  the \ninternal weights  Wand input  weights  win  are fixed  before  learning,  typically  in  a \nsparse random connectivity pattern.  Figure 1 sketches the setup used in this article. \n\nN internal units \n\nFigure 1:  Basic setup of ESN.  Solid arrows:  fixed  weights; dashed arrows:  trainable \nweights. \n\nThe activation of internal units and the output unit is  updated according to \n\nx(n + 1) \ny(n + 1) \n\nf(Wx(n) + winu(n + 1) + v(n + 1)) \nrut (wout ( u(n + 1), x(n + 1) )) , \n\n(1) \n(2) \n\nwhere  f  stands for  an  element-wise  application  of the  unit  nonlinearity, for  which \nwe  here  use  tanh;  v(n + 1)  is  an  optional  noise  vector;  (u(n + l) , x(n + 1))  is  a \nvector concatenated from u(n + 1)  and x(n + 1);  and rut is  the output unit's non(cid:173)\nlinearity  (tanh  will  be  used  here,  too).  Training  data  is  a  stationary  I/O  signal \n(Uteach(n), Yteach(n)).  When  the  network  is  updated  according  to  (1),  then  under \ncertain  conditions  the  network  state  becomes  asymptotically  independent  of ini(cid:173)\ntial  conditions.  More  precisely, if the  network is  started from  two  arbitrary states \nx(O), X(O)  and is run with the same input sequence in both cases, the resulting state \nsequences  x(n), x(n)  converge to  each  other.  If this  condition  holds,  the  reservoir \nnetwork state will asymptotically depend only on the input history, and the network \n\ntained \nhttp://www.ais.fraunhofer.de/INDY /ESNresources.html. \n\ntutorial  Mathematica  notebook  which \n\nin \n\na \n\ncan  be \n\nfetched \n\nfrom \n\n\fis  said to be an  echo  state  network  (ESN).  A sufficient  condition for  the echo state \nproperty  is  contractivity  of W.  In  practice  it  was  found  that  a  weaker  condition \nsuffices,  namely,  to ensure that the spectral radius  I Amax I of W  is  less  than unity. \n[5]  gives  a  detailed account. \n\nConsider  the  task  of computing  the  output  weights  such  that  the  teacher  output \nis  approximated  by  the  network.  In  the  ESN  approach,  this  task  is  spelled  out \nconcretely as  follows:  compute wout  such that the training error \n\n(3) \n\nis  minimized  in  the  mean  square  sense.  Note  that  the  effect  of the  output  non(cid:173)\nlinearity  is  undone  by  (f0ut)-l  in  this  error  definition.  We  dub  (fout)-IYteach(n) \nthe teacher pre-signal and (f0ut)-l (wout (Uteach(n), x(n)) + v(n))  the network's pre(cid:173)\noutput.  The computation of wout  is  a linear regression.  Here is  a sketch of an offline \nalgorithm for  the entire learning procedure: \n\n1.  Fix a  RNN  with a  single input and a  single output unit , scaling the weight \n\nmatrix W  such that I Amax 1<  1 obtains. \n\n2.  Run  this  RNN  by  driving  it  with  the  teaching  input  signal.  Dismiss \ndata  from  initial  transient  and  collect  remaining  input+network  states \n(Uteach (n), Xteach (n))  row-wise  into  a  matrix  M.  Simultaneously,  collect \nthe remaining training pre-signals (f0ut)-IYteach(n)  into a column vector r. \n3.  Compute  the  pseudo-inverse  M-l,  and  put  wout  =  (M-Ir) T  (where  T \n\ndenotes transpose). \n\n4.  Write wout  into the output connections;  the ESN is  now  trained. \n\nThe modeling power of an ESN grows with network size.  A cheaper way to increase \nthe power is to use additional nonlinear transformations ofthe network state x(n) for \ncomputing the network output in (2).  We  use here a squared version of the network \nstate.  Let  w~~~ares denote  a  length  2N +  2  output  weight  vector  and  Xsquares(n) \nthe length 2N +2 (column) vector (u(n), Xl (n), . . . , xN(n), u2 (n), xi(n), ... , xJv(n)). \nKeep  the network update  (1)  unchanged, but compute  outputs with the following \nvariant of (2): \n\ny(n + 1) \n\n(4) \n\nThe  \"reservoir\"  and the  input  is  now  tapped by  linear and quadratic connections. \nThe learning procedure remains linear  and now  goes  like  this: \n\n1.  (unchanged) \n2.  Drive the ESN with the training input.  Dismiss initial transient and collect \nremaining augmented states  Xsquares(n)  row-wise  into  M.  Simultaneously, \ncollect the training pre-signals  (fout)-IYteach(n)  into a  column vector r. \n\n3.  Compute the pseudo-inverse M-l, and put w~~~ares =  (M-Ir) T. \n4.  The ESN is  now ready for  exploitation, using output formula  (4). \n\n3 \n\nIdentifying a  10th order  system:  offline  case \n\nIn  this section the workings of the augmented algorithm will  be demonstrated with \na  nonlinear system identification task.  The system was introduced in a survey-and(cid:173)\nunification-paper  [1].  It is  a  10th-order  NARMA system: \n\n\fd(n +  1)  = 0.3 d(n) +  0.05 d(n)  [t, d(n - i)]  +  1.5 u(n - 9) u(n) +  0.1. \n\n(5) \n\nNetwork setup.  An N  =  100 ESN was  prepared by fixing  a  random, sparse connec(cid:173)\ntion  weight  matrix W  (connectivity  5  %,  non-zero weights  sampled from  uniform \ndistribution  in  [-1,1],  the  resulting  raw  matrix was  re-scaled  to  a  spectral  radius \nof 0.8, thus ensuring the echo  state property).  An  input unit  was  attached with  a \nrandom weight vector win  sampled from  a  uniform distribution over  [-0.1,0.1]. \n\nTraining  data  and training.  An I/O training sequence was  prepared by driving the \nsystem (5)  with an i.i.d.  input sequence sampled from the uniform distribution over \n[0,0.5]'  as  in  [1].  The network was run according to  (1)  with the training input for \n1200  time  steps  with  uniform  noise  v(n)  of size  0.0001.  Data from  the  first  200 \nsteps  were  discarded.  The  remaining  1000  network  states  were  entered  into  the \naugmented  training  algorithm,  and  a  202-length  augmented  output  weight  vector \nw~~~ares was  calculated. \nTesting.  The  learnt  output  vector  was  installed  and  the  network  was  run  from  a \nzero  starting  state  with  newly  created  testing  input  for  2200  steps,  of  which  the \nfirst  200  were  discarded.  From  the  remaining  2000  steps,  the  NMSE  test  error \nNMSEtest  =  E[(Y(;~(d~(n))2J  was estimated.  A value of NMSEtest  ~ 0.032 was found. \n\nComments.  (1)  The  noise  term  v(n)  functions  as  a  regularizer,  slightly  compro(cid:173)\nmising  the  training  error  but  improving  the  test  error.  (2)  Generally,  the  larger \nan  ESN,  the  more  training  data  is  required  and  the  more  precise  the  learning. \nSet up exactly like  in  the  described  100-unit example,  an  augmented 20-unit  ESN \ntrained on  500  data points  gave  NMSEtest  ~ 0.31,  a  50-unit  ESN  trained on  1000 \npoints  gave  NMSEtest  ~ 0.084,  and  a  400-unit  ESN  trained  on  4000  points  gave \nNMSEtest  ~ 0.0098. \nComparison.  The  best  NMSE  training  [!]  error  obtained  in  [1]  on  a  length  200 \ntraining sequence was  NMSEtrain  ~ 0.2412  However, the level of precision reported \n[1]  and  many  other  published  papers  about  RNN  training  appear  to  be  based  on \nsuboptimal training schemes.  After submission of this  paper I  went into a  friendly \nmodeling competition with Danil Prokhorov who expertly applied EKF-BPPT tech(cid:173)\nniques  [3]  to  the  same  tasks.  His  results  improve  on  [1]  results  by  an  order  of \nmagnitude and reach a  slightly better precision than the results reported here. \n\n4  Online adaptation of ESN network \n\nBecause the determination of optimal  (augmented)  output weights is  a  linear task, \nstandard recursive  algorithms  for  MSE  minimization  known  from  adaptive  linear \nsignal processing can be applied to online ESN estimation.  I assume that the reader \nis  familiar with the basic idea of FIR tap-weight (Wiener)  filters:  i.e. , that N  input \nsignals  Xl (n), ... ,XN(n)  are  transformed  into  an  output  signal  y(n)  by  an  inner \nproduct with a  tap-weight  vector  (Wl, ... ,WN):  y(n)  =  wlxl(n) +  ... +  wNxN(n). \nIn the ESN context, the input signals are the 2N +  2 components of the augmented \ninput+network state vector, the tap-weight vector  is  the augmented output weight \nvector, and the output signal is  the network pre-output  (fout)-ly(n) . \n\n2The authors miscalculated their NMSE because they used a formula for  zero-mean sig(cid:173)\nnals.  I re-calculated the value NMSEtrain  ~ 0.241  from their reported best  (miscalculated) \nNMSE  of 0.015 .  The larger  value  agrees  with the plots supplied in  that paper. \n\n\f4.1  A  refresher on adaptive  linear system identification \n\nFor  a  recursive  online  estimation  of tap-weight  vectors,  \"recursive  least  squares\" \n(RLS)  algorithms  are  widely  used  in  linear  signal  processing  when  fast  conver(cid:173)\ngence  is  of prime  importance.  A  good  introduction  to  RLS  is  given  in  [2],  whose \nnotation  I  follow.  An  online  algorithm  in  the  augmented  ESN  setting  should  do \nthe following:  given an open-ended, typically non-stationary training I/O sequence \n(Uteach(n), Yteach(n)),  at  each  time  n  ~ 1  determine  an  augmented  output  weight \nvector w~~~ares(n) which  yields a  good model of the current teacher system. \n\nFormally, an RLS algorithm for  ESN output weight update minimizes the exponen(cid:173)\ntially discounted square  \"pre-error\" \n\nn LAn- k  ((follt)-lYteach(k) - (follt)-lY [n](k))2 , \n\n(6) \n\nk=l \n\nwhere  A <  1  is  the  forgetting  factor  and  Y[n](k)  is  the  model  output  that  would \nbe obtained at time  k  when a network with the current output weights  w~~~ares(n) \nwould  be employed at all  times k =  1, ... ,n. \nThere are many variants of RLS algorithms minimizing  (6),  differing in their trade(cid:173)\noffs between computational cost, simplicity, and numerical stability.  I use a  \"vanilla\" \nversion,  which  is  detailed  out  in  Table  12.1  in  [2]  and  in  the  web  tutorial package \naccompanying this  paper. \n\nTwo  parameters characterise the  tracking  performance  of an  RLS  algorithm:  the \nmisadjustment  M  and the convergence time  constant  T.  The misadjustment gives \nthe ratio between the excess MSE (or excess NMSE)  incurred by the fluctuations of \nthe adaptation process,  and the optimal steady-state MSE that would  be obtained \nin  the  limit  of offline-training on infinite  stationary training data.  For  instance, a \nmisadjustment of M  =  0.3 means that the tracking error of the adaptive algorithm \nin a steady-state situation exceeds the theoretically achievable optimum (with Sanle \ntap  weight  vector  length)  by  30  %.  The  time  constant  T  associated  with  an  RLS \nalgorithm determines  the exponent  of the  MSE  convergence,  e- n / T \u2022  For  example, \nT  =  200  would  imply  an  excess  MSE  reduction  by  I/e  every  200  steps.  Misad(cid:173)\njustment  and  convergence  exponent  are  related  to  the  forgetting  factor  and  the \ntap-vector length through \n\nand \n\n1 \n\nT::::::--. \nI-A \n\n(7) \n\n4.2  Case study:  RLS-ESN  for  our 10th-order system \n\nEqns.  (7)  can  be  used  to  predict/design  the  tracking  characteristics  of  a  RLS(cid:173)\npowered  ESN.  I  will  demonstrate  this  with  the  10th-order  system  (5).  Ire-use \nthe  same  augmented  lOa-unit  ESN,  but  now  determine  its  2N + 2 output  weight \nvector  online  with  RLS.  Setting  A =  0.995 ,  and  considering  N  =  202,  Eqns.  (7) \nyield  a  misadjustment of M  =  0.5  and a  time constant  T  ::::::  200.  Since  the asymp(cid:173)\ntotically optimal NMSE is  approximately the NMSE of the offline-trained network, \nnamely,  NMSE  ::::::  0.032,  the  misadjustment  M  =  0.5  lets  us  expect  a  NMSE  of \n0.032  x  150%  ::::::  0.048  for  the  online  adaptation after  convergence.  The time  con(cid:173)\nstant  T \nNMSE  by a  factor  of I/e every 200  steps. \n\n::::::  200  makes  us  expect  NMSE  convergence  to  the  expected  asymptotic \n\n\fTraining  data.  Experiments  with  the  system  (5)  revealed  that  the  system  some(cid:173)\ntimes  explodes  when  driven  with  i.i.d.  input  from  [0,0.5].  To  bound  outputs,  I \nwrapped  the  r.h.s.  of  (5)  with  a  tanh.  Furthermore,  I  replaced  the  original  con(cid:173)\nstants 0.3,0.05,1.5, 0.1  by free  parameters a, (3\", 6,  to obtain \n\nd(n +  1)  = tanh (a d(n)  +  (3 d(n)  [t, d(n - i)]  +  ,u(n - 9) u(n) +  6). \n\n(8) \n\nThis system was run for  10000 steps with an i.i.d.  teacher input from  [0,0.5].  Every \n2000 steps, 0'.,(3\",6 were assigned new random values taken from a \u00b1 50 % interval \naround the respective original constants.  Fig. 2A shows the resulting teacher output \nsequence,  which  clearly  shows  transitions between  different  \"episodes\"  every  2000 \nsteps. \n\nRunning  the  RLS-ENS  algorithm.  The  ENS  was  started  from  zero  state  and \nwith  a  zero  augmented  output  weight  vector.  It  was  driven  by  the  teacher  in(cid:173)\nput,  and  a  noise  of  size  0.0001  was  inserted  into  the  state  update,  as  in  the \noffline  training.  The  RLS  algorithm  (with  forgetting  factor  0.995)  was  initial(cid:173)\nized  according  to  the  prescriptions  given  in  [2]  and  then  run  together  with  the \nnetwork  updates,  to  compute  from  the  augmented  input+network  states  x(n)  = \n(u(n), Xl (n), ... ,XN(n), u2 (n), xi(n), ... ,xJv(n))  a  sequence  of  augmented  output \nweight  vectors  w~~~ares (n).  These  output  weight  vectors  were  used  to  calculate  a \nnetwork output y(n)  =  tanh(w~~~ares(n), x(n)). \n\nResults.  From the resulting length-l0000 sequences of desired outputs d(n)  and net(cid:173)\nwork productions y(n) , NMSE's were numerically estimated from  averaging within \nsubsequent length-lOO blocks.  Fig.  2B  gives  a  logarithmic plot. \n\nIn  the  last  three  episodes,  the  exponential  NMSE  convergence  after  each  episode \nonset  disruption  is  clearly  recognizable.  Also  the  convergence  speed  matches  the \npredicted time constant, as revealed by the T  =  200  slope line  inserted in Fig.  2B. \n\nThe dotted horizontal line  in Fig.  2B  marks the NMSE  of the offline-trained  ESN \ndescribed in the previous section.  Surprisingly, after convergence, the online-NMSE \nis  lower than the offline  NMSE. This can be explained through the IIR (autoregres(cid:173)\nsive)  nature of the system  (5)  resp.  (8) , which incurs long-term correlations in the \nsignal d( n), or in other words, a nonstationarity of the signal in the timescale of the \ncorrelation lengthes,  even with fixed  parameters  a, (3\", 6.  This  medium-term non(cid:173)\nstationarity compromises  the  performance of the  offline  algorithm,  but  the  online \nadaptation can to a  certain degree follow  this  nonstationarity. \n\nFig. 2C  is a logarithmic plot of the development of the mean absolute output weight \nsize.  It is  apparent  that  after  starting  from  zero,  there  is  an  initial  exponential \ngrowth  of absolute  values  of the  output  weights,  until  a  stabilization  at  a  size  of \nabout  1000, whereafter the NMSE  develops a  regular pattern  (Fig.  2B). \nFinally, Fig.  2D  shows an overlay of d(n)  (solid)  with y(n)  (dotted)  of the last  100 \nsteps in  the experiment,  visually demonstrating the precision after convergence. \nA  note  on  noise  and  stability.  Standard  offline  training  of  ESNs  yields  output \nweights  whose  absolute  size  depends  on  the  noise  inserted  into  the  network  dur(cid:173)\ning training:  the larger  the  noise,  the  smaller  the  mean output weights  (extensive \ndiscussion  in  [5]).  In  online  training,  a  similar  inverse  correlation between  output \nweight  size  (after  settling  on  plateau)  and  noise  size  can  be  observed.  When  the \nonline learning experiment was  done otherwise identically but without  noise  inser(cid:173)\ntion,  weights  grew  so  large  that  the  RLS  algorithm entered  a  region  of numerical \n\n\finstability.  Thus, the noise  term is  crucial here  for  numerical stability,  a  condition \nfamiliar  from  EKF -based  RNN  training  schemes  [3],  which  are  computationally \nclosely  related to RLS. \n\n0.8 \n0 . 7 \n0.6 \n0.5 \n0.4 \n0 . 3 \n\nTeacher  output  signal \n\nLoglO  of  NMSE \n\n2000  4000  6000  8000  10000 \n\nLoglO  of  avo  abs.  weights \n\nB. \n\n-0 . 5 \n-1 \n\n-1.5 \n-2 \n\nTeacher  vs.  network \n\n~!I~ \n\n6000  8000  10000 \n\nD. \n\n2000 \n\n4000 \n\nA. \n\nC. \n\nFigure  2:  A.  Teacher output.  B.  NMSE  with  predicted  baseline  and slopeline.  C. \nDevelopment  of weights.  D.  Last  100  steps:  desired  (solid)  and network-predicted \n( dashed)  signal.  For details see text. \n\n5  Discussion \n\nSeveral of the well-known error-gradient-based RNN training algorithms can be used \nfor  online weight adaptation.  The update costs per time step in the most efficient of \nthose  algorithms  (overview  in  [1])  are  O(N2 ) ,  where  N  is  network size.  Typically, \nstandard approaches train small networks (order of N  =  20), whereas ESN typically \nrelies on large networks for precision (order of N  =  100).  Thus, the RLS-based ESN \nonline  learning  algorithm  is  typically  more  expensive  than  standard  techniques. \nHowever, this drawback might be compensated by the following properties of RLS(cid:173)\nESN: \n\n\u2022  Simplicity  of design  and implementation;  robust behavior  with  little  need \n\nfor  learning parameter hand-tuning. \n\n\u2022  Custom-design  of  RLS-ESNs  with  prescribed  tracking  parameters,  trans(cid:173)\n\nferring  well-understood linear systems methods to nonlinear systems. \n\n\u2022  Systems  with  long-lasting  short-term  memory  can  be  learnt.  Exploitable \nESN  memory spans grow  with  network size  (analysis  in  [6]).  Consider the \n\n30th order system d(n+ 1)  =  tanh(0.2d(n) +0.04d(n)  [L~=o 9d(n - i)]  + \n1.5 u(n - 29) u(n) + 0.001).  It was learnt by a 400-unit augmented adaptive \nESN  with  a  test  NMSE  of 0.0081.  The 51-th  (!)  order system  y(n + 1)  = \nu(n - 10)  u(n - 50)  was  learnt offline  by  a  400-unit augmented ESN  with \na  NMSE of 0.213. \n\nAll  in  all,  on  the  kind  of tasks  considered  in  above,  adaptive  (augmented)  ESNs \nreach a  similar level of precision as today's most refined gradient-based techniques. \nA  given  level  of precision  is  attained in  ESN  vs.  gradient-based techniques  with  a \nsimilar  number  of trainable  weights  (D.  Prokhorov,  private  communication).  Be(cid:173)\ncause gradient-based techniques train every connection weight in the RNN, whereas \n\n3See  Mathematica notebook for  details. \n\n\fESNs  train  only  the  output  weights,  the numbers  of units  of similarly  performing \nstandard RNNs  vs.  ESNs  relate as  N  to  N 2 .  Thus, RNNs  are more  compact than \nequivalent  ESNs.  However,  when  working  with  ESNs,  for  each  new  trained  out(cid:173)\nput  signal  one  can  re-use  the  same  \"reservoir\",  adding  only  N  new  connections \nand  weights.  This  has  for  instance  been  exploited  for  robots  in  the  AIS  institute \nby simultaneously training multiple feature  detectors from  a  single  \"reservoir\"  [4]. \nIn  this  circumstance,  with  a  growing  number  of simultaneously  required  outputs, \nthe requisite net model sizes for  ESNs vs.  traditional RNNs  become asymptotically \nequal.  The  size  disadvantage  of ESNs  is  further  balanced  by  much  faster  offline \ntraining,  greater  simplicity,  and  the  general  possibility  to  exploit  linear-systems \nexpertise for  nonlinear adaptive modeling. \n\nAcknowledgments  The  results  described  in  this  paper  were  obtained  while  I \nworked  at  the  Fraunhofer  AIS  Institute. \nI  am  greatly  indebted  to  Thomas \nChristaller  for  unfaltering  support.  Wolfgang  Maass  and  Danil  Prokhorov  con(cid:173)\ntributed motivating discussions and valuable references.  An international patent ap(cid:173)\nplication for  the ESN technique was filed  on October 13,  2000  (PCT /EPOI/11490). \n\nReferences \n\n[1]  A.F. Atiya and A.G. Parlos. New results on recurrent network training:  Unifying \nthe  algorithms  and  accelerating  convergence.  IEEE  Trans.  Neural  Networks, \n11(3):697- 709,2000. \n\n[2]  B.  Farhang-Boroujeny.  Adaptive Filters:  Theory  and Applications.  Wiley,  1998. \n[3]  L.A.  Feldkamp,  D.V.  Prokhorov,  C.F.  Eagen,  and  F.  Yuan.  Enhanced  multi(cid:173)\n\nstream Kalman filter  training for  recurrent neural networks.  In J .A.K. Suykens \nand  J.  Vandewalle,  editors,  Nonlinear  Modeling:  Advanced  Black-Box  Tech(cid:173)\nniques,  pages  29- 54.  Kluwer, 1998. \n\n[4]  J.  Hertzberg, H.  Jaeger,  and F.  Schonherr.  Learning to ground fact  symbols in \nbehavior-based robots.  In F.  van Harmelen, editor,  Proc.  15th  Europ.  Gonf.  on \nArt.  Int.  (EGAI 02),  pages 708- 712.  lOS Press,  Amsterdam,  2002. \n\nThe  \"echo  state\"  approach \n\nJaeger. \nrecurrent  neural  networks. \n\n[5]  H. \ning \nman  National  Research \nhttp://www.gmd.de/People/Herbert.Jaeger/Publications.html. \n\nInstitute \n\nto  analysing  and \n-\n\nGMD  Report  148,  GMD \nScience, \n\nfor  Computer \n\ntrain(cid:173)\nGer-\n2001. \n\n[6]  H.  Jaeger.  Short  term  memory  in  echo  state  networks.  GMD-Report  152, \nGMD  - German  National  Research  Institute  for  Computer  Science,  2002. \nhttp://www.gmd.de/People/Herbert.Jaeger/Publications.html. \n\n[7]  H.  Jaeger.  Tutorial  on  training  recurrent  neural  networks,  covering  BPPT, \nRTRL, EKF and the echo state network approach. GMD Report 159, Fraunhofer \nInstitute AIS , 2002. \n\n[8]  W.  Maass,  T.  Natschlaeger,  and  H.  Markram.  Real-time  computing  without \nstable states:  A new framework for  neural computation based on perturbations. \nhttp://www.cis.tugraz.at/igi/maass/psfiles/LSM-vl06.pdf. 2002. \n\n[9]  W.  Maass,  Th.  NatschHiger,  and  H.  Markram.  A  model  for  real-time  compu(cid:173)\ntation  in  generic  neural  microcircuits.  In  S.  Becker,  S.  Thrun,  and  K.  Ober(cid:173)\nmayer,  editors,  Advances  in  Neural  Information  Processing  System  15  (Proc. \nNIPS 2002).  MIT Press,  2002. \n\n\f", "award": [], "sourceid": 2318, "authors": [{"given_name": "Herbert", "family_name": "Jaeger", "institution": null}]}