{"title": "Recurrent Neural Networks for Missing or Asynchronous Data", "book": "Advances in Neural Information Processing Systems", "page_first": 395, "page_last": 401, "abstract": null, "full_text": "Recurrent Neural Networks for  Missing or \n\nAsynchronous Data \n\nYoshua Bengio -\n\nDept.  Informatique et \n\nRecherche  Operationnelle \nUniversite de  Montreal \nMontreal,  Qc  H3C-3J7 \n\nFrancois Gingras \nDept.  Informatique et \n\nRecherche  Operationnelle \nUniversite de  Montreal \nMontreal, Qc H3C-3J7 \n\nbengioy~iro.umontreal.ca \n\ngingra8~iro.umontreal.ca \n\nAbstract \n\nIn this paper we  propose recurrent neural networks with feedback into the input \nunits for  handling two  types  of data analysis problems.  On the one  hand,  this \nscheme can be used for static data when some of the input variables are missing. \nOn  the  other  hand,  it can also  be  used  for  sequential  data,  when  some of the \ninput variables are missing or are available at different frequencies.  Unlike in the \ncase of probabilistic models (e.g.  Gaussian) of the missing variables, the network \ndoes  not  attempt  to  model the  distribution  of the  missmg  variables given  the \nobserved variables.  Instead it is a  more \"discriminant\" approach that fills  in the \nmissing variables for  the sole purpose of minimizing a  learning criterion  (e.g.,  to \nminimize an output error). \n\nIntroduction \n\n1 \nLearning from examples implies discovering certain relations between  variables of interest.  The \nmost general form of learning requires to essentially capture the joint distribution between  these \nvariables.  However,  for  many specific  problems,  we  are  only interested  in  predicting  the  value \nof certain variables when  the others  (or  some of the others)  are given.  A distinction IS  therefore \nmade between  input variables  and output variables.  Such  a  task  requires  less  information (and \nless  p'arameters,  in  the  case  of a  parameterized  model)  than  that  of estimating  the  full  joint \ndistrIbution.  For example in the case of classification problems, a traditional statistical approach \nis  based  on  estimating the  conditional  distribution  of the  inputs for  each  class  as  well  as  the \nclass  prior  probabilities (thus yielding the full joint distribution of inputs and classes).  A  more \ndiscriminant  approach  concentrates  on estimating the  class  boundaries  (and  therefore  requires \nless parameters), as for example with a feedforward neural network trained to estimate the output \nclass  probabilities given the observed  variables. \nHowever,  for  many learning problems, only some ofthe input variables are given for each partic(cid:173)\nular  training case,  and  the missing variables differ  from case  to case.  The simplest way  to deal \nwith  this  mIssing data problem consists  in  replacing  the  missing  values  by  their  unconditional \nmean.  It can  be  used  with  \"discriminant\"  training  algorithms such  as  those  used  with  feed(cid:173)\nforward  neural  networks.  However,  in  some  problems,  one  can  obtain better  results  by  taking \nadvantage  of the  dependencies  between  the  input  variables.  A  simple  idea  therefore  consists \n\n-also,  AT&T Bell  Labs,  Holmdel,  NJ 07733 \n\n\f396 \n\nY. BENGIO, F. GINGRAS \n\nFigure  1:  Architectures  of the  recurrent  networks  in  the  experiments.  On  the  left  a  90-3-4 \narchitecture for  static data with missing values, on the right a 6-3-2-1  architecture with multiple \ntime-scales for  asynchronous sequential data.  Small squares represent  a  unit delay.  The number \nof units in each  layer is  inside the rectangles.  The time scale  at which each  layer operates  is  on \nthe right of each  rectangle. \nin  replacing the  missing input variables  by their  conditional expected  value,  when the observed \ninput variables  are  given.  An  even  better  scheme  is  to compute the expected  output given the \nobserved inputs, e.g.  with a  mixture of Gaussian.  Unfortunately, this amounts to estimating the \nfull  joint distribution of all  the  variables.  For  example, with  ni  inputs,  capturing  the  possible \neffect of each observed variable on each missing variable would require O(nl) parameters (at least \none  parameter to capture  some co-occurrence  statistic  on each  pair  of input  variables) .  Many \nrelated approaches have been proposed to deal with missing inputs using a Gaussian (or Gaussian \nmixture)  model  (Ahmad and Tresp,  1993; Tresp,  Ahmad and Neuneier,  1994; Ghahramani and \nJordan, 1994).  In the experiments presented  here,  the proposed  recurrent  network  is  compared \nwith  a  Gaussian  mixture  model  trained  with  EM  to handle  missing  values  (Ghahramani and \nJordan, 1994). \nThe  approach  proposed  in  section  2  is  more  economical  than  the  traditional  Gaussian-based \napproaches  for  two  reasons .  Firstly,  we  take  advantage of hidden  units  in  a  recurrent  network, \nwhich  might  be  less  numerous  than  the  inputs.  The  number  of parameters  depends  on  the \nproduct of the number of hidden units and the number of inputs.  The hidden units only need  to \ncapture the dependencies  between input variables which have some dependencies,  and which  are \nuseful  to  reducing  the output error.  The second  advantage  is  indeed  that  training is  based  on \noptimizing the desired  criterion  (e.g., reducing an output error),  rather than predIcting  as  well \nas possible  the values of the missmg inputs.  The recurrent  network  is  allowed  to relax for  a few \niterations  (typically  as  few  as  4 or 5)  in  order  to fill-in  some values for  the  missing inputs and \nproduce  an  output.  In section 3 we  present  experimental results  with this approach, comparing \nthe results  with  those  obtained with  a feedforward  network. \nIn section 4 we  propose an extension of this scheme to sequential data.  In this case,  the network \nis  not relaxing:  inputs keep changing with time and the network  maps an input sequence  (with \npossibly  missing  values)  to an  output sequence.  The  main advantage of this  extension  is  that \nIt allows to deal  with sequential data in which  the  variables occur at different  frequencies.  This \ntype  of problem  is  frequent  for  example with  economic or financial  data.  An  experiment  with \nasynchronous data is  presented  in section  5. \n\n2  Relaxing Recurrent  Network for  Missing Inputs \nNetworks with feedback  such  as those proposed in (Almeida,  1987; Pineda, 1989)  can be applied \nto learning a  static input/output mapping when  some of the  inputs are missing.  In  both cases, \nhowever,  one has  to wait for  the network  to relax either  to a  fixed  point (assuming it does  find \none)  or  to a  \"stable distribution\"  (in the  case  of the Boltzmann machine).  In the case  of fixed(cid:173)\npoint  recurrent  networks,  the  training algorithm assumes  that  a  fixed  point has  been  reached. \nThe  gradient  with  respect  to  the  weIghts  is  then  computed  in  order  to  move  the  fixed  point \nto a  more desirable  position.  The approach we  have  preferred  here  avoids  such  an  assumption. \n\n\fRecurrent Neural Networks for Missing or Asynchronous Data \n\n397 \n\nInstead  it  uses  a  more explicit optimization of the  whole  behavior  of the  network  as  it unfolds \nin time, fills-in  the missing inputs and produces an output.  The network  is trained to minimize \nsome function of its output by back-propagation through time. \n\nComputation of Outputs Given Observed Inputs \ninput  vector  U = [UI, U2, ..\u2022 , un,] \nGiven: \nResul t:  output  vector  Y = [YI, Y2, .. . , Yn.l \n\n1.  Initialize for  t  =  0: \nFor i = 1 ... nu,xo,;  f- 0 \nFor i  = 1 . . . n;, if U;  is  missing then  xO,1(;)  f- E( i), \n\n.  Else  XO,1(i)  f- Ui\u00b7 \n\n2.  Loop  over  tl.me: \nFor i = 1 ... nu \n\nFor t  =  1 to T \n\nIf i  =  I(k)  is  an input unit  and Uk  is  not missing then \n\nElse  ' \n\nXt  if-Uk \nXt,i  f- (1- \"Y)Xt-I,i + \"YfCEles,  WIXt_d/,p/) \nwhere  Si  is a  set of links from  unit PI  to unit i, \neach with weight  WI  and  a discrete  delay dl \n(but  terms for  which t  - dl  < 0 were  not considered). \n\n3.  Collect  outputs  by  averaging  at  the  end  of  the  sequence: \n\nY;  f- 'L;=I Vt  Xt,O(i) \n\nBack-Propagation \nThe back-propagation computation requireE!  an extra set of variables Xt  and W,  which will contain \nrespectively  g~ and  ~~ after  this computation. \nGiven:  output  gradient  vector  ~; \nResul t: \n\ninput  gradient  ~~  and  parameter  gradient  ae \naw' \n\n1.  Initialize  unit  gradients  using  outside  gradient: \n\nInitialize  Xt,;  = 0 for  all t  and  i. \nFor i  =  1 . . . no, initialize  Xt,O(;)  f- Vt Z~ \n\nIf i  = I(k)  is  an input unit and Uk  is  not  missing then \n\nno  backward propagation \n\nElse \n\nFor IE S; \n1ft - d!  > 0 \n\nXt-d/,p/  f- Xt-d/,p/  + (1  - \"Y)Xt-d/+1 \n+ \"YwIXt,d'('L/es, WIXt_d/,p/) \nWI  f- WI + \"Yf'CLAes,  WIXt-d/ ,p/)Xt-d/,p/ \n\n2.  Backward  loop  over  time: \n\nFor t = T  to 1 \n\nFor  i = nu ... 1 \n\n3.  Collect  input \nFor i = 1 .. . ni, \n\ngradients: \n\nIf U;  is  missing, then \n\nElse \n\nae  f- 0 \nau; \nae  \" .  \nau,  f- l..Jt  Xt,1(;) \n\nThe  observed  inputs  are  clamped  for  the  whole  duration  of the  sequence.  The  missing  units \ncorresponding to missing inputs are initialized to their unconditional expectation and their value \nis  then updated using the feedback  links for  the rest of the sequence  (just as if they were  hidden \nunits).  To help stability of the network  and  prevent it from finding  periodic solutions  (in which \nthe outputs have a correct output only periodically), output supervision is given for several time \nsteps.  A fixed  vector  v,  with  Vt  > 0 and I':t Vt  = 1 specifies  a  weighing scheme that distributes \n\n\f398 \n\nY.  BENGIO, F. GINGRAS \n\nthe responsibility for producing the correct  output among different  time steps.  Its purpose is to \nencourage  the  network  to develop stable dynamics which gradually converge toward the correct \noutput  tthus the weights  Vt  were  chosen  to gradually increase with t) . \nThe  neuron  transfer  function  was  a  hyperbolic  tangent  in  our  experiments.  The inertial  term \nweighted  by  ,  (in  step  3  of  the  forward  propagation  algorithm  below)  was  used  to  help  the \nnetwork find stable solutions.  The parameter, was fixed  by hand.  In the experiments described \nbelow , a  value of 0.7 was  used,  but near values yielded similar results. \nThis module can therefore  be combined within a  hybrid system composed of several modules by \npropagating gradient  through  the  combined system  (as  in  (Bottou  and  Gallinari,  1991)).  For \nexample, as in Figure 2,  there  might be another module taking as input the recurrent  network's \noutput. \nIn  this  case  the  recurrent  network  can  be  seen  as  a  feature  extractor  that  accepts \ndata with  missing  values  in  input  and  computes  a  set  of features  that  are  never  missing.  In \nanother example of hybrid system the non-missing values  in input of the recurrent  network  are \ncomputed  by  another,  upstream  module  (such  as  the  preprocessing  normalization used  in our \nexperiments),  and  the  recurrent  network  would  provide gradients to  this upstream  module (for \nexample to better tune its normalization parameters) . \n3  Experiments with Static Data \nA  network  with  three  layers  (inputs,  hidden,  outputs)  was  trained  to  classify  data with  miss(cid:173)\ning values  from  the  audiolD9Y  database.  This database  was  made public  thanks  to Jergen  and \nQuinlan, was  used  by  (Barelss  and Porter,  1987), and was obtained from  the UCI Repository of \nmachine learning databases (ftp. ies . ueL edu: pub/maehine-learning-databases). The orig(cid:173)\ninal  database  has  226  patterns,  with  69  attributes,  and  24  classes.  Unfortunately,  most  of the \nclasses  have  only  1 exemplar.  Hence  we  decided  to cluster  the  classes  into four  groups.  To do \nso,  the  average  pattern  for  each  of the  24  classes  was  computed,  and  the  K-Means  clustering \nalgorithm  was  then  applied  on  those  24  prototypical  class  \"patterns\",  to  yield  the  4  \"super(cid:173)\nclasses\"  used  in our experiments.  The multi-valued input symbolic attributes  (with  more than \n2  possible  values)  where  coded  with  a  \"one-out-of-n\"  scheme,  using  n  inputs  (all  zeros  except \nthe one corresponding to the attribute value).  Note  that a  missing value was  represented  with a \nspecial numeric value recognized  by the neural network module.  The inputs which were constant \nover  the  training  set  were  then  removed.  The  remaining  90  inputs  were  finally  standardized \n(by  computing mean and standard deviation)  and transformed  by a  saturating non-linearity  (a \nscaled  hyperbolic  tangent).  The  output  class  ~s  coded  with  a  \"one-out-of-4\"  scheme,  and  the \nrecognized  class  is  the one for  which the corresponding output has the largest value. \nThe architecture of the network is depicted in Figure 1 (left) .  The length of each relaxing sequence \nin the experiments was 5.  Higher values would not bring any measurable improvements, whereas \nfor  shorter sequences  performance would degrade.  The number of hidden  units was  varied, with \nthe best generalization performance obtained using 3 hidden  units. \nThe  recurrent  network  was  compared  with  feedforward  networks  as  well  as  with  a  mixture  of \nGaussians.  For  the  feedforward  networks,  the  missing input  values  were  replaced  by  their  un(cid:173)\nconditional expected  value.  They  were  trained to minimize the same criterion  as  the  recurrent \nnetworksl  i.e.,  the sum of squared  differences  between  network  output and desired  output.  Sev(cid:173)\neral feedtorward  neural  networks  with  varying numbers of hidden units were  trained.  The best \ngeneralization  was  obtained  with  15  hidden  units.  Experiments  were  also  performed  with  no \nhidden units and two hidden layers  (see  Table 1) .  We found that the recurrent  network not only \ngeneralized  better  but  also  learned  much  faster  (although each  pattern  required  5  times  more \nwork because  of the  relaxation), as depicted  in Figure 3. \nThe recurrent  network was also compared with an approach  based on a  Gaussian and  Gaussian \nmixture  model  of the  data.  We  used  the  algorithm  described  in  (Ghahramani  and  Jordan, \n1994)  for  supervised  leaning  from  incomplete  data  with  the  EM  algorithm.  The  whole  joint \ninput/output distribution is  modeled using a mixture model with Gaussians (for  the inputs)  and \nmultinomial (outputs)  components: \n\nP(X = x, C = c)  = E P(Wj) (21r)S;I~jll/2 exp{ -~(x _lJj)'Ejl(X -lJj)} \n\nj \n\nwhere x  is the input vector, c the output class, and P(Wj)  the prior probability of component j  of \nthe mixture.  The IJjd  are the multinomial parameters; IJj  and  Ej  are  the  Gaussian mean vector \n\n\fRecurrent Neural Networks for Missing or Asynchronous Data \n\n399 \n\ndown \natllNlm \nstatic \nmodule \n\ncoat --\n\nUpA'ellm \n\nnormalization \n\nmodule \n\nFigure  2:  Example of hybrid modular system,  using  the  recurrent  network  (middle)  to extract \nfeatures  from  patterns  which  may  have  missing  values. \nIt  can  be  combined  with  upstream \nmodules (e.g., a normalizing preprocessor,  right) and downstream modules (e.g., a static classifier, \nleft) .  Dotted arrows  show  the backward flow  of gradients. \n\n50r-----~----~----~----~----_, \n\n50r-----~----~----~----~----_, \n\ntraining eet \n\nteat .et \n\n45 \n\n40 \n\n35 \n\n30 \n\n~ 25 \n.... \n\n20 \n\n15 \n\n10 \n\n5 \n\n45 \n\n40 \n\n35 \n\n30 \n\n~ 25 \n.... \n\n20 \n\n15 \n\nrecurrent \n\nf \u2022 \u2022  dforvvard \n\n40 \n\n40 \n\nFigure  3:  Evolution  of training  and  test  error  for  the  recurrent  network  and  for  the  best  of \nthe feedforward  networks  (90-15-4):  average  classification error  w.r.t.  training epoch,  (with  1 \nstandard deviation error bars, computed over  10  trials). \n\nand  covariance  matrix for  component  j.  Maximum likelihood  training  is  applied  as explained \nin  (Ghahramani and  Jordan,  1994),  taking missing  values  into account  (as  additional  missing \nvariables of the  EM  algorithm). \nFor  each  architecture  in  Table  1,  10  trainin~ trials  were  run  with  a  different  subset  of  200 \ntraining and 26 test patterns (and different initial weights for the neural networks) .  The recurrent \nnetwork was dearlr superior to the other architectures,  probably for the reasons discussed  in the \nconclusion.  In addItion, we have shown graphically the rate of convergence during training of the \nbest feedforward  network  (90-15-4)  as  well  as  the  best  recurrent  network  (90-3-4),  in  Figure  3. \nClearly,  the  recurrent  network  not  only  performs  better  at  the end  of traming  but  also  learns \nmuch faster . \n\n4  Recurrent Network for  Asynchronous  Sequential Data \nAn important problem with many sequential  data analysis  problems such  as  those encountered \nin  financial  data sets  is  that  different  variables  are  known  at  different  frequencies,  at different \ntimes  (phase),  or  are  sometimes missing.  For example,  some  variables  are  given  daily,  weekly, \nmonthly, quarterly,  or yearly.  Furthermore,  some  variables  may not even  be given  for  some  of \nthe  periods or the precise  timing may change  (for example the date at which a  company reports \nfinancial  performance my vary) . \nTherefore,  we  propose  to  extend  the  algorithm  presented  above  for  static  data  with  missing \nvalues  to the general case  of sequential data with missing values or asynchronous  variables.  For \ntime steps  at which  a  low-frequency  variable is  not given,  a  missing value  is  assumed  in  input. \nAgain, the feedback  links from the hidden and output units to the input units allow the network \n\n\f400 \n\nY. BENGIO, F. GINGRAS \n\nTable  1:  Comparative  performances  of recurrent  network,  feedforward  network,  and  Gaussian \nmixture density model on audiology data.  The average percentage of classification error is shown \nafter training, for  both training and test  sets,  and tlie  standard deviation in  parenthesis,  for  10 \ntrials. \n\n90-3-4  Recurrent  net \n90-6-4  Recurrent  net \n90-25-4  Feedforward net \n90-15-4  Feedforward net \n90-10-6-4  Feedforward  net \n90-6-4  Feedforward net \n90-2-4  Feedforward net \n90-4  Feedforward  net \n1 Gaussian \n4 Gaussians Mixture \n8 Gaussians Mixture \n\n0.3(~ .6 \n0(0 \n\n\u00b0r6 \n\nTrammg set error  Test  set error \n2.~(?(j \n3.8(4 \n15(7.3 \n13.8(7 \n16f5.3 \n298.9 \n27(10 \n33(8 \n38  9.3 \n38  9.2 \n38  9.3 \n\n0.80.4 \n1 0.9 \n64.9 \n18.5? \n22  1 \n35  1.6 \n36  1.5 \n36  2.1 \n\nto  \"complete\"  the  missing data.  The main differences  with  the static  case  are  that  the  inputs \nand  outputs vary  with t  (we  use  Ut  and  Yt  at each  time step instead  of U  and  y).  The training \nalgorithm is otherwise the same. \n\n5  Experiments with  Asynchronous  Data \nTo evaluate the algorithm, we have used a recurrent  network with random weights, and feedback \nlinks  on  the  input  units  to generate  artificial  data.  The  generating  network  has  6  inputs  3 \nhidden  and  1 outputs.  The  hidden layer  is  connected  to the input layer  (1  delay).  The hidden \nlayer receives  inputs with delays  0 and 1 from  the input layer  and with delay  1 from  itself.  The \noutput layer receives  inputs from  the  hidden  layer.  At  the  initial time step as  well  as  at  5%  of \nthe time steps (chosen randomly), the input units were clamped with random values to introduce \nsome further  variability.  The mlssing values were  then completed by  the  recurrent  network.  To \ngenerate asynchronous data, half of the inputs were then hidden with missing values 4 out of every \n5 time steps.  100 training sequences  and 50 test sequences were generated.  The learning problem \nis  therefore  a sequence  regression  problem with mlssing and asynchronous  input variables. \nPreliminary comp'arative experiments show  a  clear  advantage to completing the  missing values \n(due to the the dlfferent frequencies  of the input variables)  wlth the recurrent  network,  as shown \nin Figure 4.  The recognition recurrent  network is shown on the right of Figure 1.  It has multiple \ntime scales  (implemented with subsampling and oversampling, as  in TDNNs  (Lang,  Waibel and \nHinton,  1990)  and reverse-TDNNs  (Simard and LeCun,  1992)), to facilitate the learning of such \nasynchronous  data.  The static network  is  a  time-delay neural  network  with  6  input,  8  hidden, \nand 1 output unit, and connections with delays 0,2, and 4 from the input to hidden and hidden to \noutput units.  The  \"missing values\"  for slow-varying variables were replaced by the last observed \nvalue in the sequence.  Experiments with 4 and 16  hidden  units yielded  similar results. \n\n6  Conclusion \nWhen  there  are  dependencies  between  input  variables,  and  the  output  prediction  can  be  im(cid:173)\nproved by taking them into account, we  have seen  that a  recurrent  network with input feedback \ncan  perform  significantly better  than  a  simpler  approach  that  replaces  missing values  by  their \nunconditional expectation.  According  to us,  this  explains the significant  improvement brought \nby using the recurrent  network  instead of a feedforward  network  in  the experiments. \nOn the other hand, the large number of input variables (n;  =  90, in the experiments)  most likely \nexplains the poor performance of the mixture of Gaussian model in comparison to both the static \nnetworks and the recurrent  network.  The Gaussian model requires estimating O(nn parameters \nand  inverting large covariance matrices. \nThe aPl?roach  to handling missing values  presented  here can also be extended to sequential data \nwith  mlssing or  asynchronous  variables.  As  our experiments  suggest,  for  such  problems,  using \nrecurrence  and multiple time scales yields better performance than static or time-delay networks \nfor  which  the  missing values are filled  using a  heuristic. \n\n\fRecurrent Neural Networks for Missing or Asynchronous Data \n\n401 \n\n\"().18 \n\n0.18 \n\n0.104 \n\nf 0 .12 \ni \nj \n\n0 . 1 \n\n0.08 \n\n0.08 \n\n0 .040 \n\ntlme-delay network \n\n~ ___  -lecurrent network \n\n2 \n\n~ \n\n8 \n\n8 \n\n10 \n\n12 \n\ntraining 4tPOCh \n\n1~ \n\n18 \n\n18 \n\n20 \n\nFigure 4:  Test set mean squared error on the asynchronous data.  Top:  static network with time \ndelays.  Bottom:  recurrent  network  with feedback  to input values  to  complete missing data. \nReferences \nAhmad,  S.  and  Tresp,  V.  (1993) .  Some solutions  to  the  missing feature  problem in  vision .  In \nHanson,  S.  J.,  Cowan,  J.  D.,  and  Giles,  C.  L. ,  editors,  ACivances  in  Neural  Information \nProcessing  Systems  5,  San Mateo,  CA.  Morgan Kaufman Publishers. \n\nAlmeida,  L.  (1987).  A learning  rule  for  asynchronous  perceptrons  with  feedback  in a  combina(cid:173)\n\ntorial environment.  In Caudill,  M. and  Butler, C., editors,  IEEE  International  Conference \non  Neural  Networks,  volume 2,  pages  609- 618, San  Diego 1987.  IEEE,  New  York. \n\nBareiss, E. and Porter, B. (1987) .  Protos:  An exemplar-based learning apprentice.  In Proceedings \nof the  4th  International  Workshop  on  Machine  Learning, pages  12-23, Irvine,  CA. Morgan \nKaufmann. \n\nBottou, L.  and Gallinari, P.  (1991).  A framework for  the cooperation of learning algorithms.  In \nLippman, R.  P.,  Moody,  R., and Touretzky,  D.  S., editors,  Advances  in  Neural  Information \nProcessing  Systems 3,  pages  781-788, Denver,  CO. \n\nGhahramani,  Z.  and  Jordan,  M.  I.  (1994).  Supervised  learning  from  incomplete  data  via  an \nEM  approach.  In  Cowan,  J. , Tesauro,  G. , and  Alspector,  J. , editors,  Advances  in  Neural \nInformation  Processing Systems  6,  page  ,San Mateo, CA.  Morgan Kaufmann. \n\nLang) K.  J ., Waibel, A. H., and  Hinton, G.  E.  (1990) .  A time-delay neural  network  architecture \n\ntor isolated  word  recognition .  Neural  Networks,  3:23- 43 . \n\nPineda,  F.  (1989) .  Recurrent  back-propagation and  the dynamical approach  to adaptive neural \n\ncomputation.  Neural  Computation,  1:161- 172. \n\nSimard, P.  and LeCun,  Y. (1992) .  Reverse TDNN: An architecture for  trajectory generation.  In \nMoody, J .,  Hanson, S., and Lipmann, R. , editors, Advances in Neural Information Processing \nSystems 4,  pages 579- 588,  Denver, CO. Morgan Kaufmann, San  Mateo. \n\nTresp,  V.,  Ahmad, S.,  and  Neuneier,  R.  (1994).  Training neural  networks  with  deficient  data. \nIn  Cowan,  J.,  Tesauro,  G.,  and  Alspector,  J.,  editors,  Advances  in  Neural  Information \nProcessing  Systems  6,  pages  128-135. Morgan  Kaufman Publishers,  San Mateo, CA. \n\n\f", "award": [], "sourceid": 1126, "authors": [{"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "Francois", "family_name": "Gingras", "institution": null}]}