{"title": "Stationarity and Stability of Autoregressive Neural Network Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 267, "page_last": 273, "abstract": null, "full_text": "Stationarity and  Stability of \n\nAutoregressive Neural Network Processes \n\nFriedrich Leisch\\  Adrian Trapletti 2  &  Kurt  Hornik l \n\n1  Institut fur  Statistik \n\nTechnische  UniversiUit  Wien \n\nWiedner  Hauptstrafie 8-10  /  1071 \n\nA-1040  Wien,  Austria \n\nfirstname.lastname@ci.tuwien.ac.at \n\n2  Institut fiir  Unternehmensfiihrung \n\nWirtschaftsuniversi tat  Wien \n\nAugasse  2-6 \n\nA-lOgO  Wien,  Austria \n\nadrian. trapletti@wu-wien.ac.at \n\nAbstract \n\nWe  analyze  the  asymptotic behavior of autoregressive  neural  net(cid:173)\nwork  (AR-NN)  processes  using techniques from Markov chains and \nnon-linear time series  analysis.  It is  shown  that standard AR-NNs \nwithout shortcut connections are  asymptotically stationary.  If lin(cid:173)\near  shortcut  connections  are  allowed,  only  the  shortcut  weights \ndetermine whether the overall system is stationary, hence standard \nconditions for  linear AR processes  can be  used. \n\n1 \n\nIntroduction \n\nIn  this  paper  we  consider  the  popular  class  of nonlinear  autoregressive  processes \ndriven  by  additive  noise,  which  are  defined  by  stochastic  difference  equations  of \nform \n\n(1) \nwhere  ft  is  an  iid.  noise  process.  If g( . .. , (J)  is  a  feedforward  neural  network  with \nparameter (\"weight\")  vector (J,  we  call Equation 1 an autoregressive neural network \nprocess  of order p,  short AR-NN(p)  in  the following. \n\nAR-NNs are a natural generalization of the classic linear autoregressive  AR(p) pro(cid:173)\ncess \n\n(2) \nSee,  e.g.,  Brockwell  &  Davis  (1987)  for  a  comprehensive introduction into AR and \nARMA  (autoregressive  moving average)  models. \n\n\f268 \n\nF.  Leisch, A.  Trapletti and K.  Hornik \n\nOne of the most central questions  in  linear time series  theory  is  the stationarity of \nthe model, i.e., whether the probabilistic structure of the series is constant over time \nor at least  asymptotically constant  (when  not started in equilibrium).  Surprisingly, \nthis  question  has  not  gained  much  interest  in  the  NN  literature,  especially  there \nare-up to our knowledge-no results  giving conditions for  the stationarity of AR(cid:173)\nNN  models.  There  are results on the stationarity of Hopfield nets  (Wang & Sheng, \n1996), but these  nets  cannot be used  to estimate conditional expectations for  time \nseries  prediction. \n\nThe  rest  of this  paper  is  organized  as  follows:  In  Section  2 we  recall  some  results \nfrom time series analysis and Markov chain theory defining the relationship between \na  time series  and its associated  Markov chain.  In  Section 3  we  use  these  results  to \nestablish that standard AR-NN models without shortcut connections are stationary. \nWe  also give conditions for  AR-NN models with shortcut connections to be station(cid:173)\nary.  Section  4 examines  the  NN  modeling of an important class  of non-stationary \ntime series,  namely integrated series.  All  proofs  are deferred  to the  appendix . \n\n2  Some Time Series and  Markov  Chain  Theory \n\n2.1  Stationarity \n\nLet  ~t denote  a  time series  generated  by  a  (possibly  nonlinear)  autoregressive  pro(cid:173)\nIf lEft  = 0,  then  9  equals  the  conditional  expectation \ncess  as  defined  in  (1). \n1E(~t I~t-l' ... , ~t-p) and  g(~t-l' ... , ~t-p) is  the best  prediction for  ~t  in  the  mean \nsquare  sense. \nIf we  are  interested  in  the  long  term properties  of the series,  we  may  ask  whether \ncertain  features  such  as  mean  or  variance  change  over  time  or  remain  constant. \nThe  time  series  is  called  weakly  stationary  if lE~t  =  Jl  and  cov(~t,~t+h) =  ,h, 'it, \ni.e.,  mean  and  covariances  do  not  depend  on  the  time  t.  A  stronger  criterion  is \nthat the whole distribution (and not only mean and covariance)  of the process does \nnot  depend  on  the  time, in  this case  the series  is  called  strictly stationary.  Strong \nstationarity implies weak stationarity if the second moments of the series exist.  For \ndetails see  standard time series  textbooks such  as  Brockwell  &  Davis  (1987). \nIf ~t is strictly stationary, then IP (~t  E A) = rr( A), 'it and rrO is called the stationary \ndistribution  of the  series.  Obviously  the  series  can  only  be  stationary  from  the \nIf \nbeginning  if  it  is  started  with  the  stationary  distribution  such  that  ~o  '\"  rr. \nit  is  not  started  with  rr,  e.g.,  because  ~o  is  a  constant,  then  we  call  the  series \nasymptotically  stationary  if it converges  to its stationary distribution: \n\nlim  IP(~t E A) = rr(A) \n\nt-HX) \n\n2.2  Time Series as  Markov  Chains \n\nUsing  the notation \n\nXt-l \n\n(~t-l\"\"  ,~t_p)' \n(g(Xt-d,~t-l\"\"  , ~t-p+d \n(ft,O, ... ,O)' \n\n(3) \n\n(4) \n(5) \n\nwe  can  write  scalar  autoregressive  models  of order  p  such  as  (1)  or  (2)  as  a  first \norder  vector  model \n\n(6) \n\n\fStationarity and Stability of Autoregressive Neural Network Processes \n\n269 \n\nwith  Xt,  et  E lRP  (e.g.,  Chan &  Tong,  1985).  If we  write \n\npn(x,A)  =  IP{Xt+n  E Alxt = x} \np( x, A)  =  pl (x, A) \n\nfor  the  probability of going  from  point  x  to  set  A  E  B  in  n  steps,  then  {xd  with \np(x , A)  forms  a  Markov  chain  with  state  space  (lRP ,B,>'),  where  B  are  the  Borel \nsets  on  lRP  and>' is  the usual  Lebesgue  measure. \nThe  Markov  chain  {xd  is  called  cp-irreducible,  if for  some  IT-finite  measure  cp  on \n(lR P, B, >.) \n\n00 \n\nn=l \n\nwhenever  cp(A)  > O.  This me~ns essentially, that all parts of the state space can be \nreached  by  the  Markov chain irrespective of the starting point.  Another important \nproperty  of Markov chains is  aperiodicity,  which loosely speaking  means that there \nare  no  (infinitely often  repeated)  cycles.  See,  e.g.,  Tong  (1990)  for  details. \nThe  Markov  chain  {Xt}  is  called  geometrically  ergodic,  if there  exists  a  probability \nmeasure 1I\"(A)  on  (lR P, B, >.)  and a  p > 1 such  that \n\nVx  E lR P : \n\nlim  pnllpn(x,.) - 11\"(\u00b7)11  =  0 \nn-+oo \n\nwhere  II  . II  denotes  the  total variation.  Then  11\"  satisfies  the invariance equation \n\n1I\"(A)  = ! p(x, A) 1I\"(dx) ,  VA  E B \n\nThere is  a close relationship between  a  time series  and its associated  Markov chain. \nIf the  Markov  chain is  geometrically ergodic,  then  its distribution  will  converge  to \n11\"  and the  time series is  asymptotically stationary.  If the time series is  started with \ndistribution 11\",  i.e.,  Xo  \"\" 11\", then  the series  {~d is  strictly  stationary. \n\n3  Stationarity of AR-NN Models \n\nWe  now  apply  the  concepts  defined  in  Section  2  to the  case  where  9  is  defined  by \na  neural  network.  Let  x  denote  a p-dimensional input vector,  then  we  consider the \nfollowing standard network  architectures: \n\nSingle hidden layer perceptrons: \n\ng(x) = 'Yo  + L,8ilT(ai + a~x) \n\n(7) \n\nwhere  ai, ,8i  and 'Yo  are scalar weights,  aj  are p-dimensional weight vectors, \nand IT(')  is  a  bounded  sigmoid function  such  as  tanh(\u00b7). \n\nSingle hidden layer perceptrons with shortcut connections: \n\n(8) \n\nwhere  c  is  an  additional  weight  vector  for  shortcut  connections  between \ninputs and output.  In this case  we  define  the characteristic polynomial c(z) \nassociated  with  the linear shortcuts  as \nc(z) = 1 - ClZ  - C2z2  -\n\nZ E C. \n\n. .. -\n\ncP zP, \n\n\f270 \n\nF.  Leisch, A.  TrapleUi and K.  Hornik \n\nRadial basis function networks: \n\n(9) \n\nwhere  mj  are  center  vectors  and  \u00a2( ... )  is  one  of the  usual  bounded radial \nbasis functions  such  as  \u00a2(x) = exp( _x 2 ). \n\nLemma 1  Let  {xtl  be  defined  by  (6),  let  IEjt:tl  <  00  and  let  the  PDF  of f:t  be \npositive everywhere in JR.  Then  if 9  is  defined  by any  of (7),  (8)  or (9),  the  Markov \nchain  {Xt}  is  \u00a2-irreducible and  aperiodic. \n\nLemma 1 basically says that the state space of the Markov chain, i.e., the points that \ncan  be  reached,  cannot  be  reduced  depending  on  the  starting  point.  An  example \n\nfor  a reducible  Markov chain would be a series  that is always positive if only Xo  > \u00b0 \n\n(and  negative  otherwise).  This  cannot  happen  in  the  AR-NN(p)  case  due  to  the \nunbounded  additive  noise  term. \n\nTheorem 1  Let {~tl  be  defined  by  (1),  {xtl  by  (6),  further let IEktl < 00  and the \nPDF of f:t  be  positive  everywhere  in JR.  Then \n\n1.  If 9  is  a  network  without  linear  shortcuts  as  defined  in  (7)  and  (9),  then \n\n{ x tl is  geometrically ergodic  and  {~tl  is  asymptotically stationary. \n\n2.  If 9  is  a  network  with  linear  shortcuts  as  defined  in  (8)  and  additionally \nc(z)  f  0,  Vz  E C  :  Izl  ~ 1,  then  {xtl  is  geometrically  ergodic  and  {~tl  is \nasymptotically stationary. \n\nThe time series  {~t} remains stationary if we  allow for  more than one hidden  layer \n(-+ multi layer  perceptron,  MLP)  or non-linear output units,  as  long as  the overall \nmapping has  bounded range.  An  MLP  with shortcut connections combines  a  (pos(cid:173)\nsibly  non-stationary)  linear  AR(p)  process  with  a  non-linear  stationary  NN  part. \nThus,  the  NN  part  can  be  used  to  model  non-linear  fluctuations  around  a  linear \nprocess  like  a  random walk. \nThe only part of the network that controls whether the overall process is stationary \nare  the  linear shortcut  connections  (if present).  If there  are  no shortcuts,  then  the \nprocess  is  always stationary.  With shortcuts,  the usual  test for  stability of a linear \nsystem  applies. \n\n4 \n\nIntegrated Models \n\nAn  important  method  in  classic  time  series  analysis  is  to. first  transform  a  non(cid:173)\nstationary series into a stationary one and then model the remainder by a stationary \nprocess.  The  probably  most  popular  models  of this  kind  are  autoregressive  inte(cid:173)\ngrated moving average (ARIMA)  models, which can be transformed into stationary \nARMA  processes  by simple differencing. \n\nLet  I::!..k  denote  the k-th order difference  operator \n\net - ~t-l \nI::!..(~t  - ~t-d = ~t - 2~t-l + ~t-2 \n\n(10) \n(11) \n\n(12) \n\n\fStationarity and Stability of Autoregressive Neural Network Processes \n\n271 \n\nwith ~ 1  =  ~. E.g., a standard random walk ~t =  ~t-l +ft is non-stationary because \nof the growing variance,  but can be transformed into the iid  (and hence stationary) \nnoise  process  ft  by  taking first  differences. \n\nIf a  time  series  is  non-stationary,  but  can  be  transformed  into  a  stationary  series \nby taking k-th  differences,  we  call  the series  integrated  of order k.  Standard  MLPs \nor RBFs without shortcuts  are  asymptotically stationary.  It is  therefore  important \nto  take care  that  these  networks  are  only  used  to  model  stationary  processes.  Of \ncourse the network can be trained to mimic a non-stationary process on  a finite time \ninterval, but the out-of-sample or prediction performance will  be  poor,  because the \nnetwork inherently cannot capture some important features of the process.  One way \nto overcome  this  problem  is  to first  transform  the process  into  a  stationary series \n(e.g., by differencing  an integrated series)  and train the network on the transformed \nseries  (Chng et  al.,  1996). \n\nAs  differencing  is  a  linear operation,  this  transformation can  also  be  easily  incor(cid:173)\nporated  into  the  network  by  choosing  the  shortcut  connections  and  weights  from \ninput  to  hidden  units  accordingly.  Assume  we  want  to  model  an  integrated  series \nof integration order  k,  such  that \n\n~k~t =  g(~k~t_l' . .. ' ~k~t_p) + ft \n\nwhere  ~k~t is  stationary.  By  (12)  this  is  equivalent to \n\n~t \n\nk \n\n~(-lt-l (~)~t-n + g(~k~t_l' ... ' ~k~t_p) + ft \nk ~(-lt-l (~)~t-n + g(~t-l' ... ,~t-p-k) + ft \n\nwhich  (for p > k)  can  be modeled by an  MLP with shortcut connections  as defined \nby  (8)  where  the shortcut  weight  vector c is  fixed  to \n\n(~) := 0 for  n > k \n\nand  9 is  such  that  g(~t-l' ... ,~t-p-k) =  g(~kXt_d.  This  is  always  possible  and \ncan  basically be obtained by  adding c to all weights between  input and first  hidden \nlayer of g. \nAn  AR-NN(p)  can  model  integrated series  up  to integration  order  p.  If the  order \nof integration is  known , the shortcut  weights can either be fixed,  or the differenced \nseries  is  used  as  input.  If the  order  is  unknown,  we  can  also  train  the  complete \nnetwork  including  the  shortcut  connections  and  implicitly  estimate  the  order  of \nintegration.  After training the final model can be checked for stationarity by looking \nat  the characteristic  roots of the  polynomial defined  by  the shortcut connections. \n\n4.1  Fractional Integration \n\nUp  to now  we  have only considered  integrated  series  with  positive  integer  order of \nintegration,  i.e.,  kEN.  In  the  last  years  models with  fractional  integration  order \nbecame  very  popular  (again).  Series  with  integration  order  of 0.5  <  k  <  1  can \nbe  shown  to exhibit self-similar or fractal  behavior,  and have  long  memory.  These \ntype of processes  were  introduced by Mandelbrot in a series of paper modeling river \nflows,  e.g., see  Mandelbrot & Ness  (1968).  More recently,  self-similar processes  were \nused  to  model  Ethernet  traffic  by  Leland  et  al.  (1994).  Also  some  financial  time \nseries such  as foreign  exchange  data series  exhibit long memory and self-similarity. \n\n\f272 \n\nFLeisch. A.  Trapletti and K.  Hornik \n\nThe fractional differencing operator ~ k , k  E [-1, 1] is defined by the series expansion \n\nk  ~ f(-k+n) \n\n~ ~t = ~ r(-k)f(n + 1)~t-n \n\n(13) \n\nwhich  is  obtained from  the  Taylor series  of (1  - z)k.  For  k > 1 we  first  use  Equa(cid:173)\ntion  (12)  and  then  the  above  series  for  the  fractional  remainder.  For  practical \ncomputation,  the  series  (13)  is  of course  truncated  at some  term  n  =  N.  An  AR(cid:173)\nNN(p)  model  with shortcut connections can  approximate the  series  up  to the first \np  terms. \n\n5  Summary \n\nWe have shown that AR-NN models using standard NN architectures without short(cid:173)\ncuts  are  asymptotically stationary.  If linear shortcuts  between  inputs and outputs \nare  included-which many popular software  packages have  already  implemented(cid:173)\nthen  only  the  weights  of the  shortcut  connections  determine  if the  overall  system \nis  stationary.  It is  also  possible  to model  many integrated  time series  by  this  kind \nof networks.  The  asymptotic behavior  of AR-NNs  is  especially  important for  pa(cid:173)\nrameter  estimation,  predictions  over  larger  intervals  of  time,  or  when  using  the \nnetwork  to  generate  artificial  time  series.  Limiting  (normal)  distributions  of  pa(cid:173)\nrameter  estimates  are  only  guaranteed  for  stationary  series.  We  therefore  always \nrecommend  to  transform  a  non-stationary  series  to  a  stationary  series  if possible \n(e.g.,  by differencing)  before  training a  network on it. \n\nAnother  important  aspect  of stationarity  is  that  a  single  trajectory  displays  the \ncomplete probability law  of the  process.  If we  have  observed  one  long enough  tra(cid:173)\njectory  of the  process  we  can  (in  theory)  estimate  all  interesting  quantities  of the \nprocess  by  averaging over time.  This need  not be true for  non-stationary processes \nin  general,  where  some quantities may only be estimated by  averaging over several \nindependent  trajectories.  E.g.,  one  might train  the  network  on  an  available sam(cid:173)\nple  and  then  use  the  trained  network  afterwards-driven  by  artificial  noise  from  a \nrandom number generator-to generate  new  data with similar properties  than  the \ntraining  sample.  The  asymptotic  stationarity  guarantees  that  the  AR-NN  model \ncannot show  \"explosive\"  behavior or growing variance with  time. \n\nWe  currently  are  working on extensions  of this paper in several  directions.  AR-NN \nprocesses  can  be  shown  to  be  strong  mixing  (the  memory of the  process  vanishes \nexponentially fast)  and have  autocorrelations going to zero  at  an exponential rate. \nAnother  question  is  a  thorough  analysis  of the  properties  of parameter estimates \n(weights)  and  tests for  the order of integration.  Finally we  want to extend  the  uni(cid:173)\nvariate results  to the multivariate case  with a special  interest  towards cointegrated \nprocesses. \n\nAcknowledgement \n\nThis  piece  of  research  was  supported  by  the  Austrian  Science  Foundation  (FWF)  under \ngrant  SFB#OlO  ('Adaptive  Information  Systems  and  Modeling  in  Economics  and  Man(cid:173)\nagement  Science'). \n\n\fStationarity and Stability of Autoregressive Neural Network Processes \n\n273 \n\nAppendix:  Mathematical Proofs \n\nProof of Lemma 1 \n\nIt can easily  be shown  that  {xe}  is  <p-irreducible  if the support  of the probability  density \nfunction  (PDF) of \u20act  is the whole real line,  i.e.,  the PDF is  positive everywhere in IR  (Chan \n&  Tong,  1985).  In  this  case  every  non-null p-dimensional  hypercube  is  reached in p  steps \nwith positive  probability  (and hence every  non-null  Borel set  A). \n\nA  necessary  and sufficient  condition  for  {Xt}  to  be aperiodic  is  that  there  exists  a  set  A \nand  positive  integer  n  such  that  pn(x, A)  > 0  and  pn+l (x, A)  > 0  for  all  x  E  A  (Tong, \n1990,  p.  455).  In our case  this  is  true  for  all  n  due to the unbounded  additive  noise. \n\nProof of Theorem 1 \n\nWe  use  the following  result  from nonlinear  time series  theory: \n\nTheorem  2  (Chan  &  Tong  1985)  Let {Xt}  be  defined by (1),  (6)  and let G  be  compact, \ni.e.  preserve compact sets.  IfG can be  decomposedasG =  Gh+Gd  andGd(-)  is  of bounded \nrange,  G h(-)  is  continuous and homogeneous,  i.e.,  Gh(ax) = aGh(x),  the  origin  is  a  fixed \npoint  of G h  and Gh  is  uniform  asymptotically stable,  IEI\u20actl  <  00  and  the  PDF  of \u20act \nis \npositive  everywhere  in IR,  then {Xt}  is  geometrically ergodic. \n\nfulfills  the  conditions  by  assumption.  Clearly  all  networks  are  con(cid:173)\n\nThe  noise  process  \u20act \ntinuous  compact functions.  Standard MLPs without shortcut connections  and RBFs have \na  bounded  range,  hence  G h  ==  0  and  G  ==  Gd ,  and  the series  {ee}  is  asymptotically  sta(cid:173)\ntionary.  If we  allow  for  linear  shortcut  connections  between  the  input  and  the  outputs, \nwe  get  G h  = c'x  and  G d  =  70  + l.:i (3i(T(ai  + aix)  i.e.,  G h  is  the  linear  shortcut  part \nof the  network,  and  Gd  is  a  standard  MLP  without  shortcut  connections.  Clearly,  G h  is \ncontinuous,  homogeneous  and  has  the  origin  as  a  fixed  point.  Hence,  the  series  {eel  is \nasymptotically  stationary if G h  is  asymptotically  stable,  i.e.,  when all  characteristic roots \nof Gh have a  magnitude less  than unity.  Obviously  the same is true for RBFs with shortcut \nconnections.  Note that  the model reduces  to a  standard linear  AR(p)  model if  Gd ==  O. \n\nReferences \nBrockwell,  P.  J. &  Davis,  R.  A.  (1987).  Time Series:  Theory and Methods.  Springer Series \n\nin Statistics.  New  York,  USA:  Springer  Verlag. \n\nChan,  K.  S.  &  Tong,  H.  (1985).  On  the  use  of the  deterministic  Lyapunov  function  for \nthe  ergodicity  of stochastic  difference  equations.  Advances in  Applied  Probability,  17, \n666-678. \n\nChng,  E .  S.,  Chen,  S.,  &  Mulgrew,  B.  (1996).  Gradient  radial  basis  function  networks \nfor  nonlinear  and  nonstationary  time  series  prediction.  IEEE  Transactions  on  Neural \nNetworks, 7(1),  190- 194. \n\nHusmeier,  D.  &  Taylor,  J.  G.  (1997).  Predicting  conditional  probability  densities  of sta(cid:173)\n\ntionary  stochastic  time series.  Neural Networks,  10(3),479-497. \n\nJones,  D.  A.  (1978).  Nonlinear  autoregressive  processes.  Proceedings  of the  Royal Society \n\nLondon A,  360,  71- 95. \n\nLeland,  W.  E.,  Taqqu,  M.  S.,  Willinger,  W.,  &  Wilson,  D.  V.  (1994) .  On the  self-similar \nnature of ethernet  traffic  (extended  version).  IEEE/ACM  Transactions on  Networking, \n2(1),  1- 15. \n\nMandelbrot,  B.  B. &  Ness,  J . W.  V.  (1968).  Fractional brownian motions,  fractional noises \n\nand applications.  SIAM Review,  10(4),  422-437. \n\nTong,  H.  (1990).  Non-linear time  series:  A  dynamical system  approach.  New  York,  USA: \n\nOxford University  Press. \n\nWang,  T.  &  Sheng,  Z.  (1996).  Asymptotic  stationarity  of discrete-time  stochastic  neural \n\nnetworks.  Neural  Networks, 9(6) , 957-963. \n\n\f", "award": [], "sourceid": 1529, "authors": [{"given_name": "Friedrich", "family_name": "Leisch", "institution": null}, {"given_name": "Adrian", "family_name": "Trapletti", "institution": null}, {"given_name": "Kurt", "family_name": "Hornik", "institution": null}]}