{"title": "Learning long-term dependencies is not as difficult with NARX networks", "book": "Advances in Neural Information Processing Systems", "page_first": 577, "page_last": 583, "abstract": null, "full_text": "Learning long-term dependencies \n\nis not as difficult with NARX networks \n\nTsungnan Lin* \n\nBill G. Horne \n\nDepartment of Electrical Engineering \n\nNEC Research Institute \n\nPrinceton University \nPrinceton, N J 08540 \n\n4 Independence Way \nPrinceton, NJ 08540 \n\nPeter Tiiio \n\nDept. of Computer Science and Engineering \n\nSlovak Technical University \n\nIlkovicova 3, 812 19 Bratislava, Slovakia \n\nc. Lee Gilest \n\nNEC Research Institute \n\n4 Independence Way \nPrinceton, N J 08540 \n\nAbstract \n\nIt has recently been shown that gradient descent learning algo(cid:173)\nrithms for recurrent neural networks can perform poorly on tasks \nthat involve long-term dependencies. \nIn this paper we explore \nthis problem for a class of architectures called NARX networks, \nwhich have powerful representational capabilities. Previous work \nreported that gradient descent learning is more effective in NARX \nnetworks than in recurrent networks with \"hidden states\". We \nshow that although NARX networks do not circumvent the prob(cid:173)\nlem of long-term dependencies, they can greatly improve perfor(cid:173)\nmance on such problems. We present some experimental 'results \nthat show that NARX networks can often retain information for \ntwo to three times as long as conventional recurrent networks. \n\n1 \n\nIntroduction \n\nRecurrent Neural Networks (RNNs) are capable of representing arbitrary nonlin(cid:173)\near dynamical systems [19, 20]. However, learning simple behavior can be quite \n\n\"Also with NEC Research Institute. \ntAlso with UMIACS, University of Maryland, College Park, MD 20742 \n\n\f578 \n\nT. LIN, B. G. HORNE, P. TINO, C. L. GILES \n\ndifficult using gradient descent. For example, even though these systems are 'lUr(cid:173)\ning equivalent, it has been difficult to get them to successfully learn small finite \nstate machines from example strings encoded as temporal sequences. Recently, it \nhas been demonstrated that at least part of this difficulty can be attributed to \nlong-term dependencies, i.e. when the desired output at time T depends on inputs \npresented at times t \u00ab T. In [13] it was reported that RNNs were able to learn short \nterm musical structure using gradient based methods, but had difficulty capturing \nglobal behavior. These ideas were recently formalized in [2], which showed that if \na system is to robustly latch information, then the fraction of the gradient due to \ninformation n time steps in the past approaches zero as n becomes large. \n\nSeveral approaches have been suggested to circumvent this problem. For exam(cid:173)\nple, gradient-based methods can be abandoned in favor of alternative optimization \nmethods [2, 15]. However, the algorithms investigated so far either perform just \nas poorly on problems involving long-term dependencies, or, when they are better, \nrequire far more computational resources [2]. Another possibility is to modify con(cid:173)\nventional gradient descent by more heavily weighing the fraction of the gradient due \nto information far in the past, but there is no guarantee that such a modified algo(cid:173)\nrithm would converge to a minima of the error surface being searched [2]. Another \nsuggestion has been to alter the input data so that it represents a reduced description \nthat makes global features more explicit and more readily detectable [7, 13, 16, 17]. \nHowever, this approach may fail if short term dependencies are equally as impor(cid:173)\ntant. Finally, it has been suggested that a network architecture that operates on \nmultiple time scales might be useful [5, 6]. \n\nIn this paper, we also propose an architectural approach to deal with long-term \ndependencies [11]. We focus on a class of architectures based upon Nonlinear Au(cid:173)\ntoRegressive models with eXogenous inputs (NARX models), and are therefore \ncalled NARX networks [3, 14]. This is a powerful class of models which has recently \nbeen shown to be computationally equivalent to 'lUring machines [18]. Further(cid:173)\nmore, previous work has shown that gradient descent learning is more effective \nin NARX networks than in recurrent network architectures with \"hidden states\" \nwhen applied to problems including grammatical inference and nonlinear system \nidentification [8]. Typically, these networks converge much faster and generalize \nbetter than other networks. The results in this paper give an explanation of this \nphenomenon. \n\n2 Vanishing gradients and long-term dependencies \n\nBengio et al. [2] have analytically explained why learning problems with long- term \ndependencies is difficult. They argue that for many practical applications the goal \nof the network must be to robustly latch information, i.e. the network must be \nable to store information for a long period of time in the presence of noise. More \nspecifically, they argue that latching of information is accomplished when the states \nof the network stay within the vicinity of a hyperbolic attractor, and robustness \nto noise is accomplished if the states of the network are contained in the reduced \nattracting set of that attractor, i.e. those set of points at which the eigenvalues of \nthe Jacobian are contained within the unit circle. \n\nIn algorithms such as Backpropagation Through Time (BPTT), the gradient of \nthe cost function function C is written assuming that the weights at different time \n\n\fLearning Long-term Dependencies Is Not as Difficult with NARX Networks \n\n579 \n\nu(k) \n\nu(k-l) u(k-2) \n\ny(k-3) y(k-2) y(k-l) \n\nFigure 1: NARX network. \n\nindices are independent and computing the partial gradient with respect to these \nweights. The total gradient is then equal to the sum of these partial gradients. \n\nIt can be easily shown that the weight updates are proportional to \n\nwhere Yp(T) and d p are the actual and desired (or target) output for the pth \npattern!, x(t) is the state vector of the network at time t and Jx(T,T - T) = \n\\l xC-r)x(T) denotes the Jacobian of the network expanded over T - T time steps. \n\nIn [2], it was shown that if the network robustly latches information, then Jx(T, n) \nis an exponentially decreasing function of n, so that limn-too Jx(T, n) = 0 . This \nimplies that the portion of \\l we due to information at times T \u00ab T is insignificant \ncompared to the portion at times near T. This vanishing gradient is the essential \nreason why gradient descent methods are not sufficiently powerful to discover a \nrelationship between target outputs and inputs that occur at a much earlier time. \n\n3 NARX networks \n\nAn important class of discrete- time nonlinear systems is the Nonlinear AutoRegres(cid:173)\nsive with eXogenous inputs (NARX) model [3, 10, 12, 21]: \n\ny(t) = f (u(t - D u ), ... ,u(t - 1), u(t), y(t - D y ),'\" \n\n,y(t - 1)) , \n\nwhere u(t) and y(t) represent input and output ofthe network at time t, Du and Dy \nare the input and output order, and f is a nonlinear function. When the function \nf can be approximated by a Multilayer Perceptron, the resulting system is called a \nNARX network [3, 14]. \n\nIn this paper we shall consider NARX networks with zero input order and a one \ndimensional output. However there is no reason why our results could not be \nextended to networks with higher input orders. Since the states of a discrete-time \n\nlWe deal only with problems in which the target output is presented at the end of the \n\nsequence. \n\n\f580 \n\n0 9 \n\nDe \n\nT. LIN, B. G. HORNE, P. TINO, C. L. GILES \n\nEm\" 0.3 \n\n- \u00b70. 6 \n\nt]., \n\n.... 0 .3 \n,- ' \u00b7 0.6 \n\n:' 1 \n\n~I ., , \nj , \ni \nI \n'...\\ \n\n0' \n\n009 \n\nDOS \n\n\"5 \n~ OO7 \n~ O 06 \nIi \nJOO5 \n:i \n..Q 004 \n\"0 \nh 03 \n~ \"' 002 \n\nDO' \n\n(a) \n\n60 \n\n0 \n0 \n\n10 \n\n20 \n\n. 0 \n\n60 \n\n30 \nn \n\n(b) \n\nFigure 2: Results for the latching problem. (a) Plots of J(t,n) as a function of n. \n(b) Plots of the ratio E~~(ltJ(tr) as a function of n . \n\ndynamical system can always be associated with the unit-delay elements in the \nrealization of the system, we can then describe such a network in a state space form \n\ni=l \ni = 2, ... ,D \n\n(1) \n\nwith y(t) = Xl (t + 1) . \nIf the Jacobian of this system has all of its eigenvalues inside the unit circle at each \ntime step, then the states of the network will be in the reduced attracting set of some \nhyperbolic attractor, and thus the system will be robustly latched at that time. As \nwith any other RNN, this implies that limn-too Jx(t, n) = o. Thus, NARX networks \nwill also suffer from vanishing gradients and the long- term dependencies problem. \nHowever, we find in the simulation results that follow that NARX networks are \noften much better at discovering long-term dependencies than conventional RNNs. \n\nAn intuitive reason why output delays can help long-term dependencies can be \nfound by considering how gradients are calculated using the Backpropagation \nThrough Time algorithm. BPTT involves two phases: unfolding the network in \ntime and backpropagating the error through the unfolded network. When a NARX \nnetwork is unfolded in time, the output delays will appear as jump-ahead connec(cid:173)\ntions in the unfolded network. Intuitively, these jump-ahead connections provide a \nshorter path for propagating gradient information, thus reducing the sensitivity of \nthe network to long- term dependencies. However, this intuitive reasoning is only \nvalid if the total gradient through these jump- ahead pathways is greater than the \ngradient through the layer-to-layer pathways. \n\nIt is possible to derive analytical results for some simple toy problems to show \nthat NARX networks are indeed less sensitive to long-term dependencies. Here \nwe give one such example, which is based upon the latching problem described \nin [2] . Consider the one node autonomous recurrent network described by, x(t) = \ntanh(wx(t - 1)) where w = 1.25, which has two stable fixed points at \u00b10.710 \nand one unstable fixed point at zero. The one node, autonomous NARX network \nx(t) = tanh (L:~=l wrx(t - r)) has the same fixed points as long as L:?:l Wi = w. \n\n\fLearning Long-tenn Dependencies Is Not as Difficult with NARX Networks \n\n581 \n\nAssume the state of the network has reached equilibrium at the positive stable fixed \npoint and there are no external inputs. For simplicity, we only consider the Jacobian \nJ(t, n) = 8~{t~~)' which will be a component of the gradient 'ilwC. Figure 2a shows \nplots of J(t, n) with respect to n for D = 1, D = 3 and D = 6 with Wi = wiD . \nThese plots show that the effect of output delays is to flatten out the curves and \nplace more emphasis on the gradient due to terms farther in the past. Note that the \ngradient contribution due to short term dependencies is deemphasized. In Figure 2b \nwe show plots of the ratio L::~\\tj(t,r) , which illustrates the percentage of the total \ngradient that can be attributed to information n time steps in the past. These plots \nshow that this percentage is larger for the network with output delays, and thus \none would expect that these networks would be able to more effectively deal with \nlong-term dependencies. \n\n4 Experimental results \n\n4.1 The latching problem \n\nWe explored a slight modification on the latching problem described in [2], which \nis a minimal task designed as a test that must necessarily be passed in order for \na network to robustly latch information. In this task there are three inputs Ul(t), \nU2(t), and a noise input e(t), and a single output y(t) . Both Ul(t) and U2(t) are \nzero for all times t> 1. At time t = 1, ul(l) = 1 and u2(1) = 0 for samples from \nclass 1, and ul(l) = 0 and u2(1) = 1 for samples from class 2. The noise input e(t) \nis drawn uniformly from [-b, b] when L < t S T, otherwise e(t) = 0 when t S L. \nThis network used to solve this problem is a NARX network consisting of a single \nneuron, \n\nwhere the parameters h{ are adjustable and the recurrent weights Wr are fixed 2 . \nWe fixed the recurrent feedback weight to Wr = 1.251 D, which gives the autonomous \nnetwork two stable fixed points at \u00b10.710, as described in Section 3. It can be \nshown [4] that the network is robust to perturbations in the range [-0.155,0.155]. \nThus, the uniform noise in e(t) was restricted to this range. \n\nFor each simulation, we generated 30 strings from each class, each with a different \ne(t). The initial values of h{ for each simulation were also chosen from the same \ndistribution that defines e(t). For strings from class one, a target value of 0.8 was \nchosen, for class two -0.8 was chosen. The network was run using a simple BPTT \nalgorithm with a learning rate of 0.1 for a maximum of 100 epochs. (We found that \nthe network converged to some solution consistently within a few dozen epochs.) If \nthe simulation exceeded 100 epochs and did not correctly classify all strings then \nthe simulation was ruled a failure. We varied T from 10 to 200 in increments of 2. \nFor each value of T, we ran 50 simulations. Figure 3a shows a plot of the percentage \nof those runs that were successful for each case. It is clear from these plots that \n\n2 Although this description may appear different from the one in [2], it can be shown \n\nthat they are actually identical experiments for D = 1. \n\n\f582 \n\n0 9 \n\n09 \n\n. 0 1 \n~ \n~06 \n~ \n\n~05 10 \u2022 \n\" 03 \n\n02 \n\n0' \n\nT. LIN, B. G. HORNE, P. TINO, C. L. GILES \n\n'.J \"60J' \n\n. Ii' , , \n\n\\ \n\n~., \n\n... . 0.3 \n\u00b7-\u00b7-0.6 \n\n....... \n\n, _ .. ..,.. .......... : . .: . \n\n! , \n~ i ! , , \n\n09 \n\n06 \n\n04 \n\n02 \n\n, . \n\n\" \n\n...... ':~ .. ,.'< ...... . \n\n... ... \n\n20 \n\n40 \n\n\\ .. \n'\" \u00b7.~Il~ \n\n~~~~~~~~~~~~'6~0 ~'M~~200 ~~~~'0~~'5~~ro--~25--=OO~3~5 ~4~0~4~5~50\u00b7 \n00 \n\n(a) \n\nLanglh 01 InPJI nC.IIIe \n\n(b) \n\nFigure 3: (a) Plots of percentage of successful simulations as a function of T, the \nlength of the input strings. (b) Plots of the final classification rate with respect to \ndifferent length input strings. \n\nthe NARX networks become increasingly less sensitive to long- term dependencies \nas the output order is increased. \n\n4.2 The parity problem \n\nIn the parity problem, the task is to classify sequences depending on whether or not \nthe number of Is in the input string is odd. We generated 20 strings of different \nlengths from 3 to 5 and added uniformly distributed noise in the range [-0.2,0.2] at \nthe end of each string. The length of input noise varied from 0 to 50. We arbitrarily \nchose 0.7 and -0.7 to represent the symbol \"1\" and \"0\". The target is only given \nat the end of each string. Three different networks with different number of output \ndelays were run on this problem in order to evaluate the capability of the network \nto learn long-term dependencies. In order to make the networks comparable, we \nchose networks in which the number of weights was roughly equal. For networks \nwith one to three delays, 5, 4 and 3 hidden neurons were chosen respectively, giving \n21, 21, and 19 trainable weights. Initial weight values were randomly generated \nbetween -0.5 and 0.5 for 10 trials. \n\nFig. 3b shows the average classification rate with respect to different length of input \nnoise. When the length of the noise is less than 5, all three of the networks can \nlearn all the sequences with the classification rate near to 100%. When the length \nincreases to between 10 and 35, the classification rate of networks with one feedback \ndelay drops quickly to about 60% while the rate of those networks with two or three \nfeedback delays still remains about 80%. \n\n5 Conclusion \n\nIn this paper we considered an architectural approach to dealing with the problem of \nlearning long-term dependencies. We explored the ability of a class of architectures \ncalled NARX networks to solve such problems. This has been observed previously, \nin the sense that gradient descent learning appeared to be more effective in NARX \n\n\fLearning Long-tenn Dependencies Is Not as Difficult with NARX Networks \n\n583 \n\nnetworks than in RNNs [8]. We presented an analytical example that showed that \nthe gradients do not vanish as quickly in NARX networks as they do in networks \nwithout multiple delays when the network is operating at a fixed point. We also \npresented two experimental problems which show that NARX networks can out(cid:173)\nperform networks with single delays on some simple problems involving long-term \ndependencies. \n\nWe speculate that similar results could be obtained for other networks. In particular \nwe hypothesize that any network that uses tapped delay feedback [1, 9] would \ndemonstrate improved performance on problems involving long-term dependencies. \n\nAcknowledgements \n\nWe would like to thank A. Back and Y. Bengio for many useful suggestions. \n\nReferences \n\n(1] A.D. Back and A.C. Tsoi. FIR and IIR synapses, a new neural network architecture for time \n\nseries modeling. Neural Computation, 3(3):375-385, 1991. \n\n(2] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient is difficult. \n\nIEEE Trans. on Neural Networks, 5(2):157- 166, 1994. \n\n(3] S. Chen, S.A. Billings, and P.M. Grant. Non-linear system identification using neural networks. \n\nInternational Journal of Control, 51(6):1191-1214, 1990. \n\n(4] P. Frasconi, M. Gori, M. Maggini, and G. Soda. Unified integration of explicit knowledge and \nlearning by example in recurrent networks. IEEE Trans. on Know. and Data Eng.,7(2):340-346, \n1995. \n\n(5] M. Gori, M. Maggini, and G. Soda. Scheduling of modular architectures for inductive inference of \nregular grammars. In ECAI'94 Work. on Comb. Sym. and Connectionist Proc., pages 78-87. \n(6J S. EI Hihi and Y. Bengio. Hierarchical recurrent neural networks for long-term dependencies. In \n\nNIPS 8, 1996. (In this Proceedings.) \n\n(7] S. Hochreiter and J. Schmidhuber. Long short term memory. Technical Report FKI-207-95, \n\nTechnische Universitat Munchen, 1995. \n\n(8] B.G. Horne and C.L. Giles. An experimental comparison of recurrent neural networks. In NIPS 7, \n\npages 697-704, 1995. \n\n(9J R.R. Leighton and B.C. Conrath. The autoregressive backpropagation algorithm. In Proceedings \nof the International Joint Conference on Neural Networks, volume 2, pages 369-377, July 1991. \n(10] I.J. Leontaritis and S.A. Billings. Input-output parametric models for non-linear systems: Part \n\nI: deterministic non- linear systems. International Journal of Control, 41(2):303-328, 1985. \n\n(ll] T .N. Lin, B.G. Horne, P.Tino and C.L. Giles. Learning long-term dependencies is not as difficult \nwith NARX recurrent neural networks. Technical Report UMIACS-TR-95-78 and CS-TR-3500, \nUniv. Of Maryland, 1995. \n\n(12] L. Ljung. System identification: Theory for the user. Prentice-Hall, 1987. \n[13] M. C. Mozer. Induction of multiscale temporal structure. In J.E. Moody, S. J. Hanson, and R.P. \n\nLippmann, editors, NIPS 4, pages 275-282, 1992. \n\n(14] K.S. Narendra and K. Parthasarathy. Identification and control of dynamical systems using neural \n\nnetworks. IEEE Trans. on Neural Networks, 1:4-27, March 1990. \n\n(15] G .V . Puskorius and L.A. Feldkamp. Recurrent network training with the decoupled extended \n\nKalman filter. In Proc. 1992 SPIE Con/. on the Sci. of ANN, Orlando, Florida, April 1992. \n\n(16] J . Schmidhu ber. Learning complex, extended sequences using the principle of history compression. \n\nIn Neural Computation, 4(2):234-242, 1992. \n\n(17] J. Schmidhuber. Learning unambiguous reduced sequence descriptions. In NIPS 4, pages 291-\n\n298,1992. \n\n(18] H.T. Siegelmann, B.G. Horne, and C.L. Giles. Computational capabilities of NARX neural net(cid:173)\n\nworks. In IEEE Trans. on Systems, Man and Cybernetics, 1996. Accepted. \n\n(19] H.T. Siegel mann and E.D. Sontag. On the computational power of neural networks. Journal of \n\nComputer and System Science, 50(1):132-150, 1995. \n\n[20] E.D. Sontag. Systems combining linearity and saturations and relations to neural networks. \n\nTechnical Report SYCON-92- 01, Rutgers Center for Systems and Control, 1992. \n\n(21] H. Su, T. McAvoy, and P. Werbos. Long-term predictions of chemical processes using recurrent \n\nneural networks: A parallel training approach. Ind. Eng . Chem. Res., 31:1338, 1992. \n\n\f", "award": [], "sourceid": 1151, "authors": [{"given_name": "Tsungnan", "family_name": "Lin", "institution": null}, {"given_name": "Bill", "family_name": "Horne", "institution": null}, {"given_name": "Peter", "family_name": "Ti\u00f1o", "institution": null}, {"given_name": "C.", "family_name": "Giles", "institution": null}]}