{"title": "Credit Assignment through Time: Alternatives to Backpropagation", "book": "Advances in Neural Information Processing Systems", "page_first": 75, "page_last": 82, "abstract": null, "full_text": "Credit Assignment through Time: \nAlternatives to Backpropagation \n\nYoshua Bengio * \nDept. Informatique et \n\nRecherche Operationnelle \nUniversite de Montreal \nMontreal, Qc H3C-3J7 \n\nPaolo Frasconi \n\nDip. di Sistemi e Informatica \n\nUniversita di Firenze \n50139 Firenze (Italy) \n\nAbstract \n\nLearning to recognize or predict sequences using long-term con(cid:173)\ntext has many applications. However, practical and theoretical \nproblems are found in training recurrent neural networks to per(cid:173)\nform tasks in which input/output dependencies span long intervals. \nStarting from a mathematical analysis of the problem, we consider \nand compare alternative algorithms and architectures on tasks for \nwhich the span of the input/output dependencies can be controlled. \nResults on the new algorithms show performance qualitatively su(cid:173)\nperior to that obtained with backpropagation. \n\n1 \n\nIntroduction \n\nRecurrent neural networks have been considered to learn to map input sequences to \noutput sequences. Machines that could efficiently learn such tasks would be useful \nfor many applications involving sequence prediction, recognition or production. \n\nHowever, practical difficulties have been reported in training recurrent neural net(cid:173)\nworks to perform tasks in which the temporal contingencies present in the in(cid:173)\nput/output sequences span long intervals. In fact, we can prove that dynamical \nsystems such as recurrent neural networks will be increasingly difficult to train with \ngradient descent as the duration of the dependencies to be captured increases. A \nmathematical analysis of the problem shows that either one of two conditions arises \nin such systems. In the first case, the dynamics of the network allow it to reliably \nstore bits of information (with bounded input noise), but gradients (with respect \nto an error at a given time step) vanish exponentially fast as one propagates them \n\n\u00b7also, AT&T Bell Labs, Holmdel, NJ 07733 \n\n7S \n\n\f76 \n\nBengio and Frasconi \n\nbackward in time. In the second case, the gradients can flow backward but the sys(cid:173)\ntem is locally unstable and cannot reliably store bits of information in the presence \nof input noise. \nIn consideration of the above problem and the understanding brought by the theo(cid:173)\nretical analysis, we have explored and compared several alternative algorithms and \narchitectures. Comparative experiments were performed on artificial tasks on which \nthe span of the input/output dependencies can be controlled. In all cases, a dura(cid:173)\ntion parameter was varied, from T/2 to T, to avoid short sequences on which the \nalgorithm could much more easily learn. These tasks require learning to latch, i.e. \nstore bits of information for arbitrary durations (which may vary from example to \nexample). Such tasks cannot be performed by Time Delay Neural Networks or by \nrecurrent networks whose memories are gradually lost with time constants that are \nfixed by the parameters of the network. \nOf all the alternatives to gradient descent that we have explored, an approach based \non a probabilistic interpretation of a discrete state space, similar to hidden Markov \nmodels (HMMs), yielded the most interesting results. \n\n2 A Difficult Problem of Error Propagation \n\nConsider a non-autonomous discrete-time system with additive inputs, such as a \nrecurrent neural network a with a continuous activation function: \n\nat = M(at-d + Ut \n\n(1) \n\nand the corresponding autonomous dynamics \n\nat = M(at-d \n\n(2) \nwhere M is a nonlinear map (which may have tunable parameters such as network \nweights), and at E R n and Ut E R m are vectors representing respectively the system \nstate and the external input at time t. \nIn order to latch a bit of state information one wants to restrict the values of the \nsystem activity at to a subset S of its domain. In this way, it will be possible to \nlater interpret at in at least two ways: inside S and outside S. To make sure that at \nremains in such a region, the system dynamics can be chosen such that this region \nis the basin of attraction of an attractor X (or of an attractor in a sub-manifold or \nsubspace of at's domain). To \"erase\" that bit of information, the inputs may push \nthe system activity at out of this basin of attraction and possibly into another one. \nIn (Bengio, Simard, & Frasconi, 1994) we show that only two conditions can arise \nwhen using hyperbolic attractors to latch bits of information in such a system. \nEither the system is very sensitive to noise, or the derivatives of the cost at time t \nwith respect to the system activations ao converge exponentially to 0 as t increases. \nThis situation is the essential reason for the difficulty in using gradient descent to \ntrain a dynamical system to capture long-term dependencies in the input/output \nsequences. \n\nA first theorem can be used to show that when the state at is in a region where \nIM'I > 1, then small perturbations grow exponentially, which can yield to a loss of \nthe information stored in the dynamics of the system: \n\nTheorem 1 A ssume x is a point of R n such that there exists an open sphere U (x) \ncentered on x for which IM'(z)1 > 1 for all z E U(x). Then there exist Y E U(x) \nsuch that IIM(x) - M(y) I > Ilx - YII\u00b7 \n\n\fCredit Assignment through Time: Alternatives to Backpropagation \n\n77 \n\nA second theorem shows that when the state at is in a region where IM'I < 1, the \ngradients propagated backwards in time vanish exponentially fast: \n\nTheorem 2 If the \nnM'(adl < 1) on attmctor X after time 0, then g:~ -t 0 as t -t 00. \n\nis such that a system remains robustly latched \n\ninput Ut \n\nSee proofs in (Bengio, Simard, & Frasconi, 1994). A consequence of these results \nis that it is generally very difficult to train a parametric dynamical system (such \nas a recurrent neural network) to learn long-term dependencies using gradient de(cid:173)\nscent. Based on the understanding brought by this analysis, we have explored and \ncompared several alternative algorithms and architectures. \n\n3 Global Search Methods \n\nGlobal search methods such as simulated annealing can be applied to this prob(cid:173)\nlem, but they are generally very slow. We implemented the simulated annealing \nalgorithm presented in (Corana, Marchesi, Martini, & Ridella, 1987) for optimizing \nfunctions of continuous variables. This is a \"batch learning\" algorithm (updating \nparameters after all examples of the training set have been seen). It performs a cy(cid:173)\ncle of random moves, each along one coordinate (parameter) direction. Each point \nis accepted or rejected according to the Metropolis criterion (Kirkpatrick, Gelatt, \n& Vecchi, 1983). The simulated annealing algorithm is very robust with respect \nto local minima and long plateaus. Another global search method evaluated in \nour experiments is a multi-grid random search. The algorithm tries random points \naround the current solution (within a hyperrectangle of decreasing size) and accepts \nonly those that reduce the error. Thus it is resistant to problems of plateaus but \nnot as much resistant to problems of local minima. Indeed, we found the multi-grid \nrandom search to be much faster than simulated annealing but to fail on the parity \nproblem, probably because of local minima. \n\n4 Time Weighted Pseudo-Newton \n\nThe time-weighted pseudo-Newton algorithm uses second order derivatives of the \ncost with respect to each of the instantiations of a weight at different time steps to \ntry correcting for the vanishing gradient problem. The weight update for a weight \nWi is computed as follows: \n\n(3) \n\nwhere Wit is the instantiation for time t of parameter Wi, 1} is a global learning \nrate and C(p) is the cost for pattern p. In this way, each (temporal) contribution \nto ~Wi(p) is weighted by the inverse curvature with respect to Wit . Like for the \npseudo-Newton algorithm of Becker and Le Cun (1988) we prefer using a diagonal \napproximation of the Hessian which is cheap to compute and guaranteed to be \npositive definite. \nThe constant J1 is introduced to prevent ~w from becoming very large (when I &;C~p) I \nis very small). We found the performance of this algorithm to be better than the \nregular pseudo-Newton algorithm, which is better than the simple stochastic back(cid:173)\npropagation algorithm, but all of these algorithms perform worse and worse as the \nlength of the sequences is increased. \n\nW.! \n\n\f78 \n\nBengio and Frasconi \n\n5 Discrete Error Propagation \n\nThe discrete error propagation algorithm replaces sigmoids in the network by dis(cid:173)\ncrete threshold units and attempts to propagate discrete error information back(cid:173)\nwards in time. The basic idea behind the algorithm is that for a simple discrete \nelement such as a threshold unit or a latch, one can write down an error propagation \nrule that prescribes desired changes in the values of the inputs in order to obtain \ncertain changes in the values of the outputs. In the case of a threshold unit, such \na rule assumes that the desired change for the output of the unit is discrete (+2, \no or -2). However, error information propagated backwards to such as unit might \nhave a continuous value. A stochastic process is used to convert this continuous \nvalue into an appropriate discrete desired change. In the case of a self-loop, a clear \nadvantage of this algorithm over gradient back-propagation through sigmoid units \nis that the error information does not vanish as it is repeatedly propagated back(cid:173)\nwards in time around the loop, even though the unit can robustly store a bit of \ninformation. Details of the algorithm will appear in (Bengio, Simard, & Frasconi, \n1994). This algorithm performed better than the time-weighted pseudo-Newton, \npseudo-Newton and back-propagation algorithms but the learning curve appeared \nvery irregular, suggesting that the algorithm is doing a local random search. \n\n6 An EM Approach to Target Propagation \n\nThe most promising of the algorithms we studied was derived from the idea of \npropagating targets instead of gradients. For this paper we restrict ourselves to \nsequence classification. We assume a finite-state learning system with the state qt \nat time t taking on one of n values. Different final states for each class are used \nas targets. The system is given a probabilistic interpretation and we assume a \nMarkovian conditional independence model. As in HMMs, the system propagates \nforward a discrete distribution over the n states. Transitions may be constrained \nso that each state j has a defined set of successors Sj. \n\nStat~ L \n.~_.~_1 ..... 0_j1_ \n\n_;_\u00b7\u00b7\n\nState \n\n\u2022\u2022\u2022 ( .. \u00b7\u2022 \u2022\u2022 _ .\u2022.. __ n_e--lt/\\rK : \n\nUt \n\nFigure 1: The proposed architecture \n\nLearning is formulated as a maximum likelihood problem with missing data. Missing \nvariables, over which an expectation is taken, are the paths in state-space. The \n\n\fCredit Assignment through Time: Alternatives to Backpropagation \n\n79 \n\nEM (Expectation/Maximization) or GEM (Generalized EM) algorithms (Dempster, \nLaird., & Rubin, 1977) can be used to help decoupling the influence of different \nhypothetical paths in state-space. The estimation step of EM requires propagating \nbackward a discrete distribution of targets. In contrast to HMMs, where parameters \nare adjusted in an unsupervised learning framework, we use EM in a supervised \nfashion. This new perspective has been successful in training static models (Jordan \n& Jacobs, 1994). \nTransition probabilities, conditional on the current input, can be computed by a \nparametric function such as a layer of a neural network with softmax units. We pro(cid:173)\npose a modular architecture with one subnetwork Nj for each state (see Figure 1). \nEach subnetwork is feedforward, takes as input a continuous vector of features Ut \nand has one output for each successor state, interpreted as P(qt = i I qt-l = j, Ut; 0), \n(j = 1, ... , n, i E Sj). 0 is a set of tunable parameters. Using a Markovian assump(cid:173)\ntion, the distribution over states at time t is thus obtained as a linear combination \nof the outputs of the subnetworks, gated by the previously computed distribution: \n\nP(qt = i lui; 0) = L P(qt-l = j lui-I; O)P(qt = i I qt-l = j, Ut; 0) \n\n(4) \n\nj \n\nwhere ui is a subsequence of inputs from time 1 to t inclusively. The training \nalgorithm looks for parameters 0 of the system that maximize the likelihood L of \nfalling in the \"correct\" state at the end of each sequence: \n\nL(O) = II P(qTp = qj,p I uip; 0) \n\n(5) \n\n(6) \n\nwhere p ranges over training sequences, Tp the length of the pth training sequence, \nand qj, \n\nthe desired state at time Tp. \n\np \n\np \n\nAn auxiliary function Q(O, Ok) is constructed by introducing as hidden variables the \nwhole state sequence, hence the complete likelihood function is defined as follows: \n\nLc(O) = IIp(qip luip;O) \n\np \n\nand \n\n(7) \nwhere at the k+lth EM (or GEM) iteration, Ok+l is chosen to maximize (or increase) \nthe auxiliary function Q with respect to O. \nIf the inputs are quantized and the subnetworks perform a simple look-up in a table \nof probabilities, then the EM algorithm can be used, i.e., aQ~/k) = 0 can be solved \nanalytically. If the networks have non-linearities, (e.g., with hidden units and a \nsoftmax at their output to constrain the outputs to sum to 1), then one can use \nthe GEM algorithm (which simply increases Q, for example with gradient ascent) \nor directly perform (preferably stochastic) gradient ascent on the likelihood. \nAn extra term was introduced in the optimization criterion when we found that in \nmany cases the target information would not propagate backwards (or would be \ndiffused over all the states). These experiments confirmed previous results indicat(cid:173)\ning a general difficulty of training fully connected HMMs, with the EM algorithm \nconverging very often to poor local maxima of the likelihood. In an attempt to \nunderstand better the phenomenon, we looked at the quantities propagated for(cid:173)\nward and the quantities propagated backward (representing credit or blame) in the \n\n\f80 \n\nBengio and Frasconi \n\ntraining algorithm. We found a diffusion of credit or blame occurring when the \nforward maps (i.e. the matrix of transition probabilities) at each time step are such \nthat many inputs map to a few outputs, i.e., when the ratio of a small volume in \nthe image of the map with respect to the corresponding volume in the domain is \nsmall. This ratio is the absolute value of the determinant of the Jacobian of the \nmap. Hence, using an optimization criterion that incorporates the maximization of \nthe average magnitude of the determinant of the transition matrices, this algorithm \nperforms much better than the other algorithms. Two other tricks were found to \nbe important to help convergence and reduce the problem of diffusion of credit. \nThe first idea is to use whenever possible a structured model with a sparse con(cid:173)\nnectivity matrix, thus introducing some prior knowledge about the state-space. For \nexample, applications of HMMs to speech recognition always rely on such structured \ntopologies. We could reduce connectivity in the transition matrix for the 2-sequence \nproblem (see next section for its definition) by splitting some of the nodes into two \nsubsets, each specializing on one of the sequence classes. However, sometimes it is \nnot possible to introduce such constraints, such as in the parity problem. Another \ntrick that drastically improved performance was to use stochastic gradient ascent in \na way that helps the training algorithm get out of local optima. The learning rate \nis decreased when the likelihood improves but it is increased when the likelihood \nremains flat (the system is stuck in a plateau or local optimum). \nAs the results in the next section show, the performances obtained with this algo(cid:173)\nrithm are much better than those obtained with the other algorithms on the two \nsimple test problems that were considered. \n\n7 Experimental Results \n\nWe present here results on two problems for which one can control the span of \ninput/output dependencies. The 2-sequence problem is the following: classify an \ninput sequence, at the end of the sequence, in one of two types, when only the first \nN elements (N = 3 in our experiments) of this sequence carry information about \nthe sequence class. Uniform noise is added to the sequence. For the first 6 methods \n(see Tables 1 to 4), we used a fully connected recurrent network with 5 units (with \n25 free parameters). For the EM algorithm, we used a 7 -state system with a sparse \nconnectivity matrix (an initial state, and two separate left-to-right submodels of \nthree states each to model the two types of sequences). \nThe parity problem consists in producing the parity of an input sequence of 1 's and \n-l's (i.e., a 1 should be produced at the final output if and only if the number of \n1 's in the input is odd). The target is only given at the end of the sequence. For \nthe first 6 methods we used a minimal size network (1 input, 1 hidden, 1 output, \n7 free parameters). For the EM algorithm, we used a 2-state system with a full \nconnectivity matrix. \nInitial parameters were chosen randomly for each trial. Noise added to the sequence \nwas also uniformly distributed and chosen independently for each training sequence. \nWe considered two criteria: (1) the average classification error at the end of training, \ni.e., after a stopping criterion has been met (when either some allowed number of \nfunction evaluations has been performed or the task has been learned), (2) the \naverage number of function evaluations needed to reach the stopping criterion. \nIn the tables, \"p-n\" stands for pseudo-Newton. Each column corresponds to a value \nof the maximum sequence length T for a given set of trials. The sequence length for \na particular training sequence was picked randomly within T/2 and T. Numbers \n\n\fCredit Assignment through Time: Alternatives to Backpropagation \n\n81 \n\nreported are averages over 20 or more trials. \n\n8 Conclusion \n\nRecurrent networks and other parametric dynamical systems are very powerful in \ntheir ability to represent and use context. However, theoretical and experimental \nevidence shows the difficulty of assigning credit through many time steps, which \nis required in order to learn to use and represent context. This paper studies this \nfundamental problem and proposes alternatives to the backpropagation algorithm \nto perform such learning tasks. Experiments show these alternative approaches \nto perform significantly better than gradient descent. The behavior of these algo(cid:173)\nrithms yields a better understanding of the central issue of learning to use context, \nor assigning credit through many transformations. Although all of the alterna(cid:173)\ntive algorithms presented here showed some improvement with respect to standard \nstochastic gradient descent, a clear winner in our comparison was an algorithm \nbased on the EM algorithm and a probabilistic interpretation of the system dynam(cid:173)\nics. However, experiments on more challenging tasks will have to be conducted to \nconfirm those results. Furthermore, several extensions of this model are possible, \nfor example allowing both inputs and outputs, with supervision on outputs rather \nthan on states. Finally, similarly to the work we performed for recurrent networks \ntrained with gradient descent, it would be very important to analyze theoretically \nthe problems of propagation of credit encountered in training such Markov models. \n\nAcknowledgements \n\nWe wish to emphatically thank Patrice Simard, who collaborated with us on the \nanalysis of the theoretical difficulties in learning long-term dependencies, and on \nthe discrete error propagation algorithm. \n\nReferences \n\nS. Becker and Y. Le Cun. (1988) Improving the convergence of back-propagation \nlearning with second order methods, Proc. of the 1988 Connectionist Models Sum(cid:173)\nmer School, (eds. Touretzky, Hinton and Sejnowski), Morgan Kaufman, pp. 29-37. \nY. Bengio, P. Simard, and P. Frasconi. (1994) Learning long-term dependencies \nwith gradient descent is difficult, IEEE Trans. Neural Networks, (in press). \nA. Corana, M. Marchesi, C. Martini, and S. Ridella. (1987) Minimizing multimodal \nfunctions of continuous variables with the simulated annealing algorithm, A CM \nTransactions on Mathematical Software, vol. 13, no. 13, pp. 262-280. \nA.P. Dempster, N.M. Laird, and D.B. Rubin. (1977) Maximum-likelihood from \nincomplete data via the EM algorithm, J. of Royal Stat. Soc., vol. B39, pp. 1-38. \nM.1. Jordan and R.A. Jacobs. (1994) Hierarchical mixtures of experts and the EM \nalgorithm, Neural Computation, (in press). \nS. Kirkpatrick, C.D. Gelatt, and M.P. Vecchio (1983) Optimization by simulated \nannealing, Science 220, 4598, pp.671-680. \n\n\f82 \n\nBengio and Frasconi \n\nTable 1: Final classification error for the 2-sequence problem wrt sequence length \n\nac -prop \n\np-n \n\ntime-weighted p-n \n\nmultigrid \n\ndiscrete err. prop. \nsimulated anneal. \n\nEM \n\n2 \no \n2 \n6 \n6 \no \n\n3 \n0 \n6 \n16 \n0 \n0 \n\n10 \n25 \n9 34 \n1 \n3 \n23 \n29 \n4 \n7 \n0 \n0 \n\n29 \n14 \n6 \n22 \n11 \no \n\nTable 2: # sequence presentations for the 2-sequence problem wrt sequence length \n\nac -prop \n\np-n \n\ntime-weighted p-n \n\nmultigrid \n\ndiscrete err. prop. \nsimulated anneal. \n\nEM \n\n. e \n5.1e2 \n5.4e2 \n4.1e3 \n6.6e2 \n2.0e5 \n3.2e3 \n\n. e \n1.1e3 \n4.3e2 \n5.8e3 \n1.3e3 \n3.ge4 \n4.0e3 \n\n. e \n1.ge3 \n2.4e3 \n2.5e3 \n2.1e3 \n8.2e4 \n2.ge3 \n\n. e \n2.6e3 \n2.ge3 \n3.ge3 \n2.1e3 \n7.7e4 \n3.2e3 \n\n. e \n2.5e3 \n2.7e3 \n6.4e3 \n2.1e3 \n4.3e4 \n2.ge3 \n\nTable 3: Final classification error for the parity problem wrt sequence length \n\n5 \n\np-n \n\nback-prop \n\n10 \n3 \n~ ~U 41 \n41 \n3 \ntime-weighted p-n 26 \n39 \n15 \n0 \n3 \n\ndiscrete err. prop. \nsimulated anneal. \n\nmultigrid \n\n25 \n\n0 \n\n6 \n\n0 \n\nEM \n\n20 \n50 \n~~ 43-\n40 \n44 \n43 \n44 \n45 \n44 \n5 \n0 \n0 \n10 \n14 \n0 \n\n100 \n\n500 \n\n47 \n\n0 \n\n12 \n\nTable 4: # sequence presentations for the parity problem wrt sequence length \n\np-n \n\nback-prop \n\n3 \n3.6e3 \n2.5e2 \ntime-weighted p-n 4.5e4 \n4.2e3 \ndiscrete err. prop. 5.0e3 \nsimulated anneal. \n5.1e5 \n\nmultigrid \n\nEM \n\n5 \n5.5e3 \n8.ge3 \n\n9 \n8.7e3 \n8.ge3 \n7.0e4 \n\n7.ge3 \n\n2.3e3 \n\n1.5e3 \n\n100 \n\n1.le5 \n\n20 \n50 \n1.1e4 \n1.6e4 \n1.1e4 \n7 .7e4 \n3.4e4 \n8.1e4 \n1.5e4 \n3.1e4 \n1.5e4 5.4e4 \n1.2e6 8.1e5 \n1.3e3 3.2e3 2.6e3 \n\n500 \n\n3.4e3 \n\n\f", "award": [], "sourceid": 724, "authors": [{"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "Paolo", "family_name": "Frasconi", "institution": null}]}