{"title": "Recurrent Neural Networks for Missing or Asynchronous Data", "book": "Advances in Neural Information Processing Systems", "page_first": 395, "page_last": 401, "abstract": null, "full_text": "Recurrent Neural Networks for Missing or \n\nAsynchronous Data \n\nYoshua Bengio -\n\nDept. Informatique et \n\nRecherche Operationnelle \nUniversite de Montreal \nMontreal, Qc H3C-3J7 \n\nFrancois Gingras \nDept. Informatique et \n\nRecherche Operationnelle \nUniversite de Montreal \nMontreal, Qc H3C-3J7 \n\nbengioy~iro.umontreal.ca \n\ngingra8~iro.umontreal.ca \n\nAbstract \n\nIn this paper we propose recurrent neural networks with feedback into the input \nunits for handling two types of data analysis problems. On the one hand, this \nscheme can be used for static data when some of the input variables are missing. \nOn the other hand, it can also be used for sequential data, when some of the \ninput variables are missing or are available at different frequencies. Unlike in the \ncase of probabilistic models (e.g. Gaussian) of the missing variables, the network \ndoes not attempt to model the distribution of the missmg variables given the \nobserved variables. Instead it is a more \"discriminant\" approach that fills in the \nmissing variables for the sole purpose of minimizing a learning criterion (e.g., to \nminimize an output error). \n\nIntroduction \n\n1 \nLearning from examples implies discovering certain relations between variables of interest. The \nmost general form of learning requires to essentially capture the joint distribution between these \nvariables. However, for many specific problems, we are only interested in predicting the value \nof certain variables when the others (or some of the others) are given. A distinction IS therefore \nmade between input variables and output variables. Such a task requires less information (and \nless p'arameters, in the case of a parameterized model) than that of estimating the full joint \ndistrIbution. For example in the case of classification problems, a traditional statistical approach \nis based on estimating the conditional distribution of the inputs for each class as well as the \nclass prior probabilities (thus yielding the full joint distribution of inputs and classes). A more \ndiscriminant approach concentrates on estimating the class boundaries (and therefore requires \nless parameters), as for example with a feedforward neural network trained to estimate the output \nclass probabilities given the observed variables. \nHowever, for many learning problems, only some ofthe input variables are given for each partic(cid:173)\nular training case, and the missing variables differ from case to case. The simplest way to deal \nwith this mIssing data problem consists in replacing the missing values by their unconditional \nmean. It can be used with \"discriminant\" training algorithms such as those used with feed(cid:173)\nforward neural networks. However, in some problems, one can obtain better results by taking \nadvantage of the dependencies between the input variables. A simple idea therefore consists \n\n-also, AT&T Bell Labs, Holmdel, NJ 07733 \n\n\f396 \n\nY. BENGIO, F. GINGRAS \n\nFigure 1: Architectures of the recurrent networks in the experiments. On the left a 90-3-4 \narchitecture for static data with missing values, on the right a 6-3-2-1 architecture with multiple \ntime-scales for asynchronous sequential data. Small squares represent a unit delay. The number \nof units in each layer is inside the rectangles. The time scale at which each layer operates is on \nthe right of each rectangle. \nin replacing the missing input variables by their conditional expected value, when the observed \ninput variables are given. An even better scheme is to compute the expected output given the \nobserved inputs, e.g. with a mixture of Gaussian. Unfortunately, this amounts to estimating the \nfull joint distribution of all the variables. For example, with ni inputs, capturing the possible \neffect of each observed variable on each missing variable would require O(nl) parameters (at least \none parameter to capture some co-occurrence statistic on each pair of input variables) . Many \nrelated approaches have been proposed to deal with missing inputs using a Gaussian (or Gaussian \nmixture) model (Ahmad and Tresp, 1993; Tresp, Ahmad and Neuneier, 1994; Ghahramani and \nJordan, 1994). In the experiments presented here, the proposed recurrent network is compared \nwith a Gaussian mixture model trained with EM to handle missing values (Ghahramani and \nJordan, 1994). \nThe approach proposed in section 2 is more economical than the traditional Gaussian-based \napproaches for two reasons . Firstly, we take advantage of hidden units in a recurrent network, \nwhich might be less numerous than the inputs. The number of parameters depends on the \nproduct of the number of hidden units and the number of inputs. The hidden units only need to \ncapture the dependencies between input variables which have some dependencies, and which are \nuseful to reducing the output error. The second advantage is indeed that training is based on \noptimizing the desired criterion (e.g., reducing an output error), rather than predIcting as well \nas possible the values of the missmg inputs. The recurrent network is allowed to relax for a few \niterations (typically as few as 4 or 5) in order to fill-in some values for the missing inputs and \nproduce an output. In section 3 we present experimental results with this approach, comparing \nthe results with those obtained with a feedforward network. \nIn section 4 we propose an extension of this scheme to sequential data. In this case, the network \nis not relaxing: inputs keep changing with time and the network maps an input sequence (with \npossibly missing values) to an output sequence. The main advantage of this extension is that \nIt allows to deal with sequential data in which the variables occur at different frequencies. This \ntype of problem is frequent for example with economic or financial data. An experiment with \nasynchronous data is presented in section 5. \n\n2 Relaxing Recurrent Network for Missing Inputs \nNetworks with feedback such as those proposed in (Almeida, 1987; Pineda, 1989) can be applied \nto learning a static input/output mapping when some of the inputs are missing. In both cases, \nhowever, one has to wait for the network to relax either to a fixed point (assuming it does find \none) or to a \"stable distribution\" (in the case of the Boltzmann machine). In the case of fixed(cid:173)\npoint recurrent networks, the training algorithm assumes that a fixed point has been reached. \nThe gradient with respect to the weIghts is then computed in order to move the fixed point \nto a more desirable position. The approach we have preferred here avoids such an assumption. \n\n\fRecurrent Neural Networks for Missing or Asynchronous Data \n\n397 \n\nInstead it uses a more explicit optimization of the whole behavior of the network as it unfolds \nin time, fills-in the missing inputs and produces an output. The network is trained to minimize \nsome function of its output by back-propagation through time. \n\nComputation of Outputs Given Observed Inputs \ninput vector U = [UI, U2, ..\u2022 , un,] \nGiven: \nResul t: output vector Y = [YI, Y2, .. . , Yn.l \n\n1. Initialize for t = 0: \nFor i = 1 ... nu,xo,; f- 0 \nFor i = 1 . . . n;, if U; is missing then xO,1(;) f- E( i), \n\n. Else XO,1(i) f- Ui\u00b7 \n\n2. Loop over tl.me: \nFor i = 1 ... nu \n\nFor t = 1 to T \n\nIf i = I(k) is an input unit and Uk is not missing then \n\nElse ' \n\nXt if-Uk \nXt,i f- (1- \"Y)Xt-I,i + \"YfCEles, WIXt_d/,p/) \nwhere Si is a set of links from unit PI to unit i, \neach with weight WI and a discrete delay dl \n(but terms for which t - dl < 0 were not considered). \n\n3. Collect outputs by averaging at the end of the sequence: \n\nY; f- 'L;=I Vt Xt,O(i) \n\nBack-Propagation \nThe back-propagation computation requireE! an extra set of variables Xt and W, which will contain \nrespectively g~ and ~~ after this computation. \nGiven: output gradient vector ~; \nResul t: \n\ninput gradient ~~ and parameter gradient ae \naw' \n\n1. Initialize unit gradients using outside gradient: \n\nInitialize Xt,; = 0 for all t and i. \nFor i = 1 . . . no, initialize Xt,O(;) f- Vt Z~ \n\nIf i = I(k) is an input unit and Uk is not missing then \n\nno backward propagation \n\nElse \n\nFor IE S; \n1ft - d! > 0 \n\nXt-d/,p/ f- Xt-d/,p/ + (1 - \"Y)Xt-d/+1 \n+ \"YwIXt,d'('L/es, WIXt_d/,p/) \nWI f- WI + \"Yf'CLAes, WIXt-d/ ,p/)Xt-d/,p/ \n\n2. Backward loop over time: \n\nFor t = T to 1 \n\nFor i = nu ... 1 \n\n3. Collect input \nFor i = 1 .. . ni, \n\ngradients: \n\nIf U; is missing, then \n\nElse \n\nae f- 0 \nau; \nae \" . \nau, f- l..Jt Xt,1(;) \n\nThe observed inputs are clamped for the whole duration of the sequence. The missing units \ncorresponding to missing inputs are initialized to their unconditional expectation and their value \nis then updated using the feedback links for the rest of the sequence (just as if they were hidden \nunits). To help stability of the network and prevent it from finding periodic solutions (in which \nthe outputs have a correct output only periodically), output supervision is given for several time \nsteps. A fixed vector v, with Vt > 0 and I':t Vt = 1 specifies a weighing scheme that distributes \n\n\f398 \n\nY. BENGIO, F. GINGRAS \n\nthe responsibility for producing the correct output among different time steps. Its purpose is to \nencourage the network to develop stable dynamics which gradually converge toward the correct \noutput tthus the weights Vt were chosen to gradually increase with t) . \nThe neuron transfer function was a hyperbolic tangent in our experiments. The inertial term \nweighted by , (in step 3 of the forward propagation algorithm below) was used to help the \nnetwork find stable solutions. The parameter, was fixed by hand. In the experiments described \nbelow , a value of 0.7 was used, but near values yielded similar results. \nThis module can therefore be combined within a hybrid system composed of several modules by \npropagating gradient through the combined system (as in (Bottou and Gallinari, 1991)). For \nexample, as in Figure 2, there might be another module taking as input the recurrent network's \noutput. \nIn this case the recurrent network can be seen as a feature extractor that accepts \ndata with missing values in input and computes a set of features that are never missing. In \nanother example of hybrid system the non-missing values in input of the recurrent network are \ncomputed by another, upstream module (such as the preprocessing normalization used in our \nexperiments), and the recurrent network would provide gradients to this upstream module (for \nexample to better tune its normalization parameters) . \n3 Experiments with Static Data \nA network with three layers (inputs, hidden, outputs) was trained to classify data with miss(cid:173)\ning values from the audiolD9Y database. This database was made public thanks to Jergen and \nQuinlan, was used by (Barelss and Porter, 1987), and was obtained from the UCI Repository of \nmachine learning databases (ftp. ies . ueL edu: pub/maehine-learning-databases). The orig(cid:173)\ninal database has 226 patterns, with 69 attributes, and 24 classes. Unfortunately, most of the \nclasses have only 1 exemplar. Hence we decided to cluster the classes into four groups. To do \nso, the average pattern for each of the 24 classes was computed, and the K-Means clustering \nalgorithm was then applied on those 24 prototypical class \"patterns\", to yield the 4 \"super(cid:173)\nclasses\" used in our experiments. The multi-valued input symbolic attributes (with more than \n2 possible values) where coded with a \"one-out-of-n\" scheme, using n inputs (all zeros except \nthe one corresponding to the attribute value). Note that a missing value was represented with a \nspecial numeric value recognized by the neural network module. The inputs which were constant \nover the training set were then removed. The remaining 90 inputs were finally standardized \n(by computing mean and standard deviation) and transformed by a saturating non-linearity (a \nscaled hyperbolic tangent). The output class ~s coded with a \"one-out-of-4\" scheme, and the \nrecognized class is the one for which the corresponding output has the largest value. \nThe architecture of the network is depicted in Figure 1 (left) . The length of each relaxing sequence \nin the experiments was 5. Higher values would not bring any measurable improvements, whereas \nfor shorter sequences performance would degrade. The number of hidden units was varied, with \nthe best generalization performance obtained using 3 hidden units. \nThe recurrent network was compared with feedforward networks as well as with a mixture of \nGaussians. For the feedforward networks, the missing input values were replaced by their un(cid:173)\nconditional expected value. They were trained to minimize the same criterion as the recurrent \nnetworksl i.e., the sum of squared differences between network output and desired output. Sev(cid:173)\neral feedtorward neural networks with varying numbers of hidden units were trained. The best \ngeneralization was obtained with 15 hidden units. Experiments were also performed with no \nhidden units and two hidden layers (see Table 1) . We found that the recurrent network not only \ngeneralized better but also learned much faster (although each pattern required 5 times more \nwork because of the relaxation), as depicted in Figure 3. \nThe recurrent network was also compared with an approach based on a Gaussian and Gaussian \nmixture model of the data. We used the algorithm described in (Ghahramani and Jordan, \n1994) for supervised leaning from incomplete data with the EM algorithm. The whole joint \ninput/output distribution is modeled using a mixture model with Gaussians (for the inputs) and \nmultinomial (outputs) components: \n\nP(X = x, C = c) = E P(Wj) (21r)S;I~jll/2 exp{ -~(x _lJj)'Ejl(X -lJj)} \n\nj \n\nwhere x is the input vector, c the output class, and P(Wj) the prior probability of component j of \nthe mixture. The IJjd are the multinomial parameters; IJj and Ej are the Gaussian mean vector \n\n\fRecurrent Neural Networks for Missing or Asynchronous Data \n\n399 \n\ndown \natllNlm \nstatic \nmodule \n\ncoat --\n\nUpA'ellm \n\nnormalization \n\nmodule \n\nFigure 2: Example of hybrid modular system, using the recurrent network (middle) to extract \nfeatures from patterns which may have missing values. \nIt can be combined with upstream \nmodules (e.g., a normalizing preprocessor, right) and downstream modules (e.g., a static classifier, \nleft) . Dotted arrows show the backward flow of gradients. \n\n50r-----~----~----~----~----_, \n\n50r-----~----~----~----~----_, \n\ntraining eet \n\nteat .et \n\n45 \n\n40 \n\n35 \n\n30 \n\n~ 25 \n.... \n\n20 \n\n15 \n\n10 \n\n5 \n\n45 \n\n40 \n\n35 \n\n30 \n\n~ 25 \n.... \n\n20 \n\n15 \n\nrecurrent \n\nf \u2022 \u2022 dforvvard \n\n40 \n\n40 \n\nFigure 3: Evolution of training and test error for the recurrent network and for the best of \nthe feedforward networks (90-15-4): average classification error w.r.t. training epoch, (with 1 \nstandard deviation error bars, computed over 10 trials). \n\nand covariance matrix for component j. Maximum likelihood training is applied as explained \nin (Ghahramani and Jordan, 1994), taking missing values into account (as additional missing \nvariables of the EM algorithm). \nFor each architecture in Table 1, 10 trainin~ trials were run with a different subset of 200 \ntraining and 26 test patterns (and different initial weights for the neural networks) . The recurrent \nnetwork was dearlr superior to the other architectures, probably for the reasons discussed in the \nconclusion. In addItion, we have shown graphically the rate of convergence during training of the \nbest feedforward network (90-15-4) as well as the best recurrent network (90-3-4), in Figure 3. \nClearly, the recurrent network not only performs better at the end of traming but also learns \nmuch faster . \n\n4 Recurrent Network for Asynchronous Sequential Data \nAn important problem with many sequential data analysis problems such as those encountered \nin financial data sets is that different variables are known at different frequencies, at different \ntimes (phase), or are sometimes missing. For example, some variables are given daily, weekly, \nmonthly, quarterly, or yearly. Furthermore, some variables may not even be given for some of \nthe periods or the precise timing may change (for example the date at which a company reports \nfinancial performance my vary) . \nTherefore, we propose to extend the algorithm presented above for static data with missing \nvalues to the general case of sequential data with missing values or asynchronous variables. For \ntime steps at which a low-frequency variable is not given, a missing value is assumed in input. \nAgain, the feedback links from the hidden and output units to the input units allow the network \n\n\f400 \n\nY. BENGIO, F. GINGRAS \n\nTable 1: Comparative performances of recurrent network, feedforward network, and Gaussian \nmixture density model on audiology data. The average percentage of classification error is shown \nafter training, for both training and test sets, and tlie standard deviation in parenthesis, for 10 \ntrials. \n\n90-3-4 Recurrent net \n90-6-4 Recurrent net \n90-25-4 Feedforward net \n90-15-4 Feedforward net \n90-10-6-4 Feedforward net \n90-6-4 Feedforward net \n90-2-4 Feedforward net \n90-4 Feedforward net \n1 Gaussian \n4 Gaussians Mixture \n8 Gaussians Mixture \n\n0.3(~ .6 \n0(0 \n\n\u00b0r6 \n\nTrammg set error Test set error \n2.~(?(j \n3.8(4 \n15(7.3 \n13.8(7 \n16f5.3 \n298.9 \n27(10 \n33(8 \n38 9.3 \n38 9.2 \n38 9.3 \n\n0.80.4 \n1 0.9 \n64.9 \n18.5? \n22 1 \n35 1.6 \n36 1.5 \n36 2.1 \n\nto \"complete\" the missing data. The main differences with the static case are that the inputs \nand outputs vary with t (we use Ut and Yt at each time step instead of U and y). The training \nalgorithm is otherwise the same. \n\n5 Experiments with Asynchronous Data \nTo evaluate the algorithm, we have used a recurrent network with random weights, and feedback \nlinks on the input units to generate artificial data. The generating network has 6 inputs 3 \nhidden and 1 outputs. The hidden layer is connected to the input layer (1 delay). The hidden \nlayer receives inputs with delays 0 and 1 from the input layer and with delay 1 from itself. The \noutput layer receives inputs from the hidden layer. At the initial time step as well as at 5% of \nthe time steps (chosen randomly), the input units were clamped with random values to introduce \nsome further variability. The mlssing values were then completed by the recurrent network. To \ngenerate asynchronous data, half of the inputs were then hidden with missing values 4 out of every \n5 time steps. 100 training sequences and 50 test sequences were generated. The learning problem \nis therefore a sequence regression problem with mlssing and asynchronous input variables. \nPreliminary comp'arative experiments show a clear advantage to completing the missing values \n(due to the the dlfferent frequencies of the input variables) wlth the recurrent network, as shown \nin Figure 4. The recognition recurrent network is shown on the right of Figure 1. It has multiple \ntime scales (implemented with subsampling and oversampling, as in TDNNs (Lang, Waibel and \nHinton, 1990) and reverse-TDNNs (Simard and LeCun, 1992)), to facilitate the learning of such \nasynchronous data. The static network is a time-delay neural network with 6 input, 8 hidden, \nand 1 output unit, and connections with delays 0,2, and 4 from the input to hidden and hidden to \noutput units. The \"missing values\" for slow-varying variables were replaced by the last observed \nvalue in the sequence. Experiments with 4 and 16 hidden units yielded similar results. \n\n6 Conclusion \nWhen there are dependencies between input variables, and the output prediction can be im(cid:173)\nproved by taking them into account, we have seen that a recurrent network with input feedback \ncan perform significantly better than a simpler approach that replaces missing values by their \nunconditional expectation. According to us, this explains the significant improvement brought \nby using the recurrent network instead of a feedforward network in the experiments. \nOn the other hand, the large number of input variables (n; = 90, in the experiments) most likely \nexplains the poor performance of the mixture of Gaussian model in comparison to both the static \nnetworks and the recurrent network. The Gaussian model requires estimating O(nn parameters \nand inverting large covariance matrices. \nThe aPl?roach to handling missing values presented here can also be extended to sequential data \nwith mlssing or asynchronous variables. As our experiments suggest, for such problems, using \nrecurrence and multiple time scales yields better performance than static or time-delay networks \nfor which the missing values are filled using a heuristic. \n\n\fRecurrent Neural Networks for Missing or Asynchronous Data \n\n401 \n\n\"().18 \n\n0.18 \n\n0.104 \n\nf 0 .12 \ni \nj \n\n0 . 1 \n\n0.08 \n\n0.08 \n\n0 .040 \n\ntlme-delay network \n\n~ ___ -lecurrent network \n\n2 \n\n~ \n\n8 \n\n8 \n\n10 \n\n12 \n\ntraining 4tPOCh \n\n1~ \n\n18 \n\n18 \n\n20 \n\nFigure 4: Test set mean squared error on the asynchronous data. Top: static network with time \ndelays. Bottom: recurrent network with feedback to input values to complete missing data. \nReferences \nAhmad, S. and Tresp, V. (1993) . Some solutions to the missing feature problem in vision . In \nHanson, S. J., Cowan, J. D., and Giles, C. L. , editors, ACivances in Neural Information \nProcessing Systems 5, San Mateo, CA. Morgan Kaufman Publishers. \n\nAlmeida, L. (1987). A learning rule for asynchronous perceptrons with feedback in a combina(cid:173)\n\ntorial environment. In Caudill, M. and Butler, C., editors, IEEE International Conference \non Neural Networks, volume 2, pages 609- 618, San Diego 1987. IEEE, New York. \n\nBareiss, E. and Porter, B. (1987) . Protos: An exemplar-based learning apprentice. In Proceedings \nof the 4th International Workshop on Machine Learning, pages 12-23, Irvine, CA. Morgan \nKaufmann. \n\nBottou, L. and Gallinari, P. (1991). A framework for the cooperation of learning algorithms. In \nLippman, R. P., Moody, R., and Touretzky, D. S., editors, Advances in Neural Information \nProcessing Systems 3, pages 781-788, Denver, CO. \n\nGhahramani, Z. and Jordan, M. I. (1994). Supervised learning from incomplete data via an \nEM approach. In Cowan, J. , Tesauro, G. , and Alspector, J. , editors, Advances in Neural \nInformation Processing Systems 6, page ,San Mateo, CA. Morgan Kaufmann. \n\nLang) K. J ., Waibel, A. H., and Hinton, G. E. (1990) . A time-delay neural network architecture \n\ntor isolated word recognition . Neural Networks, 3:23- 43 . \n\nPineda, F. (1989) . Recurrent back-propagation and the dynamical approach to adaptive neural \n\ncomputation. Neural Computation, 1:161- 172. \n\nSimard, P. and LeCun, Y. (1992) . Reverse TDNN: An architecture for trajectory generation. In \nMoody, J ., Hanson, S., and Lipmann, R. , editors, Advances in Neural Information Processing \nSystems 4, pages 579- 588, Denver, CO. Morgan Kaufmann, San Mateo. \n\nTresp, V., Ahmad, S., and Neuneier, R. (1994). Training neural networks with deficient data. \nIn Cowan, J., Tesauro, G., and Alspector, J., editors, Advances in Neural Information \nProcessing Systems 6, pages 128-135. Morgan Kaufman Publishers, San Mateo, CA. \n\n\f", "award": [], "sourceid": 1126, "authors": [{"given_name": "Yoshua", "family_name": "Bengio", "institution": null}, {"given_name": "Francois", "family_name": "Gingras", "institution": null}]}