{"title": "Modeling Deep Temporal Dependencies with Recurrent Grammar Cells\"\"", "book": "Advances in Neural Information Processing Systems", "page_first": 1925, "page_last": 1933, "abstract": "We propose modeling time series by representing the transformations that take a frame at time t to a frame at time t+1. To this end we show how a bi-linear model of transformations, such as a gated autoencoder, can be turned into a recurrent network, by training it to predict future frames from the current one and the inferred transformation using backprop-through-time. We also show how stacking multiple layers of gating units in a recurrent pyramid makes it possible to represent the \u201dsyntax\u201d of complicated time series, and that it can outperform standard recurrent neural networks in terms of prediction accuracy on a variety of tasks.", "full_text": "Modeling Deep Temporal Dependencies with\n\nRecurrent \u201cGrammar Cells\u201d\n\nVincent Michalski\n\nGoethe University Frankfurt, Germany\n\nvmichals@rz.uni-frankfurt.de\n\nRoland Memisevic\n\nUniversity of Montreal, Canada\n\nroland.memisevic@umontreal.ca\n\nKishore Konda\n\nGoethe University Frankfurt, Germany\n\nkonda.kishorereddy@gmail.com\n\nAbstract\n\nWe propose modeling time series by representing the transformations that take a\nframe at time t to a frame at time t+1. To this end we show how a bi-linear model\nof transformations, such as a gated autoencoder, can be turned into a recurrent net-\nwork, by training it to predict future frames from the current one and the inferred\ntransformation using backprop-through-time. We also show how stacking multi-\nple layers of gating units in a recurrent pyramid makes it possible to represent the\n\u201dsyntax\u201d of complicated time series, and that it can outperform standard recurrent\nneural networks in terms of prediction accuracy on a variety of tasks.\n\n1\n\nIntroduction\n\nThe predominant paradigm of modeling time series is based on state-space models, in which a\nhidden state evolves according to some prede\ufb01ned dynamical law, and an observation model maps\nthe state to the dataspace. In this work, we explore an alternative approach to modeling time series,\nwhere learning amounts to \ufb01nding an explicit representation of the transformation that takes an\nobservation at time t to the observation at time t + 1.\nModeling a sequence in terms of transformations makes it very easy to exploit redundancies that\nwould be hard to capture otherwise. For example, very little information is needed to specify an\nelement of the signal class sine-wave, if it is represented in terms of a linear mapping that takes a\nsnippet of signal to the next snippet: given an initial \u201cseed\u201d-frame, any two sine-waves differ only\nby the amount of phase shift that the linear transformation has to repeatedly apply at each time step.\nIn order to model a signal as a sequence of transformations, it is necessary to make transformations\n\u201c\ufb01rst-class objects\u201d, that can be passed around and picked up by higher layers in the network. To\nthis end, we use bilinear models (e.g. [1, 2, 3]) which use multiplicative interactions to extract trans-\nformations from pairs of observations. We show that deep learning which is proven to be effective in\nlearning structural hierarchies can also learn to capture hierarchies of relations or transformations.\nA deep model can be built by stacking multiple layers of the transformation model, so that higher\nlayers capture higher-oder transformations (that is, transformations between transformations). To be\nable to model multiple steps of a time-series, we propose a training scheme called predictive train-\ning: after computing a deep representation of the dynamics from the \ufb01rst frames of a time series, the\nmodel predicts future frames by repeatedly applying the transformations passed down by higher lay-\ners, assuming constancy of the transformation in the top-most layer. Derivatives are computed using\nback-prop through time (BPTT) [4]. We shall refer to this model as a predictive gating pyramid\n(PGP) in the following.\n\n1\n\n\fSince hidden units at each layer encode transformations, not content of their inputs, they capture only\nstructural dependencies and we refer to them as \u201cgrammar cells.\u201d1 The model can also be viewed as a\nhigher-order partial difference equation whose parameters are estimated from data. Generating from\nthe model amounts to providing boundary conditions in the form of seed-frames, whose number\ncorresponds to the number of layers (the order of the difference equation). We demonstrate that\na two-layer model is already surprisingly effective at capturing whole classes of complicated time\nseries, including frequency-modulated sine-waves (also known as \u201cchirps\u201d) which we found hard to\nrepresent using standard recurrent networks.\n\n1.1 Related Work\n\nLSTM units [5] also use multiplicative interactions, in conjunction with self-connections of weight\n1, to model long-term dependencies and to avoid vanishing gradients problems [6]. Instead of con-\nstant self-connections, the lower-layer units in our model can represent long-term structure by using\ndynamically changing orthogonal transformations as we shall show. Other related work includes\n[7], where multiplicative interactions are used to let inputs modulate connections between succes-\nsive hidden states of a recurrent neural network (RNN), with application to modeling text. Our model\nalso bears some similarity to [3] who model MOCAP data using a three-way Restricted Boltzmann\nMachine, where a second layer of hidden units can be used to model more \u201cabstract\u201d features of\nthe time series. In contrast to that work, our higher-order units which are bi-linear too, are used to\nexplicitly model higher-order transformations. More importantly, we use predictive training using\nbackprop through time for our model, which is crucial for achieving good performance as we show\nin our experiments. Other approaches to sequence modeling include [8], who compress sequences\nusing a two-layer RNN, where the second layer predicts residuals, which the \ufb01rst layer fails to pre-\ndict well. In our model, compression amounts to exploiting redundancies in the relations between\nsuccessive sequence elements. In contrast to [9] who introduce a recursive bi-linear autoencoder\nfor modeling language, our model is recurrent and trained to predict, not reconstruct. The model\nby [10] is similar to our model in that it learns the dynamics of sequences, but assumes a simple\nautoregressive, rather than deep, compositional dependence, on the past. An early version of our\nwork is described in [11].\nOur work is also loosely related to sequence based invariance [12] and slow feature analysis [13],\nbecause hidden units are designed to extract structure that is invariant in time. In contrast to that\nwork, our multi-layer models assume higher-order invariances, that is, invariance of velocity in the\ncase of one hidden layer, of acceleration in the case of two, of jerk (the rate of change of acceleration)\nin the case of three, etc.\n\n2 Background on Relational Feature Learning\n\nIn order to learn transformation features, m, that represent the relationship between two observa-\ntions x(1) and x(2) it is necessary to learn a basis that can represent the correlation structure across\nthe observations. In a time series, knowledge of one frame, x(1), typically highly constrains the\ndistribution over possible next frames, x(2). This suggests modeling x(2) using a feature learning\nmodel whose parameters are a function of x(1) [14], giving rise to bi-linear models of transforma-\ntions, such as the Gated Boltzmann Machine [15, 3], Gated Autoencoder [16], and similar models\n(see [14] for an overview). Formally, bi-linear models learn to represent a linear transformation, L,\nbetween two observations x(1) and x(2), where\n\n(1)\nBi-linear models encode the transformation in a layer of mapping units that get tuned to rotation\nangles in the invariant subspaces of the transformation class [14]. We shall focus on the gated\nautoencoder (GAE) in the following but our description could be easily adapted to other bi-linear\nmodels. Formally, the response of a layer of mapping units in the GAE takes the form2\n\nx(2) = Lx(1).\n\nm = \u03c3(cid:0)W(Ux(1) \u00b7 Vx(2))(cid:1).\n\n(2)\n\n1We dedicate this paper to the venerable grandmother cell, a grandmother of the grammar cell.\n2We are only using \u201cfactored\u201d [15] bi-linear models in this work, but the framework presented in this work\n\ncould be applied to unfactored models, too.\n\n2\n\n\fwhere U, V and W are parameter matrices, \u00b7 denotes elementwise multiplication, and \u03c3 is an\nelementwise non-linearity, such as the logistic sigmoid. Given mapping unit activations, m, and\nthe \ufb01rst observation, x(1), the second observation can be reconstructed using\n\n(3)\nwhich amounts to applying the transformation encoded in m to x(1) [16]. As the model is symmet-\nric, the reconstruction of the \ufb01rst observation, given the second, is similarly given by\n\n\u02dcx(2) = VT(cid:0)Ux(1) \u00b7 WTm(cid:1)\n\u02dcx(1) = UT(cid:0)Vx(2) \u00b7 WTm(cid:1).\n\n(4)\n\nFor training one can minimize the symmetric reconstruction error\n\nL = ||x(1) \u2212 \u02dcx(1)||2 + ||x(2) \u2212 \u02dcx(2)||2.\n\n(5)\nTraining turns the rows of U and V into \ufb01lter pairs which reside in the invariant subspaces of the\ntransformation class on which the model was trained. After learning, each pair is tuned to a particular\nrotation angle in the subspace, and the components of m are consequently tuned to subspace rotation\nangles. Due to the pooling layer, W, they are furthermore independent of the absolute angles in the\nsubspaces [14].\n\n3 Higher-Order Relational Features\n\nAlternatively, one can think of the bilinear model as performing a \ufb01rst-order Taylor approximation\nof the input sequence, where the hidden representation models the partial \ufb01rst-order derivatives of\nthe inputs with respect to time. If we assume constancy of the \ufb01rst-order derivatives (or higher-order\nderivates, as we shall discuss), the complete sequence can be encoded using information about a\nsingle frame and the derivatives. This is a very different way of addressing long-range correlations\nthan assuming memory units that explicitly keep state [5]. Instead, here we assume that there is\nstructure in the temporal evolution of the input stream and we focus on capturing this structure.\nAs an intuitive example, consider a sinusoidal signal with unknown frequency and phase. The\ncomplete signal can be speci\ufb01ed exactly and completely after having seen a few seed frames, making\nit possible in principle to generate the rest of the signal ad in\ufb01nitum.\n\n3.1 Learning of Higher-Order Relational Features\n\n2\n\nThe \ufb01rst-order partial derivative of a multidimensional discrete-time dynamical system describes\nthe correspondences between observations at subsequent time steps. The fact that relational feature\nlearning applied to subsequent frames may be viewed as a way to learn these derivatives, suggests\nmodeling higher-order derivatives with another layer of relational features.\nTo this end, we suggest cascading relational features in a \u201cpyramid\u201d as depicted in Figure 1 on the\nleft.3 Given a sequence of inputs x(t\u22122), x(t\u22121), x(t), \ufb01rst-order relational features m(t\u22121:t)\nde-\nscribe the transformations between two subsequent inputs x(t\u22121) and x(t). Second-order relational\ndescribe correspondences between two \ufb01rst-order relational features m(t\u22122:t\u22121)\nfeatures m(t\u22122:t)\nand m(t\u22121:t)\nTo learn the higher-order features, we can \ufb01rst train a bottom-layer GAE module to represent corre-\nspondences between frame pairs using \ufb01lter matrices U1, V1 and W1 (the subscript index refers to\nthe layer). From the \ufb01rst-layer module we can infer mappings m(t\u22122:t\u22121)\nfor overlap-\nping input pairs (x(t\u22122), x(t\u22121)) and (x(t\u22121), x(t)), and use these as inputs to a second-layer GAE\nmodule. A second GAE can then learn to represent relations between mappings of the \ufb01rst-layer\nusing parameters U2, V2 and W2.\nInference of second-order relational features amounts to computing \ufb01rst- and second-order mappings\naccording to\n\n, modeling the \u201csecond-order derivatives\u201d of the signal with respect to time.\n\nand m(t\u22121:t)\n\n1\n\n1\n\n1\n\n1\n\n1\n\n= \u03c3(cid:0)W1\n= \u03c3(cid:0)W1\n= \u03c3(cid:0)W2\n\n(cid:0)(U1x(t\u22122)) \u00b7 (V1x(t\u22121))(cid:1)(cid:1)\n(cid:0)(U1x(t\u22121)) \u00b7 (V1x(t))(cid:1)(cid:1)\n(cid:0)(U2m(t\u22122:t\u22121)\n\n) \u00b7 (V2m(t\u22121:t)\n\n1\n\n1\n\n1\n\nm(t\u22122:t\u22121)\nm(t\u22121:t)\nm(t\u22122:t)\n\n1\n\n2\n\n)(cid:1)(cid:1).\n\n(6)\n(7)\n(8)\n\n3Images taken from the NORB data set described in [17]\n\n3\n\n\fFigure 1: Left: A two-layer model encodes a sequence by assuming constant \u201cacceleration\u201d. Right:\nPrediction using \ufb01rst-order relational features.\n\nLike a mixture of experts, a bi-linear model represents a highly non-linear mapping from x(1) to\nx(2) as a mixture of linear (and thereby possibly orthogonal) transformations. Similar to the LSTM,\nthis facilitates error back-propagation, because orthogonal transformations do not suffer from van-\nishing/exploding gradient problems. This may be viewed as a way of generalizing LSTM [5] which\nuses the identity matrix as the orthogonal transformation. \u201cGrammar units\u201d in contrast try to model\nlong-term structure that is dynamic and compositional rather than remembering a \ufb01xed value.\nCascading GAE modules in this way can also be motivated from the view of orthogonal trans-\nformations as subspace rotations: summing over \ufb01lter-response products can yield transformation\ndetectors which are sensitive to relative angles (phases in the case of translations) and invariant to\nthe absolute angles [14]. The relative rotation angle (or phase delta) between two projections is itself\nan angle, and the relation between two such angles represents an \u201cangular acceleration\u201d that can be\npicked up by another layer.\nIn contrast to a single-layer, two-frame model, the reconstruction error is no longer directly applica-\nble (although a naive way to train the model would be to minimize reconstruction error for each pair\nof adjacent nodes in each layer). However, a natural way of training the model on sequential data is\nto replace the reconstruction task with the objective of predicting future frames as we discuss next.\n\n4 Predictive Training\n\n4.1 Single-Step Prediction\n\nIn the GAE model, given two frames x(1) and x(2) one can compute a prediction of the third frame\nby \ufb01rst inferring mappings m(1,2) from x(1) and x(2) (see Equation 2) and using these to compute a\nprediction \u02c6x(3) by applying the inferred transformation m(1,2) to frame x(2)\n\n\u02c6x(3) = VT(cid:0)Ux(2) \u00b7 WTm(1,2)(cid:1).\n\n(9)\nSee Figure 1 (right side) for an outline of the prediction scheme. The prediction of x(3) is a good\nprediction under the assumption that frame-to-frame transformations from x(1) to x(2) and from\nx(2) to x(3) are approximately the same, in other words if transformations themselves are assumed\nto be approximately constant in time. We shall show later how to relax the assumption of constancy\nof the transformation by adding layers to the model.\nThe training criterion for this predictive gating pyramid (PGP) is the prediction error\n\n(10)\nBesides allowing us to apply bilinear models to sequences, this training objective, in contrast to the\nreconstruction objective, can guide the mapping representation to be invariant to the content of each\nframe, because encoding the content of x(2) will not help predicting x(3) well.\n\nL = ||\u02c6x(3) \u2212 x(3)||2\n2.\n\n4.2 Multi-Step Prediction and Non-Constant Transformations\n\nWe can iterate the inference-prediction process in order to look ahead more than one frame in time.\nTo compute a prediction \u02c6x(4) with the PGP, for example, we can infer the mappings and prediction:\n(11)\n\nm(2:3) = \u03c3(cid:0)W(Ux(2) \u00b7 V\u02c6x(3))(cid:1),\n\n\u02c6x(4) = VT(cid:0)U \u02c6x(3) \u00b7 WTm(2:3)(cid:1).\n\n4\n\n\fFigure 2: Left: Prediction with a 2-layer PGP. Right: Multi-step prediction with a 3-layer PGP.\n\nThen mappings can be inferred again from \u02c6x(3) and \u02c6x(4) to compute a prediction of \u02c6x(5), and so on.\nWhen the assumption of constancy of the transformations is violated, one can use an additional\nlayer to model how transformations themselves change over time as described in Section 3. The\nassumption behind the two-layer PGP is that the second-order relational structure in the sequence\nis constant. Under this assumption, we compute a prediction \u02c6x(t+1) in two steps after inferring\nm(t\u22122:t)\naccording to Equation 8: First, \ufb01rst-order relational features describing the correspondence\nbetween x(t) and x(t+1) are inferred top-down as\n\n2\n\nT(cid:0)U2m(t\u22121:t)\nT(cid:0)U1x(t) \u00b7 WT\n\n1\n\n\u00b7 WT\n\n2 m(t\u22122:t)\n\n2\n\n(cid:1),\n\n(cid:1).\n\n(12)\n\n(13)\n\n\u02c6m(t:t+1)\n\n1\n\n= V2\n\nfrom which we can compute \u02c6x(t+1) as\n\n\u02c6x(t+1) = V1\n\n1 \u02c6m(t:t+1)\n\n1\n\nSee Figure 2 (left side) for an illustration of the two-layer prediction scheme. To predict multiple\nsteps ahead we repeat the inference-prediction process on x(t\u22121), x(t) and \u02c6x(t+1), i.e. by appending\nthe prediction to the sequence and increasing t by one.\nAs outlined in Figure 2 (right side), the concept can be generalized to more than two layers by\nrecursion to yield higher-order relational features. Weights can be shared across layers, but we used\nuntied weights in our experiments.\nTo summarize, the prediction process consists in iteratively computing predictions of the next lower\nlevels activations beginning from the top. To infer the top-level activations themselves, one needs a\nnumber of seed frames corresponding to the depth of the model. The models can be trained using\nBPTT to compute gradients of the k-step prediction error (the sum of prediction errors) with respect\nto the parameters. We observed that starting with few prediction steps and iteratively increasing the\nnumber of prediction steps as training progresses considerably stabilizes the learning.\n\n5 Experiments\n\nWe tested and compared the models on sequences and videos with varying degrees of complex-\nity, from synthetic constant to synthetic accelerated transformations to more complex real-world\ntransformations. A description of the synthetic shift and rotation data sets is provided in the supple-\nmentary material.\n\n5.1 Preprocessing and Initialization\n\nFor all data sets, except for chirps and bouncing balls, PCA whitening was used for dimensionality\nreduction, retaining around 95% of the variance. The chirps-data was normalized by subtracting\nthe mean and dividing by the standard deviation of the training set. For the multi-layer models we\nused greedy layerwise pretraining before predictive training. We found pretraining to be crucial for\nthe predictive training to work well. Each layer was pretrained using a simple GAE, the \ufb01rst layer\non input frames, the next layer on the inferred mappings. Stochastic gradient descent (SGD) with\nlearning rate 0.001 and momentum 0.9 was used for all pretraining.\n\n5\n\n\fTable 1: Classi\ufb01cation accuracies (%) on accelerated transformation data using mappings from\ndifferent layers in the PGP (accuracies after pretraining shown in parentheses).\n\nData set\n\nm(1:2)\n\n1\n\nm(2:3)\n\n1\n\n(m(1:2)\n\n1\n\n, m(2:3)\n\n1\n\n)\n\nm(1:3)\n\n2\n\nACCROT\nACCSHIFT\n\n18.1 (19.4)\n20.9 (20.6)\n\n29.3 (30.9)\n34.4 (33.3)\n\n74.0 (64.9)\n42.7 (38.4)\n\n74.4 (53.7)\n80.6 (63.4)\n\n5.2 Comparison of Predictive and Reconstructive Training\n\nTo evaluate whether predictive training (PGP) yields better representations of transformations than\ntraining with a reconstruction objective (GAE), we \ufb01rst performed a classi\ufb01cation experiment on\nvideos showing arti\ufb01cially transformed natural images. 13 \u00d7 13 patches were cropped from the\nBerkeley Segmentation data set (BSDS300) [18]. Two data sets with videos featuring constant\nvelocity shifts (CONSTSHIFT) and rotations (CONSTROT) were generated. The shift vectors (for\nCONSTSHIFT) and rotation angles (for CONSTROT) were each grouped into 8 bins to generate\nlabels for classi\ufb01cation.\nThe numbers of \ufb01lter pairs and mapping units were chosen using a grid search. The setting with the\nbest performance on the validation set was 256 \ufb01lters and 256 mapping units for both training objec-\ntives on both data sets. The models were each trained for 1 000 epochs using SGD with learning rate\n0.001 and momentum 0.9. Mappings of the \ufb01rst two inputs were used as input to a logistic regres-\nsion classi\ufb01er. The experiment was performed three times on both data sets. The mean accuracy (%)\non CONSTSHIFT after predictive training was 79.4 compared to 76.4 after reconstructive training.\nFor CONSTROT mean accuracies were 98.2 after predictive and 97.6 after reconstructive training.\nThis con\ufb01rms that predictive training yields a more explicit representation of transformations, that\nis less dependent on image content, as discussed in Section 4.1.\n\n5.3 Detecting Acceleration\n\nTo test the hypothesis that the PGP learns to model second-order correspondences in sequences, im-\nage sequences with accelerated shifts (ACCSHIFT) and rotations (ACCROT) of natural image patches\nwere generated. The acceleration vectors (for ACCSHIFT) and angular rotations (for ACCROT) were\neach grouped into 8 bins to generate output labels for classi\ufb01cation.\nNumbers of \ufb01lter pairs and mapping units were set to 512 and 256, respectively, after performing a\ngrid search. After pretraining, the PGP was trained using SGD with learning rate 0.0001 and mo-\nmentum 0.9, for 400 epochs on single-step prediction and then 500 epochs on two-step prediction.\nAfter training, \ufb01rst- and second-layer mappings were inferred from the \ufb01rst three frames of the test\nsequences. The classi\ufb01cation accuracies using logistic regression with second-layer mappings of the\nPGP (m(1:3)\n), and with their concatena-\ntion (m(1:2)\n) as classi\ufb01er inputs are compared in Table 1 for both data sets (before and after\npredictive \ufb01netuning). The second-layer mappings achieved a signi\ufb01cantly higher accuracy for both\ndata sets after predictive training. For ACCROT, the concatenation of \ufb01rst-layer mappings performs\nalmost as well as the second-layer mappings, which may be because rotations have fewer degrees of\nfreedom than shifts making them easier to model. Note that the accuracy for the \ufb01rst layer mappings\nalso improved with predictive \ufb01netuning.\nThese results show that the PGP can learn a better representation of the second-order relational\nstructure in the data than the single-layer model. They further show that predictive training improves\nperformances of both models and is crucial for the PGP.\n\n) , with individual \ufb01rst-layer mappings (m(1:2)\n, m(2:3)\n\n1\n\nand m(2:3)\n\n1\n\n2\n\n1\n\n1\n\n5.4 Sequence Prediction\n\nIn these experiments we test the capability of the models to predict previously unseen sequences\nmultiple steps into the future. This allows us to assess to what degree modeling higher order \u201cderiva-\ntives\u201d makes it possible to capture the temporal evolution of a signal without resorting to an explicit\n\n6\n\n\fFigure 3: Multi-step predictions by the PGP trained on accelerated rotations (left) and shifts (right).\nFrom top to bottom: ground truth, predictions before and after predictive \ufb01netuning.\n\nFigure 4: Left: Chirp signal and the predictions of the CRBM, RNN and PGP after seeing the \ufb01rst\n\ufb01ve 10-frame vectors. Right: The MSE of the three models for each step.\n\nrepresentation of a hidden state. Unless mentioned otherwise, the presented sequences were seeded\nwith frames from test data (not seen during training).\nAccelerated Transformations\nFigure 3 shows predictions with the PGP on the data sets introduced in Section 5.3 after different\nstages of training. As can be seen in the \ufb01gures, the prediction accuracy increases signi\ufb01cantly with\nmulti-step training.\nChirps\nPerformances of the PGP were compared with that of a standard RNN (trained with BPTT) and a\nCRBM (trained with contrastive divergence) [19] on a dataset containing chirps (sinusoidal waves\nthat increase or decrease in frequency over time). Training and test set each contain 20, 000 se-\nquences. The 160 frames of each sequence are grouped into 16 non-overlapping 10-frame windows,\nyielding 10-dimensional input vectors. Given the \ufb01rst 5 windows, the remaining 11 windows have to\nbe predicted. Second-order mappings of the PGP are averaged for the seed windows and then held\n\ufb01xed for prediction. Predictions for one test sequence are shown in Figure 4 (left). Mean-squared\nerrors (MSE) on the test set are 1.159 for the RNN, 1.624 for the CRBM and 0.323 for the PGP. A\nplot of per-step MSEs is shown in Figure 4 (right).\nNorbVideos\nThe NORBvideos data set introduced in [20] contains videos of objects from the NORB dataset\n[17]. The 5 frame videos each show incrementally changed viewpoints of one object. One- and two-\nhidden layer PGP models were trained on this data using the author\u2019s original split. Both models\nused 2000 features and 1000 mapping units (per layer). The performance of the one-hidden layer\nmodel stopped improving at 2000 features, while the two-hidden layer model was able to make use\nof the additional parameters. Two-step MSEs on test data were 448.4 and 582.1, respectively.\nFigure 6 shows predictions made by both models. The second-order PGP generates predictions that\nre\ufb02ect the 3-D structure in the data. In contrast to the \ufb01rst-order PGP, it is able to extrapolate the\nobserved transformations.\nBouncing Balls\nThe PGP is also able to capture the highly non-linear dynamics in the bouncing balls data set4. The\nsequence shown in Figure 5 contains 56 frames, where the \ufb01rst 5 are from the training sequences\nand are used as seed for sequence generation (similar to the chirps experiment the average top-layer\nmapping vector for the seed frames is \ufb01xed). Note that the sequences used for training were only\n\n4 The training and test sequences were generated using the script released with [21].\n\n7\n\n0100200300time\u22122.502.5groundtruthCRBMRNNPGP0510predict-aheadinterval01.01.8meansquarederror\fFigure 5: PGP generated sequence of bouncing balls (left-to-right, top-to-bottom).\n\nFigure 6: Two-step PGP test predictions on NORBvideos.\n\n20 frames long. The model\u2019s predictions look qualitatively better than most published generated\nsequences.5 Further results and data can be found on the project website at http://www.ccc.\ncs.uni-frankfurt.de/people/vincent-michalski/grammar-cells\n\n6 Discussion\n\nA major long-standing problem in sequence modeling is dealing with long-range correlations. It\nhas been proposed that deep learning may help address this problem by \ufb01nding representations that\ncapture better the abstract, semantic content of the inputs [22]. In this work we propose learning\nrepresentations with the explicit goal of enabling the prediction of the temporal evolution of the\ninput stream multiple time steps ahead. Thus we seek a hidden representation that captures those\naspects of the input data which allow us to make predictions about the future.\nAs we discussed, learning the long-term evolution of a sequence can be simpli\ufb01ed by modeling\nit as a sequence of temporally varying orthogonal (and thus, in particular, linear) transformations.\nSince gating networks are like mixtures-of-experts, the PGP does model its input using a sequence\nof linear transformations in the lowest layer, it is thus \u201chorizontally linear\u201d. At the same time,\nit is \u201cvertically compressive\u201d, because its sigmoidal units are encouraged to compute non-linear,\nsparse representations, like the hidden units in any standard feed-forward neural network. From an\noptimization perspective this is a very sensible way to model time-series, since gradients have to\nbe back-propagated through many more layers horizontally (in time) than vertically (through the\nnon-linear network).\nIt is interesting to note that predictive training can also be viewed as an analogy making task [15].\nIt amounts to relating the transformation from frame t \u2212 1 to t with the transformation between a\nlater pair of observations, e.g. those at time t and t + 1. The difference is that in a genuine analogy\nmaking task, the target observation may be unrelated to the source observation pair, whereas here\ntarget and source are related. It would be interesting to apply the model to word representations,\nor language in general, as this is a domain where both, sequentially structured data and analogical\nrelationships play central roles.\n\nAcknowledgments\n\nThis work was supported by the German Federal Ministry of Education and Research (BMBF)\nin project 01GQ0841 (BFNT Frankfurt), by an NSERC Discovery grant and by a Google faculty\nresearch award.\n\n5compare with http://www.cs.utoronto.ca/\u02dcilya/pubs/2007/multilayered/index.\n\nhtml and http://www.cs.utoronto.ca/\u02dcilya/pubs/2008/rtrbm_vid.tar.gz.\n\n8\n\n\fReferences\n[1] R. Memisevic and G. E. Hinton. Unsupervised learning of image transformations. In Proceed-\n\nings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007.\n\n[2] B. A. Olshausen, C. Cadieu, J. Culpepper, and D. K. Warland. Bilinear models of natural\n\nimages. 2007.\n\n[3] G. W. Taylor, G. E. Hinton, and S. T. Roweis. Two distributed-state models for generating\nhigh-dimensional time series. The Journal of Machine Learning Research, 12:1025\u20131068,\n2011.\n\n[4] P. J. Werbos. Generalization of backpropagation with application to a recurrent gas market\n\nmodel. Neural Networks, 1(4):339\u2013356, 1988.\n\n[5] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u2013\n\n1780, 1997.\n\n[6] S. Hochreiter. Untersuchungen zu dynamischen neuronalen netzen. diploma thesis, institut f\u00a8ur\n\ninformatik, lehrstuhl prof. brauer, technische universit\u00a8at m\u00a8unchen. 1991.\n\n[7] I. Sutskever, J. Martens, and G. E. Hinton. Generating text with recurrent neural networks. In\n\nProceedings of the 2011 International Conference on Machine Learning, 2011.\n\n[8] J. Schmidhuber. Learning complex, extended sequences using the principle of history com-\n\npression. Neural Computation, 4(2):234\u2013242, 1992.\n\n[9] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts. Recursive\ndeep models for semantic compositionality over a sentiment treebank. In Proceedings of the\n2013 Conference on Empirical Methods in Natural Language Processing.\n\n[10] J. Luttinen, T. Raiko, and A. Ilin. Linear state-space model with time-varying dynamics. In\n\nMachine Learning and Knowledge Discovery in Databases, pages 338\u2013353. Springer, 2014.\n\n[11] V. Michalski. Neural networks for motion understanding: Diploma thesis. Master\u2019s thesis,\n\nGoethe-Universit\u00a8at Frankfurt, Frankfurt, Germany, 2013.\n\n[12] P. F\u00a8oldi\u00b4ak. Learning invariance from transformation sequences. Neural Computation,\n\n3(2):194\u2013200, 1991.\n\n[13] L. Wiskott and T. Sejnowski. Slow feature analysis: Unsupervised learning of invariances.\n\nNeural computation, 14(4):715\u2013770, 2002.\n\n[14] R. Memisevic. Learning to relate images. IEEE Transactions on Pattern Analysis and Machine\n\nIntelligence, 35(8):1829\u20131846, 2013.\n\n[15] R. Memisevic and G. E. Hinton. Learning to represent spatial transformations with factored\n\nhigher-order boltzmann machines. Neural Computation, 22(6):1473\u20131492, 2010.\n\n[16] R. Memisevic. Gradient-based learning of higher-order image features. In 2011 IEEE Inter-\n\nnational Conference on Computer Vision, pages 1591\u20131598. IEEE, 2011.\n\n[17] Y. LeCun, F. J. Huang, and L. Bottou. Learning methods for generic object recognition with\ninvariance to pose and lighting. In Proceedings of the 2001 IEEE Conference on Computer\nVision and Pattern Recognition, 2001.\n\n[18] D. Martin, Fowlkes C., D. Tal, and J. Malik. A database of human segmented natural images\nand its application to evaluating segmentation algorithms and measuring ecological statistics.\nIn Proceedings of the Eigth IEEE International Conference on Computer Vision, volume 2,\npages 416\u2013423, July 2001.\n\n[19] G. W. Taylor, G. E. Hinton, and S. T. Roweis. Modeling human motion using binary latent\nvariables. In Advances in Neural Information Processing Systems 20, pages 1345\u20131352, 2007.\n[20] R. Memisevic and G. Exarchakis. Learning invariant features by harnessing the aperture prob-\n\nlem. In Proceedings of the 30th International Conference on Machine Learning, 2013.\n\n[21] I. Sutskever, G. E. Hinton, and G. W. Taylor. The recurrent temporal restricted boltzmann\nmachine. In Advances in Neural Information Processing Systems 21, pages 1601\u20131608, 2008.\n[22] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning,\n\n2(1):1\u2013127, 2009. Also published as a book. Now Publishers, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1062, "authors": [{"given_name": "Vincent", "family_name": "Michalski", "institution": "Goethe University Frankfurt"}, {"given_name": "Roland", "family_name": "Memisevic", "institution": "University of Montreal"}, {"given_name": "Kishore", "family_name": "Konda", "institution": "Goethe University Frankfurt"}]}