{"title": "Learning to Linearize Under Uncertainty", "book": "Advances in Neural Information Processing Systems", "page_first": 1234, "page_last": 1242, "abstract": "Training deep feature hierarchies to solve supervised learning tasks has achieving state of the art performance on many problems in computer vision. However, a principled way in which to train such hierarchies in the unsupervised setting has remained elusive. In this work we suggest a new architecture and loss for training deep feature hierarchies that linearize the transformations observed in unlabelednatural video sequences. This is done by training a generative model to predict video frames. We also address the problem of inherent uncertainty in prediction by introducing a latent variables that are non-deterministic functions of the input into the network architecture.", "full_text": "Learning to Linearize Under Uncertainty\n\nRoss Goroshin\u2217\u22171 Michael Mathieu\u2217\u22171 Yann LeCun1,2\n\n1Dept. of Computer Science, Courant Institute of Mathematical Science, New York, NY\n\n2Facebook AI Research, New York, NY\n\n{goroshin,mathieu,yann}@cs.nyu.edu\n\nAbstract\n\nTraining deep feature hierarchies to solve supervised learning tasks has achieved\nstate of the art performance on many problems in computer vision. However, a\nprincipled way in which to train such hierarchies in the unsupervised setting has\nremained elusive. In this work we suggest a new architecture and loss for training\ndeep feature hierarchies that linearize the transformations observed in unlabeled\nnatural video sequences. This is done by training a generative model to predict\nvideo frames. We also address the problem of inherent uncertainty in prediction\nby introducing latent variables that are non-deterministic functions of the input\ninto the network architecture.\n\n1\n\nIntroduction\n\nThe recent success of deep feature learning in the supervised setting has inspired renewed interest\nin feature learning in weakly supervised and unsupervised settings. Recent \ufb01ndings in computer\nvision problems have shown that the representations learned for one task can be readily transferred\nto others [10], which naturally leads to the question: does there exist a generically useful feature\nrepresentation, and if so what principles can be exploited to learn it?\nRecently there has been a \ufb02urry of work on learning features from video using varying degrees of\nsupervision [14][12][13]. Temporal coherence in video can be considered as a form of weak super-\nvision that can be exploited for feature learning. More precisely, if we assume that data occupies\nsome low dimensional \u201cmanifold\u201d in a high dimensional space, then videos can be considered as\none-dimensional trajectories on this manifold parametrized by time. Many unsupervised learning\nalgorithms can be viewed as various parameterizations (implicit or explicit) of the data manifold\n[1]. For instance, sparse coding implicitly assumes a locally linear model of the data manifold [9].\nIn this work, we assume that deep convolutional networks are good parametric models for natural\ndata. Parameterizations of the data manifold can be learned by training these networks to linearize\nshort temporal trajectories, thereby implicitly learning a local parametrization.\nIn this work we cast the linearization objective as a frame prediction problem. As in many other\nunsupervised learning schemes, this necessitates a generative model. Several recent works have also\ntrained deep networks for the task of frame prediction [12][14][13]. However, unlike other works\nthat focus on prediction as a \ufb01nal objective, in this work prediction is regarded as a proxy for learn-\ning representations. We introduce a loss and architecture that addresses two main problems in frame\nprediction: (1) minimizing L2 error between the predicted and actual frame leads to unrealistically\nblurry predictions, which potentially compromises the learned representation, and (2) copying the\nmost recent frame to the input seems to be a hard-to-escape trap of the objective function, which\nresults in the network learning little more than the identity function. We argue that the source of\nblur partially stems from the inherent unpredictability of natural data; in cases where multiple valid\npredictions are plausible, a deterministic network will learn to average between all the plausible pre-\ndictions. To address the \ufb01rst problem we introduce a set of latent variables that are non-deterministic\n\n\u2217Equal contribution\n\n1\n\n\ffunctions of the input, which are used to explain the unpredictable aspects of natural videos. The\nsecond problem is addressed by introducing an architecture that explicitly formulates the prediction\nin the linearized feature space.\nThe paper is organized as follows. Section 2 reviews relevant prior work. Section 3 introduces the\nbasic architecture used for learning linearized representations. Subsection 3.1 introduces \u201cphase-\npooling\u201d\u2013an operator that facilitates linearization by inducing a topology on the feature space. Sub-\nsection 3.2 introduces a latent variable formulation as a means of learning to linearize under uncer-\ntainty. Section 4 presents experimental results on relatively simple datasets to illustrate the main\nideas of our work. Finally, Section 5 offers directions for future research.\n\n2 Prior Work\n\nThis work was heavily inspired by the philosophy revived by Hinton et al. [5], which introduced\n\u201ccapsule\u201d units. In that work, an equivariant representation is learned by the capsules when the\ntrue latent states were provided to the network as implicit targets. Our work allows us to move\nto a more unsupervised setting in which the true latent states are not only unknown, but represent\ncompletely arbitrary qualities. This was made possible with two assumptions: (1) that temporally\nadjacent samples also correspond to neighbors in the latent space, (2) predictions of future samples\ncan be formulated as linear operations in the latent space.\nIn theory, the representation learned\nby our method is very similar to the representation learned by the \u201ccapsules\u201d; this representation\nhas a locally stable \u201cwhat\u201d component and a locally linear, or equivariant \u201cwhere\u201d component.\nTheoretical properties of linearizing features were studied in [3].\nSeveral recent works propose schemes for learning representations from video which use varying\ndegrees of supervision[12][14][13][4]. For instance, [13] assumes that the pre-trained network from\n[7] is already available and training consists of learning to mimic this network. Similarly, [14]\nlearns a representation by receiving supervision from a tracker. This work is more closely related to\nfully unsupervised approaches for learning representations from video such as [4][6][2][15][8]. It\nis most related to [12] which also trains a decoder to explicitly predict video frames. Our proposed\narchitecture was inspired by those presented in in [11] and [16].\n\n3 Learning Linearized Representations\n\nOur goal is to obtain a representation of each input sequence that varies linearly in time by trans-\nforming each frame individually. Furthermore, we assume that this transformation can be learned\nby a deep, feed forward network referred to as the encoder, denoted by the function FW . Denote\nthe code for frame xt by zt = FW (xt). Assume that the dataset is parameterized by a temporal\nindex t so it is described by the sequence X = {..., xt\u22121, xt, xt+1, ...} with a corresponding feature\nsequence produced by the encoder Z = {..., zt\u22121, zt, zt+1, ...}. Thus our goal is to train FW to\nproduce a sequence Z whose average local curvature is smaller than sequence X. A scale invariant\nlocal measure of curvature is the cosine distance between the two vectors formed by three tempo-\nrally adjacent samples. However, minimizing the curvature directly can result in the trivial solutions:\nzt = ct \u2200 t and zt = c \u2200 t. These solutions are trivial because they are virtually uninformative with\nrespect to the input xt and therefore cannot be a meaningful representation of the input. To avoid this\nsolution, we also minimize the prediction error in the input space. The predicted frame is generated\nin two steps: (i) linearly extrapolation in code space to obtain a predicted code \u02c6zt+1 = a[zt zt\u22121]T\nfollowed by (ii) a decoding with GW , which generates the predicted frame \u02c6xt+1 = GW (\u02c6zt+1). For\nexample, if a = [2,\u22121] the predicted code \u02c6zt+1 corresponds to a constant speed linear extrapolation\nof zt and zt\u22121. The L2 prediction error is minimized by jointly training the encoder and decoder\nnetworks. Note that minimizing prediction error alone will not necessarily lead to low curvature\ntrajectories in Z since the decoder is unconstrained; the decoder may learn a many to one mapping\nwhich maps different codes to the same output image without forcing them to be equal. To prevent\nthis, we add an explicit curvature penalty to the loss, corresponding to the cosine distance between\n(zt \u2212 zt\u22121) and (zt+1 \u2212 zt). The complete loss to minimize is:\n\n) \u2212 xt+1(cid:107)2\n\n2 \u2212 \u03bb\n\n(zt \u2212 zt\u22121)T (zt+1 \u2212 zt)\n(cid:107)zt \u2212 zt\u22121(cid:107)(cid:107)zt+1 \u2212 zt(cid:107)\n\n(1)\n\n2\n\n(cid:107)GW (a(cid:2)zt\n\nzt\u22121(cid:3)T\n\nL =\n\n1\n2\n\n\f(a)\n\n(b)\n\nFigure 1: (a) A video generated by translating a Gaussian intensity bump over a three pixel array\n(x,y,z), (b) the corresponding manifold parametrized by time in three dimensional space\n\nFigure 2: The basic linear prediction architecture with shared weight encoders\n\nThis feature learning scheme can be implemented using an autoencoder-like network with shared\nencoder weights.\n\n3.1 Phase Pooling\n\nThus far we have assumed a generic architecture for FW and GW . We now consider custom ar-\nchitectures and operators that are particularly suitable for the task of linearization. To motivate the\nde\ufb01nition of these operators, consider a video generated by translating a Gaussian \u201cintensity bump\u201d\nover a three pixel region at constant speed. The video corresponds to a one dimensional manifold in\nthree dimensional space, i.e. a curve parameterized by time (see Figure 1). Next, assume that some\nconvolutional feature detector \ufb01res only when centered on the bump. Applying the max-pooling\noperator to the activations of the detector in this three-pixel region signi\ufb01es the presence of the fea-\nture somewhere in this region (i.e.\nthe \u201cwhat\u201d). Applying the argmax operator over the region\nreturns the position (i.e. the \u201cwhere\u201d) with respect to some local coordinate frame de\ufb01ned over the\npooling region. This position variable varies linearly as the bump translates, and thus parameterizes\nthe curve in Figure 1b. These two channels, namely the what and the where, can also be regarded\nas generalized magnitude m and phase p, corresponding to a factorized representation: the magni-\ntude represents the active set of parameters, while the phase represents the set of local coordinates\nin this active set. We refer to the operator that outputs both the max and argmax channels as the\n\u201cphase-pooling\u201d operator.\nIn this example, spatial pooling was used to linearize the translation of a \ufb01xed feature. More gen-\nerally, the phase-pooling operator can locally linearize arbitrary transformations if pooling is per-\nformed not only spatially, but also across features in some topology.\nIn order to be able to back-propagate through p, we de\ufb01ne a soft version of the max and argmax\noperators within each pool group. For simplicity, assume that the encoder has a fully convolutional\narchitecture which outputs a set of feature maps, possibly of a different resolution than the input.\nAlthough we can de\ufb01ne an arbitrary topology in feature space, for now assume that we have the\n\n3\n\ntimexyzx\u2212IntensityThree\u2212Pixel Videoy\u2212Intensityz\u2212Intensitypredictionencpoolm2p2encpoolm3p3m3p3~~encpoolm1p1x1x2x3unpooldecx3~L2x3cosinedistance\f(cid:88)\n\nNk\n\n(cid:80)\n\nfamiliar three-dimensional spatial feature map representation where each activation is a function\nz(f, x, y), where x and y correspond to the spatial location, and f is the feature map index. Assum-\ning that the feature activations are positive, we de\ufb01ne our soft \u201cmax-pooling\u201d operator for the kth\nneighborhood Nk as:\n\nmk =\n\nz(f, x, y)\n\ne\u03b2z(f,x,y)\n\ne\u03b2z(f(cid:48),x(cid:48),y(cid:48))\n\nNk\n\n\u2248 max\n\nNk\n\nz(f, x, y),\n\n(2)\n\nwhere \u03b2 \u2265 0. Note that the fraction in the sum is a softmax operation (parametrized by \u03b2), which\nis positive and sums to one in each pooling region. The larger the \u03b2, the closer it is to a unimodal\ndistribution and therefore the better mk approximates the max operation. On the other hand, if\n\u03b2 = 0, Equation 2 reduces to average-pooling. Finally, note that mk is simply the expected value\nof z (restricted to Nk) under the softmax distribution.\nAssuming that the activation pattern within each neighborhood is approximately unimodal, we can\nde\ufb01ne a soft versions of the argmax operator. The vector pk approximates the local coordinates\nin the feature topology at which the max activation value occurred. Assuming that pooling is done\nvolumetrically, that is, spatially and across features, pk will have three components. In general, the\nnumber of components in pk is equal to the dimension of the topology of our feature space induced\nby the pooling neighborhood. The dimensionality of pk can also be interpreted as the maximal\nintrinsic dimension of the data. If we de\ufb01ne a local standard coordinate system in each pooling\nvolume to be bounded between -1 and +1, the soft \u201cargmax-pooling\u201d operator is de\ufb01ned by the\nvector-valued sum:\n\n(cid:34)f\n\n(cid:35)\n\nx\ny\n\n(cid:88)\n\nNk\n\n(cid:80)\n\npk =\n\ne\u03b2z(f,x,y)\n\ne\u03b2z(f(cid:48),x(cid:48),y(cid:48))\n\nNk\n\n\u2248 arg max\n\nNk\n\nz(f, x, y),\n\n(3)\n\nwhere the indices f, x, y take values from -1 to 1 in equal increments over the pooling region. Again,\nwe observe that pk is simply the expected value of [f x y]T under the softmax distribution.\nThe phase-pooling operator acts on the output of the encoder, therefore it can simply be considered\nas the last encoding step. Correspondingly we de\ufb01ne an \u201cun-pooling\u201d operation as the \ufb01rst step of the\ndecoder, which produces reconstructed activation maps by placing the magnitudes m at appropriate\nlocations given by the phases p.\nBecause the phase-pooling operator produces both magnitude and phase signals for each of the two\ninput frames, it remains to de\ufb01ne the predicted magnitude and phase of the third frame. In general,\nthis linear extrapolation operator can be learned, however \u201chard-coding\u201d this operator allows us to\nplace implicit priors on the magnitude and phase channels. The predicted magnitude and phase are\nde\ufb01ned as follows:\n\n2\n\nmt+mt\u22121\nmt+1 =\npt+1 = 2pt \u2212 pt\u22121\n\n(4)\n(5)\nPredicting the magnitude as the mean of the past imposes an implicit stability prior on m, i.e. the\ntemporal sequence corresponding to the m channel should be stable between adjacent frames. The\nlinear extrapolation of the phase variable imposes an implicit linear prior on p. Thus such an ar-\nchitecture produces a factorized representation composed of a locally stable m and locally linearly\nvarying p. When phase-pooling is used curvature regularization is only applied to the p variables.\nThe full prediction architecture is shown in Figure 2.\n\n3.2 Addressing Uncertainty\n\nNatural video can be inherently unpredictable; objects enter and leave the \ufb01eld of view, and out of\nplane rotations can also introduce previously invisible content. In this case, the prediction should\ncorrespond to the most likely outcome that can be learned by training on similar video. However, if\nmultiple outcomes are present in the training set then minimizing the L2 distance to these multiple\noutcomes induces the network to predict the average outcome. In practice, this phenomena results in\nblurry predictions and may lead the encoder to learn a less discriminative representation of the input.\nTo address this inherent unpredictability we introduce latent variables \u03b4 to the prediction architecture\nthat are not deterministic functions of the input. These variables can be adjusted using the target\n\n4\n\n\fxt+1 in order to minimize the prediction L2 error. The interpretation of these variables is that they\nexplain all aspects of the prediction that are not captured by the encoder. For example, \u03b4 can be used\nto switch between multiple, equally likely predictions. It is important to control the capacity of \u03b4 to\nprevent it from explaining the entire prediction on its own. Therefore \u03b4 is restricted to act only as\na correction term in the code space output by the encoder. To further restrict the capacity of \u03b4 we\nenforce that dim(\u03b4) (cid:28) dim(z). More speci\ufb01cally, the \u03b4-corrected code is de\ufb01ned as:\n\n\u03b4 = zt + (W1\u03b4) (cid:12) a(cid:2)zt\n\nzt\u22121(cid:3)T\n\n\u02c6zt+1\n\n(6)\nWhere W1 is a trainable matrix of size dim(\u03b4) \u00d7 dim(z), and (cid:12) denotes the component-wise prod-\nuct. During training, \u03b4 is inferred (using gradient descent) for each training sample by minimizing\nthe loss in Equation 7. The corresponding adjusted \u02c6zt+1\nis then used for back-propagation through\nW and W1. At test time \u03b4 can be selected via sampling, assuming its distribution on the training set\nhas been previously estimated.\n\n\u03b4\n\nL = min\n\n\u03b4\n\n(cid:107)GW (\u02c6zt+1\n\n\u03b4\n\n) \u2212 xt+1(cid:107)2\n\n2 \u2212 \u03bb\n\n(zt \u2212 zt\u22121)T (zt+1 \u2212 zt)\n(cid:107)zt \u2212 zt\u22121(cid:107)(cid:107)zt+1 \u2212 zt(cid:107)\n\n(7)\n\nThe following algorithm details how the above loss is minimized using stochastic gradient descent:\n\nAlgorithm 1 Minibatch stochastic gradient descent training for prediction with uncertainty. The\nnumber of \u03b4-gradient descent steps (k) is treated as a hyper-parameter.\n\nfor number of training epochs do\n\nSample a mini-batch of temporal triplets {xt\u22121, xt, xt+1}\nSet \u03b40 = 0\nForward propagate xt\u22121, xt through the network and obtain the codes zt\u22121, zt and the predic-\ntion \u02c6xt+1\nfor i =1 to k do\n\n0\n\nCompute the L2 prediction error\nBack propagate the error through the decoder to compute the gradient\nUpdate \u03b4i = \u03b4i\u22121 \u2212 \u03b1 \u2202L\n\u2202\u03b4i\u22121\nCompute \u02c6zt+1\nCompute \u02c6xt+1\n\n= zt + (W1\u03b4i) (cid:12) a(cid:2)zt\n\nzt\u22121(cid:3)T\n\n\u03b4i\ni = GW (zt+1\n\u03b4i )\n\n\u2202L\n\u2202\u03b4i\u22121\n\nend for\nBack propagate the full encoder/predictor loss from Equation 7 using \u03b4k, and update the weight\nmatrices W and W1\n\nend for\n\nWhen phase pooling is used we allow \u03b4 to only affect the phase variables in Equation 5, this further\nencourages the magnitude to be stable and places all the uncertainty in the phase.\n\n4 Experiments\n\nThe following experiments evaluate the proposed feature learning architecture and loss. In the \ufb01rst\nset of experiments we train a shallow architecture on natural data and visualize the learned features\nin order gain a basic intuition. In the second set of experiments we train a deep architecture on\nsimulated movies generated from the NORB dataset. By generating frames from interpolated and\nextrapolated points in code space we show that a linearized representation of the input is learned.\nFinally, we explore the role of uncertainty by training on only partially predictable sequences, we\nshow that our latent variable formulation can account for this uncertainty enabling the encoder to\nlearn a linearized representation even in this setting.\n\n4.1 Shallow Architecture Trained on Natural Data\n\nTo gain an intuition for the features learned by a phase-pooling architecture let us consider an en-\ncoder architecture comprised of the following stages: convolutional \ufb01lter bank, rectifying point-wise\nnonlinearity, and phase-pooling. The decoder architecture is comprised of an un-pooling stage fol-\nlowed by a convolutional \ufb01lter bank. This architecture was trained on simulated 32 \u00d7 32 movie\n\n5\n\n\fShallow Architecture 1\n\nShallow Architecture 2\n\nEncoder\n\nConv+ReLU 64 \u00d7 9 \u00d7 9\nConv+ReLU 64 \u00d7 9 \u00d7 9\n\nPhase Pool 4\n\nPhase Pool 4 stride 2\n\nPrediction\n\nAverage Mag.\n\nLinear Extrap. Phase\n\nAverage Mag.\n\nLinear Extrap. Phase\n\nDeep Architecture 1\n\nConv+ReLU 16 \u00d7 9 \u00d7 9\nConv+ReLU 32 \u00d7 9 \u00d7 9\nFC+ReLU 8192 \u00d7 4096\n\nNone\n\nDeep Architecture 2\n\nConv+ReLU 16 \u00d7 9 \u00d7 9\nConv+ReLU 32 \u00d7 9 \u00d7 9\nFC+ReLU 8192 \u00d7 4096\n\nLinear Extrapolation\n\nDeep Architecture 3\n\nConv+ReLU 16 \u00d7 9 \u00d7 9\nConv+ReLU 32 \u00d7 9 \u00d7 9\nFC+ReLU 8192 \u00d7 4096\nReshape 64 \u00d7 8 \u00d7 8\nPhase Pool 8 \u00d7 8\n\nAverage Mag.\n\nLinear Extrap. Phase\n\nTable 1: Summary of architectures\n\nDecoder\n\nConv 64 \u00d7 9 \u00d7 9\nConv 64 \u00d7 9 \u00d7 9\n\nFC+ReLU 8192 \u00d7 8192\nReshape 32 \u00d7 16 \u00d7 16\nSpatialPadding 8 \u00d7 8\nConv+ReLU 16 \u00d7 9 \u00d7 9\nSpatialPadding 8 \u00d7 8\n\nConv 1 \u00d7 9 \u00d7 9\n\nFC+ReLU 4096 \u00d7 8192\nReshape 32 \u00d7 16 \u00d7 16\nSpatialPadding 8 \u00d7 8\nConv+ReLU 16 \u00d7 9 \u00d7 9\nSpatialPadding 8 \u00d7 8\n\nConv 1 \u00d7 9 \u00d7 9\nUnpool 8 \u00d7 8\n\nFC+ReLU 4096 \u00d7 8192\nReshape 32 \u00d7 16 \u00d7 16\nSpatialPadding 8 \u00d7 8\nConv+ReLU 16 \u00d7 9 \u00d7 9\nSpatialPadding 8 \u00d7 8\n\nConv 1 \u00d7 9 \u00d7 9\n\n3\n\n3\n\nx, f2 = A\u03c4 = 2\n\nframes taken from YouTube videos [4]. Each frame triplet is generated by transforming still frames\nwith a sequence of three rigid transformations (translation, scale, rotation). More speci\ufb01cally let A\nbe a random rigid transformation parameterized by \u03c4, and let x denote a still image reshaped into\na column vector, the generated triplet of frames is given by {f1 = A\u03c4 = 1\nx, f3 =\nA\u03c4 =1x}. Two variants of this architecture were trained, their full architecture is summarized in\nthe \ufb01rst two lines of Table 1. In Shallow Architecture 1, phase pooling is performed spatially in\nnon-overlapping groups of 4 \u00d7 4 and across features in a one-dimensional topology consisting of\nnon-overlapping groups of four. Each of the 16 pool-groups produce a code consisting of a scalar m\nand a three-component p = [pf , px, py]T (corresponding to two spatial and one feature dimensions);\nthus the encoder architecture produces a code of size 16 \u00d7 4 \u00d7 8 \u00d7 8 for each frame. The corre-\nsponding \ufb01lters whose activations were pooled together are laid out horizontally in groups of four in\nFigure 3(a). Note that each group learns to exhibit a strong ordering corresponding to the linearized\nvariable pf . Because global rigid transformations can be locally well approximated by translations,\nthe features learn to parameterize local translations. In effect the network learns to linearize the\ninput by tracking common features in the video sequence. Unlike the spatial phase variables, pf can\nlinearize sub-pixel translations. Next, the architecture described in column 2 of Table 1 was trained\non natural movie patches with the natural motion present in the real videos. The architecture differs\nin only in that pooling across features is done with overlap (groups of 4, stride of 2). The resulting\ndecoder \ufb01lters are displayed in Figure 3 (b). Note that pooling with overlap introduces smoother\ntransitions between the pool groups. Although some groups still capture translations, more complex\ntransformations are learned from natural movies.\n\n4.2 Deep Architecture trained on NORB\n\nIn the next set of experiments we trained deep feature hierarchies that have the capacity to linearize\na richer class of transformations. To evaluate the properties of the learned features in a controlled\nsetting, the networks were trained on simulated videos generated using the NORB dataset rescaled\nto 32 \u00d7 32 to reduce training time. The simulated videos are generated by tracing constant speed\ntrajectories with random starting points in the two-dimensional latent space of pitch and azimuth ro-\ntations. In other words, the models are trained on triplets of frames ordered by their rotation angles.\nAs before, presented with two frames as input, the models are trained to predict the third frame.\nRecall that prediction is merely a proxy for learning linearized feature representations. One way to\nevaluate the linearization properties of the learned features is to linearly interpolate (or extrapolate)\n\n6\n\n\f(a) Shallow Architecture 1\n\n(b) Shallow Architecture 2\n\nFigure 3: Decoder \ufb01lters learned by shallow phase-pooling architectures\n\n(a)\n\n(b)\n\nFigure 4: (a) Test samples input to the network (b) Linear interpolation in code space learned by our\nSiamese-encoder network\n\nnew codes and visualize the corresponding images via forward propagation through the decoder.\nThis simultaneously tests the encoder\u2019s capability to linearize the input and the decoder\u2019s (genera-\ntive) capability to synthesize images from the linearized codes. In order to perform these tests we\nmust have an explicit code representation, which is not always available. For instance, consider a\nsimple scheme in which a generic deep network is trained to predict the third frame from the con-\ncatenated input of two previous frames. Such a network does not even provide an explicit feature\nrepresentation for evaluation. A simple baseline architecture that affords this type of evaluation is a\nSiamese encoder followed by a decoder, this exactly corresponds to our proposed architecture with\nthe linear prediction layer removed. Such an architecture is equivalent to learning the weights of the\nlinear prediction layer of the model shown in Figure 2. In the following experiment we evaluate\nthe effects of: (1) \ufb01xing v.s. learning the linear prediction operator, (2) including the phase pooling\noperation, (3) including explicit curvature regularization (second term in Equation 1).\nLet us \ufb01rst consider Deep Architecture 1 summarized in Table 1. In this architecture a Siamese\nencoder produces a code of size 4096 for each frame. The codes corresponding to the two frames\nare concatenated together and propagated to the decoder. In this architecture the \ufb01rst linear layer of\nthe decoder can be interpreted as a learned linear prediction layer. Figure 4a shows three frames from\nthe test set corresponding to temporal indices 1,2, and 3, respectively. Figure 4b shows the generated\nframes corresponding to interpolated codes at temporal indices: 0, 0.5, 1, 1.5, 2, 2.5, 3. The images\nwere generated by propagating the corresponding codes through the decoder. Codes corresponding\nto non-integer temporal indices were obtained by linearly interpolating in code space.\nDeep Architecture 2 differs from Deep Architecture 1 in that it generates the predicted code via\na \ufb01xed linear extrapolation in code space. The extrapolated code is then fed to the decoder that\ngenerates the predicted image. Note that the fully connected stage of the decoder has half as many\nfree parameters compared to the previous architecture. This architecture is further restricted by\npropagating only the predicted code to the decoder. For instance, unlike in Deep Architecture 1, the\ndecoder cannot copy any of the input frames to the output. The generated images corresponding\nto this architecture are shown in Figure 5a. These images more closely resemble images from\nthe dataset. Furthermore, Deep Architecture 2 achieves a lower L2 prediction error than Deep\nArchitecture 1.\n\n7\n\n\f(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 5: Linear interpolation in code space learned by our model. (a) no phase-pooling, no curva-\nture regularization, (b) with phase pooling and curvature regularization Interpolation results obtained\nby minimizing (c) Equation 1 and (d) Equation 7 trained with only partially predictable simulated\nvideo\n\nFinally, Deep Architecture 3 uses phase-pooling in the encoder, and \u201cun-pooling\u201d in the decoder.\nThis architecture makes use of phase-pooling in a two-dimensional feature space arranged on an\n8 \u00d7 8 grid. The pooling is done in a single group over all the fully-connected features producing a\nfeature vector of dimension 192 (64 \u00d7 3) compared to 4096 in previous architectures. Nevertheless\nthis architecture achieves the best overall L2 prediction error and generates the most visually realistic\nimages (Figure 5b). In this subsection we compare the representation learned by minimizing the loss\nin Equation 1 to Equation 7. Uncertainty is simulated by generating triplet sequences where the third\nframe is skipped randomly with equal probability, determined by Bernoulli variable s. For example,\nthe sequences corresponding to models with rotation angles 0\u25e6, 20\u25e6, 40\u25e6 and 0\u25e6, 20\u25e6, 60\u25e6 are equally\nlikely. Minimizing Equation 1 with Deep Architecture 3 results in the images displayed in Figure\n5c. The interpolations are blurred due to the averaging effect discussed in Subsection 3.2. On the\nother hand minimizing Equation 7 (Figure 5d) partially recovers the sharpness of Figure 5b. For this\nexperiment, we used a three-dimensional, real valued \u03b4. Moreover training a linear predictor to infer\nbinary variable s from \u03b4 (after training) results in a 94% test set accuracy. This suggests that \u03b4 does\nindeed capture the uncertainty in the data.\n\n5 Discussion\n\nIn this work we have proposed a new loss and architecture for learning locally linearized fea-\ntures from video. We have also proposed a method that introduces latent variables that are non-\ndeterministic functions of the input for coping with inherent uncertainty in video. In future work\nwe will suggest methods for \u201cstacking\u201d these architectures that will linearize more complex features\nover longer temporal scales.\n\nAcknowledgments\n\nWe thank Jonathan Tompson, Joan Bruna, and David Eigen for many insightful discussions. We\nalso gratefully acknowledge NVIDIA Corporation for the donation of a Tesla K40 GPU used for\nthis research.\n\n8\n\n\fReferences\n[1] Yoshua Bengio, Aaron C. Courville, and Pascal Vincent. Representation learning: A review\n\nand new perspectives. Technical report, University of Montreal, 2012.\n\n[2] Charles F. Cadieu and Bruno A. Olshausen. Learning intermediate-level representations of\n\nform and motion from natural movies. Neural Computation, 2012.\n\n[3] Taco S Cohen and Max Welling. Transformation properties of learned visual representations.\n\narXiv preprint arXiv:1412.7659, 2014.\n\n[4] Ross Goroshin, Joan Bruna, Jonathan Tompson, David Eigen, and Yann LeCun. Unsupervised\n\nlearning of spatiotemporally coherent metrics. arXiv preprint arXiv:1412.6056, 2014.\n\n[5] Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto-encoders.\n\nIn\nArti\ufb01cial Neural Networks and Machine Learning\u2013ICANN 2011, pages 44\u201351. Springer, 2011.\n[6] Christoph Kayser, Wolfgang Einhauser, Olaf Dummer, Peter Konig, and Konrad Kding. Ex-\n\ntracting slow subspaces from natural videos leads to complex cells. In ICANN\u20192001, 2001.\n\n[7] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\n\nconvolutional neural networks. In NIPS, volume 1, page 4, 2012.\n\n[8] Hossein Mobahi, Ronana Collobert, and Jason Weston. Deep learning from temporal coher-\n\nence in video. In ICML, 2009.\n\n[9] Bruno A Olshausen and David J Field. Sparse coding of sensory inputs. Current opinion in\n\nneurobiology, 14(4):481\u2013487, 2004.\n\n[10] Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. Learning and transferring mid-\nIn Computer Vision and\n\nlevel image representations using convolutional neural networks.\nPattern Recognition (CVPR), 2014 IEEE Conference on, pages 1717\u20131724. IEEE, 2014.\n\n[11] M Ranzato, Fu Jie Huang, Y-L Boureau, and Yann LeCun. Unsupervised learning of invariant\nfeature hierarchies with applications to object recognition. In Computer Vision and Pattern\nRecognition, 2007. CVPR\u201907. IEEE Conference on, pages 1\u20138. IEEE, 2007.\n\n[12] MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and\nSumit Chopra. Video (language) modeling: a baseline for generative models of natural videos.\narXiv preprint arXiv:1412.6604, 2014.\n\n[13] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating the future by watching\n\nunlabeled video. arXiv preprint arXiv:1504.08023, 2015.\n\n[14] Xiaolong Wang and Abhinav Gupta. Unsupervised learning of visual representations using\n\nvideos. arXiv preprint arXiv:1505.00687, 2015.\n\n[15] Laurenz Wiskott and Terrence J. Sejnowski. Slow feature analysis: Unsupervised learning of\n\ninvariances. Neural Computation, 2002.\n\n[16] Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Robert Fergus. Deconvolutional\nnetworks. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on,\npages 2528\u20132535. IEEE, 2010.\n\n9\n\n\f", "award": [], "sourceid": 763, "authors": [{"given_name": "Ross", "family_name": "Goroshin", "institution": "New York University"}, {"given_name": "Michael", "family_name": "Mathieu", "institution": "New York University"}, {"given_name": "Yann", "family_name": "LeCun", "institution": "New York University"}]}