{"title": "Dynamic Filter Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 667, "page_last": 675, "abstract": "In a traditional convolutional layer, the learned filters stay fixed after training. In contrast, we introduce a new framework, the Dynamic Filter Network, where filters are generated dynamically conditioned on an input. We show that this architecture is a powerful one, with increased flexibility thanks to its adaptive nature, yet without an excessive increase in the number of model parameters. A wide variety of filtering operation can be learned this way, including local spatial transformations, but also others like selective (de)blurring or adaptive feature extraction. Moreover, multiple such layers can be combined, e.g. in a recurrent architecture. We demonstrate the effectiveness of the dynamic filter network on the tasks of video and stereo prediction, and reach state-of-the-art performance on the moving MNIST dataset with a much smaller model. By visualizing the learned filters, we illustrate that the network has picked up flow information by only looking at unlabelled training data. This suggests that the network can be used to pretrain networks for various supervised tasks in an unsupervised way, like optical flow and depth estimation.", "full_text": "Dynamic Filter Networks\n\nBert De Brabandere1\u2217\n\nESAT-PSI, KU Leuven, iMinds\n\nXu Jia1\u2217\n\nESAT-PSI, KU Leuven, iMinds\n\nTinne Tuytelaars1\n\nESAT-PSI, KU Leuven, iMinds\n\nLuc Van Gool1,2\n\n1firstname.lastname@esat.kuleuven.be\n\n2vangool@vision.ee.ethz.ch\n\nESAT-PSI, KU Leuven, iMinds\n\nD-ITET, ETH Zurich\n\nAbstract\n\nIn a traditional convolutional layer, the learned \ufb01lters stay \ufb01xed after training. In\ncontrast, we introduce a new framework, the Dynamic Filter Network, where \ufb01lters\nare generated dynamically conditioned on an input. We show that this architecture\nis a powerful one, with increased \ufb02exibility thanks to its adaptive nature, yet without\nan excessive increase in the number of model parameters. A wide variety of \ufb01ltering\noperations can be learned this way, including local spatial transformations, but also\nothers like selective (de)blurring or adaptive feature extraction. Moreover, multiple\nsuch layers can be combined, e.g. in a recurrent architecture.\nWe demonstrate the effectiveness of the dynamic \ufb01lter network on the tasks of\nvideo and stereo prediction, and reach state-of-the-art performance on the moving\nMNIST dataset with a much smaller model. By visualizing the learned \ufb01lters,\nwe illustrate that the network has picked up \ufb02ow information by only looking at\nunlabelled training data. This suggests that the network can be used to pretrain\nnetworks for various supervised tasks in an unsupervised way, like optical \ufb02ow and\ndepth estimation.\n\n1\n\nIntroduction\n\nHumans are good at predicting another view from related views. For example, humans can use\ntheir everyday experience to predict how the next frame in a video will differ; or after seeing a\nperson\u2019s pro\ufb01le face have an idea of her frontal view. This capability is extremely useful to get early\nwarnings about impinging dangers, to be prepared for necessary actions, etc. The vision community\nhas realized that endowing machines with similar capabilities would be rewarding.\nSeveral papers have already addressed the generation of an image conditioned on given image(s).\nYim et al. [24] and Yang et al. [23] learn to rotate a given face to another pose. The authors\nof [16, 19, 18, 15, 12] train a deep neural network to predict subsequent video frames. Flynn et al. [3]\nuse a deep network to interpolate between views separated by a wide baseline. Yet all these methods\napply the exact same set of \ufb01ltering operations on each and every input image. This seems suboptimal\nfor the tasks at hand. For example, for video prediction, there are different motion patterns within\ndifferent video clips. The main idea behind our work is to generate the future frames with parameters\n\n\u2217X. Jia and B. De Brabandere contributed equally to this work and are listed in alphabetical order.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fadapted to the motion pattern within a particular video. Therefore, we propose a learnable parameter\nlayer that provides custom parameters for different samples.\nOur dynamic \ufb01lter module consists of two parts: a \ufb01lter-generating network and a dynamic \ufb01ltering\nlayer (see Figure 1). The \ufb01lter-generating network dynamically generates sample-speci\ufb01c \ufb01lter\nparameters conditioned on the network\u2019s input. Note that these are not \ufb01xed after training, like\nregular model parameters. The dynamic \ufb01ltering layer then applies those sample-speci\ufb01c \ufb01lters to the\ninput. Both components of the dynamic \ufb01lter module are differentiable with respect to the model\nparameters such that gradients can be backpropagated throughout the network. The \ufb01lters can be\nconvolutional, but other options are possible. In particular, we propose a special kind of dynamic\n\ufb01ltering layer which we coin dynamic local \ufb01ltering layer, which is not only sample-speci\ufb01c but also\nposition-speci\ufb01c. The \ufb01lters in that case vary from position to position and from sample to sample,\nallowing for more sophisticated operations on the input. Our framework can learn both spatial and\nphotometric changes, as pixels are not simply displaced, but the \ufb01lters possibly operate on entire\nneighbourhoods.\nWe demonstrate the effectiveness of the proposed dynamic \ufb01lter module on several tasks, including\nvideo prediction and stereo prediction. We also show that, because the computed dynamic \ufb01lters\nare explicitly calculated - can be visualised as an image similar to an optical \ufb02ow or stereo map.\nMoreover, they are learned in a totally unsupervised way, i.e. without groundtruth maps.\nThe rest of paper is organised as follows. In section 2 we discuss related work. Section 3 describes\nthe proposed method. We show the evaluation in section 4 and conclude the paper in section 5.\n\n2 Related Work\n\nDeep learning architectures Several recent works explore the idea of introducing more \ufb02exibility\ninto the network architecture. Jaderberg et al. [9] propose a module called Spatial Transformer,\nwhich allows the network to actively spatially transform feature maps conditioned on themselves\nwithout explicit supervision. They show this module is able to perform translation, scaling, rotation\nand other general warping transformations. They apply this module to a standard CNN network\nfor classi\ufb01cation, making it invariant to a set of spatial transformations. This seminal method only\nworks with parametric transformations however, and applies a single transformation to the entire\nfeature map(s). Patraucean et al. [15] extend the Spatial Transformer by modifying the grid generator\nsuch that it has one transformation for each position, instead of a single transformation for the entire\nimage. They exploit this idea for the task of video frame prediction, applying the learned dense\ntransformation map to the current frame to generate the next frame. Similarly, our method also applies\na position speci\ufb01c transformation to the image or feature maps and takes video frame prediction as\none testbed. In contrast to their work, our method generates the new image by applying dynamic local\n\ufb01lters to the input image or feature maps instead of using a grid generator and sampler. Our method\nis not only able to learn how to displace a pixel, but how to construct it from an entire neighborhood,\nincluding its intensity (e.g. by learning a blur kernel).\nIn the context of visual question answering, Noh et al. [13] introduce a dynamic parameter layer\nwhich output is used as parameters of a fully connected layer. In that work, the dynamic parameter\nlayer takes the information from another domain, i.e. question representation, as input. They further\napply hashing to address the issue of predicting the large amount of weights needed for a fully\nconnected layer. Different from their work, we propose to apply the dynamically generated \ufb01lters\nto perform a \ufb01ltering operation on an image, hence we do not have the same problem of predicting\nlarge amounts of parameters. Our work also shares similar ideas with early work on fast-weight\nnetworks [4], that is, having a network learn to generate context dependent weights for another\nnetwork. However, we instantiate this idea as a convolution/local \ufb01ltering operation with spatial\ninformation under consideration while they use a fully connected layer, and use it as an alternative\nfor RNN. Most similar to our work, a dynamic convolution layer is proposed by Klein et al. [10] in\nthe context of short range weather prediction and by Riegler et al. [17] for single image non-blind\nsingle image super resolution. Our work differs from theirs in that it is more general: dynamic \ufb01lter\nnetworks are not limited to translation-invariant convolutions, but also allow position-speci\ufb01c \ufb01ltering\nusing a dynamic locally connected layer. Lastly, Finn et al. [2] recently independently proposed a\nmechanism called (convolutional) dynamic neural advection that is very similar to ours.\n\n2\n\n\fNew view synthesis Our work is also related to works on new view synthesis, that is, generating a\nnew view conditioned on the given views of a scene. One popular task in this category is to predict\nfuture video frames. Ranzato et al. [16] use an encoder-decoder framework in a way similar to\nlanguage modeling. Srivastava et al. [19] propose a multilayer LSTM based autoencoder for both\npast frames reconstruction and future frames prediction. This work has been extended by Shi et\nal. [18] who propose to use convolutional LSTM to replace the fully connected LSTM in the network.\nThe use of convolutional LSTM reduces the amount of model parameters and also exploits the local\ncorrelation in the image. Oh et al. [14] address the problem of predicting future frames conditioned\non both previous frames and actions. They propose the encoding-transformation-decoding framework\nwith either feedforward encoding or recurrent encoding to address this task. Mathieu et al. [12]\nmanage to generate reasonably sharp frames by means of a multi-scale architecture, an adversarial\ntraining method, and an image gradient difference loss function. In a similar vein, Flynn et al. [3]\napply a deep network to produce unseen views given neighboring views of a scene. Their network\ncomes with a selection tower and a color tower, and is trained in an end-to-end fashion. This idea\nis further re\ufb01ned by Xie et al. [22] for 2D-to-3D conversion. None of these works adapt the \ufb01lter\noperations of the network to the speci\ufb01c input sample, as we do, with the exception of [3, 22]. We\u2019ll\ndiscuss the relation between their selection tower and our dynamic \ufb01lter layer in section 3.3.\n\nShortcut connections Our work also shares some similarity, through the use of shortcut connec-\ntions, with the highway network [20] and the residual network [7, 8]. For a module in the highway\nnetwork, the transform gate and the carry gate are de\ufb01ned to control the information \ufb02ow across\nlayers. Similarly, He et al. [7, 8] propose to reformulate layers as learning residual functions instead\nof learning unreferenced functions. Compared to the highway network, residual networks remove the\ngates in the highway network module and the path for input is always open throughout the network.\nIn our network architecture, we also learn a referenced function. Yet, instead of applying addition to\nthe input, we apply \ufb01ltering to the input - see section 3.3 for more details.\n\n3 Dynamic Filter Networks\n\nIn this section we describe our dy-\nnamic \ufb01lter framework. A dynamic\n\ufb01lter module consists of a \ufb01lter-\ngenerating network that produces \ufb01l-\nters conditioned on an input, and a\ndynamic \ufb01ltering layer that applies\nthe generated \ufb01lters to another input.\nBoth components of the dynamic \ufb01lter\nmodule are differentiable. The two in-\nFigure 1: The general architecture of a Dynamic Filter Network.\nputs of the module can be either identi-\ncal or different, depending on the task. The general architecture of this module is shown schematically\nin Figure 1. We explicitly model the transformation: invariance to change should not imply one\nbecomes totally blind to it. Moreover, such explicit modeling allows unsupervised learning of\ntransformation \ufb01elds like optical \ufb02ow or depth.\nFor clarity, we make a distinction between model parameters and dynamically generated parameters.\nModel parameters denote the layer parameters that are initialized in advance and only updated during\ntraining. They are the same for all samples. Dynamically generated parameters are sample-speci\ufb01c,\nand are produced on-the-\ufb02y without a need for initialization. The \ufb01lter-generating network outputs\ndynamically generated parameters, while its own parameters are part of the model parameters.\n\n3.1 Filter-Generating Network\nThe \ufb01lter-generating network takes an input IA \u2208 Rh\u00d7w\u00d7cA, where h, w and cA are height, width\nand number of channels of the input A respectively. It outputs \ufb01lters F\u03b8 parameterized by parameters\n\u03b8 \u2208 Rs\u00d7s\u00d7cB\u00d7n\u00d7d, where s is the \ufb01lter size, cB the number of channels in input B and n the number\nof \ufb01lters. d is equal to 1 for dynamic convolution and h \u00d7 w for dynamic local \ufb01ltering, which we\ndiscuss below. The \ufb01lters are applied to input IB \u2208 Rh\u00d7w\u00d7cB to generate an output G = F\u03b8(IB),\nwith G \u2208 Rh\u00d7w\u00d7n. The \ufb01lter size s determines the receptive \ufb01eld and is chosen depending on the\n\n3\n\nFilter-generating networkInput AFiltersOutputInput BInputDynamic filtering layer\fapplication. The size of the receptive \ufb01eld can also be increased by stacking multiple dynamic \ufb01lter\nmodules. This is for example useful in applications that may involve large local displacements.\nThe \ufb01lter-generating network can be implemented with any differentiable architecture, such as a\nmultilayer perceptron or a convolutional network. A convolutional network is particularly suitable\nwhen using images as input to the \ufb01lter-generating network.\n\n3.2 Dynamic Filtering Layer\n\nThe dynamic \ufb01ltering layer takes images or feature maps IB as input and outputs the \ufb01ltered result\nG \u2208 Rh\u00d7w\u00d7n. For simplicity, in the experiments we only consider a single feature map (cB = 1)\n\ufb01ltered with a single generated \ufb01lter (n = 1), but this is not required in a general setting. The dynamic\n\ufb01ltering layer can be instantiated as a dynamic convolutional layer or a dynamic local \ufb01ltering layer.\n\nDynamic convolutional layer. A dynamic convolutional layer is similar to a traditional convolu-\ntional layer in that the same \ufb01lter is applied at every position of the input IB. But different from the\ntraditional convolutional layer where \ufb01lter weights are model parameters, in a dynamic convolutional\nlayer the \ufb01lter parameters \u03b8 are dynamically generated by a \ufb01lter-generating network:\n\nG(i, j) = F\u03b8(IB(i, j))\n\n(1)\n\nThe \ufb01lters are sample-speci\ufb01c and conditioned on the input of the \ufb01lter-generating network. The\ndynamic convolutional layer is shown schematically in Figure 2(a). Given some prior knowledge\nabout the application at hand, it is sometimes possible to facilitate training by constraining the\ngenerated convolutional \ufb01lters in a certain way. For example, if the task is to produce a translated\nversion of the input image IB where the translation is conditioned on another input IA, the generated\n\ufb01lter can be sent through a softmax layer to encourage elements to only have a few high magnitude\nelements. We can also make the \ufb01lter separable: instead of a single square \ufb01lter, generate separate\nhorizontal and vertical \ufb01lters that are applied to the image consecutively similar to what is done\nin [10].\n\nDynamic local \ufb01ltering layer. An extension of the dynamic convolution layer that proves interesting,\nas we show in the experiments, is the dynamic local \ufb01ltering layer. In this layer the \ufb01ltering operation\nis not translation invariant anymore. Instead, different \ufb01lters are applied to different positions of the\ninput IB similarly to the traditional locally connected layer: for each position (i, j) of the input IB, a\nspeci\ufb01c local \ufb01lter F\u03b8\n\n(i,j) is applied to the region centered around IB(i, j):\n\nG(i, j) = F\u03b8\n\n(i,j)(IB(i, j))\n\n(2)\n\nThe \ufb01lters used in this layer are not only sample speci\ufb01c but also position speci\ufb01c. Note that dynamic\nconvolution as discussed in the previous section is a special case of local dynamic \ufb01ltering where\nthe local \ufb01lters are shared over the image\u2019s spatial dimensions. The dynamic local \ufb01ltering layer\nis shown schematically in Figure 2b. If the generated \ufb01lters are again constrained with a softmax\nfunction so that each \ufb01lter only contains one non-zero element, then the dynamic local \ufb01ltering layer\nreplaces each element of the input IB by an element selected from a local neighbourhood around\nit. This offers a natural way to model local spatial deformations conditioned on another input IA.\nThe dynamic local \ufb01ltering layer can perform not only a single transformation like the dynamic\nconvolutional layer, but also position-speci\ufb01c transformations like local deformation. Before or after\napplying the dynamic local \ufb01ltering operation we can add a dynamic pixel-wise bias to each element\nof the input IB to address situations like photometric changes. This dynamic bias can be produced by\nthe same \ufb01lter-generating network that generates the \ufb01lters for the local \ufb01ltering.\nWhen inputs IA and IB are both images, a natural way to implement the \ufb01lter-generating network is\nwith a convolutional network. This way, the generated position-speci\ufb01c \ufb01lters are conditioned on the\nlocal image region around their corresponding position in IA. The receptive \ufb01eld of the convolutional\nnetwork that generates the \ufb01lters can be increased by using an encoder-decoder architecture. We can\nalso apply a smoothness penalty to the output of the \ufb01lter-generating network, so that neighboring\n\ufb01lters are encouraged to apply the same transformation.\nAnother advantage of the dynamic local \ufb01ltering layer over the traditional locally connected layer is\nthat we do not need so many model parameters. The learned model is smaller and this is desirable in\nembedded system applications.\n\n4\n\n\f(a)\n\n(b)\n\nFigure 2: Left: Dynamic convolution: the \ufb01lter-generating network produces a single \ufb01lter that is applied\nconvolutionally on IB. Right: Dynamic local \ufb01ltering: each location is \ufb01ltered with a location-speci\ufb01c\ndynamically generated \ufb01lter.\n\n3.3 Relationship with other networks\n\nThe generic formulation of our framework allows to draw parallels with other networks in the\nliterature. Here we discuss the relation with the spatial transformer networks [9], the deep stereo\nnetwork [3, 22], and the residual networks [7, 8].\n\nSpatial Transformer Networks The proposed dynamic \ufb01lter network shares the same philosophy\nas the spatial transformer network proposed by [9], in that it applies a transformation conditioned on\nan input to a feature map. The spatial transformer network includes a localization network which\ntakes a feature map as input, and it outputs the parameters of the desired spatial transformation. A\ngrid generator and sampler are needed to apply the desired transformation to the feature map. This\nidea is similar to our dynamic \ufb01lter network, which uses a \ufb01lter-generating network to compute the\nparameters of the desired \ufb01lters. The \ufb01lters are applied on the feature map with a simple \ufb01ltering\noperation that only consists of multiplication and summation operations.\nA spatial transformer network is naturally suited for global transformations, even sophisticated ones\nsuch as a thin plate spline. The dynamic \ufb01lter network is more suitable for local transformations,\nbecause of the limited receptive \ufb01eld of the generated \ufb01lters, although this problem can be alleviated\nwith larger \ufb01lters, stacking multiple dynamic \ufb01lter modules, and using multi-resolution extensions. A\nmore fundamental difference is that the spatial transformer is only suited for spatial transformations,\nwhereas the dynamic \ufb01lter network can apply more general ones (e.g. photometric, \ufb01ltering), as long\nas the transformation is implementable as a series of \ufb01ltering operations. This is illustrated in the \ufb01rst\nexperiment in the next section.\n\nDeep Stereo The deep stereo network of [3] can be seen as a speci\ufb01c instantiation of a dynamic\n\ufb01lter network with a local \ufb01ltering layer where inputs IA and IB denote the same image, only a\nhorizontal \ufb01lter is generated and softmax is applied to each dynamic \ufb01lter. The effect of the selection\ntower used in their network is equivalent to the proposed dynamic local \ufb01ltering layer. For the speci\ufb01c\ntask of stereo prediction, they use a more complicated architecture for the \ufb01lter-generating network.\n\nResidual Networks The core idea\nof ResNets [7, 8] is to learn a residual\nfunction with respect to the identity\nmapping, which is implemented as an\nadditive shortcut connection. In the\ndynamic \ufb01lter network, we also have\ntwo branches where one branch acts as\na shortcut connection. This becomes\nclear when we redraw the diagram (Figure 3). Instead of merging the branches with addition, we\nmerge them with a dynamic \ufb01ltering layer which is multiplicative in nature. Multiplicative interactions\nin neural networks have also been investigated by [21].\n\nFigure 3: Relation with residual networks.\n\n5\n\nFilter-generating networkInput AOutputInput BInputFilter-generating networkInput AOutputInput BInputDynamic filtering layerInputOutputParameter-generating networkParameters\fModel\n\nFC-LSTM [19]\nConv-LSTM [18]\nSpatio-temporal [15]\nBaseline (ours)\nDFN (ours)\n\nMoving MNIST\n# params\nbce\n142,667,776 341.2\n7,585,296 367.1\n1,035,067 179.8\n637,443 432.5\n637,361 285.2\n\nTable 1: Left: Quantitative results on Moving MNIST: number of model parameters and average binary\ncross-entropy (bce). Right: The dynamic \ufb01lter network for video prediction.\n\n4 Experiments\n\nThe Dynamic Filter Network can be used in dif-\nferent ways in a wide variety of applications.\nIn this section we show its application in learn-\ning steerable \ufb01lters, video prediction and stereo\nprediction. All code to reproduce the experi-\nments is available at https://github.com/\ndbbert/dfn.\n\n4.1 Learning steerable \ufb01lters\n\nFigure 4: The dynamic \ufb01lter network for learning steer-\nable \ufb01lters and several examples of learned \ufb01lters.\n\nWe \ufb01rst set up a simple experiment to illustrate\nthe basics of the dynamic \ufb01lter module with\na dynamic convolution layer. The task is to\n\ufb01lter an input image with a steerable \ufb01lter of a\ngiven orientation \u03b8. The network must learn this\ntransformation from looking at input-output pairs, consisting of randomly chosen input images and\nangles together with their corresponding output.\nThe task of the \ufb01lter-generating network here is to transform an angle into a \ufb01lter, which is then\napplied to the input image to generate the \ufb01nal output. We implement the \ufb01lter-generating network as\na few fully-connected layers with the last layer containing 81 neurons, corresponding to the elements\nof a 9x9 convolution \ufb01lter. Figure 4 shows an example of the trained network. It has indeed learned\nthe expected \ufb01lters and applies the correct transformation to the image.\n\n4.2 Video prediction\n\nIn video prediction, the task is to predict the sequence of future frames that follows the given sequence\nof input frames. To address this task we use a convolutional encoder-decoder as the \ufb01lter-generating\nnetwork where the encoder consists of several strided convolutional layers and the decoder consists\nof several unpooling layers and convolutional layers. The convolutional encoder-decoder is able to\nexploit the spatial correlation within a frame and generates feature maps that are of the same size\nas the frame. To exploit the temporal correlation between frames we add a recurrent connection\ninside the \ufb01lter-generating network: we pass the previous hidden state through two convolutional\nlayers and sum it with the output of the encoder to produce the new hidden state. During prediction,\nwe propagate the prediction from the previous time step. Table 1 (right) shows a diagram of our\narchitecture. Note that we use a very simple recurrent architecture rather than the more advanced\nLSTM as in [19, 18]. A softmax layer is applied to each generated \ufb01lter such that each \ufb01lter is\nencouraged to only have a few high magnitude elements. This helps the dynamic \ufb01ltering layer to\ngenerate sharper images because each pixel in the output image comes from only a few pixels in the\nprevious frame. To produce the prediction of the next frame, the generated \ufb01lters are applied on the\nprevious frame to transform it with the dynamic local \ufb01ltering mechanism explained in Section 3.\n\nMoving MNIST We \ufb01rst evaluate the method on the synthetic moving MNIST dataset [19]. Given\na sequence of 10 frames with two moving digits as input, the goal is to predict the following 10 frames.\nWe use the code provided by [19] to generate training samples on-the-\ufb02y, and use the provided test\n\n6\n\ntt - 1t - 2SOFTMAX1000000000000000000000000t + 1Filter-generating network\u03b8 = 45\u00b00\u00b090\u00b0139.2\u00b0180\u00b0242.9\u00b0\fFigure 5: Qualitative results on moving MNIST. Note that the network has learned the bouncing dynamics and\nseparation of overlapping digits. More examples and out-of-domain results are in the supplementary material.\n\nFigure 6: Qualitative results of video prediction on the Highway Driving dataset. Note the good prediction of\nthe lanes (red), bridge (green) and a car moving in the opposite direction (purple).\n\nset for comparison. Only simple pre-processing is done to convert pixel values into the range [0,1].\nAs the loss function we use average binary cross-entropy over the 10 frames. The size of the dynamic\n\ufb01lters is set to 9x9. This allows the network to translate pixels over a distance of at most 4 pixels,\nwhich is suf\ufb01cient for this dataset. Details on the hyper-parameter can be found in the available code.\nWe also compare our results with a baseline consisting of only the \ufb01lter-generating network, followed\nby a 1\u00d7 1 convolution layer. This way, the baseline network has approximately the same structure and\nnumber of parameters as the proposed dynamic \ufb01lter network. The quantitative results are shown in\nTable 1 (left). Our method outperforms the baseline and [19, 18] with a much smaller model. Figure 5\nshows some qualitative results. Our method is able to correctly learn the individual motions of digits.\nWe observe that the predictions deteriorate over time, i.e. the digits become blurry. This is partly\nbecause of the model error: our model is not able to perfectly separate digits after an overlap, and\nthese errors accumulate over time. Another cause of blurring comes from an artifact of the dataset:\nbecause of imperfect cropping, it is uncertain when exactly the digit will bounce and change its\ndirection. The behavior is not perfectly deterministic. This uncertainty combined with the pixel-wise\nloss function encourages the model to \"hedge its bets\" when a digit reaches the boundary, causing a\nblurry result. This issue could be alleviated with the methods proposed in [5, 6, 11].\n\nHighway Driving We also evaluate our method on real-world data of a car driving on the highway.\nCompared to natural video like UCF101 used in [16, 12], the highway driving data is highly structured\nand much more predictable, making it a good testbed for video prediction. We add a small extension\nto the architecture: a dynamic per-pixel bias is added to the image before the \ufb01ltering operation. This\nallows the network to handle illumination changes such as when the car drives through a tunnel.\nBecause the Highway Driving sequence is less deterministic than moving MNIST, we only predict\nthe next 3 frames given an input sequence of 3 frames. We split the approximately 20, 000 frames\nof the 30-minute video into a training set of 16, 000 frames and a test set of 4, 000 frames. We train\nwith a Euclidean loss function and obtain a loss of 13.54 on the test set with a model consisting of\n368, 122 parameters, beating the baseline which gets a loss of 15.97 with 368, 245 parameters.\nFigure 6 shows some qualitative results. Similar to the experiments on moving MNIST, the predictions\nget blurry over time. This can partly be attributed to the increasing uncertainty combined with an\n\n7\n\nInput SequenceGround Truth and PredictionInput SequenceGround Truth and Prediction\fFigure 7: Some samples for video (left) and stereo (right) prediction and visualization of the dynamically\ngenerated \ufb01lters. More examples and a video can be found in the supplementary material.\n\nelement-wise loss-function which encourages averaging out the possible predictions. Moreover, the\nerrors accumulate over time and make the network operate in an out-of-domain regime.\nWe can visualize the dynamically generated \ufb01lters of the trained model in a \ufb02ow-like manner. The\nresult is shown in Figure 7 and the visualization process is explained in the supplementary material.\nNote that the network seems to generate \"valid\" \ufb02ow only insofar that it helps with minimizing its\nvideo prediction objective. This is sometimes noticeable in uniform, textureless regions of the image,\nwhere a valid optical \ufb02ow is no prerequisite for correctly predicting the next frame. Although the\n\ufb02ow map is not perfectly smooth, it is learned in a self-supervised way by only training on unlabeled\nvideo data. This is different from supervised methods like [1].\n\n4.3 Stereo prediction\n\nWe de\ufb01ne stereo prediction as predicting the right view given the left view of a stereo camera. This\ntask is a variant of video prediction, where the goal is to predict a new view in space rather than in\ntime, and from a single image rather than multiple ones. Flynn et al. [3] developed a network for new\nview synthesis from multiple views in unconstrained settings like musea, parks and streets. We limit\nourselves to the more structured Highway Driving dataset and a classical two-view stereo setup.\nWe recycle the architecture from the previous section, and replace the square 9x9 \ufb01lters with horizontal\n13x1 \ufb01lters. The network is trained and evaluated on the same train- and test split as in the previous\nsection, with the left view as input and the right one as target. It reaches a loss of 0.52 on the test set\nwith a model consisting of 464, 494 parameters. The baseline obtains a loss of 1.68 with 464, 509\nparameters. The network has learned to shift objects to the left depending on their distance to the\ncamera, as shown in Figure 7 (right). The results suggest that it is possible to use the proposed\ndynamic \ufb01lter network architecture to pre-train networks for optical \ufb02ow and disparity map estimation\nin a self-supervised manner using only unlabeled data.\n\n5 Conclusion\n\nIn this paper we introduced Dynamic Filter Networks, a class of networks that applies dynamically\ngenerated \ufb01lters to an image in a sample-speci\ufb01c way. We discussed two versions: dynamic convolu-\ntion and dynamic local \ufb01ltering. We validated our framework in the context of steerable \ufb01lters, video\nprediction and stereo prediction. As future work, we plan to explore the potential of dynamic \ufb01lter\nnetworks on other tasks, such as \ufb01negrained image classi\ufb01cation, where \ufb01lters could learn to adapt to\nthe object pose, or image deblurring, where \ufb01lters can be tuned to adapt to the image structure.\n\n6 Acknowledgements\n\nThis work was supported by FWO through the project G.0696.12N \u201cRepresentations and algorithms\nfor captation, visualization and manipulation of moving 3D objects, subjects and scenes\u201d, the EU\nFP7 project Europa2, the iMinds ICON project Footwork and bilateral Toyota project.\n\n8\n\ninputpredictiongroundtruthfilters\fReferences\n[1] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip H\u00e4usser, Caner Hazirbas, Vladimir Golkov, Patrick\nvan der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical \ufb02ow with convolutional\nnetworks. In ICCV, 2015.\n\n[2] Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through\n\nvideo prediction. In NIPS, 2016.\n\n[3] John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. Deepstereo: Learning to predict new views\n\nfrom the world\u2019s imagery. In CVPR, 2015.\n\n[4] Faustino J. Gomez and J\u00fcrgen Schmidhuber. Evolving modular fast-weight networks for control. In\n\nICANN, 2005.\n\n[5] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,\n\nAaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[6] Ross Goroshin, Micha\u00ebl Mathieu, and Yann LeCun. Learning to linearize under uncertainty. In NIPS,\n\n2015.\n\n[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nCoRR, abs/1512.03385, 2015.\n\n[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks.\n\nCoRR, abs/1603.05027, 2016.\n\n[9] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer\n\nnetworks. In NIPS, 2015.\n\n[10] Benjamin Klein, Lior Wolf, and Yehuda Afek. A dynamic convolutional layer for short range weather\n\nprediction. In CVPR, 2015.\n\n[11] Anders B. L. Larsen, S\u00f8ren Kaae S\u00f8nderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond\n\npixels using a learned similarity metric. In ICML, 2016.\n\n[12] Micha\u00ebl Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean\n\nsquare error. In ICLR, 2016.\n\n[13] Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. Image question answering using convolutional\n\nneural network with dynamic parameter prediction. In CVPR, 2016.\n\n[14] Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-conditional video\n\nprediction using deep networks in atari games. In NIPS, 2015.\n\n[15] Viorica Patraucean, Ankur Handa, and Roberto Cipolla. Spatio-temporal video autoencoder with differen-\n\ntiable memory. CoRR, abs/1511.06309, 2015.\n\n[16] Marc\u2019Aurelio Ranzato, Arthur Szlam, Joan Bruna, Micha\u00ebl Mathieu, R. Collobert, Video (language)\n\nmodeling: a baseline for generative models of natural videos. CoRR, abs/1412.6604, 2014.\n\n[17] Gernot Riegler, Samuel Schulter, Matthias R\u00fcther, and Horst Bischof. Conditioned regression models for\n\nnon-blind single image super-resolution. In ICCV, 2015.\n\n[18] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolu-\n\ntional LSTM network: A machine learning approach for precipitation nowcasting. In NIPS, 2015.\n\n[19] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised learning of video represen-\n\ntations using LSTMs. In ICML, 2015.\n\n[20] Rupesh Kumar Srivastava, Klaus Greff, and J\u00fcrgen Schmidhuber. Training very deep networks. In NIPS,\n\n2015.\n\n[21] G. Taylor and G. Hinton. Factored conditional restricted boltzmann machines for modeling motion style.\n\nIn ICML, 2009.\n\n[22] Junyuan Xie, Ross Girshick, and Ali Farhadi. Deep3d: Fully automatic 2D-to-3D video conversion with\n\ndeep convolutional neural networks. In ECCV, 2016.\n\n[23] Jimei Yang, Scott Reed, Ming-Hsuan Yang, and Honglak Lee. Weakly-supervised disentangling with\n\nrecurrent transformations for 3d view synthesis. In NIPS, 2015.\n\n[24] Junho Yim, Heechul Jung, ByungIn Yoo, Changkyu Choi, Du-Sik Park, and Junmo Kim. Rotating your\n\nface using multi-task deep neural network. In CVPR, 2015.\n\n9\n\n\f", "award": [], "sourceid": 378, "authors": [{"given_name": "Xu", "family_name": "Jia", "institution": "KU Leuven"}, {"given_name": "Bert", "family_name": "De Brabandere", "institution": "KU Leuven"}, {"given_name": "Tinne", "family_name": "Tuytelaars", "institution": "KU Leuven"}, {"given_name": "Luc", "family_name": "Gool", "institution": "ETH Z\u00fcrich"}]}