{"title": "Semi-Supervised Learning for Optical Flow with Generative Adversarial Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 354, "page_last": 364, "abstract": "Convolutional neural networks (CNNs) have recently been applied to the optical flow estimation problem. As training the CNNs requires sufficiently large ground truth training data, existing approaches resort to synthetic, unrealistic datasets. On the other hand, unsupervised methods are capable of leveraging real-world videos for training where the ground truth flow fields are not available. These methods, however, rely on the fundamental assumptions of brightness constancy and spatial smoothness priors which do not hold near motion boundaries. In this paper, we propose to exploit unlabeled videos for semi-supervised learning of optical flow with a Generative Adversarial Network. Our key insight is that the adversarial loss can capture the structural patterns of flow warp errors without making explicit assumptions. Extensive experiments on benchmark datasets demonstrate that the proposed semi-supervised algorithm performs favorably against purely supervised and semi-supervised learning schemes.", "full_text": "Semi-Supervised Learning for Optical Flow\n\nwith Generative Adversarial Networks\n\nWei-Sheng Lai1\n1University of California, Merced\n\nJia-Bin Huang2\n\n2Virginia Tech\n\n1{wlai24|mhyang}@ucmerced.edu\n\n2jbhuang@vt.edu\n\nMing-Hsuan Yang1,3\n3Nvidia Research\n\nAbstract\n\nConvolutional neural networks (CNNs) have recently been applied to the optical\n\ufb02ow estimation problem. As training the CNNs requires suf\ufb01ciently large amounts\nof labeled data, existing approaches resort to synthetic, unrealistic datasets. On\nthe other hand, unsupervised methods are capable of leveraging real-world videos\nfor training where the ground truth \ufb02ow \ufb01elds are not available. These methods,\nhowever, rely on the fundamental assumptions of brightness constancy and spatial\nsmoothness priors that do not hold near motion boundaries. In this paper, we\npropose to exploit unlabeled videos for semi-supervised learning of optical \ufb02ow\nwith a Generative Adversarial Network. Our key insight is that the adversarial\nloss can capture the structural patterns of \ufb02ow warp errors without making explicit\nassumptions. Extensive experiments on benchmark datasets demonstrate that the\nproposed semi-supervised algorithm performs favorably against purely supervised\nand baseline semi-supervised learning schemes.\n\nIntroduction\n\n1\nOptical \ufb02ow estimation is one of the fundamental problems in computer vision. The classical formu-\nlation builds upon the assumptions of brightness constancy and spatial smoothness [15, 25]. Recent\nadvancements in this \ufb01eld include using sparse descriptor matching as guidance [4], leveraging dense\ncorrespondences from hierarchical features [2, 39], or adopting edge-preserving interpolation tech-\nniques [32]. Existing classical approaches, however, involve optimizing computationally expensive\nnon-convex objective functions.\nWith the rapid growth of deep convolutional neural networks (CNNs), several approaches have\nbeen proposed to solve optical \ufb02ow estimation in an end-to-end manner. Due to the lack of the\nlarge-scale ground truth \ufb02ow datasets of real-world scenes, existing approaches [8, 16, 30] rely on\ntraining on synthetic datasets. These synthetic datasets, however, do not re\ufb02ect the complexity of\nrealistic photometric effects, motion blur, illumination, occlusion, and natural image noise. Several\nrecent methods [1, 40] propose to leverage real-world videos for training CNNs in an unsupervised\nsetting (i.e., without using ground truth \ufb02ow). The main idea is to use loss functions measuring\nbrightness constancy and spatial smoothness of \ufb02ow \ufb01elds as a proxy for losses using ground truth\n\ufb02ow. However, the assumptions of brightness constancy and spatial smoothness often do not hold\nnear motion boundaries. Despite the acceleration in computational speed, the performance of these\napproaches still does not match up to the classical \ufb02ow estimation algorithms.\nWith the limited quantity and unrealistic of ground truth \ufb02ow and the large amounts of real-world\nunlabeled data, it is thus of great interest to explore the semi-supervised learning framework. A\nstraightforward approach is to minimize the End Point Error (EPE) loss for data with ground truth\n\ufb02ow and the loss functions that measure classical brightness constancy and smoothness assumptions\nfor unlabeled training images (Figure 1 (a)). However, we show that such an approach is sensitive\nto the choice of parameters and may sometimes decrease the accuracy of \ufb02ow estimation. Prior\nwork [1, 40] minimizes a robust loss function (e.g., Charbonnier function) on the \ufb02ow warp error\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f(a) Baseline semi-supervised learning\n\n(b) The proposed semi-supervised learning\n\nFigure 1: Semi-supervised learning for optical \ufb02ow estimation. (a) A baseline semi-supervised\nalgorithm utilizes the assumptions of brightness constancy and spatial smoothness to train CNN from\nunlabeled data (e.g., [1, 40]). (b) We train a generative adversarial network to capture the structure\npatterns in \ufb02ow warp error images without making any prior assumptions.\n\n(i.e., the difference between the \ufb01rst input image and the warped second image) by modeling the\nbrightness constancy with a Laplacian distribution. As shown in Figure 2, although robust loss\nfunctions can \ufb01t the likelihood of the per-pixel \ufb02ow warp error well, the spatial structure in the warp\nerror images cannot be modeled by simple distributions. Such structural patterns often arise from\nocclusion and dis-occlusion caused by large object motion, where the brightness constancy assumption\ndoes not hold. A few approaches have been developed to cope with such brightness inconsistency\nproblem using the Fields-of-Experts (FoE) [37] or a Gaussian Mixture Model (GMM) [33]. However,\nthe inference of optical \ufb02ow entails solving time-consuming optimization problems.\nIn this work, our goal is to leverage both the labeled and the unlabeled data without making explicit\nassumptions on the brightness constancy and \ufb02ow smoothness. Speci\ufb01cally, we propose to impose an\nadversarial loss [12] on the \ufb02ow warp error image to replace the commonly used brightness constancy\nloss. We formulate the optical \ufb02ow estimation as a conditional Generative Adversarial Network\n(GAN) [12]. Our generator takes the input image pair and predicts the \ufb02ow. We then compute the \ufb02ow\nwarp error image using a bilinear sampling layer. We learn a discriminator to distinguish between the\n\ufb02ow warp error from predicted \ufb02ow and ground truth optical \ufb02ow \ufb01elds. The adversarial training\nscheme encourages the generator to produce the \ufb02ow warp error images that are indistinguishable\nfrom the ground truth. The adversarial loss serves as a regularizer for both labeled and unlabeled data\n(Figure 1 (b)). With the adversarial training, our network learns to model the structural patterns of\n\ufb02ow warp error to re\ufb01ne the motion boundary. During the test phase, the generator can ef\ufb01ciently\npredict optical \ufb02ow in one feed-forward pass.\nWe make the following three contributions:\n\u2022 We propose a generative adversarial training framework to learn to predict optical \ufb02ow by leverag-\n\u2022 We develop a network to capture the spatial structure of the \ufb02ow warp error without making\n\u2022 We demonstrate that the proposed semi-supervised \ufb02ow estimation method outperforms the purely\nsupervised and baseline semi-supervised learning when using the same amount of ground truth\n\ufb02ow and network parameters.\n\ning both labeled and unlabeled data in a semi-supervised learning framework.\n\nprimitive assumptions on brightness constancy or spatial smoothness.\n\n2 Related Work\nIn the following, we discuss the learning-based optical \ufb02ow algorithms, CNN-based semi-supervised\nlearning approaches, and generative adversarial networks within the context of this work.\nOptical \ufb02ow. Classical optical \ufb02ow estimation approaches typically rely on the assumptions of\nbrightness constancy and spatial smoothness [15, 25]. Sun et al. [36] provide a uni\ufb01ed review of\nclassical algorithms. Here we focus our discussion on recent learning-based methods in this \ufb01eld.\nLearning-based methods aim to learn priors from natural image sequences without using hand-crafted\nassumptions. Sun et al. [37] assume that the \ufb02ow warp error at each pixel is independent and use a\nset of linear \ufb01lters to learn the brightness inconsistency. Rosenbaum and Weiss [33] use a GMM to\nlearn the \ufb02ow warp error at the patch level. The work of Rosenbaum et al. [34] learns patch priors\n\n2\n\nFlow warp errorLabeled dataUnlabeled dataGround truth flowPredicted flowPredicted flowCNNCNNEPE lossBrightness constancy lossSmoothness lossFlow warp errorFlow warp errorLabeled dataUnlabeled dataEPE lossPredicted flowPredicted flowAdversarial lossCNNCNNGround truth flow\fInput image 1\n\nInput image 2\n\nGround truth optical \ufb02ow\n\nGround truth \ufb02ow warp error\n\nNegative log likelihood\n\nFigure 2: Modeling the distribution of \ufb02ow warp error. The robust loss functions, e.g., Lorentzian\nor Charbonnier functions, can model the distribution of per-pixel \ufb02ow warp error well. However, the\nspatial pattern resulting from large motion and occlusion cannot be captured by simple distributions.\n\nto model the local \ufb02ow statistics. These approaches incorporate the learned priors into the classical\nformulation and thus require solving time-consuming alternative optimization to infer the optical \ufb02ow.\nFurthermore, the limited amount of training data (e.g., Middlebury [3] or Sintel [5]) may not fully\ndemonstrate the capability of learning-based optical \ufb02ow algorithms. In contrast, we train a deep\nCNN with large datasets (FlyingChairs [8] and KITTI [10]) in an end-to-end manner. Our model can\npredict \ufb02ow ef\ufb01ciently in a single feed-forward pass.\nThe FlowNet [8] presents a deep CNN approach for learning optical \ufb02ow. Even though the network\nis trained on a large dataset with ground truth \ufb02ow, strong data augmentation and the variational\nre\ufb01nement are required. Ilg et al. [16] extend the FlowNet by stacking multiple networks and using\nmore training data with different motion types including complex 3D motion and small displacements.\nTo handle large motion, the SPyNet approach [30] estimates \ufb02ow in a classical spatial pyramid\nframework by warping one of the input images and predicting the residual \ufb02ow at each pyramid level.\nA few attempts have recently been made to learn optical \ufb02ow from unlabeled videos in an unsupervised\nmanner. The USCNN method [1] approximates the brightness constancy with a Taylor series\nexpansion and trains a deep network using the UCF101 dataset [35]. Yu et al. [40] enables the back-\npropagation of the warping function using the bilinear sampling layer from the spatial transformer\nnetwork [18] and explicitly optimizes the brightness constancy and spatial smoothness assumptions.\nWhile Yu et al. [40] demonstrate comparable performance with the FlowNet on the KITTI dataset,\nthe method requires signi\ufb01cantly more sophisticated data augmentation techniques and different\nparameter settings for each dataset. Our approach differs from these methods in that we use both\nlabeled and unlabeled data to learn optical \ufb02ow in a semi-supervised framework.\nSemi-supervised learning. Several methods combine the classi\ufb01cation objective with unsupervised\nreconstruction losses for image recognition [31, 41]. In low-level vision tasks, Kuznietsov et al. [21]\ntrain a deep CNN using sparse ground truth data for single-image depth estimation. This method\noptimizes a supervised loss for pixels with ground truth depth value as well as an unsupervised\nimage alignment cost and a regularization cost. The image alignment cost resembles the brightness\nconstancy, and the regularization cost enforces the spatial smoothness on the predicted depth maps.\nWe show that adopting a similar idea to combine the EPE loss with image reconstruction and\nsmoothness losses may not improve \ufb02ow accuracy. Instead, we use the adversarial training scheme\nfor learning to model the structural \ufb02ow warp error without making assumptions on images or \ufb02ow.\nGenerative adversarial networks. The GAN framework [12] has been successfully applied to\nnumerous problems, including image generation [7, 38], image inpainting [28], face completion [23],\nimage super-resolution [22], semantic segmentation [24], and image-to-image translation [17, 42].\nWithin the scope of domain adaptation [9, 14], the discriminator learns to differentiate the features\nfrom the two different domains, e.g., synthetic, and real images. Kozi\u00b4nski et al. [20] adopt the\nadversarial training framework for semi-supervised learning on the image segmentation task where the\ndiscriminator is trained to distinguish between the predictions produced from labeled and unlabeled\ndata. Different from Kozi\u00b4nski et al. [20], our discriminator learns to distinguish the \ufb02ow warp errors\nbetween using the ground truth \ufb02ow and using the estimated \ufb02ow. The generator thus learns to model\nthe spatial structure of \ufb02ow warp error images and can improve \ufb02ow estimation accuracy around\nmotion boundaries.\n\n3\n\nFlow warp error-0.15-0.1-0.0500.050.10.15Negative log likelihoodFlow warp errorGaussianLaplacianLorentzian\f3 Semi-Supervised Optical Flow Estimation\n\nIn this section, we describe the semi-supervised learning approach for optical \ufb02ow estimation, the\ndesign methodology of the proposed generative adversarial network for learning the \ufb02ow warp error,\nand the use of the adversarial loss to leverage labeled and unlabeled data.\n\n3.1 Semi-supervised learning\n\n(cid:113)\n\nWe address the problem of learning optical \ufb02ow by using both labeled data (i.e., with the ground truth\ndense optical \ufb02ow) and unlabeled data (i.e., raw videos). Given a pair of input images {I1, I2}, we\ntrain a deep network to generate the dense optical \ufb02ow \ufb01eld f = [u, v]. For labeled data with the\nground truth optical \ufb02ow (denoted by \u02c6f = [\u02c6u, \u02c6v]), we optimize the EPE loss between the predicted\nand ground truth \ufb02ow:\n\nLEPE(f, \u02c6f ) =\n\n(u \u2212 \u02c6u)2 + (v \u2212 \u02c6v)2.\n\n(1)\nFor unlabeled data, existing work [40] makes use of the classical brightness constancy and spatial\nsmoothness to de\ufb01ne the image warping loss and \ufb02ow smoothness loss:\n\nLwarp(I1, I2, f ) = \u03c1 (I1 \u2212 W (I2, f )) ,\nLsmooth(f ) = \u03c1(\u2202xu) + \u03c1(\u2202yu) + \u03c1(\u2202xv) + \u03c1(\u2202yv),\n\n(2)\n(3)\nwhere \u2202x and \u2202y are horizontal and vertical gradient operators and \u03c1(\u00b7) is the robust penalty function.\nThe warping function W (I2, f ) uses the bilinear sampling [18] to warp I2 according to the \ufb02ow\n\ufb01eld f. The difference I1 \u2212 W (I2, f ) is the \ufb02ow warp error as shown in Figure 2. Minimizing\nLwarp(I1, I2, f ) enforces the \ufb02ow warp error to be close to zero at every pixel.\nA baseline semi-supervised learning approach is to minimize LEPE for labeled data and minimize\nLwarp and Lsmooth for unlabeled data:\n\n(cid:88)\n\ni\u2208Dl\n\n(cid:16)\n\nf (i), \u02c6f (i)(cid:17)\n\nLEPE\n\n(cid:88)\n\n(cid:16)\n\n+\n\nj\u2208Du\n\n(cid:16)\n\n2 , f (j)(cid:17)\n\n(cid:16)\n\nf (j)(cid:17)(cid:17)\n\n\u03bbwLwarp\n\nI (j)\n1 , I (j)\n\n+ \u03bbsLsmooth\n\n,\n\n(4)\n\nwhere Dl and Du represent labeled and unlabeled datasets, respectively. However, the commonly\nused robust loss functions (e.g., Lorentzian and Charbonnier) assume that the error is independent\nat each pixel and thus cannot model the structural patterns of \ufb02ow warp error caused by occlusion.\nMinimizing the combination of the supervised loss in (1) and unsupervised losses in (2) and (3)\nmay degrade the \ufb02ow accuracy, especially when large motion present in the input image pair. As a\nresult, instead of using the unsupervised losses based on classical assumptions, we propose to impose\nan adversarial loss on the \ufb02ow warp images within a generative adversarial network. We use the\nadversarial loss to regularize the \ufb02ow estimation for both labeled and unlabeled data.\n\n3.2 Adversarial training\n\nTraining a GAN involves optimizing the two networks: a generator G and a discriminator D. The\ngenerator G takes a pair of input images to generate optical \ufb02ow. The discriminator D performs\nbinary classi\ufb01cation to distinguish whether a \ufb02ow warp error image is produced by the estimated\n\ufb02ow from the generator G or by the ground truth \ufb02ow. We denote the \ufb02ow warp error image from the\nground truth \ufb02ow and generated \ufb02ow by \u02c6y = I1 \u2212 W(I2, \u02c6f ) and y = I1 \u2212 W(I2, f ), respectively.\nThe objective function to train the GAN can be expressed as:\n\n(5)\nWe incorporate the adversarial loss with the supervised EPE loss and solve the following minmax\nproblem for optimizing G and D:\n\nLadv(y, \u02c6y) = E\u02c6y[log D(\u02c6y)] + Ey[log (1 \u2212 D(y))].\n\nmin\n\nG\n\nmax\n\nD\n\nLEPE(G) + \u03bbadvLadv(G, D),\n\n(6)\n\nwhere \u03bbadv controls the relative importance of the adversarial loss for optical \ufb02ow estimation.\nFollowing the standard procedure for GAN training, we alternate between the following two steps\nto solve (6): (1) update the discriminator D while holding the generator G \ufb01xed and (2) update\ngenerator G while holding the discriminator D \ufb01xed.\n\n4\n\n\f(a) Update discriminator D using labeled data\n\n(b) Update generator G using both labeled and unlabeled data\n\nFigure 3: Adversarial training procedure. Training a generative adversarial network involves the\nalternative optimization of the discriminator D and generator G.\n\nUpdating discriminator D. We train the discriminator D to classify between the ground truth\n\ufb02ow warp error (real samples, labeled as 1) and the \ufb02ow warp error from the predicted \ufb02ow (fake\nsamples, labeled as 0). The maximization of (5) is equivalent to minimizing the binary cross-entropy\nloss LBCE(p, t) = \u2212t log(p) \u2212 (1 \u2212 t) log(1 \u2212 p) where p is the output from the discriminator and t\nis the target label. The adversarial loss for updating D is de\ufb01ned as:\n\nLD\nadv(y, \u02c6y) = LBCE(D(\u02c6y), 1) + LBCE(D(y), 0)\n\n(7)\nAs the ground truth \ufb02ow is required to train the discriminator, only the labeled data Dl is involved in\nthis step. By \ufb01xing G in (6), we minimize the following loss function for updating D:\n\n= \u2212 log D(\u02c6y) \u2212 log(1 \u2212 D(y)).\n\n(8)\n\n(cid:88)\n\ni\u2208Dl\n\nLD\nadv(y(i), \u02c6y(i)).\n\nUpdating generator G. The goal of the generator is to \u201cfool\u201d the discriminator by producing \ufb02ow\nto generate realistic \ufb02ow warp error images. Optimizing (6) with respect to G becomes minimizing\nlog(1 \u2212 D(y)). As suggested by Goodfellow et al. [12], one can instead minimize \u2212 log(D(y)) to\nspeed up the convergence. The adversarial loss for updating G is then equivalent to the binary cross\nentropy loss that assigns label 1 to the generated \ufb02ow warp error y:\n\n(9)\nBy combining the adversarial loss with the supervised EPE loss, we minimize the following function\n\nLG\nadv(y) = LBCE(D(y), 1) = \u2212 log(D(y)).\n\nfor updating G: (cid:88)\n\n(cid:16)LEPE\n\n(cid:16)\n\nf (i), \u02c6f (i)(cid:17)\n\ni\u2208Dl\n\n(cid:17)\n\n(cid:88)\n\nj\u2208Du\n\n+ \u03bbadvLG\n\nadv(y(i))\n\n+\n\n\u03bbadvLG\n\nadv(y(j)).\n\n(10)\n\nWe note that the adversarial loss is computed for both labeled and unlabeled data, and thus guides\nthe \ufb02ow estimation for image pairs without the ground truth \ufb02ow. Figure 3 illustrates the two main\nsteps to update the generator D and the discriminator G in the proposed semi-supervised learning\nframework.\n\n5\n\nPredicted flowGround truthflowGenerator GDiscriminator DFlow warp errorGround truthflow warp errorLabeled data\u2112\"#$%(Eq. 7)UpdatedFrozenUnlabeled dataPredicted flowFlow warp errorPredicted flowGround truth flowGenerator GLabeled dataFlow warp error\u2112\"#\"(Eq. 1)\u2112$%&\u2019(Eq. 9)Discriminator DUpdatedFrozen\f3.3 Network architecture and implementation details\n\nGenerator. We construct a 5-level SPyNet [30] as our generator. Instead of using simple stacks\nof convolutional layers as sub-networks [30], we choose the encoder-decoder architecture with skip\nconnections to effectively increase the receptive \ufb01elds. Each convolutional layer has a 3 \u00d7 3 spatial\nsupport and is followed by a ReLU activation. We present the details of our SPyNet architecture in\nthe supplementary material.\nDiscriminator. As we aim to learn the local structure of \ufb02ow warp error at motion boundaries, it\nis more effective to penalize the structure at the scale of local patches instead of the whole image.\nTherefore, we use the PatchGAN [17] architecture as our discriminator. The PatchGAN is a fully\nconvolutional classi\ufb01er that classi\ufb01es whether each N \u00d7 N overlapping patch is real or fake. The\nPatchGAN has a receptive \ufb01eld of 47 \u00d7 47 pixels.\nImplementation details. We implement the proposed method using the Torch framework [6]. We\nuse the Adam solver [19] to optimize both the generator and discriminator with \u03b21 = 0.9, \u03b22 = 0.999\nand the weight decay of 1e \u2212 4. We set the initial learning rate as 1e \u2212 4 and then multiply by 0.5\nevery 100k iterations after the \ufb01rst 200k iterations. We train the network for a total of 600k iterations.\nWe use the FlyingChairs dataset [8] as the labeled dataset and the KITTI raw videos [10] as the\nunlabeled dataset. In each mini-batch, we randomly sample 4 image pairs from each dataset. We\nrandomly augment the training data in the following ways: (1) Scaling between [1, 2], (2) Rotating\nwithin [\u221217\u25e6, 17\u25e6], (3) Adding Gaussian noise with a sigma of 0.1, (4) Using color jitter with respect\nto brightness, contrast and saturation uniformly sampled from [0, 0.04]. We then crop images to\n384 \u00d7 384 patches and normalize by the mean and standard deviation computed from the ImageNet\ndataset [13]. The source code is publicly available on http://vllab.ucmerced.edu/wlai24/\nsemiFlowGAN.\n\n4 Experimental Results\n\nWe evaluate the performance of optical \ufb02ow estimation on \ufb01ve benchmark datasets. We conduct\nablation studies to analyze the contributions of individual components and present comparisons with\nthe state-of-the-art algorithms including classical variational algorithms and CNN-based approaches.\n\n4.1 Evaluated datasets and metrics\n\nWe evaluate the proposed optical \ufb02ow estimation method on the benchmark datasets: MPI-Sintel [5],\nKITTI 2012 [11], KITTI 2015 [27], Middlebury [3] and the test set of FlyingChairs [8]. The MPI-\nSintel and FlyingChairs are synthetic datasets with dense ground truth \ufb02ow. The Sintel dataset\nprovides two rendered sets, Clean and Final, that contain both small displacements and large motion.\nThe training and test sets contain 1041 and 552 image pairs, respectively. The FlyingChairs test set\nis composed of 640 image pairs with similar motion statistics to the training set. The Middlebury\ndataset has only eight image pairs with small motion. The images from the KITTI 2012 and KITTI\n2015 datasets are collected from driving real-world scenes with large forward motion. The ground\ntruth optical \ufb02ow is obtained from a 3D laser scanner and thus only covers about 50% of image pixels.\nThere are 194 image pairs in the KITTI 2012 dataset, and 200 image pairs in the KITTI 2015 dataset.\nWe compute the average EPE (1) on pixels with the ground truth \ufb02ow available for each dataset. On\nthe KITTI-2015 dataset, we also compute the Fl score [27], which is the ratio of pixels that have EPE\ngreater than 3 pixels and 5% of the ground truth value.\n\n4.2 Ablation study\n\nWe conduct ablation studies to analyze the contributions of the adversarial loss and the proposed\nsemi-supervised learning with different training schemes.\n\nAdversarial loss. We adjust the weight of the adversarial loss \u03bbadv in (10) to validate the effect\nof the adversarial training. When \u03bbadv = 0, our method falls back to the fully supervised learning\nsetting. We show the quantitative evaluation in Table 1. Using larger values of \u03bbadv may decrease the\nperformance and cause visual artifacts as shown in Figure 4. We therefore choose \u03bbadv = 0.01.\n\n6\n\n\fTable 1: Analysis on adversarial loss. We train the proposed model using different weights for the\nadversarial loss in (10).\n\n\u03bbadv\n\n0\n\n0.01\n0.1\n1\n\nSintel-Clean\n\nEPE\n\nSintel-Final\n\nEPE\n\nKITTI 2012\n\nEPE\n\nKITTI 2015\n\nEPE\n\nFl-all\n\nFlyingChairs\n\nEPE\n\n3.51\n3.30\n3.57\n3.93\n\n4.70\n4.68\n4.73\n5.18\n\n7.69\n7.16\n8.25\n13.89\n\n17.19\n16.02\n16.82\n21.07\n\n40.82%\n38.77%\n42.78%\n63.43%\n\n2.15\n1.95\n2.11\n2.21\n\nTable 2: Analysis on receptive \ufb01eld of discriminator. We vary the number of strided convolutional\nlayers in the discriminator to achieve different size of receptive \ufb01elds.\n\n# Strided\nconvolutions\n\nReceptive \ufb01eld\n\nSintel-Clean\n\nEPE\n\nSintel-Final\n\nEPE\n\nKITTI 2012\n\nEPE\n\nKITTI 2015\n\nEPE\n\nFl-all\n\nFlyingChairs\n\nEPE\n\nd = 2\nd = 3\nd = 4\n\n23 \u00d7 23\n47 \u00d7 47\n95 \u00d7 95\n\n3.66\n3.30\n3.70\n\n4.90\n4.68\n5.00\n\n7.38\n7.16\n7.54\n\n16.28\n16.02\n16.38\n\n40.19%\n38.77%\n41.52%\n\n2.15\n1.95\n2.16\n\nReceptive \ufb01elds of discriminator. The receptive \ufb01eld of the discriminator is equivalent to the size\nof patches used for classi\ufb01cation. The size of the receptive \ufb01eld is determined by the number of\nstrided convolutional layers, denoted by d. We test three different values, d = 2, 3, 4, which are\ncorresponding to the receptive \ufb01eld of 23\u00d723, 47\u00d747, and 95\u00d795, respectively. As shown in Table 2,\nthe network with d = 3 performs favorably against other choices on all benchmark datasets. Using\ntoo large or too small patch sizes might not be able to capture the structure of \ufb02ow warp error well.\nTherefore, we design our discriminator to have a receptive \ufb01eld of 47 \u00d7 47 pixels.\nTraining schemes. We train the same network (i.e., our generator G) with the following training\nschemes: (a) Supervised: minimizing the EPE loss (1) on the FlyingChairs dataset. (b) Unsupervised:\nminimizing the classical brightness constancy (2) and spatial smoothness (3) using the Charbonnier\nloss function on the KITTI raw dataset. (c) Baseline semi-supervised: minimizing the combination\nof supervised and unsupervised losses (4) on the FlyingChairs and KITTI raw datasets. For the\nsemi-supervised setting, we evaluate different combinations of \u03bbw and \u03bbs in Table 3. We note that it\nis not easy to run grid search to \ufb01nd the best parameter combination for all evaluated datasets. We\nchoose \u03bbw = 1 and \u03bbs = 0.01 for the baseline semi-supervised and unsupervised settings.\nWe provide the quantitative evaluation of the above training schemes in Table 4 and visual comparisons\nin Figure 5 and 6. As images in KITTI 2015 have large forward motion, there are large occluded/dis-\noccluded regions, particularly on the image and moving object boundaries. The brightness constancy\ndoes not hold in these regions. Consequently, minimizing the image warping loss (2) results in\ninaccurate \ufb02ow estimation. Compared to the fully supervised learning, our method further re\ufb01nes the\nmotion boundaries by modeling the \ufb02ow warp error. By incorporating both labeled and unlabeled\ndata in training, our method effectively reduces EPEs on the KITTI 2012 and 2015 datasets.\nTraining on partially labeled data. We further analyze the effect of the proposed semi-supervised\nmethod by reducing the amount of labeled training data. Speci\ufb01cally, we use 75%, 50% and 25%\n\nInput images\n\n\u03bbadv = 0\n\n\u03bbadv = 0.01\n\nGround truth \ufb02ow\n\n\u03bbadv = 0.1\n\n\u03bbadv = 1\n\nFigure 4: Comparisons of adversarial loss \u03bbadv. Using larger value of \u03bbadv does not necessarily\nimprove the performance and may cause unwanted visual artifacts.\n\n7\n\n\fTable 3: Evaluation for baseline semi-supervised setting. We test different combinations of \u03bbw\nand \u03bbs in (4). We note that it is dif\ufb01cult to \ufb01nd the best parameters for all evaluated datasets.\n\n\u03bbw\n\n1\n1\n1\n0.1\n0.01\n\n\u03bbs\n\n0\n0.1\n0.01\n0.01\n0.01\n\nSintel-Clean\n\nEPE\n\nSintel-Final\n\nEPE\n\nKITTI 2012\n\nEPE\n\nKITTI 2015\n\nEPE\n\nFl-all\n\nFlyingChairs\n\nEPE\n\n3.77\n3.75\n3.69\n3.64\n3.57\n\n5.02\n5.05\n4.86\n4.81\n4.82\n\n10.90\n11.82\n10.38\n10.15\n8.63\n\n18.52\n19.98\n18.07\n18.94\n18.87\n\n39.94%\n43.18%\n39.33%\n40.85 %\n42.63 %\n\n2.25\n2.19\n2.11\n2.17\n2.22\n\nTable 4: Analysis on different training schemes. \u201cChairs\u201d represents the FlyingChairs dataset and\n\u201cKITTI\u201d denotes the KITTI raw dataset. The baseline semi-supervised settings cannot improve the\n\ufb02ow accuracy as the brightness constancy assumption does not hold on occluded regions. In contrast,\nour approach effectively utilizes the unlabeled data to improve the performance.\n\nMethod\n\nTraining Datasets\n\nSintel-Clean\n\nEPE\n\nSintel-Final\n\nEPE\n\nKITTI 2012\n\nEPE\n\nKITTI 2015\nFl\n\nEPE\n\nFlyingChairs\n\nEPE\n\nSupervised\nUnsupervised\n\nBaseline semi-supervised\nProposed semi-supervised\n\nChairs\nKITTI\n\nChairs + KITTI\nChairs + KITTI\n\n3.51\n8.01\n3.69\n3.30\n\n4.70\n8.97\n4.86\n4.68\n\n7.69\n16.54\n10.38\n7.16\n\n17.19\n25.53\n18.07\n16.02\n\n40.82%\n54.40%\n39.33%\n38.77%\n\n2.15\n6.66\n2.11\n1.95\n\nof labeled data with ground truth \ufb02ow from the FlyingChairs dataset and treat the remaining part as\nunlabeled data to train the proposed semi-supervised method. We also train the purely supervised\nmethod with the same amount of labeled data for comparisons. Table 5 shows that the proposed semi-\nsupervised method consistently outperforms the purely supervised method on the Sintel, KITTI2012\nand KITTI2015 datasets. The performance gap becomes larger when using less labeled data, which\ndemonstrates the capability of the proposed method on utilizing the unlabeled data.\n4.3 Comparisons with the state-of-the-arts\nIn Table 6, we compare the proposed algorithm with four variational methods: EpicFlow [32],\nDeepFlow [39], LDOF [4] and FlowField [2], and four CNN-based algorithms: FlowNetS [8],\nFlowNetC [8], SPyNet [30] and FlowNet 2.0 [16]. We further \ufb01ne-tune our model on the Sintel\ntraining set (denoted by \u201c+ft\u201c) and compare with the \ufb01ne-tuned results of FlowNetS, FlowNetC,\nSPyNet, and FlowNet2. We note that the SPyNet+ft is also \ufb01ne-tuned on the Driving dataset [26]\nfor evaluating on the KITTI2012 and KITTI2015 datasets, while other methods are \ufb01ne-tuned on\nthe Sintel training data. The FlowNet 2.0 has signi\ufb01cantly more network parameters and uses more\ntraining datasets (e.g., FlyingThings3D [26]) to achieve the state-of-the-art performance. We show\nthat our model achieves competitive performance with the FlowNet and SPyNet when using the same\namount of ground truth \ufb02ow (i.e., FlyingChairs and Sintel datasets). We present more qualitative\ncomparisons with the state-of-the-art methods in the supplementary material.\n\n4.4 Limitations\nAs the images in the KITTI raw dataset are captured in driving scenes and have a strong prior of\nforward camera motion, the gain of our semi-supervised learning over the supervised setting is mainly\non the KITTI 2012 and 2015 datasets. In contrast, the Sintel dataset typically has moving objects\nwith various types of motion. Exploring different types of video datasets, e.g., UCF101 [35] or\nDAVIS [29], as the source of unlabeled data in our semi-supervised learning framework is a promising\nfuture direction to improve the accuracy on general scenes.\n\nTable 5: Training on partial labeled data. We use 75%, 50% and 25% of data with ground truth\n\ufb02ow from the FlyingChair dataset as labeled data and treat the remaining part as unlabeled data. The\nproposed semi-supervised method consistently outperforms the purely supervised method.\n\nMethod\n\nAmount of\nlabeled data\n\nSintel-Clean\n\nEPE\n\nSintel-Final\n\nEPE\n\nKITTI 2012\n\nEPE\n\nKITTI 2015\n\nEPE\n\nFl-all\n\nFlyingChairs\n\nEPE\n\nSupervised\n\nProposed semi-supervised\n\nSupervised\n\nProposed semi-supervised\n\nSupervised\n\nProposed semi-supervised\n\n75%\n\n50%\n\n25%\n\n4.35\n3.58\n\n4.48\n3.67\n\n4.91\n3.95\n\n8.22\n7.30\n\n9.34\n7.39\n\n10.60\n7.40\n\n17.43\n16.46\n\n18.71\n16.64\n\n19.90\n16.61\n\n41.62%\n41.00%\n\n42.14%\n40.48%\n\n43.79%\n40.68%\n\n1.96\n2.20\n\n2.04\n2.28\n\n2.09\n2.33\n\n5.40\n4.81\n\n5.46\n4.92\n\n5.78\n5.00\n\n8\n\n\fInput images\n\nUnsupervised\n\nSupervised\n\nGround truth \ufb02ow\n\nBaseline semi-supervised\n\nProposed semi-supervised\n\nFigure 5: Comparisons of training schemes. The proposed method learns the \ufb02ow warp error\nusing the adversarial training and improve the \ufb02ow accuracy on motion boundary.\n\nGround truth\n\nBaseline semi-supervised\n\nProposed semi-supervised\n\nFigure 6: Comparisons of \ufb02ow warp error. The baseline semi-supervised approach penalizes the\n\ufb02ow warp error on occluded regions and thus produce inaccurate \ufb02ow.\n\nTable 6: Comparisons with state-of-the-arts. We report the average EPE on six benchmark\ndatasets and the Fl score on the KITTI 2015 dataset.\nSintel-Final\nTest\nTrain\nEPE\nEPE\n\nSintel-Clean\nTest\nTrain\nEPE\nEPE\n\nKITTI 2012\nTest\nTrain\nEPE\nEPE\n\nTrain\nEPE\n\nTrain\nFl-all\n\nMiddlebury\n\nTrain\nEPE\n\nMethod\n\nKITTI 2015\n\nTest\nFl-all\n\nChairs\nTest\nEPE\n\nEpicFlow\nDeepFlow\nLDOF\nFlowField\n\nFlowNetS\nFlowNetC\nSpyNet\nFlowNet2\n\n[32]\n[39]\n[4]\n[2]\n\n[8]\n[8]\n[30]\n[16]\n\nFlowNetS + ft\n[8]\nFlowNetC + ft [8]\nSpyNet + ft\n[30]\nFlowNet2 + ft [16]\n\nOurs\nOurs + ft\n\n0.31\n0.25\n0.44\n0.27\n\n1.09\n1.15\n0.33\n0.35\n\n0.98\n0.93\n0.33\n0.35\n\n0.37\n0.32\n\n2.27\n2.66\n4.64\n1.86\n\n4.50\n4.31\n4.12\n2.02\n\n(3.66)\n(3.78)\n(3.17)\n(1.45)\n\n3.30\n(2.41)\n\n4.12\n5.38\n7.56\n3.75\n\n7.42\n7.28\n6.69\n3.96\n\n6.96\n6.85\n6.64\n4.16\n\n6.28\n6.27\n\n3.57\n4.40\n5.96\n3.06\n\n5.45\n5.87\n5.57\n3.14\n\n(4.44)\n(5.28)\n(4.32)\n(2.01)\n\n4.68\n(3.16)\n\n6.29\n7.21\n9.12\n5.81\n\n8.43\n8.81\n8.43\n6.02\n\n7.76\n8.51\n8.36\n5.74\n\n7.61\n7.31\n\n3.47\n4.58\n10.94\n3.33\n\n8.26\n9.35\n9.12\n4.09\n\n7.52\n8.79\n4.13\n3.61\n\n7.16\n5.23\n\n3.8\n5.8\n12.4\n3.5\n\n-\n-\n-\n-\n\n9.10\n\n-\n4.7\n-\n\n7.5\n6.8\n\n9.27\n10.63\n18.19\n8.33\n\n15.44\n12.52\n20.56\n10.06\n\n-\n-\n-\n\n9.84\n\n27.18% 27.10%\n26.52% 29.18%\n38.11%\n24.43%\n\n-\n-\n\n52.86%\n47.93%\n44.78%\n30.37%\n\n-\n-\n-\n\n28.20%\n\n-\n-\n-\n-\n\n-\n-\n-\n-\n\n16.02\n14.69\n\n38.77% 39.71%\n30.30% 31.01 %\n\n2.94\n3.53\n3.47\n\n-\n\n2.71\n2.19\n2.63\n1.68\n\n3.04\n2.27\n3.07\n\n-\n\n1.95\n2.41\n\n5 Conclusions\nIn this work, we propose a generative adversarial network for learning optical \ufb02ow in a semi-\nsupervised manner. We use a discriminative network and an adversarial loss to learn the structural\npatterns of the \ufb02ow warp error without making assumptions on brightness constancy and spatial\nsmoothness. The adversarial loss serves as guidance for estimating optical \ufb02ow from both labeled and\nunlabeled datasets. Extensive evaluations on benchmark datasets validate the effect of the adversarial\nloss and demonstrate that the proposed method performs favorably against the purely supervised and\nthe straightforward semi-supervised learning approaches for learning optical \ufb02ow.\n\nAcknowledgement\nThis work is supported in part by the NSF CAREER Grant #1149783, gifts from Adobe and NVIDIA.\n\n9\n\n\fReferences\n\n[1] A. Ahmadi and I. Patras. Unsupervised convolutional neural networks for motion estimation.\n\nIn ICIP, 2016.\n\n[2] C. Bailer, B. Taetz, and D. Stricker. Flow \ufb01elds: Dense correspondence \ufb01elds for highly accurate\n\nlarge displacement optical \ufb02ow estimation. In ICCV, 2015.\n\n[3] S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski. A database and\n\nevaluation methodology for optical \ufb02ow. IJCV, 92(1):1\u201331, 2011.\n\n[4] T. Brox and J. Malik. Large displacement optical \ufb02ow: descriptor matching in variational\n\nmotion estimation. TPAMI, 33(3):500\u2013513, 2011.\n\n[5] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for\n\noptical \ufb02ow evaluation. In ECCV, 2012.\n\n[6] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A Matlab-like environment for machine\n\nlearning. In BigLearn, NIPS Workshop, 2011.\n\n[7] E. L. Denton, S. Chintala, and R. Fergus. Deep generative image models using a laplacian\n\npyramid of adversarial networks. In NIPS, 2015.\n\n[8] P. Fischer, A. Dosovitskiy, E. Ilg, P. H\u00e4usser, C. Haz\u0131rba\u00b8s, V. Golkov, P. van der Smagt,\nD. Cremers, and T. Brox. FlowNet: Learning optical \ufb02ow with convolutional networks. In\nICCV, 2015.\n\n[9] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and\nV. Lempitsky. Domain-adversarial training of neural networks. Journal of Machine Learning\nResearch, 17(59):1\u201335, 2016.\n\n[10] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The KITTI dataset. The\n\nInternational Journal of Robotics Research, 32(11):1231\u20131237, 2013.\n\n[11] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? The KITTI vision\n\nbenchmark suite. In CVPR, 2012.\n\n[12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\n\nY. Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR,\n\n2016.\n\n[14] J. Hoffman, D. Wang, F. Yu, and T. Darrell. FCNs in the wild: Pixel-level adversarial and\n\nconstraint-based adaptation. arXiv, 2016.\n\n[15] B. K. Horn and B. G. Schunck. Determining optical \ufb02ow. Arti\ufb01cial intelligence, 17(1-3):185\u2013\n\n203, 1981.\n\n[16] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. FlowNet 2.0: Evolution of\n\noptical \ufb02ow estimation with deep networks. In CVPR, 2017.\n\n[17] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional\n\nadversarial networks. In CVPR, 2017.\n\n[18] M. Jaderberg, K. Simonyan, and A. Zisserman. Spatial transformer networks. In NIPS, 2015.\n[19] D. Kingma and J. Ba. ADAM: A method for stochastic optimization. In ICLR, 2015.\n[20] M. Kozi\u00b4nski, L. Simon, and F. Jurie. An adversarial regularisation for semi-supervised training\n\nof structured output neural networks. arXiv, 2017.\n\n[21] Y. Kuznietsov, J. St\u00fcckler, and B. Leibe. Semi-supervised deep learning for monocular depth\n\nmap prediction. In CVPR, 2017.\n\n[22] C. Ledig, L. Theis, F. Husz\u00e1r, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani,\nJ. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative\nadversarial network. In CVPR, 2017.\n\n[23] Y. Li, S. Liu, J. Yang, and M.-H. Yang. Generative face completion. In CVPR, 2017.\n[24] P. Luc, C. Couprie, S. Chintala, and J. Verbeek. Semantic segmentation using adversarial\n\nnetworks. In NIPS Workshop on Adversarial Training, 2016.\n\n[25] B. D. Lucas and T. Kanade. An iterative image registration technique with an application to\n\nstereo vision. In International Joint Conference on Arti\ufb01cial Intelligence, 1981.\n\n[26] N. Mayer, E. Ilg, P. H\u00e4usser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large\ndataset to train convolutional networks for disparity, optical \ufb02ow, and scene \ufb02ow estimation. In\nCVPR, 2016.\n\n[27] M. Menze and A. Geiger. Object scene \ufb02ow for autonomous vehicles. In CVPR, 2015.\n\n10\n\n\f[28] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature\n\nlearning by inpainting. In CVPR, 2016.\n\n[29] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A\nbenchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016.\n[30] A. Ranjan and M. J. Black. Optical \ufb02ow estimation using a spatial pyramid network. In CVPR,\n\n[31] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko. Semi-supervised learning with\n\n2017.\n\nladder networks. In NIPS, 2015.\n\n[32] J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid. EpicFlow: Edge-preserving interpola-\n\ntion of correspondences for optical \ufb02ow. In CVPR, 2015.\n\n[33] D. Rosenbaum and Y. Weiss. Beyond brightness constancy: Learning noise models for optical\n\n[34] D. Rosenbaum, D. Zoran, and Y. Weiss. Learning the local statistics of optical \ufb02ow. In NIPS,\n\n\ufb02ow. arXiv, 2016.\n\n2013.\n\n[35] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from\n\nvideos in the wild. CRCV-TR-12-01, 2012.\n\n[36] D. Sun, S. Roth, and M. J. Black. A quantitative analysis of current practices in optical \ufb02ow\n\nestimation and the principles behind them. IJCV, 106(2):115\u2013137, 2014.\n\n[37] D. Sun, S. Roth, J. Lewis, and M. Black. Learning optical \ufb02ow. In ECCV, 2008.\n[38] X. Wang and A. Gupta. Generative image modeling using style and structure adversarial\n\nnetworks. In ECCV, 2016.\n\n[39] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid. DeepFlow: Large displacement optical\n\n\ufb02ow with deep matching. In ICCV, 2013.\n\n[40] J. J. Yu, A. W. Harley, and K. G. Derpanis. Back to basics: Unsupervised learning of optical\n\n\ufb02ow via brightness constancy and motion smoothness. In ECCV Workshops, 2016.\n\n[41] Y. Zhang, K. Lee, and H. Lee. Augmenting supervised neural networks with unsupervised\n\nobjectives for large-scale image classi\ufb01cation. In ICML, 2016.\n\n[42] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using\n\ncycle-consistent adversarial networks. In ICCV, 2017.\n\n11\n\n\f", "award": [], "sourceid": 286, "authors": [{"given_name": "Wei-Sheng", "family_name": "Lai", "institution": "University of California, Merced"}, {"given_name": "Jia-Bin", "family_name": "Huang", "institution": "Virginia Tech"}, {"given_name": "Ming-Hsuan", "family_name": "Yang", "institution": "UC Merced"}]}