{"title": "Joint-task Self-supervised Learning for Temporal Correspondence", "book": "Advances in Neural Information Processing Systems", "page_first": 318, "page_last": 328, "abstract": "This paper proposes to learn reliable dense correspondence from videos in a self-supervised manner. Our learning process integrates two highly related tasks: tracking large image regions and establishing fine-grained pixel-level associations between consecutive video frames. We exploit the synergy between both tasks through a shared inter-frame affinity matrix, which simultaneously models transitions between video frames at both the region- and pixel-levels. While region-level localization helps reduce ambiguities in fine-grained matching by narrowing down search regions; fine-grained matching provides bottom-up features to facilitate region-level localization. Our method outperforms the state-of-the-art self-supervised methods on a variety of visual correspondence tasks, including video-object and part-segmentation propagation, keypoint tracking, and object tracking. Our self-supervised method even surpasses the fully-supervised affinity feature representation obtained from a ResNet-18 pre-trained on the ImageNet.", "full_text": "Joint-task Self-supervised Learning\n\nfor Temporal Correspondence\n\nXueting Li1\u21e4, Sifei Liu2\u21e4, Shalini De Mello2, Xiaolong Wang3, Jan Kautz2, Ming-Hsuan Yang1\n\n1University of California, Merced, 2NVIDIA, 3 Carnegie Mellon University\n\nAbstract\n\nThis paper proposes to learn reliable dense correspondence from videos in a\nself-supervised manner. Our learning process integrates two highly related tasks:\ntracking large image regions and establishing \ufb01ne-grained pixel-level associa-\ntions between consecutive video frames. We exploit the synergy between both\ntasks through a shared inter-frame af\ufb01nity matrix, which simultaneously mod-\nels transitions between video frames at both the region- and pixel-levels. While\nregion-level localization helps reduce ambiguities in \ufb01ne-grained matching by\nnarrowing down search regions; \ufb01ne-grained matching provides bottom-up features\nto facilitate region-level localization. Our method outperforms the state-of-the-art\nself-supervised methods on a variety of visual correspondence tasks, including\nvideo-object and part-segmentation propagation, keypoint tracking, and object\ntracking. Our self-supervised method even surpasses the fully-supervised af\ufb01nity\nfeature representation obtained from a ResNet-18 pre-trained on the ImageNet.\nThe project website can be found at https://sites.google.com/view/uvc2019/.\n\nIntroduction\n\n1\nLearning representations for visual correspondence is a fundamental problem that is closely related\nto a variety of vision tasks: correspondences between multi-view images relate 2D and 3D represen-\ntations, and those between frames link static images to dynamic scenes. To learn correspondences\nacross frames in a video, numerous methods have been developed from two perspectives: (a) learning\nregion/object-level correspondences, via object tracking [2, 42, 44, 37, 49] or (b) learning pixel-level\ncorrespondences between multi-view images or frames, e.g., via stereo matching [35] or optical \ufb02ow\nestimation [29, 41, 16, 31].\nHowever, most methods address one or the other problem and signi\ufb01cantly less effort has been\nmade to solve both of them together. The main reason is that methods designed to address either of\nthem optimize different goals. Object tracking focuses on learning object representations that are\ninvariant to viewpoint and deformation changes, while learning pixel-level correspondence focuses on\nmodeling detailed changes within an object over time. Subsequently, the existing supervised methods\nfor these two problems often use different annotations. For example, bounding boxes are annotated in\nreal videos for object tracking [53]; and pixel-wise associations are generated from synthesized data\nfor optical \ufb02ow estimation [4, 10]. Datasets with annotations for both tasks are scarcely available and\nsupervision, here, is a further bottleneck preventing us from connecting the two tasks.\nIn this paper, we demonstrate that these two tasks inherently require the same operation of learning\nan inter-frame transformation that associates the contents of two images. We show that the two\ntasks bene\ufb01t greatly by modeling them jointly via a single transformation operation which can\nsimultaneously match regions and pixels. To overcome the lack of data with annotations for both\ntasks we exploit self-supervision via the signals of (a) Temporal Coherency, which states that objects\nor scenes move smoothly and gradually over time; (b) Cycle Consistency, correct correspondences\n\n\u21e4Equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Our method (c) compared against (a) region-level matching (e.g., object tracking), and (b) pixel-level\nmatching, e.g., matching by colorization [45]. We propose a joint-task framework which conducts region-level\nand \ufb01ne-grained matching simultaneously and which are supported by a single inter-frame af\ufb01nity matrix A.\nDuring training, the two tasks improve each other progressively. To illustrate this, we unroll two training\niterations and illustrate the improvement with the red box and arrow.\nshould ensure that pixels or regions match bi-directionally and (c) Energy Preservation, which\npreserves the energy of feature representations during transformations. Since all these supervisory\nsignals naturally exist in videos and are task-agnostic, the transformation that we learn through them\ncan generalize well to any video without restriction on domain or object category.\nOur key idea is to learn a single af\ufb01nity matrix for modeling all inter-frame transformations through\na network that learns appropriate feature representations that model the af\ufb01nity. We show that\nregion localization and \ufb01ne-grained matching can be carried out by sharing the af\ufb01nity in a fully\ndifferentiable manner: the region localization module \ufb01nds a pair of patches with matching parts\nin the two frames (Figure 1, mid-top), and the \ufb01ne-grained module reconstructs the color feature\nby transforming it between the patches (Figure 1, mid-bottom), all through the same af\ufb01nity matrix.\nThese two tasks symbiotically facilitate each other: the \ufb01ne-grained matching module learns better\nfeature representations that lead to an improved af\ufb01nity matrix, which in turn generates better\nlocalization that reduces the search space and ambiguities for \ufb01ne-grained matching (Figure 1, right).\nThe contributions of this work are summarized as: (a) A joint-task self-supervision network is\nintroduced to \ufb01nd accurate correspondences at different levels across video frames. (b) A general\ninter-frame transformation is proposed to support both tasks and to satisfy various video constraints \u2013\ncoherency, cycle, and energy consistency. (c) Our method outperforms state-of-the-art methods on a\nvariety of visual correspondence tasks, e.g., video instance and part segmentation, keypoints tracking,\nand object tracking. Our self-supervised method even surpasses the fully-supervised af\ufb01nity feature\nrepresentation obtained from a ResNet-18 pre-trained on the ImageNet [9].\n2 Related Work\nLearning correspondence in time is widely explored in visual tracking [2, 42, 44, 37, 49] and optical\n\ufb02ow estimation [41, 29, 16]. Existing models are mainly trained on large annotated datasets, which\nrequire signi\ufb01cant efforts. To overcome the limit of annotations, numerous methods have been\ndeveloped to learn correspondences in a self-supervised manner [46, 52, 45]. Our work establishes\non learning correspondence with self-supervision, and we discuss the most related methods here.\nObject-level correspondence. The goal of visual tracking is to determine a bounding box in each\nframe based on an annotated box in the reference image. Most methods belong to one of the two\ncategories that use: (a) the tracking-by-detection framework [1, 20, 47, 25], which models tracking as\ndetection applied independently to individual frames; or (b) the tracking-by-matching framework that\nmodels cross-frame relations and includes several early attempts, e.g., mean-shift trackers [8, 55],\nkernelized correlation \ufb01lters (KCF) [14, 27], and several works that model correlation \ufb01lters as\ndifferentiable blocks [32, 33, 7, 48]. Most of these methods use annotated bounding boxes [53] in\nevery frame of the videos to learn feature representations for tracking. Our work can be viewed as\nexploiting the tracking-by-matching framework in a self-supervised manner.\nFine-grained correspondence. Dense correspondence between video frames has been widely\napplied for optical \ufb02ow and motion estimation [31, 41, 29, 16], where the goal is to track individual\npixels. Most deep neural networks [16, 41] are trained with the objective of regressing the ground-\ntruth optical \ufb02ow produced by synthetic datasets [4, 10]. In contrast to many classic methods [31, 29]\nthat model dense correspondence as a matching problem, direct regression of pixel offsets has limited\n\n2\n\n\fJH\n\n\u2a02\nG:\n\n=\n2\n2\n\n=\n2\n2\n\nRegion-level\tlocalization\n\n-<$\nKI\n\n\u2a02\n\nK7\n\n\u2a02 Matrix\tmultiplication\n\nGradient\tflow\nData\tflow\n\nFine-grained\tmatching\n\n-<<\n\n\u2a02\n\nJH\nJ:\n\nE\n\u2a02\nF\n\nGT\n\nPredict\n\nFigure 2: Main steps of proposed method. Blue grids represent the reference-patch p1\u2019s and target-frame f2\u2019s\nfeature maps that are shared by the region-level localization (left box) and \ufb01ne-grained matching (right box)\nmodules. Apf is the af\ufb01nity between p1 and f2, and App is that between p1 and p2. p2 is a differentiable\ncrop from the frame f2. The maps lx and ly are the coordinates of pixels on a regular grid. All modules are\ndifferentiable, where the gradient \ufb02ow is visualized via the red dashed arrows.\ncapability for frames containing dramatic appearance changes [3, 40], and suffers from problems\nrelated to domain shift when applied to real-world scenarios.\nSelf-supervised learning. Recently, numerous approaches have been developed for correspondence\nlearning via various self-supervised signals, including image [17] or color transformation [45] and\ncycle-consistency [52, 46]. Self-supervised learning of correspondence in videos has been explored\nalong the two different directions \u2013 for region-level localization [52, 46] and for \ufb01ne-grained pixel-\nlevel matching [45, 23]. In [46], a correlation \ufb01lter is learned to track regions via a cycle-consistency\nconstraint, and no pixel-level correspondence is determined. [52] develops patch-level tracking by\nmodeling the similarity transformation of pixels within a \ufb01xed rectangular region. Conversely, several\nmethods learn a matching network by transforming color/RGB information between adjacent frames\n[45, 24, 23]. As no region-level regularization is exploited, these approaches are less effective when\ncolor features are less distinctive (see Figure 1(b)). In contrast, our method learns object-level and\npixel-level correspondence jointly across video frames in a self-supervised manner.\n3 Approach\nVideo frames are temporally coherent in nature. For a pair of adjacent frames, pixels in a later frame\ncan be considered as being copied from some locations of an earlier one with slight appearance\nchanges conforming to object motion. This \u201ccopy\u201d operator can be expressed via a linear transforma-\ntion with a matrix A, in which Aij = 1 denotes that the pixel j in the second frame is copied from\npixel i in the \ufb01rst one. An approximation of A is the inter-frame af\ufb01nity matrix [44, 30, 52]:\n\nAij = \uf8ff(f1i, f2j)\n\n(1)\nwhere \uf8ff denotes some similarity function. Each entry Aij represents the similarity of subspace pixels\ni and j in the two frames f1 2R C\u21e5N1 and f2 2R C\u21e5N2, where f 2R C\u21e5N is a vectorized feature\nmap with C channels and N pixels. In this work, our goal is to learn the feature embedding f that\noptimally associates the contents of the two frames.\nOne free supervisory signal that we can utilize is color. To learn the inter-frame transformation in\na self-supervised manner, we can slightly modify (1) to generate the af\ufb01nity via features f learned\nonly from gray-scale images. The learned af\ufb01nity is then utilized to map the color channels from one\nframe to another [45, 30], while using the ground-truth color as the self-supervisory signal.\nOne strict assumption of this formulation is that the paired frames need to have the same contents \u2013\nno new object or scene pixel should emerge over time. Hence, the existing methods [45, 30] sample\npairs of frames either uniformly, or randomly within a speci\ufb01ed interval, e.g., 50 frames. However,\nit is dif\ufb01cult to determine a \u201cperfect\u201d interval as video contents may change sporadically. When\ntransforming color from a reference frame to a target one, the objects/scene pixels in the target frame\nmay not exist in the reference frame, thereby leading to wrong matches and an adverse effect on\nfeature learning. Another issue is that a large portion of the video frames are \u201cstatic\u201d, in which the\nsampled pair of frames are almost the same and cause the learned af\ufb01nity to be an identity matrix.\nWe show that the above problems can be addressed by incorporating a region-level localization\nmodule. Given a pair of reference and target frames, we \ufb01rst randomly sample a patch in the reference\nframe and localize this patch in the target frame (see Figure 2). The inter-frame color transformation is\n\n3\n\n\fthen estimated between the paired patches. Both localization and color transformation are supported\nby a single af\ufb01nity derived from a convolutional neural network (CNN) based on the fact that the\naf\ufb01nity matrix can simultaneously track locations and transform features discussed in this section.\n3.1 Transforming Feature and Location via Af\ufb01nity\nWe sample a pair of frames and denote the 1st frame as the reference and the 2nd one as the target. The\nCNN can be any effective model, e.g., ResNet-18 [13] with the \ufb01rst 4 blocks that takes a gray-scale\nimage as input. We compute the af\ufb01nity and conduct the feature transformation and localization on\nthe top layer of the CNN, with features that are one-eighth the size of the input image. This ensures\nthe af\ufb01nity matrix to be memory ef\ufb01cient and each pixel in the feature space to contain considerable\nlocal contextual information.\nTransforming feature representations. We adopt the dot product for \uf8ff in (1) to compute the\naf\ufb01nity, where each column can be interpreted as the similarity score between a point in the target\nframe to all points in the reference frame. For dense correspondence, the inter-frame af\ufb01nity needs\nto be sparse to ensure one-to-one mapping. However, it is challenging to model a sparse matrix in\na deep neural network. We relax this constraint and encourage the af\ufb01nity matrix to be sparse by\nnormalizing each column with the softmax function, so that the similarity score distribution can be\npeaky and only a few pixels with high similarity in the reference frame are matched to each point in\nthe target frame:\n\nexp(f>1if2j)\n\n,\n\nAij =\n\n8i 2 [1, N1], j 2 [1, N2]\n\nPk exp(f>1kf2j)\n\n(2)\nwhere the variable de\ufb01nitions follow (1). The transformation is carried out as \u02c6c2 = c1A, where\nA 2R N1\u21e5N2 , and ci has the same number of entries as fi and can be features of the reference frame\nor any associated label, e.g., color, segmentation mask or keypoint heatmap.\nTracing pixel locations. We denote lj = (xj, yj), l 2R 2\u21e5N as the vectorized location map for an\nimage/feature with N pixels. Given a sparse af\ufb01nity matrix, the location of an individual pixel can be\ntraced from a reference frame to an adjacent target frame:\n\nl12\nj =\n\nl11\nk Akj,\n\n8j 2 [1, N2]\n\n(3)\n\nN1Xk=1\n\nj\n\nrepresents the coordinate in frame m that transits to the jth pixel in frame n. Note that\n\nwhere lmn\nlnn (e.g., l11 in (3)) usually represents a canonical grid as shown in Figure 3.\n3.2 Region-level Localization\nIn the target frame, region-level localization aims to localize a patch randomly selected from the\nreference frame by predicting a bounding box (denoted as \u201cbbox\u201d) on a region that shares matching\nparts with the selected patch.\nIn other words, it is a differential region of interest (ROI) with\nlearnable center and scale. We compute an N1 \u21e5 N2 af\ufb01nity Apf according to (2) between feature\nrepresentations of the patch in the reference frame, and that of the whole target frame (see Figure 2(a)).\nLocating the center. To track the center position of the reference patch in the target frame, we \ufb01rst\nlocalize each individual pixel of the reference patch p1 in the target frame f2, according to (3). As\nwe obtain the set l21, with the same number of entries as p1, that collects the coordinates of the most\nsimilar pixels in f2, we can compute the average coordinate C21 = 1\ni of all the points, as\nthe estimated new position of the reference patch.\nScale modeling. For region-level tracking, the reference patch may undergo signi\ufb01cant scale\nchanges. Scale estimation in object tracking is challenging and existing methods mainly enumerate\npossible scales [2, 46] and select the optimal one. In contrast, the scale can be estimated by our\nproposed model. We assume that the transformed locations l21 are still distributed uniformly in a\nlocal rectangular region. By denoting w as the width of the new bounding box, the scale is estimated\nby:\n\nN1PN1\n\ni=1 l21\n\n(4)\n\nwhere the xi is the x-coordinate of the ith entry in the l21. We note that (4) can be proved by using\nthe analogous continuous space. Suppose there is a rectangle with scale (2w, 2h) and with its center\nlocated at the origin of a 2D coordinate plane. By integrating points inside of it, we have:\n\nxdx = w\n\n(5)\n\n\u02c6w =\n\n2\nN1\n\nN1Xi=1xi  C 21(x)1\n\n1\n\nwZ w\nw kxk1 dx =\n4\n\n2\n\nwZ w\n\n0\n\n\fLc =(0,\n\n1\n\nN2PN2\n\nj=1l12\n\nj  C 122 ,\n\nl12\nj (x)  C 12(x)1 \uf8ff w and l12\n\notherwise\n\nj (y)  C 12(y)1 \uf8ff h\n\n(6)\n\nThis represents the average absolute distances w.r.t. the center when transforming to the discrete\nspace. The estimation of height is conducted in the same manner.\nMoving as a unit. An important assumption in the aforementioned ROI estimation in the target\nframe is that the pixels from the reference patch should move in unison \u2013 this is true in most videos,\nas an object or its parts typically move as one unit at the region level. We enforce this constraint\nwith a concentration regularization [58, 15] term on the transformed pixels, with a truncated loss to\npenalize these points from moving too far away from the center:\n\nThis formulation encourages all the tracked pixels, originally from a patch, to be concentrated (see\nFigure 3) rather than being dispersed to other objects, which is likely to happen for methods that are\nbased on pixel-wise matching only, e.g., when matching by color reconstruction, pixels of different\nobjects having similar colors may match each other, as shown in Figure 1(b).\n\n3.3 Fine-grained Matching\n\nconcentration\tloss\n\ncanonical\tgrid\n\northogonal\tloss\n\nbackward\n\nforward\n\nFine-grained matching aims to reconstruct the\ncolor information of the located patch in the\ntarget frame, given the reference patch (see Fig-\nure 1). We re-use the inter-frame af\ufb01nity Apf by\nextracting a sub-af\ufb01nity matrix App containing\nthe columns corresponding to the located pixels\nin the target frame, and by using it for the color\ntransformation described in the formulations in\nSection 3.1. To make the color feature compati-\nble with the af\ufb01nity matrix, we train an auto-encoder that learns to reconstruct an image in the Lab\nspace faithfully (see the encoder E and the decoder D in Figure 2). This network also encodes global\ncontextual information from color channels. We show that using the color feature instead of pixels\nsigni\ufb01cantly reduces the errors caused by reconstructing color directly in the image space [45] (see\nTable 1, ours vs. [45]). In the following, we introduce self-supervisory signals as regularization\nfor \ufb01ne-grained matching. For brevity, we denote A as the sub-af\ufb01nity, l and f as the vectorized\ncoordinate and feature map, respectively, for the paired patches.\n\nFigure 3: Concentration (left) and orthogonal (right)\nregularization. The dots denote pixels in feature space.\nThe orange arrows show how they push the pixels.\n\nframe\t2\n\nframe\t2\n\nframe\t1\n\nOrthogonal regularization. Another important constraint, cycle-consistency, for the transforma-\ntion of both location [52] and feature [30] is the orthogonal regularization. For a pair of patches, we\nencourage every pixel to fall into the same location after one cycle of forward and backward tracking,\nas shown in Figure 3 (middle and right):\n\n(7)\nHere we speci\ufb01cally add m ! n to denote af\ufb01nity transforming from the frame m to n, i.e.,\nAm!n = \uf8ff(fm, fn). Similarly, the cycle-consistency can be applied to the feature space:\n\n\u02c6l12 = l11A1!2,\n\n\u02c6l11 = \u02c6l12A2!1\n\n\u02c6f1 = \u02c6f2A2!1\n\n\u02c6f2 = f1A1!2,\n\n(8)\nWe show that enforcing cycle-consistency is equivalent to regularizing A to be orthogonal: With (7)\nand (8), it is easy to show that the optimal solution is achieved when A1\n1!2 = A2!1. Inspired by\nrecent style transfer methods [12, 30], the color energy represented by the Gram-matrix should be\nconsistent such that f1f>1 = f2f>2 , which derives that A>1!2 = A2!1 is the goal to reconstruct\nthe color information. Thus, it is easy to show that regularizing A as orthogonal automatically\nsatis\ufb01es the cycle constraint. In practice, we switch the role of reference and target to perform the\ntransformation, as described in (7) and (8). We use the MSE loss between both \u02c6l11 and l11, \u02c6f1 and\nf1, and speci\ufb01cally replace A2!1 with A>1!2 in Eq. (8) to enforce the regularization. Namely, the\northogonal regularization provides a concise mathematical formulation for many recent works [52, 46]\nthat exploit cycle-consistency in videos.\nConcentration regularization. We additionally apply the concentration loss (i.e., Eq.(6) without\nthe truncation) in local, non-overlapping 8 \u21e5 8 grids of a feature map, to encourage local context or\n\n5\n\n\fInput\n\nPropagated\tResults\n\nInput\n\nPropagated\tResults\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 4: Visualization of the propagation results. (a) Instance mask propagation on the DAVIS-2017 [36]\ndataset. (b) Pose keypoints propagation on the J-HMDB [19] dataset. (c) Parts segmentation propagation on the\nVIP [59] dataset. (d) Visual tracking on the OTB2015 [53] dataset.\n\nobject parts to move as an entity over time. Unlike [52, 39] where local patches are regularized by\nsimilarity transformation via a spatial transformation network [18], this local concentration loss is\nmore \ufb02exible by allowing arbitrary deformations within each local grid.\n4 Experiments\nWe compare with state-of-the-art algorithms [45, 46, 52] on several tasks: instance mask propagation,\npose keypoints tracking, human parts segmentation propagation and visual tracking.\n4.1 Network Architecture\nAs shown in Figure 2, our model consists of a region-level localization module and a \ufb01ne-grained\nmatching module that share a feature representation network (see Figure 2). We use the ResNet-\n18 [13] as the network for fair comparisons with [45, 52]. The patch randomly cropped from the\nreference frame is of 256 \u21e5 256 pixels. We carry out all our experiments on servers equipped with\nfour 16GB Tesla V100 GPUs.\n\nTraining. We \ufb01rst train the auto-encoder in the matching module (the encoder \u201cE\u201d and decoder \u201cD\u201d\nin Figure 2) to reconstruct images in the Lab space using the MSCOCO [28] dataset. We then \ufb01x it\nand train the feature representation network using the Kinetics dataset [21]. For all experiments, we\ntrain our model from scratch without any level of pre-training or human annotations. The objectives\ninclude: (a) concentration loss (Section 3.2 and 3.3), (b) color reconstruction loss and (c) orthogonal\nregularization (Section 3.3). Involving the localization module from the beginning in the training\nprocess prevents the network from converging because poor localization makes matching impossible.\nThus we \ufb01rst train our network using patches cropped at the same location with the same size in the\nreference and target frame respectively. Fine-grained matching is conducted between the two patches\nfor 10 epochs. We then jointly train the localization and matching module for another 10 epochs. We\ntrain our model using Adam [22] as the optimizer with a learning rate of 104 for the warm-up and\n0.5 \u21e5 104 for the joint training of the localization and matching modules. We set the temperature in\nthe softmax layer applied to the af\ufb01nity matrix to 1 which empirically achieves best performance.\n\nInference.\nIn the inference stage, we directly apply the af\ufb01nity learned to transform color feature\nrepresentations, on different types of inputs, e.g., segmentation masks and keypoint maps. We use the\nsame testing protocol as Wang et al. [52] for all tasks. Similar to [52], we adopt a recurrent inference\nstrategy by propagating the ground truth segmentation mask or keypoint heatmap from the \ufb01rst frame,\nas well as the predicted results from the preceding n frames onto the target frame. We average all\nn + 1 predictions to obtain the \ufb01nal propagated map (n is 1 for the VIP, and 7 for all the other tasks).\nFor fair comparisons, we also use the k-NN propagation schema as Wang et al. [52] and set k = 5\nfor all tasks. To compare with the ResNet-18 trained on the ImageNet with classi\ufb01cation labels, we\nreplace our learned network weights with it and leave other settings unchanged for fair comparisons.\n\n6\n\n\f(a) Reference frame\n\n(b) ResNet-18\n\n(c) Wang et al.\n\n(d) Ours\n\n(e) Ours-track\n\n(f) Target ground truth\n\nFigure 5: Qualitative comparison with other methods. (a) Reference frame with instance masks. (b) Results by\nthe ResNet-18 trained on ImageNet. (c) Results by Wang et al. [52]. (d) Ours (global matching). (e) Ours with\nlocalization during inference. (f) Target frame with ground truth instance masks.\n\nTable 1: Evaluation of instance segmentation propagation on the DAVIS-2017 dataset [36].\n\nWang et al. [52] (400 \u21e5 400)\n\nWang et al. [52] (480p)\n\nModel\n\nSIFT Flow [29]\nDeepCluster [6]\nTransitive Inv [51]\nVondrick et al. [45]\n\nmgPFF [23]\nLai et al. [24]\n\nours\n\nours-track\n\nResNet-18(3 blocks)\nResNet-18(4 blocks)\n\nFlowNet2 [16]\nPWC-Net [41]\nSiamMask [50]\n\nOSVOS [5]\n\nSupervised\n\n\u21e5\n\u21e5\n\u21e5\n\u21e5\n\u21e5\n\u21e5\n\u21e5\n\u21e5\n\u21e5\n\u21e5\nX\nX\nX\nX\nX\nX\n\nDataset\n\nYFCC100M [43]\n\n-\n\n-\n\n-\n\nKinetics [21]\nVLOG [11]\nVLOG [11]\n\nKinetics [21]\nKinetics [21]\nKinetics [21]\nImageNet [9]\nImageNet [9]\n\nFlyingThings3D [34]\nFlyingThings3D [34]\nYouTube-VOS [54]\nImageNet,DAVIS [36]\n\nJ (Mean)\n\n33.0\n37.5\n32.0\n34.6\n43.0\n46.4\n42.2\n47.7\n56.8\n57.7\n49.4\n40.2\n26.7\n35.2\n54.3\n56.6\n\nJ (Recall)\n\n-\n-\n-\n\n-\n\n-\n\n34.1\n43.7\n50.1\n41.8\n\n65.7\n67.1\n52.9\n36.1\n\n34.0\n62.8\n63.8\n\nF(Mean)\n\n35.0\n33.2\n26.8\n32.7\n42.6\n50.0\n46.9\n51.3\n59.5\n60.0\n55.1\n42.5\n25.2\n37.4\n58.5\n63.9\n\nF(Recall)\n\n-\n-\n-\n\n-\n\n-\n\n26.8\n41.3\n48.0\n44.4\n\n65.1\n65.7\n56.6\n36.6\n\n33.1\n67.5\n73.8\n\nInstance Segmentation Mask Propagation on the DAVIS-2017 dataset\n\n4.2\nFigure 4 (a) and Figure 5 show the propagated instance masks and Table 1 lists quantitative results\nof all evaluated methods based on the Jacaard index J (IOU) and contour-based accuracy F. We\nuse the full 480p images during inference for our model. For fair comparisons we test the model by\nWang et al. [52] with the resolution of 480p, in addition to the result reported using 400\u21e5 400 images.\nOur model performs favorably against the self-supervised state-of-the-art methods. Speci\ufb01cally, our\nmodel outperforms Wang et al. [52] by 13.3% in J and 16.6% in F. and is even 6.9% better in J and\n4.1% better in F than the ResNet-18 model [13] trained on ImageNet [9] with classi\ufb01cation labels.\nFurthermore, we demonstrate that by including the localization module during inference, our model\ncan exclude noise from background pixels. Given the instance masks in the \ufb01rst frame, we obtain\nthe bounding box w.r.t. the instance mask and \ufb01rst locate it in the target frame by our localization\nmodule. Then, we propagate the instance masks within the bounding box in the reference frame to\nthe localized bounding box in the target frame using our matching module. Since the propagation is\ncarried out within two bounding boxes instead of the entire frames, we can minimize noise introduced\nby background pixels as shown in Figure 5 (d) and (e). The quantitative evaluation of this improved\nmodel outperforms the model that does not include the localization module during inference. (see\n\u201cOurs-track\u201d vs. \u201cOurs\u201d in Table 1)\n4.3 Ablation Studies on the DAVIS-2017 Dataset\nWe carry out ablation studies to see the contributions of each term, as shown in Figure 6 and Table 2.\nNote that inference is conducted between a pair of full-size frames without localization.\nRegion-level Localization. Our model trained with the region-level localization module is able to\nplace the individual points all within a reasonable local region (Figure 6 (c)). We show that the model\ncan accurately capture both region-level shifts (e.g., person moving forward), and subtle deformations\n(e.g., movement of body parts), while preserving the correct spatial relations among all the points. In\ncontrast, the model trained without the localization module tends to model global matching, leading\nto less accurate preservation of the local spatial relationships among points, e.g., the red points in\nFigure 6 (d) tend to cluster together as shown in the cyan circle. Consistent quantitative results can\nalso be found in Table 2 (c), where the J and F measures drop 2.5% and 0.9%, respectively, when\ntrained without the localization module. We also discover that the localization module should always\nbe trained together with the concentration loss to satisfy the assumption in Section 3.2(Table 2(f)(g)).\nConcentration regularization. The concentration regularization encourages locality during the\ntransformation process, i.e. points within a neighbourhood in the reference frame stay together in the\ntarget frame. The model trained without it tends to introduce outliers, as shown in the cyan circle of\n\n7\n\n\f(b)\tTarget\tframe\n\n(a)\tReference\tframe\nFigure 6: Visualization of the ablation studies. Given a set of points in the reference frame (a), we visualize\nthe results of propagating these points on to the target frame (b). \u201cL\u201d, \u201cC\u201d, \u201cO\u201d and \u201call\u201d correspond to the\nlocalization modules, concentration or orthogonal regularization, or all of them (d-g).\n\n(g)\tw/o\tall\n\n(e)\tw/o\tC\n\n(f)\tw/o\tO\n\n(d)\tw/o\tL\n\n(c)\tOurs\n\nTable 2: Ablation studies. The minus sign \u201c-\u201d indicates training without the speci\ufb01c module or regularization.\n\u201cL\u201d, \u201cO\u201d and \u201cC\u201d mean the localization module, orthogonal and concentration regularization, respectively. The\nlast column (\u201c(g) -all\u201d) shows results of a baseline model trained without any of \u201cL\u201d, \u201cO\u201d or \u201cC\u201d.\n\nMetric\nJ (Mean)\nF (Mean)\n\n(a) Ours-track\n\n57.7\n61.3\n\n(b) Ours\n\n56.3\n59.2\n\n(c) -L\n53.8\n58.3\n\n(d) -O\n55.2\n58.7\n\n(e) -C\n48.3\n52.4\n\n(f) -O&C\n\n44.3\n49.6\n\n(g) -all\n45.7\n52.3\n\nModel\n\nUDT [46]\n\nSupervised\n\nTable 3: Tracking results on OTB2015 [53]\nAUC score (%)\n\nFigure 6(e). Table 2 (b)(e) demonstrate the contribution of this concentration regularization term,\ne.g., compared to (b), the J in (e) decrease by 8% without this regularization term.\nOrthogonal regularization. The orthogonal regularization term enforces points to match back to\nthemselves after a cycle of forward and backward transformation. As shown in Figure 6 (f), the model\ntrained without the orthogonal regularization term is less effective in preserving local structures. The\neffectiveness of the orthogonal regularization is also validated quantitatively at Table 2 (e) and (f).\n4.4 Tracking Pose Keypoint Propagation on the J-HMDB Dataset\nWe demonstrate that our model learns accurate\ncorrespondence by evaluating it on the J-HMDB\ndataset [19], which requires precise matching of\npoints compared to the coarser propagation of\nmasks. Given the 15 ground truth human pose key-\npoints in the \ufb01rst frame, we propagate them to the\nremaining frames. We quantitatively evaluate per-\nformance using the probability of correct keypoint (PCK) metric [57], which measures the ratio of\njoints that fall within a threshold distance from the ground truth joint locations. We show quantitative\nevaluations against the state-of-the-art methods in Table 5 and qualitative propagation results in\nFigure 4(b). Our model performs well versus all self-supervised methods [52, 45] and notably\nachieves better results than ResNet-18 [13] trained with classi\ufb01cation labels [9].\n4.5 Visual Tracking on the OTB Dataset\nOther than the tasks that require dense matching, e.g., segmentation or keypoints propagation, the\nfeatures learned by our model can be applied to object matching tasks such as visual tracking, because\nof its capability of localizing an object or a relatively global region. Without any \ufb01ne-tuning, we\ndirectly integrate our network trained via self-supervision into a classic tracking framework [46, 37]\nbased on correlation \ufb01lters, by replacing the Siamese network in [46, 37] with our model, while\nkeeping other parts in the tracking framework unchanged. Even without training with a correlation\n\ufb01lter, our features are general and robust enough to achieve comparable performance on the OTB2015\ndataset [53] to methods trained with this \ufb01lter [46], as shown in Table 3. Figure 4(d) shows that\nour learned features are robust against occlusion (left), object scale, as well as illumination changes\n(right) and can track objects through a long sequence (hundreds of frames in the OTB2015 dataset).\n\nFully Supervised [2]\n\n59.4\n59.2\n55.6\n58.2\n\nResNet-18\n\n\u21e5\n\u21e5\nX\nX\n\nOurs\n\nTable 4: Segmentation propagation on VIP [59].\n\nTable 5: Kepoints propagation on J-HMDB [19].\n\nModel\n\nDeepCluster. [6]\nWang et al. [52]\n\nOurs\n\nResNet-18\n\nFully Supervised [38]\n\nSupervised\n\n\u21e5\n\u21e5\n\u21e5\nX\nX\n\nvol\n\nmIoU AP r\n21.8\n8.1\n15.6\n28.9\n17.7\n34.1\n12.6\n31.8\n37.9\n24.1\n\nModel\n\nVondrick et al. [45]\n\nWang et al. [52]\n\nOurs\n\nResNet-18\n\nFully Supervised [56]\n\nSupervised\n\n\u21e5\n\u21e5\n\u21e5\nX\nX\n\nPCK@.1\n\n45.2\n57.3\n58.6\n53.8\n68.7\n\nPCK@.2\n\n69.6\n78.1\n79.8\n74.6\n92.1\n\n4.6 Semantic and Instance Propagation on the VIP Dataset\nWe evaluate our method on the VIP dataset [59], which includes dense human parts segmentation\nmasks on both the semantic and instance levels. We use the same settings as Wang et al. [52] and\nresize the input frames to 560 \u21e5 560. For the semantic propagation task, we propagate the semantic\n\n8\n\n\fsegmentation maps of human parts (e.g., arms and legs) and evaluate performance via the mean\nIoU metric. For the part instance propagation task, we propagate the instance-level segmentation of\nhuman parts (e.g., arms of the \ufb01rst person or legs of the second person) and evaluate performance via\nthe mean average precision of the instance-level human parsing metric [26]. Table 4 shows that our\nmethod performs favourably against all self-supervised methods and notably the ResNet-18 model\ntrained on ImageNet with classi\ufb01cation labels for both tasks. Figure 4(c) shows sample semantic\nsegmentation propagation results. Interestingly, our model correctly propagates each part mask onto\nan unseen instance (the woman which does not appear in the \ufb01rst frame) in the second example.\n\n5 Conclusions\nIn this work, we propose to learn correspondences across video frames in a self-supervised manner.\nOur method jointly tackles region-level and pixel-level correspondence learning and allows them to\nfacilitate each other through a shared inter-frame af\ufb01nity matrix. Experimental results demonstrate\nthe effectiveness of our approach versus the state-of-the-art self-supervised video correspondence\nlearning methods, as well as supervised models such as the ResNet-18 trained on ImageNet with\nclassi\ufb01cation labels.\n\nReferences\n[1] M. Andriluka, S. Roth, and B. Schiele. People-tracking-by-detection and people-detection-by-tracking. In\n\nCVPR, 2008. 2\n\n[2] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr. Fully-convolutional siamese\n\nnetworks for object tracking. arXiv preprint arXiv:1606.09549, 2016. 1, 2, 4, 8\n\n[3] T. Brox and J. Malik. Large displacement optical \ufb02ow: descriptor matching in variational motion estimation.\n\nTPAMI, 2010. 3\n\n[4] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical \ufb02ow\n\nevaluation. In ECCV, 2012. 1, 2\n\n[5] S. Caelles, K. Maninis, J. Pont-Tuset, L. Leal-Taix\u00e9, D. Cremers, and L. Van Gool. One-shot video object\n\nsegmentation. In CVPR, 2017. 7\n\n[6] M. Caron, P. Bojanowski, A. Joulin, and M. Douze. Deep clustering for unsupervised learning of visual\n\nfeatures. In ECCV, 2018. 7, 8\n\n[7] J. Choi, H. Jin Chang, S. Yun, T. Fischer, Y. Demiris, and J. Young Choi. Attentional correlation \ufb01lter\n\nnetwork for adaptive visual tracking. In CVPR, 2017. 2\n\n[8] D. Comaniciu, V. Ramesh, and P. Meer. Real-time tracking of non-rigid objects using mean shift. In CVPR,\n\n2000. 2\n\n[9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image\n\ndatabase. In CVPR, 2009. 2, 7, 8\n\n[10] A. Dosovitskiy, P. Fischer, E. Ilg, P. H\u00e4usser, C. Haz\u0131rba\u00b8s, V. Golkov, P. v.d. Smagt, D. Cremers, and\n\nT. Brox. Flownet: Learning optical \ufb02ow with convolutional networks. In ICCV, 2015. 1, 2\n\n[11] D. F. Fouhey, W. Kuo, A. A. Efros, and J. Malik. From lifestyle vlogs to everyday interactions. In CVPR,\n\n2018. 7\n\n[12] L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In\n\nCVPR, 2016. 5\n\n[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. 4, 6,\n\n7, 8\n\n[14] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized correlation\n\n\ufb01lters. TPAMI, 2014. 2\n\n[15] W.-C. Hung, V. Jampani, S. Liu, P. Molchanov, M.-H. Yang, and J. Kautz. Scops: Self-supervised co-part\n\nsegmentation. CVPR, 2019. 5\n\n[16] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical\n\n\ufb02ow estimation with deep networks. In CVPR, 2017. 1, 2, 7\n\n[17] J. P. B. H. J. Lee, D. Kim. Sfnet: Learning object-aware semantic \ufb02ow. In CVPR, 2019. 3\n[18] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Neurips, 2015. 6\n[19] H. Jhuang, J. Gall, S. Zuf\ufb01, C. Schmid, and M. J. Black. Towards understanding action recognition. In\n\nICCV, 2013. 6, 8\n\n[20] Z. Kalal, K. Mikolajczyk, and J. Matas. Tracking-learning-detection. TPAMI, 2011. 2\n[21] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back,\n\nP. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017. 6, 7\n\n[22] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICML, 2015. 6\n\n9\n\n\f[23] S. Kong and C. Fowlkes. Multigrid predictive \ufb01lter \ufb02ow for unsupervised learning on videos. arXiv\n\n[24] Z. Lai and W. Xie.\n\nSelf-supervised learning for video correspondence \ufb02ow.\n\narXiv preprint\n\n[25] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu. High performance visual tracking with siamese region proposal\n\npreprint arXiv:1904.01693, 2019. 3, 7\n\narXiv:1905.00875, 2019. 3, 7\n\nnetwork. In CVPR, 2018. 2\n\n[26] Q. Li, A. Arnab, and P. H. Torr. Holistic, instance-level human parsing. arXiv preprint arXiv:1709.03612,\n\n2017. 9\n\n2\n\n[27] Y. Li and J. Zhu. A scale adaptive kernel correlation \ufb01lter tracker with feature integration. In ECCV, 2014.\n\n[28] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll\u00e1r, and C. L. Zitnick. Microsoft\n\ncoco: Common objects in context. In ECCV, 2014. 6\n\n[29] C. Liu, J. Yuen, and A. Torralba. Sift \ufb02ow: Dense correspondence across scenes and its applications.\n\nTPAMI, 2011. 1, 2, 7\n\n[30] S. Liu, G. Zhong, S. De Mello, J. Gu, V. Jampani, M.-H. Yang, and J. Kautz. Switchable temporal\n\npropagation network. In ECCV, 2018. 3, 5\n\n[31] B. D. Lucas, T. Kanade, et al. An iterative image registration technique with an application to stereo vision.\n\nIn DARPA Image Understanding Workshop. Vancouver, British Columbia, 1981. 1, 2\n\n[32] C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang. Hierarchical convolutional features for visual tracking. In\n\nCVPR, 2015. 2\n\n[33] C. Ma, X. Yang, C. Zhang, and M.-H. Yang. Long-term correlation tracking. In CVPR, 2015. 2\n[34] N. Mayer, E. Ilg, P. H\u00e4usser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train\n\nconvolutional networks for disparity, optical \ufb02ow, and scene \ufb02ow estimation. In CVPR, 2016. 7\n\n[35] M. Okutomi and T. Kanade. A multiple-baseline stereo. TPAMI, 1993. 1\n[36] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbel\u00e1ez, A. Sorkine-Hornung, and L. Van Gool. The 2017 davis\n\nchallenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017. 6, 7, 12\n\n[37] J. X. M. Z. W. H. Qiang Wang, Jin Gao. Dcfnet: Discriminant correlation \ufb01lters network for visual tracking.\n\narXiv preprint arXiv:1704.04057, 2017. 1, 2, 8\n\n[38] K. G. L. L. Qixian Zhou, Xiaodan Liang. Adaptive temporal encoding network for video instance-level\n\nhuman parsing. In Proc. of ACM International Conference on Multimedia (ACM MM), 2018. 8\n\n[39] I. Rocco, R. Arandjelovi\u00b4c, and J. Sivic. End-to-end weakly-supervised semantic alignment. In CVPR,\n\n2018. 6\n\n[40] M. Rubinstein, C. Liu, and W. Freeman. Towards longer long-range motion trajectories. BMVC, 2012. 3\n[41] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. PWC-Net: CNNs for optical \ufb02ow using pyramid, warping, and\n\ncost volume. In CVPR, 2018. 1, 2, 7\n\n[42] R. Tao, E. Gavves, and A. W. M. Smeulders. Siamese instance search for tracking. In CVPR, 2016. 1, 2\n[43] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. Yfcc100m:\n\nThe new data in multimedia research. arXiv preprint arXiv:1503.01817, 2015. 7\n\n[44] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P. H. Torr. End-to-end representation learning for\n\ncorrelation \ufb01lter based tracking. In CVPR, 2017. 1, 2, 3\n\n[45] C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and K. Murphy. Tracking emerges by colorizing\n\nvideos. In ECCV, 2018. 2, 3, 5, 6, 7, 8\n\n[46] N. Wang, Y. Song, C. Ma, W. Zhou, W. Liu, and H. Li. Unsupervised deep tracking. In CVPR, 2019. 2, 3,\n\n4, 5, 6, 8\n\n2013. 2\n\n[47] N. Wang and D.-Y. Yeung. Learning a deep compact image representation for visual tracking. In Neurips,\n\n[48] Q. Wang, J. Gao, J. Xing, M. Zhang, and W. Hu. Dcfnet: Discriminant correlation \ufb01lters network for visual\n\ntracking. arXiv preprint arXiv:1704.04057, 2017. 2\n\n[49] Q. Wang, Z. Teng, J. Xing, J. Gao, W. Hu, and S. Maybank. Learning attentions: residual attentional\n\nsiamese network for high performance online visual tracking. In CVPR, 2018. 1, 2\n\n[50] Q. Wang, L. Zhang, L. Bertinetto, W. Hu, and P. H. Torr. Fast online object tracking and segmentation: A\n\n[51] X. Wang, K. He, and A. Gupta. Transitive invariance for self-supervised visual representation learning. In\n\n[52] X. Wang, A. Jabri, and A. A. Efros. Learning correspondence from the cycle-consistency of time. In\n\nunifying approach. CVPR, 2019. 7\n\nICCV, 2017. 7\n\nCVPR, 2019. 2, 3, 5, 6, 7, 8, 12\n\n[53] Y. Wu, J. Lim, and M.-H. Yang. Object tracking benchmark. TPAMI, 2015. 1, 2, 6, 8\n[54] N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, and T. Huang. Youtube-vos: A large-scale video object\n\nsegmentation benchmark. arXiv preprint arXiv:1809.03327, 2018. 7\n\n[55] C. Yang, R. Duraiswami, and L. Davis. Ef\ufb01cient mean-shift tracking via a new similarity measure. In\n\nCVPR, 2005. 2\n\n10\n\n\f[56] L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos. Ef\ufb01cient video object segmentation via\n\nnetwork modulation. In CVPR, 2018. 8\n\n[57] Y. Yang and D. Ramanan. Articulated human detection with \ufb02exible mixtures of parts. TPAMI, 2013. 8\n[58] Y. Zhang, Y. Guo, Y. Jin, Y. Luo, Z. He, and H. Lee. Unsupervised discovery of object landmarks as\n\nstructural representations. In CVPR, 2018. 5\n\n[59] Q. Zhou, X. Liang, K. Gong, and L. Lin. Adaptive temporal encoding network for video instance-level\n\nhuman parsing. arXiv preprint arXiv:1808.00661, 2018. 6, 8\n\n11\n\n\f", "award": [], "sourceid": 159, "authors": [{"given_name": "Xueting", "family_name": "Li", "institution": "University of California, Merced"}, {"given_name": "Sifei", "family_name": "Liu", "institution": "NVIDIA"}, {"given_name": "Shalini", "family_name": "De Mello", "institution": "NVIDIA"}, {"given_name": "Xiaolong", "family_name": "Wang", "institution": "CMU"}, {"given_name": "Jan", "family_name": "Kautz", "institution": "NVIDIA"}, {"given_name": "Ming-Hsuan", "family_name": "Yang", "institution": "Google / UC Merced"}]}