{"title": "Volumetric Correspondence Networks for Optical Flow", "book": "Advances in Neural Information Processing Systems", "page_first": 794, "page_last": 805, "abstract": "Many classic tasks in vision -- such as the estimation of optical flow or stereo disparities -- can be cast as dense correspondence matching. Well-known techniques for doing so make use of a cost volume, typically a 4D tensor of match costs between all pixels in a 2D image and their potential matches in a 2D search window. State-of-the-art (SOTA) deep networks for flow/stereo make use of such volumetric representations as internal layers. However, such layers require significant amounts of memory and compute, making them cumbersome to use in practice. As a result, SOTA networks also employ various heuristics designed to limit volumetric processing, leading to limited accuracy and overfitting. Instead, we introduce several simple modifications that dramatically simplify the use of volumetric layers - (1) volumetric encoder-decoder architectures that efficiently capture large receptive fields, (2) multi-channel cost volumes that capture multi-dimensional notions of pixel similarities, and finally, (3) separable volumetric filtering that significantly reduces computation and parameters while preserving accuracy. Our innovations dramatically improve accuracy over SOTA on standard benchmarks while being significantly easier to work with - training converges in 10X fewer iterations, and most importantly, our networks generalize across correspondence tasks. On-the-fly adaptation of search windows allows us to repurpose optical flow networks for stereo (and vice versa), and can also be used to implement adaptive networks that increase search window sizes on-demand.", "full_text": "Volumetric Correspondence Networks\n\nfor Optical Flow\n\n\u2217\nGengshan Yang1\n, Deva Ramanan1,2\n2Argo AI\n1Carnegie Mellon University,\n\n{gengshay, deva}@cs.cmu.edu\n\nAbstract\n\nMany classic tasks in vision \u2013 such as the estimation of optical \ufb02ow or stereo\ndisparities \u2013 can be cast as dense correspondence matching. Well-known techniques\nfor doing so make use of a cost volume, typically a 4D tensor of match costs between\nall pixels in a 2D image and their potential matches in a 2D search window. State-\nof-the-art (SOTA) deep networks for \ufb02ow/stereo make use of such volumetric\nrepresentations as internal layers. However, such layers require signi\ufb01cant amounts\nof memory and compute, making them cumbersome to use in practice. As a\nresult, SOTA networks also employ various heuristics designed to limit volumetric\nprocessing, leading to limited accuracy and over\ufb01tting. Instead, we introduce\nseveral simple modi\ufb01cations that dramatically simplify the use of volumetric\nlayers - (1) volumetric encoder-decoder architectures that ef\ufb01ciently capture large\nreceptive \ufb01elds, (2) multi-channel cost volumes that capture multi-dimensional\nnotions of pixel similarities, and \ufb01nally, (3) separable volumetric \ufb01ltering that\nsigni\ufb01cantly reduces computation and parameters while preserving accuracy. Our\ninnovations dramatically improve accuracy over SOTA on standard benchmarks\nwhile being signi\ufb01cantly easier to work with - training converges in 7X fewer\niterations, and most importantly, our networks generalize across correspondence\ntasks. On-the-\ufb02y adaptation of search windows allows us to repurpose optical \ufb02ow\nnetworks for stereo (and vice versa), and can also be used to implement adaptive\nnetworks that increase search window sizes on-demand.\n\n1\n\nIntroduction\n\nMany classic tasks in vision \u2013 such as the estimation of optical \ufb02ow [13] or stereo disparities [34]\n\u2013 can be cast as dense correspondence matching. Well-known techniques for doing so make use\nof a cost volume, typically a 4D tensor of match costs between all pixels in a 2D image and their\npotential matches in a 2D search window. State-of-the-art (SOTA) deep networks for stereo can make\nuse of 3D volumetric representations because the search window reduces to a epipolar line [11, 22].\nSearch windows for optical \ufb02ow need to be two-dimensional, implying that cost volumes have to be\n4D. Because of the added memory and compute demands, deep optical \ufb02ow networks have rarely\nexploited volumetric processing until recently. Even then, most employ heuristics that reshape cost\nvolumes into 2D data structures that are processed with 2D spatial processing [7, 18, 19, 39, 42].\nSpeci\ufb01cally, common workarounds reshape a 4D array (x, y, u, v) into a multichannel 2D array (x, y)\nwith uv channels. This allows for use of standard 2D convolutional processing routines, but implies\nthat feature channels are now tied to particular (u, v) displacements. This requires the network to\nmemorize particular displacements in order to report them at test-time. In practice, such networks are\nquite dif\ufb01cult to train because they require massive amounts of data augmentation and millions of\ntraining iterations to effectively memorize [7, 19].\n\n\u2217Code will be available at github.com/gengshay-y/VCN.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWe introduce three simple modi\ufb01cations that signi\ufb01cantly improve performance and generalizability\nby enabling true volumetric processing of cost volumes:\n\n1. We propose the 4D volumetric counterpart of 2D encoder-decoder \"U-Net\" architectures,\n\nwhich are able to ef\ufb01ciently encode large receptive \ufb01elds for cost volume processing.\n\n2. We propose multi-channel cost volumes that make use of multiple pixel embeddings to\ncapture complementary notions of similarity (or match cost). We demonstrate that these mul-\ntiple matches allow for better handling of ambiguous correspondences, which is particularly\nhelpful for ambiguous coarse matches in a coarse-to-\ufb01ne matching network [38].\n\n3. We implement 4D convolutional kernels with separable high-order \ufb01lters. In particular, our\nseparable factorization results in a spatial (x, y) \ufb01lter that enforces spatial regularity of a\n\ufb02ow feild, and an inhibitory \"winner-take-all\" or WTA (u, v) \ufb01lter that competes candidate\nmatches for a given (x, y) pixel.\n\nOur innovations dramatically improve accuracy over SOTA on standard \ufb02ow benchmarks while\nbeing signi\ufb01cantly easier to work with - training converges in 7X fewer iterations. Interestingly,\nour networks appear to generalize across diverse correspondence tasks. On-the-\ufb02y adaptation of\nsearch windows allows us to repurpose optical \ufb02ow networks for stereo (and vice versa), and can\nalso be used to implement adapative networks that increase search window sizes on-demand. We\ndemonstrate the latter to be useful for stereo matching with noisy recti\ufb01cations.\n\n2 Related Work\n\nDense visual correspondence Finding dense pixel correspondences between a pair of images has\nbeen studied extensively in low-level vision. Concrete examples include stereo matching and optical\n\ufb02ow [13, 34]. Stereo matching constrains the search space to a horizontal scanline, where a 3D\ncost volume is usually built and optimized to ensure global consistency[11, 23]. Though optical\n\ufb02ow with small motion has been well-addressed by the classi\ufb01c variational approaches [37], \ufb01nding\ncorrespondences in the 2D target image remains a challenge when displacements are large and\nocclusion occurs [3].\nCorrespondence matching with cost volume Classic stereo matching algorithms usually extract\nlocal patch features and create a regular 3D cost volume, where smoothness constraints are further\nenforced by energy minimization [13, 34]. Recently, hand-crafted feature extraction is replaced\nwith convolutional networks and cost-volume optimization step is commonly substituted by 3D\nconvolutions[22, 28, 45]. Despite their similar formulation, \u201ctrue\" 4D cost volume is rarely used in\noptical \ufb02ow estimation until very recently. Xu et al. [42] directly construct and process a 4D cost\nvolume using semi-global matching. Recent successful optical \ufb02ow networks also build a correlation\ncost volume and process it with 2D convolutions [7, 24, 39]. There also exists work in semantic\ncorrespondence matching on a 4D cost volume with 4D convolutions [31].\nEf\ufb01cient convolutional networks Recent years have seen great interest in designing computation-\nef\ufb01cient and memory-friendly deep convolutional networks. At operation level, depthwise separable\nconvolutions [36] save computations by separating a multi-channel 2D convolution into a depthwise\nconvolution and a pointwise convolution [6, 14, 33, 46]. Efforts have also been made in using tensor\nfactorization to speed up a trained network [20, 25]. Different from prior works, we separate a\n4D convolution kernel into two separate 2D kernels. At architecture level, U-Net encoder-decoder\nscheme is widely used in dense prediction task [1, 7, 32]. Instead of directly \ufb01ltering the high-res\nfeature maps, it saves memory and computation by downsampling the input feature maps with strided\nconvolutions and upsampling them back. Typically, it is able to acquire suf\ufb01cient receptive \ufb01elds\nwith very few numbers of layers. Similarly, we downsample the 4D cost-volume in (u,v) dimension\nto maintain a small memory footprint.\n\n3 Approach\n\nIn this section, we \ufb01rst introduce a 4D convolutional matching module for volumetric correspondence\nprocessing. We then show by factorizing the \ufb01lter into separable components that are implemented\nwith an encoder-decoder [32], one can signi\ufb01cantly reduce computation and memory. Finally, we\n\n2\n\n\fFigure 1: We compare 2D \ufb01ltering of a 4D cost vol-\nume reshaped to be a multi-channel 2D array (left)\nversus true 4D \ufb01ltering (right). For simplicity, we\nvisualize the candidate 7 \u00d7 7 array of (u, v) match\ncosts for a particular (x, y) pixel. Blue and red cir-\ncles indicate \ufb01ltered values, and lines connected to\nthem indicate \ufb01lter weights between two layers. Note\nthat 2D \ufb01lter weights are not shared across spatial\nlocations (indicated by different colors), while 4D\n\ufb01lter weights are. During gradient-based learning of\nthe 2D \ufb01lter, a particular observed (u, v) displace-\nment only backprops along the particular colored\nweights connected to it. On the other hand, the 4D\n\ufb01lter will be updated for any observed (u, v) dis-\nplacement, making it easier to generalize to different\ndisplacements.\n\nintegrate volumetric \ufb01ltering into a coarse-to-\ufb01ne warping scheme [18, 39], where ambiguous matches\nand coarse-mistakes are handled by the multi-hypotheses design.\n\n4D Convolutional Matching Module\n\n3.1\nLet F1, F2 \u2208 Rd\u00d7H\u00d7W be the d-dimensional pixelwise embedding of the source and target image.\nWe construct a 4D cost volume by computing the cosine similarity between each pixel in the H \u00d7 W\nsource image with a set of candidate targets in a U \u00d7 V search window:\n\nC(u, x) =\n\nF1(x) \u00b7 F2(x + u)\n||F1(x)|| \u00b7 ||F2(x + u)|| , C(u, x) \u2208 RU\u00d7V \u00d7H\u00d7W ,\n\nwhere x = (x, y) is the source pixel coordinate and u = (u, v) is the pixel displacement. Cosine\nsimilarity is used in person re-identi\ufb01cation and face veri\ufb01cation [27, 41] in replacement of dot\nproduct, and empirically we \ufb01nd it produces a better result over dot product.\n2D convolution vs 4D convolution Many recent optical \ufb02ow networks re-organize the 4D cost\nvolume into a multichannel 2D array with N = U \u00d7 V channels, and process it with multi-channel\n2D convolutions [7, 18, 19, 39]. Instead, we leave the 4D cost volume C(u, x) as-is and \ufb01lter it\nwith 4D convolutions. Much as 2D \ufb01lters ensure translation invariance and generalize to images of\ndifferent sizes [26], we posit that 4D \ufb01lters may ensure a form of offset \"invariance\" and generalize to\nsearch windows of different sizes. Fig. 1 suggests that multi-channel 2D \ufb01ltering requires the network\nto memorize particular displacements seen during training. By explicit cost volume processing,\nvolumetric \ufb01ltering of cost volumes is preferable because 1) It signi\ufb01cantly reduces the number of\nparameters and computations; 2) It is capable of processing variable-sized cost volumes on demand;\n3) It generalizes better to displacements that are not seen in the training.\nTruncated soft-argmin Given a (\ufb01ltered) cost volume, one natural approach to reporting the (u, v)\ndisplacement for a pixel (x, y) is a \"winner take all\" (WTA) operation that returns the argmin\ndisplacement. Alternatively, if the offset dimensions are normalized by a softmax, one could compute\nthe expected offset by taking a weighted average of offsets with weights given by the probabilistic\nsoftmax (soft argmin) [22]:\n\nE[u] =\n\nuip(u = ui),\n\n[Soft Argmin]\n\n(cid:88)\n\ni\n\nUnfortunately, WTA is not differentiable, while the soft argmin is sensitive to changes in the size of\nthe search window [40]. Instead, we combine both with a \"truncated soft-argmin\" that zeros out the\nsoftmax probabilities for displacements more than M pixels away from the argmin u\u2217:\n\n(cid:26)p(u = ui),\n\n0,\n\np(cid:48)(u = ui) \u221d\n\n|ui \u2212 u\u2217| \u2264 M\notherwise\n\n[Truncated Soft Argmin]\n\nWe empirically set M = 3 for a 7 \u00d7 7 search window, and use truncated soft-argmin for training and\ntesting. Later we show that a truncated soft-argmin produces a notable improvement over soft-argmin.\n\n3\n\n\fFigure 2: For ease of visualization, we show the 2D cost volume C(u, x) for matching pixels across a source\nand target scanline image (a). To ef\ufb01ciently \ufb01lter the volume, we factor the 3 \u00d7 3 \ufb01lter (b) into a 1D spatial\nconvolution over positions (c) followed by a 1D WTA convolution over displacements (d).\n\n3.2 Ef\ufb01cient Cost Volume Processing\n\nSeparable 4D convolution We now show that 4D volumetric kernels can be dramatically simpli\ufb01ed\nby factorizing into separable components. In the context of a cost volume, we propose a factorization\nof a 4D \ufb01lter K(u, x) into a 2D spatial \ufb01lter KS(x) and a 2D WTA KW T A(u) \ufb01lter:\n\nK(u, x) \u2217 C(u, x) =\n\nK(v, y)C(u \u2212 v, x \u2212 y)\n\n[4D Convolution]\n\n[Factorization]\n\n[Separable Filtering]\n\n(cid:105)\n\n(cid:105)\n\n(cid:104)\n\n(cid:88)\n(cid:88)\n(cid:88)\n= KW T A(u) \u2217(cid:104)\n\nKW T A(v)\n\n=\n\n=\n\nv,y\n\nv,y\n\nv\n\n(cid:104)(cid:88)\n\ny\n\nKW T A(v)KS(y)\n\nC(u \u2212 v, x \u2212 y)\n\nKS(y)C(u \u2212 v, x \u2212 y)\n\n(cid:105)\n\nKS(x) \u2217 C(u, x)\n\nFig. 2 visualizes this factorization, which reduces computation by N 2 for a N \u00d7 N \u00d7 N \u00d7 N \ufb01lter\nwith negligible effect on peformance, as shown in ablation study Tab. 4.\nU-Net encoder-decoder volume \ufb01ltering We \ufb01nd it important to make use of 4D kernels with\nlarge receptive feilds that can take advantage of contextual cues (as is the case for 2D image \ufb01ltering).\nHowever, naively implementing large volumetric \ufb01lters takes a considerabe amount of memory [22].\nWe found it particularly important to include context for WTA \ufb01ltering. Inspired by spatial encoder-\ndecoder networks [1, 32] we apply two downsampling layers and two upsampling layers rather than\nstacking multiple 4D convolutional layers. In Sec. 4.3, we show that encoder-decoder architectures\nallow us to signi\ufb01cantly improve accuracy given alternatives with a similar compute budget.\n\n3.3 Multi-hypotheses Correspondence Matching\n\nMulti-channel cost volume Past work has suggested that cost volumes might be too restricted in\nsize and serve as too much of an information bottleneck for subsequent layers of a network [5, 22].\nOne common solution in the stereo literature is the construction of a feature volume rather than a cost\nvolume, where an additional dimension of feature channels is encoded in the volumetric tensor [22] -\ntypically, one might include the difference of the two feature descriptors being compared within the\ncost volume, resulting in an additional channel of dimension |F(x)|.\nIn our case, this would result in a prohibitively large volume. Instead, we propose an \u201cintermediate\u201d\nstrategy between a traditional cost volume and a contemporary (deep) feature volume: a multi-channel\ncost volume. Intuitively, rather than simply encoding the cosine similarity between two embedding\nvectors, we record K similarities between K different feature embeddings that are trained jointly, by\n\n4\n\nsource s(x)target t(x)(a)(b)(d)(c)\fFigure 3: Illustration of volumetric processing at one pyramid level. 1) Cost volume construction: We warp\nfeatures of the target image using the upsampled coarse \ufb02ow and compute a multi-channel cost volume. 2)\nVolume processing: The multi-channel cost volume is \ufb01ltered with separable 4D convolutions, which is integrated\ninto a volumetric U-Net architecture. We predict multiple \ufb02ow hypotheses using truncated soft-argmin. 3) Soft\nselection: The \ufb02ow hypotheses are linearly combined considering their uncertainties and the appearance feature.\n\ntaking channel-wise product between each pair of potential matches [10]. While this can be thought\nof as K distinct cost volumes, we instead concatenate them into a multi-channel 4D cost volume\nRK\u00d7U\u00d7V \u00d7H\u00d7W where K is treated as a feature channel that is kept constant in dimension during\n\ufb01ltering. After being processed by the volumetric U-Net, each of the K cost-volumes Ck(u, v, x, y)\nis used to compute a truncated softmax expectation.\nMulti-hypotheses selection Considering the multimodal nature of correspondence matching, we\npropose a multi-hypotheses selection module that assigns weights to each hypothesis given its value,\nuncertainty and appearance information. Inspired by Campbell et al. [4], we treat it as a labelling\nproblem and use a stacked 2D convolution network that takes the image features, K hypotheses values,\nand K entropy scores as the input, to produce a softmax distribution over the hypotheses. The \ufb01nal\ncorrespondence prediction is computed by weighting the hypotheses with the softmax distribution.\nCoarse-to-\ufb01ne warping architecture, such as PWC-Net [39], is sensitive to coarse-level failures, where\nthe incorrect coarse \ufb02ow is used to warp the features and lead to gross errors. More importantly, small\nobjects with large displacement are never considered, since only one coarse prediction is used to warp\na group of \ufb01ne-pixels (usually 2 \u00d7 2). To account for the missing multi-modal information of the\ncoarse scale, one solution is to create K different warpings and delta \ufb01ne cost volumes according to\nK different coarse-scale hypotheses, and then aggregate the results. However, processing K different\nhypotheses would be prohibitively expensive. Instead, we directly pass K coarse-level hypothesized\ncorrespondences to the subsequent \ufb01ne-scale multi-hypotheses network as additional hypotheses [43].\nOut-of-range detection During occlusions or severe displacements, the optimal predicted displace-\nment is likely an \"out-of-range\" output that lies outside the search window. We use the processed cost\nvolumes to train such a binary classi\ufb01er. Since cost volumes allow us to access a distribution over all\ncandidate matches, we can use the distribution to estimate uncertainty. Speci\ufb01cally, for each of the K\nhypothesized cost volumes, we compute the Shannon entropy of the truncated softmax given by\n\nH[u] = \u2212(cid:88)\n\np(cid:48)(u = ui) \u00b7 log p(cid:48)(u = ui)\n\ni\n\nSince Shannon entropy itself is not a reliable uncertainty indicator [15], we pass them into a U-Net\nmodule along with the image features and expected displacements, and produce a binary variable that\nindicates whether the ground-truth displacement is out-of search range. The out-of-range detection\nmodule is trained with binary cross-entropy loss where the supervision comes from comparing the\nground-truth \ufb02ow with the maximum search range. Empirically, adding the out-of-range detector\nregularizes the model and improves the generalization ability as shown in Sec.4.3.\n\n5\n\nMulti-channel\u00a0Residual Cost Volume(\ud835\udc48,\ud835\udc49,\ud835\udc3b,\ud835\udc4a)\u00d7\ud835\udc3eVolumetric U-NetUpsampled\u00a0Coarse Prediction(2,\ud835\udc3b,\ud835\udc4a)Truncated Soft-argminSeparable 4D KernelsFine Prediction(2,\ud835\udc3b,\ud835\udc4a)Warping\u00a0&\u00a0Cosine SimilarityHypotheses Selection\u00a0NetworkFlow Hypotheses & Uncertainties(2,\ud835\udc3b,\ud835\udc4a)\u00d7\ud835\udc3e,(\ud835\udc3b,\ud835\udc4a)\u00d7\ud835\udc3ePyramid Feature\u00a0Siamese NetworkFine-scale Features(\ud835\udc51,\ud835\udc3b,\ud835\udc4a)\u00d72Reference FrameTarget Frame\ud835\udc361\ud835\udc36\ud835\udc58\ud835\udc362\u22efStage 1: Cost Volume ConstructionStage 2: ProcessingStage 3: Soft Selection\fTable 1: Model size and running time. G\ufb02ops is mea-\nsured on KITTI-sized (0.5 megapixel) images. Number\nof training iterations is recorded for the pre-training\nstage on FlyingChairs and FlyingThings, and (S) indi-\ncates sequential training on separate modules.\n\nMethod\nFlowNetS [7]\nFlowNetC [7]\nFlowNet2 [19]\nPWC-Net+ [39]\nLiteFlowNet [17]\nHD\u22273F [44]\nIRR-PWC [21]\nOurs-small\nOurs\n\n#param. G\ufb02ops\n66.8\n38.7M\n39.2M\n69.6\n162.5M 365.6\n90.8\n9.4M\n5.4M\n151.7\n39.9M 186.1\n6.4M\n5.2M\n6.2M\n\n36.9\n96.5\n\n-\n\n#train iter.\n\n1700K\n1700K\n\n7100K (S)\n\n1700K\n\n2000K (S)\n\n-\n\n1700K\n220K\n220K\n\nFigure 4: Stereo \u2192 Flow transfer. After \ufb01ne-\ntuning with KITTI stereo data, our small model\nconsistently out-performs PWC-Net on KITTI\n\ufb02ow, though with similar error on the stereo train-\ning set, indicating our model is more generalizable.\n\n4 Experiments\n\nNetwork speci\ufb01cation\nSimilar to PWC-Net and LiteFlowNet [18, 39], we follow the coarse-to-\n\ufb01ne feature warping scheme as shown in Fig. 3. We \ufb01nd correspondences with 9 \u00d7 9 search windows\non a feature pyramid with stride {64, 32, 16, 8, 4}. We keep K = {16, 16, 16, 16, 12} hypotheses at\neach scale. Besides the full model, we also train a smaller model that only takes features from coarse\nlevels with stride {64, 32, 16, 8}, indicated by \u201cOurs-small\".\nTraining procedure We build the model and re-implement the training pipeline of PWC-Net+ [39]\nusing Pytorch. The model is trained on a machine with 4 Titan X Pascal GPUs. The same training\nand \ufb01ne-tuning procedure is followed. To be noted, we are able to stably train the network with a\nlarger learning rate (10\u22123 vs 10\u22124) and fewer iterations (140K vs 1200K on FlyingChairs and 80K\nvs 500K on FlyingThings) compared to prior optical \ufb02ow networks. Furthermore, people \ufb01nd that\nPWC-Net is sensitive to initialization [39] and several attempts of training with random initialization\nhave to be made to avoid the poor local minimum, which is never observed for our case.\n\n4.1 Benchmark results\n\nAs shown in Tab. 1, our models can be trained with signi\ufb01cantly fewer iterations without sequential\ntraining of submodules. In terms of computation ef\ufb01ciency, our small model only uses less than\nhalf of the FLOPS used by PWC-Net and a quarter of the FLOPS for LiteFlowNet. Our full model\nuses similar computation as PWC-Net and 40% fewer computations than LiteFlowNet. It is also a\ncompact model among the ones with the least number of parameters. One more thing to notice is that\nour model is the only optical \ufb02ow network in the table that processes a \u201ctrue\" 4D cost volume instead\nof convolving a \u201cpseudo\" multi-channel 2D cost volume.\nThough our model is compact, computationally ef\ufb01cient and trained with fewer iterations, it demon-\nstrates SOTA accuracy on multiple benchmarks. As shown in Tab. 2, after the pretraining stage,\nours-small achieves smaller end-point-error (EPE) than all methods on KITTI [9, 30], except for\nLiteFlowNet2, which is heavier than LiteFlowNet, and much heavier than ours-small. Our full\nmodel further out-performs our small model and reduces the F1-all error by one-third compared to\nPWC-Net. On Sintel, our small model beats all previous networks except for FlowNet2, which uses\n8X more computations, 30X more parameters, and 30X more training iterations. Our full model\nfurther improves the accuracy over our small model. The pretraining-stage results demonstrate that\nour network can generalize better than the existing optical \ufb02ow architectures.\nAfter \ufb01ne-tuning on KITTI, our model clearly out-performs existing SOTA methods by a large margin.\nThe only method comparable to ours is HD\u22273F, which uses 6X more parameters and 1.76X more\ncomputation compared to ours. On Sintel, our method ranks 1st for both the \u201cclean\" pass and the\n\u201c\ufb01nal\" pass over all two-frame optical \ufb02ow methods. Noticebly, our small model achieves similar \ufb02ow\nerror on KITTI as LiteFlowNet2 and PWC-Net+ using 1/4 and 2/5 computations of theirs respectively.\n\n6\n\n\fTable 2: Results on K(ITTI)-15 and S(intel) optical \ufb02ow benchmark. \u201cC+T\" indicates models pre-trained on\nChairs and Things [7, 29].\u201c+K/S\" indicates models \ufb01ne-tuned on KITTI or Sintel. \u2020:D1-all is the default metric\nfor KITTI stereo matching, and is evaluated on KITTI-15 stereo training data. The subscript number shows the\nabsolute ranking among all two-frame optical \ufb02ow methods in the benchmark. Best results over each group are\nbolded, and best results overall are underlined. Parentheses mean that the training and testing are performed on\nthe same dataset. Some results are shown as mean \u00b1 standard deviation in \ufb01ve trials.\n\nMethod\n\nK-15-train\n\nK-15-test\n\nFl-all D1-all\u2020 Clean\n1.86\n\n-\n\nS-train (epe)\nFinal\n3.06\n\n-\n\nC+T\n\n+K/S\n\nFlowFields [2]\nDCFlow [42]\nFlowNet2 [19]\nPWC-Net [38]\nLiteFlowNet [17]\nLiteFlowNet2 [18]\nHD\u22273F [44]\nOurs-small\nOurs\nFlowNet2 [19]\nPWC-Net-ft+ [39]\nLiteFlowNet2-ft [18]\nIRR-PWC-ft [21]\nHD\u22273F-ft [44]\nOurs-small-ft\nOurs-ft\n\n9.43 \u00b1 0.18\n\nFl-epe\n8.33\n\n-\n\n10.08\n10.35\n10.39\n8.97\n13.17\n\n8.36\n(2.30)\n(1.50)\n(1.47)\n(1.63)\n(1.31)\n(1.41)\n(1.16)\n\nFl-all\n24.4\n15.1\n30.0\n33.7\n28.5\n25.9\n24.0\n33.4\n25.1\n(8.6)\n(5.3)\n(4.8)\n(5.3)\n(4.1)\n(5.5)\n(4.1)\n\n14.83\n\n-\n-\n-\n-\n-\n-\n-\n\n11.48\n7.72\n7.74\n7.653\n6.552\n7.74\n6.301\n\n-\n-\n-\n\n-\n-\n-\n\n-\n\n-\n-\n-\n\n23.30\n\n13.12\n8.73\n\n9.17\n\n6.10\n4.67\n\nS-test (epe)\nFinal\nClean\n5.81\n3.75\n3.54\n5.12\n6.02\n3.96\n\n-\n-\n-\n-\n-\n-\n\n-\n-\n-\n-\n-\n-\n\n4.16\n3.45\n3.45\n3.84\n4.79\n3.26\n2.811\n\n5.74\n4.60\n4.90\n4.58\n4.67\n4.73\n4.401\n\n-\n\n-\n\n2.02\n2.55\n2.48\n2.24\n3.84\n2.45\n2.21\n(1.45)\n(1.71)\n(1.30)\n(1.92)\n(1.87)\n(1.84)\n(1.66)\n\n3.54\n3.93\n4.04\n3.78\n8.77\n3.63\n3.62\n(2.01)\n(2.34)\n(1.62)\n(2.51)\n(1.17)\n(2.44)\n(2.24)\n\nOn Sintel clean pass, our small model is better than all convolutional optical \ufb02ow methods except for\nour full model.\nInterestingly, on KITTI stereo matching training set, our method out-performs PWC-Net with an\neven larger margin, i.e., 8.73% error versus 23.30% without \ufb01ne-tuning, and 4.67% versus 9.17%\nafter \ufb01ne-tuning on KITTI \ufb02ow data. This indicates the superior generalization ability of our model\nacross correspondence tasks.\n\n4.2 Generalization ability\nCross task generalization: Stereo \u2192 Flow To compare the generalization ability of our method\nwith existing deep \ufb02ow networks [7, 38], we transfer the Chairs/Things-pretrained model to the real\ndomain, i.e., KITTI, where it is dif\ufb01cult to acquire \ufb02ow annotations than stereo (depth) annotations.\nTo do so, we \ufb01ne-tune our pretrained small model using KITTI stereo training set together with\nFlyingChairs and FlyingThings for 75K iterations. As comparison, a pretrained of\ufb01cial PWC-Net\nmodel is also \ufb01ne-tuned with the same procedure, except that learning rate is set as 0.0001 since a\nlarger learning rate makes training PWC-Net unstable.\nAs shown in Fig. 4, our pre-trained model initially perform on par with PWC-Net on KITTI optical\n\ufb02ow training set. After \ufb01ne-tuning on KITTI-15 stereo images for 75k iterations, although both\nmethods perform similarly on the training data, ours-small gets much lower error on out-of-domain\noptical \ufb02ow image pairs. This indicates our model is less over\ufb01tted to the training distribution.\nQualitative results can be found in the supplementary material.\nCross-range generalization: small motion \u2192 large motion In-the-wild image pairs have unknown\nmaximum displacement, i.e., they may be captured from very different view points and objects\ncan move to anywhere. Therefore, the ability to \ufb01nd correspondences out of training search range\nis important for real-world applications. To deal with large displacements, one could simply \ufb01nd\ncorrespondences on downsampled images. However, this loses high-frequency information. Instead,\nour proposed separable 4D-convolutional matching module is able to vary search range at test time\non demand. To demonstrate this, we train the correspondence model on pixels with small motion\n(0-32px) on FlyingThings, and test on two displacement ranges (0-32px and 0-64px) on KITTI-\n15 training set. Ours-32 is our proposed matching module operating on stride 8 features. As a\n\n7\n\n\fcomparison, we train a PWC-Net baseline using the same annotated data, referred to as PWC-32. We\nalso train a PWC-Net baseline with 0-64px motion to serve as the upper-bound of our method.\nAs shown in Tab. 3, our method achieves 39.2% lower error than PWC-32 for in-distribution pixels\n(pixels with 0-32px motion), while achieving 65.4% lower error for out-of-distribution pixels (pixels\nwith 0-64px motion). Moving from in-distribution to out-of-distribution data, the error rate of PWC-\n32 increases by 231%, while our model increases by 89%, which is on par with a model trained with\nboth in-distribution and out-of distribution data, i.e., PWC-64, demonstrating strong generalization\nability to out-of-training-range data.\n\nTable 4: Results of single-stage ablation study.\n\nTable 3: On-demand correspondence matching\nwith extended search range.\n\nMethod\n\nEPE (px)\n\nratio\n\nPWC-32 [38]\nPWC-64\u2020 [38]\nOurs-32\n\n0-32px\n2.85\n2.72\n1.73\n\n0-64px\n9.44\n5.50\n3.27\n\n3.31\n2.02\n1.89\n\nMethod\nDenseNet [38]\nFull-4D\nSep-4D\nOurs-UNet\nUNet\u2192Plain\u00d74\n- Multi-channel\nT-soft.\u2192Soft.\nT-soft.\u2192Reg.\n- OOR\n\n4.3 Diagnostics\n\nEPE (px) GFlops\n25.5\n52.5\n23.4\n28.5\n+20.9\n-0.7\n-0.5\n-0.4\n\n2.64\n2.30\n2.31\n1.73\n-0.02\n+0.32\n+0.10\n+0.58\n+0.07\n\n-\n\n# Params.\n\n8.2\n1.83\n1.78\n2.94\n\n-\n\n-0.001\n\n+0.001\n\n-\n\n-\n\nSingle-stage ablation study To reveal the contribution of each component, we perform a detailed\nablation study. For clarity we use a single stage architecture, i.e., without coarse-to-\ufb01ne warping,\non stride-8 features. The models are trained on 0-32px (in both x and y directions) motions on\nFlyingChairs and evaluated on KITTI-15 training set on pixels with the same motion range. As the\nbaseline model, we implement a DenseNet matching module followed by a re\ufb01nement module as used\nin PWC-Net [16], referred to as \u201cDenseNet\". For \u201cFull-4D\", we replace the DenseNet and re\ufb01nement\nmodule with two residual 4D convolutions blocks (four convolutions in total). As shown in Tab. 4,\nit reduces error by 12.9% and number of parameters by 77.7%, though with an increased amount\nof computation. \u201cSep-4D\" separates 4D kernels into WTA kernels and spatial kernels, reducing\nGFlops by half without signi\ufb01cant loss in accuracy. \u201cOurs-UNet\" is our \ufb01nal model, which uses\nmulti-channel cost volumes, volumetric U-Net architecture, truncated soft-argmin inference, and\nout-of-range (OOR) detection. It further reduces the error rate by 23.4%.\nWe then remove or replace each component from our \ufb01nal model. Replacing the U-Net architecture\n(ten convolutions) with a plain architecture (eight convolutions) slightly reduces the error but adds\na large compute and memory overhead. Replacing the multi-channel cost volume with a standard\nsingle-channel cost volume increases the error by 18.5%. Replacing the truncated soft-argmin with\na standard soft-argmin increases the error by 6.8%, and direct regression of \ufb02ow vectors from cost\nvolumes increases the error by almost one-third, demonstrating the bene\ufb01ts of using truncated soft-\nargmin inference. Interestingly, removing the out-of-range detection module in training also increases\nerror. We posit that it uses knowledge from the cost volume structure to regularize the network and\nhelps the model to generalize better.\nAnalysis on cost volume \ufb01ltering We also compare different architectural designs of cost volume\n\ufb01ltering in terms of FLOPS and numbers of parameters that are used. To \ufb01lter a multi-channel cost\nvolume of size (K, U, V, H, W ), \"2D convolution\" reshapes the \ufb01rst three dimensions (k, u, v) into a\nfeature vector and \ufb01lters along the height and width dimension (x, y). Our \"4D convolution\" and\n\"separable 4D convolution\" treat the hypotheses dimension k as feature dimension and \ufb01lter along\nthe (u, v, x, y) dimension. As shown in Tab. 5, separable 4D convolution uses 3.5X fewer parameters\nand computations compared to the full 4D convolution. Compared to 2D convolution, separable\n4D convolution only uses\nU V computations. Speci\ufb01cally when U = V = 9\nas in PWC-Net [39], replacing the 2D convolutions with separable 4D convolutions reduces the\ncomputation by 40x and number of parameters by 3000x.\n\nU 2V 2 parameters and 2\n\n2\n\n8\n\n\fTable 5: Comparison between \ufb01ltering approaches on a (K,U,V,H,W) multi-channel 4D cost volume.\nratio\n\n# Mult-Adds\n\n# Param.\n\nMethod\n\nKernel\n\n2D conv.\n4D conv.\n\nSep. 4D conv.\n\n(KU V, KU V, 3, 3)\n\n9K 2U 2V 2\n\n(K, K, 3, 3, 3, 3)\n(2, K, K, 3, 3)\n\n81K 2\n18K2\n\n4.4 Stereo matching with vertical disparity\n\nratio\nU 2V 2\n\n2\n4.5\n1\n\n9HW \u00d7 K 2U 2V 2\n81HW \u00d7 K 2U V\n18HW \u00d7 K2UV\n\nU V\n2\n4.5\n1\n\nWe further show an application of our correspondence network in stereo matching with imperfect\nrecti\ufb01cation. Although most stereo systems assume that cameras are perfectly calibrated and cor-\nrespondences lie on the same horizontal scan-line. However in reality, it is dif\ufb01cult to perfectly\ncalibrate stereo pairs during large temperature changes and vibrations [12]. Such errors result in\nground-truth disparity matches that have a vertical component (e.g., match to a different horizontal\nscanline). Instead of searching for stereo correspondences along the horizontal scanline, we \ufb01nd\nmatchings in a 2D rectangular area, and project the displacement vector in the horizontal direction.\nWe \ufb01ne-tune our model and PWC-Net using stereo data from KITTI, Middlebury, and SceneFlow [9,\n29, 30, 35] training set for 70K iterations. For our model, we set U = 6, V = 1 for each level. We\nthen evaluate on half-sized Middlebury-14 additional images, where there are thirteen images with\nperfect recti\ufb01cation and thirteen with imperfect recti\ufb01cation. ELAS [8] is taken from the Robust\nVision Challenge of\ufb01cial package, and we implemented two-pass SGBM2 [11] using OpenCV (with\nSAD window size = 3, truncation value for pre-\ufb01lter = 63, p1 = 216, p2 = 864, uniqueness ratio = 10,\nspeckle window size = 100, speckle range = 32). The results from SGBM2 is also post-processed\nusing weighted least square \ufb01lter with default parameters.\nAs shown in Tab.6, going from perfectly recti\ufb01ed stereo images to the imperfectly recti\ufb01ed ones, the\nerror rate of our methods does not increase. While methods without explicit vertical displacement\nhandling, for example, ELAS [8], suffer heavily from such situations. Compared to PWC-Net, our\nmodel gets a lower error, possibly due to the effectiveness of volumetric \ufb01ltering, and is more \ufb02exible\nbecause of the on-demand selection of search space. A qualitative comparison is shown in Fig. 5.\nThough ELAS handles stereo images with perfect calibration well, it fails on imperfectly recti\ufb01ed\npairs, yielding gross errors on repeated patterns and textureless surfaces as indicated by the circles.\nOur method is not affected by vertical displacement caused by imperfect recti\ufb01cation, given its\npre-de\ufb01ned 2D search space.\n\nTable 6: Results on Middlebury stereo images.\n\nMethod\n\navgerr (px)\n\ninc.(%)\n\nSGBM2 [11]\nELAS [8]\nPWC-Net [38]\nOurs\n\nperfect\n14.51\n9.89\n9.41\n9.03\n\nimperfect\n15.89\n11.79\n9.92\n8.79\n\n9.5\n19.2\n5.4\n-2.7\n\n5 Discussion\n\nFigure 5: Result on Middlebury-14 image, \"Stick2\".\n\nWe introduce ef\ufb01cient volumetric networks for dense 2D correspondence matching. Compared to\nprior SOTA, our approach is more accurate, easier to train, generalizes better, and produces multiple\ncandidate matches. To do so, we make use of volumetric encoder-decoder layers, multi-channel\ncost volumes, and separable volumetric \ufb01lters. Our formulation is general enough to adapt search\nwindows on-the-\ufb02y, allowing us to repurpose optical \ufb02ow networks for stereo (and vice versa) and\nimplement on-demand expansion of search windows. Due to limited CUDA kernel and hardware\nsupport for convolutions and poolings with non-standard shapes, the FLOPS numbers for our current\nimplementation are not directly transferable to running time, which will be explored in the future.\nAcknowledgements: This work was supported by the CMU Argo AI Center for Autonomous Vehicle\nResearch.\n\n9\n\nLeftRightELAS-H (perfect)ELAS-H (imperfect)Ours-ft (perfect)Ours-ft (imperfect)\fReferences\n[1] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder\n\narchitecture for image segmentation. PAMI, 39(12):2481\u20132495, 2017.\n\n[2] C. Bailer, B. Taetz, and D. Stricker. Flow \ufb01elds: Dense correspondence \ufb01elds for highly accurate\n\nlarge displacement optical \ufb02ow estimation. In ICCV, pages 4015\u20134023, 2015.\n\n[3] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for\n\noptical \ufb02ow evaluation. In ECCV, pages 611\u2013625. Springer, 2012.\n\n[4] N. D. Campbell, G. Vogiatzis, C. Hern\u00e1ndez, and R. Cipolla. Using multiple hypotheses to\nimprove depth-maps for multi-view stereo. In European Conference on Computer Vision, pages\n766\u2013779. Springer, 2008.\n\n[5] J.-R. Chang and Y.-S. Chen. Pyramid stereo matching network. In CVPR, pages 5410\u20135418,\n\n2018.\n\n[6] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, pages\n\n1251\u20131258, 2017.\n\n[7] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt,\nD. Cremers, and T. Brox. Flownet: Learning optical \ufb02ow with convolutional networks. In ICCV,\npages 2758\u20132766, 2015.\n\n[8] A. Geiger, M. Roser, and R. Urtasun. Ef\ufb01cient large-scale stereo matching. In ACCV, 2010.\n\n[9] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision\n\nbenchmark suite. In CVPR, 2012.\n\n[10] X. Guo, K. Yang, W. Yang, X. Wang, and H. Li. Group-wise correlation stereo network. In\n\nCVPR, 2019.\n\n[11] H. Hirschmuller. Stereo processing by semiglobal matching and mutual information. PAMI, 30\n\n(2):328\u2013341, 2008.\n\n[12] H. Hirschmuller and S. Gehrig. Stereo matching in the presence of sub-pixel calibration errors.\n\nIn CVPR, pages 437\u2013444. IEEE, 2009.\n\n[13] B. K. Horn and B. G. Schunck. Determining optical \ufb02ow. Arti\ufb01cial intelligence, 17(1-3):\n\n185\u2013203, 1981.\n\n[14] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and\nH. Adam. Mobilenets: Ef\ufb01cient convolutional neural networks for mobile vision applications.\narXiv preprint arXiv:1704.04861, 2017.\n\n[15] X. Hu and P. Mordohai. A quantitative evaluation of con\ufb01dence measures for stereo vision.\n\nPAMI, 34(11):2121\u20132133, 2012.\n\n[16] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional\n\nnetworks. In CVPR, pages 4700\u20134708, 2017.\n\n[17] T.-W. Hui, X. Tang, and C. C. Loy. Lite\ufb02ownet: A lightweight convolutional neural network for\n\noptical \ufb02ow estimation. In CVPR, pages 8981\u20138989, June 2018.\n\n[18] T.-W. Hui, X. Tang, and C. C. Loy. A lightweight optical \ufb02ow cnn\u2013revisiting data \ufb01delity and\n\nregularization. arXiv preprint arXiv:1903.07414, 2019.\n\n[19] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of\n\noptical \ufb02ow estimation with deep networks. In CVPR, pages 2462\u20132470, 2017.\n\n[20] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with\nlow rank expansions. In Proceedings of the British Machine Vision Conference. BMVA Press,\n2014.\n\n10\n\n\f[21] H. Junhwa and S. Roth.\n\nIterative residual re\ufb01nement for joint optical \ufb02ow and occlusion\n\nestimation. In CVPR, 2019.\n\n[22] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry.\nEnd-to-end learning of geometry and context for deep stereo regression. In ICCV, pages 66\u201375,\n2017.\n\n[23] V. Kolmogorov and R. Zabih. Computing visual correspondence with occlusions via graph cuts.\n\nTechnical report, Cornell University, 2001.\n\n[24] S. Kong and C. Fowlkes. Multigrid predictive \ufb01lter \ufb02ow for unsupervised learning on videos.\n\narXiv preprint arXiv:1904.01693, 2019.\n\n[25] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky. Speeding-up convolutional\n\nneural networks using \ufb01ne-tuned cp-decomposition. arXiv preprint arXiv:1412.6553, 2014.\n\n[26] Y. LeCun, Y. Bengio, et al. Convolutional networks for images, speech, and time series. The\n\nhandbook of brain theory and neural networks, 3361(10):1995, 1995.\n\n[27] Y. Liu, H. Li, and X. Wang. Learning deep features via congenerous cosine loss for person\n\nrecognition. arXiv preprint arXiv:1702.06890, 2017.\n\n[28] W. Luo, A. G. Schwing, and R. Urtasun. Ef\ufb01cient deep learning for stereo matching. In CVPR,\n\n2016.\n\n[29] N. Mayer, E. Ilg, P. H\u00e4usser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large\ndataset to train convolutional networks for disparity, optical \ufb02ow, and scene \ufb02ow estimation. In\nCVPR, 2016. URL http://lmb.informatik.uni-freiburg.de/Publications/2016/\nMIFDB16. arXiv:1512.02134.\n\n[30] M. Menze and A. Geiger. Object scene \ufb02ow for autonomous vehicles. In CVPR, 2015.\n\n[31] I. Rocco, M. Cimpoi, R. Arandjelovi\u00b4c, A. Torii, T. Pajdla, and J. Sivic. Neighbourhood\n\nconsensus networks. In NeurIPS, pages 1651\u20131662, 2018.\n\n[32] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image\nsegmentation. In International Conference on Medical image computing and computer-assisted\nintervention, pages 234\u2013241. Springer, 2015.\n\n[33] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted\n\nresiduals and linear bottlenecks. In CVPR, pages 4510\u20134520, 2018.\n\n[34] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspon-\n\ndence algorithms. IJCV, 47(1-3):7\u201342, 2002.\n\n[35] D. Scharstein, H. Hirschm\u00fcller, Y. Kitajima, G. Krathwohl, N. Ne\u0161i\u00b4c, X. Wang, and P. Westling.\nHigh-resolution stereo datasets with subpixel-accurate ground truth. In German Conference on\nPattern Recognition, 2014.\n\n[36] L. Sifre and S. Mallat. Rigid-motion scattering for texture classi\ufb01cation. arXiv preprint\n\narXiv:1403.1687, 2014.\n\n[37] D. Sun, S. Roth, and M. J. Black. A quantitative analysis of current practices in optical \ufb02ow\n\nestimation and the principles behind them. IJCV, 106(2):115\u2013137, 2014.\n\n[38] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. PWC-Net: CNNs for optical \ufb02ow using pyramid,\n\nwarping, and cost volume. In CVPR, 2018.\n\n[39] D. Sun, X. Yang, M. Liu, and J. Kautz. Models matter, so does training: An empirical study of\n\ncnns for optical \ufb02ow estimation. PAMI, 2019.\n\n[40] S. Tulyakov, A. Ivanov, and F. Fleuret. Practical deep stereo (pds): Toward applications-friendly\n\ndeep stereo matching. In NeurIPS, pages 5871\u20135881, 2018.\n\n11\n\n\f[41] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille. Normface: l 2 hypersphere embedding for face\nveri\ufb01cation. In Proceedings of the 25th ACM international conference on Multimedia, pages\n1041\u20131049. ACM, 2017.\n\n[42] J. Xu, R. Ranftl, and V. Koltun. Accurate optical \ufb02ow via direct cost volume processing. In\n\nCVPR, pages 1289\u20131297, 2017.\n\n[43] L. Xu, J. Jia, and Y. Matsushita. Motion detail preserving optical \ufb02ow estimation. PAMI, 34(9):\n\n1744\u20131757, 2011.\n\n[44] Z. Yin, T. Darrell, and F. Yu. Hierarchical discrete distribution decomposition for match density\n\nestimation. In CVPR, 2019.\n\n[45] J. Zbontar and Y. LeCun. Stereo matching by training a convolutional neural network to compare\n\nimage patches. Journal of Machine Learning Research, 17(1-32):2, 2016.\n\n[46] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shuf\ufb02enet: An extremely ef\ufb01cient convolutional neural\n\nnetwork for mobile devices. In CVPR, pages 6848\u20136856, 2018.\n\n12\n\n\f", "award": [], "sourceid": 403, "authors": [{"given_name": "Gengshan", "family_name": "Yang", "institution": "Carnegie Mellon University"}, {"given_name": "Deva", "family_name": "Ramanan", "institution": "Carnegie Mellon University"}]}