{"title": "Do Convnets Learn Correspondence?", "book": "Advances in Neural Information Processing Systems", "page_first": 1601, "page_last": 1609, "abstract": "Convolutional neural nets (convnets) trained from massive labeled datasets have substantially improved the state-of-the-art in image classification and object detection. However, visual understanding requires establishing correspondence on a finer level than object category. Given their large pooling regions and training from whole-image labels, it is not clear that convnets derive their success from an accurate correspondence model which could be used for precise localization. In this paper, we study the effectiveness of convnet activation features for tasks requiring correspondence. We present evidence that convnet features localize at a much finer scale than their receptive field sizes, that they can be used to perform intraclass aligment as well as conventional hand-engineered features, and that they outperform conventional features in keypoint prediction on objects from PASCAL VOC 2011.", "full_text": "Do Convnets Learn Correspondence?\n\nTrevor Darrell\nJonathan Long\n{jonlong, nzhang, trevor}@cs.berkeley.edu\n\nUniversity of California \u2013 Berkeley\n\nNing Zhang\n\nAbstract\n\nConvolutional neural nets (convnets) trained from massive labeled datasets [1]\nhave substantially improved the state-of-the-art in image classi\ufb01cation [2] and ob-\nject detection [3]. However, visual understanding requires establishing correspon-\ndence on a \ufb01ner level than object category. Given their large pooling regions and\ntraining from whole-image labels, it is not clear that convnets derive their success\nfrom an accurate correspondence model which could be used for precise localiza-\ntion. In this paper, we study the effectiveness of convnet activation features for\ntasks requiring correspondence. We present evidence that convnet features local-\nize at a much \ufb01ner scale than their receptive \ufb01eld sizes, that they can be used to\nperform intraclass aligment as well as conventional hand-engineered features, and\nthat they outperform conventional features in keypoint prediction on objects from\nPASCAL VOC 2011 [4].\n\n1\n\nIntroduction\n\nRecent advances in convolutional neural nets [2] dramatically improved the state-of-the-art in image\nclassi\ufb01cation. Despite the magnitude of these results, many doubted [5] that the resulting features\nhad the spatial speci\ufb01city necessary for localization; after all, whole image classi\ufb01cation can rely\non context cues and overly large pooling regions to get the job done. For coarse localization, such\ndoubts were alleviated by record breaking results extending the same features to detection on PAS-\nCAL [3].\nNow, the same questions loom on a \ufb01ner scale. Are the modern convnets that excel at classi\ufb01cation\nand detection also able to \ufb01nd precise correspondences between object parts? Or do large receptive\n\ufb01elds mean that correspondence is effectively pooled away, making this a task better suited for\nhand-engineered features?\nIn this paper, we provide evidence that convnet features perform at least as well as conventional\nones, even in the regime of point-to-point correspondence, and we show considerable performance\nimprovement in certain settings, including category-level keypoint prediction.\n\n1.1 Related work\n\nImage alignment\nImage alignment is a key step in many computer vision tasks, including face\nveri\ufb01cation, motion analysis, stereo matching, and object recognition. Alignment results in cor-\nrespondence across different images by removing intraclass variability and canonicalizing pose.\nAlignment methods exist on a supervision spectrum from requiring manually labeled \ufb01ducial points\nor landmarks, to requiring class labels, to fully unsupervised joint alignment and clustering models.\nCongealing [6] is an unsupervised joint alignment method based on an entropy objective. Deep\ncongealing [7] builds on this idea by replacing hand-engineered features with unsupervised feature\nlearning from multiple resolutions. Inspired by optical \ufb02ow, SIFT \ufb02ow [8] matches densely sampled\nSIFT features for correspondence and has been applied to motion prediction and motion transfer. In\nSection 3, we apply SIFT \ufb02ow using deep features for aligning different instances of the same class.\n\n1\n\n\fKeypoint localization Semantic parts carry important information for object recognition, object\ndetection, and pose estimation. In particular, \ufb01ne-grained categorization, the subject of many recent\nworks, depends strongly on part localization [9, 10]. Large pose and appearance variation across\nexamples make part localization for generic object categories a challenging task.\nMost of the existing works on part localization or keypoint prediction focus on either facial landmark\nlocalization [11] or human pose estimation. Human pose estimation has been approached using tree\nstructured methods to model the spatial relationships between parts [12, 13, 14], and also using\nposelets [15] as an intermediate step to localize human keypoints [16, 17]. Tree structured models\nand poselets may struggle when applied to generic objects with large articulated deformations and\nwide shape variance.\n\nDeep learning Convolutional neural networks have gained much recent attention due to their suc-\ncess in image classi\ufb01cation [2]. Convnets trained with backpropagation were initially succesful in\ndigit recognition [18] and OCR [19]. The feature representations learned from large data sets have\nbeen found to generalize well to other image classi\ufb01cation tasks [20] and even to object detection\n[3, 21]. Recently, Toshev et al. [22] trained a cascade of regression-based convnets for human pose\nestimation and Jain et al. [23] combine a weak spatial model with deep learning methods.\nThe latter work trains multiple small, independent convnets on 64 \u00d7 64 patches for binary body-\npart detection. In contrast, we employ a powerful pretained ImageNet model that shares mid-elvel\nfeature representations among all parts in Section 5.\nSeveral recent works have attempted to analyze and explain this overwhelming success. Zeiler and\nFergus [24] provide several heuristic visualizations suggesting coarse localization ability. Szegedy\net al. [25] show counterintuitive properties of the convnet representation, and suggest that individual\nfeature channels may not be more semantically meaningful than other bases in feature space. A\nconcurrent work [26] compares convnet features with SIFT in a standard descriptor matching task.\nThis work illuminates and extends that comparison by providing visual analysis and by moving\nbeyond single instance matching to intraclass correspondence and keypoint prediction.\n\n1.2 Preliminaries\n\nWe perform experiments using a network architecture almost identical1 to that popularized by\nKrizhevsky et al. [2] and trained for classi\ufb01cation using the 1.2 million images of the ILSVRC\n2012 challenge dataset [1]. All experiments are implemented using caffe [27], and our network\nis the publicly available caffe reference model. We use the activations of each layer as features,\nreferred to as convn, pooln, or fcn for the nth convolutional, pooling, or fully connected layer,\nrespectively. We will use the term receptive \ufb01eld, abbreviated rf, to refer to the set of input pixels\nthat are path-connected to a particular unit in the convnet.\n\n2 Feature visualization\n\nTable 1: Convnet receptive \ufb01eld sizes and strides,\nfor an input of size 227 \u00d7 227.\n\nIn this section and Figures 1 and 2, we provide a\nnovel visual investigation of the effective pool-\ning regions of convnet features.\nIn Figure 1, we perform a nonparametric recon-\nstruction of images from features in the spirit\nof HOGgles [28]. Rather than paired dictionary\nlearning, however, we simply replace patches\nwith averages of their top-k nearest neighbors\nin a convnet feature space. To do so, we \ufb01rst\ncompute all features at a particular layer, re-\nsulting in an 2d grid of feature vectors. We as-\nsociate each feature vector with a patch in the\noriginal image at the center of the corresponding receptive \ufb01eld and with size equal to the receptive\n\ufb01eld stride. (Note that the strides of the receptive \ufb01elds are much smaller than the receptive \ufb01elds\n\nlayer\nconv1\nconv2\nconv3\nconv4\nconv5\npool5\n\nrf size\n11 \u00d7 11\n51 \u00d7 51\n99 \u00d7 99\n131 \u00d7 131\n163 \u00d7 163\n195 \u00d7 195\n\nstride\n4 \u00d7 4\n8 \u00d7 8\n16 \u00d7 16\n16 \u00d7 16\n16 \u00d7 16\n32 \u00d7 32\n\n1Ours reverses the order of the response normalization and pooling layers.\n\n2\n\n\fconv3\n\nconv4\n\nconv5\n\nuniform rf\n\nr\no\nb\nh\ng\ni\ne\nn\n\n1\n\ns\nr\no\nb\nh\ng\ni\ne\nn\n\n5\n\nr\no\nb\nh\ng\ni\ne\nn\n\n1\n\ns\nr\no\nb\nh\ng\ni\ne\nn\n\n5\n\nFigure 1: Even though they have large receptive \ufb01elds, convnet features carry local information at\na \ufb01ner scale. Upper left: given an input image, we replaced 16 \u00d7 16 patches with averages over\n1 or 5 nearest neighbor patches, computed using convnet features centered at those patches. The\nyellow square illustrates one input patch, and the black squares show the corresponding rfs for the\nthree layers shown. Right: Notice that the features retrieve reasonable matches for the centers of\ntheir receptive \ufb01elds, even though those rfs extend over large regions of the source image. In the\n\u201cuniform rf\u201d column, we show the best that could be expected if convnet features discarded all\nspatial information within their rfs, by choosing input patches uniformly at random from conv3-\nsized neighborhoods. (Best viewed electronically.)\n\nthemselves, which overlap. Refer to Table 1 above for speci\ufb01c numbers.) We replace each such\npatch with an average over k nearest neighbor patches using a database of features densely com-\nputed on the images of PASCAL VOC 2011. Our database contains at least one million patches for\nevery layer. Features are matched by cosine similarity.\nEven though the feature rfs cover large regions of the source images, the speci\ufb01c resemblance of\nthe resulting images shows that information is not spread uniformly throughout those regions. No-\ntable features (e.g., the tires of the bicycle and the facial features of the cat) are replaced in their\ncorresponding locations. Also note that replacement appears to become more semantic and less\nvisually speci\ufb01c as the layer deepens: the eyes and nose of the cat get replaced with differently col-\nored or shaped eyes and noses, and the fur gets replaced with various animal furs, with the diversity\nincreasing with layer number.\nFigure 2 gives a feature-centric rather than image-centric view of feature locality. For each column,\nwe \ufb01rst pick a random seed feature vector (computed from a PASCAL image), and \ufb01nd k nearest\nneighbor features, again by cosine similarity. Instead of averaging only the centers, we average\nthe entire receptive \ufb01elds of the neighbors. The resulting images show that similar features tend to\nrespond to similar colors speci\ufb01cally in the centers of their receptive \ufb01elds.\n\n3\n\n\fconv3\n\nconv4\n\nconv5\n\ns\nr\nb\nn\n5\n\ns\nr\nb\nn\n0\n5\n\ns\nr\nb\nn\n0\n0\n5\n\nFigure 2: Similar convnet features tend to have similar receptive \ufb01eld centers. Starting from a\nrandomly selected seed patch occupying one rf in conv3, 4, or 5, we \ufb01nd the nearest k neighbor\nfeatures computed on a database of natural images, and average together the corresponding receptive\n\ufb01elds. The contrast of each image has been expanded after averaging. (Note that since each layer\nis computed with a stride of 16, there is an upper bound on the quality of alignment that can be\nwitnessed here.)\n3\n\nIntraclass alignment\n\n(cid:88)\n\np\n\n(cid:88)\n\n(p,q)\u2208E\n\n(cid:107)w(p) \u2212 w(q)(cid:107)2\n2,\n\nWe conjecture that category learning implicitly aligns instances by pooling over a discriminative\nmid-level representation. If this is true, then such features should be useful for post-hoc alignment\nin a similar fashion to conventional features. To test this, we use convnet features for the task of\naligning different instances of the same class. We approach this dif\ufb01cult task in the style of SIFT\n\ufb02ow [8]: we retrieve near neighbors using a coarse similarity measure, and then compute dense\ncorrespondences on which we impose an MRF smoothness prior which \ufb01nally allows all images to\nbe warped into alignment.\nNearest neighbors are computed using fc7 features. Since we are speci\ufb01cally testing the quality of\nalignment, we use the same nearest neighbors for convnet or conventional features, and we compute\nboth types of features at the same locations, the grid of convnet rf centers in the response to a single\nimage.\nAlignment is determined by solving an MRF formulated on this grid of feature locations. Let p be a\npoint on this grid, let fs(p) be the feature vector of the source image at that point, and let ft(p) be the\nfeature vector of the target image at that point. For each feature grid location p of the source image,\nthere is a vector w(p) giving the displacement of the corresponding feature in the target image. We\nuse the energy function\n\nE(w) =\n\n(cid:107)fs(p) \u2212 ft(p + w(p))(cid:107)2 + \u03b2\n\nwhere E are the edges of a 4-neighborhood graph and \u03b2 is the regularization parameter. Optimiza-\ntion is performed using belief propagation, with the techniques suggested in [29]. Message passing\nis performed ef\ufb01ciently using the squared Euclidean distance transform [30]. (Unlike the L1 regu-\nlarization originally used by SIFT \ufb02ow [8], this formulation maintains rotational invariance of w.)\nBased on its performance in the next section, we use conv4 as our convnet feature, and SIFT with\ndescriptor radius 20 as our conventional feature. From validation experiments, we set \u03b2 = 3 \u00b7 10\u22123\nfor both conv4 and SIFT features (which have a similar scale).\nGiven the alignment \ufb01eld w, we warp target to source using bivariate spline interpolation (imple-\nmented in SciPy [31]). Figure 3 gives examples of alignment quality for a few different seed images,\nusing both SIFT and convnet features. We show \ufb01ve warped nearest neighbors as well as keypoints\ntransferred from those neighbors.\nWe quantitatively assess the alignment by measuring the accuracy of predicted keypoints. To obtain\ngood predictions, we warp 25 nearest neighbors for each target image, and order them from smallest\nto greatest deformation energy (we found this method to outperform ordering using the data term).\nWe take the predicted keypoints to be the median points (coordinate-wise) of the top \ufb01ve aligned\nkeypoints according to this ordering.\nWe assess correctness using mean PCK [32]. We consider a ground truth keypoint to be correctly\npredicted if the prediction lies within a Euclidean distance of \u03b1 times the maximum of the bounding\n\n4\n\n\ftarget image\n\n\ufb01ve nearest neighbors\n\nw\no\n\ufb02\n4\nv\nn\no\nc\n\nw\no\n\ufb02\nT\nF\nI\nS\n\nw\no\n\ufb02\n4\nv\nn\no\nc\n\nw\no\n\ufb02\nT\nF\nI\nS\n\nFigure 3: Convnet features can bring different instances of the same class into good alignment at\nleast as well (on average) as traditional features. For each target image (left column), we show\nwarped versions of \ufb01ve nearest neighbor images aligned with conv4 \ufb02ow (\ufb01rst row), and warped\nversions aligned with SIFT \ufb02ow [8] (second row). Keypoints from the warped images are shown\ncopied to the target image. The cat shows a case where convnet features perform better, while the\nbicycle shows a case where SIFT features perform better. (Note that each instance is warped to a\nsquare bounding box before alignment. Best viewed in color.)\n\nTable 2: Keypoint transfer accuracy using convnet \ufb02ow, SIFT \ufb02ow, and simple copying from nearest\nneighbors. Accuracy (PCK) is shown per category using \u03b1 = 0.1 (see text) and means are also\nshown for the stricter values \u03b1 = 0.05 and 0.025. On average, convnet \ufb02ow performs as well as\nSIFT \ufb02ow, and performs a bit better for stricter tolerances.\n\naero bike bird boat bttl bus car\n\ncat chair cow table dog horse mbike prsn plant sheep sofa train tv mean\nconv4 \ufb02ow 28.2 34.1 20.4 17.1 50.6 36.7 20.9 19.6 15.7 25.4 12.7 18.7 25.9 23.1 21.4 40.2 21.1 14.5 18.3 33.3 24.9\nSIFT \ufb02ow 27.6 30.8 19.9 17.5 49.4 36.4 20.7 16.0 16.1 25.0 16.1 16.3 27.7 28.3 20.2 36.4 20.5 17.2 19.9 32.9 24.7\nNN transfer 18.3 24.8 14.5 15.4 48.1 27.6 16.0 11.1 12.0 16.8 15.7 12.7 20.2 18.5 18.7 33.4 14.0 15.5 14.6 30.0 19.9\n\nmean \u03b1 = 0.1 \u03b1 = 0.05 \u03b1 = 0.025\n\nconv4 \ufb02ow\nSIFT \ufb02ow\nNN transfer\n\n24.9\n24.7\n19.9\n\n11.8\n10.9\n7.8\n\n4.08\n3.55\n2.35\n\nbox width and height, picking some \u03b1 \u2208 [0, 1]. We compute the overall accuracy for each type of\nkeypoint, and report the average over keypoint types. We do not penalize predicted keypoints that\nare not visible in the target image.\nResults are given in Table 2. We show per category results using \u03b1 = 0.1, and mean results for\n\u03b1 = 0.1, 0.05, and 0.025.\nIndeed, convnet learned features are at least as capable as SIFT at\nalignment, and better than might have been expected given the size of their receptive \ufb01elds.\n\n4 Keypoint classi\ufb01cation\n\nIn this section, we speci\ufb01cally address the ability of convnet features to understand semantic in-\nformation at the scale of parts. As an initial test, we consider the task of keypoint classi\ufb01cation:\ngiven an image and the coordinates of a keypoint on that image, can we train a classi\ufb01er to label the\nkeypoint?\n\n5\n\n\fTable 3: Keypoint classi\ufb01cation accuracies, in percent, on the twenty categories of PASCAL 2011\nval, trained with SIFT or convnet features. The best SIFT and convnet scores are bolded in each\ncategory.\n\nSIFT 10 36\n(radius) 20 37\n40 35\n80 33\n160 27\n1 16\n2 37\n3 42\n4 44\n5 44\n\nconv\n(layer)\n\n42\n50\n54\n43\n36\n14\n43\n50\n53\n51\n\n36\n39\n37\n37\n34\n15\n40\n46\n49\n49\n\naero bike bird boat bttl bus car cat chair cow table dog horse mbike prsn plant sheep sofa train tv mean\n64 75 45\n68 77 50\n74 78 51\n69 77 48\n66 76 44\n27 29 20\n63 72 47\n71 77 54\n73 76 56\n71 75 54\n\n32 67 64 40 37 33\n35 74 67 47 40 36\n41 76 68 47 37 39\n42 75 66 42 30 43\n38 72 59 35 25 39\n19 20 29 15 22 16\n35 69 63 38 44 35\n41 76 69 46 52 39\n42 78 70 45 55 41\n41 77 68 44 53 39\n\n37\n43\n40\n36\n30\n17\n40\n45\n48\n45\n\n42\n52\n52\n49\n48\n12\n41\n50\n52\n47\n\n37\n44\n39\n35\n29\n18\n39\n46\n49\n47\n\n63\n70\n69\n70\n70\n33\n65\n74\n76\n73\n\n60\n68\n69\n70\n67\n29\n61\n64\n68\n63\n\n34\n38\n36\n31\n27\n17\n38\n47\n51\n50\n\n39\n42\n42\n36\n32\n14\n40\n48\n51\n49\n\n38\n48\n49\n51\n46\n16\n44\n52\n53\n52\n\n29\n33\n32\n27\n25\n15\n34\n40\n41\n39\n\n(a)\n\n(b)\n\nFigure 5: Cross validation scores for cat\nkeypoint classi\ufb01cation as a function of\nthe SVM parameter C. In (a), we plot\nmean accuracy against C for \ufb01ve dif-\nferent convnet features; in (b) we plot\nthe same for SIFT features of different\nsizes. We use C = 10\u22126 for all experi-\nments in Table 3.\n\n(a) cat left eye\n\n(b) cat nose\n\nFigure 4: Convnet features show \ufb01ne\nlocalization ability, even beyond their\nstride and in cases where SIFT features\ndo not perform as well. Each plot is\na 2D histogram of the locations of the\nmaximum responses of a classifer in a\n21 by 21 pixel rectangle taken around a\nground truth keypoint.\n\nFor this task we use keypoint data [15] on the twenty classes of PASCAL VOC 2011 [4]. We extract\nfeatures at each keypoint using SIFT [33] and using the column of each convnet layer whose rf\ncenter lies closest to the keypoint. (Note that the SIFT features will be more precisely placed as a\nresult of this approximation.) We trained one-vs-all linear SVMs on the train set using SIFT at \ufb01ve\ndifferent radii and each of the \ufb01ve convolutional layer activations as features (in general, we found\npooling and normalization layers to have lower performance). We set the SVM parameter C = 10\u22126\nfor all experiments based on \ufb01ve-fold cross validation on the training set (see Figure 5).\nTable 3 gives the resulting accuracies on the val set. We \ufb01nd features from convnet layers consis-\ntently perform at least as well as and often better than SIFT at this task, with the highest performance\ncoming from layers conv4 and conv5. Note that we are speci\ufb01cally testing convnet features trained\nonly for classi\ufb01cation; the same net could be expected to achieve even higher performance if trained\nfor this task.\nFinally, we study the precise location understanding of our classi\ufb01ers by computing their responses\nwith a single-pixel stride around ground truth keypoint locations. For two example keypoints (cat\nleft eye and nose), we histogram the locations of the maximum responses within a 21 pixel by 21\npixel rectangle around the keypoint, shown in Figure 4. We do not include maximum responses\nthat lie on the boundary of this rectangle. While the SIFT classi\ufb01ers do not seem to be sensitive\nto the precise locations of the keypoints, in many cases the convnet ones seem to be capable of\nlocalization \ufb01ner than their strides, not just their receptive \ufb01eld sizes. This observation motivates\nour \ufb01nal experiments to consider detection-based localization performance.\n\n6\n\n\f5 Keypoint prediction\n\nWe have seen that despite their large receptive \ufb01eld sizes, convnets work as well as the hand-\nengineered feature SIFT for alignment and slightly better than SIFT for keypoint classi\ufb01cation.\nKeypoint prediction provides a natural follow-up test. As in Section 3, we use keypoint annotations\nfrom PASCAL VOC 2011, and we assume a ground truth bounding box.\nInspired in part by [3, 34, 23], we train sliding window part detectors to predict keypoint locations\nindependently. R-CNN [3] and OverFeat [34] have both demonstrated the effectiveness of deep con-\nvolutional networks on the generic object detection task. However, neither of them have investigated\nthe application of CNNs for keypoint prediction.2 R-CNN starts from bottom-up region proposal\n[35], which tends to overlook the signal from small parts. OverFeat, on the other hand, combines\nconvnets trained for classi\ufb01cation and for regression and runs in multi-scale sliding window fashion.\nWe rescale each bounding box to 500 \u00d7 500 and compute conv5 (with a stride of 16 pixels). Each\ncell of conv5 contains one 256-dimensional descriptor. We concatenate conv5 descriptors from a\nlocal region of 3 \u00d7 3 cells, giving an overall receptive \ufb01eld size of 195 \u00d7 195 and feature dimension\nof 2304. For each keypoint, we train a linear SVM with hard negative mining. We consider the ten\nclosest features to each ground truth keypoint as positive examples, and all the features whose rfs\ndo not contain the keypoint as negative examples. We also train using dense SIFT descriptors for\ncomparison. We compute SIFT on a grid of stride eight and bin size of eight using VLFeat [36]. For\nSIFT, we consider features within twice the bin size from the ground truth keypoint to be positives,\nwhile samples that are at least four times the bin size away are negatives.\nWe augment our SVM detectors with a spherical Gaussian prior over candidate locations constructed\nby nearest neighbor matching. The mean of each Gaussian is taken to be the location of the keypoint\nin the nearest neighbor in the training set found using cosine similarity on pool5 features, and we\nuse a \ufb01xed standard deviation of 22 pixels. Let s(Xi) be the output score of our local detector for\nkeypoint Xi, and let p(Xi) be the prior score. We combine these to yield a \ufb01nal score f (Xi) =\ns(Xi)1\u2212\u03b7p(Xi)\u03b7, where \u03b7 \u2208 [0, 1] is a tradeoff parameter. In our experiments, we set \u03b7 = 0.1 by\ncross validation. At test time, we predict the keypoint location as the highest scoring candidate over\nall feature locations.\nWe evaluate the predicted keypoints using the measure PCK introduced in Section 3, taking \u03b1 = 0.1.\nA predicted keypoint is de\ufb01ned as correct if the distance between it and the ground truth keypoint is\nless than \u03b1 \u00b7 max(h, w) where h and w are the height and width of the bounding box. The results\nusing conv5 and SIFT with and without the prior are shown in Table 4. From the table, we can see\nthat local part detectors trained on the conv5 feature outperform SIFT by a large margin and that the\nprior information is helpful in both cases. To our knowledge, these are the \ufb01rst keypoint prediction\nresults reported on this dataset. We show example results from \ufb01ve different categories in Figure\n6. Each set consists of rescaled bounding box images with ground truth keypoint annotations and\npredicted keypoints using SIFT and conv5 features, where each color corresponds to one keypoint.\nAs the \ufb01gure shows, conv5 outperforms SIFT, often managing satisfactory outputs despite the\nchallenge of this task. A small offset can be noticed for some keypoints like eyes and noses, likely\ndue to the limited stride of our scanning windows. A \ufb01nal regression or \ufb01ner stride could mitigate\nthis issue.\n\n6 Conclusion\n\nThrough visualization, alignment, and keypoint prediction, we have studied the ability of the in-\ntermediate features implicitly learned in a state-of-the-art convnet classi\ufb01er to understand speci\ufb01c,\nlocal correspondence. Despite their large receptive \ufb01elds and weak label training, we have found in\nall cases that convnet features are at least as useful (and sometimes considerably more useful) than\nconventional ones for extracting local visual information.\nAcknowledgements This work was supported in part by DARPA\u2019s MSEE and SMISC programs, by NSF\nawards IIS-1427425, IIS-1212798, and IIS-1116411, and by support from Toyota.\n\n2But see works cited in Section 1.1 regarding keypoint localization.\n\n7\n\n\fTable 4: Keypoint prediction results on PASCAL VOC 2011. The numbers give average accuracy\nof keypoint prediction using the criterion described in Section 3, PCK with \u03b1 = 0.1.\n\nSIFT\n\naero bike bird boat bttl bus car\ncat chair cow table dog horse mbike prsn plant sheep sofa train tv mean\n17.9 16.5 15.3 15.6 25.7 21.7 22.0 12.6 11.3 7.6 6.5 12.5 18.3 15.1 15.9 21.3 14.7 15.1 9.2 19.9 15.7\nSIFT+prior 33.5 36.9 22.7 23.1 44.0 42.6 39.3 22.1 18.5 23.5 11.2 20.6 32.2 33.9 26.7 30.6 25.7 26.5 21.9 32.4 28.4\n38.5 37.6 29.6 25.3 54.5 52.1 28.6 31.5 8.9 30.5 24.1 23.7 35.8 29.9 39.3 38.2 30.5 24.5 41.5 42.0 33.3\nconv5+prior 50.9 48.8 35.1 32.5 66.1 62.0 45.7 34.2 21.4 41.1 27.2 29.3 46.8 45.6 47.1 42.5 38.8 37.6 50.7 45.6 42.5\n\nconv5\n\nGroundtruth\n\nSIFT+prior\n\nconv5+prior\n\nGroundtruth\n\nSIFT+prior\n\nconv5+prior\n\nFigure 6: Examples of keypoint prediction on \ufb01ve classes of the PASCAL dataset: aeroplane, cat,\ncow, potted plant, and horse. Each keypoint is associated with one color. The \ufb01rst column is the\nground truth annotation, the second column is the prediction result of SIFT+prior and the third\ncolumn is conv5+prior. (Best viewed in color).\n\nReferences\n[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.\n\nImage Database. In CVPR, 2009.\n\nImageNet: A Large-Scale Hierarchical\n\n[2] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classi\ufb01cation with deep convolutional neural net-\n\nworks. In NIPS, 2012.\n\n[3] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection\n\nand semantic segmentation. In CVPR, 2014.\n[4] M. Everingham, L. Van Gool, C. K.\n\nPASCAL Visual Object Classes Challenge 2011 (VOC2011) Results.\nnetwork.org/challenges/VOC/voc2011/workshop/index.html.\n\nI. Williams,\n\nJ. Winn,\n\nand A. Zisserman.\n\nThe\nhttp://www.pascal-\n\n[5] Debate on Yann LeCun\u2019s Google+ page. https://plus.google.com/+YannLeCunPhD/posts/JBBFfv2XgWM.\n\nAccessed: 2014-5-31.\n\n[6] G. B. Huang, V. Jain, and E. Learned-Miller. Unsupervised joint alignment of complex images. In ICCV,\n\n2007.\n\n8\n\n\f[7] G. B. Huang, M. A. Mattar, H. Lee, and E. Learned-Miller. Learning to align from scratch. In NIPS,\n\n2012.\n\n[8] C. Liu, J. Yuen, and A. Torralba. Sift \ufb02ow: Dense correspondence across scenes and its applications.\n\nPAMI, 33(5):978\u2013994, 2011.\n\n[9] J. Liu and P. N. Belhumeur. Bird part localization using exemplar-based models with enforced pose and\n\nsubcategory consistenty. In ICCV, 2013.\n\n[10] T. Berg and P. N. Belhumeur. POOF: Part-based one-vs.-one features for \ufb01ne-grained categorization, face\n\nveri\ufb01cation, and attribute estimation. In CVPR, 2013.\n\n[11] P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar. Localizing parts of faces using a consensus\n\nof exemplars. In CVPR, 2011.\n\n[12] Y. Yang and D. Ramanan. Articulated pose estimation using \ufb02exible mixtures of parts. In CVPR, 2011.\n[13] M. Sun and S. Savarese. Articulated part-based model for joint object detection and pose estimation. In\n\nICCV, 2011.\n\n[14] X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark localization in the wild.\n\nCVPR, 2012.\n\nIn\n\n[15] L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d human pose annotations. In\n\nICCV, 2009.\n\n[16] G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik. Using k-poselets for detecting people and localizing\n\ntheir keypoints. In CVPR, 2014.\n\n[17] G. Gkioxari, P. Arbelaez, L. Bourdev, and J. Malik. Articulated pose estimation using discriminative\n\narmlet classi\ufb01ers. In CVPR, 2013.\n\n[18] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backprop-\n\nagation applied to hand-written zip code recognition. In Neural Computation, 1989.\n\n[19] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nIn Proceedings of the IEEE, pages 2278\u20132324, 1998.\n\n[20] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: A deep convo-\n\nlutional activation feature for generic visual recognition. In ICML, 2014.\n\n[21] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun. Pedestrian detection with unsupervised multi-\n\nstage feature learning. In CVPR, 2013.\n\n[22] A. Toshev and C. Szegedy. DeepPose: Human pose estimation via deep neural networks. In CVPR, 2014.\n[23] A. Jain, J. Tompson, M. Andriluka, G. W. Taylor, and C. Bregler. Learning human pose estimation\n\nfeatures with convolutional networks. In ICLR, 2014.\n\n[24] M. D Zeiler and R. Fergus. Visualizing and understanding convolutional neural networks.\n\n2014.\n\n[25] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus.\n\nproperties of neural networks. In ICLR, 2014.\n\nIn ECCV,\n\nIntriguing\n\n[26] P. Fischer, A. Dosovitskiy, and T. Brox. Descriptor Matching with Convolutional Neural Networks: a\n\nComparison to SIFT. ArXiv e-prints, May 2014.\n\n[27] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:\n\nConvolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.\n\n[28] C. Vondrick, A. Khosla, T. Malisiewicz, and A. Torralba. HOGgles: Visualizing Object Detection Fea-\n\ntures. In ICCV, 2013.\n\n[29] P. Felzenszwalb and D. P. Huttenlocher. Ef\ufb01cient belief propagation for early vision. International journal\n\nof computer vision, 70(1):41\u201354, 2006.\n\n[30] P. Felzenszwalb and D. Huttenlocher. Distance transforms of sampled functions. Technical report, Cornell\n\nUniversity, 2004.\n\n[31] E. Jones, T. Oliphant, P. Peterson, et al. SciPy: Open source scienti\ufb01c tools for Python, 2001.\n[32] Y. Yang and D. Ramanan. Articulated human detection with \ufb02exible mixtures of parts. In PAMI, 2013.\n[33] D.G. Lowe. Object recognition from local scale-invariant features. In ICCV, 1999.\n[34] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition,\n\nlocalization and detection using convolutional networks. In ICLR, 2014.\n\n[35] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object recognition. IJCV,\n\n2013.\n\n[36] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms.\n\nhttp://www.vlfeat.org/, 2008.\n\n9\n\n\f", "award": [], "sourceid": 845, "authors": [{"given_name": "Jonathan", "family_name": "Long", "institution": "UC Berkeley"}, {"given_name": "Ning", "family_name": "Zhang", "institution": "UC Berkeley"}, {"given_name": "Trevor", "family_name": "Darrell", "institution": "UC Berkeley"}]}