{"title": "CliqueCNN: Deep Unsupervised Exemplar Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3846, "page_last": 3854, "abstract": "Exemplar learning is a powerful paradigm for discovering visual similarities in an unsupervised manner. In this context, however, the recent breakthrough in deep learning could not yet unfold its full potential. With only a single positive sample, a great imbalance between one positive and many negatives, and unreliable relationships between most samples, training of convolutional neural networks is impaired. Given weak estimates of local distance we propose a single optimization problem to extract batches of samples with mutually consistent relations. Conflicting relations are distributed over different batches and similar samples are grouped into compact cliques. Learning exemplar similarities is framed as a sequence of clique categorization tasks. The CNN then consolidates transitivity relations within and between cliques and learns a single representation for all samples without the need for labels. The proposed unsupervised approach has shown competitive performance on detailed posture analysis and object classification.", "full_text": "CliqueCNN: Deep Unsupervised Exemplar Learning\n\nMiguel A. Bautista\u2217, Artsiom Sanakoyeu\u2217, Ekaterina Sutter, Bj\u00f6rn Ommer\n\nfirstname.lastname@iwr.uni-heidelberg.de\n\nHeidelberg Collaboratory for Image Processing\n\nIWR, Heidelberg University, Germany\n\nAbstract\n\nExemplar learning is a powerful paradigm for discovering visual similarities in\nan unsupervised manner. In this context, however, the recent breakthrough in\ndeep learning could not yet unfold its full potential. With only a single positive\nsample, a great imbalance between one positive and many negatives, and unreliable\nrelationships between most samples, training of Convolutional Neural networks is\nimpaired. Given weak estimates of local distance we propose a single optimization\nproblem to extract batches of samples with mutually consistent relations. Con\ufb02ict-\ning relations are distributed over different batches and similar samples are grouped\ninto compact cliques. Learning exemplar similarities is framed as a sequence of\nclique categorization tasks. The CNN then consolidates transitivity relations within\nand between cliques and learns a single representation for all samples without\nthe need for labels. The proposed unsupervised approach has shown competitive\nperformance on detailed posture analysis and object classi\ufb01cation.\n\n1\n\nIntroduction\n\nVisual similarity learning is the foundation for numerous computer vision subtasks ranging from\nlow-level image processing to high-level object recognition or posture analysis. A common paradigm\nhas been category-level recognition, where categories and the similarities of all their instances\nto other classes are jointly modeled. However, large intra-class variability has recently spurred\nexemplar methods [15, 11], which split this problem into simpler sub-tasks. Therefore, separate\nexemplar classi\ufb01ers are trained by learning the similarities of individual exemplars against a large\nset of negatives. The exemplar paradigm has been successfully employed in diverse areas such as\nsegmentation [11], grouping [10], instance retrieval [2, 19], and object recognition [15, 5]. Learning\nsimilarities is also of particular importance for posture analysis [8] and video parsing [17].\nAmong the many approaches for similarity learning, supervised techniques have been particularly\npopular in the vision community, leading to the formulation as a ranking [23], regression [6], and\nclassi\ufb01cation [17] task. With the recent advances of convolutional neural networks (CNN), two-stream\narchitectures [25] and ranking losses [21] have shown great improvements. However, to achieve their\nperformance gain, CNN architectures require millions of samples of supervised training data or at\nleast the \ufb01ne-tuning [3] on large datasets such as PASCAL VOC. Although the amount of accessible\nimage data is increasing at an enormous rate, supervised labeling of similarities is very costly. In\naddition, not only similarities between images are important, but especially between objects and their\nparts. Annotating the \ufb01ne-grained similarities between all these entities is hopelessly complex, in\nparticular for the large datasets typically used for training CNNs.\nUnsupervised deep learning of similarities that does not requiring any labels for pre-training or\n\ufb01ne-tuning is, therefore, of great interest to the vision community. This way we can utilize large\n\n\u2217Both authors contributed equally to this work.\nProject on GitHub: https://github.com/asanakoy/cliquecnn\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fimage datasets without being limited by the need for costly manual annotations. However, CNNs for\nexemplar-based learning have been rare [4] due to limitations resulting from the widely used softmax\nloss. The learning task suffers from only a single positive instance, it is highly unbalanced with many\nmore negatives, and the relationships between samples are unknown, cf. Sec. 2. Consequentially,\nstochastic gradient descend (SGD) gets corrupted and has a bias towards negatives, thus forfeiting\nthe bene\ufb01ts of deep learning.\nOutline of the proposed approach: We overcome these limitations by updating similarities and\nCNNs. Typically at the beginning only a few, local estimates of (dis-)similarity are easily available,\ni.e., pairs of samples that are highly similar (near duplicates) or that are very distant. Most of the\nsimilarities are, however, unknown or mutually contradicting, so that transitivity does not hold.\nTherefore, we initially can only gather small, compact cliques of mutually similar samples around an\nexemplar, but for most exemplars we know neither if they are similar nor dissimilar. To nevertheless\nde\ufb01ne balanced classi\ufb01cation tasks suited for CNN training, we formulate an optimization problem\nthat builds training batches for the CNN by selecting groups of compact cliques, so that all cliques in\na batch are mutually distant. Thus for all samples of a batch (dis-)similarity is de\ufb01ned\u2014they either\nbelong to the same compact clique or are far away and belong to different cliques. However, pairs of\nsamples with no reliable similarities end up in different batches so they do not yield false training\nsignal for SGD. Classifying if a sample belongs to a clique serves as a pretext task for learning\nexemplar similarity. Training the network then implicitly reconciles the transitivity relations between\nsamples in different batches. Thus, the learned CNN representations impute similarities that were\ninitially unavailable and generalize them to unseen data.\nIn the experimental evaluation the proposed approach signi\ufb01cantly improves over state-of-the-art\napproaches for posture analysis and retrieval by learning a general feature representation for human\npose that can be transferred across datasets.\n1.1 Exemplar Based Methods for Similarity Learning\nThe Exemplar Support Vector Machine (Exemplar-SVM) has been one of the driving methods for\nexemplar based learning [15]. Each Exemplar-SVM classi\ufb01er is de\ufb01ned by a single positive instance\nand a large set of negatives. To improve performance, Exemplar-SVMs require several round of hard\nnegative mining, increasing greatly the computational cost of this approach. To circumvent this high\ncomputational cost [10] proposes to train Linear Discriminant Analysis (LDA) over Histogram of\nGradient (HOG) features [10]. LDA whitened HOG features with the common covariance matrix\nestimated for all the exemplars removes correlations between the HOG features, which tend to amplify\nthe background of the image.\nRecently, several CNN approaches have been proposed for supervised similarity learning using either\npairs [25], or triplets [21] of images. However, supervised formulations for learning similarities\nrequire that the supervisory information scales quadratically for pairs of images, or cubically for\ntriplets. This results in very large training times.\nLiterature on exemplar based learning in CNNs is very scarce. In [4] the authors of Exemplar-\nCNN tackle the problem of unsupervised feature learning. A patch-based categorization problem is\ndesigned by randomly extracting patch for each image in the training set and de\ufb01ning it as surrogate\nclass. Hence, since this approach does not take into account (dis-)similarities between exemplars, it\nfails to model their transitivity relationships, resulting in poor performances (see Sect. 3.1). le\nFurthermore, recent works by Wang et al. [22] and Doersh et al. [3] showed that temporal information\nin videos and spatial context information in images can be utilized as a convenient supervisory\nsignal for learning feature representation with CNNs. However, the computational cost of the\ntraining algorithm is enormous since the approach in [3] needs to tackle all possible pair-wise image\nrelationships requiring training set that scales quadratically with the number of samples. On the\ncontrary, our approach leverages the relationship information between compact cliques, de\ufb01ning\na multi-class classi\ufb01cation problem. As each training batch contains mutually distinct cliques the\ncomputational cost of the training algorithm is greatly decreased.\n\n2 Approach\n\nWe will now discuss how we can employ a CNN for learning similarities between all pairs of a large\nnumber of exemplars. Exemplar learning in CNNs has been a relatively unexplored approach for\nmultiple reasons. First and foremost, deep learning requires large amounts of training data, thus\n\n2\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: (a) Average AUC for posture retrieval in the Olympic Sports dataset. Similarities learnt\nby (b) 1-sample CNN, (c) using NN-CNN, and (d) for the proposed approach. The plots show a\nmagni\ufb01ed crop of the full similarity matrix. Note the more detailed \ufb01ne structure in (d).\n\ncon\ufb02icting with having only a single positive exemplar in a setup that we now abbreviate as 1-sample\nCNN. Such a 1-sample CNN faces several issues. (i) The within-class variance of an individual\nexemplar cannot be modeled. (ii) The ratio of one exemplar and many negatives is highly imbalanced,\nso that the softmax loss over SGD batches over\ufb01ts against the negatives. (iii) An SGD batch for\ntraining a CNN on multiple exemplars can contain arbitrarily similar samples with different label (the\ndifferent exemplars may be similar or dissimilar), resulting in label inconsistencies. The proposed\nmethod overcomes these issues as follows. In Sect. 2.2 we discuss why simply merging an exemplar\nwith its nearest neighbors and data augmentation (similar in spirit to the Clustered Exemplar-SVM\n[20]) is not suf\ufb01cient to address (i). Sect. 3.1 compares this NN-CNN approach against other methods.\nSect. 2.3 deals with (ii) and (iii) by generating batches of cliques that maximize the intra-clique\nsimilarity while minimizing inter-clique similarity.\nTo show the effectiveness of the proposed method we give empirical proof by training CNNs in both\n1-sample CNN and NN-CNN manners. Fig. 1(a) shows the average ROC curve for posture retrieval\nin the Olympic Sports dataset [16] (refer to Sec. 3.1 for further details) for 1-sample CNN, NN-CNN\nand the proposed method, which clearly outperforms both exemplar based strategies. In addition,\nFig. 1(b-d) show an excerpt of the similarity matrix learned for each method. It becomes evident\nthat the proposed approach captures more detailed similarity structures, e.g., the diagonal structures\ncorrespond to repetitions of the same gait cycle within a long jump.\n\n2.1\n\nInitialization\n\nSince deep learning bene\ufb01ts from large amounts of data and requires more than a single exemplar\nto avoid biased gradients, we now reframe exemplar-based learning of similarities so that it can\nbe handled by a CNN. Given a single exemplar di we thus strive for related samples to enable a\nCNN training that then further improves the similarities between samples. To obtain this initial\nset of few, mutually similar samples for an exemplar, we now brie\ufb02y discuss the reliability of\nstandard feature distances such as whitening HOG features using LDA [10]. HOG-LDA is a\ncomputationally effective foundation for estimating similarities sij between large numbers of samples,\nsij = s(di, dj) = \u03c6(di)(cid:62)\u03c6(dj). Here \u03c6(di) is the initial HOG-LDA representation of the exemplar\nand S is the resulting kernel.\nMost of these initial similarities are unreliable (cf. Fig. 4(b)) and, thus, the majority of samples\ncannot be properly ranked w.r.t. their similarity to an exemplar di. However, highly similar samples\nand those that are far away can be reliably identi\ufb01ed as they stand out from the similarity distribution.\nSubsequently we utilize these few reliable relationships to build groups of compact cliques.\n\n2.2 Compact Cliques\n\nSimply assigning the same label to all the nearest and another label to all the furthest neighbors\nof an exemplar is inappropriate. The samples in these groups may be close to di (or distant for\nthe negative group) but not to another due to lacking transitivity. Moreover, mere augmentation of\nthe exemplar with synthetic data does not add transitivity relations to other samples. Therefore, to\nlearn within-class similarities we need to restrict the model to compact cliques of samples so that all\nsamples in a clique are also mutually close to another and deserve the same label.\n\n3\n\nFalse positive rate00.20.40.60.81True positive rate00.10.20.30.40.50.60.70.80.91-sample-CNN(0.62)NN-CNN(0.65)Ours(0.79)\fQuery\n\nOurs\n\nAlexnet [13]\n\nHOG-LDA [10]\n\nFigure 2: Averaging of the 50 nearest neighbours for a given query frame using similarities obtained\nby our approach, Alexnet[13] and HOG-LDA [10].\n\nTo build candidate cliques we apply complete-linkage clustering starting at each di to merge the\nsample with its local neighborhood, so that all merged samples are mutually similar. Thus, cliques are\ncompact, differ in size, and may be mutually overlapping. To reduce redundancy, highly overlapping\ncliques are subsequently merged by clustering cliques using farthest-neighbor clustering. This\nagglomerative grouping is terminated if intra-clique similarity of a cluster is less than half that of its\nconstituents. Let K be the resulting number of clustered cliques and N the number of samples di.\nThen C \u2208 {0, 1}K\u00d7N is the resulting assignment matrix of samples to cliques.\n\n2.3 Selecting Batches of Mutually Consistent Cliques\n\nWe now have a set of compact cliques that comprise all training data. Thus, one may consider to train\na CNN to assign all samples of a clique with the same label. However, since only the highest/lowest\nsimilarities are reliable, samples in different cliques are not necessarily dissimilar. Forcing them into\ndifferent classes can consequently entail incorrect similarities. Therefore, we now seek batches of\nmutually distant cliques, so that all samples in a batch can be labeled consistently because they are\neither similar (same compact clique) or dissimilar (different, distant clique). Samples with unreliable\nsimilarity then end up in different batches and we train a CNN successively on these batches.\nWe now formulate an optimization problem that produces a set of consistent batches of cliques. Let\nX \u2208 {0, 1}B\u00d7K be an indicator matrix that assigns K cliques to B batches (the rows xb of X are the\ncliques in batch b) and S(cid:48) \u2208 RK\u00d7K be the similarity between cliques. We enforce cliques in the same\nbatch to be dissimilar by minimizing tr (XS(cid:48)X(cid:62)), which is regularized for the diagonal elements of\nthe matrix S(cid:48) selected for each batch (see Eq. (1)). Moreover, each batch should maximize sample\ncoverage, i.e., the number of distinct samples in all cliques of a batch (cid:107)xbC(cid:107)p\np should be maximal.\nFinally, the number of distinct points covered by all batches, (cid:107)1XC(cid:107)p\np, should be maximal, so that\nthe different (potentially overlapping) batches together comprise as much samples as possible. We\nselect p = 1/16 so that our penalty function roughly approximate the non-linear step function. The\nobjective of the optimization problem then becomes\n\nB(cid:88)\n\nb=1\n\nmin\n\nX\u2208{0,1}B\u00d7K\n\ns.t.\n\ntr (XS(cid:48)X(cid:62))\u2212 tr (X diag (S(cid:48))X(cid:62)) \u2212 \u03bb1\nX1(cid:62)\n\nK = r1(cid:62)\n\nB\n\n(cid:107)xbC(cid:107)p\n\np\u2212\u03bb2(cid:107)1XC(cid:107)p\n\np\n\n(1)\n\n(2)\n\nwhere r is the desired number of cliques in one batch for CNN training. The number of batches, B,\ncan be set arbitrarily high to allow for as many rounds of SGD training as desired. If it is too low,\nthis can be easily spotted as only limited coverage of training data can be achieved in the last term of\nEq. (1). Since X is discrete, the optimization problem (1) is not easier than the Quadratic Assignment\nProble which is known to be N P -hardm [1]. To overcome this issue we relax the binary constraints\nand force instead the continuous solution to the boundaries of the feasible range by maximizing the\nadditional term \u03bb3(cid:107)X \u2212 0.5(cid:107)2\nWe condition S(cid:48) to be positive semi-de\ufb01nite by thresholding its eigenvectors and projecting onto the\nresulting base. Since also p < 1 the previous objective function is a difference of convex functions\n\nF using the Frobenius norm.\n\n4\n\n\fFigure 3: Visual example of a resulting batch of cliques for long jump category of Olympic Sports\ndataset. Each clique contains at least 20 samples and is represented as their average.\n\nu(X) \u2212 v(X), where\n\nB(cid:88)\n\nu(X) = tr (XS(cid:48)X(cid:62)) \u2212 \u03bb1\nv(X) = tr(X diag (S(cid:48))X(cid:62)) + \u03bb3(cid:107)X \u2212 0.5(cid:107)2\n\n(cid:107)xbC(cid:107)p\n\nb=1\n\np \u2212 \u03bb2(cid:107)1XC(cid:107)p\n\np\n\n(4)\nIt can be solved using the CCCP Algorithm [24]. In each iteration of CCCP the following convex\noptimization problem is solved,\n\nF\n\n(3)\n\nargmin\nX\u2208[0,1]B\u00d7K\ns.t.\n\n(cid:62)\nu(X) \u2212 vec (X)\nX1(cid:62)\n\nK = r1(cid:62)\n\nB\n\nvec (\u2207v(Xt)),\n\n(5)\n\n(6)\nwhere \u2207v(Xt) = 2X (cid:12) (1 diag (S(cid:48))) + 2X \u2212 1 and (cid:12) denotes the Hadamard product. We solve\nthis constrained optimization problem by means of the interior-point method. Fig. 3 shows a visual\nexample of a selected batch of cliques.\n\n2.4 CNN Training\n\nWe successively train a CNN on the different batches xb obtained using Eq. (1). In each batch,\nclassifying samples according to the clique they are in then serves as a pretext task for learning\nsample similarities. One of the key properties of CNNs is the training using SGD and backpropagation\n[14]. The backpropagated gradient is estimated only over a subset (batch) of training samples, so it\ndepends only on the subset of cliques in xb. Following this observation, the clique categorization\nproblem is effectively decoupled into a set of smaller sub-tasks\u2014the individual batches of cliques.\nDuring training, we randomly pick a batch b in each iteration and compute the stochastic gradient,\nusing the softmax loss L(W),\n\nfW(dj) + \u03bbr(W)\n\n(7)\n\nVt+1 = \u00b5Vt \u2212 \u03b1\u2207L(Wt), Wt+1 = Wt + Vt+1 ,\n\n(8)\nwhere M is the SGD batch size, Wt denotes the CNN weights at iteration t, and Vt denotes the\nweight update of the previous iteration. Parameters \u03b1 and \u00b5 denote the learning rate and momentum,\nrespectively. We then compute similarities between exemplars by simply measuring correlation on\nthe learned feature representation extracted from the CNN (see Sect. 3.1 for details).\n\n2.5 Similarity Imputation\n\nBy alternating between the different batches, which contain cliques with mutually inconsistent\nsimilarities, the CNN learns a single representation for samples from all batches. In effect, this\nconsolidates similarities between cliques in different batches. It generalizes from a subset of initial\ncliques to new, previously unreliable relations between samples in different batches by utilizing\ntransitivity relationships implied by the cliques.\nAfter a training round over all batches we impute the similarities using the representation learned\nby the CNN. The resulting similarities are more reliable and enable the grouping algorithm from\nSect. 2.2 to \ufb01nd larger cliques of mutually related samples. As there are fewer unreliable similarities,\n\n5\n\n(cid:88)\n\nj\u2208xb\n\nL(W) \u2248 1\nM\n\n\fFigure 4: (a) Cumulative distribu-\ntion of the spectrum of the sim-\nilarity matrices obtained by our\nmethod and the HOG-LDA ini-\ntialization.\n(b) Sorted similari-\nties with respect to one exemplar,\nwhere only similarities at the ends\nof the distribution can be trusted.\n\n(a)\n\n(b)\n\nmore samples can be comprised in a batch and overall less batches already cover the same fraction of\ndata as before. Consequently, we alternately train the CNN and recompute cliques and batches using\nthe similarities inferred in the previous iteration of CNN training. This alternating imputation of\nsimilarities and update of the classi\ufb01er follows the idea of multiple-instance learning and has shown\nto converge quickly in less than four iterations.\nTo evaluate the improvement of the similarities Fig. 4 analyzes the eigenvalue spectrum of S on\nthe Olympic Sports dataset, see Sect. 3.1. The plot shows the normalized cumulative sum of the\neigenvalues as the function of the number of eigenvectors. Compared to the initialization, transitivity\nrelations are learned and the approach can generalize from an exemplar to more related samples.\nTherefore, the similarity matrix becomes more structured (cf. Fig. 1) and random noisy relations\ndisappear. As a consequence it can be represented using very few basis vectors.\nIn a further\nexperiment we evaluate the number of reliable similarities and dissimilarities within and between\ncliques per batch. Recall that samples can only be part of the same batch, if their similarity is\nreliable. So the goal of similarity learning is to remove transitivity con\ufb02icts and reconcile relations\nbetween samples to yield larger batches. We now observe that after the iterative update of similarities,\nthe average number of similarities and dissimilarities in a batch has increased by a factor of 2.34\ncompared to the batches at initialization.\n\n3 Experimental Evaluation\n\nWe provide a quantitative and qualitative analysis of our exemplar-based approach for unsupervised\nsimilarity learning. For evaluation, three different settings are considered: posture analysis on\nOlympic Sports [16], pose estimation on Leeds Sports [12], and object classi\ufb01cation on PASCAL\nVOC 2007.\n\n3.1 Olympic Sports Dataset: Posture Analysis\n\nThe Olympic Sports dataset [16] is a video compilation of different sports competitions. To evaluate\n\ufb01ne-scale pose similarity, for each sports category we had independent annotators manually label 20\npositive (similar) and negative (dissimilar) samples for 1033 exemplars. Note that these annotations\nare solely used for testing, since we follow an unsupervised approach.\nWe compare the proposed method with the Exemplar-CNN [4], the two-stream approach of Doersch\net. al [3], 1-sample CNN and NN-CNN models (in a very similar spirit to [20]), Alexnet [13],\nExemplar-SVMs [15], and HOG-LDA [10]. Due to its performance in object and person detection,\nwe use the approach of [7] to compute person bounding boxes. (i) The evaluation should investigate\nthe bene\ufb01t of the unsupervised gathering of batches of cliques for deep learning of exemplars using\nstandard CNN architectures. Therefore we incarnate our approach by adopting the widely used model\nof Krizhevsky et al. [13]. Batches for training the network are obtained by solving the optimization\nproblem in Eq. (1) with B = 100, K = 100, and r = 20 and \ufb01ne-tuning the model for 105 iterations.\nThereafter we compute similarities using features extracted from layer fc7 in the caffe implementation\nof [13]. (ii) Exemplar-CNN is trained using the best performing parameters reported in [4] and the\n64c5-128c5-256c5-512f architecture. Then we use the output of fc4 and compute 4-quadrant max\npooling. (iii) Exemplar-SVM was trained on the exemplar frames using the HOG descriptor. The\nsamples for hard negative mining come from all categories except the one that an exemplar is from.\nWe performed cross-validation to \ufb01nd an optimal number of negative mining rounds (less than three).\nThe class weights of the linear SVM were set as C1 = 0.5 and C2 = 0.01. (iv) LDA whitened HOG\n\n6\n\nFrame ranking01000200030004000500060007000Similarity score20406080100120140160180200Query exemplarFrames sorted by exemplar similarity score \fHOG-LDA [10]\n\n0.58\n\nEx-SVM [15]\n\nEx-CNN [4]\n\nAlexnet [13]\n\n1-s CNN\n\nNN-CNN\n\nDoersch et. al [3]\n\n0.67\n\n0.58\nTable 1: Avg. AUC for each method on Olympic Sports dataset.\n\n0.56\n\n0.65\n\n0.62\n\n0.65\n\nOurs\n0.79\n\nwas computed as speci\ufb01ed in [10]. (v) The 1-sample CNN was trained by de\ufb01ning a separate class for\neach exemplar sample plus a negative category containing all other samples. (vi) In a similar fashion,\nthe NN-CNN was trained using the exemplar plus 10 nearest neighbours obtained using the whitened\nHOG similarities. As implementation for both CNNs we again used the model of [13] \ufb01ne-tuned\nfor 105 iterations. Each image in the training set is augmented with 10 transformed versions by\nperforming random translation, scaling, rotation and color transformation, to improve invariance with\nrespect to these.\nTab. 1 reports the average AuC for each method over all categories of the Olympic Sports dataset.\nOur approach obtains a performance improvement of at least 10% w.r.t.\nthe other methods. In\nparticular, the experiments show that the 1-sample CNN fails to model the positive distribution,\ndue to the high imbalance between positives and negatives and the resulting biased gradient. In\ncomparison, additional nearest neighbours to the exemplar (NN-CNN) yield a better model of within-\nclass variability of the exemplar leading to a 3% performance increase over the 1-sample CNN.\nHowever NN-CNN also sees a large set of negatives, which are partially similar and dissimilar. Due\nto this unstructuredness of the negative set, the approach fails to thoroughly capture the \ufb01ne-grained\nsimilarity structure over the negative samples. To circumvent this issue we compute sets of mutually\ndistant compact cliques resulting in a relative performance increase of 12% over NN-CNN.\nFurthermore, Fig. 1 presents the similarity structures, which the different approaches extract when\nanalyzing human postures. Fig. 2 further highlights the similarities and the relations between\nneighbors. For each method the top 50 nearest neighbours for a randomly chosen exemplar frame in\nthe Olympic Sports dataset are blended. We can see how the neighbors obtained by our approach\ndepict a sharper average posture, since they result from compact cliques of mutually similar samples.\nTherefore they retain more details and are more similar to the original than in case of the other\nmethods.\n\n3.2 Leeds Sports Dataset: Pose Estimation\n\nThe Leeds Sports Dataset [12] is the most widely used benchmark for pose estimation. For training\nwe employ 1000 images from the dataset combined with 4000 images from the extended version of\nthis dataset, where each image is annotated with 14 joint locations. We use the visual similarities\nlearned by our approach to \ufb01nd frames similar in posture to a query frame. Since our training is\nunsupervised, joint labels are not available. At test time we therefore estimate the pose of a query\nperson by identifying the nearest neighbor from the training set. To compare against the supervised\nmethods, the pose of the nearest neighbor is then compared against ground-truth.\nNow we evaluate our visual similarity learning and the resulting identi\ufb01cation of nearest postures.\nFor comparison, similar postures are also retrieved using HOG-LDA [10] and Alexnet [13]. In\naddition, we also report an upper bound on the performance that can be achieved by the nearest\nneighbor using ground-truth similarities. Therefore, the nearest training pose for a query is identi\ufb01ed\nby minimizing the average distance between their ground-truth pose annotation. This is the best one\ncan do by \ufb01nding the most similar frame, when not provided with a supervised parametric model (the\nperformance gap to 100% shows the difference between training and test poses). For completeness,\nwe compare with a fully supervised state-of-the-art approach for pose estimation [18]. We use the\nsame experimental settings described in Sect. 3.1.\nTab. 2 reports the Percentage of Correct Parts (PCP) for the different methods. The prediction for a\npart is considered correct when its endpoints are within 50% part length of the corresponding ground\ntruth endpoints. Our approach signi\ufb01cantly improves the visual similarities learned using Alexnet\nand HOG-LDA. It is note-worthy that even though our approach for estimating the pose is fully\nunsupervised it attains a competitive performance when compared to the upper-bound of supervised\nground truth similarities.\nIn addition, Fig. 5 presents success (a) and failure (c) cases of our method. In Fig.5(a) we can see\nthat the pose is correctly transferred from the nearest neighbor (b) from the training set, resulting in a\nPCP score of 0.6 for that particular image. Moreover, Fig.5(c), (d) show that the representation learnt\n\n7\n\n\fMethod\nOurs\n\nHOG-LDA[10]\n\nAlexnet[13]\nGround Truth\n\nUpper arms\n\nLower arms\n\nUpper legs\n\nLower legs\n\nHead\n45.5\n42.2\n42.4\n72.4\n85.4\nTable 2: PCP measure for each method on Leeds Sports dataset.\n\nTorso\n80.1\n73.7\n76.9\n93.7\n93.1\n\n50.1\n41.8\n47.8\n78.8\n83.6\n\n27.2\n23.2\n26.7\n58.7\n68.1\n\n12.6\n10.3\n11.2\n36.4\n42.2\n\n45.7\n39.2\n41.8\n74.9\n76.8\n\nPose Machines [18]\n\nTotal\n43.5\n38.4\n41.1\n69.2\n72.0\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 5: Pose prediction results. (a) and (c) are test images with the superimposed ground truth\nskeleton depicted in red and the predicted skeleton in green. (b) and (d) are corresponding nearest\nneighbours, which were used to transfer pose.\n\nby our method is invariant to front-back \ufb02ips (matching a person facing away from the camera to one\nfacing the camera). Since our approach learns pose similarity in an unsupervised manner, it becomes\ninvariant to changes in appearance as long as the shape is similar, thus explaining this confusion.\nAdding additional training data or directly incorporating face detection-based features could resolve\nthis.\n\n3.3 PASCAL VOC 2007: Object Classi\ufb01cation\n\nThe previous sections have analyzed the learning of pose similarities. Now we evaluate the learning\nof similarities over object categories. Therefore, we classify object bounding boxes of the PASCAL\nVOC 2007 dataset. To initialize our model we now use the visual similarities of Wang et al. [22]\nwithout applying any \ufb01ne tuning on PASCAL and also compare against this approach. Thus, neither\nImageNet nor Pascal VOC labels are utilized. For comparison we evaluate against HOG-LDA [10],\n[22], and R-CNN [9]. For our method and HOG-LDA we use the same experimental settings as\ndescribed in Sect. 3.1, initializing our method and network with the similarities obtained by [22].\nFor all methods, the k nearest neighbors are computed using similarities (Pearson correlation) based\non fc6. In Tab. 3 we show the classi\ufb01cation accuracies for all approaches for k = 5. Our approach\nimproves upon the initial similarities of the unsupervised approach of [22] to yield a performance\ngain of 3% without requiring any supervision information or \ufb01ne-tuning on PASCAL.\n\nHOG-LDA Wang et. al [22] Wang et. al [22] + Ours\n\n0.1180\n\n0.4501\n\n0.4812\n\nRCNN\n0.6825\n\nTable 3: Classi\ufb01cation results for PASCAL VOC 2007\n\n4 Conclusion\n\nWe have proposed an approach for unsupervised learning of similarities between large numbers\nof exemplars using CNNs. CNN training is made applicable in this context by addressing crucial\nproblems resulting from the single positive exemplar setup, the imbalance between exemplar and\nnegatives, and inconsistent labels within SGD batches. Optimization of a single cost function yields\nSGD batches of compact, mutually dissimilar cliques of samples. Learning exemplar similarities is\nthen posed as a categorization task on individual batches. In the experimental evaluation the approach\nhas shown competitive performance compared to the state-of-the-art, providing signi\ufb01cantly \ufb01ner\nsimilarity structure that is particularly crucial for detailed posture analysis.\n\nThis research has been funded in part by the Ministry for Science, Baden-W\u00fcrttemberg and the Heidelberg\nAcademy of Sciences, Heidelberg, Germany. We are grateful to the NVIDIA corporation for donating a Titan X\nGPU.\n\n8\n\n\fReferences\n[1] R. E. Burkard, E. \u00c7ela, P. M. Pardalos, and L. Pitsoulis. The quadratic assignment problem. In P. M.\n\nPardalos and D.-Z Du, editors, Handbook of Combinatorial Optimization. 1998.\n\n[2] C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros. What makes paris look like paris? ACM TOG,\n\n31(4), 2012.\n\n[3] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning by context\n\nprediction. In ICCV, pages 1422\u20131430, 2015.\n\n[4] Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. Discriminative\n\nunsupervised feature learning with convolutional neural networks. In NIPS, pages 766\u2013774, 2014.\n\n[5] A. Eigenstetter, M. Takami, and B. Ommer. Randomized max-margin compositions for visual recognition.\n\nIn CVPR \u201914.\n\n[6] I. El-Naqa, Y. Yang, N. P. Galatsanos, R. M. Nishikawa, and M. N. Wernick. A similarity learning approach\n\nto content-based image retrieval: application to digital mammography. TMI, 23(10):1233\u20131244, 2004.\n\n[7] Pedro Felzenszwalb, David McAllester, and Deva Ramanan. A discriminatively trained, multiscale,\n\ndeformable part model. In CVPR, pages 1\u20138. IEEE, 2008.\n\n[8] V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Pose search: retrieving people using their pose. In CVPR,\n\npages 1\u20138. IEEE, 2009.\n\n[9] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate\n\nobject detection and semantic segmentation. In CVPR, pages 580\u2013587, 2014.\n\n[10] Bharath Hariharan, Jitendra Malik, and Deva Ramanan. Discriminative decorrelation for clustering and\n\nclassi\ufb01cation. In ECCV, pages 459\u2013472. Springer, 2012.\n\n[11] X. He and S. Gould. An exemplar-based crf for multi-instance object segmentation. In CVPR. IEEE, 2014.\n\n[12] Sam Johnson and Mark Everingham. Learning effective human pose estimation from inaccurate annotation.\n\nIn Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2011.\n\n[13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, pages 1097\u20131105, 2012.\n\n[14] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropa-\n\ngation applied to handwritten zip code recognition. Neural computation, 1(4):541\u2013551, 1989.\n\n[15] Tomasz Malisiewicz, Abhinav Gupta, and Alexei A Efros. Ensemble of exemplar-svms for object detection\n\nand beyond. In ICCV, pages 89\u201396. IEEE, 2011.\n\n[16] Juan Carlos Niebles, Chih-Wei Chen, and Li Fei-Fei. Modeling temporal structure of decomposable motion\n\nsegments for activity classi\ufb01cation. In ECCV, pages 392\u2013405. Springer, 2010.\n\n[17] H. Pirsiavash and D. Ramanan. Parsing videos of actions with segmental grammars. In CVPR, 2014.\n\n[18] V. Ramakrishna, D. Munoz, M. Hebert, J. A. Bagnell, and Y. Sheikh. Pose machines: Articulated pose\n\nestimation via inference machines. In ECCV \u201914. Springer, 2014.\n\n[19] J. Rubio, A. Eigenstetter, and B. Ommer. Generative regularization with latent topics for discriminative\n\nobject recognition. PR, 48:3871\u20133880, 2015.\n\n[20] Nataliya Shapovalova and Greg Mori. Clustered exemplar-svm: Discovering sub-categories for visual\n\nrecognition. In ICIP, pages 93\u201397. IEEE, 2015.\n\n[21] Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James Philbin, Bo Chen, and\nYing Wu. Learning \ufb01ne-grained image similarity with deep ranking. In CVPR, pages 1386\u20131393, 2014.\n\n[22] X. Wang and A. Gupta. Unsupervised learning of visual representations using videos. In ICCV, 2015.\n\n[23] Hao Xia, Steven CH Hoi, Rong Jin, and Peilin Zhao. Online multiple kernel similarity learning for visual\n\nsearch. TPAMI, 36(3):536\u2013549, 2014.\n\n[24] A. L. Yuille and Anand Rangarajan. The concave-convex procedure (CCCP). Neural computation,\n\n15(4):915\u2013936, 2003.\n\n[25] S. Zagoruyko and N. Komodakis. Learning to compare image patches via convolutional neural networks.\n\nIn CVPR \u201914.\n\n9\n\n\f", "award": [], "sourceid": 1916, "authors": [{"given_name": "Miguel", "family_name": "Bautista", "institution": "Heidelberg University"}, {"given_name": "Artsiom", "family_name": "Sanakoyeu", "institution": "Heidelberg University"}, {"given_name": "Ekaterina", "family_name": "Tikhoncheva", "institution": "Heidelberg University"}, {"given_name": "Bjorn", "family_name": "Ommer", "institution": "Heidelberg University"}]}