{"title": "Learning Deep Embeddings with Histogram Loss", "book": "Advances in Neural Information Processing Systems", "page_first": 4170, "page_last": 4178, "abstract": "We suggest a new loss for learning deep embeddings. The key characteristics of the new loss is the absence of tunable parameters and very good results obtained across a range of datasets and problems. The loss is computed by estimating two distribution of similarities for positive (matching) and negative (non-matching) point pairs, and then computing the probability of a positive pair to have a lower similarity score than a negative pair based on these probability estimates. We show that these operations can be performed in a simple and piecewise-differentiable manner using 1D histograms with soft assignment operations. This makes the proposed loss suitable for learning deep embeddings using stochastic optimization. The experiments reveal favourable results compared to recently proposed loss functions.", "full_text": "Learning Deep Embeddings with Histogram Loss\n\nEvgeniya Ustinova and Victor Lempitsky\n\nSkolkovo Institute of Science and Technology (Skoltech)\n\nMoscow, Russia\n\nAbstract\n\nWe suggest a loss for learning deep embeddings. The new loss does not introduce\nparameters that need to be tuned and results in very good embeddings across a range\nof datasets and problems. The loss is computed by estimating two distribution of\nsimilarities for positive (matching) and negative (non-matching) sample pairs, and\nthen computing the probability of a positive pair to have a lower similarity score\nthan a negative pair based on the estimated similarity distributions. We show that\nsuch operations can be performed in a simple and piecewise-differentiable manner\nusing 1D histograms with soft assignment operations. This makes the proposed\nloss suitable for learning deep embeddings using stochastic optimization. In the\nexperiments, the new loss performs favourably compared to recently proposed\nalternatives.\n\n1\n\nIntroduction\n\nDeep feed-forward embeddings play a crucial role across a wide range of tasks and applications in\nimage retrieval [1, 8, 15], biometric veri\ufb01cation [3, 5, 13, 17, 22, 25, 28], visual product search [21],\n\ufb01nding sparse and dense image correspondences [20, 29], etc. Under this approach, complex input\npatterns (e.g. images) are mapped into a high-dimensional space through a chain of feed-forward\ntransformations, while the parameters of the transformations are learned from a large amount of\nsupervised data. The objective of the learning process is to achieve the proximity of semantically-\nrelated patterns (e.g. faces of the same person) and avoid the proximity of semantically-unrelated (e.g.\nfaces of different people) in the target space. In this work, we focus on simple similarity measures\nsuch as Euclidean distance or scalar products, as they allow fast evaluation, the use of approximate\nsearch methods, and ultimately lead to faster and more scalable systems.\nDespite the ubiquity of deep feed-forward embeddings, learning them still poses a challenge and is\nrelatively poorly understood. While it is not hard to write down a loss based on tuples of training\npoints expressing the above-mentioned objective, optimizing such a loss rarely works \u201cout of the\nbox\u201d for complex data. This is evidenced by the broad variety of losses, which can be based on pairs,\ntriplets or quadruplets of points, as well as by a large number of optimization tricks employed in\nrecent works to reach state-of-the-art, such as pretraining for the classi\ufb01cation task while restricting\n\ufb01ne-tuning to top layers only [13, 25], combining the embedding loss with the classi\ufb01cation loss [22],\nusing complex data sampling such as mining \u201csemi-hard\u201d training triplets [17]. Most of the proposed\nlosses and optimization tricks come with a certain number of tunable parameters, and the quality of\nthe \ufb01nal embedding is often sensitive to them.\nHere, we propose a new loss function for learning deep embeddings. In designing this function\nwe strive to avoid highly-sensitive parameters such as margins or thresholds of any kind. While\nprocessing a batch of data points, the proposed loss is computed in two stages. Firstly, the two\none-dimensional distributions of similarities in the embedding space are estimated, one corresponding\nto similarities between matching (positive) pairs, the other corresponding to similarities between\nnon-matching (negative) pairs. The distributions are estimated in a simple non-parametric ways\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: The histogram loss computation for a batch of examples (color-coded; same color indicates matching\nsamples). After the batch (left) is embedded into a high-dimensional space by a deep network (middle), we\ncompute the histograms of similarities of positive (top-right) and negative pairs (bottom-right). We then evaluate\nthe integral of the product between the negative distribution and the cumulative density function for the positive\ndistribution (shown with a dashed line), which corresponds to a probability that a randomly sampled positive\npair has smaller similarity than a randomly sampled negative pair. Such histogram loss can be minimized by\nbackpropagation. The only associated parameter of such loss is the number of histogram bins, to which the\nresults have very low sensitivity.\n\n(as histograms with linearly-interpolated values-to-bins assignments).\nIn the second stage, the\noverlap between the two distributions is computed by estimating the probability that the two points\nsampled from the two distribution are in a wrong order, i.e. that a random negative pair has a higher\nsimilarity than a random positive pair. The two stages are implemented in a piecewise-differentiable\nmanner, thus allowing to minimize the loss (i.e. the overlap between distributions) using standard\nbackpropagation. The number of bins in the histograms is the only tunable parameter associated\nwith our loss, and it can be set according to the batch size independently of the data itself. In the\nexperiments, we \ufb01x this parameter (and the batch size) and demonstrate the versatility of the loss by\napplying it to four different image datasets of varying complexity and nature. Comparing the new\nloss to state-of-the-art reveals its favourable performance. Overall, we hope that the proposed loss\nwill be used as an \u201cout-of-the-box\u201d solution for learning deep embeddings that requires little tuning\nand leads to close to the state-of-the-art results.\n\n2 Related work\n\nRecent works on learning embeddings use deep architectures (typically ConvNets [8, 10]) and\nstochastic optimization. Below we review the loss functions that have been used in recent works.\nClassi\ufb01cation losses. It has been observed in [8] and con\ufb01rmed later in multiple works (e.g. [15])\nthat deep networks trained for classi\ufb01cation can be used for deep embedding. In particular, it is\nsuf\ufb01cient to consider an intermediate representation arising in one of the last layers of the deep\nnetwork. The normalization is added post-hoc. Many of the works mentioned below pre-train their\nembeddings as a part of the classi\ufb01cation networks.\nPairwise losses. Methods that use pairwise losses sample pairs of training points and score them\nindependently. The pioneering work on deep embeddings [3] penalizes the deviation from the unit\ncosine similarity for positive pairs and the deviation from \u22121 or \u22120.9 for negative pairs. Perhaps,\nthe most popular of pairwise losses is the contrastive loss [5, 20], which minimizes the distances in\nthe positive pairs and tries to maximize the distances in the negative pairs as long as these distances\nare smaller than some margin M. Several works pointed to the fact that attempting to collapse all\npositive pairs may lead to excessive over\ufb01tting and therefore suggested losses that mitigate this\neffect, e.g. a double-margin contrastive loss [12], which drops to zero for positive pairs as long as\ntheir distances fall beyond the second (smaller) margin. Finally, several works use non-hinge based\npairwise losses such as log-sum-exp and cross-entropy on the similarity values that softly encourage\nthe similarity to be high for positive values and low for negative values (e.g. [25, 28]). The main\nproblem with pairwise losses is that the margin parameters might be hard to tune, especially since\nthe distributions of distances or similarities can be changing dramatically as the learning progresses.\nWhile most works \u201cskip\u201d the burn-in period by initializing the embedding to a network pre-trained\n\n2\n\n-input batchdeep netembedded batchaggregationsimilarity histograms+\ffor classi\ufb01cation [25], [22] further demonstrated the bene\ufb01t of admixing the classi\ufb01cation loss during\nthe \ufb01ne-tuning stage (which brings in another parameter).\nTriplet losses. While pairwise losses care about the absolute values of distances of positive and\nnegative pairs, the quality of embeddings ultimately depends on the relative ordering between positive\nand negative distances (or similarities). Indeed, the embedding meets the needs of most practical\napplications as long as the similarities of positive pairs are greater than similarities of negative pairs\n[19, 27]. The most popular class of losses for metric learning therefore consider triplets of points\nx0, x+, x\u2212, where x0, x+ form a positive pair and x0, x\u2212 form a negative pair and measure the\ndifference in their distances or similarities. Triplet-based loss can then e.g. be aggregated over all\ntriplets using a hinge function of these differences. Triplet-based losses are popular for large-scale\nembedding learning [4] and in particular for deep embeddings [13, 14, 17, 21, 29]. Setting the margin\nin the triplet hinge-loss still represents the challenge, as well as sampling \u201ccorrect\u201d triplets, since the\nmajority of them quickly become associated with zero loss. On the other hand, focusing sampling on\nthe hardest triplets can prevent ef\ufb01cient learning [17]. Triplet-based losses generally make learning\nless constrained than pairwise losses. This is because for a low-loss embedding, the characteristic\ndistance separating positive and negative pairs can vary across the embedding space (depending on\nthe location of x0), which is not possible for pairwise losses. In some situations, such added \ufb02exibility\ncan increase over\ufb01tting.\nQuadruplet losses. Quadruplet-based losses are similar to triplet-based losses as they are computed\nby looking at the differences in distances/similarities of positive pairs and negative pairs. In the case\nof quadruplet-based losses, the compared positive and negative pairs do not share a common point\n(as they do for triplet-based losses). Quadruplet-based losses do not allow the \ufb02exibility of triplet-\nbased losses discussed above (as they includes comparisons of positive and negative pairs located in\ndifferent parts of the embedding space). At the same time, they are not as rigid as pairwise losses, as\nthey only penalize the relative ordering for negative pairs and positive pairs. Nevertheless, despite\nthese appealing properties, quadruplet-based losses remain rarely-used and con\ufb01ned to \u201cshallow\u201d\nembeddings [9, 31]. We are unaware of deep embedding approaches using quadruplet losses. A\npotential problem with quadruplet-based losses in the large-scale setting is that the number of all\nquadruplets is even larger than the number of triplets. Among all groups of losses, our approach\nis most related to quadruplet-based ones, and can be seen as a way to organize learning of deep\nembeddings with a quarduplet-based loss in an ef\ufb01cient and (almost) parameter-free manner.\n\n3 Histogram loss\n\nWe now describe our loss function and then relate it to the quadruplet-based loss. Our loss (Figure 1)\nis de\ufb01ned for a batch of examples X = {x1, x2, . . . xN} and a deep feedforward network f (\u00b7; \u03b8),\nwhere \u03b8 represents learnable parameters of the network. We assume that the last layer of the network\nperforms length-normalization, so that the embedded vectors {yi = f (xi; \u03b8)} are L2-normalized.\nWe further assume that we know which elements should match to each other and which ones are\nnot. Let mij be +1 if xi and xj form a positive pair (correspond to a match) and mij be \u22121 if\nxi and xj are known to form a negative pair (these labels can be derived from class labels or be\nspeci\ufb01ed otherwise). Given {mij} and {yi} we can estimate the two probability distributions p+\nand p\u2212 corresponding to the similarities in positive and negative pairs respectively. In particular\nS + = {sij = (cid:104)xi, xj(cid:105)| mij = +1} and S\u2212 = {sij = (cid:104)xi, xj(cid:105)| mij = \u22121} can be regarded as\nsample sets from these two distributions. Although samples in these sets are not independent, we\nkeep all of them to ensure a large sample size.\nGiven sample sets S + and S\u2212, we can use any statistical approach to estimate p+ and p\u2212. The fact\nthat these distributions are one-dimensional and bounded to [\u22121; +1] simpli\ufb01es the task. Perhaps,\nthe most obvious choice in this case is \ufb01tting simple histograms with uniformly spaced bins, and we\nuse this approach in our experiments. We therefore consider R-dimensional histograms H + and H\u2212,\nwith the nodes t1 = \u22121, t2, . . . , tR = +1 uniformly \ufb01lling [\u22121; +1] with the step \u2206 = 2\nR\u22121. We\nestimate the value h+\n\nr of the histogram H + at each node as:\n\nh+\nr =\n\n1\n|S +|\n\n\u03b4i,j,r\n\n(1)\n\n(cid:88)\n\n(i,j) : mij =+1\n\n3\n\n\fwhere (i, j) spans all positive pairs of points in the batch. The weights \u03b4i,j,r are chosen so that each\npair sample is assigned to the two adjacent nodes:\n\n\uf8f1\uf8f2\uf8f3(sij \u2212 tr\u22121)/\u2206,\n\n(tr+1 \u2212 sij)/\u2206,\n0,\n\n\u03b4i,j,r =\n\nif sij \u2208 [tr\u22121; tr],\nif sij \u2208 [tr; tr+1],\notherwise .\n\nWe thus use linear interpolation for each entry in the pair set, when assigning it to the two nodes. The\nestimation of H\u2212 proceeds analogously. Note, that the described approach is equivalent to using\n\u201dtriangular\u201d kernel for density estimation; other kernel functions can be used as well [2].\nOnce we have the estimates for the distributions p+ and p\u2212, we use them to estimate the probability\nof the similarity in a random negative pair to be more than the similarity in a random positive pair (\nthe probability of reverse). Generally, this probability can be estimated as:\n\npreverse =\n\np\u2212(x)\n\np+(y) dy\n\ndx =\n\np\u2212(x) \u03a6+(x) dx = Ex\u223cp\u2212 [\u03a6+(x)] ,\n\n(3)\n\n(cid:90) 1\n\n\u22121\n\n(cid:20)(cid:90) x\n\n\u22121\n\nwhere \u03a6+(x) is the CDF (cumulative density function) of p+(x). The integral (3) can then be\napproximated and computed as:\n\n(cid:21)\n\n(cid:90) 1\n\n\u22121\n\n(cid:32)\n\nR(cid:88)\n\n(cid:33)\n\nr(cid:88)\n\nh\u2212\n\nr\n\nh+\nq\n\nr=1\n\nq=1\n\nL(X, \u03b8) =\n\n(2)\n\n(4)\n\nh\u2212\nr \u03c6+\nr ,\n\n=\n\nr=1\n\nR(cid:88)\nr =(cid:80)r\n=(cid:80)R\n\n=(cid:80)r\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 +1\n\n=\n\nwhere L is our loss function (the histogram loss) computed for the batch X and the embedding\nparameters \u03b8, which approximates the reverse probability; \u03c6+\nq is the cumulative sum of\nthe histogram H +.\nImportantly, the loss (4) is differentiable w.r.t. the pairwise similarities s \u2208 S + and s \u2208 S\u2212. Indeed,\nit is straightforward to obtain \u2202L\nq from (4). Furthermore, from\n\u2212\n\u2202h\nr\n(1) and (2) it follows that:\n\nq and \u2202L\n\u2202h+\nr\n\nq=r h\u2212\n\nq=1 h+\n\nq=1 h+\n\nr\n\u2202sij\n\n\u2202h+\nr\n\u2202sij\n\n= xj and \u2202sij\n\u2202xj\n\n). Finally, \u2202sij\n\u2202xi\n\n\u2206|S+| ,\n\u22121\n\u2206|S+| ,\n0,\n\nif sij \u2208 [tr\u22121; tr],\nif sij \u2208 [tr; tr+1],\notherwise ,\nfor any sij such that mij = +1 (and analogously for \u2202h\u2212\n= xi.\nOne can thus backpropagate the loss to the scalar product similarities, then further to the individual\nembedded points, and then further into the deep embedding network.\nRelation to quadruplet loss. Our loss \ufb01rst estimates the probability distributions of similarities\nfor positive and negative pairs in a semi-parametric ways (using histograms), and then computes\nthe probability of reverse using these distributions via equation (4). An alternative and purely non-\nparametric way would be to consider all possible pairs of positive and negative pairs contained in\nthe batch and to estimate this probability from such set of pairs of pairs. This would correspond\nto evaluating a quadruplet-based loss similarly to [9, 31]. The number of pairs of pairs in a batch,\nhowever tends to be quartic (fourth degree polynomial) of the batch size, rendering exhaustive\nsampling impractical. This is in contrast to our loss, for which the separation into two stages brings\ndown the complexity to quadratic in batch size. Another ef\ufb01cient loss based on quadruplets is\nintroduced in [24]. The training is done pairwise, but the threshold separating positive and negative\npairs is also learned.\nWe note that quadruplet-based losses as in [9, 31] often encourage the positive pairs to be more\nsimilar than negative pairs by some non-zero margin. It is also easy to incorporate such non-zero\nmargin into our method by de\ufb01ning the loss to be:\n\n(5)\n\nL\u00b5(X, \u03b8) =\n\nh+\nq\n\n,\n\n(6)\n\nwhere the new loss effectively enforces the margin \u00b5 \u2206. We however do not use such modi\ufb01cation in\nour experiments (preliminary experiments do not show any bene\ufb01t of introducing the margin).\n\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014-\n\nr=1\n\nq=1\n\n4\n\n(cid:32)\n\nR(cid:88)\n\nr+\u00b5(cid:88)\n\nh\u2212\n\nr\n\n(cid:33)\n\n\fFigure 2: (left) - Recall@K for the CUB-200-2011 dataset for the Histogram loss (4). Different curves\ncorrespond to variable histogram step \u2206, which is the only parameter inherent to our loss. The curves are very\nsimilar for CUB-200-2011. (right) - Recall@K for the CUHK03 labeled dataset for different batch sizes. Results\nfor batch size 256 is uniformly better than those for smaller values.\n\n4 Experiments\n\nIn this section we present the results of embedding learning. We compare our loss to state-of-the-\nart pairwise and triplet losses, which have been reported in recent works to give state-of-the-art\nperformance on these datasets.\nBaselines. In particular, we have evaluated the Binomial Deviance loss [28]. While we are aware only\nof its use in person re-identi\ufb01cation approaches, in our experiments it performed very well for product\nimage search and bird recognition signi\ufb01cantly outperforming the baseline pairwise (contrastive) loss\nreported in [21], once its parameters are tuned. The binomial deviance loss is de\ufb01ned as:\n\nJdev =\n\nwi,j ln(exp\u2212\u03b1(si,j\u2212\u03b2)mi,j +1),\n\n(7)\n\n(cid:88)\n\ni,j\u2208I\n\nwhere I is the set of training image indices, and si,j is the similarity measure between ith and jth\nimages (i.e. si,j = cosine(xi, xj).\nFurthermore, mi,j and wi,j are the learning supervision and scaling factors respectively:\n\n(cid:26) 1,if (i, j) is a positive pair,\n\n\u2212C,if (i, j) is a negative pair,\n\nmi,j =\n\n(cid:26) 1\n\nn1\n1\nn2\n\nwi,j =\n\n,if (i, j) is a positive pair,\n,if (i, j) is a negative pair,\n\n(8)\n\nwhere n1 and n2 are the number of positive and negative pairs in the training set (or mini-batch)\ncorrespondingly, \u03b1 and \u03b2 are hyper-parameters. Parameter C is the negative cost for balancing\nweights for positive and negative pairs that was introduced in [28]. Our experimental results suggest\nthat the quality of the embedding is sensitive to this parameter. Therefore, in the experiments we\nreport results for the two versions of the loss: with C = 10 that is close to optimal for re-identi\ufb01cation\ndatasets, and with C = 25 that is close to optimal for the product and bird datasets.\nWe have also computed the results for the Lifted Structured Similarity Softmax (LSSS) loss [21] on\nCUB-200-2011 [26] and Online Products [21] datasets and additionally applied it to re-identi\ufb01cation\ndatasets. Lifted Structured Similarity Softmax loss is triplet-based and uses sophisticated triplet\nsampling strategy that was shown in [21] to outperform standard triplet-based loss.\nAdditionally, we performed experiments for the triplet loss [18] that uses \u201csemi-hard negative\u201d triplet\nsampling. Such sampling considers only triplets violating the margin, but still having the positive\ndistance smaller than the negative distance.\n\n5\n\n12481632K405060708090Recall@K, %CUB-200-20110.040.020.010.00515101520K5060708090Recall@K, %CUHK03256 hist128 hist64 hist\fFigure 3: Recall@K for (left) - CUB-200-2011 and (right) - Online Products datasets for different methods.\nResults for the Histogram loss (4), Binomial Deviance (7), LSSS [21] and Triplet [18] losses are present.\nBinomial Deviance loss for C = 25 outperforms all other methods. The best-performing method is Histogram\nloss. We also include results for contrastive and triplet losses from [21].\n\nDatasets and evaluation metrics. We have evaluated the above mentioned loss functions on the\nfour datasets : CUB200-2011 [26], CUHK03 [11], Market-1501 [30] and Online Products [21]. All\nthese datasets have been used for evaluating methods of solving embedding learning tasks.\nThe CUB-200-2011 dataset includes 11,788 images of 200 classes corresponding to different birds\nspecies. As in [21] we use the \ufb01rst 100 classes for training (5,864 images) and the remaining classes\nfor testing (5,924 images). The Online Products dataset includes 120,053 images of 22,634 classes.\nClasses correspond to a number of online products from eBay.com. There are approximately 5.3\nimages for each product. We used the standard split from [21]: 11,318 classes (59,551 images) are\nused for training and 11,316 classes (60,502 images) are used for testing. The images from the\nCUB-200-2011 and the Online Products datasets are resized to 256 by 256, keeping the original\naspect ratio (padding is done when needed).\nThe CUHK03 dataset is commonly used for the person re-identi\ufb01cation task. It includes 13,164\nimages of 1,360 pedestrians captured from 3 pairs of cameras. Each identity is observed by two\ncameras and has 4.8 images in each camera on average. Following most of the previous works we use\nthe \u201cCUHK03-labeled\u201d version of the dataset with manually-annotated bounding boxes. According\nto the CUHK03 evaluation protocol, 1,360 identities are split into 1,160 identities for training, 100\nfor validation and 100 for testing. We use the \ufb01rst split from the CUHK03 standard split set which is\nprovided with the dataset. The Market-1501 dataset includes 32,643 images of 1,501 pedestrians,\neach pedestrian is captured by several cameras (from two to six). The dataset is divided randomly\ninto the test set of 750 identities and the train set of 751 identities.\nFollowing [21, 28, 30], we report Recall@K1 metric for all the datasets. For CUB-200-2011 and\nOnline products, every test image is used as the query in turn and remaining images are used as the\ngallery correspondingly. In contrast, for CUHK03 single-shot results are reported. This means that\none image for each identity from the test set is chosen randomly in each of its two camera views.\nRecall@K values for 100 random query-gallery sets are averaged to compute the \ufb01nal result for a\ngiven split. For the Market-1501 dataset, we use the multi-shot protocol (as is done in most other\nworks), as there are many images of the same person in the gallery set.\nArchitectures used. For training on the CUB-200-2011 and the Online Products datasets we used\nthe same architecture as in [21], which conincides with the GoogleNet architecture [23] up to the\n\u2018pool5\u2019 and the inner product layers, while the last layer is used to compute the embedding vectors.\nThe GoogleNet part is pretrained on ImageNet ILSVRC [16] and the last layer is trained from scratch.\nAs in [21], all GoogLeNet layers are \ufb01ne-tuned with the learning rate that is ten times less than\n\n1Recall@K is the probability of getting the right match among \ufb01rst K gallery candidates sorted by similarity.\n\n6\n\n12481632K2030405060708090Recall@K, %CUB-200-2011HistogramLSSS Binomial Deviance, c=10Binomial Deviance, c=25Triplet semi-hardGoogLeNet pool5Contrastive (from [21])Triplet (from [21])1101001000K2030405060708090Recall@K, %Online ProductsHistogramLSSSBinomial Deviance, c=10Binomial Deviance, c=25Triplet semi-hardGoogLeNet pool5Contrastive (from [21])Triplet (from [21])\fFigure 4: Recall@K for (left) - CUHK03 and (right) - Market-1501 datasets. The Histogram loss (4) outperforms\nBinomial Deviance, LSSS and Triplet losses.\n\nthe learning rate of the last layer. We set the embedding size to 512 for all the experiments with\nthis architecture. We reproduced the results for the LSSS loss [21] for these two datasets. For the\narchitectures that use the Binomial Deviance loss, Histogram loss and Triplet loss the iteration number\nand the parameters value (for the former) are chosen using the validation set.\nFor training on CUHK03 and Market-1501 we used the Deep Metric Learning (DML) architecture\nintroduced in [28]. It has three CNN streams for the three parts of the pedestrian image (head and\nupper torso, torso, lower torso and legs). Each of the streams consists of 2 convolution layers followed\nby the ReLU non-linearity and max-pooling. The \ufb01rst convolution layers for the three streams have\nshared weights. Descriptors are produced by the last 500-dimensional inner product layer that has the\nconcatenated outputs of the three streams as an input.\n\nr = 15\n98.94\n89.28\n\nr = 20\n99.43\n91.09\n\nr = 1\n65.77\n59.47\n\nr = 5\n92.85\n80.73\n\nDataset\nCUHK03\n\nMarket-1501\n\nr = 10\n97.62\n86.94\n\nTable 1: Final results for CUHK03-labeled and Market-1501. For\nCUHK03-labeled results for 5 random splits were averaged. Batch\nof size 256 was used for both experiments.\n\nImplementation details. For all the\nexperiments with loss functions (4)\nand (7) we used quadratic number\nof pairs in each batch (all the pairs\nthat can be sampled from batch). For\ntriplet loss \u201csemi-hard\u201d triplets cho-\nsen from all the possible triplets in the\nbatch are used. For comparison with\nother methods the batch size was set\nto 128. We sample batches randomly\nin such a way that there are several\nimages for each sampled class in the batch. We iterate over all the classes and all the images\ncorresponding to the classes, sampling images in turn. The sequences of the classes and of the\ncorresponding images are shuf\ufb02ed for every new epoch. CUB-200-2011 and Market-1501 include\nmore than ten images per class on average, so we limit the number of images of the same class in the\nbatch to ten for the experiments on these datasets. We used ADAM [7] for stochastic optimization\nin all of the experiments. For all losses the learning rate is set to 1e \u2212 4 for all the experiments\nexcept ones on the CUB-200-2011 datasets, for which we have found the learning rate of 1e \u2212 5\nmore effective. For the re-identi\ufb01cation datasets the learning rate was decreased by 10 after the 100K\niterations, for the other experiments learning rate was \ufb01xed. The iterations number for each method\nwas chosen using the validation set.\nResults. The Recall@K values for the experiments on CUB-200-2011, Online Products, CUHK03\nand Market-1501 are shown in Figure 3 and Figure 4. The Binomial Deviance loss (7) gives the\nbest results for CUB-200-2011 and Online Products with the C parameter set to 25. We previously\nchecked several values of C on the CUB-200-2011 dataset and found the value C = 25 to be the\noptimal one. We also observed that with smaller values of C the results are signi\ufb01cantly worse than\n\n7\n\n15101520K5060708090Recall@K, %CUHK03HistogramBinomial Deviance, c=10Binomial Deviance, c=25LSSSTriplet semi-hard15101520K5060708090Recall@K, %Market-1501HistogramBinomial Deviance, c=10Binomial Deviance, c=25LSSSTriplet semi-hard\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 5: Histograms for positive and negative distance distributions on the CUHK03 test set for: (a) Initial\nstate: randomly initialized net, (b) Network training with the Histogram loss, (c) same for the Binomial Deviance\nloss, (d) same for the LSSS loss. Red is for negative pairs, green is for positive pairs. Negative cosine distance\nmeasure is used for Histogram and Binomial Deviance losses, Euclidean distance is used for the LSSS loss.\nInitially the two distributions are highly overlapped. For the Histogram loss the distribution overlap is less than\nfor the LSSS.\n\nthose presented in the Figure 3-left (for C equal to 2 the best Recall@1 is 43.50%). For CUHK03\nthe situation is reverse: the Histogram loss gives the boost of 2.64% over the Binomial Deviance\nloss with C = 10 (which we found to be optimal for this dataset). The results are shown in the\n\ufb01gure Figure 4-left. Embedding distributions of the positive and negative pairs from CUHK03 test\nset for different methods are shown in Figure 5b,Figure 5c,Figure 5d. For the Market-1501 dataset\nour method also outperforms the Binomial Deviance loss for both values of C. In contrast to the\nexperiments with CUHK03, the Binomial Deviance loss appeared to perform better with C set to 25\nthan to 10 for Market-1501. We have also investigated how the size of the histogram bin affects the\nmodel performance for the Histogram loss. As shown in the Figure 2-left, the results for CUB-200-\n2011 remain stable for the sizes equal to 0.005, 0.01, 0.02 and 0.04 (these values correspond to 400,\n200, 100 and 50 bins in the histograms). In our method, distributions of similarities of training data\nare estimated by distributions of similarities within mini-batches. Therefore we also show results\nfor the Histogram loss for various batch size values (Figure 2-right). The larger batches are more\npreferable: for CUHK03, Recall@K for batch size equal to 256 is uniformly better than Recall@K\nfor 128 and 64. We also observed similar behaviour for Market-1501. Additionally, we present\nour \ufb01nal results (batch size set to 256) for CUHK03 and Market-1501 in Table 1. For CUHK03,\nRekall@K values for 5 random splits were averaged. To the best of our knowledge, these results\ncorresponded to state-of-the-art on CUHK03 and Market-1501 at the moment of submission. To\nsummarize the results of the comparison: the new (Histogram) loss gives the best results on the two\nperson re-identi\ufb01cation problems. For CUB-200-2011 and Online Products it came very close to the\nbest loss (Binomial Deviance with C = 25). Interestingly, the histogram loss uniformly outperformed\nthe triplet-based LSSS loss [21] in our experiments including two datasets from [21]. Importantly,\nthe new loss does not require to tune parameters associated with it (though we have found learning\nwith our loss to be sensitive to the learning rate).\n\n5 Conclusion\n\nIn this work we have suggested a new loss function for learning deep embeddings, called the\nHistogram loss. Like most previous losses, it is based on the idea of making the distributions of\nthe similarities of the positive and negative pairs less overlapping. Unlike other losses used for\ndeep embeddings, the new loss comes with virtually no parameters that need to be tuned. It also\nincorporates information across a large number of quadruplets formed from training samples in\nthe mini-batch and implicitly takes into account all of such quadruplets. We have demonstrated\nthe competitive results of the new loss on a number of datasets. In particular, the Histogram loss\noutperformed other losses for the person re-identi\ufb01cation problem on CUHK03 and Market-1501\ndatasets. The code for Caffe [6] is available at: https://github.com/madkn/HistogramLoss.\nAcknowledgement: This research is supported by the Russian Ministry of Science and Education\ngrant RFMEFI57914X0071.\n\nReferences\n[1] R. Arandjelovi\u00b4c, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. Netvlad: Cnn architecture for weakly supervised\n\nplace recognition. IEEE International Conference on Computer Vision, 2015.\n\n[2] A. Bowman and A. Azzalini. Applied smoothing techniques for data analysis. Number 18 in Oxford\n\nstatistical science series. Clarendon Press, Oxford, 1997.\n\n8\n\n\f[3] J. Bromley, J. W. Bentz, L. Bottou, I. Guyon, Y. LeCun, C. Moore, E. S\u00e4ckinger, and R. Shah. Signature\nveri\ufb01cation using a \u201csiamese\u201d time delay neural network. International Journal of Pattern Recognition\nand Arti\ufb01cial Intelligence, 7(04):669\u2013688, 1993.\n\n[4] G. Chechik, V. Sharma, U. Shalit, and S. Bengio. Large scale online learning of image similarity through\n\nranking. The Journal of Machine Learning Research, 11:1109\u20131135, 2010.\n\n[5] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to\nface veri\ufb01cation. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition\n(CVPR 2005), 20-26 June 2005, San Diego, CA, USA, pp. 539\u2013546, 2005.\n\n[6] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:\n\nConvolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.\n\n[7] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.\n[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. Advances in neural information processing systems (NIPS), pp. 1097\u20131105, 2012.\n\n[9] M. Law, N. Thome, and M. Cord. Quadruplet-wise image similarity learning. Proceedings of the IEEE\n\nInternational Conference on Computer Vision, pp. 249\u2013256, 2013.\n\n[10] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropa-\n\ngation applied to handwritten zip code recognition. Neural computation, 1(4):541\u2013551, 1989.\n\n[11] W. Li, R. Zhao, T. Xiao, and X. Wang. Deepreid: Deep \ufb01lter pairing neural network for person re-\nidenti\ufb01cation. 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus,\nOH, USA, June 23-28, 2014, pp. 152\u2013159, 2014.\n\n[12] J. Lin, O. Mor\u00e8re, V. Chandrasekhar, A. Veillard, and H. Goh. Deephash: Getting regularization, depth and\n\n\ufb01ne-tuning right. CoRR, abs/1501.04711, 2015.\n\n[13] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. Proceedings of the British Machine\n\nVision Conference 2015, BMVC 2015, Swansea, UK, September 7-10, 2015, pp. 41.1\u201341.12, 2015.\n\n[14] Q. Qian, R. Jin, S. Zhu, and Y. Lin. Fine-grained visual categorization via multi-stage metric learning.\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3716\u20133724, 2015.\n[15] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. CNN features off-the-shelf: An astounding\nbaseline for recognition. IEEE Conference on Computer Vision and Pattern Recognition, CVPR Workshops\n2014, Columbus, OH, USA, June 23-28, 2014, pp. 512\u2013519, 2014.\n\n[16] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bern-\nstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International\nJournal of Computer Vision (IJCV), 115(3):211\u2013252, 2015.\n\n[17] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni\ufb01ed embedding for face recognition and\nclustering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815\u2013823,\n2015.\n\n[18] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni\ufb01ed embedding for face recognition and\nclustering. IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA,\nJune 7-12, 2015, pp. 815\u2013823, 2015.\n\n[19] M. Schultz and T. Joachims. Learning a distance metric from relative comparisons. Advances in neural\n\ninformation processing systems (NIPS), p. 41, 2004.\n\n[20] E. Simo-Serra, E. Trulls, L. Ferraz, I. Kokkinos, P. Fua, and F. Moreno-Noguer. Discriminative learning\nof deep convolutional feature point descriptors. Proceedings of the IEEE International Conference on\nComputer Vision, pp. 118\u2013126, 2015.\n\n[21] H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured feature\n\nembedding. Computer Vision and Pattern Recognition (CVPR), 2016.\n\n[22] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identi\ufb01cation-veri\ufb01cation.\n\nAdvances in Neural Information Processing Systems, pp. 1988\u20131996, 2014.\n\n[23] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.\nGoing deeper with convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pp. 1\u20139, 2015.\n\n[24] O. Tadmor, T. Rosenwein, S. Shalev-Shwartz, Y. Wexler, and A. Shashua. Learning a metric embedding\n\nfor face recognition using the multibatch method. NIPS, 2016.\n\n[25] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to human-level performance\nin face veri\ufb01cation. Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pp.\n1701\u20131708. IEEE, 2014.\n\n[26] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset.\n\n(CNS-TR-2011-001), 2011.\n\n[27] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classi\ufb01cation.\n\nThe Journal of Machine Learning Research, 10:207\u2013244, 2009.\n\n[28] D. Yi, Z. Lei, and S. Z. Li. Deep metric learning for practical person re-identi\ufb01cation. arXiv prepzrint\n\narXiv:1407.4979, 2014.\n\n[29] J. \u017dbontar and Y. LeCun. Stereo matching by training a convolutional neural network to compare image\n\npatches. arXiv preprint arXiv:1510.05970, 2015.\n\n[30] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person re-identi\ufb01cation: A benchmark.\n\nComputer Vision, IEEE International Conference on, 2015.\n\n[31] W.-S. Zheng, S. Gong, and T. Xiang. Reidenti\ufb01cation by relative distance comparison. Pattern Analysis\n\nand Machine Intelligence, IEEE Transactions on, 35(3):653\u2013668, 2013.\n\n9\n\n\f", "award": [], "sourceid": 2069, "authors": [{"given_name": "Evgeniya", "family_name": "Ustinova", "institution": "Skoltech"}, {"given_name": "Victor", "family_name": "Lempitsky", "institution": "Skolkovo Institute of Science and Technology (Skoltech)"}]}