{"title": "Hamming Distance Metric Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1061, "page_last": 1069, "abstract": "Motivated by large-scale multimedia applications we propose to learn mappings from high-dimensional data to binary codes that preserve semantic similarity. Binary codes are well suited to large-scale applications as they are storage efficient and permit exact sub-linear kNN search. The framework is applicable to broad families of mappings, and uses a flexible form of triplet ranking loss. We overcome discontinuous optimization of the discrete mappings by minimizing a piecewise-smooth upper bound on empirical loss, inspired by latent structural SVMs. We develop a new loss-augmented inference algorithm that is quadratic in the code length. We show strong retrieval performance on CIFAR-10 and MNIST, with promising classification results using no more than kNN on the binary codes.", "full_text": "Hamming Distance Metric Learning\n\nMohammad Norouzi\u2020\n\nDavid J. Fleet\u2020\n\nRuslan Salakhutdinov\u2020,\u2021\n\nDepartments of Computer Science\u2020 and Statistics\u2021\n\nUniversity of Toronto\n\n[norouzi,fleet,rsalakhu]@cs.toronto.edu\n\nAbstract\n\nMotivated by large-scale multimedia applications we propose to learn mappings\nfrom high-dimensional data to binary codes that preserve semantic similarity.\nBinary codes are well suited to large-scale applications as they are storage ef-\n\ufb01cient and permit exact sub-linear kNN search. The framework is applicable\nto broad families of mappings, and uses a \ufb02exible form of triplet ranking loss.\nWe overcome discontinuous optimization of the discrete mappings by minimizing\na piecewise-smooth upper bound on empirical loss, inspired by latent structural\nSVMs. We develop a new loss-augmented inference algorithm that is quadratic in\nthe code length. We show strong retrieval performance on CIFAR-10 and MNIST,\nwith promising classi\ufb01cation results using no more than kNN on the binary codes.\n\nIntroduction\n\n1\nMany machine learning algorithms presuppose the existence of a pairwise similarity measure on\nthe input space. Examples include semi-supervised clustering, nearest neighbor classi\ufb01cation, and\nkernel-based methods, When similarity measures are not given a priori, one could adopt a generic\nfunction such as Euclidean distance, but this often produces unsatisfactory results. The goal of\nmetric learning techniques is to improve matters by incorporating side information, and optimizing\nparametric distance functions such as the Mahalanobis distance [7, 12, 30, 34, 36].\nMotivated by large-scale multimedia applications, this paper advocates the use of discrete mappings,\nfrom input features to binary codes. Compact binary codes are remarkably storage ef\ufb01cient, allow-\ning one to store massive datasets in memory. The Hamming distance, a natural similarity measure\non binary codes, can be computed with just a few machine instructions per comparison. Further, it\nhas been shown that one can perform exact nearest neighbor search in Hamming space signi\ufb01cantly\nfaster than linear search, with sublinear run-times [15, 25]. By contrast, retrieval based on Ma-\nhalanobis distance requires approximate nearest neighbor (ANN) search, for which state-of-the-art\nmethods (e.g., see [18, 23]) do not always perform well, especially with massive, high-dimensional\ndatasets when storage overheads and distance computations become prohibitive.\nMost approaches to discrete (binary) embeddings have focused on preserving the metric (e.g. Eu-\nclidean) structure of the input data, the canonical example being locality-sensitive hashing (LSH)\n[4, 17]. Based on random projections, LSH and its variants (e.g., [26]) provide guarantees that metric\nsimilarity is preserved for suf\ufb01ciently long codes. To \ufb01nd compact codes, recent research has turned\nto machine learning techniques that optimize mappings for speci\ufb01c datasets (e.g., [20, 28, 29, 32, 3]).\nHowever, most such methods aim to preserve Euclidean structure (e.g. [13, 20, 35]).\nIn metric learning, by comparison, the goal is to preserve semantic structure based on labeled at-\ntributes or parameters associated with training exemplars. There are papers on learning binary hash\nfunctions that preserve semantic similarity [29, 28, 32, 24], but most have only considered ad hoc\ndatasets and uninformative performance measures, for which it is dif\ufb01cult to judge performance with\nanything but the qualitative appearance of retrieval results. The question of whether or not it is pos-\nsible to learn hash functions capable of preserving complex semantic structure, with high \ufb01delity,\nhas remained unanswered.\n\n1\n\n\fTo address this issue, we introduce a framework for learning a broad class of binary hash functions\nbased on a triplet ranking loss designed to preserve relative similarity (c.f. [11, 5]). While certainly\nuseful for preserving metric structure, this loss function is very well suited to the preservation of\nsemantic similarity. Notably, it can be viewed as a form of local ranking loss. It is more \ufb02exible\nthan the pairwise hinge loss of [24], and is shown below to produce superior hash functions.\nOur formulation is inspired by latent SVM [10] and latent structural SVM [37] models, and it gen-\neralizes the minimal loss hashing (MLH) algorithm of [24]. Accordingly, to optimize hash function\nparameters we formulate a continuous upper-bound on empirical loss, with a new form of loss-\naugmented inference designed for ef\ufb01cient optimization with the proposed triplet loss on the Ham-\nming space. To our knowledge, this is one of the most general frameworks for learning a broad class\nof hash functions. In particular, many previous loss-based techniques [20, 24] are not capable of\noptimizing mappings that involve non-linear projections, e.g., by neural nets.\nOur experiments indicate that the framework is capable of preserving semantic structure on chal-\nlenging datasets, namely, MNIST [1] and CIFAR-10 [19]. We show that k-nearest neighbor (kNN)\nsearch on the resulting binary codes retrieves items that bear remarkable similarity to a given query\nitem. To show that the binary representation is rich enough to capture salient semantic structure,\nas is common in metric learning, we also report classi\ufb01cation performance on the binary codes.\nSurprisingly, on these datasets, simple kNN classi\ufb01ers in Hamming space are competitive with so-\nphisticated discriminative classi\ufb01ers, including SVMs and neural networks. An important appeal of\nour approach is the scalability of kNN search on binary codes to billions of data points, and of kNN\nclassi\ufb01cation to millions of class labels.\n\n2 Formulation\nThe task is to learn a mapping b(x) that projects p-dimensional real-valued inputs x \u2208 Rp onto\nq-dimensional binary codes h \u2208 H \u2261 {\u22121, 1}q, while preserving some notion of similarity. This\nmapping, referred to as a hash function, is parameterized by a real-valued vector w as\n\n(1)\nwhere sign(.) denotes the element-wise sign function, and f (x; w) : Rp \u2192 Rq is a real-valued\ntransformation. Different forms of f give rise to different families of hash functions:\n\nb(x; w) = sign (f (x; w)) ,\n\n1. A linear transform f (x) = W x, where W \u2208 Rq\u00d7p and w \u2261 vec(W ), is the simplest and most\nwell-studied case [4, 13, 24, 33]. Under this mapping the kth bit is determined by a hyperplane\nin the input space whose normal is given by the kth row of W . 1\n\n2. In [35], linear projections are followed by an element-wise cosine transform, i.e. f (x) =\ncos(W x). For such mappings the bits correspond to stripes of 1 and \u22121 regions, oriented\nparallel to the corresponding hyperplanes, in the input space.\n\n3. Kernelized hash functions [20, 21].\n4. More complex hash functions are obtained with multilayer neural networks [28, 32]. For\nexample, a two-layer network with a p(cid:48)-dimensional hidden layer and weight matrices W1 \u2208\nRp(cid:48)\u00d7p and W2 \u2208 Rq\u00d7p(cid:48)\ncan be expressed as f (x) = tanh(W2 tanh(W1x)), where tanh(.)\nis the element-wise hyperbolic tangent function.\n\nOur Hamming distance metric learning framework applies to all of the above families of hash func-\ntions. The only restriction is that f must be differentiable with respect to its parameters, so that one\nis able to compute the Jacobian of f (x; w) with respect to w.\n2.1 Loss functions\nThe choice of loss function is crucial for learning good similarity measures. To this end, most ex-\nisting supervised binary hashing techniques [13, 22, 24] formulate learning objectives in terms of\npairwise similarity, where pairs of inputs are labelled as either similar or dissimilar. Similarity-\npreserving hashing aims to ensure that Hamming distances between binary codes for similar (dis-\nsimilar) items are small (large). For example, MLH [24] uses a pairwise hinge loss function. For\n\n1For presentation clarity, in linear and nonlinear cases, we omit bias terms. They are incorporated by adding\n\none dimension to the input vectors, and to the hidden layers of neural networks, with a \ufb01xed value of one.\n\n2\n\n\f(cid:26) [(cid:107)h\u2212g(cid:107)H \u2212 \u03c1 + 1 ]+\n\ntwo binary codes h, g \u2208 H with Hamming distance2 (cid:107)h\u2212g(cid:107)H, and a similarity label s \u2208 {0, 1},\nthe pairwise hinge loss is de\ufb01ned as:\n\nfor s = 1\nfor s = 0\n\n(similar)\n(dissimilar) ,\n\n(cid:96)pair(h, g, s) =\n\n[ \u03c1 \u2212 (cid:107)h\u2212g(cid:107)H + 1 ]+\n\n(2)\nwhere [\u03b1]+ \u2261 max(\u03b1, 0), and \u03c1 is a Hamming distance threshold that separates similar from dis-\nsimilar codes. This loss incurs zero cost when a pair of similar inputs map to codes that differ by\nless than \u03c1 bits. The loss is zero for dissimilar items whose Hamming distance is more than \u03c1 bits.\nOne problem with such loss functions is that \ufb01nding a suitable threshold \u03c1 with cross-validation is\nslow. Furthermore, for many problems one cares more about the relative magnitudes of pairwise\ndistances than their precise numerical values. So, constraining pairwise Hamming distances over\nall pairs of codes with a single threshold is overly restrictive. More importantly, not all datasets are\namenable to labeling input pairs as similar or dissimilar. One way to avoid some of these problems is\nto de\ufb01ne loss in terms of relative similarity. Such loss functions have been used in metric learning [5,\n11], and, as shown below, they are also naturally suited to Hamming distance metric learning.\nTo de\ufb01ne relative similarity, we assume that the training data includes triplets of items (x, x+, x\u2212),\nsuch that the pair (x, x+) is more similar than the pair (x, x\u2212). Our goal is to learn a hash function\nb such that b(x) is closer to b(x+) than to b(x\u2212) in Hamming distance. Accordingly, we propose a\nranking loss on the triplet of binary codes (h, h+, h\u2212), obtained from b applied to (x, x+, x\u2212):\n\n(cid:96)triplet(h, h+, h\u2212) = (cid:2)(cid:107)h\u2212h+(cid:107)H \u2212 (cid:107)h\u2212h\u2212(cid:107)H + 1(cid:3)\n\n(3)\nThis loss is zero when the Hamming distance between the more-similar pair, (cid:107)h\u2212 h+(cid:107)H, is at\nleast one bit smaller than the Hamming distance between the less-similar pair, (cid:107)h\u2212 h\u2212(cid:107)H. This\nloss function is more \ufb02exible than the pairwise loss function (cid:96)pair, as it can be used to preserve\nrankings among similar items, for example based on Euclidean distance, or perhaps using path\ndistance between category labels within a phylogenetic tree.\n\n+ .\n\n3 Optimization\n\nGiven a training set of triplets, D =(cid:8)(xi, x+\n\ni )(cid:9)n\n\ni , x\u2212\n\nloss and a simple regularizer on the vector of unknown parameters w:\n\n(cid:88)\n\nL(w) =\n\n(cid:96)triplet\n\n(x,x+,x\u2212)\u2208D\n\ni=1, our objective is the sum of the empirical\n\n(cid:0) b(x; w), b(x+; w), b(x\u2212; w)(cid:1) +\n\n(cid:107)w(cid:107)2\n2 .\n\n\u03bb\n2\n\n(4)\n\nThis objective is discontinuous and non-convex. The hash function is a discrete mapping and em-\npirical loss is piecewise constant. Hence optimization is very challenging. We cannot overcome the\nnon-convexity, but the problems owing to the discontinuity can be mitigated through the construction\nof a continuous upper bound on the loss.\nThe upper bound on loss that we adopt is inspired by previous work on latent structural SVMs [37].\nThe key observation that relates our Hamming distance metric learning framework to structured\nprediction is as follows,\n\nb(x; w) = sign (f (x; w))\n\n= argmax\n\nhTf (x; w) ,\n\n(5)\nwhere H \u2261 {\u22121, +1}q. The argmax on the RHS effectively means that for dimensions of f (x; w)\nwith positive values, the optimal code should take on values +1, and when elements of f (x; w) are\nnegative the corresponding bits of the code should be \u22121. This is identical to the sign function.\n3.1 Upper bound on empirical loss\nThe upper bound on loss that we exploit for learning hash functions takes the following form:\n\nh\u2208H\n\n(cid:0) b(x; w), b(x+; w), b(x\u2212; w)(cid:1) \u2264\n(cid:0) g, g+, g\u2212(cid:1) + gTf (x; w) + g+T\n(cid:8)hTf (x; w)(cid:9) \u2212 max\n\nmax\ng,g+,g\u2212\n\u2212 max\n\n(cid:110)\n\n(cid:110)\n\n(cid:96)triplet\n\nh\u2212T\n2The Hamming norm (cid:107)v(cid:107)H is de\ufb01ned as the number of non-zero entries of vector v.\n\nf (x+; w)\n\nh+T\n\nh\u2212\n\nh+\n\nh\n\nf (x+; w) + g\u2212T\n\nf (x\u2212; w)\n\n(cid:111) \u2212 max\n\n(cid:110)\n\nf (x\u2212; w)\n\n,\n\n(6)\n\n(cid:111)\n\n(cid:111)\n\n(cid:96)triplet\n\n3\n\n\fwhere g, g+, g\u2212, h, h+, and h\u2212 are constrained to be q-dimensional binary vectors. To prove\nthe inequality in Eq. 6, note that if the \ufb01rst term on the RHS were maximized3 by (g, g+, g\u2212) =\n(b(x), b(x+), b(x\u2212)), then using Eq. 5, it is straightforward to show that Eq. 6 would become an\nequality. In all other cases of (g, g+, g\u2212) which maximize the \ufb01rst term, the RHS can only be as\nlarge or larger than when (g, g+, g\u2212) = (b(x), b(x+), b(x\u2212)), hence the inequality holds.\nSumming the upper bound instead of the loss in Eq. 4 yields an upper bound on the regularized\nempirical loss in Eq. 4.\nImportantly, the resulting bound is easily shown to be continuous and\npiecewise smooth in w as long as f is continuous in w. The upper bound of Eq. 6 is a generalization\nof a bound introduced in [24] for the linear case, f (x) = W x. In particular, when f is linear in\nw, the bound on regularized empirical loss becomes piecewise linear and convex-concave. While\nthe bound in Eq. 6 is more challenging to optimize than the bound in [24], it allows us to learn\nhash functions based on non-linear functions f, e.g. neural networks. While the bound in [24] was\nde\ufb01ned for (cid:96)pair-type loss functions and pairwise similarity labels, the bound here applies to the more\n\ufb02exible class of triplet loss functions.\n\n3.2 Loss-augmented inference\nTo use the upper bound in Eq. 6 for optimization, we must be able to \ufb01nd the binary codes given by\n\n(cid:110)\n\n(cid:0) g, g+, g\u2212(cid:1) + gTf (x) + g+T\n\nf (x+) + g\u2212T\n\nf (x\u2212)\n\n.\n\n(7)\n\n(cid:111)\n\n(\u02c6g, \u02c6g+, \u02c6g\u2212) = argmax\n(g,g+,g\u2212)\n\n(cid:96)triplet\n\nIn the structured prediction literature this maximization is called loss-augmented inference. The\nchallenge stems from the 23q possible binary codes over which one has to maximize the RHS.\nFortunately, we can show that this loss-augmented inference problem can be solved ef\ufb01ciently for\nthe class of triplet loss functions that depend only on the value of\n\nd(g, g+, g\u2212) \u2261 (cid:107)g\u2212g+(cid:107)H \u2212 (cid:107)g\u2212g\u2212(cid:107)H .\n\nImportantly, such loss functions do not depend on the speci\ufb01c binary codes, but rather just the\ndifferences. Further, note that d(g, g+, g\u2212) can take on only 2q +1 possible values, since it is an\ninteger between \u2212q and +q. Clearly the triplet ranking loss only depends on d since\n\n(cid:0)g, g+, g\u2212(cid:1) = (cid:96) (cid:48)(cid:0)d(g, g+, g\u2212)(cid:1) , where\n\n(cid:96) (cid:48)(\u03b1) = [ \u03b1 \u2212 1 ]+ .\n\n(8)\n\n(cid:96)triplet\n\nFor this family of loss functions, given the values of f (.) in Eq. 7, loss-augmented inference can be\nperformed in time O(q2). To prove this, \ufb01rst consider the case d(g, g+, g\u2212) = m, where m is an\ninteger between \u2212q and q. In this case we can replace the loss augmented inference problem with\n\ngTf (x) + g+T\n\nf (x+) + g\u2212T\n\nf (x\u2212)\n\ns.t. d(g, g+, g\u2212) = m .\n\n(9)\n\n(cid:110)\n\n(cid:96) (cid:48)(m) + max\ng,g+,g\u2212\n\n(cid:111)\n\nOne can solve Eq. 9 for each possible value of m. It is straightforward to see that the largest of those\n2q + 1 maxima is the solution to Eq. 7. Then, what remains for us is to solve Eq. 9.\nTo solve Eq. 9, consider the ith bit for each of the three codes, i.e. a = g[i], b = g+[i], and c =\ng\u2212[i], where v[i] denotes the ith element of vector v. There are 8 ways to select a, b and c, but no\nmatter what values they take on, they can only change the value of d(g, g+, g\u2212) by \u22121, 0, or +1.\nAccordingly, let ei \u2208 {\u22121, 0, +1} denote the effect of the ith bits on d(g, g+, g\u2212). For each value\nof ei, we can easily compute the maximal contribution of (a, b, c) to Eq. 9 by:\n\n(cid:8)af (x)[i] + bf (x+)[i] + cf (x\u2212)[i](cid:9)\n\n(10)\n\nsuch that a, b, c \u2208 {\u22121, +1} and (cid:107)a\u2212b(cid:107)H \u2212 (cid:107)a\u2212c(cid:107)H = ei.\n\nTherefore, to solve Eq. 9, we aim to select values for ei, for all i, such that(cid:80)q\n(cid:80)q\n\ni=1 ei = m and\ni=1 cont(i, ei) is maximized. This can be solved for any m using a dynamic programming algo-\nrithm, similar to knapsack, in O(q2). Finally, we choose m that maximizes Eq. 9 and set the bits to\nthe con\ufb01gurations that maximized cont(i, ei).\n\ncont(i, ei) = max\na,b,c\n\n3For presentation clarity we will sometimes drop the dependence of f and b on w, and write b(x) and f (x).\n\n4\n\n\f3.3 Perceptron-like learning\nOur learning algorithm is a form of stochastic gradient descent, where in the tth iteration we sample\na triplet (x, x+, x\u2212) from the dataset, and then take a step in the direction that decreases the upper\nbound on the triplet\u2019s loss in Eq. 6. To this end, we randomly initialize w(0). Then, at each iteration\nt + 1, given w(t), we use the following procedure to update the parameters, w(t+1):\n\n1. Select a random triplet (x, x+, x\u2212) from dataset D.\n2. Compute (\u02c6h, \u02c6h+, \u02c6h\u2212) = (b(x; w(t)), b(x+; w(t)), b(x\u2212; w(t))) using Eq. 5.\n3. Compute (\u02c6g, \u02c6g+, \u02c6g\u2212), the solution to the loss-augmented inference problem in Eq. 7 .\n4. Update model parameters using\n\n(cid:21)\nw(t+1) = w(t) + \u03b7\n,\nwhere \u03b7 is the learning rate, and \u2202f (x)/\u2202w \u2261 \u2202f (x; w)/\u2202w|w=w(t) \u2208 R|w|\u00d7q is the trans-\npose of the Jacobian matrix, where |w| is the number of parameters.\n\n(cid:16)\u02c6h\u2212\u2212 \u02c6g\u2212(cid:17)\u2212 \u03bbw(t)\n\n(cid:16)\u02c6h+\u2212 \u02c6g+(cid:17)\n\n(cid:20) \u2202f (x)\n\n(cid:16)\u02c6h\u2212 \u02c6g\n\n\u2202f (x\u2212)\n\n(cid:17)\n\n\u2202w\n\n+\n\n\u2202f (x+)\n\n\u2202w\n\n\u2202w\n\n+\n\nThis update rule can be seen as gradient descent in the upper bound of the regularized empirical loss.\nAlthough the upper bound in Eq. 6 is not differentiable at isolated points (owing to the max terms),\nin our experiments we \ufb01nd that this update rule consistently decreases both the upper bound and the\nactual regularized empirical loss L(w).\n\n4 Asymmetric Hamming distance\nWhen Hamming distance is used to score and retrieve the nearest neighbors to a given query, there is\na high probability of a tie, where multiple items are equidistant from the query in Hamming space.\nTo break ties and improve the similarity measure, previous work suggests the use of an asymmetric\nHamming (AH) distance [9, 14]. With an AH distance, one stores dataset entries as binary codes (for\nstorage ef\ufb01ciency) but the queries are not binarized. An asymmetric distance function is therefore\nde\ufb01ned on a real-valued query vector, v \u2208 Rq, and a database binary code, h \u2208 H. Computing AH\ndistance is slightly less ef\ufb01cient than Hamming distance, and ef\ufb01cient retrieval algorithms, such as\n[25], are not directly applicable. Nevertheless, the AH distance can also be used to re-rank items\nretrieved using Hamming distance, with a negligible increase in run-time. To improve ef\ufb01ciency\nfurther when there are many codes to be re-ranked, AH distance from the query to binary codes can\nbe pre-computed for each 8 or 16 consecutive bits, and stored in a query-speci\ufb01c lookup table.\nIn this work, we use the following asymmetric Hamming distance function\n\n1\n4\n\n(cid:107) h \u2212 tanh(Diag(s) v)(cid:107)2\n2 ,\n\nAH(h, v; s) =\n\n(11)\nwhere s \u2208 Rq is a vector of scaling parameters that control the slope of hyperbolic tangent applied\nto different bits; Diag(s) is a diagonal matrix with the elements of s on its diagonal. As the scaling\nfactors in s approach in\ufb01nity, AH and Hamming distances become identical. Here we use the AH\ndistance between a database code b(x(cid:48)) and the real-valued projection for the query f (x). Based\non our validation sets, the AH distance of Eq. 11 is relatively insensitive to values in s. For the\nexperiments we simply use s to scale the average absolute values of the elements of f (x) to be 0.25.\n5\nIn practice, the basic learning algorithm described in Sec. 3 is implemented with several modi\ufb01ca-\ntions. First, instead of using a single training triplet to estimate the gradients, we use mini-batches\ncomprising 100 triplets and average the gradient. Second, for each triplet (x, x+, x\u2212), we replace\nx\u2212 with a \u201chard\u201d example by selecting an item among all negative examples in the mini-batch that is\nclosest in the current Hamming distance to b(x). By harvesting hard negative examples, we ensure\nthat the Hamming constraints for the triplets are not too easily satis\ufb01ed. Third, to \ufb01nd good binary\ncodes, we encourage each bit, averaged over the training data, to be mean-zero before quantization\n(motivated in [35]). This is accomplished by adding the following penalty to the objective function:\n\nImplementation details\n\n(cid:0)f (x; w)(cid:1)(cid:107)2\n\n2 ,\n\n1\n2\n\n(cid:107) mean\n\nx\n\n5\n\n(12)\n\n\fFigure 1: MNIST precision@k: (left) four methods (with 32-bit codes); (right) three code lengths.\n\nwhere mean(f (x; w)) denotes the mean of f (x; w) across the training data. In our implementation,\nfor ef\ufb01ciency, the stochastic gradient of Eq. 12 is computed per mini-batch using the Jacobian matrix\nin the update rule (see Sec. 3.3). Empirically, we observe that including this term in the objective\nimproves the quality of binary codes, especially with the triplet ranking loss.\nWe use a heuristic to adapt learning rates, known as bold driver [2]. For each mini-batch we evaluate\nthe learning objective before the parameters are updated. As long as the objective is decreasing we\nslowly increase the learning rate \u03b7, but when the objective increases, \u03b7 is halved. In particular, after\nevery 25 epochs, if the objective, averaged over the last 25 epochs, decreased, we increase \u03b7 by 5%,\notherwise we decrease \u03b7 by 50%. We also used a momentum term; i.e. the previous gradient update\nis scaled by 0.9 and then added to the current gradient.\nAll experiments are run on a GPU for 2, 000 passes through the datasets. The training time for\nour current implementation is under 4 hours of GPU time for most of our experiments. The two\nexceptions involve CIFAR-10 with 6400-D inputs and relatively long code-lengths of 256 and 512\nbits, for which the training times are approximated 8 and 16 hours respectively.\n\n6 Experiments\nOur experiments evaluate Hamming distance metric learning using two families of hash functions,\nnamely, linear transforms and multilayer neural networks (see Sec. 2). For each, we examine two\nloss functions, the pairwise hinge loss (Eq. 2) and the triplet ranking loss (Eq. 3).\nExperiments are conducted on two well-known image corpora, MNIST [1] and CIFAR-10 [19].\nGround-truth similarity labels are derived from class labels; items from the same class are deemed\nsimilar4. This de\ufb01nition of similarity ignores intra-class variations and the existence of sub-\ncategories, e.g. styles of handwritten fours, or types of airplanes. Nevertheless, we use these coarse\nsimilarity labels to evaluate our framework. To that end, using items from the test set as queries,\nwe report precision@k, i.e. the fraction of k closest items in Hamming distance that are same-class\nneighbors. We also show kNN retrieval results for qualitative inspection. Finally, we report Ham-\nming (H) and asymmetric Hamming (AH) kNN classi\ufb01cation rates on the test sets.\nDatasets. The MNIST [1] digit dataset contains 60, 000 training and 10, 000 test images (28\u00d728\npixels) of ten handwritten digits (0 to 9). Of the 60, 000 training images, we set aside 5, 000 for\nvalidation. CIFAR-10 [19] comprises 50, 000 training and 10, 000 test color images (32\u00d732 pixels).\nEach image belongs to one of 10 classes, namely airplane, automobile, bird, cat, deer, dog, frog,\nhorse, ship, and truck. The large variability in scale, viewpoint, illumination, and background clutter\nposes a signi\ufb01cant challenge for classi\ufb01cation. Instead of using raw pixel values, we borrow a bag-\nof-words representation from Coates et al [6]. Its 6400-D feature vector comprises one 1600-bin\nhistogram per image quadrant, the codewords of which are learned from 6\u00d76 image patches. Such\nhigh-dimensional inputs are challenging for learning similarity-preserving hash functions. Of the\n50, 000 training images, we set aside 5, 000 for validation.\nMNIST: We optimize binary hash functions, mapping raw MNIST images to 32, 64, and 128-bit\ncodes. For each test code we \ufb01nd the k closest training codes using Hamming distance, and report\nprecision@k in Fig. 1. As one might expect, the non-linear mappings5 signi\ufb01cantly outperform lin-\near mappings. We also \ufb01nd that the triplet loss (Eq. 3) yields better performance than the pairwise\n\n4Training triplets are created by taking two items from the same class, and one item from a different class.\n5The two-layer neural nets for Fig. 1 and Table 1 had 1 hidden layer with 512 units. Weights were initialized\n\nrandomly, and the Jacobian with respect to the parameters was computed with the backprop algorithm [27].\n\n6\n\n101001000100000.870.90.930.960.99kPrecision @k Two\u2212layer net, tripletTwo\u2212layer net, pairwiseLinear, tripletLinear, pairwise [24]101001000100000.870.90.930.960.99kPrecision @k 128\u2212bit, linear, triplet64\u2212bit, linear, triplet32\u2212bit, linear, tripletEuclidean distance\fDistance\n\n.\n\ng\nn\ni\nm\nm\na\nH\n\ng\nn\ni\nm\nm\na\nH\n\nHash function, Loss\nLinear, pairwise hinge [24]\nLinear, triplet ranking\nTwo-layer Net, pairwise hinge\nTwo-layer Net, triplet ranking\nLinear, pairwise hinge\nLinear, triplet ranking\nTwo-layer Net, pairwise hinge\nTwo-layer Net, triplet ranking\nBaseline\nDeep neural nets with pre-training [16]\nLarge margin nearest neighbor [34]\nRBF-kernel SVM [8]\nNeural network [31]\nEuclidean 3NN\n\nm\ny\ns\nA\n\n64 bits\n3.16\n3.06\n1.45\n1.38\n2.78\n2.90\n1.36\n1.29\n\n128 bits\n\n2.61\n2.44\n1.44\n1.27\n2.46\n2.51\n1.35\n1.20\n\nkNN\n2 NN\n2 NN\n30 NN\n30 NN\n3 NN\n3 NN\n30 NN\n30 NN\n\n32 bits\n4.66\n4.44\n1.50\n1.45\n4.30\n3.88\n1.50\n1.45\n\nError\n1.2\n1.3\n1.4\n1.6\n2.89\n\nTable 1: Classi\ufb01cation error rates on MNIST test set.\n\nloss (Eq. 2). The sharp drop in precision at k = 6000 is a consequence of the fact that each digit in\nMNIST has approximately 6000 same-class neighbors. Fig. 1 (right) shows how precision improves\nas a function of the binary code length. Notably, kNN retrieval, for k > 10 and all code lengths,\nyields higher precision than Euclidean NN on the 784-D input space. Further, note that these Euclid-\nian results effectively provide an upper bound on the performance one would expect with existing\nhashing methods that preserve Eucliean distances (e.g., [13, 17, 20, 35]).\nOne can also evaluate the \ufb01delity of the Hamming space represenation in terms of classi\ufb01cation\nperformance from the Hamming codes. To focus on the quality of the hash functions, and the speed\nof retrieval for large-scale multimedia datasets, we use a kNN classi\ufb01er; i.e. we just use the retrieved\nneighbors to predict class labels for each test code. Table 1 reports classi\ufb01cation error rates using\nkNN based on Hamming and asymmetric Hamming distance. Non-linear mappings, even with\nonly 32-bit codes, signi\ufb01cantly outperform linear mappings (e.g.with 128 bits). The ranking hinge\nloss also improves upon the pairwise hinge loss, even though the former has no hyperparameters.\nTable 1 also indicates that AH distance provides a modest boost in performance. For each method\nthe parameter k in the kNN classi\ufb01er is chosen based on the validation set.\nFor baseline comparison, Table 1 reports state-of-the-art performance on MNIST with sophisticated\ndiscriminative classi\ufb01ers (excluding those using examplar deformations and convolutional nets).\nDespite the simplicity of a kNN classi\ufb01er, our model achieves error rates of 1.29% and 1.20% using\n64- and 128-bit codes. This is compared to 1.4% with RBF-SVM [8], and to 1.6%, the best published\nneural net result for this version of the task [31]. Our model also out performs the metric learning\napproach of [34], and is competitive with the best known Deep Belief Network [16]; although they\nused unsupervised pre-training while we do not.\nThe above results show that our Hamming distance metric learning framework can preserve suf\ufb01-\ncient semantic similarity, to the extent that Hamming kNN classi\ufb01cation becomes competitive with\nstate-of-the-art discriminative methods. Nevertheless, our method is not solely a classi\ufb01er, and it\ncan be used within many other machine learning algorithms.\nIn comparison, another hashing technique called iterative quantization (ITQ) [13] achieves 8.5%\nerror on MNIST and 78% accuracy on CIFAR-10. Our method compares favorably, especially on\nMNIST. However, ITQ [13] inherently binarizes the outcome of a supervised classi\ufb01er (Canonical\nCorrelation Analysis with labels), and does not explicitly learn a similarity measure on the input\nfeatures based on pairs or triplets.\nCIFAR-10: On CIFAR-10 we optimize hash functions for 64, 128, 256, and 512-bit codes. The\nsupplementary material includes precision@k curves, showing superior quality of hash functions\nlearned by the ranking loss compared to the pairwise loss. Here, in Fig. 2, we depict the quality\nof retrieval results for two queries, showing the 16 nearest neighbors using 256-bit codes, 64-bit\ncodes (both learned with the triplet ranking loss), and Euclidean distance in the original 6400-D\nfeature space. The number of class-based retrieval errors is much smaller in Hamming space, and\nthe similarity in visual appearance is also superior. More such results, including failure modes, are\nshown in the supplementary material.\n\n7\n\n\f(Hamming on 256 bit codes)\n\n(Hamming on 64 bit codes)\n\n(Euclidean distance)\n\nFigure 2: Retrieval results for two CIFAR-10 test images using Hamming distance on 256-bit and\n64-bit codes, and Euclidean distance on bag-of-words features. Red rectangles indicate mistakes.\n\nHashing, Loss\nLinear, pairwise hinge [24]\nLinear, pairwise hinge\nLinear, triplet ranking\nLinear, triplet ranking\n\nDistance\n\nH\nAH\nH\nAH\n\nkNN\n7 NN\n8 NN\n2 NN\n2 NN\n\n64 bits\n72.2\n72.3\n75.1\n75.7\n\n128 bits\n\n256 bits\n\n512 bits\n\n72.8\n73.5\n75.9\n76.8\n\n73.8\n74.3\n77.1\n77.5\n\n74.6\n74.9\n77.9\n78.0\n\nBaseline\nOne-vs-all linear SVM [6]\nEuclidean 3NN\n\nAccuracy\n\n77.9\n59.3\n\nTable 2: Recognition accuracy on the CIFAR-10 test set (H \u2261 Hamming, AH \u2261 Asym. Hamming).\n\nTable 2 reports classi\ufb01cation performance (showing accuracy instead of error rates for consistency\nwith previous papers). Euclidean NN on the 6400-D input features yields under 60% accuracy,\nwhile kNN with the binary codes obtains 76\u2212 78%. As with MNIST data, this level of perfor-\nmance is comparable to one-vs-all SVMs applied to the same features [6]. Not surprisingly, training\nfully-connected neural nets on 6400-dimensional features with only 50, 000 training examples is\nchallenging and susceptible to over-\ufb01tting, hence the results of neural nets on CIFAR-10 were not\ncompetitive. Previous work [19] had some success training convolutional neural nets on this dataset.\nNote that our framework can easily incorporate convolutional neural nets, which are intuitively bet-\nter suited to the intrinsic spatial structure of natural images.\n\n7 Conclusion\n\nWe present a framework for Hamming distance metric learning, which entails learning a discrete\nmapping from the input space onto binary codes. This framework accommodates different families\nof hash functions, including quantized linear transforms, and multilayer neural nets. By using a\npiecewise-smooth upper bound on a triplet ranking loss, we optimize hash functions that are shown\nto preserve semantic similarity on complex datasets.\nIn particular, our experiments show that a\nsimple kNN classi\ufb01er on the learned binary codes is competitive with sophisticated discriminative\nclassi\ufb01ers. While other hashing papers have used CIFAR or MNIST, none report kNN classi\ufb01cation\nperformance, often because it has been thought that the bar established by state-of-the-art classi\ufb01ers\nis too high. On the contrary our kNN classi\ufb01cation performance suggests that Hamming space can\nbe used to represent complex semantic structures with high \ufb01delity. One appeal of this approach is\nthe scalability of kNN search on binary codes to billions of data points, and of kNN classi\ufb01cation to\nmillions of class labels.\n\n8\n\n\fReferences\n[1] http://yann.lecun.com/exdb/mnist/.\n[2] R. Battiti. Accelerated backpropagation learning: Two optimization methods. Complex Systems, 1989.\n[3] A. Bergamo, L. Torresani, and A. Fitzgibbon. Picodes: Learning a compact code for novel-category\n\n[4] M. Charikar. Similarity estimation techniques from rounding algorithms. STOC, 2002.\n[5] G. Chechik, V. Sharma, U. Shalit, and S. Bengio. Large scale online learning of image similarity through\n\n[6] A. Coates, H. Lee, and A. Ng. An analysis of single-layer networks in unsupervised feature learning.\n\nrecognition. NIPS, 2011.\n\nranking. JMLR, 2010.\n\nAISTATS, 2011.\n\n[7] J. Davis, B. Kulis, P. Jain, S. Sra, and I. Dhillon. Information-theoretic metric learning. ICML, 2007.\n[8] D. Decoste and B. Sch\u00a8olkopf. Training invariant support vector machines. Machine Learning, 2002.\n[9] W. Dong, M. Charikar, and K. Li. Asymmetric distance estimation with sketches for similarity search in\n\nhigh-dimensional spaces. SIGIR, 2008.\n\n[10] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively\n\ntrained part-based models. IEEE Trans. PAMI, 2010.\n\n[11] A. Frome, Y. Singer, F. Sha, and J. Malik. Learning globally-consistent local distance functions for\n\nshape-based image retrieval and classi\ufb01cation. ICCV, 2007.\n\n[12] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. NIPS,\n\n[13] Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. CVPR,\n\n[14] A. Gordo and F. Perronnin. Asymmetric distances for binary embeddings. CVPR, 2011.\n[15] D. Greene, M. Parnas, and F. Yao. Multi-index hashing for information retrieval. FOCS, 1994.\n[16] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,\n\n[17] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality.\n\n[18] H. J\u00b4egou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search.\n\nIEEE Trans.\n\n[19] A. Krizhevsky. Learning multiple layers of features from tiny images. MSc. thesis, Univ. Toronto, 2009.\n[20] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. NIPS, 2009.\n[21] B. Kulis and K. Grauman. Kernelized locality-sensitive hashing for scalable image search. ICCV, 2009.\n[22] W. Liu, J. Wang, R. Ji, Y. Jiang, and S. Chang. Supervised hashing with kernels. CVPR, 2012.\n[23] M. Muja and D. Lowe. Fast approximate nearest neighbors with automatic algorithm con\ufb01guration.\n\nVISSAPP, 2009.\n\n2012.\n\nPress, 1986.\n\nICCV, 2003.\n\n[24] M. Norouzi and D. J. Fleet. Minimal Loss Hashing for Compact Binary Codes. ICML, 2011.\n[25] M. Norouzi, A. Punjani, and D. Fleet. Fast search in hamming space with multi-index hashing. CVPR,\n\n[26] M. Raginsky and S. Lazebnik. Locality-sensitive binary codes from shift-invariant kernels. NIPS, 2009.\n[27] D. Rumelhart, G. Hinton, and R. Williams. Learning internal representations by error propagation. MIT\n\n[28] R. Salakhutdinov and G. Hinton. Semantic hashing. Int. J. Approximate Reasoning, 2009.\n[29] G. Shakhnarovich, P. A. Viola, and T. Darrell. Fast pose estimation with parameter-sensitive hashing.\n\n[30] S. Shalev-Shwartz, Y. Singer, and A. Ng. Online and batch learning of pseudo-metrics. ICML, 2004.\n[31] P. Simard, D. Steinkraus, and J. Platt. Best practice for convolutional neural networks applied to visual\n\ndocument analysis. ICDR, 2003.\n\n[32] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition. CVPR,\n\n2004.\n\n2011.\n\n2006.\n\nSTOC, 1998.\n\nPAMI, 2011.\n\n2008.\n\nICML, 2010.\n\n\ufb01cation. NIPS, 2006.\n\n[33] J. Wang, S. Kumar, and S. Chang. Sequential Projection Learning for Hashing with Compact Codes.\n\n[34] K. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large margin nearest neighbor classi-\n\n[35] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. NIPS, 2008.\n[36] E. Xing, A. Ng, M. Jordan, and S. Russell. Distance metric learning, with application to clustering with\n\nside-information. NIPS, 2002.\n\n[37] C. N. J. Yu and T. Joachims. Learning structural SVMs with latent variables. ICML, 2009.\n\n9\n\n\f", "award": [], "sourceid": 514, "authors": [{"given_name": "Mohammad", "family_name": "Norouzi", "institution": null}, {"given_name": "David", "family_name": "Fleet", "institution": ""}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": null}]}