{"title": "Learning to Hash with Binary Reconstructive Embeddings", "book": "Advances in Neural Information Processing Systems", "page_first": 1042, "page_last": 1050, "abstract": "Fast retrieval methods are increasingly critical for many large-scale analysis tasks, and there have been several recent methods that attempt to learn hash functions for fast and accurate nearest neighbor searches. In this paper, we develop an algorithm for learning hash functions based on explicitly minimizing the reconstruction error between the original distances and the Hamming distances of the corresponding binary embeddings. We develop a scalable coordinate-descent algorithm for our proposed hashing objective that is able to efficiently learn hash functions in a variety of settings. Unlike existing methods such as semantic hashing and spectral hashing, our method is easily kernelized and does not require restrictive assumptions about the underlying distribution of the data. We present results over several domains to demonstrate that our method outperforms existing state-of-the-art techniques.", "full_text": "Learning to Hash with Binary Reconstructive\n\nEmbeddings\n\nBrian Kulis and Trevor Darrell\n\nUC Berkeley EECS and ICSI\n\nBerkeley, CA\n\n{kulis,trevor}@eecs.berkeley.edu\n\nAbstract\n\nFast retrieval methods are increasingly critical for many large-scale analysis tasks,\nand there have been several recent methods that attempt to learn hash functions for\nfast and accurate nearest neighbor searches. In this paper, we develop an algorithm\nfor learning hash functions based on explicitly minimizing the reconstruction error\nbetween the original distances and the Hamming distances of the corresponding\nbinary embeddings. We develop a scalable coordinate-descent algorithm for our\nproposed hashing objective that is able to ef\ufb01ciently learn hash functions in a va-\nriety of settings. Unlike existing methods such as semantic hashing and spectral\nhashing, our method is easily kernelized and does not require restrictive assump-\ntions about the underlying distribution of the data. We present results over sev-\neral domains to demonstrate that our method outperforms existing state-of-the-art\ntechniques.\n\n1\n\nIntroduction\n\nAlgorithms for fast indexing and search have become important for a variety of problems, particu-\nlarly in the domains of computer vision, text mining, and web databases. In cases where the amount\nof data is huge\u2014large image repositories, video sequences, and others\u2014having fast techniques for\n\ufb01nding nearest neighbors to a query is essential. At an abstract level, we may view hashing methods\nfor similarity search as mapping input data (which may be arbitrarily high-dimensional) to a low-\ndimensional binary (Hamming) space. Unlike standard dimensionality-reduction techniques from\nmachine learning, the fact that the embeddings are binary is critical to ensure fast retrieval times\u2014\none can perform ef\ufb01cient linear scans of the binary data to \ufb01nd the exact nearest neighbors in the\nHamming space, or one can use data structures for \ufb01nding approximate nearest neighbors in the\nHamming space which have running times that are sublinear in the number of total objects [1, 2].\nSince the Hamming distance between two objects can be computed via an xor operation and a bit\ncount, even a linear scan in the Hamming space for a nearest neighbor to a query in a database of\n100 million objects can currently be performed within a few seconds on a typical workstation. If the\ninput dimensionality is very high, hashing methods lead to enormous computational savings.\n\nIn order to be successful, hashing techniques must appropriately preserve distances when mapping to\nthe Hamming space. One of the basic but most widely-employed methods, locality-sensitive hashing\n(LSH) [1, 2], generates embeddings via random projections and has been used for many large-scale\nsearch tasks. An advantage to this technique is that the random projections provably maintain the\ninput distances in the limit as the number of hash bits increases; at the same time, it has been\nobserved that the number of hash bits required may be large in some cases to faithfully maintain\nthe distances. On the other hand, several recent techniques\u2014most notably semantic hashing [3]\nand spectral hashing [4]\u2014attempt to overcome this problem by designing hashing techniques that\nleverage machine learning to \ufb01nd appropriate hash functions to optimize an underlying hashing\nobjective. Both methods have shown advantages over LSH in terms of the number of bits required\n\n1\n\n\fto \ufb01nd good approximate nearest neighbors. However, these methods cannot be directly applied in\nkernel space and have assumptions about the underlying distributions of the data. In particular, as\nnoted by the authors, spectral hashing assumes a uniform distribution over the data, a potentially\nrestrictive assumption in some cases.\n\nIn this paper, we introduce and analyze a simple objective for learning hash functions, develop an ef-\n\ufb01cient coordinate-descent algorithm, and demonstrate that the proposed approach leads to improved\nresults as compared to existing hashing techniques. The main idea is to construct hash functions\nthat explicitly preserve the input distances when mapping to the Hamming space. To achieve this,\nwe minimize a squared loss over the error between the input distances and the reconstructed Ham-\nming distances. By analyzing the reconstruction objective, we show how to ef\ufb01ciently and exactly\nminimize the objective function with respect to a single variable. If there are n training points, k\nnearest neighbors per point in the training data, and b bits in our desired hash table, our method ends\nup costing O(nb(k + log n)) time per iteration to update all hash functions, and provably reaches a\nlocal optimum of the reconstruction objective. In experiments, we compare against relevant existing\nhashing techniques on a variety of important vision data sets, and show that our method is able to\ncompete with or outperform state-of-the-art hashing algorithms on these data sets. We also apply\nour method on the very large Tiny Image data set of 80 million images [5], to qualitatively show\nsome example retrieval results obtained by our proposed method.\n\n1.1 Related Work\n\nMethods for fast nearest neighbor retrieval are generally broken down into two families. One group\npartitions the data space recursively, and includes algorithms such as k \u2212 d trees [6], M-trees [7],\ncover trees [8], metric trees [9], and other related techniques. These methods attempt to speed up\nnearest neighbor computation, but can degenerate to a linear scan in the worst case. Our focus in\nthis paper is on hashing-based methods, which map the data to a low-dimensional Hamming space.\nLocality-sensitive hashing [1, 2] is the most popular method, and extensions have been explored for\naccommodating distances such as \u2113p norms [10], learned metrics [11], and image kernels [12]. Algo-\nrithms based on LSH typically come with guarantees that the approximate nearest neighbors (neigh-\nbors within (1 + \u01eb) times the true nearest neighbor distance) may be found in time that is sublinear\nin the total number of database objects (but as a function of \u01eb). Unlike standard dimensionality-\nreduction techniques, the binary embeddings allow for extremely fast similarity search operations.\nSeveral recent methods have explored ways to improve upon the random projection techniques used\nin LSH. These include semantic hashing [3], spectral hashing [4], parameter-sensitive hashing [13],\nand boosting-based hashing methods [14].\n\n2 Hashing Formulation\n\nIn the following section, we describe our proposed method, starting with the choice of parameteri-\nzation for the hash functions and the objective function to minimize. We then develop a coordinate-\ndescent algorithm used to minimize the objective function, and discuss extensions of the proposed\napproach.\n\n2.1 Setup\n\nLet our data set be represented by a set of n vectors, given by X = [x1 x2 ... xn]. We will assume\nthat these vectors are normalized to have unit \u21132 norm\u2014this will make it easier to maintain the\nproper scale for comparing distances in the input space to distance in the Hamming space.1 Let a\nkernel function over the data be denoted as \u03ba(xi, xj). We use a kernel function as opposed to the\nstandard inner product to emphasize that the algorithm can be expressed purely in kernel form.\n\nWe would like to project each data point to a low-dimensional binary space to take advantage of fast\nnearest neighbor routines. Suppose that the desired number of dimensions of the binary space is b;\nwe will compute the b-dimensional binary embedding by projecting our data using a set of b hash\nfunctions h1, ..., hb. Each hash function hi is a binary-valued function, and our low-dimensional\n\n1Alternatively, we may scale the data appropriately by a constant so that the squared Euclidean distances\n\n1\n\n2 kxi \u2212 xjk2 are in [0, 1].\n\n2\n\n\fbinary reconstruction can be represented as \u02dcxi = [h1(xi); h2(xi); ...; hb(xi)]. Finally, denote\nb k \u02dcxi \u2212 \u02dcxjk2. Notice that d and \u02dcd are always between\nd(xi, xj) = 1\n0 and 1.\n\n2 kxi \u2212 xjk2 and \u02dcd(xi, xj) = 1\n\n2.2 Parameterization and Objective\n\nIn standard random hyperplane locality-sensitive hashing (e.g. [1]), each hash function hp is gener-\nated independently by selecting a random vector rp from a multivariate Gaussian with zero-mean\nand identity covariance. Then the hash function is given as hp(x) = sign(rT\nx). In contrast, we\np\npropose to generate a sequence of hash functions that are dependent on one another, in the same\nspirit as in spectral hashing (though with a different parameterization). We introduce a matrix W of\nsize b \u00d7 n, and we parameterize the hash functions h1, ..., hp, ..., hb as follows:\n\nhp(x) = sign\u00b5 s\nXq=1\n\nWpq\u03ba(xpq, x)\u00b6.\n\nNote that the data points xpq for each hash function need not be the same for each hq (that is, each\nhash function may utilize different sets of points). Similarly, the number of points s used for each\nhash function may change, though for simplicity we will present the case when s is the same for each\nfunction (and so we can represent all weights via the b \u00d7 s matrix W ). Though we are not aware\nof any existing methods that parameterize the hash functions in this way, this parameterization is\nnatural for several reasons.\nIt does not explicitly assume anything about the distribution of the\ndata. It is expressed in kernelized form, meaning we can easily work over a variety of input data.\nFurthermore, the form of each hash function\u2014the sign of a linear combination of kernel function\nvalues\u2014is the same as several kernel-based learning algorithms such as support vector machines.\n\nRather than simply choosing the matrix W based on random hyperplanes, we will speci\ufb01cally con-\nstruct this matrix to achieve good reconstructions. In particular, we will look at the squared error\nbetween the original distances (using d) and the reconstructed distances (using \u02dcd). We minimize the\nfollowing objective with respect to the weight matrix W :\n\nO({xi}n\n\ni=1, W ) = X(i,j)\u2208N\n\n(d(xi, xj) \u2212 \u02dcd(xi, xj))2.\n\n(1)\n\nThe set N is a selection of pairs of points, and can be chosen based on the application. Typically,\nwe will choose this to be a set of pairs which includes both the nearest neighbors as well as other\npairs from the database (see Section 3 for details). If we choose k pairs for each point, then the total\nsize of N will be nk.\n\n2.3 Coordinate-Descent Algorithm\n\nThe objective O given in (1) is highly non-convex in W , making optimization the main challenge in\nusing the proposed objective for hashing. One of the most dif\ufb01cult issues is due to the fact that the\nreconstructions are binary; the objective is not continuous or differentiable, so it is not immediately\nclear how an effective algorithm would proceed. One approach is to replace the sign function by the\nsigmoid function, as is done with neural networks and logistic regression.2 Then the objective O\nand gradient \u2207O can both be computed in O(nkb) time. However, our experience with minimizing\nO with such an approach using a quasi-Newton L-BFGS algorithm typically resulted in poor local\noptima; we need an alternative method.\nInstead of the continuous relaxation, we will consider \ufb01xing all but one weight Wpq, and optimize\nthe original objective O with respect to Wpq. Surprisingly, we will show below that an exact, optimal\nupdate to this weight can be achieved in time O(n log n+nk). Such an approach will update a single\nhash function hp; then, by choosing a single weight to update for each hash function, we can update\nall hash functions in O(nb(k + log n)) time. In particular, if k = \u2126(log n), then we can update\nall hash functions on the order of the time it takes to compute the objective function itself, making\nthe updates particularly ef\ufb01cient. We will also show that this method provably converges to a local\noptimum of the objective function O.\n\n2The sigmoid function is de\ufb01ned as s(x) = 1/(1 + e\u2212x), and its derivative is s\u2032(x) = s(x)(1 \u2212 s(x)).\n\n3\n\n\fWe sketch out the details of our coordinate-descent scheme below. We begin with a simple lemma\ncharacterizing how the objective function changes when we update a single hash function.\nLemma 1. Let \u00afDij = d(xi, xj) \u2212 \u02dcd(xi, xj). Consider updating some hash function hold to hnew\n(where \u02dcd uses hold), and let ho and hn be the n \u00d7 1 vectors obtained by applying the old and new\nhash functions to each data point, respectively. Then the objective function O from (1) after updating\nthe hash function can be expressed as\n\nO = X(i,j)\u2208N\n\n\u00b5 \u00afDij +\n\n1\nb\n\n(ho(i) \u2212 ho(j))2 \u2212\n\n(hn(i) \u2212 hn(j))2\u00b62\n\n1\nb\n\n.\n\nProof. For notational convenience in this proof, let \u02dcDold and \u02dcDnew be the matrices of reconstructed\ndistances using hold and hnew, respectively, and let Hold and Hnew be the n \u00d7 b matrices of old\nand new hash bits, respectively. Also, let et be the t-th standard basis vector and e be a vector of\nt , where t is the index of the hash function being\nall ones. Note that Hnew = Hold + (hn \u2212 ho)eT\nupdated. We can express \u02dcDold as\n\n\u02dcDold =\n\n1\n\nb\u00b5\u2113oldeT + e\u2113T\n\nold \u2212 2HoldH T\n\nold\u00b6,\n\nwhere \u2113old is the vector of squared norms of the rows of Hold. Note that the corresponding vector\nof squared norms of the rows of Hnew may be expressed as \u2113new = \u2113old \u2212 ho + hn since the hash\nvectors are binary-valued. Therefore we may write\n\n\u02dcDnew =\n\n1\n\nb\u00b5(\u2113old + hn \u2212 ho)eT + e(\u2113old + hn \u2212 ho)T\nt )(Hold + (hn \u2212 ho)eT\n\n\u22122(Hold + (hn \u2212 ho)eT\n\nt )T\u00b6\nb\u00b5(hn \u2212 ho)eT + e(hn \u2212 ho)T \u2212 2(hnhT\nb\u00b5(hoeT + ehT\no ) \u2212 (hneT + ehT\n\no \u2212 2hohT\n\n1\n\n1\n\n= \u02dcDold +\n\n= \u02dcDold \u2212\n\nn \u2212 hohT\n\no )\u00b6\nn )\u00b6,\nn \u2212 2hnhT\n\nwhere we have used the fact that Holdet = ho. We can then write the objective using \u02dcDnew to\nobtain\n\nO = X(i,j)\u2208N\n= X(i,j)\u2208N\n\n\u00b5 \u00afDij +\n\u00b5 \u00afDij +\n\n1\nb\n\n1\nb\n\n(ho(i) + ho(j) \u2212 2ho(i)ho(j)) \u2212\n\n(hn(i) + hn(j) \u2212 2hn(i)hn(j))\u00b62\n\n1\nb\n\n(ho(i) \u2212 ho(j))2 \u2212\n\n(hn(i) \u2212 hn(j))2\u00b62\n\n1\nb\n\n,\n\nsince ho(i)2 = ho(i) and hn(i)2 = hn(i). This completes the proof.\n\nThe lemma above demonstrates that, when updating a hash function, the new objective function can\nbe computed in O(nk) time, assuming that we have computed and stored the values of \u00afDij. Next\nwe show that we can compute an optimal weight update in time O(nk + n log n).\nConsider choosing some hash function hp, and choose one weight index q, i.e. \ufb01x all entries of\nW except Wpq, which corresponds to the one weight updated during this iteration of coordinate-\ndescent. Modifying the value of Wpq results in updating hp to a new hashing function hnew. Now,\nfor every point x, there is a hashing threshold: a new value of Wpq, which we will call \u02c6Wpq, such\nthat\n\ns\n\n\u02c6Wpq\u03ba(xpq, x) = 0.\n\nXq=1\n\n4\n\n\fObserve that, if cx = Ps\n\nq=1 Wpq\u03ba(xpq, x), then the threshold tx is given by\n\ntx = Wpq \u2212\n\ncx\n\n\u03ba(xpq, x)\n\n.\n\nWe \ufb01rst compute the thresholds for all n data points: once we have the values of cx for all x,\ncomputing tx for all points requires O(n) time. Since we are updating a single Wpq per iteration,\nwe can update the values of cx in O(n) time after updating Wpq, so the total time to compute all\nthresholds tx is O(n).\nNext, we sort the thresholds in increasing order, which de\ufb01nes a set of n + 1 intervals (interval 0 is\nthe interval of values smaller than the \ufb01rst threshold, interval 1 is the interval of points between the\n\ufb01rst and the second threshold, and so on). Observe that, for any \ufb01xed interval, the new computed\nhash function hnew does not change over the entire interval. Furthermore, observe that as we cross\nfrom one threshold to the next, a single bit of the corresponding hash vector \ufb02ips. As a result, we\nneed only compute the objective function at each of the n + 1 intervals, and choose the interval\nthat minimizes the objective function. We choose a value Wpq within that interval (which will be\noptimal) and update the hash function using this new choice of weight. The following result shows\nthat we can choose the appropriate interval in time O(nk). When we add the cost of sorting the\nthresholds, the total cost of an update to a single weight Wpq is O(nk + n log n).\nLemma 2. Consider updating a single hash function. Suppose we have a sequence of hash vectors\nht0 , ..., htn such that htj\u22121 and htj differ by a single bit for 1 \u2264 j \u2264 n. Then the objective functions\nfor all n + 1 hash functions can be computed in O(nk) time.\n\nProof. The objective function may be computed in O(nk) time for the hash function ht0 corre-\nsponding to the smallest interval. Consider the case when going from ho = htj\u22121 to hn = htj for\nsome 1 \u2264 j \u2264 n. Let the index of the bit that changes in hn be a. The only terms of the sum in\nthe objective that change are ones of the form (a, j) \u2208 N and (i, a) \u2208 N . Let fa = 1 if ho(a) =\n0, hn(a) = 1, and fa = \u22121 otherwise. Then we can simplify (hn(i) \u2212 hn(j))2 \u2212 (ho(i) \u2212 ho(j))2\nto fa(1 \u2212 2hn(j)) when a = i and to fa(1 \u2212 2hn(i)) when a = j (the expression is zero when\ni = j and will not contribute to the objective). Therefore the relevant terms in the objective function\nas given in Lemma 1 may be written as:\n\n\u00b5 \u00afDaj \u2212\n\nfa\nb\n\n(1 \u2212 2hn(j))\u00b62\n\nX(a,j)\u2208N\n\n+ X(i,a)\u2208N\n\n\u00b5 \u00afDia \u2212\n\nfa\nb\n\n(1 \u2212 2hn(i))\u00b62\n\n.\n\nAs there are k nearest neighbors, the \ufb01rst sum will have k elements and can be computed in O(k)\ntime. The second summation may have more or less than k terms, but across all data points there will\nbe k terms on average. Furthermore, we must update \u00afD as we progress through the hash functions,\nwhich can also be straightforwardly done in O(k) time on average. Completing this process over all\nn + 1 hash functions results in a total of O(nk) time.\n\nPutting everything together, we have shown the following result:\nTheorem 3. Fix all but one entry Wpq of the hashing weight matrix W . An optimal update to Wpq\nto minimize (1) may be computed in O(nk + n log n) time.\n\nOur overall strategy successively cycles through each hash function one by one, randomly selects a\nweight to update for each hash function, and computes the optimal updates for those weights. It then\nrepeats this process until reaching local convergence. One full iteration to update all hash functions\nrequires time O(nb(k + log n)). Note that local convergence is guaranteed in a \ufb01nite number of\nupdates since each update will never increase the objective function value, and only a \ufb01nite number\nof possible hash con\ufb01gurations are possible.\n\n2.4 Extensions\n\nThe method described in the previous section may be enhanced in various ways. For instance,\nthe algorithm we developed is completely unsupervised. One could easily extend the method to\na supervised one, which would be useful for example in large-scale k-NN classi\ufb01cation tasks. In\nthis scenario, one would additionally receive a set of similar and dissimilar pairs of points based on\n\n5\n\n\fclass labels or other background knowledge. For all similar pairs, one could set the target original\ndistance to be zero, and for all dissimilar pairs, one could set the target original distance to be large\n(say, 1).\n\nOne may also consider loss functions other than the quadratic loss considered in this paper. Another\noption would be to use an \u21131-type loss, which would not penalize outliers as severely. Additionally,\none may want to introduce regularization, especially for the supervised case. For example, the\naddition of an \u21131 regularization over the entries of W could lead to sparse hash functions, and may\nbe worth additional study.\n\n3 Experiments\n\nWe now present results comparing our proposed approach to the relevant existing methods\u2014locality\nsensitive hashing, semantic hashing (RBM), and spectral hashing. We also compared against the\nBoosting SSC algorithm [14] but were unable to \ufb01nd parameters to yield competitive performance,\nand so we do not present those results here. We implemented our binary reconstructive embedding\nmethod (BRE) and LSH, and used the same code for spectral hashing and RBM that was employed\nin [4]. We further present some qualitative results over the Tiny Image data set to show example\nretrieval results obtained by our method.\n\n3.1 Data Sets and Methodology\n\nWe applied the hashing algorithms to a number of important large-scale data sets from the com-\nputer vision community. Our vision data sets include: the Photo Tourism data [15], a collection\nof approximately 300,000 image patches, processed using SIFT to form 128-dimensional vectors;\nthe Caltech-101 [16], a standard benchmark for object recognition in the vision community; and\nLabelMe and Peekaboom [17], two image data set on top of which global Gist descriptors have\nbeen extracted. We also applied our method to MNIST, the standard handwritten digits data set, and\nNursery, one of the larger UCI data sets.\n\nWe mean-centered the data and normalized the feature vectors to have unit norm. Following the sug-\ngestion in [4], we apply PCA (or kernel PCA in the case of kernelized data) to the input data before\napplying spectral hashing or BRE\u2014the results of the RBM method and LSH were better without\napplying PCA, so PCA is not applied for these algorithms. For all data sets, we trained the methods\nusing 1000 randomly selected data points. For training the BRE method, we select nearest neigh-\nbors using the top 5th percentile of the training distances and set the target distances to 0; we found\nthat this ensures that the nearest neighbors in the embedded space will have Hamming distance very\nclose to 0. We also choose farthest neighbors using the 98th percentile of the training distances and\nmaintained their original distances as target distances. Having both near and far neighbors improves\nperformance for BRE, as it prevents a trivial solution where all the database objects are given the\nsame hash key. The spectral hashing and RBM parameters are set as in [4, 17]. After construct-\ning the hash functions for each method, we randomly generate 3000 hashing queries (except for\nCaltech-101, which has fewer than 4000 data points; in this case we choose the remainder of the\ndata as queries).\n\nWe follow the evaluation scheme developed in [4]. We collect training/test pairs such that the un-\nnormalized Hamming distance using the constructed hash functions is less than or equal to three.\nWe then compute the percentage of these pairs that are nearest neighbors in the original data space,\nwhich are de\ufb01ned as pairs of points from the training set whose distances are in the top 5th percentile.\nThis percentage is plotted as the number of bits increases. Once the number of bits is suf\ufb01ciently\nhigh (e.g. 50), one would expect that distances with a Hamming distance less than or equal to three\nwould correspond to nearest neighbors in the original data embedding.\n\n3.2 Quantitative Results\n\nIn Figure 1, we plot hashing retrieval results over each of the data sets. We can see that the BRE\nmethod performs comparably to or outperforms the other methods on all data sets. Observe that\nboth RBM and spectral hashing underperform all other methods on at least one data set. On some\n\n6\n\n\fPhoto Tourism\n\nCaltech\u2212101\n\n3\n \n=\n<\n \ne\nc\nn\na\nt\ns\nd\n \n.\n\ni\n\nm\nm\na\nH\n \nh\nt\ni\n\ni\n\nw\n \ns\nr\no\nb\nh\ng\ne\nn\n \nd\no\no\ng\n \nf\no\n \n.\np\no\nr\nP\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n10\n\n3\n \n=\n<\n \ne\nc\nn\na\nt\ns\nd\n \n.\n\ni\n\nm\nm\na\nH\n \nh\nt\ni\n\ni\n\nw\n \ns\nr\no\nb\nh\ng\ne\nn\n \nd\no\no\ng\n \nf\no\n \n.\np\no\nr\nP\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n10\n\n \n\n50\n\n \n\nBRE\nSpectral hashing\nRBM\nLSH\n\n20\n\n30\n\nNumber of bits\n\n40\n\nPeekaboom\n\nBRE\nSpectral hashing\nRBM\nLSH\n\n20\n\n30\n\nNumber of bits\n\n40\n\n50\n\n3\n \n=\n<\n \ne\nc\nn\na\nt\ns\nd\n \n.\n\ni\n\nm\nm\na\nH\n \nh\nt\ni\n\ni\n\nw\n \ns\nr\no\nb\nh\ng\ne\nn\n \nd\no\no\ng\n \nf\no\n \n.\np\no\nr\nP\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n \n\n10\n\n3\n \n=\n<\n \ne\nc\nn\na\nt\ns\nd\n \n.\n\ni\n\nm\nm\na\nH\n \nh\nt\ni\n\ni\n\nw\n \ns\nr\no\nb\nh\ng\ne\nn\n \nd\no\no\ng\n \nf\no\n \n.\np\no\nr\nP\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n10\n\n \n\n50\n\n \n\nBRE\nSpectral hashing\nRBM\nLSH\n\n20\n\n30\n\nNumber of bits\n\n40\n\nMNIST\n\nBRE\nSpectral hashing\nRBM\nLSH\n\n20\n\n30\n\nNumber of bits\n\n40\n\n50\n\n \n\n \n\n3\n=\n<\ne\nc\nn\na\n\nt\ns\nd\n\ni\n\n \n.\n\nm\nm\na\nH\nh\n\n \n\nt\ni\n\nw\n \ns\nr\no\nb\nh\ng\ne\nn\n\ni\n\n \n\nd\no\no\ng\n\n \nf\n\no\n\n \n.\n\np\no\nr\nP\n\n3\n \n=\n<\n \ne\nc\nn\na\nt\ns\nd\n \n.\n\ni\n\nm\nm\na\nH\n \nh\nt\ni\n\ni\n\nw\n \ns\nr\no\nb\nh\ng\ne\nn\n \nd\no\no\ng\n \nf\no\n \n.\np\no\nr\nP\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n10\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n10\n\nLabelMe\n\n \n\n50\n\n \n\nBRE\nSpectral hashing\nRBM\nLSH\n\n20\n\n30\n\nNumber of bits\n\n40\n\nNursery\n\nBRE\nSpectral hashing\nRBM\nLSH\n\n20\n\n30\n\nNumber of bits\n\n40\n\n50\n\nFigure 1: Results over Photo Tourism, Caltech-101, LabelMe, Peekaboom, MNIST, and Nursery.\nThe plots show how well the nearest neighbors in the Hamming space (pairs of data points with\nunnormalized Hamming distance less than or equal to 3) correspond to the nearest neighbors (top\n5th percentile of distances) in the original dataset. Overall, our method outperforms, or performs\ncomparably to, existing methods. See text for further details.\n\ndata sets, RBM appears to require signi\ufb01cantly more than 1000 training images to achieve good\nperformance, and in these cases the training time is substantially higher than the other methods.\n\nOne surprising outcome of these results is that LSH performs well in comparison to the other ex-\nisting methods (and outperforms some of them for some data sets)\u2014this stands in contrast to the\nresults of [4], where LSH showed signi\ufb01cantly poorer performance (we also evaluated our LSH\nimplementation using the same training/test split as in [4] and found similar results). The better\nperformance in our tests may be due to our implementation of LSH; we use Charikar\u2019s random\nprojection method [1] to construct hash tables.\n\nIn terms of training time, the BRE method typically converges in 50\u2013100 iterations of updating\nall hash functions, and takes 1\u20135 minutes to train per data set on our machines (depending on the\nnumber of bits requested). Relatively speaking, the time required for training is typically faster than\nRBM but slower than spectral hashing and LSH. Search times in the binary space are uniform across\neach of the methods and our timing results are similar to those reported previously (see, e.g. [17]).\n\n3.3 Qualitative Results\n\nFinally, we present qualitative results on the large Tiny Image data set [5] to demonstrate our method\napplied to a very large database. This data set contains 80 million images, and is one of the largest\nreadily available data sets for content-based image retrieval. Each image is stored as 32 \u00d7 32 pixels,\nand we employ the global Gist descriptors that have been extracted for each image.\n\nWe ran our reconstructive hashing algorithm on the Gist descriptors for the Tiny Image data set\nusing 50 bits, with 1000 training images used to construct the hash functions as before. We selected\na random set of queries from the database and compared the results of a linear scan over the Gist\nfeatures with the hashing results over the Gist features. When obtaining hashing results, we collected\nthe nearest neighbors in the Hamming space to the query (the top 0.01% of the Hamming distances),\nand then sorted these by their distance in the original Gist space. Some example results are displayed\nin Figure 2; we see that, with 50 bits, we can obtain very good results that are qualitatively similar\nto the results of the linear scan.\n\n7\n\n\fFigure 2: Qualitative results over the 80 million images in the Tiny Image database [5]. For each\ngroup of images, the top left image is the query, the top row corresponds to a linear scan, and the\nsecond row corresponds to the hashing retrieval results using 50 hash bits. The hashing results are\nsimilar to the linear scan results but are signi\ufb01cantly faster to obtain.\n\n4 Conclusion and Future Work\n\nIn this paper, we presented a method for learning hash functions, developed an ef\ufb01cient coordinate-\ndescent algorithm for \ufb01nding a local optimum, and demonstrated improved performance on several\nbenchmark vision data sets as compared to existing state-of-the-art hashing algorithms. One avenue\nfor future work is to explore alternate methods of optimization; our approach, while simple and fast,\nmay fall into poor local optima in some cases. Second, we would like to explore the use of our\nalgorithm in the supervised setting for large-scale k-NN tasks.\n\nAcknowledgments\n\nThis work was supported in part by DARPA, Google, and NSF grants IIS-0905647 and IIS-0819984.\nWe thank Rob Fergus for the spectral hashing and RBM code, and Greg Shakhnarovich for the\nBoosting SSC code.\n\nReferences\n[1] M. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In STOC, 2002.\n[2] P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimension-\n\nality. In STOC, 1998.\n\n[3] R. R. Salakhutdinov and G. E. Hinton. Learning a Nonlinear Embedding by Preserving Class Neighbour-\n\nhood Structure. In AISTATS, 2007.\n\n[4] Y. Weiss, A. Torralba, and R. Fergus. Spectral Hashing. In NIPS, 2008.\n[5] A. Torralba, R. Fergus, and W. T. Freeman. 80 Million Tiny Images: A Large Dataset for Non-parametric\n\nObject and Scene Recognition. TPAMI, 30(11):1958\u20131970, 2008.\n\n8\n\n\f[6] J. Freidman, J. Bentley, and A. Finkel. An Algorithm for Finding Best Matches in Logarithmic Expected\n\nTime. ACM Transactions on Mathematical Software, 3(3):209\u2013226, September 1977.\n\n[7] P. Ciaccia, M. Patella, and P. Zezula. M-tree: An Ef\ufb01cient Access Method for Similarity Search in Metric\n\nSpaces. In VLDB, 1997.\n\n[8] A. Beygelzimer, S. Kakade, and J. Langford. Cover Trees for Nearest Neighbor. In ICML, 2006.\n[9] J. Uhlmann. Satisfying General Proximity / Similarity Queries with Metric Trees. Information Processing\n\nLetters, 40:175\u2013179, 1991.\n\n[10] M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni. Locality-Sensitive Hashing Scheme Based on p-Stable\n\nDistributions. In SOCG, 2004.\n\n[11] P. Jain, B. Kulis, and K. Grauman. Fast Image Search for Learned Metrics. In CVPR, 2008.\n[12] K. Grauman and T. Darrell. Pyramid Match Hashing: Sub-Linear Time Indexing Over Partial Correspon-\n\ndences. In CVPR, 2007.\n\n[13] G. Shakhnarovich, P. Viola, and T. Darrell. Fast Pose Estimation with Parameter-Sensitive Hashing. In\n\nICCV, 2003.\n\n[14] G. Shakhnarovich. Learning Task-speci\ufb01c Similarity. PhD thesis, MIT, 2006.\n[15] N. Snavely, S. Seitz, and R. Szeliski. Photo Tourism: Exploring Photo Collections in 3D. In SIGGRAPH\n\nConference Proceedings, pages 835\u2013846, New York, NY, USA, 2006. ACM Press.\n\n[16] L. Fei-Fei, R. Fergus, and P. Perona. Learning Generative Visual Models from Few Training Examples:\nan Incremental Bayesian Approach Tested on 101 Object Categories. In Workshop on Generative Model\nBased Vision, Washington, D.C., June 2004.\n\n[17] A. Torralba, R. Fergus, and Y. Weiss. Small Codes and Large Databases for Recognition. In CVPR, 2008.\n\n9\n\n\f", "award": [], "sourceid": 971, "authors": [{"given_name": "Brian", "family_name": "Kulis", "institution": null}, {"given_name": "Trevor", "family_name": "Darrell", "institution": null}]}