{"title": "Learning to Search Efficiently in High Dimensions", "book": "Advances in Neural Information Processing Systems", "page_first": 1710, "page_last": 1718, "abstract": "High dimensional similarity search in large scale databases becomes an important challenge due to the advent of Internet. For such applications, specialized data structures are required to achieve computational efficiency. Traditional approaches relied on algorithmic constructions that are often data independent (such as Locality Sensitive Hashing) or weakly dependent (such as kd-trees, k-means trees). While supervised learning algorithms have been applied to related problems, those proposed in the literature mainly focused on learning hash codes optimized for compact embedding of the data rather than search efficiency. Consequently such an embedding has to be used with linear scan or another search algorithm. Hence learning to hash does not directly address the search efficiency issue. This paper considers a new framework that applies supervised learning to directly optimize a data structure that supports efficient large scale search. Our approach takes both search quality and computational cost into consideration. Specifically, we learn a boosted search forest that is optimized using pair-wise similarity labeled examples. The output of this search forest can be efficiently converted into an inverted indexing data structure, which can leverage modern text search infrastructure to achieve both scalability and efficiency. Experimental results show that our approach significantly outperforms the start-of-the-art learning to hash methods (such as spectral hashing), as well as state-of-the-art high dimensional search algorithms (such as LSH and k-means trees).", "full_text": "Learning to Search Ef\ufb01ciently in High Dimensions\n\nZhen Li \u2217\n\nUIUC\n\nHuazhong Ning\n\nGoogle Inc.\n\nLiangliang Cao\n\nIBM T.J. Watson Research Center\n\nzhenli3@uiuc.edu\n\nhuazhong@gooogle.com\n\nliangliang.cao@us.ibm.com\n\nTong Zhang\n\nRutgers University\n\nYihong Gong\n\nNEC China\n\nThomas S. Huang \u2217\n\nUIUC\n\ntzhang@stat.rutgers.edu\n\nygongca@gmail.com\n\nhuang@ifp.uiuc.edu\n\nAbstract\n\nHigh dimensional similarity search in large scale databases becomes an impor-\ntant challenge due to the advent of Internet. For such applications, specialized\ndata structures are required to achieve computational ef\ufb01ciency. Traditional ap-\nproaches relied on algorithmic constructions that are often data independent (such\nas Locality Sensitive Hashing) or weakly dependent (such as kd-trees, k-means\ntrees). While supervised learning algorithms have been applied to related prob-\nlems, those proposed in the literature mainly focused on learning hash codes op-\ntimized for compact embedding of the data rather than search ef\ufb01ciency. Conse-\nquently such an embedding has to be used with linear scan or another search algo-\nrithm. Hence learning to hash does not directly address the search ef\ufb01ciency issue.\nThis paper considers a new framework that applies supervised learning to directly\noptimize a data structure that supports ef\ufb01cient large scale search. Our approach\ntakes both search quality and computational cost into consideration. Speci\ufb01cally,\nwe learn a boosted search forest that is optimized using pair-wise similarity la-\nbeled examples. The output of this search forest can be ef\ufb01ciently converted into\nan inverted indexing data structure, which can leverage modern text search infras-\ntructure to achieve both scalability and ef\ufb01ciency. Experimental results show that\nour approach signi\ufb01cantly outperforms the start-of-the-art learning to hash meth-\nods (such as spectral hashing), as well as state-of-the-art high dimensional search\nalgorithms (such as LSH and k-means trees).\n\n1 Introduction\n\nThe design of ef\ufb01cient algorithms for large scale similarity search (such as nearest neighbor search)\nhas been a central problem in computer science. This problem becomes increasingly challenging\nin modern applications because the scale of modern databases has grown substantially and many of\nthem are composed of high dimensional data. This means that classical algorithms such as kd-trees\nare no longer suitable [25] and new algorithms have to be designed to handle high dimensionality.\nHowever, existing approaches for large scale search in high dimension relied mainly on algorith-\nmic constructions that are either data independent or weakly dependent. Motivated by the success\nof machine learning in the design of ranking functions for information retrieval (the learning to\nrank problem [13, 9]) and the design of compact embedding into binary codes (the learning to hash\nproblem [10]), it is natural to ask whether we can use machine learning (in particular, supervised\nlearning) to optimize data structures that can improve search ef\ufb01ciency. We call this problem learn-\ning to search, and this paper demonstrates that supervised learning can lead to improved search\nef\ufb01ciency over algorithms that are not optimized using supervised information.\n\n\u2217These authors were sponsored in part by the U.S. National Science Foundation under grant IIS-1049332 EAGER and by the Beckman\n\nSeed Grant.\n\n1\n\n\fTo leverage machine learning techniques, we need to consider a scalable search structure with pa-\nrameters optimizable using labeled data. The data structured considered in this paper is motivated\nby the success of vocabulary tree method in image retrieval [18, 27, 15], which has been adopted\nin modern image search engines to \ufb01nd near duplicate images. Although the original proposal was\nbased on \u201cbag of local patch\u201d image representation, this paper considers a general setting where\neach database item is represented as a high dimensional vector. Recent advances in computer vision\nshow that it is desirable to represent images as numerical vectors of as high as thousands or even\nmillions of dimensions [12, 28]. We can easily adapt the vocabulary tree to this setting: we partition\nthe high dimensional space into disjoint regions using hierarchical k-means, and regard them as the\n\u201cvocabulary\u201d. This representation can then be integrated into an inverted index based text search\nengine for ef\ufb01cient large scale retrieval. In this paper, we refer to this approach as k-means trees\nbecause the underlying algorithm is the same as in [5, 16]. Note that k-means trees can be used for\nhigh dimensional data, while the classical kd-trees [1, 3, 22] are limited to dimensions of no more\nthan a few hundreds.\n\nIn this paper, we also adopt the tree structural representation, and propose a learning algorithm to\nconstruct the trees using supervised data. It is worth noting that the k-means trees approach suffers\nfrom several drawbacks that can be addressed in our approach. First the k-means trees only use\nunsupervised clustering algorithm, which is not optimized for search purposes; as we will show in\nthe experiments, by employing supervised information, our learning to search approach can achieve\nsigni\ufb01cantly better performance. Second, the underlying k-means clustering limits the k-means\ntree approach to Euclidean similarity measures (though possible to extended to Bregman distances),\nwhile our approach can be easily applied to more general metrics (including semantic ones) that\nprove effective in many scenarios [8, 11, 7]. Nevertheless our experiments still focus on Euclidean\ndistance search, which is to show the advantage over the k-means trees.\n\nThe learning to search framework proposed in this paper is based on a formulation of search as a\nsupervised learning problem that jointly optimizes two key factors of search: retrieval quality and\ncomputational cost. Speci\ufb01cally, we learn a set of selection functions in the form of a tree ensemble,\nas motivated by the aforementioned kd-trees and k-means trees approaches. However, unlike the\ntraditional methods that are based only on unsupervised information, our trees are learned under the\nsupervision of pairwise similarity information, and are optimized for the de\ufb01ned search criteria, i.e.,\nto maximize the retrieval quality while keeping the computational cost low. In order to form the\nforest, boosting is employed to learn the trees sequentially. We call this particular method Boosted\nSearch Forest (BSF).\n\nIt is worth comparing the in\ufb02uential Locality Sensitive Hashing (LSH) [6, 2] approach with our\nlearning to search approach. The idea of LSH is to employ random projections to approximate the\nEuclidean distance of original features. An inverted index structure can be constructed based on the\nhashing results [6], which facilitates ef\ufb01cient search. However, the LSH algorithm is completely\ndata independent (using random projections), and thus the data structure is constructed without\nany learning. While interesting theoretical results can be obtained for LSH, as we shall see with the\nexperiments, in practice its performance is inferior to the data-dependent search structures optimized\nvia the learning to search approach of this paper.\n\nAnother closely related problem is learning to hash, which includes BoostSSC [20], Spectral Hash-\ning [26], Restricted Boltzmann Machines [19], Semi-Supervised Hashing [24], Hashing with Graphs\n[14], etc. However, the motivation of hashing problem is fundamentally different from that of the\nsearch problem considered in this paper. Speci\ufb01cally, the goal of learning to hash is to embed data\ninto compact binary codes so that the hamming distance between two codes re\ufb02ects their original\nsimilarity. In order to perform ef\ufb01cient hamming distance search using the embedded representa-\ntion, an additional ef\ufb01cient algorithmic structure is still needed. (How to come up with such an\nef\ufb01cient algorithm is an issue usually ignored by learning to hash algorithms.) The compact hash\ncodes were traditionally believed to achieve low search latency by employing either linear scan,\nhash table lookup, or more sophisticated search mechanism. As we shall see in our experiments,\nhowever, linear scan on the Hamming space is not a feasible solution for large scale search prob-\nlems. Moreover, if other search data structure is implemented on top of the hash code, the optimality\nof the embedding is likely to be lost, which usually yields suboptimal solution inferior to directly\noptimizing a search criteria.\n\n2\n\n\f2 Background\n\nGiven a database X = {x1, . . . , xn} and a query q, the search problem is to return top ranked items\nfrom the database that are most similar to the query. Let s(q, x) \u2265 0 be a ranking function that\nmeasures the similarity between q and x. In large-scale search applications, the database size n can\nbe billions or larger. Explicitly evaluating the ranking function s(q, x) against all samples is very\nexpensive. On the other hand, in order to achieve accurate search results, a complicated ranking\nfunction s(q, x) is indispensible.\nModern search engines handle this problem by \ufb01rst employing a non-negative selection function\nT (q, x) that selects a small set of candidates Xq = {x : T (q, x) > 0, x \u2208 X } with most of the top\nranked items (T (q, x) = 0 means \u201cnot selected\u201d). This is called candidate selection stage, which is\nfollowed by a reranking stage where a more costly ranking function s(q, x) is evaluated on Xq.\nTwo properties of the selection function T (q, x) are: 1) It must be evaluated much more ef\ufb01ciently\nthan the ranking function s(q, x).\nIn particular, for a given query, the complexity of evaluating\nT (q, x) over the entire dataset should be sublinear or even constant, which is usually made possible\nby delicated data structures such as inverted index tables. 2) The selection function is an approxi-\nmation to s(q, x). In other word, with high probability, the more similar q and x are, the more likely\nx is contained in Xq (which means T (q, x) should take a larger value).\nThis paper focuses on the candidate selection stage, i.e., learning the selection function T (q, x). In\norder to achieve both effectiveness and ef\ufb01ciency, three aspects need to be taken into account:\n\n\u2022 Xq can be ef\ufb01ciently obtained (this is ensured by properties of selection function).\n\u2022 The size of Xq should be small since it indicates the computational cost for reranking.\n\n\u2022 The retrieval quality of Xq measured by the total similarityPx\u2208Xq\n\nTherefore, our objective is to retrieve a set of items that maximizes the ranking quality while lower-\ning the computational cost (keeping the candidate set as small as possible). In additiona, to achieve\nsearch ef\ufb01ciency, the selection stage employs the inverted index structure as in standard text search\nengines to handle web-scale dataset.\n\ns(q, x) should be large.\n\n3 Learning to Search\n\nThis section presents the proposed learning to search framework. We present the general formula-\ntion \ufb01rst, followed by a speci\ufb01c algorithm based on boosted search trees.\n\n3.1 Problem Formulation\n\nAs stated in Section 2, the set of candidates returned for a query q is given by Xq = {x \u2208 X :\nT (q, x) > 0}. Intuitively, the quality of this candidate set can be measured by the overall similarities\nwhile the reranking cost is linear in |Xq|. Mathematically, we de\ufb01ne:\n\ns(q, x)1(T (q, x) > 0)\n\n1(T (q, x) > 0)\n\n(1)\n\n(2)\n\nRetrieval Quality:\n\nComputational Cost:\n\nQ(T ) =Xq Xx\u2208X\nC(T ) =Xq Xx\u2208X\n\nwhere 1(\u00b7) is the indicator function.\nThe learning to search framework considers the search problem as a machine learning problem that\n\ufb01nds the optimal selection function T as follows:\n\nmax\n\nT\n\nQ(T )\n\nsubject to C(T ) \u2264 C0,\n\n(3)\n\nwhere C0 is the upper-bound of computational cost. Alternatively, we can rewrite the optimization\nproblem in (3) by applying Lagrange multiplier:\n\nwhere \u03bb is a tuning parameter that balances the retrieval quality and computational cost.\n\nmax\n\nT\n\nQ(T ) \u2212 \u03bbC(T ),\n\n(4)\n\n3\n\n\fTo simplify the learning process, we assume that the queries are randomly drawn from the database.\nLet xi and xj be two arbitrary samples in the dataset, and let sij = s(xi, xj ) \u2208 {1, 0} indicate if\nthey are \u201csimilar\u201d or \u201cdissimilar\u201d. Problem in (4) becomes:\n\nJ(T ) = max\n\nsij 1(T (xi, xj) > 0) \u2212 \u03bbXi,j\n\n= max\n\nzij 1(T (xi, xj ) > 0)\n\nmax\n\nT\n\nwhere\n\nT Xi,j\nT Xi,j\nzij =(cid:26)1 \u2212 \u03bb\n\n\u2212\u03bb\n\nfor similar pairs\nfor dissimilar pairs\n\n.\n\n1(T (xi, xj) > 0)\n\n(5)\n\n(6)\n\n3.2 Learning Ensemble Selection Function via Boosting\n\nNote that (5) is nonconvex in T and thus is dif\ufb01cult to optimize. Inspired by AdaBoost [4], we\nemploy the standard trick of using a convex relaxation, and in particular, we consider the exponential\nloss as a convex surrogate:\n\nmin\n\nT\n\nL(T ) =Xi,j\n\ne\u2212zij T (xi,xj) = E[e\u2212zT (xi,xj)].\n\n(7)\n\nHere we replace the summation over \u2200(xi, xj) \u2208 X \u00d7 X by the expectation over two i.i.d. random\nvariables xi and xj. We also drop the subscripts of zij and regard z as a random variable conditioned\non xi and xj.\nWe de\ufb01ne the ensemble selection function as a weighted sum of a set of base selection functions:\n\nT (x, y) =\n\ncm \u00b7 tm(xi, xj ).\n\n(8)\n\nM\n\nXm=1\n\nSuppose we have learnt M base functions, and we are about to learn the (M + 1)-th selection\nfunction, denoted as t(xi, xj) with weight given by c. The updated loss function is hence given by\n\n(9)\n\n(10)\n\nL(t, c) = E[e\u2212z[T (xi,xj )+ct(xi,xj )]] = Ew[e\u2212czt(xi,xj)],\n\nmin\n\nt\n\nwhere Ew[\u00b7] denotes the weighted expectation with weights given by\n\nwij = w(xi, xj ) = e\u2212zT (xi,xj ) =(cid:26)e\u2212(1\u2212\u03bb)T (xi,xj)\n\ne\u03bbT (xi,xj)\n\nfor similar pairs\nfor dissimilar pairs\n\nThis reweighting scheme leads to the boosting algorithm in Algorithm 1.\nIn many application scenarios, each base selection function t(xi, xj) takes only binary values 1 or\n0. Thus, we may want to minimize L(t, c) by choosing the optimal value of t(xi, xj ) for any given\npair (xi, xj ).\nCase 1: t(xi, xj) = 0\n\nL(t, c) = Ew[e\u22120] = 1.\n\n(11)\n\nCase 2: t(xi, xj) = 1\n\nL(t, c) = Ew[e\u2212zc] = e\u2212(1\u2212\u03bb)c \u00b7 Pw[sij = 1|xi, xj] + e\u03bbc \u00b7 Pw[sij = 0|xi, xj].\n\n(12)\n\nComparing the two cases leads to:\n\nt\u2217(xi, xj) =(1 if Pw[sij = 1|xi, xj] > 1\u2212e\u2212\u03bbc\n\n0 otherwise\n\n1\u2212e\u2212c\n\nTo \ufb01nd the optimal c, we \ufb01rst decompose L in the following way:\n\nL(t, c) = Ew[e\u2212czt(xi,xj )]\n\n= Pw[t(xi, xj) = 0|xi, xj] + e\u2212c(1\u2212\u03bb) \u00b7 Pw[t(xi, xj) = 1, sij = 1|xi, xj ]\n\n+ec\u03bb \u00b7 Pw[t(xi, xj ) = 1, sij = 0|xi, xj].\n\n4\n\n(13)\n\n(14)\n\n\fTaking the derivative of L with respect to c, we arrive at the optimal solution for c:\n\nc\u2217 = log\n\n(1 \u2212 \u03bb)Pw[t(xi, xj) = 1, sij = 1|xi, xj]\n\n\u03bbPw[t(xi, xj ) = 0, sij = 1|xi, xj]\n\n.\n\n(15)\n\nAlgorithm 1 Boosted Selection Function Learning\nInput: A set of data points X ; pairwise similarities sij \u2208 {0, 1} and weights wij = 1\n1: for m \u2208 1, 2, \u00b7 \u00b7 \u00b7 , M do\n2:\n3:\n4:\n5: end for\n\nLearn a base selection function tm(x, y) based on weights wij\nUpdate ensemble: T (xi, xj ) \u2190 T (xi, xj) + cm \u00b7 tm(xi, xj)\nUpdate weights: wij \u2190 wij \u00b7 e\u2212cm\u00b7tm(xi,xj )\n\n3.3 Tree Implementation of Base Selection Function\n\nSimultaneously solving (13) and (15) leads to the optimal solutions at each iteration of boosting. In\npractice, however, the optimality can hardly be achieved. This is particularly because the binary-\nvalued base selection functions t(xi, xj ) has to be selected from limited function families to ensure\nthe wearability (\ufb01nite model complexity) and more importantly, the ef\ufb01ciency. As mentioned in Sec-\ntion 2, evaluating t(q, x) for \u2200x \u2208 X needs to be accomplished in sublinear or constant time when\na query q comes. This suggests using an inverted table data structure as an ef\ufb01cient implantation\nof the selection function. Speci\ufb01cally, t(xi, xj ) = 1 if xi and xj get hashed into the same bucket\nof the inverted table, and 0 otherwise. This paper considers trees (we name it \u201csearch trees\u201d) as an\napproximation to the optimal selection functions, and quick inverted table lookup follows naturally.\n\nA natural consideration for the tree construction is that the tree must be balanced. However, we do\nnot need to explicitly enforce this constraint: the balanceness is automatically favored by the term\nC in (4) as balanced trees give the minimum computational cost. In this sense, unlike other methods\nthat explicitly enforce balancing constraint, we relax it while jointly optimizing the retrieval quality\nand computational cost.\nConsider a search tree with L leaf nodes {\u21131, \u00b7 \u00b7 \u00b7 , \u2113L}. The selection function given by this tree is\nde\ufb01ned as\n\nL\n\nt(xi, xj) =\n\nt(xi, xj ; \u2113k),\n\n(16)\n\nwhere t(xi, xj ; \u2113k) \u2208 {0, 1} indicating whether both xi and xj reach the same leaf node \u2113k. Similar\nto (5), the objective function for a search tree can be written as:\n\nXk=1\n\nmax\n\nt\n\nJ = max\n\nwij zij\n\nt(xi, xj ; \u2113k) = max\n\nt\n\nJ k,\n\n(17)\n\nL\n\nL\n\nt Xi,j\n\nXk=1\n\nXk=1\n\ngiven by (10).\n\nwhere J k =Pij wij zij t(xi, xj; \u2113k) is a partial objective function for the k-th leaf node, and wij is\n\nThe appealing additive property of the objective function J makes it trackable to analyze each split\nwhen the search tree grows. In particular, we split the k-th leaf node into two child nodes k(1) and\nk(2) if and only if it increases the overall objective function J k(1) + J k(2) > J k. Moreover, we\noptimize each split by choosing the one that maximizes J k(1) + J k(2).\nTo \ufb01nd the optimal split for a leaf node \u2113k, we con\ufb01ne to the hyperplane split cases, i.e., a sample x\nis assigned to the left child \u2113k(1) if p\u22a4x+b = \u02dcp\u22a4 \u02dcx > 0 and right child otherwise, where \u02dcp = [p\u22a4 b]\u22a4\nand \u02dcx = [x\u22a4 1]\u22a4 are the augmented projection and data vectors. The splitting criterion is given by:\n\nmax J k(1) + J k(2) = max\n\nk \u02dcpk=1Xij\nk \u02dcpk=1Xij\n\n\u2248 max\n\nwij zij 1(\u02dcp\u22a4 \u02dcxi \u00b7 \u02dcp\u22a4 \u02dcxj > 0)\n\nwij zij[\u02dcp\u22a4 \u02dcxi \u02dcx\u22a4\n\nj \u02dcp]\n\n= max\nk \u02dcpk=1\n\n\u02dcp\u22a4 \u02dcXM \u02dcX \u22a4 \u02dcp,\n\n(18)\n\n5\n\n\f2 sign(a) + 1\n\n2 is non-differentiable, we approximate it using 1\n\nwhere Mij = wij zij, and \u02dcX is the stack of all augmented samples at node \u2113k. Note that as 1(a >\n0) = 1\n2 . The optimal \u02dcp of the\nabove objective function is the eigenvector corresponding to the largest eigenvalue of \u02dcXM \u02dcX \u22a4.\nThe search tree construction algorithm is listed in Algorithm 2. In the implementation, if compu-\ntation resource is critical, we may use stump functions to split the nodes with a large amount of\nsamples, while applying the optimal projection p to the small nodes. The selection of the stump\nfunctions is similar to that in traditional decision trees: on the given leaf node, a set of stump func-\ntions are attempted and the one that maximizes (17) is selected if the objective function increases.\n\n2 a + 1\n\nAlgorithm 2 Search Tree Construction\nInput: A set of data points X ; pairwise similarities sij \u2208 {0, 1} and weights wij given by (10)\nOutput: Tree t\n1: Assign X as root; enqueue root\n2: repeat\n3:\n4:\n5:\n6:\n7:\n8: until Queue is empty\n\nFind a leaf node \u2113 in the queue; dequeue \u2113\nFind the optimal split for \u2113 by solving (18)\nif criteria in (17) increases then\n\nSplit \u2113 into \u21131 and \u21132; enqueue \u21131 and \u21132\n\nend if\n\n3.4 Boosted Search Forest\n\nIn summary, we present a Boosted Search Forest (BSF) algorithm to the learning to search problem.\nIn the learning stage, this algorithm follows the boosting framework described in Algorithm 1 to\nlearn an ensemble of selection functions; each base selection function, in the form of a search tree,\nis learned with Algorithm 2. We then build inverted indices by passing all data points through the\nlearned search trees.\nIn analogy to text search, each leaf node corresponds to an \u201cindex word\u201d\nin the vocabulary and the data points reaching this leaf node are the \u201cdocuments\u201d associated with\nthis \u201cindex word\u201d. In the candidate selection stage, instead of exhaustively evaluating T (q, x) for\n\u2200x \u2208 X , we only need to traverse the search trees and retrieve all items that collide with the query\nexample for at least one tree. The selected candidate set, given by Xq = {x \u2208 X : T (q, x) > 0}, is\nstatistically optimized to have a small size (small computation cost) while containing a large number\nof relevant samples (good retrieval quality).\n\n4 Experiments\n\nWe evaluate the Boosted Search Forest (BSF) algorithm on several image search tasks. Although a\nmore general similarity measure can be used, for simplicity we set s(xi, xj) \u2208 {0, 1} according to\nwhether xj is within the top K nearest neighbors (K-NN) of xi on the designated metric space. We\nuse K = 100 in the implementation.\nWe compare the performance of BSF to two most popular algorithms on high dimensional image\nsearch: k-means trees and LSH. We also compare to a representative method in the learning to\nhash community: spectral hashing, although this algorithm was designed for Hamming embedding\ninstead of search. Here linear scan is adopted on top of spectral hashing for search, because its more\nef\ufb01cient alternatives are either directly compared (such as LSH) or can easily fail as noticed in [24].\nOur experiment shows that exhaustive linear scan is not scalable, especially with long hash codes\nneeded for better retrieval accuracy (see Table 1).\n\nThe above algorithms are most representative. We do not compare with other algorithms for several\nreasons. Fist, LSH was reported to be superior to kd-trees [21] and spectral hashing was reported\nto out-perform RBM and BoostSCC [26]. Second, kd-trees and its extensions still work on low\ndimensions, and is known to behave poorly on high dimension data like in image search. Third, since\nthis paper focuses on learning to search, not learning to hash (Hamming embedding) or learning\ndistance metrics that consider different goals, it is not essential to compare with more recent work\non those topics such as [8, 11, 24, 7].\n\n6\n\n\fN\nN\n\u2212\n0\n0\n1\n\n \nf\n\no\n\n \nl\nl\n\na\nc\ne\nR\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\n \n\nBSF\nK\u2212means\nLSH\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\nNumber of returned images\n\n700\n\nBSF\nSH\u221232bit\nSH\u221296bit\nSH\u2212200bit\n\n \n\n100\n\nN\nN\n\u2212\n0\n0\n1\n\n \nf\n\no\n\n \nl\nl\n\na\nc\ne\nR\n\n80\n\n60\n\n40\n\n20\n\n800\n\n900\n\n1000\n\n0\n\n \n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\nNumber of returned images\n\n \n\n800\n\n900\n\n1000\n\n(a)\n\n(b)\n\nFigure 1: Comparison of Boosted Search Forest (BSF) on Concept1000 dataset with (a) k-means\ntrees and LSH (b) Spectral Hashing (SH) of varying bits.\n\n4.1 Concept-1000 Dataset\n\nThis dataset consists of more than 150K images of 1000 concepts selected from the Large Scale\nConcept Ontology for Multimedia (LSCOM) [17]. The LSCOM categories were speci\ufb01cally se-\nlected for multimedia annotation and retrieval, and have been used in the TRECVID video retrieval\nseries. These concept names were inputed as queries in Google and Bing, and the top returned\nimages were collected.\n\nWe choose the image representation proposed in [28], which is a high dimensional (\u223c84K) feature\nwith reported state-of-the-art performance in many visual recognition tasks. PCA is applied to\nreduce the dimension to 1000. We then randomly select around 6000 images as queries, and use the\nremaining (\u223c150K) images as the search database.\nIn image search, we are interested in the overall quality of the set of candidate images returned by\na search algorithm. This notion coincides with our formulation of the search problem in (4) that\nis aimed at maximizing retrieval quality while maintaining a relative low computational cost (for\nreranking stage). The number of returned images clearly re\ufb02ects the computational cost, and the\nretrieval quality is measured by the recall of retrieved images, i.e., the number of retrieved images\nthat are among the 100-NN of the query. Note that we use recall instead of accuracy because recall\ngives the upper-bound performance of the reranking stage.\n\nFigure 1(a) shows the performance comparison with two search algorithms: k-means trees and LSH.\nSince our boosted search forest consists of tree ensembles, for a fair comparison, we also construct\nequivalent number of k-means trees (with random initializations) and multiple sets of LSH codes.\nOur proposed approach signi\ufb01cantly outperforms k-means trees and LSH. The better performance\nis due to our learning to search formulation that simultaneously maximizes recall while minimizing\nthe size of returned candidate set.\nIn contrast, k-means trees uses only unsupervised clustering\nalgorithm and LSH employs purely random projections. Moreover, the performance of k-means\nalgorithm deteriorates when dimension increases.\n\nIt is still interesting to compare to spectral hashing, although it is not a search algorithm. Since\nour approach requires more trees when the number of returns increases, we implement spectral\nhashing with varying bits: 32-bit, 96-bit, and 200-bit. As illustrated in Figure 1(b), our approach\nsigni\ufb01cantly outperforms spectral hashing under all con\ufb01gurations. Although the search forest does\nnot have an explicit concept of bits, we can measure it from the information theoretical point of view,\nby counting every binary-branching in the trees as one bit. In the experiment, our approach retrieves\nabout 70% of 100-NN out of 500 returned images, after traversing 17 trees, each of 12 layers. This\nis equivalent to 17 \u00d7 12 = 204 bits. With the same number of bits, spectral hashing only achieves a\nrecall rate around 60%.\n\n4.2 One Million Tiny Images\n\nIn order to examine the scalability of BSF, we conducted experiments on a much larger database. We\nrandomly sample one million images from the 80 Millions Tiny Images dataset [23] as the search\ndatabase, and 5000 additional images as queries. We use the 384-dimensional GIST feature pro-\nvided by the authors of [23]. Comparison with search algorithms (Figure 2(a)) and hashing methods\n\n7\n\n\fBSF\nK\u2212means\nLSH\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\nN\nN\n\u2212\n0\n0\n1\n\n \nf\n\no\n\n \nl\nl\n\na\nc\ne\nR\n\n \n\nBSF\nSH\u2212100bit\nSH\u2212500bit\nSH\u2212800bit\n\n \n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\nN\nN\n\u2212\n0\n0\n1\n\n \nf\n\no\n\n \nl\nl\n\na\nc\ne\nR\n\n0\n\n \n\n500\n\n1000 1500 2000 2500 3000 3500 4000 4500 5000\n\nNumber of returned images\n\n0\n\n \n\n500\n\n1000 1500 2000 2500 3000 3500 4000 4500 5000\n\nNumber of returned images\n\n(a)\n\n(b)\n\nFigure 2: Comparison of Boosted Search Forest (BSF) on 1 Millions Tiny Images dataset with (a)\nK-means trees and LSH (b) Spectral Hashing (SH) of varying bits.\n\nTable 1: Comparison of retrieval time in a database with 0.5 billion synthesized samples.\n\n#bits\n\nLinear scan\n\nBoosted search forest\n\n32\n1.55s\n0.006s\n\n64\n2.74s\n0.009s\n\n128\n5.13s\n0.017s\n\n256\n10.11s\n0.034s\n\n512\n19.79s\n0.073s\n\n(Figure 2(b)) are made in a similar way as in the previous section. Again, the BSF algorithm sub-\nstantially outperforms the other methods: using 60 trees (less than 800 bits), our approach retrieves\n55.0% of the 100-NN with 5000 returns (0.5% of the entire database), while k-means trees achieves\nonly 47.1% recall rate and LSH and spectral hashing are even worse. Note that using more bits in\nspectral hashing can even hurt performance on this dataset.\n\n4.3 Search Speed\n\nAll three aforementioned search algorithms (boosted search trees, k-means trees, and LSH) can\nnaturally utilize inverted index structures to facilitate very ef\ufb01cient search. In particular, both our\nboosted search trees and k-means trees use the leaf nodes as the keys to index a list of data points\nin the database, while LSH uses multiple independently generated bits to form the indexing key. In\nthis sense, all three algorithm has the same order of ef\ufb01ciency (constant time complexity).\n\nOn the other hand, in order to perform search with compact hamming codes generated by a learning\nto hash method (e.g. spectral hashing), one has to either use a linear scan approach or a hash table\nlookup technique that \ufb01nds the samples within a radius-1 Hamming ball (or more complex methods\nlike LSH). Although much more ef\ufb01cient, the hash table lookup approach is likely to fail as the\ndimension of hash code grows to a few dozens, as observed in [24]. The retrieval speed using\nexhaustive linear scan is, however, far from satisfactory. Table 1 clearly illustrates this phenomenon\non a database of 0.5 billion synthesized items. Even small codes with 32 bits take around 1.55\nseconds (without sorting). When the hash codes grow to 512 bits (which is not unusual for high-\ndimensional image/video data), the query time is almost 20 seconds. This is not acceptable for\nmost real applications. On the contrary, our boosted search forest with 32 16-layer trees (\u223c512 bits)\nresponds in less than 0.073s. Our timing is carried out on a Intel Xeon Quad X5560 CPU, with a\nhighly optimized implementation of Hamming distance which is at least 8\u201310 times faster than a\nnaive implementation.\n\n5 Conclusion\n\nThis paper introduces a learning to search framework for scalable similarity search in high dimen-\nsions. Unlike previous methods, our algorithm learns a boosted search forest by jointly optimizing\nsearch quality versus computational ef\ufb01ciency, under the supervision of pair-wise similarity labels.\nWith a natural integration of the inverted index search structure, our method can handle web-scale\ndatasets ef\ufb01ciently. Experiments show that our approach leads to better retrieval accuracy than the\nstate-of-the-art search methods such as locality sensitive hashing and k-means trees.\n\n8\n\n\fReferences\n\n[1] J. S. Beis and D. G. Lowe. Shape indexing using approximate nearest-neighbour search in\n\nhigh-dimensional spaces. In CVPR, pages 1000\u20131006, 1997.\n\n[2] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based\n\non p-stable distributions. In Symposium on Computational Geometry, pages 253\u2013262, 2004.\n\n[3] J. Friedman, J. Bentley, and R. Finkel. An algorithm for \ufb01nding best matches in logarithmic\n\nexpected time. ACM Transactions on Mathematical Software (TOMS), 3(3):209\u2013226, 1977.\n\n[4] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of\n\nboosting. The Annals of Statistics, 28(2):337\u2013374, 2000.\n\n[5] K. Fukunage and P. Narendra. A branch and bound algorithm for computing k-nearest neigh-\n\nbors. IEEE Transactions on Computers, 100(7):750\u2013753, 1975.\n\n[6] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In\n\nVLDB, pages 518\u2013529, 1999.\n\n[7] J. He, W. Liu, and S.-F. Chang. Scalable similarity search with optimized kernel hashing. In\n\nKDD, 2010.\n\n[8] P. Jain, B. Kulis, and K. Grauman. Fast image search for learned metrics. In CVPR, 2008.\n[9] V. Jain and M. Varma. Learning to re-rank: query-dependent image re-ranking using click\n\ndata. In WWW, pages 277\u2013286, 2011.\n\n[10] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. NIPS, 2009.\n[11] B. Kulis and K. Grauman. Kernelized locality-sensitive hashing for scalable image search. In\n\nICCV, 2009.\n\n[12] Y. Lin, F. Lv, S. Zhu, M. Yang, T. Cour, K. Yu, L. Cao, and T. Huang. Large-scale image\n\nclassi\ufb01cation: fast feature extraction and svm training. In CVPR, 2011.\n\n[13] T.-Y. Liu. Learning to rank for information retrieval. In SIGIR, page 904, 2010.\n[14] W. Liu, J. Wang, S. Kumar, and S. Chang. Hashing with graphs. In ICML, 2011.\n[15] F. Moosmann, B. Triggs, and F. Jurie. Fast discriminative visual codebooks using randomized\n\nclustering forests. In NIPS, pages 985\u2013992, 2006.\n\n[16] M. Muja and D. G. Lowe. Fast approximate nearest neighbors with automatic algorithm con-\n\n\ufb01guration. In VISSAPP, 2009.\n\n[17] M. Naphade, J. Smith, J. Tesic, S. Chang, W. Hsu, L. Kennedy, A. Hauptmann, and J. Curtis.\nLarge-scale concept ontology for multimedia. IEEE Multimedia Magazine, 13(3):86\u201391, 2006.\nIn CVPR, pages\n\n[18] D. Nist\u00b4er and H. Stew\u00b4enius. Scalable recognition with a vocabulary tree.\n\n2161\u20132168, 2006.\n\n[19] R. Salakhutdinov and G. E. Hinton. Semantic hashing. Int. J. Approx. Reasoning, 50(7):969\u2013\n\n978, 2009.\n\n[20] G. Shakhnarovich. Learning task-speci\ufb01c similarity. PhD thesis, Massachusetts Institute of\n\nTechnology, 2005.\n\n[21] G. Shakhnarovich, T. Darrell, and P. Indyk. Nearest-Neighbor Methods in Learning and Vision:\n\nTheory and Practice. The MIT Press, 2006.\n\n[22] C. Silpa-Anan and R. Hartley. Optimised kd-trees for fast image descriptor matching.\n\nCVPR, 2008.\n\nIn\n\n[23] A. Torralba, R. Fergus, and W. T. Freeman. 80 million tiny images: A large data set for\n\nnonparametric object and scene recognition. IEEE Trans. PAMI, 30(11), 2008.\n\n[24] J. Wang, O. Kumar, and S.-F. Chang. Semi-supervised hashing for scalable image retrieval. In\n\nCVPR, 2010.\n\n[25] Weber, Roger, Schek, Hans J., and Blott, Stephen. A Quantitative Analysis and Performance\n\nStudy for Similarity-Search Methods in High-Dimensional Spaces. In VLDB, 1998.\n\n[26] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In NIPS, 2008.\n[27] T. Yeh, J. J. Lee, and T. Darrell. Adaptive vocabulary forests br dynamic indexing and category\n\nlearning. In ICCV, pages 1\u20138, 2007.\n\n[28] X. Zhou, N. Cui, Z. Li, F. Liang, and T. S. Huang. Hierarchical gaussianization for image\n\nclassi\ufb01cation. In ICCV, 2009.\n\n9\n\n\f", "award": [], "sourceid": 965, "authors": [{"given_name": "Zhen", "family_name": "Li", "institution": null}, {"given_name": "Huazhong", "family_name": "Ning", "institution": null}, {"given_name": "Liangliang", "family_name": "Cao", "institution": null}, {"given_name": "Tong", "family_name": "Zhang", "institution": null}, {"given_name": "Yihong", "family_name": "Gong", "institution": null}, {"given_name": "Thomas", "family_name": "Huang", "institution": null}]}