{"title": "A learning framework for nearest neighbor search", "book": "Advances in Neural Information Processing Systems", "page_first": 233, "page_last": 240, "abstract": "Can we leverage learning techniques to build a fast nearest-neighbor (NN) retrieval data structure? We present a general learning framework for the NN problem in which sample queries are used to learn the parameters of a data structure that minimize the retrieval time and/or the miss rate. We explore the potential of this novel framework through two popular NN data structures: KD-trees and the rectilinear structures employed by locality sensitive hashing. We derive a generalization theory for these data structure classes and present simple learning algorithms for both. Experimental results reveal that learning often improves on the already strong performance of these data structures.", "full_text": "A Learning Framework for Nearest Neighbor Search\n\nLawrence Cayton\n\nDepartment of Computer Science\nUniversity of California, San Diego\n\nlcayton@cs.ucsd.edu\n\nSanjoy Dasgupta\n\nDepartment of Computer Science\nUniversity of California, San Diego\n\ndasgupta@cs.ucsd.edu\n\nAbstract\n\nCan we leverage learning techniques to build a fast nearest-neighbor (ANN) re-\ntrieval data structure? We present a general learning framework for the NN prob-\nlem in which sample queries are used to learn the parameters of a data structure\nthat minimize the retrieval time and/or the miss rate. We explore the potential of\nthis novel framework through two popular NN data structures: KD-trees and the\nrectilinear structures employed by locality sensitive hashing. We derive a gener-\nalization theory for these data structure classes and present simple learning algo-\nrithms for both. Experimental results reveal that learning often improves on the\nalready strong performance of these data structures.\n\n1 Introduction\n\nNearest neighbor (NN) searching is a fundamental operation in machine learning, databases, signal\nprocessing, and a variety of other disciplines. We have a database of points X = {x1, . . . , xn}, and\non an input query q, we hope to return the nearest (or approximately nearest, or k-nearest) point(s)\nto q in X using some similarity measure.\nA tremendous amount of research has been devoted to designing data structures for fast NN retrieval.\nMost of these structures are based on some clever partitioning of the space and a few have bounds\n(typically worst-case) on the number of distance calculations necessary to query it.\nIn this work, we propose a novel approach to building an ef\ufb01cient NN data structure based on\nlearning. In contrast to the various data structures built using geometric intuitions, this learning\nframework allows one to construct a data structure by directly minimizing the cost of querying it.\nIn our framework, a sample query set guides the construction of the data structure containing the\ndatabase. In the absence of a sample query set, the database itself may be used as a reasonable prior.\nThe problem of building a NN data structure can then be cast as a learning problem:\n\nLearn a data structure that yields ef\ufb01cient retrieval times on the sample queries\nand is simple enough to generalize well.\n\nA major bene\ufb01t of this framework is that one can seamlessly handle situations where the query\ndistribution is substantially different from the distribution of the database.\nWe consider two different function classes that have performed well in NN searching: KD-trees\nand the cell structures employed by locality sensitive hashing. The known algorithms for these\ndata structures do not, of course, use learning to choose the parameters. Nevertheless, we can\nexamine the generalization properties of a data structure learned from one of these classes. We\nderive generalization bounds for both of these classes in this paper.\nCan the framework be practically applied? We present very simple learning algorithms for both of\nthese data structure classes that exhibit improved performance over their standard counterparts.\n\n1\n\n\f2 Related work\n\nThere is a voluminous literature on data structures for nearest neighbor search, spanning several\nacademic communities. Work on ef\ufb01cient NN data structures can be classi\ufb01ed according to two\ncriteria: whether they return exact or approximate answers to queries; and whether they merely\nassume the distance function is a metric or make a stronger assumption (usually that the data are\nEuclidean). The framework we describe in this paper applies to all these methods, though we focus\nin particular on data structures for RD.\nPerhaps the most popular data structure for nearest neighbor search in RD is the simple and con-\nvenient KD-tree [1], which has enjoyed success in a vast range of applications. Its main downside\nis that its performance is widely believed to degrade rapidly with increasing dimension. Variants\nof the data structure have been developed to ameliorate this and other problems [2], though high-\ndimensional databases continue to be challenging. One recent line of work suggests randomly pro-\njecting points in the database down to a low-dimensional space, and then using KD-trees [3, 4].\nLocality sensitive hashing (LSH) has emerged as a promising option for high-dimensional NN search\nin RD [5]. It has strong theoretical guarantees for databases of arbitrary dimensionality, though they\nare for approximate NN search. We review both KD-trees and LSH in detail later.\nFor data in metric spaces, there are several schemes based on repeatedly applying the triangle in-\nequality to eliminate portions of the space from consideration; these include Orchard\u2019s algorithm\n[6] and AESA [7]. Metric trees [8] and the recently suggested spill trees [3] are based on similar\nideas and are related to KD-trees. A recent trend is to look for data structures that are attuned to the\nintrinsic dimension, e.g. [9]. See the excellent survey [10] for more information.\nThere has been some work on building a data structure for a particular query distribution [11];\nthis line of work is perhaps most similar to ours. Indeed, we discovered at the time of press that the\nalgorithm for KD-trees we describe appeared previously in [12]. Nevertheless, the learning theoretic\napproach in this paper is novel; the study of NN data structures through the lens of generalization\nability provides a fundamentally different theoretical basis for NN search with important practical\nimplications.\n\n3 Learning framework\n\nIn this section we formalize a learning framework for NN search. This framework is quite general\nand will hopefully be of use to algorithmic developments in NN searching beyond those presented\nin this paper.\nLet X = {x1, . . . , xn} denote the database and Q the space from which queries are drawn. A\ntypical example is X \u2282 RD and Q = RD. We take a nearest neighbor data structure to be a\nmapping f : Q \u2192 2X; the interpretation is we compute distances only to f(q), not all of X. For\nexample, the structure underlying LSH partitions RD into cells and a query is assigned to the subset\nof X that falls into the same cell.\nWhat quantities are we interested in optimizing? We want to only compute distances to a small\nfraction of the database on a query; and, in the case of probabilistic algorithms, we want a high\nprobability of success. More precisely, we hope to minimize the following two quantities for a data\nstructure f:\n\n\u2022 The fraction of X that we need to compute distances to:\nsizef (q) \u2261 |f(q)|\n\n.\n\nn\n\n\u2022 The fraction of a query\u2019s k nearest neighbors that are missed:\nmissf (q) \u2261 |\u0393k(q) \\ f(q)|\n\nk\n\n(\u0393k(q) denotes the k nearest neighbors of q in X).\n\n2\n\n\fIn \u0001-approximate NN search, we only require a point x such that d(q, x) \u2264 (1 + \u0001)d(q, X), so we\ninstead use an approximate miss rate:\n\n\u0001missf (q) \u2261 1 [(cid:64)x \u2208 f(q) such that d(q, x) \u2264 (1 + \u0001)d(q, X)] .\n\nNone of the previously discussed data structures are built by explicitly minimizing these quantities,\nthough there are known bounds for some. Why not? One reason is that research has typically\nfocused on worst-case sizef and missf rates, which require minimizing these functions over all\nq \u2208 Q. Q is typically in\ufb01nite of course.\nIn this work, we instead focus on average-case sizef and missf rates\u2014i.e. we assume q is a draw\nfrom some unknown distribution D on Q and hope to minimize\n\nEq\u223cD [sizef (q)]\n\nand Eq\u223cD [missf (q)] .\n\nTo do so, we assume that we are given a sample query set Q = {q1, . . . , qm} drawn iid from D. We\nattempt to build f minimizing the empirical size and miss rates, then resort to generalization bounds\nto relate these rates to the true ones.\n\n4 Learning algorithms\n\nWe propose two learning algorithms in this section. The \ufb01rst is based on a splitting rule for KD-trees\ndesigned to minimize a greedy surrogate for the empirical sizef function. The second is a algorithm\nthat determines the boundary locations of the cell structure used in LSH that minimize a tradeoff of\nthe empirical sizef and \u0001missf functions.\n\n4.1 KD-trees\nKD-trees are a popular cell partitioning scheme for RD based on the binary search paradigm. The\ndata structure is built by picking a dimension, splitting the database along the median value in that\ndimension, and then recursing on both halves.\nprocedure BUILDTREE(S)\nif |S| < MinSize, return leaf.\nelse:\n\nPick an axis i.\nLet median = median(si : s \u2208 S).\nLeftTree = BUILDTREE({s \u2208 S : si \u2264 median}).\nRightTree= BUILDTREE({s \u2208 S : si > median}).\nreturn [LeftTree, RightTree, median, i].\n\nTo \ufb01nd a NN for a query q, one \ufb01rst computes distances to all points in the same cell, then traverses\nup the tree. At each parent node, the minimum distance between q and points already explored is\ncompared to the distance to the split. If the latter is smaller, then the other child must be explored.\n\nExplore right subtree:\n\nDo not explore:\n\nTypically the cells contain only a few points; a query is expensive because it lies close to many of\nthe cell boundaries and much of the tree must be explored.\n\nLearning method\n\nRather than picking the median split at each level, we use the training queries qi to pick a split that\ngreedily minimizes the expected cost. A split s divides the sample queries (that are in the cell being\nsplit) into three sets: Qtc, those q that are \u201ctoo close\u201d to s\u2014i.e. nearer to s than d(q, X); Qr, those\non the right of s but not in Qtc; and Ql, those on the left of s but not in Qtc. Queries in Qtc will\nrequire exploring both sides of the split. The split also divides the database points (that are in the\ncell being split) into Xl and Xr. The cost of split s is then de\ufb01ned to be\n\ncost(s) \u2261 |Ql| \u00b7 |Xl| + |Qr| \u00b7 |Xr| + |Qtc| \u00b7 |X|.\n\n3\n\n\fcost(s) is a greedy surrogate for (cid:80)\n\ni sizef (qi); evaluating the true average size would require a\npotentially costly recursion. In contrast, minimizing cost(s) can be done painlessly since it takes on\nat most 2m + n possible values and each can be evaluated quickly. Using a sample set led us to a\nvery simple, natural cost function that can be used to pick splits in a principled manner.\n\n4.2 Locality sensitive hashing\n\nLSH was a tremendous breakthrough in NN search as it led to data structures with provably sublinear\n(in the database size) retrieval time for approximate NN searching. More impressive still, the bounds\non retrieval are independent of the dimensionality of the database. We focus on the LSH scheme for\nthe (cid:107) \u00b7 (cid:107)p norm (p \u2208 (0, 2]), which we refer to as LSHp. It is built on an extremely simple space\npartitioning scheme which we refer to as a rectilinear cell structure (RCS).\nprocedure BUILDRCS(X \u2282 RD)\nLet R \u2208 RO(log n)\u00d7d with Rij iid draws from a p-stable distribution.1\nProject database down to O(log n) dimensions: xi (cid:55)\u2192 Rxi.\nUniformly grid the space with B bins per direction.\n\nSee \ufb01gure 3, left panel, for an example. On query q, one simply \ufb01nds the cell that q belongs to, and\nreturns the nearest x in that cell.\nIn general, LSHp requires many RCSs, used in parallel, to achieve a constant probability of success;\nin many situations one may suf\ufb01ce [13]. Note that LSHp only works for distances at a single scale\nR: the speci\ufb01c guarantee is that LSHp will return a point x \u2208 X within distance (1 + \u0001)R of q as\nlong as d(q, X) < R. To solve the standard \u0001 approximate NN problem, one must build O(log(n/\u0001))\nLSHp structures.\n\nLearning method\n\nWe apply our learning framework directly to the class of RCSs since they are the core structural\ncomponent of LSHp. We consider a slightly wider class of RCSs where the bin widths are allowed\nto vary. Doing so potentially allows a single RCS to work at multiple scales if the bin positions are\nchosen appropriately. We give a simple procedure that selects the bin boundary locations.\n\nWe wish to select boundary locations minimizing the cost(cid:80)\n\ni \u0001missf (qi) + \u03bbsizef (qi), where \u03bb is a\ntradeoff parameter (alternatively, one could \ufb01x a miss rate that is reasonable, say 5%, and minimize\nthe size). The optimization is performed along one dimension at a time. Fortunately, the optimal\nbinning along a dimension can be found by dynamic programming. There are at most m+n possible\nboundary locations; order them from left to right. The cost of placing the boundaries at p1, p2, pB+1\ncan be decomposed as c[p1, p2] + \u00b7\u00b7\u00b7 + c[pB, pB+1], where\n\n\u0001missf (q) + \u03bb\n\nq\u2208[pi,pi+1]\n\nq\u2208[pi,pi+1]\n\n|{x \u2208 [pi, pi+1]}| .\n\nc[pi, pi+1] = (cid:88)\n\n(cid:88)\n\nLet D be our dynamic programming table where D[p, i] is de\ufb01ned as the cost of putting the ith\nboundary at position p and the remaining B + 1 \u2212 i to the right. Then D[p, i] = minp(cid:48)\u2265p c[p, p(cid:48)] +\nD[p(cid:48), i \u2212 1].\n\n5 Generalization theory2\n\nIn our framework, a nearest neighbor data structure is learned by speci\ufb01cally designing it to per-\nform well on a set of sample queries. Under what conditions will this search structure have good\nperformance on future queries?\nRecall the setting: there is a database X = {x1, . . . , xn}, sample queries Q = {q1, . . . , qm} drawn\niid from some distribution D on Q, and we wish to learn a data structure f : Q \u2192 2X drawn from a\n1Dp is p-stable if for any v \u2208 Rd and Z, X1, . . . , Xd drawn iid from Dp, (cid:104)v, X(cid:105) d= (cid:107)v(cid:107)pZ. For example,\n\nN (0, 1) is 2-stable.\n\n2See the full version of this paper for any missing proofs.\n\n4\n\n\fand missf (q) \u2261\n, both of which have range [0, 1] (\u0001missf (q) can be substituted for missf (q) throughout\n\nfunction class F. We are interested in the generalization of sizef (q) \u2261 |f (q)|\nn ,\n|\u0393k(q)\\f (q)|\nthis section).\nSuppose a data structure f is chosen from some class F, so as to have low empirical cost\n\nk\n\nm(cid:88)\n\ni=1\n\n1\nm\n\nsizef (qi) and\n\n1\nm\n\nmissf (qi).\n\nm(cid:88)\n\ni=1\n\nCan we then conclude that data structure f will continue to perform well for subsequent queries\ndrawn from the underlying distribution on Q? In other words, are the empirical estimates above\nnecessarily close to the true expected values Eq\u223cDsizef (q) and Eq\u223cDmissf (q) ?\nThere is a wide range of uniform convergence results which relate the difference between empirical\nand true expectations to the number of samples seen (in our case, m) and some measure of the\ncomplexity of the two classes {sizef : f \u2208 F} and {missf : f \u2208 F}. The following is particularly\nconvenient to use, and is well-known [14, theorem 3.2].\nTheorem 1. Let G be a set of functions from a set Z to [0, 1]. Suppose a sample z1, . . . , zm is drawn\nfrom some underlying distribution on Z. Let Gm denote the restriction of G to these samples, that is,\n\nGm = {(g(z1), g(z2), . . . , g(zm)) : g \u2208 G}.\nThen for any \u03b4 > 0, the following holds with probability at least 1 \u2212 \u03b4:\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Eg \u2212 1\n\nm\n\nm(cid:88)\n\ni=1\n\nsup\ng\u2208G\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 2\n\n(cid:114)2 log |Gm|\n\n+\n\nm\n\n(cid:114)log(2/\u03b4)\n\n.\n\nm\n\ng(zi)\n\nThis can be applied immediately to the kind of data structure used by LSH.\nDe\ufb01nition 2. A (u1, . . . , ud, B)-rectilinear cell structure (RCS) in RD is a partition of RD into Bd\ncells given by\nwhere each hi : R \u2192 {1, . . . , B} is a partition of the real line into B intervals.\nTheorem 3. Fix any vectors u1, . . . , ud \u2208 RD, and, for some positive integer B, let the set of data\nstructures F consist of all (u1, . . . , ud, B)-rectilinear cell structures in RD. Fix any database of n\npoints X \u2282 RD. Suppose there is an underlying distribution over queries in RD, from which m\nsample queries q1, . . . , qm are drawn. Then\n\nx (cid:55)\u2192 (h1(x \u00b7 u1), . . . , hd(x \u00b7 ud)),\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 2\n(cid:114)2d(B \u2212 1) log(m + n)\n\nm\n\n(cid:114)log(2/\u03b4)\n\nm\n\n+\n\nmissf (qi)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)E[missf ] \u2212 1\n\nm\n\nm(cid:88)\n\ni=1\n\nsup\nf\u2208F\n\nand likewise for sizef .\nProof. Fix any X = {x1, . . . , xn} and any q1, . . . , qm. In how many ways can these points be\nassigned to cells by the class of all (u1, . . . , ud, B)-rectilinear data structures? Along each axis ui\nthere are B \u2212 1 boundaries to be chosen and only m + n distinct locations for each of these (as far\nas partitioning of the xi\u2019s and qi\u2019s is concerned). Therefore there are at most (m + n)d(B\u22121) ways to\ncarve up the points. Thus the functions {missf : f \u2208 F} (or likewise, {sizef : f \u2208 F}) collapse to\na set of size just (m + n)d(B\u22121) when restricted to m queries; the rest follows from theorem 1.\n\nThis is good generalization performance because it depends only on the projected dimension, not\nthe original dimension. It holds when the projection directions u1, . . . , ud are chosen randomly, but,\nmore remarkably, even if they are chosen based on X (for instance, by running PCA on X). If we\nlearn the projections as well (instead of using random ones) the bound degrades substantially.\nthat now F ranges over\nTheorem 4. Consider the same setting as Theorem 3, except\n(u1, . . . , ud, B)-rectilinear cell structures for all choices of u1, . . . , ud \u2208 RD. Then with proba-\nbility at least 1 \u2212 \u03b4,\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 2\n(cid:114)2 + 2d(D + B \u2212 2) log(m + n)\n\nm\n\n(cid:114)log(2/\u03b4)\n\nm\n\n+\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)E[missf ] \u2212 1\n\nm\n\nm(cid:88)\n\ni=1\n\nmissf (qi)\n\nsup\nf\u2208F\n\nand likewise for sizef .\n\n5\n\n\fFigure 1: Left: Outer ring is the database; inner cluster of points are the queries. Center: KD-tree\nwith standard median splits. Right: KD-tree with learned splits.\n\nKD-trees are slightly different than RCSs: the directions ui are simply the coordinate axes, and the\nnumber of partitions per direction varies (e.g. one direction may have 10 partitions, another only 1).\nTheorem 5. Let F be the set of all depth \u03b7 KD-trees in RD and X \u2282 RD be a database of points.\nSuppose there is an underlying distribution over queries in RD from which q1, . . . qm are drawn.\nThen with probability at least 1 \u2212 \u03b4,\n\n(cid:114)(2\u03b7+1 \u2212 2) log (D(3m + n))\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)E[missf ] \u2212 1\n\nm\n\nm(cid:88)\n\ni=1\n\nsup\nf\u2208F\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 2\n\nmissf (qi)\n\n(cid:114)log (2/\u03b4)\n\nm\n\n+\n\nm\n\nA KD-tree utilizing median splits has depth \u03b7 \u2264 log n. The depth of a KD-tree with learned splits\ncan be higher, though we found empirically that the depth was always much less than 2 log n (and\ncan of course be restricted manually). KD-trees require signi\ufb01cantly more samples than RCSs to\ngeneralize; the class of KD-trees is much more complex than that of RCSs.\n\n6 Experiments3\n\n6.1 KD-trees\n\nFirst let us look at a simple example comparing the learned splits to median splits. Figure 1 shows\na 2-dimensional dataset and the cell partitions produced by the learned splits and the median splits.\nThe KD-tree constructed with the median splitting rule places nearly all of the boundaries running\nright through the queries. As a result, nearly the entire database will have to be searched for queries\ndrawn from the center cluster distribution. The KD-tree with the learned splits places most of the\nboundaries right around the actual database points, ensuring that fewer leaves will need to be exam-\nined for each query.\nWe now show results on several datasets from the UCI repository and 2004 KDD cup competition.\nWe restrict attention to relatively low-dimensional datasets (D < 100) since that is the domain\nin which KD-trees are typically applied. These experiments were all conducted using a modi\ufb01ed\nversion of Mount and Arya\u2019s excellent KD-tree software [15]. For this set of experiments, we used\na randomly selected subset of the dataset as the database and a separate small subset as the test\nqueries. For the sample queries, we used the database itself\u2014i.e. no additional data was used to\nbuild the learned KD-tree.\nThe following table shows the results. We compare performance in terms of the average number of\ndatabase points we have to compute distances to on a test set.\n\n# distance calculations\n\nmedian split\n\nlearned split\n\nimprovement\n\ndata set\n\nDB size\n\ntest pts\n\ndim\n\nCorel (UCI)\n\nCovertype (UCI)\n\nLetter (UCI)\n\nPen digits (UCI)\n\nBio (KDD)\n\nPhysics (KDD)\n\n32k\n100k\n18k\n9k\n100k\n100k\n\n5k\n10k\n2k\n1k\n10k\n10k\n\n32\n55\n16\n16\n74\n78\n\n1035.7\n20.8\n470.1\n168.9\n1409.8\n1676.6\n\n403.7\n18.4\n353.8\n114.9\n1310.8\n404.0\n\n%\n\n61.0\n11.4\n27.4\n31.9\n7.0\n75.9\n\nThe learned method outperforms the standard method on all of the datasets, showing a very large im-\nprovement on several of them. Note also that even the standard method exhibits good performance,\n\n3Additional experiments appear in the full version of this paper.\n\n6\n\nann2fig dumpSTDann2fig dumpL\fBears\n\nN. American animals\n\nAll animals\n\nEverything\n\nstandard KD-tree\n\nlearned KD-tree\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n.1\n\n.2\n\n.3\n\n0\n\n.1\n\n.2\n\n.3\n\n0\n\n.1\n\n.2\n\n.3\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n0\n\n.1\n\n.2\n\n.3\n\nFigure 2: Percentage of DB examined as a function of (cid:31)(the approximation factor) for various query\ndistributions.\n\nRandom boundaries\n\nRandom boundaries\n\nTuned boundaries\n\nTuned boundaries\n\nFigure 3: Example RCSs. Left: Standard RCS. Right: Learned RCS\n\noften requiring distance calculations to less than one percent of the database. We are showing strong\nimprovements on what are already quite good results.\nWe additionally experimented with the \u2018Corel50\u2019 image dataset. It is divided into 50 classes (e.g.\nair shows, bears, tigers, Fiji) containing 100 images each. We used the 371-dimensional \u201csemantic\nspace\u201d representation of the images recently developed in a series of image retrieval papers (see e.g.\n[16]). This dataset allows us to explore the effect of differing query and database distributions in\na natural setting. It also demonstrates that KD-trees with learned parameters can perform well on\nhigh-dimensional data.\nFigure 2 shows the results of running KD-trees using median and learned splits. In each case, 4000\nimages were chosen for the database (from across all the classes) and images from select classes\nwere chosen for the queries. The \u201cAll\u201d queries were chosen from all classes; the \u201cAnimals\u201d were\nchosen from the 11 animal classes; the \u201cN. American animals\u201d were chosen from 5 of the animal\nclasses; and the \u201cBears\u201d were chosen from the two bear classes. Standard KD-trees are performing\nsomewhat better than brute force in these experiments; the learned KD-trees yield much faster re-\ntrieval times across a range of approximation errors. Note also that the performance of the learned\nKD-tree seems to improve as the query distribution becomes simpler whereas the performance for\nthe standard KD-tree actually degrades.\n\n6.2 RCS/LSH\n\nFigure 3 shows a sample run of the learning algorithm. The queries and DB are drawn from the\nsame distribution. The learning algorithm adjusts the bin boundaries to the regions of density.\nExperimenting with RCS structures is somewhat challenging since there are two parameters to set\n(number of projections and boundaries), an approximation factor (cid:31), and two quantities to compare\n(size and miss). We swept over the two parameters to get results for the standard RCSs. Results for\nlearned RCSs were obtained using only a single (essentially unoptimized) parameter setting. Rather\nthan minimizing a tradeoff between sizef and missf , we constrained the miss rate and optimized the\nsizef . The constraint was varied between runs (2%, 4%, etc.) to get comparable results.\nFigure 4 shows the comparison on databases of 10k points drawn from the MNIST and Physics\ndatasets (2.5k points were used as sample queries). We see a marked improvement for the Physics\ndataset and a small improvement for the MNIST dataset. We suspect that the learning algorithm\nhelps substantially for the physics data because the one-dimensional projections are highly non-\nuniform whereas the MNIST one-dimensional projections are much more uniform.\n\n7\n\n\fFigure 4: Left: Physics dataset. Right: MNIST dataset.\n\n7 Conclusion\n\nThe primary contribution of this paper is demonstrating that building a NN search structure can be\nfruitfully viewed as a learning problem. We used this framework to develop algorithms that learn\nRCSs and KD-trees optimized for a query distribution. Possible future work includes applying the\nlearning framework to other data structures, though we expect that even stronger results may be\nobtained by using this framework to develop a novel data structure from the ground up. On the\ntheoretical side, margin-based generalization bounds may allow the use of richer classes of data\nstructures.\n\nAcknowledgments\nWe are grateful to the NSF for support under grants IIS-0347646 and IIS-0713540. Thanks to Nikhil\nRasiwasia, Sunhyoung Han, and Nuno Vasconcelos for providing the Corel50 data.\n\nReferences\n[1] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for \ufb01nding best matches in logarithmic\n\nexpected time. ACM Transactions on Mathematical Software, 3(3):209\u2013226, 1977.\n\n[2] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for approximate\n\nnearest neighbor searching. Journal of the ACM, 45(6):891\u2013923, 1998.\n\n[3] T. Liu, A. W. Moore, A. Gray, and K. Yang. An investigation of practical approximate neighbor algo-\n\nrithms. In Neural Information Processing Systems (NIPS), 2004.\n\n[4] S. Dasgupta and Y. Freund. Random projection trees and low dimensional manifolds. Technical report,\n\nUCSD, 2007.\n\n[5] P. Indyk. Nearest neighbors in high dimensional spaces. In J. E. Goodman and J. O\u2019Rourke, editors,\n\nHandbook of Discrete and Computational Geometry. CRC Press, 2006.\n\n[6] M. T. Orchard. A fast nearest-neighbor search algorithm. In ICASSP, pages 2297\u20133000, 1991.\n[7] E. Vidal. An algorithm for \ufb01nding nearest neighbours in (approximately) constant average time. Pattern\n\nRecognition Letters, 4:145\u2013157, 1986.\n\n[8] S. Omohundro. Five balltree construction algorithms. Technical report, ICSI, 1989.\n[9] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In ICML, 2006.\n[10] K. L. Clarkson. Nearest-neighbor searching and metric space dimensions. In Nearest-Neighbor Methods\n\nfor Learning and Vision: Theory and Practice, pages 15\u201359. MIT Press, 2006.\n\n[11] S. Maneewongvatana and D. Mount. The analysis of a probabilistic approach to nearest neighbor search-\n\ning. In Workshop on Algorithms and Data Structures, 2001.\n\n[12] S. Maneewongvatana and D. Mount. Analysis of approximate nearest neighbor searching with clustered\n\npoint sets. In Workshop on Algorithm Engineering and Experimentation (ALENEX), 1999.\n\n[13] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme\n\nbased on p-stable distributions. In SCG 2004, pages 253\u2013262, New York, NY, USA, 2004. ACM Press.\n\nProbability and Statistics, 9:323\u2013375, 2004.\n\n[14] O. Bousquet, S. Boucheron, and G. Lugosi. Theory of classi\ufb01cation: a survey of recent advances. ESAIM:\n[15] D. Mount and S. Arya. ANN library. http://www.cs.umd.edu/\u223cmount/ANN/.\n[16] N. Rasiwasia, P. Moreno, and N. Vasconcelos. Bridging the gap: query by semantic example.\n\nTransactions on Multimedia, 2007.\n\nIEEE\n\n8\n\n00.050.10.150.200.050.10.150.2miss ratesize rate (fraction of DB)Physics00.050.10.150.200.050.10.150.20.25MNISTmiss ratesize rate (fraction of DB)StandardTuned\f", "award": [], "sourceid": 674, "authors": [{"given_name": "Lawrence", "family_name": "Cayton", "institution": null}, {"given_name": "Sanjoy", "family_name": "Dasgupta", "institution": null}]}