{"title": "Rank-Approximate Nearest Neighbor Search: Retaining Meaning and Speed in High Dimensions", "book": "Advances in Neural Information Processing Systems", "page_first": 1536, "page_last": 1544, "abstract": "The long-standing problem of efficient nearest-neighbor (NN) search has ubiquitous applications ranging from astrophysics to MP3 fingerprinting to bioinformatics to movie recommendations. As the dimensionality of the dataset increases, exact NN search becomes computationally prohibitive; (1+eps)-distance-approximate NN search can provide large speedups but risks losing the meaning of NN search present in the ranks (ordering) of the distances. This paper presents a simple, practical algorithm allowing the user to, for the first time, directly control the true accuracy of NN search (in terms of ranks) while still achieving the large speedups over exact NN. Experiments with high-dimensional datasets show that it often achieves faster and more accurate results than the best-known distance-approximate method, with much more stable behavior.", "full_text": "Rank-Approximate Nearest Neighbor Search:\n\nRetaining Meaning and Speed in High Dimensions\n\nParikshit Ram, Dongryeol Lee, Hua Ouyang and Alexander G. Gray\nComputational Science and Engineering, Georgia Institute of Technology\n\nAtlanta, GA 30332\n\n{p.ram@,dongryel@cc.,houyang@,agray@cc.}gatech.edu\n\nAbstract\n\nThe long-standing problem of ef\ufb01cient nearest-neighbor (NN) search has ubiqui-\ntous applications ranging from astrophysics to MP3 \ufb01ngerprinting to bioinformat-\nics to movie recommendations. As the dimensionality of the dataset increases, ex-\nact NN search becomes computationally prohibitive; (1+\u0001) distance-approximate\nNN search can provide large speedups but risks losing the meaning of NN search\npresent in the ranks (ordering) of the distances. This paper presents a simple,\npractical algorithm allowing the user to, for the \ufb01rst time, directly control the\ntrue accuracy of NN search (in terms of ranks) while still achieving the large\nspeedups over exact NN. Experiments on high-dimensional datasets show that\nour algorithm often achieves faster and more accurate results than the best-known\ndistance-approximate method, with much more stable behavior.\n\n1 Introduction\n\nIn this paper, we address the problem of nearest-neighbor (NN) search in large datasets of high\ndimensionality. It is used for classi\ufb01cation (\u0001-NN classi\ufb01er [1]), categorizing a test point on the ba-\nsis of the classes in its close neighborhood. Non-parametric density estimation uses NN algorithms\nwhen the bandwidth at any point depends on the \u0001\u0001\u210e NN distance (NN kernel density estimation [2]).\nNN algorithms are present in and often the main cost of most non-linear dimensionality reduction\ntechniques (manifold learning [3, 4]) to obtain the neighborhood of every point which is then pre-\nserved during the dimension reduction. NN search has extensive applications in databases [5] and\ncomputer vision for image search Further applications abound in machine learning.\n\nTree data structures such as \u0001\u0001-trees are used for ef\ufb01cient exact NN search but do not scale better\nthan the na\u00a8\u0131ve linear search in suf\ufb01ciently high dimensions. Distance-approximate NN (DANN)\nsearch, introduced to increase the scalability of NN search, approximates the distance to the NN and\nany neighbor found within that distance is considered to be \u201cgood enough\u201d. Numerous techniques\nexist to achieve this form of approximation and are fairly scalable to higher dimensions under certain\nassumptions.\n\nAlthough the DANN search places bounds on the numerical values of the distance to NN, in NN\nsearch, distances themselves are not essential; rather the order of the distances of the query to the\npoints in the dataset captures the necessary and suf\ufb01cient information [6, 7]. For example, consider\nthe two-dimensional dataset (1, 1), (2, 2), (3, 3), (4, 4), . . . with a query at the origin. Appending\nnon-informative dimensions to each of the reference points produces higher dimensional datasets\nof the form (1, 1, 1, 1, 1, ....), (2, 2, 1, 1, 1, ...), (3, 3, 1, 1, 1, ...), (4, 4, 1, 1, 1, ...), . . .. For a \ufb01xed dis-\ntance approximation, raising the dimension increases the number of points for which the distance to\nthe query (i.e. the origin) satis\ufb01es the approximation condition. However, the ordering (and hence\nthe ranks) of those distances remains the same. The proposed framework, rank-approximate nearest-\nneighbor (RANN) search, approximates the NN in its rank rather than in its distance, thereby making\nthe approximation independent of the distance distribution and only dependent on the ordering of\nthe distances.\n\n1\n\n\fThis paper is organized as follows: Section 2 describes the existing methods for exact NN and\nDANN search and the challenges they face in high dimensions. Section 3 introduces the proposed\napproach and provides a practical algorithm using strati\ufb01ed sampling with a tree data structure to\nobtain a user-speci\ufb01ed level of rank approximation in Euclidean NN search. Section 4 reports the\nexperiments comparing RANN with exact search and DANN. Finally, Section 5 concludes with\ndiscussion of the road ahead.\n\n2 Related Work\n\nThe problem of NN search is formalized as the following:\nProblem. Given a dataset \u0001 \u2282 \u0001 of size \u0001 in a metric space (\u0001, \u0001) and a query \u0001 \u2208 \u0001, ef\ufb01ciently\n\ufb01nd a point \u0001 \u2208 \u0001 such that\n(1)\n\n\u0001(\u0001, \u0001).\n\n\u0001(\u0001, \u0001) = min\n\u0001\u2208\u0001\n\n2.1 Exact Search\n\nThe simplest approach of linear search over \u0001 to \ufb01nd the NN is easy to implement, but requires\nO(\u0001 ) computations for a single NN query, making it unscalable for moderately large \u0001.\nHashing the dataset into buckets is an ef\ufb01cient technique, but scales only to very low dimensional\n\u0001. Hence data structures are used to answer queries ef\ufb01ciently. Binary spatial partitioning trees,\nlike \u0001\u0001-trees [9], ball trees [10] and metric trees [11] utilize the triangular inequality of the distance\nmetric \u0001 (commonly the Euclidean distance metric) to prune away parts of the dataset from the com-\nputation and answer queries in expected O(log \u0001 ) computations [9]. Non-binary cover trees [12]\nanswer queries in theoretically bounded O(log \u0001 ) time using the same property under certain mild\nassumptions on the dataset.\n\nFinding NNs for O(\u0001 ) queries would then require at least O(\u0001 log \u0001 ) computations using the\ntrees. The dual-tree algorithm [13] for NN search also builds a tree on the queries instead of going\nthrough them linearly, hence amortizing the cost of search over the queries. This algorithm shows\norders of magnitude improvement in ef\ufb01ciency and is conjectured to be O(\u0001 ) for answering O(\u0001 )\nqueries using the cover trees [12].\n\n2.2 Nearest Neighbors in High Dimensions\n\nThe frontier of research in NN methods is high dimensional problems, stemming from common\ndatasets like images and documents to microarray data. But high dimensional data poses an inherent\nproblem for Euclidean NN search as described in the following theorem:\nTheorem 2.1.\n[8] Let \u0001 be a \u0001-dimensional hypersphere with radius \u0001. Let \u0001 and \u0001 be any two\npoints chosen at random in \u0001, the distributions of \u0001 and \u0001 being independent and uniform over the\ninterior of \u0001. Let \u0001 be the Euclidean distance between \u0001 and \u0001 (\u0001 \u2208 [0, 2\u0001]). Then the asymptotic\ndistribution of \u0001 is \u0001 (\u0001\u221a2, \u00012/2\u0001).\n\nThis implies that in high dimensions, the Euclidean distances between uniformly distributed points\nlie in a small range of continuous values. This hypothesizes that the tree based algorithms perform\nno better than linear search since these data structures would be unable to employ suf\ufb01ciently tight\nbounds in high dimensions. This turns out to be true in practice [14, 15, 16]. This prompted interest\nin approximation of the NN search problem.\n\n2.3 Distance-Approximate Nearest Neighbors\n\nThe problem of NN search is relaxed in the following form to make it more scalable:\nProblem. Given a dataset \u0001 \u2282 \u0001 of size \u0001 in some metric space (\u0001, \u0001) and a query \u0001 \u2208 \u0001,\nef\ufb01ciently \ufb01nd any point \u0001\u2032 \u2208 \u0001 such that\n\nfor a low value of \u0001 \u2208 \u211d+ with high probability.\nThis approximation can be achieved with \u0001\u0001-trees, balls trees, and cover trees by modifying the\nsearch algorithm to prune more aggressively. This introduces the allowed error while providing\nsome speedup over the exact algorithm [12]. Another approach modi\ufb01es the tree data structures to\n\n\u0001(\u0001\u2032, \u0001) \u2264 (1 + \u0001) min\n\n\u0001\u2208\u0001\n\n\u0001(\u0001, \u0001)\n\n(2)\n\n2\n\n\fbound error with just one root-to-leaf traversal of the tree, i.e. to eliminate backtracking. Sibling\nnodes in \u0001\u0001-trees or ball-trees are modi\ufb01ed to share points near their boundaries, forming spill\ntrees [14]. These obtain signi\ufb01cant speed up over the exact methods. The idea of approximately\ncorrect (satisfying Eq. 2) NN is further extended to a formulation where the (1 + \u0001) bound can be\nexceeded with a low probability \u0001, thus forming the PAC-NN search algorithms [17]. They provide\n1-2 orders of magnitude speedup in moderately large datasets with suitable \u0001 and \u0001.\nThese methods are still unable to scale to high dimensions. However, they can be used in combina-\ntion with the assumption that high dimensional data actually lies on a lower dimensional subspace.\nThere are a number of fast DANN methods that preprocess data with randomized projections to\nreduce dimensionality. Hybrid spill trees [14] build spill trees on the randomly projected data to\nobtain signi\ufb01cant speedups. Locality sensitive hashing [18, 19] hashes the data into a lower dimen-\nsional buckets using hash functions which guarantee that \u201cclose\u201d points are hashed into the same\nbucket with high probability and \u201cfarther apart\u201d points are hashed into the same bucket with low\nprobability. This method has signi\ufb01cant improvements in running times over traditional methods in\nhigh dimensional data and is shown to be highly scalable.\n\nHowever, the DANN methods assume that the distances are well behaved and not concentrated in a\nsmall range. However, for example, if the all pairwise distances are within the range (100.0, 101.00),\nany distance approximation \u0001 \u2265 0.01 will return an arbitrary point to a NN query. The exact tree-\nbased algorithms failed to be ef\ufb01cient because many datasets encountered in practice suffered the\nsame concentration of pairwise distances. Using DANN in such a situation leads to the loss of the\nordering information of the pairwise distances which is essential for NN search [6]. This is too\nlarge of a loss in accuracy for increased ef\ufb01ciency. In order to address this issue, we propose a\nmodel of approximation for NN search which preserves the information present in the ordering of\nthe distances by controlling the error in the ordering itself irrespective of the dimensionality or the\ndistribution of the pairwise distances in the dataset. We also provide a scalable algorithm to obtain\nthis form of approximation.\n\n3 Rank Approximation\n\nTo approximate the NN rank, we formulate and relax NN search in the following way:\nProblem. Given a dataset \u0001 \u2282 \u0001 of size \u0001 in a metric space (\u0001, \u0001) and a query \u0001 \u2208 \u0001, let\n\u0001 = {\u00011, . . . , \u0001\u0001} be the set of distances between the query and all the points in the dataset \u0001,\nsuch that \u0001\u0001 = \u0001(\u0001\u0001, \u0001), \u0001\u0001 \u2208 \u0001, \u0001 = 1, . . . , \u0001. Let \u0001(\u0001) be the \u0001\u0001\u210e order statistic of \u0001. Then the\n\u0001 \u2208 \u0001 : \u0001(\u0001, \u0001) = \u0001(1) is the NN of \u0001 in \u0001. The rank-approximation of NN search would then be to\nef\ufb01ciently \ufb01nd a point \u0001\u2032 \u2208 \u0001 such that\n(3)\n\n\u0001(\u0001\u2032, \u0001) \u2264 \u0001(1+\u0001 )\n\nwith high probability for a given value of \u0001 \u2208 \u2124+.\nRANN search may use any order statistics of the population \u0001, bounded above by the (1 + \u0001 )\u0001\u210e\norder statistics, to answer a NN query. Sedransk et.al. [20] provide a probability bound for the\nsample order statistics bound on the order statistics of the whole set.\nTheorem 3.1. For a population of size \u0001 with \u0001 values ordered as \u0001(1) \u2264 \u0001(2) \u22c5\u22c5\u22c5 \u2264 \u0001(\u0001 ), let\n\u0001(1) \u2264 \u0001(2) \u22c5\u22c5\u22c5 \u2264 \u0001(\u0001) be a ordered sample of size \u0001 drawn from the population uniformly without\nreplacement. For 1 \u2264 \u0001 \u2264 \u0001 and 1 \u2264 \u0001 \u2264 \u0001,\n\n\u0001 (\u0001(\u0001) \u2264 \u0001(\u0001)) =\n\n\u08a3\u0001=0\n\n\u0001\u2212\u0001\n\n\u001c \u0001 \u2212 \u0001 \u2212 1\n\n\u0001 \u2212 1 \u001d\u001c \u0001 \u2212 \u0001 + \u0001\n\n\u0001 \u2212 \u0001 \u001d /\u001c \u0001\n\n\u0001 \u001d .\n\n(4)\n\nWe may \ufb01nd a \u0001\u2032 \u2208 \u0001 satisfying Eq. 3 with high probability by sampling enough points {\u00011, . . . \u0001\u0001}\nfrom \u0001 such that for some 1 \u2264 \u0001 \u2264 \u0001, rank error bound \u0001 , and a success probability \u0001\n(5)\nSample order statistic \u0001 = 1 minimizes the required number of samples; hence we substitute the\nvalues of \u0001 = 1 and \u0001 = 1 + \u0001 in Eq. 4 obtaining the following expression which can be computed\nin O(\u0001 ) time\n\n\u0001 (\u0001(\u0001\u2032, \u0001) = \u0001(\u0001) \u2264 \u0001(1+\u0001 )) \u2265 \u0001.\n\n\u0001 (\u0001(1) \u2264 \u0001(1+\u0001 )) =\n\n\u0001\n\n\u08a3\u0001=0\n\n\u001c \u0001 \u2212 \u0001 + \u0001 \u2212 1\n\n\u0001 \u2212 1\n\n\u001d /\u001c \u0001\n\n\u0001 \u001d .\n\n(6)\n\n3\n\n\fThe required sample size \u0001 for a particular error \u0001 with success probability \u0001 is computed using\nbinary search over the range (1 + \u0001, \u0001 ]. This makes RANN search O(\u0001) (since now we only need\nto compute the \ufb01rst order statistics of a sample of size \u0001) giving O(\u0001/\u0001) speedup.\n\n3.1 Strati\ufb01ed Sampling with a Tree\n\nFor a required sample size of \u0001, we randomly sample \u0001 points from \u0001 and compute the RANN for a\nquery \u0001 by going through the sampled set linearly. But for a tree built on \u0001, parts of the tree would\nbe pruned away for the query \u0001 during the tree traversal. Hence we can ignore the random samples\nfrom the pruned part of the tree, saving us some more computation.\nHence let \u0001 be in the form of a binary tree (say \u0001\u0001-tree) rooted at \u0001\u0001\u0001\u0001\u0001. The root node has \u0001\npoints. Let the left and right child have \u0001\u0001 and \u0001\u0001 points respectively. For a random query \u0001 \u2208 \u0001,\nthe population \u0001 is the set of distances of \u0001 to all the \u0001 points in \u0001\u0001\u0001\u0001\u0001. The tree strati\ufb01es the\npopulation \u0001 into \u0001\u0001 = {\u0001\u00011, . . . , \u0001\u0001\u0001\u0001} and \u0001\u0001 = {\u0001\u00011, . . . , \u0001\u0001\u0001\u0001}, where \u0001\u0001 and \u0001\u0001 are the\nset of distances of \u0001 to all the \u0001\u0001 and \u0001\u0001 points respectively in the left and right child of the root\nnode \u0001\u0001\u0001\u0001\u0001. The following theorem provides a way to decide how much to sample from a particular\nnode, subsequently providing a lower bound on the number of samples required from the unpruned\npart of the tree without violating Eq.5\nTheorem 3.2. Let \u0001\u0001 and \u0001\u0001 be the number of random samples from the strata \u0001\u0001 and \u0001\u0001 respec-\ntively by doing a strati\ufb01ed sampling on the population \u0001 of size \u0001 = \u0001\u0001 + \u0001\u0001. Let \u0001 samples be\nrequired for Eq.5 to hold in the population \u0001 for a given value of \u0001. Then Eq.5 holds for \u0001 with the\nsame value of \u0001 with the random samples of sizes \u0001\u0001 and \u0001\u0001 from the random strata \u0001\u0001 and \u0001\u0001 of\n\u0001 respectively if \u0001\u0001 + \u0001\u0001 = \u0001 and \u0001\u0001 : \u0001\u0001 = \u0001\u0001 : \u0001\u0001.\n\nProof. Eq. 5 simply requires \u0001 uniformly sampled points, i.e.\nfor each distance in \u0001 to have\nprobability \u0001/\u0001 of inclusion. For \u0001\u0001 + \u0001\u0001 = \u0001 and \u0001\u0001 : \u0001\u0001 = \u0001\u0001 : \u0001\u0001, we have \u0001\u0001 = \u2308(\u0001/\u0001 )\u0001\u0001\u2309\nand similarly \u0001\u0001 = \u2308(\u0001/\u0001 )\u0001\u0001\u2309, and thus samples in both \u0001\u0001 and \u0001\u0001 are included at the proper rate.\nSince the ratio of the sample size to the population size is a constant \u0001 = \u0001/\u0001, Theorem 3.2 is\ngeneralizable to any level of the tree.\n\n3.2 The Algorithm\n\nThe proposed algorithm introduces the intended approximation in the unpruned portion of the \u0001\u0001-\ntree since the pruned part does not add to the computation in the exact tree based algorithms. The\nalgorithm starts at the root of the tree. While searching for the NN of a query \u0001 in a tree, most of\nthe computation in the traversal involves computing the distance of the query \u0001 to any tree node\n\u0001 (\u0001\u0001\u0001\u0001 \u0001\u0001 \u0001\u0001\u0001\u0001(\u0001, \u0001)). If the current upperbound to the NN distance (\u0001\u0001(\u0001)) for the query \u0001 is\ngreater than \u0001\u0001\u0001\u0001 \u0001\u0001 \u0001\u0001\u0001\u0001(\u0001, \u0001), the node is traversed and \u0001\u0001(\u0001) is updated. Otherwise node \u0001 is\npruned. The computations of distance of \u0001 to points in the dataset \u0001 occurs only when \u0001 reaches\na leaf node it cannot prune. The NN candidate in that leaf is computed using the linear search\n(COMPUTEBRUTENN subroutine in Fig.2). The traversal of the exact algorithm in the tree is illus-\ntrated in Fig.1.\n\nTo approximate the computation by sampling, traversal down the tree is stopped at a node which can\nbe summarized with a small number of samples (below a certain threshold MAXSAMPLES). This is\nillustrated in Fig.1. The value of MAXSAMPLES giving maximum speedup can be obtained by cross-\nvalidation. If a node is summarizable within the desired error bounds (decided by the CANAPPROX-\nIMATE subroutine in Fig.2), required number of points are sampled from such a node and the nearest\nneighbor candidate is computed from among them using linear search (COMPUTEAPPROXNN sub-\nroutine of Fig.2).\nSingle Tree. The search algorithm is presented in Fig.2. The dataset \u0001 is stored as a binary tree\nrooted at \u0001\u0001\u0001\u0001\u0001. The algorithm starts as STRANKAPPROXNN(\u0001, \u0001, \u0001, \u0001). During the search, if a\nleaf node is reached (since the tree is rarely balanced), the exact NN candidate is computed. In case\na non-leaf node cannot be approximated, the child node closer to the query is always traversed \ufb01rst.\nThe following theorem proves the correctness of the algorithm.\nTheorem 3.3. For a query \u0001 and a speci\ufb01ed value of \u0001 and \u0001 , STRANKAPPROXNN(\u0001, \u0001, \u0001, \u0001)\ncomputes a neighbor in \u0001 within (1 + \u0001 ) rank with probability at least \u0001.\n\n4\n\n\fFigure 1: The traversal paths of the exact and the rank-approximate algorithm in a \u0001\u0001-tree\n\nProof. By Eq.6, a query requires at least \u0001 samples from a dataset of size \u0001 to compute a neighbor\nwithin (1 + \u0001 ) rank with a probability \u0001. Let \u0001 = (\u0001/\u0001 ). Let a node \u0001 contain \u2223\u0001\u2223 points. In the\nalgorithm, sampling occurs when a base case of the recursion is reached. There are three base cases:\n\n\u2219 Case 1 - Exact Pruning (if \u0001\u0001(\u0001) \u2264 \u0001\u0001\u0001\u0001 \u0001\u0001 \u0001\u0001\u0001\u0001(\u0001, \u0001)): Then number of points required\nto be sampled from the node is at least \u2308\u0001 \u22c5 \u2223\u0001\u2223\u2309. However, since this node is pruned, we\nignore these points. Hence nothing is done in the algorithm.\n\u2219 Case 2 - Exact Computation COMPUTEBRUTENN(\u0001, \u0001)): In this subroutine, linear search\nis used to \ufb01nd the NN candidate. Hence number of points actually sampled is \u2223\u0001\u2223 \u2265\n\u2308\u0001 \u22c5 \u2223\u0001\u2223\u2309.\n\u2219 Case 3 - Approximate Computation (COMPUTEAPPROXNN(\u0001, \u0001, \u0001)): In this subroutine,\nexactly \u0001 \u22c5 \u2223\u0001\u2223 samples are made and linear search is performed over them.\nLet the total number of points effectively sampled from \u0001 be \u0001\u2032. From the three base cases of the\nalgorithm, it is con\ufb01rmed that \u0001\u2032 \u2265 \u2308\u0001\u22c5\u0001\u2309 = \u0001. Hence the algorithm computes a NN within (1+\u0001 )\nrank with probability at least \u0001.\nDual Tree. The single tree algorithm in Fig.2 can be extended to the dual tree algorithm in case\nof O(\u0001 ) queries. The dual tree RANN algorithm (DTRANKAPPROXNN(\u0001, \u0001, \u0001, \u0001)) is given in\nFig.2. The only difference is that for every query \u0001 \u2208 \u0001 , the minimum required amount of sampling\nis done and the random sampling is done separately for each of the queries. Even though the queries\ndo not share samples from the reference set, when a query node of the query tree prunes a reference\nnode, that reference node is pruned for all the queries in that query node simultaneously. This\nwork-sharing is a key feature of all dual-tree algorithms [13].\n\n4 Experiments and Results\n\nA meaningful value for the rank error \u0001 should be relative to the size of the reference dataset \u0001.\nHence for the experiments, the (1 + \u0001 )-RANN is modi\ufb01ed to (1 + \u2308\u0001 \u22c5 \u0001\u2309)-RANN where 1.0 \u2265\n\u0001 \u2208 \u211d+. The Euclidean metric is used in all the experiments. Although the value of MAXSAMPLES\nfor maximum speedup can be obtained by cross-validation, for practical purposes, any low value (\u2248\n20-30) suf\ufb01ces well, and this is what is used in the experiments.\n\n4.1 Comparisons with Exact Search\n\nThe speedups of the exact dual-tree NN algorithm and the approximate tree-based algorithm over\nthe linear search algorithm is computed and compared. Different levels of approximations ranging\nfrom 0.001% to 10% are used to show how the speedup increases with increase in approximation.\n\n5\n\n\fSTRANKAPPROXNN(\u0001, \u0001, \u0001, \u0001)\n\n\u0001 \u2190COMPUTESAMPLESIZE (\u2223\u0001\u2223, \u0001, \u0001)\n\u0001 \u2190 \u0001/\u2223\u0001\u2223\n\u0001\u0001\u0001\u0001\u0001 \u2190TREE(\u0001)\nSTRANN(\u0001, \u0001\u0001\u0001\u0001\u0001, \u0001)\n\nSTRANN(\u0001, \u0001, \u0001)\n\nif \u0001\u0001(\u0001) > \u0001\u0001\u0001\u0001 \u0001\u0001 \u0001\u0001\u0001\u0001(\u0001, \u0001) then\n\nif ISLEAF(\u0001) then\n\nCOMPUTEBRUTENN(\u0001, \u0001)\n\nif CANAPPROXIMATE(\u0001, \u0001)\n\nCOMPUTEAPPROXNN (\u0001, \u0001, \u0001)\n\nelse\nthen\n\nelse\n\nend if\n\nend if\n\nSTRANN(\u0001, \u0001\u0001, \u0001),\nSTRANN(\u0001, \u0001\u0001, \u0001)\n\nCOMPUTEBRUTENN(\u0001, \u0001)\n\n\u0001\u0001(\u0001) \u2190 min(min\n\n\u0001\u2208\u0001\n\n\u0001(\u0001, \u0001), \u0001\u0001(\u0001))\n\nCOMPUTEBRUTENN(\u0001, \u0001)\n\n\u0001\u0001(\u0001) \u2190 min(min\n\nfor \u2200\u0001 \u2208 \u0001 do\nend for\n\u0001\u0001\u0001\u0001 \u0001\u0001(\u0001) \u2190 max\n\n\u0001\u2208\u0001\n\n\u0001\u2208\u0001\n\n\u0001(\u0001, \u0001), \u0001\u0001(\u0001))\n\n\u0001\u0001(\u0001)\n\nCOMPUTEAPPROXNN(\u0001, \u0001, \u0001)\n\n\u0001\u2032 \u2190 \u2308\u0001 \u22c5 \u2223\u0001\u2223\u2309 samples from \u0001\nCOMPUTEBRUTENN(\u0001, \u0001\u2032)\nCOMPUTEAPPROXNN(\u0001, \u0001, \u0001)\n\nfor \u2200\u0001 \u2208 \u0001 do\n\n\u0001\u2032 \u2190 \u2308\u0001 \u22c5 \u2223\u0001\u2223\u2309 samples from \u0001\nCOMPUTEBRUTENN(\u0001, \u0001\u2032)\n\nend for\n\u0001\u0001\u0001\u0001 \u0001\u0001(\u0001) \u2190 max\n\n\u0001\u2208\u0001\n\n\u0001\u0001(\u0001)\n\nDTRANKAPPROXNN(\u0001, \u0001, \u0001, \u0001)\n\n\u0001 \u2190COMPUTESAMPLESIZE (\u2223\u0001\u2223, \u0001, \u0001)\n\u0001 \u2190 \u0001/\u2223\u0001\u2223\n\u0001\u0001\u0001\u0001\u0001 \u2190TREE(\u0001)\n\u0001\u0001\u0001\u0001\u0001 \u2190TREE(\u0001 )\nDTRANN(\u0001\u0001\u0001\u0001\u0001, \u0001\u0001\u0001\u0001\u0001, \u0001)\nDTRANN(\u0001, \u0001, \u0001)\nif \u0001\u0001\u0001\u0001 \u0001\u0001(\u0001) >\n\u0001\u0001\u0001\u0001 \u0001\u0001\u0001\u0001\u0001\u0001\u0001 \u0001\u0001\u0001\u0001\u0001(\u0001, \u0001) then\n\nif ISLEAF(\u0001) && ISLEAF(\u0001) then\n\nCOMPUTEBRUTENN(\u0001, \u0001)\n\nelse if ISLEAF(\u0001) then\n\nDTRANN(\u0001\u0001, \u0001, \u0001), DTRANN(\u0001\u0001, \u0001, \u0001)\n\u0001\u0001\u0001\u0001 \u0001\u0001(\u0001) \u2190 max\n\n\u0001\u0001\u0001\u0001 \u0001\u0001(\u0001\u0001)\nelse if CANAPPROXIMATE(\u0001, \u0001) then\n\n\u0001={\u0001,\u0001}\n\nif ISLEAF(\u0001) then\n\nCOMPUTEAPPROXNN (\u0001, \u0001, \u0001)\n\nelse\n\nDTRANN(\u0001\u0001, \u0001, \u0001),\nDTRANN(\u0001\u0001, \u0001, \u0001)\n\u0001\u0001\u0001\u0001 \u0001\u0001(\u0001) \u2190 max\n\n\u0001={\u0001,\u0001}\n\nend if\n\nelse if ISLEAF(\u0001) then\n\n\u0001\u0001\u0001\u0001 \u0001\u0001(\u0001\u0001)\n\nDTRANN(\u0001, \u0001\u0001, \u0001), DTRANN(\u0001, \u0001\u0001, \u0001)\n\nelse\n\nDTRANN(\u0001\u0001, \u0001\u0001, \u0001), DTRANN(\u0001\u0001, \u0001\u0001, \u0001)\nDTRANN(\u0001\u0001, \u0001\u0001, \u0001),\nDTRANN(\u0001\u0001, \u0001\u0001, \u0001)\n\u0001\u0001\u0001\u0001 \u0001\u0001(\u0001) \u2190 max\n\n\u0001\u0001\u0001\u0001 \u0001\u0001(\u0001\u0001)\n\n\u0001={\u0001,\u0001}\n\nend if\n\nend if\n\nCANAPPROXIMATE(\u0001, \u0001)\n\nreturn \u2308\u0001 \u22c5 \u2223\u0001\u2223\u2309 \u2264MAXSAMPLES\n\nFigure 2: Single tree (STRANKAPPROXNN) and dual tree (DTRANKAPPROXNN) algorithms and\nsubroutines for RANN search for a query \u0001 (or a query set \u0001 ) in a dataset \u0001 with rank approximation\n\u0001 and success probability \u0001. \u0001\u0001 and \u0001\u0001 are the closer and farther child respectively of \u0001 from the\nquery \u0001 (or a query node \u0001)\n\nDifferent datasets drawn for the UCI repository (Bio dataset 300k\u00d774, Corel dataset 40k\u00d732,\nCovertype dataset 600k\u00d755, Phy dataset 150k\u00d778)[21], MNIST handwritten digit recognition\ndataset (60k\u00d7784)[22] and the Isomap \u201cimages\u201d dataset (700\u00d74096)[3] are used. The \ufb01nal dataset\n\u201curand\u201d is a synthetic dataset of points uniform randomly sampled from a unit ball (1m\u00d720). This\ndataset is used to show that even in the absence of a lower-dimensional subspace, RANN is able to\nget signi\ufb01cant speedups over exact methods for relatively low errors. For each dataset, the NN of\nevery point in the dataset is found in the exact case, and (1 +\u2308\u0001\u22c5 \u0001\u2309)-rank-approximate NN of every\npoint in the dataset is found in the approximate case. These results are summarized in Fig.3.\n\nThe results show that for even low values of \u0001 (high accuracy setting), the RANN algorithm is\nsigni\ufb01cantly more scalable than the exact algorithms for all the datasets. Note that for some of the\ndatasets, the low values of approximation used in the experiments are equivalent to zero rank error\n(which is the exact case), hence are equally ef\ufb01cient as the exact algorithm.\n\n6\n\n\f\u03b5=0%(exact),0.001%,0.01%,0.1%,1%,10%\n\n\u03b1=0.95\n\nh\nc\nr\na\ne\ns\n \nr\na\ne\nn\n\ni\nl\n \nr\ne\nv\no\n \np\nu\nd\ne\ne\np\ns\n\n104\n\n103\n\n102\n\n101\n\n100\n\nbio\n\ncorel\n\ncovtype images mnist\n\nphy\n\nurand\n\nFigure 3: Speedups(logscale on the Y-axis) over the linear search algorithm while \ufb01nd-\ning the NN in the exact case or (1 + \u0001\u0001 )-RANN in the approximate case with \u0001 =\n0.001%, 0.01%, 0.1%, 1.0%, 10.0% and a \ufb01xed success probability \u0001 = 0.95 for every point in\nthe dataset. The \ufb01rst(white) bar in each dataset in the X-axis is the speedup of exact dual tree\nNN algorithm, and the subsequent(dark) bars are the speedups of the approximate algorithm with\nincreasing approximation.\n\n4.2 Comparison with Distance-Approximate Search\n\nIn the case of the different forms of approximation, the average rank errors and the maximum rank\nerrors achieved in comparable retrieval times are considered for comparison. The rank errors are\ncompared since any method with relatively lower rank error will obviously have relatively lower\ndistance error. For DANN, Locality Sensitive Hashing (LSH) [19, 18] is used.\n\nSubsets of two datasets known to have a lower-dimensional embedding are used for this experiment\n- Layout Histogram (10k\u00d730)[21] and MNIST dataset (10k\u00d7784)[22]. The approximate NN of\nevery point in the dataset is found with different levels of approximation for both the algorithms.\nThe average rank error and maximum rank error is computed for each of the approximation levels.\nFor our algorithm, we increased the rank error and observed a corresponding decrease in the retrieval\ntime. LSH has three parameters. To obtain the best retrieval times with low rank error, we \ufb01xed one\nparameter and changed the other two to obtain a decrease in runtime and did this for many values of\nthe \ufb01rst parameter. The results are summarized in Fig. 4 and Fig. 5.\n\nThe results show that even in the presence of a lower-dimensional embedding of the data, the rank\nerrors for a given retrieval time are comparable in both the approximate algorithms. The advantage\nof the rank-approximate algorithm is that the rank error can be directly controlled, whereas in LSH,\ntweaking in the cross-product of its three parameters is typically required to obtain the best ranks for\na particular retrieval time. Another advantage of the tree-based algorithm for RANN is the fact that\neven though the maximum error is bounded only with a probability, the actual maximum error is not\nmuch worse than the allowed maximum rank error since a tree is used. In the case of LSH, at times,\nthe actual maximum rank error is extremely large, corresponding to LSH returning points which\nare very far from being the NN. This makes the proposed algorithm for RANN much more stable\n\n7\n\n\fRandom Sample of size 10000\n\nRANN\nLSH\n\n500\n\n1000\n\n1500\nAverage Rank Error\n\n2000\n\n(a) Layout Histogram\n\n)\n.\nc\ne\ns\n \n\nn\ni\n(\n \n\ne\nm\nT\n\ni\n\n10\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\nRandom Sample of size 10000\n\nRANN\nLSH\n\n0\n\n500\n\n1000\n\n1500\n\nAverage Rank Error\n\n2000\n\n2500\n\n3000\n\n3500\n\n4000\n\n(b) Mnist\n\nFigure 4: Query times on the X-axis and the Average Rank Error on the Y-axis.\n\nRandom Sample of size 10000\n\nRandom Sample of size 10000\n\nRANN\nLSH\n\n8000\n\n9000 10000\n\n)\n.\nc\ne\ns\n \n\nn\ni\n(\n \n\ne\nm\nT\n\ni\n\n10\n\n9\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n0\n\n1000\n\n2000\n\nRANN\nLSH\n\n8000\n\n9000 10000\n\n4000\n\n3000\n7000\nMaximum Rank Error\n\n5000\n\n6000\n\n)\n.\nc\ne\ns\n \n\nn\ni\n(\n \n\ne\nm\nT\n\ni\n\n4\n\n3.5\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n0\n\n4\n\n3.5\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n)\n.\nc\ne\ns\n \n\nn\ni\n(\n \n\ne\nm\nT\n\ni\n\n0\n\n0\n\n1000\n\n2000\n\n4000\n\n3000\n7000\nMaximum Rank Error\n\n6000\n\n5000\n\n(a) Layout Histogram\n\n(b) Mnist\n\nFigure 5: Query times on the X-axis and the Maximum Rank Error on the Y-axis.\n\nthan LSH for Euclidean NN search. Of course, the reported times highly depend on implementation\ndetails and optimization tricks, and should be considered carefully.\n\n5 Conclusion\n\nWe have proposed a new form of approximate algorithm for unscalable NN search instances by con-\ntrolling the true error of NN search (i.e. the ranks). This allows approximate NN search to retain\nmeaning in high dimensional datasets even in the absence of a lower-dimensional embedding. The\nproposed algorithm for approximate Euclidean NN has been shown to scale much better than the\nexact algorithm even for low levels of approximation even when the true dimension of the data is\nrelatively high. When compared with the popular DANN method (LSH), it is shown to be compara-\nbly ef\ufb01cient in terms of the average rank error even in the presence of a lower dimensional subspace\nof the data (a fact which is crucial for the performance of the distance-approximate method). More-\nover, the use of spatial-partitioning tree in the algorithm provides stability to the method by clamping\nthe actual maximum error to be within a reasonable rank threshold unlike the distance-approximate\nmethod.\n\nHowever, note that the proposed algorithm still bene\ufb01ts from the ability of the underlying tree data\nstructure to bound distances. Therefore, our method is still not necessarily immune to the curse of\ndimensionality. Regardless, RANN provides a new paradigm for NN search which is comparably\nef\ufb01cient to the existing methods of distance-approximation and allows the user to directly control\nthe true accuracy which is present in ordering of the neighbors.\n\n8\n\n\fReferences\n\n[1] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning: Data\n\nMining, Inference, and Prediction. Springer, 2001.\n\n[2] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC,\n\n1986.\n\n[3] J. B. Tenenbaum, V. Silva, and J.C. Langford. A Global Geometric Framework for Nonlinear\n\nDimensionality Reduction. Science, 290(5500):2319\u20132323, 2000.\n\n[4] S. T. Roweis and L. K. Saul. Nonlinear Dimensionality Reduction by Locally Linear Embed-\n\nding. Science, 290(5500):2323\u20132326, December 2000.\n\n[5] A. N. Papadopoulos and Y. Manolopoulos. Nearest Neighbor Search: A Database Perspective.\n\nSpringer, 2005.\n\n[6] N. Alon, M. B\u02d8adoiu, E. D. Demaine, M. Farach-Colton, and M. T. Hajiaghayi. Ordinal Em-\n\nbeddings of Minimum Relaxation: General Properties, Trees, and Ultrametrics. 2008.\n\n[7] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When Is \u201cNearest Neighbor\u201d Mean-\n\ningful? LECTURE NOTES IN COMPUTER SCIENCE, pages 217\u2013235, 1999.\n\n[8] J. M. Hammersley. The Distribution of Distance in a Hypersphere. Annals of Mathematical\n\nStatistics, 21:447\u2013452, 1950.\n\n[9] J. H. Freidman, J. L. Bentley, and R. A. Finkel. An Algorithm for Finding Best Matches in\n\nLogarithmic Expected Time. ACM Trans. Math. Softw., 3(3):209\u2013226, September 1977.\n\n[10] S. M. Omohundro. Five Balltree Construction Algorithms. Technical Report TR-89-063,\n\nInternational Computer Science Institute, December 1989.\n\n[11] F. P. Preparata and M. I. Shamos. Computational Geometry: An Introduction. Springer, 1985.\n[12] A. Beygelzimer, S. Kakade, and J.C. Langford. Cover Trees for Nearest Neighbor. Proceedings\n\nof the 23rd international conference on Machine learning, pages 97\u2013104, 2006.\n\n[13] A. G. Gray and A. W. Moore. \u2018\u0001-Body\u2019 Problems in Statistical Learning. In NIPS, volume 4,\n\npages 521\u2013527, 2000.\n\n[14] T. Liu, A. W. Moore, A. G. Gray, and K. Yang. An Investigation of Practical Approximate\nIn Advances in Neural Information Processing Systems 17,\n\nNearest Neighbor Algorithms.\npages 825\u2013832, 2005.\n\n[15] L. Cayton. Fast Nearest Neighbor Retrieval for Bregman Divergences. Proceedings of the 25th\n\ninternational conference on Machine learning, pages 112\u2013119, 2008.\n\n[16] T. Liu, A. W. Moore, and A. G. Gray. Ef\ufb01cient Exact k-NN and Nonparametric Classi\ufb01cation\n\nin High Dimensions. 2004.\n\n[17] P. Ciaccia and M. Patella. PAC Nearest Neighbor Queries: Approximate and Controlled Search\nin High-dimensional and Metric spaces. Data Engineering, 2000. Proceedings. 16th Interna-\ntional Conference on, pages 244\u2013255, 2000.\n\n[18] A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing.\n\npages 518\u2013529, 1999.\n\n[19] P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of\n\nDimensionality. In STOC, pages 604\u2013613, 1998.\n\n[20] J. Sedransk and J. Meyer. Con\ufb01dence Intervals for the Quantiles of a Finite Population: Simple\nRandom and Strati\ufb01ed Simple Random sampling. Journal of the Royal Statistical Society,\npages 239\u2013252, 1978.\n\n[21] C. L. Blake and C. J. Merz. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml/,\n\n1998.\n\n[22] Y. LeCun. MNIST dataset, 2000. http://yann.lecun.com/exdb/mnist/.\n\n9\n\n\f", "award": [], "sourceid": 435, "authors": [{"given_name": "Parikshit", "family_name": "Ram", "institution": null}, {"given_name": "Dongryeol", "family_name": "Lee", "institution": null}, {"given_name": "Hua", "family_name": "Ouyang", "institution": null}, {"given_name": "Alexander", "family_name": "Gray", "institution": null}]}