{"title": "DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node", "book": "Advances in Neural Information Processing Systems", "page_first": 13771, "page_last": 13781, "abstract": "Current state-of-the-art approximate nearest neighbor search (ANNS) algorithms\ngenerate indices that must be stored in main memory for fast high-recall search.\nThis makes them expensive and limits the size of the dataset. We present a\nnew graph-based indexing and search system called DiskANN that can index,\nstore, and search a billion point database on a single workstation with just 64GB\nRAM and an inexpensive solid-state drive (SSD). Contrary to current wisdom,\nwe demonstrate that the SSD-based indices built by DiskANN can meet all three\ndesiderata for large-scale ANNS: high-recall, low query latency and high density\n(points indexed per node). On the billion point SIFT1B bigann dataset, DiskANN\nserves > 5000 queries a second with < 3ms mean latency and 95%+ 1-recall@1\non a 16 core machine, where state-of-the-art billion-point ANNS algorithms with\nsimilar memory footprint like FAISS and IVFOADC+G+P plateau at\naround 50% 1-recall@1. Alternately, in the high recall regime, DiskANN can\nindex and serve 5 \u2212 10x more points per node compared to state-of-the-art graph-\nbased methods such as HNSW and NSG. Finally, as part of our overall\nDiskANN system, we introduce Vamana, a new graph-based ANNS index that is\nmore versatile than the graph indices even for in-memory indices.", "full_text": "DiskANN: Fast Accurate Billion-point Nearest\n\nNeighbor Search on a Single Node\n\nSuhas Jayaram Subramanya\u2217\nCarnegie Mellon University\n\nDevvrit\u2217\n\nUniversity of Texas at Austin\n\nRohan Kadekodi\u2217\n\nUniversity of Texas at Austin\n\nsuhas@cmu.edu\n\ndevvrit.03@gmail.com\n\nrak@cs.texas.edu\n\nRavishankar Krishaswamy\n\nMicrosoft Research India\nrakri@microsoft.com\n\nHarsha Vardhan Simhadri\nMicrosoft Research India\n\nharshasi@microsoft.com\n\nAbstract\n\nCurrent state-of-the-art approximate nearest neighbor search (ANNS) algorithms\ngenerate indices that must be stored in main memory for fast high-recall search.\nThis makes them expensive and limits the size of the dataset. We present a\nnew graph-based indexing and search system called DiskANN that can index,\nstore, and search a billion point database on a single workstation with just 64GB\nRAM and an inexpensive solid-state drive (SSD). Contrary to current wisdom,\nwe demonstrate that the SSD-based indices built by DiskANN can meet all three\ndesiderata for large-scale ANNS: high-recall, low query latency and high density\n(points indexed per node). On the billion point SIFT1B bigann dataset, DiskANN\nserves > 5000 queries a second with < 3ms mean latency and 95%+ 1-recall@1\non a 16 core machine, where state-of-the-art billion-point ANNS algorithms with\nsimilar memory footprint like FAISS [18] and IVFOADC+G+P [8] plateau at\naround 50% 1-recall@1. Alternately, in the high recall regime, DiskANN can\nindex and serve 5 \u2212 10x more points per node compared to state-of-the-art graph-\nbased methods such as HNSW [21] and NSG [13]. Finally, as part of our overall\nDiskANN system, we introduce Vamana, a new graph-based ANNS index that is\nmore versatile than the existing graph indices even for in-memory indices.\n\nIntroduction\n\n1\nIn the nearest neighbor search problem, we are given a dataset P of points in some space. The goal is\nto design a data structure of small size, such that, for any query q in the same metric space, and target\nk, we can retrieve the k nearest neighbors of q from the dataset P quickly. This is a fundamental\nproblem in algorithms research, and also a commonly used sub-routine in a diverse set of areas\nsuch as computer vision, document retrieval and recommendation systems, to name a few. In these\napplications, the actual entities \u2014 images, documents, user pro\ufb01les \u2014 are embedded into a hundred\nor thousand dimensional space such that a desired notion of the entities\u2019 similarity is encoded as\ndistance between their embeddings.\nUnfortunately, it is often impossible to retrieve the exact nearest neighbors without essentially\nresorting to a linear scan of the data (see, e.g., [15, 23]) due to a phenomenon known as the curse of\ndimensionality [10]. As a result, one resorts to \ufb01nding the approximate nearest neighbors (ANN)\nwhere the goal is to retrieve k neighbors which are close to being optimal. More formally, consider a\nquery q, and suppose the algorithm outputs a set X of k candidate near neighbors, and suppose G is\n\n\u2217Work done while at Microsoft Research India\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fk\n\nthe ground-truth set of the k closest neighbors to q from among the points of the base dataset. Then,\nwe de\ufb01ne the k-recall@k of this set X to be |X\u2229G|\n. The goal of an ANN algorithm then is to maximize\nrecall while retrieving the results as quickly as possible, which results in the recall-vs-latency tradeoff.\nThere are numerous algorithms for this problem with diverse index construction methodologies\nand a range of tradeoffs w.r.t indexing time, recall, and query time. For example, while k-d trees\ngenerate compact indices that are fast to search in low dimensions, they are typically very slow when\ndimension d exceeds about 20. On the other hand, Locality Sensitive Hashing based methods [2, 4]\nprovide near-optimal guarantees on the tradeoff between index size and search time in the worst case,\nbut they fail to exploit the distribution of the points and are outperformed by more recent graph-based\nmethods on real-world datasets. Recent work on data-dependent LSH schemes (e.g. [3]) is yet to be\nproven at scale. As of this writing, the best algorithms in terms of search time vs recall on real-world\ndatasets are often graph-based algorithms such as HNSW [21] and NSG [13] where the indexing\nalgorithm constructs a navigable graph over the base points, and the search procedure is a best-\ufb01rst\ntraversal that starts at a chosen (or random) point, and walks along the edges of the graph, while\ngetting closer to the query at each step until it converges to a local minimum. A recent work of Li et\nal. [20] has an excellent survey and comparison of ANN algorithms.\nMany applications require fast and accurate search on billions of points in Euclidean metrics. Today,\nthere are essentially two high-level approaches to indexing large datasets.\nThe \ufb01rst approach is based on Inverted Index + Data Compression and includes methods such as\nFAISS [18] and IVFOADC+G+P [8]. These methods cluster the dataset into M partitions, and\ncompare the query to only the points in a few, say, m << M partitions closest to the query.\nMoreover, since the full-precision vectors cannot \ufb01t in main memory, the points are compressed\nusing a quantization scheme such as Product Quantization [17]. While these schemes have a small\nmemory footprint \u2013 less than 64 GB for storing an index on billion points in 128 dimensions \u2013 and\ncan retrieve results in < 5 ms using GPUs or other hardware accelerators, their 1-recall@1 is rather\nlow (around 0.5) since the data compression is lossy. These methods report higher recall values for a\nweaker notion of 1-recall@100 \u2013 the likelihood that the true nearest neighbor is present in a list of\n100 output candidates. However, this measure may not be acceptable in many applications.\nThe second approach is to divide the dataset into disjoint shards, and build an in-memory index for\neach shard. However, since these indices store both the index and the uncompressed data points,\nthey have a larger memory footprint than the \ufb01rst approach. For example, an NSG index for 100M\n\ufb02oating-point vectors in 128 dimensions would have a memory footprint of around 75GB2. Therefore,\nserving an index over a billion points would need several machines to host the indices. Such a scheme\nis reportedly [13] in use in Taobao, Alibaba\u2019s e-commerce platform, where they divide their dataset\nwith 2 billion 128-dimensional points into 32 shards, and host the index for each shard on a different\nmachine. Queries are routed to all shards, and the results from all shards are aggregated. Using this\napproach, they report 100-recall@100 values of 0.98 with a latency of \u223c 5ms. Note that extending\nthis to web scale data with hundreds of billions of points would require thousands of machines.\nThe scalability of both these classes of algorithms is limited by the fact that they construct indices\nmeant to be served from main memory. Moving these indices to disks, even SSDs, would result in a\ncatastrophic rise of search latency and a corresponding drop in throughput. The current wisdom on\nsearch requiring main memory is re\ufb02ected in the blog post by FAISS [11]: \u201cFaiss supports searching\nonly from RAM, as disk databases are orders of magnitude slower. Yes, even with SSDs.\u201d\nIndeed, the search throughput of an SSD-resident index is limited by the number of random disk\naccesses/query and latency is limited by the the number of round-trips (each round-trip can consist of\nmultiple reads) to the disk. An inexpensive retail-grade SSD requires a few hundred microseconds\nto serve a random read and can service about \u223c 300K random reads per second. On the other hand,\nsearch applications (e.g. web search) with multi-stage pipelines require mean latencies of a few\nmilliseconds for nearest neighbor search. Therefore, the main challenges in designing a performant\nSSD-resident index lie in reducing (a) the number of random SSD accesses to a few dozen, and (b)\nthe number of round trip requests to disk to under ten, preferably \ufb01ve. Naively mapping indices\ngenerated by traditional in-memory ANNS algorithms to SSDs would generate several hundreds of\ndisk reads per query, which would result in unacceptable latencies.\n\n2The average degree of an NSG index can vary depending on the inherent structure of the dataset, here we\n\nassume a degree of 50, which is reasonable for datasets with little inherent structure.\n\n2\n\n\fwith 64GB RAM, providing 95%+ 1-recall@1 with latencies of under 5 milliseconds.\n\nNSG and HNSW, allowing DiskANN to minimize the number of sequential disk reads.\n\n1.1 Our technical contribution\nWe present DiskANN, an SSD-resident ANNS system based on our new graph-based indexing\nalgorithm called Vamana, that debunks current wisdom and establishes that even commodity SSDs\ncan effectively support large-scale ANNS. Some interesting aspects of our work are:\n\u2022 DiskANN can index and serve a billion point dataset in 100s of dimensions on a workstation\n\u2022 A new algorithm called Vamana which can generate graph indices with smaller diameter than\n\u2022 The graphs generated by Vamana can be also be used in-memory, where their search performance\n\u2022 Smaller Vamana indices for overlapping partitions of a large dataset can be easily merged into\none index that provides nearly the same search performance as a single-shot index constructed for\nthe entire dataset. This allows indexing of datasets that are otherwise too large to \ufb01t in memory.\n\u2022 We show that Vamana can be combined with off-the-shelf vector compression schemes such as\nproduct quantization to build the DiskANN system. The graph index along with the full-precision\nvectors of the dataset are stored on the disk, while compressed vectors are cached in memory.\n\nmatches or exceeds state-of-the-art in-memory algorithms such as HNSW and NSG.\n\n1.2 Notation\nFor the remainder of the paper, we let P denote the dataset with |P| = n. We consider directed\ngraphs with vertices corresponding to points in P , and edges between them. With slight notation\noverload, we refer to such graphs as G = (P, E) by letting P also denote the vertex set. Given a\npoint p \u2208 P in a directed graph, we let Nout(p) to denote the set of out-edges incident on p. Finally,\nwe let xp denote the vector data corresponding to p, and let d(p, q) = ||xp \u2212 xq|| denote the metric\ndistance between two points p and q. All experiments presented in this paper used Euclidean metric.\n1.3 Paper Outline\nSection 2 presents Vamana our new graph index construction algorithm and Section 3 explains\nthe overall system design of DiskANN. Section 4 presents an empirical comparison Vamana\nwith HNSW and NSG for in-memory indices, and also demonstrates the search characteristics of\nDiskANN for billion point datasets on a commodity machine.\n\n2 The Vamana Graph Construction Algorithm\nWe begin with a brief overview of graph-based ANNS algorithms before presenting the details of\nVamana, a speci\ufb01cation which is given in Algorithm 3.\n2.1 Relative Neighborhood Graphs and the GreedySearch algorithm\nMost graph-based ANNS algorithms work in the following manner: during index construction, they\nbuild a graph G = (P, E) based on the geometric properties of the dataset P . At search time, for a\nquery vector xq, search employs a natural greedy or best-\ufb01rst traversal, such as in Algorithm 1, on G.\nStarting at some designated point s \u2208 P , they traverse the graph to get progressively closer to xq.\nThere has been much work on understanding how to construct sparse graphs for which the\nGreedySearch(s, xq, k, L) converges quickly to the (approximate) nearest neighbors for any query.\nA suf\ufb01cient condition for this to happen, at least when the queries are close to the dataset points,\nis the so-called sparse neighborhood graph (SNG), which was introduced in [5]3. In an SNG, the\nout-neighbors of each point p are determined as follows: initialize a set S = P \\ {p}. As long as\nS (cid:54)= \u2205, add a directed edge from p to p\u2217, where p\u2217 is the closest point to p from S, and remove from\nS all points p(cid:48) such that d(p, p(cid:48)) > d(p\u2217, p(cid:48)). It is then easy to see that GreedySearch(s, xp, 1, 1)\nstarting at any s \u2208 P would converge to p for all base points p \u2208 P .\nWhile this construction is ideal in principle, it is infeasible to construct such graphs for even\n\nmoderately sized datasets, as the running time is (cid:101)O(n2). Building on this intuition, there have\n\nbeen a series of works that design more practical algorithms that generate good approximations of\nSNGs [21, 13]. However, since they all essentially try to approximate the SNG property, there is very\nlittle \ufb02exibility in controlling the diameter and the density of the graphs output by these algorithms.\n3This notion itself was inspired by a related property known as the Relative Neighborhood Graph (RNG)\n\nproperty, \ufb01rst de\ufb01ned in the 1960s [16].\n\n3\n\n\fbegin\n\nAlgorithm 1: GreedySearch(s, xq, k, L)\nData: Graph G with start node s, query xq, result\nResult: Result set L containing k-approx NNs, and\n\nsize k, search list size L \u2265 k\na set V containing all the visited nodes\ninitialize sets L \u2190 {s} and V \u2190 \u2205\nwhile L \\ V (cid:54)= \u2205 do\nlet p\u2217 \u2190 arg minp\u2208L\\V ||xp \u2212 xq||\nupdate L \u2190 L \u222a Nout(p\u2217) and\nV \u2190 V \u222a {p\u2217}\nif |L| > L then\n\nupdate L to retain closest L\npoints to xq\n\nreturn [closest k points from L; V]\n\nAlgorithm 2: RobustPrune(p,V, \u03b1, R)\nData: Graph G, point p \u2208 P , candidate set V,\ndistance threshold \u03b1 \u2265 1, degree bound R\nResult: G is modi\ufb01ed by setting at most R new\n\nbegin\n\nout-neighbors for p\n\nV \u2190 (V \u222a Nout(p)) \\ {p}\nNout(p) \u2190 \u2205\nwhile V (cid:54)= \u2205 do\n\np\u2217 \u2190 arg minp(cid:48)\u2208V d(p, p(cid:48))\nNout(p) \u2190 Nout(p) \u222a {p\u2217}\nif |Nout(p)| = R then\nfor p(cid:48) \u2208 V do\n\nbreak\n\nif \u03b1 \u00b7 d(p\u2217, p(cid:48)) \u2264 d(p, p(cid:48)) then\n\nremove p(cid:48)\n\nfrom V\n\n2.2 The Robust Pruning Procedure\nAs mentioned earlier, graphs which satisfy the SNG property are all good candidates for the\nGreedySearch search procedure. However, it is possible that the diameter of the graphs can be quite\nlarge. For example, if the points are linearly arranged on the real line in one dimension, the O(n)\ndiamater line graph, where each point connects to its two neighbors (one at the end), is the one that\nsatis\ufb01es the SNG property. Searching such graphs stored in disks would incur many sequential reads\nto the disk at to fetch the neighbors of the vertices visited on the search path in Algorithm 1.\nTo overcome this, we would like to ensure that the distance to the query decreases by a multiplicative\nfactor of \u03b1 > 1 at every node along the search path, instead of merely decreasing as in the SNG\nproperty. Consider the directed graph where the out-neighbors of every point p are determined by the\nRobustPrune(p,V, \u03b1, R) procedure in Algorithm 2. Note that if the out-neighbors of every p \u2208 P\nare determined by RobustPrune(p, P \\ {p}, \u03b1, n \u2212 1), then GreedySearch(s, p, 1, 1), starting at\nany s, would converge to p \u2208 P in logarithmically many steps, if \u03b1 > 1. However, this would result\ninvokes RobustPrune(p,V, \u03b1, R) for a carefully selected V with far fewer than n \u2212 1 nodes, to\nimprove index construction time.\n\nin a running time of (cid:101)O(n2) for index construction. Hence, building on the ideas of [21, 13], Vamana\n\n2.3 The Vamana Indexing Algorithm\nVamana constructs a directed graph G in an iterative manner. The graph G is initialized so that each\nvertex has R randomly chosen out-neighbors. Note that while the graph is well connected when\nR > log n, random connections do not ensure convergence of the GreedySearch algorithm to good\nresults. Next, we let s denote the medoid of the dataset P , which will be the starting node for the\nsearch algorithm. The algorithm then iterates through all the points in p \u2208 P in a random order, and\nin each step, updates the graph to make it more suitable for GreedySearch(s, xp, 1, L) to converge to\np. Indeed, in the iteration corresponding to point p, Vamana \ufb01rst runs GreedySearch(s, xp, 1, L) on\nthe current graph G, and sets Vp to the set of all points visited by GreedySearch(s, xp, 1, L). Then,\nthe algorithm updates G by running RobustPrune(p,Vp, \u03b1, R) to determine p\u2019s new out-neighbors.\nThen, Vamana updates the graph G by adding backward edges (p(cid:48), p) for all p(cid:48) \u2208 Nout(p). This\nensures that there are connections between the vertices visited on the search path and p, thereby\nensuring that the updated graph will be better suited for GreedySearch(s, xp, 1, L) to converge to p.\nHowever, adding backward edges of the form (p(cid:48), p) might lead to a degree violation of p(cid:48), and\nso whenever any vertex p(cid:48) has an out-degree which exceeds the degree threshold of R, the graph\nis modi\ufb01ed by running RobustPrune(p(cid:48), Nout(p(cid:48)), \u03b1, R) where Nout(p(cid:48)) is the set of existing out-\nneighbors of p(cid:48). As the algorithm proceeds, the graph becomes consistently better and faster for\nGreedySearch. Our overall algorithm makes two passes over the dataset, the \ufb01rst pass with \u03b1 = 1,\nand the second with a user-de\ufb01ned \u03b1 \u2265 1. We observed that a second pass results in better graphs,\nand that running both passes with the user-de\ufb01ned \u03b1 makes the indexing algorithm slower as the \ufb01rst\npass computes a graph with higher average degree which takes longer.\n\n4\n\n\fAlgorithm 3: Vamana Indexing algorithm\nData: Database P with n points where i-th point has coords xi, parameters \u03b1, L, R\nResult: Directed graph G over P with out-degree <=R\nbegin\n\ninitialize G to a random R-regular directed graph\nlet s denote the medoid of dataset P\nlet \u03c3 denote a random permutation of 1..n\nfor 1 \u2264 i \u2264 n do\n\nlet [L;V] \u2190 GreedySearch(s, x\u03c3(i), 1, L)\nrun RobustPrune(\u03c3(i),V, \u03b1, R) to update out-neighbors of \u03c3(i)\nfor all points j in Nout(\u03c3(i)) do\n\nif |Nout(j) \u222a {\u03c3(i)}| > R then\n\nrun RobustPrune(j, Nout(j) \u222a {\u03c3(i)}, \u03b1, R) to update out-neighbors of j\nupdate Nout(j) \u2190 Nout(j) \u222a \u03c3(i)\n\nelse\n\nFigure 1: Progression of the graph generated by the Vamana indexing algorithm described in\nAlgorithm 3 on a database with 200 points in 2 dimensions. Notice that the algorithm goes through\nthe \ufb01rst pass with \u03b1 = 1, followed by the second pass where it introduces long range edges.\n\n2.4 Comparison of Vamana with HNSW [21] and NSG [13]\n\nAt a high level, Vamana is rather similar to HNSW and NSG, two very popular ANNS algorithms. All\nthree algorithms iterate over the dataset P , and use the results of the GreedySearch(s, xp, 1, L) and\nRobustPrune(p,V, \u03b1, R) to determine p\u2019s neighbors. However, there are some important differences\nbetween these algorithms. Most crucially, both HNSW and NSG have no tunable parameter \u03b1 and\nimplicitly use \u03b1 = 1. This is the main factor which lets Vamana achieve a better trade-off between\ngraph degree and diameter. Next, while HNSW sets the candidate set V for the pruning procedure\nto be the \ufb01nal result-set of L candidates output by GreedySearch(s, p, 1, L), Vamana and NSG\nlet V be the entire set of vertices visited by GreedySearch(s, p, 1, L). Intuitively, this feature helps\nVamana and NSG add long-range edges, while HNSW, by virtue of adding only local edges to\nnearby points, has an additional step of constructing a hierarchy of graphs over a nested sequence of\nsamples of the dataset. The next difference pertains to the initial graph: while NSG sets the starting\ngraph to be an approximate K-nearest neighbor graph over the dataset, which is a time and memory\nintensive step, HNSW and Vamana have simpler initializations, with the former beginning with an\nempty graph and Vamana beginning with a random graph. We have observed that starting with a\nrandom graph results in better quality graphs than beginning with the empty graph. Finally, Vamana\nmakes two passes over the dataset, whereas both HNSW and NSG make only one pass, motivated by\nour observation that the second pass improves the graph quality.\n\n5\n\n\f3 DiskANN: Constructing SSD-Resident Indices\nWe now present the design of the DiskANN overall in two parts. In the \ufb01rst part, we explain the\nindex construction algorithm, and in the second part, we explain the search algorithm.\n3.1 The DiskANN Index Design\nThe high-level idea is simple: run Vamana on a dataset P and store the resulting graph on an SSD.\nAt search time, whenever Algorithm 1 requires the out-neighbors of a point p, we simply fetch this\ninformation from the SSD. However, note that just storing the vector data for a billion points in 100\ndimensions would far exceed the RAM on a workstation! This raises two questions: how do we build\na graph over a billion points, and how do we do distance comparisons between the query point and\npoints in our candidate list at search time in Algorithm 1, if we cannot even store the vector data?\nOne way to address the \ufb01rst question would be to partition the data into multiple smaller shards using\nclustering techniques like k-means, build a separate index for each shard, and route the query only to\na few shards at search time. However, such an approach would suffer from increased search latency\nand reduced throughput since the query needs to be routed to several shards.\nOur idea is simple in hindsight: instead of routing the query to multiple shards at search time, what\nif we send each base point to multiple nearby centers to obtain overlapping clusters? Indeed, we\n\ufb01rst partition a billion point dataset into k clusters (with k = 40, say) using k-means, and then assign\neach base point to the (cid:96)-closest centers (typically (cid:96) = 2 suf\ufb01ces). We then build Vamana indices\nfor the points assigned to each of the clusters (which would now only have about N (cid:96)\nk points and\nthus can be indexed in-memory), and \ufb01nally merge all the different graphs into a single graph by\ntaking a simple union of edges. Empirically, it turns out that the overlapping nature of the different\nclusters provides suf\ufb01cient connectivity for the GreedySearch algorithm to succeed even if the query\u2019s\nnearest neighbors are actually split between multiple shards. We would like to remark that there\nhave been earlier works [9, 22] which construct indices for large datasets by merging several smaller,\noverlapping indices. However, their ideas for constructing the overlapping clusters are different, and\na more detailed comparison of these different techniques needs to be done.\n\nOur next and natural idea to address the second question is to store compressed vectors (cid:101)xp for every\ndistances d((cid:101)xp, xq) at query time in Algorithm 1. We would like to remark that Vamana uses full-\n\ndatabase point p \u2208 P in main memory, along with storing the graph on the SSD. We use a popular\ncompression scheme known as Product Quantization[17]4, which encodes the data and query points\ninto short codes (e.g., 32 bytes per data point) that can be used to ef\ufb01ciently obtain approximate\n\nprecision coordinates when building the graph index, and hence is able to ef\ufb01ciently guide the search\ntowards the right region of the graph, although we use only the compressed data at search time.\n3.2 DiskANN Index Layout\nWe store the compressed vectors of all the data points in memory, and store the graph along with the\nfull-precision vectors on the SSD. On the disk, for each point i, we store its full precision vector xi\nfollowed by the identities of its \u2264 R neighbors. If the degree of a node is smaller than R, we pad\nwith zeros, so that computing the offset within the disk of the data corresponding to any point i is a\nsimple calculation, and does not require storing the offsets in memory. We will explain the need to\nstore full-precision coordinates in the following section.\n3.3 DiskANN Beam Search\nA natural way to search for neighbors of a given query xq would be to run Algorithm 1, fetching the\nneighborhood information Nout(p\u2217) from the SSD as needed. Distance calculations to guide the best\nvertices (and neighborhoods) to read from disk can be done using the compressed vectors. While\nreasonable, this requires many rountrips to SSD (which take few hundred microseconds) resulting in\nhigher latencies. To reduce the number of round triprs to SSD (to fetch neighborhoods sequentially)\nwithout increasing compute (distance calculations) excessively, we fetch the neighborhoods of a\nsmall number, W (say 4, 8), of the closest points in L \\ V in one shot, and update L to be the top L\ncandidates in L along with all the neighbors retrieved in this step. Note that fetching a small number\nof random sectors from an SSD takes almost the same time as one sector. We refer to this modi\ufb01ed\nsearch algorithm as BeamSearch. If W = 1, this search resembles normal greedy search. Note that if\nW is too large, say 16 or more, then both compute and SSD bandwidth could be wasted.\n\n4Although more complex compression methods like [14, 19, 18] can deliver better quality approximations,\n\nwe found simple product quantization suf\ufb01cient for our purposes.\n\n6\n\n\fFigure 2: (a)1-recall@1 vs latency on SIFT bigann dataset. The R128 and R128/Merged series\nrepresent the one-shot and merged Vamana index constructions, respectively. (b)1-recall@1 vs\nlatency on DEEP1B dataset. (c) Average number of hops vs maximum graph degree for achieving\n98% 5-recall@5 on SIFT1M.\n\nFigure 3: Latency (microseconds) vs recall plots comparing HNSW, NSG and Vamana.\n\nAlthough NAND-\ufb02ash based SSDs can serve 500K+ random reads per second, extracting mimaxmum\nread throughput requires saturating all I/O request queues. However, operating at peak throughput\nwith backlogged queues results in disk read latencies of over a millisecond. Therefore, it is necessary\nto operate the SSD at a lower load factor to obtain low search latency. We have found that operating\nat low beam widths (e.g., W = 2, 4, 8) can strike a good balance between latency and throughput.\nIn this setting, the load factor on the SSD is between 30 \u2212 40% and each thread running our search\nalgorithm spends between 40 \u2212 50% of the query processing time in I/O.\n3.4 DiskANN Caching Frequently Visited Vertices\nTo further reduce the number of disk accesses per query, we cache the data associated with a subset\nof vertices in DRAM, either based on a known query distribution, or simply by caching all vertices\nthat are C = 3 or 4 hops from the starting point s. Since the number of nodes in the index graph at\ndistance C grows exponentially with C, larger values of C incur excessively large memory footprint.\n\n3.5 DiskANN Implicit Re-Ranking Using Full-Precision Vectors\nSince Product Quantization is a lossy compression method, there is a discrepancy between the closest\nk candidates to the query computed using PQ-based approximate distances and using the actual\ndistances. To bridge this gap, we use full-precision coordinates stored for each point next to its\nneighborhood on the disk. In fact, when we retrieve the neighborhood of a point during search, we\nalso retrieve the full coordinates of the point without incurring extra disk reads. This is because,\nreading 4KB-aligned disk address into memory is no more expensive than reading 512B, and the\nneighborhood of a vertex (4\u2217 128 bytes long for degree 128 graphs) and full-precision coordinates can\nbe stored on the same disk sector. Hence, as BeamSearch loads neighborhoods of the search frontier,\nit can also cache full-precision coordinates of all the nodes visited during the search process, using\nno extra reads to the SSD. This allows us to return the top k candidates based on the full precision\nvectors. Independent of our work, the idea of fetching and re-ranking full-precision coordinates\nstored on the SSD is also used in [24], but the algorithm fetches all the vectors to re-rank in one shot,\nwhich would result in hundreds of random disk accesses all in one shot, in turn adversely affecting\nthroughput and latency. We provide a more detailed explanation in Section 4.3. In our case, full\nprecision coordinates essentially piggyback on the cost of expanding the neighborhoods.\n4 Evaluation\nWe now compare Vamana with other relevant algorithms for approximate nearest neighbor search.\nFirst, for in-memory search, we compare our algorithm with NSG [13] and HNSW [21], which\noffer best-in-class latency vs recall on most public benchmark datasets. Next, for large billion\npoint datasets, we compare DiskANN with compression based techniques such as FAISS [18] and\nIVF-OADC+G+P [8].\n\n7\n\n\fWe use the following two machines for all experiments.\n\n\u2022 z840: a bare-metal mid-range workstation with dual Xeon E5-2620v4s (16 cores), 64GB\n\u2022 M64-32ms: a virtual machine with dual Xeon E7-8890v3s (32-vCPUs) with 1792GB DDR3\n\nDDR4 RAM, and two Samsung 960 EVO 1TB SSDs in RAID-0 con\ufb01guration.\n\nRAM that we use to build a one-shot in-memory index for billion point datasets.\n\n4.1 Comparison of HNSW, NSG and Vamana for In-Memory Search Performance\n\nWe compared Vamana with HNSW and NSG on three commonly used public benchmarks: SIFT1M\n(128-dimensions) and GIST1M (960-dimensions), both of which are million point datasets of image\ndescriptors [1], and DEEP1M (96-dimensions), a random one million size sample of DEEP1B, a\nmachine-learned set of one billion vectors [6]. For all three algorithms, we did a parameter sweep\nand selected near-optimal choice of parameters for the best recall vs latency trade-off. All HNSW\nindices were constructed using M = 128, efC = 512, while Vamana indices used L = 125, R =\n70, C = 3000, \u03b1 = 2. For NSG on SIFT1M and GIST1M, we use the parameters listed on their\nrepository5 due to their excellent performance, and used R = 60, L = 70, C = 500 for DEEP1M.\nMoreover, since the main focus of this work is on the SSD-based search, we did not implement our\nown in-mmeory search algorithm to test Vamana. Instead, we simply used the implementation of\nthe optimized search algorithm on the NSG repository, on the indices generated by Vamana. From\nFigure 3, we can see one clear trend \u2013 NSG and Vamana out-perform HNSW in all instances, and on\nthe 960-dimensional GIST1M dataset, Vamana outperforms both NSG and HNSW. Moreover, the\nindexing time of Vamana was better than both HNSW and NSG in all three experiments. For example,\nwhen indexing DEEP1M on z840, the total index construction times were 149s, 219s, and 480s\nfor Vamana, HNSW and NSG6 respectively. From these experiments, we conclude that Vamana\nmatches or outperforms, the current best ANNS methods on both hundred and thousand-dimensional\ndatasets obtained from different sources.\n\n4.2 Comparison of HNSW, NSG and Vamana for Number of Hops\nVamana is more suitable for SSD-based serving than other graph-based algorithms as it makes 2 \u2212 3\ntimes fewer hops for search to converge on large datasets compared to HNSW and NSG. By hops, we\nrefer to the number of rounds of disk reads on the critical path of the search. In BeamSearch, it maps\nto the number of times the search frontier is expanded by making W parallel disk reads. The number\nof hops is important as it directly affects search latency. For HNSW, we assume nodes in all levels\nexcluding the base level are cached in DRAM and only count the number of hops on base-level graph.\nFor NSG and Vamana indices, we assume that the \ufb01rst 3 BFS levels around the navigating node(s)\ncan be cached in DRAM. We compare the number of hops required to achieve a target 5-recall@5 of\n98% by varying the maximum graph degrees in Figure 2(c), and using the BeamSearch algorithm\nwith beamwidth of W = 4 for all three algorithms. We notice a stagnation trend for both HNSW and\nNSG, while Vamana shows a reduction in number of hops with increasing max degree, due to its\nability to add more long-range edges. We thus infer that Vamana with \u03b1 > 1 makes better use of the\nhigh capacity offered by SSDs than HNSW and NSG.\n\n4.3 Comparison on Billion-Scale Datasets: One-Shot Vamana vs Merged Vamana\nFor our next set of experiments, we focus our evaluations on the 109 point ANN_SIFT1B [1] bigann\ndataset of SIFT image descriptors of 128 uint8s. To demonstrate the effectiveness of the merged-\nVamana scheme described in Section 3, we built two indices using our Vamana. The \ufb01rst is a single\nindex with L = 125, R = 128, \u03b1 = 2 on the full billion-point dataset. This procedure takes about 2\ndays on M64-32ms with a peak memory usage at \u22481100GB, and generates an index with an average\ndegree of 113.9. The second is the merged index, which is constructed as follows: (1) partition the\ndataset into k = 40 shards using k-means clustering, (2) send each point in the dataset to the (cid:96) = 2\nclosest shards, (3) build indices for each shard with L = 125, R = 64, \u03b1 = 2, and (4) merge the\nedge sets of all the graphs. The result is a 348GB index with an average degree of 92.1. The indices\nwere built on z840 and took about 5 days with memory usage remaining under 64GB for the entire\nprocess. Partitioning the dataset and merging the graphs are fast and can be done directly from the\ndisk, and hence, the entire build process consumes under 64GB main memory.\nWe compare 1-recall@1 vs latency with the 10,000 query bigann dataset for both con\ufb01gurations in\nFigure 2(a) by running the search using 16 threads (each query is processed only on a single thread).\nFrom this experiment we conclude the following. (a) The single index outperforms the merged\n\n5https://github.com/ZJULearning/nsg\n6Since NSG needs a starting k-nearest neighbor graph, we also include the time taken by EFANNA [12].\n\n8\n\n\findex, which traverses more links to reach the same neighborhood, thus increasing search latency.\nThis could possibly be because the in- and out-edges of each node in the merged index are limited\nk = 5% of all points. (b) The merged index is still a very good choice for billion-scale\nto about (cid:96)\nk-ANN indexing and serving single-node, easily outperforming the existing state-of-the-art methods\nand requires no more than 20% extra latency for a target recall when compared to the single index.\nThe single index, on the other hand, achieves a new state-of-the-art 1-recall@1 of 98.68% with <5\nmilliseconds latency. The merged index is also a good choice for the DEEP1B dataset. Figure 2(b)\nshows the recall vs latency curve of the merged DiskANN index for the DEEP1B dataset built using\nk = 40 shards and (cid:96) = 2 on the z840 machine, and with search running on 16 threads.\n\n4.4 Comparison on Billion-Scale Datasets: DiskANN vs IVF-based Methods\nOur \ufb01nal comparisons are with FAISS[18] and IVFOADC+G+P[7], two recent approaches to con-\nstructing billion point indices on a single node. Both methods utilize Inverted Indexing and Product\nQuantization-based compression schemes to develop indices with low-memory footprint that can\nserve queries with high-throughput and good 1-recall@100. We compare DiskANN with only IV-\nFOADC+G+P since [7] demonstrates superior recall for IVFOADC+G+P over FAISS, and moreover,\nbillion-scale indexing using FAISS requires GPUs that might not be available in some platforms.\nIVFOADC+G+P uses HNSW as a routing layer to obtain a small set of clusters that are further re\ufb01ned\nusing a novel grouping and pruning strategy. Using their open-source code, we build indices with 16\nand 32-byte OPQ code-books on the SIFT1B base set. IVFOADC+G+P-16 and IVFOADC+G+P-32\ncurves in 2(a) represent the two con\ufb01gurations. While IVFOADC+G+P-16 plateaus at 1-recall@1\nof 37.04%, the larger IVFOADC+G+P-32 indices reach 1-recall@1 at 62.74%. With the same\nmemory footprint as IVFOADC+G+P-32, DiskANN saturates at a perfect 1-recall@1 of 100%,\nwhile providing 1-recall@1 of above 95% in under 3.5ms. Thus DiskANN, while matching the\nmemory footprint of compression-based methods, can achieve signi\ufb01cantly higher recall at the\nsame latency. Compression-based methods provide low recall due to loss of precision from lossy\ncompression of coordinates which results in slightly inaccurate distance calculations.\nZoom[24] is a compression-based method, similar to IVFOADC+G+P, that identi\ufb01es the approximate\nnearest K(cid:48) > K candidates using the compressed vectors, and re-ranks them by fetching the full-\nprecision coordinates from the disk to output the \ufb01nal set of K candidates. However, Zoom suffers\nfrom two drawbacks: (a) it fetches all the K(cid:48) (often close to hundred even if K = 1) full-precision\nvectors using simultaneous random disk reads, which would affect latency and throughput, and\n(b) it requires expensive k-means clustering using hundreds of thousands of centroids to build the\nHNSW-based routing layer. For example, the clustering step described in [24] utilizes 200K centroids\non 10M base set, and might not scale easily to billion-point datasets.\n5 Conclusion\nWe presented and evaluated a new graph-based indexing algorithm called Vamana for ANNS\nwhose indices are comparable to the current state-of-the-art methods for in-memory search in high-\nrecall regimes. In addition, we demonstrated the construction of a high-quality SSD-resident index\nDiskANN on a billion point dataset using only 64GB of main memory. We detailed and motivated\nthe algorithmic improvements that enabled us to serve these indices using inexpensive retail-grade\nSSDs with latencies of few milliseconds. By combining the high-recall, low-latency properties of\ngraph-based methods with the memory ef\ufb01ciency and scalability properties of compression-based\nmethods, we established the new state-of-the-art for indexing and serving billion point datasets.\n\nAcknowledgements. We would like to thank Nived Rajaraman and Gopal Srinivasa for several\nuseful discussions during the course of this work.\n\nReferences\n[1] Laurent Amsaleg and Herv\u00e9 Jegou. Datasets for approximate nearest neighbor search. http://\n\ncorpus-texmex.irisa.fr/, 2010. [Online; accessed 20-May-2018].\n\n[2] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in\nhigh dimensions. Commun. ACM, 51(1):117\u2013122, January 2008. ISSN 0001-0782. doi: 10.1145/1327452.\n1327494. URL http://doi.acm.org/10.1145/1327452.1327494.\n\n[3] Alexandr Andoni and Ilya Razenshteyn. Optimal data-dependent hashing for approximate near neighbors.\nIn Proceedings of the Forty-seventh Annual ACM Symposium on Theory of Computing, STOC \u201915, pages\n793\u2013801, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3536-2. doi: 10.1145/2746539.2746553.\nURL http://doi.acm.org/10.1145/2746539.2746553.\n\n9\n\n\f[4] Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt. Practical and\noptimal lsh for angular distance. In Proceedings of the 28th International Conference on Neural Information\nProcessing Systems - Volume 1, NIPS\u201915, pages 1225\u20131233, Cambridge, MA, USA, 2015. MIT Press.\nURL http://dl.acm.org/citation.cfm?id=2969239.2969376.\n\n[5] Sunil Arya and David M. Mount. Approximate nearest neighbor queries in \ufb01xed dimensions. In Proceed-\nings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA \u201993, pages 271\u2013280,\nPhiladelphia, PA, USA, 1993. Society for Industrial and Applied Mathematics. ISBN 0-89871-313-7. URL\nhttp://dl.acm.org/citation.cfm?id=313559.313768.\n\n[6] Artem Babenko and Victor S. Lempitsky. Ef\ufb01cient indexing of billion-scale datasets of deep descriptors.\nIn 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA,\nJune 27-30, 2016, pages 2055\u20132063, 2016. doi: 10.1109/CVPR.2016.226. URL https://doi.org/10.\n1109/CVPR.2016.226.\n\n[7] Dmitry Baranchuk, Artem Babenko, and Yury Malkov. Revisiting the inverted indices for billion-scale\napproximate nearest neighbors. CoRR, abs/1802.02422, 2018. URL http://arxiv.org/abs/1802.\n02422.\n\n[8] Dmitry Baranchuk, Artem Babenko, and Yury Malkov. Revisiting the inverted indices for billion-scale\napproximate nearest neighbors. In The European Conference on Computer Vision (ECCV), September\n2018.\n\n[9] Jon Louis Bentley. Multidimensional divide-and-conquer. Commun. ACM, 23(4):214\u2013229, April\n1980. ISSN 0001-0782. doi: 10.1145/358841.358850. URL http://doi.acm.org/10.1145/358841.\n358850.\n\n[10] Kenneth L. Clarkson. An algorithm for approximate closest-point queries. In Proceedings of the Tenth\nAnnual Symposium on Computational Geometry, SCG \u201994, pages 160\u2013164, New York, NY, USA, 1994.\nACM. ISBN 0-89791-648-4. doi: 10.1145/177424.177609. URL http://doi.acm.org/10.1145/\n177424.177609.\n\n[11] Matthijs Douze,\n\nJeff\n\nand Herv\u00e9\n\nsimilarity\n\n\ufb01cient\nfaiss-a-library-for-efficient-similarity-search/, 2017.\n2017].\n\nJohnson,\nsearch.\n\nJegou.\n\nef-\nhttps://code.fb.com/data-infrastructure/\n[Online; accessed 29-March-\n\nA library\n\nFaiss:\n\nfor\n\n[12] Cong Fu and Deng Cai. URL https://github.com/ZJULearning/efanna.\n\n[13] Cong Fu, Chao Xiang, Changxu Wang, and Deng Cai. Fast approximate nearest neighbor search with the\nnavigating spreading-out graphs. PVLDB, 12(5):461 \u2013 474, 2019. doi: 10.14778/3303753.3303754. URL\nhttp://www.vldb.org/pvldb/vol12/p461-fu.pdf.\n\n[14] Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. Optimized product quantization. IEEE Trans. Pattern\nAnal. Mach. Intell., 36(4):744\u2013755, 2014. doi: 10.1109/TPAMI.2013.240. URL https://doi.org/10.\n1109/TPAMI.2013.240.\n\n[15] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of\ndimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC\n\u201998, pages 604\u2013613, New York, NY, USA, 1998. ACM. ISBN 0-89791-962-9. doi: 10.1145/276698.276876.\nURL http://doi.acm.org/10.1145/276698.276876.\n\n[16] Jerzy W. Jaromczyk and Godfried T. Toussaint. Relative neighborhood graphs and their relatives. 1992.\n\n[17] Herv\u00e9 J\u00e9gou, Matthijs Douze, and Cordelia Schmid. Product Quantization for Nearest Neighbor Search.\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117\u2013128, January 2011. doi:\n10.1109/TPAMI.2010.57. URL https://hal.inria.fr/inria-00514462.\n\n[18] Jeff Johnson, Matthijs Douze, and Herv\u00e9 J\u00e9gou. Billion-scale similarity search with gpus. arXiv preprint\n\narXiv:1702.08734, 2017.\n\n[19] Yannis Kalantidis and Yannis Avrithis. Locally optimized product quantization for approximate nearest\nneighbor search. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014,\nColumbus, OH, USA, June 23-28, 2014, pages 2329\u20132336, 2014. doi: 10.1109/CVPR.2014.298. URL\nhttps://doi.org/10.1109/CVPR.2014.298.\n\n[20] W. Li, Y. Zhang, Y. Sun, W. Wang, M. Li, W. Zhang, and X. Lin. Approximate nearest neighbor search on\nhigh dimensional data - experiments, analyses, and improvement. IEEE Transactions on Knowledge and\nData Engineering, pages 1\u20131, 2019. doi: 10.1109/TKDE.2019.2909204.\n\n10\n\n\f[21] Yury A. Malkov and D. A. Yashunin. Ef\ufb01cient and robust approximate nearest neighbor search using\nhierarchical navigable small world graphs. CoRR, abs/1603.09320, 2016. URL http://arxiv.org/\nabs/1603.09320.\n\n[22] J. Wang, J. Wang, G. Zeng, Z. Tu, R. Gan, and S. Li. Scalable k-nn graph construction for visual descriptors.\nIn 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1106\u20131113, June 2012. doi:\n10.1109/CVPR.2012.6247790.\n\n[23] Roger Weber, Hans-J\u00f6rg Schek, and Stephen Blott. A quantitative analysis and performance study for\nsimilarity-search methods in high-dimensional spaces. In Proceedings of the 24rd International Conference\non Very Large Data Bases, VLDB \u201998, pages 194\u2013205, San Francisco, CA, USA, 1998. Morgan Kaufmann\nPublishers Inc. ISBN 1-55860-566-5. URL http://dl.acm.org/citation.cfm?id=645924.671192.\n\n[24] Minjia Zhang and Yuxiong He. Zoom: Ssd-based vector search for optimizing accuracy, latency and\n\nmemory. CoRR, abs/1809.04067, 2018. URL http://arxiv.org/abs/1809.04067.\n\n11\n\n\f", "award": [], "sourceid": 7667, "authors": [{"given_name": "Suhas", "family_name": "Jayaram Subramanya", "institution": "Carnegie Mellon University"}, {"given_name": "Fnu", "family_name": "Devvrit", "institution": "University of Texas at Austin"}, {"given_name": "Harsha Vardhan", "family_name": "Simhadri", "institution": "Microsoft Research"}, {"given_name": "Ravishankar", "family_name": "Krishnawamy", "institution": "Microsoft Research India"}, {"given_name": "Rohan", "family_name": "Kadekodi", "institution": "The University of Texas at Austin"}]}