Paper ID: | 7667 |
---|---|

Title: | DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node |

The paper under review builds upon prior work on graph-based approximate kNN search, and improves upon by (a) introducing random graph initialization that speeds up the graph construction procedure (b) replacing a hard RNG constraint w/ a soft \alpha-RNG constraint, which reduces the number of hops required by the algorithm (c) expanding the BeamSearch procedure to work with SSD storage, by selecting k nodes to expand at a time, which addresses the I/O bottleneck. Overall, while these all can be thought of as incremental contributions, the resulting algorithm outperforms the relevant baselines. It will no doubt be of interest to practitioners in the field. This type of work, however, could significantly benefit from open-sourcing the code to facilitate adoption. The paper is well written, the techniques are well explained and the connection to prior work is clear, even for non-expert reader. The experimentation is thorough and convincing.

This paper studies the problem of process approximate nearest neighbor search with memory and SSD, and propose a graph-based index and search algorithm named Rand-NSG. It can hold a billion points searching on a normal workstation with a cheap SSD. However, there are some concerns: (1). The point of novelty is not strong enough, because the sub-ideas are all existing techniques in other existing researches (e.g. beam search, product quantization), and making previous NSG method SSD-friendly is an incremental work. (2). In Figure 3, it is not enough to only see the distance comparisons between different methods. The runtime of comparison is not the only factor affecting performance. Other overheads should be considered as well. The experiment of "Recall-Queries per second (1/s) tradeoff" should be added, like in https://github.com/erikbern/ann-benchmarks/. Other minor typos: (1). page 2 line 80: "the number the number of random" -> "the number of random"

A very nice contribution! The writing could be improved, but it's in general understandable. However, citation quality can be improved. In particular, it seems to me that NSG and HNSW are actually using the same pruning rule (which results in approximate relative neighborhood graph). I really like your updated version, which reduces the number hops (and I haven't seen this pruning variant before)! Detailed comments: Abstract and further: base points sounds like a strange term, do you mean domain points? 22-23 Why 100 dimensions? Not 200-300? 27 You talk about the Euclidean search, but cite a paper on the inner-product search. Please, find a more specific-generic citation that describes this phenomena. Off-the-top of my head *POSSIBLY* more appropriate references: i. Weber, R., Schek, H.J. and Blott, S., 1998, August. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB (Vol. 98, pp. 194-205). ii. Beyer, Kevin, et al. "When is “nearest neighbor” meaningful?." International conference on database theory. Springer, Berlin, Heidelberg, 1999. 29 Seems a bit awkward. First, it's IMHO better to first define recall. Second, I am not sure we want to maximize recall, as the maximum recall is 100% and as you mention it's not a realistic target. What we really want to maximize efficiency at a given recall level (likely somewhat close to 100%, albeit the desired level of recall might be application dependent). 37 This is a controversial claim! See, e.g., results of ann-benchmarks. There are bunch of methods whose performance is roughly the same. See the latest comparison showing that graphs based on NN-descent algorithms work really well too (about the same performance as HNSW): Li, Wen, et al. "Approximate nearest neighbor search on high dimensional data-experiments, analyses, and improvement." IEEE Transactions on Knowledge and Data Engineering(2019). Crucially: are these methods all that different? HNSW and NSG are using basically the same graph pruning heuristic (approximation to the relative neighborhood graph). It seems that you fail to disclose this. 39 What is exactly a navigable graph? 40 IMHO, the procedure is not a *GREEDY* walk. It's a best-first graph traversal with a buffer priority queue that keeps a buffer list of candidates. 53-54 awkward writing 127. the main idea in [12]. Actually, this idea has been used in a bunch of other papers, notably in HNSW and its predecessor NSW. 154-156 Although implementationally different, your algorithm is conceptually similar to the NN-descent implemented in Wei Dong's k-graph. https://github.com/aaalgo/kgraph 196. The merging algorithm isn't quite new please cite the relevant paper: i. I known this approach from Scalable k-nn graph construction for visual descriptors J Wang, J Wang, G Zeng, Z Tu… - 2012 IEEE Conference …, 2012 - ieeexplore.ieee.org ii. However, Wang et al. learned it from J. L. Bentley. Multidimensional divide-and-conquer. Commun. ACM, 23(4):214–229, 1980. 1, 2 249 Which public benchmarks?