Paper ID: | 9251 |
---|---|

Title: | Space and Time Efficient Kernel Density Estimation in High Dimensions |

This is overall an interesting direction but the impact of the proposed scheme is somewhat underdeveloped given that the gains are somewhat limited and most of the theoretical results build directly off of existing work. - While the proposed scheme does improve the storage requirements of HBE, it is not clear from empirical evaluation as to how much savings the proposed scheme provides relative to the data set size. The data set size and the O(1/\epsilon^2) factor appears to be the dominating storage requirement here (either the full or the subsampled set); it is not clear from the empirical results how significant are the O(1 / \sqrt{\tau}) savings here. - The example for the proposed scheme on page 5 (lines 150-153), the resulting KDE estimate is only accurate if L = 1/\sqrt{\tau} instead of being O(1 / \sqrt{\tau}), which requires use to know \tau beforehand which might be problematic. The intuition behind this example is very useful but maybe this explanation can be further clarified to address the aforementioned discrepancy. - The theoretical and empirical results demonstrate the ability of the proposed scheme to match the average performance of the existing HBE scheme. However, as mentioned in the manuscript, the variance of the estimates from the proposed scheme would be adversely affected. But this loss in variance is not empirically studied. ===== Post response ----- I thank the authors for the response. Based on the response and the discussion, I am motivated to raise the score.

Originality: This is a clever variant of the original HBE by Charikar and Siminelakis. Though it is simple and natural in hindsight, the fact is that nobody realised this in the last two years even though [CS17] appeared in a top venue. Quality: Proofs (in supplementary material) appear solid. The paper is generally very well written, but comparison of space usage for different methods is sloppy, making the improvement over [CS17] seem larger than it is. Many bounds stated seem to assume d=O(1). Table 1 is misleading. The issue is that it is only necessary to store each point once, and then use pointers to refer to it. Thus, the space usage of (a trivial modification of) the [CS17] HBE is d/τ · 1/ε^2 + 1/τ^{3/2} · 1/ε^2. The new bound is only better when d < τ^{1/2}. This should be made clear. Clarity: The paper is clearly written. I think it would be possible to reproduce all results. Significance: I believe this is a significant improvement over [CS17] as well as over the variant of in the recent ICML paper [SRB+19]. Since the problem is of wide importance, I think this will be of interest to many NeurIPS participants.

This paper is relevant since hashing is important for high-dimensional data, and the proposed method is a simple variant of HBE. The idea of introducing sparseness in estimation is a popular technique in many problems, so the methodology itself is not new. But, adopting it to this setting is a nice attempt, and the space saving should be valuable in practice. The theoretical part is not that strong, perhaps because the idea is simple. Two questions: 1) why is the sampling probability chosen as \delta=1/(n\sqrt\tau)? I didn’t find a clear explanation in the paper. 2) In the theorems and lemmas, by ‘’there exists a data structure…’’ you just refer to the proposed one, right? Post Rebuttal: After reading the rebuttal and re-reading the paper, I am willing to raise the score from 5 to 6.

Overall the paper is an average paper but clearly written. This paper proposes an improvement of Charikar's approach to achieve sublinear kernel density estimation with linear space and linear time preprocessing. Experimental results focus mainly on Laplacian (L1 variant in the main submission and L2 variant added in supplement). The key observation for achieving linear space is to modify the previous HBE approach so that each hash table stores each point in the dataset with constant probability - in this way, the superlinear storage cost is overcome. However, my main complaint is in the experimental results. First of all, the authors need to clarify what is meant by "typical kernel value" - I understand the median nearest neighbor distance is estimated and then subsequently scaled to obtain the bandwidth value used for the kernel. By "typical", does it mean 50 % of the dataset (or any other threshold)? What about the distribution of the raw kernel values themselves and the skewness/kurtosis? From the plots, it is impossible to judge how these factors are accounted to achieve scalability. I think it would be better for the authors to present a more complete picture by showing the scalability as the bandwidth of the kernel is varied. In summary: Originality: builds upon the previous hashing-based approach by Charikar et al., though the modification is trivial. Quality: average - I would like to have seen more experimental results. Clarity: The paper is clearly written - though it may be good idea for the authors to correct minor punctuation errors and run-on sentences. Significance: space improvement over the previous approach is novel but the experimental results need to be supplemented for demonstrating a stronger message.