Part of Advances in Neural Information Processing Systems 35 (NeurIPS 2022) Main Conference Track
Clément L Canonne, Ilias Diakonikolas, Daniel Kane, Sihan Liu
We investigate the problem of testing whether a discrete probability distribution over an ordered domain is a histogram on a specified number of bins. One of the most common tools for the succinct approximation of data, k-histograms over [n], are probability distributions that are piecewise constant over a set of k intervals. Given samples from an unknown distribution p on [n], we want to distinguish between the cases that p is a k-histogram versus far from any k-histogram, in total variation distance. Our main result is a sample near-optimal and computationally efficient algorithm for this testing problem, and a nearly-matching (within logarithmic factors) sample complexity lower bound, showing that the testing problem has sample complexity ˜Θ(√nk/ϵ+k/ϵ2+√n/ϵ2).