Part of Advances in Neural Information Processing Systems 28 (NIPS 2015)
Bhaswar Bhattacharya, Gregory Valiant
We consider the problem of testing whether two unequal-sized samples were drawn from identical distributions, versus distributions that differ significantly. Specifically, given a target error parameter $\eps > 0$, $m_1$ independent draws from an unknown distribution $p$ with discrete support, and $m_2$ draws from an unknown distribution $q$ of discrete support, we describe a test for distinguishing the case that $p=q$ from the case that $||p-q||_1 \geq \eps$. If $p$ and $q$ are supported on at most $n$ elements, then our test is successful with high probability provided $m_1\geq n^{2/3}/\varepsilon^{4/3}$ and $m_2 = \Omega\left(\max\{\frac{n}{\sqrt m_1\varepsilon^2}, \frac{\sqrt n}{\varepsilon^2}\}\right).$ We show that this tradeoff is information theoretically optimal throughout this range, in the dependencies on all parameters, $n,m_1,$ and $\eps$, to constant factors. As a consequence, we obtain an algorithm for estimating the mixing time of a Markov chain on $n$ states up to a $\log n$ factor that uses $\tilde{O}(n^{3/2} \tau_{mix})$ queries to a ``next node'' oracle. The core of our testing algorithm is a relatively simple statistic that seems to perform well in practice, both on synthetic data and on natural language data. We believe that this statistic might prove to be a useful primitive within larger machine learning and natural language processing systems.