A Fast, Consistent Kernel Two-Sample Test

Part of Advances in Neural Information Processing Systems 22 (NIPS 2009)

Bibtex Metadata Paper Supplemental

Authors

Arthur Gretton, Kenji Fukumizu, Zaïd Harchaoui, Bharath K. Sriperumbudur

Abstract

A kernel embedding of probability distributions into reproducing kernel Hilbert spaces (RKHS) has recently been proposed, which allows the comparison of two probability measures P and Q based on the distance between their respective embeddings: for a sufficiently rich RKHS, this distance is zero if and only if P and Q coincide. In using this distance as a statistic for a test of whether two samples are from different distributions, a major difficulty arises in computing the significance threshold, since the empirical statistic has as its null distribution (where P=Q) an infinite weighted sum of $\chi^2$ random variables. The main result of the present work is a novel, consistent estimate of this null distribution, computed from the eigenspectrum of the Gram matrix on the aggregate sample from P and Q. This estimate may be computed faster than a previous consistent estimate based on the bootstrap. Another prior approach was to compute the null distribution based on fitting a parametric family with the low order moments of the test statistic: unlike the present work, this heuristic has no guarantee of being accurate or consistent. We verify the performance of our null distribution estimate on both an artificial example and on high dimensional multivariate data.