{"title": "Fast Two-Sample Testing with Analytic Representations of Probability Measures", "book": "Advances in Neural Information Processing Systems", "page_first": 1981, "page_last": 1989, "abstract": "We propose a class of nonparametric two-sample tests with a cost linear in the sample size. Two tests are given, both  based on an ensemble of distances between analytic functions representing each of the distributions. The first test uses smoothed empirical characteristic functions to represent the distributions, the second uses distribution embeddings in a reproducing kernel Hilbert space. Analyticity implies that differences in the distributions may be detected almost surely at a finite number of randomly chosen locations/frequencies. The new tests are consistent against a larger class of alternatives than the previous linear-time tests based on the (non-smoothed) empirical characteristic functions, while being much faster than the current state-of-the-art quadratic-time kernel-based or energy distance-based tests. Experiments on artificial benchmarks and on challenging real-world testing problems demonstrate that our tests give a better power/time tradeoff than  competing approaches, and in some cases, better outright power than even the most expensive quadratic-time tests. This performance advantage is retained even in high dimensions, and in cases where the difference in distributions is not observable with low order statistics.", "full_text": "Fast Two-Sample Testing with Analytic\nRepresentations of Probability Measures\n\nKacper Chwialkowski\n\nGatsby Computational Neuroscience Unit, UCL\n\nkacper.chwialkowski@gmail.com\n\nAaditya Ramdas\n\nDept. of EECS and Statistics, UC Berkeley\n\naramdas@cs.berkeley.edu\n\nDino Sejdinovic\n\nDept of Statistics, University of Oxford\ndino.sejdinovic@gmail.com\n\nArthur Gretton\n\nGatsby Computational Neuroscience Unit, UCL\n\narthur.gretton@gmail.com\n\nAbstract\n\nWe propose a class of nonparametric two-sample tests with a cost linear in the\nsample size. Two tests are given, both based on an ensemble of distances be-\ntween analytic functions representing each of the distributions. The \ufb01rst test uses\nsmoothed empirical characteristic functions to represent the distributions, the sec-\nond uses distribution embeddings in a reproducing kernel Hilbert space. Analytic-\nity implies that differences in the distributions may be detected almost surely at a\n\ufb01nite number of randomly chosen locations/frequencies. The new tests are consis-\ntent against a larger class of alternatives than the previous linear-time tests based\non the (non-smoothed) empirical characteristic functions, while being much faster\nthan the current state-of-the-art quadratic-time kernel-based or energy distance-\nbased tests. Experiments on arti\ufb01cial benchmarks and on challenging real-world\ntesting problems demonstrate that our tests give a better power/time tradeoff than\ncompeting approaches, and in some cases, better outright power than even the\nmost expensive quadratic-time tests. This performance advantage is retained even\nin high dimensions, and in cases where the difference in distributions is not ob-\nservable with low order statistics.\n\n1 Introduction\n\nTesting whether two random variables are identically distributed without imposing any parametric\nassumptions on their distributions is important in a variety of scienti\ufb01c applications. These include\ndata integration in bioinformatics [6], benchmarking for steganography [20] and automated model\nchecking [19]. Such problems are addressed in the statistics literature via two-sample tests (also\nknown as homogeneity tests).\nTraditional approaches to two-sample testing are based on distances between representations of the\ndistributions, such as density functions, cumulative distribution functions, characteristic functions or\nmean embeddings in a reproducing kernel Hilbert space (RKHS) [27, 26]. These representations are\nin\ufb01nite dimensional objects, which poses challenges when de\ufb01ning a distance between distributions.\nExamples of such distances include the classical Kolmogorov-Smirnov distance (sup-norm between\ncumulative distribution functions); the Maximum Mean Discrepancy (MMD) [9], an RKHS norm\nof the difference between mean embeddings, and the N-distance (also known as energy distance)\n[34, 31, 4], which is an MMD-based test for a particular family of kernels [25] . Tests may also be\nbased on quantities other than distances, an example being the Kernel Fisher Discriminant (KFD)\n[12], the estimation of which still requires calculating the RKHS norm of a difference of mean\nembeddings, with normalization by an inverse covariance operator.\n\n1\n\n\fIn contrast to consistent two-sample tests, heuristics based on pseudo-distances, such as the dif-\nference between characteristic functions evaluated at a single frequency, have been studied in the\ncontext of goodness-of-\ufb01t tests [13, 14]. It was shown that the power of such tests can be maximized\nagainst fully speci\ufb01ed alternative hypotheses, where test power is the probability of correctly reject-\ning the null hypothesis that the distributions are the same. In other words, if the class of distributions\nbeing distinguished is known in advance, then the tests can focus only at those particular frequen-\ncies where the characteristic functions differ most. This approach was generalized to evaluating the\nempirical characteristic functions at multiple distinct frequencies by [8], thus improving on tests\nthat need to know the single \u201cbest\u201d frequency in advance (the cost remains linear in the sample size,\nalbeit with a larger constant). This approach still fails to solve the consistency problem, however:\ntwo distinct characteristic functions can agree on an interval, and if the tested frequencies fall in that\ninterval, the distributions will be indistinguishable.\nIn Section 2 of the present work, we introduce two novel distances between distributions, which\nboth use a parsimonious representation of the probability measures. The \ufb01rst distance builds on\nthe notion of differences in characteristic functions with the introduction of smooth characteristic\nfunctions, which can be though of as the analytic analogues of the characteristics functions. A\ndistance between smooth characteristic functions evaluated at a single random frequency is almost\nsurely a distance (De\ufb01nition 1 formalizes this concept) between these two distributions. In other\nwords, there is no need to calculate the whole in\ufb01nite dimensional representation - it is almost\nsurely suf\ufb01cient to evaluate it at a single random frequency (although checking more frequencies will\ngenerally result in more powerful tests). The second distance is based on analytic mean embeddings\nof two distributions in a characteristic RKHS; again, it is suf\ufb01cient to evaluate the distance between\nmean embeddings at a single randomly chosen point to obtain almost surely a distance. To our\nknowledge, this representation is the \ufb01rst mapping of the space of probability measures into a \ufb01nite\ndimensional Euclidean space (in the simplest case, the real line) that is almost surely an injection,\nand as a result almost surely a metrization. This metrization is very appealing from a computational\nviewpoint, since the statistics based on it have linear time complexity (in the number of samples)\nand constant memory requirements.\nWe construct statistical tests in Section 3, based on empirical estimates of differences in the analytic\nrepresentations of the two distributions. Our tests have a number of theoretical and computational\nadvantages over previous approaches. The test based on differences between analytic mean embed-\ndings is a.s. consistent for all distributions, and the test based on differences between smoothed\ncharacteristic functions is a.s. consistent for all distributions with integrable characteristic func-\ntions (contrast with [8], which is only consistent under much more onerous conditions, as discussed\nabove). This same weakness was used by [1] in justifying a test that integrates over the entire fre-\nquency domain (albeit at cost quadratic in the sample size), for which the quadratic-time MMD is\na generalization [9]. Compared with such quadratic time tests, our tests can be conducted in linear\ntime \u2013 hence, we expect their power/computation tradeoff to be superior.\nWe provide several experimental benchmarks (Section 4) for our tests. First, we compare test power\nas a function of computation time for two real-life testing settings: amplitude modulated audio\nsamples, and the Higgs dataset, which are both challenging multivariate testing problems. Our\ntests give a better power/computation tradeoff than the characteristic function-based tests of [8],\nthe previous sub-quadratic-time MMD tests [11, 32], and the quadratic-time MMD test. In terms\nof power when unlimited computation time is available, we might expect worse performance for\nthe new tests, in line with \ufb01ndings for linear- and sub-quadratic-time MMD-based tests [15, 9, 11,\n32]. Remarkably, such a loss of power is not the rule: for instance, when distinguishing signatures\nof the Higgs boson from background noise [3] (\u2019Higgs dataset\u2019), we observe that a test based on\ndifferences in smoothed empirical characteristic functions outperforms the quadratic-time MMD.\nThis is in contrast to linear- and sub-quadratic-time MMD-based tests, which by construction are less\npowerful than the quadratic-time MMD. Next, for challenging arti\ufb01cial data (both high-dimensional\ndistributions, and distributions for which the difference is very subtle), our tests again give a better\npower/computation tradeoff than competing methods.\n\n2 Analytic embeddings and distances\n\nIn this section we consider mappings from the space of probability measures into a sub-space of\nreal valued analytic functions. We will show that evaluating these maps at J randomly selected\n\n2\n\n\fpoints is almost surely injective for any J > 0. Using this result, we obtain a simple (randomized)\nmetrization of the space of probability measures. This metrization is used in the next section to\nconstruct linear-time nonparametric two-sample tests.\nTo motivate our approach, we begin by recalling an integral family of distances between distribu-\ntions, denoted Maximum Mean Discrepancies (MMD) [9]. The MMD is de\ufb01ned as\n\nMMD(P, Q) = sup\n\nf2Bk\uf8ffZE\n\nf dP ZE\n\nf dQ ,\n\n(1)\n\n(3)\n\n(4)\n\nwhere P and Q are probability measures on E, and Bk is the unit ball in the RKHS Hk associated\nwith a positive de\ufb01nite kernel k : E \u21e5 E ! R. A popular choice of k is the Gaussian kernel\nk(x, y) = exp(kx yk2/2) with bandwidth parameter . It can be shown that the MMD is equal\nto the RKHS distance between so called mean embeddings,\n(2)\n\nMMD(P, Q) = k\u00b5P  \u00b5QkHk ,\nwhere \u00b5P is an embedding of the probability measure P to Hk,\n\u00b5P (t) =ZE\n\nk(x, t)dP (x),\n\nand k\u00b7k Hk denotes the norm in the RKHS Hk. When k is translation invariant, i.e., k (x, y) =\n\uf8ff(x  y), the squared MMD can be written [27, Corollary 4]\n\nMMD2(P, Q) =ZRd |'P (t)  'Q(t)|2 F 1\uf8ff(t)dt,\n\nwhere F denotes the Fourier transform, F 1 is the inverse Fourier transform, and 'P , 'Q are\nthe characteristic functions of P , Q, respectively. From [27, Theorem 9], a kernel k is called\ncharacteristic when the MMD for Hk satis\ufb01es\n\nMMD(P, Q) = 0 iff P = Q.\n\n(5)\nAny bounded, continuous, translation-invariant kernel whose inverse Fourier transform is almost\neverywhere non-zero is characteristic [27]. By representation (2), it is clear that the MMD with a\ncharacteristic kernel is a metric.\nPseudometrics based on characteristic functions. A practical limitation when using the MMD\nin testing is that an empirical estimate is expensive to compute, this being the sum of two U-statistics\nand an empirical average, with cost quadratic in the sample size [9, Lemma 6]. We might instead\nconsider a \ufb01nite dimensional approximation to the MMD, achieved by estimating the integral (4),\nwith the random variable\n\nd2\n',J (P, Q) =\n\n1\nJ\n\nJXj=1\n\n|'P (Tj)  'Q(Tj)|2,\n\n(6)\n\nwhere {Tj}J\nj=1 are sampled independently from the distribution with a density function F 1\uf8ff. This\ntype of approximation is applied to various kernel algorithms under the name of random Fourier\nfeatures [21, 17]. In the statistical testing literature, the quantity d',J (P, Q) predates the MMD by\na considerable time, and was studied in [13, 14, 8], and more recently revisited in [33]. Our \ufb01rst\nproposition is that d2\n',J (P, Q) can be a poor choice of distance between probability measures, as it\nfails to distinguish a large class of measures. The following result is proved in the Appendix.\nProposition 1. Let J 2 N and let {Tj}J\nj=1 be a sequence of real valued i.i.d. random variables\nwith a distribution which is absolutely continuous with respect to the Lebesgue measure. For any\n0 <\u270f< 1, there exists an uncountable set A of mutually distinct probability measures (on the real\nline) such that for any P, Q 2A , Pd2\n\nWe are therefore motivated to \ufb01nd distances of the form (6) that can distinguish larger classes of\ndistributions, yet remain ef\ufb01cient to compute. These distances are characterized as follows:\nDe\ufb01nition 1 (Random Metric). A random process d with values in R, indexed with pairs from the\nset of probability measures M, i.e., d = {d(P, Q) : P, Q 2M} , is said to be a random metric\nif it satis\ufb01es all the conditions for a metric with quali\ufb01cation \u2018almost surely\u2019. Formally, for all\nP, Q, R 2M , random variables d(P, Q), d(P, R), d(R, Q) must satisfy\n\n',J (P, Q) = 0  1  \u270f.\n\n3\n\n\f1. d(P, Q)  0 a.s.\n2. if P = Q, then d(P, Q) = 0 a.s, if P 6= Q then d(P, Q) 6= 0 a.s.\n3. d(P, Q) = d(Q, P ) a.s.\n4. d(P, Q) \uf8ff d(P, R) + d(R, Q) a.s. 1\n\nFrom the statistical testing point of view, the coincidence axiom of a metric d, d(P, Q) = 0 if and\nonly if P = Q, is key, as it ensures consistency against all alternatives. The quantity d',J (P, Q) in\n(6) violates the coincidence axiom, so it is only a random pseudometric (other axioms are trivially\nsatis\ufb01ed). We remedy this problem by replacing the characteristic functions by smooth characteristic\nfunctions:\nDe\ufb01nition 2. A smooth characteristic function P (t) of a measure P is a characteristic function of\nP convolved with an analytic smoothing kernel l, i.e.\n\nP (t) =ZRd\n\n'P (w)l(t  w)dw,\n\nt 2 Rd.\n\n(7)\n\nProposition 3 shows that smooth characteristic function can be estimated in a linear time. The\nanalogue of d',J (P, Q) for smooth characteristic functions is simply\n\nd2\n,J (P, Q) =\n\n1\nJ\n\nJXj=1\n\n|P (Tj)  Q(Tj)|2,\n\n(8)\n\nwhere {Tj}J\nj=1 are sampled independently from the absolutely continuous distribution (returning\nto our earlier example, this might be F 1\uf8ff(t) if we believe this to be an informative choice). The\nfollowing theorem, proved in the Appendix, demonstrates that the smoothing greatly increases the\nclass of distributions we can distinguish.\nTheorem 1. Let l be an analytic, integrable kernel with an inverse Fourier transform that is non-\nzero almost everywhere. Then, for any J > 0, d,J is a random metric on the space of probability\nmeasures with integrable characteristic functions, and P is an analytic function.\n\nThis result is primarily a consequence of analyticity of smooth characteristic functions and the fact\nthat analytic functions are \u2019well behaved\u2019. There is an additional, practical advantage to smoothing:\nwhen the variability in the difference of the characteristic functions is high, and these differences are\nlocal, smoothing distributes the difference in CFs more broadly in the frequency domain (a simple\nillustration is in Fig. A.1, Appendix), making it easier to \ufb01nd by measurement at a small number of\nrandomly chosen points. This accounts for the observed improvements in test power in Section 4,\nover differences in unsmoothed CFs.\nMetrics based on mean embeddings. The key step which leads us to the construction of a random\nmetric d,J is the convolution of the original characteristic functions with an analytic smoothing ker-\nnel. This idea need not be restricted to the representations of probability measures in the frequency\ndomain. We may instead directly convolve the probability measure with a positive de\ufb01nite kernel k\n(that need not be translation invariant), yielding its mean embedding into the associated RKHS,\n\n\u00b5P (t) =ZE\n\nk(x, t)dP (x).\n\n(9)\n\nWe say that a positive de\ufb01nite kernel k : RD\u21e5RD ! R is analytic on its domain if for all x 2 RD,\nthe feature map k(x,\u00b7) is an analytic function on RD. By using embeddings with characteristic and\nanalytic kernels, we obtain particularly useful representations of distributions. As for the smoothed\nCF case, we de\ufb01ne\n\n(\u00b5P (Tj)  \u00b5Q(Tj))2.\nThe following theorem ensures that d\u00b5,J (P, Q) is also a random metric.\n\nd2\n\u00b5,J (P, Q) =\n\n1\nJ\n\nJXj=1\n\n(10)\n\n1 Note that this does not imply that realizations of d are distances on M, but it does imply that they are\n\nalmost surely distances for all arbitrary \ufb01nite subsets of M.\n\n4\n\n\fTheorem 2. Let k be an analytic, integrable and characteristic kernel. Then for any J > 0, d\u00b5,J is\na random metric on the space of probability measures (and \u00b5P is an analytic function).\n\nNote that this result is stronger than the one presented in Theorem 1, since it is not restricted to the\nclass of probability measures with integrable characteristic functions. Indeed, the assumption that\nthe characteristic function is integrable implies the existence and boundedness of a density. Recall-\ning the representation of MMD in (2), we have proved that it is almost always suf\ufb01cient to measure\ndifference between \u00b5P and \u00b5Q at a \ufb01nite number of points, provided our kernel is characteristic\nand analytic. In the next section, we will see that metrization of the space of probability measures\nusing random metrics d\u00b5,J, d,J is very appealing from the computational point of view. It turns\nout that the statistical tests that arise from these metrics have linear time complexity (in the number\nof samples) and constant memory requirements.\n3 Hypothesis Tests Based on Distances Between Analytic Functions\n\nIn this section, we provide two linear-time two-sample tests: \ufb01rst, a test based on analytic mean\nembeddings, and next a test based on smooth characteristic functions. We further describe the\nrelation with competing alternatives. Proofs of all propositions are in Appendix B.\nDifference in analytic functions In the previous section we described the random metric based\nJPJ\non a difference in analytic mean embeddings, d2\nj=1(\u00b5P (Tj)  \u00b5Q(Tj))2. If we\nnPn\nreplace \u00b5P with the empirical mean embedding \u02c6\u00b5P = 1\ni=1 k(Xi,\u00b7) it can be shown that for any\nj=1, under the null hypothesis, as n ! 1,\nsequence of unique {tj}J\n(\u02c6\u00b5P (tj)  \u02c6\u00b5Q(tj))2\n\n\u00b5,J (P, Q) = 1\n\npn\n\n(11)\n\nJXj=1\n\nj=1, it is\nconverges in distribution to a sum of correlated chi-squared variables. Even for \ufb01xed {tj}J\nvery computationally costly to obtain quantiles of this distribution, since this requires a bootstrap\nor permutation procedure. We will follow a different approach based on Hotelling\u2019s T 2-statistic\n[16]. The Hotelling\u2019s T 2-squared statistic of a normally distributed, zero mean, Gaussian vector\nW = (W1,\u00b7\u00b7\u00b7 , WJ ), with a covariance matrix \u2303, is T 2 = W \u23031W . The compelling property of\nthe statistic is that it is distributed as a 2-random variable with J degrees of freedom. To see a link\nbetween T 2 and equation (11), consider a random variablePJ\nj : this is also distributed as a\nsum of correlated chi-squared variables. In our case W is replaced with a difference of normalized\nempirical mean embeddings, and \u2303 is replaced with the empirical covariance of the difference of\nmean embeddings. Formally, let Zi denote the vector of differences between kernels at tests points\nTj,\n\ni=j W 2\n\nZi = (k(Xi, T1)  k(Yi, T1),\u00b7\u00b7\u00b7 , k(Xi, TJ )  k(Yi, TJ )) 2 RJ .\n\n(12)\ni=1 Zi, and its covariance matrix\n\nWe de\ufb01ne the vector of mean empirical differences Wn = 1\n\u2303n = 1\n\nnPi(Zi  Wn)(Zi  Wn)T . The test statistic is\n\nnPn\n\nn Wn.\n\nSn = nWn\u23031\n\n(13)\nThe computation of Sn requires inversion of a J \u21e5 J matrix \u2303n, but this is fast and numerically\nstable: J will typically be small, and is less than 10 in our experiments. The next proposition\ndemonstrates the use of Sn as a two-sample test statistic.\nProposition 2 (Asymptotic behavior of Sn). Let d2\ni=1 and {Yi}n\n\u00b5,J (P, Q) = 0 a.s. and let {Xi}n\ni=1\nbe i.i.d. samples from P and Q respectively. If \u23031\nn exists for n large enough, then the statistic Sn\nis a.s. asymptotically distributed as a 2-random variable with J degrees of freedom (as n ! 1\nwith d \ufb01xed). If d2\n\n\u00b5,J (P, Q) > 0 a.s., then a.s. for any \ufb01xed r, P(Sn > r) ! 1 as n ! 1 .\n\nWe now apply the above proposition to obtain a statistical test.\nTest 1 (Analytic mean embedding ). Calculate Sn. Choose a threshold r\u21b5 corresponding to the 1\u21b5\nquantile of a 2 distribution with J degrees of freedom, and reject the null hypothesis whenever Sn\nis larger than r\u21b5.\n\n5\n\n\fj=1 to evaluate the differences\n\ni=1 and {Yi}n\n\nThere are a number of valid sampling schemes for the test points {Tj}J\nin mean embeddings: see Section 4 for a discussion.\nDifference in smooth characteristic functions From the convolution de\ufb01nition of a smooth char-\nacteristic function (7) it is not immediately obvious how to calculate its estimator in linear time. In\nthe next proposition, however, we show that a smooth characteristic function is an expected value of\nsome function (with respect to the given measure), which can be estimated in a linear time.\nProposition 3. Let k be an integrable translation-invariant kernel and f its inverse Fourier trans-\n\nform. Then the smooth characteristic function of P can be written as P (t) =RRd eit>xf (x)dP (x).\n\nIt is now clear that a test based on the smooth characteristic functions is similar to the test based on\nmean embeddings. The main difference is in the de\ufb01nition of the vector of differences Zi:\nZi = (f (Xi) sin(XiT1)f (Yi) sin(YiT1), f (Xi) cos(XiT1)f (Yi) cos(YiT1),\u00b7\u00b7\u00b7 ) 2 R2J (14)\nThe imaginary and real part of the ep1T >j Xif (Xi) ep1T >j Yif (Yi) are stacked together, in order\nto ensure that Wn, \u2303n and Sn as all real-valued quantities.\nProposition 4. Let d2\ni=1 be i.i.d. samples from P and\nQ respectively. Then the statistic Sn is almost surely asymptotically distributed as a 2-random\nvariable with 2J degrees of freedom (as n ! 1 with J \ufb01xed). If d2\n,J (P, Q) > 0 , then almost\nsurely for any \ufb01xed r, P (Sn > r) ! 1 as n ! 1.\nOther tests. The test [8] based on empirical characteristic functions was constructed originally\nfor one test point and then generalized to many points - it is quite similar to our second test, but\ndoes not perform smoothing (it is also based on a T 2-Hotelling statistic). The block MMD [32] is\na sub-quadratic test, which can be trivially linearized by \ufb01xing the block size, as presented in the\nAppendix. Finally, another alternative is the MMD, an inherently quadratic time test. We scale\nMMD to linear time by sub-sampling our data set, and choosing only pn points, so that the MMD\ncomplexity becomes O(n). Note, however, that the true complexity of MMD involves a permutation\ncalculation of the null distribution at cost O(bnn), where the number of permutations bn grows with\nn. See Appendix C for a detailed description of alternative tests.\n\n,J (P, Q) = 0 and let {Xi}n\n\n4 Experiments\n\nIn this section we compare two-sample tests on both arti\ufb01cial benchmark data and on real-world\ndata. We denote the smooth characteristic function test as \u2018Smooth CF\u2019, and the test based on the\nanalytic mean embeddings as \u2018Mean Embedding\u2019. We compare against several alternative testing ap-\nproaches: block MMD (\u2018Block MMD\u2019), a characteristic functions based test (\u2018CF\u2019), a sub-sampling\nMMD test (\u2018MMD(pn)\u2019), and the quadratic-time MMD test (\u2018MMD(n)\u2019).\nExperimental setup. For all the experiments, D is the dimensionality of samples in a dataset, n\nis a number of samples in the dataset (sample size) and J is number of test frequencies. Parameter\nselection is required for all the tests. The table summarizes the main choices of the parameters made\nfor the experiments. The \ufb01rst parameter is the test function, used to calculate the particular statistic.\nThe scalar  represents the length-scale of the observed data. Notice that for the kernel tests we\nrecover the standard parameterization exp(k x\n). The original CF test\nwas proposed without any parameters, hence we added  to ensure a fair comparison - for this test\nvarying  is equivalent to adjusting the variance of the distribution of frequencies Tj. For all tests,\nthe value of the scaling parameter  was chosen so as to minimize a p-value estimate on a held-out\ntraining set: details are described in Appendix D. We chose not to optimize the sampling scheme\nfor the Mean Embedding and Smooth CF tests, since this would give them an unfair advantage over\nthe Block MMD, MMD(pn) and CF tests. The block size in the Block MMD test and the number\nof test frequencies in the Mean Embedding, Smooth CF, and CF tests, were always set to the same\nvalue (not greater than 10) to maintain exactly the same time complexity. Note that we did not use\nthe popular median heuristic for kernel bandwidth choice (MMD and B-test), since it gives poor\nresults for the Blobs and AM Audio datasets [11]. We do not run MMD(n) test for \u2019Simulation 1\u2019\nor \u2019Amplitude Modulated Music\u2019, since the sample size is 10000, and too large for a quadratic-time\ntest with permutation sampling for the test critical value.\n\nk2) = exp(kxyk2\n\n  y\n\n2\n\n6\n\n\fFigure 1: Higgs dataset. Left: Test power vs. sample size. Right: Test power vs. execution time.\n\nIt is important to verify that Type I error is indeed at the design level, set at \u21b5 = 0.05 in this paper.\nThis is veri\ufb01ed in the Appendix, Figure A.2. Also shown in the plots is the 95% percent con\ufb01dence\nintervals for the results, as averaged over 4000 runs.\n\nTest\nMean Embedding\nSmooth CF\nMMD(n),MMD(pn)\nBlock MMD\nCF\n\nTest Function\n\nexp(it>1x  k1x  tk2)\n\nexp(k1(x  t)k2)\nexp(k1(x  t)k2)\nexp(k1(x  t)k2)\n\nexp(it>1x)\n\nSampling scheme\nTj \u21e0 N (0D, ID)\nTj \u21e0 N (0D, ID)\nnot applicable\nnot applicable\nTj \u21e0 N (0D, ID)\n\nOther parameters\n\nJ - no. of test frequencies\nJ - no. of test frequencies\n\nb -bootstraps\nB-block size\n\nJ - no. of test frequencies\n\nReal Data 1: Higgs dataset, D = 4, n varies, J = 10. The \ufb01rst experiment we consider is on\nthe UCI Higgs dataset [18] described in [3] - the task is to distinguish signatures of processes that\nproduce Higgs bosons from background processes that do not. We consider a two-sample test on\ncertain extremely low-level features in the dataset - kinematic properties measured by the particle\ndetectors, i.e., the joint distributions of the azimuthal angular momenta ' for four particle jets. We\ndenote by P the jet '-momenta distribution of the background process (no Higgs bosons), and by\nQ the corresponding distribution for the process that produces Higgs bosons (both are distributions\non R4). As discussed in [3, Fig. 2], '-momenta, unlike transverse momenta pT , carry very little\ndiscriminating information for recognizing whether Higgs bosons were produced. Therefore, we\nwould like to test the null hypothesis that the distributions of angular momenta P (no Higgs boson\nobserved) and Q (Higgs boson observed) might yet be rejected. The results for different algorithms\nare presented in the Figure 1. We observe that the joint distribution of the angular momenta is in\nfact discriminative. Sample size varies from 1000 to 12000. The Smooth CF test has signi\ufb01cantly\nhigher power than the other tests, including the quadratic-time MMD, which we could only run\non up to 5100 samples due to computational limitations. The leading performance of the Smooth\nCF test is especially remarkable given it is several orders of magnitude faster than the quadratic-\ntime MMD(n), even though we used the fastest quadratic-time MMD implementation, where the\nasymptotic distribution is approximated by a Gamma density .\nReal Data 2: Amplitude Modulated Music, D = 1000, n = 10000, J = 10. Amplitude mod-\nulation is the earliest technique used to transmit voice over the radio. In the following experiment\nobservations were one thousand dimensional samples of carrier signals that were modulated with\ntwo different input audio signals from the same album, song P and song Q (further details of these\ndata are described in [11, Section 5]). To increase the dif\ufb01culty of the testing problem, independent\nGaussian noise of increasing variance (in the range 1 to 4.0) was added to the signals. The results\nare presented in the Figure 2. Compared to the other tests, the Mean Embedding and Smooth CF\ntests are more robust to the moderate noise contamination.\nSimulation 1: High Dimensions, D varies, n = 10000, J = 3. It has recently been shown, in\ntheory and in practice, that the two-sample problem gets more dif\ufb01cult for an increasing number of\ndimensions increases on which the distributions do not differ [22, 23]. In the following experiment,\nwe study the power of the two-sample tests as a function of dimension of the samples. We run two-\nsample tests on two datasets of Gaussian random vectors which differ only in the \ufb01rst dimension,\n\nDataset I: P = N (0D, ID)\nDataset II: P = N (0D, ID)\n\nvs.\nvs.\n\nQ = N ((1, 0,\u00b7\u00b7\u00b7 , 0), ID)\nQ = N (0D, diag((2, 1,\u00b7\u00b7\u00b7 , 1))) ,\n\n7\n\n\fFigure 2: Music Dataset.Left: Test power vs. added noise. Right: four samples from P and Q.\n\nFigure 3: Power vs. redundant dimensions comparison for tests on high dimensional data.\n\nwhere 0d is a D-dimensional vector of zeros, ID is a D-dimensional identity matrix, and diag(v)\nis a diagonal matrix with v on the diagonal. The number of dimensions (D) varies from 50 to\n2500 (Dataset I) and from 50 to 1200 (Dataset II). The power of the different two-sample tests is\npresented in Figure 3. The Mean Embedding test yields best performance for both datasets, where\nthe advantage is especially large for differences in variance.\nSimulation 2: Blobs, D = 2, n varies, J = 5. The Blobs dataset is a grid of two dimensional\nGaussian distributions (see Figure 4), which is known to be a challenging two-sample testing task.\nThe dif\ufb01culty arises from the fact that the difference in distributions is encoded at a much smaller\nlengthscale than the overall data. In this experiment both P and Q are four by four grids of Gaus-\nsians, where P has unit covariance matrix in each mixture component, while each component of Q\nhas direction of the largest variance rotated by \u21e1/4 and ampli\ufb01ed to 4. It was demonstrated by [11]\nthat a good choice of kernel is crucial for this task. Figure 4 presents the results of two-sample tests\non the Blobs dataset. The number of samples varies from 50 to 14000 ( MMD(n) reached test power\none with n = 1400). We found that the MMD(n) test has the best power as function of the sample\nsize, but the worst power/computation tradeoff. By contrast, random distance based tests have the\nbest power/computation tradeoff.\n\nAcknowledgment. We would like thank Bharath Sriperumbudur and Wittawat Jitkrittum for in-\nsightful comments.\n\nFigure 4: Blobs Dataset. Left: test power vs. sample size. Center: test power vs. execution time.\nRight: illustration of the blob dataset.\n\n8\n\n\f[10] A. Gretton, K. Fukumizu, Z. Harchaoui, and B. Sriperumbudur. A fast, consistent kernel two-sample test.\n\n[11] A. Gretton, B. Sriperumbudur, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, and K. Fuku-\n\nmizu. Optimal kernel choice for large-scale two-sample tests. In NIPS, 2012.\n\n[12] Z. Harchaoui, F.R. Bach, and E. Moulines. Testing for Homogeneity with Kernel Fisher Discriminant\n\n[13] CE Heathcote. A test of goodness of \ufb01t for symmetric random variables. Aust J stat, 14(2):172\u2013181,\n\n[14] CR Heathcote. The integrated squared error estimation of parameters. Biometrika, 64(2):255\u2013264, 1977.\n[15] H.-C. Ho and G. Shieh. Two-stage U-statistics for hypothesis testing. Scandinavian Journal of Statistics,\n\n13:723\u2013773, 2012.\n\nIn NIPS, 2009.\n\nAnalysis. In NIPS. 2008.\n\n1972.\n\n33(4):861\u2013873, 2006.\n\nReferences\n[1] V. Alba Fern\u00b4andez, M. Jim\u00b4enez-Gamero, and J. Mu\u02dcnoz Garcia. A test for the two-sample problem based\non empirical characteristic functions. Computational Statistics and Data Analysis, 52:3730\u20133748, 2008.\n\n[2] T. W. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley, July 2003.\n[3] P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high-energy physics with deep\n\nlearning. Nature Communications, 5, 2014.\n\n[4] L Baringhaus and C Franz. On a new multivariate two-sample test. J mult anal, 88(1):190\u2013206, 2004.\n[5] Alain Berlinet and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statis-\n\ntics, volume 3. Kluwer Academic Boston, 2004.\n\n[6] K.M. Borgwardt, A. Gretton, M.J. Rasch, H.-P. Kriegel, B. Sch\u00a8olkopf, and A. Smola. Integrating struc-\n\ntured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49\u2013e57, 2006.\n\n[7] K. R. Davidson. Pointwise limits of analytic functions. Am math mon, pages 391\u2013394, 1983.\n[8] T.W. Epps and K.J. Singleton. An omnibus test for the two-sample problem using the empirical charac-\n\nteristic function. Journal of Statistical Computation and Simulation., 26(3-4):177\u2013203, 1986.\n\n[9] A. Gretton, K. Borgwardt, M. Rasch, B. Sch\u00a8olkopf, and A. Smola. A kernel two-sample test. JMLR,\n\n[16] H. Hotelling. The generalization of student\u2019s ratio. Ann. Math. Statist., 2(3):360\u2013378, 1931.\n[17] Q. Le, T. Sarlos, and A. Smola. Fastfood - computing Hilbert space expansions in loglinear time. In\n\nICML, volume 28, pages 244\u2013252, 2013.\n\n[18] M. Lichman. UCI machine learning repository, 2013.\n[19] J.R. Lloyd and Z. Ghahramani. Statistical model criticism using kernel two sample tests. Technical report,\n\n[20] Tom\u00b4a\u02c7s Pevn`y and Jessica Fridrich. Benchmarking for steganography. In Information Hiding, pages 251\u2013\n\n2014.\n\n267. Springer, 2008.\n\n[21] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, 2007.\n[22] A. Ramdas, S. Reddi, B. P\u00b4oczos, A. Singh, and L. Wasserman. On the decreasing power of kernel- and\n\ndistance-based nonparametric hypothesis tests in high dimensions. AAAI, 2015.\n\n[23] S. Reddi, A. Ramdas, B. P\u00b4oczos, A. Singh, and L. Wasserman. On the high-dimensional power of linear-\n\ntime kernel two-sample testing under mean-difference alternatives. AISTATS, 2015.\n\n[24] Walter Rudin. Real and complex analysis. Tata McGraw-Hill Education, 1987.\n[25] D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu. Equivalence of distance-based and\n\nRKHS-based statistics in hypothesis testing. Annals of Statistics, 41(5):2263\u20132291, 2013.\n\n[26] B. Sriperumbudur, K. Fukumizu, and G. Lanckriet. Universality, characteristic kernels and RKHS em-\n\nbedding of measures. JMLR, 12:2389\u20132410, 2011.\n\n[27] B. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and B. Sch\u00a8olkopf. Hilbert space embeddings\n\nand metrics on probability measures. JMLR, 11:1517\u20131561, 2010.\n\n[28] I. Steinwart and A. Christmann. Support vector machines. Springer Science & Business Media, 2008.\n[29] I. Steinwart, D. Hush, and C. Scovel. An explicit description of the reproducing kernel hilbert spaces of\n\ngaussian rbf kernels. Information Theory, IEEE Transactions on, 52(10):4635\u20134643, 2006.\n\n[30] Hong-Wei Sun and Ding-Xuan Zhou. Reproducing kernel hilbert spaces associated with analytic\ntranslation-invariant mercer kernels. Journal of Fourier Analysis and Applications, 14(1):89\u2013101, 2008.\n\n[31] GJ Sz\u00b4ekely. E-statistics: The energy of statistical samples. Technical report, 2003.\n[32] W. Zaremba, A. Gretton, and M. Blaschko. B-test: A non-parametric, low variance kernel two-sample\n\ntest. In NIPS, 2013.\n\n[33] Ji Zhao and Deyu Meng. FastMMD: Ensemble of circular discrepancy for ef\ufb01cient two-sample test.\n\nNeural computation, (27):1345\u20131372, 2015.\n\n[34] AA Zinger, AV Kakosyan, and LB Klebanov. A characterization of distributions by mean values of\n\nstatistics and certain probabilistic metrics. Journal of Mathematical Sciences, 59(4):914\u2013920, 1992.\n\n9\n\n\f", "award": [], "sourceid": 1205, "authors": [{"given_name": "Kacper", "family_name": "Chwialkowski", "institution": "University College London"}, {"given_name": "Aaditya", "family_name": "Ramdas", "institution": "Carnegie Mellon University"}, {"given_name": "Dino", "family_name": "Sejdinovic", "institution": "University of Oxford"}, {"given_name": "Arthur", "family_name": "Gretton", "institution": "University Collage London"}]}