{"title": "Comparing distributions: $\\ell_1$ geometry improves kernel two-sample testing", "book": "Advances in Neural Information Processing Systems", "page_first": 12327, "page_last": 12337, "abstract": "Are two sets of observations drawn from the same distribution? This\nproblem is a two-sample test. \nKernel methods lead to many appealing properties. Indeed state-of-the-art\napproaches use the $L^2$ distance between kernel-based\ndistribution representatives to derive their test statistics. Here, we show that\n$L^p$ distances (with $p\\geq 1$) between these\ndistribution representatives give metrics on the space of distributions that are\nwell-behaved to detect differences between distributions as they\nmetrize the weak convergence. Moreover, for analytic kernels,\nwe show that the $L^1$ geometry gives improved testing power for\nscalable computational procedures. Specifically, we derive a finite\ndimensional approximation of the metric given as the $\\ell_1$ norm of a vector which captures differences of expectations of analytic functions evaluated at spatial locations or frequencies (i.e, features). The features can be chosen to\nmaximize the differences of the distributions and give interpretable\nindications of how they differs. Using an $\\ell_1$ norm gives better detection\nbecause differences between representatives are dense\nas we use analytic kernels (non-zero almost everywhere). The tests are consistent, while\nmuch faster than state-of-the-art quadratic-time kernel-based tests. Experiments\non artificial\nand real-world problems demonstrate\nimproved power/time tradeoff than the state of the art, based on\n$\\ell_2$ norms, and in some cases, better outright power than even the most\nexpensive quadratic-time tests. This performance gain is retained even in high dimensions.", "full_text": "Comparing distributions: `1 geometry improves\n\nkernel two-sample testing\n\nCREST, ENSAE & Inria, Universit\u00e9 Paris-Saclay\n\nMeyer Scetbon\n\nGa\u00ebl Varoquaux\n\nInria, Universit\u00e9 Paris-Saclay\n\nAbstract\n\nAre two sets of observations drawn from the same distribution? This problem is\na two-sample test. Kernel methods lead to many appealing properties. Indeed\nstate-of-the-art approaches use the L2 distance between kernel-based distribution\nrepresentatives to derive their test statistics. Here, we show that Lp distances\n(with p 1) between these distribution representatives give metrics on the space\nof distributions that are well-behaved to detect differences between distributions\nas they metrize the weak convergence. Moreover, for analytic kernels, we show\nthat the L1 geometry gives improved testing power for scalable computational\nprocedures. Speci\ufb01cally, we derive a \ufb01nite dimensional approximation of the\nmetric given as the `1 norm of a vector which captures differences of expectations\nof analytic functions evaluated at spatial locations or frequencies (i.e, features).\nThe features can be chosen to maximize the differences of the distributions and\ngive interpretable indications of how they differs. Using an `1 norm gives better\ndetection because differences between representatives are dense as we use analytic\nkernels (non-zero almost everywhere). The tests are consistent, while much faster\nthan state-of-the-art quadratic-time kernel-based tests. Experiments on arti\ufb01cial\nand real-world problems demonstrate improved power/time tradeoff than the state\nof the art, based on `2 norms, and in some cases, better outright power than even\nthe most expensive quadratic-time tests.\n\nWe consider two sample tests: testing whether two random variables are identically distributed without\nassumption on their distributions. This problem has many applications such as data integration [4] or\nautomated model checking [22]. Distances between distributions underlie progress in unsupervised\nlearning with generative adversarial networks [20, 1]. A kernel on the sample space can be used to\nbuild the Maximum Mean Discrepancy (MMD) [11, 12, 13, 26], a metric on distribution which has the\nstrong propriety of metrizing the weak convergence of probability measures. It leads to non-parametric\ntwo-sample tests using the reproducing kernel Hilbert space (RKHS) distance [15, 9], or energy\ndistance [32, 3]. The MMD has a quadratic computational cost, which may force to use of subsampled\nestimates [33, 14]. [5] approximate the L2 distance between distribution representatives in the RKHS,\nto compute in linear time a pseudo metric over the space of distributions. Such approximations are\nrelated to random (Fourier) features, used in kernels algorithms [24, 19]. Distribution representatives\ncan be mean embeddings [29, 30] or smooth characteristic functions [5, 17].\nWe \ufb01rst introduce the state of the art on Kernel-based two-sample testing built from the L2 distance\nbetween mean embeddings in the RKHS. In fact, a wider family of distance is well suited for the\ntwo-sample problem: we show that for any p 1, the Lp distance between these distribution repre-\nsentatives is a metric on the space of Borel probability measures that metrizes their weak convergence.\nWe then de\ufb01ne our `1-based statistic derived from the L1 geometry and study its asymptotic behavior.\nWe consider the general case where the number of samples of the two distributions may differ. We\nshow that using the `1 norm provides a better testing power. Indeed, test statistics approximate such\nmetrics and are de\ufb01ned as the norm of a J-dimensional vector which is the difference between the\ntwo distribution representatives at J locations. Under the alternative hypothesis H1: P 6= Q, the\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fanalyticity of the kernel ensures that all the features of this vector are non zero almost surely. We\nshow that the `1 norm captures this dense difference better than the `2 norm and leads to better tests.\nWe show also that improvements of Kernel two-sample tests established with the `2 norm [17] hold in\nthe `1 case: optimizing features and the choice of kernel. We adapt the construction in the frequency\ndomain as in [5]. Finally, we show that on 4 synthetic and 3 real-life problems, our new `1-based\ntests outperform the state of the art.\n\n1 Prior art: kernel embeddings for two-sample tests\nGiven two samples X := {xi}n\ni=1 \u21e2X independently and identically distributed\n(i.i.d.) according to two probability measures P and Q on a metric space (X , d) respectively, the goal\nof a two-sample test is to decide whether P is different from Q on the basis of the samples. Kernel\nmethods arise naturally in two-sample testing as they provide Euclidean norms over the space of\nprobability measures that metrize the convergence in law. To de\ufb01ne such a metric, we need \ufb01rst to\nintroduce the notion of Integral Probability Metric (IPM):\n\ni=1, Y := {yi}n\n\nIPM[F, P, Q] := sup\n\nf2F\u21e3Ex\u21e0P [f (x)] Ey\u21e0Q [f (y)]\u2318\n\nwhere F is an arbitrary class of functions. When F is the unit ball Bk in the RKHS Hk associated\nwith a positive de\ufb01nite bounded kernel k : X\u21e5X! R, the IPM is known as the Maximum Mean\nDiscrepancy (MMD) [11], and it can be shown that the MMD is equal to the RKHS distance between\nso called mean embeddings [13],\n\nMMD[P, Q] = k\u00b5P \u00b5QkHk\nwhere \u00b5P is an embedding of the probability measure P to Hk,\n\u00b5P (t) :=ZRd\n\nk(x, t)dP (x)\n\nand k.kHk denotes the norm in the RKHS Hk. Moreover for kernels said to be characteristic [10],\neg Gaussian kernels, MMD[P, Q] = 0 if and only if P = Q [11]. In addition, when the kernel\nis bounded, and X is a compact Hausdorff space, [28] show that the MMD metrizes the weak\nconvergence. Tests between distributions can be designed using an empirical estimation of the MMD.\nA drawback of the MMD is the computation cost of empirical estimates, these being the sum of two\nU-statistics and an empirical average, with a quadratic cost in the sample size.\n[5] study a related expression de\ufb01ned as the L2 distance between mean embeddings of Borel\nprobability measures:\n\nwhere is a Borel probability measure. They estimate the integral (4) with the random variable,\n\n(1)\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\n2\n\nd(t)\n\nd2\n\nL2,\u00b5(P, Q) :=Zt2Rd\u00b5P (t) \u00b5Q(t)\nJXj=1\u00b5P (Tj) \u00b5Q(Tj)\n\nd2\n`2,\u00b5,J (P, Q) :=\n\n1\nJ\n\n2\n\nj=1 are sampled i.i.d. from the distribution . This expression still has desirable metric-\n\nwhere {Tj}J\nlike properties, provided that the kernel is analytic:\nDe\ufb01nition 1.1 (Analytic kernel). A positive de\ufb01nite kernel k : Rd \u21e5 Rd ! R is analytic on its\ndomain if for all x 2 Rd, the feature map k(x, .) is an analytic function on Rd.\nIndeed, for k a de\ufb01nite positive, characteristic, analytic, and bounded kernel on Rd, [5] show that\nd`2,\u00b5,J is a random metric1 from which consistent two-sample test can be derived. By denoting \u00b5X\nand \u00b5Y respectively the empirical mean embeddings of P and Q,\n\n\u00b5X(T ) :=\n\n1\nn\n\nk(xi, T ),\n\n\u00b5Y (T ) :=\n\n1\nn\n\nk(yi, T )\n\nnXi=1\n\n1A random metric is a random process which satis\ufb01es all the conditions for a metric \u2018almost surely\u2019 [5].\n\nnXi=1\n\n2\n\n\f[5] show that for {Tj}J\nas n ! 1, the following test statistic:\n\nj=1 sampled from the distribution , under the null hypothesis H0 : P = Q,\n\n2\n\n(6)\n\n`2,\u00b5,J [X, Y ] := n\n\nJXj=1\u00b5X(Tj) \u00b5Y (Tj)\n\nbd2\nconverges in distribution to a sum of correlated chi-squared variables. Moreover, under the alternative\nhypothesis H1 : P 6= Q, bd2\n`2,\u00b5,J [X, Y ] can be arbitrarly large as n ! 1, allowing the test to\ncorrectly reject H0. For a \ufb01xed level \u21b5, the test rejects H0 if bd2\n`2,\u00b5,J [X, Y ] exceeds a predetermined\ntest threshold, which is given by the (1 \u21b5)-quantile of its asymptotic null distribution. As it is\nvery computationally costly to obtain quantiles of this distribution, [5] normalize the differences\nbetween mean embeddings, and consider instead the test statistic ME[X,Y]:=kpn\u23031/2\n2 where\nSn := 1\nj=1 2\nRJ. Under the null hypothesis H0, asymptotically the ME statistic follows 2(J), a chi-squared\ndistribution with J degrees of freedom. Moreover, for k a translation-invariant kernel, [5] derive\nanother statistical test, called the SCF test (for Smooth Characteristic Function), where its statistic\nSCF[X, Y ] is of the same form as the ME test statistic with a modi\ufb01ed zi := [f (xi) sin(xT\ni Tj) \nj=1 2 R2J where f is the inverse Fourier\nf (yi) sin(yT\ntransform of k, and show that under H0, SCF[X, Y ] follows asymptotically 2(2J).\n\nSnk2\ni=1(zi Sn)(zi Sn)T , and zi := (k(xi, Tj) k(yi, Tj))J\n\ni Tj) f (yi) cos(yT\n\nn1Pn\n\ni Tj), f (xi) cos(xT\n\ni=1 zi, \u2303n := 1\n\nnPn\n\nn\n\ni Tj)]J\n\n2 A family of metrics that metrize of the weak convergence\n\n+(Rd):\n\n+(Rd) \u21e5M 1\n\n[5] build their ME statistic by estimating the L2 distance between mean embeddings. This metric can\nbe generalized using any Lp distance with p 1. These metrics are well suited for the two-sample\nproblem as they metrize the weak convergence (see proof in supp. mat. A.1):\nTheorem 2.1. Given p 1, k a de\ufb01nite positive, characteristic, continuous, and bounded kernel on\nRd, \u00b5P and \u00b5Q the mean embeddings of the Borel probability measures P and Q respectively, the\nfunction de\ufb01ned on M1\n\ndLp,\u00b5(P, Q) :=\u2713Zt2Rd\u00b5P (t) \u00b5Q(t)\n\nd(t)\u25c61/p\nis a metric on the space of Borel probability measures, for a Borel probability measure absolutely\ncontinuous with respect to Lebesgue measure. Moreover a sequence (\u21b5n)n0 of Borel probability\nmeasures converges weakly towards \u21b5 if and only if dLp,\u00b5(\u21b5n,\u21b5 ) ! 0.\nTherefore, as the MMD, these metrics take into account the geometry of the underlying space and\nmetrize the convergence in law. If we assume in addition that the kernel is analytic, we will show that\nderiving test statistics from the L1 distance instead of the L2 distance improves the test power for\ntwo-sample testing.\n\n(7)\n\np\n\n3 Two-sample testing using the `1 norm\n\n3.1 A test statistic with simple asymptotic distribution\nFrom now, we assume that k is a positive de\ufb01nite, characteristic, analytic, and bounded kernel.\nThe statistic presented in eq. 6 is based on the `2 norm of a vector that capture differences between\ndistributions in the RKHS at J locations. We will show that using an `1 norm instead of an `2 norm\nimproves the test power (Proposition 3.1). It captures better the geometry of the problem. Indeed,\nwhen P 6= Q, the differences between distributions are dense which allow the `1 norm to reject better\nthe null hypothesis H0: P = Q.\nWe now build a consistent statistical test based on an empirical estimation of the L1 metric introduced\nin eq. 7:\n\nbd`1,\u00b5,J [X, Y ] := pn\n\nJXj=1\u00b5X(Tj) \u00b5Y (Tj)\n\n3\n\n(8)\n\n\fj=1 are sampled from the distribution . We show that under H0,bd`1,\u00b5,J [X, Y ] converges\nwhere {Tj}J\nin distribution to a sum of correlated Nakagami variables2 and under H1, bd`1,\u00b5,J [X, Y ] can be\narbitrary large as n ! 1 (see supp. mat. C.1). For a \ufb01xed level \u21b5, the test rejects H0 ifbd`1,\u00b5,J [X, Y ]\nexceeds the (1 \u21b5)-quantile of its asymptotic null distribution. We now compare the power of the\nstatistics based respectively on the `2 norm (eq. 6) and the `1 norm (eq. 8) at the same level \u21b5> 0\nand we show that the power of the test using the `1 norm is better with high probability (see supp.\nmat. C.2):\nProposition 3.1. Let \u21b5 2]0, 1[, > 0 and J 2. Let {Tj}J\nand let X := {xi}n\ni=1 and Y := {yi}n\nthe (1 \u21b5)-quantile of the asymptotic null distribution of bd`1,\u00b5,J [X, Y ] and the (1 \u21b5)-quantile\nof the asymptotic null distribution of bd2\nthere exists N 1 such that for all n N, with a probability of at least 1 we have:\n\nj=1 sampled i.i.d. from the distribution \ni=1 i.i.d. samples from P and Q respectively. Let us denote \n\n`2,\u00b5,J [X, Y ]. Under the alternative hypothesis, almost surely,\n\nbd2\n`2,\u00b5,J [X, Y ] > ) bd`1,\u00b5,J [X, Y ] >\n\nTherefore, for a \ufb01xed level \u21b5, under the alternative hypothesis, when the number of samples is large\nenough, with high probability, the `1-based test rejects better the null hypothesis. However, even\nfor \ufb01xed {Tj}J\nj=1, computing the quantiles of these distributions requires a computationally-costly\nbootstrap or permutation procedure. Thus we follow a different approach where we allow the number\nof samples to differ. Let X := {xi}N1\ni=1 i.i.d according to respectively P and Q.\nWe de\ufb01ne for any sequence of {Tj}J\n\ni=1 and Y := {yi}N2\nj=1 in Rd:\n\nSN1,N2 :=\u21e3\u00b5X(T1) \u00b5Y (T1), ..., \u00b5X(TJ ) \u00b5Y (TJ )\u2318\n\nX := (k(xi, T1), ..., k(xi, TJ )) 2 RJ\nZi\n\nZj\nY := (k(yj, T1), ..., k(yj, TJ )) 2 RJ\n\n(9)\n\n(10)\n\nAnd by denoting:\n\n\u2303N1 :=\n\n1\n\nN1 1\n\nN1Xi=1\n\nWe can de\ufb01ne our new statistic as:\n\n(Zi\n\nX ZX)(Zi\n\nX ZX)T\n\n\u2303N2 :=\n\n1\n\nN2 1\n\n\u2303N1,N2 :=\n\n\u2303N1\n\u21e2\n\n+\n\n\u2303N2\n1 \u21e2\n\nL1-ME[X, Y ] :=\n\npt\u2303 1\n\n2\n\nN1,N2\n\nSN1,N21\n\nN2Xj=1\n\n(Zj\n\nY ZY )(Zj\n\nY ZY )T\n\n(11)\n\nt ! \u21e2 and therefore N2\n\nWe assume that the number of samples of the distributions P and Q are of the same order, i.e: let\nt ! 1 \u21e2 with \u21e2 2]0, 1[. The computation of\nt = N1 + N2, we have: N1\nthe statistic requires inverting a J \u21e5 J matrix \u2303N1,N2, but this is fast and numerically stable: J is\ntypically be small, eg less than 10. The next proposition demonstrates the use of this statistic as a\nconsistent two-sample test (see supp. mat. C.3 for the proof).\nProposition 3.2. Let {Tj}J\n{yi}N2\nsurely asymptotically distributed as Naka( 1\nNakagami distribution of parameter m = 1\ncan be arbitrarily large as t ! 1, enabling the test to correctly reject H0.\nStatistical test of level \u21b5: Compute kpt\u2303 1\nthe (1 \u21b5)-quantile of Naka( 1\nis larger than .\n\ni=1 and Y :=\ni=1 be i.i.d. samples from P and Q respectively. Under H0, the statistic L1-ME[X, Y ] is almost\n2 , 1, J), a sum of J random variables i.i.d which follow a\n2 and ! = 1. Finally under H1, almost surely the statistic\n\nSN1,N2k1, choose the threshold corresponding to\nSN1,N2k1\n\n2 , 1, J), and reject the null hypothesis whenever kpt\u2303 1\n\nj=1 sampled i.i.d. from the distribution and X := {xi}N1\n\nN1,N2\n\nN1,N2\n\n2\n\n2\n\n2the pdf of the Nakagami distribution of parameters m 1\n\n2 and !> 0 is 8x 0,\n\n(m)!m x2m1 exp( m\n\n! x2) where is the Gamma function.\n\nf (x, m, !) = 2mm\n\n4\n\n\fsup\n\n2 SN1,N2k1 (see proof in supp. mat. D.1).\n\n3.2 Optimizing test locations to improve power\nAs in [17], we can optimize the test locations V and kernel parameters (jointly referred to as \u2713)\nby maximizing a lower bound on the test power which offers a simple objective function for fast\nparameter tuning. We make the same regularization as in [17] of the test statistic for stability of the\nmatrix inverse, by adding a regularization parameter N1,N2 > 0 which goes to 0 as t goes to in\ufb01nity,\ngiving L1-ME[X, Y ] := kpt(\u2303N1,N2 + N1,N2I) 1\nProposition 3.3. Let K be a uniformly bounded family of k : Rd \u21e5 Rd ! R measurable kernels (i.e.,\n(x,y)2(Rd)2|k(x, y)|\uf8ff K). Let V be a collection in which each element is a\n9 K < 1 such that sup\nk2K\nV 2V,k2Kk\u23031/2k< 1. Then the test power P\u21e3bt \u2318 of\nset of J test locations. Assume that c := sup\n(N1 + N2)2!\nJ 2 + J\u25c62 N1,N2N1N2\n(J 2+J)pt J 3K2\nt\n(N1 + N2) max\u21e3 8\n\nthe L1-ME test satis\ufb01es P\u21e3bt \u2318 L(t) where:\nexp \u2713 t \nJXk=1\nexp0B@2\u21e3 N1,N2\nJXk,q=1\n\n(1\u21e2)N2\u23182 1CA\npN1,N2 J 4K1\u23182\n2 Sk1 is the population counterpart ofbt := kpt(\u2303N1,N2 + N1,N2I) 1\n\nand K1, K2, K3 and K, are positive constants depending on only K, J and c. The parameter\nt := kpt\u2303 1\n2 SN1,N2k1\nwhere S = Ex,y(SN1,N2) and \u2303 = Ex,y(\u2303N1,N2). Moreover for large t, L(t) is increasing in t.\nProposition 3.3 suggests that it is suf\ufb01cient to maximize t to maximize a lower bound on the L1-ME\ntest power. The statistic t for this test depends on a set of test locations V and a kernel parameter .\nWe set \u2713\u21e4 := {V, } = arg max\n2 Sk1. As proposed in [14], we can maximize\na proxy test power to optimize \u2713: it does not affect H0 and H1 as long as the data used for parameter\ntuning and for testing are disjoint.\n\n\u2713 kpt \u2303 1\n\nL(t) =1 2\n\nt = arg max\n\n 2\n\nK3J 2\n\nK2\n\n\u21e2N1\n\n\u2713\n\n,\n\n8\n\n3.3 Using smooth characteristic functions (SCF)\n\nAs the ME statistic, the SCF statistic estimates the L2 distance between well chosen distribution\nrepresentatives. Here, the representatives of the distributions are the convolution of their characteristic\nfunctions and the kernel k, assumed translation-invariant. [5] use them to detect differences between\ndistributions in the frequency domain. We show that the L1 version (denoted dL1,) is a metric on the\nspace of Borel probability measures with integrable characteristic functions such that if \u21b5n converge\nweakly towards \u21b5, then dL1,(\u21b5n,\u21b5 ) ! 0 (see supp. mat. A.2). Let us introduce the test statistics in\nthe frequency domain respectively based on the `2 norm and on the `1 norm which lead to consistent\ntests:\n\n2\n\nand\n\n(12)\n\ni=1 zi, zi\n\n:= [f (xi) sin(xT\n\ni Tj) f (yi) sin(yT\n\nbd`1,,J [X, Y ] := kpnSnk1\n\n`2,,J [X, Y ] := kpnSnk2\nbd2\nnPn\nwhere Sn := 1\ni Tj) \nj=1 2 R2J and f is the inverse Fourier transform of k. We show that, at the\nf (yi) cos(yT\ni Tj)]J\nsame level \u21b5, using the `1 norm in the frequency domain provides a better power with high probabil-\nity (see supp. mat. E.1):\nProposition 3.4. Let \u21b5 2]0, 1[, > 0 and J 2. Let {Tj}J\nand let X := {xi}n\ni=1 and Y := {yi}n\nthe (1 \u21b5)-quantile of the asymptotic null distribution of bd`1,,J [X, Y ] and the (1 \u21b5)-quantile\nof the asymptotic null distribution of bd2\nthere exists N 1 such that for all n N, with a probability of at least 1 we have:\n\nj=1 sampled i.i.d. from the distribution \ni=1 i.i.d. samples from P and Q respectively. Let us denote \n\n`2,,J [X, Y ]. Under the alternative hypothesis, almost surely,\n\ni Tj), f (xi) cos(xT\n\nbd2\n`2,,J [X, Y ] > ) bd`1,,J [X, Y ] >\n\n5\n\n(13)\n\n\f2\n\nL1-SCF[X, Y ] := kpt \u2303 1\n1 xi f (xi), ..., sinT T\n\nX =cosT T\n\nN1,N2\n\nZi\n\nSN1,N2k1\n\nJ xi f (xi) 2 R2J\n\n(14)\nY ):\n\nWe now adapt the construction of the L1-ME test to the frequency domain to avoid computational\nissues of the quantiles of the asymptotic null distribution:\n\nwith \u2303N1,N2, and SN1,N2 de\ufb01ned as in the L1-ME statistic with new expression for Zi\n\nX (and Zj\n\nFrom this statistic, we build a consistent test. Indeed, an analogue proof of the Proposition 3.2 gives\nthat under H0, L1-SCF[X, Y ] is a.s. asymptotically distributed as Naka( 1\n2 , 1, 2J), and under H1, the\ntest statistic can be arbitrarily large as t goes to in\ufb01nity. Finally an analogue proof of Proposition 3.3\nshows that we can optimize the test locations and the kernel parameter to improve the power as well.\n\n4 Experimental study\n\nWe now run empirical comparisons of our `1-based tests to their `2 counterparts, state-of-the-\nart Kernel-based two-sample tests. We study both toy and real problems. We use the isotropic\nGaussian kernel class Kg. We call L1-opt-ME and L1-opt-SCF the tests based respectively on\nmean embeddings and smooth characteristic functions proposed in this paper when optimizing test\nlocations and the Gaussian width on a separate training set of the same size as the test set. We\ndenote also L1-grid-ME and L1-grid-SCF where only the Gaussian width is optimized by a grid\nsearch, and locations are randomly drawn from a multivariate normal distribution. We write ME-full\nand SCF-full for the tests of [17], also fully optimized according to their criteria. MMD-quad\n(quadratic-time) and MMD-lin (linear-time) refer to the MMD-based tests of [11], where, to ensure\na fair comparison, the kernel width is also set to maximize the test power following [14]. For MMD-\nquad, as its null distribution is an in\ufb01nite sum of weighted chi-squared variables (no closed-form\nquantiles), we approximate the null distribution with 200 random permutations in each trial.\nIn all the following experiments, we repeat each problem 500 times. For synthetic problems, we\ngenerate new samples from the speci\ufb01ed P , Q distributions in each trial. For the \ufb01rst real problem\n(Higgs dataset), as the dataset is big enough we use new samples from the two distributions for each\ntrial. For the second and third real problem (Fast food and text datasets), samples are split randomly\ninto train and test sets in each trial. In all the simulations we report an empirical estimate of the\nType-I error when H0 hold and of the Type-II error when H1 hold. We set \u21b5 = 0.01. The code is\navailable at https://github.com/meyerscetbon/l1_two_sample_test.\nHow to realize `1-based tests ? The asymptotic distributions of the statistics is a sum of i.i.d.\nNakagami distribution. [8] give a closed form for the probability density function. As the formula is\nnot simple, we can also derive an estimate of the CDF (see supp. mat. F.1).\nOptimization For a fair comparison between our tests and those of [17], we use the same initialization\nof the test locations3. For the ME-based tests, we initialize the test locations with realizations from\ntwo multivariate normal distributions \ufb01tted to samples from P and Q and for the for initialization of\nthe SCF-based tests, we use the standard normal distribution. The regularization parameter is set to\nN1,N2 = 105. The computation costs for our proposed tests are the same as that of [17]: with t\nsamples, optimization is O(J 3 + dJt) per gradient ascent iteration and testing O(J 3 + Jt + dJt)\n(see supp. mat. Table 3).\nThe experiments on synthetic problems mirror those of [17] to make a fair comparison between the\nprior art and the proposed methods.\nTest power vs. sample size We consider four syn-\nthetic problems: Same Gaussian (SG, dim= 50),\nGaussian mean difference (GMD, dim= 100), Gaus-\nsian variance difference (GVD, dim= 30), and Blobs.\nTable 1 summarizes the speci\ufb01cations of P and Q. In\nthe Blobs problem, P and Q are a mixture of Gaus-\nsian distributions on a 4\u21e5 4 grid in R2. This problem\nis challenging as the difference of P and Q is en-\ncoded at a much smaller length scale compared to the\nglobal structure as explained in [14]. We set J = 5 in this experiment.\n\nData\nSG\nN (0, Id)\nGMD N (0, Id)\nN ((1, 0, .., 0)T , Id)\nGVD N (0, Id)\nN (0, diag(2, 1, .., 1))\nBlobs Mixture of 16 Gaussians in R2 as [17]\n\nTable 1: Synthetic problems.\nH0 holds only in SG.\n\nN (0, Id)\n\nQ\n\nP\n\n3[17]: github.com/wittawatj/interpretable-test\n\n6\n\n\fFigure 1: Plots of type-I/type-II errors against the test sample size nte in the four synthetic problems.\n\nFigure 1 shows type-I error (for SG problem), and test power (for GMD, GVD and Blobs problem)\nas a function of test sample size. In the SG problem, the type-I error roughly stays at the speci\ufb01ed\nlevel \u21b5 for all tests except the L1-ME tests, which reject the null at a rate below the speci\ufb01ed level \u21b5.\nTherefore, here these tests are more conservative.\nGMD with 100 dimensions is an easy problem for L1-opt-ME, L1-opt-SCF, ME-full MMD-quad,\nwhile the SCF-full test requires many samples to achieve optimal test power. In the GMD, GVD and\nBlobs cases, L1-opt-ME and L1-opt-SCF achieve substantially higher test power than L1-grid-ME\nand L1-grid-SCF, respectively: optimizing the test locations brings a clear bene\ufb01t. Remarkably\nL1-opt-SCF consistently outperforms the quadratic-time MMD-quad up to 2 500 samples in the\nGVD case. SCF variants perform signi\ufb01cantly better than ME variants on the Blobs problem, as the\ndifference in P and Q is localized in the frequency domain. For the same reason, L1-opt-SCF does\nmuch better than the quadratic-time MMD up to 3 000 samples, as the latter represents a weighted\ndistance between characteristic functions integrated across the frequency domain as explained in [29].\nWe also perform a more dif\ufb01cult GMD problem to distinguish the power of the proposed tests with\nthe ME-full as all reach maximal power. L1-opt-ME then performs better than ME-full, its `2\ncounterpart, as it needs less data to achieve good control (see mat. supp. F.3).\nTest power vs. dimension d On \ufb01g 2, we study how the dimension of the problem affects type-I error\nand test power of our tests. We consider the same synthetic problems: SG, GMD and GVD, we \ufb01x the\ntest sample size to 10000, set J = 5, and vary the dimension. Given that these experiments explore\nlarge dimensions and a large number of samples, computing the MMD-quad was too expensive.\nIn the SG problem, we observe the L1-ME tests are more conservative as dimension increases, and\nthe others tests can maintain type-I error at roughly the speci\ufb01ed signi\ufb01cance level \u21b5 = 0.01. In the\nGMD problem, we note that the tests proposed achieve the maximum test power without making\nerror of type-II whatever the dimension is, while the SCF-full loses power as dimension increases.\nHowever, this is true only with optimization of the test locations as it is shown by the test power of\nL1-grid-ME and L1-grid-SCF which drops as dimension increases. Moreover the performance of\nMMD-lin degrades quickly with increasing dimension, as expected from [25]. Finally in the GVD\nproblem, all tests failed to keep a good test power as the dimension increases, except the L1-opt-SCF,\nwhich has a very low type-II for all dimensions. These results echo those obtained by [34]. Indeed\n[34] study a class of two sample test statistics based on inter-point distances and they show bene\ufb01ts\nof using the `1 norm over the Euclidean distance and the Maximum Mean Discrepancy (MMD) when\nthe dimensionality goes to in\ufb01nity. For this class of test statistics, they characterize asymptotic power\n\nFigure 2: Plots of type-\nI/type-II error against\nthe dimension in three\nsynthetic\nproblems:\nSG (Same Gaussian),\nGMD (Gaussian Mean\nDifference), and GVD\n(Gaussian\nVariance\nDifference).\n\n7\n\n\fthat the objective functionbtr\n\nloss w.r.t the dimension and show that the `1 norm is bene\ufb01cial compared to the `2 norm provided\nthat the summation of discrepancies between marginal univariate distributions is large enough.\nInformative features Figure 3 we replicate the ex-\nperiment of [17], showing that the selected locations\ncapture multiple modes in the `1 case, as in the `2\ncase. (details in supp. mat. F.4). The \ufb01gure shows\nt (T1, T2) used to posi-\ntion the second test location T2 has a maximum far\nfrom the chosen position for the \ufb01rst test location T1.\nReal Data 1, Higgs: The \ufb01rst real problem is the\nHiggs dataset [21] described in [2]: distinguishing\nsignatures of Higgs bosons from the background. We\nuse a two-sample test on 4 derived features as in [5].\nWe compare for various sample sizes the performance\nof the proposed tests with those of [17]. We do not\nstudy the MMD-quad test as its computation is too\nexpensive with 10 000 samples. To make the problem\nharder, we only consider J = 3 locations. Fig. 4\nshows a clear bene\ufb01t of the optimized `1-based tests,\nin particular for SCF (L1-opt-SCF) compared to its\n`2 counterpart (SCF-full). Optimizing the location\nis important, as L1-opt-SCF and L1-opt-ME per-\nform much better than their grid versions (which are\ncomparable to the tests of [5]).\nReal Data 2, Fastfood: We use a Kaggle dataset listing locations of over 10,000 fast food restaurants\nacross America4. We consider the 6 most frequent brands in mainland USA: Mc Donald\u2019s, Burger\nKing, Taco Bell, Wendy\u2019s, Arby\u2019s and KFC. We benchmark the various two-sample tests to test\nwhether the spatial distribution (in R2) of restaurants differs across brand. This is a non trivial\nquestion, as it depends on marketing strategy of the brand. We compare the distribution of Mc\nDonald\u2019s restaurants with others. We also compare the distribution of Mc Donald\u2019s restaurants with\nitself to evaluate the level of the tests (see supp. mat. Table 5). The number of samples differ across\nthe distributions; hence to perform the tests from [17], we randomly subsample the largest distribution.\nWe use J = 3 as the number of locations.\n\nIllustrating interpretable fea-\nFigure 3:\ntures, replicating in the `1 case the \ufb01gure\nt (T1, T2) as a\nfunction of T2, when J = 2, and T1 is \ufb01xed.\nThe red and black dots represent the samples\nfrom the P and Q distributions, and the big\nblack triangle the position of T1 \u2013complete\n\ufb01gure in supp. mat. F.4.\n\nof [17]. A contour plot ofbtr\n\n4www.kaggle.com/data\ufb01niti/fast-food-restaurants\n\nFigure 4: Higgs dataset: Plots of type-II errors against\nthe test sample size nte.\n\nE\n\nL 1-o pt- M\n\nE\n\nL 1-grid- M\n\nF\n\nC\n\nL 1-o pt-S\n\nF\n\nC\n\nL 1-grid-S\n\nProblem\nMcDo vs Burger King (1141)\nMcDo vs Taco Bell (877)\nMcDo vs Wendy\u2019s (733)\nMcDo vs Arby\u2019s (517)\nMcDo vs KFC (429)\nTable 2: Fast food dataset: Type-II errors for distinguishing the distribution of fast food restaurants.\n\u21b5 = 0.01. J = 3. The number in brackets denotes the sample size of the distribution on the right.\nWe consider MMD-quad as the gold standard.\n\n0.428\n0.710\n0.752\n0.006\n1.00\n\n0.960\n0.834\n0.942\n0.468\n0.998\n\n0.112\n0.554\n0.156\n0.000\n0.912\n\n0.426\n0.624\n0.246\n0.004\n0.990\n\nE -full\nM\n0.170\n0.684\n0.416\n0.004\n0.996\n\nF -full\nC\nS\n0.094\n0.638\n0.624\n0.012\n0.856\n\nD -q u a d\n\nM\n\nM\n0.184\n0.666\n0.208\n0.004\n0.980\n\n8\n\n\fTable 2 summarizes type-II errors of the tests. Note that it is not clear that distributions must differ, as\ntwo brands sometimes compete directly, and target similar locations. We consider the MMD-quad\nas the gold standard to decide whether distributions differ or not. The three cases for which there\nseems to be a difference are Mc Donald\u2019s vs Burger King, Mc Donald\u2019s vs Wendy\u2019s, and Mc Donalds\nvs Arby\u2019s. Overall, we \ufb01nd that the optimized L1-opt-ME agrees best with this gold standard. The\nMc Donald\u2019s vs Arby\u2019s problem seems to be an easy problem, as all tests reach a maximal power,\nexcept for the L1-grid-SCF test which shows the gain of power brought by the optimization. In\nthe Mc Donald\u2019s vs Wendy\u2019s problem the L1-opt-ME test outperforms the `2 tests and even the\nquadratic-time MMD. Finally, all the tests fail to discriminate Mc Donald\u2019s vs KFC. The data provide\nno evidence that these brands pursue different strategies to chose locations.\nIn the Mc Donald\u2019s vs Burger King and Mc Donald\u2019s vs Wendy\u2019s problems, the optimized version of\nthe test proposed based on mean embedding outperform the grid version. This success implies that\nthe locations learned are each informative, and we plot them (see supp. mat. Figure 8), to investigate\nthe interpretability of the L1-opt-ME test. The \ufb01gure shows that the procedure narrows on speci\ufb01c\nregions of the USA to \ufb01nd differences between distributions of restaurants.\nReal Data 3, text: For a high-dimension problem, we consider the problem of distinguishing the\nnewsgroups text dataset [18] (details in supp. Mat. F.5). Compared to their `2 counterpart, `1-\noptimized tests bring clear bene\ufb01ts and separate all topics of articles based on their word distribution.\nDiscussion: Our theoretical results suggest it is always bene\ufb01cial for statistical power to build tests on\n`1 norms rather than `2 norm of differences between kernel distribution representatives (Propositions\n3.1, 3.4). In practice, however, optimizing test locations with `1-norm tests leads to non-smooth\nobjective functions that are harder to optimize. Our experiments con\ufb01rm the theoretical bene\ufb01t of the\n`1-based framework. The bene\ufb01t is particularly pronounced for a large number J of test locations\n\u2013as the difference between `1 and `2 norms increases with dimension (see in supp. mat. Lemmas\n8, 12)\u2013 as well as for large dimension of the native space (Figure 2). The bene\ufb01t of `1 distances\nfor two-sample testing in high dimension has also been reported by [34], though their framework\ndoes not link to kernel embeddings or to the convergence of probability measures. Further work\nshould consider extending these results to goodness-of-\ufb01t testing, where the L1 geometry was shown\nempirically to provide excellent performance [16].\n\n5 Conclusion\n\nIn this paper, we show that statistics derived from the Lp distances between well-chosen distribution\nrepresentatives are well suited for the two-sample problem as these distances metrize the weak conver-\ngence (Theorem 2.1). We then compare the power of tests introduced in [5] and their `1 counterparts\nand we show that `1-based statistics have better power with high probability (Propositions 3.1, 3.4).\nAs with state-of-the-art Euclidean approaches, the framework leads to tractable computations and\nlearns interpretable locations of where the distributions differ. Empirically, on all 4 synthetic and 3\nreal problems investigated, the `1 geometry gives clear bene\ufb01ts compared to the Euclidean geometry.\nThe L1 distance is known to be well suited for densities, to control differences or estimation [7]. It is\nalso bene\ufb01cial for kernel embeddings of distributions.\n\nAcknowledgments This work was funded by the DirtyDATA ANR grant (ANR-17-CE23-0018).\nWe also would like to thank Zolt\u00e1n Szab\u00f3 from \u00c9cole Polytechnique for crucial suggestions, and\nacknowledge hardware donations from NVIDIA Corporation.\n\nReferences\n[1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In\n\nInternational Conference on Machine Learning, pages 214\u2013223, 2017.\n\n[2] P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high-energy physics\n\nwith deep learning. Nature communications, 5:4308, 2014.\n\n[3] L. Baringhaus and C. Franz. On a new multivariate two-sample test. Journal of multivariate\n\nanalysis, 88(1):190\u2013206, 2004.\n\n9\n\n\f[4] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Sch\u00f6lkopf, and A. J. Smola.\nIntegrating structured biological data by kernel maximum mean discrepancy. Bioinformatics,\n22(14):e49\u2013e57, 2006.\n\n[5] K. P. Chwialkowski, A. Ramdas, D. Sejdinovic, and A. Gretton. Fast two-sample testing with\nanalytic representations of probability measures. In Advances in Neural Information Processing\nSystems, pages 1981\u20131989, 2015.\n\n[6] F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin of the American\n\nmathematical society, 39(1):1\u201349, 2002.\n\n[7] L. Devroye and L. Gy\u00f6r\ufb01. Nonparametric density estimation: The l1 view, 1985.\n\n[8] P. Dharmawansa, N. Rajatheva, and K. Ahmed. On the distribution of the sum of nakagami-m\n\nrandom variables. IEEE transactions on communications, 55(7):1407\u20131416, 2007.\n\n[9] M. Fromont, M. Lerasle, P. Reynaud-Bouret, et al. Kernels based tests with non-asymptotic\nbootstrap approaches for two-sample problems. In Conference on Learning Theory, page 23,\n2012.\n\n[10] K. Fukumizu, A. Gretton, X. Sun, and B. Sch\u00f6lkopf. Kernel measures of conditional dependence.\n\nIn Advances in neural information processing systems, pages 489\u2013496, 2008.\n\n[11] A. Gretton, K. M. Borgwardt, M. Rasch, B. Sch\u00f6lkopf, and A. J. Smola. A kernel method\nfor the two-sample-problem. In Advances in neural information processing systems, pages\n513\u2013520, 2007.\n\n[12] A. Gretton, K. Fukumizu, Z. Harchaoui, and B. K. Sriperumbudur. A fast, consistent kernel\ntwo-sample test. In Advances in neural information processing systems, pages 673\u2013681, 2009.\n\n[13] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch\u00f6lkopf, and A. Smola. A kernel two-sample\n\ntest. Journal of Machine Learning Research, 13(Mar):723\u2013773, 2012.\n\n[14] A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, and B. K.\nSriperumbudur. Optimal kernel choice for large-scale two-sample tests. In Advances in neural\ninformation processing systems, page 1205, 2012.\n\n[15] Z. Harchaoui, E. Moulines, and F. R. Bach. Testing for homogeneity with kernel Fisher\ndiscriminant analysis. In Advances in Neural Information Processing Systems, page 609, 2008.\n\n[16] J. Huggins and L. Mackey. Random feature stein discrepancies.\n\nInformation Processing Systems, pages 1899\u20131909, 2018.\n\nIn Advances in Neural\n\n[17] W. Jitkrittum, Z. Szab\u00f3, K. P. Chwialkowski, and A. Gretton. Interpretable distribution features\nwith maximum testing power. In Advances in Neural Information Processing Systems, pages\n181\u2013189, 2016.\n\n[18] K. Lang. Newsweeder: Learning to \ufb01lter netnews. In Proceedings of the Twelfth International\n\nConference on Machine Learning, pages 331\u2013339, 1995.\n\n[19] Q. Le, T. Sarl\u00f3s, and A. Smola. Fastfood-computing hilbert space expansions in loglinear time.\n\nIn International Conference on Machine Learning, pages 244\u2013252, 2013.\n\n[20] C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, and B. P\u00f3czos. Mmd gan: Towards deeper\nunderstanding of moment matching network. In Advances in Neural Information Processing\nSystems, pages 2203\u20132213, 2017.\n\n[21] M. Lichman et al. UCI machine learning repository, 2013.\n\n[22] J. R. Lloyd and Z. Ghahramani. Statistical model criticism using kernel two sample tests. In\n\nAdvances in Neural Information Processing Systems, pages 829\u2013837, 2015.\n\n[23] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,\n\nL. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.\n\n10\n\n\f[24] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in\n\nneural information processing systems, pages 1177\u20131184, 2008.\n\n[25] A. Ramdas, S. J. Reddi, B. P\u00f3czos, A. Singh, and L. A. Wasserman. On the decreasing power\nof kernel and distance based nonparametric hypothesis tests in high dimensions. In AAAI, pages\n3571\u20133577, 2015.\n\n[26] D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu. Equivalence of distance-based\nand rkhs-based statistics in hypothesis testing. The Annals of Statistics, pages 2263\u20132291, 2013.\n\n[27] B. Simon. Trace ideals and their applications. Number 120. Am. Math. Soc., 2005.\n[28] C.-J. Simon-Gabriel and B. Sch\u00f6lkopf. Kernel distribution embeddings: Universal kernels,\ncharacteristic kernels and kernel metrics on distributions. arXiv preprint arXiv:1604.05251,\n2016.\n\n[29] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Sch\u00f6lkopf, and G. R. Lanckriet. Hilbert\nspace embeddings and metrics on probability measures. Journal of Machine Learning Research,\n11:1517, 2010.\n\n[30] B. K. Sriperumbudur, K. Fukumizu, and G. R. Lanckriet. Universality, characteristic kernels\n\nand rkhs embedding of measures. Journal of Machine Learning Research, 12:2389, 2011.\n\n[31] I. Steinwart. On the in\ufb02uence of the kernel on the consistency of support vector machines.\n\nJournal of machine learning research, 2(Nov):67\u201393, 2001.\n\n[32] G. J. Sz\u00e9kely and M. L. Rizzo. Testing for equal distributions in high dimension. InterStat, 5\n\n(16.10):1249\u20131272, 2004.\n\n[33] W. Zaremba, A. Gretton, and M. Blaschko. B-test: A non-parametric, low variance kernel\ntwo-sample test. In Advances in neural information processing systems, pages 755\u2013763, 2013.\n[34] C. Zhu and X. Shao. Interpoint distance based two sample tests in high dimension. arXiv\n\npreprint arXiv:1902.07279, 2019.\n\n11\n\n\f", "award": [], "sourceid": 6662, "authors": [{"given_name": "meyer", "family_name": "scetbon", "institution": "CREST-ENSAE"}, {"given_name": "Gael", "family_name": "Varoquaux", "institution": "Parietal Team, INRIA"}]}