{"title": "Testing and Learning on Distributions with Symmetric Noise Invariance", "book": "Advances in Neural Information Processing Systems", "page_first": 1343, "page_last": 1353, "abstract": "Kernel embeddings of distributions and the Maximum Mean Discrepancy (MMD), the resulting distance between distributions, are useful tools for fully nonparametric two-sample testing and learning on distributions. However, it is rarely that all possible differences between samples are of interest -- discovered differences can be due to different types of measurement noise, data collection artefacts or other irrelevant sources of variability. We propose distances between distributions which encode invariance to additive symmetric noise, aimed at testing whether the assumed true underlying processes differ. Moreover, we construct invariant features of distributions, leading to learning algorithms robust to the impairment of the input distributions with symmetric additive noise.", "full_text": "Testing and Learning on Distributions with\n\nSymmetric Noise Invariance\n\nHo Chung Leon Law\nDepartment of Statistics\nUniversity Of Oxford\n\nhlaw@stats.ox.ac.uk\n\nChristopher Yau\n\nCentre for Computational Biology\n\nUniversity of Birmingham\nc.yau@bham.ac.uk\n\nDino Sejdinovic\n\nDepartment of Statistics\nUniversity Of Oxford\n\ndino.sejdinovic@stats.ox.ac.uk\n\nAbstract\n\nKernel embeddings of distributions and the Maximum Mean Discrepancy (MMD),\nthe resulting distance between distributions, are useful tools for fully nonparametric\ntwo-sample testing and learning on distributions. However, it is rare that all possible\ndifferences between samples are of interest \u2013 discovered differences can be due to\ndifferent types of measurement noise, data collection artefacts or other irrelevant\nsources of variability. We propose distances between distributions which encode\ninvariance to additive symmetric noise, aimed at testing whether the assumed\ntrue underlying processes differ. Moreover, we construct invariant features of\ndistributions, leading to learning algorithms robust to the impairment of the input\ndistributions with symmetric additive noise.\n\n1\n\nIntroduction\n\nj=1 and {X2j}N2\n\nThere are many sources of variability in data, and not all of them are pertinent to the questions that\na data analyst may be interested in. Consider, for example, a nonparametric two-sample testing\nproblem, which has recently been attracting signi\ufb01cant research interest, especially in the context\nof kernel embeddings of distributions [2, 5, 7]. We observe samples {X1j}N1\nj=1 from\ntwo data generating processes P1 and P2, respectively, and would like to test the null hypothesis that\nP1 = P2 without making any parametric assumptions on these distributions. With a large sample-size,\nthe minutiae of the two data generating processes are uncovered (e.g. slightly different calibration\nof the data collecting equipment, different numerical precision), and we ultimately reject the null\nhypothesis, even if the sources of variation across the two samples may be irrelevant for the analysis.\nSimilarly, we may be interested in learning on distributions [14, 23, 24], where the appropriate\nlevel of granularity in the data is distributional. For example, each label yi in supervised learning\nis associated to a whole bag of observations Bi = {Xij}Ni\nj=1 \u2013 assumed to come from a probability\ndistribution Pi, or we may be interested in clustering such bags of observations. Again, nonparametric\ndistances used in such contexts to facilitate a learning algorithm on distributions, such as Maximum\nMean Discrepancy (MMD) [5], can be sensitive to irrelevant sources of variation and may lead to\nsuboptimal or even misleading results, in which case building predictors which are invariant to noise\nis of interest.\nWhile it may be tempting to revert back to a parametric setup and work with simple, easy to interpret\nmodels, we argue that a different approach is possible: we stay within a nonparametric framework,\nexploit the irregular and complicated nature of real life distributions and encode invariances to sources\n\n\fof variation assumed to be irrelevant. In this contribution, we focus on invariances to symmetric\nadditive noise on each of the data generating distributions. Namely, assume that the i-th sample\n{Xij}Ni\nj=1 we observe does not follow the distribution Pi of interest but instead its convolution Pi (cid:63)Ei\nwith some unknown noise distributions Ei assumed to be symmetric about 0 (we also require that it\nhas a positive characteristic function). We would like to assess the differences between Pi and Pi(cid:48)\nwhile allowing Ei and Ei(cid:48) to differ in an arbitrary way. We investigate two approaches to this problem:\n(1) measuring the degree of asymmetry of the paired differences {Xij \u2212 Xi(cid:48)j}, and (2) comparing\nthe phase functions of the corresponding samples. While the \ufb01rst approach is simpler and presents\na sensible solution for the two-sample testing problem, we demonstrate that phase functions give a\nmuch better gauge on the relative comparisons between bags of observations, as required for learning\non distributions.\nThe paper is outlined as follows. In section 2, we provide an overview of the background. In section 3,\nwe provide details of the construction and implementation of phase features. In section 4, we discuss\nthe approach based on asymmetry in paired differences for two sample testing with invariances.\nSection 5 provides experiments on synthetic and real data, before concluding in section 6.\n\n2 Background and Setup\n\n(cid:2)exp(i\u03c9(cid:62)E)(cid:3) > 0, \u2200\u03c9 \u2208 Rd. This means\n\nWe will say that a random vector E on Rd is a symmetric positive de\ufb01nite (SPD) component if its\ncharacteristic function is positive, i.e. \u03d5E(\u03c9) = EX\u223cE\nthat E is (1) symmetric about zero, i.e. E and \u2212E have the same distribution and (2) if it has a\ndensity, this density must be a positive de\ufb01nite function [20]. Note that many distributions used to\nmodel additive noise, including the spherical zero-mean Gaussian distribution, as well as multivariate\nLaplace, Cauchy or Student\u2019s t (but not uniform), are all SPD components.\nFollowing the terminology similar to that of [3], we will say that a random vector X on Rd is\ndecomposable if its characteristic function can be written as \u03d5X = \u03d5X0 \u03d5E, with \u03d5E > 0. Thus,\nif X can be written in the form X = X0 + E, where X0 and E are independent and E is an\nSPD noise component, then X is decomposable. We will say that X is indecomposable if it is\nnot decomposable. In this paper, we will assume that mostly the indecomposable components of\ndistributions are of interest and will construct tools to directly measure differences between these\nindecomposable components, encoding invariance to other sources of variability. The class of Borel\nProbability measures on Rd will be denoted M1\n+(Rd), while the class of indecomposable probability\nmeasures will be denoted by I(Rd) \u2286 M1\n+(Rd).\n\n2.1 Kernel Embeddings, Fourier Features and learning on distributions\nFor any positive de\ufb01nite function k : X \u00d7 X (cid:55)\u2192 R, there exists a unique reproducing kernel Hilbert\nspace (RKHS) Hk of real-valued functions on X . Function k(\u00b7, x) is an element of Hk and represents\n(cid:82)\nevaluation at x, i.e. (cid:104)f, k(\u00b7, x)(cid:105)H = f (x), \u2200f \u2208 Hk, \u2200x \u2208 X . The kernel mean embedding\n(cf. [15] for a recent review) of a probability measure P is de\ufb01ned by \u00b5P = EX\u223cP [k(\u00b7, X)] =\nX k(\u00b7, x)dP (x). The Maximum Mean Discrepancy (MMD) between probability measures P and Q\nis then given by (cid:107)\u00b5P \u2212 \u00b5Q(cid:107)Hk. For shift-invariant kernels on Rd, using Bochner\u2019s characterisation of\npositive de\ufb01niteness [26, 6.2], the squared MMD can be written as a weighted L2-distance between\ncharacteristic functions [22, Corollary 4]\n\n(cid:90)\n\nRd\n\n(cid:107)\u00b5P \u2212 \u00b5Q(cid:107)2Hk\n\n=\n\n|\u03d5P (\u03c9) \u2212 \u03d5Q (\u03c9)|2 d\u039b (\u03c9) ,\n\n(1)\n\nwhere \u039b is the non-negative spectral measure (inverse Fourier transform) of kernel k as a function of\nx \u2212 y, while \u03d5P (\u03c9) and \u03d5Q(\u03c9) are the characteristic functions of probability measures P and Q.\nBochner\u2019s theorem is also used to construct random Fourier features (RFF) [19] for fast approxi-\nmations to kernel methods in order to approximate a pre-speci\ufb01ed shift-invariant kernel by a \ufb01nite\ndimensional explicit feature map. If we can draw samples from its spectral measure \u039b, we can\n\n2\n\n\fj y)(cid:3) = (cid:104)\u03c6(x), \u03c6(y)(cid:105)R2m\n\napproximate k by1\n\n\u02c6k(x, y) =\n\nm(cid:88)\n\nj=1\n\n1\nm\n\n(cid:2) cos(\u03c9T\n\nj x) cos(\u03c9T\n\nj y) + sin(\u03c9T\n\nj x) sin(\u03c9T\n\n(cid:113) 1\n\nm\n\n(cid:2)cos(cid:0)\u03c9(cid:62)\n\n1 x(cid:1) , sin(cid:0)\u03c9(cid:62)\n\n1 x(cid:1) . . . , cos(cid:0)\u03c9(cid:62)\n\nmx(cid:1) , sin(cid:0)\u03c9(cid:62)\n\nmx(cid:1)(cid:3) . Thus, the\n\nwhere \u03c91, . . . , \u03c9m \u223c \u039b and \u03c6(x) :=\nexplicit computation of the kernel matrix is not needed and the computational complexity is\nreduced. This also allows computation with the approximate, \ufb01nite-dimensional embeddings\n\u02dc\u00b5P = \u03a6(P ) = EX\u223cP \u03c6(X) \u2208 R2m, which can be understood as the evaluations (real and complex\npart stacked together) of the characteristic function \u03d5P at frequencies \u03c91, . . . , \u03c9m. We will refer to\nthe approximate embeddings \u03a6(P ) as Fourier features of distribution P .\nKernel embeddings can be used for supervised learning on distributions. Assume we have a training\nset {Bi, yi}n\nj=1 is a bag of samples taking values in X , and yi is\ni=1, where input Bi = {xij}Ni\n(cid:80)Ni\na response. Given a kernel k : X \u00d7 X \u2192 R, we \ufb01rst map each Bi to the empirical embedding\nj=1 k(\u00b7, xij) \u2208 Hk and then can apply any positive de\ufb01nite kernel on Hk as the kernel\n\u00b5 \u02c6Pi\n, \u00b5 \u02c6Pi(cid:48)(cid:105)Hk, in order to perform classi\ufb01cation [14]\non bag inputs, e.g. linear kernel \u02dcK(Bi, B(cid:48)\nor regression [24]. Approximate kernel embeddings have also been applied in this context [23].\n\ni) = (cid:104)\u00b5 \u02c6Pi\n\n= 1\nNi\n\n3 Phase Discrepancy and Phase Features\n\nWhile MMD and kernel embeddings are related to characteristic functions, and indeed the same\nconnection forms a basis for fast approximations to kernel methods using random Fourier features\n[19], the relevant notion in our context is the phase function of a probability measure, recently used\nfor nonparametric deconvolution by [3]. In this section, we overview this formalism. Based on\nthe empirical phase functions, we will then derive and investigate hypothesis testing and learning\nframework using phase features of distributions.\nIn nonparametric deconvolution [3], the goal is to estimate the density function f0 of a univariate r.v.\niid\u223c X = X0 + E, where E denotes\nX0, but in general we only have noisy data samples X1, . . . , Xn\nan independent noise term. Even though the distribution of E is unknown, making the assumption\nthat E is an SPD noise component, and that X0 is indecomposable, i.e. X0 itself does not contain\nany SPD noise components, [3] show that it is possible to obtain consistent estimates of f0.\nThey distinguish between the symmetric noise and the underlying indecomposable component by\nmatching phase functions, de\ufb01ned as\n\n\u03c1X (\u03c9) =\n\n\u03d5X (\u03c9)\n|\u03d5X (\u03c9)|\n\nwhere \u03d5X (\u03c9) denotes the characteristic function of X. Observe that |\u03c1X (\u03c9)| = 1, and thus we\nare effectively removing the amplitude information from the characteristic function. For a SPD\nnoise component E, the phase function is \u03c1E(\u03c9) \u2261 1. But then since \u03d5X = \u03d5X0\u03d5E, we have that\n\u03c1X0 = \u03c1X = \u03d5X /|\u03d5X|, i.e. the phase function is invariant to additive SPD noise components. This\nmotivates us to construct explicit feature maps of distributions with the same property and similarly\nto the motivation of [3], we argue that real-world distributions of interest often exhibit certain amount\nof irregularity and it is exactly this irregularity which is exploited in our methodology.\nIn analogy to the MMD, we \ufb01rst de\ufb01ne the phase discrepancy (PhD) as a weighted L2-distances\nbetween the phase functions:\n\nPhD(X, Y ) =\n\n|\u03c1X (\u03c9) \u2212 \u03c1Y (\u03c9)|2 d\u039b (\u03c9)\n\n(2)\n\n(cid:90)\n\nRd\n\nfor some non-negative measure \u039b (w.l.o.g. a probability measure). Now suppose we write X =\nX0 + U, Y = Y0 + V , where U and V are SPD noise components. This then implies \u03c1X = \u03c1X0\nand \u03c1Y = \u03c1Y0 \u039b-everywhere, so that PhD(X, Y ) = PhD(X0, Y0). It is clear then that the PhD is\n\n(cid:2)exp(cid:0)i\u03c9(cid:62)\n\n1 x(cid:1) , . . . , exp(cid:0)i\u03c9(cid:62)\n\nmx(cid:1)(cid:3) can also be used, but we follow the con-\n\n1a complex feature map \u03c6(x) =\n\n(cid:113) 1\n\nm\n\nvention of real-valued Fourier features, since kernels of interest are typically real-valued.\n\n3\n\n\fwhere \u03be\u03c9 (x) =(cid:2)cos(cid:0)\u03c9(cid:62)x(cid:1) , sin(cid:0)\u03c9(cid:62)x(cid:1)(cid:3)(cid:62)\n\nProposition 2.\n\nnot affected by additive SPD noise components, so it captures desired invariance. However, the\nPhD for \u039b supported everywhere is in fact not a proper metric on the indecomposable probability\nmeasures I(Rd), as one can \ufb01nd indecomposable random variables X and Y s.t. \u03c1X = \u03c1Y and thus\nPhD(X, Y ) = 0. An example is given in Appendix A.\nWhile such cases appear contrived, we hence restrict attention to a subset of indecomposable\nprobability measures P(Rd) \u2282 I(Rd), which are uniquely determined by phase functions, i.e.\n\u2200P, Q \u2208 P(Rd) : \u03c1P = \u03c1Q \u21d2 P = Q.\nWe now have the two following propositions (proofs are given in Appendix B).\nProposition 1.\n\nd\u039b(\u03c9)\n\nand (cid:107) \u00b7 (cid:107) denotes the standard L2 norm.\n\n(cid:17)\n\n(cid:107)E\u03be\u03c9(Y )(cid:107)\n\n(cid:107)E\u03be\u03c9(X)(cid:107)\n\n(cid:17)(cid:62)(cid:16) E\u03be\u03c9(Y )\nPhD(X, Y ) = 2 \u2212 2(cid:82)(cid:16) E\u03be\u03c9(X)\n(cid:17)\n(cid:17)(cid:62)(cid:16) E\u03be\u03c9(Y )\nK (PX , PY ) =(cid:82)(cid:16) E\u03be\u03c9(X)\n(cid:20) E\u03be\u03c91 (X)\n(cid:113) 1\n\nd\u039b(\u03c9)\n\n(cid:107)E\u03be\u03c9(Y )(cid:107)\n\n(cid:107)E\u03be\u03c9(X)(cid:107)\nis a positive de\ufb01nite kernel on probability measures.\ni=1 \u223c\nNow, we can construct an approximate explicit feature map for kernel K. Taking a sample {\u03c9i}m\n\u039b, we de\ufb01ne \u03a8 : PX (cid:55)\u2192 R2m given by \u03a8(PX ) =\n. We will refer\nto \u03a8(\u00b7) as the phase features. Note that these are very similar to Fourier features, but the cos, sin-pair\ncorresponding to each frequency is normalised to have unit L2 norm. In other words, \u03a8(\u00b7) can be\nthought of as evaluations of the phase function at the selected frequencies. By construction, phase\nfeatures are invariant to additive SPD noise components. For an empirical measure, we simply have\nthe following:\n\n(cid:107)E\u03be\u03c91 (X)(cid:107) , . . . ,\n\nE\u03be\u03c9m (X)\n(cid:107)E\u03be\u03c9m (X)(cid:107)\n\n(cid:21)\n\nm\n\n(cid:21)\n\n(cid:13)(cid:13)(cid:13)\u03a8( \u02c6PX )\n\n(3)\n\n(cid:13)(cid:13)(cid:13) = 1, we\n\n\u03a8( \u02c6PX ) =\n\n(cid:107)\u02c6E\u03be\u03c91 (X)(cid:107) , . . . ,\n\n\u02c6E\u03be\u03c9m (X)\n(cid:107)\u02c6E\u03be\u03c9m (X)(cid:107)\n\nm\n\n(cid:20) \u02c6E\u03be\u03c91 (X)\n(cid:113) 1\n(cid:13)(cid:13)(cid:13)\u03a8( \u02c6PX ) \u2212 \u03a8( \u02c6PY )\n(cid:13)(cid:13)(cid:13)2\n\nwhere we have replaced the expectations by their empirical estimates. Because\n\ncan construct (cid:100)PhD( \u02c6PX , \u02c6PY ) =\n\n= 2 \u2212 2\u03a8( \u02c6PX )(cid:62)\u03a8( \u02c6PY ),\n\n(4)\nwhich is a Monte Carlo estimator of PhD( \u02c6PX , \u02c6PY ). In summary, \u03a8( \u02c6P ) \u2208 R2m is an explicit feature\nvector of the empirical distribution which encodes invariance to additive SPD noise components\npresent in P 2, as demonstrated in Figure F.1 in the Appendix. It can now be directly applied to (1)\ntwo-sample testing up to SPD components, where the distance between the phase features, i.e. an\nestimate (4) of the PhD, can be used as a test statistic, with details given in section 5.1 and (2) learning\non distributions, where we use phase features as the explicit feature map for a bag of samples.\nAlthough we have assumed an indecomposable underlying distribution so far, this assumption is\nnot strict. For distribution regression, if the indecomposable assumption is invalid, given that the\nunderlying distribution is irregular, it may still be useful to encode invariance as long as the bene\ufb01t\nof removing the SPD components irrelevant for learning outweighs the signal in the SPD part of\nthe distribution, i.e. there is a trade off between SPD noise and SPD signal. In practice, the phase\nfeatures we propose can be used to encode such invariance where appropriate or in conjunction with\nother features which do not encode invariance.\nIn order to construct the approximate mean embeddings for learning, we \ufb01rst compute an\nexplicit feature map by taking averages of the Fourier features, as given by \u03a6( \u02c6PX ) =\n. For phase features, we need to compute an additional normal-\ni=1, we can draw\n\nisation term over each frequency as in (3). To obtain the set of frequencies {wi}m\n\n(cid:104)\u02c6E\u03be\u03c91(X), . . . , \u02c6E\u03be\u03c9m(X)\n\n(cid:113) 1\n\n(cid:105)\n\nm\n\n2Note that, unlike the population expression \u03a8(P ), the empirical estimator \u03a8( \u02c6P ) will in general have a\ndistribution affected by the noise components and is thus only approximately invariant, but we observe that it\ncaptures invariance very well as long as the signal-to-noise regime remains relatively high (Section 5.1).\n\n4\n\n\fsamples from a probability measure \u039b corresponding to an inverse Fourier transform of a shift-\ninvariant kernel, e.g. Gaussian Kernel. However, given a supervised signal, we can also optimise a set\nof frequencies {wi}m\ni=1 that will give us a useful representation and good discriminative performance.\nIn other words, we no longer focus on a speci\ufb01c shift-invariant kernel k, but are learning discrim-\ninative Fourier/phase features. To do this, we can construct a neural network (NN) with special\nactivation functions, pooling layers as shown in Algorithm D.1 and Figure D.1 in the Appendix.\n\n4 Asymmetry in Paired Differences\n\nWe now consider a separate approach to nonparametric two-sample test, where we wish to test the\nnull hypothesis that H0 : P d=Q vs. the general alternative, but we only have iid samples arising from\nX \u223c P (cid:63) E1 and Y \u223c Q (cid:63) E2. i.e.\n\nX = X0 + U Y = Y0 + V\n\nd=Y0.\n\ni=1 and {Yi}n\n\nwhere X0 \u223c P , Y0 \u223c Q lie in the space of P(Rd) of indecomposable distributions uniquely\ndetermined by phase functions and U and V are SPD noise components. With this setting (proof in\nAppendix B):\nProposition 3. Under the null hypothesis H0, X \u2212 Y is SPD \u21d0\u21d2 X0\nThis motivates us to simply perform a two-sample test on X \u2212 Y and Y \u2212 X since its rejection would\nd=Y0, as it tests for symmetry. However, note that this is a test for symmetry\nimply rejection of X0\nonly and that for consistency against all alternatives, positivity of characteristic function would need\nto be checked separately. Now, given two i.i.d. samples {Xi}n\ni=1 with n even, we split\nthe two samples into two halves and compute Wi = Xi \u2212 Yi on one half and Zi = Yi \u2212 Xi on the\nother half, and perform a nonparametric two sample test on W and Z (which are, by construction,\nindependent of each other). The advantage of this regime is that we can use any two-sample test \u2013\nin particular in this paper, we will focus on the linear time mean embedding (ME) test [7], which\nwas found to have performance similar to or better than the original MMD two-sample test [5], and\nexplicitly formulates a criterion which maximises the test power. We will refer to the resulting test on\npaired differences as the Symmetric Mean Embedding (SME).\nAlthough we have assumed here that X0, Y0 lie in the space P(Rd) of indecomposable distributions,\nin practice, the SME test would not reject if the underlying distributions of interest differ only in the\nsymmetric components (or in the SPD components for the PhD test). We argue this to be unlikely due\nto real life distributions being complex in nature with interesting differences often having a degree of\nasymmetry. In practice, we recommend the use of the ME and SME or PhD test together to provide\nan exploratory tool to understand the underlying differences, as demonstrated in the Higgs Data\nexperiment in section 5.1.\nIt is tempting to also consider learning on distributions with invariances using this formalism. However\nnote that the MMD on paired differences is not invariant to the additive SPD noise components under\nthe alternative, i.e. in general MMD(X \u2212 Y, Y \u2212 X) (cid:54)= MMD(X0 \u2212 Y0, Y0 \u2212 X0). This means that\nthe paired differences approach to learning is sensitive to the actual type and scale of the additive\nSPD noise components, hence not suitable for learning. The mathematical details and empirical\nexperiments to show this are presented in Appendix C and F.1.\n\n5 Experimental Results\n\n5.1 Two-Sample Tests with Invariances\n\nIn this section, we demonstrate the performance of the SME test and the PhD test on both arti\ufb01cial\nand real-world data for testing the hypothesis H0 : X0\ni=1 from X0 + U\nand {Yi}N\ni=1 from Y0 + V , where U and V are arbitrary SPD noise components (we assume the same\nnumber of samples for simplicity). SME test follows the setup in [7] but applied to {Xi \u2212 Yi}N/2\ni=1 and\n{Yi \u2212 Xi}N\nis unclear what the exact form of the null distribution is, so we use a permutation test, by recomputing\nthis statistic on the samples which are \ufb01rst merged and then randomly split in the original proportions.\n\ni=N/2+1. For the PhD test, we use as the test statistic the estimate (cid:100)PhD( \u02c6PX , \u02c6PY ) of (2). It\n\nd=Y0 based on samples {Xi}N\n\n5\n\n\f\u03d5null =\n\n\u03d5X0\u03d5U +\n\n\u03d5X0 \u03d5V = \u03d5X0(\n\n\u03d5U +\n\n\u03d5V )\n\n1\n2\n\n1\n2\n\n1\n2\n\n1\n2\n\nFigure 1: Type I error and Power under various additional symmetric noise in the synthetic \u03c72 dataset.\nDashed line is the 99% Wald interval here. Left: Type I error, n11 denotes the noise to signal ratio\nfor the \ufb01rst set of samples and n12 for the second set. Right: Power, n1 denotes the noise to signal\nratio for the X set of samples and n2 denotes the noise to signal ratio for the Y set of samples.\nWhile we are combining samples with different distributions, the permutation test is still justi\ufb01ed\nd=Y0, the resulting characteristic function \u03d5null of the mixture\nsince, under the null hypothesis X0\ncan be written as\n\nand since the mixture of the SPD noise terms is also SPD, we have that \u03c1null = \u03c1X0 = \u03c1Y0. For our\nexperiments, we denote by N the sample size, d the dimension of the samples, and we take \u03b1 = 0.05\nto be the signi\ufb01cance level. In the SME test, we take the number of test locations J to be 10, and\nuse 20% of the samples to optimise the test locations. All experimental results are averaged over\n1000 runs, where each run repeats the simulation or randomly samples without replacement from the\ndataset.\n\n5.1.1 Synthetic example: Noisy \u03c72\n\n1 and n2 = 4\u03c32\n2.\n\ni=1 and {Yn2,i}N\n1I) and similarly Yn2 \u223c Y0 + V , where V \u223c N (0, \u03c32\n\nWe start by demonstrating our tests with invariances on a simulated dataset where X0 and Y0 are\nrandom vectors with d = 5, each dimension is the same in distribution and follows \u03c72(4)/4 and\n\u03c72(8)/8 respectively, i.e. chi-squared random variables, with different degrees of freedom, rescaled to\nhave the same mean 1 (but have different variances, 1/2 and 1/4 respectively). An illustration of the\ntrue and empirical phase and characteristic function with noise for these two distributions can be found\ni=1 such that Xn1 \u223c X0 + U, where\nin Appendix F.2. We construct samples {Xn1,i}N\nU \u223c N (0, \u03c32\n2I), ni denotes the noise-to-signal\nratio given by the ratio of variances in each dimension, i.e. n1 = 2\u03c32\nWe \ufb01rst verify that Type I error is indeed controlled at our design level of \u03b1 = 0.05 up to various\nd=Y0, both constructed\nadditive SPD noise components. This is shown in Figure 1 (left), where X0\nusing \u03c72(4)/4, with the noiseless case found in Figure F.6 in the Appendix. It is noted here that the\nME test rejects the null hypothesis for even a small difference in noise levels, hence it is unable to\nlet us target the underlying distributions we are concerned with. This is unlike the SME test which\ncontrols the Type I error even for large differences in noise levels. The PhD test, on the other hand,\nwhile correctly controlling Type I at small noise levels, was found to have in\ufb02ated Type I error rates\nfor large noise, with more results and explanation provided in Figure F.6 in the Appendix. Namely,\nthe test relies on the invariance to SPD of the population expression of PhD, but the estimator of the\nnull distribution of the corresponding test statistic will in general be affected by the differing noise\nlevels.\nNext, we investigate the power, shown in Figure 1 (right). For a fair comparison, we have included\nthe PhD test power only for small noise levels, in which the Type I error is controlled at the design\nlevel. In these cases, the PhD test has better power than the SME test. This is not surprising, as for the\nSME we have to halve the sample size in order to construct a valid test. However, recall that the PhD\ntest has an in\ufb02ated Type I error for large noises, which means that its results should be considered\nwith caution in practice. ME test rejects at all levels at all sample sizes as it picks up all possible\n\n6\n\n10002000300040005000600070008000Sample Size0.00.20.40.60.81.0Rejection RatioME n11=0.01,n12=0.05PhD n11=0.01,n12=0.05SME n11=0.01,n12=0.05ME n11=0.25,n12=0.5PhD n11=0.25,n12=0.5SME n11=0.25,n12=0.5010002000300040005000600070008000Sample Size0.00.20.40.60.81.0PowerME All levelsPhD n1=0.0,n2=0.0SME n1=0.0,n2=0.0PhD n1=0.01,n2=0.05SME n1=0.01,n2=0.05SME n1=0.1,n2=0.1SME n1=0.25,n2=0.25\fFigure 2: Rejection ratio vs. sample size for\nextremely low level features for Higgs dataset.\nDashed line is the 99% Wald interval for 1000\nrepetitions for \u03b1 = 0.05. Note PhD is not used\nhere, due to its expensive computational cost.\n\nFigure 3: RMSE on the Aerosol test set, cor-\nrupted by various levels of noise averaged\nover 100 runs, with the 5th and the 95th per-\ncentile. The noiseless case is shown with one\nrun. RMSE from mean is 0.206.\n\ndifferences. SME and PhD are by construction more conservative tests whose rejection provides a\nmuch stronger statement: two samples differ even when all arbitrary additive SPD components have\nbeen stripped off.\n\n5.1.2 Higgs Dataset\n\nThe UCI Higgs dataset [1, 11] is a dataset with 11 million observations, where the problem is to\ndistinguish between the signal process where Higgs bosons are found, versus the background process\nthat do not produce Higgs bosons. In particular, we will consider a two-sample test with the ME\nand SME test on the high level features derived by physicists, as well as a two-sample test on four\nextremely low level features (azimuthal angular momentum \u03c6 measured by four particle jets in the\ndetector). The high level features here (in R7) have been shown to have good discriminative properties\nin [1]. Thus, we expect them to have different distributions across two processes. Denoting by X the\nhigh level features of the process without Higgs Boson, and Y as the corresponding distribution for\nthe processes where Higgs bosons are produced, we test the null hypothesis that the indecomposable\nparts of X and Y agree. The results can be found in Table F.1 in the Appendix, which shows that the\nhigh level features differ even up to additive SPD components, with a high power for the SME and\nME test even at small sample sizes (rejection rate of 0.94 at N = 500). Now we perform the same\nexperiment, but with the low level features \u2208 R4, commented in [1] to carry very little discriminating\ninformation, using the setup from [2].\nThe results for the ME and SME test can be found in Figure 2. Here we observe that while ME\ntest clearly rejects and \ufb01nds the difference between the two distributions, there is no evidence that\nthe indecomposable parts of the joint distributions of the angular momentum actually differ. In\nfact, the test rejection rate remains around the chosen design level of \u03b1 = 0.05 for all sample sizes.\nThis highlights the signi\ufb01cance in using the SME test, suggesting that the nature of the difference\nbetween the two processes can potentially be explained by some additive symmetric noise components\nwhich may be irrelevant for discrimination, providing an insight into the dataset. Furthermore, this\nalso highlights the argument that given two samples from complex data collection and generation\nprocesses, a nonparametric two sample test like ME will likely reject given suf\ufb01cient sample sizes,\neven if the discovered difference may not be of interest. With the SME test however, we can ask a\nmuch more subtle question about the differences between the assumed true underlying processes.\nFigures showing that the Type I error is controlled at the design level of \u03b1 = 0.05 for both low and\nhigh level features can be found in Figure F.7 in the Appendix.\n\n5.2 Learning with Phase Features\n\n5.2.1 Aerosol Dataset\n\nTo demonstrate the phase features invariance to SPD noise component, we use the Aerosol MISR1\ndataset also studied by [24] and [25] and consider a situation with covariate shift [18] on distribution\ninputs: the testing data is impaired by additive SPD components different to that in the training data.\n\n7\n\n\fTable 1: Mean Square Error (MSE) on dark\nmatter dataset for 500 runs with 5th and 95th\npercentile.\n\nAlgorithm\nMean\nPLRR\nGLRR\nLGRR\nPGRR\nGGRR\n\nMSE\n0.16\n\n0.021 (0.018, 0.024)\n0.033 (0.030, 0.037)\n\n0.032 (0.028, 0.036)\n0.021 (0.017, 0.024)\n0.018 (0.015, 0.019)\n\nFigure 4: MSE with various levels of noise\nadded on test set, with 5th and 95th percentile.\n\nHere, we have an aerosol optical depth (AOD) multi-instance learning problem with 800 bags, where\neach bag contains 100 randomly selected multispectral (potentially cloudy) pixels within 20km radius\naround an AOD sensor. The label yi for each bag is given by the AOD sensor measurements and each\nsample xi is 16-dimensional. This can be understood as a distribution regression problem where each\nbag is treated as a set of samples from some distribution.\nWe use 640 bags for training and 160 bags for testing. Here in the bags for testing only, we add\nvarying levels of Gaussian noise \u0001 \u223c N (0, Z) to each bag, where Z is a diagonal matrix with\ndiagonal components zi \u223c U [0, \u03c3vi] with vi being the empirical variance in dimension i across all\nsamples, accounting for different scales across dimensions. For comparisons, we consider linear\nridge regression on embeddings with respect to a Gaussian kernel, approximated with RFF (GLRR)\nas described in section 2.1 (i.e. a linear kernel is applied on approximate embeddings), linear ridge\nregression on phase features (PLRR) (i.e. normalisation step is applied to obtain (3)), and also the\nphase and Fourier neural networks (NN), described in Appendix D, tuning all hyperparameters with\n3-fold cross validation. With the same model, we now measure Root Mean Square Error (RMSE)\n100 times with various noise-corrupted test sets and results are shown in \ufb01gure 3. It is also noted that\na second level non-linear kernel \u02dcK does not improve performance signi\ufb01cantly on this problem [24].\nWe see that GLRR and PLRR are competitive (see Appendix Table F.2) in the noiseless case, and\nthese clearly outperform both the Fourier NN and Phase NN (likely due to the small size of the\ndataset). For increasing noise, the performance of GLRR degrades signi\ufb01cantly, and while the\nperformance of PLRR degrades also, the model is much more robust under additional SPD noise.\nIn comparison, the Phase NN implementation is almost insensitive to covariate shift in the test sets,\nunlike the performance of PLRR, highlighting the importance of learning discriminative frequencies\nw in a very low signal-to-noise setting.\nIt is noted that the Fourier NN performs similarly to that of the Phase NN on this example. Interest-\ningly, discriminative frequencies learnt on the training data correspond to Fourier features that are\nnearly normalised (i.e. they are close to unit norm - see Figure F.8 in the Appendix). This means\nthat the Fourier NN has learned to be approximately invariant based on training data, indicating that\nthe original Aerosol data potentially has irrelevant SPD noise components. This is reinforced by the\nnature of the dataset (each bag contains 100 randomly selected potentially cloudy pixels, known to\nbe noisy [25]) and no loss of performance from going from GLRR to PLRR. The results highlights\nthat phase features are stable under additive SPD noise.\n\n5.2.2 Dark Matter Dataset\n\nWe now study the use of phase features on the dark matter dataset, composing of a catalog of galaxy\nclusters. In this setting, we would like to predict the total mass of galaxy clusters, using the dispersion\nof velocities in the direction along our line of sight. In particular, we will use the \u2018ML1\u2019 dataset,\nas obtained from the authors of [16, 17], who constructed a catalog of massive halos from the\nMultiDark mdpl simulation [9]. The dataset contains 5028 bags, with each sample consisting of\nits sub-object velocity and its mass label in R. By viewing each galaxy cluster at multiple lines of\nsights, we obtain 15 000 bags, using the same experimental setup as in [10]. For experiments, we use\napproximately 9000 bags for training, and 3000 bags each for validation and testing, keeping those\nof multiple lines of sight in the same set. As before, we use GLRR and PLRR and we also include\n\n8\n\n\fin comparisons methods with a second level Gaussian kernel (with RFF) applied to phase features\n(PGRR) and to approximate embeddings (GGRR). For a baseline, we also include a \ufb01rst level linear\nkernel (equivalent to representing each bag with its mean), before applying a second level gaussian\nkernel (LGRR). We use the same set of randomly sampled frequencies across the methods, tuning for\nthe scale of the frequencies and for regularisation parameters.\nTable 1 shows the results of the methods across 10 different data splits, with 50 sets of randomised\nfrequencies for each data split. We see that PLRR is signi\ufb01cantly better than GLRR. This suggests\nthat under this model structure, by removing SPD components from each bag, we can target the\nunderlying signal and obtain superior performance, highlighting the applicability of phase features.\nConsidering a second level gaussian kernel, we see that the GGRR has a slight advantage over PGRR,\nwith PGRR performing similar to PLRR. This suggests that the SPD components of the distribution\nof sub-object velocity may be useful for predicting the mass of a galaxy cluster if an additional\nnonlinearity is applied to embeddings \u2013 whereas the bene\ufb01ts of removing them outweigh the signal\npresent in them without this additional nonlinearity. To show that indeed the phase features are robust\nto SPD components, we perform the same covariate shift experiment as in the aerosol dataset, with\nresults given in Figure 4. Note that LGRR is robust to noise, as each bag is represented by its mean.\n\n6 Conclusion\n\nNo dataset is immune from measurement noise and often this noise differs across different data\ngeneration and collection processes. When measuring distances between distributions, can we\ndisentangle the differences in noise from the differences in the signal? We considered two different\nways to encode invariances to additive symmetric noise in those distances, each with different\nstrengths: a nonparametric measure of asymmetry in paired sample differences and a weighted\ndistance between the empirical phase functions. The former was used to construct a hypothesis test on\nwhether the difference between the two generating processes can be explained away by the difference\nin postulated noise, whereas the latter allowed us to introduce a \ufb02exible framework for invariant\nfeature construction and learning algorithms on distribution inputs which are robust to measurement\nnoise and target underlying signal distributions.\n\nAcknowledgements\n\nWe thank Dougal Sutherland for suggesting the use of of the dark matter dataset, Michelle Ntampaka\nfor providing the catalog, as well as Ricardo Silva, Hyunjik Kim and Kaspar Martens for useful\ndiscussions. This work was supported by the EPSRC and MRC through the OxWaSP CDT\nprogramme (EP/L016710/1). C.Y. and H.C.L.L. also acknowledge the support of the MRC Grant No.\nMR/L001411/1.\n\nThe CosmoSim database used in this paper is a service by the Leibniz-Institute for Astro-\nphysics Potsdam (AIP). The MultiDark database was developed in cooperation with the Spanish\nMultiDark Consolider Project CSD2009-00064. The authors gratefully acknowledge the Gauss\nCentre for Supercomputing e.V. (www.gauss-centre.eu) and the Partnership for Advanced\nSupercomputing in Europe (PRACE, www.prace-ri.eu) for funding the MultiDark simulation project\nby providing computing time on the GCS Supercomputer SuperMUC at Leibniz Supercomputing\nCentre (LRZ, www.lrz.de).\n\n9\n\n\fReferences\n[1] Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for exotic particles in high-energy\n\nphysics with deep learning. Nature communications, 5, 2014.\n\n[2] Kacper P Chwialkowski, Aaditya Ramdas, Dino Sejdinovic, and Arthur Gretton. Fast two-\nsample testing with analytic representations of probability measures. In Advances in Neural\nInformation Processing Systems, pages 1981\u20131989, 2015.\n\n[3] Aurore Delaigle and Peter Hall. Methodology for non-parametric deconvolution when the\nerror distribution is unknown. Journal of the Royal Statistical Society: Series B (Statistical\nMethodology), 78(1):231\u2013252, 2016.\n\n[4] Paul Fearnhead and Dennis Prangle. Constructing summary statistics for approximate bayesian\ncomputation: semi-automatic approximate bayesian computation. Journal of the Royal Statisti-\ncal Society: Series B (Statistical Methodology), 74(3):419\u2013474, 2012.\n\n[5] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch\u00f6lkopf, and Alexander\nSmola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723\u2013773,\n2012.\n\n[6] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\nby reducing internal covariate shift. In International Conference on Machine Learning (ICML),\npages 448\u2013456, 2015.\n\n[7] Wittawat Jitkrittum, Zolt\u00e1n Szab\u00f3, Kacper P Chwialkowski, and Arthur Gretton. Interpretable\ndistribution features with maximum testing power. In Advances in Neural Information Process-\ning Systems 29, pages 181\u2013189. 2016.\n\n[8] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[9] Anatoly Klypin, Gustavo Yepes, Stefan Gottlober, Francisco Prada, and Steffen Hess. Mul-\ntiDark simulations: the story of dark matter halo concentrations and density pro\ufb01les. 2014.\narXiv:1411.4001.\n\n[10] Ho Chung Leon Law, Dougal J. Sutherland, Dino Sejdinovic, and Seth Flaxman. Bayesian\n\napproaches to distribution regression. arXiv preprint arXiv:1705.04293, 2017.\n\n[11] M. Lichman. UCI machine learning repository, 2013.\n\n[12] Yu V Linnik and IV Ostrovskii. Decomposition of random variables and vectors. 1977.\n\n[13] J. Mitrovic, D. Sejdinovic, and Y.W. Teh. DR-ABC: Approximate Bayesian Computation\nwith Kernel-Based Distribution Regression. In International Conference on Machine Learning\n(ICML), pages 1482\u20131491, 2016.\n\n[14] Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, and Bernhard Sch\u00f6lkopf. Learning\nfrom distributions via support measure machines. In Advances in Neural Information Processing\nSystems 25, pages 10\u201318. 2012.\n\n[15] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Sch\u00f6lkopf. Kernel\nmean embedding of distributions: A review and beyonds. arXiv preprint arXiv:1605.09522,\n2016.\n\n[16] Michelle Ntampaka, Hy Trac, Dougal J. Sutherland, Nicholas Battaglia, Barnab\u00e1s P\u00f3czos, and\nJeff Schneider. A machine learning approach for dynamical mass measurements of galaxy\nclusters. The Astrophysical Journal, 803(2):50, 2015. arXiv:1410.0686.\n\n[17] Michelle Ntampaka, Hy Trac, Dougal J. Sutherland, S. Fromenteau, B. Poczos, and Jeff\nSchneider. Dynamical mass measurements of contaminated galaxy clusters using machine\nlearning. The Astrophysical Journal, 831(2):135, 2016. arXiv:1509.05409.\n\n[18] Joaquin Quinonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D. Lawrence.\n\nDataset Shift in Machine Learning. The MIT Press, 2009.\n\n10\n\n\f[19] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances\n\nin Neural Information Processing Systems, pages 1177\u20131184, 2007.\n\n[20] H-J Rossberg. Positive de\ufb01nite probability densities and probability distributions. Journal of\n\nMathematical Sciences, 76(1):2181\u20132197, 1995.\n\n[21] Le Song, Kenji Fukumizu, and Arthur Gretton. Kernel embeddings of conditional distributions:\nA uni\ufb01ed kernel framework for nonparametric inference in graphical models. IEEE Signal\nProcessing Magazine, 30(4):98\u2013111, 2013.\n\n[22] Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Sch\u00f6lkopf, and Gert R.G.\nLanckriet. Hilbert space embeddings and metrics on probability measures. J. Mach. Learn.\nRes., 11:1517\u20131561, August 2010.\n\n[23] Dougal J. Sutherland, Junier B. Oliva, Barnab\u00e1s P\u00f3czos, and Jeff G. Schneider. Linear-time\nlearning on distributions with approximate kernel embeddings. In Proc. AAAI Conference on\nArti\ufb01cial Intelligence, pages 2073\u20132079, 2016.\n\n[24] Zolt\u00e1n Szab\u00f3, Arthur Gretton, Barnab\u00e1s P\u00f3czos, and Bharath K. Sriperumbudur. Two-stage\nIn Proc. International Conference on Arti\ufb01cial\n\nsampled learning theory on distributions.\nIntelligence and Statistics, AISTATS 2015, 2015.\n\n[25] Z. Wang, L. Lan, and S. Vucetic. Mixture model for multiple instance regression and applications\nin remote sensing. IEEE Transactions on Geoscience and Remote Sensing, 50(6):2226\u20132237,\nJune 2012.\n\n[26] H. Wendland. Scattered Data Approximation. Cambridge University Press, Cambridge, UK,\n\n2004.\n\n11\n\n\f", "award": [], "sourceid": 876, "authors": [{"given_name": "Ho Chung", "family_name": "Law", "institution": "University Of Oxford"}, {"given_name": "Christopher", "family_name": "Yau", "institution": "University of Oxford"}, {"given_name": "Dino", "family_name": "Sejdinovic", "institution": "University of Oxford"}]}