{"title": "A Wild Bootstrap for Degenerate Kernel Tests", "book": "Advances in Neural Information Processing Systems", "page_first": 3608, "page_last": 3616, "abstract": "A wild bootstrap method for nonparametric hypothesis tests based on kernel distribution embeddings is proposed. This bootstrap method is used to construct provably consistent tests that apply to random processes, for which the naive permutation-based bootstrap fails. It applies to a large group of kernel tests based on V-statistics, which are degenerate under the null hypothesis, and non-degenerate elsewhere. To illustrate this approach, we construct a two-sample test, an instantaneous independence test and a multiple lag independence test for time series. In experiments, the wild bootstrap gives strong performance on synthetic examples, on audio data, and in performance benchmarking for the Gibbs sampler. The code is available at https://github.com/kacperChwialkowski/wildBootstrap.", "full_text": "A Wild Bootstrap for Degenerate Kernel Tests\n\nKacper Chwialkowski\n\nDepartment of Computer Science\n\nUniversity College London\n\nLondon, Gower Street, WC1E 6BT\n\nkacper.chwialkowski@gmail.com\n\nDino Sejdinovic\n\nGatsby Computational Neuroscience Unit, UCL\n\n17 Queen Square, London WC1N 3AR\ndino.sejdinovic@gmail.com\n\nArthur Gretton\n\nGatsby Computational Neuroscience Unit, UCL\n\n17 Queen Square, London WC1N 3AR\n\narthur.gretton@gmail.com\n\nAbstract\n\nA wild bootstrap method for nonparametric hypothesis tests based on kernel dis-\ntribution embeddings is proposed. This bootstrap method is used to construct\nprovably consistent tests that apply to random processes, for which the naive\npermutation-based bootstrap fails.\nIt applies to a large group of kernel tests\nbased on V-statistics, which are degenerate under the null hypothesis, and non-\ndegenerate elsewhere. To illustrate this approach, we construct a two-sample test,\nan instantaneous independence test and a multiple lag independence test for time\nseries. In experiments, the wild bootstrap gives strong performance on synthetic\nexamples, on audio data, and in performance benchmarking for the Gibbs sampler.\nThe code is available at https://github.com/kacperChwialkowski/\nwildBootstrap.\n\nIntroduction\n\n1\nStatistical tests based on distribution embeddings into reproducing kernel Hilbert spaces have been\napplied in many contexts, including two sample testing [19, 15, 31], tests of independence [17, 32,\n4], tests of conditional independence [14, 32], and tests for higher order (Lancaster) interactions\n[24]. For these tests, consistency is guaranteed if and only if the observations are independent and\nidentically distributed. Much real-world data fails to satisfy the i.i.d. assumption: audio signals,\nEEG recordings, text documents, \ufb01nancial time series, and samples obtained when running Markov\nChain Monte Carlo, all show signi\ufb01cant temporal dependence patterns.\nThe asymptotic behaviour of kernel test statistics becomes quite different when temporal dependen-\ncies exist within the samples. In recent work on independence testing using the Hilbert-Schmidt\nIndependence Criterion (HSIC) [8], the asymptotic distribution of the statistic under the null hy-\npothesis is obtained for a pair of independent time series, which satisfy an absolute regularity or\na \u03c6-mixing assumption. In this case, the null distribution is shown to be an in\ufb01nite weighted sum\nof dependent \u03c72-variables, as opposed to the sum of independent \u03c72-variables obtained in the i.i.d.\nsetting [17]. The difference in the asymptotic null distributions has important implications in prac-\ntice: under the i.i.d. assumption, an empirical estimate of the null distribution can be obtained by\nrepeatedly permuting the time indices of one of the signals. This breaks the temporal dependence\nwithin the permuted signal, which causes the test to return an elevated number of false positives,\nwhen used for testing time series. To address this problem, an alternative estimate of the null distri-\nbution is proposed in [8], where the null distribution is simulated by repeatedly shifting one signal\nrelative to the other. This preserves the temporal structure within each signal, while breaking the\ncross-signal dependence.\n\n1\n\n\f1\n\nnm\u22121\n\n(cid:80)\n\nA serious limitation of the shift procedure in [8] is that it is speci\ufb01c to the problem of independence\ntesting: there is no obvious way to generalise it to other testing contexts. For instance, we might\nhave two time series, with the goal of comparing their marginal distributions - this is a generalization\nof the two-sample setting to which the shift approach does not apply.\nWe note, however, that many kernel tests have a test statistic with a particular structure: the Maxi-\nmum Mean Discrepancy (MMD), HSIC, and the Lancaster interaction statistic, each have empirical\nestimates which can be cast as normalized V -statistics,\n1\u2264i1,...,im\u2264n h(Zi1, ..., Zim), where\nZi1, ..., Zim are samples from a random process at the time points {i1, . . . , im}. We show that a\nmethod of external randomization known as the wild bootstrap may be applied [22, 28] to simulate\nfrom the null distribution. In brief, the arguments of the above sum are repeatedly multiplied by\nrandom, user-de\ufb01ned time series. For a test of level \u03b1, the 1 \u2212 \u03b1 quantile of the empirical distri-\nbution obtained using these perturbed statistics serves as the test threshold. This approach has the\nimportant advantage over [8] that it may be applied to all kernel-based tests for which V -statistics\nare employed, and not just for independence tests.\nThe main result of this paper is to show that the wild bootstrap procedure yields consistent tests\nfor time series, i.e., tests based on the wild bootstrap have a Type I error rate (of wrongly rejecting\nthe null hypothesis) approaching the design parameter \u03b1, and a Type II error (of wrongly accepting\nthe null) approaching zero, as the number of samples increases. We use this result to construct a\ntwo-sample test using MMD, and an independence test using HSIC. The latter procedure is applied\nboth to testing for instantaneous independence, and to testing for independence across multiple time\nlags, for which the earlier shift procedure of [8] cannot be applied.\nWe begin our presentation in Section 2, with a review of the \u03c4-mixing assumption required of the\ntime series, as well as of V -statistics (of which MMD and HSIC are instances). We also introduce\nthe form taken by the wild bootstrap. In Section 3, we establish a general consistency result for\nthe wild bootstrap procedure on V -statistics, which we apply to MMD and to HSIC in Section 4.\nFinally, in Section 5, we present a number of empirical comparisons: in the two sample case, we test\nfor differences in audio signals with the same underlying pitch, and present a performance diagnostic\nfor the output of a Gibbs sampler (the MCMC M.D.); in the independence case, we test for inde-\npendence of two time series sharing a common variance (a characteristic of econometric models),\nand compare against the test of [4] in the case where dependence may occur at multiple, potentially\nunknown lags. Our tests outperform both the naive approach which neglects the dependence struc-\nture within the samples, and the approach of [4], when testing across multiple lags. Detailed proofs\nare found in the appendices of an accompanying technical report [9], which we reference from the\npresent document as needed.\n\n2 Background\n\nThe main results of the paper are based around two concepts: \u03c4-mixing [10], which describes the\ndependence within the time series, and V -statistics [27], which constitute our test statistics. In this\nsection, we review these topics, and introduce the concept of wild bootstrapped V -statistics, which\nwill be the key ingredient in our test construction.\n\n\u03c4-mixing. The notion of \u03c4-mixing is used to characterise weak dependence. It is a less restrictive\nalternative to classical mixing coef\ufb01cients, and is covered in depth in [10]. Let {Zt,Ft}t\u2208N be a sta-\ntionary sequence of integrable random variables, de\ufb01ned on a probability space \u2126 with a probability\nmeasure P and a natural \ufb01ltration Ft. The process is called \u03c4-dependent if\n\n\u03c4 (r) = sup\nl\u2208N\n\u03c4 (M, X) = E\n\n(cid:18)\n\n1\nl\n\nsup\n\nr\u2264i1\u2264...\u2264il\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:90)\n\nsup\ng\u2208\u039b\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:19)\n\n\u03c4 (F0, (Zi1, ..., Zil )) r\u2192\u221e\u2212\u2192 0, where\n\n(cid:90)\n\ng(t)PX|M(dt) \u2212\n\ng(t)PX (dt)\n\nand \u039b is the set of all one-Lipschitz continuous real-valued functions on the domain of X. \u03c4 (M, X)\ncan be interpreted as the minimal L1 distance between X and X\u2217 such that X d= X\u2217 and X\u2217\nis independent of M \u2282 F. Furthermore, if F is rich enough, this X\u2217 can be constructed (see\nProposition 4 in the Appendix). More information is provided in the Appendix B.\n\n2\n\n\f(cid:88)\n\n1\nnm\n\nV -statistics. The test statistics considered in this paper are always V -statistics. Given the ob-\nservations Z = {Zt}n\nt=1, a V -statistic of a symmetric function h taking m arguments is given by\n\ni\u2208N m\n\nV (h, Z) =\n\nh(Zi1, ..., Zim ),\n\nj+1, . . . , Z\u2217\n\nm) = 0, where Z\u2217\n\n(1)\nwhere N m is a Cartesian power of a set N = {1, ..., n}. For simplicity, we will often drop the\nsecond argument and write simply V (h).\nWe will refer to the function h as to the core of the V -statistic V (h). While such functions\nare usually called kernels in the literature, in this paper we reserve the term kernel for positive-\nde\ufb01nite functions taking two arguments. A core h is said to be j-degenerate if for each z1, . . . , zj\nEh(z1, . . . , zj, Z\u2217\nm are independent copies of Z1. If h is\nj-degenerate for all j \u2264 m \u2212 1, we will say that it is canonical. For a one-degenerate core\nh, we de\ufb01ne an auxiliary function h2, called the second component of the core, and given by\nh2(z1, z2) = Eh(z1, z2, Z\u2217\nm). Finally we say that nV (h) is a normalized V -statistic, and\nthat a V -statistic with a one-degenerate core is a degenerate V -statistic. This degeneracy is common\nto many kernel statistics when the null hypothesis holds [15, 17, 24].\nOur main results will rely on the fact that h2 governs the asymptotic behaviour of normalized degen-\nerate V -statistics. Unfortunately, the limiting distribution of such V -statistics is quite complicated\n- it is an in\ufb01nite sum of dependent \u03c72-distributed random variables, with a dependence determined\nby the temporal dependence structure within the process {Zt} and by the eigenfunctions of a certain\nintegral operator associated with h2 [5, 8]. Therefore, we propose a bootstrapped version of the\nV -statistics which will allow a consistent approximation of this dif\ufb01cult limiting distribution.\n\nj+1, . . . , Z\u2217\n\n3 , . . . , Z\u2217\n\nBootstrapped V -statistic. We will study two versions of the bootstrapped V -statistics\n\n(cid:88)\n(cid:88)\n\n1\nnm\n1\nnm\n\nVb1(h, Z) =\n\nVb2(h, Z) =\n\ni\u2208N m\n\ni\u2208N m\n\nWi1,nWi2,nh(Zi1 , ..., Zim ),\n\n\u02dcWi1,n \u02dcWi2,nh(Zi1 , ..., Zim ),\n\n(2)\n\n(3)\n\n(cid:80)n\n\nn\n\nj=1 Wj,n. This\n\nand(cid:80)n\u22121\n\nwhere {Wt,n}1\u2264t\u2264n is an auxiliary wild bootstrap process and \u02dcWt,n = Wt,n \u2212 1\nauxiliary process, proposed by [28, 22], satis\ufb01es the following assumption:\nBootstrap assumption: {Wt,n}1\u2264t\u2264n is a row-wise strictly stationary triangular array independent\nof all Zt such that EWt,n = 0 and supn E|W 2+\u03c3\nt,n | < \u221e for some \u03c3 > 0. The autocovariance of the\nprocess is given by EWs,nWt,n = \u03c1(|s \u2212 t|/ln) for some function \u03c1, such that limu\u21920 \u03c1(u) = 1\nr=1 \u03c1(|r|/ln) = O(ln). The sequence {ln} is taken such that ln = o(n) but limn\u2192\u221e ln =\n\u221e. The variables Wt,n are \u03c4-weakly dependent with coef\ufb01cients \u03c4 (r) \u2264 C\u03b6\nln for r = 1, ..., n,\n\u03b6 \u2208 (0, 1) and C \u2208 R.\nAs noted in in [22, Remark 2], a simple realization of a process that satis\ufb01es this assumption is\n1 \u2212 e\u22122/ln \u0001t where W0,n and \u00011, . . . , \u0001n are independent standard nor-\nWt,n = e\u22121/ln Wt\u22121,n +\nmal random variables. For simplicity, we will drop the index n and write Wt instead of Wt,n. A\nprocess that ful\ufb01ls the bootstrap assumption will be called bootstrap process. Further discussion of\nthe wild bootstrap is provided in the Appendix A. The versions of the bootstrapped V -statistics in\n(2) and (3) were previously studied in [22] for the case of canonical cores of degree m = 2. We\nextend their results to higher degree cores (common within the kernel testing framework), which are\nnot necessarily one-degenerate. When stating a fact that applies to both Vb1 and Vb2, we will simply\nwrite Vb, and the argument Z will be dropped when there is no ambiguity.\n\n\u221a\n\nr\n\n3 Asymptotics of wild bootstrapped V -statistics\n\nIn this section, we present main Theorems that describe asymptotic behaviour of V -statistics. In\nthe next section, these results will be used to construct kernel-based statistical tests applicable to\ndependent observations. Tests are constructed so that the V -statistic is degenerate under the null\nhypothesis and non-degenerate under the alternative. Theorem 1 guarantees that the bootstrapped\nV -statistic will converge to the same limiting null distribution as the simple V -statistic. Following\n[22], we will establish the convergence of the bootstrapped distribution to the desired asymptotic\n\n3\n\n\fdistribution in the Prokhorov metric \u03d5 [13, Section 11.3]), and ensure that this distance approaches\nzero in probability as n \u2192 \u221e. This two-part convergence statement is needed due to the additional\nrandomness introduced by the Wj,n.\nTheorem 1. Assume that the stationary process {Zt} is \u03c4-dependent with \u03c4 (r) = O(r\u22126\u2212\u0001) for\nsome \u0001 > 0. If the core h is a Lipschitz continuous, one-degenerate, and bounded function of m\nin probability as n \u2192 \u221e, where \u03d5 is Prokhorov metric.\n\narguments and its h2-component is a positive de\ufb01nite kernel, then \u03d5(n(cid:0)m\nProof. By Lemma 3 and Lemma 2 respectively, \u03d5(nVb(h), nVb(h2)) and \u03d5(nV (h), n(cid:0)m\n\n(cid:1)Vb(h, Z), nV (h, Z)) \u2192 0\n(cid:1)V (h2))\n\n2\n\n2\n\nconverge to zero. By [22, Theorem 3.1], nVb(h2) and nV (h2, Z) have the same limiting distribution,\ni.e., \u03d5(nVb(h2), nV (h2, Z)) \u2192 0 in probability under certain assumptions. Thus, it suf\ufb01ces to check\nthese assumptions hold: Assumption A2. (i) h2 is one-degenerate and symmetric - this follows from\nLemma 1; (ii) h2 is a kernel - is one of the assumptions of this Theorem; (iii) Eh2(Z1, Z1) \u2264 \u221e - by\nLemma 7, h2 is bounded and therefore has a \ufb01nite expected value; (iv) h2 is Lipschitz continuous\nr=1 r\u22121\u2212\u0001/2 \u2264 \u221e. Assumption B2. This assumption about the auxiliary\n\n- follows from Lemma 7. Assumption B1. (cid:80)n\nr=1 r2(cid:112)\u03c4 (r) \u2264 C(cid:80)n\n(cid:80)n\n\nr=1 r2(cid:112)\u03c4 (r) < \u221e. Since \u03c4 (r) = O(r\u22126\u2212\u0001) then\n\nprocess {Wt} is the same as our Bootstrap assumption.\n\nOn the other hand, if the V -statistic is not degenerate, which is usually true under the alternative, it\nconverges to some non-zero constant. In this setting, Theorem 2 guarantees that the bootstrapped\nV -statistic will converge to zero in probability. This property is necessary in testing, as it implies\nthat the test thresholds computed using the bootstrapped V -statistics will also converge to zero, and\nso will the corresponding Type II error. The following theorem is due to Lemmas 4 and 5.\nTheorem 2. Assume that the process {Zt} is \u03c4-dependent with a coef\ufb01cient \u03c4 (r) = O(r\u22126\u2212\u0001).\nIf the core h is a Lipschitz continuous, symmetric and bounded function of m arguments, then\nnVb2(h) converges in distribution to some non-zero random variable with \ufb01nite variance, and Vb1(h)\nconverges to zero in probability.\n\nAlthough both Vb2 and Vb1 converge to zero, the rate and the type of convergence are not the same:\nnVb2 converges in law to some random variable while the behaviour of nVb1 is unspeci\ufb01ed. As a\nconsequence, tests that utilize Vb2 usually give lower Type II error then the ones that use Vb1. On the\nother hand, Vb1 seems to better approximate V -statistic distribution under the null hypothesis. This\nagrees with our experiments in Section 5 as well as with those in [22, Section 5]).\n\n4 Applications to Kernel Tests\n\non X is an element \u00b5k(P ) \u2208 Hk, given by \u00b5k(P ) =(cid:82) k(\u00b7, x) dP (x) [3, 29]. If a measurable kernel\n\nIn this section, we describe how the wild bootstrap for V -statistics can be used to construct ker-\nnel tests for independence and the two-sample problem, which are applicable to weakly dependent\nobservations. We start by reviewing the main concepts underpinning the kernel testing framework.\nFor every symmetric, positive de\ufb01nite function, i.e., kernel k : X \u00d7 X \u2192 R, there is an associated\nreproducing kernel Hilbert space Hk [3, p. 19]. The kernel embedding of a probability measure P\nk is bounded, the mean embedding \u00b5k(P ) exists for all probability measures on X , and for many\ninteresting bounded kernels k, including the Gaussian, Laplacian and inverse multi-quadratics, the\nkernel embedding P (cid:55)\u2192 \u00b5k(P ) is injective. Such kernels are said to be characteristic [30]. The\nRKHS-distance (cid:107)\u00b5k(Px) \u2212 \u00b5k(Py)(cid:107)2Hk\nbetween embeddings of two probability measures Px and\nPy is termed the Maximum Mean Discrepancy (MMD), and its empirical version serves as a popular\nstatistic for non-parametric two-sample testing [15]. Similarly, given a sample of paired observations\n{(Xi, Yi)}n\ni=1 \u223c Pxy, and kernels k and l respectively on X and Y domains, the RKHS-distance\n(cid:107)\u00b5\u03ba(Pxy) \u2212 \u00b5\u03ba(PxPy)(cid:107)2H\u03ba\nbetween embeddings of the joint distribution and of the product of the\nmarginals, measures dependence between X and Y . Here, \u03ba((x, y), (x(cid:48), y(cid:48))) = k(x, x(cid:48))l(y, y(cid:48))\nis the kernel on the product space of X and Y domains. This quantity is called Hilbert-Schmidt\nIndependence Criterion (HSIC) [16, 17]. When characteristic RKHSs are used, the HSIC is zero\niff X|=Y : this follows from [18]. The empirical statistic is written (cid:91)HSIC\u03ba = 1\nn2 Tr(KHLH) for\nkernel matrices K and L and the centering matrix H = I \u2212 1\n\nn 11(cid:62).\n\n4\n\n\fnx(cid:88)\n\ni=1\n\ni\n\nj=1\n\nnx(cid:88)\nnx(cid:88)\n(cid:80)nx\n\ni=1\n\n\u2212 2\nnxny\n\nny(cid:88)\n\nj=1\n\nny(cid:88)\n\nny(cid:88)\n\ni=1\n\nj=1\n\n(cid:80)ny\n\ni=1 \u223c Px, and {Yj}ny\n\n4.1 Wild Bootstrap For MMD\nj=1 \u223c Py. Our goal is to test the null hypothe-\nDenote the observations by {Xi}nx\nsis H0 : Px = Py vs. the alternative H1 : Px (cid:54)= Py. In the case where samples have equal sizes, i.e.,\nnx = ny, application of the wild bootstrap to MMD-based tests on dependent samples is straight-\nforward: the empirical MMD can be written as a V -statistic with the core of degree two on pairs\nzi = (xi, yi) given by h(z1, z2) = k(x1, x2)\u2212k(x1, y2)\u2212k(x2, y1)+k(y1, y2). It is clear that when-\never k is Lipschitz continuous and bounded, so is h. Moreover, h is a valid positive de\ufb01nite kernel,\nsince it can be represented as an RKHS inner product (cid:104)k(\u00b7, x1) \u2212 k(\u00b7, y1), k(\u00b7, x2) \u2212 k(\u00b7, y2)(cid:105)Hk\n.\nUnder the null hypothesis, h is also one-degenerate, i.e., Eh ((x1, y1), (X2, Y2)) = 0. Therefore,\nwe can use the bootstrapped statistics in (2) and (3) to approximate the null distribution and attain a\ndesired test level.\nWhen nx (cid:54)= ny, however, it is no longer possible to write the empirical MMD as a one-sample\nV -statistic. We will therefore require the following bootstrapped version of MMD\n\n(cid:92)MMDk,b =\n\n1\nn2\nx\n\n\u02dcW (x)\n\nj k(xi, xj) \u2212 1\n\u02dcW (x)\nn2\nx\n\n\u02dcW (y)\n\ni\n\n\u02dcW (y)\n\nj k(yi, yj)\n\n(4)\n\n\u02dcW (x)\n\ni\n\n\u02dcW (y)\n\nj k(xi, yj),\n\nny\n\nnx\n\ni=1 W (x)\n\ni\n\nj=1 W (y)\n\nj\n\n; {W (x)\n\nt \u2212 1\n\nt \u2212 1\n\n, \u02dcW (y)\n\nt = W (y)\n\nt = W (x)\n\nt } and {W (y)\nt }\nwhere \u02dcW (x)\nare two auxiliary wild bootstrap processes that are independent of {Xt} and {Yt} and also indepen-\ndent of each other, both satisfying the bootstrap assumption in Section 2. The following Proposi-\ntion shows that the bootstrapped statistic has the same asymptotic null distribution as the empirical\nMMD. The proof follows that of [22, Theorem 3.1], and is given in the Appendix.\nProposition 1. Let k be bounded and Lipschitz continuous, and let {Xt} and {Yt} both be\n\u03c4-dependent with coef\ufb01cients \u03c4 (r) = O(r\u22126\u2212\u0001), but independent of each other. Further, let\nnx = \u03c1xn and ny = \u03c1yn where n = nx + ny. Then, under the null hypothesis Px = Py,\n\u03d5\nand (cid:92)MMDk is the MMD between empirical measures.\n\n(cid:17) \u2192 0 in probability as n \u2192 \u221e, where \u03d5 is the Prokhorov metric\n\n\u03c1x\u03c1yn(cid:92)MMDk, \u03c1x\u03c1yn(cid:92)MMDk,b\n\n(cid:16)\n\n4.2 Wild Bootstrap For HSIC\n\nUsing HSIC in the context of random processes is not new in the machine learning literature. For\na 1-approximating functional of an absolutely regular process [6], convergence in probability of\nthe empirical HSIC to its population value was shown in [33]. No asymptotic distributions were\nobtained, however, nor was a statistical test constructed. The asymptotics of a normalized V -statistic\nwere obtained in [8] for absolutely regular and \u03c6-mixing processes [12]. Due to the intractability\nof the null distribution for the test statistic, the authors propose a procedure to approximate its null\ndistribution using circular shifts of the observations leading to tests of instantaneous independence,\ni.e., of Xt|=Yt, \u2200t. This was shown to be consistent under the null (i.e., leading to the correct\nType I error), however consistency of the shift procedure under the alternative is a challenging open\nquestion (see [8, Section A.2] for further discussion). In contrast, as shown below in Propositions 2\nand 3 (which are direct consequences of the Theorems 1 and 2), the wild bootstrap guarantees test\nconsistency under both hypotheses: null and alternative, which is a major advantage. In addition, the\nwild bootstrap can be used in constructing a test for the harder problem of determining independence\nacross multiple lags simultaneously, similar to the one in [4].\nFollowing symmetrisation, it is shown in [17, 8] that the empirical HSIC can be written as a degree\nfour V -statistic with core given by\n\nh(z1, z2, z3, z4) =\n\n1\n4!\n\nk(x\u03c0(1), x\u03c0(2))[l(y\u03c0(1), y\u03c0(2)) + l(y\u03c0(3), y\u03c0(4)) \u2212 2l(y\u03c0(2), y\u03c0(3))],\n\n(cid:88)\n\n\u03c0\u2208S4\n\nwhere we denote by Sn the group of permutations over n elements. Thus, we can directly apply\nthe theory developed for higher-order V -statistics in Section 3. We consider two types of tests:\ninstantaneous independence and independence at multiple time lags.\n\n5\n\n\fTable 1: Rejection rates for two-sample experiments. MCMC: sample size=500; a Gaussian kernel\nwith bandwidth \u03c3 = 1.7 is used; every second Gibbs sample is kept (i.e., after a pass through both\ndimensions). Audio: sample sizes are (nx, ny) = {(300, 200), (600, 400), (900, 600)}; a Gaussian\nkernel with bandwidth \u03c3 = 14 is used. Both: wild bootstrap uses blocksize of ln = 20; averaged\nover at least 200 trials. The Type II error for all tests was zero\n\nMCMC\n\nAudio\n\nexperiment \\ method\ni.i.d. vs i.i.d. (H0)\ni.i.d. vs Gibbs (H0)\nGibbs vs Gibbs (H0)\n\nH0\nH1\n\npermutation (cid:92)MMDk,b\n.025\n.100\n.110\n\n{.970,.965,.995}\n\n.040\n.528\n.680\n{1,1,1}\n\n{.145,.120,.114}\n{.600,.898,.995}\n\nVb1\n.012\n.052\n.060\n\nVb2\n.070\n.105\n.100\n\nTest of instantaneous independence Here, the null hypothesis H0 is that Xt and Yt are indepen-\ndent at all times t, and the alternative hypothesis H1 is that they are dependent.\nProposition 2. Under the null hypothesis, if the stationary process Zt = (Xt, Yt) is \u03c4-dependent\n\nwith a coef\ufb01cient \u03c4 (r) = O(cid:0)r\u22126\u2212\u0001(cid:1) for some \u0001 > 0, then \u03d5(6nVb(h), nV (h)) \u2192 0 in probability,\n\nwhere \u03d5 is the Prokhorov metric.\n\nProof. Since k and l are bounded and Lipschitz continuous, the core h is bounded and Lipschitz\ncontinuous. One-degeneracy under the null hypothesis was stated in [17, Theorem 2], and that h2 is\na kernel is shown in [17, section A.2, following eq. (11)]. The result follows from Theorem 1.\n\nThe following proposition holds by the Theorem 2, since the core h is Lipschitz continuous, sym-\nmetric and bounded.\n\nProposition 3. If the stationary process Zt is \u03c4-dependent with a coef\ufb01cient \u03c4 (r) = O(cid:0)r\u22126\u2212\u0001(cid:1)\n\nfor some \u0001 > 0, then under the alternative hypothesis nVb2(h) converges in distribution to some\nrandom variable with a \ufb01nite variance and Vb1 converges to zero in probability.\n\nLag-HSIC Propositions 2 and 3 also allow us to construct a test of time series independence that\nis similar to one designed by [4]. Here, we will be testing against a broader null hypothesis: Xt and\nYt(cid:48) are independent for |t \u2212 t(cid:48)| < M for an arbitrary large but \ufb01xed M. In the Appendix, we show\nhow to construct a test when M \u2192 \u221e, although this requires an additional assumption about the\nuniform convergence of cumulative distribution functions.\nSince the time series Zt = (Xt, Yt) is stationary, it suf\ufb01ces to check whether there exists a de-\npendency between Xt and Yt+m for \u2212M \u2264 m \u2264 M. Since each lag corresponds to an indi-\nvidual hypothesis, we will require a Bonferroni correction to attain a desired test level \u03b1. We\ntherefore de\ufb01ne q = 1 \u2212 \u03b1\nt = (Xt, Yt+m).\nLet Sm,n = nV (h, Z m) denote the value of the normalized HSIC statistic calculated on the\nshifted process Z m\nt . Let Fb,n denote the empirical cumulative distribution function obtained by\nthe bootstrap procedure using nVb(h, Z). The test will then reject the null hypothesis if the event\nAn =\noccurs. By a simple application of the union bound,\nit is clear that the asymptotic probability of the Type I error will be limn\u2192\u221e P H0 (An) \u2264 \u03b1.\nOn the other hand, if the alternative holds, there exists some m with |m| \u2264 M for which\nV (h, Z m) = n\u22121Sm,n converges to a non-zero constant. In this case\n\n2M +1. The shifted time series will be denoted Z m\n\nmax\u2212M\u2264m\u2264M Sm,n > F \u22121\n\nb,n (q)\n\n(cid:110)\n\n(cid:111)\n\nP H1 (An) \u2265 P H1 (Sm,n > F \u22121\n\nb,n (q)) = P H1(n\u22121Sm,n > n\u22121F \u22121\n\n(5)\nb,n (q) \u2192 0, which follows from the convergence of Vb to zero in probability shown\nas long as n\u22121F \u22121\nin Proposition 3. Therefore, the Type II error of the multiple lag test is guaranteed to converge to\nzero as the sample size increases. Our experiments in the next Section demonstrate that while this\nprocedure is de\ufb01ned over a \ufb01nite range of lags, it results in tests more powerful than the procedure\nfor an in\ufb01nite number of lags proposed in [4]. We note that a procedure that works for an in\ufb01nite\nnumber of lags, although possible to construct, does not add much practical value under the present\nassumptions. Indeed, since the \u03c4-mixing assumption applies to the joint sequence Zt = (Xt, Yt),\n\nb,n (q)) \u2192 1\n\n6\n\n\fFigure 1: Comparison of Shift-HSIC and tests based on Vb1 and Vb2. The left panel shows the\nperformance under the null hypothesis, where a larger AR coef\ufb01cient implies a stronger temporal\ndependence. The right panel show the performance under the alternative hypothesis, where a larger\nextinction rate implies a greater dependence between processes.\n\nFigure 2: In both panel Type II error is plotted. The left panel presents the error of the lag-HSIC\nand KCSD algorithms for a process following dynamics given by the equation (6). The errors for a\nprocess with dynamics given by equations (7) and (8) are shown in the right panel. The X axis is\nindexed by the time series length, i.e., sample size. The Type I error was around 5%.\n\ndependence between Xt and Yt+m is bound to disappear at a rate of o(m\u22126), i.e., the variables both\nwithin and across the two series are assumed to become gradually independent at large lags.\n\n5 Experiments\nThe MCMC M.D. We employ MMD in order to diagnose how far an MCMC chain is from its\nstationary distribution [26, Section 5], by comparing the MCMC sample to a benchmark sample.\nA hypothesis test of whether the sampler has converged based on the standard permutation-based\nbootstrap leads to too many rejections of the null hypothesis, due to dependence within the chain.\nThus, one would require heavily thinned chains, which is wasteful of samples and computationally\nburdensome. Our experiments indicate that the wild bootstrap approach allows consistent tests di-\nrectly on the chains, as it attains a desired number of false positives.\nTo assess performance of the wild bootstrap in determining MCMC convergence, we consider the\nsituation where samples {Xi} and {Yi} are bivariate, and both have the identical marginal distri-\nbution given by an elongated normal P = N\n. However, they could\nhave arisen either as independent samples, or as outputs of the Gibbs sampler with stationary distri-\nbution P . Table 1 shows the rejection rates under the signi\ufb01cance level \u03b1 = 0.05. It is clear that in\nthe case where at least one of the samples is a Gibbs chain, the permutation-based test has a Type I\nerror much larger than \u03b1. The wild bootstrap using Vb1 (without arti\ufb01cial degeneration) yields the\ncorrect Type I error control in these cases. Consistent with \ufb01ndings in [22, Section 5], Vb1 mimics\nthe null distribution better than Vb2. The bootstrapped statistic (cid:92)MMDk,b in (4) which also relies on\nthe arti\ufb01cially degenerated bootstrap processes, behaves similarly to Vb2. In the alternative scenario\nwhere {Yi} was taken from a distribution with the same covariance structure but with the mean set\nto \u00b5 = [ 2.5 0 ], the Type II error for all tests was zero.\nPitch-evoking sounds Our second experiment is a two sample test on sounds studied in the \ufb01eld\nof pitch perception [20]. We synthesise the sounds with the fundamental frequency parameter of\ntreble C, subsampled at 10.46kHz. Each i-th period of length \u2126 contains d = 20 audio samples\n\n(cid:20) 15.5\n\n(cid:21)(cid:19)\n\n[ 0\n\n0 ] ,\n\n14.5\n15.5\n\n14.5\n\n(cid:18)\n\n7\n\n0.20.40.60.8\u22120.0500.050.10.150.2type I errorAR coeffcient0.20.40.60.8100.20.40.60.81type II errorExtinction rate Vb1Vb2Shift10015020025030000.20.40.60.81type II error ratesample size20025030000.20.40.60.81sample size KCSDHSIC\fj\n\n2\u03c32\n\n\u221a\n\n(cid:17)\n\n(cid:80)d\n\ns=1 aj,s exp\n\n(cid:16)\u2212 (tr\u2212ts\u2212(j\u2212i)\u2126)2\n\n1 \u2212 \u03bb2\u0001i, where a0, \u0001i \u223c N (0, Id), with Xi,r =(cid:80)\n\nat times 0 = t1 < . . . < td < \u2126 \u2013 we treat this whole vector as a single observation Xi or Yi,\ni.e., we are comparing distributions on R20. Sounds are generated based on the AR process ai =\n.\n\u03bbai\u22121 +\nThus, a given pattern \u2013 a smoothed version of a0 \u2013 slowly varies, and hence the sound deviates from\nperiodicity, but still evokes a pitch. We take X with \u03c3 = 0.1\u2126 and \u03bb = 0.8, and Y is either an\nindependent copy of X (null scenario), or has \u03c3 = 0.05\u2126 (alternative scenario) (Variation in the\nsmoothness parameter changes the width of the spectral envelope, i.e., the brightness of the sound).\nnx is taken to be different from ny. Results in Table 1 demonstrate that the approach using the wild\nbootstrapped statistic in (4) allows control of the Type I error and reduction of the Type II error with\nincreasing sample size, while the permutation test virtually always rejects the null hypothesis. As\nin [22] and the MCMC example, the arti\ufb01cial degeneration of the wild bootstrap process causes the\nType I error to remain above the design parameter of 0.05, although it can be observed to drop with\nincreasing sample size.\nInstantaneous independence To examine instantaneous independence test performance, we com-\npare it with the Shift-HSIC procedure [8] on the \u2019Extinct Gaussian\u2019 autoregressive process proposed\nin the [8, Section 4.1]. Using exactly the same setting we compute type I error as a function of the\ntemporal dependence and type II error as a function of extinction rate. Figure 1 shows that all three\ntests (Shift-HSIC and tests based on Vb1 and Vb2) perform similarly.\nLag-HSIC The KCSD [4] is, to our knowledge, the only test procedure to reject the null hypoth-\nesis if there exist t,t(cid:48) such that Zt and Zt(cid:48) are dependent. In the experiments, we compare lag-HSIC\nwith KCSD on two kinds of processes: one inspired by econometrics and one from [4].\nIn lag-HSIC, the number of lags under examination was equal to max{10, log n}, where n is the\nsample size. We used Gaussian kernels with widths estimated by the median heuristic. The cumu-\nlative distribution of the V -statistics was approximated by samples from nVb2. To model the tail of\nthis distribution, we have \ufb01tted the generalized Pareto distribution to the bootstrapped samples ([23]\nshows that for a large class of underlying distribution functions such an approximation is valid).\nThe \ufb01rst process is a pair of two time series which share a common variance,\n\nt , \u03c32\n\nt\u22121 + Y 2\n\nYt = \u00012,t\u03c32\n\nt = 1 + 0.45(X 2\n\ni \u2208 {1, 2}. (6)\nXt = \u00011,t\u03c32\nt ,\nThe above set of equations is an instance of the VEC dynamics [2] used in econometrics to model\nmarket volatility. The left panel of the Figure 2 presents the Type II error rate: for KCSD it remains\nat 90% while for lag-HSIC it gradually drops to zero. The Type I error, which we calculated by\nsampling two independent copies (X (1)\n) of the process and performing the\ntests on the pair (X (1)\nOur next experiment is a process sampled according to the dynamics proposed by [4],\n\n), was around 5% for both of the tests.\n\ni.i.d.\u223c N (0, 1),\n\n) and (X (2)\n\n, Y (2)\n\n, Y (1)\n\n, Y (2)\n\nt\u22121),\n\nt\n\nt\n\n\u0001i,t\n\nt\n\nt\n\nt\n\nt\n\n\u00011,t\n\nXt = cos(\u03c6t,1),\n\n\u03c6t,1 = \u03c6t\u22121,1 + 0.1\u00011,t + 2\u03c0f1Ts,\n\nYt = [2 + C sin(\u03c6t,1)] cos(\u03c6t,2), \u03c6t,2 = \u03c6t\u22121,2 + 0.1\u00012,t + 2\u03c0f2Ts,\n\ni.i.d.\u223c N (0, 1), (7)\ni.i.d.\u223c N (0, 1), (8)\nwith parameters C = .4, f1 = 4Hz,f2 = 20Hz, and frequency 1\n= 100Hz. We compared\nTs\nperformance of the KCSD algorithm, with parameters set to vales recommended in [4], and the\nlag-HSIC algorithm. The Type II error of lag-HSIC, presented in the right panel of the Figure 2,\nis substantially lower than that of KCSD. The Type I error (C = 0) is equal or lower than 5% for\nboth procedures. Most oddly, KCSD error seems to converge to zero in steps. This may be due\nto the method relying on a spectral decomposition of the signals across a \ufb01xed set of bands. As\nthe number of samples increases, the quality of the spectrogram will improve, and dependence will\nbecome apparent in bands where it was undetectable at shorter signal lengths.\n\n\u00012,t\n\nReferences\n[1] M.A. Arcones. The law of large numbers for U-statistics under absolute regularity. Electron. Comm.\n\nProbab, 3:13\u201319, 1998.\n\n[2] L. Bauwens, S. Laurent, and J.V.K. Rombouts. Multivariate GARCH models: a survey. J. Appl. Econ.,\n\n21(1):79\u2013109, January 2006.\n\n[3] A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics.\n\nKluwer, 2004.\n\n8\n\n\f[4] M. Besserve, N.K. Logothetis, and B. Schlkopf. Statistical analysis of coupled time series with kernel\n\ncross-spectral density operators. In NIPS, pages 2535\u20132543. 2013.\n\n[5] I.S. Borisov and N.V. Volodko. Orthogonal series and limit theorems for canonical U- and V-statistics of\n\nstationary connected observations. Siberian Adv. Math., 18(4):242\u2013257, 2008.\n\n[6] S. Borovkova, R. Burton, and H. Dehling. Limit theorems for functionals of mixing processes with\napplications to U-statistics and dimension estimation. Trans. Amer. Math. Soc., 353(11):4261\u20134318, 2001.\n[7] R. Bradley et al. Basic properties of strong mixing conditions. a survey and some open questions. Prob-\n\nability surveys, 2(107-44):37, 2005.\n\n[8] K. Chwialkowski and A. Gretton. A kernel independence test for random processes. In ICML, 2014.\n[9] Kacper Chwialkowski, Dino Sejdinovic, and Arthur Gretton. A wild bootstrap for degenerate kernel tests.\n\ntech. report. arXiv preprint arXiv:1408.5404, 2014.\n\n[10] J. Dedecker, P. Doukhan, G. Lang, S. Louhichi, and C. Prieur. Weak dependence: with examples and\n\napplications, volume 190. Springer, 2007.\n\n[11] J\u00b4er\u02c6ome Dedecker and Cl\u00b4ementine Prieur. New dependence coef\ufb01cients. examples and applications to\n\nstatistics. Probability Theory and Related Fields, 132(2):203\u2013236, 2005.\n\n[12] P. Doukhan. Mixing. Springer, 1994.\n[13] R.M. Dudley. Real analysis and probability, volume 74. Cambridge University Press, 2002.\n[14] K. Fukumizu, A. Gretton, X. Sun, and B. Sch\u00a8olkopf. Kernel measures of conditional dependence. In\n\nNIPS, volume 20, pages 489\u2013496, 2007.\n\n[15] A. Gretton, K.M. Borgwardt, M.J. Rasch, B. Sch\u00a8olkopf, and A. Smola. A kernel two-sample test. J.\n\nMach. Learn. Res., 13:723\u2013773, 2012.\n\n[16] A. Gretton, O. Bousquet, A. Smola, and B. Sch\u00a8olkopf. Measuring statistical dependence with Hilbert-\n\nSchmidt norms. In Algorithmic learning theory, pages 63\u201377. Springer, 2005.\n\n[17] A. Gretton, K. Fukumizu, C Teo, L. Song, B. Sch\u00a8olkopf, and A. Smola. A kernel statistical test of\n\nindependence. In NIPS, volume 20, pages 585\u2013592, 2007.\n\n[18] Arthur Gretton. A simpler condition for consistency of a kernel independence test. arXiv:1501.06103,\n\n2015.\n\n[19] Z. Harchaoui, F. Bach, and E. Moulines. Testing for homogeneity with kernel Fisher discriminant analysis.\n\nIn NIPS. 2008.\n\n[20] P. Hehrmann. Pitch Perception as Probabilistic Inference. PhD thesis, Gatsby Computational Neuro-\n\nscience Unit, University College London, 2011.\n\n[21] A. Leucht. Degenerate U- and V-statistics under weak dependence: Asymptotic theory and bootstrap\n\nconsistency. Bernoulli, 18(2):552\u2013585, 2012.\n\n[22] A. Leucht and M.H. Neumann. Dependent wild bootstrap for degenerate U- and V-statistics. Journal of\n\nMultivariate Analysis, 117:257\u2013280, 2013.\n\n[23] J. Pickands III. Statistical inference using extreme order statistics. Ann. Statist., pages 119\u2013131, 1975.\n[24] D. Sejdinovic, A. Gretton, and W. Bergsma. A kernel test for three-variable interactions. In NIPS, pages\n\n1124\u20131132, 2013.\n\n[25] D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu. Equivalence of distance-based and\n\nRKHS-based statistics in hypothesis testing. Ann. Statist., 41(5):2263\u20132702, 2013.\n\n[26] D. Sejdinovic, H. Strathmann, M. Lomeli Garcia, C. Andrieu, and A. Gretton. Kernel Adaptive\n\nMetropolis-Hastings. In ICML, 2014.\n\n[27] R. Ser\ufb02ing. Approximation Theorems of Mathematical Statistics. Wiley, New York, 1980.\n[28] X. Shao. The dependent wild bootstrap. J. Amer. Statist. Assoc., 105(489):218\u2013235, 2010.\n[29] A. J Smola, A. Gretton, L. Song, and B. Sch\u00a8olkopf. A Hilbert space embedding for distributions. In Al-\ngorithmic Learning Theory, volume LNAI4754, pages 13\u201331, Berlin/Heidelberg, 2007. Springer-Verlag.\n[30] B. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and B. Sch\u00a8olkopf. Hilbert space embeddings\n\nand metrics on probability measures. J. Mach. Learn. Res., 11:1517\u20131561, 2010.\n\n[31] M. Sugiyama, T. Suzuki, Y. Itoh, T. Kanamori, and M. Kimura. Least-squares two-sample test. Neural\n\nNetworks, 24(7):735\u2013751, 2011.\n\n[32] K. Zhang, J. Peters, D. Janzing, B., and B. Sch\u00a8olkopf. Kernel-based conditional independence test and\n\napplication in causal discovery. In UAI, pages 804\u2013813, 2011.\n\n[33] X. Zhang, L. Song, A. Gretton, and A. Smola. Kernel measures of independence for non-iid data. In\n\nNIPS, volume 22, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1891, "authors": [{"given_name": "Kacper", "family_name": "Chwialkowski", "institution": "University College London"}, {"given_name": "Dino", "family_name": "Sejdinovic", "institution": "University of Oxford"}, {"given_name": "Arthur", "family_name": "Gretton", "institution": "UCL"}]}