{"title": "Conditional Independence Testing using Generative Adversarial Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2202, "page_last": 2211, "abstract": "We consider the hypothesis testing problem of detecting conditional dependence, with a focus on high-dimensional feature spaces. Our contribution is a new test statistic based on samples from a generative adversarial network designed to approximate directly a conditional distribution that encodes the null hypothesis, in a manner that maximizes power (the rate of true negatives). We show that such an approach requires only that density approximation be viable in order to ensure that we control type I error (the rate of false positives); in particular, no assumptions need to be made on the form of the distributions or feature dependencies. Using synthetic simulations with high-dimensional data we demonstrate significant gains in power over competing methods. In addition, we illustrate the use of our test to discover causal markers of disease in genetic data.", "full_text": "Conditional Independence Testing using Generative\n\nAdversarial Networks\n\n1University of Cambridge, 2The Alan Turing Institute, 3University of California Los Angeles\n\nAlexis Bellot1,2 Mihaela van der Schaar1,2,3\n\n[abellot,mschaar]@turing.ac.uk\n\nAbstract\n\nWe consider the hypothesis testing problem of detecting conditional dependence,\nwith a focus on high-dimensional feature spaces. Our contribution is a new test\nstatistic based on samples from a generative adversarial network designed to\napproximate directly a conditional distribution that encodes the null hypothesis, in\na manner that maximizes power (the rate of true negatives). We show that such an\napproach requires only that density approximation be viable in order to ensure that\nwe control type I error (the rate of false positives); in particular, no assumptions\nneed to be made on the form of the distributions or feature dependencies. Using\nsynthetic simulations with high-dimensional data we demonstrate signi\ufb01cant gains\nin power over competing methods. In addition, we illustrate the use of our test to\ndiscover causal markers of disease in genetic data.\n\n1\n\nIntroduction\n\nConditional independence tests are concerned with the question of whether two variables X and Y\nbehave independently of each other, after accounting for the effect of confounders Z. Such questions\ncan be written as a hypothesis testing problem:\n\nH0 : X|=Y |Z versus H1 : X (cid:54)\u22a5\u22a5 Y |Z\n\nTests for this problem have recently become increasingly popular in the Machine Learning literature\n[19, 24, 18, 17, 6] and \ufb01nd natural applications in causal discovery studies in all areas of science\n[12, 14]. An area of research where such tests are important is genetics, where one problem is to \ufb01nd\ngenomic mutations directly linked to disease for the design of personalized therapies [26, 11]. In this\ncase, researchers have a limited number of data samples to test relationships even though they expect\ncomplex dependencies between variables and often high-dimensional confounding variables Z. In\nsettings like this, existing tests may be ineffective because the accumulation of spurious correlations\nfrom a large number of variables makes it dif\ufb01cult to discriminate between the hypotheses. As an\nexample the work in [16] shows empirically that kernel-based tests have rapidly decreasing power\nwith increasing data dimensionality.\nIn this paper, we present a test for conditional independence that relies on a different set of assumptions\nthat we show to be more robust for testing in high-dimensional samples (X, Y, Z). In particular, we\nshow that given only a viable approximation to a conditional distribution one can derive conditional\nindependence tests that are approximately valid in \ufb01nite samples and that have non-trivial power. Our\ntest is based on a modi\ufb01cation of Generative Adversarial Networks (GANs) [8] that simulates from\na distribution under the assumption of conditional independence, while maintaining good power in\nhigh dimensional data. In our procedure, after training, the \ufb01rst step involves simulating from our\nnetwork to generate data sets consistent with H0. We then de\ufb01ne a test statistic to capture the X \u2212 Y\ndependency in each sample and compute an empirical distribution which approximates the behaviour\nof the statistic under H0 and can be directly compared to the statistic observed on the real data to\nmake a decision.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThe paper is outlined as follows. In section 2, we provide an overview of conditional hypothesis testing\nand related work. In section 3, we provide details of our test and give our main theoretical results.\nSections 4 and 5 provide experiments on synthetic and real data respectively, before concluding in\nsection 6.\n\n2 Background\n\nWe start by introducing our notation and de\ufb01ne central notions of hypothesis testing. Throughout,\nwe will assume the observed data consists of n i.i.d tuples (Xi, Yi, Zi), de\ufb01ned in a potentially\nhigh-dimensional space X \u00d7 Y \u00d7 Z, typically Rdx \u00d7 Rdy \u00d7 Rdz. Conditional independence tests\nstatistics T : X \u00d7Y\u00d7Z \u2192 R summarize the evidence in the observational data against the hypothesis\nH0 : X|=Y |Z in a real-valued scalar. Its value from observed data, compared to a de\ufb01ned threshold\nthen determines a decision of whether to reject the null hypothesis H0 or not reject H0. Hypothesis\ntests can fail in two ways:\n\n\u2022 Type I error: rejecting H0 when it is true.\n\u2022 Type II error: not rejecting H0 when it is false.\n\nWe de\ufb01ne the p-value of a test as the probability of making a type I error, and its power as the\nprobability of correctly rejecting H0 (that is 1 - Type II error). A good test requires the p-value to\nbe upper-bounded by a user de\ufb01ned signi\ufb01cance level \u03b1 (typically \u03b1 = 0.05) and seeks maximum\npower. Testing for conditional independence is a challenging problem. Shah et al. [20] showed that\nno conditional independence test maintains non-trivial power while controlling type I error over any\nnull distribution. In high dimensional samples (relative to sample size), the problem of maintaining\ngood power is exacerbated by spurious correlations which tend to make X and Y appear independent\n(conditional on Z) when they are not.\n\n2.1 Related work\n\nA recent favoured line of research has characterized conditional independence in a reproducing\nkernel Hilbert space (RKHS) [24, 6]. The dependence between variables is assessed considering all\nmoments of the joint distributions which potentially captures \ufb01ner differences between them. [24]\nuses a measure of partial association in a RKHS to de\ufb01ne the KCIT test with provable control on\ntype I error asymptotically in the number of samples. Numerous extensions have also been proposed\nto remedy high computational costs, such as [21] that approximates the KCIT with random Fourier\nfeatures making it signi\ufb01cantly faster. Computing the limiting distribution of the test becomes harder\nto accurately estimate in practice [24], and different bandwidth parameters give widely divergent\nresults with dimensionality [16], which affects power.\nTo avoid tests that rely on asymptotic null distributions, sampling strategies consider explicitly\nestimating the data distribution under the null assumption H0. Permutation-based methods [6, 17, 3,\n19] follow this approach. To induce conditional independence, they select permutations of the data\nthat preserve the marginal structure between X and Z, and between Y and Z. For a set of continuous\nconditioning variables and for sizes of the conditioning set above a few variables, the \"similar\"\nexamples (in Z) that they seek to permute are hard to de\ufb01ne as common notions of distance increase\nexponentially in magnitude with the number of variables. The approximated permutation will be\ninaccurate and its computational complexity will not be manageable for use in practical scenarios. As\nan example, [6] constructs a permutation P that enforces invariance in Z (P Z \u2248 Z) while [17] uses\nnearest neighbors to de\ufb01ne suitable permutation sets.\nWe propose a different sampling strategy building on the ideas proposed by [4] that introduce the\nconditional randomization test (CRT). It assumes that the conditional distribution of X given Z\nis known under the null hypothesis (in our experiments we will assume it to be Gaussian for use\nin practice). The CRT then compares the known conditional distribution to the distribution of\nthe observed samples of the original data using summary statistics. Instead we require a weaker\nassumption, namely having access to a viable approximation, and give an approximately valid test that\ndoes not depend on the dimensionality of the data or the distribution of the response Y ; resulting in a\nnon-parametric alternative to the CRT. [3] also expands the CRT by proposing a permutation-based\napproach to density estimation. Generative adversarial networks have been used for hypothesis\n\n2\n\n\fFigure 1: Illustration of conditional independence testing with the GCIT. A generator G is optimized\nby adversarial training to estimate the conditional distribution X|Z under H0. We then use G to\ngenerate synthetic samples of \u02dcX under the estimated conditional distribution. Multiple draws are\ntaken for each con\ufb01guration Z and a measure of dependence between generated \u02dcX and Y , \u02c6\u03c1, is\ncomputed. The sequence of synthetic \u02c6\u03c1 is subsequently compared to the original sample statistic \u03c1 to\nget a p-value.\n\ntesting in [18]. In this work, the authors use GANs to model to the data distribution and \ufb01t a\nclassi\ufb01cation model to discriminate between the true and estimated samples. The difference with\nour test is that they provide only a loose characterization of their test statistic\u2019s distribution under\nH0 using Hoeffding\u2019s inequality. As an example of how this might impact performance is that\nHoeffding\u2019s inequality does not account for the variance in the data sample which biases the resulting\ntest. A second contrast with our work is that we avoid estimating the distribution exactly but rather\nuse the generating mechanism directly to inform our test.\n\n3 Generative Conditional Independence Test\n\nOur test for conditional independence, the GCIT (short for Generative Conditional Independence\nTest), compares an observed sample with a generated sample equal in distribution if and only if the\nnull hypothesis holds. We use the following representation under H0,\nP r(X|Z, Y ) = P r(X|Z) \u223c qH0(X)\n\n(1)\nOn the right hand side the null model preserves the dependence structure of P r(X, Z) but breaks any\ndependency between X and Y . If actually there exists a direct causal link between X and Y then\nreplacing X with a null sample \u02dcX \u223c qH0 is likely to break this relationship.\nSampling repeatedly \u02dcX conditioned on the observed confounders Z results in an exchangeable\nsequence of generated triples ( \u02dcX, Y, Z) and original data (X, Y, Z) under H0. In this context, any\nfunction \u03c1 - such as a statistic \u03c1 : X \u00d7Y \u00d7Z \u2192 R - chosen independently of the values of X applied\nto the real and generated samples preserves exchangeability. Hence the sequence,\n\n\u03c1(X, Y, Z), \u03c1( \u02dcX (1), Y, Z), ..., \u03c1( \u02dcX (M ), Y, Z)\n\n(2)\nis exchangeable under the null hypothesis H0, deriving from the fact that the observed data is equally\nlikely to have arisen from any of the above. Without loss of generality, we assume that larger values\nof \u03c1 are more extreme. The p-value of the test can be approximated by comparing the generated\nsamples with the observed sample,\n\nM(cid:88)\n\nm=1\n\n1{\u03c1( \u02dcX (m), Y, Z) \u2265 \u03c1(X, Y, Z)}/M\n\n(3)\n1{\u03c1( \u02dcX, Y, Z) \u2265 \u03c1(X, Y, Z)},\nwhich can be made arbitrarily close to the true probability, E \u02dcX\u223cqH0\nby sampling additional features \u02dcX from qH0. 1 is the indicator function. Figure 1 gives a graphical\noverview of the GCIT.\n\n3.1 Generating samples from qH0\nIn this section we describe a sampling algorithm that adapts generative adversarial networks [8] to\ngenerate samples \u02dcX conditional on high dimensional confounding variables Z. GANs provide a\n\n3\n\n\fpowerful method for general-purpose generative modeling of datasets by designing a discriminator\nD explicitly used as an adversary to train a generator G responsible for estimating qH0 := P r(X|Z).\nOver successive iterations both functions improve based on the performance of the adversarial player.\nOur implementation is based on Energy-based generative neural networks introduced in [25]\nwhich if trained optimally, can be shown to minimize a measure of divergence between probability\nmeasures that directly relates to a theoretical bound shown in this section that underlies our method.\nPseudo-code for the GCIT and full details on the implementation are given in Supplement D.\n\nDiscriminator. We de\ufb01ne the discriminator as a function D\u03b7 : X \u00d7 Z \u2192 [0, 1] parameterized by \u03b7\nthat judges whether a generated sample \u02dcX from G is likely to be distributed as its real counterpart X\nor not, conditional on Z. We train the discriminator by gradient descent to minimize the following\nloss function,\n\nLD := Ex\u223cqH0\n\n(4)\nwhere G\u03c6(z, v), v \u223c p(v) is a synthetic sample from the generator (described below) and x \u223c qH0 is\na sample from the data distribution under H0. Note that in contrast to [25] we set the image of D to\nlie in (0, 1) and include conditional data generation.\n\nD\u03b7(x, z) + E\u02dcv\u223cp(v) (1 \u2212 D\u03b7(G\u03c6(v, z), z)\n\nGenerator. The generator, G, takes (realizations of) Z and a noise variable, V , as inputs and returns\n\u02dcX, a sample from an estimated distribution X|Z. Formally, we de\ufb01ne G : Z \u00d7 [0, 1]d \u2192 X\nto be a measurable function (speci\ufb01cally a neural network) parameterized by \u03c6, and V to be d-\ndimensional noise variable (independent of all other variables). For the remainder of the paper,\nlet us denote \u02dcx \u223c \u02c6qH0 the generated sample under the model distribution implicitly de\ufb01ned by\n\u02c6x = G\u03c6(v, z), v \u223c p(v). In opposition to the discriminator, G is trained to minimize\n\nLG(D) := E\u02dcx\u223c\u02c6qH0\n\nD\u03b7(\u02dcx, z) \u2212 Ex\u223cqH0\n\nD\u03b7(x, z)\nWe estimate the expectations empirically from real and generated samples.\n\n(5)\n\n3.2 Validity of the GCIT\n\nThe following result ensures that our sampling mechanism leads to a valid test for the null hypothesis\nof conditional independence.\nProposition 1 (Exchangeability) Under the assumption that X|=Y |Z, any sequence of statistics\n(\u03c1i)M\n\ni=1 functions of the generated triples ( \u02dcX (m), Y, Z)M\n\nm=1 is exchangeable.\n\nProof. All proofs are given in Supplement C.\nGenerating conditionally independent samples with a neural network preserves exchangeability of\ninput samples and thus leads to a valid p-value, de\ufb01ned in eq. (3), for the hypothesis of conditional\nindependence. Under the assumption that the conditional distribution qH0 can be estimated exactly,\nthis implies that we maintain an exact control of the type I error in \ufb01nite samples. In practice however,\nlimited amounts of data and noise will prevent us from learning the conditional distribution exactly.\nIn such circumstances we show below that the excess type I error - that is the proportion of false\nnegatives reported above a speci\ufb01ed tolerated level \u03b1 - is bounded by the loss function LG; which,\nmoreover, can be made arbitrarily close to 0 for a generator with suf\ufb01cient capacity. We give this\nsecond result as a corollary of the GAN\u2019s convergence properties in Supplement C.\nTheorem 1 An optimal discriminator D\u2217 minimizing LD exists; and, for any statistic \u02c6\u03c1 =\n\u03c1 (X, Y, Z), the excess type I error over a desired level \u03b1 is bounded by LG(D\u2217),\n\nP r(\u02c6\u03c1 > c\u03b1|H0) \u2212 \u03b1 \u2264 LG(D\u2217)\n\n(6)\nwhere c\u03b1 := inf{c \u2208 R : P r(\u02c6\u03c1 > c) \u2264 \u03b1} is the critical value on the test\u2019s distribution and\nP r(\u02c6\u03c1 > c\u03b1|H0) is the probability of making a type I error.\nTheorem 1 shows that the GCIT has an increase in type I error dependent only on the quality of our\nconditional density approximation, given by the loss function with respect to the generator, even in\nthe worst-case under any statistic \u03c1. For reasonable choices of \u03c1, robust to errors in the estimation\n\n4\n\n\fof the conditional distribution, this bound is expected to be tighter. The key assumption to ensure\ncontrol of the type I error, and therefore to ensure the validity of the GCIT, thus rests solely on\nour ability to \ufb01nd a viable approximation to the conditional distribution of X|Z. The capacity of\ndeep neural networks and their success in estimating heterogeneous conditional distributions even in\nhigh-dimensional samples make this a reasonable assumption, and the GCIT applicable in a large\nnumber of scenarios previously unexplored.\n\n3.3 Maximizing power\nFor a \ufb01xed sample size, conditional dependence H1 : X (cid:54)\u22a5\u22a5 Y |Z, is increasingly dif\ufb01cult to detect\nwith larger conditioning sets (Z) as spurious correlations due to sample size make X and Y appear\nindependent. To maximize power it will be desirable that differences between generated samples\n\u02dcX (under the model P r(X|Z)) and observed samples X (distributed according to P r(X|Z, Y )) be\nas apparent as possible. In order to achieve this we will encourage \u02c6X and X to have low mutual\ninformation because irrespective of dimensionality, mutual information between distributions in the\nnull and alternative relates directly to the hardness of hypothesis testing problems, which can be\nseen for example via Fano\u2019s inequality (section 2.11 in [23]). To do so, we investigate the use of the\ninformation network proposed in [2] and used in the context of feature selection in [10]. [2] propose\na neural architecture and training procedure for estimating the mutual information between two\nrandom variables. We approximate the mutual information with a neural network T\u03b8 : X \u00d7 X \u2192 R,\nparameterized by \u03b8, with the following objective function (to be maximized),\n\nLInf o := sup\n\n[T\u03b8] \u2212 log E\n\nE\np(n)\nx,\u02dcx\n\n\u03b8\n\nx \u00d7p(n)\np(n)\n\n\u02dcx\n\n[exp(T\u03b8)]\n\n(7)\n\nWe estimate T\u03b8 in alternation with the discriminator and generator given samples from the generator\nin every iteration. We modify the loss function for the generator to include the mutual information\nand perform gradient descent to optimize the generator on the following objective,\n\nLG(D) + \u03bbLInf o\n\n(8)\n\u03bb > 0 is a hyperparameter controlling the in\ufb02uence of the information network. This additional term\n(\u03bbLInf o) encourages the generation of samples \u02dcX as independent as possible from the observed\nvariables X such that the resulting differences (between \u02dcX and X) are truly a consequence of the\ndirect dependence between X and Y rather than spurious correlations with confounders Z.\nTo provide some further intuition, one can see why generating data different than the sample observed\nin the alternative H1 might be bene\ufb01cial by considering the following bound (proven in Supplement\nC),\n\nType I error + Type II error \u2265 1 \u2212 \u03b4T V (\u02c6qH0, qH1 )\n\n(9)\nwhere \u02c6qH0 is the estimated null distribution with the GCIT, qH1 is the distribution under H1 and\nwhere \u03b4T V is the total variation distance between probability measures. This result suggests that\nwhen emphasizing the differences between the estimated samples and true samples from H1, which\nincreases the total variation, can improve the overall performance pro\ufb01le of our test by reducing a\nlower bound on type I and type II errors.\nRemark. The GCIT aims at generating samples whose conditional distribution matches the distribu-\ntion of its real counterparts, but can be independent otherwise. It is that gap that the power maximizing\nprocedure intends to exploit. In practice, there will be a trade-off between the objectives of the\ndiscriminator and information network but we found that setting \u03bb = 10 in our experiments achieved\ngood performance. It should be noted also that hyperparameter selection cannot be performed using\ncross-validation as we do not have access to ground truth and so the hyperparameters must typically\nbe \ufb01xed a priori. However, we can consider arti\ufb01cially inducing conditional independence (X|=Y |Z)\n(by permuting variables X and Y such as to preserve the marginal dependence in (X, Z) and (Y, Z))\nand choose hyperparameters that best control for type I error. We explore this further in Supplement\nA and test con\ufb01gurations of \u03bb with synthetic data in section 4.2.\n\n3.4 Choice of statistic \u03c1\n\nThe bound on the type I error given in Theorem 1 holds for any choice of statistic \u03c1 as it depends\nsolely on the conditional distribution estimation. For choices of \u03c1 less sensitive to spurious differences\n\n5\n\n\fbetween generated and true samples when the null H0 holds, the type I error is expected to be below\nthis bound. We experimented with various dependence measures (between two samples) as choices\nfor \u03c1. We consider the Maximum Mean Discrepancy [9], Pearson\u2019s correlation coef\ufb01cient, the\ndistance correlation (which measures both linear and nonlinear association, in contrast to Pearson\u2019s\ncorrelation), the Kolmogorov-Smirnov distance between two samples and the randomized dependence\ncoef\ufb01cient [13]. In our experiments we use the distance correlation and analyze performance using\nall other measures in Supplement A.\n\n4 Synthetic data example\n\nIn this section we analyse the performance of the GCIT1 in a controlled fashion with synthetic data\nagainst a wide range of competing algorithms, illustrating the effects of different components of\nour method. We consider the CRT [4] with pre-speci\ufb01ed Gaussian sampling distribution, whose\nparameters are estimated from data; the kernel-methods KCIT [24] and RCoT [21] with bandwith\nparameter estimated with the median of all pairwise distances between X and Y , a common choice\nin the literature; and the CCIT [19], which does not make prior assumptions on data distributions but\nwas also not speci\ufb01cally designed for high-dimensional data.\nWhen testing at level \u03b1, type I error should be as close as possible to \u03b1 even though this might not\nbe the case because of violated assumptions or approximations. An important consideration in our\ndiscussion of power as we increase the dimensionality of Z, is the choice of alternatives H1. For\ninstance, if the strength of the dependency between X and Y increases, the hypothesis testing problem\nwill be made arti\ufb01cially easier and bias our conclusions with regards to data dimensionality, as\nobserved also in [16]. In every synthetic experiment, we maintain the mutual information between X\nand Y approximately constant by \ufb01rst generating data and second estimating the mutual information\nbefore deciding to draw a new dataset, if the mutual information disagrees with the previous draw, or\notherwise proceed with testing. We estimate the mutual information with a Gaussian approximation,\nM I(X, Y ) = \u2212 1\n\n2 log(1 \u2212 \u02c6\u03c12), where \u02c6\u03c1 is the linear correlation between X and Y .\n\n4.1 Setup\n\nWe generate synthetic data according to the \"post non-linear noise model\" similarly to [24, 6, 21] that\nde\ufb01nes (X, Y, Z) under H0 and H1 as follows,\n\nY = g(AgZ + \u0001g)\n\nY = h(AhZ + \u03b1X + \u0001h)\n\nH0 : X = f (Af Z + \u0001f ),\nH1 :\n\n(10)\n(11)\nThe matrix dimensions of A(\u00b7) are such that X and Y are univariate, matrix entries as well as\nparameter \u03b1 are generated at random in the interval [0, 1], and lastly, the noise variables \u0001(\u00b7) are 0 on\naverage with variance 0.025. The distributions of X, Y and \u0001, and the complexity of dependencies\nvia f, g and g will be tuned carefully to make performance comparisons in three settings:\n(1) Multivariate Gaussian\nWe set f, g and h to be the identity functions which induces linear dependencies, Z \u223c N (0, \u03c32), and\nX \u223c N (0, \u03c32) under H1 which results in jointly Gaussian data under the null and the alternative.\nSuch a setting matches the assumptions of all methods and the interest of this study will be to provide\na baseline for more complex scenarios.\n(2) Multivariate Laplace\nKernel choice has a large impact on power, as we demonstrate in this setting. In this case, we set\nf, g and h as before but use a Laplace distribution to generate Z and X. The RBF kernel in this\ncase overestimates the \"smoothness\" of the data. This study highlights the robustness of the GCIT in\ncomparison to kernel-based methods which is important since hyperparameters cannot be tuned by\ncross-validation.\n(3) Arbitrary distributions\nWe set f, g and h to be randomly sampled from {x3, tanh x, exp(\u2212x)}, resulting in more complex\ndistributions and variable dependencies. Here Z \u223c N (0, \u03c32), and X \u223c N (0, \u03c32) under H1. This\nat\n\nimplementation\n\nof\n\ntutorial\n\nare\n\n1An\n\nhttps://bitbucket.org/mvdschaar/mlforhealthlabpub/src/master/alg/gcit/.\n\nour\n\ntest\n\nand\n\navailable\n\n6\n\n\fFigure 2: Power results of the synthetic simulations. (Higher better). Left panel: (1) Multivariate\nGaussian, Middle panel: (2) Multivariate Laplace, Right panel: (3) Arbitrary distributions.\n\nis our most general setting which most faithfully resembles the complexities we can expect in real\napplications.\nResults: Power as a function of the dimensionality of Z is shown in Figure 2. Each point on the\ncurves is computed by taking averages over 1000 random experiments with sample size equal to 500\nexamples. The results from scenario (1) are consistent with our expectations; all methods perform\ncomparably, the CRT and kernel-based methods achieving high power in lower dimensions while\nslightly under-performing in higher dimensions. In scenario (2) and (3), the failure of the CRT\nand kernel-based methods is apparent while the GCIT maintains high power, even with increasing\ndimensionality, which demonstrates the robustness of our sampling mechanism to arbitrary complex\ndata distributions. The CCIT outperforms kernel-based methods in these cases also. An important\ncontrast of the GCIT with respect to the CCIT is our addition of the information network, which we\nargue contributes to the higher power observed across all experiments. We analyze this empirically\nbelow.\nFigure 2 in Supplement B shows that type I error is approximately controlled at a level \u03b1 for all\nmethods. Observe also that even though the GCIT requires training a new GAN in every iteration,\nin Figure 3 Supplement B we show empirically that running times for the GCIT scale much better\nwith dimensionality and sample size in comparison with the best benchmark, the CCIT: its running\ntimes are prohibitive in practice with more than 1000 samples or 500 dimensions in Z, with each test\ntaking over 600s versus 60s for the GCIT.\n\nFigure 3: Type I error and power for different values of \u03bb.\n\n4.2 Source of gain: consequences of the information network\n\nThe information network aims to encourage maximum power in high-dimensional data. We control\nfor its in\ufb02uence by varying \u03bb in the loss function of the GCIT given in eq. (8). Higher values of\n\u03bb encourage the generation of independent samples which improves power even though it might\ndecrease the accuracy of the density approximation in the GAN optimization when the null in fact\nholds. We notice this trade-off between power and type I error for higher values of \u03bb in Figure 3.\nThe underlying data was generated from setting (1), each curve in the two panels corresponds to\n\n7\n\n0100200300400500Dimension of Z0.30.40.50.60.70.80.91.0PowerKCITRCoTCRTGCITCCIT0100200300400500Dimension of Z0.30.40.50.60.70.80.91.0Power0100200300400500Dimension of Z0.30.40.50.60.70.80.91.0Power\fa different value of \u03bb. Lastly, we computed the lower-bound from GCIT generated samples and\nobserved samples (by numerical integration) in eq. (9) to conclude that higher values of \u03bb did\ndecrease the lower bound, as expected.\n\n5 Genetic data example\n\nThere is compelling evidence that the likelihood of a patient\u2019s cancer responding to treatment can\nbe strongly in\ufb02uenced by alterations in the cancer genome [7]. We study the response of cancer\ncell lines to an anti-cancer drug where the problem is to distinguish between genetic mutations that\nin\ufb02uence directly the cancer cell line response from those that are not directly relevant [1, 22]. We\nuse the subset of the CCLE data [1] relating to the drug PLX4720; it contains 474 cancer cell lines\ndescribed by 466 genetic mutations. More details on the data can be found in Supplement E.\n\nFigure 4: Genetic experiment results. Each cell gives the p-value or importance rank (where\nappropriate) indicating the dependency between a mutation and drug response.\n\nEvaluating conditional independence relations from real data is dif\ufb01cult as we do not have access\nto the ground truth causal links. Instead we give our results in comparison to those of [1], who\nproceeded by reporting discriminative features returned by the parameter values of a \ufb01tted elastic\nnet regression model (EN). This is common practice in genetic studies, see for example also [7].\nIn addition, we compare with the rank of each feature given by a random forest model importance\nscores (RF) and the p-value assigned by the CRT. The results for 10 selected mutations can be found\nin Figure 4. The \ufb01rst two rows give ranks of heuristic methods and the last two rows give p-values of\nconditional independence tests. We distinguish between the mutations where all methods agree (in\nthe leftmost columns), and the mutations where not all methods agree (in the rightmost columns).\nThe mutations on genes PIP5K1A and MAP3K5 are recognized to be discriminative by the random\nforest model (high rank) and the GCIT (low p-value), which highlights the signi\ufb01cance of the GCIT\nfor conditional independence testing, suggesting that non-linear dependencies occur which are not\ncaptured by the elastic net or the CRT. For further evaluation, in this case we were able to cross-\nreference with a previous study to \ufb01nd evidence of the PIP5K1A gene to have a differential response\non cancer cell lines when PLX4720 is applied [22]. The MAP3K5 gene has not previously been\nreported in the literature as being directly linked to the PLX4720 drug response, however [15] did\n\ufb01nd a proliferation of these gene mutations to be of BRAF type in cancer patients. This is interesting\nbecause PLX4720 is precisely designed as a BRAF inhibitor, and thus we would expect it to have\nan impact also on MAP3K5 mutations of the BRAF type. FLT3 is an interesting gene, found to be\ndependent on cancer response by the EN, RF and CRT, but not by the GCIT. This \ufb01nding by the\nGCIT was con\ufb01rmed however by a posterior genetic study [5] that established no link between cancer\nresponse and FLT3 mutations in the presence of PLX4720. Such results encourage us to believe that\nthe GCIT is able to better detect dependence for these problems.\n\n6 Conclusions and future perspectives\n\nWe propose a generative approach to conditional independence testing using generative adversarial\nnetworks. We show this approach results in an approximately valid test for an arbitrary data distri-\nbution irrespective of the number of variables observed. We have demonstrated through simulated\ndata signi\ufb01cant gains in statistical power, and we illustrated the application of our method to discover\ngenetic markers for cancer drug response on real high-dimensional data.\n\n8\n\n\fFrom a practical perspective, algorithms based on other generative models can be constructed based\non our proposed procedure that may be more adequate for different data modalities. In a general\nsense, this work opens the door to principled statistical testing with more heterogeneous data, and\nexpands our ability to reason and test variable relationships in more challenging scenarios.\n\n7 Acknowledgements\n\nWe thank the anonymous reviewers for valuable feedback. This work was supported by the Alan\nTuring Institute under the EPSRC grant EP/N510129/1, the ONR and the NSF grants number 1462245\nand number 1533983.\n\nReferences\n[1] Jordi Barretina, Giordano Caponigro, Nicolas Stransky, Kavitha Venkatesan, Adam A Margolin,\nSungjoon Kim, Christopher J Wilson, Joseph Leh\u00e1r, Gregory V Kryukov, Dmitriy Sonkin, et al.\nThe cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity.\nNature, 483(7391):603, 2012.\n\n[2] Ishmael Belghazi, Sai Rajeswar, Aristide Baratin, R Devon Hjelm, and Aaron Courville. Mine:\nmutual information neural estimation. In International Conference on Machine Learning, 2018.\n[3] Thomas B Berrett, Yi Wang, Rina Foygel Barber, and Richard J Samworth. The conditional\n\npermutation test. arXiv preprint arXiv:1807.05405, 2018.\n\n[4] Emmanuel Candes, Yingying Fan, Lucas Janson, and Jinchi Lv. Panning for gold:\u2018model-\nx\u2019knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical\nSociety: Series B (Statistical Methodology), 80(3):551\u2013577, 2018.\n\n[5] Anindya Chatterjee, Joydeep Ghosh, Baskar Ramdas, Raghuveer Singh Mali, Holly Martin,\nMichihiro Kobayashi, Sasidhar Vemula, Victor H Canela, Emily R Waskow, Valeria Visconte,\net al. Regulation of stat5 by fak and pak1 in oncogenic \ufb02t3-and kit-driven leukemogenesis. Cell\nreports, 9(4):1333\u20131348, 2014.\n\n[6] Gary Doran, Krikamol Muandet, Kun Zhang, and Bernhard Sch\u00f6lkopf. A permutation-based\n\nkernel conditional independence test. In UAI, pages 132\u2013141, 2014.\n\n[7] Mathew J Garnett, Elena J Edelman, Sonja J Heidorn, Chris D Greenman, Anahita Dastur,\nKing Wai Lau, Patricia Greninger, I Richard Thompson, Xi Luo, Jorge Soares, et al. Systematic\nidenti\ufb01cation of genomic markers of drug sensitivity in cancer cells. Nature, 483(7391):570,\n2012.\n\n[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural\ninformation processing systems, pages 2672\u20132680, 2014.\n\n[9] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch\u00f6lkopf, and Alexander\nSmola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723\u2013773,\n2012.\n\n[10] James Jordon, Jinsung Yoon, and Mihaela van der Schaar. Knockoffgan: Generating knockoffs\n\nfor feature selection using generative adversarial networks. In ICLR, 2019.\n\n[11] Amit V Khera and Sekar Kathiresan. Genetics of coronary artery disease: discovery, biology\n\nand clinical translation. Nature reviews Genetics, 18(6):331, 2017.\n\n[12] Steffen L Lauritzen. Graphical models, volume 17. Clarendon Press, 1996.\n[13] David Lopez-Paz, Philipp Hennig, and Bernhard Sch\u00f6lkopf. The randomized dependence\n\ncoef\ufb01cient. In Advances in neural information processing systems, pages 1\u20139, 2013.\n\n[14] Judea Pearl et al. Causal inference in statistics: An overview. Statistics surveys, 3:96\u2013146,\n\n2009.\n\n[15] Todd D Prickett, Brad Zerlanko, Jared J Gartner, Stephen CJ Parker, Ken Dutton-Regester,\nJimmy C Lin, Jamie K Teer, Xiaomu Wei, Jiji Jiang, Guo Chen, et al. Somatic mutations\nin map3k5 attenuate its proapoptotic function in melanoma through increased binding to\nthioredoxin. Journal of Investigative Dermatology, 134(2):452\u2013460, 2014.\n\n9\n\n\f[16] Aaditya Ramdas, Sashank Jakkam Reddi, Barnab\u00e1s P\u00f3czos, Aarti Singh, and Larry A Wasser-\nman. On the decreasing power of kernel and distance based nonparametric hypothesis tests in\nhigh dimensions. In AAAI, pages 3571\u20133577, 2015.\n\n[17] Jakob Runge. Conditional independence testing based on a nearest-neighbor estimator of\n\nconditional mutual information. In AISTATS, 2018.\n\n[18] Rajat Sen, Karthikeyan Shanmugam, Himanshu Asnani, Arman Rahimzamani, and Sreeram\nKannan. Mimic and classify: A meta-algorithm for conditional independence testing. arXiv\npreprint arXiv:1806.09708, 2018.\n\n[19] Rajat Sen, Ananda Theertha Suresh, Karthikeyan Shanmugam, Alexandros G Dimakis, and\nSanjay Shakkottai. Model-powered conditional independence test. In Advances in Neural\nInformation Processing Systems, pages 2951\u20132961, 2017.\n\n[20] Rajen D Shah and Jonas Peters. The hardness of conditional independence testing and the\n\ngeneralised covariance measure. arXiv preprint arXiv:1804.07203, 2018.\n\n[21] Eric V Strobl, Kun Zhang, and Shyam Visweswaran. Approximate kernel-based conditional\nindependence tests for fast non-parametric causal discovery. arXiv preprint arXiv:1702.03877,\n2017.\n\n[22] Wesley Tansey, Victor Veitch, Haoran Zhang, Raul Rabadan, and David M Blei. The hold-\nout randomization test: Principled and easy black box feature selection. arXiv preprint\narXiv:1811.00645, 2018.\n\n[23] Joy A Thomas and TM Cover. Elements of information theory. John Wiley & Sons, Inc., New\nYork. Toni, T., Welch, D., Strelkowa, N., Ipsen, A., and Stumpf, MPH (2009),\u201cApproximate\nBayesian computation scheme for parameter inference and model selection in dynamical\nsystems,\u201d Journal of the Royal Society Interface, 6:187\u2013202, 1991.\n\n[24] Kun Zhang, Jonas Peters, Dominik Janzing, and Bernhard Sch\u00f6lkopf. Kernel-based conditional\n\nindependence test and application in causal discovery. In UAI, 2012.\n\n[25] Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial network.\n\nIn ICLR, 2017.\n\n[26] Zhihong Zhu, Zhili Zheng, Futao Zhang, Yang Wu, Maciej Trzaskowski, Robert Maier,\nMatthew R Robinson, John J McGrath, Peter M Visscher, Naomi R Wray, et al. Causal\nassociations between risk factors and common diseases inferred from gwas summary data.\nNature communications, 9(1):224, 2018.\n\n10\n\n\f", "award": [], "sourceid": 1295, "authors": [{"given_name": "Alexis", "family_name": "Bellot", "institution": "University of Cambridge / Alan Turing Institute"}, {"given_name": "Mihaela", "family_name": "van der Schaar", "institution": "University of Cambridge, Alan Turing Institute and UCLA"}]}