{"title": "Model-Powered Conditional Independence Test", "book": "Advances in Neural Information Processing Systems", "page_first": 2951, "page_last": 2961, "abstract": "We consider the problem of non-parametric Conditional Independence testing (CI testing) for continuous random variables. Given i.i.d samples from the joint distribution $f(x,y,z)$ of continuous random vectors $X,Y$ and $Z,$ we determine whether $X \\independent Y \\vert Z$. We approach this by converting the conditional independence test into a classification problem.  This allows us to harness very powerful classifiers like gradient-boosted trees and deep neural networks.  These models can handle complex probability distributions and allow us to perform significantly better compared to the prior state of the art, for high-dimensional CI testing. The main technical challenge in the classification problem is the need for samples from the conditional product distribution $f^{CI}(x,y,z) = f(x|z)f(y|z)f(z)$ -- the joint distribution if and only if $X \\independent Y \\vert Z.$ -- when given access only to i.i.d.  samples from the true joint distribution $f(x,y,z)$.  To tackle this problem we propose a novel nearest neighbor bootstrap procedure and theoretically show that our generated samples are indeed close to $f^{CI}$ in terms of total variational distance. We then develop theoretical results regarding the generalization bounds for classification for our problem, which translate into error bounds for CI testing. We provide a novel analysis of Rademacher type classification bounds in the presence of non-i.i.d \\textit{near-independent} samples. We empirically validate the performance of our algorithm on simulated and real datasets and show performance gains over previous methods.", "full_text": "Model-Powered Conditional Independence Test\n\nRajat Sen1,*, Ananda Theertha Suresh2,*, Karthikeyan Shanmugam3,*, Alexandros G. Dimakis1, and\n\nSanjay Shakkottai1\n\n1The University of Texas at Austin\n\n2Google, New York\n\n3IBM Research, Thomas J. Watson Center\n\nAbstract\n\nWe consider the problem of non-parametric Conditional Independence testing\n(CI testing) for continuous random variables. Given i.i.d samples from the joint\ndistribution f (x, y, z) of continuous random vectors X, Y and Z, we determine\nwhether X ?? Y |Z. We approach this by converting the conditional independence\ntest into a classi\ufb01cation problem. This allows us to harness very powerful classi\ufb01ers\nlike gradient-boosted trees and deep neural networks. These models can handle\ncomplex probability distributions and allow us to perform signi\ufb01cantly better\ncompared to the prior state of the art, for high-dimensional CI testing. The main\ntechnical challenge in the classi\ufb01cation problem is the need for samples from\nthe conditional product distribution f CI(x, y, z) = f (x|z)f (y|z)f (z) \u2013 the joint\ndistribution if and only if X ?? Y |Z. \u2013 when given access only to i.i.d. samples\nfrom the true joint distribution f (x, y, z). To tackle this problem we propose a novel\nnearest neighbor bootstrap procedure and theoretically show that our generated\nsamples are indeed close to f CI in terms of total variational distance. We then\ndevelop theoretical results regarding the generalization bounds for classi\ufb01cation for\nour problem, which translate into error bounds for CI testing. We provide a novel\nanalysis of Rademacher type classi\ufb01cation bounds in the presence of non-i.i.d near-\nindependent samples. We empirically validate the performance of our algorithm on\nsimulated and real datasets and show performance gains over previous methods.\n\n1\n\nIntroduction\n\nTesting datasets for Conditional Independence (CI) have signi\ufb01cant applications in several statisti-\ncal/learning problems; among others, examples include discovering/testing for edges in Bayesian\nnetworks [15, 27, 7, 9], causal inference [23, 14, 29, 5] and feature selection through Markov Blan-\nkets [16, 31]. Given a triplet of random variables/vectors (X, Y, Z), we say that X is conditionally\nindependent of Y given Z (denoted by X ?? Y |Z), if the joint distribution fX,Y,Z(x, y, z) factorizes\nas fX,Y,Z(x, y, z) = fX|Z(x|z)fY |Z(y|z)fZ(z). The problem of Conditional Independence Testing\n(CI Testing) can be de\ufb01ned as follows: Given n i.i.d samples from fX,Y,Z(x, y, z), distinguish\nbetween the two hypothesis H0 : X ?? Y |Z and H1 : X 6?? Y |Z.\nIn this paper we propose a data-driven Model-Powered CI test. The central idea in a model-driven\napproach is to convert a statistical testing or estimation problem into a pipeline that utilizes the power\nof supervised learning models like classi\ufb01ers and regressors; such pipelines can then leverage recent\nadvances in classi\ufb01cation/regression in high-dimensional settings. In this paper, we take such a\nmodel-powered approach (illustrated in Fig. 1), which reduces the problem of CI testing to Binary\nClassi\ufb01cation. Speci\ufb01cally, the key steps of our procedure are as follows:\n\n* Equal Contribution\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f3n Original\nSamples\nX Y Z\nx1 y1 z1\n\n...\n\nx3n y3n z3n\n\nX Y Z\n\n...\n\n2n Original\nSamples\n\nn\n\nOriginal Samples\n\nU1\n\nX Y Z\n\n...\n\n+\n\nShue\n\nU 02\n\nX Y Z\n\n...\n\n+\n\nn samples\nclose to f CI\n\nNearest\nNeighbor\nBootstrap\n\n`\n1\n...\n\n1\n\n`\n0\n...\n\n0\n\nTraining\nSet\nDr\n\nG\n\n\u02c6g (Trained Classi\ufb01er)\n\nTest\nSet\n\nDe\n\nL(\u02c6g, De) (Test Error)\n\nFigure 1: Illustration of our methodology. A part of the original samples are kept aside in U1. The\nrest of the samples are used in our nearest neighbor boot-strap to generate a data-set U02 which is\nclose to f CI in distribution. The samples are labeled as shown and a classi\ufb01er is trained on a training\nset. The test error is measured on a test set there-after. If the test-error is close to 0.5, then H0 is not\nrejected, however if the test error is low then H0 is rejected.\n\n(i) Suppose we are provided 3n i.i.d samples from fX,Y,Z(x, y, z). We keep aside n of these original\nsamples in a set U1 (refer to Fig. 1). The remaining 2n of the original samples are processed through\nour \ufb01rst module, the nearest-neighbor bootstrap (Algorithm 1 in our paper), which produces n\nsimulated samples stored in U02. In Section 3, we show that these generated samples in U02 are in fact\nclose in total variational distance (de\ufb01ned in Section 3) to the conditionally independent distribution\nf CI(x, y, z) , fX|Z(x|z)fY |Z(y|z)fZ(z). (Note that only under H0 does the equality f CI(.) =\nfX,Y,Z(.) hold; our method generates samples close to f CI(x, y, z) under both hypotheses).\n(ii) Subsequently, the original samples kept aside in U1 are labeled 1 while the new samples simulated\nfrom the nearest-neighbor bootstrap (in U02) are labeled 0. The labeled samples (U1 with label 1 and\nU02 labeled 0) are aggregated into a data-set D. This set D is then broken into training and test sets\nDr and De each containing n samples each.\n(iii) Given the labeled training data-set (from step (ii)), we train powerful classi\ufb01ers such as gradient\nboosted trees [6] or deep neural networks [17] which attempt to learn the classes of the samples.\nIf the trained classi\ufb01er has good accuracy over the test set, then intuitively it means that the joint\ndistribution fX,Y,Z(.) is distinguishable from f CI (note that the generated samples labeled 0 are\nclose in distribution to f CI). Therefore, we reject H0. On the other hand, if the classi\ufb01er has accuracy\nclose to random guessing, then fX,Y,Z(.) is in fact close to f CI, and we fail to reject H0.\nFor independence testing (i.e whether X ?? Y ), classi\ufb01ers were recently used in [19]. Their key\nobservation was that given i.i.d samples (X, Y ) from fX,Y (x, y), if the Y coordinates are randomly\npermuted then the resulting samples exactly emulate the distribution fX(x)fY (y). Thus the problem\ncan be converted to a two sample test between a subset of the original samples and the other subset\nwhich is permuted - Binary classi\ufb01ers were then harnessed for this two-sample testing; for details\nsee [19]. However, in the case of CI testing we need to emulate samples from f CI. This is harder\nbecause the permutation of the samples needs to be Z dependent (which can be high-dimensional).\nOne of our key technical contributions is in proving that our nearest-neighbor bootstrap in step (i)\nachieves this task.\nThe advantage of this modular approach is that we can harness the power of classi\ufb01ers (in step (iii)\nabove), which have good accuracies in high-dimensions. Thus, any improvements in the \ufb01eld of\nbinary classi\ufb01cation imply an advancement in our CI test. Moreover, there is added \ufb02exibility in\nchoosing the best classi\ufb01er based on domain knowledge about the data-generation process. Finally,\nour bootstrap is also ef\ufb01cient owing to fast algorithms for identifying nearest-neighbors [24].\n\n1.1 Main Contributions\n\n(i) (Classi\ufb01cation based CI testing) We reduce the problem of CI testing to Binary Classi\ufb01cation as\ndetailed in steps (i)-(iii) above and in Fig. 1. We simulate samples that are close to f CI through a\nnovel nearest-neighbor bootstrap (Algorithm 1) given access to i.i.d samples from the joint distribution.\n\n2\n\n\fThe problem of CI testing then reduces to a two-sample test between the original samples in U1 and\nU02, which can be effectively done by binary classi\ufb01ers.\n(ii) (Guarantees on Bootstrapped Samples) As mentioned in steps (i)-(iii), if the samples gener-\nated by the bootstrap (in U02) are close to f CI, then the CI testing problem reduces to testing whether\nthe data-sets U1 and U02 are distinguishable from each other. We theoretically justify that this is indeed\ntrue. Let X,Y,Z(x, y, z) denote the distribution of a sample produced by Algorithm 1, when it is sup-\nplied with 2n i.i.d samples from fX,Y,Z(.). In Theorem 1, we prove that dT V (, f CI) = O(1/n1/dz )\nunder appropriate smoothness assumptions. Here dz is the dimension of Z and dT V denotes total\nvariational distance (Def. 1).\n(iii) (Generalization Bounds for Classi\ufb01cation under near-independence) The samples generated\nfrom the nearest-neighbor bootstrap do not remain i.i.d but they are close to i.i.d. We quantify this\nproperty and go on to show generalization risk bounds for the classi\ufb01er. Let us denote the class of\nfunction encoded by the classi\ufb01er as G. Let \u02c6R denote the probability of error of the optimal classi\ufb01er\n\u02c6g 2G trained on the training set (Fig. 1). We prove that under appropriate assumptions, we have\n\nr0 O (1/n1/dz ) \uf8ff \u02c6R \uf8ff r0 + O(1/n1/dz ) + O\u2713pV \u2713n1/3 +q2dz /n\u25c6\u25c6\n\nwith high probability, upto log factors. Here r0 = 0.5(1 dT V (f, f CI)), V is the VC dimension [30]\nof the class G. Thus when f is equivalent to f CI (H0 holds) then the error rate of the classi\ufb01er is\nclose to 0.5. But when H1 holds the loss is much lower. We provide a novel analysis of Rademacher\ncomplexity bounds [4] under near-independence which is of independent interest.\n(iv) (Empirical Evaluation) We perform extensive numerical experiments where our algorithm\noutperforms the state of the art [32, 28]. We also apply our algorithm for analyzing CI relations in the\nprotein signaling network data from the \ufb02ow cytometry data-set [26]. In practice we observe that the\nperformance with respect to dimension of Z scales much better than expected from our worst case\ntheoretical analysis. This is because powerful binary classi\ufb01ers perform well in high-dimensions.\n\n1.2 Related Work\n\nIn this paper we address the problem of non-parametric CI testing when the underlying random\nvariables are continuous. The literature on non-parametric CI testing is vast. We will review some of\nthe recent work in this \ufb01eld that is most relevant to our paper.\nMost of the recent work in CI testing are kernel based [28, 32, 10]. Many of these works build on\nthe study in [11], where non-parametric CI relations are characterized using covariance operators\nfor Reproducing Kernel Hilbert Spaces (RKHS) [11]. KCIT [32] uses the partial association of\nregression functions relating X, Y , and Z. RCIT [28] is an approximate version of KCIT that\nattempts to improve running times when the number of samples are large. KCIPT [10] is perhaps\nmost relevant to our work. In [10], a speci\ufb01c permutation of the samples is used to simulate data\nfrom f CI. An expensive linear program needs to be solved in order to calculate the permutation.\nOn the other hand, we use a simple nearest-neighbor bootstrap and further we provide theoretical\nguarantees about the closeness of the samples to f CI in terms of total variational distance. Finally\nthe two-sample test in [10] is based on a kernel method [3], while we use binary classi\ufb01ers for the\nsame purpose. There has also been recent work on entropy estimation [13] using nearest neighbor\ntechniques (used for density estimation); this can subsequently be used for CI testing by estimating\nthe conditional mutual information I(X; Y |Z).\nBinary classi\ufb01cation has been recently used for two-sample testing, in particular for independence\ntesting [19]. Our analysis of generalization guarantees of classi\ufb01cation are aimed at recovering\nguarantees similar to [4], but in a non-i.i.d setting. In this regard (non-i.i.d generalization guarantees),\nthere has been recent work in proving Rademacher complexity bounds for -mixing stationary\nprocesses [21]. This work also falls in the category of machine learning reductions, where the general\nphilosophy is to reduce various machine learning settings like multi-class regression [2], ranking [1],\nreinforcement learning [18], structured prediction [8] to that of binary classi\ufb01cation.\n\n3\n\n\f2 Problem Setting and Algorithms\n\nIn this section we describe the algorithmic details of our CI testing procedure. We \ufb01rst formally\nde\ufb01ne our problem. Then we describe our bootstrap algorithm for generating the data-set that mimics\nsamples from f CI. We give a detailed pseudo-code for our CI testing process which reduces the\nproblem to that of binary classi\ufb01cation. Finally, we suggest further improvements to our algorithm.\nProblem Setting: The problem setting is that of non-parametric Conditional Independence (CI)\ntesting given i.i.d samples from the joint distributions of random variables/vectors [32, 10, 28]. We are\ngiven 3n i.i.d samples from a continuous joint distribution fX,Y,Z(x, y, z) where x 2 Rdx, y 2 Rdy\nand z 2 Rdz. The goal is to test whether X ?? Y |Z i.e whether fX,Y,Z(x, y, z) factorizes as,\nfX,Y,Z(x, y, z) = fX|Z(x|z)fY |Z(y|z)fZ(z) , f CI(x, y, z)\nThis is essentially a hypothesis testing problem where: H0 : X ?? Y |Z and H1 : X 6?? Y |Z.\nNote: For notational convenience, we will drop the subscripts when the context is evident. For\ninstance we may use f (x|z) in place of fX|Z(x|z).\nNearest-Neighbor Bootstrap: Algorithm 1 is a procedure to generate a data-set U0 consisting of\nn samples given a data-set U of 2n i.i.d samples from the distribution fX,Y,Z(x, y, z). The data-set\nU is broken into two equally sized partitions U1 and U2. Then for each sample in U1, we \ufb01nd the\nnearest neighbor in U2 in terms of the Z coordinates. The Y -coordinates of the sample from U1 are\nexchanged with the Y -coordinates of its nearest neighbor (in U2); the modi\ufb01ed sample is added to U0.\n\nAlgorithm 1 DataGen - Given data-set U = U1 [U 2 of 2n i.i.d samples from f (x, y, z) (|U1| =\n|U2| = n ), returns a new data-set U0 having n samples.\n1: function DATAGEN(U1,U2, 2n)\n2:\n3:\n4:\n\nLet v = (x0, y0, z0) 2U 2 be the sample such that z0 is the 1-Nearest Neighbor (1-NN)\nof z (in `2 norm) in the whole data-set U2, where u = (x, y, z)\nLet u0 = (x, y0, z) and U0 = U0 [{ u0}.\n\nU0 = ;\nfor u in U1 do\n\n5:\n6:\n7: end function\n\nend for\n\nOne of our main results is that the samples in U0, generated in Algorithm 1 mimic samples coming\nfrom the distribution f CI. Suppose u = (x, y, z) 2U 1 be a sample such that fZ(z) is not too\nsmall. In this case z0 (the 1-NN sample from U2) will not be far from z. Therefore given a \ufb01xed z,\nunder appropriate smoothness assumptions, y0 will be close to an independent sample coming from\nfY |Z(y|z0) \u21e0 fY |Z(y|z). On the other hand if fZ(z) is small, then z is a rare occurrence and will\nnot contribute adversely.\nCI Testing Algorithm: Now we introduce our CI testing algorithm, which uses Algorithm 1 along\nwith binary classi\ufb01ers. The psuedo-code is in Algorithm 2 (Classi\ufb01er CI Test -CCIT).\n\nAlgorithm 2 CCITv1 - Given data-set U of 3n i.i.d samples from f (x, y, z), returns if X ?? Y |Z.\n1: function CCIT(U, 3n, \u2327,G)\n2:\n3:\n4:\n5:\n6:\n\nPartition U into three disjoint partitions U1, U2 and U3 of size n each, randomly.\nLet U02 = DataGen(U2,U3, 2n) (Algorithm 1). Note that |U02| = n.\nCreate Labeled data-set D := {(u, ` = 1)}u2U1 [{ (u0,` 0 = 0)}u02U02\nDivide data-set D into train and test set Dr and De respectively. Note that |Dr| = |De| = n.\nLet \u02c6g = argming2G\n1{g(u) 6= l}. This is Empirical Risk\nMinimization for training the classi\ufb01er (\ufb01nding the best function in the class G).\nIf \u02c6L(\u02c6g,De) > 0.5  \u2327, then conclude X ?? Y |Z, otherwise, conclude X 6?? Y |Z.\n\n|Dr|P(u,`)2Dr\n\n\u02c6L(g,Dr) := 1\n\n7:\n8: end function\n\n4\n\n\fIn Algorithm 2, the original samples in U1 and the nearest-neighbor bootstrapped samples in U02\nshould be almost indistinguishable if H0 holds. However, if H1 holds, then the classi\ufb01er trained in\nLine 6 should be able to easily distinguish between the samples corresponding to different labels. In\nLine 6, G denotes the space of functions over which risk minimization is performed in the classi\ufb01er.\nWe will show (in Theorem 1) that the variational distance between the distribution of one of the\nsamples in U02 and f CI(x, y, z) is very small for large n. However, the samples in U02 are not\nexactly i.i.d but close to i.i.d. Therefore, in practice for \ufb01nite n, there is a small bias b > 0 i.e.\n\u02c6L(\u02c6g,De) \u21e0 0.5  b, even when H0 holds. The threshold \u2327 needs to be greater than b in order for\nAlgorithm 2 to function. In the next section, we present an algorithm where this bias is corrected.\nAlgorithm with Bias Correction: We present an improved bias-corrected version of our algorithm\nas Algorithm 3. As mentioned in the previous section, in Algorithm 2, the optimal classi\ufb01er may be\nable to achieve a loss slightly less that 0.5 in the case of \ufb01nite n, even when H0 is true. However, the\nclassi\ufb01er is expected to distinguish between the two data-sets only based on the Y, Z coordinates, as\nthe joint distribution of X and Z remains the same in the nearest-neighbor bootstrap. The key idea\nin Algorithm 3 is to train a classi\ufb01er only using the Y and Z coordinates, denoted by \u02c6g0. As before\nwe also train another classier using all the coordinates, which is denoted by \u02c6g. The test loss of \u02c6g0 is\nexpected to be roughly 0.5  b, where b is the bias mentioned in the previous section. Therefore, we\ncan just subtract this bias. Thus, when H0 is true \u02c6L(\u02c6g0,D0e)  \u02c6L(\u02c6g,De) will be close to 0. However,\nwhen H1 holds, then \u02c6L(\u02c6g,De) will be much lower, as the classi\ufb01er \u02c6g has been trained leveraging the\ninformation encoded in all the coordinates.\n\nAlgorithm 3 CCITv2 - Given data-set U of 3n i.i.d samples, returns whether X ?? Y |Z.\n1: function CCIT(U, 3n, \u2327,G)\n2:\n3:\n\nPerform Steps 1-5 as in Algorithm 2.\nLet D0r = {((y, z),` )}(u=(x,y,z),`)2Dr. Similarly, let D0e = {((y, z),` )}(u=(x,y,z),`)2De.\nThese are the training and test sets without the X-coordinates.\n1{g(u) 6= l}. Compute test loss:\nLet \u02c6g = argming2G\n1{g(u) 6= l}. Compute test loss:\n\nLet \u02c6g0 = argming2G\nIf \u02c6L(\u02c6g,De) < \u02c6L(\u02c6g0,D0e)  \u2327, then conclude X 6?? Y |Z, otherwise, conclude X ?? Y |Z.\n\n\u02c6L(g,Dr) := 1\n\u02c6L(g,D0r) := 1\n\n|Dr|P(u,`)2Dr\n|D0r|P(u,`)2D0r\n\n4:\n\n5:\n\n\u02c6L(\u02c6g,De).\n\u02c6L(\u02c6g0,D0e).\n\n6:\n7: end function\n\n3 Theoretical Results\n\nIn this section, we provide our main theoretical results. We \ufb01rst show that the distribution of any\none of the samples generated in Algorithm 1 closely resemble that of a sample coming from f CI.\nThis result holds for a broad class of distributions fX,Y,Z(x, y, z) which satisfy some smoothness\nassumptions. However, the samples generated by Algorithm 1 (U2 in the algorithm) are not exactly\ni.i.d but close to i.i.d. We quantify this and go on to show that empirical risk minimization over\na class of classi\ufb01er functions generalizes well using these samples. Before, we formally state our\nresults we provide some useful de\ufb01nitions.\nDe\ufb01nition 1. The total variational distance between two continuous probability distributions f (.)\nand g(.) de\ufb01ned over a domain X is, dT V (f, g) = supp2B|Ef [p(X)] Eg[p(X)]| where B is the set\nof all measurable functions from X! [0, 1]. Here, Ef [.] denotes expectation under distribution f.\nWe \ufb01rst prove that the distribution of any one of the samples generated in Algorithm 1 is close to f CI\nin terms of total variational distance. We make the following assumptions on the joint distribution of\nthe original samples i.e. fX,Y,Z(x, y, z):\nSmoothness assumption on f (y|z): We assume a smoothness condition on f (y|z), that is a\ngeneralization of boundedness of the max. eigenvalue of Fisher Information matrix of y w.r.t z.\n\n5\n\n\fAssumption 1. For z 2 Rdz, a such that ka  zk2 \uf8ff \u270f1, the generalized curvature matrix Ia(z) is,\nIa(z)ij = @2\n(1)\nWe require that for all z 2 Rdz and all a such that ka  zk2 \uf8ff \u270f1, max (Ia(z)) \uf8ff . Analogous\nassumptions have been made on the Hessian of the density in the context of entropy estimation [12].\n\nf (y|z)dy!z0=a\n\n@z0i@z0jZ log\n\nz0=a\n\n= E\"\n\nZ = z#\n\n2 log f (y|z0)\n\nz0iz0j\n\nf (y|z)\nf (y|z0)\n\nSmoothness assumptions on f (z): We assume some smoothness properties of the probability\ndensity function f (z). The smoothness assumptions (in Assumption 2) is a subset of the assumptions\nmade in [13] (Assumption 1, Page 5) for entropy estimation.\nDe\ufb01nition 2. For any > 0, we de\ufb01ne G() = P (f (Z) \uf8ff ). This is the probability mass of the\ndistribution of Z in the areas where the p.d.f is less than .\nDe\ufb01nition 3. (Hessian Matrix) Let Hf (z) denote the Hessian Matrix of the p.d.f f (z) with respect\nto z i.e Hf (z)ij = @2f (z)/@zi@zj, provided it is twice continuously differentiable at z.\nAssumption 2. The probability density function f (z) satis\ufb01es the following:\n(1) f (z) is twice continuously differentiable and the Hessian matrix Hf satis\ufb01es kHf (z)k2 \uf8ff cdz\nalmost everywhere, where cdz is only dependent on the dimension.\n(2)R f (z)11/ddz \uf8ff c3, 8d  2 where c3 is a constant.\nTheorem 1. Let (X, Y 0, Z) denote a sample in U02 produced by Algorithm 1 by modifying the original\nsample (X, Y, Z) in U1, when supplied with 2n i.i.d samples from the original joint distribution\nfX,Y,Z(x, y, z). Let X,Y,Z(x, y, z) be the distribution of (X, Y 0, Z). Under smoothness assumptions\n(1) and (2), for any \u270f<\u270f 1, n large enough, we have:\n\ndT V (, f CI) \uf8ff b(n)\n2s \n, 1\n\nc3 \u21e4 21/dz (1/dz)\n\n(ndz )1/dz dz\n\n4\n\n+\n\n\u270fG (2cdz \u270f2)\n\n4\n\n+ exp\u2713\n\n1\n2\n\nndz cdz \u270fdz+2\u25c6 + G2cdz \u270f2 .\n\nHere, d is the volume of the unit radius `2 ball in Rd.\nTheorem 1 characterizes the variational distance of the distribution of a sample generated in Algo-\nrithm 1 with that of the conditionally independent distribution f CI. We defer the proof of Theorem 1\nto Appendix A. Now, our goal is to characterize the misclassi\ufb01cation error of the trained classi\ufb01er in\nAlgorithm 2 under both H0 and H1. Consider the distribution of the samples in the data-set Dr used\nfor classi\ufb01cation in Algorithm 2. Let q(x, y, z|` = 1) be the marginal distribution of each sample\nwith label 1. Similarly, let q(x, y, z|` = 0) denote the marginal distribution of the label 0 samples.\nNote that under our construction,\n\nq(x, y, z|` = 1) = fX,Y,Z(x, y, z) =\u21e2 f CI(x, y, z)\n\n6= f CI(x, y, z)\n\nif H0 holds\nif H1 holds\n\n(2)\n\nq(x, y, z|` = 0) = X,Y,Z(x, y, z)\nwhere X,Y,Z(x, y, z) is as de\ufb01ned in Theorem 1.\nNote that even though the marginal of each sample with label 0 is X,Y,Z(x, y, z) (Equation (2)),\nthey are not exactly i.i.d owing to the nearest neighbor bootstrap. We will go on to show that they\nare actually close to i.i.d and therefore classi\ufb01cation risk minimization generalizes similar to the\ni.i.d results for classi\ufb01cation [4]. First, we review standard de\ufb01nitions and results from classi\ufb01cation\ntheory [4].\nIdeal Classi\ufb01cation Setting: We consider an ideal classi\ufb01cation scenario for CI testing and in the\nprocess de\ufb01ne standard quantities in learning theory. Recall that G is the set of classi\ufb01ers under\nconsideration. Let \u02dcq be our ideal distribution for q given by \u02dcq(x, y, z|` = 1) = fX,Y,Z(x, y, z),\nX,Y,Z(x, y, z) and \u02dcq(` = 1) = \u02dcq(` = 0) = 0.5. In other words this is the ideal\n\u02dcq(x, y, z|` = 0) = f CI\nclassi\ufb01cation scenario for testing CI. Let L(g(u),` ) be our loss function for a classifying function\ng 2G , for a sample u , (x, y, z) with true label `. In our algorithms the loss function is the 0  1\nloss, but our results hold for any bounded loss function s.t. |L(g(u),` )|\uf8ff| L|. For a distribution \u02dcq\n\n6\n\n\fand a classi\ufb01er g let R\u02dcq(g) , Eu,`\u21e0\u02dcq[L(g(u),` )] be the expected risk of the function g. The risk\noptimal classi\ufb01er g\u21e4\u02dcq under \u02dcq is given by g\u21e4\u02dcq , arg ming2G R\u02dcq(g). Similarly for a set of samples S\nand a classi\ufb01er g, let RS(g) , 1\nWe de\ufb01ne gS as the classi\ufb01er that minimizes the empirical loss on the observed set of samples S\nthat is, gS , arg ming2G RS(g).\nIf the samples in S are generated independently from \u02dcq, then standard results from the learning theory\nstates that with probability  1  ,\n\n|S|Pu,`2S L(g(u),` ) be the empirical risk on the set of samples.\n\nR\u02dcq(gS) \uf8ff R\u02dcq(g\u21e4\u02dcq ) + Cr V\n\nn\n\n+r 2 log(1/)\n\nn\n\n,\n\n(3)\n\nwhere V is the VC dimension [30] of the classi\ufb01cation model, C is an universal constant and n = |S|.\nGuarantees under near-independent samples: Our goal is to prove a result like (3), for the\nclassi\ufb01cation problem in Algorithm 2. However, in this case we do not have access to i.i.d samples\nbecause the samples in U02 do not remain independent. We will see that they are close to independent\nin some sense. This brings us to one of our main results in Theorem 2.\nTheorem 2. Assume that the joint distribution f (x, y, z) satis\ufb01es the conditions in Theorem 1.\nFurther assume that f (z) has a bounded Lipschitz constant. Consider the classi\ufb01er \u02c6g in Algorithm 2\ntrained on the set Dr. Let S = Dr. Then according to our de\ufb01nition gS = \u02c6g. For \u270f> 0 we have:\n(i) Rq(gS)  Rq(g\u21e4q ) \uf8ff n\n! + G(\u270f)! ,\nwith probability at least 1 8. Here V is the V.C. dimension of the classi\ufb01cation function class, G is\nas de\ufb01ned in Def. 2, C is an universal constant and |L| is the bound on the absolute value of the loss.\n(ii) Suppose the loss is L(g(u),` ) = 1g(u)6=` (s.t |L|\uf8ff 1). Further suppose the class of classifying\nfunctions is such that Rq(g\u21e4q ) \uf8ff r0 + \u2318. Here, r0 , 0.5(1  dT V (q(x, y, z|1), q(x, y, z|0))) is the\nrisk of the Bayes optimal classi\ufb01er when q(` = 1) = q(` = 0). This is the best loss that any classi\ufb01er\ncan achieve for this classi\ufb01cation problem [4]. Under this setting, w.p at least 1  8 we have:\n\n+r 4dz log(n/) + on(1/\u270f)\n\n, C|L| pV +rlog\n\n! \u2713 log(n/)\n\nn\n\n\u25c61/3\n\nn\n\n1\n\n1\n\n21  dT V (f, f CI) \n\nwhere b(n) is as de\ufb01ned in Theorem 1.\n\nb(n)\n2 \uf8ff Rq(gS) \uf8ff\n\n1\n\n21  dT V (f, f CI) +\n\nb(n)\n\n2\n\n+ \u2318 + n\n\nWe prove Theorem 2 as Theorem 3 and Theorem 4 in the appendix. In part (i) of the theorem\nwe prove that generalization bounds hold even when the samples are not exactly i.i.d. Intuitively,\nconsider two sample inputs ui, uj 2U 1, such that corresponding Z coordinates zi and zj are far\naway. Then we expect the resulting samples u0i and u0j (in U02) to be nearly-independent. By carefully\ncapturing this notion of spatial near-independence, we prove generalization errors in Theorem 3. Part\n(ii) of the theorem essentially implies that the error of the trained classi\ufb01er will be close to 0.5 (l.h.s)\nwhen f \u21e0 f CI (under H0). On the other hand under H1 if dT V (f, f CI) > 1  , the error will be\nless than 0.5( + b(n)) + n which is small.\n\n4 Empirical Results\n\nIn this section we provide empirical results comparing our proposed algorithm and other state of the\nart algorithms. The algorithms under comparison are: (i) CCIT - Algorithm 3 in our paper where we\nuse XGBoost [6] as the classi\ufb01er. In our experiments, for each data-set we boot-strap the samples and\nrun our algorithm B times. The results are averaged over B bootstrap runs1. (ii) KCIT - Kernel CI\ntest from [32]. We use the Matlab code available online. (iii) RCIT - Randomized CI Test from [28].\nWe use the R package that is publicly available.\n\n1The python package for our implementation can be found here (https://github.com/rajatsen91/CCIT).\n\n7\n\n\f4.1 Synthetic Experiments\nWe perform the synthetic experiments in the regime of post-nonlinear noise similar to [32]. In our\nexperiments X and Y are dimension 1, and the dimension of Z scales (motivated by causal settings\nand also used in [32, 28]). X and Y are generated according to the relation G(F (Z) + \u2318) where \u2318\nis a noise term and G is a non-linear function, when the H0 holds. In our experiments, the data is\ngenerated as follows: (i) when X ?? Y |Z, then each coordinate of Z is a Gaussian with unit mean\nand variance, X = cos(aT Z + \u23181) and Y = cos(bT Z + \u23182). Here, a, b 2 Rdz and kak = kbk = 1.\na,b are \ufb01xed while generating a single dataset. \u23181 and \u23182 are zero-mean Gaussian noise variables,\nwhich are independent of everything else. We set V ar(\u23181) = V ar(\u23182) = 0.25. (ii) when X 6?? Y |Z,\nthen everything is identical to (i) except that Y = cos(bT Z + cX + \u23182) for a randomly chosen\nconstant c 2 [0, 2].\nIn Fig. 2a, we plot the performance of the algorithms when the dimension of Z scales. For generating\neach point in the plot, 300 data-sets were generated with the appropriate dimensions. Half of them\nare according to H0 and the other half are from H1 Then each of the algorithms are run on these\ndata-sets, and the ROC AUC (Area Under the Receiver Operating Characteristic curve) score is\ncalculated from the true labels (CI or not CI) for each data-set and the predicted scores. We observe\nthat the accuracy of CCIT is close to 1 for dimensions upto 70, while all the other algorithms do not\nscale as well. In these experiments the number of bootstraps per data-set for CCIT was set to B = 50.\nWe set the threshold in Algorithm 3 to \u2327 = 1/pn, which is an upper-bound on the expected variance\nof the test-statistic when H0 holds.\n4.2 Flow-Cytometry Dataset\nWe use our CI testing algorithm to verify CI relations in the protein network data from the \ufb02ow-\ncytometry dataset [26], which gives expression levels of 11 proteins under various experimental\nconditions. The ground truth causal graph is not known with absolute certainty in this data-set,\nhowever this dataset has been widely used in the causal structure learning literature. We take three\npopular learned causal structures that are recovered by causal discovery algorithms, and we verify\nCI relations assuming these graphs to be the ground truth. The three graph are: (i) consensus graph\nfrom [26] (Fig. 1(a) in [22]) (ii) reconstructed graph by Sachs et al. [26] (Fig. 1(b) in [22]) (iii)\nreconstructed graph in [22] (Fig. 1(c) in [22]).\nFor each graph we generate CI relations as follows: for each node X in the graph, identify the set Z\nconsisting of its parents, children and parents of children in the causal graph. Conditioned on this\nset Z, X is independent of every other node Y in the graph (apart from the ones in Z). We use this\nto create all CI conditions of these types from each of the three graphs. In this process we generate\nover 60 CI relations for each of the graphs. In order to evaluate false positives of our algorithms, we\nalso need relations such that X 6?? Y |Z. For, this we observe that if there is an edge between two\nnodes, they are never CI given any other conditioning set. For each graph we generate 50 such non-CI\nrelations, where an edge X $ Y is selected at random and a conditioning set of size 3 is randomly\nselected from the remaining nodes. We construct 50 such negative examples for each graph. In Fig. 2,\nwe display the performance of all three algorithms based on considering each of the three graphs\nas ground-truth. The algorithms are given access to observational data for verifying CI and non-CI\nrelations. In Fig. 2b we display the ROC plot for all three algorithms for the data-set generated by\nconsidering graph (ii). In Table 2c we display the ROC AUC score for the algorithms for the three\ngraphs. It can be seen that our algorithm outperforms the others in all three cases, even when the\ndimensionality of Z is fairly low (less than 10 in all cases). An interesting thing to note is that the\nedges (pkc-raf), (pkc-mek) and (pka-p38) are there in all the three graphs. However, all three CI\ntesters CCIT, KCIT and RCIT are fairly con\ufb01dent that these edges should be absent. These edges\nmay be discrepancies in the ground-truth graphs and therefore the ROC AUC of the algorithms are\nlower than expected.\n\n8\n\n\f(a)\n\n(b)\n\nAlgo. Graph (i) Graph (ii) Graph (iii)\nCCIT\nRCIT\nKCIT\n\n0.6848\n0.6448\n0.6528\n\n0.7156\n0.6928\n0.6610\n\n0.7778\n0.7168\n0.7416\n(c)\n\nFigure 2: In (a) we plot the performance of CCIT, KCIT and RCIT in the post-nonlinear noise synthetic data.\nIn generating each point in the plots, 300 data-sets are generated where half of them are according to H0 while\nthe rest are according to H1. The algorithms are run on each of them, and the ROC AUC score is plotted. In (a)\nthe number of samples n = 1000, while the dimension of Z varies. In (b) we plot the ROC curve for all three\nalgorithms based on the data from Graph (ii) for the \ufb02ow-cytometry dataset. The ROC AUC score for each of\nthe algorithms are provided in (c), considering each of the three graphs as ground-truth.\n\n5 Conclusion\n\nIn this paper we present a model-powered approach for CI tests by converting it into binary classi\ufb01-\ncation, thus empowering CI testing with powerful supervised learning tools like gradient boosted\ntrees. We provide an ef\ufb01cient nearest-neighbor bootstrap which makes the reduction to classi\ufb01cation\npossible. We provide theoretical guarantees on the bootstrapped samples, and also risk generalization\nbounds for our classi\ufb01cation problem, under non-i.i.d near independent samples. In conclusion we\nbelieve that model-driven data dependent approaches can be extremely useful in general statistical\ntesting and estimation problems as they enable us to use powerful supervised learning tools.\n\nAcknowledgments\nThis work is partially supported by NSF grants CNS 1320175, NSF SaTC 1704778, ARO grants\nW911NF-17-1-0359, W911NF-16-1-0377 and the US DoT supported D-STOP Tier 1 University\nTransportation Center.\n\nReferences\n[1] Maria-Florina Balcan, Nikhil Bansal, Alina Beygelzimer, Don Coppersmith, John Langford,\nand Gregory Sorkin. Robust reductions from ranking to classi\ufb01cation. Learning Theory, pages\n604\u2013619, 2007.\n\n[2] Alina Beygelzimer, John Langford, Yuri Lifshits, Gregory Sorkin, and Alex Strehl. Condi-\ntional probability tree estimation analysis and algorithms. In Proceedings of the Twenty-Fifth\nConference on Uncertainty in Arti\ufb01cial Intelligence, pages 51\u201358. AUAI Press, 2009.\n\n[3] Karsten M Borgwardt, Arthur Gretton, Malte J Rasch, Hans-Peter Kriegel, Bernhard Sch\u00f6lkopf,\nand Alex J Smola. Integrating structured biological data by kernel maximum mean discrepancy.\nBioinformatics, 22(14):e49\u2013e57, 2006.\n\n[4] St\u00e9phane Boucheron, Olivier Bousquet, and G\u00e1bor Lugosi. Theory of classi\ufb01cation: A survey\n\nof some recent advances. ESAIM: probability and statistics, 9:323\u2013375, 2005.\n\n[5] Eliot Brenner and David Sontag. Sparsityboost: A new scoring function for learning bayesian\n\nnetwork structure. arXiv preprint arXiv:1309.6820, 2013.\n\n[6] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of\nthe 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,\npages 785\u2013794. ACM, 2016.\n\n[7] Jie Cheng, David Bell, and Weiru Liu. Learning bayesian networks from data: An ef\ufb01cient\napproach based on information theory. On World Wide Web at http://www. cs. ualberta. ca/\u02dc\njcheng/bnpc. htm, 1998.\n\n9\n\n05205070100150DimensionofZ0.60.70.80.91.0ROCAUCCCITRCITKCIT\f[8] Hal Daum\u00e9, John Langford, and Daniel Marcu. Search-based structured prediction. Machine\n\nlearning, 75(3):297\u2013325, 2009.\n\n[9] Luis M De Campos and Juan F Huete. A new approach for learning belief networks using\n\nindependence criteria. International Journal of Approximate Reasoning, 24(1):11\u201337, 2000.\n\n[10] Gary Doran, Krikamol Muandet, Kun Zhang, and Bernhard Sch\u00f6lkopf. A permutation-based\n\nkernel conditional independence test. In UAI, pages 132\u2013141, 2014.\n\n[11] Kenji Fukumizu, Francis R Bach, and Michael I Jordan. Dimensionality reduction for supervised\nlearning with reproducing kernel hilbert spaces. Journal of Machine Learning Research,\n5(Jan):73\u201399, 2004.\n\n[12] Weihao Gao, Sewoong Oh, and Pramod Viswanath. Breaking the bandwidth barrier: Geometri-\ncal adaptive entropy estimation. In Advances in Neural Information Processing Systems, pages\n2460\u20132468, 2016.\n\n[13] Weihao Gao, Sewoong Oh, and Pramod Viswanath. Demystifying \ufb01xed k-nearest neighbor\n\ninformation estimators. arXiv preprint arXiv:1604.03006, 2016.\n\n[14] Markus Kalisch and Peter B\u00fchlmann. Estimating high-dimensional directed acyclic graphs with\n\nthe pc-algorithm. Journal of Machine Learning Research, 8(Mar):613\u2013636, 2007.\n\n[15] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques.\n\nMIT press, 2009.\n\n[16] Daphne Koller and Mehran Sahami. Toward optimal feature selection. Technical report,\n\nStanford InfoLab, 1996.\n\n[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[18] John Langford and Bianca Zadrozny. Reducing t-step reinforcement learning to classi\ufb01cation.\n\nIn Proc. of the Machine Learning Reductions Workshop, 2003.\n\n[19] David Lopez-Paz and Maxime Oquab. Revisiting classi\ufb01er two-sample tests. arXiv preprint\n\narXiv:1610.06545, 2016.\n\n[20] Colin McDiarmid. On the method of bounded differences. Surveys in combinatorics, 141(1):148\u2013\n\n188, 1989.\n\n[21] Mehryar Mohri and Afshin Rostamizadeh. Rademacher complexity bounds for non-iid processes.\n\nIn Advances in Neural Information Processing Systems, pages 1097\u20131104, 2009.\n\n[22] Joris Mooij and Tom Heskes. Cyclic causal discovery from continuous equilibrium data. arXiv\n\npreprint arXiv:1309.6849, 2013.\n\n[23] Judea Pearl. Causality. Cambridge university press, 2009.\n[24] V Ramasubramanian and Kuldip K Paliwal. Fast k-dimensional tree algorithms for nearest\nneighbor search with application to vector quantization encoding. IEEE Transactions on Signal\nProcessing, 40(3):518\u2013531, 1992.\n\n[25] Bero Roos. On the rate of multivariate poisson convergence. Journal of Multivariate Analysis,\n\n69(1):120\u2013134, 1999.\n\n[26] Karen Sachs, Omar Perez, Dana Pe\u2019er, Douglas A Lauffenburger, and Garry P Nolan.\nCausal protein-signaling networks derived from multiparameter single-cell data. Science,\n308(5721):523\u2013529, 2005.\n\n[27] Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search. MIT\n\npress, 2000.\n\n[28] Eric V Strobl, Kun Zhang, and Shyam Visweswaran. Approximate kernel-based conditional\nindependence tests for fast non-parametric causal discovery. arXiv preprint arXiv:1702.03877,\n2017.\n\n[29] Ioannis Tsamardinos, Laura E Brown, and Constantin F Aliferis. The max-min hill-climbing\n\nbayesian network structure learning algorithm. Machine learning, 65(1):31\u201378, 2006.\n\n[30] Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies\n\nof events to their probabilities. In Measures of Complexity, pages 11\u201330. Springer, 2015.\n\n10\n\n\f[31] Eric P Xing, Michael I Jordan, Richard M Karp, et al. Feature selection for high-dimensional\n\ngenomic microarray data. In ICML, volume 1, pages 601\u2013608. Citeseer, 2001.\n\n[32] Kun Zhang, Jonas Peters, Dominik Janzing, and Bernhard Sch\u00f6lkopf. Kernel-based conditional\nindependence test and application in causal discovery. arXiv preprint arXiv:1202.3775, 2012.\n\n11\n\n\f", "award": [], "sourceid": 1707, "authors": [{"given_name": "Rajat", "family_name": "Sen", "institution": "University of Texas at Austin"}, {"given_name": "Ananda Theertha", "family_name": "Suresh", "institution": "Google"}, {"given_name": "Karthikeyan", "family_name": "Shanmugam", "institution": "IBM Research, NY"}, {"given_name": "Alexandros", "family_name": "Dimakis", "institution": "University of Texas, Austin"}, {"given_name": "Sanjay", "family_name": "Shakkottai", "institution": "The University of Texas at Austin"}]}