{"title": "Causal Inference via Kernel Deviance Measures", "book": "Advances in Neural Information Processing Systems", "page_first": 6986, "page_last": 6994, "abstract": "Discovering the causal structure among a set of variables is a fundamental problem in many areas of science. In this paper, we propose Kernel Conditional Deviance for Causal Inference (KCDC) a fully nonparametric causal discovery method based on purely observational data. From a novel interpretation of the notion of asymmetry between cause and effect, we derive a corresponding asymmetry measure using the framework of reproducing kernel Hilbert spaces. Based on this, we propose three decision rules for causal discovery. We demonstrate the wide applicability and robustness of our method across a range of diverse synthetic datasets. Furthermore, we test our method on real-world time series data and the real-world benchmark dataset T\u00fcbingen Cause-Effect Pairs where we outperform state-of-the-art approaches.", "full_text": "Causal Inference via Kernel Deviance Measures\n\nJovana Mitrovic\u2217\n\nDino Sejdinovic\n\nYee Whye Teh\u2217\n\nDepartment of Statistics, University of Oxford\n\n[mitrovic, dino.sejdinovic, y.w.teh]@stats.ox.ac.uk\n\nAbstract\n\nDiscovering the causal structure among a set of variables is a fundamental problem\nin many areas of science. In this paper, we propose Kernel Conditional Deviance\nfor Causal Inference (KCDC) a fully nonparametric causal discovery method\nbased on purely observational data. From a novel interpretation of the notion\nof asymmetry between cause and effect, we derive a corresponding asymmetry\nmeasure using the framework of reproducing kernel Hilbert spaces. Based on this,\nwe propose three decision rules for causal discovery. We demonstrate the wide\napplicability and robustness of our method across a range of diverse synthetic\ndatasets. Furthermore, we test our method on real-world time series data and the\nreal-world benchmark dataset T\u00fcbingen Cause-Effect Pairs where we outperform\nstate-of-the-art approaches.\n\n1\n\nIntroduction\n\nIn many areas of science, we strive to answer questions that are fundamentally causal in nature. For\nexample, in medicine one is often interested in the genetic drivers of diseases, while in commerce one\nmight want to identify the motives behind customers\u2019 purchasing behaviour. Furthermore, it is of the\nutmost importance to thoroughly understand the underlying causal structure of the data-generating\nprocess if we are to predict, with reasonable accuracy, the consequences of interventions or answer\ncounterfactual questions about what would have happened had we acted differently. While most\nmachine learning methods excel at prediction tasks by successfully inferring statistical dependencies,\nthere are still many open questions when it comes to uncovering the causal dependencies between the\nvariables driving the underlying data-generating process. Given the growing interest in using data to\nguide decisions in areas where interventional and counterfactual questions abound, causal discovery\nmethods have attracted considerable research interest [9, 25, 13, 16].\nWhile causal inference is preferably performed on data coming from randomized control experiments,\noften this kind of data is not available due to a combination of ethical, technical and \ufb01nancial consid-\nerations. These real-world limitations have motivated research into inferring causal relationships from\npurely observational data. While methods that attempt to recover the causal structure by analyzing\nconditional independencies present in the data [20, 23] are mathematically well-founded, they are not\nrobust to the choice of conditional independence testing methodology. Another group of methods\n[9, 25, 14] postulates that there is some inherent asymmetry between cause and effect and proposes\ndifferent asymmetry measures that form the basis for causal discovery. In order to facilitate causal\ninference, these approaches typically assume a particular functional form for the interaction between\nthe variables and a particular noise structure which limits their applicability. We aim our contribution\nto be a step towards a method that can deal with highly complex data-generating processes, relies\nonly observational data and whose inference can easily be extended without the need to develop\nnovel, speci\ufb01cally tailored algorithms for each new model class.\nIn this work, we develop a fully nonparametric causal inference method to automatically discover\ncausal relationships from purely observational data. In particular, our proposed method does not\n\n\u2217Now at DeepMind, UK.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\frequire any a priori assumptions on the functional form of the interaction between the variables\nor the noise structure. Furthermore, we propose a novel interpretation of the notion of asymmetry\nbetween cause and effect [4]. Before we introduce our proposed interpretation, we motivate it with\nthe following example. Let y = x3 + x + \u0001 with \u0001 \u223c N (0, 1) where we consider the correct causal\ndirection to be x \u2192 y. Figure 1 visualizes the conditional distributions p(y|x) and p(x|y) for different\nvalues of x and y, respectively.\n\nFigure 1: Conditional distributions p(y|x) and p(x|y) for different values of the conditioning variables\nx and y, respectively. These represent the causal and anticausal direction, respectively.\n\nNote that the conditional distributions in the anticausal direction exhibit a larger structural variability\nacross different values of the conditioning variable than the conditional distributions in the causal\ndirection. It is important to note here that structural variability does not only refer to variability\nin the scale and location parameters, but should be understood more broadly as variability in the\n\u201cparametric\u201d form, e.g. differences in the number of modes and in higher order moments. If one\nthinks of conditional distributions as programs generating y from x and vice versa, we see that in the\ncausal direction the structure of the program remains unchanged although different input arguments\nare provided. In the anticausal direction, the program requires structural modi\ufb01cation across different\nvalues of the input in order to account for the differing behaviour of the conditional densities.\nMotivated by the above observation, we popose a novel interpretation of the notion of asymmetry\nbetween cause and effect in terms of the shortest description length, i.e. Kolmogorov complexity\n[8], of the data-generating process. Whereas previous work [11, 10, 4, 2] quanti\ufb01es the asymmetry\nin terms of the Kolmogorov complexity of the factorization of the joint distribution, we propose\nto interpret the asymmetry based on the Kolmogorov complexity of the conditional distribution.\nSpeci\ufb01cally, we propose that this asymmetry is realized by the Kolmogorov complexity of the\nmechanism in the causal direction being independent of the input value of the cause. On the other\nhand, in the anticausal direction, there will be a dependence between the shortest description length\nof the mechanism and the particular input value of the effect. This (in)dependence can be measured\nby looking at the variability of Kolmogorov complexities of the mechanism for particular of the input.\nUnfortunately, as computing the Kolmogorov complexity is an intractable problem, we resort to\nconditional distributions as approximations of the corresponding programs. Thus, we can infer the\ncausal direction by comparing the description length variability of conditional distributions across\ndifferent values of the conditioning variable with the causal direction being the less variable. In\nparticular, we propose three decision rules for causal inference based on this criterion. For measuring\nthis variability, we use the framework of reproducing kernel Hilbert spaces (RKHS). This allows\nus to represent conditional distributions in a compact, yet expressive way and ef\ufb01ciently capture\ntheir many nuanced aspects thus enabling more accurate causal inference. In particular, by way\nof the kernel trick, we can ef\ufb01ciently compute the variability of in\ufb01nite-dimensional objects using\n\ufb01nite-dimensional quantities that can be easily estimated from data. Using the RKHS framework\nmakes our method readily applicable also in situations when trying to infer the causal direction\nbetween pairs of random variables taking values in structured or non-Euclidean domains on which a\nkernel can be de\ufb01ned.\nThe main contributions of this paper are:\n\nindependence of the description length of the mechanism on the value of the cause,\n\n\u2022 an interpretation of the notion of asymmetry between cause and effect in terms of the\n\u2022 an approximation to the intractable description length in terms of conditional distributions,\n\u2022 a \ufb02exible asymmetry measure based on RKHS embeddings of conditional distributions,\n\u2022 a fully nonparametric method for causal inference that does not impose a priori any assump-\n\ntions on the functional relationship between the variables or the noise structure.\n\n2\n\n\u22126\u22123036y\u2212202x\f2 Related Work\n\nMost approaches to causal inference from purely observational data can be grouped into three cat-\negories. Constraint-based methods assume that the true causal structure can be represented with a\ndirected acyclic graph (DAG) G which they try to infer by analyzing conditional independencies\npresent in the observational data distribution P . Under some technical assumptions [17], these\nmethods can determine G only up to its Markov equivalence class2 which usually contains DAGs that\ncan be structurally very diverse and still have many unoriented edges. Examples of this methodology\ninclude [23, 26] which rely on kernel-based conditional independence criteria and the PC algorithm\n[20] builds a graph skeleton by successively removing unnecessary connections between the variables\nand then orienting the remaining edges if possible. Although mathematically well-founded, the perfor-\nmance of these methods is highly dependent on the utilized conditional independence methodology,\nwhose performance usually depends on the amount of available data. Furthermore, these methods are\nnot robust as small errors in building the graph skeleton (e.g. a missing independence relation) can\nlead to signi\ufb01cant errors in the inferred Markov equivalence class. As conditional independence tests\nrequire at least three variables, they are not applicable in the two variable case.\nScore-based methods search the space of all DAGs of a certain size by scoring their \ufb01t to the observed\ndata using a prede\ufb01ned score function. An example of this approach is Greedy Equivalent Search [3]\nwhich combines greedy search with the Bayesian information criterion. As the search space grows\nsuper-exponentially with the number of variables, these methods quickly become computationally\nintractable. An answer to this shortcoming are hybrid methods which use constraint-based approaches\nto decrease the search space that can then be effectively explored with score-based methods, e.g.\n[24]. DAGs have also been represented using generative neural networks and scored according to\nhow well the generated data matches the observed data, e.g. [7]. A major shortcoming of this hybrid\nmethodology is that there exists no principled way of choosing problem-speci\ufb01c combinations of\nscoring functions and search strategies which is a signi\ufb01cant problem as different search strategies in\ncombination with different scoring rules can potentially lead to very different results.\nThe third category of methods assumes that there exists some inherent asymmetry between cause\nand effect. So-called functional causal models or structural equation models assume a particular\nfunctional form for the causal interactions between the variables and a particular noise structure.\nIn these models, each variable is a deterministic function of its causes and some independent\nnoise, with all noise variables assumed to be jointly independent. Examples of this methodology\nassume linearity and additive non-Gaussian noise [19], nonlinear additive noise [9, 14] and invertible\ninteractions between the covariates and the noise [25]. In order to perform causal discovery in\nthese models, the special structural assumptions placed on the interaction between the covariates\nand on the noise are of crucial importance, thus limiting their applicability. A second strand of\nresearch interprets the asymmetry between cause and effect through an information-theoretic lens\nby examining the complexity of the factorization of the joint distribution [11]. [10] argue that if X\ncauses Y , then the factorization in the causal direction, i.e. p(X, Y ) = p(Y |X)p(X), should have a\nshorter description in terms of the Kolmogorov complexity than the factorization in the anticausal\ndirection, i.e. p(X, Y ) = p(X|Y )p(Y ). In [4], instead of computing the intractable Kolmogorov\ncomplexity, the correlation between the input and the conditional distribution is measured, whereas\n[2] use the minimum description length principle. The approach of [22] measures the complexity of\nconditional distributions by RKHS seminorms computed on the logarithms of their densities.\nLastly, causal discovery has also been framed as a learning problem. RCC [13] uses feature repre-\nsentations of the data based on RKHS embeddings of the joint and marginal distributions within a\nrandom forest classi\ufb01er. In [5], the feature representation includes quantities describing the joint,\nmarginal and conditional distributions. In particular, the conditional distributions are represented\nwith conditional entropy, mutual information and a quanti\ufb01cation of their variability in terms of the\nspread of the entropy, variance and skewness for different values of the conditioning variable. This\ndiffers from our approach where we base our causal inference method on a novel interpretation of\nthe asymmetry between cause and effect, and based on this derive three decision rules with one of\nthese decision rules relying on classifying feature representations. In particular, we consider feature\nrepresentations based only on conditional distributions which we argue to be more discriminative for\ninferring the causal direction.\n\n2All DAGs that encode the same set of conditional independence relations constitute a Markov equivalence\n\nclass.\n\n3\n\n\f3 Kernel Conditional Deviance for Causal Inference\n\n(cid:80)n\ni=1 k(\u00b7, xi) with {xi}n\n\n3.1 Background\nLet (X ,BX ) and (Y,BY ) be measurable spaces with BX and BY the associated Borel \u03c3-algebras.\nDenote by (HX , k) and (HY , l) the RKHSs of functions de\ufb01ned on X and Y, respectively, and their\ncorresponding kernels. Given a probability distribution p on X , the mean embedding \u00b5p\n3 [18] is a\nrepresentation of p in HX given by \u00b5p = Ep[k(\u00b7, X)] with X \u223c p. It can be unbiasedly estimated\niid\u223c p. Furthermore, if k is a characteristic kernel [18], then\nby \u02c6\u00b5p = 1\nthis representation yields a metric on probability measures, i.e. (cid:107)\u00b5p \u2212 \u00b5q(cid:107)Hk\n= 0 \u21d4 p = q. A\nn\nconditional distribution p(X|Y = y) can be encoded using the conditional mean embedding \u00b5X|Y =y\n[18] which is an element of HX that satis\ufb01es E[h(X)|Y = y] = (cid:104)h, \u00b5X|Y =y(cid:105)HX\n\u2200h \u2208 HX .\nUsing the equivalence between conditional mean embeddings and vector-valued regressors [12],\ni=1 \u03b1i(y)k(\u00b7, xi)\nwe can estimate \u00b5X|Y =y from a sample {(xi, yi)}n\nwith regularization parameter \u03bb and identity matrix I, \u03b1(y) = (L + n\u03bbI)\u22121ly, L = [l(yi, yj)]n\ni,j=1,\nly = [l(y1, y), . . . , l(yn, y)]T , \u03b1(\u00b7) = [\u03b11(\u00b7), . . . , \u03b1n(\u00b7)]T . For a more detailed discussion, see [18].\n\niid\u223c p(x, y) as \u02c6\u00b5X|Y =y =(cid:80)n\n\ni=1\n\ni=1\n\n3.2 Method\n\nFor simplicity, we restrict our attention to the two variable problem of causal discovery, i.e. distin-\nguishing between cause and effect. Possible extensions to the multivariable setting are discussed in\nSection 5. Following the usual approach in the literature, we derive our method under the assumption\nof causal suf\ufb01ciency of the data. In particular, we ignore the potential existence of confounders, i.e. all\ncausal conclusions should be understood with respect to the set of observed variables. Nevertheless,\nin Section 4, we see that our method performs well also in settings where the noise has positive mean\nwhich can be interpreted as accounting for potential confounders.\nGiven observations {(xi, yi)}n\ni=1 of a pair of random variables (X, Y ), our goal is to infer the causal\ndirection, i.e. decide whether X causes Y (i.e. X \u2192 Y ) or Y causes X (i.e. Y \u2192 X). To this end,\nwe develop a fully nonparametric causal discovery method that relies only on observational data. In\nparticular, our method does not a priori postulate a particular functional model for the interactions\nbetween the variables or a particular noise structure. Our approach, Kernel Conditional Deviance\nfor Causal Inference (KCDC), is based on the assumption that there exists an asymmetry between\ncause and effect that is inherent in the data-generating process. While there are many interpretations\nof how this asymmetry might be realized, two of the more prominent ideas phrase it in terms of the\nindependence of cause and mechanism [4] and in terms of the complexity of the factorization of the\njoint distribution [11, 10].\nMotivated by these two ideas, we propose a novel interpretation of the notion of asymmetry between\ncause and effect. First, we take an information-theoretic approach to reasoning about the complexity of\ndistributions similar to [11, 10]. In particular, we reason about it in terms of algorithmic complexity, i.e.\nKolmogorov complexity [8] which is the description length of the shortest program that implements\nthe sampling process of the distribution. For a distribution p(Y ), the Kolmogorov complexity is\n\nK(p(Y |X)) = min\n\n{|s| : |U (s, y, x, q) \u2212 p(y|x)| \u2264 q \u2200x, y}.\n\ns\n\nAssuming X \u2192 Y , the asymmetry notion speci\ufb01ed in terms of factorization complexity can be\nexpressed as\n\nK(p(X)) + K(p(Y |X)) \u2264 K(p(Y )) + K(p(X|Y ))\n\nwhich holds up to an additive constant [21]. Further, the independence of cause and mechanism can\nbe interpreted as algorithmic independence [10], i.e. knowing the distribution of the cause p(X) does\nnot enable a shorter description of the mechanism p(Y |X).\n\n3\u00b5p and \u00b5X will be used interchangeably if it does not lead to confusion.\n\n4\n\nK(p(Y )) = min\n\ns\n\n{|s| : |U (s, y, q) \u2212 p(y)| \u2264 q \u2200y}\n\nwith q a precision parameter, U (\u00b7) extracting the output of applying program s onto a realization\nof the random variable Y denoted by y. Analogously, for a conditional distribution p(Y |X), the\nKolmogorov complexity is\n\n\fBased on this, we argue that not only knowing the distribution of the cause does not enable a shorter\ndescription of the mechanism, but also knowing any particular value of the cause does not provide\nany information that can be used to construct a shorter description of the mechanism. To formalize\nthis, we introduce the notation\n\nK(p(Y |X = x)) = min\n\n{|s| : |U (s, y, x, q) \u2212 p(y|X = x)| \u2264 q \u2200y}\n\nto be the Kolmogorov complexity of the conditional distribution p(Y |X) when the conditioning\nvariable takes on the value X = x. From our argument above, we see that in the causal direction the\nKolmogorov complexity of p(Y |X = x) is independent of the particular value x of the cause X, i.e.\n\ns\n\nK(p(Y |X = xi)) = K(p(Y |X = xj)) \u2200i, j.\n\nOn the other hand, this will not hold in the anticausal direction as the input and mechanism are not\nalgorithmically independent in that direction, i.e.\n\nK(p(X|Y = yi)) (cid:54)= K(p(X|Y = yj)) \u2200i (cid:54)= j.\n\nThis motivates our interpretation of the notion of asymmetry between cause and effect which is\nsummarized as follows.\nPostulate. (Minimal description length independence)\nIf X \u2192 Y , the minimal description length of the mechanism mapping X to Y is independent of the\nvalue of X, whereas the minimal description length of the mechanism mapping Y to X is dependent\non the value of Y .\n\nBuilding on this, we can infer the causal direction by comparing how much the description length of\nthe minimal description length program implementing the mechanism varies across different values\nof its input arguments. In particular, in the causal direction, we expect to see less variability than in\nthe anticausal direction. As computing the Kolmogorov complexity is an intractable problem, we use\nth norm of RKHS embeddings of the corresponding conditional distributions as a proxy for it. Thus,\nwe recast causal inference in terms of comparing the variability in RKHS norm of embeddings of\nsets of conditional distributions indexed by values of the conditioning variable. In order to perform\ncausal inference, we use the framework of reproducing kernel Hilbert spaces. This allows us to\nconstruct highly expressive, yet compact approximations of the potentially highly-complex programs\nand circumvent the challenges of density estimation when trying to represent conditional distributions.\nFurthermore, using the RKHS framework allows us to ef\ufb01ciently capture the many nuanced aspects\nof distributions thus enabling more accurate causal inference. For example, using non-linear kernels\nallows us to capture more comprehensive distributional properties including higher order moments.\nFurthermore, using the RKHS framework makes our method readily applicable also in situations\nwhen trying to infer the causal direction between two random vectors (treated as single variables) or\npairs of other types of random variables taking values in structured or non-Euclidean domains on\nwhich a kernel can be de\ufb01ned. Examples of such types of data include discrete data, genetic data,\nphylogenetic trees, strings, graphs and other structured data [6].\nWe represent conditional distributions in the RKHS using conditional mean embeddings [18]. In\nparticular, given observations {(xi, yi)}n\ni=1 of a pair of random variables (X, Y ), we construct the\nembeddings of the two sets of conditional distributions, {p(X|Y = yi)}n\ni=1 and {p(Y |X = xi)}n\ni=1.\nFurthermore, if we choose a characteristic kernel [18], the conditional mean embeddings of two\ndistinct distributions will not overlap. Next, we compute the variability in RKHS norm of a set of\nconditional mean embeddings as the deviance of the RKHS norms of that set. Thus, using the KCDC\nmeasure SX\u2192Y with\n\n(cid:32)(cid:13)(cid:13)\u00b5Y |X=xi\n\nn(cid:88)\n\ni=1\n\n(cid:13)(cid:13)HY \u2212 1\n\nn\n\nn(cid:88)\n\nj=1\n\n(cid:13)(cid:13)\u00b5Y |X=xj\n\n(cid:13)(cid:13)HY\n\n(cid:33)2\n\nSX\u2192Y =\n\n1\nn\n\n,\n\n(1)\n\nwe compute the deviance in RKHS norm of the set {p(Y |X = xi)}n\nanalogously compute the KCDC measure SY \u2192X.\nBased on our proposed interpretation of the notion of asymmetry between cause and effect, we can\ndetermine the causal direction between X and Y . Furthermore, we derive a con\ufb01dence measure\nT KCDC for the inferred causal direction as\nT KCDC =\n\ni=1. For {p(X|Y = yi)}n\n\ni=1, we\n\n.\n\n|SX\u2192Y \u2212 SY \u2192X|\nmin(SX\u2192Y , SY \u2192X )\n\n5\n\n\fTo determine the causal direction, we propose three decision rules. In particular, we can determine\nthe causal direction by directly comparing the KCDC measures for the two directions, i.e.\n\n(cid:26)X \u2192 Y,\n\nY \u2192 X,\n\nD1(X, Y ) =\n\nif SX\u2192Y < SY \u2192X\nif SX\u2192Y > SY \u2192X\n\nbut leave the causal direction undetermined if the KCDC measures are too close in value to determine\nthe causal direction, i.e. T KCDC < \u03b4 with \u03b4 some \ufb01xed decision threshold. This situation might come\nabout due to numerical errors or non-identi\ufb01ability. We can also determine the causal direction based\non majority voting of an ensemble constructed using different model hyperparameters, i.e.\n\nD2(X, Y ) = Majority({DHj\n\n1 (X, Y )}j)\n\nwhere the dependence on the model hyperparameters Hj has been made explicit. Lastly, the KCDC\nmeasures can also be used for constructing feature representations of the data which can then be used\nwithin a classi\ufb01cation method. In particular, we can infer the causal relationship between X and Y\nusing\n\nD3(X, Y ) = Classifier({SHj\n\nX\u2192Y , SHj\n\nY \u2192X}j)\n\nwhere Classifier is a classi\ufb01cation algorithm that classi\ufb01es X \u2192 Y against Y \u2192 X. For training\nthe classi\ufb01er, we generate synthetic data, e.g. as in [13]. Algorithms summarizing our causal inference\nmethodology are given in the supplementary material.\nIdenti\ufb01ability. In order to ensure the identi\ufb01ability of the model in KCDC, we need to ensure that\nwe apply the same kernel to the response variable when computing the KCDC measures SX\u2192Y\nand SY \u2192X. Speci\ufb01cally, we \ufb01x a kernel kr with some \ufb01xed hyperparameters and apply it both to\nY when computing SX\u2192Y and to X when computing SY \u2192X. This ensures that we are measuring\nthe variability in both directions in the same space, thus making them comparable. Furthermore,\nwe need to also apply the same \ufb01xed kernel kin (with some \ufb01xed hyperparameters) to the input\nvariable, i.e. to X forSX\u2192Y and to Y for SY \u2192X. This ensures that the set of possible functional\ndependencies between input and response is the same in both directions. In particular, due to the\nequivalence of conditional mean embeddings and vector-valued regressors [12], this is analogous\nto the requirement for constraining the assumed functional class in structural equation models in\norder to ensure identi\ufb01ability [25]. Given the postulate of minimal description length independence\nwhich is the basis for causal discovery in KCDC, the only case when the causal direction will not\nbe identi\ufb01able for KCDC is the situation where the description length of conditional distributions\nin both the causal and anticausal direction does not vary with the value of the cause and effect,\nrespectively. This happens when in both directions the functional form of the mechanism can be\ndescribed by one family of distributions for all its input arguments. One example of this is linear\nGaussian dependence which is non-identi\ufb01able for most other causal discovery methods too. Another\nexample is the case of independent variables which is usually not considered in the literature, but can\nbe easily mitigated with an independence test. Note that using characteristic kernels eliminates any\npotential non-identi\ufb01ability that might arise as a consequence of the non-injectivity of the embedding.\n\n4 Experimental Results\n\n4.1 Synthetic Data\n\nIn order to showcase the wide applicability and robustness of our proposed approach, we test it\nextensively on several synthetic datasets spanning a wide range of functional dependencies between\ncause and effect and different interaction patterns with different kinds of noise. Table 1 summarizes\nthe different models used to generate synthetic data. In all of the below experiments, we sample 100\ndatasets of 100 observations each with x \u223c N (0, 1) and test three different noise regimes \u2013 either\n\u0001 \u223c N (0, 1), \u0001 \u223c U(0, 1) or \u0001 \u223c Exp(1)4.\n\n4Note that the exponential noise has positive mean which can be interpreted as accounting for confounders.\n\n6\n\n\fTable 1: Summary of the different functional models used for generating synthetic data.\n\nAdditive Noise\ny = x3 + x + \u0001\ny = log(x + 10) + x6 + \u0001\ny = sin(10x) + e3x + \u0001\n\nMultiplicative Noise\ny = (x3 + x)e\u0001\ny = (sin(10x) + e3x)e\u0001\ny = (log(x + 10) + x6)e\u0001\n\n(A)\n(B)\n(C)\n\nComplex Noise\ny = (log(x + 10) + x2)\u0001\ny = log(x + 10) + x2|\u0001|\ny = log(x7 + 5) + x5\n\n\u2212 sin(x2|\u0001|)\n\nWe compare our approach to LiNGAM [19], IGCI [4], ANM [16] with Gaussian Process regression\nand HSIC test [18] on the residual and the post-nonlinear model (PNL) [25] with HSIC test. In all\nexperiments, we apply the decision rule based on direct comparison for KCDC. We tested across\ndifferent combinations of characteristic kernels (radial basis function (RBF), log and rational quadratic\nkernels5) which yielded fairly consistent performance. We report the results when using the log\nkernel on the input and the rational quadratic kernel on the response.\nTable 1 summarizing the results is given in the supplementary material. LiNGAM performs badly\nacross most of the settings which is to be expected given given its assumption of linear dependence.\nANM performs very well under additive noise across the different noise settings, but displays bad\nperformance under complex and, especially, multiplicative noise due to its assumption of additive\nnoise. On the other hand, PNL does not perform well under additive noise which is probably due to\nover\ufb01tting. It performs slighly better under multiplicative noise, but does not surpass chance level\nin half the settings. Under complex noise, PNL, which assumes a invertible interaction between\nthe covariates and noise, performs at or above chance level in almost all cases with very good\nperformance under periodic noise. In additive noise settings, IGCI performs well for model (C) and\nunder exponential noise across all models, while in the multiplicative and complex noise settings, it\nshows excellent performance. Our proposed method achieves perfect performance in all settings of\nadditive and multiplicative noise across all noise regime. Under complex noise, it achieves perfect\nperformance in all cases except under Gaussian and uniform noise for (A).\n\n4.2 T\u00fcbingen Cause-Effect Pairs\n\nNext, we discuss the performance of our method on real-world data. For this purpose, we test KCDC\non the only widely used benchmark dataset T\u00fcbingen Cause-Effect Pairs (TCEP) [15]. This dataset\nis comprised of real-world cause-effect samples that are collected across very diverse subject areas\nwith the true causal direction provided by human experts. Due to the heterogenous origins of the data\npairs, many diverse functional dependencies are expected to be present in TCEP.\nIn order to show the \ufb02exibility and capacity of KCDC when dealing with many diverse functional\ndependencies simultaneously, we test it using both the direct comparison decision rule and the\nmajority decision rule. We use TCEP version 1.0 which consists of 100 cause-effect pairs. Each pair\nis assigned a weight in order to account for potential sources of bias given that different pairs are\nsometimes selected from the same multivariable dataset. Following the wide-spread approach present\nin the literature of testing only on scalar-valued pairs, we remove the multivariate pairs 52, 53, 54, 55\nand 71 from TCEP in order to ensure a fair comparison to previous work. Note that contrary to some\nmethods in literature, this is not necessary for our approach. For the majority approach, we choose\nthe best settings of the kernel hyperparameters as inferred from the synthetic experiments. The direct\napproach represents the single best performing hyperparameter con\ufb01guration on TCEP.\nFrom the summary of classi\ufb01cation accuracies of KCDC and related methods in Table 2, we see that\nKCDC is competitive to the state-of-the-art methods even when only one setting of kernel hyperpa-\nrameters is used, i.e. when the direct comparison decision rule is used. When we combine multiple\nkernel hyperparameters under the majority vote approach, we see that our method outperforms other\nmethods by a signi\ufb01cant margin. Note that the review [16] discusses additive noise models [9] and\ninformation-geometric causal inference [4]. In particular, an extensive experimental evaluation of\nthese methods across a wide range of hyperparameter settings is performed. In the fourth row of\n\n5The RBF, log and rational quadratic kernel are de\ufb01ned as k(x, x(cid:48)) = exp\n\nwith\nbandwidth \u03c3, k(x, x(cid:48)) = \u2212 log((cid:107)x \u2212 x(cid:48)(cid:107)2 + 1) and k(x, x(cid:48)) = 1\u2212(cid:107)x \u2212 x(cid:48)(cid:107)2 /((cid:107)x \u2212 x(cid:48)(cid:107)2 + 1), respectively.\n\n(cid:17)\n(cid:16)\u2212(cid:107)x \u2212 x(cid:48)(cid:107)2 /(2\u03c32)\n\n7\n\n\fTable 2, we report the most favourable outcome across both types of methods of their large-scale\nexperimental analysis. For testing RCC on TCEP v1.0, we use the code provided in [13].\n\nANM\n59.5% 66.2% 64.67%\n\nPNL\n\n74.4%\n\nTable 2: Classi\ufb01cation Accuracy On TCEP\nRCC Best from [16] CGNN [7] KCDC-D1 KCDC-D2\n78.71%\n\n\u2248 74%\n\n72.87%\n\n4.3\n\nInferring the Arrow of Time\n\nIn addition to the many real-world pairs above, we also test our method at inferring the direction\nof time on causal time series. Given a time series {Xi}N\ni=1, the task is to infer if Xi \u2192 Xi+1 or\nXi \u2190 Xi+1. We use a dataset containing quarterly growth rates of the real gross domestic product\n(GDP) of the UK, Canada and USA from 1980 to 2011 as in [1]. The resulting multivariate time\nseries has length 124 and dimension three. According to the above selection of hyperparameters on\nthe synthetic datasets, we chose a wide range of hyperparameters to test KCDC on. In particular,\nboth on the response and input we used either a log kernel k(x, x(cid:48)) = \u2212 log((cid:107)x \u2212 x(cid:48)(cid:107)q + 1) with\nq in [2, 3, 4] or an RBF kernel with bandwidth [1, 1.5, 2] times the median heuristic. Across all of\nthese hyperparameters, KCDC correctly identi\ufb01es the causal direction with the con\ufb01dence measure\nT KCDC measuring the absolute relative difference between the KCDC measures varying between\n2.45 and 44565.6. We compare our approach to methods readily applicable to causal infenrence on\nmultivariable time series. In particular, LiNGAM does not identify the correct direction. On the other\nhand, the method developed in [1] that models the data as an autoregressive moving average model\nwith non-Gaussian noise correctly identi\ufb01es the causal direction.\n\n5 Extensions to the Multivariable Case\n\nWhile we present and discuss our method for the case of pairs of variables, it can be extended to the\nsetting of more than two variables. Assuming we have d variables with d \u2265 2, i.e. X = {X1, . . . , Xd},\nwe can apply KCDC to every pair of variables {Xi, Xj} \u2286 X with i (cid:54)= j while conditioning on all\nof the remaining variables in X \\ {Xi, Xj}. This corresponds to inferring the causal relationship\nbetween Xi and Xj while accounting for the confounding effect of all the remaining variables.\nAnother way of dealing with the multivariable setting is to use KCDC in conjunction with, for\nexample, the PC algorithm [20]. In particular, one would \ufb01rst apply the PC algorithm to the data.\nThe resulting DAG skeleton containing potentially many unoriented edges can then be processed\nwith KCDC. In particular, our method can be applied sequentially to every pair of variables that is\nconnected with an unoriented edge while conditioning on the remaining variables in the DAG.\nYet another approach to the multivariable case is to use KCDC measures as features in a multiclass\nclassi\ufb01cation problem for d-dimensional distributions. However, as noted in [13], this approach\nquickly becomes rather cumbersome as the number of labels grows super-exponentially in the number\nof variables due to the rapid increase of the number of DAGs that can be constructed from d variables.\n\n6 Conclusion\n\nIn this paper, we proposed a fully nonparametric causal inference method that uses purely obser-\nvational data and does not postulate a priori assumptions on the functional relationship between\nthe variables or the noise structure. We proposed a novel interpretation of the notion of asymmetry\nbetween cause and effect in terms of the variability, across different values of the input, of the minimal\ndescription length of programs implementing the data-generating process of conditional distributions.\nIn order to quantify the description length variability, we proposed a \ufb02exible measure in terms of the\nwithin-set deviance of the RKHS norms of conditional mean embeddings and presented three decision\nrules for causal inference based on direct comparison, ensembling and classi\ufb01cation, respectively. We\nextensively tested our proposed method across a wide range of diverse synthetic datasets showcasing\nits wide applicability and robustness. Furthermore, we tested our method on real-world time series\ndata and the real-world benchmark dataset T\u00fcbingen Cause-Effect Pairs where we outperformed\nexisting state-of-the-art methods by a signi\ufb01cant margin.\n\n8\n\n\fAcknowledgments\n\nJM acknowledges the \ufb01nancial support of The Clarendon Fund of the University of Oxford. DS\u2019s and\nYWT\u2019s research leading to these results has received funding from the European Research Council\nunder the European Union\u2019s Seventh Framework Programme (FP7/2007\u20132013) ERC grant agreement\nno. 617071.\n\nReferences\n[1] S. Bauer, B. Sch\u00f6lkopf, and J. Peters. The arrow of time in multivariate time series. In International\n\nConference on Machine Learning, pages 2043\u20132051, 2016.\n\n[2] K. Budhathoki and J. Vreeken. Causal inference by stochastic complexity. arXiv:1702.06776, 2017.\n[3] D. M. Chickering. Optimal structure identi\ufb01cation with greedy search. Journal of machine learning\n\nresearch, 3(Nov):507\u2013554, 2002.\n\n[4] P. Daniusis, D. Janzing, J. Mooij, J. Zscheischler, B. Steudel, K. Zhang, and B. Sch\u00f6lkopf. Inferring\n\ndeterministic causal relations. arXiv preprint arXiv:1203.3475, 2012.\n\n[5] J. A. Fonollosa. Conditional distribution variability measures for causality detection. arXiv preprint\n\narXiv:1601.06680, 2016.\n\n[6] T. G\u00e4rtner, J. W. Lloyd, and P. A. Flach. Kernels for structured data. In International Conference on\n\nInductive Logic Programming, pages 66\u201383. Springer, 2002.\n\n[7] O. Goudet, D. Kalainathan, P. Caillou, D. Lopez-Paz, I. Guyon, M. Sebag, A. Tritas, and P. Tubaro.\nLearning functional causal models with generative neural networks. arXiv preprint arXiv:1709.05321,\n2017.\n\n[8] P. D. Gr\u00fcnwald and P. M. Vit\u00e1nyi. Algorithmic information theory. Handbook of the Philosophy of\n\nInformation, pages 281\u2013320, 2008.\n\n[9] P. O. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and B. Sch\u00f6lkopf. Nonlinear causal discovery with additive\n\nnoise models. In Advances in neural information processing systems, pages 689\u2013696, 2009.\n\n[10] D. Janzing and B. Scholkopf. Causal inference using the algorithmic markov condition. IEEE Transactions\n\non Information Theory, 56(10):5168\u20135194, 2010.\n\n[11] J. Lemeire and E. Dirkx. Causal models as minimal descriptions of multivariate systems, 2006.\n[12] G. Lever, L. Baldassarre, S. Patterson, A. Gretton, M. Pontil, and S. Gr\u00fcnew\u00e4lder. Conditional mean\nembeddings as regressors. In Proceedings of the 29th International Conference on Machine Learning\n(ICML-12), pages 1823\u20131830, 2012.\n\n[13] D. Lopez-Paz, K. Muandet, B. Sch\u00f6lkopf, and I. Tolstikhin. Towards a learning theory of cause-effect\n\ninference. In International Conference on Machine Learning, pages 1452\u20131461, 2015.\n\n[14] J. Mooij, D. Janzing, J. Peters, and B. Sch\u00f6lkopf. Regression by dependence minimization and its\napplication to causal inference in additive noise models. In Proceedings of the 26th annual international\nconference on machine learning, pages 745\u2013752. ACM, 2009.\n\n[15] J. M. Mooij, D. Janzing, J. Zscheischler, and B. Sch\u00f6lkopf. Cause-effect pairs repository. 2015.\n\nhttp://webdav.tuebingen.mpg.de/cause-effect/.\n\n[16] J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Sch\u00f6lkopf. Distinguishing cause from effect\nusing observational data: methods and benchmarks. Journal of Machine Learning Research, 17(32):1\u2013102,\n2016.\n\n[17] J. Pearl. Causality: models, reasoning, and inference. Cambridge University Press Cambridge, UK:, 2000.\n[18] B. Scholkopf and A. J. Smola. Learning with kernels: support vector machines, regularization, optimization,\n\nand beyond. MIT press, 2001.\n\n[19] S. Shimizu, P. O. Hoyer, A. Hyv\u00e4rinen, and A. Kerminen. A linear non-gaussian acyclic model for causal\n\ndiscovery. Journal of Machine Learning Research, 7(Oct):2003\u20132030, 2006.\n\n[20] P. Spirtes, C. Glymour, R. Scheines, et al. Causation, prediction, and search. MIT Press Books, 2000.\n[21] O. Stegle, D. Janzing, K. Zhang, J. M. Mooij, and B. Sch\u00f6lkopf. Probabilistic latent variable models for\ndistinguishing between cause and effect. In Advances in Neural Information Processing Systems, pages\n1687\u20131695, 2010.\n\n[22] X. Sun, D. Janzing, and B. Sch\u00f6lkopf. Distinguishing between cause and effect via kernel-based complexity\n\nmeasures for conditional distributions. In ESANN, pages 441\u2013446, 2007.\n\n[23] X. Sun, D. Janzing, B. Sch\u00f6lkopf, and K. Fukumizu. A kernel-based causal learning algorithm. In\n\nProceedings of the 24th international conference on Machine learning, pages 855\u2013862. ACM, 2007.\n\n[24] I. Tsamardinos, L. E. Brown, and C. F. Aliferis. The max-min hill-climbing bayesian network structure\n\nlearning algorithm. Machine learning, 65(1):31\u201378, 2006.\n\n[25] K. Zhang and A. Hyv\u00e4rinen. On the identi\ufb01ability of the post-nonlinear causal model. In Proceedings of\n\nthe twenty-\ufb01fth conference on uncertainty in arti\ufb01cial intelligence, pages 647\u2013655. AUAI Press, 2009.\n\n[26] K. Zhang, J. Peters, D. Janzing, and B. Sch\u00f6lkopf. Kernel-based conditional independence test and\napplication in causal discovery. In Proceedings of the 27th Annual Conference on Uncertainty in Arti\ufb01cial\nIntelligence (uai), 2011.\n\n9\n\n\f", "award": [], "sourceid": 3470, "authors": [{"given_name": "Jovana", "family_name": "Mitrovic", "institution": "University of Oxford"}, {"given_name": "Dino", "family_name": "Sejdinovic", "institution": "University of Oxford"}, {"given_name": "Yee Whye", "family_name": "Teh", "institution": "University of Oxford, DeepMind"}]}