{"title": "Domain Adaptation by Using Causal Inference to Predict Invariant Conditional Distributions", "book": "Advances in Neural Information Processing Systems", "page_first": 10846, "page_last": 10856, "abstract": "An important goal common to domain adaptation and causal inference is to make accurate predictions when the distributions for the source (or training) domain(s) and target (or test) domain(s) differ. In many cases, these different distributions can be modeled as different contexts of a single underlying system, in which each distribution corresponds to a different perturbation of the system, or in causal terms, an intervention. We focus on a class of such causal domain adaptation problems, where data for one or more source domains are given, and the task is to predict the distribution of a certain target variable from measurements of other variables in one or more target domains. We propose an approach for solving these problems that exploits causal inference and does not rely on prior knowledge of the causal graph, the type of interventions or the intervention targets. We demonstrate our approach by evaluating a possible implementation on simulated and real world data.", "full_text": "Domain Adaptation by Using Causal Inference to\n\nPredict Invariant Conditional Distributions\n\nSara Magliacane\n\nMIT-IBM Watson AI Lab, IBM Research\u2217\n\nsara.magliacane@gmail.com\n\nThijs van Ommen\n\nUniversity of Amsterdam\n\nthijsvanommen@gmail.com\n\nTom Claassen\n\nRadboud University Nijmegen\n\ntomc@cs.ru.nl\n\nStephan Bongers\n\nUniversity of Amsterdam\nsrbongers@gmail.com\n\nPhilip Versteeg\n\nUniversity of Amsterdam\n\np.j.j.p.versteeg@uva.nl\n\nJoris M. Mooij\n\nUniversity of Amsterdam\n\nj.m.mooij@uva.nl\n\nAbstract\n\nAn important goal common to domain adaptation and causal inference is to make\naccurate predictions when the distributions for the source (or training) domain(s)\nand target (or test) domain(s) differ. In many cases, these different distributions\ncan be modeled as different contexts of a single underlying system, in which each\ndistribution corresponds to a different perturbation of the system, or in causal terms,\nan intervention. We focus on a class of such causal domain adaptation problems,\nwhere data for one or more source domains are given, and the task is to predict the\ndistribution of a certain target variable from measurements of other variables in one\nor more target domains. We propose an approach for solving these problems that\nexploits causal inference and does not rely on prior knowledge of the causal graph,\nthe type of interventions or the intervention targets. We demonstrate our approach\nby evaluating a possible implementation on simulated and real world data.\n\n1\n\nIntroduction\n\nPredicting unknown values based on observed data is a problem central to many sciences, and well\nstudied in statistics and machine learning. This problem becomes signi\ufb01cantly harder if the training\nand test data do not have the same distribution, for example because they come from different\ndomains. Such a distribution shift can happen whenever the circumstances under which the training\ndata were gathered are different from those for which the predictions are to be made. A rich literature\nexists on this problem of domain adaptation, a particular task in the \ufb01eld of transfer learning; see\ne.g. Qui\u00f1onero-Candela et al. [2009], Pan and Yang [2010] for overviews.\nWhen the domain changes, so may the relations between the different variables under consideration.\nWhile for some sets of variables A, a function f : A \u2192 Y learned in one domain may continue\nto offer good predictions for Y \u2208 Y in a different domain, this may not be true of other sets A(cid:48) of\nvariables. Causal graphs [e.g., Pearl, 2009, Spirtes et al., 2000] allow us to reason about this in a\nprincipled way when the domains correspond to different external interventions on the system, or\nmore generally, to different contexts in which a system has been measured. Knowledge of the causal\ngraph that describes the data generating mechanism, and of which parts of the model are invariant\n\n\u2217Most of the work was performed while at the University of Amsterdam.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fC1\n\nX1\n\nX2\n\nX3\n\n(a) Causal graph\n\nfor {X1}:\n(b) No distribution shift\nP(Y | X1, C1 = 0) = P(Y | X1, C1 = 1)\n\n(c) Strong distribution shift for {X3}:\nP(Y | X3, C1 = 0) (cid:54)= P(Y | X3, C1 = 1)\nFigure 1: In this scenario, an intervention C1 leads to a shift of distribution between source domain\nand target domain (see also Example 1). Green crosses show source domain data (C1 = 0), blue\ncircles show target domain data (C1 = 1). A standard feature selection method that does not take into\naccount the causal structure, but would use X3 to predict Y := X2 (because X3 is a good predictor\nof Y in the source domain), would obtain extremely biased predictions in the target domain. Using\nX1 instead yields less accurate predictions in the source domain, but much more accurate ones in the\ntarget domain.\n\nacross the different domains, allows one to transfer knowledge from one domain to the other in order\nto address the problem of domain adaptation [Spirtes et al., 2000, Storkey, 2009, Sch\u00f6lkopf et al.,\n2012, Bareinboim and Pearl, 2016].\nOver the last years, various methods have been proposed to exploit the causal structure of the data\ngenerating process in order to address certain domain adaptation problems, each relying on different\nassumptions. For example, Bareinboim and Pearl [2016] provide theory for identi\ufb01ability under\ntransfer (\u201ctransportability\u201d) assuming that the causal graph is known, that interventions are perfect,\nand that the intervention targets are known. Hyttinen et al. [2015] also assume perfect interventions\nwith known targets but do not rely on complete knowledge of the causal graph, instead inferring\nthe relevant aspects of it from the data. Rojas-Carulla et al. [2018] make the assumption that if the\nconditional distribution of the target given some subset of covariates is invariant across different\nsource domains, then this conditional distribution must also be the same in the target domain. The\nmethods proposed in [Sch\u00f6lkopf et al., 2012, Zhang et al., 2013, 2015, Gong et al., 2016] all address\nchallenging settings in which conditional independences that follow from the usual Markov and\nfaithfulness assumptions alone do not suf\ufb01ce to solve the problem, but additional assumptions on the\ndata generating process have to be made.\nIn this work, we will make no such additional assumptions, and address the setting in which both the\ncausal graph and the intervention types and targets may be (partially) unknown. Our contributions are\nthe following. We consider a set of relatively weak assumptions that make the problem well-posed.\nWe propose an approach to solve this class of causal domain adaptation problems that can deal with\nthe presence of latent confounders. The main idea is to select the subset of features A that leads to\nthe best predictions of Y in the source domains, while satisfying invariance (i.e., P(Y | A) is the\nsame in the source and target domains). To test whether the invariance condition is satis\ufb01ed, we\napply the recently proposed Joint Causal Inference (JCI) framework [Mooij et al., 2018] to exploit\nthe information provided by multiple domains corresponding to different interventions. The basic\nidea is as follows. First, a standard feature selection method is applied to source domains data to \ufb01nd\nsets of features that are predictive of a target variable, trading off bias and variance, but unaware of\nchanges in the distribution across domains. A causal inference method then draws conclusions from\nall given data about the possible causal graphs, avoiding sets of features for which the predictions\nwould not transfer to the target domains. We propose a proof-of-concept implementation of our\napproach building on a causal discovery algorithm by Hyttinen et al. [2014]. We evaluate the method\non synthetic data and a real-world example.\n\n2 Theory\n\nBefore giving a precise de\ufb01nition of the class of domain adaptation problems that we consider in this\nwork, we begin with a motivating example.\n\n2\n\nX1X2X3X2\fExample 1. We are given three variables X1, X2, X3 describing different aspects of a system (for\nexample, certain blood cell phenotypes in mice). We have observational measurements of these three\nvariables (the source domain, designated with C1 = 0), and in addition, measurements of X1 and\nX3 under an intervention (the target domain, designated with C1 = 1), e.g., in which the mice have\nbeen exposed to a certain drug. The domain adaptation task is to predict the values of Y := X2\nin the interventional target domain (i.e., when C1 = 1). Let us assume for this example that the\ncausal graph in Figure 1a applies, i.e., we assume that X2 is affected by X1 and affects X3, while\nC1 affects both X1 and X3 (i.e., the intervention targets the variables X1 and X3). This causal\ngraph implies P(Y | X1, C1 = 0) = P(Y | X1, C1 = 1). Suppose further that the relation between\nX1 and X2 is about equally strong as the relation between X2 and X3, but considerably more noisy.\nThen a feature selection method using only available source domain data, and aiming to select the\nbest subset of features to use for prediction of Y will prefer both {X3} and {X1, X3} over {X1}\n(because predicting Y from X1 leads to larger variance than predicting Y from X3, and to a larger\nbias than predicting Y from both X1 and X3). However, under the intervention (C1 = 1), P(Y | X3)\nand P(Y | X1, X3) both change,2 so that using those features to predict Y in the target domain could\nlead to extreme bias, as illustrated in Figure 1c. Because the conditional distribution of Y given X1\nis invariant across domains, as illustrated in Figure 1b, predictions of Y based only on X1 can be\nsafely transferred to the target domain.\n\nThis example provides an instance of a domain adaptation problem where feature selection methods\nthat do not take into account the causal structure would pick a set of features that does not generalize\nto the target domain, and may lead to arbitrarily bad predictions (even asymptotically, as the number\nof data points tends to in\ufb01nity). On the other hand, correctly taking into account the causal structure\nand the possible distribution shift from source to target domain allows to upper bound the prediction\nerror in the target domain, as we will see in Section 2.3.\n\n2.1 Problem Setting\n\nWe now formalize the domain adaptation problems that we address in this paper. We will make use of\nthe terminology of the recently proposed Joint Causal Inference (JCI) framework [Mooij et al., 2018].\nLet us consider a system of interest described by a set of system variables {Xj}j\u2208J . In addition, we\nmodel the domain in which the system has been measured by context variables {Ci}i\u2208I (we will use\n\u201ccontext\u201d as a synonym for \u201cdomain\u201d). We will denote the tuple of all system and context variables\nas V = ((Xj)j\u2208J , (Ci)i\u2208I). System and context variables can be discrete or continuous. As a\nconcrete example, the system of interest could be a mouse. The system variables could be blood cell\nphenotypes such as the concentration of red blood cells, the concentration of white blood cells, and\nthe mean red blood cell volume. The context variables could indicate for example whether a certain\ngene has been knocked out, the dosage of a certain drug administered to the mice, the age and gender\nof the mice, or the lab in which the measurements were done. The important underlying assumption\nis that context variables are exogenous to the system, whereas system variables are endogenous. The\ninterventions are not limited to the perfect (\u201csurgical\u201d) interventions modeled by the do-operator of\nPearl [2009], but can also be other types of interventions such as mechanism changes [Tian and Pearl,\n2001], soft interventions [Markowetz et al., 2005], fat-hand interventions [Eaton and Murphy, 2007],\nactivity interventions [Mooij and Heskes, 2013], and stochastic versions of all these. Knowledge\nof the intervention targets is not necessary (but is certainly helpful). For example, administering\na drug to the mice may have a direct causal effect on an unknown subset of the system variables,\nbut we can simply model it as a binary exogenous variable (indicating whether or not the drug was\nadministered) or a continuous exogenous variable (describing the dosage of the administered drug)\nwithout specifying in advance on which variables it has a direct effect. We can now formally state the\ndomain adaptation task that we address in this work:\nTask 1 (Domain Adaptation Task). We are given data for a single or for multiple source domains,\nin each of which C1 = 0, and for a single or for multiple target domains, in each of which C1 = 1.\nAssume the source domains data is complete (i.e., no missing values), and the target domains data is\ncomplete with the exception of all values of a certain target variable Y = Xj. The task is to predict\nthese missing values of the target variable Y given the available source and target domains data.\n\n2More precisely, we should say that P(Y | X3, C1 = 1) may differ from P(Y | X3, C1 = 0), and similarly\n\nwhen conditioning on {X1, X3}.\n\n3\n\n\fContext variables\nC1\n0\n0\n0\n0\n0\n1\n1\n1\n1\n1\n\nC2\n0.1\n0.2\n0.4\n1.5\n1.7\n0.2\n0.1\n1.6\n1.8\n1.7\n\ns\nn\ni\na\nm\no\nd\n\ne\nc\nr\nu\no\ns\n\ns\nn\ni\na\nm\no\nd\n\nt\ne\ng\nr\na\nt\n\nSystem variables\nX3\nX1\n0.1\n0.5\n0.49\n0.13\n0.51\n0.23\n0.52\n0.5\n0.6\n0.51\n0.92\n0.2\n0.99\n0.23\n0.95\n0.53\n0.90\n0.61\n0.55\n0.97\n\nX2\n0.2\n0.21\n0.21\n0.19\n0.18\n?\n?\n?\n?\n?\n\nContext\nvariables\n\nSystem\nvariables\n\nC1\n\nC2\n\nX1\n\nX2\n\nX3\n\nFigure 2: Example of a causal domain adaptation problem. The causal graph is depicted on the right,\nthe corresponding data on the left. The task is to predict the missing values of Y = X2 in the target\ndomains (C1 = 1), based on the observed data from the source domains and the target domains,\nwithout knowledge of the causal graph. See also Example 2.\n\nAn example is provided in Figure 2. In the next subsection, we will formalize our assumptions to\nturn this task into a well-posed problem.\n\n2.2 Assumptions\n\nOur \ufb01rst main assumption is that the data generating process (on both system and context variables)\ncan be represented as a Structural Causal Model (SCM) (see e.g., [Pearl, 2009]):\n\n\uf8f1\uf8f2\uf8f3Ci\np(E) =(cid:81)\n\nXj\n\nM :\n\n= gi(EPA(i)\u2229K),\n= fj(XPA(j)\u2229J , CPA(j)\u2229I, EPA(j)\u2229K),\n\ni \u2208 I\n\nk\u2208K p(Ek).\n\nj \u2208 J\n\n(1)\n\nHere, we introduced exogenous latent independent \u201cnoise\u201d variables (Ek)k\u2208K that model latent\ncauses of the context and system variables. The parents of each variable are denoted by PA(\u00b7). Each\ncontext and system variable is related to its parent variables by a structural equation. In addition,\nwe assume a factorizing probability distribution on the exogenous variables. There could be cyclic\ndependencies, for example due to feedback loops, but for simplicity of exposition we will discuss\nonly the acyclic case here, noting that the extension to the cyclic case is straightforward given recent\ntheoretical advances on cyclic SCMs [Bongers et al., 2018]. This SCM provides a causal model for\nthe distributions of the various domains, and in particular, it induces a joint distribution P(V ) on\nthe context and system variables. Note that we will assume that the data generating process can be\nmodeled by some model of this form, but we do not rely on knowing the precise model.\nThe SCM M can be represented graphically by its causal graph G(M), a graph with nodes I \u222a J\n(i.e., the labels of both system and context variables), directed edges l1 \u2192 l2 for l1, l2 \u2208 I \u222a J iff\nl1 \u2208 PA(l2), and bidirected edges l1 \u2194 l2 for l1, l2 \u2208 I \u222aJ iff there exists a k \u2208 PA(l1)\u2229 PA(l2)\u2229K.\nIn the acyclic case, this causal graph is an Acyclic Directed Mixed Graph (ADMG), and M is also\nknown as a Semi-Markov Causal Model (see e.g., [Pearl, 2009]). The directed edges represent direct\ncausal relationships, and the bidirected edges may represent hidden confounders (both relative to\nthe set of variables in the ADMG). The (causal) Markov assumption holds [Richardson, 2003], i.e.,\nany d-separation A \u22a5 B | S [G(M)] between sets of random variables A, B, S \u2286 V in the ADMG\nG(M) implies a conditional independence A\u22a5\u22a5 B | S [P(V )] in the distribution P(V ) induced by\nthe SCM M. A standard assumption in causal discovery is that the joint distribution P(V ) is faithful\nwith respect to the ADMG G(M), i.e., that there are no other conditional independences in the joint\ndistribution than those implied by d-separation.\nWe will make the following assumptions on the causal structure (where henceforth we will simply\nwrite G instead of G(M)), which are discussed in detail by Mooij et al. [2018]:\nAssumption 1 (JCI Assumptions). Let G be a causal graph with variables V (consisting of system\nvariables {Xj}j\u2208J and context variables {Ci}i\u2208I).\n(i) No system variable directly causes any context variable (\u201cexogeneity\u201d)\n\n(ii) No system variable is confounded with a context variable (\u201crandomization\u201d)\n\n(\u2200j \u2208 J ,\u2200i \u2208 I : Xj \u2192 Ci /\u2208 G);\n(\u2200j \u2208 J ,\u2200i \u2208 I : Xj \u2194 Ci /\u2208 G);\n\n4\n\n\f(iii) Every pair of context variables is purely confounded (\u201cgenericity\u201d)\n\n(\u2200i, i(cid:48) \u2208 I : Ci \u2194 Ci(cid:48) \u2208 G \u2227 Ci \u2192 Ci(cid:48) /\u2208 G).\n\nThe \ufb01rst assumption is the most crucial one that captures what we mean by \u201ccontext\u201d. The other\ntwo assumptions are less crucial and could be omitted, depending on the application. For a more\nin-depth discussion of these modeling assumptions and on how they compare with other possible\ncausal modeling approaches, we refer the reader to [Mooij et al., 2018]. Any causal discovery method\ncan in principle be used in the JCI setting, but identi\ufb01ability greatly bene\ufb01ts from taking into account\nthe background knowledge on the causal graph from Assumption 1.\nIn addition, in order to be able to address the causal domain adaptation task, we will assume:\nAssumption 2. Let G be a causal graph with variables V (consisting of system variables {Xj}j\u2208J\nand context variables {Ci}i\u2208I), and P(V ) be the corresponding distribution on V . Let C1 be the\nsource/target domains indicator and Y = Xj the target variable.\n(i) The distribution P(V ) is Markov and faithful w.r.t. G;\n(ii) Any conditional independence involving Y in the source domains also holds in the target\n\ndomains, i.e., if A \u222a B \u222a S contains Y but not C1 then:3\n\nA\u22a5\u22a5 B | S [C1 = 0] =\u21d2 A\u22a5\u22a5 B | S [C1 = 1];\n\n(iii) C1 has no direct effect on Y w.r.t. V , i.e., C1 \u2192 Y /\u2208 G.\n\nThe Markov and faithfulness assumptions are standard in constraint-based causal discovery on a\nsingle domain; we apply them here on the \u201cmeta-system\u201d composed of system and context.\nAssumption 2(ii) may seem non-intuitive, but as we show in the Supplementary Material, it follows\nfrom more intuitive (but stronger) assumptions, for example if both the pooled source domains\ndistribution P(V | C1 = 0) and the pooled target domains distribution P(V | C1 = 1) are Markov\nand faithful to the subgraph of G which excludes C1. These stronger assumptions imply that the\ncausal structure (i.e., presence or absence of causal relationships and confounders) of the other\nvariables is invariant when going from source to target domains. Assumption 2(ii) is a weakened\nversion of these more natural assumptions, allowing additional independences to hold in the target\ndomains compared to the source domains, e.g., when C1 models a perfect surgical intervention.\nAssumption 2(iii) is strong, yet some assumption of that type seems necessary to make the task\nwell-de\ufb01ned. Without any information at all about the target(s) of C1, or the causal mechanism\nthat determines the values of Y in the target domains, predicting the values of Y for the target\ndomains seems generally impossible. Note that the assumption is more likely to be satis\ufb01ed if the\ninterventions are believed to be precisely targeted, and gets weaker the more relevant system variables\nare observed.4\nAs one example of a real-world setting in which these assumptions are reasonable, consider a\ngenomics experiment, in which gene expression levels of many different genes are measured in\nresponse to knockouts of single genes. Given our present-day understanding of the biology of gene\nexpression, it is very reasonable to assume that the knockout of gene Xi only has a direct effect on the\nexpression level of gene Xi itself. As long as we do not ask to predict the expression level of Xi under\na knockout of Xi, but only the expression level of other genes Y = Xj with j (cid:54)= i, Assumption 2(iii)\nseems justi\ufb01ed. It is also reasonable (based on present-day understanding of biology) to expect that a\nsingle gene knockout does not change the causal mechanisms in the rest of the system. This justi\ufb01es\nAssumption 2(ii) in this setting if one is willing to assume faithfulness.\nIn the next subsections, we will discuss how these assumptions enable us to address the domain\nadaptation task.\n\n3Here, with A\u22a5\u22a5 B | S [C1 = 0] we mean A\u22a5\u22a5 B | S [P(V | C1 = 0)], i.e., the conditional independence\nof A from B given S in the mixture of the source domains P(V | C1 = 0), and similarly for the target domains.\n4This assumption can be weakened further: in some circumstances one can infer from the data and the other\nassumptions that C1 cannot have a direct effect on Y . For example: if there exists a descendant D \u2208 DE(Y ),\nand if there exists a set S \u2286 V \\ ({C1, Y } \u222a DE(Y )), such that C1 \u22a5\u22a5 D | S, then C1 is not a direct cause of Y\nw.r.t. V . For some proposals on alternative assumptions that can be made when this assumption is violated, see\ne.g., [Sch\u00f6lkopf et al., 2012, Zhang et al., 2013, 2015, Gong et al., 2016].\n\n5\n\n\frisk E(cid:0)(Y \u2212 \u02c6Y (A))2 | C1 = 1(cid:1), and is given by the conditional expectation (regression function)\n\n2.3 Separating Sets of Features\nOur approach to addressing Task 1 is based on \ufb01nding a separating set A \u2286 V \\ {C1, Y } of (context\nand system) variables that satis\ufb01es C1 \u22a5 Y | A [G]. If such a separating set A can be found, then the\ndistribution of Y conditional on A is invariant under transferring from the source domains to the\ntarget domains, i.e., P(Y | A, C1 = 0) = P(Y | A, C1 = 1). As the former conditional distribution\ncan be estimated from the source domains data, we directly obtain a prediction for the latter, which\nthen enables us to predict the values of Y from the observed values of A in the target domains.5\nWe will now discuss the effect of the choice of A on the quality of the predictions. For simplicity of\nthe exposition, we make use of the squared loss function and look at the asymptotic case, ignoring\n\ufb01nite-sample issues. When predicting Y from a subset of features A \u2286 V \\ {Y, C1} (that may or\nmay not be separating), the optimal predictor is de\ufb01ned as the function \u02c6Y mapping from the range\nof possible values of A to the range of possible values of Y that minimizes the target domains\nA(a) := E(Y | A = a, C1 = 1). Since Y is not observed in the target domains, we cannot directly\n\u02c6Y 1\nestimate this regression function from the data.\nOne approach that is often used in practice is to ignore the difference in distribution between source\nA(a) := E(Y | A = a, C1 = 0), which minimizes\nand target domains, and use instead the predictor \u02c6Y 0\nA that\nwe will refer to as the transfer bias (when predicting Y from A). When ignoring that source domains\nand target domains have different distributions, any standard machine learning method can be used to\npredict Y from A. As the transfer bias can become arbitrarily large (as we have seen in Example 1),\nthe prediction accuracy of this solution strategy may be arbitrarily bad (even in the in\ufb01nite-sample\nlimit).\nInstead, we propose to only predict Y from A when the set A of features satis\ufb01es the following\nseparating set property:\n(2)\ni.e., it d-separates C1 from Y in G. By the Markov assumption, this implies C1 \u22a5\u22a5 Y | A [P(V )]. In\nother words (as already mentioned above), for separating sets, the distribution of Y conditional on A\nis invariant under transferring from the source domains to the target domains, i.e., P(Y | A, C1 =\n0) = P(Y | A, C1 = 1). By virtue of this invariance, regression functions are identical for the source\ndomains and target domains, i.e., \u02c6Y 0\nA, and hence also the source domains and target domains\nrisks are identical when using the predictor \u02c6Y 0\nA:\n\nthe source domains risk E(cid:0)(Y \u2212 \u02c6Y )2 | C1 = 0(cid:1). This approximation introduces a bias \u02c6Y 1\n\nC1 \u22a5 Y | A [G],\n\nA \u2212 \u02c6Y 0\n\nA = \u02c6Y 1\n\nC1 \u22a5 Y | A [G] =\u21d2 E(cid:0)(Y \u2212 \u02c6Y 0\n\nA)2 | C1 = 1(cid:1) = E(cid:0)(Y \u2212 \u02c6Y 0\n\nA)2 | C1 = 0(cid:1).\n\n(3)\nThe r.h.s. can be estimated from the source domains data, and the l.h.s. equals the generalization\nerror to the target domains when using the predictor \u02c6Y 0\nA trained on the source domains (which equals\nthe predictor \u02c6Y 1\nA that one could obtain if all target domains data, including the values of Y , were\nobserved).6 Although this approach leads to zero transfer bias, it introduces another bias: by using\nonly a subset of the features A, rather than all available features V \\ {C1, Y }, we may miss relevant\nV \\{Y,C1} \u2212 \u02c6Y 1\ninformation to predict Y . We refer to this bias as the incomplete information bias, \u02c6Y 1\nA.\nA to predict Y is the sum of the transfer bias and the incomplete\n\nThe total bias when using \u02c6Y 0\ninformation bias:\n\n+ ( \u02c6Y 1\n\nV \\{Y,C1} \u2212 \u02c6Y 1\nA)\n\n.\n\nincomplete information bias\n\n(cid:123)(cid:122)\n\n(cid:124)\n\n(cid:125)\nV \\{Y,C1} \u2212 \u02c6Y 0\n\u02c6Y 1\n\n(cid:123)(cid:122)\n\nA\n\ntotal bias\n\n= ( \u02c6Y 1\n\n(cid:124)\n\n(cid:123)(cid:122)\nA \u2212 \u02c6Y 0\nA)\ntransfer bias\n\n(cid:125)\n\n(cid:124)\n\n(cid:125)\n\nFor some problems, one may be better off by simply ignoring the transfer bias and minimizing the\nincomplete information bias, while for other problems, it is crucial to take the transfer into account to\n\n5This trivial observation is not novel; see e.g. [Ch. 7, p. 164, Spirtes et al., 2000]. It also follows as a special\ncase of [Theorem 2, Pearl and Bareinboim, 2011]. The main novelty of this work is the proposed strategy to\nidentify such separating sets.\n\n6Note that this equation only holds asymptotically; for \ufb01nite samples, in addition to the transfer from source\ndomains to target domains, we have to deal with the generalization from empirical to population distributions\nand from the covariate shift if P(A| C1 = 1) (cid:54)= P(A| C1 = 0) [see e.g. Mansour et al., 2009].\n\n6\n\n\fobtain small generalization errors. In that situation, we could use any subset A for prediction that\nsatis\ufb01es the separating set property (2), implying zero transfer bias; obviously, the best predictions\nare then obtained by selecting a separating subset that also minimizes the source domains risk (i.e.,\nminimizes the incomplete information bias). We conclude that this strategy of selecting a subset\nA to predict Y may yield an asymptotic guarantee on the prediction error by (3), whereas simply\nignoring the shift in distribution may lead to unbounded prediction error, since the transfer bias could\nbe arbitrarily large in the worst case scenario.\n\n2.4\n\nIdenti\ufb01ability of Separating Feature Sets\n\nFor the strategy of selecting the best separating sets of features as discussed in Section 2.3, we\nneed to \ufb01nd one or more sets A \u2286 V \\ {C1, Y } that satisfy (2). Of course, the problem is that we\ncannot directly test this in the data, because the values of Y are missing for C1 = 1. Note that also\nAssumption 2(ii) cannot be directly used here, because it only applies when C1 is not in A\u222aB. When\nthe causal graph G is known, it is easy to verify whether (2) holds directly using d-separation. Here\nwe address the more challenging setting in which the causal graph and the targets of the interventions\nare (partially) unknown.7 Conceptually, one could estimate a set of possible causal graphs by using a\ncausal discovery algorithm (for example, extending any standard method to deal with the missing\nconditional independence tests in C1 = 1), and then read off separating sets from these graphs. In\npractice, it is not necessary to estimate completely these causal graphs: we only need to know enough\nabout them to verify or falsify whether a given set of features separates C1 from Y . The following\nexample (with details in the Supplementary Material) illustrates a case where such reasoning allows\nus to identify a separating set.\nExample 2. Assume that Assumptions 1 and 2 hold for two context variables C1, C2 and three system\nvariables X1, X2, X3 with Y := X2. If the following conditional (in)dependences all hold in the\nsource domains:\n\nC2 \u22a5\u22a5 X2 | X1 [C1 = 0],\n\nC2 (cid:54)\u22a5\u22a5 X2 |\u2205 [C1 = 0],\n\nC2 \u22a5\u22a5 X3 | X2 [C1 = 0],\n\n(4)\nthen C1 \u22a5 X2 | X1 [G], i.e., {X1} is a separating set for C1 and X2. One possible causal graph\nleading to those (in)dependences is provided in Figure 2 (the others are shown in Figure 1c in the\nSupplementary Material). For that ADMG, and given enough data, feature selection applied to the\nsource domains data will generically select {X1, X3} as the optimal set of features for predicting\nY := X2, which can lead to an arbitrarily large prediction error. On the other hand, the set {X1}\nis separating in any ADMG satisfying (4), so using it to predict Y leads to zero transfer bias, and\ntherefore provides a guarantee on the target domains risk (i.e., it provides an upper bound on the\noptimal target domains risk, which can be estimated from the source domains data).\n\nRather than characterizing by hand all possible situations in which a separating set can be identi\ufb01ed\n(like in Example 2), in this work we delegate the causal inference to an automatic theorem prover.\nIntuitively, the idea is to provide the automatic theorem prover with the conditional (in)dependences\nthat hold in the data, in combination with an encoding of Assumptions 1 and 2 into logical rules,\nand ask the theorem prover whether it can prove that C1 \u22a5 Y | A holds for a candidate set A from\nthe assumptions and provided conditional (in)dependences. There are three possibilities: either it\ncan prove the query (and then we can proceed to predict Y from A and get an estimate of the target\ndomains risk), or it can disprove the query (and then we know A will generically give predictions that\nsuffer from an arbitrarily large transfer bias), or it can do neither (in which case hopefully another\nsubset A can be found that does provably satisfy (2)).\n\n2.5 Algorithm\n\nA simple (brute-force) algorithm that \ufb01nds the best separating set as described in Section 2.3 is\nthe following. By using a standard feature selection method, produce a ranked list of subsets\nA \u2286 V \\ {Y, C1}, ordered ascendingly with respect to the empirical source domains risks. Going\nthrough this list of subsets (starting with the one with the smallest empirical source domains risk),\n7Another option, proposed by Rojas-Carulla et al. [2018], is to assume that if p(Y | A) is invariant across all\nsource domains (i.e., p(Y | A, C1 = 0, C\\1 = c) = p(Y | A, C1 = 0) for all c), then the same holds across all\nsource and target domains (i.e., p(Y | A, C1 = 1) = p(Y | A, C1 = 0, C\\1 = c) for all c). This assumption\ncan be violated in some simple cases, e.g. see Example 2.\n\n7\n\n\ftest whether the separating set property can be inferred from the data by querying the automated\ntheorem prover. If (2) can be shown to hold, use that subset A for prediction of Y and stop; if not,\ncontinue with the next candidate subset A in the list. If no subset satis\ufb01es (2), abstain from making a\nprediction.8\nAn important consequence of Assumption 2(ii) is that it enables us to transfer conditional indepen-\ndence involving the target variable from the source domains to the target domains (proof provided in\nthe Supplementary Material):\nProposition 1. Under Assumption 2,\n\nA\u22a5\u22a5 B | S [C1 = 0] \u21d0\u21d2 A\u22a5\u22a5 B | S \u222a {C1} \u21d0\u21d2 A \u22a5 B | S \u222a {C1} [G]\n\nfor subsets A, B, S \u2286 V such that their union contains Y but not C1.\nTo test the separating set condition (2), we use the approach proposed by Hyttinen et al. [2014], where\nwe simply add the JCI assumptions (Assumption 1) as constraints on the optimization problem, in\naddition to the domain-adaptation speci\ufb01c assumption that C1 \u2192 Y /\u2208 G (Assumption 2(iii)). As\ninputs we use all directly testable conditional independence test p-values pA \u22a5\u22a5 B | S in the pooled\ndata (when Y (cid:54)\u2208 A \u222a B \u222a S) and all those resulting from Proposition 1 from the source domains\ndata only (if Y \u2208 A \u222a B \u222a S). If background knowledge on intervention targets or the causal graph\nis available, it can easily be added as well. We use the method proposed by Magliacane et al. [2016]\nto query for the con\ufb01dence of whether some statement (e.g., Y \u22a5\u22a5 C1 | A) is true or false. The results\nof Magliacane et al. [2016] show that this approach is sound under oracle inputs, and asymptotically\nconsistent whenever the statistical conditional independence tests used are asymptotically consistent.\nIn other words, in this way the probability of wrongly deciding whether a subset A is a separating\nset converges to zero as the sample size increases. We chose this approach because it is simple to\nimplement on top of existing open source code.9 Note that the computational cost quickly increases\nwith the number of variables, limiting the number of variables that can be considered simultaneously.\nOne remaining issue is how to predict Y when an optimal separating set A has been found. As the\ndistribution of A may shift when transferring from source domains to target domains, this means that\nthere is a covariate shift to be taken into account when predicting Y . Any method (e.g., least-squares\nregression) could in principle be used to predict Y from a given set of covariates, but it is advisable\nto use a prediction method that works well under covariate shift, e.g., [Sugiyama et al., 2008].\n\n3 Evaluation\n\nWe perform an evaluation on both synthetic data and a real-world dataset based on a causal inference\nchallenge.10 The latter dataset consists of hematology-related measurements from the International\nMouse Phenotyping Consortium (IMPC), which collects measurements of phenotypes of mice with\ndifferent single-gene knockouts.\nIn both evaluations we compare a standard feature selection method (which uses Random Forests)\nwith our method that builds on top of it and selects from its output the best separating set. First, we\nscore all possible subsets of features by their out-of-bag score using the implementation of Random\nForest Regressor from scikit-learn [Pedregosa et al., 2011] with default parameters. For the\nbaseline we then select the best performing subset and predict Y . Instead, for our proposed method\nwe try to \ufb01nd a subset of features A that is also a separating set, starting from the subsets with the best\nscores. To test whether A is a separating set, we use the method described in Section 2.5, using the\nASP solver clingo 4.5.4 [Gebser et al., 2014]. We provide as inputs the independence test results\nfrom a partial correlation test with signi\ufb01cance level \u03b1 = 0.05 and combine it with the weighting\nscheme from Magliacane et al. [2016]. We then use the \ufb01rst subset A in the ranked list of predictive\nsets of features found by the Random Forest method for which the con\ufb01dence that C1 \u22a5 Y | A holds\nis positive. If there is no set A that satis\ufb01es this criterion, then we abstain from making a prediction.\n\n8Abstaining from predictions can be advantageous when trading off recall and precision. If a prediction has\nto be made, we can fall back on some other method or simply accept the risk that the transfer bias may be large.\n9We build on the source code provided by Magliacane et al. [2016] which in turn extends the source code\nprovided by Hyttinen et al. [2014]. The full source code of our implementation and the experiments is available\nonline at https://github.com/caus-am/dom_adapt.\n\n10Part of the CRM workshop on Statistical Causal Inference and Applications to Genetics, Montreal, Canada\n\n(2016). See also http://www.crm.umontreal.ca/2016/Genetics16/competition_e.php\n\n8\n\n\f(a) Synthetic data with N = 1000 samples\nand a large perturbation\nFigure 3: Evaluation results (see main text and Supplementary Material for details).\n\n(b) Real-world data\n\nFor the synthetic data, we generate randomly 200 linear acyclic models with latent variables and\nGaussian noise, each with three system variables, and sample N data points each for the observational\nand two experimental domains, where we simulate soft interventions on randomly selected targets,\nwith different sizes of perturbations. We randomly select which of the two context variables will be\nC1 and which of the three system variables will be Y . We disallow direct effects of C1 on Y , and\nenforce that no intervention can directly affect all variables simultaneously. More details on how\nthe data were simulated are provided in the Supplementary Material. Figure 3a shows a boxplot\nof the L2 loss of the predicted Y values with respect to the true values for both the baseline and\nour method, considering the 121 cases out of 200 in which our method does produce an answer. In\nparticular, Figure 3a considers the case of N = 1000 samples per regime and interventions that all\nproduce a large perturbation. In the Supplementary Material we show that results improve with more\nsamples, both for the baseline, but even more so for our method, since the quality of the conditional\nindependence tests improves. We also show that, according to expectations, if the target distribution\nis very similar to the source distributions, i.e., the transfer bias is small, our method does not provide\nany bene\ufb01t and seems to perform worse than the baseline. Conversely, the larger the intervention\neffect, the bigger the advantage of using our method.\nFor the real-world dataset, we select a subset of the variables considered in the CRM Causal Inference\nChallenge. Speci\ufb01cally, for simplicity we focus on 16 phenotypes that are not deterministically\nrelated to each other. The dataset contains measurements for 441 \u201cwild type\u201d mice and for about\n10 \u201cmutant\u201d mice for each of 13 different single gene knockouts. We then generate 1000 datasets\nby randomly selecting subsets of 3 variables and 2 gene knockout contexts, and always include also\n\u201cwild type\u201d mice. For each dataset we randomly choose Y and C1, and leave out the observed values\nof Y for C1 = 1. Figure 3b shows a boxplot of the L2 loss of the predicted Y values with respect to\nthe real values for the baseline and our method. Given the small size of the datasets, this is a very\nchallenging problem. In this case, our method abstains from making a prediction for 170 cases out of\n1000 but performs similarly to the baseline on the remaining cases.\n\n4 Discussion and Conclusion\n\nWe have de\ufb01ned a general class of causal domain adaptation problems and proposed a method that\ncan identify sets of features that lead to transferable predictions. Our assumptions are quite general\nand in particular do not require the causal graph or the intervention targets to be known. The method\ngives promising results on simulated data. It is straightforward to extend our method to the cyclic\ncase by making use of the results by Forr\u00e9 and Mooij [2018]. More work remains to be done on\nthe implementation side, for example, scaling up to more variables. Currently, our approach can\nhandle about seven variables on a laptop computer, and with recent advances in exact causal discovery\nalgorithms [e.g., Rantanen et al., 2018], a few more variables would be feasible. For scaling up to\ndozens of variables, we plan to adapt constraint-based causal discovery algorithms like FCI [Spirtes\net al., 2000] to deal with the missing-data aspect of the domain adaptation task. We hope that this\nwork will also inspire further research on the interplay between bias, variance and causality from a\nstatistical learning theory perspective.\n\n9\n\n0.00.10.20.30.40.5Feature selectionOur methodMethodL2 loss0246Feature selectionOur methodMethodL2 loss\fAcknowledgments\n\nWe thank Patrick Forr\u00e9 for proofreading a draft of this work. We thank Ren\u00e9e van Amerongen and\nLucas van Eijk for sharing their domain knowledge about the hematology-related measurements from\nthe International Mouse Phenotyping Consortium (IMPC). SM, TC, SB, and PV were supported\nby NWO, the Netherlands Organization for Scienti\ufb01c Research (VIDI grant 639.072.410). SM was\nalso supported by the Dutch programme COMMIT/ under the Data2Semantics project. TC was\nalso supported by NWO grant 612.001.202 (MoCoCaDi), and EU-FP7 grant agreement n.603016\n(MATRICS). TvO and JMM were supported by the European Research Council (ERC) under the\nEuropean Union\u2019s Horizon 2020 research and innovation programme (grant agreement 639466).\n\nReferences\nE. Bareinboim and J. Pearl. Causal inference and the data-fusion problem. Proceedings of the National Academy\n\nof Sciences, 113(27):7345\u20137352, 2016.\n\nS. Bongers, J. Peters, B. Sch\u00f6lkopf, and J. M. Mooij. Theoretical aspects of cyclic structural causal models.\narXiv.org preprint, arXiv:1611.06221v2 [stat.ME], Aug. 2018. URL https://arxiv.org/abs/1611.\n06221v2.\n\nD. Eaton and K. Murphy. Exact Bayesian structure learning from uncertain interventions. In Proceedings of\nthe Eleventh International Conference on Arti\ufb01cial Intelligence and Statistics, (AISTATS-07), volume 2 of\nProceedings of Machine Learning Research, pages 107\u2013114, 2007.\n\nP. Forr\u00e9 and J. M. Mooij. Constraint-based causal discovery for non-linear structural causal models with cycles\nand latent confounders. In Proceedings of the 34th Annual Conference on Uncertainty in Arti\ufb01cial Intelligence\n(UAI-18), 2018.\n\nM. Gebser, R. Kaminski, B. Kaufmann, and T. Schaub. Clingo = ASP + control: Extended report. Tech-\nnical report, University of Potsdam, 2014. URL http://www.cs.uni-potsdam.de/wv/pdfformat/\ngekakasc14a.pdf.\n\nM. Gong, K. Zhang, T. Liu, D. Tao, C. Glymour, and B. Sch\u00f6lkopf. Domain adaptation with conditional\ntransferable components. In Proceedings of the 33rd International Conference on Machine Learning (ICML\n2016), volume 48 of JMLR Workshop and Conference Proceedings, pages 2839\u20132848, 2016.\n\nA. Hyttinen, F. Eberhardt, and M. J\u00e4rvisalo. Constraint-based causal discovery: Con\ufb02ict resolution with\nanswer set programming. In Proceedings of the Thirtieth Conference on Uncertainty in Arti\ufb01cial Intelligence,\n(UAI-14), pages 340\u2013349, 2014.\n\nA. Hyttinen, F. Eberhardt, and M. J\u00e4rvisalo. Do-calculus when the true graph is unknown. In Proceedings of the\n\nThirty-First Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI 2015), pages 395\u2013404, 2015.\n\nS. Magliacane, T. Claassen, and J. M. Mooij. Ancestral causal inference. In In Proceedings of Advances in\n\nNeural Information Processing Systems, (NIPS-16), pages 4466\u20134474, 2016.\n\nY. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms.\n\nProceedings of the Twenty-Second Annual Conference on Learning Theory (COLT 2009), 2009.\n\nIn\n\nF. Markowetz, S. Grossmann, and R. Spang. Probabilistic soft interventions in conditional Gaussian networks.\nIn Proceedings of the Tenth International Workshop on Arti\ufb01cial Intelligence and Statistics, (AISTATS-05),\npages 214\u2013221, 2005.\n\nJ. M. Mooij and T. Heskes. Cyclic causal discovery from continuous equilibrium data. In Proceedings of the\n\n29th Annual Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI-13), pages 431\u2013439, 2013.\n\nJ. M. Mooij, S. Magliacane, and T. Claassen. Joint causal inference from multiple contexts. arXiv.org preprint,\nhttps://arxiv.org/abs/1611.10351v3 [cs.LG], Mar. 2018. URL https://arxiv.org/abs/1611.10351v3.\n\nS. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering,\n\n22(10):1345\u20131359, Oct. 2010.\n\nJ. Pearl. Causality: models, reasoning and inference. Cambridge University Press, 2009.\n\nJ. Pearl and E. Bareinboim. Transportability of causal and statistical relations: A formal approach. In Proceedings\n\nof the Twenty-Fifth AAAI Conference on Arti\ufb01cial Intelligence, pages 247\u2013254, 2011.\n\n10\n\n\fF. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,\nV. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:\nMachine learning in Python. Journal of Machine Learning Research, 12:2825\u20132830, 2011.\n\nJ. Qui\u00f1onero-Candela, M. Suyiyama, A. Schwaighofer, and N. D. Lawrence, editors. Dataset Shift in Machine\n\nLearning. MIT Press, 2009.\n\nK. Rantanen, A. Hyttinen, and M. J\u00e4rvisalo. Learning optimal causal graphs with exact search. In Proceedings of\nthe 9th International Conference on Probabilistic Graphical Models (PGM 2018), volume 72 of Proceedings\nof Machine Learning Research, pages 344\u2013355, 2018.\n\nT. Richardson. Markov properties for acyclic directed mixed graphs. Scandinavian Journal of Statistics, 30:\n\n145\u2013157, 2003.\n\nM. Rojas-Carulla, B. Sch\u00f6lkopf, R. Turner, and J. Peters. Invariant models for causal transfer learning. Journal\n\nof Machine Learning Research, 19(36):1\u201334, 2018.\n\nB. Sch\u00f6lkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. M. Mooij. On causal and anticausal learning.\nIn Proceedings of the 29th International Conference on Machine Learning (ICML 2012), pages 1255\u20131262,\n2012.\n\nP. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT press, 2nd edition, 2000.\n\nA. Storkey. When training and test sets are different: Characterizing learning transfer. In Dataset Shift in\n\nMachine Learning, chapter 1, pages 3\u201328. MIT Press, 2009.\n\nM. Sugiyama, S. Nakajima, H. Kashima, P. V. Buenau, and M. Kawanabe. Direct importance estimation with\nmodel selection and its application to covariate shift adaptation. In In Proceedings of Advances in Neural\nInformation Processing Systems (NIPS-08), pages 1433\u20131440, 2008.\n\nJ. Tian and J. Pearl. Causal discovery from changes. In Proceedings of the 17th Conference in Uncertainty in\n\nArti\ufb01cial Intelligence, (UAI-01), 2001.\n\nK. Zhang, B. Sch\u00f6lkopf, K. Muandet, and Z. Wang. Domain adaptation under target and conditional shift.\nIn Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of\nMachine Learning Research, pages 819\u2013827, 2013.\n\nK. Zhang, M. Gong, and B. Sch\u00f6lkopf. Multi-source domain adaptation: A causal view. In Proceedings of the\n\nTwenty-Ninth AAAI Conference on Arti\ufb01cial Intelligence, pages 3150\u20133157, 2015.\n\n11\n\n\f", "award": [], "sourceid": 6921, "authors": [{"given_name": "Sara", "family_name": "Magliacane", "institution": "IBM Research AI"}, {"given_name": "Thijs", "family_name": "van Ommen", "institution": "University of Amsterdam"}, {"given_name": "Tom", "family_name": "Claassen", "institution": "Radboud University Nijmegen"}, {"given_name": "Stephan", "family_name": "Bongers", "institution": "University of Amsterdam"}, {"given_name": "Philip", "family_name": "Versteeg", "institution": "University of Amsterdam"}, {"given_name": "Joris", "family_name": "Mooij", "institution": "University of Amsterdam"}]}