{"title": "Causal Effect Inference with Deep Latent-Variable Models", "book": "Advances in Neural Information Processing Systems", "page_first": 6446, "page_last": 6456, "abstract": "Learning individual-level causal effects from observational data, such as inferring the most effective medication for a specific patient, is a problem of growing importance for policy makers. The most important aspect of inferring causal effects from observational data is the handling of confounders, factors that affect both an intervention and its outcome. A carefully designed observational study attempts to measure all important confounders. However, even if one does not have direct access to all confounders, there may exist noisy and uncertain measurement of proxies for confounders. We build on recent advances in latent variable modeling to simultaneously estimate the unknown latent space summarizing the confounders and the causal effect. Our method is based on Variational Autoencoders (VAE) which follow the causal structure of inference with proxies. We show our method is significantly more robust than existing methods, and matches the state-of-the-art on previous benchmarks focused on individual treatment effects.", "full_text": "Causal Effect Inference with\nDeep Latent-Variable Models\n\nChristos Louizos\n\nUniversity of Amsterdam\nTNO Intelligent Imaging\n\nc.louizos@uva.nl\n\nUri Shalit\n\nNew York University\n\nCIMS\n\nuas1@nyu.edu\n\nJoris Mooij\n\nUniversity of Amsterdam\n\nj.m.mooij@uva.nl\n\nMassachusetts Institute of Technology\n\nDavid Sontag\n\nCSAIL & IMES\n\ndsontag@mit.edu\n\nRichard Zemel\n\nUniversity of Toronto\n\nCIFAR\u2217\n\nMax Welling\n\nUniversity of Amsterdam\n\nCIFAR\u2217\n\nzemel@cs.toronto.edu\n\nm.welling@uva.nl\n\nAbstract\n\nLearning individual-level causal effects from observational data, such as inferring\nthe most effective medication for a speci\ufb01c patient, is a problem of growing\nimportance for policy makers. The most important aspect of inferring causal effects\nfrom observational data is the handling of confounders, factors that affect both an\nintervention and its outcome. A carefully designed observational study attempts\nto measure all important confounders. However, even if one does not have direct\naccess to all confounders, there may exist noisy and uncertain measurement of\nproxies for confounders. We build on recent advances in latent variable modeling\nto simultaneously estimate the unknown latent space summarizing the confounders\nand the causal effect. Our method is based on Variational Autoencoders (VAE)\nwhich follow the causal structure of inference with proxies. We show our method\nis signi\ufb01cantly more robust than existing methods, and matches the state-of-the-art\non previous benchmarks focused on individual treatment effects.\n\n1\n\nIntroduction\n\nUnderstanding the causal effect of an intervention t on an individual with features X is a fundamental\nproblem across many domains. Examples include understanding the effect of medications on a\npatient\u2019s health, or of teaching methods on a student\u2019s chance of graduation. With the availability\nof large datasets in domains such as healthcare and education, there is much interest in developing\nmethods for learning individual-level causal effects from observational data [42, 53, 25, 43].\nThe most crucial aspect of inferring causal relationships from observational data is confounding. A\nvariable which affects both the intervention and the outcome is known as a confounder of the effect\nof the intervention on the outcome. On the one hand, if such a confounder can be measured, the\nstandard way to account for its effect is by \u201ccontrolling\u201d for it, often through covariate adjustment or\npropensity score re-weighting [39]. On the the other hand, if a confounder is hidden or unmeasured,\nit is impossible in the general case (i.e. without further assumptions) to estimate the effect of the\nintervention on the outcome [40]. For example, socio-economic status can affect both the medication\na patient has access to, and the patient\u2019s general health. Therefore socio-economic status acts as\nconfounder between the medication and health outcomes, and without measuring it we cannot in\n\n\u2217Canadian Institute For Advanced Research\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fX\n\ny\n\nZ\n\nt\n\nFigure 1: Example of a proxy variable. t is a treatment, e.g. medication; y is an outcome, e.g.\nmortality. Z is an unobserved confounder, e.g. socio-economic status; and X is noisy views on the\nhidden confounder Z, say income in the last year and place of residence.\n\ngeneral isolate the causal effect of medications on health measures. Henceforth we will denote\nobserved potential confounders2 by X, and unobserved confounders by Z.\nIn most real-world observational studies we cannot hope to measure all possible confounders. For\nexample, in many studies we cannot measure variables such as personal preferences or most genetic\nand environmental factors. An extremely common practice in these cases is to rely on so-called\n\u201cproxy variables\u201d [38, 6, 36, Ch. 11]. For example, we cannot measure the socio-economic status of\npatients directly, but we might be able to get a proxy for it by knowing their zip code and job type.\nOne of the promises of using big-data for causal inference is the existence of myriad proxy variables\nfor unmeasured confounders.\nHow should one use these proxy variables? The answer depends on the relationship between the\nhidden confounders, their proxies, the intervention and outcome [31, 37]. Consider for example the\ncausal graphs in Figure 1: it\u2019s well known [20, 15, 18, 31, 41] that it is often incorrect to treat the\nproxies X as if they are ordinary confounders, as this would induce bias. See the Appendix for a\nsimple example of this phenomena. The aforementioned papers give methods which are guaranteed\nto recover the true causal effect when proxies are observed. However, the strong guarantees these\nmethods enjoy rely on strong assumptions. In particular, it is assumed that the hidden confounder is\neither categorical with known number of categories, or that the model is linear-Gaussian.\nIn practice, we cannot know the exact nature of the hidden confounder Z: whether it is categorical\nor continuous, or if categorical how many categories it includes. Consider socio-economic status\n(SES) and health. Should we conceive of SES as a continuous or ordinal variable? Perhaps SES\nas confounder is comprised of two dimensions, the economic one (related to wealth and income)\nand the social one (related to education and cultural capital). Z might even be a mix of continuous\nand categorical, or be high-dimensional itself. This uncertainty makes causal inference a very hard\nproblem even with proxies available. We propose an alternative approach to causal effect inference\ntailored to the surrogate-rich setting when many proxies are available: estimation of a latent-variable\nmodel where we simultaneously discover the hidden confounders and infer how they affect treatment\nand outcome. Speci\ufb01cally, we focus on (approximate) maximum-likelihood based methods.\nAlthough in many cases learning latent-variable models are computationally intractable [50, 7],\nthe machine learning community has made signi\ufb01cant progress in the past few years developing\ncomputationally ef\ufb01cient algorithms for latent-variable modeling. These include methods with\nprovable guarantees, typically based on the method-of-moments (e.g. Anandkumar et al. [4]); as\nwell as robust, fast, heuristics such as variational autoencoders (VAEs) [27, 46], based on stochastic\noptimization of a variational lower bound on the likelihood, using so-called recognition networks for\napproximate inference.\nOur paper builds upon VAEs. This has the disadvantage that little theory is currently available to\njustify when learning with VAEs can identify the true model. However, they have the signi\ufb01cant\nadvantage that they make substantially weaker assumptions about the data generating process and\nthe structure of the hidden confounders. Since their recent introduction, VAEs have been shown to\nbe remarkably successful in capturing latent structure across a wide-range of previously dif\ufb01cult\nproblems, such as modeling images [19], volumes [24], time-series [10] and fairness [34].\n\n2Including observed covariates which do not affect the intervention or outcome, and therefore are not truly\n\nconfounders.\n\n2\n\n\fWe show that in the presence of noisy proxies, our method is more robust against hidden confounding,\nin experiments where we successively add noise to known-confounders. Towards that end we\nintroduce a new causal inference benchmark using data about twin births and mortalities in the USA.\nWe further show that our method is competitive on two existing causal inference benchmarks. Finally,\nwe note that our method does not currently deal with the related problem of selection bias, and we\nleave this to future work.\nRelated work. Proxy variables and the challenges of using them correctly have long been considered\nin the causal inference literature [54, 14]. Understanding what is the best way to derive and measure\npossible proxy variables is an important part of many observational studies [13, 29, 55]. Recent work\nby Cai and Kuroki [9], Greenland and Lash [18], building on the work of Greenland and Kleinbaum\n[17], Sel\u00e9n [47], has studied conditions for causal identi\ufb01ability using proxy variables. The general\nidea is that in many cases one should \ufb01rst attempt to infer the joint distribution p(X, Z) between the\nproxy and the hidden confounders, and then use that knowledge to adjust for the hidden confounders\n[55, 41, 32, 37, 12]. For the example in Figure 1, Cai and Kuroki [9], Greenland and Lash [18], Pearl\n[41] show that if Z and X are categorical, with X having at least as many categories as Z, and with\nthe matrix p(X, Z) being full-rank, one could identify the causal effect of t on y using a simple\nmatrix inversion formula, an approach called \u201ceffect restoration\u201d. Conditions under which one could\nidentify more general and complicated proxy models were recently given by [37].\n\n2\n\nIdenti\ufb01cation of causal effect\n\nThroughout this paper we assume the causal model in Figure 1. For simplicity and compatibility with\nprior benchmarks we assume that the treatment t is binary, but our proposed method does not rely\non that. We further assume that the joint distribution p (Z, X, t, y) of the latent confounders Z and\nthe observed confounders X can be approximately recovered solely from the observations (X, t, y).\nWhile this is impossible if the hidden confounder has no relation to the observed variables, there\nare many cases where this is possible, as mentioned in the introduction. For example, if X includes\nthree independent views of Z [4, 22, 16, 2]; if Z is categorical and X is a Gaussian mixture model\nwith components determined by X [5]; or if Z is comprised of binary variables and X are so-called\n\u201cnoisy-or\u201d functions of Z [23, 8]. Recent results show that certain VAEs can recover a very large\nclass of latent-variable models [51] as a minimizer of an optimization problem; the caveat is that the\noptimization process is not guaranteed to achieve the true minimum even if it is within the capacity\nof the model, similar to the case of classic universal approximation results for neural networks.\n\n2.1\n\nIdentifying individual treatment effect\n\n(cid:90)\n\n(cid:90)\n\nIT E(x) := E [y|X = x, do(t = 1)] \u2212 E [y|X = x, do(t = 0)] ,\n\nOur goal in this paper is to recover the individual treatment effect (ITE), also known as the conditional\naverage treatment effect (CATE), of a treatment t, as well as the average treatment effect (ATE):\nAT E := E[IT E(x)]\nIdenti\ufb01cation in our case is an immediate result of Pearl\u2019s back-door adjustment formula [40]:\nTheorem 1. If we recover p (Z, X, t, y) then we recover the ITE under the causal model in Figure 1.\nProof. We will prove that p (y|X, do(t = 1)) is identi\ufb01able under the premise of the theorem. The\ncase for t = 0 is identical, and the expectations in the de\ufb01nition of ITE above readily recovered from\nthe probability function. ATE is identi\ufb01ed if ITE is identi\ufb01ed. We have that:\n\np (y|X, do(t = 1)) =\n\np (y|X, do(t = 1), Z) p (Z|X, do(t = 1)) dZ\n\n(i)\n=\n\np (y|X, t = 1, Z) p (Z|X) dZ,\n\nZ\n\n(1)\n\nZ\n\nwhere equality (i) is by the rules of do-calculus applied to the causal graph in Figure 1 [40]. This\ncompletes the proof since the quantities in the \ufb01nal expression of Eq. (1) can be identi\ufb01ed from the\ndistribution p (Z, X, t, y) which we know by the Theorem\u2019s premise.\n\nNote that the proof and the resulting estimator in Eq. (1) would be identical whether there is or there is\nnot an edge from X to t. This is because we intervene on t. Also note that for the model in Figure 1,\n\n3\n\n\fy is independent of X given Z, and we obtain: p (y|X, do(t = 1)) =(cid:82)\n\nZ p (y|t = 1, Z) p (Z|X) dZ.\n\nIn the next section we will show how we estimate p (Z, X, t, y) from observations of (X, t, y).\n\n3 Causal effect variational autoencoder\n\n(a) Inference network, q(z, t, y|x).\n\n(b) Model network, p(x, z, t, y).\n\nFigure 2: Overall architecture of the model and inference networks for the Causal Effect Variational\nAutoencoder (CEVAE). White nodes correspond to parametrized deterministic neural network transi-\ntions, gray nodes correspond to drawing samples from the respective distribution and white circles\ncorrespond to switching paths according to the treatment t.\n\nThe approach we take in this paper to the problem of learning the latent variable causal model is\nby using variational autoencoders [27, 46] to infer the complex non-linear relationships between X\nand (Z, t, y) and approximately recover p (Z, X, t, y). Recent work has dramatically increased the\nrange and type of distributions which can be captured by VAEs [51, 45, 28]. The drawback of these\nmethods is that because of the dif\ufb01culty of guaranteeing global optima of neural net optimization,\none cannot ensure that any given instance will \ufb01nd the true model even if it is within the model class.\nWe believe this drawback is offset by the strong empirical performance across many domains of deep\nneural networks in general, and VAEs in particular. Speci\ufb01cally, we propose to parametrize the causal\ngraph of Figure 1 as a latent variable model with neural net functions connecting the variables of\ninterest. The \ufb02exible non-linear nature of neural nets will hopefully allow us to approximate well the\ntrue interactions between the treatment and its effect.\nOur design choices are mostly typical for VAEs: we assume the observations factorize conditioned\non the latent variables, and use an inference network [27, 46] which follows a factorization of\nthe true posterior. For the generative model we use an architecture inspired by TARnet [48], but\ninstead of conditioning on observations we condition on the latent variables z; see details below. For\nthe following, xi corresponds to an input datapoint (e.g. the feature vector of a given subject), ti\ncorresponds to the treatment assignment, yi to the outcome of the of the particular treatment and zi\ncorresponds to the latent hidden confounder. Each of the corresponding factors is described as:\n\np(zi) =\n\nN (zij|0, 1);\n\np(xi|zi) =\n\np(xij|zi);\n\np(ti|zi) = Bern(\u03c3(f1(zi))),\n\n(2)\n\nj=1\n\nj=1\n\nwith p(xij|zi) being an appropriate probability distribution for the covariate j and \u03c3(\u00b7) being the\nlogistic function, Dx the dimension of x and Dz the dimension of z. For a continuous outcome\nwe parametrize the probability distribution as a Gaussian with its mean given by a TARnet [48]\narchitecture, i.e. a treatment speci\ufb01c function, and its variance \ufb01xed to \u02c6v, whereas for a discrete\noutcome we use a Bernoulli distribution similarly parametrized by a TARnet:\n\np(yi|ti, zi) = N (\u00b5 = \u02c6\u00b5i, \u03c32 = \u02c6v)\np(yi|ti, zi) = Bern(\u03c0 = \u02c6\u03c0i)\n\n(3)\n(4)\nNote that each of the fk(\u00b7) is a neural network parametrized by its own parameters \u03b8k for k = 1, 2, 3.\nAs we do not a-priori know the hidden confounder z we have to marginalize over it in order to\nlearn the parameters of the model \u03b8k. Since the non-linear neural network functions make inference\nintractable we will employ variational inference along with inference networks; these are neural\nnetworks that output the parameters of a \ufb01xed form posterior approximation over the latent variables\n\n\u02c6\u00b5i = tif2(zi) + (1 \u2212 ti)f3(zi)\n\u02c6\u03c0i = \u03c3(tif2(zi) + (1 \u2212 ti)f3(zi)).\n\n4\n\nDz(cid:89)\n\nDx(cid:89)\n\n\fz, e.g. a Gaussian, given the observed variables. By the de\ufb01nition of the model at Figure 1 we can see\nthat the true posterior over Z depends on X, t and y. Therefore we employ the following posterior\napproximation:\n\nq(zi|xi, ti, yi) =\n\n\u00af\u00b5i = ti\u00b5t=0,i + (1 \u2212 ti)\u00b5t=1,i\nt=0,i = g2 \u25e6 g1(xi, yi)\n\u00b5t=0,i, \u03c32\n\nN (\u00b5j = \u00af\u00b5ij, \u03c32\n\nj = \u00af\u03c32\nij)\nt=0,i + (1 \u2212 ti)\u03c32\nt=1,i = g3 \u25e6 g1(xi, yi),\n\nt=1,i\n\nj=1\n\u00af\u03c32\ni = ti\u03c32\n\u00b5t=1,i, \u03c32\n\n(5)\n\nDz(cid:89)\n\nwhere we similarly use a TARnet [48] architecture for the inference network, i.e. split them for each\ntreatment group in t after a shared representation g1(xi, yi), and each gk(\u00b7) is a neural network with\nvariational parameters \u03c6k. We can now form a single objective for the inference and model networks,\nthe variational lower bound of this graphical model [27, 46]:\n\nL =\n\nEq(zi|xi,ti,yi)[log p(xi, ti|zi) + log p(yi|ti, zi) + log p(zi) \u2212 log q(zi|xi, ti, yi)].\n\n(6)\n\nN(cid:88)\n\ni=1\n\ni )(cid:1),\n\nN(cid:88)\n\n(cid:0) log q(ti = t\u2217\n\nNotice that for out of sample predictions, i.e. new subjects, we require to know the treatment\nassignment t along with its outcome y before inferring the distribution over z. For this reason we will\nintroduce two auxiliary distributions that will help us predict ti, yi for new samples. More speci\ufb01cally,\nwe will employ the following distributions for the treatment assignment t and outcomes y:\n\nq(ti|xi) = Bern(\u03c0 = \u03c3(g4(xi)))\n\nq(yi|xi, ti) = N (\u00b5 = \u00af\u00b5i, \u03c32 = \u00afv)\nq(yi|xi, ti) = Bern(\u03c0 = \u00af\u03c0i)\n\n(7)\n(8)\n(9)\nwhere we choose eq. 8 for continuous and eq. 9 for discrete outcomes. To estimate the parameters of\nthese auxiliary distributions we will add two extra terms in the variational lower bound:\n\n\u00af\u00b5i = ti(g6 \u25e6 g5(xi)) + (1 \u2212 ti)(g7 \u25e6 g5(xi))\n\u00af\u03c0i = ti(g6 \u25e6 g5(xi)) + (1 \u2212 ti)(g7 \u25e6 g5(xi)),\n\nFCEVAE = L +\n\ni |x\u2217\n\ni ) + log q(yi = y\u2217\n\ni |x\u2217\n\ni , t\u2217\n\n(10)\n\nwith xi, t\u2217\ni being the observed values for the input, treatment and outcome random variables in\nthe training set. We coin the name Causal Effect Variational Autoencoder (CEVAE) for our method.\n\ni , y\u2217\n\ni=1\n\n4 Experiments\n\nEvaluating causal inference methods is always challenging because we usually lack ground-truth\nfor the causal effects. Common evaluation approaches include creating synthetic or semi-synthetic\ndatasets, where real data is modi\ufb01ed in a way that allows us to know the true causal effect or real-\nworld data where a randomized experiment was conducted. Here we compare with two existing\nbenchmark datasets where there is no need to model proxies, IHDP [21] and Jobs [33], often used\nfor evaluating individual level causal inference. In order to speci\ufb01cally explore the role of proxy\nvariables, we create a synthetic toy dataset, and introduce a new benchmark based on data of twin\nbirths and deaths in the USA.\nFor the implementation of our model we used Tensor\ufb02ow [1] and Edward [52]. For the neural network\narchitecture choices we closely followed [48]; unless otherwise speci\ufb01ed we used 3 hidden layers\nwith ELU [11] nonlinearities for the approximate posterior over the latent variables q(Z|X, t, y),\nthe generative model p(X|Z) and the outcome models p(y|t, Z), q(y|t, X). For the treatment\nmodels p(t|Z), q(t|X) we used a single hidden layer neural network with ELU nonlinearities.\nUnless mentioned otherwise, we used a 20-dimensional latent variable z and used a small weight\ndecay term for all of the parameters with \u03bb = .0001. Optimization was done with Adamax [26]\nand a learning rate of 0.01, which was annealed with an exponential decay schedule. We further\nperformed early stopping according to the lower bound on a validation set. To compute the outcomes\np(y|X, do(t = 1)) and p(y|X, do(t = 0)) we averaged over 100 samples from the approximate\n\n(cid:82) q(Z|t, y, X)q(y|t, X)q(t|X)dy.\n\nposterior q(Z|X) =(cid:80)\n\nt\n\nThroughout this section we compare with several baseline methods. LR1 is logistic regression, LR2\nis two separate logistic regressions \ufb01t to treated (t = 1) and control (t = 0). TARnet is a feed forward\nneural network architecture for causal inference [48].\n\n5\n\n\f4.1 Benchmark datasets\n\nFor the \ufb01rst benchmark task we consider estimating the individual and population causal effects\non a benchmark dataset introduced by [21]; it is constructed from data obtained from the Infant\nHealth and Development Program (IHDP). Brie\ufb02y, the confounders x correspond to collected\nmeasurements of the children and their mothers used during a randomized experiment that studied\nthe effect of home visits by specialists on future cognitive test scores. The treatment assignment\nis then \u201cde-randomized\u201d by removing from the treated set children with non-white mothers; for\neach unit a treated and a control outcome are then simulated, thus allowing us to know the \u201ctrue\u201d\nindividual causal effects of the treatment. We follow [25, 48] and use 1000 replications of the\n(cid:80)N\nsimulated outcome, along with the same train/validation/testing splits. To measure the accuracy of\nthe individual treatment effect estimation we use the Precision in Estimation of Heterogeneous Effect\ni=1((yi1 \u2212 yi0) \u2212 (\u02c6yi1 \u2212 \u02c6yi0))2, where y1, y0 correspond to the true\n(PEHE) [21], PEHE = 1\nN\noutcomes under t = 1 and t = 0, respectively, and \u02c6y1, \u02c6y0 correspond to the outcomes estimated by\nour model. For the population causal effect we report the absolute error on the Average Treatment\nEffect (ATE). The results can be seen at Table 1. As we can see, CEVAE has decent performance,\ncomparable to the Balancing Neural Network (BNN) of [25].\n\nTable 1: Within-sample and out-of-sample mean\nand standard errors for the metrics for the various\nmodels at the IHDP dataset.\n\nMethod (cid:112)\u0001within-s.\n\nPEHE\n\n5.8\u00b1.3\n2.4\u00b1.1\n5.8\u00b1.3\n2.1\u00b1.1\n5.0\u00b1.2\n2.1\u00b1.1\n4.2\u00b1.2\n3.8\u00b1.2\n2.2\u00b1.1\n.71\u00b1.0\n2.7\u00b1.1\n\n\u0001within-s.\nATE\n.73\u00b1.04\n.14\u00b1.01\n.72\u00b1.04\n.14\u00b1.01\n.30\u00b1.01\n.23\u00b1.01\n.73\u00b1.05\n.18\u00b1.01\n.37\u00b1.03\n.25\u00b1.01\n.34\u00b1.01\n\nOLS-1\nOLS-2\nBLR\nk-NN\nTMLE\nBART\nRF\nCF\nBNN\nCFRW\nCEVAE\n\n(cid:112)\u0001out-of-s.\n\nPEHE\n\n-\n\n5.8\u00b1.3\n2.5\u00b1.1\n5.8\u00b1.3\n4.1\u00b1.2\n2.3\u00b1.1\n6.6\u00b1.3\n3.8\u00b1.2\n2.1\u00b1.1\n.76\u00b1.0\n2.6\u00b1.1\n\n-\n\n\u0001out-of-s.\nATE\n.94\u00b1.06\n.31\u00b1.02\n.93\u00b1.05\n.79\u00b1.05\n.34\u00b1.02\n.96\u00b1.06\n.40\u00b1.03\n.42\u00b1.03\n.27\u00b1.01\n.46\u00b1.02\n\nTable 2: Within-sample and out-of-sample\npolicy risk and error on the average treatment\neffect on the treated (ATT) for the various\nmodels on the Jobs dataset.\n\npol\n\nMethod Rwithin-s.\n.22\u00b1.0\nLR-1\n.21\u00b1.0\nLR-2\n.22\u00b1.0\nBLR\n.02\u00b1.0\nk-NN\n.22\u00b1.0\nTMLE\n.23\u00b1.0\nBART\n.23\u00b1.0\nRF\n.19\u00b1.0\nCF\n.20\u00b1.0\nBNN\nCFRW .17\u00b1.0\n.15\u00b1.0\nCEVAE\n\n\u0001within-s.\nATT\n.01\u00b1.00\n.01\u00b1.01\n.01\u00b1.01\n.21\u00b1.01\n.02\u00b1.01\n.02\u00b1.00\n.03\u00b1.01\n.03\u00b1.01\n.04\u00b1.01\n.04\u00b1.01\n.02\u00b1.01\n\n-\n\npol\n\nRout-of-s.\n.23\u00b1.0\n.24\u00b1.0\n.25\u00b1.0\n.26\u00b1.0\n.25\u00b1.0\n.28\u00b1.0\n.20\u00b1.0\n.24\u00b1.0\n.21\u00b1.0\n.26\u00b1.0\n\n-\n\n\u0001out-of-s.\nATT\n.08\u00b1.04\n.08\u00b1.03\n.08\u00b1.03\n.13\u00b1.05\n.08\u00b1.03\n.09\u00b1.04\n.07\u00b1.03\n.09\u00b1.04\n.09\u00b1.03\n.03\u00b1.01\n\nFor the second benchmark we consider the task described at [48] and follow closely their procedure.\nIt uses a dataset obtained by the study of [33, 49], which concerns the effect of job training (treatment)\non employment after training (outcome). Due to the fact that a part of the dataset comes from a\nrandomized control trial we can estimate the \u201ctrue\u201d causal effect. Following [48] we report the\nabsolute error on the Average Treatment effect on the Treated (ATT), which is the E [IT E(X)|t = 1].\nFor the individual causal effect we use the policy risk, that acts as a proxy to the individual treatment\neffect. The results after averaging over 10 train/validation/test splits can be seen at Table 2. As we\ncan observe, CEVAE is competitive with the state-of-the art, while overall achieving the best estimate\non the out-of-sample ATT.\n\n4.2 Synthetic experiment on toy data\n\nTo illustrate that our model better handles hidden confounders we experiment on a toy simulated\ndataset where the marginal distribution of X is a mixture of Gaussians, with the hidden variable Z\ndetermining the mixture component. We generate synthetic data by the following process:\n\nxi|zi \u223c N(cid:0)zi, \u03c32\n\nzi + \u03c32\nz0\n\nz1\n\n(1 \u2212 zi)(cid:1)\n\nyi|ti, zi \u223c Bern (Sigmoid (3(zi + 2(2ti \u2212 1)))) ,\n\nzi \u223c Bern (0.5) ;\nti|zi \u223c Bern (0.75zi + 0.25(1 \u2212 zi)) ;\n\n(11)\nwhere \u03c3z0 = 3, \u03c3z1 = 5 and Sigmoid is the logistic sigmoid function. This generation process\nintroduces hidden confounding between t and y as t and y depend on the mixture assignment z\nfor x. Since there is signi\ufb01cant overlap between the two Gaussian mixture components we expect\nthat methods which do not model the hidden confounder z will not produce accurate estimates for\nthe treatment effects. We experiment with both a binary z for CEVAE, which is close to the true\n\n6\n\n\fmodel, as well as a 5-dimensional continuous z in order to investigate the robustness of CEVAE w.r.t.\nmodel misspeci\ufb01cation. We evaluate across samples size N \u2208 {1000, 3000, 5000, 10000, 30000}\nand provide the results in Figure 3. We see that no matter how many samples are given, LR1, LR2\nand TARnet are not able to improve their error in estimating ATE directly from the proxies. On the\nother hand, CEVAE achieves signi\ufb01cantly less error. When the latent model is correctly speci\ufb01ed\n(CEVAE bin) we do better even with a small sample size; when it is not (CEVAE cont) we require\nmore samples for the latent space to imitate more closely the true binary latent variable.\n\nFigure 3: Absolute error of estimating ATE on samples from the generative process (11). CEVAE bin\nand CEVAE cont are CEVAE with respectively binary or continuous 5-dim latent z. See text above\nfor description of the other methods.\n\n4.3 Binary treatment outcome on Twins\n\nWe introduce a new benchmark task that utilizes data from twin births in the USA between 1989-1991\n[3] 3. The treatment t = 1 is being born the heavier twin whereas, the outcome corresponds to the\nmortality of each of the twins in their \ufb01rst year of life. Since we have records for both twins, their\noutcomes could be considered as the two potential outcomes with respect to the treatment of being\nborn heavier. We only chose twins which are the same sex. Since the outcome is thankfully quite\nrare (3.5% \ufb01rst-year mortality), we further focused on twins such that both were born weighing less\nthan 2kg. We thus have a dataset of 11984 pairs of twins. The mortality rate for the lighter twin is\n18.9%, and for the heavier 16.4%, for an average treatment effect of \u22122.5%. For each twin-pair we\nobtained 46 covariates relating to the parents, the pregnancy and birth: mother and father education,\nmarital status, race and residence; number of previous births; pregnancy risk factors such as diabetes,\nrenal disease, smoking and alcohol use; quality of care during pregnancy; whether the birth was at a\nhospital, clinic or home; and number of gestation weeks prior to birth.\nIn this setting, for each twin pair we observed both the case t = 0 (lighter twin) and t = 1 (heavier\ntwin). In order to simulate an observational study, we selectively hide one of the two twins; if\nwe were to choose at random this would be akin to a randomized trial. In order to simulate the\ncase of hidden confounding with proxies, we based the treatment assignment on a single variable\nwhich is highly correlated with the outcome: GESTAT10, the number of gestation weeks prior to\nbirth. It is ordinal with values from 0 to 9 indicating birth before 20 weeks gestation, birth after\nwo \u223c N (0, 0.1 \u00b7 I), wh \u223c N (5, 0.1), where z is GESTAT10 and x are the 45 other features.\nWe created proxies for the hidden confounder as follows: We coded the 10 GESTAT10 categories with\none-hot encoding, replicated 3 times. We then randomly and independently \ufb02ipped each of these 30\nbits. We varied the probabilities of \ufb02ipping from 0.05 to 0.5, the latter indicating there is no direct\ninformation about the confounder. We chose three replications following the well-known result that\nthree independent views of a latent feature are what is needed to guarantee that it can be recovered\n\n20-27 weeks of gestation and so on 4. We then set ti|xi, zi \u223c Bern(cid:0)\u03c3(w(cid:62)\n\no x + wh(z/10 \u2212 0.1))(cid:1),\n\n3Data taken from the denominator \ufb01le at http://www.nber.org/data/linked-birth-infant-death-data-vital-\n\nstatistics-data.html\n\n4The partition is given in the original dataset from NBER.\n\n7\n\n3.03.54.04.5log(Nsamples)0.000.040.080.120.16absolute ATE errorLR1LR2TARnetCEVAE contCEVAE bin\f[30, 2, 5]. We note that there might still be proxies for the confounder in the other variables, such\nas the incompetent cervix covariate which is a known risk factor for early birth. Having created\nthe dataset, we focus our attention on two tasks: Inferring the mortality of the unobserved twin\n(counterfactual), and inferring the average treatment effect. We compare with TARnet, LR1 and LR2.\nWe vary the number of hidden layers for TARnet and CEVAE (nh in the \ufb01gures). We note that while\nTARnet with 0 hidden layers is equivalent to LR2, CEVAE with 0 hidden layers still infers a latent\nspace and is thus different. The results are given respectively in Figures 4(a) (higher is better) and\n4(b) (lower is better).\nFor the counterfactual task, we see that for small proxy noise all methods perform similarly. This is\nprobably due to the gestation length feature being very informative; for LR1, the noisy codings of\nthis feature form 6 of the top 10 most predictive features for mortality, the others being sex (males\nare more at risk), and 3 risk factors: incompetent cervix, mother lung disease, and abnormal amniotic\n\ufb02uid. For higher noise, TARnet, LR1 and LR2 see roughly similar degradation in performance;\nCEVAE, on the other hand, is much more robust to increasing proxy noise because of its ability to\ninfer a cleaner latent state from the noisy proxies. Of particular interest is CEVAE nh = 0, which\ndoes much better for counterfactual inference than the equivalent LR2, probably because LR2 is\nforced to rely directly on the noisy proxies instead of the inferred latent state. For inference of\naverage treatment effect, we see that at the low noise levels CEVAE does slightly worse than the\nother methods, with CEVAE nh = 0 doing noticeably worse. However, similar to the counterfactual\ncase, CEVAE is signi\ufb01cantly more robust to proxy noise, achieving quite a low error even when the\ndirect proxies are completely useless at noise level 0.5.\n\n(a) Area under the curve (AUC) for predicting the\nmortality of the unobserved twin in a hidden con-\nfounding experiment; higher is better.\n\n(b) Absolute error ATE estimate; lower is better.\nDashed black line indicates the error of using the\nnaive ATE estimator: the difference between the\naverage treated and average control outcomes.\n\nFigure 4: Results on the Twins dataset. LR1 is logistic regression, LR2 is two separate logistic\nregressions \ufb01t on the treated and control. \u201cnh\u201d is number of hidden layers used. TARnet with nh = 0\nis identical to LR2 and not shown, whereas CEVAE with nh = 0 has a latent space component.\n\n5 Conclusion\nIn this paper we draw a connection between causal inference with proxy variables and the ground-\nbreaking work in the machine learning community on latent variable models. Since almost all\nobservational studies rely on proxy variables, this connection is highly relevant.\nWe introduce a model which is the \ufb01rst attempt at tying these two ideas together: The Causal\nEffect Variational Autoencoder (CEVAE), a neural network latent variable model used for estimating\nindividual and population causal effects. In extensive experiments we showed that it is competitive\nwith the state-of-the art on benchmark datasets, and more robust to hidden confounding both at a\ntoy arti\ufb01cial dataset as well as modi\ufb01cations of real datasets, such as the newly introduced Twins\ndataset. For future work, we plan to employ the expanding set of tools available for latent variables\nmodels (e.g. Kingma et al. [28], Tran et al. [51], Maal\u00f8e et al. [35], Ranganath et al. [44]), as well as\nto further explore connections between method of moments approaches such as Anandkumar et al.\n[5] with the methods for effect restoration given by Kuroki and Pearl [32], Miao et al. [37].\n\n8\n\n0.10.20.30.40.5proxy noise level0.650.750.85counterfactual AUCCEVAE nh=0CEVAE nh=1CEVAE nh=2LR2LR1TARnet nh=1TARnet nh=20.10.20.30.40.5proxy noise level0.000.020.040.060.08absolute ATE errorCEVAE nh=0CEVAE nh=1CEVAE nh=2LR2LR1TARnet nh=1TARnet nh=2\fAcknowledgements\n\nWe would like to thank Fredrik D. Johansson for valuable discussions, feedback and for providing\nthe data for IHDP and Jobs. We would also like to thank Maggie Makar for helping with the Twins\ndataset. Christos Louizos and Max Welling were supported by TNO, NWO and Google. Joris Mooij\nwas supported by the European Research Council (ERC) under the European Union\u2019s Horizon 2020\nresearch and innovation programme (grant agreement 639466).\n\nReferences\n[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,\nM. Devin, et al. Tensor\ufb02ow: Large-scale machine learning on heterogeneous distributed systems. arXiv\npreprint arXiv:1603.04467, 2016.\n\n[2] E. S. Allman, C. Matias, and J. A. Rhodes. Identi\ufb01ability of parameters in latent structure models with\n\nmany observed variables. The Annals of Statistics, pages 3099\u20133132, 2009.\n\n[3] D. Almond, K. Y. Chay, and D. S. Lee. The costs of low birth weight. The Quarterly Journal of Economics,\n\n120(3):1031\u20131083, 2005.\n\n[4] A. Anandkumar, D. J. Hsu, and S. M. Kakade. A method of moments for mixture models and hidden\n\nmarkov models. In COLT, volume 1, page 4, 2012.\n\n[5] A. Anandkumar, R. Ge, D. J. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning\n\nlatent variable models. Journal of Machine Learning Research, 15(1):2773\u20132832, 2014.\n\n[6] J. D. Angrist and J.-S. Pischke. Mostly harmless econometrics: An empiricist\u2019s companion. Princeton\n\nuniversity press, 2008.\n\n[7] S. Arora and R. Kannan. Learning mixtures of separated nonspherical gaussians. Annals of Applied\n\nProbability, pages 69\u201392, 2005.\n\n[8] S. Arora, R. Ge, T. Ma, and A. Risteski. Provable learning of noisy-or networks. CoRR, abs/1612.08795,\n\n2016. URL http://arxiv.org/abs/1612.08795.\n\n[9] Z. Cai and M. Kuroki. On identifying total effects in the presence of latent variables and selection bias. In\nProceedings of the Twenty-Fourth Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 62\u201369. AUAI\nPress, 2008.\n\n[10] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio. A recurrent latent variable model\n\nfor sequential data. In Advances in neural information processing systems, pages 2980\u20132988, 2015.\n\n[11] D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential\n\nlinear units (elus). arXiv preprint arXiv:1511.07289, 2015.\n\n[12] J. K. Edwards, S. R. Cole, and D. Westreich. All your data are always missing: incorporating bias due to\nmeasurement error into the potential outcomes framework. International Journal of Epidemiology, 44(4):\n1452, 2015.\n\n[13] D. Filmer and L. H. Pritchett. Estimating wealth effects without expenditure data\u2014or tears: an application\n\nto educational enrollments in states of india. Demography, 38(1):115\u2013132, 2001.\n\n[14] P. A. Frost. Proxy variables and speci\ufb01cation bias. The review of economics and Statistics, pages 323\u2013325,\n\n1979.\n\n[15] W. Fuller. Measurement error models. Wiley series in probability and mathematical statistics (, 1987.\n[16] L. A. Goodman. Exploratory latent structure analysis using both identi\ufb01able and unidenti\ufb01able models.\n\nBiometrika, 61(2):215\u2013231, 1974.\n\n[17] S. Greenland and D. G. Kleinbaum. Correcting for misclassi\ufb01cation in two-way tables and matched-pair\n\nstudies. International Journal of Epidemiology, 12(1):93\u201397, 1983.\n\n[18] S. Greenland and T. Lash. Bias analysis. In Modern epidemiology, 3rd ed., pages 345\u2013380. Lippincott\n\nWilliams and Wilkins, 2008.\n\n[19] K. Gregor, I. Danihelka, A. Graves, D. Jimenez Rezende, and D. Wierstra. DRAW: A Recurrent Neural\n\nNetwork For Image Generation. ArXiv e-prints, Feb. 2015.\n\n[20] Z. Griliches and J. A. Hausman. Errors in variables in panel data. Journal of econometrics, 31(1):93\u2013118,\n\n1986.\n\n[21] J. L. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical\n\nStatistics, 20(1):217\u2013240, 2011.\n\n[22] D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithm for learning hidden markov models. Journal of\n\nComputer and System Sciences, 78(5):1460\u20131480, 2012.\n\n9\n\n\f[23] Y. Jernite, Y. Halpern, and D. Sontag. Discovering hidden variables in noisy-or networks using quartet\n\ntests. In Advances in Neural Information Processing Systems, pages 2355\u20132363, 2013.\n\n[24] D. Jimenez Rezende, S. M. A. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess. Unsupervised\n\nLearning of 3D Structure from Images. ArXiv e-prints, July 2016.\n\n[25] F. D. Johansson, U. Shalit, and D. Sontag. Learning representations for counterfactual inference. Interna-\n\ntional Conference on Machine Learning (ICML), 2016.\n\n[26] D. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning\n\nRepresentations (ICLR), San Diego, 2015.\n\n[27] D. P. Kingma and M. Welling. Auto-encoding variational bayes. International Conference on Learning\n\nRepresentations (ICLR), 2014.\n\n[28] D. P. Kingma, T. Salimans, and M. Welling. Improving variational inference with inverse autoregressive\n\n\ufb02ow. arXiv preprint arXiv:1606.04934, 2016.\n\n[29] S. Kolenikov and G. Angeles. Socioeconomic status measurement with discrete proxy variables: Is\n\nprincipal component analysis a reliable answer? Review of Income and Wealth, 55(1):128\u2013165, 2009.\n\n[30] J. B. Kruskal. More factors than subjects, tests and treatments: an indeterminacy theorem for canonical\n\ndecomposition and individual differences scaling. Psychometrika, 41(3):281\u2013293, 1976.\n\n[31] M. Kuroki and J. Pearl. Measurement bias and effect restoration in causal inference. Technical report,\n\nDTIC Document, 2011.\n\n[32] M. Kuroki and J. Pearl. Measurement bias and effect restoration in causal inference. Biometrika, 101(2):\n\n423, 2014.\n\n[33] R. J. LaLonde. Evaluating the econometric evaluations of training programs with experimental data. The\n\nAmerican economic review, pages 604\u2013620, 1986.\n\n[34] C. Louizos, K. Swersky, Y. Li, M. Welling, and R. Zemel. The variational fair autoencoder. International\n\nConference on Learning Representations (ICLR), 2016.\n\n[35] L. Maal\u00f8e, C. K. S\u00f8nderby, S. K. S\u00f8nderby, and O. Winther. Auxiliary deep generative models. arXiv\n\npreprint arXiv:1602.05473, 2016.\n\n[36] G. S. Maddala and K. Lahiri. Introduction to econometrics, volume 2. Macmillan New York, 1992.\n[37] W. Miao, Z. Geng, and E. Tchetgen Tchetgen. Identifying causal effects with proxy variables of an\n\nunmeasured confounder. arXiv preprint arXiv:1609.08816, 2016.\n\n[38] M. R. Montgomery, M. Gragnolati, K. A. Burke, and E. Paredes. Measuring living standards with proxy\n\nvariables. Demography, 37(2):155\u2013174, 2000.\n\n[39] S. L. Morgan and C. Winship. Counterfactuals and causal inference. Cambridge University Press, 2014.\n[40] J. Pearl. Causality. Cambridge university press, 2009.\n[41] J. Pearl. On measurement bias in causal inference. arXiv preprint arXiv:1203.3504, 2012.\n[42] J. Pearl. Detecting latent heterogeneity. Sociological Methods & Research, page 0049124115600597,\n\n2015.\n\n[43] A. Peysakhovich and A. Lada. Combining observational and experimental data to \ufb01nd heterogeneous\n\ntreatment effects. arXiv preprint arXiv:1611.02385, 2016.\n\n[44] R. Ranganath, D. Tran, J. Altosaar, and D. Blei. Operator variational inference. In Advances in Neural\n\nInformation Processing Systems, pages 496\u2013504, 2016.\n\n[45] D. J. Rezende and S. Mohamed. Variational inference with normalizing \ufb02ows.\n\narXiv:1505.05770, 2015.\n\narXiv preprint\n\n[46] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in\ndeep generative models. In Proceedings of the 31th International Conference on Machine Learning, ICML\n2014, Beijing, China, 21-26 June 2014, pages 1278\u20131286, 2014.\n\n[47] J. Sel\u00e9n. Adjusting for errors in classi\ufb01cation and measurement in the analysis of partly and purely\n\ncategorical data. Journal of the American Statistical Association, 81(393):75\u201381, 1986.\n\n[48] U. Shalit, F. Johansson, and D. Sontag. Estimating individual treatment effect: generalization bounds and\n\nalgorithms. ArXiv e-prints, June 2016.\n\n[49] J. A. Smith and P. E. Todd. Does matching overcome lalonde\u2019s critique of nonexperimental estimators?\n\nJournal of econometrics, 125(1):305\u2013353, 2005.\n\n[50] B. Thiesson, C. Meek, D. M. Chickering, and D. Heckerman. Learning mixtures of dag models. In\nProceedings of the Fourteenth conference on Uncertainty in arti\ufb01cial intelligence, pages 504\u2013513. Morgan\nKaufmann Publishers Inc., 1998.\n\n10\n\n\f[51] D. Tran, R. Ranganath, and D. M. Blei. The variational Gaussian process. International Conference on\n\nLearning Representations (ICLR), 2015.\n\n[52] D. Tran, A. Kucukelbir, A. B. Dieng, M. Rudolph, D. Liang, and D. M. Blei. Edward: A library for\n\nprobabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787, 2016.\n\n[53] S. Wager and S. Athey. Estimation and inference of heterogeneous treatment effects using random forests.\n\narXiv preprint arXiv:1510.04342, 2015.\n\n[54] M. R. Wickens. A note on the use of proxy variables. Econometrica: Journal of the Econometric Society,\n\npages 759\u2013761, 1972.\n\n[55] J. M. Wooldridge. On estimating \ufb01rm-level production functions using proxy variables to control for\n\nunobservables. Economics Letters, 104(3):112\u2013114, 2009.\n\n11\n\n\f", "award": [], "sourceid": 3219, "authors": [{"given_name": "Christos", "family_name": "Louizos", "institution": "University of Amsterdam"}, {"given_name": "Uri", "family_name": "Shalit", "institution": null}, {"given_name": "Joris", "family_name": "Mooij", "institution": "University of Amsterdam"}, {"given_name": "David", "family_name": "Sontag", "institution": "MIT"}, {"given_name": "Richard", "family_name": "Zemel", "institution": "University of Toronto"}, {"given_name": "Max", "family_name": "Welling", "institution": "University of Amsterdam and University of California Irvine and CIFAR"}]}