{"title": "Debiased Bayesian inference for average treatment effects", "book": "Advances in Neural Information Processing Systems", "page_first": 11952, "page_last": 11962, "abstract": "Bayesian approaches have become increasingly popular in causal inference problems due to their conceptual simplicity, excellent performance and in-built uncertainty quantification ('posterior credible sets'). We investigate Bayesian inference for average treatment effects from observational data, which is a challenging problem due to the missing counterfactuals and selection bias. Working in the standard potential outcomes framework, we propose a data-driven modification to an arbitrary (nonparametric) prior based on the propensity score that corrects for the first-order posterior bias, thereby improving performance. We illustrate our method for Gaussian process (GP) priors using (semi-)synthetic data. Our experiments demonstrate significant improvement in both estimation accuracy and uncertainty quantification compared to the unmodified GP, rendering our approach highly competitive with the state-of-the-art.", "full_text": "Debiased Bayesian inference for average treatment\n\neffects\n\nKolyan Ray\n\nDepartment of Mathematics\n\nKing\u2019s College London\nkolyan.ray@kcl.ac.uk\n\nBotond Szab\u00f3\n\nb.t.szabo@math.leidenuniv.nl\n\nMathematical Institute\n\nLeiden University\n\nAbstract\n\nBayesian approaches have become increasingly popular in causal inference prob-\nlems due to their conceptual simplicity, excellent performance and in-built uncer-\ntainty quanti\ufb01cation (\u2018posterior credible sets\u2019). We investigate Bayesian inference\nfor average treatment effects from observational data, which is a challenging prob-\nlem due to the missing counterfactuals and selection bias. Working in the standard\npotential outcomes framework, we propose a data-driven modi\ufb01cation to an ar-\nbitrary (nonparametric) prior based on the propensity score that corrects for the\n\ufb01rst-order posterior bias, thereby improving performance. We illustrate our method\nfor Gaussian process (GP) priors using (semi-)synthetic data. Our experiments\ndemonstrate signi\ufb01cant improvement in both estimation accuracy and uncertainty\nquanti\ufb01cation compared to the unmodi\ufb01ed GP, rendering our approach highly\ncompetitive with the state-of-the-art.\n\n1\n\nIntroduction\n\nInferring the causal effect of a treatment or condition is an important problem in many applications,\nsuch as healthcare [11, 17, 38], education [20], economics [16], marketing [6] and survey sampling\n[13] amongst others. While carefully designed experiments are the gold standard for measuring\ncausal effects, these are often impractical due to ethical, \ufb01nancial or time-constraints. For example,\nwhen evaluating the effectiveness of a new medicine it may not be ethically feasible to randomly\nassign a patient to a particular treatment irrespective of their particular circumstances. An alternative\nis to use observational data which, while typically much easier to obtain, requires careful analysis.\nA common framework for causal inference is the potential outcomes setup [19], where every in-\ndividual possesses two \u2018potential outcomes\u2019 corresponding to the individual\u2019s outcomes with and\nwithout treatment. For every subject in the observation cohort we thus observe only one of these two\noutcomes and not the \u2018missing\u2019 counterfactual outcome, without which we cannot observe the true\ntreatment effect. This problem differs from standard supervised learning in that we must thus account\nfor the missing counterfactuals, which is the well-known missing data problem in causal inference.\nA further complication is that in practice, particularly in observational studies, individuals are often\nassigned treatments in a biased manner [36] so that a simple comparison of the two groups may be\nmisleading. A common way to deal with selection bias is to measure features, called confounders,\nthat are believed to in\ufb02uence both the treatment assignment and outcomes. The discrepancy in feature\ndistributions for the treated and control subject groups can be expressed via the propensity score,\nwhich is then used to apply a correction to the estimate. Under the assumption of unconfoundedness,\nnamely that the treatment assignment and outcome are conditionally independent given the features,\none can then identify the causal effect. Widely used methods include propensity score matching\n[32, 34, 36] and double robust methods [5, 31, 33].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn recent years, Bayesian methods have become increasingly popular for causal inference due to\ntheir excellent performance, for example Gaussian processes [1\u20134, 11, 23, 38] and BART [13, 15, 17,\n18, 20, 35] amongst other priors [6]. Apart from excellent estimation precision, advantages of the\nBayesian approach are its conceptual simplicity, ability to incorporate prior knowledge and access to\nuncertainty quanti\ufb01cation via posterior credible sets.\nIn this work we are interested in Bayesian inference for the (population) average treatment effect\n(ATE) of a causal intervention, which is relevant when policy makers are interested in evaluating\nwhether to apply a single intervention to the entire population. This may be the case when one no\nlonger observes feature measurements of new individuals outside the dataset. This problem is an\nexample of estimating a one-dimensional functional (the ATE) of a complex Bayesian model (the\nfull response surface). In such situations, the induced marginal posterior for the functional can often\ncontain a signi\ufb01cant bias in its centering, leading to poor estimation and uncertainty quanti\ufb01cation\n[7, 8, 27]. This is indeed the case in our setting, where it is known that a naive choice of prior can\nyield badly biased inference for the ATE in casual inference/missing data problems [14, 26, 30]. For\ninstance, Gaussian process (GP) priors will typically not be correctly centered, see Figure 1 below.\nCorrecting for this is a delicate issue since even when the prior is perfectly calibrated (i.e. all tuning\nparameters are set optimally to recover the treatment response surface), the posterior can still induce\na large bias in the marginal posterior for the ATE [25].\nOur main contribution is to propose a data-driven modi\ufb01cation to an arbitrary nonparametric prior\nbased on the estimated propensity score that corrects for the \ufb01rst-order posterior bias for the ATE. By\ncorrectly centering the posterior for the ATE, this improves performance for both estimation accuracy\nand uncertainty quanti\ufb01cation. We numerically illustrate our method on simulated and semi-synthetic\ndata using GP priors, where our prior correction corresponds to a simple data-driven alteration to the\ncovariance kernel. Our experiments demonstrate signi\ufb01cant improvement in performance from this\ndebiasing. This method should be viewed as a way to increase the ef\ufb01ciency of a given Bayesian\nprior, selected for modelling or computational reasons, when estimating the ATE.\nOur method provides the same bene\ufb01ts for inference on the conditional average treatment effect\n(CATE). We further show that randomization of the feature distribution is not necessary for accurate\nuncertainty quanti\ufb01cation for the CATE, but is helpful for the ATE. Since this approach provides\nsimilar estimation accuracy irrespective of whether the feature distribution is randomized, this\nhighlights that care must be taken when using \ufb01ner properties of the posterior, such as uncertainty\nquanti\ufb01cation.\nOrganization: in Section 2 we present the causal inference problem, in Section 3 our main idea for\ndebiasing an arbitrary Bayesian prior, with the speci\ufb01c case of GPs treated in Section 4. Simulations\nand further discussion are in Sections 5 and 6, respectively. Additional technical details, some\nmotivation based on semiparametric statistics and further simulation results are in the supplement.\n\n2 Problem setup\n\ni\n\ni\n\ni\n\nand Y (0)\n\ni \u2212 Y (0)\n\nConsider the situation where a binary treatment with heterogeneous treatment effects is applied\nto a population. Working in the potential outcomes setup [19], every individual i possesses a d-\ndimensional feature Xi \u2208 Rd and two \u2018potential outcomes\u2019 Y (1)\n, corresponding to the\nindividual\u2019s outcomes with and without treatment, respectively. We wish to make inference on the\ntreatment effect Y (1)\n, but since we only observe one out of each pair of outcomes, and not\nthe corresponding (missing) counterfactual outcome, we do not directly observe samples of the\ntreatment effect. In this paper we are interested in estimating the average treatment effect (ATE)\n\u03c8 = E[Y (1) \u2212 Y (0)].\nFor Ri \u2208 {0, 1} the treatment assignment indicator, we observe outcome Y (Ri)\n, which can also be\nexpressed as observing Y = RiY (1) + (1 \u2212 Ri)Y (0). The treatment assignment policy generally\ndepends on the features Xi and is expressed by the conditional probability \u03c0(x) = P (R = 1|X = x)\ni \u22a5\u22a5 Ri|Xi for all\ncalled the propensity score (PS). We assume unconfoundedness, namely Y (1)\nXi \u2208 Rd, which is a standard assumption in the potential outcomes framework [19]. Unconfound-\nedness (or strong ignorability) says that the outcomes Y (1)\nare independent of the treatment\n\n, Y (0)\n\ni\n\ni\n\n, Y (0)\n\ni\n\ni\n\n2\n\n\fassignment Ri given the measured features Xi, i.e. any dependence can be fully explained through\nXi. Without such an assumption the ATE is typically not even identi\ufb01able [32].\nWe work in the standard nonparametric regression framework for causal inference with mean-zero\nadditive errors [2, 14, 15, 18, 23]\n\n(1)\nwhere \u03b5i \u223ciid N (0, \u03c32\nn), Ri \u2208 {0, 1} is the indicator variable for whether treatment is applied and\nXi \u2208 Rd represents measured feature information about individual i. We assume the general feature\ninformation is unbiased Xi \u223ciid F , but the treatment assignment \u03c0(x) = P (R = 1|X = x) may be\nheavily biased. Our goal is to estimate the average treatment effect (ATE)\n\nYi = m(Xi, Ri) + \u03b5i,\n\n\u03c8 = E[Y (1) \u2212 Y (0)] =\n\n=\n\nE[Y |R = 1, X = x] \u2212 E[Y |R = 0, X = x]dF (x)\n\nm(x, 1) \u2212 m(x, 0)dF (x)\n\n(2)\n\n(cid:90)\n(cid:90)\n\nRd\n\nRd\n\nn(cid:88)\n\ni=1\n\nbased on an observational dataset Dn consisting of n i.i.d. samples of the triplet (Xi, Ri, Yi). A\nrelated quantity is the conditional average treatment effect (CATE)\n\n\u03c8c = \u03c8c(X1, . . . , Xn) =\n\n1\nn\n\nE[Y (1)\n\ni \u2212 Y (0)\n\ni\n\n|Xi] =\n\n1\nn\n\nm(Xi, 1) \u2212 m(Xi, 0),\n\n(3)\n\ndistribution F in the de\ufb01nition (2) of \u03c8 with its empirical counterpart n\u22121(cid:80)n\n\nwhich represents the average treatment effect over the measured individuals. Compared to the\nATE, this quantity ignores the randomness in the feature data, replacing the true population feature\ni=1 \u03b4Xi with \u03b4x the\n\nDirac measure (point mass) at x.\n\nn(cid:88)\n\ni=1\n\n3 Bayesian causal inference for average treatment effects\n\nWe \ufb01t a nonparametric prior to the model (F, \u03c0, m) and consider the ATE \u03c8 as a functional of\nthese three components, studying the one-dimensional marginal posterior for \u03c8 induced by the full\nnonparametric posterior. More concretely, one can sample from the marginal posterior for \u03c8 by\ndrawing a full posterior sample (F, \u03c0, m) and computing the corresponding draw \u03c8 according to the\nformula (2). Note that this yields the full posterior for the ATE \u03c8, which is much more informative\nthan simply the posterior mean, for instance also providing credible intervals for \u03c8. This is the natural\nBayesian approach to modelling \u03c8 and it is indeed typically necessary to fully model (F, \u03c0, m) rather\nthan \u03c8 directly when considering heterogeneous treatment effects.\nAssuming the distribution F has a density f, the likelihood for data Dn arising from model (1) is\n(1\u2212Ri)(Yi\u2212m(Xi,0))2\n\nn(cid:89)\n\nRi(Yi\u2212m(Xi,1))2\u2212 1\n2\u03c32\nn\n\nf (Xi)\u03c0(Xi)Ri(1 \u2212 \u03c0(Xi))1\u2212Ri\n\n\u2212 1\n2\u03c32\nn\n\ne\n\n.\n\n1\u221a\n2\u03c0\u03c3n\n\ni=1\n\nSince this factorizes in the model parameters (f, \u03c0, m), placing a product prior on these three\nparameters yields a product posterior, i.e. f, \u03c0, m are (conditionally) independent under the posterior.\nAs this is particularly computationally ef\ufb01cient, we pursue this approach. In this case, since \u03c0 does\nnot appear in the ATE \u03c8, the \u03c0 terms will cancel from the marginal posterior for \u03c8 and the prior\non \u03c0 is irrelevant for estimating the ATE. We thus need not specify the \u03c0 component of the prior.\nThese properties hold even when F has no density and so a likelihood cannot be de\ufb01ned, see the\nsupplement.\nA Bayesian will typically endow the response surface m with a nonparametric prior for either\nmodelling or computational reasons. As already mentioned, the induced marginal posterior for \u03c8 will\nthen often have a signi\ufb01cant bias term in its centering, see Figure 1 for an example arising from a\nstandard GP prior. Our main idea is to augment a given Bayesian prior for m by ef\ufb01ciently using an\nestimate \u02c6\u03c0 of the PS, since it is well-known that using PS information can improve estimation of the\nATE [32]. We model (F, m) using the following prior:\n\nm(x, r) = W (x, r) + \u03bdn\u03bb\n\n,\n\nF \u223c DP,\n\n(4)\n\n(cid:19)\n\n\u2212 1 \u2212 r\n1 \u2212 \u02c6\u03c0(x)\n\n(cid:18) r\n\n\u02c6\u03c0(x)\n\n3\n\n\fFigure 1: Plot of marginal posterior distributions for the ATE with true ATE (red), histogram of\n10,000 posterior draws (blue), posterior mean (solid black), 90% credible interval (dotted black)\nand best \ufb01tting Gaussian distribution (orange). Data arises from the synthetic simulation (HOM) in\nSection 5 with n = 500 and Gaussian process prior described in Section 4. Left/right: without/with\nbias correction. Note the incorrect centering on the left-hand side.\n\nwhere W : Rd \u00d7 {0, 1} \u2192 R is a stochastic process, DP denotes the Dirichlet process with a \ufb01nite\nbase measure [12], \u03bdn > 0 is a scaling parameter and \u03bb is a real-valued random variable, with W, F, \u03bb\nindependent. Estimating the PS is a standard binary classi\ufb01cation problem and one can use any\nsuitable estimator \u02c6\u03c0, from logistic regression to more advanced machine learning methods. It may\nbe practically advantageous to truncate the estimator \u02c6\u03c0 away from 0 and 1 for numerical stability.\nFor estimating the CATE, we propose the same prior (4) but with the Dirichlet process prior for F\n\nreplaced by a plug-in estimate consisting of the empirical distribution Fn = n\u22121(cid:80)n\n\ni=1 \u03b4Xi.\n\nThe prior (4) increases/decreases the prior correlation within/across treatment groups in a hetero-\ngeneous manner compared to the unmodi\ufb01ed prior (\u03bdn = 0). For example, in regions with few\nobservations in the treatment group (small \u03c0(x)), (4) signi\ufb01cantly increases the prior correlation with\nother treated individuals, thereby borrowing more information across individuals to account for the\nlack of data. Conversely, in observation rich areas (large \u03c0(x)), (4) borrows less information, instead\nusing the (relatively) numerous local observations.\nUsing an unmodi\ufb01ed prior (\u03bdn = 0), the posterior will make a bias-variance tradeoff aimed at\nestimating the full regression function m rather than the smooth one-dimensional functional \u03c8.\nIn particular, the bias for the ATE \u03c8 will dominate, leading to poor estimation and uncertainty\nquanti\ufb01cation unless the true m and f = F (cid:48) are especially easy to estimate. The idea behind the\nprior (4) is to use a data-driven correction to (\ufb01rst-order) debias the resulting marginal posterior\nfor \u03c8. The quantity r/\u03c0(x) \u2212 (1 \u2212 r)/(1 \u2212 \u03c0(x)) corresponds in a speci\ufb01c technical sense to the\n\u2018derivative\u2019 of the ATE \u03c8 with respect to the model (1), the so-called \u2018least favorable direction\u2019 of \u03c8.\nHeuristically, Taylor expanding \u03c8|Dn \u2212 \u03c80, where \u03c8|Dn and \u03c80 are the posterior and \u2018true\u2019 ATE,\nthe hyperparameter \u03bb is introduced to help the posterior remove the \ufb01rst-order (bias) term in this\nexpansion, see Figure 1 for an illustration. Since the true \u03c0 is unknown, the natural approach is to\nreplace it with an estimator \u02c6\u03c0. A more technical explanation can be found in the supplement.\nSuch a bias correction will help most when (F, m) are dif\ufb01cult to estimate, for instance in high-\ndimensional feature settings. Higher-order bias corrections have also been considered using estimating\nequations [28, 29], but it is unclear how to extend this to the Bayesian setting. A similar idea has been\ninvestigated theoretically in [25], where it is shown that in a related idealized model, priors correctly\ncalibrated to the unknown true functions (i.e. non-adaptive) satisfy a semiparametric Bernstein-von\nMises theorem, i.e. the marginal posterior for the ATE is asymptotically normal with optimal variance\nin the large data setting. Figure 1 suggests the shape also holds in the present setting.\nA good choice of prior for W is still essential, since poor modelling of m can also induce bias.\nIn particular, \u03bdn should be picked so that the second term in (4) is of smaller order than W in\norder to have relatively little effect on the full posterior for m. If the Bernstein-von Mises theorem\nholds, the marginal posterior for \u03c8 \ufb02uctuates on a 1/\nn scale (see [25] for a related model), which\nsuggests taking \u03bdn \u223c 1/\nn. On this scale the bias correction is suf\ufb01ciently large to meaningfully\n\n\u221a\n\n\u221a\n\n4\n\n\faffect the marginal posterior, but not so large as to dominate. Simulations indicate that taking\n\u03bdn signi\ufb01cantly larger than this can cause the bias correction to dominate in small data situations,\nreducing performance. In a data-rich situation, larger values of \u03bdn are also admissible since the\nposterior can calibrate the value of \u03bb based on the data. Thus correct calibration of \u03bdn is mainly\nimportant for small or moderate sample performance, see Section 4.\nOne can also take a fully Bayesian approach by placing a prior on \u03c0 in (4). While such an approach\nmay be philosophically appealing, it can cause computational dif\ufb01culties since the priors for (\u03c0, m),\nand hence also the corresponding posteriors, are no longer independent. For Gaussian processes\n(GPs), considered in detail in Section 4, one can then only sample from the fully Bayesian posterior\nusing a Metropolis-Hastings-within-Gibbs-sampling algorithm, which is far slower in practice. In\ncontrast, the \u2018empirical Bayes\u2019 approach we advocate in (4) maintains this independence and is thus\ncomputationally more ef\ufb01cient, e.g. in the GP case, the resulting prior for m remains a GP.\nIt is known that for estimating a smooth one-dimensional functional of a nonparametric model,\nselecting an undersmoothing prior can be advantageous [8]. As well as being computationally\nef\ufb01cient due to conjugacy, the choice of Dirichlet process for F is thus also theoretically motivated,\nsince it can be viewed as a considerable undersmoothing (f = F (cid:48) does not even exist as F is a\ndiscrete probability measure with prior probability one).\nOne can also directly plug-in an estimator Fn of F in (2), such as the empirical distribution, and\nrandomize only m from its posterior. This provides an estimate of both the ATE (2) and CATE (3),\nbut is only suitable for uncertainty quanti\ufb01cation regarding the CATE. Not randomizing F causes the\nposterior to ignore the uncertainty in the features, leading to an underestimation of the variance for\n\u03c8. The resulting credible intervals will then be too narrow, giving wrong uncertainty quanti\ufb01cation\nas we see in the supplementary material. The message here is that even when different (empirical)\nBayes methods give equally good estimation, as these two do, one must be careful about assuming\nthat \ufb01ner aspects of the posteriors behave similarly well, for example uncertainty quanti\ufb01cation.\nIn summary, we view the prior modi\ufb01cation (4) as a way to increase the ef\ufb01ciency of a given Bayesian\nprior for estimating the ATE and CATE.\nA related approach is Bayesian Causal Forests (BCF) [15], where the estimated PS is directly added\nas an additional input feature to a BART model, yielding better performance. This approach is\ndesigned to improve nonparametric estimation of the entire response surface (i.e. the heterogeneous\ntreatment effects themselves), which will also lead to some improvement when estimating the ATE.\nHowever, it is known that even when the prior is perfectly calibrated (i.e. all tuning parameters are\nset optimally) and recovers the entire response surface at the optimal rate, the posterior can still\ninduce a bias in the marginal posterior for the ATE \u03c8 that prevents ef\ufb01cient estimation and destroys\nuncertainty quanti\ufb01cation (see e.g. [25]).\nAs discussed above, the speci\ufb01c form in which we include the PS in our prior (4) is very deliberate,\nbeing motivated by semiparametric statistical theory and speci\ufb01cally designed for estimating the\nATE. When either the PS or response surface are especially dif\ufb01cult to estimate, we expect that\nincorporating the PS as a feature as in BCF will still induce a bias for the ATE (the theory in [25]\npredicts this). We emphasize, however, that the main goal of BCF is to estimate the entire response\nsurface, which is a different problem to estimating the ATE we consider here. An alternative Bayesian\napproach to estimating the ATE is to reparametrize the model to force \u03c0 into the likelihood [14, 26].\n\n4 Gaussian process priors\n\nIn recent years, Gaussian process (GP) priors have found especial uptake in causal inference problems\n[1\u20134, 23], for example in healthcare [11, 38]. We therefore concretely illustrate the prior (4) for W a\nmean-zero GP with covariance kernel K, \u03bb \u223c N (0, 1) independent and scaling parameter \u03bdn > 0 to\nbe de\ufb01ned below. Under the prior (4), m is again a mean-zero GP with data-driven covariance kernel\nEm(x, r)m(x(cid:48), r(cid:48)) = K((x, r), (x(cid:48), r(cid:48))) + \u03bd2\n\n(cid:19)(cid:18) r(cid:48)\n\n(cid:18) r\n\n(cid:19)\n\n. (5)\n\n\u2212 1 \u2212 r\n1 \u2212 \u02c6\u03c0(x)\n\n\u2212 1 \u2212 r(cid:48)\n1 \u2212 \u02c6\u03c0(x(cid:48))\n\n\u02c6\u03c0(x(cid:48))\n\nn\n\n\u02c6\u03c0(x)\n\nFor GPs, our debiasing corresponds to a simple and easy to implement modi\ufb01cation to the covariance\nkernel. One should use the original covariance kernel K that was considered suitable for estimating\nm (e.g. squared exponential, Mat\u00e9rn), since accurately modelling the regression surface is also\nnecessary.\n\n5\n\n\fFor our simulations in Section 5, we compute \u02c6\u03c0 using logistic regression based on the same data,\ntruncating our estimator to [0.1, 0.9] for numerical stability in (5). We take K equal to the squared\nexponential kernel (also called radial basis function) with automatic relevance determination (ARD),\n\n(cid:32)\n\n\u2212 1\n2\n\nd(cid:88)\n\n(cid:33)\n\n(cid:18)\n\n(xi \u2212 x(cid:48)\ni)2\n(cid:96)2\ni\n\nexp\n\n\u2212 1\n2\n\n(r \u2212 r(cid:48))2\n\n(cid:96)2\nd+1\n\n(cid:19)\n\nK((x, r), (x(cid:48)r(cid:48))) = \u03c12\n\nm exp\n\ni=1 the length scale parameters and \u03c12\n\ni=1\nwith ((cid:96)i)d+1\nm > 0 the kernel variance [24]. The data-driven\nlength scales (cid:96)i can be interpreted as the relevance of the ith feature to the regression surface m\nand are particularly important for high-dimensional data, where some features may play little role.\nARD has been used successfully for removing irrelevant inputs by several authors (see Chapter 5.1\n[24]) and can thus be viewed as a form of automatic (causal) feature selection. We optimize the\nhyperparameters ((cid:96)i)d+1\ni=1 , \u03c1m and \u03c3n (noise variance) by maximizing the marginal likelihood (using\nthe scaled conjugate gradient method option in the GPy package). We set \u03bdn = 0.2\u03c1m/(\nnMn) for\ni=1[Ri/\u02c6\u03c0(Xi) + (1 \u2212 Ri)/(1 \u2212 \u02c6\u03c0(Xi))] the average absolute value of the last part of\n(5). This places the second term in (5) on the same scale as the original covariance kernel K.\nWe assign F|Dn the Bayesian bootstrap (BB) distribution [12], namely a Dirichlet process with base\ni=1 \u03b4Xi of the observations. When n is moderate\nor large, the BB distribution will be very close to that of the true DP posterior. The advantage\nof the BB is that samples are particularly easy to generate: using that F|Dn can be represented\ni=1 Vi\u03b4Xi for (V1, . . . , Vn) \u223c Dir(n; 1, . . . , 1) and that m and F are independent under the\n\nMn = n\u22121(cid:80)n\nmeasure equal to the rescaled empirical measure(cid:80)n\nas(cid:80)n\n\n\u221a\n\nn(cid:88)\n\ni=1\n\nposterior, the posterior mean and draws for the ATE can be written as\nE[\u03c8|Dn] =\n\nE [m(Xi, 1) \u2212 m(Xi, 0)|Dn] , \u03c8|Dn =\n\nrespectively. Using the representation Vi = Ui/(cid:80)n\n\n1\nn\n\ni=1\n\nn(cid:88)\n\nj=1 Uj for Ui \u223ciid exp(1), sampling (V1, . . . , Vn)\nis particularly simple. One also needs to generate an n-dimensional multivariate Gaussian random\nvariable (m(Xi, 1) \u2212 m(Xi, 0))n\ni=1, whose covariance can be directly obtained from the posterior\nGP process (m(x, r) : x \u2208 Rd, r \u2208 {0, 1})|Dn evaluated at the observations and their counterfactual\nvalues. This follows from the usual formula for the mean and covariance of a posterior GP in\nregression with Gaussian noise (Chapter 2.2 of [24]) and the whole procedure is summarized in\nAlgorithm 11. Using this scheme, we may sample directly from the marginal posterior for the ATE \u03c8.\nTo show the importance of randomizing F for uncertainty quanti\ufb01cation, we also consider the\ni=1 \u03b4Xi for F in (2). This yields the same\nposterior mean as in (6), while sampling \u03c8|Dn corresponds to the right-hand side of (6) with Vi\nreplaced by 1/n. We expect this to yield similar prediction to the posterior mean in (6) but worse\nuncertainty quanti\ufb01cation for the ATE (but not CATE). This is indeed what we see in the supplement.\n\nposterior where one plugs in the empirical measure n\u22121(cid:80)n\n\nVi (m(Xi, 1) \u2212 m(Xi, 0)) , (6)\n\n5 Simulations\n\nWe numerically illustrate the improved performance of our debiased GP method (GP+PS) versus\nthe original GP approach, both with (GP and GP+PS) and without randomization (GP (noRand)\nand GP + PS (noRand)) of the feature distribution F . The methods are implemented as described\nin Section 4. Credible intervals are computed by sampling 2,000 posterior draws and taking the\nempirical 95% credible interval, see Figure 1. We measure estimation accuracy via the absolute error\nbetween the posterior mean and true (C)ATE. We also report the average size and coverage of the\nresulting credible/con\ufb01dence intervals (CI) and the Type II error, which measures the fraction of\ntimes the method does not identity a statistically signi\ufb01cant (C)ATE.\nWe further compare their performance with standard state-of-art-methods for estimating the ATE and\nCATE, namely Bayesian Additive Regression Trees (BART) [10, 18, 15] both with and without using\nthe PS as a feature, Bayesian Causal Forests (BCF) [15], Causal Forests (CF) with average inverse\n\n1Lines 8-9 in Algorithm 1 are the usual predictive mean and covariance computations for a posterior GP.\nIn particular, these can be more ef\ufb01ciently solved using for example Cholesky factorization, see Chapter 2.2\nof [24]. Similarly, m can be ef\ufb01ciently generated by once taking the Cholesky factor L\u03a3 of \u03a3, generating\nW \u223c N2n(0, I2n) and setting m = \u00b5 + L\u03a3W\n\n6\n\n\fAlgorithm 1 Debiased GP with PS correction\n1: Input: X (features), R (treatment assignments), Y (outcomes), K (covariance kernel)\n2: Run logistic regression on (X1, R1), . . . , (Xn, Rn) and return estimates \u02c6\u03c0(X1), . . . , \u02c6\u03c0(Xn)\n3: wf =\n\n(factual)\n\n(cid:16) R1\n(cid:16) 1\u2212R1\n\u02c6\u03c0(X1) \u2212 1\u2212R1\n\u02c6\u03c0(X1) \u2212 R1\n(cid:18)X\n\n\u02c6\u03c0(Xn) \u2212 1\u2212Rn\n1\u2212\u02c6\u03c0(X1) , . . . , Rn\n1\u2212\u02c6\u03c0(Xn)\n\u02c6\u03c0(Xn) \u2212 Rn\n1\u2212\u02c6\u03c0(X1) , . . . , 1\u2212Rn\n(cid:19)\n4: wc =\n1\u2212\u02c6\u03c0(Xn)\n5: Optimize hyperparameters of k (including \u03c32\n6: Z = (X R) and Z\u2217 =\n\nn) and then \u03bd2 (see Section 4)\n\n(counterfactual)\n\n(cid:17)\n(cid:17)\n\nnIn]\u22121Y\n\nR\nX 1 \u2212 R\n7: K f,c = K(Z\u2217, Z) + \u03bd2(wf wc)T wf\n8: \u00b5 = K f,c[K(Z, Z) + \u03bd2wf\nT wf + \u03c32\n9: \u03a3 = K(Z\u2217, Z\u2217) + \u03bd2(wf wc)T (wf wc) \u2212 K f,c[K(Z, Z) + \u03bd2wf\n10: Compute \u02c6\u03c8 = E[\u03c8|Dn] from \u00b5 according to the left hand side of (6)\n11: for l = 1 . . . P (# posterior samples) do\n12:\n13:\n14:\n15: Compute credible interval (CI) based on quantiles of \u03c81, . . . , \u03c8P\n16: Output: \u02c6\u03c8 (posterior mean), CI (credible interval), \u03c81, . . . , \u03c8P (posterior samples)\n\nGenerate (V1, . . . , Vn) \u223c Dir(n; 1, . . . , 1)\nGenerate m \u223c N2n(\u00b5, \u03a3)\nCompute \u03c8l from m and V1, . . . , Vn according to the right hand side of (6)\n\nT wf + \u03c32\n\nnIn]\u22121K\n\nT\nf,c\n\n(x1, ..., x100) is assigned (non-randomly) to the treatment group if(cid:80)5\ngenerated as Y |X = x, R = r \u223c N ((cid:80)5\neffect. In case (HET), Y is generated as Y |X = x, R = r \u223c N ((cid:80)5\n\npropensity weighting (AIPW) and targeted maximum likelihood estimation (TMLE) [39], Propensity\nScore Matching (PSM) [36], ordinary least squares (OLS), and Covariate Balancing (CB) with the\nstandard inverse PS weights and weights computed by constrained minimization (CM) [9]. Details of\nthese benchmarks are provided in the supplementary material. We ran all simulations 200 times and\nreport average values.\nSynthetic dataset. We consider two versions of synthetic data generated following the protocol used\niid\u223c N (0, 1).\nin [21, 22, 37]. We take sample sizes n = 500, 1000 and d = 100 features x1, x2, ..., x100\nThe response surface and treatment assignments are de\ufb01ned via the following ten functions: g1(x) =\nx \u2212 0.5, g2(x) = (x \u2212 0.5)2 + 2, g3(x) = x2 \u2212 1/3, g4(x) = \u22122 sin(2x), g5(x) = e\u2212x \u2212 e\u22121 \u2212 1,\ng6(x) = e\u2212x, g7(x) = x2, g8(x) = x, g9(x) = Ix>0, g10(x) = cos(x). A subject with features x =\nk=1 gk(xk) > 0 and otherwise\nto the control group. Given the features and treatment assignment, in case (HOM) the outcome Y is\nk=1 gk+5(xk)+r, 1), which models a homogeneous treatment\nk=1 gk+5(xk) + r(1 + 2x2x5), 1),\nwhich models heterogeneous treatment effects. In both cases, the \ufb01rst \ufb01ve features affect both the\ntreatment and outcome, representing confounders, while the remaining 95 features are noise. The\nATE is 1 in both cases. Some results are in Table 1 with the remainder in the supplement.\nIHDP dataset with simulated outcomes. Since simulated covariates often do not accurately repre-\nsent \u201creal world\u201d examples, we consider a semi-synthetic dataset with real features and treatment\nassignments from the Infant Health and Development Program (IHDP), but simulated responses.\nThe IHDP consisted of a randomized experiment studying whether low-birth-weight and premature\ninfants bene\ufb01ted from intensive high-quality child care. The data contains d = 25 pretreatment\nvariables per subject. Following [18] (also used in [2, 21]), an observational study is created by\nremoving a non-random portion of the treatment group, namely all children with non-white mothers.\nThis leaves a dataset of 747 subjects, with 139 in the treatment group and 608 in the control group.\nWe consider a slight modi\ufb01cation of the non-linear \u201cResponse Surface B\u201d of [18], taking\n\nY (0)|X = x \u223c N (e(x+w)\u03b2, 1)\n\nand Y (1)|X = x \u223c N (xT \u03b2 \u2212 \u03c9\u03b2, 1),\n\nwhere x \u2208 Rd are the features, w = (0.5, . . . , 0.5) is an offset vector, \u03b2 is a vector of regres-\nsion coef\ufb01cients with each entry randomly sampled from {0, 0.1, 0.2, 0.3, 0.4} with probabilities\n(0.6, 0.1, 0.1, 0.1, 0.1). For each simulation of \u03b2, \u03c9\u03b2 is then selected so that the CATE equals 4.\nHere, we can only measure estimation quality of the CATE and not the ATE since the true feature\ndistribution F is unknown. Results are in Table 2.\n\n7\n\n\fTable 1: Results for synthetic dataset (HET) with n = 1000.\n\nAbs. error\u00b1 sd Size CI\u00b1 sd\nCoverage Type II error\nMethod\n0.613 \u00b1 0.027\n0.321 \u00b1 0.027\n0.38\nGP\n0.321 \u00b1 0.027\n0.427 \u00b1 0.017\n0.00\nGP (noRand)\n0.063 \u00b1 0.042 0.883 \u00b1 0.040 1.00\nGP PS\nGP PS (noRand) 0.063 \u00b1 0.042 0.766 \u00b1 0.037 1.00\n1.723 \u00b1 0.490 1.00\n0.228 \u00b1 0.186\nBART\n0.741 \u00b1 0.079\n0.134 \u00b1 0.092\nBART (PS)\n0.99\n0.144 \u00b1 0.109\n0.535 \u00b1 0.066\nBCF\n0.87\n0.138 \u00b1 0.097 0.695 \u00b1 0.103 0.96\nCF (AIPW)\n0.136 \u00b1 0.099\nCF (TMLE)\n0.99\n0.725 \u00b1 0.160\n0.00\nOLS\n0.606 \u00b1 0.324\n0.68\nCB (IPW)\n0.234 \u00b1 0.178\nPSM\n0.97\n\n0.891 \u00b1 0.156\n0.361 \u00b1 0.034\n1.467 \u00b1 0.418\n1.282 \u00b1 0.158\n\n0.00\n0.00\n0.00\n0.00\n0.50\n0.00\n0.00\n0.00\n0.01\n0.26\n0.01\n0.06\n\nTable 2: Results for semi-synthetic IHDP dataset.\n\nCoverage Type II error\n\nAbs. error\u00b1 sd Size CI\u00b1 sd\nMethod\n0.246 \u00b1 0.398\n1.383 \u00b1 1.458\nGP\n0.95\n1.096 \u00b1 1.305\n0.246 \u00b1 0.398\nGP (noRand)\n0.89\n1.445 \u00b1 1.013\n0.189 \u00b1 0.234\nGP + PS\n0.97\nGP +PS (noRand) 0.189 \u00b1 0.234\n1.162 \u00b1 0.822\n0.93\n0.945 \u00b1 0.745\n0.234 \u00b1 0.282\n0.91\nBART\n0.238 \u00b1 0.342\n0.906 \u00b1 0.682\nBART (PS)\n0.89\n0.108 \u00b1 0.106 0.526 \u00b1 0.151 0.95\nBCF\n0.245 \u00b1 0.236\n1.052 \u00b1 0.811\n0.91\nCF (AIPW)\n0.242 \u00b1 0.240\n1.087 \u00b1 0.842\n0.91\nCF (TMLE)\n0.127 \u00b1 0.101\n0.815 \u00b1 0.537\nOLS\n0.98\n0.238 \u00b1 0.180\n1.200 \u00b1 0.860\nCB (IPW)\n0.91\n0.961 \u00b1 0.765\n0.134 \u00b1 0.117\nCB (CM)\n0.93\n0.136 \u00b1 0.108\n2.052 \u00b1 1.701 1.00\nPSM\n\n0.01\n0.01\n0.01\n0.01\n0.00\n0.00\n0.00\n0.01\n0.01\n0.00\n0.00\n0.00\n0.01\n\nBoth of these simulations contain unbalanced treatment groups, with roughly 90% and 20% of subjects\nin the treatment group in the synthetic and IHDP simulations, respectively. Like PS reweighting-\nbased methods, our bias corrected GP method (5) is designed with problems satisfying the standard\noverlap assumption [19] (namely 0 < P (R = 1|X = x) < 1 for all x \u2208 Rd) in mind. In the\nsynthetic simulation the treatment assignment is fully deterministic so this condition is not satis\ufb01ed;\nin particular the data generation process was not selected to favour our method.\nResults. We see from Tables 1 and 2 that our methods (GP + PS and GP + PS (noRand)) substantially\nimprove upon the performance of the vanilla GP methods (GP and GP (noRand)) [also true in the\nadditional simulations in the supplement]. In both cases we obtain signi\ufb01cantly improved estimation\naccuracy and uncertainty quanti\ufb01cation. As an example of what can go wrong, in the synthetic\nsimulation the absolute errors of the vanilla GP methods barely decrease as the sample size increases\n(Table 1 and the supplementary tables) since the posterior for the ATE contains a non-vanishing\nbias. Moreover, since the posterior variance shrinks rapidly with the sample size (at rate 1/n for the\nATE [25]), the posterior will concentrate tightly around the wrong value, giving poor uncertainty\nquanti\ufb01cation that actually worsens with increasing data, see Figure 1 and Table 1. This is a typical\naspect of causal inference problems with dif\ufb01cult to estimate PS and response surfaces, particularly\n\n8\n\n\fin high feature dimensions. In contrast, our debiased method explicitly corrects for this bias at\nthe expense of a (smaller) increase in variance, as can be seen from the average CI length. The\nsubstantially improved coverage from our method is the result of the debiasing rather than the increase\nin posterior variance.\nAsymptotic theory predicts the frequentist coverage of our method should converge to exactly 0.95 as\nthe sample size increases due to the semiparametric Bernstein-von Mises theorem [25]. However, it\nis a subtle question as to when the asymptotic regime applies and our examples seem insuf\ufb01ciently\ndata rich for this to be the case (e.g. d = 100 input features, but only n = 1000 observations).\nWe see that our method makes the previously underperforming GP method highly competitive with\nstate-of-the-art methods, even outperforming them in certain cases. In the synthetic simulation, our\nmethod performs best yielding substantially better estimation accuracy. It further provides reliable\nand informative uncertainty quanti\ufb01cation, performing similarly to BART (PS), CF (AIPW) and CF\n(TMLE). On the IHDP dataset, our debiased methods outperform the widely used BART and CF for\nestimation accuracy, but BCF performs best. While OLS and CB (CM) also performed well here, we\nnote that in the synthetic simulation, OLS performed especially badly while CB (CM) did not even\nrun. Regarding uncertainty quanti\ufb01cation, our method provides excellent coverage though larger CIs\nthan BART (whose coverage is slightly lower), but BCF again performs best.\nWe lastly note that not randomizing the feature distribution (noRand) yields narrower CIs and lower\ncoverage as expected. This does not make a substantial difference in Tables 1 and 2, but can have a\nsigni\ufb01cant impact, see the tables in the supplement. We recall that randomization is generally helpful\nfor uncertainty quanti\ufb01cation for the ATE, but is conservative for the CATE.\n\n6 Discussion\n\nWe have introduced a general data-driven modi\ufb01cation that can be applied to any given prior that\ncorrects for \ufb01rst-order posterior bias when estimating (conditional) average treatment effects (ATEs)\nin a causal inference regression model. We illustrated this experimentally on both simulated and\nsemi-synthetic data for the example of Gaussian process (GP) priors. We showed that by correctly\nincorporating an estimate of the propensity score into the covariance kernel, one can substantially\nimprove the precision of both the posterior mean and posterior uncertainty quanti\ufb01cation. In particular,\nthis makes the modi\ufb01ed GP method highly competitive with state-of-the-art methods.\nThere are many avenues for future work. First, GP methods scale poorly with data size and there has\nbeen extensive research on scalable alternatives, including sparse GP approximations, variational\nBayes and distributed computing approaches. Since in the GP case our approach simply returns a\nGP with modi\ufb01ed covariance kernel, all these existing methods should be directly applicable and\ncan be investigated. Second, it would be particularly interesting to see if our prior correction can be\nef\ufb01ciently implemented to improve the already excellent performance of BART and its derivatives in\ncausal inference problems [15, 18]. Third, it is unclear if and how one can perform higher order bias\ncorrections using Bayes for especially dif\ufb01cult problems as has been done using estimating equations\n[28, 29].\nAcknowledgements: Botond Szab\u00f3 received funding from the Netherlands Organization for Scien-\nti\ufb01c Research (NWO) under Project number: 639.031.654. We thank 3 reviewers for their useful\ncomments that helped improve the presentation of this work.\n\nReferences\n[1] ALAA, A., AND VAN DER SCHAAR, M. Limits of estimating heterogeneous treatment effects:\nGuidelines for practical algorithm design. In Proceedings of the 35th International Conference\non Machine Learning (2018), pp. 129\u2013138.\n\n[2] ALAA, A. M., AND VAN DER SCHAAR, M. Bayesian inference of individualized treatment\neffects using multi-task Gaussian processes. In Advances in Neural Information Processing\nSystems 30. 2017, pp. 3424\u20133432.\n\n[3] ALAA, A. M., AND VAN DER SCHAAR, M. Deep multi-task Gaussian processes for survival\nanalysis with competing risks. In Advances in Neural Information Processing Systems 30. 2017,\npp. 2329\u20132337.\n\n9\n\n\f[4] ANTONELLI, J., AND DOMINICI, F. A Bayesian semiparametric framework for causal inference\n\nin high-dimensional data. arXiv e-prints (May 2018), arXiv:1805.04899.\n\n[5] ATHEY, S., IMBENS, G., PHAM, T., AND WAGER, S. Estimating average treatment effects:\nSupplementary analyses and remaining challenges. American Economic Review 107, 5 (May\n2017), 278\u201381.\n\n[6] BRODERSEN, K. H., GALLUSSER, F., KOEHLER, J., REMY, N., AND SCOTT, S. L. Inferring\ncausal impact using Bayesian structural time-series models. Ann. Appl. Stat. 9, 1 (2015),\n247\u2013274.\n\n[7] CASTILLO, I. Semiparametric bernstein\u2013von mises theorem and bias, illustrated with gaussian\n\nprocess priors. Sankhya A 74, 2 (Aug 2012), 194\u2013221.\n\n[8] CASTILLO, I., AND ROUSSEAU, J. A Bernstein\u2013von Mises theorem for smooth functionals in\n\nsemiparametric models. Ann. Statist. 43, 6 (2015), 2353\u20132383.\n\n[9] CHAN, K. C. G., YAM, S. C. P., AND ZHANG, Z. Globally ef\ufb01cient non-parametric inference\nof average treatment effects by empirical balancing calibration weighting. Journal of the Royal\nStatistical Society: Series B (Statistical Methodology) 78, 3 (2016), 673\u2013700.\n\n[10] CHIPMAN, H. A., GEORGE, E. I., AND MCCULLOCH, R. E. BART: Bayesian additive\n\nregression trees. Ann. Appl. Stat. 4, 1 (03 2010), 266\u2013298.\n\n[11] FUTOMA, J., HARIHARAN, S., AND HELLER, K. Learning to detect sepsis with a multitask\nGaussian process RNN classi\ufb01er. In Proceedings of the 34th International Conference on\nMachine Learning (2017), pp. 1174\u20131182.\n\n[12] GHOSAL, S., AND VAN DER VAART, A. W. Fundamentals of Nonparametric Bayesian\nInference. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University\nPress, Cambridge, 2017.\n\n[13] GREEN, D. P., AND KERN, H. L. Modeling Heterogeneous Treatment Effects in Survey\nExperiments with Bayesian Additive Regression Trees. Public Opinion Quarterly 76, 3 (09\n2012), 491\u2013511.\n\n[14] HAHN, P. R., CARVALHO, C. M., PUELZ, D., AND HE, J. Regularization and confounding in\n\nlinear regression for treatment effect estimation. Bayesian Anal. 13, 1 (2018), 163\u2013182.\n\n[15] HAHN, P. R., MURRAY, J. S., AND CARVALHO, C. Bayesian regression tree models for causal\ninference: regularization, confounding, and heterogeneous effects. arXiv e-prints (Jun 2017),\narXiv:1706.09523.\n\n[16] HECKMAN, J. J., ICHIMURA, H., AND TODD, P. Matching As An Econometric Evaluation\n\nEstimator. The Review of Economic Studies 65, 2 (04 1998), 261\u2013294.\n\n[17] HILL, J., AND SU, Y.-S. Assessing lack of common support in causal inference using Bayesian\nnonparametrics: implications for evaluating the effect of breastfeeding on children\u2019s cognitive\noutcomes. Ann. Appl. Stat. 7, 3 (2013), 1386\u20131420.\n\n[18] HILL, J. L. Bayesian nonparametric modeling for causal inference. J. Comput. Graph. Statist.\n\n20, 1 (2011), 217\u2013240. Supplementary material available online.\n\n[19] IMBENS, G. W., AND RUBIN, D. B. Causal Inference for Statistics, Social, and Biomedical\n\nSciences: An Introduction. Cambridge University Press, New York, NY, USA, 2015.\n\n[20] KERN, H. L., STUART, E. A., HILL, J., AND GREEN, D. P. Assessing methods for generalizing\nexperimental impact estimates to target populations. Journal of Research on Educational\nEffectiveness 9, 1 (2016), 103\u2013127. PMID: 27668031.\n\n[21] LI, S., AND FU, Y. Matching on balanced nonlinear representations for treatment effects\n\nestimation. In Advances in Neural Information Processing Systems 30. 2017, pp. 929\u2013939.\n\n10\n\n\f[22] LI, S., VLASSIS, N., KAWALE, J., AND FU, Y. Matching via dimensionality reduction\nIn Proceedings of the\nfor estimation of treatment effects in digital marketing campaigns.\nTwenty-Fifth International Joint Conference on Arti\ufb01cial Intelligence (2016), pp. 3768\u20133774.\n\n[23] MUELLER, J., RESHEF, D., DU, G., AND JAAKKOLA, T. Learning Optimal Interventions.\nIn Proceedings of the 20th International Conference on Arti\ufb01cial Intelligence and Statistics\n(2017), pp. 1039\u20131047.\n\n[24] RASMUSSEN, C. E., AND WILLIAMS, C. K. I. Gaussian processes for machine learning.\n\nAdaptive Computation and Machine Learning. MIT Press, Cambridge, MA, 2006.\n\n[25] RAY, K., AND VAN DER VAART, A. W. Semiparametric Bayesian causal inference. Ann.\n\nStatist., to appear.\n\n[26] RITOV, Y., BICKEL, P. J., GAMST, A. C., AND KLEIJN, B. J. K. The Bayesian analysis of\n\ncomplex, high-dimensional models: can it be CODA? Statist. Sci. 29, 4 (2014), 619\u2013639.\n\n[27] RIVOIRARD, V., AND ROUSSEAU, J. Bernstein-von Mises theorem for linear functionals of\n\nthe density. Ann. Statist. 40, 3 (2012), 1489\u20131523.\n\n[28] ROBINS, J., LI, L., TCHETGEN, E., AND VAN DER VAART, A. Higher order in\ufb02uence\nfunctions and minimax estimation of nonlinear functionals. In Probability and statistics: essays\nin honor of David A. Freedman, vol. 2 of Inst. Math. Stat. (IMS) Collect. Inst. Math. Statist.,\nBeachwood, OH, 2008, pp. 335\u2013421.\n\n[29] ROBINS, J. M., LI, L., MUKHERJEE, R., TCHETGEN, E. T., AND VAN DER VAART, A.\nMinimax estimation of a functional on a structured high-dimensional model. Ann. Statist. 45, 5\n(2017), 1951\u20131987.\n\n[30] ROBINS, J. M., AND RITOV, Y. Toward a curse of dimensionality appropriate (CODA)\nasymptotic theory for semi-parametric models. Statistics in Medicine 16, 3 (1997), 285\u2013319.\n\n[31] ROBINS, J. M., AND ROTNITZKY, A. Semiparametric ef\ufb01ciency in multivariate regression\n\nmodels with missing data. J. Amer. Statist. Assoc. 90, 429 (1995), 122\u2013129.\n\n[32] ROSENBAUM, P. R., AND RUBIN, D. B. The central role of the propensity score in observa-\n\ntional studies for causal effects. Biometrika 70, 1 (1983), 41\u201355.\n\n[33] ROTNITZKY, A., AND ROBINS, J. M. Semi-parametric estimation of models for means and\n\ncovariances in the presence of missing data. Scand. J. Statist. 22, 3 (1995), 323\u2013333.\n\n[34] RUBIN, D. B. Bayesian inference for causal effects: the role of randomization. Ann. Statist. 6,\n\n1 (1978), 34\u201358.\n\n[35] SIVAGANESAN, S., M\u00dcLLER, P., AND HUANG, B. Subgroup \ufb01nding via Bayesian additive\n\nregression trees. Stat. Med. 36, 15 (2017), 2391\u20132403.\n\n[36] STUART, E. A. Matching methods for causal inference: a review and a look forward. Statist.\n\nSci. 25, 1 (2010), 1\u201321.\n\n[37] SUN, W., WANG, P., YIN, D., YANG, J., AND CHANG, Y. Causal inference via sparse additive\nmodels with application to online advertising. In Twenty-Ninth AAAI Conference on Arti\ufb01cial\nIntelligence (2015).\n\n[38] URTEAGA, I., ALBERS, D. J., WHEELER, M. V., DRUET, A., RAFFAUF, H., AND ELHADAD,\nN. Towards Personalized Modeling of the Female Hormonal Cycle: Experiments with Mecha-\nnistic Models and Gaussian Processes. NIPS 2017 Workshop: \u201cMachine Learning for Health\u201d\n(2017).\n\n[39] WAGER, S., AND ATHEY, S. Estimation and inference of heterogeneous treatment effects using\nrandom forests. Journal of the American Statistical Association 113, 523 (2018), 1228\u20131242.\n\n11\n\n\f", "award": [], "sourceid": 6430, "authors": [{"given_name": "Kolyan", "family_name": "Ray", "institution": "King's College London"}, {"given_name": "Botond", "family_name": "Szabo", "institution": "Leiden University"}]}