{"title": "Policy Evaluation with Latent Confounders via Optimal Balance", "book": "Advances in Neural Information Processing Systems", "page_first": 4826, "page_last": 4836, "abstract": "Evaluating novel contextual bandit policies using logged data is crucial in applications where exploration is costly, such as medicine. But it usually relies on the assumption of no unobserved confounders, which is bound to fail in practice. We study the question of policy evaluation when we instead have proxies for the latent confounders and develop an importance weighting method that avoids fitting a latent outcome regression model. Surprisingly, we show that there exist no single set of weights that give unbiased evaluation regardless of outcome model, unlike the case with no unobserved confounders where density ratios are sufficient. Instead, we propose an adversarial objective and weights that minimize it, ensuring sufficient balance in the latent confounders regardless of outcome model. We develop theory characterizing the consistency of our method and tractable algorithms for it. Empirical results validate the power of our method when confounders are latent.", "full_text": "Policy Evaluation with Latent Confounders via\n\nOptimal Balance\n\nAndrew Bennett\u02da\nCornell University\n\nawb222@cornell.edu\n\nNathan Kallus\u02da\nCornell University\n\nkallus@cornell.edu\n\nAbstract\n\nEvaluating novel contextual bandit policies using logged data is crucial in appli-\ncations where exploration is costly, such as medicine. But it usually relies on the\nassumption of no unobserved confounders, which is bound to fail in practice. We\nstudy the question of policy evaluation when we instead have proxies for the latent\nconfounders and develop an importance weighting method that avoids \ufb01tting a\nlatent outcome regression model. We show that unlike the unconfounded case\nno single set of weights can give unbiased evaluation for all outcome models,\nyet we propose a new algorithm that can still provably guarantee consistency by\ninstead minimizing an adversarial balance objective. We further develop tractable\nalgorithms for optimizing this objective and demonstrate empirically the power of\nour method when confounders are latent.\n\n1\n\nIntroduction\n\nPersonalized intervention policies are of increasing importance in education [32], healthcare [3],\nand public policy [26]. In many of these domains exploration is costly or otherwise prohibitive,\nand so it is crucial to evaluate new policies using existing observational data. Usually, this relies\non an assumption of no unobserved confounding (aka unconfoundedness or ignorability): that\nconditioned on observables, interventions are independent of idiosyncrasies that affect outcomes,\nso that counterfactuals can be reliably and correctly predicted. In particular, this enables the use\nof inverse propensity score (IPS) estimators of policy value [4, 21, 29] that eschew the need to\nactually \ufb01t outcome prediction models and doubly robust estimators that work even if such models\nare misspeci\ufb01ed [8].\nIn practice, however, it may be unlikely that we observe confounders exactly. Nonetheless, if we\nobserve very many features they may serve as good proxies for the true confounders, which can enable\nan alternative route to identi\ufb01cation [22, 30]. In particular, noisy observations of true confounders can\nserve as valid proxies. For example, if intelligence is latent but affects both selection and outcome, we\ncan instead use many noisy observations of intelligence such as school grades, IQ test, etc. Similarly,\nmany medical measurements taken together can serve as proxies for underlying healthfulness.\nIn this paper, we study the problem of policy evaluation from observational data where we observe\nproxies instead of true confounders and we develop new weighting estimators based on optimizing\nbalance in the latent confounders. Unlike the unconfounded setting where IPS weights ensure balance\nregardless of outcome model, we show that in this new setting there cannot exist any single of weights\nthat ensure such unbiasedness regardless of outcome model. Instead, we develop an adversarial\nobjective that bounds the conditional mean square error (CMSE) of any weighted estimator and,\nby appealing to game theoretic and empirical process arguments, we show that this objective can\nactually be driven to zero by a single set of weights. We therefore propose a novel policy evaluation\n\n\u02daAlphabetical order.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fmethod that minimizes this objective, thus provably ensuring consistent estimation in the face of\nlatent confounders. We develop tractable algorithms for this optimization problem. Finally, we\nprovide empirical evidence demonstrating our method\u2019s consistent evaluation compared to standard\nevaluation methods and its improved performance compared to using \ufb01tted latent outcome models.\n\n2 Problem\n\n2.1 Setting and Assumptions\nWe consider a contextual decision making setting with m possible treatments (aka actions or interven-\ntions). Each unit is associated with a set of potential outcomes Y p1q, . . . , Y pmq P R corresponding\nto the reward/loss for each treatment, an observed treatment T P t1, . . . , mu, an observed outcome\nY \u201c Y pTq, true but latent confounders Z P Z \u00d1 Rp, and observed covariates X P X \u00d1 Rq. Our\ndata consists of iid observations Xi, Ti, Yi of X, T, Y . Both the latent confounders and potential out-\ncomes of unassigned treatments are unobserved. Note that Yi \u201c YipTiq encapsulates the assumptions\nof consistency between observed and potential outcomes and non-interference between units.\nA policy is a rule for assigning the probability of each treatment option given the observed covariates\nX. Given a policy \u21e1, we use the notation \u21e1tpxq to indicate the probability of assigning treatment t\nwhen observed covariates are x. We de\ufb01ne the value of a policy, \u2327 \u21e1, as the expected outcome that\nwould be obtained from following the policy in the population. Formally:\nDe\ufb01nition 1 (Policy Value). \u2327 \u21e1 \u201c Er\u221em\n\nWe encapsulate the assumption that Z are suf\ufb01cient for unconfoundedness\nand that X is a proxy for Z in the following assumption. Figure 1 provides\na representation of this setting using a causal DAG [36]. Note importantly\nthat we do not assume ignorability given X.\nAssumption 1 (Z are true confounders). For every t P t1, . . . , mu, Y ptq is\nindependant of pX, Tq, given Z.\nWe next de\ufb01ne the average mean outcome given Z and its conditional expectations given observables:\n\nFigure 1: DAG repre-\nsentation of problem.\n\nt\u201c1 \u21e1tpXqY ptqs.\n\nT\n\nX\n\nZ\n\nY\n\n\u00b5tpzq \u201c ErY ptq | Z \u201c zs,\n\u232btpx, t1q \u201c Er\u00b5tpZq | X \u201c x, T \u201c t1s \u201c ErY ptq | X \u201c x, T \u201c t1s,\n\n\u21e2tpxq \u201c Er\u00b5tpZq | X \u201c xs \u201c ErY ptq | X \u201c xs.\n\nWe further de\ufb01ne the propensity function and its conditional expectation given observables:\n\netpzq \u201c PpT \u201c t | Z \u201c zq ,\n\u2318tpxq \u201c PpT \u201c t | X \u201c xq \u201c EretpZq | X \u201c xs .\n\nFinally, we denote by 'pz; x, tq the conditional density of Z given X \u201c x, T \u201c t. This density\nrepresents the latent variable model underlying the observables. For example, this can be a Gaussian\nmixture model, a PCA-type model as in [22], or a deep variational autoencoder as in [30]. Because\nwe focus on how one might use such a latent model rather than the estimation of this model, we just\n\nassume we have some approximate oracle \u02c6' for calculating its values, such that \u02c6' \u201c ' ` Opp1{?nq\nin L1. (Note that for fair comparison, in experiments in Section 5, we similarly let the outcome\nregression methods use this oracle.)\nWe further make the following regularity assumptions:\nAssumption 2 (Weak Overlap). E\u201ce\u00b42\nRemark 1. Given Assumption 2 it trivially follows that for every t P t1, . . . , mu, x P X , z P Z that\netpzq\u00b0 0 and \u2318tpxq\u00b0 0.\nAssumption 3 (Bounded Variance). The conditional variance of our potential outcomes given X, T\nis bounded: VrY ptq | X, Ts\u00a7 2.\n2.2 The Policy Evaluation Task\nThe problem we consider is to estimate the policy value \u2327 \u21e1 given a policy \u21e1 and data X1:n, T1:n, Y1:n.\nOne standard approach to this is the direct method [38], which given an estimate \u02c6\u21e2t of \u21e2t predicts the\n\nt pZq\u2030 \u20208\n\n2\n\n\fpolicy value as\n\n\u02c6\u21e2 \u201c 1\n\u02c6\u2327 \u21e1\n\nn\n\nn\u00ffi\u201c1\n\nm\u00fft\u201c1\n\n\u21e1tpXiq\u02c6\u21e2tpXiq.\n\n(1)\n\nHowever this method is known to be biased and doesn\u2019t generalize well [4]. Furthermore given\nAssumption 1 \u02c6\u21e2 is not straightforward to estimate, since the mean value of Y observed in our logged\ndata given X \u201c x and T \u201c t is \u232btpx, tq not \u21e2tpxq, so \ufb01tting \u02c6\u21e2 would require controlling for the\neffects of the unobserved Z.\nAn alternative to this is to come up with weights W1:n \u201c fWpX1:n, T1:nq according to some function\nfW of the observed covariates and treatments, in order to re-weight the outcomes to look more like\nthose that would be observed under \u21e1. Using these weights we can de\ufb01ne the weighted estimator\n\nW \u201c 1\n\u02c6\u2327 \u21e1\n\nn\n\nn\u00ffi\u201c1\n\nWiYi.\n\n(2)\n\nThis weighted estimator has the advantage that it does not require modeling the outcome distributions.\nFurthermore we could combine the weights W1:n with an outcome model \u02c6\u21e2t to calculate the doubly\nrobust estimator [8], which is de\ufb01ned as\n\nW,\u02c6\u21e2 \u201c 1\n\u02c6\u2327 \u21e1\n\nn\n\nn\u00ffi\u201c1\n\nm\u00fft\u201c1\n\n\u21e1tpXiq\u02c6\u21e2tpXiq ` 1\n\nn\n\nn\u00ffi\u201c1\n\nWipYi \u00b4 \u02c6\u21e2TipXiqq.\n\n(3)\n\nThe doubly robust estimator is known to be consistent when either the weighted or direct estimator is\nconsistent and can attain local ef\ufb01ciency [19, 39].\nVarious approaches exist for coming up with weights for either the weighted or doubly robust estima-\ntors, which we discuss below. However none of these methods are applicable given Assumption 1,\nand so we develop a theory for weighting using proxy variables in Section 3.\n\n2.3 Related Work\nOne of the most standard approaches for policy evaluation is using the weighted or doubly robust\nestimator de\ufb01ned in Eqs. (2) and (3), using inverse propensity score (IPS) weights. These are given\nby Wi \u201c \u21e1TipXiq{eTipZiq [5], where et are known or estimated logging probabilities. Since these\nweights can be extreme, both normalization [2, 31, 41] and clipping [10, 12, 40] are often employed.\nIn addition some other approaches include recursive partitioning [14]. None of these methods are\napplicable to our setting however, since we do not know the true confounders Z1:n.\nAn alternative to approaches based on \ufb01xed formulae for computing the importance weights is to\ncompute weights that optimize an imbalance objective function [1, 15, 17, 18]. For policy evaluation,\nKallus [16] propose to choose weights that adversarially minimize the conditional mean squared\nerror of policy evaluation in the worst case of possible mean outcome functions in some reproducing\nkernel Hilbert space (RKHS), by solving a linearly constraint quadratic program (LCQP). Our work\nfollows a very similar style to this, however instead of using the true confounders we only assume\naccess to proxies, and we prove our theory for more general families of functions. This, we show,\nprovides a unique imperative \u2013 separate from stability and variance control \u2013 to obtain importance\nweights via optimal balancing: while no single unbiased importance weights exist when we have\nproxies instead of true confounders, optimal balancing obtains weights in a model-robust manner that\nensures consistency.\nFinally there has been a long history of work in causal inference using proxies for true confounders\n[11, 42]. As in our problem setup, much of this work is based on the model of using an identi\ufb01ed latent\nvariable model for the proxies [9, 27, 34, 37, 43]. Some recent work on this problem involves using\ntechniques such as matrix completion [22] or variational autoencoders [30] to infer confounders from\nthe proxies. Additionally, there is recent work on robustness to unidenti\ufb01ability due to unobserved\nconfounding [20, 23, 24]; in contrast we focus on a setting where, while confounders are unobserved,\neffects are identi\ufb01able via proxies. In particular, there is a variety of work that studies suf\ufb01cient\nconditions for the identi\ufb01ability of latent confounder models [6, 34, 37]. Our work is complementary\nto this line of research in that we assume access to an accurate latent confounder model, but do not\nstudy how to estimate such models. Furthermore our work is novel in combining proxy variable\nmodels with optimal balancing and applying it to \ufb01nding importance weights for policy evaluation.\n\n3\n\n\f3 Weight-Balancing Objectives\n\nInfeasibility of IPS-Style Unbiased Weighting\n\n3.1\nIf we had unconfoundedness given X (i.e., Y ptq KK T | X), the IPS weights \u21e1TpXq{\u2318TpXq are\nimmediately gotten as the solution to making every term in the weighted sum Eq. (2) unbiased:\n(4)\nNotably the IPS weights do not depend on the outcome function. However, without unconfoundedness\ngiven X and given only Assumptions 1 to 3, this approach fails.\nTheorem 1. If Wpx, tq satis\ufb01es Eq. (4) then for any t P t1, . . . , mu\n\nErWpX, TqTitY ptqs \u201c Er\u21e1tpXqY ptqs.\n\nWpX, tq \u201c \u21e1tpXq\u221et1P\u2327 \u2318t1pXq\u232btpX, t1q ` \u2326tpXq\n\n\u2318tpXq\u232btpX, tq\n\n,\n\n(5)\n\nfor some \u2326tpxq such that E\u2326tpXq \u201c 0@t.\nThe proof of Theorem 1 is given in Appendix A.1.\nNote that if we had unconfoundedness given X then \u232btpx, t1q \u201c \u232btpx, tq \u201c \u21e2tpxq so that choosing\n\u2326tpXq \u201c 0 would recover the standard IPS weights. However, in our setting we generally have\n\u232btpx, t1q \u2030 \u232btpx, tq, and so Theorem 1 tells us that we cannot do unbiased IPS-style weighted\nevaluation without knowing the mean outcome functions \u232btpx, t1q. In particular, there exists no single\nweight function that is simultaneously unbiased for all outcome functions.\nOn the other hand, Theorem 1 tells us that there do exist some weights that give unbiased and\nconsistent policy evaluation via Eq. (2) or Eq. (3): we just may not be able to calculate them. The\nexistence of such weights motivates our subsequent approach, which seeks weights that mimic these\nweights for a wide class of possible outcome functions.\n\n3.2 Adversarial Error Objective\n\nErp\u02c6\u2327 \u21e1\n\nOver all weights that are functions of X1:n, T1:n, the optimal choice of weights for estimating \u2327 \u21e1 via\nEq. (2) would minimize the (unknown) conditional MSE (CMSE):\nW \u00b4 \u2327 \u21e1q2 | X1:n, T1:ns.\n\n(6)\nIn particular, the weights in Eq. (5) achieve Opp1{nq control on this CMSE for many outcome\nfunctions, as long as the denominator is well behaved, which can be seen by applying concentration\ninequalities to Eq. (6). However, as discussed above the outcome function is unknown and these\nweights are therefore practically infeasible. Our aim is to \ufb01nd weights with similar near-optimal\nbehavior but that do not depend on the particular unknown outcome function. To do this, we will \ufb01nd\nan upper bound for Eq. (6) that we can actually compute.\nLet fit \u201c WiTit \u00b4 \u21e1tpXiq and\n\nJ\u02dapW, \u00b5q \u201c\u02dc 1\n\nn\n\nn\u00ffi\u201c1\n\nm\u00fft\u201c1\n\nfit\u232btpXi, Tiq\u00b82\n\n` 22\n\nn2 }W}2\n2,\n\nW \u00b4 \u2327 \u21e1q2 | X1:n, T1:ns\u00a7 2J\u02dapW, \u00b5q ` Opp1{nq.\n\nwhere we embedded the dependence on \u00b5 inside \u232btpx, t1q \u201c Er\u00b5tpZq | X \u201c x, T \u201c t1s.\nTheorem 2. Erp\u02c6\u2327 \u21e1\nNote that J\u02da above is de\ufb01ned in terms of the true posterior ', which is infeasible to compute in\npractice. Therefore we de\ufb01ne J by replacing the conditional measure of Z given X and T used\nto compute \u232bt in J\u02da with the approximate measure given by \u02c6'. Note that in the remainder of this\nsection and corresponding proofs we make slight abuse of notation by de\ufb01ning J in terms of \u232bt; these\n\u232bt terms should be interpreted below as de\ufb01ned in terms of the approximate conditional measure\ngiven by \u02c6'. Then applying the fact that \u02c6' \u201c ' ` Opp1{?nq in L1, we obtain the following corollary\nwhich trivially follows from Theorem 2 and Slutsky\u2019s theorem:\nCorallary 1. Erp\u02c6\u2327 \u21e1\n\nW \u00b4 \u2327 \u21e1q2 | X1:n, T1:ns\u00a7 2JpW, \u00b5q ` Opp1{?nqaJpW, \u00b5q ` Opp1{nq.\n\n4\n\n\fTherefore, if we \ufb01nd weights that obtain Opp1{nq control on JpW, \u00b5q, we can ensure that we also\nhave Opp1{nq control on Erp\u02c6\u2327 \u21e1\nW \u00b4 \u2327 \u21e1q2 | X1:n, T1:ns. Combined with the following result, which\nfollows from [13, Lemma 31], this would give root-n consistent estimation.\nLemma 1. If Erp\u02c6\u2327 \u21e1\nIt remains to \ufb01nd weights that control JpW, \u00b5q. The key obstacle for this is that \u00b5 is unknown. Instead,\nwe show how we can obtain weights that control JpW, \u00b5q over a whole class of given functions \u00b5.\nSuppose we are given a set F of functions mapping Z to Rm, where each \u00b5 P F corresponds to a\nvector of mean outcome functions \u00b5 \u201c p\u00b51, . . . , \u00b5mq. Then, motivated by Theorem 2 and Lemma 1,\nwe de\ufb01ne our adversarial optimization problem as\n\nW \u00b4 \u2327 \u21e1q2 | X1:n, T1:ns \u201c Opp1{nq then \u02c6\u2327 \u21e1\n\nW \u201c \u2327 \u21e1 ` Opp1{?nq.\n\nW\u02da \u201c arg min\nWPW\n\nsup\n\u00b5PF\n\nJpW, \u00b5q.\n\n(7)\n\nOne question the reader might ask at this point is why not solve the above optimization problem\nby ignoring the hidden confounders and directly balancing the conditional mean outcome functions\n\u232btpx, t1q? The problem is that this would be impossible to do over any kind of generic \ufb02exible\nfunction space, since we have no data corresponding to terms in the form \u232btpx, t1q when t \u2030 t1, so\nthis is akin to an overlap problem. Conversely, if we were to ignore the conditioning on t and balance\nagainst functions of the form \u232btpxq \u201c \u232btpx, tq this would be inadequate, as we couldn\u2019t hope for such\na space to cover the true \u00b5 since we don\u2019t assume ignorability given Z.\nIn light of these limitations, we can view what we are doing in optimizing Eq. (7), using an identi\ufb01ed\nmodel \u02c6'pz; x, tq, as implicitly balancing some controlled space of functions \u232btpx, t1q that do not have\nthis overlap issue between different t values. The following lemma makes this explicit, as it implies\nthat that the terms \u232btpx, t1q are all mutually bounded by each other for \ufb01xed x and t:\nLemma 2. Assuming }\u00b5t}8 \u00a7 b, under Assumption 2, for all x P X , and t, t1, t2 P t1, . . . , mu we\nhave\nt pZq | X \u201c x, T \u201c t1\u2030|\u232btpx, t1q|.\n\n\u2318t2pxqb8bE\u201ce\u00b42\n\n|\u232btpx, t2q| \u00a7 \u2318t1pxq\n\n3.3 Consistency of Adversarially Optimal Estimator\nNow we analyze the consistency of our weighted estimator based on Eq. (7). Given Lemma 1, all we\nneed to justify to prove consistency is that \u00b5 P F and that inf W sup\u00b5PF JpW, \u00b5q P Opp 1\nnq. De\ufb01ne Ft\nas the space of all functions for treatment level t allowed by F. That is Ft \u201c t\u00b5t : Dp\u00b511, . . . , \u00b51mq P\nF with \u00b51t \u201c \u00b5tu. We will use the following assumptions about F to prove control of J:\nAssumption 4 (Normed). For each t P t1, . . . , mu there exists a norm }\u00a8}t on spanpFtq, and there ex-\nists a norm }\u00a8} on spanpFq which is de\ufb01ned given some Rm norm as }\u00b5} \u201c }p}\u00b51}1, . . . ,}\u00b5m}mq}.\nAssumption 5 (Absolutely Star Shaped). For every \u00b5 P F and ||\u00a7 1, we have \u00b5 P F.\nAssumption 6 (Convex Compact). F is convex and compact\nAssumption 7 (Square Integrable). For each t P t1, . . . , mu the space Ft is a subset of L2pZq, and\nits norm dominates the L2 norm (i.e., inf \u00b5tPFt }\u00b5t}{}\u00b5t}L2 \u00b0 0).\nAssumption 8 (Nondegeneracy). De\ufb01ne Bpq \u201c t\u00b5 P spanpFq : }\u00b5}\u00a7 u. Then we have\nBpq\u00d1 F for some \u00b0 0.\nAssumption 9 (Boundedness). sup\u00b5PF }\u00b5}8 \u20208 .\nDe\ufb01nition 2 (Rademacher Complexity). RnpFq \u201c ErsupfPF\nRademacher random variables.\nAssumption 10 (Complexity). For each t P t1 . . . , mu we have RnpFtq \u201c op1q.\nThese assumptions are satis\ufb01ed for many commonly-used families of functions, such as RKHSs and\nfamilies of neural networks. We shall prove this claim for RKHSs in Section 4.\nIn order to justify that we can control J, \ufb01rst we will show that these assumptions allow us to reverse\nthe order of minimization and maximization in our optimization problem. This means we can reduce\nthe problem to \ufb01nding weights to control any particular \u00b5 rather than controlling all of F.\n\nn\u221en\ni\u201c1 \u270fifpZiqs, where \u270fi are iid\n\n1\n\n5\n\n\fn\u221en\ni\u201c1\u221em\nLemma 3. Let BpW, \u00b5q \u201c 1\nM \u00b0 0 we have the bound\nJpW, \u00b5q\u00a7 sup\nmin\nW\n\u00b5PF\n\nsup\n\u00b5PF\n\nt\u201c1 fit\u232btpXi, Tiq. Then under Assumptions 5 to 7 for every\n\nBpW, \u00b5q2 ` 2\n\nn2 M 2.\n\nmin\n\n}W}2\u00a7M\n\nNext, we note that Lemma 3 means that we can choose weights given \u00b5 to set BpW, \u00b5q \u201c 0, and\ntherefore we have our desired control as long as we can justify that these weights have controlled\neuclidean norm. Using this strategy and optimizing for the minimum norm weights of this kind, we\nare able to prove the following:\nLemma 4. Under Assumptions 4 to 10 we have inf W sup\u00b5PF JpW, \u00b5q \u201c Opp1{nq.\nThis is the key lemma in proving our main consistency theorem:\nW \u02da \u201c \u2327 \u21e1 ` Opp1{?nq.\nTheorem 3. Under Assumptions 4 to 10 and assuming that \u00b5 P F we have \u02c6\u2327 \u21e1\nThis theorem follows immediately from our previous results, since \u00b5 P F and Lemma 4 imply that\nJpW \u02da, \u00b5q \u201c Opp1{nq. This combined with Corollary 1 imply that Erp\u02c6\u2327 \u21e1\nW \u02da \u00b4 \u2327 \u21e1q2 | X1:n, T1:ns \u201c\nOpp1{nq, which in turn combined with Lemma 1 gives us our result. Furthermore, given some\nadditional assumptions, we can take advantage of L8 universal approximation of \u00b5 to obtain the\nfollowing corollary which does not depend on the assumption that \u00b5 P F:\nAssumption 11 (Continuous Mean Outcome). For each t, \u00b5t is a continuous function of Z.\nAssumption 12 (Universal Approximation). Fm is a universally approximating sequence of function\nclasses. That is, for every vector of continuous functions \u00b5 and every m, there exists f P Fm such\nthat }ft \u00b4 \u00b5t}8 \u00a7 \u270fm for each t, and \u270fm \u00d1 0.\nCorallary 2. Under Assumptions 4 to 12, and using a universally approximating sequence of function\nclasses Fn to compute W \u02da, we have \u02c6\u2327 \u21e1\nThis corollary follows from observing that by the universally approximating property of Fn, we can\nobtain an Opp1{?nq ` \u270f error bound for every \u270f \u00b0 0 (where the Op term\u2019s constants can depend\non \u270f). Thus Opp1q error bound follows trivially. Note that the universal approximation property of\nAssumption 12 is obtainable for many classes of functions such as Gaussian RKHSs [35].\n\nW \u02da \u201c \u2327 \u21e1 ` Opp1q.\n\n4 Algorithms for Optimal Kernel Balancing\n\nK.\nt\u201c1 ||\u00b5t||2\n\n4.1 Kernel Function Class\nWe now provide an algorithm for optimal balancing when our function class consists of vectors of\nRKHS functions. Formally, given a kernel K and corresponding RKHS norm } \u00a8 }K, we de\ufb01ne the\nspace F K as follows:\n\nDe\ufb01nition 3 (Kernel Class). F K \u201c t\u00b5 : ||\u00b5|| \u00a7 1u, where ||p\u00b51, . . . , \u00b5mq|| \u201ca\u221em\nTheorem 4. Assuming K is a Mercer kernel [44] and is bounded, F K satis\ufb01es Assumptions 4 to 10.\nWe can remark that the commonly used Gaussian kernel is both Mercer and bounded, so it satis\ufb01es the\nconditions of Theorem 4. Given this, and assuming that F K covers the real mean outcome function\n\u00b5, we can apply Theorem 3 to see that solving Eq. (7) using F K gives consistent evaluation.\nNote that the F K having maximum norm 1 is without loss of generality, because if we wanted the\nmaximum norm to instead be m we could replace the standard deviation by {m in our objective\nfunction, resulting in an equivalent re-scaled optimization problem. To make this explicit, we will\nreplace in the objective with , where it is assumed that is a freely chosen hyperparameter to\nallow for varying regularization. For ease of notation below we de\ufb01ne to be the diagonal matrix\nsuch that ii \u201c for every i.\n4.2 Kernel Balancing Algorithm\nIn order to optimize Eq. (7) over a class of kernel functions as de\ufb01ned by De\ufb01nition 3, we can\nobserve that the de\ufb01nition of JpW, \u00b5q looks very similar to the adversarial objective of Kallus [16],\n\n6\n\n\fexcept that we have \u232btpXi, Tiq terms instead of \u00b5tpXiq terms. This motivates the idea that, given\nour identi\ufb01ed posterior model \u02c6'pz; x, tq, we may be able to employ a similar quadratic programming\n(QP)-based approach. The following theorem makes this explicit, by de\ufb01ning a QP objective for W\nthat we can approximate by sampling from \u02c6':\nTheorem 5. De\ufb01ne Qij \u201c ErKpZi, Z1jqs, Gij \u201c 1\nn2pQijTiTj ` ijq, and ai \u201c\nn2\u221en\nj\u201c1 Qij\u21e1TjpXiq, where for each i Zi and Z1i are iid shadow variables, and the expectation is\n2\nde\ufb01ned condional on the observed data using the approximate posterior \u02c6'. Then for some c that is\nconstant in W we have the identity\n\nsup\n\u00b5PF K\n\nJpW, \u00b5q \u201c W T GW \u00b4 aT W ` c.\n\nGiven this our balancing algorithm is natural and straightforward, and is summarized by Algorithm 1.\nNote that we provide an optional weight space constraint W in this algorithm, since standard weighted\nestimator approaches for policy evaluation regularize by forcing constraints such as W P nn. Under\nthis kind of constraint our unconstrained QP becomes a LCQP. However our theory does not support\nthis constraint, and we \ufb01nd that it hurts performance in practice, especially when is large, so we do\nnot use this constraint in our main experiments.\n\n, number samples B, optional weight space W (defaults to Rn if not provided)\n\nAlgorithm 1 Optimal Kernel Balancing\nInput: Data pX1:n, T1:nq, policy \u21e1, kernel function K, posterior density \u02c6', regularization matrix\nOutput: Optimal balancing weights W1:n\n1: for i P t1, . . . , nu do\ni from the posterior \u02c6'p\u00a8 ; Xi, Tiq\n2:\n3: end for\nb\u201c1\u221eB\n4: Estimate Q. Calculate Qij \u201c 1\nc\u201c1 KpZb\n5: Calculate QP Inputs. Calculate Gij \u201c QijTiTj ` ij, and ai \u201c 2\u221en\n6: Solve Quadratic Program. Calculate W \u201c arg minWPW W T GW \u00b4 aT W\n\nSample Data. Draw B data points Zb\n\nj\u201c1 Qij\u21e1TjpXiq,\n\nB2\u221eB\n\ni q\ni , Zc\n\n5 Experiments\n\n5.1 Experimental Setup\nWe now present a brief set of experiments to explore our methodology. The aim of these experiments\nis to be a proof of concept of our theory. We seek to show that given an identi\ufb01ed posterior model \u02c6'\npolicy as discussed in Section 2.1, evaluation using the weights de\ufb01ned by Eq. (7) can give unbiased\npolicy evaluation even in the face of suf\ufb01ciently strong confounding where standard benchmark\napproaches that rely on ignorability given X fail. We experiment with the following generalized\nlinear model-style scenario:\n\nZ \u201e Np0, 1q\nT \u201e softmaxpPTq\n\nX \u201e Np\u21b5T Z ` \u21b50, Xq\nWptq \u201e Np\u21e3ptqT Z ` \u21e30ptq, Y q\n\nPT \u201c T Z ` 0\nY ptq \u201c gpWptqq\n\nIn our experiments Z is 1-dimensional, X is 10-dimensional, and we have two possible treatment\nlevels (m \u201c 2). We experiment with a parametric policy \u21e1 and multiple link functions g as follows:\n\nt Xq\n\nexpp T\n1 Xq ` expp T\n\n2 Xq\n\n\u21e1tpXq \u201c\nexp: gpwq \u201c exppwq\n\nexpp T\n\nstep: gpwq \u201c 3 tw\u20220u\u00b46\nWe experiment with the following methods in this evaluation:\n\ncubic: gpwq \u201c w3\n\nlinear: gpwq \u201c w\n\n1. OptZ Our method, using \u201c Identitypnq for P t0.001, 0.2, 1.0, 5.0u.\n2. IPS IPS weights based on X using estimated \u02c6\u2318t.\n3. OptX The optimal weighting method of Kallus [16] with same values of as our method.\n\n7\n\n\f4. DirX Direct method by \ufb01tting \u02c6\u21e2tpxq incorrectly assuming ignorability given X.\n5. DirZ Direct method by \ufb01rst \ufb01tting \u02c6\u00b5t using posterior samples from \u02c6', then using the\nestimate \u02c6\u21e2tpxq \u201c p1{Dq\u221eD\n\n6. D:W Doubly robust estimation using direct estimator D and weighted estimator W.\n\ni\u201c1 \u02c6\u00b5tpz1iq, where z1i are sampled from \u02c6'p\u00a8; x, tq.\n\nFinally we detail all choices for scenario parameters in Appendix B.1, and provide implementation\ndetails of our methods in Appendix B.2.2\n\n5.2 Results\n\nWe display results for our experiments using the step link function in Tables 1 and 2. For each\nof n P t200, 500, 1000, 2000u we estimate the RMSE of policy evaluation using each method, as\nwell as doubly robust evaluation using our best performing weights, by averaging over 64 runs. In\naddition, in Tables 3 and 4 we display the estimated bias from the evaluations. It is clear that the\nnaive methods that assume ignorability given X all hit a performance ceiling, where bias converges\nto some non-zero value. In particular for IPS we separately ran it on up to one million data points\nand found that the bias converged to 0.418 \u02d8 0.001. On the other hand, for our method it appears\nlike we have consistency. This is particularly evident when we look at Table 3, as bias seems to be\napproximately converging to zero with vanishing variance. We can also observe that doubly robust\nestimation using either direct method does not appear to improve performance.\nIt is noteworthy that the DirZ benchmark method fails horribly, despite being a correctly speci\ufb01ed\nregression estimate. From our experience we observed that it is dif\ufb01cult to train the \u00b5t functions\naccurately if there is a high amount of overlap in the \u02c6'p\u00a8; x, tq posteriors for \ufb01xed t. Therefore we\npostulate that in highly confounded settings this benchmark is inherently dif\ufb01cult to train using a\n\ufb01nite number of samples from \u02c6'p\u00a8; x, tq, and the result seems to collapse to degenerate solutions.\nNext we note that we observed similar trends to this using our other link functions, and other doubly\nrobust estimators. We present more extensive tables of results in Appendix C.1. In addition we\npresent some results there on the negative impact on our method\u2019s performance using the constraint\nW P nn, as mentioned in Section 4.2.\nFinally we provide some more detailed experiments investigating the impact of changing the dimen-\nsionality of Z and the level of confounding by replacing Z with X in Appendix C.2 and Appendix C.3\nrespectively. In brief the results of these experiments are as expected: increasing the dimensionality\nof Z gives the same overall pattern of results but with slower convergence, and increasing the level of\nconfounding strongly decreases the performance of benchmark methods, while our method appears\nto maintain its unbiasedness.\n\n6 Conclusion\n\nWe presented theory for how to do optimal balancing for policy evaluation when we only have\nproxies for the true confounders, given an already identi\ufb01ed model for the confounders, treatment,\nand proxies, but not for the outcomes. We provided an adversarial objective for selecting optimal\nweights given some class of mean outcome functions to be balanced, and proved that under mild\nconditions these optimal weights result in consistent policy evaluation. In addition, we presented\na tractable algorithm for minimizing this objective when our function class is an RKHS, and we\nconducted a series of experiments to demonstrate that our method can achieve consistent evaluation\neven under suf\ufb01cient levels of confounding where standard approaches fail.\nFor future work we note that the adversarial objective and theory presented here is fairly general,\nand could be used to develop new algorithms for balancing different function classes such as neural\nnetworks. Indeed neural networks with weight decay easily satisfy the conditions of both Theorem 3\nand Corollary 2, and thus it might be possible to learn balancing weights by optimizing a GAN-like\nobjective. An alternative direction would be to study how best to apply this methodology when an\nidenti\ufb01ed model is not already given.\n\n2Code available online at https://github.com/CausalML/LatentConfounderBalancing.\n\n8\n\n\fn\n200\n500\n1000\n2000\n\nn\n200\n500\n1000\n2000\n\nn\n200\n500\n1000\n2000\n\nn\n200\n500\n1000\n2000\n\nOptZ0.001\n.39 \u02d8 .07\n.19 \u02d8 .02\n.11 \u02d8 .01\n.08 \u02d8 .01\nTable 1: Convergence of RMSE for for policy evaluation using our weights.\n\nOptZ0.2\n.24 \u02d8 .02\n.18 \u02d8 .02\n.11 \u02d8 .01\n.08 \u02d8 .01\n\nOptZ1.0\n.36 \u02d8 .02\n.23 \u02d8 .02\n.13 \u02d8 .01\n.09 \u02d8 .01\n\nOptZ5.0\n.81 \u02d8 .02\n.49 \u02d8 .02\n.27 \u02d8 .01\n.17 \u02d8 .01\n\n.57 \u02d8 .06\n.55 \u02d8 .03\n.49 \u02d8 .02\n.48 \u02d8 .01\n\n.41 \u02d8 .07\n.20 \u02d8 .02\n.11 \u02d8 .01\n.08 \u02d8 .01\n\nDirX:OptZ0.001 DirZ:OptZ0.001\n\nIPS\n\n.47 \u02d8 .03\n.48 \u02d8 .03\n.39 \u02d8 .02\n.40 \u02d8 .01\n\nOptX0.001\n2.0 \u02d8 .03\n2.0 \u02d8 .02\n2.0 \u02d8 .01\n2.0 \u02d8 .01\n\nOptX0.2\n2.1 \u02d8 .03\n2.1 \u02d8 .02\n2.1 \u02d8 .01\n2.1 \u02d8 .01\n\nOptX1.0\n2.3 \u02d8 .02\n2.3 \u02d8 .02\n2.3 \u02d8 .01\n2.3 \u02d8 .01\n\nOptX5.0\n2.5 \u02d8 .02\n2.6 \u02d8 .02\n2.5 \u02d8 .01\n2.5 \u02d8 .01\n\nDirX\n.52 \u02d8 .02\n.48 \u02d8 .02\n.48 \u02d8 .02\n.45 \u02d8 .02\n\nDirZ\n2.6 \u02d8 .02\n2.6 \u02d8 .01\n2.6 \u02d8 .01\n2.6 \u02d8 .01\n\nTable 2: Convergence of RMSE for benchmark methods.\n\nOptZ0.001\n.03 \u02d8 .39\n.09 \u02d8 .17\n.02 \u02d8 .11\n.03 \u02d8 .07\nTable 3: Convergence of bias for policy evaluation using our weights.\n\nOptZ0.2\n.11 \u02d8 .21\n.10 \u02d8 .15\n.05 \u02d8 .09\n.05 \u02d8 .06\n\nOptZ1.0\n.29 \u02d8 .21\n.17 \u02d8 .16\n.08 \u02d8 .09\n.07 \u02d8 .07\n\nOptZ5.0\n.78 \u02d8 .18\n.47 \u02d8 .15\n.25 \u02d8 .09\n.16 \u02d8 .07\n\n.43 \u02d8 .38\n.51 \u02d8 .19\n.47 \u02d8 .13\n.47 \u02d8 .09\n\n.05 \u02d8 .40\n.10 \u02d8 .18\n.04 \u02d8 .11\n.03 \u02d8 .07\n\nDirX:OptZ0.001 DirZ:OptZ0.001\n\nIPS\n\n.40 \u02d8 .25\n.43 \u02d8 .21\n.37 \u02d8 .12\n.39 \u02d8 .10\n\nOptX0.001\n1.9 \u02d8 .21\n2.0 \u02d8 .16\n2.0 \u02d8 .10\n2.0 \u02d8 .08\n\nOptX0.2\n2.1 \u02d8 .20\n2.1 \u02d8 .15\n2.1 \u02d8 .09\n2.1 \u02d8 .07\n\nOptX1.0\n2.3 \u02d8 .19\n2.3 \u02d8 .14\n2.3 \u02d8 .09\n2.3 \u02d8 .07\n\nOptX5.0\n2.5 \u02d8 .18\n2.6 \u02d8 .13\n2.5 \u02d8 .08\n2.5 \u02d8 .07\n\nDirX\n.49 \u02d8 .18\n.45 \u02d8 .16\n.46 \u02d8 .15\n.42 \u02d8 .17\n\nDirZ\n2.6 \u02d8 .14\n2.6 \u02d8 .12\n2.6 \u02d8 .11\n2.6 \u02d8 .11\n\nTable 4: Convergence of bias for benchmark methods.\n\nAcknowledgements\n\nThis material is based upon work supported by the National Science Foundation under Grant No.\n1846210. This research was funded in part by JPMorgan Chase & Co. Any views or opinions\nexpressed herein are solely those of the authors listed, and may differ from the views and opinions\nexpressed by JPMorgan Chase & Co. or its af\ufb01liates. This material is not a product of the Research\nDepartment of J.P. Morgan Securities LLC. This material should not be construed as an individual\nrecommendation for any particular client and is not intended as a recommendation of particular\nsecurities, \ufb01nancial instruments or strategies for a particular client. This material does not constitute\na solicitation or offer in any jurisdiction.\n\nReferences\n[1] S. Athey, G. W. Imbens, and S. Wager. Approximate residual balancing: debiased inference of\naverage treatment effects in high dimensions. Journal of the Royal Statistical Society: Series B\n(Statistical Methodology), 80(4):597\u2013623, 2018.\n\n[2] P. C. Austin and E. A. Stuart. Moving towards best practice when using inverse probability of\ntreatment weighting (iptw) using the propensity score to estimate causal treatment effects in\nobservational studies. Statistics in medicine, 34(28):3661\u20133679, 2015.\n\n[3] D. Bertsimas, N. Kallus, A. M. Weinstein, and Y. D. Zhuo. Personalized diabetes management\n\nusing electronic medical records. Diabetes care, 40(2):210\u2013217, 2017.\n\n[4] A. Beygelzimer and J. Langford. The offset tree for learning with partial labels. In Proceedings\nof the 15th ACM SIGKDD international conference on Knowledge discovery and data mining,\npages 129\u2013138. ACM, 2009.\n\n9\n\n\f[5] L. Bottou, J. Peters, J. Q. Candela, D. X. Charles, M. Chickering, E. Portugaly, D. Ray, P. Y.\nSimard, and E. Snelson. Counterfactual reasoning and learning systems: the example of\ncomputational advertising. Journal of Machine Learning Research, 14(1):3207\u20133260, 2013.\n\n[6] Z. Cai and M. Kuroki. On identifying total effects in the presence of latent variables and\nselection bias. In Proceedings of the Twenty-Fourth Conference on Uncertainty in Arti\ufb01cial\nIntelligence, pages 62\u201369. AUAI Press, 2008.\n\n[7] B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker,\nJ. Guo, P. Li, and A. Riddell. Stan: A probabilistic programming language. Journal of statistical\nsoftware, 76(1), 2017.\n\n[8] M. Dud\u00edk, J. Langford, and L. Li. Doubly robust policy evaluation and learning. In Proceedings\nof the 28th International Conference on International Conference on Machine Learning, pages\n1097\u20131104. Omnipress, 2011.\n\n[9] J. K. Edwards, S. R. Cole, and D. Westreich. All your data are always missing: incorporating\nbias due to measurement error into the potential outcomes framework. International journal of\nepidemiology, 44(4):1452\u20131459, 2015.\n\n[10] M. R. Elliott. Model averaging methods for weight trimming. Journal of of\ufb01cial statistics, 24\n\n(4):517, 2008.\n\n[11] P. A. Frost. Proxy variables and speci\ufb01cation bias. The review of economics and Statistics,\n\npages 323\u2013325, 1979.\n\n[12] E. L. Ionides. Truncated importance sampling. Journal of Computational and Graphical\n\nStatistics, 17(2):295\u2013311, 2008.\n\n[13] N. Kallus. Generalized optimal matching methods for causal inference. arXiv preprint\n\narXiv:1612.08321, 2016.\n\n[14] N. Kallus. Recursive partitioning for personalization using observational data. In International\n\nConference on Machine Learning (ICML), pages 1789\u20131798, 2017.\n\n[15] N. Kallus. A framework for optimal matching for causal inference. In Arti\ufb01cial Intelligence\n\nand Statistics (AISTATS), pages 372\u2013381, 2017.\n\n[16] N. Kallus. Balanced policy evaluation and learning.\n\nProcessing Systems, pages 8895\u20138906, 2018.\n\nIn Advances in Neural Information\n\n[17] N. Kallus. Optimal a priori balance in the design of controlled experiments. Journal of the\n\nRoyal Statistical Society: Series B (Statistical Methodology), 80(1):85\u2013112, 2018.\n\n[18] N. Kallus. Discussion: \u201centropy learning for dynamic treatment regimes\u201d. Statistica Sinica, 29\n\n(4):1697\u20131705, 2019.\n\n[19] N. Kallus and M. Uehara. Intrinsically ef\ufb01cient, stable, and bounded off-policy evaluation for\n\nreinforcement learning. In Advances in Neural Information Processing Systems, 2019.\n\n[20] N. Kallus and A. Zhou. Confounding-robust policy improvement. In Advances in Neural\n\nInformation Processing Systems, pages 9269\u20139279, 2018.\n\n[21] N. Kallus and A. Zhou. Policy evaluation and optimization with continuous treatments. In\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 1243\u20131251, 2018.\n\n[22] N. Kallus, X. Mao, and M. Udell. Causal inference with noisy and missing covariates via matrix\nfactorization. In Advances in Neural Information Processing Systems, pages 6921\u20136932, 2018.\n[23] N. Kallus, A. M. Puli, and U. Shalit. Removing hidden confounding by experimental grounding.\n\nIn Advances in Neural Information Processing Systems, pages 10888\u201310897, 2018.\n\n[24] N. Kallus, X. Mao, and A. Zhou. Interval estimation of individual-level causal effects under\nunobserved confounding. In The 22nd International Conference on Arti\ufb01cial Intelligence and\nStatistics, pages 2281\u20132290, 2019.\n\n10\n\n\f[25] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[26] A. Kube, S. Das, and P. J. Fowler. Allocating interventions based on predicted outcomes: A\ncase study on homelessness services. In Proceedings of the AAAI Conference on Arti\ufb01cial\nIntelligence, 2019.\n\n[27] M. Kuroki and J. Pearl. Measurement bias and effect restoration in causal inference. Biometrika,\n\n101(2):423\u2013437, 2014.\n\n[28] M. Ledoux and M. Talagrand. Probability in Banach Spaces: isoperimetry and processes.\n\nSpringer Science & Business Media, 2013.\n\n[29] L. Li, W. Chu, J. Langford, and X. Wang. Unbiased of\ufb02ine evaluation of contextual-bandit-\nbased news article recommendation algorithms. In Proceedings of the fourth ACM international\nconference on Web search and data mining, pages 297\u2013306. ACM, 2011.\n\n[30] C. Louizos, U. Shalit, J. M. Mooij, D. Sontag, R. Zemel, and M. Welling. Causal effect inference\nwith deep latent-variable models. In Advances in Neural Information Processing Systems, pages\n6446\u20136456, 2017.\n\n[31] J. K. Lunceford and M. Davidian. Strati\ufb01cation and weighting via the propensity score in\nestimation of causal treatment effects: a comparative study. Statistics in medicine, 23(19):\n2937\u20132960, 2004.\n\n[32] T. Mandel, Y.-E. Liu, S. Levine, E. Brunskill, and Z. Popovic. Of\ufb02ine policy evaluation across\nrepresentations with applications to educational games. In Proceedings of the International\nConference on Autonomous Agents and Multi-agent Systems, pages 1077\u20131084. International\nFoundation for Autonomous Agents and Multiagent Systems, 2014.\n\n[33] S. Mendelson. On the performance of kernel classes. Journal of Machine Learning Research, 4\n\n(Oct):759\u2013771, 2003.\n\n[34] W. Miao, Z. Geng, and E. T. Tchetgen. Identifying causal effects with proxy variables of an\n\nunmeasured confounder. arXiv preprint arXiv:1609.08816, 2016.\n\n[35] C. A. Micchelli, Y. Xu, and H. Zhang. Universal kernels. Journal of Machine Learning\n\nResearch, 7(Dec):2651\u20132667, 2006.\n\n[36] J. Pearl. Causality: models, reasoning and inference. Cambridge University Press, 2000.\n[37] J. Pearl. On measurement bias in causal inference. arXiv preprint arXiv:1203.3504, 2012.\n[38] M. Qian and S. A. Murphy. Performance guarantees for individualized treatment rules. Annals\n\nof statistics, 39(2):1180, 2011.\n\n[39] J. M. Robins, A. Rotnitzky, and L. P. Zhao. Estimation of regression coef\ufb01cients when some\nregressors are not always observed. Journal of the American statistical Association, 89(427):\n846\u2013866, 1994.\n\n[40] A. Swaminathan and T. Joachims. Counterfactual risk minimization: Learning from logged\n\nbandit feedback. In ICML, pages 814\u2013823, 2015.\n\n[41] A. Swaminathan and T. Joachims. The self-normalized estimator for counterfactual learning. In\n\nAdvances in Neural Information Processing Systems, pages 3231\u20133239, 2015.\n\n[42] M. R. Wickens. A note on the use of proxy variables. Econometrica: Journal of the Econometric\n\nSociety, pages 759\u2013761, 1972.\n\n[43] J. M. Wooldridge. On estimating \ufb01rm-level production functions using proxy variables to\n\ncontrol for unobservables. Economics Letters, 104(3):112\u2013114, 2009.\n\n[44] D.-X. Zhou. The covering number in learning theory. Journal of Complexity, 18(3):739\u2013767,\n\n2002.\n\n11\n\n\f", "award": [], "sourceid": 2684, "authors": [{"given_name": "Andrew", "family_name": "Bennett", "institution": "Cornell University"}, {"given_name": "Nathan", "family_name": "Kallus", "institution": "Cornell University"}]}