{"title": "Balanced Policy Evaluation and Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 8895, "page_last": 8906, "abstract": "We present a new approach to the problems of evaluating and learning personalized decision policies from observational data of past contexts, decisions, and outcomes. Only the outcome of the enacted decision is available and the historical policy is unknown. These problems arise in personalized medicine using electronic health records and in internet advertising. Existing approaches use inverse propensity weighting (or, doubly robust versions) to make historical outcome (or, residual) data look like it were generated by a new policy being evaluated or learned. But this relies on a plug-in approach that rejects data points with a decision that disagrees with the new policy, leading to high variance estimates and ineffective learning. We propose a new, balance-based approach that too makes the data look like the new policy but does so directly by finding weights that optimize for balance between the weighted data and the target policy in the given, finite sample, which is equivalent to minimizing worst-case or posterior conditional mean square error. Our policy learner proceeds as a two-level optimization problem over policies and weights. We demonstrate that this approach markedly outperforms existing ones both in evaluation and learning, which is unsurprising given the wider support of balance-based weights. We establish extensive theoretical consistency guarantees and regret bounds that support this empirical success.", "full_text": "Balanced Policy Evaluation and Learning\n\nNathan Kallus\n\nCornell University and Cornell Tech\n\nkallus@cornell.edu\n\nAbstract\n\nWe present a new approach to the problems of evaluating and learning personalized\ndecision policies from observational data of past contexts, decisions, and outcomes.\nOnly the outcome of the enacted decision is available and the historical policy is\nunknown. These problems arise in personalized medicine using electronic health\nrecords and in internet advertising. Existing approaches use inverse propensity\nweighting (or, doubly robust versions) to make historical outcome (or, residual)\ndata look like it were generated by a new policy being evaluated or learned. But this\nrelies on a plug-in approach that rejects data points with a decision that disagrees\nwith the new policy, leading to high variance estimates and ineffective learning. We\npropose a new, balance-based approach that too makes the data look like the new\npolicy but does so directly by \ufb01nding weights that optimize for balance between the\nweighted data and the target policy in the given, \ufb01nite sample, which is equivalent\nto minimizing worst-case or posterior conditional mean square error. Our policy\nlearner proceeds as a two-level optimization problem over policies and weights.\nWe demonstrate that this approach markedly outperforms existing ones both in\nevaluation and learning, which is unsurprising given the wider support of balance-\nbased weights. We establish extensive theoretical consistency guarantees and regret\nbounds that support this empirical success.\n\nIntroduction\n\n1\nUsing observational data with partially observed outcomes to develop new and effective personalized\ndecision policies has received increased attention recently [1, 7, 8, 13, 23, 29, 41\u201343, 45]. The\naim is to transform electronic health records to personalized treatment regimes [6], transactional\nrecords to personalized pricing strategies [5], and click- and \u201clike\u201d-streams to personalized advertising\ncampaigns [8] \u2013 problems of great practical signi\ufb01cance. Many of the existing methods rely on a\nreduction to weighted classi\ufb01cation via a rejection and importance sampling technique related to\ninverse propensity weighting and to doubly robust estimation. However, inherent in this reduction\nare several shortcomings that lead to reduced personalization ef\ufb01cacy: it involves a na\u00efve plug-in\nestimation of a denominator nuisance parameter leading either to high variance or scarcely-motivated\nstopgaps; it necessarily rejects a signi\ufb01cant amount of observations leading to smaller datasets in\neffect; and it proceeds in a two-stage approach that is unnatural to the single learning task.\nIn this paper, we attempt to ameliorate these by using a new approach that directly optimizes for the\nbalance that is achieved only on average or asymptotically by the rejection and importance sampling\napproach. We demonstrate that this new approach provides improved performance and explain why.\nAnd, we provide extensive theory to characterize the behavior of the new methods. The proofs are\ngiven in the supplementary material.\n\n1.1 Setting, Notation, and Problem Description\n\nThe problem we consider is how to choose the best of m treatments based on an observation of\ncovariates x 2X\u2713 Rd (also known as a context). An instance is characterized by the random\nvariables X 2X and Y (1), . . . , Y (m) 2 R, where X denotes the covariates and Y (t) for t 2 [m] =\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f+ :Pm\n\n{1, . . . , m} is the outcome that would be derived from applying treatment t. We always assume that\nsmaller outcomes are preferable, i.e., Y (t) corresponds to costs or negative rewards.\nA policy is a map \u21e1 : X! m from observations of covariates to a probability vector in the\nm-simplex m = {p 2 Rm\nt=1 pt = 1}. Given an observation of covariates x, the policy \u21e1\nspeci\ufb01es that treatment t should be applied with probability \u21e1t(x). There are two key tasks of interest:\npolicy evaluation and policy learning. In policy evaluation, we wish to evaluate the performance of a\ngiven policy based on historical data. This is also known as off -policy evaluation, highlighting the\nfact that the historical data was not necessarily generated by the policy in question. In policy learning,\nwe wish to determine a policy that has good performance.\nWe consider doing both tasks based on data consisting of n passive, historical observations of covari-\nate, treatment, and outcome: Sn = {(X1, T1, Y1), . . . , (Xn, Tn, Yn)}, where the observed outcome\nYi = Yi(Ti) corresponds only to the treatment Ti historically applied. We use the notation X1:n to\ndenote the data tuple (X1, . . . , Xn). The data is assumed to be iid. That is, the data is generated by\ndrawing from a stationary population of instances (X, T, Y (1), . . . , Y (m)) and observing a censored\nform of this draw given by (X, T, Y (T )).1 From the (unknown) joint distribution of (X, T ) in the pop-\nulation, we de\ufb01ne the (unknown) propensity function 't(x) = P(T = t | X = x) = E[T t | X = x],\nwhere st = I [s = t] is the Kronecker delta. And, from the (unknown) joint distribution of (X, Y (t))\nin the population, we de\ufb01ne the (unknown) mean-outcome function \u00b5t(x) = E[Y (t) | X = x]. We\nuse the notation '(x) = ('1(x), . . . ,' m(x)) and \u00b5(x) = (\u00b51(x), . . . , \u00b5m(x)).\nApart from being iid, we also assume the data satis\ufb01es unconfoundedness:\nAssumption 1. For each t 2 [m]: Y (t) is independent of T given X, i.e., Y (t) ?? T | X.\nThis assumption is equivalent to there being a logging policy ' that generated the data by prescribing\ntreatment t with probability 't(Xi) to each instance i and recording the outcome Yi = Yi(Ti).\nTherefore, especially in the case where the logging policy 't is in fact known to the user, the problem\nis often called learning from logged bandit feedback [41, 42].\nIn policy evaluation, given a policy \u21e1, we wish to estimate its sample-average policy effect (SAPE),\n\nby an estimator \u02c6\u2327 (\u21e1) = \u02c6\u2327 (\u21e1; X1:n, T1:n, Y1:n) that depends only on the observed data and the policy\n\u21e1. The SAPE quanti\ufb01es the average outcome that a policy \u21e1 would induce in the sample and hence\nmeasures its risk. SAPE is strongly consistent for the population-average policy effect (PAPE):\n\nt=1 \u21e1t(Xi)\u00b5t(Xi),\n\nSAPE(\u21e1) = 1\n\nnPn\ni=1Pm\nPAPE(\u21e1) = E[SAPE(\u21e1)] = E[Pm\n\nt=1 \u21e1t(X)\u00b5t(X)] = E[Y ( \u02dcT\u21e1(X))],\n\nwhere \u02dcT\u21e1(x) is de\ufb01ned as \u21e1\u2019s random draw of treatment when X = x, \u02dcT\u21e1(x) \u21e0 Multinomial(\u21e1(x)).\nMoreover, if \u21e1\u21e4 is such that \u21e1\u21e4t (x) > 0 () t 2 argmins2[m] \u00b5s(x), then bR(\u21e1) = SAPE(\u21e1) \nSAPE(\u21e1\u21e4) is the regret of \u21e1 [10]. The policy evaluation task is closely related to causal effect\nestimation [19] where, for m = 2, one is interested in estimating the sample and population average\nnPn\ntreatment effects: SATE = 1\ni=1(\u00b52(Xi) \u00b51(Xi)), PATE = E[SATE] = E[Y (2) Y (1)].\nIn policy learning, we wish to \ufb01nd a policy \u02c6\u21e1 that achieves small outcomes, i.e., small SAPE and\nPAPE. The optimal policy \u21e1\u21e4 minimizes both SAPE(\u21e1) and PAPE(\u21e1) over all functions X! m.\n1.2 Existing Approaches and Related Work\nThe so-called \u201cdirect\u201d approach \ufb01ts regression estimates \u02c6\u00b5t of \u00b5t on each dataset {(Xi, Yi) : Ti = t},\nt 2 [m]. Given these estimates, it estimates SAPE in a plug-in fashion:\nt=1 \u21e1t(Xi)\u02c6\u00b5t(Xi).\n\n\u02c6\u2327 direct(\u21e1) = 1\n\nA policy is learned either by \u02c6\u21e1direct(x) = argmint2[m] \u02c6\u00b5t(x) or by minimizing \u02c6\u2327 direct(\u21e1) over \u21e1 2 \u21e7\n[33]. However, direct approaches may not generalize as well as weighting-based approaches [7].\nWeighting-based approaches seek weights based on covariate and treatment data W (\u21e1) =\nW (\u21e1; X1:n, T1:n) that make the outcome data, when reweighted, look as though it were generated by\nthe policy being evaluated or learned, giving rise to estimators that have the form\n\nnPn\ni=1Pm\n\n1Thus, although the data is iid, the t-treated sample {i : Ti = t} may differ systematically from the t0-treated\nsample {i : Ti = t0} for t 6= t0, i.e., not necessarily just by chance as in a randomized controlled trial (RCT).\n\n\u02c6\u2327W = 1\n\ni=1 WiYi.\n\nnPn\n\n2\n\n\fBottou et al. [8], e.g., propose to use inverse propensity weighting (IPW). Noting that [17, 18]\nnPn\ni=1 Yi \u21e5 \u21e1Ti(Xi)/'Ti(Xi) | X1:n], one \ufb01rst \ufb01ts a probabilistic classi\ufb01cation\nSAPE(\u21e1) = E[ 1\nmodel \u02c6' to {(Xi, Ti) : i 2 [n]} and then estimates SAPE in an alternate but also plug-in fashion:\n\u02c6\u2327 IPW(\u21e1) = \u02c6\u2327W IPW(\u21e1), W IPW\n\n(\u21e1) = \u21e1Ti(Xi)/ \u02c6'Ti(Xi)\n\ni\n\nFor a deterministic policy, \u21e1t(x) 2{ 0, 1}, this can be interpreted as a rejection and importance\nsampling approach [29, 41]: reject samples where the observed treatment does not match \u21e1\u2019s\nrecommendation and up-weight those that do by the inverse (estimated) propensity. For deterministic\nis the complement of 0-1 loss of \u21e1(X)\npolicies \u21e1t(x) 2{ 0, 1}, we have that \u21e1T (X) = T, \u02dcT\u21e1(X)\nin predicting T . By scaling and constant shifts, one can therefore reduce minimizing \u02c6\u2327 IPW(\u21e1) over\npolicies \u21e1 2 \u21e7 to minimizing a weighted classi\ufb01cation loss over classi\ufb01ers \u21e1 2 \u21e7, providing a\nreduction to weighted classi\ufb01cation [7, 45].\nGiven both \u02c6\u00b5(x) and \u02c6'(x) estimates, Dud\u00edk et al. [13] propose a weighting-based approach that\ncombines the direct and IPW approaches by adapting the doubly robust (DR) estimator [11, 34, 35,\n38]:\n\nnPn\n\nnPn\ni=1Pm\n\n\u02c6\u2327 DR(\u21e1) = 1\n\nt=1 \u21e1t(Xi)\u02c6\u00b5t(Xi) + 1\n\ni=1(Yi \u02c6\u00b5Ti(Xi))\u21e1Ti(Xi)/ \u02c6'Ti(Xi).\n\n\u02c6\u2327 DR(\u21e1) can be understood either as debiasing the direct estimator by via the reweighted residuals\n\u02c6\u270fi = Yi \u02c6\u00b5Ti(Xi) or as denoising the IPW estimator by subtracting the conditional mean from Yi.\nAs its bias is multiplicative in the biases of the regression and propensity estimates, the estimator is\nconsistent so long as one of the estimates is consistent. For policy learning, [1, 13] minimize this\nestimator via weighted classi\ufb01cation. Athey and Wager [1] provide a tight and favorable analysis of\nthe corresponding uniform consistency (and hence regret) of the DR approach to policy learning.\nBased on the fact that 1 = E[\u21e1T (X)/'T (X)], a normalized IPW (NIPW) estimator is given by\nnormalizing the weights so they sum to n, a common practice in causal effect estimation [2, 31]:\n\n(\u21e1)/Pn\n\n\u02c6\u2327 NIPW(\u21e1) = \u02c6\u2327W NIPW(\u21e1), W NIPW\n\ni\n\n(\u21e1) = W IPW\n\ni\n\ni0=1 W IPW\n\ni0\n\n(\u21e1).\n\nAny IPW approaches are subject to considerable variance because the plugged-in propensities are in\nthe denominator so that small errors can have outsize effects on the total estimate. Another stopgap\nmeasure is to clip the propensities [14, 20] resulting in the clipped IPW (CIPW) estimator:\n(\u21e1) = \u21e1Ti(Xi)/ max{M, \u02c6'Ti(Xi)}.\n\n\u02c6\u2327 M-CIPW(\u21e1) = \u02c6\u2327W M-CIPW(\u21e1), W M-CIPW\n\nWhile effective in reducing variance, the practice remains ad-hoc, loses the unbiasedness of IPW (with\ntrue propensities), and requires the tuning of M. For policy learning, Swaminathan and Joachims [42]\npropose to minimizes over \u21e1 2 \u21e7 the M-CIPW estimator plus a regularization term of the sample\nvariance of the estimator, which they term POEM. The sample variance scales with the level of\noverlap between \u21e1 and T1:n, i.e., the prevalence of \u21e1Ti(Xi) > 0. Indeed, when the policy class \u21e7 is\nvery \ufb02exible relative to n and if outcomes are nonnegative, then the anti-logging policy \u21e1Ti(Xi) = 0\nminimizes any of the above estimates. POEM avoids learning the anti-logging policy by regularizing\noverlap, reducing variance but limiting novelty of \u21e1. A re\ufb01nement, SNPOEM [43] uses a normalized\nand clipped IPW (NCIPW) estimator (and regularizes variance):\n\ni\n\n\u02c6\u2327 M-NCIPW(\u21e1) = \u02c6\u2327W M-NCIPW(\u21e1), W M-NCIPW\n\ni\n\n(\u21e1) = W M-CIPW\n\ni\n\ni0=1 W M-CIPW\n\ni0\n\n(\u21e1).\n\nKallus and Zhou [26] generalize the IPW approach to a continuum of treatments. Kallus and Zhou\n[25] suggest a minimax approach to perturbations of the weights to account for confounding factors.\nKallus [23] proposes a recursive partitioning approach to policy learning, the Personalization Tree\n(PT) and Personalization Forest (PF), that dynamically learns both weights and policy, but still uses\nwithin-partition IPW with dynamically estimated propensities.\n\n(\u21e1)/Pn\n\n1.3 A Balance-Based Approach\nShortcomings in existing approaches. All of the above weighting-based approaches seek to\nreweight the historical data so that they look as though they were generated by the policy be-\ning evaluated or learned. Similarly, the DR approach seeks to make the historical residuals look like\nthose that would be generated under the policy in question so to remove bias from the estimated\nregression model of the direct approach. However, the way these methods achieve this through\nvarious forms and versions of inverse propensity weighting, has three critical shortcomings:\n\n3\n\n\f(1) By taking a simple plug-in approach for a nuisance parameter (propensities) that appears in the\ndenominator, existing weighting-based methods are either subject to very high variance or must\nrely on scarcely-motivated stopgap measures such as clipping (see also [27]).\n\n(2) In the case of deterministic policies (such as an optimal policy), existing methods all have weights\nthat are multiples of \u21e1Ti(Xi), which means that one necessarily throws away every data point\nTi that does not agree with the new policy recommendation \u02dcT\u21e1(Xi). This means that one is\nessentially only using a much smaller dataset than is available, leading again to higher variance.2\n(3) The existing weighting-based methods all proceed in two stages: \ufb01rst estimate propensities and\nthen plug these in to a derived estimator (when the logging policy is unknown). On the one hand,\nthis raises model speci\ufb01cation concerns, and on the other, is unsatisfactory when the task at hand\nis not inherently two-staged \u2013 we wish only to evaluate or learn policies, not to learn propensities.\nA new approach. We propose a balance-based approach that, like the existing weighting-based\nmethods, also reweights the historical data to make it look as though they were generated by the\npolicy being evaluated or learned and potentially denoises outcomes in a doubly robust fashion, but\nrather than doing so circuitously via a plug-in approach, we do it directly by \ufb01nding weights that\noptimize for balance between the weighted data and the target policy in the given, \ufb01nite sample.\nIn particular, we formalize balance as a discrepancy between the reweighted historical covariate\ndistribution and that induced by the target policy and prove that it is directly related to the worst-case\nconditional mean square error (CMSE) of any weighting-based estimator. Given a policy \u21e1, we\nthen propose to choose (policy-dependent) weights W \u21e4(\u21e1) that optimize the worst-case CMSE and\ntherefore achieve excellent balance while controlling for variance. For evaluation, we use these\noptimal weights to evaluate the performance of \u21e1 by the estimator \u02c6\u2327W \u21e4(\u21e1) as well as a doubly robust\nversion. For learning, we propose a bilevel optimization problem: minimize over \u21e1 2 \u21e7, the estimated\nrisk \u02c6\u2327W \u21e4(\u21e1) (or a doubly robust version thereof and potentially plus a regularization term), given by\nthe weights W \u21e4(\u21e1) that minimize the estimation error. Our empirical results show the stark bene\ufb01t of\nthis approach while our main theoretical results (Thm. 6, Cor. 7) establish vanishing regret bounds.\n\n2 Balanced Evaluation\n2.1 CMSE and Worst-Case CMSE\n\nnPn\n\nnPn\ni=1Pm\n\nnPn\n\nWe begin by presenting the approach in the context of evaluation. Given a policy \u21e1, consider any\nweights W = W (\u21e1; X1:n, T1:n) that are based on the covariate and treatment data. Given these\nweights we can consider both a simple weighted estimator as well as a W -weighted doubly robust\nestimator given a regression estimate \u02c6\u00b5:\n\n\u02c6\u2327W = 1\n\ni=1 WiYi,\n\n\u02c6\u2327W,\u02c6\u00b5 = 1\n\nt=1 \u21e1t(Xi)\u02c6\u00b5t(Xi) + 1\n\ni=1 Wi(Yi \u02c6\u00b5Ti(Xi)).\n\nWe can measure the risk of either such estimator as the conditional mean square error (CMSE),\nconditioned on all of the data upon which the chosen weights depend:\n\nCMSE(\u02c6\u2327,\u21e1 ) = E[(\u02c6\u2327 SAPE(\u21e1))2 | X1:n, T1:n].\n\nnPn\n\ni=1(WiTit \u21e1t(Xi))ft(Xi)\n\nMinimal CMSE is the target of choosing weights for weighting-based policy evaluation. Basic\nmanipulations under the unconfoundedness assumption decompose the CMSE of any weighting-\nbased policy evaluation estimator into its conditional bias and variance:\nTheorem 1. Let \u270fi = Yi \u00b5Ti(Xi) and \u2303 = diag(E[\u270f2\nBt(W, \u21e1t; ft) = 1\nThen we have that:\nMoreover, under Asn. 1:\n\nn | Xn, Tn]). De\ufb01ne\nt=1 Bt(W, \u21e1t; ft)\nnPn\n\u02c6\u2327W SAPE(\u21e1) = B(W, \u21e1; \u00b5) + 1\n\n1 | X1, T1], . . . , E[\u270f2\nand B(W, \u21e1; f ) =Pm\n\ni=1 Wi\u270fi.\nn2 W T \u2303W.\n2 This problem is unique to policy evaluation and learning \u2013 in causal effect estimation, the IPW estimator\nfor SATE has nonzero weights on all of the data points. For policy learning with m = 2, Athey and Wager\n2 (\u02c6\u2327 (\u21e1) \u02c6\u2327 (1(\u00b7) \u21e1)) with \u02c6\u2327 (\u21e1) = \u02c6\u2327 IPW(\u21e1)\n[1], Beygelzimer and Langford [7] minimize estimates of the form 1\nor = \u02c6\u2327 DR(\u21e1). This evaluates \u21e1 relative to the uniformly random policy and the resulting total weighted sums\nover Yi or \u02c6\u270fi have nonzero weights whether \u21e1Ti (Xi) = 0 or not. While a useful approach for reduction to\nweighted classi\ufb01cation [7] or invoking semi-parametric theory [1], it only works for m = 2, has no effect on\nlearning as the centering correction is constant in \u21e1, and, for evaluation, is not an estimator for SAPE.\n\nCMSE(\u02c6\u2327W ,\u21e1 ) = B2(W, \u21e1; \u00b5) + 1\n\n4\n\n\fFigure 1: The setting in Ex. 1\n\n(a) X1:n, T1:n\n\n(b) \u00b51(x)\n\n(c) \u21e1\u21e4(x)\n\nTable 1: Policy evaluation performance in Ex. 1\n\nVanilla \u02c6\u2327W\n\nDoubly robust \u02c6\u2327W,\u02c6\u00b5\n\nWeights W\n\nSD\n2.209\n0.242\n0.310\n0.242\n0.487\n0.390\n0.415\n0.390\n0.163\n\nSD\n4.174\n0.361\n0.451\n0.361\n0.634\n0.511\n0.550\n0.511\n0.251\n\nIPW, '\nIPW, \u02c6'\n.05-CIPW, '\n.05-CIPW, \u02c6'\nNIPW, '\nNIPW, \u02c6'\n.05-NCIPW, '\n.05-NCIPW, \u02c6'\nBalanced eval\n\nRMSE\nBias\n2.209 0.005\n0.568 0.514\n0.581 0.491\n0.568 0.514\n0.519 0.181\n0.463 0.251\n0.485 0.250\n0.463 0.251\n0.227\n0.280\n\nRMSE\nBias\n4.196\n0.435\n0.428\n0.230\n0.520\n0.259\n0.428\n0.230\n0.754\n0.408\n0.692\n0.467\n0.724\n0.471\n0.692\n0.467\n0.251 0.006\n\nkWk0\n13.6 \u00b1 2.9\n13.6 \u00b1 2.9\n13.6 \u00b1 2.9\n13.6 \u00b1 2.9\n13.6 \u00b1 2.9\n13.6 \u00b1 2.9\n13.6 \u00b1 2.9\n13.6 \u00b1 2.9\n90.7 \u00b1 3.2\nCorollary 2. Let \u02c6\u00b5 be given such that \u02c6\u00b5 ?? Y1:n | X1:n, T1:n (e.g., trained on a split sample).\nnPn\nThen we have that:\n\u02c6\u2327W,\u02c6\u00b5 SAPE(\u21e1) = B(W, \u21e1; \u00b5 \u02c6\u00b5) + 1\ni=1 Wi\u270fi.\nMoreover, under Asn. 1:\nCMSE(\u02c6\u2327W,\u02c6\u00b5,\u21e1 ) = B2(W, \u21e1; \u00b5 \u02c6\u00b5) + 1\nn2 W T \u2303W.\nIn Thm. 1 and Cor. 2, B(W, \u21e1; \u00b5) and B(W, \u21e1; \u00b5 \u02c6\u00b5) are precisely the conditional bias in evaluating\n\u21e1 for \u02c6\u2327W and \u02c6\u2327W,\u02c6\u00b5, respectively, and 1\nn2 W T \u2303W the conditional variance for both. In particular,\nBt(W, \u21e1t; \u00b5t) or Bt(W, \u21e1t; \u00b5t \u02c6\u00b5t) is the conditional bias in evaluating the effect on the instances\nwhere \u21e1 assigns t. Note that for any function ft, Bt(W, \u21e1t; ft) corresponds to the discrepancy\nnPn\nbetween the ft(X)-moments of the measure \u232bt,\u21e1(A) = 1\ni=1 \u21e1t(Xi)I [Xi 2 A] on X and the\nnPn\nmeasure \u232bt,W (A) = 1\ni=1 WiTitI [Xi 2 A]. The sum B(W, \u21e1; f ) corresponds to the sum of\nmoment discrepancies over the components of f = (f1, . . . , fm) between these measures. The\nmoment discrepancy of interest is that of f = \u00b5 or f = \u00b5 \u02c6\u00b5, but neither of these are known.\nBalanced policy evaluation seeks weights W to minimize a combination of imbalance, given by\nthe worst-case value of B(W, \u21e1; f ) over functions f, and variance, given by the norm of weights\nW T \u21e4W for a speci\ufb01ed positive semide\ufb01nite (PSD) matrix \u21e4. This follows a general approach\nintroduced by [22, 24] of \ufb01nding optimal balancing weights that optimize a given CMSE objective\ndirectly rather than via a plug-in approach. Any choice of k\u00b7k gives rise to a worst-case CMSE\nobjective for policy evaluation:\n\nE2(W, \u21e1;k\u00b7k , \u21e4) = supkfk\uf8ff1 B2(W, \u21e1; f ) + 1\n\nn2 W T \u21e4W.\n\nHere, we focus on k\u00b7k given by the direct product of reproducing kernel Hilbert spaces (RKHS):\n\nkfkp,K1:m,1:m = (Pm\n\nt=1 kftkp\nKt\n\n/p\n\nt )1/p,\n\nwhere k\u00b7k Kt is the norm of the RKHS given by the PSD kernel Kt(\u00b7,\u00b7) : X 2 ! R, i.e., the unique\ncompletion of span(Kt(x,\u00b7) : x 2X ) endowed with hKt(x,\u00b7),Kt(x0,\u00b7)i = Kt(x, x0) [see 39]. We\nsay kfkKt = 1 if f is not in the RKHS. One example of a kernel is the Mahalanobis RBF kernel:\nKs(x, x0) = exp((x x0)T \u02c6S1(x x0)/s2) where \u02c6S is the sample covariance of X1:n and s is a\nparameter. For such an RKHS product norm, we can decompose the worst-case objective into the\ndiscrepancies in each treatment as well as characterize it as a posterior (rather than worst-case) risk.\nLemma 1. Let B2\ni,j=1(WiTit \u21e1t(Xi))(WjTj t \u21e1t(Xj))Kt(Xi, Xj) and\n1/p + 1/q = 1. Then\n\nt (W, \u21e1t;k\u00b7k Kt) =Pn\n\nE2(W, \u21e1;k\u00b7k p,K1:m,1:m, \u21e4) = (Pm\n\n5\n\nt=1 q\n\nt Bq\n\nt (W, \u21e1t;k\u00b7k Kt))2/q + 1\n\nn2 W T \u21e4W.\n\n\fMoreover, if p = 2 and \u00b5t has a Gaussian process prior [44] with mean ft and covariance tKt then\n\nCMSE(\u02c6\u2327W,f ,\u21e1 ) = E2(W, \u21e1;k\u00b7k p,K1:m,1:m, \u2303),\n\nwhere the CMSE marginalizes over \u00b5. This gives the CMSE of \u02c6\u2327W for f constant or \u02c6\u2327W,\u02c6\u00b5 for f = \u02c6\u00b5.\n\nThe second statement in Lemma 1 suggests that, in practice, model selection of 1:m, \u21e4, and kernel\nhyperparameters such as s or even \u02c6S, can done by the marginal likelihood method [see 44, Ch. 5].\n\n2.2 Evaluation Using Optimal Balancing Weights\nOur policy evaluation estimates are given by either the estimator \u02c6\u2327W \u21e4(\u21e1;k\u00b7k,\u21e4) or \u02c6\u2327W \u21e4(\u21e1;k\u00b7k,\u21e4),\u02c6\u00b5 where\nW \u21e4(\u21e1) = W \u21e4(\u21e1;k\u00b7k , \u21e4) is the minimizer of E2(W, \u21e1;k\u00b7k , \u21e4) over the space of all weights W that\nsum to n, W = {W 2 Rn\n\ni=1 Wi = n} = nn. Speci\ufb01cally,\n\n+ :Pn\nW \u21e4(\u21e1;k\u00b7k , \u21e4) 2 argminW2W E2(W, \u21e1;k\u00b7k , \u21e4).\n\nWhen k\u00b7k = k\u00b7k p,K1:m,1:m, this problem is a quadratic program for p = 2 and a second-order cone\nprogram for p = 1,1. Both are ef\ufb01ciently solvable [9]. In practice, we solve these using Gurobi 7.0.\nIn Lemma 1, Bt(W, \u21e1t;k\u00b7k Kt) measures the imbalance between \u232bt,\u21e1 and \u232bt,W as the worst-case\ndiscrepancy in means over functions in the unit ball of an RKHS. In fact, as a distributional distance\nmetric, it is the maximum mean discrepancy (MMD) used, for example, for testing whether two\nsamples come from the same distribution [16]. Thus, minimizing E2(W, \u21e1;k\u00b7k p,K1:m,1:m, \u21e4) is\nsimply seeking the weights W that balance \u232bt,\u21e1 and \u232bt,W subject to variance regularization in W .\nExample 1. We demonstrate balanced evaluation with a mixture of m = 5 Gaussians: X |\nT \u21e0N (X T , I2\u21e52), X 1 = (0, 0), X t = (Re, Im)(ei2\u21e1(t2)/(m1)) for t = 2, . . . , m, and\nT \u21e0 Multinomial(1/5, . . . , 1/5). Fix a draw of X1:n, T1:n with n = 100 shown in Fig. 1a (numpy\nseed 0). Color denotes Ti and size denotes 'Ti(Xi). The centers X t are marked by a colored number.\nNext, we let \u00b5t(x) = exp(1 1/kx tk2) where t = (Re, Im)(ei2\u21e1t/m/p2) for t 2 [m],\n\u270fi \u21e0N (0, ), and = 1. Fig. 1b plots \u00b51(x). Fig. 1c shows the corresponding optimal policy \u21e1\u21e4.\nNext we consider evaluating \u21e1\u21e4. Fixing X1:n as in Fig. 1a, we have SAPE(\u21e1\u21e4) = 0.852. With\nX1:n \ufb01xed, we draw 1000 replications of T1:n, Y1:n from their conditional distribution. For each\nreplication, we \ufb01t \u02c6' by estimating the (well-speci\ufb01ed) Gaussian mixture by maximum likelihood and\n\ufb01t \u02c6\u00b5 using m separate gradient-boosted tree models (sklearn defaults). We consider evaluating \u21e1\u21e4\neither using the vanilla estimator \u02c6\u2327W or the doubly robust estimator \u02c6\u2327W,\u02c6\u00b5 for W either chosen in the\n4 different standard ways laid out in Sec. 1.2, using either the true ' or the estimated \u02c6', or chosen by\nthe balanced evaluation approach using untuned parameters (rather than \ufb01t by marginal likelihood)\nusing the standard (s = 1) Mahalanobis RBF kernel for Kt, kfk2 =Pm\nKt, and \u21e4= I. (Note\nthat this misspeci\ufb01es the outcome model, k\u00b5tkKt = 1.) We tabulate the results in Tab. 1.\nWe note a few observations on the standard approaches: vanilla IPW with true ' has zero bias but\nlarge SD (standard deviation) and hence RMSE (root mean square error); a DR approach improves\non a vanilla IPW with \u02c6' by reducing bias; clipping and normalizing IPW reduces SD. The balanced\nevaluation approach achieves the best RMSE by a clear margin, with the vanilla estimator beating all\nstandard vanilla and DR estimators and the DR estimator providing a further improvement by nearly\neliminating bias (but increasing SD). The marked success of the balanced approach is unsurprising\ni=1 I [Wi > 0] of the weights. All standard approaches\nuse weights that are multiples of \u21e1Ti(Xi), limiting support to the overlap between \u21e1 and T1:n, which\nhovers around 10\u201316 over replications. The balanced approach uses weights that have signi\ufb01cantly\nwider support, around 88\u201394. In light of this, the success of the balanced approach is expected.\n\nwhen considering the support kWk0 =Pn\n\nt=1 kftk2\n\n2.3 Consistent Evaluation\n\nNext we consider the question of consistent evaluation: under what conditions can we guarantee that\n\u02c6\u2327W \u21e4(\u21e1) SAPE(\u21e1) and \u02c6\u2327W \u21e4(\u21e1),\u02c6\u00b5 SAPE(\u21e1) converge to zero and at what rates.\nOne key requirement for consistent evaluation is a weak form of overlap between the historical data\nand the target policy to be evaluated using this data:\nAssumption 2 (Weak overlap). P('t(X) > 0_\u21e1t(X) = 0) = 18t 2 [m], E[\u21e12\n\nT (X)] < 1.\n\nT (X)/'2\n\n6\n\n\fFigure 2: Policy learning results in Ex. 2; numbers denote regret\n\nIPW\n\n.50\n\nGauss Proc\n\n0.29\n\nIPW-SVM 0.34\n\nSNPOEM 0.28\n\nBalanced policy learner\n\n.06\n\nDR\n\n.26\n\nGrad Boost\n\n0.20\n\nDR-SVM\n\n0.18\n\nPF\n\n0.23\n\nThis ensures that if \u21e1 can assign treatment t to X then the data will have some examples of units with\nsimilar covariates being given treatment t; otherwise, we can never say what the outcome might look\nlike. Another key requirement is speci\ufb01cation. If the mean-outcome function is well-speci\ufb01ed in\nthat it is in the RKHS product used to compute W \u21e4(\u21e1) then convergence at rate 1/pn is guaranteed.\nOtherwise, for a doubly robust estimator, if the regression estimate is well-speci\ufb01ed then consistency\nis still guaranteed. In lieu of speci\ufb01cation, consistency is also guaranteed if the RKHS product\nconsists of C0-universal kernels, de\ufb01ned below, such as the RBF kernel [40].\nDe\ufb01nition 1. A PSD kernel K on a Hausdorff X (e.g., Rd) is C0-universal if, for any continuous\nfunction g : X! R with compact support (i.e., for some C compact, {x : g(x) 6= 0}\u2713 C) and\n\u2318> 0, there exists m, \u21b51, x1, . . . ,\u21b5 m, xm such that supx2X |Pm\nTheorem 3. Fix \u21e1 and let W \u21e4n(\u21e1) = W \u21e4n(\u21e1;kfkp,K1:m,n,1:m, \u21e4n) with 0 \uf8ffI \u21e4n \uf8ffI,\n0 < \uf8ff n,t \uf8ff 8t 2 [m] for each n. Suppose Asns. 1 and 2 hold, Var(Y | X) a.s. bounded,\nE[pKt(X, X)] < 1, and E[Kt(X, X)\u21e12\nT (X)] < 1. Then the following two results hold:\n\u02c6\u2327W \u21e4n (\u21e1) SAPE(\u21e1) = Op(1/pn).\nIf k\u00b5tkKt < 1 for all t 2 [m]:\n(a)\nIf Kt is C0-universal for all t 2 [m]:\n\u02c6\u2327W \u21e4n (\u21e1) SAPE(\u21e1) = op(1).\n(b)\nThe key assumptions of Thm. 3 are unconfoundedness, overlap, and bounded variance. The other\nconditions simply guide the choice of method parameters. The two conditions on the kernel are trivial\nfor bounded kernels like the RBF kernel. An analogous result for the DR estimator is a corollary.\nCorollary 4. Suppose the assumptions of Thm. 3 hold and. Then\n(a)\n\nj=1 \u21b5iK(xj, x) g(x)|\uf8ff \u2318.\n\nT (X)/'2\n\n(b)\n(c)\n(d)\n\nIf k\u02c6\u00b5nt \u00b5tkKt = op(1)8t 2 [m]:\nn2Pn\n\u02c6\u2327W \u21e4n (\u21e1),\u02c6\u00b5n SAPE(\u21e1) = ( 1\nIf k\u02c6\u00b5n(X) \u00b5(X)k2 = Op(r(n)), r(n) =\u2326(1 /pn):\nIf k\u00b5tkKt < 1,k\u02c6\u00b5ntkKt = Op(1) for all t 2 [m]:\nIf Kt is C0-universal for all t 2 [m]:\n\n2 Var(Yi | Xi))1/2 + op(1/pn).\ni=1 W \u21e4ni\n\u02c6\u2327W \u21e4n (\u21e1),\u02c6\u00b5n SAPE(\u21e1) = Op(r(n)).\n\u02c6\u2327W \u21e4n (\u21e1),\u02c6\u00b5n SAPE(\u21e1) = Op(1/pn).\n\u02c6\u2327W \u21e4n (\u21e1),\u02c6\u00b5n SAPE(\u21e1) = op(1).\nCor. 4(a) is the case where both the balancing weights and the regression function are well-speci\ufb01ed,\nin which case the multiplicative bias disappears faster than op(1/pn), leaving us only with the\nirreducible residual variance, leading to an ef\ufb01cient evaluation. The other cases concern the \u201cdoubly\nrobust\u201d nature of the balanced DR estimator: Cor. 4(b) requires only that the regression be consitent\nand Cor. 4(c)-(d) require only the balancing weights to be consistent.\n\n3 Balanced Learning\n\nNext we consider a balanced approach to policy learning. Given a policy class \u21e7 \u21e2 [X! m], we\nlet the balanced policy learner yield the policy \u21e1 2 \u21e7 that minimizes the balanced policy evaluation\nusing either a vanilla or DR estimator plus a potential regularization term in the worst-case/posterior\nCMSE of the evaluation. We formulate this as a bilevel optimization problem:\n\n\u02c6\u21e1bal 2 argmin\u21e1{\u02c6\u2327W +E(W, \u21e1;k\u00b7k , \u21e4) : \u21e1 2 \u21e7, W 2 argminW2W E2(W, \u21e1;k\u00b7k , \u21e4)} (1)\n\u02c6\u21e1bal-DR 2 argmin\u21e1{\u02c6\u2327W,\u02c6\u00b5+E(W, \u21e1;k\u00b7k , \u21e4) : \u21e1 2 \u21e7, W 2 argminW2W E2(W, \u21e1;k\u00b7k , \u21e4)} (2)\n\n7\n\n\fThe regularization term regularizes both the balance (i.e., worst-case/posterior bias) that is achievable\nfor \u21e1 and the variance in evaluating \u21e1. We include this regularizer for completeness and motivated by\nthe results of [42] (which regularize variance), but \ufb01nd that it not necessary to include it in practice.\n3.1 Optimizing the Balanced Policy Learner\nUnlike [1, 7, 13, 41, 45], our (nonconvex) policy optimization problem does not reduce to weighted\nclassi\ufb01cation precisely because our weights are not multiplies of \u21e1Ti(Xi) (but therefore our weights\nalso lead to better performance). Instead, like [42], we use gradient descent. For that, we need to be\nable to differentiate our bilevel optimization problem. We focus on p = 2 for brevity.\nTheorem 5. Let k\u00b7k = k\u00b7k 2,K1:m,1:m. Then 9W \u21e4(\u21e1) 2 argminW2W E2(W, \u21e1;k\u00b7k , \u21e4) such that\n\n1:n \u02dcH(I (A + (I A) \u02dcH)1(I A) \u02dcH)Jt\nn Y T\nr\u21e1t(X1),...,\u21e1t(Xn) \u02c6\u2327W \u21e4(\u21e1) = 1\n1:n \u02dcH(I (A + (I A) \u02dcH)1(I A) \u02dcH)Jt + 1\nn \u02c6\u270fT\nr\u21e1t(X1),...,\u21e1t(Xn) \u02c6\u2327W \u21e4(\u21e1),\u02c6\u00b5 = 1\nr\u21e1t(X1),...,\u21e1t(Xn)E(W \u21e4(\u21e1),\u21e1 ;k\u00b7k , \u21e4) = Dt/E(W \u21e4(\u21e1),\u21e1 ;k\u00b7k , \u21e4)\n\nn \u02c6\u00b5t(X1:n)\n\nt=1 2\n\ntPn\nj=1 Kt(Xi, Xj)(WjTj t \u21e1t(Xj)), Hij = 2Pm\nt TitKt(Xi, Xj).\n\nwhere \u02dcH = F (F T HF )1F T , Fij = ij in for i 2 [n], j 2 [n 1], Aij = ijI [W \u21e4i (\u21e1) > 0],\nt TitTj tKt(Xi, Xj) + 2\u21e4, and\nDti = 2\nJtij = 22\nTo leverage this result, we use a parameterized policy class such as \u21e7logit = {\u21e1t(x; t) / exp(t0 +\nt x)} (or kernelized versions thereof), apply chain rule to differentiate objective in the parameters ,\nT\nand use BFGS [15] with random starts. The logistic parametrization allows us to smooth the problem\neven while the solution ends up being deterministic (extreme ).\nThis approach requires solving a quadratic program for each objective gradient evaluation. While this\ncan be made faster by using the previous solution as warm start, it is still computationally intensive,\nespecially as the bilevel problem is nonconvex and both it and each quadratic program solved in\n\u201cbatch\u201d mode. This is a limitation of the current optimization algorithm that we hope to improve on\nin the future using specialized methods for bilevel optimization [4, 32, 37].\nExample 2. We return to Ex. 1 and consider policy learning. We use the \ufb01xed draw shown in Fig. 1a\nand set to 0. We consider a variety of policy learners and plot the policies in Fig. 2 along with their\npopulation regret PAPE(\u21e1) PAPE(\u21e1\u21e4). The policy learners we consider are: minimizing standard\nIPW and DR evaluations over \u21e7logit with \u02c6', \u02c6\u00b5 as in Ex. 1 (versions with combinations of normalized,\nclipped, and/or true ', not shown, all have regret 0.26\u20130.5), the direct method with Gaussian process\nregression gradient boosted trees (both sklearn defaults), weighted SVM classi\ufb01cation using IPW\nand DR weights (details in supplement), SNPOEM [43], PF [23], and our balanced policy learner (1)\nwith parameters as in Ex. 1, \u21e7=\u21e7 logit, =\u21e4= 0 (the DR version (2), not shown, has regret .08).\nExample 3. Next, we consider two UCI multi-class classi\ufb01cation datasets [30], Glass (n = 214,\nd = 9, m = 6) and Ecoli (n = 336, d = 7, m = 8), and use a supervised-to-contextual-bandit\ntransformation [7, 13, 42] to compare different policy learning algorithms. Given a supervised multi-\nclass dataset, we draw T as per a multilogit model with random \u00b11 coef\ufb01cients in the normalized\ncovariates X. Further, we set Y to 0 if T matches the label and 1 otherwise. And we split the\ndata 75-25 into training and test sample. Using 100 replications of this process, we evaluate the\nperformance of learned linear policies by comparing the linear policy learners as in Ex. 2. For\nIPW-based approaches, we estimate \u02c6' by a multilogit regression (well-speci\ufb01ed by construction).\nFor DR approaches, we estimate \u02c6\u00b5 using gradient boosting trees (sklearn defaults). We compare\nthese to our balanced policy learner in both vanilla and DR forms with all parameters \ufb01t by marginal\nlikelihood using the RBF kernel with an unspeci\ufb01ed length scale after normalizing the data. We\ntabulate the results in Tab. 2. They \ufb01rst demonstrate that employing the various stopgap \ufb01xes to\nIPW-based policy learning as in SNPOEM indeed provides a critical edge. This is further improved\nupon by using a balanced approach to policy learning, which gives the best results. In this example,\nDR approaches do worse than vanilla ones, suggesting both that XGBoost provided a bad outcome\nmodel and/or that the additional variance of DR was not compensated for by suf\ufb01ciently less bias.\n\n3.2 Uniform Consistency and Regret Bounds\nNext, we establish consistency results uniformly over policy classes. This allows us to bound the\nregret of the balanced policy learner. We de\ufb01ne the sample and population regret, respectively, as\n\nR\u21e7(\u02c6\u21e1) = PAPE(\u02c6\u21e1) min\u21e12\u21e7 PAPE(\u21e1),\n\nbR\u21e7(\u02c6\u21e1) = SAPE(\u02c6\u21e1) min\u21e12\u21e7 SAPE(\u21e1)\n\n8\n\n\fTable 2: Policy learning results in Ex. 3\n\nIPW\n0.726\n0.488\n\nDR\n0.755\n0.501\n\nGlass\nEcoli\n\nIPW-SVM DR-SVM POEM SNPOEM Balanced\n\nBalanced-DR\n\n0.641\n0.332\n\n0.731\n0.509\n\n0.851\n0.431\n\n0.615\n0.331\n\n0.584\n0.298\n\n0.660\n0.371\n\nA key requirement for these to converge is that the best-in-class policy is learnable. We quantify that\nusing Rademacher complexity [3] and later extend our results to VC dimension. Let us de\ufb01ne\n\n1\n\n(b)\n\nnPn\ni=1 \u21e2if (Xi), Rn(F) = E[bRn(F)].\n\n2nP\u21e2i2{1,+1}n supf2F\nbRn(F) = 1\nE.g., for linear policies bRn(F) = O(1/pn) [21]. If F\u2713 [X! Rm] let Ft = {(f (\u00b7))t : f 2F}\nand set Rn(F) =Pm\nt=1 Rn(Ft) and same for bRn(F). We also strengthen the overlap assumption.\nAssumption 3 (Strong overlap). 9\u21b5 1 such that P('t(X) 1/\u21b5) = 1 8t 2 [m].\nTheorem 6. Fix \u21e7 \u2713 [X! m] and let W \u21e4n(\u21e1) = W \u21e4n(\u21e1;kfkp,K1:m,n,1:m, \u21e4n) with 0 \uf8ffI \n\u21e4n \uf8ffI, 0 < \uf8ff n,t \uf8ff 8t 2 [m] for each n and \u21e1 2 \u21e7. Suppose Asns. 1 and 3 hold, |\u270fi|\uf8ff B\na.s. bounded, andpKt(x, x) \uf8ff 8t 2 [m] for 1. Then the following two results hold:\nIf k\u00b5tkKt < 1, 8t 2 [m] then for n suf\ufb01ciently large (n 2 log(4m/\u232b)/(1/(2\u21b5) Rn(\u21e7))2),\n(a)\nwe have that, with probability at least 1 \u232b,\nsup\u21e12\u21e7 |\u02c6\u2327W \u21e4(\u21e1) SAPE(\u21e1)|\uf8ff 8\u21b5m(k\u00b5k +p2 log(4m/\u232b)\uf8ff1B)Rn(\u21e7)\n\n+ 1pn2\u21b5\uf8ffk\u00b5k + 12\u21b52mk\u00b5k + 6\u21b5m\uf8ff1B log 4m\n\u232b \n+ 1pn (2\u21b5\uf8ff\uf8ff1B + 12\u21b52m\uf8ff1B + 3\u21b5mk\u00b5k)q2 log 4m\n\u232b \nIf Kt is C0-universal for all t 2 [m] and either Rn(\u21e7) = o(1) or bRn(\u21e7) = op(1) then\nThe proof crucially depends on simultaneously handling the functional complexities of both the\npolicy class \u21e7 and the space of functions {f : kfk < 1} being balanced against. Again, the key\nassumptions of Thm. 6 are unconfoundedness, overlap, and bounded residuals. The other conditions\nsimply guide the choice of method parameters. Regret bounds follow as a corollary.\nCorollary 7. Suppose the assumptions of Thm. 6 hold. If \u02c6\u21e1bal\n(a)\n(b)\nIf \u02c6\u21e1bal-DR\n(c)\n(d)\n(e)\n(f)\n\nIf k\u00b5tkKt < 1 for all t 2 [m]:\nIf Kt is C0-universal for all t 2 [m]:\nIf k\u02c6\u00b5nt \u00b5tkKt = op(1) for all t 2 [m]:\nIf k\u02c6\u00b5n(X) \u00b5(X)k2 = Op(r(n)):\nIf k\u00b5tkKt < 1,k\u02c6\u00b5ntkKt = Op(1) for all t 2 [m]:\nIf Kt is C0-universal for all t 2 [m]:\n\nn ) = Op(Rn(\u21e7) + 1/pn).\nn ) = op(1).\n) = Op(Rn(\u21e7) + 1/pn).\n) = Op(r(n) + Rn(\u21e7) + 1/pn).\n) = Op(Rn(\u21e7) + 1/pn).\nn ) = op(1).\n\nsup\u21e12\u21e7 |\u02c6\u2327W \u21e4(\u21e1) SAPE(\u21e1)| = op(1).\n\nAnd, all the same results hold when replacing Rn(\u21e7) with bRn(\u21e7) and/or replacing R\u21e7 with bR\u21e7.\n\n4 Conclusion\nConsidering the policy evaluation and learning problems using observational or logged data, we\npresented a new method that is based on \ufb01nding optimal balancing weights that make the data look\nlike the target policy and that is aimed at ameliorating the shortcomings of existing methods, which\nincluded having to deal with near-zero propensities, using too few positive weights, and using an\nawkward two-stage procedure. The new approach showed promising signs of \ufb01xing these issues in\nsome numerical examples. However, the new learning method is more computationally intensive than\nexisting approaches, solving a QP at each gradient step. Therefore, in future work, we plan to explore\nfaster algorithms that can implement the balanced policy learner, perhaps using alternating descent,\nand use these to investigate comparative numerics in much larger datasets.\n\nn is as in (1) then:\n\nR\u21e7(\u02c6\u21e1bal-DR\n\nn\n\nis as in (2) then:\n\nn\n\nR\u21e7(\u02c6\u21e1bal-DR\n\nn\n\nR\u21e7(\u02c6\u21e1bal-DR\n\nn\n\nR\u21e7(\u02c6\u21e1bal\n\nR\u21e7(\u02c6\u21e1bal\n\nR\u21e7(\u02c6\u21e1bal\n\n9\n\n\fAcknowledgements\nThis material is based upon work supported by the National Science Foundation under Grant No.\n1656996.\nReferences\n[1] S. Athey and S. Wager. Ef\ufb01cient policy learning. arXiv preprint arXiv:1702.02896, 2017.\n[2] P. C. Austin and E. A. Stuart. Moving towards best practice when using inverse probability of\ntreatment weighting (iptw) using the propensity score to estimate causal treatment effects in\nobservational studies. Statistics in medicine, 34(28):3661\u20133679, 2015.\n\n[3] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and\n\nstructural results. The Journal of Machine Learning Research, 3:463\u2013482, 2003.\n\n[4] K. P. Bennett, G. Kunapuli, J. Hu, and J.-S. Pang. Bilevel optimization and machine learning.\n\nIn IEEE World Congress on Computational Intelligence, pages 25\u201347. Springer, 2008.\n\n[5] D. Bertsimas and N. Kallus. The power and limits of predictive approaches to observational-\n\ndata-driven optimization. arXiv preprint arXiv:1605.02347, 2016.\n\n[6] D. Bertsimas, N. Kallus, A. M. Weinstein, and Y. D. Zhuo. Personalized diabetes management\n\nusing electronic medical records. Diabetes care, 40(2):210\u2013217, 2017.\n\n[7] A. Beygelzimer and J. Langford. The offset tree for learning with partial labels. In Proceedings\nof the 15th ACM SIGKDD international conference on Knowledge discovery and data mining,\npages 129\u2013138. ACM, 2009.\n\n[8] L. Bottou, J. Peters, J. Q. Candela, D. X. Charles, M. Chickering, E. Portugaly, D. Ray, P. Y.\nSimard, and E. Snelson. Counterfactual reasoning and learning systems: the example of\ncomputational advertising. Journal of Machine Learning Research, 14(1):3207\u20133260, 2013.\n\n[9] S. P. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge,\n\n2004.\n\n[10] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends in Machine Learning, 5(1):1\u2013122, 2012.\n\n[11] V. Chernozhukov, D. Chetverikov, M. Demirer, E. Du\ufb02o, and C. Hansen. Double machine\n\nlearning for treatment and causal parameters. arXiv preprint arXiv:1608.00060, 2016.\n\n[12] K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based\n\nvector machines. Journal of machine learning research, 2(Dec):265\u2013292, 2001.\n\n[13] M. Dud\u00edk, J. Langford, and L. Li. Doubly robust policy evaluation and learning. arXiv preprint\n\narXiv:1103.4601, 2011.\n\n[14] M. R. Elliott. Model averaging methods for weight trimming. Journal of of\ufb01cial statistics, 24\n\n(4):517, 2008.\n\n[15] R. Fletcher. Practical methods of optimization. John Wiley & Sons, 2013.\n[16] A. Gretton, K. M. Borgwardt, M. Rasch, B. Sch\u00f6lkopf, and A. J. Smola. A kernel method\nfor the two-sample-problem. In Advances in neural information processing systems, pages\n513\u2013520, 2006.\n\n[17] D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a\n\n\ufb01nite universe. Journal of the American statistical Association, 47(260):663\u2013685, 1952.\n\n[18] G. W. Imbens. The role of the propensity score in estimating dose-response functions.\n\nBiometrika, 87(3), 2000.\n\n[19] G. W. Imbens and D. B. Rubin. Causal inference in statistics, social, and biomedical sciences.\n\nCambridge University Press, 2015.\n\n10\n\n\f[20] E. L. Ionides. Truncated importance sampling. Journal of Computational and Graphical\n\nStatistics, 17(2):295\u2013311, 2008.\n\n[21] S. M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear prediction: Risk\nbounds, margin bounds, and regularization. In Advances in neural information processing\nsystems, pages 793\u2013800, 2009.\n\n[22] N. Kallus. Generalized optimal matching methods for causal inference. arXiv preprint\n\narXiv:1612.08321, 2016.\n\n[23] N. Kallus. Recursive partitioning for personalization using observational data. In International\n\nConference on Machine Learning (ICML), pages 1789\u20131798, 2017.\n\n[24] N. Kallus. Optimal a priori balance in the design of controlled experiments. Journal of the\n\nRoyal Statistical Society: Series B (Statistical Methodology), 80(1):85\u2013112, 2018.\n\n[25] N. Kallus and A. Zhou. Confounding-robust policy improvement. 2018.\n\n[26] N. Kallus and A. Zhou. Policy evaluation and optimization with continuous treatments. In\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 1243\u20131251, 2018.\n\n[27] J. D. Kang, J. L. Schafer, et al. Demystifying double robustness: A comparison of alternative\nstrategies for estimating a population mean from incomplete data. Statistical science, 22(4):\n523\u2013539, 2007.\n\n[28] M. Ledoux and M. Talagrand. Probability in Banach Spaces: isoperimetry and processes.\n\nSpringer, 1991.\n\n[29] L. Li, W. Chu, J. Langford, and X. Wang. Unbiased of\ufb02ine evaluation of contextual-bandit-\nbased news article recommendation algorithms. In Proceedings of the fourth ACM international\nconference on Web search and data mining, pages 297\u2013306. ACM, 2011.\n\n[30] M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/\n\nml.\n\n[31] J. K. Lunceford and M. Davidian. Strati\ufb01cation and weighting via the propensity score in\nestimation of causal treatment effects: a comparative study. Statistics in medicine, 23(19):\n2937\u20132960, 2004.\n\n[32] P. Ochs, R. Ranftl, T. Brox, and T. Pock. Techniques for gradient-based bilevel optimization\nwith non-smooth lower level problems. Journal of Mathematical Imaging and Vision, 56(2):\n175\u2013194, 2016.\n\n[33] M. Qian and S. A. Murphy. Performance guarantees for individualized treatment rules. Annals\n\nof statistics, 39(2):1180, 2011.\n\n[34] J. M. Robins. Robust estimation in sequentially ignorable missing data and causal inference\n\nmodels. In Proceedings of the American Statistical Association, pages 6\u201310, 1999.\n\n[35] J. M. Robins, A. Rotnitzky, and L. P. Zhao. Estimation of regression coef\ufb01cients when some\nregressors are not always observed. Journal of the American statistical Association, 89(427):\n846\u2013866, 1994.\n\n[36] H. L. Royden. Real Analysis. Prentice Hall, 1988.\n\n[37] S. Sabach and S. Shtern. A \ufb01rst order method for solving convex bilevel optimization problems.\n\nSIAM Journal on Optimization, 27(2):640\u2013660, 2017.\n\n[38] D. O. Scharfstein, A. Rotnitzky, and J. M. Robins. Adjusting for nonignorable drop-out using\nsemiparametric nonresponse models. Journal of the American Statistical Association, 94(448):\n1096\u20131120, 1999.\n\n[39] B. Scholkopf and A. J. Smola. Learning with kernels: support vector machines, regularization,\n\noptimization, and beyond. MIT press, 2001.\n\n11\n\n\f[40] B. K. Sriperumbudur, K. Fukumizu, and G. R. Lanckriet. Universality, characteristic kernels\n\nand rkhs embedding of measures. arXiv preprint arXiv:1003.0887, 2010.\n\n[41] A. Strehl, J. Langford, L. Li, and S. M. Kakade. Learning from logged implicit exploration data.\n\nIn Advances in Neural Information Processing Systems, pages 2217\u20132225, 2010.\n\n[42] A. Swaminathan and T. Joachims. Counterfactual risk minimization: Learning from logged\n\nbandit feedback. In ICML, pages 814\u2013823, 2015.\n\n[43] A. Swaminathan and T. Joachims. The self-normalized estimator for counterfactual learning. In\n\nAdvances in Neural Information Processing Systems, pages 3231\u20133239, 2015.\n\n[44] C. K. Williams and C. E. Rasmussen. Gaussian processes for machine learning. MIT Press,\n\nCambridge, MA, 2006.\n\n[45] X. Zhou, N. Mayer-Hamblett, U. Khan, and M. R. Kosorok. Residual weighted learning for\nestimating individualized treatment rules. Journal of the American Statistical Association, 112\n(517):169\u2013187, 2017.\n\n12\n\n\f", "award": [], "sourceid": 5335, "authors": [{"given_name": "Nathan", "family_name": "Kallus", "institution": "Cornell University"}]}