{"title": "Semi-Parametric Efficient Policy Learning with Continuous Actions", "book": "Advances in Neural Information Processing Systems", "page_first": 15065, "page_last": 15075, "abstract": "We consider off-policy evaluation and optimization with continuous action spaces. We focus on observational data where the data collection policy is unknown and needs to be estimated from data. We take a semi-parametric approach where the value function takes a known parametric form in the treatment, but we are agnostic on how it depends on the observed contexts. We propose a doubly robust off-policy estimate for this setting and show that off-policy optimization based on this doubly robust estimate is robust to estimation errors of the policy function or the regression model. We also show that the variance of our off-policy estimate achieves the semi-parametric efficiency bound. Our results also apply if the model does not satisfy our semi-parametric form but rather we measure regret in terms of the best projection of the true value function to this functional space. Our work extends prior approaches of policy optimization from observational data that only considered discrete actions. We provide an experimental evaluation of our method in a synthetic data example motivated by optimal personalized pricing.", "full_text": "Semi-Parametric Ef\ufb01cient Policy Learning with\n\nContinuous Actions\n\nMert Demirer\n\nMIT\n\nmdemirer@mit.edu\n\nVasilis Syrgkanis\nMicrosoft Research\n\nvasy@microsoft.com\n\nGreg Lewis\n\nMicrosoft Research\n\nVictor Chernozhukov\n\nMIT\n\nglewis@microsoft.com\n\nvchern@mit.edu\n\nAbstract\n\nWe consider off-policy evaluation and optimization with continuous action spaces.\nWe focus on observational data where the data collection policy is unknown and\nneeds to be estimated. We take a semi-parametric approach where the value\nfunction takes a known parametric form in the treatment, but we are agnostic\non how it depends on the observed contexts. We propose a doubly robust off-\npolicy estimate for this setting and show that off-policy optimization based on\nthis estimate is robust to estimation errors of the policy function or the regression\nmodel. Our results also apply if the model does not satisfy our semi-parametric\nform, but rather we measure regret in terms of the best projection of the true value\nfunction to this functional space. Our work extends prior approaches of policy\noptimization from observational data that only considered discrete actions. We\nprovide an experimental evaluation of our method in a synthetic data example\nmotivated by optimal personalized pricing and costly resource allocation.\n\n1\n\nIntroduction\n\nWe consider off-policy evaluation and optimization with continuous action spaces from observational\ndata, where the data collection (logging) policy is unknown. We take a semi-parametric approach\nwhere we assume that the value function takes a known parametric form in the treatment, but we are\nagnostic on how it depends on the observed contexts/features. In particular, we assume that:\n\nV (a, z) = h\u03b80(z), \u03c6(a, z)i\n\n(1)\n\nfor some known feature functions \u03c6 but unknown functions \u03b80. We assume that we are given\na set of n observational data points (x1, ..., xn) that consist of i.i.d copies of the random vector\nx = (y, a, z) \u2208 Y \u00d7 A \u00d7 Z, such that E[y | a, z] = V (a, z).1\nOur goal is to estimate a policy \u02c6\u03c0 : Z \u2192 A from a space of policies \u03a0 that achieves good regret:\n\nsup\n\u03c0\u2208\u03a0\n\nE[V (\u03c0(z), z)] \u2212 E[V (\u02c6\u03c0(z), z)] \u2264 R(\u03a0, n)\n\n(2)\n\nfor some regret rate that depends on the policy space \u03a0 and the sample size n.\n\n1In most of the paper, we can allow for the case where z is endogenous, in the sense that E[y | a, z] =\nV (a, z) + f0(z). In other words, the noise in the random variable y can be potentially correlated with z.\nHowever, we assume that conditional on z, there is no remaining endogeneity in the choice of the action in our\ndata. The latter is typically referred to as conditional ignorability/exogeneity [11].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThe semi-parametric value assumption allows us to formulate a doubly robust estimate VDR of the\nvalue function, from the observational data, which depends on \ufb01rst stage regression estimates of the\n\ncoef\ufb01cients \u03b80(z) and the conditional covariance of the features \u03a30(z) = E[\u03c6(a, z)\u03c6(a, z)T | z].\nThe latter is the analogue of the propensity function when actions are discrete. Our estimate is doubly\nrobust in that it is unbiased if either \u03b80 or \u03a30 is correct. Then we optimize this estimate:\n\n\u02c6\u03c0 = sup\n\u03c0\u2208\u03a0\n\nVDR(\u03c0)\n\n(3)\n\nMain contributions. We show that the double robustness property implies that our objective\nfunction satis\ufb01es a Neyman orthogonality criterion, which in turn implies that our regret rates depend\nonly in a second order manner on the estimation errors on the \ufb01rst stage regression estimates of the\nfunctions \u03b80, \u03a30. Moreover, we prove a regret rate whose leading term depends on the variance of\nthe difference of our estimated value between any two policy values within a \u201csmall regret-slice\u201d\nand on the entropy integral of the policy space. We achieve this with a computationally ef\ufb01cient\nvariant of the empirical risk minimization (ERM) algorithm (of independent interest) that uses a\nvalidation set to construct a preliminary policy and use it to regularize the policy computed on the\ntraining set. Based on this result, we manage to achieve variance-based regret bounds without the\nneed for variance or moment penalization [15, 20, 9] used in prior work and which can render a\ncomputationally tractable policy learning problem, non-convex. Therefore, our method provides a\ncomputationally ef\ufb01cient alternative to the variance penalization when the original ERM problem\nis convex.2 We also show that the asymptotic variance of our off-policy estimate (which governs\nthe scale of the leading regret term) is asymptotic minimax optimal, in the sense that it achieves the\nsemi-parametric ef\ufb01ciency lower bound.\n\nRobustness to mis-speci\ufb01cation. Notably, our approach provides meaningful guarantees even\nwhen our semi-parametric value function assumption is violated. Suppose that the true value function\ndoes not take the form of Equation (1), but rather takes some other form V0(a, z). Then one can\nconsider the projection of this value function onto the forms of Equation (1), as:\n\n\u03b8p(z) = arg inf\n\u03b8\n\nE(cid:2)(V0(a, z) \u2212 h\u03b8(z), \u03c6(a, z)i)2 | z(cid:3)\n\n(4)\n\nwhere the expectation is taken over the distribution of observed data. Then our approach takes\nthe interpretation of achieving good regret bounds with respect to this best linear semi-parametric\napproximation. This is an alternative to the kernel smoothing approximation proposed by [20] in\ncontextual bandit setting, as a regret target, and related to [12]. If there is some rough domain\nknowledge on the form of how the action affects the reward, then our semi-parametric approximate\ntarget should achieve better performance when the dimension of the action space is large, as the bias\nof kernel methods will typically incur an exponential in the dimension bias.\n\nDouble robustness.\nIn cases where the collection policy is known, our doubly robust approach can\nbe used for variance reduction via \ufb01tting \ufb01rst stage regression estimates to the policy value, whilst\nmaintaining unbiasedness. Thus we can apply our approach to improve regret in the counterfactual\nrisk minimization framework [20], [12] and as a variance reduction method in contextual bandit\nalgorithms with continuous actions [20].\n\nThe problem we study in this paper is different in that we consider optimizing over continuous action\nspaces, rather than in\ufb01nitesimal nudges, under semi-parametric functional form. This assumption is\nwithout loss of generality if treatment is binary or multi-valued. Hence, our results are a generalization\nof binary treatments to arbitrary continuous actions spaces, subject to our semi-parametric value\nassumption. In fact we show formally in the Appendix how one can recover the result of [1] for the\nbinary setting, from our main regret bound.\n\nRelated Literature. Our work builds on the recent work at the intersection of semi-parametric\ninference and policy learning from observational data. The important work of [1] analyzes the\nbinary treatments and in\ufb01nitesimal nudges to continuous treatments. They also take a doubly robust\napproach so as to obtain regret bounds whose leading term depends on the semi-parametric ef\ufb01cient\nvariance and the entropy integral and which is robust to \ufb01rst stage estimation errors. The problem we\nstudy in this paper is different in that we consider optimization over continuous action spaces, under a\n\n2We provide two examples of convex problem in our experiments.\n\n2\n\n\fsemi-parametric functional form assumption on the payoff function. This paper is complementary to\nthe work of [1], who allow for more general payoff functions, but restrict attention to the optimization\nof \"in\ufb01nitesimal nudges\"3 from the current actions. In the case of discrete actions the payoff function\nassumption is without loss of generality, and thus we are able to show formally in the Appendix that\nwe can recover the result of [1] from our main regret bound. In turn our work builds on a long line of\nwork on policy learning and counterfactual risk minimization [17, 26, 27, 1, 13, 28, 2, 7, 20, 12, 14].\nNotably, the work of [28] extends the work of [1] to many discrete actions, but only proves a second\nmoment based regret bound, which can be much larger than the variance. Our setting also subsumes\nthe setting of many discrete actions and hence our regularized ERM offers an improvement over the\nrates in [28]. [9] formulates a general framework of statistical learning with a nuisance component.\nOur method falls into this framework and we build upon some of the results in [9]. However,\nfor the case of policy learning the implications of [9] provide a variance based regret only when\ninvoking second moment penalization, which can be intractable. We side-step this need and provide a\ncomputationally ef\ufb01cient alternative. Finally, most of the work on policy learning in machine learning\nassumes that the current policy (equiv. \u03a30(z)) is known. Hence, double robustness is used mostly as\na variance reduction technique. Even for this literature, as we discuss above, our method can be seen\nan alternative of recent work on policy learning with continuous actions [12, 14] that makes use of\nnon-parametric kernel methods.\n\nOur work also connects to the semi-parametric estimation literature in econometrics and statistics.\nOur model is an extension of the partially linear model which has been extensively studied in the\neconometrics [8, 19]. By considering context-speci\ufb01c coef\ufb01cients (random coef\ufb01cients) and modeling\na value function that is non-linear in treatment, we substantially extend the partially linear model.\n[24, 10] studied a special case of our model where output is linearly dependent on treatment given\ncontext, with the aim of estimating the average treatment effect. [10] constructed the doubly robust\nestimator and showed its semi-parametric ef\ufb01ciency under the linear-in-treatment assumption. We\nextend their results to a more general functional form and use the double-robustness property and semi-\nparametric ef\ufb01ciency for policy evaluation and optimization rather than treatment effect estimation.\nOur work is also connected to the recent and rapidly growing literature on the orthogonal/locally\nrobust/debiased estimation literature [5, 6, 21].\n\n2 Orthogonal Off-Policy Evaluation and Optimization\n\nLet \u02c6\u03b8 be a \ufb01rst stage estimate of \u03b80(z), which can be obtained by minimizing the square loss:\n\n\u02c6\u03b8 = arg inf\n\u03b8\u2208\u0398\n\nEnh(y \u2212 h\u03b8(z), \u03c6(a, z)i)2i\n\n\u03a30(z) = E[\u03c6(a, z) \u03c6(a, z)T | z]\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\nwhere \u0398 is an appropriate parameter space for the parameters \u03b8(z). Let \u03a30(z) denote the conditional\ncovariance matrix:\n\nThis is the analogue of the propensity model in discrete treatment settings. An estimate \u02c6\u03a3(z) can be\nobtained by running a multi-task regression problem for each entry to the matrix, i.e.:\n\n\u02c6\u03a3ij = arg\n\ninf\n\n\u03a3ij \u2208Sij\n\nE(cid:2)(\u03c6i(a, z) \u03c6j(a, z) \u2212 \u03a3ij(z))2(cid:3)\n\nwhere Sij is some appropriate hypothesis space for these regressions. Finally, the doubly robust\nestimate of the off-policy value takes the form:\n\nwhere:\n\nVDR(\u03c0) = En [vDR(y, a, z; \u03c0)]\n\nvDR(y, a, z; \u03c0) = h\u03b8DR(y, a, z), \u03c6(\u03c0(z), z)i\n\n\u03b8DR(y, a, z) = \u02c6\u03b8(z) + \u02c6\u03a3(z)\u22121 \u03c6(a, z) (y \u2212 h\u02c6\u03b8(z), \u03c6(a, z)i)\n\nThe quantity \u03b8DR(y, a, z) can be viewed as an estimate of \u03b80(z), based on a single observation. In\nfact, if the matrix \u02c6\u03a3 was equal to \u03a30, then one can see that \u03b8DR(y, a, z) is an unbiased estimate\n\n3In\ufb01nitesimal nudges are de\ufb01ned as \u01eb positive or negative perturbations over a baseline policy. With the limit\nof epsilon going to zero the latter is essentially a binary policy problem where the binary choice is whether to \u01eb\nincrease or decrease the current treatment level.\n\n3\n\n\fof \u03b80(z). Our estimate vDR also satis\ufb01es a doubly robust property, i.e. it is correct if either \u02c6\u03b8 is\nunbiased or \u02c6\u03a3\u22121 is unbiased (see Appendix F for a formal statement). Finally, we will denote with\n\u03b80\nDR(y, a, z) the version of \u03b8DR, where the nuisance quantities \u03b8 and \u03a3 are replaced by their true\nvalues, and correspondingly de\ufb01ne v0\nDR(y, a, z; \u03c0). We perform policy optimization based on this\ndoubly robust estimate:\n\n\u02c6\u03c0 = arg max\n\u03c0\u2208\u03a0\n\nVDR(\u03c0)\n\n(10)\n\nMoreover, we let \u03c00\n\n\u2217 be the optimal policy: \u03c00\n\n\u2217 = arg max\u03c0\u2208\u03a0 V (\u03c0).\n\nRemark 1 (Multi-Action Policy Learning). A special case of our setup is the setting where the number\nof actions is \ufb01nitely many. This can be encoded as a \u2208 {e1, . . . , en} and \u03c6(a, z) = a. In that case,\nobserve that the covariance matrix becomes a diagonal matrix: \u03a30(z) = diag(p1(z), . . . , pn(z)),\nwith pi(z) = Pr[a = ei | z]. In this case, we simply recover the standard doubly robust estimate that\ncombines the direct regression part with the inverse propensity weights part, i.e.:\n\n\u03b8DR,i(y, a, z) = \u02c6\u03b8i(z) +\n\n1\n\n\u02c6pi(z)\n\n1{a = ei} (y \u2212 \u02c6\u03b8i(z))\n\nThus our estimator is an extension of the doubly robust estimate from discrete to continuous actions.\n\nRemark 2 (Finitely Many Possible Actions: Linear Contextual Bandits). Another interesting special\ncase of our approach is a generalization of the linear contextual bandit setting. In particular, suppose\nthat there is only a \ufb01nite (albeit potentially large) set of N > p possible actions A = {a1, . . . , aN}\nand ai \u2208 Rp. However, unlike the multi-action setting, where these actions are the orthonormal basis\nvectors, in this setting, each action ai \u2208 A, maps to a feature vector \u03c6i(z) := \u03c6(ai, z). Then the\nreward y that we observe satis\ufb01es E[y | z, a] = h\u03b8(z), \u03c6(a, z)i. This is a generalization of the linear\ncontextual bandit setting, in which the coef\ufb01cient vector \u03b8(z) is a constant parameter \u03b8 as opposed\nto varying with z. In this case observe that: \u03a30(z) =PN\ni=1 pi(z) \u03c6i(z) \u03c6i(z)T = U DU T , i.e. it is\nthe sum of N rank one matrices where D = diag(p1(z), . . . , pn(z)), pi(z) = Pr[a = ai | z] and\nU = [\u03c6i(z), . . . , \u03c6N (z)] The doubly robust estimate of the parameter takes the form:\n\n\u03b8DR(y, a, z) = \u02c6\u03b8(z) + (U DU T )\u22121 \u03c6(a, z) (y \u2212 h\u02c6\u03b8(z), \u03c6(a, z)i)\n\nThis approach leverages the functional form assumption to get an estimate that avoids a large\nvariance that depends on the number of actions N but rather mostly depends on the number of\nparameters p. This is achieved by sharing reward information across actions.\n\nRemark 3 (Linear-in-Treatment Value). Consider the case where the value is linear in the action\n\u03c6(a, z) = a \u2208 Rp.\nIn this case observe that: \u03a30(z) = Var(a | z). For instance, suppose\nthat we assume that experimentation is independent across actions in the observed data. Then\n\u03a30(z) = diag(\u03c32\ni = Var(ai | z). Then the doubly robust estimate of the\nparameter takes the form:\n\np(z)), where \u03c32\n\n1(z), . . . , \u03c32\n\n\u03b8DR,i(y, a, z) = \u02c6\u03b8i(z) +\n\nai\n\u02c6\u03c32\ni (z)\n\n(y \u2212 h\u02c6\u03b8(z), ai)\n\n(11)\n\n3 Theoretical Analysis\n\nOur main regret bounds are derived for a slight variation of the ERM algorithm that we presented\nin the preliminary section. In particular, we crucially need to augment the ERM algorithm with a\n\u201cvalidation\u201d step, where we split our data into a training and validation step, and we restrict attention\nto policies that achieve small regret on the training data, while still maintaining small regret on\nthe validation set. This extra modi\ufb01cation enabled us to prove variance based regret bounds and\n\n4\n\n\fis reminiscent of standard approaches in machine learning, like k-fold cross-validation and early\nstopping, hence could be of independent interest.\n\nAlgorithm 1: Out-of-Sample Regularized ERM with Nuisance Estimates\n\n1 The inputs are given by the sample of data S = (x1, . . . , xn), which we randomly split in two parts\n\nS1, S2. Moreover, we randomly split S2 into validation and training samples Sv\n2 Estimate the nuisance functions \u02c6\u03b8(z) and \u02c6\u03a3(z) using Equations (5) and (6) on S1.\n3 Use the output of Step 2 to construct the doubly robust moment in Equation (7) on Sv\n\n2 and St\n2.\n\ngiven in Equation (10) over policy space \u03a01 on Sv\n\n2 to learn a policy function \u03c01.\n\n4 Use the output of Step 3 to construct a function class \u03a02 de\ufb01ned as\n\n2 . Run ERM\n\n\u03a02 = {\u03c0 \u2208 \u03a0 : ESv [vDR(y, a, z; \u03c01) \u2212 vDR(y, a, z; \u03c0)] \u2264 \u00b5n}\nfor some \u00b5n and ESv denotes the empirical expectation over the validation sample.\n5 Use the output of Step 1 to construct the doubly robust moment in Equation (7) on St\n\n2. Run a\n\nconstrained ERM on St\n\n2 over \u03a02.\n\nWe note that we present our theoretical results for the simpler case where the nuisance estimates are\ntrained on a separate split of the data. However, our results qualitatively extend to the case where we\nuse the cross-\ufb01tting idea of [5] (i.e. train a model on one half and predict on the other and vice versa).\n\nRegret bound. To show the properties of this algorithm, we \ufb01rst show that the regret of the doubly\nrobust algorithm is impacted in a second order manner by the errors in the \ufb01rst stage estimates.\nWe will also make the following preliminary de\ufb01nitions. For any function f we denote with\n\nkfk2 =pE[f (x)2], the standard L2 norm and with kfk2,n =pEn[f (x)2] its empirical analogue.\nFurthermore, we de\ufb01ne the empirical entropy of a function class H2(\u01eb,F, n) as the largest value,\nover the choice of n samples, of the logarithm of the size of the smallest empirical \u01eb-cover of F on\nthe samples with respect to the k \u00b7 k2,n norm. Finally, we consider the empirical entropy integral:\n\n\u03ba(r,F) = inf\n\n\u03b1\u22650(4\u03b1 + 10\u02c6 r\n\n\u03b1 rH2(\u01eb,F, n)\n\nn\n\nd\u01eb) ,\n\nOur statistical learning problem corresponds to learning over the function space:\n\nF\u03a0 = {vDR(\u00b7; \u03c0) : \u03c0 \u2208 \u03a0}\n\nrandomness of the nuisance sample:\n\nwhere the data is x = (y, a, z). We will also make a very benign assumption on the entropy integral:\nASSUMPTION 1. The function class F\u03a0 satis\ufb01es that for any constant r, \u03ba(r,F) \u2192 0 as n \u2192 \u221e.\nTheorem 1 (Variance-Based Oracle Policy Regret). Suppose that the nuisance estimates satisfy\nthat their mean squared error is upper bounded w.p. 1 \u2212 \u03b4/2 by h2\nn,\u03b4, i.e. w.p. 1 \u2212 \u03b4/2 over the\nF ro]o \u2264 h2\nmaxnE[(\u02c6\u03b8(z) \u2212 \u03b80(z))2], E[k \u02c6\u03a3(z) \u2212 \u03a30(z)k2\nn (cid:19). Moreover, let\n\nLet r = sup\u03c0\u2208\u03a0pE[vDR(z; \u03c0)2] and \u00b5n = \u0398(cid:18)\u03ba(r,F\u03a0) + rq log(1/\u03b4)\n\n(14)\n\nn,\u03b4\n\ndenote an \u01eb-regret slice of the policy space. Let \u01ebn = O(\u00b5n + h2\nDR(x; \u03c0) \u2212 v0\n\nVar(v0\n\nV 0\n2 =\n\nsup\n\n\u03c0,\u03c0\u2032\u2208\u03a0\u2217(\u01ebn)\n\n\u2217) \u2212 V (\u03c0) \u2264 \u01eb},\nn,\u03b4) and\nDR(x; \u03c0\u2032))\n\n\u03a0\u2217(\u01eb) = {\u03c0 \u2208 \u03a0 : V (\u03c00\n\ndenote the variance of the difference between any two policies in an \u01ebn-regret slice, evaluated at the\ntrue nuisance quantities. Then the policy \u03c02 returned by the out-of-sample regularized ERM, satis\ufb01es\nw.p. 1 \u2212 \u03b4 over the randomness of S:\n\nV (\u03c00\n\n\u2217) \u2212 V (\u03c02) = O \u03ba(qV 0\n\n2 log(1/\u03b4)\n\nn\n\n+ h2\n\nn,\u03b4!\n\n(12)\n\n(13)\n\n(15)\n\n(16)\n\n(17)\n\nExpected regret is O(cid:18)\u03ba(pV 0\n\n2 ,F\u03a0) +q V 0\n\n2\n\nn + h2\n\nn is expected MSE of nuisance functions.\n\n2 ,F\u03a0) +r V 0\nn(cid:19), with h2\n\n5\n\n\fWe provide a proof of this Theorem in Appendix C. The regret result contains two main contributions:\n1) \ufb01rst the impact of the nuisance estimation error is of second order (i.e. h2\nn,\u03b4 instead of hn,\u03b4), 2) the\nleading regret term depends on the variance of small-regret policy differences and the entropy integral\nof the policy space. The \ufb01rst property stems from the Neyman orthogonality property of the doubly\nrobust estimate of the policy. The second property stems from the out-of-sample regularization step\nn,\u03b4 = o(1/\u221an) and thereby this term\nthat we added to the ERM algorithm. Typically, we will have h2\nis of lower order than the leading term. Moreover, for many policy spaces \u03ba(0,F\u03a0) = 0, in which\ncase we see that if the setting satis\ufb01es a \u201cmargin\u201d condition (i.e. the best policy is better by a constant\n\u2206 margin), then eventually the variance of small regret policies is 0, since it only contains the best\npolicy. In that case, our bound leads to fast rates of log(n)/n as opposed to 1/\u221an, since the leading\nterm vanishes (similar to the log(n)/n achieved in bandit problems with such a margin condition).\nDependence on the quantity V 0\n2 is quite intuitive: if two policies have almost equivalent regret up to\na \u00b5n rate, then it will be very easy to be mislead among them if one has much higher variance than\nthe other. For some classes of problems, the above also implies a regret rate that only depends on\nthe variance of the optimal policy (e.g. when all policies with low regret have a variance that is not\nmuch larger than the variance of the optimal policy. In Appendix G we show that the latter is always\nthe case for the setting of binary treatment studied in [1] and therefore applying our main result, we\nrecover exactly their bound for binary treatments.\n\nSemi-parametric ef\ufb01cient variance. Our regret bound depends on the variance of our doubly ro-\nbust estimate of the value function. One then wonders if there are other estimates of the value function\nthat could achieve better variance than VDR(\u03c0). However, we show that at least asymptotically and\nwithout further assumptions on the functions \u03b80(z) and \u03a30(z), this cannot be the case. In particular,\nwe show that our estimator achieves what is known as the semi-parametric ef\ufb01cient variance limit for\nour setting. More importantly for our regret result, this is also true for the semiparametric ef\ufb01cient\nvariance of the policy differences. This is the case in our main setup; where the model is mis-speci\ufb01ed\nand only a projection of the true value; and even if we assume that our model is correct, but make the\nextra assumption of homoskedasticity, i.e., the conditional variance of residuals of outcomes y do not\ndepend on (a, z). Homoskedasticity is needed for ef\ufb01ciency because otherwise one can optimally\nre-weight the moments to obtain a more ef\ufb01cient estimator. Such optimally re-weighted estimators\nare typically avoided in practice as they heavily rely on the well-speci\ufb01cation of the model.\nTheorem 2 (Semi-parametric Ef\ufb01ciency). If the model is mis-speci\ufb01ed, i.e, V0(a, z) 6= V (a, z)\nthe asymptotic variance of VDR(\u03c0) is equal to the semi-parametric ef\ufb01ciency bound for the policy\nvalue h\u03b8p(z), \u03c0(z)i de\ufb01ned in Equation (4). If the model is correctly speci\ufb01ed, VDR(\u03c0) is semi-\nparametrically ef\ufb01cient under homoskedasticity, i.e. E[(y \u2212 V (a, z))2 | a, z] = E[(y \u2212 V (a, z))2].\n\nWe provide a proof for the value function, but this result also extends to the difference of values. We\nconclude the section by providing concrete examples of rates for policy classes of interest.\nExample 1 (VC Policies). As a concrete example, consider the case when the class F\u03a0 is\na VC-subgraph class of VC dimension d (e.g.\nthe policy space has small VC-dimension or\npseudo-dimension), and let S = E[sup\u03c0 vDR(x; \u03c0)2]. Then Theorem 2.6.7 of [22] shows that:\nH2(\u01eb,F\u03a0, n) = O(d(1 + log(S/\u01eb))) (see also discussion in Appendix G). This implies that\n\n\u03ba(r,F\u03a0) = O(cid:18)\u02c6 r\n\n0 pd(1 + log(S/\u01eb))d\u01eb(cid:19) = O(cid:16)r\u221adp1 + log(S/r)(cid:17) .\n\nHence, we can conclude that regret is O(cid:18)pV 0\n\n2 (1 + log(S/V 0\n\n2 ))q d\n\nn +q V 0\n\nthe case of binary action policies (as we discuss in Appendix G) this result recovers the result of [1]\nfor binary treatments up to constants and extends it to arbitrary action spaces and VC-subgraph\npolicies.\n\n2 log(1/\u03b4)\n\nn\n\n+ h2\n\nn,\u03b4(cid:19). For\n\nExample 2 (High-Dimensional Index Policies). As an example, we consider the class of policies,\ncharacterized by a constant number of \u21131 or \u21130-bounded linear indices:\n\n\u03a01 = {z \u2192 \u0393(h\u03b21, zi, . . . ,h\u03b2d, zi) : \u03b2i \u2208 Rp,k\u03b2ik1 \u2264 s}\n\n(18)\n\nwhere \u0393 : Rd \u2192 Rm is a \ufb01xed L-Lipschitz function of the indices, with d, m constants, while p >> n\n(and similarly for \u03a00, where use k\u03b2ik0 \u2264 s). Assuming vDR(y, a, z; \u03c0) is a Lipschitz function of\n\u03c0(z) and since \u0393 is a Lipschitz function of h\u03b2, zi, we have by a standard multi-variate Lipschitz\n\n6\n\n\fcontraction argument (and since d, m are constants), that the entropy of F\u03a0 is of the same order as\nthe maximum entropy of each of the linear index spaces: B1 := {z \u2192 h\u03b2i, zi : \u03b2 \u2208 Rp,k\u03b2ik1 \u2264 s}.\nMoreover, by known covering arguments (see e.g. [25], Theorem 3) that if kzk\u221e \u2264 1, then:\nn(cid:19), which\nH2(\u01eb, B, n) = O(cid:16) s2 log(d)\nleads to regret O(cid:18)s log(n)q log(d)\n\n(cid:17). Thus we get \u03ba(r,F\u03a0) = O(cid:18)s log(n)q log(d)\n\nthe policy space is too large for the variance to drive the asymptotic regret. There is a lead-\ning term that remains even if the worse-case variance of policies in a small-regret slice is\n0.\nIntuitively this stems from the high-dimensionality of the linear indices, which introduces\nan extra dimension of error, namely bias due to regularization. On the contrary, for exactly\n\nn +q V 0\n\nIn this setting, we observe that\n\nn,\u03b4(cid:19).\n\nn + r\n\n2 log(1/\u03b4)\n\n+ h2\n\n\u01eb2\n\nn\n\nsparse policies B0 := {z \u2192 h\u03b2i, zi : \u03b2 \u2208 Rp,k\u03b2ik0 \u2264 s}, we have that since for any\npossible support the entropy at scale \u01eb is at most O(s log(1/\u01eb)), we can take a union over all\ns (cid:1)s\ns(cid:1) \u2264 (cid:0) ep\n(cid:0)p\npossible sparse supports, which implies H2(\u01eb,F\u03a0, n) = O(s (log(d/s) + log(1/\u01eb)).\nThus \u03ba(r,F\u03a0) = O(cid:18)rplog(1/r)q s log(d/s)\n(cid:19), leading to policy regret similar to the VC classes:\nO(cid:18)pV 0\n\nRemark 4 (Instrumental Variable Estimation). Our main regret results extend to the instrumental\nvariables settings where treatments are endogenous but we have a vector of instrumental variables w\nsatisfying\n\n2 )q s log(d/s)\n\n+q V 0\n\nn,\u03b4(cid:19).\n\n2 log(1/V 0\n\n2 log(1/\u03b4)\n\n+ h2\n\nn\n\nn\n\nn\n\nE[(y \u2212 h\u03b80(z), \u03c6(a, z)i)w | z] = 0,\n\nand \u03a3I\n\n0(z) = E[w\u03c6(a, z)T | z] is invertible. Then we can use the following doubly robust moment\n\n\u03b8DR,I (y, a, z, w) = \u02c6\u03b8(z) + \u02c6\u03a3I (z)\u22121 w(y \u2212 h\u02c6\u03b8(z), \u03c6(a, z)i).\n\nRemark 5 (Estimating the First Stages). Bounds on \ufb01rst stage errors as a function of sample\ncomplexity measures can be obtained by standard results on the MSE achievable by regression\nproblems (see e.g. [18, 23]). Essentially these are bounds for the regression estimates \u02c6\u03b8(z) and \u02c6\u03a3(z),\nas a function of the complexity of their assumed hypothesis spaces. Since the latter is a standard\nstatistical learning problem that is orthogonal to our main contribution, we omit technical details.4\nSince the square loss is a strongly convex objective the rates achievable for these problems are\ntypically fast rates on the MSE (e.g. h2\nn,\u03b4 is of the order 1/n for the case of parametric hypothesis\nspaces, and typically o(1/\u221an) for reproducing kernel Hilbert spaces with fast eigendecay (see e.g.\n[23])). Thus the term h2\nn,\u03b4 to be\nof second order in our regret bounds are achievable if these nuisance regressions are \u21131-penalized\nlinear regressions and several regularity assumptions are satis\ufb01ed by the data distribution, even when\nthe dimension p of z is growing with n.\n\nn,\u03b4 is of lower order. For instance, the required rates for the term h2\n\nExtension: Semi-Bandit Feedback Suppose that our value function takes the form: V (a, z) =\n\u03c6(a, z)T \u03980(z) \u03c6(a, z), where \u03980(z) is a p \u00d7 p matrix and we observe semi-bandit feedback, i.e. we\nobserve a vector Y s.t.: E[Y | a, z] = \u03980(z)T \u03c6(a, z). Then we can apply our DR approach to each\ncoordinate of Y separately.\n\nVDR(\u03c0) = Enh\u03c6(\u03c0(z), z)T (cid:16) \u02c6\u0398(z) + \u02c6\u03a3(z)\u22121 \u03c6(a, z) (Y T \u2212 \u03c6(a, z)T \u02c6\u0398(z))(cid:17) \u03c6(\u03c0(z), z)i\n\nAll the theorems in this section extend to this case, which will prove useful in our pricing application\nwhere a is the price of a set of products and Y is the vector of observed demands for each product.\n\n4 Application: Personalized Pricing\n\nConsider the personalized pricing of a single product. The objective is the revenue:\n\n4Also, as we show in pricing application in some problems simpler estimators arise for co-variance matrix.\n\nV (p, z) = p (a(z) \u2212 b(z) p)\n\n7\n\n\fl\n\ne\nu\na\nV\n\n3.8\n\n3.7\n\n3.6\n\n3.5\n\nl\n\ne\nu\na\nV\n\n5.75\n\n5.50\n\n5.25\n\n5.00\n\n4.75\n\n4.50\n\n2.7\n\n2.6\n\nl\n\ne\nu\na\nV\n\n2.5\n\n2.4\n\n2.3\n\n3.7\n\nl\n\ne\nu\na\nV\n\n3.6\n\n3.5\n\n3.4\n\nt\ne\nr\ng\ne\nR\n\n0.12\n\n0.08\n\n0.04\n\n0.00\n\nt\ne\nr\ng\ne\nR\n\n4\n\n3\n\n2\n\n1\n\n0\n\n4.6\n\n4.5\n\n4.4\n\n4.3\n\n4.2\n\nl\n\ne\nu\na\nV\n\nl\n\ne\nu\na\nV\n\n2.8\n\n2.7\n\n2.6\n\n2.5\n\n7.6\n\n7.4\n\n7.2\n\n7.0\n\nl\n\ne\nu\na\nV\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n6.00\n\n5.75\n\nl\n\ne\nu\na\nV\n\n5.50\n\n5.25\n\n4.0\n\n3.5\n\nl\n\ne\nu\na\nV\n\nl\n\ne\nu\na\nV\n\n11.0\n\n10.5\n\n10.0\n\n9.5\n\n9.0\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\nl\n\ne\nu\na\nV\n\nl\n\ne\nu\na\nV\n\n2.5\n\n2.4\n\n2.3\n\n2.2\n\n2.1\n\n4.4\n\n4.3\n\n4.2\n\n4.1\n\n4.0\n\nl\n\ne\nu\na\nV\n\n1.5\n\n1.4\n\n1.3\n\n1.2\n\n4.6\n\n4.4\n\nl\n\ne\nu\na\nV\n\n4.2\n\n4.0\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n2.7\n\nl\n\ne\nu\na\nV\n\n2.6\n\n2.5\n\n2.4\n\n7.4\n\n7.2\n\n7.0\n\n6.8\n\nl\n\ne\nu\na\nV\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\nDirect\n\nIPS\n\nDoubly Robust\n\nOracle\n\n(a) Policy Evaluation\n\n0.04\n\n0.02\n\nt\ne\nr\ng\ne\nR\n\n0.00\n\n0.6\n\nt\ne\nr\ng\ne\nR\n\n0.4\n\n0.2\n\n0.0\n\nt\ne\nr\ng\ne\nR\n\n0.04\n\n0.03\n\n0.02\n\n0.01\n\n0.00\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\nt\ne\nr\ng\ne\nR\n\n0.10\n\n0.05\n\n0.00\n\nt\ne\nr\ng\ne\nR\n\n6\n\n4\n\n2\n\n0\n\nt\ne\nr\ng\ne\nR\n\n3\n\n2\n\n1\n\n0\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\nDirect\n\nIPS\n\nDoubly Robust\n\nOracle\n\n(b) Regret\n\nFigure 1: (a) Black line shows the true value of the policy and each line shows the mean and standard\ndeviation of the value of the corresponding policy over 100 simulations. (b) Each line shows the\nmean and standard deviation of regret over 100 simulations. The top half reports the regret for a\nconstant policy, the bottom half reports regret for a linear policy.\n\nwhere b(z) \u2265 \u03b3 > 0 and a(z) + b(z)p gives the unknown, context-speci\ufb01c demand function. We\nassume that we observe an unbiased estimate d of demand:\n\nWe want to optimize over a space of personalized pricing policies \u03a0. If, for instance, the observational\npolicy was homoskedastic (i.e. the exploration component was independent of the context z), we\nshow in Appendix H that doubly robust estimators for a(z) and b(z) are\n\nE[d | z, p] = a(z) \u2212 b(z) p\n\naDR(z) = \u02c6a(z) +(cid:18)1 + \u02c6g(z)\np \u2212 \u02c6g(z)\n\nbDR(z) = \u02c6b(z) +\n\n\u02c6g(z) \u2212 p\n\u02c6\u03c32 (cid:19) (d \u2212 \u02c6a(z) \u2212 \u02c6b(z) p)\n\n(d \u2212 \u02c6a(z) \u2212 \u02c6b(z) p)\n\n\u02c6\u03c32\n\nwhere g(z) = E[p | z] and the variance \u03c32. Thus, in this example, we only need to estimate the mean\ntreatment policy E[p | z] and the variance \u03c32.\n\nExperimental evaluation. We empirically evaluate our framework on the personalized pricing\napplication with synthetic data. In particular, we use simulations to assess our estimator\u2019s ability to\nevaluate and optimize personalized pricing functions. To do this, we compare the performance of\n\nour doubly robust estimator with (1) Direct estimator, h\u02c6\u03b8(z), \u03c6(a, z)i, (2) Inverse propensity score\nestimator 5, (3) Oracle orthogonal estimator, vo\n\nDR(x, \u03c0).\n\nData Generating Process. Our simulation design considers a sparse model. We assume that there\nare k continuous context variables distributed uniformly zi \u223c U (1, 2) for i = 1, . . . , k but only\nl of them affects demand. Let \u00afz = 1/l(zi + \u00b7\u00b7\u00b7 + zl). Price p and demand d are generated as\nx \u223c N (\u00afz, 1), d = a(\u00afz) \u2212 b(\u00afz)x + \u01eb and \u01eb \u223c N (0, 1). We consider four functional forms for the\ndemand model: (i) (Quadratic) a(z) = 2z2, b(z) = 0.6z, (ii) (Step) a(z) = 5{z < 1.5} + 6{z >\n1.5}, b(z) = 0.7{z < 1.5} + 1.2{z > 1.5}, (iii) (Sigmoid) a(z) = 1/(1 + exp(z)) + 3, b(z) =\n2/(1 + exp(z)) + 0.1, (iv) (Linear) a(z) = 6z, b(z) = z\n\nThese functions and the data generating process ensure that the conditional expectation function\nof demand given z is non-negative for all z, the observed prices are positive with high probability,\n\n5\u03b8IP S(y, a, z) = \u02c6\u03a3(z)\u22121 \u03c6(a, z) y\n\n8\n\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\fl\n\ne\nu\na\nV\n\n3.7\n\n3.5\n\n3.3\n\n3.1\n\n6.0\n\nl\n\ne\nu\na\nV\n\n5.5\n\n5.0\n\n4.5\n\nl\n\ne\nu\na\nV\n\n2.6\n\n2.5\n\n2.4\n\n2.3\n\n2.2\n\n3.6\n\nl\n\ne\nu\na\nV\n\n3.4\n\n3.2\n\n3.0\n\n0.006\n\nt\n\ne\nr\ng\ne\nR\n\n0.004\n\n0.002\n\n0.000\n\nt\n\ne\nr\ng\ne\nR\n\n0.010\n\n0.005\n\n0.000\n\n4.5\n\n4.2\n\n3.9\n\n3.6\n\nl\n\ne\nu\na\nV\n\nl\n\ne\nu\na\nV\n\n2.6\n\n2.4\n\n2.2\n\n2.0\n\n7.6\n\n7.4\n\n7.2\n\n7.0\n\n6.8\n\nl\n\ne\nu\na\nV\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n7\n\n6\n\n5\n\nl\n\ne\nu\na\nV\n\n4.0\n\nl\n\ne\nu\na\nV\n\n3.5\n\n3.0\n\nl\n\ne\nu\na\nV\n\n11.0\n\n10.5\n\n10.0\n\n9.5\n\n9.0\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\nl\n\ne\nu\na\nV\n\n2.4\n\n2.2\n\n2.0\n\n1.8\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\nl\n\ne\nu\na\nV\n\n4.25\n\n4.00\n\n3.75\n\n3.50\n\n3.25\n\n1.4\n\nl\n\ne\nu\na\nV\n\n1.3\n\n1.2\n\n1.1\n\n2.6\n\n2.4\n\n2.2\n\nl\n\ne\nu\na\nV\n\n2.0\n\n4.4\n\nl\n\ne\nu\na\nV\n\n4.2\n\n4.0\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\nl\n\ne\nu\na\nV\n\n7.2\n\n7.0\n\n6.8\n\n6.6\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n0.04\n\n0.02\n\nt\n\ne\nr\ng\ne\nR\n\n0.00\n\n0.05\n\n0.04\n\nt\n\ne\nr\ng\ne\nR\n\n0.03\n\n0.02\n\n0.01\n\n0.00\n\nDirect\n\nDoubly Robust\n\nOracle\n\n(a) Policy Evaluation\n\n0.0025\n\n0.0020\n\n0.0015\n\n0.0010\n\n0.0005\n\n0.0000\n\nt\n\ne\nr\ng\ne\nR\n\nt\n\ne\nr\ng\ne\nR\n\n0.004\n\n0.003\n\n0.002\n\n0.001\n\n0.000\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\nt\n\ne\nr\ng\ne\nR\n\n0.06\n\n0.04\n\n0.02\n\n0.00\n\n0.006\n\nt\n\ne\nr\ng\ne\nR\n\n0.004\n\n0.002\n\n0.000\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\n1000\n\n2000\n\n5000\n\n10000\n\nDirect\n\nDoubly Robust\n\nOracle\n\n(b) Regret\n\nFigure 2: Quadratic, Low Dimensional Regime: (a) Black line shows the true value of the policy,\neach line shows the mean and standard deviation of the value of the corresponding policy over 100\nsimulations. (b) Mean and standard deviation of regret reported over 100 simulations. We omit the\nresults for the inverse propensity score method since they are too large to report together with the\nother estimates in the high dimensional regime.\n\nand the optimal prices are in the support of the observed prices. In each experiment, we generate\n1000, 2000, 5000, and 10000 data points, and report results over 100 simulations. We estimate the\nnuisance functions using 5-fold cross-validated lasso model with polynomials of degrees up to 3 and\nall the two-way interactions of context variables. We present the results for two regimes: (i) Low\ndimensional with k = 2, l = 1, (ii) High dimensional with k = 10, l = 3.\n\nPolicy Evaluation. For policy evaluation we consider four pricing functions: (i) Constant, \u03c0(z) =\n1, (ii) Linear, \u03c0(z) = z, (iii) Threshold, \u03c0(z) = 1 + 1{z > 1.5}, (iv) Sin, \u03c0(z) = sin(z). The\nresults for the low dimensional regime are summarized in Figure 1(a), where each row and column\ncorresponds to a different demand function and a policy function, respectively6. The results show that,\nas expected, our the performance of our method is very similar to the oracle estimator and achieves a\nsigni\ufb01cantly better performance than the direct and inverse propensity score methods, which suffer\nfrom large biases. These results also support our claim that the asymptotic variance of the doubly\nrobust estimate is the same as the variance of the oracle method. It is also important to point out that\nwe obtain similar performances across two different regimes.\n\nRegret. To investigate the regret performance of our method, we consider a constant pricing\nfunction, \u03c0(z) = \u03b3 and a linear policy \u03c0(z) = (\u03b31z1 + \u00b7\u00b7\u00b7 + \u03b3kzk). We compute the optimal pricing\nfunctions in these function spaces and report the distribution of regret in Figure 4(b) under the low\ndimensional regime and in Appendix I under the high dimensional regime. Across the four demand\nfunctions and two pricing functions, our method achieves small regrets, comparable to the oracle.\nThe direct and inverse propensity methods, depending on the demand function, yield large regrets.\n\nQuadratic Model Finally, we consider the same simulation exercise under the assumption that an\nunbiased estimate of revenue rather than demand is observed. Since revenue depends on the p2 the\nmodel is now quadratic r = a(z)p \u2212 b(z)p2 + \u01eb. For the data generating process we use the same\nfunctions a(z) and b(z) as in the personalized pricing example 7. Figures 2 and 5 in Appendix I\nsummarize results for policy evaluation and optimization. The overall performance of our doubly\nrobust estimator is similar to the demand model, and it performs better the direct model.\n\n6The results are very similar for the high dimensional model which are reported in Figure 4(a) in the appendix.\n7We provide the calculation of the doubly robust estimators for this example in Appendix H.\n\n9\n\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\u25cf\n\fReferences\n\n[1] Susan Athey and Stefan Wager. Ef\ufb01cient policy learning. arXiv preprint arXiv:1702.02896, 2018.\n\n[2] Alina Beygelzimer and John Langford. The offset tree for learning with partial labels. In Proceedings of\nthe 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 129\u2013138.\nACM, 2009.\n\n[3] Peter J Bickel, Chris AJ Klaassen, Peter J Bickel, Y Ritov, J Klaassen, Jon A Wellner, and YA\u2019Acov Ritov.\nEf\ufb01cient and adaptive estimation for semiparametric models, volume 4. Johns Hopkins University Press\nBaltimore, 1993.\n\n[4] Gary Chamberlain. Ef\ufb01ciency bounds for semiparametric regression. Econometrica: Journal of the\n\nEconometric Society, pages 567\u2013596, 1992.\n\n[5] Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Du\ufb02o, Christian Hansen, Whitney Newey,\nand James Robins. Double/debiased machine learning for treatment and structural parameters. The\nEconometrics Journal, 21(1):C1\u2013C68, 2018.\n\n[6] Victor Chernozhukov, Juan Carlos Escanciano, Hidehiko Ichimura, Whitney K Newey, and James M\n\nRobins. Locally robust semiparametric estimation. arXiv preprint arXiv:1608.00033, 2016.\n\n[7] Miroslav Dud\u00edk, John Langford, and Lihong Li. Doubly robust policy evaluation and learning.\n\nIn\nProceedings of the 28th International Conference on International Conference on Machine Learning, pages\n1097\u20131104. Omnipress, 2011.\n\n[8] Robert F Engle, Clive WJ Granger, John Rice, and Andrew Weiss. Semiparametric estimates of the relation\nbetween weather and electricity sales. Journal of the American statistical Association, 81(394):310\u2013320,\n1986.\n\n[9] Dylan J Foster and Vasilis Syrgkanis. Orthogonal statistical learning. arXiv preprint arXiv:1901.09036,\n\n2019.\n\n[10] Bryan S Graham and Cristine Campos de Xavier Pinto. Semiparametrically ef\ufb01cient estimation of the\n\naverage linear regression function. Working paper, National Bureau of Economic Research, 2018.\n\n[11] Guido W Imbens and Donald B Rubin. Causal inference in statistics, social, and biomedical sciences.\n\nCambridge University Press, 2015.\n\n[12] Nathan Kallus and Angela Zhou. Policy evaluation and optimization with continuous treatments. arXiv\n\npreprint arXiv:1802.06037, 2018.\n\n[13] Toru Kitagawa and Aleksey Tetenov. Who should be treated? empirical welfare maximization methods for\n\ntreatment choice. Econometrica, 86(2):591\u2013616, 2018.\n\n[14] Akshay Krishnamurthy, John Langford, Aleksandrs Slivkins, and Chicheng Zhang. Contextual bandits\n\nwith continuous actions: Smoothing, zooming, and adapting. arXiv preprint arXiv:1902.01520, 2019.\n\n[15] Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization.\n\nIn The 22nd Conference on Learning Theory (COLT), 2009.\n\n[16] Whitney K Newey. Semiparametric ef\ufb01ciency bounds. Journal of applied econometrics, 5(2):99\u2013135,\n\n1990.\n\n[17] Min Qian and Susan A Murphy. Performance guarantees for individualized treatment rules. Annals of\n\nstatistics, 39(2):1180, 2011.\n\n[18] Alexander Rakhlin, Karthik Sridharan, Alexandre B Tsybakov, et al. Empirical entropy, minimax regret\n\nand minimax risk. Bernoulli, 23(2):789\u2013824, 2017.\n\n[19] Peter M Robinson. Root-n-consistent semiparametric regression. Econometrica: Journal of the Economet-\n\nric Society, pages 931\u2013954, 1988.\n\n[20] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged\n\nbandit feedback. In International Conference on Machine Learning, pages 814\u2013823, 2015.\n\n[21] Mark J Van der Laan and Sherri Rose. Targeted learning: causal inference for observational and\n\nexperimental data. Springer Science & Business Media, 2011.\n\n[22] A. W. Van Der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes: With Applications to\n\nStatistics. Springer Series, March 1996.\n\n[23] Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in\n\nStatistical and Probabilistic Mathematics. Cambridge University Press, 2019.\n\n[24] Jeffrey M Wooldridge. Estimating average partial effects under conditional moment independence assump-\n\ntions. Working paper, cemmap working paper, 2004.\n\n[25] Tong Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine\n\nLearning Research, 2(Mar):527\u2013550, 2002.\n\n10\n\n\f[26] Yingqi Zhao, Donglin Zeng, A John Rush, and Michael R Kosorok. Estimating individualized treatment\nrules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106\u2013\n1118, 2012.\n\n[27] Xin Zhou, Nicole Mayer-Hamblett, Umer Khan, and Michael R Kosorok. Residual weighted learning for\nestimating individualized treatment rules. Journal of the American Statistical Association, 112(517):169\u2013\n187, 2017.\n\n[28] Zhengyuan Zhou, Susan Athey, and Stefan Wager. Of\ufb02ine multi-action policy learning: Generalization\n\nand optimization. arXiv preprint arXiv:arXiv:1810.04778, 2018.\n\n11\n\n\f", "award": [], "sourceid": 8611, "authors": [{"given_name": "Victor", "family_name": "Chernozhukov", "institution": "MIT"}, {"given_name": "Mert", "family_name": "Demirer", "institution": "MIT"}, {"given_name": "Greg", "family_name": "Lewis", "institution": "Microsoft Research"}, {"given_name": "Vasilis", "family_name": "Syrgkanis", "institution": "Microsoft Research"}]}