{"title": "Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3325, "page_last": 3334, "abstract": "Off-policy evaluation (OPE) in both contextual bandits and reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible. The problem's importance has attracted many proposed solutions, including importance sampling (IS), self-normalized IS (SNIS), and doubly robust (DR) estimates. DR and its variants ensure semiparametric local efficiency if Q-functions are well-specified, but if they are not they can be worse than both IS and SNIS. It also does not enjoy SNIS's inherent stability and boundedness. We propose new estimators for OPE based on empirical likelihood that are always more efficient than IS, SNIS, and DR and satisfy the same stability and boundedness properties as SNIS. On the way, we categorize various properties and classify existing estimators by them. Besides the theoretical guarantees, empirical studies suggest the new estimators provide advantages.", "full_text": "Intrinsically Ef\ufb01cient, Stable, and Bounded\n\nOff-Policy Evaluation for Reinforcement Learning\n\nNathan Kallus\nCornell University\n\nNew York, NY\n\nkallus@cornell.edu\n\nMasatoshi Uehara \u2217\nHarvard University\nCambrdige, MA\n\nuehara_m@g.harvard.edu\n\nAbstract\n\nOff-policy evaluation (OPE) in both contextual bandits and reinforcement learning\nallows one to evaluate novel decision policies without needing to conduct explo-\nration, which is often costly or otherwise infeasible. The problem\u2019s importance\nhas attracted many proposed solutions, including importance sampling (IS), self-\nnormalized IS (SNIS), and doubly robust (DR) estimates. DR and its variants\nensure semiparametric local ef\ufb01ciency if Q-functions are well-speci\ufb01ed, but if they\nare not they can be worse than both IS and SNIS. It also does not enjoy SNIS\u2019s\ninherent stability and boundedness. We propose new estimators for OPE based\non empirical likelihood that are always more ef\ufb01cient than IS, SNIS, and DR and\nsatisfy the same stability and boundedness properties as SNIS. On the way, we\ncategorize various properties and classify existing estimators by them. Besides\nthe theoretical guarantees, empirical studies suggest the new estimators provide\nadvantages.\n\n1\n\nIntroduction\n\nOff-policy evaluation (OPE) is the problem of evaluating a given policy (evaluation policy) using data\ngenerated by the log of another policy (behavior policy). OPE is a key problem in both reinforcement\nlearning (RL) [7, 13\u201315, 17, 23, 32] and contextual bandits (CB) [5, 19, 28] and it \ufb01nds applications\nas varied as healthcare [18] and education [16].\nMethods for OPE can be roughly categorized into three types. The \ufb01rst approach is the direct\nmethod (DM), wherein we directly estimate the Q-function using regression and use it to directly\nestimate the value of the evaluation policy. The problem of this approach is that if the model is wrong\n(misspeci\ufb01ed), the estimator is no longer consistent.\nThe second approach is importance sampling (IS; aka Horvitz-Thompson), which averages the data\nweighted by the density ratio of the evaluation and behavior policies. Although IS gives an unbiased\nand consistent estimate, its variance tends to be large. Therefore self-normalized IS (SNIS; aka\nH\u00e1jek) is often used [28], which divides IS by the average of density ratios. SNIS has two important\nproperties: (1) its value is bounded in the support of rewards and (2) its conditional variance given\naction and state is bounded by the conditional variance of the rewards. This leads to increased stability\ncompared with IS, especially when the density ratios are highly variable due to low overlap.\nThe third approach is the doubly robust (DR) method, which combines DM and IS and is given by\nadding the estimated Q-function as a control variate [5, 7, 25]. If the Q-function is well speci\ufb01ed, DR\nis locally ef\ufb01cient in the sense that its asymptotic MSE achieves the semiparametric lower bound\n[33]. However, if the Q-function is misspeci\ufb01ed, DR can actually have worse MSE than IS and/or\nSNIS [12]. In addition, it does not have the boundedness property.\nTo address these de\ufb01ciencies, we propose novel OPE estimators for both CB and RL that are\nguaranteed to improve over both (SN)IS and DR in terms of asymptotic MSE (termed intrinsic\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTable 1: Comparison of policy evaluation methods. The notation (*) means proposed estimator.\nThe notation # means partially satis\ufb01ed, as discussed in the text. (S)IS and SN(S)IS refer either to\nstepwise or non-stepwise.\n\nConsistency\nLocal ef\ufb01ciency\nIntrinsic ef\ufb01ciency\nBoundedness\nStability\n\nDM (S)IS SN(S)IS DR SNDR MDR REG(*) SNREG(*) EMP(*)\n\n1\n#\n\n1\n\n#\n\n2\n#\n\n#\n2\n\n1\n\nef\ufb01ciency) and at the same time also satisfy the same boundedness and stability properties of SNIS,\nin addition to the consistency and local ef\ufb01ciency of existing DR methods. See Table 1. Our general\nstrategy to obtain these estimators is to (1) make a parametrized class of estimators that includes IS,\nSNIS, and DR and (2) choose the parameter using either a regression way (REG) or an empirical\nlikelihood way (EMP). The bene\ufb01t of these new properties in practice is con\ufb01rmed by experiments in\nboth CB and RL settings.\n2 Sequential Decision Processes and Off Policy Evaluation\nA sequential decision process is de\ufb01ned by a tuple (X ,A, P, R, P0, \u03b3), where X and A are the state\nand action spaces, Pr(x, a) is the distribution of the bounded random variable r(x, a) \u2208 [0, Rmax]\nbeing the immediate reward of taking action a in state x, P (\u00b7|x, a) is the transition probability\ndistribution, P0 is the initial state distribution, and \u03b3 \u2208 [0, 1] is the discounting factor. A policy\n\u03c0 : X \u00d7 A \u2192 [0, 1] assigns each state x \u2208 X a distribution over actions with \u03c0(a|x) being the\nprobability of taking actions a into x. We denote HT\u22121 = (x0, a0, r0,\u00b7\u00b7\u00b7 , xT\u22121, aT\u22121, rT\u22121) as a\nt=0 \u03b3trt, which is the return\n\nT-step trajectory generated by policy \u03c0, and de\ufb01ne RT\u22121(HT\u22121) =(cid:80)T\u22121\n\nof trajectory. Our task is to estimate\n\nT = E[RT\u22121(HT\u22121)]\n\u03b2\u03c0\n\n(policy value).\n\nWe further de\ufb01ne the value function V \u03c0(x) and Q-function Q\u03c0(x, a) of a policy \u03c0, respectively, as\nthe expectation of the return of a T -step trajectory generated by starting at state x and state-action\npair (x, a). Note that the contextual bandit setting is a special case when T = 1.\nThe off-policy evaluation (OPE) problem is to estimate \u03b2\u2217 = \u03b2\u03c0e\nT for the evaluation policy \u03c0e from n\nobservation of T -step trajectories D = {H(i)\ni=1 independently generated by the behavior policy\n\u03c0b. Here, we assume an overlap condition: for all state-action pair (x, a) \u2208 X \u00d7 A if \u03c0b(a|x) = 0\nthen \u03c0e(a|x) = 0. Throughout, expectations E[\u00b7] are taken with respect to a behavior policy. For any\nfunction of the trajectory, we let\n\nT\u22121}n\n\nEn[f (HT\u22121)] = n\u22121(cid:80)n\n\u03c9t1:t2 =(cid:81)t2\n\nt=t1\n\ni=1 f (H(i)\n\nT\u22121).\n\nAsmse[\u00b7] denotes asymptotic MSE in terms of the \ufb01rst order; i.e., Asmse[ \u02c6\u03b2] = MSE[ \u02c6\u03b2] + o(n\u22121).\nThe cumulative importance ratio from time step t1 to time step t2 is\n\u03c0e(at|xt)/\u03c0b(at|xt),\n\nwhere the empty product is 1. We assume that this weight is bounded for simplicity.\n2.1 Existing Estimators and Properties\n\nWe summarize three types of estimators. Some estimators depend on a model q(x, a; \u03c4 ) with\nparameter \u03c4 \u2208 \u0398\u03c4 for the Q-function Q\u03c0e(x, a). We say the model is correct or well-speci\ufb01ed if\nthere is some \u03c40 such that Q\u03c0e(x, a) = q(x, a; \u03c40) and otherwise we say it is wrong or misspeci\ufb01ed.\nThroughout, we make the following assumption about the model\nAssumption 2.1. (a1) \u0398\u03c4 is compact, (a2) |q(x, a; \u03c4 )| \u2264 Rmax.\nDirect estimator: DM is given by \ufb01tting \u02c6\u03c4, e.g., by least squares, and then plugging this into\n\n\u02c6\u03b2dm = En\n\n.\n\n(1)\n\n(cid:34)(cid:88)\n\na\u2208A\n\n(cid:35)\n\n\u03c0e(a|x(i)\n\n0 )q(x(i)\n\n0 , a; \u02c6\u03c4 )\n\n2\n\n\fWhen this model is correct, \u02c6\u03b2dm is both consistent for \u03b2\u2217 and locally ef\ufb01cient in that its asymptotic\nMSE is minimized among the class of all estimators consistent for \u03b2\u2217 [19, 33].\nDe\ufb01nition 2.1 (Local ef\ufb01ciency). When the model q(x, a; \u03c4 ) is well-speci\ufb01ed, the estimator achieves\nthe ef\ufb01ciency bound.\n\nHowever, all of models are wrong to some extent. In this sense, even if the sample size goes to\nin\ufb01nity, \u02c6\u03b2dm might not be consistent.\nDe\ufb01nition 2.2 (Consistency). The estimator is consistent for \u03b2\u2217 irrespective of model speci\ufb01cation.\nImportance sampling estimators: IS and step-wise IS (SIS) are de\ufb01ned\nrespectively as\n\u02c6\u03b2is = En\n\n(cid:104)(cid:80)T\u22121\n\n(cid:80)T\u22121\n\nEMP = REG = DR\n\n, \u02c6\u03b2sis = En\n\n\u03c90:T\u22121\n\n(cid:69)\n\nt=0 \u03c90:t\u03b3trt\n\n.\n\nt=0 \u03b3trt\n\nIS, SNIS\n\n(cid:104)\n\n(cid:105)\n\n(cid:105)\n\nThe weights \u03c90:t are assumed known here as is common in RL; otherwise\nthey can either be estimated directly or chosen by optimal balance [1, 8, 9].\nBoth IS and SIS satisfy consistency but the MSE of SIS estimator is\nsmaller than regular IS estimator by the law of total variance [27]. The\nself-normalized versions of these estimators are:\n\n(cid:104)\n\n(cid:80)T\u22121\n\n(cid:105)\n\n(cid:35)\n\nEn\n\n\u02c6\u03b2snis =\n\n\u03c90:T\u22121\n\nt=0 \u03b3trt\n\nEn[\u03c90:T\u22121]\n\n, \u02c6\u03b2snsis = En\n\n\u03c90:t\n\nEn[\u03c90:t]\n\n\u03b3trt\n\n.\n\nSN(S)IS have two advantages over (S)IS. First, they are both 1-bounded\nin that they are bounded by the theoretical upper bound of reward.\n\nDe\ufb01nition 2.3 (\u03b1-Boundedness). The estimator is bounded by \u03b1(cid:80)T\u22121\n\n(cid:34)T\u22121(cid:88)\n\nt=0\n\n(a) Well-speci\ufb01ed\n\nEMP = REG\n\n(cid:69)\n\nIS, SNIS, DR\n(b) Misspeci\ufb01ed\n\nFigure 1:\nasymptotic MSEs\n\nOrder of\n\nt=0 \u03b3tRmax.\n\ndata. If the conditional variance of(cid:80)T\u22121\n\n1-boundedness is the best we can achieve where \u03b1-boundedness for any \u03b1 > 1 is a weaker property.\nSecond, their conditional variance given state and action data are no larger than the conditional\nvariance of any reward, to which we refer as stability.\nDe\ufb01nition 2.4 (Stability). Let Dx,a = {(x(i)\nt=1 \u03b3tr(i)\nvariance of the estimator, given Dx,a, is also bounded by \u03c32.\nUnlike ef\ufb01ciency, boundedness and stability are \ufb01nite-sample properties. Notably (S)IS lacks both of\nthese properties, which explains its unstable performance in practice, especially when density ratios\ncan be very large. While boundedness can be achieved by a simple truncation, stability cannot.\nDoubly robust estimators: A DR estimator for RL [7, 32] is given by \ufb01tting \u02c6\u03c4 and plugging it into\n\nt ) : i \u2264 n, t \u2264 T \u2212 1} denote that action-state\n, a(i)\n, given Dx,a, is bounded by \u03c32, then the conditional\n\nt\n\nt\n\n\u02c6\u03b2dr = \u02c6\u03b2d({q(x, a; \u02c6\u03c4 )}T\u22121\nt=0 ),\n\nwhere for any collection of functions {mt}T\u22121\n\u02c6\u03b2d({mt}T\u22121\n\n\u03b3t\u03c90:trt \u2212 \u03b3t\n\nt=0 ) = En\n\nt=0 (known as control variates) we let\n\u03c90:tmt(xt, at) \u2212 \u03c90:t\u22121\n\nmt(xt, a)\u03c0e(a|xt)\n\n(cid:32)\n\n(cid:33)(cid:33)(cid:35)\n\n.\n\n(cid:34)T\u22121(cid:88)\n\nt=0\n\n(cid:32)(cid:88)\n\na\u2208A\n\n(2)\nThe DR estimator is both consistent and locally ef\ufb01cient. Recently, this approach to ef\ufb01cient OPE\nhas been extended to Markov decision process [10] and in\ufb01nite-horizon problems [11]; in this paper\nwe focus on the more general sequential decision process. Instead of using a plug-in estimate of \u03c4\nin eq. 2, [4, 6, 26] further suggest that to pick \u02c6\u03c4 to minimize an estimate of the asymptotic variance\nof \u02c6\u03b2d({q(x, a; \u03c4 )}T\u22121\nt=0 ), leading to the MDR estimator [6] for OPE. However, DR and MDR satisfy\nneither boundedness nor stability. Replacing, \u03c90:t with its self-normalized version \u03c90:t/En [\u03c90:t]\nin (2) leads to SNDR [24, 32] (aka WDR), but it only satis\ufb01es these properties partially: it\u2019s only\n2-bounded and partially stable (see Appendix B).\nMoreover, if the model is incorrectly speci\ufb01ed, (M)DR may have MSE that is worse than any of the\nfour (SN)(S)IS estimators. [12] also experimentally showed that the performance of \u02c6\u03b2dr might be\nvery bad in practice when the model is wrong.\nWe therefore de\ufb01ne intrinsic ef\ufb01ciency as an additional desiderata, which prohibits this from occurring.\n\n3\n\n\fDe\ufb01nition 2.5 (Intrinsic ef\ufb01ciency). The asymptotic MSE of the estimator is smaller than that of any\nof \u02c6\u03b2sis, \u02c6\u03b2is, \u02c6\u03b2snsis, \u02c6\u03b2snis, \u02c6\u03b2dr, irrespective of model speci\ufb01cation.\nMDR can be seen as motivated by a variant of intrinsic ef\ufb01ciency against only DR (hence the #\nin Table 1). Although this is not precisely proven in [6], this arises as a corollary of our results.\nNonetheless, MDR does not achieve full intrinsic ef\ufb01ciency against all above estimators.\n3 REG and EMP for Contextual Bandit\n\nNone of the estimators above simultaneously satisfy all desired properties, De\ufb01nitions 2.1\u20132.5. In the\nnext sections, we develop new estimators that do. For clarity we \ufb01rst consider the simpler CB setting,\nwhere we write (x, a, r) and w instead of (x0, a0, r0) and w0:0. We then start by showing how a\nmodi\ufb01cation to MDR ensures intrinsic ef\ufb01ciency. To obtain the other desiderata, we have to change\nhow we choose the parameters. Regarding the intuitive detailed explanation, refer to Appendix A.\n3.1 REG: Intrinsic Ef\ufb01ciency\n\nWhen T = 1, \u02c6\u03b2d(m) in (2) becomes simply\n\nwhere F(m(x, a)) = wm(x, a) \u2212(cid:8)(cid:80)\n\n\u02c6\u03b2d(m) = En [wr \u2212 F(m)] ,\n\na\u2208A m(x, a)\u03c0e(a|x)(cid:9). By construction, E[F(m)] = 0 for\n\n(3)\n\nevery m. (M)DR, for example, use m(x, a; \u03c4 ) = q(x, a; \u03c4 ).\nInstead, we let\n\nm(x, a; \u03b61, \u03b62, \u03c4 ) = \u03b61 + \u03b62q(x, a; \u03c4 ),\n\nfor parameters \u03c4 and \u03b6 = (\u03b61, \u03b62). This new choice has a special property: it includes both IS and DR\nestimators. Given any \u03c4, setting \u03b61 = 0, \u03b62 = 0 yields IS and setting \u03b61 = 0, \u03b62 = 1 gives (M)DR.\nThis gives a simple recipe for intrinsic ef\ufb01ciency: estimate the variance of \u02c6\u03b2d(\u03b61 + \u03b62q(x, a; \u03c4 )) and\n\nminimize it over \u03c4, \u03b6. Because \u02c6\u03b2d(m) is unbiased, its variance is simply E(cid:2){wr \u2212 F(m)}2(cid:3) \u2212 \u03b2\u22172.\n\nTherefore, over the parameter spaces \u0398\u03c4 the (unknown) minimal variance choice is\n\n(cid:104){wr \u2212 F(\u03b61 + \u03b62q(x, a; \u03c4 ))}2(cid:105)\n(cid:104){wr \u2212 F(\u03b61 + \u03b62q(x, a; \u03c4 ))}2(cid:105)\n\n.\n\n(\u03b6\u2217, \u03c4\u2217) = arg min\n\u03b6\u2208R2,\u03c4\u2208\u0398\u03c4\n\nE\n\n(\u02c6\u03b6, \u02c6\u03c4 ) = arg min\n\u03b6\u2208R2,\u03c4\u2208\u0398\u03c4\n\nEn\n\nWe let the REG estimator be \u02c6\u03b2reg = \u02c6\u03b2d(\u02c6\u03b61 + \u02c6\u03b62q(x, a; \u02c6\u03c4 )) where we choose the parameters by\nminimizing the estimated variance:\n\n.\n\n(4)\n\n(5)\n\nAsmse[ \u02c6\u03b2reg] = n\u22121 min\n\n\u03b6\u2208R2,\u03c4\u2208\u0398\u03c4\n\nE\n\nTo establish desired ef\ufb01ciencies, we prove the following theorem indicating that our choice of\nparameters does not in\ufb02ate the variance. Note that it is not obvious because the plug-in some\nparameters generally causes an in\ufb02ation of the variance.\nTheorem 3.1. When the optimal solution (\u03b6\u2217, \u03c4\u2217) in (4) is unique,\n\n(cid:104){wr \u2212 F(\u03b61 + \u03b62q(x, a; \u03c4 ))}2 \u2212 \u03b2\u22172(cid:105)\n\n.\n\n(cid:104){wr \u2212 F(\u03b61 + \u03b62q(x, a; \u03c4 ))}2 \u2212 \u03b2\u22172(cid:105)\n\nRemark 3.1. For \u03b6 = (\u03b2\u2217, 0), this asymptotic MSE is the same as the one of SNIS, var[w(r \u2212 \u03b2\u2217)].\nFrom Theorem 3.1 we obtain the desired ef\ufb01ciencies.\nImportantly, to prove this, we note\nhow the asymptotic MSEs of each of (SN)(S)IS and DR can be represented in the form\nn\u22121E\nCorollary 3.1. The estimator \u02c6\u03b2reg has local and intrinsic ef\ufb01ciency.\nRemark 3.2 (Comparison to MDR). REG is like MDR with an expanded model class. This class is\ncarefully chosen to guarantee intrinsic ef\ufb01ciency. In addition, as another corollary, we have proven\npartial intrinsic ef\ufb01ciency for MDR against DR (just \ufb01x \u03b6 = (0, 1) in (5)) where [6] only proved\nconsistency of MDR. However, neither MDR nor REG satis\ufb01es boundedness and stability.\nRemark 3.3 (SNREG). Replacing weights w by their self-normalized version w/En[w] in REG\nleads to SNREG. We explore this estimator in Appendix A and show it only gives 2-boundedness, does\nnot give stability, and limits REG\u2019s intrinsic ef\ufb01ciency to be only against SN(S)IS and SNDR.\n\nfor some \u03b6 and \u03c4.\n\n4\n\n\f3.2 EMP: Intrinsic Ef\ufb01ciency, Boundedness, and Stability\n\nWe next construct an estimator satisfying intrinsic ef\ufb01ciency as well as boundedness and stability. The\nkey idea is to use empirical likelihood to choose the parameters [29\u201331]. Empirical likelihood is a\nnonparametric MLE commonly used in statistics [21]. We consider the control variate m(x, a; \u03be; \u03c4 ) =\n\u03be + q(x, a; \u03c4 ) with parameters \u03be, \u03c4 and q(x, a; \u03c4 ) = t(x, a)(cid:62)\u03c4, where t(x, a) is a d\u03c4 -dimensional\nvector of linear independent basis functions not including a constant. Then, an estimator for \u03b2 is\nde\ufb01ned as\n\n(cid:2)\u02c6c\u22121\u02c6\u03ba(x, a)\u03c0e(a|x)r(cid:3) , where\n\n\u02c6\u03b2emp = En\n\u02c6\u03ba(x, a) = {\u03c0b(a|x)[1 + F(m(x, a; \u02c6\u03be, \u02c6\u03c4 ))]}\u22121, \u02c6c = En\n\n(cid:104){1 + F(m(x, a; \u02c6\u03be, \u02c6\u03c4 ))}\u22121(cid:105)\n\n,\n\n(6)\n\n\u02c6\u03be, \u02c6\u03c4 = arg max\n\u03be\u2208R,\u03c4\u2208\u0398\u03c4\n\nEn[log{1 + F(m(x, a; \u03be, \u03c4 ))}].\n\nn(cid:88)\n\nn(cid:88)\n\nn(cid:88)\n\nThis is motivated by solving the dual problem of the following optimization problem formulated by\nthe empirical likelihood:\n\nmax\n\n\u03ba\n\nlog \u03ba(i), s.t.\n\ni=1\n\ni=1\n\ni=1\n\n\u03ba(i)\u03c0b(a(i)|x(i)) = 1,\n\n\u03ba(i)\u03c0b(a(i)|x(i))F(m(x(i), a(i); \u03be, \u03c4 )) = 0.\n\nThe objective in an optimization problem (6) is a convex function; therefore, it is easy to solve. Then,\nthe estimator \u02c6\u03b2emp has all the desirable \ufb01nite-sample and asymptotic properties.\nLemma 3.1. The estimator \u02c6\u03b2emp satis\ufb01es 1-boundedness and stability.\nTheorem 3.2. The estimator \u02c6\u03b2emp has local and intrinsic ef\ufb01ciency, and\n\n(cid:104){wr \u2212 F(\u03b6 + q(x, a; \u03c4 ))}2 \u2212 \u03b2\u22172(cid:105)\n\n.\n\n(7)\n\nAsmse[ \u02c6\u03b2emp] = n\u22121 min\n\n\u03b6\u2208R,\u03c4\u2208Rd\u03c4\n\nE\n\nHere, we have assumed the model is linear in \u03c4. Without this assumption, Theorem 3.2 may not\nhold. In the following section, we consider how to relax this assumption while maintaining local and\nintrinsic ef\ufb01ciency.\n3.3 Practical REG and EMP\n\nWhile REG and EMP have desirable theoretical properties, both have some practical issues. First, for\nREG, the optimization problem in (5) may be non-convex if q(x, a; \u03c4 ) is not linear in \u03c4, as is the case\nin our experiment in Sec. 5.1 where we use a logistic model with 216 parameters. (The same issue\nexists for MDR.) Similarly, EMP estimator has the problem that there is no theoretical guarantee for\nintrinsic ef\ufb01ciency when q(x, a; \u03c4 ) is not linear in \u03c4. Therefore, we suggest the following uni\ufb01ed\npractical approach to selecting \u03c4 in a way that maintains the desired properties.\nFirst, we estimate a parameter \u03c4 in q(x, a; \u03c4 ) as in DM to obtain \u02c6\u03c4, which we assume has a limit,\np\u2192 \u03c4\u2020 . Then, we consider solving the following optimization problems instead of (5) and (6) for\n\u02c6\u03c4\nREG and EMP, respectively\n\n(cid:104){wr \u2212 F(m(x, a; \u03b6, \u02c6\u03c4 ))}2(cid:105)\n\n, \u02c6\u03be = arg max\n\n\u03be\u2208R2\n\nEn[log{1 + F(m(x, a; \u03be, \u02c6\u03c4 ))}],\n\n\u02c6\u03b6 = arg min\n\n\u03b6\u2208R2\n\nEn\n\nwhere m(x, a; \u03b6, \u02c6\u03c4 ) = \u03b61 + \u03b62q(x, a; \u02c6\u03c4 ) or m(x, a; \u03be, \u02c6\u03c4 ) = \u03be1 + \u03be2q(x, a; \u02c6\u03c4 ). This is a convex\noptimization problem with two dimensional parameters; thus, it is easy to solve.\nHere, the asymptotic MSE of practical \u02c6\u03b2reg and \u02c6\u03b2emp are as follows.\nTheorem 3.3. The above plug-in-\u03c4 versions of \u02c6\u03b2reg and \u02c6\u03b2emp still satisfy local and intrinsic ef\ufb01ciency,\nand \u02c6\u03b2emp satis\ufb01es 1-boundedness and partial stability. Their asymptotic MSEs are\n\n(cid:104)(cid:8)wr \u2212 F(\u03b61 + \u03b62q(x, a; \u03c4\u2020))(cid:9)2 \u2212 \u03b2\u22172(cid:105)\n\n.\n\n(8)\n\nn\u22121 min\n\u03b6\u2208R2\n\nE\n\nAs a simple extension, we may consider multiple models for the Q-function. E.g, we can have two\nmodels q1(x, a; \u03c41) and q2(x, a; \u03c41) and let m(x, a; \u03b6, \u02c6\u03c4 ) = \u03b61 + \u03b62q1(x, a; \u02c6\u03c41) + \u03b63q2(x, a; \u02c6\u03c42). Our\nresults easily extend to provide intrinsic ef\ufb01ciency with respect to DR using any of these models.\n\n5\n\n\f4 REG and EMP for Reinforcement learning\n\nWe next present how REG and EMP extend to the RL setting. Some complications arise because of\nthe multi-step horizon. For example, IS and SIS are different as opposed to the case T = 1.\n\n4.1 REG for RL\nWe consider an extension of REG to a RL setting. First, we derive the variance of \u02c6\u03b2d({mt}T\u22121\nt=0 ).\nTheorem 4.1. The variance of \u02c6\u03b2d({mt}T\u22121\n\nt=0 ) is n\u22121E[v({mt}T\u22121\n\n(cid:40)\n\n\u03c9t:tmt(xt, at) \u2212(cid:88)\n\nt=0 )], where v({mt}T\u22121\nt=0 ) is\nmt(xt, a)\u03c0e(a|xt)\n\n(cid:41)\n\n\u03b3k\u2212t\u03c9t:krk\u2212t|Ht] \u2212\n\n(cid:32)\n\nT\u22121(cid:88)\n\nT\u22121(cid:88)\n\n\u03b32t\u03c92\n\n0:t\u22121var\n\nE[\n\nt=0\n\nk=t\n\na\u2208A\n\n(cid:33)\n\n.\n\n|Ht\u22121\n\n(9)\n\nTo derive REG, we consider the class of estimators \u02c6\u03b2d({mt}T\u22121\n\u03b61t + \u03b62tq(xt, at; \u02c6\u03c4 ) for all 0 \u2264 t \u2264 T \u2212 1. Then, we de\ufb01ne an estimator \u02c6\u03b6 and the optimal \u03b6\u2217 as\n\nt=0 ) where mt is mt(xt, at; \u03b6) =\n\nE[v({mt(xt, at; \u03b6)}T\u22121\n\nt=0 )].\n\n(10)\n\n\u02c6\u03b6 = arg min\n\n\u03b6\u2208R2\n\nEn[v({mt(xt, at; \u03b6)}T\u22121\n\n\u03b6\u2217 = arg min\n\u03b6\u2208R2\nreg = \u02c6\u03b2d({\u02c6\u03b61t + \u02c6\u03b62tq(x, a; \u02c6\u03c4 )}T\u22121\n\nt=0 )],\n\nREG is then de\ufb01ned as \u02c6\u03b2T\u22121\nt=0 ), where following our discussion in\nSection 3.3, \u02c6\u03c4 is given by \ufb01tting as in DM/DR. Theoretically, we could also choose \u03c4 to minimize\neq. (9), but that can be computationally intractable.\nA similar argument to that in Section 3.1 shows that a data-driven parameter choice induces no\nin\ufb02ation in asymptotic MSE. Therefore, the asymptotic MSE of the estimator \u02c6\u03b2reg is minimized\namong the class of estimators \u02c6\u03b2d({\u03b61t + \u03b62tq(xt, at; \u02c6\u03c4 )}T\u22121\nt=0 )). This implies that the asymptotic\nMSE of \u02c6\u03b2reg is smaller than \u02c6\u03b2sis and \u02c6\u03b2dr because \u02c6\u03b2sis corresponds to the case \u03b6t = (0, 0) and \u02c6\u03b2dr\ncorresponds to the case \u03b6t = (0, 1). In addition, we can prove that the estimator \u02c6\u03b2T\u22121\nis more\nef\ufb01cient than \u02c6\u03b2snsis. To prove this, we introduce the following lemma.\nLemma 4.1.\nAsmse[ \u02c6\u03b2snis] = n\u22121\n\n\u03b3k\u2212t\u03c9t+1:krk\u2212t|Ht\n\n(cid:34)T\u22121(cid:88)\n\nT\u22121(cid:88)\n\n(cid:33)(cid:35)\n\n|Ht\u22121\n\n\u2212 \u03b2\u2217\n\n(cid:33)\n\n\u03b32t\u03c92\n\n0:t\u22121var\n\n\u03c9t:t\n\nE\n\n(cid:32)\n\n(cid:32)\n\n(cid:34)\n\n(cid:35)\n\nreg\n\n,\n\nt\n\nk=t\n\nt = E[\u03c90:trt].\n\nt , 0) in eq. (9) recovers the above. This suggests the following theorem.\n\nwhere \u03b2\u2217\nWe note that setting \u03b6t = (\u03b2\u2217\nTheorem 4.2. The estimator \u02c6\u03b2T\u22121\nRemark 4.1. Practically, when the horizon is long, there may be too many parameters to optimize,\nwhich can causes over\ufb01tting. That is, although there is no in\ufb02ation in MSE asymptotically, there\nmay be issues in \ufb01nite samples. To avoid this problem, some constraint or regularization should be\nreg (0 \u2264 k \u2264 T \u2212 1) given by\nimposed on the parameters. Here we will consider the estimator \u02c6\u03b2k\n\u02c6\u03b2d({mt(xt, at; \u02c6\u03b6)}T\u22121\n\nt=0 ) for the constrained control variates:\n\nis locally and intrinsically ef\ufb01cient.\n\nreg\n\nE\n\nt=0\n\n(cid:26)\u03b6t1 + \u03b6t2q(xt, at; \u02c6\u03c4 ) (0 \u2264 t < k),\n\n\u03b6k1 + \u03b6k2q(xt, at; \u02c6\u03c4 ) (k \u2264 t \u2264 T \u2212 1).\n\nmt(xt, at; \u03b6) =\n\nThe estimator \u02c6\u03b2T\u22121\nguarantees of \u02c6\u03b2k\n\nreg corresponds to the originally introduced estimator. We can also obtain theoretical\nreg for k (cid:54)= T \u2212 1. For details, see Appendix C.\n\n4.2 EMP for RL\n\nFirst, we de\ufb01ne a control variate:\n\n(cid:32)\n\nT\u22121(cid:88)\n\ng(Dx,a; \u03be, \u02c6\u03c4 ) =\n\n\u03b3t\n\n\u03c90:tmt(xt, at; \u03be, \u02c6\u03c4 ) \u2212 \u03c90:t\u22121\n\nt=0\n\n6\n\n(cid:41)(cid:33)\n\nmt(xt, a; \u03be, \u02c6\u03c4 )\u03c0e(a|xt)\n\n.\n\n(cid:40)(cid:88)\n\na\u2208A\n\n\fTable 2: SatImage (RMSE\u00d71000 )\n\nDM2\n12.2\n30.5\n71.7\n\nIS\n6.7\n12.0\n26.0\n\nSNIS\n4.0\n5.6\n12.7\n\nDR\n3.0\n5.0\n18.0\n\nMDR\n3.8\n5.3\n14.4\n\nTable 3: Pageblock (RMSE\u00d71000 )\n\nDM2\n2.6\n5.6\n16.0\n\nIS\n8.5\n13.4\n27.2\n\nSNIS\n3.4\n4.0\n6.5\n\nDR\n1.4\n2.7\n7.2\n\nTable 4: PenDigits (RMSE\u00d71000 )\n\nDM2\n8.2\n17.4\n56.0\n\nIS\n6.1\n10.7\n29.6\n\nSNIS\n2.8\n3.9\n9.9\n\nDR\n1.5\n2.2\n11.1\n\nMDR\n2.3\n3.4\n6.4\n\nMDR\n2.2\n3.4\n9.4\n\nDM1\n18.1\n49.2\n128.6\n\nDM1\n21.8\n32.4\n62.0\n\nDM1\n8.1\n19.4\n58.6\n\nREG\n2.8\n4.4\n13.6\n\nREG\n1.5\n2.5\n4.9\n\nREG\n1.4\n2.1\n9.4\n\nEMP\n2.8\n4.4\n13.7\n\nEMP\n1.4\n2.4\n4.9\n\nEMP\n1.4\n2.0\n9.5\n\nBehavior policy\n\n0.7\u03c0d + 0.3\u03c0u\n0.4\u03c0d + 0.6\u03c0u\n0.0\u03c0d + 1.0\u03c0u\n\nBehavior policy\n\n0.7\u03c0d + 0.3\u03c0u\n0.4\u03c0d + 0.6\u03c0u\n0.0\u03c0d + 1.0\u03c0u\n\nBehavior policy\n\n0.7\u03c0d + 0.3\u03c0u\n0.4\u03c0d + 0.6\u03c0u\n0.0\u03c0d + 1.0\u03c0u\n\nBy setting mt(xt, at; \u03be, \u02c6\u03c4 ) = \u03be1t + \u03be2tq(xt, at; \u02c6\u03c4 ), de\ufb01ne \u02c6\u03be;\n\n\u02c6\u03be(\u02c6\u03c4 ) = arg max\n\n\u03be\u2208R2\n\nEn[log{1 + g(Dx,a; \u03be, \u02c6\u03c4 )}].\n\nThen, an estimator \u02c6\u03b2T\u22121\n\nemp is de\ufb01ned as\n\n(cid:34)T\u22121(cid:88)\n\nt=0\n\n\u02c6\u03b2T\u22121\nemp = En\n\n\u03c90:t\u03b3trt\n\n\u02c6c\u22121\n\n1 + g(Dx,a; \u02c6\u03be, \u02c6\u03c4 )\n\n(cid:35)\n\n(cid:34)\n\n, \u02c6c = En\n\n(cid:35)\n\n.\n\n1\n\n1 + g(Dx,a; \u02c6\u03be, \u02c6\u03c4 )\n\nThis estimator has the same ef\ufb01ciencies as \u02c6\u03b2T\u22121\ntantly, the estimator \u02c6\u03b2T\u22121\nTheorem 4.3. The asymptotic MSE of the estimator \u02c6\u03b2T\u22121\nalso locally and intrinsically ef\ufb01cient. It also satis\ufb01es 1-boundeness and stability.\n\nemp also satis\ufb01es a 1-boundedness and stability.\n\nemp is the same as that of \u02c6\u03b2T\u22121\n\nreg . Hence, it is\n\nreg because the asymptotic MSE is the same. Impor-\n\n5 Experiments\n\n5.1 Contextual Bandit\n\nWe evaluate the OPE algorithms using the standard classi\ufb01cation data-sets from the UCI repository.\nHere, we follow the same procedure of transforming a classi\ufb01cation data-set into a contextual bandit\ndata set as in [5, 6]. Additional details of the experimental setup are given in Appendix E.\nWe \ufb01rst split the data into training and evaluation. We make a deterministic policy \u03c0d by training\na logistic regression classi\ufb01er on the training data set. Then, we construct evaluation and behavior\npolicies as mixtures of \u03c0d and the uniform random policy \u03c0u. The evaluation policy \u03c0e is \ufb01xed at\n0.9\u03c0d + 0.1\u03c0u. Three different behavior policies are investigated by changing a mixture parameter.\nHere, we compare the (practical) REG and EMP with DM, SIS, SNIS, DR, and MDR on the evaluation\ndata set. First, two Q-functions \u02c6q1(x, a), \u02c6q2(x, a) are constructed by \ufb01tting a logistic regression in\ntwo ways with a l1 or l2 regularization term. We refer them as DM1 and DM2. Then, in DR, we\nuse a mixture of Q-functions 0.5\u02c6q1 + 0.5\u02c6q2 as m(x, a). For MDR, we use a logistic function as\nm(x, a) and we use SGD to solve the resulting non-convex high-dimensional optimization (e.g.,\nfor SatImage we have 6(number of actions) \u00d7 36(number of covariates) parameters). We use\nm(x, a; \u03b6) = \u03b6(cid:62)(1, \u02c6q1, \u02c6q2) in REG and m(x, a; \u03be) = \u03be(cid:62)(1, \u02c6q1, \u02c6q2) in EMP.\nThe resulting estimation RMSEs (root mean square error) over 200 replications of each experiment\nare given in Tables 2\u20134, where we highlight in bold the best two methods in each case. We \ufb01rst\n\ufb01nd that REG and EMP generally have overall the best performance. Second we see that this arises\n\n7\n\n\fTable 5: Windy GridWorld (RMSE)\n\nSIS\n0.64\n0.53\n0.39\n\nSNSIS\n0.49\n0.34\n0.29\n\nMDR\n0.28\n0.21\n0.14\nTable 6: Cliff Walking (RMSE)\n\nDR\n0.17\n0.11\n0.09\n\nSNSIS\n\n2.9\n2.4\n2.2\n\nSIS\n3.6\n3.2\n3.1\nTable 7: Mountain Car (RMSE)\n\nMDR\n2.3\n2.2\n2.0\n\nDR\n2.5\n2.3\n2.2\n\nSIS\n4.2\n3.3\n2.4\n\nSNSIS\n\n3.7\n2.9\n1.8\n\nDR\n1.9\n1.6\n1.4\n\nMDR\n1.9\n1.6\n1.5\n\nREG\n0.09\n0.06\n0.05\n\nREG\n2.1\n1.6\n1.2\n\nREG\n1.7\n1.2\n1.0\n\nEMP\n0.09\n0.06\n0.05\n\nEMP\n2.1\n1.5\n1.1\n\nEMP\n1.7\n1.2\n1.0\n\nSize\n250\n500\n750\n\nDM\n2.9\n2.8\n2.6\n\nSize\n1000\n2000\n3000\n\nSize\n1000\n2000\n3000\n\nDM\n7.7\n6.0\n6.8\n\nDM\n9.8\n10.6\n8.2\n\nbecause they achieve similar RMSE to SNIS when SNIS performs well and similar RMSE to (M)DR\nwhen (M)DR performs well, which is thanks to the intrinsic ef\ufb01ciency property. Whereas REG\u2019s and\nEMP\u2019s intrinsic ef\ufb01ciency is visible, MDR still often does slightly worse than DR despites its partial\nintrinsic ef\ufb01ciency, which can be attributed to optimizing too many parameters leading to over\ufb01tting\nin the sample size studied.\n\n5.2 Reinforcement Learning\n\nWe next compare the OPE algorithms in three standard RL setting from OpenAI Gym [3]: Windy\nGridWorld, Cliff Walking, and Mountain Car. For further detail on each see Appendix E. We again\nsplit the data into training and evaluation. In each setting we consider varying evaluation dataset\nsizes. In each setting, a policy \u03c0d is computed as the optimal policy based on the training data using\nQ-learning. The evaluation policy \u03c0e is then set to be (1\u2212 \u03b1)\u03c0d + \u03b1\u03c0u, where \u03b1 = 0.1. The behavior\npolicy is de\ufb01ned similarly with \u03b1 = 0.2 for Windy GridWorld and Cliff Walking and with \u03b1 = 0.15\nfor Mountain Car. We set the discounting factor to be 1.0 as in [6].\nWe compare the (practical) REG, EMP with k = 2 with DM, SIS, SNSIS, DR, MDR on the evaluation\ndata set generated by a behavior policy. A Q-function model is constructed using an off-policy TD\nlearning [27]. This is used in DM, DR, REG, and EMP. For MDR, we use a linear function for\nm(x, a) in order to enable tractable optimization given the many parameters due to long horizons.\nWe report the resulting estimation RMSEs over 200 replications of each experiment in Tables 5\u20137.\nWe \ufb01nd that the modest bene\ufb01ts we gained in one time step in the CB setting translate to signi\ufb01cant\noutright bene\ufb01ts in the longer horizon RL setting. REG and EMP consistently outperform other\nmethods. Their RMSEs are indistinguishable except for one setting where EMP has slightly better\nRMSE. These results highlight how the theoretical properties of intrinsic ef\ufb01ciency, stability, and\nboundedness can translate to improved performance in practice.\n6 Conclusion and Discussion\n\nWe studied various desirable properties for OPE in CB and RL. Finding that no existing estimator\nsatis\ufb01es all of them, we proposed two new estimators, REG and EMP, that satisfy consistency,\nlocal ef\ufb01ciency, intrinsic ef\ufb01ciency, 1-boundedness, and stability. These theoretical properties also\ntranslated to improved comparative performance in a variety of CB and RL experiments.\nIn practice, there may be additional modi\ufb01cations that can further improve these estimators. For\nexample, [32, 35] propose hybrid estimators that blend or switch to DM when importance weights are\nvery large. This reportedly works very well in practice but may make the estimator inconsistent under\nmisspeci\ufb01cation unless blending vanishes with n. In this paper, we focused on consistent estimators.\nAlso these do not satisfy intrinsic ef\ufb01ciency, 1-boudedness, or stability. Achieving these properties\nwith blending estimators remains an important next step.\n\n8\n\n\fAcknowledgements\n\nThis material is based upon work supported by the National Science Foundation under Grant No.\n1846210.\n\nReferences\n\n[1] A. Bennett and N. Kallus. Policy evaluation with latent confounders via optimal balance. In\n\nAdvances in Neural Information Processing Systems, 2019.\n\n[2] C. G. Bowsher and P. S. Swain. Identifying sources of variation and the \ufb02ow of information in\n\nbiochemical networks. Proceedings of the National Academy of Sciences, 109, 2012.\n\n[3] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, , and W. Zaremba.\n\nOpenai gym. arXiv preprint arXiv:1606.01540, 2016.\n\n[4] W. Cao, A. A. Tsiatis, and M. Davidian. Improving ef\ufb01ciency and robustness of the doubly\nrobust estimator for a population mean with incomplete data. Biometrika, 96:723\u2013734, 2009.\n\n[5] M. Dud\u00edk, D. Erhan, J. Langford, and L. Li. Doubly robust policy evaluation and optimization.\n\nStatistical Science, 29:485\u2013511, 2014.\n\n[6] M. Farajtabar, Y. Chow, and M. Ghavamzadeh. More robust doubly robust off-policy evaluation.\nIn Proceedings of the 35th International Conference on Machine Learning, pages 1447\u20131456,\n2018.\n\n[7] N. Jiang and L. Li. Doubly robust off-policy value evaluation for reinforcement learning. In\nProceedings of the 33rd International Conference on International Conference on Machine\nLearning-Volume, pages 652\u2013661, 2016.\n\n[8] N. Kallus. Balanced policy evaluation and learning.\n\nProcessing Systems, pages 8895\u20138906, 2018.\n\nIn Advances in Neural Information\n\n[9] N. Kallus. Discussion: \u201centropy learning for dynamic treatment regimes\u201d. Statistica Sinica, 29\n\n(4):1697\u20131705, 2019.\n\n[10] N. Kallus and M. Uehara. Double reinforcement learning for ef\ufb01cient off-policy evaluation in\n\nmarkov decision processes. arXiv preprint arXiv:1908.08526, 2019.\n\n[11] N. Kallus and M. Uehara. Ef\ufb01ciently breaking the curse of horizon: Double reinforcement\n\nlearning in in\ufb01nite-horizon processes. arXiv preprint arXiv:1909.05850, 2019.\n\n[12] J. D. Y. Kang and J. L. Schafer. Demystifying double robustness: A comparison of alternative\nstrategies for estimating a population mean from incomplete data. Statistical Science, 22:\n523\u2013539, 2007.\n\n[13] L. Li, R. Munos, and C. Szepesvari. Toward minimax off-policy value estimation. In Proceedings\nof the 18th International Conference on Arti\ufb01cial Intelligence and Statistics, pages 608\u2013616,\n2015.\n\n[14] Q. Liu, L. Li, Z. Tang, and D. Zhou. Breaking the curse of horizon: In\ufb01nite-horizon off-policy\nestimation. In Advances in Neural Information Processing Systems 31, pages 5356\u20135366. 2018.\n\n[15] A. R. Mahmood, H. P. van Hasselt, and R. S. Sutton. Weighted importance sampling for\noff-policy learning with linear function approximation. In Advances in Neural Information\nProcessing Systems 27, pages 3014\u20133022. 2014.\n\n[16] T. Mandel, Y. Liu, S. Levine, E. Brunskill, and Z. Popovic. Off-policy evaluation across\nrepresentations with applications to educational games. In Proceedings of the 13th International\nConference on Autonomous Agentsand Multi-agent Systems, page 1077\u20131084, 2014.\n\n[17] R. Munos, T. Stepleton, A. Harutyunyan, and M. Bellemare. Safe and ef\ufb01cient off-policy\nreinforcement learning. In Advances in Neural Information Processing Systems 29, pages\n1054\u20131062. 2016.\n\n9\n\n\f[18] S. A. Murphy. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society:\n\nSeries B (Statistical Methodology), 65:331\u2013355, 2003.\n\n[19] Y. Narita, S. Yasui, and K. Yata. Ef\ufb01cient counterfactual learning from bandit feedback. AAAI,\n\n2019.\n\n[20] W. K. Newey and D. L. Mcfadden. Large sample estimation and hypothesis testing. Handbook\n\nof Econometrics, IV:2113\u20132245, 1994.\n\n[21] A. Owen. Empirical likelihood. Monographs on statistics and applied probability (Series); 92.\n\nChapman & Hall/CRC, 2001.\n\n[22] A. Owen and Y. Zhou. Safe and effective importance sampling. Journal of the American\n\nStatistical Association, 95(449):135\u2013143, 2000.\n\n[23] D. Precup, R. Sutton, and S. Singh. Eligibility traces for off-policy policy evaluation. In\nProceedings of the 17th International Conference on Machine Learning, pages 759\u2013766, 2000.\n\n[24] J. Robins, M. Sued, Q. Lei-Gomez, and A. Rotnitzky. Comment: Performance of double-robust\nestimators when \"inverse probability\" weights are highly variable. Statistical Science, 22:\n544\u2013559, 2007.\n\n[25] J. M. Robins, A. Rotnitzky, and L. P. Zhao. Estimation of regression coef\ufb01cients when some\nregressors are not always observed. Journal of the American Statistical Association, 89:846\u2013866,\n1994.\n\n[26] D. B. Rubin and M. J. V. der Laan. Empirical ef\ufb01ciency maximization: Improved locally\nef\ufb01cient covariate adjustment in randmized experiments and survival analysis. International\nJournal of Biostatistics, 4:Article 5, 2008.\n\n[27] R. S. Sutton. Reinforcement learning : an introduction. MIT Press, Cambridge, Mass., 2018.\n\n[28] A. Swaminathan and T. Joachims. The self-normalized estimator for counterfactual learning. In\n\nAdvances in Neural Information Processing Systems 28, pages 3231\u20133239. 2015.\n\n[29] Z. Tan. On likelihood approach for monte carlo integration. Journal of the American Statistical\n\nAssociation, 99:1027\u20131036, 2004.\n\n[30] Z. Tan. A distributional approach for causal inference using propensity scores. Journal of the\n\nAmerican Statistical Association, 101:1619\u20131637, 2006.\n\n[31] Z. Tan. Bounded, ef\ufb01cient and doubly robust estimation with inverse weighting. Biometrika,\n\n97:661\u2013682, 2010.\n\n[32] P. Thomas and E. Brunskill. Data-ef\ufb01cient off-policy policy evaluation for reinforcement\nlearning. In Proceedings of the 33rd International Conference on Machine Learning, pages\n2139\u20132148, 2016.\n\n[33] A. Tsiatis. Semiparametric Theory and Missing Data. Springer, New York, 2006.\n\n[34] A. W. van der Vaart. Asymptotic statistics. Cambridge University Press, Cambridge, UK, 1998.\n\n[35] Y.-X. Wang, A. Agarwal, and M. Dudik. Optimal and adaptive off-policy evaluation in contextual\nbandits. In Proceedings of the 34th International Conference on Machine Learning, pages\n3589\u20133597, 2017.\n\n10\n\n\f", "award": [], "sourceid": 1845, "authors": [{"given_name": "Nathan", "family_name": "Kallus", "institution": "Cornell University"}, {"given_name": "Masatoshi", "family_name": "Uehara", "institution": "Harvard University"}]}