{"title": "Surrogate Objectives for Batch Policy Optimization in One-step Decision Making", "book": "Advances in Neural Information Processing Systems", "page_first": 8827, "page_last": 8837, "abstract": "We investigate batch policy optimization for cost-sensitive classification and contextual bandits---two related tasks that obviate exploration but require generalizing from observed rewards to action selections in unseen contexts. When rewards are fully observed, we show that the expected reward objective exhibits suboptimal plateaus and exponentially many local optima in the worst case. To overcome the poor landscape, we develop a convex surrogate that is calibrated with respect to entropy regularized expected reward. We then consider the partially observed case, where rewards are recorded for only a subset of actions. Here we generalize the surrogate to partially observed data, and uncover novel objectives for batch contextual bandit training. We find that surrogate objectives remain provably sound in this setting and empirically demonstrate state-of-the-art performance.", "full_text": "Surrogate Objectives for Batch Policy Optimization\n\nin One-step Decision Making\n\nMinmin Chen\u2217 Ramki Gummadi\u2217 Chris Harris\u2217 Dale Schuurmans\u2217\u2020\n\u2020 University of Alberta\n\n\u2217Google\n\nAbstract\n\nWe investigate batch policy optimization for cost-sensitive classi\ufb01cation and con-\ntextual bandits\u2014two related tasks that obviate exploration but require generalizing\nfrom observed rewards to action selections in unseen contexts. When rewards are\nfully observed, we show that the expected reward objective exhibits suboptimal\nplateaus and exponentially many local optima in the worst case. To overcome\nthe poor landscape, we develop a convex surrogate that is calibrated with respect\nto entropy regularized expected reward. We then consider the partially observed\ncase, where rewards are recorded for only a subset of actions. Here we generalize\nthe surrogate to partially observed data, and uncover novel objectives for batch\ncontextual bandit training. We \ufb01nd that surrogate objectives remain provably sound\nin this setting and empirically demonstrate state-of-the-art performance.\n\n1\n\nIntroduction\n\nCost-sensitive classi\ufb01cation [1] and batch contextual bandits [34\u201336] are two problems that share the\ngoal of inferring, given a batch of training data, a policy that chooses high reward actions in potentially\nunseen contexts. The problems differ in the assumed completeness of the data: in cost-sensitive\nclassi\ufb01cation, rewards are given (or inferable [8]) for every action, whereas in batch contextual\nbandits, rewards are only observed for a small subset of actions (typically one). The batch contextual\nbandit problem is more prevalent in practice, since massive data logs routinely record contexts\nencountered, actions taken in response, and the outcomes that resulted [18]. Rarely, if ever, are\ncounterfactual outcomes recorded for actions that might have been taken instead [3]. Nevertheless, we\n\ufb01nd it helpful to reconsider cost-sensitive classi\ufb01cation, since a core learning challenge is orthogonal\nto reward incompleteness: both tasks create dif\ufb01cult optimization landscapes.\nThere is an extensive literature on cost-sensitive classi\ufb01cation. Problems with two actions have been\nparticularly well studied [5, 8], and subsequent work has sought to reduce multiple-action learning\nto learning binary decisions [1, 19]. A reduction strategy has also been used to convert simply\ntrained stochastic policies to cost-sensitive variants via post-processing [24]. Unfortunately, such\nreductions do not compose well with current policy learning methods, which are gradient based and\nbest formulated as optimizing a single policy model over a well formed optimization objective.\nIn this paper we investigate cost-sensitive classi\ufb01cation with stochastic policy representations, to\nensure the developments are compatible with current deep learning methods. Our \ufb01rst result is\nnegative: for natural policy representations, the expected reward objective generates a poor opti-\nmization landscape that exhibits plateaus and potentially an exponential number of local maxima.\nIn response, we develop surrogate objectives for training [28]. Supervised learning research has\nobserved that solution quality can be ensured by using surrogates that satisfy \u201ccalibration\u201d with\nrespect to a dif\ufb01cult to optimize target loss [32, 37, 42]. This idea has also recently been applied to\ncost-sensitive classi\ufb01cation [26]. We extend this approach to stochastic policies and deep models by\nconsidering expected reward augmented with entropy regularization. This allows a convex surrogate\nto be developed that improves trainability while approximating expected cost to controllable accuracy.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWe then consider batch contextual bandits, where rewards are observed only for a subset (typically\none) of the available actions in each training context. Current work has focused on direct maximization\nof expected reward, using importance correction to provide unbiased (or nearly unbiased) estimates of\ntarget gradients [14, 16, 41, 43]. Unfortunately, as we illustrate, such an objective creates an extremely\ndif\ufb01cult optimization landscape, even if variance can be reduced to zero [6, 9]. Alternatively, we\nextend the calibrated surrogate to partially observed rewards, through the introduction of imputed\nestimates. We prove soundness of the approach and demonstrate empirical performance bene\ufb01ts.\n\n2 Cost-sensitive Classi\ufb01cation\n\nWe \ufb01rst consider cost-sensitive classi\ufb01cation. For simplicity assume a \ufb01nite set of actions A =\n{1, ..., K}, and thus we are given training data D = {(xi, ri)}T\ni=1, where ri \u2208 RK is a vector that\nspeci\ufb01es the reward for each action in context xi. The goal is to infer a mapping h : X \u2192 A that\nspeci\ufb01es a high reward action a \u2208 A for a given context x \u2208 X. Notation: We let \u2206K denote the K\ndimensional simplex, 1 the vector of all 1s, and 1a the vector of 0s except for 1 in coordinate a.\nMuch of the literature on cost-sensitive classi\ufb01cation has focused on deterministic classi\ufb01ers h, but\nwe consider stochastic policies \u03c0 : X \u2192 \u2206K. Any deterministic classi\ufb01er h can be equivalently\nexpressed by \u03c0(x) = 1h(x). We seek a policy that maximizes expected reward, or equivalently\nminimizes expected cost. If we assume the data source is i.i.d. with a joint distribution p(x, r), the\ntrue risk of a policy \u03c0 and its empirical risk on data set D can be de\ufb01ned respectively by\n\n\u02c6R(\u03c0,D) = \u2212 1\n\nR(\u03c0) = \u2212E[\u03c0(x) \u00b7 r]\n\nand\n\n(1)\nSince expected cost is the target, one might presume that directly minimizing empirical risk would be\na reasonable approach; unfortunately, this proves problematic [13]. In practice, it is nearly universal\nto train an unconstrained model q : X \u2192 RK that is converted to a policy via a \u201csoftmax\u201d transfer;\nthat is, policies are normally represented with the composition \u03c0(x) = f (q(x)) where the model\noutput q(x) is converted to a probability vector via\n\nT\n\n(cid:80)\n(xi,ri)\u2208D \u03c0(xi) \u00b7 ri.\n\nf (q) = eq\u2212F (q) with F (q) = log(1 \u00b7 eq).\nThe true and empirical risk can then be re-expressed in terms of q by\nR(f \u25e6 q) = \u2212E[f (q(x)) \u00b7 r]\n(3)\nUnfortunately, the dot product r \u00b7 f (q(x)) creates signi\ufb01cant dif\ufb01culty, as this interacts poorly with\nthe softmax transfer f. A well known consequence is that the expected cost plateaus whenever the\ncorresponding policy probabilities are nearly deterministic. A potentially greater challenge, however,\nis that the softmax transfer can also induce exponentially many local optima.\n\n(cid:80)\n(xi,ri)\u2208D f (q(xi)) \u00b7 ri.\n\n\u02c6R(f \u25e6 q,D) = \u2212 1\n\nand\n\n(2)\n\nT\n\nTheorem 1 Even for a single context x, a deterministic reward vector r, and a linear model\nq(x) = W \u03c6(x), the function r\u00b7f (q(x)) can have a number of local maxima in W that is exponential\nin the number of actions K and the number of features in \u03c6. (All proofs given in the appendix.1)\n\nIt is therefore unsurprising that empirical risk minimization with stochastic policies is not considered\nviable in the cost-sensitive classi\ufb01cation literature. Nevertheless, it remains the dominant approach\nfor batch contextual bandits. We seek to bridge the apparent disconnect between these two settings.\n\n2.1 Calibrated Strongly Convex Surrogate\n\nA key idea in cost-sensitive classi\ufb01cation has been the development of convex surrogate objectives\nthat exhibit \u201ccalibration\u201d with respect to the target risk [2, 32]. We require additional de\ufb01nitions.\nLet Q denote the set of measurable functions X \u2192 RK, and de\ufb01ne the minimum risk by R\u2217 =\ninf q\u2208Q R(f \u25e6 q). Note that the minimum is generally achieved at a deterministic policy, which\ncannot be represented by q \u2208 Q; however, the in\ufb01mum can be arbitrarily well approximated within\nQ. It will be convenient to expand the risk de\ufb01nition through a notion of pointwise risk: de\ufb01ne the\nlocal risk as R(\u03c0, r, x) = \u2212r \u00b7 \u03c0(x) , which is related to the true risk via R(\u03c0) = E[R(\u03c0, r, x)],\nwith the expectation taken over pairs (x, r) \u223c p(x, r). For each (x, r) de\ufb01ne the minimal risk by\n\nR\u2217(r, x) = inf \u03c0\u2208P R(\u03c0, r, x) = inf q\u2208Q R(f \u25e6 q, r, x).\n\n(4)\n\n1Appendix and code available at https://www.cs.ualberta.ca/~dale/neurips19/supplement\n\n2\n\n\fConsider a surrogate loss function L : (Q, RK, X) \u2192 R and let L\u2217(r, x) = inf q\u2208Q L(q, r, x). We\nsay that a surrogate L is calibrated with respect to the target risk R if there exists a calibration\nfunction \u03b4(\u0001, x) \u2265 0 such that for all \u0001 > 0, all x \u2208 X, all r \u2208 RK and all q \u2208 Q:\n\nL(q, r, x) \u2212 L\u2217(r, x) < \u03b4(\u0001, x)\n\nimplies R(f \u25e6 q, r, x) < R\u2217(r, x) + \u0001.\n\n(5)\n\nAlthough calibrated convex surrogates have been developed for cost-sensitive classi\ufb01cation [26],\nthese do not consider stochastic policies. Rather than extending these constructions to stochastic\npolicies, which is not straightforward, we develop a new surrogate for the stochastic case. Consider\nan entropy regularized version of the target risk [25] which we call the smoothed risk:\n\nand\n\nS\u03c4 (\u03c0) = E[S\u03c4 (\u03c0, r, x)].\n\nS\u03c4 (\u03c0, r, x) = \u2212r \u00b7 \u03c0(x) + \u03c4 \u03c0(x) \u00b7 log \u03c0(x)\n\n(6)\nThe smoothed risk approximates the true risk, with a discrepancy that can be made arbitrarily small.\nProposition 2 Let \u02dc\u03c0\u03c4 = arg min\u03c0\u2208P S\u03c4 (\u03c0). Then \u02dc\u03c0\u03c4 (x) = exp(E[r|x] \u2212 F (E[r|x])/\u03c4 ) and\nR( \u02dc\u03c0\u03c4 ) < R\u2217 + \u03c4 log K. Hence for any \u0001 > 0 setting \u03c4 < \u0001/ log K ensures R( \u02dc\u03c0\u03c4 ) < R\u2217 + \u0001.\nNote that the smoothed risk is not convex in q due to the softmax transfer \u03c0(x) = f (q(x)).\nNevertheless, it is possible to develop a convex surrogate that is calibrated for the smoothed risk as\nfollows. First we need a few properties of Bregman divergences in general and the KL divergence in\nparticular. The Bregman divergence DF , speci\ufb01ed by the convex differentiable potential F , satis\ufb01es:\n(7)\nwhere f = \u2207F , F \u2217(p) is the convex conjugate of F , p = f (r) and \u03c0 = f (q) [27]. Clearly, DF is\nconvex in its \ufb01rst argument q, but not necessarily in the second. For the KL divergence in particular\nwe have F (q) = log 1 \u00b7 eq, f (q) = eq\u2212F (q), F \u2217(p) = p \u00b7 log p, hence\n\nDF (q(cid:107)r) = F (q) \u2212 F (r) \u2212 f (r) \u00b7 (q \u2212 r) = F (q) \u2212 q \u00b7 p + F \u2217(p) = DF \u2217 (p(cid:107)\u03c0),\n\nDKL(\u03c0(cid:107)p) = \u03c0 \u00b7 (log \u03c0 \u2212 log p) = DF \u2217 (\u03c0(cid:107)p) = DF (r(cid:107)q).\nThis means that the local smoothed risk (6) can be shown to be equivalent to\n\n\u03c4 \u00b7 \u03c0(x) \u2212 \u03c0(x) \u00b7 log \u03c0(x)(cid:1) = \u2212\u03c4 F ( r\n\nS\u03c4 (\u03c0, r, x) = \u2212\u03c4(cid:0) r\n\n(9)\nLater, in Section 3, we will \ufb01nd it helpful to consider a shift v of the expected cost; i.e. R(\u03c0, r \u2212\nv, x) = v \u2212 r \u00b7 \u03c0, noting this does not affect the location of the minimizer in q. The above\ncharacterization then allows us to formulate a convex calibrated surrogate by reversing the divergence.\n\n\u03c4 ) + \u03c4 DF\n\n(cid:0) r\n\u03c4 (cid:107)q(x)(cid:1) .\n\n(8)\n\nTheorem 3 For an arbitrary baseline v and \u03c4 > 0, let\n\n(cid:0)q(x) + v\n\n(cid:13)(cid:13) r\n\n\u03c4\n\n(cid:1) + \u03c4\n\n(cid:13)(cid:13)q(x) \u2212 r\u2212v\n\n(cid:13)(cid:13)2\n\nL(q, r, x) = \u03c4 DF\n\n(10)\nThen, for any \ufb01xed v, L is strongly convex in q and calibrated with respect to the smoothed (shifted)\nrisk S\u03c4 (f \u25e6 q, r \u2212 v, x) = S\u03c4 (f \u25e6 q, r, x) \u2212 v with calibration function \u03b4(\u0001, x) = \u0001 \u2200x.\nTherefore, any desired level of accuracy in minimizing empirical smoothed risk can be achieved by\napproximately minimizing the surrogate loss L to appropriate accuracy.\n\n4\n\n.\n\n\u03c4\n\n\u03c4\n\n2.2 Experimental Evaluation\n\nTo \ufb01rst assess the overall approach, we evaluate how well optimizing the surrogate (10) minimizes\ntrue risk, using a separate test set for evaluation. As baselines, we compare to directly minimizing\nempirical risk \u02c6R(\u03c0) (1), and the standard supervised objectives, log-likelihood, \u2212Ep[log \u03c0], and\nsquared error, (cid:107)q(x) \u2212 r\u2212v\n\u03c4 (cid:107)2. Empirically, we found it bene\ufb01cial to relax (10) to a tunable combina-\ntion between the components and empirical risk. We refer to such a tuned loss as \u201cComposite\" in all\nexperimental results. Since the surrogate objective is a combination of the \u201creversed KL\u201d objective\nDF \u2217 (p(cid:107)\u03c0) (7) and the squared error, we also evaluate DF \u2217 (p(cid:107)\u03c0) alone to isolate its effect.\nMNIST We \ufb01rst consider MNIST data, training a fully connected model with one hidden layer of\n512 ReLU units. The original training data was partitioned into the \ufb01rst 55K examples for training and\nthe last 5K examples for validation. We use the validation data to select hyperparameters, including\nlearning rate, mini-batch size, and combination weights (details in appendix). The policy was trained\nby minimizing each objective using SGD with momentum \ufb01xed at 0.9 [33] for 100 epochs.\n\n3\n\n\f)\n\n%\n\n(\n\n2\n\nr\no\nr\nr\ne\n\n1.5\n\n1\n\n0.5\n\nn\no\ni\nt\na\nc\n\ufb01\ni\ns\ns\na\nl\nC\n\n0\n\n\u2212\nE\n\n(cid:13)(cid:13)\n\nq\n\n\u2212\n\np[l\n\no\n\ng\n\n\u03c0\n\n]\n\n\u02c6R\n(\n\n\u03c0\n\n)\n\nr\u2212\n\n\u03c4 (cid:13)(cid:13)2\n\nv\n\nTest\nTrain\n\nD\n\nF\u2217\n\n20\n\n15\n\n10\n\n5\n\n0\n\nC\n\no\n\nm\n\n\u2212\nE\n\n(\n\np(cid:107)\n\n\u03c0\n\n)\n\np\n\no\n\nsite\n\np[l\n\no\n\ng\n\n\u03c0\n\n]\n\n(cid:13)(cid:13)\n\nq\n\n\u2212\n\n\u02c6R\n(\n\n\u03c0\n\n)\n\nr\u2212\n\n\u03c4 (cid:13)(cid:13)2\n\nv\n\nTest\nTrain\n\nD\n\nF\u2217\n\nC\n\no\n\nm\n\n(\n\np(cid:107)\n\n\u03c0\n\n)\n\np\n\no\n\nsite\n\n(a) MNIST\n\n(b) CIFAR10\n\nFigure 1: Training with full reward feedback across all actions (see appendix for additional results).\n\nCIFAR-10 Next we considered the CIFAR-10 data set [15] and trained a Resnet-20 architecture [12],\nusing the standard 50K training, 10K validation split. We set any unspeci\ufb01ed model hyperparameters\nto the defaults for resnet in the open source tensor2tensor library [39] and tuned learning rate and\nthe composite loss combination weights on validation data. All objectives were trained using the\nMomentum optimizer with cosine decay learning rate for 250 epochs (details in appendix).\nThe results in Figure 1 con\ufb01rm that directly minimizing \u02c6R(\u03c0) (1), is not always competitive: it yields\nthe highest training error on both MNIST and CIFAR10, as well as poor test error. For MNIST, it is\nstriking that generalization can still be improved with respect to the standard log-likelihood baseline.\nFor CIFAR-10, as shown in Figure 1b, minimizing the reverse KL DF \u2217 (p(cid:107)\u03c0) achieved 10.7% test\nerror, which signi\ufb01cantly improves directly optimizing empirical risk \u02c6R(\u03c0), which obtained 19.9%\ntest error. The reverse KL was competitive even against the baseline log-likelihood, which achieved\n10.4% test error. The results for squared error are worse than log-likelihood, while the DF \u2217 (p(cid:107)\u03c0)\nobjective performs better in both data sets. This suggests that the generalization improvements are\ncoming from better minimization of DF \u2217 (p(cid:107)\u03c0), while the squared error term is helping improve the\noptimization landscape. To further investigate whether \u02c6R(\u03c0) suffers from a dif\ufb01cult optimization\nlandscape, we ran a much longer training experiment (see appendix), \ufb01nding that every method\nexcept squared loss is eventually able to achieve about 6.5% test error, but at signi\ufb01cant cost.\n\n3 Batch Contextual Bandits\n\nWe now extend these developments to the contextual bandit case. To focus on the most challenging\nand practical scenario, we assume a single action has been observed in each context. Therefore, the\ntraining data consists of tuples D = {(xi, ai, ri, \u03b2i)}, where xi \u2208 X is a context, ai \u2208 {1, ..., K}\nis an action, ri \u2208 R is a reward, and \u03b2i is the proposal probability of ai. For simplicity, we\nassume a stationary behaviour (logging) policy \u03b2 : X \u2192 \u2206K was used to select the actions, hence\n\u03b2i = \u03b2(ai|xi). Although \u03b2 might not be known [20], estimating it from D has proved effective\n[41, 43, 4]. We continue to assume contexts and rewards are generated i.i.d. from a joint distribution\np(x, r), but the distribution of rewards r(x) \u223c p(r|x) and actions a \u223c \u03b2(a|x) are conditionally\nindependent given the context x [43]. Other more elaborate models for missing data are possible, but\nrequire committing to stronger assumptions about the data generation and behavior process [3, 21].\nAs before, the goal is to infer a policy \u03c0 : X \u2192 \u2206K that maximizes expected reward. Here we\nde\ufb01ne the true risk of a policy \u03c0 as in (1), but the empirical risk, also de\ufb01ned in (1), is no longer\ndirectly observable because it requires rewards for all actions. The standard solution is to formulate\nan unbiased estimate of the full empirical risk (which is itself an unbiased estimate of the true risk),\nthen use this as a policy optimization objective. In fact, the current literature is dominated by such an\napproach, where an unbiased (or nearly unbiased) estimate of the empirical risk (1) is \ufb01rst formulated\nvia importance correction then used as a training objective [14, 16, 17, 29, 34\u201336]. Unfortunately,\nimportance correction introduces signi\ufb01cant variance in gradient estimates, even using standard\nvariance reduction techniques. Also, as identi\ufb01ed in Section 2, even if variance could be completely\neliminated, the underlying optimization landscape presents dif\ufb01culties.\n\n4\n\n\f3.1 Reward Estimation\n\nBefore focusing on policy optimization, we \ufb01rst need to address the problem of estimating rewards\nfrom incomplete data. We here adopt a simple approach of imputing missing values with a model\nq : X \u2192 RK. That is, for a context x, observed action a and observed reward ra, we estimate the\nfull reward vector r by\n\n\u02c6r(x) = \u03c4 q(x) + 1a\u03bb(x, a)(ra \u2212 \u03c4 q(x)a),\n\n(11)\nwith parameters \u03bb(x, a) and \u03c4. This construction allows the local risk of any policy \u03c0 to be estimated\nby R(\u03c0, \u02c6r, x) = \u2212\u03c0(x) \u00b7 \u02c6r(x). Although (11) seems simplistic, it is able to express most estimators\nin the literature by suitable choices of \u03c4 and \u03bb(x, a). For example, choosing \u03c4 = 0 and \u03bb(x, a) =\n\u03b2(a|x)\u22121 yields importance weighting R(\u03c0, \u02c6r, x) = \u03c0(x)a\n\u03b2(a|x) ra; choosing \u03c4 = 1 and \u03bb(x, a) = 0\nyields the \u201cdirect method\u201d R(\u03c0, \u02c6r, x) = \u03c0(x) \u00b7 q(x); and choosing \u03c4 = 1 and \u03bb(x, a) = \u03b2(a|x)\u22121\nyields the \u201cdoubly robust\u201d estimate R(\u03c0, \u02c6r, x) = \u03c0(x)\u00b7 q(x) + \u03c0(x)a\n\u03b2(a|x) (ra \u2212 q(x)a) [6]. The \u201cswitch\u201d\nestimator [40] can be expressed for a given threshold \u03b8 > 0 by setting \u03c4 = 1 and \u03bb(x, a) = 0 if\n\u03c0(x)a/\u03b8 > \u03b2(a|x), otherwise \u03c4 = 0 and \u03bb(x, a) = \u03b2(a|x)\u22121. The \u201cswitch\u201d estimator generalizes\nthe trimmed importance estimator [3], and is argued in [40] to be superior to the \u201cmagic\u201d estimator\n[38]. The \u201cself-normalized\u201d importance estimator [36] can also be expressed by setting \u03c4 = 0 and\n\n(xi,ai,ri)\u2208D \u03c0(xi)ai\u03b2(ai|xi)\u22121.\n\nFor any \ufb01xed q(x) and \u03c4 it is easy to show that \u03bb(x, a) = \u03b2(a|x)\u22121 implies Ea\u223c\u03b2(\u00b7|x)[\u02c6r(x)] = r(x).\nA key question is the provenance of q. The standard approach is to recover q by regressing to the\n(xi,ai,ri)\u2208D(ri \u2212 q(xi)ai)2 for a class of models H. Note that\nthis is equivalent to conducting a policy evaluation step for the policy \u03b2. Stochastic contextual bandits\nare a restricted case of reinforcement learning where every policy has the same action value function,\nE[r(x)]. Hence, a single policy evaluation, yielding q, can in principle be used to evaluate any policy,\nsince evaluating \u03c0 instead of \u03b2 does not change action values but only introduces covariate shift.\n\n\u03bb(x, a) = \u03b2(a|x)\u22121/(cid:80)\nobserved rewards q = arg minq\u2208H(cid:80)\n\n3.2 Policy Optimization\n\nFor policy optimization, one could adopt the least squares estimate q and optimize a separate policy,\nbut if \u03c0 uses the same architecture the optimum is simply \u03c0 = f \u25e6 q (under S). In Section 2, we saw\nthat least squares estimation of q did not perform well, nor do we expect so here. We would like to\ngain the advantages realized in Section 2, but an actor-critic approach obviates policy optimization.\nInstead, to couple the value estimator to policy optimization, we consider a uni\ufb01ed approach where\nthe actor and the critic are the same model. That is, we use the policy transformation \u03c0 = f \u25e6 q from\nSection 2, but now explicitly treat the logits as action value estimates. A uni\ufb01ed actor-critic model has\nbeen considered previously [22]. In the partially observed case, we propose to replace the observed\nreward vector with the estimate \u02c6r derived from q, allowing any loss to be applied. Although such an\napproach seems naive, we \ufb01nd that maintaining this form of strict mutual consistency between the\nvalue estimates and policy, combined with the estimator \u02c6r and surrogate losses, leads to effective\nempirical performance. Moreover, we will \ufb01nd that this approach is theoretically justi\ufb01ed.\n\n3.3 Calibrated Surrogate\n\nGiven (x, a, ra), de\ufb01ne the optimal imputed local risk and the suboptimality gap respectively by\n\nS\u2217\n\u03c4 (\u02c6r, x) = inf q\u2208Q S\u03c4 (f \u25e6 q, \u02c6r, x)\n\nand\n\nEquality (9) can then be used to show the divergence DF\n\nProposition 4 For any q, \u03c4 > 0 and observation (x, a, ra): \u03c4 DF\n\nG\u03c4 (\u03c0, \u02c6r, x) = S\u03c4 (\u03c0, \u02c6r, x) \u2212 S\u2217\n\n\u03c4 (cid:107)q(cid:1) characterizes the suboptimality gap:\n(cid:0) \u02c6r\n(cid:13)(cid:13)q(x)(cid:1) = G\u03c4 (f \u25e6 q, \u02c6r, x).\n\n(cid:0) \u02c6r(x)\n\n\u03c4 (\u02c6r, x).\n\n(12)\n\n\u03c4\n\nIf we consider the imputed form of the surrogate objective L(q, \u02c6r, x) de\ufb01ned in Theorem 3 we then\n\ufb01nd that the surrogate remains calibrated for the imputed smoothed risk.\n\nTheorem 5 For any model q, \u03c4 > 0, observation (x, a, ra), and baseline v:\n\nL(q, \u02c6r, x) \u2265 \u03c4 DF\n\n= G\u03c4 (f \u25e6 q, \u02c6r, x) \u2265 0.\n\n(13)\n\n(cid:16) \u02c6r(x)\n\n\u03c4\n\n(cid:13)(cid:13)(cid:13)q(x) + v\n\n\u03c4\n\n(cid:17)\n\n5\n\n\fMoreover, L is calibrated with respect to S\u03c4 (f \u25e6 q, \u02c6r\u2212v, x) with calibration function \u03b4(x, \u0001) = \u0001.\nThis result suggests a simple algorithmic approach for policy optimization: given the data D =\n{(xi, ai, ri, \u03b2i)}, minimize the imputed empirical surrogate objective with respect to the model q:\n(14)\nThat is, we combine the estimate \u02c6r from Section 3.1, (11), with the surrogate L from Section 2, (10).\n\nminq\u2208Q \u02c6L(q,D) where\n\n(xi,ai,ri,\u03b2i)\u2208D L(q, \u02c6r, xi).\n\n\u02c6L(q,D) = 1\n\n(cid:80)\n\nT\n\n3.4 Analysis\n\n\u03c4 = inf q\u2208Q S\u03c4 (f \u25e6 q)\n\nThe expected smoothed risk quantities we seek to control are de\ufb01ned by:\nS\u03c4 (\u03c0) = E[S\u03c4 (\u03c0, r, x)], S\u2217\n(15)\nFor the purposes of analysis, we assume training data consists of tuples drawn from (x, a, ra) \u223c\np(x, r)\u03b2(a|x), and that the estimate \u02c6r is unbiased; i.e., E[\u02c6r|x] = E[r|x], using \u03bb(x, a) = \u03b2(a|x)\u22121.\nFirst, observe that, in expectation, the surrogate objective upper bounds the divergence in Theorem 5,\nwhich, in turn, by Jensen\u2019s inequality, bounds the suboptimality gap in the expected smoothed risk.\nTheorem 6 For any model q, any \u02c6r such that E[\u02c6r|x] = E[r|x], and any baseline v:\n\nG\u03c4 (\u03c0) = S\u03c4 (\u03c0) \u2212 S\u2217\n\u03c4 .\n\nand\n\n(cid:16) \u02c6r(x)\n\n\u03c4\n\n(cid:13)(cid:13)(cid:13)q(x) + v\n\n\u03c4\n\n(cid:17)(cid:105) \u2265 G\u03c4 (f \u25e6 q) \u2265 0.\n\n\u03c4 DF\n\n(16)\n\nE[L(q, \u02c6r, x)] \u2265 E(cid:104)\n(cid:80)\n\n(cid:0) \u02c6r(xi)\n\nT\n\ni DF\n\n(cid:13)(cid:13)q(xi)(cid:1) also concentrates to its expectation, uniformly over q \u2208 H, for\n\nTherefore, minimizing (14), in expectation, minimizes the true smoothed risk (15).\nThis result can be made stronger by observing that, under mild assumptions, the empirical divergence\n\u02c6D(q,D) = 1\na well behaved model class H. In the appendix, we specify the conditions on H, \u03b2, and p(x, r) that,\nin addition to \u02c6r being unbiased (i.e. E[\u02c6r|x] = E[r|x]), ensure \ufb01nite sample concentration. We refer\nto a collection H, \u03b2, p(x, r) and \u02c6r that satis\ufb01es these conditions as \u201cwell behaved\u201d.\nLemma 7 Assume H, \u03b2, p(x, r) and \u02c6r are \u201cwell behaved\u201d. Then for any \u03c4, \u03b4 > 0 there exists a\nconstant C such that with probability at least 1 \u2212 \u03b4:\n\n\u03c4\n\nE(cid:104)\n\nDF\n\n(cid:16) \u02c6r(x)\n\n\u03c4\n\n(cid:13)(cid:13)q(x)\n\n(cid:17)(cid:105) \u2264 \u02c6DF (q,D) + C\u221a\n\nT\n\n\u2200q \u2208 H.\n\n(17)\n\nCombining Theorem 6 with Lemma 7 it can be shown that for \ufb01nite sample size T , with high\nprobability, the empirical surrogate (14) is approximately calibrated with respect to smoothed risk.\nTheorem 8 Assume H, \u03b2, p(x, r) and \u02c6r are \u201cwell behaved\u201d. Then for any v and \u03c4, \u03b4 > 0, there\nexists a C such that with probability at least 1\u2212\u03b4: if \u02c6L(q,D) < \u03c4 C\u221a\nfor q \u2208 H then G\u03c4 (f \u25e6q) \u2264 2\u03c4 C\u221a\n.\nThat is, if \u02c6L(q,D) can be suf\ufb01ciently minimized within H, the suboptimality gap achieved by q will\nbe near-optimal with high probability, with bound diminishing to zero for large sample size.\n\nT\n\nT\n\n3.5 Discussion\n\nIf we let \u03c4 = 0 in the de\ufb01nition of \u02c6r, (11), then \u02c6r exhibits no dependence on q, making L(q, \u02c6r, x)\nconvex in q. However, we have found that empirical results are improved by choosing \u03c4 > 0,\nsince this compels the logits q to also model observed rewards. In addition, even though using an\nunbiased \u02c6r enables the theory above, achieving unbiasedness via importance correction increases\nvariance, degrades the quality of the reward estimate, and yields inferior results. In our experiments,\nwe considered \u03c4 to be a hyperparameter, and also considered different choices for \u03bb, including\n\u03bb(x, a) = \u03b2(a|x)\u22121 and \u03bb(x, a) = 1. We also introduced tunable combination weights between the\nBregman divergence and the squared error terms in (14), similar to the relaxation in Section 2.2. In\nall cases, we chose hyperparameters from validation data only.\nNote that the approach developed in this paper differs fundamentally from recent trust-region and\nproximal methods in reinforcement learning [30, 31], which still directly optimize expected return,\npossibly with entropy regularization [23]. These methods use proximal constraints/regularization to\nimprove the stability of optimization, but apply a \u201csurrogate\u201d as a local not a global modi\ufb01cation of\nthe objective. By contrast, we are changing the entire optimization objective globally, not locally, and\ntrain to maximize a target that is different from expected return.\n\n6\n\n\f)\n\n%\n\n(\n\nr\no\nr\nr\ne\n\nn\no\ni\nt\na\nc\n\ufb01\ni\ns\ns\na\nl\nC\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\nTest\nTrain\n\nD\n\nF\u2217\n\n(\n\np(cid:107)\n\n\u03c0\n\n)\n\n45\n40\n35\n30\n25\n20\n15\n10\n5\n0\n\nC\n\no\n\nm\n\np\n\no\n\nsite\n\n(cid:13)(cid:13)\n\nq\n\n\u2212\n\n\u02c6R\n(\n\n\u03c0\n\nr\u2212\n\n\u03c4 (cid:13)(cid:13)2\n\nv\n\n)\n\nTest\nTrain\n\nD\n\nF\u2217\n\n(\n\np(cid:107)\n\n\u03c0\n\n)\n\nC\n\no\n\nm\n\np\n\no\n\nsite\n\n(cid:13)(cid:13)\n\nq\n\n\u2212\n\n\u02c6R\n(\n\n\u03c0\n\nr\u2212\n\n\u03c4 (cid:13)(cid:13)2\n\nv\n\n)\n\n(a) MNIST\n\n(b) CIFAR10\n\nFigure 2: Training with partial (single action) reward feedback (see appendix for additional results).\n\nUnlike previous entropy regularized approaches [11, 22], which generally consider split actor-critic\nmodels, we achieve success with a single model that serves as both.\nAnother subtlety with optimizing importance corrected objectives, such as R(\u03c0, \u02c6r, x) = \u03c0(x)a\n\u03b2(a|x) ra, is\nthat this does not account for the policy\u2019s data coverage [20]; that is, a policy might minimize such\nan objective by moving mass \u03c0(xi)ai away from the training observations (xi, ai, ri), leading to\na phenomenon known as \u201cpropensity over\ufb01tting\u201d [34\u201336]. This effect can be countered by adding\ncoverage-dependent con\ufb01dence intervals to the estimates [34\u201336], or constraining [10] or regularizing\n[20] toward the logging policy choices. Although such regularization is helpful, it is orthogonal to\nthe aim of the current investigation, as any objective can be augmented in this way.\n\n3.6 Experimental Evaluation\n\nAs is standard in the \ufb01eld [34\u201336], we form a partially observed version of a supervised learning task\nby sampling actions from a behaviour policy \u03c00, assigning a reward of 1 when the action chosen\nmatches the correct label and a reward of 0 when it does not. Reward on all counterfactual actions\nis therefore missing. For the MNIST and CIFAR-10 experiments, we used the same architecture,\noptimizer and model con\ufb01gurations used in the fully observed label experiments. For the empirical\nrisk estimator \u02c6R(\u03c0) we used importance correction \u03bb(x, a) = \u03b2(a|x)\u22121; however \u03bb(x, a) = 1\nproved to be more effective for DF \u2217 (p(cid:107)\u03c0), which is equivalent to replacing the counterfactual\nrewards with the model estimate.\nMNIST We evaluate the results when data is collected by the uniform behavior policy, \u03c00(x) =\n1 \u00b7 1\n10. The hyperparameters for all objectives were re-optimized on validation data, using the same\noptimization algorithm as before. Details are given in the appendix.\nCIFAR-10 Here we also evaluate the results when data is collected by the uniform behavior policy,\n\u03c00(x) = 1 \u00b7 1\n10. However, in addition, we also evaluate the proposed objectives using data released\nin a recently published benchmark on CIFAR-10 [14]. (Note that the behavior policy itself was not\nreleased in this benchmark; instead different sized training sets of size 50k, 100k, 150k, and 250k\nwere generated using this policy.) We used this alternative data to produce each column in Table 1.\nFor the CIFAR-10 experiments we simply set \u03c4 = 1. Additional details are given in the appendix.\nFigure 2 shows the results for training on MNIST and CIFAR-10 given data collected by the random\nbehavior policy. In both cases, the composite objective yields improvements over optimizing \u02c6R(\u03c0)\ndirectly. To investigate whether this dif\ufb01culty is due to plateaus, we again conduct signi\ufb01cantly longer\ntraining in the appendix, \ufb01nding that the Composite and DF \u2217 (p(cid:107)\u03c0) objectives remain advantageous.\nTable 1 then shows results on CIFAR-10 using the alternative behavior data from [14]. This data\nappears to be more condusive to optimizing \u02c6R(\u03c0) directly, although even in this scenario the\ncomposite objective is still competitive, signi\ufb01cantly improving the results reported in [14].\nCriteo We also test the proposed surrogate objective on the Criteo data set [18], a large-scale test-bed\nfor evaluating batch contextual bandit methods [3, 34]. Here again the behavior policy was not\n\n7\n\n\f\u03c4\n\n(cid:13)(cid:13)2\n\n(cid:13)(cid:13)q \u2212 r\u2212v\n\n100k\n7.51\n18.26\n9.85\n6.92\n\n50k\n8.71\n21.85\n16.00\n8.34\n\nExamples\n\u02c6R(\u03c0)\nDF \u2217 (p(cid:107)\u03c0)\nComposite\nTable 1: CIFAR-10: Test error % for the bandit\nfeedback data sets from [14] with increasing num-\nber of training examples.\n\n250k\n6.66\n12.86\n8.75\n6.36\n\n150k\n6.92\n14.65\n8.68\n6.57\n\nObjectives\nRandom\nBehavior\n\n\u03c4\n\nDRO \u02c6R(\u03c0) [7]\nPOEM [34]\n\n(cid:13)(cid:13)q \u2212 r\u2212v\n\n\u02c6R(\u03c0)\n\n(cid:13)(cid:13)2\n\nDF \u2217 (p(cid:107)\u03c0)\nComposite\n\n\u02c6R(\u03c0) \u00d7 104\n43.68 \u00b1 2.11\n53.55\n53.07 \u00b1 2.27\n51.89 \u00b1 1.73 2\n51.72 \u00b1 1.42\n52.00 \u00b1 1.28\n52.30 \u00b1 0.83\n55.09 \u00b1 2.86\n\nTable 2: Criteo: Importance sampling esti-\nmated reward on test. Error bars are 99% con-\n\ufb01dence intervals under normal distribution.\n\nreleased, but only its generated data. Following [18], we use only banners with a single slot (i.e.,\nwhere only a single item is chosen) in our learning and evaluation. These banners are randomly split\ninto training, validation and test sets, each containing 7 million records, using the script provided by\n[18]. There are 35 features used to describe the context and candidates actions (2 continuous and the\nrest categorical). We encode the discrete features using one-hot encoding, and build linear models\nusing different learning losses. For evaluation, we report the importance sampling based estimates of\nreward (user clicks on banner) \u02c6R(\u03c0) on the test set, as in [18].\nWe compare the proposed surrogates with several state-of-the-art methods on this data set. Hyperpa-\nrameters of the different methods were tuned on validation data, and all objectives optimized by SGD\nwith momentum and batch size between 1K and 5K; more details regarding the experiment setup and\nhyperparameter choices are given in the appendix. All methods use the same input encoding to map\ninputs x to \u03c6(x). In particular, we evaluated the following:\nRandom: Choose a candidate banner (i.e. an action) uniformly at random to display.\nBehavior: Simply report observed reward on the test set acting according to the logging policy \u03b2.\n\n(cid:13)(cid:13)2: We set v = 0 is effective since expected reward is close to 0.\n\nSquared(cid:13)(cid:13)q \u2212 r\u2212v\n\n\u02c6R(\u03c0): Directly optimize \u02c6R(\u03c0) using importance correction; i.e. \u03bb(x, a) = \u03b2(a|x)\u22121, \u03c4 = 0 in (11).\nDRO \u02c6R(\u03c0): Optimize the doubly robust estimator [7]; i.e., \u03bb(x, a) = \u03b2(a|x)\u22121 and \u03c4 > 0 in (11).\nPOEM: Combines importance corrected empirical risk estimation, \u02c6R(\u03c0), with a regularization that\npenalizes the variance of the estimated \u02c6R(\u03c0) [34]. We tuned the additional regularization factor \u03bb.\n(To keep a fair comparison with the other methods, we did not impose capping on the importance\nweight here, which was additionally tuned in [34].)\nDF\u2217 (p(cid:107)\u03c0): The imputation strategy for the reward uses \u03bb(x, a) = \u03b2(a|x)\u22121 and we tune \u03c4 > 0.\nComposite: We combine the \u02c6R(\u03c0) objective with DF\u2217 (p(cid:107)\u03c0). In addition to the scaling factor \u03c4, we\ntuned the combination weights.\nTable 2 reports the estimated reward obtained by each method on test data. Here we can see that\ntraining with the proposed surrogate performs competitively against previous state-of-the-art-methods.\n\n\u03c4\n\n4 Conclusion\n\nWe investigated alternative objectives for policy optimization in cost-sensitive classi\ufb01cation and\ncontextual bandits. The formulations developed are directly applicable to deep learning and improve\nthe underlying optimization landscape. The empirical results in both the cost-sensitive classi\ufb01cation\nand batch contextual bandit scenarios replicate or surpass training with state-of-the-art baseline\nobjectives, merely through the optimization of the non-standard loss functions. There remain several\nopportunities for further development of surrogate training objectives for sequential decision making\ntasks (i.e. in planning and reinforcement learning).\n\n2The number reported here is lower than reported in the original paper [34]. One hypothesis is that this is\ndue to the removal of the cap on importance weights. A similar result to what we obtain was reported in [20].\n\n8\n\n\fReferences\n[1] Naoki Abe, Bianca Zadrozny, and John Langford. An iterative method for multi-class cost-\nsensitive learning. In Proceedings of the International Conference on Knowledge Discovery\nand Data Mining (KDD), pages 3\u201311, 2004.\n\n[2] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classi\ufb01cation, and risk\n\nbounds. Journal of the American Statistical Association, 101(473), 2006.\n\n[3] L\u00e9on Bottou, Jonas Peters, Joaquin Qui\u00f1onero Candela, Denis Xavier Charles, Max Chickering,\nElon Portugaly, Dipankar Ray, Patrice Y. Simard, and Ed Snelson. Counterfactual reasoning\nand learning systems: the example of computational advertising. Journal of Machine Learning\nResearch, 14(1):3207\u20133260, 2013.\n\n[4] Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H Chi. Top-k\noff-policy correction for a reinforce recommender system. In Proceedings of the Twelfth ACM\nInternational Conference on Web Search and Data Mining, pages 456\u2013464, 2019.\n\n[5] Jacek P. Dmochowski, Paul Sajda, and Lucas C. Parra. Maximum likelihood in cost-sensitive\nlearning: Model speci\ufb01cation, approximations, and upper bounds. Journal of Machine Learning\nResearch, 11:3313\u20133332, 2010.\n\n[6] Miroslav Dud\u00edk, Dumitru Erhan, John Langford, and Lihong Li. Doubly robust policy evaluation\n\nand optimization. Statistical Science, 29(4):458\u2013511, 2014.\n\n[7] Miroslav Dud\u00edk, John Langford, and Lihong Li. Doubly robust policy evaluation and learning.\nIn Proceedings of the International Conference on Machine Learning (ICML), pages 1097\u20131104,\n2011.\n\n[8] Charles Elkan. The foundations of cost-sensitive learning. In Proceedings of the International\n\nJoint Conference on Arti\ufb01cial Intelligence (IJCAI), pages 973\u2013978, 2001.\n\n[9] Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust\noff-policy evaluation. In Proceedings of the International Conference on Machine Learning\n(ICML), pages 1446\u20131455, 2018.\n\n[10] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning\nwithout exploration. In Proceedings of the International Conference on Machine Learning\n(ICML), pages 2052\u20132062, 2019.\n\n[11] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy\nmaximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the\nInternational Conference on Machine Learning (ICML), pages 1856\u20131865, 2018.\n\n[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. arXiv preprint arXiv:1512.03385, 2015.\n\n[13] Klaus-Uwe H\u00f6ffgen, Hans Ulrich Simon, and Kevin S. Van Horn. Robust trainability of single\n\nneurons. Journal of Computer and System Sciences, 50(1):114\u2013125, 1995.\n\n[14] Thorsten Joachims, Adith Swaminathan, and Maarten de Rijke. Deep learning with logged\nbandit feedback. In Proceedings of the International Conference on Learning Representations\n(ICLR), 2018.\n\n[15] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report,\n\nUniversity of Toronto, 2009.\n\n[16] Carolin Lawrence and Stefan Riezler. Improving a neural semantic parser by counterfactual\nlearning from human bandit feedback. In Proceedings of the Annual Meeting of the Association\nfor Computational Linguistics (ACL), pages 1820\u20131830, 2018.\n\n[17] Carolin Lawrence, Artem Sokolov, and Stefan Riezler. Counterfactual learning from bandit\nfeedback under deterministic logging : A case study in statistical machine translation. In Pro-\nceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP),\npages 2566\u20132576, 2017.\n\n9\n\n\f[18] Damien Lefortier, Adith Swaminathan, Xiaotao Gu, Thorsten Joachims, and Maarten de Rijke.\nLarge-scale validation of counterfactual learning methods: A test-bed. CoRR, abs/1612.00367,\n2016.\n\n[19] Hsuan-Tien Lin. Reduction from cost-sensitive multiclass classi\ufb01cation to one-versus-one\nbinary classi\ufb01cation. In Proceedings of the Asian Conference on Machine Learning (ACML),\n2014.\n\n[20] Yifei Ma, Yu-Xiang Wang, and Balakrishnan (Murali) Narayanaswamy. Imitation-regularized\nof\ufb02ine learning. In Proceedings of the International Conference on Arti\ufb01cial Intelligence and\nStatistics (AISTATS), 2019.\n\n[21] Karthika Mohan, Judea Pearl, and Jin Tian. Graphical models for inference with missing data.\n\nIn Advances in Neural Information Processing Systems 26, pages 1277\u20131285, 2013.\n\n[22] O\ufb01r Nachum, Mohammad Norouzi, and Dale Schuurmans. Bridging the gap between value and\npolicy based reinforcement learning. In Advances in Neural Information Processing Systems\n31, pages 2772\u20132782, 2017.\n\n[23] O\ufb01r Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Trust-PCL: An off-policy\ntrust region method for continuous control. In Proceedings of the International Conference on\nLearning Representations (ICLR), 2018.\n\n[24] Deirdre B. O\u2019Brien, Maya R. Gupta, and Robert M. Gray. Cost-sensitive multi-class classi\ufb01ca-\ntion from probability estimates. In Proceedings of the International Conference on Machine\nLearning (ICML), pages 712\u2013719, 2008.\n\n[25] Gabriel Pereyra, George Tucker, Jan Chorowski, Lukasz Kaiser, and Geoffrey E. Hinton.\nRegularizing neural networks by penalizing con\ufb01dent output distributions. CoRR, 1701.06548,\n2017.\n\n[26] Bernardo \u00c1vila Pires, Csaba Szepesv\u00e1ri, and Mohammad Ghavamzadeh. Cost-sensitive multi-\nclass classi\ufb01cation risk bounds. In Proceedings of the International Conference on Machine\nLearning (ICML), pages 1391\u20131399, 2013.\n\n[27] Mark D. Reid and Robert C. Williamson. Information, divergence and risk for binary experi-\n\nments. Journal of Machine Learning Research, 12:731\u2013817, 2011.\n\n[28] Lorenzo Rosasco, Ernesto De Vito, Andrea Caponnetto, Michele Piana, and Alessandro Verri.\n\nAre loss functions all the same? Neural Computation, 16(5):1063\u2013107, 2004.\n\n[29] Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims.\nRecommendations as treatments: Debiasing learning and evaluation. In Proceedings of the\nInternational Conference on Machine Learning (ICML), pages 1670\u20131679, 2016.\n\n[30] John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust\nregion policy optimization. In Proceedings of the International Conference on Machine Learning\n(ICML), pages 1889\u20131897, 2015.\n\n[31] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal\n\npolicy optimization algorithms. CoRR, abs/1707.06347, 2017.\n\n[32] Ingo Steinwart. How to compare different loss functions and their risks. Constructive Approxi-\n\nmation, 26(2):225\u2013287, 2007.\n\n[33] Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance of\ninitialization and momentum in deep learning. In Proceedings of the International Conference\non Machine Learning (ICML), pages 1139\u20131147, 2013.\n\n[34] Adith Swaminathan and Thorsten Joachims. Batch learning from logged bandit feedback\nthrough counterfactual risk minimization. Journal of Machine Learning Research, 16:1731\u2013\n1755, 2015.\n\n10\n\n\f[35] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from\nlogged bandit feedback. In Proceedings of the International Conference on Machine Learning\n(ICML), pages 814\u2013823, 2015.\n\n[36] Adith Swaminathan and Thorsten Joachims. The self-normalized estimator for counterfactual\nlearning. In Advances in Neural Information Processing Systems 28, pages 3231\u20133239, 2015.\n\n[37] Ambuj Tewari and Peter L. Bartlett. On the consistency of multiclass classi\ufb01cation methods.\n\nJournal of Machine Learning Research, 8:1007\u20131025, 2007.\n\n[38] Philip S. Thomas and Emma Brunskill. Data-ef\ufb01cient off-policy policy evaluation for reinforce-\nment learning. In Proceedings of the International Conference on Machine Learning (ICML),\npages 2139\u20132148, 2016.\n\n[39] Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez, Stephan\nGouws, Llion Jones, \u0141ukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam\nShazeer, and Jakob Uszkoreit. Tensor2tensor for neural machine translation. CoRR,\nabs/1803.07416, 2018.\n\n[40] Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dud\u00edk. Optimal and adaptive off-policy evalua-\ntion in contextual bandits. In Proceedings of the International Conference on Machine Learning\n(ICML), pages 3589\u20133597, 2017.\n\n[41] Yuan Xie, Boyi Liu, Qiang Liu, Zhaoran Wang, Yuan Zhou, and Jian Peng. Off-policy evaluation\nand learning from logged bandit feedback: Error reduction via surrogate policy. In Proceedings\nof the International Conference on Learning Representations (ICLR), 2019.\n\n[42] Tong Zhang. Statistical analysis of some multi-category large margin classi\ufb01cation methods.\n\nJournal of Machine Learning Research, 5, 2004.\n\n[43] Zhengyuan Zhou, Susan Athey, and Stefan Wager. Of\ufb02ine multi-action policy learning: Gener-\n\nalization and optimization. CoRR, abs/1810.04778, 2018.\n\n11\n\n\f", "award": [], "sourceid": 4754, "authors": [{"given_name": "Minmin", "family_name": "Chen", "institution": "Google"}, {"given_name": "Ramki", "family_name": "Gummadi", "institution": "Google"}, {"given_name": "Chris", "family_name": "Harris", "institution": "Google"}, {"given_name": "Dale", "family_name": "Schuurmans", "institution": "University of Alberta & Google Brain"}]}