{"title": "The Self-Normalized Estimator for Counterfactual Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3231, "page_last": 3239, "abstract": "This paper identifies a severe problem of the counterfactual risk estimator typically used in batch learning from logged bandit feedback (BLBF), and proposes the use of an alternative estimator that avoids this problem.In the BLBF setting, the learner does not receive full-information feedback like in supervised learning, but observes feedback only for the actions taken by a historical policy.This makes BLBF algorithms particularly attractive for training online systems (e.g., ad placement, web search, recommendation) using their historical logs.The Counterfactual Risk Minimization (CRM) principle offers a general recipe for designing BLBF algorithms. It requires a counterfactual risk estimator, and virtually all existing works on BLBF have focused on a particular unbiased estimator.We show that this conventional estimator suffers from apropensity overfitting problem when used for learning over complex hypothesis spaces.We propose to replace the risk estimator with a self-normalized estimator, showing that it neatly avoids this problem.This naturally gives rise to a new learning algorithm -- Normalized Policy Optimizer for Exponential Models (Norm-POEM) --for structured output prediction using linear rules.We evaluate the empirical effectiveness of Norm-POEM on severalmulti-label classification problems, finding that it consistently outperforms the conventional estimator.", "full_text": "The Self-Normalized Estimator for Counterfactual\n\nLearning\n\nAdith Swaminathan\n\nDepartment of Computer Science\n\nCornell University\n\nadith@cs.cornell.edu\n\nThorsten Joachims\n\nDepartment of Computer Science\n\nCornell University\n\ntj@cs.cornell.edu\n\nAbstract\n\nThis paper identi\ufb01es a severe problem of the counterfactual risk estimator typi-\ncally used in batch learning from logged bandit feedback (BLBF), and proposes\nthe use of an alternative estimator that avoids this problem. In the BLBF setting,\nthe learner does not receive full-information feedback like in supervised learn-\ning, but observes feedback only for the actions taken by a historical policy. This\nmakes BLBF algorithms particularly attractive for training online systems (e.g., ad\nplacement, web search, recommendation) using their historical logs. The Coun-\nterfactual Risk Minimization (CRM) principle [1] offers a general recipe for de-\nsigning BLBF algorithms. It requires a counterfactual risk estimator, and virtually\nall existing works on BLBF have focused on a particular unbiased estimator. We\nshow that this conventional estimator suffers from a propensity over\ufb01tting problem\nwhen used for learning over complex hypothesis spaces. We propose to replace\nthe risk estimator with a self-normalized estimator, showing that it neatly avoids\nthis problem. This naturally gives rise to a new learning algorithm \u2013 Normalized\nPolicy Optimizer for Exponential Models (Norm-POEM) \u2013 for structured output\nprediction using linear rules. We evaluate the empirical effectiveness of Norm-\nPOEM on several multi-label classi\ufb01cation problems, \ufb01nding that it consistently\noutperforms the conventional estimator.\n\n1\n\nIntroduction\n\nMost interactive systems (e.g. search engines, recommender systems, ad platforms) record large\nquantities of log data which contain valuable information about the system\u2019s performance and user\nexperience. For example, the logs of an ad-placement system record which ad was presented in a\ngiven context and whether the user clicked on it. While these logs contain information that should\ninform the design of future systems, the log entries do not provide supervised training data in the\nconventional sense. This prevents us from directly employing supervised learning algorithms to\nimprove these systems. In particular, each entry only provides bandit feedback since the loss/reward\nis only observed for the particular action chosen by the system (e.g. the presented ad) but not for\nall the other actions the system could have taken. Moreover, the log entries are biased since actions\nthat are systematically favored by the system will by over-represented in the logs.\nLearning from historical logs data can be formalized as batch learning from logged bandit feedback\n(BLBF) [2, 1]. Unlike the well-studied problem of online learning from bandit feedback [3], this\nsetting does not require the learner to have interactive control over the system. Learning in such\na setting is closely related to the problem of off-policy evaluation in reinforcement learning [4] \u2013\nwe would like to know how well a new system (policy) would perform if it had been used in the\npast. This motivates the use of counterfactual estimators [5]. Following an approach analogous\nto Empirical Risk Minimization (ERM), it was shown that such estimators can be used to design\nlearning algorithms for batch learning from logged bandit feedback [6, 5, 1].\n\n1\n\n\fHowever the conventional counterfactual risk estimator used in prior works on BLBF exhibits severe\nanomalies that can lead to degeneracies when used in ERM. In particular, the estimator exhibits a\nnew form of Propensity Over\ufb01tting that causes severely biased risk estimates for the ERM mini-\nmizer. By introducing multiplicative control variates, we propose to replace this risk estimator with\na Self-Normalized Risk Estimator that provably avoids these degeneracies. An extensive empirical\nevaluation con\ufb01rms that the desirable theoretical properties of the Self-Normalized Risk Estimator\ntranslate into improved generalization performance and robustness.\n\n2 Related work\n\nBatch learning from logged bandit feedback is an instance of causal inference. Classic inference\ntechniques like propensity score matching [7] are, hence, immediately relevant. BLBF is closely\nrelated to the problem of learning under covariate shift (also called domain adaptation or sample\nbias correction) [8] as well as off-policy evaluation in reinforcement learning [4]. Lower bounds for\ndomain adaptation [8] and impossibility results for off-policy evaluation [9], hence, also apply to\npropensity score matching [7], costing [10] and other importance sampling approaches to BLBF.\nSeveral counterfactual estimators have been developed for off-policy evaluation [11, 6, 5]. All these\nestimators are instances of importance sampling for Monte Carlo approximation and can be traced\nback to What-If simulations [12]. Learning (upper) bounds have been developed recently [13, 1, 14]\nthat show that these estimators can work for BLBF. We additionally show that importance sampling\ncan over\ufb01t in hitherto unforeseen ways with the capacity of the hypothesis space during learning.\nWe call this new kind of over\ufb01tting Propensity Over\ufb01tting.\nClassic variance reduction techniques for importance sampling are also useful for counterfactual\nevaluation and learning. For instance, importance weights can be \u201cclipped\u201d [15] to trade-off bias\nagainst variance in the estimators [5]. Additive control variates give rise to regression estimators\n[16] and doubly robust estimators [6]. Our proposal uses multiplicative control variates. These\nare widely used in \ufb01nancial applications (see [17] and references therein) and policy iteration for\nreinforcement learning (e.g. [18]). In particular, we study the self-normalized estimator [12] which\nis superior to the vanilla estimator when \ufb02uctuations in the weights dominate the variance [19]. We\nadditionally show that the self-normalized estimator neatly addresses propensity over\ufb01tting.\n\n3 Batch learning from logged bandit feedback\n\nFollowing [1], we focus on the stochastic, cardinal, contextual bandit setting and recap the essence\nof the CRM principle. The inputs of a structured prediction problem x\u2208X are drawn i.i.d. from a\n\ufb01xed but unknown distribution Pr(X ). The outputs are denoted by y \u2208Y. The hypothesis space H\ncontains stochastic hypotheses h(Y | x) that de\ufb01ne a probability distribution over Y. A hypothesis\nh\u2208H makes predictions by sampling from the conditional distribution y\u223c h(Y | x). This de\ufb01nition\nof H also captures deterministic hypotheses. For notational convenience, we denote the probability\ndistribution h(Y | x) by h(x), and the probability assigned by h(x) to y as h(y| x). We use (x, y)\u223c h\nto refer to samples of x\u223c Pr(X ), y\u223c h(x), and when clear from the context, we will drop (x, y).\nBandit feedback means we only observe the feedback \u03b4(x, y) for the speci\ufb01c y that was predicted,\nbut not for any of the other possible predictions Y \\ {y}. The feedback is just a number, called the\nloss \u03b4 : X \u00d7Y (cid:55)\u2192 R. Smaller numbers are desirable. In general, the loss is the (noisy) realization of\na stochastic random variable. The following exposition can be readily extended to the general case\nby setting \u03b4(x, y) = E [\u03b4 | x, y]. The expected loss \u2013 called risk \u2013 of a hypothesis R(h) is\n\nThe aim of learning is to \ufb01nd a hypothesis h \u2208 H that has minimum risk.\n\nR(h) = Ex\u223cPr(X )Ey\u223ch(x) [\u03b4(x, y)] = Eh [\u03b4(x, y)] .\n\n(1)\n\nCounterfactual estimators. We wish to use the logs of a historical system to perform learning. To\nensure that learning will not be impossible [9], we assume the historical algorithm whose predictions\nwe record in our logged data is a stationary policy h0(x) with full support over Y. For a new\nhypothesis h (cid:54)= h0, we cannot use the empirical risk estimator used in supervised learning [20] to\ndirectly approximate R(h), because the data contains samples drawn from h0 while the risk from\nEquation (1) requires samples from h.\n\n2\n\n\fwhere (xi, yi) \u223c h0, \u03b4i \u2261 \u03b4(xi, yi) and pi \u2261 h0(yi | xi), we can derive an unbiased estimate of\nR(h) via Monte Carlo approximation,\n\nn(cid:88)\n\ni=1\n\n\u02c6R(h) =\n\n1\nn\n\nh(yi| xi)\n\n\u03b4i\n\npi\n\n.\n\n(2)\n\nImportance sampling \ufb01xes this distribution mismatch,\nR(h) = Eh [\u03b4(x, y)] = Eh0\n\n(cid:20)\n\n\u03b4(x, y)\n\n(cid:21)\n\n.\n\nh(y| x)\nh0(y| x)\n\nSo, with data collected from the historical system\n\nD = {(x1, y1, \u03b41, p1), . . . , (xn, yn, \u03b4n, pn)},\n\nThis classic inverse propensity estimator [7] has unbounded variance: pi (cid:39) 0 in D can cause \u02c6R(h)\nto be arbitrarily far away from the true risk R(h). To remedy this problem, several thresholding\nschemes have been proposed and studied in the literature [15, 8, 5, 11]. The straightforward option\nis to cap the propensity weights [15, 1], i.e. pick M > 1 and set\n\nn(cid:88)\n\ni=1\n\n(cid:26)\n\n(cid:27)\n\n.\n\nh(yi| xi)\n\npi\n\n\u02c6RM (h) =\n\n1\nn\n\n\u03b4i min\n\nM,\n\nSmaller values of M reduce the variance of \u02c6RM (h) but induce a larger bias.\n\npi\n\nCounterfactual Risk Minimization.\nImportance sampling also introduces variance in \u02c6RM (h)\nthrough the variability of h(yi|xi)\n. This variance can be drastically different for different h \u2208 H. The\nCRM principle is derived from a generalization error bound that reasons about this variance using\nan empirical Bernstein argument [1, 13]. Let \u03b4(\u00b7,\u00b7) \u2208 [\u22121, 0] and consider the random variable\nuh = \u03b4(x, y) min\n\u02c6V ar(uh). With probability at least 1\u2212\u03b3 in the\nTheorem 1. Denote the empirical variance of uh by\nrandom vector (xi, yi) \u223c h0, for a stochastic hypothesis space H with capacity C(H) and n \u2265 16,\n\n. Note that D contains n i.i.d. observations uh\n\nM, h(y|x)\nh0(y|x)\n\n(cid:111)\n\n(cid:110)\n\ni.\n\n\u2200h \u2208 H : R(h) \u2264 \u02c6RM (h) +\n\n18 \u02c6V ar(uh) log( 10C(H)\n\n\u03b3\n\n)\n\nn\n\n+ M\n\n15 log( 10C(H)\n\n\u03b3\n\nn \u2212 1\n\n)\n\n.\n\n(cid:115)\n\nProof. Refer Theorem 1 of [1] and the proof of Theorem 6 of [13].\n\nFollowing Structural Risk Minimization [20], this bound motivates the CRM principle for designing\nalgorithms for BLBF. A learning algorithm should jointly optimize the estimate \u02c6RM (h) as well as\nits empirical standard deviation, where the latter serves as a data-dependent regularizer.\n\n(cid:115)\n\n\uf8f1\uf8f2\uf8f3 \u02c6RM (h) + \u03bb\n\n\u02c6V ar(uh)\n\nn\n\n\uf8fc\uf8fd\uf8fe .\n\n\u02c6hCRM = argmin\n\nh\u2208H\n\n(3)\n\nM > 1 and \u03bb \u2265 0 are regularization hyper-parameters.\n\n4 The Propensity Over\ufb01tting problem\nThe CRM objective in Equation (3) penalizes those h \u2208 H that are \u201cfar\u201d from the logging policy\nh0 (as measured by their empirical variance \u02c6V ar(uh)). This can be intuitively understood as a\nsafeguard against over\ufb01tting. However, over\ufb01tting in BLBF is more nuanced than in conventional\nsupervised learning. In particular, the unbiased risk estimator of Equation (2) has two anomalies.\nEven if \u03b4(\u00b7,\u00b7) \u2208 [(cid:53),(cid:52)], the value of \u02c6R(h) estimated on a \ufb01nite sample need not lie in that range.\nFurthermore, if \u03b4(\u00b7,\u00b7) is translated by a constant \u03b4(\u00b7,\u00b7) + C, R(h) becomes R(h) + C by linearity of\nexpectation \u2013 but the unbiased estimator on a \ufb01nite sample need not equal \u02c6R(h) + C. In short, this\nrisk estimator is not equivariant [19]. The various thresholding schemes for importance sampling\nonly exacerbate this effect. These anomalies leave us vulnerable to a peculiar kind of over\ufb01tting, as\nwe see in the following example.\n\n3\n\n\fExample 1. For the input space of integers X = {1..k} and the output space Y = {1..k}, de\ufb01ne\n\nThe hypothesis space H is the set of all deterministic functions f : X (cid:55)\u2192 Y.\n\n(cid:26)\u22122\n(cid:26)1\n\n\u22121\n\n0\n\n\u03b4(x, y) =\n\nif y = x\notherwise.\n\nhf (y|x) =\n\nif f (x) = y\notherwise.\n\nf with f\u2217(x) = x, which has risk R(h\u2217) = \u22122.\n\nData is drawn uniformly, x \u223c U(X ) and h0(Y|x) = U(Y) for all x. The hypothesis h\u2217 with\nminimum true risk is h\u2217\nWhen drawing a training sample D = ((x1, y1, \u03b41, p1), ..., (xn, yn, \u03b4n, pn)), let us \ufb01rst consider the\nspecial case where all xi in the sample are distinct. This is quite likely if n is small relative to k. In\nthis case H contains a hypothesis hoverf it, which assigns f (xi) = yi for all i. This hypothesis has\nthe following empirical risk as estimated by Equation (2):\n\n\u02c6R(hoverf it) =\n\n1\nn\n\nhoverf it(yi | xi)\n\npi\n\n=\n\n1\nn\n\n\u03b4i\n\n1\n1/k\n\n\u2264 1\nn\n\n\u22121\n\n1\n1/k\n\n= \u2212k.\n\nn(cid:88)\n\ni=1\n\n\u03b4i\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\nClearly this risk estimate shows severe over\ufb01tting, since it can be arbitrarily lower than the true risk\nR(h\u2217) = \u22122 of the best hypothesis h\u2217 with appropriately chosen k (or, more generally, the choice\nof h0). This is in stark contrast to over\ufb01tting in full-information supervised learning, where at least\nthe over\ufb01tted risk is bounded by the lower range of the loss function. Note that the empirical risk\n\u02c6R(h\u2217) of h\u2217 concentrates around \u22122. ERM will, hence, almost always select hoverf it over h\u2217.\nEven if we are not in the special case of having a sample with all distinct xi, this type of over\ufb01tting\nstill exists. In particular, if there are only l distinct xi in D, then there still exists a hoverf it with\n\u02c6R(hoverf it) \u2264 \u2212k l\nn. Finally, note that this type of over\ufb01tting behavior is not an artifact of this\nexample. Section 7 shows that this is ubiquitous in all the datasets we explored.\nMaybe this problem could be avoided by transforming the loss? For example, let\u2019s translate the\nloss by adding 2 to \u03b4 so that now all loss values become non-negative. This results in the new loss\nfunction \u03b4(cid:48)(x, y) taking values 0 and 1. In conventional supervised learning an additive translation\nof the loss does not change the empirical risk minimizer. Suppose we draw a sample D in which not\nall possible values y for xi are observed for all xi in the sample (again, such a sample is likely for\nsuf\ufb01ciently large k). Now there are many hypotheses hoverf it(cid:48) that predict one of the unobserved y\nfor each xi, basically avoiding the training data.\n\n\u02c6R(hoverf it(cid:48)) =\n\n1\nn\n\nhoverf it(cid:48)(yi | xi)\n\npi\n\n=\n\n1\nn\n\n\u03b4i\n\n0\n1/k\n\n= 0.\n\nn(cid:88)\n\ni=1\n\n\u03b4i\n\nn(cid:88)\n\ni=1\n\nAgain we are faced with over\ufb01tting, since many over\ufb01t hypotheses are indistinguishable from the\ntrue risk minimizer h\u2217 with true risk R(h\u2217) = 0 and empirical risk \u02c6R(h\u2217) = 0.\nThese examples indicate that this over\ufb01tting occurs regardless of how the loss is transformed. Intu-\nitively, this type of over\ufb01tting occurs since the risk estimate according to Equation (2) can be min-\nimized not only by putting large probability mass h(y | x) on the examples with low loss \u03b4(x, y),\nbut by maximizing (for negative losses) or minimizing (for positive losses) the sum of the weights\n\nn(cid:88)\n\nh(yi | xi)\n\npi\n\ni=1\n\n\u02c6S(h) =\n\n1\nn\n\n.\n\n(4)\n\nFor this reason, we call this type of over\ufb01tting Propensity Over\ufb01tting. This is in stark contrast to\nover\ufb01tting in supervised learning, which we call Loss Over\ufb01tting. Intuitively, Loss Over\ufb01tting oc-\ncurs because the capacity of H \ufb01ts spurious patterns of low \u03b4(x, y) in the data. In Propensity Over-\n\ufb01tting, the capacity in H allows over\ufb01tting of the propensity weights pi \u2013 for positive \u03b4, hypotheses\nthat avoid D are selected; for negative \u03b4, hypotheses that overrepresent D are selected.\nThe variance regularization of CRM combats both Loss Over\ufb01tting and Propensity Over\ufb01tting by\noptimizing a more informed generalization error bound. However the empirical variance estimate\nis also affected by Propensity Over\ufb01tting \u2013 especially for positive losses. Can we avoid Propensity\nOver\ufb01tting more directly?\n\n4\n\n\f5 Control variates and the Self-Normalized estimator\n\n(cid:90)\n\nn(cid:88)\n\ni=1\n\nE(cid:104) \u02c6S(h)\n(cid:105)\n\n(cid:90) h(yi | xi)\n\nh0(yi | xi)\n\nn(cid:88)\n\ni=1\n\n=\n\n1\nn\n\nTo avoid Propensity Over\ufb01tting, we must \ufb01rst detect when and where it is occurring. For this,\nwe draw on diagnostic tools used in importance sampling. Note that for any h \u2208 H, the sum\nof propensity weights \u02c6S(h) from Equation (4) always has expected value 1 under the conditions\nrequired for the unbiased estimator of Equation (2).\n\nh0(yi | xi) Pr(xi)dyidxi =\n\n1\nn\n\n1 Pr(xi)dxi = 1.\n\n(5)\n\nThis means that we can identify hypotheses that suffer from Propensity Over\ufb01tting based on how far\n\u02c6S(h) deviates from its expected value of 1. Since h(y|x)\nh0(y|x) is likely correlated with \u03b4(x, y) h(y|x)\nh0(y|x), a\nlarge deviation in \u02c6S(h) suggests a large deviation in \u02c6R(h) and consequently a bad risk estimate.\n\nHow can we use the knowledge that \u2200h \u2208 H : E(cid:104) \u02c6S(h)\n(cid:105)\n\n= 1 to avoid degenerate risk estimates in\na principled way? While one could use concentration inequalities to explicitly detect and eliminate\nover\ufb01t hypotheses based on \u02c6S(h), we use control variates to derive an improved risk estimator that\ndirectly incorporates this knowledge.\n\nControl variates. Control variates \u2013 random variables whose expectation is known \u2013 are a classic\ntool used to reduce the variance of Monte Carlo approximations [21]. Let V (X) be a control variate\nwith known expectation EX [V (X)] = v (cid:54)= 0, and let EX [W (X)] be an expectation that we would\nlike to estimate based on independent samples of X. Employing V (X) as a multiplicative control\nvariate, we can write EX [W (X)] =\n\nE[W (X)]\nE[V (X)] v. This motivates the ratio estimator\n\n(cid:80)n\n(cid:80)n\n\n\u02c6W SN =\n\ni=1 W (Xi)\ni=1 V (Xi)\n\n(cid:80)n\n(cid:80)n\n\ni=1 \u03b4i\n\nh(yi|xi)\nh(yi|xi)\n\npi\n\ni=1\n\npi\n\nwhich is called the Self-Normalized estimator in the importance sampling literature [12, 22, 23].\nThis estimator has substantially lower variance if W (X) and V (X) are correlated.\n\nSelf-Normalized risk estimator. Let us use S(h) as a control variate for R(h), yielding\n\nv,\n\n(6)\n\n\u02c6RSN (h) =\n\n.\n\n(7)\n\nHesterberg reports that this estimator tends be more accurate than the unbiased estimator of Equa-\ntion (2) when \ufb02uctuations in the sampling weights dominate the \ufb02uctuations in \u03b4(x, y) [19].\nObserve that the estimate is just a convex combination of the \u03b4i observed in the sample. If \u03b4(\u00b7,\u00b7)\nis now translated by a constant \u03b4(\u00b7,\u00b7) + C, both the true risk R(h) and the \ufb01nite sample estimate\n\u02c6RSN (h) get shifted by C. Hence \u02c6RSN (h) is equivariant, unlike \u02c6R(h) [19]. Moreover, \u02c6RSN (h) is\nalways bounded within the range of \u03b4. So, the over\ufb01tted risk due to ERM will now be bounded by\nthe lower range of the loss, analogous to full-information supervised learning.\n\nFinally, while the self-normalized risk estimator is not unbiased (E(cid:104) \u02c6R(h)\n\n(cid:105) (cid:54)= R(h)\n\nin general), it\n\n\u02c6S(h)\n\nE[ \u02c6S(h)]\n\nis strongly consistent and approaches the desired expectation when n is large.\nTheorem 2. Let D be drawn (xi, yi) i.i.d.\u223c h0, from a h0 that has full support over Y. Then,\n\n\u2200h \u2208 H : Pr( lim\nn\u2192\u221e\n\n\u02c6RSN (h) = R(h)) = 1.\n\nProof. The numerator of \u02c6RSN (h) in (7) are i.i.d. observations with mean R(h). Strong law\nof large numbers gives Pr(limn\u2192\u221e 1\nthe de-\nn\nnominator has i.i.d. observations with mean 1. So, the strong law of large numbers implies\nPr(limn\u2192\u221e 1\nn\n\n= 1) = 1. Hence, Pr(limn\u2192\u221e \u02c6RSN (h) = R(h)) = 1.\n\n= R(h)) = 1.\n\n(cid:80)n\n\nSimilarly,\n\nh(yi|xi)\n\nh(yi|xi)\n\ni=1 \u03b4i\n\npi\n\ni=1\n\npi\n\nIn summary, the self-normalized risk estimator \u02c6RSN (h) in Equation (7) resolves all the problems of\nthe unbiased estimator \u02c6R(h) from Equation (2) identi\ufb01ed in Section 4.\n\n5\n\n(cid:80)n\n\n\f6 Learning method: Norm-POEM\n\nWe now derive a learning algorithm, called Norm-POEM, for structured output prediction. The\nalgorithm is analogous to POEM [1] in its choice of hypothesis space and its application of the\nCRM principle, but it replaces the conventional estimator (2) with the self-normalized estimator (7).\nHypothesis space. Following [1, 24], Norm-POEM learns stochastic linear rules hw \u2208 Hlin\nparametrized by w that operate on a d\u2212dimensional joint feature map \u03c6(x, y).\n\nZ(x) =(cid:80)\n\nhw(y | x) = exp(w \u00b7 \u03c6(x, y))/Z(x).\n\ny(cid:48)\u2208Y exp(w \u00b7 \u03c6(x, y(cid:48))) is the partition function.\n\nVariance estimator.\nIn order to instantiate the CRM objective from Equation (3), we need an\nempirical variance estimate \u02c6V ar( \u02c6RSN (h)) for the self-normalized risk estimator. Following [23,\nSection 4.3], we use an approximate variance estimate for the ratio estimator of Equation (6). Using\nthe Normal approximation argument [21, Equation 9.9],\n\n\u02c6V ar( \u02c6RSN (h)) =\n\n)2\n\n.\n\n(8)\n\n(cid:80)n\ni=1(\u03b4i \u2212 \u02c6RSN (h))2( h(yi|xi)\n\npi\n\n((cid:80)n\n\nh(yi|xi)\n\n)2\n\ni=1\n\npi\n\nUsing the delta method to approximate the variance [22] yields the same formula. To invoke asymp-\ntotic normality of the estimator (and indeed, for reliable importance sampling estimates) we require\nthe true variance of the self-normalized estimator V ar( \u02c6RSN (h)) to exist. We can guarantee this by\nthresholding the importance weights, analogous to \u02c6RM (h).\nThe bene\ufb01ts of the self-normalized estimator come at a computational cost. The risk estimator\nof POEM had a simpler variance estimate which could be approximated by Taylor expansion and\noptimized using stochastic gradient descent. The variance of Equation (8) does not admit stochastic\noptimization. Surprisingly, in our experiments in Section 7 we \ufb01nd that the improved robustness of\nNorm-POEM permits fast convergence during training even without stochastic optimization.\n\nTraining objective of Norm-POEM. The objective is now derived by substituting the self-\nnormalized risk estimator of Equation (7) and its sample variance estimate from Equation (8) into\nthe CRM objective (3) for the hypothesis space Hlin. By design, hw lies in the exponential family\nof distributions. So, the gradient of the resulting objective can be tractably computed whenever the\npartition functions Z(xi) are tractable. Doing so yields a non-convex objective in the parameters\nw which we optimize using L-BFGS. The choice of L-BFGS for non-convex and non-smooth op-\ntimization is well supported [25, 26]. Analogous to POEM, the hyper-parameters M (clipping to\nprevent unbounded variance) and \u03bb (strength of variance regularization) can be calibrated via coun-\nterfactual evaluation on a held out validation set. In summary, the per-iteration cost of optimizing the\nNorm-POEM objective has the same complexity as the per-iteration cost of POEM with L-BFGS. It\nrequires the same set of hyper-parameters. And it can be done tractably whenever the correspond-\ning supervised CRF can be learnt ef\ufb01ciently. Software implementing Norm-POEM is available at\nhttp://www.cs.cornell.edu/\u223cadith/POEM.\n\n7 Experiments\n\nWe will now empirically verify if the self-normalized estimator as used in Norm-POEM can indeed\nguard against propensity over\ufb01tting and attain robust generalization performance. We follow the\nSupervised (cid:55)\u2192 Bandit methodology [2, 1] to test the limits of counterfactual learning in a well-\ncontrolled environment. As in prior work [1], the experiment setup uses supervised datasets for\nmulti-label classi\ufb01cation from the LibSVM repository. In these datasets, the inputs x \u2208 Rp. The\npredictions y \u2208 {0, 1}q are bitvectors indicating the labels assigned to x. The datasets have a range\nof features p, labels q and instances n:\n\nName\nScene\nYeast\nTMC\nLYRL\n\np(# features)\n\n294\n103\n30438\n47236\n\nq(# labels)\n\n6\n14\n22\n4\n\nntrain\n1211\n1500\n21519\n23149\n\nntest\n1196\n917\n7077\n781265\n\n6\n\n\fPOEM uses the CRM principle instantiated with the unbiased estimator while Norm-POEM uses the\nself-normalized estimator. Both use a hypothesis space isomorphic to a Conditional Random Field\n(CRF) [24]. We therefore report the performance of a full-information CRF (essentially, logistic\nregression for each of the q labels independently) as a \u201cskyline\u201d for what we can possibly hope to\nreach by partial-information batch learning from logged bandit feedback. The joint feature map\n\u03c6(x, y) = x \u2297 y for all approaches. To simulate a bandit feedback dataset D, we use a CRF with\ndefault hyper-parameters trained on 5% of the supervised dataset as h0, and replay the training data\n4 times and collect sampled labels from h0. This is inspired by the observation that supervised\nlabels are typically hard to collect relative to bandit feedback. The BLBF algorithms only have\naccess to the Hamming loss \u2206(y\u2217, y) between the supervised label y\u2217 and the sampled label y for\ninput x. Generalization performance R is measured by the expected Hamming loss on the held-out\nsupervised test set. Lower is better. Hyper-parameters \u03bb, M were calibrated as recommended and\nvalidated on a 25% hold-out of D \u2013 in summary, our experimental setup is identical to POEM [1].\nWe report performance of BLBF approaches without l2\u2212regularization here; we observed Norm-\nPOEM dominated POEM even after l2\u2212regularization. Since the choice of optimization method\ncould be a confounder, we use L-BFGS for all methods and experiments.\n\nWhat is the generalization performance of Norm-POEM ? The key question is whether the ap-\npealing theoretical properties of the self-normalized estimator actually lead to better generalization\nperformance. In Table 1, we report the test set loss for Norm-POEM and POEM averaged over 10\nruns. On each run, h0 has varying performance (trained on random 5% subsets) but Norm-POEM\nconsistently beats POEM.\n\nTable 1: Test set Hamming loss averaged over 10 runs. Norm-POEM signi\ufb01cantly outperforms\nPOEM on all four datasets (one-tailed paired difference t-test at signi\ufb01cance level of 0.05).\n\nR\nScene Yeast\n1.511\n5.577\nh0\n4.520\nPOEM\n1.200\n3.876\nNorm-POEM 1.045\nCRF\n0.657\n2.830\n\nTMC\n3.442\n2.152\n2.072\n1.187\n\nLYRL\n1.459\n0.914\n0.799\n0.222\n\nThe plot below (Figure 1) shows how generalization performance improves with more training data\nfor a single run of the experiment on the Yeast dataset. We achieve this by varying the number of\ntimes we replay the training set to collect samples from h0 (ReplayCount). Norm-POEM consis-\ntently outperforms POEM for all training sample sizes.\n\n4\n\nR\n\n3.5\n\n3\n\nh0\nCRF\nPOEM\n\nNorm-POEM\n\n20\n\n21\n\n22\n\n23\n\n24\n\n25\n\n26\n\n27\n\n28\n\nReplayCount\n\nFigure 1: Test set Hamming loss as n \u2192 \u221e on the Yeast dataset. All approaches will converge to\nCRF performance in the limit, but the rate of convergence is slow since h0 is thin-tailed.\n\nDoes Norm-POEM avoid Propensity Over\ufb01tting? While the previous results indicate that\nNorm-POEM achieves better performance, it remains to be veri\ufb01ed that this improved performance\nis indeed due to improved control over Propensity Over\ufb01tting. Table 2 (left) shows the average \u02c6S(\u02c6h)\nfor the hypothesis \u02c6h selected by each approach. Indeed, \u02c6S(\u02c6h) is close to its known expectation of\n1 for Norm-POEM, while it is severely biased for POEM. Furthermore, the value of \u02c6S(\u02c6h) depends\nheavily on how the losses \u03b4 are translated for POEM, as predicted by theory. As anticipated by\nour earlier observation that the self-normalized estimator is equivariant, Norm-POEM is unaffected\nby translations of \u03b4. Table 2 (right) shows that the same is true for the prediction error on the test\n\n7\n\n\fset. Norm-POEM is consistenly good while POEM fails catastrophically (for instance, on the TMC\ndataset, POEM is worse than random guessing).\nTable 2: Mean of the unclipped weights \u02c6S(\u02c6h) (left) and test set Hamming loss R (right), averaged\nover 10 runs. \u03b4 > 0 and \u03b4 < 0 indicate whether the loss was translated to be positive or negative.\n\nPOEM(\u03b4 > 0)\nPOEM(\u03b4 < 0)\nNorm-POEM(\u03b4 > 0)\nNorm-POEM(\u03b4 < 0)\n\n\u02c6S(\u02c6h)\n\nScene Yeast\n0.028\n0.274\n1.782\n5.352\n0.840\n0.981\n0.981\n0.821\n\nTMC\n0.000\n2.802\n0.941\n0.938\n\nLYRL\n0.175\n1.230\n0.945\n0.945\n\nR(\u02c6h)\n\nScene Yeast\n5.441\n2.059\n1.200\n4.520\n3.881\n1.058\n1.045\n3.876\n\nTMC\n17.305\n2.152\n2.079\n2.072\n\nLYRL\n2.399\n0.914\n0.799\n0.799\n\nIs CRM variance regularization still necessary? It may be possible that the improved self-\nnormalized estimator no longer requires variance regularization. The loss of the unregularized esti-\nmator is reported (Norm-IPS) in Table 3. We see that variance regularization still helps.\n\nTable 3: Test set Hamming loss for Norm-POEM and the variance agnostic Norm-IPS averaged over\nthe same 10 runs as Table 1. On Scene, TMC and LYRL, Norm-POEM is signi\ufb01cantly better than\nNorm-IPS (one-tailed paired difference t-test at signi\ufb01cance level of 0.05).\n\nR\nScene Yeast\n3.905\nNorm-IPS\n1.072\nNorm-POEM 1.045\n3.876\n\nTMC\n3.609\n2.072\n\nLYRL\n0.806\n0.799\n\nHow computationally ef\ufb01cient is Norm-POEM ? The runtime of Norm-POEM is surprisingly\nfaster than POEM. Even though normalization increases the per-iteration computation cost, opti-\nmization tends to converge in fewer iterations than for POEM. We \ufb01nd that POEM picks a hypothe-\nsis with large (cid:107)w(cid:107), attempting to assign a probability of 1 to all training points with negative losses.\nHowever, Norm-POEM converges to a much shorter (cid:107)w(cid:107). The loss of an instance relative to others\nin a sample D governs how Norm-POEM tries to \ufb01t to it. This is another nice consequence of the\nfact that the over\ufb01tted risk of \u02c6RSN (h) is bounded and small. Overall, the runtime of Norm-POEM\nis on the same order of magnitude as those of a full-information CRF, and is competitive with the\nruntimes reported for POEM with stochastic optimization and early stopping [1], while providing\nsubstantially better generalization performance.\n\nTable 4: Time in seconds, averaged across validation runs. CRF is implemented by scikit-learn [27].\n\nTime(s)\nPOEM\nNorm-POEM\nCRF\n\nScene Yeast\n98.65\n78.69\n10.15\n7.28\n4.94\n3.43\n\nTMC\n716.51\n227.88\n89.24\n\nLYRL\n617.30\n142.50\n72.34\n\nWe observe the same trends for Norm-POEM when different properties of h0 are varied (e.g.\nstochasticity and quality), as reported for POEM [1].\n\n8 Conclusions\n\nWe identify the problem of propensity over\ufb01tting when using the conventional unbiased risk estima-\ntor for ERM in batch learning from bandit feedback. To remedy this problem, we propose the use of\na multiplicative control variate that leads to the self-normalized risk estimator. This provably avoids\nthe anomalies of the conventional estimator. Deriving a new learning algorithm called Norm-POEM\nbased on the CRM principle using the new estimator, we show that the improved estimator leads to\nsigni\ufb01cantly improved generalization performance.\n\nAcknowledgement\n\nThis research was funded in part through NSF Awards IIS-1247637, IIS-1217686, IIS-1513692, the\nJTCII Cornell-Technion Research Fund, and a gift from Bloomberg.\n\n8\n\n\fReferences\n[1] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged\n\nbandit feedback. In ICML, 2015.\n\n[2] Alina Beygelzimer and John Langford. The offset tree for learning with partial labels. In KDD, pages\n\n129\u2013138, 2009.\n\n[3] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press,\n\nNew York, NY, USA, 2006.\n\n[4] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998.\n[5] L\u00b4eon Bottou, Jonas Peters, Joaquin Q. Candela, Denis X. Charles, Max Chickering, Elon Portugaly,\nDipankar Ray, Patrice Y. Simard, and Ed Snelson. Counterfactual reasoning and learning systems: the\nexample of computational advertising. Journal of Machine Learning Research, 14(1):3207\u20133260, 2013.\n[6] Miroslav Dud\u00b4\u0131k, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In ICML,\n\npages 1097\u20131104, 2011.\n\n[7] P. Rosenbaum and D. Rubin. The central role of the propensity score in observational studies for causal\n\neffects. Biometrika, 70(1):41\u201355, 1983.\n\n[8] C. Cortes, Y. Mansour, and M. Mohri. Learning bounds for importance weighting. In NIPS, pages 442\u2013\n\n450, 2010.\n\n[9] John Langford, Alexander Strehl, and Jennifer Wortman. Exploration scavenging. In ICML, pages 528\u2013\n\n535, 2008.\n\n[10] Bianca Zadrozny, John Langford, and Naoki Abe. Cost-sensitive learning by cost-proportionate example\n\nweighting. In ICDM, pages 435\u2013, 2003.\n\n[11] Alexander L. Strehl, John Langford, Lihong Li, and Sham Kakade. Learning from logged implicit explo-\n\nration data. In NIPS, pages 2217\u20132225, 2010.\n\n[12] H. F. Trotter and J. W. Tukey. Conditional monte carlo for normal samples. In Symposium on Monte\n\nCarlo Methods, pages 64\u201379, 1956.\n\n[13] Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample-variance penalization.\n\nIn COLT, 2009.\n\n[14] Philip S. Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High-con\ufb01dence off-policy\n\nevaluation. In AAAI, pages 3000\u20133006, 2015.\n\n[15] Edward L. Ionides. Truncated importance sampling. Journal of Computational and Graphical Statistics,\n\n17(2):295\u2013311, 2008.\n\n[16] Lihong Li, R. Munos, and C. Szepesvari. Toward minimax off-policy value estimation. In AISTATS, 2015.\n[17] Phelim Boyle, Mark Broadie, and Paul Glasserman. Monte carlo methods for security pricing. Journal of\n\nEconomic Dynamics and Control, 21(89):1267\u20131321, 1997.\n\n[18] John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust region policy\n\noptimization. In ICML, pages 1889\u20131897, 2015.\n\n[19] Tim Hesterberg. Weighted average importance sampling and defensive mixture distributions. Technomet-\n\nrics, 37:185\u2013194, 1995.\n\n[20] V. Vapnik. Statistical Learning Theory. Wiley, Chichester, GB, 1998.\n[21] Art B. Owen. Monte Carlo theory, methods and examples. 2013.\n[22] Augustine Kong. A note on importance sampling using standardized weights. Technical Report 348,\n\nDepartment of Statistics, University of Chicago, 1992.\n\n[23] R. Rubinstein and D. Kroese. Simulation and the Monte Carlo Method. Wiley, 2 edition, 2008.\n[24] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random \ufb01elds: Probabilis-\n\ntic models for segmenting and labeling sequence data. In ICML, pages 282\u2013289, 2001.\n\n[25] Adrian S. Lewis and Michael L. Overton. Nonsmooth optimization via quasi-newton methods. Mathe-\n\nmatical Programming, 141(1-2):135\u2013163, 2013.\n\n[26] Jin Yu, S. V. N. Vishwanathan, S. G\u00a8unter, and N. Schraudolph. A quasi-Newton approach to nonsmooth\n\nconvex optimization problems in machine learning. JMLR, 11:1145\u20131200, 2010.\n\n[27] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,\nR. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.\nScikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825\u20132830, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1808, "authors": [{"given_name": "Adith", "family_name": "Swaminathan", "institution": "Cornell University"}, {"given_name": "Thorsten", "family_name": "Joachims", "institution": "Cornell"}]}