{"title": "Learning from Logged Implicit Exploration Data", "book": "Advances in Neural Information Processing Systems", "page_first": 2217, "page_last": 2225, "abstract": "We provide a sound and consistent foundation for the use of \\emph{nonrandom} exploration data in ``contextual bandit'' or ``partially labeled'' settings where only the value of a chosen action is learned. The primary challenge in a variety of settings is that the exploration policy, in which ``offline'' data is logged, is not explicitly known. Prior solutions here require either control of the actions during the learning process, recorded random exploration, or actions chosen obliviously in a repeated manner. The techniques reported here lift these restrictions, allowing the learning of a policy for choosing actions given features from historical data where no randomization occurred or was logged. We empirically verify our solution on two reasonably sized sets of real-world data obtained from an Internet %online advertising company.", "full_text": "Learning from Logged Implicit Exploration Data\n\nAlexander L. Strehl \u2217\n\nFacebook Inc.\n\n1601 S California Ave\nPalo Alto, CA 94304\n\nJohn Langford\nYahoo! Research\n\n111 West 40th Street, 9th Floor\n\nNew York, NY, USA 10018\n\nastrehl@facebook.com\n\njl@yahoo-inc.com\n\nLihong Li\n\nYahoo! Research\n\n4401 Great America Parkway\nSanta Clara, CA, USA 95054\nlihong@yahoo-inc.com\n\nSham M. Kakade\n\nDepartment of Statistics\n\nUniversity of Pennsylvania\n\nPhiladelphia, PA, 19104\n\nskakade@wharton.upenn.edu\n\nAbstract\n\nWe provide a sound and consistent foundation for the use of nonrandom explo-\nration data in \u201ccontextual bandit\u201d or \u201cpartially labeled\u201d settings where only the\nvalue of a chosen action is learned. The primary challenge in a variety of settings\nis that the exploration policy, in which \u201cof\ufb02ine\u201d data is logged, is not explic-\nitly known. Prior solutions here require either control of the actions during the\nlearning process, recorded random exploration, or actions chosen obliviously in a\nrepeated manner. The techniques reported here lift these restrictions, allowing the\nlearning of a policy for choosing actions given features from historical data where\nno randomization occurred or was logged. We empirically verify our solution on\ntwo reasonably sized sets of real-world data obtained from Yahoo!.\n\n1 Introduction\n\nConsider the advertisement display problem, where a search engine company chooses an ad to dis-\nplay which is intended to interest the user. Revenue is typically provided to the search engine from\nthe advertiser only when the user clicks on the displayed ad. This problem is of intrinsic economic\ninterest, resulting in a substantial fraction of income for several well-known companies such as\nGoogle, Yahoo!, and Facebook.\n\nBefore discussing the proposed approach, we formalize the problem and then explain why more\nconventional approaches can fail.\n\nThe warm-start problem for contextual exploration: Let X be an arbitrary input space, and\nA = {1, . . . , k} be a set of actions. An instance of the contextual bandit problem is speci\ufb01ed by a\ndistribution D over tuples (x, ~r) where x \u2208 X is an input and ~r \u2208 [0, 1]k is a vector of rewards [6].\nEvents occur on a round-by-round basis where on each round t:\n\n1. The world draws (x, ~r) \u223c D and announces x.\n2. The algorithm chooses an action a \u2208 A, possibly as a function of x and historical informa-\n3. The world announces the reward ra of action a, but not ra\u2032 for a\u2032 6= a.\n\ntion.\n\n\u2217Part of this work was done while A. Strehl was at Yahoo! Research.\n\n1\n\n\fIt is critical to understand that this is not a standard supervised-learning problem, because the reward\nof other actions a\u2032 6= a is not revealed.\nThe standard goal in this setting is to maximize the sum of rewards ra over the rounds of interaction.\nIn order to do this well, it is essential to use previously recorded events to form a good policy on the\n\ufb01rst round of interaction. Thus, this is a \u201cwarm start\u201d problem. Formally, given a dataset of the form\nS = (x, a, ra)\u2217 generated by the interaction of an uncontrolled logging policy, we want to construct\na policy h maximizing (either exactly or approximately)\n\nV h := E(x,~r)\u223cD[rh(x)].\n\nApproaches that fail: There are several approaches that may appear to solve this problem, but\nturn out to be inadequate:\n\n1. Supervised learning. We could learn a regressor s : X \u00d7 A \u2192 [0, 1] which is trained to\npredict the reward, on observed events conditioned on the action a and other information\nx. From this regressor, a policy is derived according to h(x) = argmaxa\u2208A s(x, a). A\n\ufb02aw of this approach is that the argmax may extend over a set of choices not included\nin the training data, and hence may not generalize at all (or only poorly). This can be\nveri\ufb01ed by considering some extreme cases. Suppose that there are two actions a and b\nwith action a occurring 106 times and action b occuring 102 times. Since action b occurs\nonly a 10\u22124 fraction of the time, a learning algorithm forced to trade off between predicting\nthe expected value of ra and rb overwhelmingly prefers to estimate ra well at the expense of\naccurate estimation for rb. And yet, in application, action b may be chosen by the argmax.\nThis problem is only worse when action b occurs zero times, as might commonly occur in\nexploration situations.\n\n2. Bandit approaches. In the standard setting these approaches suffer from the curse of di-\nmensionality, because they must be applied conditioned on X. In particular, applying them\nrequires data linear in X \u00d7 A, which is extraordinarily wasteful. In essence, this is a failure\nto take advantage of generalization.\n3. Contextual Bandits. Existing approaches to contextual bandits such as EXP4 [1] or Epoch\nGreedy [6], require either interaction to gather data or require knowledge of the probability\nthe logging policy chose the action a. In our case the probability is unknown, and it may in\nfact always be 1.\n\n4. Exploration Scavenging. It is possible to recover exploration information from action vis-\nitation frequency when a logging policy chooses actions independent of the input x (but\npossibly dependent on history) [5]. This doesn\u2019t \ufb01t our setting, where the logging policy is\nsurely dependent on the input.\n\n5. Propensity Scores, naively. When conducting a survey, a question about income might be\nincluded, and then the proportion of responders at various income levels can be compared\nto census data to estimate a probability conditioned on income that someone chooses to\npartake in the survey. Given this estimated probability, results can be importance-weighted\nto estimate average survey outcomes on the entire population [2]. This approach is prob-\nlematic here, because the policy making decisions when logging the data may be deter-\nministic rather than probabilistic. In other words, accurately predicting the probability of\nthe logging policy choosing an ad implies always predicting 0 or 1 which is not useful for\nour purposes. Although the straightforward use of propensity scores does not work, the ap-\nproach we take can be thought of as as a more clever use of a propensity score, as discussed\nbelow. Lambert and Pregibon [4] provide a good explanation of propensity scoring in an\nInternet advertising setting.\n\nOur Approach: The approach proposed in the paper naturally breaks down into three steps.\n\n1. For each event (x, a, ra), estimate the probability \u02c6\u03c0(a|x) that the logging policy chooses\naction a using regression. Here, the \u201cprobability\u201d is over time\u2014we imagine taking a uni-\nform random draw from the collection of (possibly deterministic) policies used at different\npoints in time.\n\n2. For each event (x, a, ra), create a synthetic controlled contextual bandit event accord-\ning to (x, a, ra, 1/ max{\u02c6\u03c0(a|x), \u03c4}) where \u03c4 > 0 is some parameter. The quantity,\n1/ max{\u02c6\u03c0(a|x), \u03c4}, is an importance weight that speci\ufb01es how important the current event\nis for training. As will be clear, the parameter \u03c4 is critical for numeric stability.\n\n2\n\n\f3. Apply an of\ufb02ine contextual bandit algorithm to the set of synthetic contextual bandit events.\nIn our second set of experimental results (Section 4.2) a variant of the argmax regressor is\nused with two critical modi\ufb01cations: (a) We limit the scope of the argmax to those actions\nwith positive probability; (b) We importance weight events so that the training process\nemphasizes good estimation for each action equally. It should be emphasized that the the-\noretical analysis in this paper applies to any algorithm for learning on contextual bandit\nevents\u2014we chose this one because it is a simple modi\ufb01cation on existing (but fundamen-\ntally broken) approaches.\n\nThe above approach is most similar to the Propensity Score approach mentioned above. Relative to\nit, we use a different de\ufb01nition of probability which is not necessarily 0 or 1 when the logging policy\nis completely deterministic.\n\nThree critical questions arise when considering this approach.\n\n1. What does \u02c6\u03c0(a|x) mean, given that the logging policy may be deterministically choosing\nan action (ad) a given features x? The essential observation is that a policy which deter-\nministically chooses action a on day 1 and then deterministically chooses action b on day\n2 can be treated as randomizing between actions a and b with probability 0.5 when the\nnumber of events is the same each day, and the events are IID. Thus \u02c6\u03c0(a|x) is an estimate\nof the expected frequency with which action a would be displayed given features x over\nthe timespan of the logged events. In section 3 we show that this approach is sound in the\nsense that in expectation it provides an unbiased estimate of the value of new policy.\n\n2. How do the inevitable errors in \u02c6\u03c0(a|x) in\ufb02uence the process? It turns out they have an\neffect which is dependent on \u03c4. For very small values of \u03c4, the estimates of \u02c6\u03c0(a|x) must\nbe extremely accurate to yield good performance while for larger values of \u03c4 less accuracy\nis required. In Section 3.1, we prove this robustness property.\n\n3. What in\ufb02uence does the parameter \u03c4 have on the \ufb01nal result? While creating a bias in the\nestimation process, it turns out that the form of this bias is mild and relatively reasonable\u2014\nactions which are displayed with low frequency conditioned on x effectively have an under-\nestimated value. This is exactly as expected for the limit where actions have no frequency.\nIn section 3.1 we prove this.\n\nWe close with a generalization from policy evaluation to policy selection with a sample complexity\nbound in section 3.2 and then experimental results in section 4 using real data.\n\n2 Formal Problem Setup and Assumptions\n\nLet \u03c01, ..., \u03c0T be T policies, where, for each t, \u03c0t is a function mapping an input from X to a\n(possibly deterministic) distribution over A. The learning algorithm is given a dataset of T samples,\neach of the form (x, a, ra) \u2208 X \u00d7 A\u00d7 [0, 1], where (x, r) is drawn from D as described in Section 1,\nand the action a \u223c \u03c0t(x) is chosen according to the tth policy. We denote this random process by\n(x, a, ra) \u223c (D, \u03c0t(\u00b7|x)). Similarly, interaction with the T policies results in a sequence S of T\nsamples, which we denote S \u223c (D, \u03c0i(\u00b7|x))T\ni=1. The learner is not given prior knowledge of the \u03c0t.\nOf\ufb02ine policy estimator: Given a dataset of the form\n\n(1)\nwhere \u2200t, xt \u2208 X, at \u2208 A, rt,at \u2208 [0, 1], we form a predictor \u02c6\u03c0 : X \u00d7 A \u2192 [0, 1] and then use it\nwith a threshold \u03c4 \u2208 [0, 1] to form an of\ufb02ine estimator for the value of a policy h.\nFormally, given a new policy h : X \u2192 A and a dataset S, de\ufb01ne the estimator:\n\nS = {(xt, at, rt,at )}T\n\nt=1,\n\nwhere I(\u00b7) denotes the indicator function. The shorthand \u02c6V h\n\u02c6\u03c0 will be used if there is no ambiguity.\nThe purpose of \u03c4 is to upper-bound the individual terms in the sum and is similar to previous methods\nlike robust importance sampling [10].\n\nThe purpose of \u03c4 is to upper-bound the individual terms in the sum and is similar to previous methods\nlike robust importance sampling [10].\n\n3\n\n\u02c6V h\n\u02c6\u03c0 (S) =\n\n,\n\n(2)\n\n1\n\n|S| X(x,a,r)\u2208S\n\nraI(h(x) = a)\nmax{\u02c6\u03c0(a|x), \u03c4}\n\n\f3 Theoretical Results\n\nWe now present our algorithm and main theoretical results. The main idea is twofold: \ufb01rst, we have\na policy estimation step, where we estimate the (unknown) logging policy (Subsection 3.1); second,\nwe have a policy optimization step, where we utilize our estimated logging policy (Subsection 3.2).\nOur main result, Theorem 3.2, provides a generalization bound\u2014addressing the issue of how both\nthe estimation and optimization error contribute to the total error.\nThe logging policy \u03c0t may be deterministic, implying that conventional approaches relying on ran-\ndomization in the logging policy are not applicable. We show next that this is ok when the world\nis IID and the policy varies over its actions. We effectively substitute the standard approach of\nrandomization in the algorithm for randomization in the world.\n\nA basic claim is that the estimator is equivalent, in expectation, to a stochastic policy de\ufb01ned by:\n\n\u03c0(a|x) = E\n\nt\u223cUNIF(1,...,T )[\u03c0t(a|x)],\n\n(3)\nwhere UNIF(\u00b7\u00b7\u00b7 ) denotes the uniform distribution. The stochastic policy \u03c0 chooses an action uni-\nformly at random over the T policies \u03c0t. Our \ufb01rst result is that the expected value of our estimator\nis the same when the world chooses actions according to either \u03c0 or to the sequence of policies \u03c0t.\nAlthough this result and its proof are straightforward, it forms the basis for the rest of the results in\nour paper. Note that the policies \u03c0t may be arbitrary but we have assumed that they do not depend\non the data used for evaluation. This assumption is only necessary for the proofs and can often be\nrelaxed in practice, as we show in Section 4.1.\nTheorem 3.1. For any contextual bandit problem D with identical draws over T rounds, for any\nsequence of possibly stochastic policies \u03c0t(a|x) with \u03c0 derived as above, and for any predictor \u02c6\u03c0,\n(4)\n\n\u02c6V h\n\u02c6\u03c0 (S) = E(x,~r)\u223cD,a\u223c\u03c0(\u00b7|x)\n\nES\u223c(D,\u03c0i(\u00b7|x))T\n\ni=1\n\nraI(h(x) = a)\nmax{\u02c6\u03c0(a|x), \u03c4}\n\nThis theorem relates the expected value of our estimator when T policies are used to the much\nsimpler and more standard setting where a single \ufb01xed stochastic policy is used.\n\n3.1 Policy Estimation\n\nIn this section we show that for a suitable choice of \u03c4 and \u02c6\u03c0 our estimator is suf\ufb01ciently accurate\nfor evaluating new policies h. We aggressively use the simpli\ufb01cation of the previous section, which\nshows that we can think of the data as generated by a \ufb01xed stochastic policy \u03c0, i.e. \u03c0t = \u03c0 for all t.\nFor a given estimate \u02c6\u03c0 of \u03c0 de\ufb01ne the \u201cregret\u201d to be a function reg : X \u2192 [0, 1] by\n\n(5)\n\nreg(x) = max\n\na\u2208A(cid:2)(\u03c0(a|x) \u2212 \u02c6\u03c0(a|x))2(cid:3) .\n\nWe do not use \u21131 or \u2113\u221e loss above because they are harder to minimize than \u21132 loss. Our next result\nis that the new estimator is consistent. In the following theorem statement, I(\u00b7) denotes the indicator\nfunction, \u03c0(a|x) the probability that the logging policy chooses action a on input x, and \u02c6V h\n\u02c6\u03c0 our\nestimator as de\ufb01ned by Equation 2 based on parameter \u03c4.\nLemma 3.1. Let \u02c6\u03c0 be any function from X to distributions over actions A. Let h : X \u2192 A be any\ndeterministic policy. Let V h(x) = E\nr\u223cD(\u00b7|x)[rh(x)] denote the expected value of executing policy h\non input x. We have that\n\n\u03c4\n\n!# \u2264 E[ \u02c6V h\n\n# .\n\u02c6\u03c0 ] \u2264 V h + Ex\"I(\u03c0(h(x)|x) \u2265 \u03c4 ) \u00b7 preg(x)\n\u02c6\u03c0 ] is taken over all sequences of T tuples (x, a, r) where (x, r) \u223c\n\nEx\"I(\u03c0(h(x)|x) \u2265 \u03c4 ) \u00b7 V h(x) \u2212 preg(x)\nIn the above, the expectation E[ \u02c6V h\nD and a \u223c \u03c0(\u00b7|x).1\nThis lemma bounds the bias in our estimate of V h(x). There are two sources of bias\u2014one from the\nerror of \u02c6\u03c0(a|x) in estimating \u03c0(a|x), and the other from threshold \u03c4. For the \ufb01rst source, it\u2019s crucial\nthat we analyze the result in terms of the squared loss rather than (say) \u2113\u221e loss, as reasonable sample\ncomplexity bounds on the regret of squared loss estimates are achievable.2\n\n\u03c4\n\n1Note that varying T does not change the expectation of our estimator, so T has no effect in the theorem.\n2Extending our results to log loss would be interesting future work, but is made dif\ufb01cult by the fact that log\n\nloss is unbounded.\n\n4\n\n\fLemma 3.1 shows that the expected value of our estimate \u02c6V h\n\u03c0 of a policy h is an approximation to a\nlower bound of the true value of the policy h where the approximation is due to errors in the estimate\n\u02c6\u03c0 and the lower bound is due to the threshold \u03c4. When \u02c6\u03c0 = \u03c0, then the statement of Lemma 3.1\nsimpli\ufb01es to\n\nEx(cid:2)I(\u03c0(h(x)|x) \u2265 \u03c4 ) \u00b7 V h(x)(cid:3) \u2264 E[ \u02c6V h\n\n\u02c6\u03c0 ] \u2264 V h.\nThus, with a perfect predictor of \u03c0, the expected value of the estimator \u02c6V h\n\u02c6\u03c0 is a guaranteed lower\nbound on the true value of policy h. However, as the left-hand-side of this statement suggests, it may\nbe a very loose bound, especially if the action chosen by h often has a small probability of being\nchosen by \u03c0.\nThe dependence on 1/\u03c4 in Lemma 3.1 is somewhat unsettling, but unavoidable. Consider an\ninstance of the bandit problem with a single input x and two actions a1, a2. Suppose that\n\u03c0(a1|x) = \u03c4 + \u01eb for some positive \u01eb and h(x) = a1 is the policy we are evaluating. Sup-\npose further that the rewards are always 1 and that \u02c6\u03c0(a1|x) = \u03c4. Then, the estimator sat-\nis\ufb01es E[ \u02c6V h\n\u02c6\u03c0 ] = \u03c0(a1|x)/\u02c6\u03c0(a1|x) = (\u03c4 + \u01eb)/\u03c4. Thus, the expected error in the estimate is\nE[ \u02c6V h\n\n\u02c6\u03c0 ] \u2212 V h = |(\u03c4 + \u01eb)/\u03c4 \u2212 1| = \u01eb/\u03c4, while the regret of \u02c6\u03c0 is (\u03c0(a1|x) \u2212 \u02c6\u03c0(a1|x))2 = \u01eb2.\n\n3.2 Policy Optimization\n\nThe previous section proves that we can effectively evaluate a policy h by observing a stochastic\npolicy \u03c0, as long as the actions chosen by h have adequate support under \u03c0, speci\ufb01cally \u03c0(h(x)|x) \u2265\n\u03c4 for all inputs x. However, we are often interested in choosing the best policy h from a set of\npolicies H after observing logged data. Furthermore, as described in Section 2, the logged data are\ngenerated from T \ufb01xed, possibly deterministic, policies \u03c01, . . . , \u03c0T as described in section 2 rather\nthan a single stochastic policy. As in Section 3 we de\ufb01ne the stochastic policy \u03c0 by Equation 3,\n\n\u03c0(a|x) = E\n\nt\u223cUNIF(1,...,T )[\u03c0t(a|x)]\n\nThe results of Section 3.1 apply to the policy optimization problem. However, note that the data are\nnow assumed to be drawn from the execution of a sequence of T policies \u03c01, . . . , \u03c0T , rather than by\nT draws from \u03c0.\nNext, we show that it is possible to compete well with the best hypothesis in H that has adequate\nsupport under \u03c0 (even though the data are not generated from \u03c0).\nTheorem 3.2. Let \u02c6\u03c0 be any function from X to distributions over actions A. Let H be any set of de-\nterministic policies. De\ufb01ne \u02dcH = {h \u2208 H | \u03c0(h(x)|x) > \u03c4, \u2200 x \u2208 X} and \u02dch = argmaxh\u2208 \u02dcH{V h}.\nLet \u02c6h = argmaxh\u2208H{ \u02c6V h\n\u02c6\u03c0 } be the hypothesis that maximizes the empirical value estimator de\ufb01ned\nin Equation 2. Then, with probability at least 1 \u2212 \u03b4,\n\nV\n\n\u02c6h \u2265 V\n\n\u02dch \u2212\n\n2\n\n\u03c4 pEx[reg(x)] +r ln(2|H|/\u03b4)\n\n2T\n\n! ,\n\n(6)\n\nwhere reg(x) is de\ufb01ned, with respect to \u03c0, in Equation 5.\n\nThe proof of Theorem 3.2 relies on the lower-bound property of our estimator (the left-hand side\nof Inequality stated in Lemma 3.1).\nIn other words, if H contains a very good policy that has\nlittle support under \u03c0, we will not be able to detect that by our estimator. On the other hand, our\nestimation is safe in the sense that we will never drastically overestimate the value of any policy in H.\nThis \u201cunderestimate, but don\u2019t overestimate\u201d property is critical to the application of optimization\ntechniques, as it implies we can use an unrestrained learning algorithm to derive a warm start policy.\n\n4 Empirical Evaluation\n\nWe evaluated our method on two real-world datasets obtained from Yahoo!. The \ufb01rst dataset con-\nsists of uniformly random exploration data, from which an unbiased estimate of any policy can be\nobtained. This dataset is thus used to verify accuracy of our of\ufb02ine evaluator (2). The second dataset\nthen demonstrates how policy optimization can be done from nonrandom of\ufb02ine data.\n\n5\n\n\f4.1 Experiment I\n\nThe \ufb01rst experiment involves news article recommendation in the \u201cToday Module\u201d, on the Yahoo!\nfront page. For every user visit, this module displays a high-quality news article out of a small\ncandidate pool, which is hand-picked by human editors. The pool contains about 20 articles at\nany given time. We seek to maximize the click probability (aka click-through rate, or CTR) of the\nhighlighted article. This problem is modeled as a contextual bandit problem, where the context\nconsists of both user and article information, the arms correspond to articles, and the reward of a\ndisplayed article is 1 if there is a click and 0 otherwise. Therefore, the value of a policy is exactly\nits overall CTR. To protect business-sensitive information, we only report normalized CTR (nCTR)\nwhich is de\ufb01ned as the ratio of the true CTR and the CTR of a random policy.\nOur dataset, denoted D0, was collected from real traf\ufb01c on the Yahoo! front page during a two-\nweek period in June 2009. It contains T = 64.7M events in the form of triples (x, a, r), where\nthe context x contains user/article features, arm a was chosen uniformly at random from a dynamic\ncandidate pool A, and r is a binary reward signal indicating whether the user clicked on a. Since\nactions are chosen randomly, we have \u02c6\u03c0(a|x) = \u03c0(a|x) \u2261 1/|A| and reg(x) \u2261 0. Consequently,\nLemma 3.1 implies E[ \u02c6V h\n\u02c6\u03c0 ] = V h provided \u03c4 < 1/|A|. Furthermore, a straightforward application\n\u02c6\u03c0 concentrates to V h at the rate of O(1/\u221aT ) for any\nof Hoeffding\u2019s inequality guarantees that \u02c6V h\npolicy h, which is also veri\ufb01ed empirically [9]. Given the size of our dataset, therefore, we used this\ndataset to calculate \u02c6V0 = \u02c6V h\n\u02c6\u03c0 using \u02c6\u03c0(a|x) = 1/|A| in (2). The result \u02c6V0 was then treated as \u201cground\ntruth\u201d, with which we can evaluate how accurate the of\ufb02ine evaluator (2) is when non-random log\ndata are used instead.\n\nTo obtain non-random log data, we ran the LinUCB algorithm using the of\ufb02ine bandit simulation\nprocedure, both from [8], on our random log data D0 and recorded events (x, a, r) for which Lin-\nUCB chose arm a for context x. Note that \u03c0 is a deterministic learning algorithm, and may choose\ndifferent arms for the same context at different timesteps. We call this subset of recorded events D\u03c0.\nIt is known that the set of recorded events has the same distribution as if we ran LinUCB on real\nuser visits to Yahoo! front page. We used D\u03c0 as non-random log data and do evaluation.\nTo de\ufb01ne the policy h for evaluation, we used D0 to estimate each article\u2019s overall CTR across all\nusers, and then h was de\ufb01ned as selecting the article with highest estimated CTR.\nWe then evaluated h on D\u03c0 using the of\ufb02ine evaluator (2). Since the set A of articles changes over\ntime (with news articles being added and old articles retiring), \u03c0(a|x) is very small due to the large\nnumber of articles over the two-week period, resulting in large variance. To resolve this problem,\nwe split the dataset D\u03c0 into subsets so that in each subset the candidate pool remains constant,3 and\nthen estimate \u03c0(a|x) for each subset separately using ridge regression on features x. We note that\nmore advanced conditional probability estimation techniques can be used.\n\n\u02c6\u03c0 with varying \u03c4 against the ground truth \u02c6V0. As expected, as \u03c4 becomes larger,\nFigure 1 plots \u02c6V h\nour estimate can become more (downward) biased. For a large range of \u03c4 values, our estimates\nare reasonably accurate, suggesting the usefulness of our proposed method. In contrast, a naive\napproach, which assumes \u03c0(a|x) = 1/|A|, gives a very poor estimate of 2.4.\nFor extremely small values of \u03c4, however, there appears to be a consistent trend of over-estimating\nthe policy value. This is due to the fact that negative moments of a positive random variable are\noften larger than the corresponding moments of its expectation [7].\n\nNote that the logging policy we used, \u03c0, violates one of the assumptions used to prove Lemma 3.1,\nnamely that the exploration policy at timestep t not be dependent on an earlier event. Our of\ufb02ine\nevaluator is accurate in this setting, which suggests that the assumption may be relaxable in practice.\n\n4.2 Experiment II\n\nIn the second experiment, we investigate our approach to the warm-start problem. The dataset was\nprovided by Yahoo!, covering a period of one month in 2008. The data are comprised of logs of\nevents (x, a, y), where each event represents a visit by a user to a particular web page x, from a set\nof web pages X. From a large set of advertisements A, the commercial system chooses a single ad\n\n3We could do so because we know A for every event in D0.\n\n6\n\n\fR\nT\nC\nn\n\n1.7\n\n1.6\n\n1.5\n\n1.4\n\n1.3\n\n \n\n1E\u22124\n\n \n\nMethod\nLearned\nRandom\nLearned\nRandom\nNaive\n\n\u03c4\n\n0.01\n0.01\n0.05\n0.05\n0.05\n\nEstimate\n0.0193\n0.0154\n0.0132\n0.0111\n\n0.0\n\nInterval\n\n[0.0187,0.0206]\n[0.0149,0.0166]\n[0.0129,0.0137]\n[0.0109,0.0116]\n\n[0,0.0071]\n\noffline estimate\nground truth\n1E\u22123\n\n1E\u22122\n\u03c4\n\n1E\u22121\n\nFigure 2: Results of various algorithms on the ad\ndisplay dataset. Note these numbers were com-\nputed using a not-necessarily-uniform sample of\ndata.\n\nFigure 1: Accuracy of of\ufb02ine evaluator with\nvarying \u03c4 values.\n\na for the topmost, or most prominent position. It also chooses additional ads to display, but these\nwere ignored in our test. The output y is an indicator of whether the user clicked on the ad or not.\nThe total number of ads in the data set is approximately 880, 000. The training data consist of 35\nmillion events. The test data contain 19 million events occurring after the events in the training data.\nThe total number of distinct web pages is approximately 3.4 million.\nWe trained a policy h to choose an ad, based on the current page, to maximize the probability of\nclick. For the purposes of learning, each ad and page was represented internally as a sparse high-\ndimensional feature vector. The features correspond to the words that appear in the page or ad,\nweighted by the frequency with which they appear. Each ad contains, on average, 30 ad features and\neach page, approximately 50 page features. The particular form of f was linear over features of its\ninput (x, a)4\nThe particular policy that was optimized, had an argmax form: h(x) = argmaxa\u2208C(X){f (x, a)},\nwith a crucial distinction from previous approaches in how f (x, a) was trained. Here f : X \u00d7 A \u2192\n[0, 1] is a regression function that is trained to estimate probability of click, and C(X) = {a \u2208\nA | \u02c6\u03c0(a|x) > 0} is a set of feasible ads.\nThe training samples were of the form (x, a, y), where y = 1 if the ad a was clicked after being\nshown on page x or y = 0 otherwise. The regressor f was chosen to approximately minimize\nmax{\u02c6\u03c0(at|xt),\u03c4 }. Stochastic gradient descent was used to minimize the\nthe weighted squared loss:\nsquared loss on the training data.\nDuring the evaluation, we computed the estimator on the test data (xt, at, yt):\n\n(y\u2212f (x,a))2\n\n\u02c6V h\n\u02c6\u03c0 =\n\n1\nT\n\nytI(h(xt) = at)\nmax{\u02c6\u03c0(at|xt), \u03c4}\n\n.\n\n(7)\n\nT\n\nXt=1\n\nAs mentioned in the introduction, this estimator is biased due to the use of the parameter \u03c4 > 0. As\nshown in the analysis of Section 3, this bias typically underestimates the true value of the policy h.\nWe experimented with different thresholds \u03c4 and parameters of our learning algorithm.5 Results are\nsummarized in the Table 2.\n\nThe Interval column is computed using the relative entropy form of the Chernoff bound with\n\u03b4 = 0.05 which holds under the assumption that variables, in our case the samples used in the\ncomputation of the estimator (Equation 7), are IID. Note that this computation is slightly compli-\ncated because the range of the variables is [0, 1/\u03c4 ] rather than [0, 1] as is typical. This is handled by\nrescaling by \u03c4, applying the bound, and then rescaling the results by 1/\u03c4.\n\n4Technically the feature vector that the regressor uses is the Cartesian product of the page and ad vectors.\n5For stochastic gradient descent, we varied the learning rate over 5 \ufb01xed numbers (0.2, 0.1, 0.05, 0.02, 0.01)\n\nusing 1 pass over the data. We report on the test results for the value with the best training error.\n\n7\n\n\fThe \u201cRandom\u201d policy is the policy that chooses randomly from the set of feasible ads:\nRandom(x) = a \u223c UNIF(C(X)), where UNIF(\u00b7) denotes the uniform distribution.\nThe \u201cNaive\u201d policy corresponds to the theoretically \ufb02awed supervised learning approach detailed in\nthe introduction. The evaluation of this policy is quite expensive, requiring one evaluation per ad\nper example, so the size of the test set is reduced to 8373 examples with a click, which reduces the\nsigni\ufb01cance of the results. We bias the results towards the naive policy by choosing the chronologi-\ncally \ufb01rst events in the test set (i.e. the events most similar to those in the training set). Nevertheless,\nthe naive policy receives 0 reward, which is signi\ufb01cantly less than all other approaches. A possible\nfear with the evaluation here is that the naive policy is always \ufb01nding good ads that simply weren\u2019t\nexplored. A quick check shows that this is not correct\u2013the naive argmax simply makes implausible\nchoices. Note that we report only evaluation against \u03c4 = 0.05, as the evaluation against \u03c4 = 0.01 is\nnot signi\ufb01cant, although the reward obviously remains 0.\nThe \u201cLearned\u201d policies do depend on \u03c4. As suggested by Theorem 3.2, as \u03c4 is decreased, the\neffective set of hypotheses we compete with is increased, thus allowing for better performance of\nthe learned policy. Indeed, the estimates for both the learned policy and the random policy improve\nwhen we decrease \u03c4 from 0.05 to 0.01.\nThe empirical click-through rate on the test set was 0.0213, which is slightly larger than the estimate\nfor the best learned policy. However, this number is not directly comparable since the estimator\nprovides a lower bound on the true value of the policy due to the bias introduced by a nonzero \u03c4 and\nbecause any deployed policy chooses from only the set of ads which are available to display rather\nthan the set of all ads which might have been displayable at other points in time.\n\nThe empirical results are generally consistent with the theoretical approach outlined here\u2014they pro-\nvide a consistently pessimal estimate of policy value which nevertheless has suf\ufb01cient dynamic range\nto distinguish learned policies from random policies, learned policies over larger spaces (smaller\n\u03c4) from smaller spaces (larger \u03c4), and the theoretically unsound naive approach from sounder ap-\nproaches which choose amongst the the explored space of ads. It would be interesting future work\nto compare our approach to a full-\ufb02edged production online advertising system.\n\n5 Conclusion\n\nWe stated, justi\ufb01ed, and evaluated theoretically and empirically the \ufb01rst method for solving the warm\nstart problem for exploration from logged data with controlled bias and estimation. This problem\nis of obvious interest to applications for internet companies that recommend content (such as ads,\nsearch results, news stories, etc...) to users.\n\nHowever, we believe this also may be of interest for other application domains within machine\nlearning. For example, in reinforcement learning, the standard approach to of\ufb02ine policy evaluation\nis based on importance weighted samples [3, 11]. The basic results stated here could be applied to\nRL settings, eliminating the need to know the probability of a chosen action explicitly, allowing an\nRL agent to learn from external observations of other agents.\n\nReferences\n\n[1] Peter Auer, Nicol`o C. Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed\n\nbandit problem. SIAM Journal on Computing, 32(1):48\u201377, 2002.\n\n[2] D. Horvitz and D. Thompson. A generalization of sampling without replacement from a \ufb01nite universe.\n\nJournal of the American Statistical Association, 47, 1952.\n\n[3] Michael Kearns, Yishay Mansour, and Andrew Y. Ng. Approximate planning in large pomdps via reusable\n\ntrajectories. In NIPS, 2000.\n\n[4] Diane Lambert and Daryl Pregibon. More bang for their bucks: Assessing new features for online adver-\n\ntisers. In ADKDD 2007, 2007.\n\n[5] John Langford, Alexander L. Strehl, and Jenn Wortman. Exploration scavenging. In ICML-08: Proceed-\n\nings of the 25rd international conference on Machine learning, 2008.\n\n[6] John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side informa-\n\ntion. In Advances in Neural Information Processing Systems 20, pages 817\u2013824, 2008.\n\n8\n\n\f[7] Robert A. Lew. Bounds on negative moments. SIAM Journal on Applied Mathematics, 30(4):728\u2013731,\n\n1976.\n\n[8] Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personal-\nized news article recommendation. In Proceedings of the Nineteenth International Conference on World\nWide Web (WWW-10), pages 661\u2013670, 2010.\n\n[9] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. Unbiased of\ufb02ine evaluation of contextual-\nbandit-based news article recommendation algorithms. In Proceedings of the Fourth International Con-\nference on Web Search and Web Data Mining (WSDM-11), 2011.\n\n[10] Art Owen and Yi Zhou. Safe and effective importance sampling. Journal of the American Statistical\n\nAssociation, 95:135\u2013143, 1998.\n\n[11] Doina Precup, Rich Sutton, and Satinder Singh. Eligibility traces for off-policy policy evaluation.\n\nICML, 2000.\n\nIn\n\n9\n\n\f", "award": [], "sourceid": 775, "authors": [{"given_name": "Alex", "family_name": "Strehl", "institution": null}, {"given_name": "John", "family_name": "Langford", "institution": null}, {"given_name": "Lihong", "family_name": "Li", "institution": null}, {"given_name": "Sham", "family_name": "Kakade", "institution": null}]}