{"title": "Contextual Bandits with Cross-Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 9679, "page_last": 9688, "abstract": "In the classical contextual bandits problem, in each round $t$, a learner observes some\ncontext $c$, chooses some action $a$ to perform, and receives some reward $r_{a,t}(c)$. We consider the variant of this problem where in addition to receiving the reward $r_{a,t}(c)$, the learner also learns the values of $r_{a,t}(c')$ for all other contexts $c'$; i.e., the rewards that would have been achieved by performing that action under different contexts. This variant arises in several strategic settings, such as learning how to bid in non-truthful repeated auctions (in this setting the context is the decision maker's private valuation for each auction). We call this problem the contextual bandits problem with cross-learning.\n \nThe best algorithms for the classical contextual bandits problem achieve $\\tilde{O}(\\sqrt{CKT})$ regret against all stationary policies, where $C$ is the number of contexts, $K$ the number of actions, and $T$ the number of rounds. We demonstrate algorithms for the contextual bandits problem with cross-learning that remove the dependence on $C$ and achieve regret $\\tilde{O}(\\sqrt{KT})$ (when contexts are stochastic with known distribution), $\\tilde{O}(K^{1/3}T^{2/3})$ (when contexts are stochastic with unknown distribution), and $\\tilde{O}(\\sqrt{KT})$ (when contexts are adversarial but rewards are stochastic). We simulate our algorithms on real auction data from an ad exchange running first-price auctions (showing that they outperform traditional contextual bandit algorithms).", "full_text": "Contextual Bandits With Cross-Learning\n\nSantiago Balseiro\n\nColumbia Business School\nsrb2155@columbia.edu\n\nNegin Golrezaei\n\nMIT Sloan School of Management\n\ngolrezae@mit.edu\n\nMohammad Mahdian\n\nGoogle Research\n\nmahdian@google.com\n\nVahab Mirrokni\nGoogle Research\n\nmirrokni@google.com\n\nJon Schneider\nGoogle Research\n\njschnei@google.com\n\nAbstract\n\nIn the classical contextual bandits problem, in each round t, a learner observes\nsome context c, chooses some action a to perform, and receives some reward\nra,t(c). We consider the variant of this problem where in addition to receiving the\nreward ra,t(c), the learner also learns the values of ra,t(c(cid:48)) for all other contexts\nc(cid:48); i.e., the rewards that would have been achieved by performing that action under\ndifferent contexts. This variant arises in several strategic settings, such as learning\nhow to bid in non-truthful repeated auctions, which has gained a lot of attention\nlately as many platforms have switched to running \ufb01rst-price auctions. We call this\n\u221a\nproblem the contextual bandits problem with cross-learning. The best algorithms\nfor the classical contextual bandits problem achieve \u02dcO(\nCKT ) regret against all\nstationary policies, where C is the number of contexts, K the number of actions,\nand T the number of rounds. We demonstrate algorithms for the contextual bandits\nproblem with cross-learning that remove the dependence on C and achieve regret\n\u02dcO(\nKT ). We simulate our algorithms on real auction data from an ad exchange\nrunning \ufb01rst-price auctions (showing that they outperform traditional contextual\nbandit algorithms).\n\n\u221a\n\n1\n\nIntroduction\n\nIn the contextual bandits problem, a learner repeatedly observes some context, takes some action\ndepending on that context, and receives some reward depending on that context. The learner\u2019s goal is\nto maximize their total reward over some number of rounds. The contextual bandits problem is a\nfundamental problem in online learning: it is a simpli\ufb01ed (yet analyzable) variant of reinforcement\nlearning and it captures a large class of repeated decision problems. In addition, the algorithms\ndeveloped for the contextual bandits problem have been successfully applied in domains like ad\nplacement, news recommendation, and clinical trials [12, 16, 23].\nIdeally, one would like an algorithm for the contextual bandits problem which performs approximately\nas well as the best stationary strategy (i.e., the best \ufb01xed mapping from contexts to actions). This can\n\u221a\nbe accomplished by running a separate instance of some low-regret algorithm for the non-contextual\nbandits problem (e.g. EXP3) for every context. This algorithm achieves regret \u02dcO(\nCKT ) where C\nis the number of contexts, K the number of actions, and T the number of rounds. This bound can\nbe shown to be tight [6]. Since the number of contexts can be very large, these algorithms can be\nimpractical to use, and much modern current research on the contextual bandits problem instead aims\nto achieve low regret with respect to some smaller set of policies [2, 15, 4].\n\u221a\nHowever, some settings possess additional structure between the rewards and contexts which allow\none to achieve less than \u02dcO(\nCKT ) regret while still competing with the best stationary strategy.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn this paper, we look at a speci\ufb01c type of structure we call cross-learning between contexts that is\nparticularly common in strategic settings. In variants of the contextual bandits problem with this\nstructure, playing an action a in some context c at round t not only reveals the reward ra,t(c) of\nplaying this action in this context (which the learner receives), but also reveals to the learner the\nrewards ra,t(c(cid:48)) for every other context c(cid:48). Some settings where this structure appears include:\n\n\u2022 Bidding in nontruthful auctions: Consider a bidder trying to learn how to bid in a\nrepeated non-truthful auction (such as a \ufb01rst-price auction). Every round, the bidder receives\na (private, independent) value for the current item, and based on this must submit a bid for\nthe item. The auctioneer then collects the bids from all participants, and decides whether to\nallocate the item to our bidder, and if so, how much to charge the bidder.\nThis can be seen as a contextual bandits problem for the bidder where the context c is the\nbidder\u2019s value for the item, the action a is their bid, and their reward is their net utility from\nthe auction: 0 if they do not win, and their value for the item minus their payment p if they\ndo win. Note that this problem also allows for cross-learning between contexts \u2013 the net\nutility ra,t(c(cid:48)) that would have been received if they had value c(cid:48) instead of value c is just\n(c(cid:48) \u2212 p) \u00b7 1(win item), which is computable from the outcome of the auction.\nThe problem of bidding in nontruthful auctions has gained a lot of attention recently as many\nonline advertising platforms have recently switched from running second-price to \ufb01rst-price\nauctions.1 In a \ufb01rst-price auction, the highest bidder is the winner and pays their bid (as\nopposed to second-price auctions where the winner pays the second highest-bid). First-price\nauctions are nontruthful mechanisms as bidders have incentives to shade bids so that they\nenjoy a positive utility when they win [22].\n\n\u2022 Multi-armed bandits with exogenous costs: Consider a multi-armed bandit problem\nwhere at the beginning of each round t, a cost si,t of playing arm i at this round is publicly\nannounced. That is, choosing arm i this round results in a net reward of ri,t \u2212 si,t. This\ncaptures settings where, for example, a buyer must choose every round to buy one of K\nsubstitutable goods \u2013 he is aware of the price of each good (which might change from round\nto round) but must learn over time the utility each type of good brings him.\nThis is a contextual bandits problem where the context in round t is the K costs si,t at this\ntime. Cross-learning between contexts is present in this setting: given the net utility of\nplaying action i with a given up-front cost si, one can infer the net utility of playing i with\nany other up-front cost s(cid:48)\ni.\n\u2022 Dynamic pricing with variable cost: Consider a dynamic pricing problem where a \ufb01rm\noffers a service (or sells a product) to a stream of customers who arrive sequentially over\ntime. Consumer have private and independent willingness-to-pay and the cost of serving a\ncustomer is exogenously given and customer dependent. After observing the cost, the \ufb01rm\ndecides on what price to offer to the consumer who decides whether to accept the service\nat the offered price. The optimal price for each consumer is contingent in the cost; for\nexample, when demand is relatively inelastic consumers that are more costly to serve should\nbe quoted higher prices. This extends dynamic pricing problems to cases where the \ufb01rm has\nexogenous costs (see, e.g., [8] for an overview of dynamic pricing problems).\nThis is a special case of the multi-armed bandits with exogenous costs problem, and hence\nan instance of contextual-bandits with cross-learning.\n\n\u2022 Sleeping bandits: Consider the following variant of \u201csleeping bandits\u201d, where there are\nK arms and in each round some subset St of these arms are awake. The learner can play\nany arm and observe its reward, but only receives this reward if they play an awake arm.\nThis problem was originally proposed in [13], where one of the motivating applications is\necommerce settings where not all sellers or items (and hence \u201carms\u201d) might be available\nevery round.\nThis is a contextual bandits problem where the context is the set St of awake arms. Again,\ncross-learning between contexts is present in this setting: given the observation of the reward\nof arm i, one can infer the received reward for any context S(cid:48)\nt by just checking whether\ni \u2208 S(cid:48)\nt.\n\n1See\n\nhttps://www.blog.google/products/admanager/simplifying-programmatic-first-\n\nprice-auctions-google-ad-manager/\n\n2\n\n\f\u2022 Repeated Bayesian games with private types: Consider a player participating in a\nrepeated Bayesian game with private, independent types. Each round the player receives\nsome type for the current game, performs some action, and receives some utility (which\ndepends on their type, their action, and the other players\u2019 actions). Again, this can be viewed\nas a contextual bandit problem where types are contexts, actions are actions, and utilities are\nrewards, and once again this problem allows for cross-learning between contexts (as long as\nthe player can compute their utility based on their type and all players\u2019 actions).\n\n\u221a\nNote that in many of these settings, the number of possible contexts C can be huge: exponential\nin K or uncountably in\ufb01nite. This makes the naive O(\nCKT )-regret algorithm undesirable in\nthese settings. We show that in contextual bandits problems with cross-learning, it is possible to\ndesign algorithms which completely remove the dependence on the number of contexts C in their\nregret bound. We consider both settings where the contexts are generated stochastically (from some\ndistribution D that may or may not be known to the learner) and settings where the contexts are chosen\nadversarially. Similarly, we also consider settings where the rewards are generated stochastically and\nsettings where they are chosen adversarially. Our results include:\n\n\u2022 Stochastic rewards, stochastic or adversarial contexts: We design an algorithm called\n\n\u221a\nalgorithm UCB1.CL with regret of \u02dcO(\n\n\u2022 Adversarial rewards, stochastic contexts with known distribution: We design an algo-\n\nKT ).\n\u221a\n\nrithm called EXP3.CL with regret of \u02dcO(\n\nKT ).\n\n\u2022 Adversarial rewards, stochastic contexts with unknown distribution: We design an\n\nalgorithm called EXP3.CL-U with regret \u02dcO(K 1/3T 2/3).\n\n\u2022 Lower bound for adversarial rewards, adversarial contexts: We show that when both\nrewards and contexts are controlled by an adversary, any algorithm must obtain regret at\nleast \u02dc\u2126(\n\nCKT ).\n\n\u221a\n\nAll of these algorithms are easy to implement, in the sense that they can be obtained via simple\nmodi\ufb01cations from existing multi-armed bandit algorithms like EXP3 and UCB1, and ef\ufb01cient,\nin the sense that all algorithms run in time at most O(C + K) per round (and for many of the\nsettings mentioned above, this can be further improved to O(K) time per round). Our main technical\ncontribution is our analysis of UCB1.CL, which requires arguing that UCB1 can effectively use\nthe information from cross-learning despite it being drawn from a distribution that differs from the\ndesired exploration distribution. We accomplish this by constructing a linear program whose value\nupper bounds (one of the terms in) the regret of UCB1.CL, and bounding the value of this linear\nprogram.\nWe then apply our results to some of the applications listed above. In each case, our algorithms obtain\noptimal regret bounds with asymptotically less regret than a naive application of contextual bandits\nalgorithms. In particular:\n\n\u2022 For the problem of learning to bid in a \ufb01rst-price auction, standard contextual bandit\nalgorithms get regret O(T 3/4). Our algorithms achieve regret O(T 2/3). This is optimal\neven when there is only a single context (value).\n\u2022 For the problem of multi-armed bandits with exogenous costs, standard contextual bandit\nKT ),\n\u2022 For our variant of sleeping bandits, standard contextual bandit algorithms get regret\n\nalgorithms get regret O(T (K+1)/(K+2)K 1/(K+2)). Our algorithms get regret \u02dcO(\nwhich is tight.\n\n\u221a\n\n\u221a\n\n\u221a\n2KKT ). Our algorithms get regret \u02dcO(\n\n\u02dcO(\n\nKT ), which is tight.\n\nFinally, we test the performance of these algorithms on real auction data from a \ufb01rst-price ad exchange,\nshowing that our algorithms outperform traditional bandit algorithms.\n\n1.1 Related Work\n\nFor a general overview of research on the multi-armed bandit problem, we recommend the reader to\nthe survey by Bubeck and Cesa-Bianchi [6]. Our algorithms build off of pre-existing algorithms in\n\n3\n\n\fthe bandits literature, such as EXP3 [2] and UCB1 [20, 14]. Contextual bandits were \ufb01rst introduced\nunder that name in [15], although similar ideas were present in previous works (e.g. the EXP4\nalgorithm was proposed in [2]).\nOne line of research related to ours studies bandit problems under other structural assumptions\non the problem instances which allow for improved regret bounds. Slivkins [21] studies a setting\nwhere contexts and actions belong to a joint metric space, and context/action pairs that are close to\neach other give similar rewards, thus allowing for some amount of \u201ccross-learning\u201d. Several works\n[17, 1] study a partial-feedback variant of the (non-contextual) multi-armed bandit problem where\nperforming some action provides some information on the rewards of performing other actions (thus\ninterpolating between the bandits and experts settings). Our setting can be thought of as a contextual\nversion of this variant. However, since the learner cannot choose the context each round, these two\nsettings are qualitatively different. As far as we are aware, the speci\ufb01c problem of contextual bandits\nwith cross-learning between contexts has not appeared in the literature before.\nRecently there has been a surge of interest in applying methods from online learning and bandits\nto auction design. While the majority of the work in this area has been from the perspective of the\nauctioneer [19, 18, 7, 9, 11] \u2013 learning how to design an auction over time based on bidder behavior \u2013\nsome recent work studies this problem from the perspective of a buyer learning how to bid [24, 10, 5].\nIn particular, [24] studies the problem of learning to bid in a \ufb01rst-price auction over time, but where\nthe bidder\u2019s value remains constant (so there is no context).\n\n2 Model and Preliminaries\n\n2.1 Multi-armed bandits\n\nIn the classic multi-armed bandit problem, a learner chooses one of K arms per round over the course\nof T rounds. On round t, the learner receives some reward ri,t \u2208 [0, 1] for pulling arm i (where the\nrewards ri,t may be chosen adversarially). The learner\u2019s goal is to maximize their total reward.\nLet It denote the arm pulled by the decision maker at round t. The regret of an algorithm A for\nthe learner is the random variable Reg(A) = maxi\nt=1 rIt,t. We say an algorithm\nA for the multi-armed bandit problem is \u03b4-low-regret if E[Reg(A)] \u2264 \u03b4 (where the expectation is\n\u221a\ntaken over the randomness of A). We say an algorithm A is low-regret if it is \u03b4-low-regret for some\n\u03b4 = o(T ). There exist simple multi-armed bandit algorithms which are \u02dcO(\nKT )-low-regret (e.g.\nEXP3 when rewards are adversarial, and UCB1 when rewards are stochastic).\n\n(cid:80)T\nt=1 ri,t \u2212(cid:80)T\n\n2.2 Contextual bandits\n\nIn our model, we consider a contextual bandits problem. In the contextual bandits problem, in\neach round t the learner is additionally provided with a context ct, and the learner now receives\nreward ri,t(c) if he pulls arm i on round t while having context c. The contexts ct are either chosen\nadversarially at the beginning of the game or drawn independently each round from some distribution\nD. Similarly, the rewards ri,t(c) are either chosen adversarially or each independently drawn from\nsome distribution Fi(c). We assume as is standard that ri,t(c) is always bounded in [0, 1].\nIn the contextual bandits setting, we now de\ufb01ne the regret of an algorithm A in terms of regret against\nthe best stationary policy \u03c0; that is, max\u03c0:[C]\u2192[K]\nt=1 rIt,t(ct), where It is\nthe arm pulled by M on round t. The de\ufb01nition of best stationary policy \u03c0 depends slightly on how\ncontexts and rewards are chosen:\n\nt=1 r\u03c0(ct),t(ct) \u2212(cid:80)T\n(cid:80)T\n\n\u2022 When rewards are stochastic (ri,t(c) drawn independently from Fi(c) with mean \u00b5i(c)), we\n\u2022 When rewards are adversarial but contexts are stochastic, we de\ufb01ne \u03c0(c) to be the stationary\n\nde\ufb01ne \u03c0(c) = arg maxi \u00b5i(c).\n\npolicy which maximizes Ect\u223cD[(cid:80)\nwhich maximizes(cid:80)\n\nt r\u03c0(ct),t(ct).\n\n\u2022 When both rewards and contexts are adversarial, we de\ufb01ne \u03c0(c) to be the stationary policy\n\nt r\u03c0(ct),t(ct)].\n\nThese choices are uni\ufb01ed in the following way: in all of the above cases, \u03c0 is the best stationary\npolicy in expectation for someone who knows all the decisions of the adversary and details of\n\n4\n\n\fthe system ahead of time, but not the randomness in the instantiations of contexts/rewards from\ndistributions. This matches commonly studied notions of regret in the contextual bandits literature;\nsee Appendix A.1 of the Full Version [3] for further discussion. As before, we say an algorithm\nis \u03b4-low regret if E[Reg(A)] \u2264 \u03b4, and say an algorithm is low-regret if it is \u03b4-low-regret for some\n\u03b4 = o(T ).\nThere is a simple way to construct a low-regret algorithm A(cid:48) for the contextual bandits problem from\na low-regret algorithm A for the classic bandits problem: simply maintain a separate instance of A\nfor every different context c. In the contextual bandits literature, this is sometimes referred to as\n\u221a\nthe S-EXP3 algorithm when A is EXP3 [6]. This algorithm is \u02dcO(\nCKT )-low-regret. We de\ufb01ne\nthe S-UCB1 algorithm similarly, which is also \u02dcO(\nCKT )-low-regret when rewards are generated\nstochastically.\nWe consider a variant of the contextual bandits problem we call contextual bandits with cross-learning.\nIn this variant, whenever the learner pulls arm i in round t while having context c and receives reward\nri,t(c), they also learn the value of ri,t(c(cid:48)) for all other contexts c(cid:48). We de\ufb01ne the notions of regret\nand low-regret similarly for this problem.\n\n\u221a\n\n3 Cross-learning between contexts\n\nIn this section, we present two algorithms for the contextual bandits problem with cross-learning:\nUCB1.CL, for stochastic rewards and adversarial contexts (Section 3.1), and EXP3.CL, for adversarial\nrewards and stochastic contexts (Section 3.2). Then, in Section 3.3, we show that it is impossible to\nachieve regret better than \u02dcO(\nCKT ) when both rewards and contexts are controlled by an adversary\n(in particular, when both rewards and contexts are adversarial, cross-learning may not be bene\ufb01cial at\nall).\n\n\u221a\n\n3.1 Stochastic rewards\n\n\u221a\n\nIn this section we will present an O(\nKT log K) algorithm for the contextual bandits problem with\ncross learning in the stochastic reward setting: i.e., every reward ri,t(c) is drawn independently from\nan unknown distribution Fi(c) supported on [0, 1]. Importantly, this algorithm works even when the\ncontexts are chosen adversarially, unlike our algorithms for the adversarial reward setting. We call\nthis algorithm UCB1.CL (Algorithm 1).\n\n\u221a\n\n\u03c4i,K = 1 for all i).\n\n\u03c3i,t(c)/\u03c4i,t.\n\nAlgorithm 1 O(\ncross-learning where rewards are stochastic and contexts are adversarial.\n\n1: De\ufb01ne the function \u03c9(s) =(cid:112)(2 log T )/s.\n\nKT log K) regret algorithm (UCB1.CL) for the contextual bandits problem with\n\n2: Pull each arm i \u2208 [K] once (pulling arm i in round i).\n3: Maintain a counter \u03c4i,t, equal to the number of times arm i has been pulled up to round t (so\n4: For all i \u2208 [K] and c \u2208 [C], initialize variable \u03c3i,K(c) to ri,i(c). Write ri,t(c) as shorthand for\n5: for t = K + 1 to T do\nReceive context ct.\n6:\nLet It be the arm which maximizes rIt,t\u22121(ct) + \u03c9(\u03c4It,t\u22121).\n7:\nPull arm It, receiving reward rIt,t(ct), and learning the value of rIt,t(c) for all c.\n8:\nfor each c in [C] do\n9:\n10:\n11:\n12:\n13: end for\n\nend for\nSet \u03c4It,t = \u03c4It,t\u22121 + 1.\n\nSet \u03c3It,t(c) = \u03c3It,t\u22121(c) + rIt,t(c).\n\nThe UCB1.CL algorithm is a straightforward generalization of S-UCB1; both algorithms maintain a\nmean and upper con\ufb01dence bound for each action in each context, and always choose the action with\nthe highest upper con\ufb01dence bound (the difference being that UCB1.CL uses cross-learning to update\nthe appropriate means and con\ufb01dence bounds for all contexts each round). The analysis of UCB1.CL,\n\n5\n\n\fhowever, requires new ideas to deal with the fact that the observations of rewards may be drawn from\na very different distribution than the desired exploration distribution.\nVery roughly, the analysis is structured as follows. Since rewards are stochastic, in every context c,\nthere is a \u201cbest arm\u201d i\u2217(c) that the optimal policy always plays. Every other arm i is some amount\n\u2206i(c) worse in expectation than the best arm. After observing this arm mi(c) = O(log(T )/\u2206i(c)2)\ntimes, one can be con\ufb01dent that this arm is not the best arm. We can decompose the regret into the\nregret incurred \u201cbefore\u201d and \u201cafter\u201d the algorithm is con\ufb01dent that an arm is not optimal in a speci\ufb01c\ncontext. The regret \u201cafter\u201d can be bounded using standard techniques from the bandit literature. Our\nmain contribution is the bound of the regret \u201cbefore.\u201d\nFix an arm i and let Xi(c) be the number of times the algorithm pulls arm i in context c before pulling\narm i a total of mi(c) times across all contexts. Because once arm i is pulled mi(c) times we are\ncon\ufb01dent about the optimality of pulling that arm in context c, we only need to control the number\n\npulls before mi(c). Therefore, the regret \u201cbefore\u201d of arm i is roughly(cid:80)\n(cid:80)\nWe control the regret \u201cbefore\u201d by setting up a linear program in the variables Xi(c) with objective\nc,i Xi(c)\u2206i(c). Because Xi(c) counts all pulls of arm i before mi(c) we have that Xi(c) \u2264 mi(c).\nThis inequality, while valid, does not lead to a tight bound. To obtain a tighter inequality we \ufb01rst sort\n(cid:80)\nthe contexts in terms of the samples needed to learn whether an arm is optimal, i.e., in increasing\norder of mi(c). Because a different context is realized in every round, we can consider the inequality\nc(cid:48):mi(c(cid:48))\u2264mi(c) Xi(c(cid:48)) \u2264 mi(c), which counts the subset of \ufb01rst mi(c) pulls of arm i. Bounding\n\u221a\nthe value of this objective (by effectively taking the dual), we can show that the total regret is at most\nT ).\nO(\n\u221a\nTheorem 1 (Regret of UCB1.CL). UCB1.CL (Algorithm 1) has expected regret O(\nKT log K)\nfor the contextual bandits problem with cross-learning in the setting with stochastic rewards and\nadversarial contexts.\n\nc Xi(c)\u2206i(c).\n\nAs a consequence of the proof of Theorem 1, we have the following gap-dependent bound on the\nregret of UCB1.CL.\nCorollary 2 (Gap-dependent Bound for UCB1.CL). Let \u2206min = mini,c \u00b5\u2217(c) \u2212 \u00b5i(c) (where\n\u00b5\u2217(c) = maxi \u00b5i(c)). Then UCB1.CL (Algorithm 1) has expected regret of O\nfor the\ncontextual bandits problem with cross-learning in the setting with stochastic rewards and adversarial\ncontexts.\n\n(cid:16) K log T\n\n(cid:17)\n\n\u2206min\n\n3.2 Adversarial rewards and stochastic contexts\n\n\u221a\n\nWe now present a O(\nKT log K) regret algorithm for the contextual bandits problem with cross\nlearning when the rewards are adversarially chosen but contexts are stochastically drawn from some\ndistribution D. We call this algorithm EXP3.CL (Algorithm 2). For now we assume the learner knows\nthe distribution over contexts D.\n\nKT .\n\n(cid:113) log K\n\nof the ith weight for context c at round t. Initially, set all wi,0 = 1.\n\nKT log K) regret algorithm (EXP3.CL) for the contextual bandits problem with\n\n\u221a\nAlgorithm 2 O(\nsimulated contexts.\n1: Choose \u03b1 = \u03b2 =\n2: Initialize K \u00b7 C weights, one for each pair of action i and context c, letting wi,t(c) be the value\n3: for t = 1 to T do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: end for\n\nDraw context ct from D.\nFor all i \u2208 [K] and c \u2208 [C], let pi,t(c) = (1 \u2212 K\u03b1)\nSample an arm It from the distribution pt(ct).\nPull arm It, receiving reward rIt,t(ct), and learning the value of rIt,t(c) for all c.\nfor each c in [C] do\n\nSet wIt,t(c) = wIt,t\u22121(c) \u00b7 exp\n\n(cid:80)C\nc(cid:48)=1 Pr[c(cid:48)]\u00b7pIt,t(c(cid:48))\n\nrIt,t(c)\n\nend for\n\nwi,t\u22121(c)\nj=1 wj,t\u22121(c)\n\n+ \u03b1.\n\n(cid:16)\n\n\u03b2 \u00b7\n\n(cid:80)K\n\n(cid:17)\n\n.\n\n6\n\n\fBoth EXP3.CL and S-EXP3 maintain a weight for each action in each context, and update the weights\nvia multiplicative updates by an exponential of an unbiased estimator of the reward. We modify\nS-EXP3 by changing the unbiased estimator in the update rule to take advantage of the information\nfrom cross-learning. To minimize regret, we wish to choose an unbiased estimator with minimal\nvariance (as the expected variance of this estimator shows up in the \ufb01nal regret bound). The new\nestimator in question is\n\n\u02c6ri,t(c) =\n\n(cid:80)C\nc(cid:48)=1 PrD[c(cid:48)] \u00b7 pi,t(c(cid:48))\n\nri,t(c)\n\n\u00b7 1It=i.\n\nThere are two ways of thinking about this estimator. The \ufb01rst is to note that the denominator of this\nestimator is exactly the probability of pulling arm i on round t before you learn the realization of\nct (and hence this estimator is unbiased). The second way is to note that for every context c(cid:48), it is\npossible to construct an estimator of the form\n\n\u02c6ri,t(c) =\n\nri,t(c)\n\nPrD[c(cid:48)] \u00b7 pi,t(c(cid:48))\n\n1It=i,ct=c(cid:48).\n\nT K log K) for the contextual bandits problem\n\nThe estimator used in EXP3.CL is the linear combination of these estimators which minimizes\n\u221a\nvariance (i.e. the estimator obtained from importance sampling over this class of estimators). We can\nshow that the total expected variance of this estimator is on the order of O(\nKT ), independent of C,\nimplying the following regret bound.\n\u221a\nTheorem 3. EXP3.CL (Algorithm 2) has regret O(\nwith cross learning when rewards are adversarial and contexts are stochastic.\nCalculating this estimator \u02c6ri,t(c) requires the learner to know the distribution D. What can we do if\nthe learner does not know the distribution D? Unlike distributions of rewards (where the learner must\nactively choose which reward distribution to receive a sample from), the learner receives exactly one\nsample from D every round regardless of their action. This suggests the following approach: learn an\napproximation \u02c6D to D by observing the context for some number of rounds, and run EXP3.CL using\n\u02c6D to compute estimators. Unfortunately, a straightforward analysis of this approach gives regret that\nscales as T 2/3 due to the approximation error in \u02c6D.\nIn Appendix A.2 of the Full Version [3], we design a learning algorithm EXP3.CL-U which achieves\nregret \u02dcO(K 1/3T 2/3) even when the distribution D is unknown by using a much simpler (but higher\nvariance) estimator that does not require knowledge of D to compute. It is an interesting open\nproblem whether it is possible to obtain \u02dcO(\n\nKT ) regret when D is unknown.\n\n\u221a\n\n3.3 Adversarial rewards, adversarial contexts\n\nA natural question is whether we can achieve low-regret when both the rewards and contexts are\nchosen adversarially (but where we still can cross-learn between different contexts). A positive\nanswer to this question would subsume the results of the previous sections. Unfortunately, we will\nshow in this section that any learning algorithm for the contextual bandits problem with cross-learning\nmust necessarily incur \u2126(\nWe will need the following regret lower-bound for the (non-contextual) multi-armed bandits problem.\n\u221a\nLemma 4 (see [2]). There exists a distribution over instances of the multi-armed bandit problem\nwhere any algorithm must incur an expected regret of at least \u2126(\n\nCKT ) regret (which is achieved by S-EXP3).\n\nKT ).\n\n\u221a\n\nWith this lemma, we can construct the following lower-bound for the contextual bandits problem\nwith cross-learning by connecting C independent copies of these hard instances in sequence with one\nanother so that cross-learning between instances is not possible.\n\u221a\nTheorem 5. There exists a distribution over instances of the contextual bandit problem with cross-\nlearning where any algorithm must incur a regret of at least \u2126(\n\nCKT ).\n\n7\n\n\f4 Empirical evaluation\n\nIn this section, we empirically evaluate the performance of our contextual bandit algorithms on the\nproblem of learning how to bid in a \ufb01rst-price auction.\nRecall that our cross-learning algorithms rely on cross-learning between contexts being possible: if\nthe outcome of the auction remains the same, the bidder can compute their net utility they would\nreceive given any value they could have for the item. This is true if the bidder\u2019s value for the item is\nindependent of the other bidders\u2019 values for the item. Of course, this assumption (while common in\nmuch research in auction theory) does not necessarily hold in practice. We can nonetheless run our\ncontextual bandit algorithms as if this were the case, and compare them to existing contextual bandit\nalgorithms which do not make this assumption.\nOur basic experimental setup is as follows. We take existing \ufb01rst-price auction data from a large\nad exchange that runs \ufb01rst-price auctions on a signi\ufb01cant fraction of traf\ufb01c, remove one participant\n(whose true values we have access to), substitute in one of our bandit algorithms for this participant,\nand replay the auction, hence answering the question \u201chow well would this (now removed) participant\ndo if they instead ran this bandit algorithm?\u201d.\nWe collected anonymized data from 10 million consecutive auctions from this ad exchange, which\nwere then divided into 100 groups of 105 auctions. To remove outliers, bids and values above the\n90% quantile were removed, and remaining bids/values were normalized to \ufb01t in the [0, 1] interval.\nWe then replayed each group of 105 auctions, comparing the performance of our three algorithms\nwith cross-learning (EXP3.CL-U, EXP3.CL, and UCB1.CL) and the performance of classic contextual\nbandits algorithms that take no advantage of cross-learning (S-EXP3, and S-UCB1). Since all\nalgorithms require a discretized set of actions, allowable bids were discretized to multiples of 0.01.\nParameters for each of these algorithms (including level of discretization of contexts for S-EXP3 and\nS-UCB1) were optimized via cross-validation on a separate data set of 105 auctions from the same\nad exchange.\n\nFigure 1: Graph of average cumulative regrets of various learning algorithms (y-axis) versus time\n(x-axis). Grey regions indicate 95% con\ufb01dence intervals.\n\nThe results of this evaluation are summarized in Figure 1, which plots the average cumulative regret\nof these algorithms over the 105 rounds. The three algorithms which take advantage of cross-learning\n(EXP3.CL-U, EXP3.CL, and UCB1.CL) signi\ufb01cantly outperform the two algorithms which do not\n(S-EXP3 and S-UCB1). Of these, EXP3.CL-U performs the worst, followed by EXP3.CL, followed\nby UCB1.CL, which vastly outperforms both EXP3.CL-U and EXP3.CL.\nWhat is surprising about these results is that cross-learning works at all, let alone gives an advantage,\ngiven that the basic assumption necessary for cross-learning \u2013 that your values are independent from\nother players\u2019 bids, so that you can predict what would have happened if your value was different\n\u2013 does not hold. Indeed, for this data, the Pearson correlation coef\ufb01cient between the values v and\nthe maximum bids r of the other bidders is approximately 0.4. This suggests that these algorithms\nare somewhat robust to errors in the cross-learning hypothesis. It is an interesting open question to\nunderstand this phenomenon theoretically.\n\n8\n\n\fReferences\n[1] Noga Alon, Nicolo Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback\n\ngraphs: Beyond bandits. In Conference on Learning Theory, pages 23\u201335, 2015.\n\n[2] Peter Auer, Nicol\u00f2 Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic\n\nmultiarmed bandit problem. SIAM J. Comput., 32(1):48\u201377, January 2003.\n\n[3] Santiago Balseiro, Negin Golrezaei, Mohammad Mahdian, Vahab Mirrokni, and Jon Schneider.\n\nContextual bandits with cross-learning. arXiv preprint arXiv:1809.09582, 2018.\n\n[4] Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual\nIn Proceedings of the Fourteenth\n\nbandit algorithms with supervised learning guarantees.\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 19\u201326, 2011.\n\n[5] Mark Braverman, Jieming Mao, Jon Schneider, and S Matthew Weinberg. Selling to a no-regret\n\nbuyer. arXiv preprint arXiv:1711.09176, 2017.\n\n[6] S\u00e9bastien Bubeck and Nicol\u00f2 Cesa-Bianchi. Regret analysis of stochastic and nonstochastic\nmulti-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1\u2013122, 2012.\n\n[7] Yang Cai and Constantinos Daskalakis. Learning multi-item auctions with (or without) samples.\n\nIn FOCS, 2017.\n\n[8] Arnoud V. den Boer. Dynamic pricing and learning: Historical origins, current research, and\nnew directions. Surveys in Operations Research and Management Science, 20(1):1 \u2013 18, 2015.\n\n[9] Miroslav Dud\u00edk, Nika Haghtalab, Haipeng Luo, Robert E. Schapire, Vasilis Syrgkanis, and\n\nJennifer Wortman Vaughan. Oracle-ef\ufb01cient learning and auction design. In FOCS, 2017.\n\n[10] Zhe Feng, Chara Podimata, and Vasilis Syrgkanis. Learning to bid without knowing your value.\nIn Proceedings of the 2018 ACM Conference on Economics and Computation, pages 505\u2013522.\nACM, 2018.\n\n[11] Negin Golrezaei, Adel Javanmard, and Vahab Mirrokni. Dynamic incentive-aware learning:\n\nRobust pricing in contextual auctions. 2018.\n\n[12] Satyen Kale, Lev Reyzin, and Robert E Schapire. Non-stochastic bandit slate problems. In\n\nAdvances in Neural Information Processing Systems, pages 1054\u20131062, 2010.\n\n[13] Robert Kleinberg, Alexandru Niculescu-Mizil, and Yogeshwer Sharma. Regret bounds for\n\nsleeping experts and bandits. Machine learning, 80(2-3):245\u2013272, 2010.\n\n[14] T.L Lai and Herbert Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Adv. Appl.\n\nMath., 6(1):4\u201322, March 1985.\n\n[15] John Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side\ninformation. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural\nInformation Processing Systems 20, pages 817\u2013824. Curran Associates, Inc., 2008.\n\n[16] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to\npersonalized news article recommendation. In Proceedings of the 19th international conference\non World wide web, pages 661\u2013670. ACM, 2010.\n\n[17] Shie Mannor and Ohad Shamir. From bandits to experts: On the value of side-observations. In\n\nAdvances in Neural Information Processing Systems, pages 684\u2013692, 2011.\n\n[18] Mehryar Mohri and Andr\u00e9s Munoz Medina. Learning algorithms for second-price auctions\n\nwith reserve. The Journal of Machine Learning Research, 17(1):2632\u20132656, 2016.\n\n[19] Jamie Morgenstern and Tim Roughgarden. Learning simple auctions. In Vitaly Feldman,\nAlexander Rakhlin, and Ohad Shamir, editors, 29th Annual Conference on Learning Theory, vol-\nume 49 of Proceedings of Machine Learning Research, pages 1298\u20131318, Columbia University,\nNew York, New York, USA, 23\u201326 Jun 2016. PMLR.\n\n9\n\n\f[20] Herbert Robbins. Some aspects of the sequential design of experiments. Bulletin of the American\n\nMathematical Society, 58(5):527\u2013535, 1952.\n\n[21] Aleksandrs Slivkins. Contextual bandits with similarity information. In Proceedings of the 24th\n\nannual Conference On Learning Theory, pages 679\u2013702, 2011.\n\n[22] William Vickrey. Counterspeculation, auctions, and competitive sealed tenders. The Journal of\n\nFinance, 16(1):8\u201337, 1961.\n\n[23] Sof\u00eda S Villar, Jack Bowden, and James Wason. Multi-armed bandit models for the optimal\ndesign of clinical trials: bene\ufb01ts and challenges. Statistical science: a review journal of the\nInstitute of Mathematical Statistics, 30(2):199, 2015.\n\n[24] Jonathan Weed, Vianney Perchet, and Philippe Rigollet. Online learning in repeated auctions.\n\nIn Conference on Learning Theory, pages 1562\u20131583, 2016.\n\n10\n\n\f", "award": [], "sourceid": 5120, "authors": [{"given_name": "Santiago", "family_name": "Balseiro", "institution": "Columbia University"}, {"given_name": "Negin", "family_name": "Golrezaei", "institution": "University of Southern California"}, {"given_name": "Mohammad", "family_name": "Mahdian", "institution": "Google Research"}, {"given_name": "Vahab", "family_name": "Mirrokni", "institution": "Google Research NYC"}, {"given_name": "Jon", "family_name": "Schneider", "institution": "Google Research"}]}