{"title": "Contextual semibandits via supervised learning oracles", "book": "Advances in Neural Information Processing Systems", "page_first": 2388, "page_last": 2396, "abstract": "We study an online decision making problem where on each round a learner chooses a list of items based on some side information, receives a scalar feedback value for each individual item, and a reward that is linearly related to this feedback. These problems, known as contextual semibandits, arise in crowdsourcing, recommendation, and many other domains. This paper reduces contextual semibandits to supervised learning, allowing us to leverage powerful supervised learning methods in this partial-feedback setting. Our first reduction applies when the mapping from feedback to reward is known and leads to a computationally efficient algorithm with near-optimal regret. We show that this algorithm outperforms state-of-the-art approaches on real-world learning-to-rank datasets, demonstrating the advantage of oracle-based algorithms. Our second reduction applies to the previously unstudied setting when the linear mapping from feedback to reward is unknown. Our regret guarantees are superior to prior techniques that ignore the feedback.", "full_text": "Contextual semibandits via supervised learning oracles\n\nAkshay Krishnamurthy\u2020\nakshay@cs.umass.edu\n\nAlekh Agarwal\u2021\n\nalekha@microsoft.com\n\n\u2020College of Information and Computer Sciences\n\nUniversity of Massachusetts, Amherst, MA\n\nMiroslav Dud\u00edk\u2021\n\nmdudik@microsoft.com\n\u2021Microsoft Research\n\nNew York, NY\n\nAbstract\n\nWe study an online decision making problem where on each round a learner chooses\na list of items based on some side information, receives a scalar feedback value for\neach individual item, and a reward that is linearly related to this feedback. These\nproblems, known as contextual semibandits, arise in crowdsourcing, recommen-\ndation, and many other domains. This paper reduces contextual semibandits to\nsupervised learning, allowing us to leverage powerful supervised learning methods\nin this partial-feedback setting. Our \ufb01rst reduction applies when the mapping from\nfeedback to reward is known and leads to a computationally ef\ufb01cient algorithm\nwith near-optimal regret. We show that this algorithm outperforms state-of-the-art\napproaches on real-world learning-to-rank datasets, demonstrating the advantage of\noracle-based algorithms. Our second reduction applies to the previously unstudied\nsetting when the linear mapping from feedback to reward is unknown. Our regret\nguarantees are superior to prior techniques that ignore the feedback.\n\n1\n\nIntroduction\n\nDecision making with partial feedback, motivated by applications including personalized\nmedicine [21] and content recommendation [16], is receiving increasing attention from the ma-\nchine learning community. These problems are formally modeled as learning from bandit feedback,\nwhere a learner repeatedly takes an action and observes a reward for the action, with the goal of\nmaximizing reward. While bandit learning captures many problems of interest, several applications\nhave additional structure: the action is combinatorial in nature and more detailed feedback is provided.\nFor example, in internet applications, we often recommend sets of items and record information about\nthe user\u2019s interaction with each individual item (e.g., click). This additional feedback is unhelpful\nunless it relates to the overall reward (e.g., number of clicks), and, as in previous work, we assume a\nlinear relationship. This interaction is known as the semibandit feedback model.\nTypical bandit and semibandit algorithms achieve reward that is competitive with the single best \ufb01xed\naction, i.e., the best medical treatment or the most popular news article for everyone. This is often\ninadequate for recommendation applications: while the most popular articles may get some clicks,\npersonalizing content to the users is much more effective. A better strategy is therefore to leverage\ncontextual information to learn a rich policy for selecting actions, and we model this as contextual\nsemibandits. In this setting, the learner repeatedly observes a context (user features), chooses a\ncomposite action (list of articles), which is an ordered tuple of simple actions, and receives reward for\nthe composite action (number of clicks), but also feedback about each simple action (click). The goal\nof the learner is to \ufb01nd a policy for mapping contexts to composite actions that achieves high reward.\nWe typically consider policies in a large but constrained class, for example, linear learners or tree\nensembles. Such a class enables us to learn an expressive policy, but introduces a computational\nchallenge of \ufb01nding a good policy without direct enumeration. We build on the supervised learning\nliterature, which has developed fast algorithms for such policy classes, including logistic regression\nand SVMs for linear classi\ufb01ers and boosting for tree ensembles. We access the policy class exclusively\nthrough a supervised learning algorithm, viewed as an oracle.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fRegret\n\nAlgorithm\nVCEE (Thm. 1)\n\u270f-Greedy (Thm. 3)\nKale et al. [12]\nEELS (Thm. 2)\nAgarwal et al. [1]\nSwaminathan et al. [22] L4/3T 2/3(K log N )1/3\n\npKLT log N\npKLT log N\n\nLpKLT log N\n\n(LT )2/3(K log N )1/3\n\n(LT )2/3(K log N )1/3\n\n1\n\nnot oracle-based\n\nT 3/2pK/(L log N )\npKLT / log N\n\n1\n\n1\n\nOracle Calls\n\nWeights w?\n\nknown\nknown\nknown\nunknown\nunknown\nunknown\n\nTable 1: Comparison of contextual semibandit algorithms for arbitrary policy classes, assuming all\nrankings are valid composite actions. The reward is semibandit feedback weighted according to w?.\n\nFor known weights, we consider w? = 1; for unknown weights, we assume kw?k2 \uf8ff O(pL).\n\nIn this paper, we develop and evaluate oracle-based algorithms for the contextual semibandits problem.\nWe make the following contributions:\n1.\n\nIn the more common setting where the linear function relating the semibandit feedback to the\nreward is known, we develop a new algorithm, called VCEE, that extends the oracle-based\ncontextual bandit algorithm of Agarwal et al. [1]. We show that VCEE enjoys a regret bound\n\nbetween \u02dcOpKLT log N and \u02dcOLpKT log N, depending on the combinatorial structure of\n\nthe problem, when there are T rounds of interaction, K simple actions, N policies, and composite\nactions have length L.1 VCEE can handle structured action spaces and makes \u02dcO(T 3/2) calls to\nthe supervised learning oracle.\n\n2. We empirically evaluate this algorithm on two large-scale learning-to-rank datasets and compare\nwith other contextual semibandit approaches. These experiments comprehensively demonstrate\nthat effective exploration over a rich policy class can lead to signi\ufb01cantly better performance than\nexisting approaches. To our knowledge, this is the \ufb01rst thorough experimental evaluation of not\nonly oracle-based semibandit methods, but of oracle-based contextual bandits as well.\n\n3. When the linear function relating the feedback to the reward is unknown, we develop a new\nalgorithm called EELS. Our algorithm \ufb01rst learns the linear function by uniform exploration\nand then, adaptively, switches to act according to an empirically optimal policy. We prove an\n\n\u02dcO(LT )2/3(K log N )1/3 regret bound by analyzing when to switch. We are not aware of other\n\ncomputationally ef\ufb01cient procedures with a matching or better regret bound for this setting.\n\nSee Table 1 for a comparison of our results with existing applicable bounds.\n\nRelated work. There is a growing body of work on combinatorial bandit optimization [2, 4] with\nconsiderable attention on semibandit feedback [6, 10, 12, 13, 19]. The majority of this research\nfocuses on the non-contextual setting with a known relationship between semibandit feedback and\nreward, and a typical algorithm here achieves an \u02dcO(pKLT ) regret against the best \ufb01xed composite\naction. To our knowledge, only the work of Kale et al. [12] and Qin et al. [19] considers the\ncontextual setting, again with known relationship. The former generalizes the Exp4 algorithm [3]\nto semibandits, and achieves \u02dcO(pKLT ) regret,2 but requires explicit enumeration of the policies.\nThe latter generalizes the LinUCB algorithm of Chu et al. [7] to semibandits, assuming that the\nsimple action feedback is linearly related to the context. This differs from our setting: we make no\nassumptions about the simple action feedback. In our experiments, we compare VCEE against this\nLinUCB-style algorithm and demonstrate substantial improvements.\nWe are not aware of attempts to learn a relationship between the overall reward and the feedback on\nsimple actions as we do with EELS. While EELS uses least squares, as in LinUCB-style approaches,\nit does so without assumptions on the semibandit feedback. Crucially, the covariates for its least\nsquares problem are observed after predicting a composite action and not before, unlike in LinUCB.\nSupervised learning oracles have been used as a computational primitive in many settings including\nactive learning [11], contextual bandits [1, 9, 20, 23], and structured prediction [8].\n\nanalyze \ufb01nite policy classes, but our work extends to in\ufb01nite classes by standard discretization arguments.\n\n1Throughout the paper, the \u02dcO(\u00b7) notation suppressed factors polylogarithmic in K, L, T and log N. We\n2Kale et al. [12] consider the favorable setting where our bounds match, when uniform exploration is valid.\n\n2\n\n\f2 Preliminaries\nLet X be a space of contexts and A a set of K simple actions. Let \u21e7 \u2713 (X ! AL) be a \ufb01nite set of\npolicies, |\u21e7| = N, mapping contexts to composite actions. Composite actions, also called rankings,\nare tuples of L distinct simple actions. In general, there are K!/(K L)! possible rankings, but\nthey might not be valid in all contexts. The set of valid rankings for a context x is de\ufb01ned implicitly\nthrough the policy class as {\u21e1(x)}\u21e12\u21e7.\nLet (\u21e7) be the set of distributions over policies, and \uf8ff(\u21e7) be the set of non-negative weight\nvectors over policies, summing to at most 1, which we call subdistributions. Let 1(\u00b7) be the 0/1\nindicator equal to 1 if its argument is true and 0 otherwise.\nIn stochastic contextual semibandits, there is an unknown distribution D over triples (x, y, \u21e0), where\nx is a context, y 2 [0, 1]K is the vector of reward features, with entries indexed by simple actions as\ny(a), and \u21e0 2 [1, 1] is the reward noise, E[\u21e0|x, y] = 0. Given y 2 RK and A = (a1, . . . , aL) 2 AL,\nwe write y(A) 2 RL for the vector with entries y(a`). The learner plays a T -round game. In each\nround, nature draws (xt, yt,\u21e0 t) \u21e0 D and reveals the context xt. The learner selects a valid ranking\nAt = (at,1, at,2, . . . , at,L) and gets reward rt(At) =PL\n` yt(at,`) + \u21e0t, where w? 2 RL is a\npossibly unknown but \ufb01xed weight vector. The learner is shown the reward rt(At) and the vector of\nreward features for the chosen simple actions yt(At), jointly referred to as semibandit feedback.\nThe goal is to achieve cumulative reward competitive with all \u21e1 2 \u21e7. For a policy \u21e1, let R(\u21e1) :=\nE(x,y,\u21e0)\u21e0D\u21e5r\u21e1(x)\u21e4 denote its expected reward, and let \u21e1? := argmax\u21e12\u21e7 R(\u21e1) be the maximizer\nof expected reward. We measure performance of an algorithm via cumulative empirical regret,\n\n`=1 w?\n\nRegret :=\n\nrt(\u21e1?(xt)) rt(At).\n\n(1)\n\nTXt=1\n\nThe performance of a policy \u21e1 is measured by its expected regret, Reg(\u21e1) := R(\u21e1?) R(\u21e1).\nExample 1. In personalized search, a learning system repeatedly responds to queries with rankings\nof search items. This is a contextual semibandit problem where the query and user features form the\ncontext, the simple actions are search items, and the composite actions are their lists. The semibandit\nfeedback is whether the user clicked on each item, while the reward may be the click-based discounted\ncumulative gain (DCG), which is a weighted sum of clicks, with position-dependent weights. We\nwant to map contexts to rankings to maximize DCG and achieve a low regret.\n\nWe assume that our algorithms have access to a supervised learning oracle, also called an argmax\noracle, denoted AMO, that can \ufb01nd a policy with the maximum empirical reward on any appropriate\ndataset. Speci\ufb01cally, given a dataset D = {xi, yi, vi}n\ni=1 of contexts xi, reward feature vectors\nyi 2 RK with rewards for all simple actions, and weight vectors vi 2 RL, the oracle computes\n\nnXi=1\n\nnXi=1\n\nLX`=1\n\nAMO(D) := argmax\n\n\u21e12\u21e7\n\nhvi, yi(\u21e1(xi))i = argmax\n\u21e12\u21e7\n\nvi,`yi(\u21e1(xi)`),\n\n(2)\n\nwhere \u21e1(x)` is the `th simple action that policy \u21e1 chooses on context x. The oracle is supervised as\nit assumes known features yi for all simple actions whereas we only observe them for chosen actions.\nThis oracle is the structured generalization of the one considered in contextual bandits [1, 9] and can\nbe implemented by any structured prediction approach such as CRFs [14] or SEARN [8].\nOur algorithms choose composite actions by sampling from a distribution, which allows us to use\nimportance weighting to construct unbiased estimates for the reward features y. If on round t, a\ncomposite action At is chosen with probability Qt(At), we construct the importance weighted feature\nvector \u02c6yt with components \u02c6yt(a) := yt(a)1(a 2 At)/Qt(a 2 At), which are unbiased estimators of\nyt(a). For a policy \u21e1, we then de\ufb01ne empirical estimates of its reward and regret, resp., as\n\n\u2318t(\u21e1, w) :=\n\n1\nt\n\nhw, \u02c6yi(\u21e1(xi))i\n\ntXi=1\n\nand dRegt(\u21e1, w) := max\n\n\u21e10\n\n\u2318t(\u21e10, w) \u2318t(\u21e1, w).\n\nBy construction, \u2318t(\u21e1, w?) is an unbiased estimate of the expected reward R(\u21e1), butdRegt(\u21e1, w?)\nis not an unbiased estimate of the expected regret Reg(\u21e1). We use \u02c6Ex\u21e0H[\u00b7] to denote empirical\nexpectation over contexts appearing in the history of interaction H.\n\n3\n\n\f1: Q0 = 0, the all-zeros vector. H0 = ;. De\ufb01ne: \u00b5t = minn1/2K,pln(16t2N/)/(Ktpmin)o.\n\nAlgorithm 1 VCEE (Variance-Constrained Explore-Exploit) Algorithm\nRequire: Allowed failure probability 2 (0, 1).\n2: for round t = 1, . . . , T do\n3:\n4:\n5:\n6:\n7: end for\n\nLet \u21e1t1 = argmax\u21e12\u21e7 \u2318t1(\u21e1, w?) and \u02dcQt1 = Qt1 + (1 P\u21e1 Qt1(\u21e1))1\u21e1t1.\nObserve xt 2 X, play At \u21e0 \u02dcQ\u00b5t1\nt1 (\u00b7 | xt) (see Eq. (3)), and observe yt(At) and rt(At).\nDe\ufb01ne qt(a) = \u02dcQ\u00b5t1\nObtain Qt by solving OP with Ht = Ht1 [{ (xt, yt(At), qt(At)} and \u00b5t.\n\nt1 (a 2 A | xt) for each a.\n\nSemi-bandit Optimization Problem (OP)\n\nWith history H and \u00b5 0, de\ufb01ne b\u21e1 := kw?k1\nkw?k2\n\nX\u21e12\u21e7\n\u02c6Ex\u21e0H\" LX`=1\n\n8\u21e1 2 \u21e7:\n\nQ(\u21e1)b\u21e1 \uf8ff 2KL/pmin\n\n2 cRegt(\u21e1)\n \u00b5pmin and := 100. Find Q 2 \uf8ff(\u21e7) such that:\n(4)\nQ\u00b5(\u21e1(x)` 2 A | x)# \uf8ff\n\n2KL\npmin\n\n1\n\n+ b\u21e1\n\n(5)\n\nFinally, we introduce projections and smoothing of distributions. For any \u00b5 2 [0, 1/K] and any\nsubdistribution P 2 \uf8ff(\u21e7), the smoothed and projected conditional subdistribution P \u00b5(A | x) is\n(3)\n\nP (\u21e1)1(\u21e1(x) = A) + K\u00b5Ux(A),\n\nP \u00b5(A | x) := (1 K\u00b5)X\u21e12\u21e7\n\nwhere Ux is a uniform distribution over a certain subset of valid rankings for context x, designed\nto ensure that the probability of choosing each valid simple action is large. By mixing Ux into our\naction selection, we limit the variance of reward feature estimates \u02c6y. The lower bound on the simple\naction probabilities under Ux appears in our analysis as pmin, which is the largest number satisfying\n\nUx(a 2 A) pmin/K\n\nfor all x and all simple actions a valid for x. Note that pmin = L when there are no restrictions on\nthe action space as we can take Ux to be the uniform distribution over all rankings and verify that\nUx(a 2 A) = L/K. In the worst case, pmin = 1, since we can always \ufb01nd one valid ranking for\neach valid simple action and let Ux be the uniform distribution over this set. Such a ranking can\nbe found ef\ufb01ciently by a call to AMO for each simple action a, with the dataset of a single point\n(x, 1a 2 RK, 1 2 RL), where 1a(a0) = 1(a = a0).\n3 Semibandits with known weights\n\nWe begin with the setting where the weights w? are known, and present an ef\ufb01cient oracle-based\nalgorithm (VCEE, see Algorithm 1) that generalizes the algorithm of Agarwal et al. [1].\nThe algorithm, before each round t, constructs a subdistribution Qt1 2 \uf8ff(\u21e7), which is used to\nform the distribution \u02dcQt1 by placing the missing mass on the maximizer of empirical reward. The\ncomposite action for the context xt is chosen according to the smoothed distribution \u02dcQ\u00b5t1\nt1 (see\nEq. (3)). The subdistribution Qt1 is any solution to the feasibility problem (OP), which balances\nexploration and exploitation via the constraints in Eqs. (4) and (5). Eq. (4) ensures that the distribution\nhas low empirical regret. Simultaneously, Eq. (5) ensures that the variance of the reward estimates \u02c6y\nremains suf\ufb01ciently small for each policy \u21e1, which helps control the deviation between empirical and\nexpected regret, and implies that Qt1 has low expected regret. For each \u21e1, the variance constraint is\nbased on the empirical regret of \u21e1, guaranteeing suf\ufb01cient exploration amongst all good policies.\nOP can be solved ef\ufb01ciently using AMO and a coordinate descent procedure obtained by modifying\nthe algorithm of Agarwal et al. [1]. While the full algorithm and analysis are deferred to Appendix E,\nseveral key differences between VCEE and the algorithm of Agarwal et al. [1] are worth highlighting.\n\n4\n\n\f2\n\nkw?k1\n\nOne crucial modi\ufb01cation is that the variance constraint in Eq. (5) involves the marginal probabilities\nof the simple actions rather than the composite actions as would be the most obvious adaptation\nto our setting. This change, based on using the reward estimates \u02c6yt for simple actions, leads to\nsubstantially lower variance of reward estimates for all policies and, consequently, an improved regret\nbound. Another important modi\ufb01cation is the new mixing distribution Ux and the quantity pmin. For\nstructured composite action spaces, uniform exploration over the valid composite actions may not\nprovide suf\ufb01cient coverage of each simple action and may lead to dependence on the composite\naction space size, which is exponentially worse than when Ux is used.\nThe regret guarantee for Algorithm 1 is the following:\nTheorem 1. For any 2 (0, 1), with probability at\nleast 1 , VCEE achieves re-\nLpKT log(N/) / pmin. Moreover, VCEE can be ef\ufb01ciently implemented with\ngret \u02dcOkw?k2\n\u02dcOT 3/2pK / (pmin log(N/)) calls to a supervised learning oracle AMO.\nIn Table 1, we compare this result to other applicable regret bounds in the most common setting,\nwhere w? = 1 and all rankings are valid (pmin = L). VCEE enjoys a \u02dcO(pKLT log N ) regret bound,\nwhich is the best bound amongst oracle-based approaches, representing an exponentially better\nL-dependence over the purely bandit feedback variant [1] and a polynomially better T -dependence\nover an \u270f-greedy scheme (see Theorem 3 in Appendix A). This improvement over \u270f-greedy is also\nveri\ufb01ed by our experiments. Additionally, our bound matches that of Kale et al. [12], who consider\nthe harder adversarial setting but give an algorithm that requires an exponentially worse running time,\n\u2326(N T ), and cannot be ef\ufb01ciently implemented with an oracle.\nOther results address the non-contextual setting, where the optimal bounds for both stochastic [13]\nand adversarial [2] semibandits are \u21e5(pKLT ). Thus, our bound may be optimal when pmin =\u2326( L).\nHowever, these results apply even without requiring all rankings to be valid, so they improve on\nour bound by a pL factor when pmin = 1. This pL discrepancy may not be fundamental, but it\nseems unavoidable with some degree of uniform exploration, as in all existing contextual bandit\nalgorithms. A promising avenue to resolve this gap is to extend the work of Neu [18], which gives\nhigh-probability bounds in the noncontextual setting without uniform exploration.\nTo summarize, our regret bound is similar to existing results on combinatorial (semi)bandits but\nrepresents a signi\ufb01cant improvement over existing computationally ef\ufb01cient approaches.\n\n4 Semibandits with unknown weights\n\nWe now consider a generalization of the contextual semibandit problem with a new challenge: the\nweight vector w? is unknown. This setting is substantially more dif\ufb01cult than the previous one, as it\nis no longer clear how to use the semibandit feedback to optimize for the overall reward. Our result\nshows that the semibandit feedback can still be used effectively, even when the transformation is\nunknown. Throughout, we assume that the true weight vector w? has bounded norm, i.e., kw?k2 \uf8ff B.\nOne restriction required by our analysis is the ability to play any ranking. Thus, all rankings must\nbe valid in all contexts, which is a natural restriction in domains such as information retrieval and\nrecommendation. The uniform distribution over all rankings is denoted U.\nWe propose an algorithm that explores \ufb01rst and then, adaptively, switches to exploitation. In the\nexploration phase, we play rankings uniformly at random, with the goal of accumulating enough\ninformation to learn the weight vector w? for effective policy optimization. Exploration lasts for\na variable length of time governed by two parameters n? and ?. The n? parameter controls the\nminimum number of rounds of the exploration phase and is O(T 2/3), similar to \u270f-greedy style\nschemes [15]. The adaptivity is implemented by the ? parameter, which imposes a lower bound\non the eigenvalues of the 2nd-moment matrix of reward features observed during exploration. As a\nresult, we only transition to the exploitation phase after this matrix has suitably large eigenvalues.\nSince we make no assumptions about the reward features, there is no bound on how many rounds\nthis may take. This is a departure from previous explore-\ufb01rst schemes, and captures the dif\ufb01culty of\nlearning w? when we observe the regression features only after taking an action.\nAfter the exploration phase of t rounds, we perform least-squares regression using the observed\nreward features and the rewards to learn an estimate \u02c6w of w?. We use \u02c6w and importance weighted\n\n5\n\n\fObserve xt, play At \u21e0 U (U is uniform over all rankings), observe yt(At) and rt(At).\n\n2n?K2Pn?\n\nAlgorithm 2 EELS (Explore-Exploit Least Squares)\nRequire: Allowed failure probability 2 (0, 1). Assume kw?k2 \uf8ff B.\n1: Set n? T 2/3(K ln(N/)/L)1/3 max{1, (BpL)2/3}\n2: for t = 1, . . . , n? do\n3:\n4: end for\nt=1Pa,b2Ayt(a) yt(b)2 1(a,b2At)\n5: Let \u02c6V = 1\nU (a,b2At).\n6: \u02dcV 2 \u02c6V + 3 ln(2/)/(2n?).\n7: Set ? maxn6L2 ln(4LT /), (T \u02dcV /B)2/3 (L ln(2/))1/3o.\n8: Set \u2303 Pn?\n9: while min(\u2303) \uf8ff ? do\nt t + 1. Observe xt, play At \u21e0 U, observe yt(At) and rt(At).\n10:\nSet \u2303 \u2303+ yt(At)yt(At)T .\n11:\n12: end while\n13: Estimate weights \u02c6w \u23031(Pt\n14: Optimize policy \u02c6\u21e1 argmax\u21e12\u21e7 \u2318t(\u21e1, \u02c6w) using importance weighted features.\n15: For every remaining round: observe xt, play At = \u02c6\u21e1(xt).\n\ni=1 yi(Ai)ri(Ai)) (Least Squares).\n\nt=1 yt(At)yt(At)T .\n\nreward features from the exploration phase to \ufb01nd a policy \u02c6\u21e1 with maximum empirical reward,\n\u2318t(\u00b7, \u02c6w). The remaining rounds comprise the exploitation phase, where we play according to \u02c6\u21e1.\nThe remaining question is how to set ?, which governs the length of the exploration phase. The\nideal setting uses the unknown parameter V := E(x,y)\u21e0D Vara\u21e0Unif(A)[y(a)] of the distribution D,\nwhere Unif(A) is the uniform distribution over all simple actions. We form an unbiased estimator \u02c6V\nof V and derive an upper bound \u02dcV . While the optimal ? depends on V , the upper bound \u02dcV suf\ufb01ces.\nFor this algorithm, we prove the following regret bound.\nTheorem 2. For any 2 (0, 1) and T K ln(N/)/ min{L, (BL)2}, with probability at least 1,\nEELS has regret \u02dcOT 2/3(K log(N/))1/3 max{B1/3L1/2, BL1/6}. EELS can be implemented\nef\ufb01ciently with one call to the optimization oracle.\nThe theorem shows that we can achieve sublinear regret without dependence on the composite action\nspace size even when the weights are unknown. The only applicable alternatives from the literature\nare displayed in Table 1, specialized to B =\u21e5( pL). First, oracle-based contextual bandits [1]\nachieve a better T -dependence, but both the regret and the number of oracle calls grow exponentially\nwith L. Second, the deviation bound of Swaminathan et al. [22], which exploits the reward structure\nbut not the semibandit feedback, leads to an algorithm with regret that is polynomially worse in its\ndependence on L and B (see Appendix B). This observation is consistent with non-contextual results,\nwhich show that the value of semibandit information is only in L factors [2].\nOf course EELS has a sub-optimal dependence on T , although this is the best we are aware of for\na computationally ef\ufb01cient algorithm in this setting. It is an interesting open question to achieve\npoly(K, L)pT log N regret with unknown weights.\n\n5 Proof sketches\n\nWe next sketch the arguments for our theorems. Full proofs are deferred to the appendices.\nProof of Theorem 1: The result generalizes Agarwal et. al [1], and the proof structure is similar. For\nthe regret bound, we use Eq. (5) to control the deviation of the empirical reward estimates which\n\nmake up the empirical regretdRegt. A careful inductive argument leads to the following bounds:\n\n2\n\nKL\u00b5t.\n\nReg(\u21e1) \uf8ff 2dRegt(\u21e1) + c0kw?k2\n\nkw?k1\n\nKL\u00b5t\n\nand dRegt(\u21e1) \uf8ff 2Reg(\u21e1) + c0kw?k2\n\nkw?k1\n\n2\n\nHere c0 is a universal constant and \u00b5t is de\ufb01ned in the pseudocode. Eq. (4) guarantees low empirical\nregret when playing according to \u02dcQ\u00b5t\nt , and the above inequalities also ensure small population regret.\n\n6\n\n\fKLPT\n\n2\n\nThe cumulative regret is bounded by kw?k2\nt=1 \u00b5t, which grows at the rate given in Theorem 1.\nkw?k1\nThe number of oracle calls is bounded by the analysis of the number of iterations of coordinate\ndescent used to solve OP, via a potential argument similar to Agarwal et al. [1].\nProof of Theorem 2: We analyze the exploration and exploitation phases individually, and then\noptimize n? and ? to balance these terms. For the exploration phase, the expected per-round regret\ncan be bounded by either kw?k2pKV or kw?k2pL, but the number of rounds depends on the\nminimum eigenvalue min(\u2303), with \u2303 de\ufb01ned in Steps 8 and 11. However, the expected per-round\n2nd-moment matrix, E(x,y)\u21e0D,A\u21e0U [y(A)y(A)T ], has all eigenvalues at least V . Thus, after t rounds,\nwe expect min(\u2303) tV , so exploration lasts about ?/V rounds, yielding roughly\n\nExploration Regret \uf8ff\n\n?\n\nV \u00b7k w?k2 min{pKV ,pL}.\n\nNow our choice of ? produces a benign dependence on V and yields a T 2/3 bound.\nFor the exploitation phase, we bound the error between the empirical reward estimates \u2318t(\u21e1, \u02c6w) and\nthe true reward R(\u21e1). Since we know min(\u2303) ? in this phase, we obtain\n\nExploitation Regret \uf8ff Tkw?k2r K log N\n\nn?\n\n+ Tr L\n\n?\n\nmin{pKV ,pL}.\n\nThe \ufb01rst term captures the error from using the importance-weighted \u02c6y vector, while the second uses\na bound on the error k \u02c6w w?k2 from the analysis of linear regression (assuming min(\u2303) ?).\nThis high-level argument ignores several important details. First, we must show that using \u02dcV instead\nof the optimal choice V in the setting of ? does not affect the regret. Secondly, since the termination\ncondition for the exploration phase depends on the random variable \u2303, we must derive a high-\nprobability bound on the number of exploration rounds to control the regret. Obtaining this bound\nrequires a careful application of the matrix Bernstein inequality to certify that \u2303 has large eigenvalues.\n\n6 Experimental Results\n\nOur experiments compare VCEE with existing alternatives. As VCEE generalizes the algorithm\nof Agarwal et al. [1], our experiments also provide insights into oracle-based contextual bandit\napproaches and this is the \ufb01rst detailed empirical study of such algorithms. The weight vector w? in\nour datasets was known, so we do not evaluate EELS. This section contains a high-level description\nof our experimental setup, with details on our implementation, baseline algorithms, and policy classes\ndeferred to Appendix C. Software is available at http://github.com/akshaykr/oracle_cb.\nData: We used two large-scale learning-to-rank datasets: MSLR [17] and all folds of the Yahoo!\nLearning-to-Rank dataset [5]. Both datasets have over 30k unique queries each with a varying number\nof documents that are annotated with a relevance in {0, . . . , 4}. Each query-document pair has a\nfeature vector (d = 136 for MSLR and d = 415 for Yahoo!) that we use to de\ufb01ne our policy class.\nFor MSLR, we choose K = 10 documents per query and set L = 3, while for Yahoo!, we set K = 6\nand L = 2. The goal is to maximize the sum of relevances of shown documents (w? = 1) and the\nindividual relevances are the semibandit feedback. All algorithms make a single pass over the queries.\nAlgorithms: We compare VCEE, implemented with an epoch schedule for solving OP after 2i/2\nrounds (justi\ufb01ed by Agarwal et al. [1]), with two baselines. First is the \u270f-GREEDY approach [15],\nwith a constant but tuned \u270f. This algorithm explores uniformly with probability \u270f and follows the\nempirically best policy otherwise. The empirically best policy is updated with the same 2i/2 schedule.\nWe also compare against a semibandit version of LINUCB [19]. This algorithm models the semibandit\nfeedback as linearly related to the query-document features and learns this relationship, while selecting\ncomposite actions using an upper-con\ufb01dence bound strategy. Speci\ufb01cally, the algorithm maintains a\nweight vector \u2713t 2 Rd formed by solving a ridge regression problem with the semibandit feedback\nyt(at,`) as regression targets. At round t, the algorithm uses document features {xa}a2A and chooses\nthe L documents with highest xT\nt xa value. Here, \u2303t is the feature 2nd-moment matrix\nand \u21b5 is a tuning parameter. For computational reasons, we only update \u2303t and \u2713t every 100 rounds.\nOracle implementation: LINUCB only works with a linear policy class. VCEE and \u270f-GREEDY\nwork with arbitrary classes. Here, we consider three: linear functions and depth-2 and depth-5\n\na \u2713t + \u21b5xT\n\na \u23031\n\n7\n\n\fFigure 1: Average reward as a function of number of interactions T for VCEE, \u270f-GREEDY, and\nLINUCB on MSLR (left) and Yahoo (right) learning-to-rank datasets.\n\ngradient boosted regression trees (abbreviated Lin, GB2 and GB5). Both GB classes use 50 trees.\nPrecise details of how we instantiate the supervised learning oracle can be found in Appendix C.\nParameter tuning: Each algorithm has a parameter governing the explore-exploit tradeoff. For\n\nVCEE, we set \u00b5t = cp1/KLT and tune c, in \u270f-GREEDY we tune \u270f, and in LINUCB we tune \u21b5.\n\nWe ran each algorithm for 10 repetitions, for each of ten logarithmically spaced parameter values.\nResults: In Figure 1, we plot the average reward (cumulative reward up to round t divided by t)\non both datasets. For each t, we use the parameter that achieves the best average reward across the\n10 repetitions at that t. Thus for each t, we are showing the performance of each algorithm tuned\nto maximize reward over t rounds. We found VCEE was fairly stable to parameter tuning, so for\nVC-GB5 we just use one parameter value (c = 0.008) for all t on both datasets. We show con\ufb01dence\nbands at twice the standard error for just LINUCB and VC-GB5 to simplify the plot.\nQualitatively, both datasets reveal similar phenomena. First, when using the same policy class, VCEE\nconsistently outperforms \u270f-GREEDY. This agrees with our theory, as VCEE achieves pT -type regret,\nwhile a tuned \u270f-GREEDY achieves at best a T 2/3 rate.\nSecondly, if we use a rich policy class, VCEE can signi\ufb01cantly improve on LINUCB, the empirical\nstate-of-the-art, and one of few practical alternatives to \u270f-GREEDY. Of course, since \u270f-GREEDY does\nnot outperform LINUCB, the tailored exploration of VCEE is critical. Thus, the combination of\nthese two properties is key to improved performance on these datasets. VCEE is the only contextual\nsemibandit algorithm we are aware of that performs adaptive exploration and is agnostic to the policy\nrepresentation. Note that LINUCB is quite effective and outperforms VCEE with a linear class. One\npossible explanation for this behavior is that LINUCB, by directly modeling the reward, searches the\npolicy space more effectively than VCEE, which uses an approximate oracle implementation.\n\n7 Discussion\n\nThis paper develops oracle-based algorithms for contextual semibandits both with known and un-\nknown weights. In both cases, our algorithms achieve the best known regret bounds for computation-\nally ef\ufb01cient procedures. Our empirical evaluation of VCEE, clearly demonstrates the advantage\nof sophisticated oracle-based approaches over both parametric approaches and naive exploration.\nTo our knowledge this is the \ufb01rst detailed empirical evaluation of oracle-based contextual bandit or\nsemibandit learning. We close with some promising directions for future work:\n1. With known weights, can we obtain \u02dcO(pKLT log N ) regret even with structured action spaces?\n2. With unknown weights, can we achieve a pT dependence while exploiting semibandit feedback?\n\nThis may require a new contextual bandit algorithm that does not use uniform smoothing.\n\nAcknowledgements\n\nThis work was carried out while AK was at Microsoft Research.\n\n8\n\n0.00.20.40.60.81.0Numberofinteractions(T)0.00.20.40.60.81.0Averagereward1000020000300002.22.3Dataset:MSLR1000020000300002.93.03.1Dataset:Yahoo!\u270f-LinVC-Lin\u270f-GB2VC-GB2\u270f-GB5VC-GB5LinUCB\fReferences\n[1] A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. E. Schapire. Taming the monster: A fast and\n\nsimple algorithm for contextual bandits. In ICML, 2014.\n\n[2] J.-Y. Audibert, S. Bubeck, and G. Lugosi. Regret in online combinatorial optimization. Math of OR, 2014.\n\n[3] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem.\n\nSIAM Journal on Computing, 2002.\n\n[4] N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. JCSS, 2012.\n\n[5] O. Chapelle and Y. Chang. Yahoo! learning to rank challenge overview. In Yahoo! Learning to Rank\n\nChallenge, 2011.\n\n[6] W. Chen, Y. Wang, and Y. Yuan. Combinatorial multi-armed bandit: General framework and applications.\n\nIn ICML, 2013.\n\n[7] W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual bandits with linear payoff functions. In AISTATS,\n\n2011.\n\n[8] H. Daum\u00e9 III, J. Langford, and D. Marcu. Search-based structured prediction. MLJ, 2009.\n\n[9] M. Dud\u00edk, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, and T. Zhang. Ef\ufb01cient optimal\n\nlearning for contextual bandits. In UAI, 2011.\n\n[10] A. Gy\u00f6rgy, T. Linder, G. Lugosi, and G. Ottucs\u00e1k. The on-line shortest path problem under partial\n\nmonitoring. JMLR, 2007.\n\n[11] D. J. Hsu. Algorithms for active learning. PhD thesis, University of California, San Diego, 2010.\n\n[12] S. Kale, L. Reyzin, and R. E. Schapire. Non-stochastic bandit slate problems. In NIPS, 2010.\n\n[13] B. Kveton, Z. Wen, A. Ashkan, and C. Szepesv\u00e1ri. Tight regret bounds for stochastic combinatorial\n\nsemi-bandits. In AISTATS, 2015.\n\n[14] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for segmenting\n\nand labeling sequence data. In ICML, 2001.\n\n[15] J. Langford and T. Zhang. The epoch-greedy algorithm for multi-armed bandits with side information. In\n\nNIPS, 2008.\n\n[16] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article\n\nrecommendation. In WWW, 2010.\n\n[17] MSLR. Mslr: Microsoft learning to rank dataset.\n\nprojects/mslr/.\n\nhttp://research.microsoft.com/en-us/\n\n[18] G. Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In NIPS,\n\n2015.\n\n[19] L. Qin, S. Chen, and X. Zhu. Contextual combinatorial bandit and its application on diversi\ufb01ed online\n\nrecommendation. In ICDM, 2014.\n\n[20] A. Rakhlin and K. Sridharan. Bistro: An ef\ufb01cient relaxation-based method for contextual bandits. In\n\nICML, 2016.\n\n[21] J. M. Robins. The analysis of randomized and nonrandomized AIDS treatment trials using a new approach\nto causal inference in longitudinal studies. In Health Service Research Methodology: A Focus on AIDS,\n1989.\n\n[22] A. Swaminathan, A. Krishnamurthy, A. Agarwal, M. Dud\u00edk, J. Langford, D. Jose, and I. Zitouni. Off-policy\n\nevaluation for slate recommendation. arXiv:1605.04812v2, 2016.\n\n[23] V. Syrgkanis, A. Krishnamurthy, and R. E. Schapire. Ef\ufb01cient algorithms for adversarial contextual\n\nlearning. In ICML, 2016.\n\n9\n\n\f", "award": [], "sourceid": 1243, "authors": [{"given_name": "Akshay", "family_name": "Krishnamurthy", "institution": "UMass Amherst"}, {"given_name": "Alekh", "family_name": "Agarwal", "institution": "Microsoft"}, {"given_name": "Miro", "family_name": "Dudik", "institution": "Microsoft Research"}]}