{"title": "Model Selection for Contextual Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 14741, "page_last": 14752, "abstract": "We introduce the problem of model selection for contextual bandits, where a\nlearner must adapt to the complexity of the optimal policy while balancing exploration and exploitation. Our main result is a new model selection guarantee for linear contextual bandits. We work in the stochastic realizable setting with a sequence of nested linear policy classes of dimension $d_1 < d_2 < \\ldots$,\nwhere the $m^\\star$-th class contains the optimal policy, and we design an\nalgorithm that achieves $\\tilde{O}l(T^{2/3}d^{1/3}_{m^\\star})$\nregret with no prior knowledge of the optimal dimension\n$d_{m^\\star}$. The algorithm also achieves regret $\\tilde{O}(T^{3/4} + \\sqrt{Td_{m^\\star}})$,\nwhich is optimal for $d_{m^{\\star}}\\geq{}\\sqrt{T}$. This is the first model selection result for contextual bandits with non-vacuous regret for\nall values of $d_{m^\\star}$, and to the best of our knowledge is the first positive result of this type for any online learning setting with partial information. The core of the algorithm is a new estimator for the gap in the best loss\nachievable by two linear policy classes, which we show admits a\nconvergence rate faster than the rate required to learn the parameters for either class.", "full_text": "Model selection for contextual bandits\n\nMassachusetts Institute of Technology\n\nDylan J. Foster\n\ndylanf@mit.edu\n\nHaipeng Luo\n\nUniversity of Southern California\n\nhaipengl@usc.edu\n\nAkshay Krishnamurthy\nMicrosoft Research NYC\nakshay@cs.umass.edu\n\nAbstract\n\nWe introduce the problem of model selection for contextual bandits, where a learner\nmust adapt to the complexity of the optimal policy while balancing exploration\nand exploitation. Our main result is a new model selection guarantee for linear\ncontextual bandits. We work in the stochastic realizable setting with a sequence\n\nof nested linear policy classes of dimension d1< d2< . . ., where the m\uffff-th class\ncontains the optimal policy, and we design an algorithm that achieves \u02dcO(T 2\uffff3d1\uffff3\nm\uffff)\nalso achieves regret \u02dcO\uffffT 3\uffff4+\u221aT dm\uffff\uffff, which is optimal for dm\uffff \u2265\u221aT . This\nregret with no prior knowledge of the optimal dimension dm\uffff. The algorithm\nis the \ufb01rst model selection result for contextual bandits with non-vacuous regret\nfor all values of dm\uffff, and to the best of our knowledge is the \ufb01rst positive result\nof this type for any online learning setting with partial information. The core of\nthe algorithm is a new estimator for the gap in the best loss achievable by two\nlinear policy classes, which we show admits a convergence rate faster than the rate\nrequired to learn the parameters for either class.\n\n1\n\nIntroduction\n\nModel selection is the fundamental statistical task of choosing a hypothesis class using data. The\nchoice of hypothesis class modulates a tradeoff between approximation error and estimation error, as\na small class can be learned with less data, but may have worse asymptotic performance than a richer\nclass. In the classical statistical learning setting, model selection algorithms provide the following\n\nthe sample complexity of the algorithm scales with the statistical complexity of the smallest subclass\n\nluckiness guarantee: If the class of models decomposes as a nested sequenceF1\u2282F2\u2282\uffffFm\u2282F,\nFm\uffff containing the true model, even though m\uffff is not known in advance. Such guarantees date back\n\nto Vapnik\u2019s structural risk minimization principle and are by now well-known (Vapnik, 1982, 1992;\nDevroye et al., 1996; Birg\u00e9 and Massart, 1998; Shawe-Taylor et al., 1998; Lugosi and Nobel, 1999;\nKoltchinskii, 2001; Bartlett et al., 2002; Massart, 2007). In practice, one may use cross-validation\u2014\nthe de-facto model selection procedure\u2014to decide whether to use, for example, a linear model, a\ndecision tree, or a neural network. That cross-validation appears in virtually every machine learning\npipeline highlights the necessity of model selection for successful ML deployments.\nWe investigate model selection in contextual bandits, a simple interactive learning setting. Our main\nquestion is: Can model selection guarantees be achieved in contextual bandit learning, where a\nlearner must balance exploration and exploitation to make decisions online?\nContextual bandit learning is more challenging than statistical learning on two fronts: First, decisions\nmust be made online without seeing the entire dataset, and second, the learner\u2019s actions in\ufb02uence what\ndata is observed (\u201cbandit feedback\u201d). Between these extremes is full-information online learning,\nwhere the learner does not have to deal with bandit feedback, but still makes decisions online. Even\nin this simpler setting, model selection is challenging, since the learner must attempt to identify\nthe appropriate model class while making irrevocable decisions and incurring regret. Nevertheless,\nseveral prior works on so-called parameter-free online learning (McMahan and Abernethy, 2013;\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOrabona, 2014; Koolen and Van Erven, 2015; Luo and Schapire, 2015; Foster et al., 2017; Cutkosky\nand Boahen, 2017) provide algorithms for online model selection with guarantees analogous to those\nin statistical learning. With bandit feedback, however, the learner must carefully balance exploration\nand exploitation, which presents a substantial challenge for model selection. At an intuitive level, the\nreason is that different hypothesis classes require different amounts of exploration, but either over-\nor under-exploring can incur signi\ufb01cant regret (A detailed discussion requires a formal setup and is\ndeferred to Section 2). At this point, it suf\ufb01ces to say that prior to this work, we are not aware of any\nadequate model selection guarantee that adapts results from statistical learning to any online learning\nsetting with partial information.\nWe provide a new model selection guarantee for the linear stochastic contextual bandit setting (Chu\n\net al., 2011; Abbasi-Yadkori et al., 2011). We consider a sequence of feature maps into d1< d2<\n. . .< dM dimensions and assume that the losses are linearly related to the contexts via the m\uffff-th\nfeature map, so that the optimal policy is a dm\uffff-dimensional linear policy. We design an algorithm\nthat achieves \u02dcO(T 2\uffff3d1\uffff3\nm\uffff) regret to this optimal policy over T rounds, with no prior knowledge\nof dm\uffff. As this bound has no dependence on the maximum dimensionality dM, we say that the\nalgorithm adapts to the complexity of the optimal policy. All prior approaches suffer linear regret\nthat the problem is learnable. Our algorithm can also be tuned to achieve \u02dcO\uffffT 3\uffff4+\u221aT dm\uffff\uffff regret,\nfor non-trivial values of dm\uffff, whereas the regret of our algorithm is sublinear whenever dm\uffff is such\nwhich matches the optimal rate when dm\uffff\u2265\u221aT .\nleast \u2326(d) labeled examples, we can estimate the improvement in value of the optimal loss using\nonly O(\u221ad) examples, analogous to so-called variance estimation results in statistics (Dicker, 2014;\n\nAt a technical level, we design a sequential test to determine whether the optimal square loss for a\nlarge linear class is substantially better than that of a smaller linear class. We show that this test has\nsublinear sample complexity: while learning a near-optimal predictor in d dimensions requires at\n\nKong and Valiant, 2018). Crucially, this implies that we can test whether or not to use the larger class\nwithout over-exploring for the smaller class.\n\n2 Preliminaries\n\nWe work in the stochastic contextual bandit setting (Langford and Zhang, 2008; Beygelzimer et al.,\n\nloss vector. The learner interacts with nature for T rounds, where in round t: (1) nature samples\n\ngoal of the learner is to choose actions to minimize the cumulative loss.\nFollowing several prior works (Chu et al., 2011; Abbasi-Yadkori et al., 2011; Agarwal et al., 2012;\nRusso and Van Roy, 2013; Li et al., 2017), we study a variant of the contextual bandit setting where\n\n2011; Agarwal et al., 2014). The setting is de\ufb01ned by a context spaceX , a \ufb01nite action space\nA\u2236={1, . . . , K} and a distributionD supported over(x, `) pairs, where x\u2208X and `\u2208 RA is a\n(xt,` t)\u223cD, (2) the learner observes xt and chooses action at, (3) the learner observes `t(at). The\nthe learner has access to a class of regression functionsF\u2236X\u00d7A\u2192 R containing the Bayes optimal\nWe refer to this assumption (f\uffff\u2208F) as realizability. For each regression function f we de\ufb01ne the\ninduced policy \u21e1f(x)\u2236= argmina f(x, a). Note that \u21e1\uffff\u2236= \u21e1f\uffff is the globally optimal policy, and\nchooses the best action on every context. We measure performance via regret to \u21e1\uffff:\n\nf\uffff(x, a)\u2236= E[`(a)\uffff x] \u2200x, a.\n\nregressor\n\n(1)\n\nReg\u2236= T\ufffft=1\n\n`t(at)\u2212 T\ufffft=1\n\n`t(\u21e1\uffff(xt)).\n\nLow regret is tractable here due to the realizability assumption, and it is well known that the optimal\n\nregret is \u02dc\u21e5\uffff\uffffT\u22c5 comp(F)\uffff, where comp(F) measures the statistical complexity ofF. For example,\ncomp(F)= log\uffffF\uffff for \ufb01nite classes, and comp(F)= d for d-dimensional linear classes (Agarwal\nplexity of the optimal regressor f\uffff rather than the worst-case complexity of the classF. To this end,\n\net al., 2012; Chu et al., 2011).1\nModel selection for contextual bandits. We aim for re\ufb01ned guarantees that scale with the com-\n\n1We suppress dependence on K and logarithmic dependence on T for this discussion.\n\n2\n\n\fFm\u2236=\uffff(x, a)\uffff\uffff, m(x, a)\uffff\uffff \u2208 Rdm\uffff ,\n\nrather than the less favorable \u02dcO(\uffffT\u22c5 comp(F))?\n\nprecisely whenever the optimal model class is learnable. This implies that the bound, in spite of\nhaving worse dependence on T , adapts to the complexity of the optimal class with no prior knowledge.\nWe achieve this type of guarantee for linear contextual bandits. We assume that each regressor class\n\nwe assume thatF is structured as a nested sequence of classesF1\u2282F2\u2282 . . .\u2282FM =F, and we\nde\ufb01ne m\uffff\u2236= min{m\u2236 f\uffff\u2208Fm}. The model selection problem for contextual bandits asks:\nGiven that m\uffff is not known in advance, can we achieve regret scaling as \u02dcO(\uffffT\u22c5 comp(Fm\uffff)),\nA slightly weaker model selection problem is to achieve \u02dcO\uffffT \u21b5\u22c5 comp(Fm\uffff)1\u2212\u21b5\uffff for some \u21b5\u2208\n[1\uffff2, 1), again without knowing m\uffff. Crucially, the exponents on T and comp(Fm\uffff) sum to one,\nimplying that we can achieve sublinear regret whenever comp(Fm\uffff) is sublinear in T , which is\nFm consists of linear functions of the form\nwhere m\u2236X\u00d7A\u2192 Rdm is a \ufb01xed feature map. To obtain a nested sequence of classes, and to\nensure the complexity is monotonically increasing, we assume that d1< d2< . . . , dM= d and that\nfor each m, the feature map m contains the map m\u22121 as its \ufb01rst dm\u22121 coordinates.2 If m\uffff is the\nwhere \uffff\u2208 Rdm\uffff is the optimal coef\ufb01cient vector. In this setup, the optimal rate if m\uffff is known is\n\u02dcO(\u221aT dm\uffff), obtained by LinUCB (Chu et al., 2011).3 Our main result achieves both \u02dcO(T 2\uffff3d1\uffff3\nm\uffff)\nregret (i.e., \u21b5= 2\uffff3) and \u02dcO\uffffT 3\uffff4+\u221aT dm\uffff\uffff regret without knowing m\uffff in advance.\nlearner for each sub-classFm and aggregate these base learners with a master Hedge instance (Freund\nmethods do not depend on the so-called \u201clocal norms\u201d, which are essential for achieving\u221aT -regret\n\nand Schapire, 1997). Other strategies include parameter-free methods like AdaNormalHedge (Luo\nand Schapire, 2015) and Squint (Koolen and Van Erven, 2015). Unfortunately, none of these methods\nappear to adequately handle bandit feedback. For example, the regret bounds of parameter-free\n\nRelated work. The model selection guarantee we seek is straightforward for full information online\nlearning and statistical learning. A simple strategy for the former setting is to use a low-regret online\n\nf\uffff(x, a)=\uffff\uffff, m\uffff(x, a)\uffff,\n\nsmallest feature map that realizes the optimal regressor, we can write\n\nin the bandit setting via the usual importance weighting approach (Auer et al., 2002). See Appendix B\nfor further discussion.\nIn the bandit setting, two approaches we are aware of also fail: the Corral algorithm of Agarwal\net al. (2017b), and an adaptive version of the classical \u270f-greedy strategy (Langford and Zhang, 2008).\n\nUnfortunately, both algorithms require tuning parameters in terms of the unknown index m\uffff, and\nnaive tuning gives a guarantee of the form \u02dcO(T \u21b5comp(Fm\uffff)) where \u21b5+ > 1. For example, for\n\ufb01nite classes Corral gives regret\u221aT log\uffffFm\uffff\uffff. This guarantee is quite weak, since it is vacuous\nwhen log\uffffFm\uffff\uffff= \u21e5(\u221aT) even though such a class admits sublinear regret if m\uffff were known in\nO(T \u21b5comp(Fm\uffff)1\u2212\u21b5)-type rates.\n\nadvance (see Appendix B). The conceptual takeaway from these examples is that model selection\nfor contextual bandits appears to require new algorithmic ideas, even when we are satis\ufb01ed with\n\nSeveral recent papers have developed adaptive guarantees for various contextual bandit settings.\nThese include: (1) adaptivity to easy data, where the optimal policy achieves low loss (Allenberg\net al., 2006; Agarwal et al., 2017a; Lykouris et al., 2018; Allen-Zhu et al., 2018), (2) adaptivity to\nsmoothness in settings with continuous action spaces (Locatelli and Carpentier, 2018; Krishnamurthy\net al., 2019), and (3) adaptivity in non-stationary environments, where distribution drift parameters\nare unknown (Luo et al., 2018; Cheung et al., 2019; Auer et al., 2018; Chen et al., 2019). The latter\nresults can be cast as model selection with appropriate nested classes of time-dependent policies, but\nthese results are incomparable to our own, since they are specialized to the non-stationary setting.\n\n2This is without loss of generality in a certain quantitative sense, since we can concatenate features without\n\nsigni\ufb01cantly increasing the complexity ofFm. See Corollary 5.\n3Regret scaling as \u02dcO(\u221adT) is optimal for the \ufb01nite action setting we work in. Results for the in\ufb01nite action\ncase, where regret scales as \u02dc\u21e5(d\u221aT), are incomparable to ours.\n\n3\n\n\fInterestingly, for multi-armed (non-contextual) bandits, several lower bounds demonstrate that model\nselection is not possible. The simplest of these results is Lattimore\u2019s pareto frontier (Lattimore, 2015),\n\nwhich states that for multi-armed bandits, if we want to ensure O(\u221aT) regret against a single \ufb01xed\narm instead of the usual O(\u221aKT) rate, we must incur \u2326(K\u221aT) regret to the remaining K\u2212 1\narms. This precludes a model selection guarantee of the form\uffffT\u22c5 comp(A) since for bandits, the\n\nstatistical complexity is simply the number of arms. Related lower bounds are known for Lipschitz\nbandits (Locatelli and Carpentier, 2018; Krishnamurthy et al., 2019). Our results show that model\nselection is possible for contextual bandits, and thus highlight an important gap between the two\nsettings.\nIn concurrent work, Chatterji et al. (2019) studied a similar model selection problem with two classes,\nwhere the \ufb01rst class consists of all K constant policies and the second is a d-dimensional linear\n\nassumptions on the context distribution are strictly stronger than our own. A detailed discussion is\ndeferred to the end of the section.\nTechnical preliminaries and assumptions. For a matrix A, A\u2020 denotes the pseudoinverse and\n\nclass. They obtain logarithmic regret to the \ufb01rst class and O(\u221aT d) regret to the second, but their\n\uffffA\uffff2 denotes the spectral norm. Id denotes the identity matrix in Rd\u00d7d and\uffff\u22c5\uffffp denotes the `p norm.\n\nWe use non-asymptotic big-O notation, and use \u02dcO to hide terms logarithmic in K, dM, M, and T .\nFor a real-valued random variable z, we use the following notation to indicate if z is subgaussian or\nsubexponential, following Vershynin (2012):\n\nz\u223c subE()\u21d4 sup\n\np\u22651{p\u22121(E\uffffz\uffffp)1\uffffp}\u2264 .\n\n(2)\n\np\u22651{p\u22121\uffff2(E\uffffz\uffffp)1\uffffp}\u2264 ,\n\nfamiliar de\ufb01nitions for subgaussian/subexponential random variables; see Appendix C.1.\n\nWe require a lower bound on the eigenvalues of the second moment matrices for the feature vectors.\n\nz\u223c subG(2)\u21d4 sup\nWhen z is a random variable in Rd, we write z\u223c subGd(2) if\uffff\u2713, z\uffff\u223c subG(2) for all\uffff\u2713\uffff2= 1\nand z\u223c subEd() if\uffff\u2713, z\uffff\u223c subE() for all\uffff\u2713\uffff2= 1. These de\ufb01nitions are equivalent to many other\nm) under x\u223cD. Nestedness implies that\nWe assume that for each m and a\u2208A, m(x, a)\u223c subG(\u2327 2\n\u23271\u2264 \u23272\u2264 . . ., and we de\ufb01ne \u2327 = \u2327M. We also assume that `(a)\u2212 E[`(a)\uffff x]\u223c subG(2) for all\nx\u2208X and a\u2208A. Finally, we assume that\uffff\uffff\uffff\u2264 B. To keep notation clean, we use the convention\nthat \u2264 \u2327 and B\u2264 1, which ensures that `(a)\u223c subG(4\u2327 2).\nm \u2236= min(\u2303m), where\nFor each m, de\ufb01ne \u2303m \u2236= 1\nmin(\u22c5) denotes the smallest eigenvalue; nestedness implies 1\u2265 2\u2265 . . .. We assume m\u2265 > 0\nrequires a lower bound on min(E[(x, a)(x, a)\uffff]) for all actions. Previous work suggests that\n\nfor all m, and our regret bounds scale inversely proportional to .\nNote that prior linear contextual bandit algorithms (Chu et al., 2011; Abbasi-Yadkori et al., 2011) do\nnot require lower bounds on the second moment matrices. As discussed earlier, the work of Chatterji\net al. (2019) obtains stronger model selection guarantees in the case of two classes, but their result\n\nadvanced exploration is not needed under such assumptions (Bastani et al., 2017; Kannan et al., 2018;\nRaghavan et al., 2018), which considerably simpli\ufb01es the problem.4 As such, the result should be\nseen as complementary to our own. Whether model selection can be achieved without some type of\neigenvalue condition is an important open question.\n\nK\u2211a\u2208A Ex\u223cD[m(x, a)m(x, a)\uffff]. We let 2\n\n3 Model selection for linear contextual bandits\n\nWe now present our algorithm for model selection in linear contextual bandits, ModCB (\u201cModel\nSelection for Contextual Bandits\u201d). Pseudocode is displayed in Algorithm 1. The algorithm maintains\n\nan \u201cactive\u201d policy class index\u0302m\u2208[M], which it updates over the T rounds starting from\u0302m= 1. The\nalgorithm updates\u0302m only when it can prove that\u0302m\u2260 m\uffff, which is achieved through a statistical test\ncalled EstimateResidual (Algorithm 2). When\u0302m is not being updated, the algorithm operates as if\nassumption. Consider the case d= 2 and \uffff=(1\uffff2, 1). Suppose there are four actions, and that at the \ufb01rst\nround, (x,\u22c5)={e1,\u2212e1, e2,\u2212e2}. We can ensure that with probability 1\uffff2, the \ufb01rst action played will be one\n\nof the \ufb01rst two. At this point a greedy strategy will always choose e1, but the average context distribution has\nminimum eigenvalue 1.\n\n4It appears that exploration is still required for linear contextual bandits under our average eigenvalue\n\n4\n\n\fAlgorithm 1 ModCB (Model Selection for Contextual Bandits)\n\ninput:\n\nt\n\n1\n\nReceive xt.\n\notherwise\n\nde\ufb01nitions:\n\n\u2022 De\ufb01ne T min\n\ninitialization:\n\n\u2022 Feature maps{m(\u22c5,\u22c5)}m\u2208[M], where m(x, a)\u2208 Rdm, and time T\u2208 N.\n\u2022 Subgaussian parameter \u2327> 0, second moment parameter > 0.\n\u2022 Failure probability \u2208 (0, 1), exploration parameter \uf8ff \u2208 (0, 1), con\ufb01dence parameters.\nC1, C2> 0.\n\u2022 De\ufb01ne 0= \uffff10M 2T 2 and \u00b5t=(K\ufffft)\uf8ff\u2227 1.\nm log2(2dm\uffff0)\n4\u22c5 d1\uffff2\n8 \u22c5 dm log(2\uffff0)\n+ \u2327 10\n\u2022 De\ufb01ne \u21b5m,t= C1\u22c5\uffff \u2327 6\n\uffff.\nK\uf8fft1\u2212\uf8ff\n1\u2212\uf8ff(2\uffff0)+ K\uffff+ 1.\n2\u22c5 dm log(2\uffff0)+ log\nm = C2\u22c5\uffff \u2327 4\n\u2022 \u0302m\u2190 1. // Index of candidate policy class.\n\u2022 Exp4-IX1\u2190 Exp4-IX(\u21e71, T, 0).\n\u2022 S\u2190{\uffff}. // Times at which uniform exploration takes place.\nfor t= 1, . . . , T do\nwith probability 1\u2212 \u00b5t\nFeed xt into Exp4-IX\u0302m and take at to be the predicted action.\nUpdate Exp4-IX\u0302m with(xt, at,` t(at)).\nSample at uniformly fromA and let S\u2190 S\u222a{t}.\n\u0302\u2303i\u2190 1\nK\u2211a\u2208A\u2211t\nHi\u2190{(i(xs, as),`(as))}s\u2208S for each i>\u0302m.\n\u0302E\u0302m,i\u2190 EstimateResidual\uffffHi,\u0302\u2303\u0302m,\u0302\u2303i\uffff for each i>\u0302m. // Gap estimate.\nif there exists i>\u0302m such that\u0302E\u0302m,i\u2265 2\u21b5i,t and t\u2265 T min\nLet\u0302m be the smallest such i. Re-initialize Exp4-IX\u0302m\u2190 Exp4-IX(\u21e7\u0302m, T, 0).\n\u0302m= m\uffff by running a low-regret contextual bandit algorithm with the policies induced byF\u0302m; we\ncall this policy class \u21e7m\u2236={x\uffff argmina\u2208A\uffff, m(x, a)\uffff\uffff\uffff\uffff2\u2264 \u2327\uffff}.5 Note that the low-regret\nalgorithm we run for \u21e7\u0302m cannot based on realizability, sinceF\u0302m will not contain the true regressor\nf\uffff until we reach m\uffff. This rules out the usual linear contextual bandit algorithms such as LinUCB.\n\ns=1 i(xs, a)i(xs, a)\uffff for each i\u2265\u0302m. // Empirical second moment.\n\nInstead we use a variant of Exp4-IX (Neu, 2015), which is an agnostic contextual bandit algorithm\nand does not depend on realizability. To deal with in\ufb01nite classes, unbounded losses, and other\ntechnical issues, we require some simple modi\ufb01cations to Exp4-IX; pseudocode and analysis are\ndeferred to Appendix C.3.\n\n/* Test whether we should move on to a larger policy class. */\n\nthen\n\ni\n\n3.1 Key idea: Estimating prediction error with sublinear sample complexity\n\nBefore stating the main result, we elaborate on the statistical test (EstimateResidual) used in Algo-\nrithm 1. EstimateResidual estimates an upper bound on the gap between the best-in-class loss for two\n\npolicy classes \u21e7i and \u21e7j, which we de\ufb01ne as i,j\u2236= L\uffffi\u2212 L\uffffj , where L\uffffi \u2236= min\u21e1\u2208\u21e7i L(\u21e1). At each\nround, Algorithm 1 uses EstimateResidual to estimate the gap \u0302m,i for all i>\u0302m. If the estimated gap\nis suf\ufb01ciently large for some i, the algorithm sets\u0302m to the smallest such i for the next round. This\napproach is based on the following observation: For all m\u2265 m\uffff, L\uffffm= L\uffffm\uffff. Hence, if \u0302m,i> 0, it\nmust be the case that\u0302m\u2260 m\uffff, and we should move on to a larger class.\nThe key challenge is to estimate \u0302m,i while ensuring low regret. Naively, we could use uniform\nexploration and \ufb01nd a policy in \u21e7i that has minimal empirical loss, which gives an estimate of L\uffffi .\nthe regret if\u0302m= m\uffff. Similar tuning issues arise with other approaches and are discussed further in\n5The norm constraint \u2327\uffff guarantees that \u21e7m contains parameter vectors arising from a certain square loss\nminimization problem under our assumption that\uffff\uffff\uffff2\u2264 1; see Proposition 20.\n\nUnfortunately, this requires tuning the exploration parameter in terms of di and would compromise\n\nAppendix B.\n\n5\n\n\fAlgorithm 2 EstimateResidual\n\ninput: Examples{(xs, ys)}n\nDe\ufb01ne d2= d\u2212 d1 and\n\nReturn estimator\n\ns=1, second moment matrix estimates\u0302\u2303\u2208 Rd\u00d7d and\u0302\u23031\u2208 Rd1\u00d7d1.\n\u0302R= \u0302D\u2020\u2212\u0302\u2303\u2020, where \u0302D=\uffff \u0302\u23031\n0d2\u00d7d2 \uffff .\n0d1\u00d7d2\n0d2\u00d7d1\n2\uffff\uffffs\u0302m. Observe thatE\u0302m,i depends on the optimal predictors \uffff\u0302m,\uffffi in the\ntwo classes. A natural approach to estimateE\u0302m,i is to solve regression problems over both classes to\nestimate the predictors, then plug them into the expression forE; we call this the plug-in approach.\nAs this relies on linear regression, it gives an O(di\uffffn) error rate for estimatingE\u0302m,i from n uniform\nAs a key technical contribution, we design more ef\ufb01cient estimators for the square loss gapE\u0302m,i. We\nwork in the following slightly more general gap estimation setup: we receive pairs{(xs, ys)}n\ns=1 i.i.d.\nfrom a distributionD\u2208 (Rd\u00d7 R), where x\u223c subG(\u2327 2) and y\u223c subG(2). We partition the feature\nspace into x=(x(1), x(2)), where x(1)\u2208 Rd1, and de\ufb01ne\n\nexploration samples. Unfortunately, since the error scales linearly with the size of the larger class, we\nmust over-explore to ensure low error, and this compromises the regret if the smaller class is optimal.\n\n\uffff\u2236= argmin\n\u2208Rd\n\nE(\uffff, x\uffff\u2212 y)2 ,\n\n\uffff1\u2236= argmin\n\u2208Rd1\n\nE\uffff\uffff, x(1)\uffff\u2212 y\uffff2\n\n.\n\nThese are, respectively, the optimal linear predictor and the optimal linear predictor restricted\n\nThe pseudocode for our estimator EstimateResidual is displayed in Algorithm 2. In addition to the\n\nto the \ufb01rst d1 dimensions. The square loss gap for the two predictors is de\ufb01ned as E \u2236=\nE\uffff\uffff\uffff, x\uffff\u2212\uffff\uffff1 , x(1)\uffff\uffff2. Our problem of estimatingE\u0302m,i clearly falls into this general setup if\nwe uniformly explore the actions for n rounds, then set{xs}n\ns=1 to be the features obtained through\nthe feature map i and{ys}n\ns=1 to be the observed losses.\nn labeled samples, it takes as input two empirical second moment matrices\u0302\u2303 and\u0302\u23031 constructed\nvia an extra set of m i.i.d. unlabeled samples; these serve as estimates for \u2303 \u2236= E[xx\uffff] and\n\u23031\u2236= E\uffffx(1)x(1)\uffff\uffff. The intuition is that one can write the square loss gap asE=\uffff\u23031\uffff2R E[xy]\uffff2\nwhere R\u2208 Rd\u00d7d is a certain function of \u2303 and \u23031. EstimateResidual simply replaces the second\nweighted norm of E[xy] through a U-statistic. The main guarantee for the estimator is as follows.\n6In Appendix D we show that \uffffm and consequentlyEi,j are always uniquely de\ufb01ned.\n\nmoment matrices with their empirical counterparts and then uses the labeled examples to estimate the\n\n2\n\n6\n\n\fTheorem 2. Suppose we take\u0302\u2303 and\u0302\u23031 to be the empirical second moment matrices formed from m\ni.i.d. unlabeled samples. Then when m\u2265 C(d+ log(2\uffff))\u2327 4\uffffmin(\u2303), EstimateResidual, given n\nlabeled samples, guarantees that with probability at least 1\u2212 ,\n\n\u2327 6\n\n\uffff\u0302E\u2212E\uffff\u2264 1\n\n2E+ O\uffff 2\u2327 4\n\nmin(\u2303)\u22c5 d1\uffff2 log2(2d\uffff)\n\n2\n\nn\n\n+\n\n4\n\nmin(\u2303)\u22c5 d log(2\uffff)m\n\n\u22c5\uffffE[xy]\uffff2\n2\uffff.\n\n(5)\n\nTo compare with the plug-in approach, we focus on the dependence between d and n. When\nEstimateResidual is applied within ModCB we have plenty of unlabeled data, so the dependence\n\nestimator has sublinear sample complexity. This property is crucial for our model selection result.\nThe result generalizes and is inspired by the variance estimation method of Dicker (Dicker, 2014;\n\n3.2 Main result\nEquipped with EstimateResidual, we can now sketch the approach behind ModCB in a bit more detail.\n\non m is less important. The dominant term in Theorem 2 is \u02dcO(\u221ad\uffffn), a signi\ufb01cant improvement\nover the \u02dcO(d\uffffn) rate for the plug-in estimator. In particular, the dependence on the larger ambient\ndimension is much milder: we can achieve constant error with n \uffff \u221ad, or in other words the\nKong and Valiant, 2018), which obtains a rate of O\uffff\u221ad\uffffn+ 1\uffff\u221an\uffff to estimate the optimal square loss\nmin\u2208Rd E(\uffff, x\uffff\u2212 y)2 when the second moments are known. By estimating the square loss gap\ninstead of the loss itself, we avoid the 1\uffff\u221an term, which is critical for achieving \u02dcO(T 2\uffff3d1\uffff3\nm\uffff) regret.\nRecall that the algorithm maintains an index\u0302m denoting the current guess for m\uffff. We run Exp4-IX\ni>\u0302m such that the estimated gap satis\ufb01es\u0302E\u0302m,i\u2265 2\u21b5i,t and t\u2265 T min\nbound in Theorem 2\u2014implies thatE\u0302m,i> 0 and thus\u0302m\u2260 m\uffff. If this is the case, we advance\u0302m to the\nTheorem 3. When C1 and C2 are suf\ufb01ciently large absolute constants and \uf8ff= 1\uffff3, ModCB guaran-\ntees that with probability at least 1\u2212 ,\nReg\u2264 \u02dcO\uffff \u2327 4\n3\u22c5(T m\uffff)2\uffff3(Kdm\uffff)1\uffff3 log(2\uffff)\uffff.\nWhen \uf8ff= 1\uffff4, ModCB guarantees that with probability at least 1\u2212 ,\n2\u22c5 K1\uffff4(T m\uffff)3\uffff4 log(2\uffff)+ \u2327 5\n\nover the induced policy class \u21e7m, mixing in some additional uniform exploration (with probability\n\u00b5t at round t). We use all of the data to estimate the second moment matrices of all classes, and we\npass only the exploration data into the subroutine EstimateResidual. We check if there exists some\nwhich\u2014based on the deviation\n\n4\u22c5\uffffK(T m\uffff)dm\uffff log(2\uffff)\uffff.\n\nsmallest such i, and if not, we continue with our current guess. This leads to the following guarantee.\n\nReg\u2264 \u02dcO\uffff \u2327 3\n\n(6)\n\n(7)\n\ni\n\nA few remarks are in order\n\n\u2022 The two stated bounds are incomparable in general. Tracking only dependence on T and\n\nm\uffff) while the latter is \u02dcO(T 3\uffff4+\u221aT dm\uffff). The former is better\ndm\uffff, the \ufb01rst is \u02dcO(T 2\uffff3d1\uffff3\nwhen dm\uffff\u2264 T 1\uffff4. There is a more general Pareto frontier that can be explored by choosing\n\uf8ff\u2208[1\uffff3, 1\uffff4], but no choice for \uf8ff dominates the others for all values of dm\uffff.\n\u02dcO(\u221aT dm\uffff) regret. The bound (7) matches this oracle rate when dm\uffff>\u221aT , but otherwise\n\u2022 Recall that if had we known dm\uffff in advance, we have could simply run LinUCB to achieve\nour guarantee is slightly worse than the oracle rate. Nevertheless, both bounds are o(T)\nwhenever the oracle rate is o(T) (that is, when dm\uffff= o(T)), so the algorithm has sublinear\nregret whenever the optimal model class is learnable. It remains open whether there is a\nmodel selection algorithm that can match the oracle rate for all values of dm\uffff simultaneously.\n\u2022 We have not optimized dependence on the condition number \u2327\uffff or the logarithmic factors.\n\u2022 If the individual distribution parameters{\u2327m}m\u2208[M] and{m}m\u2208[M] are known, the algo-\nrithm can be modi\ufb01ed slightly so that regret scales in terms of \u2327m\uffff and \u22121\nm\uffff. However the\ncurrent model, in which we assume access only to uniform upper and lower bounds on these\nparameters, is more realistic.\n\n7\n\n\fAdditionally, Appendix A contains a validation experiment, where we demonstrate that ModCB\ncompares favorably to LinUCB in simulations.\n\nthis preserves the relevant details but simpli\ufb01es the argument. The analysis has two cases depending\n\nguarantee for Exp4-IX using policy class \u21e71, and by accounting for uniform exploration.\n\nalgorithm never advances. We can bound regret as\n\nthe following improved regret bounds.\n\nNon-nested feature maps. As a \ufb01nal variant, we note that the algorithm easily extends to the case\n\nCorollary 5. For non-nested feature maps, ModCB with preprocessing guarantees that with proba-\n\n\u02dcO\uffff \u2327 4\n3\u22c5 T 2\uffff3(Kdm\uffff)1\uffff3 log(2\uffff)\uffff,\uf8ff\n\u02dcO\uffff \u2327 3\n2\u22c5 K1\uffff4T 3\uffff4 log(2\uffff)+ \u2327 5\n\n4\u22c5\u221aKT dm\uffff log(2\uffff)\uffff,\uf8ff\n\nImproving the dependence on m\uffff. Theorem 3 obtains the desired model selection guarantee for\nlinear classes, but the bound includes a polynomial dependence on the optimal index m\uffff. This\ncontrasts the logarithmic dependence on m\uffff that can be obtained through structural risk minimization\nin statistical learning (Vapnik, 1992). However, this poly(m\uffff) dependence can be replaced by a single\nlog(T) factor with a simple preprocessing step: Given feature maps{m(\u22c5,\u22c5)}m\u2208[M] we construct a\nnew collection of maps\uffff \u00afm(\u22c5,\u22c5)\uffffm\u2208\uffff \u00afM\uffff, where \u00afM\u2264 log T , as follows. First, for i= 1, . . . , log T ,\ntake \u00afi to be the largest feature map m for which dm\u2264 ei. Second, remove any duplicates. This\npreprocessing reduces the number of feature maps to at most log(T) while ensuring that a map of\ndimension O(dm\uffff) that contains m\uffff is always available. Speci\ufb01cally, the preprocessing step yields\nTheorem 4. ModCB with preprocessing guarantees that with probability at least 1\u2212 ,\nReg\u2264\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff\n= 1\uffff3.\n= 1\uffff4.\nwhere feature maps are not nested. Suppose we have non-nested feature maps{m}m\u2208[M], where\nd1\u2264 d2\u2264 . . .\u2264 dM; note that the inequality is no longer strict. In this case, we can obtain a nested\ncollection by concatenating 1, . . . , m\u22121 to the map m for each m. This process increases the\ndimension of the optimal map from dm\uffff to at most dm\uffffm\uffff, so we have the following corollary.\nbility at least 1\u2212 ,\nReg\u2264\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff\uffff\n\u02dcO\uffff \u2327 4\n= 1\uffff3.\n3\u22c5 T 2\uffff3(Kdm\uffffm\uffff)1\uffff3 log(2\uffff)\uffff,\uf8ff\n= 1\uffff4.\n\u02dcO\uffff \u2327 3\n2\u22c5 K1\uffff4T 3\uffff4 log(2\uffff)+ \u2327 5\non the case where there are just two classes, so M= 2. We only track dependence on T and dm, as\non whether f\uffff belongs toF1 orF2.\nFirst, if f\uffff\u2208F1 then by Lemma 1 we have thatE1,2= 0. Further, via Theorem 2, we can guarantee\nthat we never advance to \u0302m= 2 with high probability. The result then follows from the regret\nThe more challenging case is when f\uffff\u2208F2. Let\u0302T denote the \ufb01rst round where\u0302m= 2, or T if the\nThe four terms correspond to: (1) uniform exploration with probability \u00b5t\uffff t\u2212\uf8ff in round t, (2) the\nExp4-IX regret bound for class \u21e71 until time\u0302T , (3) the policy gap between the best policy in \u21e71 and\nthe optimal policy \u21e1\uffff\u2208 \u21e72, and (4) the Exp4-IX bound over class \u21e72 until time T . The two regret\nbounds (the second and fourth terms) clearly contribute O(\u221aT d2) to the overall regret, and the \ufb01rst\nterm is controlled by our choice of \uf8ff\u2208{1\uffff4, 1\uffff3}. We are left to bound the third term. For this\nterm, observe that in round\u0302T\u2212 1, since the algorithm did not advance, we must have\u0302E1,2\u2264 2\u21b52,\u0302T\u22121.\nAppealing to Theorem 2, this implies thatE1,2\u2264 O(\u0302E1,2+ \u21b52,\u0302T\u22121)\u2264 O(\u21b52,\u0302T\u22121). Plugging in the\n\nReg\u2264 O\uffffT 1\u2212\uf8ff\uffff+ \u02dcO\uffff\uffff\u0302T d1\uffff+\u0302T 1,2+ \u02dcO\uffff\uffff(T\u2212\u0302T)d2\uffff .\n\n3.3 Proof sketch\nWe now sketch the proof of the main theorem, with the full proof deferred to Appendix D. We focus\n\n4\u22c5\u221aKT dm\uffffm\uffff log(2\uffff)\uffff,\uf8ff\n\nde\ufb01nition of \u21b52,t and applying Lemma 1, this gives\n\n\u0302T 1,2\u2264 O\uffff\u0302T\uffffE1,2\uffff\u2264 O\uffff\u0302T\uffff\u21b52,\u0302T\u22121\uffff= O\uffffT\n\n2 d1\uffff4\n2 \uffff .\n1+\uf8ff\n\n8\n\n(8)\n\n\fThe sublinear estimation rate for EstimateResidual (Theorem 2) plays a critical role in this argument.\n\n2 ) regret, and with \uf8ff= 1\uffff4 we obtain \u02dcO(T 3\uffff4+\u221aT d2+ T 5\uffff8d1\uffff4\n\nThis establishes the result. In particular, with \uf8ff= 1\uffff3 we obtain \u02dcO(T 2\uffff3+\u221aT d2+ T 2\uffff3d1\uffff4\n2 )\u2264\n2 )= \u02dcO(T 3\uffff4+\u221aT d2).7\n\u02dcO(T 2\uffff3d1\uffff3\nWith the \u02dcO(d\uffffn) rate for the na\u00efve plug-in estimator, we could at best set \u21b5t,2 = d2\ufffft1\u2212\uf8ff, but this\nto\u221ad2. Unfortunately, this results in\nwould degrade the dimension dependence in (8) from d1\uffff4\nT 1\u2212\uf8ff+\u221ad2T 1+\uf8ff\nfor \uf8ff\u2208(0, 1) for which the exponents on d2 and T sum to one for both terms.\n\nregret, which is not a meaningful model selection result, since there is no choice\n\n2\n\n2\n\n4 Discussion\n\nmodel selection is negligible.\n\nThis paper initiates the study of model selection tradeoffs in contextual bandits. We provide the\n\ufb01rst model selection algorithm for the linear contextual bandit setting, which we show achieves\n\n2. Is it possible to generalize our results beyond linear classes? Speci\ufb01cally, given regressor\n\ntual bandit algorithm that adapts the complexity of the optimal policy class with no prior knowledge,\nand raises a number of intriguing questions:\n\n\u02dcO\uffffT 2\uffff3d1\uffff3\nm\uffff\uffff when the optimal model is a dm\uffff-dimensional linear function. This is the \ufb01rst contex-\n1. Is it possible to achieve \u02dcO\uffff\u221aT dm\uffff\uffff regret in our setup? This would show that the price for\nclassesF1\u2282F2\u2282 . . .\u2282FM and assuming the optimal model f\uffff belongs toFm\uffff for some\nunknown m\uffff, is there a contextual bandit algorithm that achieve \u02dcO\uffffT \u21b5\u22c5 comp1\u2212\u21b5(Fm\uffff)\uffff\nregret for some \u21b5\u2208[1\uffff2, 1)? More ambitiously, can we achieve \u02dcO\uffff\uffffT\u22c5 comp(Fm\uffff)\uffff?\n\n3. We have conducted a validation experiment with ModCB (see Appendix A), demonstrating\nthat the algorithm performs favorably in simulations. While this synthetic experiment is\nencouraging, ModCB may not be immediately useful for practical deployments for several\nreasons, including its reliance on linear realizability and its computational complexity. Are\nthere more robust algorithmic principles for theoretically-sound and practically-effective\nmodel selection in contextual bandits?\n\n4. For what classesF can we estimate the loss at a sublinear rate, and is this necessary\n\nfor contextual bandit model selection? Any sublinear guarantee will lead to non-trivial\nmodel selection guarantees through a strategy similar to ModCB. Interestingly, it is already\nknown that for certain (e.g., sparse linear) classes, sublinear loss estimation is not possible\n(Verzelen and Gassiat, 2018). On the other hand, positive results are available for certain\nnonparametric classes (Brown et al., 2007; Wang et al., 2008).\n\nModel selection is instrumental to the success of ML deployments, yet few positive results exist\nfor partial feedback settings. We believe these questions are technically challenging and practically\nimportant, and we are hopeful that positive results of the type we provide here will extend to\ninteractive learning settings beyond contextual bandits.\n\nAcknowledgements\n\nWe thank Ruihao Zhu for working with us at the early stages of this project, and for many helpful\ndiscussions. AK is supported by NSF IIS-1763618. HL is supported by NSF IIS-1755781.\n\nReferences\nYasin Abbasi-Yadkori, D\u00e1vid P\u00e1l, and Csaba Szepesv\u00e1ri. Improved algorithms for linear stochastic\n\nbandits. In Advances in Neural Information Processing Systems, 2011.\n\nAlekh Agarwal, Miroslav Dud\u00edk, Satyen Kale, John Langford, and Robert E Schapire. Contextual\nbandit learning with predictable rewards. In International Conference on Arti\ufb01cial Intelligence\nand Statistics, 2012.\n\n7Note that if d2\u2264\u221aT then T 5\uffff8d1\uffff4\n\n2 \u2264 T 3\uffff4, but if d2\u2265\u221aT then T 5\uffff8d1\uffff4\n\n2 \u2264\u221aT d2.\n\n9\n\n\fAlekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihon Li, and Robert E. Schapire. Taming\nthe monster: A fast and simple algorithm for contextual bandits. In International Conference on\nMachine Learning, 2014.\n\nAlekh Agarwal, Akshay Krishnamurthy, John Langford, Haipeng Luo, and Robert E. Schapire. Open\nproblem: First-order regret bounds for contextual bandits. In Conference on Learning Theory,\n2017a.\n\nAlekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. Corralling a band of\n\nbandit algorithms. Conference on Learning Theory, 2017b.\n\nZeyuan Allen-Zhu, S\u00e9bastien Bubeck, and Yuanzhi Li. Make the minority great again: First-order\n\nregret bound for contextual bandits. International Conference on Machine Learning, 2018.\n\nChamy Allenberg, Peter Auer, L\u00e1szl\u00f3 Gy\u00f6r\ufb01, and Gy\u00f6rgy Ottucs\u00e1k. Hannan consistency in on-line\nlearning in case of unbounded losses under partial monitoring. In International Conference on\nAlgorithmic Learning Theory. Springer, 2006.\n\nPeter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed\n\nbandit problem. SIAM Journal on Computing, 2002.\n\nPeter Auer, Pratik Gajane, and Ronald Ortner. Adaptively tracking the best arm with an unknown\n\nnumber of distribution changes. In European Workshop on Reinforcement Learning, 2018.\n\nPeter L Bartlett, St\u00e9phane Boucheron, and G\u00e1bor Lugosi. Model selection and error estimation.\n\nMachine Learning, 2002.\n\nHamsa Bastani, Mohsen Bayati, and Khashayar Khosravi. Mostly exploration-free algorithms for\n\ncontextual bandits. arXiv:1704.09011, 2017.\n\nShai Ben-David, Nicolo Cesa-Bianchi, David Haussler, and Philip M Long. Characterizations of\nlearnability for classes of (0,..., n)-valued functions. Journal of Computer and System Sciences,\n1995.\n\nAlina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual\nbandit algorithms with supervised learning guarantees. In International Conference on Arti\ufb01cial\nIntelligence and Statistics, 2011.\n\nLucien Birg\u00e9 and Pascal Massart. Minimum contrast estimators on sieves: exponential bounds and\n\nrates of convergence. Bernoulli, 1998.\n\nLawrence D Brown, Michael Levine, et al. Variance estimation in nonparametric regression via the\n\ndifference sequence method. The Annals of Statistics, 2007.\n\nNiladri S. Chatterji, Vidya Muthukumar, and Peter L. Bartlett. Osom: A simultaneously optimal\n\nalgorithm for multi-armed and linear contextual bandits. arXiv:1905.10040, 2019.\n\nYifang Chen, Chung-Wei Lee, Haipeng Luo, and Chen-Yu Wei. A new algorithm for non-stationary\ncontextual bandits: Ef\ufb01cient, optimal, and parameter-free. Conference on Learning Theory, 2019.\n\nWang Chi Cheung, David Simchi-Levi, and Ruihao Zhu. Learning to optimize under non-stationarity.\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, 2019.\n\nWei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff\n\nfunctions. In International Conference on Arti\ufb01cial Intelligence and Statistics, 2011.\n\nAshok Cutkosky and Kwabena A. Boahen. Online learning without prior information. Conference\n\non Learning Theory, 2017.\n\nAmit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclass learnability and\n\nthe ERM principle. The Journal of Machine Learning Research, 2015.\n\nVictor de la Pe\u00f1a and Evarist Gin\u00e9. Decoupling: From Dependence to Independence. Springer, 1998.\n\n10\n\n\fLuc Devroye, L\u00e1zl\u00f3 Gy\u00f6r\ufb01, and G\u00e1bor Lugosi. A Probabilistic Theory of Pattern Recognition.\n\nSpringer, 1996.\n\nLee H Dicker. Variance estimation in high-dimensional linear models. Biometrika, 2014.\n\nDylan J. Foster, Satyen Kale, Mehryar Mohri, and Karthik Sridharan. Parameter-free online learning\n\nvia model selection. In Advances in Neural Information Processing Systems, 2017.\n\nDylan J. Foster, Alekh Agarwal, Miroslav Dudik, Haipeng Luo, and Robert Schapire. Practical\ncontextual bandits with regression oracles. In International Conference on Machine Learning,\n2018.\n\nYoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. Journal of Computer and System Sciences, 1997.\n\nPierre Gaillard, Gilles Stoltz, and Tim Van Erven. A second-order bound with excess losses. In\n\nConference on Learning Theory, 2014.\n\nDavid Haussler and Philip M Long. A generalization of Sauer\u2019s lemma. Journal of Combinatorial\n\nTheory, Series A, 1995.\n\nSampath Kannan, Jamie H Morgenstern, Aaron Roth, Bo Waggoner, and Zhiwei Steven Wu. A\nsmoothed analysis of the greedy algorithm for the linear contextual bandit problem. In Advances\nin Neural Information Processing Systems, 2018.\n\nVladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions\n\non Information Theory, 2001.\n\nWeihao Kong and Gregory Valiant. Estimating learnability in the sublinear data regime. In Advances\n\nin Neural Information Processing Systems, 2018.\n\nWouter M Koolen and Tim Van Erven. Second-order quantile methods for experts and combinatorial\n\ngames. In Conference on Learning Theory, 2015.\n\nAkshay Krishnamurthy, Alekh Agarwal, and Miro Dudik. Contextual semibandits via supervised\n\nlearning oracles. In Advances In Neural Information Processing Systems, 2016.\n\nAkshay Krishnamurthy, Zhiwei Steven Wu, and Vasilis Syrgkanis. Semiparametric contextual bandits.\n\nIn International Conference on Machine Learning, 2018.\n\nAkshay Krishnamurthy, John Langford, Aleksandrs Slivkins, and Chicheng Zhang. Contextual\nbandits with continuous actions: Smoothing, zooming, and adapting. Conference on Learning\nTheory, 2019.\n\nJohn Langford and Tong Zhang. The epoch-greedy algorithm for multi-armed bandits with side\n\ninformation. In Advances in neural information processing systems, 2008.\n\nTor Lattimore. The pareto regret frontier for bandits. In Advances in Neural Information Processing\n\nSystems, 2015.\n\nLihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual\n\nbandits. In International Conference on Machine Learning, 2017.\n\nAndrea Locatelli and Alexandra Carpentier. Adaptivity to smoothness in x-armed bandits.\n\nConference on Learning Theory, 2018.\n\nIn\n\nG\u00e1bor Lugosi and Andrew B Nobel. Adaptive model selection using empirical complexities. Annals\n\nof Statistics, 1999.\n\nHaipeng Luo and Robert E Schapire. Achieving all with no parameters: Adanormalhedge. In\n\nConference on Learning Theory, 2015.\n\nHaipeng Luo, Chen-Yu Wei, Alekh Agarwal, and John Langford. Ef\ufb01cient contextual bandits in\n\nnon-stationary worlds. Conference on Learning Theory, 2018.\n\n11\n\n\fThodoris Lykouris, Karthik Sridharan, and \u00c9va Tardos. Small-loss bounds for online learning with\n\npartial information. Conference on Learning Theory, 2018.\n\nPascal Massart. Concentration inequalities and model selection. Springer, 2007.\n\nBrendan McMahan and Jacob Abernethy. Minimax optimal algorithms for unconstrained linear\n\noptimization. In Advances in Neural Information Processing Systems, 2013.\n\nGergely Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits.\n\nIn Advances in Neural Information Processing Systems, 2015.\n\nFrancesco Orabona. Simultaneous model selection and optimization through parameter-free stochastic\n\nlearning. In Advances in Neural Information Processing Systems, 2014.\n\nManish Raghavan, Aleksandrs Slivkins, Jennifer Wortman Vaughan, and Zhiwei Steven Wu. The\nexternalities of exploration and how data diversity helps exploitation. Conference on Learning\nTheory, 2018.\n\nDaniel Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic\n\nexploration. In Advances in Neural Information Processing Systems, 2013.\n\nJohn Shawe-Taylor, Peter L. Bartlett, Robert C Williamson, and Martin Anthony. Structural risk\nminimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 1998.\n\nVladimir Vapnik. Estimation of dependences based on empirical data. Springer-Verlag, 1982.\n\nVladimir Vapnik. Principles of risk minimization for learning theory.\n\nInformation Processing Systems, 1992.\n\nIn Advances in Neural\n\nRoman Vershynin. Introduction to the non-asymptotic analysis of random matrices. Cambridge\n\nUniversity Press, 2012.\n\nNicolas Verzelen and Elisabeth Gassiat. Adaptive estimation of high-dimensional signal-to-noise\n\nratios. Bernoulli, 2018.\n\nLie Wang, Lawrence D Brown, T Tony Cai, Michael Levine, et al. Effect of mean on variance\n\nfunction estimation in nonparametric regression. The Annals of Statistics, 2008.\n\n12\n\n\f", "award": [], "sourceid": 8339, "authors": [{"given_name": "Dylan", "family_name": "Foster", "institution": "MIT"}, {"given_name": "Akshay", "family_name": "Krishnamurthy", "institution": "Microsoft"}, {"given_name": "Haipeng", "family_name": "Luo", "institution": "University of Southern California"}]}