{"title": "Thompson Sampling for Multinomial Logit Contextual Bandits", "book": "Advances in Neural Information Processing Systems", "page_first": 3151, "page_last": 3161, "abstract": "We consider a dynamic assortment selection problem where the goal is to offer a sequence of assortments that maximizes the expected cumulative revenue, or alternatively, minimize the expected regret. The feedback here is the item that the user picks from the assortment.  The distinguishing feature in this work is that this feedback has a multinomial logistic distribution. The utility of each item is a dynamic function of contextual information of both the item and the user.\nWe propose two Thompson sampling algorithms for this multinomial logit contextual bandit. Our first algorithm maintains a posterior distribution of the true parameter and establishes  $\\tilde{O}(d\\sqrt{T})$ Bayesian regret over $T$ rounds with $d$ dimensional context vector. The worst-case computational complexity of this algorithm could be high when the prior distribution is not a conjugate.  The second algorithm approximates the posterior by a Gaussian distribution, and uses a new optimistic sampling procedure to address the issues that arise in worst-case regret analysis. This algorithm achieves $\\tilde{O}(d^{3/2}\\sqrt{T})$ worst-case (frequentist) regret bound. The numerical experiments show that the practical performance of both methods is in line with the theoretical guarantees.", "full_text": "Thompson Sampling for Multinomial Logit\n\nContextual Bandits\n\nMin-hwan Oh\n\nColumbia University\n\nNew York, NY\n\nm.oh@columbia.edu\n\nGarud Iyengar\n\nColumbia University\n\nNew York, NY\n\ngarud@ieor.columbia.edu\n\nAbstract\n\nWe consider a dynamic assortment selection problem where the goal is to offer\na sequence of assortments that maximizes the expected cumulative revenue, or\nalternatively, minimize the expected regret. The feedback here is the item that the\nuser picks from the assortment. The distinguishing feature in this work is that this\nfeedback is given by a multinomial logit choice model. The utility of each item is\na dynamic function of contextual information of both the item and the user. We\nrefer to this problem as the multinomial logit contextual bandit. We propose two\nThompson sampling algorithms for this multinomial logit contextual bandit. Our\n\ufb01rst algorithm maintains a posterior distribution of the unknown parameter and\n\nvector. The second algorithm approximates the posterior by a Gaussian distribution\nand uses a new optimistic sampling procedure to address the issues that arise\n\nestablishes eO(dpT )1 Bayesian regret over T rounds with d dimensional context\nin worst-case regret analysis. This algorithm achieves eO(d3/2pT ) worst-case\n\n(frequentist) regret bound. The numerical experiments show that the practical\nperformance of both methods is in line with the theoretical guarantees.\n\n1\n\nIntroduction\n\nIn the stochastic multi-armed bandit (MAB) problem [10, 27], the learning agent selects one of\nN actions (or items) and receives a revenue feedback corresponding to the chosen action in each\nround. The objective is to maximize the cumulative revenue over a \ufb01nite horizon of length T , or\nalternatively, to minimize the cumulative regret de\ufb01ned as the difference in cumulative revenues of the\noptimal strategy and the agent\u2019s strategy. The main challenge in MAB problems is to appropriately\nbalance the trade-off between exploitation, i.e., pulling the best empirical arm, and exploration,\ni.e., experimenting with arms which are not suf\ufb01ciently pulled. The balancing strategies for this\nexploration-exploitation trade-off typically fall into two categories: upper con\ufb01dence bound (UCB)\nmethods [9, 18] and Thompson sampling (TS) based methods [42]. (Besides UCB and TS, one may\nalso consider \u270f-greedy approach [24].)\nUCB methods maintain a con\ufb01dence set for the unknown true parameter, and in each step, choose the\nmost optimistic parameter from this set, and pull the optimal arm corresponding to this optimistic\nparameter value. The con\ufb01dence set is updated based on the revenue feedback which is revealed\nafter an arm is pulled. TS assumes a prior distribution over the parameters de\ufb01ning the reward\ndistribution. At each step, a parameter value is sampled from the posterior distribution, and an\noptimal arm corresponding to a sampled parameter is pulled. Upon observing the reward for each\nround, the posterior distribution is updated via Bayes rule. TS has been successfully applied in a wide\nrange of settings [40, 13, 38].\n\n1eO suppresses logarithmic dependence.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWhile UCB algorithms have simple implementations and good theoretical regret bounds [29], TS\nhas been shown to achieve better empirical performance in many simulated and real-world settings\nwithout sacri\ufb01cing simplicity [13, 23]. In order to bridge this gap, many recent studies have been\nfocused on the analysis of worst-case regret and Bayesian regret in TS approaches for both contextual\nbandits and reinforcement learning settings [5, 7, 38, 3]. The main technical dif\ufb01culty in analyzing\nregret in the TS lies in controlling the deviation introduced by the randomness in the algorithm.\nIn this paper, we consider a dynamic assortment selection with contextual information, which is a\ncombinatorial variant of the contextual bandit problem. The goal is to offer a sequence of assortments\nof at most K items from a set of N possible items that minimize regret. The feedback here is\nthe particular item chosen by the user from the offered assortment. This problem arises in many\nreal-world applications such as online retailing, streaming services, news feed, online advertising, etc.\nWe assume that the item choice is given by a multinomial logit (MNL) choice model [33]. This is one\nof the most widely used models in dynamic assortment optimization literature [12, 37, 39, 6, 7, 14].\nThe utility of each item that de\ufb01nes the MNL choice probability is assumed to be a linear function of\na d-dimensional contextual information, or a set of d features. This contextual information can be a\ncombined information of both the item and the user, and is allowed to change over time.\nThe MNL contextual bandit is a multinomial generalization of generalized linear contextual bandits\n[23, 30], particularly logistic bandits, that reduces to generalized linear bandits when the assortment\ncontains a single item. However, this extension is non-trivial since the MNL model cannot be\nexpressed in the form of a generalized linear model [15]; hence, the results of generalized linear\nbandits do not directly apply. Also, in contrast to the standard contextual bandit problems, in the MNL\ncontextual bandit, the item choice (feedback) is a function of the entire offered assortment. Thus,\nregret analysis is more complicated. Furthermore, we allow the context vector to vary arbitrarily in\ntime; thus, offering the same assortment repeatedly several times to learn the parameter values [6, 7]\nis no longer an effective strategy.\nWe propose two Thompson sampling algorithms for this multinomial logit contextual bandit. To our\nknowledge, these are the \ufb01rst TS algorithms for this problem.\n\n(a) The \ufb01rst algorithm maintains a posterior distribution of the true parameter and establishes\n\n(b) The second algorithm approximates the posterior by a Gaussian distribution and uses a new\noptimistic sampling procedure to address the issues that arise in worst-case regret analysis. We\n\neO(dpT ) Bayesian regret.\nestablish eO(d3/2pT ) worst-case (frequentist) regret bound for this algorithm.\n\nThe additional pd factor in the regret of the second algorithm is due to the deviation from the random\nsampling in TS which is addressed in the worst-case regret analysis and is consistent with the results\nin TS methods for linear bandits [5, 3]. Both regret bounds are free of candidate item set size N,\nwhich implies that our TS algorithms can be applied to a large item set. The TS algorithms we propose\nare ef\ufb01cient to implement as long as the assortment optimization step is solved ef\ufb01ciently, for which\nour TS algorithms can exploit ef\ufb01cient polynomial-time algorithms [36, 20], which is a signi\ufb01cant\nadvantage over the previously proposed UCB method in [15] which computes the con\ufb01dence bound\nfor each assortment (i.e., for each of the total N choose K assortments). Furthermore, the numerical\nexperiments show that the practical performance of the proposed methods is in line with the theoretical\nguarantees.\n\n2 Related Work\n\nThe MNL model [34, 33, 32] is one of the most widely used choice models for assortment selection\nproblems. The problem of computing the optimal assortment (static assortment optimization problem),\nwhen the MNL parameters, i.e., user preferences, are known a priori, is well-studied [41, 21, 22].\nOur work belongs to the literature on dynamic assortment optimization. [12] consider the setting\nwhere the demand for items in an assortment is independent. [37] and [39] consider the problem\nof minimizing regret under the MNL choice model and present an \u201cexplore \ufb01rst then exploit later\u201d\napproach. [37] showed O(N 2 log2 T ) regret bound, where N is the number of total candidate items.\n[39] later improved the bound to O(N log T ). However, these methods require a priori knowledge of\n\u201cseparability\u201d between the true optimal assortment and the other sub-optimal alternatives.\n\n2\n\n\fMore recent work by [6, 7, 16, 14, 15] also incorporated MNL models into dynamic assortment\noptimization and formulated the problem into an online regret minimization problem without requiring\n\ninformation into the previous work (see, e.g. [6, 7]) since these methods require that the same\nassortment be offered repeatedly for a random number of rounds until an outside choice (no purchase)\n\na priori knowledge on separability. [6] proposed UCB-style algorithm which shows eO(pN T ) regret\nbound. [7] achieve the same order of the regret bound eO(pN T ) using TS approach with improved\nempirical performance. [14] show a matching lower bound of \u2326(pN T ). All of this previous\nwork on MNL bandits assumes each item is associated with a unique parameter, i.e., one cannot\nlearn across items. In our proposed MNL contextual bandits, the utility of item i at round t is of\nthe form x>ti\u2713\u21e4 some \ufb01xed but unknown utility parameter \u2713\u21e4; hence, we can learn across items.\nWhen the feature dimension d \u2327 pN, learning across items allows one to reduce the regret bound\nfrom eO(pN T ) to eO(dpT ). However, one cannot directly incorporate (time-varying) contextual\nis observed. [15] proposed a UCB method which establishes eO(dpT ) regret bound for the MNL\n\ncontextual bandit similar to our settings. Apart from the fact that their method is UCB based, there is\nanother fundamental difference between [15] and our work. [15] enumerates the exponentially many\n(N choose K) assortments and builds con\ufb01dence bounds for each of them. In contrast, our methods\nonly maintain uncertainty for each of the N different items.\nIt is also worth mentioning work in the personalized MNL-bandit problem [25, 17, 11]. These works\nconsider each item utility separately and learn N different parameters; hence there is no generalization\nacross different items, which is different from our setting. Perhaps, the most related one among\nthese personalized MNL bandit methods is [17], which proposed a TS algorithm for their problem.\nHowever, they only provide the Bayesian regret which is relatively easier to control compared to\nthe worst-case regret (we discuss this aspect in Section 5), and again their method (as well as other\npersonalized MNL bandit methods) still considers learning N separate parameters for each of the\nitems; hence it is not scalable for a large item set (i.e., large N).\nLinear contextual bandits [2, 9, 19, 36, 1, 18, 5] have been widely studied. [23] and [30] extend the\nlinear contextual bandit to scalar, monotone, generalized linear bandit using a UCB-type approach.\nIn most of these linear bandits or generalized linear bandits, balancing exploitation and exploration\ncan be done simply by taking an action that maximizes the sum of mean reward and the variance.\n[5] de\ufb01ne TS for linear contextual bandit as a Bayesian algorithm where a Gaussian prior over \u2713\u21e4 is\nupdated according to the observed rewards, a random sample is drawn from the posterior, and the\n\nFollowing the work of [5], [3] show that the TS does not need to sample from an actual Bayesian\nposterior distribution and that any distribution satisfying suitable concentration and anti-concentration\nproperties guarantees a small regret and provide an alternative proof of TS achieving the same regret\n\ncorresponding optimal arm is selected at each step. They show eO(d3/2pT ) worst-case regret bound.\nbound eO(d3/2pT ). However, these results in (generalized) linear contextual bandits (either UCB\n\nor TS) do not apply directly to our MNL contextual bandit problem, since the choice probability\nof an item in an assortment is non-linear and non-monotone in the MNL parameter \u2713\u21e4. It is also\nworthwhile to mention a line of work in other combinatorial bandit problems [35, 43, 26] mostly\nwith semi-bandit feedback or cascading feedback. Our work is distinct from these combinatorial\nbandit problems since in cascading or semi-bandit settings, the mapping from the item context to the\nuser feedback is still independent of other items in an offered set; hence it does not take substitution\neffect into account. On the other hand, MNL choice feedback is a function of the entire assortment\nwhich makes our analysis more challenging.\n\n3 Problem Formulation\n\n3.1 Notations\nFor a vector x 2 Rd, we use kxk to denote its `2-norm and x> its transpose. The weighted `2-norm\nassociated with a positive-de\ufb01nite matrix V is de\ufb01ned by kxkV := px>V x. The minimum and\nmaximum singular values of a matrix V are written as min(V ) and kV k, respectively. The trace of a\nmatrix V is trace(V ). For two symmetric matrices V and W of the same dimensions, V \u232b W means\nthat V  W is positive semi-de\ufb01nite. We de\ufb01ne [n] for a positive integer n to be a set containing\npositive integers up to n, i.e., {1, 2, ..., n}. Finally, we de\ufb01ne S to be the set of candidate assortments\nwith size constraint at most K, i.e., S = {S \u21e2 [N ] : |S|\uf8ff K}.\n\n3\n\n\f3.2 MNL Contextual Bandits\nWe formulate the problem of the MNL contextual bandit as follows. The decision-making agent can\nchoose an assortment as a subset of the item set containing N distinct items, indexed by i 2 [N ].\nAt round t, feature vectors xti 2 Rd for every item i 2 [N ] are revealed to the agent. Each feature\nvector combines the information of the user and the corresponding item i. For example, suppose\nthe user at round t is characterized by a feature vector vt and the item i has a feature vector wti\n(note that we allow feature vectors for an item and a user to change over time), then we can use\nxti = vec(vtw>ti ), the vectorized outer-product of vt and wti, as the combined feature vector of item\ni a at round t. If vt is not available, we can use item dependent features only xti = wti. Given this\ncontextual information, at every round t, the agent selects an assortment St 2S and observes the\nuser choice represented as a binary vector yt 2{ 0, 1}|St| where yti = 1 if the i-th item in assortment\nSt is chosen by the user and ytj = 0 for all non-chosen items j 2 St. Note thatPi2St\nyti \uf8ff 1 and\nwe allow an \u201coutside option\u201d (i = 0) which means the user does not choose any items offered in St,\ni.e., yti = 0 for all i 2 St. This user choice is given by the MNL choice model. Under this model,\nthe probability that a user chooses item i 2 St is given by,\n1 +Pj2St\n\nexp{x>tj\u2713\u21e4}\n\nyt \u21e0 multinomial1, pt0(St,\u2713 \u21e4), pt1(St,\u2713 \u21e4), ..., ptK(St,\u2713 \u21e4)\n\nwhere \u2713\u21e4 2 Rd is an unknown time-invariant parameter and 1 in the denominator accounts for the\noutside option with pt0(St,\u2713 \u21e4) = 1/(1 +Pj2St\nexp{x>tj\u2713\u21e4}). Then, the choice response variable\nyt = (yt0, yt1, ..., ytK) is a sample from this multinomial distribution:\nwhere 1 represents yt is a single-trial sample. Also, we de\ufb01ne noise \u270fti := yti  pti(St,\u2713 \u21e4). Since\n\u270fti is bounded in [0, 1], \u270fti is 2-sub-Gaussian with 2 = 1/4. It is important to note that \u270fti is not\nindependent across i 2 St due to the substitution effect in the MNL model.\nThe revenue parameter for each item i is also revealed at round t, denoted by rti. Note that rti is the\nrevenue incurred by item i if item i is chosen by the user at round t. Without loss of generality, we\nassume |rti|\uf8ff 1 for all i and t. Then, the expected revenue corresponding to assortment St is given\nby\n\nexp{x>ti\u2713\u21e4}\n\npti(St,\u2713 \u21e4) =\n\nRt(St,\u2713 \u21e4) =Xi2St\n\nrti exp{x>ti\u2713\u21e4}\n1 +Pj2St\n\nexp{x>tj\u2713\u21e4}\n\n.\n\nLet S\u21e4t be the of\ufb02ine optimal assortment at round t under full information when \u2713\u21e4 is known, i.e., if\nthe true MNL probabilities pti(S, \u2713\u21e4) are known a priori:\n\nS\u21e4t = arg max\n\nRt(S, \u2713\u21e4).\n\nS2S\n\nConsider a planning horizon T , where assortments can be offered at rounds t = 1, ..., T . The\nagent does not know the value of \u2713\u21e4 (hence pti(S, \u2713\u21e4) is not known) and can only make sequential\nassortment decisions, S1, ..., ST at rounds 1, ..., T respectively. Hence, the main challenge is how to\nconstruct an algorithm that simultaneously learns the unknown parameter \u2713\u21e4 and sequentially makes\nthe decisions on offered assortments based on past choices and observed responses to maximize\ncumulative expected revenues over the planning horizon. The performance of an algorithm is usually\nmeasured by the regret, which is the gap between the expected revenue generated by the assortment\nchosen by the algorithm and that of the of\ufb02ine optimal assortment. We de\ufb01ne the (worst-case)\ncumulative expected regret as\n\nR(T,\u2713\u21e4) =\n\nTXt=1\n\nE\u21e5Rt(S\u21e4t ,\u2713 \u21e4)  Rt(St,\u2713 \u21e4) | \u2713\u21e4\u21e4\n\nwhere Rt(S\u21e4t ,\u2713 \u21e4) is the expected revenue corresponding to the of\ufb02ine optimal assortment at round\nt, and the expectation is taken over random parameters and possible randomization in a learning\nalgorithm. When it is clear that we condition on a \ufb01xed \u2713\u21e4, we denote R(T ) := R(T,\u2713\u21e4) in the rest\nof the paper. In Bayesian settings, i.e., when \u2713\u21e4 is randomly generated or the learning agent has a\nprior belief in \u2713\u21e4, the Bayesian cumulative regret [38] over T horizon is de\ufb01ned as\n\nRBayes(T ) = E\u2713\u21e4 [R(T,\u2713\u21e4)] =\n\nTXt=1\n\nE\u21e5Rt(S\u21e4t ,\u2713 \u21e4)  Rt(St,\u2713 \u21e4)\u21e4\n\n4\n\n\fwhere the expectation is taken also over the distribution of \u2713\u21e4. In other words, RBayes(T ) is a weighted\naverage of R(T,\u2713\u21e4) under the prior on \u2713\u21e4.\n3.3 Assumptions\nWe introduce general assumptions on the structure of the problem.\nAssumption 1. kxtik \uf8ff 1 for all t and i. Also, k\u2713\u21e4k \uf8ff 1.\nThis assumption is used to make the regret bounds scale-free for convenience and is in fact standard\nin the bandit literature. If kxtik \uf8ff C and k\u2713\u21e4k \uf8ff C for some constant C instead, then our regret\nbounds would increase by a factor of C.\nAssumption 2. There exists \uf8ff> 0 such that for every item i 2 S and any S 2S and all round t\ninf S2S,\u27132Rd pti(S, \u2713)pt0(S, \u2713)  \uf8ff.\nNote that this is equivalent to a standard assumption in generalized linear contextual bandit literature\n[23, 30] to ensure the Fisher information matrix is invertible and is adapted to suit our MNL setting.\nWe discuss the need for this assumption in detail in Appendix A.\n\n4 Algorithm: TS-MNL\n\nIn this section, we describe TS-MNL, our \ufb01rst TS algorithm for the MNL contextual bandit problem,\nand present its Bayesian regret bound. We \ufb01rst provide the de\ufb01nition of the posterior distribution\nQt on the unknown parameter \u2713\u21e4. At the beginning of the learning phase, the agent knows that \u2713\u21e4 is\ndistributed according to Q0, the prior distribution. Now, at each round t, the agent has access to the\nobservations up to round t, Dt = {X\u2327 , y\u2327}t1\n\u2327 =1 where X\u2327 = {x\u2327i}i2S\u2327 . Then the agent combines\nQ0 and Dt to de\ufb01ne the posterior distribution Qt(\u2713):\n\nt1Y\u2327 =1 Yi2S\u2327\n\nQt(\u2713) / Q0(\u2713)p(Dt|\u2713), where p(Dt|\u2713) =\n\n(p\u2327i (S\u2327 ,\u2713 ))y\u2327i\n\n(1)\n\nand the \u201c/\u201d notation hides the partition functionR Q0()p(Dt|)d in the denominator. In other\n\nwords, the posterior distribution is proportional to the product of the prior distribution and the\nlikelihood function. Note that there is no conjugate prior for the MNL model. Hence, sampling from\nQt is intractable. In order to overcome this intractability, one may draw an approximate sampling\nusing Markov chain Monte Carlo [8]. For ease of exposition, we assume the following in this section\nand in the Bayesian regret analysis. We will later provide a remedy for this intractability in the\nmodi\ufb01cation of our algorithm for the worst-case regret analysis.\nAssumption 3. We can sample from Qt(\u2713).\nIn each round t, TS-MNL algorithm consists of three major steps. First, it randomly samples a\n\nparametere\u2713t from the posterior distribution Qt. Second, it computes the assortment choice St under\nthis sampled parameter e\u2713t. Finally, St is offered to the user and feedback yt is observed. The\n\npseudocode of TS-MNL is presented in Algorithm 1.\n\nAlgorithm 1 TS-MNL\n1: Input: prior distribution Q0\n2: for all t = 1 to T do\n3:\n4:\n5:\n6:\n7: end for\n\nObserve xti and rti for all i 2 [N ]\nSamplee\u2713t from the posterior distribution Qt in Eq.(1)\nCompute St = arg maxS2S Rt(S,e\u2713t)\n\nOffer St and observe yt (user choice at round t)\n\nCombinatorial Optimization. Algorithm 1 has the combinatorial optimization step in Line 5. There\nare ef\ufb01cient polynomial-time algorithms available to solve this combinatorial optimization problem\n[37, 20] for given utility estimates under the sampled parameter. In particular, we can use the solution\nof the linear programming (LP) formulation presented in [20] for this optimization step.\n\n5\n\n\fRBayes(T ) \uf8ffO (1) +\" 1\n\n\uf8ffs2d log\u27131 +\nd2 \u2318\u25c6 .\n\nT K\n\n= O\u2713dpT log\u21e31 +\n\nT K\n\nd2 \u25c6 + 2 log T +\n\npd\n\n\uf8ff # \u00b7s2dT log\u27131 +\n\nT K\n\nd2 \u25c6\n\n4.1 Bayesian Regret of TS-MNL\nWe state the Bayesian cumulative regret bound for Algorithm 1 in Theorem 1. We also provide an\noverview of establishing the regret bound.\nTheorem 1. Suppose we run TS-MNL (Algorithm 1) for a total of T rounds with assortment size\nconstraint K. Then the Bayesian regret of the algorithm is upper-bounded by\n\nTheorem 1 establishes eO(dpT ) Bayesian regret. [15] established the lower bound \u2326(dpT /K)\n\nfor MNL contextual bandits under almost identical settings. When K is small and \ufb01xed (which\nis typically true in many applications), Theorem 1 demonstrates that TS-MNL is almost optimal.\nFurthermore, the regret bound is completely free of N; hence TS-MNL is applicable to the case of a\nlarge number of items (large N). Also, if K \uf8ff d2, the regret bound becomes free of K. In Section 6,\nwe introduce modi\ufb01cations to TS-MNL for the worst-case regret analysis which include the explicit\nuse of regularized MLE for parameter estimation and sampling from the Gaussian distribution instead\nof maintaining the actual posterior to overcome the intractability. The concentration results derived\nfor the Bayesian regret analysis in this section serve as a building block for the worst-case regret\nanalysis for the modi\ufb01ed algorithm.\nThe proof outline of Theorem 1 is motivated by [38, 43]. Given Ft which contains all available\ninformation up to round t, e\u2713t and \u2713\u21e4 are i.i.d. with the posterior distribution Qt in the Bayesian\nperspective. Also, the optimization step is a \ufb01xed combinatorial optimization and {xti}i2[N ] are\n\ufb01xed given Ft. Hence, conditioning on Ft, St and S\u21e4t are also i.i.d. Therefore, the expected regret\npertaining to the random sampling is 0.; Then, we control the estimation error of \u2713\u21e4 for which we\nutilize the \ufb01nite-sample concentration results for MNL parameter. The proofs are left to Appendix B.\n\n5 Worst-Case Regret\n\nAlgorithm 1 is still valid under a frequentist setting, i.e., when the true parameter is not a random\nvariable but a \ufb01xed parameter. However, when analyzing the worst-case regret (also known as\nfrequentist regret) for the algorithm, the main technical dif\ufb01culty lies in controlling the deviation in\nperformance due to the random sampling of the algorithm. Note that in Bayesian regret analysis,\n\ncontrolling this sampling deviation is not addressed because of the assumption thate\u2713t and \u2713\u21e4 are i.i.d.\nconditioning on Ft . However, this does not hold anymore when \u2713\u21e4 is \ufb01xed; hence the worst-case\nregret analysis needs to ensure that the deviation due to sampling is small enough. To see this, we\ndecompose the worst-case immediate regret into a few components.\nR(t) = E[Rt(S\u21e4t ,\u2713 \u21e4)  Rt(St,\u2713 \u21e4)]\n\neasier to control. We can show that the term can be bounded by combining the upper-bound for the\n\n= E[Rt(S\u21e4t ,\u2713 \u21e4)  Rt(S\u21e4t ,e\u2713t)] + E[Rt(S\u21e4t ,e\u2713t)  Rt(St,e\u2713t)] + E[Rt(St,e\u2713t)  Rt(St,\u2713 \u21e4)]\n\uf8ff E[Rt(S\u21e4t ,\u2713 \u21e4)  Rt(S\u21e4t ,e\u2713t)] + E[Rt(St,e\u2713t)  Rt(St,\u2713 \u21e4)]\nThe inequality comes from the fact that our assortment choice at round t, St, is optimal undere\u2713t;\nhence Rt(S\u21e4t ,e\u2713t) \uf8ff Rt(St,e\u2713t). The second term E[Rt(St,e\u2713t)  Rt(St,\u2713 \u21e4)] in Eq.(2) is relatively\nestimation error |x>(\u02c6\u2713t  \u2713\u21e4)| and the concentration of the sampling probability ofe\u2713t. However,\ncontrolling the \ufb01rst term E[Rt(S\u21e4t ,\u2713 \u21e4)  Rt(S\u21e4t ,e\u2713t)] in Eq.(2) is more challenging in frequentist\nanalysis. First, note that E[Rt(S\u21e4t ,\u2713 \u21e4)  Rt(S\u21e4t ,e\u2713t)] = 0 in the Bayesian regret by the assumption\nthat \u2713\u21e4 and \u02c6\u2713t are i.i.d. conditioning on Ft as mentioned earlier. However, this is no longer true in the\nworst-case regret analysis. In the worst-case regret analysis of TS, this term is controlled by showing\nthat a sampled parameter is optimistic frequently enough. In other words, we need to lower-bound the\nprobability of the sampled parameter being optimistic, i.e., PRt(S\u21e4t ,e\u2713t)  Rt(S\u21e4t ,\u2713 \u21e4) | Ft  p\n\nfor some parameter free p > 0.\n\n(2)\n\n6\n\n\fTo describe the challenge in our MNL contextual bandit problem, we present the following lemma\nwhich shows that the expected revenue for the optimal assortment is monotonically increasing with\nan increase in the utility estimates.\nLemma 1 ([6], Lemma 4.2). Suppose S\u21e4t is the optimal assortment under the true parameter \u2713\u21e4 at\nround t, i.e., S\u21e4t = arg maxS2S Rt(S, \u2713\u21e4). Also suppose that x>ti\u2713\u21e4 \uf8ff x>ti\u27130 for all i 2 S\u21e4t . Then\nRt(S\u21e4t ,\u2713 \u21e4) \uf8ff Rt(S\u21e4t ,\u2713 0).\nNote that Lemma 1 shows the monotonicity of expected revenue only for the optimal assortment and\nit does not claim that the expected revenue is generally a monotone function for all assortments. This\nlemma implies that we can lower-bound the probability of having an optimistic expected revenue\nunder the sampled parameter.\n\nP\u21e3Rt(S\u21e4t ,e\u2713t)  Rt(S\u21e4t ,\u2713 \u21e4) | Ft\u2318  P\u21e3x>tie\u2713t  x>ti\u2713\u21e4,8i 2 S\u21e4t | Ft\u2318\n\nHowever, this makes the probability of being optimistic exponentially small in the size of the\nassortment S\u21e4t , i.e., exponentially small in O(K), which in turn results in exponential dependence\non O(K) in the worst-case regret bound. In order to overcome such an issue, we adopt a few\nmodi\ufb01cations in the algorithm which we discuss in the following section.\n\n6 TS-MNL with Optimistic Sampling\n\nSampling from Gaussian Distribution. We modify our TS algorithm to a generic randomized\nalgorithm constructed on the regularized MLE rather than sampling from an actual Bayesian posterior.\n[3] show that TS does not need to sample from an actual posterior distribution and that any distri-\nbution satisfying suitable concentration and anti-concentration properties guarantees a small regret.\n\nt\n\nt V 1\n\nSpeci\ufb01cally, instead of sampling from the posterior Qt, we samplee\u2713t from Gaussian distribution\nN\u02c6\u2713t,\u21b5 2\n where \u02c6\u2713t is the regularized MLE, the minimizer of Eq.(3), and \u21b5t is the con\ufb01dence\n\nradius. This way, we ensure tractability of the sampling distribution. Furthermore, this Gaussian\napproximation allows us to adopt optimistic sampling (which we discuss below) in an ef\ufb01cient\nmanner.\nOptimistic Sampling. The optimistic sampling we present here is a key ingredient in avoiding the\ntheoretical challenges present in the worst-case regret analysis. For optimistic sampling, instead of\n\n (the\ndrawing a single samplee\u2713t, we draw M independent samples {e\u2713(j)\nexact value of M is speci\ufb01ed in Theorem 2). Then we compute the optimistic utility estimateeuti for\neach i 2 [N ]:\nWe de\ufb01ne eRt(S) to be the expected revenue of assortment S based oneuti:\n\nj=1 from N\u02c6\u2713t,\u21b5 2\n\nt }M\n\nt V 1\n\nt\n\nt\n\nj\n\n.\n\nx>tie\u2713(j)\neuti = max\neRt(S) = Pi2S rti exp{euti}\n1 +Pj2S exp{eutj}\n\nNote that this optimistic sampling scheme is different from that proposed in [7]. The setting in [7]\nis non-contextual, and they use a 1-dimensional Gaussian random variable to correlate the samples\nof the utility of the K items in order to ensure the probability that all samples are simultaneously\noptimistic is a constant. This correlated sampling reduces the overall variance severely, hence they\npropose taking K samples instead of a single sample to increase the variance. In contrast, we take\nmultiple samples of the multivariate Gaussian distribution to directly ensure that the probability of an\noptimistic sample is suf\ufb01ciently large.\nThe pseudocode of the modi\ufb01ed algorithm is presented in Algorithm 2. As before, we can utilize\nthe LP solution [20] for the optimization step in Line 6. The modi\ufb01ed algorithm now explicitly\nmaintains the matrix Vt and computes the regularized MLE \u02c6\u2713t. Note that \u21b5T can be replaced by\n\n\u21b5t = O\u21e3qd log1 + tK\n\nanalysis holds for either case.\n\nd + 4 log t\u2318 at round t, if the planning horizon T is not known and the\n\n7\n\n\fAlgorithm 2 TS-MNL with Optimistic Sampling\n1: Input: sample size M, con\ufb01dence radius \u21b5T , penalty parameter \n2: for all t = 1 to T do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n\nObserve xti and rti for all i 2 [N ]\nSample {e\u2713(j)\nComputeeuti = maxj x>tie\u2713(j)\nCompute St = arg maxS2S eRt(S)\nUpdate Vt+1 Vt +Pi2St\n\nj=1 independently from N (\u02c6\u2713t,\u21b5 2\n\nCompute the regularized MLE \u02c6\u2713t by minimizing\n\nOffer St and observe yt (user choice at round t)\n\nfor all i 2 [N ]\n\nt }M\n\nT V 1\n\nxtix>ti\n\nt\n\n)\n\nt\n\n\n\ntX\u2327 =1Xi2S\u2327\n\ny\u2327i log p\u2327i (S\u2327 ,\u2713 ) +\n\n\n2k\u2713k2.\n\n(3)\n\n10: end for\n\n6.1 Worst-Case Regret of TS-MNL with Optimistic Sampling\n\nTheorem 2. Suppose we run TS-MNL with \u201coptimistic sampling\u201d (Algorithm 2) for a total of T\nrounds with optimistic sample size M = d1 \nlog(11/(4pe\u21e1))e, the penalty parameter   1 and\nassortment size constraint K. Then the worst-case regret of the algorithm is upper-bounded by\n\nlog K\n\nR(T ) \uf8ffO (1) + 16pe\u21e1T s2dT log\u27131 +\nd \u25c6\n\n+ (\u21b5T + T )s2dT log\u27131 +\n\nT K\n\nT K\n\nd \u25c6 +r 8T\n\n\n\nlog 2T!\n\nwhere \u21b5T = 1\n\n2\uf8ffqd log1 + T K\n\nd + 4 log T +\n\np\n\n\uf8ff and T = \u21b5Tp2d log(M T ).\n\nTheorem 2 establishes eO(d3/2pT ) worst-case regret, which matches the regret bounds of TS methods\nfor linear contextual bandits [5, 3] up to logarithmic factor. The regret bound shows no dependence\non N, and has an additional O(plog log K) dependence due to optimistic sampling which is very\nsmall for any reasonable assortment size K. Compared to Theorem 1, the additional factor pd comes\nfrom the deviation of the random sampling which is addressed in the worst-case regret analysis.\nThe proof of Theorem 2 utilizes the anti-concentration property of the maximum of Gaussian random\nvariables for ensuring frequent optimism. In particular, we show in the following lemma that the\nproposed optimistic sampling can ensure a constant probability of optimism.\nLemma 2. Suppose k\u02c6\u2713t  \u2713\u21e4kVt \uf8ff 1\nsamples of size M = d1 \n\nd + 4 log t +\n\np\n\uf8ff and we take optimistic\n\nlog K\n\n2\uf8ffqd log1 + tK\nlog(11/(4pe\u21e1))e. Then we have\nP\u21e3eRt(St) > Rt(S\u21e4t ,\u2713 \u21e4) | Ft\u2318 \n\n1\n\n4pe\u21e1\n\n.\n\nThe inverse of the lower-bounding probability 4pe\u21e1 can be interpreted as the expected time between\nany two optimistic assortment selections. In other words, our modi\ufb01ed algorithm is optimistic at least\nwith a constant frequency. Then, using this frequent optimism, we can ensure that the cumulative\nregret due to the random sampling can be bounded. Along with this result, we show the concentrations\nof both regularized MLE and TS samples to establish the regret bound in Theorem 2. The proofs are\nleft to Appendix D.\n\n8\n\n\f7 Numerical Study\n\nIn this section, we perform numerical evaluations to analyze two variants of our proposed algorithm:\nTS-MNL with optimistic sampling (Algorithm 2) and TS-MNL with the Gaussian approximation for\nthe posterior distribution. We perform both synthetic experiments as well as simulated experiments\nusing a real-world dataset: MovieLens dataset.2 We simulated instances of the MNL contextual bandit\nproblem with varying parameter values.\n\nFigure 1: Regret growth with T for a UCB method and TS-MNL variants on MNL contextual bandits.\n\nWe report the worst-case cumulative expected regret for each of the experiments. For the synthetic\nexperiments, we randomly draw \u2713\u21e4 for each instance and hence we can directly compute the expected\nregret using \u2713\u21e4. For the experiments using MovieLens dataset, we use of\ufb02ine regression using the\nentire dataset to estimate the unknown parameter \u2713\u21e4 and compare with the estimates from online\nexperiments. The details of the experimental setup and additional experimental results are presented\nin Appendix G.\nFigure 1 shows the performances averaged over 40 independent instances for each experiment. For\ncomparison, we evaluate the performances of our TS-MNL algorithms along with the performances\nof the UCB method proposed in [15]. The performances of the proposed two variants of TS-MNL are\nobserved to be superior to that of the UCB method on the synthetic data in our experiments, which\nis consistent with the other empirical evidence of TS methods in the literature. The experiments\nwith MovieLens dataset (and the additional experiments shown in Appendix G) suggest that our\nmethods can be used and effective for problem instances with a large number of items, i.e., large N.\nFurthermore, TS-MNL with optimistic sampling consistently performs better than TS-MNL with\nGaussian approximation only. The results of these experiments support our theoretical analysis: TS-\nMNL with optimistic sampling takes advantage of the MNL structure and can guarantee a worst-case\nstatistical ef\ufb01ciency.\n\n8 Discussions\n\nIn this paper, we study the dynamic assortment selection problem under an MNL model with contex-\ntual information. We propose two TS algorithms for the MNL contextual bandits which learn the\nparameters of the underlying choice model while simultaneously maximizing the cumulative revenue.\nWe provide their theoretical performance bounds and show attractive numerical performances in\nour experiments. We also discuss the challenges which arise in worst-case regret analysis for this\ncombinatorial action selection problem under the MNL model. We believe that these challenges\nare potentially present in many other problems involving combinatorial action selections with con-\ntext/feature information beyond the MNL model. To our knowledge, the worst-case regret analysis\nin this work is the \ufb01rst frequentist regret guarantee for contextual bandits with combinatorial action\nselection of any kind. We believe that our proposed optimistic sampling framework can be useful for\nother combinatorial contextual bandit problems.\n\n2https://grouplens.org/datasets/movielens/\n\n9\n\n\fReferences\n[1] Yasin Abbasi-Yadkori, D\u00e1vid P\u00e1l, and Csaba Szepesv\u00e1ri.\n\nImproved algorithms for linear\nstochastic bandits. In Advances in Neural Information Processing Systems, pages 2312\u20132320,\n2011.\n\n[2] Naoki Abe and Philip M Long. Associative reinforcement learning using linear probabilistic\n\nconcepts. In International Conference on Machine Learning, pages 3\u201311, 1999.\n\n[3] Marc Abeille, Alessandro Lazaric, et al. Linear thompson sampling revisited. Electronic\n\nJournal of Statistics, 11(2):5165\u20135197, 2017.\n\n[4] Milton Abramowitz and Irene A Stegun. Handbook of mathematical functions: with formulas,\n\ngraphs, and mathematical tables, volume 55. Courier Corporation, 1965.\n\n[5] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear\n\npayoffs. In International Conference on Machine Learning, pages 127\u2013135, 2013.\n\n[6] Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. Mnl-bandit: a dynamic\n\nlearning approach to assortment selection. arXiv preprint arXiv:1706.03880, 2017.\n\n[7] Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. Thompson sampling for\n\nthe mnl-bandit. In Conference on Learning Theory, pages 76\u201378, 2017.\n\n[8] Christophe Andrieu, Nando De Freitas, Arnaud Doucet, and Michael I Jordan. An introduction\n\nto mcmc for machine learning. Machine learning, 50(1-2):5\u201343, 2003.\n\n[9] Peter Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Machine\n\nLearning Research, 3(Nov):397\u2013422, 2002.\n\n[10] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed\n\nbandit problem. Machine learning, 47(2-3):235\u2013256, 2002.\n\n[11] Fernando Bernstein, Sajad Modaresi, and Denis Saur\u00e9. A dynamic clustering approach to\n\ndata-driven assortment personalization. Management Science, 2018.\n\n[12] Felipe Caro and J\u00e9r\u00e9mie Gallien. Dynamic assortment with demand learning for seasonal\n\nconsumer goods. Management Science, 53(2):276\u2013292, 2007.\n\n[13] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances\n\nin neural information processing systems, pages 2249\u20132257, 2011.\n\n[14] Xi Chen and Yining Wang. A note on tight lower bound for mnl-bandit assortment selection\n\nmodels. arXiv preprint arXiv:1709.06109, 2017.\n\n[15] Xi Chen, Yining Wang, and Yuan Zhou. Dynamic assortment optimization with changing\n\ncontextual information. arXiv preprint arXiv:1810.13069, 2018.\n\n[16] Wang Chi Cheung and David Simchi-Levi. Assortment optimization under unknown multino-\n\nmial logit choice models. arXiv preprint arXiv:1704.00108, 2017.\n\n[17] Wang Chi Cheung and David Simchi-Levi. Thompson sampling for online personalized\nassortment optimization problems with multinomial logit choice models. Available at SSRN\n3075658, 2017.\n\n[18] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff\nfunctions. In Proceedings of the Fourteenth International Conference on Arti\ufb01cial Intelligence\nand Statistics, pages 208\u2013214, 2011.\n\n[19] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under\nbandit feedback. In Proceedings of the 21st Annual Conference on Learning Theory, page\n355\u2013366, 2008.\n\n[20] James Davis, Guillermo Gallego, and Huseyin Topaloglu. Assortment planning under the\n\nmultinomial logit model with totally unimodular constraint structures. 2013.\n\n[21] James M Davis, Guillermo Gallego, and Huseyin Topaloglu. Assortment optimization under\n\nvariants of the nested logit model. Operations Research, 62(2):250\u2013273, 2014.\n\n[22] Antoine D\u00e9sir, Vineet Goyal, and Jiawei Zhang. Near-optimal algorithms for capacity con-\n\nstrained assortment optimization. Available at SSRN 2543309, 2014.\n\n[23] Sarah Filippi, Olivier Cappe, Aur\u00e9lien Garivier, and Csaba Szepesv\u00e1ri. Parametric bandits: The\ngeneralized linear case. In Advances in Neural Information Processing Systems, pages 586\u2013594,\n2010.\n\n10\n\n\f[24] Alexander Goldenshluger and Assaf Zeevi. A linear response bandit problem. Stochastic\n\nSystems, 3(1):230\u2013261, 2013.\n\n[25] Nathan Kallus and Madeleine Udell. Dynamic assortment personalization in high dimensions.\n\narXiv preprint arXiv:1610.05604, 2016.\n\n[26] Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. Cascading bandits:\nLearning to rank in the cascade model. In International Conference on Machine Learning,\npages 767\u2013776, 2015.\n\n[27] Tor Lattimore and Csaba Szepesv\u00e1ri. Bandit Algorithms. Cambridge University Press (preprint),\n\n2019.\n\n[28] Erich L Lehmann and George Casella. Theory of point estimation. Springer Science & Business\n\nMedia, 2006.\n\n[29] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to\npersonalized news article recommendation. In Proceedings of the 19th international conference\non World wide web, pages 661\u2013670. ACM, 2010.\n\n[30] Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear\ncontextual bandits. In International Conference on Machine Learning, pages 2071\u20132080, 2017.\n[31] Shuai Li, Tor Lattimore, and Csaba Szepesvari. Online learning to rank with features. In\n\nInternational Conference on Machine Learning, pages 3856\u20133865, 2019.\n\n[32] R Duncan Luce. Individual choice behavior: A theoretical analysis. Courier Corporation, 2012.\n[33] Daniel McFadden. Modeling the choice of residential location. Transportation Research Record,\n\n(673), 1978.\n\n[34] Robin L Plackett. The analysis of permutations. Applied Statistics, pages 193\u2013202, 1975.\n[35] Lijing Qin, Shouyuan Chen, and Xiaoyan Zhu. Contextual combinatorial bandit and its appli-\ncation on diversi\ufb01ed online recommendation. In Proceedings of the 2014 SIAM International\nConference on Data Mining, pages 461\u2013469. SIAM, 2014.\n\n[36] Paat Rusmevichientong and John N Tsitsiklis. Linearly parameterized bandits. Mathematics of\n\nOperations Research, 35(2):395\u2013411, 2010.\n\n[37] Paat Rusmevichientong, Zuo-Jun Max Shen, and David B Shmoys. Dynamic assortment\noptimization with a multinomial logit choice model and capacity constraint. Operations\nresearch, 58(6):1666\u20131680, 2010.\n\n[38] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics\n\nof Operations Research, 39(4):1221\u20131243, 2014.\n\n[39] Denis Saur\u00e9 and Assaf Zeevi. Optimal dynamic assortment planning with demand learning.\n\nManufacturing & Service Operations Management, 15(3):387\u2013404, 2013.\n\n[40] Malcolm Strens. A bayesian framework for reinforcement learning. In International Conference\n\non Machine Learning, pages 943\u2013950, 2000.\n\n[41] Kalyan Talluri and Garrett Van Ryzin. Revenue management under a general discrete choice\n\nmodel of consumer behavior. Management Science, 50(1):15\u201333, 2004.\n\n[42] William R Thompson. On the likelihood that one unknown probability exceeds another in view\n\nof the evidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n[43] Zheng Wen, Branislav Kveton, and Azin Ashkan. Ef\ufb01cient learning in large-scale combinatorial\n\nsemi-bandits. In International Conference on Machine Learning, pages 1113\u20131122, 2015.\n\n11\n\n\f", "award": [], "sourceid": 1775, "authors": [{"given_name": "Min-hwan", "family_name": "Oh", "institution": "Columbia University"}, {"given_name": "Garud", "family_name": "Iyengar", "institution": "Columbia"}]}