{"title": "Fighting Boredom in Recommender Systems with Linear Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1757, "page_last": 1768, "abstract": "A common assumption in recommender systems (RS) is the existence of a best fixed recommendation strategy. Such strategy may be simple and work at the item level (e.g., in multi-armed bandit it is assumed one best fixed arm/item exists) or implement more sophisticated RS (e.g., the objective of A/B testing is to find the\nbest fixed RS and execute it thereafter). We argue that this assumption is rarely verified in practice, as the recommendation process itself may impact the user\u2019s\npreferences. For instance, a user may get bored by a strategy, while she may gain interest again, if enough time passed since the last time that strategy was used. In\nthis case, a better approach consists in alternating different solutions at the right frequency to fully exploit their potential. In this paper, we first cast the problem as\na Markov decision process, where the rewards are a linear function of the recent history of actions, and we show that a policy considering the long-term influence\nof the recommendations may outperform both fixed-action and contextual greedy policies. We then introduce an extension of the UCRL algorithm ( L IN UCRL ) to\neffectively balance exploration and exploitation in an unknown environment, and we derive a regret bound that is independent of the number of states. Finally,\nwe empirically validate the model assumptions and the algorithm in a number of realistic scenarios.", "full_text": "Fighting Boredom in Recommender Systems with\n\nLinear Reinforcement Learning\n\nRomain Warlop\n\n\ufb01fty-\ufb01ve, Paris, France\n\nSequeL Team, Inria Lille, France\n\nromain@fifty-five.com\n\nAlessandro Lazaric\nFacebook AI Research\n\nParis, France\n\nlazaric@fb.com\n\nJ\u00e9r\u00e9mie Mary\nCriteo AI Lab\nParis, France\n\nj.mary@criteo.com\n\nAbstract\n\nA common assumption in recommender systems (RS) is the existence of a best\n\ufb01xed recommendation strategy. Such strategy may be simple and work at the item\nlevel (e.g., in multi-armed bandit it is assumed one best \ufb01xed arm/item exists) or\nimplement more sophisticated RS (e.g., the objective of A/B testing is to \ufb01nd the\nbest \ufb01xed RS and execute it thereafter). We argue that this assumption is rarely\nveri\ufb01ed in practice, as the recommendation process itself may impact the user\u2019s\npreferences. For instance, a user may get bored by a strategy, while she may gain\ninterest again, if enough time passed since the last time that strategy was used. In\nthis case, a better approach consists in alternating different solutions at the right\nfrequency to fully exploit their potential. In this paper, we \ufb01rst cast the problem as\na Markov decision process, where the rewards are a linear function of the recent\nhistory of actions, and we show that a policy considering the long-term in\ufb02uence\nof the recommendations may outperform both \ufb01xed-action and contextual greedy\npolicies. We then introduce an extension of the UCRL algorithm (LINUCRL) to\neffectively balance exploration and exploitation in an unknown environment, and\nwe derive a regret bound that is independent of the number of states. Finally,\nwe empirically validate the model assumptions and the algorithm in a number of\nrealistic scenarios.\n\n1\n\nIntroduction\n\nConsider a movie recommendation problem, where the recommender system (RS) selects the genre\nto suggest to a user. A basic strategy is to estimate user\u2019s preferences and then recommend movies of\nthe preferred genres. While this strategy is sensible in the short term, it overlooks the dynamics of the\nuser\u2019s preferences caused by the recommendation process. For instance, the user may get bored of\nthe proposed genres and then reduce her ratings. This effect is due to the recommendation strategy\nitself and not by an actual evolution of user\u2019s preferences, as she would still like the same genres, if\nonly they were not proposed so often.1\nThe existence of an optimal \ufb01xed strategy is often assumed in RS using, e.g., matrix factorization to\nestimate users\u2019 ratings and the best (\ufb01xed) item/genre [16]. Similarly, multi-armed bandit (MAB)\nalgorithms [4] effectively trade off exploration and exploitation in unknown environments, but still\nassume that rewards are independent from the sequence of arms selected over time and they try\nto select the (\ufb01xed) optimal arm as often as possible. Even when comparing more sophisticated\nrecommendation strategies, as in A/B testing, we implicitly assume that once the better option\n(either A or B) is found, it should be constantly executed, thus ignoring how its performance may\ndeteriorate if used too often. An alternative approach is to estimate the state of the user (e.g., her\n\n1In this paper, we do not study non-stationarity preferences, as it is a somehow orthogonal problem to the\n\nissue that we consider.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\flevel of boredom) as a function of the movies recently watched and estimate how her preferences are\naffected by that. We could then learn a contextual strategy that recommends the best genre depending\non the actual state of the user (e.g., using LINUCB [17]). While this could partially address the\nprevious issue, we argue that in practice it may not be satisfactory. As the preferences depend on\nthe sequence of recommendations, a successful strategy should \u201cdrive\u201d the user\u2019s state in the most\nfavorable condition to gain as much reward as possible in the long term, instead selecting the best\n\u201cinstantaneous\u201d action at each step. Consider a user with preferences 1) action, 2) drama, 3) comedy.\nAfter showing a few action and drama movies, the user may get bored. A greedy contextual strategy\nwould then move to recommending comedy, but as soon as it estimates that action or drama are\nbetter again (i.e., their potential value reverts to its initial value as they are not watched), it would\nimmediately switch back to them. On the other hand, a more farsighted strategy may prefer to stick\nto comedy for a little longer to increase the preference of the user for action to its higher level and\nfully exploit its potential.\nIn this paper, we propose to use a reinforcement learning (RL) [23] model to capture this dynamical\nstructure, where the reward (e.g., the average rating of a genre) depends on a state that summarizes the\neffect of the recent recommendations on user\u2019s preferences. We introduce a novel learning algorithm\nthat effectively trades off exploration and exploitation and we derive theoretical guarantees for it.\nFinally, we validate our model and algorithm in synthetic and real-data based environments.\nRelated Work. While in the MAB model, regret minimization [2] and best-arm identi\ufb01cation\nalgorithms [11, 22] have been often proposed to learn effective RS, they all rely on the assumption\nthat one best \ufb01xed arm exists. [8] study settings with time-varying rewards, where each time an arm\nis pulled, its reward decreases due to loss of interest, but, unlike our scenario, it never increases again,\neven if not selected for a long time. [14] also consider rewards that continuously decrease over time\nwhether the arm is selected or not (e.g., modeling novelty effects, where new products naturally loose\ninterest over time). This model \ufb01ts into the more general case of restless bandit [e.g., 6, 25, 20],\nwhere each arm has a partially observable internal state that evolves as a Markov chain independently\nfrom the arms selected over time. Time-varying preferences has also been widely studied in RS.\n[25, 15] consider a time-dependent bias to capture seasonality and trends effect, but do not consider\nthe effects on users\u2019 state. More related to our model is the setting proposed by [21], who consider an\nMDP-based RS at the item level, where the next item reward depends on the previously k selected\nitems. Working at the item level without any underlying model assumption prevents their algorithm\nfrom learning in large state spaces, as every single combination of k items should be considered\n(in their approach this is partially mitigated by state aggregation). Finally, they do not consider\nthe exploration-exploitation trade-off and they directly solve an estimated MDP. This may lead to\nan overall linear regret, i.e., failing to learn the optimal policy. Somewhat similar, [12] propose a\nsemi-markov model to decide what item to recommend to a user based on her latent psychological\nstate toward this item. They assumed two possible states: sensitization, state during which she is\nhighly engaged with the item, and boredom, state during which she is not interested in the item.\nThanks to the use of a semi-markov model, the next state of the user depends on how long she has\nbeen in the current state. Our work is also related to the linear bandit model [17, 1], where rewards\nare a linear function of a context and an unknown target vector. Despite producing context-dependent\npolicies, this model does not consider the in\ufb02uence that the actions may have on the state and thus\noverlook the potential of long-term reward maximization.\n\n2 Problem Formulation\nWe consider a \ufb01nite set of actions a \u2208 {1, . . . , K} = [K]. Depending on the application, actions\nmay correspond to simple items or complex RS. We de\ufb01ne the state st at time t as the history of the\nlast w actions, i.e., st = (at\u22121,\u00b7\u00b7\u00b7 , at\u2212w), where for w = 0 the state reduces to the empty history.\nAs described in the introduction, we expect the reward of an action a to depend on how often a has\nbeen recently selected (e.g., a user may get bored the more a RS is used). We introduce the recency\n1{at\u2212\u03c4 = a}/\u03c4, where the effect of an action fades as 1/\u03c4, so that the\nrecency is large if an action is often selected and it decreases as it is not selected for a while. We\nde\ufb01ne the (expected) reward function associated to an action a in state s as\n\nfunction \u03c1(st, a) =(cid:80)w\n\n\u03c4 =1\n\nr(st, a) =\n\n\u03b8\u2217\na,j\u03c1(st, a)j = xT\n\ns,a\u03b8\u2217\na,\n\n(1)\n\nd(cid:88)\n\nj=0\n\n2\n\n\fFigure 1: Average rating as a function of the recency for different genre of movies (w = 10) and predictions of\nour model for d = 5 in red. From left to right, drama, comedy, action and thriller. The con\ufb01dence intervals are\nconstructed based on the amount of samples available at each state s and the red curves are obtained by \ufb01tting\nthe data with the model in Eq. 1.\n\n\u00b7\u00b7\u00b7 , \u03c1(s, a)d] \u2208 Rd+1 is the context vector associated to action a in\nwhere xs,a = [1, \u03c1(s, a),\na \u2208 Rd+1 is an unknown vector. In practice, the reward observed when selecting a at\nstate s and \u03b8\u2217\nst is rt = r(st, a) + \u03b5t, with \u03b5t a zero-mean noise. For d = 0 or w = 0, this model reduces to the\nstandard MAB setting, where \u03b8\u2217\na,0 is the expected reward of action a. Eq. 1 extends the MAB model\nby summing the \u201cstationary\u201d component \u03b8\u2217\na,0 to a polynomial function of the recency \u03c1(st, a). While\nalternative and more complicated functions of st may be used to model the reward, in the next section\nwe show that a small degree polynomial of the recency is rich enough to model real data.\nThe formulation in Eq. 1 may suggest that this is an instance of a linear bandit problem, where\nxst,a is the context for action a at time t and \u03b8\u2217\na is the unknown vector. Nonetheless, in linear\nbandit the sequence of contexts {xst,a}t is independent from the actions selected over time and the\nst,a\u03b8\u2217\noptimal action at time t is a\u2217\na,2 while in our model, xst,a actually depends\non the state st, that summarizes the last w actions. As a result, an optimal policy should take into\naccount its effect on the state to maximize the long-term average reward. We thus introduce the\ndeterministic Markov decision process (MDP) M = (cid:104)S, [K], f, r(cid:105) with state space S enumerating the\npossible sequences of w actions, action space [K], noisy reward function in Eq. 1, and a deterministic\ntransition function f : S \u00d7 [K] \u2192 S that simply drops the action selected w steps ago and appends\nthe last action to the state. A policy \u03c0 : S \u2192 [K] is evaluated according to its long-term average\n\n(cid:3), where rt is the (random) reward of state st and action\n\nreward as \u03b7\u03c0 = limn\u2192\u221e E(cid:2)1/n(cid:80)n\n\nat = \u03c0(st). The optimal policy is thus \u03c0\u2217 = arg max\u03c0 \u03b7\u03c0, with optimal average reward \u03b7\u2217 = \u03b7\u03c0\u2217\n.\nWhile an explicit form for \u03c0\u2217 cannot be obtained in general, an optimal policy may select an action\nwith suboptimal instantaneous reward (i.e., action at = \u03c0(st) is s.t. r(st, at) < maxa r(st, a)) so as\nto let other (potentially more rewarding) actions \u201crecharge\u201d and select them later on. This results into\na policy that alternates actions with a \ufb01xed schedule (see Sec. 5 for more insights).3 If the parameters\n\u03b8\u2217\na were known, we could compute the optimal policy by using value iteration where a value function\nu0 \u2208 RS is iteratively updated as\n\nt = arg maxa\u2208[K] xT\n\nt=1 rt\n\n(cid:104)\n\n(cid:0)f (s, a)(cid:1)(cid:105)\n\n\u2206(T ) = T \u03b7\u2217 \u2212 T(cid:88)\n\nui+1(s) = max\na\u2208[K]\n\nr(s, a) + ui\n\n,\n\n(2)\n\nand a nearly-optimal policy is obtained after n iterations as \u03c0(s) = maxa\u2208[K][r(s, a) + un(f (s, a))].\nAlternatively, algorithms to compute the maximum reward cycle for deterministic MDPs could be\nused [see e.g., 13, 5]. The objective of a learning algorithm is to approach the performance of the\noptimal policy as quickly as possible. This is measured by the regret, which compares the reward\ncumulated over T steps by a learning algorithm and by the optimal policy, i.e.,\n\nr(st, at),\n\n(3)\n\nwhere (st, at) is the sequence of states and actions observed and selected by the algorithm.\n\nt=1\n\n3 Model Validation on Real Data\n\nIn order to provide a preliminary validation of our model, we use the movielens-100k dataset [9, 7].\nWe consider a simple scenario where a RS directly recommends a genre to a user. In practice, one\n\n2We will refer to this strategy as \u201cgreedy\u201d policy thereafter.\n3In deterministic MDPs the optimal policy is a recurrent sequence of actions inducing a maximum-reward\n\ncycle over states.\n\n3\n\n0.00.51.01.52.02.53.03.13.23.33.43.53.63.73.83.9predictionhistorical ratings0.00.51.01.52.02.53.02.82.93.03.13.23.33.43.53.63.7predictionhistorical ratings0.00.51.01.52.02.53.03.03.13.23.33.43.53.63.7predictionhistorical ratings0.00.51.01.52.02.52.93.03.13.23.33.43.53.63.7predictionhistorical ratings\fGenre\naction\ncomedy\ndrama\nthriller\n\nd = 1\n0.55\n0.77\n0.0\n0.74\n\nd = 2\n0.74\n0.85\n0.77\n0.81\n\nd = 3\n0.79\n0.88\n0.80\n0.83\n\nd = 4\n0.81\n0.90\n0.83\n0.91\n\nd = 5\n0.81\n0.90\n0.86\n0.91\n\nd = 6\n0.82\n0.91\n0.87\n0.91\n\nTable 1: R2 for the different genres and values of d on movielens-100k and a window w = 10.\n\nmay prefer to use collaborative \ufb01ltering algorithms (e.g., matrix factorisation) and apply our proposed\nalgorithm on top of them to \ufb01nd the optimal cadence to maximize long term performances. However,\nwhen dealing with very sparse information like in retargeting, it may happen that a RS just focuses\non performing recommendations from a very limited set of items.4 Once applied to this scenario, our\nmodel predicts that user\u2019s preferences change with the number of movies of the same genre a user\nhave recently watched (e.g., she may get bored after seeing too many movies of a genre and then\ngetting interested again as time goes by without watching that genre). In order to verify this intuition,\nwe sort ratings for each user using their timestamps to produce an ordered sequence of ratings.5 For\ndifferent genres observed more than 10, 000 times, we compute the average rating for each value of\nthe recency function \u03c1(st, a) at the states st encountered in the dataset. The charts of Fig. 1 provide\na \ufb01rst qualitative support for our model. The rating for comedy, action, and thriller genres is a\nmonotonically decreasing function of the recency, hinting to the existence of a boredom-effect, so\nthat the rating of a genre decreases with how many movies of that kind have been recently watched.\nOn the other hand, drama shows a more sophisticated behavior where users \u201cdiscover\u201d the genre and\nincrease the ratings as they watch more movies, but get bored if they recently watched \u201ctoo many\u201d\ndrama movies. This suggests that in this case there is a critical frequency at which users enjoy movies\nof this genre. In order to capture the dependency between rating and recency for different genres, in\nEq. 1 we de\ufb01ned the reward as a polynomial of \u03c1(st, a) with coef\ufb01cients that are speci\ufb01c for each\naction a. In Table 1 we report the coef\ufb01cient of determination R2 of \ufb01tting the model of Eq. 1 to\nthe dataset for different genres and values of d. The results show how our model becomes more and\nmore accurate as we increase its complexity. We also notice that even polynomials of small degree\n(from d = 4 the R2 tends to plateau) actually produce accurate reward predictions, suggesting that\nthe recency does really capture the key elements of the state s and that a relatively simple function of\n\u03c1 is enough to accurately predict the rating. This result also suggests that standard approaches in RS,\nsuch as matrix factorization, where the rating is contextual (as it depends on features of both users\nand movies/genres) but static, potentially ignore a critical dimension of the problem that is related to\nthe dynamics of the recommendation process itself.\n\n4 Linear Upper-Con\ufb01dence bound for Reinforcement Learning\nThe Learning Algorithm. LINUCRL directly builds on the UCRL algorithm [10] and exploits the\nlinear structure of the reward function and the deterministic and known transition function f. The\ncore idea of LINUCRL is to construct con\ufb01dence intervals on the reward function and apply the\noptimism-in-face-of-uncertainty principle to compute an optimistic policy. The structure of LINUCRL\nis illustrated in Alg. 1. Let us consider an episode k starting at time t, LINUCRL \ufb01rst uses the\n\ncurrent samples collected for each action a separately to compute an estimate(cid:98)\u03b8t,a by regularized\n\nleast squares, i.e.,\n\n(cid:98)\u03b8t,a = min\n\n\u03b8\n\n(cid:88)\n\n(cid:0)xT\n\n(cid:1)2\n\ns\u03c4 ,a\u03b8 \u2212 r\u03c4\n\n+ \u03bb(cid:107)\u03b8(cid:107)2,\n\n(4)\n\n\u03c4 <t:a\u03c4 =a\n\nwhere xs\u03c4 ,a is the context vector corresponding to state s\u03c4 and r\u03c4 is the (noisy) reward observed\nat time \u03c4. Let be Ra,t the vector of rewards obtained up to time t when a was executed and Xa,t\n\nthe feature matrix corresponding to the contexts observed so far, then Vt,a =(cid:0)X T\nR(d+1)\u00d7(d+1) is the design matrix. The closed-form solution of the estimate is(cid:98)\u03b8t,a = V \u22121\nwhich gives an estimated reward function(cid:98)rt(s, a) = xT\n\nt,aXt,a + \u03bbI(cid:1) \u2208\ns,a(cid:98)\u03b8t,a. Instead of computing the optimal\n\nt,a X T\n\nt,aRt,a,\n\n4See Sect. 5 for further discussion on the dif\ufb01culty of \ufb01nding suitable datasets for the validation of time-\n\nvarying models.\n\n5In the movielens dataset a timestamp does not correspond to the moment the user saw the movie but when\nthe rating is actually submitted. Yet, this does not cancel potential dependencies of future rewards on past\nactions.\n\n4\n\n\fAlgorithm 1 The LINUCRL algorithm.\n\nSet tk = t, \u03bda = 0\n\nfor rounds k = 1, 2,\u00b7\u00b7\u00b7 do\n\nInit: Set t = 0, Ta = 0,(cid:98)\u03b8a = 0 \u2208 Rd+1, Va = \u03bbI\nCompute(cid:98)\u03b8a = V \u22121\ns,a(cid:98)\u03b8a + ct,a(cid:107)xs,a(cid:107)V\nSet optimistic reward(cid:101)rk(s, a) = xT\nCompute optimal policy(cid:101)\u03c0k for MDP (S, [K], f,(cid:101)rt)\nChoose action at =(cid:101)\u03c0k(st)\nObserve reward rt and next state st+1\nUpdate Xat \u2190 [Xat , xst,at ], Rat \u2190 [Rat , rt], Vat \u2190 Vat + xst,at xT\nSet \u03bdat \u2190 \u03bdat + 1, t \u2190 t + 1\n\nwhile \u2200a \u2208 [K], Ta < \u03bda do\n\na X T\n\na Ra\n\n\u22121\na\n\nst,at\n\nend while\nSet Ta \u2190 Ta + \u03bda,\u2200a \u2208 [K]\n\nend for\n\npolicy according to the estimated reward, we compute the upper-con\ufb01dence bound\n\n(cid:101)rt(s, a) =(cid:98)rt(s, a) + ct,a(cid:107)xs,a(cid:107)V\n\n,\n\n\u22121\nt,a\n\n(5)\n\n\u22121\nt,a\n\nwe have that\n\n\u201cplausible\u201d MDPs that are within the con\ufb01dence intervals over the reward function. More formally, let\n,\u2200s, a}, then with high probability\n\nwhere ct,a is a scaling factor whose explicit form is provided in Eq. 6. Since the transition function f\nis deterministic and known, we then simply apply the value iteration scheme in Eq. 2 to the MDP\n\n(cid:102)Mk = (cid:104)S, [K], f,(cid:101)rk(cid:105) and compute the corresponding optimal (optimistic) policy(cid:101)\u03c0k. It is simple\nto verify that ((cid:102)Mk,(cid:101)\u03c0k) is the pair of MDP and policy that maximizes the average reward over all\nMk = {M = (cid:104)S, [A], f, r(cid:105), |r(s, a)\u2212(cid:98)rt(s, a)| \u2264 ct,a(cid:107)xs,a(cid:107)V\nFinally, LINUCRL execute(cid:101)\u03c0k until the number of samples for an action is doubled w.r.t. the beginning\n\n\u03b7(cid:101)\u03c0k ((cid:102)Mk) \u2265 max\n\nof the episode. The speci\ufb01c structure of the problem makes LINUCRL more ef\ufb01cient than UCRL, since\neach iteration of Eq. 2 has O(dSK) computational complexity compared to O(S2K) of extended\nvalue iteration (used in UCRL) due to the randomness of the transitions and the optimism over f.\nTheoretical Analysis. We prove that LINUCRL successfully exploits the structure of the problem to\nreduce its cumulative regret w.r.t. basic UCRL. We \ufb01rst make explicit the con\ufb01dence interval in Eq. 5.\na(cid:107)2 \u2264 B for all actions a \u2208 [K]\nLet assume that there exist (known) constants B and R such that (cid:107)\u03b8\u2217\nand the noise is sub-Gaussian with parameter R. Let (cid:96)w = log(w) + 1, where w is the length of the\nwindow in the state de\ufb01nition, and L2\n, where d is the degree of the polynomial describing\nthe reward function. Then, we run LINUCRL with the scaling factor\n\n\u03c0,M\u2208Mk\n\n\u03b7\u03c0(M ).\n\n(cid:115)\n\nw\n\nw = 1\u2212(cid:96)d+1\n1\u2212(cid:96)w\n(cid:18)\n\n(cid:16)\n\nct,a = R\n\n(d + 1) log\n\nKt\u03b1\n\n+ \u03bb1/2B\n\n(6)\n\n1 +\n\nTt,aL2\nw\n\n\u03bb\n\nwhere Tt,a is the number of samples collected from action a up to t. Then we can prove the following.\nTheorem 1. If LINUCRL runs with the scaling factor in Eq. 6 over T rounds, then its cumulative\nregret is\n\n(cid:115)\n\n(cid:16) 8T\n\n(cid:17)\n\nK\n\n+ 2cmax\n\n2KT (d + 1) log\n\n1 +\n\n(cid:18)\n\n(cid:19)\n\n,\n\nT L2\nw\n\n\u03bb(d + 1)\n\n\u2206(LINUCRL, T ) \u2264 Kw log2\n\nwhere cmax = maxt,a ct,a.\n\n(cid:17)(cid:19)\n\n\u221a\n\nT , showing that as time increases,\nWe \ufb01rst notice that the per-step regret \u2206/T decreases to zero as 1/\nthe reward approaches the optimal average reward. Furthermore, by leveraging the speci\ufb01c structure\n\u221a\nof our problem, LINUCRL greatly improves the dependency on other elements characterizing the\nMDP. In the general MDP case, UCRL suffers from a regret O(DS\nKT ), where D is the diameter\nof the MDP, which in our case is equal to the history window w. In the regret bound of LINUCRL the\n\n5\n\n\f(a) sequence of actions\n\n(b) average reward\n\n(c) sequence of actions\n\n(d) average reward\n\nFigure 2: Optimal policy vs. greedy and \ufb01xed-action. The \ufb01xed-action policy selects the action with the largest\n\u201cconstant\u201d reward (i.e., ignoring the effects of the recommendation). The greedy policy selects the action with the\nhighest immediate reward (depending on the state). The optimal policy is computed with value iteration. (a-b):\nparameters c1 = 0.3, c2 = 0.4, \u03b1 = 1.5 (limited boredom effect). (c-d): parameters c1 = 2, c2 = 0.01, \u03b1 = 2\n(strong boredom effect).\n\ndependency on the number of states (which is exponential in the history window S = K w) disappears\nand it is replaced by the number of parameters d + 1 in the reward model. Furthermore, since the\ndynamics is deterministic and known, the only dependency on the diameter w is in a lower-order\nlogarithmic term. This result suggests that we can take a large window w and a complex polynomial\nexpression for the reward (i.e., large d) without compromising the overall regret. Let note that in\n\nMDPs, the worst-case regret lower bound also exhibits a(cid:112)(T ) dependency ([10]), so there is not\n\nmuch hope to improve it. The interesting part of these bounds is actually in the problem-speci\ufb01c terms.\nFurthermore, LINUCRL compares favorably with a linear bandit approach. First, \u03b7\u2217 is in general\nmuch larger than the optimal average reward of a greedy policy selecting the best instantanous action\nat each step. Second, apart from the log(T ) term, the regret is the same of a linear bandit algorithm\n(e.g., LINUCB). This means that LINUCRL approaches a better target performance \u03b7\u2217 almost at the\nsame speed as linear bandit algorithms reach a worse greedy policy. Finally, [19] developed a speci\ufb01c\ninstance of UCRL for deterministic MDPs, whose \ufb01nal regret is of order O(\u03bbA log(T )/\u2206), where \u03bb\nis the length of the largest simple cycle that can be generated in the MDP and \u2206 is the gap between\nthe reward of the optimal and second-optimal policy. While the regret in this bound only scales\nas O(log T ), in our setting \u03bb can be as large as S = K w, which is exponentially worse than the\ndiameter w, and \u2206 can be arbitrarily small, thus making a O(\nT ) bound often preferable. We leave\nthe integration of our linear reward assumption into the algorithm proposed by [19] as future work.\n\n\u221a\n\n5 Experiments\n\nIn order to validate our model on real datasets, we need persistent information about a user iden-\nti\ufb01cation number to follow the user through time and evaluate how preferences evolve over time\nin response to the recommendations. This also requires datasets where several RSs are used for\nthe same user with different cadence and for which it is possible to associate a user-item feedback\nwith the system that actually performed that recommendation. Unfortunately, these requirements\nmake most of publicly available datasets not suitable for this validation. As a result, we propose to\nuse both synthetic and dataset-based experiments to empirically validate our model and compare\nLINUCRL to existing baselines. We consider three different scenarios. Toy experiment: A simulated\nenvironment with two actions and different parameters, with the objective of illustrating when the\noptimal policy could outperform \ufb01xed-action and greedy strategies. Movielens: We derive model\nparameters from the movielens dataset and we compare the learning performance (i.e., cumulative\nreward) of LINUCRL to baseline algorithms. Real-world data from A/B testing: this dataset provides\nenough information to test our algorithm and although our model assumptions are no longer satis\ufb01ed,\nwe can still investigate how a long-term policy alternating A and B on the basis of past choices can\noutperform each solution individually.\nOptimal vs. \ufb01xed-action and greedy policy. We \ufb01rst illustrate the potential improvement coming\nfrom a non-static policy that takes into consideration the recent sequence of actions and maximizes\nthe long-term reward, compared to a greedy policy that selects the action with the higher immediate\nreward at each step. Intuitively, the gap may be large whenever an action has a large instantaneous\nreward that decreases very fast as it is selected (e.g., boredom effect). A long-term strategy may\nprefer to stick to selecting a sub-optimal action for a while, until the better action goes back to its\n\n6\n\n010203040506012optimal policy010203040506012greedy policyoptimal policygreedypolicybestsingle arm0.10.20.30.40.5010203040506012optimal policy010203040506012greedy policyoptimal policygreedypolicybestsingle arm0.460.480.500.52\f(b) Avg. rwd. at T = 200\n\n(c) Avg. rwd. at the end\n\n(a) Last 40 actions\n\nFigure 3: Results of learning experiment based on movielens dataset.\n1 = (1, c1), \u03b8\u2217\n\ninitial value. We consider the simple case K = 2 and d = 1. Let \u03b8\u2217\n2 = (1/\u03b1, c2).\nWe study the optimal policy maximizing the average reward \u03b7, a greedy policy that always selects\nat = arg maxa r(st, a), and a \ufb01xed-action policy at = arg max{1, 1/\u03b1}. We \ufb01rst set c1 = 0.3 \u2248\nc2 = 0.4 and \u03b1 = 1.5, for which the \u201cboredom\u201d effect (i.e., the decrease in reward) is very mild. In\nthis case (see Fig. 2-(left)), the \ufb01xed-action policy performs very poorly, while greedy and optimal\npolicy smartly alternates between actions so as to avoid decreasing the reward of the \u201cbest\u201d action\ntoo much. In this case, the difference between greedy and optimal policy is very narrow. However in\nFig. 2-(right), with c1 = 2 (cid:29) c2 = 0.01 and \u03b1 = 2, we see that the greedy policy switches to action\n1 too soon to gain immediate reward (plays action 1 for 66% of the time) whereas the optimal policy\nstick to action 2 longer (plays action 1 for 57% of the time) so as to allow action 1 to regain reward\nand then go back to select it again. As a result, the optimal policy exploits the full potential of action\n1 better and eventually gains higher average reward. While here we only illustrate the \u201cboredom\u201d\neffect (i.e., the reward linearly decreases with the recency), we can imagine a large range of scenarios\nwhere the greedy policy is highly suboptimal compared to the optimal policy.\nLearning on movielens dataset. In order to overcome the dif\ufb01culty of creating full complex RS\nand evaluate them on of\ufb02ine datasets, we focus on a relatively simple scenario where a RS directly\nrecommends movies from one chosen genre, for which we have already validated our model in\nSec. 3. One strategy could be to apply a bandit algorithm to \ufb01nd the optimal genre and then always\nrecommend movies of this genre. On the other hand, our algorithm tries to identify an optimal\nsequence of those genres to keep the user interested. The standard of\ufb02ine evaluation of a learning\nalgorithm on historical data is to use a replay or counterfactual strategy [18, 24], which consists in\nupdating the model whenever the learning algorithm takes the same action as in the logged data,\nand only update the state (but not the model) otherwise. In our case this replay strategy cannot be\napplied because the reward depends on the history of selected actions and we could not evaluate\nthe reward of an action if the algorithm generated a sequence that is not available in the dataset\n(which is quite likely). Thus in order to compare the learning performance of LINUCRL to existing\nbaselines, we use the movielens100k dataset to estimate the parameters of our model and construct\nthe corresponding \u201csimulator\u201d. Unlike a fully synthetic experiment, this gives a con\ufb01guration which\nis \u201clikely\u201d to appear in practice, as the parameters are directly estimated from real data. We choose\nK = 10 actions corresponding to different genres of movies, and we set d = 5 and w = 5, which\nresults into K w = 105 states. We recall that w has a mild impact on the learning performance of\nLINUCRL as it does not need to repeatedly try the same action in each state (as UCRL) to be able\nto estimate its reward. This is also con\ufb01rmed by the regret analysis that shows that the regret only\ndepends on w in the lower-order logarithmic term of the regret. Given this number of states, UCRL\nwould need at least one million iteration to observe each state 10 times which is dramatically too large\nfor the application we consider. The parameters that describe the dependency of the reward function\non the recency (i.e., \u03b8\u2217\nj,a) are computed by using the ratings averaged over all users for each state\nencountered and for ten different genres in the dataset. The \ufb01rst component of the vectors \u03b8\u2217\na is chosen\nto simulate different user\u2019s preferences and to create complex dynamics in the reward functions. The\nresulting parameters and reward functions are reported in App. B. Finally, the observed reward is\nobtained by adding a small random Gaussian noise to the linear function. In this setting, a constant\nstrategy would always pull the comedy genre since it is the one with the highest \u201cstatic\u201d reward,\nwhile other genres are also highly rewarding and a suitable alternation between them may provide a\nmuch higher reward.\nWe compare LINUCRL to the following algorithms: oracle optimal (\u03c0\u2217), oracle greedy (greedy\ncontextual policy), LINUCB [1] (learn the parameters using LINUCB for each action and select the\n\n7\n\n0510152025303540ActionComedyAdventureThrillerDramaChildrenCrimeHorrorSciFiAnimationoracle greedy0510152025303540ActionComedyAdventureThrillerDramaChildrenCrimeHorrorSciFiAnimationlinUCRL0510152025303540ActionComedyAdventureThrillerDramaChildrenCrimeHorrorSciFiAnimationlinUCB0510152025303540ActionComedyAdventureThrillerDramaChildrenCrimeHorrorSciFiAnimationoracle optimallinUCBUCRLlinUCRLoracle greedyoracle optimal3.13.23.33.43.53.63.2843.333.4863.5383.551UCRLlinUCBoracle greedylinUCRLoracle optimal3.203.253.303.353.403.453.503.553.603.3273.433.5363.543.555\fAlgorithm\n\nonly B\nUCRL\n\nLINUCRL\noracle greedy\noracle optimal\n\non the T steps\n\non the last steps\n\n46.0%\n46.5%\n66.7%\n61.3%\n95.2%\n\n46.0%\n46.0%\n75.8%\n61.3%\n95.2%\n\nTable 2: Relative improvement over only A of learning experiment based on large scale A/B testing\ndataset.\none with largest instantaneous reward), UCRL [3] (considering each action and state independently).\nThe results are obtained by averaging 4 independent runs. Fig. 3(b-c) shows the average reward at\nT = 200 and after T = 2000 steps. We \ufb01rst notice that as in the previous experiment the oracle\ngreedy policy is suboptimal compared to the optimal policy that maximizes the long-term reward.\nDespite the fact that UCRL targets this better performance, the learning process is very slow as the\nnumber of states is too large. Indeed this number of steps is lower than the number of states so UCRL\ndid not have the chance to update its policy since in average no states has been visited twice. On the\nother hand, at early learning stages LINUCRL is already better than LINUCB, and its performance\nkeeps improving until, at 2000 steps, it actually performs better than the oracle greedy strategy and it\nis close to the optimal policy.\nLarge scale A/B testing dataset. We also validate our approach on a real-world A/B testing dataset.\nWe collected 15 days of click on ads history of a CRITEO\u2019s test, where users have been proposed two\nvariations on the display denoted as A and B. Each display is actually the output of two real-world\ncollaborative-\ufb01ltering recommender strategies; precise information on how these algorithms are\nconstructed is not relevant for our analysis. Unlike a classical A/B testing each unique user has\nbeen exposed to both A and B but with different frequencies. This dataset is formed of 350M\ntuples (user id, timestamp, version, click) and will be released publicly as soon as possible. Remark\nthat the system is already heavily optimized and that even a small improvement in the click-rate is\nvery desirable. As in the movielens experiment, we do not have enough data to evaluate a learning\nalgorithm on the historical events (not enough samples per state would be available), so we \ufb01rst\ncompute a simulator based on the data and then run LINUCRL- that does not know the parameters of\nthe simulator and must try to estimate them - and compare it to simple baselines. Unlike the previous\nexperiment, we do not impose any linear assumption on the simulator (as in Eq. 1) and we compute\nthe click probability for actions A and B independently in each state (we set w = 10, for a total\nof 210 = 1024 states) and whenever that state-action pair is executed we draw a Bernoulli with the\ncorresponding probability. Using this simulator we compute oracle greedy and optimal policies and\nwe compare LINUCB, LINUCRL, which is no longer able to learn the \u201ctrue\u201d model, since it does\nnot satisfy the linear assumption, and UCRL, which may suffer from the large number of state but\ntargets a model with potentially better performance (as it can correctly estimate the actual reward\nfunction and not just a linear approximation of it). We report the results (averaged over 5 runs) as\na relative improvement over the worst \ufb01xed option (i.e., in this case A). Tab. 2 shows the average\nreward over T = 2, 000 steps and of the learned policy at the end of the experiment. Despite the fact\nthat the simulator does not satisfy our modeling assumptions, LINUCRL is still the most competitive\nalgorithm as it achieves the best performance among the learning algorithms and it outperforms the\noracle greedy policy.\n6 Conclusion\n\nWe showed that estimating the in\ufb02uence of the recommendation strategy on the reward and computing\na policy maximizing the long-term reward may signi\ufb01cantly outperform \ufb01xed-action or greedy\ncontextual policies. We introduced a novel learning algorithm, LINUCRL, to effectively learn such\npolicy and we prove that its regret is much smaller than for standard reinforcement learning algorithms\n(UCRL). We validated our model and its usefulness on the movielens dataset and on a novel A/B testing\ndataset. Our results illustrate how the optimal policy effectively alternates between different options,\nin order to keep the interest of the users as high as possible. Furthermore, we compared LINUCRL\nto a series of learning baselines on simulators satisfying our linearity assumptions (movielens) or\nnot (A/B testing). A venue for future work is to extend the current model to take into consideration\ncorrelations between actions. Furthermore, given its speed of convergence, it could be interesting\nto run a different instance of LINUCRL per user - or group of users - in order to offer personalized\n\u201cboredom\u201d curves. Finally, using different models of the reward as a function of the recency (e.g.,\nlogistic regression) could be used in case of binary rewards.\n\n8\n\n\fReferences\n[1] Y. Abbasi-yadkori, D. P\u00e1l, and C. Szepesv\u00e1ri. Improved algorithms for linear stochastic bandits.\nIn J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors,\nAdvances in Neural Information Processing Systems 24, pages 2312\u20132320. Curran Associates,\nInc., 2011.\n\n[2] P. Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res.,\n\n3:397\u2013422, mar 2003.\n\n[3] P. Auer, T. Jaksch, and R. Ortner. Near-optimal regret bounds for reinforcement learning. In\nD. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information\nProcessing Systems 21, pages 89\u201396. Curran Associates, Inc., 2009.\n\n[4] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends R(cid:13) in Machine Learning, 5(1):1\u2013122, 2012.\n\n[5] A. Dasdan, S. S. Irani, and R. K. Gupta. Ef\ufb01cient algorithms for optimum cycle mean and\noptimum cost to time ratio problems. In Proceedings of the 36th Annual ACM/IEEE Design\nAutomation Conference, DAC \u201999, pages 37\u201342, New York, NY, USA, 1999. ACM.\n\n[6] S. Filippi, O. Capp\u00e9, and A. Garivier. Optimally Sensing a Single Channel Without Prior\nInformation: The Tiling Algorithm and Regret Bounds. IEEE Journal of Selected Topics in\nSignal Processing, 5(1):68 \u2013 76, Feb. 2010.\n\n[7] F. M. Harper and J. A. Konstan. The movielens datasets: History and context. ACM Trans.\n\nInteract. Intell. Syst., 5(4):19:1\u201319:19, Dec. 2015.\n\n[8] H. Heidari, M. Kearns, and A. Roth. Tight policy regret bounds for improving and decay-\ning bandits. In Proceedings of the Twenty-Fifth International Joint Conference on Arti\ufb01cial\nIntelligence, IJCAI\u201916, pages 1562\u20131570. AAAI Press, 2016.\n\n[9] J. Herlocker, J. Konstan, A. Borchers, and J. Riedl. An algorithmic framework for performing\ncollaborative \ufb01ltering. In Proceedings of the 1999 Conference on Research and Development in\nInformation Retrieval, 1999.\n\n[10] T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. J.\n\nMach. Learn. Res., 11:1563\u20131600, Aug. 2010.\n\n[11] K. G. Jamieson and A. Talwalkar. Non-stochastic best arm identi\ufb01cation and hyperparameter\n\noptimization. In AISTATS, 2016.\n\n[12] K. Kapoor, K. Subbian, J. Srivastava, and P. Schrater. Just in time recommendations: Modeling\nthe dynamics of boredom in activity streams. In Proceedings of the Eighth ACM International\nConference on Web Search and Data Mining, WSDM \u201915, pages 233\u2013242, New York, NY, USA,\n2015. ACM.\n\n[13] R. M. Karp. A characterization of the minimum cycle mean in a digraph. 23:309\u2013311, 12 1978.\n\n[14] J. Komiyama and T. Qin. Time-Decaying Bandits for Non-stationary Systems, pages 460\u2013466.\n\nSpringer International Publishing, Cham, 2014.\n\n[15] Y. Koren. Collaborative \ufb01ltering with temporal dynamics. In Proceedings of the 15th ACM\nSIGKDD International Conference on Knowledge Discovery and Data Mining, KDD \u201909, pages\n447\u2013456, New York, NY, USA, 2009. ACM.\n\n[16] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems.\n\nComputer, 42(8):30\u201337, Aug. 2009.\n\n[17] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized\nnews article recommendation. In M. Rappa, P. Jones, J. Freire, and S. Chakrabarti, editors,\nWWW, pages 661\u2013670. ACM, 2010.\n\n9\n\n\f[18] L. Li, W. Chu, J. Langford, and X. Wang. Unbiased of\ufb02ine evaluation of contextual-bandit-based\nnews article recommendation algorithms. In Proceedings of the Fourth ACM International\nConference on Web Search and Data Mining, WSDM \u201911, pages 297\u2013306, New York, NY, USA,\n2011. ACM.\n\n[19] R. Ortner. Online regret bounds for markov decision processes with deterministic transitions. In\nY. Freund, L. Gy\u00f6r\ufb01, G. Tur\u00e1n, and T. Zeugmann, editors, Algorithmic Learning Theory, pages\n123\u2013137, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg.\n\n[20] R. Ortner, D. Ryabko, P. Auer, and R. Munos. Regret bounds for restless markov bandits. Theor.\n\nComput. Sci., 558:62\u201376, 2014.\n\n[21] G. Shani, D. Heckerman, and R. I. Brafman. An mdp-based recommender system. J. Mach.\n\nLearn. Res., 6:1265\u20131295, Dec. 2005.\n\n[22] M. Soare, A. Lazaric, and R. Munos. Best-Arm Identi\ufb01cation in Linear Bandits. In NIPS -\n\nAdvances in Neural Information Processing Systems 27, Montreal, Canada, Dec. 2014.\n\n[23] R. S. Sutton and A. G. Barto. Introduction to Reinforcement Learning. MIT Press, Cambridge,\n\nMA, USA, 1st edition, 1998.\n\n[24] A. Swaminathan and T. Joachims. Batch learning from logged bandit feedback through\n\ncounterfactual risk minimization. J. Mach. Learn. Res., 16(1):1731\u20131755, Jan. 2015.\n\n[25] C. Tekin and M. Liu. Online Learning of Rested and Restless Bandits. IEEE Transactions on\n\nInformation Theory 58(8), Aug. 2012.\n\n10\n\n\fA Proof of Theorem 1\n\nProof. In order to prove Thm. 1, we \ufb01rst need the following proposition about the con\ufb01dence intervals\n\nused in computing the optimistic reward(cid:101)r(s, a).\n\nProposition 2. Let assume (cid:107)\u03b8\u2217\nthen\n\na(cid:107)2 \u2264 B. If(cid:98)\u03b8t,a is computed as in Eq. 4 and ct,a is de\ufb01ned as in Eq. 6,\nr(s, a) \u2264(cid:98)r(s, a) + ct,a(cid:107)xs,a(cid:107)V\n\n(cid:17) \u2264 t\u2212\u03b1\n\n\u22121\nt,a\n\n.\n\nP(cid:16)\n\nProof. By de\ufb01nition of \u03c1(s, a) we have 0 \u2264 \u03c1(s, a) \u2264(cid:80)w\n(cid:80)d\n\n1\n\u03c4 < log(w) + 1\nw. Using Thm. 2 of [1], we have with probability 1 \u2212 \u03b4,\n\n= L2\n\nw = 1\u2212(cid:96)d+1\n1\u2212(cid:96)w\n\nj=0 (cid:96)j\n\nK\n\n\u03c4 =1\n\nw\n\n= (cid:96)w. Thus 1 \u2264 (cid:107)xs,a(cid:107)2\n.\n\n2 \u2264\n\n(cid:115)\n\n(cid:107)(cid:98)\u03b8t,a \u2212 \u03b8\n\na(cid:107)Vt,a \u2264 R\n\u2217\n|r(s, a) \u2212(cid:98)r(s, a)| = |xT\n\nThus for all s \u2208 S we have,\n\n(cid:18) 1 + Tt,aL2\n\nw/\u03bb\n\n(cid:19)\n\n\u03b4\n\n(d + 1) log\n\n+ \u03bb1/2B.\n\ns,a(cid:98)\u03b8t,a \u2212 xT\n\ns,a\u03b8\n\na| \u2264 (cid:107)xs,a(cid:107)V\n\u2217\n\n\u22121\nt,a\n\n(cid:107)(cid:98)\u03b8a \u2212 \u03b8\n\na(cid:107)Vt,a .\n\u2217\n\nUsing \u03b4 = t\u2212\u03b1\n\nK concludes the proof.\n\nAn immediate result of Prop. 2 is that the estimated average reward of(cid:101)\u03c0k in the optimistic MDP (cid:102)Mk\n\nis an upper-con\ufb01dence bound on the optimal average reward, i.e., for any t (the probability follows\nby a union bound over actions)\n\nP(cid:0)\u03b7\u2217 > \u03b7(cid:101)\u03c0k ((cid:102)Mk)(cid:1) \u2264 t\u2212\u03b1.\n\n(7)\n\nWe are now ready to prove the main result.\n\nProof of Thm. 1. We follow similar steps as in [10]. We split the regret over episodes as\n\n\u2206(A, T ) =\n\nm(cid:88)\n\ntk+1\u22121(cid:88)\n\n(cid:0)\u03b7\u2217 \u2212 r(st, at)(cid:1) =\n\nm(cid:88)\n\nk=1\n\nt=tk\n\nk=1\n\n\u2206k.\n\nt=tk\n\na\u2208[K]\n\na\u2208[K]\n\n\u2206k =\n\nt\u2208Tk,a\n\nt\u2208Tk,a\n\n(cid:0)\u03b7\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nLet Tk,a = {tk \u2264 t < tk+1 : at = a} be the steps when action a is selected during episode k. We\nupper bound the per-episode regret as\n\n(cid:0)(cid:101)rk(st, a) \u2212 r(st, a)(cid:1),\nwhere the inequality directly follows from the event that(cid:101)\u03b7k \u2265 \u03b7\u2217 (Eq. 7) with probability 1 \u2212 T \u2212\u03b1.\n\n(cid:0)(cid:101)\u03b7k \u2212(cid:101)rk(st, a)(cid:1) +\n\n\u2217 \u2212 r(st, a)(cid:1) \u2264\n\ntk+1\u22121(cid:88)\n\n(cid:88)\n\nNotice that the low-probability event of failing con\ufb01dence intervals can be treated as in [10].\nWe proceed by bounding the \ufb01rst term of Eq. 8. Unlike in the general online learning scenario, in\nour setting the transition function f is known and thus the regret incurred from bad estimates of the\ndynamics is reduced to zero. Furthermore, since we are dealing with deterministic MDPs, the optimal\npolicy converges to a loop over states. When starting a new policy, we may start from a state outside\nits loop. Nonetheless, it is easy to verify that starting from any state s, it is always possible to reach\nany desired state s(cid:48) in at most w steps (i.e., the size of the history window). As a result, within each\n\nepisode k the difference between the cumulative reward ((cid:80)\nt(cid:101)rk(st, a)) and the (optimistic) average\nreward ((tk+1 \u2212 tk)(cid:101)\u03b7k) in the loop never exceeds w. Furthermore, since episodes terminate when\none action doubles its number of samples, using a similar proof as [10], we have that the number of\nepisodes is bounded as m \u2264 K log2( 8T\nK ). As a result, the contribution of the \ufb01rst term of Eq. 8 to the\ntk+1\u22121(cid:88)\noverall regret is bounded as\n\n(cid:0)(cid:101)\u03b7k \u2212(cid:101)rk(st, a)(cid:1) \u2264 Kw log2\n\n(cid:16) 8T\n\nm(cid:88)\n\n(cid:17)\n\n(8)\n\n.\n\nK\n\nk=1\n\nt=tk\n\n11\n\n\fThe second term in Eq. 8 refers to the (cumulative) reward estimation error and it can be decomposed\nas\n\nWe can bound the cumulative sum of the second term as (similar for the \ufb01rst, since(cid:101)rk belongs to the\ncon\ufb01dence interval of(cid:98)rk by construction)\n\n|(cid:101)rk(st, a) \u2212 r(st, a)| \u2264 |(cid:101)rk(st, a) \u2212(cid:98)rk(st, a)| + |(cid:98)rk(st, a) \u2212 r(st, a)|.\n(cid:88)\nm(cid:88)\n\n|(cid:98)rk(st, a) \u2212 r(st, a)| \u2264 m(cid:88)\n\n(cid:88)\n\n\u22121\na,t\n\nk=1\n\na\u2208[K]\n\nt\u2208Tk,a\n\n(cid:88)\n(cid:88)\n\na\u2208[K]\n\n(cid:88)\n(cid:118)(cid:117)(cid:117)(cid:116) m(cid:88)\n\nt\u2208Tk,a\n\nk=1\n\n\u2264 cmax\n\nct,a(cid:107)xst,a(cid:107)V\n(cid:88)\n\na\u2208[K]\n\nk=1\n\nt\u2208Tk,a\n\n(cid:107)xst,a(cid:107)2\n\nV\n\n\u221a\n\nTa,\n\n\u22121\na,t\n\nwhere the \ufb01rst inequality follows from Prop. 2 with probability 1 \u2212 T \u2212\u03b1, and Ta is the total number\nof times a has been selected at step T . Let Ta = \u222akTk,a, then using Lemma 11 of [1], we have\n\n(cid:88)\n\n(cid:107)xst,a||2\n\nV\n\n\u22121\nt,a\n\n\u2264 2 log\n\ndet(VT,a)\ndet(\u03bbI)\n\n,\n\nand by Lem. 10 of [1], we have\n\nt\u2208Ta\n\nwhich leads to\n\nm(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nk=1\n\na\u2208[K]\n\nt\u2208Tk,a\n\ndet(Vt,a) \u2264 (\u03bb + tL2\n|(cid:98)rk(st, a) \u2212 r(st, a)| \u2264 cmax\n\nw/(d + 1))d+1,\n\n(cid:115)\n\n\u221a\nTa\n\n(cid:88)\n(cid:115)\n\na\u2208[K]\n\n2(d + 1) log\n\n(cid:17)\n\n(cid:16) \u03bb + tL2\n(cid:17)\n\nw\n\u03bb(d + 1)\n\n.\n\n(cid:16) \u03bb + tL2\n\nw\n\u03bb(d + 1)\n\n\u2264 cmax\n\n2KT (d + 1) log\n\nBringing all the terms together gives the regret bound.\n\nB Experiments Details\n\nGenre\nAction\nComedy\nAdventure\nThriller\nDrama\nChildren\nCrime\nHorror\nSciFi\n\nAnimation\n\n\u03b8\u2217\na,0\n3.1\n3.34\n3.51\n3.4\n2.75\n3.52\n3.37\n3.54\n3.3\n3.4\n\n\u03b8\u2217\na,1\n0.54\n0.54\n0.86\n1.26\n1.0\n0.1\n0.32\n-0.68\n0.64\n1.38\n\n\u03b8\u2217\na,2\n-1.08\n-1.08\n-2.7\n-2.9\n0.94\n0.0\n1.12\n1.84\n-1.32\n-3.44\n\n\u03b8\u2217\na,3\n0.78\n0.78\n3.06\n2.76\n-1.86\n-0.3\n-3.0\n-2.04\n1.1\n3.62\n\n\u03b8\u2217\na,4\n-0.22\n-0.22\n-1.46\n-1.14\n0.94\n0.2\n2.26\n0.82\n-0.38\n-1.62\n\n\u03b8\u2217\na,5\n0.02\n0.02\n0.24\n0.16\n-0.16\n-0.04\n-0.54\n-0.12\n0.02\n0.24\n\nTable 3: Reward parameters of each genre for the movielens experiment.\n\nThe parameters used in the MovieLens experiment are reported in Table 3.\n\n12\n\n\f", "award": [], "sourceid": 883, "authors": [{"given_name": "Romain", "family_name": "WARLOP", "institution": "Inria"}, {"given_name": "Alessandro", "family_name": "Lazaric", "institution": "INRIA"}, {"given_name": "J\u00e9r\u00e9mie", "family_name": "Mary", "institution": null}]}