{"title": "Multi-armed Bandits with Compensation", "book": "Advances in Neural Information Processing Systems", "page_first": 5114, "page_last": 5122, "abstract": "We propose and study the known-compensation multi-arm bandit (KCMAB) problem, where a system controller offers a set of arms to many short-term players for $T$ steps. In each step, one short-term player arrives to the system. Upon arrival, the player greedily selects an arm with the current best average reward and receives a stochastic reward associated with the arm. In order to incentivize players to explore other arms, the controller provides proper payment compensation to players. The objective of the controller is to maximize the total reward collected by players while minimizing the compensation. We first give a compensation lower bound $\\Theta(\\sum_i {\\Delta_i\\log T\\over KL_i})$, where $\\Delta_i$ and $KL_i$ are the expected reward gap and Kullback-Leibler (KL) divergence between distributions of arm $i$ and the best arm, respectively. We then analyze three algorithms to solve the KCMAB problem, and obtain their regrets and compensations. We show that the algorithms all achieve $O(\\log T)$ regret and $O(\\log T)$ compensation that match the theoretical lower bound. Finally, we use experiments to show the behaviors of those algorithms.", "full_text": "Multi-armed Bandits with Compensation\n\nSiwei Wang\n\nIIIS, Tsinghua University\n\nwangsw15@mails.tsinghua.edu.cn\n\nLongbo Huang\u21e4\n\nIIIS, Tsinghua University\n\nlongbohuang@tsinghua.edu.cn\n\nAbstract\n\nWe propose and study the known-compensation multi-armed bandit (KCMAB)\nproblem, where a system controller offers a set of arms to many short-term players\nfor T steps. In each step, one short-term player arrives at the system. Upon\narrival, the player aims to select an arm with the current best average reward and\nreceives a stochastic reward associated with the arm. In order to incentivize players\nto explore other arms, the controller provide proper payment compensations to\nplayers. The objective of the controller is to maximize the total reward collected by\nplayers while minimizing the total compensation. We \ufb01rst provide a compensation\n), where i and KLi are the expected reward gap\nand the Kullback-Leibler (KL) divergence between distributions of arm i and\nthe best arm, respectively. We then analyze three algorithms for solving the\nKCMAB problem, and obtain their regrets and compensations. We show that the\nalgorithms all achieve O(log T ) regret and O(log T ) compensation that match the\ntheoretical lower bounds. Finally, we present experimental results to demonstrate\nthe performance of the algorithms.\n\nlower bound \u21e5(Pi\n\ni log T\n\nKLi\n\n1\n\nIntroduction\n\nMulti-armed bandit (MAB) is a game that lasts for an unknown time horizon T [4, 17]. In each time\nslot, the controller pulls one out of N arms, and pulling different arms results in different feedbacks. In\nthe stochastic MAB model [12], feedbacks from each single arm follow a corresponding distribution,\nwhich is unknown to the controller. These feedbacks are random variables independent of any other\nevents. After pulling the arm, the controller collects a reward that depends on the feedback. The\ncontroller aims to maximize the sum of rewards during the game by choosing a proper arm to pull\nin each time slot, and the decision can depend on all available information, i.e., past chosen arms\nand feedbacks. The common metric for evaluating the performance of a policy is the value of regret,\nde\ufb01ned as the expected difference between the controller\u2019s reward and pulling an arm that generates\nthe largest expected reward.\nThe MAB formulation models the trade-off between exploration and exploitation, where exploration\nconcerns \ufb01nding the potential best arms, but can result in pulling sub-optimal arms, while exploitation\naims at choosing arms with the current best performance and can lose reward if that arm is in\nfact sub-optimal. Thus, optimizing this trade-off is very important for any controller seeking to\nminimize regret. However, in many real-world applications, arms are not pulled by the controller\nconcerning long-term performance. Instead, actions are taken by short-term players interested in\noptimizing their instantaneous reward. In this case, an important means is to provide monetary\ncompensations to players, so that they act as if they are pulling arms on behalf of the controller, to\njointly minimize regret, e.g., [8]. This is precisely our focus in this paper, i.e., we aim to design an\nef\ufb01cient incentivizing policy, so as to minimize regret while not giving away too much compensation.\n\n\u21e4This work is supported in part by the National Natural Science Foundation of China Grants 61672316,\n\n61303195, the Tsinghua Initiative Research Grant, and the China Youth 1000-Talent Grant\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fAs a concrete example, consider the scenario where an e-commerce website recommends goods to\nconsumers. When a consumer chooses to purchase a certain good, he receives the reward of that\ngood. The website similarly collects the same reward as a recognition of the recommendation quality.\nIn this model, the website acts as a controller that decides how to provide recommendations. Yet,\nthe actual product selection is made by consumers, who are not interested in exploration and will\nchoose to optimize their reward greedily. However, being a long-term player, the website cares more\nabout maximizing the total reward throughout the game. As a result, he needs to devise a scheme to\nin\ufb02uence the choices of short-term consumers, so that both the consumers and website can maximize\ntheir bene\ufb01ts. One common way to achieve this goal in practice is that the website offers customized\ndiscounts for certain goods to consumers, i.e., by offering a compensation to pay for part of the goods.\nIn this case, each customer, upon arrival, will choose the good with largest expected reward plus\nthe compensation. The goal of the e-commerce site is to \ufb01nd an optimal compensation policy to\nminimize his regret, while not spending too much additional payment.\nIt is important to notice the difference between regret and compensation. In particular, regret comes\nfrom pulling a sub-optimal arm, while compensation comes from pulling an arm with poor past\nbehavior. For example, consider two arms with expected rewards 0.9 for arm 1 and 0.1 for arm 2.\nSuppose in the \ufb01rst twenty observations, arm 1 has an empirical mean 0.1 but arm 2 has an empirical\nmean 0.9. Then, in the next time slot, pulling arm 2 will cause regret 0.8, since its expected gain is 0.8\nless than arm 1. But in a short-term player\u2019s view, arm 2 behaves better than arm 1. Thus, pulling arm\n2 does not require any compensation, while pulling arm 1 needs 0.8 for compensation. As a result,\nthe two measures can behave differently and require different analysis, i.e., regret depends heavily on\nlearning the arms well, while compensation is largely affected by how the reward dynamics behaves.\nThere is a natural trade-off between regret and compensation. If one does not offer any compensation,\nthe resulting user selection policy is greedy, which will lead to a \u21e5(T ) regret. On the other hand, if\none is allowed to have arbitrary compensation, one can achieve an O(log T ) regret with many existing\nalgorithms. The key challenge in obtaining the best trade-off between regret and compensation lies\nin that the compensation value depends on the random history. As a consequence, different random\nhistory not only leads to different compensation value, but also results in different arm selection.\nMoreover, in practice, the compensation budget may be limited, e.g., a company hopes to maximize\nits total income which equals to reward subtracts compensation. These make it hard to analyze its\nbehavior.\n\n1.1 Related works\n\nThe incentivized learning model has been investigated in prior works, e.g., [8, 14, 15]. In [8], the\nmodel contains a prior distribution for each arm\u2019s mean reward at the beginning. As time goes on,\nobservations from each arm update the posterior distributions, and subsequent decisions are made\nbased on posterior distributions. The objective is to optimize the total discounted rewards. Following\ntheir work, [14] considered the case when the rewards are not discounted, and they presented an\nalgorithm to achieve regret upper bound of O(pT ). In [15], instead of a simple game, there is\na complex game in each time slot that contains more players and actions. These incentivization\nformulations can model many practical applications, including crowdsourcing and recommendation\nsystems [6, 16].\nIn this paper, we focus on the non-Bayesian setting and consider non-discounted rewards. As pointed\nout in [14], the de\ufb01nition of user expectation is different in this case. Speci\ufb01cally, in our setting, each\nplayer selects arms based on their empirical means, whereas in the Bayesian setting, it is possible\nfor a player to also consider posterior distributions of arms for decision making. We propose three\nalgorithms for solving our problem, which adapt ideas from existing policies for stochastic MAB,\ni.e., Upper Con\ufb01dence Bound (UCB) [2, 9], Thompson Sampling (TS) [18] and \"-greedy [20]. These\nalgorithms guarantee O(log T ) regret upper bounds (match the regret lower bound \u21e5(log T ) [12]).\nAnother related bandit model is contextual bandit, where a context is contained in each time slot\n[3, 5, 13]. The context is given before a decision is made, and the reward depends on the context. As\na result, arm selection also depends on the given context. In incentivized learning, the short-term\nplayers can view the compensation as a context, and their decisions are in\ufb02uenced by the context.\nHowever, different from contextual bandits, where the context is often exogenous and the controller\nfocuses on identifying the best arm under given contexts, in our case, the context is given by the\ncontroller and itself is in\ufb02uenced by player actions. Moreover, the controller needs to pay for\n\n2\n\n\fobtaining a desired context. What he needs is the best way to construct a context in every time slot,\nso that the total cost is minimized.\nIn the budgeted MAB model, e.g., [7, 19, 21], players also need to pay for pulling arms. In this model,\npulling each arm costs a certain budget. The goal for budgeted MAB is to maximize the total reward\nsubject to the budget constraint. The main difference from our work is that in budgeted MAB, the\ncost budget for pulling each arm is pre-determined and it does not change with the reward history. In\nincentivized learning, however, different reward sample paths will lead to different costs for pulling\nthe same arm.\n\n1.2 Our contributions\n\nThe main contributions of our paper are summarized as follows:\n\n1. We propose and study the Known-Compensation MAB problem (KCMAB). In KCMAB, a\nlong-term controller aims to optimize the accumulated reward but has to offer compensation\nto a set of short-term players for pulling arms. Short-term players, on the other hand, arrive at\nthe system and make greedy decisions to maximize their expected reward plus compensation.\nThe objective of the long-term controller is to design a proper compensation policy, so\nas to minimize his regret with minimum compensation. KCMAB is a non-Bayesian and\nnon-discounted extension of the model in [8].\n\n2. In KCMAB, subject to the algorithm having an o(T \u21b5) regret for any \u21b5 2 (0, 1), we provide\na \u21e5(log T ) lower bound for the compensation. This compensation lower bound has the\nsame order as the regret lower bound, which means that one cannot expect a compensation\nto be much less than its regret, if the regret is already small.\n\n3. We propose algorithms to solve the KCMAB problem and present their compensation\nanalysis. Speci\ufb01cally, we provide the analyses of compensation for the UCB policy, a\nmodi\ufb01ed \"-greedy policy and a modi\ufb01ed-TS policy. All these algorithms have O(log T )\nregret upper bounds while using compensations upper bounded O(log T ), which matches\nthe lower bound (in order).\n\n4. We provide experimental results to demonstrate the performance of our algorithms. In\nexperiments, we \ufb01nd that the modi\ufb01ed TS policy behaves better than UCB policy, while\nthe modi\ufb01ed \"-greedy policy has regret and compensation slightly larger than those under\nthe modi\ufb01ed-TS policy. We also compare the classic TS algorithm and our modi\ufb01ed-TS\npolicy. The results show that our modi\ufb01cation is not only effective in analysis, but also\nimpactful on actual performance. Our results also demonstrate the trade-off between regret\nand compensation.\n\n2 Model and notations\n\nIn the Known-Compensation Multi-Armed Bandit (KCMAB) problem, a central controller has N\narms {1,\u00b7\u00b7\u00b7 , N}. Each arm i has a reward distribution denoted by Di with support [0, 1] and mean\n\u00b5i. Without loss of generality, we assume 1 \u00b51 > \u00b52 \u00b7\u00b7\u00b7 \u00b5N 0 and set i = \u00b51 \u00b5i for all\ni 2. The game is played for T time steps. In each time slot t, a short-term player arrives at the\nsystem and chooses an arm a(t) to pull. After the player pulls arm a(t), the player and the controller\neach receive a reward drawn from the distribution Da(t), denoted by Xa(t)(t) \u21e0 Da(t), which is an\nindependent random variable every time arm a(t) is pulled.\nDifferent from the classic MAB model, e.g., [12], where the only control decision is arm selection,\nthe controller can also choose to offer a compensation to a player for pulling a particular arm, so\nas to incentivize the player to explore an arm favored by the controller. We denote the offered\ncompensation by c(t) = ca(t)(t), and assume that it can depend on all the previous information,\ni.e., it depends on Ft1 = {(a(\u2327 ), X(\u2327 ), c(\u2327 ))|1 \uf8ff \u2327 \uf8ff t 1}. Each player, if he pulls arm i at\ntime t, collects income \u02c6\u00b5i(t) + ci(t), where \u02c6\u00b5i(t) , Mi(t)/Ni(t) is the empirical mean reward\nof arm i, with Ni(t) = P\u2327 0.9 and D1 is a Bernoulli\ndistribution, one can prove that E[\u02c6\u00b51(ti(k))] \u00b51\n2 2(T ) with a probabilistic argument, where\n(T ) converges to 0 as T goes to in\ufb01nity. Thus, for large \u00b51 and small \u00b52 (so are \u00b5i for i 2),\nwe have that E[\u02c6\u00b51(ti(k))] \u00b5i =\u2326( \u00b51 \u00b5i) holds for any i and k 2. This means that the\ncompensation we need to pay for pulling arm i once is about \u21e5(\u00b51 \u00b5i) =\u2326( i). Thus, the total\n\ni log T\n\nlog T\n\ni=2\n\ncompensation \u2326\u21e3PN\n\ni=2\n\ni log T\n\nKL(Di,D1)\u2318. \u21e4\n\n4 Compensation upper bound\n\nIn this section, we propose three algorithms that can be applied to solve the KCMAB problem and\npresent their analyses. Speci\ufb01cally, we consider the Upper Con\ufb01dence Bound (UCB) Policy [2], and\n\n4\n\n\f1: for t = 1, 2,\u00b7\u00b7\u00b7 , N do\nChoose arm a(t) = t.\n2:\n3: end for\n4: for t = N + 1,\u00b7\u00b7\u00b7 do\n5:\n6:\n7: end for\n\nFor all arm i, compute ri(t) =q 2 log t\nChoose arm a(t) = argmaxi ui(t) (with compensation maxj \u02c6\u00b5j(t) \u02c6\u00b5a(t)(t))\n\nNi(t) and ui(t) = \u02c6\u00b5i(t) + ri(t)\n\nAlgorithm 1: The UCB algorithm for KCMAB.\n\npropose a modi\ufb01ed \"-Greedy Policy and a modi\ufb01ed-Thompson Sampling Policy. Note that while the\nalgorithms have been extensively analyzed for their regret performance, the compensation metric is\nsigni\ufb01cantly different from regret. Thus, the analyses are different and require new arguments.\n\n4.1 The Upper Con\ufb01dence Bound policy\nWe start with the UCB policy shown in Algorithm 1. In the view of the long-term controller, Algorithm\n\n1 is the same as the UCB policy in [2], and its regret has been proven to be O\u21e3PN\n\nwe focus on the compensation upper bound, which is shown in Theorem 2.\nTheorem 2. In Algorithm 1, we have that\n\nlog T\n\ni \u2318. Thus,\n\ni=2\n\nCom(T ) \uf8ff\n\n16 log T\n\ni\n\n+\n\n2N\u21e1 2\n\n3\n\nNXi=2\n\nProof Sketch: First of all, it can be shown that the each sub-optimal arm is pulled for at most 8\nlog T\n2\ni\ntimes in Algorithm 1 with high probability. Since in every time slot t the long-term controller chooses\nthe arm a(t) = argmaxj \u02c6\u00b5j(t) + rj(t), we must have \u02c6\u00b5a(t)(t) + ra(t)(t) = maxj(\u02c6\u00b5j(t) + rj(t)) \nmaxj \u02c6\u00b5j(t). This implies that the compensation is at most ra(t)(t). Moreover, if arm a(t) has been\npulled the maximum number of times, i.e., Na(t)(t) = maxj Nj(t), then ra(t)(t) = minj rj(t) (by\nde\ufb01nition). Thus, \u02c6\u00b5a(t)(t) = maxj \u02c6\u00b5j(t), which means that the controller does not need to pay any\ncompensation.\nNext, for any sub-optimal arm i, with high probability, the compensation that the long-term controller\npays for it can be upper bounded by:\n\nNi(T )X\u2327 =1 r 2 log T\n\nComi(T ) \uf8ff E24\n35 \uf8ff Ehp8Ni(T ) log Ti (a)\nHere the inequality (a) holds because px is concave. As for the optimal arm, when N1(t) \nPN\n, with high probability N1(t) = maxj Nj(t). Thus, the controller does not need to pay\ncompensation in time slots with a(t) = 1 and N1(t) PN\n. Using the same argument,\nthe compensation for arm 1 is upper bounded by Com1(T ) \uf8ffPN\n8 log T\nwith high probability.\nTherefore, the overall compensation upper bound is given by Com(T ) \uf8ffPN\nwith high\n\n\uf8ff p8E[Ni(T )] log T \uf8ff\n\nprobability. \u21e4\n\n8 log T\n\ni=2\n\ni\n\n8 log T\n\n2\ni\n\n8 log T\n\n2\ni\n\ni=2\n\ni\n\n16 log T\n\n\u2327\n\ni\n\ni=2\n\ni=2\n\n4.2 The modi\ufb01ed \"-greedy policy\nThe second algorithm we propose is a modi\ufb01ed \"-greedy policy, whose details are presented in\nAlgorithm 2. The modi\ufb01ed \"-greedy algorithm, though appears to be similar to the classic \"-greedy\nalgorithm, has a critical difference. In particular, instead of randomly choosing an arm to explore,\nwe use the round robin method to explore the arms. This guarantees that, given the number of total\nexplorations, each arm will be explored a deterministic number of times. This facilitates the analysis\nfor compensation upper bound.\nIn the regret analysis of the \"-greedy algorithm, the random exploration ensures that at time slot t,\nthe expectation of explorations on each arm is about \u270f\nN log t. Thus, the probability that its empirical\n\n5\n\n\f1: Input: \u270f,\n2: for t = 1, 2,\u00b7\u00b7\u00b7 , N do\nChoose arm a(t) = t.\n3:\n4: end for\n5: ae 1\n6: for t = N + 1,\u00b7\u00b7\u00b7 do\n7: With probability min{1, \u270f\n8:\n9: end for\n\ncompensation maxj \u02c6\u00b5j(t) \u02c6\u00b5a(t)(t)).\nElse, choose the arm a(t) = argmaxi \u02c6\u00b5i(t).\n\nt}, choose arm a(t) = ae and set ae (ae mod N ) + 1 (with\n\nAlgorithm 2: The modi\ufb01ed \"-greedy algorithm for KCMAB.\n\nmean has a large error is small. In our algorithm, the number of explorations of each single arm is\nalmost the same as classic \"-greedy algorithm in expectation (with only a small constant difference).\nHence, adapting the analysis from the \"-greedy algorithm gives the same regret upper bound, i.e.\n\nO(PN\n\ni=2\n\ni log T\n\n2\n2\n\n) when \u270f = cN\n2\n2\n\n.\n\nNext, we provide a compensation upper bound for our modi\ufb01ed \"-greedy algorithm.\nTheorem 3. In Algorithm 2, if we have \u270f = cN\n2\n2\n\n, then\n\nCom(T ) \uf8ff\n\nNXi=2\n\nci log T\n\n2\n2\n\n+\n\nN 2\n\n22pc log T .\n\n(1)\n\nProof Sketch: Firstly, our modi\ufb01ed \"-greedy algorithm chooses the arm with the largest empirical\nmean in non-exploration steps. Thus, we only need to consider the exploration steps, i.e., steps\ni (k) be the time slot that we\nduring which we choose to explore arms according to round-robin. Let t\"\nexplore arm i for the k-th time. Then the compensation the controller has to pay in this time slot is\nE[maxj \u02c6\u00b5j(t\"\nSince the rewards are independent of whether we choose to explore, one sees that E[\u02c6\u00b5i(t\"\nThus, we can decompose E[maxj \u02c6\u00b5j(t\"\n\ni (k)) \u02c6\u00b5i(t\"\n\ni (k))] = \u00b5i.\n\ni (k))].\n\ni (k)) \u02c6\u00b5i(t\"\ni (k))] = E[max\n\uf8ff E[max\n\ni (k))] as follows:\n(\u02c6\u00b5j(t\"\ni (k)) \u00b5i)]\ni (k)) \u00b5j)] + E[max\n\n(\u02c6\u00b5j(t\"\n\nj\n\nj\n\nj\n\n(\u00b5j \u00b5i)].\n\n(2)\n\nE[max\n\nj\n\n\u02c6\u00b5j(t\"\n\ni (k)) \u02c6\u00b5i(t\"\n\nThe second term in (2) is bounded by i = \u00b51 \u00b5i. Summing over all these steps and all arms, we\nobtain the \ufb01rst termPN\n\nWe turn to the \ufb01rst term in (2), i.e., E[maxj(\u02c6\u00b5j(t\"\n\nin our bound (1).\n\nci log T\n\n2\n2\n\ni=2\n\ni (k)) \u00b5j)]. We see that it is upper bounded by\ni (k)) \u00b5j)+] \uf8ffXj\n\ni (k)) \u00b5j)+]\n\nE[(\u02c6\u00b5j(t\"\n\nE[max\n\nj\n\n(\u02c6\u00b5j(t\"\n\ni (k)) \u00b5j)] \uf8ff E[max\n\nj\n\n(\u02c6\u00b5j(t\"\n\nwhere (\u21e4)+ = max{\u21e4, 0}. When arm i has been explored k times (line 7 in Algorithm 2), we know\nthat all other arms have at least k observations (in the \ufb01rst N time slots, there is one observation for\neach arm). Hence, E[(\u02c6\u00b5j(t\"\n4pk (the equality is due to the\nfact that E[|x|] = 2E[x+] if E[x] = 0).\nSuppose arm i is been explored in time set Ti = {t\"\n\ni (k)) \u00b5j)+] = 1\n\ni (k)) \u00b5j|] \uf8ff 1\n\n2E[|\u02c6\u00b5j(t\"\n\nXk\uf8ff|Ti|\n\nE[max\n\nj\n\n(\u02c6\u00b5j(t\"\n\ni (1),\u00b7\u00b7\u00b7}. Then,\nN\n4pk \uf8ff\n\ni (k)) \u00b5j)+] \uf8ff Xk\uf8ff|Ti|\n\nNp|Ti|\n\n2\n\nlog T , we can bound the \ufb01rst term in (2) as N 2pc log T\n\n2\n2\nabove for the second term, we obtain the compensation upper bound in (1). \u21e4\n\n. Summing this with\n\n22\n\nSince E[|Ti|] = c\nPN\n\nci log T\n\n2\n2\n\ni=2\n\n6\n\n\f1: Init: \u21b5i = 1, i = 1 for each arm i.\n2: for t = 1, 2,\u00b7\u00b7\u00b7 , N do\nChoose arm a(t) = t and receive the observation X(t).\n3:\nUpdate(\u21b5a(t), a(t), X(t))\n4:\n5: end for\n6: for t = N + 1, N + 3,\u00b7\u00b7\u00b7 do\n7:\n8:\n9:\n\nFor all i sample values \u2713i(t) from Beta distribution B(\u21b5i, i);\nPlay action a1(t) = argmaxi \u02c6\u00b5i(t), get the observation X(t). Update(\u21b5a1(t), a1(t), X(t))\nPlay action a2(t + 1) = argmaxi \u2713i(t) (with compensation maxj \u02c6\u00b5j(t + 1) \u02c6\u00b5a2(t+1)(t + 1)),\nreceive the observation X(t + 1). Update(\u21b5a2(t+1), a2(t+1), X(t + 1))\n\n10: end for\n\nAlgorithm 3: The Modi\ufb01ed Thompson Sampling Algorithm for KCMAB.\n\n1: Input: \u21b5i, i, X(t)\n2: Output: updated \u21b5i, i\n3: Y (t) 1 with probability X(t), 0 with probability 1 X(t)\n4: \u21b5i \u21b5i + Y (t); i i + 1 Y (t)\nAlgorithm 4: Procedure Update\n\n4.3 The Modi\ufb01ed Thompson Sampling policy\nThe third algorithm we propose is a Thompson Sampling (TS) based policy. Due to the complexity\nof the analysis for the traditional TS algorithm, we propose a modi\ufb01ed TS policy and derive its\ncompensation bound. Our modi\ufb01cation is motivated by the LUCB algorithm [10]. Speci\ufb01cally,\nwe divide time into rounds containing two time steps each, and pull not only the arm with largest\nsample value, but also the arm with largest empirical mean in each round. The modi\ufb01ed TS policy is\npresented in Algorithm 3, and we have the following theorem about its regret and compensation.\nTheorem 4. In Algorithm 3, we have\n\nfor some small \"< 2 and F1(\u00b5) does not depend on (T,\"). As for compensation, we have:\n\nReg(T ) \uf8ffXi\nCom(T ) \uf8ffXi\n\n8\n\ni \"\n\n2i\n\n(i \")2 log T + O\u2713 N\nlog T + N log T + O\u2713 N\n\n\"4\u25c6 + F1(\u00b5)\n\"4\u25c6 + F2(\u00b5)\n\nwhere F2(\u00b5) does not depend on (T,\") as well.\nProof Sketch: In round (t, t + 1), we assume that we \ufb01rst run the arm with largest empirical mean\non time slot t and call t an empirical step. Then we run the arm with largest sample on time slot t + 1\nand call t + 1 a sample step.\nWe can bound the number of sample steps during which we pull a sub-optimal arm, using existing\nresults in [1], since all sample steps form an approximation of the classic TS algorithm. Moreover,\n[11] shows that in sample steps, the optimal arm is pulled for many times (at least tb at time t with a\nconstant b 2 (0, 1)). Thus, after several steps, the empirical mean of the optimal arm will be accurate\nenough. Then, if we choose to pull sub-optimal arm i during empirical steps, arm i must have an\ninaccurate empirical mean. Since the pulling will update its empirical mean, it is harder and harder\nfor the arm\u2019s empirical mean to remain inaccurate. As a result, it cannot be pulled a lot of times\nduring the empirical steps as well.\nNext, we discuss how to bound its compensation. It can be shown that with high probability, we\nalways have |\u2713i(t) \u02c6\u00b5i(t)|\uf8ff ri(t), where ri(t) =q 2 log t\nNi(t) is de\ufb01ned in Algorithm 1. Thus, we\ncan focus on the case that |\u2713i(t) \u02c6\u00b5i(t)|\uf8ff ri(t) for any i and t. Note that we do not need to pay\ncompensation in empirical steps. In sample steps, suppose we pull arm i and the largest empirical\nmean is in arm j 6= i at the beginning of this round. Then, we need to pay maxk \u02c6\u00b5k(t + 1) \u02c6\u00b5i(t + 1),\nwhich is upper bounded by \u02c6\u00b5j(t) \u02c6\u00b5i(t) + (\u02c6\u00b5j(t + 1) \u02c6\u00b5j(t))+ \uf8ff \u02c6\u00b5j(t) \u02c6\u00b5i(t) + 1\nNj (t) (here\n\n7\n\n\fn\no\n\ni\nt\n\na\ns\nn\ne\np\nm\no\nC\n\n/\nt\n\ne\nr\ng\ne\nR\n\n350\n\n300\n\n250\n\n200\n\n150\n\n100\n\n50\n\n0\n\n0\n\nUCB-R\nUCB-C\nModified-TS-R\nModified-TS-C\nEpsilon-Greedy-R\nEpsilon-Greedy-C\n\n2000\n\n4000\n\nT\n\n6000\n\n8000\n\n10000\n\nn\no\n\ni\nt\n\na\ns\nn\ne\np\nm\no\nC\n\n/\nt\n\ne\nr\ng\ne\nR\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n0\n\nOrigin-TS-R\nOrigin-TS-R\nModified-TS-R\nModified-TS-C\n\n2000\n\n4000\n\nT\n\n6000\n\n8000\n\n10000\n\nn\no\n\ni\nt\n\na\ns\nn\ne\np\nm\no\nC\n\n/\nt\n\ne\nr\ng\ne\nR\n\n160\n\n140\n\n120\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\n0\n\nEpsilon-10-R\nEpsilon-10-C\nEpsilon-15-R\nEpsilon-15-C\nEpsilon-20-R\nEpsilon-20-C\n\n2000\n\n4000\n\nT\n\n6000\n\n8000\n\n10000\n\nFigure 1: Regret and Compensa-\ntion of Three policies.\n\nFigure 2: Regret and Compensa-\ntion of TS and modi\ufb01ed-TS.\n\nFigure 3: Regret and Compensa-\ntion of modi\ufb01ed \"-greedy.\n\n\"4 ) + F1(\u00b5) + log T (summing\nover ri(t) gives the same result as in the UCB case, and summing over\nNi(t) is upper bounded\nby log T ). As for arm 1, when a1(t) = a2(t + 1) = 1, we do not need to pay r1(t) twice. In\nfact, we only need to pay at most\nN1(t). Then, the number of time steps that a1(t) = a2(t +\n\n\u02c6\u00b5i(t + 1) = \u02c6\u00b5i(t)). As \u2713i(t) \u2713j(t), we must have \u02c6\u00b5i(t) + ri(t) \u2713i(t) \u2713j(t) \u02c6\u00b5j(t) rj(t),\nwhich implies \u02c6\u00b5j(t) \u02c6\u00b5i(t) \uf8ff ri(t)+rj(t). Thus, what we need to pay is at most ri(t)+rj(t)+ 1\nNj (t)\nif i 6= j, in which case we can safely assume that we pay rj(t) + 1\nNj (t) during empirical steps, and\nri(t) during sample steps.\nFor an sub-optimal arm i, we have Comi(T ) \uf8ffPi\ni=2\u21e3\n1) = 1 does not happen is upper bounded byPN\nPi\ni\" log T + O( 1\nCom(T ) \uf8ffPi\ni\" log T + N log T + O( 1\n\ngiven by regret analysis. Thus, the compensation we need to pay on arm 1 is upper bounded by\n\"4 ) + F1(\u00b5) + log T . Combining the above, we have the compensation bound\n\n\"4 + F1(\u00b5), which is\n\n2\n\n(i\")2 log T\u2318 + O N\n\n4\n\ni\" log T + O( 1\n\n1\n\n\"4 ) + F2(\u00b5). \u21e4\n\n1\n\n4\n\n8\n\n5 Experiments\n\nIn this section, we present experimental results for the three algorithms, i.e., the UCB policy, the\nmodi\ufb01ed \"-greedy policy and the modi\ufb01ed TS policy. We also compare our modi\ufb01ed TS policy\nwith origin TS policy to evaluate their difference. In our experiments, there are a total of nine\narms with expected reward vector \u00b5 = [0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1]. We run the game\nfor T = 10000 time steps. The experiment runs for 1000 times and we take the average over these\nresults. The \u201c-R\u201d represents the regret of that policy, and \u201c-C\u201d represents the compensation.\nThe comparison of the three policies in this paper is shown in Figure 1. We can see that modi\ufb01ed-TS\nperforms best in both regret and compensation, compared to other algorithms. As for the modi\ufb01ed\n\"-greedy policy, when the parameter \u270f is chosen properly, it can also achieve a good performance. In\nour experiment, we choose \u270f = 20.\nIn Figure 2, we see that modi\ufb01ed-TS performs better than TS in both compensation and regret, which\nmeans that our modi\ufb01cation is effective. Figure 3 shows the different performance of the modi\ufb01ed\n\"-greedy policies with different \u270f values. Here we choose \u270f to be 10,15 and 20. From the experiments,\nwe see the trade-off between regret and compensation: low compensation leads to high regret, and\nhigh compensation leads to low regret.\n\n6 Conclusion\n\nWe propose and study the known-compensation multi-armed bandit (KCMAB) problem where\na controller offers compensation to incentivize players for arm exploration. We \ufb01rst establish a\ncompensation lower bound achieved by regret-minimizing algorithms. Then, we consider three\nalgorithms, namely, UCB, modi\ufb01ed \"-greedy and modi\ufb01ed TS. We show that all three algorithms\nachieve good regret bounds, while keeping order-optimal compensation. We also conduct experiments\nand the results validate our theoretical \ufb01ndings.\n\n8\n\n\fReferences\n[1] S. Agrawal and N. Goyal. Further optimal regret bounds for thompson sampling. In AISTATS,\n\npages 99\u2013107, 2013.\n\n[2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine learning, 47(2-3):235\u2013256, 2002.\n\n[3] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The non-stochastic multi-armed bandit\n\nproblem. Siam Journal on Computing, 32(1):48\u201377, 2002.\n\n[4] D. A. Berry and B. Fristedt. Bandit problems: sequential allocation of experiments (Monographs\n\non statistics and applied probability). Springer, 1985.\n\n[5] A. Beygelzimer, J. Langford, L. Li, L. Reyzin, and R. Schapire. Contextual bandit algorithms\nwith supervised learning guarantees. In Proceedings of the Fourteenth International Conference\non Arti\ufb01cial Intelligence and Statistics, pages 19\u201326, 2011.\n\n[6] Y.-K. Che and J. H\u00f6rner. Recommender systems as mechanisms for social learning. The\n\nQuarterly Journal of Economics, 133(2):871\u2013925, 2017.\n\n[7] R. Combes, C. Jiang, and R. Srikant. Bandits with budgets: Regret lower bounds and optimal\n\nalgorithms. ACM SIGMETRICS Performance Evaluation Review, 43(1):245\u2013257, 2015.\n\n[8] P. Frazier, D. Kempe, J. Kleinberg, and R. Kleinberg. Incentivizing exploration. In Fifteenth\n\nACM Conference on Economics and Computation, pages 5\u201322, 2014.\n\n[9] J. Gittins. Multi-armed bandit allocation indices. wiley-interscience series in systems and\n\noptimization. 1989.\n\n[10] S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. Pac subset selection in stochastic\n\nmulti-armed bandits. In International Conference on Machine Learning, 2012.\n\n[11] E. Kaufmann, N. Korda, and R. Munos. Thompson sampling: an asymptotically optimal \ufb01nite-\ntime analysis. In Proceedings of the 23rd international conference on Algorithmic Learning\nTheory, pages 199\u2013213, 2012.\n\n[12] T. L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in\n\napplied mathematics, 6(1):4\u201322, 1985.\n\n[13] O.-A. Maillard and R. Munos. Adaptive bandits: Towards the best history-dependent strategy.\n\nIn International Conference on Arti\ufb01cial Intelligence and Statistics, pages 570\u2013578, 2011.\n\n[14] Y. Mansour, A. Slivkins, and V. Syrgkanis. Bayesian incentive-compatible bandit exploration. In\nProceedings of the Sixteenth ACM Conference on Economics and Computation, pages 565\u2013582.\nACM, 2015.\n\n[15] Y. Mansour, A. Slivkins, V. Syrgkanis, and Z. S. Wu. Bayesian exploration: Incentivizing\n\nexploration in bayesian games. arXiv preprint arXiv:1602.07570, 2016.\n\n[16] Y. Papanastasiou, K. Bimpikis, and N. Savva. Crowdsourcing exploration. Management Science,\n\n2017.\n\n[17] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press\n\nCambridge, 1998.\n\n[18] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of\n\nthe evidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n[19] L. Tran-Thanh, A. C. Chapman, A. Rogers, and N. R. Jennings. Knapsack based optimal\n\npolicies for budget-limited multi-armed bandits. In AAAI, 2012.\n\n[20] C. J. C. H. Watkins. Learning form delayed rewards. Ph.d.thesis Kings College University of\n\nCambridge, 1989.\n\n[21] Y. Xia, H. Li, T. Qin, N. Yu, and T.-Y. Liu. Thompson sampling for budgeted multi-armed\n\nbandits. In IJCAI, pages 3960\u20133966, 2015.\n\n9\n\n\f", "award": [], "sourceid": 2459, "authors": [{"given_name": "Siwei", "family_name": "Wang", "institution": "IIIS, Tsinghua University"}, {"given_name": "Longbo", "family_name": "Huang", "institution": "IIIS, Tsinghua Univeristy"}]}