{"title": "Distributed Multi-Player Bandits - a Game of Thrones Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 7222, "page_last": 7232, "abstract": "We consider a multi-armed bandit game where N players compete for K arms for T turns. Each player has different expected rewards for the arms, and the instantaneous rewards are independent and identically distributed. Performance is measured using the expected sum of regrets, compared to the optimal assignment of arms to players. We assume that each player only knows her actions and the reward she received each turn. Players cannot observe the actions of other players, and no communication between players is possible. We present a distributed algorithm and prove that it achieves an expected sum of regrets of near-O\\left(\\log^{2}T\\right). This is the first algorithm to achieve a poly-logarithmic regret in this fully distributed scenario. All other works have assumed that either all players have the same vector of expected rewards or that communication between players is possible.", "full_text": "Distributed Multi-Player Bandits - a Game of\n\nThrones Approach\n\nIlai Bistritz\n\nStanford University\n\nAmir Leshem\n\nBar Ilan University\n\nbistritz@stanford.edu\n\nAmir.Leshem@biu.ac.il\n\nAbstract\n\nWe consider a multi-armed bandit game where N players compete for K arms\nfor T turns. Each player has different expected rewards for the arms, and the in-\nstantaneous rewards are independent and identically distributed. Performance is\nmeasured using the expected sum of regrets, compared to the optimal assignment\nof arms to players. We assume that each player only knows her actions and the\nreward she received each turn. Players cannot observe the actions of other players,\nand no communication between players is possible. We present a distributed al-\n\ngorithm and prove that it achieves an expected sum of regrets of near-O(cid:0)log2 T(cid:1).\n\nThis is the \ufb01rst algorithm to achieve a poly-logarithmic regret in this fully dis-\ntributed scenario. All other works have assumed that either all players have the\nsame vector of expected rewards or that communication between players is possi-\nble.\n\n1\n\nIntroduction\n\nIn online learning problems, an agent needs to learn on the run how to behave optimally. The crux\nof these problems is the trade-off between exploration and exploitation. This trade-off is well cap-\ntured by the multi-armed bandit problem, which has attracted enormous attention from the research\ncommunity. Recently, there has been a growing interest in the case of the multi-player multi-armed\nbandit. In the multi-player scenario, the nature of the interaction between the players can take many\nforms. Players may want to solve the problem of \ufb01nding the best mutual arm as a team [1\u20136], or\nmay compete over the arms as resources they all individually require [7\u201319].\nThe idea of regret in the competitive multi-player multi-armed bandit problem is the expected sum\nof regrets and is de\ufb01ned as the performance loss compared to the optimal assignment of arms to\nplayers. The rationale for this notion of regret is formulated from the designer\u2019s perspective, who\nwants the distributed system of individuals to converge to a globally good solution.\nMany works have considered a scenario where all the players have the same expectations for the\nrewards of all arms. Some of these works assume that communication between players is possible\n[10\u201312, 14, 19], whereas others consider a fully distributed scenario [7, 13, 15].\nOne of the main reasons for studying resource allocation bandits is their applications in cognitive\nradio or wireless networks in general. In these scenarios, the channels are interpreted as arms and\nthe channel gains as the rewards. However, since users are scattered in space, the physical reality\ndictates that different arms have different expected channel gains for different players.\nThis essential generalization for a matrix of expectations introduces a famous combinatorial opti-\nmization problem known as the assignment problem [20]. Achieving a sublinear expected sum of\nregrets in a distributed manner requires a distributed solution to the assignment problem, which by\nitself has been explored extensively, e.g. [21, 22].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fThis generalization was \ufb01rst considered in [9], and later enhanced in [8], where an algorithm that\nachieves an expected sum of regrets of near-O (log T ) was presented. However, this algorithm\nrequires communication between players. It is based on the distributed auction algorithm in [21],\nwhich is not fully distributed. It requires that players can observe the bids of other players. This was\npossible in [8, 9] since it was assumed that the players could observe the actions of other players,\nwhich allows them to communicate by using the arm choices as a signaling method. In [19], the\nauthors suggest an algorithm that only assumes that users can sense all the channels without knowing\nwhich channels was chosen by whom. This algorithm requires less communication than [8], but has\nno regret guarantees. In wireless networks, assuming that each user can hear all other transmissions\n(fully connected network) is very demanding in practice. In a fully distributed scenario, players\nonly have access to their previous actions and rewards. However, to date there is no completely\ndistributed algorithm that converges to the exact optimal solution of the assignment problem. The\nfully distributed multi-armed bandit problem remains unresolved.\nOur work generalizes [7] for different expectations for different players and [8, 9, 19] for a fully\ndistributed scenario with no communication between players.\nRecently, very powerful payoff-based dynamics were introduced [23\u201325]. These dynamics only\nrequire each player to know her own action and the reward she received for that action. Speci\ufb01cally,\nthe dynamics in [24] guarantee that the optimal sum of utilities strategy pro\ufb01le will be played a\nsuf\ufb01ciently large portion of the time, even if it is not a Nash equilibrium. The crucial issue of\napplying these results to our case is that they all assume interdependent games. In an interdependent\ngame, each group of players can always in\ufb02uence at least one player from outside this group. In the\nmultiplayer multi-armed bandit collision model, this does not hold. A player in a collision receives\nzero reward. Nothing that other players (who chose other arms) can do will change that.\nIn this paper, we suggest novel modi\ufb01ed dynamics that behave similarly to [24], but in our non-\ninterdependent game. Speci\ufb01cally, they guarantee that the optimal solution to the assignment prob-\nlem is played a considerable amount of time. We present a fully distributed multi-player multi-armed\nbandit algorithm for the resource allocation and collision scenario, based on these modi\ufb01ed dynam-\nics. By fully distributed we mean that players only have access to their own actions and rewards.\n\nThis is the \ufb01rst algorithm that achieves a poly-logarithmic expected sum of regrets, near- O(cid:0)log2 T(cid:1),\n\nwith a matrix of expected rewards and no communication at all between players.\n\n2 Problem Formulation\nWe consider a stochastic game with the set of players N = {1, ..., N} and a \ufb01nite time horizon T .\nThe horizon T is not known in advance by any of the players. The discrete turn index is denoted by t.\nThe strategy space of each player is a set of K arms with indices that are denoted by i, j = 1, ..., K.\nWe assume that K \u2265 N. At each turn t, all players simultaneously pick one arm each. The arm that\nplayer n chooses at time t is an (t) and the strategy pro\ufb01le at time t is a (t). Players do not know\nwhich arms the other players chose, and need not even know how many other players are there.\nDe\ufb01ne the set of players that chose arm i in strategy pro\ufb01le a\nNi (a) = {n| an = i} .\nDe\ufb01ne the no-collision indicator of arm i in strategy pro\ufb01le a\n\n(1)\n\n(cid:110) 0\n\n1\n\n(cid:12)(cid:12)(cid:12)Ni (a)\n(cid:12)(cid:12)(cid:12) > 1\n\no.w.\n\nR =\n\nT(cid:88)\n\nN(cid:88)\n\n\u03c5n (a\u2217) \u2212 T(cid:88)\n\nN(cid:88)\n\nt=1\n\nn=1\n\nt=1\n\nn=1\n\n2\n\n\u03b7i (a) =\n\n.\n\n(2)\n\nThe instantaneous utility of player n in strategy pro\ufb01le a (t) in time t is\n\n\u03c5n (a (t)) = rn,an(t) (t) \u03b7an(t) (a (t))\n\n(3)\nwhere rn,an(t) (t) is a random reward which is assumed to have a continuous distribution on [0, 1].\nThe sequence of rewards {rn,i (t)}t of arm i for player n is i.i.d. (\u201cin time\u201d) with expectation \u00b5n,i.\nNext we de\ufb01ne the expected total regret, which we want our distributed algorithm to minimize.\nDe\ufb01nition 1. Denote the expected utility of player n in strategy pro\ufb01le a by gn (a) = E {\u03c5n (a)}.\nThe total regret is de\ufb01ned as the random variable\n\nrn,an(t) (t) \u03b7an(t) (a (t))\n\n(4)\n\n\fwhere\n\nN(cid:88)\n\nn=1\n\ngn (a) .\n\na\u2217 = arg max\n\na\n\n(5)\n\nThe expected total regret \u00afR (cid:44) E {R} is the average of (4) over the randomness of the rewards\n{rn,i (t)}t, that dictate the random actions {an (t)}.\nThe problem in (5) is no other than the famous assignment problem [20] on the N \u00d7 K matrix of\nexpectations {\u00b5n,i}. In this sense, our problem is a generalization of the distributed assignment\nproblem to an online learning framework.\nAssuming continuously distributed rewards is well justi\ufb01ed in wireless networks. Given no collision,\nthe quality of an arm (channel) always has a continuous measure like the SNR or the channel gain.\nHowever, this assumption is only used in two arguments and can be easily replaced without changing\nthe analysis in this paper. The \ufb01rst argument is that since the probability for zero reward in a non-\ncollision is zero, players can safely rule out collisions in their estimation of the expected reward.\nIn the case where the probability for a zero reward is not zero, we can assume instead that each\nplayer can observe her collision indicator in addition to her reward. Knowing whether other players\nchose the same arm is a very modest requirement compared to assuming that players can observe\nthe actions of other players. The second argument is that the continuity of the rewards\u2019 distributions\nmakes the solution of (5), with the estimated expectations, unique with probability 1. We can assume\ninstead that {\u00b5n,i} are generated at random using a continuous distribution, so the optimal solution\nis unique with probability 1 (i.e., \u201cfor almost all games\u201d), with arbitrary distributions for the rewards\nthat have expectations {\u00b5n,i}.\nAccording to the seminal work in [26], the optimal regret of the single-player case is logarithmic;\ni.e., O (log T ). Players do not help each other; hence, we expect the expected total regret lower\nbound to be logarithmic at best. The next proposition shows that this is indeed the case.\nProposition 1. The expected total regret is at least \u2126 (log T ).\n\nProof. See Section 8 (supplementary material).\n\n3 The Game of Thrones Algorithm\n\nWhen all players have the same arm expectations, the exploration phase is used to identify the N\nbest arms. Once the best arms are identi\ufb01ed, players need to coordinate to be sure that each of them\nwill sit on a different \u201cchair\u201d (see the Musical Chairs algorithm in [7]). When players have different\narm expectations, a non-cooperative game is induced where the estimated expected rewards serve\nas utilities. In this game, each player has a speci\ufb01c chair (throne) she must sit on to avoid causing a\nlinear regret. This throne is the unique arm a player must play in the allocation that maximizes the\nsum of the expected rewards of all players. Any other solution will result in linear (in T ) expected\ntotal regret. Note that our assignment problem has a unique optimal allocation with probability 1 (as\nshown in Lemma 4).\nThe total time needed for exploration increases with T since the cost of being wrong becomes\nhigher. When T is known to the players, a long enough exploration can be accomplished at the\nbeginning of the game. In order to maintain the right balance between exploration and exploitation\nwhen T is not known in advance to the players, we divide the T turns into epochs, one starting\nimmediately after the other. Each epoch is further divided into three phases - exploration, Game\nof Thrones (GoT) and exploitation. During the exploration phase, players estimate the expected\nreward of each arm. The goal of the GoT phase is to let the players distributedly identify the optimal\nsolution for the assignment problem on the estimated expected rewards from the exploration phase.\nIt is done by playing a game with the estimated expectations as utilities, using random dynamics\nthat probabilistically prefer strategy pro\ufb01les with a higher sum of utilities. In the exploitation phase,\neach player plays the constant action she deduced from the GoT phase. The division into epochs\nis depicted in Fig. 1. The GoT Algorithm and GoT Dynamics are described in Algorithm 1 and\nAlgorithm 2, respectively.\n\n3\n\n\fFigure 1: Epochs structure. Depicted are the \ufb01rst and the k-th epochs.\n\nAlgorithm 1 Game of Thrones Algorithm\nInitialization - Set on,i = 0 and sn,i (0) = 0 for all i. Set \u03b4 > 0, 0 < \u03c1 < 1 and \u03b5 > 0. De\ufb01ne kT\nas the index of the last epoch where the horizon is T .\n\nFor each epoch k = 1, ..., kT\n\n1. Exploration Phase - for the next c1 turns\n\n(a) Sample an arm i uniformly at random from all K arms.\n(b) Receive rn,i (t) and set \u03b7i (a (t)) = 0 if rn,i (t) = 0 and \u03b7i (a (t)) = 1 otherwise.\n(c) If \u03b7i (a (t)) = 1 then update on,i = on,i + 1 and sn,i (t) = sn,i (t \u2212 1) + rn,i (t).\n(d) Estimate the expectation of the arm i by \u00b5k\n\n, for each i = 1, ..., K.\n\n2. GoT Phase - for the next c2k1+\u03b4 turns, play according to Algorithm 2 with \u03b5 and \u03c1.\n\n(a) Starting from the dg = (cid:6)\u03c1c2k1+\u03b4(cid:7)-th turn inside the GoT Phase, keep track on the\n\nn,i = sn,i(t)\non,i\n\nnumber of times each action was played that resulted in being content\n\nc2k1+\u03b4(cid:88)\n\nF n\n\nt (i) =\n\nI (an (l) = i, Mn (l) = C)\n\nwhere I is the indicator function.\n\n3. Exploitation Phase - for the next c32k turns, play\n\nl=dg\n\nak\nn = arg max\n\ni\n\nF n\n\nt (i)\n\nEnd\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\nAlgorithm 2 Game of Thrones Dynamics\nInitialization - Let c \u2265 N. Each player n has a personal state Mn, either content C or discontent\nD, which determines her mixed strategy. Each player also keeps a baseline action an and her utility\nun (a).\nis un. Denote un,max = max\nIn each turn during the GoT Phase\n\na\n\n\u2022 A content player chooses an action according to\n\n(cid:40)\n\n\u03b5c\n\n(cid:12)(cid:12)(cid:12)An\n\n(cid:12)(cid:12)(cid:12)\u22121\n\n1 \u2212 \u03b5c\n\npan\nn =\n\nan (cid:54)= an\n\nan = an\n\n.\n\n\u2022 A discontent player chooses an action uniformly at random; i.e.,\n\npan\nn =\n\n1\n\n|An| , \u2200an \u2208 An.\n\nThe transitions between C and D are determined as follows:\n\n\u2022 If an = an and un > 0, then a content player remains content with probability 1\n\n\u2022 If an (cid:54)= an or un = 0 or Mn = D, then (C/D denoting either C or D)\n\n[an, C/D] \u2192\n\nEnd\n\n[an, C] \u2192 [an, C]\n\nun\n\n\u03b5un,max\u2212un\n\nun,max\n\n1 \u2212 un\n\nun,max\n\n\u03b5un,max\u2212un .\n\n(cid:40)\n\n[an, C]\n[an, D]\n\n4\n\n\fIn this paper, we prove the following main result.\nTheorem 1 (Main Theorem). Assume that the rewards {rn,i (t)}t are independent in n and i.i.d. in\ntime t, with continuous distributions on [0, 1] with positive expectations {\u00b5n,i}. Let the game have a\n\ufb01nite horizon T , unknown to the players. Denote the optimal objective by J1 = max\nn=1 gn (a)\nand the second best one by J2. Let each player play according to Algorithm 1, with a small enough\n\u03b5, exploration phase length of c1 > 16N 2K\n(J1\u2212J2)2 and \u03b4 > 0. Then, for large enough T , the expected\ntotal regret is upper bounded by\n\n(cid:80)N\n\na\n\n\u00afR \u2264 3c2N log2+\u03b4\n\n2\n\n+ 2\n\n= O\n\nlog2+\u03b4 T\n\n.\n\n(cid:19)\n\n(cid:16)\n\n(cid:17)\n\n(cid:18) T\n\nc3\n\nT \u2265 E\u22121(cid:88)\n(cid:0)c1 + c2k1+\u03b4 + c32k(cid:1) \u2265 c3\n(cid:16) T\n\n(cid:17)\n\nk=1\n\n(cid:0)2E \u2212 2(cid:1)\n\n(12)\n\n(13)\n\nProof. Let \u03b4 > 0. Denote the number of epochs that start within T turns by E. Since\n\n+ 2\n\nc3\n\nE is upper bounded by E \u2264 log2\n. Denote by Pe,k and Pc,k the error probabilities of the\nexploration and GoT phases of epoch k respectively. Observe that if none of these errors occurred,\nthe optimal solution to (5) is played in the k-th exploitation phase, which adds no additional regret\nto the total regret. We will prove in Lemma 2 and Lemma 5 that Pe,k \u2264 4K 2e\u2212k and Pc,k \u2264\n\n(cid:1) is the mixing time of the Markov chain of\n(cid:1) depends on N, K and \u03b5 , so there exists a k0 such that for all\n\n(cid:0) 1\n\n, where A0 is a constant and Tm\n\n\u2212 c2(1\u2212\u03c1)\n1728Tm( 1\n8 )\n\nA0e\nthe GoT Dynamics. Note that Tm\nk > k0 we have\n\n(cid:0) 1\n\n8\n\nk1+\u03b4\n\n8\n\n\u2212 c2(1\u2212\u03c1)\n1728Tm( 1\n8 )\ne\n\nk\u03b4\n\n<\n\n.\n\n(14)\n\nWe now bound the expected total regret of epoch k > k0, denoted by \u00afRk, as follows\n\n\u00afRk \u2264(cid:0)c1 + c2k1+\u03b4(cid:1) N +\n\n(cid:32)\n\n4K 2e\u2212k + A0e\n\n\u2212 c2 (1\u2212\u03c1)\n1728Tm( 1\n8 )\n\nc32kN \u2264\n\n1\n2\n\nk1+\u03b4(cid:33)\n\nc1N + 2A0c3N \u03b2k + c2k1+\u03b4N (15)\n\n(cid:19)\n\n(cid:18) T\n\nc3\n\nfor some constant \u03b2 < 1. We conclude that, for some additive constant C,\n\nE(cid:88)\n\nk=1\n\nE(cid:88)\n\nk=k0+1\n\n\u00afR =\n\n\u00afRk \u2264\n\n(a)\n\nC + 2c2N\n\nk1+\u03b4 \u2264 C + 2c2N E2+\u03b4 \u2264\n\n(b)\n\nC + 2c2N log2+\u03b4\n\n2\n\n+ 2\n\n(16)\n\nwhere (a) follows since completing the last epoch to a full epoch increases \u00afRk, and (b) is (13).\n\nIf either the exploration or the GoT phases fail, the regret becomes linear with T . Like many other\nonline learning algorithms, we avoid a linear expected regret by ensuring that the error probabilities\nvanish with T . By using instead a single epoch with a constant duration for the \ufb01rst two phases, we\nobtain that with high probability (in T ) our algorithm achieves a constant regret (as in [7]). However,\nour main result is formulated using the more conservative formulation of the expected regret.\n\n4 Exploration Phase - Estimation of the Expected Rewards\n\nIn this section, we describe the exploration phase, and analyze its addition to the expected total\nregret. At the beginning of the game, players still do not have any evaluation of the K different\narms. They estimate these values on the run, based on the rewards they get. We propose a pure\nexploration phase where each player picks an arm uniformly at random, similar to the one suggested\nin [7]. Note that in contrast to [7], we do not assume that T is known to the players. Hence, the\nexploration phase is repeated in each epoch. In each epoch, only a constant number c1 of turns is\n\n5\n\n\fdedicated to exploration. However, the estimation uses all the previous exploration phases, so that\nthe number of samples for estimation grows linearly with time.\nThe estimation of the expected rewards is never perfect. Hence, the optimal solution to the assign-\nment problem given the estimated expectations might be different from the optimal solution with\nthe correct expectations. However, if the uncertainty of the true value of each expectation is small\nenough, we expect both of these optimal assignments to coincide. This is exactly the precision we\nrequire from the estimation, as formulated in the following lemma.\nLemma 1. Assume that {\u00b5n,i} are known up to an uncertainty of \u2206, i.e., |\u02c6\u00b5n,i \u2212 \u00b5n,i| \u2264 \u2206 for\neach n and i for some {\u02c6\u00b5n,i}. Denote the optimal assignment by a1 = arg max\nn=1 gn (a)\nn=1 gn (a1). Denote the second best objective and the corresponding\n\nand its objective by J1 =(cid:80)N\n\n(cid:80)N\n\na\n\nassignment by J2 and a2, respectively. If \u2206 < J1\u2212J2\n\n2N then\n\nN(cid:88)\n\nn=1\n\nN(cid:88)\n\nn=1\n\narg max\n\na\n\ngn (a) = arg max\n\na\n\n\u02c6\u00b5n,an \u03b7an (a)\n\n(17)\n\nso that the optimal assignment does not change due to the uncertainty.\n\nProof. See Section 8 (supplementary material).\n\nIf the exploration phase is long enough, players know their arm expectations accurately enough with\na very small failure probability. The following lemma concludes this section by providing an upper\nbound for the probability that the estimation for epoch k failed.\n\nLemma 2 (Exploration Error Probability). Let(cid:8)\u00b5k\n\u03b7an (a). Also denote J1 = (cid:80)N\n\ning all the exploration phases up to epoch k. Denote a\u2217 = arg max\narg max\nJ2. If the length of the exploration phase satis\ufb01es c1 > 16N 2K\n\n(cid:9) be the estimated reward expectations us-\n\n(cid:80)N\nn=1 gn (a) and ak\u2217 =\nn=1 gn (a\u2217) and the second best1 objective by\n(J1\u2212J2)2 , then after the k-th epoch we have\n\n(cid:80)N\n\nn=1 \u00b5k\n\nn,an\n\nn,i\n\na\n\na\n\nPe,k (cid:44) Pr(cid:0)a\u2217 (cid:54)= ak\u2217(cid:1) \u2264 4K 2e\u2212k.\n\n(18)\n\nProof. See Section 8 (supplementary material).\n\n5 Game of Thrones Dynamics Phase\n\nIn this section we analyze the game of thrones (GoT) dynamics between players. These dynamics\nguarantee that the optimal state will be played a signi\ufb01cant amount of time, and only require the\nplayers to know their own action and the received payoff on each turn. Note that these dynamics\nassume deterministic utilities. We use the estimated expected reward of each arm as the utility for\nthis step, and zero if a collision occurred. This means that players ignore the numerical reward they\nreceive by choosing the arm, as long as it is positive.\nDe\ufb01nition 2. The game of thrones G of epoch k has the N players of the original multi-armed\nbandit game. Each player can choose from among the K arms, so An = {1, ..., K} for each n. The\nutility of player n in the strategy pro\ufb01le a = (a1, ..., aN ) is\n\nun (a) = \u00b5k\n\nn,an\n\n\u03b7an (a)\n\n(19)\n\nwhere \u00b5k\nhave ended, up to epoch k. Also denote un,max = max\n\nn,an is the estimation of the expected reward of arm an, from all the exploration phases that\n\nun (a).\n\na\n\nOur dynamics belong to the family introduced in [23\u201325]. These dynamics guarantee that the opti-\nmal sum of utilities strategy pro\ufb01les will be played a suf\ufb01ciently large portion of the turns. However,\nthey all rely on the following structural property of the game, called interdependence.\n\n1Note that this is the second best objective and not the second best allocation, so J2 < J1. If all allocations\n\nhave the same objective then this Lemma trivially holds with c1 \u2265 1.\n\n6\n\n\fm\u2208J Am such that un (a(cid:48)\n\nJ , a\u2212J ) (cid:54)= un (aJ , a\u2212J ).\n\nJ \u2208(cid:81)\n\nDe\ufb01nition 3. A game G with \ufb01nite action spaces A1, ...,AN is interdependent if for every strategy\npro\ufb01le a \u2208 A1 \u00d7 ... \u00d7 AN and every set of players J \u2282 N, there exists a player n /\u2208 J and a choice\nof actions a(cid:48)\nOur GoT is not interdependent. To see this, pick any strategy pro\ufb01le a such that some players are\nin a collision while others are not. Choose J as the set of all players that are not in a collision. All\nplayers outside this set are in a collision, and there does not exist any colliding player that the actions\nof the non-colliding players can make her utility non-zero.\nThe GoT Dynamics in Algorithm 2 modify [24] such that interdependency is no longer needed.\nNote that in comparison with [24], our dynamics assign zero probability that a player with un = 0\n(in a collision) will be content. Additionally, we do not need to keep the benchmark utility as part\nof the state. A player knows with probability 1 whether there was a collision, and if there was not,\nshe gets the same utility for the same arm. Our dynamics require that each player uses c \u2265 N. The\nnumber of players N might be unknown. In this case, players can use c \u2265 K, since the number of\narms is known and K \u2265 N by de\ufb01nition of the problem.\n\nThe GoT dynamics induce a Markov chain over the state space Z = (cid:81)N\n\nn=1 (An \u00d7 M), where\nM = {C, D}. The transition matrix of this Markov chain is denoted by P \u03b5. The following lemma\ncharacterizes the recurrence classes of the unperturbed chain P 0 (with \u03b5 = 0). In [24], interdepen-\ndency was used to prove the same result. This is the sole reason interdependency was required in the\n\ufb01rst place. We provide an alternative proof that does not require interdependency but instead uses\nthe fact that in our modi\ufb01ed dynamics, no player can be content with un = 0. Note that this proof\nexploits the structure of the GoT, and cannot be applied to a more general game.\nLemma 3. Denote by D0 the set of all the discontent states (all players are discontent) and by\nC0 the set of all singleton content states (all players are content). The recurrence classes of the\nunperturbed process P 0 are D0 and all z \u2208 C0.\n\nProof. In P 0, there is no path between the discontent states and the content ones. Moreover, all\nthe discontent states are connected and all the content states are absorbing (i.e., singletons). Now\nassume there is a different recurrence class. In any state in this class, denoted zC/D, not all the\nplayers are content, otherwise this is a z \u2208 C0 singleton. Denote one of the discontent players by\nn. Since she chooses her action at random, there is a positive probability that she will pick the same\narm as any of the content players. By doing so, she changes the state of this player to discontent\nwith probability 1. With \u03b5 = 0, every discontent player remains so with probability one. On the next\nturn, a discontent player may again choose the arm of a content player with a positive probability.\nBy repeating this process, we conclude that there is a positive probability that all players become\ndiscontent. Hence, zC/D is connected to D in P 0. We conclude that this different recurrence class\nis in fact connected to D, which is a contradiction.\n\nThe process Z of the GoT dynamics is a regular perturbed Markov chain, de\ufb01ned as follows.\nDe\ufb01nition 4. P \u03b5 is called a regular perturbed Markov Process if P \u03b5 is ergodic for all suf\ufb01ciently\nsmall \u03b5 > 0 and for every z, z(cid:48) \u2208 Z we have\nlim\n\u03b5\u21920+\n\nzz(cid:48) = P 0\nP \u03b5\nzz(cid:48)\n\n(20)\n\nand if P \u03b5\n\nzz(cid:48) > 0 for some \u03b5 > 0 then\n\n< \u221e\n\n(21)\n\n0 < lim\n\u03b5\u21920+\n\nP \u03b5\nzz(cid:48)\n\n\u03b5r(z\u2192z(cid:48))\n\nfor some real non-negative r (z \u2192 z(cid:48)) that is called the resistance of the transition z \u2192 z(cid:48).\nNext we de\ufb01ne stochastic stability, which is a powerful convergence analysis tool.\nDe\ufb01nition 5. Let P \u03b5 be regular perturbed Markov process and \u00b5\u03b5 its unique stationary distribution\nthat exists for \u03b5 > 0. A state z \u2208 Z is stochastically stable if and only if\n\n\u00b5\u03b5 (z) > 0.\n\nlim\n\u03b5\u21920+\n\n(22)\n\nIn [24], it is shown for their dynamics that only the states with the maximal sum of utilities are\nstochastically stable. For a small enough \u03b5 the dynamics will visit the stochastically stable states\n\n7\n\n\fvery often. However, there might be several stochastically stable states and the dynamics might\n\ufb02uctuate between them. Fortunately, in our case, as shown in the following lemma, there is a unique\noptimal state with probability one. For a small enough \u03b5 the unique optimal state is played more\nthan half of the times, which allows for the players to distributedly agree on the optimal solution.\nThis uniqueness is due to the continuous distribution of the rewards that makes the distribution of\nthe empirical estimation for the expectations continuous as well.\nLemma 4. The optimal solution to max\n\nn=1 un (a) is unique with probability 1.\n\n(cid:80)N\n\nProof. First note that an optimal solution must not have any collisions, otherwise it can be improved\n\n(cid:9) be the estimated reward expectations in epoch k. For two different\n\na\n\n. However, \u02dca and\na\u2217 must differ in at least one assignment. Since the distributions of the rewards rn,an are con-\nn,an (as a sum of the average of the rewards). Hence\n\nn=1 \u00b5k\n\nn=1 \u00b5k\n\nn=1 \u00b5k\n\nn,a\u2217\n\nn,\u02dcan\n\nn\n\n= (cid:80)N\n\nsince K \u2265 N. Let(cid:8)\u00b5k\nsolutions \u02dca (cid:54)= a\u2217 to be optimal, we must have (cid:80)N\ntinuous, so are the distributions of(cid:80)N\n(cid:16)(cid:80)N\n=(cid:80)N\n\n(cid:17)\n\nPr\n\nn,i\n\nn=1 \u00b5k\n\nn,a\u2217\n\nn=1 \u00b5k\n\nn,\u02dcan\n\n= 0, and the result follows.\n\nn\n\nNext we show that only the unique optimal state is stochastically stable. This means that after\nenough time, the action that a player played most of the time is highly likely to be part of the unique\noptimal solution. This is crucial for the success of the exploitation phase.\nTheorem 2. De\ufb01ne ak\u2217 = arg max\n\ncally stable state is z\u2217 =(cid:2)ak\u2217, C N(cid:3) with probability 1.\n\nn=1 un (a). Under the GoT dynamics, the unique stochasti-\n\n(cid:80)N\n\na\n\nProof. See Section 8 (supplementary material).\n\n((cid:101)a1, ..., \u02dcaN ) where(cid:101)an = arg max\n\nNow we can prove the main lemma of this section that gives an upper bound for the probability that\nthe GoT phase does not lead to the optimal solution.\nLemma 5 (GoT Error Probability). Let \u03b4 > 0. De\ufb01ne ak\u2217 = arg max\n\n(cid:80)N\nn=1 un (a) and (cid:101)a =\n\nt (i) for all n. For a small enough \u03b5, the error probability of the\nF n\nk-th GoT phase, which is the probability that another strategy pro\ufb01le than ak\u2217 will be played in the\nexploitation phase, is bounded as follows\n\na\n\ni\n\nPc,k (cid:44) Pr(cid:0)(cid:101)a (cid:54)= ak\u2217(cid:1) \u2264 A0e\n\n\u2212 c2(1\u2212\u03c1)\n1728Tm( 1\n8 )\n\nk1+\u03b4\n\n(23)\n\nwhere A0 is a constant with respect to t (or k), and may depend on N, K, \u03b5 and the initial state.\n\nProof. See Section 8 (supplementary material).\n\n6 Numerical Simulations and Practical Considerations\n\nThe total regret compares the sum of utilities to the ideal one that could have been achieved in a cen-\ntralized scenario. With no communication between players and with a matrix of expected rewards,\nthe gap to this ideal naturally increases. In this scenario, converging to the exact optimal solution\nmight take a long time, even for the (unknown) optimal algorithm. Our main result provides theo-\nretical guarantees for the asymptotic performance of our algorithm, which suggest that performance\nimproves with time on its way to converge to the optimal solution. The simulations in this section\ncomplete the picture by showing how the sum of utilities behaves in the non-asymptotic regime.\nWe simulated a multi-armed bandit game with {\u00b5n,i} that are chosen independently and uniformly\nat random in [0.05, 0.95]. The rewards are generated as rn,i (t) = \u00b5n,i + zn,i (t) where {zn,i (t)}\nare independent and uniformly distributed on [\u22120.05, 0.05] for each n, i.\nIn the simulations presented here we use \u03b4 = 0 since it yields good results in practice. We conjecture\nthat the bound (23) is not tight for our particular Markov chain and indicator function, since it applies\nfor all Markov chains with the same mixing time and all functions on the states. This explains why\nmodest choices of c2 are large enough and the k\u03b4 factor in the exponent is not needed in practice.\n\n8\n\n\fThe lengths of the phases should be chosen so that the exploitation phase occupies most of the turns\nalready in early epochs, while allowing for a considerable GoT phase. Note that the exploration\n(c1) is much easier than the GoT phase (c2) and achieves a good accuracy relatively fast. Hence\nwe choose c1 = 1000, c2 = c3 = 6000. We use \u03c1 = 1\n2 in the simulations we present, since the\nperformance is very similar for \u03c1 values not too close to zero or one. We use c = N, that gives the\nhighest possible escape probability of \u03b5c from a content state.\n\nIn Fig. 2, we present the sample mean of the accumulated sum of utilities(cid:80)N\n\n(cid:80)t\n\n1\nt\n\nn=1\n\n\u03c4 =1 un (a (\u03c4 ))\nas a function of time t, averaged over 100 experiments. The performance was normalized by the\noptimal solution to the assignment problem (for each experiment). On the left graph we compare our\nsum of utilities for N = K = 5 to that of the sel\ufb01sh algorithm, reported to achieve good performance\nfor this problem in [17], and to a random choice of arms. The sel\ufb01sh algorithm consists of each\nplayer playing a standard upper con\ufb01dence bound (UCB) algorithm, treating collisions as any other\nvalue for the reward. Both algorithms perform much better than the random selection. Our sum of\nutilities is slightly better and is increasing with time. More importantly, our algorithm has provably\nperformance guarantees while [17] have none. On its way to converge to the optimal solution, our\nalgorithm performs very well straight from the beginning. While visiting near-optimal solutions\nin\ufb02icts linear regret at the beginning, it is very satisfying in practice considering that players cannot\ncommunicate and have a matrix of expected rewards. Similar results were obtained for different\nchoices of c1, c2, c3. On the right graph, we present the median and the best 90% of the sample\nmean of the sum of utilities for K = N = 6 and \u03b5 = 0.01, 0.001, 0.0001. It is evident that our\nalgorithm behaves very similarly in all the 100 experiments, indicating that it is robust and rarely\nfails. Additionally, our algorithm behaves very similarly for a wide range of \u03b5 values (two orders\nof magnitude). This supports the intuition that there is no threshold phenomenon on \u03b5 (becoming\n\u201csmall enough\u201d), since the dynamics prefer states with a higher sum of utilities for all \u03b5 < 1.\n\nFigure 2: Sample mean of the sum of utilities as a function of time, averaged over 100 experiments.\n\n7 Conclusions and Future Work\n\nIn this paper, we considered a multi-player multi-armed bandit game where players compete over\nthe arms as resources. In contrast to all existing multi-player bandit problems, we allow for different\narm expected rewards between players and assume each player only knows her own actions and\nrewards. We proposed a novel fully distributed algorithm that achieves a poly-logarithmic expected\n\ntotal regret of near-O(cid:0)log2 T(cid:1) when the horizon T is unknown to the players.\n\nOur simulations suggest that tuning the parameters for our algorithm is a relatively easy task in prac-\ntice. The algorithm designer can do so by simulating a random model for the unknown environment\nand varying the parameters, knowing that only a very slack accuracy is needed for the tuning.\nIt is still an open question whether the lower bound \u2126 (log T ) on the expected total regret is tight for\na fully distributed algorithm.\nOur game is not a general one but has a structure that allowed us to modify the dynamics such\nthat the interdependence assumption can be dropped. We conjecture that the same structure can\nbe exploited to accelerate the convergence rate of the GoT dynamics, speci\ufb01cally by relaxing the\nc \u2265 N condition.\n\n9\n\nt\u00d710600.511.522.53sample mean - sum of utilities0.20.30.40.50.60.70.80.91GoTSelfish UCBRandom Selectiont\u00d710600.511.522.533.544.55sample mean - sum of utilities0.70.750.80.850.90.951\u01eb=0.0001, median\u01eb=0.0001, best 90% \u01eb=0.001, median\u01eb=0.001, best 90%\u01eb=0.01, best 90%\u01eb=0.01, median\fReferences\n[1] S. Shahrampour, A. Rakhlin, and A. Jadbabaie, \u201cMulti-armed bandits in multi-agent net-\nworks,\u201d in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Con-\nference on, 2017, pp. 2786\u20132790.\n\n[2] E. Hillel, Z. S. Karnin, T. Koren, R. Lempel, and O. Somekh, \u201cDistributed exploration in multi-\n\narmed bandits,\u201d in Advances in Neural Information Processing Systems, 2013, pp. 854\u2013862.\n\n[3] P. Landgren, V. Srivastava, and N. E. Leonard, \u201cDistributed cooperative decision-making in\nmultiarmed bandits: Frequentist and Bayesian algorithms,\u201d in Decision and Control (CDC),\n2016 IEEE 55th Conference on, 2016, pp. 167\u2013172.\n\n[4] N. Cesa-Bianchi, C. Gentile, Y. Mansour, and A. Minora, \u201cDelay and cooperation in non-\n\nstochastic bandits,\u201d in Conference on Learning Theory, 2016, pp. 605\u2013622.\n\n[5] B. Szorenyi, R. Busa-Fekete, I. Hegedus, R. Orm\u00b4andi, M. Jelasity, and B. K\u00b4egl, \u201cGossip-based\ndistributed stochastic bandit algorithms,\u201d in International Conference on Machine Learning,\n2013, pp. 19\u201327.\n\n[6] N. Korda, B. Sz\u00a8or\u00b4enyi, and L. Shuai, \u201cDistributed clustering of linear bandits in peer to peer\nnetworks,\u201d in International Conference on Machine Learning, vol. 48, 2016, pp. 1301\u20131309.\n[7] J. Rosenski, O. Shamir, and L. Szlak, \u201cMulti-player bandits\u2013a musical chairs approach,\u201d in\n\nInternational Conference on Machine Learning, 2016, pp. 155\u2013163.\n\n[8] N. Nayyar, D. Kalathil, and R. Jain, \u201cOn regret-optimal learning in decentralized multi-player\nmulti-armed bandits,\u201d IEEE Transactions on Control of Network Systems, vol. PP, no. 99, pp.\n1\u20131, 2016.\n\n[9] D. Kalathil, N. Nayyar, and R. Jain, \u201cDecentralized learning for multiplayer multiarmed ban-\n\ndits,\u201d IEEE Transactions on Information Theory, vol. 60, no. 4, pp. 2331\u20132345, 2014.\n\n[10] K. Liu and Q. Zhao, \u201cDistributed learning in multi-armed bandit with multiple players,\u201d IEEE\n\nTransactions on Signal Processing, vol. 58, no. 11, pp. 5667\u20135681, 2010.\n\n[11] S. Vakili, K. Liu, and Q. Zhao, \u201cDeterministic sequencing of exploration and exploitation for\nmulti-armed bandit problems,\u201d IEEE Journal of Selected Topics in Signal Processing, vol. 7,\nno. 5, pp. 759\u2013767, 2013.\n\n[12] L. Lai, H. Jiang, and H. V. Poor, \u201cMedium access in cognitive radio networks: A competi-\ntive multi-armed bandit framework,\u201d in Signals, Systems and Computers, 2008 42nd Asilomar\nConference on, 2008, pp. 98\u2013102.\n\n[13] A. Anandkumar, N. Michael, A. K. Tang, and A. Swami, \u201cDistributed algorithms for learning\nand cognitive medium access with logarithmic regret,\u201d IEEE Journal on Selected Areas in\nCommunications, vol. 29, no. 4, pp. 731\u2013745, 2011.\n\n[14] H. Liu, K. Liu, and Q. Zhao, \u201cLearning in a changing world: Restless multiarmed bandit with\nunknown dynamics,\u201d IEEE Transactions on Information Theory, vol. 59, no. 3, pp. 1902\u20131916,\n2013.\n\n[15] O. Avner and S. Mannor, \u201cConcurrent bandits and cognitive radio networks,\u201d in Joint European\nConference on Machine Learning and Knowledge Discovery in Databases, 2014, pp. 66\u201381.\n[16] N. Evirgen and A. Kose, \u201cThe effect of communication on noncooperative multiplayer multi-\n\narmed bandit problems,\u201d in arXiv preprint arXiv:1711.01628, 2017, 2017.\n\n[17] L. Besson and E. Kaufmann, \u201cMulti-player bandits revisited,\u201d in Algorithmic Learning Theory,\n\n2018, pp. 56\u201392.\n\n[18] J. Cohen, A. H\u00b4eliou, and P. Mertikopoulos, \u201cLearning with bandit feedback in potential\ngames,\u201d in Proceedings of the 31th International Conference on Neural Information Process-\ning Systems, 2017.\n\n[19] O. Avner and S. Mannor, \u201cMulti-user lax communications: a multi-armed bandit approach,\u201d in\nINFOCOM 2016-The 35th Annual IEEE International Conference on Computer Communica-\ntions, IEEE, 2016, pp. 1\u20139.\n\n[20] C. H. Papadimitriou and K. Steiglitz, Combinatorial optimization: algorithms and complexity.\n\nCourier Corporation, 1998.\n\n10\n\n\f[21] D. P. Bertsekas, \u201cThe auction algorithm: A distributed relaxation method for the assignment\n\nproblem,\u201d Annals of operations research, vol. 14, no. 1, pp. 105\u2013123, 1988.\n\n[22] M. M. Zavlanos, L. Spesivtsev, and G. J. Pappas, \u201cA distributed auction algorithm for the\nassignment problem,\u201d in Decision and Control, 2008. CDC 2008. 47th IEEE Conference on,\n2008, pp. 1212\u20131217.\n\n[23] B. S. Pradelski and H. P. Young, \u201cLearning ef\ufb01cient Nash equilibria in distributed systems,\u201d\n\nGames and Economic behavior, vol. 75, no. 2, pp. 882\u2013897, 2012.\n\n[24] J. R. Marden, H. P. Young, and L. Y. Pao, \u201cAchieving pareto optimality through distributed\n\nlearning,\u201d SIAM Journal on Control and Optimization, vol. 52, no. 5, pp. 2753\u20132770, 2014.\n\n[25] A. Menon and J. S. Baras, \u201cConvergence guarantees for a decentralized algorithm achieving\n\npareto optimality,\u201d in American Control Conference (ACC), 2013, 2013.\n\n[26] T. L. Lai and H. Robbins, \u201cAsymptotically ef\ufb01cient adaptive allocation rules,\u201d Advances in\n\napplied mathematics, vol. 6, no. 1, pp. 4\u201322, 1985.\n\n[27] K.-M. Chung, H. Lam, Z. Liu, and M. Mitzenmacher, \u201cChernoff-Hoeffding bounds for Markov\nchains: Generalized and simpli\ufb01ed,\u201d in 29th International Symposium on Theoretical Aspects\nof Computer Science, 2012, p. 124.\n\n11\n\n\f", "award": [], "sourceid": 3588, "authors": [{"given_name": "Ilai", "family_name": "Bistritz", "institution": "Stanford"}, {"given_name": "Amir", "family_name": "Leshem", "institution": "Bar-Ilan University"}]}