{"title": "An Information-Theoretic Analysis for Thompson Sampling with Many Actions", "book": "Advances in Neural Information Processing Systems", "page_first": 4157, "page_last": 4165, "abstract": "Information-theoretic Bayesian regret bounds of Russo and Van Roy capture the dependence of regret on prior uncertainty. However, this dependence is through entropy, which can become arbitrarily large as the number of actions increases. We establish new bounds that depend instead on a notion of rate-distortion. Among other things, this allows us to recover through information-theoretic arguments a near-optimal bound for the linear bandit. We also offer a bound for the logistic bandit that dramatically improves on the best previously available, though this bound depends on an information-theoretic statistic that we have only been able to quantify via computation.", "full_text": "An Information-Theoretic Analysis for\nThompson Sampling with Many Actions\n\nShi Dong\n\nStanford University\n\nsdong15@stanford.edu\n\nBenjamin Van Roy\nStanford University\nbvr@stanford.edu\n\nAbstract\n\nInformation-theoretic Bayesian regret bounds of Russo and Van Roy [8] capture\nthe dependence of regret on prior uncertainty. However, this dependence is through\nentropy, which can become arbitrarily large as the number of actions increases. We\nestablish new bounds that depend instead on a notion of rate-distortion. Among\nother things, this allows us to recover through information-theoretic arguments a\nnear-optimal bound for the linear bandit. We also offer a bound for the logistic\nbandit that dramatically improves on the best previously available, though this\nbound depends on an information-theoretic statistic that we have only been able to\nquantify via computation.\n\n1\n\nIntroduction\n\n(cid:113)\n\n(cid:113)\n\nThompson sampling [11] has proved to be an effective heuristic across a broad range of online\ndecision problems [2, 10]. Russo and Van Roy [8] provided an information-theoretic analysis that\n\u0393H(A\u2217)T on\nyields insight into the algorithm\u2019s broad applicability and establishes a bound of\ncumulative expected regret over T time periods of any algorithm and online decision problem. The\ninformation ratio \u0393 is a statistic that captures the manner in which an algorithm trades off between\nimmediate reward and information acquisition; Russo and Van Roy [8] bound the information ratio\nof Thompson sampling for particular classes of problems. The entropy H(A\u2217) of the optimal action\nquanti\ufb01es the agent\u2019s initial uncertainty.\nIf the prior distribution of A\u2217 is uniform, the entropy H(A\u2217) is the logarithm of the number of actions.\n\u0393H(A\u2217)T grows arbitrarily large with the number of actions. On the other hand, even\nAs such,\nfor problems with in\ufb01nite action sets, like the linear bandit with a polytopic action set, Thompson\nsampling is known to obey gracious regret bounds [6]. This suggests that the dependence on entropy\nleaves room for improvement.\nIn this paper, we establish bounds that depend on a notion of rate-distortion instead of entropy. Our\nnew line of analysis is inspired by rate-distortion theory, which is a branch of information theory\nthat quanti\ufb01es the amount of information required to learn an approximation [3]. This concept was\nalso leveraged in recent work of Russo and Van Roy [9], which develops an alternative to Thompson\nsampling that aims to learn satis\ufb01cing actions. An important difference is that the results of this paper\napply to Thompson sampling itself.\nWe apply our analysis to linear and generalized linear bandits and establish Bayesian regret bounds\n\u221a\nthat remain sharp with large action spaces. For the d-dimensional linear bandit setting, our bound\nT log T ) bound of [7]. Our bound also improves\nis O(d\n\non the previous O((cid:112)dT H(A\u2217)) information-theoretic bound of [8] since it does not depend on\n\n\u221a\nT log T ), which is tighter than the O(d\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f\u221a\nlog T ) of the \u2126(d\n\n\u221a\nthe number of actions. Our Bayesian regret bound is within a factor of O(\nworst-case regret lower bound of [4].\nFor the logistic bandit, previous bounds for Thompson sampling [7] and upper-con\ufb01dence-bound\nalgorithms [5] scale linearly with supx \u03c6(cid:48)(x)/ inf x \u03c6(cid:48)(x), where \u03c6 is the logistic function \u03c6(x) =\ne\u03b2x/(1 + e\u03b2x). These bounds explode as \u03b2 \u2192 \u221e since lim\u03b2\u2192\u221e supx \u03c6(cid:48)(x) = \u221e. This does not\nmake sense because, as \u03b2 grows, the reward of each action approaches a deterministic binary value,\nwhich should simplify learning. Our analysis addresses this gap in understanding by establishing\na bound that decays as \u03b2 becomes large, converging to 2d\nT log 3 for any \ufb01xed T . However, this\nanalysis relies on a conjecture about the information ratio of Thompson sampling for the logistic\nbandit, which we only support through computational results.\n\n\u221a\n\nT )\n\n2 Problem Formulation\n\nWe consider an online decision problem in which over each time period t = 1, 2, . . ., an agent selects\nan action At from a \ufb01nite action set A and observes an outcome YAt \u2208 Y, where Y denotes the set of\npossible outcomes. A \ufb01xed and known system function g associates outcomes with actions according\nto\nwhere a \u2208 A is the action, W is an exogenous noise term, and \u03b8\u2217 is the \u201ctrue\u201d model unknown to\nthe agent. Here we adopt the Bayesian setting, in which \u03b8\u2217 is a random variable taking value in a\nspace of parameters \u0398. The randomness of \u03b8\u2217 stems from the prior uncertainty of the agent. To make\nnotations succinct and avoid measure-theoretic issues, we assume that \u0398 = {\u03b81, . . . , \u03b8m} is a \ufb01nite\nset, whereas our analysis can be extended to the cases where both A and \u0398 are in\ufb01nite.\nThe reward function R : Y (cid:55)\u2192 R assigns a real-valued reward to each outcome. As a shorthand we\nde\ufb01ne\n\nYa = g(a, \u03b8\u2217, W ),\n\n\u00b5(a, \u03b8) = E(cid:2)R(Ya)(cid:12)(cid:12)\u03b8\u2217 = \u03b8(cid:3) ,\n\n\u2200a \u2208 A, \u03b8 \u2208 \u0398.\n\nSimply stated, \u00b5(a, \u03b8) is the expected reward of action a when the true model is \u03b8. We assume that,\nconditioned on the true model parameter and the selected action, the reward is bounded1, i.e.\n\nR(y) \u2212 inf\n\ny\u2208Y R(y) \u2264 1.\n\nsup\ny\u2208Y\n\nIn addition, for each parameter \u03b8, let \u03b1(\u03b8) be the optimal action under model \u03b8, i.e.\n\n\u03b1(\u03b8) = argmax\n\na\u2208A\n\n\u00b5(a, \u03b8).\n\nNote that the ties induced by argmax can be circumvented by expanding \u0398 with identical elements.\nLet A\u2217 = \u03b1(\u03b8\u2217) be the \u201ctrue\u201d optimal action and let R\u2217 = \u00b5(A\u2217, \u03b8\u2217) be the corresponding maximum\nreward.\nBefore making her decision at the beginning of period t, the agent has access to the history up to time\nt \u2212 1, which we denote by\n\nHt\u22121 =(cid:0)A1, YA1 , . . . , At\u22121, YAt\u22121\n\n(cid:1) .\n\nA policy \u03c0 = (\u03c01, \u03c02, . . .) is de\ufb01ned as a sequence of functions mapping histories and exogenous\nnoise to actions, which can be written as\n\nAt = \u03c0t(Ht\u22121, \u03bet),\n\nt = 1, 2, . . . ,\n\nwhere \u03bet is a random variable which characterizes the algorithmic randomness. The performance of\npolicy \u03c0 is evaluated by the \ufb01nite horizon Bayesian regret, de\ufb01ned by\n\nBayesRegret(T ; \u03c0) = E\n\n(cid:34) T(cid:88)\n\n(cid:0)R\u2217 \u2212 R(YAt)(cid:1)(cid:35)\n\n,\n\n1The boundedness assumption allows application of a basic version of Pinsker\u2019s inequality. Since there exists\na version of Pinsker\u2019s inequality that applies to sub-Gaussian random variables (see Lemma 3 of [8]), all of our\nresults hold without change for 1/4-sub-Gaussian rewards, i.e.\n\nE(cid:2)exp(cid:8)\u03bb [R(g(a, \u03b8, W )) \u2212 \u00b5(a, \u03b8)](cid:9)(cid:3) \u2264 exp(\u03bb2/8) \u2200\u03bb \u2208 R, a \u2208 A, \u03b8 \u2208 \u0398.\n\nt=1\n\n2\n\n\fwhere the actions are chosen by policy \u03c0, and the expectation is taken over the randomness in both\nR\u2217 and (At)T\n\nt=1.\n\n3 Thompson Sampling and Information Ratio\n\nThe Thompson sampling policy \u03c0TS is de\ufb01ned such that at each period, the agent samples the next\naction according to her posterior belief of the optimal action, i.e.\n\nP(cid:0)\u03c0TS\n\nt\n\n(Ht\u22121, \u03bet) = a(cid:12)(cid:12)Ht\u22121\n\nAn equivalent de\ufb01nition, which we use throughout our analysis, is that over period t the agent samples\na parameter \u03b8t from the posterior of the true parameter \u03b8\u2217, and plays the action At = \u03b1(\u03b8t). The\nhistory available to the agent is thus\n\na.s. \u2200a \u2208 A, t = 1, 2, . . . .\n\n(cid:1) = P(cid:0)A\u2217 = a(cid:12)(cid:12)Ht\u22121),\n\u02dcHt =(cid:0)\u03b81, Y\u03b1(\u03b81), . . . , \u03b8t, Y\u03b1(\u03b8t)\n\n(cid:1).\n\nThe information ratio, \ufb01rst proposed in [8], quanti\ufb01es the trade-off between exploration and exploita-\ntion. Here we adopt the simpli\ufb01ed de\ufb01nition in [9], which integrates over all randomness. Let \u03b8, \u03b8(cid:48) be\ntwo \u0398-valued random variables. Over period t, the information ratio of \u03b8(cid:48) with respect to \u03b8 is de\ufb01ned\nby\n\nE(cid:2)R(Y\u03b1(\u03b8)) \u2212 R(Y\u03b1(\u03b8(cid:48)))(cid:3)2\nI(cid:0)\u03b8; (\u03b8(cid:48), Y\u03b1(\u03b8(cid:48)))(cid:12)(cid:12) \u02dcHt\u22121\n(cid:1) ,\n\n\u0393t(\u03b8; \u03b8(cid:48)) =\n\n(1)\n\nwhere the denominator is the mutual information between \u03b8 and (\u03b8(cid:48), Y\u03b1(\u03b8(cid:48))), conditioned on the\n\u03c3-algebra generated by \u02dcHt\u22121. We can interpret \u03b8 as a benchmark model parameter that the agent\nwants to learn and \u03b8(cid:48) as the model parameter that she selects. When \u0393t(\u03b8; \u03b8(cid:48)) is small, the agent\nwould only incur large regret over period t if she was expected to learn a lot of information about \u03b8.\nWe restate a result proven in [6], which proposes a bound for the regret of any policy in terms of the\nworst-case information ratio.\nt=1 be such that \u03b1(\u03b8t) = \u03c0t(Ht\u22121, \u03bet) for each\nProposition 1. For all T > 0 and policy \u03c0, let (\u03b8t)T\nt = 1, 2 . . . , T , then\n\nBayesRegret(T ; \u03c0) \u2264(cid:113)\n\n\u0393T \u00b7 H(\u03b8\u2217) \u00b7 T ,\n\nwhere H(\u03b8\u2217) is the entropy of \u03b8\u2217 and\n\n\u0393T = max\n1\u2264t\u2264T\n\n\u0393t(\u03b8\u2217; \u03b8t).\n\nThe bound given by Proposition 1 is loose in the sense that it depends implicitly on the cardinality\nof \u0398. When \u0398 is large, knowing exactly what \u03b8\u2217 is requires a lot of information. Nevertheless,\nbecause of the correlation between actions, it suf\ufb01ces for the agent to learn a \u201cblurry\u201d version of \u03b8\u2217,\nwhich conveys far less information, to achieve low regret. In the following section we concretize this\nargument.\n\n4 A Rate-Distortion Analysis of Thompson Sampling\n\nIn this section we develop a sharper bound for Thompson sampling. At a high level, the argument\nrelies on the existence of a statistic \u03c8 of \u03b8\u2217 such that:\n\ni The statistic \u03c8 is less informative than \u03b8\u2217;\nii In each period, if the agent aims to learn \u03c8 instead of \u03b8\u2217, the regret incurred can be\nbounded in terms of the information gained about \u03c8; we refer to this approximate learning\nas \u201ccompressed Thompson sampling\u201d;\n\niii The regret of Thompson sampling is close to that of the compressed Thompson sampling\nbased on the statistic \u03c8, and at the same time, compressed Thompson sampling yields no\nmore information about \u03c8 than Thompson sampling.\n\n3\n\n\fFollowing the above line of analysis, we can bound the regret of Thompson sampling by the mutual\ninformation between the statistic \u03c8 and \u03b8\u2217. When \u03c8 can be chosen to be far less informative than \u03b8\u2217,\nwe obtain a signi\ufb01cantly tighter bound.\nTo develop the argument, we \ufb01rst quantify the amount of distortion that we incur if we replace one\nparameter with another. For two parameters \u03b8, \u03b8(cid:48) \u2208 \u0398, the distortion of \u03b8 with respect to \u03b8(cid:48) is de\ufb01ned\nas\n\nd(\u03b8, \u03b8(cid:48)) = \u00b5(\u03b1(\u03b8(cid:48)), \u03b8(cid:48)) \u2212 \u00b5(\u03b1(\u03b8), \u03b8(cid:48)).\n\n(2)\n\nIn other words, the distortion is the price we pay if we deem \u03b8 to be the true parameter while the\nactual true parameter is \u03b8(cid:48). Notice that from the de\ufb01nition of \u03b1, we always have d(\u03b8, \u03b8(cid:48)) \u2265 0. Let\n{\u0398k}K\n\nk=1 \u0398k = \u0398 and \u0398i \u2229 \u0398j = \u2205, \u2200i (cid:54)= j, such that\n\nk=1 be a partition of \u0398, i.e.(cid:83)K\n\nd(\u03b8, \u03b8(cid:48)) \u2264 \u0001,\n\n\u2200\u03b8, \u03b8(cid:48) \u2208 \u0398k, k = 1, . . . , K.\n\n(3)\n\nwhere \u0001 > 0 is a positive distortion tolerance. Let \u03c8 be the random variable taking values in\n{1, . . . , K} that records the index of the partition in which \u03b8\u2217 lies, i.e.\n\n\u03c8 = k \u21d4 \u03b8\u2217 \u2208 \u0398k.\n\n(4)\nThen we have H(\u03c8) \u2264 log K. If the structure of \u0398 allows for a small number of partitions, \u03c8 would\nhave much less information than \u03b8\u2217. Let subscript t \u2212 1 denote corresponding values under the\nposterior measure Pt\u22121(\u00b7) = P(\u00b7| \u02dcHt\u22121). In other words, Et\u22121[\u00b7] and It\u22121(\u00b7;\u00b7) are random variables\nthat are functions of \u02dcHt\u22121. We claim the following.\nProposition 2. Let \u03c8 be de\ufb01ned as in (4). For each t = 1, 2, . . ., there exists a \u0398-valued random\nvariable \u02dc\u03b8\u2217\n\n(i) \u02dc\u03b8\u2217\n(ii) Et\u22121\n\nt that satis\ufb01es the following:\nt is independent of \u03b8\u2217, conditioned on \u03c8.\n\n(cid:2)R\u2217 \u2212 R(Y\u03b1(\u03b8t))(cid:3) \u2212 Et\u22121\n(cid:0)\u03c8; (\u02dc\u03b8t, Y\u03b1(\u02dc\u03b8t))(cid:1) \u2264 It\u22121\n\n(cid:2)R(Y\u03b1(\u02dc\u03b8\u2217\n(cid:0)\u03c8; (\u03b8t, Y\u03b1(\u03b8t))(cid:1), a.s.\n\n(iii) It\u22121\n\nt )) \u2212 R(Y\u03b1(\u02dc\u03b8t))(cid:3) \u2264 \u0001, a.s.\n\nwhere in (ii) and (iii), \u02dc\u03b8t is independent from and distributed identically with \u02dc\u03b8\u2217\nt .\n\nAccording to Proposition 2, over period t if the agent deviated from her original Thompson sampling\nscheme and applied a \u201cone-step\u201d compressed Thompson sampling to learn \u02dc\u03b8\u2217\nt by sampling \u02dc\u03b8t, the\nextra regret that she would incur can be bounded (as is guaranteed by (ii)). Meanwhile, from (i), (iii)\nand the data-processing inequality, we have that\n\n(cid:0)\u02dc\u03b8\u2217\nt ; (\u02dc\u03b8t, Y\u03b1(\u02dc\u03b8t))(cid:1) \u2264 It\u22121\n\n(cid:0)\u03c8; (\u02dc\u03b8t, Y\u03b1(\u02dc\u03b8t))(cid:1) \u2264 It\u22121\n\n(cid:0)\u03c8; (\u03b8t, Y\u03b1(\u03b8t))(cid:1), a.s.\n\nIt\u22121\n\n(5)\n\nwhich implies that the information gain of the compressed Thompson sampling will not exceed that of\nthe original Thompson sampling towards \u03c8. Therefore, the regret of the original Thompson sampling\ncan be bounded in terms of the total information gain towards \u03c8 and the worst-case information ratio\nof the one-step compressed Thompson sampling. Formally, we have the following.\nTheorem 1. Let {\u0398k}K\nd(\u03b8, \u03b8(cid:48)) \u2264 \u0001. Let \u03c8 be de\ufb01ned as in (4) and let \u02dc\u03b8\u2217\nhave\n\nk=1 be any partition of \u0398 such that for any k = 1, . . . , K and \u03b8, \u03b8(cid:48) \u2208 \u0398k,\nt and \u02dc\u03b8t satisfy the conditions in Proposition 2. We\n\nBayesRegret(T ; \u03c0TS) \u2264(cid:113)\n\n\u0393 \u00b7 I(\u03b8\u2217; \u03c8) \u00b7 T + \u0001 \u00b7 T,\n\n(6)\n\nwhere\n\n\u0393 = max\n1\u2264t\u2264T\n\n\u0393t(\u02dc\u03b8\u2217\n\nt ; \u02dc\u03b8t).\n\n4\n\n\ft )) \u2212 R(Y\u03b1(\u02dc\u03b8t))\n\n(cid:105)(cid:111)\n(cid:16)\u02dc\u03b8\u2217\nt ; (\u02dc\u03b8t, Y\u03b1(\u02dc\u03b8t))(cid:12)(cid:12) \u02dcHt\u22121\n\n(cid:17)\n\n+ \u0001 \u00b7 T\n\n+ \u0001 \u00b7 T\n\n(cid:105)(cid:111)\n\nR\u2217 \u2212 R(YAt)\n\nR(Y\u03b1(\u02dc\u03b8\u2217\n\n(cid:105)\nR\u2217 \u2212 R(YAt)\n\nt=1\n\nt=1\n\nt=1\n\nt=1\n\n(cid:104)\n(cid:104)\n\n\u0393t(\u02dc\u03b8\u2217\n\nE(cid:104)\nT(cid:88)\nE(cid:110)Et\u22121\nT(cid:88)\nE(cid:110)Et\u22121\nT(cid:88)\n(cid:114)\nT(cid:88)\nt , \u02dc\u03b8t) \u00b7 I\n(cid:114)\n(cid:16)\nT(cid:88)\n(cid:118)(cid:117)(cid:117)(cid:116)\u0393 \u00b7 T \u00b7 T(cid:88)\n(cid:114)\n(cid:16)\n(cid:114)\n(cid:16)\n\u03c8; \u03b8\u2217(cid:17)\n(cid:17)\n\n\u0393 \u00b7 T \u00b7 I\n\n\u0393 \u00b7 T \u00b7 I\n\n\u0393 \u00b7 I\n\n(cid:16)\n\n= I\n\n+ I\n\nt=1\n\nt=1\n\nI\n\n(cid:17)\n\u03c8; (\u03b8t, Y\u03b1(\u03b8t))(cid:12)(cid:12) \u02dcHt\u22121\n(cid:16)\n\u03c8; (\u03b8t, Y\u03b1(\u03b8t))(cid:12)(cid:12) \u02dcHt\u22121\n(cid:17)\n(cid:17)\n(cid:16)\n\n+ \u0001 \u00b7 T,\n\n+ \u0001 \u00b7 T\n\n\u03c8; \u02dcHT\u22121\n\n\u03b8\u2217; \u03c8\n\n(cid:12)(cid:12)\u03b8\u2217(cid:17)\n\n\u03c8; \u02dcHT\n\n= I\n\n+ \u0001 \u00b7 T\n\n(7)\n\n+ \u0001 \u00b7 T\n\n(cid:17)\n\n(cid:16)\n\n\u03c8; \u03b8\u2217(cid:17)\n\n,\n\nProof. We have that\n\nBayesRegret(T ; \u03c0TS) =\n\n=\n\n(a)\u2264\n\n=\n\n(b)\u2264\n\n(c)\u2264\n\n(d)\n=\n\n(e)\u2264\n\nwhere (a) follows from Proposition 2 (ii); (b) follows from (5); (c) results from Cauchy-Schwartz\ninequality; (d) is the chain rule for mutual information and (e) comes from that\n\n(cid:16)\n\n(cid:17) \u2264 I\n\n(cid:16)\n\n\u03c8; \u02dcHT\n\nI\n\n\u03c8; (\u03b8\u2217, \u02dcHT )\n\nwhere we use the fact that \u03c8 is independent of \u02dcHT , conditioned on \u03b8\u2217. Thence we arrive at our\ndesired result.\nRemark. The bound given in Theorem 1 dramatically improves the previous bound in Proposition 1\nsince I(\u03b8\u2217; \u03c8) can be bounded by H(\u03c8), which, when \u0398 is large, can be much smaller than H(\u03b8\u2217).\nThe new bound also characterizes the tradeoff between the preserved information I(\u03b8\u2217; \u03c8) and the\ndistortion tolerance \u0001, which is the essence of rate distortion theory. In fact, we can de\ufb01ne the\ndistortion between \u03b8\u2217 and \u03c8 as\nesssup\n\n(cid:2)R\u2217 \u2212 R(Y\u03b1(\u03b8t))(cid:3) \u2212 Et\u22121\n\nt )) \u2212 R(Y\u03b1(\u02dc\u03b8t))(cid:3)(cid:111)\n\n(cid:2)R(Y\u03b1(\u02dc\u03b8\u2217\n\n(cid:110)Et\u22121\n\nD(\u03b8\u2217, \u03c8) = max\n1\u2264t\u2264T\n\n,\n\nt and \u02dc\u03b8t depend on \u03c8 through Proposition 2. By taking the in\ufb01mum over all possible choices\n\nwhere \u02dc\u03b8\u2217\nof \u03c8, the bound (6) can be written as\n\nBayesRegret(T ; \u03c0TS) \u2264(cid:113)\n\n\u0393 \u00b7 \u03c1(\u0001) \u00b7 T + \u0001 \u00b7 T,\n\n\u2200\u0001 > 0,\n\n(8)\n\nwhere\n\n\u03c1(\u0001) = min\u03c8\n\nI(\u03b8\u2217; \u03c8)\n\ns.t. D(\u03b8\u2217, \u03c8) \u2264 \u0001\n\nis the rate-distortion function with respect to the distortion D.\nTo obtain explicit bounds for speci\ufb01c problem instances, we use the fact that I(\u03b8\u2217; \u03c8) \u2264 H(\u03c8) \u2264\nlog K. In the following section we introduce a broad range of problems in which both K and \u0393 can\nbe effectively bounded.\n\n5 Specializing to Structured Bandit Problems\n\nWe now apply the analysis in Section 2 to common bandit settings and show that our bounds are\nsigni\ufb01cantly sharper than the previous bounds. In these models, the observation of the agent is the\nreceived reward. Hence we can let R be the identity function and use Ra as a shorthand for R(Ya).\n\n5\n\n\f5.1 Linear Bandits\n\nLinear bandits are a class of problems in which each action is parametrized by a \ufb01nite-dimensional\nfeature vector, and the mean reward of playing each action is the inner product between the feature\nvector and the model parameter vector. Formally, let A, \u0398 \u2282 Rd, where d < \u221e, and Y \u2286 [\u22121/2, 1/2].\nThe reward of playing action a satis\ufb01es\n\nE[Ra|\u03b8\u2217 = \u03b8] = \u00b5(a, \u03b8) =\n\na(cid:62)\u03b8,\n\n1\n2\n\n\u2200a \u2208 A, \u03b8 \u2208 \u0398.\n\nNote that we apply a normalizing factor 1/2 to make the setting consistent with our assumption that\nsupy R(y) \u2212 inf y R(y) \u2264 1.\nA similar line of analysis as in [8] allows us to bound the information ratio of the one-step compressed\nThompson sampling.\nProposition 3. Under the linear bandit setting, for each t = 1, 2, . . ., letting \u02dc\u03b8\u2217\nconditions in Proposition 2, we have\n\nt and \u02dc\u03b8t satisfy the\n\n\u0393t(\u02dc\u03b8\u2217\n\nt ; \u02dc\u03b8t) \u2264 d\n2\n\n.\n\nAt the same time, with the help of a covering argument, we can also bound the number of partitions\nthat is required to achieve distortion tolerance \u0001.\nProposition 4. Under the linear bandit setting, suppose that A, \u0398 \u2286 Bd(0, 1), where Bd(0, 1) is\nthe d-dimensional closed Euclidean unit ball. Then for any \u0001 > 0 there exists a partition {\u0398k}K\nk=1 of\n\u0398 such that for all k = 1, . . . , K and \u03b8, \u03b8(cid:48) \u2208 \u0398k, we have d(\u03b8, \u03b8(cid:48)) \u2264 \u0001 and\n\n(cid:19)d\n\n(cid:18) 1\n\n\u0001\n\nK \u2264\n\n+ 1\n\n.\n\nCombining Theorem 1, Propositions 3 and 4, we arrive at the following bound.\nTheorem 2. Under the linear bandit setting, if A, \u0398 \u2286 Bd(0, 1), then\n\u221a\n\nBayesRegret(T ; \u03c0TS) \u2264 d\n\n(cid:118)(cid:117)(cid:117)(cid:116)T log\n\n(cid:32)\n\n(cid:33)\n\n.\n\n3 +\n\n3\n\n2T\nd\n\ntions. It signi\ufb01cantly improves the bound O(cid:0)(cid:112)dT \u00b7 H(A\u2217)(cid:1) in [8] and the bound O(cid:0)(cid:112)|A|T log |A|(cid:1)\n\nThis bound is the \ufb01rst information-theoretic bound that does not depend on the number of available ac-\n\nin [1] in that it drops the dependence on the cardinality of the action set and imposes no assumption\n\u221a\non the reward distribution. Comparing with the con\ufb01dence-set-based analysis in [7], which results in\nthe bound O(d\nT log T ), our argument is much simpler and cleaner and yields a tighter bound. This\n\u221a\nbound suggests that Thompson sampling is near-optimal in this context since it exceeds the minimax\nlower bound \u2126(d\n\nT ) proposed in [4] by only a\n\nlog T factor.\n\n\u221a\n\n5.2 Generalized Linear Bandits with iid Noise\nIn generalized linear models, there is a \ufb01xed and strictly increasing link function \u03c6 : R (cid:55)\u2192 [0, 1], such\nthat\n\nE[Ra|\u03b8\u2217 = \u03b8] = \u00b5(a, \u03b8) = \u03c6(a(cid:62)\u03b8).\n\nLet\n\nL = inf\n\na\u2208A,\u03b8\u2208\u0398\n\na(cid:62)\u03b8, L = sup\n\na\u2208A,\u03b8\u2208\u0398\n\na(cid:62)\u03b8.\n\nWe make the following assumptions.\nAssumption 1. The reward noise is iid, i.e.\n\nRa = \u00b5(a, \u03b8\u2217) + Wa = \u03c6(a(cid:62)\u03b8\u2217) + Wa,\n\n\u2200a \u2208 A,\n\nwhere Wa is a zero-mean noise term with a \ufb01xed and known distribution for all a \u2208 A.\n\n6\n\n\fAssumption 2. The link function \u03c6 is continuously differentiable in [L, L], with\n\nC(\u03c6) = sup\n\nx\u2208[L,L]\n\n\u03c6(cid:48)(x) < \u221e.\n\nUnder these assumptions, both the information ratio of the compressed Thompson sampling and the\nnumber of partitions can be bounded.\nProposition 5. Under the genearlized linear bandit setting and Assumptions 1 and 2, for each\nt = 1, 2, . . ., letting \u02dc\u03b8\u2217\n\nt and \u02dc\u03b8t satisfy the conditions in Proposition 2, we have\n\n\u0393t(\u02dc\u03b8\u2217\n\nt ; \u02dc\u03b8t) \u2264 2C(\u03c6)2d.\nProposition 6. Under the generalized linear bandit setting and Assumption 2, suppose that A, \u0398 \u2286\nBd(0, 1). Then for any \u0001 > 0 there exists a partition {\u0398k}K\nk=1 of \u0398 such that for each k = 1, . . . , K\n(cid:19)d\nand \u03b8, \u03b8(cid:48) \u2208 \u0398k we have d(\u03b8, \u03b8(cid:48)) \u2264 \u0001 and\n\n(cid:18) 2C(\u03c6)\n\nK \u2264\n\n+ 1\n\n.\n\n\u0001\n\nCombining Theorem 1, Propositions 5 and 6, we have the following.\nTheorem 3. Under the generalized linear bandit setting and Assumptions 1 and 2, if A, \u0398 \u2286\nBd(0, 1), then\n\nBayesRegret(T ; \u03c0TS) \u2264 2C(\u03c6) \u00b7 d\n\n(cid:118)(cid:117)(cid:117)(cid:116)T log\n\n(cid:32)\n\n(cid:33)\n\n.\n\n\u221a\n3\n\n2T\nd\n\n3 +\n\n\u221a\n\u221a\nNote that the optimism-based algorithm in [5] achieves O(rd\nThompson sampling given in [7] is O(rd\n3 apparently yields a sharper bound.\n\nT log T ) regret, and the bound of\nT log3/2 T ), where r = supx \u03c6(cid:48)(x)/ inf x \u03c6(cid:48)(x). Theorem\n\n5.3 Logistic Bandits\n\nLogistic bandits are special cases of generalized linear bandits, in which the agent only observes\nbinary rewards, i.e. Y = {0, 1}. The link function is given by \u03c6L(x) = e\u03b2x/(1 + e\u03b2x), where\n\u03b2 \u2208 (0,\u221e) is a \ufb01xed and known parameter. Conditioned on \u03b8\u2217 = \u03b8, the reward of playing action a is\nBernoulli distributed with parameter \u03c6L(a(cid:62)\u03b8).\nThe preexisting upper bounds on logistic bandit problems all scale linearly with\n\nr = sup\n\nx\n\n(\u03c6L)(cid:48)(x)/ inf\n\n(\u03c6L)(cid:48)(x),\n\nx\n\nwhich explodes when \u03b2 \u2192 \u221e. However, when \u03b2 is large, the rewards of actions are clearly bifurcated\nby a hyperplane and we expect Thompson sampling to perform better. The regret bound given by our\nanalysis addresses this point and has a \ufb01nite limit as \u03b2 increases. Since the logistic bandit setting\nis incompatible with Assumption 1, we propose the following conjecture, which is supported with\nnumerical evidence.\nConjecture 1. Under the logistic bandit setting, let the link function be \u03c6L(x) = e\u03b2x/(1 + e\u03b2x), and\nt and \u02dc\u03b8t satisfy the conditions in Proposition 2. Then for all \u03b2 \u2208 (0,\u221e),\nfor each t = 1, 2 . . ., let \u02dc\u03b8\u2217\n\n\u0393t(\u02dc\u03b8\u2217\n\nt ; \u02dc\u03b8t) \u2264 d\n2\n\n.\n\nTo provide evidence for Conjecture 1, for each \u03b2 and d, we randomly generate 100 actions and\nparameters and compute the exact information ratio under a randomly selected distribution over the\nparameters. The result is given in Figure 1. As the \ufb01gure shows, the simulated information ratio is\nalways smaller than the conjectured upper bound d/2. We suspect that for every link function \u03c6,\nthere exists an upper bound for the information ratio that depends only on d and \u03c6 and is independent\nof the cardinality of the parameter space. This opens an interesting topic for future research.\nWe further make the following assumption, which posits existence of a classi\ufb01cation margin that\napplies uniformly over \u03b8 \u2208 \u0398\n\n7\n\n\fFigure 1: Simulated information ratio values for dimensions d = 2, 3, . . . , 20 and (a) \u03b2 = 0.1, (b)\n\u03b2 = 1, (c) \u03b2 = 10 and (d) \u03b2 = 100. The diagonal black dashed line is the upper bound \u0393 = d/2.\n\nAssumption 3. We have that inf \u03b8\u2208\u0398 |\u00b5(\u03b1(\u03b8), \u03b8) \u2212 1/2| > 0. Equivalently, we have that\n\n(cid:12)(cid:12)\u03b1(\u03b8)(cid:62)\u03b8(cid:12)(cid:12) > 0.\n\ninf\n\u03b8\u2208\u0398\n\nThe following theorem introduces the bound for the logistic bandit.\nTheorem 4. Under the logistic bandit setting where A, \u0398 \u2286 Bd(0, 1), for all \u03b2 > 0, if the link\nfunction is given by \u03c6L(x) = e\u03b2x/(1 + e\u03b2x), Assumption 3 holds with inf \u03b8\u2208\u0398\nand Conjecture 1 holds, then for all suf\ufb01ciently large T ,\n\nBayesRegret(T ; \u03c0TS) \u2264 2d\n\n(cid:118)(cid:117)(cid:117)(cid:116)T log\n(cid:118)(cid:117)(cid:117)(cid:116)T log\n\n(cid:32)\n(cid:32)\n\n3 +\n\n3 +\n\n\u221a\n\n2T\nd\n\u221a\n\n2T\n\n6\n\n3\n\n2d\n\n\u2264 2d\n\n(cid:12)(cid:12)\u03b1(\u03b8)(cid:62)\u03b8(cid:12)(cid:12) = \u03b4 > 0,\n(cid:33)\n\n(cid:33)\n\n.\n\n(9)\n\n(10)\n\n\u00b7\n\n\u03b2e\u03b2\u03b4\n\n(1 + e\u03b2\u03b4)2\n\n\u00b7 min{\u03b4\u22121, \u03b2}\n\n\u221a\nFor \ufb01xed d and T , when \u03b2 \u2192 \u221e the right-hand side of (9) converges to 2d\nsubstantially sharper than previous bounds when \u03b2 is large.\n\nT log 3. Thus (9) is\n\n6 Conclusion\n\n\u221a\n\nThrough an analysis based on rate-distortion, we established a new information-theoretic regret\nbound for Thompson sampling that scales gracefully to large action spaces. Our analysis yields an\nT log T ) regret bound for the linear bandit problem, which strengthens state-of-the-art bounds.\nO(d\nThe same regret bound applies also to the logistic bandit problem if a conjecture about the information\nratio that agrees with computational results holds. We expect that our new line of analysis applies to\na wide range of online decision algorithms.\n\n8\n\n\fAcknowledgments\n\nThis work was supported by a grant from the Boeing Corporation and the Herb and Jane Dwight\nStanford Graduate Fellowship. We would also like to thank Daniel Russo, David Tse and Xiuyuan\nLu for useful conversations.\n\nReferences\n[1] Shipra Agrawal and Navin Goyal. Near-optimal regret bounds for Thompson sampling. Journal\n\nof the ACM (JACM), 64(5):30, 2017.\n\n[2] Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In Advances\n\nin neural information processing systems, pages 2249\u20132257, 2011.\n\n[3] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons,\n\n2012.\n\n[4] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under\n\nbandit feedback. In 21st Annual Conference on Learning Theory, pages 355\u2013366, 2008.\n\n[5] Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear\ncontextual bandits. In International Conference on Machine Learning, pages 2071\u20132080, 2017.\n\n[6] Daniel Russo and Benjamin Van Roy. Learning to optimize via information-directed sampling.\n\nIn Advances in Neural Information Processing Systems, pages 1583\u20131591, 2014.\n\n[7] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics\n\nof Operations Research, 39(4):1221\u20131243, 2014.\n\n[8] Daniel Russo and Benjamin Van Roy. An information-theoretic analysis of Thompson sampling.\n\nThe Journal of Machine Learning Research, 17(1):2442\u20132471, 2016.\n\n[9] Daniel Russo and Benjamin Van Roy. Satis\ufb01cing in time-sensitive bandit learning. arXiv\n\npreprint arXiv:1803.02855, 2018.\n\n[10] Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, and Zheng Wen. A tutorial\non Thompson sampling. Foundations and Trends R(cid:13) in Machine Learning, 11(1):1\u201396, 2018.\n[11] William R Thompson. On the likelihood that one unknown probability exceeds another in view\n\nof the evidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n9\n\n\f", "award": [], "sourceid": 2053, "authors": [{"given_name": "Shi", "family_name": "Dong", "institution": "Stanford University"}, {"given_name": "Benjamin", "family_name": "Van Roy", "institution": "Stanford University"}]}