{"title": "Ensemble Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 3258, "page_last": 3266, "abstract": "Thompson sampling has emerged as an effective heuristic for a broad range of online decision problems. In its basic form, the algorithm requires computing and sampling from a posterior distribution over models, which is tractable only for simple special cases. This paper develops ensemble sampling, which aims to approximate Thompson sampling while maintaining tractability even in the face of complex models such as neural networks. Ensemble sampling dramatically expands on the range of applications for which Thompson sampling is viable. We establish a theoretical basis that supports the approach and present computational results that offer further insight.", "full_text": "Ensemble Sampling\n\nXiuyuan Lu\n\nStanford University\nlxy@stanford.edu\n\nBenjamin Van Roy\nStanford University\nbvr@stanford.edu\n\nAbstract\n\nThompson sampling has emerged as an effective heuristic for a broad range of\nonline decision problems. In its basic form, the algorithm requires computing\nand sampling from a posterior distribution over models, which is tractable only\nfor simple special cases. This paper develops ensemble sampling, which aims to\napproximate Thompson sampling while maintaining tractability even in the face\nof complex models such as neural networks. Ensemble sampling dramatically\nexpands on the range of applications for which Thompson sampling is viable. We\nestablish a theoretical basis that supports the approach and present computational\nresults that offer further insight.\n\n1\n\nIntroduction\n\nThompson sampling [8] has emerged as an effective heuristic for trading off between exploration\nand exploitation in a broad range of online decision problems. To select an action, the algorithm\nsamples a model of the system from the prevailing posterior distribution and then determines which\naction maximizes expected immediate reward according to the sampled model. In its basic form,\nthe algorithm requires computing and sampling from a posterior distribution over models, which is\ntractable only for simple special cases.\nWith complex models such as neural networks, exact computation of posterior distributions becomes\nintractable. One can resort to to the Laplace approximation, as discussed, for example, in [2, 5], but\nthis approach is suitable only when posterior distributions are unimodal, and computations become\nan obstacle with complex models like neural networks because compute time requirements grow\nquadratically with the number of parameters. An alternative is to leverage Markov chain Monte Carlo\nmethods, but those are computationally onerous, especially when the model is complex.\nA practical approximation to Thompson sampling that can address complex models and problems\nrequiring frequent decisions should facilitate fast incremental updating. That is, the time required per\ntime period to learn from new data and generate a new sample model should be small and should not\ngrow with time. Such a fast incremental method that builds on the Laplace approximation concept is\npresented in [5]. In this paper, we study a fast incremental method that applies more broadly, without\nrelying on unimodality. As a sanity check we offer theoretical assurances that apply to the special\ncase of linear bandits. We also present computational results involving simple bandit problems as\nwell as complex neural network models that demonstrate ef\ufb01cacy of the approach.\nOur approach is inspired by [6], which applies a similar concept to the more complex context of deep\nreinforcement learning, but without any theoretical analysis. The essential idea is to maintain and\nincrementally update an ensemble of statistically plausible models, and to sample uniformly from this\nset in each time period as an approximation to sampling from the posterior distribution. Each model\nis initially sampled from the prior, and then updated in a manner that incorporates data and random\nperturbations that diversify the models. The intention is for the ensemble to approximate the posterior\ndistribution and the variance among models to diminish as the posterior concentrates. We re\ufb01ne this\nmethodology and bound the incremental regret relative to exact Thompson sampling for a broad class\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fof online decision problems. Our bound indicates that it suf\ufb01ces to maintain a number of models\nthat grows only logarithmically with the horizon of the decision problem, ensuring computational\ntractability of the approach.\n\n2 Problem formulation\n\nWe consider a broad class of online decision problems to which Thompson sampling could, in princi-\nple, be applied, though that would typically be hindered by intractable computational requirements.\nWe will de\ufb01ne random variables with respect to a probability space (\u2126, F, P) endowed with a \ufb01ltration\n(Ft : t = 0, . . . , T ). As a convention, random variables we index by t will be Ft-measurable, and\nwe use Pt and Et to denote probabilities and expectations conditioned on Ft. The decision-maker\nchooses actions A0, . . . , AT\u22121 \u2208 A and observes outcomes Y1, . . . , YT \u2208 Y. There is a random\nvariable \u03b8, which represents a model index. Conditioned on (\u03b8, At\u22121), Yt is independent of Ft\u22121.\nFurther, P(Yt = y|\u03b8, At\u22121) does not depend on t. This can be thought of as a Bayesian formulation,\nwhere randomness in \u03b8 re\ufb02ects prior uncertainty about which model corresponds to the true nature of\nthe system.\nWe assume that A is \ufb01nite and that each action At is chosen by a randomized policy \u03c0 =\n(\u03c00, . . . , \u03c0T\u22121). Each \u03c0t is Ft-measurable, and each realization is a probability mass function\nover actions A; At is sampled independently from \u03c0t.\nThe agent associates a reward R(y) with each outcome y \u2208 Y, where the reward function R\nis \ufb01xed and known. Let Rt = R(Yt) denote the reward realized at time t. Let R\u03b8(a) =\nE [R(Yt)|\u03b8, At\u22121 = a]. Uncertainty about \u03b8 induces uncertainty about the true optimal action,\nwhich we denote by A\u2217 \u2208 arg max\nR\u03b8(a). Let R\u2217 = R\u03b8(A\u2217). The T -period conditional regret\nwhen the actions (A0, .., AT\u22121) are chosen according to \u03c0 is de\ufb01ned by\n\na\u2208A\n\n(cid:34) T(cid:88)\n\nt=1\n\n(cid:35)\n\n(cid:12)(cid:12)(cid:12)\u03b8\n\nRegret(T, \u03c0, \u03b8) = E\n\n(R\u2217 \u2212 Rt)\n\n,\n\n(1)\n\nwhere the expectation is taken over the randomness in actions At and outcomes Yt, conditioned on \u03b8.\nWe illustrate with a couple of examples that \ufb01t our formulation.\nExample 1. (linear bandit) Let \u03b8 be drawn from (cid:60)N and distributed according to a N (\u00b50, \u03a30)\nprior. There is a set of K actions A \u2286 (cid:60)N . At each time t = 0, 1, . . . , T \u2212 1, an action At \u2208 A is\nselected, after which a reward Rt+1 = Yt+1 = \u03b8(cid:62)At + Wt+1 is observed, where Wt+1 \u223c N (0, \u03c32\nw).\nExample 2. (neural network) Let g\u03b8 : (cid:60)N (cid:55)\u2192 (cid:60)K denote a mapping induced by a neural network\nwith weights \u03b8. Suppose there are K actions A \u2286 (cid:60)N , which serve as inputs to the neural network,\nand the goal is to select inputs that yield desirable outputs. At each time t = 0, 1, . . . , T \u22121, an action\nAt \u2208 A is selected, after which Yt+1 = g\u03b8(At) + Wt+1 is observed, where Wt+1 \u223c N (0, \u03c32\nwI).\nA reward Rt+1 = R(Yt+1) is associated with each observation. Let \u03b8 be distributed according to\na N (\u00b50, \u03a30) prior. The idea here is that data pairs (At, Yt+1) can be used to \ufb01t a neural network\nmodel, while actions are selected to trade off between generating data pairs that reduce uncertainty\nin neural network weights and those that offer desirable immediate outcomes.\n\n3 Algorithms\n\nThompson sampling offers a heuristic policy for selecting actions. In each time period, the algorithm\nsamples an action from the posterior distribution pt(a) = Pt(A\u2217 = a) of the optimal action. In\nother words, Thompson sampling uses a policy \u03c0t = pt. It is easy to see that this is equivalent to\nsampling a model index \u02c6\u03b8t from the posterior distribution of models and then selecting an action\nAt = arg max\n\n(a) that optimizes the sampled model.\n\nR \u02c6\u03b8t\n\na\u2208A\n\nThompson sampling is computationally tractable for some problem classes, like the linear bandit\nproblem, where the posterior distribution is Gaussian with parameters (\u00b5t, \u03a3t) that can be updated\nincrementally and ef\ufb01ciently via Kalman \ufb01ltering as outcomes are observed. However, when dealing\nwith complex models, like neural networks, computing the posterior distribution becomes intractable.\nEnsemble sampling serves as an approximation to Thompson sampling for such contexts.\n\n2\n\n\fAlgorithm 1 EnsembleSampling\n1: Sample: \u02dc\u03b80,1, . . . , \u02dc\u03b80,M \u223c p0\n2: for t = 0, . . . , T \u2212 1 do\n3:\n4:\n\nSample: m \u223c unif({1, . . . , M})\nAct: At = arg max\nObserve: Yt+1\nUpdate: \u02dc\u03b8t+1,1, . . . , \u02dc\u03b8t+1,M\n\nR \u02dc\u03b8t,m\n\na\u2208A\n\n(a)\n\n5:\n6:\n7: end for\n\nThe posterior can be interpreted as a distribution of \u201cstatistically plausible\u201d models, by which we\nmean models that are suf\ufb01ciently consistent with prior beliefs and the history of observations. With\nthis interpretation in mind, Thompson sampling can be thought of as randomly drawing from the\nrange of statistically plausible models. Ensemble sampling aims to maintain, incrementally update,\nand sample from a \ufb01nite set of such models. In the spirit of particle \ufb01ltering, this set of models\napproximates the posterior distribution. The workings of ensemble sampling are in some ways\nmore intricate than conventional uses of particle \ufb01ltering, however, because interactions between the\nensemble of models and selected actions can skew the distribution.\nWhile elements of ensemble sampling require customization, a general template is presented as\nAlgorithm 1. The algorithm begins by sampling M models from the prior distribution. Then, over\neach time period, a model is sampled uniformly from the ensemble, an action is selected to maximize\nexpected reward under the sampled model, the resulting outcome is observed, and each of the M\nmodels is updated. To produce an explicit algorithm, we must specify a model class, prior distribution,\nand algorithms for sampling from the prior and updating models.\nFor a concrete illustration, let us consider the linear bandit (Example 1). Though ensemble sampling\nis unwarranted in this case, since Thompson sampling is ef\ufb01cient, the linear bandit serves as a useful\ncontext for understanding the approach. Standard algorithms can be used to sample models from\nthe N (\u00b50, \u03a30) prior. One possible procedure for updating models maintains a covariance matrix,\nupdating it according to\n\nt /\u03c32\nw\nand generates model parameters incrementally according to\n\nt + AtA(cid:62)\n\n\u03a3t+1 =(cid:0)\u03a3\u22121\n(cid:16)\n\n\u03a3\u22121\n\nt\n\n(cid:1)\u22121\n\n,\n\n(cid:17)\n\n,\n\n\u02dc\u03b8t+1,m = \u03a3t+1\n\n\u02dc\u03b8t,m + At(Rt+1 + \u02dcWt+1,m)/\u03c32\nw\n\nfor m = 1, . . . , M, where ( \u02dcWt,m : t = 1, . . . , T, m = 1, . . . , M ) are independent N (0, \u03c32\nw) random\nsamples drawn by the updating algorithm. It is easy to show that the resulting parameter vectors\nsatisfy\n\n(cid:32)\n\nt\u22121(cid:88)\n\n\u03c4 =0\n\n1\n\u03c32\nw\n\n(cid:33)\n0 (\u03bd \u2212 \u02dc\u03b80,m)\n\n,\n\n\u02dc\u03b8t,m = arg min\n\n\u03bd\n\n(R\u03c4 +1 + \u02dcW\u03c4 +1,m \u2212 A(cid:62)\n\n\u03c4 \u03bd)2 + (\u03bd \u2212 \u02dc\u03b80,m)(cid:62)\u03a3\u22121\n\nwhich admits an intuitive interpretation: each \u02dc\u03b8t,m is a model \ufb01t to a randomly perturbed prior\nand randomly perturbed observations. As we establish in the appendix, for any deterministic\nsequence A0, . . . , At\u22121, conditioned on Ft, the models \u02dc\u03b8t,1, . . . , \u02dc\u03b8t,M are independent and identically\ndistributed according to the posterior distribution of \u03b8. In this sense, the ensemble approximates\nthe posterior. It is not a new observation that, for deterministic action sequences, such a scheme\ngenerates exact samples of the posterior distribution (see, e.g., [7]). However, for stochastic action\nsequences selected by Algorithm 1, it is not immediately clear how well the ensemble approximates\nthe posterior distribution. We will provide a bound in the next section which establishes that, as\nthe number of models M increases, the regret of ensemble sampling quickly approaches that of\nThompson sampling.\nThe ensemble sampling algorithm we have described for the linear bandit problem motivates an\nanalogous approach for the neural network model of Example 2. This approach would again begin\nwith M models, with connection weights \u02dc\u03b80,1, . . . , \u02dc\u03b80,M sampled from a N (\u00b50, \u03a30) prior. It could be\n\n3\n\n\fnatural here to let \u00b50 = 0 and \u03a30 = \u03c32\n0 chosen so that the range of probable\nmodels spans plausible outcomes. To incrementally update parameters, at each time t, each model m\napplies some number of stochastic gradient descent iterations to reduce a loss function of the form\n\n0I for some variance \u03c32\n\n(Y\u03c4 +1 + \u02dcW\u03c4 +1,m \u2212 g\u03bd(A\u03c4 ))2 + (\u03bd \u2212 \u02dc\u03b80,m)(cid:62)\u03a3\u22121\n\n0 (\u03bd \u2212 \u02dc\u03b80,m).\n\nt\u22121(cid:88)\n\n\u03c4 =0\n\nLt(\u03bd) =\n\n1\n\u03c32\nw\n\nWe present computational results in Section 5.2 that demonstrate viability of this approach.\n\n4 Analysis of ensemble sampling for the linear bandit\n\nPast analyses of Thompson sampling have relied on independence between models sampled over\ntime periods. Ensemble sampling introduces dependencies that may adversely impact performance. It\nis not immediately clear whether the degree of degradation should be tolerable and how that depends\non the number of models in the ensemble. In this section, we establish a bound for the linear bandit\ncontext. Our result serves as a sanity check for ensemble sampling and offers insight that should\nextend to broader model classes, though we leave formal analysis beyond the linear bandit for future\nwork.\nConsider the linear bandit problem described in Example 1. Let \u03c0TS and \u03c0ES denote the Thompson\nand ensemble sampling policies for this problem, with the latter based on an ensemble of M models,\ngenerated and updated according to the procedure described in Section 3. Let R\u2217 = mina\u2208A \u03b8(cid:62)a\ndenote the worst mean reward and let \u2206(\u03b8) = R\u2217\u2212 R\u2217 denote the gap between maximal and minimal\nmean rewards. The following result bounds the difference in regret as a function of the gap, ensemble\nsize, and number of actions.\nTheorem 3. For all \u0001 > 0, if\n\nM \u2265 4|A|\n\n\u00012\n\n4|A|T\n\u00013\n\n,\n\nlog\n\nthen\n\nRegret(T, \u03c0ES, \u03b8) \u2264 Regret(T, \u03c0TS, \u03b8) + \u0001\u2206(\u03b8)T.\n\nThis inequality bounds the regret realized by ensemble sampling by a sum of the regret realized by\nThompson sampling and an error term \u0001\u2206(\u03b8)T . Since we are talking about cumulative regret, the\nerror term bounds the per-period degradation relative to Thompson sampling by \u0001\u2206(\u03b8). The value\nof \u0001 can be made arbitrarily small by increasing M. Hence, with a suf\ufb01ciently large ensemble, the\nper-period loss will be small. This supports the viability of ensemble sampling.\nAn important implication of this result is that it suf\ufb01ces for the ensemble size to grow logarithmically\nin the horizon T . Since Thompson sampling requires independence between models sampled over\ntime, in a sense, it relies on T models \u2013 one per time period. So to be useful, ensemble sampling\nshould operate effectively with a much smaller number, and the logarithmic dependence is suitable.\nThe bound also grows with |A| log |A|, which is manageable when there are a modest number of\nactions. We conjecture that a similar bound holds that depends instead on a multiple of N log N,\nwhere N is the linear dimension, which would offer a stronger guarantee when the number of actions\nbecomes large or in\ufb01nite, though we leave proof of this alternative bound for future work.\nThe bound of Theorem 3 is on a notion of regret conditioned on the realization of \u03b8. A Bayesian\nregret bound that removes dependence on this realization can be obtained by taking an expectation,\nintegrating over \u03b8:\n\nE(cid:2)Regret(T, \u03c0ES, \u03b8)(cid:3) \u2264 E(cid:2)Regret(T, \u03c0TS, \u03b8)(cid:3) + \u0001E [\u2206(\u03b8)] T.\n\nWe provide a complete proof of Theorem 3 in the appendix. Due to space constraints, we only offer a\nsketch here.\nSketch of Proof. Let A denote an Ft-adapted action process (A0, . . . , AT\u22121). Our procedure for\ngenerating and updating models with ensemble sampling is designed so that, for any deterministic\nA, conditioned on the history of rewards (R1, . . . , Rt), models \u02dc\u03b8t,1, . . . , \u02dc\u03b8t,M that comprise the\nensemble are independent and identically distributed according to the posterior distribution of \u03b8. This\ncan be veri\ufb01ed via some algebra, as is done in the appendix.\n\n4\n\n\ft denote an approximation to pA\n\nRecall that pt(a) denotes the posterior probability Pt(A\u2217 = a) = P (A\u2217 = a|A0, R1, . . . , At\u22121, Rt).\nt (a).\nTo explicitly indicate dependence on the action process, we will use a superscript: pt(a) = pA\na = arg maxa(cid:48) \u02dc\u03b8(cid:62)\nLet \u02c6pA\n.\nNote that given an action process A, at time t Thompson sampling would sample the next action from\nt , while ensemble sampling would sample the next action from \u02c6pA\nt . If A is deterministic then, since\npA\n\u02dc\u03b8t,1, . . . , \u02dc\u03b8t,M , conditioned on the history of rewards, are i.i.d. and distributed as \u03b8, \u02c6pA\nt represents an\nempirical distribution of samples drawn from pA\nt . It follows from this and Sanov\u2019s Theorem that, for\nany deterministic A,\n\nt , given by \u02c6pA\n\nt,ma(cid:48)(cid:17)\n\n(cid:80)M\n\nt (a) = 1\nM\n\nI(cid:16)\n\nm=1\n\nt ) \u2265 \u0001|\u03b8(cid:1) \u2264 (M + 1)|A|e\u2212M \u0001.\n\nA naive application of the union bound over all deterministic action sequences would establish that,\nfor any A (deterministic or stochastic),\n\nP(cid:0)dKL(\u02c6pA\nt (cid:107)pA\n(cid:18)\nt ) \u2265 \u0001|\u03b8(cid:1) \u2264 P\n\nP(cid:0)dKL(\u02c6pA\n\nt (cid:107)pA\n\n(cid:19)\n\nt ) \u2265 \u0001(cid:12)(cid:12)\u03b8\n\ndKL(\u02c6pa\n\nt (cid:107)pa\n\nmax\na\u2208At\n\n\u2264 |A|t(M + 1)|A|e\u2212M \u0001\n\nt and \u02c6pA\n\nHowever, our proof takes advantage of the fact that, for any deterministic A, pA\nt do not\ndepend on the ordering of past actions and observations. To make it precise, we encode the sequence\nof actions in terms of action counts c0, . . . , cT\u22121. In particular, let ct,a = |{\u03c4 \u2264 t : A\u03c4 = a}|\nbe the number of times that action a has been selected by time t. We apply a coupling argument\nthat introduces dependencies between the noise terms Wt and action counts, without changing the\ndistributions of any observable variables. We let (Zn,a : n \u2208 N, a \u2208 A) be i.i.d. N (0, 1) random\nvariables, and let Wt+1 = Zct,At ,At. Similarly, we let ( \u02dcZn,a,m : n \u2208 N, a \u2208 A, m = 1, . . . , M ) be\ni.i.d N (0, 1) random variables, and let \u02dcWt+1,m = \u02dcZct,At ,At,m. To make explicit the dependence\non A, we will use a superscript and write cA\nto denote the action counts at time t when the action\nprocess is given by A. It is not hard to verify, as is done in the appendix, that if a, a \u2208 AT are two\nt\ndeterministic action sequences such that ca\nt . This allows us to\napply the union bound over action counts, instead of action sequences, and we get that for any A\n(deterministic or stochastic),\n\nt\u22121, then pa\n\nt\u22121 = ca\n\nt and \u02c6pa\n\nt = \u02c6pa\n\nt = pa\n\nP(cid:0)dKL(\u02c6pA\n\nt ) \u2265 \u0001|\u03b8(cid:1) \u2264 P\n\nt (cid:107)pA\n\n(cid:32)\n\n(cid:33)\n\n(cid:12)(cid:12)(cid:12)\u03b8\n\ndKL(\u02c6pa\n\nt (cid:107)pa\n\nt ) \u2265 \u0001\n\n\u2264 (t + 1)|A|(M + 1)|A|e\u2212M \u0001.\n\nmax\nt\u22121:a\u2208At\nca\n\nselected by ensemble\nt . We can decompose the per-period regret\n\nt\n\nt and \u02c6pA\n\nNow, we specialize the action process A to the action sequence At = AES\nsampling, and we will omit the superscripts in pA\nof ensemble sampling as\n\nE(cid:2)R\u2217 \u2212 \u03b8(cid:62)At|\u03b8(cid:3) = E(cid:2)(R\u2217 \u2212 \u03b8(cid:62)At)I (dKL(\u02c6pt(cid:107)pt) \u2265 \u0001)|\u03b8(cid:3)\n+ E(cid:2)(R\u2217 \u2212 \u03b8(cid:62)At)I (dKL(\u02c6pt(cid:107)pt) < \u0001)|\u03b8(cid:3) .\nE(cid:2)(R\u2217 \u2212 \u03b8(cid:62)At)I (dKL(\u02c6pt(cid:107)pt) \u2265 \u0001)|\u03b8(cid:3) \u2264 \u2206(\u03b8)P (dKL(\u02c6pt(cid:107)pt) \u2265 \u0001|\u03b8)\n\nThe \ufb01rst term can be bounded by\n\n(2)\n\n\u2264 \u2206(\u03b8)(t + 1)|A|(M + 1)|A|e\u2212M \u0001.\n\nt\n\nTo bound the second term, we will use another coupling argument that couples the actions that would\nbe selected by ensemble sampling with those that would be selected by Thompson sampling. Let\ndenote the action that Thompson sampling would select at time t. On {dKL(\u02c6pt(cid:107)pt) \u2264 \u0001}, we\nATS\n2\u0001 by Pinsker\u2019s inequality. Conditioning on \u02c6pt and pt, if dKL(\u02c6pt(cid:107)pt) \u2264 \u0001,\nsuch that they have the same distributions as AES\nt with probability at least\n\nhave (cid:107)\u02c6pt \u2212 pt(cid:107)TV \u2264 \u221a\n2(cid:107)\u02c6pt \u2212 pt(cid:107)TV \u2265 1 \u2212(cid:112)\u0001/2. Then, the second term of the sum in (2) can be decomposed into\n\nwe can construct random variables \u02dcAES\nand ATS\n1 \u2212 1\n\n, respectively. Using maximal coupling, we can make \u02dcAES\n\nt = \u02dcATS\n\nand \u02dcATS\n\nt\n\nt\n\nt\n\nt\n\nE(cid:2)(R\u2217 \u2212 \u03b8(cid:62)At)I (dKL(\u02c6pt(cid:107)pt) \u2264 \u0001)|\u03b8(cid:3)\n= E(cid:104)E(cid:104)\n+E(cid:104)E(cid:104)\n\n(R\u2217 \u2212 \u03b8(cid:62) \u02dcAES\n(R\u2217 \u2212 \u03b8(cid:62) \u02dcAES\n\nt )I(cid:16)\nt )I(cid:16)\n\ndKL(\u02c6pt(cid:107)pt) \u2264 \u0001, \u02dcAES\ndKL(\u02c6pt(cid:107)pt) \u2264 \u0001, \u02dcAES\n\nt\n\nt\n\nt = \u02dcATS\n(cid:54)= \u02dcATS\n\nt\n\n(cid:105)(cid:12)(cid:12)\u03b8\n(cid:17)(cid:12)(cid:12)\u02c6pt, pt, \u03b8\n(cid:105)\n(cid:105)(cid:12)(cid:12)\u03b8\n(cid:17)(cid:12)(cid:12)\u02c6pt, pt, \u03b8\n\n(cid:105)\n\n,\n\n5\n\n\fwhich, after some algebraic manipulations, leads to\n\nE(cid:2)(R\u2217 \u2212 \u03b8(cid:62)At)I (dKL(\u02c6pt(cid:107)pt) < \u0001)|\u03b8(cid:3) \u2264 E(cid:2)R\u2217 \u2212 \u03b8(cid:62)ATS\n\nThe result then follows from some straightforward algebra.\n\n5 Computational results\n\n|\u03b8(cid:3) +(cid:112)\u0001/2 \u2206(\u03b8).\n\nt\n\nIn this section, we present computational results that demonstrate viability of ensemble sampling.\nWe will start with a simple case of independent Gaussian bandits in Section 5.1 and move on to\nmore complex models of neural networks in Section 5.2. Section 5.1 serves as a sanity check for\nthe empirical performance of ensemble sampling, as Thompson sampling can be ef\ufb01ciently applied\nin this case and we are able to compare the performances of these two algorithms. In addition, we\nprovide simulation results that demonstrate how the ensemble size grows with the number of actions.\nSection 5.2 goes beyond our theoretical analysis in Section 4 and gives computational evidence of the\nef\ufb01cacy of ensemble sampling when applied to more complex models such as neural networks. We\nshow that ensemble sampling, even with a few models, achieves ef\ufb01cient learning and outperforms\n\u0001-greedy and dropout on the example neural networks.\n\n5.1 Gaussian bandits with independent arms\n\nWe consider a Gaussian bandit with K actions, where action k has mean reward \u03b8k. Each \u03b8k is drawn\ni.i.d. from N (0, 1). During each time step t = 0, . . . , T \u2212 1, we select an action k \u2208 {1, . . . , K}\nand observe reward Rt+1 = \u03b8k + Wt+1, where Wt+1 \u223c N (0, 1). Note that this is a special case of\nExample 1. Since the posterior distribution of \u03b8 can be explicitly computed in this case, we use it as a\nsanity check for the performance of ensemble sampling.\nFigure 1a shows the per-period regret of Thompson sampling and ensemble sampling applied to a\nGaussian bandit with 50 independent arms. We see that as the number of models increases, ensemble\nsampling better approximates Thompson sampling. The results were averaged over 2,000 realizations.\nFigure 1b shows the minimum number of models required so that the expected per-period regret of\nensemble sampling is no more than \u0001 plus the expected per-period regret of Thompson sampling at\nsome large time horizon T across different numbers of actions. All results are averaged over 10,000\nrealizations. We chose T = 2000 and \u0001 = 0.03. The plot shows that the number of models needed\nseems to grow sublinearly with the number of actions, which is stronger than the bound proved in\nSection 4.\n\n(a)\n\n(b)\n\nFigure 1: (a) Ensemble sampling compared with Thompson sampling on a Gaussian bandit with 50\nindependent arms. (b) Minimum number of models required so that the expected per-period regret\nof ensemble sampling is no more than \u0001 = 0.03 plus the expected per-period regret of Thompson\nsampling at T = 2000 for Gaussian bandits across different numbers of arms.\n\n5.2 Neural networks\n\nIn this section, we follow Example 2 and show computational results of ensemble sampling applied\nto neural networks. Figure 2 shows \u0001-greedy and ensemble sampling applied to a bandit problem\n\n6\n\n0100200300400500600700t0.00.51.01.52.0Per-period regretEnsemble sampling on an independent Gaussian bandit with 50 armsThompson sampling5 models10 models20 models30 models0255075100125150175200Number of actions020406080100Number of modelsEnsemble sampling on independent Gaussian bandits\fwhere the mapping from actions to expected rewards is represented by a neuron. More speci\ufb01cally,\nwe have a set of K actions A \u2286 (cid:60)N . The mean reward of selecting an action a \u2208 A is given by\ng\u03b8(a) = max(0, \u03b8(cid:62)a), where weights \u03b8 \u2208 (cid:60)N are drawn from N (0, \u03bbI). During each time period,\nwe select an action At \u2208 A and observe reward Rt+1 = g\u03b8(At) + Zt+1, where Zt+1 \u223c N (0, \u03c32\nz ).\nWe set the input dimension N = 100, number of actions K = 100, prior variance \u03bb = 10, and noise\nz = 100. Each dimension of each action was sampled uniformly from [\u22121, 1], except for\nvariance \u03c32\nthe last dimension, which was set to 1.\nIn Figure 3, we consider a bandit problem where the mapping from actions to expected rewards is\nrepresented by a two-layer neural network with weights \u03b8 \u2261 (W1, W2), where W1 \u2208 (cid:60)D\u00d7N and\nW2 \u2208 (cid:60)D. Each entry of the weight matrices is drawn independently from N (0, \u03bb). There is a set of\nK actions A \u2286 (cid:60)N . The mean reward of choosing an action a \u2208 A is g\u03b8(a) = W (cid:62)\n2 max(0, W1a).\nDuring each time period, we select an action At \u2208 A and observe reward Rt+1 = g\u03b8(At) + Zt+1,\nwhere Zt+1 \u223c N (0, \u03c32\nz ). We used N = 100 for the input dimension, D = 50 for the dimension of\nthe hidden layer, number of actions K = 100, prior variance \u03bb = 1, and noise variance \u03c32\nz = 100.\nEach dimension of each action was sampled uniformly from [\u22121, 1], except for the last dimension,\nwhich was set to 1.\nEnsemble sampling with M models starts by sampling \u02dc\u03b8m from the prior distribution independently\nfor each model m. At each time step, we pick a model m uniformly at random and apply the greedy\naction with respect to that model. We update the ensemble incrementally. During each time period,\nwe apply a few steps of stochastic gradient descent for each model m with respect to the loss function\n\nt\u22121(cid:88)\n\n\u03c4 =0\n\nLt(\u03b8) =\n\n1\n\u03c32\nz\n\n(R\u03c4 +1 + \u02dcZ\u03c4 +1,m \u2212 g\u03b8(A\u03c4 ))2 +\n\n(cid:107)\u03b8 \u2212 \u02dc\u03b8m(cid:107)2\n2,\n\n1\n\u03bb\n\nz ).\n\nwhere perturbations ( \u02dcZt,m : t = 1, . . . , T, m = 1, . . . , M ) are drawn i.i.d. from N (0, \u03c32\nBesides ensemble sampling, there are other heuristics for sampling from an approximate posterior\ndistribution over neural networks, which may be used to develop approximate Thompson sampling.\nGal and Ghahramani proposed an approach based on dropout [4] to approximately sample from a\nposterior over neural networks. In Figure 3, we include results from using dropout to approximate\nThompson sampling on the two-layer neural network bandit.\nTo facilitate gradient \ufb02ow, we used leaky ReLUs of the form max(0.01x, x) internally in all agents,\nwhile the target neural nets still use regular ReLUs as described above. We took 3 stochastic gradient\nsteps with a minibatch size of 64 for each model update. We used a learning rate of 1e-1 for\n\u0001-greedy and ensemble sampling, and a learning rate of 1e-2, 1e-2, 2e-2, and 5e-2 for dropout\nwith dropping probabilities 0.25, 0.5, 0.75, and 0.9 respectively. All results were averaged over\naround 1,000 realizations.\nFigure 2 plots the per-period regret of \u0001-greedy and ensemble sampling on the single neuron bandit.\nWe see that ensemble sampling, even with 10 models, performs better than \u0001-greedy with the best\ntuned parameters. Increasing the size of the ensemble further improves the performance. An ensemble\nof size 50 achieves orders of magnitude lower regret than \u0001-greedy.\nFigure 3a and 3b show different versions of \u0001-greedy applied to the two-layer neural network model.\nWe see that \u0001-greedy with an annealing schedule tends to perform better than a \ufb01xed \u0001. Figure 3c\nplots the per-period regret of the dropout approach with different dropping probabilities, which seems\nto perform worse than \u0001-greedy. Figure 3d plots the per-period regret of ensemble sampling on\nthe neural net bandit. Again, we see that ensemble sampling, with a moderate number of models,\noutperforms the other approaches by a signi\ufb01cant amount.\n\n6 Conclusion\n\nEnsemble sampling offers a potentially ef\ufb01cient means to approximate Thompson sampling when\nusing complex models such as neural networks. We have provided an analysis that offers theoretical\nassurances for the case of linear bandit models and computational results that demonstrate ef\ufb01cacy\nwith complex neural network models.\nWe are motivated largely by the need for effective exploration methods that can ef\ufb01ciently be applied\nin conjunction with complex models such as neural networks. Ensemble sampling offers one approach\n\n7\n\n\fFigure 2: (a) \u0001-greedy and (b) ensemble sampling applied to a single neuron bandit.\n\nFigure 3: (a) Fixed \u0001-greedy, (b) annealing \u0001-greedy, (c) dropout, and (d) ensemble sampling applied\nto a two-layer neural network bandit.\n\nto representing uncertainty in neural network models, and there are others that might also be brought\nto bear in developing approximate versions of Thompson sampling [1, 4]. The analysis of various\nother forms of approximate Thompson sampling remains open.\nEnsemble sampling loosely relates to ensemble learning methods [3], though an important difference\nin motivation lies in the fact that the latter learns multiple models for the purpose of generating a\nmore accurate model through their combination, while the former learns multiple models to re\ufb02ect\nuncertainty in the posterior distribution over models. That said, combining the two related approaches\nmay be fruitful. In particular, there may be practical bene\ufb01t to learning many forms of models (neural\nnetworks, tree-based models, etc.) and viewing the ensemble as representing uncertainty from which\none can sample.\n\nAcknowledgments\n\nThis work was generously supported by a research grant from Boeing and a Marketing Research\nAward from Adobe.\n\n8\n\n0500100015002000010203040instant regret(a) Epsilon-greedyagent nameepsilon=0.05epsilon=0.1epsilon=0.2epsilon=50/(50+t)epsilon=150/(150+t)epsilon=300/(300+t)ensemble=5ensemble=10ensemble=30ensemble=500500100015002000t(b) Ensemble sampling0204060instant regret(a) Fixed epsilonagent nameepsilon=0.05epsilon=0.1epsilon=0.2epsilon=0.3epsilon=10/(10+t)epsilon=30/(30+t)epsilon=50/(50+t)epsilon=70/(70+t)dropout=0.25dropout=0.5dropout=0.75dropout=0.9ensemble=5ensemble=10ensemble=30ensemble=50(b) Annealing epsilon01002003004005000204060(c) Dropout0100200300400500t(d) Ensemble sampling\fReferences\n[1] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty\nin neural networks. In Proceedings of the 32Nd International Conference on International\nConference on Machine Learning - Volume 37, ICML\u201915, pages 1613\u20131622. JMLR.org, 2015.\n\n[2] Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In J. Shawe-\nTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural\nInformation Processing Systems 24, pages 2249\u20132257. Curran Associates, Inc., 2011.\n\n[3] Thomas G Dietterich. Ensemble learning. The handbook of brain theory and neural networks,\n\n2:110\u2013125, 2002.\n\n[4] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model\nuncertainty in deep learning. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Pro-\nceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings\nof Machine Learning Research, pages 1050\u20131059, New York, New York, USA, 20\u201322 Jun 2016.\nPMLR.\n\n[5] Carlos G\u00f3mez-Uribe. Online algorithms for parameter mean and variance estimation in dynamic\n\nregression. arXiv preprint arXiv:1605.05697v1, 2016.\n\n[6] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via\nbootstrapped DQN. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 29, pages 4026\u20134034. Curran Associates,\nInc., 2016.\n\n[7] George Papandreou and Alan L Yuille. Gaussian sampling by local perturbations. In J. D.\nLafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in\nNeural Information Processing Systems 23, pages 1858\u20131866. Curran Associates, Inc., 2010.\n\n[8] W.R. Thompson. On the likelihood that one unknown probability exceeds another in view of the\n\nevidence of two samples. Biometrika, 25(3/4):285\u2013294, 1933.\n\n9\n\n\f", "award": [], "sourceid": 1854, "authors": [{"given_name": "Xiuyuan", "family_name": "Lu", "institution": "Stanford University"}, {"given_name": "Benjamin", "family_name": "Van Roy", "institution": "Stanford University"}]}