{"title": "Learning to Optimize via Information-Directed Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 1583, "page_last": 1591, "abstract": "We propose information-directed sampling -- a new algorithm for online optimization problems in which a decision-maker must balance between exploration and exploitation while learning from partial feedback. Each action is sampled in a manner that minimizes the ratio between the square of expected single-period regret and a measure of information gain: the mutual information between the optimal action and the next observation. We establish an expected regret bound for information-directed sampling that applies across a very general class of models and scales with the entropy of the optimal action distribution. For the widely studied Bernoulli and linear bandit models, we demonstrate simulation performance surpassing popular approaches, including upper confidence bound algorithms, Thompson sampling, and knowledge gradient. Further, we present simple analytic examples illustrating that information-directed sampling can dramatically outperform upper confidence bound algorithms and Thompson sampling due to the way it measures information gain.", "full_text": "Learning to Optimize via\n\nInformation-Directed Sampling\n\nDaniel Russo\n\nStanford University\nStanford, CA 94305\n\nBenjamin Van Roy\nStanford University\nStanford, CA 94305\n\ndjrusso@stanford.edu\n\nbvr@stanford.edu\n\nAbstract\n\nWe propose information-directed sampling \u2013 a new algorithm for online optimiza-\ntion problems in which a decision-maker must balance between exploration and\nexploitation while learning from partial feedback. Each action is sampled in a\nmanner that minimizes the ratio between the square of expected single-period\nregret and a measure of information gain: the mutual information between the\noptimal action and the next observation.\nWe establish an expected regret bound for information-directed sampling that ap-\nplies across a very general class of models and scales with the entropy of the opti-\nmal action distribution. For the widely studied Bernoulli and linear bandit models,\nwe demonstrate simulation performance surpassing popular approaches, including\nupper con\ufb01dence bound algorithms, Thompson sampling, and knowledge gradi-\nent. Further, we present simple analytic examples illustrating that information-\ndirected sampling can dramatically outperform upper con\ufb01dence bound algo-\nrithms and Thompson sampling due to the way it measures information gain.\n\n1\n\nIntroduction\n\nThere has been signi\ufb01cant recent interest in extending multi-armed bandit techniques to address\nproblems with more complex information structures, in which sampling one action can inform the\ndecision-maker\u2019s assessment of other actions. Effective algorithms must take advantage of the in-\nformation structure to learn more ef\ufb01ciently. Recent work has extended popular algorithms for\nthe classical multi-armed bandit problem, such as upper con\ufb01dence bound (UCB) algorithms and\nThompson sampling, to address such contexts.\n\nFor some cases, such as classical and linear bandit problems, strong performance guarantees have\nbeen established for UCB algorithms (e.g. [4, 8, 9, 13, 21, 23, 29]) and Thompson sampling (e.g. [1,\n15, 19, 24]). However, as we will demonstrate through simple analytic examples, these algorithms\ncan perform very poorly when faced with more complex information structures. The shortcoming\nlies in the fact that these algorithms do not adequately assess the information gain from selecting an\naction.\n\nIn this paper, we propose a new algorithm \u2013 information-directed sampling (IDS) \u2013 that preserves nu-\nmerous guarantees of Thompson sampling for problems with simple information structures while of-\nfering strong performance in the face of more complex problems that daunt alternatives like Thomp-\nson sampling or UCB algorithms. IDS quanti\ufb01es the amount learned by selecting an action through\nan information theoretic measure: the mutual information between the true optimal action and the\nnext observation. Each action is sampled in a manner that minimizes the ratio between squared\nexpected single-period regret and this measure of information gain.\n\nAs we will show through simple analytic examples, the way in which IDS assesses information gain\nallows it to dramatically outperform UCB algorithms and Thompson sampling. Further, we establish\n\n1\n\n\fan expected regret bound for IDS that applies across a very general class of models and scales with\nthe entropy of the optimal action distribution. We then specialize this bound to several widely\nstudied problem classes. Finally, we benchmark the performance of IDS through simulations of\nthe widely studied Bernoulli and linear bandit problems, for which UCB algorithms and Thompson\nsampling are known to be very effective. We \ufb01nd that even in these settings, IDS outperforms UCB\nalgorithms, Thompson sampling, and knowledge gradient.\n\nIDS solves a single-period optimization problem as a proxy to an intractable multi-period problem.\nSolution of this single-period problem can itself be computationally demanding, especially in cases\nwhere the number of actions is enormous or mutual information is dif\ufb01cult to evaluate. To carry\nout computational experiments, we develop numerical methods for particular classes of online op-\ntimization problems. More broadly, we feel this work provides a compelling proof of concept and\nhope that our development and analysis of IDS facilitate the future design of ef\ufb01cient algorithms\nthat capture its bene\ufb01ts.\n\nRelated literature. Two other papers [17, 30] have used the mutual information between the op-\ntimal action and the next observation to guide action selection. Both focus on the optimization of\nexpensive-to-evaluate, black-box functions. Each proposes sampling points so as to maximize the\nmutual information between the algorithm\u2019s next observation and the true optimizer. Several fea-\ntures distinguish our work. First, these papers focus on pure exploration problems: the objective is\nsimply to learn about the optimum \u2013 not to attain high cumulative reward. Second, and more im-\nportantly, they focus only on problems with Gaussian process priors and continuous action spaces.\nFor such problems, simpler approaches like UCB algorithms, Probability of Improvement, and Ex-\npected Improvement are already extremely effective (See [6]). By contrast, a major motivation of\nour work is that a richer information measure is needed in order to address problems with more\ncomplicated information structures. Finally, we provide a variety of general theoretical guarantees\nfor IDS, whereas Villemonteix et al. [30] and Hennig and Schuler [17] propose their algorithms only\nas heuristics. The full-length version of this paper [26] shows our theoretical guarantees extend to\npure exploration problems.\n\nThe knowledge gradient (KG) algorithm uses a different measure of information to guide action\nselection: the algorithm computes the impact of a single observation on the quality of the decision\nmade by a greedy algorithm, which simply selects the action with highest posterior expected reward.\nThis measure has been thoroughly studied (see e.g. [22, 27]). KG seems natural since it explicitly\nseeks information that improves decision quality. Computational studies suggest that for problems\nwith Gaussian priors, Gaussian rewards, and relatively short time horizons, KG performs very well.\nHowever, even in some simple settings, KG may not converge to optimality. In fact, it may select a\nsuboptimal action in every period, even as the time horizon tends to in\ufb01nity.\n\nOur work also connects to a much larger literature on Bayesian experimental design (see [10] for a\nreview). Recent work has demonstrated the effectiveness of greedy or myopic policies that always\nmaximize the information gain the next sample. Jedynak et al. [18] consider problem settings in\nwhich this greedy policy is optimal. Another recent line of work [14] shows that information gain\nbased objectives sometimes satisfy a decreasing returns property known as adaptive sub-modularity,\nimplying the greedy policy is competitive with the optimal policy. Our algorithm also only considers\nonly the information gain due to the next sample, even though the goal is to acquire information over\nmany periods. Our results establish that the manner in which IDS encourages information gain leads\nto an effective algorithm, even for the different objective of maximizing cumulative reward.\n\n2 Problem formulation\n\nWe consider a general probabilistic, or Bayesian, formulation in which uncertain quantities are mod-\neled as random variables. The decision\u2013maker sequentially chooses actions (At)t\u2208N from the \ufb01nite\naction set A and observes the corresponding outcomes (Yt(At))t\u2208N. There is a random outcome\nYt(a) \u2208 Y associated with each a \u2208 A and time t \u2208 N. Let Yt \u2261 (Yt(a))a\u2208A be the vector of\noutcomes at time t \u2208 N. The \u201ctrue outcome distribution\u201d p\u2217 is a distribution over Y |A| that is itself\nrandomly drawn from the family of distributions P. We assume that, conditioned on p\u2217, (Yt)t\u2208N is\nan iid sequence with each element Yt distributed according to p\u2217. Let p\u2217\na be the marginal distribution\ncorresponding to Yt(a).\n\n2\n\n\fThe agent associates a reward R(y) with each outcome y \u2208 Y, where the reward function R : Y \u2192\nR is \ufb01xed and known. We assume R(y)\u2212 R(y) \u2264 1 for any y, y \u2208 Y. Uncertainty about p\u2217 induces\nuncertainty about the true optimal action, which we denote by A\u2217 \u2208 arg max\n[R(y)]. The T\n\ny\u223cp\u2217\na\n\nE\n\na\u2208A\n\nperiod regret is the random variable,\n\nRegret(T ) :=\n\nT\n\nXt=1\n\n[R(Yt(A\u2217)) \u2212 R(Yt(At))] ,\n\n(1)\n\nwhich measures the cumulative difference between the reward earned by an algorithm that always\nchooses the optimal action, and actual accumulated reward up to time T . In this paper we study\nexpected regret E [Regret(T )] where the expectation is taken over the randomness in the actions\nAt and the outcomes Yt, and over the prior distribution over p\u2217. This measure of performance is\nsometimes called Bayesian regret or Bayes risk.\nRandomized policies. We de\ufb01ne all random variables with respect to a probability space (\u2126,F, P).\nFix the \ufb01ltration (Ft)t\u2208N where Ft\u22121 \u2282 F is the sigma\u2013algebra generated by the history of ob-\nservations (A1, Y1(A1), ..., At\u22121, Yt\u22121(At\u22121)). Actions are chosen based on the history of past\nobservations, and possibly some external source of randomness1. It\u2019s useful to think of the actions\nas being chosen by a randomized policy \u03c0, which is an Ft\u2013predictable sequence (\u03c0t)t\u2208N. An ac-\ntion is chosen at time t by randomizing according to \u03c0t(\u00b7) = P(At = \u00b7|Ft\u22121), which speci\ufb01es a\nprobability distribution over A. We denote the set of probability distributions over A by D(A).\nWe explicitly display the dependence of regret on the policy \u03c0, letting E [Regret(T, \u03c0)] denote the\nexpected value of (1) when the actions (A1, .., AT ) are chosen according to \u03c0.\nFurther notation. We set \u03b1t(a) = P (A\u2217 = a|Ft\u22121) to be the posterior distribution of A\u2217.\nFor a probability distribution P over a \ufb01nite set X , the Shannon entropy of P is de\ufb01ned as\nH(P ) = \u2212Px\u2208X P (x) log (P (x)) . For two probability measures P and Q over a common mea-\nsurable space, if P is absolutely continuous with respect to Q, the Kullback-Leibler divergence\nbetween P and Q is\n\nDKL(P||Q) = Z\n\nY\n\nlog(cid:18) dP\n\ndQ(cid:19) dP\n\n(2)\n\nwhere dP\ndQ is the Radon\u2013Nikodym derivative of P with respect to Q. The mutual information under\nthe posterior distribution between random variables X1 : \u2126 \u2192 X1, and X2 : \u2126 \u2192 X2, denoted by\n(3)\n\nIt(X1; X2) := DKL (P ((X1, X2) \u2208 \u00b7|Ft\u22121) || P (X1 \u2208 \u00b7|Ft\u22121) P (X2 \u2208 \u00b7|Ft\u22121)) ,\n\nis the Kullback-Leibler divergence between the joint posterior distribution of X1 and X2 and the\nproduct of the marginal distributions. Note that It(X1; X2) is a random variable because of its\ndependence on the conditional probability measure P (\u00b7|Ft\u22121).\nTo simplify notation, we de\ufb01ne the information gain from an action a to be gt(a) := It(A\u2217; Yt(a)).\nAs shown for example in Lemma 5.5.6 of Gray [16], this is equal to the expected reduction in\nentropy of the posterior distribution of A\u2217 due to observing Yt(a):\n\ngt(a) = E [H(\u03b1t) \u2212 H(\u03b1t+1)|Ft\u22121, At = a] ,\n\n(4)\n\nwhich plays a crucial role in our results. Let \u2206t(a) := E [Rt(Yt(A\u2217)) \u2212 R(Yt(a))|Ft\u22121] denote the\nexpected instantaneous regret of action a at time t. We overload the notation gt(\u00b7) and \u2206t(\u00b7). For\n\u03c0 \u2208 D(A), de\ufb01ne gt(\u03c0) = Pa\u2208A \u03c0(a)gt(a) and \u2206t(\u03c0) = Pa\u2208A \u03c0(a)\u2206t(a).\n\n3\n\nInformation-directed sampling\n\nIDS explicitly balances between having low expected regret in the current period and acquiring new\ninformation about which action is optimal. It does this by maximizing over all action sampling\n\ndistributions \u03c0 \u2208 D(A) the ratio between the square of expected regret \u2206t(\u03c0)2 and information\n1Formally, At is measurable with respect to the sigma\u2013algebra generated by (Ft\u22121, \u03bet) where (\u01ebt)t\u2208N\nare random variables representing this external source of randomness, and are jointly independent of p\u2217 and\n(Yt)t\u2208N\n\n3\n\n\fgain gt(\u03c0) about the optimal action A\u2217. In particular, the policy \u03c0IDS = !\u03c0IDS\n\nby:\n\n1\n\n\u03c0IDS\nt\n\n\u03c0\u2208D(A) (cid:26)\u03a8t (\u03c0) :=\n\u2208 arg min\n\n\u2206t(\u03c0)2\n\ngt(\u03c0) (cid:27) .\n\n, \u03c0IDS\n\n2\n\n, ...(cid:1) is de\ufb01ned\n\n(5)\n\nWe call \u03a8t(\u03c0) the information ratio of a sampling distribution \u03c0 and \u03a8\u2217\nt = min\u03c0 \u03a8t(\u03c0) = \u03a8t(\u03c0IDS\nthe minimal information ratio. Each roughly measures the \u201ccost\u201d per bit of information acquired.\nOptimization problem. Suppose that there are K = |A| actions, and that the posterior expected\nregret and information gain are stored in the vectors \u2206 \u2208 RK\n+ . Assume g 6= 0, so that\nthe optimal action is not known with certainty. The optimization problem (5) can be written as\n\n+ and g \u2208 RK\n\n)\n\nt\n\nminimize \u03a8(\u03c0) := !\u03c0T \u2206(cid:1)2\n\n/\u03c0T g\n\nsubject to \u03c0T e = 1, \u03c0 \u2265 0.\n\n(6)\n\nThe following result shows this is a convex optimization problem, and surprisingly, has an optimal\nsolution with only two non-zero components. Therefore, while IDS is a randomized policy, it ran-\ndomizes over at most two actions. Algorithm 1, presented in the supplementary material, solves (6)\nby looping over all pairs of actions, and solving a one dimensional convex optimization problem.\n\nProposition 1. The function \u03a8 : \u03c0 7\u2192 !\u03c0T \u2206(cid:1)2\nthere is an optimal solution \u03c0\u2217 to (6) with |{i : \u03c0\u2217\n\n/\u03c0T g is convex on (cid:8)\u03c0 \u2208 RK|\u03c0T g > 0(cid:9). Moreover,\ni > 0}| \u2264 2.\n\n4 Regret bounds\n\nThis section establishes regret bounds for IDS that scale with the entropy of the optimal action\ndistribution. The next proposition shows that bounds on a policy\u2019s information ratio imply bounds\non expected regret. We then provide several bounds on the information ratio of IDS.\nProposition 2. Fix a deterministic \u03bb \u2208 R and a policy \u03c0 = (\u03c01, \u03c02, ...) such that \u03a8t(\u03c0t) \u2264 \u03bb\nalmost surely for each t \u2208 {1, .., T}. Then, E [Regret (\u03c0, T )] \u2264 p\u03bbH(\u03b11)T .\nt = \u03a8\u2217\n\nBounds on the information ratio. We establish upper bounds on the minimal information ratio\n\u03a8\u2217\n) in several important settings. These bound show that, in any period, the algorithm\u2019s\nexpected regret can only be large if it\u2019s expected to acquire a lot of information about which action\nis optimal. It effectively balances between exploration and exploitation in every period.\n\nt (\u03c0IDS\n\nt\n\nThe proofs of these bounds essentially follow from a very recent analysis of Thompson sampling,\nand the implied regret bounds are the same as those established for Thompson sampling. In par-\nticular, since \u03a8\u2217\nt \u2264 \u03a8t(\u03c0TS) where \u03c0TS is the Thompson sampling policy, it is enough to bound\n\u03a8t(\u03c0TS). Several such bounds were provided by Russo and Van Roy [25].2 While the analysis is\nsimilar in the cases considered here, IDS outperforms Thompson sampling in simulation, and, as we\nwill highlight in the next section, is sometimes provably much more informationally ef\ufb01cient.\n\nWe brie\ufb02y describe each of these bounds below and then provide a more complete discussion for\nlinear bandit problems. For each of the other cases, more formal propositions, their proofs, and a\ndiscussion of lower bounds can be found in the supplementary material or the full version of this\npaper [26].\n\nFinite action space: With no additional assumption, we show \u03a8\u2217\nLinear bandit: Each action is associated with a d dimensional feature vector, and the mean reward\ngenerated by an action is the inner product between its known feature vector and some\nunknown parameter vector. We show \u03a8\u2217\n\nt \u2264 |A|/2.\n\nFull information: Upon choosing an action, the agent observes the reward she would have received\n\nt \u2264 d/2.\n\nt \u2264 1/2.\n\nhad she chosen any other action. We show \u03a8\u2217\n\nCombinatorial action sets: At time t, project i \u2208 {1, .., d} yields a random reward \u03b8t,i, and the\nreward from selecting a subset of projects a \u2208 A \u2282 {a\u2032 \u2282 {0, 1, ..., d} : |a\u2032| \u2264 m} is\nm\u22121Pi\u2208A \u03b8t,i. The outcome of each selected project (\u03b8t,i : i \u2208 a) is observed, which is\nsometimes called \u201csemi\u2013bandit\u201d feedback [3]. We show \u03a8\u2217\n\nt \u2264 d/2m2.\n\n2\u03a8t(\u03c0TS) is exactly equal to the term \u03932\n\nt that is bounded in [25].\n\n4\n\n\fLinear optimization under bandit feedback. The stochastic linear bandit problem has been widely\nstudied (e.g. [13, 23]) and is one of the most important examples of a multi-armed bandit problem\nwith \u201ccorrelated arms.\u201d In this setting, each action is associated with a \ufb01nite dimensional feature\nvector, and the mean reward generated by an action is the inner product between its known feature\nvector and some unknown parameter vector. The next result bounds \u03a8\u2217\nProposition 3. If A \u2282 Rd and for each p \u2208 P there exists \u03b8p \u2208 Rd such that for all a \u2208 A\n\nt for such problems.\n\nE\n\ny\u223cpa\n\n[R(y)] = aT \u03b8p, then for all t \u2208 N, \u03a8\u2217\n\nt \u2264 d/2 almost surely.\n\nThis result shows that E(cid:2)Regret(T, \u03c0IDS)(cid:3) \u2264 q 1\n2 log(|A|)dT for linear bandit\nproblems. Dani et al. [12] show this bound is order optimal, in the sense that for any time horizon T\nand dimension d if the actions set is A = {0, 1}d, there exists a prior distribution over p\u2217 such that\ninf \u03c0 E [Regret(T, \u03c0)] \u2265 c0plog(|A|)dT where c0 is a constant the is independent of d and T . The\nbound here improves upon this worst case bound since H(\u03b11) can be much smaller than log(|A|).\n\n2 H(\u03b11)dT \u2264 q 1\n\n5 Beyond UCB and Thompson sampling\n\nUpper con\ufb01dence bound algorithms (UCB) and Thompson sampling are two of the most popular\napproaches to balancing between exploration and exploitation. In some cases, these algorithms are\nempirically effective, and have strong theoretical guarantees. But we will show that, because they\ndon\u2019t quantify the information provided by sampling actions, they can be grossly suboptimal in other\ncases. We demonstrate this through two examples - each designed to be simple and transparent. To\nset the stage for our discussion, we now introduce UCB algorithms and Thompson sampling.\n\nThompson sampling. The Thompson sampling algorithm simply samples actions according to\nIn particular, actions are chosen randomly at time t\nthe posterior probability they are optimal.\naccording to the sampling distribution \u03c0TS\nt = \u03b1t. By de\ufb01nition, this means that for each a \u2208 A,\nP(At = a|Ft\u22121) = P(A\u2217 = a|Ft\u22121) = \u03b1t(a). This algorithm is sometimes called probability\nmatching because the action selection distribution is matched to the posterior distribution of the\noptimal action. Note that Thompson sampling draws actions only from the support of the posterior\ndistribution of A\u2217. That is, it never selects an action a if P (A\u2217 = a) = 0. Put differently, this\nimplies that it only selects actions that are optimal under some p \u2208 P.\nUCB algorithms. UCB algorithms select actions through two steps. First, for each action a \u2208 A\nan upper con\ufb01dence bound Bt(a) is constructed. Then, an action At \u2208 arg maxa\u2208A Bt(a) with\nmaximal upper con\ufb01dence bound is chosen. Roughly, Bt(a) represents the greatest mean reward\nvalue that is statistically plausible. In particular, Bt(a) is typically constructed so that Bt(a) \u2192\n\n[R(y)] as data about action a accumulates, but with high probability E\n\nE\n\n[R(y)] \u2264 Bt(a).\n\ny\u223cp\u2217\na\n\ny\u223cp\u2217\na\n\nLike Thompson sampling, many UCB algorithms only select actions that are optimal under some\np \u2208 P. Consider an algorithm that constructs at each time t a con\ufb01dence set Pt \u2282 P containing\nthe set of distributions that are statistically plausible given observed data. (e.g. [13]). Upper con-\n\ufb01dence bounds are then set to be the highest expected reward attainable under one of the plausible\ndistributions:\n\nBt(a) = max\np\u2208P\n\nE\n\ny\u223cpa\n\n[R(y)] .\n\nAny action At \u2208 arg maxa Bt(a) must be optimal under one of the outcome distributions p \u2208 Pt.\nAn alternative method involves choosing Bt(a) to be a particular quantile of the posterior dis-\ntribution of the action\u2019s mean reward under p\u2217 [20].\nIn each of the examples we construct,\nsuch an algorithm chooses actions from the support of A\u2217 unless the quantiles are so low that\nmaxa\u2208A Bt(a) < E [R(Yt(A\u2217))].\n\n5.1 Example: sparse linear bandits\n\nConsider a linear bandit problem where A \u2282 Rd and the reward from an action a \u2208 A is aT \u03b8\u2217.\nThe true parameter \u03b8\u2217 is known to be drawn uniformly at random from the set of 1\u2013sparse vectors\n\u0398 = {\u03b8 \u2208 {0, 1}d : k\u03b8k0 = 1}. For simplicity, assume d = 2m for some m \u2208 N. The action\nset is taken to be the set of vectors in {0, 1}d normalized to be a unit vector in the L1 norm: A =\n\n5\n\n\fkxk1\n\nn x\n\n: x \u2208 {0, 1}d, x 6= 0o. We will show that the expected number of time steps for Thompson\n\nsampling (or a UCB algorithm) to identify the optimal action grows linearly with d, whereas IDS\nrequires only log2(d) time steps.\nWhen an action a is selected and y = aT \u03b8\u2217 \u2208 {0, 1/kak0} is observed, each \u03b8 \u2208 \u0398 with aT \u03b8 6= y\nis ruled out. Let \u0398t denote the parameters in \u0398 that are consistent with the observations up to time\nt and let It = {i \u2208 {1, ..., d} : \u03b8i = 1, \u03b8 \u2208 \u0398t} be the set of possible positive components.\nFor this problem, A\u2217 = \u03b8\u2217. That is, if \u03b8\u2217 were known, the optimal action would be to choose the\naction \u03b8\u2217. Thompson sampling and UCB algorithms only choose actions from the support of A\u2217\nand therefore will only sample actions a \u2208 A that have only a single positive component. Unless\nthat is also the positive component of \u03b8\u2217, the algorithm will observe a reward of zero and rule out\nonly one possible value for \u03b8\u2217. The algorithm may require d samples to identify the optimal action.\n\nConsider an application of IDS to this problem. It essentially performs binary search: it selects\na \u2208 A with ai > 0 for half of the components i \u2208 It and ai = 0 for the other half as well as for any\ni /\u2208 It. After just log2(d) time steps the true support of \u03b8\u2217 is identi\ufb01ed.\nTo see why this is the case, \ufb01rst note that all parameters in \u0398t are equally likely and hence the\nai. Since ai \u2265 0 and Pi ai = 1 for each a \u2208 A, every\nexpected reward of an action a is\naction whose positive components are in It yields the highest possible expected reward of 1/|It|.\nTherefore, binary search minimizes expected regret in period t for this problem. At the same time,\nbinary search is assured to rule out half of the parameter vectors in \u0398t at each time t. This is the\nlargest possible expected reduction, and also leads to the largest possible information gain about A\u2217.\nSince binary search both minimizes expected regret in period t and uniquely maximizes expected\ninformation gain in period t, it is the sampling strategy followed by IDS.\n\n|It| Pi\u2208It\n\n1\n\n5.2 Example: recommending products to a customer of unknown type\n\nConsider the problem of repeatedly recommending an assortment of products to a customer. The\n\ncustomer has unknown type c\u2217 \u2208 C where |C| = n. Each product is geared toward customers of\na particular type, and the assortment a \u2208 A = C m of m products offered is characterized by the\nvector of product types a = (c1, .., cm). We model customer responses through a random utility\nmodel in which customers are apriori more likely to derive high value from a product geared toward\ntheir type. When offered an assortment of products a, the customer associates with the ith product\nutility U (t)\nci follows an extreme\u2013value distribution and \u03b2 \u2208 R\nis a known constant. This is a standard multinomial logit discrete choice model. The probability a\nj=1 exp{\u03b21{aj =c}}. When an\nci (a) and leaves\n(a) indicating the utility derived from the product, both of which are observed by the\n\ncustomer of type c chooses product i is given by exp{\u03b21{ai=c}}/Pm\nassortment a is offered at time t, the customer makes a choice It = arg maxi U (t)\na review U (t)\ncIt\nrecommendation system. The system\u2019s reward is the normalized utility of the customer ( 1\n\nci (a) = \u03b21{ai=c} + W (t)\n\nci , where W t\n\n(a).\n\n\u03b2 )U (t)\n\ncIt\n\nIf the type c\u2217 of the customer were known, then the optimal recommendation would be A\u2217 =\n(c\u2217, c\u2217, ..., c\u2217), which consists only of products targeted at the customer\u2019s type. Therefore, both\nThompson sampling and UCB algorithms would only offer assortments consisting of a single type\nof product. Because of this, each type of algorithm requires order n samples to learn the customer\u2019s\ntrue type. IDS will instead offer a diverse assortment of products to the customer, allowing it to\nlearn much more quickly.\n\nTo make the presentation more transparent, suppose that c\u2217 is drawn uniformly at random from C\nand consider the behavior of each type of algorithm in the limiting case where \u03b2 \u2192 \u221e. In this\nregime, the probability a customer chooses a product of type c\u2217 if it available tends to 1, and the\nreview U (t)\n(a) tends to 1{aIt = c\u2217}, an indicator for whether the chosen product had type c\u2217.\ncIt\nThe initial assortment offered by IDS will consist of m different and previously untested product\ntypes. Such an assortment maximizes both the algorithm\u2019s expected reward in the next period and\nthe algorithm\u2019s information gain, since it has the highest probability of containing a product of type\nc\u2217. The customer\u2019s response almost perfectly indicates whether one of those items was of type c\u2217.\nThe algorithm continues offering assortments containing m unique, untested, product types until a\n\n6\n\n\freview near U (t)\n(a) \u2248 1 is received. With extremely high probability, this takes at most \u2308n/m\u2309\ncIt\ntime periods. By diversifying the m products in the assortment, the algorithm learns m times faster.\n\n6 Computational experiments\n\nSection 5 showed that, for some complicated information structures, popular approaches like UCB\nalgorithms and Thompson sampling are provably outperformed by IDS. Our computational experi-\nments focus instead on simpler settings where these algorithms are extremely effective. We \ufb01nd that\neven for these widely studied settings, IDS displays performance exceeding state of the art. For each\nexperiment, the algorithm used to implement IDS is presented in Appendix C.\n\nMean-based IDS. Some of our numerical experiments use an approximate form of IDS that is\nsuitable for some problems with bandit feedback, satis\ufb01es our regret bounds for such problems, and\ncan sometimes facilitate design of more ef\ufb01cient numerical methods. More details can be found in\nthe appendix, or in the full version of this paper [26].\n\nBeta-Bernoulli experiment. Our \ufb01rst experiment involves a multi-armed bandit problem with in-\ndependent arms. The action ai \u2208 {a1, ..., aK} yields in each time period a reward that is 1 with\nprobability \u03b8i and 0 otherwise. The \u03b8i are drawn independently from Beta(1, 1), which is the uni-\nform distribution. Figure 1a presents the results of 1000 independent trials of an experiment with 10\narms and a time horizon of 1000. We compare IDS to six other algorithms, and \ufb01nd that it has the\nlowest average regret of 18.16. Our results indicate that the the variation of IDS \u03c0IDSME presented\nin Section 6 has extremely similar performance to standard IDS for this problem.\n\nKnowledge Gradient\nIDS\nMean\u2212based IDS\nThompson Sampling\nBayes UCB\nUCB Tuned\nMOSS\nKL UCB\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\nt\n\ne\nr\ng\ne\nR\ne\nv\ni\nt\n\n \n\nl\n\na\nu\nm\nu\nC\n\n0\n0\n\n200\n\n400\n\n600\n\nTime Period\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\nt\n\ne\nr\ng\ne\nR\ne\nv\ni\nt\n\n \n\nl\n\na\nu\nm\nu\nC\n\n800\n\n1000\n\n0\n0\n\n2\n\nIDS\nThompson Sampling\nBayes UCB\nLower Bound\n\n4\n6\nTime Period\n\n8\n\n10\n4\nx 10\n\n(a) Binary rewards\n\n(b) Asymptotic performance\n\nIn this experiment, the famous UCB1 algorithm of Auer et al. [4] had average regret 131.3, which is\ndramatically larger than that of IDS. For this reason UCB1 is omitted from Figure 1a. The con\ufb01dence\nbounds of UCB1 are constructed to facilitate theoretical analysis. For practical performance Auer\net al. [4] proposed using a heuristic algorithm called UCB-Tuned. The MOSS algorithm of Audibert\nand Bubeck [2] is similar to UCB1 and UCB\u2013Tuned, but uses slightly different con\ufb01dence bounds.\nIt is known to satisfy regret bounds for this problem that are minimax optimal up to a constant factor.\n\nIn previous numerical experiments [11, 19, 20, 28], Thompson sampling and Bayes UCB exhibited\nstate-of-the-art performance for this problem. Unsurprisingly, they are the closest competitors to\nIDS. The Bayes UCB algorithm, studied in Kaufmann et al. [20], uses upper con\ufb01dence bounds at\n\ntime step t that are the 1 \u2212 1\n\nt quantile of the posterior distribution of each action3.\n\nThe knowledge gradient (KG) policy of Ryzhov et al. [27], uses the one\u2013step value of information\nto incentivize exploration. However, for this problem, KG does not explore suf\ufb01ciently to identify\nthe optimal arm in this problem, and therefore its expected regret grows linearly with time. It should\nbe noted that KG is particularly poorly suited to problems with discrete observations and long time\nhorizons. It can perform very well in other types of experiments.\n\nAsymptotic optimality. That IDS outperforms Bayes UCB and Thompson sampling in our last\nexperiment is is particularly surprising, as each of these algorithms is known, in a sense we will\n\n3Their theoretical guarantees require choosing a somewhat higher quantile, but the authors suggest choosing\n\nthis quantile, and use it in their own numerical experiments.\n\n7\n\n\fsoon formalize, to be asymptotically optimal for these problems. We now present simulation results\nover a much longer time horizon that suggest IDS scales in the same asymptotically optimal way.\n\nThe seminal work of Lai and Robbins [21] provides the following asymptotic frequentist lower\nbound on regret of any policy \u03c0. When applied with an independent uniform prior over \u03b8, both\nBayes UCB and Thompson sampling are known to attain this frequentist lower bound [19, 20]:\n\nlim inf\nT \u2192\u221e\n\nE [Regret(T, \u03c0)|\u03b8]\n\nlog T\n\n\u2265 Xa6=A\u2217\n\n(\u03b8A\u2217 \u2212 \u03b8a)\nDKL(\u03b8A\u2217 || \u03b8a)\n\n:= c(\u03b8)\n\nOur next numerical experiment \ufb01xes a problem with three actions and with \u03b8 = (.3, .2, .1). We com-\npare algorithms over a 10,000 time periods. Due to the computational expense of this experiment,\nwe only ran 200 independent trials. Each algorithm uses a uniform prior over \u03b8. Our results, along\nwith the asymptotic lower bound of c(\u03b8) log(T ), are presented in Figure 1b.\n\na\nlinear ban-\nfeature vector.\n\nOur \ufb01nal numerical\n\nexperiment\n\ntreats\n\nis de\ufb01ned by a 5 dimensional\n\nEach action a \u2208 R5\n\nLinear bandit problems.\ndit problem.\nThe reward of action a at time t is aT \u03b8 + \u01ebt\nwhere \u03b8 \u223c N (0, 10I) is drawn from a mul-\ntivariate Gaussian prior distribution, and \u01ebt \u223c\nN (0, 1) is independent Gaussian noise. In each\nperiod, only the reward of the selected action is\nobserved. In our experiment, the action set A\ncontains 30 actions, each with features drawn\nuniformly at random from [\u22121/\u221a5, 1/\u221a5].\n\nThe results displayed in Figure 1 are averaged\nover 1000 independent trials.\n\nBayes UCB\nKnowledge Gradient\nThompson Sampling\nMean\u2212based IDS\nGP UCB\nGP UCB Tuned\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\nt\ne\nr\ng\ne\nR\n \ne\nv\ni\nt\na\nu\nm\nu\nC\n\nl\n\nWe compare the regret of \ufb01ve algorithms. Three\nof these - GP-UCB, Thompson sampling , and\nIDS - satisfy strong regret bounds for this prob-\nlem4. Both GP-UCB and Thompson sampling\nare signi\ufb01cantly outperformed by IDS. Bayes\nUCB [20] and a version of GP-UCB that was\ntuned to minimize its average regret, are each\ncompetitive with IDS. These algorithms are heuristics, in the sense that their con\ufb01dence bounds\ndiffer signi\ufb01cantly from those of linear UCB algorithms known to satisfy theoretical guarantees.\n\nFigure 1: Regret in linear\u2013Gaussian model.\n\nTime Period\n\n100\n\n250\n\n150\n\n200\n\n0\n0\n\n50\n\n7 Conclusion\n\nThis paper has proposed information-directed sampling \u2013 a new algorithm for balancing between\nexploration and exploitation. We establish a general regret bound for the algorithm, and specialize\nthis bound to several widely studied classes of online optimization problems. We show the way\nin which IDS assesses information gain allows it to dramatically outperform UCB algorithms and\nThompson sampling in some settings. Finally, for two simple and widely studied classes of multi-\narmed bandit problems we demonstrate state of art performance in simulation experiments. In these\nways, we feel this work provides a compelling proof of concept.\n\nMany important open questions remain, however. IDS solves a single-period optimization problem\nas a proxy to an intractable multi-period problem. Solution of this single-period problem can itself be\ncomputationally demanding, especially in cases where the number of actions is enormous or mutual\ninformation is dif\ufb01cult to evaluate. An important direction for future research concerns the devel-\nopment of computationally elegant procedures to implement IDS in important cases. Even when\nthe algorithm cannot be directly implemented, however, one may hope to develop simple algorithms\nthat capture its main bene\ufb01ts. Proposition 2 shows that any algorithm with small information ratio\nsatis\ufb01es strong regret bounds. Thompson sampling is a very tractable algorithm that, we conjecture,\nsometimes has nearly minimal information ratio. Perhaps simple schemes with small information\nratio could be developed for other important problem classes, like the sparse linear bandit problem.\n\n4Regret analysis of GP-UCB can be found in [29] and for Thompson sampling can be found in [1, 24, 25]\n\n8\n\n\fReferences\n\n[1] S. Agrawal and N. Goyal. Thompson sampling for contextual bandits with linear payoffs. In ICML, 2013.\n\n[2] J.-Y. Audibert and S. Bubeck. Minimax policies for bandits games. COLT, 2009.\n\n[3] J.-Y. Audibert, S. Bubeck, and G. Lugosi. Regret in online combinatorial optimization. Mathematics of\n\nOperations Research, 2013.\n\n[4] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine\n\nlearning, 47(2):235\u2013256, 2002.\n\n[5] S.P. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.\n\n[6] E. Brochu, V.M. Cora, and N. De Freitas. A tutorial on bayesian optimization of expensive cost func-\ntions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint\narXiv:1012.2599, 2010.\n\n[7] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit\n\nproblems. arXiv preprint arXiv:1204.5721, 2012.\n\n[8] S. Bubeck, R. Munos, G. Stoltz, and Cs. Szepesv\u00b4ari. X-armed bandits. JMLR, 12:1655\u20131695, June 2011.\n\n[9] O. Capp\u00b4e, A. Garivier, O.-A. Maillard, R. Munos, and G. Stoltz. Kullback-Leibler upper con\ufb01dence\n\nbounds for optimal sequential allocation. Annals of Statistics, 41(3):1516\u20131541, 2013.\n\n[10] K. Chaloner, I. Verdinelli, et al. Bayesian experimental design: A review. Statistical Science, 10(3):\n\n273\u2013304, 1995.\n\n[11] O. Chapelle and L. Li. An empirical evaluation of Thompson sampling. In NIPS, 2011.\n\n[12] V. Dani, S.M. Kakade, and T.P. Hayes. The price of bandit information for online optimization. In NIPS,\n\npages 345\u2013352, 2007.\n\n[13] V. Dani, T.P. Hayes, and S.M. Kakade. Stochastic linear optimization under bandit feedback. In COLT,\n\npages 355\u2013366, 2008.\n\n[14] D. Golovin and A. Krause. Adaptive submodularity: Theory and applications in active learning and\n\nstochastic optimization. Journal of Arti\ufb01cial Intelligence Research, 42(1):427\u2013486, 2011.\n\n[15] A. Gopalan, S. Mannor, and Y. Mansour. Thompson sampling for complex online problems. In ICML,\n\n2014.\n\n[16] R.M. Gray. Entropy and information theory. Springer, 2011.\n\n[17] P. Hennig and C.J. Schuler. Entropy search for information-ef\ufb01cient global optimization. JMLR, 98888\n\n(1):1809\u20131837, 2012.\n\n[18] B. Jedynak, P.I. Frazier, R. Sznitman, et al. Twenty questions with noise: Bayes optimal policies for\n\nentropy loss. Journal of Applied Probability, 49(1):114\u2013136, 2012.\n\n[19] E. Kauffmann, N. Korda, and R. Munos. Thompson sampling: an asymptotically optimal \ufb01nite time\n\nanalysis. In ALT, 2012.\n\n[20] E. Kaufmann, O. Capp\u00b4e, and A. Garivier. On Bayesian upper con\ufb01dence bounds for bandit problems. In\n\nAISTATS, 2012.\n\n[21] T.L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in applied mathe-\n\nmatics, 6(1):4\u201322, 1985.\n\n[22] W.B. Powell and I.O. Ryzhov. Optimal learning, volume 841. John Wiley & Sons, 2012.\n\n[23] P. Rusmevichientong and J.N. Tsitsiklis. Linearly parameterized bandits. Mathematics of Operations\n\nResearch, 35(2):395\u2013411, 2010.\n\n[24] D. Russo and B. Van Roy. Learning to optimize via posterior sampling. CoRR, abs/1301.2609, 2013.\n\n[25] D. Russo and B. Van Roy. An information-theoretic analysis of thompson sampling. arXiv preprint\n\narXiv:1403.5341, 2014.\n\n[26] D. Russo and B. Van Roy. Learning to optimize via information directed sampling. arXiv preprint\n\narXiv:1403.5556, 2014.\n\n[27] I.O. Ryzhov, W.B. Powell, and P.I. Frazier. The knowledge gradient algorithm for a general class of online\n\nlearning problems. Operations Research, 60(1):180\u2013195, 2012.\n\n[28] S.L. Scott. A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business\n\nand Industry, 26(6):639\u2013658, 2010.\n\n[29] N. Srinivas, A. Krause, S.M. Kakade, and M. Seeger. Information-theoretic regret bounds for Gaussian\nprocess optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250 \u20133265,\nmay 2012.\n\n[30] Julien Villemonteix, Emmanuel Vazquez, and Eric Walter. An informational approach to the global opti-\n\nmization of expensive-to-evaluate functions. Journal of Global Optimization, 44(4):509\u2013534, 2009.\n\n9\n\n\f", "award": [], "sourceid": 834, "authors": [{"given_name": "Daniel", "family_name": "Russo", "institution": "Stanford University"}, {"given_name": "Benjamin", "family_name": "Van Roy", "institution": "Stanford University"}]}