{"title": "Adaptive Learning with Unknown Information Flows", "book": "Advances in Neural Information Processing Systems", "page_first": 7473, "page_last": 7482, "abstract": "An agent facing sequential decisions that are characterized by partial feedback needs to strike a balance between maximizing immediate payoffs based on available information, and acquiring new information that may be essential for maximizing future payoffs. This trade-off is captured by the multi-armed bandit (MAB) framework that has been studied and applied when at each time epoch payoff observations are collected on the actions that are selected at that epoch. In this paper we introduce a new, generalized MAB formulation in which additional information on each arm may appear arbitrarily throughout the decision horizon, and study the impact of such information flows on the achievable performance and the design of efficient decision-making policies. By obtaining matching lower and upper bounds, we characterize the (regret) complexity of this family of MAB problems as a function of the information flows. We introduce an adaptive exploration policy that, without any prior knowledge of the information arrival process, attains the best performance (in terms of regret rate) that is achievable when the information arrival process is a priori known. Our policy uses dynamically customized virtual time indexes to endogenously control the exploration rate based on the realized information arrival process.", "full_text": "Adaptive Learning with Unknown Information Flows\n\nYonatan Gur\n\nGraduate School of Business\n\nStanford University\nStanford, CA 94305\nygur@stanford.edu\n\nAhmadreza Momeni\n\nElectrical Engineering Department\n\nStanford University\nStanford, CA 94305\n\namomenis@stanford.edu\n\nAbstract\n\nAn agent facing sequential decisions that are characterized by partial feedback\nneeds to strike a balance between maximizing immediate payoffs based on available\ninformation, and acquiring new information that may be essential for maximiz-\ning future payoffs. This trade-off is captured by the multi-armed bandit (MAB)\nframework that has been studied and applied when at each time epoch payoff ob-\nservations are collected on the actions that are selected at that epoch. In this paper\nwe introduce a new, generalized MAB formulation in which additional information\non each arm may appear arbitrarily throughout the decision horizon, and study the\nimpact of such information \ufb02ows on the achievable performance and the design\nof ef\ufb01cient decision-making policies. By obtaining matching lower and upper\nbounds, we characterize the (regret) complexity of this family of MAB problems\nas a function of the information \ufb02ows. We introduce an adaptive exploration policy\nthat, without any prior knowledge of the information arrival process, attains the\nbest performance (in terms of regret rate) that is achievable when the information\narrival process is a priori known. Our policy uses dynamically customized virtual\ntime indexes to endogenously control the exploration rate based on the realized\ninformation arrival process.\n\n1\n\nIntroduction\n\nBackground and motivation. In the presence of uncertainty and partial feedback on payoffs, an\nagent that faces a sequence of decisions needs to strike a balance between maximizing instantaneous\nperformance and collecting valuable information that is essential for optimizing future decisions. A\nwell-studied framework that captures this trade-off between new information acquisition (exploration),\nand optimizing payoffs based on available information (exploitation) is the one of multi-armed bandits\n(MAB) that \ufb01rst emerged in [20] in the context of drug testing, and was later extended by [15] to a\nmore general setting. In this framework, an agent repeatedly chooses between K arms where at each\ntrial the agent pulls one arm and then receives a reward. In this formulation (known as the stochastic\nMAB setting), rewards are assumed to be identically distributed for each arm and independent across\ntrails and arms. The objective of the agent is to maximize the cumulative return over a certain time\nhorizon, and the performance criterion is the so-called regret: the expected difference between the\ncumulative reward received by the agent and the reward accumulated by a hypothetical benchmark,\nreferred to as oracle, who holds prior information about the reward distribution of each arm (and thus\nrepeatedly selects the arm with the highest expected reward). A sharp regret characterization for this\ntraditional framework was \ufb01rst established by [13], followed by analysis of important policies such as\n\u0001-greedy, UCB1, and Thompson sampling; see, e.g., [3], as well as [1]. The MAB framework focuses\non balancing exploration and exploitation, typically under very little assumptions on the distribution\nof rewards, but with very speci\ufb01c assumptions on the future information collection process. In\nparticular, optimal policy design is typically predicated on the assumption that at each period a reward\nobservation is collected only on the arm that is selected by the policy at that time period (exceptions\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fto this common information structure will be discussed below). In that sense, such policy design\ndoes not account for information (e.g., that may arrive between pulls) that may be available in many\npractical settings, and that might be fundamental for achieving good performance.\nIn the current paper we relax the information structure of these classical frameworks by allowing\narbitrary information arrival processes. Our focus is on: (i) studying the impact of the information\narrival characteristics (such as frequency and timing) on the achievable performance and on the\nmanner in which a decision maker should balance exploration and exploitation; and (ii) on adapting\nto a priori unknown sample path of information arrivals in real time. In that respect, we identify\nconditions on the information arrival process that guarantee the optimality of myopic policies (e.g.,\nones that at each period pull the arm with the highest estimated mean reward), and further identify\nadaptive MAB policies that guarantee the \u201cbest of all worlds\" in the sense of establishing (near)\noptimal performance without prior knowledge of the information collection process.\nMain contributions. On the modeling front, we introduce a new, generalized MAB formulation\nthat relaxes strong assumptions classical MAB settings typically make on the information collection\nprocess. Our formulation considers information \ufb02ows that correspond to the different arms and allows\ninformation to arrive at arbitrary rate and time, and therefore captures a large variety of real-world\nphenomena, yet maintains mathematical tractability.\nOn the analysis front, we establish lower bounds on the performance that is achievable by any\nnon-anticipating policy in the presence of unknown information \ufb02ows. We further show that our\nlower bounds can be achieved through suitable policy design. These results identify the minimax\nregret rate associated with the MAB problem with arbitrary information \ufb02ows, as a function of the\nhorizon length, the information arrival process, the number of arms, and parameters of the family\nof reward distributions. In particular, we obtain a spectrum of minimax regret rates that ranges\nfrom the classical regret rates that appear in the stochastic MAB literature when there is no or very\nlittle auxiliary information, to a constant regret (independent of the decision horizon length) when\ninformation arrives frequently and/or early enough.\nWe introduce an adaptive exploration policy that, without any prior knowledge of the auxiliary\ninformation \ufb02ows, approximates the best performance that is achievable when the information arrival\nprocess is known in advance. This \u201cbest of all worlds\" type of guarantee implies that rate-optimality\nis achieved uniformly over the general class of information \ufb02ows at hand (including the case with\nno information \ufb02ows, where classical guarantees are recovered). Our approach relies on using\nendogenous exploration rates that depend on the amount of information that becomes available over\ntime. In particular, it is based on adjusting in real time the effective exploration rate through virtual\ntime indexes that are dynamically updated based on information arrivals.\nRelated work. Several MAB settings have introduced cases where exploration is unnecessary (as a\nmyopic decision-making policy may achieve optimal performance), versus cases where exploration\nshould be done at an especially higher rate in order to maintain optimality. For example, [6] consider\nthe contextual bandit framework and show that if the distribution of the contextual information\nguarantees suf\ufb01cient diversity, then exploration becomes unnecessary and greedy policies can bene\ufb01t\nfrom the natural exploration that is embedded in the information diversity to achieve asymptotic\noptimality. On the other hand, [7] considers a general MAB framework where the underlying reward\ndistribution may change over time according to a budget of variation, and characterize the manner in\nwhich optimal exploration rates increase as a function of said budget. In addition, [18] consider a\nplatform in which the preferences of arriving users may be biased by the experience of previous users\nand show that classical MAB policies may under-explore in this setting. These studies demonstrate\nthat the extent of exploration that is required to maintain optimality strongly depends on particular\nproblem characteristics that may often be a priori unknown to the decision maker. This introduces\nthe challenge of endogenizing exploration to identify the appropriate rate of exploration and to\napproximate the best performance that is achievable under ex ante knowledge of the underlying\nproblem characteristics. We address this challenge from information collection perspective. We\nidentify conditions on the information arrival process that guarantee the optimality of myopic policies\n(e.g., ones that at each period pull the arm with the highest estimated mean reward), and further\nidentify adaptive MAB policies that guarantee (near) optimal performance (\u201cbest of all worlds\")\nwithout prior knowledge on the information arrival process.\nIn addition, few papers have considered regulating exploration based on a priori known characteristics\nin settings that are different than ours. For example, [21] consider regulating exploration and\n\n2\n\n\fexploitation in a setting where rewards are scaled by an exogenous multiplier that temporally evolves\nin an a priori known manner, and show that in such setting the performance of known MAB policies\ncan be improved if exploitation is enhanced during periods with higher reward, and more exploration\noccurs in periods of low reward. Another approach of regulating exploration is studied by [12] in a\nsetting that includes lock-up periods in which the agent cannot change her actions.\nWhile in traditional MAB formulations in each time period observations are obtained only for the\narm that was selected at that period, there are MAB formulations (and other sequential decision\nframeworks) in which more information can be observed in each time period. One important MAB\nframework where at each round the decision maker may collect some information on arms that were\nnot pulled is the so-called contextual MAB setting, also referred to as bandit problem with side\nobservations [23], or the associative bandit problem [19], where at each trial the decision maker\nobserves a context carrying information about the arms. Another important example is the full-\ninformation adversarial MAB setting, where rewards are not characterized by a stationary stochastic\nprocess but are rather arbitrary and can be even selected by an adversary ([4], and [9]). In the\nfull-information setting, at each time period, after pulling an arm, the agent observes the rewards\ngenerated by all the arms. While the adversarial nature of the latter makes it fundamentally different,\nin terms of achievable performance, analysis, and policy design, from the stochastic formulation\nthat is considered here, it is also important to mention that the settings above still consider very\nspeci\ufb01c information structures that are a priori known to the agent, as opposed to our setting where\ninformation \ufb02ows are arbitrary and a priori unknown.\nOne of the challenges we address is to design a policy that adapts to unknown problem characteristics,\nin the sense of achieving ex-post performance that is as good (or nearly as good) as the one achievable\nunder ex-ante knowledge of the information arrival process. This challenge dates back to studies in\nthe statistics literature (see [22] and references therein), and has seen recent interest in the machine\nlearning literature; examples include [17] that presents an algorithm that achieves (near) optimal\nperformance in both stochastic and adversarial MAB regimes without prior knowledge of the nature\nof environment, [16] that considers an online convex optimization setting and derive algorithms that\nare rate optimal regardless of whether the target function is weakly or strongly convex, and [11] that\nstudies the design of an optimal adaptive algorithm competing against dynamic benchmarks.\n\n2 Formulation\nLet K = {1, . . . , K} be a set of arms and let T = {1, . . . , T} denote a sequence of decision\nepochs. At each time period t, a decision maker selects one of the K arms. When selecting arm\nk \u2208 K at time t \u2208 T , a reward Xk,t \u2208 R is realized and observed. For each t \u2208 T and k \u2208 K,\nthe reward Xk,t is assumed to be independently drawn from some \u03c32-sub-Gaussian distribution\n(cid:62) and the\nwith mean \u00b5k.1 We denote the pro\ufb01le of rewards at time t by Xt = (X1,t, . . . , XK,t)\n(cid:62) the\npro\ufb01le of mean-rewards by \u00b5 = (\u00b51, . . . , \u00b5K)\ndistribution of the rewards pro\ufb01le Xt. We assume that rewards are independent across the time\nperiods and arms. We denote the highest expected reward and the best arm by \u00b5\u2217 = maxk\u2208K \u00b5k\nand k\u2217 = arg maxk\u2208K \u00b5k, respectively.2 We denote by \u2206k = \u00b5\u2217 \u2212 \u00b5k the difference between the\nexpected reward of the best arm, and the one associated with arm k. We assume that a priori known\npositive lower bound 0 < \u2206 \u2264 mink\u2208K\\{k\u2217} \u2206k as well as a positive number \u03c3 > 0 for which all\nthe reward distributions are \u03c32-sub-Gaussian, and denote by S = S(\u2206, \u03c32) the class of \u2206-separated\n\u03c32-sub-Gaussian distribution pro\ufb01les:\n\n(cid:62). We further denote by \u03bd = (\u03bd1, . . . , \u03bdK)\n\ne\u03bb(Xk,1\u2212\u00b5k)(cid:105) \u2264 e\u03c32\u03bb2/2 \u2200k \u2208 K,\u2200\u03bb \u2208 R(cid:111)\n\n.\n\nS(\u2206, \u03c32) :=\n\n\u03bd\n\n(cid:110)\n\n(cid:12)(cid:12)(cid:12) \u2206 \u00b7 1{k (cid:54)= k\u2217} \u2264 \u2206k and E(cid:104)\n\nAuxiliary information \ufb02ows. Before each round t, the agent may or may not observe reward\nrealizations of some of the arms without pulling them. Let \u03b7k,t \u2208 {0, 1} denote the indicator of\n(cid:62) the\nobserving an auxiliary information on arm k at time t. We denote by \u03b7t = (\u03b71,t, . . . , \u03b7K,t)\nvector of indicators \u03b7k,t\u2019s at time step t, and by H = (\u03b71, . . . , \u03b7T ) the information arrival matrix\n1A real-valued random variable X is said to be sub-Gaussian if there is some \u03c3 > 0 such that for every\n\n\u03bb \u2208 R one has Ee\u03bb(X\u2212EX) \u2264 e\u03c32\u03bb2/2.\n\n2For the sake of simplicity, in the formulation and hereafter in the rest of the paper when using the arg min\n\nand arg max operators we assume that ties are broken in favor of the smaller index.\n\n3\n\n\f(cid:62) where for any k one has Zk,t = \u03b7k,t \u00b7 Yk,t.\n\nwith columns \u03b7t\u2019s; we assume that this matrix is independent of the policy\u2019s actions and observations.\n(cid:62), and\nIf \u03b7k,t = 1, then a random variable Yk,t \u223c \u03bdk is observed. We denote Yt = (Y1,t, . . . , YK,t)\nassume that the random variables Yk,t are independent across time periods and arms and are also\nindependent from the reward realizations Xk,t. We denote the vector of information received at time\nt by Zt = (Z1,t, . . . , ZK,t)\nAdmissible policies, performance, and regret. Let U be a random variable de\ufb01ned over a prob-\nability space (U,U, Pu). Let \u03c0t : Rt\u22121 \u00d7 RK\u00d7t \u00d7 {0, 1}K\u00d7t \u00d7 U \u2192 K for t = 1, 2, 3, . . . be\nmeasurable functions; with some abuse of notation we aksi denote by \u03c0t \u2208 K the action at time t\ngiven by \u03c0t = \u03c0t(X\u03c0t\u22121,t\u22121, . . . , X\u03c01,1, Zt, . . . , Z1, \u03b7t, . . . , \u03b71, U ) for t = 1, 2, 3, . . . .\nThe mappings {\u03c0t : t = 1, . . . , T}, together with the distribution Pu de\ufb01ne the class of admissible\npolicies. We denote this class by P. Note that policies in P depend only on the past history of actions\nand observations as well as auxiliary information arrivals, and allow for randomized strategies via\ntheir dependence on U. To evaluate the guaranteed performance of a policy \u03c0 \u2208 P under information\narrival process H by the worst-case expected regret it incurs relative to the performance of an oracle\nthat selects the arm with the highest expected reward, we de\ufb01ne regret as follows:\n\n(cid:35)\n\n(cid:34) T(cid:88)\n\nt=1\n\nR\u03c0S (H, T ) = sup\n\u03bd\u2208S\n\nE\u03c0\n\u03bd\n\n(\u00b5\u2217 \u2212 \u00b5\u03c0t)\n\n,\n\n\u03bd, E\u03c0\n\n\u03bd, and R\u03c0\n\n\u03bd [\u00b7] is taken with respect to the noisy rewards, as well as to the policy\u2019s\nwhere the expectation E\u03c0\nactions (throughout the paper we will denote by P\u03c0\n\u03bd the probability, expectation, and\nregret when the arms are selected according to policy \u03c0 and rewards are distributed according to \u03bd).\nDiscussion of model assumptions. For the sake of simplicity, our model is based on a simple and\nwell studied MAB framework [13]; However, we note that our methods and analysis can be directly\napplied to more general MAB frameworks such as the contextual MAB framework where mean\nrewards are linearly dependent on context vectors; see, e.g., [10] and references therein.\nFor the sake of simplicity of the model description, we assume that only one information arrival can\noccur before each time step for each arm (that is, for each time t and arm k, one has that \u03b7k,t \u2208 {0, 1}).\nNotably, all our results hold for the case with more than one information arrival per time step per arm.\nIn our formulation we focus on auxiliary observations that have the same distribution as reward\nobservations, but all our results hold for a broad family of information structures as long as unbiased\nestimators of mean rewards can be constructed from the auxiliary observations, that is, when there\nexists a mapping \u03c6(\u00b7) such that E [\u03c6(Yk,t)] = \u00b5k for each k.\n\n3 The impact of information \ufb02ows on achievable performance\n\nIn this section we study the impact of auxiliary information \ufb02ows on the performance that one could\naspire to achieve. Our \ufb01rst result formalizes what cannot be achieved, establishing a lower bound on\nthe best achievable performance as a function of the information arrival process.\nTheorem 1. (Lower bound on the best achievable performance) For any T \u2265 1 and information\narrival matrix H, the worst case regret for any admissible policy \u03c0 \u2208 P is bounded below as follows\n\n(cid:32)\n\nT(cid:88)\n\n(cid:33)(cid:33)\n\nt(cid:88)\n\nR\u03c0S (H, T ) \u2265 C1\n\u2206\n\nlog\n\nC2\u22062\n\nK\n\nexp\n\n\u2212C3\u22062\n\n\u03b7s,k\n\n,\n\nt=1\n\ns=1\n\n(cid:32)\n\nK(cid:88)\n\nk=1\n\nwhere C1, C2, and C3 are positive constants that only depend on \u03c3.\n\nTheorem 1 establishes a lower bound on the achievable performance in the presence of auxiliary\ninformation \ufb02ows. The Theorem provides a spectrum of bounds on achievable performances,\nmapping many potential information arrival trajectories to the best performance they may allow. In\nparticular, when H = 0, we recover a lower bound of order K\n\u2206 log T that coincides with the lower\nbound established in [13], and [8] for the classical MAB setting. Theorem 1 further establishes that in\nthe presence of auxiliary information \ufb02ows regret rates may be lower relative to classical regret rates,\nand that the impact of information arrivals on the achievable performance depends on the frequency\nof these arrivals and on the time at which these arrivals occur; We discuss these observations in \u00a73.1.\n\n4\n\n\fKey ideas in the proof. The proof of Theorem 1 adapts to our framework ideas of identifying a\nworst-case nature \u201cstrategy\"; see, e.g. proof of Theorem 6 in [8]. While the full proof appears in the\nfull version of the paper, we next illustrate its key ideas using the special case of two arms. Consider\ntwo possible pro\ufb01les of reward distributions, \u03bd and \u03bd(cid:48), that are \u201cclose\" enough in the sense that it\nis hard to distinguish between the two, but \u201cseparated\" enough such that a considerable regret may\nbe incurred when the \u201ccorrect\" pro\ufb01le of distributions is misidenti\ufb01ed. In particular, assume that\nthe decision maker is a priori informed that the \ufb01rst arm generates rewards according to a normal\ndistribution with standard variation \u03c3 and a mean that is either \u2212\u2206 (according to \u03bd) or +\u2206 (according\nto \u03bd(cid:48)), and the second arm is known to generate rewards with normal distribution of standard variation\n\u03c3 and mean zero for both \u03bd, and \u03bd(cid:48). To quantify a notion of distance between the possible pro\ufb01les of\nreward distributions we use the Kullback-Leibler (KL) divergence. Using Lemma 2.6 from [22] that\nconnects the KL divergence to error probabilities, we establish that at each period t the probability\nof selecting a suboptimal arm must be at least psub\n,\nwhere \u02dcn1,t denotes the number of times the \ufb01rst arm is pulled up to time t by the policy. Each\nselection of suboptimal arm contributes \u2206 to the regret, and therefore the cumulative regret must be\n. We observe that if arm 1 has mean reward \u2212\u2206, the cumulative regret must\nalso be at least \u2206 \u00b7 E\u03bd [\u02dcn1,T ]. Therefore the regret is lower bounded by \u2206\nt + E\u03bd [\u02dcn1,T ]\nwhich is greater than \u03c32\n. The argument can be repeated\nby switching arms 1 and 2. For K arms, we follow the above lines to establish K lower bounds.\nTaking the average of these bounds, the result is established.\n\n(cid:16)E\u03bd [\u02dcn1,T ] +(cid:80)t\n(cid:16)(cid:80)T\n\nat least \u2206(cid:80)T\n\n(cid:17)(cid:17)\n(cid:17)\n\n(cid:16)\u2212 2\u22062\n\nt = 1\n\n4 exp\n\n(cid:16)\u2212 2\u22062\n\n\u03c32\n\n(cid:16) \u22062\n\n2\u03c32\n\n(cid:80)T\n\n(cid:80)t\n\nt=1 exp\n\n\u03c32\n\ns=1 \u03b71,s\n\n(cid:17)(cid:17)\n\n2\n\nt=1 psub\n\nt=1 psub\n\nt\n\ns=1 \u03b71,s\n\n4\u2206 log\n\n3.1 Discussion and subclasses of information \ufb02ows\n\nTheorem 1 demonstrates that auxiliary information \ufb02ows may be leveraged to improve performance\nand reduce regret rates, and that their impact on the achievable performance increases when informa-\ntion arrivals are more frequent, and occur earlier. This observation is consistent with the following\nintuition: (i) at early time periods we have collected only few observations and therefore the marginal\nimpact of an additional observation on the stochastic error probabilities is relatively large; and (ii)\nwhen information appears early on, there are more decision periods to come where this information\ncan be used. To emphasize this observation we next demonstrate the implications on achievable\nperformance of two concrete information arrival processes of natural interest: a process with a \ufb01xed\narrival rate, and a process with a decreasing arrival rate.\nStationary information \ufb02ows. Assume that \u03b7k,t\u2019s are i.i.d. Bernoulli random variables with mean \u03bb.\nThen, for any T \u2265 1 and admissible policy \u03c0 \u2208 P, one obtains the following lower bound for the\nachievable performance. If \u03bb \u2264 \u03c32\n\n4\u22062T , then\n\nand if \u03bb \u2265 \u03c32\n\n4\u22062T , then\n\nEH [R\u03c0S (H, T )] \u2265 \u03c32(K \u2212 1)\n\nlog\n\n4\u2206\n\nEH [R\u03c0S (H, T )] \u2265 \u03c32(K \u2212 1)\n\n4\u2206\n\nlog\n\n(cid:19)\n\n,\n\n(cid:18) (1 \u2212 e\u22121/2)\u22062T\n(cid:19)\n(cid:18) 1 \u2212 e\u22121/2\n\n\u03c32K\n\n.\n\n2\u03bbK\n\nThis class considers stationary information \ufb02ows in which information arrives at a constant rate \u03bb\nthroughout the horizon. Analyzing this arrival process reveals two different regimes. When the arrival\nrate of information is small enough, auxiliary observations become essentially ineffective, and one\nrecovers the performance bounds that were established for the classical stochastic MAB problem. In\nparticular, as long as there are no more than order \u2206\u22122 information arrivals over T time periods, this\ninformation does not impact achievable regret rates.3 When \u2206 is \ufb01xed and independent of the horizon\n\u221a\nlength T , the lower bound scales logarithmically with T . When \u2206 can scale with T , a bound of order\nT is recovered when \u2206 is of order T \u22121/2. In both cases, there are known policies (such as UCB1)\nthat guarantee rate-optimal performance; for more details see policies, analysis, and discussion in [3].\nOn the other hand, when there are more than order \u2206\u22122 observations over T periods, the lower bound\non the regret becomes a function of the arrival rate \u03bb. When the arrival rate is independent of the\n3This coincides with the observation that one requires order \u2206\u22122 samples to distinguish between two\n\ndistributions that are \u2206-separated; see, e.g., [2].\n\n5\n\n\fthat for each arm k \u2208 K and at each time step t, E(cid:104)(cid:80)t\n\nhorizon length T , the regret is bounded by a constant that is independent of T , and a myopic policy is\noptimal. For more details, see the full version of the paper.\nDiminishing information \ufb02ows. Fix some \u03ba > 0, and assume that \u03b7k,t\u2019s are random variables such\n. Then, for any T \u2265 1\nand admissible policy \u03c0 \u2208 P, one obtains the following lower bound for the achievable performance.\nIf \u03ba < 1 then:\n\ns=1 \u03b7k,s\n\n(cid:105)\n(cid:18) \u22062/K\u03c32\n\n1 \u2212 \u03ba\n\n=\n\n2\u22062 log t\n\n(cid:106) \u03c32\u03ba\n(cid:107)\n(cid:0)(T + 1)1\u2212\u03ba \u2212 1(cid:1)(cid:19)\n(cid:19)(cid:19)\n(cid:18)\n\n,\n\n1\n\n1 \u2212\n\n(T + 1)\u03ba\u22121\n\n.\n\nEH [R\u03c0S (H, T )] \u2265 \u03c32(K \u2212 1)\n\n4\u2206\n\nlog\n\nand if \u03ba > 1 then:\n\nEH [R\u03c0S (H, T )] \u2265 \u03c32(K \u2212 1)\n\n4\u2206\n\nlog\n\n(cid:18) \u22062/K\u03c32\n\n\u03ba \u2212 1\n\nThis class considers diminishing information \ufb02ows under which the expected number of information\narrivals up to time t is of order log t. The example illustrates the impact of the timing of information\narrivals on the achievable performance, and suggests that a constant regret may be achievable even\nwhen the rate of information arrivals is decreasing. Whenever \u03ba < 1, the lower bound on the regret is\nlogarithmic in T , and there are well-studied MAB policies (e.g., UCB1, Auer et al. 3) that guarantee\nrate-optimal performance. When \u03ba > 1, the lower bound on the regret is a constant, and one may\nobserve that when \u03ba is large enough a myopic policy is asymptotically optimal. (In the limit \u03ba \u2192 1\nthe lower bound is of order log log T .) For more details, see the full version of the paper.\nDiscussion. One may contrast the subclasses of information \ufb02ows described above by selecting\n\u03ba = 2\u22062\u03bbT\n\u03c32 log T . Then, in both settings the total number of information arrivals for each arm is \u03bbT .\nHowever, while in the \ufb01rst example the information arrival rate is \ufb01xed over the horizon, in the second\nexample this arrival rate is higher in the beginning of the horizon and gradually decreasing over time.\nBy further selecting \u03bb = \u03c32 log T\none obtains \u03ba = 2. The lower bound under stationary information\n\u22062T\n\ufb02ows is then logarithmic in T (establishing the impossibility of constant regret in that setting), but the\nlower bound under the diminishing information \ufb02ows is constant and independent of T (in the next\nsection we will observe that constant regret is indeed achievable in that setting). This observation\nechoes the intuition that earlier observations have larger impact on achievable performance, as at\nearly periods there is only little information that is available and therefore the marginal impact of an\nadditional observation on the performance is larger, and since earlier information can be used for\nmore decision periods (as the remaining horizon is longer).4\nThe analysis above demonstrates that optimal policy design and the best achievable performance\ndepend on the information arrival process: while policies such as UCB1 and \u0001-greedy may be\nrate-optimal in some cases, a myopic policy can achieve rate-optimal performance in other cases.\nHowever, the identi\ufb01cation of a rate-optimal policy relies on prior knowledge of the information \ufb02ow.\nTherefore, an important question one may ask is: How can a decision maker adapt to an arbitrary and\nunknown information arrival process in the sense of achieving (near) optimal performance without\nany prior knowledge of the information \ufb02ow? We address this question in \u00a74.\n\n4This observation can be generalized by noting that the described subclasses are special cases of the following\nsetting. Let \u03b7k,t\u2019s be independent random variables such that for each arm k and every time period t, the expected\nnumber of information arrivals up to time t satis\ufb01es\n\n(cid:35)\n\n(cid:34) t(cid:88)\n\ns=1\n\nE\n\n\u03b7k,s\n\n= \u03bbT\n\nt1\u2212\u03b3 \u2212 1\nT 1\u2212\u03b3 \u2212 1\n\n.\n\nthe beginning of the horizon, and \u03b3 \u2192 1 leads to E(cid:2)(cid:80)t\n\nThe expected number of total information arrivals for each arm, \u03bbT , is determined by the parameter \u03bb. The\nconcentration of arrivals, however, is governed by the parameter \u03b3. When \u03b3 = 0 the arrival rate is constant, which\ncorresponds to the subclass of stationary information \ufb02ows. As \u03b3 increases, information arrivals concentrate in\nlog T , which corresponds to the subclass\ns=1 \u03b7k,s\nof diminishing information \ufb02ows. Then, one may apply similar analysis to observe that when \u03bbT is of order\nT 1\u2212\u03b3 or more, the lower bound is a constant independent of T .\n\n(cid:3) = \u03bbT log t\n\n6\n\n\f4 A near-optimal adaptive policy\n\nIn this section we suggest a policy that adapts to a priori unknown information \ufb02ow. Before laying\ndown the policy, we \ufb01rst demonstrate that classical policy design may fail to achieve the lower bound\nin Theorem 1 in the presence of unknown information \ufb02ows.\nThe inef\ufb01ciency of naive adaptations of MAB policies. Consider a simple approach of adapting\nclassical MAB policies to account for information that arrived so far in calculating the estimates\nof mean rewards while maintaining the structure of the policy otherwise. Such an approach can\nbe implemented easily using some well-known MAB policies such as UCB1 or \u0001-greedy. A \ufb01rst\nobservation is that the performance bounds that are analyzed for these policies (e.g., in Auer et al.\n3) do not improve (as a function of the horizon length T ) in the presence of unknown information\n\ufb02ows. Moreover, it is possible to show through lower bounds on the guaranteed performance that\nthese policies indeed achieve sub-optimal performance. To demonstrate this, consider the subclass\nof stationary information \ufb02ows described in \u00a73.1, with an arrival rate \u03bb that is very large compared\nto \u03c32\n4\u22062T . In that case, we have seen that the regret lower bound becomes constant whenever the\narrival rate \u03bb is independent of T . However, the \u0001-greedy policy, employs an exploration rate that is\nindependent of the number of observations obtained for each arms and therefore effectively incurs\nregret of order log T due to performing unnecessary exploration.\nA simple rate-optimal policy. We provide a simple and deterministic adaptive exploration policy\nthat includes the key elements that are essential for appropriately adjusting the exploration rate and\nachieving good performance in the presence of auxiliary information \ufb02ows. In what follows, we\ndenote by nk,t, and \u00afXk,nk,t the number of times a sample from arm k has been observed and the\nempirical average reward of arm k up to time t, respectively, that is,\n\n\u03b7k,tYk,t +(cid:80)t\u22121\n\ns=1 (\u03b7k,sYk,s + 1{\u03c0s = k}Xk,s)\n\n.\n\nnk,t\n\nt\u22121(cid:88)\n\nnk,t = \u03b7k,t+\n\n(\u03b7k,s + 1{\u03c0s = k}) ,\n\n\u00afXk,nk,t =\n\ns=1\n\nConsider the following policy:\n\nAdaptive exploration policy. Input: a tuning parameter c > 0.\n\n1. Set initial virtual times \u03c4k,0 = 0 for all k \u2208 K, and an exploration set W0 = K.\n2. At each period t = 1, 2, . . . , T :\n\n(a) Observe the vectors \u03b7t, and Zt.\n\n\u2022 Advance virtual times: \u03c4k,t = (\u03c4k,t\u22121 + 1) \u00b7 exp\n\u2022 Update the exploration set: Wt =\n\nk \u2208 K | nk,t < c\u03c32\n\nc\u03c32\n\n\u22062 log \u03c4k,t\n\n(cid:110)\n\n(cid:16) \u03b7k,t\u22062\n\n(cid:17)\n\nfor all k \u2208 K\n\n(cid:111)\n\n(b) If Wt is not empty, select an arm from Wt with the fewest observations: (explo-\n\nration)\n\n\u03c0t = arg min\nk\u2208Wt\n\nnk,t.\n\nOtherwise, Select an arm with the highest estimated reward: (exploitation)\n\n\u03c0t = arg max\nk\u2208K\n\n\u00afXk,nk,t .\n\n(c) Receive and observe a reward X\u03c0t,t\n\nClearly \u03c0 \u2208 P. At each time step t, the adaptive exploration policy checks whether for each arm k the\nnumber of observations that has been collected so far (through arm pulls and auxiliary information\ntogether) exceeds a dynamic threshold that depends logarithmically on the virtual time \u03c4k,t, that is,\nwhether arm k satis\ufb01es the condition nk,t \u2265 c\u03c32\n\u22062 log \u03c4k,t. If yes, the arm with the highest reward\nestimator \u00afXk,nk,t is pulled (exploitation). Otherwise, the arm with the fewest observations is pulled\n(exploration). This approach guarantees that enough observations have been collected from each\narm such that a suboptimal arm will be selected with a probability of order t\u2212c/8 or less.\nThe adaptive exploration policy endogenizes a common principle of balancing exploration and\nexploitation, by which the exploration rate should be set to guarantee that the overall loss due to\nexploration would equal the expected loss due to misidenti\ufb01cation of the best arm; see e.g., [3] and\n\n7\n\n\fVirtual time \u03c4\n\nExploration rate f (\u03c4 ) = 1/\u03c4\n\nMultiplicative\nMultiplicative\nMultiplicative\nacceleration\nacceleration\nacceleration\n\nt\n\nt\n\nInformation\narrivals\n\nInformation\narrivals\nFigure 1: Illustration of the adaptive exploration policy.\n(Left) Virtual time index \u03c4 is advanced using\nmultiplicative factors whenever auxiliary information is observed. (Right) Exploration rate decreases as a\nfunction of \u03c4, and in particular, exhibits discrete \u201cjumps\" whenever auxiliary information is observed.\n\nreferences therein, the related concept of forced sampling in [14], as well as related discussions in\n[10] and [5]. In the absence of information \ufb02ows, an exploration rate of order 1/t guarantees that the\narm with the highest estimated mean reward can be suboptimal only with a probability of order 1/t;\nsee, e.g., the analysis of the \u0001-greedy policy in [3], where at each time period t exploration occurs\nuniformly at random with probability 1/t. The manner in which the appropriate exploration rate\ndecays captures the extent to which the value of new information de\ufb02ates over time: at early time\nperiods new information is more valuable as estimates are based on little information and there are\nmany remaining decision epochs where this information could be used, and as time goes by new\ninformation becomes less valuable. In the presence of auxiliary information \ufb02ows stochastic error\nrates decrease due to the additional observations. Our policy dynamically reacts to the information\n\ufb02ows by reducing the exploration rate to guarantee that the loss due to exploration is balanced\nthroughout the horizon with the expected loss due to misidenti\ufb01cation of the best arm. The policy\nis doing so through adjusting the exploration rate of each arm based on a virtual time index \u03c4k,t\nassociated with that arm, rather than based on the actual time period t (which is appropriate in the\nabsence of information \ufb02ows). In particular, the adaptive exploration policy explores arm k at a\nrate that would have been appropriate without auxiliary information \ufb02ows at a future time step \u03c4k,t.\nEvery time additional information on arm k is observed, a multiplicative factor is used to further\nadvance the virtual time step \u03c4k,t by \u03c4k,t = (\u03c4k,t\u22121 + 1) \u00b7 exp (\u03b4 \u00b7 \u03b7k,t) for some suitably chosen \u03b4.\nThe general idea of adapting the exploration rate of a policy by advancing a virtual time index as a\nfunction of the information arrival process is illustrated in Figure 1.\nTheorem 2. (Near optimality of the adaptive exploration policy) Let \u03c0 be the adaptive exploration\npolicy with tuning parameter c > 8. For any T \u2265 1 and auxiliary information arrival matrix H:\n\nR\u03c0S (H, T ) \u2264(cid:88)\n\n(cid:32)\n\n\u2206k\n\nk\u2208K\n\nC4\n\u22062 log\n\nexp\n\n\u2212 \u22062\nC4\n\n(cid:32)\n\n(cid:32) T(cid:88)\n\nt=0\n\n(cid:33)(cid:33)\n\n(cid:33)\n\n\u03b7k,s\n\n+ C5\n\n,\n\nt(cid:88)\n\ns=1\n\n(cid:17)\n\n(cid:80)t\n\nnote that the virtual times could be expressed as \u03c4k,t =(cid:80)t\n\n(cid:16) \u22062\n\u22062 log \u03c4k,T + 1. Subtracting the number of information arrivals(cid:80)T\n\nwhere C4 and C5 are positive constants that depend only on \u03c3.\nKey ideas in the proof. We decompose the overall regret into the regret over exploration time steps,\nand the regret over exploitation time steps. To bound the regret at exploration time periods we\n, and that the\nexpected number of observations from arm k due to exploration and information \ufb02ows together is at\nmost c\u03c32\nt=1 \u03b7k,t one obtains the \ufb01rst\nterm in the upper bound. To bound the regret at exploitation time periods we use Chernoff-Hoeffding\ninequality to bound the probability that a sub-optimal arm has the highest estimated reward, given the\nminimal number of observations that must be collected on each arm.\nThe upper bound in Theorem 2 holds for any arbitrary sample path of information arrivals that is\ncaptured by the matrix H, and matches the lower bound in Theorem 1 with respect to dependence\non the time horizon T , as well as the sample path of information arrivals \u03b7k,t\u2019s, the number of\narms K, and the minimum expected reward difference \u2206. In particular, this establishes a minimax\nregret rate of order 1\nfor the MAB problem with auxiliary\n\u2206\n\n(cid:16)\u2212c \u00b7(cid:80)t\n\n(cid:16)(cid:80)T\n\ns=1 exp\n\nc\u03c32\n\n\u03c4 =s \u03b7k,\u03c4\n\nk log\n\nt=0 exp\n\n(cid:80)K\n\n(cid:17)(cid:17)\n\ns=1 \u03b7k,s\n\n8\n\n\finformation that is formulated here, where c is a constant that may depend on problem parameters\nsuch as K, \u2206, and \u03c3. Theorem 2 also implies that the adaptive exploration policy guarantees the best\nachievable regret (up to some multiplicative constant) under any arbitrary sample path of auxiliary\ninformation (and in particular, under the subclasses discussed in \u00a73.1 for any values of \u03bb and \u03ba).\n\n5 Concluding remarks\n\nIn this study we considered a generalization of the stationary multi-armed bandits problem in the\npresence of unknown and arbitrary information \ufb02ows on each arm. We studied the impact of\nsuch auxiliary information on the design of ef\ufb01cient learning policies and on the performance that\ncan be achieved. In particular, we introduced an adaptive MAB policy that adapts in real time to\nthe unknown information arrival process by controlling endogenizing the exploration rate through\nadvancing virtual time indexes that are customized for each arm every time information on this arm\narrives. We established that using this policy, one may guarantee the best performance (in terms of\nminimax regret) that is achievable under prior knowledge on the information arrival process.\n\nReferences\n[1] Agrawal, S. and N. Goyal (2013). Further optimal regret bounds for thompson sampling. In Arti\ufb01cial\n\nIntelligence and Statistics, pp. 99\u2013107.\n\n[2] Audibert, J.-Y. and S. Bubeck (2010). Best arm identi\ufb01cation in multi-armed bandits. In COLT-23th\n\nConference on Learning Theory-2010, pp. 13\u2013p.\n\n[3] Auer, P., N. Cesa-Bianchi, and P. Fischer (2002). Finite-time analysis of the multiarmed bandit problem.\n\nMachine learning 47(2-3), 235\u2013256.\n\n[4] Auer, P., N. Cesa-Bianchi, Y. Freund, and R. E. Schapire (1995). Gambling in a rigged casino: The\nadversarial multi-armed bandit problem. In Foundations of Computer Science, 1995. Proceedings., 36th\nAnnual Symposium on, pp. 322\u2013331. IEEE.\n\n[5] Bastani, H. and M. Bayati (2015). Online decision-making with high-dimensional covariates. Preprint,\n\navailable at SSRN: http://ssrn.com/abstract=2661896.\n\n[6] Bastani, H., M. Bayati, and K. Khosravi (2017). Exploiting the natural exploration in contextual bandits.\n\narXiv preprint arXiv:1704.09011.\n\n[7] Besbes, O., Y. Gur, and A. Zeevi (2014). Stochastic multi-armed-bandit problem with non-stationary\n\nrewards. Advances in Neural Information Processing Systems 27, 199\u2013207.\n\n[8] Bubeck, S., V. Perchet, and P. Rigollet (2013). Bounded regret in stochastic multi-armed bandits. In\n\nConference on Learning Theory, pp. 122\u2013134.\n\n[9] Freund, Y. and R. E. Schapire (1997). A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. Journal of computer and system sciences 55(1), 119\u2013139.\n\n[10] Goldenshluger, A. and A. Zeevi (2013). A linear response bandit problem. Stochastic Systems 3(1),\n\n230\u2013261.\n\n[11] Jadbabaie, A., A. Rakhlin, S. Shahrampour, and K. Sridharan (2015). Online optimization: Competing\n\nwith dynamic comparators. In Arti\ufb01cial Intelligence and Statistics, pp. 398\u2013406.\n\n[12] Komiyama, J., I. Sato, and H. Nakagawa (2013). Multi-armed bandit problem with lock-up periods. In\n\nAsian Conference on Machine Learning, pp. 100\u2013115.\n\n[13] Lai, T. L. and H. Robbins (1985). Asymptotically ef\ufb01cient adaptive allocation rules. Advances in applied\n\nmathematics 6(1), 4\u201322.\n\n[14] Langford, J. and T. Zhang (2008). The epoch-greedy algorithm for multi-armed bandits with side\n\ninformation. In Advances in neural information processing systems, pp. 817\u2013824.\n\n[15] Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American\n\nMathematical Society 58(5), 527\u2013535.\n\n[16] Sani, A., G. Neu, and A. Lazaric (2014). Exploiting easy data in online optimization. In Advances in\n\nNeural Information Processing Systems, pp. 810\u2013818.\n\n9\n\n\f[17] Seldin, Y. and A. Slivkins (2014). One practical algorithm for both stochastic and adversarial bandits. In\n\nInternational Conference on Machine Learning, pp. 1287\u20131295.\n\n[18] Shah, V., J. Blanchet, and R. Johari (2018). Bandit learning with positive externalities. arXiv preprint\n\narXiv:1802.05693.\n\n[19] Strehl, A. L., C. Mesterharm, M. L. Littman, and H. Hirsh (2006). Experience-ef\ufb01cient learning in\nassociative bandit problems. In Proceedings of the 23rd international conference on Machine learning, pp.\n889\u2013896. ACM.\n\n[20] Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the\n\nevidence of two samples. Biometrika 25(3/4), 285\u2013294.\n\n[21] Trac\u00e0, S. and C. Rudin (2015). Regulating greed over time. arXiv preprint arXiv:1505.05629.\n\n[22] Tsybakov, A. B. (2008). Introduction to Nonparametric Estimation (1st ed.). Springer Publishing Company,\n\nIncorporated.\n\n[23] Wang, C.-C., S. R. Kulkarni, and H. V. Poor (2005). Bandit problems with side observations. IEEE\n\nTransactions on Automatic Control 50(3), 338\u2013355.\n\n10\n\n\f", "award": [], "sourceid": 3719, "authors": [{"given_name": "Yonatan", "family_name": "Gur", "institution": "Stanford University"}, {"given_name": "Ahmadreza", "family_name": "Momeni", "institution": "Stanford University"}]}