{"title": "Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments", "book": "Advances in Neural Information Processing Systems", "page_first": 409, "page_last": 416, "abstract": null, "full_text": " Exploration-Exploitation Tradeoffs for\n Experts Algorithms in Reactive Environments\n\n\n Daniela Pucci de Farias Nimrod Megiddo\n Department of Mechanical Engineering IBM Almaden Research Center\n Massachusetts Institute of Technology 650 Harry Road, K53-B2\n Cambridge, MA 02139 San Jose, CA 95120\n pucci@mit.edu megiddo@almaden.ibm.com\n\n\n\n Abstract\n\n A reactive environment is one that responds to the actions of an agent rather than\n evolving obliviously. In reactive environments, experts algorithms must balance\n exploration and exploitation of experts more carefully than in oblivious ones. In\n addition, a more subtle definition of a learnable value of an expert is required. A\n general exploration-exploitation experts method is presented along with a proper\n definition of value. The method is shown to asymptotically perform as well as\n the best available expert. Several variants are analyzed from the viewpoint of the\n exploration-exploitation tradeoff, including explore-then-exploit, polynomially\n vanishing exploration, constant-frequency exploration, and constant-size explo-\n ration phases. Complexity and performance bounds are proven.\n\n\n1 Introduction\n\nReal-world environments require agents to choose actions sequentially. For example, a\ndriver has to choose everyday a route from one point to another, based on past experience\nand perhaps some current information. In another example, an airline company has to set\nprices dynamically, also based on past experience and current information. One important\ndifference between these two examples is that the effect of the driver's decision on the fu-\nture traffic patterns is negligible, whereas prices set by one airline can affect future market\nprices significantly. In this sense the decisions of the airlines are made in a reactive envi-\nronment, whereas the driver performs in a non-reactive one. For this reason, the driver's\nproblem is essentially a problem of prediction while the airline's problem has an additional\nelement of control.\nIn the decision problems we consider, an agent has to repeatedly choose currently feasi-\nble actions. The agent then observes a reward, which depends both on the chosen action\nand the current state of the environment. The state of the environment may depend both\non the agent's past choices and on choices made by the environment independent of the\nagent's current choice. There are various known approaches to sequential decision making\nunder uncertainty. In this paper we focus on the so-called experts algorithm approach. An\n\"expert\" (or \"oracle\") is simply a particular strategy recommending actions based on the\npast history of the process. An experts algorithm is a method that combines the recommen-\ndations of several given \"experts\" (or \"oracles\") into another strategy of choosing actions\n(e.g., [4, 1, 3]).\n\n\f\nMany learning algorithms can be interpreted as \"exploration-exploitation\" methods.\nRoughly speaking, such algorithms blend choices of exploration, aimed at acquiring knowl-\nedge, and exploitation that capitalizes on gained knowledge to accumulate rewards. In par-\nticular, some experts algorithms can be interpreted as blending the testing of all experts\nand following those experts that observed to be more rewarding. Our previous paper [2]\npresented a specific exploration-exploitation experts algorithm. The reader is referred to\n[2] for more definitions, examples and discussion. That algorithm was designed especially\nfor learning in reactive environments. The difference between our algorithm and previous\nexperts algorithms is that our algorithm tests each expert for multiple consecutive stages of\nthe decision process, in order to acquire knowledge about how the environment reacts to\nthe expert. We pointed out that the \"Minimum Regret\" criterion often used for evaluating\nexperts algorithms was not suitable for reactive environments, since it ignored the possi-\nbility that different experts may induce different states of the environment. The previous\npaper, however, did not attempt to optimize the exploration-exploitation tradeoff. It rather\nfocused on one particular possibility, which was shown to perform in the long-run as well\nas the best expert.\nIn this paper, we present a more general exploration-exploitation experts method and pro-\nvide results about the convergence of several of its variants. We develop performance guar-\nantees showing that the method achieves average payoff comparable to that achieved by the\nbest expert. We characterize convergence rates that hold both in expected value and with\nhigh probability. We also introduce a definition for the long-term value of an expert, which\ncaptures the reactions of the environment to the expert's actions, as well as the fact that any\nlearning algorithm commits mistakes. Finally, we characterize how fast the method learns\nthe value of each expert. An important aspect of our results is that they provide an explicit\ncharacterization of the tradeoff between exploration and exploitation.\nThe paper is organized as follows. The method is described in section 2. Convergence\nrates based on actual expert performance are presented in section 3. In section 4, we de-\nfine the experts' long-rum values, whereas in section 5 we address the question of how\nfast the method learns the values of the experts. Finally, in section 6 we analyze various\nexplorations schemes. These results assume that the number of stages.\n\n2 The Exploration-Exploitation Method\n\nThe problem we consider in this paper can be described as follows. At times t = 1, 2, . . .,\nan agent has to choose actions at A. At the same times the environment also \"chooses\"\nbt B, and then the agent receives a reward R(at,bt). The choices of the environment\nmay depend on various factors, including the past choices of the agent.\nAs in the particular algorithm of [2], the general method follows chosen experts for multiple\nstages rather than picking a different expert each time. A maximal set of consecutive stages\nduring which the same expert is followed is called a phase. Phase numbers are denoted by\ni, The number of phases during which expert e has been followed is denoted by Ne, the\ntotal number of stages during which expert e has been followed is denoted by Se, and the\naverage payoff from phases in which expert e has been followed is denoted by Me. The\ngeneral method is stated as follows.\n\n Exploration. An exploration phase consists of picking a random expert e (i.e.,\n from the uniform distribution over {1,...,r}), and following e's recommenda-\n tions for a certain number of stages depending on the variant of the method.\n\n Exploitation. An exploitation phase consists of picking an expert e with maxi-\n mum Me, breaking ties at random, and following e's recommendations for a cer-\n tain number of stages depending on the variant of the method.\n\n\f\nA general Exploration-Exploitation Experts Method:\n\n 1. Initialize Me = Ne = Se = 0 (e = 1, . . . , r) and i = 1.\n 2. With probability pi, perform an exploration phase, and with probability 1 - pi\n perform an exploitation phase; denote by e the expert chosen to be followed and\n by n the number of stages chosen for the current phase.\n 3. Follow expert e's instructions for the next n stages. Increment Ne = Ne + 1 and\n update Se = Se + n. Denote by ~\n R the average payoff accumulated during the\n current phase of n stages and update\n n\n Me = Me + ( ~\n R\n S - Me) .\n e\n\n 4. Increment i = i + 1 and go to step 2.\n\nWe denote stage numbers by s and phase numbers by i. We denote by M1(i), . . . , Mr(i)\nthe values of the registers M1, . . . , Mr, respectively, at the end of phase i. Similarly, we\ndenote by N1(i), . . . , Nr(i) the values of the registers N1, . . . , Nr, respectively, and by\nS1(i), . . . , Sr(i) the values of the registers S1, . . . , Sr, respectively, at the end of phase i.\nIn sections 3 and 5, we present performance bounds for the EEE method when the length\nof the phase is n = Ne. In section 6.4 we consider the case where n = L for a fixed L.\nDue to space limitations, proofs are omitted and can be found in the online appendix CITE.\n\n3 Bounds Based on Actual Expert Performance\n\nThe original variant of the EEE method [2] used pi = 1/i and n = Ne. The following was\nproven:\n Pr lim inf M (s) lim inf Me(i) = 1 . (1)\n s max\n e i\nIn words, the algorithm achieves asymptotically an average reward that is as large as that\nof the best expert. In this section we generalize this result. We present several bounds\ncharacterizing the relationship between M(i) and Me(i). These bounds are valuable in\nseveral ways. First, they provide worst-case guarantees about the performance of the EEE\nmethod. Second, they provide a starting point for analyzing the behavior of the method\nunder various assumptions about the environment. Third, they quantify the relationship\nbetween amount of exploration, represented by the exploration probabilities pi, and the\nloss of performance. Together with the analysis of Section 5, which characterizes how fast\nthe EEE method learns the value of each expert, the bounds derived here describe explicitly\nthe tradeoff between exploration and exploitation.\nWe denote by Zej the event \"phase j performs exploration with expert e,\" and let Zj =\n Z\n e ej and\n i i\n \n Zi Z p\n 0i = E j = i .\n j=i0+1 j=i0+1\nNote that \n Zi0i denotes the expected number of exploration phases between phases i0 + 1\nand i.\nThe first theorem establishes that, with high probability, after a finite number of iterations,\nthe EEE method performs comparably to the best expert. The performance of each expert\nis defined as the smallest average reward achieved by that expert in the interval between an\n(arbitrary) phase i and the current phase\n 0 i. It can be shown via a counterexample that this\nbound cannot be extended into a (somewhat more natural) comparison between the average\nreward of the EEE method and the average reward of each expert at iteration i.\n\n\f\nTheorem 3.1. For all i ,\n 0 i and such that \n Zi0i i 2/(4ru2) - i0 /(4u),\n 2\n 1 i 2 i\n Pr M (i) max min M 0\n e(j) .\n e - 2 exp -\n i0+1ji 2i 4ru2 - 4u - \n Zi0i\n\n\nThe following theorem characterizes the expected difference between the average reward\nof EEE method and that of the best expert.\nTheorem 3.2. For all i0 i and > 0,\n i0(i0 + 1) 3u + 2 2 \n Zi\n E M (i) - max min M 0i\n e(i) .\n e - - u\n i0+1ji i (i/r + 1) - 2u i\n\nIt follows from Theorem 3.1 that, under certain assumptions on the exploration probabil-\nities, the EEE method performs asymptotically at least as well as the expert that did best.\nCorollary 3.1 generalizes the asymptotic result established in [2].\nCorollary 3.1. If lim \n i Z0i/i = 0, then\n\n Pr lim inf M (s) lim inf Me(i) = 1 . (2)\n s max\n e i\n\n\nNote that here the average reward obtained by the EEE method is compared with the reward\nactually achieved by each expert during the same run of the method. It does not have any\nimplication on the behavior of Me(i), which is analyzed in the next section.\n\n4 The Value of an Expert\n\nIn this section we analyze the behavior of the average reward Me(i) that is computed by\nthe EEE method for each expert e. This average reward is also used by the method to intu-\nitively estimate the value of expert e. So, the question is whether the EEE method is indeed\ncapable of learning the value of the best experts. Thus, we first discuss what is a \"learnable\nvalue\" of an expert. This concept is not trivial especially when the environment is reactive.\nThe obvious definition of a value as the expected average reward the expert could achieve,\nif followed exclusively, does not work. The previous paper presented an example (see Sec-\ntion 4 in [2]) of a repeated Matching Pennies game, which proved this impossibility. That\nexample shows that an algorithm that attempts to learn what an expert would achieve, if\nplayed exclusively, cannot avoid committing fatal \"mistakes.\" In certain environments, ev-\nery non-trivial learning algorithm must commit such fatal mistakes. Hence, such mistakes\ncannot, in general, be considered necessarily a weakness of the algorithm. A more realistic\nconcept of value, relative to a certain environment policy , is defined as follows, using a\nreal parameter .\nDefinition 4.1.\n\n (i) Achievable -Value. A real is called an achievable -value for expert e against\n an environment policy , if there exists a constant c 0 such that, for every\n stage s , every possible history at stage and any number of stages\n 0 hs s s,\n 0 0\n\n c\n E 1 s0+s R(a .\n s s=s e(s), b(s)) : ae(s)\n 0+1 e(hs), b(s) (hs) - s\n\n (ii) -Value. The -value of expert\n e e with respect to is the largest achievable\n -value of e:\n e = sup{ : is an achievable -value} . (3)\n\n\f\nIn words, a value is achievable by expert e if the expert can secure an expected average\nreward during the s stages, between stage s and stage\n 0 s0 + s, which is asymptotically at\nleast as much as , regardless of the history of the play prior to stage s . In [2], we intro-\n 0\nduced the notion of flexibility as a way of reasoning about the value of an expert and when\nit can be learned. The -value can be viewed as a relaxation of the previous assumptions\nand hence the results here strengthen those of [2]. We note, however, that flexibility does\nhold when the environment reacts with bounded memory or as a finite automaton.\n\n5 Bounds Based on Expected Expert Performance\n\nIn this section we characterize how fast the EEE method learns the -value of each expert.\nWe can derive the rate at which the average reward achieved by the EEE method approaches\nthe -value of the best expert.\nTheorem 5.1. Denote = min(, 1). For all > 0 and i,\n 4r 4c 1/\n \n if \n 3 (2 - ) Z0i ,\n 33u2 2 \n Z\n then Pr inf M 0i\n e(j) < \n e - exp - .\n ji 2 43u2r\n\nNote from the definition of -values that we can only expect the average reward of expert e\nto be close to if the phase lengths when the expert is chosen are sufficiently large. This\n e\nis necessary to ensure that the bias term c /s , present in the definition of the -value, is\nsmall. The condition on \n Z0i reflects this observation. It ensures that each expert is chosen\nsufficiently many phases; since phase lengths grow proportionally to the number of phases\nan expert is chosen, this implies that phase lengths are large enough.\nWe can combine Theorems 3.1 and 5.1 to provide an overall bound on the difference of the\naverage reward achieved by the EEE method and the -value of the best expert.\nCorollary 5.1. For all > 0, i and\n 0 i,\n\n 4r 4c 1/\n i 2 i\n if (i) , and (ii) \n Z 0 ,\n 3 (2 - ) Z0i0 i0i 4ru2 - 4u\nthen Pr M(i) maxe (4)\n e - 3\n 2\n 33u2 2 \n Z 1 i 2 i\n exp 0i0 + exp 0 .\n 2 - 43u2r -2i 4ru2 - 4u - Zi0i\n\nCorollary 5.1 explicitly quantifies the tradeoff between exploration and exploitation. In\nparticular, one would like to choose pj such that \n Z is large enough to make the first\n 0i0\nterm in the bound small, and \n Zi0i as small as possible. In Section 6, we analyze several\nexploration schemes and their effect on the convergence rate of the EEE method.\nHere we can also derive from Theorems 3.1 and 5.1 asymptotic guarantees for the EEE\nmethod.\nCorollary 5.2. If lim \n i Z0i = , then Pr (lim infi Me(i) e) = 1.\nThe following is an immediate result from Corollaries 3.1 and 5.2:\nCorollary 5.3. If lim \n i Z0i = and limi Z0i/i = 0, then\n Pr lim inf M (i) e = 1 .\n i max\n e\n\n\f\n6 Exploration Schemes\n\nThe results of the previous sections hold under generic choices of the probabilities pi.\nHere, we discuss how various particular choices affect the speed of exploiting accumulated\ninformation, gathering new information and adapting to changes in the environment.\n\n6.1 Explore-then-Exploit\n\nOne approach to determining exploration schemes is to minimize the upper bound provided\nin Corollary 5.1. This gives rise to a scheme where the whole exploration takes place before\nany exploitation. Indeed, according to expression (4), for any fixed number of iterations i,\nit is optimal to let \n Z (i.e., ) and \n 0i = i p Z\n 0 0 j = 1 for all j i0 i0i = 0 (i.e., pj = 0 for\nall j > i ). Let\n 0 U denote the upper bound given by (4). It can be shown that the smallest\nnumber of phases i, such that U , is bounded between two polynomials in 1/ , u, and\nr. Moreover, its dependence on the the total number of experts r is asymptotically O(r1.5).\nThe main drawback of explore-then-exploit is its inability to adapt to changes in the policy\nof the environment -- since the whole exploration occurs first, any change that occurs after\nexploration has ended cannot be learned. Moreover, the choice of the last exploration phase\ni depends on parameters of the problem that may not be observable. Finally, it requires\n0\nfixing and a priori, and can only achieve optimality within these tolerance parameters.\n\n6.2 Polynomially Decreasing Exploration\n\nIn [2] asymptotic results were described that were equivalent to Corollaries 3.1 and 5.3\nwhen pj = 1/j. This choice of exploration probabilities satisfies\n\n lim \n Z \n 0i = Z0i/i = 0 ,\n i and lim\n i\n\nso the corollaries apply. We have, however,\n \n Z0i0 log(i0) + 1 .\nIt follows that the total number of phases required for U to hold grows exponentially in\n1/ , u and r. An alternative scheme, leading to polynomial complexity, can be developed\nby choosing pj = j-, for some (0,1). In this case,\n (i\n Z 0 + 1)1-\n 0i0 1 - - 1\nand\n i1-\n Z0i .\n 1 - \nIt follows that the smallest number of phases that guarantees that U is on the order of\n 3 3 1\n - 2 1\n u -\n 1 2(1 1\n - r -) u2 - u r 2\n i = O max log , .\n 3- 2 2\n 1 \n -\n\n\n\n\n\n6.3 Constant-Rate Exploration\n\nThe previous exploration schemes have the property that the frequency of exploration van-\nishes as the number of phases grows. This property is required in order to achieve the\nasymptotic optimality results described in Corollaries 3.1 and 5.3. However, it also makes\nthe EEE method increasingly slower in tracking changes in the policy of the environment.\nAn alternative approach is to use a constant frequency pj = (0,1) of exploration.\n\n\f\nConstant-rate exploration does not satisfy the conditions of Corollaries 3.1 and 5.3. How-\never, for any given tolerance level , the value of can be chosen so that\n\n Pr lim inf M (i) max e - = 1 .\n i e\nMoreover, constant-rate exploration yields complexity results similar to those of the\nexplore-then-exploit scheme. For example, given any tolerance level , if\n 2\n pj = (j = 1, 2, . . .) ;\n 8ru2\nthen it follows that U if the number of phases i is on the order of\n r2u5 u2\n i = O log .\n 5 2\n\n6.4 Constant Phase Lengths\n\nIn all the variants of the EEE method considered so far, the number of stages per phase\nincreases linearly as a function of the number of phases during which the same expert has\nbeen followed previously. This growth is used to ensure that, as long as the policy of the\nenvironment exhibits some regularity, that regularity is captured by the algorithm. For\ninstance, if that policy is cyclic, then the EEE method correctly learns the long-term value\nof each expert, regardless of the lengths of the cycles.\nFor practical purposes, it may be necessary to slow down the growth of phase lengths in\norder to get some meaningful results in reasonable time. In this section, we consider the\npossibility of a constant number L of stages in each phase. Following the same steps that\nwe took to prove Theorems 3.1, 3.2 and 5.1, we can derive the following results:\nTheorem 6.1. If the EEE method is implemented with phases of fixed length L, then for all\ni ,\n0 i, and , such that\n i 2 i\n Z 0\n i ,\n 0i 2u2 - 2u\nthe following bound holds:\n 2\n 1 i 2 i\n Pr M (i) max min M 0\n e(j) .\n e - 2 exp -\n i0+1ji 2i 2u2 - 2u - \n Zi0i\n\nWe can also characterize the expected difference between the average reward of EEE\nmethod and that of the best expert.\nTheorem 6.2. If the EEE method is implemented with phases of fixed length L, then for all\ni0 i and > 0, i0 2u2 \n Zi\n E M (i) - max min M 0i\n e(i) .\n e - - u\n i0+1ji i - i\nTheorem 6.3. If the EEE method is implemented with phases of fixed length L 2, then\nfor all > 0,\n c 2L2u2 2 \n Z\n Pr inf M 0i\n e(j) < \n e - .\n ji L - 2 exp -4L2u2r\nAn important qualitative difference between fixed-length phases and increasing-length ones\nis the absence of the number of experts r in the bound given in Theorem 6.2. This implies\nthat, in the explore-then-exploit or constant-rate exploration schemes, the algorithm re-\nquires a number of phases which grows only linearly with r to ensure that\n Pr(M (i) max Me\n e - c/L - ) .\nNote, however, that we cannot ensure performance better than maxe e - c/L.\n\n\f\nReferences\n[1] Auer, P., Cesa-Bianchi, N., Freund, Y. and Schapire, R.E. (1995) Gambling in a rigged casino:\n The adversarial multi-armed bandit problem. In Proc. 36th Annual IEEE Symp. on Foundations\n of Computer Science, pp. 322331, Los Alamitos, CA: IEEE Computer Society Press.\n[2] de Farias, D. P. and Megiddo, N. (2004) How to Combine Expert (and Novice) Ad-\n vice when Actions Impact the Environment. In Advances in Neural Information Process-\n ing Systems 16, S. Thrun, L. Saul and B. Scholkopf, Eds., Cambridge, MA:MIT Press.\n http://books.nips.cc/papers/files/nips16/NIPS2003 CN09.pdf\n\n[3] Freund, Y. and Schapire, R.E. (1999) Adaptive game playing using multiplicative weights.\n Games and Economic Behavior 29:79103.\n[4] Littlestone, N. and Warmuth, M.K. (1994) The weighted majority algorithm. Information and\n Computation 108 (2):212261.\n\n\f\n", "award": [], "sourceid": 2734, "authors": [{"given_name": "Daniela", "family_name": "Farias", "institution": null}, {"given_name": "Nimrod", "family_name": "Megiddo", "institution": null}]}