{"title": "Local Bandit Approximation for Optimal Learning Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 1019, "page_last": 1025, "abstract": null, "full_text": "Local Bandit Approximation \n\nfor Optimal Learning Problems \n\nMichael o. Duff \n\nAndrew G. Barto \n\nDepartment of Computer Science \n\nUniversity of Massachusetts \n\nAmherst, MA 01003 \n\n{duff.barto}Ccs.umass.edu \n\nAbstract \n\nIn general, procedures for determining Bayes-optimal adaptive \ncontrols for Markov decision processes (MDP's) require a pro(cid:173)\nhibitive amount of computation-the optimal learning problem \nis intractable. This paper proposes an approximate approach in \nwhich bandit processes are used to model, in a certain \"local\" sense, \na given MDP. Bandit processes constitute an important subclass of \nMDP's, and have optimal learning strategies (defined in terms of \nGittins indices) that can be computed relatively efficiently. Thus, \none scheme for achieving approximately-optimal learning for gen(cid:173)\neral MDP's proceeds by taking actions suggested by strategies that \nare optimal with respect to local bandit models. \n\n1 \n\nINTRODUCTION \n\nWatkins [1989] has defined optimal learning as:\" \nthe process of collecting and \nusing information during learning in an optimal manner, so that the learner makes \nthe best possible decisions at all stages of learning: learning itself is regarded as a \nmUltistage decision process, and learning is optimal if the learner adopts a strategy \nthat will yield the highest possible return from actions over the whole course of \nlearning.\" \n\nFor example, suppose a decision-maker is presented with two biased coins (the \ndecision-maker does not know precisely how the coins are biased) and asked to al(cid:173)\nlocate twenty flips between them so as to maximize the number of observed heads. \nAlthough the decision-maker is certainly interested in determining which coin has \na higher probability of heads, his principle concern is with optimizing performance \nen route to this determination. An optimal learning strategy typically intersperses \n\"exploitation\" steps, in which the coin currently thought to have the highest proba-\n\n\f1020 \n\nM. O. Duff and A. G. Barto \n\n1 \n1- P\" \n\n~ 1 \n(a) \"\"\"~ \n\nP \u2022 f'r:i) \n\n~-QP22 \n\n1 \n1- P12 \n\nl-P~, \n\n~ \n\n2 ~~P2 \nR\" \nR2 :y 22 \nll~ 1 \n\n2 \n\nP \n\n(b) \n\ny \ni 1;(3,1)(1,1):(1 ,1)(1 ,1) ~ \n\n~2;(3,2)(1,1):(1,1)(1,1) .7 \n~ \n\n2 / 3 \n\n\u2022 \u2022 \u2022 \n\n> 2;(1.2)(1,1):(1,1)(2,1) <{: \n\na, \n\naj \n\naJ 1 /3 \n~1;(2'1)(1'1):(1'1)(1'1) l~ \na, ;~ 2:(1,2)(1,1):(1,1)(1,1) ~ 7 \n\n-\na \n\"\" ./ \n\n1/2 \n\nl-p 2 \n\n22 \n\n1;(1,1)(1 ,1):(1,1)(1.1) \n\na, )fo \n~ \n\n(c) \n\nFigure 1: A simple example: dynamics/rewards under (a) action 1 and (b) action \n2. (c) The decision problem in hyperstate space. \n\nbility of heads is flipped, with \"exploration\" steps in which, on the basis of observed \nflips, a coin that would be deemed inferior is flipped anyway to further resolve its \ntrue potential for turning up heads. The coin-flip problem is a simple example of a \n(two-armed) bandit problem. A key feature of these problems, and of adaptive con(cid:173)\ntrol processes in general, is the so-called \"exploration-versus-exploitation trade-off\" \n(or problem of \"dual control\" [Fel'dbaum, 1965]). \n\nAs an another example, consider the MDP depicted in Figures l(a) and (b). This is \na 2-state/2-action proceSSj transition probabilities label arcs, and quantities within \ncircles denote expected rewards for taking particular actions in particular states. \nThe goal is to assign actions to states so as to maximize, say, the expected infinite \nhorizon discounted sum ofrewards (the value function) over all states. For the case \nconsidered in this paper, the transition probabilites are not known. Given that \nthe process is in some state, one action may be optimal with respect to currently(cid:173)\nperceived point-estimates of unknown parameters, while another action may result \nin greater information gain. Optimal learning is concerned with striking a balance \nbetween these two criteria. \n\nWhile reinforcement learning approaches have recognized the dual-effects of con(cid:173)\ntrol, at least in the sense that one must occasionally deviate from a greedy policy \nto ensure a search of sufficient breadth, many exploration procedures appear not \nto be motivated by real notions of optimallearningj rather, they aspire to be prac(cid:173)\ntical schemes for avoiding unrealistic levels of sampling and search that would be \nrequired if one were to strictly adhere to the theoretical sufficient conditions for \nconvergence-that all state-action pairs must be considered infinitely many times. \n\nIf one is willing to adopt a Bayesian perspective, then the exploration-versus(cid:173)\nexploitation issue has already been resolved, in principle. A solution was recognized \nby Bellman and Kalaba nearly fo rty years ago [Bellman & Kalaba, 1959]j their \ndynamic programming algorithm for computing Bayes-optimal policies begins by \nregarding \"state\" as an ordered pair, or \"hyperstate,\" (:z:,I), where :z: is a point \nin phase-space (Markov-chain state) and I \nis the \"information pattern,\" which \nsummarizes past history as it relates to modeling the transitional dynamics of :z:. \nComputation grows increasingly burdensome with problem size, however, so one is \ncompelled to seek approximate solutions, some of which ignore the effects of infor(cid:173)\nmation gain entirely. In contrast, the approach suggested in this paper explicitly \nacknowledges that there is an information-gain component to the optimal learn-\n\n\fLocal Bandit Approximation/or Optimal Learning Problems \n\n1021 \n\ning problem; if certain salient aspects of the value of information can be captured, \neven approximately, then one may be led to a reasonable method for approximating \noptimal learning policies. \n\nHere is the basic idea behind the approach suggested in this paper: First note that \nthere exists a special class of problems, namely multi-armed bandit problems, in \nwhich the information pattern is the sole component of the hyperstate. These special \nproblems have the important feature that their optimal policies can be defined \nconcisely in terms of \"Gittins indices,\" and these indices can be computed in a \nrelatively efficient way. This paper is an attempt to make use of the fact that this \nspecial subclass of MDP's has tractably-computable optimal learning strategies. \nActions for general MDP's are derived by, first, attaching to a given general MDP \nin a given state a \"local\" n-armed bandit process that captures some aspect of the \nvalue of information gain as well as explicit reward. Indices for the local bandit \nmodel can be computed relatively efficiently; the largest index suggests the best \naction in an optimal-learning sense. The resulting algorithm has a receding-horizon \nflavor in that a new local-bandit process is constructed after each transition; it \nmakes use of a mean-process model as in some previously-suggested approximation \nschemes, but here the value of information gain is explicitly taken into account, in \npart, through index calculations. \n\n2 THE BAYES-BELLMAN APPROACH FOR \n\nADAPTIVE MDP'S \n\nConsider the two-state, two-action process shown in Figure 1, and suppose that \none is uncertain about the transition probabilities. \nIf the process is in a given \nstate and an action is taken, then the result is that the process either stays in the \nstate it is in or jumps to the other state-one observes a Bernoulli process with \nunknown parameter-just as in the coin-flip example. But in this case one observes \nfour Bernoulli processes: the result of taking action 1 in state 1, action 1 in \nstate 2, action 2 in state 1, action 2 in state 2. So if the prior probability \nfor staying in the current state, for each of these state-action pairs, is represented by \na beta distribution (the appropriate conjugate family of distributions with regard \nto Bernoulli sampling; I.e., a Bayesian update of a beta prior remains beta), then \none may perform dynamic programming in a space of \"hyperstates,\" in which the \ncomponents are four pairs of parameters specifying the beta distributions describ(cid:173)\ning the uncertainty in the transition probabilities, along with the Markov chain \nstate: (:z:, (aL,Bt), (a~,,B~)(a~,,Bn, (a~,,B~\u00bb), where for example (aL,BD denotes \nthe parmeters specifying the beta distribution that represents uncertainty in the \ntransition probability P~l. Figure l(c) shows part ofthe associated decision tree; an \noptimality equation may be written in terms of the hyperstates. MDP's with more \nthan two states pose no special problem (there exists an appropriate generalization \nof the beta distribution). What i& a problem is what Bellman calls the \"problem \nof the expanding grid:\" the number of hyperstates that must be examined grows \nexponentially with the horizon. \n\nHow does one proceed if one is constrained to practical amounts of computation and \nis willing to settle for an approximate solution? One could truncate the decision \ntree at some shorter and more manageable horizon, compute approximate terminal \nvalues by replacing the distributions with their means, and proceed with a receding(cid:173)\nhorizon approach: Starting from the approximate terminal values at the horizon, \nperform a backward sweep of dynamic programming, computing an optimal policy. \nTake the initial action of the policy, then shift the entire computational window \nforward one level and repeat. One can imagine a sort of limiting, degenerate version \n\n\f1022 \n\nM. O. Duff and A. G. Barto \n\nof this receding horizon approach in which the horizon is zerOj that is, use the means \nof the current distributions to calculate an optimal policy, take an \"optimal\" action, \nobserve a transition, perform a Bayesian modification of the prior, and repeat. \nThis (certainty-equivalence) heuristic was suggested by [Cozzolino et al., 1965], and \nhas recently reappeared in [Dayan & Sejnowski, 1996]. However, as was noted in \n[Cozzolino et al., 1965] \" ... the trade-off between immmediate gain and information \ndoes not exist in this heuristic. There is no mechanism which explicitly forces \nunexplored policies to be observed in early stages. Therefore, if it should happen \nthat there is some very good policy which a priori seemed quite bad, it is entirely \npossible that this heuristic will never provide the information needed to recognize \nthe policy as being better than originally thought .. .'t This comment and others seem \nto refer to what is now regarded as a problem of \"identifiability\" associated with \ncertainty-equivalence controllers in which a closed-loop system evolves identically for \nboth true and false values of the unknown parametersj that is, certainty-equivalence \ncontrol may make some of the unknown parameters invisible to the identification \nprocess and lead one to repeatedly choose the wrong action (see [Borkar & Varaiya, \n1979], and also Watkinst discussion of \"metastable policiestt in [Watkins, 1989]). \n\n3 BANDIT PROBLEMS AND INDEX COMPUTATION \n\nOne basic version of the bandit problem may be described as follows: There are \nsome number of statistically independent reward processes-Markov chains with \nan imposed reward structure associated with the chain's arcs. At each discrete \ntime-step, a decision-maker selects one of these processes to activate. The activated \nprocess yields an immediate reward and then changes state. The other processes \nremain frozen and yield no reward. The goal is to splice together the individual \nreward streams into one sequence having maximal expected discounted value. \n\nThe special Cartesian structure of the bandit problem turns out to imply that there \nare functions that map process-states to scalars (or \"indices't), such that optimal \npolicies consist simply of activating the task with the largest index. Consider one of \nthe reward processes, let S be its state space, and let B be the set of all subsets of \nS. Suppose that :z:(k) is the state of the process at time k and, for B E B, let reB) \nbe the number of transitions until the process first enters the set B. Let v(ij B) be \nthe expected discounted reward per unit of discounted time starting from state i \nuntil the stopping time reB): \n\nThen the Gittins index of state i for the process under consideration is \n\nv(i) = maxv(ijB). \n\nBEB \n\n(1) \n\n[Gittins & Jones, 1979] shows that the indices may be obtained by solving a set \nof functional equations. Other algorithms that have been suggested include those \nby Beale (see the discussion section following [Gittins & Jones, 1979]), [Robin(cid:173)\nsion, 1981], [Varaiya et al., 1985], and [Katehakis & Vein ott , 1987]. [Dufft 1995] \nprovides a reinforcement learning approach that gradually learns indices through \nonline/model-free interaction with bandit processes. The details of these algorithms \nwould require more space than is available here. The algorithm proposed in the next \nsection makes use of the approach of [Varaiya et al., 1985]. \n\n\fLocal Bandit Approximation for Optimal Learning Problems \n\n1023 \n\n4 LOCAL BANDIT APPROXIMATION AND AN \n\nAPPROXIMATELY-OPTIMAL LEARNING \nALGORITHM \n\nThe most obvious difference between the optimal learning problem for an MDP and \nthe multi-armed bandit problem is that the MDP has a phase-space component \n(Markov chain state) to its hyperstate. A first step in bandit-based approximation, \nthen, proceeds by \"removing\" this phase-space component. This can be achieved by \nviewing the process on a time-scale defined by the recurrence time of a given state. \nThat is, suppose the process is in some state, z. In response to some given action, \ntwo things can happen: (1) The process can transition, in one time-step, into z \nagain with some immediate reward, or (2) The process can transition into some \nstate that is not z and experience some \"sojourn\" path of states and rewards before \nreturning to z. On a time-scale defined by sojourn-time, one can view the process \nin a sort of \"state-z-centric\" way (if state z never recurs, then the sojourn-time is \n\"infinite\" and there is no value-of-information component of the local bandit model \nto acknowledge). From this perspective, the process appears to have only one state, \nand is 8em~Markov; that is, the time between transitions is a random variable. \nSome other action taken in state z would give rise to a different sojourn reward \nprocess. For both processes (sojourn-processes initiated by different actions applied \nto state z), the sojourn path/reward will depend upon the policy for states encoun(cid:173)\ntered along sojourn paths, but suppose that this policy is fixed for the moment. \nBy viewing the original process on a time-scale of sojourn-time, one has effectively \ncollapsed the phase-space component of the hyperstate. The new process has one \nstate, z, and the problem of choosing an action, given that one is uncertain about \nthe transition probabilities, presents itself as a semi-Markov bandit problem. \n\nThe preceding discussion suggests an algorithm for approximately-optimal learning: \n\n(0) Given that the uncertainty in transition probabilities is expressed in terms of \n\nsufficient statistics < a, Ii >, and the process is currently in state Zt. \n\n(1) Compute the optimal policy for the mean process, 7r\u00b7[F(a,Ii)]; that is, com(cid:173)\npute the policy that is optimal for the MDP whose transition probabilities \nare taken to be the mean values associated with < a, Ii >-this defines a \nnominal (certainty-equivalent) policy for sojourn states. \n\n(2) Construct a local bandit model at state Zt; that is, the decision-maker must \nchoose between some number (the number of admissible actions) of sojourn \nreward processes-this is a semi-Markov multi-armed bandit problem. \n\n(3) Compute the Gittins indices for the local bandit model. \n( 4) Take the action with the largest index. \n(5) Observe a transition to Zt+l in the underlying MDP. \n(6) Update < a,1i > accordingly (Bayes update). \n(7) Go to step (1) \n\nThe local semi-Markov bandit process associated with state 1 / action 1 for \nthe 2-state example MDP of Figure 1 is shown in Figure 2. The sufficient statistics \nfor ptl are denoted by (Q, f3), and Q~.8 and ~ are the expected probabilities for \ntransition into state 1 and state 2, respectively. rand R121 are random variables \nsignifying sojourn time and reward. \nThe goal is to compute the index for the root information-state labeled < Q, f3 > and \nto compare it with that computed for a similar diagram associated with the bandit \n\n\f1024 \n\nM. O. Duff and A. G. Barto \n\n/ \nr ~/ \n\n~,y.~ \n/,~I \n\nu+1, \n\nFigure 2: A local semi-Markov bandit process associated with state 1 / action \n1 for the 2-state example MDP of Figure 1. \n\nprocess for taking action 2. The approximately-optimal action is suggested by the \nprocess having the largest root-node index. Indices for semi-Markov bandits can be \nobtained by considering the bandits as Markov, but performing the optimization \nin Equation 1 over a restricted set of stopping times. The algorithm suggested \nin [Tsitsiklis, 1993], which in turn makes use of methods described in [Varaiya et \nal., 1985], proceeds by \"reducing\" the graph through a sequence of node-excisions \nand modifications of rewards and transition probabilities; [Duff, 1997] details how \nthese steps may be realized for the special semi-Markov processes associated with \nproblems of optimal learning. \n\n5 Discussion \n\nIn summary, this paper has presented the problem of optimal learning, in which \na decision-maker is obliged to enjoy or endure the consequences of its actions in \nquest of the asymptotically-learned optimal policy. A Bayesian formulation of the \nproblem leads to a clear concept of a solution whose computation, however, appears \nto entail an examination of an intractably-large number of hyperstates. This pa(cid:173)\nper has suggested extending the Gittins index approach (which applies with great \npower and elegance to the special class of multi-armed bandit processes) to general \nadaptive MDP's. The hope has been that if certain salient features of the value \nof information could be captured, even approximately, then one could be led to a \nreasonable method for avoiding certain defects of certainty-equivalence approaches \n(problems with identifiability, \"metastability\"). Obviously, positive evidence, in the \nform of empirical results from simulation experiments, would lend support to these \nideas- work along these lines is underway. \n\nLocal bandit approximation is but one approximate computational approach for \nproblems of optimal learning and dual control. Most prominent in the literature of \ncontrol theory is the \"wide-sense\" approach of [Bar-Shalom & Tse, 1976], which uti(cid:173)\nlizes local quadratic approximations about nominal state/control trajectories. For \ncertain problems, this method has demonstrated superior performance compared \nto a certainty-equivalence approach, but it is computationally very intensive and \nunwieldy, particularly for problems with controller dimension greater than one. \n\nOne could revert to the view of the bandit problem, or general adaptive MDP, \nas simply a very large MDP defined over hyperstates, and then consider a some-\n\n\fLocal Bandit Approximationfor Optimal Learning Problems \n\n1025 \n\nwhat direct approach in which one performs approximate dynamic programming \nwith function approximation over this domain-details of function-approximation, \nfeature-selection, and \"training\" all become important design issues. [Duff, 1997] \nprovides further discussion of these topics, as well as a consideration of action(cid:173)\nelimination procedures [MacQueen, 1966] that could result in substantial pruning \nof the hyperstate decision tree. \n\nAcknowledgements \n\nThis research was supported, in part, by the National Science Foundation under \ngrant ECS-9214866 to Andrew G. Barto. \n\nReferences \n\nBar-Shalom, Y. 8\u00a3 Tse, E. (1976) Caution, probing and the value of information in \nthe control of uncertain systems, Ann. Econ. Soc. Meas. 5:323-337. \n\nR. Bellman 8\u00a3 R. Kalaba, (1959) On adaptive control processes. IRE Trans., 4:1-9. \n\nBokar, V. 8\u00a3 Varaiya, P.P. (1979) Adaptive control of Markov chains I: finite pa(cid:173)\nrameter set. IEEE Trans. Auto. Control 24:953-958. \n\nCozzolino, J.M., Gonzalez-Zubieta, R., 8\u00a3 Miller, R.L. (1965) Markov decision pro(cid:173)\ncesses with uncertain transition probabilities. Tech. Rpt. 11, Operations Research \nCenter, MIT. \n\nDayan, P. 8\u00a3 Sejnowski, T. (1996) Exploration Bonuses and Dual Control. Machine \nLearning (in press). \nDuff, M.O. (1995) Q-Iearning for bandit problems. in Machine Learning: Proceed(cid:173)\nings of the Twelfth International Conference on Machine Learning: pp. 209-217. \n\nDuff, M.O. (1997) Approximate computational methods for optimal learning and \ndual control. Technical Report, Deptartment of Computer Science, Univ. of Mas(cid:173)\nsachusetts, Amherst. \n\nFel'dbaum, A. (1965) Optimal Control Systems, Academic Press. \n\nGittins, J.C. 8\u00a3 Jones, D. (1979) Bandit processes and dynamic allocation indices \n(with discussion). J. R. Statist. Soc. B 41:148-177. \n\nKatehakis, M.H. 8\u00a3 Veinott, A.F. (1987) The multi-armed bandit problem: decom(cid:173)\nposition and computation Math. OR 12: 262-268. \nMacQueen, J. (1966). A modified dynamic programming method for Markov deci(cid:173)\nsion problems, J. Math. Anal. Appl., 14:38-43. \n\nRobinsion, D.R. (1981) Algorithms for evaluating the dynamic allocation index. \nResearch Report No. 80/DRR/4, Manchester-Sheffield School of Probability and \nStatistics. \n\nTsitsiklis, J. (1993) A short proof of the Gittins index theorem. Proc. 3fnd Conf. \nDec. and Control: 389-390. \n\nVaraiya, P.P., Walrand, J.C., 8\u00a3 Buyukkoc, C. (1985) Extensions of the multiarmed \nbandit problem: the discounted case. IEEE Trans. Auto. Control 30(5):426-439. \n\nWatkins, C. (1989) Learning /rom Delayed Rewards Ph.D. Thesis, Cambidge Uni(cid:173)\nversity. \n\n\f", "award": [], "sourceid": 1230, "authors": [{"given_name": "Michael", "family_name": "Duff", "institution": null}, {"given_name": "Andrew", "family_name": "Barto", "institution": null}]}