{"title": "Experts in a Markov Decision Process", "book": "Advances in Neural Information Processing Systems", "page_first": 401, "page_last": 408, "abstract": null, "full_text": " Experts in a Markov Decision Process\n\n\n\n Eyal Even-Dar Sham M. Kakade Yishay Mansour \n Computer Science Computer and Information Science Computer Science\n Tel-Aviv University University of Pennsylvania Tel-Aviv University\n evend@post.tau.ac.il skakade@linc.cis.upenn.edu mansour@post.tau.ac.il\n\n\n\n Abstract\n\n We consider an MDP setting in which the reward function is allowed to\n change during each time step of play (possibly in an adversarial manner),\n yet the dynamics remain fixed. Similar to the experts setting, we address\n the question of how well can an agent do when compared to the reward\n achieved under the best stationary policy over time. We provide efficient\n algorithms, which have regret bounds with no dependence on the size of\n state space. Instead, these bounds depend only on a certain horizon time\n of the process and logarithmically on the number of actions. We also\n show that in the case that the dynamics change over time, the problem\n becomes computationally hard.\n\n\n1 Introduction\n\nThere is an inherent tension between the objectives in an expert setting and those in a re-\ninforcement learning setting. In the experts problem, during every round a learner chooses\none of n decision making experts and incurs the loss of the chosen expert. The setting is\ntypically an adversarial one, where Nature provides the examples to a learner. The stan-\ndard objective here is a myopic, backwards looking one -- in retrospect, we desire that our\nperformance is not much worse than had we chosen any single expert on the sequence of\nexamples provided by Nature. In contrast, a reinforcement learning setting typically makes\nthe much stronger assumption of a fixed environment, typically a Markov decision pro-\ncess (MDP), and the forward looking objective is to maximize some measure of the future\nreward with respect to this fixed environment.\n\nThe motivation of this work is to understand how to efficiently incorporate the benefits of\nexisting experts algorithms into a more adversarial reinforcement learning setting, where\ncertain aspects of the environment could change over time. A naive way to implement an\nexperts algorithm is to simply associate an expert with each fixed policy. The running time\nof such algorithms is polynomial in the number of experts and the regret (the difference\nfrom the optimal reward) is logarithmic in the number of experts. For our setting the num-\nber of policies is huge, namely #actions#states, which renders the naive experts approach\ncomputationally infeasible.\n\nFurthermore, straightforward applications of standard regret algorithms produce regret\nbounds which are logarithmic in the number of policies, so they have linear dependence\n\n This work was supported in part by the IST Programme of the European Community, under the\nPASCAL Network of Excellence, IST-2002-506778, by a grant from the Israel Science Foundation\nand an IBM faculty award. This publication only reflects the authors' views.\n\n\f\non the number of states. We might hope for a more effective regret bound which has no\ndependence on the size of state space (which is typically large).\n\nThe setting we consider is one in which the dynamics of the environment are known to\nthe learner, but the reward function can change over time. We assume that after each time\nstep the learner has complete knowledge of the previous reward functions (over the entire\nenvironment), but does not know the future reward functions.\n\nAs a motivating example one can consider taking a long road-trip over some period of\ntime T . The dynamics, namely the roads, are fixed, but the road conditions may change\nfrequently. By listening to the radio, one can get (effectively) instant updates of the road\nand traffic conditions. Here, the task is to minimize the cost during the period of time T .\nNote that at each time step we select one road segment, suffer a certain delay, and need to\nplan ahead with respect to our current position.\n\nThis example is similar to an adversarial shortest path problem considered in Kalai and\nVempala [2003]. In fact Kalai and Vempala [2003], address the computational difficulty of\nhandling a large number of experts under certain linear assumptions on the reward func-\ntions. However, their algorithm is not directly applicable to our setting, due to the fact that\nin our setting, decisions must be made with respect to the current state of the agent (and\nthe reward could be changing frequently), while in their setting the decisions are only made\nwith respect to a single state.\n\nMcMahan et al. [2003] also considered a similar setting -- they also assume that the reward\nfunction is chosen by an adversary and that the dynamics are fixed. However, they assume\nthat the cost functions come from a finite set (but are not observable) and the goal is to find\na min-max solution for the related stochastic game.\n\nIn this work, we provide efficient ways to incorporate existing best experts algorithms into\nthe MDP setting. Furthermore, our loss bounds (compared to the best constant policy) have\nno dependence on the number of states and depend only on on a certain horizon time of\nthe environment and log(#actions). There are two sensible extensions of our setting. The\nfirst is where we allow Nature to change the dynamics of the environment over time. Here,\nwe show that it becomes NP-Hard to develop a low regret algorithm even for oblivious\nadversary. The second extension is to consider one in which the agent only observes the\nrewards for the states it actually visits (a generalization of the multi-arm bandits problem).\nWe leave this interesting direction for future work.\n\n\n2 The Setting\n\nWe consider an MDP with state space S; initial state distribution d1 over S; action space\nA; state transition probabilities {Psa( )} (here, Psa is the next-state distribution on tak-\ning action a in state s); and a sequence of reward functions r1, r2, . . . rT , where rt is the\n(bounded) reward function at time step t mapping S A into [0, 1].\nThe goal is to maximize the sum of undiscounted rewards over a T step horizon. We assume\nthe agent has complete knowledge of the transition model P , but at time t, the agent only\nknows the past reward functions r1, r2, . . . rt-1. Hence, an algorithm A is a mapping from\nS and the previous reward functions r1, . . . rt-1 to a probability distribution over actions,\nso A(a|s, r1, . . . rt-1) is the probability of taking action a at time t.\nWe define the return of an algorithm A as:\n 1 T\n Vr ( E r\n 1,r2,...rT A) = T t(st, at) d1, A\n t=1\n\nwhere at A(a|st, r1, . . . rt-1) and st is the random variable which represents the state\n\n\f\nat time t, starting from initial state s1 d1 and following actions a1, a2, . . . at-1. Note\nthat we keep track of the expectation and not of a specific trajectory (and our algorithm\nspecifies a distribution over actions at every state and at every time step t).\n\nIdeally, we would like to find an A which achieves a large reward Vr (\n 1,...rT A) regardless of\nhow the adversary chooses the reward functions. In general, this of course is not possible,\nand, as in the standard experts setting, we desire that our algorithm competes favorably\nagainst the best fixed stationary policy (a|s) in hindsight.\n\n3 An MDP Experts Algorithm\n\n3.1 Preliminaries\n\nBefore we provide our algorithm a few definitions are in order. For every stationary pol-\nicy (a|s), we define P to be the transition matrix induced by , where the component\n[P ]s,s is the transition probability from s to s under . Also, define d,t to be the state\ndistribution at time t when following , ie\n\n d,t = d1(P )t\n\nwhere we are treating d1 as a row vector here.\n\nAssumption 1 (Mixing) We assume the transition model over states, as determined by ,\nhas a well defined stationary distribution, which we call d. More formally, for every\ninitial state s, d,t converges to d as t tends to infinity and dP = d. Furthermore, this\nimplies there exists some such that for all policies , and distributions d and d,\n\n dP - dP 1 e-1/ d - d 1\nwhere x 1 denotes the l1 norm of a vector x. We refer to as the mixing time and assume\nthat > 1.\n\nThe parameter provides a bound on the planning horizon timescale, since it implies that\nevery policy achieves close to its average reward in O( ) steps 1. This parameter also\ngoverns how long it effectively takes to switch from one policy to another (after time O( )\nsteps there is little information in the state distribution about the previous policy).\n\nThis assumption allows us to define the average reward of policy in an MDP with reward\nfunction r as:\n r() = Esd,a(a|s)[r(s, a)]\nand the value, Q,r(s, a), is defined as\n\n \n Q,r(s, a) E (r(st, at) - r()) s1 = s,a1 = a,\n t=1\n\nwhere and st and at are the state and actions at time t, after starting from state s1 = s\nthen deviating with an immediate action of a1 = a and following onwards. We slightly\nabuse notation by writing Q,r(s, ) = Ea(a|s)[Q,r(s, a)]. These values satisfy the\nwell known recurrence equation:\n\n Q,r(s, a) = r(s, a) - r() + EsP [Q\n sa (s, )] (1)\n\nwhere Q(s, ) is the next state value (without deviation).\n\n 1If this timescale is unreasonably large for some specific MDP, then one could artificially impose\nsome horizon time and attempt to compete with those policies which mix in this horizon time, as\ndone Kearns and Singh [1998].\n\n\f\nIf is an optimal policy (with respect to r), then, as usual, we define Qr(s, a) to be the\nvalue of the optimal policy, ie Qr(s, a) = Q,r(s, a).\n\nWe now provide two useful lemmas. It is straightforward to see that the previous assump-\ntion implies a rate of convergence to the stationary distribution that is O( ), for all policies.\nThe following lemma states this more precisely.\n\nLemma 2 For all policies ,\n\n d,t - d 1 2e-t/ .\nProof. Since is stationary, we have dP = d, and so\n d,t - d 1 = d,t-1P - dP 1 d,t-1 - d 1e-1/\nwhich implies d,t - d 1 d1 - d 1e-t/. The claim now follows since, for all\ndistributions d and d, d - d 1 2.\nThe following derives a bound on the Q values as a function of the mixing time.\n\nLemma 3 For all reward functions r, Q,r(s, a) 3 .\nProof. First, let us bound Q,r(s, ), where is used on the first step. For all t, including\nt = 1, let d,s,t be the state distribution at time t starting from state s and following .\nHence, we have\n \n Q,r(s, ) = Esd,s,t,a[r(s, a)] - r())\n t=1\n \n Esd,a[r(s, a)] - r() + 2e-t/\n t=1\n \n = 2e-t/ 2e-t/ = 2\n t=1 0\n\n\nUsing the recurrence relation for the values, we know Q,r(s, a) could be at most 1 more\nthan the above. The result follows since 1 + 2 3\n3.2 The Algorithm\n\nNow we provide our main result showing how to use any generic experts algorithm in our\nsetting. We associate each state with an experts algorithm, and the expert for each state\nis responsible for choosing the actions at that state. The immediate question is what loss\nfunction should we feed to each expert. It turns out Q is appropriate. We now assume\n t ,rt\nthat our experts algorithm achieves a performance comparable to the best constant action.\n\nAssumption 4 (Black Box Experts) We assume access to an optimized best expert algo-\nrithm which guarantees that for any sequence of loss functions c1, c2, . . . cT over actions\nA, the algorithm selects a distribution qt over A (using only the previous loss functions\nc1, c2, . . . ct-1) such that\n T T\n Eaq [c c\n t t(a)] t(a) + M T log |A|,\n t=1 t=1\nwhere ct(a) M. Furthermore, we also assume that decision distributions do not\nchange quickly:\n log\n q |A|\n t - qt+1 1 t\n\n\f\nThese assumptions are satisfied by the multiplicative weights algorithms. For instance, the\nalgorithm in Freund and Schapire [1999] is such that the for each decision a, | log qt(a) -\nlog qt+1(a)| changes by O( log|A|), which implies the weaker l\n t 1 condition above.\n\nIn our setting, we have an experts algorithm associated with every state s, which is fed the\nloss function Q (s, ) at time t. The above assumption then guarantees that at every\n t ,rt\nstate s for every action a we have that\n\n T T\n Q (s, Q (s, a) + 3 T log\n t ,rt t) t,rt |A|\n t=1 t=1\n\nsince the loss function Q is bounded by 3 , and that\n t ,rt\n\n\n log\n | |A|\n t( |s) - t+1( |s)|1 t\nAs we shall see, it is important that this 'slow change' condition be satisfied. Intuitively,\nour experts algorithms will be using a similar policy for significantly long periods of time.\n\nAlso note that since the experts algorithms are associated with each state and each of the\nN experts chooses decisions out of A actions, the algorithm is efficient (polynomial in N\nand A, assuming that that the black box uses a reasonable experts algorithm).\n\nWe now state our main theorem.\n\nTheorem 5 Let A be the MDP experts algorithm. Then for all reward functions\nr1, r2, . . . rT and for all stationary policies ,\n\n log 4\n V |A|\n r ( ()\n 1,r2,...rT A) Vr1,r2,...rT - 82 log |A|\n T - 3 T - T\n\nAs expected, the regret goes to 0 at the rate O(1/T ), as is the case with experts algo-\nrithms. Importantly, note that the bound does not depend on the size of the state space.\n\n\n3.3 The Analysis\n\nThe analysis is naturally divided into two parts. First, we analyze the performance of the\nalgorithm in an idealized setting, where the algorithm instantaneously obtains the average\nreward of its current policy at each step. Then we take into account the slow change of the\npolicies to show that the actual performance is similar to the instantaneous performance.\n\nAn Idealized Setting: Let us examine the case in which at each time t, when the algo-\nrithm uses t, it immediately obtains reward r (\n t t). The following theorem compares the\nperformance of our algorithms to that of a fixed constant policy in this setting.\n\nTheorem 6 For all sequences r1, r2, . . . rT , the MDP experts algorithm have the following\nperformance bound. For all ,\n\n T T\n r ( ()\n t t) rt - 3 T log |A|\n t=1 t=1\n\nwhere 1, 2, . . . T is the sequence of policies generated by A in response to r1, r2, . . . rT.\nNext we provide a technical lemma, which is a variant of a result in Kakade [2003]\n\nLemma 7 For all policies and ,\n\n r() - r() = Esd [Q,r(s,)\n - Q,r(s,)]\n\n\f\nProof. Note that by definition of stationarity, if the state distribution is at d, then the\nnext state distribution is also d if is followed. More formally, if s d, a (a|s),\nand s Psa, then s d. Using this and equation 1, we have:\n Esd [Q,r(s, )] = Esd\n ,a [Q,r (s, a)]\n\n = Esd [Q(s, )]\n ,a [r(s, a) - r() + EsPsa\n = Esd [Q(s, )]\n ,a [r(s, a) - r()] + Esd\n = r() - r() + Esd [Q(s,)]\n \n\n\n\n\nRearranging terms leads to the result.\n\nThe lemma shows why our choice to feed each experts algorithm Q was appropriate.\n t ,rt\nNow we complete the proof of the above theorem.\n\nProof. Using the assumed regret in assumption 4,\n\n T T T\n r () ( E [Q (s, ) (s, \n t - rt t) = sd t,rt - Qt,rt t)]\n t=1 t=1 t=1\n\n T\n = Esd [ Q (s, ) (s, \n t,rt - Qt,rt t)]\n t=1\n\n Esd [3 T log A]\n \n\n = 3 T log A\n\nwhere we used the fact that d does not depend on the time in the second step.\n\nTaking Mixing Into Account: This subsection relates the values V to the sums of average\nreward used in the idealized setting.\n\nTheorem 8 For all sequences r1, r2, . . . rT and for all A\n 1 T 2\n |Vr ( ( +\n 1,r2,...rT A) - T rt t)| 42 log |A|\n T T\n t=1\n\nwhere 1, 2, . . . T is the sequence of policies generated by A in response to r1, r2, . . . rT.\nSince the above holds for all A (including those A which are the constant policy ), then\ncombining this with Theorem 6 (once with A and once with ) completes the proof of\nTheorem 5. We now prove the above.\n\nThe following simple lemma is useful and we omit the proof. It shows how close are the\nnext state distributions when following t rather than t+1.\n\nLemma 9 Let and be such that ( |s)-( |s) 1 . Then for any state distribution\nd, we have dP - dP 1 .\nAnalogous to the definition of d,t, we define dA,t\n dA,t = Pr[st = s|d1,A]\nwhich is the probability that the state at time t is s given that A has been followed.\nLemma 10 Let 1, 2, . . . T be the sequence of policies generated by A in response to\nr1, r2, . . . rT . We have\n\n dA,t - d + 2e-t/\n t 1 22 log |A|\n t\n\n\f\nProof. Let k t. Using our experts assumption, it is straightforward to see that that the\nchange in the policy over k steps is |k( |s) - t( |s)|1 (t - k) log |A|/t. Using this\nwith dA,k = dA,k-1P (k) and d P t = d , we have\n t t\n\n dA,k - dt 1 = dA,k-1Pk - dt 1\n dA,k-1Pt - dt 1 + dA,k-1Pk - dA,k-1Pt 1\n dA,k-1Pt - d Pt\n t 1 + 2(t - k) log |A|/t\n e-1/ dA,k-1 - dt 1 + 2(t - k) log |A|/t\n\nwhere we have used the last lemma in the third step and our contraction assumption 1 in\nthe second to last step. Recursing on the above equation leads to:\n\n 2\n dA,t - d (t\n t 2 log |A|/t - k)e-(t-k)/ + e-t/ d1 - dt\n k=t\n \n 2 log |A|/t ke-k/ + 2e-t/\n k=1\n\n\nThe sum is bounded by an integral from 0 to , which evaluates to 2.\nWe are now ready to complete the proof of Theorem 8.\n\nProof. By definition of V ,\n\n 1 T\n Vr ( E [r\n 1,r2,...rT A) = T sdA,t,at t(s, a)]\n t=1\n\n 1 T 1 T\n E [r d\n T sd ,a t(s, a)] + A,t\n t t T - dt 1\n t=1 t=1\n\n 1 T 1 T\n ( 2 2 log |A| + 2e-t/\n T rt t) + T t\n t=1 t=1\n\n 1 T 2\n ( +\n T rt t) + 4 2 log |A|\n T T\n t=1\n\nwhere we have bounded the sums by integration in the second to last step. A symmetric\nargument leads to the result.\n\n\n4 A More Adversarial Setting\n\nIn this section we explore a different setting, the changing dynamics model. Here, in each\ntimestep t, an oblivious adversary is allowed to choose both the reward function rt and\nthe transition model Pt -- the model that determines the transitions to be used at timestep\nt. After each timestep, the agent receives complete knowledge of both rt and Pt. Fur-\nthermore, we assume that Pt is deterministic, so we do not concern ourselves with mixing\nissues. In this setting, we have the following hardness result. We let Rt(M) be the optimal\naverage reward obtained by a stationary policy for times [1, t].\n\nTheorem 11 In the changing dynamics model, if there exists a polynomial time online\nalgorithm (polynomial in the problem parameters) such that, for any MDP, has an expected\naverage reward larger than (0.875 + )Rt(M), for some > 0 and t, then P = NP .\n\n\f\nThe following lemma is useful in the proof and uses the fact that it is hard to approximate\nMAX3SAT within any factor better than 0.875 (Hastad [2001]).\n\nLemma 12 Computing a stationary policy in the changing dynamics model with average\nreward larger than (0.875 + )R(M ), for some > 0, is NP-Hard.\n\nProof: We prove it by reduction from 3-SAT. Suppose that the 3-SAT formula, has m\nclauses, C1, . . . , Cm, and n literals, x1, . . . , xn then we reduce it to MDP with n + 1\nstates,s1, . . . sn, sn+1, two actions in each state, 0, 1 and fixed dynamic for 3m steps which\nwill be described later. We prove that a policy with average reward p/3 translates to an\nassignment that satisfies p fraction of and vice versa. Next we describe the dynamics.\nSuppose that C1 is (x1 x2x7) and C2 is (x4x1x7). The initial state is s1 and the\nreward for action 0 is 0 and the agent moves to state s2, for action 1 the reward is 1 and it\nmoves to state sn+1. In the second timestep the reward in sn+1 is 0 for every action and the\nagents stay in it; in state s2 if the agent performs action 0 then it obtains reward 1 and move\nto state sn+1 otherwise it obtains reward 0 and moves to state s7. In the next timestep the\nreward in sn+1 is 0 for every action and the agents moves to x4, the reward in s7 is 1 for\naction 1 and zero for action 0 and moves to s4 for both actions. The rest of the construction\nis done identically. Note that time interval [3( - 1) + 1, 3] corresponds to C and that the\nreward obtained in this interval is at most 1. We note that has an assignment y1, . . . , yn\nwhere yi = {0, 1} that satisfies p fraction of it, if and only if which takes action yi in si\nhas average reward p/3. We prove it by looking on each interval separately and noting that\nif a reward 1 is obtained then there is an action a that we take in one of the states which has\nreward 1 but this action corresponds to a satisfying assignment for this clause.\n\nWe are now ready to prove Theorem 11.\n\nProof: In this proof we make few changes from the construction given in Lemma 12. We\nallow the same clause to repeat few times, and its dynamics are described in n steps and\nnot in 3 steps, where in the k step we move from sk to sk+1 and obtains 0 reward, unless\nthe action \"satisfies\" the chosen clause, if it satisfies then we obtain an immediate reward\n1, move to sn+1 and stay there for n - k - 1 steps. After n steps the adversary chooses\nuniformly at random the next clause. In the analysis we define the n steps related to a clause\nas an iteration. The strategy defined by the algorithm at the k iteration is the probability\nassigned to action 0/1 at state s just before arriving to s. Note that the strategy at each\niteration is actually a stationary policy for M . Thus the strategy in each iteration defines\nan assignment for the formula. We also note that before an iteration the expected reward\nof the optimal stationary policy in the iteration is k/(nm), where k is the maximal number\nof satisfiable clauses and there are m clauses, and we have E[R(M )] = k/(nm). If we\nchoose at random an iteration, then the strategy defined in that iteration has an expected\nreward which is larger than (0.875 + )R(M ), which implies that we can satisfy more\nthan 0.875 fraction of satisfiable clauses, but this is impossible unless P = N P .\n\n\nReferences\n\nY. Freund and R. Schapire. Adaptive game playing using multiplicative weights. Games and Eco-\n nomic Behavior, 29:79103, 1999.\n\nJ. Hastad. Some optimal inapproximability results. J. ACM, 48(4):798859, 2001.\n\nS. Kakade. On the Sample Complexity of Reinforcement Learning. PhD thesis, University College\n London, 2003.\n\nA. Kalai and S. Vempala. Efficient algorithms for on-line optimization. Proceedings of COLT, 2003.\n\nM. Kearns and S. Singh. Near-optimal reinforcement learning in polynomial time. Proceedings of\n ICML, 1998.\n\nH. McMahan, G. Gordon, and A. Blum. Planning in the presence of cost functions controlled by an\n adversary. In In the 20th ICML, 2003.\n\n\f\n", "award": [], "sourceid": 2730, "authors": [{"given_name": "Eyal", "family_name": "Even-dar", "institution": null}, {"given_name": "Sham", "family_name": "Kakade", "institution": null}, {"given_name": "Yishay", "family_name": "Mansour", "institution": null}]}