{"title": "Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 1505, "page_last": 1512, "abstract": "We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average reward in an irreducible but otherwise unknown Markov decision process (MDP). OLP uses its experience so far to estimate the MDP. It chooses actions by optimistically maximizing estimated future rewards over a set of next-state transition probabilities that are close to the estimates: a computation that corresponds to solving linear programs. We show that the total expected reward obtained by OLP up to time $T$ is within $C(P)\\log T$ of the reward obtained by the optimal policy, where $C(P)$ is an explicit, MDP-dependent constant. OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it does not require knowledge of the supports of transition probabilities and the proof of the regret bound is simpler, but our regret bound is a constant factor larger than the regret of their algorithm. OLP is also similar in flavor to an algorithm recently proposed by Auer and Ortner. But OLP is simpler and its regret bound has a better dependence on the size of the MDP.", "full_text": "Optimistic Linear Programming gives Logarithmic\n\nRegret for Irreducible MDPs\n\nAmbuj Tewari\n\nComputer Science Division\n\nUniveristy of California, Berkeley\n\nBerkeley, CA 94720, USA\n\nambuj@cs.berkeley.edu\n\nComputer Science Division and Department of Statistics\n\nPeter L. Bartlett\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720, USA\n\nbartlett@cs.berkeley.edu\n\nAbstract\n\nWe present an algorithm called Optimistic Linear Programming (OLP) for learn-\ning to optimize average reward in an irreducible but otherwise unknown Markov\ndecision process (MDP). OLP uses its experience so far to estimate the MDP. It\nchooses actions by optimistically maximizing estimated future rewards over a set\nof next-state transition probabilities that are close to the estimates, a computation\nthat corresponds to solving linear programs. We show that the total expected re-\nward obtained by OLP up to time T is within C(P ) log T of the reward obtained\nby the optimal policy, where C(P ) is an explicit, MDP-dependent constant. OLP\nis closely related to an algorithm proposed by Burnetas and Katehakis with four\nkey differences: OLP is simpler, it does not require knowledge of the supports\nof transition probabilities, the proof of the regret bound is simpler, but our regret\nbound is a constant factor larger than the regret of their algorithm. OLP is also\nsimilar in \ufb02avor to an algorithm recently proposed by Auer and Ortner. But OLP\nis simpler and its regret bound has a better dependence on the size of the MDP.\n\n1 Introduction\n\nDecision making under uncertainty is one of the principal concerns of Arti\ufb01cial Intelligence and\nMachine Learning. Assuming that the decision maker or agent is able to perfectly observe its own\nstate, uncertain systems are often modeled as Markov decision processes (MDPs). Given complete\nknowledge of the parameters of an MDP, there are standard algorithms to compute optimal policies,\ni.e., rules of behavior such that some performance criterion is maximized. A frequent criticism of\nthese algorithms is that they assume an explicit description of the MDP which is seldom available.\nThe parameters constituting the description are themselves estimated by simulation or experiment\nand are thus not known with complete reliability. Taking this into account brings us to the well\nknown exploration vs. exploitation trade-off. On one hand, we would like to explore the system as\nwell as we can to obtain reliable knowledge about the system parameters. On the other hand, if we\nkeep exploring and never exploit the knowledge accumulated, we will not behave optimally.\nGiven a policy \u03c0, how do we measure its ability to handle this trade-off? Suppose the agent gets a\nnumerical reward at each time step and we measure performance by the accumulated reward over\ntime. Then, a meaningful quantity to evaluate the policy \u03c0 is its regret over time. To understand\nwhat regret means, consider an omniscient agent who knows all parameters of the MDP accurately\nand behaves optimally. Let VT be the expected reward obtained by this agent up to time T . Let V \u03c0\nT\ndenote the corresponding quantity for \u03c0. Then the regret R\u03c0\nT measures how much \u03c0 is\nhurt due to its incomplete knowledge of the MDP up to time T . If we can show that the regret R\u03c0\nT\ngrows slowly with time T , for all MDPs in a suf\ufb01ciently big class, then we can safely conclude that\n\u03c0 is making a judicious trade-off between exploration and exploitation. It is rather remarkable that\n\nT = VT \u2212 V \u03c0\n\n1\n\n\fT = O(log T ). Thus the per-step regret R\u03c0\n\nfor this notion of regret, logarithmic bounds have been proved in the literature [1,2]. This means that\nthere are policies \u03c0 with R\u03c0\nT /T goes to zero very quickly.\nBurnetas and Katehakis [1] proved that for any policy \u03c0 (satisfying certain reasonable assumptions)\nT \u2265 CB(P ) log T where they identi\ufb01ed the constant CB(P ). This constant depends on the tran-\nR\u03c0\nsition function P of the MDP1. They also gave an algorithm (we call it BKA) that achieves this rate\nand is therefore optimal in a very strong sense. However, besides assuming that the MDP is irre-\nducible (see Assumption 1 below) they assumed that the support sets of the transition distributions\npi(a) are known for all state-action pairs. In this paper, we not only get rid of this assumption but\nour optimistic linear programming (OLP) algorithm is also computationally simpler. At each step,\nOLP considers certain parameters in the vicinity of the estimates. Like BKA, OLP makes optimistic\nchoices among these. But now, making these choices only involves solving linear programs (LPs)\nto maximize linear functions over L1 balls. BKA instead required solving non-linear (though con-\nvex) programs due to the use of KL-divergence. Another bene\ufb01t of using the L1 distance is that\nit greatly simpli\ufb01es a signi\ufb01cant part of the proof. The price we pay for these advantages is that\nthe regret of OLP is C(P ) log T asymptotically, for a constant C(P ) \u2265 CB(P ). We should note\nhere that a number of algorithms in the literature have been inspired by the \u201coptimism in the face of\nuncertainty\u201d principle [3]\u2013[7].\nThe algorithm of Auer and Ortner (we refer to it as AOA) is another logarithmic regret algorithm for\nirreducible2 MDPs. AOA does not solve an optimization problem at every time step but only when\na con\ufb01dence interval is halved. But then the optimization problem they solve is more complicated\nbecause they \ufb01nd a policy to use in the next few time steps by optimizing over a set of MDPs. The\nregret of AOA is CA(P ) log T where\n\n|S|5|A|Tw(P )\u03ba(P )2\n\n,\n\nCA(P ) = c\n\n(1)\nfor some universal constant c. Here |S|,|A| denote the state and action space size, Tw(P ) is the worst\ncase hitting time over deterministic policies (see Eqn. (12)) and \u2206\u2217(P ) is the difference between\nthe long term average return of the best policy and that of the next best policy. The constant \u03ba(P ) is\nalso de\ufb01ned in terms of hitting times. Under Auer and Ortner\u2019s assumption of bounded rewards, we\ncan show that the constant for OLP satis\ufb01es\n\n\u2206\u2217(P )2\n\nC(P ) \u2264 2|S||A|T (P )2\n\n\u03a6\u2217(P )\n\n.\n\n(2)\n\nHere T (P ) is the hitting time of an optimal policy is therefore necessarily smaller than Tw(P ). We\nget rid of the dependence on \u03ba(P ) while replacing Tw(P ) with T (P )2. Most importantly, we signif-\nicantly improve the dependence on the state space size. The constant \u03a6\u2217(P ) can roughly be thought\nof as the minimum (over states) difference between the quality of the best and the second best ac-\ntion (see Eqn. (9)). The constants \u2206\u2217(P ) and \u03a6\u2217(P ) are similar though not directly comparable.\nNevertheless, note that C(P ) depends inversely on \u03a6\u2217(P ) not \u03a6\u2217(P )2.\n\n2 Preliminaries\nConsider an MDP (S,A, R, P ) where S is the set of states, A = \u222ai\u2208SA(i) is the set of actions\n(A(i) being the actions available in state i), R = {r(i, a)}i\u2208S,a\u2208A(i) are the rewards and P =\n{pi,j(a)}i,j\u2208S,a\u2208A(i) are the transition probabilities. For simplicity of analysis, we assume that\nthe rewards are known to us beforehand. We do not assume that we know the support sets of the\ndistributions pi(a).\nThe history \u03c3t up to time t is a sequence i0, k0, . . . , it\u22121, kt\u22121, it such that ks \u2208 A(is) for all s < t.\nA policy \u03c0 is a sequence {\u03c0t} of probability distributions on A given \u03c3t such that \u03c0t(A(st)|\u03c3t) = 1\nwhere st denotes the random variable representing the state at time t. The set of all policies is\ndenoted by \u03a0. A deterministic policy is simply a function \u00b5 : S \u2192 A such that \u00b5(i) \u2208 A(i).\nDenote the set of deterministic policies by \u03a0D. If D is a subset of A, let \u03a0(D) denote the set of\n\n1Notation for MDP parameters is de\ufb01ned in Section 2 below.\n2Auer & Ortner prove claims for unichain MDPs but their usage seems non-standard. The MDPs they call\n\nunichain are called irreducible in standard textbooks (for example, see [9, p. 348])\n\n2\n\n\fpolicies that take actions in D. Probability and expectation under a policy \u03c0, transition function P\nand starting state i0 will be denoted by P\u03c0,P\nrespectively. Given history \u03c3t, let Nt(i),\nNt(i, a) and Nt(i, a, j) denote the number of occurrences of the state i, the pair (i, a) and the triplet\n(i, a, j) respectively in \u03c3t.\nWe make the following irreducibility assumption regarding the MDP.\nAssumption 1. For all \u00b5 \u2208 \u03a0D, the transition matrix P \u00b5 = (pi,j(\u00b5(i)))i,j\u2208S is irreducible (i.e. it\nis possible to reach any state from any other state).\n\nand E\u03c0,P\n\ni0\n\ni0\n\nConsider the rewards accumulated by the policy \u03c0 before time T ,\n\nT\u22121X\n\nT (i0, P ) := E\u03c0,P\nV \u03c0\n\ni0\n\n[\n\nr(st, at)] ,\n\nwhere at is the random variable representing the action taken by \u03c0 at time t. Let VT (i0, P ) be the\nmaximum possible sum of expected rewards before time T ,\n\nt=0\n\nVT (i0, P ) := sup\n\u03c0\u2208\u03a0\n\nT (i0, P ) .\nV \u03c0\n\nThe regret of a policy \u03c0 at time T is a measure of how well the expected rewards of \u03c0 compare with\nthe above quantity,\n\nT (i0, P ) := VT (i0, P ) \u2212 V \u03c0\nR\u03c0\n\nT (i0, P ) .\n\nDe\ufb01ne the long term average reward of a policy \u03c0 as\n\nUnder assumption 1, the above limit exists and is independent of the starting state i0. Given a\nrestricted set D \u2286 A of actions, the gain or the best long term average performance is\n\n\u03bb\u03c0(i0, P ) := lim inf\nT\u2192\u221e\n\nT (i0, P )\nV \u03c0\n\n.\n\nT\n\n\u03bb(P,D) := sup\n\u03c0\u2208\u03a0(D)\n\n\u03bb\u03c0(i0, P ) .\n\nAs a shorthand, de\ufb01ne \u03bb\u2217(P ) := \u03bb(P,A).\n\n2.1 Optimality Equations\nA restricted problem (P,D) is obtained from the original MDP by choosing subsets D(i) \u2286 A(i)\nand setting D = \u222ai\u2208SD(i). The transition and reward functions of the restricted problems are\nsimply the restrictions of P and r to D. Assumption 1 implies that there is a bias vector h(P,D) =\n{h(i; P,D)}i\u2208S such that the gain \u03bb(P,D) and bias h(P,D) are the unique solutions to the average\nreward optimality equations:\n\n[r(i, a) + hpi(a), h(P,D)i] .\n\n\u2200i \u2208 S, \u03bb(P,D) + h(i; P, D) = max\na\u2208D(i)\n\n(3)\nWe will use h\u2217(P ) to denote h(P,A). Also, denote the in\ufb01nity norm kh\u2217(P )k\u221e by H\u2217(P ). Note\nthat if h\u2217(P ) is a solution to the optimality equations and e is the vector of ones, then h\u2217(P ) + ce\nis also a solution for any scalar c. We can therefore assume \u2203i\u2217 \u2208 S, h\u2217(i\u2217; P ) = 0 without any loss\nof generality.\nIt will be convenient to have a way to denote the quantity inside the \u2018max\u2019 that appears in the\noptimality equations. Accordingly, de\ufb01ne\n\nL(i, a, p, h) := r(i, a) + hp, hi ,\n\nL\u2217(i; P,D) := max\na\u2208D(i)\n\nL(i, a, pi(a), h(P, D)) .\n\nTo measure the degree of suboptimality of actions available at a state, de\ufb01ne\n\u03c6\u2217(i, a; P ) = L\u2217(i; P,A) \u2212 L(i, a, pi(a), h\u2217(P )) .\n\nNote that the optimal actions are precisely those for which the above quantity is zero.\n\nAny policy in O(P,D) is an optimal policy, i.e.,\n\nO(i; P,D) := {a \u2208 D(i) : \u03c6\u2217(i, a; P ) = 0} ,\n\nO(P,D) := \u03a0i\u2208SO(i; P,D) .\n\n\u2200\u00b5 \u2208 O(P,D), \u03bb\u00b5(P ) = \u03bb(P,D) .\n\n3\n\n\f2.2 Critical pairs\n\nFrom now on, \u2206+ will denote the probability simplex of dimension determined by context. For a\nsuboptimal action a /\u2208 O(i; P,A), the following set contains probability distributions q such that if\npi(a) is changed to q, the quality of action a comes within \u0001 of an optimal action. Thus, q makes a\nlook almost optimal:\n\nMakeOpt(i, a; P, \u0001) := {q \u2208 \u2206+ : L(i, a, q, h\u2217(P )) \u2265 L\u2217(i; P,A) \u2212 \u0001} .\n\n(4)\nThose suboptimal state-action pairs for which MakeOpt is never empty, no matter how small \u0001 is,\nplay a crucial role in determining the regret. We call these critical state-action pairs,\n\nCrit(P ) := {(i, a) : a /\u2208 O(i; P,A) \u2227 (\u2200\u0001 > 0, MakeOpt(i, a; P, \u0001) 6= \u2205)} .\n\n(5)\n\nDe\ufb01ne the function,\n\nJi,a(p; P, \u0001) := inf{kp \u2212 qk2\n\n1 : q \u2208 MakeOpt(i, a; P, \u0001)} .\n\n(6)\nTo make sense of this de\ufb01nition, consider p = pi(a). The above in\ufb01mum is then the least distance\n(in the L1 sense) one has to move away from pi(a) to make the suboptimal action a look \u0001-optimal.\nTaking the limit of this as \u0001 decreases gives us a quantity that also plays a crucial role in determining\nthe regret,\n\n(7)\nIntuitively, if K(i, a; P ) is small, it is easy to confuse a suboptimal action with an optimal one and\nso it should be dif\ufb01cult to achieve small regret. The constant that multiplies log T in the regret bound\nof our algorithm OLP (see Algorithm 1 and Theorem 4 below) is the following:\n\nK(i, a; P ) := lim\n\u0001\u21920\n\nJi,a(pi(a); P, \u0001) .\n\nC(P ) := X\n\n(i,a)\u2208Crit(P )\n\n2\u03c6\u2217(i, a; P )\nK(i, a; P ) .\n\n(8)\n\nThis de\ufb01nition might look a bit hard to interpret, so we give an upper bound on C(P ) just in terms\nof the in\ufb01nity norm H\u2217(P ) of the bias and \u03a6\u2217(P ). This latter quantity is de\ufb01ned below to be the\nminimum degree of suboptimality of a critical action.\nProposition 2. Suppose A(i) = A for all i \u2208 S. De\ufb01ne\n(i,a)\u2208Crit(P )\n\n\u03c6\u2217(i, a; P ) .\n\n\u03a6\u2217(P ) :=\n\nmin\n\n(9)\n\nThen, for any P ,\n\nSee the appendix for a proof.\n\n2.3 Hitting times\n\nC(P ) \u2264 2|S||A|H\u2217(P )2\n\n\u03a6\u2217(P )\n\n.\n\nIt turns out that we can bound the in\ufb01nity norm of the bias in terms of the hitting time of an optimal\npolicy. For any policy \u00b5 de\ufb01ne its hitting time to be the worst case expected time to reach one state\nfrom another:\n\nT\u00b5(P ) := max\ni6=j\n\nE\u00b5,P\nj\n\n[min{t > 0 : st = i}] .\n\nThe following constant is the minimum hitting time among optimal policies:\n\nT (P ) := min\n\n\u00b5\u2208O(P,D)\n\nT\u00b5(P ) .\n\n(11)\n\nThe following constant is de\ufb01ned just for comparison with results in [2]. It is the worst case hitting\ntime over all policies:\n\n(10)\n\n(12)\n\nWe can now bound C(P ) just in terms of the hitting time T (P ) and \u03c6\u2217(P ).\nProposition 3. Suppose A(i) = A for all i \u2208 S and that r(i, a) \u2208 [0, 1] for all i \u2208 S, a \u2208 A. Then\nfor any P ,\n\nTw(P ) := max\n\u00b5\u2208\u03a0D\n\nT\u00b5(P ) .\n\nC(P ) \u2264 2|S||A|T (P )2\n\n.\n\n\u03a6\u2217(P )\n\nSee the appendix for a proof.\n\n4\n\n\f3 The optimistic LP algorithm and its regret bound\n\nAlgorithm 1 Optimistic Linear Programming\n1: for t = 0, 1, 2, . . . do\nst \u2190 current state\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22:\n23: end for\n\nend if\n\nelse\n\nt\n\nt\n\ni,j(a) \u2190 1+Nt(i,a,j)\n|A(i)|+Nt(i,a)\n\n. Compute solution for \u201cempirical MDP\u201d excluding \u201cundersampled\u201d actions\n\u2200i, j \u2208 S, a \u2208 A(i), \u02c6pt\n\u2200i \u2208 S, Dt(i) \u2190 {a \u2208 A(i) : Nt(i, a) \u2265 log2 Nt(i)}\n\u02c6ht, \u02c6\u03bbt \u2190 solution of the optimality equations (3) with P = \u02c6P t,D = Dt\n. Compute indices of all actions for the current state\n\u2200a \u2208 A(st), Ut(st, a) \u2190 supq\u2208\u2206+{r(st, a) + hq, \u02c6hti : k\u02c6pt\n\nNt(st,a)}\n. Optimal actions (for the current problem) that are about to become \u201cundersampled\u201d\nt \u2190 {a \u2208 O(st; \u02c6P t,Dt) : Nt(st, a) < log2(Nt(st) + 1)}\n\u03931\n. The index maximizing actions\nt \u2190 arg maxa\u2208A(st) Ut(st, a)\n\u03932\nt = O(st; \u02c6P t,Dt) then\nif \u03931\nat \u2190 any action in \u03931\nat \u2190 any action in \u03932\n\n(a) \u2212 qk1 \u2264q 2 log t\n\nst\n\nAlgorithm 1 is the Optimistic Linear Programming algorithm. It is inspired by the algorithm of\nBurnetas and Katehakis [1] but uses L1 distance instead of KL-divergence. At each time step t,\nthe algorithm computes the empirical estimates for transition probabilities. It then forms a restricted\nproblem ignoring relatively undersampled actions. An action a \u2208 A(i) is considered \u201cundersam-\npled\u201d if Nt(i, a) < log2 Nt(i). The solutions \u02c6ht, \u02c6\u03bbt might be misleading due to estimation errors.\nTo avoid being misled by empirical samples we compute optimistic \u201cindices\u201d Ut(st, a) for all legal\nactions a \u2208 A(st) where st is the current state. The index for action a is computed by looking at\n(a) and choosing a probability distribution q that max-\nan L1-ball around the empirical estimate \u02c6pt\nst\nimizes L(i, a, q, \u02c6ht). Note that if the estimates were perfect, we would take an action maximizing\nL(i, a, \u02c6pt\n(a), \u02c6ht). Instead, we take an action that maximizes the index. There is one case where we\nare forced not to take an index-maximizing action. It is when all the optimal actions of the current\nproblem are about to become undersampled at the next time step. In that case, we take one of these\nactions (steps 18\u201322). Note that both steps 7 and 10 can be done by solving LPs. The LP for solving\noptimality equations can be found in several textbooks (see, for example, [9, p. 391]). The LP in step\n10 is even simpler: the L1 ball has only 2|S| vertices and so we can maximize over them ef\ufb01ciently.\nLike the original Burnetas-Katehakis algorithm, the modi\ufb01ed one also satis\ufb01es a logarithmic regret\nbound as stated in the following theorem. Unlike the original algorithm, OLP does not need to know\nthe support sets of the transition distributions.\nTheorem 4. Let \u03b2 denote the policy implemented by Algorithm 1. Then we have, for all i0 \u2208 S and\nfor all P satisfying Assumption 1,\n\nst\n\nlim sup\nT\u2192\u221e\n\nR\u03b2\n\nT (i0, P )\nlog T\n\n\u2264 C(P ) ,\n\nwhere C(P ) is the MDP-dependent constant de\ufb01ned in (8).\n\nProof. From Proposition 1 in [1], it follows that\n\n[NT (i, a)]\u03c6\u2217(i, a; P ) + O(1) .\n\n(13)\n\nT (i0, P ) =X\n\nR\u03b2\n\nX\n\ni\u2208S\n\na /\u2208O(i;P,A)\n\nE\u03b2,P\ni0\n\n5\n\n\fDe\ufb01ne the event\n\nDe\ufb01ne,\n\nAt := {k\u02c6ht \u2212 h\u2217(P )k\u221e \u2264 \u0001 \u2227 O( \u02c6P t,Dt) \u2286 O(P )} .\n\n(14)\n\nN 1\n\nT (i, a; \u0001) :=\n\nN 2\n\nT (i, a; \u0001) :=\n\nN 3\n\nT (\u0001) :=\n\n1 [(st, at) = (i, a) \u2227 At \u2227 Ut(i, a) \u2265 L\u2217(i; P,A) \u2212 2\u0001] ,\n\n1 [(st, at) = (i, a) \u2227 At \u2227 Ut(i, a) < L\u2217(i; P,A) \u2212 2\u0001] ,\n\n1(cid:2) \u00afAt\n\n(cid:3) ,\n\nt=0\n\nT\u22121X\nT\u22121X\nT\u22121X\n\nt=0\n\nt=0\n\nwhere \u00afAt denotes the complement of At. For all \u0001 > 0,\nT (i, a; \u0001) + N 2\n\nNT (i, a) \u2264 N 1\n\n(15)\nThe result then follows by combining (13) and (15) with the following three propositions and then\nletting \u0001 \u2192 0 suf\ufb01ciently slowly.\nProposition 5. For all P and i0 \u2208 S, we have\nE\u03b2,P\ni0\n\nT (i, a; \u0001) + N 3\n\nT (i, a; \u0001)]\n\nX\n\nX\n\nT (\u0001) .\n\n\u03c6\u2217(i, a; P ) \u2264 C(P ) .\n\nlim\n\u0001\u21920\n\nlim sup\nT\u2192\u221e\n\ni\u2208S\n\na /\u2208O(i;P,A)\n\n[N 1\nlog T\n\nProposition 6. For all P , i0, i \u2208 S, a /\u2208 O(i; P,A) and \u0001 suf\ufb01ciently small, we have\n\nProposition 7. For all P satisfying Assumption 1, i0 \u2208 S and \u0001 > 0, we have\n\nE\u03b2,P\ni0\n\n[N 2\n\nT (i, a; \u0001)] = o(log T ) .\n\nE\u03b2,P\ni0\n\n[N 3\n\nT (\u0001)] = o(log T ) .\n\n4 Proofs of auxiliary propositions\n\nWe prove Propositions 5 and 6. The proof of Proposition 7 is almost the same as that of Proposition 5\nin [1] and therefore omitted (for details, see Chapter 6 in the \ufb01rst author\u2019s thesis [8]). The proof of\nProposition 6 is considerably simpler (because of the use of L1 distance rather than KL-divergence)\nthan the analogous Proposition 4 in [1].\nProof of Proposition 5. There are two cases depending on whether (i, a) \u2208 Crit(P ) or not.\nIf\n(i, a) /\u2208 Crit(P ), there is an \u00010 > 0 such that MakeOpt(i, a; P, \u00010) = \u2205. On the event At (recall the\nde\ufb01nition given in (14)), we have |hq, \u02c6hti \u2212 hq, h\u2217(P )i| \u2264 \u0001 for any q \u2208 \u2206+. Therefore,\n\n{r(i, a) + hq, \u02c6hti}\n{r(i, a) + hq, h\u2217(P )i} + \u0001\n\nUt(i, a) \u2264 sup\nq\u2208\u2206+\n\u2264 sup\nq\u2208\u2206+\n< L\u2217(i; P,A) \u2212 \u00010 + \u0001\n< L\u2217(i; P,A) \u2212 2\u0001 provided that 3\u0001 < \u00010\n\n[\u2235 MakeOpt(i, a; P, \u00010) = \u2205]\n\nTherefore for \u0001 < \u00010/3, N 1\nNow suppose (i, a) \u2208 Crit(P ). The event Ut(i, a) \u2265 L\u2217(i; P,A) \u2212 2\u0001 is equivalent to\n\nT (i, a; \u0001) = 0.\n\n\u2203q \u2208 \u2206+ s.t.\n\nk\u02c6pt\n\ni(a) \u2212 qk2\n\nr(i, a) + hq, \u02c6hti \u2265 L\u2217(i; P,A) \u2212 2\u0001\n\n.\n\n(cid:17)\n\nOn the event At, we have |hq, \u02c6hti \u2212 hq, h\u2217(P )i| \u2264 \u0001 and thus the above implies\n\n\u2203q \u2208 \u2206+ s.t.\n\nk\u02c6pt\n\ni(a) \u2212 qk2\n\n1 \u2264 2 log t\nNt(i, a)\n\n\u2227 (r(i, a) + hq, h\u2217(P )i \u2265 L\u2217(i; P,A) \u2212 3\u0001) .\n\n(cid:18)\n(cid:18)\n\n\u2227(cid:16)\n\n(cid:19)\n1 \u2264 2 log t\nNt(i, a)\n(cid:19)\n\n6\n\n\fRecalling the de\ufb01nition (6) of Ji,a(p; P, \u0001), we see that this implies\n\nJi,a(\u02c6pt\n\ni(a); P, 3\u0001) \u2264 2 log t\nNt(i, a) .\n\nWe therefore have,\n\nN 1\n\n1\n\nT (i, a; \u0001) \u2264 T\u22121X\n\u2264 T\u22121X\nT\u22121X\n\nt=0\n\nt=0\n\n1\n\n+\n\n(cid:21)\n\n(st, at) = (i, a) \u2227 Ji,a(\u02c6pt\n\n(cid:20)\ni(a); P, 3\u0001) \u2264 2 log t\n(cid:20)\nNt(i, a)\n(st, at) = (i, a) \u2227 Ji,a(pi(a); P, 3\u0001) \u2264 2 log t\n1(cid:2)(st, at) = (i, a) \u2227 Ji,a(pi(a); P, 3\u0001) > Ji,a(\u02c6pt\nNt(i, a)\n\n+ \u03b4\n\n(cid:21)\ni(a); P, 3\u0001) + \u03b4(cid:3)\n\nt=0\n\n(16)\n\nwhere \u03b4 > 0 is arbitrary. Each time the pair (i, a) occurs Nt(i, a) increases by 1, so the \ufb01rst count\nis no more than\n\n2 log T\n\nJi,a(pi(a); P, 3\u0001) \u2212 \u03b4\n\n.\n\n(17)\n\nTo control the expectation of the second sum, note that continuity of Ji,a in its \ufb01rst argument implies\nthat there is a function f such that f(\u03b4) > 0 for \u03b4 > 0, f(\u03b4) \u2192 0 as \u03b4 \u2192 0 and Ji,a(pi(a); P, 3\u0001) >\ni(a)k1 > f(\u03b4). By a Chernoff-type bound, we have,\nJi,a(\u02c6pt\nfor some constant C1,\n\ni(a); P, 3\u0001) + \u03b4 implies that kpi(a) \u2212 \u02c6pt\n\nP\u03b2,P\ni0\n\n[kpi(a) \u2212 \u02c6pt\n\ni(a)k1 > f(\u03b4)| Nt(i, a) = m] \u2264 C1 exp(\u2212mf(\u03b4)2) .\n\nand so the expectation of the second sum is no more than\n\nE\u03b2,P\ni0\n\n[\n\nC1 exp(\u2212Nt(i, a)f(\u03b4)2)] \u2264\n\nt=0\n\nm=1\n\nC1 exp(\u2212mf(\u03b4)2) =\n\nC1\n\n1 \u2212 exp(\u2212f(\u03b4)2) .\n\n(18)\n\nCombining the bounds (17) and (18) and plugging them into (16), we get\n\nE\u03b2,P\ni0\n\n[N 1\n\nT (i, a; \u0001)] \u2264\n\nJi,a(pi(a); P, 3\u0001) \u2212 \u03b4\nLetting \u03b4 \u2192 0 suf\ufb01ciently slowly, we get that for all \u0001 > 0,\n2 log T\n\n2 log T\n\nE\u03b2,P\ni0\n\n[N 1\n\nT (i, a; \u0001)] \u2264\n\nJi,a(pi(a); P, 3\u0001)\n\n+ o(log T ) .\n\n+\n\nC1\n\n1 \u2212 exp(\u2212f(\u03b4)2) .\n\nTherefore,\n\nlim\n\u0001\u21920\n\nlim sup\nT\u2192\u221e\n\nE\u03b2,P\ni0\n\nT (i, a; \u0001)]\n\n[N 1\nlog T\n\n\u2264 lim\n\u0001\u21920\n\n2\n\nJi,a(pi(a); P, 3\u0001)\n\n=\n\n2\n\nK(i, a; P ) ,\n\nwhere the last equality follows from the de\ufb01nition (7) of K(i, a; P ). The result now follows by\nsumming over (i, a) pairs in Crit(P ).\n\nT\u22121X\n\n\u221eX\n\nProof of Proposition 6. De\ufb01ne the event\n\nt(i, a; \u0001) := {(st, at) = (i, a) \u2227 At \u2227 Ut(i, a) < L\u2217(i; P,A) \u2212 2\u0001} ,\nA0\n\nso that we can write\n\nNote that on A0\ntaken at time t, so it must have been in \u03932\noptimal actions a\u2217 \u2208 O(i; P,A), we have, on the event A0\n\nt(i, a; \u0001), we have \u03931\n\nt(i, a; \u0001),\n\nt(i, a; \u0001)] .\n\nT (i, a; \u0001) =\n(19)\nN 2\nt \u2286 O(i; \u02c6P t,Dt) \u2286 O(i; P,A). So, a /\u2208 O(i; P,A). But a was\nt which means it maximized the index. Therefore, for all\n\nt=0\n\nUt(i, a\u2217) \u2264 Ut(i, a) < L\u2217(i; P,A) \u2212 2\u0001 .\n\nT\u22121X\n\n1 [A0\n\n7\n\n\fSince L\u2217(i; P,A) = r(i, a\u2217) + hpi(a\u2217), h\u2217(P )i, this implies\n\ns\n\n\u2200q \u2208 \u2206+, kq \u2212 \u02c6pt\n\ni(a\u2217)k1 \u2264\n\n\u21d2 hq, \u02c6hti < hpi(a\u2217), h\u2217(P )i \u2212 2\u0001 .\n\n2 log t\nNt(i, a\u2217)\n\nMoreover, on the event At, |hq, \u02c6hti \u2212 hq, h\u2217(P )i| \u2264 \u0001. We therefore have, for any a\u2217 \u2208 O(i; P,A),\n\nt(i, a; \u0001) \u2286\nA0\n\n\u2200q \u2208 \u2206+, kq \u2212 \u02c6pt\n\ni(a)k1 \u2264\n\n\u21d2 hq, h\u2217(P )i < hpi(a), h\u2217(P )i \u2212 \u0001\n\n)\n\n)\n\n\u2286\n\n(\n(\n(\n\u2286 t[\n\n\u2286\n\nm=1\n\ns\ns\n\n2 log t\nNt(i, a)\n\n2 log t\nNt(i, a)\n\ns\n\n)\n\n\u2200q \u2208 \u2206+, kq \u2212 \u02c6pt\n\ni(a)k1 \u2264\n\n\u21d2 kq \u2212 pi(a)k1 >\n\n\u0001\n\nkh\u2217(P )k\u221e\n\nk\u02c6pt\n\ni(a) \u2212 pi(a)k1 >\n(\n\n\u0001\n\nh\u2217(P )\n\n+\n\n2 log t\nNt(i, a)\n\nNt(i, a) = m \u2227 k\u02c6pt\n\ni(a) \u2212 pi(a)k1 >\n\n\u0001\n\nkh\u2217(P )k\u221e\n\n+\n\ns\n\n)\n\n2 log t\nNt(i, a)\n\nUsing a Chernoff-type bound, we have, for some constant C1,\n\nP\u03b2,P\ni0\n\n[k\u02c6pt\n\ni(a) \u2212 pi(a)k1 > \u03b4 | Nt(i, a) = m] \u2264 C1 exp(\u2212m\u03b42/2) .\n\nUsing a union bound, we therefore have,\n\nt(i, a; \u0001)] \u2264 tX\n\nP\u03b2,P\ni0\n\n[A0\n\n \n\n\uf8eb\uf8ed\u2212 m\n(cid:18)\n\n2\n\n\u0001\n\nkh\u2217(P )k\u221e\n\nC1 exp\n\nm=1\n\n\u2264 C1\nt\n\n\u221eX\n\nm=1\n\nexp\n\n\u2212 m\u00012\n2kh\u2217(P )k2\u221e\n\n!2\uf8f6\uf8f8\nr2 log t\n(cid:19)\n\nm\n\n+\n\u221a\n2m log t\n\u2212 \u0001\nkh\u2217(P )k\u221e\n\n= o\n\n(cid:18)1\n\n(cid:19)\n\nt\n\n.\n\nCombining this with (19) proves the result.\n\nReferences\n\n[1] Burnetas, A.N. & Katehakis, M.N. (1997) Optimal adaptive policies for Markov decision processes. Math-\nematics of Operations Research 22(1):222\u2013255\n[2] Auer, P. & Ortner, R. (2007) Logarithmic online regret bounds for undiscounted reinforcement learning.\nAdvances in Neural Information Processing Systems 19. Cambridge, MA: MIT Press.\n[3] Lai, T.L. & Robbins, H. (1985) Asymptotically ef\ufb01cient adaptive allocation rules. Advances in Applied\nMathematics 6(1):4\u201322.\n[4] Brafman, R.I. & Tennenholtz, M. (2002) R-MAX - a general polynomial time algorithm for near-optimal\nreinforcement learning. Journal of Machine Learning Research 3:213\u2013231.\n[5] Auer, P. (2002) Using con\ufb01dence bounds for exploitation-exploration trade-offs. Journal of Machine Learn-\ning Research 3:397\u2013422.\n[6] Auer, P., Cesa-Bianchi, N. & and Fischer, P. (2002) Finite-time analysis of the multiarmed bandit problem.\nMachine Learning 47(2-3):235-256.\n[7] Strehl, A.L. & Littman, M. (2005) A theoretical analysis of model-based interval estimation. In Proceedings\nof the Twenty-Second International Conference on Machine Learning, pp. 857-864. ACM Press.\n[8] Tewari, A. (2007) Reinforcement Learning in Large or Unknown MDPs. PhD thesis, Department of Elec-\ntrical Engineering and Computer Sciences, University of California at Berkeley.\n[9] Puterman, M.L. (1994) Markov Decision Processes: Discrete Stochastic Dynamic Programming. New\nYork: John Wiley and Sons.\n\n8\n\n\f", "award": [], "sourceid": 673, "authors": [{"given_name": "Ambuj", "family_name": "Tewari", "institution": null}, {"given_name": "Peter", "family_name": "Bartlett", "institution": null}]}