{"title": "Efficient Exploration and Value Function Generalization in Deterministic Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 3021, "page_last": 3029, "abstract": "We consider the problem of reinforcement learning over episodes of a finite-horizon deterministic system  and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization. We establish that when the true value function lies within the given hypothesis class, OCP selects optimal actions over all but at most K episodes, where K is the eluder dimension of the given hypothesis class. We establish further efficiency and asymptotic performance guarantees that apply even if the true value function does not lie in the given hypothesis space, for  the special case where the hypothesis space is the span of pre-specified indicator functions over disjoint sets.", "full_text": "Ef\ufb01cient Exploration and Value Function\nGeneralization in Deterministic Systems\n\nZheng Wen\n\nStanford University\n\nzhengwen@stanford.edu\n\nBenjamin Van Roy\nStanford University\n\nbvr@stanford.edu\n\nAbstract\n\nWe consider the problem of reinforcement learning over episodes of a \ufb01nite-\nhorizon deterministic system and as a solution propose optimistic constraint prop-\nagation (OCP), an algorithm designed to synthesize ef\ufb01cient exploration and\nvalue function generalization. We establish that when the true value function Q\u21e4\nlies within the hypothesis class Q, OCP selects optimal actions over all but at most\ndimE[Q] episodes, where dimE denotes the eluder dimension. We establish fur-\nther ef\ufb01ciency and asymptotic performance guarantees that apply even if Q\u21e4 does\nnot lie in Q, for the special case where Q is the span of pre-speci\ufb01ed indicator\nfunctions over disjoint sets.\n\n1\n\nIntroduction\n\nA growing body of work on ef\ufb01cient reinforcement learning provides algorithms with guarantees\non sample and computational ef\ufb01ciency [13, 6, 2, 22, 4, 9]. This literature highlights the point that\nan effective exploration scheme is critical to the design of any ef\ufb01cient reinforcement learning al-\ngorithm. In particular, popular exploration schemes such as \u270f-greedy, Boltzmann, and knowledge\ngradient can require learning times that grow exponentially in the number of states and/or the plan-\nning horizon.\nThe aforementioned literature focuses on tabula rasa learning; that is, algorithms aim to learn with\nlittle or no prior knowledge about transition probabilities and rewards. Such algorithms require\nlearning times that grow at least linearly with the number of states. Despite the valuable insights\nthat have been generated through their design and analysis, these algorithms are of limited practical\nimport because state spaces in most contexts of practical interest are enormous. There is a need for\nalgorithms that generalize from past experience in order to learn how to make effective decisions in\nreasonable time.\nThere has been much work on reinforcement learning algorithms that generalize (see, e.g.,\n[5, 23, 24, 18] and references therein). Most of these algorithms do not come with statistical or\ncomputational ef\ufb01ciency guarantees, though there are a few noteworthy exceptions, which we now\ndiscuss. A number of results treat policy-based algorithms (see [10, 3] and references therein), in\nwhich the goal is to select high-performers among a pre-speci\ufb01ed collection of policies as learn-\ning progresses. Though interesting results have been produced in this line of work, each entails\nquite restrictive assumptions or does not make strong guarantees. Another body of work focuses\non model-based algorithms. An algorithm is proposed in [12] that \ufb01ts a factored model to observed\ndata and makes decisions based on the \ufb01tted model. The authors establish a sample complexity\nbound that is polynomial in the number of model parameters rather than the number of states, but\nthe algorithm is computationally intractable because of the dif\ufb01culty of solving factored MDPs. A\nrecent paper [14] proposes a novel algorithm for the case where the true environment is known to\nbelong to a \ufb01nite or compact class of models, and shows that its sample complexity is polynomial\nin the cardinality of the model class if the model class is \ufb01nite, or the \u270f-covering-number if the\n\n1\n\n\fmodel class is compact. Though this result is theoretically interesting, for most model classes of\ninterest, the \u270f-covering-number is enormous since it typically grows exponentially in the number of\nfree parameters. Another recent paper [17] establishes a regret bound for an algorithm that applies\nto problems with continuous state spaces and H\u00a8older-continuous rewards and transition kernels.\nThough the results represent an interesting contribution to the literature, a couple features of the\nregret bound weaken its practical implications. First, regret grows linearly with the H\u00a8older constant\nof the transition kernel, which for most contexts of practical relevance grows exponentially in the\nnumber of state variables. Second, the dependence on time becomes arbitrarily close to linear as the\ndimension of the state space grows. Reinforcement learning in linear systems with quadratic cost\nis treated in [1]. The method proposed is shown to realize regret that grows with the square root\nof time. The result is interesting and the property is desirable, but to the best of our knowledge,\nexpressions derived for regret in the analysis exhibit an exponential dependence on the number of\nstate variables, and further, we are not aware of a computationally ef\ufb01cient way of implementing the\nproposed method. This work was extended by [8] to address linear systems with sparse structure.\nHere, there are ef\ufb01ciency guarantees that scale gracefully with the number of state variables, but\nonly under sparsity and other technical assumptions.\nThe most popular approach to generalization in the applied reinforcement learning literature involves\n\ufb01tting parameterized value functions. Such approaches relate closely to supervised learning in that\nthey learn functions from state to value, though a difference is that value is in\ufb02uenced by action\nand observed only through delayed feedback. One advantage over model learning approaches is\nthat, given a \ufb01tted value function, decisions can be made without solving a potentially intractable\ncontrol problem. We see this as a promising direction, though there currently is a lack of theoretical\nresults that provide attractive bounds on learning time with value function generalization. A relevant\npaper along this research line is [15], which studies the ef\ufb01cient reinforcement learning with value\nfunction generalization in the KWIK framework (see [16]), and reduces the ef\ufb01cient reinforcement\nlearning problem to the ef\ufb01cient KWIK online regression problem. However, the authors do not\nshow how to solve the general KWIK online regression problem ef\ufb01ciently, and it is not even clear\nwhether this is possible. Thus, though the result of [15] is interesting, it does not provide a provably\nef\ufb01cient algorithm.\nAn important challenge that remains is to couple exploration and value function generalization in\na provably effective way, and in particular, to establish sample and computational ef\ufb01ciency guar-\nantees that scale gracefully with the planning horizon and model complexity. In this paper, we aim\nto make progress in this direction. To start with a simple context, we restrict our attention to deter-\nministic systems that evolve over \ufb01nite time horizons, and we consider episodic learning, in which\nan agent repeatedly interacts with the same system. As a solution to the problem, we propose opti-\nmistic constraint propagation (OCP), a computationally ef\ufb01cient reinforcement learning algorithm\ndesigned to synthesize ef\ufb01cient exploration and value function generalization. We establish that\nwhen the true value function Q\u21e4 lies within the hypothesis class Q, OCP selects optimal actions\nover all but at most dimE[Q] episodes. Here, dimE denotes the eluder dimension, which quanti\ufb01es\ncomplexity of the hypothesis class. A corollary of this result is that regret is bounded by a function\nthat is constant over time and linear in the problem horizon and eluder dimension.\nTo put our aforementioned result in perspective, it is useful to relate it to other lines of work.\nConsider \ufb01rst the broad area of reinforcement learning algorithms that \ufb01t value functions, such\nas SARSA [19]. Even with the most commonly used sort of hypothesis class Q, which is made\nup of linear combinations of \ufb01xed basis functions, and even when the hypothesis class contains the\ntrue value function Q\u21e4, there are no guarantees that these algorithms will ef\ufb01ciently learn to make\nnear-optimal decisions. On the other hand, our result implies that OCP attains near-optimal per-\nformance in time that scales linearly with the number of basis functions. Now consider the more\nspecialized context of a deterministic linear system with quadratic cost and a \ufb01nite time horizon.\nThe analysis of [1] can be leveraged to produce regret bounds that scale exponentially in the number\nof state variables. On the other hand, using a hypothesis space Q consisting of quadratic functions\nof state-action pairs, the results of this paper show that OCP behaves near optimally within time that\nscales quadratically in the number of state and action variables.\nWe also establish ef\ufb01ciency and asymptotic performance guarantees that apply to agnostic reinforce-\nment learning, where Q\u21e4 does not necessarily lie in Q. In particular, we consider the case where Q\nis the span of pre-speci\ufb01ed indicator functions over disjoint sets. Our results here add to the litera-\nture on agnostic reinforcement learning with such a hypothesis class [21, 25, 7, 26]. Prior work in\n\n2\n\n\fthis area has produced interesting algorithms and insights, as well as bounds on performance loss\nassociated with potential limits of convergence, but no convergence or ef\ufb01ciency guarantees.\n\n2 Reinforcement Learning in Deterministic Systems\n\nIn this paper, we consider an episodic reinforcement learning (RL) problem in which an agent repeat-\nedly interacts with a discrete-time \ufb01nite-horizon deterministic system, and refer to each interaction\nas an episode. The system is identi\ufb01ed by a sextuple M = (S,A, H, F, R, S), where S is the state\nspace, A is the action space, H is the horizon, F is a system function, R is a reward function and\nS is a sequence of states. If action a 2A is selected while the system is in state x 2S at period\nt = 0, 1,\u00b7\u00b7\u00b7 , H  1, a reward of Rt(x, a) is realized; furthermore, if t < H  1, the state transitions\nto Ft(x, a). Each episode terminates at period H  1, and then a new episode begins. The initial\nstate of episode j is the jth element of S.\nTo represent the history of actions and observations over multiple episodes, we will often index\nvariables by both episode and period. For example, xj,t and aj,t denote the state and action at\nperiod t of episode j, where j = 0, 1,\u00b7\u00b7\u00b7 and t = 0, 1,\u00b7\u00b7\u00b7 , H  1. To count the total number of\nsteps since the agent started learning, we say period t in episode j is time jH + t.\nA (deterministic) policy \u00b5 = (\u00b50, . . . , \u00b5H1) is a sequence of functions, each mapping S to A.\nFor each policy \u00b5, de\ufb01ne a value function V \u00b5\n\u2327 =t R\u2327 (x\u2327 , a\u2327 ), where xt = x, x\u2327 +1 =\nF\u2327 (x\u2327 , a\u2327 ), and a\u2327 = \u00b5\u2327 (x\u2327 ). The optimal value function is de\ufb01ned by V \u21e4t (x) = sup\u00b5 V \u00b5\nt (x). A\npolicy \u00b5\u21e4 is said to be optimal if V \u00b5\u21e4 = V \u21e4. Throughout this paper, we will restrict attention to\nsystems M = (S,A, H, F, R, S) that admit optimal policies. Note that this restriction incurs no\nloss of generality when the action space is \ufb01nite.\nIt is also useful to de\ufb01ne an action-contingent optimal value function: Q\u21e4t (x, a) = Rt(x, a) +\nV \u21e4t+1(Ft(x, a)) for t < H  1, and Q\u21e4H1(x, a) = RH1(x, a). Then, a policy \u00b5\u21e4 is optimal if\n\u00b5\u21e4t (x) 2 arg maxa2A Q\u21e4t (x, a) for all (x, t).\nA reinforcement learning algorithm generates each action aj,t based on observations made up to the\ntth period of the jth episode, including all states, actions, and rewards observed in previous episodes\nand earlier in the current episode, as well as the state space S, action space A, horizon H, and possi-\nble prior information. In each episode, the algorithm realizes reward R(j) =PH1\nt=0 Rt (xj,t, aj,t).\nNote that R(j) \uf8ff V \u21e40 (xj,0) for each jth episode. One way to quantify performance of a reinforce-\nment learning algorithm is in terms of the number of episodes JL for which R(j) < V \u21e40 (xj,0)  \u270f,\nwhere \u270f  0 is a pre-speci\ufb01ed performance loss threshold. If the reward function R is bounded,\nwith |Rt(x, a)|\uf8ff R for all (x, a, t), then this also implies a bound on regret over episodes ex-\nperienced prior to time T , de\ufb01ned by Regret(T ) = PbT /Hc1\n(V \u21e40 (xj,0)  R(j)). In particular,\nRegret(T ) \uf8ff 2RHJL + \u270fbT /Hc.\n\nt (x) = PH1\n\nj=0\n\n3 Optimistic Constraint Propagation\n\nAt a high level, our reinforcement learning algorithm \u2013 optimistic constraint propagation (OCP) \u2013\nselects actions based on the optimism in the face of uncertainty principle and based on observed\nrewards and state transitions propagates constraints backwards through time. Speci\ufb01cally, it takes\nas input the state space S, the action space A, the horizon H, and a hypothesis class Q of candi-\ndates for Q\u21e4. The algorithm maintains a sequence of subsets of Q and a sequence of scalar \u201cupper\nbounds\u201d, which summarize constraints that past experience suggests for ruling out hypotheses. Each\nconstraint in this sequence is speci\ufb01ed by a state x 2S , an action a 2A , a period t = 0, . . . , H  1,\nand an interval [L, U ] \u2713 <, and takes the form {Q 2Q : L \uf8ff Qt(x, a) \uf8ff U}. The upper bound\nof the constraint is U. Given a sequence C = (C1, . . . ,C|C|) of such constraints and upper bounds\nU = (U1, . . . ,U|C|), a set QC is de\ufb01ned constructively by Algorithm 1. Note that if the constraints\ndo not con\ufb02ict then QC = C1 \\\u00b7\u00b7\u00b7\\C |C|. When constraints do con\ufb02ict, priority is assigned \ufb01rst\nbased on upper bound, with smaller upper bound preferred, and then, in the event of ties in upper\nbound, based on position in the sequence, with more recent experience preferred.\n\n3\n\n\fAlgorithm 1 Constraint Selection\nRequire: Q, C\n\nQC Q , u minU\nwhile u \uf8ff 1 do\nfor \u2327 = |C| to 1 do\nQC Q C \\C \u2327\n\nif U\u2327 = u and QC \\C \u2327 6= ? then\nend if\nend for\nif {u0 2U : u0 > u} = ? then\nend if\nu min{u0 2U : u0 > u}\n\nreturn QC\n\nend while\n\nOCP, presented below as Algorithm 2, at each time t computes for the current state xj,t and each\naction a the greatest state-action value Qt(xj,t, a) among functions in QC and selects an action that\nattains the maximum. In other words, an action is chosen based on the most optimistic feasible out-\ncome subject to constraints. The subsequent reward and state transition give rise to a new constraint\nthat is used to update C. Note that the update of C is postponed until one episode is completed.\nAlgorithm 2 Optimistic Constraint Propagation\nRequire: S, A, H, Q\n\nInitialize C ?\nfor episode j = 0, 1,\u00b7\u00b7\u00b7 do\n\nSet C0 C\nfor period t = 0, 1,\u00b7\u00b7\u00b7 , H  1 do\n\nApply aj,t 2 arg maxa2A supQ2QC\nif t < H  1 then\nUj,t supQ2QC\nLj,t inf Q2QC\nUj,t Rt(xj,t, aj,t)\nLj,t Rt(xj,t, aj,t)\n\nelse\n\nQt(xj,t, a)\n\n(Rt(xj,t, aj,t) + supa2A Qt+1 (xj,t+1, a))\n(Rt(xj,t, aj,t) + supa2A Qt+1 (xj,t+1, a))\n\nend if\nC0 C 0 _ {Q 2Q : Lj,t \uf8ff Qt(xj,t, aj,t) \uf8ff Uj,t}\n\nend for\nUpdate C C 0\n\nend for\n\nNote that if Q\u21e4 2Q then each constraint appended to C does not rule out Q\u21e4, and therefore, the\nsequence of sets QC generated as the algorithm progresses is decreasing and contains Q\u21e4 in its\nintersection. In the agnostic case, where Q\u21e4 may not lie in Q, new constraints can be inconsistent\nwith previous constraints, in which case selected previous constraints are relaxed as determined by\nAlgorithm 1.\nLet us brie\ufb02y discuss several contexts of practical relevance and/or theoretical interest in which OCP\ncan be applied.\n\n\u2022 Finite state/action tabula rasa case. With \ufb01nite state and action spaces, Q\u21e4 can be repre-\nsented as a vector, and without special prior knowledge, it is natural to let Q = <|S|\u00b7|A|\u00b7H.\n\u2022 Polytopic prior constraints. Consider the aforementioned example, but suppose that we\nhave prior knowledge that Q\u21e4 lies in a particular polytope. Then we can let Q be that\npolytope and again apply OCP.\n\u2022 Linear systems with quadratic cost (LQ). In this classical control model, if S = <n,\nA = <m, and R is a positive semide\ufb01nite quadratic, then for each t, Q\u21e4t is known to be a\n\n4\n\n\fpositive semide\ufb01nite quadratic, and it is natural to let Q = QH\npositive semide\ufb01nite quadratics.\n\n0 with Q0 denoting the set of\n\u2022 Finite hypothesis class. Consider a context when we have prior knowledge that Q\u21e4 can\nbe well approximated by some element in a \ufb01nite hypothesis class. Then we can let Q be\nthat \ufb01nite hypothesis class and apply OCP. This scenario is of particular interest from the\nperspective of learning theory. Note that this context entails agnostic learning, which is\naccommodated by OCP.\n\n\u2022 Linear combination of features.\nIt is often effective to hand-select a set of features\n1, . . . , K, each mapping S\u21e5A to <, and, then for each t, aiming to compute weights\n\u2713(t) 2 <K so that Pk \u2713(t)\nk k approximates Q\u21e4t without knowing for sure that Q\u21e4t lies\nin the span of the features. To apply OCP here, we would let Q = QH\n0 with Q0 =\nspan(1, . . . , K). Note that this context also entails agnostic learning.\n\u2022 Sigmoid. If it is known that rewards are only received upon transitioning to the terminal\nstate and take values between 0 and 1, it might be appropriate to use a variation of the\naforementioned feature based model that applies a sigmoidal function to the linear combi-\n0 with Q0 =  (Pk \u2713kk(\u00b7)) : \u2713 2 <K ,\nnation. In particular, we could have Q = QH\nwhere (z) = ez/(1 + ez).\n\nIt is worth mentioning that OCP, as we have de\ufb01ned it, assumes that an action a maximizing\nQt(xj,t, a) exists in each iteration. It is not dif\ufb01cult to modify the algorithm so that it\nsupQ2QC\naddresses cases where this is not true. But we have not presented the more general form of OCP in\norder to avoid complicating this short paper.\n\n4 Sample Ef\ufb01ciency of Optimistic Constraint Propagation\n\nWe now establish results concerning the sample ef\ufb01ciency of OCP. Our results bound the time it\ntakes OCP to learn, and this must depend on the complexity of the hypothesis class. As such, we\nbegin by de\ufb01ning the eluder dimension, as introduced in [20], which is the notion of complexity we\nwill use.\n\n4.1 Eluder Dimension\nLet Z = {(x, a, t) : x 2S , a 2A , t = 0, . . . , H  1} be the set of all state-action-period triples,\nand let Q denote a nonempty set of functions mapping Z to <. For all (x, a, t) 2Z and \u02dcZ\u2713Z ,\n(x, a, t) is said to be dependent on \u02dcZ with respect to Q if any pair of functions Q, \u02dcQ 2Q that are\nequal on \u02dcZ are equal at (x, a, t). Further, (x, a, t) is said to be independent of \u02dcZ with respect to Q\nif (x, a, t) is not dependent on \u02dcZ with respect to Q.\nThe eluder dimension dimE[Q] of Q is the length of the longest sequence of elements in Z such that\nevery element is independent of its predecessors. Note that dimE[Q] can be zero or in\ufb01nity, and it\nis straightforward to show that if Q1 \u2713Q 2 then dimE[Q1] \uf8ff dimE[Q2]. Based on results of [20],\nwe can characterize the eluder dimensions of various hypothesis classes presented in the previous\nsection.\n\ndimE[Q] = d.\nics with domain <m+n and Q = QH\n\n\u2022 Finite state/action tabula rasa case. If Q = <|S|\u00b7|A|\u00b7H, then dimE[Q] = |S| \u00b7 |A| \u00b7 H.\n\u2022 Polytopic prior constraints.\nIf Q is a polytope of dimension d in <|S|\u00b7|A|\u00b7H, then\n\u2022 Linear systems with quadratic cost (LQ). If Q0 is the set of positive semide\ufb01nite quadrat-\n\u2022 Finite hypothesis space. If |Q| < 1, then dimE[Q] \uf8ff|Q| 1.\n\u2022 Linear combination of features.\n\u2022 Sigmoid. If Q = QH\n\n0 with Q0 = span(1, . . . , K), then\n0 with Q0 = (Pk \u2713kk(\u00b7)) : \u2713 2 <K , then dimE[Q] \uf8ff KH.\n\n0 , then dimE[Q] = (m + n + 1)(m + n)H/2.\n\ndimE[Q] \uf8ff KH.\n\nIf Q = QH\n\n5\n\n\f4.2 Learning with a Coherent Hypothesis Class\n\nWe now present results that apply when OCP is presented with a coherent hypothesis class; that is,\nwhere Q\u21e4 2Q . Our \ufb01rst result establishes that OCP can deliver less than optimal performance in\nno more than dimE[Q] episodes.\nTheorem 1 For any system M = (S,A, H, F, R, S), if OCP is applied with Q\u21e4 2Q , then |{j :\nR(j) < V \u21e40 (xj,0)}| \uf8ff dimE[Q].\nThis theorem follows from an \u201cexploration-exploitation lemma\u201d, which asserts that in each episode,\nOCP either delivers optimal reward (exploitation) or introduces a constraint that reduces the eluder\ndimension of the hypothesis class by one (exploration). Consequently, OCP will experience sub-\noptimal performance in at most dimE[Q] episodes. A complete proof is provided in the appendix.\nAn immediate corollary bounds regret.\n\nCorollary 1 For any R, any system M = (S,A, H, F, R, S) with sup(x,a,t) |Rt(x, a)|\uf8ff R, and\nany T , if OCP is applied with Q\u21e4 2Q , then Regret(T ) \uf8ff 2RHdimE[Q].\nNote the regret bound in Corollary 1 does not depend on time T , thus, it is an O (1) bound. Further-\nmore, this regret bound is linear in R, H and dimE[Q], and does not directly depend on |S| or |A|.\nThe following results demonstrate that the bounds of the above theorem and corollary are sharp.\n\nTheorem 2 For any reinforcement learning algorithm that takes as input a state space, an action\nspace, a horizon, and a hypothesis class, there exists a system M = (S,A, H, F, R, S) and a\nhypothesis class Q3 Q\u21e4 such that |{j : R(j) < V \u21e40 (xj,0)}|  dimE[Q].\nTheorem 3 For any R  0 and any reinforcement learning algorithm that takes as input a\nstate space, an action space, a horizon, and a hypothesis class, there exists a system M =\n(S,A, H, F, R, S) with sup(x,a,t) |Rt(x, a)|\uf8ff R and a hypothesis class Q3 Q\u21e4 such that\nsupT Regret(T )  2RHdimE[Q].\nA constructive proof of these lower bounds is provided in the appendix. Following our discussion\nin previous sections, we discuss several interesting contexts in which the agent knows a coherent\nhypothesis class Q with \ufb01nite eluder dimension.\n\n\u2022 Finite state/action tabula rasa case. If we apply OCP in this case, then it will deliver sub-\noptimal performance in at most |S|\u00b7|A|\u00b7H episodes. Furthermore, if sup(x,a,t) |Rt(x, a)|\uf8ff\nR, then for any T , Regret(T ) \uf8ff 2R|S||A|H 2.\n\u2022 Polytopic prior constraints. If we apply OCP in this case, then it will deliver sub-optimal\nperformance in at most d episodes. Furthermore, if sup(x,a,t) |Rt(x, a)|\uf8ff R, then for any\nT , Regret(T ) \uf8ff 2RHd.\n\u2022 Linear systems with quadratic cost (LQ). If we apply OCP in this case, then it will deliver\nsub-optimal performance in at most (m + n + 1)(m + n)H/2 episodes.\n\u2022 Finite hypothesis class case. Assume that the agent has prior knowledge that Q\u21e4 2Q ,\nwhere Q is a \ufb01nite hypothesis class. If we apply OCP in this case, then it will deliver sub-\noptimal performance in at most |Q|1 episodes. Furthermore, if sup(x,a,t) |Rt(x, a)|\uf8ff R,\nthen for any T , Regret(T ) \uf8ff 2RH [|Q|  1].\n\n4.3 Agnostic Learning\n\nAs we have discussed in Section 3, OCP can also be applied in agnostic learning cases, where Q\u21e4\nmay not lie in Q. For such cases, the performance of OCP should depend on not only the complexity\nof Q, but also the distance between Q and Q\u21e4. We now present results when OCP is applied in a\nspecial agnostic learning case, where Q is the span of pre-speci\ufb01ed indicator functions over disjoint\nsubsets. We henceforth refer to this case as the state aggregation case.\n\n6\n\n\fSpeci\ufb01cally, we assume that for any t = 0, 1,\u00b7\u00b7\u00b7 , H  1, the state-action space at period t, Zt =\n{(x, a, t) : x 2S , a 2A} , can be partitioned into Kt disjoint subsets Zt,1,Zt,2,\u00b7\u00b7\u00b7 ,Zt,Kt, and\nuse t,k to denote the indicator function for partition Zt,k (i.e. t,k(x, a, t) = 1 if (x, a, t) 2Z t,k,\nand t,k(x, a, t) = 0 otherwise). We de\ufb01ne K =PH1\nNote that dimE[Q] = K. We de\ufb01ne the distance between Q\u21e4 and the hypothesis class Q as\n\nQ = span0,1, 0,2,\u00b7\u00b7\u00b7 , 0,K0, 1,1,\u00b7\u00b7\u00b7 , H1,KH1 .\n(x,a,t)|Qt(x, a)  Q\u21e4t (x, a)|.\nQ2QkQ  Q\u21e4k1 = min\nsup\nQ2Q\n\nt=0 Kt, and Q as\n\n(4.1)\n\n(4.2)\n\n\u21e2 = min\n\nThe following result establishes that with Q and \u21e2 de\ufb01ned above, the performance loss of OCP is\nlarger than 2\u21e2H(H + 1) in at most K episodes.\nTheorem 4 For any system M = (S,A, H, F, R, S), if OCP is applied with Q de\ufb01ned in Eqn(4.1),\nthen\n\nwhere K is the number of partitions and \u21e2 is de\ufb01ned in Eqn(4.2).\n\n|{j : R(j) < V \u21e40 (xj,0)  2\u21e2H(H + 1)}| \uf8ff K,\n\nSimilar to Theorem 1, this theorem also follows from an \u201cexploration-exploitation lemma\u201d, which\nasserts that in each episode, OCP either delivers near-optimal reward (exploitation), or approxi-\nmately determines Q\u21e4t (x, a)\u2019s for all the (x, a, t)\u2019s in a disjoint subset (exploration). A complete\nproof for Theorem 4 is provided in the appendix. An immediate corollary bounds regret.\nCorollary 2 For any R  0, any system M = (S,A, H, F, R, S) with sup(x,a,t) |Rt(x, a)|\uf8ff R,\nand any time T , if OCP is applied with Q de\ufb01ned in Eqn(4.1), then Regret(T ) \uf8ff 2RKH + 2\u21e2(H +\n1)T , where K is the number of partitions and \u21e2 is de\ufb01ned in Eqn(4.2).\n\nNote that the regret bound in Corollary 2 is O (T ), and the coef\ufb01cient of the linear term is 2\u21e2(H +1).\nConsequently, if Q\u21e4 is close to Q, then the regret will increase slowly with T . Furthermore, the\nregret bound in Corollary 2 does not directly depend on |S| or |A|.\nWe further notice that the threshold performance loss in Theorem 4 is O\u21e2H 2. The following\nproposition provides a condition under which the performance loss in one episode is O (\u21e2H).\nProposition 1 For any episode j, if 8t = 0, 1,\u00b7\u00b7\u00b7 , H  1,\n\nQC \u2713{ Q 2Q : Lj,t \uf8ff Qt(xj,t, aj,t) \uf8ff Uj,t} ,\n\nthen we have V \u21e40 (xj,0)  R(j) \uf8ff 6\u21e2H = O (\u21e2H).\nThat is, if all the new constraints in an episode are redundant, then the performance loss in that\nepisode is O (\u21e2H). Note that if the condition for Proposition 1 holds in an episode, then QC will\nnot be modi\ufb01ed at the end of that episode. Furthermore, if the system has a \ufb01xed initial state and\nthe condition for Proposition 1 holds in one episode, then it will hold in all the subsequent episodes,\nand consequently, the performance losses in all the subsequent episodes are O (\u21e2H).\n\n5 Computational Ef\ufb01ciency of Optimistic Constraint Propagation\n\nWe now brie\ufb02y discuss the computational complexity of OCP. As typical in the complexity analysis\nof optimization algorithms, we assume that basic operations include the arithmetic operations, com-\nparisons, and assignment, and measure computational complexity in terms of the number of basic\noperations (henceforth referred to as operations) per period.\nFirst, it is worth pointing out that for a general hypothesis class Q and general action space A, the\nper period computations of OCP can intractable. This is because:\n\n\u2022 Computing supQ2QC\ntimization problems.\n\nQt(xj,t, a), Uj,t and Lj,t requires solving a possibly intractable op-\n\n7\n\n\f\u2022 Selecting an action that maximizes supQ2QC\n\nQt(xj,t, a) can be intractable.\n\nFurther, the number of constraints in C, and with it the number of operations per period, can grow\nover time.\nHowever, if |A| is tractably small and Q has some special structures (e.g. Q is a \ufb01nite set or a\nlinear subspace or, more generally a polytope), then by discarding some \u201credundant\u201d constraints in\nC, OCP with a variant of Algorithm 1 will be computationally ef\ufb01cient, and the sample ef\ufb01ciency\nresults developed in Section 4 will still hold. Due to space limitations, we only discuss the scenario\nwhere Q is a polytope of dimension d. Note that the \ufb01nite state/action tabula rasa case, the linear-\nquadratic case, and the case with linear combinations of disjoint indicator functions are all special\ncases of this scenario.\nSpeci\ufb01cally, if Q is a polytope of dimension d (i.e., within a d-dimensional subspace), then any\nQ 2Q can be represented by a weight vector \u2713 2 <d, and Q can be characterized by a set of linear\ninequalities of \u2713. Furthermore, the new constraints of the form Lj,t \uf8ff Qt(xj,t, aj,t) \uf8ff Uj,t are also\nlinear inequalities of \u2713. Hence, in each episode, QC is characterized by a polyhedron in <d, and\nQt(xj,t, a), Uj,t and Lj,t can be computed by solving linear programming (LP) problems.\nsupQ2QC\nIf we assume that all the encountered numerical values can be represented with B bits, and LPs\nare solved by Karmarkar\u2019s algorithm [11], then the following proposition bounds the computational\ncomplexity.\n\nProposition 2 If Q is a polytope of dimension d, each numerical value in the problem data\nor observed in the course of learning can be represented with B bits, and OCP uses Kar-\nthen the computational complexity of OCP is\nmarkar\u2019s algorithm to solve linear programs,\nO[|A| + |C|]|C|d4.5B operations per period.\nThe proof of Proposition 2 is provided in the appendix. Notice that the computational complexity\nis polynomial in d, B, |C| and |A|, and thus, OCP will be computationally ef\ufb01cient if all these\nparameters are tractably small. Note that the bound in Proposition 2 is a worst-case bound, and\nthe O(d4.5) term is incurred by the need to solve LPs. For some special cases, the computational\ncomplexity is much less. For instance, in the state aggregation case, the computational complexity\nis O (|C| + |A| + d) operations per period.\nAs we have discussed above, one can ensure that |C| remains bounded by using variants of Algo-\nrithm 1 that discard the redundant constraints and/or update QC more ef\ufb01ciently. Speci\ufb01cally, it is\nstraightforward to design such constraint selection algorithms if Q is a coherent hypothesis class, or\nif Q is the span of pre-speci\ufb01ed indicator functions over disjoint sets. Furthermore, if the notion of\nredundant constraints is properly de\ufb01ned, the sample ef\ufb01ciency results derived in Section 4 will still\nhold.\n\n6 Conclusion\n\nWe have proposed a novel reinforcement learning algorithm, called optimistic constraint propagation\n(OCP), that synthesizes ef\ufb01cient exploration and value function generalization for reinforcement\nlearning in deterministic systems. We have shown that OCP is sample ef\ufb01cient if Q\u21e4 lies in the given\nhypothesis class, or if the given hypothesis class is the span of pre-speci\ufb01ed indicator functions over\ndisjoint sets.\nIt is worth pointing out that for more general reinforcement learning problems, how to design prov-\nably sample ef\ufb01cient algorithms with value function generalization is currently still open. For in-\nstance, it is not clear how to establish such algorithms for the general agnostic learning case dis-\ncussed in this paper, as well as for reinforcement learning in MDPs. One interesting direction for\nfuture research is to extend OCP, or a variant of it, to these two problems.\n\nReferences\n[1] Yasin Abbasi-Yadkori and Csaba Szepesv\u00b4ari. Regret bounds for the adaptive control of linear\nquadratic systems. Journal of Machine Learning Research - Proceedings Track, 19:1\u201326, 2011.\n\n8\n\n\f[2] Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforce-\n\nment learning. In NIPS, pages 49\u201356, 2006.\n\n[3] Mohammad Gheshlaghi Azar, Alessandro Lazaric, and Emma Brunskill. Regret bounds for\n\nreinforcement learning with policy advice. CoRR, abs/1305.1027, 2013.\n\n[4] Peter L. Bartlett and Ambuj Tewari. REGAL: A regularization based algorithm for reinforce-\nIn Proceedings of the 25th Conference on\n\nment learning in weakly communicating MDPs.\nUncertainty in Arti\ufb01cial Intelligence (UAI2009), pages 35\u201342, June 2009.\n\n[5] Dimitri P. Bertsekas and John Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti\ufb01c,\n\nSeptember 1996.\n\n[6] Ronen I. Brafman and Moshe Tennenholtz. R-max - a general polynomial time algorithm\nfor near-optimal reinforcement learning. Journal of Machine Learning Research, 3:213\u2013231,\n2002.\n\n[7] Geoffrey Gordon. Online \ufb01tted reinforcement learning. In Advances in Neural Information\n\nProcessing Systems 8, pages 1052\u20131058. MIT Press, 1995.\n\n[8] Morteza Ibrahimi, Adel Javanmard, and Benjamin Van Roy. Ef\ufb01cient reinforcement learning\n\nfor high dimensional linear quadratic systems. In NIPS, 2012.\n\n[9] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement\n\nlearning. Journal of Machine Learning Research, 11:1563\u20131600, 2010.\n\n[10] Sham Kakade. On the Sample Complexity of Reinforcement Learning. PhD thesis, University\n\nCollege London, 2003.\n\n[11] Narendra Karmarkar. A new polynomial-time algorithm for linear programming. Combina-\n\ntorica, 4(4):373\u2013396, 1984.\n\n[12] Michael J. Kearns and Daphne Koller. Ef\ufb01cient reinforcement learning in factored MDPs. In\n\nIJCAI, pages 740\u2013747, 1999.\n\n[13] Michael J. Kearns and Satinder P. Singh. Near-optimal reinforcement learning in polynomial\n\ntime. Machine Learning, 49(2-3):209\u2013232, 2002.\n\n[14] Tor Lattimore, Marcus Hutter, and Peter Sunehag. The sample-complexity of general rein-\n\nforcement learning. In ICML, 2013.\n\n[15] Lihong Li and Michael Littman. Reducing reinforcement learning to kwik online regression.\n\nAnnals of Mathematics and Arti\ufb01cial Intelligence, 2010.\n\n[16] Lihong Li, Michael L. Littman, and Thomas J. Walsh. Knows what it knows: a framework for\n\nself-aware learning. In ICML, pages 568\u2013575, 2008.\n\n[17] Ronald Ortner and Daniil Ryabko. Online regret bounds for undiscounted continuous rein-\n\nforcement learning. In NIPS, 2012.\n\n[18] Warren Powell and Ilya Ryzhov. Optimal Learning. John Wiley and Sons, 2011.\n[19] G. A. Rummery and M. Niranjan. On-line Q-learning using connectionist systems. Technical\n\nreport, 1994.\n\n[20] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. CoRR,\n\nabs/1301.2609, 2013.\n\n[21] Satinder P. Singh, Tommi Jaakkola, and Michael I. Jordan. Reinforcement learning with soft\n\nstate aggregation. In NIPS, pages 361\u2013368, 1994.\n\n[22] Er L. Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman. PAC model-\nfree reinforcement learning. In Proceedings of the 23rd international conference on Machine\nlearning, pages 881\u2013888, 2006.\n\n[23] Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press,\n\nMarch 1998.\n\n[24] Csaba Szepesv\u00b4ari. Algorithms for Reinforcement Learning. Synthesis Lectures on Arti\ufb01cial\n\nIntelligence and Machine Learning. Morgan & Claypool Publishers, 2010.\n\n[25] John N. Tsitsiklis and Benjamin Van Roy. Feature-based methods for large scale dynamic\n\nprogramming. Machine Learning, 22(1-3):59\u201394, 1996.\n\n[26] Benjamin Van Roy. Performance loss bounds for approximate value iteration with state aggre-\n\ngation. Math. Oper. Res., 31(2):234\u2013244, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1379, "authors": [{"given_name": "Zheng", "family_name": "Wen", "institution": "Stanford University"}, {"given_name": "Benjamin", "family_name": "Van Roy", "institution": "Stanford University"}]}