{"title": "Efficient Exploration and Value Function Generalization in Deterministic Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 3021, "page_last": 3029, "abstract": "We consider the problem of reinforcement learning over episodes of a finite-horizon deterministic system and as a solution propose optimistic constraint propagation (OCP), an algorithm designed to synthesize efficient exploration and value function generalization. We establish that when the true value function lies within the given hypothesis class, OCP selects optimal actions over all but at most K episodes, where K is the eluder dimension of the given hypothesis class. We establish further efficiency and asymptotic performance guarantees that apply even if the true value function does not lie in the given hypothesis space, for the special case where the hypothesis space is the span of pre-specified indicator functions over disjoint sets.", "full_text": "Ef\ufb01cient Exploration and Value Function\nGeneralization in Deterministic Systems\n\nZheng Wen\n\nStanford University\n\nzhengwen@stanford.edu\n\nBenjamin Van Roy\nStanford University\n\nbvr@stanford.edu\n\nAbstract\n\nWe consider the problem of reinforcement learning over episodes of a \ufb01nite-\nhorizon deterministic system and as a solution propose optimistic constraint prop-\nagation (OCP), an algorithm designed to synthesize ef\ufb01cient exploration and\nvalue function generalization. We establish that when the true value function Q\u21e4\nlies within the hypothesis class Q, OCP selects optimal actions over all but at most\ndimE[Q] episodes, where dimE denotes the eluder dimension. We establish fur-\nther ef\ufb01ciency and asymptotic performance guarantees that apply even if Q\u21e4 does\nnot lie in Q, for the special case where Q is the span of pre-speci\ufb01ed indicator\nfunctions over disjoint sets.\n\n1\n\nIntroduction\n\nA growing body of work on ef\ufb01cient reinforcement learning provides algorithms with guarantees\non sample and computational ef\ufb01ciency [13, 6, 2, 22, 4, 9]. This literature highlights the point that\nan effective exploration scheme is critical to the design of any ef\ufb01cient reinforcement learning al-\ngorithm. In particular, popular exploration schemes such as \u270f-greedy, Boltzmann, and knowledge\ngradient can require learning times that grow exponentially in the number of states and/or the plan-\nning horizon.\nThe aforementioned literature focuses on tabula rasa learning; that is, algorithms aim to learn with\nlittle or no prior knowledge about transition probabilities and rewards. Such algorithms require\nlearning times that grow at least linearly with the number of states. Despite the valuable insights\nthat have been generated through their design and analysis, these algorithms are of limited practical\nimport because state spaces in most contexts of practical interest are enormous. There is a need for\nalgorithms that generalize from past experience in order to learn how to make effective decisions in\nreasonable time.\nThere has been much work on reinforcement learning algorithms that generalize (see, e.g.,\n[5, 23, 24, 18] and references therein). Most of these algorithms do not come with statistical or\ncomputational ef\ufb01ciency guarantees, though there are a few noteworthy exceptions, which we now\ndiscuss. A number of results treat policy-based algorithms (see [10, 3] and references therein), in\nwhich the goal is to select high-performers among a pre-speci\ufb01ed collection of policies as learn-\ning progresses. Though interesting results have been produced in this line of work, each entails\nquite restrictive assumptions or does not make strong guarantees. Another body of work focuses\non model-based algorithms. An algorithm is proposed in [12] that \ufb01ts a factored model to observed\ndata and makes decisions based on the \ufb01tted model. The authors establish a sample complexity\nbound that is polynomial in the number of model parameters rather than the number of states, but\nthe algorithm is computationally intractable because of the dif\ufb01culty of solving factored MDPs. A\nrecent paper [14] proposes a novel algorithm for the case where the true environment is known to\nbelong to a \ufb01nite or compact class of models, and shows that its sample complexity is polynomial\nin the cardinality of the model class if the model class is \ufb01nite, or the \u270f-covering-number if the\n\n1\n\n\fmodel class is compact. Though this result is theoretically interesting, for most model classes of\ninterest, the \u270f-covering-number is enormous since it typically grows exponentially in the number of\nfree parameters. Another recent paper [17] establishes a regret bound for an algorithm that applies\nto problems with continuous state spaces and H\u00a8older-continuous rewards and transition kernels.\nThough the results represent an interesting contribution to the literature, a couple features of the\nregret bound weaken its practical implications. First, regret grows linearly with the H\u00a8older constant\nof the transition kernel, which for most contexts of practical relevance grows exponentially in the\nnumber of state variables. Second, the dependence on time becomes arbitrarily close to linear as the\ndimension of the state space grows. Reinforcement learning in linear systems with quadratic cost\nis treated in [1]. The method proposed is shown to realize regret that grows with the square root\nof time. The result is interesting and the property is desirable, but to the best of our knowledge,\nexpressions derived for regret in the analysis exhibit an exponential dependence on the number of\nstate variables, and further, we are not aware of a computationally ef\ufb01cient way of implementing the\nproposed method. This work was extended by [8] to address linear systems with sparse structure.\nHere, there are ef\ufb01ciency guarantees that scale gracefully with the number of state variables, but\nonly under sparsity and other technical assumptions.\nThe most popular approach to generalization in the applied reinforcement learning literature involves\n\ufb01tting parameterized value functions. Such approaches relate closely to supervised learning in that\nthey learn functions from state to value, though a difference is that value is in\ufb02uenced by action\nand observed only through delayed feedback. One advantage over model learning approaches is\nthat, given a \ufb01tted value function, decisions can be made without solving a potentially intractable\ncontrol problem. We see this as a promising direction, though there currently is a lack of theoretical\nresults that provide attractive bounds on learning time with value function generalization. A relevant\npaper along this research line is [15], which studies the ef\ufb01cient reinforcement learning with value\nfunction generalization in the KWIK framework (see [16]), and reduces the ef\ufb01cient reinforcement\nlearning problem to the ef\ufb01cient KWIK online regression problem. However, the authors do not\nshow how to solve the general KWIK online regression problem ef\ufb01ciently, and it is not even clear\nwhether this is possible. Thus, though the result of [15] is interesting, it does not provide a provably\nef\ufb01cient algorithm.\nAn important challenge that remains is to couple exploration and value function generalization in\na provably effective way, and in particular, to establish sample and computational ef\ufb01ciency guar-\nantees that scale gracefully with the planning horizon and model complexity. In this paper, we aim\nto make progress in this direction. To start with a simple context, we restrict our attention to deter-\nministic systems that evolve over \ufb01nite time horizons, and we consider episodic learning, in which\nan agent repeatedly interacts with the same system. As a solution to the problem, we propose opti-\nmistic constraint propagation (OCP), a computationally ef\ufb01cient reinforcement learning algorithm\ndesigned to synthesize ef\ufb01cient exploration and value function generalization. We establish that\nwhen the true value function Q\u21e4 lies within the hypothesis class Q, OCP selects optimal actions\nover all but at most dimE[Q] episodes. Here, dimE denotes the eluder dimension, which quanti\ufb01es\ncomplexity of the hypothesis class. A corollary of this result is that regret is bounded by a function\nthat is constant over time and linear in the problem horizon and eluder dimension.\nTo put our aforementioned result in perspective, it is useful to relate it to other lines of work.\nConsider \ufb01rst the broad area of reinforcement learning algorithms that \ufb01t value functions, such\nas SARSA [19]. Even with the most commonly used sort of hypothesis class Q, which is made\nup of linear combinations of \ufb01xed basis functions, and even when the hypothesis class contains the\ntrue value function Q\u21e4, there are no guarantees that these algorithms will ef\ufb01ciently learn to make\nnear-optimal decisions. On the other hand, our result implies that OCP attains near-optimal per-\nformance in time that scales linearly with the number of basis functions. Now consider the more\nspecialized context of a deterministic linear system with quadratic cost and a \ufb01nite time horizon.\nThe analysis of [1] can be leveraged to produce regret bounds that scale exponentially in the number\nof state variables. On the other hand, using a hypothesis space Q consisting of quadratic functions\nof state-action pairs, the results of this paper show that OCP behaves near optimally within time that\nscales quadratically in the number of state and action variables.\nWe also establish ef\ufb01ciency and asymptotic performance guarantees that apply to agnostic reinforce-\nment learning, where Q\u21e4 does not necessarily lie in Q. In particular, we consider the case where Q\nis the span of pre-speci\ufb01ed indicator functions over disjoint sets. Our results here add to the litera-\nture on agnostic reinforcement learning with such a hypothesis class [21, 25, 7, 26]. Prior work in\n\n2\n\n\fthis area has produced interesting algorithms and insights, as well as bounds on performance loss\nassociated with potential limits of convergence, but no convergence or ef\ufb01ciency guarantees.\n\n2 Reinforcement Learning in Deterministic Systems\n\nIn this paper, we consider an episodic reinforcement learning (RL) problem in which an agent repeat-\nedly interacts with a discrete-time \ufb01nite-horizon deterministic system, and refer to each interaction\nas an episode. The system is identi\ufb01ed by a sextuple M = (S,A, H, F, R, S), where S is the state\nspace, A is the action space, H is the horizon, F is a system function, R is a reward function and\nS is a sequence of states. If action a 2A is selected while the system is in state x 2S at period\nt = 0, 1,\u00b7\u00b7\u00b7 , H 1, a reward of Rt(x, a) is realized; furthermore, if t < H 1, the state transitions\nto Ft(x, a). Each episode terminates at period H 1, and then a new episode begins. The initial\nstate of episode j is the jth element of S.\nTo represent the history of actions and observations over multiple episodes, we will often index\nvariables by both episode and period. For example, xj,t and aj,t denote the state and action at\nperiod t of episode j, where j = 0, 1,\u00b7\u00b7\u00b7 and t = 0, 1,\u00b7\u00b7\u00b7 , H 1. To count the total number of\nsteps since the agent started learning, we say period t in episode j is time jH + t.\nA (deterministic) policy \u00b5 = (\u00b50, . . . , \u00b5H1) is a sequence of functions, each mapping S to A.\nFor each policy \u00b5, de\ufb01ne a value function V \u00b5\n\u2327 =t R\u2327 (x\u2327 , a\u2327 ), where xt = x, x\u2327 +1 =\nF\u2327 (x\u2327 , a\u2327 ), and a\u2327 = \u00b5\u2327 (x\u2327 ). The optimal value function is de\ufb01ned by V \u21e4t (x) = sup\u00b5 V \u00b5\nt (x). A\npolicy \u00b5\u21e4 is said to be optimal if V \u00b5\u21e4 = V \u21e4. Throughout this paper, we will restrict attention to\nsystems M = (S,A, H, F, R, S) that admit optimal policies. Note that this restriction incurs no\nloss of generality when the action space is \ufb01nite.\nIt is also useful to de\ufb01ne an action-contingent optimal value function: Q\u21e4t (x, a) = Rt(x, a) +\nV \u21e4t+1(Ft(x, a)) for t < H 1, and Q\u21e4H1(x, a) = RH1(x, a). Then, a policy \u00b5\u21e4 is optimal if\n\u00b5\u21e4t (x) 2 arg maxa2A Q\u21e4t (x, a) for all (x, t).\nA reinforcement learning algorithm generates each action aj,t based on observations made up to the\ntth period of the jth episode, including all states, actions, and rewards observed in previous episodes\nand earlier in the current episode, as well as the state space S, action space A, horizon H, and possi-\nble prior information. In each episode, the algorithm realizes reward R(j) =PH1\nt=0 Rt (xj,t, aj,t).\nNote that R(j) \uf8ff V \u21e40 (xj,0) for each jth episode. One way to quantify performance of a reinforce-\nment learning algorithm is in terms of the number of episodes JL for which R(j) < V \u21e40 (xj,0) \u270f,\nwhere \u270f 0 is a pre-speci\ufb01ed performance loss threshold. If the reward function R is bounded,\nwith |Rt(x, a)|\uf8ff R for all (x, a, t), then this also implies a bound on regret over episodes ex-\nperienced prior to time T , de\ufb01ned by Regret(T ) = PbT /Hc1\n(V \u21e40 (xj,0) R(j)). In particular,\nRegret(T ) \uf8ff 2RHJL + \u270fbT /Hc.\n\nt (x) = PH1\n\nj=0\n\n3 Optimistic Constraint Propagation\n\nAt a high level, our reinforcement learning algorithm \u2013 optimistic constraint propagation (OCP) \u2013\nselects actions based on the optimism in the face of uncertainty principle and based on observed\nrewards and state transitions propagates constraints backwards through time. Speci\ufb01cally, it takes\nas input the state space S, the action space A, the horizon H, and a hypothesis class Q of candi-\ndates for Q\u21e4. The algorithm maintains a sequence of subsets of Q and a sequence of scalar \u201cupper\nbounds\u201d, which summarize constraints that past experience suggests for ruling out hypotheses. Each\nconstraint in this sequence is speci\ufb01ed by a state x 2S , an action a 2A , a period t = 0, . . . , H 1,\nand an interval [L, U ] \u2713 <, and takes the form {Q 2Q : L \uf8ff Qt(x, a) \uf8ff U}. The upper bound\nof the constraint is U. Given a sequence C = (C1, . . . ,C|C|) of such constraints and upper bounds\nU = (U1, . . . ,U|C|), a set QC is de\ufb01ned constructively by Algorithm 1. Note that if the constraints\ndo not con\ufb02ict then QC = C1 \\\u00b7\u00b7\u00b7\\C |C|. When constraints do con\ufb02ict, priority is assigned \ufb01rst\nbased on upper bound, with smaller upper bound preferred, and then, in the event of ties in upper\nbound, based on position in the sequence, with more recent experience preferred.\n\n3\n\n\fAlgorithm 1 Constraint Selection\nRequire: Q, C\n\nQC Q , u minU\nwhile u \uf8ff 1 do\nfor \u2327 = |C| to 1 do\nQC Q C \\C \u2327\n\nif U\u2327 = u and QC \\C \u2327 6= ? then\nend if\nend for\nif {u0 2U : u0 > u} = ? then\nend if\nu min{u0 2U : u0 > u}\n\nreturn QC\n\nend while\n\nOCP, presented below as Algorithm 2, at each time t computes for the current state xj,t and each\naction a the greatest state-action value Qt(xj,t, a) among functions in QC and selects an action that\nattains the maximum. In other words, an action is chosen based on the most optimistic feasible out-\ncome subject to constraints. The subsequent reward and state transition give rise to a new constraint\nthat is used to update C. Note that the update of C is postponed until one episode is completed.\nAlgorithm 2 Optimistic Constraint Propagation\nRequire: S, A, H, Q\n\nInitialize C ?\nfor episode j = 0, 1,\u00b7\u00b7\u00b7 do\n\nSet C0 C\nfor period t = 0, 1,\u00b7\u00b7\u00b7 , H 1 do\n\nApply aj,t 2 arg maxa2A supQ2QC\nif t < H 1 then\nUj,t supQ2QC\nLj,t inf Q2QC\nUj,t Rt(xj,t, aj,t)\nLj,t Rt(xj,t, aj,t)\n\nelse\n\nQt(xj,t, a)\n\n(Rt(xj,t, aj,t) + supa2A Qt+1 (xj,t+1, a))\n(Rt(xj,t, aj,t) + supa2A Qt+1 (xj,t+1, a))\n\nend if\nC0 C 0 _ {Q 2Q : Lj,t \uf8ff Qt(xj,t, aj,t) \uf8ff Uj,t}\n\nend for\nUpdate C C 0\n\nend for\n\nNote that if Q\u21e4 2Q then each constraint appended to C does not rule out Q\u21e4, and therefore, the\nsequence of sets QC generated as the algorithm progresses is decreasing and contains Q\u21e4 in its\nintersection. In the agnostic case, where Q\u21e4 may not lie in Q, new constraints can be inconsistent\nwith previous constraints, in which case selected previous constraints are relaxed as determined by\nAlgorithm 1.\nLet us brie\ufb02y discuss several contexts of practical relevance and/or theoretical interest in which OCP\ncan be applied.\n\n\u2022 Finite state/action tabula rasa case. With \ufb01nite state and action spaces, Q\u21e4 can be repre-\nsented as a vector, and without special prior knowledge, it is natural to let Q = <|S|\u00b7|A|\u00b7H.\n\u2022 Polytopic prior constraints. Consider the aforementioned example, but suppose that we\nhave prior knowledge that Q\u21e4 lies in a particular polytope. Then we can let Q be that\npolytope and again apply OCP.\n\u2022 Linear systems with quadratic cost (LQ). In this classical control model, if S =