{"title": "On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization", "book": "Advances in Neural Information Processing Systems", "page_first": 1484, "page_last": 1492, "abstract": "The ability to learn a policy for a sequential decision problem with continuous state space using on-line data is a long-standing challenge. This paper presents a new reinforcement-learning algorithm, called iKBSF, which extends the benefits of kernel-based learning to the on-line scenario. As a kernel-based method, the proposed algorithm is stable and has good convergence properties. However, unlike other similar algorithms,iKBSF's space complexity is independent of the number of sample transitions, and as a result it can process an arbitrary amount of data. We present theoretical results showing that iKBSF can approximate (to any level of accuracy) the value function that would be learned by an equivalent batch non-parametric kernel-based reinforcement learning approximator. In order to show the effectiveness of the proposed algorithm in practice, we apply iKBSF to the challenging three-pole balancing task, where the ability to process a large number of transitions is crucial for achieving a high success rate.", "full_text": "On-line Reinforcement Learning Using Incremental\n\nKernel-Based Stochastic Factorization\n\nAndr\u00b4e M. S. Barreto\n\nDoina Precup\n\nJoelle Pineau\n\nSchool of Computer Science\n\nSchool of Computer Science\n\nSchool of Computer Science\n\nMcGill University\nMontreal, Canada\n\nMcGill University\nMontreal, Canada\n\nMcGill University\nMontreal, Canada\n\namsb@cs.mcgill.ca\n\ndprecup@cs.mcgill.ca\n\njpineau@cs.mcgill.ca\n\nAbstract\n\nKernel-based stochastic factorization (KBSF) is an algorithm for solving rein-\nforcement learning tasks with continuous state spaces which builds a Markov de-\ncision process (MDP) based on a set of sample transitions. What sets KBSF apart\nfrom other kernel-based approaches is the fact that the size of its MDP is indepen-\ndent of the number of transitions, which makes it possible to control the trade-off\nbetween the quality of the resulting approximation and the associated computa-\ntional cost. However, KBSF\u2019s memory usage grows linearly with the number of\ntransitions, precluding its application in scenarios where a large amount of data\nmust be processed. In this paper we show that it is possible to construct KBSF\u2019s\nMDP in a fully incremental way, thus freeing the space complexity of this algo-\nrithm from its dependence on the number of sample transitions. The incremental\nversion of KBSF is able to process an arbitrary amount of data, which results in\na model-based reinforcement learning algorithm that can be used to solve contin-\nuous MDPs in both off-line and on-line regimes. We present theoretical results\nshowing that KBSF can approximate the value function that would be computed\nby conventional kernel-based learning with arbitrary precision. We empirically\ndemonstrate the effectiveness of the proposed algorithm in the challenging three-\npole balancing task, in which the ability to process a large number of transitions\nis crucial for success.\n\nIntroduction\n\n1\nThe task of learning a policy for a sequential decision problem with continuous state space is a\nlong-standing challenge that has attracted the attention of the reinforcement learning community for\nyears. Among the many approaches that have been proposed to solve this problem, kernel-based\nreinforcement learning (KBRL) stands out for its good theoretical guarantees [1, 2]. KBRL solves\na continuous state-space Markov decision process (MDP) using a \ufb01nite model constructed based on\nsample transitions only. By casting the problem as a non-parametric approximation, it provides a\nstatistically consistent way of approximating an MDP\u2019s value function. Moreover, since it comes\ndown to the solution of a \ufb01nite model, KBRL always converges to a unique solution.\n\nUnfortunately, the good theoretical properties of kernel-based learning come at a price: since the\nmodel constructed by KBRL grows with the amount of sample transitions, the number of operations\nperformed by this algorithm quickly becomes prohibitively large as more data become available.\nSuch a computational burden severely limits the applicability of KBRL to real reinforcement learn-\ning (RL) problems. Realizing that, many researchers have proposed ways of turning KBRL into a\nmore practical tool [3, 4, 5]. In this paper we focus on our own approach to leverage KBRL, an\nalgorithm called kernel-based stochastic factorization (KBSF) [4].\n\nKBSF uses KBRL\u2019s kernel-based strategy to perform a soft aggregation of the states of its MDP.\nBy doing so, our algorithm is able to summarize the information contained in KBRL\u2019s model in an\nMDP whose size is independent of the number of sample transitions. KBSF enjoys good theoretical\n\n1\n\n\fguarantees and has shown excellent performance on several tasks [4]. The main limitation of the\nalgorithm is the fact that, in order to construct its model, it uses an amount of memory that grows\nlinearly with the number of sample transitions. Although this is a signi\ufb01cant improvement over\nKBRL, it still hinders the application of KBSF in scenarios in which a large amount of data must be\nprocessed, such as in complex domains or in on-line reinforcement learning.\n\nIn this paper we show that it is possible to construct KBSF\u2019s MDP in a fully incremental way,\nthus freeing the space complexity of this algorithm from its dependence on the number of sample\ntransitions. In order to distinguish it from its original, batch counterpart, we call this new version of\nour algorithm incremental KBSF, or iKBSF for short. As will be seen, iKBSF is able to process an\narbitrary number of sample transitions. This results in a model-based RL algorithm that can be used\nto solve continuous MDPs in both off-line and on-line regimes.\n\nA second important contribution of this paper is a theoretical analysis showing that it is possible to\ncontrol the error in the value-function approximation performed by KBSF. In our previous experi-\nments with KBSF, we de\ufb01ned the model used by this algorithm by clustering the sample transitions\nand then using the clusters\u2019s centers as the representative states in the reduced MDP [4]. However,\nwe did not provide a theoretical justi\ufb01cation for such a strategy. In this paper we \ufb01ll this gap by\nshowing that we can approximate KBRL\u2019s value function at any desired level of accuracy by mini-\nmizing the distance from a sampled state to the nearest representative state. Besides its theoretical\ninterest, the bound is also relevant from a practical point of view, since it can be used in iKBSF to\nguide the on-line selection of representative states.\n\nFinally, a third contribution of this paper is an empirical demonstration of the performance of iKBSF\nin a new, challenging control problem: the triple pole-balancing task, an extension of the well-known\ndouble pole-balancing domain. Here, iKBSF\u2019s ability to process a large number of transitions is\ncrucial for achieving a high success rate, which cannot be easily replicated with batch methods.\n\n2 Background\n\nIn reinforcement learning, an agent interacts with an environment in order to \ufb01nd a policy that\nmaximizes the discounted sum of rewards [6]. As usual, we assume that such an interaction can be\nmodeled as a Markov decision process (MDP, [7]). An MDP is a tuple M \u2261 (S, A, Pa, ra, \u03b3), where S\nis the state space and A is the (\ufb01nite) action set. In this paper we are mostly concerned with MDPs\nwith continuous state spaces, but our strategy will be to approximate such models as \ufb01nite MDPs. In\na \ufb01nite MDP the matrix Pa \u2208 R|S|\u00d7|S| gives the transition probabilities associated with action a \u2208 A\nand the vector ra \u2208 R|S| stores the corresponding expected rewards. The discount factor \u03b3 \u2208 [0, 1) is\nused to give smaller weights to rewards received further in the future.\n\nConsider an MDP M with continuous state space S \u2282 [0, 1]d. Kernel-based reinforcement learning\n(KBRL) uses sample transitions to derive a \ufb01nite MDP that approximates the continuous model [1,\n2]. Let Sa = {(sa\nk, ra\nk)|k = 1, 2, ..., na} be sample transitions associated with action a \u2208 A, where\nk \u2208 R. Let \u03c6 : R+ 7\u2192 R+ be a Lipschitz continuous function and let k\u03c4 (s, s\u2032) be a\nsa\nk, \u02c6sa\nk \u2208 S and ra\nkernel function de\ufb01ned as k\u03c4 (s, s\u2032) = \u03c6 (k s \u2212 s\u2032 k /\u03c4), where k \u00b7 k is a norm in Rd and \u03c4 > 0. Finally,\nde\ufb01ne the normalized kernel function associated with action a as \u03ba a\ni ) = k\u03c4 (s, sa\nj ).\nThe model constructed by KBRL has the following transition and reward functions:\n\nj=1 k\u03c4 (s, sa\n\ni )/\u2211na\n\n\u03c4 (s, sa\n\nk , \u02c6sa\n\n\u02c6Pa(s\u2032|s) =(cid:26) \u03ba a\n\n\u03c4 (s, sa\n0, otherwise\n\ni ), if s\u2032 = \u02c6sa\ni ,\n\nand\n\n\u02c6Ra(s, s\u2032) =(cid:26) ra\n\ni , if s\u2032 = \u02c6sa\ni ,\n0, otherwise.\n\n(1)\n\nSince only transitions ending in the states \u02c6sa\ni have a non-zero probability of occurrence, one can\nde\ufb01ne a \ufb01nite MDP \u02c6M composed solely of these n = \u2211a na states [2, 3]. After \u02c6V \u2217, the optimal\n\u02c6M, has been computed, the value of any state-action pair can be determined as:\nvalue function of\nQ(s, a) = \u2211na\ni=1 \u03ba a\ni )(cid:3) , where s \u2208 S and a \u2208 A. Ormoneit and Sen [1] proved that,\nif na \u2192 \u221e for all a \u2208 A and the widths of the kernels \u03c4 shrink at an \u201cadmissible\u201d rate, the probability\nof choosing a suboptimal action based on Q(s, a) converges to zero.\n\ni + \u03b3 \u02c6V \u2217( \u02c6sa\n\ni )(cid:2)ra\n\n\u03c4 (s, sa\n\n\u02c6M, but the time and\nUsing dynamic programming, one can compute the optimal value function of\nspace required to do so grow fast with the number of states n [7, 8]. Therefore, the use of KBRL\nleads to a dilemma: on the one hand, one wants to use as many transitions as possible to capture the\ndynamics of M, but on the other hand one wants to have an MDP \u02c6M of manageable size.\n\n2\n\n\fKernel-based stochastic factorization (KBSF) provides a practical way of weighing these two con-\n\u02c6M\n\ufb02icting objectives [4]. Our algorithm compresses the information contained in KBRL\u2019s model\nin an MDP \u00afM whose size is independent of the number of transitions n. The fundamental idea\nbehind KBSF is the \u201cstochastic-factorization trick\u201d, which we now summarize. Let P \u2208 Rn\u00d7n be a\ntransition-probability matrix and let P = DK be a factorization in which D \u2208 Rn\u00d7m and K \u2208 Rm\u00d7n are\nstochastic matrices. Then, swapping the factors D and K yields another transition matrix \u00afP = KD\nthat retains the basic topology of P\u2014that is, the number of recurrent classes and their respective\nreducibilities and periodicities [9]. The insight is that, in some cases, one can work with \u00afP instead\nof P; when m \u226a n, this replacement affects signi\ufb01cantly the memory usage and computing time.\n\nKBSF results from the application of the stochastic-factorization trick to \u02c6M. Let \u00afS \u2261 { \u00afs1, \u00afs2, ..., \u00afsm}\nbe a set of representative states in S. KBSF computes matrices \u02d9Da \u2208 Rna\u00d7m and \u02d9Ka \u2208 Rm\u00d7na with el-\nements \u02d9da\nj=1 k \u00af\u03c4 (s, \u00afs j).\nThe basic idea of the algorithm is to replace the MDP \u02c6M with \u00afM \u2261 ( \u00afS, A, \u00afPa, \u00afra, \u03b3), where \u00afPa = \u02d9Ka \u02d9Da\nand \u00afra = \u02d9Kara (ra \u2208 Rna is the vector composed of sample rewards ra\ni ). Thus, instead of solving an\n\nj ), where \u03ba \u00af\u03c4 is de\ufb01ned as \u03ba \u00af\u03c4 (s, \u00afsi) = k \u00af\u03c4 (s, \u00afsi)/\u2211m\n\ni , \u00afs j) and \u02d9ka\n\ni j = \u03ba \u00af\u03c4 ( \u02c6sa\n\ni j = \u03ba a\n\n\u03c4 ( \u00afsi, sa\n\nMDP with n states, one solves a model with m states only. Let D\u22ba \u2261 [(cid:0) \u02d9D1(cid:1)\u22ba(cid:0) \u02d9D2(cid:1)\u22ba ...(cid:0) \u02d9D|A|(cid:1)\u22ba] \u2208 Rm\u00d7n\n\nand let K \u2261 [ \u02d9K1 \u02d9K2... \u02d9K|A|] \u2208 Rm\u00d7n. Based on \u00afQ\u2217 \u2208 Rm\u00d7|A|, the optimal action-value function of \u00afM,\n\u02c6M as \u02dcv = \u0393D \u00afQ\u2217, where \u0393 is the \u2018max\u2019 operator\none can obtain an approximate value function for\napplied row wise, that is, \u02dcvi = maxa(D \u00afQ\u2217)ia. We have showed that the error in \u02dcv is bounded by:\n\nk\u02c6v\u2217 \u2212 \u02dcvk\u221e \u2264\n\n1\n\n1 \u2212 \u03b3\n\nmax\n\na\n\nk\u02c6ra \u2212 D\u00afrak\u221e +\n\n1\n\n(1 \u2212 \u03b3)2 (cid:18) \u00afCmax\n\ni\n\n(1 \u2212 max\n\nj\n\ndi j) +\n\n\u02c6C\u03b3\n2\n\nmax\n\na (cid:13)(cid:13)\n\n\u02c6Pa \u2212 DKa(cid:13)(cid:13)\u221e(cid:19) ,\n\n(2)\n\nwhere k\u00b7k\u221e is the in\ufb01nity norm, \u02c6v\u2217 \u2208 Rn is the optimal value function of KBRL\u2019s MDP,\nmaxa,i \u02c6ra\nexcept for those corresponding to matrix \u02d9Ka (see [4] for details).\n\n\u02c6C =\ni , and Ka is matrix K with all elements equal to zero\n\ni , \u00afC = maxa,i \u00afra\n\ni \u2212 mina,i \u00afra\n\ni \u2212 mina,i \u02c6ra\n\n3\n\nIncremental kernel-based stochastic factorization\n\nIn the batch version of KBSF, described in Section 2, the matrices \u00afPa and vectors \u00afra are determined\nusing all the transitions in the corresponding sets Sa simultaneously. This has two undesirable conse-\nquences. First, the construction of the MDP \u00afM requires an amount of memory of O(nmaxm), where\nnmax = maxa na. Although this is a signi\ufb01cant improvement over KBRL\u2019s memory usage, which\nis O(n2\nmax), in more challenging domains even a linear dependence on nmax may be impractical.\nSecond, with batch KBSF the only way to incorporate new data into the model \u00afM is to recompute\nthe multiplication \u00afPa = \u02d9Ka \u02d9Da for all actions a for which there are new sample transitions available.\nEven if we ignore the issue of memory usage, this is clearly inef\ufb01cient in terms of computation. In\nthis section we present an incremental version of KBSF that circumvents these important limitations.\n\nk , \u02c6sa\n\nk, ra\n\nk)|k = 1, 2, ..., n1} and S2 \u2261 {(sa\n\nSuppose we split the set of sample transitions Sa in two subsets S1 and S2 such that S1 \u2229 S2 = /0\nand S1 \u222a S2 = Sa. Without loss of generality, suppose that the sample transitions are indexed so that\nk)|k = n1 + 1, n1 + 2, ..., n1 + n2 = na}. Let \u00afPS1\nS1 \u2261 {(sa\nand \u00afrS1 be matrix \u00afPa and vector \u00afra computed by KBSF using only the n1 transitions in S1 (if n1 = 0,\nwe de\ufb01ne \u00afPS1 = 0 \u2208 Rm\u00d7m and \u00afrS1 = 0 \u2208 Rm for all a \u2208 A). We want to compute \u00afPS1\u222aS2 and \u00afrS1\u222aS2\nfrom \u00afPS1 , \u00afrS1 , and S2, without using the set of sample transition S1.\nWe start with the transition matrices \u00afPa. We know that\n\nk , \u02c6sa\n\nk, ra\n\n\u00afpS1\ni j =\n\nn1\n\u2211\nt=1\n\n\u02d9ka\nit\n\n\u02d9da\nt j =\n\nn1\n\u2211\nt=1\n\nk\u03c4 ( \u00afsi, sa\nt )\nl=1 k\u03c4 ( \u00afsi, sa\nl )\n\n\u2211n1\n\nk \u00af\u03c4 ( \u02c6sa\nt , \u00afs j)\nl=1 k \u00af\u03c4 ( \u02c6sa\n\nt , \u00afsl)\n\n\u2211m\n\n=\n\n1\n\n\u2211n1\n\nl=1 k\u03c4 ( \u00afsi, sa\nl )\n\nn1\n\u2211\nt=1\n\nk\u03c4 ( \u00afsi, sa\n\nt )k \u00af\u03c4 ( \u02c6sa\n\nt , \u00afs j)\n\n\u2211m\n\nl=1 k \u00af\u03c4 ( \u02c6sa\n\nt , \u00afsl)\n\n.\n\nTo simplify the notation, de\ufb01ne wS1\nk\u03c4 ( \u00afsi,sa\n\nt , \u00afs j)\n\n, with t \u2208 {1, 2, ..., n1 + n2}. Then,\n\nt )k \u00af\u03c4 ( \u02c6sa\n\u2211m\nl=1 k \u00af\u03c4 ( \u02c6sa\n\nt , \u00afsl )\n\ni = \u2211n1\n\nl=1 k\u03c4 ( \u00afsi, sa\n\nl ), wS2\n\ni = \u2211n1+n2\n\nl=n1+1 k\u03c4 ( \u00afsi, sa\n\nl ), and ct\n\ni j =\n\n\u00afpS1\u222aS2\ni j =\n\n1\n\nwS1\ni + wS2\n\ni (cid:16)\u2211n1\n\nt=1 ct\n\ni j + \u2211n1+n2\n\nt=n1+1 ct\n\ni j(cid:17) =\n\n1\n\nwS1\ni + wS2\n\ni (cid:16) \u00afpS1\n\ni j wS1\n\ni + \u2211n1+n2\n\nt=n1+1 ct\n\ni j(cid:17) .\n\n3\n\n\fNow, de\ufb01ning bS2\n\ni j = \u2211n1+n2\n\nt=n1+1 ct\n\ni j, we have the simple update rule:\n\n\u00afpS1\u222aS2\ni j =\n\n1\n\ni + wS2\nwS1\n\ni (cid:16)bS2\n\ni j + \u00afpS1\n\ni j wS1\n\ni (cid:17) .\n\nWe can apply similar reasoning to derive an update rule for the rewards \u00afra\n\ni . We know that\n\n\u00afrS1\ni =\n\n1\n\n\u2211n1\n\nl=1 k\u03c4 ( \u00afsi, sa\nl )\n\nn1\n\u2211\nt=1\n\nk\u03c4 ( \u00afsi, sa\n\nt )ra\n\nt =\n\n1\nwS1\ni\n\nn1\n\u2211\nt=1\n\nk\u03c4 ( \u00afsi, sa\n\nt )ra\nt .\n\nLet ht\n\ni = k\u03c4 ( \u00afsi, sa\n\nt )ra\n\nt , with t \u2208 {1, 2, ..., n1 + n2}. Then,\n\n1\n\n\u00afrS1\u222aS2\ni\n\n=\n\ni + wS2\nwS1\ni = \u2211n1+n2\nt=n1+1 ht\n\ni (cid:16)\u2211n1\n\nDe\ufb01ning eS2\n\ni, we have the following update rule:\n\nt=1 ht\n\ni + \u2211n1+n2\n\nt=n1+1 ht\n\n1\n\ni(cid:17) =\n\ni + wS2\nwS1\n\ni (cid:16)wS1\ni \u00afrS1\n\ni + \u2211n1+n2\n\nt=n1+1 ht\n\ni(cid:17) .\n\n\u00afrS1\u222aS2\ni\n\n=\n\n1\n\nwS1\ni + wS2\n\ni (cid:16)eS2\n\ni + \u00afrS1\n\ni wS1\n\ni (cid:17) .\n\n(3)\n\n(4)\n\ni\n\ni j , eS2\n\n, and wS2\n\nSince bS2\ni can be computed based on S2 only, we can discard the sample transitions in S1\nafter computing \u00afPS1 and \u00afrS1 . To do that, we only have to keep the variables wS1\n. These variables can\ni\nbe stored in |A| vectors wa \u2208 Rm, resulting in a modest memory overhead. Note that we can apply\nthe ideas above recursively, further splitting the sets S1 and S2 in subsets of smaller size. Thus, we\nhave a fully incremental way of computing KBSF\u2019s MDP which requires almost no extra memory.\n\nAlgorithm 1 shows a step-by-step description of how to update \u00afM based on a set of sample tran-\nsitions. Using this method to update its model, KBSF\u2019s space complexity drops from O(nm) to\nO(m2). Since the amount of memory used by KBSF is now independent of n, it can process an\narbitrary number of sample transitions.\n\nAlgorithm 1 Update KBSF\u2019s MDP\n\nAlgorithm 2 Incremental KBSF (iKBSF)\n\nInput:\n\n\u00afPa, \u00afra, wa for all a \u2208 A\nSa\nfor all a \u2208 A\n\nOutput: Updated \u00afM and wa\n\nfor a \u2208 A do\n\nt , \u00afsl)\n\nl=1 k \u00af\u03c4 ( \u02c6sa\n\nfor t = 1, ..., na do zt \u2190 \u2211m\nna \u2190 |Sa|\nfor i = 1, 2, ..., m do\nw\u2032 \u2190 \u2211na\nt=1 k\u03c4 ( \u00afsi, sa\nt )\nfor j = 1, 2, ..., m do\nb \u2190 \u2211na\nt )k \u00af\u03c4 ( \u02c6sa\nt=1 k\u03c4 ( \u00afsi, sa\nt , \u00afs j)/zt\n\u00afpi j \u2190 1\ni +w\u2032 (b + \u00afpi jwa\ni )\nwa\n\ne \u2190 \u2211na\nt )ra\nt=1 k\u03c4 ( \u00afsi, sa\nt\n\u00afri \u2190 1\ni +w\u2032 (e + \u00afriwa\ni )\nwa\ni + w\u2032\ni \u2190 wa\nwa\n\nInput:\n\n\u00afsi Representative states, i = 1, 2, ..., m\ntm Interval to update model\ntv Interval to update value function\nn Total number of sample transitions\nOutput: Approximate value function \u02dcQ(s, a)\n\u00afQ \u2190 arbitrary matrix in Rm\u00d7|A|\n\u00afPa \u2190 0 \u2208 Rm\u00d7m, \u00afra \u2190 0 \u2208 Rm, wa \u2190 0 \u2208 Rm, \u2200a \u2208 A\nfor t = 1, 2, ..., n do\n\nSelect a based on \u02dcQ(st , a) = \u2211m\nExecute a in st and observe rt and \u02c6st\nSa \u2190 SaS{(st , rt , \u02c6st )}\n\nif (t mod tm = 0) then\n\ni=1 \u03ba \u00af\u03c4 (st , \u00afsi) \u00afqia\n\nAdd new representative states to \u00afM using Sa\nUpdate \u00afM and wa using Algorithm 1 and Sa\nSa \u2190 /0 for all a \u2208 A\n\nif (t mod tv = 0) update \u00afQ\n\nInstead of assuming that S1 and S2 are a partition of a \ufb01xed dataset Sa, we can consider that S2 was\ngenerated based on the policy learned by KBSF using the transitions in S1. Thus, Algorithm 1 pro-\nvides a \ufb02exible framework for integrating learning and planning within KBSF. A general description\nof the incremental version of KBSF is given in Algorithm 2. iKBSF updates the model \u00afM and the\nvalue function \u00afQ at \ufb01xed intervals tm and tv, respectively. When tm = tv = n, we recover the batch\nversion of KBSF; when tm = tv = 1, we have an on-line method which stores no sample transitions.\n\u00afM.\nNote that Algorithm 2 also allows for the inclusion of new representative states to the model\nUsing Algorithm 1 this is easy to do: given a new representative state \u00afsm+1, it suf\ufb01ces to set wa\nm+1 =\n0, \u00afra\nm+1 = 0, and \u00afpm+1, j = \u00afp j,m+1 = 0 for j = 1, 2, ..., m + 1 and all a \u2208 A. Then, in the following\napplications of Eqns (3) and (4), the dynamics of \u00afM will naturally re\ufb02ect the existence of state \u00afsm+1.\n\n4\n\n\f4 Theoretical Results\n\nOur previous experiments with KBSF suggest that, at least empirically, the algorithm\u2019s performance\nimproves as m \u2192 n [4] . In this section we present theoretical results that con\ufb01rm this property. The\nresults below are particularly useful for iKBSF because they provide practical guidance towards\nwhere and when to add new representative states.\n\n\u2217 \u2261 \u02c6sa\n\nk with k = argmaxi min j k \u02c6sa\n\nSuppose we have a \ufb01xed set of sample transitions Sa. We will show that, if we are free to de\ufb01ne the\nrepresentative states, then we can use KBSF to approximate KBRL\u2019s solution to any desired level of\naccuracy. To be more precise, let d\u2217 \u2261 maxa,i min j k \u02c6sa\ni \u2212 \u00afs j k, that is, d\u2217 is the maximum distance\ni to the closest representative state. We will show that, by minimizing d\u2217, we\nfrom a sampled state \u02c6sa\ncan make k\u02c6v\u2217 \u2212 \u02dcvk\u221e as small as desired (cf. Eqn (2)).\ni \u2212 \u00afs j k and \u00afsa\nLet \u02c6sa\nis the sampled state in Sa whose distance to the closest representative state is maximal, and \u00afsa\n\u2217, \u00afsa\nrepresentative state that is closest to \u02c6sa\nmaximizes k \u02c6sa\n\n\u2217 \u2212 \u00afs j k, that is, \u02c6sa\n\u2217\n\u2217 is the\n\u2217) that\n\u2217 k. Obviously, k \u02c6s\u2217 \u2212 \u00afs\u2217 k= d\u2217.\nWe make the following simple assumptions: (i) \u02c6sa\n0 \u03c6 (x)dx \u2264\nL\u03c6 < \u221e, (iii) \u03c6 (x) \u2265 \u03c6 (y) if x < y, (iv) \u2203 A\u03c6 , \u03bb\u03c6 > 0, \u2203 B\u03c6 \u2265 0 such that A\u03c6 exp(\u2212x) \u2264 \u03c6 (x) \u2264\n\u03bb\u03c6 A\u03c6 exp(\u2212x) if x \u2265 B\u03c6 . Assumption (iv) implies that the kernel function \u03c6 will eventually decay\nexponentially. We start by introducing the following de\ufb01nition:\nDe\ufb01nition 1. Given \u03b1 \u2208 (0, 1] and s, s\u2032 \u2208 S, the \u03b1-radius of k\u03c4 with respect to s and s\u2032 is de\ufb01ned as\n\u03c1(k\u03c4 , s, s\u2032, \u03b1) = max{x \u2208 R+|\u03c6 (x/\u03c4) = \u03b1k\u03c4 (s, s\u2032)}.\n\n\u2217 are unique for all a \u2208 A, (ii) R \u221e\n\n\u2217. Using these de\ufb01nitions, we can select the pair ( \u02c6sa\n\n\u2217 \u2261 \u00afsh where h = argmin j k \u02c6sa\n\n\u2217 where b = argmaxa k \u02c6sa\n\n\u2217 and \u00afs\u2217 \u2261 \u00afsb\n\n\u2217 k: \u02c6s\u2217 \u2261 \u02c6sb\n\n\u2217 and \u00afsa\n\n\u2217 \u2212 \u00afsa\n\n\u2217 \u2212 \u00afsa\n\nThe existence of \u03c1(k\u03c4 , s, s\u2032, \u03b1) is guaranteed by assumptions (ii) and (iii) and the fact that \u03c6 is\ncontinuous [1]. To provide some intuition on the meaning of the \u03b1-radius of k\u03c4 , suppose that \u03c6 is\nstrictly decreasing and let c = \u03c6 (k s\u2212 s\u2032 k /\u03c4). Then, there is a s\u2032\u2032 \u2208 S such that \u03c6 (k s\u2212 s\u2032\u2032 k /\u03c4) = \u03b1c.\nThe radius of k\u03c4 in this case is k s \u2212 s\u2032\u2032 k. It should be thus obvious that \u03c1(k\u03c4 , s, s\u2032, \u03b1) \u2265k s \u2212 s\u2032 k.\nWe can show that \u03c1 has the following properties (proved in the supplementary material):\nProperty 1. If k s \u2212 s\u2032 k \u03c1(k\u03c4 , s, s\u2032, \u03b1 \u2032).\nProperty 3. For \u03b1 \u2208 (0, 1) and \u03b5 > 0, there is a \u03b4 > 0 such that \u03c1(k\u03c4 , s, s\u2032, \u03b1)\u2212 k s\u2212 s\u2032 k< \u03b5 if \u03c4 < \u03b4 .\n\nWe now introduce a notion of dissimilarity between two states s, s\u2032 \u2208 S which is induced by a speci\ufb01c\nset of sample transitions Sa and the choice of kernel function:\nDe\ufb01nition 2. Given \u03b2 > 0, the\u03b2 -dissimilarity between s and s\u2032 with respect to \u03ba a\n\n\u03c4 is de\ufb01ned as\n\n\u03c8(\u03ba a\n\n\u03c4 , s, s\u2032, \u03b2 ) =(cid:26) \u2211na\n\nk=1 |\u03ba a\n\n\u03c4 (s, sa\n0, otherwise.\n\nk) \u2212 \u03ba a\n\n\u03c4 (s\u2032, sa\n\nk)|, if k s \u2212 s\u2032 k\u2264 \u03b2 ,\n\nThe parameter \u03b2 de\ufb01nes the volume of the ball within which we want to compare states. As we will\n\u03c4 , s, s\u2032, \u03b2 ) \u2208 [0, 2]. It is possible to show\nsee, this parameter links De\ufb01nitions 1 and 2. Note that \u03c8(\u03ba a\nthat \u03c8 satis\ufb01es the following property (see supplementary material):\nProperty 4. For \u03b2 > 0 and \u03b5 > 0, there is a \u03b4 > 0 such that \u03c8(\u03ba a\n\n\u03c4 , s, s\u2032, \u03b2 ) < \u03b5 if k s \u2212 s\u2032 k< \u03b4 .\n\nDe\ufb01nitions 1 and 2 allow us to enunciate the following result:\n\nLemma 1. For any \u03b1 \u2208 (0, 1] and any t \u2265 m \u2212 1,\nmax\n\ni , \u00afs j, \u03c1 a), and let \u03c8 a\n\n\u03c8(\u03ba a\n\n\u03c8(\u03ba a\n\n\u03c4 , \u02c6sa\n\n\u03c4 , \u02c6sa\n\ni , \u00afs j, \u221e). Then,\n\nmax = max\ni, j\n\nlet \u03c1 a = \u03c1(k \u00af\u03c4 , \u02c6sa\n\n\u2217, \u00afsa\n\n\u2217, \u03b1/t),\n\ni, j\n\nkPa \u2212 DKak\u221e \u2264\n\n1\n\n1 + \u03b1\n\n\u03c8 a\n\n\u03c1 +\n\n\u03b1\n\n1 + \u03b1\n\n\u03c8 a\n\nmax.\n\nProof. See supplementary material.\n\nlet \u03c8 a\n\n\u03c1 =\n\n(5)\n\nmax \u2265 \u03c8 a\n\nSince \u03c8 a\nas \u03b1 \u2192 0. This is not necessarily true, though, because \u03c8 a\nare \ufb01nally ready to prove the main result of this section.\n\n\u03c1 , one might think at \ufb01rst that the right-hand side of Eqn (5) decreases monotonically\nmax as \u03b1 \u2192 0 (see Property 2). We\n\n\u03c1 \u2192 \u03c8 a\n\n5\n\n\fProposition 1. For any \u03b5 > 0, there are \u03b41, \u03b42 > 0 such that k\u02c6v\u2217 \u2212 \u02dcvk\u221e < \u03b5 if d\u2217 < \u03b41 and \u00af\u03c4 < \u03b42.\nProof. Let \u02c7r \u2261 [(r1)\u22ba, (r2)\u22ba, ..., (r|A|)\u22ba]\u22ba \u2208 Rn. From Eqn (1) and the de\ufb01nition of \u00afra, we can write\n\n\u02c6Pa \u02c7r \u2212 D \u02d9Kara(cid:13)(cid:13)\u221e =(cid:13)(cid:13)\n\nThus, plugging Eqn (6) back into Eqn (2), it is clear that there is a \u03b7 > 0 such that k\u02c6v\u2217 \u2212 \u02dcvk\u221e < \u03b5\n\n\u02c6Pa \u02c7r \u2212 DKa \u02c7r(cid:13)(cid:13)\u221e =(cid:13)(cid:13)( \u02c6Pa \u2212 DKa)\u02c7r(cid:13)(cid:13)\u221e \u2264(cid:13)(cid:13)\n\nk\u02c6ra \u2212 D\u00afrak\u221e =(cid:13)(cid:13)\n\u02c6Pa \u2212 DKa(cid:13)(cid:13)\u221e < \u03b7 and maxi (1 \u2212 max j di j) < \u03b7. We start by showing that if d\u2217 and \u00af\u03c4 are\n\u02c6Pa \u2212 DKa(cid:13)(cid:13)\u221e < \u03b7. From Lemma 1 we know that, for any set of m \u2264 n\n\nif maxa(cid:13)(cid:13)\nsmall enough, then maxa(cid:13)(cid:13)\n\nrepresentative states, and for any \u03b1 \u2208 (0, 1], the following must hold:\n\n\u02c6Pa \u2212 DKa(cid:13)(cid:13)\u221e k\u02c7rk\u221e .\n\n(6)\n\nkPa \u2212 DKak\u221e \u2264 (1 + \u03b1)\u22121\u03c8\u03c1 + \u03b1(1 + \u03b1)\u22121\u03c8MAX,\n\nmax\n\na\n\n\u2217, \u00afsa\n\n\u03c4 , \u02c6sa\n\n\u03c1 = maxa,i, j \u03c8(\u03ba a\n\ni , s, \u221e) and \u03c8\u03c1 = maxa \u03c8 a\n\nwhere \u03c8MAX = maxa,i,s \u03c8(k\u03c4 , \u02c6sa\ni , \u00afs j, \u03c1 a), with \u03c1 a =\n\u03c1(k \u00af\u03c4 , \u02c6sa\n\u2217, \u03b1/(n \u2212 1)). Note that \u03c8MAX is independent of the representative states. De\ufb01ne \u03b1 such\nthat \u03b1/(1 + \u03b1)\u03c8MAX < \u03b7. We have to show that, if we de\ufb01ne the representative states in such a way\nthat d\u2217 is small enough, and set \u00af\u03c4 accordingly, then we can make \u03c8\u03c1 < (1 \u2212 \u03b1)\u03b7 \u2212 \u03b1\u03c8MAX \u2261 \u03b7 \u2032.\nFrom Property 4 we know that there is a \u03b41 > 0 such that \u03c8\u03c1 < \u03b7 \u2032 if \u03c1 a < \u03b41 for all a \u2208 A. From\nProperty 1 we know that \u03c1 a \u2264 \u03c1(k \u00af\u03c4 , \u02c6s\u2217, \u00afs\u2217, \u03b1/(n \u2212 1)) for all a \u2208 A. From Property 3 we know that,\nfor any \u03b5 \u2032 > 0, there is a \u03b4 \u2032 > 0 such that \u03c1(k \u00af\u03c4 , \u02c6s\u2217, \u00afs\u2217, \u03b1/(n \u2212 1)) < d\u2217 + \u03b5 \u2032 if \u00af\u03c4 < \u03b4 \u2032. Therefore, if\nd\u2217 < \u03b41, we can take any \u03b5 \u2032 < \u03b41 \u2212 d\u2217 to have an upper bound \u03b4 \u2032 for \u00af\u03c4. It remains to show that there\nis a \u03b4 > 0 such that mini max j di j > 1 \u2212 \u03b7 if \u00af\u03c4 < \u03b4 . Recalling that \u02d9da\ni , \u00afsk),\nlet h = argmax jk \u00af\u03c4 ( \u02c6sa\ni = max j6=h k \u00af\u03c4 ( \u02c6sa\ni , \u00afs j). Then, for any i,\nmax j \u02d9da\ni ). From Assump. (i) and Prop. 3 we know\nthat there is a \u03b4 a\ni . Thus, by making \u03b4 = mina,i \u03b4 a\ni ,\nwe can guarantee that mini max j di j > 1 \u2212 \u03b7. If we take \u03b42 = min(\u03b4 , \u03b4 \u2032), the result follows.\n\ni , \u00afs j)(cid:1) \u2265 ya\ni > (m \u2212 1)(1 \u2212 \u03b7) \u02c7ya\n\ni , \u00afsh) and \u02c7ya\ni + (m \u2212 1) \u02c7ya\n\ni = k \u00af\u03c4 ( \u02c6sa\ni /(ya\n\ni > 0 such that ya\n\ni , \u00afs j), and let ya\n\ni + \u2211 j6=h k \u00af\u03c4 ( \u02c6sa\n\ni j = k \u00af\u03c4 ( \u02c6sa\n\ni , \u00afs j)/\u2211m\n\nk=1 k \u00af\u03c4 ( \u02c6sa\n\ni j = ya\n\ni /(cid:0)ya\n\ni /\u03b7 if \u00af\u03c4 < \u03b4 a\n\nProposition 1 tells us that, regardless of the speci\ufb01c reinforcement-learning problem at hand, if the\ndistances between sampled states and the respective nearest representative states are small enough,\nthen we can make KBSF\u2019s approximation of KBRL\u2019s value function as accurate as desired by setting\n\u00af\u03c4 to a small value. How small d\u2217 and \u00af\u03c4 should be depends on the particular choice of kernel k\u03c4 and on\nthe characteristics of the sets of transitions Sa. Of course, a \ufb01xed number m of representative states\nimposes a minimum possible value for d\u2217, and if this value is not small enough decreasing \u00af\u03c4 may\nactually hurt the approximation. Again, the optimal value for \u00af\u03c4 in this case is problem-dependent.\n\nOur result supports the use of a local approximation based on representative states spread over\nthe state space S. This is in line with the quantization strategies used in batch-mode kernel-based\nreinforcement learning to de\ufb01ne the states \u00afs j [4, 5]. In the case of on-line learning, we have to\nadaptively de\ufb01ne the representative states \u00afs j as the sample transitions come in. One can think of\nseveral ways of doing so [10]. In the next section we show a simple strategy for adding representative\nstates which is based on the theoretical results presented in this section.\n\n5 Empirical Results\n\nWe now investigate the empirical performance of the incremental version of KBSF. We start with a\nsimple task in which iKBSF is contrasted with batch KBSF. Next we exploit the scalability of iKBSF\nto solve a dif\ufb01cult control task that, to the best of our knowledge, has never been solved before.\n\nWe use the \u201cpuddle world\u201d problem as a proof of concept [11]. In this \ufb01rst experiment we show\nthat iKBSF is able to recover the model that would be computed by its batch counterpart. In order\nto do so, we applied Algorithm 2 to the puddle-world task using a random policy to select actions.\nFigure 1a shows the result of such an experiment when we vary the parameters tm and tv. Note\nthat the case in which tm = tv = 8000 corresponds to the batch version of KBSF. As expected, the\nperformance of KBSF decision policies improves gradually as the algorithm goes through more\nsample transitions, and in general the intensity of the improvement is proportional to the amount of\ndata processed. More important, the performance of the decision policies after all sample transitions\nhave been processed is essentially the same for all values of tm and tv, which shows that iKBSF\ncan be used as a tool to circumvent KBSF\u2019s memory demand (which is linear in n). Thus, if one\nhas a batch of sample transitions that does not \ufb01t in the available memory, it is possible to split\nthe data in chunks of smaller sizes and still get the same value-function approximation that would\n\n6\n\n\fbe computed if the entire data set were processed at once. As shown in Figure 1b, there is only a\nsmall computational overhead associated with such a strategy (this results from unnormalizing and\nnormalizing the elements of \u00afPa and \u00afra several times through update rules (3) and (4)).\n\n\u03b9 = 1000\n\u03b9 = 2000\n\u03b9 = 4000\n\u03b9 = 8000\n\nn\nr\nu\n\nt\n\ne\nR\n\n3\n\n2\n\n1\n\n0\n\n1\n\u2212\n\n2\n\u2212\n\n3\n\u2212\n\n\u03b9 = 1000\n\u03b9 = 2000\n\u03b9 = 4000\n\u03b9 = 8000\n\ns\nd\nn\no\nc\ne\nS\n\n5\n1\n\n.\n\n0\n\n.\n\n1\n\n5\n\n.\n\n0\n\n0\n0\n\n.\n\n0\n\n2000\n\n4000\n\n6000\n\n8000\n\n0\n\n2000\n\n4000\n\n6000\n\n8000\n\nNumber of sample transitions\n\nNumber of sample transitions\n\n(a) Performance\n\n(b) Run times\n\nFigure 1: Results on the puddle-world task averaged over 50 runs. iKBSF used 100 representative\nstates evenly distributed over the state space and tm = tv = \u03b9 (see legends). Sample transitions were\ncollected by a random policy. The agents were tested on two sets of states surrounding the \u201cpuddles\u201d:\na 3 \u00d7 3 grid over [0.1, 0.3] \u00d7 [0.3, 0.5] and the four states {0.1, 0.3} \u00d7 {0.9, 1.0}.\n\nBut iKBSF is more than just a tool for avoiding the memory limitations associated with batch learn-\ning. We illustrate this fact with a more challenging RL task. Pole balancing has a long history\nas a benchmark problem because it represents a rich class of unstable systems [12, 13, 14]. The\nobjective in this task is to apply forces to a wheeled cart moving along a limited track in order to\nkeep one or more poles hinged to the cart from falling over [15]. There are several variations of\nthe problem with different levels of dif\ufb01culty; among them, balancing two poles at the same time\nis particularly hard [16]. In this paper we raise the bar, and add a third pole to the pole-balancing\ntask. We performed our simulations using the parameters usually adopted with the double pole task,\nexcept that we added a third pole with the same length and mass as the longer pole [15]. This results\nin a problem with an 8-dimensional state space S.\n\nIn our experiments with the double-pole task, we used 200 representative states and 106 sample\ntransitions collected by a random policy [4]. Here we start our experiment with triple pole-balancing\n\u00afM by incorporating\nusing exactly the same con\ufb01guration, and then we let KBSF re\ufb01ne its model\nmore sample transitions through update rules (3) and (4). Speci\ufb01cally, we used Algorithm 2 with a\n0.3-greedy policy, tm = tv = 106, and n = 107. Policy iteration was used to compute \u00afQ\u2217 at each value-\nfunction update. As for the kernels, we adopted Gaussian functions with widths \u03c4 = 100 and \u00af\u03c4 = 1\n(to improve ef\ufb01ciency, we used a KD-tree to only compute the 50 largest values of k\u03c4 ( \u00afsi, \u00b7) and the\n10 largest values of k \u00af\u03c4 ( \u02c6sa\ni , \u00b7)). Representative states were added to the model on-line every time the\nagent encountered a sample state \u02c6sa\ni , \u00afs j) < 0.01 for all j \u2208 1, 2, ..., m (this corresponds\nto setting the maximum allowed distance d\u2217 from a sampled state to the closest representative state).\n\ni for which k \u00af\u03c4 ( \u02c6sa\n\nWe compare iKBSF with \ufb01tted Q-iteration using an ensemble of 30 trees generated by Ernst et al.\u2019s\nextra-trees algorithm [17]. We chose this algorithm because it has shown excellent performance in\nboth benchmark and real-world reinforcement-learning tasks [17, 18].1 Since this is a batch-mode\nlearning method, we used its result on the initial set of 106 sample transitions as a baseline for our\nempirical evaluation. To build the trees, the number of cut-directions evaluated at each node was\n\ufb01xed at dim(S) = 8, and the minimum number of elements required to split a node, denoted here\nby \u03b7min, was \ufb01rst set to 1000 and then to 100. The algorithm was run for 50 iterations, with the\nstructure of the trees \ufb01xed after the 10th iteration.\n\nAs shown in Figure 2a, both \ufb01tted Q-iteration and batch KBSF perform poorly in the triple pole-\nbalancing task, with average success rates below 55%. This suggests that the amount of data used\n\n1Another reason for choosing \ufb01tted Q-iteration was that some of the most natural competitors of iKBSF\n\nhave already been tested on the simpler double pole-balancing task, with disappointing results [19, 4].\n\n7\n\n\fby these algorithms is insuf\ufb01cient to describe the dynamics of the control task. Of course, we could\ngive more sample transitions to \ufb01tted Q-iteration and batch KBSF. Note however that, since they\nare batch-learning methods, there is an inherent limit on the amount of data that these algorithms\ncan use to construct their approximation. In contrast, the amount of memory required by iKBSF\nis independent of the number of sample transitions n. This fact together with the fact that KBSF\u2019s\ncomputational complexity is only linear in n allow our algorithm to process a large amount of data\nwithin a reasonable time. This can be observed in Figure 2b, which shows that iKBSF can build\nan approximation using 107 transitions in under 20 minutes. As a reference for comparison, \ufb01tted\nQ-iteration using \u03b7min = 1000 took an average of 1 hour and 18 minutes to process 10 times less data.\n\ns\ne\nd\no\ns\np\ne\n\ni\n\n \nl\n\nu\n\nf\ns\ns\ne\nc\nc\nu\nS\n\n9\n0\n\n.\n\n8\n0\n\n.\n\n7\n0\n\n.\n\n6\n0\n\n.\n\n5\n0\n\n.\n\n4\n\n.\n\n0\n\n.\n\n3\n0\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\niKBSF\nTREE\u22121000\nTREE\u2212100\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n0\n0\n0\n0\n5\n\n0\n0\n0\n0\n1\n\n0\n0\n0\n2\n\n0\n0\n5\n\n0\n0\n2\n\n)\ng\no\nl\n(\n \ns\nd\nn\no\nc\ne\nS\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\nBatch KBSF\n\n0\n5\n\n\u25cf\n\nBatch KBSF\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\niKBSF\nTREE\u22121000\nTREE\u2212100\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n\u25cf\n\n0\n0\n0\n4\n\n0\n0\n0\n3\n\n0\n0\n0\n2\n\n0\n0\n0\n1\n\ns\ne\na\n\nt\n\nt\ns\n \ne\nv\ni\nt\n\na\n\nt\n\nn\ne\ns\ne\nr\np\ne\nr\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nN\n\n\u25cf\n\n\u25cf\n\n2e+06\n\n4e+06\n\n6e+06\n\n8e+06\n\n1e+07\n\n2e+06\n\n4e+06\n\n6e+06\n\n8e+06\n\n1e+07\n\n2e+06\n\n6e+06\n\n1e+07\n\nNumber of sample transitions\n\nNumber of sample transitions\n\nNumber of sample transitions\n\n(a) Performance\n\n(b) Run times\n\n(c) Size of KBSF\u2019s MDP\n\nFigure 2: Results on the triple pole-balancing task averaged over 50 runs. The values correspond to\nthe fraction of episodes initiated from the test states in which the 3 poles could be balanced for 3000\nsteps (one minute of simulated time). The test set was composed of 256 states equally distributed\nover the hypercube de\ufb01ned by \u00b1[1.2 m, 0.24 m/s, 18o, 75o/s, 18o, 150o/s, 18o, 75o/s]. Shadowed re-\ngions represent 99% con\ufb01dence intervals.\n\nAs shown in Figure 2a, the ability of iKBSF to process a large number of sample transitions allows\nour algorithm to achieve a success rate of approximately 80%. This is similar to the performance\nof batch KBSF on the double-pole version of the problem [4]. The good performance of iKBSF on\nthe triple pole-balancing task is especially impressive when we recall that the decision policies were\nevaluated on a set of test states representing all possible directions of inclination of the three poles.\nIn order to achieve the same level of performance with KBSF, approximately 2 Gb of memory would\nbe necessary, even using sparse kernels, whereas iKBSF used less than 0.03 Gb of memory.\n\nTo conclude, observe in Figure 2c how the number of representative states m grows as a function of\nthe number of sample transitions processed by KBSF. As expected, in the beginning of the learning\nprocess m grows fast, re\ufb02ecting the fact that some relevant regions of the state space have not been\nvisited yet. As more and more data come in, the number of representative states starts to stabilize.\n\n6 Conclusion\n\nThis paper presented two contributions, one practical and one theoretical. The practical contribution\nis iKBSF, the incremental version of KBSF. iKBSF retains all the nice properties of its precursor:\nit is simple, fast, and enjoys good theoretical guarantees. However, since its memory complexity\nis independent of the number of sample transitions, iKBSF can be applied to datasets of any size,\nand it can also be used on-line. To show how iKBSF\u2019s ability to process large amounts of data can\nbe useful in practice, we used the proposed algorithm to learn how to simultaneously balance three\npoles, a dif\ufb01cult control task that had never been solved before.\n\nAs for the theoretical contribution, we showed that KBSF can approximate KBRL\u2019s value function\nat any level of accuracy by minimizing the distance between sampled states and the closest repre-\nsentative state. This supports the quantization strategies usually adopted in kernel-based RL, and\nalso offers guidance towards where and when to add new representative states in on-line learning.\n\nAcknowledgments The authors would like to thank Amir massoud Farahmand for helpful discussions re-\ngarding this work. Funding for this research was provided by the National Institutes of Health (grant R21\nDA019800) and the NSERC Discovery Grant program.\n\n8\n\n\fReferences\n\n[1] D. Ormoneit and S. Sen. Kernel-based reinforcement learning. Machine Learning, 49 (2\u20133):\n\n161\u2013178, 2002.\n\n[2] D. Ormoneit and P. Glynn. Kernel-based reinforcement learning in average-cost problems.\n\nIEEE Transactions on Automatic Control, 47(10):1624\u20131636, 2002.\n\n[3] N. Jong and P. Stone. Kernel-based models for reinforcement learning in continuous state\nIn Proceedings of the International Conference on Machine Learning (ICML)\u2014\n\nspaces.\nWorkshop on Kernel Machines and Reinforcement Learning, 2006.\n\n[4] A. M. S. Barreto, D. Precup, and J. Pineau. Reinforcement learning using kernel-based stochas-\ntic factorization. In Advances in Neural Information Processing Systems (NIPS), pages 720\u2013\n728, 2011.\n\n[5] B. Kveton and G. Theocharous. Kernel-based reinforcement learning on representative states.\nIn Proceedings of the AAAI Conference on Arti\ufb01cial Intelligence (AAAI), pages 124\u2013131, 2012.\n\n[6] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.\n\n[7] M. L. Puterman. Markov Decision Processes\u2014Discrete Stochastic Dynamic Programming.\n\nJohn Wiley & Sons, Inc., 1994.\n\n[8] M. L. Littman, T. L. Dean, and L. P. Kaelbling. On the complexity of solving Markov decision\nproblems. In Proceedings of the Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI),\npages 394\u2013402, 1995.\n\n[9] A. M. S. Barreto and M. D. Fragoso. Computing the stationary distribution of a \ufb01nite Markov\nchain through stochastic factorization. SIAM Journal on Matrix Analysis and Applications, 32:\n1513\u20131523, 2011.\n\n[10] Y. Engel, S. Mannor, and R. Meir. The kernel recursive least squares algorithm. IEEE Trans-\n\nactions on Signal Processing, 52:2275\u20132285, 2003.\n\n[11] R. S. Sutton. Generalization in reinforcement learning: Successful examples using sparse\ncoarse coding. In Advances in Neural Information Processing Systems (NIPS), pages 1038\u2013\n1044, 1996.\n\n[12] D. Michie and R. Chambers. BOXES: An experiment in adaptive control. Machine Intelligence\n\n2, pages 125\u2013133, 1968.\n\n[13] C. W. Anderson. Learning and Problem Solving with Multilayer Connectionist Systems. PhD\n\nthesis, Computer and Information Science, University of Massachusetts, 1986.\n\n[14] A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve\ndif\ufb01cult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13:\n834\u2013846, 1983.\n\n[15] F. J. Gomez. Robust non-linear control through neuroevolution. PhD thesis, The University of\n\nTexas at Austin, 2003.\n\n[16] A. P. Wieland. Evolving neural network controllers for unstable systems. In Proceedings of\n\nthe International Joint Conference on Neural Networks (IJCNN), pages 667\u2013673, 1991.\n\n[17] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal\n\nof Machine Learning Research, 6:503\u2013556, 2005.\n\n[18] D. Ernst, G. B. Stan, J. Gonc\u00b8alves, and L. Wehenkel. Clinical data based optimal STI strate-\ngies for HIV: a reinforcement learning approach. In Proceedings of the IEEE Conference on\nDecision and Control (CDC), pages 124\u2013131, 2006.\n\n[19] F. Gomez, J. Schmidhuber, and R. Miikkulainen. Ef\ufb01cient non-linear control through neu-\nroevolution. In Proceedings of the European Conference on Machine Learning (ECML), pages\n654\u2013662, 2006.\n\n9\n\n\f", "award": [], "sourceid": 709, "authors": [{"given_name": "Andre", "family_name": "Barreto", "institution": ""}, {"given_name": "Doina", "family_name": "Precup", "institution": ""}, {"given_name": "Joelle", "family_name": "Pineau", "institution": ""}]}