{"title": "Reinforcement Learning using Kernel-Based Stochastic Factorization", "book": "Advances in Neural Information Processing Systems", "page_first": 720, "page_last": 728, "abstract": "Kernel-based reinforcement-learning (KBRL) is a method for learning a decision policy from a set of sample transitions which stands out for its strong theoretical guarantees. However, the size of the approximator grows with the number of transitions, which makes the approach impractical for large problems.  In this paper we introduce a novel algorithm to improve the scalability of KBRL. We resort to a special decomposition of a transition matrix, called stochastic factorization, to fix the size of the approximator while at the same time incorporating all the information contained in the data. The resulting algorithm, kernel-based stochastic factorization (KBSF), is much faster but still converges to a unique solution. We derive a theoretical upper bound for the distance between the value functions computed by KBRL and KBSF. The effectiveness of our method is illustrated with computational experiments on four reinforcement-learning problems, including a difficult task in which the goal is to learn a neurostimulation policy to suppress the occurrence of seizures in epileptic rat brains. We empirically demonstrate that the proposed approach is able to compress the information contained in KBRL's model. Also, on the tasks studied, KBSF outperforms two of the most prominent reinforcement-learning algorithms, namely least-squares policy iteration and fitted Q-iteration.", "full_text": "Reinforcement Learning using Kernel-Based\n\nStochastic Factorization\n\nAndr\u00b4e M. S. Barreto\n\nDoina Precup\n\nJoelle Pineau\n\nSchool of Computer Science\n\nSchool of Computer Science\n\nSchool of Computer Science\n\nMcGill University\nMontreal, Canada\n\nMcGill University\nMontreal, Canada\n\nMcGill University\nMontreal, Canada\n\namsb@cs.mcgill.ca\n\ndprecup@cs.mcgill.ca\n\njpineau@cs.mcgill.ca\n\nAbstract\n\nKernel-based reinforcement-learning (KBRL) is a method for learning a decision\npolicy from a set of sample transitions which stands out for its strong theoretical\nguarantees. However, the size of the approximator grows with the number of tran-\nsitions, which makes the approach impractical for large problems. In this paper\nwe introduce a novel algorithm to improve the scalability of KBRL. We resort\nto a special decomposition of a transition matrix, called stochastic factorization,\nto \ufb01x the size of the approximator while at the same time incorporating all the\ninformation contained in the data. The resulting algorithm, kernel-based stochas-\ntic factorization (KBSF), is much faster but still converges to a unique solution.\nWe derive a theoretical upper bound for the distance between the value functions\ncomputed by KBRL and KBSF. The effectiveness of our method is illustrated with\ncomputational experiments on four reinforcement-learning problems, including a\ndif\ufb01cult task in which the goal is to learn a neurostimulation policy to suppress\nthe occurrence of seizures in epileptic rat brains. We empirically demonstrate that\nthe proposed approach is able to compress the information contained in KBRL\u2019s\nmodel. Also, on the tasks studied, KBSF outperforms two of the most promi-\nnent reinforcement-learning algorithms, namely least-squares policy iteration and\n\ufb01tted Q-iteration.\n\n1\n\nIntroduction\n\nRecent years have witnessed the emergence of several reinforcement-learning techniques that make\nit possible to learn a decision policy from a batch of sample transitions. Among them, Ormoneit\nand Sen\u2019s kernel-based reinforcement learning (KBRL) stands out for two reasons [1]. First, unlike\nother approximation schemes, KBRL always converges to a unique solution. Second, KBRL is\nconsistent in the statistical sense, meaning that adding more data always improves the quality of the\nresulting policy and eventually leads to optimal performance.\n\nDespite its nice theoretical properties, KBRL has not been widely adopted by the reinforcement\nlearning community. One possible explanation for this is its high computational complexity. As\ndiscussed by Ormoneit and Glynn [2], KBRL can be seen as the derivation of a \ufb01nite Markov\ndecision process whose number of states coincides with the number of sample transitions collected\nto perform the approximation. This gives rise to a dilemma: on the one hand one wants as much\ndata as possible to describe the dynamics of the decision problem, but on the other hand the number\nof transitions should be small enough to allow for the numerical solution of the resulting model.\n\nIn this paper we describe a practical way of weighting the relative importance of these two con-\n\ufb02icting objectives. We rely on a special decomposition of a transition matrix, called stochastic\nfactorization, to rewrite it as the product of two stochastic matrices of smaller dimension. As we\n\n1\n\n\fwill see, the stochastic factorization possesses a very useful property: if we swap its factors, we\nobtain another transition matrix which retains some fundamental characteristics of the original one.\nWe exploit this property to \ufb01x the size of KBRL\u2019s model. The resulting algorithm, kernel-based\nstochastic factorization (KBSF), is much faster than KBRL but still converges to a unique solution.\nWe derive a theoretical bound on the distance between the value functions computed by KBRL and\nKBSF. We also present experiments on four reinforcement-learning domains, including the double\npole-balancing task, a dif\ufb01cult control problem representative of a wide class of unstable dynamical\nsystems, and a model of epileptic rat brains in which the goal is to learn a neurostimulation policy\nto suppress the occurrence of seizures. We empirically show that the proposed approach is able to\ncompress the information contained in KBRL\u2019s model, outperforming both the least-squares policy\niteration algorithm and \ufb01tted Q-iteration on the tasks studied [3, 4].\n\n2 Background\n\nThe KBRL algorithm solves a continuous state-space Markov Decision Process (MDP) using a \ufb01nite\nmodel approximation. A \ufb01nite MDP is de\ufb01ned by a tuple M \u2261 (S, A, Pa, ra, \u03b3) [5]. The \ufb01nite sets\nS and A are the state and action spaces. The matrix Pa \u2208 R|S|\u00d7|S| gives the transition probabilities\nassociated with action a \u2208 A and the vector ra \u2208 R|S| stores the corresponding expected rewards. The\ndiscount factor \u03b3 \u2208 [0, 1) is used to give smaller weights to rewards received further in the future.\n\nIn the case of a \ufb01nite MDP, we can use dynamic programming to \ufb01nd an optimal decision-policy\n\u03c0 \u2217 \u2208 A|S| in polynomial time [5]. As well known, this is done using the concept of a value function.\nThroughout the paper, we use v \u2208 R|S| to denote the state-value function and Q \u2208 R|S|\u00d7|A| to refer to\nthe action-value function. Let the operator \u0393 : R|S|\u00d7|A| 7\u2192 R|S| be given by \u0393Q = v, with vi = max j qi j,\nand de\ufb01ne \u2206 : R|S| 7\u2192 R|S|\u00d7|A| as \u2206v = Q, where the ath column of Q is given by qa = ra + \u03b3Pav.\nA fundamental result in dynamic programming states that, starting from v(0) = 0, the expression\nv(t) = \u0393\u2206v(t\u22121) gives the optimal t-step value function, and as t \u2192 \u221e the vector v(t) approaches v\u2217,\nfrom which any optimal decision policy \u03c0 \u2217 can be derived [5].\n\nConsider now an MDP with continuous state space S \u2282 Rd and let Sa = {(sa\nbe a set of sample transitions associated with action a \u2208 A, where sa\nk , \u02c6sa\nconstructed by KBRL has the following transition and reward functions:\n\nk , \u02c6sa\nk \u2208 S and ra\n\nk , ra\n\nk)|k = 1, 2, ..., na}\nk \u2208 R. The model\n\n\u02c6Pa(s j|si) =(cid:26) \u03ba a(si, sa\nk) is a weighting kernel centered at sa\n\nk), if s j = \u02c6sa\nk ,\n\n0, otherwise\n\nand\n\n\u02c6Ra(si, s j) =(cid:26) ra\n\nk , if s j = \u02c6sa\nk ,\n0, otherwise,\n\nk and de\ufb01ned in such a way that \u2211na\n\nwhere \u03ba a(\u00b7, sa\nk) = 1\nfor all si \u2208 S (for example, \u03ba a can be a normalized Gaussian function; see [1] and [2] for a formal\nde\ufb01nition and other examples of valid kernels). Since only transitions ending in the states \u02c6sa\nk have a\nnon-zero probability of occurrence, one can solve a \ufb01nite MDP \u02c6M whose space is composed solely\nof these n = \u2211a na states [2, 6]. After the optimal value function of \u02c6M has been found, the value of\nany state si \u2208 S can be computed as Q(si, a) = \u2211na\nk)(cid:3) . Ormoneit and Sen [1]\nproved that, if na \u2192 \u221e for all a \u2208 A and the widths of the kernels \u03ba a shrink at an \u201cadmissible\u201d rate,\nthe probability of choosing a suboptimal action based on Q(si, a) converges to zero.\n\nk=1 \u03ba a(si, sa\n\nk=1 \u03ba a(si, sa\n\nk + \u03b3 \u02c6V \u2217(\u02c6sa\n\nk)(cid:2)ra\n\nAs discussed in the introduction, the problem with the practical application of KBRL is that, as n\nincreases, so does the cost of solving the MDP derived by this algorithm. To alleviate this problem,\nJong and Stone [6] propose growing incrementally the set of sample transitions, using a prioritized\nsweeping approach to guide the exploration of the state space. In this paper we present a new method\nfor addressing this problem, using stochastic factorization.\n\n3 Stochastic factorization\n\nA stochastic matrix has only non-negative elements and each of its rows sums to 1. That said, we\ncan introduce the concept that will serve as a cornerstone for the rest of the paper:\n\nDe\ufb01nition 1 Given a stochastic matrix P \u2208 Rn\u00d7p, the relation P = DK is called a stochastic factor-\nization of P if D \u2208 Rn\u00d7m and K \u2208 Rm\u00d7p are also stochastic matrices. The integer m > 0 is the order\nof the factorization.\n\n2\n\n\fThis mathematical concept has been explored before. For example, Cohen and Rothblum [7] brie\ufb02y\ndiscuss it as a special case of non-negative matrix factorization, while Cutler and Breiman [8] focus\non slightly modi\ufb01ed versions of the stochastic factorization for statistical data analysis. However, in\nthis paper we will focus on a useful property of this type of factorization that seems to have passed\nunnoticed thus far. We call it the \u201cstochastic-factorization trick\u201d:\n\nGiven a stochastic factorization of a square matrix, P = DK, swapping the factors of the fac-\ntorization yields another transition matrix \u00afP = KD, potentially much smaller than the original,\nwhich retains the basic topology and properties of P.\n\nThe stochasticity of \u00afP follows immediately from the same property of D and K. What is perhaps\nmore surprising is the fact that this matrix shares some fundamental characteristics with the orig-\ninal matrix P. Speci\ufb01cally, it is possible to show that: (i) for each recurrent class in P there is a\ncorresponding class in \u00afP with the same period and, given some simple assumptions about the fac-\ntorization, (ii) P is irreducible if and only if \u00afP is irreducible and (iii) P is regular if and only if \u00afP is\nregular (see [9] for details).\nGiven the strong connection between P \u2208 Rn\u00d7n and \u00afP \u2208 Rm\u00d7m, the idea of replacing the former by the\nlatter comes almost inevitably. The motivation for this would be, of course, to save computational\nresources when m < n.\nIn this paper we are interested in exploiting the stochastic-factorization\ntrick to reduce the computational cost of dynamic programming. The idea is straightforward: given\nstochastic factorizations of the transition matrices Pa, we can apply our trick to obtain a reduced\nMDP that will be solved in place of the original one.\nIn the most general scenario, we would\nhave one independent factorization Pa = DaKa for each action a \u2208 A. However, in the current\nwork we will focus on a particular case which will prove to be convenient both mathematically and\ncomputationally. When all factorizations share the same matrix D, it is easy to derive theoretical\nguarantees regarding the quality of the solution of the reduced MDP:\n\nProposition 1 Let M \u2261 (S, A, Pa, ra, \u03b3) be a \ufb01nite MDP with |S| = n and 0 \u2264 \u03b3 < 1. Let DKa = Pa\nbe |A| stochastic factorizations of order m and let \u00afra be vectors in Rm such that D\u00afra = ra for all\na \u2208 A. De\ufb01ne the MDP \u00afM \u2261 ( \u00afS, A, \u00afPa, \u00afra, \u03b3), with | \u00afS| = m and \u00afPa = KaD. Then,\n\nkv\u2217 \u2212 \u02dcvk\u221e \u2264\n\n\u00afC\n\n(1 \u2212 \u03b3)2\n\nmax\n\ni\n\n(1 \u2212 max\n\nj\n\ndi j),\n\n(1)\n\nwhere \u02dcv = \u0393D \u00afQ\u2217, \u00afC = maxa,k \u00afra\n\nk \u2212 mina,k \u00afra\n\nk , and k\u00b7k\u221e is the maximum norm.\n\nmax j di j) \u00af\u03b4 (t)\n\n, where \u00af\u03b4 (t)\n\ni\n\n\u00afM,\n\n\u00afQ\n\nProof. Since ra = D\u00afra and D \u00afPa = DKaD = PaD for all a \u2208 A, the stochastic matrix D satis\ufb01es Sorg\nand Singh\u2019s de\ufb01nition of a soft homomorphism between M and \u00afM (see equations (25)\u2013(28) in [10]).\n\nt-step action-value function of\n\ni = max j:di j >0,k \u00afq\n(t)\n\n(t)\njk \u2212 min j:di j >0,k \u00afq\n= \u2206\u00afv(t\u22121).\n\n\u0393(Q\u2217 \u2212 D \u00afQ\u2217)(cid:13)(cid:13)\u221e \u2264 (1 \u2212 \u03b3)\u22121 supi,t (1 \u2212\n\n(t)\njk and \u00afq\n\nApplying Theorem 1 by the same authors, we know that (cid:13)(cid:13)\n\u0393(Q\u2217 \u2212 D \u00afQ\u2217)(cid:13)(cid:13)\u221e and, for all t > 0, \u00af\u03b4 (t)\n\n(t)\njk are elements of the optimal\nIn order to obtain our bound, we note that\nk ). 2\nProposition 1 elucidates the basic mechanism through which one can exploit the stochastic-\nfactorization trick to reduce the number of states in an MDP. However, in order to apply this idea\nin practice, one must actually compute the factorizations. This computation can be expensive, ex-\nceeding the computational effort necessary to calculate v\u2217 [11, 9]. In the next section we discuss a\nstrategy to reduce the computational cost of the proposed approach.\n\n\u0393Q\u2217 \u2212 \u0393D \u00afQ\u2217(cid:13)(cid:13)\u221e \u2264(cid:13)(cid:13)\n(cid:13)(cid:13)\n\ni \u2264 (1 \u2212 \u03b3)\u22121(maxa,k \u00afra\n\nk \u2212 mina,k \u00afra\n\n4 Kernel-based stochastic factorization\n\nIn Section 2 we presented KBRL, an approximation scheme for reinforcement learning whose main\ndrawback is its high computational complexity.\nIn Section 3 we discussed how the stochastic-\nfactorization trick can in principle be useful to reduce an MDP, as long as one circumvents the\ncomputational burden imposed by the calculation of the matrices involved in the process. We now\nshow how to leverage these two components to produce an algorithm called kernel-based stochastic\nfactorization (KBSF) that overcomes these computational limitations.\n\n3\n\n\fAs outlined in Section 2, KBRL de\ufb01nes the probability of a transition from state \u02c6sb\nk via\ni , sa\nkernel-averaging, formally denoted \u03ba a(\u02c6sb\nk), where a, b \u2208 A. So for each action a \u2208 A, the state\nj \u2208 R1\u00d7n whose non-zero entries correspond to the function\ni has an associated stochastic vector \u02c6pa\n\u02c6sb\n\u03ba a(\u02c6sb\nk , k = 1, 2, . . . , na. Recall that we are dealing with a continuous state space,\nso it is possible to compute an analogous vector for any si \u2208 S. Therefore, we can link each state of\nthe original MDP with |A| n-dimensional stochastic vectors. The core strategy of KBSF is to \ufb01nd\ni \u2208 R1\u00d7n whose convex combination can\na set of m representative states associated with vectors ka\napproximate the rows of the corresponding \u02c6Pa.\n\ni , \u00b7) evaluated at sa\n\ni to state \u02c6sa\n\nKBRL\u2019s matrices \u02c6Pa have a very speci\ufb01c structure, since only transitions ending in states \u02c6sa\ni asso-\nciated with action a have a non-zero probability of occurrence. Suppose now we want to apply the\nstochastic-factorization trick to KBRL\u2019s MDP. Assuming that the matrices Ka have the same struc-\nture as \u02c6Pa, when computing \u00afPa = KaD we only have to look at the submatrices of Ka and D corre-\nsponding to the na non-zero columns of Ka. We call these matrices \u02d9Ka \u2208 Rm\u00d7na and \u02d9Da \u2208 Rna\u00d7m.\nLet { \u00afs1, \u00afs2, ..., \u00afsm} be a set of representative states in S. KBSF computes matrices \u02d9Da and \u02d9Ka with\nelements \u02d9da\nj ), where \u00af\u03ba is also a kernel. Obviously, once we have \u02d9Da\nand \u02d9Ka it is trivial to compute D and Ka. Depending on how the states \u00afsi and the kernels \u00af\u03ba are\nde\ufb01ned, we have DKa \u2248 \u02c6Pa for all a \u2208 A. The important point here is that the matrices Pa = DKa\nare never actually computed, but instead we solve an MDP with m states whose dynamics are given\nby \u00afPa = KaD = \u02d9Ka \u02d9Da. Algorithm 1 gives a step-by-step description of KBSF.\n\ni j = \u03ba a( \u00afsi, sa\n\ni , \u00afs j) and \u02d9ka\n\ni j = \u00af\u03ba(\u02c6sa\n\nAlgorithm 1 KBSF\n\nInput: Sa for all a \u2208 A, m\nSelect a set of representative states { \u00afs1, \u00afs2, ..., \u00afsm}\nfor each a \u2208 A do\n\nCompute matrix \u02d9Da: \u02d9da\nCompute matrix \u02d9Ka: \u02d9ka\nCompute vector \u00afra: \u00afra\n\ni j = \u00af\u03ba(\u02c6sa\ni , \u00afs j)\ni j = \u03ba a( \u00afsi, sa\nj )\ni = \u2211 j\n\n\u02d9ka\ni j\n\nra\nj\n\nend for\nSolve \u00afM \u2261 ( \u00afS, A, \u00afPa, \u00afra, \u03b3), with \u00afPa= \u02d9Ka \u02d9Da\n\nReturn \u02dcv = \u0393D \u00afQ\u2217, where D\u22ba =h(cid:0) \u02d9Da1(cid:1)\u22ba(cid:0) \u02d9Da2(cid:1)\u22ba\n\n...(cid:0) \u02d9Da|A|(cid:1)\u22bai\n\nObserve that we did not describe how to de\ufb01ne the representative states \u00afsi. Ideally, these states\ni forming a convex hull which contains the rows of \u02c6Pa. In practice, we\nwould be linked to vectors ka\ncan often resort to simple methods to pick states \u00afsi in strategic regions of S. In Section 5 we give\nan example of how to do so. Also, the reader might have noticed that the stochastic factorizations\ncomputed by KBSF are in fact approximations of the matrices \u02c6Pa. The following proposition extends\nthe result of the previous section to the approximate case:\n\nProposition 2 Let \u02c6M \u2261 (S, A, \u02c6Pa, \u02c6ra, \u03b3) be the \ufb01nite MDP derived by KBRL and let D, Ka, and \u00afra be\nthe matrices and vector computed by KBSF. Then,\n\nk\u02c6v\u2217 \u2212 \u02dcvk\u221e \u2264\n\n1\n\n1 \u2212 \u03b3\n\nmax\n\na\n\nk\u02c6ra \u2212 D\u00afrak\u221e +\n\nwhere \u02c6C = maxa,i \u02c6ra\n\ni \u2212 mina,i \u02c6ra\ni .\n\n1\n\n(1 \u2212 \u03b3)2 (cid:18) \u00afCmax\n\ni\n\n(1 \u2212 max\n\nj\n\ndi j) +\n\n\u02c6C\u03b3\n2\n\nmax\n\na (cid:13)(cid:13)\n\n\u02c6Pa \u2212 DKa(cid:13)(cid:13)\u221e(cid:19) ,\n\n(2)\n\nProof. Let M \u2261 (S, A, DKa, D\u00afra, \u03b3). It is obvious that\n\n(3)\nIn order to provide a bound for k\u02c6v\u2217 \u2212 v\u2217k\u221e, we apply Whitt\u2019s Theorem 3.1 and Corollary (b) of his\nTheorem 6.1 [12], with all mappings between \u02c6M and M taken to be identities, to obtain\n\nk\u02c6v\u2217 \u2212 \u02dcvk\u221e \u2264 k\u02c6v\u2217 \u2212 v\u2217k\u221e + kv\u2217 \u2212 \u02dcvk\u221e.\n\nResorting to Proposition 1, we can substitute (1) and (4) in (3) to obtain (2). 2\n\n\u02c6Pa \u2212 DKa(cid:13)(cid:13)\u221e(cid:19) .\n\n(4)\n\nk\u02c6v\u2217 \u2212 v\u2217k\u221e \u2264\n\n1\n\n1 \u2212 \u03b3 (cid:18)max\n\na\n\nk\u02c6ra \u2212 D\u00afrak\u221e +\n\n\u02c6C\u03b3\n\n2(1 \u2212 \u03b3)\n\nmax\n\na (cid:13)(cid:13)\n\n4\n\n\fNotice that when D is deterministic\u2014that is, when all its non-zero elements are 1\u2014expression (2)\nreduces to Whitt\u2019s classical result regarding state aggregation in dynamic programming [12]. On\nthe other hand, when the stochastic factorizations are exact, we recover (1), which is a computable\nversion of Sorg and Singh\u2019s bound for soft homomorphisms [10]. Finally, if we have exact deter-\nministic factorizations, the right-hand side of (2) reduces to zero. This also makes sense, since in\nthis case the stochastic-factorization trick gives rise to an exact homomorphism [13].\n\nAs shown in Algorithm 1, KBSF is very simple to understand and to implement. It is also fast,\nrequiring only O(nm2|A|) operations to build a reduced version of an MDP. Finally, and perhaps\nmost importantly, KBSF always converges to a unique solution whose distance to the optimal one is\nbounded. In the next section we show how all these qualities turn into practical bene\ufb01ts.\n\n5 Experiments\n\nWe now present a series of computational experiments designed to illustrate the behavior of KBSF\nin a variety of challenging domains. We start with a simple problem showing that KBSF is indeed\ncapable of compressing the information contained in KBRL\u2019s model. We then move to more dif\ufb01cult\ntasks, and compare KBSF with other state-of-the-art reinforcement-learning algorithms.\n\nAll problems considered in this section have a continuous state space and a \ufb01nite number of actions\nand were modeled as discounted tasks with \u03b3 = 0.99. The algorithms\u2019s results correspond to the\nperformance of the greedy decision policy derived from the \ufb01nal value function computed. In all\ncases, the decision policies were evaluated on a set of test states from which the tasks cannot be\neasily solved. This makes the tasks considerably harder, since the algorithms must provide a good\napproximation of the value function over a larger region of the state space.\n\nk , ra\n\nk , \u02c6sa\n\nThe experiments were carried out in the same way for all tasks: \ufb01rst, we collected a set of n sample\ntransitions (sa\nk were grouped\nby the k-means algorithm into m clusters and a Gaussian kernel \u00af\u03ba was positioned at the center of\neach resulting cluster [14]. These kernels de\ufb01ned the models used by KBSF to approximate the\nvalue function. This process was repeated 50 times for each task.\n\nk) using a uniformly-random exploration policy. Then the states \u02c6sa\n\nWe adopted the same width for all kernels. The algorithms were executed on each task with the fol-\nlowing values for this parameter: {1, 0.1, 0.01}. The results reported represent the best performance\nof the algorithms over the 50 runs; that is, for each n and each m we picked the width that gener-\nated the maximum average return. Throughout this section we use the following convention to refer\nto speci\ufb01c instances of each method: the \ufb01rst number enclosed in parentheses after an algorithm\u2019s\nname is n, the number of sample transitions used in the approximation, and the second one is m, the\nsize of the model used to approximate the value function. Note that for KBRL n and m coincide.\n\nFigure 1 shows the results obtained by KBRL and KBSF on the puddle-world task [15]. In Fig-\nure 1a and 1b we observe the effect of \ufb01xing the number of transitions n and varying the number\nof representative states m. As expected, the results of KBSF improve as m \u2192 n. More surprising\nis the fact that KBSF has essentially the same performance as KBRL using models one order of\nmagnitude smaller. This indicates that KBSF is summarizing well the information contained in the\ndata. Depending on the values of n and m, this compression may represent a signi\ufb01cant reduction of\ncomputational resources. For example, by replacing KBRL(8000) with KBSF(8000, 90), we obtain\na decrease of more than 99% on the number of operations performed to \ufb01nd a policy, as shown in\nFigure 1b (the cost of constructing KBSF\u2019s MDP is included in all reported run times).\n\nIn Figures 1c and 1d we \ufb01x m and vary n. Observe in Figure 1c how KBRL and KBSF have similar\nperformances, and both improve as n \u2192 \u221e. However, since KBSF is using a model of \ufb01xed size, its\ncomputational cost depends only linearly on n, whereas KBRL\u2019s cost grows with n3. This explains\nthe huge difference in the algorithms\u2019s run times shown in Figure 1d.\n\nNext we evaluate how KBSF compares to other reinforcement-learning approaches. We \ufb01rst contrast\nour method with Lagoudakis and Parr\u2019s least-squares policy iteration algorithm (LSPI) [3]. LSPI is\na natural candidate here because it also builds an approximator of \ufb01xed size out of a batch of sample\ntransitions. In all experiments LSPI used the same data and approximation architectures as KBSF\n(to be fair, we \ufb01xed the width of KBSF\u2019s kernel \u03ba a at 1 in the comparisons).\n\nFigure 2 shows the results of LSPI and KBSF on the single and double pole-balancing tasks [16].\nWe call attention to the fact that the version of the problems used here is signi\ufb01cantly harder than\n\n5\n\n\fn\nr\nu\nt\ne\nR\n\n0\n.\n3\n\n5\n.\n2\n\n0\n.\n2\n\n5\n.\n1\n\n0\n.\n1\n\n5\n.\n0\n\nKBRL(8000)\nKBSF(8000,m)\n\n100 120 140\n\n20\n\n40\n\n60\n\n80\n\nm\n\n(a) Performance as a function of m\n\n0\n.\n3\n\n5\n\n.\n\n2\n\n0\n\n.\n\n2\n\n5\n\n.\n\n1\n\n0\n\n.\n\n1\n\nn\nr\nu\n\nt\n\ne\nR\n\nG\nG\nG\n\nG\nG\nG\n\nG\nG\nG\n\nG\nG\nG\n\nG\nG\nG\n\nG\nG\nG\n\nG\nG\nG\n\nG\nG\nG\n\nG\nG\nG\n\nG\nG\nG\n\nG KBRL(n)\n\nKBSF(n,100)\n\n3\n0\n+\ne\n1\n\n2\n0\n+\ne\n1\n\n1\n0\n+\ne\n1\n\n0\n0\n+\ne\n1\n\n1\n0\n\u2212\ne\n1\n\n2\n0\n+\ne\n5\n\n1\n0\n+\ne\n5\n\n0\n0\n+\ne\n5\n\n1\n0\n\u2212\ne\n5\n\n)\ng\no\nl\n(\n \ns\nd\nn\no\nc\ne\nS\n\n)\ng\no\nl\n(\n \ns\nd\nn\no\nc\ne\nS\n\nKBRL(8000)\nKBSF(8000,m)\n\n20\n\n40\n\n60\n\n80\n\nm\n\n100 120 140\n\n(b) Run time as a function of m\n\nG\nG\nG\n\nG\nG\nG\n\nG\nG\nG\n\nG\nG\nG\n\nG\nG\nG\n\nG\nG\nG\n\nG\nG\nG\n\nG\nG\nG\n\nG\nG\nG\n\nG\nG\nG\n\nG KBRL(n)\n\nKBSF(n,100)\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\n2000\n\n4000\n\n6000\n\n8000\n\n10000\n\nn\n\nn\n\n(c) Performance as a function of n\n\n(d) Run time as a function of n\n\nFigure 1: Results on the puddle-world task averaged over 50 runs. The algorithms were evaluated\non two sets of states distributed over the region of the state space surrounding the \u201cpuddles\u201d. The\n\ufb01rst set was a 3 \u00d7 3 grid over [0.1, 0.3] \u00d7 [0.3, 0.5] and the second one was composed of four states:\n{0.1, 0.3} \u00d7 {0.9, 1.0}. The shadowed regions represent 99% con\ufb01dence intervals.\n\nthe more commonly-used variants in which the decision policies are evaluated on a single state close\nto the origin. This is probably the reason why LSPI achieves a success rate of no more than 60% on\nthe single pole-balancing task, as shown in Figure 2a. In contrast, KBSF\u2019s decision policies are able\nto balance the pole in 90% of the attempts, on average, using as few as m = 30 representative states.\n\nThe results of KBSF on the double pole-balancing task are still more impressive. As Wieland [17]\nrightly points out, this version of the problem is considerably more dif\ufb01cult than its single pole\nvariant, and previous attempts to apply reinforcement-learning techniques to this domain resulted\nin disappointing performance [18]. As shown in Figure 2c, KBSF(106, 200) is able to achieve a\nsuccess rate of more than 80%. To put this number in perspective, recall that some of the test states\nare quite challenging, with the two poles inclined and falling in opposite directions.\n\nThe good performance of KBSF comes at a relatively low computational cost. A conservative esti-\nmate reveals that, were KBRL(106) run on the same computer used for these experiments, we would\nhave to wait for more than 6 months to see the results. KBSF(106, 200) delivers a decision policy in\nless than 7 minutes. KBSF\u2019s computational cost also compares well with that of LSPI, as shown in\nFigures 2b and 2d. LSPI\u2019s policy-evaluation step involves the update and solution of a linear system\nof equations, which take O(nm2) and O(m3|A|3), respectively. In addition, the policy-update stage\nrequires the de\ufb01nition of \u03c0(\u02c6sa\nk) for all n states in the set of sample transitions. In contrast, KBSF\nonly performs O(m3) operations to evaluate a decision policy and O(m2|A|) operations to update it.\n\nWe conclude our empirical evaluation of KBSF by using it to learn a neurostimulation policy for the\ntreatment of epilepsy. In order to do so, we use a generative model developed by Bush et al. [19]\nbased on real data collected from epileptic rat hippocampus slices. This model was shown to re-\n\n6\n\n\fi\n\ns\ne\nd\no\ns\np\ne\n \nl\nu\nf\ns\ns\ne\nc\nc\nu\nS\n\n0\n.\n1\n\n8\n.\n0\n\n6\n.\n0\n\n4\n.\n0\n\n2\n.\n0\n\n0\n.\n0\n\n)\ng\no\nl\n(\n \ns\nd\nn\no\nc\ne\nS\n\n0\n0\n5\n\n0\n5\n\n5\n\n1\n\nLSPI(5x104,m)\nKBSF(5x104,m)\n\n20\n\n40\n\n60\n\n80\n\nm\n\n100 120 140\n\n20\n\n40\n\n60\n\nLSPI(5x104,m)\nKBSF(5x104,m)\n\n100 120 140\n\n80\n\nm\n\n(a) Performance on single pole-balancing\n\n(b) Run time on single pole-balancing\n\ns\ne\nd\no\ns\np\ne\n\ni\n\n \nl\n\nu\n\nf\ns\ns\ne\nc\nc\nu\nS\n\n8\n\n.\n\n0\n\n6\n\n.\n\n0\n\n4\n\n.\n\n0\n\n2\n\n.\n\n0\n\n0\n\n.\n\n0\n\nLSPI(106,m)\nKBSF(106,m)\n\n)\ng\no\nl\n(\n \ns\nd\nn\no\nc\ne\nS\n\n0\n0\n0\n0\n1\n\n0\n0\n0\n1\n\n0\n0\n2\n\n0\n5\n\nLSPI(106,m)\nKBSF(106,m)\n\n50\n\n100\n\n150\n\n200\n\n50\n\n100\n\n150\n\n200\n\nm\n\nm\n\n(c) Performance on double pole-balancing\n\n(d) Run time on double pole-balancing\n\nFigure 2: Results on the pole-balancing tasks averaged over 50 runs. The values correspond to the\nfraction of episodes initiated from the test states in which the pole(s) could be balanced for 3000\nsteps (one minute of simulated time). The test sets were regular grids de\ufb01ned over the hypercube\ncentered at the origin and covering 50% of the state-space axes in each dimension (we used a resolu-\ntion of 3 and 2 cells per dimension for the single and double versions of the problem, respectively).\nShadowed regions represent 99% con\ufb01dence intervals.\n\nproduce the seizure pattern of the original dynamical system and was later validated through the\ndeployment of a learned treatment policy on a real brain slice [20]. The associated decision problem\nhas a \ufb01ve-dimensional continuous state space and extremely non-linear dynamics. At each state the\nagent must choose whether or not to apply an electrical pulse. The goal is to suppress seizures while\nminimizing the total amount of stimulation needed to do so.\n\nWe use as a baseline for our comparisons the \ufb01xed-frequency stimulation policies usually adopted\nin standard in vitro clinical studies [20]. In particular, we considered policies that apply electrical\npulses at frequencies of 0 Hz, 0.5 Hz, 1 Hz, and 1.5 Hz. For this task we ran LSPI and KBSF\nwith sparse kernels, that is, we only computed the value of the Gaussian function at the 6-nearest\nneighbors of a given state (thus de\ufb01ning a simplex with the same dimension as the state space). This\nmodi\ufb01cation made it possible to use m = 50, 000 representative states with KBSF. Since for LSPI\nthe reduction on the computational cost was not very signi\ufb01cant, we \ufb01xed m = 50 to keep its run\ntime within reasonable bounds.\n\nWe compare the decision policies returned by KBSF and LSPI with those computed by \ufb01tted Q-\niteration using Ernst et al.\u2019s extra-trees algorithm [4]. This approach has shown excellent perfor-\nmance on several reinforcement-learning tasks [4]. We used the extra-trees algorithm to build an\nensemble of 30 trees. The algorithm was run for 50 iterations, with the structure of the trees \ufb01xed\nafter the 10th one. The number of cut-directions evaluated at each node was \ufb01xed at dim(S) = 5, and\nthe minimum number of elements required to split a node, denoted here by \u03b7min, was selected from\nthe set {20, 30, ..., 50, 100, 150, ..., 200}. In general, we observed that the performance of the tree-\n\n7\n\n\fbased method improved with smaller values for \u03b7min, with an expected increase in the computational\ncost. Thus, in order to give an overall characterization of the performance of \ufb01tted Q-iteration, we\nreport the results obtained with the extreme values of \u03b7min. The respective instances of the tree-based\napproach are referred to as T20 and T200.\n\nFigure 3 shows the results on the epilepsy-suppression task. In order to obtain different compro-\nmises of the problem\u2019s two con\ufb02icting objectives, we varied the relative magnitude of the penalties\nassociated with the occurrence of seizures and with the application of an electrical pulse [19, 20].\nIn particular, we \ufb01xed the latter at \u22121 and varied the former with values in {\u221210, \u221220, \u221240}. This\nappears in the plots as subscripts next to the algorithms\u2019s names. As shown in Figure 3a, LSPI\u2019s poli-\ncies seem to prioritize reduction of stimulation at the expense of higher seizure occurrence, which\nis clearly sub-optimal from a clinical point of view. T200 also performs poorly, with solutions rep-\nresenting no advance over the \ufb01xed-frequency stimulation strategies. In contrast, T20 and KBSF\nare both able to generate decision policies superior to the 1 Hz policy, which is the most ef\ufb01cient\nstimulation regime known to date in the clinical literature [21]. However, as shown in Figure 3b,\nKBSF is able to do it at least 100 times faster than the tree-based method.\n\ns\ne\nr\nu\nz\ne\ns\n \nf\n\ni\n\no\n\n \n\nn\no\n\ni\nt\nc\na\nr\nF\n\n0\n2\n\n.\n\n0\n\n5\n1\n\n.\n\n0\n\n0\n1\n\n.\n\n0\n\nLSPI\u221210\n\n0Hz\n\n0.5Hz\n\nLSPI\u221220\n\nLSPI\u221240\n\nT200\u221210\n\nT200\u221220\n\nT200\u221240\n\nKBSF\u221210\n\nT20\u221210\n\nT20\u221220\n\n1Hz\n\nKBSF\u221220\n\n1.5Hz\n\nT20\u221240\n\nKBSF\u221240\n\nKBSF\u221240\n\nLSPI\u221240\n\nT200\u221240\n\nT20\u221240\n\nKBSF\u221220\n\nLSPI\u221220\n\nT200\u221220\n\nT20\u221220\n\nKBSF\u221210\n\nLSPI\u221210\n\nT200\u221210\n\nT20\u221210\n\n0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35\n\n50\n\n200\n\n1000\n\n5000\n\nFraction of stimulation\n\nSeconds (log)\n\n(a) Performance. The length of the rectangles\u2019s edges repre-\nsent 99% con\ufb01dence intervals.\n\n(b) Run times (con\ufb01dence intervals\ndo not show up in logarithmic scale)\n\nFigure 3: Results on the epilepsy-suppression problem averaged over 50 runs. The algorithms used\nn = 500, 000 sample transitions to build the approximations. The decision policies were evaluated\non episodes of 105 transitions starting from a \ufb01xed set of 10 test states drawn uniformly at random.\n\n6 Conclusions\nWe presented KBSF, a reinforcement-learning algorithm that emerges from the application of the\nstochastic-factorization trick to KBRL. As discussed, our algorithm is simple, fast, has good theo-\nretical guarantees, and always converges to a unique solution. Our empirical results show that KBSF\nis able to learn very good decision policies with relatively low computational cost. It also has pre-\ndictable behavior, generally improving its performance as the number of sample transitions or the\nsize of its approximation model increases. In the future, we intend to investigate more principled\nstrategies to select the representative states, based on the large body of literature available on kernel\nmethods. We also plan to extend KBSF to the on-line scenario, where the intermediate decision\npolicies generated during the learning process guide the collection of new sample transitions.\n\nAcknowledgments\n\nThe authors would like to thank Keith Bush for making the epilepsy simulator available and also Yuri\nGrinberg for helpful discussions regarding this work. Funding for this research was provided by the\nNational Institutes of Health (grant R21 DA019800) and the NSERC Discovery Grant program.\n\n8\n\n\fReferences\n\n[1] D. Ormoneit and S. Sen. Kernel-based reinforcement learning. Machine Learning, 49 (2\u20133):\n\n161\u2013178, 2002.\n\n[2] D. Ormoneit and P. Glynn. Kernel-based reinforcement learning in average-cost problems.\n\nIEEE Transactions on Automatic Control, 47(10):1624\u20131636, 2002.\n\n[3] M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning\n\nResearch, 4:1107\u20131149, 2003.\n\n[4] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal\n\nof Machine Learning Research, 6:503\u2013556, 2005.\n\n[5] M. L. Puterman. Markov Decision Processes\u2014Discrete Stochastic Dynamic Programming.\n\nJohn Wiley & Sons, Inc., 1994.\n\n[6] N. Jong and P. Stone. Kernel-based models for reinforcement learning in continuous state\nspaces. In Proceedings of the International Conference on Machine Learning\u2014Workshop on\nKernel Machines and Reinforcement Learning, 2006.\n\n[7] J. E. Cohen and U. G. Rothblum. Nonnegative ranks, decompositions and factorizations of\n\nnonnegative matrices. Linear Algebra and its Applications, 190:149\u2013168, 1991.\n\n[8] A. Cutler and L. Breiman. Archetypal analysis. Technometrics, 36(4):338\u2013347, 1994.\n\n[9] A. M. S. Barreto and M. D. Fragoso. Computing the stationary distribution of a \ufb01nite Markov\nchain through stochastic factorization. SIAM Journal on Matrix Analysis and Applications. In\npress.\n\n[10] J. Sorg and S. Singh. Transfer via soft homomorphisms. In Autonomous Agents & Multiagent\n\nSystems / Agent Theories, Architectures, and Languages, pages 741\u2013748, 2009.\n\n[11] S. A. Vavasis. On the complexity of nonnegative matrix factorization. SIAM Journal on\n\nOptimization, 20:1364\u20131377, 2009.\n\n[12] W. Whitt. Approximations of dynamic programs, I. Mathematics of Operations Research, 3\n\n(3):231\u2013243, 1978.\n\n[13] B. Ravindran. An Algebraic Approach to Abstraction in Reinforcement Learning. PhD thesis,\n\nUniversity of Massachusetts, Amherst, MA, 2004.\n\n[14] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis.\n\nJohn Wiley and Sons, 1990.\n\n[15] R. S. Sutton. Generalization in reinforcement learning: Successful examples using sparse\ncoarse coding. In Advances in Neural Information Processing Systems, volume 8, pages 1038\u2013\n1044, 1996.\n\n[16] F. J. Gomez. Robust Non-linear Control Through Neuroevolution. PhD thesis, The University\n\nof Texas at Austin, 2003.\n\n[17] A. P. Wieland. Evolving neural network controllers for unstable systems. In Proceedings of\n\nthe International Joint Conference on Neural Networks, volume 2, pages 667\u2013673, 1991.\n\n[18] F. Gomez, J. Schmidhuber, and R. Miikkulainen. Ef\ufb01cient non-linear control through neu-\nroevolution. In Proceedings of the 17th European Conference on Machine Learning, pages\n654\u2013662, 2006.\n\n[19] K. Bush, J. Pineau, and M. Avoli. Manifold embeddings for model-based reinforcement learn-\nIn Proceedings of the ICML / UAI / COLT Workshop on\n\ning of neurostimulation policies.\nAbstraction in Reinforcement Learning, 2009.\n\n[20] K. Bush and J. Pineau. Manifold embeddings for model-based reinforcement learning under\npartial observability. In Advances in Neural Information Processing Systems, volume 22, pages\n189\u2013197, 2009.\n\n[21] K. Jerger and S. J. Schiff. Periodic pacing an in vitro epileptic focus. Journal of Neurophysi-\n\nology, (2):876\u2013879, 1995.\n\n9\n\n\f", "award": [], "sourceid": 496, "authors": [{"given_name": "Andre", "family_name": "Barreto", "institution": ""}, {"given_name": "Doina", "family_name": "Precup", "institution": null}, {"given_name": "Joelle", "family_name": "Pineau", "institution": null}]}