{"title": "Optimal Reinforcement Learning for Gaussian Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 325, "page_last": 333, "abstract": "The exploration-exploitation trade-off is among the central challenges of reinforcement learning. The optimal Bayesian solution is intractable in general. This paper studies to what extent analytic statements about optimal learning are possible if all beliefs are Gaussian processes. A first order approximation of learning of both loss and dynamics, for nonlinear, time-varying systems in continuous time and space, subject to a relatively weak restriction on the dynamics, is described by an infinite-dimensional partial differential equation. An approximate finite-dimensional projection gives an impression for how this result may be helpful.", "full_text": "Optimal Reinforcement Learning\n\nfor Gaussian Systems\n\nPhilipp Hennig\n\nMax Planck Institute for Intelligent Systems\n\nDepartment of Empirical Inference\n\nSpemannstra\u00dfe 38, 72070 T\u00a8ubingen, Germany\n\nphennig@tuebingen.mpg.de\n\nAbstract\n\nThe exploration-exploitation trade-off is among the central challenges of rein-\nforcement learning. The optimal Bayesian solution is intractable in general. This\npaper studies to what extent analytic statements about optimal learning are possible\nif all beliefs are Gaussian processes. A \ufb01rst order approximation of learning of\nboth loss and dynamics, for nonlinear, time-varying systems in continuous time\nand space, subject to a relatively weak restriction on the dynamics, is described\nby an in\ufb01nite-dimensional partial differential equation. An approximate \ufb01nite-\ndimensional projection gives an impression for how this result may be helpful.\n\n1\n\nIntroduction \u2013 Optimal Reinforcement Learning\n\nReinforcement learning is about doing two things at once: Optimizing a function while learning\nabout it. These two objectives must be balanced: Ignorance precludes ef\ufb01cient optimization; time\nspent hunting after irrelevant knowledge incurs unnecessary loss. This dilemma is famously known\nas the exploration exploitation trade-off. Classic reinforcement learning often considers time cheap;\nthe trade-off then plays a subordinate role to the desire for learning a \u201ccorrect\u201d model or policy. Many\nclassic reinforcement learning algorithms thus rely on ad-hoc methods to control exploration, such\nas \u201c\u0001-greedy\u201d [1], or \u201cThompson sampling\u201d [2]. However, at least since a thesis by Duff [3] it has\nbeen known that Bayesian inference allows optimal balance between exploration and exploitation. It\nrequires integration over every possible future trajectory under the current belief about the system\u2019s\ndynamics, all possible new data acquired along those trajectories, and their effect on decisions taken\nalong the way. This amounts to optimization and integration over a tree, of exponential cost in the\nsize of the state space [4]. The situation is particularly dire for continuous space-times, where both\ndepth and branching factor of the \u201ctree\u201d are uncountably in\ufb01nite. Several authors have proposed\napproximating this lookahead through samples [5, 6, 7, 8], or ad-hoc estimators that can be shown to\nbe in some sense close to the Bayes-optimal policy [9].\nIn a parallel development, recent work by Todorov [10], Kappen [11] and others introduced an idea to\nreinforcement learning long commonplace in other areas of machine learning: Structural assumptions,\nwhile restrictive, can greatly simplify inference problems. In particular, a recent paper by Simpkins\net al. [12] showed that it is actually possible to solve the exploration exploitation trade-off locally,\nby constructing a linear approximation using a Kalman \ufb01lter. Simpkins and colleagues further\nassumed to know the loss function, and the dynamics up to Brownian drift. Here, I use their work as\ninspiration for a study of general optimal reinforcement learning of dynamics and loss functions of\nan unknown, nonlinear, time-varying system (note that most reinforcement learning algorithms are\nrestricted to time-invariant systems). The core assumption is that all uncertain variables are known up\nto Gaussian process uncertainty. The main result is a \ufb01rst-order description of optimal reinforcement\nlearning in form of in\ufb01nite-dimensional differential statements. This kind of description opens up\nnew approaches to reinforcement learning. As an only initial example of such treatments, Section 4\n\n1\n\n\fpresents an approximate Ansatz that affords an explicit reinforcement learning algorithm; tested in\nsome simple but instructive experiments (Section 5).\nAn intuitive description of the paper\u2019s results is this: From prior and corresponding choice of learning\nmachinery (Section 2), we construct statements about the dynamics of the learning process (Section\n3). The learning machine itself provides a probabilistic description of the dynamics of the physical\nsystem. Combining both dynamics yields a joint system, which we aim to control optimally. Doing so\namounts to simultaneously controlling exploration (controlling the learning system) and exploitation\n(controlling the physical system).\nBecause large parts of the analysis rely on concepts from optimal control theory, this paper will use\nnotation from that \ufb01eld. Readers more familiar with the reinforcement learning literature may wish to\nmentally replace coordinates x with states s, controls u with actions a, dynamics with transitions\np(s(cid:48) | s, a) and utilities q with losses (negative rewards) \u2212r. The latter is potentially confusing, so\nnote that optimal control in this paper will attempt to minimize values, rather than to maximize them,\nas usual in reinforcement learning (these two descriptions are, of course, equivalent).\n\n2 A Class of Learning Problems\nWe consider the task of optimally controlling an uncertain system whose states s \u2261 (x, t) \u2208 K \u2261\nRD\u00d7R lie in a D +1 dimensional Euclidean phase space-time: A cost Q (cumulated loss) is acquired\nat (x, t) with rate dQ/dt = q(x, t), and the \ufb01rst inference problem is to learn this analytic function\nq. A second, independent learning problem concerns the dynamics of the system. We assume the\ndynamics separate into free and controlled terms af\ufb01ne to the control:\n\ndx(t) = [f (x, t) + g(x, t)u(x, t)] dt\n\n(1)\n\nwhere u(x, t) is the control function we seek to optimize, and f, g are analytic functions. To simplify\nour analysis, we will assume that either f or g are known, while the other may be uncertain (or,\nalternatively, that it is possible to obtain independent samples from both functions). See Section\n3 for a note on how this assumption may be relaxed. W.l.o.g., let f be uncertain and g known.\nInformation about both q(x, t) and f (x, t) = [f1, . . . , fD] is acquired stochastically: A Poisson\nprocess of constant rate \u03bb produces mutually independent samples\nyq(x, t) = q(x, t)+\u0001q and yf d(x, t) = fd(x, t)+\u0001f d where \u0001q \u223c N (0, \u03c32\nf d). (2)\nThe noise levels \u03c3q and \u03c3f are presumed known. Let our initial beliefs about q and f be given by\nd GP kf d (fd; \u00b5f d, \u03a3f d),\nrespectively, with kernels kr, kf 1, . . . , kf D over K, and mean / covariance functions \u00b5 / \u03a3. In other\nwords, samples over the belief can be drawn using an in\ufb01nite vector \u2126 of i.i.d. Gaussian variables, as\n\nGaussian processes GP kq (q; \u00b5q, \u03a3q); and independent Gaussian processes(cid:81)D\n\nq ); \u0001f d \u223c N (0, \u03c32\n\n\u02dcfd([x, t]) = \u00b5f d([x, t])+\n\nf d ([x, t], [x(cid:48), t(cid:48)])\u2126(x(cid:48), t(cid:48))dx(cid:48) dt = \u00b5f d([x, t])+(\u03a31/2\n\u03a31/2\n\nf d \u2126)([x, t]) (3)\n\nthe second equation demonstrates a compact notation for inner products that will be used throughout.\nIt is important to note that f, q are unknown, but deterministic. At any point during learning, we can\nuse the same samples \u2126 to describe uncertainty, while \u00b5, \u03a3 change during the learning process.\nTo ensure continuous trajectories, we also need to regularize the control. Following control custom,\nR\u22121u with control cost scaling matrix R. Its units\nwe introduce a quadratic control cost \u03c1(u) = 1\n[R] = [x/t]/[Q/x] relate the cost of changing location to the utility gained by doing so.\nThe overall task is to \ufb01nd the optimal discounted horizon value\n\n(cid:124)\n2 u\n\nv(x, t) = min\n\nu\n\ne\u2212(\u03c4\u2212t)/\u03b3\n\nq[\u03c7[\u03c4, u(\u03c7, \u03c4 )], \u03c4 ] +\n\n(cid:124)\nu(\u03c7, \u03c4 )\n\nR\u22121u(\u03c7, \u03c4 )\n\n1\n2\n\nd\u03c4\n\n(4)\n\nwhere \u03c7(\u03c4, u) is the trajectory generated by the dynamics de\ufb01ned in Equation (1), using the control\nlaw (policy) u(x, t). The exponential de\ufb01nition of the discount \u03b3 > 0 gives the unit of time to \u03b3.\nBefore beginning the analysis, consider the relative generality of this de\ufb01nition: We allow for a\ncontinuous phase space. Both loss and dynamics may be uncertain, of rather general nonlinear form,\nand may change over time. The speci\ufb01c choice of a Poisson process for the generation of samples is\n\n2\n\n(cid:90)\n\n(cid:90) \u221e\n\nt\n\n(cid:20)\n\n(cid:21)\n\n\fsomewhat ad-hoc, but some measure is required to quantify the \ufb02ow of information through time.\nThe Poisson process is in some sense the simplest such measure, assigning uniform probability\ndensity. An alternative is to assume that datapoints are acquired at regular intervals of width \u03bb. This\nresults in a quite similar model but, since the system\u2019s dynamics still proceed in continuous time, can\ncomplicate notation. A downside is that we had to restrict the form of the dynamics. However, Eq.\n(1) still covers numerous physical systems studied in control, for example many mechanical systems,\nfrom classics like cart-and-pole to realistic models for helicopters [13].\n\n3 Optimal Control for the Learning Process\n\nThe optimal solution to the exploration exploitation trade-off is formed by the dual control [14] of a\njoint representation of the physical system and the beliefs over it. In reinforcement learning, this idea\nis known as a belief-augmented POMDP [3, 4], but is not usually construed as a control problem.\nThis section constructs the Hamilton-Jacobi-Bellman (HJB) equation of the joint control problem\nfor the system described in Sec. 2, and analytically solves the equation for the optimal control. This\nnecessitates a description of the learning algorithm\u2019s dynamics:\nAt time t = \u03c4, let the system be at phase space-time s\u03c4 = (x(\u03c4 ), \u03c4 ) and have the Gaussian process\nbelief GP(q; \u00b5\u03c4 (s), \u03a3\u03c4 (s, s(cid:48))) over the function q (all derivations in this section will focus on q,\nand we will drop the sub-script q from many quantities for readability. The forms for f, or g, are\nentirely analogous, with independent Gaussian processes for each dimension d = 1, . . . , D). This\n(cid:124) \u2208 RN collected at space-times\nbelief stems from a \ufb01nite number N of samples y0 = [y1, . . . , yN ]\n(cid:124) \u2208 KN (note that t1 to tN need not be equally\nS0 = [(x1, t1), . . . , (xN , tN )]\nspaced, ordered, or < \u03c4). For arbitrary points s\u2217 = (x\u2217, t\u2217) \u2208 K, the belief over q(s\u2217) is a Gaussian\nwith mean function \u00b5\u03c4 , and co-variance function \u03a3\u03c4 [15]\ni , S0)[K(S0, S0) + \u03c32\ni , s\u2217\n\n(5)\nwhere K(S0, S0) is the Gram matrix with elements Kab = k(sa, sb). We will abbreviate K0 \u2261\nyI] from here on. The co-vector k(s\u2217, S0) has elements ki = k(s\u2217, si) and will\n[K(S0, S0) + \u03c32\nchance of acquiring a datapoint y\u03c4 in this time is \u03bb dt. Marginalising over this Poisson stochasticity,\nwe expect one sample with probability \u03bb dt, two samples with (\u03bb dt)2 and so on. So the mean after\ndt is expected to be\n\nbe shortened to k0. How does this belief change as time moves from \u03c4 to \u03c4 + dt? If dt(cid:95) 0, the\n\ni ) = k(s\u2217\nj ) = k(s\u2217\n\n(cid:124) \u2261 [s1, . . . , sN ]\n\ni , S0)[K(S0, S0) + \u03c32\n\nq I]\u22121k(S0, s\u2217\nj )\n\n\u00b5\u03c4 (s\u2217\ni , s\u2217\n\nj ) \u2212 k(s\u2217\n\nq I]\u22121y0\n\n\u03a3\u03c4 (s\u2217\n\n0 y0 +O(\u03bb dt)2 (6)\n\u00b5\u03c4 + dt = \u03bb dt (k0, k\u03c4 )\nwhere we have de\ufb01ned the map k\u03c4 = k(s\u2217, s\u03c4 ), the vector \u03be\u03c4 with elements \u03be\u03c4,i = k(si, s\u03c4 ), and\nthe scalar \u03ba\u03c4 = k(s\u03c4 , s\u03c4 ) + \u03c32\n\n+ (1\u2212 \u03bb dt\u2212O(\u03bb dt)2)\u00b7 k0K\u22121\n\n\u03be\u03c4\n\u03ba\u03c4\n\ny\u03c4\n\n(cid:124)\n\u03c4\n\n\u03be\n\n\u00b5\u03c4 + dt = k0K\u22121\n\nq. Algebraic re-formulation yields\n(cid:124)\nt K\u22121\n\n0 \u03bet)(\u03bat \u2212 \u03be\nK\u22121\n\n0 y0 + \u03bb(kt \u2212 k0\n\n(cid:124)\n\n0 \u03bet)\u22121(yt \u2212 \u03be\n(cid:124)\n\u03c4 K\u22121\n\n(cid:124)\nt K\u22121\n0 \u03be\u03c4 ) = \u03c32\n\n0 y0) dt.\n\n(7)\nq + \u03a3(s\u03c4 , s\u03c4 ),\n\n0 y0 = \u00b5(s\u03c4 ), the mean prediction at s\u03c4 and (\u03ba\u03c4 \u2212 \u03be\nNote that \u03be\nthe marginal variance there. Hence, we can de\ufb01ne scalars \u00af\u03a3, \u00af\u03c3 and write\n\n(cid:124)\n\u03c4 K\u22121\n\n(cid:18)K0\n\n(cid:19)\u22121(cid:18)y0\n\n(cid:19)\n\n(cid:124)\n(y\u03c4 \u2212 \u03be\n\u03c4 K\u22121\n0 y0)\n(cid:124)\n(\u03ba\u03c4 \u2212 \u03be\n\u03c4 K\u22121\n0 \u03be\u03c4 )1/2\n\n=\n\n[\u03a31/2\u2126](s\u03c4 ) + \u03c3\u03c9\n[\u03a3(s\u03c4 , s\u03c4 ) + \u03c32]1/2\n\n\u2261 \u00af\u03a31/2\n\n\u03c4 \u2126 + \u00af\u03c3\u03c4 \u03c9 with \u03c9 \u223c N (0, 1).\n\n(8)\n\nSo the change to the mean consists of a deterministic but uncertain change whose effects accumulate\nlinearly in time, and a stochastic change, caused by the independent noise process, whose variance\naccumulates linearly in time (in truth, these two points are considerably subtler, a detailed proof is\nleft out for lack of space). We use the Wiener [16] measure d\u03c9 to write\nd\u00b5s\u03c4 (s\u2217) = \u03bb\n\ndt \u2261 \u03bbLs\u03c4 (s\u2217)[ \u00af\u03a31/2\n\n\u03c4 \u2126 dt + \u00af\u03c3\u03c4 d\u03c9]\n\n(cid:124)\nk\u03c4 \u2212 k0\nK\u22121\n0 \u03be\u03c4\n(cid:124)\n\u03c4 K\u22121\n0 \u03be\u03c4 )\u22121/2\n\n(\u03ba\u03c4 \u2212 \u03be\n\n[\u03a31/2\u2126](s\u03c4 ) + \u03c3\u03c9\n[\u03a3(s\u03c4 , s\u03c4 ) + \u03c32]1/2\n\n(9)\nwhere we have implicitly de\ufb01ned the innovation function L. Note that L is a function of both s\u2217 and\ns\u03c4 . A similar argument \ufb01nds the change of the covariance function to be the deterministic rate\n\nd\u03a3s\u03c4 (s\u2217\n\ni , s\u2217\n\nj ) = \u2212\u03bbLs\u03c4 (s\u2217\n\n(cid:124)\ni )L\ns\u03c4\n\n(s\u2217\n\nj ) dt.\n\n(10)\n\n3\n\n\f(cid:26)(cid:90)(cid:90) (cid:20)(cid:18)\n\nv(z\u03c4 ) = min\n\nu\n\n(cid:26)(cid:90)\n\n= min\n\nu\n\n(cid:19)\n\nR\u22121u\n\n(cid:124)\n\nu\n\n1\n2\n\n(cid:21)\n\n(cid:27)\n\n(cid:27)\n\n(14)\n\ndt\n\n\u00b5q(s\u03c4 ) + \u03a31/2\n\nq\u03c4 \u2126q + \u03c3q\u03c9q +\n\ndt + v(z\u03c4 + dt)\n\nd\u03c9 d\u2126\n\n\u00b5\u03c4\nq +\u03a31/2\n\nq\u03c4 \u2126q+\n\n(cid:124)\nu\n\nR\u22121u+\n\n1\n2\n\nv(z\u03c4 )\n\ndt\n\n+\n\n\u2202v\n\u2202t\n\n+[A+Bu+C\u2126]\n\n(cid:124)\u2207v+\n\n1\n2\n\n(cid:124)\ntr[D\n\n(\u22072v)D]d\u2126\n\nIntegration over \u03c9 can be performed with ease, and removes the stochasticity from the problem; The\nuncertainty over \u2126 is a lot more challenging. Because the distribution over future losses is correlated\nthrough space and time, \u2207v, \u22072v are functions of \u2126, and the integral is nontrivial. But there are some\nobvious approximate approaches. For example, if we (inexactly) swap integration and minimisation,\ndraw samples \u2126i and solve for the value for each sample, we get an \u201caverage optimal controller\u201d.\nThis over-estimates the actual sum of future rewards by assuming the controller has access to the true\nsystem. It has the potential advantage of considering the actual optimal controller for every possible\nsystem, the disadvantage that the average of optima need not be optimal for any actual solution. On\nthe other hand, if we ignore the correlation between \u2126 and \u2207v, we can integrate (17) locally, all terms\nin \u2126 drop out and we are left with an \u201coptimal average controller\u201d, which assumes that the system\nlocally follows its average (mean) dynamics. This cheaper strategy was adopted in the following.\nNote that it is myopic, but not greedy in a simplistic sense \u2013 it does take the effect of learning into\naccount. It amounts to a \u201cglobal one-step look-ahead\u201d. One could imagine extensions that consider\nthe in\ufb02uence of \u2126 on \u2207v to a higher order, but these will be left for future work. Under this \ufb01rst-order\napproximation, analytic minimisation over u can be performed in closed form, and bears\n\nu(z) = \u2212RB(z)\n\n(cid:124)\u2207v(z) = \u2212Rg(x, t)\n\n(cid:124)\u2207xv(z).\n\nSo the dynamics of learning consist of a deterministic change to the covariance, and both deterministic\nand stochastic changes to the mean, both of which are samples a Gaussian processes with covariance\n(cid:124). This separation is a fundamental characteristic of GPs (it is the\nfunction proportional to LL\nnonparametric version of a more straightforward notion for \ufb01nite-dimensional Gaussian beliefs, for\ndata with known noise magnitude).\nWe introduce the belief-augmented space H containing states z(\u03c4 ) \u2261 [x(\u03c4 ), \u03c4, \u00b5\u03c4\nq (s, s(cid:48)), \u03a3\u03c4\n\u03a3\u03c4\nUnder our beliefs, z(\u03c4 ) obeys a stochastic differential equation of the form\n\nf D,\nf D]. Since the means and covariances are functions, H is in\ufb01nite-dimensional.\n\nf 1, . . . , \u03a3\u03c4\n\nf 1, . . . , \u00b5\u03c4\n\nq (s), \u00b5\u03c4\n\n(cid:104)\n\ndz = [A(z) + B(z)u + C(z)\u2126] dt + D(z) d\u03c9\n\n(11)\n\nwith free dynamics A, controlled dynamics Bu, uncertainty operator C, and noise operator D\n\nA =\nB = [g(s\u2217), 0, 0, 0, . . . ]; C = diag(\u03a31/2\n\n(cid:124)\n(cid:124)\n(cid:124)\nf (zx, zt) , 1 , 0 , 0 , . . . , 0 , \u2212\u03bbLqL\nq , \u2212\u03bbLf 1L\nf 1 , . . . , \u2212\u03bbLf DL\n\u00b5\u03c4\nf D\nf 1 , . . . , \u03bbLf D \u00af\u03a31/2\n\nf \u03c4 , 0, \u03bbLq \u00af\u03a31/2\n\n, \u03bbLf 1 \u00af\u03a31/2\n\nq\n\nf d , 0, . . . , 0);\n\n;\n\n(12)\n\n(13)\nThe value \u2013 the expected cost to go \u2013 of any state s\u2217 is given by the Hamilton-Jacobi-Bellman\nequation, which follows from Bellman\u2019s principle and a \ufb01rst-order expansion, using Eq. (4):\n\nD = diag(0, 0, \u03bbLq \u00af\u03c3q, \u03bbLf 1 \u00af\u03c3f 1, . . . , \u03bbLf D \u00af\u03c3f D, 0, . . . , 0)\n\n(cid:105)\n\ntr(cid:2)D\n\n(cid:124)\n\n(\u22072v)D(cid:3) .\n\nThe optimal Hamilton-Jacobi-Bellman equation is then\n\nA more explicit form emerges upon re-inserting the de\ufb01nitions of Eq. (12) into Eq. (16):\n\n\u03b3\u22121v(z) = [\u00b5\u03c4\n\nf (zx, zt)\u2207x + \u2207t\n\n(cid:124)\n[\u2207xv(z)]\n\n(cid:124)\n\ng\n\n(zx, zt)Rg(zx, zt)\u2207xv(z)\n\n\u03b3\u22121v(z) = \u00b5\u03c4\n\nq + A\n\n(cid:124)\u2207v \u2212 1\n2\n\n[\u2207v]\n\n(cid:124)\n\n(cid:124)\u2207v +\n\nBRB\n\n1\n2\n\nq +(cid:2)\u00b5\u03c4\n(cid:124)\n\n(cid:123)(cid:122)\n(cid:88)\n\nfree drift cost\n\n+\n\n(cid:3)v(z)\n(cid:125)\n\u2212 \u03bb(cid:2)LcL\n(cid:124)\n\n\u2212 1\n2\n\n(cid:124)\n(cid:123)(cid:122)\n(cid:124)\nc\u2207\u03a3c\n\n(cid:3)v(z)\n(cid:125)\n\nc=q,f1,...,fD\n\nexploration bonus\n\n(cid:123)(cid:122)\n(cid:2)L\n\n+\n\n1\n2\n\n(cid:124)\n\ncontrol bene\ufb01t\n(cid:123)(cid:122)\n(cid:124)\nf d(\u22072\n\n\u03bb2 \u00af\u03c32\nc\n\ndiffusion cost\n\nv(z))Lf d\n\n\u00b5f d\n\n(15)\n\n(16)\n\n(17)\n\n(cid:3)\n(cid:125)\n\n(cid:125)\n\nEquation (17) is the central result: Given Gaussian priors on nonlinear control-af\ufb01ne dynamic\nsystems, up to a \ufb01rst order approximation, optimal reinforcement learning is described by an in\ufb01nite-\ndimensional second-order partial differential equation. It can be interpreted as follows (labels in the\n\n4\n\n\fequation, note the negative signs of \u201cbene\ufb01cial\u201d terms): The value of a state comprises the immediate\nutility rate; the effect of the free drift through space-time and the bene\ufb01t of optimal control; an\nexploration bonus of learning, and a diffusion cost engendered by the measurement noise. The \ufb01rst\ntwo lines of the right hand side describe effects from the phase space-time subspace of the augmented\nspace, while the last line describes effects from the belief part of the augmented space. The former\nwill be called exploitation terms, the latter exploration terms, for the following reason: If the \ufb01rst\ntwo lines line dominate the right hand side of Equation (17) in absolute size, then future losses are\ngoverned by the physical sub-space \u2013 caused by exploiting knowledge to control the physical system.\nOn the other hand, if the last line dominates the value function, exploration is more important than\nexploitation \u2013 the algorithm controls the physical space to increase knowledge. To my knowledge,\nthis is the \ufb01rst differential statement about reinforcement learning\u2019s two objectives. Finally, note the\nrole of the sampling rate \u03bb: If \u03bb is very low, exploration is useless over the discount horizon.\nEven after these approximations, solving Equation (17) for v remains nontrivial for two reasons:\nFirst, although the vector product notation is pleasingly compact, the mean and covariance functions\nare of course in\ufb01nite-dimensional, and what looks like straightforward inner vector products are in\nfact integrals. For example, the average exploration bonus for the loss, writ large, reads\n\n(cid:124)\nq\u2207\u03a3q v(z) = \u2212\n\u2212\u03bbLqL\n\n\u03bbL(q)\ns\u03c4\n\n(s\u2217\n\ni )L(q)\ns\u03c4\n\n(s\u2217\nj )\n\n\u2202v(z)\ni , s\u2217\nj )\n\n\u2202\u03a3(s\u2217\n\nds\u2217\n\ni ds\u2217\nj .\n\nK\n\n(18)\n\n(cid:90)(cid:90)\n\n(note that this object remains a function of the state s\u03c4 ). For general kernels k, these integrals may\nonly be solved numerically. However, for at least one speci\ufb01c choice of kernel (square-exponentials)\nand parametric Ansatz, the required integrals can be solved in closed form. This analytic structure\nis so interesting, and the square-exponential kernel so widely used that the \u201cnumerical\u201d part of the\npaper (Section 4) will restrict the choice of kernel to this class.\nThe other problem, of course, is that Equation (17) is a nontrivial differential Equation. Section 4\npresents one, initial attempt at a numerical solution that should not be mistaken for a de\ufb01nitive answer.\nDespite all this, Eq. (17) arguably constitutes a useful gain for Bayesian reinforcement learning:\nIt replaces the intractable de\ufb01nition of the value in terms of future trajectories with a differential\nequation. This raises hope for new approaches to reinforcement learning, based on numerical analysis\nrather than sampling.\n\nDigression: Relaxing Some Assumptions\n\nThis paper only applies to the speci\ufb01c problem class of Section 2. Any generalisations and extensions\nare future work, and I do not claim to solve them. But it is instructive to consider some easier\nextensions, and some harder ones: For example, it is intractable to simultaneously learn both g and\nf nonparametrically, if only the actual transitions are observed, because the beliefs over the two\nfunctions become in\ufb01nitely dependent when conditioned on data. But if the belief on either g or f\nis parametric (e.g. a general linear model), a joint belief on g and f is tractable [see 15, \u00a72.7], in\nfact straightforward. Both the quadratic control cost \u221d u\nRu and the control-af\ufb01ne form (g(x, t)u)\nare relaxable assumptions \u2013 other parametric forms are possible, as long as they allow for analytic\noptimization of Eq. (14). On the question of learning the kernels for Gaussian process regression\non q and f or g, it is clear that standard ways of inferring kernels [15, 18] can be used without\ncomplication, but that they are not covered by the notion of optimal learning as addressed here.\n\n(cid:124)\n\n4 Numerically Solving the Hamilton-Jacobi-Bellman Equation\n\nSolving Equation (16) is principally a problem of numerical analysis, and a battery of numeri-\ncal methods may be considered. This section reports on one speci\ufb01c Ansatz, a Galerkin-type\nprojection analogous to the one used in [12]. For this we break with the generality of previous\nsections and assume that the kernels k are given by square exponentials k(a, b) = kSE(a, b; \u03b8, S) =\n\u03b82 exp(\u2212 1\nS\u22121(a \u2212 b)) with parameters \u03b8, S. As discussed above, we approximate by\nsetting \u2126 = 0. We \ufb01nd an approximate solution through a factorizing parametric Ansatz: Let the\nvalue of any point z \u2208 H in the belief space be given through a set of parameters w and some\nnonlinear functionals \u03c6, such that their contributions separate over phase space, mean, and covariance\nfunctions:\n\n2 (a \u2212 b)\n\n(cid:124)\n\n(cid:88)\n\nv(z) =\n\n(cid:124)\n\u03c6e(ze)\n\nwe\n\nwith \u03c6e, we \u2208 RNe\n\n(19)\n\ne=x,\u03a3q,\u00b5q,\u03a3f ,\u00b5f\n\n5\n\n\fThis projection is obviously restrictive, but it should be compared to the use of radial basis functions\nfor function approximation, a similarly restrictive framework widely used in reinforcement learning.\nThe functionals \u03c6 have to be chosen conducive to the form of Eq. (17). For square exponential\nkernels, one convenient choice is\n\n\u03c6a\n\n\u03c6b\n\n\u03a3(z\u03a3) =\n\ns (zs) = k(sz, sa; \u03b8a, Sa)\ni , s\u2217\ni )\u00b5z(s\u2217\n\n[\u03a3z(s\u2217\n\n\u00b5z(s\u2217\n\n\u00b5(z\u00b5) =\n\nK\n\n\u03c6c\n\nj ) \u2212 k(s\u2217\nj )k(s\u2217\n\nj )]k(s\u2217\n\ni , s\u2217\ni , sc, \u03b8c, Sc)k(s\u2217\n\ni , sb; \u03b8b, Sb)k(s\u2217\n\nj , sb; \u03b8b, Sb) ds\u2217\n\ni ds\u2217\n\nj\n\nj , sc, \u03b8c, Sc) ds\u2217\n\ni ds\u2217\n\nj\n\nand\n\n(20)\n\n(21)\n\n(22)\n\n(cid:90)(cid:90)\n(cid:90)(cid:90)\n\nK\n\n(the subtracted term in the \ufb01rst integral serves only numerical purposes). With this choice, the\nintegrals of Equation (17) can be solved analytically (solutions left out due to space constraints). The\napproximate Ansatz turns Eq. (17) into an algebraic equation quadratic in wx, linear in all other we:\n\n(cid:124)\nx\u03a8(zx)wx \u2212 q(zx) +\n\nw\n\n1\n2\n\ne=x,\u00b5q,\u03a3q,\u00b5f ,\u03a3f\n\n\u039ee(ze)we = 0\n\n(23)\n\n(cid:88)\n\nusing co-vectors \u039e and a matrix \u03a8 with elements\n(cid:124)\u2207x\u03c6a\nLs\u03c4 (s\u2217\n\na(zs) = \u03b3\u22121\u03c6a\n\u039ex\na (z\u03a3) = \u03b3\u22121\u03c6a\n\u039e\u03a3\n\n(cid:90)(cid:90)\ns (zs) \u2212 f (zx)\n\n\u03a3(z\u03a3) + \u03bb\n\ns (zs) \u2212 \u2207t\u03c6a\ni )Ls\u03c4 (s\u2217\nj )\n\n(cid:90)(cid:90)\n\nK\n\u00b5(z\u00b5) \u2212 \u03bb2 \u00af\u03c32\n\ns\u03c4\n\n2\n\nK\n(cid:124)\ng(zx)Rg(zx)\n\n(cid:124)\ns (z)]\n\na (z\u00b5) = \u03b3\u22121\u03c6a\n\u039e\u00b5\n\u03a8(z)k(cid:96) = [\u2207x\u03c6k\n\ni )Ls\u03c4 (s\u2217\nj )\n\nLs\u03c4 (s\u2217\n[\u2207x\u03c6(cid:96)\n\ns(z)]\n\ns (zs)\n\u2202\u03c6\u03a3(z\u03a3)\n\u2202\u03a3z(s\u2217\ni , s\u2217\nj )\n\u22022\u03c6a\n\u2202\u00b5z(s\u2217\n\nds\u2217\n\ni ds\u2217\n\nj\n\n\u00b5(z\u00b5)\ni )\u2202\u00b5z(s\u2217\nj )\n\n(24)\n\nds\u2217\n\ni ds\u2217\n\nj\n\nNote that \u039e\u00b5 and \u039e\u03a3 are both functions of the physical state, through s\u03c4 . It is through this functional\ndependency that the value of information is associated with the physical phase space-time. To solve\nfor w, we simply choose a number of evaluation points zeval suf\ufb01cient to constrain the resulting\nsystem of quadratic equations, and then \ufb01nd the least-squares solution wopt by function minimisation,\nusing standard methods, such as Levenberg-Marquardt [19]. A disadvantage of this approach is that is\nhas a number of degrees of freedom \u0398, such as the kernel parameters, and the number and locations\nxa of the feature functionals. Our experiments (Section 5) suggest that it is nevertheless possible to\nget interesting results simply by choosing these parameters heuristically.\n\n5 Experiments\n\n5.1\n\nIllustrative Experiment on an Arti\ufb01cial Environment\n\nAs a simple example system with a one-dimensional state space, f, q were sampled from the model\ndescribed in Section 2, and g set to the unit function. The state space was tiled regularly, in a bounded\nregion, with 231 square exponential (\u201cradial\u201d) basis functions (Equation 20), initially all with weight\nx = 0. For the information terms, only a single basis function was used for each term (i.e. one\nwi\nsingle \u03c6\u03a3q, one single \u03c6\u00b5q, and equally for f, all with very large length scales S, covering the entire\nregion of interest). As pointed out above, this does not imply a trivial structure for these terms,\nbecause of the functional dependency on Ls\u03c4 . Five times the number of parameters, i.e. Neval = 1175\nevaluation points zeval were sampled, at each time step, uniformly over the same region. It is not\nintuitively clear whether each ze should have its own belief (i.e. whether the points must cover the\nbelief space as well as the phase space), but anecdotal evidence from the experiments suggests that it\nsuf\ufb01ces to use the current beliefs for all evaluation points. A more comprehensive evaluation of such\naspects will be the subject of a future paper. The discount factor was set to \u03b3 = 50s, the sampling\nrate at \u03bb = 2/s, the control cost at 10m2/($s). Value and optimal control were evaluated at time\nsteps of \u03b4t = 1/\u03bb = 0.5s.\nFigure 1 shows the situation 50s after initialisation. The most noteworthy aspect is the nontrivial\nstructure of exploration and exploitation terms. Despite the simplistic parameterisation of the\ncorresponding functionals, their functional dependence on s\u03c4 induces a complex shape. The system\n\n6\n\n\fx\n\nx\n\n40\n\n20\n\n0\n\u221220\n\u221240\n\n40\n\n20\n\n0\n\u221220\n\u221240\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n0.5\n\n0\n\u22120.5\n\u22121\n\n0.5\n\n0\n\n40\n\n20\n\n0\n\u221220\n\u221240\n\n40\n\n20\n\n0\n\u221220\n\u221240\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\nt\n\nt\n\n0\n\u22122\n\u22124\n\u22126\n\u22128\n\n0\n\n\u22120.5\n\n\u22121\n\nFigure 1: State after 50 time steps, plotted over phase space-time. top left: \u00b5q (blue is good).\nThe belief over f is not shown, but has similar structure. top right: value estimate v at current\nbelief: compare to next two panels to note that the approximation is relatively coarse. bottom left:\nexploration terms. bottom right: exploitation terms. At its current state (black diamond), the system\nis in the process of switching from exploitation to exploration (blue region in bottom right panel is\nroughly cancelled by red, forward cone in bottom left one).\n\nconstantly balances exploration and exploitation, and the optimal balance depends nontrivially on\nlocation, time, and the actual value (as opposed to only uncertainty) of accumulated knowledge. This\nis an important insight that casts doubt on the usefulness of simple, local exploration boni, used in\nmany reinforcement learning algorithms.\nSecondly, note that the system\u2019s trajectory does not necessarily follow what would be the optimal\npath under full information. The value estimate re\ufb02ects this, by assigning low (good) value to regions\nbehind the system\u2019s trajectory. This amounts to a sense of \u201cremorse\u201d: If the learner would have\nknown about these regions earlier, it would have strived to reach them. But this is not a sign of\nsub-optimality: Remember that the value is de\ufb01ned on the augmented space. The plots in Figure 1\nare merely a slice through that space at some level set in the belief space.\n\n5.2 Comparative Experiment \u2013 The Furuta Pendulum\n\nThe cart-and-pole system is an under-actuated problem widely studied in reinforcement learning. For\nvariation, this experiment uses a cylindrical version, the pendulum on the rotating arm [20]. The\ntask is to swing up the pendulum from the lower resting point. The table in Figure 2 compares the\naverage loss of a controller with access to the true f, g, q, but otherwise using Algorithm 1, to that\nof an \u0001-greedy TD(\u03bb) learner with linear function approximation, Simpkins\u2019 et al.\u2019s [12] Kalman\nmethod and the Gaussian process learning controller (Fig. 2). The linear function approximation of\nTD(\u03bb) used the same radial basis functions as the three other methods. None of these methods is free\nof assumptions: Note that the sampling frequency in\ufb02uences TD in nontrivial ways rarely studied\n(for example through the coarseness of the \u0001-greedy policy). The parameters were set to \u03b3 = 5s,\n\u03bb = 50/s. Note that reinforcement learning experiments often quote total accumulated loss, which\ndiffers from the discounted task posed to the learner. Figure 2 reports actual discounted losses. The\nGP method clearly outperforms the other two learners, which barely explore. Interestingly, none of\nthe tested methods, not even the informed controller, achieve a stable controlled balance, although\n\n7\n\n\fu\n\n\u03b81\n\n(cid:96)1\n\n(cid:96)2\n\n\u03b82\n\nMethod\nFull Information (baseline)\nTD(\u03bb)\nKalman \ufb01lter Optimal Learner\nGaussian process optimal learner\n\ncumulative loss\n4.4 \u00b10.3\n6.401\u00b10.001\n6.408\u00b10.001\n4.6 \u00b11.4\n\nFigure 2: The Furuta pendulum system: A pendulum of length (cid:96)2 is attached to a rotatable arm of\nlength (cid:96)1. The control input is the torque applied to the arm. Right: cost to go achieved by different\nmethods. Lower is better. Error measures are one standard deviation over \ufb01ve experiments.\n\nthe GP learner does swing up the pendulum. This is due to the random, non-optimal location of basis\nfunctions, which means resolution is not necessarily available where it is needed (in regions of high\ncurvature of the value function), and demonstrates a need for better solution methods for Eq. (17).\nThere is of course a large number of other algorithms methods to potentially compare to, and these\nresults are anything but exhaustive. They should not be misunderstood as a critique of any other\nmethod. But they highlight the need for units of measure on every quantity, and show how hard\noptimal exploration and exploitation truly is. Note that, for time-varying or discounted problems,\nthere is no \u201cconservative\u201d option that cold be adopted in place of the Bayesian answer.\n\n6 Conclusion\n\nGaussian process priors provide a nontrivial class of reinforcement learning problems for which\noptimal reinforcement learning reduces to solving differential equations. Of course, this fact alone\ndoes not make the problem easier, as solving nonlinear differential equations is in general intractable.\nHowever, the ubiquity of differential descriptions in other \ufb01elds raises hope that this insight opens\nnew approaches to reinforcement learning. For intuition on how such solutions might work, one\nspeci\ufb01c approximation was presented, using functionals to reduce the problem to \ufb01nite least-squares\nparameter estimation.\nThe critical reader will have noted how central the prior is for the arguments in Section 3: The\ndynamics of the learning process are predictions of future data, thus inherently determined exclusively\nby prior assumptions. One may \ufb01nd this unappealing, but there is no escape from it. Minimizing\nfuture loss requires predicting future loss, and predictions are always in danger of falling victim to\nincorrect assumptions. A \ufb01nite initial identi\ufb01cation phase may mitigate this problem by replacing\nprior with posterior uncertainty \u2013 but even then, predictions and decisions will depend on the model.\nThe results of this paper raise new questions, theoretical and applied. The most pressing questions\nconcern better solution methods for Eq. (14), in particular better means for taking the expectation\nover the uncertain dynamics to more than \ufb01rst order. There are also obvious probabilistic issues: Are\nthere other classes of priors that allow similar treatments? (Note some conceptual similarities between\nthis work and the BEETLE algorithm [4]). To what extent can approximate inference methods \u2013\nwidely studied in combination with Gaussian process regression \u2013 be used to broaden the utility of\nthese results?\n\nAcknowledgments\n\nThe author wishes to express his gratitude to Carl Rasmussen, Jan Peters, Zoubin Ghahramani,\nPeter Dayan, and an anonymous reviewer, whose thoughtful comments uncovered several errors and\ncrucially improved this paper.\n\n8\n\n\fReferences\n[1] R.S. Sutton and A.G. Barto. Reinforcement Learning. MIT Press, 1998.\n[2] W.R. Thompson. On the likelihood that one unknown probability exceeds another in view of two samples.\n\nBiometrika, 25:275\u2013294, 1933.\n\n[3] M.O.G. Duff. Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes.\n\nPhD thesis, U of Massachusetts, Amherst, 2002.\n\n[4] P. Poupart, N. Vlassis, J. Hoey, and K. Regan. An analytic solution to discrete Bayesian reinforcement\nlearning. In Proceedings of the 23rd International Conference on Machine Learning, pages 697\u2013704, 2006.\n[5] Richard Dearden, Nir Friedman, and David Andre. Model based Bayesian exploration. In Uncertainty in\n\nArti\ufb01cial Intelligence, pages 150\u2013159, 1999.\n\n[6] Malcolm Strens. A Bayesian framework for reinforcement learning. In International Conference on\n\nMachine Learning, pages 943\u2013950, 2000.\n\n[7] T. Wang, D. Lizotte, M. Bowling, and D. Schuurmans. Bayesian sparse sampling for on-line reward\n\noptimization. In International Conference on Machine Learning, pages 956\u2013963, 2005.\n\n[8] J. Asmuth, L. Li, M.L. Littman, A. Nouri, and D. Wingate. A Bayesian sampling approach to exploration\n\nin reinforcement learning. In Uncertainty in Arti\ufb01cial Intelligence, 2009.\n\n[9] J.Z. Kolter and A.Y. Ng. Near-Bayesian exploration in polynomial time. In Proceedings of the 26th\n\nInternational Conference on Machine Learning. Morgan Kaufmann, 2009.\n\n[10] E. Todorov. Linearly-solvable Markov decision problems. Advances in Neural Information Processing\n\nSystems, 19, 2007.\n\n[11] H. J. Kappen. An introduction to stochastic control theory, path integrals and reinforcement learning. In 9th\nGranada seminar on Computational Physics: Computational and Mathematical Modeling of Cooperative\nBehavior in Neural Systems., pages 149\u2013181, 2007.\n\n[12] A. Simpkins, R. De Callafon, and E. Todorov. Optimal trade-off between exploration and exploitation. In\n\nAmerican Control Conference, 2008, pages 33\u201338, 2008.\n\n[13] I. Fantoni and L. Rogelio. Non-linear Control for Underactuated Mechanical Systems. Springer, 1973.\n[14] A.A. Feldbaum. Dual control theory. Automation and Remote Control, 21(9):874\u2013880, April 1961.\n[15] C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[16] N. Wiener. Differential space. Journal of Mathematical Physics, 2:131\u2013174, 1923.\n[17] T. Kailath. An innovations approach to least-squares estimation \u2014 part I: Linear \ufb01ltering in additive white\n\nnoise. IEEE Transactions on Automatic Control, 13(6):646\u2013655, 1968.\n\n[18] I. Murray and R.P. Adams. Slice sampling covariance hyperparameters of latent Gaussian models.\n\narXiv:1006.0868, 2010.\n\n[19] D. W. Marquardt. An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society\n\nfor Industrial and Applied Mathematics, 11(2):431\u2013441, 1963.\n\n[20] K. Furuta, M. Yamakita, and S. Kobayashi. Swing-up control of inverted pendulum using pseudo-state\n\nfeedback. Journal of Systems and Control Engineering, 206(6):263\u2013269, 1992.\n\n9\n\n\f", "award": [], "sourceid": 242, "authors": [{"given_name": "Philipp", "family_name": "Hennig", "institution": null}]}