{"title": "Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation", "book": "Advances in Neural Information Processing Systems", "page_first": 1185, "page_last": 1192, "abstract": null, "full_text": "Fast Online Policy Gradient Learning\nwith SMD Gain Vector Adaptation\n\nNicol N. Schraudolph\n\nJin Yu Douglas Aberdeen\n\nStatistical Machine Learning, National ICT Australia, Canberra\n{nic.schraudolph,douglas.aberdeen}@nicta.com.au\n\nAbstract\n\nReinforcement learning by direct policy gradient estimation is attractive\nin theory but in practice leads to notoriously ill-behaved optimization\nproblems. We improve its robustness and speed of convergence with\nstochastic meta-descent, a gain vector adaptation method that employs\nfast Hessian-vector products. In our experiments the resulting algorithms\noutperform previously employed online stochastic, of\ufb02ine conjugate, and\nnatural policy gradient methods.\n\n1\n\nIntroduction\n\nPolicy gradient reinforcement learning (RL) methods train controllers by estimating the\ngradient of a long-term reward measure with respect to the parameters of the controller [1].\nThe advantage of policy gradient methods, compared to value-based RL, is that we avoid\nthe often redundant step of accurately estimating a large number of values. Policy gradient\nmethods are particularly appealing when large state spaces make representing the exact\nvalue function infeasible, or when partial observability is introduced. However, in practice\npolicy gradient methods have shown slow convergence [2], not least due to the stochastic\nnature of the gradients being estimated.\nThe stochastic meta-descent (SMD) gain adaptation algorithm [3, 4] can considerably ac-\ncelerate the convergence of stochastic gradient descent. In contrast to other gain adaptation\nmethods, SMD copes well not only with stochasticity, but also with non-i.i.d. sampling of\nobservations, which necessarily occurs in RL. In this paper we derive SMD in the context\nof policy gradient RL, and obtain over an order of magnitude improvement in convergence\nrate compared to previously employed policy gradient algorithms.\n\n2 Stochastic Meta-Descent\n2.1 Gradient-based gain vector adaptation\nLet R be a scalar objective function we wish to maximize with respect to its adaptive\nparameter vector \u03b8 \u2208 Rn, given a sequence of observations xt \u2208 X at time t = 1, 2, . . .\nWhere R is not available or expensive to compute, we use the stochastic approximation\nRt : Rn\u00d7X \u2192 R of R instead, and maximize the expectation Et[Rt(\u03b8t, xt)]. Assuming\nthat Rt is twice differentiable wrt. \u03b8, with gradient and Hessian given by\n\ngt = \u2202\n\n\u2202\u03b8 Rt(\u03b8, xt)|\u03b8=\u03b8t and Ht = \u22022\n\n\u2202\u03b8 \u2202\u03b8> Rt(\u03b8, xt)|\u03b8=\u03b8t ,\n\n(1)\n\n\frespectively, we maximize Et[Rt(\u03b8)] by the stochastic gradient ascent\n\n\u03b8t+1 = \u03b8t + \u03b3t \u00b7 gt ,\n\n(2)\nwhere \u00b7 denotes element-wise (Hadamard) multiplication. The gain vector \u03b3t \u2208 (R+)n\nserves as a diagonal conditioner, providing each element of \u03b8 with its own positive gradient\nstep size. We adapt \u03b3 by a simultaneous meta-level gradient ascent in the objective Rt. A\nstraightforward implementation of this idea is the delta-delta algorithm [5], which would\nupdate \u03b3 via\n\n\u2202Rt+1(\u03b8t+1)\n\n\u2202Rt+1(\u03b8t+1)\n\n\u00b7 \u2202\u03b8t+1\n\u2202\u03b3t\n\n= \u03b3t + \u00b5gt+1 \u00b7 gt ,\n\n\u2202\u03b3t\n\n\u2202\u03b8t+1\n\n= \u03b3t + \u00b5\n\n\u03b3t+1 = \u03b3t + \u00b5\n\n(3)\nwhere \u00b5 \u2208 R is a scalar meta-step size. In a nutshell, gains are decreased where a negative\nautocorrelation of the gradient indicates oscillation about a local minimum, and increased\notherwise. Unfortunately such a simplistic approach has several problems: Firstly, (3)\nallows gains to become negative. This can be avoided by updating \u03b3 multiplicatively, e.g.\nvia the exponentiated gradient algorithm [6].\nSecondly, delta-delta\u2019s cure is worse than the disease: individual gains are meant to address\nill-conditioning, but (3) actually squares the condition number. The autocorrelation of the\ngradient must therefore be normalized before it can be used. A popular (if extreme) form of\nnormalization is to consider only the sign of the autocorrelation. Such sign-based methods\n[5, 7\u20139], however, do not cope well with stochastic approximation of the gradient since the\nnon-linear sign function does not commute with the expectation operator [10]. More recent\nalgorithms [3, 4, 10] therefore use multiplicative (hence linear) normalization factors to\ncondition the meta-level update.\nFinally, (3) fails to take into account that gain changes affect not only the current, but also\nfuture parameter updates. In recognition of this shortcoming, gt in (3) is often replaced\nwith a running average of past gradients. Though such ad-hoc smoothing does improve\nperformance, it does not properly capture long-term dependences, the average still being\none of immediate, single-step effects. By contrast, Sutton [11] modeled the long-term effect\nof gains on future parameter values in a linear system by carrying the relevant partials\nforward in time, and found that the resulting gain adaptation can outperform a less than\nperfectly matched Kalman \ufb01lter. Stochastic meta-descent (SMD) extends this approach to\narbitrary twice-differentiable nonlinear systems, takes into account the full Hessian instead\nof just the diagonal, and applies a decay to the partials being carried forward.\n\n2.2 The SMD Algorithm\nSMD employs two modi\ufb01cations to address the problems described above: it adjusts gains\nin log-space, and optimizes over an exponentially decaying trace of gradients. Thus ln \u03b3 is\nupdated as follows:\n\n\u03bbi \u2202\u03b8t+1\n\u2202 ln \u03b3t\u2212i\n\n=: ln \u03b3t + \u00b5 gt+1 \u00b7 vt+1,\n\ni=0\n\n= ln \u03b3t + \u00b5\n\n\u2202\u03b8t+1\n\n(4)\nwhere the vector v \u2208 Rn characterizes the long-term dependence of the system parameters\non their gain history over a time scale governed by the decay factor 0 \u2264 \u03bb \u2264 1. Element-\nwise exponentiation of (4) yields the desired multiplicative update\n\n\u03b3t+1 = \u03b3t \u00b7 exp(\u00b5 gt+1 \u00b7 vt+1) \u2248 \u03b3t \u00b7 max( 1\n\n(5)\nThe linearization eu \u2248 max( 1\n2 , 1+u) eliminates an expensive exponentiation for each gain\nupdate, improves its robustness by reducing the effect of outliers (|u| (cid:29) 0), and ensures\n\n2 , 1 + \u00b5 gt+1 \u00b7 vt+1).\n\nln \u03b3t+1 = ln \u03b3t + \u00b5\n\ntX\n\n\u03bbi \u2202R(\u03b8t+1)\ntX\n\u2202 ln \u03b3t\u2212i\n\ni=0\n\u2202R(\u03b8t+1)\n\n\u00b7\n\n\fthat \u03b3 remains positive. To compute the gradient trace v ef\ufb01ciently, we expand \u03b8t+1 in\nterms of its recursive de\ufb01nition (2):\n\ntX\n\ni=0\n\ntX\n\ni=0\n\nvt+1 =\n\n\u03bbi \u2202\u03b8t+1\n\u2202 ln \u03b3t\u2212i\n\n=\n\n\u03bbi\n\n\u2202\u03b8t\n\n\u2202 ln \u03b3t\u2212i\n\n+\n\ntX\n\"\n\ni=0\n\n\u03bbi \u2202(\u03b3t \u00b7 gt)\ntX\n\u2202 ln \u03b3t\u2212i\n\n\u03bbi\n\n\u2202gt\n\u2202\u03b8t\n\ni=0\n\n(6)\n\n#\n\n\u2202\u03b8t\n\n\u2202 ln \u03b3t\u2212i\n\n\u2248 \u03bbvt + \u03b3t \u00b7 gt + \u03b3t \u00b7\n\nNoting that \u2202gt\n\u2202\u03b8t\n\nis the Hessian Ht of Rt(\u03b8t), we arrive at the simple iterative update\n\nv0 = 0 .\n\nvt+1 = \u03bbvt + \u03b3t \u00b7 (gt + \u03bbHtvt) ;\n\n(7)\nAlthough the Hessian of a system with n parameters has O(n2) entries, ef\ufb01cient indirect\nmethods from algorithmic differentiation are available to compute its product with an ar-\nbitrary vector in the same time as 2\u20133 gradient evaluations [12, 13]. To improve stability,\nSMD employs an extended Gauss-Newton approximation of Ht for which a similar (even\nfaster) technique is available [4]. An iteration of SMD \u2014 comprising (5), (2), and (7) \u2014\nthus requires less than 3 times the \ufb02oating-point operations of simple gradient ascent. The\nextra computation is typically more than compensated for by the faster convergence of\nSMD. Fast convergence minimizes the number of expensive world interactions required,\nwhich in RL is typically of greater concern than computational cost.\n\n3 Policy Gradient Reinforcement Learning\nA Markov decision process (MDP) consists of a \ufb01nite1 set of states s \u2208 S of the world,\nactions a \u2208 A available to the agent in each state, and a (possibly stochastic) reward\nfunction r(s) for each state s. In a partially observable MDP (POMDP), the controller sees\nonly an observation x \u2208 X of the current state, sampled stochastically from an unknown\ndistribution P(x|s). Each action a determines a stochastic matrix P (a) = [P(s0|s, a)] of\ntransition probabilities from state s to state s0 given action a. The methods discussed in this\npaper do not assume explicit knowledge of P (a) or of the observation process. All policies\nare stochastic, with a probability of choosing action a given state s, and parameters \u03b8 \u2208 Rn\nof P(a|\u03b8, s). The evolution of the state s is Markovian, governed by an |S| \u00d7 |S| transition\nprobability matrix P (\u03b8) = [P(s0|\u03b8, s)] with entries given by\n\nP(a|\u03b8, s) P(s0|s, a) .\n\n(8)\n\nP(s0|\u03b8, s) =X\n\na\u2208A\n\n3.1 GPOMDP Monte Carlo estimates of gradient and hessian\nGPOMDP is an in\ufb01nite-horizon policy gradient method [1] to compute the gradient of the\nlong-term average reward\n\n#\n\nR(\u03b8) := lim\nT\u2192\u221e\n\n1\nT\n\nE\u03b8\n\nr(st)\n\n,\n\n(9)\n\n\" TX\n\nt=1\n\nwith respect the policy parameters \u03b8. The expectation E\u03b8 is over the distribution of state\ntrajectories {s0, s1, . . .} induced by P (\u03b8).\nTheorem 1 (1) Let I be the identity matrix, and u a column vector of ones. The gradient\nof the long-term average reward wrt. a policy parameter \u03b8i is\n\n\u2207\u03b8iR(\u03b8) = \u03c0(\u03b8)>\u2207\u03b8iP (\u03b8)[I \u2212 P (\u03b8) + u\u03c0(\u03b8)>]\u22121r ,\n\n(10)\n\nwhere \u03c0(\u03b8) is the stationary distribution of states induced by \u03b8.\n\n1For uncountably in\ufb01nite state spaces, the derivation becomes more complex without substantially\n\naltering the resulting algorithms.\n\n\fNote that (10) requires knowledge of the underlying transition probabilities P (\u03b8), and\nthe inversion of a potentially large matrix. The GPOMDP algorithm instead computes a\nMonte-Carlo approximation of (10): the agent interacts with the environment, producing\nan observation, action, reward sequence {x1, a1, r1, x2, . . . , xT , aT , rT}.2 Under mild\ntechnical assumptions, including ergodicity and bounding all the terms involved, Baxter\n\nand Bartlett [1] obtainb\u2207\u03b8R =\n\nT\u22121X\n\nt=0\n\n1\nT\n\nTX\n\n\u03c4 =t+1\n\n\u2207\u03b8 ln P(at|\u03b8, st)\n\n\u03b2\u03c4\u2212t\u22121r(s\u03c4 ) ,\n\n(11)\n\nwhere a discount factor \u03b2 \u2208 [0, 1) implicitly assumes that rewards are exponentially more\nlikely to be due to recent actions. Without it, rewards would be assigned over a potentially\nin\ufb01nite horizon, resulting in gradient estimates with in\ufb01nite variance. As \u03b2 decreases, so\ndoes the variance, but the bias of the gradient estimate increases [1]. In practice, (11) is\nimplemented ef\ufb01ciently via the discounted eligibility trace\n\net = \u03b2et\u22121 + \u03b4t , where \u03b4t := \u2207\u03b8 P(at|\u03b8, st)/ P(at|\u03b8, st) .\n\n(12)\nNow gt = rtet is the gradient of R(\u03b8) arising from assigning the instantaneous re-\nward to all log action gradients, where \u03b2 gives exponentially more credit to recent ac-\ntions. Likewise, Baxter and Bartlett [1] give the Monte Carlo estimate of the Hessian as\nHt = rt(Et + ete>\n\nEt = \u03b2Et\u22121 + Gt \u2212 \u03b4t\u03b4>\n\nt ), using an eligibility trace matrix\nt , where Gt := \u22072\n\nP(at|\u03b8, st)/ P(at|\u03b8, st) .\n\n(13)\nMaintaining E would be O(n2), thus computationally expensive for large policy parameter\nspaces. Noting that SMD only requires the product of Ht with a vector v, we instead use\n\n\u03b8\n\nHtv = rt[dt + et(e>\n\n(14)\nis an eligibility trace vector that can be maintained in O(n). We describe the ef\ufb01cient\ncomputation of Gtv in (14) for a speci\ufb01c action selection method in Section 3.3 below.\n\nt v)] , where dt = \u03b2dt\u22121 + Gtv \u2212 \u03b4t(\u03b4>\n\nt v)\n\n3.2 GPOMDP-Based optimization algorithms\nBaxter et al. [2] proposed two optimization algorithms using GPOMDP\u2019s policy gradient\nestimates gt: OLPOMDP is a simple online stochastic gradient descent (2) with scalar gain\n\u03b3t. Alternatively, CONJPOMDP performs Polak-Ribi`ere conjugation of search directions,\nusing a noise-tolerant line search to \ufb01nd the approximately best scalar step size in a given\nsearch direction. Since conjugate gradient methods are very sensitive to noise [14], CONJ-\nPOMDP must average gt over many steps to obtain a reliable gradient measurement; this\nmakes the algorithm inherently inef\ufb01cient (cf. Section 4).\nOLPOMDP, on the other hand, is robust to noise but converges only very slowly. We can,\nhowever, employ SMD\u2019s gain vector adaptation to greatly accelerate it while retaining the\nbene\ufb01ts of high noise tolerance and online learning. Experiments (Section 4) show that the\nresulting SMDPOMDP algorithm can greatly outperform OLPOMDP and CONJPOMDP.\nKakade [15] has applied natural gradient [16] to GPOMDP, premultiplying the policy gra-\ndient by the inverse of the online estimate\n\nFt = (1 \u2212 1\n\nt )Ft\u22121 + 1\n\nt (\u03b4t\u03b4>\n\nt + \u0001I)\n\n(15)\n\nof the Fisher information matrix for the parameter update: \u03b8t+1 = \u03b8t + \u03b30 \u00b7 rtF \u22121\nt et. This\napproach can yield very fast convergence on small problems, but in our experience does\nnot scale well at all to larger, more realistic tasks; see our experiments in Section 4.\n\n2We use rt as shorthand for r(st), making it clear that only the reward value is known, not the\n\nunderlying state st.\n\n\f3.3 Softmax action selection\nFor discrete action spaces, a vector of action probabilities zt := P(at|yt) can be generated\nfrom the output yt := f(\u03b8t, xt) of a parameterised function f : Rn\u00d7X \u2192 R| A | (such as\na neural network) via the softmax function:\n\nzt := softmax(yt) =\n\neytP| A |\n\nm=1[eyt]m.\n\n(16)\n\n(17)\n\nGiven action at \u223c zt, GPOMDP\u2019s instantaneous log-action gradient wrt. y is then\n\n\u02dcgt := \u2207y[zt]at/[zt]at = uat \u2212 zt ,\n\nwhere ui is the unity vector in direction i. The action gradient wrt. \u03b8 is obtained by\nbackpropagating \u02dcgt through f\u2019s adjoint system [13], performing an ef\ufb01cient multiplication\nby the transposed Jacobian of f. The resulting gradient \u03b4t := J>\nf \u02dcgt is then accumulated in\nthe eligibility trace (12). GPOMDP\u2019s instantaneous Hessian for softmax action selection is\n\n\u02dcHt := \u22072\n\ny[zt]at/[zt]at = (uat\u2212zt)(uat\u2212zt)> + ztz>\nIt is inde\ufb01nite but reasonably well-behaved: the Gerschgorin circle theorem can be em-\nployed to show that its eigenvalues must all lie in the interval [\u2212 1\n4 , 2]. Furthermore, its\nexpectation over possible actions is zero:\nEzt( \u02dcHt) = [diag(zt) \u2212 2ztz>\n\nt \u2212 diag(zt) = 0 .\n\nt \u2212 diag(zt) .\n\nt ] + ztz>\n\nt + ztz>\n\n(18)\n\n(19)\n\nThe extended Gauss-Newton matrix-vector product [4] employed by SMD is then given by\n\nGtvt := J>\n\nf\n\n\u02dcHtJf vt ,\n\n(20)\n\nwhere the multiplication by the Jacobian off (resp. its transpose) is implemented ef\ufb01ciently\nby propagating vt through f\u2019s tangent linear (resp. adjoint) system [13].\n\nAlgorithm 1 SMDPOMDP with softmax action selection\n\n1. Given (a) an ergodic POMDP with observations xt \u2208 X , actions at \u2208 A,\n\nbounded rewards rt \u2208 R, and softmax action selection\n\n(b) a differentiable parametric map f : Rn\u00d7X \u2192 R|A| (neural network)\n(c) f\u2019s adjoint (u \u2192 J>\n(d) free parameters: \u00b5 \u2208 R+; \u03b2, \u03bb \u2208 [0, 1]; \u03b30 \u2208 Rn\n\nf u) and tangent linear (v \u2192 Jf v) maps\n+; \u03b81 \u2208 Rn\n\n2. Initialize in Rn: e0 = d0 = v0 = 0\n3. For t = 1 to \u221e: (a) interact with POMDP:\n\ni. observe feature vector xt\nii. compute zt := softmax(f (\u03b8t, xt))\niii. perform action at \u223c zt\niv. observe reward rt\n\nt vt) + zt(z>\n\n(b) maintain eligibility traces:\nf (uat \u2212 zt)\ni. \u03b4t := J>\nii. pt := Jf vt\niii. qt := (uat \u2212 zt)(\u03b4>\niv. et = \u03b2et\u22121 + \u03b4t\nv. dt = \u03b2dt\u22121 + J>\n(c) update SMD parameters:\ni. \u03b3t = \u03b3t\u22121 \u00b7 max( 1\nii. \u03b8t+1 = \u03b8t + rt\u03b3t \u00b7 et\niii. vt+1 = \u03bbvt + rt\u03b3t \u00b7 [(1 + \u03bbe>\n\nf qt \u2212 \u03b4t(\u03b4>\n2 , 1 + \u00b5 rtet \u00b7 vt)\n\nt vt)\n\nt vt)et + \u03bbdt]\n\nt pt) \u2212 zt \u00b7 pt\n\n\f 6/18\n12/18\n\nr = 0\n\n12/18\n 6/18\n\nr = 0\n\n5/18\n5/18\n\nr = 1\n\n6\n\n12/18\n\nr = 1\n\n12\n6/18\n\nr = -1\n\n5\n5/18\n\nr = 8\n\nFig. 1: Left: Baxter et al.\u2019s simple 3-state POMDP. States are labelled with their observable\nfeatures and instantaneous reward r; arrows indicate the 80% likely transition for the \ufb01rst\n(solid) resp. second (dashed) action. Right: our modi\ufb01ed, more dif\ufb01cult 3-state POMDP.\n\n4 Experiments\n\n4.1 Simple Three-State POMDP\nFig. 1 (left) depicts the simple 3-state POMDP used by Baxter et al. [2, Tables 1&2]. Of\nthe two possible transitions from each state, the preferred one occurs with 80% probability,\nthe other with 20%. The preferred transition is determined by the action of a simple proba-\nbilistic adaptive controller that receives two state-dependent feature values as input, and is\ntrained to maximize the expected average reward by policy gradient methods.\nUsing the original code of Baxter et al. [2], we replicated their experimental results for\nthe OLPOMDP and CONJPOMDP algorithms on this simple POMDP. We can accurately\nreproduce all essential features of their graphed results on this problem [2, Figures 7&8].\nWe then implemented SMDPOMDP (Algorithm 1), and ran a comparison of algorithms,\nusing the best free parameter settings found by Baxter et al. [2] (in particular: \u03b2 = 0, \u03b3 0 =\n1), and \u00b5 = \u03bb = 1 for SMDPOMDP. We always match random seeds across algorithms.\nBaxter et al. [2] collect and plot results for CONJPOMDP in terms of its T parameter, which\nspeci\ufb01es the number of Markov chain iterations per gradient evaluation. For a fair com-\nparison of convergence speed we added code to record the total number of Markov chain\niterations consumed by CONJPOMDP, and plot performance for all three algorithms in\nthose terms, with error bars along both axes for CONJPOMDP.\nThe results are shown in Fig. 2 (left), averaged over 500 runs. While early on CONJPOMDP\non average reaches a given level of performance about three times faster than OLPOMDP,\nit does so at the price of far higher variance. Moreover, CONJPOMDP is the only algorithm\nthat fails to asymptotically approach optimal performance (R = 0.8; Fig. 2 left, inset).\nOnce its step size adaptation gets going, SMDPOMDP converges asymptotically to the op-\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\nd\nr\na\nw\ne\nR\n \ne\ng\na\nr\ne\nv\nA\n\n0.2\n\n1\n\nsmd\nol\nconj\n\nsmd\nol\nconj\nng\n\n2.6\n\n2.4\n\n2.2\n\n2\n\n1.8\n\n1.6\n\n1.4\n\n1.2\n\nd\nr\na\nw\ne\nR\n\n \n \ne\ng\na\nr\ne\nv\nA\n\n0.8\n\n0.79\n\n0.78\n\n10\nTotal Markov Chain Iterations\n\n1000\n\n100\n\n10000\n\n100\n\n1000\n\n10000\n\n1e+5\n\n1e+6\n\n1e+7\n\nTotal Markov Chain Iterations\n\nFig. 2: Left: The POMDP of Fig. 1 (left) is easy to learn. CONJPOMDP converges faster\nbut to asymptotically inferior solutions (see inset) than the two online algorithms. Right:\nSMDPOMDP outperforms OLPOMDP and CONJPOMDP on the dif\ufb01cult POMDP of Fig. 1\n(right). Natural policy gradient has rapid early convergence but diverges asymptotically.\n\n\ftimal policy about three times faster than OLPOMDP in terms of Markov chain iterations,\nmaking the two algorithms roughly equal in terms of computational expense.\nCONJPOMDP on average performs less than two iterations of conjugate gradient in each\nrun. While this is perfectly understandable \u2014 the controller only has two trainable pa-\nrameters \u2014 it bears keeping in mind that the performance of CONJPOMDP here is almost\nentirely governed by the line search rather than the conjugation of search directions.\n\n4.2 Modi\ufb01ed Three-State POMDP\nThe three-state POMDP employed by Baxter et al. [2] has the property that greedy max-\nimization of instantaneous reward leads to the optimal policy. Non-trivial temporal credit\nassignment \u2014 the hallmark of reinforcement learning \u2014 is not needed. The best results\nare obtained with the eligibility trace turned off (\u03b2 = 0). To create a more challenging\nproblem, we rearranged the POMDP\u2019s state transitions and reward structure so that the\ninstantaneous reward becomes deceptive (Fig. 1, right). We also multiplied one state fea-\nture by 18 to create an ill-conditioned input to the controller, while leaving the actions and\nrelative transition probabilities (80% resp. 20%) unchanged. In our modi\ufb01ed POMDP, the\nhigh-reward state can only be reached through an intermediate state with negative reward.\nFig. 2 (right) shows our experimental results for this harder POMDP, averaged over 100\nruns. Free parameters were tuned to \u03b81 \u2208 [\u22120.1, 0.1], \u03b2 = 0.6, \u03b30 = 0.001; T = 105 for\nCONJPOMDP; \u00b5 = 0.002, \u03bb = 1 for SMDPOMDP. CONJPOMDP now performs the worst,\nwhich is expected because conjugation of directions is known to collapse in the presence\nof noise [14]. SMDPOMDP converges about 20 times faster than OLPOMDP because its\nadjustable gains compensate for the ill-conditioned input. Kakade\u2019s natural gradient (using\n\u0001 = 0.01) performs extremely well early on, taking 2\u20133 times fewer iterations than SMD-\nPOMDP to reach optimal performance (R = 2.6). It does, however, diverge asymptotically.\n\n4.3 Puck World\nWe also implemented the Puck World benchmark of Baxter et al. [2], with the free param-\neters settings \u03b81 \u2208 [\u22120.1, 0.1], \u03b2 = 0.95 \u03b30 = 2 \u00b7 10\u22126; T = 106 for CONJPOMDP;\n\u00b5 = 100, \u03bb = 0.999 for SMDPOMDP; \u0001 = 0.01 for natural policy gradient. To improve its\nstability, we modi\ufb01ed SMD here to track instantaneous log-action gradients \u03b4t instead of\nnoisy rtet estimates of \u2207\u03b8R. CONJPOMDP used a quadratic weight penalty of initially 0.5,\nwith the adaptive reduction schedule described by Baxter et al. [2, page 369]; the online\nalgorithms did not require a weight penalty.\nFig. 3 shows our results averaged over 100 runs, except for natural policy gradient where\nonly a single typical run is shown. This is because its O(n3) time complexity per iteration3\n\n3The Sherman-Morrison formula cannot be used here because of the diagonal term in (15).\n\nFig. 3: The action-gradient version of\nSMDPOMDP yields better asymptotic\nresults on PuckWorld than OL-\nPOMDP; CONJPOMDP is inef\ufb01cient;\nnatural policy gradient even more so.\n\n1e+61e+71e+81e-61e-7SMD Gains1e+51e+61e+71e+8Iterations-60-50-40-30-20-10Average Rewardsmdolconjng\fmakes natural policy gradient intolerably slow for this task, where n = 88. Moreover, its\nconvergence is quite poor here in terms of the number of iterations required as well.\nCONJPOMDP is again inferior to the best online algorithms by over an order of magnitude.\nEarly on, SMDPOMDP matches OLPOMDP, but then reaches superior solutions with small\nvariance. SMDPOMDP-trained controllers achieve a long-term average reward of -6.5, sig-\nni\ufb01cantly above the optimum of -8 hypothesized by Baxter et al. [2, page 369] based on\ntheir experiments with CONJPOMDP.\n\n5 Conclusion\n\nOn several non-trivial RL problems we \ufb01nd that our SMDPOMDP consistently outperforms\nOLPOMDP, which in turn outperforms CONJPOMDP. Natural policy gradient can converge\nrapidly, but is too unstable and computationally expensive for all but very small controllers.\n\nAcknowledgements\nWe are indebted to John Baxter for his code and helpful comments. National ICT Australia\nis funded by the Australian Government\u2019s Backing Australia\u2019s Ability initiative, in part\nthrough the Australian Research Council. This work is also supported by the IST Program\nof the European Community, under the Pascal Network of Excellence, IST-2002-506778.\n\nReferences\n[1] J. Baxter and P. L. Bartlett. In\ufb01nite-horizon policy-gradient estimation. Journal of Arti\ufb01cial\n\nIntelligence Research, 15:319\u2013350, 2001.\n\n[2] J. Baxter, P. L. Bartlett, and L. Weaver. Experiments with in\ufb01nite-horizon, policy-gradient\n\nestimation. Journal of Arti\ufb01cial Intelligence Research, 15:351\u2013381, 2001.\n\n[3] N. N. Schraudolph. Local gain adaptation in stochastic gradient descent. In Proc. Intl. Conf.\n\nArti\ufb01cial Neural Networks, pages 569\u2013574, Edinburgh, Scotland, 1999. IEE, London.\n\n[4] N. N. Schraudolph. Fast curvature matrix-vector products for second-order gradient descent.\n\nNeural Computation, 14(7):1723\u20131738, 2002.\n\n[5] R. Jacobs. Increased rates of convergence through learning rate adaptation. Neural Networks,\n\n1:295\u2013307, 1988.\n\n[6] J. Kivinen and M. K. Warmuth. Additive versus exponentiated gradient updates for linear\nprediction. In Proc. 27th Annual ACM Symposium on Theory of Computing, pages 209\u2013218.\nACM Press, New York, NY, 1995.\n\n[7] T. Tollenaere. SuperSAB: Fast adaptive back propagation with good scaling properties. Neural\n\nNetworks, 3:561\u2013573, 1990.\n\n[8] F. M. Silva and L. B. Almeida. Acceleration techniques for the backpropagation algorithm.\nIn L. B. Almeida and C. J. Wellekens, editors, Neural Networks: Proc. EURASIP Workshop,\nvolume 412 of Lecture Notes in Computer Science, pages 110\u2013119. Springer Verlag, 1990.\n\n[9] M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation learning: The\n\nRPROP algorithm. In Proc. Intl. Conf. Neural Networks, pages 586\u2013591. IEEE, 1993.\n\n[10] L. B. Almeida, T. Langlois, J. D. Amaral, and A. Plakhov. Parameter adaptation in stochastic\noptimization. In D. Saad, editor, On-Line Learning in Neural Networks, Publications of the\nNewton Institute, chapter 6, pages 111\u2013134. Cambridge University Press, 1999.\n\n[11] R. S. Sutton. Gain adaptation beats least squares? In Proceedings of the 7th Yale Workshop on\n\nAdaptive and Learning Systems, pages 161\u2013166, 1992.\n\n[12] B. A. Pearlmutter. Fast exact multiplication by the Hessian. Neural Comput., 6(1):147\u201360, 1994.\n[13] A. Griewank. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentia-\n\ntion. Frontiers in Applied Mathematics. SIAM, Philadelphia, 2000.\n\n[14] N. N. Schraudolph and T. Graepel. Combining conjugate direction methods with stochastic\napproximation of gradients. In C. M. Bishop and B. J. Frey, editors, Proc. 9th Intl. Workshop\nArti\ufb01cial Intelligence and Statistics, pages 7\u201313, Key West, Florida, 2003.\n\n[15] S. Kakade. A natural policy gradient. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors,\n\nAdvances in Neural Information Processing Systems 14, pages 1531\u20131538. MIT Press, 2002.\n\n[16] S. Amari. Natural gradient works ef\ufb01ciently in learning. Neural Comput., 10(2):251\u2013276, 1998.\n\n\f", "award": [], "sourceid": 2825, "authors": [{"given_name": "Jin", "family_name": "Yu", "institution": null}, {"given_name": "Douglas", "family_name": "Aberdeen", "institution": null}, {"given_name": "Nicol", "family_name": "Schraudolph", "institution": null}]}