{"title": "Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1515, "page_last": 1522, "abstract": "", "full_text": "Rates of Convergence of Performance Gradient\nEstimates Using Function Approximation and\n\nBias in Reinforcement Learning\n\nGregory Z. Grudic\n\nUniversity of Colorado, Boulder\n\ngrudic@cs.colorado.edu\n\nLyle H. Ungar\n\nUniversity of Pennsylvania\n\nungar@cis.upenn.edu\n\nAbstract\n\n\u0001\u0003\u0002\u0005\u0004\u0007\u0006\t\b\n\nWe address two open theoretical questions in Policy Gradient Reinforce-\nment Learning. The \ufb01rst concerns the ef\ufb01cacy of using function approx-\nimation to represent the state action value function, \n. Theory is pre-\nsented showing that linear function approximation representations of \ncan degrade the rate of convergence of performance gradient estimates\nby a factor of\nrelative to when no function approximation of \nis used, where\nis the number\nof basis functions in the function approximation representation. The sec-\nond concerns the use of a bias term in estimating the state action value\nfunction. Theory is presented showing that a non-zero bias term can\nimprove the rate of convergence of performance gradient estimates by\nis the number of possible actions. Experimen-\n\u0001\u0003\u0002\u000b\n\r\f\u000e\u0002\u000f\n\u0011\u0010\u0012\u0004\u0013\b\u0014\b\ntal evidence is presented showing that these theoretical results lead to\nsigni\ufb01cant improvement in the convergence properties of Policy Gradi-\nent Reinforcement Learning algorithms.\n\nis the number of possible actions and\n\n, where\n\n1 Introduction\n\nPolicy Gradient Reinforcement Learning (PGRL) algorithms have recently received at-\ntention because of their potential usefulness in addressing large continuous reinforcement\nLearning (RL) problems. However, there is still no widespread agreement on how PGRL\nalgorithms should be implemented. In PGRL, the agent\u2019s policy is characterized by a set\nof parameters which in turn implies a parameterization of the agent\u2019s performance metric.\nis\nThus if\n\u0015\u0017\u0016\u0019\u0018\u001b\u001a\na performance metric the agent is meant to maximize, then the performance metric must\nhave the form\n[6]. PGRL algorithms work by \ufb01rst estimating the performance gradient\n(PG)\n\n\u001d\u001e\u0002\u001f\u0015 \b\nand then using this gradient to update the agent\u2019s policy using:\n\ndimensional parameterization of the agent\u2019s policy and\n\nrepresents a\n\n!\u001e\u001d\nis a small positive step size. If the estimate of\n!0\u0015\n\nwhere\nclimb the performance gradient in the\npractice,\nformulation is attractive because 1) the parameterization\n\n(1)\nis accurate, then the agent can\nparameter space, toward locally optimal policies. In\n. The PGRL\nof the policy can directly imply\n\nis estimated using samples of the state action value function \n\n\u0015\u0011$&%('*)+\u0015\u0011$-,\u000e.\n\n!\u001e\u001d\"\u0010#!\u001e\u0015\n\n!\u001e\u001d\"\u0010#!\u001e\u0015\n\n!\u001e\u001d1\u00102!\u001e\u0015\n\n\u0004\n\u0006\n\u0004\n\u001c\n\u001d\n/\n.\n\u0015\n\u0015\n\f!0\u001d\"\u0010#!\u001e\u0015\n\n\b\u001e\u0010\u000b\u001f\n\ncan represent the adjustable weights\na generalization over the agent\u2019s state space (e.g.,\nin a neural network approximation), which suggests that PGRL algorithms can work well\non very high dimensional problems [3]; 2) the computational cost of estimating\nis\n!0\u001d\"\u0010#!\u001e\u0015\nlinear in the number of parameters\n, which contrasts with the computational cost for most\nRL algorithms which grows exponentially with the dimension of the state space; and 3) PG\nalgorithms exist which are guaranteed to give unbiased estimates of\n[6, 5, 4, 2, 1].\n\nThis paper addresses two open theoretical questions in PGRL formulations. In PGRL for-\nmulations performance gradient estimates typically have the following form:\n\n(2)\n\n!0\u001d\n!\u001e\u0015\n\u0002 \b\u0019!\"\n\r\f#!\u0005\b\n\n)\u0001\u0003\u0002\u0005\u0004\u0007\u0006\n\n\u0002\t\b#'\u000b\n\r\f1'\n\u000e#\u0002 \b%!\n\n\f\u000f\u000e\n\n\u0002\t\b#'\u0012\b\u0011\u0010\u0012\n\u0014\u0013\u0015\u0013\u0016\u0013\u0015\n\u0007\u0004\u0007\u0006\n\n\u0002\u0017\b\u0019\u0018\u001a\n\r\f\u001b\u0018\n!\u001e\u001d1\u00102!0\u0015\n\n\u0002\u001d\b\u0019\u0018\n\f\u001c\u000e\n\f$!\n\u0002 \b\u0019!&\n&\f#!\u0005\b\n\n(i.e.\n\n,\n\n\b\u0019!\n\b\u0019!\n\n\u000e#\u0002 \b%!\n\nquestions.\n\nin state\nin state\n\nthe bias subtracted from\n\nis obtained and the form of\n\nis the estimate of the value of executing action\n\nwhere\nstate action value function),\nnumber of steps the agent takes before estimating\n\nthe\nis the\n, and the form of the function\ndepends on the PGRL algorithm being used (see Section 2, equation (3) for the form\nbeing considered here). The effectiveness of PGRL algorithms strongly depends on how\n. The aim of this work is to address these\n\n\u0002\u0011\u0013\n\u0002\t\b\u0019!\"\n&\f#!\nThe \ufb01rst open theoretical question addressed here is concerned with the use of function\napproximation (FA) to represent the state action value function \n, which is in turn used\nto estimate the performance gradient. The original formulation of PGRL [6], the REIN-\nFORCE algorithm, has been largely ignored because of the slow rate of convergence of the\nbased on its observations has been\nPG estimate. The use of FA techniques to represent \nsuggested as a way of improving convergence properties. It has been proven that speci\ufb01c\nlinear FA formulations can be incorporated into PGRL algorithms, while still guaranteeing\nconvergence to locally optimal solutions [5, 4]. However, whether linear FA represen-\ntations actually improves the convergence properties of PGRL is an open question. We\n, rather than\npresent theory showing that using linear basis function representations of \ndirect observations of it, can slow the rate of convergence of PG estimates by a factor of\n(see Theorem 1 in Section 3.1). This result suggests that PGRL formulations\n\u0001\u0003\u0002\u0005\u0004\u0007\u0006\t\b\n. In Section 4, experimental\nshould avoid the use of linear FA techniques to represent \nevidence is presented supporting this conjecture.\nThe second open theoretical question addressed here is can a non-zero bias term\nin\n(2) improve the convergence properties of PG estimates? There has been speculation\ncan improve convergence properties [6, 5], but the-\nthat an appropriate choice of\noretical support has been lacking. This paper presents theory showing that if\n\n\u000e#\u0002\t\b\nis the number actions, then the rate of convergence of the\n\u0002\u000b\n#\u0010\u0012\u0004\u0013\b)(\u0001*\n(see Theorem 2 in Section 3.2). This sug-\nPG estimate is improved by\n\u0001\u0003\u0002\u000f\n\ngests that the convergence properties of PGRL algorithms can be improved by using a bias\nvalues in each state. Section 4 gives experimental evidence\nterm that is the average of \nsupporting this conjecture.\n\n\u000e#\u0002 \b\u0011\b\n\u000e#\u0002 \b\u0011\b\n\n\f\u0013\u0002\u000b\n#\u0010\u0012\u0004\u0013\b\u000f\b\n\n\u0002 \b+\n&\f1\b\n\n, where\n\n2 The RL Formulation and Assumptions\n\n\u0016.-\n\b+\n&\f\n\n/\n\r0$\n1\u0013\u0016\u0013\u0015\u0013\u00152\n\u0004:9\n)F\fG2\n\nand expected rewards\n\n,\npolicy followed by the agent is characterized by a parameter vector\nby the probability distribution\n\nThe RL problem is modeled as a Markov Decision Process (MDP). The agent\u2019s state at time\n. At each time step the agent chooses from a \ufb01nite\n. The dynamics of\n\nset of\nthe environment are characterized by transition probabilities\n\nand receives a reward\n\nis given by\nactions\n\n,\n\n\u001643\n)\u0001\f\n\n\u0016<;\n\n365\n\u001887\n\n1\u0013\u0016\u0013\u0016\u0013\u0015\n\r\f#=\n)JIK-\u0019>\n\n\u0002 \b+\n&\f\u0007R\u0014\u0015 \b\n\n$&%('\n)S?C>+-%\f\n\nA&E\n\n)B?C>+-\u000b\b\n@\"@\u001eA\n)M\fG2\nNO\b+\n\r\b\nNO\b\n)S\b+R\u0014\u0015T2\n\n$&%('\n. The\n\u0016\u000f3P\n&\f\n, and is de\ufb01ned\n. We\n\u0016U3P\n&\f\n\n)D\b\n\u0016\u001c;\n\u0016U;\n\n,\n\n)L\b+\n&\f\n)D\f\n\n\u0015\n\u0015\n/\n\n\b\n\n\b\n\u0006\n\n\b\n\u0006\n\n'\n\n\b\n\u0006\n\n\b\n\b\n\b\n)\n\n\u0004\n,\n\b\n$\n\n\f\n$\n'\n>\n$\n\u0016\n\u0018\n?\n*\n\b\n$\n)\n$\nH\n*\n@\nE\n\b\n$\n$\nA\n\u0015\n\u0016\n\u0018\n\u001a\nQ\n$\nE\n\b\n$\n\f\f\u0007\n\"Q\u000e\n\n(3)\n\n\u0002\t\b+\n&\f\u0007R\u000f\u0015 \b\n\u0002\u0004\u0003\n\u0002 \b+\n&\f1\b\n$\u0006\u0005\n)\u0001I\n\u0007\u0017\u0016\n\n!\u001e\u001d\n!\u001e\u0015\n\nassume that\n\nis differentiable with respect to\n\n.\n\nWe use the Policy Gradient Theorem of Sutton et al. [5] and limit our analysis to the start\nstate discount reward formulation. Here the reward function\nand state action value\nfunction \n\nare de\ufb01ned as:\n\n\u0002\u000f\u0003\n\n\u0010\u0012\u0011\n\u0002\t\b+\n&\f\n\n.\n\n$&%\n\n\u0002\t\b\n\n\b+\n&\f\n\n\f\u001c\u000e\n\n\u0005('\n\n\u0002 \b\u0011\b\u0014\b\n\nR\u0014\u0015 \b\n\u000e#\u0002\t\b\n\n\b\f\u000b/\n&Q\u000e\r\n$\n\t\n)\u0019\u0018\n$\u001c\u001b\u001e\u001d\n\n\u0002 \b+\n\r\f\n!GQ\n\n\"Q\u001a2\n\n)\u0001I\n\u0002\u0017\b+\n&\f\n!\u001e\u0015\nand\n\nwhere\u0013\u0015\u0014\n\n. Then the exact expression for the performance gradient is:\n\nwhere\nThis policy gradient formulation requires that the state-action value function, \nthe current policy be estimated. This estimate,\n\n'\b\u0007\n\u0002\u0017\b\u0011\b\n!\u001a\u0005('\n$\u0006\u0005\n)\u0001\b\n\u001f! \n\u001f! \n\u001f! \n\u0002\t\b+\n&\f\n\u0002\t\b+\n&\f\n,#\"\u001b\u0002 \b+\n&\f#!\n\u0002 \b+\n&\f#!\n\u0002\t\b/\n\r\f#!\n*)( . Therefore, if\nhas zero mean and \ufb01nite variance$&%\nwhere\"\u001b\u0002 \b+\n\r\f#!\u0005\b\n@!'\n\u001f! \nobservations of \ntimate of \nobtained by averaging*\n\u0002\t\b/\n\r\f#!\u0005\b\n\u0002 \b+\n&\f#!\nand variance are given by:\n132\n\bO\n,+B\u0004\n).-0/\n\u0002\t\b/\n\r\f#!\u0005\b\u0011\u0010\n\u0002\t\b+\n&\f#!\n\u0002\t\b+\n&\f#!\nIB\u0004\n\u001f! \nIn addition, we assume that \nwith the MDP assumption.\n\n. We assume that \n\nhas the following form:\n\n\u0002\t\b/\n\r\f\n\n\b\u0017\u0010\n\n\b\t\u0016\n, under\n, is derived using the observed value\n\nis an es-\n, then the mean\n\n\u0002 \b+\n\r\f\u001b!\n\n(4)\nare independently distributed. This is consistent\n\n3 Rate of Convergence Results\n\nBefore stating the convergence theorems, we de\ufb01ne the following:\n\nwhere$\n\n@!'\n\nis de\ufb01ned in (4) and\n\n798;:\n$6%\n\n@!@\u001cAB'\n79I\n798;:\n\n<>=0?\n@\bC\n)QP\n)QP\n\n'EDEDEDE'\n\n=GF\n\n\u0002\u0005\u001c\n\u0002\u0005\u001c\n\n@!'\n$6%\n\u0002\u0017\b\u0011\b\u000f\b\n\u0002\u0017\b\u0011\b\u000f\b\n\n79I\n\nH$6%\n\u0002SR\n!M\u0005('\n!M\u0005('\n\n@!'\n$6%\n\n@N@\u001cA\b'\n*)(VU\nW!X\nW!X\n\n\u0001BT\n\u0001BT\n\n@!'\n@!'\n\n<GKML\n@\bC\n%ZY\n%ZY\n\n'EDEDEDE'\n=>F\n79I\n798;:\n$6%\n\n(5)\n\n(6)\n\n3.1 Rate of Convergence of PIFA Algorithms\n\nConsider the PIFA algorithm [5] which uses a basis function representation for estimated\nstate action value function,\n\n, of the following form:\n\nwhere\nweights\n\n. If the\n, and the basis functions,\n, satisfy the conditions de\ufb01ned in [5, 4], then the performance gradient is given by:\n\nare basis functions de\ufb01ned in\n\n\u0016+\u0018\n\n(3'\n\n(a'\n\n\u0002\t\b\u0011\b\n\n)\u0001\n(3'\n\n)\\[\n\u0002\t\b/\n\r\f#!\n'&^\nare weights and_\n] are chosen based using the observed \n)\u0019\u0018\n\u0002\u0017\b+\n&\f\n!\u0007Q\n!\u001e\u0015\n\n*)(\n\u0002\t\b\n\u0002\u0017\b\u0011\b\n\n!\u001e\u001d\n\n!\u001e\u0015cb\n\n!\u001a\u0005('\n\n*)(\n\u001f! \n\n]`_\n\n*)(\n\u0002 \b+\n&\f\nR\u0014\u0015 \b\n\n\u0002 \b\u0011\b\n\n\u0002\u0017\b\u0011\b\n\n*)(\n\n\u0002\t\b\n\nThe following theorem establishes bounds on the rate of convergence for this representation\nof the performance gradient.\n\n(7)\n\n(8)\n\nQ\n\u0015\n\u001d\n\u0002\nQ\n\b\n\u0001\n\u001d\n\u0002\nQ\n\b\n(\n$\n>\n\t\n\t\n\t\n\n\n\u0001\n\b\n(\n\u0010\n\u0007\n'\n>\n\u0010\n\t\n\t\n\t\n\t\n\b\n$\n)\n$\n)\n\n@\n\u001c\n\u0001\n=\n\u0018\n!\n\u0002\n\n\u0001\n!\n\b\n\u001c\n\u0001\n\b\n)\n(\n\u0003\n\u000b\n\u0007\n-\n\b\n$\nE\n\b\n\u000b\n\u0018\n\u0001\n\u0006\n\n\u0001\n\n\u0001\n@\n!\n\b\n\u0001\n@\n!\n\b\n\n\u0001\n@\n\b\n)\n\n\u0001\n\b\n\b\n\u0006\n\n\u0001\n\b\n\u0001\n\u0001\n@\n\b\n\u0006\n\n\u0001\n)\n\n\u0001\n\u0006\n\n\u0001\n4\n(\n5\n\u0001\n@\n!\n\b\n)\n!\n'\n*\n(\nJ\n)\n!\n'\n*\n(\n%\n*\n(\nO\nJ\n(\n@\n\u0001\n%\n=\n(\nR\nW\n\u001f\n$\n%\nJ\nO\n(\n@\n\u0001\n%\n=\n(\n\u0002\nR\n*\n(\nU\nR\nW\n\u001f\n\u0006\n\n\u0001\n\u0006\n\n\u0001\n\b\n\u0001\n\u0018\n]\n\u0005\n'\n'\n]\n^\n*\n]\n\u0016\n\u0018\n*\n]\n\b\n\b\n7\n^\n*\n\u0001\n@\n!\n\b\n_\n'\n]\n\b\n@\n\u001c\n\u0001\n=\n\u0018\n!\n\n\u0001\n*\n(\n\fTheorem 1: Let\nfunction representation (7). Then, given the assumptions de\ufb01ned in Section 2 and equations\n(5) and (6), the rate of convergence of a PIFA algorithm is bounded below and above by:\n\nb be an estimate of (8) obtained using the PIFA algorithm and the basis\nb\u0004\u0003\nis the number of possible actions, and*\n\nis the number of basis functions,\n\nwhere\nthe number of independent estimates of the performance gradient.\nProof: See Appendix.\n\n798\n\n79I\n\n\u0004\u0007\u0006\n\n\u0004\u0007\u0006\n\n!\u001e\u001d\n!\u001e\u0015\n\n(9)\n\nis\n\n3.2 Rate of Convergence of Direct Sampling Algorithms\n\n\u001f! \n\nand\n\nin (3).\n\n\u0002\u0017\b+\n&\f\n\n\u0002\u000b\n#\u0010\u0012\u0004\u0013\b\n\n\u0002\t\b/\n\r\f1\b\n\n\u0002 \b+\n&\f#!\nW be a estimate of (3), be obtained using direct samples of \n\nIn the previous section, the observed \nare used to build a linear basis function\n\u0002\t\b/\n\r\f#!\u0005\b\nrepresentation of the state action value function, \n, which is in turn used to es-\ntimate the performance gradient. In this section we establish rate of convergence bounds\nfor performance gradient estimates that directly use the observed \nwithout the\nintermediate step of building the FA representation. These bounds are established for the\nconditions\n\u000e#\u0002 \b\u0011\b\nTheorem 2: Let /\nif\n\u0013 , and given the assumptions de\ufb01ned in Section 2 and equations (5) and (6), the\n\u000e#\u0002 \b\u0011\b\r)\nrate of convergence of /\nwhere*\n\u000e#\u0002 \b\u0011\b\u0007\u0006\n\nis the number of independent estimates of the performance gradient. If\n\nis bounded by:\n\n!\u001e\u0015\u0005\u0003\n\nis de\ufb01ned as:\n\n. Then,\n\n79I\n\n\u000e#\u0002\t\b\n\n(10)\n\n(11)\n\n!0\u001d\n\n798;:\n\u0002 \b+\n&\f\n\n\u0002\u001d\b\u0011\b\n\nthen the rate of convergence of the performance gradient\n\nis bounded by:\n\nis the number of possible actions.\n\nwhere\nProof: See Appendix.\n\n79I\n\n\u001b\f\n\n*\n\t\n\n\u0004\f\u000b\n\n!\u001e\u001d\n!0\u0015\n\n798\n\n\u001b\f\n\n*\r\t\n\n\u0004\f\u000b\n\nThus comparing (12) and (10) to (9) one can see that policy gradient algorithms such as\nPIFA which build FA representations of \nslower than\n\u0004\u0007\u0006\n\u0001\u0003\u0002\n. Furthermore, if the bias term is as de\ufb01ned in (11),\nalgorithms which directly sample \nthe bounds on the variance are further reduced by\nIn the next section\n\u0001\u0003\u0002\u000f\n\nexperimental evidence is given showing that these theoretical consideration can be used to\nimprove the convergence properties of PGRL algorithms.\n\nconverge by a factor of\n\n\u0004\u0013\b\u000f\b\n\n\u0002\u000b\n\u0011\u0010\n\n.\n\n(12)\n\n\u000e\u0010\u000f\n\nThe Simulated Environment: The experiments simulate an agent episodically interacting\nin a continuous two dimensional environment. The agent starts each episode in the same\nstate\n.\nThe stochastic policy is de\ufb01ned by a \ufb01nite set of Gaussians, each associated with a speci\ufb01c\n\n, and executes a \ufb01nite number of steps following a policy to a \ufb01xed goal state\n\n4 Experiments\n\n\u000e\u0019!\n\n\nR\n\u0001\nR\nW\nO\nJ\n*\n\u0016\n+\n\u0002\n\n\u0016\nO\n:\n*\n\u0006\n\u0004\n\u0001\n@\n\u0001\n!\n\b\n\u0001\n\u001f\n \n@\n\b\n)\n(\n*\n\n\b\n)\n\u0013\nR\n\u0001\nR\n\u0001\nR\n\u0001\nR\nW\nO\nJ\n\n*\n\u0016\n+\n\u0002\n/\n\u0016\nO\n\n*\n)\n\u0013\n\u000e\n)\n\n\u0004\n=\n\u0018\n\b\n\u0005\n'\n\n\u0001\n\b\n\b\n\nR\n\u0001\nR\nW\n \nO\nJ\n\n\u0016\n+\n\u0002\n\n \n\u0003\n\u0016\nO\n:\n\n\u0004\n\b\n\f\n\f0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n)\n\n(\n\n\u22120.1\n0\n\nBiased Q \n\nNo Bias\n\nLinear FA Q \n\n60\n\n40\n\n20\n80\nNumber of Policy Updates\na) Convergence of\n\nAlgorithms\n\n104\n\n103\n\n102\n\n]\n\n q\n\n \u00b6\n/\n \n\n r\n\n[\u00b6\n\nV\n\n \n \n/\n \n]\n\nF\n\n q\n\n \u00b6\n/\n \n\n r\n\n[\u00b6\n\nV\n\n101\n\n100\n\n100\n0\n\n2\nNumber of Possible Actions (M)\n\n10\n\n8\n\n4\n\n6\n\n12\n\n102\n\n]\n\nb\n\n101\n\n q\n\n \u00b6\n/\n \n\n r\n\n[\u00b6\n\nV\n\n \n \n/\n \n]\n\n q\n\n \u00b6\n/\n \n\n r\n\n[\u00b6\n\nV\n\n14\n\n100\n0\n\naction. The Gaussian associated with action\n\nis de\ufb01ned as:\n\nFigure 1: Simulation Results\n\n2\nNumber of Possible Actions (M)\n\n10\n\n8\n\n6\n\n4\n\n12\n\n14\n\nc)+\u0001\n\n+\u0004\n\n\f\f\u000b\r\u0005\n\u000b\u000f\u0005*'%\n\u0014\u0013\u0015\u0013\u0016\u0013\u0015\n\u0010\u000b\r\u0005\n\nis\n\n\u001887\n\n\u000e\u0011\b\n\nb)+\u0001\n)\t\bZ?\u0006\n\n\u000e+\n&\f\u0011\u0005\n\n1\u0013\u0016\u0013\u0015\u0013\u0016\n\u0013\u000b1=\n\n1\u0013\u0016\u0013\u0015\u0013\u0016\n\n\f\u0014\u0005\n\u000b\u0016\u0005*'%\n\u0014\u0013\u0015\u0013\u0016\u0013\u0015\n\u0010\u000b\r\u0005\n\u001f! \n\n(for\n.\n\n\u0005\u000e\u0002\n\nR\u000f\u0015 \b\n\n+\u0001\n\u0002 \b\n\nb\u0003\u0002\n\f\u0006\u0005\n\n\u0005('\n\n\u0005('\n\n1\u0013\u0016\u0013\u0015\u0013\u0016\n\u0013\u000b1=\n0T\n\u0014\u0013\u0015\u0013\u0016\u0013\u0015\n\n\u0002`\u0013)\n\nis the Gaussian center,\nis the variance along each state space dimension. The probability of\n\n, is the agents state,\n\nexecuting action\n\nin state\n\nwhere\n\u0005*'\n\nand \u000e\n\n1\u0013\u0016\u0013\u0015\u0013\u0016\n\n\u0002 \b#'\u000b\n1\u0013\u0016\u0013\u0015\u0013\u0016\n\r\b\n\f\u0006\u0005\n\n\f#\u0013$\u0013\u001c\u001b2\b\n\b\u0012%\n\n\u0005('\n\nwhere\nrameters that dictate the agent\u2019s actions. Action\n\n'\u0014'\n\n\u0002\u0012\u000b\n\n1\u0013\u0016\u0013\u0016\u0013\u0015\n\u0010\u000b\n\n, while the remaining actions\n\nsponding Gaussian center\n\nde\ufb01nes the policy pa-\ndirects the agent toward the goals state\n) direct the agent towards the corre-\n\n\u0014\u0013\u0015\u0013\u0016\u0013\u0015\n\nNoise is modeled using a uniform random distribution between\nsuch that the noise in dimension\n\nis given by:\n,\u0019\u0018\t\u0002\u001a\u0017\nis the magnitude of the noise,\n\nis the actual state of the agent.\n\nwhere\nchoose actions, and\n\ndenoted by\n\n,\n\n\u0011\b\nis the state the agent observes and uses to\n\n\u0017\u0003\u0002\n\n\u0013$\n\n\u0013$\n\nThe agent receives a reward of +1 when it reaches the goal state, otherwise it receives a\nreward of:\n\n)\u0013\f\n\n\u0013)\u0013\n\n\u001d\bZ?\u0006\n\n>\u001b\u0002\n\n\u000e\u0011\b\n\nin state\n\nThus the agent gets negative rewards the closer it gets to the origin of the state space, and a\npositive reward whenever it reaches the goal state.\nImplementation of the PGRL algorithms: All the PGRL formulations studied here re-\nquire observations (i.e. samples) of the state action value function. \nis sampled\nby executing action\nand thereafter following the policy. In the episodic formu-\nlation, where the agent executes a maximum of\nsteps during each episode, at the end of\neach episode, \nfor step, can be evaluated as follows:\n\u0002\t\b\n\n&\f\nat\n\b\u0014\b\n\n&\f\n\nthe episode we can calculate\nstate action value pairs. Equa-\nstate action value function observations to\nactions). Therefore,\nby sending the agent out on\n\ntion (3) tells us that we require a total of\nestimate a performance gradient (assuming the agent can execute\nwe can obtain the remaining\n\nthe completion of\n. This gives samples of\n\nThus, given that\nfollowing the policy\n\nthe agent executes a complete episode\n\n\u0002 \b\u0019\u0018\u001a\n\r\f#\u0018(\b\u000f\b\n\n\u0002\u0014\u0002 \b '\u000b\n\r\f1'\n\n)\u0001\b+\n\r\f $\n\n\fT!\n\u0002 \b\n\n1\u0013\u0016\u0013\u0015\u0013\u0016\n\n\u0014\u0013\u0015\u0013\u0016\u0013\u0015\n\n\u0002 \b+\n&\f\n\n\u0005('\n\n\f\u0007\n&Q\n\n\u0010\u0012\u0011\n\n&\f\n\n&\f\n\n\u0002 \b\n\n\u0002 \b\n\n$\u000b\b\n\n$&%\n\n,\n\nobservations of \n\n\u0002\u0005\u0004\n\n\f+\n\u0011\b\u001e'\n\nr\np\n/\nR\n\u0001\nR\nW\n\u0010\n/\nR\n\u0001\nR\nW\n\u0002\n/\nR\n\u0001\nR\nW\n\u0002\n\u0010\n/\nR\n\u0001\nR\nW\n \n\u0002\n\u0007\n\u0005\n\u0002\n\u0002\n\f\n7\n\u0018\n\u001a\n\u001a\n\u001a\n\b\n%\n\u000e\n\u0005\n\u001a\n\u0003\n\u000e\n)\n7\n\b\n\u0016\n7\n\u000e\n\u0005\n7\n\u000e\nQ\n\u0002\n)\n\u0007\n\u000e\n\b\n=\n(\n\b\n\u0007\n\b\n\u0002\n\u000e\n\b\n\u0015\n)\n'\n\u001a\n\n\u000e\n'\n'\n\u000e\n'\n\u001a\n'\n7\n\n\u000e\n=\n'\n\u000e\n=\n7\n\b\n\f\n'\n\u000e\n\u000f\n\u0015\n)\n\u0004\n7\n\u0002\n\n\b\n\b\n\u001a\n\b\n@\n\u001a\n)\n\b\n\u001a\n\n\b\n\u0018\n9\n\u0013\n\b\n\u001f\n \n@\n\u001a\n\b\n\u001a\n\u0013\n\u0002\n\f\n7\n\u0018\n\u001a\n\u001a\n\u001e\n\u0003\n\u0001\n\u001f\n \n@\n!\n\b\n\b\n'\n\u0001\n\u001f\n \n@\n$\n$\n\b\n\n\u0001\n\u001f\n \n@\n$\n)\n\u0003\n\u0018\n\u0010\n\u0007\n'\n>\n\u0010\nE\n\b\n$\n)\n\b\nQ\n\u0002\n\n\u0001\n\u001f\n \n@\n'\n'\n\b\n\n\u0001\n\u001f\n \n@\n\u0018\n\u0018\n'\n\u0004\n'\n\u0004\n\u0001\n\u001f\n \n@\n\f\u0002\u0005\u0004\n\n\u001f! \n\n\u0002 \b\n\n\b\u000f\b\n\n\b\u0011'\n\n\u0011\b\n\nepsisodes, each time allowing it to follow the policy\n\nfor all\n\nsteps, with the\nis being observed. This sam-\n\n+\u0004\n\n!\u001e\u001d\"\u0010#!\u001e\u0015\n\n)U\f\u0011\u0005\n\n\u0002\u0005\u0004\nexception that action\npling procedure requires a total of\nstate action pairs for any path\n. For the direct sampling algorithms\nin Section 3.2, these observations are directly used to estimate the performance gradient.\nFor the linear basis function based PGRL algorithm in Section 3.1, these observations are\n\ufb01rst used to calculate the\ncalculated using (8).\n\nis executed when \n\u0002\u0014\u0002 \b\n\n\f\n\u0002\t\b\n\n&\f\n] as de\ufb01ned in [5, 4], and then the performance gradient is\n\nepisodes and gives a complete set of \n\n\u0011\b\u001e'\n\n1\u0013\u0016\u0013\u0015\u0013\u0016\n\n&\f\u0011\u0005\n\n; the Gaussian variances are sampled from a uniform distribution\n\nincreases. Figure 1c shows a plot of average+\u0001\n\nExperimental Results: Figure 1b shows a plot of average+\u0001\n\nvalues\n!\u001e\u001d1\u00102!0\u0015\nover 10,000 estimates of the performance gradient. For each estimate, the goal state, start\nstate, and Gaussian centers are all chosen using a uniform random distribution\n;\n\u0002\u000b\f\nthe Gaussian variances are sampled from a uniform distribution\n. As predicted by\nTheorem 1 in Section 3.1 and Theorem 2 in Section 3.2, as the number of actions\nin-\ncreases, this ratio also increases. Note that Figure 1b plots average variance ratios, not the\nbounds in variance given in Theorem 1 and Theorem 2 (which have not been experimen-\ntally sampled), so the\nratio predicted by the theorems is supported by the increase in\nthe ratio as\nvalues\nover 10,000 estimates of the performance gradient. As above, for each estimate, the goal\nstate, start state, and Gaussian centers are all chosen using a uniform random distribution\n. This also\n\u0002\u000b\f\n\n+\n\nfollows the predicted trends of Theorem 1 and Theorem 2. Finally, Figure 1a shows the\naverage reward over 100 runs as the three algorithms converge on a two action problem.\nsamples to estimate the gradient before\nEach algorithm is given the same number of \neach update. Because /\nto converge\n!0\u001d\"\u0010#!\u001e\u0015\nto the highest reward value\npolicy updates converge to the worst\nverge to the same locally optimal policy given enough samples of \ndemonstrates that /\nsamples than /\n!\u001e\u001d\"\u0010#!\u001e\u0015\n5 Conclusion\n\n+\u0004\n\u0002`\u0013)\u0013\n has the least variance, it allows the policy\n\u001f! \n\n. Note that because all three algorithms will con-\n, Figure 1a simply\n, which in turn requires more\n\nb has the highest variance, its\n\nrequires more samples than /\n\n. Similarly, because /\n\n\u0011\b\n\n/\n\n\u0011\b\n\n+\n\n!\u001e\u001d\"\u0010#!\u001e\u0015\n\n!\u001e\u001d1\u00102!\u001e\u0015\n\n .\n\n!\u001e\u001d1\u00102!0\u0015\n\n!\u001e\u001d1\u0010#!\u001e\u0015\n\n\u0013$\u0013\n\n!\u001e\u001d1\u00102!0\u0015\n\n+\n\n\u0004\u0007\u0006\n\n\u001f! \n\n\u001d\u001e\u0002\u001dQ\n\nThe theoretical and experimental results presented here indicate that how PGRL algorithms\nare implemented can substantially affect the number of observations of the state action\n) needed to obtain good estimates of the performance gradient. Further-\nvalue function (\nmore, they suggest that an appropriately chosen bias term, speci\ufb01cally the average value of\n over all actions, and the direct use of observed  values can improve the convergence of\ncan signi\ufb01cantly\nPGRL algorithms. In practice linear basis function representations of \ndegrade the convergence properties of policy gradient algorithms. This leaves open the\nquestion of whether any (i.e. nonlinear) function approximation representation of value\nfunctions can be used to improve convergence of such algorithms.\n\nReferences\n\n[1] Jonathan Baxter and Peter L. Bartlett, Reinforcement learning in pomdp\u2019s via direct\ngradient ascent, Proceedings of the Seventeenth International Conference on Machine\nLearning (ICML\u20192000) (Stanford University, CA), June 2000, pp. 41\u201348.\n\n[2] G. Z. Grudic and L. H. Ungar, Localizing policy gradient estimates to action transi-\ntions, Proceedings of the Seventeenth International Conference on Machine Learning,\nvol. 17, Morgan Kaufmann, June 29 - July 2 2000, pp. 343\u2013350.\n\n\f\n\nQ\n'\n\f\n$\n\u0001\n@\n$\n\b\n\f\n\f\n\n\u0001\n\u001f\n \n@\n'\n'\n\b\n\u0018\n\u0018\n^\n*\n(\n'\n/\nb\n\u0002\n\u0010\n/\n\u0002\n\n\b\n\u0002\n\u0004\n\u0004\n/\n\u0002\n\u0010\n/\n \n\u0002\n\u0001\n@\nQ\n\u001d\n\u0002\nQ\n\b\n\b\n\u0001\n@\nb\n\f[3]\n\n, Localizing search in reinforcement learning, Proceedings of the Seventeenth\nNational Conference on Arti\ufb01cial Intelligence, vol. 17, Menlo Park, CA: AAAI Press /\nCambridge, MA: MIT Press, July 30 - August 3 2000, pp. 590\u2013595.\n\n[4] V. R. Konda and J. N. Tsitsiklis, Actor-critic algorithms, Advances in Neural Informa-\ntion Processing Systems (Cambridge, MA) (S. A. Solla, T. K. Leen, and K.-R. Mller,\neds.), vol. 12, MIT Press, 2000.\n\n[5] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, Policy gradient methods for\nreinforcement learning with function approximation, Advances in Neural Information\nProcessing Systems (Cambridge, MA) (S. A. Solla, T. K. Leen, and K.-R. Mller, eds.),\nvol. 12, MIT Press, 2000.\n\n[6] R. J. Williams, Simple statistical gradient-following algorithms for connectionist rein-\n\nforcement learning, Machine Learning 8 (1992), no. 3, 229\u2013256.\n\nAppendix: Proofs of Theorems 1 and 2\n\n*)(\n!\u0007Q\n\nfollowing:\n\n\u001f! \n\u001f! \n\nW!X\n\n\u0001BT\n\nLet R\n\n( given in (7). In [5] it is shown that\n\u0013(13)\n\u0002\t\b\u0011\b\n!M\u0005('\n\n\u0002\t\b+\n&\f\n!\u001e\u0015\n\n\u0002 \b+\n&\f\n\nR\u000f\u0015 \b\n\n!\u0007Q\n\n(3) after a single episode. Using (13), we get the\n\nProof of Theorem 1: Consider the de\ufb01nition of\nthere exist\n\nsuch that:\n\n\u0002\t\b\u0011\b\n\u0002 \b+\n&\f\n!\u001e\u0015\n\n] and_\n*Z(\nR\u0014\u0015 \b\n\u0002 \b\u0011\b\n\u0002 \b\u0011\b\n!\u001a\u0005('\nbe the observation of R\n@N'\n\u0001BT\n\u001f! \n\u0002\u0017\b/\n\r\f#!\u0005\b\u0001\n@!'\n*Z(\n\u0001BT\nW;X\n\u0002\t\b+\n&\f\n@!'\n\u0001BT\nW;X\n*)(\n,#\"\n\u0002\t\b\u0011\b\n]`_\n@!'\n\u0001BT\nW;X\n*)(\n*)(\n@!'\n] have the form\n\u0002\t\b\nR\u0014\u0015 \b\n\u0002\t\b/\n\r\f\n!GQ\n!\u001e\u0015\n\n!\u001a\u0005('\n\u0002\u0017\b\u0011\b\n\u0002\u0017\b\u0011\b\n!M\u0005\n\u0002\u0017\b\u0011\b\n!M\u0005\n!M\u0005\n\u0002\u0017\b\u0011\b\n*Z(\n!M\u0005('\n\u0005('\nwhere the basis functions\u0003\n\u0013 , with variance\nI\u0001\n@!'\n)\u0004+\nb as the least squares (LS) estimate of (3), its form is given by:\n\u0010 correspond to the\n\u0010 are LS estimates of the\n] . Then, it can be shown that any linear system of the type given in (14) has a\n@!'\n\u0002\u0017\b+\n&\f#!\u0011R\u0014\u0015 \b\n!GQ\nb\u0004\u0003\n!\u001e\u0015\nSubstituting (5) and (6) into the above equation completes the proof.\n\n\u0002 \b\u0011\b\n\u0002\t\b\u0011\b\n\u0002\t\b\u0011\b\n!\u001a\u0005('\n\u0010\u000b\n\n] and\n\n*)(\n\u0002\t\b\u0011\b\u000f\b\n!\u001a\u0005('\n\n\"\u0002\n\u0002\u0017\b\u0011\b\u0007\u0006\nR\u000f\u0015 \b\n\u0002 \b+\n&\f\n!\u001e\u0015\n\nfunctions\u0003\n\n\u0002 \b\u0011\b\n\u0005('\t\b\n\nrate of convergence given by:\n\n!\u001e\u001d\n!0\u0015\n\u0004\u0007\u0006\n\n\u0004\u0007\u0006\n\n!M\u0005\n\n\u0005('\n\n]\u0004\u0003\n\n*)(\n\n,#\"\n\n(14)\n\nbasis\n\n\u0004\u0007\u0006\n\n,#\"\n\n!\u0007Q\n\n\u0004\u0007\u0006\n\n!\u001e\u001d\n!\u001e\u0015\n\nand\n\nDenoting\n\nwhere\n\n*Z(\n\n!\u001e\u001d\n!\u001e\u0015\n\n\u0002\u001f\u001c\n\nweights\n\n\n\u0001\n*\n^\n'\n'\n]\nI\n\u0002\n\u0018\n@\n\u001c\n\u0001\n=\n\u0018\n!\n\n\u0001\n*\n(\n\f\n\u0018\n@\n\u001c\n\u0001\n=\n\u0018\n!\n\n\u0001\n!\n\b\n\u0003\n)\n\u0001\nR\nW\n@\n\u0001\nR\nW\nR\n\u0001\nR\nW\n\t\n\t\n\t\n@\n)\n(\n@\n\u001c\n\u0001\n=\n(\nR\n*\n(\nU\nR\nW\n\n\u0001\n@\nR\n\u0001\nR\nW\n,\n\"\n)\nP\n(\n@\n\u001c\n\u0001\n=\n(\n'\nR\nU\nR\nW\n\n\u0001\n!\n\b\nY\n,\n\"\n)\nP\n(\n@\n\u001c\n\u0001\n=\n(\n'\nR\n*\n(\nU\nR\nW\n\n\u0001\nY\n)\nP\n(\n@\n\u001c\n\u0001\n=\n(\n'\nR\n*\n(\nU\nR\nW\n[\n(\n]\n\u0005\n'\n^\n'\n'\n]\nY\n)\nP\n=\n(\n[\n(\n]\n^\n'\n]\n\t\n(\n@\nR\n*\n(\nU\nW\nX\nR\nW\n\u001c\n\u0001\n\b\n_\n'\n]\n\u000b\nY\n,\n=\n(\n'\n[\n(\n]\n^\n'\n!\n]\n!\n\u0003\n!\n]\n)\n\u0005\n\u0018\n@\n!\n\u001c\n\u0001\n_\n*\n(\n'\n]\n\"\n\u0002\n)\n+\n\n\"\n\u0002\nP\n\t\n\t\n\t\n\t\n\u001f\n \n@\nY\n)\n\u0018\n@\n\u0001\n\b\n%\n=\n\u0018\n\t\n!\n\u000b\n%\n$\n%\n*\n(\n\nR\n\u0001\nR\nW\n\nb\n)\n=\n[\n\u0018\n\u0010\n\u0010\n\b\n^\n'\n!\n+\n\u0002\n\n)\n*\n+\n\n\"\n\u0002\n)\n*\n\u0018\n@\n\u0002\n\u001c\n\u0001\n%\n=\n\u0018\n\t\n\u000b\n%\n$\n%\n*\n(\n\festimates of the performance\n. These examples are averaged\n\nand therefore:\n\nProof of Theorem 2: We prove equation (10) \ufb01rst. For*\n\u001f! \ngradient, we get*\n\u0002\t\b/\n\r\f#!\u0005\b\nR\u0014\u0015 \b\n\u0002\t\b/\n\r\f\n!\u001e\u0015\n\u001f! \n\nindependent samples of each \n!GQ\n\nBecause each \nby\n\n\u0002\u0017\b\u0011\b\n\n!\u001e\u001d\n!\u001e\u0015\n\n\u0002\t\b/\n\r\f\n\nGiven (5) the worst rate of convergence is bounded by:\n\n\u0002\t\b+\n&\f\n!\u001e\u001d\n!\u001e\u0015\n\nis independently distributed, the variance of the estimate is given\n\n!\u001e\u001d\n!\u001e\u0015\n\n\u0002\u001f\u001c\n\n\u0002 \b\u0011\b\u000f\b\n\nA similarly argument applies to the lower bound on convergence completing the proof for\n(10). Following the same argument for (12), we have\n\nbound. The proof for the lower bound in the variance follows similar reasoning.\n\nfrom (6) completes the proof for the upper\n\n!\u001a\u0005('\n\u0002 \b\u0011\b\u000f\b\n!\u001a\u0005('\n!\u0007Q\u0003\u0002\t\b+\n&\f\n!\u001e\u0015\n\n\u0002\u0005\u001c\n\n!\u001a\u0005('\n!\u0007Q\n\n(15)\n\n\u0002\t\b+\n\r\f\n\n\b\u0004\u0003\n\n\u0002\t\b/\n\r\f\n\n\u0003\n\t\n\n(16)\n\n@N'\n\n*)(\n\n798\n\u0005('\n\n\u0005('\n\b\b\u0007\n@N'\n$6%\n\n*\u0011\u0010\n\n=\u000f\n\n\u0005('\n\b\b\u0007\n\n!\u0007Q\n\nR\u0014\u0015 \b\n\n\u0002\u0017\b+\n&\f#!\u0011R\u0014\u0015 \b\n!\u001e\u0015\n798\n\u0002\t\b+\n\r\f\u001b!\n\nR\u0014\u0015 \b\n\n+\u0001\n\n@!'\n$6%\n\n*)(\n\n=\u000e\n\n=\u000e\r\n=\u000e\n\n\u0002\u0014\u000b\n\n\u001b\f\n\n\u0002\u0017\b+\n&\f\n\n\u0005('\n\b\b\u0007\n798;:\n$6%\n798\n$6%\n\n*\u0016\u0015\n\n798\n$6%\n798;:\n$6%\n\n=\u0013\r\n=\u0013\n\non the far left of (16) is bounded by:\n\n\u0002\u0017\b+\n&\f\n!\u001e\u0015\n)\u0004+\n)\f\u000b\n\n!\u001a\u0005('\n\u0002 \b+\n&\f\n\n\u0002\u001f\u001c\n\n!\u001e\u001d\n!\u001e\u0015\nWhere\n\n\u0002 \b+\n&\f\n\n\u0002\t\b\u0011\b\u000f\b\n\u0005('\nGiven (5) the variance+\u0001\n*)(\n\u0005('\n\b\b\u0007\n\n=\u0012\n\n@!'\n$6%\n\n@!'\n$6%\n\u0003\n\t\n*\u0011\u0010\n798;:\nPlugging the above into (16) and insertingO\n\n=\u0013\n\n\u0001\n@\nI\n\u0002\n/\n\u0003\n)\n\u0018\n@\n\u001c\n\u0001\n=\n\u0018\n!\n\n\u0001\n!\n\b\n\u0001\n@\n!\n\b\n+\n\u0002\n/\n\u0003\n)\n\n*\n\u0018\n@\n\u0001\n%\n=\n\u0018\n\t\n\u000b\n%\n$\n%\n+\n\u0002\n/\n\u0003\n\u0016\n\u0002\n\u0018\n@\n\u0001\n%\n=\n\u0018\n\t\n!\n\u000b\n%\n\u0003\n$\n%\n:\n\u0004\n*\n)\nO\n:\n\n*\n+\n\u0002\n\n \n\u0003\n)\n\n*\n\u0018\n@\n\u0001\n%\n=\n\u0018\n\t\n!\n\u000b\n%\n\u0002\n\n\u0001\n\b\n\f\n\n\u0004\n=\n\u0018\n\b\n\n\u0001\n\b\n\u0005\n+\n\u0002\n\n\u0001\n!\n\b\n\f\n'\n=\n=\n(\n\b\n\n\u0001\n\b\n\b\n\u0003\n\n\u0006\n\u0002\n=\n\u0011\n'\n=\n\n\u0001\n!\n\b\n\f\n'\n=\n=\n(\n\b\n\u0005\n!\n\n\u0001\n\b\n\b\n\u0005\n=\n\u0011\n'\n%\n,\n=\n(\n\b\n\u0005\n!\n\u000b\n'\n%\n\u0002\n\n\u0006\n\u0002\n\u000b\n=\n\u0011\n'\n%\n,\n=\n(\n\b\n\u0005\n!\n\u000b\n'\n%\n\u0005\n)\n\u000b\n=\n\u0011\n'\n%\n,\n=\n(\n\b\n\u0005\n!\n\u000b\n'\n%\n:\n)\n=\n\u0011\n'\n%\n,\n\u0002\n\u0004\n\f\n\n\b\n\u000b\n'\n%\n\u001f\n)\n\u000b\n'\n=\n\n:\n\u0005\n\f", "award": [], "sourceid": 2053, "authors": [{"given_name": "Gregory", "family_name": "Grudic", "institution": null}, {"given_name": "Lyle", "family_name": "Ungar", "institution": null}]}