{"title": "Adaptive Step-Size for Policy Gradient Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 1394, "page_last": 1402, "abstract": "In the last decade, policy gradient methods have significantly grown in popularity in the reinforcement--learning field. In particular, they have been largely employed in motor control and robotic applications, thanks to their ability to cope with continuous state and action domains and partial observable problems. Policy gradient researches have been mainly focused on the identification of effective gradient directions and the proposal of efficient estimation algorithms. Nonetheless, the performance of policy gradient methods is determined not only by the gradient direction, since convergence properties are strongly influenced by the choice of the step size: small values imply slow convergence rate, while large values may lead to oscillations or even divergence of the policy parameters. Step--size value is usually chosen by hand tuning and still little attention has been paid to its automatic selection. In this paper, we propose to determine the learning rate by maximizing a lower bound to the expected performance gain. Focusing on Gaussian policies, we derive a lower bound that is second--order polynomial of the step size, and we show how a simplified version of such lower bound can be maximized when the gradient is estimated from trajectory samples. The properties of the proposed approach are empirically evaluated in a linear--quadratic regulator problem.", "full_text": "Adaptive Step\u2013Size for Policy Gradient Methods\n\nMatteo Pirotta\n\nDept. Elect., Inf., and Bio.\n\nPolitecnico di Milano, ITALY\nmatteo.pirotta@polimi.it\n\nMarcello Restelli\n\nDept. Elect., Inf., and Bio.\n\nPolitecnico di Milano, ITALY\nmarcello.restelli@polimi.it\n\nLuca Bascetta\n\nDept. Elect., Inf., and Bio.\n\nPolitecnico di Milano, ITALY\n\nluca.bascetta@polimi.it\n\nAbstract\n\nIn the last decade, policy gradient methods have signi\ufb01cantly grown in popularity\nin the reinforcement\u2013learning \ufb01eld. In particular, they have been largely employed\nin motor control and robotic applications, thanks to their ability to cope with con-\ntinuous state and action domains and partial observable problems. Policy gradient\nresearches have been mainly focused on the identi\ufb01cation of effective gradient\ndirections and the proposal of ef\ufb01cient estimation algorithms. Nonetheless, the\nperformance of policy gradient methods is determined not only by the gradient di-\nrection, since convergence properties are strongly in\ufb02uenced by the choice of the\nstep size: small values imply slow convergence rate, while large values may lead\nto oscillations or even divergence of the policy parameters. Step\u2013size value is usu-\nally chosen by hand tuning and still little attention has been paid to its automatic\nselection. In this paper, we propose to determine the learning rate by maximizing\na lower bound to the expected performance gain. Focusing on Gaussian policies,\nwe derive a lower bound that is second\u2013order polynomial of the step size, and\nwe show how a simpli\ufb01ed version of such lower bound can be maximized when\nthe gradient is estimated from trajectory samples. The properties of the proposed\napproach are empirically evaluated in a linear\u2013quadratic regulator problem.\n\n1\n\nIntroduction\n\nPolicy gradient methods have established as the most effective reinforcement\u2013learning techniques\nin robotic applications. Such methods perform a policy search to maximize the expected return of a\npolicy in a parameterized policy class. The reasons for their success are many. Compared to several\ntraditional reinforcement\u2013learning approaches, policy gradients scale well to high\u2013dimensional con-\ntinuous state and action problems, and no changes to the algorithms are needed to face uncertainty\nin the state due to limited and noisy sensors. Furthermore, policy representation can be properly de-\nsigned for the given task, thus allowing to incorporate domain knowledge into the algorithm useful\nto speed up the learning process and to prevent the unexpected execution of dangerous policies that\nmay harm the system. Finally, they are guaranteed to converge to locally optimal policies.\nThanks to these advantages, from the 1990s policy gradient methods have been widely used to learn\ncomplex control tasks [1]. The research in these years has focused on obtaining good model\u2013free\nestimators of the policy gradient using data generated during the task execution. The oldest policy\ngradient approaches are \ufb01nite\u2013difference methods [2], that estimate gradient direction by resolving\na regression problem based on the performance evaluation of policies associated to different small\nperturbations of the current parameterization. Finite\u2013difference methods have some advantages:\nthey are easy to implement, do not need assumptions on the differentiability of the policy w.r.t. the\npolicy parameters, and are ef\ufb01cient in deterministic settings. On the other hand, when used on real\nsystems, the choice of parameter perturbations may be dif\ufb01cult and critical for system safeness.\nFurthermore, the presence of uncertainties may signi\ufb01cantly slow down the convergence rate. Such\ndrawbacks have been overcome by likelihood ratio methods [3, 4, 5], since they do not need to gen-\nerate policy parameters variations and quickly converge even in highly stochastic systems. Several\n\n1\n\n\fstudies have addressed the problem to \ufb01nd minimum variance estimators by the computation of op-\ntimal baselines [6]. To further improve the ef\ufb01ciency of policy gradient methods, natural gradient\napproaches (where the steepest ascent is computed w.r.t. the Fisher information metric) have been\nconsidered [7, 8]. Natural gradients still converge to locally optimal policies, are independent from\nthe policy parameterization, need less data to attain good gradient estimate, and are less affected by\nplateaus.\nOnce an accurate estimate of the gradient direction is obtained, policy parameters are updated by:\n\u03b8t+1 = \u03b8t + \u03b1t\u2207\u03b8J \u03b8=\u03b8t, where \u03b1t \u2208 R+ is the step size in the direction of the gradient. Although,\ngiven an unbiased gradient estimate, convergence to a local optimum can be guaranteed under mild\nconditions over the learning\u2013rate values [9], their choice may signi\ufb01cantly affect the convergence\nspeed or the behavior during the transient. Updating the policy with large step sizes may lead to\npolicy oscillations or even divergence [10], while trying to avoid such phenomena by using small\nlearning rates determines a growth in the number of iterations that is unbearable in most real\u2013world\napplications. In general unconstrained programming, the optimal step size for gradient ascent meth-\nods is determined through line\u2013search algorithms [11], that require to try different values for the\nlearning rate and evaluate the function value in the corresponding updated points. Such an approach\nis unfeasible for policy gradient methods, since it would require to perform a large number of policy\nevaluations. Despite these dif\ufb01culties, up to now, little attention has been paid to the study of step\u2013\nsize computation for policy gradient algorithms. Nonetheless, some policy search methods based\non expectation\u2013maximization have been recently proposed; such methods have properties similar to\nthe ones of policy gradients, but the policy update does not require to tune the step size [12, 13].\nIn this paper, we propose a new approach to compute the step size in policy gradient methods that\nguarantees an improvement at each step, thus avoiding oscillation and divergence issues. Starting\nfrom a lower bound to the difference of performance between two policies, in Section 3 we derive a\nlower bound in the case where the new policy is obtained from the old one by changing its parame-\nters along the gradient direction. Such a new bound is a (polynomial) function of the step size, that,\nfor positive values of the step size, presents a single, positive maximum ( i.e., it guarantees improve-\nment) which can be computed in closed form. In Section 4, we show how the bound simpli\ufb01es to a\nquadratic function of the step size when Gaussian policies are considered, and Section 5 studies how\nthe bound needs to be changed in approximated settings (e.g., model\u2013free case) where the policy\ngradient needs to be estimated directly from experience.\n\n2 Preliminaries\n\nA discrete\u2013time continuous Markov decision process\nis de\ufb01ned as a 6-tuple\n(cid:104)S,A,P,R, \u03b3, D(cid:105), where S is the continuous state space, A is the continuous action space, P\nis a Markovian transition model where P(s(cid:48)|s, a) de\ufb01nes the transition density between state s and\ns(cid:48) under action a, R : S \u00d7 A \u2192 [0, R] is the reward function, such that R(s, a) is the expected\nimmediate reward for the state-action pair (s, a) and R is the maximum reward value, \u03b3 \u2208 [0, 1) is\nthe discount factor for future rewards, and D is the initial state distribution. The policy of an agent\nis characterized by a density distribution \u03c0(\u00b7|s) that speci\ufb01es for each state s the density distribution\nover the action space A. To measure the distance between two policies we will use this norm:\n\n(MDP)\n\n(cid:107)\u03c0(cid:48) \u2212 \u03c0(cid:107)\u221e = sup\ns\u2208S\n\nA\n\n|\u03c0(cid:48)(a|s) \u2212 \u03c0(a|s)|da,\n\nthat is the superior value over the state space of the total variation between the distributions over the\naction space of policy \u03c0(cid:48) and \u03c0.\nWe consider in\ufb01nite horizon problems where the future rewards are exponentially discounted with\n\u03b3. For each state s, we de\ufb01ne the utility of following a stationary policy \u03c0 as:\n\n(cid:35)\n\n(cid:90)\n\n(cid:34) \u221e(cid:88)\n\nt=0\n\nV \u03c0(s) = E at \u223c \u03c0\nst \u223c P\n\n\u03b3tR(st, at)|s0 = s\n\n.\n\nIt is known that V \u03c0 solves the following recursive (Bellman) equation:\n\n(cid:90)\n\nV \u03c0(s) =\n\n(cid:90)\n\nS\n\nA\n\n\u03c0(a|s)R(s, a) + \u03b3\n\n2\n\nP (s(cid:48)|s, a)V \u03c0(s(cid:48))ds(cid:48)da.\n\n\f(cid:90)\n\n(cid:90)\n\n(cid:90)\nD(s) = (1 \u2212 \u03b3)(cid:80)\u221e\n\nJ \u03c0\nD =\n\nS\n\nPolicies can be ranked by their expected discounted reward starting from the state distribution D:\n\n\u03c0(a|s)R(s, a)dads,\n\nd\u03c0\nD(s)\n\nD(s)V \u03c0(s)ds) =\nt=0 \u03b3tP r(st = s|\u03c0, D) is the \u03b3\u2013discounted future state distribution\nwhere d\u03c0\nfor a starting state distribution D [5]. Solving an MDP means to \ufb01nd a policy \u03c0\u2217 that maximizes\nthe expected long-term reward: \u03c0\u2217 \u2208 arg max\u03c0\u2208\u03a0 J \u03c0\nD. For any MDP there exists at least one\ndeterministic optimal policy that simultaneously maximizes V \u03c0(s), \u2200s \u2208 S. For control purposes, it\nis better to consider action values Q\u03c0(s, a), i.e., the value of taking action a in state s and following\na policy \u03c0 thereafter:\n\nA\n\nS\n\nQ\u03c0(s, a) = R(s, a) + \u03b3\n\nP(s(cid:48)|s, a)\n\n\u03c0(a(cid:48)|s(cid:48))Q\u03c0(s(cid:48), a(cid:48))da(cid:48)ds(cid:48).\n\n(cid:90)\n\nS\n\n(cid:90)\n\nA\n\nFurthermore, we de\ufb01ne the advantage function:\n\nA\u03c0(s, a) = Q\u03c0(s, a) \u2212 V \u03c0(s),\n\n\u03c0 (s) = (cid:82)\n\nthat quanti\ufb01es the advantage (or disadvantage) of taking action a in state s instead of following\npolicy \u03c0. In particular, for each state s, we de\ufb01ne the advantage of a policy \u03c0(cid:48) over policy \u03c0 as\nA \u03c0(cid:48)(a|s)A\u03c0(s, a)da and, following [14], we de\ufb01ne its expected value w.r.t. an initial\nA\u03c0(cid:48)\nstate distribution \u00b5 as A\u03c0(cid:48)\nWe consider the problem of \ufb01nding a policy that maximizes the expected discounted reward over\na class of parameterized policies \u03a0\u03b8 = {\u03c0\u03b8 : \u03b8 \u2208 Rm}, where \u03c0\u03b8 is a compact representation of\n\u03c0(a|s, \u03b8). The exact gradient of the expected discounted reward w.r.t. the policy parameters [5] is:\n\n\u03c0,\u00b5 =(cid:82)\n\n\u00b5(s)A\u03c0(cid:48)\n\n\u03c0 (s)ds.\n\nS d\u03c0\n\n(cid:90)\n\n(cid:90)\n\n\u2207\u03b8J\u00b5(\u03b8) =\n\n1\n1 \u2212 \u03b3\n\nd\u03c0\u03b8\n\u00b5 (s)\n\nS\n\nA\n\n\u2207\u03b8\u03c0(a|s, \u03b8)Q\u03c0\u03b8 (s, a)dads.\n\nThe policy parameters can be updated by following the direction of the gradient of the expected\n= \u03b8 + \u03b1\u2207\u03b8J\u00b5(\u03b8). In the following, we will denote with (cid:107)\u2207\u03b8J\u00b5(\u03b8)(cid:107)1 and\ndiscounted reward: \u03b8\n(cid:107)\u2207\u03b8J\u00b5(\u03b8)(cid:107)2 the L1\u2013 and L2\u2013norm of the policy gradient vector, respectively.\n\n(cid:48)\n\n3 Policy Gradient Formulation\n\nIn this section we provide a lower bound to the improvement obtained by updating the policy pa-\nrameters along the gradient direction as a function of the step size. The idea is to start from the\ngeneral lower bound on the performance difference between any pair of policies introduced in [15]\nand specialize it to the policy gradient framework.\nLemma 3.1 (Continuous MDP version of Corollary 3.6 in [15]). For any pair of stationary poli-\n(cid:48) and for any starting state distribution \u00b5, the difference\ncies corresponding to parameters \u03b8 and \u03b8\nbetween the performance of policy \u03c0\u03b8(cid:48) and policy \u03c0\u03b8 can be bounded as follows\n\n(cid:48)\n\nJ\u00b5(\u03b8\n\n) \u2212 J\u00b5(\u03b8) \u2265 1\n1 \u2212 \u03b3\n\n\u00b5 (s)A\u03c0\u03b8(cid:48)\nd\u03c0\u03b8\n\u03c0\u03b8\n\n(s)ds \u2212\n\n\u03b3\n\n2(1 \u2212 \u03b3)2 (cid:107)\u03c0\u03b8(cid:48) \u2212 \u03c0\u03b8(cid:107)2\u221e (cid:107)Q\u03c0\u03b8(cid:107)\u221e ,\n\nwhere (cid:107)Q\u03c0\u03b8(cid:107)\u221e is the supremum norm of the Q\u2013function: (cid:107)Q\u03c0\u03b8(cid:107)\u221e = sup\ns\u2208S,a\u2208A\n\nQ\u03c0\u03b8 (s, a)\n\n(1)\n\n(cid:90)\n\nS\n\nAs we can notice from the above bound, to maximize the performance improvement, we need to\n\ufb01nd a new policy \u03c0\u03b8(cid:48) that is associated to large average advantage A\u03c0\u03b8(cid:48)\n\u03c0\u03b8 ,\u00b5, but, at the same time, is\nnot too different from the current policy \u03c0\u03b8. Policy gradient approaches provide search directions\ncharacterized by increasing advantage values and, through the step size value, allow to control the\ndifference between the new policy and the target one. Exploiting a lower bound to the \ufb01rst order\nTaylor\u2019s expansion, we can bound the difference between the current policy and the new policy,\nwhose parameters are adjusted along the gradient direction, as a function of the step size \u03b1.\nLemma 3.2. Let the update of the policy parameters be \u03b8\n\n= \u03b8 + \u03b1\u2207\u03b8J\u00b5(\u03b8). Then\n\n(cid:48)\n\n(cid:32) m(cid:88)\n\ni,j=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)\u03b8+c\u2206\u03b8\n\n\u22022\u03c0(a|s, \u03b8)\n\n\u2202\u03b8i\u2202\u03b8j\n\n(cid:33)\n\n\u2206\u03b8i \u2206\u03b8j\n\n1 + I(i = j)\n\n,\n\n) \u2212 \u03c0(a|s, \u03b8) \u2265\u03b1\u2207\u03b8\u03c0(a|s, \u03b8)T\u2207\u03b8J\u00b5(\u03b8) + \u03b12\n\n(cid:48)\n\n\u03c0(a|s, \u03b8\nwhere \u2206\u03b8 = \u03b1\u2207\u03b8J\u00b5(\u03b8).\n\ninf\n\nc\u2208(0,1)\n\n3\n\n\fBy combining the two previous lemmas, it is possible to derive the policy performance improvement\nobtained following the gradient direction.\n= \u03b8 + \u03b1\u2207\u03b8J\u00b5(\u03b8). Then for any stationary\nTheorem 3.3. Let the update of the parameters be \u03b8\npolicy \u03c0(a|s, \u03b8) and any starting state distribution \u00b5, the difference in performance between \u03c0\u03b8 and\n\u03c0\u03b8(cid:48) is lower bounded by:\n\n(cid:48)\n\n(cid:48)\n\nJ\u00b5(\u03b8\n\n) \u2212 J\u00b5(\u03b8) \u2265 \u03b1(cid:107)\u2207\u03b8J\u00b5(\u03b8)(cid:107)2\n\n2\n\n(cid:90)\n\n+\n\n(cid:90)\n\u03b12\n1 \u2212 \u03b3\n\u2212 \u03b3 (cid:107)Q\u03c0\u03b8(cid:107)\u221e\n2(1 \u2212 \u03b3)2\n\nS\n\ni,j=1\n\n\u2202\u03b8i\u2202\u03b8j\n\n\u22022\u03c0(a|s, \u03b8)\n\n(cid:32) m(cid:88)\n(cid:12)(cid:12)(cid:12)(cid:12)\u03b8+c\u2206\u03b8\n(cid:12)(cid:12)\u2207\u03b8\u03c0(a|s, \u03b8)T\u2207\u03b8J\u00b5(\u03b8)(cid:12)(cid:12) da\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) sup\n(cid:32) m(cid:88)\n(cid:12)(cid:12)(cid:12)(cid:12)\u03b8+c\u2206\u03b8\n\n\u22022\u03c0(a|s, \u03b8)\n\nc\u2208(0,1)\n\n\u2202\u03b8i\u2202\u03b8j\n\ni,j=1\n\nA\n\n(cid:90)\n\ninf\n\nc\u2208(0,1)\n\nA\n\n(cid:90)\n\nd\u03c0\u03b8\n\u00b5 (s)\n\n(cid:18)\n\n\u03b1 sup\ns\u2208S\n\n+\u03b12 sup\ns\u2208S\n\nA\n\n\u2206\u03b8i \u2206\u03b8j\n\n1 + I(i = j)\n\nQ\u03c0\u03b8 (s, a)dads\n\n(cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) da\n(cid:33)2\n\n.\n\n\u2206\u03b8i \u2206\u03b8j\n\n1 + I(i = j)\n\n(cid:33)\n\nThe above bound is a forth\u2013order polynomial of the step size, whose stationary points, being the\nroots of a third\u2013order polynomial ax3 + bx2 + cx + d, can be expressed in closed form. It is worth to\nnotice that, for positive values of \u03b1, the bound presents a single stationary point that corresponds to\na local maximum. In fact, since a, b \u2264 0 and d \u2265 0, the Descartes\u2019 rule of signs gives the existence\nand uniqueness of the real positive root.\nIn the following section, we will show, in the case of Gaussian policies, how the bound in Theo-\nrem 3.3 can be reduced to a second\u2013order polynomial in \u03b1, thus obtaining a simpler closed-form\nsolution for optimal (w.r.t. the bound) step size.\n\n4 The Gaussian Policy Model\n\nIn this section we consider the Gaussian policy model with \ufb01xed standard deviation \u03c3 and the mean\nis a linear combination of the state feature vector \u03c6(\u00b7) using a parameter vector \u03b8 of size m:\n\n(cid:32)\n\n(cid:18) a \u2212 \u03b8T\u03c6(s)\n\n(cid:19)2(cid:33)\n\n.\n\n\u03c3\n\n\u03c0(a|s, \u03b8) =\n\n1\u221a\n2\u03c0\u03c32\n\nexp\n\n\u2212 1\n2\n\nIn the case of Gaussian policies, each second\u2013order derivative of policy \u03c0\u03b8 can be easily bounded.\nLemma 4.1. For any Gaussian policy \u03c0(a|s, \u03b8) \u223c N (\u03b8T\u03c6(s), \u03c32), the second order derivative of\nthe policy can be bounded as follows:\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u22022\u03c0(a|s, \u03b8)\n\n\u2202\u03b8i\u2202\u03b8j\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264 |\u03c6i(s)\u03c6j(s)|\n\n2\u03c0\u03c33\n\n\u221a\n\n\u2200\u03b8 \u2208 Rm,\u2200a \u2208 A.\n\n,\n\nThis result allows to restate Lemma 3.2 in the case of Gaussian policies:\n\n\u03c0(a|s, \u03b8\n\n(cid:48)\n\n) \u2212 \u03c0(a|s, \u03b8) \u2265 \u03b1\u2207\u03b8\u03c0(a|s, \u03b8)T\u2207\u03b8J\u00b5(\u03b8) \u2212 \u03b12\u221a\n\n2\u03c0\u03c33\n\n(cid:0)|\u2207\u03b8J\u00b5(\u03b8)|T|\u03c6(s)|(cid:1)2\n\n.\n\nIn the following we will assume that features \u03c6 are uniformly bounded:\nAssumption 4.1. All the basis functions are uniformly bounded by M\u03c6:\nS,\u2200i = 1, . . . , m.\n\n|\u03c6i(s)|< M\u03c6, \u2200s \u2208\n\nExploiting Pinsker\u2019s inequality [16] (which upper bounds the total variation between two distribu-\ntions with their Kullback\u2013Liebler divergence), it is possible to provide the following upper bound to\nthe supremum norm between two Gaussian policies.\n= \u03b8 +\u03b1\u2207\u03b8J\u00b5(\u03b8), supremum\nLemma 4.2. For any pair of stationary policies \u03c0\u03b8 and \u03c0\u03b8(cid:48), so that \u03b8\nnorm of their difference can be upper bounded as follows:\n\n(cid:48)\n\n(cid:107)\u03c0\u03b8(cid:48) \u2212 \u03c0\u03b8(cid:107)\u221e \u2264 \u03b1M\u03c6\n\u03c3\n\n(cid:107)\u2207\u03b8J\u00b5(\u03b8)(cid:107)1 .\n\n4\n\n\fBy plugging the results of Lemmas 4.1 and 4.2 into Equation (1) we can obtain a lower bound to\nthe performance difference between a Gaussian policy \u03c0\u03b8 and another policy along the gradient\ndirection that is quadratic in the step size \u03b1.\nTheorem 4.3. For any starting state distribution \u00b5, and any pair of stationary Gaussian policies\n= \u03b8 + \u03b1\u2207\u03b8J\u00b5(\u03b8) and under Assump-\n\u03c0\u03b8 \u223c N (\u03b8T\u03c6(s), \u03c32) and \u03c0\u03b8(cid:48) \u223c N (\u03b8\ntion 4.1, the difference between the performance of \u03c0\u03b8(cid:48) and the one of \u03c0\u03b8 can be lower bounded as\nfollows:\n(cid:48)\n\n\u03c6(s), \u03c32), so that \u03b8\n\n(cid:48)T\n\n(cid:48)\n\n(cid:18)\n\n) \u2212 J\u00b5(\u03b8) \u2265 \u03b1(cid:107)\u2207\u03b8J\u00b5(\u03b8)(cid:107)2\n2\n\u221a\n1\n(1 \u2212 \u03b3)\n2(1 \u2212 \u03b3)2\u03c32 (cid:107)\u2207\u03b8J\u00b5(\u03b8)(cid:107)2\n\n\u2212 \u03b12\n\n\u03b3M 2\n\u03c6\n\n2\u03c0\u03c33\n\nd\u03c0\u03b8\n\n+\n\nS\n\n(cid:90)\n\n\u00b5 (s)(cid:0)|\u2207\u03b8J\u00b5(\u03b8)|T |\u03c6(s)|(cid:1)2(cid:90)\n\n(cid:33)\n\n1 (cid:107)Q\u03c0\u03b8(cid:107)\u221e\n\n.\n\nQ\u03c0\u03b8 (s, a)dads\n\nA\n\nJ\u00b5(\u03b8\n\nSince the linear coef\ufb01cient is positive and the quadratic one is negative, the bound in Theorem 4.3\nhas a single maximum attained for some positive value of \u03b1.\nCorollary 4.4. The performance lower bound provided in Theorem 4.3 is maximized by choosing\nthe following step size:\n\n\u2217\n\n\u03b1\n\n=\n\n\u221a\n\n\u03b3\n\n2\u03c0\u03c3M 2\n\n\u03c6 (cid:107)\u2207\u03b8J\u00b5(\u03b8)(cid:107)2\n\n1 (cid:107)Q\u03c0\u03b8(cid:107)\u221e + 2(1 \u2212 \u03b3)(cid:82)\n\n(1 \u2212 \u03b3)2\n\n\u221a\n2\u03c0\u03c33 (cid:107)\u2207\u03b8J\u00b5(\u03b8)(cid:107)2\n\n\u00b5 (s)(cid:0)|\u2207\u03b8J\u00b5(\u03b8)|T |\u03c6(s)|(cid:1)2(cid:82)\n\nS d\u03c0\u03b8\n\n2\n\nA Q\u03c0\u03b8 (s, a)dads\n\n,\n\nthat guarantees the following policy performance improvement\n\nJ\u00b5(\u03b8\n\n(cid:48)\n\n) \u2212 J\u00b5(\u03b8) \u2265 1\n2\n\n\u03b1\u2217 (cid:107)\u2207\u03b8J\u00b5(\u03b8)(cid:107)2\n2 .\n\n5 Approximate Framework\n\nThe solution for the tuning of the step size presented in the previous section depends on some\nconstants (e.g., discount factor and the variance of the Gaussian policy) and requires to be able to\ncompute some quantities (e.g., the policy gradient and the supremum value of the Q\u2013function). In\nmany real\u2013world applications such quantities cannot be computed (e.g., when the state\u2013transition\nmodel is unknown or too large for exact methods) and need to be estimated from experience samples.\nIn this section, we study how the step size can be chosen when the gradient is estimated through\nsample trajectories to guarantee a performance improvement in high probability.\nFor sake of easiness, we consider a simpli\ufb01ed version of the bound in Theorem 4.3, in order to obtain\na bound where the only element that needs to be estimated is the policy gradient \u2207\u03b8J\u00b5(\u03b8).\nCorollary 5.1. For any starting state distribution \u00b5, and any pair of stationary Gaussian policies\n\u03c0\u03b8 \u223c N (\u03b8T\u03c6(s), \u03c32) and \u03c0\u03b8(cid:48) \u223c N (\u03b8\n= \u03b8 + \u03b1\u2207\u03b8J\u00b5(\u03b8) and under Assump-\ntion 4.1, the difference between the performance of \u03c0\u03b8(cid:48) and \u03c0\u03b8 is lower bounded by:\n\n\u03c6(s), \u03c32), so that \u03b8\n\n(cid:48)T\n\n(cid:48)\n\n(cid:48)\n\nJ\u00b5(\u03b8\n\n) \u2212 J\u00b5(\u03b8) \u2265 \u03b1(cid:107)\u2207\u03b8J\u00b5(\u03b8)(cid:107)2\n\n2 \u2212 \u03b12 RM 2\n\n\u03c6 (cid:107)\u2207\u03b8J\u00b5(\u03b8)(cid:107)2\n(1 \u2212 \u03b3)2 \u03c32\n\n1\n\n(cid:18) |A|\u221a\n\n2\u03c0\u03c3\n\n+\n\n\u03b3\n\n2(1 \u2212 \u03b3)\n\n(cid:19)\n\n,\n\nthat is maximized by the following step size value:\n\u221a\n\n\u02dc\u03b1\u2217 =\n\n(cid:0)\u03b3\n\n(1 \u2212 \u03b3)3\n\n2\u03c0\u03c3 + 2(1 \u2212 \u03b3)|A|(cid:1) RM 2\n\n2\u03c0\u03c33 (cid:107)\u2207\u03b8J\u00b5(\u03b8)(cid:107)2\n\n2\n\n\u221a\n\n\u03c6 (cid:107)\u2207\u03b8J\u00b5(\u03b8)(cid:107)2\n\n1\n\n.\n\nSince we are assuming that the policy gradient \u2207\u03b8J\u00b5(\u03b8) is estimated through trajectory samples,\nthe lower bound in Corollary 5.1 must take into consideration the associated approximation error.\n\nGiven a set of trajectories obtained following policy \u03c0\u03b8, we can produce an estimate (cid:98)\u2207\u03b8J\u00b5(\u03b8) of\n\nthe policy gradient and we assume to be able to produce a vector \u0001 = [\u00011, . . . , \u0001m]T, so that the i\u2013th\ncomponent of the approximation error is bounded at least with probability 1 \u2212 \u03b4:\n\nP(cid:16)(cid:12)(cid:12)(cid:12)\u2207\u03b8iJ\u00b5(\u03b8) \u2212(cid:98)\u2207\u03b8i J\u00b5(\u03b8)\n(cid:12)(cid:12)(cid:12) \u2265 \u0001i\n\n(cid:17) \u2264 \u03b4.\n\n5\n\n\fGiven the approximation error vector \u0001, we can adjust the bound in Corollary 5.1 to produce a new\nbound that holds at least with probability (1 \u2212 \u03b4)m. In particular, to preserve the inequality sign,\nthe estimated approximation error must be used to decrease the L2\u2013norm of the policy gradient in\nthe \ufb01rst term (the one that provides the positive contribution to the performance improvement) and\nto increase the L1\u2013norm in the penalization term. To lower bound the L2\u2013norm, we introduce the\n\nvector (cid:98)\u2207\u03b8J\u00b5(\u03b8) whose components are a lower bound to the absolute value of the policy gradient\n\nbuilt on the basis of the approximation error \u0001:\n\n(cid:98)\u2207\u03b8J\u00b5(\u03b8) = max(|(cid:98)\u2207\u03b8J\u00b5(\u03b8)| \u2212 \u0001, 0),\n\n(cid:98)\u2207\u03b8J\u00b5(\u03b8) = |(cid:98)\u2207\u03b8J\u00b5(\u03b8)| + \u0001.\n\nwhere 0 denotes the m\u2013size vector with all zeros, and max denotes the component\u2013wise maximum.\n\nTheorem 5.2. Under the same assumptions of Corollary 5.1, and provided that it is available a\n\nbetween the performance of \u03c0\u03b8(cid:48) and \u03c0\u03b8 can be lower bounded at least with probability (1 \u2212 \u03b4)m:\n\nSimilarly, to upper bound the L1\u2013norm of the policy gradient, we introduce the vector (cid:98)\u2207\u03b8J\u00b5(\u03b8):\n(cid:12)(cid:12)(cid:12) \u2265 \u0001i\npolicy gradient estimate (cid:98)\u2207\u03b8J\u00b5(\u03b8), so that P(cid:16)(cid:12)(cid:12)(cid:12)\u2207\u03b8iJ\u00b5(\u03b8) \u2212(cid:98)\u2207\u03b8iJ\u00b5(\u03b8)\n(cid:17) \u2264 \u03b4, the difference\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)(cid:98)\u2207\u03b8J\u00b5(\u03b8)\n(cid:18) |A|\u221a\n2\u03c0\u03c33(cid:13)(cid:13)(cid:13)(cid:98)\u2207\u03b8J\u00b5(\u03b8)\n(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)(cid:98)\u2207\u03b8J\u00b5(\u03b8)\n2\u03c0\u03c3 + 2(1 \u2212 \u03b3)|A|(cid:1) RM 2\n\n(cid:13)(cid:13)(cid:13)(cid:98)\u2207\u03b8J\u00b5(\u03b8)\n(cid:98)\u03b1\u2217 =\n\nthat is maximized by the following step size value:\n\u221a\n\n) \u2212 J\u00b5(\u03b8) \u2265 \u03b1\n\n(1 \u2212 \u03b3)2 \u03c32\n\n2(1 \u2212 \u03b3)\n\n(1 \u2212 \u03b3)3\n\n(cid:13)(cid:13)(cid:13)2\n\n2\n\n(cid:0)\u03b3\n\n\u2212 \u03b12\n\n(cid:48)\n\nJ\u00b5(\u03b8\n\nRM 2\n\u03c6\n\n2\u03c0\u03c3\n\n(cid:19)\n\n,\n\n1\n\n2\n\n\u221a\n\n+\n\n\u03b3\n\n\u03c6\n\n(cid:13)(cid:13)(cid:13)2\n\n1\n\n.\n\nIn the following, we will discuss how the approximation error of the policy gradient can be bounded.\nAmong the several methods that have been proposed over the years, we focus on two well\u2013\nunderstood policy\u2013gradient estimation approaches: REINFORCE [3] and G(PO)MDP [4]/policy\ngradient theorem (PGT) [5].\n\n5.1 Approximation with REINFORCE gradient estimator\n\nThe REINFORCE approach [3] is the main exponent of the likelihood\u2013ratio family. The episodic\nREINFORCE gradient estimator is given by:\n\n(cid:98)\u2207\u03b8J RF\n\n\u00b5 (\u03b8) =\n\n(cid:32) H(cid:88)\n\nN(cid:88)\n\nn=1\n\nk=1\n\n1\nN\n\n\u2207\u03b8 log \u03c0 (an\n\nk ; sn\n\nk , \u03b8)\n\n\u03b3l\u22121rn\n\nl \u2212 b\n\n,\n\n(cid:33)(cid:33)\n\n(cid:32) H(cid:88)\n\nl=1\n\nwhere N is the number of H\u2013step trajectories generated from a system by roll\u2013outs and b \u2208 R is\na baseline that can be chosen arbitrary, but usually with the goal of minimizing the variance of the\ngradient estimator. The main drawback of REINFORCE is its variance, that is strongly affected by\nthe length of the trajectory horizon H.\nThe goal is to determine the number of trajectories N in order to obtain the desired accuracy of\nthe gradient estimate. To achieve this, we exploit the upper bound to the variance of the episodic\nREINFORCE gradient estimator introduced in [17] for Gaussian policies.\n[17]). Given a Gaussian policy \u03c0(a|s, \u03b8) \u223c\nLemma 5.3 (Adapted from Theorem 2 in\n\nN(cid:0)\u03b8T\u03c6(s), \u03c32(cid:1), under the assumption of uniformly bounded rewards and basis functions (Assump-\nREINFORCE gradient estimate (cid:98)\u2207\u03b8iJ RF\n(cid:16)(cid:98)\u2207\u03b8i J RF\n\ntion 4.1), we have the following upper bound to the variance of the i\u2013th component of the episodic\n\n\u03c6H(cid:0)1 \u2212 \u03b3H(cid:1)2\n\n(cid:17) \u2264 R2M 2\n\n\u00b5 (\u03b8):\n\n\u00b5 (\u03b8)\n\nV ar\n\n.\n\nN \u03c32 (1 \u2212 \u03b3)2\n\n6\n\n\fThe result in the previous Lemma combined with the Chebyshev\u2019s inequality allows to provide a\nhigh\u2013probability upper bound to the gradient approximation error using the episodic REINFORCE\ngradient estimator.\n\nTheorem 5.4. Given a Gaussian policy \u03c0(a|s, \u03b8) \u223c N(cid:0)\u03b8T\u03c6(s), \u03c32(cid:1), under the assumption of\n\nuniformly bounded rewards and basis functions (Assumption 4.1), using the following number of\nH\u2013step trajectories:\n\nthe gradient estimate (cid:98)\u2207\u03b8iJ RF\n\n\u00b5 (\u03b8) generated by REINFORCE is such that with probability 1 \u2212 \u03b4:\n\nR2M 2\ni \u03c32 (1 \u2212 \u03b3)2\n\u03b4\u00012\n\n\u03c6H(cid:0)1 \u2212 \u03b3H(cid:1)2\n(cid:12)(cid:12)(cid:12) \u2264 \u0001i.\n\n,\n\n\u00b5 (\u03b8) \u2212 \u2207\u03b8iJ\u00b5(\u03b8)\n\nN =\n\n(cid:12)(cid:12)(cid:12)(cid:98)\u2207\u03b8iJ RF\n\n5.2 Approximation with G(PO)MDP/PGT gradient estimator\n\n\u00b5\n\nH(cid:88)\n\n(cid:32) H(cid:88)\n\nreason, we can limit our attention to the PGT formulation:\n\nAlthough the REINFORCE method is guaranteed to converge at the true gradient at the fastest possi-\nble pace, its large variance can be problematic in practice. Advances in the likelihood ratio gradient\nestimators have produced new approaches that signi\ufb01cantly reduce the variance of the estimate. Fo-\ncusing on the class of \u201cvanilla\u201d gradient estimator, two main approaches have been proposed: policy\ngradient theorem (PGT) [5] and G(PO)MDP [4]. In [6], the authors show that, while the algorithms\n(\u03b8). For this\n\n(\u03b8) = (cid:98)\u2207\u03b8J G(PO)MDP\nlook different, their gradient estimate are equal, i.e., (cid:98)\u2207\u03b8J P GT\n(cid:32) H(cid:88)\n(cid:33)(cid:33)\n\n(cid:98)\u2207\u03b8J P GT\nl \u2208 R have the objective to reduce the variance of the gradient estimate. Following the\nwhere bn\nprocedure used to bound the approximation error of REINFORCE, we need an upper bound to the\nvariance of the gradient estimate of PGT that is provided by the following lemma (whose proof is\nsimilar to the one used in [17] for the REINFORCE case).\n\nLemma 5.5. Given a Gaussian policy \u03c0(a|s, \u03b8) \u223c N(cid:0)\u03b8T\u03c6(s), \u03c32(cid:1), under the assumption of uni-\nto the variance of the i\u2013th component of the PGT gradient estimate (cid:98)\u2207\u03b8iJ P GT\n\nformly bounded rewards and basis functions (Assumption 4.1), we have the following upper bound\n\n\u2207\u03b8 log \u03c0 (an\n\nl \u2212 bn\n\n\u03b3l\u22121rn\n\nk ; sn\n\nk , \u03b8)\n\n(\u03b8) =\n\n1\nN\n\nn=1\n\nk=1\n\nl=k\n\n\u00b5\n\n\u00b5\n\nl\n\n,\n\n(cid:16)(cid:98)\u2207\u03b8iJ P GT\n\n\u00b5\n\n(cid:17) \u2264\n\n(\u03b8)\n\nV ar\n\n(cid:20) 1 \u2212 \u03b32H\n1 \u2212 \u03b32 + H\u03b32H \u2212 2\u03b3H 1 \u2212 \u03b3H\n1 \u2212 \u03b3\n\n(\u03b8):\n\n\u00b5\n\n(cid:21)\n\n.\n\nR2M 2\n\u03c6\n\nN (1 \u2212 \u03b3)2 \u03c32\n\nAs expected, since the variance of the gradient estimate obtained with PGT is smaller than the one\nwith REINFORCE, also the upper bound of the PGT variance is smaller than REINFORCE one. In\nparticular, while the variance with REINFORCE grows linearly with the time horizon, using PGT\nthe dependence on the time horizon is signi\ufb01cantly smaller. Finally, we can derive the upper bound\nfor the approximation error of the gradient estimated of PGT.\n\nTheorem 5.6. Given a Gaussian policy \u03c0(a|s, \u03b8) \u223c N(cid:0)\u03b8T\u03c6(s), \u03c32(cid:1), under the assumption of\n\nuniformly bounded rewards and basis functions (Assumption 4.1), using the following number of\nH\u2013step trajectories:\n\nN =\n\nthe gradient estimate (cid:98)\u2207\u03b8iJ P GT\n\n\u00b5\n\nR2M 2\n\u03c6\ni \u03c32 (1 \u2212 \u03b3)2\n\u03b4\u00012\n\n(cid:20) 1 \u2212 \u03b32H\n1 \u2212 \u03b32 + H\u03b32H \u2212 2\u03b3H 1 \u2212 \u03b3H\n1 \u2212 \u03b3\n\n(cid:21)\n\n(\u03b8) generated by PGT is such that with probability 1 \u2212 \u03b4:\n\n(cid:12)(cid:12)(cid:12)(cid:98)\u2207\u03b8i J P GT\n\n\u00b5\n\n(cid:12)(cid:12)(cid:12) \u2264 \u0001i.\n\n(\u03b8) \u2212 \u2207\u03b8iJ\u00b5(\u03b8)\n\n7\n\n\f1e \u2212 07\n1e \u2212 06\n1e \u2212 05\n1e \u2212 04\n1e \u2212 03\n1e \u2212 05\n1e \u2212 04\n\n0.50\nitmax\nitmax\n17138\n1675\n\u22a5\nitmax\nitmax\n24106\n\n0.75\nitmax\nitmax\n8669\n697\n\u22a5\nitmax\nitmax\n7271\n\n1.00\nitmax\nitmax\n5120\n499\n\u22a5\nitmax\nitmax\n3279\n\n1.25\nitmax\nitmax\n3348\n\u22a5\n\u22a5\nitmax\n\u22a5\n1838\n\n\u03b1const\n\n\u03b1t = \u03b10\nt\n\n\u03b1\u2217\n\n\u03c3\n1.50\nitmax\n23651\n2342\n\u22a5\n\u22a5\nitmax\n\u22a5\n1172\n\n1.75\nitmax\n17516\n1714\n\u22a5\n\u22a5\nitmax\n\u22a5\n813\n\n2.00\nitmax\n13480\n1287\n\u22a5\n\u22a5\nitmax\n\u22a5\n598\n\n5.00\n21888\n2163\n\u22a5\n\u22a5\n\u22a5\n\u22a5\n\u22a5\n1\n\n7.50\n9740\n849\n\u22a5\n\u22a5\n\u22a5\n\u22a5\n\u22a5\n58\n\nTable 1: Convergence speed in exact LQG scenario with \u03b3 = 0.95. The table reports the number of\niterations required by the exact gradient approach, starting from \u03b8 = 0, to learn the optimal policy\nparameter \u03b8\u2217 = \u22120.6037 with an accuracy of 0.01, for different step\u2013size values. Three different\nset of experiments are shown: constant step size, decreasing step size, and the step size proposed in\nCorollary 4.4. The table contains itmax when no convergence happens in 30, 000 iterations, and \u22a5\nwhen the algorithm diverges (\u03b8 < \u22121 or \u03b8 > 0). Best performances are reported in boldface.\n\nNumber of trajectories\n\n10, 000\n\nit\n\u22120.0030\n822\n29, 761 \u22120.2176\n\n\u03b8\n\nRF\nPGT\n\n100, 000\n\nit\n\n\u03b8\n\n51, 731 \u22120.3068\n63, 985 \u22120.4013\n\n500, 000\n\nit\n\n\u03b8\n\n75, 345 \u22120.4088\n83, 983 \u22120.4558\n\nTable 2: Convergence speed in approximate LQG scenario with \u03b3 = 0.9. The table reports, starting\n\nfrom \u03b8 = 0 and \ufb01xed \u03c3 = 1, the number of iterations performed before the proposed step size (cid:98)\u03b1\n\nt + a2\n\ncharacterized by a transition model st+1 \u223c N(cid:0)st + at, \u03c32(cid:1), Gaussian policy at \u223c N(cid:0)\u03b8 \u00b7 s, \u03c32(cid:1)\n\nbecomes 0 and the last value of the policy parameter. Results are shown for different number of\ntrajectories (of 20 steps each) used in the gradient estimation by REINFORCE and PGT.\n6 Numerical Simulations and Discussion\nIn this section we show results related to some numerical simulations of policy gradient in the\nlinear\u2013quadratic Gaussian regulation (LQG) problem as formulated in [6]. The LQG problem is\nand quadratic reward rt = \u22120.5(s2\nt ). The range of state and action spaces is bounded to\nthe interval [\u22122, 2] and the initial state is drawn uniformly at random. This scenario is particularly\ninstructive since it allows to exactly compute all terms involved in the bounds. We \ufb01rst present\nresults in the exact scenario and then we move toward the approximated one.\nTable 1 shows how the number of iterations required to learn a near\u2013optimal value of the policy\nparameter changes according to the standard deviation of the Gaussian policy and the step\u2013size\nvalue. As expected, very small values of the step size allow to avoid divergence, but the learning\nprocess needs many iterations to reach a good performance (this can be observed both when the step\nsize is kept constant and when it decreases). On the other hand, larger step\u2013size values may lead to\ndivergence. In this example, the higher the policy variance, the lower is the step size value that allows\nto avoid divergence, since, in LQG, higher policy variance implies larger policy gradient values.\nUsing the step size \u03b1\u2217 from Corollary 4.4 the policy gradient algorithm avoids divergence (since\nit guarantees an improvement at each iteration), and the speed of convergence is strongly affected\nby the variance of the Gaussian policy. In general, when the policy are nearly deterministic (small\nvariance in the Gaussian case), small changes in the parameters lead to large distances between\nthe policies, thus negatively affecting the lower bound in Equation 1. As we can notice from the\nexpression of \u03b1\u2217 in Corollary 4.4, considering policies with high variance (that might be a problem in\nreal\u2013world applications) allows to safely take larger step size, thus speeding up the learning process.\nNonetheless, increasing the variance over some threshold (making policies nearly random) produces\nvery bad policies, so that changing the policy parameter has a small impact on the performance,\nand as a result slows down the learning process. How to identify an optimal variance value is\nan interesting future research direction. Table 2 provides numerical results in the approximated\nsettings, showing the effect of varying the number of trajectories used to estimate the gradient by\nREINFORCE and PGT. Increasing the number of trajectories reduces the uncertainty on the gradient\nestimates, thus allowing to use larger step sizes and reaching better performances. Furthermore, the\nsmaller variance of PGT w.r.t. REINFORCE allows the former to achieve better performances.\nHowever, even with a large number of trajectories, the approximated errors are still quite large\npreventing to reach very high performance. For this reason, future studies will try to derive tighter\nbounds. Further developments include extending these results to other policy models (e.g., Gibbs\npolicies) and to other policy gradient approaches (e.g., natural gradient).\n\n8\n\n\fReferences\n[1] Jan Peters and Stefan Schaal. Policy gradient methods for robotics. In Intelligent Robots and\n\nSystems, 2006 IEEE/RSJ International Conference on, pages 2219\u20132225. IEEE, 2006.\n\n[2] James C Spall. Multivariate stochastic approximation using a simultaneous perturbation gra-\n\ndient approximation. Automatic Control, IEEE Transactions on, 37(3):332\u2013341, 1992.\n\n[3] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist rein-\n\nforcement learning. Machine Learning, 8(3-4):229\u2013256, May 1992.\n\n[4] Jonathan Baxter and Peter L. Bartlett. In\ufb01nite-horizon policy-gradient estimation. Journal of\n\nArti\ufb01cial Intelligence Research, 15:319\u2013350, 2001.\n\n[5] Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient\nmethods for reinforcement learning with function approximation. Advances in neural infor-\nmation processing systems, 12(22), 2000.\n\n[6] Jan Peters and Stefan Schaal. Reinforcement learning of motor skills with policy gradients.\n\nNeural Networks, 21(4):682\u2013697, 2008.\n\n[7] Sham Kakade. A natural policy gradient. Advances in neural information processing systems,\n\n14:1531\u20131538, 2001.\n\n[8] Jan Peters and Stefan Schaal. Natural actor-critic. Neurocomputing, 71(7):1180\u20131190, 2008.\n[9] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Math-\n\nematical Statistics, pages 400\u2013407, 1951.\n\n[10] P. Wagner. A reinterpretation of the policy oscillation phenomenon in approximate policy\n\niteration. Advances in Neural Information Processing Systems, 24, 2011.\n\n[11] Jorge J Mor\u00b4e and David J Thuente. Line search algorithms with guaranteed suf\ufb01cient decrease.\n\nACM Transactions on Mathematical Software (TOMS), 20(3):286\u2013307, 1994.\n\n[12] J. Kober and J. Peters. Policy search for motor primitives in robotics. In Advances in Neural\n\nInformation Processing Systems 22 (NIPS 2008), Cambridge, MA: MIT Press, 2009.\n\n[13] Nikos Vlassis, Marc Toussaint, Georgios Kontes, and Savas Piperidis. Learning model-free\n\nrobot control by a monte carlo em algorithm. Autonomous Robots, 27(2):123\u2013130, 2009.\n\n[14] S.M. Kakade. On the sample complexity of reinforcement learning. PhD thesis, PhD thesis,\n\nUniversity College London, 2003.\n\n[15] Matteo Pirotta, Marcello Restelli, Alessio Pecorino, and Daniele Calandriello. Safe policy\nIn Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th In-\niteration.\nternational Conference on Machine Learning (ICML-13), volume 28, pages 307\u2013315. JMLR\nWorkshop and Conference Proceedings, May 2013.\n\n[16] S. Pinsker. Information and Information Stability of Random Variable and Processes. Holden-\n\nDay Series in Time Series Analysis. Holden-Day, Inc., 1964.\n\n[17] Tingting Zhao, Hirotaka Hachiya, Gang Niu, and Masashi Sugiyama. Analysis and improve-\n\nment of policy gradient estimation. Neural Networks, 26(0):118 \u2013 129, 2012.\n\n9\n\n\f", "award": [], "sourceid": 704, "authors": [{"given_name": "Matteo", "family_name": "Pirotta", "institution": "Politecnico di Milano"}, {"given_name": "Marcello", "family_name": "Restelli", "institution": "Politecnico di Milano"}, {"given_name": "Luca", "family_name": "Bascetta", "institution": "Politecnico di Milano"}]}