{"title": "Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result", "book": "Advances in Neural Information Processing Systems", "page_first": 1592, "page_last": 1600, "abstract": "Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD($\\lambda$)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of soft-greedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward any optimal policy, except in a certain pathological case. Consequently, in the context of approximations, the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality.", "full_text": "Optimistic policy iteration and natural actor-critic:\n\nA unifying view and a non-optimality result\n\nPaul Wagner\n\nAalto University\n\nDepartment of Information and Computer Science\n\nFI-00076 Aalto, Finland\n\npaul.wagner@aalto.fi\n\nAbstract\n\nApproximate dynamic programming approaches to the reinforcement learning\nproblem are often categorized into greedy value function methods and value-based\npolicy gradient methods. As our \ufb01rst main result, we show that an important subset\nof the latter methodology is, in fact, a limiting special case of a general formula-\ntion of the former methodology; optimistic policy iteration encompasses not only\nmost of the greedy value function methods but also natural actor-critic methods,\nand permits one to directly interpolate between them. The resulting continuum ad-\njusts the strength of the Markov assumption in policy improvement and, as such,\ncan be seen as dual in spirit to the continuum in TD(\u03bb)-style algorithms in policy\nevaluation. As our second main result, we show for a substantial subset of soft-\ngreedy value function approaches that, while having the potential to avoid policy\noscillation and policy chattering, this subset can never converge toward an opti-\nmal policy, except in a certain pathological case. Consequently, in the context of\napproximations (either in state estimation or in value function representation), the\nmajority of greedy value function methods seem to be deemed to suffer either from\nthe risk of oscillation/chattering or from the presence of systematic sub-optimality.\n\n1 Introduction\n\nWe consider the reinforcement learning problem in which one attempts to \ufb01nd an approximately\noptimal policy for controlling a stochastic nonlinear dynamical system. We focus on the setting in\nwhich the target system is actively sampled during the learning process. Here the sampling policy\nchanges during the learning process in a manner that depends on the main policy being optimized.\nThis learning setting is often called interactive learning [e.g., 23, \u00a73]. Many approaches to the\nproblem are value-based and build on the methodology of simulation-based approximate dynamic\nprogramming [23, 4, 9, 19, 8, 21]. The majority of these methods are often categorized into greedy\nvalue function methods (critic-only) and value-based policy gradient methods (actor-critic) [e.g.,\n23, 13].\nWithin this interactive setting, the policy gradient approach has better convergence guarantees, with\nthe strongest case being for Monte Carlo evaluation with \u2018compatible\u2019 value function approximation.\nIn this case, convergence with probability one (w.p.1) to a local optimum can be established for\narbitrary differentiable policy classes under mild assumptions [22, 13, 19]. On the other hand, while\nthe greedy value function approach is often considered to possess practical advantages in terms of\nconvergence speed and representational \ufb02exibility, its behavior in the proximity of an optimum is\ncurrently not well understood. It is well known that interactively operated approximate hard-greedy\n\nAn extended version of this paper with full proofs and additional background material is available at\n\nhttp://books.nips.cc/ and http://users.ics.aalto.fi/pwagner/.\n\n1\n\n\fvalue function methods can fail to converge to any single policy and instead become trapped in\nsustained policy oscillation or policy chattering, which is currently a poorly understood phenomenon\n[6, 7]. This applies to both non-optimistic and optimistic policy iteration (value iteration being a\nspecial case of the latter). In general, the best guarantees for this methodology exist in the form of\nsub-optimality bounds [6, 7]. The practical value of these bounds, however, is under question (e.g.,\n[2; 7, \u00a76.2.2]), as they can permit very bad solutions. Furthermore, it has been shown that these\nbounds are tight [7, \u00a76.2.3; 12, \u00a73.2].\nA hard-greedy policy is a discontinuous function of its parameters, which has been identi\ufb01ed as a\nkey source of problems [18, 10, 17, 22]. In addition to the observation that the class of stochastic\npolicies may often permit much simpler solutions [cf. 20], it is known that continuously stochastic\npolicies can also re-gain convergence: both non-optimistic and optimistic soft-greedy approximate\npolicy iteration using, for example, the Gibbs/Boltzmann policy class, is known to converge with\nenough softness, \u2018enough\u2019 being problem-speci\ufb01c. This has been shown by Perkins & Precup [18]\nand Melo et al. [14], respectively, although with no consideration of the quality of the obtained\nsolutions nor with an interpretation of how \u2018enough\u2019 relates to the problem at hand. Unfortunately,\nthe aforementioned sub-optimality bounds are also lost in this case (consider temperature \u03c4 \u2192 \u221e);\nwhile convergence is re-gained, the properties of the obtained solutions are rather unknown.\nTo summarize, there are considerable shortcomings in the current understanding of the learning\ndynamics at the very heart of the approximate dynamic programming methodology. We share the\nbelief of Bertsekas [5, 6], expressed in the context of the policy oscillation phenomenon, that a\nbetter understanding of these issues \u201chas the potential to alter in fundamental ways our thinking\nabout approximate DP.\u201d\nIn this paper, we provide insight into the convergence behavior and optimality of the generalized\noptimistic form of the greedy value function methodology by re\ufb02ecting it against the policy gradient\napproach. While these two approaches are considered in the literature mostly separately, we are\nmotivated by the belief that it is eventually possible to fully unify them, so as to have the bene\ufb01ts and\ninsights from both in a single framework with no arti\ufb01cial (or historical) boundaries, and that such a\nuni\ufb01cation can eventually resolve the issues outlined above. These issues revolve mainly around the\ngreedy methodology, while at the same time, solid convergence results exist for the policy gradient\nmethodology; connecting these methodologies more \ufb01rmly might well lead to a fuller understanding\nof both.\nAfter providing background in Section 2, we take the following steps in this direction. First, we\nshow that natural actor-critic methods from the policy gradient side are, in fact, a limiting special\ncase of optimistic policy iteration (Sec. 3). Second, we show that while having the potential to avoid\npolicy oscillation and chattering, a substantial subset of soft-greedy value function approaches can\nnever converge to an optimal policy, except in a certain pathological case (Sec. 4). We then conclude\nwith a discussion in a broader context and use the results to complete a high-level convergence and\noptimality property map of the variants of the considered methodology (Sec. 5).\n\n2 Background\n\nA Markov decision process (MDP) is de\ufb01ned by a tuple M = (S,A,P, r), where S and A denote\nthe state and action spaces. St \u2208 S and At \u2208 A denote random variables at time t. s, s(cid:48) \u2208 S\nand a, b \u2208 A denote state and action instances. P(s, a, s(cid:48)) = P(St+1 = s(cid:48)|St = s, At = a)\nde\ufb01nes the transition dynamics and r(s, a) \u2208 R de\ufb01nes the expected immediate reward function.\nNon-Markovian aggregate states, i.e., subsets of S, are denoted by y. A policy \u03c0(a|s, \u03b8k) \u2208 \u03a0 is\na stochastic mapping from states to actions, parameterized by \u03b8k \u2208 \u0398. Improvement is performed\nE[r(St, At)|\u03c0(\u03b8)]. \u2207\u03b8J(\u03b8k) \u2208 \u0398 denotes\n(cid:80)\na parameter gradient at \u03b8k. \u2207\u03c0J(\u03b8k) \u2208 \u03a0 denotes the corresponding policy gradient in the selected\npolicy space. We de\ufb01ne the policy distance (cid:107)\u03c0u \u2212 \u03c0v(cid:107) as some p-norm of the action probability\na |\u03c0u(a|s)\u2212 \u03c0v(a|s)|p)1/p. Action value functions \u00afQ(s, a, \u02c6wk) and Q(s, a, \u02c6wk),\nt \u03b3tE[r(St, At)|S0 =\ns, A0 = a, \u03c0(\u03b8k)] for some (s, a) when following some policy \u03c0(\u03b8k). The state value function\nV (s, \u02c6wk) is an estimator of such cumulative reward that follows some s. We use \u0001 to denote a small\npositive in\ufb01nitesimal quantity.\n\nwith respect to the performance metric J(\u03b8) = 1/H(cid:80)H\ndifferences ((cid:80)\nparameterized by \u02c6wk, are estimators of the \u03b3-discounted cumulative reward(cid:80)\n\ns\n\nt\n\n2\n\n\fWe focus on the Gibbs (Boltzmann) policy class with a linear combination of basis functions \u03c6:\n\n\u03c0(a|s, \u03b8k) =\n\n(1)\nWe shall use the term \u2018semi-uniformly stochastic policy\u2019 for referring to a policy for which \u03c0(a|s) =\ncs \u2228 \u03c0(a|s) = 0, \u2200s, a, \u2200s \u2203cs \u2208 [0, 1]. Note that both the uniformly stochastic policy and all\ndeterministic policies are special cases of semi-uniformly stochastic policies.\nFor the value function, we focus on least-squares linear-in-parameters approximation with the same\nbasis \u03c6 as in (1). We consider both advantage values [see 22, 19]\n\nk \u03c6(s,b)\n\n.\n\nk \u03c6(s,a)\n\n(cid:80)\n\ne\u03b8(cid:62)\nb e\u03b8(cid:62)\n\n\u00afQk(s, a, \u02c6wk) = \u02c6w(cid:62)\n\nk\n\n\u03c0(b|s, \u03b8k)\u03c6(s, b)\n\n(cid:32)\n\u03c6(s, a) \u2212(cid:88)\n\nb\n\n(cid:33)\n\n(2)\n\n(5)\n\nand absolute action values\n\nQk(s, a, \u02c6wk) = \u02c6w(cid:62)\n\n(3)\nEvaluation can be based on either Monte Carlo or temporal difference estimation. We focus on\noptimistic policy iteration, which contains both non-optimistic policy iteration and value iteration as\nspecial cases, and on the policy gradient counterparts of these.\nIn the general form of optimistic approximate policy iteration (e.g., [7, \u00a76.4]; see also [6, \u00a73.3]), a\nvalue function parameter vector w is gradually interpolated toward the most recent evaluation \u02c6w:\n\nk \u03c6(s, a) .\n\nwk+1 = wk + \u03bak( \u02c6wk \u2212 wk) ,\n\n(4)\nNon-optimistic policy iteration is obtained with \u03bak = 1, \u2200k and \u2018complete\u2019 evaluations \u02c6wk (see\nbelow). The corresponding Gibbs soft-greedy policy is obtained by combining (1) and a temperature\n(softness) parameter \u03c4 with\n\n\u03bak \u2208 (0, 1] .\n\n\u03b8k+1 = wk+1/\u03c4k ,\n\n\u03c4k \u2208 (0,\u221e) .\n\nHard-greedy iteration is obtained in the limit as \u03c4 \u2192 0.\nIn optimistic policy iteration, policy improvement is based on an incomplete evaluation. We distin-\nguish between two dimensions of completeness, which are evaluation depth and evaluation accu-\nracy. By evaluation depth, we refer to the look-ahead depth after which truncation with the previous\nvalue function estimate occurs. For example, LSPE(0) and LSTD(0) [e.g., 15] implement shallow\nand deep evaluation, respectively. With shallow evaluation, the current value function parameter\nvector wk is required for look-ahead truncation when computing \u02c6wk+1. Inaccurate (noisy) evalua-\ntion necessitates additional caution in the policy improvement process and is the usual motivation\nfor using (4) with \u03ba < 1.\nIt is well known that greedy policy iteration can be non-convergent under approximations [4]. The\nwidely used projected equation approach can manifest convergence behavior that is complex and not\nwell understood, including bounded but potentially severe sustained policy oscillations [6, 7] (see\nthe extended version for further details). Similar consequences arise in the context of partial ob-\nservability for approximate or incomplete state estimation [e.g., 20, 16]. A novel explanation to the\nphenomenon in the non-optimistic case was recently proposed in [24, 25], where policy oscillation\nwas re-cast as sustained overshooting over an attractive stochastic policy. Policy convergence can\nbe established under various restrictions (see the extended version for further details). Most impor-\ntantly to this paper, convergence can be established with continuously soft-greedy action selection\n[18, 14], in which case, however, the quality of the obtained solutions is unknown.\nIn policy gradient reinforcement learning [22, 13, 19, 8], improvement is obtained via stochastic\ngradient ascent:\n\n\u03b8k+1 = \u03b8k + \u03b1kG(\u03b8k)\u22121 \u2202J(\u03b8k)\n\n(6)\nwhere \u03b1k \u2208 (0,\u221e), G is a Riemannian metric tensor that ideally encodes the curvature of the\npolicy parameterization, and \u03b7k is some estimate of the gradient. With value-based policy gradient\nmethods, using (1) together with either (2) or (3) ful\ufb01lls the \u2018compatibility condition\u2019 [22, 13]. With\n(2), the value function parameter vector \u02c6wk becomes the natural gradient estimate for the evaluated\npolicy \u03c0(\u03b8k), leading to natural actor-critic algorithms [11, 19], for which\n\n= \u03b8k + \u03b1k\u03b7k ,\n\n\u2202\u03b8\n\n\u03b7k = \u02c6wk .\n\n3\n\n(7)\n\n\fFor policy gradient learning with a \u2018compatible\u2019 value function and Monte Carlo evaluation, con-\nvergence w.p.1 to a local optimum is established under standard assumptions [22, 13]. Temporal\ndifference evaluation can lead to sub-optimal results with a known sub-optimality bound [13, 8].\n\n3 Forgetful natural actor-critic\n\nIn this section, we show that an important subset of natural actor-critic algorithms is a limiting\nspecial case of optimistic policy iteration. A related connection was recently shown in [24, 25],\nwhere a modi\ufb01ed form of the natural actor-critic algorithm by Peters & Schaal [19] was shown\nto correspond to non-optimistic policy iteration. In the following, we generalize and simplify this\nresult: by starting from the more general setting of optimistic policy iteration, we arrive at a unifying\nview that both encompasses a broader range of greedy methods and permits interpolation between\nthe approaches directly with existing (unmodi\ufb01ed) methodology.\nWe consider the Gibbs policy class from (1) and the linear-in-parameters advantage function from\n(2), which form a \u2018compatible\u2019 actor-critic setup. We assume deep policy evaluation (cf. Section 2).\nWe begin with the natural actor-critic (NAC) algorithm by Peters & Schaal [19] (cf. (6) and (7)) and\ngeneralize it by adding a forgetting term:\n\n\u03b8k+1 = \u03b8k + \u03b1k\u03b7k \u2212 \u03bak\u03b8k ,\n\n(8)\nwhere \u03b1k \u2208 (0,\u221e), \u03bak \u2208 (0, 1]. We refer to this generalized algorithm as the forgetful natural\nactor-critic algorithm, or NAC(\u03ba).\nIn the following, we show that this algorithm is, within the\ndiscussed context, equivalent to the general form of optimistic policy iteration in (4) and (5), with\nthe following translation of the parameterization:\n\n\u03c4k =\n\n\u03bak\n\u03b1k\n\n,\n\nor \u03b1k =\n\n\u03bak\n\u03c4k\n\n.\n\n(9)\n\nTaking the forgetting factor \u03ba in (8) toward zero leads back toward the original natural actor-critic\nalgorithm, with the implication that the original algorithm is a limiting special case of optimistic\npolicy iteration.\nTheorem 1. For the case of deep policy evaluation (Section 2), the natural actor-critic algorithm\nfor the Gibbs policy class ((6), (7), (1), (2)) is a limiting special case of Gibbs soft-greedy optimistic\npolicy iteration ((4), (5), (1), (2)).\n\nProof. The update rule for Gibbs soft-greedy optimistic policy iteration is given in (4) and (5). By\nmoving the temperature to scale \u02c6w (assume w0 to be scaled accordingly), we obtain\n\n(cid:26) w(cid:48)\n\nk+1 = w(cid:48)\n\u03b8k+1 = w(cid:48)\n\nk + \u03bak( \u02c6wk/\u03c4k \u2212 w(cid:48)\nk)\nk+1 ,\n\n(10)\nagain with \u03bak \u2208 (0, 1], \u03c4k \u2208 (0,\u221e). Such a re-formulation effectively re-scales w and is possible\nonly with deep policy evaluation (cf. Section 2), with which the non-scaled w is not needed by the\npolicy evaluation process. We can now remove the redundant second line and rename w(cid:48) to \u03b8:\n\nFinally, we open up the last term and encapsulate \u03ba/\u03c4 into \u03b1:\n\n\u03b8k+1 = \u03b8k + \u03bak( \u02c6wk/\u03c4k \u2212 \u03b8k) .\n\n(11)\n\n\u03b8k+1 = \u03b8k + \u03bak( \u02c6wk/\u03c4k) \u2212 \u03bak\u03b8k\n\n(12)\n(13)\nwith \u03b1k = \u03bak/\u03c4k. Based on (7), we observe that (13) is equivalent to (8). The original natural\nactor-critic algorithm is obtained in the limit as \u03bak \u2192 0, which causes the forgetting term \u03bak\u03b8k to\nvanish (the effective step size \u03b1 can still be controlled with \u03c4).\n\n= \u03b8k + \u03b1k \u02c6wk \u2212 \u03bak\u03b8k ,\n\nThis result has some interesting implications. First, it becomes apparent that the implicit effective\nstep size in optimistic policy iteration is, in fact, \u03b1 = \u03ba/\u03c4, i.e., it is inversely related to the tem-\nperature \u03c4. If the interpolation factor \u03ba is held \ufb01xed, a low temperature, which can lead to policy\n\n4\n\n\foscillation, equals a long effective step size. This agrees with the interpretation of policy oscillation\nas overshooting in [24, 25]. Likewise, a high temperature equals a short effective step size. In [18],\nconvergence is established for a high enough constant temperature. This result now becomes trans-\nlated to showing that convergence is established with a short enough constant effective step size,1\nwhich creates an interesting and more direct connection to convergence results for (batch) steepest\ndescent methods with a constant step size [e.g., 1, 3]. In addition, this connection might permit the\napplication of the results in the aforementioned literature to establish, in the considered context, a\nconstant step size convergence result for the natural actor-critic methodology.\nSecond, we see that the interpolation scheme in optimistic policy iteration, while originally intro-\nduced for the sake of countering an inaccurate value function estimate, actually goes in the direction\nof the policy gradient methodology. Smooth interpolation between policy gradient and greedy value\nfunction learning turns out to be possible by simply adjusting the interpolation factor \u03ba while treat-\ning the temperature \u03c4 as an inverse of the step size (we return to provide an interpretation of the role\nof \u03ba at a later point). Contrary to the related result in [24], no modi\ufb01cations to existing algorithms\nare needed. This connection also allows the convergence results from the policy gradient literature\nto be brought in (see Section 2): convergence w.p.1, under standard assumptions from the referred\nliterature, to an optimal solution is established in the limit for this class of approximate optimistic\npolicy iteration as the interpolation factor \u03ba is taken toward zero and the step size requirements are\ninversely enforced on the temperature \u03c4.\nThird, we observe that in non-optimistic policy iteration (\u03ba = 1), the forgetting term resets the\nparameter vector to the origin at the beginning of every iteration, with the implication that solutions\nthat are not within the range of a single step from the origin in the direction of the natural gradient\ncannot be reached in any number of iterations. The choice of the effective step size, which is\ninversely controlled by the temperature, becomes again decisive: a step size that is too short (the\ntemperature is too high) will cause the algorithm to permanently undershoot the desired optimum,\nthus trapping it in sustained sub-optimality, while a step size that is too long (the temperature is too\nlow) will cause it to overshoot, which can additionally trap it in sustained oscillation. Unfortunately,\neven hitting the target exactly with a perfect step size will fail to lead to convergence and optimality\nat the same time. Our next section examines these issues more closely.\n\n4 Systematic non-optimality of soft-greedy methods\n\nFor greedy value function methods, using the hard-greedy policy class trivially prevents conver-\ngence to other than deterministic policies. Furthermore, the proximity of an attractive stochastic\npolicy can prevent convergence altogether and trap the process in oscillation (cf. Section 2). The\nGibbs soft-greedy policy class, on the other hand, can represent stochastic policies, \ufb01xed points do\nexist [10, 17], and convergence toward some policy is guaranteed with suf\ufb01cient softness [18, 14].\nWhile convergence toward deterministic optimal decisions is trivially lost as soon as any softness\nis introduced (\u03c4 (cid:54)\u2192 0, and assuming a bounded value function), one might hope that convergence\ntoward stochastic optimal decisions could still occur in some cases. Unfortunately, as we show in\nthe following, this is not the case: in the presence of any softness, this approach can never converge\ntoward any optimal policy (i.e., convergence and optimality become mutually exclusive), except in\na certain pathological case.\nAt this point, we wish to make clear that we are not arguing against the practical value of the greedy\nvalue function methodology in (interactively) approximated problems; the methodology has some\nclear merits, and the sub-optimality and oscillations could well be negligible in a given task. Instead,\nwe take the following result, together with existing literature on policy oscillations, as an indication\nof a fundamental theoretical incompatibility of this methodology to this context: the way by which\nthis methodology deals with stochastic optima seems to be fundamentally \ufb02awed, and we believe\nthat a thorough understanding of this \ufb02aw will have, in addition to facilitating sound theoretical\nadvances, also immediate practical value by permitting correctly informed trade-off decisions.\nTheorem 2. Assume an unbiased value function estimator (e.g., Monte Carlo evaluation). Now,\nfor Gibbs soft-greedy policy iteration ((1), (4) and (5)) using a linear-in-parameters value function\napproximator ((2) or (3)), including optimistic and non-optimistic variants (any \u03ba in (4)), there\ncannot exist a \ufb01xed point at an optimum, except for the uniformly stochastic policy.\n\n1Note that the diminishing step size \u03b1t in [18, Fig. 1] concerns policy evaluation, not policy improvement.\n\n5\n\n\fProof outline. A \ufb01xed point of the update rule (4) must satisfy\n\n(14)\ni.e., at a \ufb01xed point, the policy evaluation step \u02c6wk := eval(\u03c0(wk/\u03c4k)) for the current parameter\nvector must yield the same parameter vector as its result:\n\n\u02c6wk = wk ,\n\neval (\u03c0 (wk/\u03c4k)) = wk .\n\n(15)\n\nBy applying (14) and (7), we have\n\nwk = \u02c6wk = \u03b7k = G(\u03b8k)\u22121\u2207\u03b8J(\u03b8k) ,\n\n(16)\nwhich shows that the \ufb01xed-point policy \u03c0(wk/\u03c4k) in (15) is de\ufb01ned solely by its own (scaled)\nperformance gradient.\nFor an optimal policy and an unbiased estimator, this parameter gradient must, by de\ufb01nition, map to\nthe zero policy gradient, i.e., to \u2207\u03c0J(\u03b8k) = 0. Consequently, an optimal policy at a \ufb01xed point is\nde\ufb01ned solely by the zero policy gradient, making the policy equal to \u03c0(0), which is the uniformly\nstochastic policy. For the full proof, see the extended version.\n\nTheorem 3. Consider the family of methods from Theorem 2. Assume a smooth policy gradient \ufb01eld\n((cid:107)\u2207\u03c0J(\u03c0u) \u2212 \u2207\u03c0J(\u03c0v)(cid:107) \u2192 0 as (cid:107)\u03c0u \u2212 \u03c0v(cid:107) \u2192 0) and \u03c4 (cid:54)\u2192 0. First, the policy distance between a\nif the optimal policy \u03c0(cid:63) is a semi-uniformly stochastic policy. Second, for bounded returns (\u03b3 (cid:54)\u2192 1\nand r(s, a) (cid:54)\u2192 \u00b1\u221e,\u2200s, a), the policy distance between a \ufb01xed point policy \u03c0f and an optimal policy\n\n\ufb01xed point policy \u03c0f and an optimal policy \u03c0(cid:63) cannot be vanishingly small ((cid:13)(cid:13)\u03c0f \u2212 \u03c0(cid:63)(cid:13)(cid:13) (cid:54)< \u0001), except\n\u03c0(cid:63) cannot be vanishingly small ((cid:13)(cid:13)\u03c0f \u2212 \u03c0(cid:63)(cid:13)(cid:13) (cid:54)< \u0001), except if the optimal policy \u03c0(cid:63) is the uniformly\n\nstochastic policy.\n\nProof outline. For a policy \u00af\u03c0 = \u03c0(wk/\u03c4k) that is vanishingly close to an optimum, an unbiased\nparameter gradient \u03b7k must, assuming a smooth gradient \ufb01eld, map to a policy gradient that is\nvanishingly close to zero, i.e., \u03b7k must have a vanishingly small effect on \u00af\u03c0 with any \ufb01nite step size:\n(17)\n\n(cid:107)\u03c0(wk/\u03c4k + \u03b1\u03b7k) \u2212 \u03c0(wk/\u03c4k)(cid:107) < \u0001 ,\n\n\u2200\u03b1 > 0, \u03b1 (cid:54)\u2192 \u221e .\n\nIf \u00af\u03c0 is also a \ufb01xed point, then, by (16), we can substitute both wk and \u03b7k in (17) with \u02c6wk:\n\n(cid:107)\u03c0( \u02c6wk/\u03c4k + \u03b1 \u02c6wk) \u2212 \u03c0( \u02c6wk/\u03c4k)(cid:107) < \u0001 ,\n\n\u2200\u03b1 > 0, \u03b1 (cid:54)\u2192 \u221e\n\u21d4 (cid:107)\u03c0 ((1/\u03c4k + \u03b1) \u02c6wk) \u2212 \u03c0((1/\u03c4k) \u02c6wk)(cid:107) < \u0001 , \u2200\u03b1 > 0, \u03b1 (cid:54)\u2192 \u221e .\n\n(18)\n\nWe now see that \u00af\u03c0 is de\ufb01ned solely by a temperature-scaled version of a vanishingly small policy\ngradient, and that the condition in (17) is equivalent to stating that any \ufb01nite decrease of the tem-\nperature must not have a non-vanishing effect on \u00af\u03c0. As only semi-uniformly stochastic policies are\ninvariant to such temperature decreases, it follows that \u00af\u03c0 must be vanishingly close to such a policy.\nFurthermore, if assuming bounded returns, then no dimension of the term \u02c6w(cid:62)\u03c6(s, a) can approach\npositive or negative in\ufb01nity when \u02c6w is estimated using (2) or (3). Consequently, for \u03c4 (cid:54)\u2192 0, the\nuniformly stochastic policy \u03c0(0) becomes the only semi-uniformly stochastic policy that the Gibbs\npolicy class in (1) can approach, with the implication that \u00af\u03c0 must be vanishingly close to the uni-\nformly stochastic policy. For the full proof, see the extended version.\n\nTo interpret the preceding theorems, we observe that the gist of them is that, assuming a well-\nbehaved gradient \ufb01eld, the closer the evaluated policy is to an optimum, the closer the target point of\nthe next greedy update will be to the origin (in policy parameter space). At a \ufb01xed point, the policy\nparameter vector must equal the target point of the next update, causing convergence to or toward a\npolicy that is exactly optimal but not at the origin to be a contradiction (Theorem 2). Convergence to\nor toward a policy that is vanishingly close to an optimum is also impossible, except if the optimum\nis (semi-)uniformly stochastic (Theorem 3).\nIn practical terms, Theorem 2 states that even if the task at hand and the chosen hyperparameters\nwould allow convergence to some policy in a \ufb01nite number of iterations, the resulting policy can\n\n6\n\n\fnever contain optimal decisions, except for uniformly stochastic ones. Theorem 3 generalizes this\nresult to the case of asymptotic convergence toward some limiting policy: for unbounded returns\nand any \u03c4 (cid:54)\u2192 0, it is impossible to have asymptotic convergence toward any optimal decision in any\nstate, except for semi-uniformly stochastic decisions, and for bounded returns and any \u03c4 (cid:54)\u2192 0, it is\nimpossible to have asymptotic convergence toward any non-uniform optimal decision in any state.\nIf convergence is to occur, then the limiting policy must reside \u201cbetween\u201d the origin and an optimum,\ni.e., the result must always undershoot the optimum that the learning process was in\ufb02uenced by.\nHowever, we can see in (15) that by decreasing the temperature \u03c4, it is possible to shift this point of\nconvergence further away from the origin and closer to the optimum: in the limit of \u03c4 \u2192 0, (15) can\npermit the parameter vector \u02c6w to converge toward a point that approaches the origin while, at the\nsame time, allowing the corresponding policy \u03c0( \u02c6w/\u03c4 ) to converge toward a policy that is arbitrarily\nclose to a distant optimum (one can also see that with \u03c4 \u2192 0, the inequality in (18) becomes satis\ufb01ed\nfor any \u02c6wk, due to \u03b1 (cid:54)\u2192 \u221e). Unfortunately, as we already know, such manipulation of the distance of\nthe \ufb01xed point from an optimum by adjusting \u03c4 can ruin convergence altogether in non-Markovian\nproblems. Perkins & Precup [18] report negative convergence results for non-optimistic iteration\n(\u03ba = 1) with a too low \u03c4, while for optimistic iteration (\u03ba < 1), Melo et al. [14] report a lack\nof positive results.\nInterestingly, this latter case is exactly what Theorem 1 addressed, showing\nthat there actually is a way out and that it is by moving toward natural policy gradient iteration:\ndecreasing the temperature \u03c4 toward zero causes the sub-optimality to vanish, while decreasing the\ninterpolation factor \u03ba at the same rate prevents the effective step size from exploding.\nFinally, we provide a brief discussion on some questions that may have occurred to the reader by\nnow. First, how does the preceding \ufb01t with the well-known soundness of greedy value function\nmethods in the Markovian case? The crucial difference between the Markovian case (fully observ-\nable and tabular) and the non-Markovian case (partially observable or non-tabular) follows from the\nstandard result for MDPs that states that in the former, all optima must be deterministic (with the\npossibility of redundant stochastic optima) [e.g., 23, \u00a7A.2]. For the Gibbs policy class, determin-\nistic policies reside at in\ufb01nity in some direction in the parameter space, with two implications for\nthe Markovian case. First, the distance to an optimum never decreases. Consequently, the value\nfunction, being a correction toward an optimum, never vanishes toward a \u2018neutral\u2019 state. Second,\nonly the direction of an optimum is relevant, as the distance can be always assumed to be in\ufb01nite.\nThis implies that in, and only in Markovian problems, the value function never ceases to retain all\nnecessary information about the current solution, while in non-Markovian problems, relying solely\non the value function can lead to losing track of the current solution.\nSecond, when moving toward an optimum at in\ufb01nity, how can the value function / natural gradient\n(encoded by \u02c6w = \u03b7) stay non-zero and continue to properly represent action values while the corre-\nsponding policy gradient \u2207\u03c0J(\u03b8) must approach zero at the same time? We note that the equivalence\nin (7) is between a value function and a natural gradient \u03b7. We then recall that the curvature of the\nGibbs policy class turns into a plateau at in\ufb01nity, onto which the policy becomes pushed when mov-\ning toward a deterministic optimum. The increasing discrepancy between \u03b7 = G(\u03b8)\u22121\u2207\u03b8J(\u03b8) (cid:54)\u2192 0\nand \u2207\u03c0J(\u03b8) \u2192 0 can be consumed by G(\u03b8)\u22121 as it captures the curvature of this plateau.\n\n5 Common ground\n\nFigure 1 shows a map of relevant variants of optimistic policy iteration, parameterized as in (4). As\nis well known, the hard-greedy variants of this methodology (seen on the left edge on the map) can\nbecome trapped in non-converging cycles over potentially non-optimal policies (see Section 2 for\nreferences and exceptions). For a continuously soft-greedy policy class (toward right on the map),\nconvergence can be established with enough softness [18, 14]. The natural actor-critic algorithm,\nwhich is convergent and optimal, is placed to the lower left corner by Theorem 1, while the in-\nevitable non-optimality of soft-greedy variants toward right follows from Theorems 2 and 3. The\nexact (problem-dependent) place and shape of the line separating non-convergent and convergent\nsoft-greedy variants (dashed line on the map) remains an open problem.\nThe main value of Theorem 1 is in bringing the greedy value function and policy gradient method-\nologies closer to each other. In our context, the unifying NAC(\u03ba) formulation in (8) permits interpo-\nlation between the methodologies using the \u03ba parameter. As discussed at the end of Section 4, the\npolicy-forgetting term requires a Markovian problem for being justi\ufb01ed: a greedy update implicitly\n\n7\n\n\fNon-optimistic soft-greedy (small \u03c4)\n\u0017 Non-convergence (Perkins & Precup)\n\u0017 Non-optimality (Theorems 2\u20133)\n\nNon-optimistic soft-greedy (large \u03c4)\n\u0013 Convergence (Perkins & Precup)\n\u0017 Non-optimality (Theorems 2\u20133)\n\nNon-optimistic hard-greedy\n\u0017 Oscillation\n\u0017 Non-optimality\n\n(Bertsekas, . . . )\n\nOptimistic hard-greedy\n\u0017 Chattering\n\u0017 Non-optimality\n\n(Bertsekas, . . . )\n\n1\n\n\u03ba\n\n0\n\nNatural actor-critic\n\u0013 Convergence\n\u0013 Optimality\n\n(Theorem 1)\n\n0\n\nc\n\nc f. F i g . 2\n\ncf. Fig. 2b\n\nOptimistic soft-greedy (large \u03c4)\n\u0013 Convergence (Melo et al.)\n\u0017 Non-optimality (Theorems 2\u20133)\n\n\u03c4\n\n\u221e\n\nFigure 1: The hyperparameter space of the general form of (approximate) optimistic policy\niteration in (4), with known convergence and optimality properties (see text for assumptions).\n\ny1\n\ns1\n\nal\n\ns2\n\nal\n\n1/4\n\nar\n\nar\n\n1\n\n0\n\n(a) A non-Markovian\nproblem (adapted from\n\n[24]). The incoming arrow\n\nindicates the start state.\n\nArrows leading out indicate\ntermination with the shown\n\nreward.\n\n1\n\n0.5\n\n)\nt\nh\ng\ni\nr\n(\n\u03b8\n\u2212\n\n)\nt\nf\ne\nl\n(\n\u03b8\n\n0\n\n0\n\n\u03ba = 0.2, \u03c4 = 1\n\u03ba = 0.2, \u03c4 = 0.2\n\u03ba = 0.2, \u03c4 = 0.05\n\n1\n\n0.5\n\n)\nt\nh\ng\ni\nr\n(\n\u03b8\n\u2212\n\n)\nt\nf\ne\nl\n(\n\u03b8\n\n\u03ba = \u03c4 = 0.2\n\u03ba = \u03c4 = 0.05\n\u03ba = \u03c4 = 0.01\nNAC (\u03b1 = 1)\n\n5\n\n10\n\n15\n\n20\n\niteration\n\n0\n\n0\n\n5\n\n10\n\n15\n\n20\n\niteration\n\n(b) Non-optimality or oscillation\nwith \u03ba (cid:54)\u2192 0. The variants are\n\nmarked with\n\nin Fig. 1 (schematic).\n\n(c) Interpolation toward NAC with\n\u03ba \u2192 0 and \u03c4 \u2192 0. The variants are\nmarked with in Fig. 1 (schematic).\n\nFigure 2: Empirical illustration of the behavior of optimistic policy iteration ((1), (2), (4) and (5),\nwith tabular \u03c6) in the proximity of a stochastic optimum. The problem is shown in Fig. 2a. In\nFigures 2b and 2c, the optimum at \u03b8(left) \u2212 \u03b8(right) = log(2) is denoted by a solid green line. The\n\nuniformly stochastic policy is denoted by a dashed red line.\n\nstands on a Markov assumption and the \u03ba parameter in (8) can be interpreted as adjusting the strength\nof this assumption. In this respect, the policy improvement parameter \u03ba in NAC(\u03ba) can be seen (in-\nversely) as a dual in spirit to the policy evaluation parameter \u03bb in TD(\u03bb)-style algorithms. On the\npolicy evaluation side, having \u03bb = 0 obtains variance reduction by assuming and exploiting Marko-\nvianity of the problem, while \u03bb = 1 obtains unbiased estimates also for non-Markovian problems.\nOn the policy improvement side, with \u03ba = 1, we have strictly greedy updates that gain in speed as\nthe policy can respond instantly to new opportunities appearing in the value function (for empirical\nobservations of such a speed gain, see [11, 25]), and in representational \ufb02exibility due to the lack of\ncontinuity constraints between successive policies (for a canonical example, consider \ufb01tted Q itera-\ntion). This comes at the price of either oscillation or non-optimality if the Markov assumption fails\nto hold, which is illustrated in Figure 2b for the problem in 2a. With \u03ba \u2192 0, we approach natural\ngradient updates that remain sound also in non-Markovian settings, which is illustrated in Figure 2c.\nThe possibility to interpolate between the approaches might turn out useful in problems with partial\nMarkovianity: a large \u03ba in the NAC(\u03ba) formulation can be used to quickly \ufb01nd the rough direction of\nthe strongest attractors, after which gradually decreasing \u03ba allows a convergent \ufb01nal ascent toward\nan optimum.\n\nAcknowledgments\n\nThis work has been \ufb01nancially supported by the Academy of Finland through project no. 254104,\nand by the Foundation of Nokia Corporation.\n\n8\n\n\fReferences\n\n[1] Armijo, L. (1966). Minimization of functions having Lipschitz continuous \ufb01rst partial derivatives. Paci\ufb01c\n\nJournal of Mathematics, 16(1), 1\u20133.\n\n[2] Baxter, J., & Bartlett, P. L. (2000). Reinforcement learning in POMDP\u2019s via direct gradient ascent. In\n\nProceedings of the Seventeenth International Conference on Machine Learning, (pp. 41\u201348).\n\n[3] Bertsekas, D. P. (1997). A new class of incremental gradient methods for least squares problems. SIAM\n\nJournal on Optimization, 7(4), 913\u2013926.\n\n[4] Bertsekas, D. P. (2005). Dynamic Programming and Optimal Control. Athena Scienti\ufb01c.\n[5] Bertsekas, D. P. (2010). Pathologies of temporal difference methods in approximate dynamic program-\n\nming. In 49th IEEE Conference on Decision and Control, (pp. 3034\u20133039).\n\n[6] Bertsekas, D. P. (2011). Approximate policy iteration: A survey and some new methods. Journal of\n\nControl Theory and Applications, 9(3), 310\u2013335.\n\n[7] Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Scienti\ufb01c.\n[8] Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., & Lee, M. (2009). Natural actor-critic algorithms. Auto-\n\nmatica, 45(11), 2471\u20132482.\n\n[9] Bus\u00b8oniu, L., Babu\u02c7ska, R., De Schutter, B., & Ernst, D. (2010). Reinforcement learning and dynamic\n\nprogramming using function approximators. CRC Press.\n\n[10] De Farias, D. P., & Van Roy, B. (2000). On the existence of \ufb01xed points for approximate value iteration\n\nand temporal-difference learning. Journal of Optimization Theory and Applications, 105(3), 589\u2013608.\n\n[11] Kakade, S. M. (2002). A natural policy gradient. In Advances in Neural Information Processing Systems.\n[12] Kakade, S. M. (2003). On the Sample Complexity of Reinforcement Learning. Ph.D. thesis, University\n\nCollege London.\n\n[13] Konda, V. R., & Tsitsiklis, J. N. (2004). On actor-critic algorithms. SIAM Journal on Control and\n\nOptimization, 42(4), 1143\u20131166.\n\n[14] Melo, F. S., Meyn, S. P., & Ribeiro, M. I. (2008). An analysis of reinforcement learning with function\napproximation. In Proceedings of the 25th International Conference on Machine Learning, (pp. 664\u2013671).\n[15] Nedi\u00b4c, A., & Bertsekas, D. P. (2003). Least squares policy evaluation algorithms with linear function\n\napproximation. Discrete Event Dynamic Systems: Theory and Applications, 13(1\u20132), 79\u2013110.\n\n[16] Pendrith, M. D., & McGarity, M. J. (1998). An analysis of direct reinforcement learning in non-Markovian\n\ndomains. In Proceedings of the Fifteenth International Conference on Machine Learning.\n\n[17] Perkins, T. J., & Pendrith, M. D. (2002). On the existence of \ufb01xed points for Q-learning and sarsa in\nIn Proceedings of the Nineteenth International Conference on Machine\n\npartially observable domains.\nLearning, (pp. 490\u2013497).\n\n[18] Perkins, T. J., & Precup, D. (2003). A convergent form of approximate policy iteration. In Advances in\n\nNeural Information Processing Systems.\n\n[19] Peters, J., & Schaal, S. (2008). Natural actor-critic. Neurocomputing, 71(7-9), 1180\u20131190.\n[20] Singh, S. P., Jaakkola, T., & Jordan, M. I. (1994). Learning without state-estimation in partially observable\nIn Proceedings of the Eleventh International Conference on Machine\n\nMarkovian decision processes.\nLearning.\n\n[21] Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press.\n[22] Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforcement\n\nlearning with function approximation. In Advances in Neural Information Processing Systems.\n[23] Szepesv\u00b4ari, C. (2010). Algorithms for Reinforcement Learning. Morgan & Claypool Publishers.\n[24] Wagner, P. (2011). A reinterpretation of the policy oscillation phenomenon in approximate policy itera-\n\ntion. In Advances in Neural Information Processing Systems 24, (pp. 2573\u20132581).\n\n[25] Wagner, P. (to appear). Policy oscillation is overshooting. Neural Networks. Author manuscript available\n\nat http://users.ics.aalto.fi/pwagner/.\n\n9\n\n\f", "award": [], "sourceid": 790, "authors": [{"given_name": "Paul", "family_name": "Wagner", "institution": "Aalto University"}]}