{"title": "A reinterpretation of the policy oscillation phenomenon in approximate policy iteration", "book": "Advances in Neural Information Processing Systems", "page_first": 2573, "page_last": 2581, "abstract": "A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results.", "full_text": "A reinterpretation of the policy oscillation\nphenomenon in approximate policy iteration\n\nDepartment of Information and Computer Science\n\nPaul Wagner\n\nAalto University School of Science\n\nPO Box 15400, FI-00076 Aalto, Finland\n\npwagner@cis.hut.fi\n\nAbstract\n\nA majority of approximate dynamic programming approaches to the reinforce-\nment learning problem can be categorized into greedy value function methods and\nvalue-based policy gradient methods. The former approach, although fast, is well\nknown to be susceptible to the policy oscillation phenomenon. We take a fresh\nview to this phenomenon by casting a considerable subset of the former approach\nas a limiting special case of the latter. We explain the phenomenon in terms of this\nview and illustrate the underlying mechanism with arti\ufb01cial examples. We also\nuse it to derive the constrained natural actor-critic algorithm that can interpolate\nbetween the aforementioned approaches. In addition, it has been suggested in the\nliterature that the oscillation phenomenon might be subtly connected to the grossly\nsuboptimal performance in the Tetris benchmark problem of all attempted approx-\nimate dynamic programming methods. We report empirical evidence against such\na connection and in favor of an alternative explanation. Finally, we report scores in\nthe Tetris problem that improve on existing dynamic programming based results.\n\n1\n\nIntroduction\n\nWe consider the reinforcement learning problem in which one attempts to \ufb01nd a good policy for\ncontrolling a stochastic nonlinear dynamical system. Many approaches to the problem are value-\nbased and build on the methodology of simulation-based approximate dynamic programming [1, 2,\n3, 4, 5]. In this setting, there is no \ufb01xed set of data to learn from, but instead the target system, or\ntypically a simulation of it, is actively sampled during the learning process. This learning setting is\noften described as interactive learning (e.g., [1, \u00a73]).\nThe majority of these methods can be categorized into greedy value function methods (critic-only)\nand value-based policy gradient methods (actor-critic) (e.g., [1, 6]). The former approach, although\nfast, is susceptible to potentially severe policy oscillations in presence of approximations. This phe-\nnomenon is known as the policy oscillation (or policy chattering) phenomenon [7, 8]. The latter\napproach has better convergence guarantees, with the strongest case being for Monte Carlo eval-\nuation with \u2018compatible\u2019 value function approximation. In this case, convergence w.p.1 to a local\noptimum can be established under mild assumptions [9, 6, 4].\nBertsekas has recently called attention to the currently not well understood policy oscillation phe-\nnomenon [7]. He suggests that a better understanding of it is needed and that such understanding\n\u201chas the potential to alter in fundamental ways our thinking about approximate DP.\u201d He also notes\nthat little progress has been made on this topic in the past decade. In this paper, we will try to shed\nmore light on this topic. The motivation is twofold. First, the policy oscillation phenomenon is inti-\nmately connected to some aspects of the learning dynamics at the very heart of approximate dynamic\n\nAn extended version of this paper is available at http://users.ics.tkk.fi/pwagner/.\n\n1\n\n\fprogramming; the lack of understanding in the former implies a lack of understanding in the latter.\nIn the long run, this state might well be holding back important theoretical developments in the \ufb01eld.\nSecond, methods not susceptible to oscillations have a much better suboptimality bound [7], which\ngives also immediate value to a better understanding of oscillation-predisposing conditions.\nThe policy oscillation phenomenon is strongly associated in the literature with the popular Tetris\nbenchmark problem. This problem has been used in numerous studies to evaluate different learning\nalgorithms (see [10, 11]). Several studies, including [12, 13, 14, 11, 15, 16, 17], have been con-\nducted using a standard set of features that were originally proposed in [12]. This setting has posed\nconsiderable dif\ufb01culties to some approximate dynamic programming methods. Impressively fast\ninitial improvement followed by severe degradation was reported in [12] using a greedy approxi-\nmate policy iteration method. This degradation has been taken in the literature as a manifestation of\nthe policy oscillation phenomenon [12, 8].\nPolicy gradient and greedy approximate value iteration methods have shown much more stable be-\nhavior in the Tetris problem [13, 14], although it has seemed that this stability tends to come at the\nprice of speed (see esp. [13]). Still, the performance levels reached by even these methods fall way\nshort of what is known to be possible. The typical performance levels obtained with approximate\ndynamic programming methods have been around 5,000 points [12, 8, 13, 16], while an improve-\nment to around 20,000 points has been obtained in [14] by considerably lowering the discount factor.\nOn the other hand, performance levels between 300,000 and 900,000 points were obtained recently\nwith the very same features using the cross-entropy method [11, 15]. It has been hypothesized in\n[7] that this grossly suboptimal performance of even the best-performing approximate dynamic pro-\ngramming methods might also have some subtle connection to the oscillation phenomenon. In this\npaper, we will also brie\ufb02y look into these potential connections.\nThe structure of the paper is as follows. After providing background in Section 2, we discuss the\npolicy oscillation phenomenon in Section 3 along with three examples, one of which is novel and\ngeneralizes the others. We develop a novel view to the policy oscillation phenomenon in Sections 4\nand 5. We validate the view also empirically in Section 6 and proceed to looking for the suggested\nconnection between the oscillation phenomenon and the convergence issues in the Tetris problem.\nWe report empirical evidence that indeed suggests a shared explanation to the policy degradation\nobserved in [12, 8] and the early stagnation of all the rest of the attempted approximate dynamic\nprogramming methods. However, it seems that this explanation is not primarily related to the oscil-\nlation phenomenon but to numerical instability.\n\n2 Background\nA Markov decision process (MDP) is de\ufb01ned by a tuple M = (S,A,P, r), where S and A denote\nthe state and action spaces. St \u2208 S and At \u2208 A denote random variables on time t, and s, s(cid:48) \u2208 S\nand a, b \u2208 A denote state and action instances. P(s, a, s(cid:48)) = P(St+1 = s(cid:48)|St = s, At = a)\nde\ufb01nes the transition dynamics and r(s, a) \u2208 R de\ufb01nes the expected immediate reward function. A\n(soft-)greedy policy \u03c0\u2217(a|s, Q) is a (stochastic) mapping from states to actions and is based on the\nvalue function Q. A parameterized policy \u03c0(a|s, \u03b8) is a stochastic mapping from states to actions\nand is based on the parameter vector \u03b8. Note that we use \u03c0\u2217 to denote a (soft-)greedy policy, not an\noptimal policy. The action value functions Q(s, a) and A(s, a) are estimators of the \u03b3-discounted\nt \u03b3tE[r(St, At)|S0 = s, A0 = a, \u03c0] that follows some (s, a) under some \u03c0.\n\ncumulative reward(cid:80)\n\nThe state value function V is an estimator of such cumulative reward that follows some s.\nIn policy iteration, the current policy is fully evaluated, after which a policy improvement step is\ntaken based on this evaluation. In optimistic policy iteration, policy improvement is based on an\nincomplete evaluation. In value iteration, just a one-step lookahead improvement is made at a time.\nIn greedy value function reinforcement learning (e.g., [2, 3]), the current policy on iteration k is\nusually implicit and is greedy (and thus deterministic) with respect to the value function Qk\u22121 of\nthe previous policy:\n\n(cid:26) 1\n\n0\n\n\u03c0\u2217(a|s, Qk\u22121) =\n\nif a = arg maxb Qk\u22121(s, b)\notherwise.\n\n(1)\n\nImprovement is obtained by estimating a new value function Qk for this policy, after which the\nprocess repeats. Soft-greedy iteration is obtained by slightly softening \u03c0\u2217 in some way so that\n\n2\n\n\f(cid:32)\n\u03c6\u2217(s, a) \u2212(cid:88)\n\n(cid:33)\n\n\u03c0\u2217(a|s, Qk\u22121) > 0,\u2200a, s, the Gibbs soft-greedy policy class with a temperature \u03c4 (Boltzmann\nexploration) being a common choice:\n\n\u03c0\u2217(a|s, Qk\u22121) \u221d eQk\u22121(s,a)/\u03c4 .\n\n(2)\nWe note that (1) becomes approximated by (2) arbitrarily closely as \u03c4 \u2192 0 and that this corresponds\nto scaling the action values toward in\ufb01nity.\nA common choice for approximating Q is to obtain a least-squares \ufb01t using a linear-in-parameters\napproximator \u02dcQ with the feature basis \u03c6\u2217:\n\n(3)\nFor the soft-greedy case, one option is to use an approximator that will obtain an approximation of\nan advantage function (see [9]):1\n\n\u02dcQk(s, a, wk) = w(cid:62)\n\nk \u03c6\u2217(s, a) \u2248 Qk(s, a) .\n\n\u02dcAk(s, a, wk) = w(cid:62)\n\nk\n\n\u03c0\u2217(b|s, \u02dcAk\u22121)\u03c6\u2217(s, b)\n\n\u2248 Ak(s, a) .\n\n(4)\n\nb\n\nConvergence properties depend on how the estimation is performed and on the function approxi-\nmator class with which Q is being approximated. For greedy approximate policy iteration in the\ngeneral case, policy convergence is guaranteed only up to bounded sustained oscillation [2]. Opti-\nmistic variants can permit asymptotic convergence in parameters, although the corresponding policy\ncan manifest sustained oscillation even then [8, 2, 7]. For the case of greedy approximate value\niteration, a line of research has provided solid (although restrictive) conditions for the approximator\nclass for having asymptotic parameter convergence (reviewed in, e.g., [3]), whereas the question of\npolicy convergence in these cases has been left quite open. In the rest of the paper, our focus will be\non non-optimistic approximate policy iteration.\nIn policy gradient reinforcement learning (e.g., [9, 6, 4, 5]), the current policy on iteration k is\nexplicitly represented using some differentiable stochastic policy class \u03c0(\u03b8), the Gibbs policy with\nsome basis \u03c6 being a common choice:\n\n\u03c0(a|s, \u03b8) \u221d e\u03b8(cid:62)\u03c6(s,a) .\n\n(5)\nImprovement is obtained via stochastic gradient ascent: \u03b8k+1 = \u03b8k + \u03b1k\u2202J(\u03b8k)/\u2202\u03b8.\nIn actor-\ncritic (value-based policy gradient) methods that implement a policy gradient based approximate\npolicy iteration scheme, the so-called \u2018compatibility condition\u2019 is ful\ufb01lled if the value function is\napproximated using (4) with \u03c6\u2217 = \u03c6 and \u03c0(\u03b8k) in place of \u03c0\u2217( \u02dcAk\u22121) (e.g., [9]). In this case, the\nvalue function parameter vector w becomes the natural gradient estimate \u03b7 for the policy \u03c0(a|s, \u03b8),\nleading to the natural actor-critic algorithm [13, 4]:\n\n(6)\nHere, convergence w.p.1 to a local optimum is established for Monte Carlo evaluation under stan-\ndard assumptions (properly diminishing step-sizes and ergodicity of the sampling process, roughly\nspeaking) [9, 6]. Convergence into bounded suboptimality is obtained under temporal difference\nevaluation [6, 5].\n\n\u03b7 = w .\n\n3 The policy oscillation phenomenon\n\nIt is well known that greedy policy iteration can be non-convergent under approximations. The\nwidely used projected equation approach can manifest convergence behavior that is complex and not\nwell understood, including bounded but potentially severe sustained policy oscillations [7, 8, 18].\nSimilar consequences arise in the context of partial observability for approximate or incomplete state\nestimation (e.g., [19, 20, 21]).\nIt is important to remember that sustained policy oscillation can take place even under (asymptotic)\nvalue function convergence (e.g., [7, 8]). Policy convergence can be established under various re-\nstrictions. Continuously soft-greedy action selection (which is essentially a step toward the policy\n\n1The approach in [4] is needed to permit temporal difference evaluation in this case.\n\n3\n\n\fgradient approach) has been found to have a bene\ufb01cial effect in cases of both value function ap-\nproximation and partial observability [22]. A notable approach is introduced in [7] wherein it is\nalso shown that the suboptimality bound for a converging policy sequence is much better. Interest-\ningly, for the special case of Monte Carlo estimation of action values, it is also possible to establish\nconvergence by solely modifying the exploration scheme, which is known as consistent exploration\n[23] or MCESP [24].\n\nFigure 1: Oscillatory examples. Boxes marked with yk denote observations (aggregated states).\n\nCircles marked with wk illustrate receptive \ufb01elds of the basis functions. Only non-zero rewards are\n\nshown. Start states: s1 (1.1), s0 (1.2), and s1 (1.3). Arrows leading out indicate termination.\n\nThe setting likely to be the simplest possible in which oscillation occurs even with Monte Carlo\npolicy evaluation is depicted in Figure 1.1 (adapted from [21]). The actions al and ar are available\nin the decision states s1 and s2. Both states are observed as y1. The only reward is obtained\nwith the decision sequence (s1, al; s2, ar). Greedy value function methods that operate without\nstate estimation will oscillate between the policies \u03c0(y1) = al and \u03c0(y1) = ar, excluding the\nexceptions mentioned above. This example can also be used to illustrate how local optima can\narise in the presence of approximations by changing the sign of the reward that follows (s2, ar)\n(see [20]). Figure 1.2 (adapted from [25]) shows a more complex case in which a deterministic\noptimal solution is attainable. The actions a[1,3] are available in the only decision state s0, which\nis observed as y0. Oscillation will occur when using temporal difference evaluation but not with\nMonte Carlo evaluation. These two POMDP examples are trivially equivalent to value function\napproximation using hard aggregation. Figure 1.3 (a novel example inspired by the classical XOR\nproblem) shows how similar counterexamples can be constructed also for the case of softer value\nfunction approximation. The action values are approximated with \u02dcQ(s1, al) = w1,l, \u02dcQ(s2, al) =\n0.5w1,l + 0.5w2,l, \u02dcQ(s3, al) = w2,l, \u02dcQ(s1, ar) = w1,r, \u02dcQ(s2, ar) = 0.5w1,r + 0.5w2,r, and\n\u02dcQ(s3, ar) = w2,r. The only reward is obtained with the decision sequence (s1, al; s2, ar; s3, al).\nOscillation will occur even with Monte Carlo evaluation. For other examples, see e.g. [8, 19].\nA detailed description of the oscillation phenomenon can be found in [8, \u00a76.4] (see also [12, 7]),\nwhere it is described in terms of cyclic sequences in the so-called greedy partition of the value\nfunction parameter space. Although this line of research has provided a concise description of\nthe phenomenon, it has not fully answered the question of why approximations can introduce such\ncyclic sequences in greedy policy iteration and why strong convergence guarantees exist for the\npolicy gradient based counterpart of this methodology. We will proceed by taking a different view\nby casting a considerable subset of the former approach as a special case of the latter.\n\n4 Approximations and attractive stochastic policies\n\nIn this section, we brie\ufb02y and informally examine how policy oscillation arises in the examples in\nSection 3. In all cases, oscillation is caused by the presence of an attractive stochastic policy, these\nattractors being induced by approximations. In the case of partial observability without proper state\nestimation (Figure 1.1), the policy class is incapable of representing differing action distributions\nfor the same observation with differing histories. This makes the optimal sequence (y1, al; y1, ar)\ninexpressible for deterministic policies, whereas a stochastic policy can still emit it every now and\nthen by chance. In Figure 1.3, the same situation is arrived at due to the insuf\ufb01cient capacity of the\napproximator: the speci\ufb01ed value function approximator cannot express such value estimates that\n\n4\n\n\fwould lead to an implicit greedy policy that attains the optimal sequence (s1, al; s2, ar; s3, al). Gen-\nerally speaking, in these cases, oscillation follows from a mismatch between the main policy class\nand the exploration policy class: stochastic exploration can occasionally reach the reward, but the\ndeterministic main policy is incapable of exploiting this opportunity. The opportunity nevertheless\nkeeps appearing in the value function, leading to repeated failing attempts to exploit it. Consistent\nexploration avoids the problem by limiting exploration to only expressible sequences.\nTemporal difference evaluation effectively solves for an implicit Markov model [26], i.e., it gains\nvariance reduction in policy evaluation by making the Markov assumption. When this assumption\nfails, the value function shows non-existent improvement opportunities. In Figure 1.2, an incorrect\nMarkov assumption leads to a TD solution that corresponds to a model in which, e.g., the actually\nimpossible sequence (y0, a2, r = 0; y2,\u2212, r = +1; end) becomes possible and attractive. Generally\nspeaking, oscillation results in this case from perceived but non-existent improvement opportunities\nthat vanish once an attempt is made to exploit them. This vanishing is caused by changes in the\nsampling distribution that leads to a different implicit Markov model and, consequently, to a different\n\ufb01xed point (see [27, 18]).\nIn summary, stochastic policies can become attractive due to deterministically unreachable or com-\npletely non-existing improvement opportunities that appear in the value function. In all cases, the\nclass of stochastic policies allows gradually increasing the attempt of exploitation of such an oppor-\ntunity until it is either optimally exploited or it has vanished enough so as to have no advantage over\nalternatives, at which point a stochastic equilibrium is reached.\n\n5 Policy oscillation as sustained overshooting\n\nIn this section, we focus more carefully on how attractive stochastic policies lead to sustained policy\noscillation when viewed within the policy gradient framework. We begin by looking at a natural\nactor-critic algorithm that uses the Gibbs policy class (5). We iterate by fully estimating \u02dcAk in (4)\nfor the current policy \u03c0(\u03b8k), as shown in [4], and then a gradient update is performed using (6):\n\n\u03b8k+1 = \u03b8k + \u03b1\u03b7k .\n\n(7)\n\nNow let us consider some policy \u03c0(\u03b8k) from such a policy sequence generated by (7) and denote the\ncorresponding value function estimate by \u02dcAk and the natural gradient estimate by \u03b7k. It is shown in\n[13] that taking a very long step in the direction of the natural gradient \u03b7k will approach in the limit\na greedy update (1) for the value function \u02dcAk:\n\n\u03c0(a|s, \u03b8k + \u03b1\u03b7k) lim= \u03c0\u2217(a|s, \u02dcAk) , \u03b1 \u2192 \u221e, \u03b8k (cid:54)\u2192 \u221e, \u03b7 (cid:54)= 0, \u2200s, a .\n\n(8)\n\nThe resulting policy will have the form\n\n\u03c0(a|s, \u03b8k + \u03b1\u03b7k) \u221d e\u03b8(cid:62)\n\nk \u03c6(s,a)+\u03b1\u03b7(cid:62)\n\nk \u03c6(s,a) .\n\n(9)\nk \u03c6(s, a) dominating the sum when \u03b1 \u2192 \u221e. Thus, this type\nThe proof in [13] is based on the term \u03b1\u03b7(cid:62)\nof a greedy update is a special case of a natural gradient update in which the step-size approaches\nin\ufb01nity.\nHowever, the requirement that \u03b8k (cid:54)\u2192 \u221e will hold only during the \ufb01rst step using a constant \u03b1 \u2192 \u221e,\nassuming a bounded initial \u03b8. Thus, natural gradient based policy iteration using such a very large\nbut constant step-size does not approach greedy value function based policy iteration after the \ufb01rst\nsuch iteration. Little is needed, however, to make the equality apply in the case of full policy\niteration. The cleanest way, in theory, is to use a steeply increasing step-size schedule.\nTheorem 1. Let \u03c0(\u03b8k) denote the kth policy obtained from (7) using the step-sizes \u03b1[0,k\u22121] and\nnatural gradients \u03b7[0,k\u22121]. Let \u03c0\u2217(wk) denote the kth policy obtained from (1) with in\ufb01nitely small\nadded softness and using a value function (4), with \u03c6\u2217 = \u03c6 and \u02dcA(w0) being evaluated for \u03c0(\u03b80).\nAssume \u03b80 to be bounded from in\ufb01nity. Assume all \u03b7k to be bounded from zero and in\ufb01nity.\nIf\n\u03b10 \u2192 \u221e and \u03b1k/\u03b1k\u22121 \u2192 \u221e, \u2200k > 0, then \u03c0(\u03b8k+1) =lim \u03c0\u2217(wk).\n\nProof. The equivalence after the \ufb01rst iteration is proven in [13] with the requirement that the sum\n0 \u03c6(s, a). For \u03b10 \u2192 \u221e, this holds if \u03b80 is bounded and\nin (9) is dominated by the last term \u03b10\u03b7(cid:62)\n\n5\n\n\f\u03b70 (cid:54)= 0. By writing the parameter vector after the second iteration as \u03b82 = \u03b80 + \u03b10\u03b70 + \u03b11\u03b71, the\nsum becomes\n\n(10)\nThe requirement for the result in [13] to still apply is that the last term keeps dominating the sum.\nAssuming \u03b80 (cid:54)\u2192 \u221e, \u03b70 (cid:54)\u2192 \u221e, and \u03b71 (cid:54)= 0, then this condition is maintained if \u03b11 \u2192 \u221e and\n\u03b11/\u03b10 \u2192 \u221e. That is, the step-size in the second iteration needs to approach in\ufb01nity much faster\nthan the step-size in the \ufb01rst iteration. The rest follows by induction.\n\n1 \u03c6(s, a) .\n\n\u03b8(cid:62)\n0 \u03c6(s, a) + \u03b10\u03b7(cid:62)\n\n0 \u03c6(s, a) + \u03b11\u03b7(cid:62)\n\nHowever, once the \ufb01rst update is performed using such a very large step-size, it is no longer possible\nto revert back to more conventional step-size schedules: once \u03b8 has become very large, any update\nperformed using a much smaller \u03b1 will have virtually no effect on the policy. In the following, a\nmore practical alternative is discussed that both avoids the related numerical issues and that allows\ngradual interpolation back toward conventional policy gradient iteration. It also makes it easier to\nillustrate the resulting process, which we will do shortly. However, a slight modi\ufb01cation to the\nnatural actor-critic algorithm is required.\nMore precisely, we constrain the magnitude of \u03b8 by enforcing (cid:107)\u03b8(cid:107) \u2264 c after each update, where (cid:107)\u03b8(cid:107)\nis some measure of the magnitude of \u03b8 and c is some positive constant. Here the update equation (7)\nis replaced by:\n\nif (cid:107)\u03b8k + \u03b1\u03b7k(cid:107) \u2264 c\notherwise,\n\n(11)\n\n(cid:26) \u03b8k + \u03b1\u03b7k\n\n\u03b8k+1 =\n\n\u03c4c(\u03b8k + \u03b1\u03b7k)\n\nwhere \u03c4c = c/(cid:107)\u03b8k + \u03b1\u03b7k(cid:107).\nTheorem 2. Let \u03c0(\u03b8k) and \u03c0\u2217(wk) be as previously, except that (7) is replaced with (11). Let\n\u02dcA(w0) to be evaluated for \u03c0(\u03b80). Assume \u03b80 (cid:54)\u2192 \u221e and all \u03b7k (cid:54)= 0. If c \u2192 \u221e and \u03b1/c \u2192 \u221e, then\n\u03c0(\u03b8k+1) =lim \u03c0\u2217(wk).\n\nProof. The proof in [13] for a single iteration of the unmodi\ufb01ed algorithm requires that the last term\nof the sum in (9) dominates. This holds if \u03b1/(cid:107)\u03b8k(cid:107) \u2192 \u221e and \u03b7k (cid:54)= 0. This is ensured during the \ufb01rst\niteration by having \u03b80 (cid:54)\u2192 \u221e. After the kth iteration, (cid:107)\u03b8k(cid:107) \u2264 c due to the constraint in (11), and the\nlast term will dominate as long as \u03b1/c \u2192 \u221e and \u03b7k (cid:54)= 0.\nThe constraint affects the policy \u03c0(\u03b8k+1) only when (cid:107)\u03b8k + \u03b1\u03b7k(cid:107) > c, in which case the magnitude\nof the parameter vector is scaled down with a factor \u03c4c so that it becomes equal to c. This has\na diminishing effect on the resulting policy as c \u2192 \u221e because the Gibbs distribution becomes\nincreasingly insensitive to scaling of the parameter vector when its magnitude approaches in\ufb01nity:\n\n\u03c0(\u03c4c\u03b8) lim= \u03c0(\u03b8) ,\n\n\u2200\u03c4c, \u03b8 such that(cid:107)\u03c4c\u03b8(cid:107) \u2192 \u221e, (cid:107)\u03b8(cid:107) \u2192 \u221e .\n\n(12)\n\nWith a constant \u03b1 \u2192 \u221e and \ufb01nite c, the resulting constrained natural actor-critic algorithm (CNAC)\nis analogous to soft-greedy iteration in which on-policy Boltzmann exploration with temperature\n\u03c4 = 1/c is used: constraining the magnitude of \u03b8 will effectively ensure some minimum level\nof stochasticity in the corresponding policy (there is a mismatch between the algorithms even for\n\u03c4 = 1/c whenever (cid:107)\u03b7(cid:107) (cid:54)= 1). If the soft-greedy method uses (4) for policy evaluation, then exact\nequivalence in the limit is obtained when c \u2192 \u221e while maintaining \u03b1/c \u2192 \u221e. Lowering \u03b1\ninterpolates toward a conventional natural gradient method. These considerations apply also for (3)\nin place of (4) in the soft-greedy method if the indices of the maximizing actions become estimated\nequally in both cases: arg maxa\nGreedy policy iteration searches in the space of deterministic policies. As noted, the sequence of\ngreedy policies that is generated by such a process can be approximated arbitrarily closely with the\nGibbs policy class (2) with \u03c4 \u2192 0. For this class, the parameters of all deterministic policies lie at\nin\ufb01nity in different directions in the parameter space, whereas stochastic policies are obtained with\n\ufb01nite parameter values (except for vanishingly narrow directions along diagonals). From this point\nof view, the search is conducted on the surface of an \u221e-radius sphere: each iteration performs a\njump from in\ufb01nity in one direction to in\ufb01nity in some other direction. Based on Theorems 1 and 2,\nwe observe that the policy sequence that results from these jumps can be approximated arbitrarily\nclosely with a natural actor-critic method using very large step-sizes.\n\n\u02dcA(s, a, wA) = arg maxb\n\n\u02dcQ(s, b, wQ), \u2200s.\n\n6\n\n\fThe soundness of such a process obviously requires some special structure for the gradient land-\nscape.\nIn informal terms, what suf\ufb01ces is that the performance landscape has a monotonically\nincreasing pro\ufb01le up to in\ufb01nity in the direction of a gradient that is estimated at any point. This\ncondition is established if all attractors in the parameter space reside at in\ufb01nity and if the gradi-\nent \ufb01eld is not, loosely speaking, too \u2018curled\u2019. Although we ignore the latter requirement, we note\nthat the former requirement is satis\ufb01ed when only deterministic attractors exist in the Gibbs pol-\nicy space. This condition holds when the problem is fully Markovian and the value function is\nrepresented exactly, which leads to the standard result for MDPs stating that there always exists a\ndeterministic policy that is globally optimal, that there are no locally optimal policies and that any\npotential stochastic optimal policies are not attractive (e.g., [1, \u00a7A.2]). However, when these condi-\ntions do not hold and there is an attractor in the policy space that corresponds to a stochastic policy,\nthere is a \ufb01nite attractor in the parameter space that resides inside the \u221e-radius sphere.\n\nFigure 2: Performance landscapes and estimated gradient \ufb01elds for the examples in Figure 1.\n\nThe required special structure is clearly visible in Figure 2.1a, in which the performance landscape\nand the gradient \ufb01eld are shown for the fully observable (and thus Markovian) counterpart of the\nproblem from Figure 1.1. This structure can be seen also in Figure 2.2a, in which the problem from\nFigure 1.2 is evaluated using Monte Carlo evaluation. The redundant parameters \u03b8s1,al and \u03b8s2,al in\nthe former and \u03b8y0,a3 in the latter were \ufb01xed to zero. In these cases, movement in the direction of\nthe natural gradient keeps improving performance up to in\ufb01nity, i.e., there are no \ufb01nite optima in the\nway. This structure is clearly lost in Figure 2.1b, which shows the evaluation for the non-Markovian\nproblem from Figure 1.1. The same holds for the temporal differences based gradient \ufb01eld for the\nproblem from Figure 1.2 that is shown in Figure 2.2b. In essence, the sustained policy oscillation\nthat results from using very large step-sizes or greedy updates in the latter two cases (2.1b and 2.2b)\nis caused by sustained overshooting over the \ufb01nite attractor in the policy parameter space.\nAnother implication of the equivalence between very long natural gradient updates and greedy up-\ndates is that, contrary to what is sometimes suggested in the literature, the natural actor-critic ap-\nproach has an inherent capability for a speed that is comparable to parametric greedy approaches\nwith linear-in-parameters value function approximation. This is because whatever initial improve-\nment speed can be achieved with the latter due to greedy updates, the same speed can be also\nachieved with the former using the same basis together with very long steps and constraining. This\neffectively corresponds to an attempt to exploit whatever remains of the special structure of a Marko-\nvian problem, making the use of a very large \u03b1 in constrained policy improvement analogous to\nusing a small \u03bb in policy evaluation. Constraining (cid:107)\u03b8(cid:107) also enables interpolating back toward con-\nventional natural policy gradient learning (in addition to offering a crude way of maintaining ex-\nplorativity): in cases of partial Markovianity, very long (near-in\ufb01nite) natural gradient steps can be\nused to quickly \ufb01nd the rough direction of the strongest attractors, after which gradually decreasing\nthe step-size allows an ascent toward some \ufb01nite attractor.\n\n6 Empirical results\n\nIn this section, we apply several variants of the natural actor-critic algorithm and some greedy policy\niteration algorithms to the Tetris problem using the standard features from [12]. For policy improve-\nment, we use the original natural actor-critic (NAC) from [4], a constrained one (CNAC) that uses\n(11) and a very large \u03b1, and a soft-greedy policy iteration algorithm (SGPI) that uses (2). For policy\nevaluation, we use LSPE [28] and an SVD-based batch solver (pinv). The B matrix in LSPE was\ninitialized to 0.5I and the policy was updated after every 100th episode. We used the advantage\nestimator from (4) unless otherwise stated. We used a simple initial policy (\u03b8maxh = \u03b8holes = \u22123)\nthat scores around 20 points.\n\n7\n\n\u03b8s1,ar\u03b8s2,ar\u2212202\u22122021a\u03b8y1,al\u03b8y1,ar\u2212202\u22122021b\u03b8y0,a1\u03b8y0,a2\u2212202\u22122022a\u03b8y0,a1\u03b8y0,a2  \u2212202\u2212202return\u2212  +2b\fFigure 3: Empirical results for the Tetris problem. See the text for details.\n\nFigure 3.1 shows that with soft-greedy policy iteration (SGPI), it is in fact possible to avoid policy\ndegradation by using a suitable amount of softening. Results for constrained natural actor-critic\n(CNAC) with \u03b1 = 1010 are shown in Figure 3.2. The algorithm can indeed emulate greedy updates\n(SGPI) and the associated policy degradation. Unconstrained natural actor-critic (NAC), shown in\nFigure 3.3, failed to match the behavior and speed of SGPI and CNAC with any step-size (only\nselected step-sizes are shown). Results for all algorithms when using the Q estimator in (3) are\nshown in Figure 3.4 (technically, CNAC and NAC are not using a natural gradient now). SGPI and\nCNAC match perfectly while reaching transiently a level around 50,000 points in just 2 iterations.\nWe did observe a presence of oscillation-predisposing structure during several runs. There were\noptima at \ufb01nite parameter values along several consecutively estimated gradient directions, but these\noptima did not usually form closed basins of attraction in the full parameter space. At such points,\nthe performance landscape was reminiscent of what was illustrated in Figure 2.1b, except that there\nwas a tendency for a slope toward an open end of the valley (ridge) at \ufb01nite distance. As a result,\noscillations were mainly transient with suitably chosen learning parameter values.\nHowever, a commonality among all the algorithms was that the relevant matrices became quickly\nhighly ill-conditioned. This was the case especially when using (4), with which condition numbers\nwere typically above 109 upon convergence/stagnation. Figures 3.5 and 3.6 show performance levels\nand typical condition numbers for NAC with different discount factors. It can be seen that the inferior\nresults obtained with a too high \u03b3 (cf. [14, 12]) are associated with serious ill-conditioning.\nIn contrast to typical approximate dynamic programming methods, the cross-entropy method in-\nvolves numerically more stable computations and, moreover, the computations are based on infor-\nmation from a distribution of policies. Currently, we expect that the policy oscillation or chattering\nphenomenon is not the main cause for neither policy degradation nor stagnation in this problem. In-\nstead, it seems that, for both greedy and gradient approaches, the explanation is related to numerical\ninstabilities that stem possibly both from the involved estimators and from insuf\ufb01cient exploration.\n\nAcknowledgments\nWe thank Lucian Bus\u00b8oniu and Dimitri Bertsekas for valuable discussions. This work has been\n\ufb01nancially supported by the Academy of Finland through the Centre of Excellence Programme.\n\n8\n\n05101500.511.52x 104IterationAverage scoreSGPI    \u03c4 = 0.1 \u03c4 = 0.01\u03c4 = 0.001105101500.511.52x 104IterationAverage scoreCNAC   c = 10 c = 50c = 500205101500.511.52x 104IterationAverage scoreNAC     \u03b1 = 50   \u03b1 = 75 \u03b1 = 1000\u03b1 = 100003051015012345x 104IterationAverage scoreQ estimator from (3)  SGPI\"CNAC\"\"NAC\" (\u03b1=500)\"NAC\" (\u03b1=1010)4051000.511.522.53x 104IterationAverage scoreNAC    \u03b3 = 0.8  \u03b3 = 0.9\u03b3 = 0.975    \u03b3 = 15051010810101012IterationCondition numberNAC    \u03b3 = 0.8  \u03b3 = 0.9\u03b3 = 0.975    \u03b3 = 16\fReferences\n\n[1] C. Szepesv\u00b4ari. Algorithms for reinforcement learning. Morgan & Claypool Publishers, 2010.\n[2] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scienti\ufb01c, 2005.\n[3] L. Bus\u00b8oniu, R. Babu\u02c7ska, B. De Schutter, and D. Ernst. Reinforcement learning and dynamic programming\n\nusing function approximators. CRC Press, 2010.\n\n[4] J. Peters and S. Schaal. Natural actor-critic. Neurocomputing, 71(7-9):1180\u20131190, 2008.\n[5] S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor-critic algorithms. Automatica,\n\n45(11):2471\u20132482, 2009.\n\n[6] V. R. Konda and J. N. Tsitsiklis. On actor-critic algorithms. SIAM Journal on Control and Optimization,\n\n42(4):1143\u20131166, 2004.\n\n[7] D. P. Bertsekas. Approximate policy iteration: A survey and some new methods. Technical report,\n\nMassachusetts Institute of Technology, Cambridge, US, 2010.\n\n[8] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming. Athena Scienti\ufb01c, 1996.\n[9] R. S. Sutton, D. Mcallester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning\n\nwith function approximation. In Advances in Neural Information Processing Systems, 2000.\n[10] C. Thiery and B. Scherrer. Building Controllers for Tetris. ICGA Journal, 32(1):3\u201311, 2009.\n[11] I. Szita and A. L\u00a8orincz. Learning Tetris using the noisy cross-entropy method. Neural Computation,\n\n18(12):2936\u20132941, 2006.\n\n[12] D. P. Bertsekas and S. Ioffe. Temporal differences-based policy iteration and applications in neuro-\ndynamic programming. Technical report, Massachusetts Institute of Technology, Cambridge, US, 1996.\n[13] S. M. Kakade. A natural policy gradient. In Advances in Neural Information Processing Systems, 2002.\n[14] M. Petrik and B. Scherrer. Biasing approximate dynamic programming with a lower discount factor. In\n\nAdvances in Neural Information Processing Systems, 2008.\n\n[15] C. Thiery and B. Scherrer. Improvements on learning Tetris with cross entropy. ICGA Journal, 32(1):23\u2013\n\n33, 2009.\n\n[16] V. Farias and B. Roy. Tetris: A study of randomized constraint sampling. In Probabilistic and Randomized\n\nMethods for Design Under Uncertainty, pages 189\u2013201. Springer, 2006.\n\n[17] V. Desai, V. Farias, and C. Moallemi. A smoothed approximate linear program. In Advances in Neural\n\nInformation Processing Systems, 2009.\n\n[18] G. J. Gordon. Reinforcement learning with function approximation converges to a region. In Advances\n\nin Neural Information Processing Systems, 2001.\n\n[19] S. P. Singh, T. Jaakkola, and M. I. Jordan. Learning without state-estimation in partially observable\nIn Proceedings of the Eleventh International Conference on Machine\n\nmarkovian decision processes.\nLearning, volume 31, page 37, 1994.\n\n[20] M. D. Pendrith and M. J. McGarity. An analysis of direct reinforcement learning in non-markovian\n\ndomains. In Proceedings of the Fifteenth International Conference on Machine Learning, 1998.\n\n[21] T. J. Perkins. Action value based reinforcement learning for POMDPs. Technical report, University of\n\nMassachusetts, Amherst, MA, USA, 2001.\n\n[22] T. J. Perkins and D. Precup. A convergent form of approximate policy iteration. In Advances in Neural\n\nInformation Processing Systems, 2003.\n\n[23] P. A. Crook and G. Hayes. Consistent exploration improves convergence of reinforcement learning on\n\nPOMDPs. In AAMAS 2007 Workshop on Adaptive and Learning Agents, 2007.\n\n[24] T. J. Perkins. Reinforcement learning for POMDPs based on action values and stochastic optimization. In\nProceedings of the Eighteenth National Conference on Arti\ufb01cial Intelligence, pages 199\u2013204. American\nAssociation for Arti\ufb01cial Intelligence, 2002.\n\n[25] G. J. Gordon. Chattering in SARSA(\u03bb). Technical report, Carnegie Mellon University, Pittsburgh, PA,\n\nUSA, 1996.\n\n[26] R. Parr, L. Li, G. Taylor, C. Painter-Wake\ufb01eld, and M. L. Littman. An analysis of linear models, linear\nIn Proceedings of the\n\nvalue-function approximation, and feature selection for reinforcement learning.\n25th International Conference on Machine learning, pages 752\u2013759. ACM, 2008.\n\n[27] D. P. Bertsekas and H. Yu. Q-learning and enhanced policy iteration in discounted dynamic programming.\n\nIn Decision and Control (CDC), 2010 49th IEEE Conference on, pages 1409\u20131416. IEEE, 2010.\n\n[28] A. Nedi\u00b4c and D. P. Bertsekas. Least squares policy evaluation algorithms with linear function approxima-\n\ntion. Discrete Event Dynamic Systems: Theory and Applications, 13(1\u20132):79\u2013110, 2003.\n\n9\n\n\f", "award": [], "sourceid": 1398, "authors": [{"given_name": "Paul", "family_name": "Wagner", "institution": null}]}