{"title": "Policy Gradient Coagent Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1944, "page_last": 1952, "abstract": "We present a novel class of actor-critic algorithms for actors consisting of sets of interacting modules. We present, analyze theoretically, and empirically evaluate an update rule for each module, which requires only local information: the module's input, output, and the TD error broadcast by a critic. Such updates are necessary when computation of compatible features becomes prohibitively difficult and are also desirable to increase the biological plausibility of reinforcement learning methods.", "full_text": "Policy Gradient Coagent Networks\n\nPhilip S. Thomas\n\nDepartment of Computer Science\n\nUniversity of Massachusetts Amherst\n\nAmherst, MA 01002\n\npthomas@cs.umass.edu\n\nAbstract\n\nWe present a novel class of actor-critic algorithms for actors consisting of sets\nof interacting modules. We present, analyze theoretically, and empirically eval-\nuate an update rule for each module, which requires only local information: the\nmodule\u2019s input, output, and the TD error broadcast by a critic. Such updates are\nnecessary when computation of compatible features becomes prohibitively dif\ufb01-\ncult and are also desirable to increase the biological plausibility of reinforcement\nlearning methods.\n\n1\n\nIntroduction\n\nMethods for solving sequential decision problems with delayed reward, where the problems are for-\nmulated as Markov decision processes (MDPs), have been compared to the learning mechanisms of\nanimal brains [3, 4, 9, 10, 13, 20, 22]. These comparisons stem from similarities between activa-\ntion of dopaminergic neurons and reward prediction error [19], also called the temporal difference\n(TD) error [21]. Dopamine is broadcast to large portions of the human brain, suggesting that it may\nbe used in a similar manner to the TD error in reinforcement learning (RL) [23] systems, i.e., to\nfacilitate improvements to the brain\u2019s decision rules.\nSystems with a critic that computes and broadcasts the TD error to another module called the actor,\nwhich stores the current decision rule, are called actor-critic architectures. Chang et al. [7] present a\ncompelling argument that the \ufb02y brain is an actor-critic by \ufb01nding the neurons making up the critic\nand then arti\ufb01cially activating them to train the actor portions of the brain. However, current actor-\ncritic methods in the arti\ufb01cial intelligence community remain biologically implausible because each\ncomponent of the actor can only be updated with detailed knowledge of the entire actor. This forces\ncomputational neuroscientists to either create novel methods [14] or alter existing methods from the\narti\ufb01cial intelligence community in order to enforce locality constraints (e.g., [16]).\n\nFigure 1: Example modular actor.\n\nThe actor in an actor-critic maintains a decision rule, \u03c0, called a policy, parameterized by a vector\n\u03b8, that computes the probability of an action (decision), a, given an estimate of the current state of\nthe world, st, and the current parameters, \u03b8t. In some cases, an actor can be broken into multiple\ninteracting modules, each of which computes an action given some input, x, which may contain\nelements of s as well as the outputs of other modules. An example of such a modular actor is\nprovided in Figure 1. This actor consists of three modules, A1, A2, and A3, with parameters \u03b81, \u03b82,\n\n1\n\nA1AIt\u03b81\u03b83Otta1a3x1x3A2A3Input,\u00a0s\u03b82\u03b8Output,\u00a0aa2x2Input\u2026\u2026\u2026\u2026\u2026\u2026Layer\u00a01Layer\u00a02OutputLayer\fand \u03b83, respectively. The ith module takes input xi, which is a subset of the state features and the\noutputs of other modules. It then produces its action ai according to its policy, \u03c0i(xi, ai, \u03b8i) =\nPr(ai|xi, \u03b8i). The output, a, of the whole modular actor is one of the module outputs\u2014in this case\na = a3. Later we modify this to allow the action a to follow any distribution with the state and\nmodule outputs as parameters. This modular policy can also be written as a non-modular policy that\n\nis a function of \u03b8 =(cid:8)\u03b81, \u03b82, \u03b83(cid:9), i.e., \u03c0(s, a, \u03b8) = Pr(a|s, \u03b8). We assume that the modular policy is\n\nnot recurrent. Such modular policies appear frequently in models of the human brain, with modules\ncorresponding to neurons or collections thereof [12, 16].\nCurrent actor-critic methods (e.g. [11, 15, 23, 24]) require knowledge of \u2202\u03c0/\u2202\u03b8i in order to update\n\u03b8i. However, \u2202\u03c0/\u2202\u03b8i often depends on the current values of all other parameters as well as the\nstructure de\ufb01ning how the parameters are combined to produce the decision rule. This is akin to\nassuming that a neuron (or cluster of neurons), Ai, must know its in\ufb02uence on the \ufb01nal decision\nrule implemented. Were another module to modify its policy such that \u2202\u03c0/\u2202\u03b8i changes, a message\nmust be sent to alert Ai of the exact changes so that it can update its estimate of \u2202\u03c0/\u2202\u03b8i, which is\nbiologically implausible.\nRather than keeping a current estimate of \u2202\u03c0/\u2202\u03b8i, one might attempt to compute it on the \ufb02y via\nthe error backpropagation learning algorithm [17]. In this algorithm, each module, Ai, beginning\nwith the output modules, computes its own update and then sends a message containing \u2202\u03c0/\u2202aj\nto each Aj that Ai uses as input (we call these Aj parents, and Ai a child of Aj). Once all of\nAi\u2019s children have updated, it will have all of the information required to compute \u2202\u03c0/\u2202\u03b8i. Though\nan improvement upon the naive message passing scheme, backpropagation remains biologically\nimplausible because it would require rapid transmission of information backwards along the axon,\nwhich has not been observed [8, 28]. However, gradient descent remains one of the most frequently\nused methods. For example, Rivest et al. [16] use gradient descent to update a modular actor, and\nare forced to assume that certain derivatives are always one in order to maintain realistic locality\nconstraints.\nThis raises the question: could each module update given only local information that does not in-\nclude explicit knowledge of \u2202\u03c0/\u2202\u03b8i? We assume that a critic exists that broadcasts the TD error, so\na module\u2019s local information would consist of its input xi, which is not necessarily a Markov state\nrepresentation, its output ai, and the TD error. Though this has been achieved for tasks with imme-\ndiate rewards [3, 26, 27], we are not aware of any such methods for tasks with delayed rewards. In\nthis paper we present a class of algorithms, called policy gradient coagent networks (PGCNs), that\ndo exactly this: they allow modules to update given only local information.\nPGCNs are also a viable technique for non-biological reinforcement learning applications in which\n\u2202\u03c0/\u2202\u03b8 is prohibitively dif\ufb01cult to compute. For example, consider an arti\ufb01cial neural network where\nthe output of each neuron follows some probability distribution over the reals. Though this would\nallow for exploration at every level, rather than just at the level of primitive actions of the output\nlayer, expressions for \u03c0(s, a, \u03b8) would require a nested integral for every node and \u2202\u03c0/\u2202\u03b8 would be\ndif\ufb01cult to compute or approximate for networks with many neurons and layers. Because PGCNs\ndo not require knowledge of \u2202\u03c0/\u2202\u03b8, they remain simple even in such cases, making them a practical\nchoice for complex parameterized policies.\n\n2 Background\n\nAn MDP is a tuple M = (S,A,P,R, ds0), where S and A are the sets of possible states and actions\nrespectively, P gives state transition probabilities: P(s, a, s(cid:48)) = Pr(st+1=s(cid:48)|st=s, at=a), where t\nis the current time step, R(s, a) = E[rt|st=s, at=a] is the expected reward when taking action a in\nstate s, and ds0(s) = Pr(s0=s). An agent A with time-variant parameters \u03b8t \u2208 \u0398 (typically function\napproximator weights, learning rates, etc.) observes the current state st, selects an action, at, based\non st and \u03b8t, which is used to update the state according to P. It then observes the resulting state,\nst+1, receives uniformly bounded reward rt according to R, and updates its parameters to \u03b8t+1.\nA policy is a mapping from states to probabilities of selecting each possible action. A\u2019s policy \u03c0\nmay be parameterized by a vector, \u03b8, such that \u03c0(s, a, \u03b8) = Pr(at=a|st=s, \u03b8t=\u03b8). We assume that\n\u2202\u03c0(s, a, \u03b8)/\u2202\u03b8 exists for all s, a, and \u03b8. Let d\u03b8\nM (s) denote the stationary distribution over states\n\n2\n\n\funder the policy induced by \u03b8. We can then write the average reward for \u03b8 as\n\n(cid:34)T\u22121(cid:88)\n\nt=0\n\n(cid:35)\n\n(cid:12)(cid:12)(cid:12)M, \u03b8\n\nrt\n\n.\n\n(1)\n\nJM (\u03b8) = lim\nT\u2192\u221e\n\n1\nT\n\nE\n\n\u221e(cid:88)\n\nt=1\n\nV \u03b8\nM (s) =\n\nThe state-value function, which maps states to the difference between the average reward and the\nexpected reward if the agent follows the policy induced by \u03b8 starting in the provided state, is\n\nLastly, we de\ufb01ne the TD error to be \u03b4t = rt \u2212 JM (\u03b8) + V \u03b8\n\nE[rt \u2212 J(\u03b8)|M, s0 = s, \u03b8].\nM (st+1) \u2212 V \u03b8\n\nM (st).\n\n(2)\n\n2.1 Policy Gradient\n\nOne approach to improving a policy for an MDP is to adjust the parameters \u03b8 to ascend the policy\ngradient, \u2207\u03b8JM (\u03b8). For reviews of policy gradient methods, see [5, 15, 24]. A common variable\nin policy gradient methods is the compatible features, \u03c8sa = \u2207\u03b8 log \u03c0(s, a, \u03b8). Bhatnagar et al. [5]\nshowed that \u03b4t\u03c8sa is an unbiased estimate of \u2207\u03b8JM (\u03b8) if s \u223c d\u03b8\nM (\u00b7) and a \u223c \u03c0(s,\u00b7, \u03b8). This results\nin a simple actor-critic algorithm, which we reproduce from [5]:\n\n\u02c6Jt+1 =(1 \u2212 c\u03b1t) \u02c6Jt + c\u03b1trt\n\n\u03b4t =rt \u2212 \u02c6Jt+1 + vt \u00b7 \u03c6(st+1) \u2212 vt \u00b7 \u03c6(st)\n\n(3)\n(4)\n(5)\n(6)\nwhere \u02c6J is a scalar estimate of J, \u03b4t remains the scalar TD error, \u03c6 is any function taking S to a\nfeature space for linear value function approximation, v is a vector of weights for the approximation\nv \u00b7 \u03c6(s) \u2248 V \u03b8\n\nM (s), c is a constant, and \u03b1t and \u03b2t are learning rate schedules such that\n\nvt+1 =vt + \u03b1t\u03b4t\u03c6(st)\n\u03b8t+1 =\u03b8t + \u03b2t\u03b4t\u03c8stat,\n\n\u221e(cid:88)\n\n\u221e(cid:88)\n\n\u221e(cid:88)\n\n\u03b1t =\n\n\u03b2t = \u221e,\n\n(\u03b12\n\nt + \u03b22\n\nt ) < \u221e, \u03b1t = o(\u03b2t).\n\n(7)\n\nt=0\n\nt=0\n\nt=0\n\n(cid:101)\u2207\u03b8JM (\u03b8) = G(\u03b8)\u22121\u2207\u03b8JM (\u03b8),\n\nOne example of such a schedule would be \u03b1t = \u03b1\u03b1c\n\u03b2C +t, for some constants\n\u03b1, \u03b1C, \u03b2, and \u03b2C. We call this algorithm the vanilla actor-critic (VAC). Bhatnagar et al. [5] show\nthat under certain mild assumptions and in the limit as t \u2192 \u221e, VAC will converge to a \u03b8t that is\nwithin a small neighborhood of a local maximum of JM (\u03b8).\nSome more advanced actor-critic methods ascend the natural policy gradient [1, 5, 15],\n\n\u03b1c+t2/3 and \u03b2t = \u03b2\u03b2C\n\n(8)\nM (\u00b7),a\u223c\u03c0(s,\u00b7,\u03b8)[\u2207\u03b8 log \u03c0(s, a, \u03b8)\u2207\u03b8 log \u03c0(s, a, \u03b8)T ] is the Fisher information\nwhere G(\u03b8) = Es\u223cd\u03b8\nmatrix of the policy. To help differentiate between the two types of policy gradients, we refer to\nthe non-natural policy gradient as the vanilla policy gradient hereafter. One view of the natural\ngradient is that it corrects for the skewing of the vanilla gradient that is induced by a particular\nparameterization of the policy [2]. Empirical studies have found that ascending the natural gradient\nresults in faster convergence [1, 5, 15]. One algorithm for ascending the natural policy gradient is\nthe Natural-Gradient Actor-Critic with Advantage Parameters [5], which we abbreviate as NAC\nand use in our case study.\nVAC and NAC have a property, which we reference later as Property 1, that is common to almost all\nother actor-critic methods: if the policy is a function of x = f (s), for any f, such that \u03c0(s, a, \u03b8) can\nbe written as \u03c0(x, a, \u03b8) or \u03c0(f (s), a, \u03b8), then updates to the policy parameters \u03b8 are independent of\ns given x, a, and \u03b4t. For example, if s = (s1, s2) and f (s) = s1 so that the policy is a function of\nonly s1, then the update to \u03b8 requires knowledge of only s1, a, and \u03b4t, and not s2. This is one crucial\nproperty will allow the actor to update given only local information.\nVAC and NAC, as well as all other algorithms referenced, require computation of \u2207\u03b8 log \u03c0(s, a, \u03b8).\nHence, none of these methods allow for local updates to modular policies, which makes them un-\ndesirable from a biological standpoint, and impractical for policies for which this derivative is pro-\nhibitively dif\ufb01cult to compute. However, by combining these methods with the CoMDP framework\nreviewed in Section 2.2 and by taking advantage of Property 1, the updates to the actor can be\nmodi\ufb01ed to satisfy the locality constraint.\n\n3\n\n\f2.2 Conjugate Markov Decision Processes\n\nj Aj, where(cid:81)\n\nde\ufb01ning its policy. We de\ufb01ne \u00af\u03b8i = (cid:83)\nnumber of other agents: si \u2208 S \u00d7(cid:81)\n\nIn this section we review the aspects of the conjugate Markov decision process (CoMDP) framework\nthat are relevant to this work. Though Thomas and Barto [25] present the CoMDP framework for the\ndiscounted reward setting with \ufb01nite state, action, and reward spaces, the extension to the average\nreward and in\ufb01nite setting used here is straightforward. To solve M, one may create a network of\nagents A1, A2, . . . , An, where Ai has output ai \u2208 Ai, where Ai is any space, though typically the\nreals or integers. All agents receive the same reward. We focus on the case where Ai = {Ai, Ci}\nare all actor critics, i.e., they contain an actor, Ai, and a critic, Ci. The action at \u2208 A for M is\ncomputed as at \u223c \u0393(s, a1, a2, . . . , an), for some distribution \u0393. Each agent Ai has parameters \u03b8i\nj\u2208{1,2,...,n}\u2212{i} \u03b8j to be the parameters of all agents other\nthan Ai. Each agent takes as input si, which contains the state of M and the outputs of an arbitrary\nj Aj is the Cartesian product of the output sets\nof all the Aj whose output is used as input to Ai. Notice that si are not the components of s, but\nrather s is the state of M, while si is the input to Ai. We require the graph with nodes for each Ai\nand a directed edge from Ai to Aj if Aj takes ai as part of its input, to be acyclic. Thus, the network\nof agents must be feed-forward, so we can assume an ordering of Ai such that if aj is part of si,\nthen j < i. When executing the modular policy, the policies of the Ai can be executed in this order\nso that all requisite information for computing a module\u2019s output is always available. Thomas and\nBarto [25] call each Ai a coagent and the entire network a coagent network.\nAn agent Ai may treat the rest of the network and M as its environment, where it sees states si\nt and\n(cid:81)\nt+1. This en-\ntakes actions ai\nvironment is called a conjugate Markov decision process (CoMDP), which is an MDP M i = (S \u00d7\nPr(cid:8)si\nj Aj is the state space, Ai is the action space, P i(si, ai, \u02c6si) =\nj Aj,Ai,P i,Ri, di\n\n) where S \u00d7(cid:81)\nt = ai, M, \u00af\u03b8i(cid:9), Ri(si, ai) = E(cid:2)rt|si\n\nt resulting in reward rt (the same for all Ai) and a transition to state si\n\nt = ai, M, \u00af\u03b8i(cid:3) gives the ex-\n\nt+1 = \u02c6si|si\n\nt = si, ai\n\nt = si, ai\n\ns0 is the distribution over initial states of M i.\npected reward when taking action a in state s, and di\nWe write \u03c0i(si, ai, \u03b8i) to denote Ai\u2019s policy for M i. Notice that M i depends on \u00af\u03b8i. Thus, as the\npolicies of other coagents change, so too does the CoMDP with which Ai interacts. While [25]\nconsiders generic methods for handling this nonstationarity, we focus on the special case in which\nall Ai are policy gradient methods.\nTheorem 3 of [25] states that the policy gradient of M can be decomposed into the policy gradients\nfor all of the CoMDPs, M i:\n\u2202JM (\u03b81, \u03b82, . . . , \u03b8n)\n\n\u2202JM (\u03b81, \u03b82, . . . , \u03b8n)\n\n\u2202JM (\u03b81, \u03b82, . . . , \u03b8n)\n\n(cid:21)\n\ns0\n\n\u2202[\u03b81, \u03b82, . . . , \u03b8n]\n\n(cid:20) \u2202JM (\u03b81, \u03b82, . . . , \u03b8n)\n(cid:20) \u2202JM 1(\u03b81)\n\n\u2202\u03b81\n\n\u2202JM 2(\u03b82)\n\n,\n\n,\n\n\u2202\u03b81\n\n\u2202\u03b82\n\n=\n\n=\n\n(cid:21)\n\n\u2202\u03b82\n\n, . . . ,\n\n\u2202JM n (\u03b8n)\n\n\u2202\u03b8n\n\n, . . . ,\n\n\u2202\u03b8n\n\n.\n\n(9)\n\nThus, if each coagent computes and follows the policy gradient based on the local environment that\nit sees, the coagent network will follow its policy gradient on M.\nThomas and Barto [25] also show that the value functions for M and all the CoMDPs are the same\nfor all st, if the additional state components of M i are drawn according to the modular policy:\n\n(10)\n\nM (st).\n\nV \u03b81\nt ) = V \u03b82\nM 1(s1\nM (st+1) \u2212 V \u03b8\n\nt ) = . . . = V \u03b8n\n\nM 2 (s2\nt ) = V \u03b8\nM (st) = rt \u2212 JM i(\u03b8i) + V \u03b8i\n\nM n (sn\n\nThe state-value based TD error is therefore the same as well:\n\nt+1) \u2212 V \u03b8i\n\nt),\u2200i.\n\n\u03b4t = rt \u2212 JM (\u03b8) + V \u03b8\n\nM i (si\n\nM i(si\n\n(11)\nThis means that, if the coagents require \u03b4t, we can maintain a global critic, C, that keeps an estimate\nof V \u03b8\nM , which can be used to replace every Ci by computing \u03b4t and broadcasting it to each Ai.\nBecause all Ai share a global critic, C, all that remains of each module is the actor Ai. We therefore\nrefer to each Ai as a module.\nNotice that the CoMDPs, M i, and thus the coagents, Ai, have S as part of their state space. This is\nrequired for M i to remain Markov. However, if the actor\u2019s policy is a function of some xi = f (si)\nfor any f, i.e., the policy can be written as \u03c0i(xi, ai, \u03b8i), then, by Property 1, updates to the actor\u2019s\npolicy require only the TD error, ai, and xi. Hence, the full Markovian state representation is only\nneeded by the global critic, C. The modules, Ai, will be able to perform their updates given only\ntheir input: the xi portion of the state of M i.\n\n4\n\n\f3 Methods\n\nThe CoMDP framework tells us that, if each module is an actor that computes the policy gradient for\nits local environment (CoMDP), then the entire modular actor will ascend its policy gradient. Actor-\ncritics satisfying Property 1 are able to perform their policy updates given only local information:\nthe policy\u2019s input xt, the most recent action at, and the TD error \u03b4t. Combining these two, each\nmodule Ai can compute its update given only its local input xi\nt, and the TD\nerror \u03b4t. We call any network of coagents, each using policy gradient methods, a policy gradient\ncoagent network (PGCN). One PGCN is the vanilla coagent network (VCN), which uses VAC for\nall modules (coagents), and maintains a global critic that computes and broadcasts \u03b4t. The VCN\nxi,ai = \u2207\u03b8i log \u03c0i(xi, ai, \u03b8i) are the\nalgorithm is depicted diagramatically in Figure 2, where \u03c8i\nis an unbiased estimate of the policy\ncompatible features for the ith module. Notice that \u03b4t\u03c8i\ntai\nxi\nt\ngradient for M i [5], which is an unbiased estimate of part of the policy gradient for M by Equation\n9.\n\nt, most recent action ai\n\nFigure 2: Diagram of the vanilla coagent network (VCN) algorithm. The global critic observes\nst, rt, st+1 tuples, updates its estimate \u02c6J of the average reward, which it uses to compute the TD\nerror \u03b4t, which is then broadcast to all of the modules, Ai. Lastly, it updates the parameters, v, of its\nt) and then computes updates\nstate-value estimate. Each module Ai draws its actions from \u03c0i(xi\nto \u03b8i given its input xi\n\nt, and the TD error, \u03b4t, which was broadcast by the global critic.\n\nt, action ai\n\nt,\u00b7, \u03b8i\n\nt , a2\n\nt and then at =\nTo implement VCN, observe the current state st, compute the module outputs ai\nt ). This action will result in a transition to st+1 with reward rt. Given st, rt, and\n\u0393(st, a1\nst+1 the global critic can execute to produce \u03b4t, which can then be used to train each module Ai.\nNotice that the Ai can update concurrently. This process then repeats.\n\nt , . . . , an\n\n4 The Decomposed Natural Policy Gradient\n\nAnother interesting PGCN, which we call a natural coagent network (NCN), would use coagents\nthat ascend the natural policy gradient, e.g., NAC. However, Equation 9 does not hold for natural\n\n(cid:104)(cid:101)\u2207\u03b81JM 1(\u03b81),(cid:101)\u2207\u03b82 JM 2(\u03b82), . . . ,(cid:101)\u2207\u03b8n JM n (\u03b8n)\n\ngradients: (cid:101)\u2207\u03b8JM (\u03b8) (cid:54)=\nwhere \u03b8 =(cid:8)\u03b81, \u03b82, . . . , \u03b8n(cid:9) and (cid:98)\u2207\u03b8JM (\u03b8) is an estimate of the natural policy gradient that we call\nnot follow the natural policy gradient, but rather (cid:98)\u2207\u03b8JM (\u03b8) = (cid:98)G(\u03b8)\u22121\u2207\u03b8JM (\u03b8), an approximation\nthereto, where (cid:98)G(\u03b8) is an approximation of G(\u03b8), constructed by:\n\nthe decomposed natural policy gradient, which has an implicit dependence on how \u03b8 is partitioned\ninto n components. Hence, a PGCN, where each module computes its natural policy gradient, would\n\n(cid:105) \u2261 (cid:98)\u2207\u03b8JM (\u03b8),\n\n(12)\n\n(cid:26)\n\n(cid:98)G(\u03b8)ij =\n\n0\nG(\u03b8k)ij\n\nif the i and jth elements of \u03b8 are in different modules\nif the i and jth elements of \u03b8 are both in module Ak\n\n,\n\nwhere G(\u03b8k) is the Fisher information matrix of the kth module\u2019s policy:\n\nG(\u03b8k) = Esk\u223cd\u03b8k\n\nM k (\u00b7),ak\u223c\u03c0k(xk,\u00b7,\u03b8k)[\u2207\u03b8k log \u03c0k(xk, ak, \u03b8k)\u2207\u03b8k log \u03c0k(xk, ak, \u03b8k)T ],\n\nwhere G(\u03b8k)ij in Equation 13 denotes the entry corresponding to the i and jth elements of \u03b8, which\nare elements of \u03b8k.\nThe decomposed natural policy gradient is intuitively a trade-off between the natural policy gradient\nand the vanilla policy gradient depending on the granularity of modularization. For example, if the\n\n5\n\n(13)\n\n(14)\n\nAiGlobal\u00a0Critic\uf0b5\uf0b51(1)tttttJcJcr\uf061\uf061\uf02b\uf03d\uf02d\uf02b\uf028\uf029Atiiii\uf0711,,tttsrs\uf02bt\uf064itxita1()ttttt\uf02b\uf0b5\uf028\uf029\uf028\uf02911tttttttrJvsvs\uf064\uf066\uf066\uf02b\uf02b\uf03d\uf02d\uf02b\uf0d7\uf02d\uf0d7\uf028\uf0291tttttvvs\uf061\uf064\uf066\uf02b\uf03d\uf02b\uf028\uf0291Act:,,Train:iittiiiitttiiittttxaax\uf070\uf071\uf071\uf071\uf062\uf064\uf079\uf02b\uf0d7\uf03d\uf02b\uf03a,,iitttxa\uf064\fpolicy is one module, A1, and \u0393(s, a1) = a1, then the decomposed natural policy gradient is triv-\nially the same as the natural policy gradient. On the other hand, as the policy is broken into more\nand more modules, the gradient begins to differ more and more from the natural policy gradient,\nbecause the structure of the modular policy begins to in\ufb02uence the direction of the gradient. With\n\n\ufb01ner granularity, (cid:98)G(\u03b8) will tend to a diagonal approximation of the identity matrix. If the modular\n(cid:98)G(\u03b8)\u22121 = I, in which case the decomposed natural policy gradient will be equivalent to the vanilla\n\nactor contains one parameter per module and the module inputs are normalized, it is possible for\n\npolicy gradient. Hence, the more coarse the modularization (fewer modules), the closer the decom-\nposed natural policy gradient is to the natural policy gradient, while the \ufb01ner the modularization\n(more modules), the closer the decomposed natural policy gradient may come to the vanilla policy\ngradient.\nEach term of the decomposed natural policy gradient is within ninety degrees of the vanilla policy\ngradient, so a system will converge to a local optimum if it follows the decomposed natural policy\ngradient and the step size is decayed appropriately.\n\n5 Variance of Gradient Estimates\nLet \u03c8s,a,i = \u2207\u03b8i log \u03c0(s, a, \u03b8) be the components of \u03c8s,a that correspond to the parameters of Ai.\nxi,ai, the update to the parameters of Ai by VCN, and \u03b4t\u03c8s,a,i, the update by VAC, are\nBoth \u03b4t\u03c8i\nunbiased estimates of \u2207\u03b8iJM i(\u03b8i) = \u2207\u03b8iJM (\u03b8). This means that E[\u03b4t\u03c8s,a,i] = E[\u03b4t\u03c8i\nxi,ai], which\nis particularly interesting because \u03b4t is the same for both, so the only difference between the two\nare the compatible features used. Whereas \u03c8s,a,i requires computation of the derivative of the entire\nmodular policy, \u03c0, \u03c8i\nxi,ai only requires differentiation of \u03c0i. Thus, the latter satis\ufb01es the locality\nconstraint, and is also easier to compute. However, this bene\ufb01t comes at the cost of higher variance.\nThis increase in variance appears regardless of the actor-critic method used. In this section we focus\non VAC due to its simplicity, though the argument that stochasticity in the CoMDP is the root cause\nof the variance of gradient estimates carries over to PGCNs using other actor-critic methods as well.\nThis increase in variance has also been observed in multi-agent reinforcement learning research as\nadditional stochasticity in one agent\u2019s environment when another explores [18].\nConsider using VAC on any MDP. Bhatnagar et al. [5] show that E[\u03b4t|st = s, at = a, M, \u03b8] can\nbe viewed as the advantage of taking action at in state st over following the policy induced by\n\u03b8. If it is positive, it means taking at in st is better than following \u03c0. If it is negative, then at is\nworse. So, following E[\u03b4t\u03c8st,at] increases the likelihood of at if it is advantageous, and decreases\nthe likelihood of at if it is disadvantageous. However, our updates use samples rather than the\nexpected value, so an action at that is actually worse could, due to stochasticity in the environment,\nresult in a TD error that suggests it is advantageous. Thus, the gradient estimates are in\ufb02uenced by\nthe stochasticity of the transition function P and reward function R. If P or R is very stochastic,\nthe same s, a pair will result in seemingly random TD errors, which manifests as large variance in\n\u03b4t\u03c8st,at samples.\nNow consider the stochasticity in M and M i. The state transitions of M i depend not only on\nM\u2019s transition function, but may also depend on the actions selected by some or all Aj, j (cid:54)= i.\nConsider the modular actor from Figure 1 in the case where the transitions and rewards of M are\ndeterministic. The transition function for M 3, the CoMDP for A3, remains relatively deterministic\nbecause its actions completely determine the transitions of M. We therefore expect the variance in\nthe gradient estimate for the parameters of A3 to be only slightly higher for VCN than it is for VAC.\nHowever, the actions of A1 and A2 in\ufb02uence the transitions of M indirectly through the actions\nof A3, which adds a layer of stochasticity to their transition functions. We therefore expect policy\ngradient estimates for their parameters to have higher variance. In summary, the stochasticity in the\nCoMDPs is responsible for VCN\u2019s policy gradient estimates having higher variance than those of\nVAC.\nWe performed a simple study using the modular actor from Figure 1 on a 10 \u00d7 10 gridworld\nwith deterministic actions {up, down, lef t, right}, a reward of \u22121 for all transitions, factored\nstate (\u00afx, \u00afy), and with a terminal state at (10, 10). For the modular actor, A1 = A2 = {0, 1},\nA3 = {up, down, lef t, right}, A1 and A2 both received the full state (\u00afx, \u00afy), and all modules used\n\n6\n\n\f(a)\n\n(b)\n\nFigure 3: (a) Variance of the VAC and VCN updates for weights in each of the three modules. (b)\nVariance of updates using VCN with various \u03b5. Standard error bars are provided (n = 100).\n\nlinear function approximation rather than a tabular state representation. All modules also used soft-\nmax action selection:\n\n\u03c0i(xi, a, \u03b8i) =\n\n(cid:80)\n\na\u00b7xi\ne\u03c4 \u03b8i\n\u02c6a\u2208Ai e\u03c4 \u03b8i\n\n\u02c6a\u00b7xi ,\n\n(15)\n\nwhere \u03c4 is a constant scaling the amount of exploration, and where the parameters \u03b8i for the ith\na for each action a \u2208 Ai. The critic is common to both methods,\nmodule contain a weight vector \u03b8i\nand our goal is not to compare methods for value function approximation, so we used a tabular critic.\nWith all actor weights \ufb01xed and selected randomly with uniform distribution from (\u22121, 1), we \ufb01rst\nobserved that the mean of the updates \u03b4t\u03c8st,at,i and \u03b4t\u03c8i\nxi,ai are approximately equal, as expected,\nand then computed the variance of both updates. The results are shown in Figure 3(a). As predicted,\nthe variance of the gradient estimates for each parameter of A1 and A2 is larger for VCN, though\nthe variance of the gradient estimate for each parameter of A3 is similar for VCN and VAC.\n\n6 Variance Mitigation\n\nTo mitigate the increase in the variance of gradient estimates, we observe that, in general, the addi-\ntional variance due to the other modules can be completely removed for a module Ai if every other\nmodule is made to be deterministic. This is not practical because every module must explore in order\nto learn. However, we can approximate it by decreasing the exploration of each module, making its\npolicy less stochastic and more greedy. For example, every module could take a deterministic greedy\naction without performing any updates with probability 1 \u2212 \u03b5 for some \u03b5 \u2208 [0, 1). With probability\n\u03b5 the module would act using softmax action selection and update its parameters. As \u03b5 \u2192 0, the\nprobability of two modules exploring simultaneously goes to zero, decreasing the variance in M i\nbut also decreasing the percent of time steps during which each module trains. When \u03b5 = 1, every\nmodule explores and updates on every step, so the algorithm is the original PGCN algorithm (VCN\nif using VAC for each module).\nWe repeated the gridworld study of the variance in gradient estimates for various \u03b5. The results,\nshown in Figure 3(b), show that smaller \u03b5 can be effective in reducing the variance of gradient\nestimates. Notice that VCN using \u03b5 = 1 is equivalent to VCN as described previously, so the\npoints for \u03b5 = 1 in Figure 3(b) correspond exactly to the VCN data in Figure 3(a). Thus, if the\nvariance in gradient estimates precludes learning, we suggest making the policies of the modules\nmore deterministic by decreasing exploration and increasing exploitation.\nSeveral questions remain. First, though the variance decreases, the amount of exploration also de-\ncreases, so what is the net effect on learning speed? Second, how does PGCN compare to an actor-\ncritic where \u2207\u03b8\u03c0(s, a, \u03b8) is known? Lastly, is there a signi\ufb01cant loss in performance when using the\ndecomposed natural policy gradient as opposed to the true natural policy gradient? We attempt to\nanswer these questions in the following section.\n\n7\n\n0.160.180.2010.120.140.16ance0.060.080.1VariaVACVCN00.020.04A1A2A3Module0.160.20.12ance0.08VariaA1A2A300.04A300.20.40.60.81\u03b5\fAlgorithm\n\nVAC\nVCN\nNAC\nNCN\n\n\u03b1\n0.75\n0.25\n0.5\n0.5\n\n\u03b2\n0.25\n0.1\n0.1\n0.1\n\nc\n\n0.13\n0.04\n0.02\n0.02\n\n\u03c412\n0.5\n0.1\n0.05\n0.05\n\n\u03c43 Average Reward\n2.5\n3.5\n1\n1\n\n\u221223.13\n\u221229.15\n\u221224.91\n\u221228.32\n\nStandard Error\n\n0.09\n0.09\n0.08\n0.14\n\nTable 1: Best parameters found for each algorithm. The average reward per episode and standard\nerror are computed using 10000 samples (each a lifetime of 75 episodes). The optimization tested\neach parameter set for 300 lifetimes, so the best parameters found still occasionally perform poorly.\nWe found the above parameters to perform poorly (average reward less than \u2212200) approximately\none in 500 lifetimes. These outliers were removed for the average reward calculations. Random\npolicy parameters average less than \u22125000 reward per episode.\n\n7 Case Study\n\nIn this section we compare the learning speed of VAC, VCN, NAC, and NCN. Our goal is to de-\ntermine whether VCN and NCN perform similarly to VAC and NAC, which are established meth-\nods [6], even though VCN and NCN\u2019s modules do not have access to \u2202\u03c0/\u2202\u03b8i. To perform a thorough\nanalysis, we again use the modular actor depicted in Figure 1, as in Section 5. We therefore require\na problem with a simple optimal policy. We select the gridworld from Section 5, and again use a\ntabular critic in order to focus on the difference in policy improvements. To decrease the size of the\nparameter space, we did not decay \u03b1 nor \u03b2. For all four algorithms, we performed a grid search\nfor the \u03b1, \u03b2, c, \u03c412, and \u03c43 that maximize the average reward over 75 episodes, where \u03c412 is the \u03c4\nused by A1 and A2, while \u03c43 is that of A3. The best parameters are provided in Table 1. Recall\nthat the increased variance in VCN updates arises because A1 and A2\u2019s actions only in\ufb02uence the\ntransitions of M indirectly through the actions of A3. Though decreased exploration is bene\ufb01cial in\ngeneral, for this particular modular policy it is therefore particularly important that A3\u2019s exploration\nbe decreased by increasing \u03c43. The optimization does just this, balancing the trade-off between\nexploration and the variance of gradient estimates by selecting larger \u03c43 for VCN than VAC. The\nmean ratio \u03c43/\u03c412 for the top 25 of the 202300 parameters tested was 5.48 for VAC and 31.04 for\nVCN, further emphasizing the relatively smaller exploration of A3. For NAC and NCN, the explo-\nration parameters are identical, suggesting that the additional variance of gradient estimates was not\nsigni\ufb01cant. This is likely due to the policy gradient estimates being \ufb01ltered before being used.\nThe average rewards during a lifetime are similar, suggesting that, even though the variance of\ngradient estimates can be orders larger for VCN with \u03c412 = \u03c43 = 1 (Figure 3(a)), exploration can\nbe tuned such that learning speed is not signi\ufb01cantly diminished.\n\n8 Conclusion\n\nWe have devised a class of algorithms, policy gradient coagent networks (PGCNs), and two spe-\nci\ufb01c instantiations thereof, the natural coagent network (NCN) and vanilla coagent network (VCN),\nwhich allow modules within an actor to update given only local information. We show that the\nNCN ascends the decomposed natural policy gradient, an approximation to the natural policy gra-\ndient, while VCN ascends the vanilla policy gradient. We discussed the theoretical properties of\nboth the decomposed natural policy gradient and the increase in the variance of gradient estimates\nwhen using PGCNs. Lastly, we presented a case study to compare NCN and VCN to two existing\nactor-critic methods, NAC and VAC. We showed that, even though NAC and VAC are provided with\nadditional non-local information, VCN and NCN perform comparably. We point out how VCN\u2019s\nsimilar performance is achieved by decreasing exploration in order to decrease the stochasticity of\neach module\u2019s CoMDP, and thus the variance of the gradient estimates.\n\nAcknowledgements\n\nWe would like to thank Scott Kuindersma, Scott Niekum, Bruno Castro da Silva, Andrew Barto,\nSridhar Mahadevan, the members of the Autonomous Learning Laboratory, and the reviewers for\ntheir feedback and contributions to this paper.\n\n8\n\n\fReferences\n[1] S. Amari. Natural gradient works ef\ufb01ciently in learning. Neural Computation, 10(2):251\u2013276, 1998.\n[2] S. Amari and S. Douglas. Why natural gradient? In Proceedings of the 1998 IEEE International Confer-\n\nence on Acoustics, Speech, and Signal Processing (ICASSP \u201998), volume 2, pages 1213\u20131216, 1998.\n\n[3] A. G. Barto. Learning by statistical cooperation of self-interested neuron-like computing elements. Hu-\n\nman Neurobiology, 4:229\u2013256, 1985.\n\n[4] A. G. Barto. Adaptive critics and the basal ganglia. Models of Information Processing in the Basal\n\nGanglia, pages 215\u2013232, 1995.\n\n[5] S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor-critic algorithms. Automatica,\n\n45(11):2471\u20132482, 2009.\n\n[6] S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor-critic algorithms. Technical\n\nReport TR09-10, University of Alberta Department of Computing Science, June 2009.\n\n[7] A. Claridge-Chang, R. Roorda, E. Vrontou, L. Sjulson, H. Li, J. Hirsh, and G. Miesenbock. Writing\n\nmemories with light-addressable reinforcement circuitry. Cell, 193(2):405\u2013415, 2009.\n\n[8] F. H. C. Crick. The recent excitement about neural networks. Nature, 337:129\u2013132, 1989.\n[9] N. Daw and K. Doya. The computational neurobiology of learning and reward. Current Opinion in\n\nNeurobiology, 16:199\u2013204, 2006.\n\n[10] K. Doya. What are the computations of the cerebellum, the basal ganglia and the cerebral cortex? Neural\n\nNetworks, 12:961\u2013974, 1999.\n\n[11] K. Doya. Reinforcement learning in continuous time and space. Neural Computation, 12(1):219\u2013245,\n\n2000.\n\n[12] M. J. Frank and E. D. Claus. Anatomy of a decision: Striato-orbitofrontal interactions in reinforcement\n\nlearning, decision making, and reversal. Psychological Review, 113(2):300\u2013326, 2006.\n\n[13] E. Ludvig, R. Sutton, and E. Kehoe. Stimulus representation and the timing of reward-prediction errors\n\nin models of the dopamine system. Neural Computation, 20:3034\u20133035, 2008.\n\n[14] R. C. O\u2019Reilly. The LEABRA model of neural interactions and learning in the neocortex. PhD thesis,\n\nCarnegie Mellon University.\n\n[15] J. Peters and S. Schaal. Natural actor critic. Neurocomputing, 71:1180\u20131190, 2008.\n[16] F. Rivest, Y. Bengio, and J. Kalaska. Brain inspired reinforcement learning.\n\nInformation Processing Systems, pages 1129\u20131136, 2005.\n\nIn Advances in Neural\n\n[17] D. E. Rumelhart and J. L. McClelland. Parallel distributed processing. Volume 1: Foundations. MIT\n\nPress, Cambridge, MA, 1986.\n\n[18] T. W. Sandholm and R. H. Crites. Multiagent reinforcement learning in the iterated prisoner\u2019s dilemma.\n\nBiosystems, 37:147\u2013166, 1996.\n\n[19] W. Schultz, P. Dayan, and P. Montague. A neural substrate of prediction and reward. Science, 275:1593\u2013\n\n1599, 1992.\n\n[20] A. Stocco, C. Lebiere, and J. Anderson. Conditional routing of information to the cortex: A model of the\n\nbasal ganglia\u2019s role in cognitive coordination. Psychological Review, 117(2):541\u2013574, 2010.\n\n[21] R. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9\u201344, 1988.\n[22] R. Sutton and A. Barto. Toward a modern theory of adaptive networks: Expectation and prediction.\n\nPsychological Review, 88:135\u2013140, 1981.\n\n[23] R. Sutton and A. Barto. Reinforcement learning: An introduction. MIT Press, Cambridge, MA, 1998.\n[24] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning\nwith function approximation. In Advances in Neural Information Processing Systems 12, pages 1057\u2013\n1063, 2000.\n\n[25] P. Thomas and A. Barto. Conjugate Markov decision processes. In Proceedings of the Twenty-Eighth\n\nInternational Conference on Machine Learning, 2011.\n\n[26] R. J. Williams. A class of gradient-estimating algorithms for reinforcement learning in neural networks.\n\nIn Proceedings of the IEEE First International Conference on Neural Networks, 1987.\n\n[27] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.\n\nMachine Learning, 8(3):229\u2013256, 1992.\n\n[28] D. Zipser and R. A. Andersen. A back propagation programmed network that simulates response proper-\n\nties of a subset of posterior parietal neurons. Nature, 331:679\u2013684, 1988.\n\n9\n\n\f", "award": [], "sourceid": 1100, "authors": [{"given_name": "Philip", "family_name": "Thomas", "institution": null}]}