{"title": "DAC: The Double Actor-Critic Architecture for Learning Options", "book": "Advances in Neural Information Processing Systems", "page_first": 2012, "page_last": 2022, "abstract": "We reformulate the option framework as two parallel augmented MDPs. Under this novel formulation, all policy optimization algorithms can be used off the shelf to learn intra-option policies, option termination conditions, and a master policy over options. We apply an actor-critic algorithm on each augmented MDP, yielding the Double Actor-Critic (DAC) architecture. Furthermore, we show that, when state-value functions are used as critics, one critic can be expressed in terms of the other, and hence only one critic is necessary. We conduct an empirical study on challenging robot simulation tasks. In a transfer learning setting, DAC outperforms both its hierarchy-free counterpart and previous gradient-based option learning algorithms.", "full_text": "DAC: The Double Actor-Critic\n\nArchitecture for Learning Options\n\nShangtong Zhang, Shimon Whiteson\n\nDepartment of Computer Science\n\n{shangtong.zhang, shimon.whiteson}@cs.ox.ac.uk\n\nUniversity of Oxford\n\nAbstract\n\nWe reformulate the option framework as two parallel augmented MDPs. Under\nthis novel formulation, all policy optimization algorithms can be used off the shelf\nto learn intra-option policies, option termination conditions, and a master policy\nover options. We apply an actor-critic algorithm on each augmented MDP, yielding\nthe Double Actor-Critic (DAC) architecture. Furthermore, we show that, when\nstate-value functions are used as critics, one critic can be expressed in terms of the\nother, and hence only one critic is necessary. We conduct an empirical study on\nchallenging robot simulation tasks. In a transfer learning setting, DAC outperforms\nboth its hierarchy-free counterpart and previous gradient-based option learning\nalgorithms.\n\n1\n\nIntroduction\n\nTemporal abstraction (i.e., hierarchy) is a key component in reinforcement learning (RL). A good\ntemporal abstraction usually improves exploration (Machado et al., 2017b) and enhances the inter-\npretability of agents\u2019 behavior (Smith et al., 2018). The option framework (Sutton et al., 1999), which\nis commonly used to formulate temporal abstraction, gives rise to two problems: learning options\n(i.e., temporally extended actions) and learning a master policy (i.e., a policy over options, a.k.a. an\ninter-option policy).\nA Markov Decision Process (MDP, Puterman 2014) with options can be interpreted as a Semi-MDP\n(SMDP, Puterman 2014), and a master policy is used in this SMDP for option selection. While\nin principle, any SMDP algorithm can be used to learn a master policy, such algorithms are data\ninef\ufb01cient as they cannot update a master policy during option execution. To address this issue, Sutton\net al. (1999) propose intra-option algorithms, which can update a master policy at every time step\nduring option execution. Intra-option Q-Learning (Sutton et al., 1999) is a value-based intra-option\nalgorithm and has enjoyed great success (Bacon et al., 2017; Riemer et al., 2018; Zhang et al., 2019b).\nHowever, in the MDP setting, policy-based methods are often preferred to value-based ones because\nthey can cope better with large action spaces and enjoy better convergence properties with function\napproximation. Unfortunately, theoretical study for learning a master policy with policy-based\nintra-option methods is limited (Daniel et al., 2016; Bacon, 2018) and its empirical success has not\nbeen witnessed. This is the \ufb01rst issue we address in this paper.\nRecently, gradient-based option learning algorithms have enjoyed great success (Levy and Shimkin,\n2011; Bacon et al., 2017; Smith et al., 2018; Riemer et al., 2018; Zhang et al., 2019b). However,\nmost require algorithms that are customized to the option-based SMDP. Consequently, we cannot\ndirectly leverage recent advances in gradient-based policy optimization from MDPs (e.g., Schulman\net al. 2015, 2017; Haarnoja et al. 2018). This is the second issue we address in this paper.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTo address these issues, we reformulate the SMDP of the option framework as two augmented\nMDPs. Under this novel formulation, all policy optimization algorithms can be used for option\nlearning and master policy learning off the shelf and the learning remains intra-option. We apply an\nactor-critic algorithm on each augmented MDP, yielding the Double Actor-Critic (DAC) architecture.\nFurthermore, we show that, when state-value functions are used as critics, one critic can be expressed\nin terms of the other, and hence only one critic is necessary. Finally, we empirically study the\ncombination of DAC and Proximal Policy Optimization (PPO, Schulman et al. 2017) in challenging\nrobot simulation tasks. In a transfer learning setting, DAC+PPO outperforms both its hierarchy-free\ncounterpart, PPO, and previous gradient-based option learning algorithms.\n\n2 Background\nWe consider an MDP consisting of a state space S, an action space A, a reward function r : S \u00d7A \u2192\nR, a transition kernel p : S \u00d7 S \u00d7 A \u2192 [0, 1], an initial distribution p0 : S \u2192 [0, 1] and a discount\nfactor \u03b3 \u2208 [0, 1). We refer to this MDP as M\n= (S,A, r, p, p0, \u03b3) and consider episodic tasks. In the\n.\noption framework (Sutton et al., 1999), an option o is a triple of (Io, \u03c0o, \u03b2o), where Io is an initiation\nset indicating where the option can be initiated, \u03c0o : A \u00d7 S \u2192 [0, 1] is an intra-option policy, and\n\u03b2o : S \u2192 [0, 1] is a termination function. In this paper, we consider Io \u2261 S following Bacon et al.\n(2017); Smith et al. (2018). We use O to denote the option set and assume all options are Markov.\nWe use \u03c0 : O \u00d7 S \u2192 [0, 1] to denote a master policy and consider the call-and-return execution\nmodel (Sutton et al., 1999). Time-indexed capital letters are random variables. At time step t, an\nagent at state St either terminates the previous option Ot\u22121 w.p. \u03b2Ot\u22121(St) and initiates a new option\nOt according to \u03c0(\u00b7|St), or proceeds with the previous option Ot\u22121 w.p. 1 \u2212 \u03b2Ot\u22121 (St) and sets\n= Ot\u22121. Then an action At is selected according to \u03c0Ot(\u00b7|St). The agent gets a reward Rt+1\n.\nOt\nsatisfying E[Rt+1] = r(St, At) and proceeds to a new state St+1 according to p(\u00b7|St, At). Under\nthis execution model, we have\n\n(cid:88)\n\np(St+1|St, Ot) =\np(Ot|St, Ot\u22121) = (1 \u2212 \u03b2Ot\u22121 (St))IOt\u22121=Ot + \u03b2Ot\u22121(St)\u03c0(Ot|St),\n\n\u03c0Ot(a|St)p(St+1|St, a),\n\na\n\nq\u03c0(s, o, a)\n\n.\n\n(s, o) and an action a, we de\ufb01ne\n\np(St+1, Ot+1|St, Ot) = p(St+1|St, Ot)p(Ot+1|St+1, Ot),\n\n= E\u03c0,O,p,r[(cid:80)\u221e\n\ni=1 \u03b3i\u22121Rt+i | St = s, Ot = o, At = a].\n\nwhere I is the indicator function. With a slight abuse of notations, we de\ufb01ne r(s, o)\n\n(cid:80)\n.\n=\na \u03c0o(s, a)r(s, a). The MDP M and the options O form an SMDP. For each state-option pair\nThe state-option value of \u03c0 on the SMDP is q\u03c0(s, o) =(cid:80)\non the SMDP is v\u03c0(s) =(cid:80)\nq\u03c0(s, o) = r(s, o) + \u03b3(cid:80)\n\na \u03c0o(a|s)q\u03c0(s, o, a). The state value of \u03c0\nu\u03c0(o, s(cid:48)) = [1 \u2212 \u03b2o(s(cid:48))]q\u03c0(s(cid:48), o) + \u03b2o(s(cid:48))v\u03c0(s(cid:48)),\nwhere u\u03c0(o, s(cid:48)) is the option-value upon arrival (Sutton et al., 1999). Correspondingly, we have the\noptimal master policy \u03c0\u2217 satisfying v\u03c0\u2217 (s) \u2265 v\u03c0(s)\u2200(s, \u03c0). We use q\u2217 to denote the state-option\nvalue function of \u03c0\u2217.\nMaster Policy Learning: To learn the optimal master policy \u03c0\u2217 given a \ufb01xed O, one value-based\napproach is to learn q\u2217 \ufb01rst and derive \u03c0\u2217 from q\u2217. We can use SMDP Q-Learning to update an\nestimate Q for q\u2217 as\n\no \u03c0(o|s)q\u03c0(s, o). They are related as\ns(cid:48) p(s(cid:48)|s, o)u\u03c0(o, s(cid:48)),\n\ni=t \u03b3i\u2212tRi+1 + \u03b3k\u2212t maxo Q(Sk, o) \u2212 Q(St, Ot)(cid:1),\n\nQ(St, Ot) \u2190 Q(St, Ot) + \u03b1(cid:0)(cid:80)k\n\nwhere we assume the option Ot initiates at time t and terminates at time k (Sutton et al., 1999). Here\nthe option Ot lasts k\u2212t steps. However, SMDP Q-Learning performs only one single update, yielding\nsigni\ufb01cant data inef\ufb01ciency. This is because SMDP algorithms simply interpret the option-based\nSMDP as a generic SMDP, ignoring the presence of options. By contrast, Sutton et al. (1999) propose\nto exploit the fact that the SMDP is generated by options, yielding an update rule:\nQ(St, Ot) \u2190 Q(St, Ot) + \u03b1[Rt+1 + \u03b3U (Ot, St+1) \u2212 Q(St, Ot)]\n\n=(cid:0)1 \u2212 \u03b2Ot(St+1)(cid:1)Q(St+1, Ot) + \u03b2Ot(St+1) max\n\nU (Ot, St+1)\n\nQ(St+1, o).\n\n(1)\n\n.\n\no\n\n2\n\n\fThis update rule is ef\ufb01cient in that it updates Q every time step. However, it is still inef\ufb01cient in that it\nonly updates Q for the executed option Ot. We refer to this property as on-option. Sutton et al. (1999)\nfurther propose Intra-option Q-Learning, where the update (1) is applied to every option o satisfying\n\u03c0o(At|St) > 0. We refer to this property as off-option. Intra-option Q-Learning is theoretically\njusti\ufb01ed only when all intra-option policies are deterministic (Sutton et al., 1999). The convergence\nanalysis of Intra-option Q-Learning with stochastic intra-option policies remains an open problem\n(Sutton et al., 1999). The update (1) and the Intra-option Q-Learning can also be applied to off-policy\ntransitions.\nThe Option-Critic Architecture: Bacon et al. (2017) propose a gradient-based option learning\nalgorithm, the Option-Critic (OC) architecture. Assuming {\u03c0o}o\u2208O is parameterized by \u03bd and\n{\u03b2o}o\u2208O is parameterized by \u03c6, Bacon et al. (2017) prove that\n\n\u2207\u03bdv\u03c0(S0) =\n\n(cid:88)\n\u2207\u03c6v\u03c0(S0) = \u2212(cid:88)\n\ns,o\n\ns(cid:48),o\n\n(cid:88)\n\n\u03c1(s, o|S0, O0)\n\nq\u03c0(s, o, a)\u2207\u03bd\u03c0o(a|s),\n\n\u03c1(s(cid:48), o|S1, O0)(cid:0)q\u03c0(s(cid:48), o) \u2212 v\u03c0(s(cid:48))(cid:1)\u2207\u03c6\u03b2o(s(cid:48)),\n\na\n\n.\n\n.\n\nwhere \u03c1 de\ufb01ned in Bacon et al. (2017) is the unnormalized discounted state-option pair occupancy\nmeasure. OC is on-option in that given a transition (St, Ot, At, Rt+1, St+1), it updates only parame-\nters of the executed option Ot. OC provides the gradient for {\u03c0o, \u03b2o} and can be combined with any\nmaster policy learning algorithm. In particular, Bacon et al. (2017) combine OC with (1). Hence, in\nthis paper, we use OC to indicate this exact combination. OC has also been extended to multi-level\noptions (Riemer et al., 2018) and deterministic intra-option policies (Zhang et al., 2019b).\n= {\u03b8, \u03bd, \u03c6}.\n.\nInferred Option Policy Gradient: We assume \u03c0 is parameterized by \u03b8 and de\ufb01ne \u03be\n= (S0, A0, S1, A1, . . . , ST ) to denote a trajectory from {\u03c0,O, M}, where ST is a terminal\n.\nWe use \u03c4\nt=1 \u03b3t\u22121Rt | \u03c4 ] to denote the total expected discounted rewards along\nstate. We use r(\u03c4 )\n\u03c4. Our goal is to maximize J\nalong the trajectory as latent variables and marginalize over them when computing \u2207\u03beJ. In the\nInferred Option Policy Gradient (IOPG), Smith et al. (2018) show\n\n=(cid:82) r(\u03c4 )p(\u03c4 )d\u03c4. Smith et al. (2018) propose to interpret the options\no mt(o)\u03c0o(At|St)(cid:1)],\n\n= E[(cid:80)T\n\u2207\u03beJ = E\u03c4 [r(\u03c4 )(cid:80)T\nt=0 \u2207\u03be log p(At|Ht)] = E\u03c4 [r(\u03c4 )(cid:80)T\n\nt=0 \u2207\u03be log(cid:0)(cid:80)\n\n.\n= (S0, A0, . . . , St\u22121, At\u22121, St) is the state-action history and mt(o)\n\n= p(Ot = o|Ht)\n.\nwhere Ht\nis the probability of occupying an option o at time t. Smith et al. (2018) further show that mt can\nbe expressed recursively via (mt\u22121,{\u03c0o, \u03b2o}, \u03c0), allowing ef\ufb01cient computation of \u2207\u03beJ. IOPG\nis an off-line algorithm in that it has to wait for a complete trajectory before computing \u2207\u03beJ. To\nadmit online updates, Smith et al. (2018) propose to store \u2207\u03bemt\u22121 at each time step and use the\nstored \u2207\u03bemt\u22121 for computing \u2207\u03bemt, yielding the Inferred Option Actor Critic (IOAC). IOAC is\nbiased in that a stale approximation of \u2207\u03bemt\u22121 is used for computing \u2207\u03bemt. The longer a trajectory\nis, the more biased the IOAC gradient is. IOPG and IOAC are off-option in that given a transition\n(St, Ot, At, Rt+1, St+1), all options contribute to the gradient explicitly.\nAugmented Hierarchical Policy: Levy and Shimkin (2011) propose the Augmented Hierarchical\nPolicy (AHP) architecture. AHP reformulates the SMDP of the option framework as an augmented\n= B \u00d7 O \u00d7 A,\nMDP. The new state space is SAHP .\nwhere B .\n= {stop, continue} indicates whether to terminate the previous option or not. All policy\noptimization algorithms can be used to learn an augmented policy\n\n= O \u00d7 S. The new action space is AAHP .\n\n\u03c0AHP(cid:0)(Bt, Ot, At)|(Ot\u22121, St)(cid:1) .\n\n(cid:16)IBt=contIOt=Ot\u22121 + IBt=stop\u03c0(Ot|St)\n(cid:17)(cid:16)\n(cid:17)\n=\u03c0Ot(At|St)\nIBt=cont(1 \u2212 \u03b2Ot\u22121(St)) + IBt=stop\u03b2Ot\u22121 (St)\n\nunder this new MDP, which learns \u03c0 and {\u03c0o, \u03b2o} implicitly. Here Bt \u2208 B is a binary random\nvariable. In the formulation of \u03c0AHP, the term \u03c0(Ot|St) is gated by IBt=stop. Consequently, the\ngradient for the master policy \u03c0 is non-zero only when an option terminates (also see Equation 23\nin Levy and Shimkin (2011)). This suggests that the master policy learning in AHP is SMDP-style.\nMoreover, as suggested by the term \u03c0Ot(At|St) in \u03c0AHP, the resulting gradient for an intra-option\npolicy \u03c0o is non-zero only when the option o is being executed (also see Equation 24 in Levy and\nShimkin (2011)). This suggests that the option learning in AHP is on-option. Similar augmented\nMDP formulation is also used in Daniel et al. (2016).\n\n3\n\n\f3 Two Augmented MDPs\nIn this section, we reformulate the SMDP as two augmented MDPs: the high-MDP MH and the\nlow-MDP ML. The agent makes high-level decisions (i.e., option selection) in MH according to\n\u03c0,{\u03b2o} and thus optimizes \u03c0,{\u03b2o}. The agent makes low-level decisions (i.e., action selection) in\nML according to {\u03c0o} and thus optimizes {\u03c0o}. Both augmented MDPs share the same samples\nwith the SMDP {O, M}.\nWe \ufb01rst de\ufb01ne a dummy option # and O+ .\n= O \u222a {#}. This dummy option is only for simplifying\nnotations and is never executed. In the high-MDP, we interpret a state-option pair in the SMDP as a\nnew state and an option in the SMDP as a new action. Formally speaking, we de\ufb01ne\n= O+ \u00d7 S, AH .\n= I\n\n= pH(cid:0)(Ot, St+1)|(Ot\u22121, St), AH\nt )(cid:1) .\n= rH(cid:0)(Ot\u22121, St), Ot\n\n= {SH,AH, pH, pH\nMH .\nt , AH\n.\nt )\n= p0(S0)IO\u22121=#,\n\n= O,\n(cid:1) .\nt =Otp(St+1|St, Ot),\nAH\n\n(cid:0)(O\u22121, S0)(cid:1) .\n\n0 , rH, \u03b3}, SH .\n\npH(SH\n= pH\n.\n\nt+1|SH\n\n= r(St, Ot)\n\nrH(SH\n\nt , AH\nt )\n\n0\n\n.\n\n0 (SH\npH\n0 )\nWe de\ufb01ne a Markov policy \u03c0H on MH as\n= \u03c0H(Ot|(Ot\u22121, St))\n.\n\n\u03c0H(AH\n\nt |SH\nt )\n\n= p(Ot|St, Ot\u22121)IOt\u22121(cid:54)=# + \u03c0(St, Ot)IOt\u22121=#\n.\n\nIn the low-MDP, we interpret a state-option pair in the SMDP as a new state and leave the action\nspace unchanged. Formally speaking, we de\ufb01ne\n\n= pL(cid:0)(St+1, Ot+1)|(St, Ot), At\n\n(cid:1) .\n0 , rL, \u03b3}, SL .\n\n= {SL,AL, pL, pL\nML .\nt+1|SL\npL(SL\nt , AL\n.\nt )\npL\n0 (SL\n.\n0 )\n\n= pL(cid:0)(S0, O0)(cid:1) .\n\n= p0(S0)\u03c0(S0, O0),\n\n= A,\n\n= S \u00d7 O, AL .\n= p(St+1|St, At)p(Ot+1|St+1, Ot),\nrL(SL\n\n= rL(cid:0)(St, Ot), At\n\n(cid:1) .\n\nt , AL\nt )\n\n.\n\n= r(St, At)\n\nWe de\ufb01ne a Markov policy \u03c0L on ML as\n.\n\n\u03c0L(AL\n\nt |SL\nt )\n\n= \u03c0L(cid:0)At|(St, Ot)(cid:1) .\n\n= \u03c0Ot(At|St)\n\nt\n\n1 , AH\n\n.\n= Ot, O\u22121\n\n1 , . . . , SH\n\nT ), where SH\n\n= (Ot\u22121, St), AH\n.\n\n= {\u03c4L | p(\u03c4L|\u03c0L, ML) > 0}. With \u03c4\n\n= {\u03c4H | p(\u03c4H|\u03c0H, MH) > 0}, \u2126L .\n0 , SH\n\n= {\u03c4 | p(\u03c4|\u03c0,O, M ) > 0},\n.\nWe consider trajectories with nonzero probabilities and de\ufb01ne \u2126\n\u2126H .\n.\n=\n(S0, O0, S1, O1, . . . , ST ), we de\ufb01ne a function fH : \u2126 \u2192 \u2126H, which maps \u03c4 to \u03c4H .\n=\n(SH\n0 , AH\nLemma 1 p(\u03c4|\u03c0,O, M ) = p(\u03c4H|\u03c0H, MH), r(\u03c4 ) = r(\u03c4H), and fH is a bijection.\nProof. See supplementary materials.\n.\nWe now take action into consideration. With \u03c4\n= (S0, O0, A0, S1, O1, A1 . . . , ST ), we de\ufb01ne\na function fL : \u2126 \u2192 \u2126L, which maps \u03c4 to \u03c4L .\n= (SL\n.\n=\n(St, Ot), AL\nLemma 2 p(\u03c4|\u03c0,O, M ) = p(\u03c4L|\u03c0L, ML), r(\u03c4 ) = r(\u03c4L), and fL is a bijection.\nProof. See supplementary materials.\n\nT ), where SL\n\n.\n= At. We have:\n\n.\n= #. We have:\n\n1 , . . . , SL\n\n0 , AL\n\n1 , AL\n\n0 , SL\n\nt\n\nt\n\nProposition 1\n\n=(cid:82) r(\u03c4 )p(\u03c4|\u03c0,O, M )d\u03c4 =(cid:82) r(\u03c4H)p(\u03c4H|\u03c0H, MH)d\u03c4H =(cid:82) r(\u03c4L)p(\u03c4L|\u03c0L, ML)d\u03c4L.\n\n.\n\nJ\n\nt\n\nProof. Follows directly from Lemma 1 and Lemma 2.\nLemma 1 and Lemma 2 indicate that sampling from {\u03c0,O, M} is equivalent to sampling from\n{\u03c0H, MH} and {\u03c0L, ML}. Proposition 1 indicates that optimizing \u03c0,O in M is equivalent to\noptimizing \u03c0H in MH and optimizing \u03c0L in ML. We now make two observations:\nObservation 1 MH depends on {\u03c0o} while \u03c0H depends on \u03c0 and {\u03b2o}.\nObservation 2 ML depends on \u03c0,{\u03b2o} while \u03c0L depends on {\u03c0o}.\n\n4\n\n\fObservation 1 suggests that when we keep the intra-option policies {\u03c0o} \ufb01xed and optimize \u03c0H,\nwe are implicitly optimizing \u03c0 and {\u03b2o} (i.e., \u03b8 and \u03c6). Observation 2 suggests that when we keep\nthe master policy \u03c0 and the termination conditions {\u03b2o} \ufb01xed and optimize \u03c0L, we are implicitly\noptimizing {\u03c0o} (i.e., \u03bd). All policy optimization algorithms for MDPs can be used off the shelf\nto optimize the two actors \u03c0H and \u03c0L with samples from {\u03c0,O, M}, yielding a new family of\nalgorithms for master policy learning and option learning, which we refer to as the Double Actor-\nCritic (DAC) architecture. Theoretically, we should optimize \u03c0H and \u03c0L alternatively with different\nsamples to make sure MH and ML are stationary. In practice, optimizing \u03c0H and \u03c0L with the same\nsamples improves data ef\ufb01ciency. The pseudocode of DAC is provided in the supplementary materials.\nWe present a thorough comparison of DAC, OC, IOPG and AHP in Table 1. DAC combines the\nadvantages of both AHP (i.e., compatibility) and OC (intra-option learning). Enabling off-option\nlearning of intra-option policies in DAC as IOPG is a possibility for future work.\n\nLearning {\u03c0o, \u03b2o} Online Learning Compatibility\n\nLearning \u03c0\n\nSMDP\n\nAHP\nintra-option\nOC\nIOPG intra-option\nDAC\nintra-option\n\non-option\non-option\noff-option\non-option\n\nyes\nyes\nno\nyes\n\nyes\nno\nno\nyes\n\nTable 1: A comparison of AHP, OC, IOPG and DAC. (1) For learning {\u03c0o, \u03b2o}, all four are intra-\noption. (2) IOAC is online with bias introduced and consumes extra memory. (3) Compatibility\nindicates whether a framework can be combined with any policy optimization algorithm off-the-shelf.\n\nIn general, we need two critics in DAC, which can be learned via all policy evaluation algorithms.\nHowever, when state value functions are used as critics, Proposition 2 shows that the state value\nfunction in the high-MDP (v\u03c0H) can be expressed by the state value function in the low-MDP (v\u03c0L),\nand hence only one critic is needed.\n\nProposition 2 v\u03c0H(cid:0)(o, s(cid:48))(cid:1) =(cid:80)\nv\u03c0H(cid:0)(o, s(cid:48))(cid:1) .\nv\u03c0L(cid:0)(s(cid:48), o(cid:48))(cid:1) .\n\no(cid:48) \u03c0H(cid:0)o(cid:48)|(o, s(cid:48))(cid:1)v\u03c0L(cid:0)(s(cid:48), o(cid:48))(cid:1), where\n= E\u03c0H,MH[(cid:80)\u221e\n= E\u03c0L,ML [(cid:80)\u221e\n\ni=1 \u03b3i\u22121RH\ni=1 \u03b3i\u22121RL\n\nt+i | SH\nt+i | SL\n\nt = (o, s(cid:48))],\nt = (s(cid:48), o(cid:48))],\n\nProof. See supplementary materials.\nOur two augmented MDP formulation differs from the one augmented MDP formulation in AHP\nmainly in that we do not need to introduce the binary variable Bt. It is this elimination of Bt that leads\nto the intra-option master policy learning in DAC and yields a useful observation: the call-and-return\nexecution model (with Markov options) is similar to the polling execution model (Dietterich, 2000),\nwhere an agent reselects an option every time step according to \u03c0H. This observation opens the\npossibility for more intra-option master policy learning algorithms. Note the introduction of Bt is\nnecessary if one would want to formulate the option SMDP as a single augmented MDP and apply\nstandard control methods from the MDP setting. Otherwise, the augmented MDP will have \u03c0H in\nboth the augmented policy and the new transition kernel. By contrast, in a canonical MDP setting, a\npolicy does not overlap with the transition kernel.\nBeyond Intra-option Q-Learning: In terms of learning \u03c0 with a \ufb01xed O, Observation 1 suggests\nwe optimize \u03c0H on MH. This immediately yields a family of policy-based algorithms for learning a\nmaster policy, all of which are intra-option. Particularly, when we use Off-Policy Expected Policy\nGradients (Off-EPG, Ciosek and Whiteson 2017) for optimizing \u03c0H, we get all the merits of both\nIntra-option Q-Learning and policy gradients for free. (1) By de\ufb01nition of MH and \u03c0H, Off-EPG\noptimizes \u03c0 in an intra-option manner and is as data ef\ufb01cient as Intra-option Q-Learning. (2) Off-EPG\nis an off-policy algorithm, so off-policy transitions can also be used, as in Intra-option Q-Learning.\n(3) Off-EPG is off-option in that all the options, not only the executed one, explicitly contribute to the\npolicy gradient at every time step. Particularly, this off-option approach does not require deterministic\nintra-option policies like Intra-option Q-Learning. (4) Off-EPG uses a policy for decision making,\nwhich is more robust than value-based decision making. We leave an empirical study of this particular\napplication for future work and focus in this paper on the more general problem, learning \u03c0 and O\nsimultaneously. When O is not \ufb01xed, the MDP (MH) for learning \u03c0 becomes non-stationary. We,\ntherefore, prefer on-policy methods to off-policy methods.\n\n5\n\n\f4 Experimental Results\n\nWe design experiments to answer the following questions: (1) Can DAC outperform existing gradient-\nbased option learning algorithms (e.g., AHP, OC, IOPG)? (2) Can options learned in DAC translate\ninto a performance boost over its hierarchy-free counterparts? (3) What options does DAC learn?\nDAC can be combined with any policy optimization algorithm, e.g., policy gradient (Sutton et al.,\n2000), Natural Actor Critic (NAC, Peters and Schaal 2008), PPO, Soft Actor Critic (Haarnoja et al.,\n2018), Generalized Off-Policy Actor Critic (Zhang et al., 2019a). In this paper, we focus on the\ncombination of DAC and PPO, given the great empirical success of PPO (OpenAI, 2018). Our PPO\nimplementation uses the same architecture and hyperparameters reported by Schulman et al. (2017).\nLevy and Shimkin (2011) combine AHP with NAC and present an empirical study on an inverted\npendulum domain. In our experiments, we also combine AHP with PPO for a fair comparison. To\nthe best of our knowledge, this is the \ufb01rst time that AHP has been evaluated with state-of-the-art\npolicy optimization algorithms in prevailing deep RL benchmarks. We also implemented IOPG and\nOC as baselines. Previously, Klissarov et al. (2017) also combines OC and PPO in PPOC. PPOC\nupdates {\u03c0o} with a PPO loss and updates {\u03b2o} in the same manner as OC. PPOC applies vanilla\npolicy gradients directly to train \u03c0 in an intra-option manner, which is not theoretically justi\ufb01ed. We\nuse 4 options for all algorithms, following Smith et al. (2018). We report the online training episode\nreturn, smoothed by a sliding window of size 20. All curves are averaged over 10 independent runs\nand shaded regions indicate standard errors. All implementations are made publicly available 1. More\ndetails about the experiments are provided in the supplementary materials.\n\n4.1 Single Task Learning\n\nWe consider four robot simulation tasks used by Smith et al. (2018) from OpenAI gym (Brockman\net al., 2016). We also include the combination of DAC and A2C (Clemente et al., 2017) for reference.\nThe results are reported in Figure 1.\n\nFigure 1: Online performance on a single task\n\nResults: (1) Our implementations of OC and IOPG reach similar performance to that reported by\nSmith et al. (2018), which is signi\ufb01cantly outperformed by both vanilla PPO and option-based PPO\n(i.e., DAC+PPO, AHP+PPO). However, the performance of DAC+A2C is similar to OC and IOPG.\nThese results indicate that the performance boost of DAC+PPO and AHP+PPO mainly comes from\nthe more advanced policy optimization algorithm (PPO). This is exactly the major advantage of\nDAC and AHP. They allow all state-of-the-art policy optimization algorithms to be used off the shelf\nto learn options. (2) The performance of DAC+PPO is similar to vanilla PPO in 3 out of 4 tasks.\nDAC+PPO outperforms PPO in Swimmer by a large margin. This performance similarity between an\noption-based algorithm and a hierarchy-free algorithm is expected and is also reported by Harb et al.\n(2018); Smith et al. (2018); Klissarov et al. (2017). Within a single task, it is usually hard to translate\nthe automatically discovered options into a performance boost, as primitive actions are enough to\nexpress the optimal policy and learning the additional structure, the options, may be overhead. (3)\nThe performance of DAC+PPO is similar to AHP+PPO, as expected. The main advantage of DAC\nover AHP is its data ef\ufb01ciency in learning the master policy. Within a single task, it is possible that\nan agent focuses on a \u201cmighty\u201d option and ignores other specialized options, making master policy\nlearning less important. By contrast, when we switch tasks, cooperation among different options\n\n1https://github.com/ShangtongZhang/DeepRL\n\n6\n\n\fbecomes more important. We, therefore, expect that the data ef\ufb01ciency in learning the master policy\nin DAC translates into a performance boost over AHP in a transfer learning setting.\n\n4.2 Transfer Learning\n\nWe consider a transfer learning setting where after the \ufb01rst 1M training steps, we switch to a new task\nand train the agent for other 1M steps. The agent is not aware of the task switch. The two tasks are\ncorrelated and we expect learned options from the \ufb01rst task can be used to accelerate learning of the\nsecond task.\nWe use 6 pairs of tasks from DeepMind Control Suite (DMControl, Tassa et al. 2018): CartPole =\n(balance, balance_sparse), Reacher = (easy, hard), Cheetah = (run, backward), Fish =\n(upright, downleft), Walker1 = (squat, stand), Walker2 = (walk, backward). Most of them\nare provided by DMControl and some of them we constructed similarly as Hafner et al. (2018). The\nmaximum score is always 1000. More details are provided in the supplementary materials. There are\nother possible paired tasks in DMControl but we found that in such pairs, PPO hardly learns anything\nin the second task. Hence, we omit those pairs from our experiments. The results are reported in\nFigure 2.\n\nFigure 2: Online performance for transfer learning\n\nResults: (1) During the \ufb01rst task, DAC+PPO consistently outperforms OC and IOPG by a large\nmargin and maintains a similar performance to PPO, PPOC, and AHP+PPO. These results are\nconsistent with our previous observations in the single task learning setting. (2) After the task switch,\nthe advantage of DAC+PPO becomes clear. DAC+PPO outperforms all other baselines by a large\nmargin in 3 out of 6 tasks and is among the best algorithms in the other 3 tasks. This satis\ufb01es our\nprevious expectation about DAC and AHP in Section 4.1. (3) We further study the in\ufb02uence of the\nnumber of options in Walker2. Results are provided in the supplementary materials. We \ufb01nd 8\noptions are slightly better than 4 options and 2 options are worse. We conjecture that 2 options are\nnot enough for transferring the knowledge from the \ufb01rst task to the second.\n\n4.3 Option Structures\n\nWe visualize the learned options and option occupancy of DAC+PPO on Cheetah in Figure 3. There\nare 4 options in total, displayed via different colors. The upper strip shows the option occupancy\nduring an episode at the end of the training of the \ufb01rst task (run). The lower strip shows the option\noccupancy during an episode at the end of the training of the second task (backward). Both episodes\nlast 1000 steps.2 The four options are distinct. The blue option is mainly used when the cheetah is\n\n2The video of the two episodes is available at https://youtu.be/K0ZP-HQtx6M\n\n7\n\n\f\u201c\ufb02ying\u201d. The green option is mainly used when the cheetah pushes its left leg to move right. The\nyellow option is mainly used when the cheetah pushes its left leg to move left. The red option is\nmainly used when the cheetah pushes its right leg to move left. During the \ufb01rst task, the red option\nis rarely used. The cheetah uses the green and yellow options for pushing its left leg and uses the\nblue option for \ufb02ying. The right leg rarely touches the ground during the \ufb01rst episode. After the task\nswitch, the \ufb02ying option (blue) transfers to the second task, the yellow option specializes for moving\nleft, and the red option is developed for pushing the right leg to the left.\n\nFigure 3: Learned options and option occupancy of DAC+PPO in Cheetah\n\n5 Related Work\n\nMany components in DAC are not new. The idea of an augmented MDP is suggested by Levy and\nShimkin (2011); Daniel et al. (2016). The augmented state spaces SH and SL are also used by\nBacon et al. (2017) to simplify the derivation. Applying vanilla policy gradient to \u03c0L and ML leads\nimmediately to the Intra-Option Policy Gradient Theorem (Bacon et al., 2017). The augmented policy\n\u03c0H is also used by Smith et al. (2018) to simplify the derivation and is discussed in Bacon (2018)\nunder the name of mixture distribution. Bacon (2018) discusses two mechanisms for sampling from\nthe mixture distribution: a two-step sampling method (sampling Bt \ufb01rst then Ot) and a one-step\nsampling method (sampling Ot directly), where the latter can be viewed as an expected version of\nthe former. The two-step one is implemented by the call-and-return model and is explicitly modelled\nby Levy and Shimkin (2011) via introducing Bt, which is not used in either Bacon (2018) or our\nwork. Bacon (2018) mentions that the one-step modelling can lead to reduced variance compared to\nthe two-step one. However, there is another signi\ufb01cant difference: the one-step modelling is more\ndata ef\ufb01cient than the two-step one. The two-step one (e.g., AHP) yields SMDP learning while the\none-step one (e.g., our approach) yields intra-option learning (for learning the master policy). This\ndifference is not recognized in Bacon (2018) and we are the \ufb01rst to establish it, both conceptually\nand experimentally. Although the underlying chain of DAC is the same as that of Levy and Shimkin\n(2011); Daniel et al. (2016); Bacon et al. (2017); Bacon (2018), DAC is the \ufb01rst to formulate the two\naugmented MDPs explicitly. It is this explicit formulation that allows the off-the-shelf application of\nall state-of-the-art policy optimizations algorithm and combines advantages from both OC and AHP.\nThe gradient of the master policy \ufb01rst appeared in Levy and Shimkin (2011). However, due to the\nintroduction of Bt, that gradient is nonzero only if an option terminates. It is, therefore, SMDP-\nlearning. The gradient of the master policy later appeared in Daniel et al. (2016) in the probabilistic\ninference method for learning options, which, however, assumes a linear structure and is off-line\nlearning. The gradient of the master policy also appeared in Bacon (2018), which is mixed with all\nother gradients. Unless we work on the augmented MDP directly, we cannot easily drop in other\npolicy optimization techniques for learning the master policy, which is our main contribution and is\nnot done by Bacon (2018). Furthermore, that policy gradient is never used in Bacon (2018). All the\nempirical study uses Q-Learning for the master policy. By contrast, our explicit formulation of the\ntwo augmented MDPs generates a family of online policy-based intra-option algorithms for master\npolicy learning, which are compatible with general function approximation.\nBesides gradient-based option learning, there are also other option learning approaches based on\n\ufb01nding bottleneck states or subgoals (Stolle and Precup, 2002; McGovern and Barto, 2001; Silver\nand Ciosek, 2012; Niekum and Barto, 2011; Machado et al., 2017a). In general, these approaches are\nexpensive in terms of both samples and computation (Precup, 2018).\n\n8\n\n\fBesides the option framework, there are also other frameworks to describe hierarchies in RL. Di-\netterich (2000) decomposes the value function in the original MDP into value functions in smaller\nMDPs in the MAXQ framework. Dayan and Hinton (1993) employ multiple managers on different\nlevels for describing a hierarchy. Vezhnevets et al. (2017) further extend this idea to FeUdal Networks,\nwhere a manager module sets abstract goals for workers. This goal-based hierarchy description is\nalso explored by Schmidhuber and Wahnsiedler (1993); Levy et al. (2017); Nachum et al. (2018).\nMoreover, Florensa et al. (2017) use stochastic neural networks for hierarchical RL. We leave a\ncomparison between the option framework and other hierarchical RL frameworks for future work.\n\n6 Conclusions\n\nIn this paper, we reformulate the SMDP of the option framework as two augmented MDPs, allowing\nin an off-the-shelf application of all policy optimization algorithms in option learning and master\npolicy learning in an intra-option manner.\nIn DAC, there is no clear boundary between option termination functions and the master policy. They\nare different internal parts of the augmented policy \u03c0H. We observe that the termination probability\nof the active option becomes high as training progresses, although \u03c0H still selects the same option.\nThis is also observed by Bacon et al. (2017). To encourage long options, Harb et al. (2018) propose a\ncost model for option switching. Including this cost model in DAC is a possibility for future work.\n\nAcknowledgments\n\nSZ is generously funded by the Engineering and Physical Sciences Research Council (EPSRC). This\nproject has received funding from the European Research Council under the European Union\u2019s Hori-\nzon 2020 research and innovation programme (grant agreement number 637713). The experiments\nwere made possible by a generous equipment grant from NVIDIA. The authors thank Matthew Smith\nand Gregory Farquhar for insightful discussions. The authors also thank anonymous reviewers for\ntheir valuable feedbacks.\n\nReferences\nBacon, P.-L. (2018). Temporal Representation Learning. PhD thesis, McGill University.\n\nBacon, P.-L., Harb, J., and Precup, D. (2017). The option-critic architecture. In Proceedings of the\n\n31st AAAI Conference on Arti\ufb01cial Intelligence.\n\nBrockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W.\n\n(2016). Openai gym. arXiv preprint arXiv:1606.01540.\n\nCiosek, K. and Whiteson, S. (2017). Expected policy gradients. arXiv preprint arXiv:1706.05374.\n\nClemente, A. V., Castej\u00f3n, H. N., and Chandra, A. (2017). Ef\ufb01cient parallel methods for deep\n\nreinforcement learning. arXiv preprint arXiv:1705.04862.\n\nDaniel, C., Van Hoof, H., Peters, J., and Neumann, G. (2016). Probabilistic inference for determining\n\noptions in reinforcement learning. Machine Learning.\n\nDayan, P. and Hinton, G. E. (1993). Feudal reinforcement learning. In Advances in Neural Information\n\nProcessing Systems.\n\nDietterich, T. G. (2000). Hierarchical reinforcement learning with the maxq value function decompo-\n\nsition. Journal of Arti\ufb01cial Intelligence Research.\n\nFlorensa, C., Duan, Y., and Abbeel, P. (2017). Stochastic neural networks for hierarchical reinforce-\n\nment learning. arXiv preprint arXiv:1704.03012.\n\nHaarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum\n\nentropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290.\n\nHafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. (2018). Learning\n\nlatent dynamics for planning from pixels. arXiv preprint arXiv:1811.04551.\n\n9\n\n\fHarb, J., Bacon, P.-L., Klissarov, M., and Precup, D. (2018). When waiting is not an option: Learning\nIn Proceedings of the 32nd AAAI Conference on Arti\ufb01cial\n\noptions with a deliberation cost.\nIntelligence.\n\nKlissarov, M., Bacon, P.-L., Harb, J., and Precup, D. (2017). Learnings options end-to-end for\n\ncontinuous action tasks. arXiv preprint arXiv:1712.00004.\n\nLevy, A., Platt, R., and Saenko, K. (2017). Hierarchical actor-critic. arXiv preprint arXiv:1712.00948.\n\nLevy, K. Y. and Shimkin, N. (2011). Uni\ufb01ed inter and intra options learning using policy gradient\n\nmethods. In Proceedings of the 2011 European Workshop on Reinforcement Learning.\n\nMachado, M. C., Bellemare, M. G., and Bowling, M. (2017a). A laplacian framework for option\n\ndiscovery in reinforcement learning. arXiv preprint arXiv:1703.00956.\n\nMachado, M. C., Rosenbaum, C., Guo, X., Liu, M., Tesauro, G., and Campbell, M. (2017b).\nEigenoption discovery through the deep successor representation. arXiv preprint arXiv:1710.11089.\n\nMcGovern, A. and Barto, A. G. (2001). Automatic discovery of subgoals in reinforcement learning\n\nusing diverse density. Proceedings of the 18th International Conference on Machine Learning.\n\nNachum, O., Gu, S. S., Lee, H., and Levine, S. (2018). Data-ef\ufb01cient hierarchical reinforcement\n\nlearning. In Advances in Neural Information Processing Systems.\n\nNiekum, S. and Barto, A. G. (2011). Clustering via dirichlet process mixture models for portable\n\nskill discovery. In Advances in neural information processing systems.\n\nOpenAI (2018). Openai \ufb01ve. https://openai.com/five/.\n\nPeters, J. and Schaal, S. (2008). Natural actor-critic. Neurocomputing.\n\nPrecup, D. (2018). Temporal abstraction. URL: http://videolectures.net/site/normal_dl/\n\ntag=1199094/DLRLsummerschool2018_precup_temporal_abstraction_01.pdf.\n\nPuterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John\n\nWiley & Sons.\n\nRiemer, M., Liu, M., and Tesauro, G. (2018). Learning abstract options. In Advances in Neural\n\nInformation Processing Systems.\n\nSchmidhuber, J. and Wahnsiedler, R. (1993). Planning simple trajectories using neural subgoal\ngenerators. In Proceedings of the Second International Conference on Simulation of Adaptive\nBehavior.\n\nSchulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimiza-\n\ntion. In Proceedings of the 32nd International Conference on Machine Learning.\n\nSchulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy\n\noptimization algorithms. arXiv preprint arXiv:1707.06347.\n\nSilver, D. and Ciosek, K. (2012). Compositional planning using optimal option models. arXiv\n\npreprint arXiv:1206.6473.\n\nSmith, M., Hoof, H., and Pineau, J. (2018). An inference-based policy gradient method for learning\n\noptions. In Proceedings of the 35th International Conference on Machine Learning.\n\nStolle, M. and Precup, D. (2002). Learning options in reinforcement learning. In International\n\nSymposium on abstraction, reformulation, and approximation.\n\nSutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods\nIn Advances in Neural Information\n\nfor reinforcement learning with function approximation.\nProcessing Systems.\n\nSutton, R. S., Precup, D., and Singh, S. (1999). Between mdps and semi-mdps: A framework for\n\ntemporal abstraction in reinforcement learning. Arti\ufb01cial Intelligence.\n\n10\n\n\fTassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., Casas, D. d. L., Budden, D., Abdolmaleki, A.,\nMerel, J., Lefrancq, A., et al. (2018). Deepmind control suite. arXiv preprint arXiv:1801.00690.\n\nVezhnevets, A. S., Osindero, S., Schaul, T., Heess, N., Jaderberg, M., Silver, D., and Kavukcuoglu,\nK. (2017). Feudal networks for hierarchical reinforcement learning. In Proceedings of the 34th\nInternational Conference on Machine Learning.\n\nZhang, S., Boehmer, W., and Whiteson, S. (2019a). Generalized off-policy actor-critic. In Advances\n\nin Neural Information Processing Systems.\n\nZhang, S., Chen, H., and Yao, H. (2019b). Ace: An actor ensemble algorithm for continuous control\n\nwith tree search. Proceedings of the 33rd AAAI Conference on Arti\ufb01cial Intelligence.\n\n11\n\n\f", "award": [], "sourceid": 1187, "authors": [{"given_name": "Shangtong", "family_name": "Zhang", "institution": "University of Oxford"}, {"given_name": "Shimon", "family_name": "Whiteson", "institution": "University of Oxford"}]}