{"title": "The Option Keyboard: Combining Skills in Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 13052, "page_last": 13062, "abstract": "The ability to combine known skills to create new ones may be crucial in the solution of complex reinforcement learning problems that unfold over extended periods. We argue that a robust way of combining skills is to define and manipulate them in the space of pseudo-rewards (or \"cumulants\"). Based on this premise, we propose a framework for combining skills using the formalism of options. We show that every deterministic option can be unambiguously represented as a cumulant defined in an extended domain. Building on this insight and on previous results on transfer learning, we show how to approximate options whose cumulants are linear combinations of the cumulants of known options. This means that, once we have learned options associated with a set of cumulants, we can instantaneously synthesise options induced by any linear combination of them, without any learning involved. We describe how this framework provides a hierarchical interface to the environment whose abstract actions correspond to combinations of basic skills. We demonstrate the practical benefits of our approach in a resource management problem and a navigation task involving a quadrupedal simulated robot.", "full_text": "The Option Keyboard\n\nCombining Skills in Reinforcement Learning\n\nAndr\u00e9 Barreto, Diana Borsa, Shaobo Hou, Gheorghe Comanici, Eser Ayg\u00fcn,\n\nPhilippe Hamel, Daniel Toyama, Jonathan Hunt, Shibl Mourad, David Silver, Doina Precup\n\n{andrebarreto,borsa,shaobohou,gcomanici,eser}@google.com\n\n{hamelphi,kenjitoyama,jjhunt,shibl,davidsilver,doinap}@google.com\n\nDeepMind\n\nAbstract\n\nThe ability to combine known skills to create new ones may be crucial in the\nsolution of complex reinforcement learning problems that unfold over extended\nperiods. We argue that a robust way of combining skills is to de\ufb01ne and manipulate\nthem in the space of pseudo-rewards (or \u201ccumulants\u201d). Based on this premise, we\npropose a framework for combining skills using the formalism of options. We show\nthat every deterministic option can be unambiguously represented as a cumulant\nde\ufb01ned in an extended domain. Building on this insight and on previous results\non transfer learning, we show how to approximate options whose cumulants are\nlinear combinations of the cumulants of known options. This means that, once we\nhave learned options associated with a set of cumulants, we can instantaneously\nsynthesise options induced by any linear combination of them, without any learning\ninvolved. We describe how this framework provides a hierarchical interface to the\nenvironment whose abstract actions correspond to combinations of basic skills.\nWe demonstrate the practical bene\ufb01ts of our approach in a resource management\nproblem and a navigation task involving a quadrupedal simulated robot.\n\n1\n\nIntroduction\n\nIn reinforcement learning (RL) an agent takes actions in an environment in order to maximise the\namount of reward received in the long run [25]. This textbook de\ufb01nition of RL treats actions as\natomic decisions made by the agent at every time step. Recently, Sutton [23] proposed a new view\non action selection. In order to illustrate the potential bene\ufb01ts of his proposal Sutton resorts to the\nfollowing analogy. Imagine that the interface between agent and environment is a piano keyboard,\nwith each key corresponding to a possible action. Conventionally the agent plays one key at a time\nand each note lasts exactly one unit of time. If we expect our agents to do something akin to playing\nmusic, we must generalise this interface in two ways. First, we ought to allow notes to be arbitrarily\nlong\u2014that is, we must replace actions with skills. Second, we should be able to also play chords.\nThe argument in favour of temporally-extended courses of actions has repeatedly been made in the\nliterature: in fact, the notion that agents should be able to reason at multiple temporal scales is one of\nthe pillars of hierarchical RL [7, 18, 26, 8, 17]. The insight that the agent should have the ability to\ncombine the resulting skills is a far less explored idea. This is the focus of the current work.\nThe possibility of combining skills replaces a monolithic action set with a combinatorial counterpart:\nby learning a small set of basic skills (\u201ckeys\u201d) the agent should be able to perform a potentially very\nlarge number of combined skills (\u201cchords\u201d). For example, an agent that can both walk and grasp an\nobject should be able to walk while grasping an object without having to learn a new skill. According\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fto Sutton [23], this combinatorial action selection process \u201ccould be the key to generating behaviour\nwith a good mix of preplanned coherence and sensitivity to the current situation.\u201d\nBut how exactly should one combine skills? One possibility is to combine them in the space of\npolicies: for example, if we look at policies as distribution over actions, a combination of skills can\nbe de\ufb01ned as a mixture of the corresponding distributions. One can also combine parametric policies\nby manipulating the corresponding parameters. Although these are feasible solutions, they fail to\ncapture possible intentions behind the skills. Suppose the agent is able to perform two skills that can\nbe associated with the same objective\u2014distinct ways of grasping an object, say. It is not dif\ufb01cult\nto see how combinations of the corresponding behaviours can completely fail to accomplish the\ncommon goal. We argue that a more robust way of combining skills is to do so directly in the goal\nspace, using pseudo-rewards or cumulants [25]. If we associate each skill with a cumulant, we can\ncombine the former by manipulating the latter. This allows us to go beyond the direct prescription of\nbehaviours, working instead in the space of intentions.\nCombining skills in the space of cumulants poses two challenges. First, we must establish a well-\nde\ufb01ned mapping between cumulants and skills. Second, once a combined cumulant is de\ufb01ned, we\nmust be able to perform the associated skill without having to go through the slow process of learning\nit. We propose to tackle the former by adopting options as our formalism to de\ufb01ne skills [26]. We\nshow that there is a large subset of the space of options, composed of deterministic options, in which\nevery element can be unambiguously represented as a cumulant de\ufb01ned in an extended domain.\nBuilding on this insight, we extend Barreto et al.\u2019s [3, 4] previous results on transfer learning to\nshow how to approximate options whose cumulants are linear combinations of the cumulants of\nknown options. This means that, once the agent has learned options associated with a collection\nof cumulants, it can instantaneously synthesise options induced by any linear combination of them,\nwithout any learning involved. Thus, by learning a small set of options, the agent instantaneously\nhas at its disposal a potentially enormous number of combined options. Since we are combining\ncumulants, and not policies, the resulting options will be truly novel, meaning that they cannot, in\ngeneral, be directly implemented as a simple alternation of their constituents.\nWe describe how our framework provides a \ufb02exible interface with the environment whose abstract\nactions correspond to combinations of basic skills. As a reference to the motivating analogy described\nabove, we call this interface the option keyboard. We discuss the merits of the option keyboard at the\nconceptual level and demonstrate its practical bene\ufb01ts in two experiments: a resource management\nproblem and a realistic navigation task involving a quadrupedal robot simulated in MuJoCo [30, 21].\n\n2 Background\n\nAs usual, we assume the interaction between agent and environment can be modelled as a Markov\ndecision process (MDP) [19]. An MDP is a tuple M \u2261 (S,A, p, r, \u03b3), where S and A are the state and\naction spaces, p(\u00b7|s, a) gives the next-state distribution upon taking action a in s, r : S \u00d7A\u00d7S (cid:55)\u2192 R\nspeci\ufb01es the reward associated with the transition s\nThe objective of the agent is to \ufb01nd a policy \u03c0 : S (cid:55)\u2192 A that maximises the expected return\n\nGt \u2261(cid:80)\u221ei=0 \u03b3iRt+i, where Rt = r(St, At, St+1). A principled way to address this problem is to\n\na\u2212\u2192 s(cid:48), and \u03b3 \u2208 [0, 1) is the discount factor.\n\nuse methods derived from dynamic programming, which usually compute the action-value function\nof a policy \u03c0 as: Q\u03c0(s, a) \u2261 E\u03c0 [Gt|St = s, At = a] , where E\u03c0[\u00b7] denotes expectation over the\ntransitions induced by \u03c0 [19]. The computation of Q\u03c0(s, a) is called policy evaluation. Once \u03c0 has\nbeen evaluated, we can compute a greedy policy\n\n\u03c0(cid:48)(s) \u2208 argmaxaQ\u03c0(s, a) for all s \u2208 S.\n\n(1)\n(s, a) \u2265 Q\u03c0(s, a) for all (s, a) \u2208 S \u00d7 A, and hence the computation of \u03c0(cid:48) is\nIt can be shown that Q\u03c0(cid:48)\nreferred to as policy improvement. The alternation between policy evaluation and policy improvement\nis at the core of many RL algorithms, which usually carry out these steps approximately. Here we will\nuse a tilde over a symbol to indicate that the associated quantity is an approximation (e.g., \u02dcQ\u03c0 \u2248 Q\u03c0).\n\n2.1 Generalising policy evaluation and policy improvement\nFollowing Sutton and Barto [25], we call any signal de\ufb01ned as c : S \u00d7 A \u00d7 S (cid:55)\u2192 R a cumulant.\nAnalogously to the conventional value function Q\u03c0, we de\ufb01ne Q\u03c0\nc as the expected discounted sum of\n\n2\n\n\fcumulant c under policy \u03c0 [27]. Given a policy \u03c0 and a set of cumulants C, we call the evaluation of\n\u03c0 under all c \u2208 C generalised policy evaluation (GPE) [2]. Barreto et al. [3, 4] propose an ef\ufb01cient\nform of GPE based on successor features: they show that, given cumulants c1, c2, ..., cd, for any\n\nc =(cid:80)\n\ni wici, with w \u2208 Rd,\nc (s, a) \u2261 E\u03c0\nQ\u03c0\n\n(cid:34) \u221e(cid:88)\n\nd(cid:88)\n\n(cid:35)\n\nd(cid:88)\n\n\u03b3k\n\nwiCi,t+k|St = s, At = a\n\n=\n\nwiQ\u03c0\n\nci(s, a),\n\n(2)\n\nwhere Ci,t \u2261 ci(St, At, Rt). Thus, once we have computed Q\u03c0\n\nevaluate \u03c0 under any cumulant in the set C \u2261 {c =(cid:80)\n\nk=0\n\ni=1\n\nc1, Q\u03c0\n\ni wici | w \u2208 Rd}.\n\ni=1\nc1 , ..., Q\u03c0\n\ncd, we can instantaneously\n\nPolicy improvement can also be generalised. In Barreto et al.\u2019s [3] generalised policy improvement\n(GPI) the improved policy is computed based on a set of value functions. Let Q\u03c01\nc be\nthe action-value functions of n policies \u03c0i under cumulant c, and let Qmax\nc (s, a)\nfor all (s, a) \u2208 S \u00d7 A. If we de\ufb01ne\n\n(s, a) = maxi Q\u03c0i\n\nc , ...Q\u03c0n\n\nc , Q\u03c02\n\nc\n\nthen Q\u03c0\nimprovement (1). The guarantee extends to the case in which GPI uses approximations \u02dcQ\u03c0i\nc\n\n(3)\n(s, a) for all (s, a) \u2208 S \u00d7 A. This is a strict generalisation of standard policy\n\nc (s, a) \u2265 Qmax\n\n[3].\n\nc\n\nc\n\n\u03c0(s) \u2208 argmaxaQmax\n\n(s, a) for all s \u2208 S,\n\n2.2 Temporal abstraction via options\n\nAs discussed in the introduction, one way to get temporal abstraction is through the concept of\noptions [26]. Options are temporally-extended courses of actions. In their more general formulation,\noptions can depend on the entire history between the time t when they were initiated and the current\ntime step t + k, ht:t+k \u2261 statst+1...at+k\u22121st+k. Let H be the space of all possible histories; a\nsemi-Markov option is a tuple o \u2261 (Io, \u03c0o, \u03b2o) where Io \u2282 S is the set of states where the option can\nbe initiated, \u03c0o : H (cid:55)\u2192 A is a policy over histories, and \u03b2o : H (cid:55)\u2192 [0, 1] gives the probability that the\noption terminates after history h has been observed [26]. It is worth emphasising that semi-Markov\noptions depend on the history since their initiation, but not before.\n\n3 Combining options\nIn the previous section we discussed how several key concepts in RL can be generalised: rewards\nwith cumulants, policy evaluation with GPE, policy improvement with GPI, and actions with options.\nIn this section we discuss how these concepts can be used to combine skills.\n\n3.1 The relation between options and cumulants\n\nWe start by showing that there is a subset of the space of options in which every option can be\nunequivocally represented as a cumulant de\ufb01ned in an extended domain.\nFirst we look at the relation between policies and cumulants. Given an MDP (S,A, p,\u00b7, \u03b3), we say\nthat a cumulant c\u03c0 : S \u00d7 A \u00d7 S (cid:55)\u2192 R induces a policy \u03c0 : S (cid:55)\u2192 A if \u03c0 is optimal for the MDP\n(S,A, p, c\u03c0, \u03b3). We can always de\ufb01ne a cumulant c\u03c0 that induces a given policy \u03c0. For instance, if\nwe make\n\n(cid:26) 0 if a = \u03c0(s);\n\nz otherwise,\n\nc\u03c0(s, a,\u00b7) =\n\n(4)\n\nwhere z < 0, it is clear that \u03c0 is the only policy that achieves the maximum possible value Q\u03c0(s, a) =\nQ\u2217(s, a) = 0 on all (s, a) \u2208 S \u00d7 A. In general, the relation between policies and cumulants is a\nmany-to-many mapping. First, there is more than one cumulant that induces the same policy: for\nexample, any z < 0 in (4) will clearly lead to the same policy \u03c0. There is thus an in\ufb01nite set of\ncumulants C\u03c0 associated with \u03c0. Conversely, although this is not the case in (4), the same cumulant\ncan give rise to multiple policies if more than one action achieves the maximum in (1).\nGiven the above, we can use any cumulant c\u03c0 \u2208 C\u03c0 to refer to policy \u03c0. In order to extend this\npossibility to options o = (Io, \u03c0o, \u03b2o) we need two things. First, we must de\ufb01ne cumulants in the\nspace of histories H. This will allow us to induce semi-Markov policies \u03c0o : H (cid:55)\u2192 A in a way that is\nanalogous to (4). Second, we need cumulants that also induce the initiation set Io and the termination\nfunction \u03b2o. We propose to accomplish this by augmenting the action space.\n\n3\n\n\fLet \u03c4 be a termination action that terminates option o much like the termination function \u03b2o. We can\nthink of \u03c4 as a \ufb01ctitious action and model it by de\ufb01ning an augmented action space A+ \u2261 A \u222a {\u03c4}.\nWhen the agent is executing an option o, selecting action \u03c4 immediately terminates it. We now\nshow that if we extend the de\ufb01nition of cumulants to also include \u03c4 we can have the resulting\ncumulant induce not only the option\u2019s policy but also its initiation set and termination function. Let\ne : H \u00d7 A+ \u00d7 S (cid:55)\u2192 R be an extended cumulant. Since e is de\ufb01ned over the augmented action space,\nfor each h \u2208 H we now have a termination bonus e(h, \u03c4, s) = e(h, \u03c4 ) that determines the value of\ninterrupting option o after having observed h. The extended cumulant e induces an augmented policy\n\u03c9e : H (cid:55)\u2192 A+ in the same sense that a standard cumulant induces a policy (that is, \u03c9e is an optimal\npolicy for the derived MDP whose state space is H and the action space is A+; see Appendix A for\ndetails). We argue that \u03c9e is equivalent to an option oe \u2261 (Ie, \u03c0e, \u03b2e) whose components are de\ufb01ned\nas follows. The policy \u03c0e : H (cid:55)\u2192 A coincides with \u03c9e whenever the latter selects an action in A. The\ntermination function is given by\n\n(cid:26) 1 if e(h, \u03c4 ) > maxa(cid:54)=\u03c4 Q\u03c9e\n\n0 otherwise.\n\ne (h, a),\n\n\u03b2e(h) =\n\n(5)\n\nIn words, the agent will terminate after h if the instantaneous termination bonus e(h, \u03c4 ) is larger than\nthe maximum expected discounted sum of cumulant e under policy \u03c9e. Note that when h is a single\nstate s, no concrete action has been executed by the option yet, hence it terminates with \u03c4 immediately\nafter its initiation. This is precisely the de\ufb01nition of the initialisation set Ie \u2261 {s| \u03b2e(s) = 0}.\nTermination functions like (5) are always deterministic. This means that extended cumulants e can\nonly represent options oe in which \u03b2e is a mapping H (cid:55)\u2192 {0, 1}. In fact, it is possible to show that\nall options of this type, which we will call deterministic options, are representable as an extended\ncumulant e, as formalised in the following proposition (proof in Appendix A):\nProposition 1. Every extended cumulant induces at least one deterministic option, and every deter-\nministic option can be unambiguously induced by an in\ufb01nite number of extended cumulants.\n\n3.2 Synthesising options using GPE and GPI\n\nei can be linearly combined; it then follows that, for any w \u2208 Rd, e = (cid:80)\ntermination bonuses de\ufb01ned as e(h, \u03c4 ) =(cid:80)\n\nIn the previous section we looked at the relation between extended cumulants and deterministic\noptions; we now build on this connection to use GPE and GPI to combine options.\nLet E \u2261 {e1, e2, ..., ed} be a set of extended cumulants. We know that ei : H \u00d7 A+ \u00d7 S (cid:55)\u2192 R is\nassociated with deterministic option oei \u2261 \u03c9ei. As with any other cumulant, the extended cumulants\ni wiei de\ufb01nes a new\ndeterministic option oe \u2261 \u03c9e. Interestingly, the termination function of oe has the form (5) with\ni wiei(h, \u03c4 ). This means that the combined option oe\n\u201cinherits\u201d its termination function from its constituents oei. Since any w \u2208 Rd de\ufb01nes an option oe,\nthe set E can give rise to a very large number of combined options.\nThe problem is of course that for each w \u2208 Rd we have to actually compute the resulting option \u03c9e.\nThis is where GPE and GPI come to the rescue. Suppose we have the values of options \u03c9ei under all\nthe cumulants e1, e2, ..., ed. With this information, and analogously to (2), we can use the fast form\nof GPE provided by successor features to compute the value of \u03c9ej with respect to e:\n\n(cid:88)\n\n\u03c9ej\nQ\ne\n\n(h, a) =\n\nwiQ\n\n\u03c9ej\nei (h, a).\n\n(6)\n\nNow that we have all the options \u03c9ej evaluated under e, we can merge them to generate a new option\nthat does at least as well as, and in general better than, all of them. This is done by applying GPI over\n\u03c9ej\nthe value functions Q\ne\n\n:\n\ni\n\n\u02dc\u03c9e(h) \u2208 argmaxa\u2208A+ maxj Q\n\n(h, a).\n\n\u03c9ej\ne\n(h, a) \u2264 Q\u02dc\u03c9e\n\n\u03c9ej\nFrom previous theoretical results we know that maxj Q\ne (h, a) for all\ne\n(h, a) \u2208 H \u00d7 A+ [3]. In words, this means that, even though the GPI option \u02dc\u03c9e is not necessarily\noptimal, following it will in general result in a higher return in terms of cumulant e than if the agent\nwere to execute any of the known options \u03c9ej . Thus, we can use \u02dc\u03c9e as an approximation to \u03c9e that\nrequires no additional learning. It is worth mentioning that the action selected by the combined\noption in (7), \u02dc\u03c9e(h), can be different from \u03c9ei(h) for all i\u2014that is, the resulting policy cannot, in\n\ne (h, a) \u2264 Q\u03c9e\n\n(7)\n\n4\n\n\fthe resulting cumulant e = (cid:80)\n\ngeneral, be implemented as an alternation of its constituents. This highlights the fact that combining\ncumulants is not the same as de\ufb01ning a higher-level policy over the associated options.\nIn summary, given a set of cumulants E, we can combine them by picking weights w and computing\ni wiei. This can be interpreted as determining how desirable or\nundesirable each cumulant is. Going back to the example in the introduction, suppose that e1 is\nassociated with walking and e2 is associated with grasping an object. Then, cumulant e1 + e2 will\nreinforce both behaviours, and will be particularly rewarding when they are executed together. In\ncontrast, cumulant e1 \u2212 e2 will induce an option that avoids grasping objects, favouring the walking\nbehaviour in isolation and even possibly inhibiting it. Since the resulting option aims at maximising a\ncombination of the cumulants ei, it can itself be seen as a combination of the options oei.\n\nFigure 1: OK mediates the interaction between player\nand environment. The exchange of information between\nOK and the environment happens at every time step.\nThe interaction between player and OK only happens\n\u201cinside\u201d the agent when the termination action \u03c4 is se-\nlected by GPE and GPI (see Algorithms 1 and 2).\n\n4 Learning with combined options\nGiven a set of extended cumulants E, in\norder to be able to combine the asso-\nciated options using GPE and GPI one\nonly needs the value functions Q\n\u2261\nE\n{ \u02dcQ\n| \u2200(i, j) \u2208 {1, 2, ..., d}2}. The set\n\u03c9ei\nej\nQ\nE can be constructed using standard RL\nmethods; for an illustration of how to do it\nwith Q-learning see Algorithm 3 in App. B.\nAs discussed, once Q\nE has been computed\nwe can use GPE and GPI to synthesise op-\ntions on the \ufb02y.\nIn this case the newly-\ngenerated options are fully determined by\nthe vector of weights w \u2208 Rd. Concep-\ntually, we can think of this process as an\ninterface between an RL algorithm and the environment: the algorithm selects a vector w, hands it\nover to GPE and GPI, and \u201cwaits\u201d until the action returned by (7) is the termination action \u03c4. Once \u03c4\nhas been selected, the algorithm picks a new w, and so on. The RL method is thus interacting with\nthe environment at a higher level of abstraction in which actions are combined skills de\ufb01ned by the\nvectors of weights w. Returning to the analogy with a piano keyboard described in the introduction,\nwe can think of each option \u03c9ei as a \u201ckey\u201d that can be activated by an instantiation of w whose only\nnon-zero entry is wi > 0. Combined options associated with more general instantiations of w would\ncorrespond to \u201cchords\u201d. We will thus call the layer of temporal abstraction between algorithm and\nenvironment the option keyboard (OK). We will also generically refer to the RL method interacting\nwith OK as the \u201cplayer\u201d. Figure 1 shows how an RL agent can be broken into a player and an OK.\nAlgorithm 1 shows a generic implementation of OK. Given a set of value functions Q\nE and a vector\nof weights w, OK will execute the actions selected by GPE and GPI until the termination action is\npicked or a terminal state is reached. During this process OK keeps track of the discounted reward\naccumulated in the interaction with the environment (line 6), which will be returned to the player\nwhen the interaction terminates (line 10). As the options \u03c9ei may depend on the entire trajectory\nsince their initiation, OK uses an update function u(h, a, s(cid:48)) that retains the parts of the history that\nare actually relevant for decision making (line 8). For example, if OK is based on Markov options\nonly, one can simply use the update function u(h, a, s(cid:48)) = s(cid:48).\nThe set Q\nE de\ufb01nes a speci\ufb01c instantiation of OK; once an OK is in place any conventional RL\nmethod can interact with it as if it were the environment. As an illustration, Algorithm 2 shows\nhow a keyboard player that uses a \ufb01nite set of combined options W \u2261 {w1, w2, ..., wn} can be\nimplemented using standard Q-learning by simply replacing the environment with OK. It is worth\npointing out that if we substitute any other set of weight vectors W(cid:48) for W we can still use the same\nOK, without the need to relearn the value functions in Q\nE. We can even use sets of abstract actions\nW that are in\ufb01nite\u2014as long as the OK player can deal with continuous action spaces [33, 24, 22].\nAlthough the clear separation between OK and its player is instructive, in practice the boundary\nbetween the two may be more blurry. For example, if the player is allowed to intervene in all interac-\ntions between OK and environment, one can implement useful strategies like option interruption [26].\nFinally, note that although we have been treating the construction of OK (Algorithm 3) and its use\n\n5\n\nplayerenvironmentwar,sr\u2032,s\u2032,\u03b3\u2032agentOKtemporalabstraction\f(Algorithms 1 and 2) as events that do not overlap in time, nothing keeps us from carrying out the\ntwo procedures in parallel, like in similar methods in the literature [1, 32].\nAlgorithm 1 Option Keyboard (OK)\n\nAlgorithm 2 Q-learning keyboard player\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 OK\n\nRequire:\n\noption keyboard\nW\ncombined options\nQ\nvalue functions\n\u03b1, \u0001, \u03b3 \u2208 R hyper-parameters\nE\n\n1: create \u02dcQ(s, w) parametrised by \u03b8Q\n2: select initial state s \u2208 S\n3: repeat forever\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n\nif Bernoulli(\u0001)=1 then w \u2190 Uniform(W)\nelse w \u2190 argmaxw(cid:48)\u2208W\n(s(cid:48), r(cid:48), \u03b3(cid:48)) \u2190 OK(s, w, Q\n\u03b4 \u2190 r(cid:48) + \u03b3(cid:48) maxw(cid:48) \u02dcQ(s(cid:48), w(cid:48)) \u2212 \u02dcQ(s, w)\n\u03b8Q \u2190 \u03b8Q + \u03b1\u03b4\u2207\u03b8Q\n\u02dcQ(s, w) // update \u02dcQ\nif s(cid:48) is terminal then select initial s \u2208 S\nelse s \u2190 s(cid:48)\n\n\u02dcQ(s, w(cid:48))\nE, \u03b3)\n\ncurrent state\nvector of weights\nvalue functions\n\nw \u2208 Rd\nQ\n\u03b3 \u2208 [0, 1) discount rate\nE\n\nRequire:\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 s \u2208 S\na \u2190 argmaxa(cid:48) maxi[(cid:80)\n\n1: h \u2190 s; r(cid:48) \u2190 0; \u03b3(cid:48) \u2190 1\n2: repeat\n3:\n4:\n5:\n6:\n7:\n8:\n9: until a = \u03c4 or s(cid:48) is terminal\n10: return s(cid:48), r(cid:48), \u03b3(cid:48)\n\nif a (cid:54)= \u03c4 then\n\nj wj \u02dcQ\n\nej (h, a(cid:48))]\n\u03c9ei\n\nexecute action a and observe r and s(cid:48)\nr(cid:48) \u2190 r(cid:48) + \u03b3(cid:48)r\nif s(cid:48) is terminal \u03b3(cid:48) \u2190 0 else \u03b3(cid:48) \u2190 \u03b3(cid:48)\u03b3\nh \u2190 u(h, a, s(cid:48))\n\n5 Experiments\n\nWe now present our experimental results illustrating the bene\ufb01ts of OK in practice. Additional details,\nalong with further results and analysis, can be found in Appendix C.\n5.1 Foraging world\nThe goal in the foraging world is to manage a set of resources by navigating in a grid world and\npicking up items containing the resources in different proportions. For illustrative purposes we will\nconsider that the resources are nutrients and the items are food. The agent\u2019s challenge is to stay\nhealthy by keeping its nutrients within certain bounds. The agent navigates in the grid world using\nthe four usual actions: up, down, left, and right. Upon collecting a food item the agent\u2019s nutrients are\nincreased according to the type of food ingested. Importantly, the quantity of each nutrient decreases\nby a \ufb01xed amount at every step, so the desirability of different types of food changes even if no food is\nconsumed. Observations are images representing the con\ufb01guration of the grid plus a vector indicating\nhow much of each nutrient the agent currently has (see Appendix C.1 for a technical description).\nWhat makes the foraging world particularly challenging is the fact that the agent has to travel towards\nthe items to pick them up, adding a spatial aspect to an already complex management problem. The\ndual nature of the problem also makes it potentially amenable to be tackled with options, since we\ncan design skills that seek speci\ufb01c nutrients and then treat the problem as a management task in\nwhich actions are preferences over nutrients. However, the number of options needed can increase\nexponentially fast. If at any given moment the agent wants, does not want, or does not care about\neach nutrient, we need 3m options to cover the entire space of preferences, where m is the number of\nnutrients. This is a typical situation where being able to combine skills can be invaluable.\nAs an illustration, in our experiments we used m = 2 nutrients and 3 types of food. We de\ufb01ned\na cumulant ei \u2208 E associated with each nutrient as follows: ei(h, a, s) = 0 until a food item is\nconsumed, when it becomes the increase in the associated nutrient. After a food item is consumed we\nhave that ei(h, a, s) = \u22121{a (cid:54)= \u03c4}, where 1{\u00b7} is the indicator function\u2014this forces the induced\noption to terminate, and also illustrates how the de\ufb01nition of cumulants over histories h can be useful\n(since single states would not be enough to determine whether the agent has consumed a food item).\nWe used Algorithm 3 in Appendix B to compute the 4 value functions in Q\nE. We then de\ufb01ned a\n8-dimensional abstract action space covering the space of preferences, W \u2261 {\u22121, 0, 1}2 \u2212 {[0, 0]},\nand used it with the Q-learning player in Algorithm 2. We also consider Q-learning using only the 2\noptions maximizing each nutrient and a \u201c\ufb02at\u201d Q-learning agent that does not use options at all.\nBy modifying the target range of each nutrient we can create distinct scenarios with very different\ndynamics. Figure 2 shows results in two such scenarios. Note how the relative performance of the\ntwo baselines changes dramatically from one scenario to the other, illustrating how the usefulness\nof options is highly context-dependent. Importantly, as shown by the results of the OK player, the\n\n6\n\n\fFigure 2: Results on the foraging world. The two plots correspond to different con\ufb01gurations of the\nenvironment (see Appendix C.1). Shaded regions are one standard deviation over 10 runs.\n\nability to combine options in cumulant space makes it possible to synthesise useful behaviour from a\ngiven set of options even when they are not useful in isolation.\n5.2 Moving-target arena\nAs the name suggests, in the moving-target arena the goal is to get to a target region whose location\nchanges every time the agent reaches it. The arena is implemented as a square room with realistic\ndynamics de\ufb01ned in the MuJoCo physics engine [30]. The agent is a quadrupedal simulated robot\nwith 8 actuated degrees of freedom; actions are thus vectors in [\u22121, 1]8 indicating the torque applied to\neach joint [21]. Observations are 29-dimensional vectors with spatial and proprioception information\n(Appendix C.2). The reward is always 0 except when the agent reaches the target, when it is 1.\nWe de\ufb01ned cumulants in order to encourage the agent\u2019s displacement in certain directions. Let v(h)\nbe the vector of (x, y) velocities of the agent after observing history h (the velocity is part of the\nobservation). Then, if we want the agent to travel at a certain direction w for k steps, we can de\ufb01ne:\n\n(8)\n\n(cid:26) w(cid:62)v(h) if length(h) \u2264 k;\n\n\u22121{a (cid:54)= \u03c4} otherwise.\n\new(h, a,\u00b7) =\n\nex(h, a) + w2Q\u03c9\n\new (h, a) = w1Q\u03c9\n\nThe induced option will terminate after k = 8 steps as a negative reward is incurred for all histories\nof length greater than k and actions other than \u03c4. It turns out that even if a larger number of directions\nw is to be learned, we only need to compute two value functions for each cumulant ew. Since for\nall ew \u2208 E we have that ew = w1ex + w2ey, where x = [1, 0] and y = [0, 1], we can use (2) to\ndecompose the value function of any option \u03c9 as Q\u03c9\ney (h, a). Hence,\n|Q\n| = 2|E|, resulting in a 2-dimensional space W in which w \u2208 R2 indicates the intended direction\nE\nof locomotion. Thus, by learning a few options that move along speci\ufb01c directions, the agent is\npotentially able to synthesise options that travel in any direction.\nFor our experiments, we de\ufb01ned cumulants ew corresponding to the directions 0o, 120o, and 240o.\nTo compute the set of value functions Q\nE we used Algorithm 3 with Q-learning replaced by the\ndeterministic policy gradient (DPG) algorithm [22]. We then used the resulting OK with both discrete\nand continuous abstract-action spaces W. For \ufb01nite W we adopted a Q-learning player (Algorithm 2);\nin this case the abstract actions wi correspond to n \u2208 {4, 6, 8} directions evenly-spaced in the unit\ncircle. For continuous W we used a DPG player. We compare OK\u2019s results with that of DPG applied\ndirectly in the original action space and also with Q-learning using only the three basic options.\nFigure 3 shows our results on the moving-target arena. As one can see by DPG\u2019s results, solving\nthe problem in the original action space is dif\ufb01cult because the occurrence of non-zero rewards may\ndepend on a long sequence of lucky actions. When we replace actions with options we see a clear\nspeed up in learning, even if we take into account the training of the options. If in addition we allow\nfor combined options, we observe a signi\ufb01cant boost in performance, as shown by the OK players\u2019\nresults. Here we see the expected trend: as we increase |W| the OK player takes longer to learn but\nachieves better \ufb01nal performance, as larger numbers of directional options allow for \ufb01ner control.\nThese results clearly illustrate the bene\ufb01ts of being able to combine skills, but how much is the agent\nactually using this ability? In Figure 3 we show a histogram indicating how often combined options\nare used by OK to implement directions w \u2208 R2 across the state space (details in App. C.2). As\nshown, for abstract actions w close to 0o, 120o and 240o the agent relies mostly on the 3 options\ntrained to navigate along these directions, but as the intended direction of locomotion gets farther from\n\n7\n\n02000004000006000008000001000000steps020406080100120140160Average Episode ReturnQ-Learning PlayerQ-Learning Simple OptionsQ-Learning02000004000006000008000001000000steps4020020406080100Average Episode ReturnQ-Learning PlayerQ-Learning Simple OptionsQ-Learning\fFigure 3: Left: Results on the moving-target arena. All players used the same keyboard, so they\nshare the same OK training phase. Shaded regions are one standard deviation over 10 runs. Right:\nHistogram of options used by OK to implement directions w. Black lines are the three basic options.\n\n(cid:80)\n\nj wj \u02c6Q\n\nthese reference points combined options become crucial. This shows how the ability to combine skills\ncan extend the range of behaviours available to an agent without the need for additional learning.1\nEven if one accepts the premise of this paper that skills should be combined in the space of cumulants,\nit is natural to ask whether other strategies could be used instead of GPE and GPI. Although we are\nnot aware of any other algorithm that explicitly attempts to combine skills in the space of cumulants,\nthere are methods that do so in the space of value functions [29, 6, 13, 16]. Haarnoja et al. [13]\npropose a way of combining skills based on entropy-regularised value functions. Given a set of\ni wiei as follows:\n\u03c9ej\nej (h, a) are entropy-regularised value functions\n\ncumulants e1, e2, ..., ed, they propose to compute a skill associated with e =(cid:80)\n\u02c6\u03c9e(h) \u2208 argmaxa\u2208A+\nand wj \u2208 [\u22121, 1]. We will refer to this method as additive value composition (AVC).\nHow well does AVC perform as compared to GPE and GPI? In order to answer this question we\nreran the previous experiments but now using \u02c6\u03c9e(h) as de\ufb01ned above instead of the option \u02dc\u03c9e(h)\ncomputed through (6) and (7). In order to adhere more closely to the assumptions underlying AVC,\nwe also repeated the experiment using an entropy-regularised OK [14] (App. C.2). Figure 4 shows\nthe results. As indicated in the \ufb01gure, GPE and GPI outperform AVC both with the standard and the\nentropy-regularised OK. A possible explanation for this is given in the accompanying polar scatter\nchart in Figure 4, which illustrates how much progress each method makes, over the state space, in all\ndirections w (App. C.2). The plot suggests that, in this domain, the directional options implemented\nthrough GPI and GPE are more effective in navigating along the desired directions (also see [16]).\n\n\u03c9ej\nej (h, a), where \u02c6Q\n\nbe well approximated as r(s, a, s(cid:48)) \u2248(cid:80)\n\n6 Related work\nPrevious work has used GPI and successor features, the linear form of GPE considered here, in the\ncontext of transfer [3, 4, 5]. A crucial assumption underlying these works is that the reward can\ni wici(s, a, s(cid:48)). By solving a regression problem, the agent\n\ufb01nds a w \u2208 Rd that leads to a good approximation of r(s, a, s(cid:48)) and uses it to apply GPE and GPI\n(equations (2) and (3), respectively). In terms of the current work, this is equivalent to having a\nkeyboard player that is only allowed to play one endless \u201cchord\u201d. Through the introduction of a\ntermination action, in this work we replace policies with options that may eventually halt. Since\npolicies are options that never terminate, the previous framework is a special case of OK. Unlike in\nthe previous framework, with OK we can also chain a sequence of options, resulting in more \ufb02exible\nbehaviour. Importantly, this allows us to completely remove the linearity assumption on the rewards.\nWe now turn our attention to previous attempts to combine skills with no additional learning. As\ndiscussed, one way to do so is to work directly in the space of policies. Many policy-based methods\n\ufb01rst learn a parametric representation of a lower-level policy, \u03c0(\u00b7| s; \u03b8), and then use \u03b8 \u2208 Rd as the\nactions for a higher-level policy \u00b5 : S (cid:55)\u2192 Rd [15, 10, 12]. One of the central arguments of this paper\n1A video of the quadrupedal simulated robot being controlled by the DPG player can be found on the\n\nfollowing link: https://youtu.be/39Ye8cMyelQ.\n\n8\n\n012345Steps1e705101520253035Average return per episodeOK trainingDPG PlayerQ-Learning Player (8)Q-Learning Player (6)Q-Learning Player (4)Q-Learning + OptionsDPG\fFigure 4: Left: Comparison of GPE and GPI with AVC on the moving-target arena. Results were\nobtained by a DPG player using a standard OK and an entropy-regularised counterpart (ENT-OK).\nWe trained several ENT-OK with different regularisation parameters and picked the one leading to the\nbest AVC performance. The same player and keyboards were used for both methods. Shaded regions\nare one standard deviation over 10 runs. Right: Polar scatter chart showing the average distance\ntravelled by the agent along directions w when combining options using the two competing methods.\n\nis that combining skills in the space of cumulants may be advantageous because it corresponds to\nmanipulating the goals underlying the skills. This can be seen if we think of w \u2208 Rd as a way of\nencoding skills and compare its effect on behaviour with that of \u03b8: although the option induced by\nw1 + w2 through (6) and (7) will seek a combination of both its constituent\u2019s goals, the same cannot\nbe said about a skill analogously de\ufb01ned as \u03c0(\u00b7| s; \u03b81 + \u03b82). More generally, though, one should\nexpect both policy- and cumulant-based approaches to have advantages and disadvantages.\nInterestingly, most of the previous attempts to combine skills in the space of value functions are based\non entropy-regularised RL, like the already discussed AVC [34, 9, 11, 13]. Hunt et al. [16] propose a\nway of combining skills which can in principle lead to optimal performance if one knows in advance\nthe weights of the intended combinations. They also extend GPE and GPI to entropy-regularised\nRL. Todorov [28] focuses on entropy-regularised RL on linearly solvable MDPs. Todorov [29] and\nda Silva et al. [6] have shown how, in this scenario, one can compute optimal skills corresponding\nto linear combinations of other optimal skills\u2014a property later explored by Saxe et al. [20] to\npropose a hierarchical approach. Along similar lines, Van Niekerk et al. [31] have shown how\noptimal value function composition can be obtained in entropy-regularised shortest-path problems\nwith deterministic dynamics, with the non-regularised setup as a limiting case.\n\n7 Conclusion\nThe ability to combine skills makes it possible for an RL agent to learn a small set of skills and\nthen use them to generate a potentially very large number of distinct behaviours. A robust way of\ncombining skills is to do so in the space of cumulants, but in order to accomplish this one needs\nto solve two problems: (1) establish a well-de\ufb01ned mapping between cumulants and skills and (2)\nde\ufb01ne a mechanism to implement the combined skills without having to learn them.\nThe two main technical contributions of this paper are solutions for these challenging problems. First,\nwe have shown that every deterministic option can be induced by a cumulant de\ufb01ned in an extended\ndomain. This novel theoretical result provides a way of thinking about options whose interest may\ngo beyond the current work. Second, we have described how to use GPE and GPI to synthesise\ncombined options on-the-\ufb02y, with no learning involved. To the best of our knowledge, this is the only\nmethod to do so in general MDPs with performance guarantees for the combined options.\nWe used the above formalism to introduce OK, an interface to an RL problem in which actions\ncorrespond to combined skills. Since OK is compatible with essentially any RL method, it can be\nreadily used to endow our agents with the ability to combine skills. In describing the analogy with a\nkeyboard that inspired our work, Sutton [23] calls for the need of \u201csomething larger than actions, but\nmore combinatorial than the conventional notion of options.\u201d We believe OK provides exactly that.\n\n9\n\n012345Steps1e705101520253035Average return per episodeOK trainingGPE and GPI + OKGPE and GPI + ENT-OKAVC + OKAVC + ENT-OKDPG\fAcknowledgements\n\nWe thank Joseph Modayil for \ufb01rst bringing the subgoal keyboard idea to our attention, and also\nfor the subsequent discussions on the subject. We are also grateful to Richard Sutton, Tom Schaul,\nDaniel Mankowitz, Steven Hansen, and Tuomas Haarnoja for the invaluable conversations that helped\nus develop our ideas and improve the paper. Finally, we thank the anonymous reviewers for their\ncomments and suggestions.\n\nReferences\n[1] P. Bacon, J. Harb, and D. Precup. The option-critic architecture. In Proceedings of the AAAI\n\nConference on Arti\ufb01cial Intelligence (AAAI), 2017.\n\n[2] A. Barreto, S. Hou, D. Borsa, D. Silver, and D. Precup. Fast reinforcement learning with\n\ngeneralized policy updates. Manuscript in preparation.\n\n[3] A. Barreto, W. Dabney, R. Munos, J. Hunt, T. Schaul, H. van Hasselt, and D. Silver. Successor\nfeatures for transfer in reinforcement learning. In Advances in Neural Information Processing\nSystems (NIPS), 2017.\n\n[4] A. Barreto, D. Borsa, J. Quan, T. Schaul, D. Silver, M. Hessel, D. Mankowitz, A. Zidek, and\nR. Munos. Transfer in deep reinforcement learning using successor features and generalised\npolicy improvement. In Proceedings of the International Conference on Machine Learning\n(ICML), 2018.\n\n[5] D. Borsa, A. Barreto, J. Quan, D. J. Mankowitz, H. van Hasselt, R. Munos, D. Silver, and\nT. Schaul. Universal successor features approximators. In International Conference on Learning\nRepresentations (ICLR), 2019.\n\n[6] M. da Silva, F. Durand, and J. Popovi\u00b4c. Linear Bellman combination for control of character\n\nanimation. ACM Transactions on Graphics, 28(3):82:1\u201382:10, 2009.\n\n[7] P. Dayan and G. E. Hinton. Feudal reinforcement learning. In Advances in Neural Information\n\nProcessing Systems (NIPS), 1993.\n\n[8] T. G. Dietterich. Hierarchical reinforcement learning with the MAXQ value function decompo-\n\nsition. Journal of Arti\ufb01cial Intelligence Research, 13:227\u2013303, 2000.\n\n[9] R. Fox, A. Pakman, and N. Tishby. Taming the noise in reinforcement learning via soft updates.\n\nIn Proceedings of the Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2016.\n\n[10] K. Frans, J. Ho, X. Chen, P. Abbeel, and J. Schulman. Meta learning shared hierarchies. In\n\nInternational Conference on Learning Representations (ICLR), 2018.\n\n[11] T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energy-based\npolicies. In Proceedings of the International Conference on Machine Learning (ICML), 2017.\n\n[12] T. Haarnoja, K. Hartikainen, P. Abbeel, and S. Levine. Latent space policies for hierarchical\nreinforcement learning. In Proceedings of the International Conference on Machine Learning\n(ICML), 2018.\n\n[13] T. Haarnoja, V. Pong, A. Zhou, M. Dalal, P. Abbeel, and S. Levine. Composable deep reinforce-\nment learning for robotic manipulation. In IEEE International Conference on Robotics and\nAutomation (ICRA), 2018.\n\n[14] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum\nentropy deep reinforcement learning with a stochastic actor. In Proceedings of the International\nConference on Machine Learning (ICML), 2018.\n\n[15] N. Heess, G. Wayne, Y. Tassa, T. P. Lillicrap, M. A. Riedmiller, and D. Silver. Learning\nand transfer of modulated locomotor controllers. CoRR, abs/1610.05182, 2016. URL http:\n//arxiv.org/abs/1610.05182.\n\n10\n\n\f[16] J. J. Hunt, A. Barreto, T. P. Lillicrap, and N. Heess. Entropic policy composition with generalized\npolicy improvement and divergence correction. In Proceedings of the International Conference\non Machine Learning (ICML), 2019.\n\n[17] L. P. Kaelbling. Hierarchical learning in stochastic domains: Preliminary results. In Proceedings\n\nof the International Conference on Machine Learning (ICML), 2014.\n\n[18] R. Parr and S. Russell. Reinforcement learning with hierarchies of machines. In Proceedings of\n\nthe Conference on Advances in Neural Information Processing Systems (NIPS), 1997.\n\n[19] M. L. Puterman. Markov Decision Processes\u2014Discrete Stochastic Dynamic Programming.\n\nJohn Wiley & Sons, Inc., 1994.\n\n[20] A. M. Saxe, A. C. Earle, and B. Rosman. Hierarchy through composition with multitask\nLMDPS. In Proceedings of the International Conference on Machine Learning (ICML), 2017.\n\n[21] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous\ncontrol using generalized advantage estimation. In Proceedings of the International Conference\non Learning Representations (ICLR), 2016.\n\n[22] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller. Deterministic policy\ngradient algorithms. In Proceedings of the International Conference on Machine Learning\n(ICML), 2014.\n\n[23] R. Sutton. Toward a new view of action selection: The subgoal keyboard. Slides presented\nat the Barbados Workshop on Reinforcement Learning, 2016. URL http://barbados2016.\nrl-community.org/RichSutton2016.pdf?attredirects=0&d=1.\n\n[24] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement\nlearning with function approximation. In Advances in Neural Information Processing Systems\n(NIPS), 2000.\n\n[25] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 2018.\n\n[26] R. S. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: a framework for\n\ntemporal abstraction in reinforcement learning. Arti\ufb01cial Intelligence, 112:181\u2013211, 1999.\n\n[27] R. S. Sutton, J. Modayil, M. Delp, T. Degris, P. M. Pilarski, A. White, and D. Precup. Horde:\nA scalable real-time architecture for learning knowledge from unsupervised sensorimotor\nIn International Conference on Autonomous Agents and Multiagent Systems\ninteraction.\n(AMAS), 2011.\n\n[28] E. Todorov. Linearly-solvable Markov decision problems. In Advances in Neural Information\n\nProcessing Systems (NIPS), 2007.\n\n[29] E. Todorov. Compositionality of optimal control laws. In Advances in Neural Information\n\nProcessing Systems (NIPS), 2009.\n\n[30] E. Todorov, T. Erez, and Y. Tassa. MuJoCo: A physics engine for model-based control. In\n\nIntelligent Robots and Systems (IROS), 2012.\n\n[31] B. Van Niekerk, S. James, A. Earle, and B. Rosman. Composing value functions in reinforcement\nlearning. In Proceedings of the International Conference on Machine Learning (ICML), 2019.\n\n[32] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, and K. Kavukcuoglu.\nFeUdal networks for hierarchical reinforcement learning. In Proceedings of the International\nConference on Machine Learning (ICML), pages 3540\u20133549, 2017.\n\n[33] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine Learning, 8:229\u2013256, 1992.\n\n[34] B. D. Ziebart. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal\n\nEntropy. PhD thesis, Carnegie Mellon University, 2010.\n\n11\n\n\f", "award": [], "sourceid": 7148, "authors": [{"given_name": "Andre", "family_name": "Barreto", "institution": "DeepMind"}, {"given_name": "Diana", "family_name": "Borsa", "institution": "DeepMind"}, {"given_name": "Shaobo", "family_name": "Hou", "institution": "DeepMind"}, {"given_name": "Gheorghe", "family_name": "Comanici", "institution": "DeepMind"}, {"given_name": "Eser", "family_name": "Ayg\u00fcn", "institution": "DeepMind"}, {"given_name": "Philippe", "family_name": "Hamel", "institution": "Deepmind"}, {"given_name": "Daniel", "family_name": "Toyama", "institution": "DeepMind"}, {"given_name": "Jonathan", "family_name": "hunt", "institution": "DeepMind"}, {"given_name": "Shibl", "family_name": "Mourad", "institution": "Google"}, {"given_name": "David", "family_name": "Silver", "institution": "DeepMind"}, {"given_name": "Doina", "family_name": "Precup", "institution": "DeepMind"}]}