{"title": "Universal Option Models", "book": "Advances in Neural Information Processing Systems", "page_first": 990, "page_last": 998, "abstract": "We consider the problem of learning models of options for real-time abstract planning, in the setting where reward functions can be specified at any time and their expected returns must be efficiently computed. We introduce a new model for an option that is independent of any reward function, called the {\\it universal option model (UOM)}. We prove that the UOM of an option can construct a traditional option model given a reward function, and the option-conditional return is computed directly by a single dot-product of the UOM with the reward function. We extend the UOM to linear function approximation, and we show it gives the TD solution of option returns and value functions of policies over options. We provide a stochastic approximation algorithm for incrementally learning UOMs from data and prove its consistency. We demonstrate our method in two domains. The first domain is document recommendation, where each user query defines a new reward function and a document's relevance is the expected return of a simulated random-walk through the document's references. The second domain is a real-time strategy game, where the controller must select the best game unit to accomplish dynamically-specified tasks. Our experiments show that UOMs are substantially more efficient in evaluating option returns and policies than previously known methods.", "full_text": "Universal Option Models\n\nHengshuai Yao, Csaba Szepesv\u00b4ari, Rich Sutton, Joseph Modayil\n\nDepartment of Computing Science\n\nUniversity of Alberta\n\nEdmonton, AB, Canada, T6H 4M5\n\nhengshua,szepesva,sutton,jmodayil@cs.ualberta.ca\n\nShalabh Bhatnagar\n\nDepartment of Computer Science and Automation\n\nIndian Institute of Science\nBangalore-560012, India\n\nshalabh@csa.iisc.ernet.in\n\nAbstract\n\nWe consider the problem of learning models of options for real-time abstract plan-\nning, in the setting where reward functions can be speci\ufb01ed at any time and their\nexpected returns must be ef\ufb01ciently computed. We introduce a new model for\nan option that is independent of any reward function, called the universal option\nmodel (UOM). We prove that the UOM of an option can construct a traditional\noption model given a reward function, and also supports ef\ufb01cient computation of\nthe option-conditional return. We extend the UOM to linear function approxi-\nmation, and we show the UOM gives the TD solution of option returns and the\nvalue function of a policy over options. We provide a stochastic approximation\nalgorithm for incrementally learning UOMs from data and prove its consistency.\nWe demonstrate our method in two domains. The \ufb01rst domain is a real-time strat-\negy game, where the controller must select the best game unit to accomplish a\ndynamically-speci\ufb01ed task. The second domain is article recommendation, where\neach user query de\ufb01nes a new reward function and an article\u2019s relevance is the ex-\npected return from following a policy that follows the citations between articles.\nOur experiments show that UOMs are substantially more ef\ufb01cient than previously\nknown methods for evaluating option returns and policies over options.\n\n1\n\nIntroduction\n\nConventional methods for real-time abstract planning over options in reinforcement learning require\na single pre-speci\ufb01ed reward function, and these methods are not ef\ufb01cient in settings with multiple\nreward functions that can be speci\ufb01ed at any time. Multiple reward functions arise in several con-\ntexts. In inverse reinforcement learning and apprenticeship learning there is a set of reward functions\nfrom which a good reward function is extracted [Abbeel et al., 2010, Ng and Russell, 2000, Syed,\n2010]. Some system designers iteratively re\ufb01ne their provided reward functions to obtain desired\nbehavior, and will re-plan in each iteration. In real-time strategy games, several units on a team can\nshare the same dynamics but have different time-varying capabilities, so selecting the best unit for\na task requires knowledge of the expected performance for many units. Even article recommenda-\ntion can be viewed as a multiple-reward planning problem, where each user query has an associated\nreward function and the relevance of an article is given by walking over the links between the ar-\nticles [Page et al., 1998, Richardson and Domingos, 2002]. We propose to unify the study of such\nproblems within the setting of real-time abstract planning, where a reward function can be speci-\n\n1\n\n\f\ufb01ed at any time and the expected option-conditional return for a reward function must be ef\ufb01ciently\ncomputed.\nAbstract planning, or planning with temporal abstractions, enables one to make abstract decisions\nthat involve sequences of low level actions. Options are often used to specify action abstraction\n[Precup, 2000, Sorg and Singh, 2010, Sutton et al., 1999]. An option is a course of temporally\nextended actions, which starts execution at some states, and follows a policy in selecting actions\nuntil it terminates. When an option terminates, the agent can start executing another option. The\ntraditional model of an option takes in a state and predicts the sum of the rewards in the course till\ntermination, and the probability of terminating the option at any state. When the reward function is\nchanged, abstract planning with the traditional option model has to start from scratch.\nWe introduce universal option models (UOM) as a solution to this problem. The UOM of an option\nhas two parts. A state prediction part, as in the traditional option model, predicts the states where\nthe option terminates. An accumulation part, new to the UOM, predicts the occupancies of all the\nstates by the option after it starts execution. We also extend UOMs to linear function approximation,\nwhich scales to problems with a large state space. We show that the UOM outperforms existing\nmethods in two domains.\n\n2 Background\n\nexecuting \u03c0 from s:\n\nthat the number of states and actions are both \ufb01nite. We also assume the states are indexed by\n\nreward underlying the transition from s to s\n\n\u2032 given that action a is executed at state s.\n\nat state s\n\nV \u03c0(s)= Es,\u03c0{r1+ \u03b3r2+ \u03b32r3+\u0016} .\n\nA \ufb01nite Markov Decision Process (MDP) is de\ufb01ned by a discount factor \u03b3 \u2208(0, 1), the state set,\nS, the action set, A, the immediate rewards\u001bRa\u001b, and transition probabilities\u001bP a\u001b. We assume\nintegers, i.e., S={1, 2, . . . , N}, where N is the number of states. The immediate reward function\nRa\u2236 S\u00d7 S\u2192 R for a given action a\u2208 A and a pair of states(s, s\n\u2032)\u2208 S\u00d7 S gives the mean immediate\n\u2032 while using a. The transition probability function is a\nfunctionP a\u2236 S\u00d7 S\u2192[0, 1] and for(s, s\n\u2032) gives the probability of arriving\n\u2032)\u2208 S\u00d7 S, a\u2208 A,P a(s, s\nA (stationary, Markov) policy \u03c0 is de\ufb01ned as \u03c0\u2236 S\u00d7 A\u2192[0, 1], where\u2211a\u2208A \u03c0(s, a)= 1 for any\ns\u2208 S. The value of a state s under a policy \u03c0 is de\ufb01ned as the expected return given that one starts\nHere(r1, r2 . . .) is a process with the following properties: s0= s and for k\u2265 0, sk+1 is sampled\nfromP ak(sk,\u22c5), where ak is the action selected by policy \u03c0 and rk+1 is such that its conditional\nmean, given sk, ak, sk+1 isRak(sk, sk+1). The de\ufb01nition works also in the case when at any time\nwith ak. We will also assume that the conditional variance of rk+1 given sk, ak and sk+1 is bounded.\nunless otherwise stated. An option, o\u2261 o\u001b\u03c0, \u03b2\u001b, has two components, a policy \u03c0, and a continuation\nfunction \u03b2\u2236 S\u2192[0, 1]. The latter maps a state into the probability of continuing the option from\nak is selected according to \u03c0(sk,\u22c5). The environment then transitions to the next state sk+1, and a\nreward rk+1 is observed.1 The option terminates at the new state sk+1 with probability 1\u2212 \u03b2(sk+1).\n, where Ro is the so-called option return and po is the so-called (discounted) terminal\ndistribution of option o. In particular, Ro\u2236 S\u2192 R is a mapping such that for any state s, Ro(s) gives\nwhere T is the random termination time of the option, assuming that the process(s0, r1, s1, r2, . . .)\nstarts at time 0 at state s0= s (initiation), and every time step the policy underlying o is followed to\nget the reward and the next state until termination. The mapping po\u2236 S\u00d7 S\u2192[0,\u221e) is a function\n1Here, sk+1 is sampled fromP ak(sk,\u22c5) and the mean of rk+1 isRak(sk, sk+1).\n\nOtherwise it continues, a new action is chosen from the policy of the option, etc. When one option\nterminates, another option can start.\nThe option model for option o helps with planning. Formally, the model of option o is a pair\n\nRo(s)= Es,o[r1+ \u03b3r2+\u0016+ \u03b3T\u22121rT],\n\nstep t the policy is allowed to take into account the history s0, a1, r1, s1, a2, r2, . . . , sk in coming up\n\nThe terminology, ideas and results in this section are based on the work of [Sutton et al., 1999]\n\nthe state. An option o is executed as follows. At time step k, when visiting state sk, the next action\n\nthe total expected discounted return until the option terminates:\n\n2\n\n\f\u2032\u2208 S, gives the discounted probability of terminating at state s\n\n\u2032 provided that\n\nthat, for any given s, s\nthe option is followed from the initial state s:\n\npo(s, s\n\n\u2032)= Es,o[ \u03b3T I{sT=s\u2032}]\n= \u221eQ\n\u03b3k Ps,o{sT = s\n\u2032\nk=1\n\n, T= k} .\n\n(1)\n\n\u2032 after k steps away from s.\n\noption at s\nA semi-MDP (SMDP) is like an MDP, except that it allows multi-step transitions between states.\nA MDP with a \ufb01xed set of options gives rise to an SMDP, because the execution of options lasts\n\nHere, I{\u22c5} is the indicator function, and Ps,o{sT = s\n, T = k} is the probability of terminating the\n\u2032\nmultiple time steps. Given a set of optionsO, an option policy is then a mapping h\u2236 S\u00d7O\u2192[0, 1]\nsuch that h(s, o) is the probability of selecting option o at state s (provided the previous option has\npolicy). Let \ufb02at(h) denote the standard MDP policy of a high-level policy h. The value function\nunderlying h is de\ufb01ned as that of \ufb02at(h): V h(s)= V \ufb02at(h)(s), s\u2208 S . The process of constructing\n\ufb02at(h) given h and the optionsO is the \ufb02attening operation. The model of options is constructed in\nequations will formally hold for the tuple\u001b\u03b3= 1, S,O,\u001bRo\u001b,\u001bpo\u001b\u001b.\n\nterminated). We shall also call these policies high-level policies. Note that a high-level policy selects\noptions which in turn select actions. Thus a high-level policy gives rise to a standard MDP policy\n(albeit one that needs to remember which option was selected the last time, i.e., a history dependent\n\nsuch a way that if we think of the option return as the immediate reward obtained when following\nthe option and if we think of the terminal distribution as transition probabilities, then Bellman\u2019s\n\n3 Universal Option Models (UOMs)\n\nIn this section, we de\ufb01ne the UOM for an option, and prove a universality theorem stating that the\ntraditional model of an option can be constructed from the UOM and a reward vector of the option.\nThe goal of UOMs is to make models of options that are independent of the reward function. We use\nthe adjective \u201cuniversal\u201d because the option model becomes universal with respect to the rewards.\nIn the case of MDPs, it is well known that the value function of a policy \u03c0 can be obtained from\nthe so-called discounted occupancy function underlying \u03c0, e.g., see [Barto and Duff, 1994]. This\ntechnique has been used in inverse reinforcement learning to compute a value function with basis\nreward functions [Ng and Russell, 2000]. The generalization to options is as follows. First we\n\nintroduce the discounted state occupancy function, uo, of option o\u001b\u03c0, \u03b2\u001b:\n\n(2)\n\n\u2032)= Es,o\u0002 T\u22121Q\n\u03b3k I{sk=s\u2032}\u0002 .\nuo(s, s\nk=0\n\u2032) uo(s, s\nr\u03c0(s\nRo(s)=Q\n\u2032) ,\ns\u2032\u2208S\n\nThen,\n\nwhere r\u03c0 is the expected immediate reward vector under \u03c0 and\u001bRa\u001b, i.e., for any s\u2208 S, r\u03c0(s)=\nEs,\u03c0[r1]. For convenience, we shall also treat uo(s,\u22c5) as a vector and write uo(s) to denote it\nthat every MDP can be viewed as the combination of an immediate reward function,\u001bRa\u001b, and a\nreward-less MDP,M=\u001b\u03b3, S, A,\u001bP a\u001b\u001b.\nDe\ufb01nition 1 The UOM of option o in a reward-less MDP is de\ufb01ned by\u001buo, po\u001b, where uo is the op-\n\nas a vector. To clarify the independence of uo from the reward function, it is helpful to \ufb01rst note\n\ntion\u2019s discounted state occupancy function, de\ufb01ned by (2), and po is the option\u2019s discounted terminal\nstate distribution, de\ufb01ned by (1).\n\nThe main result of this section is the following theorem. All the proofs of the theorems in this paper\ncan be found in an extended paper.\n\nTheorem 1 Fix an option o = o\u001b\u03c0, \u03b2\u001b in a reward-less MDPM, and let uo be the occupancy\nfunction underlying o inM. Let\u001bRa\u001b be some immediate reward function. Then, for any state\ns\u2208 S, the return of option o with respect toM and\u001bRa\u001b is given by by Ro(s)=(uo(s))\u0016\n\nr\u03c0.\n\n3\n\n\f4 UOMs with Linear Function Approximation\n\nIn this section, we introduce linear universal option models which use linear function approxima-\ntion to compactly represent reward independent option-models over a potentially large state space.\nIn particular, we build upon previous work where the approximate solution has been obtained by\nWe assume that we are given a function\nsolving the so-called projected Bellman equations.\n\n\u03c6\u2236 S\u2192 Rn, which maps any state s\u2208 S into its n-dimensional feature representation \u03c6(s). Let\nV\u03b8\u2236 S\u2192 R be de\ufb01ned by V\u03b8(s)= \u03b8\n\u03c6(s), where the vector \u03b8 is a so-called weight-vector.2 Fix an\n\u0016\ninitial distribution \u00b5 over the states and an option o= o\u001b\u03c0, \u03b2\u001b. Given a reward functionR=\u001bRa\u001b,\nthe TD(0) approximation V\u03b8(TD,R) to Ro is de\ufb01ned as the solution to the following projected Bell-\nE\u00b5,o\u0002 T\u22121Q\nman equations [Sutton and Barto, 1998]:\n{rk+1+ \u03b3V\u03b8(sk+1)\u2212 V\u03b8(sk)} \u03c6(sk)\u0002= 0 .\nk=0\n\nHere s0 is sampled from \u00b5, the random variables(r1, s1, r2, s2, . . .) and T (the termination time)\nare obtained by following o from this initial state until termination. It is easy to see that if \u03b3= 0 then\nV\u03b8(TD,R) becomes the least-squares approximation Vf(LS,R) to the immediate rewardsR under o\ngiven the features \u03c6. The least-squares approximation toR is given by f\n(LS,R)= arg minf J(f)=\n\u03c6(sk)}2\u0002. We restrict our attention to this TD(0) solution in this paper, and\nE\u00b5,o\u0002\u2211T\u22121\nThe TD(0)-based linear UOM (in short, linear UOM) underlying o (and \u00b5) is a pair of n\u00d7 n matrices\n(U o, M o), which generalize the tabular model(uo, po). Given the same sequence as used in de\ufb01ning\n\nrefer to f as an (approximate) immediate reward model.\n\nk=0{rk+1\u2212 f\n\nthe approximation to Ro (equation 3), U o is the solution to the following system of linear equations:\n\n(3)\n\n\u0016\n\nE\u00b5,o\u0004T\u22121Q\nk=0\n\n{\u03c6(sk)+ \u03b3U o\u03c6(sk+1)\u2212 U o\u03c6(sk)} \u03c6(sk)\u0016\u0004= 0.\n\nLet(U o)\u0016=[u1, . . . , un], ui\u2208 Rn. If we introduce an arti\ufb01cial \u201creward\u201d function, \u02d8ri= \u03c6i, which is\nfor the arti\ufb01cial reward function. Note that if we use tabular representation, then ui,s= uo(s, i) holds\nfor all s, i\u2208 S. Therefore our extension to linear function approximation is backward consistent with\n\nthe ith feature, then ui is the weight vector such that Vui is the TD(0)-approximation to the return of o\n\nthe UOM de\ufb01nition in the tabular case. However, this alone would not be a satisfactory justi\ufb01cation\nof this choice of linear UOMs. The following theorem shows that just like the UOMs of the previous\nsection, the U o matrix allows the separation of the reward from the option models without losing\ninformation.\n\nTheorem 2 Fix an option o= o\u001b\u03c0, \u03b2\u001b in a reward-less MDP,M=\u001b\u03b3, S, A,\u001bP a\u001b\u001b, an initial state\ndistribution \u00b5 over the states S, and a function \u03c6\u2236 S\u2192 Rn. Let U be the linear UOM of o w.r.t. \u03c6\nand \u00b5. Pick some reward functionR and let V\u03b8(TD,R) be the TD(0) approximation to the return Ro.\nThen, for any s\u2208 S,\n(LS,R))\u0016(U \u03c6(s)) .\nreturn corresponding to a reward functionR, it suf\ufb01ces to \ufb01nd f\n(LS,R) (the least squares approxi-\nmation of the expected one-step reward under the option and the reward functionR), provided one\n\nThe signi\ufb01cance of this result is that it shows that to compute the TD approximation of an option\n\nV\u03b8(TD,R)(s)=(f\n\nis given the U matrix of the option. We expect that \ufb01nding a least-squares approximation (solving a\nregression problem) is easier than solving a TD \ufb01xed-point equation. Note that the result also holds\nfor standard policies, but we do not explore this direction in this paper.\nThe de\ufb01nition of M o. The matrix M o serves as a state predictor, and we call M o the transient matrix\nassociated with option o. Given a feature vector \u03c6, M o\u03c6 predicts the (discounted) expected feature\nvector where the option stops. When option o is started from state s and stopped at state sT in T\ntime steps, we update an estimate of M o by\n\nM o\u2190 M o+ \u03b7(\u03b3T \u03c6(sT)\u2212 M o\u03c6(s))\u03c6(s)\u0016\n\n.\n\n2Note that the subscript in V\u22c5 always means the TD weight vector throughout this paper.\n\n4\n\n\fFormally, M o is the solution to the associated linear system,\n\nE\u00b5,o[ \u03b3T \u03c6(sT)\u03c6(s)\u0016]= M o E\u00b5,o[ \u03c6(s)\u03c6(s)\u0016] .\n\ndistribution of option o in the tabular case.\n\nNotice that M o is thus just the least-squares solution of the problem when \u03b3T \u03c6(sT) is regressed\non \u03c6(s), given that we know that option o is executed. Again, this way we obtain the terminal\nA high-level policy h de\ufb01nes a Markov chain over S\u00d7O. Assume that this Markov chain has a\nunique stationary distribution, \u00b5h. Let(s, o)\u223c \u00b5h be a draw from this stationary distribution. Our\nof a high-level policy h (\ufb02attened) over a set of optionsO. The following theorem shows that the\nTheorem 3 Let V\u03b8(s)= \u03c6(s)\u0016\n\ngoal is to \ufb01nd an option model that can be used to compute a TD approximation to the value function\n\nvalue function of h can be computed from option returns and transient matrices.\n\n\u03b8. Under the above conditions, if \u03b8 solves\n\n(4)\n\nE\u00b5h[(Ro(s)+(M o\u03c6(s))\u0016\n\n\u03b8\u2212 \u03c6(s)\u0016\n\n\u03b8)\u03c6(s)]= 0\n\n(5)\n\nthen V\u03b8 is the TD(0) approximation to the value function of h.\n\nRecall that Theorem 2 states that the U matrices can be used to compute the option returns given\nan arbitrary reward function. Thus given a reward function, the U and M matrices are all that one\nwould need to solve the TD solution of the high-level policy. The merit of U and M is that they\nare reward independent. Once they are learned, they can be saved and used for different reward\nfunctions for different situations at different times.\n\n5 Learning and Planning with UOMs\n\nIn this section we give incremental, TD-style algorithms for learning and planning with linear\nUOMs. We start by describing the learning of UOMs while following some high-level policy h,\nand then describe a Dyna-like algorithm that estimates the value function of h with learned UOMs\nand an immediate reward model.\n\n5.1 Learning Linear UOMs\n\nAssume that we are following a high-level policy h over a set of optionsO, and that we want to\nestimate linear UOMs for the options inO. Let the trajectory generated by following this high-level\npolicy be . . . , sk, qk, ok, ak, sk+1, qk+1, . . .. Here, qk = 1 is the indicator for the event that option\nok\u22121 is terminated at state sk and so ok \u223c h(sk,\u22c5). Also, when qk = 0, ok = ok\u22121. Upon the\ntransition from sk to sk+1, qk+1, the matrix U ok is updated as follows:\nk \u03b4k+1 \u03c6(sk)\u0016\nk \u03c6(sk+1)I{qk+1=0}\u2212 U ok\n\nk \u2265 0 is the learning-rate at time k associated with option ok. Note that when option ok is\nterminated the temporal difference \u03b4k+1 is modi\ufb01ed so that the next predicted value is zero.\nThe\u001bM o\u001b matrices are updated using the least-mean square algorithm. In particular, matrix M ok\nis updated when option ok is terminated at time k+ 1, i.e., when qk+1= 1. In the update we need\nthe feature ( \u02dc\u03c6\u22c5) of the state which was visited at the time option ok was selected and also the time\nelapsed since this time (\u03c4\u22c5):\n\nk+1= U ok\nk + \u03b7ok\n\u03b4k+1= \u03c6(sk)+ \u03b3U ok\n\nk \u03c6(sk),\n\n, where\n\nand \u03b7ok\n\nU ok\n\nM ok\n\nk + \u02dc\u03b7ok\n\nI{qk+1=1}\u0001\u03b3\u03c4k \u03c6(sk+1)\u2212 M ok\n\nk+1= M ok\n\u02dc\u03c6k+1= I{qk+1=0} \u02dc\u03c6k+ I{qk+1=1}\u03c6(sk+1) ,\n\u03c4k+1= I{qk+1=0}\u03c4k+ 1 .\n\nk\n\nk\n\n\u02dc\u03c6k\u0001 \u02dc\u03c6\n\u0016\n\nk,\n\nThese variables are initialized to \u03c40= 0 and \u02dc\u03c60= \u03c6(s0).\n\nThe following theorem states the convergence of the algorithm.\n\n5\n\n\fTheorem 4 Assume that the stationary distribution of h is unique, all options inO terminate with\nprobability one and that all options inO are selected at some state with positive probability.3 If\nj(k))2<\u221e,4 then\nfor them, i.e.,\u2211i(k) \u03b7o\nk \u2192 U o with probability one, where(U o, M o) are de\ufb01ned in the\nfor any o\u2208O, M o\n\ni(k)=\u221e,\u2211i(k)(\u03b7o\nk \u2192 M o and U o\n\nthe step-sizes of the options are decreased towards zero so that the Robbins-Monro conditions hold\n\ni(k))2<\u221e, and\u2211j(k) \u02dc\u03b7o\n\nj(k)=\u221e,\u2211j(k)(\u02dc\u03b7o\n\nprevious section.\n\n5.2 Learning Reward Models\n\nIn conventional settings, a single reward signal will be contained in the trajectory when following the\n\nhigh level policy, . . . , sk, qk, ok, ak, rk+1, sk+1, qk+1, . . .. We can learn for each option an immediate\n\nreward model for this reward signal. For example, f ok is updated using least mean squares rule:\n\nI{qk+1=0}\u0001rk+1\u2212 f ok\n\u0016\n\n\u03c6(sk)\u0001 \u03c6(sk) .\n\nk+1= f ok\n\nk + \u02dc\u03b7ok\n\nk\n\nf ok\n\nIn other settings, immediate reward models can be constructed in different ways. For example, more\nthan one reward signal can be of interest, so multiple immediate reward models can be learned in\nparallel. Moreover, such additional reward signals might be provided at any time. In some settings,\nan immediate reward model for a reward function can be provided directly from knowledge of the\nenvironment and features where the immediate reward model is independent of the option.\n\n5.3 Policy Evaluation with UOMs and Reward Models\n\nConsider the process of policy evaluation for a high-level policy over options from a given set of\n\nUOMs when learning a reward model. When starting from a state s with feature vector \u03c6(s) and\nfollowing option o, the return Ro(s) is estimated from the reward model f o and the expected feature\noccupancy matrix U o by Ro(s)\u2248(f o)\u0016\nU o\u03c6(s). The TD(0) approximation to the value function\n\nof a high-level policy h can then be estimated online from Theorem 3. Interleaving updates of the\nreward model learning with these planning steps for h gives a Dyna-like algorithm.\n\n6 Empirical Results\n\nIn this section, we provide empirical results on choosing game units to execute speci\ufb01c policies\nin a simpli\ufb01ed real-time strategy game and recommending articles in a large academic database\nwith more than one million articles. We compare the UOM method with a method of Sorg and\nSingh (2010), who introduced the linear-option expectation model (LOEM) that is applicable for\n\nevaluating a high-level policy over options. Their method estimates(M o, bo) from experience,\nwhere bo is equal to(U o)\u0016\n\nf o in our formulation. This term bo is the expected return from fol-\nlowing the option, and can be computed incrementally from experience once a reward signal or an\nimmediate reward model are available.\nA simpli\ufb01ed Star Craft 2 mission. We examined the use of the UOMs and LOEMs for policy evalu-\nation in a simpli\ufb01ed variant of the real-time strategy game Star Craft 2, where the task for the player\nwas to select the best game unit to move to a particular goal location. We assume that the player has\naccess to a black-box game simulator. There are four game units with the same constant dynamics.\nThe internal status of these units dynamically changes during the game and this affects the reward\nthey receive in enemy controlled territory. We evaluated these units, when their rewards are as listed\nin the table below (the rewards are associated with the previous state and are not action-contingent).\nA game map is shown in Figure 1 (a). The four actions could move a unit left, right, up, or down.\n\nWith probability 2~3, the action moved the unit one grid in the intended direction. With probability\n1~3, the action failed, and the agent was moved in a random direction chosen uniformly from the\n11\u00d7 11 grid. For all algorithms, only one step of planning was applied per action selection. The\n3Otherwise, we can drop the options inO which are never selected by h.\ni(k) when following option o, and the index j(k) is advanced for \u02dc\u03b7o\n4 The index i(k) is advanced for \u03b7o\nj(k)\ni(k) as \u03b7o\n\nother three directions. If an action would move a unit into the boundary, it remained in the original\nlocation (with probability one). The discount factor was 0.9. Features were a lookup table over the\n\nwhen o is terminated. Note that in the algorithm, we simply wrote as \u03b7o\n\nj(k) as \u02dc\u03b7o\n\nk and \u02dc\u03b7o\n\nk.\n\n6\n\n\f(a)\n\n(b)\n\nfor the mission. (b) A high-level policy h=< o1, o2, o3, o6> initiates the options in the regions, with\n\nFigure 1: (a) A Star Craft local mission map, consisting of four bridged regions, and nine options\n\ndeterministic policies in the regions as given by the arrows: o1 (green), o2 (yellow), o3 (purple), and\no6 (white). Outside these regions, the policies select actions uniformly at random. (c) The expected\nperformance of different units can be learned by simulating trajectories (with the standard deviation\nshown by the bars), and the UOM method reduces the error faster than the LOEM method.\n\n(c)\n\n0.3\n1.0\n-1.0\n1.0\n0\n\n-1.0\n0.3\n-1.0\n0.5\n0\n\nGame Units\n\nBattlecruiser\n\nReapers\n\nSCV\n-1.0\n-1.0\n-1.0\n-1.0\n1.0\n\nplanning step-size for each algorithm was chosen from 0.001, 0.01, 0.1, 1.0. Only the best one was\nreported for an algorithm. All data reported were averaged over 30 runs.\nWe de\ufb01ned a set of nine op-\ntions and their correspond-\ning policies, shown in Fig-\nure 1 (a), (b). These options\nare speci\ufb01ed by the locations\nwhere they terminate, and the\npolicies. The termination lo-\ncation is the square pointed\nto by each option\u2019s arrows.\nFour of these are \u201cbridges\u201d between regions, and one is the position labeled \u201cB\u201d (which is the\n\nplayer\u2019s base at position(1, 1)). Each of the options could be initiated from anywhere in the region\n\nEnemy Locations\nfortress (yellow)\nground forces (green)\nviking (red)\ncobra (pink)\nminerals (blue)\n\nThor\n1.0\n1.0\n1.0\n-1.0\n0\n\n\u2032=< o8, o5, o6>.\n\nin which the policy was de\ufb01ned. The policies for these options were de\ufb01ned by a shortest path\ntraversal from the initial location to the terminal location, as shown in the \ufb01gure. These policies\nwere not optimized for the reward functions of the game units or the enemy locations.\nTo choose among units for a mission in real time, a player must be able to ef\ufb01ciently evaluate many\noptions for many units, compute the value functions of the various high-level policies, and select\nthe best unit for a particular high-level goal. A high-level policy for dispatching the game units is\nde\ufb01ned by initiating different options from different states. For example, a policy for moving units\n\nfrom the base \u201cB\u201d to position \u201cG\u201d can be, h=< o1, o2, o3>. Another high-level policy could move\n\nanother unit from upper left terrain to \u201cG\u201d by a different route with h\nWe evaluated policy h for the Reaper unit above using UOMs and LOEMs. We \ufb01rst pre-learned\nthe U o and M o models using the experience from 3000 trajectories. Using a reward function that is\ndescribed in the above table, we then learned f o for the UOM and and bo for the LEOM over 100\nsimulated trajectories, and concurrently learned \u03b8. As shown in Figure 1(c), the UOM model learns\na more accurate estimate of the value function from fewer episodes, when the best performance is\ntaken across the planning step size. Learning f o is easier than learning bo because the stochastic\ndynamics of the environment is factored out through the pre-learned U o. These constructed value\nfunctions can be used to select the best game unit for the task of moving to the goal location.\nThis approach is computationally ef\ufb01cient for multiple units. We compared the computation time\nof LOEMs and UOMs with linear Dyna on a modern PC with an Intel 1.7GHz processor and 8GB\nRAM in a MATLAB implementation. Learning U o took 81 seconds. We used a recursive least-\nsquares update to learn M o, which took 9.1 seconds. Thus, learning an LOEM model is faster than\nlearning a UOM for a single \ufb01xed reward function, but the UOM can produce an accurate option\nreturn quickly for each new reward function. Learning the value function incrementally from the 100\n\n7\n\n1oG5o2o3o6o7o4o8oB9o(11, 11)G0204060801000.060.080.10.120.140.160.180.2Number of episodesRMSE UOMLOEM\ftrajectories took 0.44 seconds for the UOM and 0.61 seconds for the LOEM. The UOM is slightly\nmore ef\ufb01cient as f o is more sparse than bo, but it is substantially more accurate, as shown in Figure\n1(c). We evaluated all the units and the results are similar.\nArticle recommendation. Recommending relevant articles for a given user query can be thought of\nas predicting an expected return of an option for a dynamically speci\ufb01ed reward model. Ranking\nan article as a function of the links between articles in the database has proven to be a successful\napproach to article recommendation, with PageRank and other link analysis algorithms using a ran-\ndom surfer model [Page et al., 1998]. We build on this idea, by mapping a user query to a reward\nmodel and pre-speci\ufb01ed option for how a reader might transition between articles. The ranking of\nan article is then the expected return from following references in articles according to the option.\nConsider the policy of performing a random-walk between articles in a database by following a ref-\nerence from an article that is selected uniformly at random. An article receives a positive reward if it\nmatches a user query (and is otherwise zero), and the value of the article is the expected discounted\nreturn from following the random-walk policy over articles. More focused reader policies can be\nspeci\ufb01ed as following references from an article with a common author or keyword.\nWe experimented with a collection from DBLP that has about 1.5 million articles, 1 million authors,\nand 2 millions citations [Tang et al., 2008]. We assume that a user query q is mapped directly to an\noption o and an immediate reward model f o\nq . For simplicity in our experiment, the reward models\nare all binary, with three non-zero features drawn uniformly at random. In total we used about 58\nfeatures, and the discount factor was 0.9. There were three policies. The \ufb01rst followed a reference\nselected uniformly at random, the second selected a reference written by an author of the current\narticle (selected at random), and the third selected a reference with a keyword in common with the\ncurrent article. Three options were de\ufb01ned from these policies, where the termination probability\nbeta was 1.0 if no suitable outgoing reference was available and 0.25 otherwise. High-level policies\nof different option sequences could also be applied, but were not tested here. We used bibliometric\nfeatures for the articles extracted from the author, title, venue \ufb01elds.\nWe generated queries q at random, where each query speci\ufb01ed an associated option o and an option-\nindependent immediate reward model f o\nmediate reward model is naturally constructed for these problems, as the reward comes from the\nstarting article based on its features, so it is not dependent on the action taken (and thus not the op-\ntion). This approach is appropriate in article recommendation as a query can provide both terms for\nrelevant features (such as the venue), and how the reader intends to follow references in the paper.\nFor the UOM based approach we pre-learned U o, and then computed U of o\nq for each query. For the\nLOEM approach, we learned a bq for each query by simulating 3000 trajectories in the database (the\nsimulated trajectories were shared for all the queries). The computation time (in seconds) for the\nUOM and LOEM approaches are shown in the table below, which shows that UOMS are much more\ncomputationally ef\ufb01cient than LOEM.\n\nq = fq. We then computed their value functions. The im-\n\nNumber of reward functions\nLOEM\nUOM\n\n10\n0.03\n0.01\n\n100\n0.09\n0.04\n\n500\n0.47\n0.07\n\n1,000\n0.86\n0.12\n\n10,000\n9.65\n1.21\n\n7 Conclusion\n\nWe proposed a new way of modelling options in both tabular representation and linear function\napproximation, called the universal option model. We showed how to learn UOMs and how to use\nthem to construct the TD solution of option returns and value functions of policies, and prove their\ntheoretical guarantees. UOMs are advantageous in large online systems. Estimating the return of an\noption given a new reward function with the UOM of the option is reduced to a one-step regression.\nComputing option returns dependent on many reward functions in large online games and search\nsystems using UOMs is much faster than using previous methods for learning option models.\n\nAcknowledgment\n\nThank the reviewers for their comments. This work was supported by grants from Alberta Innovates\nTechnology Futures, NSERC, and Department of Science and Technology, Government of India.\n\n8\n\n\fReferences\nAbbeel, P., Coates, A., and Ng, A. Y. (2010). Autonomous helicopter aerobatics through appren-\n\nticeship learning. Int. J. Rob. Res., 29(13):1608\u20131639.\n\nBarto, A. and Duff, M. (1994). Monte carlo matrix inversion and reinforcement learning. NIPS,\n\npages 687\u2013694.\n\nBertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-dynamic Programming. Athena.\nJaakkola, T., Jordan, M., and Singh, S. (1994). On the convergence of stochastic iterative dynamic\n\nprogramming algorithms. Neural Computation, 6(6):1185\u20131201.\n\nNg, A. Y. and Russell, S. J. (2000). Algorithms for inverse reinforcement learning. ICML, pages\n\n663\u2013670.\n\nPage, L., Brin, S., Motwani, R., and Winograd, T. (1998). The PageRank citation ranking: Bringing\n\norder to the web. Technical report, Stanford University.\n\nPrecup, D. (2000). Temporal Abstraction in Reinforcement Learning. PhD thesis, University of\n\nMassachusetts, Amherst.\n\nRichardson, M. and Domingos, P. (2002). The intelligent surfer: Probabilistic combination of link\n\nand content information in PageRank. NIPS.\n\nSorg, J. and Singh, S. (2010). Linear options. AAMAS, pages 31\u201338.\nSutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press.\nSutton, R. S., Precup, D., and Singh, S. (1999). Between MDPs and semi-MDPs: A framework for\n\ntemporal abstraction in reinforcement learning. Arti\ufb01cial Intelligence, 112:181\u2013211.\n\nSyed, U. A. (2010). Reinforcement Learning Without Rewards. PhD thesis, Princeton University.\nTang, J., Zhang, J., Yao, L., Li, J., Zhang, L., and Su, Z. (2008). Arnetminer: extraction and mining\n\nof academic social networks. SIGKDD, pages 990\u2013998.\n\n9\n\n\f", "award": [], "sourceid": 608, "authors": [{"given_name": "hengshuai", "family_name": "yao", "institution": "University of Alberta"}, {"given_name": "Csaba", "family_name": "Szepesvari", "institution": "University of Alberta"}, {"given_name": "Richard", "family_name": "Sutton", "institution": "University of Alberta"}, {"given_name": "Joseph", "family_name": "Modayil", "institution": "University of Alberta"}, {"given_name": "Shalabh", "family_name": "Bhatnagar", "institution": "Indian Institute of Science"}]}