{"title": "Learning Robust Options by Conditional Value at Risk Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 2619, "page_last": 2629, "abstract": "Options are generally learned by using an inaccurate environment model (or simulator), which contains uncertain model parameters. \nWhile there are several methods to learn options that are robust against the uncertainty of model parameters, these methods only consider either the worst case or the average (ordinary) case for learning options. \nThis limited consideration of the cases often produces options that do not work well in the unconsidered case. \nIn this paper, we propose a conditional value at risk (CVaR)-based method to learn options that work well in both the average and worst cases. \nWe extend the CVaR-based policy gradient method proposed by Chow and Ghavamzadeh (2014) to deal with robust Markov decision processes and then apply the extended method to learning robust options. \nWe conduct experiments to evaluate our method in multi-joint robot control tasks (HopperIceBlock, Half-Cheetah, and Walker2D). \nExperimental results show that our method produces options that 1) give better worst-case performance than the options learned only to minimize the average-case loss, and 2) give better average-case performance than the options learned only to minimize the worst-case loss.", "full_text": "Learning Robust Options\n\nby Conditional Value at Risk Optimization\n\nTakuya Hiraoka 1,2,3, Takahisa Imagawa 2, Tatsuya Mori 1,2,3, Takashi Onishi 1,2,\n\nYoshimasa Tsuruoka 2,4\n\n2National Institute of Advanced Industrial Science and Technology\n\n3RIKEN Center for Advanced Intelligence Project\n\n1NEC Corporation\n\n4The University of Tokyo\n\n{takuya-h1, tmori, takashi.onishi}@nec.com\n\nimagawa.t@aist.go.jp, tsuruoka@logos.t.u-tokyo.ac.jp\n\nAbstract\n\nOptions are generally learned by using an inaccurate environment model (or simu-\nlator), which contains uncertain model parameters. While there are several methods\nto learn options that are robust against the uncertainty of model parameters, these\nmethods only consider either the worst case or the average (ordinary) case for learn-\ning options. This limited consideration of the cases often produces options that do\nnot work well in the unconsidered case. In this paper, we propose a conditional\nvalue at risk (CVaR)-based method to learn options that work well in both the aver-\nage and worst cases. We extend the CVaR-based policy gradient method proposed\nby Chow and Ghavamzadeh (2014) to deal with robust Markov decision processes\nand then apply the extended method to learning robust options. We conduct experi-\nments to evaluate our method in multi-joint robot control tasks (HopperIceBlock,\nHalf-Cheetah, and Walker2D). Experimental results show that our method produces\noptions that 1) give better worst-case performance than the options learned only to\nminimize the average-case loss, and 2) give better average-case performance than\nthe options learned only to minimize the worst-case loss.\n\n1\n\nIntroduction\n\nIn the reinforcement learning context, an Option means a temporally extended sequence of ac-\ntions [30], and is regarded as useful for many purposes, such as speeding up learning, transferring\nskills across domains, and solving long-term planning problems. Because of its usefulness, the\nmethods for discovering (learning) options have been actively studied (e.g., [1, 17, 18, 20, 28, 29]).\nLearning options successfully in practical domains requires a large amount of training data, but\ncollecting data from a real-world environment is often prohibitively expensive [15]. In such cases,\nenvironment models (i.e., simulators) are used for learning options instead, but such models usually\ncontain uncertain parameters (i.e., a reality gap) since they are constructed based on insu\ufb03cient\nknowledge of a real environment. Options learned with an inaccurate model can lead to a signi\ufb01cant\ndegradation of the performance of the agent when deployed in the real environment [19]. This\nproblem is a severe obstacle to applying the options framework to practical tasks in the real world,\nand has driven the need for methods that can learn robust options against inaccuracy of models.\nSome previous work has addressed the problem of learning options that are robust against inaccuracy\nof models. Mankowitz et al. [19] proposed a method to learn options that minimize the expected loss\n(or maximize an expected return) on the environment model with the worst-case model parameter\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fvalues, which are selected so that the expected loss is maximized. Frans et al. [10] proposed a method\nto learn options that minimize the expected loss on the environment model with the average-case\nmodel parameter values.\nHowever, these methods only consider either the worst-case model parameter values or the average-\ncase model parameter values when learning options, and this causes learned options to work poorly in\nan environment with the unconsidered model parameter values. Mankowitz\u2019s [19] method produces\noptions that are overly conservative in the environment with an average-case model parameter\nvalue [9]. In contrast, Frans\u2019s method [10] produces options that can cause a catastrophic failure in\nan environment with the worst-case model parameter value.\nFurthermore, Mankowitz\u2019s [19] method does not use a distribution of the model parameters for\nlearning options. Generally, the distribution can be built on the basis of prior knowledge. If the\ndistribution can be used, as in Derman et al. [9] and Rajeswaran et al. [24], one can use it to adjust\nthe importance of the model parameter values in learning options. However, Mankowitz\u2019s method\ndoes not consider the distribution and always produces the policy to minimize the loss in the worst\ncase model parameter value. Therefore, even if the worst case is extremely unlikely to happen, the\nlearned policy is adapted to the case and results in being overly conservative.\nTo mitigate the aforementioned problems, we propose a conditional value at risk (CVaR)-based\nmethod to learn robust options optimized for the expected loss in both the average and worst cases.\nIn our method, an option is learned so that it 1) makes the expected loss lower than a given threshold\nin the worst case and also 2) decreases the expected loss in the average case as much as possible.\nSince our method considers both the worst and average cases when learning options, it can mitigate\nthe aforementioned problems with the previous methods for option learning. In addition, unlike\nMankowitz\u2019s [19] method, our method evaluates the expected loss in the worst case on the basis\nof CVaR (see Section 3). This enables us to re\ufb02ect the model parameter distribution in learning\noptions, and thus prevents the learned options from over\ufb01tting an environment with an extremely rare\nworst-case parameter value.\nOur main contributions are as follows: (1) We introduce CVaR optimization to learning robust options\nand illustrate its e\ufb00ectiveness (see Section 3, Section 4, and Appendices). Although some previous\nworks have proposed CVaR-based reinforcement learning methods (e.g., [6, 24, 32]), their methods\nhave been applied to learning robust \u201c\ufb02at\u201d policies, and option learning has not been discussed. (2)\nWe extend Chow and Ghavamzadeh\u2019s CVaR-based policy gradient method [6] so that it can deal\nwith robust Markov decision processes (robust MDPs) and option learning. Their method is for\nordinary MDPs and does not take the uncertainty of model parameters into account. We extend their\nmethod to robust MDPs to deal with the uncertain model parameters (see Section 3). Further, we\napply the extended method to learn robust options. For this application, we derive the option policy\ngradient theorem [1] to minimise soft robust loss [9] (see Appendices). (3) We evaluate our robust\noption learning methods in several nontrivial continuous control tasks (see Section 4 and Appendices).\nRobust option learning methods have not been thoroughly evaluated on such tasks. Mankowitz et\nal. [19] evaluated their methods only on simple control tasks (Acrobot and CartPole with discrete\naction spaces). Our research is a \ufb01rst attempt to evaluate robust option learning methods on more\ncomplex tasks (multi-joint robot control tasks) where state\u2013action spaces are high dimensional and\ncontinuous.\nThis paper is organized as follows. First we de\ufb01ne our problem and introduce preliminaries (Section\n2), and then propose our method by extending Chow and Ghavamzadeh\u2019s method (Section 3). We\nevaluate our method on multi-joint robot control tasks (Section 4). We also discuss related work\n(Section 5). Finally, we conclude our research.\n\n2 Problem Description\nWe consider a robust MDP [34], which is a tuple (cid:104)S , A, C, \u03b3, P, T p, P0(cid:105) where S , A, C, \u03b3, and P are\nstates, actions, a cost function, a discount factor, and an ambiguity set, respectively; T p and P0 are\na state transition function parameterized by p \u2208 P, and an initial state distribution, respectively.\nThe transition probability is used in the form of st+1 \u223c T p(st, at), where st+1 is a random variable.\nSimilarly, the initial state distribution is used in the form of s0 \u223c P0. The cost function is used in the\nform of ct+1 \u223c C(st, at), where ct+1 is a random variable. Here, the sum of the costs discounted by \u03b3\n\nis called a loss C =(cid:80)T\n\nt=0 \u03b3tct.\n\n2\n\n\fP(p)EC(cid:2)C | p(cid:3) ,\n\n(cid:88)\n\np\n\nEC,p [C] =\n\nAs with the standard robust reinforcement learning settings [9, 10, 24], p is generated by a model\nparameter distribution P(p) that captures our subjective belief about the parameter values of a real\nenvironment. By using the distribution, we de\ufb01ne a soft robust loss [9]:\n\n(1)\n\ns\u2208S\n\nwhere EC(cid:2)C | p(cid:3) is the expected loss on a parameterized MDP (cid:104)S , A, C, \u03b3, T p, P0(cid:105), in which the\nP(p). That is, P is assumed to be structured as a Cartesian product(cid:78)\nbe a Cartesian product(cid:78)\n\ntransition probability is parameterized by p.\nAs with Derman et al. [9] and Xu and Mannor [35], we make the rectangularity assumption on P and\ns\u2208S Ps, and P is also assumed to\nPs(ps \u2208 Ps). These assumptions enable us to de\ufb01ne a model parameter\n\ndistribution independently for each state.\nWe also assume that the learning agent interacts with an environment on the basis of its policy, which\ntakes the form of the call-and-return option [29]. Option \u03c9 \u2208 \u2126 is a temporally extended action and\nrepresented as a tuple (cid:104)I\u03c9, \u03c0\u03c9, \u03b2\u03c9(cid:105), in which I\u03c9 \u2286 S is an initiation set, \u03c0\u03c9 is an intra-option policy,\nand \u03b2\u03c9 : S \u2192 [0, 1] is a termination function. In the call-and-return option, an agent picks option \u03c9\nin accordance with its policy over options \u03c0\u2126, and follows the intra-option policy \u03c0\u03c9 until termination\n(as dictated by \u03b2\u03c9), at which point this procedure is repeated. Here \u03c0\u2126, \u03c0\u03c9, and \u03b2\u03c9 are parameterized\nby \u03b8\u03c0\u2126, \u03b8\u03c0\u03c9, and \u03b8\u03b2\u03c9, respectively.\nOur objective is to optimize parameters \u03b8 = (\u03b8\u03c0\u2126 , \u03b8\u03c0\u03c9 , \u03b8\u03b2\u03c9) 1 so that the learned options work well in\nboth the average and worst cases. An optimization criterion will be described in the next section.\n\n3 Option Critic with the CVaR Constraint\n\nTo achieve our objective, we need a method to produce the option policies with parameters \u03b8 that\nwork well in both the average and worst cases. Chow and Ghavamzadeh [6] proposed a CVaR-based\npolicy gradient method to \ufb01nd such parameters in ordinary MDPs. In this section, we extend their\npolicy gradient methods to robust MDPs, and then apply the extended methods to option learning.\nFirst, we de\ufb01ne the optimization objective:\n\nmin\n\n\u03b8\n\nEC,p [C]\nVaR\u0001 (C)\n\nEC,p [C | C \u2265 VaR\u0001 (C)] \u2264 \u03b6,\n\ns.t.\n= max{z \u2208 R | P(C \u2265 z) \u2265 \u0001} ,\n\n(2)\n\n(3)\n\nwhere \u03b6 is the loss tolerance and VaR\u0001 is the upper \u0001 percentile (also known as an upper-tail value\nat risk). The optimization term represents the soft robust loss (Eq. 1). The expectation term in the\nconstraint is CVaR, which is the expected value exceeding the value at risk (i.e., CVaR evaluates the\nexpected loss in the worst case). The constraint requires that CVaR must be equal to or less than \u03b6.\nEq. 2 is an extension of the optimization objective of Chow and Ghavamzadeh [6] to soft robust loss\n(Eq. 1), and thus it uses the model parameter distribution for adjusting the importance of the expected\nloss on each parameterized MDP. In the following part of this section, we derive an algorithm for\nrobust option policy learning to minimize Eq. 2 similarly to Chow and Ghavamzadeh [6].\nBy Theorem 16 in [25], Eq. 2 can be shown to be equivalent to the following objective:\n\nEC,p [C]\n\nmin\n\u03b8,v\u2208R\n\ns.t. v +\n\n1\n\u0001\n\nEC,p [max (0,C \u2212 v)] \u2264 \u03b6.\n\n(4)\n\nTo solve Eq. 4, we use the Lagrangian relaxation method [3] to convert it into the following uncon-\nstrained problem:\n\nmax\n\u03bb\u22650\n\nmin\n\u03b8,v\n\nL(\u03b8, v, \u03bb)\n\ns.t. L (\u03b8, v, \u03bb) = EC,p [C] + \u03bb\n\nEC,p [max(0,C \u2212 v)] \u2212 \u03b6\n\nv + 1\n\u0001\n\n,\n\n(5)\n\n(cid:17)\n\n(cid:16)\n\n1For simplicity, we assume that the parameters \u03b8\u03c0\u2126 , \u03b8\u03c0\u03c9 , \u03b8\u03b2\u03c9 are scalar variables. All the discussion in this\npaper can be extended to the case of multi-dimensional variables by making the following changes: 1) de\ufb01ne \u03b8\nas the concatenation of the multidimensional extensions of the parameters, and 2) replace all partial derivatives\nwith respect to scalar parameters with vector derivatives with respect to the multidimensional parameters.\n\n3\n\n\fwhere \u03bb is the Lagrange multiplier. The goal here is to \ufb01nd a saddle point of L (\u03b8, v, \u03bb), i.e., a point\n(\u03b8\u2217, v\u2217, \u03bb\u2217) that satis\ufb01es L (\u03b8, v, \u03bb\u2217) \u2265 L (\u03b8\u2217, v\u2217, \u03bb\u2217) \u2265 L (\u03b8\u2217, v\u2217, \u03bb) ,\u2200\u03b8, v,\u2200\u03bb \u2264 0. This is achieved by\ndescending in \u03b8 and v and ascending in \u03bb using the gradients of L(\u03b8, v, \u03bb):\n\n\u2207\u03b8L(\u03b8, v, \u03bb) = \u2207\u03b8EC,p [C] +\n\u2202L(\u03b8, v, \u03bb)\n\n\u03bb\n\u0001\n\n= \u03bb\n\n1 +\n\n(cid:32)\n\n1\n\u0001\n\n\u2207\u03b8EC,p [max(0,C \u2212 v)] ,\n\n\u2202EC,p [max(0,C \u2212 v)]\n\n(cid:33)\n\n,\n\n\u2202v\n\n\u2202v\n\n\u2202L(\u03b8, v, \u03bb)\n\n\u2202\u03bb\n\n= v +\n\n1\n\u0001\n\nEC,p [max(0,C \u2212 v)] \u2212 \u03b6.\n\n(6)\n\n(7)\n\n(8)\n\n3.1 Gradient with respect to the option policy parameters \u03b8\u03c0\u2126, \u03b8\u03c0\u03c9, and \u03b8\u03b2\u03c9\n\nIn this section, based on Eq. 6, we propose policy gradients with respect to \u03b8\u03c0\u2126, \u03b8\u03c0\u03c9, and \u03b8\u03b2\u03c9 (Eq. 12,\nEq. 13, and Eq. 14), which are useful for developing an option-critic style algorithm. First, we show\nthat Eq. 6 can be simpli\ufb01ed as the gradient of soft robust loss in an augmented MDP, and then derive\nthe policy gradients of it.\nBy applying Corollary 1 in Appendices and Eq. 1, Eq. 6 can be rewritten as\n\n\u2207\u03b8L(\u03b8, v, \u03bb) =\n\nP(p)\u2207\u03b8\n\n(cid:88)\n\np\n\n(cid:18)\nEC(cid:2)C | p(cid:3) +\n\nEC(cid:2)max (0, C \u2212 v) | p(cid:3)(cid:19)\n\n\u03bb\n\u0001\n\n.\n\n(9)\n\nIn Eq. 9, the expected soft robust loss term and constraint term are contained inside the gradient\noperator. If we proceed with the derivation of the gradient while treating these terms as they are, the\nderivation becomes complicated. Therefore, as in Chow and Ghavamzadeh [6], we simplify these\nterms to a single term by extending the optimization problem. For this, we augment the parameterized\nMDP by adding an extra reward factor of which the discounted expectation matches the constraint\nterm. We also extend the state to consider an additional state factor x \u2208 R for calculating the extra\nreward factor. Formally, given the original parameterized MDP (cid:104)S , A, C, \u03b3, T p, P0(cid:105), we de\ufb01ne an\n= P0 \u00b7 1 (x0 = v) 2, and\naugmented parameterized MDP\n\nwhere S (cid:48) = S \u00d7 R, P(cid:48)\n\n, a(cid:1) =\nC(cid:48)(cid:0)s(cid:48)\n, a(cid:1) =\n(cid:0)s(cid:48)\n(cid:1) and\n(cid:1), respectively. On this augmented parameterized MDP, the loss C(cid:48) can be written as 3\n\n(otherwise)\n(if x(cid:48) = (x \u2212 C (s, a)) /\u03b3))\n\nHere the augmented state transition and cost functions are used in the form of s(cid:48)\nc(cid:48)\n\n(if s is a terminal state)\n\nt+1 \u223c C(cid:48)(cid:0)s(cid:48)\n\np, P(cid:48)\n\nS (cid:48), A, C(cid:48), \u03b3, T(cid:48)\n\n(cid:69)\n(cid:68)\n(cid:40)C(s, a) + \u03bb max(0,\u2212x)/\u0001\n(cid:40)T p(s, a)\n\nC(s, a)\n\n0 (otherwise)\n\n(cid:0)s(cid:48)\n\nt+1 \u223c T(cid:48)\n\np\n\nT(cid:48)\n\np\n\nt , at\n\nt , at\n\n0\n\n0\n\n.\n\n,\n\nT(cid:88)\n\nT(cid:88)\n\nC(cid:48) =\n\n\u03b3tc(cid:48)\n\nt =\n\nt=0\n\nt=0\n\n\u03bb max\n\n\u03b3tct +\n\n(cid:16)\n\n0,(cid:80)T\n\n\u0001\n\n(cid:17)\n\nt=0 \u03b3tct \u2212 v\n\n.\n\n(10)\n\nFrom Eq. 10, it is clear that, by extending the optimization problem from the parameterized MDP\ninto the augmented parameterized MDP, Eq. 9 can be rewritten as\n\n\u2207\u03b8L(\u03b8, v, \u03bb) =\n\nP(p)\u2207\u03b8EC(cid:48)\n\n(11)\n\n3Note that xT = \u2212(cid:80)T\n\n21(\u2022) is an indicator function, which returns the value one if the argument \u2022 is true and zero otherwise.\n\nt=0 \u03b3\u2212T +tct + \u03b3\u2212T v, and thus \u03b3T max (0,\u2212xT ) = max\n\n.\n\n(cid:105)\n\n(cid:104)C(cid:48)(cid:12)(cid:12)(cid:12) p\n(cid:16)\n0,(cid:80)T\n\n(cid:17)\nt=0 \u03b3tct \u2212 v\n\n(cid:88)\n\np\n\n4\n\n\fBy Theorems 1, 2, and 3 in Appendices4, Eq. 11 with respect to \u03b8\u03c0\u2126, \u03b8\u03c0\u03c9, and \u03b8\u03b2\u03c9 can be written as\n\n\u2202L(\u03b8, v, \u03bb)\n\n\u2202\u03b8\u03c0\u2126\n\n\u2202L(\u03b8, v, \u03bb)\n\n\u2202\u03b8\u03c0\u03c9\n\n\u2202L(\u03b8, v, \u03bb)\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0(cid:88)\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0(cid:88)\n(cid:34)\n\n\u03c9\n\na\n\n\u2202\u03c0\u2126 (\u03c9|s(cid:48) )\n\nQ\u2126\n\u2202\u03b8\u03c0\u2126\n\u2202\u03c0\u03c9 (a|s(cid:48) )\n(cid:16)\n\n\u2202\u03b8\u03c0\u03c9\n\nV \u2126\n\n\u2202\u03b2\u03c9 (s(cid:48))\n\u2202\u03b8\u03b2\u03c9\n\n= Es(cid:48)\n\n= Es(cid:48),\u03c9\n\n= Es(cid:48),\u03c9\n\nQ\u03c9\n\n, \u03c9(cid:1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb ,\n(cid:0)s(cid:48)\n, \u03c9, a(cid:1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb ,\n(cid:0)s(cid:48)\n, \u03c9(cid:1)(cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p\n(cid:35)\n(cid:0)s(cid:48)(cid:1) \u2212 Q\u2126\n(cid:0)s(cid:48)\n(cid:105)\n(cid:104)\n(cid:69)\n\n.\n\n\u2202\u03b8\u03b2\u03c9\n\nHere Q\u2126, Q\u03c9, and V \u2126 are the average value (i.e., the soft robust loss) of executing option \u03c9, the value\n\nof executing an action in the context of a state-option pair, and state value. In addition, Es(cid:48)(cid:2)\u2022 | p(cid:3)\n(cid:2)\u2022 | p(cid:3) are the expectations of argument \u2022, with respect to the probability of trajectories\n\nS (cid:48), A, C(cid:48), \u03b3, Ep\n\nT(cid:48)\n\np\n\n, P(cid:48)\n\n0\n\n, which follows the average\n\nand Es(cid:48),\u03c9\ngenerated from the average augmented MDP\ntransition function Ep\n\n=(cid:80)\n\np P(p)T(cid:48)\n\nT(cid:48)\n\np\n\n(cid:104)\n\n(cid:105)\n\np, respectively.\n\n(cid:68)\n\n(12)\n\n(13)\n\n(14)\n\n3.2 Gradient with respect to \u03bb and v\n\nAs in the last section, we extend the parameter optimization problem into an augmented MDP. We\nde\ufb01ne an augmented parameterized MDP\n\n, A, C(cid:48)(cid:48), \u03b3, T(cid:48)\n\nwhere\n\n(cid:48)\n\n(cid:69)\n\np, P(cid:48)\n\n0\n\n(cid:68)\n\nS\n\n(cid:40)max(0,\u2212x)\n\n0 (otherwise)\n\nC(cid:48)(cid:48)(s(cid:48)\n\n, a) =\n\n(if s is a terminal state)\n\nand other functions and variables are the same as those in the augmented parameterized MDP in\nthe last section. Here the augmented cost function is used in the form of c(cid:48)(cid:48)\naugmented parameterized MDP, the expected loss C(cid:48)(cid:48) can be written as\n\nt , at\n\n,\n\nt+1 \u223c C(cid:48)(cid:48)(cid:0)s(cid:48)\n\n(15)\n\n(cid:1). On this\n\n(16)\n\nC(cid:48)(cid:48) =\n\n\u03b3tc(cid:48)(cid:48)\n\nt = max\n\n\u03b3tct \u2212 v\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed0,\n\nT(cid:88)\n\nt=0\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8 .\n\nT(cid:88)\n\nt=0\n\nFrom Eq. 16, it is clear that, by extending the optimization problem from the parameterized MDP to the\naugmented parameterized MDP, EC,p [max (0,C \u2212 v)] in Eq. 7 and Eq. 8 can be replaced by EC(cid:48)(cid:48),p [C(cid:48)(cid:48)].\n\nFurther, by Corollary 2 in Appendices, this term can be rewritten as EC(cid:48)(cid:48),p [C(cid:48)(cid:48)] = EC(cid:48)(cid:48)(cid:2)C(cid:48)(cid:48) | p(cid:3) .\n\nTherefore, Eq. 7 and Eq. 8 can be rewritten as\n\n(cid:32)\n\n(cid:33)\n\n\u2202EC(cid:48)(cid:48)(cid:2)C(cid:48)(cid:48) | p(cid:3)\n(cid:104)C(cid:48)(cid:48)(cid:12)(cid:12)(cid:12) p\n(cid:105) \u2212 \u03b6.\n\n\u2202v\n\n\u2202L(\u03b8, v, \u03bb)\n\n\u2202v\n\n\u2202L(\u03b8, v, \u03bb)\n\n\u2202\u03bb\n\n= \u03bb\n\n1 +\n\n1\n\u0001\n\n= v +\n\n1\n\u0001\n\nEC(cid:48)(cid:48)\n\n(cid:32)\n1 \u2212 1\n\u0001\n\n= \u03bb\n\nEC(cid:48)(cid:48)\n\n(cid:104)\n\n1(cid:0)C(cid:48)(cid:48) \u2265 0(cid:1) (cid:12)(cid:12)(cid:12) p\n\n(cid:105)(cid:33)\n\n,\n\n(17)\n\n(18)\n\n3.3 Algorithmic representation of Option Critic with CVaR Constraints (OC3)\n\nIn the last two sections, we derived the gradient of L(\u03b8, v, \u03bb) with respect to the option policy\nparameters, v, and \u03bb. In this section, on the basis of the derivation result, we propose an algorithm to\noptimize the option policy parameters.\nAlgorithm 1 shows a pseudocode for learning options with the CVaR constraint. In lines 2\u20135, we\nsample trajectories to estimate state-action values and \ufb01nd parameters. In this sampling phase,\nanalytically \ufb01nding the average transition function may be di\ufb03cult, if the transition function is given\nas a black-box simulator. In such cases, sampling approaches can be used to \ufb01nd such average\ntransition functions. One possible approach is to sample p from P(p), and approximate Ep\nby a\nsampled transition function. Another approach is to, for each trajectory sampling, sample p from\n\nT(cid:48)\n\n(cid:105)\n\n(cid:104)\n\np\n\n4To save the space of main content pages, we describe the detail of these theorems and these proofs in\nAppendices. Note that the soft robust loss (a model parameter uncertainty) is not considered in the policy\ngradient theorems for the vanilla option critic framework [1], and that the policy gradient theorems of the soft\nrobust loss (Theorems 1, 2, and 3) are newly proposed in out paper.\n\n5\n\n\fAlgorithm 1 Option Critic with CVaR Constraint (OC3)\nInput: \u03b8\u03c0\u03c9,0, \u03b8\u03b2\u03c9,0, \u03b8\u03c0\u2126,0, v0, \u03bb0, Niter, Nepi, \u03b6, \u2126\n1: for iteration i = 0, 1, ..., Niter do\n2:\n3:\n\nfor k = 0, 1, ..., Nepi do\n\n(cid:1) , s(cid:48)\n\n(cid:110)\n\nt , c(cid:48)(cid:48)\n\nt\n\n(cid:111)T\u22121\n\nt+1\n\nt=0\n\n(cid:104)\n\n(cid:105)\n\ns(cid:48)\n(cid:48)\n, P\n0\n\nt , \u03c9t, at,(cid:0)c(cid:48)\n(cid:69)\n(cid:111)\n\n\u03c40, ..., \u03c4Nepi\n\n.\n\n(cid:68)\n\n(cid:111)\n\n.\n\n(cid:110)\n\nSample a trajectory \u03c4k =\nT(cid:48)\nS (cid:48), A, (C(cid:48), C(cid:48)(cid:48)), \u03b3, Ep\n\np\n\n(cid:110)\n\nend for\nUpdate Q\u2126, and V \u2126 with\nfor \u03c9 \u2208 \u2126 do\nUpdate Q\u03c9 with\n\u03c40, ..., \u03c4Nepi\n\u03b8\u03c0\u03c9,i+1 \u2190 \u03b8\u03c0\u03c9,i \u2212 \u03b1 \u2202L(\u03b8i,vi,\u03bbi)\n\u03b8\u03b2\u03c9,i+1 \u2190 \u03b8\u03b2\u03c9,i \u2212 \u03b1 \u2202L(\u03b8i,vi,\u03bbi)\nend for\n\u03b8\u03c0\u2126,i+1 \u2190 \u03b8\u03c0\u2126,i \u2212 \u03b1 \u2202L(\u03b8i,vi,\u03bbi)\nvi+1 \u2190 vi \u2212 \u03b1 \u2202L(\u03b8i,vi,\u03bbi)\n\u03bbi+1 \u2190 \u03bbi + \u03b1 \u2202L(\u03b8i,vi,\u03bbi)\n\n\u2202\u03b8\u03c0\u03c9 ,i\n\n\u2202\u03b8\u03c0\u2126 ,i\n\n\u2202\u03b8\u03b2\u03c9 ,i\n\n\u2202vi\n\n\u2202\u03bbi\n\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n\n12:\n13:\n14: end for\n\nfrom the average augment MDP\n\nby option policies (\u03c0\u03c9, \u03b2\u03c9, \u03c0\u2126) with \u03b8\u03c0\u03c9,i, \u03b8\u03b2\u03c9,i, and \u03b8\u03c0\u2126,i.\n\n(cid:68)\n\n(cid:69)\n\nS (cid:48), A(cid:48), (C(cid:48), C(cid:48)(cid:48)), \u03b3, T(cid:48)\n\np, P(cid:48)\nP(p) and then generate a trajectory from the augment parameterized MDP\n.\n0\nIn lines 6\u201311, we update the value functions, and then update the option policy parameters on the\nbasis of on Eq. 12, Eq. 13, and Eq. 14. The implementation of this update part di\ufb00ers depending on\nthe base reinforcement learning method. In this paper, we adapt the proximal policy option critic\n(PPOC) [16] to this part 5. In lines 12\u201314, we update v and \u03bb. In this part, the gradient with respect to\nv and \u03bb are calculated by Eq. 17 and Eq. 18.\nOur algorithm is di\ufb00erent from the Actor-Critic Algorithm for CVaR Optimization [6] mainly in the\nfollowing points:\nPolicy structure : In the algorithm in [6], a \ufb02at policy is optimized, while, in our algorithm, the\noption policies, which are more general than the \ufb02at policy 6, are optimized.\n(cid:68)\nThe type of augmented MDP: In the algorithm in [6],\nthe augmented MDP takes the\nform of\n, while, in our algorithm, the augmented MDP takes the form of\nS (cid:48), A, C(cid:48), \u03b3, Ep\n, P(cid:48)\n. Here T(cid:48) is the transition function without model parameters. This di\ufb00er-\nence (T(cid:48) and Ep\n) comes from the di\ufb00erence of optimization objectives. In [6], the optimization\nmodel parameter distribution (EC(cid:48),p [C(cid:48)] =(cid:80)\nobjective is the expected loss without a model parameter uncertainty (EC(cid:48) [C(cid:48)]), while in our optimiza-\ntion objective (Eq. 2), the model parameter uncertainty is considered by taking the expectation of the\n\np P(p)EC(cid:48)(cid:2)C(cid:48) | p(cid:3)).\n\nS (cid:48), A, C(cid:48), \u03b3, T(cid:48), P(cid:48)\n\nT(cid:48)\nT(cid:48)\n\n(cid:105)\n(cid:105)\n\n(cid:104)\n(cid:104)\n\n(cid:68)\n\n(cid:69)\n\n(cid:69)\n\n0\n\n0\n\np\n\np\n\n4 Experiments\n\nIn this section, we conduct an experiment to evaluate our method (OC3) on multi-joint robot control\ntasks with model parameter uncertainty 7. Through this experiment, we elucidate answers for the\nfollowing questions:\nQ1: Can our method (OC3) successfully produce options that satisfy constraints in Eq. 2 in non-trivial\ncases? Note that since \u03b6 in Eq. 2 is a hyper-parameter, we can set it to an arbitrary value. If we use\na very high value for \u03b6, the learned options easily satisfy the constraints. Our interest lies in the\nnon-trivial cases, in which the \u03b6 is set to a reasonably low value.\nQ2: Can our method successfully produce options that work better in the worst case than the options\nthat are learned only to minimize the average-case loss (i.e., soft robust loss Eq. 1)?\n\n5In PPOC, the Proximal Policy Optimization method is applied to update parameters [27]. We decided to use\n\nPPOC since it produced high-performance option policies in continuous control tasks.\n\n6By using single option, the options framework can be speci\ufb01ed to a \ufb02at policy.\n7Source code to replicate the experiments is available at https://github.com/TakuyaHiraoka/\n\nLearning-Robust-Options-by-Conditional-Value-at-Risk-Optimization\n\n6\n\n\fQ3: Can our method successfully produce options that work better in the average case than the\noptions that are learned only to minimize worst-case losses? Here the worst case losses are the\nlosses in the worst case. The expectation term of the constraint in Eq. 2 is an instance of such losses.\nAnother instance is the loss in an environment following the worst-case model parameter:\n\nEC(cid:2)C | \u02dcp(cid:3) s.t.\n\nEC(cid:2)C | \u02dcp(cid:3) .\n\n\u02dcp = arg max\n\np\u2208P\n\n(19)\n\n(cid:104)\n\nV\n\n(cid:16)\n\n(cid:17)(cid:105)\nst+1 \u223c T p (st, at)\n\nIn this experiment, we compare our proposed method with three baselines. All methods are imple-\nmented by extending PPOC [16].\nSoftRobust: In this method, the option policies are learned to minimize an expected soft robust loss\n(Eq. 1) the same as in Frans et al. [10]. We implemented this method, by adapting the meta learning\nshared hierarchies (MLSH) algorithm [10] 8 to PPOC.\nWorstCase: In this method, the option policies are learned to minimize the expected loss on the\nworst case (i.e., Eq. 19) as in Mankowitz et al. [19]. This method is implemented by modifying the\ntemporal di\ufb00erence error evaluation in the generalized advantage estimation [26] part of PPOC, as\n\u2212V(st) + ct+1 + \u03b3 maxp\u2208P\n. If the model parameters are continuous, P becomes\nan in\ufb01nite set and thus evaluating this error becomes intractable. To mitigate this intractability for\nthe continuous model parameter, we discretize the parameters by tile-coding and use the discretized\nparameter set as P.\nEOOpt-\u0001: In this method, the option policies are learned to minimize the expected loss in the worst\ncase. Unlike WorstCase, in this method, the expectation term of the constraint in Eq. 2 is used for\nthe optimization objective. This method is implemented by adapting the EPOpt-\u0001 algorithm [24] to\nPPOC (see Algorithm 2 in Appendices). In this experiment, we set the value of \u0001 to 0.1.\nOC3: In this method, the option policies are learned to minimize the expected loss on the average\naugmented MDP while satisfying the CVaR constraints (i.e., Eq. 4). This method is implemented by\napplying Algorithm 1 to PPOC. In this experiment, \u0001 is set to 0.1, and \u03b6 is determined so that the\nproduced options achieve a CVaR score that is equal to or better than the score achieved by the best\noptions produced by the other baselines (we will explain this in a later paragraph).\nFor all of the aforementioned methods, we set the hyper-parameters (e.g., policy and value network\narchitecture and learning rate) for PPOC to the same values as in the original paper [16]. The param-\neters of the policy network and the value network are updated when the total number of trajectory\nreaches 10240. This parameter update is repeated 977 times for each learning trial.\nThe experiments are conducted in the robust MDP extension of the following environments:\nHalf-Cheetah: In this environment, options are used to control the half-cheetah, which is a planar\nbiped robot with eight rigid links, including two legs and a torso, along with six actuated joints [33].\nWalker2D: In this environment, options are used to control the walker, which is a planar biped robot\nconsisting of seven links, corresponding to two legs and a torso, along with six actuated joints [5].\nHopperIceBlock: In this environment, options are used to control the hopper, which is a robot with\nfour rigid links, corresponding to the torso, upper leg, lower leg, and foot, along with three actuated\njoints [13, 16]. The hopper has to pass the block either by jumping completely over them or by\nsliding on their surface. This environment is more suitable for evaluating the option framework than\nstandard Mujoco environments [5] since it contains explicit task compositionality.\nTo extend these environments to the robust MDP setting (introduced in Section 2), we change the\nenvironment so that some model parameters (e.g., the mass of robot\u2019s torso and ground friction) are\nrandomly initialized in accordance with model parameter distributions. For the model parameter\ndistribution, we prepare two types of distribution: continuous and discrete. For the continuous\ndistribution, as in Rajeswaran et al. [24], we use a truncated Gaussian distribution, which follows the\nhyperparameters described in Table 1 in Appendices. For the discrete distribution, we use a Bernoulli\ndistribution, which follows hyperparameters described in Table 2 in Appendices. We de\ufb01ne the cost\nas the negative of the reward retrieved from the environment.\nRegarding Q1, we \ufb01nd that OC3 can produce feasible options that keep CVaR lower than a given\n\u03b6 (i.e., it satis\ufb01es the constraint in Eq. 2). The negative CVaR of each method in each environment\nis shown in Figure 1 (b). To focus on the non-trivial cases, we choose the best CVaR of the\nbaselines (SoftRobust, WorstCase, and EOOpt-\u0001) and set \u03b6 to it. For example, in the \u201cWalker2D-disc\u201d\nenvironment, the CVaR of WorstCase is used for \u03b6. The \u201cOC3\u201d in Figure 1 (b) is the negative CVaR\nwith the given \u03b6. We can see that the negative CVaR score of OC3 is higher than the best negative\n\n8To make a fair comparison, \u201cwarmup period\u201d and reinitialize \u03b8\u2126 just after \u201crepeat\u201d in MLSH are avoided.\n\n7\n\n\f(a) The average-case performance: average cumulative reward (the negative of the soft robust loss Eq. 1)\n\n(b) The worst-case performance: negative CVaR with \u0001 = 0.1. Here, CVaR is calculated as the average of losses\nincluded in the upper 10th percentile, and the negative of it is shown to the \ufb01gure.\nFigure 1: Comparison of methods in the environments. In (a), the vertical axis represents the average cumulative\nreward (the negative of the soft robust loss) of each method. In (b), the vertical axis represents the negative\nof CVaR of each method. In both (a) and (b), the horizontal axis represents environments where the methods\nare evaluated. The environments with su\ufb03x \u201c-cont\u201d are environments with the continuous model parameter\ndistributions, and the environments with \u201c-disc\u201d are environments with the discrete model parameter distributions.\nThe methods with high score values can be regarded as better. In addition, each score is averaged over 36\nlearning trials with di\ufb00erent initial random seeds, and the 95% con\ufb01dence interval is attached to the score.\n\nFigure 2: Learned options in HopperIceBlock-disc. Left: the option for walking on the slippery ground. Right:\nthe option for jumping onto the box.\n\nCVaR score the baselines in each environment (i.e., the CVaR score of OC3 is lower than the given\n\u03b6). These results clearly demonstrate that our method successfully produces options satisfying the\nconstraint. In addition, the numbers of successful learning trials for OC3 also support our claim (see\nTable 4 in Appendices).\nIn addition, regarding Q2, Figure 1 (b) also shows that the negative CVaRs of OC3 are higer than\nthose of SoftRobust, in which the option is learned only to minimize the loss in the average case.\nTherefore, our method can be said to successfully produce options that work better in the worst case\nthan the options learned only for the average case.\nRegarding Q3, we compare the average-case (soft robust) loss (Eq. 1) of OC3 and those of worst-case\nmethods (WorstCase and EOOpt-0.1). The soft robust loss of each method in each environment is\nshown in Figure 1 (a). We can see that the scores of OC3 are higher than those of WorstCase and\nEOOpt-0.1. These results indicate that our method can successfully produce options that work better\nin the average case than the options learned only for the worst case.\nIn the analysis of learned options, we \ufb01nd that OC3 and SoftRobust successfully produce options\ncorresponding to decomposed skills required for solving the tasks in most environments. Especially,\nOC3 produces robust options. For example, in HopperIceBlock-disc, OC3 produces options cor-\nresponding to walking on slippery grounds and jumping onto a box (Figures 2). In addition, in\nHalf-Cheetah-disc, OC3 produces an option for running (highlighted in green in Figure 3) and an\noption for stabilizing the cheetah-bot\u2019s body (highlighted in red in Figure 3), which is used mainly\nin the rare-case model parameter setups. Acquisition of such decomposed robust skills is useful\nwhen transferring to a di\ufb00erent domain, and they can be also used for post-hoc human analysis and\nmaintenance, which is an important advantage of option-based reinforcement learning over \ufb02at-policy\nreinforcement learning.\n\n8\n\n\fFigure 3: Learned options (shown in green when option 1 is taking the control and in red otherwise) in Half-\nCheetah-cont. Left: the case with an ordinary model parameter setup. Right: the case with a rare model\nparameters setup.\n5 Related Work\n\nRobust reinforcement learning (worst case): One of the main approaches for robust reinforcement\nlearning is to learn policies by minimizing the expected loss in the worst case. \u201cWorst case\u201d is\nused in two di\ufb00erent contexts: 1) the worst case under coherent uncertainty and 2) the worst case\nunder parameter uncertainty. Coherent uncertainty is induced by the stochastic nature of MDPs\n(i.e., stochastic rewards and transition), whereas parameter uncertainty is induced by inaccurate\nenvironment models. For the worst case under coherent uncertainty, many robust reinforcement\nlearning approaches have been proposed (e.g., [7, 8, 12]). Also, many robust reinforcement learning\nmethods have been proposed for the worst case under parameter uncertainty [2, 14, 22, 23, 31, 34].\nThese works focus on learning robust \u201c\ufb02at\u201d policies (i.e., non-hierarchical policies), whereas we\nfocus on learning robust \u201coption\u201d policies (i.e., hierarchical policies).\nRobust reinforcement learning (CVaR): CVaR has previously been applied to reinforcement learn-\ning. Boda and Filar [4] used a dynamic programming approach to optimize CVaR. Morimura et\nal. [21] introduce CVaR to the exploration part of the SARSA algorithm. Rajeswaran et al. [24]\nintroduced CVaR into model-based Bayesian reinforcement learning. Tamar and Chow [6, 32]\npresented a policy gradient theorem for CVaR optimizing. These research e\ufb00orts focus on learning\nrobust \ufb02at policies, whereas we focus on learning robust options.\nLearning robust options: Learning robust options is a relatively recent topic that not much work has\naddressed. Mankowitz et al. [19] are pioneers of this research topic. They proposed a robust option\npolicy iteration to learn an option that minimizes the expected loss in the worst case under parameter\nuncertainty. Frans et al. [10] proposed a method to learn options that maximize the expected loss\nin the average case9. Both methods consider loss in either the worst or average case. Mankowitz\u2019s\nmethod considers only the loss in the worst case, so the produced options are overly adapted to the\nworst case and does not work well in the average case. On the other hand, Frans\u2019s method considers\nonly the loss in the average case, so the produced options work poorly in the unconsidered case. In\ncontrast to these methods, our method considers the losses in both the average and worst cases and\nthus can mitigate the aforementioned problems.\n\n6 Conclusion\n\nIn this paper, we proposed a conditional value at risk (CVaR)-based method to learn options so that\nthey 1) keep the expected loss lower than the given threshold (loss tolerance \u03b6) in the worst case\nand also 2) decrease the expected loss in the average case as much as possible. To achieve this,\nwe extended Chow and Ghavamzadeh\u2019s CVaR policy gradient method [6] to adapt robust Markov\ndecision processes (robust MDPs) and then applied the extended method to learn robust options. For\napplication, we derived a theorem for an option policy gradient for soft robust loss minimization [9].\nWe conducted experiments to evaluate our method in multi-joint robot control tasks. Experimental\nresults show that our method produces options that 1) work better in the worst case than the options\nlearned only to minimize the loss in the average case and 2) work better in the average case than the\noptions learned only to minimize the loss in the worst case.\nAlthough, in general, the model parameter distribution is not necessarily correct and thus needs to be\nupdated to be more precise by re\ufb02ecting observations retrieved from the real environment [24], our\ncurrent method does not consider such model adaptation. One interesting direction for future works\nis to introduce this model adaptation by extending our method to Bayes-Adaptive Markov decision\nprocesses [11].\n\n9Although their method is presented as a transfer learning method, their optimization objective is essentially\nequivalent to the \u201csoft robust\u201d objective in the robust reinforcement learning literature [9]. Thus, we regard their\nmethod as a robust option learning framework.\n\n9\n\n\fReferences\n[1] Bacon, P.-L., Harb, J., and Precup, D. The option-critic architecture. In Proc. AAAI, 2017.\n\n[2] Bagnell, J. A., Ng, A. Y., and Schneider, J. G. Solving uncertain Markov decision processes.\n\nTechnical Report CMU-RITR-01-25, 2001.\n\n[3] Bertsekas, D. P. Nonlinear programming. Athena scienti\ufb01c Belmont, 1999.\n\n[4] Boda, K. and Filar, J. A. Time consistent dynamic risk measures. Mathematical Methods of\n\nOperations Research, 63(1):169\u2013186, 2006.\n\n[5] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba,\n\nW. OpenAI Gym. arXiv:1606.01540 [cs.LG], 2016.\n\n[6] Chow, Y. and Ghavamzadeh, M. Algorithms for CVaR optimization in MDPs. In Proc. NIPS,\n\npp. 3509\u20133517, 2014.\n\n[7] Coraluppi, S. P. and Marcus, S. I. Risk-sensitive, minimax, and mixed risk-neutral / mini-\nmax control of markov decision processes. Stochastic Analysis, Control, Optimization and\nApplications: A Volume in Honor of W.H. Fleming, pp. 21\u201340, 1999.\n\n[8] Coraluppi, S. P. and Marcus, S. I. Risk-sensitive and minimax control of discrete-time, \ufb01nite-\n\nstate markov decision processes. Automatica, 35(2):301\u2013309, 1999.\n\n[9] Derman, E., Mankowitz, D. J., Mann, T. A., and Mannor, S. Soft-Robust Actor-Critic Policy-\n\nGradient. In Proc. UAI, 2018.\n\n[10] Frans, K., Ho, J., Chen, X., Abbeel, P., and Schulman, J. META LEARNING SHARED\n\nHIERARCHIES. In Proc. ICLR, 2018.\n\n[11] Ghavamzadeh, M., Mannor, S., Pineau, J., Tamar, A., et al. Bayesian reinforcement learning: A\n\nsurvey. Foundations and Trends in Machine Learning, 8(5-6):359\u2013483, 2015.\n\n[12] Heger, M. Consideration of risk in reinforcement learning. In Proc. ML, pp. 105\u2013111. Elsevier,\n\n1994.\n\n[13] Henderson, P., Chang, W., Shkurti, F., Hansen, J., Meger, D., and Dudek, G. Benchmark\nenvironments for multitask learning in continuous domains. arXiv:1708.04352 [cs.AI], 2017.\n\n[14] Iyengar, G. N. Robust dynamic programming. Mathematics of Operations Research, 30(2):\n\n257\u2013280, 2005.\n\n[15] James, S., Wohlhart, P., Kalakrishnan, M., Kalashnikov, D., Irpan, A., Ibarz, J., Levine, S.,\nHadsell, R., and Bousmalis, K. Sim-to-Real via Sim-to-Sim: Data-e\ufb03cient Robotic Grasping\nvia Randomized-to-Canonical Adaptation Networks. arXiv:1812.07252 [cs.RO], 2018.\n\n[16] Klissarov, M., Bacon, P., Harb, J., and Precup, D. Learnings options end-to-end for continuous\n\naction tasks. arXiv:1712.00004 [cs.LG], 2017.\n\n[17] Konidaris, G. and Barto, A. G. Skill discovery in continuous reinforcement learning domains\n\nusing skill chaining. In Proc. NIPS, pp. 1015\u20131023, 2009.\n\n[18] Kulkarni, T. D., Narasimhan, K., Saeedi, A., and Tenenbaum, J. Hierarchical deep reinforcement\nlearning: Integrating temporal abstraction and intrinsic motivation. In Proc. NIPS, pp. 3675\u2013\n3683, 2016.\n\n[19] Mankowitz, D. J., Mann, T. A., Bacon, P.-L., Precup, D., and Mannor, S. Learning Robust\n\nOptions. In Proc. AAAI, 2018.\n\n[20] McGovern, A. and Barto, A. G. Automatic discovery of subgoals in reinforcement learning\n\nusing diverse density. Computer Science Department Faculty Publication Series, 8, 2001.\n\n[21] Morimura, T., Sugiyama, M., Kashima, H., Hachiya, H., and Tanaka, T. Nonparametric return\n\ndistribution approximation for reinforcement learning. In Proc. ICML, pp. 799\u2013806, 2010.\n\n10\n\n\f[22] Nilim, A. and El Ghaoui, L. Robust control of markov decision processes with uncertain\n\ntransition matrices. Operations Research, 53(5):780\u2013798, 2005.\n\n[23] Pinto, L., Davidson, J., Sukthankar, R., and Gupta, A. Robust adversarial reinforcement learning.\n\nIn Proc. ICML, pp. 2817\u20132826, 2017.\n\n[24] Rajeswaran, A., Ghotra, S., Levine, S., and Ravindran, B. EPOpt: Learning Robust Neural\n\nNetwork Policies Using Model Ensembles. In Proc. ICLR, 2017.\n\n[25] Rockafellar, R. T. and Uryasev, S. Conditional value-at-risk for general loss distributions.\n\nJournal of banking & \ufb01nance, 26(7):1443\u20131471, 2002.\n\n[26] Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. High-dimensional continuous\n\ncontrol using generalized advantage estimation. In Proc. ICLR, 2016.\n\n[27] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimiza-\n\ntion algorithms. arXiv:1707.06347 [cs.LG], 2017.\n\n[28] Silver, D. and Ciosek, K. Compositional planning using optimal option models. In Proc. ICML,\n\n2012.\n\n[29] Stolle, M. and Precup, D. Learning options in reinforcement learning. In Proc. SARA, pp.\n\n212\u2013223. Springer, 2002.\n\n[30] Sutton, R. S., Precup, D., and Singh, S. Between mdps and semi-mdps: A framework for\ntemporal abstraction in reinforcement learning. Arti\ufb01cial intelligence, 112(1-2):181\u2013211, 1999.\n\n[31] Tamar, A., Mannor, S., and Xu, H. Scaling up robust mdps using function approximation. In\n\nProc. ICML, pp. 181\u2013189, 2014.\n\n[32] Tamar, A., Glassner, Y., and Mannor, S. Optimizing the CVaR via Sampling. In Proc. AAAI, pp.\n\n2993\u20132999, 2015.\n\n[33] Wawrzy\u00b4nski, P. Real-time reinforcement learning by sequential actor\u2013critics and experience\n\nreplay. Neural Networks, 22(10):1484\u20131497, 2009.\n\n[34] Wiesemann, W., Kuhn, D., and Rustem, B. Robust markov decision processes. Mathematics of\n\nOperations Research, 38(1):153\u2013183, 2013.\n\n[35] Xu, H. and Mannor, S. Distributionally robust markov decision processes. In Proc. NIPS, pp.\n\n2505\u20132513, 2010.\n\n11\n\n\f", "award": [], "sourceid": 1504, "authors": [{"given_name": "Takuya", "family_name": "Hiraoka", "institution": "NEC / AIST / RIKEN-AIP"}, {"given_name": "Takahisa", "family_name": "Imagawa", "institution": "National Institute of Advanced Industrial Science and Technology"}, {"given_name": "Tatsuya", "family_name": "Mori", "institution": "NEC, AIST, RIKEN-AIP"}, {"given_name": "Takashi", "family_name": "Onishi", "institution": "NEC / AIST"}, {"given_name": "Yoshimasa", "family_name": "Tsuruoka", "institution": "The University of Tokyo"}]}