{"title": "Policy Gradient for Coherent Risk Measures", "book": "Advances in Neural Information Processing Systems", "page_first": 1468, "page_last": 1476, "abstract": "Several authors have recently developed risk-sensitive policy gradient methods that augment the standard expected cost minimization problem with a measure of variability in cost. These studies have focused on specific risk-measures, such as the variance or conditional value at risk (CVaR). In this work, we extend the policy gradient method to the whole class of coherent risk measures, which is widely accepted in finance and operations research, among other fields. We consider both static and time-consistent dynamic risk measures. For static risk measures, our approach is in the spirit of policy gradient algorithms and combines a standard sampling approach with convex programming. For dynamic risk measures, our approach is actor-critic style and involves explicit approximation of value function. Most importantly, our contribution presents a unified approach to risk-sensitive reinforcement learning that generalizes and extends previous results.", "full_text": "Policy Gradient for Coherent Risk Measures\n\nAviv Tamar\nUC Berkeley\n\navivt@berkeley.edu\n\nMohammad Ghavamzadeh\nAdobe Research & INRIA\n\nmohammad.ghavamzadeh@inria.fr\n\nYinlam Chow\n\nStanford University\n\nychow@stanford.edu\n\nShie Mannor\n\nTechnion\n\nshie@ee.technion.ac.il\n\nAbstract\n\nSeveral authors have recently developed risk-sensitive policy gradient methods\nthat augment the standard expected cost minimization problem with a measure of\nvariability in cost. These studies have focused on speci\ufb01c risk-measures, such as\nthe variance or conditional value at risk (CVaR). In this work, we extend the pol-\nicy gradient method to the whole class of coherent risk measures, which is widely\naccepted in \ufb01nance and operations research, among other \ufb01elds. We consider\nboth static and time-consistent dynamic risk measures. For static risk measures,\nour approach is in the spirit of policy gradient algorithms and combines a standard\nsampling approach with convex programming. For dynamic risk measures, our ap-\nproach is actor-critic style and involves explicit approximation of value function.\nMost importantly, our contribution presents a uni\ufb01ed approach to risk-sensitive\nreinforcement learning that generalizes and extends previous results.\n\nIntroduction\n\n1\nRisk-sensitive optimization considers problems in which the objective involves a risk measure of\nthe random cost, in contrast to the typical expected cost objective. Such problems are important\nwhen the decision-maker wishes to manage the variability of the cost, in addition to its expected\noutcome, and are standard in various applications of \ufb01nance and operations research. In reinforce-\nment learning (RL) [27], risk-sensitive objectives have gained popularity as a means to regularize\nthe variability of the total (discounted) cost/reward in a Markov decision process (MDP).\nMany risk objectives have been investigated in the literature and applied to RL, such as the cel-\nebrated Markowitz mean-variance model [16], Value-at-Risk (VaR) and Conditional Value at Risk\n(CVaR) [18, 29, 21, 10, 8, 30]. The view taken in this paper is that the preference of one risk measure\nover another is problem-dependent and depends on factors such as the cost distribution, sensitivity to\nrare events, ease of estimation from data, and computational tractability of the optimization problem.\nHowever, the highly in\ufb02uential paper of Artzner et al. [2] identi\ufb01ed a set of natural properties that\nare desirable for a risk measure to satisfy. Risk measures that satisfy these properties are termed co-\nherent and have obtained widespread acceptance in \ufb01nancial applications, among others. We focus\non such coherent measures of risk in this work.\nFor sequential decision problems, such as MDPs, another desirable property of a risk measure is\ntime consistency. A time-consistent risk measure satis\ufb01es a \u201cdynamic programming\u201d style property:\nif a strategy is risk-optimal for an n-stage problem, then the component of the policy from the t-th\ntime until the end (where t < n) is also risk-optimal (see principle of optimality in [5]). The recently\nproposed class of dynamic Markov coherent risk measures [24] satis\ufb01es both the coherence and time\nconsistency properties.\nIn this work, we present policy gradient algorithms for RL with a coherent risk objective. Our\napproach applies to the whole class of coherent risk measures, thereby generalizing and unifying\nprevious approaches that have focused on individual risk measures. We consider both static coherent\n\n1\n\n\frisk of the total discounted return from an MDP and time-consistent dynamic Markov coherent\nrisk. Our main contribution is formulating the risk-sensitive policy-gradient under the coherent-risk\nframework. More speci\ufb01cally, we provide:\n\nusing sampling.\n\n\u2022 A new formula for the gradient of static coherent risk that is convenient for approximation\n\u2022 An algorithm for the gradient of general static coherent risk that involves sampling with\n\u2022 A new policy gradient theorem for Markov coherent risk, relating the gradient to a suitable\n\nconvex programming and a corresponding consistency result.\n\nvalue function and a corresponding actor-critic algorithm.\n\nSeveral previous results are special cases of the results presented here; our approach allows to re-\nderive them in greater generality and simplicity.\nRelated Work Risk-sensitive optimization in RL for speci\ufb01c risk functions has been studied re-\ncently by several authors. [6] studied exponential utility functions, [18], [29], [21] studied mean-\nvariance models, [8], [30] studied CVaR in the static setting, and [20], [9] studied dynamic coherent\nrisk for systems with linear dynamics. Our paper presents a general method for the whole class of\ncoherent risk measures (both static and dynamic) and is not limited to a speci\ufb01c choice within that\nclass, nor to particular system dynamics.\nReference [19] showed that an MDP with a dynamic coherent risk objective is essentially a ro-\nbust MDP. The planning for large scale MDPs was considered in [31], using an approximation of\nthe value function. For many problems, approximation in the policy space is more suitable (see,\ne.g., [15]). Our sampling-based RL-style approach is suitable for approximations both in the policy\nand value function, and scales-up to large or continuous MDPs. We do, however, make use of a\ntechnique of [31] in a part of our method.\nOptimization of coherent\nrisk measures was thoroughly investigated by Ruszczynski and\nShapiro [25] (see also [26]) for the stochastic programming case in which the policy parameters\ndo not affect the distribution of the stochastic system (i.e., the MDP trajectory), but only the reward\nfunction, and thus, this approach is not suitable for most RL problems. For the case of MDPs and\ndynamic risk, [24] proposed a dynamic programming approach. This approach does not scale-up\nto large MDPs, due to the \u201ccurse of dimensionality\u201d. For further motivation of risk-sensitive policy\ngradient methods, we refer the reader to [18, 29, 21, 8, 30].\n2 Preliminaries\nConsider a probability space (\u2126,F, P\u03b8), where \u2126 is the set of outcomes (sample space), F is a\n\u03c3-algebra over \u2126 representing the set of events we are interested in, and P\u03b8 \u2208 B, where B :=\nparameterized by some tunable parameter \u03b8 \u2208 RK. In the following, we suppress the notation of \u03b8\nin \u03b8-dependent quantities.\nTo ease the technical exposition, in this paper we restrict our attention to \ufb01nite probability spaces,\ni.e., \u2126 has a \ufb01nite number of elements. Our results can be extended to the Lp-normed spaces without\nloss of generality, but the details are omitted for brevity.\nDenote by Z the space of random variables Z : \u2126 (cid:55)\u2192 (\u2212\u221e,\u221e) de\ufb01ned over the probability space\n(\u2126,F, P\u03b8). In this paper, a random variable Z \u2208 Z is interpreted as a cost, i.e., the smaller the\nrealization of Z, the better. For Z, W \u2208 Z, we denote by Z \u2264 W the point-wise partial order,\ni.e., Z(\u03c9) \u2264 W (\u03c9) for all \u03c9 \u2208 \u2126. We denote by E\u03be[Z]\n\u03c9\u2208\u2126 P\u03b8(\u03c9)\u03be(\u03c9)Z(\u03c9) a \u03be-weighted\nexpectation of Z.\nAn MDP is a tuple M = (X ,A, C, P, \u03b3, x0), where X and A are the state and action spaces;\nC(x) \u2208 [\u2212Cmax, Cmax] is a bounded, deterministic, and state-dependent cost; P (\u00b7|x, a) is the tran-\nsition probability distribution; \u03b3 is a discount factor; and x0 is the initial state.1 Actions are chosen\naccording to a \u03b8-parameterized stationary Markov2 policy \u00b5\u03b8(\u00b7|x). We denote by x0, a0, . . . , xT , aT\na trajectory of length T drawn by following the policy \u00b5\u03b8 in the MDP.\n\n\u03c9\u2208\u2126 \u03be(\u03c9) = 1, \u03be \u2265 0(cid:9) is the set of probability distributions, is a probability measure over F\n\n(cid:8)\u03be :(cid:82)\n\n= (cid:80)\n\n1Our results may easily be extended to random costs, state-action dependent costs, and random initial states.\n2For Markov coherent risk, the class of optimal policies is stationary Markov [24], while this is not nec-\nessarily true for static risk. Our results can be extended to history-dependent policies or stationary Markov\n\n.\n\n2\n\n\f2.1 Coherent Risk Measures\nA risk measure is a function \u03c1 : Z \u2192 R that maps an uncertain outcome Z to the extended real line\n\nR \u222a {+\u221e,\u2212\u221e}, e.g., the expectation E [Z] or the conditional value-at-risk (CVaR) min\u03bd\u2208R(cid:8)\u03bd +\nE(cid:2)(Z \u2212 \u03bd)+(cid:3)(cid:9). A risk measure is called coherent, if it satis\ufb01es the following conditions for all\nA1 Convexity: \u2200\u03bb \u2208 [0, 1], \u03c1(cid:0)\u03bbZ + (1 \u2212 \u03bb)W(cid:1) \u2264 \u03bb\u03c1(Z) + (1 \u2212 \u03bb)\u03c1(W );\n\nZ, W \u2208 Z [2]:\n\n1\n\u03b1\n\nA2 Monotonicity: if Z \u2264 W , then \u03c1(Z) \u2264 \u03c1(W );\nA3 Translation invariance: \u2200a\u2208R, \u03c1(Z + a) = \u03c1(Z) + a;\nA4 Positive homogeneity: if \u03bb \u2265 0, then \u03c1(\u03bbZ) = \u03bb\u03c1(Z).\n\nIntuitively, these condition ensure the \u201crationality\u201d of single-period risk assessments: A1 ensures\nthat diversifying an investment will reduce its risk; A2 guarantees that an asset with a higher cost\nfor every possible scenario is indeed riskier; A3, also known as \u2018cash invariance\u2019, means that the\ndeterministic part of an investment portfolio does not contribute to its risk; the intuition behind A4\nis that doubling a position in an asset doubles its risk. We further refer the reader to [2] for a more\ndetailed motivation of coherent risk.\nThe following representation theorem [26] shows an important property of coherent risk measures\nthat is fundamental to our gradient-based approach.\nTheorem 2.1. A risk measure \u03c1 : Z \u2192 R is coherent if and only if there exists a convex bounded\nand closed set U \u2282 B such that3\n\n\u03c1(Z) =\n\nmax\n\n\u03be : \u03beP\u03b8\u2208U (P\u03b8)\n\nE\u03be[Z].\n\n(1)\n\nThe result essentially states that any coherent risk measure is an expectation w.r.t. a worst-case\ndensity function \u03beP\u03b8, i.e., a re-weighting of P\u03b8 by \u03be, chosen adversarially from a suitable set of test\ndensity functions U(P\u03b8), referred to as risk envelope. Moreover, a coherent risk measure is uniquely\nrepresented by its risk envelope.\nIn the sequel, we shall interchangeably refer to coherent risk\nmeasures either by their explicit functional representation, or by their corresponding risk-envelope.\nIn this paper, we assume that the risk envelope U(P\u03b8) is given in a canonical convex programming\nformulation, and satis\ufb01es the following conditions.\nAssumption 2.2 (The General Form of Risk Envelope). For each given policy parameter \u03b8 \u2208 RK,\n(cid:88)\nthe risk envelope U of a coherent risk measure can be written as\n\u03beP\u03b8 : ge(\u03be, P\u03b8) = 0, \u2200e \u2208 E, fi(\u03be, P\u03b8) \u2264 0, \u2200i \u2208 I,\nU(P\u03b8) =\n\n\u03be(\u03c9)P\u03b8(\u03c9) = 1, \u03be(\u03c9) \u2265 0\n\n(cid:27)\n\n(cid:26)\n\n(2)\n\n,\n\nwhere each constraint ge(\u03be, P\u03b8) is an af\ufb01ne function in \u03be, each constraint fi(\u03be, P\u03b8) is a convex\nfunction in \u03be, and there exists a strictly feasible point \u03be. E and I here denote the sets of equality\nand inequality constraints, respectively. Furthermore, for any given \u03be \u2208 B, fi(\u03be, p) and ge(\u03be, p) are\ntwice differentiable in p, and there exists a M > 0 such that\n\nAssumption 2.2 implies that the risk envelope U(P\u03b8) is known in an explicit form. From Theorem\n6.6 of [26], in the case of a \ufb01nite probability space, \u03c1 is a coherent risk if and only if U(P\u03b8) is a\nconvex and compact set. This justi\ufb01es the af\ufb01ne assumption of ge and the convex assumption of fi.\nMoreover, the additional assumption on the smoothness of the constraints holds for many popular\ncoherent risk measures, such as the CVaR, the mean-semi-deviation, and spectral risk measures [1].\n2.2 Dynamic Risk Measures\nThe risk measures de\ufb01ned above do not take into account any temporal structure that the random\nvariable might have, such as when it is associated with the return of a trajectory in the case of\nMDPs. In this sense, such risk measures are called static. Dynamic risk measures, on the other hand,\n\npolicies on a state space augmented with accumulated cost. The latter has shown to be suf\ufb01cient for optimizing\nthe CVaR risk [4].\n\n3When we study risk in MDPs, the risk envelope U(P\u03b8) in Eq. 1 also depends on the state x.\n\n3\n\n\u03c9\u2208\u2126\n\n(cid:26)\n\nmax\n\nmax\ni\u2208I\n\n(cid:12)(cid:12)(cid:12)(cid:12) dfi(\u03be, p)\n\ndp(\u03c9)\n\n(cid:12)(cid:12)(cid:12)(cid:12) , max\n\ne\u2208E\n\n(cid:12)(cid:12)(cid:12)(cid:12) dge(\u03be, p)\n\ndp(\u03c9)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:27)\n\n\u2264 M, \u2200\u03c9 \u2208 \u2126.\n\n\fexplicitly take into account the temporal nature of the stochastic outcome. A primary motivation for\nconsidering such measures is the issue of time consistency, usually de\ufb01ned as follows [24]: if a\ncertain outcome is considered less risky in all states of the world at stage t + 1, then it should also\nbe considered less risky at stage t. Example 2.1 in [13] shows the importance of time consistency\nin the evaluation of risk in a dynamic setting. It illustrates that for multi-period decision-making,\noptimizing a static measure can lead to \u201ctime-inconsistent\u201d behavior. Similar paradoxical results\ncould be obtained with other risk metrics; we refer the readers to [24] and [13] for further insights.\nMarkov Coherent Risk Measures. Markov risk measures were introduced in [24] and constitute\na useful class of dynamic time-consistent risk measures that are important to our study of risk in\nMDPs. For a T -length horizon and MDP M, the Markov coherent risk measure \u03c1T (M) is\n\n(cid:32)\n\nC(xT\u22121) + \u03b3\u03c1(cid:0)C(xT )(cid:1)(cid:17)(cid:33)\n(cid:16)\n\n,\n\n\u03c1T (M) = C(x0) + \u03b3\u03c1\n\nC(x1) + . . . + \u03b3\u03c1\n\n(3)\n\n\u03c1 at state x \u2208 X is induced by the transition probability P\u03b8(\u00b7|x) =(cid:80)\n\nwhere \u03c1 is a static coherent risk measure that satis\ufb01es Assumption 2.2 and x0, . . . , xT is a trajectory\ndrawn from the MDP M under policy \u00b5\u03b8. It is important to note that in (3), each static coherent risk\na\u2208A P (x(cid:48)|x, a)\u00b5\u03b8(a|x). We\nalso de\ufb01ne \u03c1\u221e(M)\n= limT\u2192\u221e \u03c1T (M), which is well-de\ufb01ned since \u03b3 < 1 and the cost is bounded.\n.\nWe further assume that \u03c1 in (3) is a Markov risk measure, i.e., the evaluation of each static coherent\nrisk measure \u03c1 is not allowed to depend on the whole past.\n3 Problem Formulation\nIn this paper, we are interested in solving two risk-sensitive optimization problems. Given a random\nvariable Z and a static coherent risk measure \u03c1 as de\ufb01ned in Section 2, the static risk problem (SRP)\nis given by\n\nFor example, in an RL setting, Z may correspond to the cumulative discounted cost Z = C(x0) +\n\u03b3C(x1) + \u00b7\u00b7\u00b7 + \u03b3T C(xT ) of a trajectory induced by an MDP with a policy parameterized by \u03b8.\nFor an MDP M and a dynamic Markov coherent risk measure \u03c1T as de\ufb01ned by Eq. 3, the dynamic\nrisk problem (DRP) is given by\n\nExcept for very limited cases, there is no reason to hope that neither the SRP in (4) nor the DRP\nin (5) should be tractable problems, since the dependence of the risk measure on \u03b8 may be complex\nand non-convex. In this work, we aim towards a more modest goal and search for a locally optimal\n\u03b8. Thus, the main problem that we are trying to solve in this paper is how to calculate the gradients\nof the SRP\u2019s and DRP\u2019s objective functions\n\n\u2207\u03b8\u03c1(Z)\n\nand\n\n\u2207\u03b8\u03c1\u221e(M).\n\nWe are interested in non-trivial cases in which the gradients cannot be calculated analytically. In\nthe static case, this would correspond to a non-trivial dependence of Z on \u03b8. For dynamic risk, we\nalso consider cases where the state space is too large for a tractable computation. Our approach for\ndealing with such dif\ufb01cult cases is through sampling. We assume that in the static case, we may\nobtain i.i.d. samples of the random variable Z. For the dynamic case, we assume that for each state\nand action (x, a) of the MDP, we may obtain i.i.d. samples of the next state x(cid:48) \u223c P (\u00b7|x, a). We\nshow that sampling may indeed be used in both cases to devise suitable estimators for the gradients.\nTo \ufb01nally solve the SRP and DRP problems, a gradient estimate may be plugged into a standard\nstochastic gradient descent (SGD) algorithm for learning a locally optimal solution to (4) and (5).\nFrom the structure of the dynamic risk in Eq. 3, one may think that a gradient estimator for \u03c1(Z) may\nhelp us to estimate the gradient \u2207\u03b8\u03c1\u221e(M). Indeed, we follow this idea and begin with estimating\nthe gradient in the static risk case.\n4 Gradient Formula for Static Risk\nIn this section, we consider a static coherent risk measure \u03c1(Z) and propose sampling-based es-\ntimators for \u2207\u03b8\u03c1(Z). We make the following assumption on the policy parametrization, which is\nstandard in the policy gradient literature [15].\nAssumption 4.1. The likelihood ratio \u2207\u03b8 log P (\u03c9) is well-de\ufb01ned and bounded for all \u03c9\u2208 \u2126.\n\n4\n\nmin\n\n\u03b8\n\n\u03c1(Z).\n\n\u03c1\u221e(M).\n\nmin\n\n\u03b8\n\n(4)\n\n(5)\n\n\f(cid:88)\n\n\u03c9\u2208\u2126\n\n(cid:32)(cid:88)\n\n\u03c9\u2208\u2126\n\nMoreover, our approach implicitly assumes that given some \u03c9 \u2208 \u2126, \u2207\u03b8 log P (\u03c9) may be easily\ncalculated. This is also a standard requirement for policy gradient algorithms [15] and is satis\ufb01ed\nin various applications such as queueing systems, inventory management, and \ufb01nancial engineering\n(see, e.g., the survey by Fu [11]).\nUsing Theorem 2.1 and Assumption 2.2, for each \u03b8, we have that \u03c1(Z) is the solution to the con-\nvex optimization problem (1) (for that value of \u03b8). The Lagrangian function of (1), denoted by\nL\u03b8(\u03be, \u03bbP , \u03bbE , \u03bbI), may be written as\n\nL\u03b8(\u03be, \u03bb\n\nP\n, \u03bb\n\nE\n, \u03bb\n\nI\n\n) =\n\n\u03be(\u03c9)P\u03b8(\u03c9)Z(\u03c9)\u2212\u03bb\n\nP\n\n\u03be(\u03c9)P\u03b8(\u03c9)\u22121\n\nI\n\n\u03bb\n\n(i)fi(\u03be,P\u03b8).\n\n(cid:33)\n\u2212(cid:88)\n\ne\u2208E\n\n(e)ge(\u03be,P\u03b8)\u2212(cid:88)\n\nE\n\n\u03bb\n\ni\u2208I\n\n(cid:104)\u2207\u03b8 log P (\u03c9)(Z \u2212 \u03bb\n\n(6)\nThe convexity of (1) and its strict feasibility due to Assumption 2.2 implies that L\u03b8(\u03be, \u03bbP , \u03bbE , \u03bbI)\nhas a non-empty set of saddle points S. The next theorem presents a formula for the gradient\n\u2207\u03b8\u03c1(Z). As we shall subsequently show, this formula is particularly convenient for devising sam-\npling based estimators for \u2207\u03b8\u03c1(Z).\n) \u2208 S\nTheorem 4.2. Let Assumptions 2.2 and 4.1 hold. For any saddle point (\u03be\u2217\nof (6), we have\n\u2207\u03b8\u03c1(Z) = E\u03be\u2217\nThe proof of this theorem, given in the supplementary material, involves an application of the Enve-\nlope theorem [17] and a standard \u2018likelihood-ratio\u2019 trick. We now demonstrate the utility of Theorem\n4.2 with several examples in which we show that it generalizes previously known results, and also\nenables deriving new useful gradient formulas.\n4.1 Example 1: CVaR\nThe CVaR at level \u03b1 \u2208 [0, 1] of a random variable Z, denoted by \u03c1CVaR(Z; \u03b1), is a very popular\ncoherent risk measure [23], de\ufb01ned as\n\n\u03b8 ; P\u03b8) \u2212(cid:88)\n\n(cid:105) \u2212(cid:88)\n\n(e)\u2207\u03b8ge(\u03be\u2217\n\n(i)\u2207\u03b8fi(\u03be\u2217\n\n\u2217,P\n\u03b8 , \u03bb\n\u03b8\n\n\u2217,I\n, \u03bb\n\u03b8\n\n\u2217,E\n, \u03bb\n\u03b8\n\n\u2217,E\n\u03bb\n\u03b8\n\n\u03b8 ; P\u03b8).\n\n\u2217,P\n\u03b8\n\n\u2217,I\n\u03b8\n\ne\u2208E\n\ni\u2208I\n\n\u03bb\n\n)\n\n\u03b8\n\n(cid:8)t + \u03b1\u22121E [(Z \u2212 t)+](cid:9).\n\n\u03c1CVaR(Z; \u03b1)\n\n.\n= inf\nt\u2208R\n\nWhen Z is continuous, \u03c1CVaR(Z; \u03b1) is well-known to be the mean of the \u03b1-tail distribution of Z,\nE [ Z| Z > q\u03b1], where q\u03b1 is a (1 \u2212 \u03b1)-quantile of Z. Thus, selecting a small \u03b1 makes CVaR partic-\nularly sensitive to rare, but very high costs.\nThe\n\n\u03c9\u2208\u2126 \u03be(\u03c9)P\u03b8(\u03c9) = 1(cid:9). Furthermore, [26] show that the saddle points of (6) satisfy\n\n[0, \u03b1\u22121], (cid:80)\n\n= (cid:8)\u03beP\u03b8\n\nfor CVaR is known to be\n\nrisk envelope\n\n[26] U\n\n\u03be(\u03c9)\n\n\u2208\n\n:\n\n\u2217,P\n\u03be\u2217\n\u03b8 (\u03c9) = \u03b1\u22121 when Z(\u03c9) > \u03bb\n, where \u03bb\n\u03b8\nquantile of Z. Plugging this result into Theorem 4.2, we can easily show that\n\u2207\u03b8\u03c1CVaR(Z; \u03b1) = E [\u2207\u03b8 log P (\u03c9)(Z \u2212 q\u03b1)| Z(\u03c9) > q\u03b1] .\n\n\u03b8 (\u03c9) = 0 when Z(\u03c9) < \u03bb\n\n, and \u03be\u2217\n\n\u2217,P\n\u03b8\n\n\u2217,P\n\u03b8\n\nis any (1 \u2212 \u03b1)-\n\nThis formula was recently proved in [30] for the case of continuous distributions by an explicit\ncalculation of the conditional expectation, and under several additional smoothness assumptions.\nHere we show that it holds regardless of these assumptions and in the discrete case as well. Our\nproof is also considerably simpler.\n4.2 Example 2: Mean-Semideviation\nThe semi-deviation of a random variable Z is de\ufb01ned as SD[Z]\nsemi-deviation captures the variation of the cost only above its mean, and is an appealing alternative\nto the standard deviation, which does not distinguish between the variability of upside and downside\ndeviations. For some \u03b1 \u2208 [0, 1], the mean-semideviation risk measure is de\ufb01ned as \u03c1MSD(Z; \u03b1)\n.\n=\nE [Z] + \u03b1SD[Z], and is a coherent risk measure [26]. We have the following result:\nProposition 4.3. Under Assumption 4.1, with \u2207\u03b8E [Z] = E [\u2207\u03b8 log P (\u03c9)Z], we have\n\n= (cid:0)E(cid:2)(Z \u2212 E [Z])2\n\n(cid:3)(cid:1)1/2. The\n\n+\n\n.\n\n\u2207\u03b8\u03c1MSD(Z; \u03b1) = \u2207\u03b8E [Z] +\n\n\u03b1E [(Z\u2212E [Z])+(\u2207\u03b8 log P (\u03c9)(Z\u2212E [Z])\u2212\u2207\u03b8E [Z])]\n\n.\n\nThis proposition can be used to devise a sampling based estimator for \u2207\u03b8\u03c1MSD(Z; \u03b1) by replacing\nall the expectations with sample averages. The algorithm along with the proof of the proposition are\nin the supplementary material. In Section 6 we provide a numerical illustration of optimization with\na mean-semideviation objective.\n\nSD(Z)\n\n5\n\n\f4.3 General Gradient Estimation Algorithm\nIn the two previous examples, we obtained a gradient formula by analytically calculating the La-\ngrangian saddle point (6) and plugging it into the formula of Theorem 4.2. We now consider a\ngeneral coherent risk \u03c1(Z) for which, in contrast to the CVaR and mean-semideviation cases, the\nLagrangian saddle-point is not known analytically. We only assume that we know the structure of the\nrisk-envelope as given by (2). We show that in this case, \u2207\u03b8\u03c1(Z) may be estimated using a sample\naverage approximation (SAA; [26]) of the formula in Theorem 4.2.\nAssume that we are given N i.i.d. samples \u03c9i \u223c P\u03b8, i = 1, . . . , N, and let P\u03b8;N (\u03c9)\n.\n=\nI{\u03c9i = \u03c9} denote the corresponding empirical distribution. Also, let the sample risk en-\nvelope U(P\u03b8;N ) be de\ufb01ned according to Eq. 2 with P\u03b8 replaced by P\u03b8;N . Consider the following\nSAA version of the optimization in Eq. 1:\nmax\n\n(cid:80)N\n\nP\u03b8;N (\u03c9i)\u03be(\u03c9i)Z(\u03c9i).\n\n(cid:88)\n\n\u03c1N (Z) =\n\n(7)\n\n1\nN\n\ni=1\n\n\u03be:\u03beP\u03b8;N\u2208U (P\u03b8;N )\n\ni\u22081,...,N\n\nNote that (7) de\ufb01nes a convex optimization problem with O(N ) variables and constraints.\nIn\nthe following, we assume that a solution to (7) may be computed ef\ufb01ciently using standard con-\nvex programming tools such as interior point methods [7]. Let \u03be\u2217\n\u03b8;N denote a solution to (7) and\n\u2217,P\n\u2217,I\n\u03b8;N denote the corresponding KKT multipliers, which can be obtained from the con-\n\u03bb\n\u03b8;N , \u03bb\nvex programming algorithm [7]. We propose the following estimator for the gradient-based on\nTheorem 4.2:\n\n\u2217,E\n\u03b8;N , \u03bb\n\n\u2207\u03b8;N \u03c1(Z) =\n\nN(cid:88)\n\u2212(cid:88)\n\ni=1\n\ne\u2208E\n\n\u03b8;N (\u03c9i)\u2207\u03b8 log P (\u03c9i)(Z(\u03c9i) \u2212 \u03bb\n\nP\u03b8;N (\u03c9i)\u03be\u2217\n\u2217,E\n\u03b8;N (e)\u2207\u03b8ge(\u03be\u2217\n\n\u03b8;N ; P\u03b8;N ) \u2212(cid:88)\n\n\u2217,P\n\u03b8;N )\n\u2217,I\n\u03b8;N (i)\u2207\u03b8fi(\u03be\u2217\n\n\u03bb\n\n\u03bb\n\n\u03b8;N ; P\u03b8;N ).\n\ni\u2208I\n\n(8)\n\nThus, our gradient estimation algorithm is a two-step procedure involving both sampling and convex\nprogramming. In the following, we show that under some conditions on the set U(P\u03b8), \u2207\u03b8;N \u03c1(Z)\nis a consistent estimator of \u2207\u03b8\u03c1(Z). The proof has been reported in the supplementary material.\nProposition 4.4. Let Assumptions 2.2 and 4.1 hold. Suppose there exists a compact set C = C\u03be\u00d7C\u03bb\nsuch that: (I) The set of Lagrangian saddle points S \u2282 C is non-empty and bounded. (II) The\nfunctions fe(\u03be, P\u03b8) for all e \u2208 E and fi(\u03be, P\u03b8) for all i \u2208 I are \ufb01nite-valued and continuous (in \u03be)\non C\u03be. (III) For N large enough, the set SN is non-empty and SN \u2282 C w.p. 1. Further assume that:\n(IV) If \u03beN P\u03b8;N \u2208 U(P\u03b8;N ) and \u03beN converges w.p. 1 to a point \u03be, then \u03beP\u03b8 \u2208 U(P\u03b8). We then have\nthat limN\u2192\u221e \u03c1N (Z) = \u03c1(Z) and limN\u2192\u221e \u2207\u03b8;N \u03c1(Z) = \u2207\u03b8\u03c1(Z) w.p. 1.\nThe set of assumptions for Proposition 4.4 is large, but rather mild. Note that (I) is implied by\nthe Slater condition of Assumption 2.2. For satisfying (III), we need that the risk be well-de\ufb01ned\nfor every empirical distribution, which is a natural requirement. Since P\u03b8;N always converges to P\u03b8\nuniformly on \u2126, (IV) essentially requires smoothness of the constraints. We remark that in particular,\nconstraints (I) to (IV) are satis\ufb01ed for the popular CVaR, mean-semideviation, and spectral risk.\nIt is interesting to compare the performance of the SAA estimator (8) with the analytical-solution\nbased estimator, as in Sections 4.1 and 4.2. In the supplementary material, we report an empirical\ncomparison between the two approaches for the case of CVaR risk, which showed that the two\n\u221a\napproaches performed very similarly. This is well-expected, since in general, both SAA and standard\nN [26].\nlikelihood-ratio based estimators obey a law-of-large-numbers variance bound of order 1/\nTo summarize this section, we have seen that by exploiting the special structure of coherent risk\nmeasures in Theorem 2.1 and by the envelope-theorem style result of Theorem 4.2, we are able to\nderive sampling-based, likelihood-ratio style algorithms for estimating the policy gradient \u2207\u03b8\u03c1(Z)\nof coherent static risk measures. The gradient estimation algorithms developed here for static risk\nmeasures will be used as a sub-routine in our subsequent treatment of dynamic risk measures.\n5 Gradient Formula for Dynamic Risk\nIn this section, we derive a new formula for the gradient of the Markov coherent dynamic risk mea-\nsure, \u2207\u03b8\u03c1\u221e(M). Our approach is based on combining the static gradient formula of Theorem 4.2,\nwith a dynamic-programming decomposition of \u03c1\u221e(M).\n\n6\n\n\fThe risk-sensitive value-function for an MDP M under the policy \u03b8 is de\ufb01ned as V\u03b8(x) =\n\u03c1\u221e(M|x0 = x), where with a slight abuse of notation, \u03c1\u221e(M|x0 = x) denotes the Markov-\ncoherent dynamic risk in (3) when the initial state x0 is x. It is shown in [24] that due to the structure\nof the Markov dynamic risk \u03c1\u221e(M), the value function is the unique solution to the risk-sensitive\nBellman equation\n\nV\u03b8(x) = C(x) + \u03b3\n\nmax\n\n\u03beP\u03b8(\u00b7|x)\u2208U (x,P\u03b8(\u00b7|x))\n\nE\u03be[V\u03b8(x(cid:48))],\n\n(9)\n\nwhere the expectation is taken over the next state transition. Note that by de\ufb01nition, we have\n\u03c1\u221e(M) = V\u03b8(x0), and thus, \u2207\u03b8\u03c1\u221e(M) = \u2207\u03b8V\u03b8(x0).\nWe now develop a formula for \u2207\u03b8V\u03b8(x); this formula extends the well-known \u201cpolicy gradient\ntheorem\u201d [28, 14], developed for the expected return, to Markov-coherent dynamic risk measures.\nWe make a standard assumption, analogous to Assumption 4.1 of the static case.\nAssumption 5.1. The likelihood ratio \u2207\u03b8 log \u00b5\u03b8(a|x) is well-de\ufb01ned and bounded for all x \u2208 X\nand a \u2208 A.\n\u2217,I\nFor each state x \u2208 X , let (\u03be\u2217\n\u03b8,x ) denote a saddle point of (6), corresponding to the\nstate x, with P\u03b8(\u00b7|x) replacing P\u03b8 in (6) and V\u03b8 replacing Z. The next theorem presents a formula\nfor \u2207\u03b8V\u03b8(x); the proof is in the supplementary material.\nTheorem 5.2. Under Assumptions 2.2 and 5.1, we have\n\n\u2217,P\n\u03b8,x , \u03bb\n\n\u2217,E\n\u03b8,x, \u03bb\n\n\u03b8,x, \u03bb\n\n\u2207V\u03b8(x) = E\u03be\u2217\n\n\u03b8\n\n(cid:35)\n\n(cid:34) \u221e(cid:88)\n\n\u03b3t\u2207\u03b8 log \u00b5\u03b8(at|xt)h\u03b8(xt, at)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) x0 = x\n\u03b8,x(\u00b7), and the stage-wise cost function h\u03b8(x, a) is de\ufb01ned as\n\u2212(cid:88)\n\n\u03b8,x \u2212(cid:88)\n\n\u2217,I\n\u03b8,x (i)\n\u03bb\n\n(cid:48)|x, a)\u03be\n\n(cid:48)\n\u03b3V\u03b8(x\n\ndfi(\u03be\u2217\n\n\u2217\n(cid:48)\n\u03b8,x(x\n\n)\n\n\u03b8,x, p)\n\n)\u2212\u03bb\n\n\u2217,P\n\n(cid:34)\n\nt=0\n\n,\n\ni\u2208I\n\ndp(x(cid:48))\n\ne\u2208E\n\n\u03b8\n\nwhere E\u03be\u2217\nprobabilities P\u03b8(\u00b7|x)\u03be\u2217\n(cid:88)\n\nh\u03b8(x, a) = C(x)+\n\nP (x\n\nx(cid:48)\u2208X\n\n[\u00b7] denotes the expectation w.r.t. trajectories generated by the Markov chain with transition\n\n(cid:35)\n\n.\n\n\u2217,E\n\u03b8,x (e)\n\n\u03bb\n\ndge(\u03be\u2217\n\n\u03b8,x, p)\n\ndp(x(cid:48))\n\nTheorem 5.2 may be used to develop an actor-critic style [28, 14] sampling-based algorithm for\nsolving the DRP problem (5), composed of two interleaved procedures:\nCritic: For a given policy \u03b8, calculate the risk-sensitive value function V\u03b8, and\nActor: Using the critic\u2019s V\u03b8 and Theorem 5.2, estimate \u2207\u03b8\u03c1\u221e(M) and update \u03b8.\nSpace limitation restricts us from specifying the full details of our actor-critic algorithm and its\nanalysis. In the following, we highlight only the key ideas and results. For the full details, we refer\nthe reader to the full paper version, provided in the supplementary material.\nFor the critic, the main challenge is calculating the value function when the state space X is large\nand dynamic programming cannot be applied due to the \u2018curse of dimensionality\u2019. To overcome\nthis, we exploit the fact that V\u03b8 is equivalent to the value function in a robust MDP [19] and modify\na recent algorithm in [31] to estimate it using function approximation.\nFor the actor, the main challenge is that in order to estimate the gradient using Thm. 5.2, we need to\nsample from an MDP with \u03be\u2217\n\u03b8 -weighted transitions. Also, h\u03b8(x, a) involves an expectation for each\ns and a. Therefore, we propose a two-phase sampling procedure to estimate \u2207V\u03b8 in which we \ufb01rst\nuse the critic\u2019s estimate of V\u03b8 to derive \u03be\u2217\n\u03b8 -weighted\ntransitions. For each state in the trajectory, we then sample several next states to estimate h\u03b8(x, a).\nThe convergence analysis of the actor-critic algorithm and the gradient error incurred from function\napproximation of V\u03b8 are reported in the supplementary material. We remark that our actor-critic\nalgorithm requires a simulator for sampling multiple state-transitions from each state. Extending\nour approach to work with a single trajectory roll-out is an interesting direction for future research.\n6 Numerical Illustration\nIn this section, we illustrate our approach with a numerical example. The purpose of this illustration\nis to emphasize the importance of \ufb02exibility in designing risk criteria for selecting an appropriate\nrisk-measure \u2013 such that suits both the user\u2019s risk preference and the problem-speci\ufb01c properties.\nWe consider a trading agent that can invest in one of three assets (see Figure 1 for their distributions).\nThe returns of the \ufb01rst two assets, A1 and A2, are normally distributed: A1 \u223c N (1, 1) and A2 \u223c\n\n\u03b8 , and sample a trajectory from an MDP with \u03be\u2217\n\n7\n\n\fFigure 1: Numerical illustration - selection between 3 assets. A: Probability density of asset return.\nB,C,D: Bar plots of the probability of selecting each asset vs. training iterations, for policies \u03c01, \u03c02,\nand \u03c03, respectively. At each iteration, 10,000 samples were used for gradient estimation.\nz\u03b1+1 \u2200z > 1, with \u03b1 =\nN (4, 6). The return of the third asset A3 has a Pareto distribution: f (z) = \u03b1\n1.5. The mean of the return from A3 is 3 and its variance is in\ufb01nite; such heavy-tailed distributions\nare widely used in \ufb01nancial modeling [22]. The agent selects an action randomly, with probability\nP (Ai) \u221d exp(\u03b8i), where \u03b8 \u2208 R3 is the policy parameter. We trained three different policies \u03c01, \u03c02,\nand \u03c03. Policy \u03c01 is risk-neutral, i.e., max\u03b8 E [Z], and it was trained using standard policy gradient\n[15]. Policy \u03c02 is risk-averse and had a mean-semideviation objective max\u03b8 E [Z] \u2212 SD[Z], and\nwas trained using the algorithm in Section 4. Policy \u03c03 is also risk-averse, with a mean-standard-\n\ndeviation objective, as proposed in [29, 21], max\u03b8 E [Z] \u2212(cid:112)Var[Z], and was trained using the\n\nalgorithm of [29]. For each of these policies, Figure 1 shows the probability of selecting each asset\nvs. training iterations. Although A2 has the highest mean return, the risk-averse policy \u03c02 chooses\nA3, since it has a lower downside, as expected. However, because of the heavy upper-tail of A3,\npolicy \u03c03 opted to choose A1 instead. This is counter-intuitive as a rational investor should not avert\nhigh returns. In fact, in this case A3 stochastically dominates A1 [12].\n7 Conclusion\nWe presented algorithms for estimating the gradient of both static and dynamic coherent risk mea-\nsures using two new policy gradient style formulas that combine sampling with convex program-\nming. Thereby, our approach extends risk-sensitive RL to the whole class of coherent risk measures,\nand generalizes several recent studies that focused on speci\ufb01c risk measures.\nOn the technical side, an important future direction is to improve the convergence rate of gradient\nestimates using importance sampling methods. This is especially important for risk criteria that are\nsensitive to rare events, such as the CVaR [3].\nFrom a more conceptual point of view, the coherent-risk framework explored in this work provides\nthe decision maker with \ufb02exibility in designing risk preference. As our numerical example shows,\nsuch \ufb02exibility is important for selecting appropriate problem-speci\ufb01c risk measures for managing\nthe cost variability. However, we believe that our approach has much more potential than that.\nIn almost every real-world application, uncertainty emanates from stochastic dynamics, but also,\nand perhaps more importantly, from modeling errors (model uncertainty). A prudent policy should\nprotect against both types of uncertainties. The representation duality of coherent-risk (Theorem\n2.1), naturally relates the risk to model uncertainty. In [19], a similar connection was made between\nmodel-uncertainty in MDPs and dynamic Markov coherent risk. We believe that by carefully shap-\ning the risk-criterion, the decision maker may be able to take uncertainty into account in a broad\nsense. Designing a principled procedure for such risk-shaping is not trivial, and is beyond the scope\nof this paper. However, we believe that there is much potential to risk shaping as it may be the key\nfor handling model misspeci\ufb01cation in dynamic decision making.\n\nAcknowledgments\n\nThe research leading to these results has received funding from the European Research Council\nunder the European Unions Seventh Framework Program (FP7/2007-2013) / ERC Grant Agreement\nn. 306638. Yinlam Chow is partially supported by Croucher Foundation Doctoral Scholarship.\n\n8\n\n\fReferences\n[1] C. Acerbi. Spectral measures of risk: a coherent representation of subjective risk aversion. Journal of\n\nBanking & Finance, 26(7):1505\u20131518, 2002.\n\n[2] P. Artzner, F. Delbaen, J. Eber, and D. Heath. Coherent measures of risk. Mathematical \ufb01nance, 9(3):203\u2013\n\n228, 1999.\n\n[3] O. Bardou, N. Frikha, and G. Pag`es. Computing VaR and CVaR using stochastic approximation and\nadaptive unconstrained importance sampling. Monte Carlo Methods and Applications, 15(3):173\u2013210,\n2009.\n\n[4] N. B\u00a8auerle and J. Ott. Markov decision processes with average-value-at-risk criteria. Mathematical\n\nMethods of Operations Research, 74(3):361\u2013379, 2011.\n\n[5] D. Bertsekas. Dynamic programming and optimal control. Athena Scienti\ufb01c, 4th edition, 2012.\n[6] V. Borkar. A sensitivity formula for risk-sensitive cost and the actor\u2013critic algorithm. Systems & Control\n\nLetters, 44(5):339\u2013346, 2001.\n\n[7] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2009.\n[8] Y. Chow and M. Ghavamzadeh. Algorithms for CVaR optimization in MDPs. In NIPS 27, 2014.\n[9] Y. Chow and M. Pavone. A unifying framework for time-consistent, risk-averse model predictive control:\n\ntheory and algorithms. In American Control Conference, 2014.\n\n[10] E. Delage and S. Mannor. Percentile optimization for Markov decision processes with parameter uncer-\n\ntainty. Operations Research, 58(1):203213, 2010.\n\n[11] M. Fu. Gradient estimation. In Simulation, volume 13 of Handbooks in Operations Research and Man-\n\nagement Science, pages 575 \u2013 616. Elsevier, 2006.\n\n[12] J. Hadar and W. R. Russell. Rules for ordering uncertain prospects. The American Economic Review,\n\npages 25\u201334, 1969.\n\n[13] D. Iancu, M. Petrik, and D. Subramanian.\n\narXiv:1106.6102, 2011.\n\nTight approximations of dynamic risk measures.\n\n[14] V. Konda and J. Tsitsiklis. Actor-critic algorithms. In NIPS, 2000.\n[15] P. Marbach and J. Tsitsiklis. Simulation-based optimization of Markov reward processes. IEEE Transac-\n\ntions on Automatic Control, 46(2):191\u2013209, 1998.\n\n[16] H. Markowitz. Portfolio selection: Ef\ufb01cient diversi\ufb01cation of investment. John Wiley and Sons, 1959.\n[17] P. Milgrom and I. Segal. Envelope theorems for arbitrary choice sets. Econometrica, 70(2):583\u2013601,\n\n2002.\n\n[18] J. Moody and M. Saffell. Learning to trade via direct reinforcement. Neural Networks, IEEE Transactions\n\non, 12(4):875\u2013889, 2001.\n\n[19] T. Osogami. Robustness and risk-sensitivity in Markov decision processes. In NIPS, 2012.\n[20] M. Petrik and D. Subramanian. An approximate solution method for large risk-averse Markov decision\n\nprocesses. In UAI, 2012.\n\n[21] L. Prashanth and M. Ghavamzadeh. Actor-critic algorithms for risk-sensitive MDPs. In NIPS 26, 2013.\n[22] S. Rachev and S. Mittnik. Stable Paretian models in \ufb01nance. John Willey & Sons, New York, 2000.\n[23] R. Rockafellar and S. Uryasev. Optimization of conditional value-at-risk. Journal of risk, 2:21\u201342, 2000.\n[24] A. Ruszczy\u00b4nski. Risk-averse dynamic programming for Markov decision processes. Mathematical Pro-\n\ngramming, 125(2):235\u2013261, 2010.\n\n[25] A. Ruszczy\u00b4nski and A. Shapiro. Optimization of convex risk functions. Math. OR, 31(3):433\u2013452, 2006.\n[26] A. Shapiro, D. Dentcheva, and A. Ruszczy\u00b4nski. Lectures on stochastic programming, chapter 6, pages\n\n253\u2013332. SIAM, 2009.\n\n[27] R. Sutton and A. Barto. Reinforcement learning: An introduction. Cambridge Univ Press, 1998.\n[28] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning\n\nwith function approximation. In NIPS 13, 2000.\n\n[29] A. Tamar, D. Di Castro, and S. Mannor. Policy gradients with variance related risk criteria. In Interna-\n\ntional Conference on Machine Learning, 2012.\n\n[30] A. Tamar, Y. Glassner, and S. Mannor. Optimizing the CVaR via sampling. In AAAI, 2015.\n[31] A. Tamar, S. Mannor, and H. Xu. Scaling up robust MDPs using function approximation. In International\n\nConference on Machine Learning, 2014.\n\n9\n\n\f", "award": [], "sourceid": 890, "authors": [{"given_name": "Aviv", "family_name": "Tamar", "institution": "Technion"}, {"given_name": "Yinlam", "family_name": "Chow", "institution": "Stanford"}, {"given_name": "Mohammad", "family_name": "Ghavamzadeh", "institution": "Adobe Research & INRIA"}, {"given_name": "Shie", "family_name": "Mannor", "institution": "Technion"}]}