{"title": "Actor-Critic Algorithms for Risk-Sensitive MDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 252, "page_last": 260, "abstract": "In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance related risk measures are among the most common risk-sensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms for estimating the gradient and updating the policy parameters in the ascent direction. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application.", "full_text": "Actor-Critic Algorithms for Risk-Sensitive MDPs\n\nPrashanth L.A.\n\nINRIA Lille - Team SequeL\n\nMohammad Ghavamzadeh\u2217\n\nINRIA Lille - Team SequeL & Adobe Research\n\nAbstract\n\nIn many sequential decision-making problems we may want to manage risk by\nminimizing some measure of variability in rewards in addition to maximizing a\nstandard criterion. Variance-related risk measures are among the most common\nrisk-sensitive criteria in \ufb01nance and operations research. However, optimizing\nmany such criteria is known to be a hard problem. In this paper, we consider both\ndiscounted and average reward Markov decision processes. For each formulation,\nwe \ufb01rst de\ufb01ne a measure of variability for a policy, which in turn gives us a set of\nrisk-sensitive criteria to optimize. For each of these criteria, we derive a formula\nfor computing its gradient. We then devise actor-critic algorithms for estimating\nthe gradient and updating the policy parameters in the ascent direction. We estab-\nlish the convergence of our algorithms to locally risk-sensitive optimal policies.\nFinally, we demonstrate the usefulness of our algorithms in a traf\ufb01c signal control\napplication.\n\n1\n\nIntroduction\n\nThe usual optimization criteria for an in\ufb01nite horizon Markov decision process (MDP) are the ex-\npected sum of discounted rewards and the average reward. Many algorithms have been developed to\nmaximize these criteria both when the model of the system is known (planning) and unknown (learn-\ning). These algorithms can be categorized to value function based methods that are mainly based on\nthe two celebrated dynamic programming algorithms value iteration and policy iteration; and policy\ngradient methods that are based on updating the policy parameters in the direction of the gradient\nof a performance measure (the value function of the initial state or the average reward). However in\nmany applications, we may prefer to minimize some measure of risk as well as maximizing a usual\noptimization criterion. In such cases, we would like to use a criterion that incorporates a penalty\nfor the variability induced by a given policy. This variability can be due to two types of uncertain-\nties: 1) uncertainties in the model parameters, which is the topic of robust MDPs (e.g., [12, 7, 24]),\nand 2) the inherent uncertainty related to the stochastic nature of the system, which is the topic of\nrisk-sensitive MDPs (e.g., [10]).\nIn risk-sensitive sequential decision-making, the objective is to maximize a risk-sensitive criterion\nsuch as the expected exponential utility [10], a variance-related measure [19, 8], or the percentile\nperformance [9]. The issue of how to construct such criteria in a manner that will be both con-\nceptually meaningful and mathematically tractable is still an open question. Although risk-sensitive\nsequential decision-making has a long history in operations research and \ufb01nance, it has only recently\ngrabbed attention in the machine learning community. This is why most of the work on this topic\n(including those mentioned above) has been in the context of MDPs (when the model is known) and\nmuch less work has been done within the reinforcement learning (RL) framework. In risk-sensitive\nRL, we can mention the work by Borkar [4, 5] who considered the expected exponential utility and\nthe one by Tamar et al. [22] on several variance-related measures. Tamar et al. [22] study stochas-\ntic shortest path problems, and in this context, propose a policy gradient algorithm for maximizing\nseveral risk-sensitive criteria that involve both the expectation and variance of the return random\nvariable (de\ufb01ned as the sum of rewards received in an episode).\n\n\u2217Mohammad Ghavamzadeh is at Adobe Research, on leave of absence from INRIA Lille - Team SequeL.\n\n1\n\n\fIn this paper, we develop actor-critic algorithms for optimizing variance-related risk measures in\nboth discounted and average reward MDPs. Our contributions can be summarized as follows:\n\u2022 In the discounted reward setting we de\ufb01ne the measure of variability as the variance of the return\n(similar to [22]). We formulate a constrained optimization problem with the aim of maximizing the\nmean of the return subject to its variance being bounded from above. We employ the Lagrangian\nrelaxation procedure [1] and derive a formula for the gradient of the Lagrangian. Since this re-\nquires the gradient of the value function at every state of the MDP (see the discussion in Sections 3\nand 4), we estimate the gradient of the Lagrangian using two simultaneous perturbation methods: si-\nmultaneous perturbation stochastic approximation (SPSA) [20] and smoothed functional (SF) [11],\nresulting in two separate discounted reward actor-critic algorithms.1\n\u2022 In the average reward formulation, we \ufb01rst de\ufb01ne the measure of variability as the long-run vari-\nance of a policy, and using a constrained optimization problem similar to the discounted case, derive\nan expression for the gradient of the Lagrangian. We then develop an actor-critic algorithm with\ncompatible features [21, 13] to estimate the gradient and to optimize the policy parameters.\n\u2022 Using the ordinary differential equations (ODE) approach, we establish the asymptotic conver-\ngence of our algorithms to locally risk-sensitive optimal policies. Further, we demonstrate the use-\nfulness of our algorithms in a traf\ufb01c signal control problem.\nIn comparison to [22], which is the closest related work, we would like to remark that while the au-\nthors there develop policy gradient methods for stochastic shortest path problems, we devise actor-\ncritic algorithms for both discounted and average reward settings. Moreover, we note the dif\ufb01culty\nin the discounted formulation that requires to estimate the gradient of the value function at every\nstate of the MDP, and thus, motivated us to employ simultaneous perturbation techniques.\n2 Preliminaries\nWe consider problems in which the agent\u2019s interaction with the environment is modeled as a\nMDP. A MDP is a tuple (X ,A, R, P, P0) where X = {1, . . . , n} and A = {1, . . . , m} are the\nstate and action spaces; R(x, a) is the reward random variable whose expectation is denoted by\n\nr(x, a) = E(cid:2)R(x, a)(cid:3); P (\u00b7|x, a) is the transition probability distribution; and P0(\u00b7) is the initial\nstochastic policies(cid:8)\u00b5(\u00b7|x; \u03b8), x \u2208 X , \u03b8 \u2208 \u0398 \u2286 R\u03ba1(cid:9), estimate the gradient of a performance mea-\n\nstate distribution. We also need to specify the rule according to which the agent selects actions\nat each state. A stationary policy \u00b5(\u00b7|x) is a probability distribution over actions, conditioned on\nthe current state. In policy gradient and actor-critic methods, we de\ufb01ne a class of parameterized\n\nsure w.r.t. the policy parameters \u03b8 from the observed system trajectories, and then improve the policy\nby adjusting its parameters in the direction of the gradient. Since in this setting a policy \u00b5 is rep-\nresented by its \u03ba1-dimensional parameter vector \u03b8, policy dependent functions can be written as a\nfunction of \u03b8 in place of \u00b5. So, we use \u00b5 and \u03b8 interchangeably in the paper.\nWe denote by d\u00b5(x) and \u03c0\u00b5(x, a) = d\u00b5(x)\u00b5(a|x) the stationary distribution of state x and state-\naction pair (x, a) under policy \u00b5, respectively. In the discounted formulation, we also de\ufb01ne the\ndiscounted visiting distribution of state x and state-action pair (x, a) under policy \u00b5 as d\u00b5\n\u03b3 (x|x0) =\n\nt=0 \u03b3t Pr(xt = x|x0 = x0; \u00b5) and \u03c0\u00b5\n\n\u03b3 (x, a|x0) = d\u00b5\n\n\u03b3 (x|x0)\u00b5(a|x).\n\n(1 \u2212 \u03b3)(cid:80)\u221e\n\n3 Discounted Reward Setting\nFor a given policy \u00b5, we de\ufb01ne the return of a state x (state-action pair (x, a)) as the sum of dis-\ncounted rewards encountered by the agent when it starts at state x (state-action pair (x, a)) and then\nfollows policy \u00b5, i.e.,\n\nD\u00b5(x) =\n\n\u03b3tR(xt, at) | x0 = x, \u00b5,\n\nD\u00b5(x, a) =\n\n\u03b3tR(xt, at) | x0 = x, a0 = a, \u00b5.\n\n\u221e(cid:88)\n\n\u221e(cid:88)\n\nt=0\n\n\u00b5, i.e., V \u00b5(x) = E(cid:2)D\u00b5(x)(cid:3) and Q\u00b5(x, a) = E(cid:2)D\u00b5(x, a)(cid:3). The goal in the standard discounted\n\nThe expected value of these two random variables are the value and action-value functions of policy\nreward formulation is to \ufb01nd an optimal policy \u00b5\u2217 = arg max\u00b5 V \u00b5(x0), where x0 is the initial state\nof the system. This can be easily extended to the case that the system has more than one initial state\n\u00b5\u2217 = arg max\u00b5\n\nP0(x)V \u00b5(x).\n\n(cid:80)\n\nt=0\n\nx\u2208X\n\n1We note here that our algorithms can be easily extended to other variance-related risk criteria such as the\n\nSharpe ratio, which is popular in \ufb01nancial decision-making [18] (see Appendix D of [17]).\n\n2\n\n\fThe most common measure of the variability in the stream of rewards is the variance of the return\n\n\u039b\u00b5(x) = E(cid:2)D\u00b5(x)2(cid:3) \u2212 V \u00b5(x)2 = U \u00b5(x) \u2212 V \u00b5(x)2,\n\n= E(cid:2)D\u00b5(x)2(cid:3) is the square reward value function\n\n(1)\n\n(cid:52)\n\n\ufb01rst introduced by Sobel [19]. Note that U \u00b5(x)\nof state x under policy \u00b5. Although \u039b\u00b5 of (1) satis\ufb01es a Bellman equation, unfortunately, it lacks\nthe monotonicity property of dynamic programming (DP), and thus, it is not clear how the related\nrisk measures can be optimized by standard DP algorithms [19]. This is why policy gradient and\nactor-critic algorithms are good candidates to deal with this risk measure. We consider the following\nrisk-sensitive measure for discounted MDPs: for a given \u03b1 > 0,\n\nV \u03b8(x0)\n\nmax\n\n\u03b8\n\nsubject to\n\n\u039b\u03b8(x0) \u2264 \u03b1.\n\n(2)\n\nTo solve (2), we employ the Lagrangian relaxation procedure [1] to convert it to the following\nunconstrained problem:\n\n(cid:16)\n\n= \u2212V \u03b8(x0) + \u03bb(cid:0)\u039b\u03b8(x0) \u2212 \u03b1(cid:1)(cid:17)\n\n(cid:52)\n\nmax\n\n\u03bb\n\nmin\n\n\u03b8\n\nL(\u03b8, \u03bb)\n\n,\n\n(3)\n\n(cid:88)\n(cid:88)\n\nx,a\n\nx,a\n\nwhere \u03bb is the Lagrange multiplier. The goal here is to \ufb01nd the saddle point of L(\u03b8, \u03bb), i.e., a\npoint (\u03b8\u2217, \u03bb\u2217) that satis\ufb01es L(\u03b8, \u03bb\u2217) \u2265 L(\u03b8\u2217, \u03bb\u2217) \u2265 L(\u03b8\u2217, \u03bb),\u2200\u03b8,\u2200\u03bb > 0. This is achieved by de-\nscending in \u03b8 and ascending in \u03bb using the gradients \u2207\u03b8L(\u03b8, \u03bb) = \u2212\u2207\u03b8V \u03b8(x0) + \u03bb\u2207\u03b8\u039b\u03b8(x0) and\n\u2207\u03bbL(\u03b8, \u03bb) = \u039b\u03b8(x0) \u2212 \u03b1, respectively. Since \u2207\u039b\u03b8(x0) = \u2207U \u03b8(x0) \u2212 2V \u03b8(x0)\u2207V \u03b8(x0), in order\nto compute \u2207\u039b\u03b8(x0), we need to calculate \u2207U \u03b8(x0) and \u2207V \u03b8(x0). From the Bellman equation of\n\u039b\u00b5(x), proposed by Sobel [19], it is straightforward to derive Bellman equations for U \u00b5(x) and the\nsquare reward action-value function W \u00b5(x, a)\nthese de\ufb01nitions and notations we are now ready to derive expressions for the gradient of V \u03b8(x0)\nand U \u03b8(x0) that are the main ingredients in calculating \u2207\u03b8L(\u03b8, \u03bb).\nLemma 1 Assuming for all (x, a), \u00b5(a|x; \u03b8) is continuously differentiable in \u03b8, we have\n\n= E(cid:2)D\u00b5(x, a)2(cid:3) (see Appendix B.1 of [17]). Using\n\n(cid:52)\n\n(1 \u2212 \u03b3)\u2207V \u03b8(x0) =\n\n\u03b3(x, a|x0)\u2207 log \u00b5(a|x; \u03b8)Q\u03b8(x, a),\n\u03c0\u03b8\n\n(cid:48)\n\n),\n\n+ 2\u03b3\n\nx,a,x(cid:48)\n\n\u03b3(x, a|x0)P (x\n\n(cid:48)|x, a)r(x, a)\u2207V \u03b8(x\n\n(1 \u2212 \u03b32)\u2207U \u03b8(x0) =\n\n\u03b3(x, a|x0)\u2207 log \u00b5(a|x; \u03b8)W \u03b8(x, a)\n\n\u03b3(x, a|x0) = (cid:101)d\u03b8\n\n(cid:101)\u03c0\u03b8\n(cid:88)\n(cid:101)\u03c0\u03b8\n\u03b3(x|x0) = (1\u2212 \u03b32)(cid:80)\u221e\n\u03b3(x|x0)\u00b5(a|x) and (cid:101)d\u03b8\n\u03b3 and(cid:101)\u03c0\u03b8\n\nwhere(cid:101)\u03c0\u03b8\nt=0 \u03b32t Pr(xt = x|x0 = x0; \u03b8).\nThe proof of the above lemma is available in Appendix B.2 of [17]. It is challenging to devise an\nef\ufb01cient method to estimate \u2207\u03b8L(\u03b8, \u03bb) using the gradient formulas of Lemma 1. This is mainly\nbecause 1) two different sampling distributions (\u03c0\u03b8\n\u03b3) are used for \u2207V \u03b8(x0) and \u2207U \u03b8(x0),\nand 2) \u2207V \u03b8(x(cid:48)) appears in the second sum of \u2207U \u03b8(x0) equation, which implies that we need\nto estimate the gradient of the value function V \u03b8 at every state of the MDP. These are the main\nmotivations behind using simultaneous perturbation methods for estimating \u2207\u03b8L(\u03b8, \u03bb) in Section 4.\n4 Discounted Reward Algorithms\nIn this section, we present actor-critic algorithms for optimizing the risk-sensitive measure (2) that\nare based on two simultaneous perturbation methods: simultaneous perturbation stochastic approx-\nimation (SPSA) and smoothed functional (SF) [3]. The idea in these methods is to estimate the\ngradients \u2207V \u03b8(x0) and \u2207U \u03b8(x0) using two simulated trajectories of the system corresponding to\npolicies with parameters \u03b8 and \u03b8+ = \u03b8 + \u03b2\u2206. Here \u03b2 > 0 is a positive constant and \u2206 is a perturba-\nvalue and square value functions, i.e., (cid:98)V (x) \u2248 v(cid:62)\u03c6v(x) and (cid:98)U (x) \u2248 u(cid:62)\u03c6u(x), where the features\ntion random variable, i.e., a \u03ba1-vector of independent Rademacher (for SPSA) and Gaussian N (0, 1)\n(for SF) random variables. In our actor-critic algorithms, the critic uses linear approximation for the\n\u03c6v(\u00b7) and \u03c6u(\u00b7) are from low-dimensional spaces R\u03ba2 and R\u03ba3, respectively.\nSPSA-based gradient estimates were \ufb01rst proposed in [20] and have been widely studied and found\nto be highly ef\ufb01cient in various settings, especially those involving high-dimensional parameters.\nThe SPSA-based estimate for \u2207V \u03b8(x0), and similarly for \u2207U \u03b8(x0), is given by:\n\n3\n\n\fFigure 1: The overall \ufb02ow of our simultaneous perturbation based actor-critic algorithms.\n\n\u2202\u03b8(i)(cid:98)V \u03b8(x0) \u2248 (cid:98)V \u03b8+\u03b2\u2206(x0) \u2212(cid:98)V \u03b8(x0)\n\n\u03b2\u2206(i)\n\n,\n\ni = 1, . . . , \u03ba1,\n\n(4)\n\nwhere \u2206 is a vector of independent Rademacher random variables. The advantage of this estimator\nis that it perturbs all directions at the same time (the numerator is identical in all \u03ba1 components).\nSo, the number of function measurements needed for this estimator is always two, independent of\nthe dimension \u03ba1. However, unlike the SPSA estimates in [20] that use two-sided balanced estimates\n(simulations with parameters \u03b8\u2212\u03b2\u2206 and \u03b8 +\u03b2\u2206), our gradient estimates are one-sided (simulations\nwith parameters \u03b8 and \u03b8+\u03b2\u2206) and resemble those in [6]. The use of one-sided estimates is primarily\nbecause the updates of the Lagrangian parameter \u03bb require a simulation with the running parameter\n\u03b8. Using a balanced gradient estimate would therefore come at the cost of an additional simulation\n(the resulting procedure would then require three simulations), which we avoid by using one-sided\ngradient estimates.\nSF-based method estimates not the gradient of a function H(\u03b8) itself, but rather the convolution of\n\u2207H(\u03b8) with the Gaussian density function N (0, \u03b22I), i.e.,\n\nC\u03b2H(\u03b8) =\n\nG\u03b2(\u03b8 \u2212 z)\u2207zH(z)dz =\n\n\u2207zG\u03b2(z)H(\u03b8 \u2212 z)dz =\n\n1\n\u03b2\n\n\u2212z\n\n(cid:48)G1(z\n\n(cid:48)\n\n)H(\u03b8 \u2212 \u03b2z\n\n(cid:48)\n\n(cid:48)\n\n,\n\n)dz\n\n(cid:90)\n\n(cid:90)\n\n(cid:90)\n\nwhere G\u03b2 is a \u03ba1-dimensional probability density function. The \ufb01rst equality above follows by\n\u2212z\nusing integration by parts and the second one by using the fact that \u2207zG\u03b2(z) =\n\u03b22 G\u03b2(z) and by\nsubstituting z(cid:48) = z/\u03b2. As \u03b2 \u2192 0, it can be seen that C\u03b2H(\u03b8) converges to \u2207\u03b8H(\u03b8) (see Chapter 6\n(cid:17)\nof [3]). Thus, a one-sided SF estimate of \u2207V \u03b8(x0) is given by\n\n(cid:16)(cid:98)V \u03b8+\u03b2\u2206(x0) \u2212(cid:98)V \u03b8(x0)\n\n\u2202\u03b8(i)(cid:98)V \u03b8(x0) \u2248 \u2206(i)\n\ni = 1, . . . , \u03ba1,\n\n(5)\n\n,\n\n\u03b2\n\nwhere \u2206 is a vector of independent Gaussian N (0, 1) random variables.\nThe overall \ufb02ow of our proposed actor-critic algorithms is illustrated in Figure 1 and involves the\nfollowing main steps at each time step t:\n(1) Take action at \u223c \u00b5(\u00b7|xt; \u03b8t), observe the reward r(xt, at) and next state xt+1 in the \ufb01rst trajectory.\n(2) Take action a+\nt+1 in the second\ntrajectory.\n(3) Critic Update: Calculate the temporal difference (TD)-errors \u03b4t, \u03b4+\nthe square value functions using (7), and update the critic parameters vt, v+\nfor the square value functions as follows:\n\nt for\nt for the value and ut, u+\n\nt ), observe the reward r(x+\n\nt for the value and \u0001t, \u0001+\n\nt ) and next state x+\n\nt \u223c \u00b5(\u00b7|x+\n\nt , a+\n\nt ; \u03b8+\n\nt\n\nvt+1 = vt + \u03b63(t)\u03b4t\u03c6v(xt),\n\nut+1 = ut + \u03b63(t)\u0001t\u03c6u(xt),\n\nv+\nt+1 = v+\nu+\nt+1 = u+\n\nt + \u03b63(t)\u03b4+\nt + \u03b63(t)\u0001+\n\nt \u03c6v(x+\nt \u03c6u(x+\n\nt ),\nt ),\n\n(6)\n\nwhere the TD-errors \u03b4t, \u03b4+\n\nin (6) are computed as\n\nt , \u0001t, \u0001+\nt\nt \u03c6v(xt+1) \u2212 v\n(cid:62)\n\n\u03b4t = r(xt, at) + \u03b3v\n\u0001t = r(xt, at)2 + 2\u03b3r(xt, at)v\n\u0001+\nt = r(x+\nt , a+\n\nt )2 + 2\u03b3r(x+\n\nt , a+\n\n(cid:62)\nt \u03c6v(xt),\n\n(cid:62)\nt \u03c6v(xt+1) + \u03b32u\nt )v+(cid:62)\n\nt \u03c6v(x+\n\nt+1) + \u03b32u+(cid:62)\n\n\u03b4+\nt = r(x+\nt \u03c6u(xt+1) \u2212 u\n(cid:62)\nt \u03c6u(x+\n\nt ) + \u03b3v+(cid:62)\nt , a+\n(cid:62)\nt \u03c6u(xt),\nt+1) \u2212 u+(cid:62)\n\nt \u03c6u(x+\n\nt ).\n\nt \u03c6v(x+\n\nt+1) \u2212 v+(cid:62)\n\nt \u03c6v(x+\n\nt ),\n\n(7)\n\n4\n\n\u03b8t+\u03b2\u2206ta+t\u223c\u00b5(\u00b7|x+t;\u03b8+t)r+tat\u223c\u00b5(\u00b7|xt;\u03b8t)rt\u03b4+t,\uffff+t,v+t,u+tCritic\u03b4t,\ufffft,vt,utCritic\u03b8t+1ActorUpdateusing\u03b8t(8) or (9)\fThis TD algorithm to learn the value and square value functions is a straightforward extension of the\nalgorithm proposed in [23] to the discounted setting. Note that the TD-error \u0001 for the square value\nfunction U comes directly from the Bellman equation for U (see Appendix B.1 of [17]).\n(4) Actor Update: Estimate the gradients \u2207V \u03b8(x0) and \u2207U \u03b8(x0) using SPSA (4) or SF (5) and\nupdate the policy parameter \u03b8 and the Lagrange multiplier \u03bb as follows: For i = 1, . . . , \u03ba1,\n\n\u03b8(i)\nt+1 = \u0393i\n\n\u03b8(i)\nt+1 = \u0393i\n\n\u03b8(i)\nt +\n\n\u03b8(i)\nt +\n\n\u03b62(t)\n\u03b2\u2206(i)\nt\n\u03b62(t)\u2206(i)\nt\n\n\u03bbt+1 = \u0393\u03bb\n\n\u03bbt + \u03b61(t)\n\nu\n\n(cid:62)\n\nt \u2212 vt)\n\n(cid:16)(cid:0)1 + 2\u03bbtv\n(cid:16)(cid:0)1 + 2\u03bbtv\nt \u03c6u(x0) \u2212(cid:0)v\n\nt \u03c6v(x0)(cid:1)(v+\nt \u03c6v(x0)(cid:1)(v+\nt \u03c6v(x0)(cid:1)2 \u2212 \u03b1\n\n(cid:62)\n\n(cid:62)\n\n(cid:62)\n\n(cid:17)(cid:21)\n\n.\n\n\u03b2\n\n(cid:16)\n\n(cid:20)\n(cid:20)\n(cid:20)\n\n(cid:62)\n\n\u03c6v(x0) \u2212 \u03bbt(u+\n\nt \u2212 ut)\n\n(cid:62)\n\n\u03c6u(x0)\n\n, SPSA (8)\n\nt \u2212 vt)\n\n(cid:62)\n\n\u03c6v(x0) \u2212 \u03bbt(u+\n\nt \u2212 ut)\n\n(cid:62)\n\n\u03c6u(x0)\n\n, SF (9)\n\n(cid:17)(cid:21)\n\n(cid:17)(cid:21)\n\n(10)\n\nNote that 1) the \u03bb-update is the same for both SPSA and SF methods, 2) \u2206(i)\nt \u2019s are independent\nRademacher and Gaussian N (0, 1) random variables in SPSA and SF updates, respectively, 3) \u0393\nis an operator that projects a vector \u03b8 \u2208 R\u03ba1 to the closest point in a compact and convex set\nC \u2282 R\u03ba1, and \u0393\u03bb is a projection operator to [0, \u03bbmax]. These projection operators are necessary to\nensure convergence of the algorithms, and 4) the step-size schedules {\u03b63(t)}, {\u03b62(t)}, and {\u03b61(t)}\nare chosen such that the critic updates are on the fastest time-scale, the policy parameter update\nis on the intermediate time-scale, and the Lagrange multiplier update is on the slowest time-scale\n(see Appendix A of [17] for the conditions on the step-size schedules). A proof of convergence\n(cid:98)L(\u03b8, \u03bb)\nof the SPSA and SF algorithms to a (local) saddle point of the risk-sensitive objective function\n\n= \u2212(cid:98)V \u03b8(x0) + \u03bb((cid:98)\u039b\u03b8(x0) \u2212 \u03b1) is given in Appendix B.3 of [17].\n\n(cid:52)\n\n5 Average Reward Setting\nThe average reward per step under policy \u00b5 is de\ufb01ned as (see Sec. 2 for the de\ufb01nitions of d\u00b5 and \u03c0\u00b5)\n\n\u03c1(\u00b5) = lim\nT\u2192\u221e\n\nE\n\n1\nT\n\nRt | \u00b5\n\n=\n\nd\u00b5(x)\u00b5(a|x)r(x, a).\n\n(cid:34)T\u22121(cid:88)\n\n(cid:35)\n\n(cid:88)\n\nt=0\n\nx,a\n\nThe goal in the standard (risk-neutral) average reward formulation is to \ufb01nd an average optimal\npolicy, i.e., \u00b5\u2217 = arg max\u00b5 \u03c1(\u00b5). Here a policy \u00b5 is assessed according to the expected differential\nreward associated with states or state-action pairs. For all states x \u2208 X and actions a \u2208 A, the\ndifferential action-value and value functions of policy \u00b5 are de\ufb01ned as\n\n\u221e(cid:88)\n\nt=0\n\nE(cid:2)Rt \u2212 \u03c1(\u00b5) | x0 = x, a0 = a, \u00b5(cid:3),\n(cid:88)\n\n\u03c0\u00b5(x, a)(cid:2)r(x, a) \u2212 \u03c1(\u00b5)(cid:3)2 = lim\n\nT\u2192\u221e\n\n\u039b(\u00b5) =\n\nx,a\n\nIn the context of risk-sensitive MDPs, different criteria have been proposed to de\ufb01ne a measure of\nvariability, among which we consider the long-run variance of \u00b5 [8] de\ufb01ned as\n\n(cid:0)Rt \u2212 \u03c1(\u00b5)(cid:1)2 | \u00b5\n\n(cid:35)\n\n.\n\n(11)\n\nQ\u00b5(x, a) =\n\nV \u00b5(x) =\n\n\u00b5(a|x)Q\u00b5(x, a).\n\n(cid:88)\n\na\n\n1\nT\n\nE\n\n(cid:34)T\u22121(cid:88)\n(cid:88)\n\nt=0\n\nx,a\n\nThis notion of variability is based on the observation that it is the frequency of occurrence of state-\naction pairs that determine the variability in the average reward. It is easy to show that\n\n\u039b(\u00b5) = \u03b7(\u00b5) \u2212 \u03c1(\u00b5)2,\n\nwhere \u03b7(\u00b5) =\n\n\u03c0\u00b5(x, a)r(x, a)2.\n\nWe consider the following risk-sensitive measure for average reward MDPs in this paper:\n\nmax\n\n\u03b8\n\n\u03c1(\u03b8)\n\nsubject to\n\n\u039b(\u03b8) \u2264 \u03b1,\n\n(12)\n\nfor a given \u03b1 > 0. As in the discounted setting, we employ the Lagrangian relaxation procedure to\nconvert (12) to the unconstrained problem\n\n(cid:16)\n\n= \u2212\u03c1(\u03b8) + \u03bb(cid:0)\u039b(\u03b8) \u2212 \u03b1(cid:1)(cid:17)\n\n(cid:52)\n\n.\n\nmax\n\n\u03bb\n\nmin\n\n\u03b8\n\nL(\u03b8, \u03bb)\n\nSimilar to the discounted case, we descend in \u03b8 using \u2207\u03b8L(\u03b8, \u03bb) = \u2212\u2207\u03b8\u03c1(\u03b8) + \u03bb\u2207\u03b8\u039b(\u03b8) and ascend\nin \u03bb using \u2207\u03bbL(\u03b8, \u03bb) = \u039b(\u03b8) \u2212 \u03b1, to \ufb01nd the saddle point of L(\u03b8, \u03bb). Since \u2207\u039b(\u03b8) = \u2207\u03b7(\u03b8) \u2212\n\n5\n\n\f2\u03c1(\u03b8)\u2207\u03c1(\u03b8), in order to compute \u2207\u039b(\u03b8) it would be enough to calculate \u2207\u03b7(\u03b8). Let U \u00b5 and W \u00b5\ndenote the differential value and action-value functions associated with the square reward under\npolicy \u00b5, respectively. These two quantities satisfy the following Poisson equations:\n\n\u03b7(\u00b5) + U \u00b5(x) =\n\n\u03b7(\u00b5) + W \u00b5(x, a) = r(x, a)2 +\n\na\n\n(cid:48)|x, a)U \u00b5(x\n(cid:48)\n\n).\n\nP (x\n\n)(cid:3),\n\n(cid:48)|x, a)U \u00b5(x\n(cid:48)\n\nP (x\n\n(cid:88)\n\n\u00b5(a|x)(cid:2)r(x, a)2 +\n\n(cid:88)\n\nx(cid:48)\n\n(cid:88)\n\nx(cid:48)\n\n(13)\n\n(14)\n\n(15)\n\nWe calculate the gradients of \u03c1(\u03b8) and \u03b7(\u03b8) as (see Lemma 5 of Appendix C.1 in [17]):\n\n(cid:88)\n(cid:88)\n\nx,a\n\n\u2207\u03c1(\u03b8) =\n\n\u2207\u03b7(\u03b8) =\n\n\u03c0(x, a; \u03b8)\u2207 log \u00b5(a|x; \u03b8)Q(x, a; \u03b8),\n\n\u03c0(x, a; \u03b8)\u2207 log \u00b5(a|x; \u03b8)W (x, a; \u03b8).\n\nx,a\n\nNote that (15) for calculating \u2207\u03b7(\u03b8) has close resemblance to (14) for \u2207\u03c1(\u03b8), and thus, similar\nto what we have for (14), any function b : X \u2192 R can be added or subtracted to W (x, a; \u03b8)\non the RHS of (15) without changing the result of the integral (see e.g., [2]). So, we can replace\nW (x, a; \u03b8) with the square reward advantage function B(x, a; \u03b8) = W (x, a; \u03b8)\u2212U (x; \u03b8) on the RHS\nof (15) in the same manner as we can replace Q(x, a; \u03b8) with the advantage function A(x, a; \u03b8) =\nQ(x, a; \u03b8) \u2212 V (x; \u03b8) on the RHS of (14) without changing the result of the integral. We de\ufb01ne the\ntemporal difference (TD) errors \u03b4t and \u0001t for the differential value and square value functions as\n\n\u03b4t = R(xt, at) \u2212(cid:98)\u03c1t+1 +(cid:98)V (xt+1) \u2212(cid:98)V (xt),\n\u0001t = R(xt, at)2 \u2212(cid:98)\u03b7t+1 + (cid:98)U (xt+1) \u2212 (cid:98)U (xt).\nIf(cid:98)V , (cid:98)U,(cid:98)\u03c1, and(cid:98)\u03b7 are unbiased estimators of V \u00b5, U \u00b5, \u03c1(\u00b5), and \u03b7(\u00b5), respectively, then we can show\nthat \u03b4t and \u0001t are unbiased estimates of the advantage functions A\u00b5 and B\u00b5, i.e., E[ \u03b4t| xt, at, \u00b5] =\nA\u00b5(xt, at), and E[ \u0001t| xt, at, \u00b5] = B\u00b5(xt, at) (see Lemma 6 in Appendix C.2 of [17]). From this,\nwe notice that \u03b4t\u03c8t and \u0001t\u03c8t are unbiased estimates of \u2207\u03c1(\u00b5) and \u2207\u03b7(\u00b5), respectively, where \u03c8t =\n\u03c8(xt, at) = \u2207 log \u00b5(at|xt) is the compatible feature (see e.g., [21, 13]).\n6 Average Reward Algorithm\n\nWe now present our risk-sensitive actor-critic algorithm for average reward MDPs. Algorithm 1\npresents the complete structure of the algorithm along with update rules for the average rewards\n\n(cid:98)\u03c1t,(cid:98)\u03b7t; TD errors \u03b4t, \u0001t; critic vt, ut; and actor \u03b8t, \u03bbt parameters. The projection operators \u0393 and \u0393\u03bb\n\nare as de\ufb01ned in Section 4, and similar to the discounted setting, are necessary for the convergence\nproof of the algorithm. The step-size schedules satisfy the standard conditions for stochastic approx-\nimation algorithms, and ensure that the average and critic updates are on the (same) fastest time-scale\n{\u03b64(t)} and {\u03b63(t)}, the policy parameter update is on the intermediate time-scale {\u03b62(t)}, and the\ndifferential value and square value functions, i.e., (cid:98)V (x) = v(cid:62)\u03c6v(x) and (cid:98)U (x) = u(cid:62)\u03c6u(x), where\nLagrange multiplier is on the slowest time-scale {\u03b61(t)}. This results in a three time-scale stochastic\napproximation algorithm. As in the discounted setting, the critic uses linear approximation for the\n\u03c6v(\u00b7) and \u03c6u(\u00b7) are feature vectors of size \u03ba2 and \u03ba3, respectively. Although our estimates of \u03c1(\u03b8)\nand \u03b7(\u03b8) are unbiased, since we use biased estimates for V \u03b8 and U \u03b8 (linear approximations in the\ncritic), our gradient estimates \u2207\u03c1(\u03b8) and \u2207\u03b7(\u03b8), and as a result \u2207L(\u03b8, \u03bb), are biased. Lemma 7 in\nAppendix C.2 of [17] shows the bias in our estimate of \u2207L(\u03b8, \u03bb). We prove that our actor-critic\nalgorithm converges to a (local) saddle point of the risk-sensitive objective function L(\u03b8, \u03bb) (see\nAppendix C.3 of [17]).\n7 Experimental Results\n\nWe evaluate our algorithms in the context of a traf\ufb01c signal control application. The objective in our\nformulation is to minimize the total number of vehicles in the system, which indirectly minimizes\nthe delay experienced by the system. The motivation behind using a risk-sensitive control strategy\nis to reduce the variations in the delay experienced by road users.\nWe consider both in\ufb01nite horizon discounted as well average settings for the traf\ufb01c signal\ncontrol MDP, formulated as in [15]. We brie\ufb02y recall their formulation here: The state at\neach time t, xt,\nis the vector of queue lengths and elapsed times and is given by xt =\n\n6\n\n\fAlgorithm 1 Template of the Average Reward Risk-Sensitive Actor-Critic Algorithm\n\nInput: parameterized policy \u00b5(\u00b7|\u00b7; \u03b8) and value function feature vectors \u03c6v(\u00b7) and \u03c6u(\u00b7)\nInitialization: policy parameters \u03b8 = \u03b80; value function weight vectors v = v0 and u = u0; initial state\nx0 \u223c P0(x)\nfor t = 0, 1, 2, . . . do\n\nDraw action at \u223c \u00b5(\u00b7|xt; \u03b8t)\nObserve next state xt+1 \u223c P (\u00b7|xt, at)\nObserve reward R(xt, at)\n\nAverage Updates: (cid:98)\u03c1t+1 =(cid:0)1 \u2212 \u03b64(t)(cid:1)(cid:98)\u03c1t + \u03b64(t)R(xt, at),\n\n(cid:98)\u03b7t+1 =(cid:0)1 \u2212 \u03b64(t)(cid:1)(cid:98)\u03b7t + \u03b64(t)R(xt, at)2\n\nTD Errors:\n\nCritic Updates:\nActor Updates:\n\n\u03b4t = R(xt, at) \u2212(cid:98)\u03c1t+1 + v\n\u0001t = R(xt, at)2 \u2212(cid:98)\u03b7t+1 + u\n\nvt+1 = vt + \u03b63(t)\u03b4t\u03c6v(xt),\n\nt \u03c6v(xt+1) \u2212 v\n(cid:62)\n(cid:62)\nt \u03c6v(xt)\nt \u03c6u(xt+1) \u2212 u\n(cid:62)\n(cid:62)\nt \u03c6u(xt)\n\n\u03b8t \u2212 \u03b62(t)(cid:0) \u2212 \u03b4t\u03c8t + \u03bbt(\u0001t\u03c8t \u2212 2(cid:98)\u03c1t+1\u03b4t\u03c8t)(cid:1)(cid:17)\n(cid:16)\n(cid:16)\n\u03bbt + \u03b61(t)((cid:98)\u03b7t+1 \u2212(cid:98)\u03c12\n\nt+1 \u2212 \u03b1)\n\n(cid:17)\n\n\u03b8t+1 = \u0393\n\n\u03bbt+1 = \u0393\u03bb\n\nut+1 = ut + \u03b63(t)\u0001t\u03c6u(xt)\n\n(16)\n\n(17)\n\n(18)\n\nend for\nreturn policy and value function parameters \u03b8, \u03bb, v, u\n\n(cid:0)q1(t), . . . , qN (t), t1(t), . . . , tN (t)(cid:1). Here qi and ti denote the queue length and elapsed time since\n\nthe signal turned to red on lane i. The actions at belong to the set of feasible sign con\ufb01gurations.\nThe single-stage cost function h(xt) is de\ufb01ned as follows:\n\nh(xt) = r1\n\nr2 \u00b7 qi(t) +\n\ns2 \u00b7 qi(t)\n\n+ s1\n\nr2 \u00b7 ti(t) +\n\ns2 \u00b7 ti(t)\n\n,\n\n(19)\n\n(cid:105)\n\n(cid:88)\n\ni /\u2208Ip\n\n(cid:105)\n\n(cid:104)(cid:88)\n\ni\u2208Ip\n\n(cid:104)(cid:88)\n\ni\u2208Ip\n\n(cid:88)\n\ni /\u2208Ip\n\nwhere ri, si \u2265 0 such that ri + si = 1 for i = 1, 2 and r2 > s2. The set Ip is the set of prioritized\nlanes in the road network considered. While the weights r1, s1 are used to differentiate between the\nqueue length and elapsed time factors, the weights r2, s2 help in prioritization of traf\ufb01c.\nGiven the above traf\ufb01c control setting, we aim to minimize both the long run discounted as well\naverage sum of the cost function h(xt). The underlying policy for all the algorithms is a parame-\nterized Boltzmann policy (see Appendix F of [17]). We implement the following algorithms in the\ndiscounted setting:\n(i) Risk-neutral SPSA and SF algorithms with the actor update as follows:\n\n(cid:33)\n\n(cid:32)\n(cid:32)\n\n\u03b8(i)\nt+1 = \u0393i\n\n\u03b8(i)\nt +\n\n\u03b62(t)\n\u03b2\u2206(i)\nt\n\nt \u2212 vt)\n(v+\n\n(cid:62)\n\n\u03c6v(x0)\n\n\u03b8(i)\nt+1 = \u0393i\n\n\u03b8(i)\nt +\n\n\u03b62(t)\u2206(i)\nt\n\n\u03b2\n\nt \u2212 vt)\n(cid:62)\n(v+\n\n\u03c6v(x0)\n\nSPSA,\n\n(cid:33)\n\nSF,\n\nwhere the critic parameters v+\nt , vt are updated according to (6). Note that these are two-timescale\nalgorithms with a TD critic on the faster timescale and the actor on the slower timescale.\n(ii) Risk-sensitive SPSA and SF algorithms (RS-SPSA and RS-SF) of Section 4 that attempt to\nsolve (2) and update the policy parameter according to (8) and (9), respectively. In the average\nsetting, we implement (i) the risk-neutral AC algorithm from [14] that incorporates an actor-critic\nscheme, and (ii) the risk-sensitive algorithm of Section 6 (RS-AC) that attempts to solve (12) and\nupdates the policy parameter according to (17).\nAll our algorithms incorporate function approximation owing to the curse of dimensionality asso-\nciated with larger road networks. For instance, assuming only 20 vehicles per lane of a 2x2-grid\nnetwork, the cardinality of the state space is approximately of the order 1032 and the situation is\naggravated as the size of the road network increases. The choice of features used in each of our al-\ngorithms is as described in Section V-B of [16]. We perform the experiments on a 2x2-grid network.\nThe list of parameters and step-sizes chosen for our algorithms is given in Appendix F of [17].\nFigures 2(a) and 2(b) show the distribution of the discounted cumulative reward D\u03b8(x0) for the\nSPSA and SF algorithms, respectively. Figure 3(a) shows the distribution of the average reward \u03c1 for\nthe algorithms in the average setting. From these plots, we notice that the risk-sensitive algorithms\n\n7\n\n\f(a) SPSA vs. RS-SPSA\n\n(b) SF vs. RS-SF\n\nFigure 2: Performance comparison in the discounted setting using the distribution of D\u03b8(x0).\n\n(a) Distribution of \u03c1\n\n(b) Average junction waiting time\n\nFigure 3: Comparison of AC vs. RS-AC in the average setting using two different metrics.\n\nthat we propose result in a long-term (discounted or average) reward that is higher than their risk-\nneutral variants. However, from the empirical variance of the reward (both discounted as well as\naverage) perspective, the risk-sensitive algorithms outperform their risk-neutral variants.\nWe use average junction waiting time (AJWT) to compare the algorithms from a traf\ufb01c signal control\napplication standpoint. Figure 3(b) presents the AJWT plots for the algorithms in the average setting\n(see Appendix F of [17] for similar results for the SPSA and SF algorithms in the discounted setting).\nWe observe that the performance of our risk-sensitive algorithms is not signi\ufb01cantly worse than their\nrisk-neutral counterparts. This coupled with the observation that our algorithms exhibit low variance,\nmakes them a suitable choice in risk-constrained systems.\n\n8 Conclusions and Future Work\nWe proposed novel actor critic algorithms for control in risk-sensitive discounted and average reward\nMDPs. All our algorithms involve a TD critic on the fast timescale, a policy gradient (actor) on\nthe intermediate timescale, and dual ascent for Lagrange multipliers on the slowest timescale. In\nthe discounted setting, we pointed out the dif\ufb01cultly in estimating the gradient of the variance of\nthe return and incorporated simultaneous perturbation based SPSA and SF approaches for gradient\nestimation in our algorithms. The average setting, on the other hand, allowed for an actor to employ\ncompatible features to estimate the gradient of the variance. We provided proofs of convergence\n(in the appendix of [17]) to locally (risk-sensitive) optimal policies for all the proposed algorithms.\nFurther, using a traf\ufb01c signal control application, we observed that our algorithms resulted in lower\nvariance empirically as compared to their risk-neutral counterparts.\nIn this paper, we established asymptotic limits for our discounted and average reward risk-sensitive\nactor-critic algorithms. To the best of our knowledge, there are no convergence rate results available\nfor multi-timescale stochastic approximation schemes and hence for actor-critic algorithms. This is\ntrue even for the actor-critic algorithms that do not incorporate any risk criterion. It would be an\ninteresting research direction to obtain \ufb01nite-time bounds on the quality of the solution obtained by\nthese algorithms.\n\n8\n\n 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 20 25 30 35 40 45 50ProbabilityD\u03b8(x0)SPSARS-SPSA 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 30 35 40 45 50 55ProbabilityD\u03b8(x0)SFRS-SF 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 20 30 40 50 60 70Probability\u03c1ACRS-AC 0 5 10 15 20 25 30 0 2000 4000 6000 8000 10000AJWTtimeRS-ACAC\fReferences\n[1] D. Bertsekas. Nonlinear programming. Athena Scienti\ufb01c, 1999.\n[2] S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor-critic algorithms. Automatica, 45\n\n(11):2471\u20132482, 2009.\n\n[3] S. Bhatnagar, H. Prasad, and L.A. Prashanth. Stochastic Recursive Algorithms for Optimization, volume\n\n434. Springer, 2013.\n\n[4] V. Borkar. A sensitivity formula for the risk-sensitive cost and the actor-critic algorithm. Systems &\n\nControl Letters, 44:339\u2013346, 2001.\n\n[5] V. Borkar. Q-learning for risk-sensitive control. Mathematics of Operations Research, 27:294\u2013311, 2002.\n[6] H. Chen, T. Duncan, and B. Pasik-Duncan. A Kiefer-Wolfowitz algorithm with randomized differences.\n\nIEEE Transactions on Automatic Control, 44(3):442\u2013453, 1999.\n\n[7] E. Delage and S. Mannor. Percentile optimization for Markov decision processes with parameter uncer-\n\ntainty. Operations Research, 58(1):203\u2013213, 2010.\n\n[8] J. Filar, L. Kallenberg, and H. Lee. Variance-penalized Markov decision processes. Mathematics of\n\nOperations Research, 14(1):147\u2013161, 1989.\n\n[9] J. Filar, D. Krass, and K. Ross. Percentile performance criteria for limiting average Markov decision\n\nprocesses. IEEE Transaction of Automatic Control, 40(1):2\u201310, 1995.\n\n[10] R. Howard and J. Matheson. Risk sensitive Markov decision processes. Management Science, 18(7):\n\n356\u2013369, 1972.\n\n[11] V. Katkovnik and Y. Kulchitsky. Convergence of a class of random search algorithms. Automatic Remote\n\nControl, 8:81\u201387, 1972.\n\n[12] A. Nilim and L. El Ghaoui. Robust control of Markov decision processes with uncertain transition matri-\n\nces. Operations Research, 53(5):780\u2013798, 2005.\n\n[13] J. Peters, S. Vijayakumar, and S. Schaal. Natural actor-critic. In Proceedings of the Sixteenth European\n\nConference on Machine Learning, pages 280\u2013291, 2005.\n\n[14] L.A. Prashanth and S. Bhatnagar. Reinforcement learning with average cost for adaptive control of traf\ufb01c\nlights at intersections. In Proceedings of the Fourteenth International IEEE Conference on Intelligent\nTransportation Systems, pages 1640\u20131645. IEEE, 2011.\n\n[15] L.A. Prashanth and S. Bhatnagar. Reinforcement Learning With Function Approximation for Traf\ufb01c\n\nSignal Control. IEEE Transactions on Intelligent Transportation Systems, 12(2):412\u2013421, june 2011.\n\n[16] L.A. Prashanth and S. Bhatnagar. Threshold Tuning Using Stochastic Optimization for Graded Signal\n\nControl. IEEE Transactions on Vehicular Technology, 61(9):3865\u20133880, Nov. 2012.\n\n[17] L.A. Prashanth and M. Ghavamzadeh. Actor-Critic Algorithms for Risk-Sensitive MDPs. Technical\n\nreport inria-00794721, INRIA, 2013.\n\n[18] W. Sharpe. Mutual fund performance. Journal of Business, 39(1):119\u2013138, 1966.\n[19] M. Sobel. The variance of discounted Markov decision processes. Applied Probability, pages 794\u2013802,\n\n1982.\n\n[20] J. Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation.\n\nIEEE Transactions on Automatic Control, 37(3):332\u2013341, 1992.\n\n[21] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning\nwith function approximation. In Proceedings of Advances in Neural Information Processing Systems 12,\npages 1057\u20131063, 2000.\n\n[22] A. Tamar, D. Di Castro, and S. Mannor. Policy gradients with variance related risk criteria. In Proceedings\n\nof the Twenty-Ninth International Conference on Machine Learning, pages 387\u2013396, 2012.\n\n[23] A. Tamar, D. Di Castro, and S. Mannor. Temporal difference methods for the variance of the reward to go.\n\nIn Proceedings of the Thirtieth International Conference on Machine Learning, pages 495\u2013503, 2013.\n\n[24] H. Xu and S. Mannor. Distributionally robust Markov decision processes. Mathematics of Operations\n\nResearch, 37(2):288\u2013300, 2012.\n\n9\n\n\f", "award": [], "sourceid": 211, "authors": [{"given_name": "Prashanth", "family_name": "L.A.", "institution": "INRIA"}, {"given_name": "Mohammad", "family_name": "Ghavamzadeh", "institution": "INRIA & Adobe Research"}]}