{"title": "A Block Coordinate Ascent Algorithm for Mean-Variance Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1065, "page_last": 1075, "abstract": "Risk management in dynamic decision problems is a primary concern in many fields, including financial investment, autonomous driving, and healthcare. The mean-variance function is one of the most widely used objective functions in risk management due to its simplicity and interpretability. Existing algorithms for mean-variance optimization are based on multi-time-scale stochastic approximation, whose learning rate schedules are often hard to tune, and have only asymptotic convergence proof. In this paper, we develop a model-free policy search framework for mean-variance optimization with finite-sample error bound analysis (to local optima). Our starting point is a reformulation of the original mean-variance function with its Fenchel dual, from which we propose a stochastic block coordinate ascent policy search algorithm. Both the asymptotic convergence guarantee of the last iteration's solution and the convergence rate of the randomly picked solution are provided, and their applicability is demonstrated on several benchmark domains.", "full_text": "A Block Coordinate Ascent Algorithm for\n\nMean-Variance Optimization\n\nTengyang Xie\u2217\nUMass Amherst\n\ntxie@cs.umass.edu\n\nBo Liu\u2217\n\nAuburn University\nboliu@auburn.edu\n\nYangyang Xu\n\nRensselaer Polytechnic Institute\n\nxuy21@rpi.edu\n\nMohammad Ghavamzadeh\n\nFacebook AI Research\n\nYinlam Chow\n\nGoogle DeepMind\n\nDaoming Lyu\n\nAuburn University\n\nmgh@fb.com\n\nyinlamchow@google.com\n\ndaoming.lyu@auburn.edu\n\nDaesub Yoon\n\nETRI\n\neyetracker@etri.re.kr\n\nAbstract\n\nRisk management in dynamic decision problems is a primary concern in many\n\ufb01elds, including \ufb01nancial investment, autonomous driving, and healthcare. The\nmean-variance function is one of the most widely used objective functions in risk\nmanagement due to its simplicity and interpretability. Existing algorithms for\nmean-variance optimization are based on multi-time-scale stochastic approxima-\ntion, whose learning rate schedules are often hard to tune, and have only asymptotic\nconvergence proof. In this paper, we develop a model-free policy search frame-\nwork for mean-variance optimization with \ufb01nite-sample error bound analysis (to\nlocal optima). Our starting point is a reformulation of the original mean-variance\nfunction with its Legendre-Fenchel dual, from which we propose a stochastic\nblock coordinate ascent policy search algorithm. Both the asymptotic convergence\nguarantee of the last iteration\u2019s solution and the convergence rate of the randomly\npicked solution are provided, and their applicability is demonstrated on several\nbenchmark domains.\n\n1\n\nIntroduction\n\nRisk management plays a central role in sequential decision-making problems, common in \ufb01elds such\nas portfolio management [Lai et al., 2011], autonomous driving [Maurer et al., 2016], and health-\ncare [Parker, 2009]. A common risk-measure is the variance of the expected sum of rewards/costs and\nthe mean-variance trade-off function [Sobel, 1982; Mannor and Tsitsiklis, 2011] is one of the most\nwidely used objective functions in risk-sensitive decision-making. Other risk-sensitive objectives\nhave also been studied, for example, Borkar [2002] studied exponential utility functions, Tamar\net al. [2012] experimented with the Sharpe Ratio measurement, Chow et al. [2018] studied value\nat risk (VaR) and mean-VaR optimization, Chow and Ghavamzadeh [2014], Tamar et al. [2015b],\nand Chow et al. [2018] investigated conditional value at risk (CVaR) and mean-CVaR optimization\nin a static setting, and Tamar et al. [2015a] investigated coherent risk for both linear and nonlinear\nsystem dynamics. Compared with other widely used performance measurements, such as the Sharpe\nRatio and CVaR, the mean-variance measurement has explicit interpretability and computational\n\n\u2217Equal contribution. Corresponding to: boliu@auburn.edu\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fadvantages [Markowitz et al., 2000; Li and Ng, 2000]. For example, the Sharpe Ratio tends to lead to\nsolutions with less mean return [Tamar et al., 2012]. Existing mean-variance reinforcement learning\n(RL) algorithms [Tamar et al., 2012; Prashanth and Ghavamzadeh, 2013, 2016] often suffer from\nheavy computational cost, slow convergence, and dif\ufb01culties in tuning their learning rate schedules.\nMoreover, all their analyses are asymptotic and no rigorous \ufb01nite-sample complexity analysis has\nbeen reported. Recently, Dalal et al. [2018] provided a general approach to compute \ufb01nite sample\nanalysis in the case of linear multiple time scales stochastic approximation problems. However,\nexisting multiple time scales algorithms like [Tamar et al., 2012] consist of nonlinear term in its\nupdate, and cannot be analyzed via the method in Dalal et al. [2018]. All these make it dif\ufb01cult to\nuse them in real-world problems. The goal of this paper is to propose a mean-variance optimization\nalgorithm that is both computationally ef\ufb01cient and has \ufb01nite-sample analysis guarantees. This paper\nmakes the following contributions: 1) We develop a computationally ef\ufb01cient RL algorithm for\nmean-variance optimization. By reformulating the mean-variance function with its Legendre-Fenchel\ndual [Boyd and Vandenberghe, 2004], we propose a new formulation for mean-variance optimization\nand use it to derive a computationally ef\ufb01cient algorithm that is based on stochastic cyclic block\ncoordinate descent. 2) We provide the sample complexity analysis of our proposed algorithm. This\nresult is novel because although cyclic block coordinate descent algorithms usually have empirically\nbetter performance than randomized block coordinate descent algorithms, yet almost all the reported\nanalysis of these algorithms are asymptotic [Xu and Yin, 2015].\nHere is a roadmap for the rest of the paper. Section 2 offers a brief background on risk-sensitive\nRL and stochastic variance reduction. In Section 3, the problem is reformulated using the Legendre-\nFenchel duality and a novel algorithm is proposed based on stochastic block coordinate descent.\nSection 4 contains the theoretical analysis of the paper that includes both asymptotic convergence\nand \ufb01nite-sample error bound. The experimental results of Section 5 validate the effectiveness of the\nproposed algorithms.\n\n2 Backgrounds\n\nThis section offers a brief overview of risk-sensitive RL, including the objective functions and\nalgorithms. We then introduce block coordinate descent methods. Finally, we introduce the Legendre-\nFenchel duality, the key ingredient in formulating our new algorithms.\n\n2.1 Risk-Sensitive Reinforcement Learning\n\nReinforcement Learning (RL) [Sutton and Barto, 1998] is a class of learning problems in which an\nagent interacts with an unfamiliar, dynamic, and stochastic environment, where the agent\u2019s goal is to\noptimize some measures of its long-term performance. This interaction is conventionally modeled\nss(cid:48), r, \u03b3), where S and A are\nas a Markov decision process (MDP), de\ufb01ned as the tuple (S,A, P0, P a\nthe sets of states and actions, P0 is the initial state distribution, P a\nss(cid:48) is the transition kernel that\nspeci\ufb01es the probability of transition from state s \u2208 S to state s(cid:48) \u2208 S by taking action a \u2208 A,\nr(s, a) : S \u00d7 A \u2192 R is the reward function bounded by Rmax, and 0 \u2264 \u03b3 < 1 is a discount factor.\nA parameterized stochastic policy \u03c0\u03b8(a|s) : S \u00d7 A \u2192 [0, 1] is a probabilistic mapping from states to\nactions, where \u03b8 is the tunable parameter and \u03c0\u03b8(a|s) is a differentiable function w.r.t. \u03b8.\nOne commonly used performance measure for policies in episodic MDPs is the return or cumulative\nk=1 r(sk, ak), where s1 \u223c P0 and \u03c4 is the \ufb01rst\npassage time to the recurrent state s\u2217 [Puterman, 1994; Tamar et al., 2012], and thus, \u03c4 := min{k >\n0 | sk = s\u2217}. In risk-neutral MDPs, the algorithms aim at \ufb01nding a near-optimal policy that\nmaximizes the expected sum of rewards J(\u03b8) := E\u03c0\u03b8 [R] = E\u03c0\u03b8\nthe square-return M (\u03b8) := E\u03c0\u03b8 [R2] = E\u03c0\u03b8\ndrop the subscript \u03c0\u03b8 to simplify the notation.\nIn risk-sensitive mean-variance optimization MDPs, the objective is often to maximize J(\u03b8) with a\nvariance constraint, i.e.,\n\nsum of rewards from the starting state, i.e., R =(cid:80)\u03c4\n(cid:2)(cid:80)\u03c4\nk=1 r(sk, ak)(cid:3). We also de\ufb01ne\n(cid:104)(cid:0)(cid:80)\u03c4\nk=1 r(sk, ak)(cid:1)2(cid:105)\n\n. In the following, we sometimes\n\nJ(\u03b8) = E\u03c0\u03b8 [R]\nmax\ns.t. Var\u03c0\u03b8 (R) \u2264 \u03b6,\n\n\u03b8\n\n2\n\n(1)\n\n\fwhere Var\u03c0\u03b8 (R) = M (\u03b8) \u2212 J 2(\u03b8) measures the variance of the return random variable R, and\n\u03b6 > 0 is a given risk parameter [Tamar et al., 2012; Prashanth and Ghavamzadeh, 2013]. Using the\nLagrangian relaxation procedure [Bertsekas, 1999], we can transform the optimization problem (1)\nto maximizing the following unconstrained objective function:\n\nJ\u03bb(\u03b8) :=E\u03c0\u03b8 [R] \u2212 \u03bb(cid:0)Var\u03c0\u03b8 (R) \u2212 \u03b6)\n=J(\u03b8) \u2212 \u03bb(cid:0)M (\u03b8) \u2212 J(\u03b8)2 \u2212 \u03b6(cid:1).\n\n(2)\nIt is important to note that the mean-variance objective function is NP-hard in general [Mannor and\nTsitsiklis, 2011]. The main reason for the hardness of this optimization problem is that although\nthe variance satis\ufb01es a Bellman equation [Sobel, 1982], unfortunately, it lacks the monotonicity\nproperty of dynamic programming (DP), and thus, it is not clear how the related risk measures can be\noptimized by standard DP algorithms [Sobel, 1982].\nThe existing methods to maximize the objective function (2) are mostly based on stochastic approxi-\nmation that often converge to an equilibrium point of an ordinary differential equation (ODE) [Borkar,\n2008]. For example, Tamar et al. [2012] proposed a policy gradient algorithm, a two-time-scale\nstochastic approximation, to maximize (2) for a \ufb01xed value of \u03bb (they optimize over \u03bb by selecting its\nbest value in a \ufb01nite set), while the algorithm in Prashanth and Ghavamzadeh [2013] to maximize (2)\nis actor-critic and is a three-time-scale stochastic approximation algorithm (the third time-scale\noptimizes over \u03bb). The stochastic compositional optimization method [Wang et al., 2017] also\nneeds two-time-scale stepsize tuning for mean-variance optimization, and dual embeddings [Dai\net al., 2017] assume the embedded problem can be solved exactly. These approaches suffer from\ncertain drawbacks: 1) Most of the analyses of ODE-based methods are asymptotic, with no sample\ncomplexity analysis. 2) It is well-known that multi-time-scale approaches are sensitive to the choice\nof the stepsize schedules, which is a non-trivial burden in real-world problems. 3) The ODE approach\ndoes not allow extra penalty functions. Adding penalty functions can often strengthen the robustness\nof the algorithm, encourages sparsity and incorporates prior knowledge into the problem [Hastie\net al., 2001].\n\n2.2 Coordinate Descent Optimization\n\nCoordinate descent (CD)1 and the more general block coordinate descent (BCD) algorithms solve\na minimization problem by iteratively updating variables along coordinate directions or coordinate\nhyperplanes [Wright, 2015]. At each iteration of BCD, the objective function is (approximately)\nminimized w.r.t. a coordinate or a block of coordinates by \ufb01xing the remaining ones, and thus, an\neasier lower-dimensional subproblem needs to be solved. A number of comprehensive studies on\nBCD have already been carried out, such as Luo and Tseng [1992] and Nesterov [2012] for convex\nproblems, and Tseng [2001], Xu and Yin [2013], and Razaviyayn et al. [2013] for nonconvex cases\n(also see Wright 2015 for a review paper). For stochastic problems with a block structure, Dang and\nLan [2015] proposed stochastic block mirror descent (SBMD) by combining BCD with stochastic\nmirror descent [Beck and Teboulle, 2003; Nemirovski et al., 2009]. Another line of research on this\ntopic is block stochastic gradient coordinate descent (BSG) [Xu and Yin, 2015]. The key difference\nbetween SBMD and BSG is that at each iteration, SBMD randomly picks one block of variables to\nupdate, while BSG cyclically updates all block variables.\nIn this paper, we develop mean-variance optimization algorithms based on both nonconvex stochastic\nBSG and SBMD. Since it has been shown that the BSG-based methods usually have better empirical\nperformance than their SBMD counterparts, the main algorithm we report, analyze, and evaluate\nin the paper is BSG-based. We report our SBMD-based algorithm in Appendix C and use it as a\nbaseline in the experiments of Section 5. The \ufb01nite-sample analysis of our BSG-based algorithm\nreported in Section 4 is novel because although there exists such analysis for convex stochastic BSG\nmethods [Xu and Yin, 2015], we are not aware of similar results for their nonconvex version to the\nbest our knowledge.\n\n3 Algorithm Design\n\nIn this section, we \ufb01rst discuss the dif\ufb01culties of using the regular stochastic gradient ascent to\nmaximize the mean-variance objective function. We then propose a new formulation of the mean-\n\n1Note that since our problem is maximization, our proposed algorithms are block coordinate ascent.\n\n3\n\n\fvariance objective function that is based on its Legendre-Fenchel dual and derive novel algorithms\nthat are based on the recent results in stochastic nonconvex block coordinate descent. We conclude\nthis section with an asymptotic analysis of a version of our proposed algorithm.\n\n3.1 Problem Formulation\n\ni.e., Rt =(cid:80)\u03c4t\n\n\u2207\u03b8J\u03bb(\u03b8t) =\u2207\u03b8J(\u03b8t) \u2212 \u03bb\u2207\u03b8Var(R)\n\nIn this section, we describe why the vanilla stochastic gradient cannot be used to maximize J\u03bb(\u03b8)\nde\ufb01ned in Eq. (2). Taking the gradient of J\u03bb(\u03b8) w.r.t. \u03b8, we have\n\n=\u2207\u03b8J(\u03b8t) \u2212 \u03bb(cid:0)\u2207\u03b8M (\u03b8) \u2212 2J(\u03b8)\u2207\u03b8J(\u03b8)(cid:1).\nk=1 rk, which is possibly a nonconvex function, and \u03c9t(\u03b8) =(cid:80)\u03c4t\n\n(3)\nComputing \u2207\u03b8J\u03bb(\u03b8t) in (3) involves computing three quantities: \u2207\u03b8J(\u03b8),\u2207\u03b8M (\u03b8), and\nJ(\u03b8)\u2207\u03b8J(\u03b8). We can obtain unbiased estimates of \u2207\u03b8J(\u03b8) and \u2207\u03b8M (\u03b8) from a single trajec-\ntory generated by the policy \u03c0\u03b8 using the likelihood ratio method [Williams, 1992], as \u2207\u03b8J(\u03b8) =\nE[Rt\u03c9t(\u03b8)] and \u2207\u03b8M (\u03b8) = E[R2\nt \u03c9t(\u03b8)]. Note that Rt is the cumulative reward of the t-th episode,\nk=1 \u2207\u03b8 ln \u03c0\u03b8(ak|sk)\nis the likelihood ratio derivative. In the setting considered in the paper, an episode is the trajectory\nbetween two visits to the recurrent state s\u2217. For example, the t-th episode refers to the trajectory\nbetween the (t-1)-th and the t-th visits to s\u2217. We denote by \u03c4t the length of this episode.\nHowever, it is not possible to compute an unbiased estimate of J(\u03b8)\u2207\u03b8J(\u03b8) without having access to\na generative model of the environment that allows us to sample at least two next states s(cid:48) for each\nstate-action pair (s, a). As also noted by Tamar et al. [2012] and Prashanth and Ghavamzadeh [2013],\ncomputing an unbiased estimate of J(\u03b8)\u2207\u03b8J(\u03b8) requires double sampling (sampling from two\ndifferent trajectories), and thus, cannot be done using a single trajectory. To circumvent the double-\nsampling problem, these papers proposed multi-time-scale stochastic approximation algorithms, the\nformer a policy gradient algorithm and the latter an actor-critic algorithm that uses simultaneous\nperturbation methods [Bhatnagar et al., 2013]. However, as discussed in Section 2.1, multi-time-scale\nstochastic approximation approach suffers from several weaknesses such as no available \ufb01nite-sample\nanalysis and dif\ufb01cult-to-tune stepsize schedules. To overcome these weaknesses, we reformulate the\nmean-variance objective function and use it to present novel algorithms with in-depth analysis in the\nrest of the paper.\n\n3.2 Block Coordinate Reformulation\n\nIn this section, we present a new formulation for J\u03bb(\u03b8) that is later used to derive our algorithms\nand do not suffer from the double-sampling problem in estimating J(\u03b8)\u2207\u03b8J(\u03b8). We begin with the\nfollowing lemma.\nLemma 1. For the quadratic function f (z) = z2, z \u2208 R, we de\ufb01ne its Legendre-Fenchel dual as\nf (z) = z2 = maxy\u2208R(2zy \u2212 y2).\nThis is a special case of the Lengendre-Fenchel duality [Boyd and Vandenberghe, 2004] that has\nbeen used in several recent RL papers (e.g., Liu et al. 2015; Du et al. 2017; Liu et al. 2018). Let\n4\u03bb2 \u2212 \u03b6. Since \u03bb > 0 is a constant,\n2\u03bb, we may\n\n(cid:1)2\u2212 M (\u03b8), which follows F\u03bb(\u03b8) = J\u03bb(\u03b8)\n\nmaximizing J\u03bb(\u03b8) is equivalent to maximizing F\u03bb(\u03b8). Using Lemma 1 with z = J(\u03b8) + 1\nreformulate F\u03bb(\u03b8) as\n\nF\u03bb(\u03b8) :=(cid:0)J(\u03b8) + 1\n\n\u03bb + 1\n\n2\u03bb\n\nUsing (4), the maximization problem max\u03b8 F\u03bb(\u03b8) is equivalent to\n\n(cid:16)\n\ny\n\n1\n2\u03bb\n\n2y(cid:0)J(\u03b8) +\n\u02c6f\u03bb(\u03b8, y) := 2y(cid:0)J(\u03b8) +\n\n\u02c6f\u03bb(\u03b8, y),\n\n(cid:1) \u2212 y2(cid:17) \u2212 M (\u03b8).\n(cid:1) \u2212 y2 \u2212 M (\u03b8).\n\n1\n2\u03bb\n\nF\u03bb(\u03b8) = max\n\nmax\n\u03b8,y\n\nwhere\n\n(4)\n\n(5)\n\nOur optimization problem is now formulated as the standard nonconvex coordinate ascent problem (5).\nWe use three stochastic solvers to solve (5): SBMD method [Dang and Lan, 2015], BSG method [Xu\nand Yin, 2015], and the vanilla stochastic gradient ascent (SGA) method [Nemirovski et al., 2009].\nWe report our BSG-based algorithm in Section 3.3 and leave the details of the SBMD and SGA based\nalgorithms to Appendix C. In the following sections, we denote by \u03b2\u03b8\nt the stepsizes of \u03b8 and\ny, respectively, and by the subscripts t and k the episode and time-step numbers.\n\nt and \u03b2y\n\n4\n\n\f3.3 Mean-Variance Policy Gradient\n\nWe now present our main algorithm that is based on a block coordinate update to maximize (5). Let\nt and gy\ng\u03b8\n\nt be block gradients and \u02dcg\u03b8\n\n\u2212 2yt\nt = E[\u02dcgy\ngy\nt ] = 2yt+1\u2207\u03b8J(\u03b8t) \u2212 \u2207\u03b8M (\u03b8t)\nt = E[\u02dcg\u03b8\ng\u03b8\n\nt ] = 2J(\u03b8t) +\n\nt and \u02dcgy\nt be their sample-based estimations de\ufb01ned as\n1\n\u03bb\n\n\u02dcgy\nt = 2Rt +\n\n\u2212 2yt,\n\n1\n\u03bb\n\n,\n\n(cid:16)\n\n2yt+1Rt \u2212 (Rt)2(cid:17)\n\n(6)\n\n(7)\n\n,\n\n\u02dcg\u03b8\nt =\n\n\u03c9t(\u03b8t).\n\nThe block coordinate updates are\n\nyt+1 =yt + \u03b2y\n\u03b8t+1 =\u03b8t + \u03b2\u03b8\nt , we shall update y (to obtain yt+1) prior to computing g\u03b8\n\nt \u02dcgy\nt ,\nt \u02dcg\u03b8\nt .\n\nTo obtain unbiased estimates of gy\nt at\neach iteration. Now it is ready to introduce the Mean-Variance Policy Gradient (MVP) Algorithm 1.\nBefore presenting our theoretical analysis, we \ufb01rst introduce the assumptions needed for these results.\n\nt and g\u03b8\n\nAlgorithm 1 Mean-Variance Policy Gradient (MVP)\n1: Input: Stepsizes {\u03b2\u03b8\n\nt } and {\u03b2y\n\nt }, and number of iterations N\n\nt } satisfy the Robbins-Monro condition\n\nt are set to be constants\n\nt } and {\u03b2y\nt and \u03b2y\n2: for episode t = 1, . . . , N do\n3:\n4:\n5:\n6:\n7:\n\nOption I: {\u03b2\u03b8\nOption II: \u03b2\u03b8\nGenerate the initial state s1 \u223c P0\nwhile sk (cid:54)= s\u2217 do\nend while\nUpdate the parameters\n\nTake the action ak \u223c \u03c0\u03b8t(a|sk) and observe the reward rk and next state sk+1\n\nrk\n\n\u03c4t(cid:88)\n\u03c4t(cid:88)\n\nk=1\n\nk=1\n\nRt =\n\n\u03c9t(\u03b8t) =\n\n\u2207\u03b8 ln \u03c0\u03b8t(ak|sk)\n(cid:18)\n(cid:19)\n2yt+1Rt \u2212 (Rt)2(cid:17)\n(cid:16)\n\n\u2212 2yt\n\n2Rt +\n\n1\n\u03bb\n\nyt+1 =yt + \u03b2y\nt\n\n\u03b8t+1 =\u03b8t + \u03b2\u03b8\nt\n\n\u03c9t(\u03b8t)\n\n8: end for\n9: Output \u00afxN :\n\nOption I: Set \u00afxN = xN = [\u03b8N , yN ](cid:62)\nOption II: Set \u00afxN = xz = [\u03b8z, yz](cid:62), where z is uniformly drawn from {1, 2, . . . , N}\n\nAssumption 1 (Bounded Gradient and Variance). There exist constants G and \u03c3 such that\n\nt (cid:107)2\n\n(cid:107)\u2207y\nE[(cid:107)\u2206y\n\n\u02c6f\u03bb(x)(cid:107)2 \u2264 G, (cid:107)\u2207\u03b8\n2] \u2264 \u03c32, E[(cid:107)\u2206\u03b8\n\n\u02c6f\u03bb(x)(cid:107)2 \u2264 G,\n2] \u2264 \u03c32,\nt(cid:107)2\nfor any t and x, where (cid:107) \u00b7 (cid:107)2 denotes the Euclidean norm, \u2206y\nt := \u02dcgy\nAssumption 1 is standard in nonconvex coordinate descent algorithms [Xu and Yin, 2015; Dang and\nLan, 2015]. We also need the following assumption that is standard in the policy gradient literature.\nAssumption 2 (Ergodicity). The Markov chains induced by all the policies generated by the algo-\nrithm are ergodic, i.e., irreducible, aperiodic, and recurrent.\n\nt \u2212 g\u03b8\nt .\n\nt and \u2206\u03b8\n\nt \u2212 gy\n\nt := \u02dcg\u03b8\n\nIn practice, we can choose either Option I with the result of the \ufb01nal iteration as output or Option II\nwith the result of a randomly selected iteration as output. In what follows in this section, we report an\n\n5\n\n\fasymptotic convergence analysis of MVP with Option I, and in Section 4, we derive a \ufb01nite-sample\nanalysis of MVP with Option II.\n\nTheorem 1 (Asymptotic Convergence). Let(cid:8)xt = (\u03b8t, yt)(cid:9) be the sequence of the outputs gener-\nsatisfying the Robbins-Monro condition, i.e.,(cid:80)\u221e\nand(cid:80)\u221e\n\nt } are time-diminishing real positive sequences\nt = \u221e,\nt=1 \u03b2\u03b8\n\nt )2 < \u221e, then Algorithm 1 will converge such that limt\u2192\u221e E[(cid:107)\u2207 \u02c6f\u03bb(xt)(cid:107)2] = 0.\n\nated by Algorithm 1 with Option I. If {\u03b2\u03b8\n\nt = \u221e,(cid:80)\u221e\n\n< \u221e,(cid:80)\u221e\n\nt } and {\u03b2y\n\nt=1 (\u03b2\u03b8\nt )\n\nt=1 \u03b2y\n\nt=1 (\u03b2y\n\n2\n\nThe proof of Theorem 1 follows from the analysis in Xu and Yin [2013]. Due to space constraint, we\nreport it in Appendix A.\nAlgorithm 1 is a special case of nonconvex block stochastic gradient (BSG) methods. To the best of\nour knowledge, no \ufb01nite-sample analysis has been reported for this class of algorithms. Motivated\nby the recent papers by Nemirovski et al. [2009], Ghadimi and Lan [2013], Xu and Yin [2015],\nand Dang and Lan [2015], in Section 4, we provide a \ufb01nite-sample analysis for general nonconvex\nblock stochastic gradient methods and apply it to Algorithm 1 with Option II.\n\n4 Finite-Sample Analysis\nIn this section, we \ufb01rst present a \ufb01nite-sample analysis for the general class of nonconvex BSG\nalgorithms [Xu and Yin, 2013], for which there are no established results, in Section 4.1. We then\nuse these results and prove a \ufb01nite-sample bound for our MVP algorithm with Option II, that belongs\nto this class, in Section 4.2. Due to space constraint, we report the detailed proofs in Appendix A.\n\n4.1 Finite-Sample Analysis of Nonconvex BSG Algorithms\n\nIn this section, we provide a \ufb01nite-sample analysis of the general nonconvex block stochastic gradient\n(BSG) method, where the problem formulation is given by\n\nf (x) = E\u03be[F (x, \u03be)].\n\nmin\nx\u2208Rn\n\nt : i = 1,\u00b7\u00b7\u00b7 , b}\u221e\n\nt=1 are denoted as the stepsizes. Also, let \u03b2max\n\nwhere xi \u2208 Rni denotes the i-th block of variables, and(cid:80)b\n\n(8)\n\u03be is a random vector, and F (\u00b7, \u03be) : Rn \u2192 R is continuously differentiable and possibly nonconvex\nfor every \u03be. The variable x \u2208 Rn can be partitioned into b disjoint blocks as x = {x1, x2, . . . , xb},\ni=1 ni = n. For simplicity, we use\nx*i, and x\u2265i are de\ufb01ned correspondingly. We also use \u2207xi to\ndenote \u2202\n\u2202xi for the partial gradient with respect to xi. \u039et is the sample set generated at t-th iteration,\nand \u039e[t] = (\u039e1, . . . , \u039et) denotes the history of sample sets from the \ufb01rst through t-th iteration.\n{\u03b2i\nt.\nt = mini \u03b2i\nSimilar to Algorithm 1, the BSG algorithm cyclically updates all blocks of variables in each iteration,\nand the detailed algorithm for BSG method is presented in Appendix B.\nWithout loss of generality, we assume a \ufb01xed update order in the BSG algorithm. Let \u039et =\n{\u03bet,1, . . . , \u03bet,mt} be the samples in the t-th iteration with size mt \u2265 1. Therefore, the stochastic\n; \u03bet,l). Similar to Section 3, we de\ufb01ne\npartial gradient is computed as \u02dcgi\nt = \u2207xif (x** 0\nsuch that (cid:107)\u2207xif (x) \u2212 \u2207xif (y)(cid:107)2 \u2264 L(cid:107)x \u2212 y(cid:107)2, \u2200i \u2208 {1, . . . , b} and \u2200x, y \u2208 Rn. Each block\ngradient of f is also bounded, i.e., there exist a positive constant G such that (cid:107)\u2207xif (x)(cid:107)2 \u2264 G,\nfor any i \u2208 {1, . . . , b} and any x \u2208 Rn. We also need Assumption 1 for all block variables, i.e.,\nE[(cid:107)\u2206i\nLemma 2. For any i and t, there exist a positive constant A, such that\n\n(cid:80)mt\nl=1 \u2207xiF (x** L(\u03b2max\n\nE(cid:2)(cid:107)\u2207f (\u00afxN )(cid:107)2\n\nt\n\n)2 for t = 1,\u00b7\u00b7\u00b7 , N, then we have\nt=1(\u03b2max\n2 (\u03b2max\n\n(cid:3) \u2264 f (x1) \u2212 f\u2217 +(cid:80)N\n(cid:80)N\n(cid:113)(cid:80)\nj** L(\u03b2max\n\nt\n\nE(cid:104)(cid:107)\u2207 \u02c6f\u03bb(\u00afxN )(cid:107)2\n\n2\n\n(cid:105) \u2264 \u02c6f\u2217\n\n\u03bb \u2212 \u02c6f\u03bb(x1) + N (\u03b2max\nt\nN (\u03b2min\n2 (\u03b2max\n)2)\n\nt \u2212 L\n\nt\n\n)2C\n\n(12)\n\n\u03bb = maxx\n\n\u02c6f\u03bb(x), and\n\nwhere \u02c6f\u2217\nC =(1 \u2212 L\n2\n\n\u03b2max\nt\n\n)(L2\u03b2max\n\nt\n\n(G2 + \u03c32) + L(2G2 + \u03c32)) + AG + L\u03c32 + 2L(1 + L\u03b2max\n\nt\n\n)(3\u03c32 + 2G2).\n\nProof Sketch. The proof follows the following major steps.\n(I). First, we need to prove the bound of each block coordinate gradient, i.e., E[(cid:107)g\u03b8\nwhich is bounded as\n\nt (cid:107)2\n\n2] and E[(cid:107)gy\n\nt (cid:107)2\n2],\n\nt \u2212 L\n2 + (cid:107)gy\n(\u03b2min\n2\n\u2264E[ \u02c6f\u03bb(xt+1)] \u2212 E[ \u02c6f\u03bb(xt)] + (\u03b2max\n\n)2)E[(cid:107)g\u03b8\n\n(\u03b2max\n\nt (cid:107)2\n\nt (cid:107)2\n2]\n)2AM\u03c1 + L(\u03b2max\n\nt\n\nt\n\nt\n\n)2\u03c32 + 2L\u03b2max\n\nt\n\n(\u03b2max\n\nt + L(\u03b2max\n\nt\n\n)2)(3\u03c32 + 2G2).\n\nN(cid:88)\n\nt=1\n\nSumming up over t, we have\n\nt \u2212 L\n(\u03b2min\n2\n\n)2)E[(cid:107)g\u03b8\n\nt (cid:107)2\n\n2 + (cid:107)gy\n\nt (cid:107)2\n2]\n\n(\u03b2max\n\nt\n\nN(cid:88)\n\n\u2264 \u02c6f\u2217\n\n\u03bb \u2212 \u02c6f\u03bb(x1) +\n\n[(\u03b2max\n\nt\n\n)2AG + L(\u03b2max\n\nt\n\n)2\u03c32 + 2L\u03b2max\n\nt\n\n(\u03b2max\n\nt + L(\u03b2max\n\nt\n\n)2)(3\u03c32 + 2G2)].\n\nt=1\n\n7\n\n\f(a) Portfolio management domain (b) American-style option domain\nFigure 1: Empirical results of the distributions of the return (cumulative rewards) random variable.\nNote that markers only indicate different methods.\n\n(c) Optimal stopping domain\n\n(II). Next, we need to bound E[(cid:107)\u2207 \u02c6f\u03bb(xt)(cid:107)2\n\nE[(cid:107)\u2207 \u02c6f\u03bb(xt)(cid:107)2\n\n2] \u2264 L2(\u03b2max\n\nt\n\nt (cid:107)2\n2 + (cid:107)gy\n2] using E[(cid:107)g\u03b8\n(2G2 + \u03c32) + E[(cid:107)g\u03b8\n)2(G2 + \u03c32) + L\u03b2max\n\n2], which is proven to be\nt (cid:107)2\n2].\n\nt (cid:107)2\n\nt (cid:107)2\n\n2 + (cid:107)gy\n\nt\n\n(III). Finally, combining (I) and (II), and rearranging the terms, Eq. (12) can be obtained as a special\ncase of Theorem 2, which completes the proof.\n\n5 Experimental Study\n\nIn this section, we evaluate our MVP algorithm with Option I in three risk-sensitive domains: the\nportfolio management [Tamar et al., 2012], the American-style option [Tamar et al., 2014], and the\noptimal stopping [Chow and Ghavamzadeh, 2014; Chow et al., 2018]. The baseline algorithms are the\nvanilla policy gradient (PG), the mean-variance policy gradient in Tamar et al. [2012], the stochastic\ngradient ascent (SGA) applied to our optimization problem (5), and the randomized coordinate ascent\npolicy gradient (RCPG), i.e., the SBMD-based version of our algorithm. Details of SGA and RCPG\ncan be found in Appendix C. For each algorithm, we optimize its Lagrangian parameter \u03bb by grid\nsearch and report the mean and variance of its return random variable as a Gaussian.2 Since the\nalgorithms presented in the paper (MVP and RCPG) are policy gradient, we only compare them with\nMonte-Carlo based policy gradient algorithms and do not use any actor-critic algorithms, such as\nthose in Prashanth and Ghavamzadeh [2013] and TRPO [Schulman et al., 2015], in the experiments.\n\n5.1 Portfolio Management\n\nnl or rhigh\nnl\n\nThe portfolio domain Tamar et al. [2012] is composed of the liquid and non-liquid assets. A liquid\nasset has a \ufb01xed interest rate rl and can be sold at any time-step k \u2264 \u03c4. A non-liquid asset can be sold\nonly after a \ufb01xed period of W time-steps with a time-dependent interest rate rnl(k), which can take\neither rlow\n, and the transition follows a switching probability pswitch. The non-liquid asset also\nsuffers a default risk (i.e., not being paid) with a probability prisk. All investments are in liquid assets at\nthe initial time-step k = 0. At the k-th step, the state is denoted by x(k) \u2208 RW +2, where x1 \u2208 [0, 1]\nis the portion of the investment in liquid assets, x2,\u00b7\u00b7\u00b7 , xW +1 \u2208 [0, 1] is the portion in non-liquid\nassets with time to maturity of 1,\u00b7\u00b7\u00b7 , W time-steps, respectively, and xW +2(k) = rnl(k)\u2212 E[rnl(k)].\nThe investor can choose to invest a \ufb01xed portion \u03b7 (0 < \u03b7 < 1) of his total available cash in the\nnon-liquid asset or do nothing. More details about this domain can be found in Tamar et al. [2012].\nFigure 1(a) shows the results of the algorithms. PG has a large variance and the Tamar\u2019s method has\nthe lowest mean return. The results indicate that MVP yields a higher mean return with less variance\ncompared to the competing algorithms.\n\n5.2 American-style Option\n\nAn American-style option Tamar et al. [2014] is a contract that gives the buyer the right to buy or\nsell the asset at a strike price W at or before the maturity time \u03c4. The initial price of the option is\nx0, and the buyer has bought a put option with the strike price Wput < x0 and a call option with the\n\n2Note that the return random variables are not necessarily Gaussian, we only use Gaussian for presentation\n\npurposes.\n\n8\n\n\u000f\u0013\u000f\u0001\u0010\r\u0010\u000f\u0018..:2:\u0004,90/\u0003#0\u0005,7/\r\u000b\r\r\u000b\u000f\r\u000b\u0011\r\u000b\u0013\r\u000b\u0001\u000e\u000b\r\u000e\u000b\u000f!74-,-\u0004\u0004\u00049\u0005\u0003\u001b038\u00049\u0005\u0003\u001d:3.9\u000443\u0003!\u001b\u001d\u001e'!$\u0002\u001b#\u001a!\u0002%,2,7!\u0002\r\u000b\u000f\r\r\u000b\u000f\u000f\r\u000b\u000f\u0011\r\u000b\u000f\u0013\u0018..:2:\u0004,90/\u0003#0\u0005,7/\r\u000f\r\u0011\r\u0013\r\u0001\r!74-,-\u0004\u0004\u00049\u0005\u0003\u001b038\u00049\u0005\u0003\u001d:3.9\u000443\u0003!\u001b\u001d\u001e'!$\u0002\u0018#\u001a!\u0002%,2,7!\u0002\fstrike price Wcall > x0. At the k-th step (k \u2264 \u03c4), the state is {xk, k}, where xk is the current price\nof the option. The action ak is either executing the option or holding it. xk+1 is fuxk w.p. p and\nfdxk w.p. 1 \u2212 p, where fu and fd are constants. The reward is 0 unless an option is executed and\nthe reward for executing an option is rk = max(0, Wput \u2212 xk) + max(0, xk \u2212 Wcall). More details\nabout this domain can be found in Tamar et al. [2014]. Figure 1(b) shows the performance of the\nalgorithms. The results suggest that MVP can yield a higher mean return with less variance compared\nto the other algorithms.\n\n5.3 Optimal Stopping\n\nThe optimal stopping problem [Chow and Ghavamzadeh, 2014; Chow et al., 2018] is a continuous\nstate domain. At the k-th time-step (k \u2264 \u03c4, \u03c4 is the stopping time), the state is {xk, k}, where\nxk is the cost. The buyer decide either to accept the present cost or wait. If the buyer accepts or\nwhen k = T , the system reaches a terminal state and the cost xk is received, otherwise, the buyer\nreceives the cost ph and the new state is {xk+1, k + 1}, where xk+1 is fuxk w.p. p and fdxk w.p.\n1 \u2212 p (fu > 1 and fd < 1 are constants). More details about this domain can be found in Chow and\nGhavamzadeh [2014]. Figure 1(c) shows the performance of the algorithms. The results indicate that\nMVP is able to yield much less variance without affecting its mean return. We also summarize the\nperformance of these algorithms in all three risk-sensitive domains as Table 1, where Std is short for\nStandard Deviation.\n\nPortfolio Management American-style Option Optimal Stopping\nMean\n29.754\n29.170\nTamar\n28.575\nSGA\n29.679\nRCPG 29.340\n\nMean\n-1.4767\n-1.4769\n-2.8553\n-1.4805\n-1.4872\n\nMean\n0.2478\n0.2477\n0.2240\n0.2470\n0.2447\n\nStd\n\n0.00482\n0.00922\n0.00694\n0.00679\n0.00819\n\nStd\n\n0.00456\n0.00754\n0.00415\n0.00583\n0.00721\n\nMVP\nPG\n\nStd\n0.325\n1.177\n0.857\n0.658\n0.789\n\nTable 1: Performance Comparison among Algorithms\n\n6 Conclusion\nThis paper is motivated to provide a risk-sensitive policy search algorithm with provable sample\ncomplexity analysis to maximize the mean-variance objective function. To this end, the objective\nfunction is reformulated based on the Legendre-Fenchel duality, and a novel stochastic block co-\nordinate ascent algorithm is proposed with in-depth analysis. There are many interesting future\ndirections on this research topic. Besides stochastic policy gradient, deterministic policy gradient\n[Silver et al., 2014] has shown great potential in large discrete action space. It is interesting to design\na risk-sensitive deterministic policy gradient method. Secondly, other reformulations of the mean-\nvariance objective function are also worth exploring, which will lead to new families of algorithms.\nThirdly, distributional RL [Bellemare et al., 2016] is strongly related to risk-sensitive policy search,\nand it is interesting to investigate the connections between risk-sensitive policy gradient methods\nand distributional RL. Last but not least, it is interesting to test the performance of the proposed\nalgorithms together with other risk-sensitive RL algorithms on highly-complex risk-sensitive tasks,\nsuch as autonomous driving problems and other challenging tasks.\n\nAcknowledgments\n\nBo Liu, Daoming Lyu, and Daesub Yoon were partially supported by a grant (18TLRP-B131486-\n02) from Transportation and Logistics R&D Program funded by Ministry of Land, Infrastructure\nand Transport of Korean government. Yangyang Xu was partially supported by the NSF grant\nDMS-1719549.\n\n9\n\n\fReferences\nBeck, A. and Teboulle, M. (2003). Mirror descent and nonlinear projected subgradient methods for\n\nconvex optimization. Operations Research Letters, 31:167\u2013175.\n\nBellemare, M. G., Dabney, W., and Munos, R. (2016). A distributional perspective on reinforcement\n\nlearning. In International Conference on Machine Learning.\n\nBertsekas, D. P. (1999). Nonlinear programming. Athena scienti\ufb01c Belmont.\n\nBhatnagar, S., Prasad, H., and Prashanth, L. (2013). Stochastic Recursive Algorithms for Optimization,\n\nvolume 434. Springer.\n\nBorkar, V. (2002). Q-learning for risk-sensitive control. Mathematics of operations research,\n\n27(2):294\u2013311.\n\nBorkar, V. (2008). Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University\n\nPress.\n\nBoyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.\n\nChow, Y. and Ghavamzadeh, M. (2014). Algorithms for CVaR optimization in MDPs. In Advances\n\nin Neural Information Processing Systems, pages 3509\u20133517.\n\nChow, Y., Ghavamzadeh, M., Janson, L., and Pavone, M. (2018). Risk-constrained reinforcement\n\nlearning with percentile risk criteria. Journal of Machine Learning Research.\n\nDai, B., He, N., Pan, Y., Boots, B., and Song, L. (2017). Learning from conditional distributions via\ndual embeddings. In The 20th International Conference on Arti\ufb01cial Intelligence and Statistics.\n\nDalal, G., Thoppe, G., Sz\u00a8or\u00b4enyi, B., and Mannor, S. (2018). Finite sample analysis of two-timescale\nstochastic approximation with applications to reinforcement learning. In Proceedings of the 31st\nConference On Learning Theory, pages 1199\u20131233. PMLR.\n\nDang, C. D. and Lan, G. (2015). Stochastic block mirror descent methods for nonsmooth and\n\nstochastic optimization. SIAM Journal on Optimization, 25(2):856\u2013881.\n\nDu, S. S., Chen, J., Li, L., Xiao, L., and Zhou, D. (2017). Stochastic variance reduction methods for\n\npolicy evaluation. arXiv preprint arXiv:1702.07944.\n\nGhadimi, S. and Lan, G. (2013). Stochastic \ufb01rst-and zeroth-order methods for nonconvex stochastic\n\nprogramming. SIAM Journal on Optimization, 23(4):2341\u20132368.\n\nHastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning. Springer.\n\nLai, T., Xing, H., and Chen, Z. (2011). Mean-variance portfolio optimization when means and\n\ncovariances are unknown. The Annals of Applied Statistics, pages 798\u2013823.\n\nLi, D. and Ng, W. (2000). Optimal dynamic portfolio selection: Multiperiod mean-variance formula-\n\ntion. Mathematical Finance, 10(3):387\u2013406.\n\nLiu, B., Gemp, I., Ghavamzadeh, M., Liu, J., Mahadevan, S., and Petrik, M. (2018). Proximal\ngradient temporal difference learning: Stable reinforcement learning with polynomial sample\ncomplexity. Journal of Arti\ufb01cial Intelligence Research.\n\nLiu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S., and Petrik, M. (2015). Finite-sample analysis of\n\nproximal gradient td algorithms. In Conference on Uncertainty in Arti\ufb01cial Intelligence.\n\nLuo, Z. and Tseng, P. (1992). On the convergence of the coordinate descent method for convex\n\ndifferentiable minimization. Journal of Optimization Theory and Applications, 72(1):7\u201335.\n\nMairal, J. (2013). Stochastic majorization-minimization algorithms for large-scale optimization. In\n\nAdvances in Neural Information Processing Systems, pages 2283\u20132291.\n\nMannor, S. and Tsitsiklis, J. (2011). Mean-variance optimization in markov decision processes. In\n\nProceedings of the 28th International Conference on Machine Learning (ICML-11).\n\n10\n\n\fMarkowitz, H. M., Todd, G. P., and Sharpe, W. F. (2000). Mean-variance analysis in portfolio choice\n\nand capital markets, volume 66. John Wiley & Sons.\n\nMaurer, M., Gerdes, C., Lenz, B., and Winner, H. (2016). Autonomous driving: technical, legal and\n\nsocial aspects. Springer.\n\nNemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009). Robust stochastic approximation\n\napproach to stochastic programming. SIAM Journal on optimization, 19(4):1574\u20131609.\n\nNesterov, Y. (2012). Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems.\n\nSIAM Journal on Optimization, 22(2):341\u2013362.\n\nParker, D. (2009). Managing risk in healthcare: understanding your safety culture using the manch-\n\nester patient safety framework. Journal of nursing management, 17(2):218\u2013222.\n\nPrashanth, L. A. and Ghavamzadeh, M. (2013). Actor-critic algorithms for risk-sensitive mdps. In\n\nAdvances in Neural Information Processing Systems, pages 252\u2013260.\n\nPrashanth, L. A. and Ghavamzadeh, M. (2016). Variance-constrained actor-critic algorithms for\n\ndiscounted and average reward mdps. Machine Learning Journal, 105(3):367\u2013417.\n\nPuterman, M. L. (1994). Markov Decision Processes. Wiley Interscience, New York, USA.\nRazaviyayn, M., Hong, M., and Luo, Z. (2013). A uni\ufb01ed convergence analysis of block successive\nminimization methods for nonsmooth optimization. SIAM Journal on Optimization, 23(2):1126\u2013\n1153.\n\nSchulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimiza-\n\ntion. In International Conference on Machine Learning, pages 1889\u20131897.\n\nSilver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. (2014). Deterministic\n\npolicy gradient algorithms. In ICML, pages 387\u2013395.\n\nSobel, M. J. (1982). The variance of discounted markov decision processes. Journal of Applied\n\nProbability, 19(04):794\u2013802.\n\nSutton, R. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press.\nTamar, A., Castro, D., and Mannor, S. (2012). Policy gradients with variance related risk criteria. In\n\nICML, pages 935\u2013942.\n\nTamar, A., Chow, Y., Ghavamzadeh, M., and Mannor, S. (2015a). Policy gradient for coherent risk\n\nmeasures. In NIPS, pages 1468\u20131476.\n\nTamar, A., Glassner, Y., and Mannor, S. (2015b). Optimizing the cvar via sampling. In AAAI\n\nConference on Arti\ufb01cial Intelligence.\n\nTamar, A., Mannor, S., and Xu, H. (2014). Scaling up robust mdps using function approximation. In\n\nInternational Conference on Machine Learning, pages 181\u2013189.\n\nTseng, P. (2001). Convergence of a block coordinate descent method for nondifferentiable minimiza-\n\ntion. Journal of optimization theory and applications, 109(3):475\u2013494.\n\nWang, M., Fang, E. X., and Liu, H. (2017). Stochastic compositional gradient descent: algorithms\nfor minimizing compositions of expected-value functions. Mathematical Programming, 161(1-\n2):419\u2013449.\n\nWilliams, R. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine learning, 8(3-4):229\u2013256.\n\nWright, S. (2015). Coordinate descent algorithms. Mathematical Programming, 151(1):3\u201334.\nXu, Y. and Yin, W. (2013). A block coordinate descent method for regularized multiconvex opti-\nmization with applications to nonnegative tensor factorization and completion. SIAM Journal on\nimaging sciences, 6(3):1758\u20131789.\n\nXu, Y. and Yin, W. (2015). Block stochastic gradient iteration for convex and nonconvex optimization.\n\nSIAM Journal on Optimization, 25(3):1686\u20131716.\n\n11\n\n\f", "award": [], "sourceid": 562, "authors": [{"given_name": "Tengyang", "family_name": "Xie", "institution": "University of Massachusetts Amherst"}, {"given_name": "Bo", "family_name": "Liu", "institution": "Auburn University"}, {"given_name": "Yangyang", "family_name": "Xu", "institution": "Rensselaer Polytechnic Institute"}, {"given_name": "Mohammad", "family_name": "Ghavamzadeh", "institution": "FaceBook FAIR"}, {"given_name": "Yinlam", "family_name": "Chow", "institution": "DeepMind"}, {"given_name": "Daoming", "family_name": "Lyu", "institution": "Auburn University"}, {"given_name": "Daesub", "family_name": "Yoon", "institution": "ETRI"}]}*