{"title": "Provably Global Convergence of Actor-Critic: A Case for Linear Quadratic Regulator with Ergodic Cost", "book": "Advances in Neural Information Processing Systems", "page_first": 8353, "page_last": 8365, "abstract": "Despite the empirical success of the actor-critic algorithm, its theoretical understanding lags behind. In a broader context, actor-critic can be viewed as an online alternating update algorithm for bilevel optimization, whose convergence is known to be fragile. To understand the instability of actor-critic, we focus on its application to linear quadratic regulators, a simple yet fundamental setting of reinforcement learning. We establish a nonasymptotic convergence analysis of actor- critic in this setting. In particular, we prove that actor-critic finds a globally optimal pair of actor (policy) and critic (action-value function) at a linear rate of convergence. Our analysis may serve as a preliminary step towards a complete theoretical understanding of bilevel optimization with nonconvex subproblems, which is NP-hard in the worst case and is often solved using heuristics.", "full_text": "Provably Global Convergence of Actor-Critic: A Case\n\nfor Linear Quadratic Regulator with Ergodic Cost\n\nZhuoran Yang\n\nPrinceton University\nzy6@princeton.edu\n\nYongxin Chen\n\nGeorgia Institute of Technology\n\nyongchen@gatech.edu\n\nMingyi Hong\n\nUniversity of Minnesota\n\nmhong@umn.edu\n\nZhaoran Wang\n\nNorthwestern University\n\nzhaoran.wang@northwestern.edu\n\nAbstract\n\nDespite the empirical success of the actor-critic algorithm, its theoretical under-\nstanding lags behind. In a broader context, actor-critic can be viewed as an online\nalternating update algorithm for bilevel optimization, whose convergence is known\nto be fragile. To understand the instability of actor-critic, we focus on its application\nto linear quadratic regulators, a simple yet fundamental setting of reinforcement\nlearning. We establish a nonasymptotic convergence analysis of actor-critic in this\nsetting. In particular, we prove that actor-critic \ufb01nds a globally optimal pair of\nactor (policy) and critic (action-value function) at a linear rate of convergence. Our\nanalysis may serve as a preliminary step towards a complete theoretical understand-\ning of bilevel optimization with nonconvex subproblems, which is NP-hard in the\nworst case and is often solved using heuristics.\n\n1\n\nIntroduction\n\nThe actor-critic algorithm [36] is one of the most used algorithms in reinforcement learning [46].\nCompared with the classical policy gradient algorithm [73], actor-critic tracks the action-value\nfunction (critic) in policy gradient in an online manner, and alternatively updates the policy (actor)\nand the critic. On the one hand, the online update of critic signi\ufb01cantly reduces the variance of policy\ngradient and hence leads to faster convergence. On the other hand, it also introduces algorithmic\ninstability, which is often observed in practice [33] and parallels the notoriously unstable training\nof generative adversarial networks [50]. Such instability of actor-critic originates from several\nintertwining challenges, including (i) function approximation of actor and critic, (ii) improper choice\nof stepsizes, (iii) the noise arising from stochastic approximation, (iv) the asynchrony between actor\nand critic, and (v) possibly off-policy data used in the update of critic. As a result, the convergence\nof actor-critic remains much less well understood than that of policy gradient, which itself is open.\nConsequently, the practical use of actor-critic often lacks theoretical guidance.\nIn this paper, we aim to theoretically understand the algorithmic instability of actor-critic.\nIn\nparticular, under a bilevel optimization framework, we establish the global rate of convergence and\nsample complexity of actor-critic for linear quadratic regulators (LQR) with ergodic cost, a simple\nyet fundamental setting of reinforcement learning [54], which captures all the above challenges.\nCompared with the classical two-timescale analysis of actor-critic [11], which is asymptotic in\nnature and requires \ufb01nite action space, our analysis is fully nonasymptotic and allows for continuous\naction space. Moreover, beyond the convergence to a stable equilibrium obtained by the classical\ntwo-timescale stochastic approximation via ordinary differential equations, we for the \ufb01rst time\nestablish the linear rate of convergence to a globally optimal pair of actor and critic. In addition, we\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fcharacterize the required sample complexity. As a technical ingredient and byproduct, we for the\n\ufb01rst time establish the sublinear rate of convergence for the gradient temporal difference algorithm\n[66, 67] for ergodic cost and dependent data, which is of independent interest. Furthermore, although\nwe only focus on the setting of LQR, our theoretical analysis framework can be readily extended to\ngeneral RL problems with other policy optimization methods for the actor (e.g. trust-region policy\noptimization (TRPO) [57] and proximal policy optimization (PPO) [58]) and other policy evaluation\nmethods for the critic such as TD(0) [65].\nOur work adds to two lines of works in machine learning, stochastic analysis, and optimization:\n\n(i) Actor-critic falls into the more general paradigm of bilevel optimization [42, 24, 4]. Bilevel\noptimization is de\ufb01ned by two nested optimization problems, where the upper-level optimization\nproblem relies on the output of the lower-level one. As a special case of bilevel optimization, minimax\noptimization is prevalent in machine learning. Recent instances include training generative adversarial\nneural networks [27], (distributionally) robust learning [63], and imitation learning [31, 15]. Such\ninstances of minimax optimization remain challenging as they lack convexity-concavity in general\n[25, 56, 16, 53, 38, 17, 18, 19, 41]. The more general paradigm of bilevel optimization remains even\nmore challenging, as there does not exist a uni\ufb01ed objective function for simultaneous minimization\nand maximization. In particular, actor-critic couples the nonconvex optimization of actor (policy\ngradient) as its upper level and the convex-concave minimax optimization of critic (gradient temporal\ndifference) as its lower level, each of which is challenging to analyze by itself. Most existing\nconvergence analysis of bilevel optimization is based on two-timescale analysis [10]. However,\nas two-timescale analysis abstracts away most technicalities via the lens of ordinary differential\nequations, which is asymptotic in nature, it often lacks the resolution to capture the nonasymptotic\nrate of convergence and sample complexity, which are obtained via our analysis.\n\n(ii) As a proxy for analyzing more general reinforcement learning settings, LQR is studied in a recent\nline of works [12, 54, 26, 69, 70, 22, 23, 61, 21, 30]. In particular, a part of our analysis is based\non the breakthrough of [26], which gives the global convergence of the population-version policy\ngradient algorithm for LQR and its \ufb01nite-sample version based on the zeroth-order estimation of\npolicy gradient based on the cumulative reward or cost. However, such zeroth-order estimation of\npolicy gradient often suffers from large variance, as it involves the randomness of an entire trajectory.\nIn contrast, actor-critic updates critic in an online manner, which reduces such variance but also\nintroduces instability and complicates the convergence analysis. In particular, as the update of critic\ninterleaves with the update of actor, the policy gradient for the update of actor is biased due to the\ninexactness of critic. Meanwhile, the update of critic has a \u201cmoving target\u201d, as it attempts to evaluate\nan actor that evolves along the iterations. A key to our analysis is to handle such asynchrony between\nactor and critic, which is a ubiquitous challenge in bilevel optimization. We hope our analysis may\nserve as the \ufb01rst step towards analyzing actor-critic in more general reinforcement learning settings.\nNotation. For any integer n > 0, we denote {1, . . . , n} to be [n]. For any symmetric matrix X, let\nsvec(X) denote the vectorization of the upper triangular submatrix of X with off-diagonal entries\nweighted by p2. Hence, for any symmetric matrices X and Y , we have tr(XY ) = hX, Y i =\nsvec(X)> svec(Y ). Meanwhile, let smat(\u00b7) be the inverse operation of svec(\u00b7), which maps a vector\nto a symmetric matrix. Besides, we denote by A \u2326s B the symmetric Kronecker product of A and B.\nWe use kvk2 to denote the `2-norm of a vector v. Finally, for a matrix A, we use kAk, kAkfro, and\n\u21e2(A) to denote its the operator norm, Frobenius norm, and spectral radius, respectively.\n\n2 Background\n\nIn the following, we introduce the background of actor-critic and LQR. In particular, we show that\nactor-critic can be cast as a \ufb01rst-order online alternating update algorithm for a bilevel optimization\nproblem [42, 24, 4].\n\n2.1 Actor-Critic Algorithm\nWe consider a Markov decision process, which is de\ufb01ned by (X ,U, P, c, D0). Here X and U are\nthe state and action spaces, respectively, P : X\u21e5U!P\n(X ) is the Markov transition kernel,\nc : X\u21e5U! R is the cost function, and D0 2P (X ) is the distribution of the initial state x0. For\n\n2\n\n\fany t  0, at the t-th time step, the agent takes action ut 2U at state xt 2X , which incurs a cost\nc(xt, ut) and moves the environment into a new state xt+1 \u21e0 P (\u00b7| xt, ut). A policy speci\ufb01es how\nthe action ut is taken at a given state xt. Speci\ufb01cally, in order to handle in\ufb01nite state and action\nspaces, we focus on a parametrized policy class {\u21e1! : X!P (U),! 2 \u2326}, where ! is the parameter\nof policy \u21e1!, and the agent takes action u \u21e0 \u21e1!(\u00b7| x) at a given state x 2X . The agent aims to \ufb01nd\na policy that minimizes the in\ufb01nite-horizon time-average cost, that is,\n\nminimize\n\n!2\u2326\n\nJ(!) = lim\nT!1\n\nE\uf8ff 1\n\nT\n\nMoreover, for policy \u21e1!, we de\ufb01ne the (advantage) action-value and state-value functions respectively\nas\n\nTXt=0\n\nc(xt, ut) x0 \u21e0 D0, ut \u21e0 \u21e1!(\u00b7| xt),8t  0.\nE!\u21e5c(xt, ut)| x0 = x, u0 = u\u21e4  J(!),\n\n(2.1)\n\n(2.2)\n\nQ!(x, u) =Xt0\nV!(x) = Eu\u21e0\u21e1!(\u00b7 | x)\u21e5Q!(x, u)\u21e4,\n\nr!J(!) = Ex\u21e0\u21e2!,u\u21e0\u21e1!(\u00b7 | x)\u21e5r! log \u21e1!(u| x) \u00b7 Q!(x, u)\u21e4,\n\nwhere we use E! to indicate that the state-action pairs {(xt, ut)}t1 are obtained from policy \u21e1!.\nActor-critic is based on the idea of solving the minimization problem in (2.1) via \ufb01rst-order optimiza-\ntion, which uses an estimator of r!J(!). In detail, by the policy gradient theorem [68, 6, 36], we\nhave\n(2.3)\nwhere \u21e2! 2P (X ) is the stationary distribution induced by \u21e1!. Based on (2.3), actor-critic [36]\nconsists two steps: (i) a policy evaluation step that estimates the action-value function Q! (critic)\nvia temporal difference learning [20], where Q! is estimated using a parametrized function class\n{Q\u2713 : \u2713 2 \u21e5}, and (ii) a policy improvement step that updates the parameter ! of policy \u21e1! (actor)\nusing a stochastic version of the policy gradient in (2.3), where Q! is replaced by the corresponding\nestimator Q\u2713.\nAs shown in [74], actor-critic can be cast as solving a bilevel optimization problem, which takes the\nform\n\nminimize\n\n!2\u2326\n\nEx\u21e0\u21e2!,u\u21e0\u21e1!(\u00b7 | x)\u21e5Q\u2713(x, u)\u21e4,\n\nsubject to (\u2713, J ) = argmin\n\u27132\u21e5,J2R\n\nEx\u21e0\u21e2!,u\u21e0\u21e1!(\u00b7 | s)n\u21e5Q\u2713(x, u) + J  c(x, u)  (B!Q\u2713)(x, u)\u21e42o,\n\n(2.5)\nwhere B! is an operator that depends on \u21e1!. In this problem, the actor and critic correspond to the\nupper-level and lower-level variables, respectively. Under this framework, the policy update can be\nviewed as a stochastic gradient step for the upper-level problem in (2.4). The objective in (2.5) is\nusually the mean-squared Bellman error or mean-squared projected Bellman error [20]. Moreover,\nwhen B! is the Bellman evaluation operator associated with \u21e1! and we solve the lower-level problem\nin (2.5) via stochastic semi-gradient descent, we obtain the TD(0) update for policy evaluation [65].\nSimilarly, when B! is the projected Bellman evaluation operator associated with \u21e1!, solving the\nlower level problem naturally recovers the GTD2 and TDC algorithms for policy evaluation [8].\nTherefore, the actor-critic algorithm is a \ufb01rst-order online algorithm for the bilevel optimization\nproblem in (2.4) and (2.5). We remark that bilevel optimization contains a family of extremely\nchallenging problems. Even when the objective functions are linear, bilevel programming is NP-hard\n[29]. In practice, various heuristic algorithms are applied to solve them approximately [62].\n\n(2.4)\n\n2.2 Linear Quadratic Regulator\nAs the simplest optimal control problem, linear quadratic regulator serves as a perfect baseline to\nexamine the performance of reinforcement learning methods. Viewing LQR from the lens of an MDP,\nthe state and action spaces are X = Rd and U = Rk, respectively. The state transition dynamics and\ncost function are speci\ufb01ed by\n\nxt+1 = Axt + But + \u270ft,\n\nc(x, u) = x>Qx + u>Ru,\n\n(2.6)\n\n3\n\n\fwhere \u270ft \u21e0 N (0, ) is the random noise that is i.i.d. for each t  0, and A, B, Q, R, are matrices\nof proper dimensions with Q, R,  0. Moreover, we assume that the dimensions d and k are\n\ufb01xed throughout this paper. For the problem of minimizing the in\ufb01nite-horizon time-average cost\nlim supT!1 T 1PT\nt=0 E[c(xt, ut)] with x0 \u21e0 D0, it is known that the optimal actions are linear in\nthe corresponding state [76, 3, 7]. Speci\ufb01cally, the optimal actions {u\u21e4t}t0 satisfy u\u21e4t = K\u21e4xt for\nall t  0, where K\u21e4 2 Rk\u21e5d can be written as K\u21e4 = (R + B>P \u21e4B)1B>P \u21e4A, with P \u21e4 being the\nsolution to the discrete algebraic Riccati equation\n\nP \u21e4 = Q + A>P \u21e4A + A>P \u21e4B(R + B>P \u21e4B)1B>P \u21e4A.\n\n(2.7)\n\nIn the optimal control literature, it is common to solve LQR by \ufb01rst estimating matrices A, B, Q,\nR and then solving the Riccati equation in (2.7) with these matrices replaced by their estimates.\nSuch an approach is known as model-based as it requires estimating the model parameters and the\nperformance of the planning step in (2.7) hinges on how well the true model is estimated. See, e.g,\n[21, 70] for theoretical guarantees of model-based methods.\nIn contrast, from a purely data-driven perspective, the framework of model-free reinforcement\nlearning offers a general treatment for optimal control problems without the prior knowledge of the\nmodel. Thanks to its simple structure, LQR enables us to assess the performances of reinforcement\nlearning algorithms from a theoretical perspective. Speci\ufb01cally, it is shown that policy iteration\n[12, 14, 45], adaptive dynamic programming [52], and policy gradient methods [26, 44, 70] are all\nable to obtain the optimal policy of LQR. Also see [54] for a thorough review of reinforcement\nlearning methods in the setting of LQR.\n\n3 Actor-Critic Algorithm for LQR\nIn this section, we establish the actor-critic algorithm for the LQR problem introduced in \u00a72.2. Recall\nthat the optimal policy of LQR is a linear function of the state. Throughout the rest of this paper, we\nfocus on the family of linear-Gaussian policies\n\n\u21e1K(\u00b7| x) = N (Kx, 2Ik), K 2 Rk\u21e5d ,\n\n(3.1)\nwhere > 0 is a \ufb01xed constant. That is, for any t  0, at state xt, we could write the action ut\nby ut = Kxt +  \u00b7 \u2318t, where \u2318t \u21e0 N (0, Ik). We note that if  = 0, then the optimal policy\nu = K\u21e4x belongs to our policy class. Here, instead of focusing on deterministic policies, we\nadopt Gaussian policies to encourage exploration. For policy \u21e1K, the corresponding time-average\ncost J(K), state-value function VK, and action-value function QK are speci\ufb01ed as in (2.1) and (2.2),\nrespectively.\nIn the following, we \ufb01rst establish the policy gradient and value functions for the ergodic LQR in\n\u00a73.1. Then, in \u00a73.2, we present the on-policy natural actor-critic algorithm, which is further extended\nto the off-policy setting in \u00a7B.\n3.1 Policy Gradient Theorem for Ergodic LQR\n\nFor any policy \u21e1K, by (2.6), the state dynamics are given by a linear dynamical system\n\"t = \u270ft +  \u00b7 B\u2318t \u21e0 N (0, ).\n\n(3.2)\nHere we de\ufb01ne  := + 2 \u00b7 BB> in (3.2) to simplify the notation. It is known that, when\n\u21e2(A  BK) < 1, the Markov chain in (3.2) has stationary distribution N (0, \u2303K), denoted by \u232bK\nhereafter, where \u2303K is the unique positive de\ufb01nite solution to the Lyapunov equation\n\nxt+1 = (A  BK)xt + \"t,\n\nwhere\n\n\u2303K =  + (A  BK)\u2303K(A  BK)>.\n\n(3.3)\n\nIn the following proposition, we establish J(K), the value functions, and the gradient rKJ(K).\nProposition 3.1. For any K 2 Rk\u21e5d such that \u21e2(A  BK) < 1, let PK be the unique positive\nde\ufb01nite solution to the Bellman equation\n\nPK = (Q + K>RK) + (A  BK)>PK(A  BK).\n\n(3.4)\n\n4\n\n\fIn the setting of LQR, for policy \u21e1K, both the state- and action-value functions are quadratic.\nSpeci\ufb01cally, we have\n\nVK(x) = x>PKx  tr(PK\u2303K),\n\nQK(x, u) = x>\u21e511\n\nK x + x>\u21e512\n\nK u + u>\u21e521\n\nK x + u>\u21e522\n\n(3.5)\nK u  2 \u00b7 tr(R + PKBB>)  tr(PK\u2303K),\n(3.6)\n\nwhere \u2303K is speci\ufb01ed in (3.3), and we de\ufb01ne matrix \u21e5K by\n\nMoreover, the time-average cost J(K) and its gradient are given by\n\n\u21e521\n\nA>PKB\n\nB>PKA\n\nK \u21e512\nK\nK \u21e522\n\nK\u25c6 =\u2713Q + A>PKA\n\n\u21e5K =\u2713\u21e511\nJ(K) = tr\u21e5(Q + K>RK)\u2303K\u21e4 + 2 \u00b7 tr(R) = tr(PK ) + 2 \u00b7 tr(R).\nrKJ(K) = 2\u21e5(R + B>PKB)K  B>PKA\u21e4\u2303K = 2EK\u2303K,\n\nR + B>PKB\u25c6 .\n\n(3.7)\n\n(3.8)\n(3.9)\n\nwhere we de\ufb01ne EK := (R + B>PKB)K  B>PKA.\nProof. See \u00a7D.1 for a detailed proof.\nTo see the connection between (3.9) and the policy gradient theorem in (2.3), note that by direct\ncomputation we have\n\nrK log \u21e1K(u| x) = rK\u21e5(22)1 \u00b7 (u + Kx)2\u21e4 = 2 \u00b7 (u + Kx)x>.\n\n(3.10)\nThus, combining, (3.6), (3.10), and the fact that u = Kx +  \u00b7 \u2318 under \u21e1K, the right-hand side of\n(2.3) can be written as\nEx\u21e0\u232bK ,u\u21e0\u21e1K\u21e5rK log \u21e1K(u| x) \u00b7 QK(x, u)\u21e4 = 2 \u00b7 Ex\u21e0\u232bK ,\u2318\u21e0N (0,Ik)[ \u00b7 \u2318x> \u00b7 QK(x,Kx +  \u00b7 \u2318)].\nRecall that for \u2318 2 N (0, Ik), Stein\u2019s identity [64], E[\u2318\u00b7f (\u2318)] = E[rf (\u2318)], holds for all differentiable\nfunction f : Rk ! R, which implies that\n\nrKJ(K) = Ex\u21e0\u232bK ,\u2318\u21e0N (0,Id)[(ruQK)(x,Kx +  \u00b7 \u2318) \u00b7 x>]\n= 2Ex\u21e0\u232bK ,\u2318\u21e0N (0,Id){[(R + B>PKB)(Kx +  \u00b7 \u2318) + B>PKAx]x>}\n= 2\u21e5(R + B>PKB)K  B>PKA\u21e4\u2303K = 2EK\u2303K.\n\n(3.11)\nThus, (3.9) is exactly the policy gradient theorem (2.3) in the setting of LQR. Moreover, it is worth\nnoting that Proposition 3.1 also holds for  = 0. Thus, setting  = 0 in (3.11), we obtain\n\nrKJ(K) = Ex\u21e0\u232bK [(ruQK)(x,Kx) \u00b7 x>] = Ex\u21e0\u232bK\u21e5(ruQK)(x, u)u=\u21e1K (x)rK\u21e1K(x)\u21e4,\n\nwhere \u21e1K(x) = Kx. Thus we obtain the deterministic policy gradient theorem [60] for LQR.\nAlthough the optimal policy for LQR is deterministic, due to the lack of exploration, behaving\naccording to a deterministic policy may lead to suboptimal solutions. Thus, we focus on the family of\nstochastic policies where a Gaussian noise  \u00b7 \u2318 is added to the action so as to promote exploration.\n3.2 Natural Actor-Critic Algorithm\nNatural policy gradient updates the variable along the steepest direction with respect to Fisher metric.\nFor the Gaussian policies de\ufb01ned in (3.1), by (3.10), the Fisher\u2019s information of policy \u21e1K, denoted\nby I(K), is given by\n[I(K)](i,j),(i0,j0) = Ex\u21e0\u232bK ,a\u21e0\u21e1K\u21e5rKij log \u21e1K(u| x) \u00b7r Ki0j0 log \u21e1K(u| x)\u21e4\n(3.12)\nwhere i, i0 2 [k], j, j0 2 [d], Kij and Ki0j0 are the (i, j)- and (i0, j0)-th entries of K, respectively,\nand [\u2303K]jj0 is the (j, j0)-th entry of \u2303K. Thus, in view of (3.9) in Proposition 3.1 and (3.12), natural\npolicy gradient algorithm updates the policy parameter in the direction of\n\n= 2 \u00b7 Ex\u21e0\u232bK ,\u2318\u21e0N (0,Ik)[\u2318ixj \u00b7 \u2318i0xj0] = 2 \u00b7 1{i = i0} \u00b7 [\u2303K]jj0,\n\n[I(K)]1rKJ(K) = rKJ(K)\u23031\n\nK = 2EK.\n\n5\n\n\fBy (3.7), we can write EK as \u21e522\nK , where \u21e5K is the coef\ufb01cient matrix of the quadratic\ncomponent of QK. Such a connection lays the foundation of the natural actor-critic algorithm.\n\nK K  \u21e521\n\nSpeci\ufb01cally, in each iteration of the algorithm, the actor updates the policy via K   \u00b7 (b\u21e522K b\u21e521),\nwhere  is the stepsize and b\u21e5 is an estimator of \u21e5K returned by any policy evaluation algorithm.\nWe present such a general natural actor-critic method in Algorithm 1 \u00a7A, under the assumption that\nwe are given a stable policy K0 for initialization. Such an assumption is standard in literatures on\nmodel-free methods for LQR [22, 26, 44].\nTo obtain an online actor-critic algorithm, in the sequel, we propose an online policy evaluation\nalgorithm for ergodic LQR based on temporal difference learning. Let \u21e1K be the policy of interest.\nFor notational simplicity, for any state-action pair (x, u) 2 Rd+k, we de\ufb01ne the feature function\n\n(x, u) = svec\"\u2713x\n\nu\u25c6\u2713x\n\nu\u25c6>#,\n\n(3.13)\n\nand denote svec(\u21e5K) by \u2713\u21e4K. Using this notation, the quadratic component in QK can be written as\n(x, u)>\u2713\u21e4K, and the Bellman equation for QK becomes\n\nh(x, u),\u2713 \u21e4Ki = c(x, u)  J(K) +\u2326E[(x0, u0)| x, u],\u2713 \u21e4K\u21b5, 8(x, u) 2 Rd+k.\n\n(3.14)\nIn order to further simplify the notation, hereafter, we de\ufb01ne #\u21e4K = (J(K),\u2713 \u21e4K>)>, denote by E(x,u)\nthe expectation with respect to x \u21e0 \u232bK and u \u21e0 \u21e1K(\u00b7| x), and let (x0, u0) be the state-action pair\nsubsequent to (x, u).\nFurthermore, to estimate J(K) and \u2713\u21e4K in (3.14) simultaneously, we de\ufb01ne\n\n0\n\n1\n\n\u2713\n\n(3.15)\nNotice that J(K) = E(x,u)[c(x, u)]. By direct computation, it can be shown that #\u21e4K satis\ufb01es the\nfollowing linear equation\n\n\u2305K = E(x,u)(x, u)\u21e5(x, u)  (x0, u0)\u21e4> ,\nE(x,u)[(x, u)] \u2305K\u25c6\u2713#1\n\nbK = E(x,u)\u21e5c(x, u)(x, u)\u21e4,\n#2\u25c6 =\u2713J(K)\nbK \u25c6 ,\n\nwhose solution is unique if and only if \u2305K in (3.15) is invertible. The following lemma shows that,\nwhen \u21e1K is a stable policy, \u2305K is indeed invertible.\nLemma 3.2. When \u21e1K is stable in the sense that \u21e2(A BK) < 1, \u2305K de\ufb01ned in (3.15) is invertible\nand thus #\u21e4K is the unique solution to the linear equation (3.16). Furthermore, the minimum singular\nvalue of the matrix in the left-hand side of (3.16) is lower bounded by a constant \uf8ff\u21e4K > 0, where \uf8ff\u21e4K\nonly depends on \u21e2(A  BK), , and min( ).\nProof. See \u00a7D.2 for a detailed proof.\nBy this lemma, when \u21e2(A BK) < 1, policy evaluation for \u21e1K can be reduced to \ufb01nding the unique\nsolution to a linear equation. Instead of solving the equation directly, it is equivalent to minimize the\nleast-squares loss:\n\n(3.16)\n\nminimize\n\n#\n\nn[#1  J(K)]2 +#1 \u00b7 E(x,u)[(x, u)] + \u2305K#2  bK2\n2o,\n\n(3.17)\n\nwhere #1 2 R and #2 has the same shape as svec(\u21e5K), which are the two components of #. It is clear\nthat the global minimizer of (3.17) is #\u21e4K. Note that we have Fenchel\u2019s duality x2 = supy{2x\u00b7yy2}.\nBy this relation, we further write (3.17) as a minimax optimization problem\n\nmin\n#2X\u21e5\n\nmax\n!2X\u2326\n\nF (#, !) = [#1  J(K)] \u00b7 !1\n\n+\u2326#1 \u00b7 E(x,u)[(x, u)] + \u2305K#2  bK,! 2\u21b5  1/2 \u00b7k !k2\n\n2,\n\n(3.18)\n\nwhere the dual variable ! = (!1,! 2) has the same shape as #. Here we restrict the primal and\ndual variables to compact sets X\u21e5 and X\u2326 for algorithmic stability, which will be speci\ufb01ed in the\nnext section. Note that the objective in (3.18) can be estimated unbiasedly using two consecutive\nstate-action pairs (x, u) and (x0, u0). Solving the minimax optimization in (3.18) using stochastic\n\n6\n\n\fgradient method, we obtain the gradient-based temporal difference (GTD) algorithm for policy\nevaluation [66, 67]. See Algorithm 2 for details. More speci\ufb01cally, by direct computation, we have\n(3.19)\n\nr#2F (\u2713, !) = E(x,u)\u21e5(  0) \u00b7 >!2\u21e4,\nr!2F (\u2713, !) = \u27131 \u00b7 E(x,u)() +\u2305 K#2  bK  !2,\n\nr#1F (\u2713, !) = !1 +\u2326E(x,u)(),! 2\u21b5,\nr!1F (\u2713, !) = #1  J(K)  !1,\n\n(3.20)\nwhere we denote (x, u) and (x0, u0) by  and 0, respectively. In the GTD algorithm, we update #\nand ! in gradient directions where the gradients in (3.19) and (3.20) are replaced by their sample\nt=1 \u21b5t \u00b7\n#2\nto update the current policy \u21e1K. Therefore, we obtain the online natural actor-critic algorithm [9] for\nergodic LQR.\nMeanwhile, using the perspective of bilevel optimization, similar to (2.4) and (2.5), our actor-critic\nalgorithm can be viewed as a \ufb01rst-order online algorithm for\n\nestimates. After T iterations of the algorithm, we output the averaged update b#2 = (PT\nt )/(PT\n\nt=1 \u21b5t) and useb\u21e5= smat(b#2) to estimate \u21e5K in (3.7), which is further used in Algorithm 1\n\nminimize\nK2Rk\u21e5d\n\nEx\u21e0\u232bK ,u\u21e0\u21e1K\u21e5h(x, u),# 2i\u21e4,\n\nF (#, !),\n\nsubject to (#, !) = argmin\n\u27132X\u21e5\n\nargmax\n!2X\u2326\n\nwhere F (#, !) is de\ufb01ned in (3.18) and depends on \u21e1K. In our algorithm, we solve the upper-level\nproblem via natural gradient descent and solve the lower-level saddle point optimization problem\nusing stochastic gradient updates.\nFurthermore, we emphasize that our method de\ufb01ned by Algorithms 1 and 2 is online in the sense\nthat each update only requires a single transition. More speci\ufb01cally, let {(xn, un, cn)}n0 be the\nsequence of transitions experienced by the agent. Combining Algorithms 1 and 2 and neglecting the\nprojections, we can write the updating rule as\n\nKn+1 = Kn  n \u00b7[smat(#2\n\nn)]22Kn  [smat(#2\n#n+1 = #n  \u21b5n \u00b7 g#(xn, un, cn, xn+1, un+1),\n!n+1 = !n + \u21b5n \u00b7 g!(xn, un, cn, xn+1, un+1),\n\nn)]21 ,\n\n(3.21)\n\n(3.22)\n\nwhere g# and g! are the update directions of # and ! whose de\ufb01nitions are clear from Algorithm 2,\n{n} and {\u21b5n} are the stepsizes. Moreover, there exists a monotone increasing sequence {Nt}t0\nsuch that n =  if n = Nt for some t and n = 0 otherwise. Such a choice of the stepsizes re\ufb02ects\nthe intuition that, although both the actor and the critic are updated simultaneously, the critic should\nbe updated at a faster pace. From the same viewpoint, classical actor-critic algorithms [36, 9, 28]\nestablish convergence results under the assumption that\nn + \u21b52\n\nn/\u21b5n = 0.\n\n(2\n\nXn0\n\nn =Xn0\n\n\u21b5n = 1, Xn0\n\nn) < 1,\n\nlim\nn!1\n\nThe condition that n/\u21b5n = 0 ensures that the critic updates at a faster timescale, which enables\nthe asymptotic analysis utilizing two-timescale stochastic approximation [10, 37]. However, such\nan approach uses two ordinary differential equations (ODE) to approximate the updates in (3.22)\nand thus only offers asymptotic convergence results. In contrast, as shown in \u00a74, our choice of the\nstepsizes yields nonasymptotic convergence results which shows that natural actor gradient converges\nin linear rate to the global optimum.\nIn addition, we note that in Algorithm 2 we assume that the initial state x0 is sampled from the\nstationary distribution \u232bK. Such an assumption is made only to simplify theoretical analysis. In\npractice, we could start the algorithm after sampling a suf\ufb01cient number of transitions so that the\nMarkov chain induced by \u21e1K approximately mixes. Moreover, as shown in [69], when \u21e1K is a stable\npolicy such that \u21e2(A  BK) < 1, the Markov chain induced by \u21e1K is geometrically -mixing and\nthus mixes rapidly.\nFinally, we remark that the minimax formulation of the policy evaluation problem is \ufb01rst proposed\nin [40], which studies the sample complexity of the GTD algorithm for discounted MDPs with i.i.d.\ndata. Using the same formulation, [72] establishes \ufb01nite sample bounds with data generated from a\nMarkov process. Our optimization problem in (3.18) can be viewed as the extension of their minimax\nformulation to the ergodic setting. Besides, our GTD algorithm can be applied to ergodic MDPs in\ngeneral with dependent data, which might be of independent interest.\n\n7\n\n\f4 Theoretical Results\n\nIn this section, we establish the global convergence of the natural actor-critic algorithm. To this end,\nwe \ufb01rst focus on the problem of policy evaluation by assessing the \ufb01nite sample performance of the\non-policy GTD algorithm.\n\npolicy update. Thus, in the policy evaluation problem for a linear policy \u21e1K, we only need to study\nfro, which characterizes the closeness between the direction of policy\n\nNote that onlyb\u21e5 returned by the GTD algorithm is utilized in the natural actor-critic algorithm for\nthe estimation error kb\u21e5  \u21e5Kk2\nupdate in Algorithm 1 and the true natural policy gradient.\nFurthermore, recall that we restrict the primal and dual variables respectively to compact sets X\u21e5 and\nX\u2326 for algorithmic stability. We make the following assumption on X\u21e5 and X\u2326.\nAssumption 4.1. Let \u21e1K0 be the initial policy in Algorithm 1. We assume that \u21e1K0 is a stable\npolicy such that \u21e2(A  BK0) < 1. Consider the policy evaluation problem for \u21e1K. We assume that\nJ(K) \uf8ff J(K0). Moreover, let X\u21e5 and X\u2326 in (3.18) be de\ufb01ned as\nX\u21e5 =# : 0 \uf8ff #1 \uf8ff J(K0),k#2k2 \uf8ff eR\u21e5 ,\nX\u2326 =! : |!1|\uf8ff J(K0),k!2k2 \uf8ff (1 + kKk2\neR\u21e5 = kQkfro + kRkfro + pd/min( ) \u00b7 (kAk2\neR\u2326 = C \u00b7 eR\u21e5 \u00b7 2\n\nHere, eR\u21e5 and eR\u2326 are two parameters that do not depend on K. Speci\ufb01cally, we have\n\nfro)2 \u00b7 eR\u2326 .\n\nmin(Q) \u00b7 [J(K0)]2,\n\nwhere C > 0 is a constant.\n\nfro) \u00b7 J(K0),\n\nfro + kBk2\n\n(4.1)\n(4.2)\n\n(4.3)\n(4.4)\n\nsaddle-point of the minimax optimization in (3.18). In other words, the solution to (3.18) is the same\nas the unconstrained problem min# max! F (#, !). When replacing the population problem by a\nsample-based optimization problem, restrictions on the primal and dual variables ensures that the\n\nThe assumption that we have access to a stable policy K0 for initialization is commonly made in\nliteratures on model-free methods for LQR [22, 26, 44]. Besides, \u21e2(A  BK0) < 1 implies that\nJ(K0) is \ufb01nite. Here we assume J(K) \uf8ff J(K0) for simplicity. Even if J(K) > J(K0), we can\nreplace J(K0) in (4.1) \u2013 (4.4) by J(K) and the theory of policy evaluation still holds. Moreover, as\nwe will show in Theorem 4.3, the actor-critic algorithm creates a sequence policies whose objective\nvalues decreases monotonically. Thus, here we assume J(K) \uf8ff J(K0) without loss of generality.\nFurthermore, as shown in the proof, the construction of eR\u21e5 and eR\u2326 ensures that (#\u21e4K, 0) is the\niterates of the GTD algorithm remains bounded. Thus, setting eR\u21e5 and eR\u2326 essentially guarantees that\nrestricting (#, !) to X\u21e5 \u21e5X \u2326 incurs no \u201cbias\u201d in the optimization problem.\nWe present the theoretical result for the online GTD algorithm as follows.\nTheorem 4.2 (Policy evaluation). Letb#1 andb\u21e5 be the output of Algorithm 2 based on T iterations.\nWe set the stepsize to be \u21b5t = \u21b5/pt with \u21b5> 0 being a constant. Under Assumption 4.1, for any\n\u21e2 2 (\u21e2(A  BK), 1), when the number of iterations T is suf\ufb01ciently large, with probability at least\n1  T 4, we have\n\n\uf8ff\u21e4K\n\nmin(Q)\u21e4\n\n\u2325\u21e5eR\u21e5,eR\u2326, J(K0),kKkfro, 1\nmin(Q)\u21e4 is a polynomial of eR\u2326, eR\u2326, J(K0), kKkfro, and\n\n2 \u00b7 (1  \u21e2)\n\nlog6 T\npT\n\n(4.5)\n\n\u00b7\n\n,\n\nkb\u21e5  \u21e5Kk2\n\nfro \uf8ff\n\nwhere \u2325\u21e5eR\u21e5,eR\u2326, J(K0),kKkfro, 1\n\n1/min(Q).\n\nProof. See \u00a7C.1 for a detailed proof.\nThis theorem establishes the statistical rate of convergence for the on-policy GTD algorithm. Speci\ufb01-\ncally, if we regard \u2325[eR\u21e5,eR\u2326, J(K0),kKkfro, 1\nmin(Q)], \u21e2, and \uf8ff\u21e4K as constant, (4.5) implies that\nthe estimation error is of order log6 T /pT . Thus, ignoring the logarithmic term, we conclude that\nthe GTD algorithm converges in the sublinear rate O(1/pT ), which is optimal for convex-concave\n\n8\n\n\fTt  \u2325\u21e5kKtk, J(K0)\u21e4 \u00b7 \uf8ff\u21e4Kt\n\n5 \u00b7 [1  \u21e2(A  BKt)]5/2 \u00b7 \u270f5,\n\nstochastic optimization [49] and is also identical to the rate of convergence of the GTD algorithm\nin the discounted setting with bounded data [40, 72]. Note that we focus on the ergodic case and\nthe feature mapping (x, u) de\ufb01ned in (3.13) is unbounded. We believe this theorem might be\nof independent interest. Furthermore, 1/\uf8ff\u21e4K is approximately the condition number of the linear\nequation of (3.16), which re\ufb02ects the fundamental dif\ufb01culty of estimating \u21e5K. Speci\ufb01cally, when\n\uf8ff\u21e4K is close to zero, the matrix on the left-hand side of (3.16) is close to a singular matrix. In this case,\nestimating \u21e5K can be viewed as solving an ill-conditioned regression problem and thus huge sample\nsize is required for consistent estimation. Finally, 1/[1  \u21e2(A  BK)] also re\ufb02ects the intrinsic\nhardness of estimating \u21e5K. Speci\ufb01cally, for any \u21e2 2 (\u21e2(A  BK), 1), the Markov chain induced by\n\u21e1K is -mixing where the k-th mixing coef\ufb01cients is bounded by C \u00b7 \u21e2k for some constant C > 0\n[69]. Thus, when \u21e2 is close to one, this Markov chain becomes more dependent, which makes the\nestimation problem more dif\ufb01cult.\nEquipped with the \ufb01nite sample error of the policy evaluation algorithm, now we are ready to present\nthe global convergence of the actor-critic algorithm. For ease of presentation, we assume that Q, R,\nA, B, are all constant matrices.\nTheorem 4.3 (Global convergence of actor-critic). Let the initial policy K0 be stable. We set the\nstepsize  = [kRk + 1\nmin( ) \u00b7k Bk2 \u00b7 J(K0)] in Algorithm 1 and perform N actor updates in\nthe actor-critic algorithm. Let {Kt}0\uf8fft\uf8ffN be the sequence of policy parameters generated by the\nalgorithm. For any suf\ufb01ciently small \u270f> 0, we set N > C \u00b7k \u2303K\u21e4k/ \u00b7 log2[J(K0)  J(K\u21e4)]/\u270f \nfor some constant C > 0. Moreover, for any t 2{ 0, 1, . . . , N}, in the t-th iteration, we set the\nnumber Tt of GTD updates in Algorithm 2 to be\n\nJ(K0t+1)  J(K\u21e4) \uf8ff\u21e51  C1 \u00b7  \u00b7k \u2303K\u21e4k1\u21e4 \u00b7\u21e5J(Kt)  J(K\u21e4)\u21e4\n\nwhere \u2325[kKtk, J(K0)] is a polynomial of kKtk and J(K0). Then with probability at least 1  \u270f10,\nwe have J(KN )  J(K\u21e4) \uf8ff \u270f.\nProof Sketch. The proof of this Theorem is based on combining the convergence of the natural policy\ngradient and the \ufb01nite sample analysis of the GTD algorithm established in Theorem 4.2. Speci\ufb01cally,\nfor each Kt, we de\ufb01ne K0t+1 = Kt  \u2318 \u00b7 EKt, which is the one-step natural policy gradient update\nstarting from Kt. Similar to [26], for ergodic LQR, it can be shown that\n(4.6)\nfor some constant C1 > 0. In addition, for policy \u21e1Kt, when the number of GTD iteration Tt is\nsuf\ufb01ciently large, Kt+1 is close to K0t+1, which further implies that |J(K0t+1)  J(Kt+1)| is small.\nThus, combining this and (4.6), we obtain the linear convergence of the actor-critic algorithm. See\n\u00a7C.2 for a detailed proof.\nThis theorem shows that natural actor-critic algorithm combined with GTD converges linearly to\nthe optimal policy of LQR. Furthermore, the number of policy updates in this theorem matches\nthose obtained by natural policy gradient algorithm [26, 44]. To the best of our knowledge, this\nresult seems to be the \ufb01rst nonasymptotic convergence result for actor-critic algorithms with function\napproximation, whose existing theory are mostly asymptotic and based on ODE approximation.\nFurthermore, from the viewpoint of bilevel optimization, Theorem 4.3 offers theoretical guarantees\nfor the actor-critic algorithm as a \ufb01rst-order online method for the bilevel program de\ufb01ned in\n(3.21), which serves a \ufb01rst attempt of understanding bilevel optimization with possibly nonconvex\nsubproblems.\nFurthermore, although we only consider the problem of LQR and analyze the natural actor-critic\nwith GTD for policy evaluation, our theoretical framework can be applied to general reinforcement\nlearning problems with other policy optimization methods for the actor (e.g. vanilla policy gradient\n[68], trust-region policy optimization [57] and proximal policy optimization [58]) and other policy\nevaluation methods for the critic such as TD(0) [65], least-square temporal-difference (LSTD) [13],\nand Retrace [47]. In particular, suppose the critic adopts compatible features [68] for policy evaluation\nusing nonconvex optimization techniques, it can be shown that vanilla policy gradient converges to a\nlocal minimum of the expected total return with a sublinear rate [75]. Moreover, by leveraging the\ngeometry of the expected total return as a functional of the policy, recently, [1, 39, 71, 59] prove that\nnatural policy gradient [34], TRPO and PPO are all able to \ufb01nd the globally optimal policy. Using\nsimilar approaches, we can establish the convergence and global optimality of actor-critic methods.\n\n9\n\n\fReferences\n[1] Agarwal, A., Kakade, S. M., Lee, J. D. and Mahajan, G. (2019). Optimality and approximation\nwith policy gradient methods in markov decision processes. arXiv preprint arXiv:1908.00261.\n[2] Alizadeh, F., Haeberly, J.-P. A. and Overton, M. L. (1998). Primal-dual interior-point methods\nfor semide\ufb01nite programming: convergence rates, stability and numerical results. SIAM Journal\non Optimization, 8 746\u2013768.\n\n[3] Anderson, B. D. and Moore, J. B. (2007). Optimal control: linear quadratic methods. Courier\n\nCorporation.\n\n[4] Bard, J. F. (2013). Practical bilevel optimization: algorithms and applications, vol. 30. Springer\n\nScience & Business Media.\n\n[5] Basar, T. and Olsder, G. J. (1999). Dynamic noncooperative game theory, vol. 23. SIAM.\n[6] Baxter, J. and Bartlett, P. L. (2001). In\ufb01nite-horizon policy-gradient estimation. Journal of\n\nArti\ufb01cial Intelligence Research, 15 319\u2013350.\n\n[7] Bertsekas, D. P. (2012). Dynamic programming and optimal control, Vol. II, 4th Edition. Athena\n\nscienti\ufb01c.\n\n[8] Bhatnagar, S., Precup, D., Silver, D., Sutton, R. S., Maei, H. R. and Szepesv\u00b4ari, C. (2009).\nConvergent temporal-difference learning with arbitrary smooth function approximation. In\nAdvances in Neural Information Processing Systems.\n\n[9] Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M. and Lee, M. (2009). Natural actor\u2013critic algo-\n\nrithms. Automatica, 45 2471\u20132482.\n\n[10] Borkar, V. S. (1997). Stochastic approximation with two time scales. Systems & Control Letters,\n\n29 291\u2013294.\n\n[11] Borkar, V. S. and Konda, V. R. (1997). The actor-critic algorithm as multi-time-scale stochastic\n\napproximation. Sadhana, 22 525\u2013543.\n\n[12] Bradtke, S. J. (1993). Reinforcement learning applied to linear quadratic regulation. In Advances\n\nin Neural Information Processing Systems.\n\n[13] Bradtke, S. J. and Barto, A. G. (1996). Linear least-squares algorithms for temporal difference\n\nlearning. Machine learning, 22 33\u201357.\n\n[14] Bradtke, S. J., Ydstie, B. E. and Barto, A. G. (1994). Adaptive linear quadratic control using\n\npolicy iteration. In American Control Conference, vol. 3. IEEE.\n\n[15] Cai, Q., Hong, M., Chen, Y. and Wang, Z. (2019). On the global convergence of imitation\n\nlearning: A case for linear quadratic regulator. arXiv preprint arXiv:1901.03674.\n\n[16] Chen, X., Wang, J. and Ge, H. (2018). Training generative adversarial networks via primal-dual\n\nsubgradient methods: A lagrangian perspective on gan. arXiv preprint arXiv:1802.01765.\n\n[17] Dai, B., He, N., Pan, Y., Boots, B. and Song, L. (2017). Learning from conditional distributions\n\nvia dual embeddings. In International Conference on Arti\ufb01cial Intelligence and Statistics.\n\n[18] Dai, B., Shaw, A., He, N., Li, L. and Song, L. (2018). Boosting the actor with dual critic.\n\nInternational Conference on Learning Representations.\n\n[19] Dai, B., Shaw, A., Li, L., Xiao, L., He, N., Liu, Z., Chen, J. and Song, L. (2018). Sbeed: Con-\nvergent reinforcement learning with nonlinear function approximation. In International Confer-\nence on Machine Learning.\n\n[20] Dann, C., Neumann, G. and Peters, J. (2014). Policy evaluation with temporal differences: A\n\nsurvey and comparison. The Journal of Machine Learning Research, 15 809\u2013883.\n\n[21] Dean, S., Mania, H., Matni, N., Recht, B. and Tu, S. (2017). On the sample complexity of the\n\nlinear quadratic regulator. arXiv preprint arXiv:1710.01688.\n\n10\n\n\f[22] Dean, S., Mania, H., Matni, N., Recht, B. and Tu, S. (2018). Regret bounds for robust adaptive\n\ncontrol of the linear quadratic regulator. arXiv preprint arXiv:1805.09388.\n\n[23] Dean, S., Tu, S., Matni, N. and Recht, B. (2018). Safely learning to control the constrained\n\nlinear quadratic regulator. arXiv preprint arXiv:1809.10121.\n\n[24] Dempe, S. (2002). Foundations of bilevel programming. Springer Science & Business Media.\n[25] Du, S. S. and Hu, W. (2018). Linear convergence of the primal-dual gradient method for convex-\n\nconcave saddle point problems without strong convexity. arXiv preprint arXiv:1802.01504.\n\n[26] Fazel, M., Ge, R., Kakade, S. and Mesbahi, M. (2018). Global convergence of policy gradient\nmethods for the linear quadratic regulator. In International Conference on Machine Learning.\nPouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,\nIn Advances in Neu-\n\nCourville, A. and Bengio, Y. (2014). Generative adversarial nets.\nral Information Processing Systems.\n\n[27] Goodfellow, I.,\n\n[28] Grondman, I., Busoniu, L., Lopes, G. A. and Babuska, R. (2012). A survey of actor-critic\nreinforcement learning: Standard and natural policy gradients. IEEE Transactions on Systems,\nMan, and Cybernetics, Part C (Applications and Reviews), 42 1291\u20131307.\n\n[29] Hansen, P., Jaumard, B. and Savard, G. (1992). New branch-and-bound rules for linear bilevel\n\nprogramming. SIAM Journal on scienti\ufb01c and Statistical Computing, 13 1194\u20131217.\n\n[30] Hardt, M., Ma, T. and Recht, B. (2018). Gradient descent learns linear dynamical systems. The\n\nJournal of Machine Learning Research, 19 1025\u20131068.\n\n[31] Ho, J. and Ermon, S. (2016). Generative adversarial imitation learning. In Advances in Neural\n\nInformation Processing Systems.\n\n[32] Horn, R. A., Horn, R. A. and Johnson, C. R. (2013). Matrix analysis. Cambridge university\n\npress.\n\n[33] Islam, R., Henderson, P., Gomrokchi, M. and Precup, D. (2017).\n\nbenchmarked deep reinforcement learning tasks for continuous control.\narXiv:1708.04133.\n\nReproducibility of\narXiv preprint\n\n[34] Kakade, S. M. (2002). A natural policy gradient. In Advances in neural information processing\n\nsystems.\n\n[35] Kirk, D. E. (1970). Optimal control theory: an introduction. Springer.\n[36] Konda, V. R. and Tsitsiklis, J. N. (2000). Actor-critic algorithms.\n\nInformation Processing Systems.\n\nIn Advances in Neural\n\n[37] Kushner, H. and Yin, G. G. (2003). Stochastic approximation and recursive algorithms and\n\napplications, vol. 35. Springer.\n\n[38] Lin, Q., Liu, M., Ra\ufb01que, H. and Yang, T. (2018).\n\nSolving weakly-convex-weakly-\nconcave saddle-point problems as successive strongly monotone variational inequalities.\narXiv:1810.10207.\n\n[39] Liu, B., Cai, Q., Yang, Z. and Wang, Z. (2019). Neural proximal/trust region policy optimiza-\n\ntion attains globally optimal policy. arXiv preprint arXiv:1906.10306.\n\n[40] Liu, B., Liu, J., Ghavamzadeh, M., Mahadevan, S. and Petrik, M. (2015). Finite-sample analy-\nsis of proximal gradient TD algorithms. In Conference on Uncertainty in Arti\ufb01cial Intelligence.\n[41] Lu, S., Singh, R., Chen, X., Chen, Y. and Hong, M. (2019). Understand the dynamics of GANs\n\nvia primal-dual optimization.\nhttps://openreview.net/forum?id=rylIy3R9K7\n\n[42] Luo, Z.-Q., Pang, J.-S. and Ralph, D. (1996). Mathematical programs with equilibrium con-\n\nstraints. Cambridge University Press.\n\n11\n\n\f[43] Magnus, J. R. (1978). The moments of products of quadratic forms in normal variables. Statis-\n\ntica Neerlandica, 32 201\u2013210.\n\n[44] Malik, D., Pananjady, A., Bhatia, K., Khamaru, K., Bartlett, P. L. and Wainwright, M. J.\n(2018). Derivative-free methods for policy optimization: Guarantees for linear quadratic\nsystems. arXiv preprint arXiv:1812.08305.\n\n[45] Meyn, S. P. (1997). The policy iteration algorithm for average reward markov decision processes\n\nwith general state space. IEEE Transactions on Automatic Control, 42 1663\u20131680.\n\n[46] Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D. and\nIn In-\n\nKavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning.\nternational Conference on Machine Learning.\n\n[47] Munos, R., Stepleton, T., Harutyunyan, A. and Bellemare, M. (2016). Safe and ef\ufb01cient off-\n\npolicy reinforcement learning. In Advances in Neural Information Processing Systems.\n\n[48] Nagar, A. L. (1959). The bias and moment matrix of the general k-class estimators of the\nparameters in simultaneous equations. Econometrica: Journal of the Econometric Society\n575\u2013595.\n\n[49] Nemirovski, A., Juditsky, A., Lan, G. and Shapiro, A. (2009). Robust stochastic approximation\n\napproach to stochastic programming. SIAM Journal on optimization, 19 1574\u20131609.\n\n[50] Pfau, D. and Vinyals, O. (2016). Connecting generative adversarial networks and actor-critic\n\nmethods. arXiv preprint arXiv:1610.01945.\n\n[51] Polyak, B. T. (1963). Gradient methods for minimizing functionals. Zhurnal Vychislitel\u2019noi\n\nMatematiki i Matematicheskoi Fiziki, 3 643\u2013653.\n\n[52] Powell, W. B. and Ma, J. (2011). A review of stochastic algorithms with continuous value func-\ntion approximation and some new approximate policy iteration algorithms for multidimensional\ncontinuous applications. Journal of Control Theory and Applications, 9 336\u2013352.\n\n[53] Ra\ufb01que, H., Liu, M., Lin, Q. and Yang, T. (2018). Non-convex min-max optimization: Provable\n\nalgorithms and applications in machine learning. arXiv:1810.02060.\n\n[54] Recht, B. (2018). A tour of reinforcement learning: The view from continuous control. arXiv\n\npreprint arXiv:1806.09460.\n\n[55] Rudelson, M., Vershynin, R. et al. (2013). Hanson-wright inequality and sub-gaussian concen-\n\ntration. Electronic Communications in Probability, 18.\n\n[56] Sanjabi, M., Razaviyayn, M. and Lee, J. D. (2018). Solving non-convex non-concave min-max\n\ngames under polyak-lojasiewicz condition. arXiv preprint arXiv:1812.02878.\n\n[57] Schulman, J., Levine, S., Abbeel, P., Jordan, M. and Moritz, P. (2015). Trust region policy\n\noptimization. In International conference on machine learning.\n\n[58] Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O. (2017). Proximal policy\n\noptimization algorithms. arXiv preprint arXiv:1707.06347.\n\n[59] Shani, L., Efroni, Y. and Mannor, S. (2019). Adaptive trust region policy optimization: Global\n\nconvergence and faster rates for regularized mdps. arXiv preprint arXiv:1909.02769.\n\n[60] Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D. and Riedmiller, M. (2014). Determin-\n\nistic policy gradient algorithms. In International Conference on Machine Learning.\n\n[61] Simchowitz, M., Mania, H., Tu, S., Jordan, M. I. and Recht, B. (2018). Learning without mix-\ning: Towards a sharp analysis of linear system identi\ufb01cation. arXiv preprint arXiv:1802.08334.\n\n[62] Sinha, A., Malo, P. and Deb, K. (2018). A review on bilevel optimization: from classical to\nevolutionary approaches and applications. IEEE Transactions on Evolutionary Computation,\n22 276\u2013295.\n\n12\n\n\f[63] Sinha, A., Namkoong, H. and Duchi, J. (2017). Certi\ufb01able distributional robustness with prin-\n\ncipled adversarial training. arXiv preprint arXiv:1710.10571.\n\n[64] Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribution. Annals of\n\nStatistics 1135\u20131151.\n\n[65] Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine\n\nlearning, 3 9\u201344.\n\n[66] Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesv\u00b4ari, C. and\nWiewiora, E. (2009). Fast gradient-descent methods for temporal-difference learning with\nlinear function approximation. In International Conference on Machine Learning. ACM.\n\n[67] Sutton, R. S., Maei, H. R. and Szepesv\u00b4ari, C. (2009). A convergent o(n) temporal-difference\nalgorithm for off-policy learning with linear function approximation. In Advances in Neural\nInformation Processing Systems.\n\n[68] Sutton, R. S., McAllester, D. A., Singh, S. P. and Mansour, Y. (2000). Policy gradient methods\nfor reinforcement learning with function approximation. In Advances in neural information\nprocessing systems.\n\n[69] Tu, S. and Recht, B. (2017). Least-squares temporal difference learning for the linear quadratic\n\nregulator. arXiv preprint arXiv:1712.08642.\n\n[70] Tu, S. and Recht, B. (2018). The gap between model-based and model-free methods on the\n\nlinear quadratic regulator: An asymptotic viewpoint. arXiv preprint arXiv:1812.03565.\n\n[71] Wang, L., Cai, Q., Yang, Z. and Wang, Z. (2019). Neural policy gradient methods: Global\n\noptimality and rates of convergence. arXiv preprint arXiv:1909.01150.\n\n[72] Wang, Y., Chen, W., Liu, Y., Ma, Z.-M. and Liu, T.-Y. (2017). Finite sample analysis of the gtd\npolicy evaluation algorithms in markov setting. In Advances in Neural Information Processing\nSystems.\n\n[73] Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist rein-\n\nforcement learning. Machine Learning, 8 229\u2013256.\n\n[74] Yang, Z., Fu, Z., Zhang, K. and Wang, Z. (2018). Convergent reinforcement learning with\n\nfunction approximation: A bilevel optimization perspective. Manuscript.\n\n[75] Zhang, K., Koppel, A., Zhu, H. and Bas\u00b8ar, T. (2019). Global convergence of policy gradient\n\nmethods to (almost) locally optimal policies. arXiv preprint arXiv:1906.08383.\n\n[76] Zhou, K., Doyle, J. C. and Glover, K. (1996). Robust and optimal control, vol. 40. Prentice\n\nHall.\n\n13\n\n\f", "award": [], "sourceid": 4529, "authors": [{"given_name": "Zhuoran", "family_name": "Yang", "institution": "Princeton University"}, {"given_name": "Yongxin", "family_name": "Chen", "institution": "Georgia Institute of Technology"}, {"given_name": "Mingyi", "family_name": "Hong", "institution": "University of Minnesota"}, {"given_name": "Zhaoran", "family_name": "Wang", "institution": "Northwestern University"}]}