{"title": "Certainty Equivalence is Efficient for Linear Quadratic Control", "book": "Advances in Neural Information Processing Systems", "page_first": 10154, "page_last": 10164, "abstract": "We study the performance of the certainty equivalent controller on Linear Quadratic (LQ) control problems with unknown transition dynamics. We show that for both the fully and partially observed settings, the sub-optimality gap between the cost incurred by playing the certainty equivalent controller on the true system and the cost incurred by using the optimal LQ controller enjoys a fast statistical rate, scaling as the square of the parameter error. To the best of our knowledge, our result is the first sub-optimality guarantee in the partially observed Linear Quadratic Gaussian (LQG) setting. Furthermore, in the fully observed Linear Quadratic Regulator (LQR), our result improves upon recent work by Dean et al., who present an algorithm achieving a sub-optimality gap linear in the parameter error. A key part of our analysis relies on perturbation bounds for discrete Riccati equations. We provide two new perturbation bounds, one that expands on an existing result from Konstantinov, and another based on a new elementary proof strategy.", "full_text": "Certainty Equivalence is Ef\ufb01cient for Linear\n\nQuadratic Control\n\nHoria Mania\n\nUniversity of California, Berkeley\n\nhmania@berkeley.edu\n\nStephen Tu\n\nUniversity of California, Berkeley\n\nstephentu@berkeley.edu\n\nBenjamin Recht\n\nUniversity of California, Berkeley\n\nbrecht@berkeley.edu\n\nAbstract\n\nWe study the performance of the certainty equivalent controller on Linear Quadratic\n(LQ) control problems with unknown transition dynamics. We show that for both\nthe fully and partially observed settings, the sub-optimality gap between the cost\nincurred by playing the certainty equivalent controller on the true system and the\ncost incurred by using the optimal LQ controller enjoys a fast statistical rate, scaling\nas the square of the parameter error. To the best of our knowledge, our result is the\n\ufb01rst sub-optimality guarantee in the partially observed Linear Quadratic Gaussian\n(LQG) setting. Furthermore, in the fully observed Linear Quadratic Regulator\n(LQR), our result improves upon recent work by Dean et al. [11], who present an\nalgorithm achieving a sub-optimality gap linear in the parameter error. A key part\nof our analysis relies on perturbation bounds for discrete Riccati equations. We\nprovide two new perturbation bounds, one that expands on an existing result from\nKonstantinov et al. [25], and another based on a new elementary proof strategy.\n\n1\n\nIntroduction\n\nOne of the most straightforward methods for controlling a dynamical system with unknown transitions\nis based on the certainty equivalence principle: a model of the system is \ufb01t by observing its time\nevolution, and a control policy is then designed by treating the \ufb01tted model as the truth [6]. Despite\nthe simplicity of this method, it is challenging to guarantee its ef\ufb01ciency because small modeling\nerrors may propagate to large, undesirable behaviors on long time horizons. As a result, most work\non controlling systems with unknown dynamics has explicitly incorporated robustness against model\nuncertainty [11, 12, 23, 30, 41, 42].\nIn this work, we show that for the standard baseline of controlling an unknown linear dynamical\nsystem with a quadratic objective function known as Linear Quadratic (LQ) control, certainty\nequivalent control synthesis achieves better cost than prior methods that account for model uncertainty.\nOur results hold for both the fully observed Linear Quadratic Regulator (LQR) and the partially\nobserved Linear Quadratic Gaussian (LQG) setting. For of\ufb02ine control, where one collects some\ndata and then designs a \ufb01xed control policy to be run on an in\ufb01nite time horizon, we show that the\ngap between the performance of the certainty equivalent controller and the optimal control policy\nscales quadratically with the error in the model parameters for both LQR and LQG. To the best of\nour knowledge, we provide the \ufb01rst sub-optimality guarantee for LQG. Moreover, in the LQR setting\nour work improves upon the recent result of Dean et al. [11], who present an algorithm that achieves\na sub-optimality gap linear in the parameter error. In the case of online LQR control, where one\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fadaptively improves the control policy as new data comes in, our of\ufb02ine result implies that a simple,\n\npolynomial time algorithm using \"-greedy exploration suf\ufb01ces for nearly optimal eO(pT ) regret.\n\n2 Main Results for the Linear Quadratic Regulator\n\nAn instance of the linear quadratic regulator (LQR) is de\ufb01ned by four matrices: two matrices\nA? 2 Rn\u21e5n and B? 2 Rn\u21e5d that de\ufb01ne the linear dynamics and two positive semide\ufb01nite matrices\nQ 2 Rn\u21e5n and R 2 Rd\u21e5d that de\ufb01ne the cost function. Given these matrices, the goal of LQR is to\nsolve the optimization problem\n\nmin\n\nu0,u1,...\n\nT\n\nlim\nT!1\n\nTXt=0\n\nE\" 1\n\nx>t Qxt + u>t Rut# s.t. xt+1 = A?xt + B?ut + wt,\nwhere xt, ut and wt denote the state, input (or action), and noise at time t, respectively. The\nexpectation is over the initial state x0 \u21e0N (0, In) and the i.i.d. noise wt \u21e0N (0, 2\nwIn). When\nthe problem parameters (A?, B?, Q, R) are known the optimal policy is given by linear feedback,\nut = K?xt, where K? = (R + B>? P?B?)1B>? P?A? where P? is the (positive de\ufb01nite) solution\nto the discrete Riccati equation\n\n(1)\n\nP? = A>? P?A? A>? P?B?(R + B>? P?B?)1B>? P?A? + Q\n\n(2)\nand can be computed ef\ufb01ciently [4, see e.g.]. Problem (1) considers an average cost over an in\ufb01nite\nhorizon. The optimal controller for the \ufb01nite horizon variant is also static and linear, but time-varying.\nThe LQR solution in this case can be computed ef\ufb01ciently via dynamic programming.\nIn this work we are interested in the control of a linear dynamical system with unknown transition\n\ndenote the Euclidean norm for vectors as well as the spectral (operator) norm for matrices.) Dean et al.\n\nterm C1(A?, B?, Q, R) that depends on the problem parameters. We show that the nominal controller\n\ndenote by J(A, B, K) the cost (1) obtained by using the actions ut = Kxt on the system (A, B),\n\nproblem (1) while disregarding the modeling error, and use the resulting controller on the true system\n(A?, B?). We interchangeably refer to the resulting policy as the certainty equivalent controller\n\nparameters (A?, B?) based on estimates (bA,bB). The cost matrices Q and R are assumed known.\nWe analyze the certainty equivalence approach: use the estimates (bA,bB) to solve the optimization\nor, following Dean et al. [11], the nominal controller. We denote by bP the solution to the Riccati\nequation (2) associated with the parameters (bA,bB) and let bK be the corresponding controller. We\nand we use bJ and J? to denote J(A?, B?, bK) and J(A?, B?, K?), respectively.\nLet \" 0 such that kA? bAk \uf8ff \" and kB? bBk \uf8ff \". (Here and throughout this work we use k\u00b7k to\n[11] introduced a robust controller that achieves bJ J? \uf8ff C1(A?, B?, Q, R)\" for some complexity\nut = bKxt achieves bJ J? \uf8ff C2(A?, B?, Q, R)\"2. Both results require \" to be suf\ufb01ciently small\n(as a function of the problem parameters) and it is important to note that \" must be much smaller for\nthe nominal controller to be guaranteed to stabilize the system than for the robust controller proposed\nby Dean et al. [11]. However, our result shows that once the estimation error \" is small enough,\nthe nominal controller performs better: the sub-optimality gap scales as O(\"2) versus O(\"). Both\nthe more stringent requirement on \" and better performance of nominal control compared to robust\ncontrol, when the estimation error is suf\ufb01ciently small, were observed empirically by Dean et al. [11].\nBefore we can formally state our result we need to introduce a few more concepts and assumptions.\nIt is common to assume that the cost matrices Q and R are positive de\ufb01nite. Under an additional\nobservability assumption, this condition can be relaxed to Q being positive semide\ufb01nite.\nAssumption 1. The cost matrices Q and R are positive de\ufb01nite. Since scaling both Q and R does\nnot change the optimal controller K?, we can assume without loss of generality that (R) 1, where\n(\u00b7) denotes the minimum singular value.\nA square matrix M is stable if its spectral radius \u21e2(M ) is (strictly) smaller than one. Recall that\nthe spectral radius is de\ufb01ned as \u21e2(M ) = max{|| : is an eigenvalue of M}. A linear dynamical\nsystem (A, B) in feedback with K is fully described by the closed loop matrix A + BK. More\nprecisely, in this case xt+1 = (A + BK)xt + wt. For a static linear controller ut = Kxt to achieve\n\ufb01nite LQR cost it is necessary and suf\ufb01cient that the closed loop matrix is stable.\n\n2\n\n\fIn order to quantify the growth or decay of powers of a square matrix M, we de\ufb01ne\n\n\u2327 (M, \u21e2) := supkM kk\u21e2k : k 0 .\n\n(3)\nIn other words, \u2327 (M, \u21e2) is the smallest value such that kM kk \uf8ff \u2327 (M, \u21e2)\u21e2k for all k 0. We note\nthat \u2327 (M, \u21e2) might be in\ufb01nite, depending on the value of \u21e2, and it is always greater or equal than\none. If \u21e2 is larger than \u21e2(M ), we are guaranteed to have a \ufb01nite \u2327 (M, \u21e2) (this is a consequence of\nGelfand\u2019s formula). In particular, if M is a stable matrix, we can choose \u21e2< 1 such that \u2327 (M, \u21e2) is\n\ufb01nite. Also, we note that \u2327 (M, \u21e2) is a decreasing function of \u21e2; if \u21e2 kMk, we have \u2327 (M, \u21e2) = 1.\nAt a high level, the quantity \u2327 (M, \u21e2) measures the degree of transient response of the linear system\nxt+1 = M xt + wt. In particular, when M is stable, \u2327 (M, \u21e2) can be upper bounded by the H1-norm\nof the system de\ufb01ned by M, which is the `2 to `2 operator norm of the system and a fundamental\nquantity in robust control [see 40, for more details].\nThroughout this work we use the quantities ? := 1 + max{kA?k,kB?k,kP?k,kK?k} and L? :=\nA? + B?K?. We use ? as a uniform upper bound on the spectral norms of the relevant matrices for\nthe sake of algebraic simplicity. We are ready to state our meta theorem. The proofs for all the results\ncan be found in the full version of the paper [28].\nTheorem 1. Suppose d \uf8ff n. Let > 0 such that \u21e2(L?) \uf8ff < 1. Also, let \"> 0 such that\nkbA A?k \uf8ff \" and kbB B?k \uf8ff \" and assume kbP P?k \uf8ff f (\") for some function f such that\nf (\") \". Then, under Assumption 1 the certainty equivalent controller ut = bKxt achieves\n\n\u2327 (L?, )2\n\nw d 9\n?\n\n(4)\n\nas long as f (\") is small enough so that the right hand side is smaller than 2\nw.\n\nbJ J? \uf8ff 200 2\n\n1 2 f (\")2,\n\nIn Section 4 we present two upper bounds f (\") on kbP P?k: one based on a proof technique\nproposed by Konstantinov et al. [25] and one based on our direct approach. Both of these upper\nbounds satisfy f (\") = O(\") for \" suf\ufb01ciently small. For simplicity, in this section we only specialize\nour meta-theorem (Theorem 1) using the perturbation result from our direct approach.\nTo state a specialization of Theorem 1 we need a few more concepts. A linear system (A, B) is\ncalled controllable when the controllability matrix\u21e5B AB A2B . . . An1B\u21e4 has full row\nrank. Controllability is a fundamental concept in control theory; it states that there exists a sequence\nof inputs to the system (A, B) that moves it from any starting state to any \ufb01nal state in at most n steps.\nIn this work we quantify how controllable a linear system is. We denote, for any integer ` 1, the\nmatrix C` :=\u21e5B AB . . . A`1B\u21e4 and call the system (`, \u232b)-controllable if the n-th singular\nvalue of C` is greater or equal than \u232b, i.e. (C`) =qminC`C>` \u232b. Intuitively, the larger \u232b is,\n\nthe less control effort is needed to move the system between two different states.\nAssumption 2. We assume the unknown system (A?, B?) is (`, \u232b)-controllable, with \u232b> 0.\nAssumption 2 was used in a different context by Cohen et al. [9]. For any controllable system and any\n` n there exists \u232b> 0 such that the system is (`, \u232b)-controllable. Therefore, (`, \u232b)-controllability\nis really not much stronger of an assumption than controllability. As ` grows minimum singular value\n(C`) also grows and therefore a larger \u232b can be chosen so that the system is still (`, \u232b) controllable.\nNote that controllability is not necessary for LQR to have a well-de\ufb01ned solution: the weaker\nrequirement is that of stabilizability, in which there exists a feedback matrix K so that A? + B?K is\n\nrequires controllability, the result of Konstantinov et al. [25] only requires stabilizability. However,\n\nstable. The result of Dean et al. [11] only requires stabilizability. While our upper bound on kbP P?k\nour upper bound on kbP P?k is sharper for some classes of systems (see Section 4). A direct plug\nin of our perturbation result, presented in Section 4, into Theorem 1 yields the following guarantee.\nTheorem 2. Suppose that d \uf8ff n. Let \u21e2 and be two real values such that \u21e2(A?) \uf8ff \u21e2 and\n\u21e2(L?) \uf8ff < 1. Also, let \"> 0 such that kbA A?k \uf8ff \" and kbB B?k \uf8ff \" and de\ufb01ne\n = max{1,\"\u2327 (A?,\u21e2 ) + \u21e2}. Under Assumptions 1 and 2, the certainty equivalent controller\nut = bKxt satis\ufb01es the suboptimality gap\nbJ J? \uf8ffO (1) 2\n\n\u232b\u25c62\n? \u2327 (A?,\u21e2 )64(`1) \u2327 (L?, )2\n1 2\nw. Here, O(1) denotes a universal constant.\n\nmin{(Q)2, (R)2}\u27131 +\n\nas long as the right hand side is smaller than 2\n\nmax{kQk2,kRk2}\n\nw d` 5 15\n\n1\n\n\"2 , (5)\n\n3\n\n\fThe exact form of Equation 5, such as the polynomial dependence on `, ?, etc, can be improved at\nthe expense of conciseness of the expression. In our proof we optimized for the latter. The factor\n\nmax{kQk2,kRk2}/ min(Q)2, (R)2 is the squared condition number of the cost function, a\n\nnatural quantity in the context of the optimization problem (1), which can be seen as an in\ufb01nite\ndimensional quadratic program with a linear constraint. The term \u2327 (L?,)2\nquanti\ufb01es the rate at which\n12\nthe optimal controller drives the state towards zero. Generally speaking, the less stable the optimal\nclosed loop system is, the larger this term becomes.\nAn interesting trade-off arises between the factor `54(`1) (which arises from upper bounding\nperturbations of powers of A? on a time interval of length `) and the factor \u232b (the lower bound on\n(C`)), which is increasing in `. Hence, the parameter ` should be seen as a free-parameter that can\nbe tuned to minimize the right hand side of (5). Now, we specialize Theorem 2 to a few cases.\n\nCase: A? is contractive, i.e. kA?k < 1.\nso that \" \uf8ff 1 kA?k. Then, (5) simpli\ufb01es to:\n\nIn this case, we can choose \u21e2 = kA?k and \" small enough\n\nCase: B? has rank n .\n\nIn this case, we can choose ` = 1. Then, (5) simpli\ufb01es to:\n\nbJ J? \uf8ffO (1) d 2\nbJ J? \uf8ffO (1) d 2\n\nw 15\n\nw `5 15\n?\n\n\u2327 (L?, )2\n1 2\n\n? \u2327 (A?,\u21e2 )6 \u2327 (L?, )2\n1 2\n\n\"2 .\n\n1\n\nmax{kQk2,kRk2}\n\nmin{(Q)2, (R)2}\u27131 +\n\u232b\u25c62\n\u232b\u25c62\nmin{(Q)2, (R)2}\u27131 +\n\nmax{kQk2,kRk2}\n\n1\n\n\"2 .\n\n2.1 Comparison to Theorem 4.1 of Dean et al. [11].\n\nDean et al. [11] show that when their robust synthesis procedure is run with estimates (bA,bB)\nsatisfying max{kbA A?k,kbB B?k} \uf8ff \" \uf8ff [5(1 + kK?k) ?]1, the resulting controller satis\ufb01es:\n(6)\nHere, the quantity ? := supz2Tk(zIn L?)1k is the H1-norm of the optimal closed loop system\nL?. In order to compare Equation 6 to Equation 5, we upper bound the quantity ? in terms of\n\u2327 (L?, ) and . In particular, by a in\ufb01nite series expansion of the inverse (zIn L?)1 we can show\n ? \uf8ff \u2327 (L?,)\n\nbJ J? \uf8ff 10(1 + kK?k) ?J?\" + O(\"2) .\n\nwn?. Therefore, Equation 6 gives us that:\n\n1 . Also, we have J? = 2\n\nw tr(P?) \uf8ff 2\nw2\n?\n\nbJ J? \uf8ffO (1)n2\n\n\u2327 (L?, )\n1 \n\n\" + O(\"2) .\n\nWe see that the dependence on the parameters ? and \u2327 (L?, ) is signi\ufb01cantly milder compared\nto Equation 5. Furthermore, this upper bound is valid for larger \" than the upper bound given in\nTheorem 2. Comparing these upper bound suggests that there is a price to pay for obtaining a fast rate,\nand that in regimes of moderate uncertainty (moderate size of \"), being robust to model uncertainty is\nimportant. This observation is supported by the empirical results of Dean et al. [11].\nA similar trade-off between slow and fast rates arises in the setting of \ufb01rst-order convex stochastic\n\noptimization. The convergence rate O(1/pT ) of the stochastic gradient descent method can be\nimproved to O(1/T ) under a strong convexity assumption. However, the performance of stochastic\ngradient descent, which can achieve a O(1/T ) rate, is sensitive to poorly estimated problem parame-\nters [29]. Similarly, in the case of LQR, the nominal controller achieves a fast rate, but it is much\nmore sensitive to estimation error than the robust controller of Dean et al. [11].\n\nEnd-to-end guarantees. Theorem 2 can be combined with \ufb01nite sample learning guarantees\n(e.g. [11, 15, 33, 34]) to obtain an end-to-end guarantee similar to Proposition 1.2 of Dean et al.\n[11]. In general, estimating the transition parameters from N samples yields an estimation error\n\nthat scales as O(1/pN ). Therefore, Theorem 2 implies that bJ J? \uf8ffO (1/N ) instead of the\nbJ J? \uf8ffO (1/pN ) rate from Proposition 1.2 of Dean et al. [11]. This is similar to the case of\nlinear regression, where O(1/pN ) estimation error for the parameters translates to a O(1/N ) fast\n\nrate for prediction error. Furthermore, Simchowitz et al. [34] and Sarkar and Rakhlin [33] showed\n\n4\n\n\fthat faster estimation rates are possible for some linear dynamical systems. Theorem 2 translates such\nrates into control suboptimality guarantees in a transparent way.\nOur result explains the behavior observed in Figure 4 of Dean et al. [11]. The authors propose\ntwo procedures for synthesizing robust controllers for LQR with unknown transitions: one which\n\nguarantees robustness of the performance gap bJ J?, and one which only guarantees the stability of\n\nthe closed loop system. Dean et al. [11] observed that the latter performs better in the small estimation\nerror regime, which happens because the robustness constraint of the synthesis procedure becomes\ninactive when the estimation error is small enough. Then, the second robust synthesis procedure\neffectively outputs the certainty equivalent controller, which we now know to achieve a fast rate.\n\n\u2318,tId), where \u21e1 is the policy, updated in epochs, and 2\n\n2.2 Nearly optimal eO(pT ) regret in the adaptive setting\nThe regret formulation of adaptive LQR was \ufb01rst proposed by Abbasi-Yadkori and Szepesv\u00e1ri [1].\nThe task is to design an adaptive algorithm {ut}t0 to minimize regret, as de\ufb01ned by Regret(T ) :=\nPT\nt=1 x>t Qxt + u>t Rut T J?. Abbasi-Yadkori and Szepesv\u00e1ri [1] study the performance of\noptimism in the face of uncertainty (OFU) and show that it has eO(pT ) regret, which is nearly\noptimal for this problem formulation. However, the OFU algorithm requires repeated solutions to a\nnon-convex optimization problem for which no known ef\ufb01cient algorithm exists.\nTo deal with the computational issues of OFU, Dean et al. [12] propose to analyze the behavior of \"-\ngreedy exploration using the suboptimality gap results of Dean et al. [11]. In the context of continuous\ncontrol, \"-greedy exploration refers to the application of the control law ut = \u21e1(xt, xt1, ..., x0)+\u2318t\nwith \u2318t \u21e0N (0, 2\n\u2318,t is the variance of the\n\u2318,t \u21e0 t1/3,\nexploration noise. Dean et al. [12] set the variance of the exploration noise as 2\nand show that their method achieves eO(T 2/3) regret. They use epochs of size 2i and decompose\n\u2318,T\u2318. Since the estimation error of the\nthe regret roughly as Regret(T ) = O\u21e3T (bJ J?) + T 2\nmodel parameters scales as O((\u2318,T pT )1), and since the suboptimality gap bJ J? of the robust\ncontroller is linear in the estimation error, we have Regret(T ) = O\u21e3 pT\n\u2318,T\u2318. Then, setting\n\u2318,t \u21e0 t1/3 balances these two terms and yields eO(T 2/3) regret. However, Theorem 2, which states\nthat the gap bJ J? for the nominal controller depends quadratically on the estimation rate, implies\nthat online certainty equivalent control achieves Regret(T ) = O\u21e3 1\n\u2318,T\u2318. Here, the optimal\n\u2318,t \u21e0 t1/2, yielding eO(pT ) regret. We note that the\nobservation that certainty equivalence coupled with \"-greedy exploration achieves eO(pT ) regret was\ncertainty equivalent control yields an adaptive LQR algorithm with regret bounded as eO(pT ).\n\n\ufb01rst made by Faradonbeh et al. [16].\nCorollary 1. (Informal) \"-greedy exploration with exploration schedule 2\n\nvariance of the exploration noise scales as 2\n\n\u2318,t \u21e0 t1/2 combined with\n\n3 Main Results for the Linear Quadratic Gaussian Problem\n\n+ T 2\n\n2\n\n\u2318,T\n\n+ T 2\n\n2\n\n\u2318,T\n\nNow we consider partially observable systems. In this case the system dynamics have the form:\n\nxt+1 = A?xt + B?ut + wt , wt \u21e0N (0, 2\n\nyt = C?xt + vt , vt \u21e0N (0, 2\n\nvI) .\n\nwI) ,\n\nIn (7), only the output process yt is observed. The LQG problem is de\ufb01ned as1:\n\nmin\n\nu0,u1,...\n\nlim\nT!1\n\nE\" 1\n\nT\n\nTXt=0\n\ny>t Qyt + u>t Rut# s.t. (7a), (7b) .\n\n(7a)\n(7b)\n\n(8)\n\n1Note that many texts de\ufb01ne the LQG cost in terms of xT\n\nt Qxt instead of yT\n\nt Qyt. We choose the latter\n\nbecause we do not want the cost to be tied to a particular (unknown) state representation.\n\n5\n\n\fHere, the input ut is allowed to depend on the history2 Ht := (u0, ..., ut1, y0, ..., yt1). The opti-\nmal solution to (8) is to set ut = K?bxt, with K? the optimal LQR solution to (A?, B?, CT\n? QC?, R)\nandbxt := E[xt|Ht]. The MSE estimatebxt can be solved ef\ufb01ciently via Kalman \ufb01ltering:\n\n(9a)\n(9b)\n(9c)\nThere is an inherent ambiguity in the dynamics (7a)-(7b) which makes LQG more delicate than\nLQR. In particular, for any invertible T , the LQG problem (8) with parameters (A?, B?, C?, Q, R) is\nequivalent to the LQG problem with parameters (T A?T 1, T B?, C?T 1, Q, R) and appropriately\n\nbxt+1 = A?bxt + B?ut + L?(yt C?bxt) ,\nL? = A?\u2303?CT\n\u2303? = A?\u2303?AT\n\nwI A?\u2303?CT\n\nvI)1C?\u2303?AT\n? .\n\n? (C?\u2303?CT\n\n? (C?\u2303?CT\n\n? + V )1 ,\n\n? + 2\n\n? + 2\n\nsuch that there exists an unitary T such that:\n\nrescaled noise processes. To deal with this ambiguity, we assume that we have estimates (bA,bB,bC,bL)\n\n(10)\nRecent work [32, 35, 38] has shown how to obtain this style of estimates with guarantees from\ninput/output data. As in Section 2, we assume that the cost matrices (Q, R) are known. Then, we\nstudy the performance of the certainty equivalence controller de\ufb01ned by:\n\nmax{kbA T A?T 1k,kbB T B?k,kbC C?T 1k,kbL T L?k} \uf8ff \".\nbxt+1 = bAbxt + bBut +bL(yt bCbxt) , ut = bKbxt , bK = LQR(bA,bB,bCTQbC, R) .\n\n(11)\nSimilarly to Theorem 1 for LQR, we state a meta theorem for LQG. Unlike Theorem 1, however,\nwe need a stronger type of Riccati perturbation guarantee which also allows for perturbation of the\n\nQ matrix. Speci\ufb01cally, we suppose there exists 0 such that for any \uf8ff 0 and (bA,bB,bQ) with\nmax{kbA Ak,kbB Bk,kbQ Qk} \uf8ff the solutions P and bP of the Riccati equations with\nparameters (A, B, Q, R) and (bA,bB,bQ, R) satisfy\n(12)\nfor an increasing function f with f () . The constant 0 and function f are allowed\nto depend on the parameters (A, B, Q, R).\nIn Section 4, we present a perturbation bound\n(Proposition 1) that satis\ufb01es these properties. Similarly to Section 2, we de\ufb01ne ? := 1 +\nmax{kA?k,kB?k,kC?k,kK?k,kL?k,kP?k}. The following theorem is our main result for LQG.\nTheorem 3. Suppose that (A?, B?) is stabilizable, (C?, A?) is observable, and that Assumption 1\nholds. Let \" be an upper bound on kbA T A?T 1k, kbB T B?k, kbC C?T 1k, and kbL \nT L?k for some unitary transformation T . Suppose that assumption (12) holds with parameters\n? QC?T 1, R) and that \" is suf\ufb01ciently small so that 3kC?k+kQk+\" \uf8ff 0\n(T A?T 1, T B?, T TCT\nand \" \uf8ff 1, where \" := 73\n(R) f (3kC?k2\n\nkP bPk \uf8ff f () ,\n\n+kQk+\"). Let bK be de\ufb01ned as in (11), and de\ufb01ne N? as\n\nN? :=\uf8ffA? + B?K?\nwhere the pair (K?, L?) is optimal for the LQG problem de\ufb01ned by (A?, B?, C?, Q, R). Let > 0\nbe such that \u21e2(N?) << 1. Then as long as \" \uf8ff\n20?\u2327 (N?,), the interconnection of (11) with (7)\nusing (bA,bB,bC, bK,bL) is stable. Furthermore, the cost J(bA,bB,bC, bK,bL) satis\ufb01es:\n\nA? L?C? ,\n\n? QC?) + tr(R))\n\n\u2327 6(N?, )\n\nw, 2\n\nB?K?\n\n?\"2 .\n\n(13)\n\n1\n\n0\n\n?\n\nThe proof of Theorem 3 appears in Appendix F. We note that such a exists since \u21e2(N?) < 1;\nby the stability and observability assumptions in Theorem 3, we have that both A? + B?K? and\nA? L?C? are stable (c.f. Appendix E of Kailath et al. [24]). Theorem 3 is a meta-theorem showing\nhow perturbation bounds on the solution of Riccati equations translates into suboptimality bounds on\nthe performance of certainty equivalent control. Combining Theorem 3 with Proposition 1, we have\nthe following explicit result, an analogue of Theorem 2 for LQG. To simplify notation we denote by\ndare(A, B, Q, R) the solution to the discrete algebraic Riccati equation de\ufb01ned by the parameters A,\nB, Q, and R.\n\n2The one step delay in yt is a standard assumption in controls which slightly simpli\ufb01es the Kalman \ufb01ltering\n\nexpressions. Our results generalize to the setting where the history also contains the current observation yt.\n\n6\n\nJ(bA,bB,bC, bK,bL) J? \uf8ffO (1) max{2\n\nv}(tr(CT\n\n(1 2)3 6\n\n\fTheorem 4. Suppose that (A?, B?) is stabilizable, (C?, A?) is observable, and that Assumption 1\n\nholds. Let \" be an upper bound on kbA T A?T 1k, kbB T B?k, kbC C?T 1k, and kbL T L?k for\nsome unitary transformation T . Let P? = dare(A?, B?, CT\n? QC?, R) and suppose that (P?) 1.\nLet N? be as in (13) and \ufb01x such that \u21e2(N?) << 1. As long as \" satis\ufb01es \" \uf8ff (12)2\n,\n1\n11\n? kQk\nwe have the following sub-optimality bound:\nJ(bA,bB,bC, bK,bL) J? \uf8ffO (1) max{2\n(1 2)5 \"2 .\nSeveral remarks are in order. First, the assumption that (P?) 1 is without loss of generality, since\nwe can always rescale Q and R without affecting the control solution. Next, we compare our results\nhere to a classic result from Doyle [13], which states that there are no gain margins for LQG. We\nremark that the notion of a gain margin is a robustness property that holds uniformly over a class\nof perturbations of varying degree. Our results do not hold uniformly; we use quantities such as\n\u2327 (N?, ) and ? to quantify how much mismatch a given LQG instance can tolerate.\n\n? QC?) + tr(R)) kQk2\n\n(R)2 26\n\nv}(tr(CT\n\n\u2327 10(N?, )\n\nw, 2\n\n\u2327 4(N?,)\n\n?\n\n4 Riccati Perturbation Theory\n\nAs discussed in Sections 2 and 3, a key piece of our analysis is bounding the solutions to discrete\nRiccati equations as we perturb the problem parameters. Speci\ufb01cally, we are interested in quantities\n\nthat it is not possible to \ufb01nd universal values b, L. Consider the systems (A?, B?) = (1,\" ) and\n\nb, L such that kbP P?k \uf8ff L\" if \"< b , where \" represents a bound on the perturbation. We note\n(bA,bB) = (1, 0); the latter system is not stabilizable and hence bP does not even exist. Therefore, b\n\nand L must depend on the system parameters.\nWhile there is a long line of work analyzing perturbations of Riccati equations, we are not aware\nof any result that offers explicit and easily interpretable b and L for a \ufb01xed (A?, B?, Q, R); see\nKonstantinov et al. [26] for an overview of this literature. In this section, we present two new\nresults for Riccati perturbation which offer interpretable bounds. The \ufb01rst one expands upon the\noperator-theoretic proof of Konstantinov et al. [25]; its proof can be found in Appendix B.1. In this\nresult we assume the cost matrix Q can also be perturbed, which is needed for our LQG guarantee. In\n\norder to be consistent we denote the true cost by function by Q? and the estimated one by bQ.\nProposition 1. Let \u21e2(L?) and also let \" such that kbA A?k, kbB B?k, and kbQ Q?k are at\nmost \". Let k\u00b7k+ = k\u00b7k + 1. We assume that R 0, (A?, B?) is stabilizable, (Q1/2, A?) observable,\nand (P?) 1.\n\n\u2327 (L?, )2\n\n1 2 kA?k2\n\n+kP?k2\n\n+kB?k+kR1k+,\n\nas long as\n\nkbP P?k \uf8ffO (1) \"\n(1 2)2\n\u2327 (L?, )4 kA?k2\n\n\" \uf8ffO (1)\n\n+ kP?k2\n\n+ kB?k3\n\n+ kR1k2\n\n+ minkL?k2\n\n+ .\n+ ,kP?k1\n\nWe note that the assumption (P?) 1 can be made without loss of generality when the other\nassumptions are satis\ufb01ed. Since R 0 and (Q1/2, A) observable, the value function matrix P? is\nguaranteed to be positive de\ufb01nite. Then, by rescalling Q and R we can ensure that (P?) 1.\nWe now present our direct approach, which uses Assumption 2 to give a bound which is sharper for\nsome systems (A?, B?) then the one provided by Proposition 1. Recall that any controllable system\nis always (`, \u232b)-controllable for some ` and \u232b.\n\nProposition 2. Let \u21e2 \u21e2(A?) and also let \" 0 such that kbA A?k \uf8ff \" and kbB B?k \uf8ff \". Let\n := max{1,\"\u2327 (A?,\u21e2 ) + \u21e2}. Under Assumptions 1 and 2 we have\n\n5\n\n2 \u2327 (A?,\u21e2 )32(`1)\u27131 +\n\n1\n\n\u232b\u25c6 (1 + kB?k)2kP?k\n\nmax{kQk,kRk}\nmin{(R), (Q)}\n\n,\n\nas long as \" is small enough so that the right hand side is smaller or equal than one.\n\nkbP P?k \uf8ff 32 \" `\n\n7\n\n\fThe proof of this result is deferred to Appendix B.2. We note that Proposition 2 can also be extended\nto handle perturbations in the cost matrix Q. Proposition 2 requires an (`, \u232b)-controllable system\n(A?, B?), whereas Proposition 1 only requires a stabilizable system, which is a milder assumption.\nHowever, Proposition 2 can offer a sharper guarantee. For example, consider the linear system\n\nR are chosen to be the identity matrix I2. This system (A?, B?) is readily checked to be (1, )-\ncontrollable. It is also straightforward to verify that as tends to zero, Proposition 1 gives a bound of\n\nwith two dimensional states (n = 2) given by A? = 1.01 \u00b7 I2 and B? = \uf8ff1\n0 . Both Q and\nkbP P?k = O(\"/4), whereas Proposition 2 gives a sharper bound of kbP P?k = O(\"/3).\n\n0\n\n5 Related Work\n\nFor the of\ufb02ine LQR batch setting, Fiechter [18] proved that the sub-optimality gap bJ J? scales\nas O(\") for certainty equivalent control. A crucial assumption of his analysis is that the nominal\ncontroller stabilizes the true unknown system. We give bounds on when this assumption is valid.\nRecently, Dean et al. [11] proposed a robust controller synthesis procedure which takes model\nuncertainty into account and whose suboptimality gap scales as O(\"). Tu and Recht [39] show\nthat the gap bJ J? of certainty equivalent control scales asymptotically as O(\"2); we provide a\nnon-asymptotic analogue of this result. Fazel et al. [17] and Malik et al. [27] analyze a model-free\napproach to policy optimization for LQR, in which the controller is directly optimized from sampled\nrollouts. Malik et al. [27] showed that, after collecting N rollouts, a derivative free method achieves\na discounted cost gap that scales as O(1/pN ) or O(1/N ), depending on the oracle model used.\n\nIn the online LQR adaptive setting it is well understood that using the certainty equivalence principle\nwithout adequate exploration can result in a lack of parameter convergence [see e.g. 5]. Abbasi-\nYadkori and Szepesv\u00e1ri [1] showed that optimism in the face of uncertainty (OFU), when applied to\n\nregret for the case when the state and inputs are both scalars. In a Bayesian setting Ouyang et al. [31]\n\ntheir work does not provide any explicit dependencies on instance parameters. Finally, Cohen et al.\n\nof the previous analysis. Ibrahimi et al. [22] showed that when the underlying system is sparse,\nthe dimension dependent constants in the regret bound can be improved. The main issue with\nOFU for LQR is that there are no known computationally tractable ways of implementing it. In\norder to deal with this, both Dean et al. [12] and Abbasi-Yadkori et al. [2] propose polynomial time\n\nonline LQR, yields eO(pT ) regret. Faradonbeh et al. [14] removed some un-necessary assumptions\nalgorithms for adaptive LQR based on \"-greedy exploration which achieve eO(T 2/3) regret. Only\nrecently progress has been made on offering eO(pT ) regret guarantees for computationally tractable\nalgorithms. Abeille and Lazaric [3] show that Thompson sampling achieves eO(pT ) (frequentist)\nshowed that Thompson sampling achieves eO(pT ) expected regret. Faradonbeh et al. [16] argue that\ncertainty equivalence control with an epsilon-greedy-like scheme achieves eO(pT ) regret, though\n[10] also give an ef\ufb01cient algorithm based on semide\ufb01nite programming that achieves eO(pT ) regret.\nTheir main result requires the initial parameter error to scale as O(1/T 1/4). While they propose a\nO(pT ) length warmup period to get around this, our analysis of \"-greedy control does not require\noT (1) accuracy of the initial parameters. Moreover, there are specialized algorithms for solving\nRiccati equations that are more ef\ufb01cient than general semide\ufb01nite programming solvers.\nThe literature for LQG is less complete, with most of the focus on the estimation side. Hardt et al.\n[19] show that gradient descent can be used to learn a model with good predictive performance, under\nstrong technical assumptions on the A matrix. A line of work [20, 21] has focused on using spectral\n\ufb01ltering techniques to learn a predictive model with low regret. Beyond predictive performance,\nseveral works [32, 35, 38] show how to learn the system dynamics up to a similarity transform\nfrom input/output data. Finally, we remark that Boczar et al. [8] give sub-optimality guarantees for\noutput-feedback of a single-input-single-output (SISO) linear system with no process noise.\nA key part of our analysis involves bounding the perturbation of solutions to the discrete algebraic\nRiccati equation. While there is a rich line of work studying perturbations of Riccati equations\n[25, 26, 36, 37], the results in the literature are either asymptotic in nature or dif\ufb01cult to use and\ninterpret. We clarify the operator-theoretic result of Konstantinov et al. [25] and provide an explicit\nupper bound on the perturbation based on their proof strategy. Also, we take a new direct approach\n\n8\n\n\fand use an extended notion of controllability to give a constructive and simpler result. While the\nresult of Konstantinov et al. [25] applies more generally to systems that are stabilizable, we give\nexamples of linear systems for which our new perturbation result is tighter.\nFinally, while we focus on a continuous control problem, we note that the performance of certainty\nequivalence had been studied in the context of tabular MDPs, e.g. Azar et al. [7] derived matching\nupper and lower bounds on the performance of value-iteration and policy-iteration that use estimated\ntransition probabilities.\n\n6 Conclusion\n\nThough a na\u00efve Taylor expansion suggests that the fast rates we derive here must be achievable,\nprecisely computing such rates has been open since the 80s. All of the pieces we used here have existed\nin the literature for some time, and perhaps it has just required a bit of time to align contemporary\nrate-analyses in learning theory with earlier operator theoretic work in optimal control. There remain\nmany possible extensions to this work. The robust control approach of Dean et al. [11] applies to\nmany different objective functions besides quadratic costs, such as H1 and L1 control. It would be\ninteresting to know whether fast rates for control are possible for other objective functions. Finally,\ndetermining the optimal minimax rate for both LQR and LQG would allow us to understand the\ntradeoffs between nominal and robust control at a more \ufb01ne grained level.\n\nAcknowledgements\n\nWe thank the anonymous reviewers for their valuable feedback. We also thank Elad Hazan and Martin\nWainwright, who both independently asked whether or not it was possible to show a fast rate for\nLQR. As part of the RISE lab, HM is generally supported in part by NSF CISE Expeditions Award\nCCF-1730628, DHS Award HSHQDC-16-3-00083, and gifts from Alibaba, Amazon Web Services,\nAnt Financial, CapitalOne, Ericsson, GE, Google, Huawei, Intel, IBM, Microsoft, Scotiabank,\nSplunk and VMware. ST is supported by a Google PhD fellowship. BR is generously supported in\npart by ONR awards N00014-17-1-2191, N00014-17-1-2401, and N00014-18-1-2833, the DARPA\nAssured Autonomy (FA8750-18-C-0101) and Lagrange (W911NF-16-1-0552) programs, a Siemens\nFuturemakers Fellowship, and an Amazon AWS AI Research Award.\n\nReferences\n[1] Y. Abbasi-Yadkori and C. Szepesv\u00e1ri. Regret Bounds for the Adaptive Control of Linear\n\nQuadratic Systems. In Conference on Learning Theory, 2011.\n\n[2] Y. Abbasi-Yadkori, N. Lazi\u00b4c, and C. Szepesv\u00e1ri. Model-Free Linear Quadratic Control via\n\nReduction to Expert Prediction. In AISTATS, 2019.\n\n[3] M. Abeille and A. Lazaric.\n\nImproved Regret Bounds for Thompson Sampling in Linear\n\nQuadratic Control Problems. In International Conference on Machine Learning, 2018.\n\n[4] B. D. O. Anderson and J. B. Moore. Optimal Control: Linear Quadratic Methods. 2007.\n\n[5] K. J. \u00c5str\u00f6m and B. Wittenmark. On Self Tuning Regulators. Automatica, 9:185\u2013199, 1973.\n\n[6] K. J. \u00c5str\u00f6m and B. Wittenmark. Adaptive Control. 2013.\n\n[7] M. G. Azar, R. Munos, and H. J. Kappen. Minimax pac bounds on the sample complexity of\n\nreinforcement learning with a generative model. Machine learning, 91(3):325\u2013349, 2013.\n\n[8] R. Boczar, N. Matni, and B. Recht. Finite-Data Performance Guarantees for the Output-\nFeedback Control of an Unknown System. In 57th IEEE Conference on Decision and Control,\n2018.\n\n[9] A. Cohen, A. Hassidim, T. Koren, N. Lazic, Y. Mansour, and K. Talwar. Online Linear Quadratic\n\nControl. In International Conference on Machine Learning, 2018.\n\n9\n\n\f[10] A. Cohen, T. Koren, and Y. Mansour. Learning Linear-Quadratic Regulators Ef\ufb01ciently with\n\nonly pT Regret. arXiv:1902.06223, 2019.\n\n[11] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. On the Sample Complexity of the Linear\n\nQuadratic Regulator. arXiv:1710.01688, 2017.\n\n[12] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. Regret Bounds for Robust Adaptive Control\n\nof the Linear Quadratic Regulator. In Neural Information Processing Systems, 2018.\n\n[13] J. C. Doyle. Guaranteed Margins for LQG Regulators. IEEE Transactions on Automatic Control,\n\n23(4):756\u2013757, 1978.\n\n[14] M. K. S. Faradonbeh, A. Tewari, and G. Michailidis. Optimism-Based Adaptive Regulation of\n\nLinear-Quadratic Systems. arXiv:1711.07230, 2017.\n\n[15] M. K. S. Faradonbeh, A. Tewari, and G. Michailidis. Finite Time Identi\ufb01cation in Unstable\n\nLinear Systems. Automatica, 96:342\u2013353, 2018.\n\n[16] M. K. S. Faradonbeh, A. Tewari, and G. Michailidis. Input Perturbations for Adaptive Regulation\n\nand Learning. arXiv:1811.04258, 2018.\n\n[17] M. Fazel, R. Ge, S. M. Kakade, and M. Mesbahi. Global Convergence of Policy Gradient\nMethods for the Linear Quadratic Regulator. In International Conference on Machine Learning,\n2018.\n\n[18] C.-N. Fiechter. PAC Adaptive Control of Linear Systems. In Conference on Learning Theory,\n\n1997.\n\n[19] M. Hardt, T. Ma, and B. Recht. Gradient Descent Learns Linear Dynamical Systems. Journal\n\nof Machine Learning Research, 19(29):1\u201344, 2018.\n\n[20] E. Hazan, K. Singh, and C. Zhang. Learning Linear Dynamical Systems via Spectral Filtering.\n\nIn Neural Information Processing Systems, 2017.\n\n[21] E. Hazan, H. Lee, K. Singh, C. Zhang, and Y. Zhang. Spectral Filtering for General Linear\n\nDynamical Systems. In Neural Information Processing Systems, 2018.\n\n[22] M. Ibrahimi, A. Javanmard, and B. V. Roy. Ef\ufb01cient Reinforcement Learning for High Dimen-\n\nsional Linear Quadratic Systems. In Neural Information Processing Systems, 2012.\n\n[23] G. N. Iyengar. Robust Dynamic Programming. Mathematics of Operations Research, 30(2):\n\n257\u2013280, 2005.\n\n[24] T. Kailath, A. H. Sayed, and B. Hassibi. Linear Estimation. 2000.\n\n[25] M. M. Konstantinov, P. H. Petkov, and N. D. Christov. Perturbation analysis of the discrete\n\nriccati equation. Kybernetika, 29(1):18\u201329, 1993.\n\n[26] M. M. Konstantinov, D.-W. Gu, V. Mehrmann, and P. H. Petkov. Perturbation theory for matrix\n\nequations, volume 9. Gulf Professional Publishing, 2003.\n\n[27] D. Malik, A. Pananjady, K. Bhatia, K. Khamaru, P. L. Bartlett, and M. J. Wainwright. Derivative-\nFree Methods for Policy Optimization: Guarantees for Linear Quadratic Systems. In AISTATS,\n2019.\n\n[28] H. Mania, S. Tu, and B. Recht. Certainty Equivalence is Ef\ufb01cient for Linear Quadratic Control.\n\narXiv:1902.07826, 2019.\n\n[29] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust Stochastic Approximation Approach\n\nto Stochastic Programming. SIAM Journal on Optimization, 19(4):1574\u20131609, 2009.\n\n[30] A. Nilim and L. El Ghaoui. Robust Control of Markov Decision Processes with Uncertain\n\nTransition Matrices. Operations Research, 53(5):780\u2013798, 2005.\n\n10\n\n\f[31] Y. Ouyang, M. Gagrani, and R. Jain. Control of Unknown Linear Systems with Thompson\n\nSampling. In Allerton, 2017.\n\n[32] S. Oymak and N. Ozay. Non-asymptotic Identi\ufb01cation of LTI Systems from a Single Trajectory.\n\narXiv:1806.05722, 2018.\n\n[33] T. Sarkar and A. Rakhlin. Near optimal \ufb01nite time identi\ufb01cation of arbitrary linear dynamical\n\nsystems. In International Conference on Machine Learning, 2019.\n\n[34] M. Simchowitz, H. Mania, S. Tu, M. I. Jordan, and B. Recht. Learning Without Mixing:\nTowards A Sharp Analysis of Linear System Identi\ufb01cation. In Conference on Learning Theory,\n2018.\n\n[35] M. Simchowitz, R. Boczar, and B. Recht. Learning Linear Dynamical Systems with Semi-\n\nParametric Least Squares. In Conference on Learning Theory, 2019.\n\n[36] J.-g. Sun. Perturbation theory for algebraic riccati equations. SIAM Journal on Matrix Analysis\n\nand Applications, 19(1):39\u201365, 1998.\n\n[37] J.-g. Sun. Sensitivity analysis of the discrete-time algebraic riccati equation. Linear algebra\n\nand its applications, 275:595\u2013615, 1998.\n\n[38] A. Tsiamis and G. J. Pappas. Finite Sample Analysis of Stochastic System Identi\ufb01cation.\n\narXiv:1903.09122, 2019.\n\n[39] S. Tu and B. Recht. The Gap Between Model-Based and Model-Free Methods on the Linear\n\nQuadratic Regulator: An Asymptotic Viewpoint. In Conference on Learning Theory, 2019.\n\n[40] S. Tu, R. Boczar, A. Packard, and B. Recht. Non-Asymptotic Analysis of Robust Control from\n\nCoarse-Grained Identi\ufb01cation. arXiv:1707.04791, 2017.\n\n[41] H. Xu and S. Mannor. Distributionally Robust Markov Decision Processes. Mathematics of\n\nOperations Research, 37(2):288\u2013300, 2012.\n\n[42] K. Zhou, J. C. Doyle, and K. Glover. Robust and Optimal Control. 1995.\n\n11\n\n\f", "award": [], "sourceid": 5363, "authors": [{"given_name": "Horia", "family_name": "Mania", "institution": "UC Berkeley"}, {"given_name": "Stephen", "family_name": "Tu", "institution": "UC Berkeley"}, {"given_name": "Benjamin", "family_name": "Recht", "institution": "UC Berkeley"}]}