{"title": "Regret Bounds for Robust Adaptive Control of the Linear Quadratic Regulator", "book": "Advances in Neural Information Processing Systems", "page_first": 4188, "page_last": 4197, "abstract": "We consider adaptive control of the Linear Quadratic Regulator (LQR), where an\nunknown linear system is controlled subject to quadratic costs. Leveraging recent\ndevelopments in the estimation of linear systems and in robust controller synthesis,\nwe present the first provably polynomial time algorithm that achieves sub-linear\nregret on this problem. We further study the interplay between regret minimization\nand parameter estimation by proving a lower bound on the expected regret in\nterms of the exploration schedule used by any algorithm. Finally, we conduct a\nnumerical study comparing our robust adaptive algorithm to other methods from\nthe adaptive LQR literature, and demonstrate the flexibility of our proposed method\nby extending it to a demand forecasting problem subject to state constraints.", "full_text": "Regret Bounds for Robust Adaptive Control of the\n\nLinear Quadratic Regulator\n\nSarah Dean Horia Mania\n\nNikolai Matni\n\nBenjamin Recht\n\nStephen Tu\n\nUniversity of California, Berkeley\n\nAbstract\n\nWe consider adaptive control of the Linear Quadratic Regulator (LQR), where\nan unknown linear system is controlled subject to quadratic costs. Leveraging\nrecent developments in the estimation of linear systems and in robust controller\nsynthesis, we present the \ufb01rst provably polynomial time algorithm that provides\nhigh probability guarantees of sub-linear regret on this problem. We further study\nthe interplay between regret minimization and parameter estimation by proving a\nlower bound on the expected regret in terms of the exploration schedule used by any\nalgorithm. Finally, we conduct a numerical study comparing our robust adaptive\nalgorithm to other methods from the adaptive LQR literature, and demonstrate the\n\ufb02exibility of our proposed method by extending it to a demand forecasting problem\nsubject to state constraints.\n\n1\n\nIntroduction\n\nThe problem of adaptively controlling an unknown dynamical system has a rich history, with classical\nasymptotic results of convergence and stability dating back decades [12, 13]. Of late, there has\nbeen a renewed interest in the study of a particular instance of such problems, namely the adaptive\nLinear Quadratic Regulator (LQR), with an emphasis on non-asymptotic guarantees of stability and\nperformance. Initiated by Abbasi-Yadkori and Szepesv\u00e1ri [1], there have since been several works\nanalyzing the regret suffered by various adaptive algorithms on LQR\u2013 here the regret incurred by\nan algorithm is thought of as a measure of deviations in performance from optimality over time.\nThese results can be broadly divided into two categories: those providing high-probability guarantees\nfor a single execution of the algorithm [1, 4, 8, 11], and those providing bounds on the expected\nBayesian regret incurred over a family of possible systems [2, 16]. As we discuss in more detail,\nthese methods all suffer from one or several of the following limitations: restrictive and unveri\ufb01able\nassumptions, limited applicability, and computationally intractable subroutines. In this paper, we\nprovide, to the best of our knowledge, the \ufb01rst polynomial-time algorithm for the adaptive LQR\nproblem that provides high probability guarantees of sub-linear regret, and that does not require\nunveri\ufb01able or unrealistic assumptions.\n\nRelated Work. There is a rich body of work on the estimation of linear systems as well as on the\nrobust and adaptive control of unknown systems. We target our discussion to works on non-asymptotic\nguarantees for the LQR control of an unknown system, broadly divided into three categories.\nOf\ufb02ine estimation and control synthesis: In a non-adaptive setting, i.e., when system identi\ufb01cation\ncan be done of\ufb02ine prior to controller synthesis and implementation, the \ufb01rst work to provide end-\nto-end guarantees for the LQR optimal control problem is that of Fiechter [10], who shows that the\ndiscounted LQR problem is PAC-learnable. Dean et al. [6] improve on this result, and provide the\n\ufb01rst end-to-end sample complexity guarantees for the in\ufb01nite horizon average cost LQR problem.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fOptimism in the Face of Uncertainty (OFU): Abbasi-Yadkori and Szepesv\u00e1ri [1], Faradonbeh et al.\n[8], and Ibrahimi et al. [11] employ the Optimism in the Face of Uncertainty (OFU) principle [5],\nwhich optimistically selects model parameters from a con\ufb01dence set by choosing those that lead to\nthe best closed-loop (in\ufb01nite horizon) control performance, and then plays the corresponding optimal\ncontroller, repeating this process online as the con\ufb01dence set shrinks. While OFU in the LQR setting\n\nhas been shown to achieve optimal regret (cid:101)O(\u221aT ), its implementation requires solving a non-convex\noptimization problem to precision (cid:101)O(T \u22121/2), for which no provably ef\ufb01cient implementation exists.\nmethod achieves (cid:101)O(T 2/3) regret with high-probability for scalar systems. However, their proof\nconsider expected regret in a Bayesian setting, and provide TS methods which achieve (cid:101)O(\u221aT ) regret.\n\nThompson Sampling (TS): To circumvent the computational roadblock of OFU, recent works replace\nthe intractable OFU subroutine with a random draw from the model uncertainty set, resulting in\nThompson Sampling (TS) based policies [2, 4, 16]. Abeille and Lazaric [4] show that such a\n\ndoes not extend to the non-scalar setting. Abbasi-Yadkori and Szepesv\u00e1ri [2] and Ouyang et al. [16]\n\nAlthough not directly comparable to our result, we remark on the computational challenges of these\nalgorithms. Whereas the proof of Abbasi-Yadkori and Szepesv\u00e1ri [2] was shown to be incorrect [15],\nOuyang et al. [16] make the restrictive assumption that there exists a (known) initial compact set \u0398\ndescribing the uncertainty in the system parameters, such that for any system \u03b81 \u2208 \u0398, the optimal\ncontroller K(\u03b81) is stabilizing when applied to any other system \u03b82 \u2208 \u0398. No means of constructing\nsuch a set are provided, and there is no known tractable algorithm to verify if a given set satis\ufb01es this\nproperty. Also, it is implicitly assumed that projecting onto this set can be done ef\ufb01ciently.\n\nprograms of size logarithmic in T .\n\nContributions. To develop the \ufb01rst polynomial-time algorithm that provides high probability\nguarantees of sub-linear regret, we leverage recent results from the estimation of linear systems [17],\nrobust controller synthesis [14, 19], and coarse-ID control [6]. We show that our robust adaptive\ncontrol algorithm: (i) guarantees stability and near-optimal performance at all times; (ii) achieves\n\na regret up to time T bounded by (cid:101)O(T 2/3); and (iii) is based on \ufb01nite-dimensional semide\ufb01nite\nFurthermore, our method estimates the system parameters at (cid:101)O(T \u22121/3) rate in operator norm.\n\nAlthough system parameter identi\ufb01cation is not necessary for optimal control performance, an\naccurate system model is often desirable in practice. Motivated by this, we study the interplay\nbetween regret minimization and parameter estimation, and identify fundamental limits connecting\nthe two. We show that the expected regret of our algorithm is lower bounded by \u2126(T 2/3), proving\nthat our analysis is sharp up to logarithmic factors. Moreover, our lower bound suggests that the\nestimation rate achievable by any algorithm with O(T \u03b1) regret is \u2126(T \u2212\u03b1/2).\nFinally, we conduct a numerical study of the adaptive LQR problem, in which we implement our\nalgorithm, and compare its performance to heuristic implementations of OFU and TS based methods.\nWe show on several examples that the regret incurred by our algorithm is comparable to that of the\nOFU and TS based methods. Furthermore, the in\ufb01nite horizon cost achieved by our algorithm at\nany given time on the true system is consistently lower than that attained by OFU and TS based\nalgorithms. Finally, we use a demand forecasting example to show how our algorithm naturally\ngeneralizes to incorporate environmental uncertainty and safety constraints. The full version of this\npaper is [7].\n\n2 Problem Statement and Preliminaries\n\nIn this work we consider adaptive control of the following discrete-time linear system\n\nxk+1 = A(cid:63)xk + B(cid:63)uk + wk , wk\n\n(2.1)\nwhere xk \u2208 Rn is the state, uk \u2208 Rp is the control input, and wk \u2208 Rn is the process noise. We\nassume that the state variables are observed exactly and, for simplicity, that x0 = 0. We consider the\nLinear Quadratic Regulator optimal control problem, given by cost matrices Q (cid:23) 0 and R (cid:31) 0,\n\nwI) ,\n\ni.i.d.\u223c N (0, \u03c32\n(cid:35)\n\n(cid:34) T(cid:88)\n\nJ(cid:63) = min\n\nu\n\nlim\nT\u2192\u221e\n\nE\n\n1\nT\n\nx(cid:62)\nk Qxk + u(cid:62)\n\nk Ruk\n\ns.t. dynamics (2.1) ,\n\n(2.2)\n\nk=1\n\n2\n\n\fT(cid:88)\n\nk=1\n\nwhere the minimum is taken over measurable functions u = {uk(\u00b7)}k\u22651, with each uk adapted to\nthe history xk, xk\u22121, . . . , x1, and possibe additional randomness independent of future states. Given\nknowledge of (A(cid:63), B(cid:63)), the optimal policy is a static state-feedback law uk = K(cid:63)xk, where K(cid:63) is\nderived from the solution to a discrete algebraic Riccati equation.\nWe are interested in algorithms which operate without knowledge of the true system transition\nmatrices (A(cid:63), B(cid:63)). We measure the performance of such algorithms via their regret, de\ufb01ned as\n\nRegret(T ) :=\n\n(x(cid:62)\n\nk Qxk + u(cid:62)\n\nk Ruk \u2212 J(cid:63)) .\n\n(2.3)\n\nThe regret of any algorithm is lower-bounded by \u2126(\u221aT ), a bound matched by OFU up to logarithmic\nfactors [8]. However, after each epoch, OFU requires optimizing a non-convex objective to O(T \u22121/2)\nprecision. Instead, our method uses a subroutine based on quasi-convex optimization and robust\ncontrol.\n\n2.1 Preliminaries: System Level Synthesis\n\nWe brie\ufb02y describe the necessary background on robust control and System Level Synthesis [19]\n(SLS). These tools were recently used by Dean et al. [6] to provide non-asymptotic bounds for LQR\nin the of\ufb02ine \u201cestimate-and-then-control\u201d setting. In the appendix of the full version [7] we expand\non these preliminaries.\nConsider the dynamics (2.1), and \ufb01x a static state-feedback control policy K, i.e., let uk = Kxk.\nThen, the closed loop map from the disturbance process {w0, w1, . . .} to the state xk and control\ninput uk at time k is given by\n\nLetting \u03a6x(k) := (A(cid:63) + B(cid:63)K)k\u22121 and \u03a6u(k) := K(A(cid:63) + B(cid:63)K)k\u22121, we can rewrite Eq. (2.4) as\n\nxk = (cid:80)k\nuk = (cid:80)k\nt=1(A(cid:63) + B(cid:63)K)k\u2212twt\u22121 ,\nt=1 K(A(cid:63) + B(cid:63)K)k\u2212twt\u22121 .\n(cid:20)xk\nk(cid:88)\n\n(cid:20)\u03a6x(k \u2212 t + 1)\n\nwt\u22121 ,\n\n(cid:21)\n\n(cid:21)\n\n=\n\nuk\n\n\u03a6u(k \u2212 t + 1)\n\nt=1\n\n(2.4)\n\n(2.5)\n\nwhere {\u03a6x(k), \u03a6u(k)} are called the closed loop system response elements induced by the controller\nK. The SLS framework shows that for any elements {\u03a6x(k), \u03a6u(k)} constrained to obey\n\n\u03a6x(k + 1) = A(cid:63)\u03a6x(k) + B(cid:63)\u03a6u(k) , \u03a6x(1) = I , \u2200k \u2265 1 ,\n\n(2.6)\nthere exists some controller that achieves the desired system responses (2.5). The state-feedback\nparameterization result in Theorem 1 of Wang et al. [19] formalizes this observation: the SLS\nframework therefore allows for any optimal control problem over linear systems to be cast as an\noptimization problem over elements {\u03a6x(k), \u03a6u(k)}, constrained to satisfy the af\ufb01ne equations (2.6).\nComparing equations (2.4) and (2.5), we see that the former is non-convex in the controller K,\nwhereas the latter is af\ufb01ne in the elements {\u03a6x(k), \u03a6u(k)}, enabling solutions to previously dif\ufb01cult\noptimal control problems.\nAs we work with in\ufb01nite horizon problems, it is notationally more convenient to work with transfer\nfunction representations of the above objects, which can be obtained by taking a z-transform of\ntheir time-domain representations. The frequency domain variable z can be informally thought\nof as the time-shift operator, i.e., z{xk, xk+1, . . .} = {xk+1, xk+2, . . .}, allowing for a compact\nrepresentation of LTI dynamics. We use boldface letters to denote such transfer functions, e.g.,\n\nk=1 \u03a6x(k)z\u2212k. Then, the constraints (2.6) can be rewritten as\n\n\u03a6x(z) =(cid:80)\u221e\n\n[zI \u2212 A(cid:63) \u2212B(cid:63)]\n\n= I ,\n\n(2.7)\n\nand the corresponding (not necessarily static) control law u = Kx is given by K = \u03a6u\u03a6\u22121\nx .\nAlthough other approaches to optimal controller design exists, we argue now that the SLS parameteri-\nzation has some appealing properties when applied to the control of uncertain systems. In particular,\n\n3\n\n(cid:20)\u03a6x\n\n(cid:21)\n\n\u03a6u\n\n\f= I\n\nif and only if\n\n\u03a6u\n\n(cid:21)\n\n(cid:2)zI \u2212 (cid:98)A \u2212(cid:98)B(cid:3)(cid:20)\u03a6x\n\nsuppose that rather than having access to the true system transition matrices (A(cid:63), B(cid:63)), we instead\n\nonly have access to estimates ((cid:98)A, (cid:98)B). The SLS framework allows us to characterize the system\nresponses achieved by a controller, computed using only the estimates ((cid:98)A, (cid:98)B), on the true system\n(A(cid:63), B(cid:63)). Speci\ufb01cally, if we denote (cid:98)\u2206 := ((cid:98)A \u2212 A(cid:63))\u03a6x + ((cid:98)B \u2212 B(cid:63))\u03a6u, simple algebra shows that\nThe robust stability result in Theorem 2 of Matni et al. [14] shows that if (I + (cid:98)\u2206)\u22121 exists, then the\nx , computed using only the estimates ((cid:98)A, (cid:98)B), achieves the following response\nFurther, if K stabilizes the system ((cid:98)A, (cid:98)B), and (I + (cid:98)\u2206)\u22121 is stable (simple suf\ufb01cient conditions can\n\ncontroller K = \u03a6u\u03a6\u22121\non the true system (A(cid:63), B(cid:63)):\n\n(I + (cid:98)\u2206)\u22121w .\n\n= I + (cid:98)\u2206 .\n\n[zI \u2212 A(cid:63) \u2212B(cid:63)]\n\n(cid:20)\u03a6x\n\n(cid:21)\n\n\u03a6u\n\n(cid:20)\u03a6x\n\n(cid:21)\n\n\u03a6u\n\n(cid:21)\n\n(cid:20)x\n\nu\n\nbe derived to ensure this, see [6]), then K is also stabilizing for the true system. It is this transparency\nbetween system uncertainty and controller performance that we exploit in our algorithm.\nWe end this discussion with the de\ufb01nition of a function space that we use extensively throughout:\n\n=\n\n.\n\n(2.8)\n\n(cid:40)\n\n(cid:41)\nM (k)z\u2212k | (cid:107)M (k)(cid:107) \u2264 C\u03c1k , k = 1, 2, ...\n\n\u221e(cid:88)\n\nk=1\n\nS(C, \u03c1) :=\n\nM =\n\n(cid:112)(cid:80)\u221e\n\nThe space S(C, \u03c1) consists of (strictly proper) stable transfer functions that satisfy a certain decay\nrate in the spectral norm of their impulse response elements. We denote the restriction of S(C, \u03c1)\nto the space of F -length \ufb01nite impulse response (FIR) \ufb01lters by SF (C, \u03c1), i.e., M \u2208 SF (C, \u03c1) if\nM \u2208 S(C, \u03c1), and M (k) = 0 for all k > F .\nWe equip S(C, \u03c1) with the H\u221e and H2 norms, which are in\ufb01nite horizon analogs of the spectral\nand Frobenius norms of a matrix, respectively: (cid:107)M(cid:107)H\u221e = sup(cid:107)w(cid:107)2=1 (cid:107)Mw(cid:107)2 and (cid:107)M(cid:107)H2 =\nF . The H\u221e and H2 norm have distinct interpretations. The H\u221e norm of a system\nM is equal to its (cid:96)2 (cid:55)\u2192 (cid:96)2 operator norm, and can be used to measure the robustness of a system to\nunmodelled dynamics [20]. The H2 norm has a direct interpretation as the energy transferred to the\nsystem by a white noise process, and is hence closely related to the LQR optimal control problem.\nUnsurprisingly, the H2 norm appears in the objective function of our optimization problem, whereas\nthe H\u221e norm appears in the constraints to ensure robust stability and performance.\n3 Algorithm and Guarantees\n\nk=1(cid:107)M (k)(cid:107)2\n\nOur proposed robust adaptive control algorithm for LQR is shown in Algorithm 1. We note that while\nLine 8 of Algorithm 1 is written as an in\ufb01nite-dimensional optimization problem, it can be formulated\nin terms of \ufb01nite-dimensional decision variables {\u03a6x(k), \u03a6u(k)}F\nk=1 due to the restriction to FIR\n\ufb01lters. In this formulation, the H2 cost can be written as a Frobenius norm and the H\u221e constraint\nreduces to a linear matrix inequality. Therefore, the inner optimization can be equivalently written as\na semide\ufb01nite program over O(Fi(n2 + np)) decision variables. We describe this transformation in\ndetail in appendix Section G of the full version [7]. We also note that the outer optimization over \u03b3\ncan be performed ef\ufb01ciently by bisection search because the objective is jointly quasi-convex in the\ndecision variables and is smooth with respect to \u03b3 in the feasible domain.\n\nSome remarks on practice are in order. First, in Line 6, only the trajectory data collected during the\ni-th epoch is used for the least squares estimate. Second, the epoch lengths we use grow exponentially\nin the epoch index. These settings are chosen primarily to simplify the analysis; in practice all the\ndata collected should be used, and it may be preferable to use a slower growing epoch schedule (such\nas Ti = CT (i + 1)). Additionally, for storage considerations, instead of performing a batch least\nsquares update of the model, a recursive least squares (RLS) estimator rule can be used to update\nthe parameters in an online manner. Furthermore, many constants in Algorithm 1 depend on the\nunknown system to be consistent with our data-independent analysis. In practice, these parameters\ncan be estimated from collected data.\nFinally, we note that the proofs for all results in this section can be found in the full version [7].\n\n4\n\n\fAlgorithm 1 Robust Adaptive Control Algorithm\nRequire: Stabilizing controller K(0), failure probability \u03b4 \u2208 (0, 1), and constants (C(cid:63), \u03c1(cid:63),(cid:107)K(cid:63)(cid:107)).\n1: Set Cx \u2190\n\n(cid:16)\nO(1)C(cid:63)\n(1\u2212\u03c1(cid:63))3 , Cu \u2190 (cid:107)K(cid:63)(cid:107)Cx, and \u03c1 \u2190 .999 + .001\u03c1(cid:63).\n\n(cid:17)\n\n(n + p) C4\n\n(cid:63) (1+(cid:107)K(cid:63)(cid:107))4\n(1\u2212\u03c1(cid:63))8\n\n.\n\n2: Set CT \u2190 (cid:101)O\n\n3: for i = 0, 1, 2, ... do\n4:\n5:\n\n6:\n\n7:\n\n8:\n\nw(Ti/CT )\u22121/3.\n\n\u03b7,i \u2190 \u03c32\nk )}Ti\n\nSet Ti \u2190 CT 2i and \u03c32\nSet Di = {(x(i)\nk , u(i)\nis\nobtained from the controller K(i) plus an additional noise term for exploration. More precisely,\nu(i) = K(i)x(i) + \u03b7(i), where each entry of \u03b7(i) is drawn i.i.d. from N (0, \u03c32\n\nk=1 \u2190 evolve system forward Ti steps, where each action u(i)\n(cid:80)Ti\u22121\nSet ((cid:98)Ai, (cid:98)Bi) \u2190 arg minA,B\n(cid:113) n+p\n(cid:17)\n(cid:16) \u03c3w(cid:107)K(cid:63)(cid:107)C(cid:63)\nSet \u03b5i \u2190 (cid:101)O\n2(cid:107)x(i)\nand Fi \u2190\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:20)Q1/2\n(cid:3)(cid:20)\u03a6x\n(cid:21)\ns.t.(cid:2)zI \u2212 (cid:98)Ai \u2212(cid:98)Bi\n\nx , where (\u03a6x, \u03a6u) are the solution to\n1\n1 \u2212 \u03b3\n\n(cid:101)O(1)(i+1)\nk+1 \u2212 Ax(i)\n1\u2212\u03c1(cid:63)\n(cid:21)(cid:20)\u03a6x\n\n(cid:21)(cid:13)(cid:13)(cid:13)(cid:13)H\u221e \u2264 \u03b3 ,\n\nSet K(i+1) = \u03a6u\u03a6\u22121\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:20)\u03a6x\n\n(cid:21)(cid:13)(cid:13)(cid:13)(cid:13)H2\n\nk \u2212 Bu(i)\nk (cid:107)2\n2.\n\nminimize\u03b3\u2208[0,1)\n\n\u03c3\u03b7,i(1\u2212\u03c1(cid:63))3\n\n\u221a2\u03b5i\n\n\u03b7,iIp).\n\n1\nzFi\n\n= I +\n\nmin\n\n\u03a6x,\u03a6u,V\n\nR1/2\n\n\u03a6u\n\n\u03a6u\n\n\u03a6u\n\n1\n\nk=1\n\nV ,\n\n0\n\n0\n\nTi\n\n.\n\nk\n\n(cid:107)V (cid:107) \u2264 Cx\u03c1Fi+1 , \u03a6x \u2208 SFi(Cx, \u03c1) , \u03a6u \u2208 SFi(Cu, \u03c1) .\n\n1 \u2212 Cx\u03c1Fi+1\n\n9: end for\n\n3.1 Regret Upper Bounds\n\nOur guarantees for Algorithm 1 are stated in terms of certain system speci\ufb01c constants, which we\nde\ufb01ne here. We let K(cid:63) denote the static feedback solution to the LQR problem for (A(cid:63), B(cid:63), Q, R).\nNext, we de\ufb01ne (C(cid:63), \u03c1(cid:63)) such that the closed loop system A(cid:63) + B(cid:63)K(cid:63) belongs to S(C(cid:63), \u03c1(cid:63)). Our\nmain assumption is stated as follows.\nAssumption 3.1. We are given a controller K(0) that stabilizes the true system (A(cid:63), B(cid:63)). Further-\nmore, letting (\u03a6x, \u03a6u) denote the response of K(0) on (A(cid:63), B(cid:63)), we assume that \u03a6x \u2208 S(Cx, \u03c1)\nand \u03a6u \u2208 S(Cu, \u03c1), where the constants Cx, Cu, \u03c1 are de\ufb01ned in Algorithm 1.\nThe requirement of an initial stabilizing controller K(0) is not restrictive; Dean et al. [6] provide an\nof\ufb02ine strategy for \ufb01nding such a controller. Furthermore, in practice Algorithm 1 can be initialized\nwith no controller, with random inputs applied instead to the system in the \ufb01rst epoch to estimate\n(A(cid:63), B(cid:63)) within an initial con\ufb01dence set for which the synthesis problem becomes feasible.\nOur \ufb01rst guarantee is on the rate of estimation of (A(cid:63), B(cid:63)) as the algorithm progresses through\ntime. This result builds on recent progress [17] for estimation along trajectories of a lin-\near dynamical system.\npolylog\n1\u2212\u03c1(cid:63)\n\nthe notation (cid:101)O(\u00b7) hides absolute constants and\n1 \u2212 \u03b4 the following statement holds. Suppose that T is at an epoch boundary. Let ((cid:98)A(T ), (cid:98)B(T ))\n\nTheorem 3.2. Fix a \u03b4 \u2208 (0, 1) and suppose that Assumption 3.1 holds. With probability at least\ndenote the current estimate of (A(cid:63), B(cid:63)) computed by Algorithm 1 at the end of time T . Then, this\nestimate satis\ufb01es the guarantee\n\n, n, p,(cid:107)B(cid:63)(cid:107),(cid:107)K(cid:63)(cid:107)\n\nFor what follows,\n\nfactors.\n\n\u03b4 , C(cid:63),\n\n(cid:16)\n\n(cid:17)\n\nT, 1\n\n1\n\nmax{(cid:107)(cid:98)A(T ) \u2212 A(cid:63)(cid:107),(cid:107)(cid:98)B(T ) \u2212 B(cid:63)(cid:107)} \u2264 (cid:101)O\n\n(cid:18) C(cid:63)(cid:107)K(cid:63)(cid:107)\n\n(1 \u2212 \u03c1(cid:63))3\n\n(cid:19)\n\n.\n\n\u221an + p\nT 1/3\n\nTheorem 3.2 shows that Algorithm 1 achieves a consistent estimate of the true dynamics (A(cid:63), B(cid:63)),\n\nand learns at a rate of (cid:101)O(T \u22121/3). We note that consistency of parameter estimates is not a guarantee\n\nprovided by OFU or TS based approaches.\n\n5\n\n\fNext, we state an upper bound on the regret incurred by Algorithm 1.\nTheorem 3.3. Fix a \u03b4 \u2208 (0, 1) and suppose that Assumption 3.1 holds. With probability at least\n(cid:19)\n1 \u2212 \u03b4 the following statement holds. For all T \u2265 0 we have that Algorithm 1 satis\ufb01es\n.\n\nC 4\n(cid:63) (1 + (cid:107)K(cid:63)(cid:107))4(1 + (cid:107)B(cid:63)(cid:107))2J(cid:63)\n\n(n + p)\n\n(cid:18)\n\nT 2/3\n\nRegret(T ) \u2264 (cid:101)O\n\nHere, the notation (cid:101)O(\u00b7) also hides o(T 2/3) terms.\n\n(1 \u2212 \u03c1(cid:63))16\n\nOur proof strategy works as follows. We \ufb01rst decompose regret by epochs as follows:\n\nO(log2 T )(cid:88)\n\nTi(cid:88)\n\ni=0\n\nk=1\n\nRegret(T ) =\n\n((x(i)\n\nk )(cid:62)Qx(i)\n\nk + (u(i)\n\nk )(cid:62)Ru(i)\n\nk \u2212 J(cid:63)) ,\n\n(cid:80)Ti\n\n\u03b7,i/\u03c32\n\nk +(u(i)\n\nk )(cid:62)Ru(i)\n\nwhere x(i)\nk denotes the state at the k-th timestep in the i-th epoch (and similarly for u(i)\nk ). By\nstandard concentration of measure arguments, we can upper bound w.h.p. the per-epoch regret\nk )(cid:62)Qx(i)\nk=1((x(i)\nk \u2212J(cid:63)) by its expected value plus a deviation term that involves the\nnorm of x(i)\n0 . Because we constrain the impulse response coef\ufb01cients of the SLS response {\u03a6x, \u03a6u}\nin Algorithm 1, this allows to easily bound (cid:107)x(i)\n0 (cid:107)2 w.h.p. again by using standard concentration\narguments. We then use the SLS machinery to quantify the difference between the expected cost\nover the horizon Ti minus J(cid:63), which yields that the regret incurred during epoch i is bounded by\nw) contribution is\nthe additional cost incurred from injecting exploration noise. We then bound our estimation error by\n, we have the per-epoch regret\n). Choosing \u03b1 = 1/3 to balance these competing powers of Ti\n\n(cid:101)O(Ti(\u03c32\n\u03b5i = (cid:101)O((\u03c3w/\u03c3\u03b7,i)T\nis bounded by (cid:101)O(T 1\u2212\u03b1\nand summing over logarithmic number of epochs, we obtain a \ufb01nal regret of (cid:101)O(T 2/3).\n\nw + \u03b5i\u22121)J(cid:63)), where \u03b5i\u22121 is the estimation error, and the O(\u03c32\n\n) using Theorem 3.2. Setting \u03c32\n+ T 1\u2212(1\u2212\u03b1)/2\n\nThe main dif\ufb01culty in the proof is ensuring that the transient behavior of the resulting controllers is\nuniformly bounded when applied to the true system. Prior works sidestep this issue by assuming that\nthe true dynamics lie within a (known) compact set for which the Heine-Borel theorem asserts the\nexistence of \ufb01nite constants that capture this behavior. We go a step further and work through the\nperturbation analysis which allows us to give a regret bound that depends only on simple quantities\nof the true system (A(cid:63), B(cid:63)). The full proof is given in the appendix.\nFinally, we remark that the dependence on 1/(1 \u2212 \u03c1(cid:63)) in our results is an artifact of our perturbation\nanalysis, and we leave sharpening this dependence to future work.\n\nwT \u2212\u03b1\n\n\u03b7,i = \u03c32\n\n\u22121/2\ni\n\n\u03b7,i/\u03c32\n\ni\n\ni\n\ni\n\n3.2 Regret Lower Bounds and Parameter Estimation Rates\n\nWe saw that Algorithm 1 achieves (cid:101)O(T 2/3) regret with high probability. Now we provide a matching\n\nalgorithmic lower bound on the expected regret, showing that the analysis presented in Section 3.1 is\nsharp as a function of T . Moreover, our lower bound characterizes how much regret must be accrued\nin order to achieve a speci\ufb01ed estimation rate for the system parameters (A(cid:63), B(cid:63)).\nTheorem 3.4. Let the initial state x0 be distributed according to the steady state distribution\nN (0, P\u221e) of the optimal closed loop system, and let {ut}t\u22650 be any sequence of inputs as in\nSection 2. Furthermore, let f : R \u2192 R be any function such that with probability 1 \u2212 \u03b4 we have\n\n\u2265 f (T ) .\n\n(3.1)\n\n(cid:20)xk\n\nuk\n\n(cid:21)(cid:2)x(cid:62)\n\nk\n\n(cid:3)(cid:33)\n\nu(cid:62)\n\nk\n\n\u03bbmin\n\n(cid:32)T\u22121(cid:88)\n(cid:3)\n\nk=0\n\nThen, there exist positive values T0 and C0 such that for all T \u2265 T0 we have\n\nk Qxk + u(cid:62)\n\nk Ruk \u2212 J(cid:63)\n\n\u2265\n\n1\n2\n\n(1 \u2212 \u03b4)\u03bbmin(R)(1 + \u03c3min(K(cid:63))2)f (T \u2212 T0) \u2212 C0 ,\n\nE(cid:2)x(cid:62)\n\nT(cid:88)\n\nk=0\n\nwhere T0 and C0 are functions of A(cid:63), B(cid:63), Q, R, \u03c32\nare given in the proof.\n\nw, and n. We note the speci\ufb01c form of T0 and C0\n\n6\n\n\fThe proof of the estimation error Theorem 3.2 shows that Algorithm 1 satis\ufb01es Eq. (3.1) with\n\u03b7,i used by Algorithm 1 during the i-th\nwT \u2212i/3), we obtain the following corollary which demonstrates the\n\n\u03b7,\u0398(log2(T ))). Since the exploration variance \u03c32\n\nepoch is given by \u03c32\nsharpness of our regret analysis with respect to the scaling of T .\nCorollary 3.5. For T > C1(n, \u03b4, \u03c32\n\n\u03b7,i = O(\u03c32\n\nw, A(cid:63), B(cid:63), Q, R) the expected regret of Algorithm 1 satis\ufb01es\n\nk Qxk + u(cid:62)\n\nk Ruk \u2212 J(cid:63)\n\n(cid:3)\n\n\u2265(cid:101)\u2126(\u03bbmin(R)(1 + \u03c3min(K(cid:63))2)T 2/3) .\n\nf (T ) = (cid:101)O(T \u03c32\nT(cid:88)\n\nE(cid:2)x(cid:62)\n\nk=1\n\nA natural question to ask is how much regret does any algorithm accrue in order to achieve estimation\n\nerror (cid:107)(cid:98)A \u2212 A(cid:63)(cid:107) \u2264 \u03b5 and (cid:107)(cid:98)B \u2212 B(cid:63)(cid:107) \u2264 \u03b5. From Theorem 3.2 we know that Algorithm 1 estimates\n(A(cid:63), B(cid:63)) at rate (cid:101)O(T \u22121/3). Therefore, in order to achieve \u03b5 estimation error, T must be (cid:101)\u2126(\u03b5\u22123).\nHence, Theorem 3.3 implies that the regret of Algorithm 1 to achieve \u03b5 estimation error is (cid:101)O(\u03b5\u22122).\nInterestingly, let us consider any other Algorithm achieving O(T \u03b1) regret for some 0 < \u03b1 < 1.\nThen, Theorem 3.4 suggests that the best rate achievable by such an algorithm is O(T \u2212\u03b1/2), since\nthe minimum eigenvalue condition Eq. (3.1) governs the signal-to-noise ratio. In the case of linear-\nregression with independent data it is known that the minimax estimation rate is lower bounded by\nsquare root of the inverse of the minimum eigenvalue (3.1). We conjecture that the same results\nwe note that while Algorithm 1 estimates (A(cid:63), B(cid:63)) at a rate (cid:101)O(T \u22121/3), Theorem 3.4 suggests that\nholds in our case. Therefore, to achieve \u03b5 estimation error, any Algorithm would likely require\n\u2126(\u03b5\u22122) regret, showing that Algorithm 1 is optimal up to logarithmic factors in this sense. Finally,\nany algorithm achieving the O(\u221aT ) regret would estimate (A(cid:63), B(cid:63)) at a rate \u2126(T \u22121/4).\n4 Experiments\n\nRegret Comparison. We illustrate the performance of several adaptive schemes empirically. We\ncompare the proposed robust adaptive method with non-Bayesian Thompson sampling (TS) as\nin Abeille and Lazaric [4] and a heuristic projected gradient descent (PGD) implementation of OFU.\nAs a simple baseline, we use the nominal control method, which synthesizes the optimal in\ufb01nite-\nhorizon LQR controller for the estimated system and injects noise with the same schedule as the\nrobust approach. Computational burden varies across adaptive methods due to differences in both\ncost and frequency of controller synthesis; implementation details and computational considerations\nfor all methods are in Section G of the full version [7].\nThe comparison experiments are carried out on the following LQR problem:\n\n(cid:34)1.01 0.01\n\n(cid:35)\n\n0\n\nA(cid:63) =\n\n0.01 1.01 0.01\n0.01 1.01\n\n0\n\n, B(cid:63) = I, Q = 10I, R = I, \u03c3w = 1 .\n\n(4.1)\n\nThis system corresponds to a marginally unstable Laplacian system where adjacent nodes are weakly\nconnected; these dynamics were also studied by [3, 6, 18]. The cost is such that input size is penalized\nrelatively less than state. This problem setting is amenable to robust methods due to both the cost\nratio and the marginal instability, which are factors that may hurt optimistic methods. In Section H of\nthe full version [7], we show similar results for an unstable system with large transients.\nTo standardize the initialization of the various adaptive methods, we use a rollout of length T0 = 100\nwhere the input is a stabilizing controller plus Gaussian noise with \ufb01xed variance \u03c3u = 1. This\ntrajectory is not counted towards the regret, but the recorded states and inputs are used to initialize\nparameter estimates. In each experiment, the system starts from x0 = 0 to reduce variance over runs.\n\nFor all methods, the actual errors (cid:98)At \u2212 A(cid:63) and (cid:98)Bt \u2212 B(cid:63) are used rather than bounds or bootstrapped\n\nestimates. The effect of this choice on regret is small, as examined empirically in Section H of [7].\nThe performance of the various adaptive methods over time is compared in Figure 1. The median\nand 90th percentile cumulative regret over 500 instances is displayed in Figure 1a, which gives an\nidea of both typical and worst-case behavior. The regret of the optimal LQR controller for the true\nsystem is displayed as a baseline. Overall, the methods have very similar performance. One bene\ufb01t\nof robustness is the guaranteed stability and bounded in\ufb01nite-horizon cost at every point during\n\n7\n\n\f(a) Cumulative Regret\n\n(b) In\ufb01nite Horizon LQR Cost\n\nFigure 1: A comparison of different adaptive methods on 500 experiments of the marginally unstable Laplacian\nexample in 4.1. In (a), the median and 90th percentile cumulative regret is plotted over time. In (b), the median\nand 90th percentile in\ufb01nite-horizon LQR cost of the epoch\u2019s controller.\n\n(a) Demand Forecasting\n\n(b) Constraint Satisfaction\n\nFigure 2: The addition of constraints in the robust synthesis problem can guarantee the safe execution of\nadaptive systems. We consider an example inspired by demand forecasting, as illustrated in (a), where the left\nhand side of the diagram represents unknown dynamics. The median and maximum values of (cid:107)xt(cid:107)\u221e over 500\ntrials are plotted for both the unconstrained and constrained synthesis problems in (b).\n\noperation. In Figure 1b, this in\ufb01nite-horizon LQR cost is plotted for the controllers played during\neach epoch. This value measures the cost of using each epoch\u2019s controller inde\ufb01nitely, rather than\ncontinuing to update its parameters. The robust adaptive method performs relatively better than other\nadaptive algorithms, indicating that it is more amenable to early stopping, i.e., to turning off the\nadaptive component of the algorithm and playing the current controller inde\ufb01nitely.\n\nExtension to Uncertain Environment with State Constraints. The proposed robust adaptive\nmethod naturally generalizes beyond the standard LQR problem. We consider a disturbance fore-\ncasting example which incorporates environmental uncertainty and safety constraints. Consider a\nsystem with known dynamics driven by stochastic disturbances that are now correlated in time. We\nmodel the disturbance process as the output of an unknown autonomous LTI system, as illustrated in\nFigure 2a. This setting can be interpreted as a demand forecasting problem, where, for example, the\nsystem is a server farm and the disturbances represent changes in the amount of incoming jobs. If\nthe dynamics of the correlated disturbance process are known, this knowledge can be used for more\ncost-effective temperature control.\nWe let the system (A(cid:63), B(cid:63)) with known dynamics be described by the graph Laplacian dynamics as\nin Eq. (4.1). The disturbance dynamics are unknown and are governed by a stable system transition\nmatrix Ad, resulting in the following dynamics for the full system:\n\n(cid:20)xt+1\n\n(cid:21)\n\ndt+1\n\n(cid:20)A(cid:63)\n\n=\n\nI\n0 Ad\n\n(cid:21)(cid:20)zt\n\n(cid:21)\n\ndt\n\n(cid:20)B(cid:63)\n\n(cid:21)\n\n0\n\n+\n\nut +\n\n(cid:21)\n\n(cid:20)0\n\nI\n\n(cid:34)0.5\n\n(cid:35)\n\n.\n\n0.1\n0.5\n0\n\n0\n0.1\n0.5\n\nwt , Ad =\n\n0\n0\n\nThe costs are set to model expensive inputs, with Q = I and R = 1 \u00d7 103I. The controller synthesis\nproblem in Line 8 of Algorithm 1 is modi\ufb01ed to re\ufb02ect the problem structure, and crucially, we add a\nconstraint on the system response \u03a6x. Further details of the formulation are explained in Section H\nof [7]. Figure 2b illustrates the effect. While the unconstrained synthesis results in trajectories with\nlarge state values, the constrained synthesis results in much more moderate behavior.\n\n8\n\n025050075010001250150017502000Iteration0500100015002000RegretOFUTSRobustNominalOptimal025050075010001250150017502000Iteration10\u2212310\u22122CostSuboptimalityOFUTSRobustNominalLTIfiltertemperaturecontrolstochasticity demand025050075010001250150017502000Iteration02468101214statenormnoconstraintconstrained\f5 Conclusions and Future Work\n\nWe presented a polynomial-time algorithm for the adaptive LQR problem that provides high probabil-\nity guarantees of sub-linear regret. In contrast to other approaches to this problem, our robust adaptive\nmethod guarantees stability, robust performance, and parameter estimation. We also explored the\ninterplay between regret minimization and parameter estimation, identifying fundamental limits\nconnecting the two.\nSeveral questions remain to be answered. It is an open question whether a polynomial-time algorithm\n\ncan achieve a regret of (cid:101)O(\u221aT ). In our implementation of OFU, we observed that PGD performed\n\nquite effectively. Interesting future work is to see if the techniques of Fazel et al. [9] for policy\ngradient optimization on LQR can be applied to prove convergence of PGD on the OFU subroutine,\nwhich would provide an optimal polynomial-time algorithm. Moreover, we observed that OFU\nand TS methods in practice gave estimates of system parameters that were comparable with our\nmethod which explicitly adds excitation noise. It seems that the switching of control policies at epoch\nboundaries provides more excitation for system identi\ufb01cation than is currently understood by the\ntheory. Furthermore, practical issues that remain to be addressed include satisfying safety constraints\nand dealing with nonlinear dynamics; in both settings, \ufb01nite-sample parameter estimation/system\nidenti\ufb01cation and adaptive control remain an open problem.\n\nAcknowledgments\n\nWe thank the anonymous reviewers for their feedback, which improved the clarity of our presentation.\nSD is supported by an NSF Graduate Research Fellowship under Grant No. DGE 1752814. As part\nof the RISE lab, HM is generally supported in part by NSF CISE Expeditions Award CCF-1730628,\nDHS Award HSHQDC-16-3-00083, and gifts from Alibaba, Amazon Web Services, Ant Financial,\nCapitalOne, Ericsson, GE, Google, Huawei, Intel, IBM, Microsoft, Scotiabank, Splunk and VMware.\nBR is generously supported in part by NSF award CCF-1359814, ONR awards N00014-17-1-2191,\nN00014-17-1-2401, and N00014-17-1-2502, the DARPA Fundamental Limits of Learning (Fun LoL)\nand Lagrange Programs, and an Amazon AWS AI Research Award.\n\nReferences\n[1] Yasin Abbasi-Yadkori and Csaba Szepesv\u00e1ri. Regret Bounds for the Adaptive Control of Linear\n\nQuadratic Systems. In Conference on Learning Theory, 2011.\n\n[2] Yasin Abbasi-Yadkori and Csaba Szepesv\u00e1ri. Bayesian Optimal Control of Smoothly Parame-\nterized Systems: The Lazy Posterior Sampling Algorithm. In Conference on Uncertainty in\nArti\ufb01cial Intelligence, 2015.\n\n[3] Yasin Abbasi-Yadkori, Nevena Lazic, and Csaba Szepesv\u00e1ri. Model-Free Linear Quadratic\n\nControl via Reduction to Expert Prediction. arXiv:1804.06021, 2018.\n\n[4] Marc Abeille and Alessandro Lazaric. Thompson Sampling for Linear-Quadratic Control\n\nProblems. In AISTATS, 2017.\n\n[5] S. Bittanti and M. C. Campi. Adaptive control of linear time invariant systems: the \u201cbet on the\n\nbest\u201d principle. Communications in Information and Systems, 6(4), 2006.\n\n[6] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the Sample\n\nComplexity of the Linear Quadratic Regulator. arXiv:1710.01688, 2017.\n\n[7] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret Bounds for\n\nRobust Adaptive Control of the Linear Quadratic Regulator. arXiv:1805.09388, 2018.\n\n[8] Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Finite Time\nAnalysis of Optimal Adaptive Policies for Linear-Quadratic Systems. arXiv:1711.07230, 2017.\n\n[9] Maryam Fazel, Rong Ge, Sham M. Kakade, and Mehran Mesbahi. Global Convergence of\nPolicy Gradient Methods for the Linear Quadratic Regulator. In International Conference on\nMachine Learning, 2018.\n\n9\n\n\f[10] Claude-Nicolas Fiechter. PAC Adaptive Control of Linear Systems. In Conference on Learning\n\nTheory, 1997.\n\n[11] Morteza Ibrahimi, Adel Javanmard, and Benjamin Van Roy. Ef\ufb01cient Reinforcement Learning\nfor High Dimensional Linear Quadratic Systems. In Neural Information Processing Systems,\n2012.\n\n[12] Petros A Ioannou and Jing Sun. Robust adaptive control, volume 1. PTR Prentice-Hall Upper\n\nSaddle River, NJ, 1996.\n\n[13] Miroslav Krstic, Ioannis Kanellakopoulos, and Peter V Kokotovic. Nonlinear and adaptive\n\ncontrol design. Wiley, 1995.\n\n[14] Nikolai Matni, Yuh-Shyang Wang, and James Anderson. Scalable system level synthesis for\n\nvirtually localizable systems. In IEEE Conference on Decision and Control, 2017.\n\n[15] Ian Osband and Benjamin Van Roy. Posterior Sampling for Reinforcement Learning Without\n\nEpisodes. arXiv:1608.02731, 2016.\n\n[16] Yi Ouyang, Mukul Gagrani, and Rahul Jain. Learning-based Control of Unknown Linear\n\nSystems with Thompson Sampling. arXiv:1709.04047, 2017.\n\n[17] Max Simchowitz, Horia Mania, Stephen Tu, Michael I. Jordan, and Benjamin Recht. Learning\nWithout Mixing: Towards A Sharp Analysis of Linear System Identi\ufb01cation. In Conference on\nLearning Theory, 2018.\n\n[18] Stephen Tu and Benjamin Recht. Least-Squares Temporal Difference Learning for the Linear\n\nQuadratic Regulator. In International Conference on Machine Learning, 2018.\n\n[19] Yuh-Shyang Wang, Nikolai Matni, and John C Doyle. A System Level Approach to Controller\n\nSynthesis. arXiv:1610.04815, 2016.\n\n[20] K. Zhou, J. C. Doyle, and K. Glover. Robust and Optimal Control. 1995.\n\n10\n\n\f", "award": [], "sourceid": 2065, "authors": [{"given_name": "Sarah", "family_name": "Dean", "institution": null}, {"given_name": "Horia", "family_name": "Mania", "institution": "UC Berkeley"}, {"given_name": "Nikolai", "family_name": "Matni", "institution": "UC Berkeley"}, {"given_name": "Benjamin", "family_name": "Recht", "institution": "UC Berkeley"}, {"given_name": "Stephen", "family_name": "Tu", "institution": "UC Berkeley"}]}