{"title": "Stable adaptive control with online learning", "book": "Advances in Neural Information Processing Systems", "page_first": 977, "page_last": 984, "abstract": null, "full_text": " Stable adaptive control with online learning\n\n\n\n Andrew Y. Ng H. Jin Kim\n Stanford University Seoul National University\n Stanford, CA 94305, USA Seoul, Korea\n\n Abstract\n Learning algorithms have enjoyed numerous successes in robotic control\n tasks. In problems with time-varying dynamics, online learning methods\n have also proved to be a powerful tool for automatically tracking and/or\n adapting to the changing circumstances. However, for safety-critical ap-\n plications such as airplane flight, the adoption of these algorithms has\n been significantly hampered by their lack of safety, such as \"stability,\"\n guarantees. Rather than trying to show difficult, a priori, stability guar-\n antees for specific learning methods, in this paper we propose a method\n for \"monitoring\" the controllers suggested by the learning algorithm on-\n line, and rejecting controllers leading to instability. We prove that even if\n an arbitrary online learning method is used with our algorithm to control\n a linear dynamical system, the resulting system is stable.\n\n1 Introduction\nOnline learning algorithms provide a powerful set of tools for automatically fine-tuning a\ncontroller to optimize performance while in operation, or for automatically adapting to the\nchanging dynamics of a control problem. [2] Although one can easily imagine many com-\nplex learning algorithms (SVMs, gaussian processes, ICA, . . . ,) being powerfully applied\nto online learning for control, for these methods to be widely adopted for applications such\nas airplane flight, it is critical that they come with safety guarantees, specifically stability\nguarantees. In our interactions with industry, we also found stability to be a frequently\nraised concern for online learning. We believe that the lack of safety guarantees represents\na significant barrier to the wider adoption of many powerful learning algorithms for online\nadaptation and control. It is also typically infeasible to replace formal stability guarantees\nwith only empirical testing: For example, to convincingly demonstrate that we can safely\nfly a fleet of 100 aircraft for 10000 hours would require 106 hours of flight-tests.\nThe control literature contains many examples of ingenious stability proofs for various on-\nline learning schemes. It is impossible to do this literature justice here, but some examples\ninclude [10, 7, 12, 8, 11, 5, 4, 9]. However, most of this work addresses only very specific\nonline learning methods, and usually quite simple ones (such as ones that switch between\nonly a finite number of parameter values using a specific, simple, decision rule, e.g., [4]).\nIn this paper, rather than trying to show difficult a priori stability guarantees for specific\nalgorithms, we propose a method for \"monitoring\" an arbitrary learning algorithm being\nused to control a linear dynamical system. By rejecting control values online that appear to\nbe leading to instability, our algorithm ensures that the resulting controlled system is stable.\n\n2 Preliminaries\nFollowing most work in control [6], we will consider control of a linear dynamical system.\nLet xt Rnx be the nx dimensional state at time t. The system is initialized to x0 = 0. At\neach time t, we select a control action ut Rnu, as a result of which the state transitions to\n xt+1 = Axt + But + wt. (1)\nHere, A Rn n n\n x x and B Rnx u govern the dynamics of the system, and wt is a\ndisturbance term. We will not make any distributional assumptions about the source of the\n\n\f\ndisturbances wt for now (indeed, we will consider a setting where an adversary chooses\nthem from some bounded set). For many applications, the controls are chosen as a linear\nfunction of the state:\n ut = Ktxt. (2)\nHere, the K n\n t Rnu x are the control gains. If the goal is to minimize the expected value\nof a quadratic cost function over the states and actions J = (1/T ) T xT Qx Ru\n t=1 t t + uT\n t t\nand the wt are gaussian, then we are in the LQR (linear quadratic regulation) control setting.\nHere, Q Rn n n\n x x and R Rnu u are positive semi-definite matrices. In the infinite\nhorizon setting, under mild conditions there exists an optimal steady-state (or stationary)\ngain matrix K, so that setting Kt = K for all t minimizes the expected value of J. [1]\nWe consider a setting in which an online learning algorithm (also called an adaptive con-\ntrol algorithm) is used to design a controller. Thus, on each time step t, an online algorithm\nmay (based on the observed states and action sequence so far) propose some new gain ma-\ntrix Kt. If we follow the learning algorithm's recommendation, then we will start choosing\ncontrols according to u = Ktx. More formally, an online learning algorithm is a function\nf : ( nx\n t=1 Rnx Rnu )t Rnu mapping from finite sequences of states and actions\n(x0, u0, . . . , xt-1, ut-1) to controller gains Kt. We assume that f 's outputs are bounded\n(||Kt||F for some > 0, where || ||F is the Frobenius norm).\n\n2.1 Stability\nIn classical control theory [6], probably the most important desideratum of a controlled\nsystem is that it must be stable. Given a fixed adaptive control algorithm f and a fixed\nsequence of disturbance terms w0, w1, . . ., the sequence of states xt visited is exactly de-\ntermined by the equations\n Kt = f (x0, u0, . . . , xt-1, ut-1); xt+1 = Axt + B Ktxt + wt. t = 0, 1, 2, . . . (3)\nThus, for fixed f , we can think of the (controlled) dynamical system as a mapping from\nthe sequence of disturbance terms wt to the sequence of states xt. We now give the most\ncommonly-used definition of stability, called BIBO stability (see, e.g., [6]).\n\nDefinition. A system controlled by f is bounded-input bounded-output (BIBO) stable if,\ngiven any constant c1 > 0, there exists some constant c2 > 0 so that for all sequences of\ndisturbance terms satisfying ||wt||2 c1 (for all t = 1, 2, . . .), the resulting state sequence\nsatisfies ||xt||2 c2 (for all t = 1, 2, . . .).\nThus, a system is BIBO stable if, under bounded disturbances to it (possibly chosen by an\nadversary), the state remains bounded and does not diverge.\n\nWe also define the t-th step dynamics matrix Dt to be Dt = A+BKt. Note therefore that\nthe state transition dynamics of the system (right half of Equation 3) may now be written\nxt+1 = Dtxt + wt. Further, the dependence of xt on the wt's can be expressed as follows:\n xt = wt-1 + Dt-1xt-1 = wt-1 + Dt-1(wt-2 + Dt-2xt-2) = (4)\n = wt-1 + Dt-1wt-2 + Dt-1Dt-2wt-3 + + Dt-1 D1w0. (5)\nSince the number of terms in the sum above grows linearly with t, to ensure BIBO stability\nof a system--i.e., that xt remains bounded for all t--it is usually necessary for the terms in\nthe sum to decay rapidly, so that the sum remains bounded. For example, if it were true that\n||Dt-1 Dt-k+1wt-k||2 (1 - )k for some 0 < < 1, then the terms in the sequence\nabove would be norm bounded by a geometric series, and thus the sum is bounded. More\ngenerally, the disturbance wt contributes a term Dt+k-1 Dt+1wt to the state xt+k, and\nwe would like Dt+k-1 Dt+1wt to become small rapidly as k becomes large (or, in the\ncontrol parlance, for the effects of the disturbance wt on xt+k to be attenuated quickly).\nIf Kt = K for all t, then we say that we using a (nonadaptive) stationary controller K. In\nthis setting, it is straightforward to check if our system is stable. Specifically, it is BIBO\nstable if and only if the magnitude of all the eigenvalues of D = Dt = A+BKt are strictly\nless than 1. [6] To informally see why, note that the effect of wt on xt+k can be written\nDk-1wt (as in Equation 5). Moreover, |max(D)| < 1 implies Dk-1wt 0 as k .\nThus, the disturbance wt has a negligible influence on xt+k for large k. More precisely, it\n\n\f\nis possible to show that, under the assumption that ||wt|| c1, the sequence on the right\nhand side of (5) is upper-bounded by a geometrically decreasing sequence, and thus its sum\nmust also be bounded. [6]\n\nIt was easy to check for stability when Kt was stationary, because the mapping from the\nwt's to the xt's was linear. In more general settings, if Kt depends in some complex way\non x1, . . . , xt-1 (which in turn depend on w0, . . . , wt-2), then xt+1 = Axt + BKtxt + wt\nwill be a nonlinear function of the sequence of disturbances.1 This makes it significantly\nmore difficult to check for BIBO stability of the system.\n\nFurther, unlike the stationary case, it is well-known that max(Dt) < 1 (for all t) is insuffi-\ncient to ensure stability. For example, consider a system where Dt = Dodd if t is odd, and\nDt = Deven otherwise, where2\n Dodd = 0.9 0 ; D . (6)\n 10 0.9 even = 0.9 10\n 0 0.9\n\nNote that max(Dt) = 0.9 < 1 for all t. However, if we pick w0 = [1 0]T and w1 = w2 =\n. . . = 0, then (following Equation 5) we have\n x2t+1 = D2tD2t-1D2t-2 . . . D2D1w0 (7)\n\n = (DevenDodd)t w0 (8)\n t\n = 100.81 9 w\n 9 0.81 0 (9)\nThus, even though the wt's are bounded, we have ||x2t+1||2 (100.81)t, showing that the\nstate sequence is not bounded. Hence, this system is not BIBO stable.\n\n3 Checking for stability\nIf f is a complex learning algorithm, it is typically very difficult to guarantee that the\nresulting system is BIBO stable. Indeed, even if f switches between only two specific sets\nof gains K, and if w0 is the only non-zero disturbance term, it can still be undecidable to\ndetermine whether the state sequence remains bounded. [3] Rather than try to give a priori\nguarantees on f , we instead propose a method for ensuring BIBO stability of a system\nby \"monitoring\" the control gains proposed by f , and rejecting gains that appear to be\nleading to instability. We start computing controls according to a set of gains ^\n Kt only if it\nis accepted by the algorithm.\n\nFrom the discussion in Section 2.1, the criterion for accepting or rejecting a set of gains ^\n Kt\ncannot simply be to check if max(A+BKt) = max(Dt) < 1. Specifically, max(D2D1)\nis not bounded by max(D2)max(D1), and so even if max(Dt) is small for all t--which\nwould be the case if the gains Kt for any fixed t could be used to obtain a stable stationary\ncontroller--the quantity max( t D D\n =1 ) can still be large, and thus ( t\n =1 )w0 can\nbe large. However, the following holds for the largest singular value max of matrices.\nThough the result is quite standard, for the sake of completeness we include a proof.3\n\nProposition 3.1 : Let any matrices P Rlm and Q Rmn be given. Then\nmax(P Q) max(P )max(Q).\n\nProof. max(P Q) = maxu,v:||u||2=||v||2=1 uT P Qv. Let u and v be a pair of vec-\ntors attaining the maximum in the previous equation. Then max(P Q) = uT P Qv \n||uT P ||2 ||Qv||2 maxv,u:||v||2=||u||2=1 ||uT P ||2 ||Qv||2 = max(P )max(Q).\nThus, if we could ensure that max(Dt) 1 - for all t, we would find that the influence\nof w0 on xt has norm bounded by ||Dt-1Dt-2 . . . D1w0||2 = max(Dt-1 . . . D1w0) \n\n 1Even if f is linear in its inputs so that Kt is linear in x1, . . . , xt-1, the state sequence's de-\npendence on (w0, w1, . . .) is still nonlinear because of the multiplicative term Ktxt in the dynam-\nics (Equation 3).\n 2Clearly, such as system can be constructed with appropriate choices of A, B and Kt.\n 3The largest singular value of M is max(M) = max(MT ) = maxu,v:||u||2=||v||2=1 uT Mv =\nmaxu:||u||2=1 ||Mu||2. If x is a vector, then max(x) is just the L2-norm of x.\n\n\f\nmax(Dt-1) . . . max(D1)||w0||2 (1 - )t-1||w0||2 (since ||v||2 = max(v) if v is a\nvector). Thus, the influence of wt on xt+k goes to 0 as k .\nHowever, it would be an overly strong condition to demand that max(Dt) < 1- for every\nt. Specifically, there are many stable, stationary controllers that do not satisfy this. For\nexample, either one of the matrices Dt in (6), if used as the stationary dynamics, is stable\n(since max = 0.9 < 1). Thus, it should be acceptable for us to use a controller with either\nof these Dt (so long as we do not switch between them on every step). But, these Dt have\nmax 10.1 > 1, and thus would be rejected if we were to demand that max(Dt) < 1 -\nfor every t. Thus, we will instead ask only for a weaker condition, that for all t,\n max(Dt Dt-1 Dt-N+1) < 1 - . (10)\nThis is motivated by the following, which shows that any stable, stationary controller meets\nthis condition (for sufficiently large N ):\n\nProposition 3.2: Let any 0 < < 1 and any D with max(D) < 1 be given. Then there\nexists N0 > 0 so that for all N N0, we have that max(DN ) 1 - .\n\nThe proof follows from the fact that max(D) < 1 implies DN 0 as N . Thus,\ngiven any fixed, stable controller, if N is sufficiently large, it will satisfy (10). Further,\nif (10) holds, then w0's influence on xkN+1 is bounded by\n ||DkN DkN-1 D1w0||2 max(DkN DkN-1 D1)||w0||2\n k-1 \n i=0 max(DiN +N DiN +N -1 DiN +1)||w0||2\n\n (1 - )k||w0||2, (11)\nwhich goes to 0 geometrically quickly as k . (The first and second inequalities above\nfollow from Proposition 3.1.) Hence, the disturbances' effects are attenuated quickly.\nTo ensure that (10) holds, we propose the following algorithm. Below, N > 0 and 0 < <\n1 are parameters of the algorithm.\n\n 1. Initialization: Assume we have some initial stable controller K0, so that\n max(D0) < 1, where D0 = A + BK0. Also assume that max(DN ) 1 -\n 0 .4\n Finally, for all values of < 0. define K = K0 and D = D0.\n 2. For t = 1, 2, . . .\n\n (a) Run the online learning algorithm f to compute the next set of proposed\n gains ^\n Kt = f (x0, u0, . . . , xt-1, ut-1).\n (b) Let ^\n Dt = A + B ^\n Kt, and check if\n max( ^\n DtDt-1Dt-2Dt-3 . . . Dt-N+1) 1 - (12)\n\n max( ^\n D2D\n t t-1Dt-2 . . . Dt-N +2) 1 - (13)\n\n max( ^\n D3D\n t t-1 . . . Dt-N +3) 1 - (14)\n . . .\n max( ^\n DN ) 1 -\n t (15)\n (c) If all of the max's above are less than 1 - , we ACCEPT ^\n Kt, and set\n Kt = ^\n Kt. Otherwise, REJECT ^\n Kt, and set Kt = Kt-1.\n (d) Let Dt = A + BKt, and pick our action at time t to be ut = Ktxt.\n\nWe begin by showing that, if we use this algorithm to \"filter\" the gains output by the online\nlearning algorithm, Equation (10) holds.\n\nLemma 3.3: Let f and w0, w1, . . . be arbitrary, and let K0, K1, K2, . . . be the sequence\nof gains selected using the algorithm above. Let Dt = A + BKt be the corresponding\ndynamics matrices. Then for every - < t < , we have5\n max(Dt Dt-1 Dt-N+1) 1 - . (16)\n\n 4From Proposition 3.2, it must be possible to choose N satisfying this.\n 5As in the algorithm description, Dt = D0 for t < 0.\n\n\f\nProof. Let any t be fixed, and let = max({0} {t : 1 t t, ^\n Kt was accepted}).\nThus, is the index of the time step at which we most recently accepted a set of gains from\nf (or 0 if no such gains exist). So, K = K+1 = . . . = Kt, since the gains stay the same\nin every time step on which we do not accept a new one. This also implies\n D = D+1 = . . . = Dt. (17)\nWe will treat the cases (i) = 0, (ii) 1 t - N + 1 and (iii) > t - N + 1,\n 1 separately. In case (i), = 0, and we did not accept any gains after time 0. Thus\nKt = = Kt-N+1 = K0, which implies Dt = = Dt-N+1 = D0. But from\nStep 1 of the algorithm, we had chosen N sufficiently large that max(DN ) 1 -\n 0 . This\nshows (16). In case (ii), t - N + 1 (and > 0). Together with (17), this implies\n Dt Dt-1 Dt-N+1 = DN .\n (18)\nBut max(DN ) 1 -\n , because at time , when we accepted K , we would have checked\nthat Equation (15) holds. In case (iii), > t - N + 1 (and > 0). From (17) we have\n Dt Dt-1 Dt-N+1 = Dt-+1 D\n -1 D -2 Dt-N +1. (19)\nBut when we accepted K , we would have checked that (12-15) hold, and the t - + 1-st\nequation in (12-15) is exactly that the largest singular value of (19) is at most 1 - .\n\nTheorem 3.4: Let an arbitrary learning algorithm f be given, and suppose we use f to\ncontrol a system, but using our algorithm to accept/reject gains selected by f . Then, the\nresulting system is BIBO stable.\n\nProof. Suppose ||wt||2 c1 for all t. For convenience also define w-1 = w-2 = = 0,\nand let = ||A||F + ||B||F . From (5),\n ||xt||2 = || D\n k=0 t-1Dt-2 Dt-k wt-k-1||2\n c \n 1 ||D\n k=0 t-1Dt-2 Dt-k ||2\n = c N -1\n 1 \n j=0 k=0 max(Dt-1Dt-2 Dt-jN -k )\n\n c N -1\n 1 D\n j=0 k=0 max(( j-1\n l=0 t-lN -1Dt-lN -2 Dt-lN -N )\n\n Dt-jN-1 Dt-jN-k)\n c N -1\n 1 (1 - )j \n j=0 k=0 max(Dt-jN -1 Dt-jN -k )\n\n c N -1\n 1 (1 - )j ( )k\n j=0 k=0\n\n c 1\n 1 N (1 + )N\nThe third inequality follows from Lemma 3.3, and the fourth inequality follows from our\nassumption that ||Kt||F , so that max(Dt) ||Dt||F ||A||F + ||B||F ||Kt||F \n||A||F + ||Bt||F = . Hence, ||xt||2 remains uniformly bounded for all t.\nTheorem 3.4 guarantees that, using our algorithm, we can safely apply any adaptive control\nalgorithm f to our system. As discussed previously, it is difficult to exactly characterize\nthe class of BIBO-stable controllers, and thus the set of controllers that we can safety\naccept. However, it is possible to show a partial converse to Theorem 3.4 that certain\nlarge, \"reasonable\" classes of adaptive control methods will always have their proposed\ncontrollers accepted by our method. For example, it is a folk theorem in control that if we\nuse only stable sets of gains (K : max(A + BK) < 1), and if we switch \"sufficiently\nslowly\" between them, then system will be stable. For our specific algorithm, we can show\nthe following:\n\nTheorem 3.5: Let any 0 < < 1 be fixed, and let K Rn n\n u x be a finite set of controller\ngains, so that for all K K, we have max(A + BK) < 1. Then there exist constants N0\nand k so that for all N N0, if (i) Our algorithm is run with parameters N, , and (ii)\nThe adaptive control algorithm f picks only gains in K, and moreover switches gains no\nmore than once every k steps (i.e., ^\n Kt = ^\n Kt+1 ^\n Kt+1 = ^\n Kt+2 = = ^\n Kt+k), then all\ncontrollers proposed by f will be accepted.\n\n\f\n 4\n 100\n 10 1.5\n\n\n 90\n 10\n\n 3\n\n 80\n 10\n\n 1\n 70\n 10 2\n\n\n 60\n 10\n\n\n 50\n 10 1 0.5\nstate (x1) state (x1)\n\n 40\n 10 controller numder\n\n 0\n 30\n 10\n\n 0\n 20\n 10\n -1\n\n 10\n 10\n\n\n 0\n 10 -2 -0.5\n 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 0 200 400 600 800 1000 1200 1400 1600 1800 2000\n t t t\n (a) (b) (c)\n\nFigure 1: (a) Typical state sequence (first component xt,1 of state vector) using switching controllers\nfrom Equation (6). (Note log-scale on vertical axis.) (b) Typical state sequence using our algorithm\nand the same controller f . (N = 150, = 0.1) (c) Index of the controller used over time, when using\nour algorithm.\n\nThe proof is omitted due to space constraints. A similar result also holds if K is infinite\n(but c > 0, K K, max(A + BK) 1 - c), and if the proposed gains change on\nevery step but the differences || ^\n Kt - ^\n Kt+1||F between successive values is small.\n\n4 Experiments\nWe now present experimental results illustrating the behavior of our algorithm. In the first\nexperiment, we apply the switching controller given in (6). Figure 1a shows a typical state\nsequence resulting from using this controller without using our algorithm to monitor it (and\nwt's from an IID standard Normal distribution). Even though max(Dt) < 1 for all t, the\ncontrolled system is unstable, and the state rapidly diverges. In contrast, Figure 1b shows\nthe result of rerunning the same experiment, but using our algorithm to accept or reject\ncontrollers. The resulting system is stable, and the states remain small. Figure 1c also\nshows which of the two controllers in (6) is being used at each time, when our algorithm\nis used. (If do not use our algorithm so that the controller switches on every time step,\nthis figure would switch between 0 and 1 on every time step.) We see that our algorithm is\nrejecting most of the proposed switches to the controller; specifically, it is permitting f to\nswitch between the two controllers only every 140 steps or so. By slowing down the rate at\nwhich we switch controllers, it causes the system to become stable (compare Theorem 3.5).\n\nIn our second example, we will consider a significantly more complex setting representative\nof a real-world application. We consider controlling a Boeing 747 aircraft in a setting\nwhere the states are only partially observable. We have a four-dimensional state vector xt\nconsisting of the sideslip angle , bank angle , yaw rate, and roll rate of the aircraft in\ncruise flight. The two-dimensional controls ut are the rudder and aileron deflections. The\nstate transition dynamics are given as in Equation (1)6 with IID gaussian disturbance terms\nwt. But instead of observing the states directly, on each time step t we observe only\n yt = Cxt + vt, (20)\n\nwhere yt Rny , and the disturbances vt Rny are distributed Normal(0, v). If the sys-\ntem is stationary (i.e., if A, B, C, v, w were fixed), then this is a standard LQG problem,\nand optimal estimates ^\n xt of the hidden states xt are obtained using a Kalman filter:\n ^\n xt+1 = Lt(yt+1 - C(Axt + But)) + A^\n xt + But, (21)\nwhere L n\n t Rnx y is the Kalman filter gain matrix. Further, it is known that, in LQG,\nthe optimal steady state controller is obtained by picking actions according to ut = Kt ^\n xt,\nwhere Kt are appropriate control gains. Standard algorithms exist for solving for the opti-\nmal steady-state gain matrices L and K. [1]\n\nIn our aircraft control problem, C = 0 1 0 0 , so that only two of the four state\n 0 0 0 1\n\nvariables and are observed directly. Further, the noise in the observations varies over time.\nSpecifically, sometimes the variance of the first observation is v,11 = Var(vt,1) = 2\n\n 6The parameters A R44 and B R42 are obtained from a standard 747 (\"yaw damper\")\nmodel, which may be found in, e.g., the Matlab control toolbox, and various texts such as [6].\n\n\f\n 2.5 2.5\n\n\n\n\n\n 2\n 2\n\n\n\n\n 1.5\n\n 1.5\n\n\n\n 1\n\n\n 1\n\n\n 0.5\n\n\n\n\n 0.5\n 0\n\n\n\n\n\n 0 -0.5\n 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 1 2 3 4 5 6 7 8 9 10\n 4 4\n x 10 x 10\n (a) (b)\nFigure 2: (a) Typical evolution of true v,11 over time (straight lines) and online approximation to\nit. (b) Same as (a), but showing an example in which the learned variance estimate became negative.\nwhile the variance of the second observation is v,22 = Var(vt,2) = 0.5; and sometimes\nthe values of the variances are reversed v,11 = 0.5, v,22 = 2. (v R22 is diagonal in\nall cases.) This models a setting in which, at various times, either of the two sensors may\nbe the more reliable/accurate one.\n\nSince the reliability of the sensors changes over time, one might want to apply an online\nlearning algorithm (such as online stochastic gradient ascent) to dynamically estimate the\nvalues of v,11 and v,22. Figure 2 shows a typical evolution of v,11 over time, and the\nresult of using a stochastic gradient ascent learning algorithm to estimate v,11. Empir-\nically, a stochastic gradient algorithm seems to do fairly well at tracking the true v,11.\nThus, one simple adaptive control scheme would be to take the current estimate of v at\neach time step t, apply a standard LQG solver giving this estimate (and A, B, C, w) to it\nto obtain the optimal steady-state Kalman filter and control gains, and use the values ob-\ntained as our proposed gains Lt and Kt for time t. This gives a simple method for adapting\nour controller and Kalman filter parameters to the varying noise parameters.\n\nThe adaptive control algorithm that we have described is sufficiently complex that it is ex-\ntremely difficult to prove that it gives a stable controller. Thus, to guarantee BIBO stability\nof the system, one might choose to run it with our algorithm. To do so, note that the \"state\"\nof the controlled system at each time step is fully characterized by the true world state xt\nand the internal state estimate of the Kalman filter ^\n xt. So, we can define an augmented\nstate vector ~\n xt = [xt; ^\n xt] R8. Because xt+1 is linear in ut (which is in turn linear in ^\n xt)\nand similarly ^\n xt+1 is linear in xt and ut (substitute (20) into (21)), for a fixed set of gains\nKt and Lt, we can express ~\n xt+1 as a linear function of ~\n xt plus a disturbance:\n ~\n xt+1 = ~\n Dt ~\n xt + ~\n wt. (22)\nHere, ~\n Dt depends implicitly on A, B, C, Lt and Kt. (The details are not complex, but are\nomitted due to space). Thus, if a learning algorithm is proposing new ^\n Kt and ^\n Lt matrices\non each time step, we can ensure that the resulting system is BIBO stable by computing\nthe corresponding ~\n Dt as a function of ^\n Kt and ^\n Lt, and running our algorithm (with ~\n Dt's\nreplacing the Dt's) to decide if the proposed gains should be accepted. In the event that\nthey are rejected, we set Kt = Kt-1, Lt = Lt-1.\nIt turns out that there is a very subtle bug in the online learning algorithm. Specifically,\nwe were using standard stochastic gradient ascent to estimate v,11 (and v,22), and on\nevery step there is a small chance that the gradient update overshoots zero, causing v,11\nto become negative. While the probability of this occurring on any particular time step is\nsmall, a Boeing 747 flown for sufficiently many hours using this algorithm will eventually\nencounter this bug and obtain an invalid, negative, variance estimate. When this occurs, the\nMatlab LQG solver for the steady-state gains outputs L = 0 on this and all successive time\nsteps.7 If this were implemented on a real 747, this would cause it to ignore all observations\n(Equation 21), enter divergent oscillations (see Figure 3a), and crash. However, using our\nalgorithm, the behavior of the system is shown in Figure 3b. When the learning algorithm\n\n 7Even if we had anticipated this specific bug and clipped v,11 to be non-negative, the LQG\nsolver (from the Matlab controls toolbox) still outputs invalid gains, since it expects nonsingular v.\n\n\f\n 5\n x 10\n 8\n 0.15\n\n\n\n 6\n\n 0.1\n\n\n 4\n\n 0.05\n\n\n 2\n\n\n 0\n\n 0\n\n\n -0.05\n -2\n\n\n\n -0.1\n -4\n\n\n\n -6 -0.15\n\n\n\n\n -8 -0.2\n 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10\n 4 4\n x 10 x 10\n (a) (b)\nFigure 3: (a) Typical plot of state (xt,1) using the (buggy) online learning algorithm in a sequence\nin which L was set to zero part-way through the sequence. (Note scale on vertical axis; this plot is\ntypical of a linear system entering divergent/unstable oscillations.) (b) Results on same sequence of\ndisturbances as in (a), but using our algorithm.\n\nencounters the bug, our algorithm successfully rejects the changes to the gains that lead to\ninstability, thereby keeping the system stable.\n\n5 Discussion\nSpace constraints preclude a full discussion, but these ideas can also be applied to verifying\nthe stability of certain nonlinear dynamical systems. For example, if the A (and/or B)\nmatrix depends on the current state but is always expressible as a convex combination\nof some fixed A1, . . . , Ak, then we can guarantee BIBO stability by ensuring that (10)\nholds for all combinations of Dt = Ai + BKt defined using any Ai (i = 1, . . . k).8 The\nsame idea also applies to settings where A may be changing (perhaps adversarially) within\nsome bounded set, or if the dynamics are unknown so that we need to verify stability\nwith respect to a set of possible dynamics. In simulation experiments of the Stanford\nautonomous helicopter, by using a linearization of the non-linear dynamics, our algorithm\nwas also empirically successful at stabilizing an adaptive control algorithm that normally\ndrives the helicopter into unstable oscillations.\n\nReferences\n\n [1] B. Anderson and J. Moore. Optimal Control: Linear Quadratic Methods. Prentice-Hall, 1989.\n [2] Karl Astrom and Bjorn Wittenmark. Adaptive Control (2nd Edition). Addison-Wesley, 1994.\n [3] V. D. Blondel and J. N. Tsitsiklis. The boundedness of all products of a pair of matrices is\n undecidable. Systems and Control Letters, 41(2):135140, 2000.\n [4] Michael S. Branicky. Analyzing continuous switching systems: Theory and examples. In Proc.\n American Control Conference, 1994.\n [5] Michael S. Branicky. Stability of switched and hybrid systems. In Proc. 33rd IEEE Conf.\n Decision Control, 1994.\n [6] G. Franklin, J. Powell, and A. Emani-Naeini. Feedback Control of Dynamic Systems. Addison-\n Wesley, 1995.\n [7] M. Johansson and A. Rantzer. On the computation of piecewise quadratic lyapunov functions.\n In Proceedings of the 36th IEEE Conference on Decision and Control, 1997.\n [8] H. Khalil. Nonlinear Systems (3rd ed). Prentice Hall, 2001.\n [9] Daniel Liberzon, Jo~ao Hespanha, and A. S. Morse. Stability of switched linear systems: A\n lie-algebraic condition. Syst. & Contr. Lett., 3(37):117122, 1999.\n[10] J. Nakanishi, J.A. Farrell, and S. Schaal. A locally weighted learning composite adaptive con-\n troller with structure adaptation. In International Conference on Intelligent Robots, 2002.\n[11] T. J. Perkins and A. G. Barto. Lyapunov design for safe reinforcement learning control. In Safe\n Learning Agents: Papers from the 2002 AAAI Symposium, pages 2330, 2002.\n[12] Jean-Jacques Slotine and Weiping Li. Applied Nonlinear Control. Prentice Hall, 1990.\n\n\n 8Checking all kN such combinations takes time exponential in N, but it is often possible to use\nvery small values of N , sometimes including N = 1, if the states xt are linearly reparameterized\n(xt = M xt) to minimize max(D0).\n\n\f\n", "award": [], "sourceid": 2623, "authors": [{"given_name": "H.", "family_name": "Kim", "institution": null}, {"given_name": "Andrew", "family_name": "Ng", "institution": null}]}