{"title": "Logarithmic Regret for Online Control", "book": "Advances in Neural Information Processing Systems", "page_first": 10175, "page_last": 10184, "abstract": "We study optimal regret bounds for control in linear dynamical systems under adversarially changing strongly convex cost functions, given the knowledge of transition dynamics. This includes several well studied and influential frameworks such as the Kalman filter and the linear quadratic regulator. State of the art methods achieve regret which scales as T^0.5, where T is the time horizon. \n\nWe show that the optimal regret in this fundamental setting can be significantly smaller, scaling as polylog(T). This regret bound is achieved by two different efficient iterative methods, online gradient descent and online natural gradient.", "full_text": "Logarithmic Regret for Online Control\n\nNaman Agarwal1\n\nElad Hazan1 2\n\nKaran Singh1 2\n\n1 Google AI Princeton\n\nnamanagarwal@google.com, {ehazan,karans}@princeton.edu\n\n2 Computer Science, Princeton University\n\nAbstract\n\nWe study optimal regret bounds for control in linear dynamical systems under\nadversarially changing strongly convex cost functions, given the knowledge of tran-\nsition dynamics. This includes several well studied and fundamental frameworks\n\u221a\nsuch as the Kalman \ufb01lter and the linear quadratic regulator. State of the art methods\nachieve regret which scales as O(\nWe show that the optimal regret in this setting can be signi\ufb01cantly smaller, scaling\nas O(poly(log T )). This regret bound is achieved by two different ef\ufb01cient iterative\nmethods, online gradient descent and online natural gradient.\n\nT ), where T is the time horizon.\n\n1\n\nIntroduction\n\nAlgorithms for regret minimization typically attain one of two performance guarantees. For general\nconvex losses, regret scales as square root of the number of iterations, and this is tight. However, if\nthe loss function exhibit more curvature, such as quadratic loss functions, there exist algorithms that\nattain poly-logarithmic regret. This distinction is also known as \u201cfast rates\u201d in statistical estimation.\nDespite their ubiquitous use in online learning and statistical estimation, logarithmic regret algorithms\nare almost non-existent in control of dynamical systems. This can be attributed to fundamental\nchallenges in computing the optimal controller in the presence of noise.\nTime-varying cost functions in dynamical systems can be used to model unpredictable dynamic\nresource constraints, and the tracking of a desired sequence of exogenous states. At a pinch, if we\nhave changing (even, strongly) convex loss functions, the optimal controller for a linear dynamical\nsystem is not immediately computable via a convex program. For the special case of quadratic loss,\nsome previous works [9] remedy the situation by taking a semi-de\ufb01nite relaxation, and thereby obtain\na controller which has provable guarantees on regret and computational requirements. However, this\nsemi-de\ufb01nite relaxation reduces the problem to regret minimization over linear costs, and removes\nthe curvature which is necessary to obtain logarithmic regret.\nIn this paper we give the \ufb01rst ef\ufb01cient poly-logarithmic regret algorithms for controlling a linear\ndynamical system with noise in the dynamics (i.e. the standard model). Our results apply to general\nconvex loss functions that are strongly convex, and not only to quadratics.\n\nRegret\n\nloss functions\n\nReference\n\n[1]\n[4]\n[9]\nhere\n\nNoise\nnone\n\nO(log2 T )\nadversarial\nT )\nO(\nstochastic\nT )\nO(\nstochastic O(log7 T )\n\n\u221a\n\u221a\n\nquadratic (\ufb01xed hessian)\n\nconvex\nquadratic\n\nstrongly convex\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f1.1 Our Results\n\nThe setting we consider is a linear dynamical system, a continuous state Markov decision process\nwith linear transitions, described by the following equation:\n\nxt+1 = Axt + But + wt.\n\n(1.1)\n\nT ) regret bounds are tight (see [13]).\n\nprevious noise terms, and take the form ut =(cid:80)H\n\nHere xt is the state of the system, ut is the action (or control) taken by the controller, and wt is the\nnoise. In each round t, the learner outputs an action ut upon observing the state xt and incurs a cost\nof ct(xt, ut), where ct is convex. The objective here is to choose a sequence of adaptive controls ut\nso that a minimum total cost may be incurred.\nThe approach taken by [9] and other previous works is to use a semi-de\ufb01nite relaxation for the\ncontroller. However, this removes the properties associated with the curvature of the loss functions,\n\u221a\nby reducing the problem to an instance of online linear optimization. It is known that without\ncurvature, O(\nTherefore we take a different approach, initiated by [4]. We consider controllers that depend on the\ni=1 Miwt\u2212i. While this resulting convex relaxation\ndoes not remove the curvature of the loss functions altogether, it results in an overparametrized\nrepresentation of the controller, and it is not a priori clear that the loss functions are strongly convex\nwith respect to the parameterization. We demonstrate the appropriate conditions on the linear\ndynamical system under which the strong convexity is retained.\nHenceforth we present two methods that attain poly-logarithmic regret. They differ in terms of the\nregret bounds they afford and the computational cost of their execution. The online gradient descent\nupdate (OGD) requires only gradient computation and update, whereas the online natural gradient\n(ONG) update, in addition, requires the computation of the preconditioner, which is the expected\nGram matrix of the Jacobian, denoted J, and its inverse. However, the natural gradient update admits\nan instance-dependent upper bound on the regret, which while being at least as good as the regret\nbound on OGD, offers better guarantees on benign instances (See Corollary 4.5, for example).\n\nAlgorithm\n\nOGD\nONG\n\nUpdate rule (simpli\ufb01ed)\nMt+1 \u2190 Mt \u2212 \u03b7t\u2207ft(Mt)\n\nMt+1 \u2190 Mt \u2212 \u03b7t(E[J(cid:62)J])\u22121\u2207ft(Mt)\n\n1.2 Related Work\n\nApplicability\n\n\u2203K, diag L s.t. A \u2212 BK = QLQ\u22121\n\n(cid:107)L(cid:107) \u2264 1 \u2212 \u03b4, (cid:107)Q(cid:107),(cid:107)Q(cid:107)\u22121 \u2264 \u03ba\n\nFor a survey of linear dynamical systems (LDS), as well as learning, prediction and control problems,\nsee [17]. Recently, there has been a renewed interest in learning dynamical systems in the machine\nlearning literature. For fully-observable systems, sample complexity and regret bounds for control\n(under Gaussian noise) were obtained in [3, 10, 2]. The technique of spectral \ufb01ltering for learning and\nopen-loop control of partially observable systems was introduced and studied in [15, 7, 14]. Provable\ncontrol in the Gaussian noise setting via the policy gradient method was also studied in [11].\nThe closest work to ours is that of [1] and [9], aimed at controlling LDS with adversarial loss\nfunctions. The authors in [3] obtain a O(log2 T ) regret algorithm for changing quadratic costs (with\na \ufb01xed hessian), but for dynamical systems that are noise-free. In contrast, our results apply to the\nfull (noisy) LDS setting, which presents the main challenges as discussed before. Cohen et al. [9]\nconsider changing quadratic costs with stochastic noise to achieve a O(\nWe make extensive use of techniques from online learning [8, 16, 13]. Of particular interest to our\nstudy is the setting of online learning with memory [5]. We also build upon the recent control work\nof [4], who use online learning techniques and convex relaxation to obtain provable bounds for LDS\nwith adversarial perturbations.\n\nT ) regret bound.\n\n\u221a\n\n2 Problem Setting\n\nWe consider a linear dynamical system as de\ufb01ned in (1.1) with costs ct(xt, ut), where ct is strongly\nconvex. In this paper we assume that the noise wt is a random variable generated independently at\n\n2\n\n\fevery time step. For any algorithm A, we attribute a cost de\ufb01ned as\n\nJT (A) = E{wt}\n\nct(xt, ut)\n\n,\n\n(cid:34) T(cid:88)\n\n(cid:35)\n\nwhere xt+1 = Axt + But + wt, ut = A(x1, . . . xt) and E{wt} represents the expectation over the\nentire noise sequence. For the rest of the paper we will drop the subscript {wt} from the expectation\nas it will be the only source of randomness. Overloading notation, we shall use JT (K) to denote the\ncost of a linear controller K which chooses the action as ut = \u2212Kxt.\n\nt=1\n\nIn the paper we assume that x1 = 0 1, as well as the following conditions.\n\nAssumptions.\nAssumption 2.1. We assume that (cid:107)B(cid:107) \u2264 \u03baB. Furthermore, the perturbation introduced per time\nstep is bounded, i.i.d, and zero-mean with a lower bounded covariance i.e.\n\n\u2200t wt \u223c Dw, E[wt] = 0, E[wtw(cid:62)\n\nt ] (cid:23) \u03c32I and (cid:107)wt(cid:107) \u2264 W\n\nThis may be adapted to the case of sub-gaussian noise by conditioning on the event that none of the\nnoise vectors are ever large. Such adaptation introduces a multiplicative log(T ) factor in the regret.\nAssumption 2.2. The costs ct(x, u) are \u03b1-strongly convex. Wehnever (cid:107)x(cid:107),(cid:107)u(cid:107) \u2264 D, it holds that\n\n(cid:107)\u2207xct(x, u)(cid:107),(cid:107)\u2207uct(x, u)(cid:107) \u2264 GD.\n\nThe class of linear controllers we work with are de\ufb01ned as follows; see Section A for a detailed note.\nDe\ufb01nition 2.3 (Diagonal Strong Stability). Given a dynamics (A, B), a linear controller K is (\u03ba, \u03b3)-\ndiagonal strongly stable for real numbers \u03ba \u2265 1, \u03b3 < 1, if there exists a complex diagonal matrix L\nand a non-singular complex matrix Q, such that A \u2212 BK = QLQ\u22121 with the following being true:\n\n1. The spectral norm of L is strictly smaller than one, i.e., (cid:107)L(cid:107) \u2264 1 \u2212 \u03b3.\n2. The controller and transforming matrices are bounded, i.e., (cid:107)K(cid:107) \u2264 \u03ba and (cid:107)Q(cid:107),(cid:107)Q\u22121(cid:107) \u2264 \u03ba.\nRegret Formulation. Let K = {K : K is (\u03ba, \u03b3)-diagonal strongly stable}. For an algorithm A,\nthe notion of regret we consider is pseudo-regret, i.e. the sub-optimality of its cost with respect to the\ncost for the best linear controller i.e.,\n\nRegret = JT (A) \u2212 min\n\nK\u2208K JT (K).\n\n3 Preliminaries\n\nNotation. We reserve the letters x, y for states and u, v for actions. We denote by dx, du to be the\ndimensionality of the state and the control space respectively. Let d = max(dx, du). We reserve\ncapital letters A, B, K, M for matrices associated with the system and the policy. Other capital letters\nare reserved for universal constants in the paper. We use the shorthand Mi:j to denote a subsequence\n{Mi, . . . , Mj}. For any matrix U, de\ufb01ne Uvec to be a \ufb02attening of the matrix where we stack the\ncolumns upon each other. Further for a collection of matrices M = {M [i]}, let Mvec be the \ufb02attening\nde\ufb01ned by stacking the \ufb02attenings of M [i] upon each other. We use (cid:107)x(cid:107)2\nU = x(cid:62)U x to denote the\nmatrix induced norm. The rest of this section provides a recap of the relevant de\ufb01nitions and concepts\nintroduced in [4].\n\n3.1 Reference Policy Class\nFor the rest of the paper, we \ufb01x a (\u03ba, \u03b3)-diagonally strongly stable matrix K (The bold notation is to\nstress that we treat this matrix as \ufb01xed and not a parameter). Note that this can be any such matrix and\nit can be computed via a semi-de\ufb01nite feasibility program [9] given the knowledge of the dynamics,\nbefore the start of the game. We work with following the class of policies.\n\n1This is only for convenience of presentation. The case with a bounded x1 can be handled similarly.\n\n3\n\n\fDe\ufb01nition 3.1 (Disturbance-Action Policy). A disturbance-action policy M = (M [0], . . . , M [H\u22121]),\nfor horizon H \u2265 1 is de\ufb01ned as the policy which at every time t, chooses the recommended action ut\nat a state xt, de\ufb01ned 2 as\n\nH(cid:88)\n\nut(M ) (cid:44) \u2212Kxt +\n\nM [i\u22121]wt\u2212i.\n\nFor notational convenience, here it may be considered that wi = 0 for all i < 0.\n\ni=1\n\nThe policy applies a linear transformation to the disturbances observed in the past H steps. Since\n(x, u) is a linear function of the disturbances in the past under a linear controller K, formulating the\npolicy this way can be seen as a relaxation of the class of linear policies. Note that K is a \ufb01xed matrix\nand is not part of the parameterization of the policy. As was established in [4] (and we include the\nproof for completeness), with the appropriate choice of parameters, superimposing such a K, to the\npolicy class allows it to approximate any linear policy in terms of the total cost suffered with a \ufb01nite\nhorizon parameter H.\nWe refer to the policy played at time t as Mt = {M [i]\nt } where the subscript t refers to the time index\nand the superscript [i \u2212 1] refers to the action of Mt on wt\u2212i. Note that such a policy can be executed\nbecause wt\u22121 is perfectly determined on the speci\ufb01cation of xt as wt\u22121 = xt \u2212 Axt\u22121 \u2212 But\u22121.\n\n3.2 Evolution of State\n\nThis section describes the evolution of the state of the linear dynamical system under a non-stationary\npolicy composed of a sequence of T policies, where at each time the policy is speci\ufb01ed by Mt =\n(M [0]\n). We will use M0:T\u22121 to denote such a non-stationary policy. The following\nde\ufb01nitions ease the burden of notation.\n\n, . . . , M [H\u22121]\n\nt\n\nt\n\n1. De\ufb01ne \u02dcA = A \u2212 BK. \u02dcA shall be helpful in describing the evolution of state starting from a\n\nnon-zero state in the absence of disturbances.\n\n2. For any sequence of matrices M0:H, de\ufb01ne \u03a8i as a linear function that describes the effect\n\nof wt\u2212i on the state xt, formally de\ufb01ned below.\n\nDe\ufb01nition 3.2. For any sequence of matrices M0:H, de\ufb01ne the disturbance-state transfer matrix \u03a8i\nfor i \u2208 {0, 1, . . . H}, to be a function with h + 1 inputs de\ufb01ned as\n\nH(cid:88)\n\n\u03a8i(M0:H ) (cid:44) \u02dcAi1i\u2264H +\n\n\u02dcAjBM [i\u2212j\u22121]\n\nH\u2212j\n\n1i\u2212j\u2208[1,H].\n\nj=0\n\nIt will be important to note that \u03c8i is a linear function of its argument.\n\n3.3 Surrogate State and Surrogate Cost\n\nThis section introduces a couple of de\ufb01nitions required to describe our main algorithm. In essence\nthey describe a notion of state, its derivative and the expected cost if the system evolved solely under\nthe past H steps of a non-stationary policy.\nDe\ufb01nition 3.3 (Surrogate State & Surrogate Action). Given a sequence of matrices M0:H+1 and 2H\nindependent invocations of the random variable w given by {wj \u223c Dw}2H\u22121\nj=0 , de\ufb01ne the following\nrandom variables denoting the surrogate state and the surrogate action:\n\n2H(cid:88)\n\ny(M0:H ) =\n\n\u03a8i(M0:H )w2H\u2212i\u2212i,\n\ni=0\n\nv(M0:H+1) = \u2212Ky(M0:H ) +\n\nM [i\u22121]\n\nH+1 w2H\u2212i.\n\nH(cid:88)\n\nWhen M is the same across all arguments we compress the notation to y(M ) and v(M ) respectively.\n2xt is completely determined given w0 . . . wt\u22121. Hence, the use of xt only serves to ease the burden of\n\npresentation.\n\ni=1\n\n4\n\n\fAlgorithm 1 Online Control Algorithm\n1: Input: Step size schedule \u03b7t, Parameters \u03baB, \u03ba, \u03b3, T .\n2: De\ufb01ne H = \u03b3\u22121 log(T \u03ba2)\n3: De\ufb01ne M = {M = {M [0] . . . M [H\u22121]} : (cid:107)M [i\u22121](cid:107) \u2264 \u03ba3\u03baB(1 \u2212 \u03b3)i}.\n4: Initialize M0 \u2208 M arbitrarily.\n5: for t = 0, . . . , T \u2212 1 do\n6:\n\nChoose the action:\n\nH(cid:88)\n\nut = \u2212Kxt +\n\nM [i\u22121]\n\nt\n\nwt\u2212i.\n\nObserve the new state xt+1 and record wt = xt+1 \u2212 Axt \u2212 But.\n\ni=1\n\n7:\n8: Online Gradient Update:\n\nMt+1 = \u03a0M(Mt \u2212 \u03b7t\u2207ft(Mt))\n\n9: Online Natural Gradient Update:\n\nMvec,t+1 = \u03a0M(Mvec,t \u2212 \u03b7t(E[J T J])\u22121\u2207Mvec,tft(Mt))\n\n10: end for\n\nDe\ufb01nition 3.4 (Surrogate Cost). De\ufb01ne the surrogate cost function ft to be the cost associated with\nthe surrogate state-action pair de\ufb01ned above, i.e.,ft(M0:H+1) = E [ct(y(M0:H ), v(M0:H+1))] .\nWhen M is the same across all arguments we compress the notation to ft(M ).\n\nDe\ufb01nition 3.5 (Jacobian). Let z(M ) =\n\n. Since y(M ), v(M ) are random linear functions\n\n(cid:20)y(M )\n\n(cid:21)\n\nv(M )\n\nof M, z(M ) can be reparameterized as z(M ) = JMvec =\nwhich derives its randomness from the random perturbations wi.\n\nJv\n\nMvec, where J is a random matrix,\n\n(cid:20)Jy\n\n(cid:21)\n\n3.4 OCO with Memory\n\nWe now describe the setting of online convex optimization with memory introduced in [5]. In\nthis setting, at every step t, an online player chooses some point xt \u2208 K \u2282 Rd, a loss function\nft : KH+1 (cid:55)\u2192 R is then revealed, and the learner suffers a loss of ft(xt\u2212H:t). We assume a certain\ncoordinate-wise Lipschitz regularity on ft of the form such that, for any j \u2208 {0, . . . , H}, for any\nx0:H , \u02dcxj \u2208 K,\n\n|ft(x0:j\u22121, xj, xj+1:H ) \u2212 ft(x0:j\u22121, \u02dcxj, xj+1:H )| \u2264 L(cid:107)xj \u2212 \u02dcxj(cid:107).\n\nIn addition, we de\ufb01ne ft(x) = ft(x, . . . , x), and we let\n\nGf =\n\nsup\n\nt\u2208{0,...,T},x\u2208K\n\n(cid:107)\u2207ft(x)(cid:107), D = sup\nx,y\u2208K\n\n(cid:107)x \u2212 y(cid:107).\n\nThe resulting goal is to minimize the policy regret [6], which is de\ufb01ned as\n\nPolicyRegret =\n\nft(xt\u2212H:t) \u2212 min\nx\u2208K\n\nft(x).\n\nT(cid:88)\n\nt=H\n\nT(cid:88)\n\nt=H\n\n(3.1)\n\n(3.2)\n\n4 Algorithms & Statement of Results\n\nThe two variants of our method are spelled out in Algorithm 1. Theorems 4.1 and 4.3 provide the\nmain guarantees for the two algorithms.\n\nOnline Gradient Update\nTheorem 4.1 (Online Gradient Update). Suppose Algorithm 1 (Online Gradient Update) is executed\n\nwith K being any (\u03ba, \u03b3)-diagonal strongly stable matrix and \u03b7t = \u0398(cid:0)\u03b1\u03c32t(cid:1)\u22121, on an LDS satisfying\n\nAssumption 2.1 with control costs satisfying Assumption 2.2. Then, it holds true that\n\n(cid:18) G2W 4\n\n\u03b1\u03c32\n\n(cid:19)\n\nlog7(T )\n\n.\n\nJT (A) \u2212 min\n\nK\u2208K JT (K) \u2264 \u02dcO\n\n5\n\n\fThe above result leverages the following lemma which shows that the function ft(\u00b7) is strongly\nconvex with respect to its argument M. Note that strong convexity of the cost functions ct over\nthe state-action space does not by itself imply the strong convexity of the surrogate cost ft over the\nspace of controllers M. This is because, in the surrogate cost ft, ct is applied to y(M ), v(M ) which\nthemselves are linear functions of M; the linear map M is necessarily column-rank-de\ufb01cient. To\nobserve this, note that M maps from a space of dimensionality H \u00d7 dim(x) \u00d7 dim(u) to that of\ndim(x) + dim(u). The next theorem, which forms the core of our analysis, shows that this is not the\ncase using the inherent stochastic nature of the dynamical system.\nLemma 4.2. If the cost functions ct(\u00b7,\u00b7) are \u03b1-strongly convex, K is a (\u03ba, \u03b3) diagonal strongly stable\nmatrix and Assumption 2.1 is met then the idealized functions ft(M ) are \u03bb-strongly convex with\nrespect to M where\n\n\u03bb =\n\n\u03b1\u03c32\u03b32\n36\u03ba10\n\nWe present the proof for simple cases in Section 6, deferring the general proof to Section F.\n\nOnline Natural Gradient Update\nTheorem 4.3 (Online Natural Gradient Update). Suppose Algorithm 1 (Online Natural Gradient\n\u22121, on an LDS satisfying Assumptions 2.1 and with control\nUpdate) is executed with \u03b7t = \u0398 (\u03b1t)\ncosts satisfying Assumption 2.2. Then, it holds true that\nJT (A)\u2212 min\n\nwhere \u00b5\u22121 (cid:44) max\n\n(cid:18) GW 2\n\nM\u2208M(cid:107)(E[J T J])\u22121\u2207Mvecft(M )(cid:107).\n\nK\u2208K JT (K) \u2264 \u02dcO\n\nlog7(T )\n\n(cid:19)\n\n\u03b1\u00b5\n\nIn Theorem 4.3, the regret guarantee depends on an instance-dependent parameter \u00b5, which is a\nmeasure of hardness of the problem. First, we note that the proof of Lemma 4.2 establishes that the\nGram matrix of the Jacobian (De\ufb01ntion 3.5) is strictly positive de\ufb01nite and hence we recover the\nlogarithmic regret guarantee achieved by the Online Gradient Descent Update, with the constants\npreserved.\nCorollary 4.4. In addition to the assumptions in Theorem 4.3, if K is a (\u03ba, \u03b3)-diagonal strongly\nstable matrix, then for the natural gradient update\nK\u2208K JT (K) \u2264 \u02dcO\n\n(cid:18) G2W 4\n\nJT (A) \u2212 min\n\nlog7(T )\n\n(cid:19)\n\n,\n\n\u03b1\u03c32\n\nProof. The conclusion follows from Lemma 5.2 and Lemma 6.1 which is the core component in the\nproof of Lemma 4.2 showing that E[J T J] \u2265 \u03b32\u03c32\n\n36\u03ba10 \u00b7 I .\n\nSecondly, we note that, being instance-dependent, the guarantee the Natural Gradient update offers\ncan potentially be stronger than that of the Online Gradient method. A case in point is the following\ncorollary involving spherically symmetric quadratic costs, in which case the Natural Gradient update\nyields a regret guarantee under demonstrably more general conditions, in that the bound does not\ndepend on the minimum eigenvalue of the covariance of the disturbances \u03c32, unlike OGD.\nCorollary 4.5. Under the assumptions on Theorem 4.3, if the cost functions are of the form ct(x, u) =\nrt((cid:107)x(cid:107)2 + (cid:107)u(cid:107)2), where rt \u2208 [\u03b1, \u03b2] is an adversarially chosen sequence of numbers and K is chosen\nto be a (\u03ba, \u03b3)-diagonal strongly stable matrix, then the natural gradient update guarantees\n\nJT (A) \u2212 min\n\nK\u2208K JT (K) \u2264 \u02dcO\n\n(cid:18) \u03b22W 2\n\n\u03b1\n\n(cid:19)\n\nlog7(T )\n\n,\n\nProof. Note (cid:107)\u2207Mvec ft(M )(cid:107)(E[J T J])\u22122 = (cid:107)E[J T (rt \u00b7 I)JMvec](cid:107)(E[J T J])\u22122 \u2264 \u03b2(cid:107)Mvec(cid:107).\n\n5 Regret Analysis\n\nThe next section is a condensation of the results from [4] which we present in this form to highlight\nthe reduction to OCO with memory.\n\n6\n\n\f5.1 Reduction to Low Regret with Memory\n\nThe next lemma shows that achieving low policy regret on the memory based function ft is suf\ufb01cient\nto ensure low regret on the overall dynamical system. Since the proof is essentially provided by [4],\nwe provide it in the Appendix for completeness. De\ufb01ne,\n\nM (cid:44) {M = {M [0] . . . M [H\u22121]} : (cid:107)M [i\u22121](cid:107) \u2264 \u03ba3\u03baB(1 \u2212 \u03b3)i}.\n\nLemma 5.1. Let the dynamical system satisfy Assumption 2.1 and let K be any (\u03ba, \u03b3)-diagonal\nstrongly stable matrix. Consider a sequence of loss functions ct(x, u) satisfying Assumption 2.2 and\na sequence of policies M0 . . . MT satisfying\n\nPolicyRegret =\n\nft(Mt\u2212H\u22121:t) \u2212 min\nM\u2208M\n\nft(M ) \u2264 R(T )\n\nT(cid:88)\n\nt=0\n\nT(cid:88)\n\nt=0\n\nfor some function R(T ) and ft as de\ufb01ned in De\ufb01nition 3.4. Let A be an online algorithm that plays\nthe non-stationary controller sequence {M0, . . . MT}. Then as long as H is chosen to be larger than\n\u03b3\u22121 log(T \u03ba2) we have that\n\nJ(A) \u2212 min\n\nK\u2217\u2208K J(K\u2217) \u2264 R(T ) + O(GW 2 log(T )),\n\nHere O(\u00b7), \u0398(\u00b7) contain polynomial factors in \u03b3\u22121, \u03baB, \u03ba, d.\nLemma 5.2. The function ft as de\ufb01ned in De\ufb01nition 3.4 is coordinate-wise L-lipschitz and the norm\nof the gradient is bounded by Gf , where\n2DGW \u03baB\u03ba3\n\n2\u03baB\u03ba3\n\n(cid:18)\n\n(cid:19)\n\n, Gf \u2264 GDW Hd\n\nL =\n\n\u03b3\n\nwhere D (cid:44) W \u03ba2(1 + H\u03ba2\n\nB\u03ba3)\n\u03b3(1 \u2212 \u03ba2(1 \u2212 \u03b3)H+1)\n\n+\n\nH +\n\n\u03b3\n\u03baB\u03ba3W\n\n.\n\n\u03b3\n\nThe proof of this lemma is identical to the analogous lemma in [4] and hence is omitted.\n\n5.2 Analysis for Online Gradient Descent\n\nIn the setting of Online Convex Optimization with Memory, as shown by [5], by running a memory-\nbased OGD, we can bound the policy regret by the following theorem, proven in the appendix.\nTheorem 5.3. Consider the OCO with memory setting de\ufb01ned in Section 3.4. Let {ft}T\nt=H be\nLipschitz loss functions with memory such that ft(x) are \u03bb-strongly convex, and let L and Gf be as\nde\ufb01ned in (3.1) and (3.2). Then, there exists an algorithm which generates a sequence {xt}T\nt=0 such\n\nT(cid:88)\n\nt=H\n\nft(xt\u2212H:t) \u2212 min\nx\u2208K\n\nT(cid:88)\n\nt=H\n\nf + LH 2Gf\n\n(1 + log(T )).\n\n\u02dcft(x) \u2264 G2\n(cid:16) G2W 4H 6\n\n\u03bb\n\n(cid:17)\n\nProof of Theorem 4.1. Setting H = \u03b3\u22121 log(T \u03ba2), Theorem 5.3, in conjunction with Lemma 5.2,\nimplies that policy regret is bounded by \u02dcO\n. An invocation of Lemma 5.1 now\nsuf\ufb01ces to conclude the proof of the claim.\n\nlog T\n\n\u03b1\u03c32\n\n5.3 Analysis for Online Natural Gradient Descent\n\nConsider structured loss functions of the form ft(M0:H+1) = E[ct(z)], where z =(cid:80)H+1\nDe\ufb01ne J =(cid:80)H+1\n\ni=0 Ji[Mi]vec.\nJi is a random matrix, and ct\u2019s are adversarially chosen strongly convex loss functions. In a similar\nvein, de\ufb01ne ft(M ) to be the specialization of ft when input the same argument, i.e. M, H + 1 times.\n\ni=0 Ji. The proof of the following theorem may be found in the appendix.\n\nTheorem 5.4. In the setting desribed in this subsection, let ct be \u03b1-strongly convex, and fT be such\nthat it satis\ufb01es equation (3.1) with constant L, and Gf = maxM\u2208M (cid:107)(E[J T J])\u22121\u2207Mvecft(M )(cid:107).\nThen, the online natural gradient update generates a sequence {Mt}T\nT(cid:88)\n\u02dcft(M ) \u2264 maxM\u2208M (cid:107)\u2207Mvecft(M )(cid:107)2\n\n(E[J T J])\u22121 + LH 2Gf\n\nt=0 such that\n\nT(cid:88)\n\n(1+log(T )).\n\nft(Mt\u2212H:t)\u2212 min\nM\u2208M\n\n\u03b1\n\nt=H\n\nt=H\n\n7\n\n\fProof of Theorem 4.3. First observe that (cid:107)\u2207Mvec ft(M )(cid:107)2\n(E[J T J])\u22121 \u2264 \u00b5\u22121(cid:107)\u2207Mvec ft(M )(cid:107). Setting\nH = \u03b3\u22121 log(T \u03ba2), Theorem 5.4, in conjunction with Lemma 5.2, imply the stated bound on policy\nregret. An invocation of Lemma 5.1 suf\ufb01ces to conclude the proof of the claim.\n\n6 Proof of Strong Convexity in Simpler Cases\n\nWe will need some de\ufb01nitions and preliminaries that are outlined below. By de\ufb01nition we have that\nft(M ) = E[ct(yt(M ), vt(M ))]. Since we know that ct is strongly convex we have that\ny Jy + J(cid:62)\n[J(cid:62)\n\n[\u22072ct(y(M ), v(M ))] (cid:23) \u03b1E{wk}2H\u22121\n\n\u22072ft(M ) = E{wk}2H\u22121\n\nv Jv].\n\nk=0\n\nk=0\n\nJy, Jv are random matrices dependent on the noise {wk}2H\u22121\nk=0 . The next lemma implies Lemma 4.2.\nLemma 6.1. If Assumption 2.1 is satis\ufb01ed and K is chosen to be a (\u03ba, \u03b3)-diagonal strongly stable\nmatrix, then the following holds,\n\nE{wk}2H\u22121\n\nk=0\n\n[J(cid:62)\ny Jy + J(cid:62)\n\nv Jv] (cid:23) \u03b32\u03c32\n\n36\u03ba10 \u00b7 I.\n\nTo analyze Jy, Jv, we will need to rearrange the de\ufb01nition of y(M ) to make the dependence on each\nindividual M [i] explicit. To this end consider the following de\ufb01nition for all k \u2208 [H + 1].\n\n\u02dcvk(M ) (cid:44) H(cid:88)\n\ni=1\n\nM [i\u22121]w2H\u2212i\u2212k\n\nUnder this de\ufb01nition it follows that\n\nH(cid:88)\n\ny(M ) =\n\n(A \u2212 BK)k\u22121B\u02dcvk(M ) +\n\nk=1\n\nv(M ) = \u2212Ky(M ) + \u02dcv0(M )\n\nk=1\n\nH(cid:88)\n\n(A \u2212 BK)k\u22121w2H\u2212k\n\nFrom the above de\ufb01nitions, (Jy, Jv) may be characterized in terms of the Jacobian of \u02dcvk with respect\nto M, which we de\ufb01ne for the rest of the section as J\u02dcvk. De\ufb01ning Mvec as the stacking of rows of\neach M [i] vertically, i.e. stacking the columns of (M [i])(cid:62), it can be observed that for all k,\n\n=(cid:2)Idu \u2297 w(cid:62)\n\nJ\u02dcvk =\n\n\u2202\u02dcvk(M )\n\n\u2202M\n\n2H\u2212k\u22121\n\nIdu \u2297 w(cid:62)\n\n2H\u2212k\u22122 . . . Idu \u2297 w(cid:62)\n\nH\u2212k\n\n(cid:3)\n\nwhere du is the dimension of the controls. We are now ready to analyze the two simpler cases. Further\non in the section we drop the subscripts {wk}2H\u22121\n6.1 Proof of Lemma 6.1: K = 0\nIn this section we assume that K = 0 is a (\u03ba, \u03b3)-diagonal strongly stable policy for (A, B). Be\nde\ufb01nition, we have v(M ) = \u02dcv0(M ). One may conclude the proof with the following observation.\n\nfrom the expectations for brevity.\n\nk=0\n\nE[J(cid:62)\n\ny Jy + J(cid:62)\n\nv Jv] (cid:23) E[J(cid:62)\n\nv Jv] = E[J(cid:62)\n\nJ\u02dcv0 ] = Idu \u2297 \u03a3 (cid:23) \u03c32I.\n\n\u02dcv0\n\n6.2 Proof of Lemma 6.1: 1-dimensional case\n\nHere, we specialize Lemma 4.2 to one-dimensional state and one-dimensional control. This case\nhighlights the dif\ufb01culty caused in the proof due to a choosing a non-zero K and presents the main\nideas of the proof in a simpli\ufb01ed notation.\nNote that in the one dimensional case, the policy given by M = {M [i]}H\u22121\ni=0 is an H dimensional\nvector with M [i] being a scalar. Furthermore y(M ), v(M ), \u02dcvk(M ) are scalars and hence their\nJacobians Jy, Jv, J\u02dcvk with respect to M are 1 \u00d7 H vectors. In particular we have that,\n\nJ\u02dcvk =\n\n\u2202\u02dcvk(M )\n\n\u2202M\n\n= [w2H\u2212k\u22121 w2H\u2212k\u22122 . . . wH\u2212k]\n\n8\n\n\fE[J(cid:62)\n\ny Jy] =\n\nE[J(cid:62)\n\n\u02dcv0\n\nJy] =\n\n(cid:33)\n(cid:125)\n\nH(cid:88)\n\nk1=1\n\nk2=1\n\n(cid:32) H(cid:88)\n(cid:124)\n(cid:32) H(cid:88)\n(cid:124)\n\nk=1\n\nTk1\u2212k2 \u00b7 (A \u2212 BK)k1\u22121+k2\u22121\n\n(cid:123)(cid:122)\n\n(cid:44)G\nT\u2212k(A \u2212 BK)k\u22121\n\n(cid:123)(cid:122)\n\n(cid:44)Y\n\n(cid:33)\n(cid:125)\n\n\u00b7B \u00b7 \u03c32\n\n\u00b7B2 \u00b7 \u03c32\n\n(6.2)\n\n(6.3)\n\nTherefore using the fact that E[wiwj] = 0 for i (cid:54)= j and E[w2\nk1, k2, we have that\n] = Tk1\u2212k2 \u00b7 \u03c32\n(6.1)\nwhere Tm is de\ufb01ned as an H \u00d7 H matrix with [Tm]ij = 1 if and only if i \u2212 j = m and 0 otherwise.\nThis in particular immediately gives us that,\n\ni ] = \u03c32, it can be observed that for any\n\nE[J(cid:62)\n\u02dcvk1\n\nJ \u02dcvk2\n\nFirst, we prove a few spectral properties of the matrices G and Y de\ufb01ned above. From Gershgorin\u2019s\ncircle theorem, and the fact that K is (\u03ba, \u03b3)-diagonal strongly stable, we have\n(T\u2212k + Tk)(A \u2212 BK)k\u22121(cid:107) \u2264 2\u03b3\u22121\n\n(cid:107)Y + Y(cid:62)(cid:107) \u2264 (cid:107) H(cid:88)\n\n(6.4)\n\nk=1\n\nThe spectral properties of G summarized in the lemma below form the core of our analysis.\nLemma 6.2. G is a symmetric positive de\ufb01nite matrix. In particular G (cid:23) 1\nNow consider the statements which follow by the respective de\ufb01nitions.\n\n4 \u00b7 I.\n\nE[J(cid:62)\n\nv Jv] = K2 \u00b7 E[J(cid:62)\n\n= \u03c32 \u00b7(cid:0)B2K2 \u00b7 G \u2212 BK \u00b7 (Y + Y(cid:62)) + I(cid:1)\ny J\u02dcv0] \u2212 K \u00b7 E[J(cid:62)\n(cid:125)\n\u02dcv0\n.\n\ny Jy] \u2212 K \u00b7 E[J(cid:62)\n(cid:123)(cid:122)\n\n(cid:124)\n\n(cid:44)F\n\nJy] + E[J(cid:62)\n\n\u02dcv0\n\nJ\u02dcv0]\n\nNow F (cid:23) 0. We \ufb01nish the proof by considering two cases. The \ufb01rst case is when 3|B|\u03b3\u22121\u03ba \u2265 1.\nNoting \u03ba \u2265 1, in this case Lemma 6.2 immediately implies that\n\nm(cid:62)(cid:0)F + B2 \u00b7 G(cid:1) m \u2265 m(cid:62)(cid:0)B2 \u00b7 G(cid:1) m \u2265 1\n\n4(cid:107)m(cid:107)2\n9\u03b3\u22122\u03ba2 \u2265 \u03b32(cid:107)m(cid:107)2\n36\u03ba10 ,\n\nIn the second case (when 3|B|\u03b3\u22121\u03ba \u2264 1), (6.4) implies that\n\nm(cid:62)(cid:0)F + B2 \u00b7 G(cid:1) m \u2265 m(cid:62)(cid:0)I \u2212 BK \u00b7 (Y + Y(cid:62))(cid:1) m \u2265 (1/3)(cid:107)m(cid:107)2 \u2265 \u03b32(cid:107)m(cid:107)2\n\n36\u03ba10 .\n\n6.2.1 Proof of Lemma 6.2\nRecall Tm is de\ufb01ned as an H \u00d7 H matrix with [Tm]ij = 1 if and only if i \u2212 j = m and 0 otherwise.\nDe\ufb01ne the following matrix for any complex number |\u03c8| < 1.\n\nH(cid:88)\n\nH(cid:88)\n\nG(\u03c8) =\n\nTk1\u2212k2\n\n(cid:0)\u03c8\u2020(cid:1)k1\u22121\n\n\u03c8k2\u22121\n\nk1=1\n\nk2=1\n\nNote that G in Lemma 6.2 is equal to G(A \u2212 BK). The following lemma, proven in Section E,\nprovides a lower bound on the spectrum of the matrix G(\u03c8). The lemma presents the proof of a more\ngeneral case (\u03c6 is complex) that aids the multi-dimensional case. A special case when \u03c6 = 1 was\nproven in [12], and we follow a similar approach relying on the inverse.\nLemma 6.3. Let \u03c8 be a complex number such that |\u03c8| \u2264 1. We have that G(\u03c8) (cid:23) (1/4) \u00b7 IH.\n\n7 Conclusion\n\nWe presented two algorithms for controlling linear dynamical systems with strongly convex costs\n\u221a\nwith regret that scales poly-logarithmically with time. This improves state-of-the-art known regret\nbounds that scale as O(\nT ). It remains open to extend the poly-log regret guarantees to more general\nsystems and loss functions, such as exp-concave losses, or alternatively, show that this is impossible.\n\n9\n\n\fAcknowledgements\n\nThe authors thank Sham Kakade and Cyril Zhang for various thoughtful discussions. Elad Hazan\nacknowledges funding from NSF grant # CCF-1704860.\n\nReferences\n[1] Yasin Abbasi-Yadkori, Peter Bartlett, and Varun Kanade. Tracking adversarial targets. In\n\nInternational Conference on Machine Learning, pages 369\u2013377, 2014.\n\n[2] Yasin Abbasi-Yadkori, Nevena Lazic, and Csaba Szepesv\u00b4ari. Model-free linear quadratic\ncontrol via reduction to expert prediction. In The 22nd International Conference on Arti\ufb01cial\nIntelligence and Statistics, pages 3108\u20133117, 2019.\n\n[3] Yasin Abbasi-Yadkori and Csaba Szepesv\u00b4ari. Regret bounds for the adaptive control of linear\nquadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages\n1\u201326, 2011.\n\n[4] Naman Agarwal, Brian Bullins, Elad Hazan, Sham Kakade, and Karan Singh. Online control\nwith adversarial disturbances. In Proceedings of the 36th International Conference on Machine\nLearning, pages 111\u2013119, 2019.\n\n[5] Oren Anava, Elad Hazan, and Shie Mannor. Online learning for adversaries with memory: price\nof past mistakes. In Advances in Neural Information Processing Systems, pages 784\u2013792, 2015.\n[6] Raman Arora, Ofer Dekel, and Ambuj Tewari. Online bandit learning against an adaptive\nadversary: from regret to policy regret. In Proceedings of the 29th International Conference on\nMachine Learning, pages 1503\u20131510, 2012.\n\n[7] Sanjeev Arora, Elad Hazan, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. Towards\n\nprovable control for unknown linear dynamical systems. 2018.\n\n[8] Nicolo Cesa-Bianchi and G\u00b4abor Lugosi. Prediction, learning, and games. Cambridge university\n\npress, 2006.\n\n[9] Alon Cohen, Avinatan Hasidim, Tomer Koren, Nevena Lazic, Yishay Mansour, and Kunal\nTalwar. Online linear quadratic control. In International Conference on Machine Learning,\npages 1028\u20131037, 2018.\n\n[10] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for\nrobust adaptive control of the linear quadratic regulator. In Advances in Neural Information\nProcessing Systems, pages 4188\u20134197, 2018.\n\n[11] Maryam Fazel, Rong Ge, Sham M Kakade, and Mehran Mesbahi. Global convergence of policy\ngradient methods for the linear quadratic regulator. In International Conference on Machine\nLearning, pages 1466\u20131475, 2018.\n\n[12] Surbhi Goel, Adam Klivans, and Raghu Meka. Learning one convolutional layer with overlap-\nping patches. In Proceedings of the 35th International Conference on Machine Learning, pages\n1783\u20131791, 2018.\n\n[13] Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimiza-\n\ntion, 2(3-4):157\u2013325, 2016.\n\n[14] Elad Hazan, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. Spectral \ufb01ltering for\ngeneral linear dynamical systems. In Advances in Neural Information Processing Systems,\npages 4634\u20134643, 2018.\n\n[15] Elad Hazan, Karan Singh, and Cyril Zhang. Learning linear dynamical systems via spectral\n\n\ufb01ltering. In Advances in Neural Information Processing Systems, pages 6702\u20136712, 2017.\n\n[16] Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and\n\nTrends R(cid:13) in Machine Learning, 4(2):107\u2013194, 2012.\n\n[17] Robert F Stengel. Optimal control and estimation. Courier Corporation, 1994.\n[18] Gilbert Strang. Introduction to linear algebra, volume 3.\n\n10\n\n\f", "award": [], "sourceid": 5371, "authors": [{"given_name": "Naman", "family_name": "Agarwal", "institution": "Google"}, {"given_name": "Elad", "family_name": "Hazan", "institution": "Princeton University"}, {"given_name": "Karan", "family_name": "Singh", "institution": "Princeton University"}]}