{"title": "Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 2636, "page_last": 2644, "abstract": "We study the problem of adaptive control of a high dimensional linear quadratic (LQ) system.  Previous work established the asymptotic convergence to an optimal controller for various adaptive control schemes. More recently, an asymptotic regret bound of $\\tilde{O}(\\sqrt{T})$ was shown for $T \\gg p$ where $p$ is the dimension of the state space. In this work we consider the case where the matrices describing the dynamic of the LQ system are sparse and their dimensions are large. We present an adaptive control scheme that for $p \\gg 1$ and $T \\gg \\polylog(p)$ achieves a regret bound of $\\tilde{O}(p \\sqrt{T})$. In particular, our algorithm has an average cost of $(1+\\eps)$ times the optimum cost after $T = \\polylog(p) O(1/\\eps^2)$. This is in comparison to previous work on the dense dynamics where the algorithm needs $\\Omega(p)$ samples before it can estimate the unknown dynamic with any significant accuracy. We believe our result has prominent applications in the emerging area of computational advertising, in particular targeted online advertising and advertising in social networks.", "full_text": "Ef\ufb01cient Reinforcement Learning for High\n\nDimensional Linear Quadratic Systems\n\nMorteza Ibrahimi\nStanford University\nStanford, CA 94305\n\nibrahimi@stanford.edu\n\nAdel Javanmard\nStanford University\nStanford, CA 94305\n\nadelj@stanford.edu\n\nBenjamin Van Roy\nStanford University\nStanford, CA 94305\n\nbvr@stanford.edu\n\nAbstract\n\nWe study the problem of adaptive control of a high dimensional linear quadratic\n(LQ) system. Previous work established the asymptotic convergence to an optimal\n\u221a\ncontroller for various adaptive control schemes. More recently, for the average\ncost LQ problem, a regret bound of O(\nT ) was shown, apart form logarithmic\nfactors. However, this bound scales exponentially with p, the dimension of the\nstate space. In this work we consider the case where the matrices describing the\ndynamic of the LQ system are sparse and their dimensions are large. We present\nan adaptive control scheme that achieves a regret bound of O(p\nT ), apart from\nlogarithmic factors. In particular, our algorithm has an average cost of (1 + \u0001)\ntimes the optimum cost after T = polylog(p)O(1/\u00012). This is in comparison to\nprevious work on the dense dynamics where the algorithm requires time that scales\nexponentially with dimension in order to achieve regret of \u0001 times the optimal cost.\nWe believe that our result has prominent applications in the emerging area of\ncomputational advertising, in particular targeted online advertising and advertising\nin social networks.\n\n\u221a\n\n1\n\nIntroduction\n\nIn this paper we address the problem of adaptive control of a high dimensional linear quadratic (LQ)\nsystem. Formally, the dynamics of a linear quadratic system are given by\nx(t + 1) = A0x(t) + B0u(t) + w(t + 1),\n\nc(t) = x(t)T Qx(t) + u(t)T Ru(t),\n\n(1)\nwhere u(t) \u2208 Rr is the control (action) at time t, x(t) \u2208 Rp is the state at time t, c(t) \u2208 R is\nthe cost at time t, and {w(t + 1)}t\u22650 is a sequence of random vectors in Rp with i.i.d. standard\nNormal entries. The matrices Q \u2208 Rp\u00d7p and R \u2208 Rr\u00d7r are positive semi-de\ufb01nite (PSD) matrices\nthat determine the cost at each step. The evolution of the system is described through the matrices\nA0 \u2208 Rp\u00d7p and B0 \u2208 Rp\u00d7r. Finally by high dimensional system we mean the case where p, r (cid:29) 1.\nA celebrated fundamental theorem in control theory asserts that the above LQ system can be op-\ntimally controlled by a simple linear feedback if the pair (A0, B0) is controllable and the pair\n(A0, Q1/2) is observable. The optimal controller can be explicitly computed from the matrices\ndescribing the dynamics and the cost. Throughout this paper we assume that controllability and\nobservability conditions hold.\nWhen the matrix \u03980 \u2261 [A0, B0] is unknown, the task is that of adaptive control, where the system\nis to be learned and controlled at the same time. Early works on the adaptive control of LQ systems\nrelied on the certainty equivalence principle [2]. In this scheme at each time t the unknown param-\neter \u03980 is estimated based on the observations collected so far and the optimal controller for the\n\n1\n\n\festimated system is applied. Such controllers are shown to converge to an optimal controller in the\ncase of minimum variance cost, however, in general they may converge to a suboptimal controller\n[11]. Subsequently, it has been shown that introducing random exploration by adding noise to the\ncontrol signal, e.g., [14], solves the problem of converging to suboptimal estimates.\nAll the aforementioned work have been concerned with the asymptotic convergence of the controller\nto an optimal controller. In order to achieve regret bounds, cost-biased parameter estimation [12, 8,\n1], in particular the optimism in the face of uncertainty (OFU) principle [13] has been shown to be\neffective. In this method a con\ufb01dence set S is found such that \u03980 \u2208 S with high probability. The\n\nsystem is then controlled using the most optimistic parameter estimates, i.e.,(cid:98)\u0398 \u2208 S with the smallest\n\noptimum cost. The asymptotic convergence of the average cost of OFU for the LQR problem was\nshown in [6]. This asymptotic result was extended in [1] by providing a bound for the cumulative\nregret. Assume x(0) = 0 and for a control policy \u03c0 de\ufb01ne the average cost\n\nJ\u03c0 = limsup\nT\u2192\u221e\n\n1\nT\n\nT(cid:88)\n\nt=0\n\nE[ct] .\n\n(2)\n\n(3)\n\nFurther, de\ufb01ne the cumulative regret as\n\nR(T ) =\n\nT(cid:88)\n\n(c\u03c0(t) \u2212 J\u2217) ,\n\nt=0\n\n\u221a\n\n\u221a\nwhere c\u03c0(t) is the cost of control policy \u03c0 at time t and J\u2217 = J(\u03980) is the optimal average cost.\nThe algorithm proposed in [1] is shown to have cumulative regret \u02dcO(\nT ) where \u02dcO is hiding the\n\u221a\nlogarithmic factors. While no lower bound was provided for the regret, comparison with the multi-\narmed bandit problem where a lower bound of O(\nT ) was shown for the general case [9], suggests\nthat this scaling with time for the cumulative regret is optimal.\nThe focus of [1] was on scaling of the regret with time horizon T . However, the regret of the pro-\nposed algorithm scales poorly with dimension. More speci\ufb01cally, the analysis in [1] proves a regret\nbound of R(T ) < Cpp+r+2\nT . The current paper focuses on (many) applications where the state\nand control dimensions are much larger than the time horizon of interest. A powerful reinforcement\nlearning algorithm for these applications should have regret which depends gracefully on dimension.\nIn general, there is little to be achieved when T < p as the number of degrees of freedom (pr + p2)\nis larger than the number of observations (T p) and any estimator can be arbitrary inaccurate. How-\never, when there is prior knowledge about the unknown parameters A0, B0, e.g., when A0, B0 are\nsparse, accurate estimation can be feasible. In particular, [3] proved that under suitable conditions\nthe unknown parameters of a noise driven system (i.e., no control) whose dynamics are modeled by\nlinear stochastic differential equations can be estimated accurately with as few as O(log(p)) sam-\nples. However, the result of [3] is not directly applicable here since for a general feedback gain L\neven if A0 and B0 are sparse, the closed loop gain A0 \u2212 B0L need not be sparse. Furthermore,\nsystem dynamics would be correlated with past observations through the estimated gain matrix L.\nFinally, there is no notion of cost in [3] while here we have to obtain bounds on cost and its scaling\nwith p.\nIn this work we extend the result of [3] by showing that under suitable conditions, un-\nknown parameters of sparse high dimensional LQ systems can be accurately estimated with as few\nas O(log(p + r)) observations. Equipped with this ef\ufb01cient learning method, we show that sparse\nhigh dimensional LQ systems can be adaptively controlled with regret \u02dcO(p\nTo put this result in perspective note that even when x(t) = 0, the expected cost at time t + 1 is\n\u2126(p) due to the noise. Therefore, the cumulative cost at time T is bounded as \u2126(pT ). Comparing\n\u00012 ), the cumulative cost of our algorithm\nthis to our regret bound, we see that for T = polylog(p)O( 1\nis bounded by (1 + \u0001) times the optimum cumulative cost. In other words, our algorithm performs\nclose to optimal after polylog(p) steps. This is in contrast with the result of [1] where the algorithm\nneeds \u2126(p2p) steps in order to achieve regret of \u0001 times the optimal cost.\nSparse high dimensional LQ systems appear in many engineering applications. Here we are par-\nticularly motivated by an emerging \ufb01eld of applications in marketing and advertising. The use of\ndynamical optimal control models in advertising has a history of at least four decades, cf. [17, 10]\nfor a survey. In these models, often a partial differential equation is used to describe how advertising\nexpenditure over time translates into sales. The basic problem is to \ufb01nd the advertising expendi-\nture that maximizes the net pro\ufb01t. The focus of these works is to model the temporal dynamics of\n\nT ).\n\n\u221a\n\n2\n\n\fthe advertising expenditure (the control variable) and the variables of interest (sales, goodwill level,\netc.). There also exists a rich literature studying the spatial interdependence of consumers\u2019 and\n\ufb01rms\u2019 behavior to devise marketing schemes [7]. In these models space can be generalized beyond\ngeographies to include notions like demographies and psychometry.\nCombination of spatial interdependence and temporal dynamics models for optimal advertising was\nalso considered [16, 15]. A simple temporal dynamics model is extended in [15] by allowing state\nand control variables to have spatial dependence and introducing a diffusive component in the con-\ntrolled PDE which describes the spatial dynamics. The controlled PDE is then showed to be equiv-\nalent to an abstract linear control system of the form\n\ndx(t)\n\ndt\n\n= Ax(t) + Bu(t).\n\n(4)\n\nBoth [15] and [7] are concerned with the optimal control and the interactions are either dictated\nby the model or assumed known. Our work deals with a discrete and noisy version of (4) where\nthe dynamics is to be estimated but is known to be sparse. In the model considered in [15] the\nstate variable x lives in an in\ufb01nite dimensional space. Spatial models in marketing [7] usually\nconsider state variables which have a large number of dimensions, e.g., number of zip codes in the\nUS (\u223c 50K). High dimensional state space and control is a recurring theme in these applications.\nIn particular, with the modern social networks customers are classi\ufb01ed in a highly granular way, po-\ntentially with each customer representing his own class. With the number of classes and complexity\nof their interactions, its unlikely that we could formulate an effective model a priori for how classes\ninteract. Further, the nature of these interactions change over time with the changing landscape of\nInternet services and information available to customers. This makes it important to ef\ufb01ciently learn\nfrom real-time data about the nature of these interactions.\nNotation: We bundle the unknown parameters into one variable \u03980 = [A0, B0] \u2208 Rp\u00d7q where\nq = p + r and call it the interaction matrix. For v \u2208 Rn, M \u2208 Rm\u00d7n and p \u2265 1, we denote by (cid:107)v(cid:107)p\nthe standard p-norm and by (cid:107)M(cid:107)p the corresponding operator norm. For 1 \u2264 i \u2264 m, Mi represents\nthe ith row of matrix M. For S \u2286 [m], J \u2286 [n], MSJ is the submatrix of M formed by the rows in\nS and columns in J. For a set S denote by |S| its cardinality. For an integer n denote by [n] the set\n{1, . . . , n}.\n\n2 Algorithm\n\nOur algorithm employs the Optimism in the Face of Uncertainty (OFU) principle in an episodic\nfashion. At the beginning of episode i the algorithm constructs a con\ufb01dence set \u2126(i) which is\nguaranteed to include the unknown parameter \u03980 with high probability. The algorithm then chooses\n\n(cid:101)\u0398(i) \u2208 \u2126(i) that has the smallest expected cost as the estimated parameter for episode i and applies\n\nthe optimal control for the estimated parameter during episode i.\nThe con\ufb01dence set is constructed using observations from the last episode only but the length of\nepisodes are chosen to increase geometrically allowing for more accurate estimates and shrinkage\nof the con\ufb01dence set by a constant factor at each episode. The details of each step and the pseudo\ncode for the algorithm follows.\nConstructing con\ufb01dence set: De\ufb01ne \u03c4i to be the start of episode i with \u03c40 = 0. Let L(i) be the\ncontroller that has been chosen for episode i. For t \u2208 [\u03c4i, \u03c4i+1) the system is controlled by u(t) =\n\u2212L(i)x(t) and the system dynamics can be written as x(t + 1) = (A0 \u2212 B0L(i))x(t) + w(t + 1). At\noptimization problem for each row \u0398u \u2208 Rq separately:\n\nthe beginning of episode i + 1, \ufb01rst an initial estimate(cid:98)\u0398 is obtained by solving the following convex\n\n\u2208 argmin L(\u0398u) + \u03bb(cid:107)\u0398u(cid:107)1,\n\nu\n\n(cid:98)\u0398(i+1)\n\u03c4i+1\u22121(cid:88)\n\nt=\u03c4i\n\n3\n\nwhere\n\nL(\u0398u) =\n\n1\n\n2\u2206\u03c4i+1\n\n{xu(t + 1) \u2212 \u0398u(cid:101)L(i)x(t)}2, \u2206\u03c4i+1 = \u03c4i+1 \u2212 \u03c4i,\n\n(5)\n\n(6)\n\n\f(cid:19)\n\n(cid:18) 1\n\nj (cid:107)2), and\n4 \u00b7 103 k2(cid:96)2\n\u03b1(1 \u2212 \u03c1)C 2\n4 \u00b7 103 k2(cid:96)(\u03980, \u0001)2\n\nmin\n\n0\n\nOutput: Series of estimates(cid:101)\u0398(i), con\ufb01dence sets \u2126(i) and controllers L(i)\n\nALGORITHM: Reinforcement learning algorithm for LQ systems.\nInput: Precision \u0001, failure probability 4\u03b4, initial (\u03c1, Cmin, \u03b1) identi\ufb01able controller L(0), (cid:96)(\u03980, \u0001)\n1: Let (cid:96)0 = max(1, maxj\u2208[r] (cid:107)L(0)\n\nk\n\nlog(\n\n),\n\nn0 =\n\nn1 =\n\n4kq\n\u03b4\n\n\u00012 +\n\n(cid:18) 1\n\n(1 \u2212 \u03c1)2\nk\n\n(cid:19)\nLet \u2206\u03c40 = n0, \u2206\u03c4i = 4i(1 + i/ log(q/\u03b4))n1 for i \u2265 1, and \u03c4i =(cid:80)i\nCalculate the estimate(cid:98)\u0398(i+1) from (5) and construct the con\ufb01dence set \u2126(i+1).\nApply the control u(t) = \u2212L(i)x(t) until \u03c4i+1 \u2212 1 and observe the trace {x(t)}\u03c4i\u2264t<\u03c4i+1.\nCalculate(cid:101)\u0398(i+1) from (9) and set L(i+1) \u2190 L((cid:101)\u0398(i+1)).\n\n2: for i = 0, 1, 2, . . . do\n3:\n4:\n5:\n\n(1 \u2212 \u03c1)C 2\n\n(1 \u2212 \u03c1)2\n\nj=0 \u2206\u03c4j.\n\n4kq\n\u03b4\n\n\u00012 +\n\nlog(\n\n).\n\nmin\n\nand (cid:101)L(i) = [I,\u2212L(i)T]T. The estimator (cid:98)\u0398u is known as the LASSO estimator. The \ufb01rst term\n\nin the cost function is the normalized negative log likelihood which measures the \ufb01delity to the\nobservations while the second term imposes the sparsity constraint on \u0398u. \u03bb is the regularization\nparameter.\nFor \u0398(1), \u0398(2) \u2208 Rp\u00d7q de\ufb01ne the distance d(\u0398(1), \u0398(2)) as\n(cid:107)\u0398(1)\n\nd(\u0398(1), \u0398(2)) = max\nu\u2208[p]\nwhere \u0398u is the uth row of the matrix \u0398.\nIt is worth noting that for k-sparse matrices with k\n\u221a\nconstant, this distance does not scale with p or q. In particular, if the absolute value of the elements\nof \u0398(1) and \u0398(2) are bounded by \u0398max then d(\u0398(1), \u0398(2)) \u2264 2\n\nHaving the estimator(cid:98)\u0398(i) the algorithm constructs the con\ufb01dence set for episode i as\n\nu \u2212 \u0398(2)\n\nu (cid:107)2,\n\nk\u0398max.\n\n(7)\n\n\u2126(i) = {\u0398 \u2208 Rp\u00d7q | d(\u0398,(cid:98)\u0398(i)) \u2264 2\u2212i\u0001},\n\n(8)\nwhere \u0001 > 0 is an input parameter to the algorithm. For any \ufb01xed \u03b4 > 0, by choosing \u03c4i judiciously\nwe ensure that with probability at least 1 \u2212 \u03b4, \u03980 \u2208 \u2126(i), for all i \u2265 1. (see Theorem 3.2).\nDesign of the controller: Let J(\u0398) be the minimum expected cost if the interaction matrix is\n\u0398 = [A, B] and denote by L(\u0398) the optimal controller that achieves the expected cost J(\u0398). The\n\nalgorithm implements OFU principle by choosing, at the beginning of episode i, an estimate(cid:101)\u0398(i) \u2208\n\n(9)\n\nThe optimal control corresponding to (cid:101)\u0398(i) is then applied during episode i,\n\u2212L((cid:101)\u0398(i))x(t) for t \u2208 [\u03c4i, \u03c4i+1). Recall that for \u0398 = [A, B], the optimal controller is given through\n\ni.e., u(t) =\n\n\u0398\u2208\u2126(i)\n\nthe following relations\n\nK(\u0398) = Q + ATK(\u0398)A \u2212 ATK(\u0398)B(BTK(\u0398)B + R)\u22121BTK(\u0398)A ,\nL(\u0398) = (BTK(\u0398)B + R)\u22121BTK(\u0398)A .\n\n(Riccati equation)\n\nThe pseudo code for the algorithm is summarized in the table.\n\n3 Main Results\n\nIn this section we present performance guarantees in terms of cumulative regret and learning ac-\ncuracy for the presented algorithm. In order to state the theorems, we \ufb01rst need to present some\nassumptions on the system.\n\n4\n\n\u2126(i) such that\n\n(cid:101)\u0398(i) \u2208 argmin\n\nJ(\u0398).\n\n\f\u039b \u2212 \u0398(cid:101)L\u039b(cid:101)LT\u0398T = I.\n\nGiven \u0398 \u2208 Rp\u00d7q and L \u2208 Rr\u00d7p, de\ufb01ne(cid:101)L = [I,\u2212LT]T \u2208 Rq\u00d7p and let \u039b \u2208 Rp\u00d7p be a solution to\n\nthe following Lyapunov equation\n\n(10)\nIf the closed loop system (A0 \u2212 B0L) is stable then the solution to the above equation exists and the\nstate vector x(t) has a Normal stationary distribution with covariance \u039b.\nWe proceed by introducing an identi\ufb01able regulator.\n\nDe\ufb01nition 3.1. For a k-sparse matrix \u03980 = [A0, B0] \u2208 Rp\u00d7q and L \u2208 Rr\u00d7p, de\ufb01ne (cid:101)L =\n[I,\u2212LT]T \u2208 Rq\u00d7p and let H = (cid:101)L\u039b(cid:101)LT where \u039b is the solution of Eq. (10) with \u0398 = \u03980. De-\n\n\ufb01ne L to be (\u03c1, Cmin, \u03b1) identi\ufb01able (with respect to \u03980) if it satis\ufb01es the following conditions for\nall S \u2286 [q], |S| \u2264 k.\n\n(1) (cid:107)A0 \u2212 B0L(cid:107)2 \u2264 \u03c1 < 1,\n\n(2) \u03bbmin(HSS) \u2265 Cmin,\n\n(3) (cid:107)HScSH\u22121\n\nSS(cid:107)\u221e \u2264 1 \u2212 \u03b1.\n\nThe \ufb01rst condition simply states that if the system is controlled using the regulator L then the closed\nloop autonomous system is asymptotically stable. The second and third conditions are similar to\nwhat is referred to in the sparse signal recovery literature as the mutual incoherence or irreprep-\nresentable conditions. Various examples and results exist for the matrix families that satisfy these\nconditions [18]. Let S be the set of indices of the nonzero entries in a speci\ufb01c row of \u03980. The\nsecond condition states that the corresponding entries in the extended state variable y = [xT, uT] are\nsuf\ufb01ciently distinguishable from each other. In other words, if the trajectories corresponding to this\ngroup of state variables are observed, non of them can be well approximated as a linear combination\nof the others. The third condition can be thought of as a quanti\ufb01cation of the \ufb01rst vs. higher order\ndependencies. Consider entry j in the extended state variable. Then, the dynamic of yj is directly\nin\ufb02uenced by entries yS. However they are also in\ufb02uenced indirectly by other entries of y. The third\ncondition roughly states that the indirect in\ufb02uences are suf\ufb01ciently weaker than the direct in\ufb02uences.\nThere exists a vast literature on the applicability of these conditions and scenarios in which they are\nknown to hold. These conditions are almost necessary for the successful recovery by (cid:96)1 relaxation.\nFor a discussion on these and other similar conditions imposed for sparse signal recovery we refer\nthe reader to [19] and [20] and the references therein.\nDe\ufb01ne \u0398min = mini\u2208[p],j\u2208[q],\u03980\nef\ufb01ciently from its trajectory observations when it is controlled by an identi\ufb01able regulator.\nTheorem 3.2. Consider the LQ system of Eq. (1) and assume \u03980 = [A0, B0] is k-sparse. Let\nu(t) = \u2212Lx(t) where L is a (\u03c1, Cmin, \u03b1) identi\ufb01able regulator with respect to \u03980 and de\ufb01ne\n(cid:96) = max(1, maxj\u2208[r] (cid:107)Lj(cid:107)2). Let n denote the number of samples of the trajectory that is observed.\nFor any 0 < \u0001 < min(\u0398min, (cid:96)\n2 ,\n\nij|. Our \ufb01rst result states that the system can be learned\n\n1\u2212\u03c1), there exists \u03bb such that, if\n\nij(cid:54)=0 |\u03980\n\n3\n\nn \u2265 4 \u00b7 103 k2(cid:96)2\n\u03b12(1 \u2212 \u03c1)C 2\n\nthen the (cid:96)1-regularized least squares solution (cid:98)\u0398 of Eq. (5) satis\ufb01es d((cid:98)\u0398, \u03980) \u2264 \u0001 with probability\nlarger than 1 \u2212 \u03b4. In particular, this is achieved by taking \u03bb = 6(cid:96)(cid:112)log(4q/\u03b4)/(n\u03b12(1 \u2212 \u03c1)) .\n\nmin\n\n(1 \u2212 \u03c1)2\n\n\u00012 +\n\n(11)\n\nk\n\nlog(\n\n4kq\n\u03b4\n\n) ,\n\n(cid:18) 1\n\n(cid:19)\n\n\u221a\nOur second result states that equipped with an ef\ufb01cient learning algorithm, the LQ system of Eq. (1)\ncan be controlled with regret \u02dcO(p\nDe\ufb01ne an \u0001-neighborhood of \u03980 as N\u0001(\u03980) = {\u0398 \u2208 Rp\u00d7q | d(\u03980, \u0398) \u2264 \u0001}. Our assumption asserts\nthe identi\ufb01ably of L(\u0398) for \u0398 close to \u03980.\nAssumption: There exist \u0001, C > 0 such that for all \u0398 \u2208 N\u0001(\u03980), L(\u0398) is identi\ufb01able w.r.t. \u03980 and\n\n2 (1/\u03b4)) under suitable assumptions.\n\nT log 3\n\n\u03c3L(\u03980, \u0001) = sup\n\n\u0398\u2208N\u0001(\u03980)\n\n(cid:107)L(\u0398)(cid:107)2 \u2264 C,\n\n\u03c3K(\u03980, \u0001) = sup\n\n\u0398\u2208N\u0001(\u03980)\n\n(cid:107)K(\u0398)(cid:107)2 \u2264 C.\n\nAlso de\ufb01ne\n\n(cid:107)Lj(\u0398)(cid:107)2) .\nNote that (cid:96)(\u03980, \u0001) \u2264 max(C, 1), since maxj\u2208[r] (cid:107)Lj(\u0398)(cid:107)2 \u2264 (cid:107)L(\u0398)(cid:107)2.\n\nmax(1, max\nj\u2208[r]\n\n(cid:96)(\u03980, \u0001) = sup\n\n\u0398\u2208N\u0001(\u03980)\n\n5\n\n\fTheorem 3.3. Consider the LQ system of Eq. (1). For some constants \u0001, Cmin and 0 < \u03b1, \u03c1 < 1,\nassume that an initial (\u03c1, Cmin, \u03b1) identi\ufb01able regulator L(0) is given. Further, assume that for any\n\u0398 \u2208 N\u0001(\u03980), L(\u0398) is (\u03c1, Cmin, \u03b1) identi\ufb01able. Then, with probability at least 1 \u2212 \u03b4 the cumulative\nregret of ALGORITHM (cf. the table) is bounded as\n\u221a\n\nR(T ) \u2264 \u02dcO(p\n\nT log 3\n\n2 (1/\u03b4)) ,\n\n(12)\n\nwhere \u02dcO is hiding the logarithmic factors.\n\n4 Analysis\n\n4.1 Proof of Theorem 3.2\n\nTo prove theorem 3.2 we \ufb01rst state a set of suf\ufb01cient conditions for the solution of the (cid:96)1-regularized\nleast squares to be within some distance, as de\ufb01ned by d(\u00b7,\u00b7), of the true parameter. Subsequently,\nwe prove that these conditions hold with high probability.\nDe\ufb01ne X = [x(0), x(1), . . . , x(n \u2212 1)] \u2208 Rp\u00d7n and let W = [w(1), . . . , w(n)] \u2208 Rp\u00d7n be the\nmatrix containing the Gaussian noise realization. Further let the Wu denote the uth row of W .\nDe\ufb01ne the normalized gradient and Hessian of the likelihood function (6) as\n\n(cid:98)G = \u2212\u2207L(\u03980\n\nu) =\n\n1\nn\n\n(cid:101)LXW T\n\nu ,\n\n(cid:98)H = \u22072L(\u03980\n\nu) =\n\n1\nn\n\n(cid:101)LXX T(cid:101)LT .\n\n(13)\n\nThe following proposition, a proof of which can be found in [20], provides a set of suf\ufb01cient condi-\ntions for the accuracy of the (cid:96)1-regularized least squares solution.\nu with |S| < k, and H be de\ufb01ned per De\ufb01nition 3.1.\nProposition 4.1. Let S be the support of \u03980\nAssume there exist 0 < \u03b1 < 1 and Cmin > 0 such that\n\n\u03bbmin(HS,S) \u2265 Cmin ,\n\n(cid:107)HSc,SH\u22121\n\nFor any 0 < \u0001 < \u0398min if the following conditions hold\n\n(cid:107)(cid:98)G(cid:107)\u221e \u2264 \u03bb\u03b1\n(cid:107)(cid:98)HSC S \u2212 HSC S(cid:107)\u221e \u2264 \u03b1\n\n3 ,\n\nCmin\u221a\nk\n\nS,S(cid:107)\u221e \u2264 1 \u2212 \u03b1 .\n(cid:107)(cid:98)GS(cid:107)\u221e \u2264 \u0001Cmin\n(cid:107)(cid:98)HSS \u2212 HSS(cid:107)\u221e \u2264 \u03b1\n\n\u2212 \u03bb,\n\n4k\n\n12\n\n(14)\n\n(15)\n\n(16)\n\nCmin\u221a\nk\n\n,\n\nthe (cid:96)1-regularized least square solution (5) satis\ufb01es d((cid:98)\u0398u, \u03980\n\n12\n\n,\n\nu) \u2264 \u0001.\n\nIn the sequel, we prove that the conditions in Proposition 4.1 hold with high probability given that the\nassumptions of Theorem 3.2 are satis\ufb01ed. A few lemmas are in order proofs of which are deferred\nto the Appendix.\n\nThe \ufb01rst lemma states that (cid:98)G concentrates in in\ufb01nity norm around its mean of zero.\n\nLemma 4.2. Assume \u03c1 = (cid:107)A0 \u2212 B0L(cid:107)2 < 1 and let (cid:96) = max(1, maxi\u2208[r] (cid:107)Li(cid:107)2). Then, for any\nS \u2286 [q] and 0 < \u0001 < (cid:96)\n\n2\n\nP(cid:8)(cid:107)(cid:98)GS(cid:107)\u221e > \u0001(cid:9) \u2264 2|S| exp\n\n(cid:18)\nof the elements of (cid:98)H from their mean H, i.e., |(cid:98)Hij \u2212 Hij|.\n\n(cid:19)\n\n\u2212 n(1 \u2212 \u03c1)\u00012\n\n4(cid:96)2\n\nTo prove the conditions in Eq. (16) we \ufb01rst bound in the following lemma the absolute deviations\n\n.\n\n(17)\n\nLemma 4.3. Let i, j \u2208 [q], \u03c1 = (cid:107)A0 \u2212 B0L(cid:107)2 < 1, and 0 < \u0001 < 3\n\u2212 n(1 \u2212 \u03c1)3\u00012\n\nP(|(cid:98)Hij \u2212 Hij| > \u0001) \u2264 2 exp\n\nThe following corollary of Lemma 4.3 bounds (cid:107)(cid:98)HJS \u2212 HJS(cid:107)\u221e for J, S \u2286 [q].\n\n24(cid:96)2\n\n1\u2212\u03c1 < n . Then,\n\n(cid:19)\n\n.\n\n(cid:18)\n\n(18)\n\n6\n\n\fCorollary 4.4. Let J, S \u2286 [q], \u03c1 = (cid:107)A0 \u2212 B0L(cid:107)2 < 1, \u0001 < 3|S|\n\n(cid:18)\n\n(cid:19)\n\nP((cid:107)(cid:98)HJS \u2212 HJS(cid:107)\u221e > \u0001) \u2264 2|J||S| exp\nP((cid:107)(cid:98)HJS \u2212 HJS(cid:107)\u221e > \u0001) \u2264 |J||S| max\n\n1\u2212\u03c1 . Then,\n\n.\n\n1\u2212\u03c1 , and n > 3\n\u2212 n(1 \u2212 \u03c1)3\u00012\n24|S|2(cid:96)2\nP(|(cid:98)Hij \u2212 Hij| > \u0001/|S|).\n\ni\u2208J,j\u2208S\n\nThe proof of Corollary 4.4 is by applying union bound as\n\n(19)\n\n(20)\n\nProof of Theorem 3.2. We show that the conditions given by Proposition 4.1 hold. The conditions\nin Eq. (14) are true by the assumption of identi\ufb01ability of L with respect to \u03980. In order to make the\n\n\ufb01rst constraint on (cid:98)G imply the second constraint on (cid:98)G, we assume that \u03bb\u03b1/3 \u2264 \u0001Cmin/(4k) \u2212 \u03bb,\nwhich is ensured to hold if \u03bb \u2264 \u0001Cmin/(6k). By Lemma 4.2, P((cid:107)(cid:98)G(cid:107)\u221e > \u03bb\u03b1/3) \u2264 \u03b4/2 if\n\n36(cid:96)2\n\nn(1 \u2212 \u03c1)\u03b12 log(\n\n4q\n\u03b4\n\n) .\n\n\u03bb2 =\nRequiring \u03bb \u2264 \u0001Cmin/(6k), we obtain\nn \u2265\n\nThe conditions on (cid:98)H can also be aggregated as (cid:107)(cid:98)H[q],S\u2212H[q],S(cid:107)\u221e \u2264 \u03b1Cmin/(12\n4.4, P((cid:107)(cid:98)H[q]S \u2212 H[q]S(cid:107)\u221e > \u03b1Cmin/(12\n\n\u00012\u03b12C 2\n\nlog(\n\n\u221a\n\n) .\n\n\u221a\n\n362 k2(cid:96)2\nmin(1 \u2212 \u03c1)\n\n4q\n\u03b4\n\n(21)\n\n(22)\n\nk) . By Corollary\n\nMerging the conditions in Eq. (22) and (23) we conclude that the conditions in Proposition 4.1 hold\nwith probability at least 1 \u2212 \u03b4 if\n\nlog(\n\n).\n\n(23)\n\nk)) \u2264 \u03b4/2 if\n3456 k3(cid:96)2\n\u03b12(1 \u2212 \u03c1)3C 2\n\nmin\n\nn \u2265\n\n(cid:18) 1\n\n4kq\n\u03b4\n\n(cid:19)\n\nn \u2265 4 \u00b7 103 k2(cid:96)2\n\u03b12(1 \u2212 \u03c1)C 2\nWhich \ufb01nishes the proof of Theorem 3.2.\n\nmin\n\n\u00012 +\n\nk\n\n(1 \u2212 \u03c1)2\n\nlog(\n\n4kq\n\u03b4\n\n).\n\n(24)\n\n4.2 Proof of Theorem 3.3\n\nThe high-level idea of the proof is similar to the proof of main Theorem in [1]. First, we give a\ndecomposition for the gap between the cost obtained by the algorithm and the optimal cost. We then\nupper bound each term of the decomposition separately.\n\n4.2.1 Cost Decomposition\n\n,\n\nu\n\n(cid:26)\n\nWriting the Bellman optimality equations [5, 4] for average cost dynamic programming, we get\n\nis the \u03c3-\ufb01eld generated by the variables {(z\u03c4 , x\u03c4 )}t\ncost occurred with initial state x(t) [5, 4]. Therefore,\n\nx(t)TQx(t) + uTRu + E(cid:2)z(t + 1)TK((cid:101)\u0398t)z(t + 1)|Ft\n\n(cid:3)(cid:27)\nJ((cid:101)\u0398t) + x(t)TK((cid:101)\u0398t)x(t) = min\nwhere (cid:101)\u0398t = [(cid:101)A, (cid:101)B] is the estimate used at time t, z(t + 1) = (cid:101)Atx(t) + (cid:101)Btu + w(t + 1), and Ft\nJ((cid:101)\u0398t) + x(t)TK((cid:101)\u0398t)x(t) = x(t)TQx(t) + u(t)TRu(t)\n(cid:3)\n+ E(cid:2)((cid:101)Atx(t) + (cid:101)Btu(t) + w(t + 1))TK((cid:101)\u0398t)((cid:101)Atx(t) + (cid:101)Btu(t) + w(t + 1))|Ft\n= x(t)TQx(t) + u(t)TRu(t) + E(cid:2)((cid:101)Atx(t) + (cid:101)Btu(t))TK((cid:101)\u0398t)((cid:101)Atx(t) + (cid:101)Btu(t))|Ft\n+ E(cid:2)w(t + 1)TK((cid:101)\u0398t)w(t + 1)|Ft]\n= x(t)TQx(t) + u(t)TRu(t) + E(cid:2)x(t + 1)TK((cid:101)\u0398t)x(t + 1)|Ft\n(cid:16)\n((cid:101)Atx(t) + (cid:101)Btu(t))TK((cid:101)\u0398t)((cid:101)Atx(t) + (cid:101)Btu(t))\n(cid:17)\n\u2212 (A0x(t) + B0u(t))TK((cid:101)\u0398t)(A0x(t) + B0u(t))\n\n\u03c4 =0. Notice that the left-hand side is the average\n\n(cid:3)\n\n+\n\n.\n\n(cid:3)\n\n7\n\n\fConsequently\n\nwhere\n\nt=0\n\nt=0\n\nT(cid:88)\n\nT(cid:88)\n\nC1 =\n\n(cid:0)x(t)TQx(t) + u(t)TRu(t)(cid:1) =\nJ((cid:101)\u0398t) + C1 + C2 + C3,\n(cid:18)\n(cid:3)(cid:19)\nT(cid:88)\nx(t)TK((cid:101)\u0398t)x(t) \u2212 E(cid:2)x(t + 1)TK((cid:101)\u0398t+1)x(t + 1)(cid:12)(cid:12)Ft\nC2 = \u2212 T(cid:88)\nE(cid:2)x(t + 1)T(K((cid:101)\u0398t) \u2212 K((cid:101)\u0398t+1))x(t + 1)(cid:12)(cid:12)Ft\n(cid:3),\n(cid:16)\nC3 = \u2212 T(cid:88)\n((cid:101)Atx(t) + (cid:101)Btu(t))TK((cid:101)\u0398t)((cid:101)Atx(t) + (cid:101)Btu(t))\n(cid:17)\n\u2212 (A0x(t) + B0u(t))TK((cid:101)\u0398t)(A0x(t) + B0u(t))\n\nt=0\n\nt=0\n\nt=0\n\n.\n\n,\n\n(25)\n\n(26)\n\n(27)\n\n(28)\n\n4.2.2 Good events\n\nWe proceed by de\ufb01ning the following two events in the probability space under which we can bound\nthe terms C1, C2, C3. We then provide a lower bound on the probability of these events.\n\nE2 = {(cid:107)w(t)(cid:107) \u2264 2(cid:112)p log(T /\u03b4), for 1 \u2264 t \u2264 T + 1}.\n\nE1 = {\u03980 \u2208 \u2126(i), for i \u2265 1},\n\n4.2.3 Technical lemmas\n\nThe following lemmas establish upper bounds on C1, C2, C3.\nLemma 4.5. Under the event E1 \u2229 E2, the following holds with probability at least 1 \u2212 \u03b4.\n\n\u221a\n128 C\n(1 \u2212 \u03c1)2\n\n\u221a\n\nC1 \u2264\n\nT p log( T\n\u03b4\n\n)\n\nlog(\n\n1\n\u03b4\n\n) .\n\n(cid:114)\n\nLemma 4.6. Under the event E1 \u2229 E2, the following holds.\n\nLemma 4.7. Under the event E1 \u2229 E2, the following holds with probability at least 1 \u2212 \u03b4.\n\u221a\n\n(cid:16) C\n\n(cid:17) 5\n\n2\n\n|C3| \u2264 800\n\n1 \u2212 \u03c1\n\nk\n\n1 + k\u00012\n\n(1 \u2212 \u03c1)2\n\n\u00b7 log( pT\n\u03b4\n\n)\n\nlog(\n\n4kq\n\u03b4\n\n) \u00b7 p log T\n\nT .\n\n(cid:115)(cid:16)\n\nC2 \u2264 8C\n\n\u03b4\n\n(1 \u2212 \u03c1)2 p log( T\n(cid:17) \u00b7 1 + C\n\nCmin\n\n) log T .\n\n(cid:114)\n\nLemma 4.8. The following holds true.\n\nP(E1) \u2265 1 \u2212 \u03b4, P(E2) \u2265 1 \u2212 \u03b4.\n\nTherefore, P(E1 \u2229 E2) \u2265 1 \u2212 2\u03b4.\nWe are now in position to prove Theorem 3.3.\nProof (Theorem 3.3). Using cost decomposition (Eq. (25)), under the event E1 \u2229 E2, we have\n\n(29)\n\n(30)\n\n(31)\n\n(32)\n\nT(cid:88)\n\nt=0\n\n(x(t)TQx(t) + u(t)TRu(t)) =\n\nT(cid:88)\n\nt=0\n\nJ((cid:101)\u0398t) + C1 + C2 + C3\n\nwhere the last inequality stems from the choice of(cid:101)\u0398t by the algorithm (cf. Eq (9)) and the fact that\n\n\u03980 \u2208 \u2126t, for all t under the event E1. Hence, R(T ) \u2264 C1 + C2 + C3 . Now using the bounds on\nC1, C2, C3, we get the desired result.\n\n\u2264 T J(\u03980) + C1 + C2 + C3,\n\nAcknowledgments\n\nThe authors thank the anonymous reviewers for their insightful comments. A.J. is supported by a\nCaroline and Fabian Pease Stanford Graduate Fellowship.\n\n8\n\n\fReferences\n[1] Y. Abbasi-Yadkori and C. Szepesv\u00b4ari. Regret bounds for the adaptive control of linear quadratic\n\nsystems. Proceeding of the 24th Annual Conference on Learning Theory, pages 1\u201326, 2011.\n\n[2] Y. Bar-Shalom and E. Tse. Dual effect, certainty equivalence, and separation in stochastic\n\ncontrol. Automatic Control, IEEE Transactions on, 19(5):494\u2013500, 1974.\n\n[3] J. Bento, M. Ibrahimi, and A. Montanari. Learning networks of stochastic differential equa-\n\ntions. Advances in Neural Information Processing Systems 23, pages 172\u2013180, 2010.\n\n[4] D. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall,\n\n1987.\n\n[5] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scienti\ufb01c, 3rd edition,\n\n2007.\n\n[6] S. Bittanti and M. Campi. Adaptive control of linear time invariant systems: the bet on the best\n\nprinciple. Communications in Information and Systems, 6(4):299\u2013320, 2006.\n\n[7] E. Bradlow, B. Bronnenberg, G. Russell, N. Arora, D. Bell, S. Duvvuri, F. Hofstede, C. Sis-\nmeiro, R. Thomadsen, and S. Yang. Spatial models in marketing. Marketing Letters,\n16(3):267\u2013278, 2005.\n\n[8] M. Campi. Achieving optimality in adaptive control: the bet on the best approach. In Decision\nand Control, 1997., Proceedings of the 36th IEEE Conference on, volume 5, pages 4671\u20134676.\nIEEE, 1997.\n\n[9] V. Dani, T. Hayes, and S. Kakade. Stochastic linear optimization under bandit feedback. In\n\nProceedings of the 21st Annual Conference on Learning Theory (COLT), 2008.\n\n[10] G. Feichtinger, R. Hartl, and S. Sethi. Dynamic optimal control models in advertising: recent\n\ndevelopments. Management Science, pages 195\u2013226, 1994.\n\n[11] L. Guo and H. Chen. The \u02daastrom-wittenmark self-tuning regulator revisited and els-based\n\nadaptive trackers. Automatic Control, IEEE Transactions on, 36(7):802\u2013812, 1991.\n\n[12] P. Kumar and A. Becker. A new family of optimal adaptive controllers for markov chains.\n\nAutomatic Control, IEEE Transactions on, 27(1):137\u2013146, 1982.\n\n[13] T. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in applied\n\nmathematics, 6(1):4\u201322, 1985.\n\n[14] T. Lai and C. Wei. Least squares estimates in stochastic regression models with applications to\nidenti\ufb01cation and control of dynamic systems. The Annals of Statistics, 10(1):154\u2013166, 1982.\n[15] C. Marinelli and S. Savin. Optimal distributed dynamic advertising. Journal of Optimization\n\nTheory and Applications, 137(3):569\u2013591, 2008.\n\n[16] T. Seidman, S. Sethi, and N. Derzko. Dynamics and optimization of a distributed sales-\n\nadvertising model. Journal of Optimization Theory and Applications, 52(3):443\u2013462, 1987.\n\n[17] S. Sethi. Dynamic optimal control models in advertising: a survey. SIAM review, pages 685\u2013\n\n725, 1977.\n\n[18] J. Tropp. Just relax: Convex programming methods for identifying sparse signals in noise.\n\nInformation Theory, IEEE Transactions on, 52(3):1030\u20131051, 2006.\n\n[19] M. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using-\nInformation Theory, IEEE Transactions on,\n\nconstrained quadratic programming (lasso).\n55(5):2183\u20132202, 2009.\n\n[20] P. Zhao and B. Yu. On model selection consistency of Lasso. The Journal of Machine Learning\n\nResearch, 7:2541\u20132563, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1237, "authors": [{"given_name": "Morteza", "family_name": "Ibrahimi", "institution": null}, {"given_name": "Adel", "family_name": "Javanmard", "institution": null}, {"given_name": "Benjamin", "family_name": "Roy", "institution": null}]}