{"title": "Random Quadratic Forms with Dependence: Applications to Restricted Isometry and Beyond", "book": "Advances in Neural Information Processing Systems", "page_first": 12599, "page_last": 12609, "abstract": "Several important families of computational and statistical results in machine learning and randomized algorithms rely on uniform bounds on quadratic forms of random vectors or matrices. Such results include the Johnson-Lindenstrauss (J-L) Lemma, the Restricted Isometry Property (RIP), randomized sketching algorithms, and approximate linear algebra. The existing results critically depend on statistical independence, e.g., independent entries for random vectors, independent rows for random matrices, etc., which prevent their usage in dependent or adaptive modeling settings. In this paper, we show that such independence is in fact not needed for such results which continue to hold under fairly general dependence structures. In particular, we present uniform bounds on random quadratic forms of stochastic processes which are conditionally independent and sub-Gaussian given another (latent) process. Our setup allows general dependencies of the stochastic process on the history of the latent process and the latent process to be influenced by realizations of the stochastic process. The results are thus applicable to adaptive modeling settings and also allows for sequential design of random vectors and matrices. We also discuss stochastic process based forms of  J-L, RIP, and sketching, to illustrate the generality of the results.", "full_text": "Random Quadratic Forms with Dependence:\nApplications to Restricted Isometry and Beyond\n\nArindam Banerjee\n\nQilong Gu\n\nVidyashankar Sivakumar\n\nZhiwei Steven Wu\n\nDepartment of Computer Science & Engineering, University of Minnesota, Twin Cities\n\nMinneapolis, MN 55455, USA\n\nAbstract\n\nSeveral important families of computational and statistical results in machine\nlearning and randomized algorithms rely on uniform bounds on quadratic forms\nof random vectors or matrices. Such results include the Johnson-Lindenstrauss\n(J-L) Lemma, the Restricted Isometry Property (RIP), randomized sketching al-\ngorithms, and approximate linear algebra. The existing results critically depend\non statistical independence, e.g., independent entries for random vectors, inde-\npendent rows for random matrices, etc., which prevent their usage in dependent\nor adaptive modeling settings.\nIn this paper, we show that such independence\nis in fact not needed for such results which continue to hold under fairly gen-\neral dependence structures. In particular, we present uniform bounds on random\nquadratic forms of stochastic processes which are conditionally independent and\nsub-Gaussian given another (latent) process. Our setup allows general dependen-\ncies of the stochastic process on the history of the latent process and the latent\nprocess to be in\ufb02uenced by realizations of the stochastic process. The results are\nthus applicable to adaptive modeling settings and also allows for sequential design\nof random vectors and matrices. We also discuss stochastic process based forms\nof J-L, RIP, and sketching, to illustrate the generality of the results. 1\n\n1\n\nIntroduction\n\nOver the past few decades, a set of key developments in machine learning and randomized algo-\nrithms have been relying on uniform large deviation bounds on quadratic forms involving random\nvectors or matrices. The Restricted Isometry Property (RIP) is a well known and widely studied\nresult of this type, which has had a major impact in high-dimensional statistics [35, 5, 45, 46]. The\nJohnson-Lindenstrauss (J-L) Lemma is another well known result of this type, which has led to ma-\njor statistical and algorithmic advances in the context of random projections [25, 2, 23]. Similar\nsubstantial developments have been made in several other contexts, including sketching algorithms\nbased on random matrices [49, 26], advances in approximate linear algebra [32, 20], among others.\nSuch existing developments in one way or another rely on uniform bounds on quadratic forms of\nrandom vectors or matrices. Let A be a set of (m \u00d7 n) matrices and \u03be \u2208 Rn be a sub-Gaussian\nrandom vector [45, 46]. The existing results stem from large deviation bounds of the following\nrandom variable [28]:\n\nCA(\u03be) = sup\nA\u2208A\n\n2 \u2212 E(cid:107)A\u03be(cid:107)2\n\n2\n\nResults such as RIP and J-L can then be obtained in a straightforward manner from such bounds by\nconverting the matrix A into a vector \u03b8 = vec(A) and converting \u03be into a suitable random matrix X\nto get bounds on\n\nC\u0398(X) = sup\n\u03b8\u2208\u0398\n\n2 \u2212 E(cid:107)X\u03b8(cid:107)2\n\n2\n\n1The full version of this paper is available at https://arxiv.org/abs/1910.04930 [6].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n(cid:12)(cid:12)(cid:107)A\u03be(cid:107)2\n(cid:12)(cid:12)(cid:107)X\u03b8(cid:107)2\n\n(cid:12)(cid:12) .\n(cid:12)(cid:12) ,\n\n(1)\n\n(2)\n\n\fwhere \u0398 = {vec(A)|A \u2208 A}. Results on other domains such as sketching [49, 26] and approximate\nlinear algebra [32, 20] can be similarly obtained. Further, note that such bounds are considerably\nmore general than the popular Hanson-Wright inequality [39, 22] for quadratic forms of random\nvectors, which focus on a \ufb01xed matrix A instead of a uniform bound over a set A.\nThe key assumption in all existing results is that the entries \u03bej of \u03be need to be statistically inde-\npendent. Such independence assumption shows up as element-wise independence of the random\nvector \u03be in quadratic forms like CA(\u03be) and row-wise or element-wise independence of the random\nmatrix X in quadratic forms like C\u0398(X). Existing analysis techniques, typically based on advanced\ntools from empirical processes [45, 30], rely on such independence to get the large deviation bound.\nIn this paper, we consider a generalization of such existing results by allowing for statistical de-\npendence in \u03be. In particular, we assume \u03be = {\u03bej} to be a stochastic process where the marginal\nrandom variables \u03bej are conditionally independent and sub-Gaussian given some other (latent) pro-\ncess F = {Fj}. While hidden Markov models (HMMs) [7] are a simple example of such a setup,\nwith F being the latent variable sequence and \u03be being the observations, our setup described in de-\ntail in Section 2 allows for far more complex dependencies, and allows for many different types of\ngraphical models connecting \u03be and F .\nIn Section 2 we discuss two key conditions such graphical\nmodels need to satisfy and give a set of concrete examples of graphical models which satisfy the\nconditions illustrating the \ufb02exibility of the setup. Our main result is to establish a uniform large\ndeviation bound for CA(\u03be) in (1) where \u03be is any stochastic process following the setup outlined in\nSection 2.\nThere are two broad implications of our results allowing for dependence in random quadratic forms.\nFirst, there are several emerging domains where data collection, modeling and estimation take place\nadaptively, including bandits learning, active learning, and time-series analysis [4, 40, 31]. The de-\npendence in such adaptive settings is hard to handle, and existing analysis for speci\ufb01c cases goes\nto great lengths to work with or around such dependence [36, 18, 34]. The general tool we pro-\nvide for such settings has the potential to simplify and generalize results in adaptive data collection,\ne.g., our results are applicable to the smoothed analysis of contextual linear bandits considered in\n[27]. Second, since our results allow for sequential construction of random vectors and matrices\nby considering what has happened so far, algorithmic approaches such as J-L and sketching would\narguably be able to take advantage of such extra \ufb02exibility possibly leading to adaptive and more\ncomputationally ef\ufb01cient algorithms. In Section 4, we illustrate how such basic results on adap-\ntive regression, RIP, and J-L would look like by allowing for dependence in the random vectors or\nmatrices.\nThe technical analysis for our main result is a signi\ufb01cant generalization of prior analysis on tail be-\nhavior of chaos processes [3, 28, 43] for random vectors with i.i.d. elements. To construct a uniform\nbound on CA(\u03be) in (1) for a stochastic process \u03be with statistically dependent entries, we decom-\npose the analysis into two parts: 1) bounding the off-diagonal terms of AT A, and 2) bounding the\ndiagonal terms of AT A. Our analysis for the off-diagonal terms is based on two key tools: decou-\npling [38] and generic chaining [43], both with suitable generalizations from i.i.d. counter-parts to\nstochastic processes \u03be. For decoupling, we present a new result on decoupling of quadratic forms\nof sub-Gaussian stochastic processes \u03be satisfying the conditions of our setup. Our result general-\nizes the classical decoupling result for vectors with i.i.d. entries [38, 28]. For generic chaining,\nwe develop new results of interest in our context as well as generalize certain existing results for\ni.i.d. random vectors to stochastic processes. While generic chaining, as a technique, does not rely\non statistical independence [43], an execution of the chaining argument does rely on an atomic large\ndeviation bound such as the Hoeffding bound for independent elements [28]. In our setting, the\natomic deviation bound in generic chaining carefully utilizes conditional independence satis\ufb01ed by\nthe stochastic process \u03be. Our analysis for the diagonal terms is based on suitable use of symmetriza-\ntion, de-symmetrization, and contraction inequalities [8, 29]. However, we cannot use the standard\nform for symmetrization and de-symmetrization which are based on i.i.d. elements. We generalize\nthe classical symmetrization and de-symmetrization results [8] to stochastic processes \u03be in our setup,\nand subsequently utilize these inequalities to bound the diagonal terms. We present a gentle exposi-\ntion to the analysis in Section 3 and the technical proofs are all in [6, Appendix]. We have tried to\nmake the exposition self-contained beyond certain key de\ufb01nitions and concepts such as Talagrand\u2019s\n\u03b3-function and admissible sequence in generic chaining [43].\nNotation. Our results are for stochastic processes \u03be = {\u03bej} adapted to another stochastic process\nF = {Fi} with both moment and conditional independence assumptions outlined in detail in Sec-\ntion 2. We will consider conditional probabilities Xj = \u03bej|f1:j, where f1:j is a realization of F1:j,\n\n2\n\n\fFigure 1: Graphical Model 1 (GM1) structure for stochastic process {\u03bei} adapted to {Fi} satis\ufb01es\n(SP-2) by construction. While we show arrows only from one random variable, e.g., Fi\u22121 \u2192 \u03bei,\nthe conditional random variable \u03bei|F1:(i\u22121) can have dependence on the entire history F1:(i\u22121). All\nthese arrows are not depicted in this and other \ufb01gures to avoid clutter.\n\n(cid:107)A(cid:107)F = (cid:112)Tr(AT A) and the operator norm (cid:107)A(cid:107)2\u21922 = sup(cid:107)x(cid:107)2\u22641 (cid:107)Ax(cid:107)2. For the set A, we\n\nand assume Xj to be zero-mean L-sub-Gaussian, i.e., P(|Xj| > \u03c4 ) \u2264 2 exp(\u2212\u03c4 2/L2) for some\nconstant L > 0 and all \u03c4 \u2265 \u03c40, a constant [45, 46]. For the exposition, we will call a random\nvariable sub-Gaussian without explicitly referring to the constant L. With n denoting the length\nof the stochastic process, we will abuse notation and consider a random vector \u03be = [\u03bej] \u2208 Rn\ncorresponding to the stochastic process \u03be = {\u03bej}, where the usage will be clear from the con-\ntext. Our results are based on two classes of complexity measures of a set of (m \u00d7 n) matrices\nA. The \ufb01rst class, denoted by dF (A) and d2\u21922(A), are the radius of A in the Frobenius norm\nhave dF (A) = supA\u2208A (cid:107)A(cid:107)F , and d2\u21922(A) = supA\u2208A (cid:107)A(cid:107)2\u21922. The second class is Talagrand\u2019s\n\u03b32(A,(cid:107) \u00b7 (cid:107)2\u21922) functional, de\ufb01ned in Section 3 [43, 42]. Recent literature have used the notion\nof Gaussian width: w(A) = E supA\u2208A | Tr(GT A)| where G = [gi,j] \u2208 Rm\u00d7n have i.i.d. normal\nentries, i.e., gi,j \u223c N (0, 1). It can be shown [43] that \u03b32(A,(cid:107) \u00b7 (cid:107)2\u21922) can be bounded by the Gaus-\nsian width w(A), i.e., \u03b32(A,(cid:107) \u00b7 (cid:107)2\u21922) \u2264 cw(A), for some constant c > 0. Our analysis will be\nbased on bounding Lp-norms of suitable random variables. For a random variable X, its Lp-norm\nis (cid:107)X(cid:107)Lp = (E|X|p)1/p.\n2 Setup\n\nWe describe the formal set up of stochastic processes for which we provide large deviation bounds.\nLet \u03be = {\u03bei} = {\u03be1, . . . , \u03ben} be a sub-Gaussian stochastic process which is decoupled when condi-\ntioned on another stochastic process F = {Fi} = {F1, . . . , Fn}. In particular, we assume:\n(SP-1) for each i = 1, . . . , n, \u03bei|f1:i is a zero mean sub-Gaussian random variable [46] for all\n(SP-2) for each i = 1, . . . , n, there exists an index \u0001(i) \u2264 i which is non-decreasing, i.e., \u0001(j) \u2264\n\nrealizations f1:i of F1:i; and\n\u0001(i) for j < i, such that \u03bei \u22a5 \u03bej|F1:\u0001(i), j < i and \u03bei \u22a5 Fk|F1:\u0001(i), k > \u0001(i).\n\nwhere \u22a5 denotes (conditional) independence. The stochastic process \u03be = {\u03bei} is said to be adapted\nto the process F = {Fi} satisfying (SP-1) and (SP-2).\n(SP-1) is an assumption on the moments of the distributions \u03bei|f1:i. Note that the assumption allows\nthe speci\ufb01cs of the distribution to depend on the history. (SP-2) is an assumption on the conditional\nindependence structure of \u03be. The assumption allows \u03bei to depend on the history F1:\u0001(i). Further, we\ncan have Fi\u22121 depending on \u03bei\u22121 and \u03bei depending on Fi\u22121. Graphical models GM1 (Figure 1),\nGM2 (Figure 2) and GM3 (Figure 3) are examples of graphical models satisfying (SP-2). For GM1,\n\u0001(i) = i \u2212 1 and Fi depends on F1:(i\u22121), but not on \u03bei. Further, \u03bei can depend on the entire history\nF1:(i\u22121). GM2 is a variant of GM1 and structurally resembles a HMM (hidden Markov model)\nwith \u0001(i) = i, Fi depending on Fi\u22121 (or the entire history F1:(i\u22121)), and \u03bei depends on Fi (or\nthe entire history F1:i). GM3 is a more complex model with \u0001(i) = i and Fi depends both on\nF1:(i\u22121) and \u03bei. For GM1 and GM3, we consider an additional \u2018prior\u2019 F0, and the properties (SP-1)\nand (SP-2) can be naturally extended to include such a prior. We also give concrete examples of\npotential interest in the context of machine learning in Section 4. For certain graphical models, it\nmay be at times more natural to \ufb01rst construct a stochastic process {Zi} respecting the graphical\nmodel structure governed by (SP-2), and then construct the sequence {\u03bei} by conditional centering,\ni.e., \u03bei|F1:i = Zi|F1:i \u2212 E[Zi|F1:i] so that E[\u03bei|F1:i] = 0 as required by (SP-1). Such a centered\nconstruction is inspired by how one can construct martingale difference sequences (MDS) from\nmartingales [48].\n\n3\n\nF1F2F3\u2026Fn-1Fn\u03be1\u03be2\u03be3\u2026\u03ben-1\u03benF0\fFigure 2: Graphical Model 2 (GM2) structure for stochastic process {\u03bei} adapted to {Fi} satis\ufb01es\n(SP-2) by construction. While we show arrows only from one random variable, e.g., Fi \u2192 \u03bei, the\nconditional random variable \u03bei|F1:i can have dependence on the entire history F1:i.\n\nFigure 3: Graphical Model 3 (GM3) structure for stochastic process {\u03bei} adapted to {Fi} sat-\nis\ufb01es (SP-2) by construction. Note that there is no restriction on the conditional distribution\nFi | (F1:(i\u22121), \u03bei), so that Fi can have arbitrary dependence on F1:(i\u22121) and Zi. While we show\narrows only to one random variable, e.g., Fi\u22121 \u2192 \u03bei, the conditional random variable \u03bei|F1:(i\u22121)\ncan have dependence on the entire history F1:(i\u22121). Similarly, Fi|F1:(i\u22121), Zi is illustrated only with\narrows from Fi\u22121, Zi to Fi to avoid clutter.\n\n3 Main Results\nLet A be a set of (m \u00d7 n) matrices and let \u03be be a L-sub-Gaussian random vector. The random\nvariable of interest for the current analysis is:\nCA(\u03be) (cid:44) sup\nA\u2208A\n\n(cid:12)(cid:12)(cid:107)A\u03be(cid:107)2\n\n2 \u2212 E(cid:107)A\u03be(cid:107)2\n\n(cid:12)(cid:12) .\n\n(3)\n\n2\n\nBased on the literature on empirical processes and generic chaining [43, 30], the random variable\nCA(\u03be) can be referred to as an order-2 sub-Gaussian chaos [43, 28]. While widely used results like\nthe restricted isometry property (RIP) [10, 19] and Johnson-Lindenstrauss (J-L) lemma [25, 49] do\nnot explicitly appear in the above form, getting such results from a large deviation bound on CA(\u03be)\nis straightforward [28, 33]. For ease of exposition, we will refer to such converted but otherwise\nequivalent form as the random matrix form of CA(\u03be).\n\n3.1 The Main Result: Warm-up\nThe main technical result in the paper is a large deviation bound on CA(\u03be) for the setting when \u03be is\na stochastic process adapted to F satisfying (SP-1) and (SP-2), as de\ufb01ned in Section 2. To develop\nlarge deviation bounds on CA(\u03be), we decompose the quadratic form into terms depending on the\noff-diagonal and the diagonal elements of AT A respectively. First note that the contributions from\nthe off-diagonal terms of AT A to E(cid:107)A\u03be(cid:107)2\n2 is 0. To see this, with Aj denoting the jth column of A,\nby linearity of expectation we have\n\nn(cid:88)\n\n(cid:2)E\u03bej ,\u03bek|F1:n [\u03bej\u03bek](cid:3)(cid:104)Aj, Ak(cid:105)\n\n\uf8ee\uf8ef\uf8ef\uf8f0 n(cid:88)\n\nj,k=1\nj(cid:54)=k\n\n\uf8f9\uf8fa\uf8fa\uf8fb =\n\nn(cid:88)\nn(cid:88)\n\nj,k=1\nj(cid:54)=k\n\nE\u03be\n\n\u03bej\u03bek(cid:104)Aj, Ak(cid:105)\n\nE\u03bej ,\u03bek [\u03bej\u03bek](cid:104)Aj, Ak(cid:105) =\n\nEF1:n\n\n(cid:2)E\u03bej|F1:n [\u03bej]E\u03bek|F1:n [\u03bek](cid:3)(cid:104)Aj, Ak(cid:105) (b)\n\nj,k=1\nj(cid:54)=k\n\n= 0 ,\n\n(a)\n=\n\nEF1:n\n\nj,k=1\nj(cid:54)=k\n\nwhere (a) follows since \u03bej \u22a5 \u03bek|F1:n by (SP-2) , and (b) follows since E\u03bej|F1:n [\u03bej] = E\u03bek|F1:n [\u03bek] =\n0 by (SP-1).\n\n4\n\nF1F2F3\u2026Fn-1Fn\u03be1\u03be2\u03be3\u2026\u03ben-1\u03benF1F2F3\u2026Fn-1Fn\u03be1\u03be2\u03be3\u2026\u03ben-1\u03benF0\fNow, by de\ufb01nition and Jensen\u2019s inequality, we have\n\n(cid:12)(cid:12)\n\n2 \u2212 E(cid:107)A\u03be(cid:107)2\n\n2\n\nCA(\u03be) = sup\nA\u2208A\n\n= sup\nA\u2208A\n\n\u2264 sup\nA\u2208A\n\n(cid:12)(cid:12)(cid:107)A\u03be(cid:107)2\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n(cid:88)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n(cid:88)\n\nj,k=1\nj(cid:54)=k\n\nj,k=1\nj(cid:54)=k\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u03bej\u03bek(cid:104)Aj, Ak(cid:105) +\n\n(|\u03bej|2 \u2212 E|\u03bej|2)(cid:107)Aj(cid:107)2\n\n2\n\n\u03bej\u03bek(cid:104)Aj, Ak(cid:105)\n\n(|\u03bej|2 \u2212 E|\u03bej|2)(cid:107)Aj(cid:107)2\n\n2\n\nn(cid:88)\n\nj=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + sup\n\nA\u2208A\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n(cid:88)\n\nj=1\n\n= BA(\u03be) + DA(\u03be)\nTherefore, for any p \u2208 [1,\u221e), we have\n\n(cid:107)CA(\u03be)(cid:107)Lp \u2264 (cid:107)BA(\u03be)(cid:107)Lp + (cid:107)DA(\u03be)(cid:107)Lp .\n\n(4)\nOur approach to getting a large deviation bound for CA(\u03be) is based on bounding (cid:107)CA(\u03be)(cid:107)Lp,\nwhich in turn is based on bounding (cid:107)BA(\u03be)(cid:107)Lp and (cid:107)DA(\u03be)(cid:107)Lp. For convenience, we will re-\nfer to BA(\u03be) as the off-diagonal term and DA(\u03be) as the diagonal term. Such bounds lead to a bound\non (cid:107)CA(\u03be)(cid:107)Lp of the form\n\n(5)\nwhere a, b, c are constants which do not depend on p. Note that by using the moment-generating\nfunction and Markov\u2019s inequality [48, 44], these Lp-norm bounds imply, for all u > 0\n\n(cid:107)CA(\u03be)(cid:107)Lp \u2264 a +\n\np \u00b7 b + p \u00b7 c ,\n\n\u2200p \u2265 1 ,\n\n\u221a\n\nP (|CA(\u03be)| \u2265 a + b \u00b7 \u221a\n(cid:26)\n\nP (|CA(\u03be)| \u2265 a + u) \u2264 exp\n\n\u2212 min\n\nu + c \u00b7 u) \u2264 e\u2212u ,\n\n(cid:19)(cid:27)\n\n(cid:18) u2\n\n4b2 ,\n\nu\n2c\n\n(6)\n\n(7)\n\n,\n\nor, equivalently\n\nwhich yields the desired large deviation bound.\n\n3.2 Upper Bounding BA(\u03be) and DA(\u03be)\nThe bound on (cid:107)BA(\u03be)(cid:107)Lp is based on two techniques: decoupling [38] and generic chaining [43].\nOur main result in decoupling extends the classical result for \u03be with i.i.d. entries to stochastic pro-\ncesses \u03be satisfying (SP-1) and (SP-2). The second part of the analysis uses generic chaining [43, 42]\nwhich is arguably one of the most powerful tools for such analysis. Since we use generic chaining,\nthe results are in terms of Talagrand\u2019s \u03b3-functions [43] de\ufb01ned below.\n\nDe\ufb01nition 1 For a metric space (T, d), an admissible sequence of T is a collection of subsets of T ,\n{Tr : r \u2265 0}, with |T0| = 1 and |Tr| \u2264 22r for all r \u2265 1. For \u03b2 \u2265 1, the \u03b3\u03b2 functional is de\ufb01ned by\n\n\u03b3\u03b2(T, d) = inf sup\nt\u2208T\n\n2r/\u03b2d(t, Tr) ,\n\n(8)\n\nwhere the in\ufb01mum is over all admissible sequences of T .\nIn particular, our results are in terms of \u03b32(A,(cid:107) \u00b7 (cid:107)2\u21922), which is related to the Gaussian width of\nthe set by the majorizing measure theorem [42, Theorem 2.1.1][43, Theorem 2.4.1]. Recent years\nhave seen major advances in using Gaussian width for both statistical and computational analysis in\nthe context of high-dimensional statistics and related areas [11, 5, 37, 13]. Hence, recent tools for\nbounding Gaussian width [11, 13] can be applied to our setting to get concrete bounds for cases of\n\ninterest. For example, if A is a set of s-sparse (m \u00d7 n) matrices, \u03b32(A,(cid:107) \u00b7 (cid:107)2\u21922) \u2264 c(cid:112)s log(mn),\n\nfor some constant c [30, 45] (also see Section 4).\nWhile the diagonal term DA(\u03be) does not have any interaction terms of the form \u03bej\u03bek, the term\ndepends on centered random variables |\u03bej|2 \u2212 E|\u03bej|2. Our analysis relies on three key results: sym-\nmetrization, de-symmetrization, and contraction [8, 29]. Our overall approach reduces to showing\nthat upper bounds on DA(\u03be) can be derived from upper bounds on DA(g), where g has i.i.d. normal\nentries, and additional terms which can be bounded using generic chaining [43]. Bounds on DA(g)\ncan be obtained using existing results [28].\n\n5\n\n\u221e(cid:88)\n\nr=0\n\n\f3.3 The Main Result\nBased on the analysis above, we have our main result as stated below\nTheorem 1 Let A be a set of (m \u00d7 n) matrices and let \u03be be a stochastic process adapted to F\nsatisfying (SP-1) and (SP-2). Let\n\n(cid:18)\n\n(cid:18)\n\nM = \u03b32(A,(cid:107) \u00b7 (cid:107)2\u21922) \u00b7\n\n\u03b32(A,(cid:107) \u00b7 (cid:107)2\u21922) + dF (A)\n\nV = d2\u21922(A) \u00b7\n2\u21922(A) .\n\nU = d2\n\n\u03b32(A,(cid:107) \u00b7 (cid:107)2\u21922) + dF (A)\n\n(cid:19)\n\n(cid:19)\n\n(9)\n\n(10)\n\n(11)\n\n,\n\n(12)\n\nThen, for any \u03b5 > 0,\n\n(cid:18)\n\nP\n\nsup\nA\u2208A\n\n(cid:12)(cid:12)(cid:107)A\u03be(cid:107)2\n\n2 \u2212 E(cid:107)A\u03be(cid:107)2\n\n2\n\n(cid:19)\n(cid:12)(cid:12) \u2265 c1M + \u03b5\n\n(cid:18)\n\n\u2264 2 exp\n\n\u2212c2 min\n\n(cid:27)(cid:19)\n\n(cid:26) \u03b52\n\nV 2 ,\n\n\u03b5\nU\n\nwhere c1, c2 are constants which depend on the support.\n\n(cid:18)\n\n(cid:18)\n\n(cid:19)\n\n(cid:19)\n\nIt is instructive to compare our bounds for stochastic processes \u03be satisfying (SP-1) and (SP-2) to the\nsharpest existing bound on CA(\u03be) for the special case when \u03be has i.i.d. sub-Gaussian entries [28].\nFor this i.i.d. sub-Gaussian case, [28] showed a large deviation bound based on\n\nM(cid:48) = \u03b32(A,(cid:107) \u00b7 (cid:107)2\u21922) \u00b7\n\n\u03b32(A,(cid:107) \u00b7 (cid:107)2\u21922) + dF (A)\n\nV (cid:48) = d2\u21922(A) \u00b7\n2\u21922(A) .\nU(cid:48) = d2\n\n\u03b32(A,(cid:107) \u00b7 (cid:107)2\u21922) + dF (A)\n\n+ dF (A) \u00b7 d2\u21922(A)\n\n(13)\n\n(14)\n\n(15)\nBy comparing the terms with those in Theorem 1, we note that U = U(cid:48) and V = V (cid:48) and while\nM(cid:48) has an extra additional term dF (A) \u00b7 d2\u21922(A), for symmetric sets A with A = \u2212A we have\nd2\u21922(A) \u2264 \u03b32(A,(cid:107) \u00b7 (cid:107)2\u21922), so the terms are of the same order. Thus, the generalization to the\nstochastic process \u03be yields the same order bound as the i.i.d. case which allows seamless extension\nof applications of the result to random vectors/matrices with statistical dependence.\nFinally, our results can be extended to the case of non-zero mean stochastic processes. In particular\nwith x = \u03be + \u00b5, where \u03be is the stochastic process satisfying (SP-1) and (SP-2) and \u00b5 is the mean\nvector, i.e., E[x] = \u00b5, we have (cid:107)Ax(cid:107)2 \u2212 E(cid:107)Ax(cid:107)2\n2) + (cid:104)\u03be, 2AT A\u00b5(cid:105), where\nthe \ufb01rst term is what we analyze and bound in Theorem 1, and the second term is a linear form of \u03be.\nFor the uniform bound, the two terms can be separated using Jensen\u2019s inequality, the \ufb01rst term can\nbe bounded using Theorem 1 and the second term can be bounded using a standard application of\ngeneric chaining using (SP-1) and (SP-2). Thus, mean shifted versions of our results also hold.\n\n2 \u2212 E(cid:107)A\u03be(cid:107)2\n\n2 = ((cid:107)A\u03be(cid:107)2\n\n4\n\nImplications of the Main Results\n\nWe show several applications of our results, including the Johnson Lindenstrauss (J-L), Restricted\nIsometry Property (RIP), and sketching. All proofs can be found in [6, Section 4].\n\nJohnson-Lindenstrauss with Stochastic Processes\n\n4.1\nLet X \u2208 Rn\u00d7p , n < p and let A be any set of N vectors in Rp. X is a Johnson-Lindenstrauss\ntransform (JLT) [25, 2] if for any \u03b5 > 0,\n\n(1 \u2212 \u03b5)(cid:107)u(cid:107)2\n\n2 \u2264 (cid:107)Xu(cid:107)2\n\n2 \u2264 (1 + \u03b5)(cid:107)u(cid:107)2\n\n2\n\nfor all u \u2208 A .\n\n(16)\n\nJLT is a random projection which embeds high-dimensional data into lower-dimensional space while\napproximately preserving all pairwise distances [49, 32, 24]. JLT has found numerous applications\nthat include searching for an aproximate nearest neighbor in high-dimensional Euclidean space [23],\ndimension reduction in data bases [1], learning mixture of Gaussians [15] and sketching [49]. It is\n\n6\n\n\f\u02dcX, where \u02dcX contains standard i.i.d. normal elements, is a JLT with high\n\nwell known that X = 1\u221a\nprobability when n = \u2126(log N ) [25]. .\nNow denote the element in the i-th row and j-th column of \u02dcX as \u02dcXi,j, and the i-th row as \u02dcXi,:. Let\nthe entries of \u02dcXi,j being sequentially generated as follows:\n\nn\n\n1. Initially, draw the \ufb01rst element \u02dcX1,1 from a zero-mean sub-Gaussian distribution.\n2. \u02dcXi,j is a conditionally 1-sub-Gaussian random variable satisfying E[ \u02dcXi,j|fi,j] = 0. The\nfi,j are realizations of a stochastic process which can possibly depend on the entries\n{{ \u02dcXi(cid:48),:}i(cid:48)<i,{ \u02dcXi,j(cid:48)}j(cid:48)<j}.\n\n3. \u02dcXi,j \u22a5 {{ \u02dcXi(cid:48),:}i(cid:48)<i,{ \u02dcXi,j(cid:48)}j(cid:48)<j} | fi,j and \u02dcXi,j \u22a5 {{fi,j(cid:48)}j(cid:48)>j,{fi(cid:48),:}i(cid:48)>i} | fi,j\n\nThe following result is an immediate consequence of Theorem 1\nCorollary 1 (JL) Let X \u2208 Rn\u00d7p be a matrix constructed as X = 1\u221a\n\u2126(\u0001\u22122 log N ), X is a JLT with probability at least 1 \u2212 1\n\n\u02dcX.\nN c for a constant c > 0.\n\nn\n\nIf we choose n =\n\n4.2 Restricted Isometry Property (RIP) with Stochastic Processes\nMatrices satisfying Restricted Isometry Property (RIP) are approximately orthonormal on sparse\nvectors [10, 9]. Let X \u2208 Rn\u00d7p and let A be the set of all s-sparse vectors in Rp. We de\ufb01ne matrix\nX to satisfy RIP with the restricted isometry constant \u03b4s \u2208 (0, 1) if for all u \u2208 A,\n\n(1 \u2212 \u03b4s)(cid:107)u(cid:107)2\n\n2 \u2264 1\nn\n\n(cid:107)Xu(cid:107)2\n\n2 \u2264 (1 + \u03b4s)(cid:107)u(cid:107)2\n2 .\n\n(17)\n\nMatrices satisfying RIP are of interest in high-dimensional statistics and compressed sensing prob-\nlems where the goal is to recover a sparse signal \u03b8\u2217 \u2208 Rp from limited noisy linear measure-\nments [47, 46]. Sub-Gaussian random matrices with i.i.d.\nrows, e.g., rows sampled from a\nN (0, \u03c32Ip\u00d7p) satis\ufb01es RIP [10, 9, 35, 5] when n = \u2126(s log p). But the i.i.d. rows assumption is vi-\nolated in many practical settings when data is generated adaptively/sequentially. Examples include\ntimes-series regression and bandits problems [31, 27], active learning [40, 21] or volume sampling\n[16, 17]. An application of our new results shows that the i.i.d. assumption is not necessary and\ndesign matrices generated from dependent elements also satisfy RIP when n = \u2126(s log p). For\nexample, the following result holds for matrices X generated similar to matrix \u02dcX in Section 4.1.\nCorollary 2 (RIP) Let X \u2208 Rn\u00d7p be a matrix generated from the process outlined earlier. Then\n\nfor any \u03b5 > 0, if we choose n = \u2126(\u03b5\u22122s log(2p/s)), then \u03b4s \u2264 \u03b5 with probability at least 1\u2212(cid:16) s\n\n(cid:17)cs\n\n2p\n\nfor a constant c > 0.\n\nRIP for adaptively generated rows. Sequential learning problems like linear contextual bandits\ninvolve estimating a parameter with a design matrix whose rows are adverserially generated based\non previously observed rows and rewards which are linear functions of the rows. An example is\nthe linear contextual bandit problem considered, e.g., in [27, 41]. The data in any time step t is\ngenerated as follows [27, 41]: .\n\nt , . . . , \u00b5k\n\nadversary At\u22121 maps the histories to k contexts \u00b51\nAt\u22121 : Ht\u22121 \u2192 (Bp\n2 )k where Bp\nthe contexts with random Gaussian noise, i.e., xi\nin the context of GM3, Ht\u22121 \u222a {x1\n\n1. Let Ht\u22121 denote historical data observed until time t \u2212 1. In time step t \u2212 1 an adaptive\nt(cid:107)2 \u2264 1, i.e.,\n2 represents the unit ball in p dimensions. Nature perturbs\nt \u223c N (0, \u03c32Ip\u00d7p). Now,\nt } based on historical\ndata Ht\u22121. Let xit\nt denote the corresponding Gaussian\nt \u2212\nperturbation. In the context of GM3, we denote the centered Gaussian perturbation git\nt , \u03b8\u2217(cid:105) + \u03c9t where \u03c9t is an\nE[git\nunknown sub-Gaussian noise. History at time step t is now augmented with the new data,\ni.e., Ht = Ht\u22121 \u222a {{x1\n\nt ] by \u03bet. The learner receives the noisy reward yt = (cid:104)xit\n\n2. In time step t, a learner chooses one among k contexts {x1\n\nt in Rp with (cid:107)\u00b51\nt with gi\n\nt denote the selected context and git\n\nt } represents F1:t\u22121.\n\nt }, xit\n\nt , yt}.\n\nt = \u00b5i\n\nt + gi\n\nt , . . . , xk\n\nt , . . . , xk\n\nt , . . . , xk\n\n7\n\n\fThe data generation process mirrors GM3 with Ft being a sub-Gaussian process which is in\ufb02uenced\nby Ft\u22121 and \u03bet but generated adaptively by an adversary. \u03bet is a sub-Gaussian random vector cho-\nsen by the learner using historical data Ht\u22121 satisfying (SP-2). The algorithm proposed in [41, 27]\ninvolves a parameter estimation step in each time step t using the observed contexts and the corre-\nsponding rewards {xit(cid:48)\nt(cid:48) , yt(cid:48)}, 1 \u2264 t(cid:48) \u2264 t. With Xt the matrix which has the centered Gaussian per-\nturbations \u03be1, . . . , \u03bet\u22121 as rows, [41, 27] show that\n2 for some constant \u03ba\ndepending on the problem parameters and require the following lower bound on the non-asymptotic\nRIP condition for ef\ufb01cient parameter estimation:\n2] \u2212 \u0001\n\n2] \u2265 t\u03ba(cid:107)u(cid:107)2\n\nE[(cid:107)Xtu(cid:107)2\n\nE[(cid:107)Xtu(cid:107)2\n\n(cid:107)u(cid:107)2\n\n2 \u2264 inf\nu\u2208Rp\n\n(cid:107)Xtu(cid:107)2\n2 .\n\n(cid:18)\n\ninf\nu\u2208Rp\n\ninf\nu\u2208Rp\n\n(cid:19)\n\n(18)\n\nSince the data generation follows graphical model GM3, the following Corollary 3 is a direct con-\nsequence of Theorem 1\n\nCorollary 3 Let Xt be a design matrix generated from the process described above. Then for any\n\u0001 > 0, if we choose t = \u2126(\u0001\u22122\u03ba\u22122p), then with probability atleast 1\u2212 exp(\u2212cp) for constant c > 0,\nthe following condition is satis\ufb01ed,\n\n(cid:107)X tu(cid:107)2\n\n2 \u2265 t\u03ba(1 \u2212 \u0001)(cid:107)u(cid:107)2\n2 .\n\ninf\nu\u2208Rp\n\n(19)\n\n\u03b7i,j satisfy(cid:80)n\n\n4.3 CountSketch\nCountSketch or sparse JL transform is used in real world applications like data streaming and dimen-\nsionality reduction [12, 49]. Every column of a (n \u00d7 p) CountSketch matrix X has only d(d (cid:28) n)\nnon-zero elements, therefore for any vector u \u2208 Rp, computing Xu takes only O(dp) instead of\nd, where \u03b4i,j is an\nO(np). Each entry of a CountSketch matrix X is given by Xi,j = \u03b7i,j\u03b4i,j/\nindependent Rademacher random variable, and \u03b7i,j is a random variable sampled adaptively. The\ni=1 \u03b7i,j = d, \u03b7i,j \u2208 {0, 1}, that is each column has exactly d non zero elements. For\nevery column j of X, the \u03b7i,j can be generated by sampling d indices from {1, 2, . . . , n} adaptively\ngiven previous columns, then set corresponding Xi,j to be a Rademacher random variable, so that\nXi,j depends on X1,j, X2,j . . . , Xi\u22121,j. The data generation process of countSketch matrix follows\ngraphical model GM1. The variance of Xi,j is 1\nn and since all the entries of X are bounded by 1, X\nis a JLT over N points when the number of rows satis\ufb01es n = \u2126(\u0001\u22122 log N ). Unlike [14, 26], our\nbound does not depend on the choice of d. Our bound also matches the state of the art [26].\n\n\u221a\n\n5 Conclusions\n\nSeveral existing results in machine learning and randomized algorithms, e.g., RIP, J-L, sketching,\netc., rely on uniform large deviation bounds of random quadratic forms based on random vectors or\nmatrices. Such results are uniform over suitable sets of matrices or vectors, and have found wide\nranging applications over the past few decades. Growing interest in adaptive data collection, mod-\neling, and estimation in modern machine learning is revealing a key limitation of such results: the\nneed for statistical independence, e.g., elementwise independence of random vectors, row-wise in-\ndependence of random matrices, etc. In this paper, we have presented a generalization of such results\nthat allows for statistical dependence on the history. We have also given examples for certain cases\nof interest, including RIP, J-L, and sketching, illustrating that in spite of allowing for dependence,\nour bounds are of the same order as that for the case of independent random vectors. We anticipate\nour results to simplify and help make advances in analyzing learning settings based on adaptive\ndata collection. Further, the added \ufb02exibility of designing random matrices sequentially may lead\nto computationally and/or statistically ef\ufb01cient random projection based algorithms. In future work,\nwe plan to investigate applications of these results in adaptive data collection and modeling settings.\nAcknowledgements: The research was supported by NSF grants OAC-1934634, IIS-1908104,\nIIS-1563950, IIS-1447566, IIS-1447574, IIS-1422557, CCF-1451986, a Google Faculty Research\nAward, a J.P. Morgan Faculty Award, and a Mozilla research grant. Part of this work completed\nwhile ZSW was visiting the Simons Institute for the Theory of Computing at UC Berkeley.\n\n8\n\n\fReferences\n[1] Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary\ncoins. Journal of Computer and System Sciences, 66(4):671 \u2013 687, 2003. Special Issue on\nPODS 2001.\n\n[2] Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast johnson-\n\nlindenstrauss transform. In STOC, 2006.\n\n[3] M. Arcones and E. Gine. On decoupling, series expansions, and tail behavior of chaos pro-\n\ncesses. Journal of Theoretical Probability, 6(1):101\u2013122, 1993.\n\n[4] Peter Auer. Using con\ufb01dence bounds for exploitation-exploration trade-offs. J. Mach. Learn.\n\nRes., 3:397\u2013422, mar 2003.\n\n[5] A. Banerjee, S. Chen, F. Fazayeli, and V. Sivakumar. Estimation with norm regularization. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2014.\n\n[6] Arindam Banerjee, Qilong Gu, Vidyashankar Sivakumar, and Zhiwei Steven Wu. Ran-\ndom quadratic forms with dependence: Applications to restricted isometry and beyond.\narXiv:1910.04930, 2019.\n\n[7] D. Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2012.\n\n[8] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory\n\nof Independence. Oxford University Press, 2013.\n\n[9] E. Candes and T Tao. The Dantzig selector: statistical estimation when p is much larger than\n\nn. The Annals of Statistics, 35(6):2313\u20132351, 2007.\n\n[10] E. J. Candes and T. Tao. Decoding by linear programming. IEEE Transactions on Information\n\nTheory, 51:4203\u20134215, 2005.\n\n[11] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The convex geometry of linear\n\ninverse problems. Foundations of Computational Mathematics, 12(6):805\u2013849, 2012.\n\n[12] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data\n\nstreams. In ICALP, 2002.\n\n[13] Sheng Chen and Arindam Banerjee. Structured estimation with atomic norms: General bounds\n\nand applications. In Advances in Neural Information Processing Systems 28, 2015.\n\n[14] Anirban Dasgupta, Ravi Kumar, and Tam\u00b4as Sarlos. A sparse johnson lindenstrauss transform.\n\nIn STOC, 2010.\n\n[15] Sanjoy Dasgupta. Learning mixtures of gaussians. In FOCS, pages 634\u2013, 1999.\n\n[16] Michal Derezinski and Manfred Warmuth. Subsampling for ridge regression via regularized\n\nvolume sampling. In AISTATS, pages 716\u2013725, 2018.\n\n[17] Michat Derezi\u00b4nski, Manfred K. Warmuth, and Daniel Hsu. Leveraged volume sampling for\n\nlinear regression. In NIPS, pages 2510\u20132519, 2018.\n\n[18] Yash Deshpande, Lester W. Mackey, Vasilis Syrgkanis, and Matt Taddy. Accurate inference\nfor adaptive linear models. In Proceedings of the 35th International Conference on Machine\nLearning, ICML 2018, Stockholmsm\u00a8assan, Stockholm, Sweden, July 10-15, 2018, 2018.\n\n[19] D. Donoho. Compressed Sensing. IEEE Transactions on Information Theory, 52(4):1289\u2013\n\n1306, 2006.\n\n[20] P. Drineas, R. Kannan, and M. Mahoney. Fast monte carlo algorithms for matrices i: Approx-\n\nimating matrix multiplication. SIAM Journal on Computing, 36(1):132\u2013157, 2006.\n\n[21] Steve Hanneke. Theory of disagreement-based active learning. Foundations and Trends in\n\nMachine Learning, 7(2-3):131\u2013309, 2014.\n\n9\n\n\f[22] D. L. Hanson and F. T. Wright. A bound on tail probabilities for quadratic forms in independent\n\nrandom variables. Ann. Math. Statist., 42(3):1079\u20131083, 06 1971.\n\n[23] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse\n\nof dimensionality. In STOC, pages 604\u2013613, 1998.\n\n[24] Piotr Indyk, Rajeev Motwani, Prabhakar Raghavan, and Santosh Vempala. Locality-preserving\n\nhashing in multidimensional spaces. In STOC, 1997.\n\n[25] William Johnson and Joram Lindenstrauss. Extensions of lipschitz maps into a hilbert space.\n\nContemporary Mathematics, 26:189\u2013206, 01 1984.\n\n[26] Daniel M. Kane and Jelani Nelson. Sparser johnson-lindenstrauss transforms.\n\n61(1):4:1\u20134:23, January 2014.\n\nJ. ACM,\n\n[27] Sampath Kannan, Jamie Morgenstern, Aaron Roth, Bo Waggoner, and Zhiwei Steven Wu. A\nsmoothed analysis of the greedy algorithm for the linear contextual bandit problem. CoRR\narXiv:1801.04323, 2018.\n\n[28] F. Krahmer, S. Mendelson, and H. Rauhut. Suprema of chaos processes and the restricted\nisometry property. Communications on Pure and Applied Mathematics, 67(11):1877\u20131904,\n2014.\n\n[29] M. Ledaux and M. Talagrand. Probability in Banach Spaces:\n\nSpringer, 1991.\n\nIsometry and Processes.\n\n[30] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes.\n\nSpringer, 2013.\n\n[31] Helmut Ltkepohl. New Introduction to Multiple Time Series Analysis. Springer, 2005.\n\n[32] Michael W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends\n\nin Machine Learning, 3(2):123\u2013224, 2011.\n\n[33] Shahar Mendelson, Holger Rauhut, and Rachel Ward. Improved bounds for sparse recovery\n\nfrom subsampled random convolutions. Ann. Appl. Probab., 28(6):3491\u20133527, 12 2018.\n\n[34] Seth Neel and Aaron Roth. Mitigating bias in adaptive data gathering via differential pri-\nvacy. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018,\nStockholmsm\u00a8assan, Stockholm, Sweden, July 10-15, 2018, 2018.\n\n[35] S. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for the analysis\n\nof regularized M-estimators. Statistical Science, 27(4):538\u2013557, 2012.\n\n[36] Xinkun Nie, Xiaoying Tian, Jonathan Taylor, and James Zou. Why adaptively collected data\nIn International Conference on Arti\ufb01cial In-\nhave negative bias and how to correct for it.\ntelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary\nIslands, Spain, 2018.\n\n[37] S. Oymak, B. Recht, and M. Soltanolkotabi. Sharp timedata tradeoffs for linear inverse prob-\n\nlems. IEEE Transactions on Information Theory, 64(6):4129\u20134158, June 2018.\n\n[38] V. H. Pena and E. Gine. Decoupling: From Dependence to Independence. Springer, 1999.\n\n[39] Mark Rudelson and Roman Vershynin. Hanson-wright inequality and sub-gaussian concentra-\n\ntion. Electron. Commun. Probab., 18:9 pp., 2013.\n\n[40] Burr Settles. Active learning. Synthesis Lectures on Arti\ufb01cial Intelligence and Machine Learn-\n\ning, 6(1):1\u2013114, 2012.\n\n[41] Vidyashankar Sivakumar, Zhiwei Steven Wu, and Arindam Banerjee. Structured linear con-\n\ntextual bandits: A sharp and geometric smoothed analysis. In preparation, 2019.\n\n[42] M. Talagrand. The Generic Chaining. Springer, 2005.\n\n10\n\n\f[43] M. Talagrand. Upper and Lower Bounds for Stochastic Processes. Springer, 2014.\n\n[44] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. Eldar and\nG. Kutyniok, editors, Compressed Sensing, chapter 5, pages 210\u2013268. Cambridge University\nPress, 2012.\n\n[45] R. Vershynin. Estimation in high dimensions: A geometric perspective, pages 3\u201366. Springer\n\nInternational Publishing, Cham, 2014.\n\n[46] Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data\nScience. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University\nPress, 2018.\n\n[47] Martin Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge\n\nUniversity Press (To appear), 2019.\n\n[48] David Williams. Probability with Martingales. Cambridge University Press, 1991.\n\n[49] David P. Woodruff. Sketching as a tool for numerical linear algebra. Found. Trends Theor.\n\nComput. Sci., 10(1&#8211;2):1\u2013157, October 2014.\n\n11\n\n\f", "award": [], "sourceid": 6848, "authors": [{"given_name": "Arindam", "family_name": "Banerjee", "institution": "Voleon"}, {"given_name": "Qilong", "family_name": "Gu", "institution": "University of Minnesota Twin Cities"}, {"given_name": "Vidyashankar", "family_name": "Sivakumar", "institution": "University of Minnesota"}, {"given_name": "Steven", "family_name": "Wu", "institution": "University of Minnesota"}]}