{"title": "New Subsampling Algorithms for Fast Least Squares Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 360, "page_last": 368, "abstract": "We address the problem of fast estimation of ordinary least squares (OLS) from large amounts of data ($n \\gg p$). We propose three methods which solve the big data problem by subsampling the covariance matrix using either a single or two stage estimation. All three run in the order of size of input i.e. O($np$) and our best method, {\\it Uluru}, gives an error bound of $O(\\sqrt{p/n})$ which is independent of the amount of subsampling as long as it is above a threshold. We provide theoretical bounds for our algorithms in the fixed design (with Randomized Hadamard preconditioning) as well as sub-Gaussian random design setting. We also compare the performance of our methods on synthetic and real-world datasets and show that if observations are i.i.d., sub-Gaussian then one can directly subsample without the expensive Randomized Hadamard preconditioning without loss of accuracy.", "full_text": "New Subsampling Algorithms for Fast Least Squares\n\nRegression\n\nParamveer S. Dhillon1 Yichao Lu2 Dean Foster2\n\nLyle Ungar1\n\n1Computer & Information Science, 2Statistics (Wharton School)\n\nUniversity of Pennsylvania, Philadelphia, PA, U.S.A\n\n{dhillon|ungar}@cis.upenn.edu\n\nfoster@wharton.upenn.edu, yichaolu@sas.upenn.edu\n\nAbstract\n\nWe address the problem of fast estimation of ordinary least squares (OLS) from\nlarge amounts of data (n (cid:29) p). We propose three methods which solve the big\ndata problem by subsampling the covariance matrix using either a single or two\nstage estimation. All three run in the order of size of input i.e. O(np) and our best\n\nmethod, Uluru, gives an error bound of O((cid:112)p/n) which is independent of the\n\namount of subsampling as long as it is above a threshold. We provide theoretical\nbounds for our algorithms in the \ufb01xed design (with Randomized Hadamard pre-\nconditioning) as well as sub-Gaussian random design setting. We also compare the\nperformance of our methods on synthetic and real-world datasets and show that if\nobservations are i.i.d., sub-Gaussian then one can directly subsample without the\nexpensive Randomized Hadamard preconditioning without loss of accuracy.\n\n1\n\nIntroduction\n\nOrdinary Least Squares (OLS) is one of the oldest and most widely studied statistical estimation\nmethods with its origins tracing back over two centuries. It is the workhorse of \ufb01elds as diverse\nas Machine Learning, Statistics, Econometrics, Computational Biology and Physics. To keep pace\nwith the growing amounts of data ever faster ways of estimating OLS are sought. This paper focuses\non the setting (n (cid:29) p), where n is the number of observations and p is the number of covariates or\nfeatures, a common one for web scale data.\nNumerous approaches to this problem have been proposed [1, 2, 3, 4, 5]. The predominant approach\nto solving big data OLS estimation involves using some kind of random projections, for instance,\ntransforming the data with a randomized Hadamard transform [6] or Fourier transform and then\nuniformly sampling observations from the resulting transformed matrix and estimating OLS on this\nsmaller data set. The intuition behind this approach is that these frequency domain transformations\nuniformize the data and smear the signal across all the observations so that there are no longer\nany high leverage points whose omission could unduly in\ufb02uence the parameter estimates. Hence,\na uniform sampling in this transformed space suf\ufb01ces. Another way of looking at this approach is\nas preconditioning the design matrix with a carefully constructed data-independent random matrix\nbefore subsampling. This approach has been used by a variety of papers proposing methods such as\nthe Subsampled Randomized Hadamard Transform (SRHT) [1, 4] and the Subsampled Randomized\nFourier Transform (SRFT) [2, 3]. There is also publicly available software which implements these\nideas [7]. It is worth noting that these approaches assume a \ufb01xed design setting.\nFollowing this line of work, in this paper we provide two main contributions:\n\n1\n\n\f1. Novel Subsampling Algorithms for OLS: We propose three novel1 algorithms for fast\nestimation of OLS which work by subsampling the covariance matrix. Some recent re-\n\nsults in [8] allow us to bound the difference between the parameter vector ((cid:98)w) we esti-\n\nmate from the subsampled data and the true underlying parameter (w0) which generates\nthe data. We provide theoretical analysis of our algorithms in the \ufb01xed design (with Ran-\ndomized Hadamard preconditioning) as well as sub-Gaussian random design setting. The\nerror bound of our best algorithm, Uluru, is independent of the fraction of data subsampled\n(above a minimum threshold of sub-sampling) and depends only on the characteristics of\nthe data/design matrix X.\n\n2. Randomized Hadamard preconditioning not always needed: We show that the error\nbounds for all the three algorithms are similar for both the \ufb01xed design and the sub-\nGaussian random design setting. In other words, one can either transform the data/design\nmatrix via Randomized Hadamard transform (\ufb01xed design setting) and then use any of our\nthree algorithms or, if the observations are i.i.d. and sub-Gaussian, one can directly use\nany of our three algorithms. Thus, another contribution of this paper is to show that if\nthe observations are i.i.d. and sub-Gaussian then one does not need the slow Randomized\nHadamard preconditioning step and one can get similar accuracies much faster.\n\nThe remainder of the paper is organized as follows: In the next section, we formally de\ufb01ne nota-\ntion for the regression problem, then in Sections 3 and 4, we describe our algorithms and provide\ntheorems characterizing their performance. Finally, we compare the empirical performance of our\nmethods on synthetic and real world data.\n\n2 Notation and Preliminaries\nLet X be the n \u00d7 p design matrix. For the random design case we assume the rows of X are n i.i.d\nsamples from the 1 \u00d7 p independent variable (a.k.a. \u201ccovariates\u201d or \u201cpredictors\u201d) X. Y is the real\nvalued n \u00d7 1 response vector which contains n corresponding values of the dependent variable Y\n(in general we use bold letter for samples and normal letter for random variables or vectors). \u0001 is\nthe n \u00d7 1 homoskedastic noise vector with common variance \u03c32. We want to infer w0 i.e. the p \u00d7 1\npopulation parameter vector that generated the data.\nMore formally, we can write the true model as:\n\nY = Xw0 + \u0001\n\u0001 \u223ciid N (0, \u03c32)\nequation\nabove\n\nto\n\nthe\n\nis\n\ngiven\n\nby\n\nThe\n\nsample\n\n(cid:98)wsample = (X(cid:62)X)\u22121X(cid:62)Y and by consistency of the OLS estimator we know that (cid:98)wsample \u2192d w0\nas n \u2192 \u221e. Classical algorithms to estimate (cid:98)wsample use QR decomposition or bidiagonalization [9]\n\n(in matrix\n\nnotation)\n\nsolution\n\nsubs ; Y(cid:62)\n\nand they require O(np2) \ufb02oating point operations.\nSince our algorithms are based on subsampling the covariance matrix, we need some extra notation.\nLet r = nsubs/n (< 1) be the subsampling ratio, giving the ratio of the number of observations\n(nsubs) in the subsampled matrix Xsubs fraction to the number of observations (n) in the original\nX matrix. I.e., r is the fraction of the observations sampled. Let Xrem, Yrem denote the data and\nresponse vector for the remaining n \u2212 nsubs observations. In other words X(cid:62) = [X(cid:62)\nrem] and\nY(cid:62) = [Y(cid:62)\nAlso, let \u03a3XXbe the covariance of X and \u03a3XY be the covariance between X and Y . Then, for\nthe \ufb01xed design setting \u03a3XX = X(cid:62)X/n and \u03a3XY = X(cid:62)Y/n and for the random design setting\n\u03a3XX = E(X(cid:62)X) and \u03a3XY = E(X(cid:62)Y).\nThe bounds presented in this paper are expressed in terms of the Mean Squared Error (or Risk) for\nthe (cid:96)2 loss. For the \ufb01xed design setting,\n\nM SE = (w0 \u2212 (cid:98)w)(cid:62)X(cid:62)X(w0 \u2212 (cid:98)w)/n = (w0 \u2212 (cid:98)w)(cid:62)\u03a3XX (w0 \u2212 (cid:98)w)\n\nsubs ; X(cid:62)\n\nrem].\n\nFor the random design setting\n\nM SE = EX(cid:107)Xw0 \u2212 X(cid:98)w(cid:107)2 = (w0 \u2212 (cid:98)w)(cid:62)\u03a3XX (w0 \u2212 (cid:98)w)\n\n1One of our algorithms (FS) is similar to [4] as we describe in Related Work. However, even for that\n\nalgorithm, our theoretical analysis is novel.\n\n2\n\n\f2.1 Design Matrix and Preconditioning\n\nThus far, we have not made any assumptions about the design matrix X. In fact, our algorithms and\nanalysis work for both \ufb01xed design and random design settings.\nAs mentioned earlier, our algorithms involve subsampling the observations, so we have to ensure\nthat we do not leave behind any observations which are outliers/high leverage points; This is done\ndifferently for \ufb01xed and random designs. For the \ufb01xed design setting the design matrix X is arbitrary\nand may contain high leverage points. Therefore before subsampling we precondition the matrix by\na Randomized Hadamard/Fourier Transform [1, 4] and after conditioning, the probability of having\nhigh leverage points in the new design matrix becomes very small. On the other hand, if we assume\nX to be random design and its rows are i.i.d. draws from some nice distribution like sub-Gaussian,\nthen the probability of having high leverage points is very small and we can happily subsample X\nwithout preconditioning.\nIn this paper we analyze both the \ufb01xed as well as sub-Gaussian random design settings. Since the\n\ufb01xed design analysis would involve transforming the design matrix with a preconditioner before\nsubsampling, some background on SRHT is warranted.\n\nSubsampled Randomized Hadamard Transform (SRHT):\nIn the \ufb01xed design setting we pre-\ncondition and subsample the data with a nsubs \u00d7 n randomized hadamard transform matrix \u0398(=\n\n(cid:113) n\n\nnsubs\n\nRHD) as \u0398 \u00b7 X.\n\nThe matrices R, H, and D are de\ufb01ned as:\n\n\u2022 R \u2208 Rnsubs\u00d7n is a set of nsubs rows from the n \u00d7 n identity matrix, where the rows are\n\u2022 D \u2208 Rn\u00d7n is a random diagonal matrix whose entries are independent random signs, i.e.\n\nchosen uniformly at random without replacement.\n\nrandom variables uniformly distributed on {\u00b11}.\n\n\u2022 H \u2208 Rn\u00d7n is a normalized Walsh-Hadamard matrix, de\ufb01ned as: Hn =\n\n(cid:20)Hn/2 Hn/2\n\nHn/2 \u2212Hn/2\n\n(cid:21)\n\n(cid:20)+1 +1\n(cid:21)\n\n+1 \u22121\n\nwith, H2 =\n\n. H = 1\u221a\n\nn Hn is a rescaled version of Hn.\n\nIt is worth noting that HD is the preconditioning matrix and R is the subsampling matrix.\nThe running time of SRHT is n p log(p) \ufb02oating point operations (FLOPS) [4]. [4] mention \ufb01x-\ning nsubs = O(p). However, in our experiments we vary the amount of subsampling, which is\nnot something recommended by their theory. With varying subsampling, the run time becomes\nO(n p log(nsubs)).\n\n3 Three subsampling algorithms for fast linear regression\n\nAll our algorithms subsample the X matrix followed by a single or two stage \ufb01tting and are described\nbelow. The algorithms given below are for the random design setting. The algorithms for the \ufb01xed\ndesign are exactly the same as below, except that Xsubs, Ysubs are replaced by \u0398 \u00b7 X, \u0398 \u00b7 Y and\nXrem, Yrem with \u0398rem \u00b7 X, \u0398rem \u00b7 Y, where \u0398 is the SRHT matrix de\ufb01ned in the previous section\nand \u0398rem is the same as \u0398, except that R is of size nrem \u00d7 n. Still, for the sake of completeness,\nthe algorithms are described in detail in the Supplementary material.\n\nFull Subsampling (FS): Full subsampling provides a baseline for comparison; In it we simply\nr-subsample (X, Y) as (Xsubs, Ysubs) and use the subsampled data to estimate both the \u03a3XX and\n\u03a3XY covariance matrices.\n\nCovariance Subsampling (CovS):\nIn Covariance Subsampling we r-subsample X as Xsubs only to\nestimate the \u03a3XX covariance matrix; we use all the n observations to compute the \u03a3XY covariance\nmatrix.\n\n3\n\n\fUluru : Uluru2 is a two stage \ufb01tting algorithm. In the \ufb01rst stage it uses the r-subsampled (X, Y)\nIn the sec-\nond stage it uses the remaining data (Xrem, Yrem) to estimate the bias of the \ufb01rst stage estimator\n\nto get an initial estimate of (cid:98)w (i.e., (cid:98)wF S) via the Full Subsampling (FS) algorithm.\nwcorrect = w0 \u2212 (cid:98)wF S. The \ufb01nal estimate (wU luru) is taken to be a weighted combination (gener-\nally just the sum) of the FS estimator and the second stage estimator ((cid:98)wcorrect). Uluru is described\nIn the second stage, since (cid:98)wF S is known, on the remaining data we have Yrem = Xremw0 + \u0001rem,\n\nin Algorithm 1.\n\nhence\n\nRrem = Yrem \u2212 Xrem \u00b7 (cid:98)wF S\n\n= Xrem(w0 \u2212 (cid:98)wF S) + \u0001rem\n\nX(cid:62)\n\nremXrem)\u22121X(cid:62)\n\nremXrem takes too many FLOPS, we use\n\nremRrem. Since computing X(cid:62)\n\nso we correct almost all the error, hence reducing the bias.\n\nThe above formula shows we can estimate wcorrect = w0 \u2212 (cid:98)wF S with another regression, i.e.\n(cid:98)wcorrect = (X(cid:62)\nsubXsub instead (which has already been computed). Finally we correct (cid:98)wF S and (cid:98)wcorrect to get\n(cid:98)wU luru. The estimate wcorrect can be seen as an almost unbiased estimation of the error w0 \u2212 wsubs,\nOutput: (cid:98)w\n(cid:98)wF S = (X(cid:62)\nRrem = Yrem \u2212 Xrem \u00b7 (cid:98)wF S;\nsubsXsubs)\u22121X(cid:62)\n(cid:98)wcorrect = nsubs\n(cid:98)wU luru = (cid:98)wF S + (cid:98)wcorrect;\nnrem \u00b7 (X(cid:62)\nreturn (cid:98)w = (cid:98)wU luru;\n\nsubsXsubs)\u22121X(cid:62)\n\nremRrem;\n\nInput: X, Y, r\n\nsubsYsubs;\n\nAlgorithm 1: Uluru Algorithm\n\n4 Theory\n\nIn this section we provide the theoretical guarantees of the three algorithms we discussed in the\nprevious sections in the \ufb01xed as well as random design setting. All the theorems assume OLS\nsetting as mentioned in Section 2. Without loss of generality we assume that X is whitened, i.e.\n\u03a3X,X = Ip (see Supplementary Material for justi\ufb01cation). For both the cases we bound the square\n\nroot of Mean Squared Error which becomes (cid:107)w0 \u2212 (cid:98)w(cid:107), as described in Section 2.\n\n4.1 Fixed Design Setting\n\nHere we assume preconditioning and subsampling with SRHT as described in previous sections.\n(Please see the Supplementary Material for all the Proofs)\nTheorem 1 Assume X \u2208 n \u00d7 p and X(cid:62)X = n \u00b7 Ip. Let Y = Xw0 + \u0001 where \u0001 \u2208 n \u00d7 1 is i.i.d.\ngaussian noise with standard deviation \u03c3.\nIf we use algorithm FS, then with failure probability at most 2 n\n\nep + 2\u03b4,\n\n(cid:107)w0 \u2212 \u02c6wF S(cid:107) \u2264 C\u03c3\n\nln(nr + 1/\u03b4)\n\np\nnr\n\n(1)\n\n(cid:114)\n\n(cid:115)\n\nTheorem 2 Assuming our data comes from the same model as Theorem 1 and we use CovS, then\nwith failure probability at most 3\u03b4 + 3 n\nep ,\n\n(cid:32)\n\n(cid:114)\n\n(cid:107)w0\u2212 \u02c6wCovS(cid:107) \u2264 (1\u2212r)\n\nC1\n\nln(\n\n+ C2\n\nln(\n\n2p\n\u03b4\n\n)\n\np\nnr\n\n2p\n\u03b4\n\n)\n\np\n\nn(1 \u2212 r)\n\n(cid:107)w0(cid:107)+C3\u03c3\n\n(cid:33)\n\n(cid:114)\n\nlog(n + 1/\u03b4)\n\np\nn\n(2)\n\n2Uluru is a rock that is shaped like a quadratic and is solid. So, if your estimate of the quadratic term is as\n\nsolid as Uluru, you do not need use more data to make it more accurate.\n\n4\n\n\fTheorem 3 Assuming our data comes from the same model as Theorem 1 and we use Uluru, then\nwith failure probability at most 5\u03b4 + 5 n\nep ,\n\n(cid:115)\n\n(cid:33)\n\n(cid:107)w0 \u2212 \u02c6wU luru(cid:107) \u2264 \u03c3\n\nln(nr + 1/\u03b4)\n\nC1\n\nln(\n\n+ C2\n\nln(\n\n2p\n\u03b4\n\n)\n\np\n\nn(1 \u2212 r)\n\n(cid:114)\n\n(cid:114)\n\n(cid:32)\n\n(cid:114)\n\np\nnr\n\n+\u03c3C3\n\nln(n(1 \u2212 r) + 1/\u03b4) \u00b7\n\n2p\n\u03b4\n\n)\n\np\nnr\n\np\n\nn(1 \u2212 r)\n\nRemark 1 The probability n\nep becomes really small for large p, hence it can be ignored and the\nln terms can be viewed as constants. Lets consider the case nsubs (cid:28) nrem, since only in this\nsituation subsampling reduces computational cost signi\ufb01cantly. Then, keeping only the dominating\nterms, the result of the above three theorems can be summarized as: With some failure probability\nnr ), the error of CovS algorithm is\n\nless than some \ufb01xed number, the error of FS algorithm is O(\u03c3(cid:112) p\nO((cid:112) p\n\nnr(cid:107)w(cid:107) + \u03c3(cid:112) p\n\nn ) and the error of Uluru algorithm is O(\u03c3 p\n\nnr + \u03c3(cid:112) p\n\nn )\n\n4.2 Sub-gaussian Random Design Setting\n\n4.2.1 De\ufb01nitions\n\nThe following two de\ufb01nitions from [10] characterize what it means to be sub-gaussian.\n\nDe\ufb01nition 1 A random variable X is sub-gaussian with sub-gaussian norm (cid:107)X(cid:107)\u03c82 if and only if\n(3)\n\nfor all p \u2265 1\nHere (cid:107)X(cid:107)\u03c82 is the minimal constant for which the above condition holds.\nDe\ufb01nition 2 A random vector X \u2208 Rn is sub-gaussian if the one dimensional marginals x(cid:62)X are\nsub-gaussian for all x \u2208 Rn. The sub-gaussian norm of random vector X is de\ufb01ned as\n\n(E|X|p)1/p \u2264 (cid:107)X(cid:107)\u03c82\n\n\u221a\n\np\n\n(cid:107)X(cid:107)\u03c82 = sup\n(cid:107)x(cid:107)2=1\n\n(cid:107)x(cid:62)X(cid:107)\u03c82\n\n(4)\n\nRemark 2 Since the sum of two sub-gaussian variables is sub-gaussian, it is easy to conclude that\na random vector X = (X1, ..Xp)(cid:62) is a sub-gaussian random vector when the components X1, ..Xp\nare sub-gaussian variables.\n\n4.2.2 Sub-gaussian Bounds\n\nUnder the assumption that the rows of the design matrix X are i.i.d draws for a p dimensional\nsub-Gaussian random vector X with \u03a3XX = Ip, we have the following bounds (Please see the\nSupplementary Material for all the Proofs):\n\nTheorem 4 If we use the FS algorithm, then with failure probability at most \u03b4,\n\nTheorem 5 If we use the CovS algorithm, then with failure probability at most \u03b4,\n\n(cid:114)\n\np \u00b7 ln(2p/\u03b4)\n\nnr\n\n(cid:107)w0 \u2212 (cid:98)wF S(cid:107) \u2264 C\u03c3\n(cid:32)\n(cid:114) p\n(cid:107)w0 \u2212 (cid:98)wCovS(cid:107) \u2264 (1 \u2212 r)\n(cid:114)\n\nC1\n\n+ C3\u03c3\n\nn \u00b7 r\n\n+ C2\np \u00b7 ln(2(p + 2)/\u03b4)\n\n(cid:114) p\n\n(cid:33)\n\nn(1 \u2212 r)\n\n(cid:107)w0(cid:107)\n\n(5)\n\n(6)\n\nn\n\n5\n\n\fTheorem 6 If we use Uluru, then with failure probability at most \u03b4,\n\n(cid:34)\n\n(cid:114) p\n\nC2\n\nn \u00b7 r\n\n(cid:114)\n\n+ C3\n\np\n\n(1 \u2212 r) \u00b7 n\n\n(cid:35)\n\n(cid:107)w0 \u2212 (cid:98)wU luru(cid:107) \u2264 C1\u03c3\n\n(cid:114)\n(cid:115)\n\np \u00b7 ln(2(2p + 2)/\u03b4)\n\nn \u00b7 r\n\np \u00b7 ln(2(2p + 2)/\u03b4)\n\n(1 \u2212 r) \u00b7 n\n\n+ C4\u03c3\n\nRemark 3 Here also, the ln terms can be viewed as constants. Consider the case r (cid:28) 1, since this\nis the only case where subsampling reduces computational cost signi\ufb01cantly. Keeping only dominat-\ning terms, the result of the above three theorems can be summarized as: With failure probability less\nrn ), the error of the CovS algorithm\nn ). These errors are\n\nthan some \ufb01xed number, the error of the FS algorithm is O(\u03c3(cid:112) p\nis O((cid:112) p\n\nn ) and the error of the Uluru algorithm is O(\u03c3 p\n\nrn(cid:107)w(cid:107) + \u03c3(cid:112) p\n\nrn + \u03c3(cid:112) p\n\nexactly the same as in the \ufb01xed design case.\n\n4.3 Discussion\n\nWe can make a few salient observations from the error expressions for the algorithms presented in\nRemarks 1 & 3.\nThe second term for the error of the Uluru algorithm does not contain r at all. If it is the dominating\nterm, which is the case if\n\nthen the error of Uluru is approximately O(\u03c3(cid:112) p\n\n(7)\nn ), which is completely independent of r. Thus, if\nr is not too small (i.e., when Eq. 7 holds), the error bound for Uluru is not a function of r. In other\nwords, when Eq. 7 holds, we do not increase the error by using less data in estimating the covariance\nmatrix in Uluru. FS Algorithm does not have this property since its error is proportional to 1\u221a\nr .\n\nr > O((cid:112)p/n)\n\nSimilarly, for the CovS algorithm, when\n\nr > O(\n\n(cid:107)w0(cid:107)2\n\u03c32\n\n)\n\n(8)\n\nthe second term dominates and we can conclude that the error does not change with r. However,\nEq. 8 depends on how large the standard deviation \u03c3 of the noise is. We can assume (cid:107)w0(cid:107)2 = O(p)\nsince it is p dimensional. Hence if \u03c3 \u2264 O(\np), Eq. 8 fails since it implies r > O(1) and the error\nbound of CovS algorithm increases with r.\n\n\u221a\n\nTo sum this up, Uluru has the nice property that its error bound does not increase as r gets smaller\nas long as r is greater than a threshold. This threshold is completely independent of how noisy the\ndata is and only depends on the characteristics of the design/data matrix (n, p).\n\n4.4 Run Time complexity\n\nTable 1 summarizes the run time complexity and theoretically predicted error bounds for all the\nmethods. We use these theoretical run times (FLOPS) in our plots.\n\n5 Experiments\n\nIn this section we elucidate the relative merits of our methods by comparing their empirical perfor-\nmance on both synthetic and real world datasets.\n\n5.1 Methodology\n\nWe can compare our algorithms by allowing them each to have about O(np) CPU time (ignoring\nthe log factors). This is of order the same time as it takes to read the data. Our target accuracy is\n\n(cid:112)p/n, namely what a full least squares algorithm would generate. We will assume n (cid:29) p. The\n\n6\n\n\fError\nbound\n\nMethods\nMethods\nOLS\nFS\nCovS\nUluru\nSRHT-FS\nSRHT-CovS O(max(n p log(p), nsubs p2 + n p))\n\nO((cid:112)p/n)\nO((cid:112)p/nsubs)\nO((cid:112)p/n)\nO((cid:112)p2/n)\nSRHT-Uluru O(max(n p log(p), nsubs p2 + n p)) O((cid:112)p/n)\n\nRunning Time\nO(FLOPS)\nO(n p2)\nO(nsubs p2)\nO(nsubs p2 + n p)\nO(nsubs p2 + n p)\nO(max(n p log(p), nsubs p2))\n\n*\n\n*\n\nTable 1: Runtime complexity. nsubs is the number of observations in the subsample, n is the number of\nobservations, and p is the number of predictors. * indicates that no uniform error bounds are known.\n\nsubsample size, nsubs, for FS should be O(n/p) to keep the CPU time O(np), which leads to an\n\naccuracy of(cid:112)p2/n. For the CovS method, the accuracy depends on how noisy our data is (i.e. how\nbig \u03c3 is). When \u03c3 is large, it performs as well as(cid:112)p/n, which is the same as full least squares.\nWhen \u03c3 is small, it performs as poorly as(cid:112)p2/n. For Uluru, to keep the CPU time O(np), nsubs\nwhen r \u2265 O((cid:112)p/n) (in this set up we want r = O(1/p), which implies n \u2265 O(p3)), Uluru has\nerror bound O((cid:112)p/n) no matter what signal noise ratio the problem has.\n\nshould be O(n/p) or equivalently r = O(1/p). As stated in the discussions after the theorems,\n\n5.2 Synthetic Datasets\n\nWe generated synthetic data by distributing the signal uniformly across all the p singular values,\npicking the p singular values to be \u03bbi = 1/i2, i = 1 : p, and further varying the amount of signal.\n\n5.3 Real World Datasets\n\nWe also compared the performance of the algorithms on two UCI datasets 3: CPUSMALL (n=8192,\np=12) and CADATA (n=20640, p=8) and the PERMA sentiment analysis dataset described in [11]\n(n=1505, p=30), which uses LR-MVL word embeddings [12] as features. 4\n\n5.4 Results\n\nThe results for synthetic data are shown in Figure 1 (top row) and for real world datasets are also\nshown in Figure 1 (bottom row).\nTo generate the plots, we vary the amount of data used in the subsampling, nsubs, from 1.1p to n.\nFor FS, this simply means using a fraction of the data; for CovS and Uluru, only the data for the\ncovariance matrix is subsampled. We report the Mean Squared Error (MSE), which in the case of\nsquared loss is the same as the risk, as was described in Section 2. For the real datasets we do not\nknow the true population parameter, w0, so we replace it with its consistent estimator wM LE, which\nis computed using standard OLS on the entire dataset.\nThe horizontal gray line in the \ufb01gures is the over\ufb01tting point; it is the error generated by \u02c6w vector of\nall zeros. The vertical gray line is the n \u00b7 p point; thus anything which is faster than that must look\nat only some of the data.\nLooking at the results, we can see two trends for the synthetic data. Firstly, our algorithms with\nno preconditioning are much faster than their counterparts with preconditioning and give similar\naccuracies. Secondly, as we had expected, CovS performs best for high noise setting being slightly\nbetter than Uluru, and Uluru is signi\ufb01cantly better for low noise setting.\nFor real world datasets also, Uluru is almost always better than the other algorithms, both with and\nwithout preconditioning. As earlier, the preconditioned alternatives are slower.\n\n3http://www.csie.ntu.edu.tw/\u02dccjlin/libsvmtools/datasets/regression.html\n4We also compared our approaches against coordinate ascent methods from [13] and our algorithms outper-\n\nform them. Due to paucity of space we relegated that comparison to the supplementary material.\n\n7\n\n\fFigure 1: Results for synthetic datasets (n=4096, p=8) in top row and for (PERMA, CPUSMALL, CADATA,\nleft-right) bottom row. The three columns in the top row have different amounts of signal, 2,\np and n\np\nrespectively. In all settings, we varied the amount of subsampling from 1.1 p to n in multiples of 2. Color\nscheme: + (Green)-FS, + (Blue)-CovS, + (Red)-Uluru. The solid lines indicate no preconditioning (i.e.\nrandom design) and dashed lines indicate \ufb01xed design with Randomized Hadamard preconditioning. The\nFLOPS reported are the theoretical values (see Supp. material), the actual values were noisy due to varying\nload settings on CPUs.\n\n(cid:113) n\n\n6 Related Work\n\nThe work that comes closest to our work is the set of approaches which precondition the matrix by\neither Subsampled Randomized Hadamard Transform (SRHT) [1, 4], or Subsampled Randomized\nFourier Transform (SRFT) [2, 3], before subsampling uniformly from the resulting transformed\nmatrix.\nHowever, this line of work is different our work in several ways. They are doing their analysis in a\nmathematical set up, i.e. solving an overdetermined linear system ( \u02c6w = arg minw\u2208Rp (cid:107)Xw \u2212 Y (cid:107)2),\nwhile we are working in a statistical set up (a regression problem Y = X\u03b2 + \u0001) which leads to\ndifferent error analysis.\nOur FS algorithm is essentially the same as the subsampling algorithm proposed by [4]. However,\nour theoretical analysis of it is novel, and furtheremore they only consider it in \ufb01xed design setting\nwith Hadamard preconditioning.\nThe CovS and Uluru are entirely new algorithms and as we have seen differ from FS in a key sense,\nnamely that CovS and Uluru make use of all the data but FS uses only a small proportion of the data.\n\n7 Conclusion\nIn this paper we proposed three subsampling methods for faster least squares regression. All three\nrun in O(size of input) = np. Our best method, Uluru, gave an error bound which is independent of\nthe amount of subsampling as long as it is above a threshold.\nFurthermore, we argued that for problems arising from linear regression, the Randomized Hadamard\ntransformation is often not needed. In linear regression, observations are generally i.i.d. If one fur-\nther assumes that they are sub-Gaussian (perhaps as a result of a preprocessing step, or simply\nbecause they are 0/1 or Gaussian), then subsampling methods without a Randomized Hadamard\ntransformation suf\ufb01ce. As shown in our experiments, dropping the Randomized Hadamard trans-\nformation signi\ufb01cantly speeds up the algorithms and in i.i.d sub-Gaussian settings, does so without\nloss of accuracy.\n\n8\n\n0.020.050.200.502.005.001e+001e+021e+041e+06# FLOPS/n*pMSE/Risk0.020.050.200.502.005.001e\u2212011e+011e+031e+05# FLOPS/n*pMSE/Risk0.020.050.200.502.005.001e\u2212031e+001e+031e+06# FLOPS/n*pMSE/Risk125101e\u2212011e+011e+031e+05# FLOPS/n*pMSE/Risk0.020.050.200.502.005.001e\u2212051e\u2212021e+011e+041e+07# FLOPS/n*pMSE/Risk5e\u2212035e\u2212025e\u2212015e+001e\u2212051e\u2212021e+011e+04# FLOPS/n*pMSE/Risk\fReferences\n[1] Boutsidis, C., Gittens, A.:\n\nhadamard transform. CoRR abs/1204.0062 (2012)\n\nImproved matrix algorithms via the subsampled randomized\n\n[2] Tygert, M.: A fast algorithm for computing minimal-norm solutions to underdetermined sys-\n\ntems of linear equations. CoRR abs/0905.4745 (2009)\n\n[3] Rokhlin, V., Tygert, M.: A fast randomized algorithm for overdetermined linear least-squares\nregression. Proceedings of the National Academy of Sciences 105(36) (September 2008)\n13212\u201313217\n\n[4] Drineas, P., Mahoney, M.W., Muthukrishnan, S., Sarl\u00b4os, T.: Faster least squares approxima-\n\ntion. CoRR abs/0710.1435 (2007)\n\n[5] Mahoney, M.W.: Randomized algorithms for matrices and data. (April 2011)\n[6] Ailon, N., Chazelle, B.: Approximate nearest neighbors and the fast johnson-lindenstrauss\n\ntransform. In: STOC. (2006) 557\u2013563\n\n[7] Avron, H., Maymounkov, P., Toledo, S.: Blendenpik: Supercharging lapack\u2019s least-squares\n\nsolver. SIAM J. Sci. Comput. 32(3) (April 2010) 1217\u20131236\n\n[8] Vershynin, R.: How Close is the Sample Covariance Matrix to the Actual Covariance Matrix?\n\nJournal of Theoretical Probability 25(3) (September 2012) 655\u2013686\n\n[9] Golub, G.H., Van Loan, C.F.: Matrix Computations (Johns Hopkins Studies in Mathematical\n\nSciences)(3rd Edition). 3rd edn. The Johns Hopkins University Press (October 1996)\n\n[10] Vershynin, R.:\n\nIntroduction to the non-asymptotic analysis of random matrices. CoRR\n\nabs/1011.3027 (2010)\n\n[11] Dhillon, P.S., Rodu, J., Foster, D., Ungar, L.: Two step cca: A new spectral method for\nestimating vector models of words. In: Proceedings of the 29th International Conference on\nMachine learning. ICML\u201912 (2012)\n\n[12] Dhillon, P.S., Foster, D., Ungar, L.: Multi-view learning of word embeddings via cca. In:\n\nAdvances in Neural Information Processing Systems (NIPS). Volume 24. (2011)\n\n[13] Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss\n\nminimization. CoRR abs/1209.1873 (2012)\n\n9\n\n\f", "award": [], "sourceid": 247, "authors": [{"given_name": "Paramveer", "family_name": "Dhillon", "institution": "University of Pennsylvania"}, {"given_name": "Yichao", "family_name": "Lu", "institution": "University of Pennsylvania"}, {"given_name": "Dean", "family_name": "Foster", "institution": "University of Pennsylvania"}, {"given_name": "Lyle", "family_name": "Ungar", "institution": "University of Pennsylvania"}]}