{"title": "Faster Ridge Regression via the Subsampled Randomized Hadamard Transform", "book": "Advances in Neural Information Processing Systems", "page_first": 369, "page_last": 377, "abstract": "We propose a fast algorithm for ridge regression when the number of features is much larger than the number of observations ($p \\gg n$). The standard way to solve ridge regression in this setting works in the dual space and gives a running time of $O(n^2p)$. Our algorithm (SRHT-DRR) runs in time $O(np\\log(n))$ and works by preconditioning the design matrix by a Randomized Walsh-Hadamard Transform with a subsequent subsampling of features. We provide risk bounds for our SRHT-DRR algorithm in the fixed design setting and show experimental results on synthetic and real datasets.", "full_text": "Faster Ridge Regression via the Subsampled Randomized\n\nHadamard Transform\n\nYichao Lu1\n\nParamveer S. Dhillon2 Dean Foster1\n\nLyle Ungar2\n\n1Statistics (Wharton School), 2Computer & Information Science\n\nUniversity of Pennsylvania, Philadelphia, PA, U.S.A\n\n{dhillon|ungar}@cis.upenn.edu\n\nfoster@wharton.upenn.edu, yichaolu@sas.upenn.edu\n\nAbstract\n\nWe propose a fast algorithm for ridge regression when the number of features is\nmuch larger than the number of observations (p (cid:29) n). The standard way to solve\nridge regression in this setting works in the dual space and gives a running time\nof O(n2p). Our algorithm Subsampled Randomized Hadamard Transform- Dual\nRidge Regression (SRHT-DRR) runs in time O(np log(n)) and works by precon-\nditioning the design matrix by a Randomized Walsh-Hadamard Transform with a\nsubsequent subsampling of features. We provide risk bounds for our SRHT-DRR\nalgorithm in the \ufb01xed design setting and show experimental results on synthetic\nand real datasets.\n\n1\n\nIntroduction\n\nRidge Regression, which penalizes the (cid:96)2 norm of the weight vector and shrinks it towards zero, is\nthe most widely used penalized regression method. It is of particular interest in the p > n case (p is\nthe number of features and n is the number of observations), as the standard ordinary least squares\nregression (OLS) breaks in this setting. This setting is even more relevant in today\u2019s age of \u2018Big\nData\u2019, where it is common to have p (cid:29) n. Thus ef\ufb01cient algorithms to solve ridge regression are\nhighly desirable.\nThe current method of choice for ef\ufb01ciently solving RR is [19], which works in the dual space\nand has a running time of O(n2p), which can be slow for huge p. As the runtime suggests, the\nbottleneck is the computation of XX(cid:62) where X is the design matrix. An obvious way to speed\nup the algorithm is to subsample the columns of X. For example, suppose X has rank k, if we\nrandomly subsample psubs of the p (k < psubs (cid:28) p ) features, then the matrix multiplication can be\nperformed in O(n2psubs) time, which is very fast! However, this speed-up comes with a big caveat.\nIf all the signal in the problem were to be carried in just one of the p features, and if we missed this\nfeature while sampling, we would miss all the signal.\nA parallel and recently popular line of research for solving large scale regression involves using\nsome kind of random projections, for instance, transforming the data with a randomized Hadamard\ntransform [1] or Fourier transform and then uniformly sampling observations from the resulting\ntransformed matrix and estimating OLS on this smaller data set. The intuition behind this approach\nis that these frequency domain transformations uniformlize the data and smear the signal across all\nthe observations so that there are no longer any high leverage points whose omission could unduly\nin\ufb02uence the parameter estimates. Hence, a uniform sampling in this transformed space suf\ufb01ces.\nThis approach can also be viewed as preconditioning the design matrix with a carefully constructed\ndata-independent random matrix. This transformation followed by subsampling has been used in a\nvariety of variations, including Subsampled Randomized Hadamard Transform (SRHT) [4, 6] and\nSubsampled Randomized Fourier Transform (SRFT) [22, 17].\n\n1\n\n\fIn this paper, we build on the above line of research and provide a fast algorithm for ridge regression\n(RR) which applies a Randomized Hadamard transform to the columns of the X matrix and then\nsamples psubs = O(n) columns. This allows the bottleneck matrix multiplication in the dual RR to\nbe computed in O(np log(n)) time, so we call our algorithm Subsampled Randomized Hadamard\nTransform-Dual Ridge Regression (SRHT-DRR).\nIn addition to being computationally ef\ufb01cient, we also prove that in the \ufb01xed design setting SRHT-\nDRR only increases the risk by a factor of (1 + C\n) (where k is the rank of the data matrix)\nw.r.t. the true RR solution.\n\n(cid:113) k\n\npsubs\n\n1.1 Related Work\n\nUsing randomized algorithms to handle large matrices is an active area of research, and has been\nused in a variety of setups. Most of these algorithms involve a step that randomly projects the origi-\nnal large matrix down to lower dimensions [9, 16, 8]. [14] uses a matrix of i.i.d Gaussian elements\nto construct a preconditioner for least square which makes the problem well conditioned. However,\ncomputing a random projection is still expensive as it requires multiplying a huge data matrix by\nanother random dense matrix. [18] introduced the idea of using structured random projection for\nmaking matrix multiplication substantially faster.\nRecently, several randomized algorithms have been developed for kernel approximation. [3] pro-\nvided a fast method for low rank kernel approximation by randomly selecting q samples to construct\na rank q approximation of the original kernel matrix. Their approximation can reduce the cost to\nO(nq2). [15] introduced a random sampling scheme to approximate symmetric kernels and [12]\naccelerates [15] by applying Hadamard Walsh transform. Although our paper and these papers can\nall be understood from a kernel approximation point of view, we are working in the p (cid:29) n (cid:29) 1\ncase while they focus on large n.\nAlso, it is worth distinguishing our setup from standard kernel learning. Kernel methods enable the\nlearning models to take into account a much richer feature space than the original space and at the\nsame time compute the inner product in these high dimensional space ef\ufb01ciently. In our p (cid:29) n (cid:29) 1\nsetup, we already have a rich enough feature space and it suf\ufb01ces to consider the linear kernel\nXX(cid:62) 1. Therefore, in this paper we propose a randomized scheme to reduce the dimension of X\nand accelerate the computation of XX(cid:62).\n\n2 Faster Ridge Regression via SRHT\n\nIn this section we \ufb01rstly review the traditional solution of solving RR in the dual and it\u2019s computa-\ntional cost. Then we introduce our algorithm SRHT-DRR for faster estimation of RR.\n\n2.1 Ridge Regression\nLet X be the n \u00d7 p design matrix containing n i.i.d. samples from the p dimensional independent\nvariable (a.k.a. \u201ccovariates\u201d or \u201cpredictors\u201d) X such that p (cid:29) n. Y is the real valued n \u00d7 1\nresponse vector which contains n corresponding values of the dependent variable Y . \u0001 is the n \u00d7 1\nhomoskedastic noise vector with common variance \u03c32. Let \u02c6\u03b2\u03bb be the solution of the RR problem,\ni.e.\n\n(cid:107)Y \u2212 X\u03b2(cid:107)2 + \u03bb(cid:107)\u03b2(cid:107)2\n\n\u02c6\u03b2\u03bb = arg min\n\u03b2\u2208p\u00d71\n\n(1)\nThe solution to Equation (1) is \u02c6\u03b2\u03bb = (X(cid:62)X + n\u03bbIp)\u22121X(cid:62)Y . The step that dominates the com-\nputational cost is the matrix inversion which takes O(p3) \ufb02ops and will be extremely slow when\np (cid:29) n (cid:29) 1. A straight forward improvement to this is to solve the Equation (1) in the dual space.\nBy change of variables \u03b2 = X(cid:62)\u03b1 where \u03b1 \u2208 n \u00d7 1 and further letting K = XX(cid:62) the optimization\nproblem becomes\n\n1\nn\n\n\u02c6\u03b1\u03bb = arg min\n\u03b1\u2208n\u00d71\n\n1\nn\n\n(cid:107)Y \u2212 K\u03b1(cid:107)2 + \u03bb\u03b1(cid:62)K\u03b1\n\n(2)\n\n1For this reason, it is standard in natural language processing applications to just use linear kernels.\n\n2\n\n\fand the solution is \u02c6\u03b1\u03bb = (K + n\u03bbIn)\u22121Y which directly gives \u02c6\u03b2\u03bb = X(cid:62) \u02c6\u03b1\u03bb. Please see [19] for a\ndetailed derivation of this dual solution. In the p (cid:29) n case the step that dominates computational\ncost in the dual solution is computing the linear kernel matrix K = XX(cid:62) which takes O(n2p) \ufb02ops.\nThis is regarded as the computational cost of the true RR solution in our setup.\nSince our algorithm SRHT-DRR uses Subsampled Randomized Hadamard Transform (SRHT),\nsome introduction to SRHT is warranted.\n\n2.2 De\ufb01nition and Properties of SRHT\nFollowing [20], for p = 2q where q is any positive integer, a SRHT can be de\ufb01ned as a psubs \u00d7 p\n(p > psubs) matrix of the form:\n\n(cid:114) p\n\nwhere\n\n\u0398 =\n\nRHD\n\npsubs\n\n\u2022 R is a random psubs \u00d7 p matrix the rows of which are psubs uniform samples (without\n\u2022 H \u2208 Rp\u00d7p is a normalized Walsh-Hadamard matrix. The Walsh-Hadamard matrix of size\n\nreplacement) from the standard basis of Rp.\n\n(cid:20)Hp/2 Hp/2\n\nHp/2 \u2212Hp/2\n\n(cid:21)\n\n(cid:20)+1 +1\n(cid:21)\n\n+1 \u22121\n\n.\n\nwith H2 =\n\np \u00d7 p is de\ufb01ned recursively: Hp =\nH = 1\u221a\n\np Hp is a rescaled version of Hp.\n\n\u2022 D is a p \u00d7 p diagonal matrix and the diagonal elements are i.i.d. Rademacher random\n\nvariables.\n\nThere are two key features that makes SRHT a nice candidate for accelerating RR when p (cid:29) n.\nFirstly, due to the recursive structure of the H matrix, it takes only O(p log(psubs)) FLOPS to\ncompute \u0398v where v is a generic p \u00d7 1 dense vector while for arbitrary unstructured psubs \u00d7 p\ndense matrix A, the cost for computing Av is O(psubsp) \ufb02ops. Secondly, after projecting any\nmatrix W \u2208 p \u00d7 k with orthonormal columns down to low dimensions with SRHT, the columns of\n\u0398W \u2208 psubs \u00d7 k are still about orthonormal. The following lemma characterizes this property:\nLemma 1. Let W be an p \u00d7 k (p > k) matrix where W(cid:62)W = Ik. Let \u0398 be a psubs \u00d7 p SRHT\nmatrix where p > psubs > k. Then with probability at least 1 \u2212 (\u03b4 + p\n\nek ),\n\n(cid:115)\n\n(cid:107)(\u0398W)(cid:62)\u0398W \u2212 Ik(cid:107)2 \u2264\n\n\u03b4 )k\n\nc log( 2k\npsubs\n\n(3)\n\nThe bound is in terms of the spectral norm of the matrix. The proof of this lemma is in the Appendix.\nThe tools for the random matrix theory part of the proof come from [20] and [21]. [10] also provided\nsimilar results.\n\n2.3 The Algorithm\n\nOur fast algorithm for SRHT-DRR is described below:\n\nSRHT-DRR\nInput: Dataset X \u2208 n \u00d7 p, response Y \u2208 n \u00d7 1, and subsampling size psubs.\nOutput: The weight parameter \u03b2 \u2208 psubs \u00d7 1.\n\n\u2022 Compute the SRHT of the data: XH = X\u0398(cid:62).\n\u2022 Compute KH = XH X(cid:62)\n\u2022 Compute \u03b1H,\u03bb = (KH + n\u03bbIn)\u22121Y , which is the solution of Equation (2) obtained by\nreplacing K with KH.\n\u2022 Compute \u03b2H,\u03bb = X(cid:62)\n\nH\n\nH \u03b1H,\u03bb\n\n3\n\n\fSince, SRHT is only de\ufb01ned for p = 2q for any integer q, so, if the dimension p is not a power of 2,\nwe can concatenate a block of zero matrix to the feature matrix X to make the dimension a power\nof 2.\n\nRemark 1. Let\u2019s look at\nthe computational cost of SRHT-DRR. Computing XH takes\nO(np log(psubs)) FLOPS [2, 6]. Once we have XH, computing \u03b1H,\u03bb costs O(n2psubs) FLOPS,\nwith the dominating step being computing KH = XH X(cid:62)\nH. So the computational cost for comput-\ning \u03b1H,\u03bb is O(np log(psubs) + n2psubs), compared to the true RR which costs O(n2p). We will\ndiscuss how large psubs should be later after stating the main theorem.\n\n3 Theory\n\nIn this section we bound the risk of SRHT-DRR and compare it with the risk of the true dual ridge\nestimator in \ufb01xed design setting.\nAs earlier, let X be an arbitrary n \u00d7 p design matrix such that p (cid:29) n. Also, we have Y = X\u03b2 + \u0001,\nwhere \u0001 is the n \u00d7 1 homoskedastic noise vector with common mean 0 and variance \u03c32. [5] and [3]\ndid similar analysis for the risk of RR under similar \ufb01xed design setups.\nFirstly, we provide a corollary to Lemma 1 which will be helpful in the subsequent theory.\nCorollary 1. Let k be the rank of X. With probability at least 1 \u2212 (\u03b4 + p\nek )\n\n(cid:113) k log(2k/\u03b4)\n\npsubs\n\nwhere \u2206 = C\n\n(1 \u2212 \u2206)K (cid:22) KH (cid:22) (1 + \u2206)K\n\n(4)\n\n. ( as for p.s.d. matrices G (cid:23) L means G \u2212 L is p.s.d.)\n\nProof. Let X = UDV(cid:62) be the SVD of X where U \u2208 n \u00d7 k, V \u2208 p \u00d7 k has orthonormal\ncolumns and D \u2208 k \u00d7 k is diagonal. Then KH = UD(V(cid:62)\u0398\u0398V)DU(cid:62). Lemma 1 directly implies\nIk(1 \u2212 \u2206) (cid:22) (V(cid:62)\u0398\u0398V) (cid:22) Ik(1 + \u2206) with probability at least 1 \u2212 (\u03b4 + p\nek ). Left multiply UD\nand right multiply DU(cid:62) to the above inequality complete the proof.\n\n3.1 Risk Function for Ridge Regression\n\nLet Z = E\u0001(Y ) = X\u03b2. The risk for any prediction \u02c6Y \u2208 n \u00d7 1 is 1\nFor any n \u00d7 n positive symmetric de\ufb01nite matrix M, de\ufb01ne the following risk function.\n\nE\u0001(cid:107) \u02c6Y \u2212 Z(cid:107)2.\n\nn\n\nR(M) =\n\n\u03c32\nn\n\nTr[M2(M + n\u03bbIn)\u22122] + n\u03bb2Z(cid:62)(M + n\u03bbIn)\u22122Z\n\n(5)\n\nLemma 2. Under the \ufb01xed design setting, the risk for the true RR solution is R(K) and the risk for\nSRHT-DRR is R(KH ).\n\n4\n\n\fProof. The risk of the SRHT-DRR estimator is\n\nE\u0001(cid:107)KH \u03b1H,\u03bb \u2212 Z(cid:107)2 =\n\n1\nn\n\n=\n\n=\n\n=\n\n=\n\n1\nn\n1\nn\n\n+\n\n1\nn\n\n+\n\n1\nn\n\n+\n\nE\u0001(cid:107)KH (KH + n\u03bbIn)\u22121Y \u2212 Z(cid:107)2\nE\u0001(cid:107)KH (KH + n\u03bbIn)\u22121Y \u2212 E\u0001(KH (KH + n\u03bbIn)\u22121Y )(cid:107)2\n(cid:107)E\u0001(KH (KH + n\u03bbIn)\u22121Y ) \u2212 Z(cid:107)2\n1\nn\nE\u0001(cid:107)KH (KH + n\u03bbIn)\u22121\u0001(cid:107)2\n(cid:107)(KH (KH + n\u03bbIn)\u22121Z \u2212 Z(cid:107)2\n1\nn\nTr[K2\nZ(cid:62)(In \u2212 KH (KH + n\u03bbIn)\u22121)2Z\n\nH (KH + n\u03bbIn)\u22122\u0001\u0001(cid:62)]\n\n1\nn\n\u03c32\nH (KH + n\u03bbIn)\u22122]\nn\n+n\u03bb2Z(cid:62)(KH + n\u03bbIn)\u22122Z\n\nTr[K2\n\n(6)\nNote that the expectation here is only over the random noise \u0001 and it is conditional on the Ran-\ndomized Hadamard Transform. The calculation is the same for the ordinary estimator. In the risk\nfunction, the \ufb01rst term is the variance and the second term is the bias.\n\n3.2 Risk In\ufb02ation Bound\n\nThe following theorem bounds the risk in\ufb02ation of SRHT-DRR compared with the true RR solution.\nTheorem 1. Let k be the rank of the X matrix. With probability at least 1 \u2212 (\u03b4 + p\nek )\n\n(cid:113) k log(2k/\u03b4)\n\npsubs\n\nwhere \u2206 = C\n\nR(KH ) \u2264 (1 \u2212 \u2206)\u22122R(K)\n\n(7)\n\nProof. De\ufb01ne\n\nB(M) = n\u03bb2Z(cid:62)(M + n\u03bbIn)\u22122Z\n\nV (M) =\n\n\u03c32\nn\n\nTr[K2\n\nH (KH + n\u03bbIn)\u22122]\n\nfor any p.s.d matrix M \u2208 n \u00d7 n. Therefore, R(M) = V (M) + B(M). Now, due to [3] we know\nthat B(M) is non-increasing in M and V (M) is non-decreasing in M. When Equation(4) holds,\n\nR(KH ) = V (KH ) + B(KH )\n\n\u2264 V ((1 + \u2206)K) + B((1 \u2212 \u2206)K)\n\u2264 (1 + \u2206)2V (K) + (1 \u2212 \u2206)\u22122B(K)\n\u2264 (1 \u2212 \u2206)\u22122(V (K) + B(K))\n= (1 \u2212 \u2206)\u22122R(K)\n\nRemark 2. Theorem 1 gives us an idea of how large psubs should be. Assuming \u2206 (the risk in\ufb02ation\nratio) is \ufb01xed, we get psubs = C k log(2k/\u03b4)\n= O(k). If we further assume that X is full rank, i.e.\nk = n, then, it suf\ufb01ces to choose psubs = O(n). Combining this with Remark 1, we can see\nthat the cost of computing XH is O(np log(n)). Hence, under the ideal setup where p is huge so\nthat the dominating step of SRHT-DRR is computing XH, the computational cost of SRHT-DRR\nO(np log(n)) FLOPS.\n\n\u22062\n\n5\n\n\fComparison with PCA Another way to handle high dimensional features is to use PCA and run\nregression only on the top few principal components (this procedure is called PCR), as illustrated by\n[13] and many other papers. RR falls in the family of \u201cshrinkage\u201d estimators as it shrinks the weight\nparameter towards zero. On the other hand, PCA is a \u201ckeep-or-kill\u201d estimator as it kills components\nwith smaller eigenvalues. Recently, [5] have shown that the risk of PCR and RR are related and that\nthe risk of PCR is bounded by four times the risk of RR. However, we believe that both PCR and\nRR are parallel approaches and one can be better than the other depending on the structure of the\nproblem, so it is hard to compare SRHT-DRR with PCR theoretically.\nMoreover, PCA under our p (cid:29) n (cid:29) 1 setup is itself a non-trivial problem both statistically and\ncomputationally. Firstly, in the p (cid:29) n case we do not have enough samples to estimate the huge\np \u00d7 p covariance matrix. Therefore the eigenvectors of the sample covariance matrix obtained by\nPCA maybe very different from the truth. (See [11] for a theoretical study on the consistency of the\nprincipal directions for the high p low n case.) Secondly, PCA requires one to compute an SVD of\nthe X matrix, which is extremely slow when p (cid:29) n (cid:29) 1. An alternative is to use a randomized\nalgorithm such as [16] or [9] to compute PCA. Again, whether randomized PCA is better than our\nSRHT-DRR algorithm depends on the problem. With that in mind, we compare SRHT-DRR against\nstandard as well as Randomized PCA in our experiments section; We \ufb01nd that SRHT-DRR beats\nboth of them in speed as well as accuracy.\n\n4 Experiments\n\nIn this section we show experimental results on synthetic as well as real-world data highlighting\nthe merits of SRHT, namely, lower computational cost compared to the true Ridge Regression (RR)\nsolution, without any signi\ufb01cant loss of accuracy. We also compare our approach against \u201cstandard\u201d\nPCA as well as randomized PCA [16].\nIn all our experiments, we choose the regularization constant \u03bb via cross-validation on the training\nset. As far as PCA algorithms are concerned, we implemented standard PCA using the built in SVD\nfunction in MATLAB and for randomized PCA we used the block power iteration like approach\nproposed by [16]. We always achieved convergence in three power iterations of randomized PCA.\n\n4.1 Measures of Performance\n\nSince we know the true \u03b2 which generated the synthetic data, we report MSE/Risk for the \ufb01xed\ndesign setting (they are equivalent for squared loss) as measure of accuracy.\nIt is computed as\n(cid:107) \u02c6Y \u2212 X\u03b2(cid:107)2, where \u02c6Y is the prediction corresponding to different methods being compared. For\nreal-world data we report the classi\ufb01cation error on the test set.\nIn order to compare the computational cost of SHRT-DRR with true RR, we need to estimate the\nnumber of FLOPS used by them. As reported by other papers, e.g. [4, 6], the theoretical cost of\napplying Randomized Hadamard Transform is O(np log(psubs)). However, the MATLAB imple-\nmentation we used took about np log(p) FLOPS to compute XH. So, for SRHT-DRR, the total\ncomputational cost is np log(p) for getting XH and a further 2n2psubs FLOPS to compute KH. As\nmentioned earlier, the true dual RR solution takes \u2248 2n2p. So, in our experiments, we report relative\ncomputational cost which is computed as the ratio of the two.\n\nRelative Computational Cost =\n\np log(p) \u00b7 n + 2n2psubs\n\n2n2p\n\n4.2 Synthetic Data\n\nWe generated synthetic data with p = 8192 and varied the number of observations n = 20, 100, 200.\nWe generated a n \u00d7 n matrix R \u223c M V N (0, I) where MVN(\u00b5, \u03a3) is the Multivariate Normal\nDistribution with mean vector \u00b5, variance-covariance matrix \u03a3 and \u03b2j \u223c N (0, 1) \u2200j = 1, . . . , p.\nThe \ufb01nal X matrix was generated by rotating R with a randomly generated n \u00d7 p rotation matrix.\nFinally, we generated the Ys as Y = X\u03b2 + \u0001 where \u0001i \u223c N (0, 1) \u2200i = 1, . . . , n.\n\n6\n\n\fFigure 1: Left to right n=20, 100, 200. The boxplots show the median error rates for SRHT-DRR\nfor different psubs. The solid red line is the median error rate for the true RR using all the features.\nThe green line is the median error rate for PCR when PCA is computed by SVD in MATLAB. The\nblack dashed line is median error rate for PCR when PCA is computed by randomized PCA.\n\nFor PCA and randomized PCA, we tried keeping r PCs in the range 10 to n and \ufb01nally chose the\nvalue of r which gave the minimum error on the training set. We tried 10 different values for psubs\nfrom n + 10 to 2000 . All the results were averaged over 50 random trials.\nThe results are shown in Figure 1. There are two main things worth noticing. Firstly, in all the cases,\nSRHT-DRR gets very close in accuracy to the true RR with only \u2248 30% of its computational cost.\nSRHT-DRR also cost much fewer FLOPS than the Randomized PCA for our experiments. Secondly,\nas we mentioned earlier, RR and PCA are parallel approaches. Either one might be better than the\nother depending on the structure of the problem. As can be seen, for our data, RR approaches are\nalways better than PCA based approaches. We hypothesize that PCA might perform better relative\nto RR for larger n.\n\n4.3 Real world Data\n\nWe took the UCI ARCENE dataset which has 200 samples with 10000 features as our real world\ndataset. ARCENE is a binary classi\ufb01cation dataset which consists of 88 cancer individuals and\n112 healthy individuals (see [7] for more details about this dataset). We split the dataset into 100\ntraining and 100 testing samples and repeated this procedure 50 times (so n = 100, p = 10000 for\nthis dataset). For PCA and randomized PCA, we tried keeping r = 10, 20, 30, 40, 50, 60, 70, 80, 90\nPCs and \ufb01nally chose the value of r which gave the minimum error on the training set (r = 30). As\nearlier, we tried 10 different values for psubs: 150, 250, 400, 600, 800, 1000, 1200, 1600, 2000, 2500.\nStandard PCA is known to be slow for this size datasets, so the comparison with it is just for accu-\nracy. Randomized PCA is fast but less accurate than standard (\u201ctrue\u201d) PCA; its computational cost\nfor r = 30 can be approximately calculated as about 240np (see [9] for details), which in this case\nis roughly the same as computing XX(cid:62) (\u2248 2n2p).\nThe results are shown in Figure 2. As can be seen, SRHT-DRR comes very close in accuracy\nto the true RR solution with just \u2248 30% of its computational cost. SRHT-DRR beats PCA and\nRandomized PCA even more comprehensively, achieving the same or better accuracy at just \u2248 18%\nof their computational cost.\n\n5 Conclusion\n\nIn this paper we proposed a fast algorithm, SRHT-DRR, for ridge regression in the p (cid:29) n (cid:29) 1\nsetting SRHT-DRR preconditions the design matrix by a Randomized Walsh-Hadamard Transform\nwith a subsequent subsampling of features. In addition to being signi\ufb01cantly faster than the true\ndual ridge regression solution, SRHT-DRR only in\ufb02ates the risk w.r.t. the true solution by a small\namount. Experiments on both synthetic and real data show that SRHT-DRR gives signi\ufb01cant speeds\nup with only small loss of accuracy. We believe similar techniques can be developed for other\nstatistical methods such as logistic regression.\n\n7\n\n0.511.522.50.3290.3310.3370.3490.3860.4170.4470.4710.5080.569Relative Computational CostMSE/Risk True RR SolutionPCARandomized PCA0.60.811.21.41.61.822.22.42.60.080.0830.0890.1260.1570.1870.2110.2480.2790.309Relative Computational CostMSE/Risk True RR SolutionPCARandomized PCA0.60.811.21.41.61.822.20.0590.0630.0690.0940.1240.1550.1790.2160.2460.277Relative Computational CostMSE/Risk True RR SolutionPCARandomized PCA\fFigure 2: The boxplots show the median error rates for SRHT-DRR for different psubs. The solid\nred line is the median error rate for the true RR using all the features. The green line is the median\nerror rate for PCR with top 30 PCs when PCA is computed by SVD in MATLAB. The black dashed\nline is the median error rate for PCR with the top 30 PCs computed by randomized PCA.\n\nReferences\n[1] Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast johnson-\n\nlindenstrauss transform. In STOC, pages 557\u2013563, 2006.\n\n[2] Nir Ailon and Edo Liberty. Fast dimension reduction using rademacher series on dual bch\n\ncodes. Technical report, 2007.\n\n[3] Francis Bach. Sharp analysis of low-rank kernel matrix approximations. CoRR, abs/1208.2015,\n\n2012.\n\n[4] Christos Boutsidis and Alex Gittens. Improved matrix algorithms via the subsampled random-\n\nized hadamard transform. CoRR, abs/1204.0062, 2012.\n\n[5] Paramveer S. Dhillon, Dean P. Foster, Sham M. Kakade, and Lyle H. Ungar. A risk comparison\nof ordinary least squares vs ridge regression. Journal of Machine Learning Research, 14:1505\u2013\n1511, 2013.\n\n[6] Petros Drineas, Michael W. Mahoney, S. Muthukrishnan, and Tam\u00e1s Sarl\u00f3s. Faster least\n\nsquares approximation. CoRR, abs/0710.1435, 2007.\n\n[7] Isabelle Guyon. Design of experiments for the nips 2003 variable selection benchmark. 2003.\n[8] N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic\nalgorithms for constructing approximate matrix decompositions. SIAM Rev., 53(2):217\u2013288,\nMay 2011.\n\n[9] Nathan Halko, Per-Gunnar Martinsson, Yoel Shkolnisky, and Mark Tygert. An algorithm for\nthe principal component analysis of large data sets. SIAM J. Scienti\ufb01c Computing, 33(5):2580\u2013\n2594, 2011.\n\n[10] Daniel Hsu, Sham M. Kakade, and Tong Zhang. Analysis of a randomized approximation\n\nscheme for matrix multiplication. CoRR, abs/1211.5414, 2012.\n\n[11] S. Jung and J.S. Marron. PCA consistency in high dimension, low sample size context. Annals\n\nof Statistics, 37:4104\u20134130, 2009.\n\n[12] Quoc Le, Tamas Sarlos, and Alex Smola. Fastfood -approximating kernel expansions in log-\n\nlinear time. ICML, 2013.\n\n[13] W.F. Massy. Principal components regression in exploratory statistical research. Journal of the\n\nAmerican Statistical Association, 60:234\u2013256, 1965.\n\n[14] Xiangrui Meng, Michael A. Saunders, and Michael W. Mahoney. Lsrn: A parallel iterative\n\nsolver for strongly over- or under-determined systems. CoRR, abs/1109.5981, 2011.\n\n8\n\n0.10.150.20.250.30.350.40.130.140.1550.1750.1950.2150.2350.2750.3150.365Relative Computational CostClassification Error True RR SolutionPCARandomized PCA\f[15] Ali Rahimi and Ben Recht. Random features for large-scale kernel machines. In In Neural\n\nInfomration Processing Systems, 2007.\n\n[16] Vladimir Rokhlin, Arthur Szlam, and Mark Tygert. A randomized algorithm for principal\n\ncomponent analysis. SIAM J. Matrix Analysis Applications, 31(3):1100\u20131124, 2009.\n\n[17] Vladimir Rokhlin and Mark Tygert. A fast randomized algorithm for overdetermined linear\nleast-squares regression. Proceedings of the National Academy of Sciences, 105(36):13212\u2013\n13217, September 2008.\n\n[18] Tamas Sarlos. Improved approximation algorithms for large matrices via random projections.\nIn In Proc. 47th Annu. IEEE Sympos. Found. Comput. Sci, pages 143\u2013152. IEEE Computer\nSociety, 2006.\n\n[19] G. Saunders, A. Gammerman, and V. Vovk. Ridge regression learning algorithm in dual vari-\nables. In Proc. 15th International Conf. on Machine Learning, pages 515\u2013521. Morgan Kauf-\nmann, San Francisco, CA, 1998.\n\n[20] Joel A. Tropp. Improved analysis of the subsampled randomized hadamard transform. CoRR,\n\nabs/1011.1595, 2010.\n\n[21] Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Com-\n\nputational Mathematics, 12(4):389\u2013434, 2012.\n\n[22] Mark Tygert. A fast algorithm for computing minimal-norm solutions to underdetermined\n\nsystems of linear equations. CoRR, abs/0905.4745, 2009.\n\n9\n\n\f", "award": [], "sourceid": 248, "authors": [{"given_name": "Yichao", "family_name": "Lu", "institution": "University of Pennsylvania"}, {"given_name": "Paramveer", "family_name": "Dhillon", "institution": "University of Pennsylvania"}, {"given_name": "Dean", "family_name": "Foster", "institution": "University of Pennsylvania"}, {"given_name": "Lyle", "family_name": "Ungar", "institution": "University of Pennsylvania"}]}