{"title": "Sketching Structured Matrices for Faster Nonlinear Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 2994, "page_last": 3002, "abstract": "Motivated by the desire to extend fast randomized techniques to nonlinear $l_p$ regression, we consider a class of structured regression problems. These problems involve Vandermonde matrices which arise naturally in various statistical modeling settings, including classical polynomial fitting problems and recently developed randomized techniques for scalable kernel methods. We show that this structure can be exploited to further accelerate the solution of the regression problem, achieving running times that are faster than input sparsity''. We present empirical results confirming both the practical value of our modeling framework, as well as speedup benefits of randomized regression.\"", "full_text": "Sketching Structured Matrices for\n\nFaster Nonlinear Regression\n\nHaim Avron\nVikas Sindhwani\nIBM T.J. Watson Research Center\n\nYorktown Heights, NY 10598\n\n{haimav,vsindhw}@us.ibm.com\n\nDavid P. Woodruff\n\nIBM Almaden Research Center\n\nSan Jose, CA 95120\n\ndpwoodru@us.ibm.com\n\nAbstract\n\nMotivated by the desire to extend fast randomized techniques to nonlinear lp re-\ngression, we consider a class of structured regression problems. These problems\ninvolve Vandermonde matrices which arise naturally in various statistical model-\ning settings, including classical polynomial \ufb01tting problems, additive models and\napproximations to recently developed randomized techniques for scalable kernel\nmethods. We show that this structure can be exploited to further accelerate the\nsolution of the regression problem, achieving running times that are faster than\n\u201cinput sparsity\u201d. We present empirical results con\ufb01rming both the practical value\nof our modeling framework, as well as speedup bene\ufb01ts of randomized regression.\n\n1\n\nIntroduction\n\nRecent literature has advocated the use of randomization as a key algorithmic device with which\nto dramatically accelerate statistical learning with lp regression or low-rank matrix approximation\ntechniques [12, 6, 8, 10]. Consider the following class of regression problems,\n\narg min\n\nx\u2208C (cid:107)Zx \u2212 b(cid:107)p, where p = 1, 2\n\n(1)\nwhere C is a convex constraint set, Z \u2208 Rn\u00d7k is a sample-by-feature design matrix, and b \u2208 Rn\nis the target vector. We assume henceforth that the number of samples is large relative to data\ndimensionality (n (cid:29) k). The setting p = 2 corresponds to classical least squares regression, while\np = 1 leads to least absolute deviations \ufb01t, which is of signi\ufb01cant interest due to its robustness\nproperties. The constraint set C can incorporate regularization. When C = Rk and p = 2, an \u0001-\noptimal solution can be obtained in time O(nk log k) + poly(k \u0001\u22121) using randomization [6, 19],\nwhich is much faster than an O(nk2) deterministic solver when \u0001 is not too small (dependence on \u0001\ncan be improved to O(log(1/\u0001)) if higher accuracy is needed [17]). Similarly, a randomized solver\nfor l1 regression runs in time O(nk log n) + poly(k \u0001\u22121) [5].\nIn many settings, what makes such acceleration possible is the existence of a suitable oblivious\nsubspace embedding (OSE). An OSE can be thought of as a data-independent random \u201csketching\u201d\nmatrix S \u2208 Rt\u00d7n whose approximate isometry properties over a subspace (e.g., over the column\nspace of Z, b) imply that,\n\n(cid:107)S(Zx \u2212 b)(cid:107)p \u2248 (cid:107)Zx \u2212 b(cid:107)p for all x \u2208 C ,\n\nwhich in turn allows x to be optimized over a \u201csketched\u201d dataset of much smaller size without losing\nsolution quality. Sketching matrices include Gaussian random matrices, structured random matrices\nwhich admit fast matrix multiplication via FFT-like operations, and others.\nThis paper is motivated by two questions which in our context turn out to be complimentary:\n\n1\n\n\f\u25e6 Can additional structure in Z be non-trivially exploited to further accelerate runtime? Clarkson\nand Woodruff have recently shown that when Z is sparse and has nnz(Z) (cid:28) nk non-zeros, it\nis possible to achieve much faster \u201cinput-sparsity\u201d runtime using hashing-based sketching matri-\nces [7]. Is it possible to further beat this time in the presence of additional structure on Z?\n\u25e6 Can faster and more accurate sketching techniques be designed for nonlinear and nonparametric\nregression? To see that this is intertwined with the previous question, consider the basic problem\ni=1 \u03b2izi to a set of samples (zi, bi) \u2208 R \u00d7 R, i = 1, . . . , n.\nThen, the design matrix Z has Vandermonde structure which can potentially be exploited in a\nregression solver. It is particularly appealing to estimate non-parametric models on large datasets.\nSketching algorithms have recently been explored in the context of kernel methods for non-\nparametric function estimation [16, 11].\n\nof \ufb01tting a polynomial model, b =(cid:80)q\n\nTo be able to precisely describe the structure on Z that we consider in this paper, and outline our\ncontributions, we need the following de\ufb01nitions.\n\nDe\ufb01nition 1 (Vandermonde Matrix) Let x0, x1, . . . , xn\u22121 be real numbers. The Vandermonde ma-\ntrix, denoted Vq,n(x0, x1, . . . , xn\u22121), has the form:\n\nVq,n(x1, x1, . . . , xn\u22121) =\n\n\uf8eb\uf8ec\uf8ed 1\n\nx0\n. . .\nxq\u22121\n\n0\n\n\uf8f6\uf8f7\uf8f8\n\n1\nx1\n. . .\nxq\u22121\n\n1\n\n1\n\n. . .\n. . . xn\u22121\n. . .\n. . .\n. . . xq\u22121\nn\u22121\n\nVandermonde matrices of dimension q \u00d7 n require only O(n) implicit storage and admit O((n +\nq) log2 q) matrix-vector multiplication time. We also de\ufb01ne the following matrix operator Tq which\nmaps a matrix A to a block-Vandermonde structured matrix.\nDe\ufb01nition 2 (Matrix Operator) Given a matrix A \u2208 Rn\u00d7d, we de\ufb01ne the following matrix:\nTq(A) =\n\nVq,n(A1,1, . . . , An,1)T Vq,n(A1,2, . . . , An,2)T\n\nVq,n(A1,d, . . . , An,d)T\n\n\u00b7\u00b7\u00b7\n\n(cid:20)\n\n(cid:21)\n\nIn this paper, we consider regression problems, Eqn. 1, where Z can be written as\n\nZ = Tq(A)\n\n(2)\nfor an n \u00d7 d matrix A, so that k = dq. The operator Tq expands each feature (column) of the\noriginal dataset A to q columns of Z by applying monomial transformations upto degree q \u2212 1. This\nlends a block-Vandermonde structure to Z. Such structure naturally arises in polynomial regression\nproblems, but also applies more broadly to non-parametric additive models and kernel methods as\nwe discuss below. With this setup, the goal is to solve the following problem:\nStructured Regression: Given A and b, with constant probability output a vector x(cid:48) \u2208 C for which\n\n(cid:107)Tq(A)x(cid:48) \u2212 b(cid:107)p \u2264 (1 + \u03b5)(cid:107)Tq(A)x(cid:63) \u2212 b(cid:107)p,\n\nfor an accuracy parameter \u03b5 > 0, where x(cid:63) = arg minx\u2208C (cid:107)Tq(A)x \u2212 b(cid:107)p.\nOur contributions in this paper are as follows:\n\u25e6 For p = 2, we provide an algorithm that solves the structured regression problem above in time\nO(nnz(A) log2 q) + poly(dq\u0001\u22121). By combining our sketching methods with preconditioned\niterative solvers, we can also obtain logarithmic dependence on \u0001. For p = 1, we provide an\nalgorithm with runtime O(nnz(A) log n log2 q) + poly(dq\u0001\u22121 log n). This implies that moving\nfrom linear (i.e, Z = A) to nonlinear regression (Z = Tq(A))) incurs only a mild additional log2 q\nruntime cost, while requiring no extra storage! Since nnz(Tq(A)) = q nnz(A), this provides - to\nour knowledge - the \ufb01rst sketching approach that operates faster than \u201cinput-sparsity\u201d time, i.e.\nwe sketch Tq(A) in time faster than nnz(Tq(A)).\n\u25e6 Our algorithms apply to a broad class of nonlinear models for both least squares regression and\ntheir robust l1 regression counterparts. While polynomial regression and additive models with\nmonomial basis functions are immediately covered by our methods, we also show that under a\nsuitable choice of the constraint set C, the structured regression problem with Z = Tq(AG) for\na Gaussian random matrix G approximates non-parametric regression using the Gaussian kernel.\nWe argue that our approach provides a more \ufb02exible modeling framework when compared to\nrandomized Fourier maps for kernel methods [16, 11].\n\n2\n\n\f\u25e6 Empirical results con\ufb01rm both the practical value of our modeling framework, as well as speedup\n\nbene\ufb01ts of sketching.\n\n2 Polynomial Fitting, Additive Models and Random Fourier Maps\n\nz \u2208 Rd are related through the model y = \u00b5 +(cid:80)d\nto expand each function as fi(\u00b7) = (cid:80)q\n\nOur primary goal in this section is to motivate sketching approaches for a versatile class of Block-\nVandermonde structured regression problems by showing that these problems arise naturally in var-\nious statistical modeling settings.\nThe most basic application is the one-dimensional (d = 1) polynomial regression.\nIn multivariate additive regression models, a continuous target variable y \u2208 R and input variables\ni=1 fi(zi) + \u0001i where \u00b5 is an intercept term,\n\u0001i are zero-mean Gaussian error terms and fi are smooth univariate functions. The basic idea is\nt=1 \u03b2i,thi,t(\u00b7) using basis functions hi,t(\u00b7) and estimate the\nunknown parameter vector x = [\u03b211 . . . \u03b21q . . . \u03b2dq]T typically by a constrained or penalized least\n2 where b = (y1 . . . yn)T and Z = [H1 . . . Hq] \u2208 Rn\u00d7dq for\nsquares model, argminx\u2208C(cid:107)Zx \u2212 b(cid:107)2\n(Hi)j,t = hi,t(zj) on a training sample (zi, yi), i = 1 . . . n. The constraint set C typically imposes\nsmoothing, sparsity or group sparsity constraints [2]. It is easy to see that choosing a monomial basis\nhi,s(u) = us immediately maps the design matrix Z to the structured regression form of Eqn. 2.\nFor p = 1, our algorithms also provide fast solvers for robust polynomial additive models.\nAdditive models impose a restricted form of univariate nonlinearity which ignores interactions\nbetween covariates. Let us denote an interaction term as z\u03b1 = z\u03b11\nd , \u03b1 = (\u03b11 . . . \u03b1d)\ni \u03b1i = q, \u03b1i \u2208 {0 . . . q}. A degree-q multivariate polynomial function space Pq is\ni \u03b1i \u2264 q}. Pq admits all possible degree-q interactions\nbut has dimensionality dq which is computationally infeasible to explicitly work with except for\nlow-degrees and low-dimensional or sparse datasets [3]. Kernel methods with polynomial kernels\n\u03b1 z\u03b1z(cid:48)\u03b1 provide an implicit mechanism to compute inner products in the\nfeature space associated with Pq. However, they require O(n3) computation for solving associated\nkernelized (ridge) regression problems and O(n2) storage of dense n \u00d7 n Gram matrices K (given\nby Kij = k(zi, zj)), and therefore do not scale well.\nFor a d \u00d7 D matrix G let SG be the subspace spanned by\n\n1 . . . z\u03b1d\n\nwhere (cid:80)\nspanned by {z\u03b1, \u03b1 \u2208 {0, . . . q}d,(cid:80)\nk(z, z(cid:48)) =(cid:0)zT z(cid:48)(cid:1)q\n\n=(cid:80)\n\n\uf8fc\uf8fd\uf8fe .\n\n(cid:33)t\n\n\uf8f1\uf8f2\uf8f3\n(cid:32) d(cid:88)\n\ni=1\n\nGijzi\n\n, t = 1 . . . q, j = 1 . . . s\n\nAssuming D = dq and that G is a random matrix of i.i.d Gaussian variables, then almost surely we\nhave SG = Pq. An intuitively appealing explicit scalable approach is then to use D (cid:28) dq. In that\ncase SG essentially spans a random subspace of Pq. The design matrix for solving the multivariate\npolynomial regression restricted to SG has the form Z = Tq(AG) where A = [zT\nThis scheme can be in fact related to the idea of random Fourier features introduced by Rahimi\nand Recht [16] in the context of approximating shift-invariant kernel functions, with the Gaussian\nKernel k(z, z(cid:48)) = exp (\u2212(cid:107)z \u2212 z(cid:48)(cid:107)2\n2/2\u03c32) as the primary example. By appealing to Bochner\u2019s The-\norem [18], it is shown that the Gaussian kernel is the Fourier transform of a zero-mean multivariate\nGaussian distribution with covariance matrix \u03c3\u22121Id where Id denotes the d-dimensional identity\nmatrix,\n\n1 . . . zT\n\nn ]T .\n\n2/2\u03c32) = E\u03c9\u223cN (0d,\u03c3\u22121Id)[\u03c6\u03c9(z)\u03c6\u03c9(z(cid:48))\u2217]\n\n(cid:80)D\nwhere \u03c6\u03c9(z) = ei\u03c9(cid:48)z. An empirical approximation to this expectation can be obtained by sampling\nD frequencies \u03c9 \u223c N (0d, \u03c3\u22121Id) and setting k(z, z(cid:48)) = 1\ni=1 \u03c6\u03c9i(z)\u03c6\u03c9i(z)\u2217. This implies\nthat the Gram matrix of the Gaussian kernel, Kij = exp (\u2212(cid:107)zi \u2212 zj(cid:107)2\n2/2\u03c32) may be approximated\nwith high concentration as K \u2248 RRT where R = [cos(AG) sin(AG)] \u2208 Rn\u00d72D (sine and co-\nsine are applied elementwise as scalar functions). This randomized explicit feature mapping for\nthe Gaussian kernel implies that standard linear regression, with R as the design matrix, can then\nbe used to obtain a solution in time O(nD2). By taking the Maclaurin series expansion of sine\nand cosine upto degree q, we can see that a restricted structured regression problem of the form,\n\nk(z, z(cid:48)) = exp (\u2212(cid:107)z \u2212 z(cid:48)(cid:107)2\n\nD\n\n3\n\n\fargminx\u2208range(Q)(cid:107)Tq(AG)x \u2212 b(cid:107)p, where the matrix Q \u2208 R2Dq\u00d72D contains appropriate coef\ufb01-\ncients of the Maclaurin series, will closely approximate the randomized Fourier features construction\nof [16]. By dropping or modifying the constraint set x \u2208 range(Q), the setup above, in principle,\ncan de\ufb01ne a richer class of models. A full error analysis of this approach is the subject of a separate\npaper.\n\n3 Fast Structured Regression with Sketching\n\nWe now develop our randomized solvers for block-Vandermonde structured lp regression problems.\nIn the theoretical developments below, we consider unconstrained regression though our results\ngeneralize straightforwardly to convex constraint sets C. For simplicity, we state all our results\nfor constant failure probability. One can always repeat the regression procedure O(log(1/\u03b4)) times,\neach time with independent randomness, and choose the best solution found. This reduces the failure\nprobability to \u03b4.\n\n3.1 Background\n\nof M. De\ufb01ne (cid:107)M(cid:107)1 to be the element-wise (cid:96)1 norm of M. That is, (cid:107)M(cid:107)1 =(cid:80)\n\nWe begin by giving some notation and then provide necessary technical background.\nGiven a matrix M \u2208 Rn\u00d7d, let M1, . . . , Md be the columns of M, and M 1, . . . , M n be the rows\ni\u2208[d] (cid:107)Mi(cid:107)1. Let\n\n(cid:107)M(cid:107)F =\n\ni\u2208[n],j\u2208[d] M 2\ni,j\n\nbe the Frobenius norm of M. Let [n] = {1, . . . , n}.\n\n(cid:16)(cid:80)\n\n(cid:17)1/2\n\n3.1.1 Well-Conditioning and Sampling of A Matrix\nDe\ufb01nition 3 ((\u03b1, \u03b2, 1)-well-conditioning [8]) Given a matrix M \u2208 Rn\u00d7d, we say M is (\u03b1, \u03b2, 1)-\nwell-conditioned if (1) (cid:107)x(cid:107)\u221e \u2264 \u03b2 (cid:107)M x(cid:107)1 for any x \u2208 Rd, and (2) (cid:107)M(cid:107)1 \u2264 \u03b1.\nLemma 4 (Implicit in [20]) Suppose S is an r \u00d7 n matrix so that for all x \u2208 Rd,\n\n(cid:107)M x(cid:107)1 \u2264 (cid:107)SM x(cid:107)1 \u2264 \u03ba(cid:107)M x(cid:107)1.\n\n\u221a\n\nr, \u03ba, 1)-well-conditioned.\n\nLet Q \u00b7 R be a QR-decomposition of SM, so that QR = SM and Q has orthonormal columns.\nThen M R\u22121 is (d\nTheorem 5 (Theorem 3.2 of [8]) Suppose U is an (\u03b1, \u03b2, 1)-well-conditioned basis of an n \u00d7 d\nmatrix A. For each i \u2208 [n], let pi \u2265 min\nSuppose we independently sample each row with probability pi, and create a diagonal matrix S\nwhere Si,i = 0 if i is not sampled, and Si,i = 1/pi if i is sampled. Then with probability at least\n1 \u2212 \u03b4, simultaneously for all x \u2208 Rd we have:\n\n(cid:17)\n, where t \u2265 32\u03b1\u03b2(d ln(cid:0) 12\n\n(cid:1) + ln(cid:0) 2\n\n(cid:1))/(\u03b52).\n\n(cid:107)Ui(cid:107)1\nt(cid:107)U(cid:107)1\n\n(cid:16)\n\n\u03b5\n\n\u03b4\n\n1,\n\n|(cid:107)SAx(cid:107)1 \u2212 (cid:107)Ax(cid:107)1| \u2264 \u03b5(cid:107)Ax(cid:107)1.\n\nWe also need the following method of quickly obtaining approximations to the pi\u2019s in Theorem 5,\nwhich was originally given in Mahoney et al. [13].\nTheorem 6 Let U \u2208 Rn\u00d7d be an (\u03b1, \u03b2, 1)-well-conditioned basis of an n \u00d7 d matrix A. Suppose\nG is a d \u00d7 O(log n) matrix of i.i.d. Gaussians. Let pi = min\nfor all i, where t is\nas in Theorem 5. Then with probability 1 \u2212 1/n, over the choice of G, the following occurs. If we\nsample each row with probability pi, and create S as in Theorem 5, then with probability at least\n1 \u2212 \u03b4, over our choice of sampled rows, simultaneously for all x \u2208 Rd we have:\n\n(cid:107)UiG(cid:107)1\n\u221a\nd(cid:107)U G(cid:107)1\n\n(cid:17)\n\n(cid:16)\n\n1,\n\nt2\n\n|(cid:107)SAx(cid:107)1 \u2212 (cid:107)Ax(cid:107)1| \u2264 \u03b5(cid:107)Ax(cid:107)1.\n\n3.1.2 Oblivious Subspace Embeddings\nLet A \u2208 Rn\u00d7d. We assume that n > d. Let nnz(A) denote the number of non-zero entries of A.\nWe can assume nnz(A) \u2265 n and that there are no all-zero rows or columns in A.\n\n4\n\n\f(cid:96)2 Norm The following family of matrices is due to Charikar et al. [4] (see also [9]): For a param-\neter t, de\ufb01ne a random linear map \u03a6D : Rn \u2192 Rt as follows:\n\u2022 h : [n] (cid:55)\u2192 [t] is a random map so that for each i \u2208 [n], h(i) = t(cid:48) for t(cid:48) \u2208 [t] with probability 1/t.\n\u2022 \u03a6 \u2208 {0, 1}t\u00d7n is a t \u00d7 n binary matrix with \u03a6h(i),i = 1, and all remaining entries 0.\n\u2022 D is an n \u00d7 n random diagonal matrix, with each diagonal entry independently chosen to be +1\n\nor \u22121 with equal probability.\n\nWe will refer to \u03a0 = \u03a6D as a sparse embedding matrix.\nFor certain t, it was recently shown [7] that with probability at least .99 over the choice of \u03a6 and D,\nfor any \ufb01xed A \u2208 Rn\u00d7d, we have simultaneously for all x \u2208 Rd,\n\n(1 \u2212 \u03b5) \u00b7 (cid:107)Ax(cid:107)2 \u2264 (cid:107)\u03a0Ax(cid:107)2 \u2264 (1 + \u03b5) \u00b7 (cid:107)Ax(cid:107)2 ,\n\nthat is, the entire column space of A is preserved [7]. The best known value of t is t = O(d2/\u03b52)\n[14, 15] .\nWe will also use an oblivious subspace embedding known as the subsampled randomized Hadamard\ntransform, or SRHT. See Boutsidis and Gittens\u2019s recent article for a state-the-art analysis [1].\nTheorem 7 (Lemma 6 in [1]) There is a distribution over linear maps \u03a0(cid:48) such that with probability\n.99 over the choice of \u03a0(cid:48), for any \ufb01xed A \u2208 Rn\u00d7d, we have simultaneously for all x \u2208 Rd,\n\n(1 \u2212 \u03b5) \u00b7 (cid:107)Ax(cid:107)2 \u2264 (cid:107)\u03a0(cid:48)Ax(cid:107)2 \u2264 (1 + \u03b5) \u00b7 (cid:107)Ax(cid:107)2 ,\n\n\u221a\nwhere the number of rows of \u03a0(cid:48) is t(cid:48) = O(\u03b5\u22122(log d)(\n\u03a0(cid:48)A is O(nd log t(cid:48)).\n\n\u221a\n\nd +\n\nlog n)2), and the time to compute\n\n(cid:96)1 Norm The results can be generalized to subspace embeddings with respect to the (cid:96)1-norm\n[7, 14, 21]. The best known bounds are due to Woodruff and Zhang [21], so we use their family of\nembedding matrices in what follows. Here the goal is to design a distribution over matrices \u03a8, so\nthat with probability at least .99, for any \ufb01xed A \u2208 Rn\u00d7d, simultaneously for all x \u2208 Rd,\n\n(cid:107)Ax(cid:107)1 \u2264 (cid:107)\u03a8Ax(cid:107)1 \u2264 \u03ba(cid:107)Ax(cid:107)1 ,\n\nwhere \u03ba > 1 is a distortion parameter. The best known value of \u03ba, independent of n, for which\n\u03a8A can be computed in O(nnz(A)) time is \u03ba = O(d2 log2 d) [21]. Their family of matrices \u03a8 is\nchosen to be of the form \u03a0 \u00b7 E, where \u03a0 is as above with parameter t = d1+\u03b3 for arbitrarily small\nconstant \u03b3 > 0, and E is a diagonal matrix with Ei,i = 1/ui, where u1, . . . , un are independent\nstandard exponentially distributed random variables.\nRecall that an exponential distribution has support x \u2208 [0,\u221e), probability density function (PDF)\nf (x) = e\u2212x and cumulative distribution function (CDF) F (x) = 1\u2212e\u2212x. We say a random variable\nX is exponential if X is chosen from the exponential distribution.\n\n3.1.3 Fast Vandermonde Multipication\nLemma 8 Let x0, . . . , xn\u22121 \u2208 R and V = Vq,n(x0, . . . , xn\u22121). For any y \u2208 Rn and z \u2208 Rq, the\nmatrix-vector products V y and V T z can be computed in O((n + q) log2 q) time.\n\n3.2 Main Lemmas\n\nWe handle (cid:96)2 and (cid:96)1 separately. Our algorithms uses the subroutines given by the next lemmas.\nLemma 9 (Ef\ufb01cient Multiplication of a Sparse Sketch and Tq(A)) Let A \u2208 Rn\u00d7d. Let \u03a0 = \u03a6D\nbe a sparse embedding matrix for the (cid:96)2 norm with associated hash function h : [n] \u2192 [t] for an\narbitrary value of t, and let E be any diagonal matrix. There is a deterministic algorithm to compute\nthe product \u03a6 \u00b7 D \u00b7 E \u00b7 Tq(A) in O((nnz(A) + dtq) log2 q) time.\nProof: By de\ufb01nition of Tq(A), it suf\ufb01ces to prove this when d = 1. Indeed, if we can prove for a\ncolumn vector a that the product \u03a6\u00b7 D \u00b7 E \u00b7 Tq(a) can be computed in O((nnz(a) + tq) log2 q) time,\nthen by linearity if will follow that the product \u03a6 \u00b7 D \u00b7 E \u00b7 Tq(A) can be computed in O((nnz(A +\n\n5\n\n\fAlgorithm 1 StructRegression-2\n1: Input: An n \u00d7 d matrix A with nnz(A) non-zero entries, an n \u00d7 1 vector b, an integer degree q, and an accuracy parameter \u03b5 > 0.\n2: Output: With probability at least .98, a vector x(cid:48) \u2208 Rd for which (cid:107)Tq(A)x(cid:48) \u2212 b(cid:107)2 \u2264 (1 + \u03b5) minx (cid:107)Tq(A)x \u2212 b(cid:107)2.\n3: Let \u03a0 = \u03a6D be a sparse embedding matrix for the (cid:96)2 norm with t = O((dq)2/\u03b52).\n4: Compute \u03a0Tq(A) using the ef\ufb01cient algorithm of Lemma 9 with E set to the identity matrix.\n5: Compute \u03a0b.\n6: Compute \u03a0(cid:48)(\u03a0Tq(A)) and \u03a0(cid:48)\u03a0b, where \u03a0(cid:48) is a subsampled randomized Hadamard transform of Theorem 7 with t(cid:48) =\n\n\u221a\n\n\u221a\n\nO(\u03b5\u22122(log(dq))(\n\ndq +\n\nlog t)2) rows.\n\n7: Output the minimizer x(cid:48) of (cid:107)\u03a0(cid:48)\u03a0Tq(A)x(cid:48) \u2212 \u03a0(cid:48)\u03a0b(cid:107)2.\n\ndtq) log2 q) time for general d. Hence, in what follows, we assume that d = 1 and our matrix A is a\ncolumn vector a. Notice that if a is just a column vector, then Tq(A) is equal to Vq,n(a1, . . . , an)T .\nFor each k \u2208 [t], de\ufb01ne the ordered list Lk = i such that ai (cid:54)= 0 and h(i) = k. Let (cid:96)k = |Lk|.\nWe de\ufb01ne an (cid:96)k-dimensional vector \u03c3k as follows. If pk(i) is the i-th element of Lk, we set \u03c3k\ni =\nDpk(i),pk(i) \u00b7 Epk(i),pk(i). Let V k be the submatrix of Vq,n(a1, . . . , an)T whose rows are in the set\nLk. Notice that V k is itself the transpose of a Vandermonde matrix, where the number of rows of\nV k is (cid:96)k. By Lemma 8, the product \u03c3kV k can be computed in O(((cid:96)k + q) log2 q) time. Notice that\n\u03c3kV k is equal to the k-th row of the product \u03a6DETq(a). Therefore, the entire product \u03a6DETq(a)\n\nk (cid:96)k log2 q(cid:1) = O((nnz(a) + tq) log2 q) time.\n\ncan be computed in O(cid:0)(cid:80)\n\nLemma 10 (Ef\ufb01cient Multiplication of Tq(A) on the Right) Let A \u2208 Rn\u00d7d. For any vector z,\nthere is a deterministic algorithm to compute the matrix vector product Tq(A) \u00b7 z in O((nnz(A) +\ndq) log2 q) time.\n\nThe proof is provided in the supplementary material.\nLemma 11 (Ef\ufb01cient Multiplication of Tq(A) on the Left) Let A \u2208 Rn\u00d7d. For any vector z,\nthere is a deterministic algorithm to compute the matrix vector product z \u00b7 Tq(A) in O((nnz(A) +\ndq) log2 q) time.\n\nThe proof is provided in the supplementary material.\n\n3.3 Fast (cid:96)2-regression\n\nWe start by considering the structured regression problem in the case p = 2. We give an algorithm\nfor this problem in Algorithm 1.\n\nTheorem 12 Algorithm STRUCTREGRESSION-2 solves w.h.p the structured regression with p = 2\nin time\n\nO(nnz(A) log2 q) + poly(dq/\u03b5).\n\nProof: By the properties of a sparse embedding matrix (see Section 3.1.2), with probability at least\n.99, for t = O((dq)2/\u03b52), we have simultaneously for all y in the span of the columns of Tq(A)\nadjoined with b,\n\n(1 \u2212 \u03b5)(cid:107)y(cid:107)2 \u2264 (cid:107)\u03a0y(cid:107)2 \u2264 (1 + \u03b5)(cid:107)y(cid:107)2,\n\nsince the span of this space has dimension at most dq + 1. By Theorem 7, we further have that with\nprobability .99, for all vectors z in the span of the columns of \u03a0(Tq(A) \u25e6 b),\n\n(1 \u2212 \u03b5)(cid:107)z(cid:107)2 \u2264 (cid:107)\u03a0(cid:48)z(cid:107)2 \u2264 (1 + \u03b5)(cid:107)z(cid:107)2.\n\nIt follows that for all vectors x \u2208 Rd,\n\n(1 \u2212 O(\u03b5))(cid:107)Tq(A)x \u2212 b(cid:107)2 \u2264 (cid:107)\u03a0(cid:48)\u03a0(Tq(A)x \u2212 B)(cid:107)2 \u2264 (1 + O(\u03b5))(cid:107)Tq(A)x \u2212 b(cid:107)2.\n\nIt follows by a union bound that with probability at least .98, the output of STRUCTREGRESSION-2\nis a (1 + \u03b5)-approximation.\nFor the time complexity, \u03a0Tq(A) can be computed in O((nnz(A) + dtq) log2 q) by Lemma 9, while\n\u03a0b can be computed in O(n) time. The remaining steps can be performed in poly(dq/\u03b5) time, and\ntherefore the overall time is O(nnz(A) log2 q) + poly(dq/\u03b5).\n\n6\n\n\f\u03b3 > 0.\n\nAlgorithm 2 StructRegression-1\n1: Input: An n \u00d7 d matrix A with nnz(A) non-zero entries, an n \u00d7 1 vector b, an integer degree q, and an accuracy parameter \u03b5 > 0.\n2: Output: With probability at least .98, a vector x(cid:48) \u2208 Rd for which (cid:107)Tq(A)x(cid:48) \u2212 b(cid:107)1 \u2264 (1 + \u03b5) minx (cid:107)Tq(A)x \u2212 b(cid:107)1.\n3: Let \u03a8 = \u03a0E = \u03a6DE be a subspace embedding matrix for the (cid:96)1 norm with t = (dq + 1)1+\u03b3 for an arbitrarily small constant\n4: Compute \u03a8Tq(A) = \u03a0ETq(A) using the ef\ufb01cient algorithm of Lemma 9.\n5: Compute \u03a8b = \u03a0Eb.\n6: Compute a QR-decomposition of \u03a8(Tq(A) \u25e6 b), where \u25e6 denotes the adjoining of column vector b to Tq(A).\n7: Let G be a (dq + 1) \u00d7 O(log n) matrix of i.i.d. Gaussians.\n8: Compute R\u22121 \u00b7 G.\n9: Compute (Tq(A) \u25e6 b) \u00b7 (R\u22121G) using the ef\ufb01cient algorithm of Lemma 10 applied to each of the columns of R\u22121G.\n10: Let S be the diagonal matrix of Theorem 6 formed by sampling \u02dcO(q1+\u03b3/2d4+\u03b3/2\u03b5\u22122) rows of Tq(A) and corresponding entries of\n11: Output the minimizer x(cid:48) of (cid:107)STq(A)x(cid:48) \u2212 Sb(cid:107)1.\n\nb using the scheme of Theorem 6.\n\n3.3.1 Logarithmic Dependence on 1/\u03b5\n\nThe STRUCTREGRESSION-2 algorithm can be modi\ufb01ed to obtain a running time with a logarithmic\ndependence on \u03b5 by combining sketching-based methods with iterative ones.\n\nTheorem 13 There is an algorithm which solves the structured regression problem with p = 2 in\ntime O((nnz(A) + dq) log(1/\u03b5)) + poly(dq) w.h.p.\n\nDue to space limitations the proof is provided in Supplementary material.\n\n3.4 Fast (cid:96)1-regression\n\nWe now consider the structured regression in the case p = 1. The algorithm in this case is more\ncomplicated than that for p = 2, and is given in Algorithm 2.\n\nTheorem 14 Algorithm STRUCTREGRESSION-1 solves w.h.p the structured regression in problem\nwith p = 1 in time\n\nO(nnz(A) log n log2 q) + poly(dq\u03b5\u22121 log n).\n\nThe proof is provided in supplementary material.\nWe note when there is a convex constraint set C the only change in the above algorithms is to\noptimize over x(cid:48) \u2208 C.\n4 Experiments\nWe report two sets of experiments on classi\ufb01cation and regression datasets. The \ufb01rst set of ex-\nperiments compares generalization performance of our structured nonlinear least squares regres-\nsion models against standard linear regression, and nonlinear regression with random fourier fea-\ntures [16]. The second set of experiments focus on scalability bene\ufb01ts of sketching. We used\nRegularized Least Squares Classi\ufb01cation (RLSC) for classi\ufb01cation.\nGeneralization performance is reported in Table 1. As expected, ordinary (cid:96)2 linear regression is\nvery fast, especially if the matrix is sparse. However, it delivers only mediocre results. The results\nimprove somewhat with additive polynomial regression. Additive polynomial regression maintains\nthe sparsity structure so it can exploit fast sparse solvers. Once we introduce random features,\nthereby introducing interaction terms, results improve considerably. When compared with random\nFourier features, for the same number of random features D, additive polynomial regression with\nrandom features get better results than regression with random Fourier features. If the number of\nrandom features is not the same, then if DF ourier = DP oly \u00b7 q (where DF ourier is the number of\nFourier features, and DP oly is the number of random features in the additive polynomial regression)\nthen regression with random Fourier features seems to outperform additive polynomial regression\nwith random features. However, computing the random features is one of the most expensive steps,\nso computing better approximations with fewer random features is desirable.\nFigure 1 reports the bene\ufb01t of sketching in terms of running times, and the trade-off in terms of\naccuracy. In this experiment we use a larger sample of the MNIST dataset with 300,000 examples.\n\n7\n\n\fDataset\n\nOrd. Reg.\n\nAdd. Poly. Reg.\n\nMNIST\nclassi\ufb01cation\nn = 60, 000, d = 784\nk = 10, 000\nCPU\nregression\nn = 6, 554, d = 21\nk = 819\nADULT\nclassi\ufb01cation\nn = 32, 561, d = 123\nk = 16, 281\nCENSUS\nregression\nn = 18, 186, d = 119\nk = 2, 273\nFOREST COVER\nclassi\ufb01cation\nn = 522, 910, d = 54\nk = 58, 102\n\n14%\n3.9 sec\n\n12%\n0.01 sec\n\n15.5%\n0.17 sec\n\n7.1%\n0.3 sec\n\n25.7%\n3.3 sec\n\n11%\n19.1 sec\nq = 4\n\n3.3%\n0.07 sec\nq = 4\n\n15.5%\n0.55 sec\nq = 4\n\n7.0%\n1.4 sec\nq = 4\n\u03bb = 0.2\n23.7%\n7.8 sec\nq = 4\n\nAdd. Poly. Reg.\nw/ Random Features\n6.9%\n5.5 sec\nD = 300, q = 4\n\nOrd. Reg.\nw/ Fourier Features\n7.8%\n6.8 sec\nD = 500\n\n2.8%\n0.13 sec\nD = 60, q = 4\n\n15.0%\n3.9 sec\nD = 500, q = 4\n\n6.85%\n1.9 sec\nD = 500, q = 4,\n\u03bb = 0.1\n20.0%\n14.0 sec\nD = 200, q = 4\n\n2.8%\n0.14 sec\nD = 180\n\n15.1%\n3.6 sec\nD = 1000\n\n6.5%\n2.1 sec\nD = 500\n\u03bb = 0.1\n21.3%\n15.5 sec\nD = 400\n\nTable 1: Comparison of testing error and training time of the different methods. In the table, n is number of training instances, d is the\nnumber of features per instance and k is the number of instances in the test set. \u201cOrd. Reg.\u201d stands for ordinary (cid:96)2 regression. \u201cAdd. Poly.\nReg.\u201d stands for additive polynomial (cid:96)2 regression. For classi\ufb01cation tasks, the percent of testing points incorrectly predicted is reported. For\nregression tasks, we report (cid:107)yp \u2212 y(cid:107)2 / (cid:107)y(cid:107) where yp is the predicted values and y is the ground truth.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 1: Examining the performance of sketching.\n\nWe compute 1,500 random features, and then solve the corresponding additive polynomial regres-\nsion problem with q = 4, both exactly and with sketching to different number of rows. We also\ntested a sampling based approach which simply randomly samples a subset of the rows (no sketch-\ning). Figure 1 (a) plots the speedup of the sketched method relative to the exact solution. In these\nexperiments we use a non-optimized straightforward implementation that does not exploit fast Van-\ndermonde multiplication or parallel processing. Therefore, running times were measured using a\nsequential execution. We measured only the time required to solve the regression problem. For\nthis experiment we use a machine with two quad-core Intel E5410 @ 2.33GHz, and 32GB DDR2\n800MHz RAM. Figure 1 (b) explores the sub-optimality in solving the regression problem. More\np is\nthe best approximation (exact solution), and Yp is the sketched solution. We see that indeed the error\ndecreases as the size of the sampled matrix grows, and that with a sketch size that is not too big we\nget to about a 10% larger objective. In Figure 1 (c) we see that this translates to an increase in error\nrate. Encouragingly, a sketch as small as 15% of the number of examples is enough to have a very\nsmall increase in error rate, while still solving the regression problem more than 5 times faster (the\nspeedup is expected to grow for larger datasets).\n\nspeci\ufb01cally, we plot ((cid:107)Yp \u2212 Y (cid:107)F \u2212(cid:13)(cid:13)Y (cid:63)\n\nwhere Y is the labels matrix, Y (cid:63)\n\np \u2212 Y(cid:13)(cid:13)F\n\n)/(cid:13)(cid:13)Y (cid:63)\np \u2212 Y(cid:13)(cid:13)F\n\nAcknowledgements\n\nThe authors acknowledge the support from XDATA program of the Defense Advanced Research\nProjects Agency (DARPA), administered through Air Force Research Laboratory contract FA8750-\n12-C-0323.\n\n8\n\n 0%10%20%30%40%50%05101520Sketch Size (% of Examples)Speedup SketchingSampling 0%10%20%30%40%50%10\u22122100102Sketch Size (% of Examples)Suboptimality of residual SketchingSampling 0%10%20%30%40%50%34567Sketch Size (% of Examples)Classification Error on Test Set (%) SketchingSamplingExact\fReferences\n[1] C. Boutsidis and A. Gittens.\n\nImproved matrix algorithms via the Subsampled Randomized\nHadamard Transform. ArXiv e-prints, Mar. 2012. To appear in the SIAM Journal on Matrix\nAnalysis and Applications.\n\n[2] P. Buhlmann and S. v. d. Geer. Statistics for High-dimensional Data. Springer, 2011.\n[3] Y. Chang, C. Hsieh, K. Chang, M. Ringgaard, and C. Lin. Low-degree polynomial mapping\n\nof data for svm. JMLR, 11, 2010.\n\n[4] M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. Theo-\nretical Computer Science, 312(1):3 \u2013 15, 2004. \u00a1ce:title\u00bfAutomata, Languages and Program-\nming\u00a1/ce:title\u00bf.\n\n[5] K. L. Clarkson, P. Drineas, M. Magdon-Ismail, M. W. Mahoney, X. Meng, and D. P. Woodruff.\nThe Fast Cauchy Transform and faster faster robust regression. CoRR, abs/1207.4684, 2012.\nAlso in SODA 2013.\n\n[6] K. L. Clarkson and D. P. Woodruff. Numerical linear algebra in the streaming model.\n\nIn\nProceedings of the 41st annual ACM Symposium on Theory of Computing, STOC \u201909, pages\n205\u2013214, New York, NY, USA, 2009. ACM.\n\n[7] K. L. Clarkson and D. P. Woodruff. Low rank approximation and regression in input sparsity\ntime. In Proceedings of the 45th annual ACM Symposium on Theory of Computing, STOC \u201913,\npages 81\u201390, New York, NY, USA, 2013. ACM.\n\n[8] A. Dasgupta, P. Drineas, B. Harb, R. Kumar, and M. Mahoney. Sampling algorithms and\n\ncoresets for (cid:96)p regression. SIAM Journal on Computing, 38(5):2060\u20132078, 2009.\n\n[9] A. Gilbert and P. Indyk. Sparse recovery using sparse matrices. Proceedings of the IEEE,\n\n98(6):937\u2013947, 2010.\n\n[10] N. Halko, P. G. Martinsson, and J. Tropp. Finding structure with randomness: Probabilistic\nalgorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217\u2013\n288, 2011.\n\n[11] Q. Le, T. Sarl\u00b4os, and A. Smola. Fastfood computing hilbert space expansions in loglinear\n\ntime. In Proceedings of International Conference on Machine Learning, ICML \u201913, 2013.\n\n[12] M. W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in\n\nMachine Learning, 3(2):123\u2013224, 2011.\n\n[13] M. W. Mahoney, P. Drineas, M. Magdon-Ismail, and D. P. Woodruff. Fast approximation of\nmatrix coherence and statistical leverage. In Proceedings of the 29th International Conference\non Machine Learning, ICML \u201912, 2012.\n\n[14] X. Meng and M. W. Mahoney. Low-distortion subspace embeddings in input-sparsity time and\napplications to robust linear regression. In Proceedings of the 45th annual ACM Symposium\non Theory of Computing, STOC \u201913, pages 91\u2013100, New York, NY, USA, 2013. ACM.\n\n[15] J. Nelson and H. L. Nguyen. OSNAP: Faster numerical linear algebra algorithms via sparser\n\nsubspace embeddings. CoRR, abs/1211.1002, 2012.\n\n[16] R. Rahimi and B. Recht. Random features for large-scale kernel machines. In Proceedings of\n\nNeural Information Processing Systems, NIPS \u201907, 2007.\n\n[17] V. Rokhlin and M. Tygert. A fast randomized algorithm for overdetermined linear least-squares\n\nregression. Proceedings of the National Academy of Sciences, 105(36):13212, 2008.\n\n[18] W. Rudin. Fourier Analysis on Groups. Wiley Classics Library. Wiley-Interscience, New York,\n\n1994.\n\n[19] T. Sarl\u00b4os. Improved approximation algorithms for large matrices via random projections. In\nProceeding of IEEE Symposium on Foundations of Computer Science, FOCS \u201906, pages 143\u2013\n152, 2006.\n\n[20] C. Sohler and D. P. Woodruff. Subspace embeddings for the l1-norm with applications. In\nProceedings of the 43rd annual ACM Symposium on Theory of Computing, STOC \u201911, pages\n755\u2013764, 2011.\n\n[21] D. P. Woodruff and Q. Zhang. Subspace embeddings and lp regression using exponential\n\nrandom variables. In COLT, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1363, "authors": [{"given_name": "Haim", "family_name": "Avron", "institution": "IBM Research"}, {"given_name": "Vikas", "family_name": "Sindhwani", "institution": "IBM Research"}, {"given_name": "David", "family_name": "Woodruff", "institution": "IBM Research"}]}