{"title": "Gaussian Quadrature for Kernel Features", "book": "Advances in Neural Information Processing Systems", "page_first": 6107, "page_last": 6117, "abstract": "Kernel methods have recently attracted resurgent interest, showing performance competitive with deep neural networks in tasks such as speech recognition. The random Fourier features map is a technique commonly used to scale up kernel machines, but employing the randomized feature map means that $O(\\epsilon^{-2})$ samples are required to achieve an approximation error of at most $\\epsilon$. We investigate some alternative schemes for constructing feature maps that are deterministic, rather than random, by approximating the kernel in the frequency domain using Gaussian quadrature. We show that deterministic feature maps can be constructed, for any $\\gamma > 0$, to achieve error $\\epsilon$ with $O(e^{e^\\gamma} + \\epsilon^{-1/\\gamma})$ samples as $\\epsilon$ goes to 0. Our method works particularly well with sparse ANOVA kernels, which are inspired by the convolutional layer of CNNs. We validate our methods on datasets in different domains, such as MNIST and TIMIT, showing that deterministic features are faster to generate and achieve accuracy comparable to the state-of-the-art kernel methods based on random Fourier features.", "full_text": "Gaussian Quadrature for Kernel Features\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nChristopher De Sa\n\nCornell University\nIthaca, NY 14853\n\ncdesa@cs.cornell.edu\n\nTri Dao\n\nStanford University\nStanford, CA 94305\ntrid@stanford.edu\n\nChristopher R\u00e9\n\nDepartment of Computer Science\n\nStanford University\nStanford, CA 94305\n\nchrismre@cs.stanford.edu\n\nAbstract\n\nKernel methods have recently attracted resurgent interest, showing performance\ncompetitive with deep neural networks in tasks such as speech recognition. The\nrandom Fourier features map is a technique commonly used to scale up kernel\nmachines, but employing the randomized feature map means that O(\u0001\u22122) samples\nare required to achieve an approximation error of at most \u0001. We investigate some\nalternative schemes for constructing feature maps that are deterministic, rather\nthan random, by approximating the kernel in the frequency domain using Gaussian\nquadrature. We show that deterministic feature maps can be constructed, for any\n+ \u0001\u22121/\u03b3) samples as \u0001 goes to 0. Our method\n\u03b3 > 0, to achieve error \u0001 with O(ee\u03b3\nworks particularly well with sparse ANOVA kernels, which are inspired by the\nconvolutional layer of CNNs. We validate our methods on datasets in different\ndomains, such as MNIST and TIMIT, showing that deterministic features are faster\nto generate and achieve accuracy comparable to the state-of-the-art kernel methods\nbased on random Fourier features.\n\n1\n\nIntroduction\n\nKernel machines are frequently used to solve a wide variety of problems in machine learning [26].\nThey have gained resurgent interest and have recently been shown [13, 18, 21, 19, 22] to be competi-\ntive with deep neural networks in some tasks such as speech recognition on large datasets. A kernel\nmachine is one that handles input x1, . . . , xn, represented as vectors in Rd, only in terms of some\nkernel function k : Rd \u00d7 Rd \u2192 R of pairs of data points k(xi, xj). This representation is attractive\nfor classi\ufb01cation problems because one can learn non-linear decision boundaries directly on the input\nwithout having to extract features before training a linear classi\ufb01er.\nOne well-known downside of kernel machines is the fact that they scale poorly to large datasets.\nNaive kernel methods, which operate on the Gram matrix Gi,j = k(xi, xj) of the data, can take a very\nlong time to run because the Gram matrix itself requires O(n2) space and many operations on it (e.g.,\nthe singular value decomposition) take up to O(n3) time. Rahimi and Recht [23] proposed a solution\nto this problem: approximating the kernel with an inner product in a higher-dimensional space.\nSpeci\ufb01cally, they suggest constructing a feature map z : Rd \u2192 RD such that k(x, y) \u2248 (cid:104)z(x), z(y)(cid:105).\nThis approximation enables kernel machines to use scalable linear methods for solving classi\ufb01cation\nproblems and to avoid the pitfalls of naive kernel methods by not materializing the Gram matrix.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn the case of shift-invariant kernels, one technique that was proposed for constructing the function z\nis random Fourier features [23]. This data-independent method approximates the Fourier transform\nintegral (1) of the kernel by averaging Monte-Carlo samples, which allows for arbitrarily-good\nestimates of the kernel function k. Rahimi and Recht [23] proved that if the feature map has\n\n(cid:1) then, with constant probability, the approximation (cid:104)z(x), z(y)(cid:105) is uniformly\n\ndimension D = \u02dc\u2126(cid:0) d\n\n\u00012\n\n\u0001-close to the true kernel on a bounded set. While the random Fourier features method has proven\nto be effective in solving practical problems, it comes with some caveats. Most importantly, the\naccuracy guarantees are only probabilistic and there is no way to easily compute, for a particular\nrandom sample, whether the desired accuracy is achieved.\nOur aim is to understand to what extent randomness is necessary to approximate a kernel. We\nthus propose a fundamentally different scheme for constructing the feature map z. While still\napproximating the kernel\u2019s Fourier transform integral (1) with a discrete sum, we select the sample\npoints and weights deterministically. This gets around the issue of probabilistic-only guarantees\nby removing the randomness from the algorithm. For small dimension, deterministic maps yield\nsigni\ufb01cantly lower error. As the dimension increases, some random sampling may become necessary,\nand our theoretical insights provide a new approach to sampling. Moreover, for a particular class\nof kernels called sparse ANOVA kernels (also known as convolutional kernels as they are similar\nto the convolutional layer in CNNs) which have shown state-of-the-art performance in speech\nrecognition [22], deterministic maps require fewer samples than random Fourier features, both in\nterms of the desired error and the kernel size. We make the following contributions:\n\nusing only O(d) samples, whereas random Fourier features requires O(d3) samples.\n\n\u2022 In Section 3, we describe how to deterministically construct a feature map z for the class of\nsubgaussian kernels (which can approximate any kernel well) that has exponentially small\n(in D) approximation error.\n\u2022 In Section 4, for sparse ANOVA kernels, we show that our method produces good estimates\n\u2022 In Section 5, we validate our results experimentally. We demonstrate that, for real clas-\nsi\ufb01cation problems on MNIST and TIMIT datasets, our method combined with random\nsampling yields up to 3 times lower kernel approximation error. With sparse ANOVA ker-\nnels, our method slightly improves classi\ufb01cation accuracy compared to the state-of-the-art\nkernel methods based on random Fourier features (which are already shown to match the\nperformance of deep neural networks), all while speeding up the feature generation process.\n\n2 Related Work\n\n(cid:16)\n\nMuch work has been done on extracting features for kernel methods. The random Fourier features\nmethod has been analyzed in the context of several learning algorithms, and its generalization error\nhas been characterized and compared to that of other kernel-based algorithms [24]. It has also been\ncompared to the Nystr\u00f6m method [35], which is data-dependent and thus can sometimes outperform\nrandom Fourier features. Other recent work has analyzed the generalization performance of the\nrandom Fourier features algorithm [17], and improved the bounds on its maximum error [29, 31].\nWhile we focus here on deterministic approximations to the Fourier transform integral and compare\nthem to Monte Carlo estimates, these are not the only two methods available to us. A possible\nmiddle-ground method is quasi-Monte Carlo estimation, in which low-discrepancy sequences, rather\nthan the fully-random samples of Monte Carlo estimation, are used to approximate the integral.\nThis approach was analyzed in Yang et al. [34] and shown to achieves an asymptotic error of\n. While this is asymptotically better than the random Fourier features\n\u0001 = O\nmethod, the complexity of the quasi-Monte Carlo method coupled with its larger constant factors\nprevents it from being strictly better than its predecessor. Our method still requires asymptotically\nfewer samples as \u0001 goes to 0.\nOur deterministic approach here takes advantage of a long line of work on numerical quadrature\nfor estimating integrals. Bach [1] analyzed in detail the connection between quadrature and random\nfeature expansions, thus deriving bounds for the number of samples required to achieve a given\naverage approximation error (though they did not present complexity results regarding maximum error\nnor suggested new feature maps). This connection allows us to leverage longstanding deterministic\nnumerical integration methods such as Gaussian quadrature [6, 33] and sparse grids [2].\n\nD\u22121 (log(D))d(cid:17)\n\n2\n\n\fUnlike many other kernels used in machine learning, such as the Gaussian kernel, the sparse ANOVA\nkernel allows us to encode prior information about the relationships among the input variables into\nthe kernel itself. Sparse ANOVA kernels have been shown [30] to work well for many classi\ufb01cation\ntasks, especially in structural modeling problems that bene\ufb01t from both the good generalization of a\nkernel machine and the representational advantage of a sparse model [9].\n\n3 Kernels and Quadrature\nWe start with a brief overview of kernels. A kernel function k : Rd \u00d7 Rd \u2192 R encodes the similarity\nbetween pairs of examples. In this paper, we focus on shift invariant kernels (those which satisfy\nk(x, y) = k(x \u2212 y), where we overload the de\ufb01nition of k to also refer to a function k : Rd \u2192 R)\nthat are positive de\ufb01nite and properly scaled. A kernel is positive de\ufb01nite if its Gram matrix is always\npositive de\ufb01nite for all non-trivial inputs, and it is properly-scaled if k(x, x) = 1 for all x. In this\nsetting, our results make use of a theorem [25] that also provides the \u201ckey insight\u201d behind the random\nFourier features method.\nTheorem 1 (Bochner\u2019s theorem). A continuous shift-invariant properly-scaled kernel k : Rd\u00d7Rd \u2192\nR is positive de\ufb01nite if and only if k is the Fourier transform of a proper probability distribution.\n\n(cid:90)\n\nRd\n\nWe can then write k in terms of its Fourier transform \u039b (which is a proper probability distribution):\n\nk(x \u2212 y) =\n\n\u039b(\u03c9) exp(j\u03c9(cid:62)(x \u2212 y)) d\u03c9.\n\n(1)\n\nFor \u03c9 distributed according to \u039b, this is equivalent to writing\n\nk(x \u2212 y) = E(cid:2)exp(j\u03c9(cid:62)(x \u2212 y))(cid:3) = E(cid:2)(cid:104)exp(j\u03c9(cid:62)x), exp(j\u03c9(cid:62)y)(cid:105)(cid:3) ,\n\nwhere we use the usual Hermitian inner product (cid:104)x, y(cid:105) =(cid:80)\n\ni xiyi. The random Fourier features\nmethod proceeds by estimating this expected value using Monte Carlo sampling averaged across D\nrandom selections of \u03c9. Equivalently, we can think of this as approximating (1) with a discrete sum\nat randomly selected sample points.\nOur objective is to choose some points \u03c9i and weights ai to uniformly approximate the integral (1)\nj (x \u2212 y)). To obtain a feature map z : Rd \u2192 CD where\n\ni=1 ai exp(j\u03c9(cid:62)\n\nwith \u02dck(x \u2212 y) = (cid:80)D\n\u02dck(x \u2212 y) =(cid:80)D\n\ni=1 aizi(x)zi(y), we can de\ufb01ne\na1 exp(j\u03c9(cid:62)\n\nz(x) =(cid:2)\u221a\n(cid:12)(cid:12)(cid:12)k(x \u2212 y) \u2212 \u02dck(x \u2212 y)\n\n1 x)\n\n. . .\n\n(cid:12)(cid:12)(cid:12) = sup\n\n(cid:107)u(cid:107)\u2264M\n\n\u221a\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:90)\n\nRd\n\n\u0001 = sup\n\n(x,y)\u2208M\n\naD exp(j\u03c9(cid:62)\n\nDx)(cid:3)(cid:62)\n\u039b(\u03c9)ej\u03c9(cid:62)u d\u03c9 \u2212 D(cid:88)\n\n.\n\ni=1\n\nWe aim to bound the maximum error for x, y in a region M with diameter M = supx,y\u2208M (cid:107)x \u2212 y(cid:107):\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) .\n\naiej\u03c9(cid:62)\n\ni u\n\n(2)\n\nA quadrature rule is a choice of \u03c9i and ai to minimize this maximum error. To evaluate a quadrature\nrule, we are concerned with the sample complexity (for a \ufb01xed diameter M).\nDe\ufb01nition 1. For any \u0001 > 0, a quadrature rule has sample complexity DSC(\u0001) = D, where D is the\nsmallest value such that the rule, when instantiated with D samples, has maximum error at most \u0001.\n\nWe will now examine ways to construct deterministic quadrature rules and their sample complexities.\n\n3.1 Gaussian Quadrature\n\nThe main idea is to approximate integrals of the form(cid:82) \u039b(\u03c9)f (\u03c9) d\u03c9 \u2248(cid:80)D\n\nGaussian quadrature is one of the most popular techniques in one-dimensional numerical integration.\ni=1 aif (\u03c9i) such that\nthe approximation is exact for all polynomials below a certain degree; D points are suf\ufb01cient for\npolynomials of degree up to 2D \u2212 1. While the points and weights used by Gaussian quadrature\ndepend both on the distribution \u039b and the parameter D, they can be computed ef\ufb01ciently using\northogonal polynomials [10, 32]. Gaussian quadrature produces accurate results when integrating\nfunctions that are well-approximated by polynomials, which include all subgaussian densities.\n\n3\n\n\fr\no\nr\nr\ne\nx\na\nm\nd\ne\nt\na\nm\n\ni\nt\ns\nE\n\n0.1\n\n0.12\n\n0.08\n\nPolynomially-exact\nRandom Fourier\n\nError of Polynomially-Exact Quadrature vs RFFs\n0.14\n0.8\n0.7\n0.6\n0.5\n0.4\n0.3\n0.2\n0.1\n0\n\n0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2\n\n0.02\n\n0.04\n\n0.06\n\n0\n\nRegion diameter (M)\n(a) Polynomially-exact\n\n0\n\nError of Sparse Grid Quadrature vs RFFs\n\nSparse grid\nRandom Fourier\n\nError of Subsampled Dense Grid vs RFFs\n\nSubsampled Dense Grid\nRandom Fourier\n\n0.14\n\n0.12\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\n0.2 0.4 0.6 0.8\n\n1\n\n1.2 1.4 1.6 1.8\n\nRegion diameter (M)\n\n2\n\n0\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\nRegion diameter (M)\n\n2.5\n\n3\n\n(b) Sparse grid\n\n(c) Subsampled dense grid\n\nFigure 1: Error comparison (empirical maximum over 106 uniformly-distributed samples) of different\nquadrature schemes and the random Fourier features method.\n\nDe\ufb01nition 2 (Subgaussian Distribution). We say that a distribution \u039b : Rd \u2192 R is subgaussian with\nparameter b if for X \u223c \u039b and for all t \u2208 Rd, E [exp((cid:104)t, X(cid:105))] \u2264 exp\n\n.\n\n(cid:16) 1\n2 b2 (cid:107)t(cid:107)2(cid:17)\n\nWe subsequently assume that the distribution \u039b is subgaussian, which is a technical restriction\ncompared to random Fourier features. Many of the kernels encountered in practice have subgaussian\nspectra, including the ubiquitous Gaussian kernel. More importantly, we can approximate any kernel\nby convolving it with the Gaussian kernel, resulting in a subgaussian kernel. The approximation error\ncan be made much smaller than the inherent noise in the data generation process.\n\n3.2 Polynomially-Exact Rules\n\nSince Gaussian quadrature is so successful in one dimension, as commonly done in the numerical\nanalysis literature [14], we might consider using quadrature rules that are multidimensional analogues\nof Gaussian quadrature \u2014 rules that are accurate for all polynomials up to a certain degree R. In\nhigher dimensions, this is equivalent to saying that our quadrature rule satis\ufb01es\n\nfor all r \u2208 Nd such that (cid:88)\n\nrl \u2264 R,\n\n(3)\n\nl\n\nl=1\n\nl=1\n\n2 exp\n\ni=1\nwhere el are the standard basis vectors.\nTo test the accuracy of polynomially-exact quadrature, we constructed a feature map for a Gaussian\nkernel, \u039b(\u03c9) = (2\u03c0)\u2212 d\n, in d = 25 dimensions with D = 1000 and accurate for\nall polynomials up to degree R = 2. In Figure 1a, we compared this to a random Fourier features\nrule with the same number of samples, over a range of region diameters M that captures most\nof the data points in practice (as the kernel is properly scaled). For small regions in particular, a\npolynomially-exact scheme can have a signi\ufb01cantly lower error than a random Fourier feature map.\nThis experiment motivates us to investigate theoretical bounds on the behavior of this method. For\nsubgaussian kernels, it is straightforward to bound the maximum error of a polynomially-exact feature\nmap using the Taylor series approximation of the exponential function in (2).\nTheorem 2. Let k be a kernel with b-subgaussian spectrum, and let \u02dck be its estimation under some\nquadrature rule with non-negative weights that is exact up to some even degree R. Let M \u2282 Rd\nbe some region of diameter M. Then, for all x, y \u2208 M, the error of the quadrature features\n\napproximation is bounded by(cid:12)(cid:12)(cid:12)k(x \u2212 y) \u2212 \u02dck(x \u2212 y)\nquadrature samples we will need to satisfy the conditions of Theorem 2. There are(cid:0)d+R\n\nAll the proofs are found in the Appendix.\nTo bound the sample complexity of polynomially-exact quadrature, we need to determine how many\n\n(cid:1) constraints\n\n(cid:18) eb2M 2\n\n(cid:12)(cid:12)(cid:12) \u2264 3\n\n(cid:19) R\n\nin (3), so a series of polynomially-exact quadrature rules that use only about this many sample points\ncan yield a bound on the sample complexity of this quadrature rule.\n\nR\n\nd\n\n.\n\n2\n\n(cid:90)\n\nd(cid:89)\n\n\u039b(\u03c9)\n\nRd\n\n(e(cid:62)\n\nl \u03c9)rl d\u03c9 =\n\n(e(cid:62)\n\nl \u03c9i)rl\n\nd(cid:89)\n\nai\n\nD(cid:88)\n(cid:16)\u2212 1\n2 (cid:107)\u03c9(cid:107)2(cid:17)\n\n4\n\n\fand that all have a number of samples D \u2264 \u03b2(cid:0)d+R\n\n(cid:1) for some \ufb01xed constant \u03b2. Then, for any \u03b3 > 0,\n\nCorollary 1. Assume that we are given a class of feature maps that satisfy the conditions of Theorem 2,\n\nthe sample complexity of features maps in this class can be bounded by\n\nd\n\n(cid:32)\nexp(cid:0)e2\u03b3+1b2M 2(cid:1) ,\n\n\u03b3(cid:33)\n(cid:19) 1\n\n(cid:18) 3\n\n\u0001\n\nD(\u0001) \u2264 \u03b22d max\n\n.\n\n(cid:16)\n\n\u2212 1\n\n\u03b3\n\n\u0001\n\n(cid:17)\n\n.\n\nIn particular, for a \ufb01xed dimension d, this means that for any \u03b3, D(\u0001) = O\n\nThe result of this corollary implies that, in terms of the desired error \u0001, the sample complexity\nincreases asymptotically slower than any negative power of \u0001. Compared to the result for random\nFourier features which had D(\u0001) = O(\u0001\u22122), this has a much weaker dependence on \u0001. While this\nweaker dependence does come at the cost of an additional factor of 2d, it is a constant cost of operating\nin dimension d, and is not dependent on the error \u0001.\nThe more pressing issue, when comparing polynomially-exact features to random Fourier features, is\nthe fact that we have no way of ef\ufb01ciently constructing quadrature rules that satisfy the conditions\nof Theorem 2. One possible construction involves selecting random sample points \u03c9i, and then\nsolving (3) for the values of ai using a non-negative least squares (NNLS) algorithm. While this\nconstruction works in low dimensions \u2014 it is the method we used for the experiment in Figure 1a \u2014\nit rapidly becomes infeasible to solve for higher values of d and R.\nWe will now show how to overcome this issue by introducing quadrature rules that can be rapidly\nconstructed using grid-based quadrature rules. These rules are constructed directly from products of a\none-dimensional quadrature rule, such as Gaussian quadrature, and so avoid the construction-dif\ufb01culty\nproblems encountered in this section. Although grid-based quadrature rules can be constructed for any\nkernel function [2], they are easier to conceptualize when the kernel k factors along the dimensions,\n\ni=1 ki(ui). For simplicity we will focus on this factorizable case.\n\nas k(u) =(cid:81)d\n\n3.3 Dense Grid Quadrature\n\n(cid:17)\n\n(cid:81)d\n\n(cid:16)(cid:82) \u221e\n\ni=1\n\ni u) d\u03c9\n\n\u2212\u221e \u039bi(\u03c9) exp(j\u03c9e(cid:62)\n\nThe simplest way to do this is with a dense grid (also known as tensor product) con-\nstruction.\nA dense grid construction starts by factoring the integral (1) into k(u) =\n, where ei are the standard basis vectors. Since each of the\nfactors is an integral over a single dimension, we can approximate them all with a one-dimensional\nquadrature rule. In this paper, we focus on Gaussian quadrature, although we could also use other\nmethods such as Clenshaw-Curtis [3]. Taking tensor products of the points and weights results in the\ndense grid quadrature. The detailed construction is given in Appendix A.\nThe individual Gaussian quadrature rules are exact for all polynomials up to degree 2L \u2212 1, so the\ndense grid is also accurate for all such polynomials. Theorem 2 then yields a bound on its sample\ncomplexity.\nCorollary 2. Let k be a kernel with a spectrum that is subgaussian with parameter b. Then, for any\n\u03b3 > 0, the sample complexity of dense grid features can be bounded by\n\n(cid:32)\n\n(cid:18)\n\nD(\u0001) \u2264 max\n\nexp\n\nde\u03b3d eb2M 2\n\n2\n\n\u03b3(cid:33)\n(cid:19) 1\n\n.\n\n(cid:19)\n\n(cid:18) 3\n\n\u0001\n\n,\n\n(cid:16)\n\n\u2212 1\n\n\u03b3\n\n\u0001\n\n(cid:17)\n\n.\n\nIn particular, as was the case with polynomially-exact features, for a \ufb01xed d, D(\u0001) = O\n\nUnfortunately, this scheme suffers heavily from the curse of dimensionality, since the sample\ncomplexity is doubly-exponential in d. This means that, even though they are easy to compute, the\ndense grid method does not represent a useful solution to the issue posed in Section 3.2.\n\n3.4 Sparse Grid Quadrature\n\nThe curse of dimensionality for quadrature in high dimensions has been studied in the numerical\nintegration setting for decades. One of the more popular existing techniques for getting around\n\n5\n\n\fthe curse is called sparse grid or Smolyak quadrature [28], originally developed to solve partial\ndifferential equations. Instead of taking the tensor product of the one-dimensional quadrature rule,\nwe only include points up to some \ufb01xed total level A, thus constructing a linear combination of dense\ngrid quadrature rules that achieves a similar error with exponentially fewer points than a single larger\nquadrature rule. The detailed construction is given in Appendix B. Compared to polynomially-exact\nrules, sparse grid quadrature can be computed quickly and easily (see Algorithm 4.1 from Holtz [12]).\nTo measure the performance of sparse grid quadrature, we constructed a feature map for the same\nGaussian kernel analyzed in the previous section, with d = 25 dimensions and up to level A = 2. We\ncompared this to a random Fourier features rule with the same number of samples, D = 1351, and\nplot the results in Figure 1b. As was the case with polynomially-exact quadrature, this sparse grid\nscheme has tiny error for small-diameter regions, but this error unfortunately increases to be even\nlarger than that of random Fourier features as the region diameter increases.\n\nThe sparse grid construction yields a bound on the sample count: D \u2264 3A(cid:0)d+A\n\n(cid:1), where A is the\n\nbound on the total level. By extending known bounds on the error of Gaussian quadrature, we can\nsimilarly bound the error of the sparse grid feature method.\nTheorem 3. Let k be a kernel with a spectrum that is subgaussian with parameter b, and let \u02dck be\nits estimation under the sparse grid quadrature rule up to level A. Let M \u2282 Rd be some region of\ndiameter M, and assume that A \u2265 24eb2M 2. Then, for all x, y \u2208 M, the error of the quadrature\nfeatures approximation is bounded by\n\nA\n\n(cid:12)(cid:12)(cid:12)k(x \u2212 y) \u2212 \u02dck(x \u2212 y)\n\n(cid:12)(cid:12)(cid:12) \u2264 2d\n\n(cid:18) 12eb2M 2\n\n(cid:19)A\n\n.\n\nA\n\nThis, along with our above upper bound on the sample count, yields a bound on the sample complexity.\nCorollary 3. Let k be a kernel with a spectrum that is subgaussian with parameter b. Then, for any\n\u03b3 > 0, the sample complexity of sparse grid features can be bounded by\n\u2212 1\n\nexp(cid:0)24e2\u03b3+1b2M 2(cid:1) , 2\n\nD(\u0001) \u2264 2d max\n\n(cid:16)\n\n(cid:17)\n\nd\n\u03b3 \u0001\n\n.\n\n\u03b3\n\nAs was the case with all our previous deterministic features maps, for a \ufb01xed d, D(\u0001) = O\n\n(cid:16)\n\n\u2212 1\n\n\u03b3\n\n\u0001\n\n(cid:17)\n\n.\n\nSubsampled grids One of the downsides of the dense/sparse grids analyzed above is the dif\ufb01culty\nof tuning the number of samples extracted in the feature map. As the only parameter we can typically\nset is the degree of polynomial exactness, even a small change in this (e.g., from 2 to 4) can produce\na signi\ufb01cant increase in the number of features. However, we can always subsample the grid points\naccording to the distribution determined by their weights to both tame the curse of dimensionality and\nto have \ufb01ne-grained control over the number of samples. For simplicity, we focus on subsampling\nthe dense grid. In Figure 1c, we compare the empirical errors of subsampled dense grid and random\nFourier features, noting that they are essentially the same across all diameters.\n\n3.5 Reweighted Grid Quadrature\n\nBoth random Fourier features and dense/sparse grid quadratures are data-independent. We now\ndescribe a data-adaptive method to choose a quadrature for a pre-speci\ufb01ed number of samples:\nreweighting the grid points to minimize the difference between the approximate and the exact kernel\non a small subset of data. Adjusting the grid to the data distribution yields better kernel approximation.\nWe approximate the kernel k(x \u2212 y) with\n\n\u02dck(x \u2212 y) =\n\nai exp(j\u03c9(cid:62)\n\ni (x \u2212 y)) =\n\nai cos(\u03c9(cid:62)\n\ni (x \u2212 y)),\n\nwhere ai \u2265 0, as k is real-valued. We \ufb01rst choose the set of potential grid points \u03c91, . . . , \u03c9D by\nsampling from a dense grid of Gaussian quadrature points. To solve for the weights a1, . . . , aD, we\nindependently sample n pairs (x1, y1), . . . , (xn, yn) from the dataset, then minimize the empirical\nmean squared error (with variable a1, . . . , aD):\n\nD(cid:88)\n\ni=1\n\nD(cid:88)\n\ni=1\n\nminimize\nsubject to ai \u2265 0, for i = 1, . . . , D.\n\nl=1\n\nk(xl \u2212 yl) \u2212 \u02dck(xl \u2212 yl)\n\n1\nn\n\n(cid:17)2\n\nn(cid:88)\n\n(cid:16)\n\n6\n\n\fFor appropriately de\ufb01ned matrix M and vector b, this is an NNLS problem of minimizing\nn (cid:107)M a \u2212 b(cid:107)2 subject to a \u2265 0, with variable a \u2208 RD. The solution is often sparse, due to\n1\nthe active elementwise constraints a \u2265 0. Hence we can pick a larger set of potential grid points\n\u03c91, . . . , \u03c9D(cid:48) (with D(cid:48) > D) and solve the above problem to obtain a smaller set of grid points (those\nwith aj > 0). To get even sparser solution, we add an (cid:96)1-penalty term with parameter \u03bb \u2265 0:\n\nminimize\nsubject to ai \u2265 0, for i = 1, . . . , D(cid:48).\n\nn (cid:107)M a \u2212 b(cid:107)2 + \u03bb 1(cid:62) a\n\n1\n\nBisecting on \u03bb yields the desired number of grid points.\nAs this is a data-dependent quadrature, we empirically evaluate its performance on the TIMIT dataset,\nwhich we will describe in more details in Section 5. In Figure 2b, we compare the estimated root\nmean squared error on the dev set of different feature generation schemes against the number of\nfeatures D (mean and standard deviation over 10 runs). Random Fourier features, Quasi-Monte Carlo\n(QMC) with Halton sequence, and subsampled dense grid have very similar approximation error,\nwhile reweighted quadrature has much lower approximation error. Reweighted quadrature achieves\n2\u20133 times lower error for the same number of features and requiring 3\u20135 times fewer features for a\n\ufb01xed threshold of approximation error compared to random Fourier features. Moreover, reweighted\nfeatures have extremely low variance, even though the weights are adjusted based only on a very\nsmall fraction of the dataset (500 samples out of 1 million data points).\n\nFaster feature generation Not only does grid-based quadrature yield better statistical performance\nto random Fourier features, it also has some notable systems bene\ufb01ts. Generating quadrature features\nrequires a much smaller number of multiplies, as the grid points only take on a \ufb01nite set of values for\nall dimensions (assuming an isotropic kernel). For example, a Gaussian quadrature that is exact up to\npolynomials of degree 21 only requires 11 grid points for each dimension. To generate the features, we\nmultiply the input with these 11 numbers before adding the results to form the deterministic features.\nThe save in multiples may be particularly signi\ufb01cant in architectures such as application-speci\ufb01c\nintegrated circuits (ASICs). In our experiment on the TIMIT dataset in Section 5, this specialized\nmatrix multiplication procedure (on CPU) reduces the feature generation time in half.\n\n4 Sparse ANOVA Kernels\n\nOne type of kernel that is commonly used in machine learning, for example in structural modeling, is\nthe sparse ANOVA kernels [11, 8]. They are also called convolutional kernels, as they operate similarly\nto the convolutional layer in CNNs. These kernels have achieved state-of-the-art performance on\nlarge real-world datasets [18, 22], as we will see in Section 5. A kernel of this type can be written as\n\nwhere S is a set of subsets of the variables in {1, . . . , d}, and k1 is a one-dimensional kernel.\n(Straightforward extensions, which we will not discuss here, include using different one-dimensional\nkernels for each element of the products, and weighting the sum.) Sparse ANOVA kernels are\nused to encode sparse dependencies among the variables: two variables are related if they appear\ntogether in some S \u2208 S. These sparse dependencies are typically problem-speci\ufb01c: each S could\ncorrespond to a factor in the graph if we are analyzing a distribution modeled with a factor graph.\nEquivalently, we can think of the set S as a hypergraph, where each S \u2208 S corresponds to a hyperedge.\nUsing this notion, we de\ufb01ne the rank of an ANOVA kernel to be r = maxS\u2208S |S|, the degree as\n\u2206 = maxi\u2208{1,...,d} |{S \u2208 S|i \u2208 S}|, and the size of the kernel to be the number of hyperedges\nm = |S|. For sparse models, it is common for both the rank and the degree to be small, even as the\nnumber of dimensions d becomes large, so m = O(d). This is the case we focus on in this section.\nIt is straightforward to apply the random Fourier features method to construct feature maps for\nANOVA kernels: construct feature maps for each of the (at most r-dimensional) sub-kernels kS(x \u2212\ni\u2208S k1(xi \u2212 yi) individually, and then combine the results. To achieve overall error \u0001, it\nsuf\ufb01ces for each of the sub-kernel feature maps to have error \u0001/m; this can be achieved by random\n\ny) = (cid:81)\nFourier features using DS = \u02dc\u2126(cid:0)r(\u0001m\u22121)\u22122(cid:1) = \u02dc\u2126(cid:0)rm2\u0001\u22122(cid:1) samples each, where the notation\n\n\u02dc\u2126 hides the log 1/\u0001 factor. Summed across all the m sub-kernels, this means that the random\n\n7\n\n(cid:88)\n\n(cid:89)\n\nS\u2208S\n\ni\u2208S\n\nk(x, y) =\n\nk1(xi \u2212 yi),\n\n\fD(\u0001) = \u02dc\u2126(cid:0)rm3\u0001\u22122(cid:1) samples. While it is nice to be able to tackle this problem using random\n\nFourier features map can achieve error \u0001 with constant probability with a sample complexity of\n\nfeatures, the cubic dependence on m in this expression is undesirable: it is signi\ufb01cantly larger than\nthe D = \u02dc\u2126(d\u0001\u22122) we get in the non-ANOVA case.\nCan we construct a deterministic feature map that has a better error bound? It turns out that we can.\nTheorem 4. Assume that we use polynomially-exact quadrature to construct features for each of\nthe sub-kernels kS, under the conditions of Theorem 2, and then combine the resulting feature maps\nto produce a feature map for the full ANOVA kernel. For any \u03b3 > 0, the sample complexity of this\nmethod is\n\n(cid:16)\n\nexp(cid:0)e2\u03b3+1b2M 2(cid:1) , (3\u2206)\n\nD(\u0001) \u2264 \u03b2m2r max\n\n(cid:17)\n\n.\n\n\u2212 1\n\n\u03b3\n\n1\n\n\u03b3 \u0001\n\nCompared to the random Fourier features, this rate depends only linearly on m. For \ufb01xed parameters\n\u03b2, b, M, \u2206, r, and for any \u03b3 > 0, we can bound the sample complexity D(\u0001) = O(m\u0001\n\u03b3 ), which is\nbetter than random Fourier features both in terms of the kernel size m and the desired error \u0001.\n\n\u2212 1\n\n5 Experiments\n\nTo evaluate the performance of deterministic feature maps, we analyzed the accuracy of a sparse\nANOVA kernel on the MNIST digit classi\ufb01cation task [16] and the TIMIT speech recognition task [5].\n\nDigit classi\ufb01cation on MNIST This task consists of 70, 000 examples (60, 000 in the training\ndataset and 10, 000 in the test dataset) of hand-written digits which need to be classi\ufb01ed. Each\nexample is a 28 \u00d7 28 gray-scale image. Clever kernel-based SVM techniques are known to achieve\nvery low error rates (e.g., 0.79%) on this problem [20]. We do not attempt to compare ourselves with\nthese rates; rather, we compare random Fourier features and subsampled dense grid features that\nboth approximate the same ANOVA kernel. The ANOVA kernel we construct is designed to have\na similar structure to the \ufb01rst layer of a convolutional neural network [27]. Just as a \ufb01lter is run on\neach 5 \u00d7 5 square of the image, for our ANOVA kernel, each of the sub-kernels is chosen to run on a\n5 \u00d7 5 square of the original image (note that there are many, (28 \u2212 5 + 1)2 = 576, such squares).\nWe choose the simple Gaussian kernel as our one-dimensional kernel.\nFigure 2a compares the dense grid subsampling method to random Fourier features across a range of\nfeature counts. The deterministic feature map with subsampling performs better than the random\nFourier feature map across most large feature counts, although its performance degrades for very\nsmall feature counts. The deterministic feature map is also somewhat faster to compute, taking\u2014for\nthe 28800-features\u2014320 seconds vs. 384 seconds for the random Fourier features, a savings of 17%.\n\nSpeech recognition on TIMIT This task requires producing accurate transcripts from raw audio\nrecordings of conversations in English, involving 630 speakers, for a total of 5.4 hours of speech.\nWe use the kernel features in the acoustic modeling step of speech recognition. Each data point\ncorresponds to a frame (10ms) of audio data, preprocessed using the standard feature space Maximum\nLikelihood Linear Regression (fMMLR) [4]. The input x has dimension 40. After generating kernel\nfeatures z(x) from this input, we model the corresponding phonemes y by a multinomial logistic\nregression model. Again, we use a sparse ANOVA kernel, which is a sum of 50 sub-kernels of the\nform exp(\u2212\u03b3 (cid:107)xS \u2212 yS(cid:107)2), each acting on a subset S of 5 indices. These subsets are randomly\nchosen a priori. To reweight the quadrature features, we sample 500 data points out of 1 million.\nWe plot the phone error rates (PER) of a speech recognizer trained based on different feature\ngeneration schemes against the number of features D in Figure 2c (mean and standard deviation\nover 10 runs). Again, subsampled dense grid performs similarly to random Fourier features, QMC\nyields slightly higher error, while reweighted features achieve slightly lower phone error rates. All\nfour methods have relatively high variability in their phone error rates due to the stochastic nature of\nthe training and decoding steps in the speech recognition pipeline. The quadrature-based features\n(subsampled dense grids and reweighted quadrature) are about twice as fast to generate, compared to\nrandom Fourier features, due to the small number of multiplies required. We use the same setup as\nMay et al. [22], and the performance here matches both that of random Fourier features and deep\nneural networks in May et al. [22].\n\n8\n\n\fy\nc\na\nr\nu\nc\nc\na\n\nt\ns\ne\nT\n\n0.985\n0.98\n0.975\n0.97\n0.965\n0.96\n0.955\n0.95\n0.945\n\nTest Accuracy on MNIST\n\nRandom Fourier\nSubsampled dense grid\n\nr\no\nr\nr\ne\n\nn\no\ni\nt\na\nm\ni\nx\no\nr\np\np\na\nS\nM\nR\n\n0\n\n5000 10000 15000 20000 25000 30000\n\nNumber of features\n\nKernel RMS approximation error on TIMIT\n\nRandom Fourier\nQuasi-Monte Carlo\nSubsampled dense grid\nReweighted quadrature\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\n10000\n\n20000\n\n30000\n\n40000\n\n50000\n\nNumber of features\n\ne\nt\na\nr\n\nr\no\nr\nr\ne\n\ne\nn\no\nh\nP\n\n19\n\n18.9\n\n18.8\n\n18.7\n\n18.6\n\n18.5\n\n18.4\n\n18.3\n\n10000\n\nPhone error rate on TIMIT\nRandom Fourier\nQuasi-Monte Carlo\nSubsampled dense grid\nReweighted quadrature\n\n15000\n\n20000\n\n25000\n\n30000\n\nNumber of features\n\n(a) Test accuracy on MNIST\n\n(b) Kernel approx. error on TIMIT\n\n(c) Phone error rate on TIMIT\n\nFigure 2: Performance of different feature generation schemes on MNIST and TIMIT.\n\n6 Conclusion\n\nWe presented deterministic feature maps for kernel machines. We showed that we can achieve better\nscaling in the desired accuracy \u0001 compared to the state-of-the-art method, random Fourier features.\nWe described several ways to construct these feature maps, including polynomially-exact quadrature,\ndense grid construction, sparse grid construction, and reweighted grid construction. Our results apply\nwell to the case of sparse ANOVA kernels, achieving signi\ufb01cant improvements (in the dependency on\nthe dimension d) over random Fourier features. Finally, we evaluated our results experimentally, and\nshowed that ANOVA kernels with deterministic feature maps can produce comparable accuracy to\nthe state-of-the-art methods based on random Fourier features on real datasets.\nANOVA kernels are an example of how structure can be used to de\ufb01ne better kernels. Resembling\nthe convolutional layers of convolutional neural networks, they induce the necessary inductive bias in\nthe learning process. Given CNNs\u2019 recent success in other domains beside images, such as sentence\nclassi\ufb01cation [15] and machine translation [7], we hope that our work on deterministic feature maps\nwill enable kernel methods such as ANOVA kernels to \ufb01nd new areas of application.\n\nAcknowledgments\n\nThis material is based on research sponsored by Defense Advanced Research Projects Agency\n(DARPA) under agreement number FA8750-17-2-0095. We gratefully acknowledge the support of\nthe DARPA SIMPLEX program under No. N66001-15-C-4043, DARPA FA8750-12-2-0335 and\nFA8750-13-2-0039, DOE 108845, National Institute of Health (NIH) U54EB020405, the National\nScience Foundation (NSF) under award No. CCF-1563078, the Of\ufb01ce of Naval Research (ONR)\nunder awards No. N000141210041 and No. N000141310129, the Moore Foundation, the Okawa\nResearch Grant, American Family Insurance, Accenture, Toshiba, and Intel. This research was\nsupported in part by af\ufb01liate members and other supporters of the Stanford DAWN project: Intel,\nMicrosoft, Teradata, and VMware. The U.S. Government is authorized to reproduce and distribute\nreprints for Governmental purposes notwithstanding any copyright notation thereon. The views and\nconclusions contained herein are those of the authors and should not be interpreted as necessarily\nrepresenting the of\ufb01cial policies or endorsements, either expressed or implied, of DARPA or the U.S.\nGovernment. Any opinions, \ufb01ndings, and conclusions or recommendations expressed in this material\nare those of the authors and do not necessarily re\ufb02ect the views of DARPA, AFRL, NSF, NIH, ONR,\nor the U.S. government.\n\n9\n\n\fReferences\n[1] Francis Bach. On the equivalence between quadrature rules and random features. arXiv preprint\n\narXiv:1502.06800, 2015.\n\n[2] Hans-Joachim Bungartz and Michael Griebel. Sparse grids. Acta numerica, 13:147\u2013269, 2004.\n\n[3] Charles W Clenshaw and Alan R Curtis. A method for numerical integration on an automatic\n\ncomputer. Numerische Mathematik, 2(1):197\u2013205, 1960.\n\n[4] Mark JF Gales. Maximum likelihood linear transformations for HMM-based speech recognition.\n\nComputer speech & language, 12(2):75\u201398, 1998.\n\n[5] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren.\nDARPA TIMIT acoustic phonetic continuous speech corpus CDROM, 1993. URL http:\n//www.ldc.upenn.edu/Catalog/LDC93S1.html.\n\n[6] Carl Friedrich Gauss. Methodus nova integralium valores per approximationem inveniendi.\n\napvd Henricvm Dieterich, 1815.\n\n[7] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional\n\nsequence to sequence learning. arXiv preprint arXiv:1705.03122, 2017.\n\n[8] S. R. Gunn and J. S. Kandola. Structural Modelling with Sparse Kernels. Machine Learning,\n48(1-3):137\u2013163, July 2002. ISSN 0885-6125, 1573-0565. doi: 10.1023/A:1013903804720.\nURL https://link.springer.com/article/10.1023/A:1013903804720.\n\n[9] Steve R. Gunn and Jaz S. Kandola. Structural modelling with sparse kernels. Machine learning,\n\n48(1-3):137\u2013163, 2002.\n\n[10] Nicholas Hale and Alex Townsend. Fast and accurate computation of Gauss\u2013Legendre and\nGauss\u2013Jacobi quadrature nodes and weights. SIAM Journal on Scienti\ufb01c Computing, 35(2):\nA652\u2013A674, 2013.\n\n[11] Thomas Hofmann, Bernhard Sch\u00f6lkopf, and Alexander J Smola. Kernel methods in machine\n\nlearning. The annals of statistics, pages 1171\u20131220, 2008.\n\n[12] Markus Holtz. Sparse grid quadrature in high dimensions with applications in \ufb01nance and\n\ninsurance, volume 77. Springer Science & Business Media, 2010.\n\n[13] Po-Sen Huang, Haim Avron, Tara N Sainath, Vikas Sindhwani, and Bhuvana Ramabhadran.\nKernel methods match deep neural networks on TIMIT. In Acoustics, Speech and Signal\nProcessing (ICASSP), 2014 IEEE International Conference on, pages 205\u2013209. IEEE, 2014.\n\n[14] Eugene Isaacson and Herbert Bishop Keller. Analysis of numerical methods. Courier Corpora-\n\ntion, 1994.\n\n[15] Yoon Kim. Convolutional neural networks for sentence classi\ufb01cation. In Proceedings of the\n2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages\n1746\u20131751.\n\n[16] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[17] Ming Lin, Shifeng Weng, and Changshui Zhang. On the sample complexity of random Fourier\nfeatures for online learning: How many random Fourier features do we need? ACM Trans.\nKnowl. Discov. Data, 2014.\n\n[18] Zhiyun Lu, Avner May, Kuan Liu, Alireza Bagheri Garakani, Dong Guo, Aur\u00e9lien Bellet, Linxi\nFan, Michael Collins, Brian Kingsbury, Michael Picheny, and Fei Sha. How to scale up kernel\nmethods to be as good as deep neural nets. arXiv:1411.4000 [cs, stat], November 2014. URL\nhttp://arxiv.org/abs/1411.4000. arXiv: 1411.4000.\n\n10\n\n\f[19] Zhiyun Lu, Dong Quo, Alireza Bagheri Garakani, Kuan Liu, Avner May, Aur\u00e9lien Bellet, Linxi\nFan, Michael Collins, Brian Kingsbury, Michael Picheny, et al. A comparison between deep\nneural nets and kernel acoustic models for speech recognition. In Acoustics, Speech and Signal\nProcessing (ICASSP), 2016 IEEE International Conference on, pages 5070\u20135074. IEEE, 2016.\n\n[20] Subhransu Maji and Jitendra Malik. Fast and accurate digit classi\ufb01cation. EECS Department,\n\nUniversity of California, Berkeley, Tech. Rep. UCB/EECS-2009-159, 2009.\n\n[21] Avner May, Michael Collins, Daniel Hsu, and Brian Kingsbury. Compact kernel models for\nacoustic modeling via random feature selection. In Acoustics, Speech and Signal Processing\n(ICASSP), 2016 IEEE International Conference on, pages 2424\u20132428. IEEE, 2016.\n\n[22] Avner May, Alireza Bagheri Garakani, Zhiyun Lu, Dong Guo, Kuan Liu, Aur\u00e9lien Bellet, Linxi\nFan, Michael Collins, Daniel Hsu, Brian Kingsbury, et al. Kernel approximation methods for\nspeech recognition. arXiv preprint arXiv:1701.03577, 2017.\n\n[23] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances\n\nin neural information processing systems, pages 1177\u20131184, 2007.\n\n[24] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimiza-\ntion with randomization in learning. In Advances in neural information processing systems,\npages 1313\u20131320, 2009.\n\n[25] Walter Rudin. Fourier analysis on groups. Number 12. John Wiley & Sons, 1990.\n\n[26] Bernhard Sch\u00f6lkopf and Alexander J Smola. Learning with kernels: Support vector machines,\n\nregularization, optimization, and beyond. MIT press, 2002.\n\n[27] Patrice Y Simard, Dave Steinkraus, and John C Platt. Best practices for convolutional neural\n\nnetworks applied to visual document analysis. In ICDAR, page 958. IEEE, 2003.\n\n[28] S. A. Smolyak. Quadrature and interpolation formulas for tensor products of certain class\nof functions. Dokl. Akad. Nauk SSSR, 148(5):1042\u20131053, 1963. Transl.: Soviet Math. Dokl.\n4:240-243, 1963.\n\n[29] Bharath Sriperumbudur and Zoltan Szabo. Optimal rates for random Fourier features. In\nC. Cortes, N.D. Lawrence, D.D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 28, pages 1144\u20131152. Curran Associates, Inc., 2015.\n\n[30] M Stitson, Alex Gammerman, Vladimir Vapnik, Volodya Vovk, Chris Watkins, and Jason\nWeston. Support vector regression with anova decomposition kernels. Advances in kernel\nmethods\u2014Support vector learning, pages 285\u2013292, 1999.\n\n[31] Dougal J. Sutherland and Jeff Schneider. On the error of random Fourier features. In Proceedings\nof the 31th Annual Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI-15). AUAI Press,\n2015.\n\n[32] Alex Townsend, Thomas Trogdon, and Sheehan Olver. Fast computation of Gauss quadrature\nnodes and weights on the whole real line. IMA Journal of Numerical Analysis, page drv002,\n2015.\n\n[33] Lloyd N Trefethen. Is Gauss quadrature better than Clenshaw\u2013Curtis? SIAM review, 50(1):\n\n67\u201387, 2008.\n\n[34] Jiyan Yang, Vikas Sindhwani, Haim Avron, and Michael Mahoney. Quasi-Monte Carlo feature\nIn Proceedings of The 31st International Conference on\n\nmaps for shift-invariant kernels.\nMachine Learning (ICML-14), pages 485\u2013493, 2014.\n\n[35] Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nystr\u00f6m method\nvs random Fourier features: A theoretical and empirical comparison. In Advances in neural\ninformation processing systems, pages 476\u2013484, 2012.\n\n11\n\n\f", "award": [], "sourceid": 3101, "authors": [{"given_name": "Tri", "family_name": "Dao", "institution": "Stanford University"}, {"given_name": "Christopher", "family_name": "De Sa", "institution": "Stanford"}, {"given_name": "Christopher", "family_name": "R\u00e9", "institution": "Stanford"}]}