{"title": "Orthogonal Random Features", "book": "Advances in Neural Information Processing Systems", "page_first": 1975, "page_last": 1983, "abstract": "We present an intriguing discovery related to Random Fourier Features: replacing multiplication by a random Gaussian matrix with multiplication by a properly scaled random orthogonal matrix significantly decreases kernel approximation error. We call this technique Orthogonal Random Features (ORF), and provide theoretical and empirical justification for its effectiveness. Motivated by the discovery, we further propose Structured Orthogonal Random Features (SORF), which uses a class of structured discrete orthogonal matrices to speed up the computation. The method reduces the time cost from $\\mathcal{O}(d^2)$ to $\\mathcal{O}(d \\log d)$, where $d$ is the data dimensionality, with almost no compromise in kernel approximation quality compared to ORF. Experiments on several datasets verify the effectiveness of ORF and SORF over the existing methods. We also provide discussions on using the same type of discrete orthogonal structure for a broader range of kernels and applications.", "full_text": "Orthogonal Random Features\n\nFelix Xinnan Yu Ananda Theertha Suresh Krzysztof Choromanski\n\nDaniel Holtmann-Rice Sanjiv Kumar\n\nGoogle Research, New York\n\n{felixyu, theertha, kchoro, dhr, sanjivk}@google.com\n\nAbstract\n\nWe present an intriguing discovery related to Random Fourier Features: in Gaussian\nkernel approximation, replacing the random Gaussian matrix by a properly scaled\nrandom orthogonal matrix signi\ufb01cantly decreases kernel approximation error. We\ncall this technique Orthogonal Random Features (ORF), and provide theoretical\nand empirical justi\ufb01cation for this behavior. Motivated by this discovery, we further\npropose Structured Orthogonal Random Features (SORF), which uses a class of\nstructured discrete orthogonal matrices to speed up the computation. The method\nreduces the time cost from O(d2) to O(d log d), where d is the data dimensionality,\nwith almost no compromise in kernel approximation quality compared to ORF.\nExperiments on several datasets verify the effectiveness of ORF and SORF over the\nexisting methods. We also provide discussions on using the same type of discrete\northogonal structure for a broader range of applications.\n\n1\n\nIntroduction\n\nKernel methods are widely used in nonlinear learning [8], but they are computationally expensive for\nlarge datasets. Kernel approximation is a powerful technique to make kernel methods scalable, by\nmapping input features into a new space where dot products approximate the kernel well [19]. With\naccurate kernel approximation, ef\ufb01cient linear classi\ufb01ers can be trained in the transformed space\nwhile retaining the expressive power of nonlinear methods [10, 21].\nFormally, given a kernel K(\u00b7,\u00b7) : Rd \u21e5 Rd ! R, kernel approximation methods seek to \ufb01nd a\nnonlinear transformation (\u00b7) : Rd ! Rd0 such that, for any x, y 2 Rd\n\nK(x, y) \u21e1 \u02c6K(x, y) = (x)T (y).\n\nRandom Fourier Features [19] are used widely in approximating smooth, shift-invariant kernels. This\ntechnique requires the kernel to exhibit two properties: 1) shift-invariance, i.e. K(x, y) = K()\nwhere = x y; and 2) positive semi-de\ufb01niteness of K() on Rd. The second property guarantees\nthat the Fourier transform of K() is a nonnegative function [3]. Let p(w) be the Fourier transform\nof K(z). Then,\n\nK(x y) =ZRd\n\np(w)ejwT (xy)dw.\n\nThis means that one can treat p(w) as a density function and use Monte-Carlo sampling to derive the\nfollowing nonlinear map for a real-valued kernel:\n\nwhere wi\n\n(x) =p1/D\u21e5 sin(wT\n\nDx)\u21e4T ,\n\u21e5w1,\u00b7\u00b7\u00b7 , wD\u21e4T . The linear transformation Wx is central to the above computation since,\n\nis sampled i.i.d. from a probability distribution with density p(w). Let W =\n\n1 x),\u00b7\u00b7\u00b7 , sin(wT\n\nDx), cos(wT\n\n1 x),\u00b7\u00b7\u00b7 , cos(wT\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f\u22124\n\nx 10\n\nRFF (Random Gaussian)\nORF (Random Orthogonal)\n\n \n\n\u22123\n\nx 10\n\n2\n\nRFF (Random Gaussian)\nORF (Random Orthogonal)\n\n \n\n1.5\n\nE\nS\nM\n\n1\n\n0.5\n\n4\n\n3\n\n2\n\n1\n\nE\nS\nM\n\n\u22124\n\nx 10\n\nRFF (Random Gaussian)\nORF (Random Orthogonal)\n\n \n\n8\n\n6\n\n4\n\n2\n\nE\nS\nM\n\n0\n\n \n1\n\n2\n\n3\n\n4\n\n5\n6\nD / d\n(a) USPS\n\n7\n\n8\n\n9 10\n\n0\n\n \n1\n\n2\n\n3\n\n4\n\n5\n6\nD / d\n(b) MNIST\n\n7\n\n8\n\n9 10\n\n0\n\n \n1\n\n2\n\n3\n\n7\n\n8\n\n9 10\n\n4\n\n5\n6\nD / d\n(c) CIFAR\n\nFigure 1: Kernel approximation mean squared error (MSE) for the Gaussian kernel K(x, y) =\ne||xy||2/22. D: number of rows in the linear transformation W. d: input dimension. ORF\nimposes orthogonality on W (Section 3).\n\n\u2022 The choice of matrix W determines how well the estimated kernel converges to the actual kernel;\n\u2022 The computation of Wx has space and time costs of O(Dd). This is expensive for high-\ndimensional data, especially since D is often required to be larger than d to achieve low ap-\nproximation error.\n\nIn this work, we address both of the above issues. We \ufb01rst show an intriguing discovery (Figure 1):\nby enforcing orthogonality on the rows of W, the kernel approximation error can be signi\ufb01cantly\nreduced. We call this method Orthogonal Random Features (ORF). Section 3 describes the method\nand provides theoretical explanation for the improved performance.\nSince both generating a d \u21e5 d orthogonal matrix (O(d3) time and O(d2) space) and computing the\ntransformation (O(d2) time and space) are prohibitively expensive for high-dimensional data, we\nfurther propose Structured Orthogonal Random Features (SORF) in Section 4. The idea is to replace\nrandom orthogonal matrices by a class of special structured matrices consisting of products of binary\ndiagonal matrices and Walsh-Hadamard matrices. SORF has fast computation time, O(D log d),\nand almost no extra memory cost (with ef\ufb01cient in-place implementation). We show extensive\nexperiments in Section 5. We also provide theoretical discussions in Section 6 of applying the\nstructured matrices in a broader range of applications where random Gaussian matrix is used.\n\n2 Related Works\n\nExplicit nonlinear random feature maps have been constructed for many types of kernels, such as\nintersection kernels [15], generalized RBF kernels [22], skewed multiplicative histogram kernels\n[14], additive kernels [24], and polynomial kernels [11, 18]. In this paper, we focus on approximating\nGaussian kernels following the seminal Random Fourier Features (RFF) framework [19], which has\nbeen extensively studied both theoretically and empirically [26, 20, 23].\nKey to the RFF technique is Monte-Carlo sampling. It is well known that the convergence of Monte-\nCarlo can be largely improved by carefully choosing a deterministic sequence instead of random\nsamples [17]. Following this line of reasoning, Yang et al. [25] proposed to use low-displacement\nrank sequences in RFF. Yu et al. [28] studied optimizing the sequences in a data-dependent fashion to\nachieve more compact maps. In contrast to the above works, this paper is motivated by an intriguing\nnew discovery that using orthogonal random samples provides much faster convergence. Compared\nto [25], the proposed SORF method achieves both lower kernel approximation error and greatly\nreduced computation and memory costs. Furthermore, unlike [28], the results in this paper are data\nindependent.\nStructured matrices have been used for speeding up dimensionality reduction [1], binary embedding\n[27], deep neural networks [5] and kernel approximation [13, 28, 7]. For the kernel approximation\nworks, in particular, the \u201cstructured randomness\u201d leads to a minor loss of accuracy, but allows faster\ncomputation since the structured matrices enable the use of FFT-like algorithms. Furthermore, these\nmatrices provide substantial model compression since they require subquadratic (usually only linear)\n\n2\n\n\fMethod\nExtra Memory\nRandom Fourier Feature (RFF) [19]\nO(Dd)\nCompact Nonlinear Map (CNM) [28]\nO(Dd)\nQuasi-Monte Carlo (QMC) [25]\nO(Dd)\nStructured (fastfood/circulant) [28, 13] O(D)\nOrthogonal Random Feature (ORF) O(Dd)\nStructured ORF (SORF)\n\nLower error than RFF?\n-\nYes (data-dependent)\nYes\nNo\nYes\nO(D) or O(1) O(D log d) Yes\n\nTime\nO(Dd)\nO(Dd)\nO(Dd)\nO(D log d)\nO(Dd)\n\nTable 1: Comparison of different kernel approximation methods under the framework of Random\nFourier Features [19]. We assume D d. The proposed SORF method have O(D) degrees of\nfreedom. The computations can be ef\ufb01ciently implemented as in-place operations with \ufb01xed random\nseeds. Therefore it can cost O(1) in extra space.\nspace. In comparison with the above works, our proposed methods SORF and ORF are more effective\nthan RFF. In particular SORF demonstrates both lower approximation error and better ef\ufb01ciency than\nRFF. Table 1 compares the space and time costs of different techniques.\n\n3 Orthogonal Random Features\n\nOur goal is to approximate a Gaussian kernel of the form\n\nK(x, y) = e||xy||2/22\n\n.\n\nIn the paragraph below, we assume a square linear transformation matrix W 2 RD\u21e5d, D = d.\nWhen D < d, we simply use the \ufb01rst D dimensions of the result. When D > d, we use multiple\nindependently generated random features and concatenate the results. We comment on this setting at\nthe end of this section.\nRecall that the linear transformation matrix of RFF can be written as\n\nWRFF =\n\n1\n\n\nG,\n\n(1)\n\nwhere G 2 Rd\u21e5d is a random Gaussian matrix, with every entry sampled independently from the\nstandard normal distribution. Denote the approximate kernel based on the above WRFF as KRFF(x, y).\nFor completeness, we \ufb01rst show the expectation and variance of KRFF(x, y).\nLemma 1. (Appendix A.2) KRFF(x, y) is an unbiased estimator of the Gaussian kernel, i.e.,\nE(KRFF(x, y)) = e||xy||2/22. Let z = ||x y||/.\nThe variance of KRFF(x, y) is\nVar (KRFF(x, y)) = 1\n\n.\n\n2D\u21e31 ez2\u23182\n\nThe idea of Orthogonal Random Features (ORF) is to impose orthogonality on the matrix on the\nlinear transformation matrix G. Note that one cannot achieve unbiased kernel estimation by simply\nreplacing G by an orthogonal matrix, since the norms of the rows of G follow the -distribution,\nwhile rows of an orthogonal matrix have the unit norm. The linear transformation matrix of ORF has\nthe following form\n\nWORF =\n\nSQ,\n\n(2)\n\n1\n\n\nwhere Q is a uniformly distributed random orthogonal matrix1. The set of rows of Q forms a bases in\nRd. S is a diagonal matrix, with diagonal entries sampled i.i.d. from the -distribution with d degrees\nof freedom. S makes the norms of the rows of SQ and G identically distributed.\nDenote the approximate kernel based on the above WORF as KORF(x, y). The following shows that\nKORF(x, y) is an unbiased estimator of the kernel, and it has lower variance in comparison to RFF.\nTheorem 1. KORF(x, y) is an unbiased estimator of the Gaussian kernel, i.e.,\n\nE(KORF(x, y)) = e||xy||2/22\n\n.\n\n1We \ufb01rst generate the random Gaussian matrix G in (1). Q is the orthogonal matrix obtained from the QR\ndecomposition of G. Q is distributed uniformly on the Stiefel manifold (the space of all orthogonal matrices)\nbased on the Bartlett decomposition theorem [16].\n\n3\n\n\f1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\no\n\ni\nt\n\na\nr\n \n\ne\nc\nn\na\ni\nr\na\nv\n\nd = \u221e\n\n \n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\no\n\ni\nt\n\na\nr\n \n\ne\nc\nn\na\ni\nr\na\nv\n\n \n\nd = 2\n\nd = 4\n\nd = 8\n\nd = 16\n\nd = 32\n\nd = \u221e\n\n5\n\n4\n\n3\n\n2\n\n1\n\nt\n\nn\nu\no\nc\n\n \n\nletter\nforest\nusps\ncifar\nmnist\ngisette\n\n0\n\n \n0\n\n1\n\n2\nz\n\n3\n\n4\n\n0\n\n \n0\n\n1\n\n2\nz\n\n3\n\n4\n\n0\n\n \n0\n\n1\n\n2\nz\n\n3\n\n4\n\n(a) Variance ratio (when d is large)\n\n(b) Variance ratio (simulation)\n\n(c) Empirical distribution of z\n\nFigure 2: (a) Var(KORF(x, y))/Var(KRFF(x, y)) when d is large and d = D. z = ||x y||/. (b)\nSimulation of Var(KORF(x, y))/Var(KRFF(x, y)) when D = d. Note that the empirical variance is\nthe Mean Squared Error (MSE). (c) Distribution of z for several datasets, when we set as the mean\ndistance to 50th-nearest neighbor for samples from the dataset. The count is normalized such that the\narea under curve for each dataset is 1. Observe that most points in all the datasets have z < 2. As\nshown in (a), for these values of z, ORF has much smaller variance compared to the standard RFF.\n\nLet D \uf8ff d, and z = ||x y||/. There exists a function f such that for all z, the variance of\nKORF(x, y) is bounded by\n\nVar (KORF(x, y)) \uf8ff\nDPD\n\nE(KORF (x, y)) = E\u21e3 1\n\ni=1 cos(wT\n\n1\n\n2D\u2713\u21e31 ez2\u23182\ni z)\u2318 = 1\nDPD\n\nProof. We \ufb01rst show the proof of the unbiasedness. Let z = xy\n\nD 1\n\nd\n\n\n\nez2\n\ni=1 Ecos(wT\n\nf (z)\nd2 .\n\nz4\u25c6 +\n , and z = ||z||, then\ni z). Based on the de\ufb01nition\n\nof ORF, w1, w2, . . . , wD are D random vectors given by wi = siui, with u1, u2, . . . , ud a uni-\nformly chosen random orthonormal basis for Rd, and si\u2019s are independent -distributed random\nvariables with d degrees of freedom. It is easy to show that for each i, wi is distributed according to\nN (0, Id), and hence by Bochner\u2019s theorem,\n\nE[cos(wT z)] = ez2/2.\n\nWe now show a proof sketch of the variance. Suppose, ai = cos(wT\n\ni z).\n\nVar 1\n\nD\n\nDXi=1\n\nD !2# E\" PD\ni ] E[ai]2 +\n\n1\n\n1\n\ni=1 ai\n\nai! = E\" PD\nD2Xi E[a2\n= \u21e31 ez2\u23182\n\n=\n\n2D\n\n+\n\nD(D 1)\n\nD2\n\ni=1 ai\n\nD !#2\nD2Xi Xj6=i\n\u21e3E[a1a2] ez2\u2318 ,\n\n(E[aiaj] E[ai]E[aj])\n\nwhere the last equality follows from symmetry. The \ufb01rst term in the resulting expression is exactly\nthe variance of RFF. In order to have lower variance, E[a1a2] ez2 must be negative. We use the\nfollowing lemma to quantify this term.\nLemma 2. (Appendix A.3) There is a function f such that for any z,\n\nTherefore, for a large d, and D \uf8ff d, the ratio of the variance of ORF and RFF is\n\nE[aiaj] \uf8ff ez2\n\n ez2 z4\n\n2d\n\n+\n\nf (z)\nd2 .\n\nVar(KORF(x, y))\nVar(KRFF(x, y)) \u21e1 1 \n\n(D 1)ez2z4\nd(1 ez2)2 .\n\nFigure 2(a) shows the ratio of the variance of ORF to that of RFF when D = d and d is large. First\nnotice that this ratio is always smaller than 1, and hence ORF always provides improvement over\n\n(3)\n\n4\n\n\fs\na\nb\n\ni\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n\u22120.1\n\n\u22120.2\n\n\u22120.3\n\n\u22120.4\n\n\u22120.5\n\n \n0\n\nd =2\nd =4\nd =8\nd =16\nd =32\n\n \n\n2\n\n4\n\n6\n\n8\n\n10\n\nz\n\n1\n\n0.5\n\ns\na\nb\n\ni\n\n0\n\n\u22120.5\n\n\u22121\n\n \n0\n\nd =2\nd =4\nd =8\nd =16\nd =32\n\nd = 2\nd = 4\nd = 8\nd = 16\nd = 32\nd = \u221e\n\n \n\no\n\ni\nt\n\na\nr\n \ne\nc\nn\na\ni\nr\na\nv\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n1.2\n\no\n\ni\nt\n\na\nr\n \ne\nc\nn\na\ni\nr\na\nv\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nd = 16\nd = 32\nd = 64\nd = \u221e\n\n \n\n2\n\n4\n\n6\n\n8\n\n10\n\nz\n\n0\n\n \n0\n\n1\n\n2\nz\n\n3\n\n4\n\n0\n\n \n0\n\n1\n\n2\nz\n\n3\n\n4\n\n(a) Bias of ORF0\n\n(b) Bias of SORF\n\n(c) Variance ratio of ORF0\n\n(d) Variance ratio of SORF\n\nFigure 3: Simulations of bias and variance of ORF0and SORF. z = ||x y||/.\n(a)\nE(KORF0(x, y)) ez2/2. (b) E(KSORF(x, y)) ez2/2. (c) Var(KORF0(x, y))/Var(KRFF(x, y)).\n(d) Var(KSORF(x, y))/Var(KRFF(x, y)). Each point on the curve is based on 20,000 choices of the\nrandom matrices and two \ufb01xed points with distance z. For both ORF and ORF0, even at d = 32, the\nbias is close to 0 and the variance is close to that of d = 1 (Figure 2(a)).\nthe conventional RFF. Interestingly, we gain signi\ufb01cantly for small values of z. In fact, when z ! 0\nand d ! 1, the ratio is roughly z2 (note ex \u21e1 1 + x when x ! 0), and ORF exhibits in\ufb01nitely\nlower error relative to RFF. Figure 2(b) shows empirical simulations of this ratio. We can see that the\nvariance ratio is close to that of d = 1 (3), even when d = 32, a fairly low-dimensional setting in\nreal-world cases.\nRecall that z = ||x y||/. This means that ORF preserves the kernel value especially well for data\npoints that are close, thereby retaining the local structure of the dataset. Furthermore, empirically \nis typically not set too small in order to prevent over\ufb01tting\u2014a common rule of thumb is to set to be\nthe average distance of 50th-nearest neighbors in a dataset. In Figure 2(c), we plot the distribution of\nz for several datasets with this choice of . These distributions are all concentrated in the regime\nwhere ORF yields substantial variance reduction.\nThe above analysis is under the assumption that D \uf8ff d. Empirically, for RFF, D needs to be larger\nthan d in order to achieve low approximation error. In that case, we independently generate and apply\nthe transformation (2) multiple times. The next lemma bounds the variance for this case.\nCorollary 1. Let D = m \u00b7 d, for an integer m and z = ||x y||/. There exists a function f such\nthat for all z, the variance of KORF(x, y) is bounded by\n\n1\n\n2D\u2713\u21e31 ez2\u23182\n\nd 1\nd\n\n\n\nez2\n\nz4\u25c6 +\n\nf (z)\ndD\n\n.\n\nVar (KORF(x, y)) \uf8ff\n\n4 Structured Orthogonal Random Features\n\nIn the previous section, we presented Orthogonal Random Features (ORF) and provided a theoretical\nexplanation for their effectiveness. Since generating orthogonal matrices in high dimensions can be\nexpensive, here we propose a fast version of ORF by imposing structure on the orthogonal matrices.\nThis method can provide drastic memory and time savings with minimal compromise on kernel\napproximation quality. Note that the previous works on fast kernel approximation using structured\nmatrices do not use structured orthogonal matrices [13, 28, 7].\nLet us \ufb01rst introduce a simpli\ufb01ed version of ORF: replace S in (2) by a scalar pd. Let us call this\nmethod ORF0. The transformation matrix thus has the following form:\n\nWORF0 =\n\npd\n\n\nQ.\n\n(4)\n\nTheorem 2. (Appendix B) Let KORF0(x, y) be the approximate kernel computed with linear transfor-\nmation matrix (4). Let D \uf8ff d and z = ||x y||/. There exists a function f such that the bias of\nKORF0(x, y) satis\ufb01es\n\nE(KORF0(x, y)) ez2/2 \uf8ff ez2/2 z4\n\n4d\n\n5\n\n+\n\nf (z)\nd2 ,\n\n\f0.04\n\n0.03\n\nE\nS\nM\n\n0.02\n\n0.01\n\n \n\nRFF\nORF\nSORF\nQMC(digitalnet)\ncirculant\nfastfood\n\n\u22123\n\nx 10\n\n8\n\n6\n\n4\n\n2\n\nE\nS\nM\n\nRFF\nORF\nSORF\nQMC(digitalnet)\ncirculant\nfastfood\n\n\u22123\n\nx 10\n\n \n\n3\n\n2.5\n\n2\n\nE\nS\nM\n\n1.5\n\n1\n\n0.5\n\n \n\nRFF\nORF\nSORF\nQMC(digitalnet)\ncirculant\nfastfood\n\n0\n\n \n1\n\n2\n\n3\n\n4\n\n5\n6\nD / d\n\n7\n\n8\n\n9 10\n\n0\n\n \n1\n\n2\n\n3\n\n4\n\n5\n6\nD / d\n\n7\n\n8\n\n9 10\n\n0\n\n \n1\n\n2\n\n3\n\n4\n\n5\n6\nD / d\n\n7\n\n8\n\n9 10\n\n(a) LETTER (d = 16)\n\n(b) FOREST (d = 64)\n\n(c) USPS (d = 256)\n\n\u22123\n\nx 10\n\n1.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nE\nS\nM\n\n \n\nRFF\nORF\nSORF\nQMC(digitalnet)\ncirculant\nfastfood\n\n\u22124\n\nx 10\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\nE\nS\nM\n\nRFF\nORF\nSORF\nQMC(digitalnet)\ncirculant\nfastfood\n\n\u22124\n\nx 10\n\n \n\n1.2\n\nE\nS\nM\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\nRFF\nORF\nSORF\nQMC(digitalnet)\ncirculant\nfastfood\n\n0\n\n \n1\n\n2\n\n3\n\n4\n\n6\n5\nD / d\n\n7\n\n8\n\n9 10\n\n0\n\n \n1\n\n2\n\n3\n\n4\n\n6\n5\nD / d\n\n7\n\n8\n\n9 10\n\n0\n\n \n1\n\n2\n\n3\n\n4\n\n6\n5\nD / d\n\n7\n\n8\n\n9 10\n\n(d) CIFAR (d = 512)\n\n(e) MNIST (d = 1024)\n\n(f) GISETTE (d = 4096)\n\nFigure 4: Kernel approximation mean squared error (MSE) for the Gaussian kernel K(x, y) =\ne||xy||2/22. D: number of transformations. d: input feature dimension. For each dataset, \nis chosen to be the mean distance of the 50th `2 nearest neighbor for 1,000 sampled datapoints.\nEmpirically, this yields good classi\ufb01cation results. The curves for SORF and ORF overlap.\n\nand the variance satis\ufb01es\n\nVar (KORF0(x, y)) \uf8ff\n\n1\n\n2D\u2713(1 ez2\n\n)2 \n\nD 1\n\nd\n\nez2\n\nz4\u25c6 +\n\nf (z)\nd2 .\n\nThe above implies that when d is large KORF0(x, y) is a good estimation of the kernel with low\nvariance. Figure 3(a) shows that even for relatively small d, the estimation is almost unbiased. Figure\n3(c) shows that when d 32, the variance ratio is very close to that of d = 1. We \ufb01nd empirically\nthat ORF0also provides very similar MSE in comparison with ORF in real-world datasets.\nWe now introduce Structured Orthogonal Random Features (SORF). It replaces the random orthogonal\nmatrix Q of ORF0in (4) by a special type of structured matrix HD1HD2HD3:\n\nWSORF =\n\npd\n\n\nHD1HD2HD3,\n\n(5)\nwhere Di 2 Rd\u21e5d, i = 1, 2, 3 are diagonal \u201csign-\ufb02ipping\u201d matrices, with each diagonal entry\nsampled from the Rademacher distribution. H is the normalized Walsh-Hadamard matrix.\nComputing WSORFx has the time cost O(d log d), since multiplication with D takes O(d) time and\nmultiplication with H takes O(d log d) time using fast Hadamard transformation. The computation\nof SORF can also be carried out with almost no extra memory due to the fact that both sign \ufb02ipping\nand the Walsh-Hadamard transformation can be ef\ufb01ciently implemented as in-place operations [9].\nFigures 3(b)(d) show the bias and variance of SORF. Note that although the curves for small d are\ndifferent from those of ORF, when d is large (d > 32 in practice), the kernel estimation is almost\nunbiased, and the variance ratio converges to that of ORF. In other words, it is clear that SORF can\nprovide almost identical kernel approximation quality as that of ORF. This is also con\ufb01rmed by the\nexperiments in Section 5. In Section 6, we provide theoretical discussions to show that the structure\nof (5) can also be generally applied to many scenarios where random Gaussian matrices are used.\n\n6\n\n\fDataset\n\nletter\nd = 16\n\nforest\nd = 64\n\nusps\nd = 256\n\ncifar\nd = 512\n\nmnist\nd = 1024\n\ngisette\nd = 4096\n\nMethod\nRFF\nORF\nSORF\nRFF\nORF\nSORF\nRFF\nORF\nSORF\nRFF\nORF\nSORF\nRFF\nORF\nSORF\nRFF\nORF\nSORF\n\nD = 10d\n87.84 \u00b1 0.59\n87.73 \u00b1 0.63\n86.78 \u00b1 0.53\n79.85 \u00b1 0.10\n79.54 \u00b1 0.15\n79.56 \u00b1 0.09\n95.71 \u00b1 0.18\n95.76 \u00b1 0.17\n95.67 \u00b1 0.15\n76.54 \u00b1 0.31\n76.69 \u00b1 0.09\n76.47 \u00b1 0.28\n95.98 \u00b1 0.05\n96.06 \u00b1 0.07\n96.02 \u00b1 0.07\n97.74 \u00b1 0.05\n97.68 \u00b1 0.04\n97.70 \u00b1 0.14\n\nExact\n\n90.10\n\n80.43\n\n95.57\n\n78.71\n\n97.14\n\n97.60\n\nD = 2d\n\nD = 4d\n\nD = 6d\n\nD = 8d\n\n76.44 \u00b1 1.04\n77.49 \u00b1 0.95\n76.18 \u00b1 1.20\n77.61 \u00b1 0.23\n77.88 \u00b1 0.24\n77.64 \u00b1 0.20\n94.27 \u00b1 0.38\n94.21 \u00b1 0.51\n94.45 \u00b1 0.39\n73.19 \u00b1 0.23\n73.59 \u00b1 0.44\n73.54 \u00b1 0.26\n94.83 \u00b1 0.13\n94.95 \u00b1 0.25\n94.98 \u00b1 0.18\n97.68 \u00b1 0.28\n97.56 \u00b1 0.17\n97.64 \u00b1 0.17\n\n81.61 \u00b1 0.46\n82.49 \u00b1 1.16\n81.63 \u00b1 0.77\n78.92 \u00b1 0.30\n78.71 \u00b1 0.19\n78.88 \u00b1 0.14\n94.98 \u00b1 0.10\n95.26 \u00b1 0.25\n95.20 \u00b1 0.43\n75.06 \u00b1 0.33\n75.06 \u00b1 0.28\n75.11 \u00b1 0.21\n95.48 \u00b1 0.10\n95.64 \u00b1 0.06\n95.48 \u00b1 0.08\n97.74 \u00b1 0.11\n97.72 \u00b1 0.15\n97.62 \u00b1 0.04\n\n85.46 \u00b1 0.56\n85.41 \u00b1 0.60\n84.43 \u00b1 0.92\n79.29 \u00b1 0.24\n79.38 \u00b1 0.19\n79.31 \u00b1 0.12\n95.43 \u00b1 0.22\n96.46 \u00b1 0.18\n95.51 \u00b1 0.34\n75.85 \u00b1 0.30\n76.00 \u00b1 0.26\n75.76 \u00b1 0.21\n95.85 \u00b1 0.07\n95.85 \u00b1 0.09\n95.77 \u00b1 0.09\n97.66 \u00b1 0.25\n97.80 \u00b1 0.07\n97.64 \u00b1 0.11\n\n86.58 \u00b1 0.99\n87.17 \u00b1 0.40\n85.71 \u00b1 0.52\n79.57 \u00b1 0.21\n79.63 \u00b1 0.21\n79.50 \u00b1 0.14\n95.66 \u00b1 0.25\n95.52 \u00b1 0.20\n95.46 \u00b1 0.34\n76.28 \u00b1 0.30\n76.29 \u00b1 0.26\n76.48 \u00b1 0.24\n96.02 \u00b1 0.06\n95.95 \u00b1 0.08\n95.98 \u00b1 0.05\n97.70 \u00b1 0.16\n97.64 \u00b1 0.09\n97.68 \u00b1 0.08\n\nTable 2: Classi\ufb01cation Accuracy based on SVM. ORF and SORF provide competitive classi\ufb01cation\naccuracy for a given D. Exact is based on kernel-SVM trained on the Gaussian kernel. Note that\nin all the settings SORF is faster than RFF and ORF by a factor of O(d/ log d). For example, on\ngisette with D = 2d, SORF provides 10 times speedup in comparison with RFF and ORF.\n\n5 Experiments\n\nKernel Approximation. We \ufb01rst show kernel approximation performance on six datasets. The input\nfeature dimension d is set to be power of 2 by padding zeros or subsampling. Figure 4 compares the\nmean squared error (MSE) of all methods. For \ufb01xed D, the kernel approximation MSE exhibits the\nfollowing ordering:\n\nSORF ' ORF < QMC [25] < RFF [19] < Other fast kernel approximations [13, 28].\n\nBy imposing orthogonality on the linear transformation matrix, Orthogonal Random Features (ORF)\nachieves signi\ufb01cantly lower approximation error than Random Fourier Features (RFF). The Structured\nOrthogonal Random Features (SORF) have almost identical MSE to that of ORF. All other fast kernel\napproximation methods, such as circulant [28] and FastFood [13] have higher MSE. We also include\nDigitalNet, the best performing method among Quasi-Monte Carlo techniques [25]. Its MSE is lower\nthan that of RFF, but still higher than that of ORF and SORF. The order of time cost for a \ufb01xed D is\n\nSORF ' Other fast kernel approximations [13, 28] \u2327 ORF = QMC [25] = RFF [19].\n\nRemarkably, SORF has both better computational ef\ufb01ciency and higher kernel approximation quality\ncompared to other methods.\nWe also apply ORF and SORF on classi\ufb01cation tasks. Table 2 shows classi\ufb01cation accuracy for\ndifferent kernel approximation techniques with a (linear) SVM classi\ufb01er. SORF is competitive with\nor better than RFF, and has greatly reduced time and space costs.\nThe Role of . Note that a very small will lead to over\ufb01tting, and a very large provides no\ndiscriminative power for classi\ufb01cation. Throughout the experiments, for each dataset is chosen to\nbe the mean distance of the 50th `2 nearest neighbor, which empirically yields good classi\ufb01cation\nresults [28]. As shown in Section 3, the relative improvement over RFF is positively correlated with\n. Figure 5(a)(b) verify this on the mnist dataset. Notice that the proposed methods (ORF and\nSORF) consistently improve over RFF.\nSimplifying SORF. The SORF transformation consists of three Hadamard-Diagonal blocks. A\nnatural question is whether using fewer computations and randomness can achieve similar empirical\nperformance. Figure 5(c) shows that reducing the number of blocks to two (HDHD) provides similar\nperformance, while reducing to one block (HD) leads to large error.\n\n6 Analysis and General Applicability of the Hadamard-Diagonal Structure\n\nWe provide theoretical discussions of SORF in this section. We \ufb01rst show that for large d, SORF is\nan unbiased estimator of the Gaussian kernel.\n\n7\n\n\f\u22124\n\nx 10\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\nE\nS\nM\n\n\u22124\n\nx 10\n\n \n\nRFF\n\nORF\nSORF\n\nE\nS\nM\n\n2\n\n1\n\n \n\nHDHDHD\n\nHDHD\nHD\n\n\u22123\n\nx 10\n\n \n\n1\n\nRFF\n\nORF\nSORF\n\nE\nS\nM\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n \n1\n\n2\n\n3\n\n4\n\n5\n6\nD / d\n\n7\n\n8\n\n9 10\n\n0\n\n \n1\n\n2\n\n3\n\n4\n\n5\n6\nD / d\n\n7\n\n8\n\n9 10\n\n0\n\n \n1\n\n2\n\n3\n\n4\n\n5\n6\nD / d\n\n7\n\n8\n\n9 10\n\n(a) = 0.5\u21e5 50NN distance\n\n(b) = 2\u21e5 50NN distance\n\n(c) Variants of SORF\n\nFigure 5: (a) (b) MSE on mnist with different . (c) Effect of using less randomness on mnist.\nHDHDHD is the the proposed SORF method. HDHD reduces the number of Hadamard-Diagonal\nblocks to two, and HD uses only one such block.\n\n.\n\nTheorem 3. (Appendix C) Let KSORF(x, y) be the approximate kernel computed with linear trans-\nformation matrix pdHD1HD2HD3. Let z = ||x y||/. Then\nE(KSORF(x, y)) ez2/2 \uf8ff\n6z\npd\n\nEven though SORF is nearly-unbiased, proving tight variance and concentration guarantees similar\nto ORF remains an open question. The following discussion provides a sketch in that direction. We\n\ufb01rst show a lemma of RFF.\nLemma 3. Let W be a random Gaussian matrix as in RFF, for a given z, the distribution of Wz is\nN (0,||z||2Id).\nNote that Wz in RFF can be written as Rg, where R is a scaled orthogonal matrix such that each\nrow has norm ||z||2 and g is distributed according to N (0, Id). Hence the distribution of Rg is\nN (0,||z||2Id), identical to Wz. The concentration results of RFF use the fact that the projections of\na Gaussian vector g onto orthogonal directions R are independent.\nWe show that pdHD1HD2HD3z has similar properties. In particular, we show that it can be\nwritten as \u02dcR\u02dcg, where rows of \u02dcR are \u201cnear-orthogonal\u201d (with high probability) and have norm ||z||2,\nand the vector \u02dcg is close to Gaussian (\u02dcg has independent sub-Gaussian elements), and hence the\nprojections behave \u201cnear-independently\u201d. Speci\ufb01cally, \u02dcg = vec(D1) (vector of diagonal entries of\nD1), and \u02dcR is a function of D2, D3 and z.\nTheorem 4. (Appendix D) For a given z, there exists a \u02dcR (function of D2, D3, z), such that\npdHD1HD2HD3z = \u02dcRvec(D1). Each row of \u02dcR has norm ||z||2 and for any t 1/d, with\nprobability 1 dec\u00b7t2/3d1/3, the inner product between any two rows of \u02dcR is at most t||z||2, where c\nis a constant.\n\nThe above result can also be applied to settings not limited to kernel approximation. In the appendix,\nwe show empirically that the same scheme can be successfully applied to angle estimation where the\nnonlinear map f is a non-smooth sign(\u00b7) function [4]. We note that the HD1HD2HD3 structure\nhas also been recently used in fast cross-polytope LSH [2, 12, 6].\n\n7 Conclusions\nWe have demonstrated that imposing orthogonality on the transformation matrix can greatly reduce\nthe kernel approximation MSE of Random Fourier Features when approximating Gaussian kernels.\nWe further proposed a type of structured orthogonal matrices with substantially lower computation\nand memory cost. We provided theoretical insights indicating that the Hadamard-Diagonal block\nstructure can be generally used to replace random Gaussian matrices in a broader range of applications.\nOur method can also be generalized to other types of kernels such as general shift-invariant kernels\nand polynomial kernels based on Schoenberg\u2019s characterization as in [18].\n\n8\n\n\fReferences\n[1] N. Ailon and B. Chazelle. Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform.\n\nIn STOC, 2006.\n\n[2] A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt. Practical and optimal lsh for angular\n\ndistance. In NIPS, 2015.\n\n[3] S. Bochner. Harmonic analysis and the theory of probability. Dover Publications, 1955.\n[4] M. S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002.\n[5] Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. Choudhary, and S.-F. Chang. An exploration of parameter\n\nredundancy in deep networks with circulant projections. In ICCV, 2015.\n\n[6] K. Choromanski, F. Fagan, C. Gouy-Pailler, A. Morvan, T. Sarlos, and J. Atif. Triplespin-a generic compact\n\nparadigm for fast machine learning computations. arXiv, 2016.\n\n[7] K. Choromanski and V. Sindhwani. Recycling randomness with structure for sublinear time kernel\n\nexpansions. ICML, 2015.\n\n[8] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273\u2013297, 1995.\n[9] B. J. Fino and V. R. Algazi. Uni\ufb01ed matrix treatment of the fast walsh-hadamard transform.\n\nTransactions on Computers, (11):1142\u20131146, 1976.\n\nIEEE\n\n[10] T. Joachims. Training linear SVMs in linear time. In KDD, 2006.\n[11] P. Kar and H. Karnick. Random feature maps for dot product kernels. In AISTATS, 2012.\n[12] C. Kennedy and R. Ward. Fast cross-polytope locality-sensitive hashing. arXiv, 2016.\n[13] Q. Le, T. Sarl\u00f3s, and A. Smola. Fastfood \u2013 approximating kernel expansions in loglinear time. In ICML,\n\n2013.\n\n[14] F. Li, C. Ionescu, and C. Sminchisescu. Random fourier approximations for skewed multiplicative\n\nhistogram kernels. Pattern Recognition, pages 262\u2013271, 2010.\n\n[15] S. Maji and A. C. Berg. Max-margin additive classi\ufb01ers for detection. In ICCV, 2009.\n[16] R. J. Muirhead. Aspects of multivariate statistical theory, volume 197. John Wiley & Sons, 2009.\n[17] H. Niederreiter. Quasi-Monte Carlo Methods. Wiley Online Library, 2010.\n[18] J. Pennington, F. Yu, and S. Kumar. Spherical random features for polynomial kernels. In NIPS, 2015.\n[19] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, 2007.\n[20] A. Rudi, R. Camoriano, and L. Rosasco. Generalization properties of learning with random features.\n\narXiv:1602.04474, 2016.\n\n[21] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient solver for\n\nSVM, volume = 127, year = 2011. Mathematical Programming, (1):3\u201330.\n\n[22] V. Sreekanth, A. Vedaldi, A. Zisserman, and C. Jawahar. Generalized RBF feature maps for ef\ufb01cient\n\ndetection. In BMVC, 2010.\n\n[23] B. Sriperumbudur and Z. Szab\u00f3. Optimal rates for random fourier features. In NIPS, 2015.\n[24] A. Vedaldi and A. Zisserman. Ef\ufb01cient additive kernels via explicit feature maps. IEEE Transactions on\n\nPattern Analysis and Machine Intelligence, 34(3):480\u2013492, 2012.\n\n[25] J. Yang, V. Sindhwani, H. Avron, and M. Mahoney. Quasi-monte carlo feature maps for shift-invariant\n\nkernels. In ICML, 2014.\n\n[26] T. Yang, Y.-F. Li, M. Mahdavi, R. Jin, and Z.-H. Zhou. Nystr\u00f6m method vs random fourier features: A\n\ntheoretical and empirical comparison. In NIPS, 2012.\n\n[27] F. X. Yu, S. Kumar, Y. Gong, and S.-F. Chang. Circulant binary embedding. In ICML, 2014.\n[28] F. X. Yu, S. Kumar, H. Rowley, and S.-F. Chang. Compact nonlinear maps and circulant extensions.\n\narXiv:1503.03893, 2015.\n\n[29] X. Zhang, F. X. Yu, R. Guo, S. Kumar, S. Wang, and S.-F. Chang. Fast orthogonal projection based on\n\nkronecker product. In ICCV, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1068, "authors": [{"given_name": "Felix Xinnan", "family_name": "Yu", "institution": "Google Research"}, {"given_name": "Ananda Theertha", "family_name": "Suresh", "institution": "University of California, San Diego"}, {"given_name": "Krzysztof", "family_name": "Choromanski", "institution": "Google Brain Robotics"}, {"given_name": "Daniel", "family_name": "Holtmann-Rice", "institution": "Google Inc"}, {"given_name": "Sanjiv", "family_name": "Kumar", "institution": "Google"}]}