{"title": "But How Does It Work in Theory? Linear SVM with Random Features", "book": "Advances in Neural Information Processing Systems", "page_first": 3379, "page_last": 3388, "abstract": "We prove that, under low noise assumptions, the support vector machine with $N\\ll m$ random features (RFSVM) can achieve the learning rate faster than $O(1/\\sqrt{m})$ on a training set with $m$ samples when an optimized feature map is used. Our work extends the previous fast rate analysis of random features method from least square loss to 0-1 loss. We also show that the reweighted feature selection method, which approximates the optimized feature map, helps improve the performance of RFSVM in experiments on a synthetic data set.", "full_text": "But How Does It Work in Theory?\nLinear SVM with Random Features\n\nYitong Sun\n\nDepartment of Mathematics\n\nUniversity of Michigan\nAnn Arbor, MI, 48109\nsyitong@umich.edu\n\nAnna Gilbert\n\nDepartment of Mathematics\n\nUniversity of Michigan\nannacg@umich.edu\n\nAmbuj Tewari\n\nDepartment of Statistics\nUniversity of Michigan\ntewaria@umich.edu\n\nAbstract\n\n\u221a\n\nWe prove that, under low noise assumptions, the support vector machine with\nN (cid:28) m random features (RFSVM) can achieve the learning rate faster than\nm) on a training set with m samples when an optimized feature map is\nO(1/\nused. Our work extends the previous fast rate analysis of random features method\nfrom least square loss to 0-1 loss. We also show that the reweighted feature\nselection method, which approximates the optimized feature map, helps improve\nthe performance of RFSVM in experiments on a synthetic data set.\n\n1\n\nIntroduction\n\nKernel methods such as kernel support vector machines (KSVMs) have been widely and successfully\nused in classi\ufb01cation tasks (Steinwart and Christmann [2008]). The power of kernel methods comes\nfrom the fact that they implicitly map the data to a high dimensional, or even in\ufb01nite dimensional,\nfeature space, where points with different labels can be separated by a linear functional. It is, however,\ntime-consuming to compute the kernel matrix and thus KSVMs do not scale well to extremely\nlarge datasets. To overcome this challenge, researchers have developed various ways to ef\ufb01ciently\napproximate the kernel matrix or the kernel function.\nThe random features method, proposed by Rahimi and Recht [2008], maps the data to a \ufb01nite\ndimensional feature space as a random approximation to the feature space of RBF kernels. With\nexplicit \ufb01nite dimensional feature vectors available, the original KSVM is converted to a linear\nsupport vector machine (LSVM), that can be trained by faster algorithms (Shalev-Shwartz et al.\n[2011], Hsieh et al. [2008]) and tested in constant time with respect to the number of training samples.\nFor example, Huang et al. [2014] and Dai et al. [2014] applied RFSVM or its variant to datasets\ncontaining millions of data points and achieved performance comparable to deep neural nets.\nDespite solid practical performance, there is a lack of clear theoretical guarantees for the learning\nrate of RFSVM. Rahimi and Recht [2009] obtained a risk gap of order O(1/\nN ) between the\nbest RFSVM and KSVM classi\ufb01ers, where N is the number of features. Although the order of\nthe error bound is correct for general cases, it is too pessimistic to justify or to explain the actual\ncomputational bene\ufb01ts of random features method in practice. And the model is formulated as a\nconstrained optimization problem, which is rarely used in practice.\nCortes et al. [2010] and Sutherland and Schneider [2015] considered the performance of RFSVM as a\nperturbed optimization problem, using the fact that the dual form of KSVM is a constrained quadratic\noptimization problem. Although the maximizer of a quadratic function depends continuously on the\nquadratic form, its dependence is weak and thus, both papers failed to obtain an informative bound\nfor the excess risk of RFSVM in the classi\ufb01cation problem. In particular, such an approach requires\nRFSVM and KSVM to be compared under the same hyper-parameters. This assumption is, in fact,\n\n\u221a\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f\u221a\n\nproblematic because the optimal con\ufb01guration of hyper-parameters of RFSVM is not necessarily\nthe same as those for the corresponding KSVM. In this sense, RFSVM is more like an independent\nlearning model instead of just an approximation to KSVM.\nIn regression settings, the learning rate of random features method was studied by Rudi and Rosasco\n[2017] under the assumption that the regression function is in the RKHS, namely the realizable\n\u221a\ncase. They show that the uniform feature sampling only requires O(\nm log(m)) features to achieve\nm) risk of squared loss. They further show that a data-dependent sampling can achieve a\nO(1/\nrate of O(1/m\u03b1), where 1/2 \u2264 \u03b1 \u2264 1, with even fewer features, when the regression function is\nsuf\ufb01ciently smooth and the spectrum of the kernel integral operator decays suf\ufb01ciently fast. However,\nthe method leading to these results depends on the closed form of the least squares solution, and thus\nwe cannot easily extend these results to non-smooth loss functions used in RFSVM. Bach [2017]\nrecently shows that for any given approximation accuracy, the number of random features required is\ngiven by the degrees of freedom of the kernel operator under such an accuracy level, when optimized\nfeatures are available. This result is crucial for sample complexity analysis of RFSVM, though not\nmany details are provided on this topic in Bach\u2019s work.\nIn this paper, we investigate the performance of RFSVM formulated as a regularized optimization\nproblem on classi\ufb01cation tasks. In contrast to the slow learning rate in previous results by Rahimi and\nRecht [2009] and Bach [2017], we show, for the \ufb01rst time, that RFSVM can achieve fast learning rate\nwith far fewer features than the number of samples when the optimized features (see Assumption 2)\nare available, and thus we justify the potential computational bene\ufb01ts of RFSVM on classi\ufb01cation\ntasks. We mainly considered two learning scenarios: the realizable case, and then unrealizable\ncase, where the Bayes classi\ufb01er does not belong to the RKHS of the feature map. In particular, our\ncontributions are threefold:\n\n1. We prove that under Massart\u2019s low noise condition, with an optimized feature map, RFSVM\ncan achieve a learning rate of \u02dcO(m\n2+c2 ) number of features when the\nBayes classi\ufb01er belongs to the RKHS of a kernel whose spectrum decays polynomially\n(\u03bbi = O(i\u2212c2)). When the decay rate of the spectrum of kernel operator is sub-exponential,\nthe learning rate can be improved to \u02dcO(1/m) with only \u02dcO(lnd(m)) number of features.\n\n1+c2 ) 1, with \u02dcO(m\n\n\u2212 c2\n\n2\n\n2. When the Bayes classi\ufb01er satis\ufb01es the separation condition; that is, when the two classes\nof points are apart by a positive distance, we prove that the RFSVM using an optimized\nfeature map corresponding to Gaussian kernel can achieve a learning rate of \u02dcO(1/m) with\n\u02dcO(ln2d(m)) number of features.\n\n3. Our theoretical analysis suggests reweighting random features before training. We con\ufb01rm\n\nits bene\ufb01t in our experiments over synthetic data sets.\n\nWe begin in Section 2 with a brief introduction of RKHS, random features and the problem formu-\nlation, and set up the notations we use throughout the rest of the paper. In Section 3, we provide\nour main theoretical results (see the appendices for the proofs), and in Section 4, we verify the\nperformance of RFSVM in experiments. In particular, we show the improvement brought by the\nreweighted feature selection algorithm. The conclusion and some open questions are summarized\nat the end. The proofs of our main theorems follow from a combination of the sample complexity\nanalysis scheme used by Steinwart and Christmann [2008] and the approximation error result of\nBach [2017]. The fast rate is achieved due to the fact that the Rademacher complexity of the RKHS\n\nof N random features and with regularization parameter \u03bb is only O((cid:112)N log(1/\u03bb)), while N and\n\n1/\u03bb need not be too large to control the approximation error when optimized features are available.\nDetailed proofs and more experimental results are provided in the Appendices for interested readers.\n\n2 Preliminaries and notations\nThroughout this paper, a labeled data point is a point (x, y) in X \u00d7 {\u22121, 1}, where X is a bounded\nsubset of Rd. X \u00d7 {\u22121, 1} is equipped with a probability distribution P.\n\n1 \u02dcO(n) represents a quantity less than Cn logk(n) for some k.\n\n2\n\n\f2.1 Kernels and Random Features\nA positive de\ufb01nite kernel function k (x, x(cid:48)) de\ufb01ned on X \u00d7 X determines the unique corresponding\nreproducing kernel Hilbert space (RKHS), denoted by Fk. A map \u03c6 from the data space X to a\nHilbert space H such that (cid:104)\u03c6 (x) , \u03c6 (x(cid:48))(cid:105)H = k (x, x(cid:48)) is called a feature map of k and H is called a\nfeature space. For any f \u2208 F, there exists an h \u2208 H such that (cid:104)h, \u03c6(x)(cid:105)H = f (x), and the in\ufb01mum\nof the norms of all such hs is equal to (cid:107)f(cid:107)F . On the other hand, given any feature map \u03c6 into H,\na kernel function is de\ufb01ned by the equation above, and we call Fk the RKHS corresponding to \u03c6,\ndenoted by F\u03c6.\nA common choice of feature space is the L2 space of a probability space (\u03c9, \u2126, \u03bd). An important\n\nobservation is that for any probability density function q(\u03c9) de\ufb01ned on \u2126, \u03c6(\u03c9; x)/(cid:112)q(\u03c9) with\n\nprobability measure q(\u03c9)d\u03bd(\u03c9) de\ufb01nes the same kernel function with the feature map \u03c6(\u03c9; x) under\nthe distribution \u03bd. One can sample the image of x under the feature map \u03c6, an L2 function \u03c6(\u03c9; x), at\npoints {\u03c91, . . . , \u03c9N} according to the probability distribution \u03bd to approximately represent x. Then\nthe vector in RN is called a random feature vector of x, denoted by \u03c6N (x). The corresponding kernel\nfunction determined by \u03c6N is denoted by kN .\nA well-known construction of random features is the random Fourier features proposed by Rahimi\nand Recht [2008]. The feature map is de\ufb01ned as follows,\n\n\u03c6 : X \u2192 L2(Rd, \u03bd) \u2295 L2(Rd, \u03bd)\n\nx (cid:55)\u2192 (cos (\u03c9 \u00b7 x) , sin (\u03c9 \u00b7 x)) .\n\nAnd the corresponding random feature vector is\n\n\u03c6N (x) =\n\n1\u221a\nN\n\n(cos (\u03c9 \u00b7 x) ,\u00b7\u00b7\u00b7 , cos (\u03c9 \u00b7 x) , sin (\u03c9 \u00b7 x) ,\u00b7\u00b7\u00b7 , sin (\u03c9 \u00b7 x))\n\n(cid:124)\n\n,\n\nwhere \u03c9is are sampled according to \u03bd. Different choices of \u03bd de\ufb01ne different translation invariant\nkernels (see Rahimi and Recht [2008]). When \u03bd is the normal distribution with mean 0 and variance\n\u03b3\u22122, the kernel function de\ufb01ned by the feature map is Gaussian kernel with bandwidth parameter \u03b3,\n\n(cid:18)\n\n(cid:19)\n\n.\n\nk\u03b3(x, x(cid:48)) = exp\n\n\u2212(cid:107)x \u2212 x(cid:48)(cid:107)2\n\n2\u03b32\n\nEquivalently, we may consider the feature map \u03c6\u03b3(\u03c9; x) := \u03c6(\u03c9/\u03b3; x) with \u03bd being standard normal\ndistribution.\nA more general and more abstract feature map can be constructed using an orthonormal set of\nL2(X , PX ). Given the orthonormal set {ei} consisting of bounded functions, and a nonnegative\nsequence (\u03bbi) \u2208 (cid:96)1, we can de\ufb01ne a feature map\n\nf (cid:55)\u2192\n\n\u03a3 is of trace class with trace norm(cid:82) k(x, x) dPX (x). When the integral operator is determined by a\n\nk(x, t)f (t) dPX (t) .\n\nfeature map \u03c6, we denote it by \u03a3\u03c6, and the ith eigenvalue in a descending order by \u03bbi(\u03a3\u03c6). Note\nthat the regularization paramter is also denoted by \u03bb but without a subscript. The decay rate of the\nspectrum of \u03a3\u03c6 plays an important role in the analysis of learning rate of random features method.\n\nX\n\n\u03c6(\u03c9; x) =\n\n\u03bbiei(x)ei(\u03c9) ,\n\ni=1\n\n(cid:80)\u221e\nwith feature space L2(\u03c9,X , PX ).\nis given by k(x, x(cid:48)) =\ni=1 \u03bbiei(x)ei(x(cid:48)). The feature map and the kernel function are well de\ufb01ned because of the\nboundedness assumption on {ei}. A similar representation can be obtained for a continuous kernel\nEvery positive de\ufb01nite kernel function k satisfying that(cid:82) k(x, x) dPX (x) < \u221e de\ufb01nes an integral\nfunction on a compact set by Mercer\u2019s Theorem (Lax [2002]).\n\nThe corresponding kernel\n\noperator on L2(x,X , PX ) by\n\n\u03a3 : L2(X , PX ) \u2192 L2(X , PX )\n\n\u221e(cid:88)\n\n(cid:112)\n\n(cid:90)\n\n3\n\n\f2.2 Formulation of Support Vector Machine\ni=1 generated i.i.d. by P and a function f : X \u2192 R, usually called a\nGiven m samples {(xi, yi)}m\nhypothesis in the machine learning context, the empirical and expected risks with respect to the loss\nm(cid:88)\nfunction (cid:96) are de\ufb01ned by\n\n(cid:96) (yi, f (xi)) R(cid:96)P (f ) := E(x,y)\u223cP(cid:96) (y, f (x)) ,\n\nm (f ) :=\n\nR(cid:96)\n\n1\nm\n\ni=1\n\nrespectively.\nThe 0-1 loss is commonly used to measure the performance of classi\ufb01ers:\n\n(cid:26)1 if f (x)y \u2264 0;\n\n0\n\nif f (x)y > 0.\n\n(cid:96)0\u22121(y, f (x)) =\n\nThe function that minimizes the expected risk under 0-1 loss is called the Bayes classi\ufb01er, de\ufb01ned by\n\nf\u2217P (x) := sgn (E[y | x]) .\n\n(f ) \u2212\nThe goal of the classi\ufb01cation task is to \ufb01nd a good hypothesis f with small excess risk R0\u22121P\nR0\u22121P\n(f\u2217P ). And to \ufb01nd the good hypothesis based on the samples, one minimizes the empirical risk.\nHowever, using 0-1 loss, it is hard to \ufb01nd the global minimizer of the empirical risk because the loss\nfunction is discontinuous and non-convex. A popular surrogate loss function in practice is the hinge\nloss: (cid:96)h(f ) = max(0, 1 \u2212 yf (x)), which guarantees that\nRhP(f ) \u2265 R0\u22121P\n\nRhP(f ) \u2212 inf\n\n(f ) \u2212 R0\u22121P\n\n(f\u2217P ) ,\n\nf\n\nwhere Rh means R(cid:96)h and R0\u22121 means R(cid:96)0\u22121. See Steinwart and Christmann [2008] for more details.\nA regularizer can be added into the optimization objective with a scalar multiplier \u03bb to avoid over\ufb01tting\nthe random samples. Throughout this paper, we consider the most commonly used (cid:96)2 regularization.\nTherefore, the solution of the binary classi\ufb01cation problem is given by minimizing the following\nobjective\n\nRm,\u03bb(f ) = Rh\n\nm(f ) +\n\n(cid:107)f(cid:107)2F ,\n\n\u03bb\n2\n\nover a hypothesis class F. When F is the RKHS of some kernel function, the algorithm described\nabove is called kernel support vector machine. Note that for technical convenience, we do not include\nthe bias term in the formulation of hypothesis so that all these functions are from the RKHS instead\nof the product space of RKHS and R (see Chapter 1 of Steinwart and Christmann [2008] for more\nexplanation of such a convention). Note that Rm,\u03bb is strongly convex and thus the in\ufb01mum will be\nattained by some function in F. We denote it by fm,\u03bb.\nWhen random features \u03c6N and the corresponding RKHS are considered, we add N into the subscripts\nof the notations de\ufb01ned above to indicate the number of random features. For example FN for the\nRKHS, fN,m,\u03bb for the solution of the optimization problem.\n\n3 Main Results\n\nIn this section we state our main results on the fast learning rates of RFSVM in different scenarios.\nFirst, we need the following assumption on the distribution of data, which is required for all the\nresults in this paper.\nAssumption 1. There exists V \u2265 2 such that\n\n|E(x,y)\u223cP[y | x]| \u2265 2/V .\n\nThis assumption is called Massart\u2019s low noise condition in many references (see for example Koltchin-\nskii et al. [2011]). When V = 2 then all the data points have deterministic labels almost surely.\nTherefore it is easier to learn the true classi\ufb01er based on observations. In the proof, Massart\u2019s low\nnoise condition guarantees the variance condition (Steinwart and Christmann [2008])\n\nE[((cid:96)h(f (x)) \u2212 (cid:96)h(f\u2217P (x)))2] \u2264 V (Rh(f ) \u2212 Rh(f\u2217P )) ,\n\n(1)\n\n4\n\n\fwhich is a common requirement for the fast rate results. Massart\u2019s condition is an extreme case of a\nmore general low noise condition, called Tsybakov\u2019s condition. For the simplicity of the theorem, we\nonly consider Massart\u2019s condition in our work, but our main results can be generalized to Tsybakov\u2019s\ncondition.\nThe second assumption is about the quality of random features. It was \ufb01rst introduced in Bach\n[2017]\u2019s approximation results.\nAssumption 2. A feature map \u03c6 : X \u2192 L2(\u03c9, \u2126, \u03bd)) is called optimized if there exists a small\nconstant \u00b50 such that for any \u00b5 \u2264 \u00b50,\n\n(cid:107)(\u03a3 + \u00b5I)\u22121/2\u03c6(\u03c9; x)(cid:107)2\n\nL2(P) \u2264 tr(\u03a3(\u03a3 + \u00b5I)\u22121) =\n\nsup\n\u03c9\u2208\u2126\n\n\u03bbi(\u03a3)\n\n.\n\n\u03bbi(\u03a3) + \u00b5\n\ni=1\n\n\u221e(cid:88)\n\nFor any given \u00b5, the quantity on the left hand side of the inequality is called leverage score with\nrespect to \u00b5, which is directly related with the number of features required to approximate a function\nin the RKHS of \u03c6. The quantity on the right hand side is called degrees of freedom by Bach [2017]\nand effective dimension by Rudi and Rosasco [2017], denoted by d(\u00b5). Note that whatever the\nRKHS is, we can always construct optimized feature map for it. In the Appendix A we describe\ntwo examples of constructing optimized feature map. When a feature map is optimized, it is easy to\ncontrol its leverage score by the decay rate of the spectrum of \u03a3, as described below.\nDe\ufb01nition 1. We say that the spectrum of \u03a3 : L2(X , P) \u2192 L2(X , P) decays at a polynomial rate if\nthere exist c1 > 0 and c2 > 1 such that\n\n\u03bbi(\u03a3) \u2264 c1i\u2212c2 .\n\nWe say that it decays sub-exponentially if there exist c3, c4 > 0 such that\n\n\u03bbi(\u03a3) \u2264 c3 exp(\u2212c4i1/d) .\n\n\u221a\n\nThe decay rate of the spectrum of \u03a3 characterizes the capacity of the hypothesis space to search\nfor the solution, which further determines the number of random features required in the learning\nprocess. Indeed, when the feature map is optimized, the number of features required to approximate\na function in the RKHS with accuracy O(\n\u00b5) is upper bounded by O(d(\u00b5) ln(d(\u00b5))). When the\nspectrum decays polynomially, the degrees of freedom d(\u00b5) is O(\u00b5\u22121/c2 ), and when it decays\nsub-exponentially, d(\u00b5) is O(lnd(c3/\u00b5)) (see Lemma 6 in Appendix C for details). Examples on the\nkernels with polynomial and sub-exponential spectrum decays can be found in Bach [2017]. Our\nproof of Lemma 8 also provides some useful discussion.\nWith these preparations, we can state our \ufb01rst theorem now.\nTheorem 1. Assume that P satis\ufb01es Assumption 1, and the feature map \u03c6 satis\ufb01es Assumption 2. If\nf\u2217P \u2208 F\u03c6 with (cid:107)f\u2217P(cid:107)F\u03c6 \u2264 R. Then when the spectrum of \u03a3\u03c6 decays polynomially, by choosing\n\n\u2212 c2\n2+c2\n\n\u03bb = m\n\nN = 10Cc1,c2m\n\n2\n\n2+c2 (ln(32Cc1,c2 m\n\n2\n\n2+c2 ) + ln(1/\u03b4)) ,\n\nwe have\n\nR0\u22121P\n\n(fN,m,\u03bb) \u2212 R0\u22121P\n\n(f\u2217P ) \u2264 Cc1,c2,V,Rm\n\n\u2212 c2\n\n2+c2 ((ln(1/\u03b4) + ln(m))) ,\n\nwith probability 1 \u2212 4\u03b4. When the spectrum of \u03a3\u03c6 decays sub-exponentially, by choosing\n\n\u03bb = 1/m\nN = 25Cd,c4 lnd(m)(ln(80Cd,c4 lnd(m)) + ln(1/\u03b4)) ,\n\nwe have\n\nR0\u22121P\n\n(fN,m,\u03bb) \u2212 R0\u22121P\n\n(f\u2217P ) \u2264 Cc3,c4,d,R,V\n\n1\nm\n\nwith probability 1 \u2212 4\u03b4 when m \u2265 exp((c4 \u2228 1\n\nc4\n\n)d2/2).\n\n5\n\n(cid:16)\n\n(cid:17)\n\n,\n\nlogd+2(m) + log(1/\u03b4)\n\n\f\u221a\n\nThis theorem characterizes the learning rate of RFSVM in realizable cases; that is, when the Bayes\nclassi\ufb01er belongs to the RKHS of the feature map. For polynomially decaying spectrum, when\nc2 > 2, we get a learning rate faster than 1/\nm. Rudi and Rosasco [2017] obtained a similar fast\nlearning rate for kernel ridge regression with random features (RFKRR), assuming polynomial decay\nof the spectrum of \u03a3\u03c6 and the existence of a minimizer of the risk in F\u03c6. Our theorem extends\ntheir result to classi\ufb01cation problems and exponential decay spectrum. However, we have to use\na stronger assumption that f\u2217P \u2208 F\u03c6 so that the low noise condition can be applied to derive the\n\u221a\nvariance condition. For RFKRR, the rate faster than O(1/\nm) will be achieved whenever c2 > 1,\nand the number of features required is only square root of our result. We think that this is mainly\ncaused by the fact that their surrogate loss is squared. The result for the sub-exponentially decaying\nspectrum is not investigated for RFKRR, so we cannot make a comparison. We believe that this is the\n\ufb01rst result showing that RFSVM can achieve \u02dcO(1/m) with only \u02dcO(lnd(m)) features. Note however\nthat when d is large, the sub-exponential case requires a large number of samples, even possibly\nlarger than the polynomial case. This is clearly an artifact of our analysis since we can always use the\npolynomial case to provide an upper bound! We therefore suspect that there is considerable room\nfor improving our analysis of high dimensional data in the sub-exponential decay case. In particular,\nremoving the exponential dependence on d under reasonable assumptions is an interesting direction\nfor future work.\nTo remove the realizability assumption, we provide our second theorem, on the learning rate of\nRFSVM in unrealizable case. We focus on the random features corresponding to the Gaussian\nkernel as introduced in Section 2. When the Bayes classi\ufb01er does not belong to the RKHS, we\nneed an approximation theorem to estimate the gap of risks. The approximation property of RKHS\nof Gaussian kernel has been studied in Steinwart and Christmann [2008], where the margin noise\nexponent is de\ufb01ned to derive the risk gap. Here we introduce the simpler and stronger separation\ncondition, which leads to a strong result.\nThe points in X can be collected in to two sets according to their labels as follows,\n\nX1 := {x \u2208 X | E(y | x) > 0}\nX\u22121 := {x \u2208 X | E(y | x) < 0} .\nThe distance of a point x \u2208 Xi to the set X\u2212i is denoted by \u2206(x).\nAssumption 3. We say that the data distribution satis\ufb01es a separation condition if there exists \u03c4 > 0\nsuch that PX (\u2206(x) < \u03c4 ) = 0.\nIntuitively, Assumption 3 requires the two classes to be far apart from each other almost surely. This\nseparation assumption is an extreme case when the margin noise exponent goes to in\ufb01nity.\nThe separation condition characterizes a different aspect of data distribution from Massart\u2019s low\nnoise condition. Massart\u2019s low noise condition guarantees that the random samples represent the\ndistribution behind them accurately, while the separation condition guarantees the existence of a\nsmooth, in the sense of small derivatives, function achieving the same risk with the Bayes classi\ufb01er.\nWith both assumptions imposed on P, we can get a fast learning rate of ln2d+1 m/m with only\nln2d(m) random features, as stated in the following theorem.\nTheorem 2. Assume that X is bounded by radius \u03c1. The data distribution has density function upper\nbounded by a constant B, and satis\ufb01es Assumption 1 and 3. Then by choosing\n\n\u03bb = 1/m \u03b3 = \u03c4 /\n\nln m N = C\u03c4,d,\u03c1 ln2d m(ln ln m + ln(1/\u03b4)) ,\n\nthe RFSVM using an optimized feature map corresponding to the Gaussian kernel with bandwidth \u03b3\nachieves the learning rate\n\n\u221a\n\nR0\u22121P\n\n(fN,m,\u03bb) \u2212 R0\u22121P\n\n(f\u2217P ) \u2264 C\u03c4,V,d,\u03c1,B\n\nm\n\nln2d+1(m)(ln ln(m) + ln(1/\u03b4))\n\n,\n\nwith probability greater than 1 \u2212 4\u03b4 for m \u2265 m0, where m0 depends on \u03c4, \u03c1, d.\nTo the best of our knowledge, this is the \ufb01rst theorem on the fast learning rate of random features\nmethod in the unrealizable case. It only assumes that the data distribution satis\ufb01es low noise and\nseparation conditions, and shows that with an optimized feature distribution, the learning rate of\n\n6\n\n\f\u02dcO(1/m) can be achieved using only ln2d+1(m) (cid:28) m features. This justi\ufb01es the bene\ufb01t of using\nRFSVM in binary classi\ufb01cation problems. The assumption of a bounded data set and a bounded\ndistribution density function can be dropped if we assume that the probability density function\nis upper bounded by C exp(\u2212\u03b32(cid:107)x(cid:107)2/2), which suf\ufb01ces to provide the sub-exponential decay of\nspectrum of \u03a3\u03c6. But we prefer the simpler form of the results under current conditions. We speculate\nthat the conclusion of Theorem 2 can be generalized to all sub-Gaussian data.\nThe main drawback of our two theorems is the assumption of an optimized feature distribution, which\nis hard to obtain in practice. Developing a data-dependent feature selection method is therefore an\nimportant problem for future work on RFSVM. Bach [2017] proposed an algorithm to approximate\nthe optimized feature map from any feature map. Adapted to our setup, the reweighted feature\nselection algorithm is described as follows.\n\n1. Select M i.i.d. random vectors {\u03c9i}M\n2. Select L data points {xi}L\n\u221a\n3. Generate the matrix \u03a6 with columns \u03c6M (xi)/\n4. Compute {ri}M\n(cid:124)\n+ \u00b5I)\u22121.\n5. Resample N features from {\u03c9i}M\n\ni=1, the diagonal of \u03a6\u03a6\n\nL.\n\ni=1 uniformly from the training set.\n\ni=1 according to the distribution d\u03bd\u03b3.\n\n(cid:124)\n\ni=1 according to the probability distribution pi = ri/(cid:80) ri.\n\n(\u03a6\u03a6\n\nThe theoretical guarantees of this algorithm have not been discussed in the literature. A result in this\ndirection will be extremely useful for guiding practioners. However, it is outside the scope of our\nwork. Instead, here we implement it in our experiment and empirically compare the performance of\nRFSVM using this reweighted feature selection method to the performance of RFSVM without this\npreprocessing step; see Section 4.\nFor the realizable case, if we drop the assumption of optimized feature map, only weak results can be\nobtained for the learning rate and the number of features required (see Appendix E for more details).\nIn particular, we can only show that 1/\u00012 random features are suf\ufb01cient to guarantee the learning\nrate less than \u0001 when 1/\u00013 samples are available. Though not helpful for justifying the computational\nbene\ufb01t of random features method, this result matches the parallel result for RFKRR in Rudi and\nRosasco [2017] and the approximation result in Sriperumbudur and Szabo [2015]. We conjecture that\nthis upper bound is also optimal for RFSVM.\nRudi and Rosasco [2017] also compared the performance of RFKRR with Nystrom method, which\nis the other popular method to scale kernel ridge regression to large data sets.We do not \ufb01nd any\ntheoretical guarantees on the fast learning rate of SVM with Nystrom method on classi\ufb01cation\nproblems in the literature, though there are several works on its approximation quality to the accurate\nmodel and its empirical performance (see Yang et al. [2012], Zhang et al. [2012]). The tools used in\nthis paper should also work for learning rate analysis of SVM using Nystrom method. We leave this\nanalysis to the future.\n\n4 Experimental Results\n\nIn this section we evaluate the performance of RFSVM with the reweighted feature selection al-\ngorithm2. The sample points shown in Figure 3 are generated from either the inner circle or outer\nannulus uniformly with equal probability, where the radius of the inner circle is 0.9, and the radius\nof the outer annulus ranges from 1.1 to 2. The points from the inner circle are labeled by -1 with\nprobability 0.9, while the points from the outer annulus are labeled by 1 with probability 0.9. In such\na simple case, the unit circle describes the Bayes classi\ufb01er.\nFirst, we compared the performance of RFSVM with that of KSVM on the training set with 1000\nsamples, over a large range of regularization parameter (\u22127 \u2264 log \u03bb \u2264 1). The bandwidth parameter\n\u03b3 is \ufb01xed to be an estimate of the average distance among the training samples. After training,\nmodels are tested on a large testing set (> 105). For RFSVM, we considered the effect of the number\nof features by setting N to be 1, 3, 5, 10 and 20, respectively. Moreover, both feature selection\nmethods, simple random feature selection (labeled by \u2018unif\u2019 in the \ufb01gures), which does not apply any\npreprocess on drawing features, and reweighted feature selection (labeled by \u2018opt\u2019 in the \ufb01gures) are\n\n2The source code is available at https://github.com/syitong/randfourier.\n\n7\n\n\fFigure 1: RFSVM with 1 feature.\n\nFigure 2: RFSVM with 20 features.\n\n\u201cksvm\u201d is for KSVM with Gaussian kernel, \u201cunif\u201d is for RFSVM with direct feature sampling, and \u201copt\u201d is for\nRFSVM with reweighted feature sampling. Error bars represent standard deviation over 10 runs.\n\nFigure 3: Distribution of Training Samples.\n\n50 points are shown in the graph. Blue crosses rep-\nresent the points labeled by -1, and red circles the\npoints labeled by 1. The unit circle is one of the best\nclassi\ufb01er for these data with 90% accuracy.\n\nFigure 4: Learning Rate of RFSVMs.\n\nThe excess risks of RFSVMs with the simple random\nfeature selection (\u201cunif\u201d) and the reweighted feature\nselection (\u201copt\u201d) are shown for different sample sizes.\nThe error rate is the excess risk. The error bars repre-\nsent the standard deviation over 10 runs.\n\ninspected. For the reweighted method, we set M = 100N and L = 0.3m to compute the weight of\neach feature. Every RFSVM is run 10 times, and the average accuracy and standard deviation are\npresented.\nThe results of KSVM, RFSVMs with 1 and 20 features are shown in Figure 1 and Figure 2 respectively\n(see the results of other levels of features in Appendix F in the supplementary material). The\nperformance of RFSVM is slightly worse than the KSVM, but improves as the number of features\nincreases. It also performs better when the reweighted method is applied to generate features.\nTo further compare the performance of simple feature selection and reweighted feature selection\nmethods, we plot the learning rate of RFSVM with O(ln2(m)) features and the best \u03bbs for each\nsample size m. KSVM is not included here since it is too slow on training sets of size larger than 104\nin our experiment compared to RFSVM. The error rate in Figure 4 is the excess risk between learned\nclassi\ufb01ers and the Bayes classi\ufb01er. We can see that the excess risk decays as m increases, and the\nRFSVM using reweighted feature selection method outperforms the simple feature selection.\nAccording to Theorem 2, the bene\ufb01t brought by optimized feature map, that is, the fast learning\nrate, will show up when the sample size is greater than O(exp(d)) (see Appendix D). The number\nof random features required also depends on d, the dimension of data. For data of small dimension\nand large sample size, as in our experiment, it is not a problem. However, in applications of image\n\n8\n\n\u22127\u22126\u22125\u22124\u22123\u22122\u2212101log(\u03bb)0.400.450.500.550.600.650.700.750.800.850.900.951.00accuracyksvmunifopt\u22127\u22126\u22125\u22124\u22123\u22122\u2212101log(\u03bb)0.400.450.500.550.600.650.700.750.800.850.900.951.00accuracyksvmunifopt\u22122\u22121012\u22122.0\u22121.5\u22121.0\u22120.50.00.51.01.52.02.53.03.54.04.55.05.5log(m)0.000.050.100.15error rateunifopt\frecognition, the dimension of the data is usually very large and it is hard for our theorem to explain\nthe performance of RFSVM. On the other hand, if we do not pursue the fast learning rate, the analysis\nfor general feature maps, not necessarily optimized, gives a learning rate of O(m\u22121/3) with O(m2/3)\nrandom features, which does not depend on the dimension of data (see Appendix E). Actually, for\nhigh dimensional data, there is barely any improvement in the performance of RFSVM by using\nreweighted feature selection method (see Appendix F). It is important to understand the role of d to\nfully understand the power of random features method.\n\n5 Conclusion\n\nOur study proves that the fast learning rate is possible for RFSVM in both realizable and unrealizable\nscenarios when the optimized feature map is available. In particular, the number of features required\nis far less than the sample size, which implies considerably faster training and testing using the\nrandom features method. Moreover, we show in the experiments that even though we can only\napproximate the optimized feature distribution using the reweighted feature selection method, it,\nindeed, has better performance than the simple random feature selection. Considering that such a\nreweighted method does not rely on the label distribution at all, it will be useful in learning scenarios\nwhere multiple classi\ufb01cation problems share the same features but differ in the class labels. We\nbelieve that a theoretical guarantee of the performance of the reweighted feature selection method\nand properly understanding the dependence on the dimensionality of data are interesting directions\nfor future work.\n\nAcknowledgements\n\nAT acknowledges the support of a Sloan Research Fellowship.\nACG acknowledges the support of a Simons Foundation Fellowship.\n\nReferences\nFrancis Bach. On the equivalence between kernel quadrature rules and random feature expansions.\n\nJournal of Machine Learning Research, 18(21):1\u201338, 2017.\n\nCorinna Cortes, Mehryar Mohri, and Ameet Talwalkar. On the impact of kernel approximation on\nlearning accuracy. Journal of Machine Learning Research, 9:113\u2013120, 2010. ISSN 1532-4435.\n\nFelipe Cucker and Steve Smale. On the mathematical foundations of learning. Bulletin of the\n\nAmerican Mathematical Society, 39:1\u201349, 2002.\n\nBo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina F Balcan, and Le Song. Scalable\nkernel methods via doubly stochastic gradients. In Z. Ghahramani, M. Welling, C. Cortes, N. D.\nLawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27,\npages 3041\u20133049. Curran Associates, Inc., 2014.\n\nMoulines Eric, Francis R Bach, and Za\u00efd Harchaoui. Testing for homogeneity with kernel Fisher\ndiscriminant analysis. In Advances in Neural Information Processing Systems, pages 609\u2013616,\n2008.\n\nCho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S. Sathiya Keerthi, and S. Sundararajan. A dual\ncoordinate descent method for large-scale linear svm. In Proceedings of the 25th International\nConference on Machine Learning, ICML \u201908, pages 408\u2013415, New York, NY, USA, 2008. ACM.\nISBN 978-1-60558-205-4. doi: 10.1145/1390156.1390208.\n\nP. S. Huang, H. Avron, T. N. Sainath, V. Sindhwani, and B. Ramabhadran. Kernel methods match\ndeep neural networks on timit. In 2014 IEEE International Conference on Acoustics, Speech and\nSignal Processing (ICASSP), pages 205\u2013209, May 2014. doi: 10.1109/ICASSP.2014.6853587.\n\nVladimir. Koltchinskii, SpringerLink (Online service), and \u00c9cole d\u2019\u00c9t\u00e9 de Probabilit\u00e9s de Saint-Flour.\nOracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems \u00c9cole d\u2019\u00c9t\u00e9\nde Probabilit\u00e9s de Saint-Flour XXXVIII-2008. Lecture Notes in Mathematics,0075-8434 ;2033.\nSpringer-Verlag Berlin Heidelberg, Berlin, Heidelberg, 2011.\n\n9\n\n\fP.D. Lax. Functional analysis. Pure and applied mathematics. Wiley, 2002. ISBN 9780471556046.\n\nURL https://books.google.com/books?id=-jbvAAAAMAAJ.\n\nAli Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In J. C. Platt,\nD. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems\n20, pages 1177\u20131184. Curran Associates, Inc., 2008.\n\nAli Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization\nwith randomization in learning. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors,\nAdvances in Neural Information Processing Systems 21, pages 1313\u20131320. Curran Associates, Inc.,\n2009.\n\nAlessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features.\n\nIn Advances in Neural Information Processing Systems, pages 3218\u20133228, 2017.\n\nClint Scovel, Don Hush, Ingo Steinwart, and James Theiler. Radial kernels and their reproducing\n\nkernel hilbert spaces. Journal of Complexity, 26(6):641\u2013660, 2010.\n\nShai Shalev-Shwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: primal estimated\nsub-gradient solver for svm. Mathematical Programming, 127(1):3\u201330, 2011. ISSN 1436-4646.\ndoi: 10.1007/s10107-010-0420-4.\n\nBharath Sriperumbudur and Zoltan Szabo.\n\nfea-\nIn C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Gar-\nInformation Processing Systems 28, pages 1144\u2013\nURL http://papers.nips.cc/paper/\n\ntures.\nnett, editors, Advances in Neural\n1152. Curran Associates,\n5740-optimal-rates-for-random-fourier-features.pdf.\n\nrandom fourier\n\nOptimal\n\nrates\n\nfor\n\nInc., 2015.\n\nI. Steinwart and A. Christmann. Support Vector Machines. Information Science and Statistics.\n\nSpringer New York, 2008. ISBN 9780387772424.\n\nDougal J. Sutherland and Jeff G. Schneider. On the error of random fourier features. CoRR,\n\nabs/1506.02785, 2015.\n\nHarold Widom. Asymptotic behavior of the eigenvalues of certain integral equations. Transactions\nof the American Mathematical Society, 109(2):278\u2013295, 1963. ISSN 00029947. URL http:\n//www.jstor.org/stable/1993907.\n\nTianbao Yang, Yu-feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nystr\u00f6m method vs\nrandom fourier features: A theoretical and empirical comparison. In F. Pereira, C. J. C. Burges,\nL. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25,\npages 476\u2013484. Curran Associates, Inc., 2012.\n\nKai Zhang, Liang Lan, Zhuang Wang, and Fabian Moerchen. Scaling up kernel svm on limited\nresources: A low-rank linearization approach. In Neil D. Lawrence and Mark Girolami, editors,\nProceedings of the Fifteenth International Conference on Arti\ufb01cial Intelligence and Statistics,\nvolume 22 of Proceedings of Machine Learning Research, pages 1425\u20131434, La Palma, Ca-\nnary Islands, 21\u201323 Apr 2012. PMLR. URL http://proceedings.mlr.press/v22/\nzhang12d.html.\n\n10\n\n\f", "award": [], "sourceid": 1713, "authors": [{"given_name": "Yitong", "family_name": "Sun", "institution": "University of Michigan"}, {"given_name": "Anna", "family_name": "Gilbert", "institution": "University of Michigan"}, {"given_name": "Ambuj", "family_name": "Tewari", "institution": "University of Michigan"}]}