{"title": "Scalable Kernel Methods via Doubly Stochastic Gradients", "book": "Advances in Neural Information Processing Systems", "page_first": 3041, "page_last": 3049, "abstract": "The general perception is that kernel methods are not scalable, so neural nets become the choice for large-scale nonlinear learning problems. Have we tried hard enough for kernel methods? In this paper, we propose an approach that scales up kernel methods using a novel concept called ``doubly stochastic functional gradients''. Based on the fact that many kernel methods can be expressed as convex optimization problems, our approach solves the optimization problems by making two unbiased stochastic approximations to the functional gradient---one using random training points and another using random features associated with the kernel---and performing descent steps with this noisy functional gradient. Our algorithm is simple, need no commit to a preset number of random features, and allows the flexibility of the function class to grow as we see more incoming data in the streaming setting. We demonstrate that a function learned by this procedure after t iterations converges to the optimal function in the reproducing kernel Hilbert space in rate O(1/t), and achieves a generalization bound of O(1/\\sqrt{t}). Our approach can readily scale kernel methods up to the regimes which are dominated by neural nets. We show competitive performances of our approach as compared to neural nets in datasets such as 2.3 million energy materials from MolecularSpace, 8 million handwritten digits from MNIST, and 1 million photos from ImageNet using convolution features.", "full_text": "Scalable Kernel Methods via Doubly Stochastic Gradients\n\nBo Dai1, Bo Xie1, Niao He1, Yingyu Liang2, Anant Raj1, Maria-Florina Balcan3, Le Song1\n\n{bodai, bxie33, nhe6, araj34}@gatech.edu, lsong@cc.gatech.edu\n\n1Georgia Institute of Technology\n\n2Princeton University\n\nyingyul@cs.princeton.edu\n\n3Carnegie Mellon University\n\nninamf@cs.cmu.edu\n\nAbstract\n\nThe general perception is that kernel methods are not scalable, so neural nets be-\ncome the choice for large-scale nonlinear learning problems. Have we tried hard\nenough for kernel methods? In this paper, we propose an approach that scales up\nkernel methods using a novel concept called \u201cdoubly stochastic functional gradi-\nents\u201d. Based on the fact that many kernel methods can be expressed as convex\noptimization problems, our approach solves the optimization problems by mak-\ning two unbiased stochastic approximations to the functional gradient\u2014one using\nrandom training points and another using random features associated with the\nkernel\u2014and performing descent steps with this noisy functional gradient. Our\nalgorithm is simple, need no commit to a preset number of random features, and\nallows the \ufb02exibility of the function class to grow as we see more incoming data in\nthe streaming setting. We demonstrate that a function learned by this procedure af-\n\u221a\nter t iterations converges to the optimal function in the reproducing kernel Hilbert\nspace in rate O(1/t), and achieves a generalization bound of O(1/\nt). Our ap-\nproach can readily scale kernel methods up to the regimes which are dominated by\nneural nets. We show competitive performances of our approach as compared to\nneural nets in datasets such as 2.3 million energy materials from MolecularSpace,\n8 million handwritten digits from MNIST, and 1 million photos from ImageNet\nusing convolution features.\n\nIntroduction\n\n1\nThe general perception is that kernel methods are not scalable. When it comes to large-scale non-\nlinear learning problems, the methods of choice so far are neural nets although theoretical under-\nstanding remains incomplete. Are kernel methods really not scalable? Or is it simply because we\nhave not tried hard enough, while neural nets have exploited sophisticated design of feature architec-\ntures, virtual example generation for dealing with invariance, stochastic gradient descent for ef\ufb01cient\ntraining, and GPUs for further speedup?\nA bottleneck in scaling up kernel methods comes from the storage and computation cost of the\ndense kernel matrix, K. Storing the matrix requires O(n2) space, and computing it takes O(n2d)\noperations, where n is the number of data points and d is the dimension. There have been many great\nattempts to scale up kernel methods, including efforts in perspectives of numerical linear algebra,\nfunctional analysis, and numerical optimization.\nA common numerical linear algebra approach is to approximate the kernel matrix using low-rank\nfactorizations, K \u2248 A(cid:62)A, with A \u2208 Rr\u00d7n and rank r (cid:54) n. This low-rank approximation allows\nsubsequent kernel algorithms to directly operate on A, but computing the approximation requires\nO(nr2 + nrd) operations. Many work followed this strategy, including Greedy basis selection\ntechniques [1], Nystr\u00a8om approximation [2] and incomplete Cholesky decomposition [3]. In prac-\ntice, one observes that kernel methods with approximated kernel matrices often result in a few\npercentage of losses in performance. In fact, without further assumption on the regularity of the\n\n1\n\n\f\u221a\n\n\u221a\n\n\u221a\n\nr + 1/\n\n\u221a\nkernel matrix, the generalization ability after using low-rank approximation is typically of order\nn) [4, 5], which implies that the rank needs to be nearly linear in the number of\nO(1/\ndata points! Thus, in order for kernel methods to achieve the best generalization ability, low-rank\napproximation based approaches immediately become impractical for big datasets because of their\nO(n3 + n2d) preprocessing time and O(n2) storage.\nRandom feature approximation is another popular approach for scaling up kernel methods [6, 7].\nThe method directly approximates the kernel function instead of the kernel matrix using explicit\nfeature maps. The advantage of this approach is that the random feature matrix for n data points\ncan be computed in time O(nrd) using O(nr) storage, where r is the number of random features.\nSubsequent algorithms then only need to operate on an O(nr) matrix. Similar to low-rank kernel\nmatrix approximation approach, the generalization ability of this approach is of the order O(1/\nr +\nn) [8, 9], which implies that the number of random features also needs to be O(n). Another\n1/\ncommon drawback of these two approaches is that adapting the solution from a small r to a large\nr(cid:48) is not easy if one wants to increase the rank of the approximated kernel matrix or the number of\nrandom features for better generalization ability. Special procedures need to be designed to reuse\nthe solution obtained from a small r, which is not straightforward.\nAnother approach that addresses the scalability issue rises from the optimization perspective. One\ngeneral strategy is to solve the dual forms of kernel methods using the block-coordinate descen-\nt (e.g., [10, 11, 12]). Each iteration of this algorithm only incurs O(nrd) computation and O(nr)\nstorage, where r is the block size. A second strategy is to perform functional gradient descent\nbased on a batch of data points at each epoch (e.g., [13, 14]). Thus, the computation and storage in\neach iteration required are also O(nrd) and O(nr), respectively, where r is the batch size. These\napproaches can straightforwardly adapt to a different r without restarting the optimization proce-\ndure and exhibit no generalization loss since they do not approximate the kernel matrix or function.\nHowever, a serious drawback of these approaches is that, without further approximation, all support\nvectors need to be stored for testing, which can be as big as the entire training set! (e.g., kernel ridge\nregression and non-separable nonlinear classi\ufb01cation problems.)\nIn summary, there exists a delicate trade-off between computation, storage and statistics when\nscaling up kernel methods. Inspired by various previous efforts, we propose a simple yet general\nstrategy that scales up many kernel methods using a novel concept called \u201cdoubly stochastic\nfunctional gradients\u201d. Our method relies on the fact that most kernel methods can be expressed\nas convex optimization problems over functions in the reproducing kernel Hilbert spaces (RKHS)\nand solved via functional gradient descent. Our algorithm proceeds by making two unbiased\nstochastic approximations to the functional gradient, one using random training points and another\nusing random functions associated with the kernel, and then descending using this noisy functional\ngradient. The key intuitions behind our algorithm originate from (i) the property of stochastic\ngradient descent algorithm that as long as the stochastic gradient is unbiased, the convergence of\nthe algorithm is guaranteed [15]; and (ii) the property of pseudo-random number generators that the\nrandom samples can in fact be completely determined by an initial value (a seed). We exploit these\nproperties and enable kernel methods to achieve better balances between computation, storage,\nand statistics. Our method interestingly integrates kernel methods, functional analysis, stochastic\noptimization, and algorithmic tricks, and it possesses a number of desiderata:\nGenerality and simplicity. Our approach applies to many kernel methods such as kernel version of\nridge regression, support vector machines, logistic regression and two-sample test as well as many\ndifferent types of kernels such as shift-invariant, polynomial, and general inner product kernels.\nThe algorithm can be summarized in just a few lines of code (Algorithm 1 and 2). For a dif-\nferent problem and kernel, we just need to replace the loss function and the random feature generator.\nFlexibility. While previous approaches based on random features typically require a pre\ufb01x number\nof features, our approach allows the number of random features, and hence the \ufb02exibility of\nthe function class to grow with the number of data points. Therefore, unlike previous random\nfeature approach, our approach applies to the data streaming setting and achieves full potentials of\nnonparametric methods.\nEf\ufb01cient computation. The key computation of our method comes from evaluating the doubly\nstochastic functional gradient, which involves the generation of the random features given speci\ufb01c\nseeds and also the evaluation of these features on a small batch of data points. At iteration t, the\ncomputational complexity is O(td).\n\n2\n\n\fSmall memory. While most approaches require saving all the support vectors, the algorithm\nallows us to avoid keeping the support vectors since it only requires a small program to regenerate\nthe random features and sample historical features according to some speci\ufb01c random seeds. At\niteration t, the memory needed is O(t), independent of the dimension of the data.\nTheoretical guarantees. We provide novel and nontrivial analysis involving Hilbert space\nmartingales and a newly proved recurrence relation, and demonstrate that the estimator produced\nby our algorithm, which might be outside of the RKHS, converges to the optimal RKHS function.\n\u221a\nMore speci\ufb01cally, both in expectation and with high probability, our algorithm estimates the optimal\nfunction in the RKHS in the rate of O(1/t) and achieves a generalization bound of O(1/\nt),\nwhich are indeed optimal [15]. The variance of the random features introduced in our second\napproximation to the functional gradient, only contributes additively to the constant in the conver-\ngence rate. These results are the \ufb01rst of the kind in literature, which could be of independent interest.\nStrong empirical performance. Our algorithm can readily scale kernel methods up to the regimes\nwhich are previously dominated by neural nets. We show that our method compares favorably to\nother scalable kernel methods in medium scale datasets, and to neural nets in big datasets with\nmillions of data.\nIn the remainder, we will \ufb01rst introduce preliminaries on kernel methods and functional gradients.\nWe will then describe our algorithm and provide both theoretical and empirical supports.\n\n2 Duality between Kernels and Random Processes\nKernel methods owe their name to the use of kernel functions, k(x, x(cid:48)) : X \u00d7 X (cid:55)\u2192 R, which are\nsymmetric positive de\ufb01nite (PD), meaning that for all n > 1, and x1, . . . , xn \u2208 X , and c1, . . . , cn \u2208\ni,j=1 cicjk(xi, xj) (cid:62) 0. There is an intriguing duality between kernels and stochastic\n\nprocesses which will play a crucial role in our algorithm design later. More speci\ufb01cally,\nTheorem 1 (e.g., Devinatz [16]; Hein & Bousquet [17]) If k(x, x(cid:48)) is a PD kernel, then there\nexists a set \u2126, a measure P on \u2126, and random function \u03c6\u03c9(x) : X (cid:55)\u2192 R from L2(\u2126, P), such that\n\nR, we have(cid:80)n\nk(x, x(cid:48)) =(cid:82)\n\n\u2126 \u03c6\u03c9(x) \u03c6\u03c9(x(cid:48)) dP(\u03c9).\n\n\u221a\n\n2 cos(\u03c9(cid:62)x + b).\n\nEssentially, the above integral representation relates the kernel function to a random process \u03c9 with\nmeasure P(\u03c9). Note that the integral representation may not be unique. For instance, the random\nprocess can be a Gaussian process on X with the sample function \u03c6\u03c9(x), and k(x, x(cid:48)) is simply\nthe covariance function between two point x and x(cid:48).\nIf the kernel is also continuous and shift\ninvariant, i.e., k(x, x(cid:48)) = k(x \u2212 x(cid:48)) for x \u2208 Rd, then the integral representation specializes into a\nform characterized by inverse Fourier transformation (e.g., [18, Theorem 6.6]),\nTheorem 2 (Bochner) A continuous, real-valued, symmetric and shift-invariant function k(x\u2212 x(cid:48))\non Rd is a PD kernel if and only if there is a \ufb01nite non-negative measure P(\u03c9) on Rd, such that\nRd\u00d7[0,2\u03c0] 2 cos(\u03c9(cid:62)x + b) cos(\u03c9(cid:62)x(cid:48) + b) d (P(\u03c9) \u00d7 P(b)) ,\n\nRd ei\u03c9(cid:62)(x\u2212x(cid:48)) dP(\u03c9) =(cid:82)\n\nk(x \u2212 x(cid:48)) =(cid:82)\n\nwhere P(b) is a uniform distribution on [0, 2\u03c0], and \u03c6\u03c9(x) =\nFor Gaussian RBF kernel, k(x \u2212 x(cid:48)) = exp(\u2212(cid:107)x \u2212 x(cid:48)(cid:107)2/2\u03c32), this yields a Gaussian distribution\nP(\u03c9) with density proportional to exp(\u2212\u03c32(cid:107)\u03c9(cid:107)2/2); for the Laplace kernel, this yields a Cauchy\ndistribution; and for the Martern kernel, this yields the convolutions of the unit ball [19]. Similar\nrepresentations where the explicit form of \u03c6\u03c9(x) and P(\u03c9) are known can also be derived for rotation\ninvariant kernel, k(x, x(cid:48)) = k((cid:104)x, x(cid:48)(cid:105)), using Fourier transformation on sphere [19]. For polynomial\nkernels, k(x, x(cid:48)) = ((cid:104)x, x(cid:48)(cid:105) + c)p, a random tensor sketching approach can also be used [20].\nInstead of \ufb01nding the random processes P(\u03c9) and functions \u03c6\u03c9(x) given kernels, one can go the\nreverse direction and construct kernels from random processes and functions (e.g., Wendland [18]).\n\u2126 \u03c6\u03c9(x)\u03c6\u03c9(x(cid:48)) dP(\u03c9) for a nonnegative measure P(\u03c9) on \u2126 and\n\u03c6\u03c9(x) : X (cid:55)\u2192 R from L2(\u2126, P), then k(x, x(cid:48)) is a PD kernel.\nFor instance, \u03c6\u03c9(x) := cos(\u03c9(cid:62)\u03c8\u03b8(x) + b), where \u03c8\u03b8(x) can be a random convolution of the input x\nparametrized by \u03b8. Another important concept is the reproducing kernel Hilbert space (RKHS). An\nRKHS H on X is a Hilbert space of functions from X to R. H is an RKHS if and only if there exists\na k(x, x(cid:48)) : X \u00d7 X (cid:55)\u2192 R such that \u2200x \u2208 X , k(x,\u00b7) \u2208 H, and \u2200f \u2208 H,(cid:104)f (\u00b7), k(x,\u00b7)(cid:105)H = f (x).\nIf such a k(x, x(cid:48)) exists, it is unique and it is a PD kernel. A function f \u2208 H if and only if\n(cid:107)f(cid:107)2H := (cid:104)f, f(cid:105)H < \u221e, and its L2 norm is dominated by RKHS norm, (cid:107)f(cid:107)L2\n\nTheorem 3 If k(x, x(cid:48)) = (cid:82)\n\n(cid:54) (cid:107)f(cid:107)H .\n\n3\n\n\f3 Doubly Stochastic Functional Gradients\nMany kernel methods can be written as convex optimization problems over functions in the RKHS\nand solved using the functional gradient methods [13, 14]. Inspired by these previous work, we will\nintroduce a novel concept called \u201cdoubly stochastic functional gradients\u201d to address the scalability\nissue. Let l(u, y) be a scalar loss function convex of u \u2208 R. Let the subgradient of l(u, y) with\nrespect to u be l(cid:48)(u, y). Given a PD kernel k(x, x(cid:48)) and the associated RKHS H, many kernel\nmethods try to \ufb01nd a function f\u2217 \u2208 H which solves the optimization problem\n\nargmin\n\nf\u2208H\n\nR(f ) := E(x,y)[l(f (x), y)] +\n\n\u03bd\n2\n\n(cid:107)f(cid:107)2H \u21d0\u21d2 argmin\n(cid:107)f(cid:107)H(cid:54)B(\u03bd)\n\nE(x,y)[l(f (x), y)]\n\n(1)\n\nwhere \u03bd > 0 is a regularization parameter, B(\u03bd) is a non-increasing function of \u03bd, and the data\n(x, y) follow a distribution P(x, y). The functional gradient \u2207R(f ) is de\ufb01ned as the linear term in\nthe change of the objective after we perturb f by \u0001 in the direction of g, i.e.,\nR(f + \u0001g) = R(f ) + \u0001(cid:104)\u2207R(f ), g(cid:105)H + O(\u00012).\n\n(2)\nFor instance, applying the above de\ufb01nition, we have \u2207f (x) = \u2207(cid:104)f, k(x,\u00b7)(cid:105)H = k(x,\u00b7), and\n\u2207(cid:107)f(cid:107)2H = \u2207(cid:104)f, f(cid:105)H = 2f.\nStochastic functional gradient. Given a data point (x, y) \u223c P(x, y) and f \u2208 H, the stochastic\nfunctional gradient of E(x,y)[l(f (x), y)] with respect to f \u2208 H is\n\u03be(\u00b7) := l(cid:48)(f (x), y)k(x,\u00b7),\n\n(3)\nwhich is essentially a single data point approximation to the true functional gradient. Furthermore,\nfor any g \u2208 H, we have (cid:104)\u03be(\u00b7), g(cid:105)H = l(cid:48)(f (x), y)g(x). Inspired by the duality between kernel func-\ntions and random processes, we can make an additional approximation to the stochastic functional\ngradient using a random function \u03c6\u03c9(x) sampled according to P(\u03c9). More speci\ufb01cally,\nDoubly stochastic functional gradient. Let \u03c9 \u223c P(\u03c9), then the doubly stochastic gradient of\nE(x,y)[l(f (x), y)] with respect to f \u2208 H is\n\n\u03b6(\u00b7) := l(cid:48)(f (x), y)\u03c6\u03c9(x)\u03c6\u03c9(\u00b7).\n\nand \u2207R(f ) = E(x,y)E\u03c9 [\u03b6(\u00b7)] + vf (\u00b7).\n\n(4)\nNote that the stochastic functional gradient \u03be(\u00b7) is in RKHS H but \u03b6(\u00b7) may be outside H, since\n\u03c6\u03c9(\u00b7) may be outside the RKHS. For instance, for the Gaussian RBF kernel, the random function\n\u03c6\u03c9(x) =\nHowever, these functional gradients are related by \u03be(\u00b7) = E\u03c9 [\u03b6(\u00b7)], which lead to unbiased estima-\ntors of the original functional gradient, i.e.,\n\u2207R(f ) = E(x,y) [\u03be(\u00b7)] + vf (\u00b7),\n\n2 cos(\u03c9(cid:62)x + b) is outside the RKHS associated with the kernel function.\n\n(5)\nWe emphasize that the source of randomness associated with the random function is not present\nin the data, but arti\ufb01cially introduced by us. This is crucial for the development of our scalable\nalgorithm in the next section. Meanwhile, it also creates additional challenges in the analysis of the\nalgorithm which we will deal with carefully.\n4 Doubly Stochastic Kernel Machines\nAlgorithm 1: {\u03b1i}t\nRequire: P(\u03c9), \u03c6\u03c9(x), l(f (x), y), \u03bd.\n1: for i = 1, . . . , t do\n2:\n3:\n4:\n5:\n6:\n7: end for\n\nAlgorithm 2: f (x) = Predict(x, {\u03b1i}t\nRequire: P(\u03c9), \u03c6\u03c9(x).\n1: Set f (x) = 0.\n2: for i = 1, . . . , t do\n3:\n4:\n5: end for\n\nSample (xi, yi) \u223c P(x, y).\nSample \u03c9i \u223c P(\u03c9) with seed i.\nf (xi) = Predict(xi,{\u03b1j}i\u22121\nj=1).\n\u03b1i = \u2212\u03b3il(cid:48)(f (xi), yi)\u03c6\u03c9i(xi).\n\u03b1j = (1 \u2212 \u03b3i\u03bd)\u03b1j for j = 1, . . . , i \u2212 1.\n\nSample \u03c9i \u223c P(\u03c9) with seed i.\nf (x) = f (x) + \u03b1i\u03c6\u03c9i (x).\n\n\u221a\n\ni=1 = Train(P(x, y))\n\ni=1)\n\nThe \ufb01rst key intuition behind our algorithm originates from the property of stochastic gradient de-\nscent algorithm that as long as the stochastic gradient is bounded and unbiased, the convergence of\nthe algorithm is guaranteed [15]. In our algorithm, we will exploit this property and introduce two\nsources of randomness, one from data and another arti\ufb01cial, to scale up kernel methods.\n\n4\n\n\fThe second key intuition behind our algorithm is that the random functions used in the doubly\nstochastic functional gradients will be sampled according to pseudo-random number generators,\nwhere the sequences of apparently random samples can in fact be completely determined by an\ninitial value (a seed). Although these random samples are not the \u201ctrue\u201d random sample in the\npurest sense of the word, they suf\ufb01ce for our task in practice.\nTo be more speci\ufb01c, our algorithm proceeds by making two stochastic approximation to the function-\nal gradient in each iteration, and then descending using this noisy functional gradient. The overall\nalgorithms for training and prediction are summarized in Algorithm 1 and 2. The training algo-\nrithm essentially just performs samplings of random functions and evaluations of doubly stochastic\ngradients and maintains a collection of real numbers {\u03b1i}, which is computationally ef\ufb01cient and\nmemory friendly. A crucial step in the algorithm is to sample the random functions with \u201cseed i\u201d.\nThe seeds have to be aligned between training and prediction, and with the corresponding \u03b1i ob-\ntained from each iteration. The learning rate \u03b3t in the algorithm needs to be chosen as O(1/t), as\nshown by our later analysis to achieve the best rate of convergence. For now, we assume that we\nhave access to the data generating distribution P(x, y). This can be modi\ufb01ed to sample uniformly\nrandomly from a \ufb01xed dataset, without affecting the algorithm and the later convergence analysis.\nLet the sampled data and random function parameters be Dt := {(xi, yi)}t\ni=1,\nrespectively after t iteration. The function obtained by Algorithm 1 is a simple additive form of the\ndoubly stochastic functional gradients\n\ni=1 and \u03c9t := {\u03c9i}t\n\nand\n\n\u2200t > 1,\n\nt\u03b6i(\u00b7),\nai\n\nf1(\u00b7) = 0,\n\nft+1(\u00b7) = ft(\u00b7) \u2212 \u03b3t(\u03b6t(\u00b7) + \u03bdft(\u00b7)) =\nt = \u2212\u03b3i\n\n(cid:81)t\n(6)\nj=i+1(1 \u2212 \u03b3j\u03bd) are deterministic values depending on the step sizes \u03b3j(i (cid:54) j (cid:54)\nwhere ai\nt) and regularization parameter \u03bd. This simple form makes it easy for us to analyze its convergence.\nWe note that our algorithm can also take a mini-batch of points and random functions at each step,\nand estimate an empirical covariance for preconditioning to achieve potentially better performance.\n5 Theoretical Guarantees\nIn this section, we will show that, both in expectation and with high probability, our algorithm\n\u221a\ncan estimate the optimal function in the RKHS with rate O(1/t) and achieve a generalization\nbound of O(1/\nt). The analysis for our algorithm has a new twist compared to previous analysis\nof stochastic gradient descent algorithms, since the random function approximation results in\nan estimator which is outside the RKHS. Besides the analysis for stochastic functional gradient\ndescent, we need to use martingales and the corresponding concentration inequalities to prove that\nthe sequence of estimators, ft+1, outside the RKHS converge to the optimal function, f\u2217, in the\nRKHS. We make the following standard assumptions ahead for later references:\n\n(cid:88)t\n\ni=1\n\ni=1 and any trajectory {fi(\u00b7)}t\n\nA. There exists an optimal solution, denoted as f\u2217, to the problem of our interest (1).\nB. Loss function (cid:96)(u, y) : R \u00d7 R \u2192 R and its \ufb01rst-order derivative is L-Lipschitz continous\nC. For any data {(xi, yi)}t\n\nin terms of the \ufb01rst argument.\ni=1, there exists M > 0, such that\n|(cid:96)(cid:48)(fi(xi), yi)| (cid:54) M. Note in our situation M exists and M < \u221e since we assume\nbounded domain and the functions ft we generate are always bounded as well.\nD. There exists \u03ba > 0 and \u03c6 > 0, such that k(x, x(cid:48)) (cid:54) \u03ba, |\u03c6\u03c9(x)\u03c6\u03c9(x(cid:48))| (cid:54) \u03c6,\u2200x, x(cid:48) \u2208\nX , \u03c9 \u2208 \u2126. For example, when k(\u00b7,\u00b7) is the Gaussian RBF kernel, we have \u03ba = 1, \u03c6 = 2.\nWe now present our main theorems as below. Due to the space restrictions, we will only provide a\nshort sketch of proofs here. The full proofs for the these theorems are given in the appendix.\nt with \u03b8 > 0 such that \u03b8\u03bd \u2208 (1, 2) \u222a Z+,\nTheorem 4 (Convergence in expectation) When \u03b3t = \u03b8\n\nEDt,\u03c9t\n\n(cid:2)|ft+1(x) \u2212 f\u2217(x)|2(cid:3) (cid:54) 2C 2 + 2\u03baQ2\n(cid:110)(cid:107)f\u2217(cid:107)H , (Q0 +(cid:112)Q2\n\nt\n\n1\n\nfor any x \u2208 X\n\n,\n\n(cid:111)\n\n2\u03ba1/2(\u03ba + \u03c6)LM \u03b82, and C 2 = 4(\u03ba + \u03c6)2M 2\u03b82.\n\nwhere Q1 = max\n\u221a\n2\nTheorem 5 (Convergence with high probability) When \u03b3t = \u03b8\nfor any x \u2208 X , we have with probability at least 1 \u2212 3\u03b4 over (Dt, \u03c9t),\n\n, with Q0 =\nt with \u03b8 > 0 such that \u03b8\u03bd \u2208 Z+,\n\n0 + (2\u03b8\u03bd \u2212 1)(1 + \u03b8\u03bd)2\u03b82\u03baM 2)/(2\u03bd\u03b8 \u2212 1)\n\n|ft+1(x) \u2212 f\u2217(x)|2 (cid:54) C 2 ln(2/\u03b4)\n\nt\n\n+\n\n2\u03baQ2\n\n2 ln(2t/\u03b4) ln2(t)\n\nt\n\n,\n\n5\n\n\f(cid:110)(cid:107)f\u2217(cid:107)H , Q0 +(cid:112)Q2\n\n(cid:111)\n\nwhere C is as above and Q2 = max\n\u221a\n2\u03ba1/2M \u03b8(8 + (\u03ba + \u03c6)\u03b8L).\nQ0 = 4\nProof sketch: We focus on the convergence in expectation; the high probability bound can be\nestablished in a similar fashion. The main technical dif\ufb01culty is that ft+1 may not be in the RKHS\nH. The key of the proof is then to construct an intermediate function ht+1, such that the difference\nbetween ft+1 and ht+1 and the difference between ht+1 and f\u2217 can be bounded. More speci\ufb01cally,\n\n0 + \u03baM 2(1 + \u03b8\u03bd)2(\u03b82 + 16\u03b8/\u03bd)\n\n, with\n\nht+1(\u00b7) = ht(\u00b7) \u2212 \u03b3t(\u03bet(\u00b7) + \u03bdht(\u00b7)) =\n\nt\u03bei(\u00b7),\nai\n\n\u2200t > 1,\n\nand h1(\u00b7) = 0,\n\n(7)\n\nwhere \u03bet(\u00b7) = E\u03c9t[\u03b6t(\u00b7)]. Then for any x, the error can be decomposed as two terms\n(cid:125)\n+ 2\u03ba (cid:107)ht+1 \u2212 f\u2217(cid:107)2H\n\n|ft+1(x) \u2212 f\u2217(x)|2 (cid:54) 2|ft+1(x) \u2212 ht+1(x)|2\n\n(cid:123)(cid:122)\n\n(cid:123)(cid:122)\n\n(cid:124)\n\n(cid:124)\n\nerror due to random functions\n\nerror due to random data\n\ncomplicated, et+1 (cid:54) (cid:0)1 \u2212 2\u03bd\u03b8\n\nFor the error term due to random functions, ht+1 is constructed such that ft+1 \u2212 ht+1 is a mar-\ntingale, and the stepsizes are chosen such that |ai\nt , which allows us to bound the martingale.\nIn other words, the choices of the stepsizes keep ft+1 close to the RKHS. For the error term\ndue to random data, since ht+1 \u2208 H, we can now apply the standard arguments for stochastic\napproximation in the RKHS. Due to the additional randomness, the recursion is slightly more\nt2 , where et+1 = EDt,\u03c9t[(cid:107)ht+1 \u2212 f\u2217(cid:107)2H], and\n\u03b21 and \u03b22 depends on the related parameters. Solving this recursion then leads to a bound for the\nsecond error term.\nTheorem 6 (Generalization bound) Let the true risk be Rtrue(f ) = E(x,y) [l(f (x), y)]. Then with\nprobability at least 1 \u2212 3\u03b4 over (Dt, \u03c9t), and C and Q2 de\ufb01ned as previously\n\n(cid:1) et + \u03b21\n\n(cid:112) et\n\nt + \u03b22\n\nt\n\nt\n\nRtrue(ft+1) \u2212 Rtrue(f\u2217) (cid:54) (C(cid:112)ln(8\n\n\u221a\n\n(cid:112)ln(2t/\u03b4) ln(t))L\n\n.\n\net/\u03b4) +\n\n\u221a\n\u221a\n2\u03baQ2\n\nt\n\n2\n\nProof By the Lipschitz continuity of l(\u00b7, y) and Jensen\u2019s Inequality, we have\n\nRtrue(ft+1)\u2212 Rtrue(f\u2217) (cid:54) LEx|ft+1(x)\u2212 f\u2217(x)| (cid:54) L(cid:112)Ex|ft+1(x) \u2212 f\u2217(x)|2 = L(cid:107)ft+1\u2212 f\u2217(cid:107)2.\nAgain, (cid:107)ft+1 \u2212 f\u2217(cid:107)2 can be decomposed as two terms O(cid:0)(cid:107)ft+1 \u2212 ht+1(cid:107)2\n(cid:1) and O((cid:107)ht+1 \u2212 f\u2217(cid:107)2H),\n\nwhich can be bounded similarly as in Theorem 5 (see Corollary 12 in the appendix).\nRemarks. The overall rate of convergence in expectation, which is O(1/t), is indeed optimal. Clas-\nsical complexity theory (see, e.g. reference in [15]) shows that to obtain \u0001-accuracy solution, the\nnumber of iterations needed for the stochastic approximation is \u2126(1/\u0001) for strongly convex case and\n\u2126(1/\u00012) for general convex case. Different from the classical setting of stochastic approximation,\nour case imposes not one but two sources of randomness/stochasticity in the gradient, which intu-\nitively speaking, might require higher order number of iterations for general convex case. However,\nour method is still able to achieve the same rate as in the classical setting. The rate of the general-\nization bound is also nearly optimal up to log factors. However, these bounds may be further re\ufb01ned\nwith more sophisticated techniques and analysis. For example, mini-batch and preconditioning can\nbe used to reduce the constant factors in the bound signi\ufb01cantly, the analysis of which is left for\nfuture study. Theorem 4 also reveals bounds in L\u221e and L2 sense as in Section A.2 in the appendix.\nThe choices of stepsizes \u03b3t and the tuning parameters given in these bounds are only for suf\ufb01cient\nconditions and simple analysis; other choices can also lead to bounds in the same order.\n6 Computation, Storage and Statistics Trade-off\nTo investigate computation, storage and statistics trade-off, we will \ufb01x the desired L2 error in the\nfunction estimation to \u0001, i.e., (cid:107)f \u2212 f\u2217(cid:107)2\n(cid:54) \u0001, and work out the dependency of other quantities on \u0001.\nThese other quantities include the preprocessing time, the number of samples and random features\n(or rank), the number of iterations of each algorithm, and the computational cost and storage require-\nment for learning and prediction. We assume that the number of samples, n, needed to achieve the\nprescribed error \u0001 is of the order O(1/\u0001), the same for all methods. Furthermore, we make no other\nregularity assumption about margin properties or the kernel matrix such as fast spectrum decay. Thus\nthe required number of random feature (or ranks) r will be of the order O(n) = O(1/\u0001) [4, 5, 8, 9].\n\n2\n\n(cid:88)t\n\ni=1\n\n(cid:125)\nt| (cid:54) \u03b8\n\n6\n\n\fWe will pick a few representative algorithms for comparison, namely, (i) NORMA [13]: kernel\nmethods trained with stochastic functional gradients; (ii) k-SDCA [12]: kernelized version of s-\ntochastic dual coordinate ascend; (iii) r-SDCA: \ufb01rst approximate the kernel function with random\nfeatures, and then run stochastic dual coordinate ascend; (iv) n-SDCA: \ufb01rst approximate the ker-\nnel matrix using Nystr\u00a8om\u2019s method, and then run stochastic dual coordinate ascend; similarly we\nwill combine Pegasos algorithm [21] with random features and Nystr\u00a8om\u2019s method, and obtain (v)\nr-Pegasos, and (vi) n-Pegasos. The comparisons are summarized below.\nFrom the table, one can see that our method, r-SDCA and r-Pegasos achieve the best dependency on\nthe dimension d of the data. However, often one is interested in increasing the number of random\nfeatures as more data points are observed to obtain a better generalization ability. Then special\nprocedures need to be designed for updating the r-SDCA and r-Pegasos solution, which we are not\nclear how to implement easily and ef\ufb01ciently.\n\nAlgorithms\n\nDoubly SGD\n\nNORMA/k-SDCA\nr-Pegasos/r-SDCA\nn-Pegasos/n-SDCA\n\n7 Experiments\n\nPreprocessing Total Computation Cost\nPrediction\nComputation\nO(d/\u0001)\nO(d/\u0001)\nO(d/\u0001)\nO(d/\u0001)\n\nTraining\nO(d/\u00012)\nO(d/\u00012)\nO(d/\u00012)\nO(d/\u00012)\n\nO(1)\nO(1)\nO(1)\n\nO(1/\u00013)\n\nTotal Storage Cost\n\nTraining\nO(1/\u0001)\nO(d/\u0001)\nO(1/\u0001)\nO(1/\u0001)\n\nPrediction\nO(1/\u0001)\nO(d/\u0001)\nO(1/\u0001)\nO(1/\u0001)\n\nWe show that our method compares favorably to other kernel methods in medium scale datasets\nand neural nets in large scale datasets. We examined both regression and classi\ufb01cation problems\nwith smooth and almost smooth loss functions. Below is a summary of the datasets used1, and more\ndetailed description of these datasets and experimental settings can be found in the appendix.\n\nForest\n\nName\nAdult\n\nModel\n(1)\nK-SVM\n(2) MNIST 8M 8 vs. 6 [25] K-SVM\nK-SVM\n(3)\n(4)\nK-logistic\nK-logistic\n(5)\nK-logistic\n(6)\nK-ridge\n(7) QuantumMachine [28]\n(8) MolecularSpace [28]\nK-ridge\n\nMNIST 8M [25]\nCIFAR 10 [26]\nImageNet [27]\n\n# of samples\n\n32K\n1.6M\n0.5M\n8M\n60K\n1.3M\n6K\n2.3M\n\n123\n784\n54\n1568\n2304\n9216\n276\n2850\n\nInput dim Output range Virtual\n\n{\u22121, 1}\n{\u22121, 1}\n{\u22121, 1}\n{0, . . . , 9}\n{0, . . . , 9}\n{0, . . . , 999}\n[\u2212800,\u22122000]\n\n[0, 13]\n\nno\nyes\nno\nyes\nyes\nyes\nyes\nno\n\nExperiment settings. For datasets (1) \u2013 (3), we compare the algorithms discussed in Section 6. For\nalgorithms based on low rank kernel matrix approximation and random features, i.e., pegasos and\nSDCA, we set the rank and number of random features to be 28. We use same batch size for both\nour algorithm and the competitors. We stop algorithms when they pass through the entire dataset\nonce. This stopping criterion (SC1) is designed for justifying our conjecture that the bottleneck of\nthe performances of the vanilla methods with explicit feature comes from the accuracy of kernel\napproximation. To this end, we investigate the performances of these algorithms under different\nlevels of random feature approximations but within the same number of training samples. To further\ninvestigate the computational ef\ufb01ciency of the proposed algorithm, we also conduct experiments\nwhere we stop all algorithms within the same time budget (SC2). Due to space limitation, the\ncomparison on regression synthetic dataset under SC1 and on (1) \u2013 (3) under SC2 are illustrated\nin Appendix B.2. We do not count the preprocessing time of Nystr\u00a8om\u2019s method for n-Pegasos and\nn-SDCA. The algorithms are executed on the machine with AMD 16 2.4GHz Opteron CPUs and\n200G memory. Note that this allows NORMA and k-SDCA to save all the data in the memory.\nWe report our numerical results in Figure 1(1)-(8) with explanations stated as below . For full details\nof our experimental setups, please refer to section B.1 in Appendix.\nAdult. The result is illustrated in Figure 1(1). NORMA and k-SDCA achieve the best error rate,\n15%, while our algorithm achieves a comparable rate, 15.3%.\n\n1 A \u201cyes\u201d for the last column means that virtual examples are generated from for training. K-ridge stands\n\nfor kernel ridge regression; K-SVM stands for kernel SVM; K-logistic stands for kernel logistic regression.\n\n7\n\n\f(1) Adult\n\n(2) MNIST 8M 8 vs. 6\n\n(3) Forest\n\n(4) MNIST 8M\n\n(5) CIFAR 10\n\n(6) ImageNet\n\n(7) QuantumMachine\n\n(8) MolecularSpace.\n\nFigure 1: Experimental results for dataset (1) \u2013 (8).\n\nMNIST 8M 8 vs. 6. The result is shown in Figure 1(2). Our algorithm achieves the best test error\n0.26%. Comparing to the methods with full kernel, the methods using random/Nystr\u00a8om features\nachieve better test errors probably because of the underlying low-rank structure of the dataset.\nForest. The result is shown in Figure 1(3). Our algorithm achieves test error about 15%, much better\nthan the n/r-pegasos and n/r-SDCA. Our method is preferable for this scenario, i.e., huge datasets\nwith sophisticated decision boundary considering the trade-off between cost and accuracy.\nMNIST 8M. The result is shown in Figure 1(4). Better than the 0.6% error provided by \ufb01xed and\njointly-trained neural nets, our method reaches an error of 0.5% very quickly.\nCIFAR 10 The result is shown in Figure 1(5). We compare our algorithm to a neural net with\ntwo convolution layers (after contrast normalization and max-pooling layers) and two local layer-\ns achieving 11% test error. The speci\ufb01cation is at https://code.google.com/p/cuda-convnet/. Our\nmethod achieves comparable performance but much faster.\nImageNet The result is shown in Figure 1(6). Our method achieves test error 44.5% by further\nmax-voting of 10 transformations of the test set while the jointly-trained neural net arrives at 42%\n(without variations in color and illumination), and the \ufb01xed neural net only achieves 46% with max-\nvoting.\nQuantumMachine/MolecularSpace The results are shown in Figure 1(7) &(8). On dataset (7), our\nmethod achieves Mean Absolute Error of 2.97 kcal/mole, outperforming neural nets, 3.51 kcal/mole,\nwhich is close to the 1 kcal/mole required for chemical accuracy. Moreover, the comparison on\ndataset (8) is the \ufb01rst in the literature, and our method is still comparable with neural net.\nAcknowledgement\nM.B. is suppoerted in part by NSF CCF-0953192, CCF-1451177, CCF-1101283, and CCF-1422910, ONR\nN00014-09-1-0751, and AFOSR FA9550-09-1-0538. L.S. is supported in part by NSF IIS-1116886, NSF/NIH\nBIGDATA 1R01GM108341, NSF CAREER IIS-1350983, and a Raytheon Faculty Fellowship.\n\n8\n\n10\u22122100152025303540Training Time (sec)Test Error (%) k\u2212SDCANORMA28 r\u2212pegasos28 r\u2212SDCA28 n\u2212pegasos28 n\u2212SDCAdoubly SGD10010210400.511.522.53Training Time (sec)Test Error (%)1001021045101520253035Training Time (sec)Test Error (%)1051061070.511.52Number of training samplesTest error (%) jointly\u2212trained neural netfixed neural netdoubly SGD1051061071020304050Number of training samplesTest error (%) jointly\u2212trained neural netfixed neural netdoubly SGD106108405060708090100Number of training samplesTest error (%) jointly\u2212trained neural netfixed neural netdoubly SGD1051065101520Number of training samplesMAE (Kcal/mole) neural netdoubly SGD10510611.21.41.61.822.22.42.6Number of training samplesPCE (%) neural netdoubly SGD\fReferences\n[1] A. J. Smola and B. Sch\u00a8olkopf. Sparse greedy matrix approximation for machine learning. In ICML, 2000.\n[2] C. K. I. Williams and M. Seeger. Using the Nystrom method to speed up kernel machines.\nIn T. G.\n\nDietterich, S. Becker, and Z. Ghahramani, editors, NIPS, 2000.\n\n[3] S. Fine and K. Scheinberg. Ef\ufb01cient SVM training using low-rank kernel representations. Journal of\n\nMachine Learning Research, 2:243\u2013264, 2001.\n\n[4] P. Drineas and M. Mahoney. On the nystr om method for approximating a gram matrix for improved\n\nkernel-based learning. JMLR, 6:2153\u20132175, 2005.\n\n[5] C. Cortes, M. Mohri, and A. Talwalkar. On the impact of kernel approximation on learning accuracy. In\n\nAISTATS, 2010.\n\n[6] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, 2008.\n[7] Q.V. Le, T. Sarlos, and A. J. Smola. Fastfood \u2014 computing hilbert space expansions in loglinear time. In\n\nICML, 2013.\n\n[8] A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization with random-\n\nization in learning. In NIPS, 2009.\n\n[9] D. Lopez-Paz, S. Sra, A. Smola, Z. Ghahramani, and B. Schlkopf. Randomized nonlinear component\n\nanalysis. In ICML, 2014.\n\n[10] J. C. Platt. Sequential minimal optimization: A fast algorithm for training support vector machines.\n\nTechnical Report MSR-TR-98-14, Microsoft Research, 1998.\n\n[11] T. Joachims. Making large-scale SVM learning practical. In B. Sch\u00a8olkopf, C. J. C. Burges, and A. J.\nSmola, editors, Advances in Kernel Methods \u2014 Support Vector Learning, pages 169\u2013184, Cambridge,\nMA, 1999. MIT Press.\n\n[12] S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss. Journal\n\nof Machine Learning Research, 14(1):567\u2013599, 2013.\n\n[13] J. Kivinen, A. J. Smola, and R. C. Williamson. Online learning with kernels. IEEE Transactions on Signal\n\nProcessing, 52(8), Aug 2004.\n\n[14] N. Ratliff and J. Bagnell. Kernel conjugate gradient for fast kernel machines. In IJCAI, 2007.\n[15] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to s-\n\ntochastic programming. SIAM J. on Optimization, 19(4):1574\u20131609, January 2009.\n\n[16] A. Devinatz. Integral representation of pd functions. Trans. AMS, 74(1):56\u201377, 1953.\n[17] M. Hein and O. Bousquet. Kernels, associated structures, and generalizations. Technical Report 127,\n\nMax Planck Institute for Biological Cybernetics, 2004.\n\n[18] H. Wendland. Scattered Data Approximation. Cambridge University Press, Cambridge, UK, 2005.\n[19] Bernhard Sch\u00a8olkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.\n[20] N. Pham and R. Pagh. Fast and scalable polynomial kernels via explicit feature maps. In KDD, 2013.\n[21] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver\n\nfor SVM. In ICML, 2007.\n\n[22] Cong D. Dang and Guanghui Lan. Stochastic block mirror descent methods for nonsmooth and stochastic\n\noptimization. Technical report, University of Florida, 2013.\n\n[23] Yurii Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems. SIAM\n\nJournal on Optimization, 22(2):341\u2013362, 2012.\n\n[24] A. Cotter, S. Shalev-Shwartz, and N. Srebro. Learning optimally sparse support vector machines.\n\nICML, 2013.\n\nIn\n\n[25] G. Loosli, S. Canu, and L. Bottou. Training invariant support vector machines with selective sampling.\n\nInLarge Scale Kernel Machines, pages 301\u2013320. MIT Press, 2007.\n\n[26] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of\n\nToronto, 2009.\n\n[27] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classi\ufb01cation with deep convolutional neural net-\n\nworks. In NIPS, 2012.\n\n[28] G. Montavon, K. Hansen, S. Fazli, M. Rupp, F. Biegler, A. Ziehe, A. Tkatchenko, A. Lilienfeld, and K.\nM\u00a8uller. Learning invariant representations of molecules for atomization energy prediction. In NIPS, 2012.\n[29] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly\n\nconvex stochastic optimization. In ICML, pages 449\u2013456, 2012.\n\n[30] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, November 1998.\n\n9\n\n\f", "award": [], "sourceid": 1578, "authors": [{"given_name": "Bo", "family_name": "Dai", "institution": "Georgia Tech"}, {"given_name": "Bo", "family_name": "Xie", "institution": "Georgia Tech"}, {"given_name": "Niao", "family_name": "He", "institution": "Gatech"}, {"given_name": "Yingyu", "family_name": "Liang", "institution": "Princeton University"}, {"given_name": "Anant", "family_name": "Raj", "institution": "IIT, Kanpur"}, {"given_name": "Maria-Florina", "family_name": "Balcan", "institution": "Georgia Tech"}, {"given_name": "Le", "family_name": "Song", "institution": "Georgia Tech"}]}