{"title": "Fast Prediction for Large-Scale Kernel Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 3689, "page_last": 3697, "abstract": "Kernel machines such as kernel SVM and kernel ridge regression usually construct high quality models; however, their use in real-world applications remains limited due to the high prediction cost. In this paper, we present two novel insights for improving the prediction efficiency of kernel machines. First, we show that by adding \u201cpseudo landmark points\u201d to the classical Nystr\u00a8om kernel approximation in an elegant way, we can significantly reduce the prediction error without much additional prediction cost. Second, we provide a new theoretical analysis on bounding the error of the solution computed by using Nystr\u00a8om kernel approximation method, and show that the error is related to the weighted kmeans objective function where the weights are given by the model computed from the original kernel. This theoretical insight suggests a new landmark point selection technique for the situation where we have knowledge of the original model. Based on these two insights, we provide a divide-and-conquer framework for improving the prediction speed. First, we divide the whole problem into smaller local subproblems to reduce the problem size. In the second phase, we develop a kernel approximation based fast prediction approach within each subproblem. We apply our algorithm to real world large-scale classification and regression datasets, and show that the proposed algorithm is consistently and significantly better than other competitors. For example, on the Covertype classification problem, in terms of prediction time, our algorithm achieves more than 10000 times speedup over the full kernel SVM, and a two-fold speedup over the state-of-the-art LDKL approach, while obtaining much higher prediction accuracy than LDKL (95.2% vs. 89.53%).", "full_text": "Fast Prediction for Large-Scale Kernel Machines\n\nCho-Jui Hsieh, Si Si, and Inderjit S. Dhillon\n\n{cjhsieh,ssi,inderjit}@cs.utexas.edu\n\nAustin, TX 78712 USA\n\nDepartment of Computer Science\n\nUniversity of Texas at Austin\n\nAbstract\n\nKernel machines such as kernel SVM and kernel ridge regression usually con-\nstruct high quality models; however, their use in real-world applications remains\nlimited due to the high prediction cost. In this paper, we present two novel in-\nsights for improving the prediction ef\ufb01ciency of kernel machines. First, we show\nthat by adding \u201cpseudo landmark points\u201d to the classical Nystr\u00a8om kernel approxi-\nmation in an elegant way, we can signi\ufb01cantly reduce the prediction error without\nmuch additional prediction cost. Second, we provide a new theoretical analysis on\nbounding the error of the solution computed by using Nystr\u00a8om kernel approxima-\ntion method, and show that the error is related to the weighted kmeans objective\nfunction where the weights are given by the model computed from the original ker-\nnel. This theoretical insight suggests a new landmark point selection technique for\nthe situation where we have knowledge of the original model. Based on these two\ninsights, we provide a divide-and-conquer framework for improving the predic-\ntion speed. First, we divide the whole problem into smaller local subproblems to\nreduce the problem size. In the second phase, we develop a kernel approximation\nbased fast prediction approach within each subproblem. We apply our algorithm\nto real world large-scale classi\ufb01cation and regression datasets, and show that the\nproposed algorithm is consistently and signi\ufb01cantly better than other competitors.\nFor example, on the Covertype classi\ufb01cation problem, in terms of prediction time,\nour algorithm achieves more than 10000 times speedup over the full kernel SVM,\nand a two-fold speedup over the state-of-the-art LDKL approach , while obtaining\nmuch higher prediction accuracy than LDKL (95.2% vs. 89.53%).\n\nIntroduction\n\n1\nKernel machines have become widely used in many machine learning problems, including clas-\nsi\ufb01cation, regression, and clustering. By mapping samples to a high-dimensional feature space,\nkernel machines are able to capture the nonlinear properties and usually achieve better performance\ncompared to linear models. However, computing the decision function for the new test samples\nis typically expensive which limits the applicability of kernel methods to real-world applications.\nTherefore speeding up the prediction time of kernel methods has become an important research\ntopic. For example, recently [2, 10] proposed various heuristics to speed up kernel SVM predic-\ntion, and kernel approximation based methods [27, 5, 21, 16] can also be applied to speed up the\nprediction for general kernel machines. Among them, LDKL attracts much attention recently as it\nperforms much better than state-of-the-art kernel approximation and reduced set based methods for\nfast prediction. Experimental results show that LDKL can reduce the prediction costs by more than\nthree orders of magnitude with little degradation of accuracy as compared with the original kernel\nSVM.\nIn this paper, we propose a novel fast prediction technique for large-scale kernel machines. Our\nmethod is built on the Nystr\u00a8om approximation, but with the following innovations:\n\n1. We show that by adding \u201cpseudo landmark points\u201d to the Nystr\u00a8om approximation, the\n\nkernel approximation error can be reduced without too much additional prediction cost.\n\n1\n\n\f2. We provide a theoretical analysis of the model approximation error (cid:107) \u00af\u03b1 \u2212 \u03b1\u2217(cid:107), where \u00af\u03b1\nis the model (solution) computed by Nystr\u00a8om approximation, and \u03b1\u2217 is the solution com-\nputed from the original kernel. Instead of bounding the error (cid:107) \u00af\u03b1 \u2212 \u03b1\u2217(cid:107) by kernel approxi-\nmation error on the entire kernel matrix, we re\ufb01ne the bound by taking the \u03b1\u2217 weights into\nconsideration, which indicates that we only need to focus on approximating the columns\nin the kernel matrix with large \u03b1\u2217 values (e.g., support vectors in kernel SVM problem).\nWe further show that the error bound is connected to the \u03b1\u2217-weighted kmeans objective\nfunction, which suggests selecting landmark points based on \u03b1\u2217 values in Nystr\u00a8om ap-\nproximation.\n\n3. We consider the above two innovations under a divide-and-conquer framework for fast pre-\ndiction. The divide-and-conquer framework partitions the problem using kmeans clustering\nto reduce the problem size, and for each subproblem we apply the above two techniques to\ndevelop a kernel approximation scheme for fast prediction.\n\nBased on the above three innovations, we develop a fast prediction scheme for kernel methods, DC-\nPred++, and apply it to speed up the prediction for kernel SVM and kernel ridge regression. The ex-\nperimental results show that our method outperforms state-of-the-art methods in terms of prediction\ntime and accuracy. For example, on the Covertype classi\ufb01cation problem, our algorithm achieves\na two-fold speedup in terms of prediction time, and yields a higher prediction accuracy (95.2% vs\n89.53%) compared to the state-of-the-art fast prediction approach LDKL. Perhaps surprisingly, our\ntraining time is usually faster or at least competitive with state-of-the-art solvers.\nWe begin by presenting related work in Section 2, while the background material is given in Section\n3.\nIn Section 4, we introduce the concept of pseudo landmark points in kernel approximation.\nIn Section 5, we present the divide-and-conquer framework, and theoretically analyze using the\nweighted kmeans to select the landmark points. The experimental results on real-world data are\npresented in Section 6.\n2 Related Work\nThere has been substantial works on speeding up the prediction time of kernel SVMs, and most of\nthe approaches can be applied to other kernel methods such as kernel ridge regression. Most of the\nprevious works can be categorized into the following three types:\nPreprocessing. Reducing the size of the training set usually yields fewer support vectors in the\nmodel, and thus results in faster prediction speed. [20] proposed a \u201csquashing\u201d approach to reduce\nthe size of training set by clustering and grouping nearby points. [19] proposed to select the extreme\npoints in the training set to train kernel SVM. Nystr\u00a8om method [27, 4, 29] and Random Kitchen\nSinks (RKS) [21] form low-rank kernel approximations to improve both training and prediction\nspeed. Although RKS usually requires a larger rank than Nystr\u00a8om method, it can be further sped\nup by using fast Hadamard transform [16]. Other kernel approximation methods [12, 18, 1] are also\nproposed for different types of kernels.\nPost-processing. Post-processing approaches are designed to reduce the number of support vectors\nin the testing phase. A comprehensive comparison of these reduced-set methods has been conducted\nin [11], and results show that the incremental greedy method [22] implemented in STRtool achieves\nthe best performance. Another randomized algorithm to re\ufb01ne the solution of the kernel SVM has\nbeen recently proposed in [2].\nModi\ufb01ed Training Process. Another line of research aims to reduce the number of support vec-\ntors by modifying the training step. [13] proposed a greedy basis selection approach; [24] proposed\na Core Vector Machine (CVM) solver to solve the L2-SVM. [9] applied a cutting plane subspace\npursuit algorithm to solve the kernel SVM. The Reduced SVM (RSVM) [17] selected a subset of\nfeatures in the original data, and solved the primal problem of kernel SVM. Locally Linear SVM\n(LLSVM) [15] represented each sample as a linear combination of its neighbors to yield ef\ufb01cient\nprediction speed. Instead of considering the original kernel SVM problem, [10] developed a new\ntree-based local kernel learning model (LDKL), where the decision value of each sample is com-\nputed by a series of inner products when traversing the tree.\n3 Background\nKernel Machines.\nIn this paper, we focus on two kernel machines \u2013 kernel SVM and kernel\nridge regressions. Given a set of instance-label pairs {xi, yi}n\ni=1, xi \u2208 Rd, the training process of\nkernel SVM and kernel ridge regression generates \u03b1\u2217 \u2208 Rn by solving the following optimization\nproblems:\n\n2\n\n\f(1)\n\nKernel SVM: \u03b1\u2217 \u2190 argmin\nKernel Ridge Regression: \u03b1\u2217 \u2190 argmin\n\n\u03b1\n\n1\n2 \u03b1T Q\u03b1 \u2212 eT \u03b1 s.t. 0 \u2264 \u03b1 \u2264 C,\n\u03b1T G\u03b1 + \u03bb\u03b1T \u03b1 \u2212 2\u03b1T y,\n\n\u03b1\n\ni=1 \u03b1\u2217\n\n(2)\nwhere G \u2208 Rn\u00d7n is the kernel matrix with Gij = K(xi, xj); Q is an n by n matrix with Qij =\nyiyjGij, and C, \u03bb are regularization parameters.\n\nIn the prediction phase, the decision value of a testing data x is computed as(cid:80)n\n\ni K(xi, x),\nwhich in general requires O(\u00afnd) where \u00afn is the number of nonzero elements in \u03b1\u2217. Note that for\nkernel SVM problem, we may think \u03b1\u2217\ni is weighted by yi when computing decision value for x. In\ncomparison, linear models only require O(d) prediction time, but usually generate lower prediction\naccuracy.\nNystr\u00a8om Approximation. Kernel machines usually do not scale to large-scale applications due\nto the O(n2d) operations to compute the kernel matrix and O(n2) space to store it in memory. As\nshown in [14], low-rank approximation of kernel matrix using the Nystr\u00a8om method provides an\nef\ufb01cient way to scale up kernel machines to millions of instances. Given m (cid:28) n landmark points\n{uj}m\nj=1, the Nystr\u00a8om method \ufb01rst forms two matrices C \u2208 Rn\u00d7m and W \u2208 Rm\u00d7m based on the\nkernel function, where Cij = K(xi, uj) and Wij = K(ui, uj), and then approximates the kernel\nmatrix as\n(3)\nwhere W \u2020 denotes the pseudo-inverse of W . By approximating G via Nystr\u00a8om method, the kernel\nmachines are usually transformed to linear machines, which can be solved ef\ufb01ciently. Given the\nmodel \u03b1, in the testing phase, the decision value of x is evaluated as\n\nG \u2248 \u00afG := CW \u2020C T ,\n\nc(W \u2020C T \u03b1) = c\u03b2,\n\nwhere c = [K(x, u1), . . . , K(x, um)], and \u03b2 = W \u2020C T \u03b1 can be precomputed and stored. To ob-\ntain the prediction on one test sample, Nystr\u00a8om approximation only needs O(md) \ufb02ops to compute\nc, and O(m) \ufb02ops to compute the decision value c\u03b2, so it becomes an effective ways to improve the\nprediction speed. However, Nystr\u00a8om approximation usually needs more than 100 landmark points\nto achieve reasonable good accuracy, which is still expensive for large-scale applications.\n\n4 Pseudo Landmark Points for Speeding up Prediction Time\n\nIn Nystr\u00a8om approximation, there is a trade-off in selecting the number of landmark points m. A\nsmaller m means faster prediction speed, but also yields higher kernel approximation error, which\nresults in a lower prediction accuracy. Therefore we want to tackle the following problem \u2013 can we\nadd landmark points without increasing the prediction time?\nOur solution is to construct extra \u201cpseudo landmark points\u201d for the kernel approximation. Recall\nthat originally we have m landmark points {uj}m\nj=1, and now we add p pseudo landmark points\n{vt}p\nt=1 to this set. In this paper, we consider pseudo landmark points are sampled from the training\ndataset, while in general each pseudo landmark point can be any d-dimensional vector. The only\ndifference between pseudo landmark points and landmark points is that the kernel values K(x, vt)\nare computed in a fast but approximate manner in order to speed up the prediction time. We use a\nregression-based method to approximate {K(x, vt)}p\nt=1. Assume for each pseudo landmark point\nvt, there exists a function ft : Rm \u2192 R, where the input to each ft is the computed kernel values\n{K(x, uj)}m\nj=1, and the output is an estimator of K(x, vt). We can either design the function\nfor speci\ufb01c kernels, for example, in Section 4.1 we design ft for stationary kernels, or learn ft by\nregression for general kernels (Section 4.2).\nBefore introducing the design or learning process for {ft}p\nform the Nyst\u00a8om approximation.With p pseudo landmark points and {ft}p\nthe following a n \u00d7 (m + p) matrix \u00afC, by adding the p extra columns to C:\n\nt=1, we \ufb01rst describe how to use them to\nt=1 given, we can form\n\n\u00afC = [C, C(cid:48)], where C(cid:48)\n\nit = ft({K(xi, uj)}m\n\nj=1) \u2200i = 1, . . . , n and \u2200t = 1, . . . , p.\n\n(4)\n\nThen the kernel matrix G can be approximated by\n\n(5)\nwhere \u00afC\u2020 is the pseudo inverse of \u00afC; \u00afW is the optimal solution to minimize (cid:107)G \u2212 \u00afG(cid:107)F if \u00afG is\nrestricted to the range space of \u00afC, which is also used in [26]. Note that in our case \u00afW cannot be\n\nG \u2248 \u00afG = \u00afC \u00afW \u00afC T , with \u00afW = \u00afC\u2020G( \u00afC\u2020)T ,\n\n3\n\n\fobtained by inverting an m + p by m + p matrix as in the original Nystr\u00a8om approach in (3), because\nthe kernel values between x and pseudo landmark points are the approximate kernel values. As a\nresult the time to form the Nystr\u00a8om approximation in (5) is slower than forming (3) since the whole\nkernel matrix G has to be computed.\nIf the number of samples n is too large to compute G, we can estimate the matrix \u00afW by minimizing\nthe approximation error on a submatrix of G. More speci\ufb01cally, we randomly select a submatrix\nGsub from G with row/and column indexes I. If we focus on approximating Gsub, the optimal \u00afW is\n\u00afW = ( \u00afCI,:)\u2020Gsub(( \u00afCI,:)\u2020)T , which only requires computation of O(|I|2) kernel elements.\nBased on the approximate kernel \u00afG, we can train a model \u00af\u03b1 and store the vector \u00af\u03b2 = \u00afW \u00afC T \u00af\u03b1\nin memory. For a testing sample x, we \ufb01rst compute the kernel values between x and landmarks\npoints c = [K(x, u1), . . . , K(x, um)], which usually requires O(md) \ufb02ops, and then expand c to\nan (m + p)-dimensional vector \u00afc = [c, f1(c), . . . , fp(c)] based on the p pseudo landmark points\nand the functions {ft}p\nt=1. Assume each ft(c) function can be evaluated with O(s) time, then we\ncan easily compute \u00afc and the decision value \u00afc\u00af\u03b2 taking O(md + ps) time, where s is much smaller\nthan d. Overall, our algorithm can be summarized in Algorithm 1.\nAlgorithm 1: Kernel Approximation with Pseudo Landmark Points\nKernel Approximation Steps:\n\nj=1.\n\nSelect m landmark points {uj}m\nCompute n \u00d7 m matrix C where Cij = K(xi, uj).\nSelect p pseudo landmark points {vt}p\nConstruct p functions {ft}p\nExpand C to \u00afC as \u00afC = [C, C(cid:48)] by (4), and compute \u00afW by (5).\nTraining: Compute \u00af\u03b1 based on \u00afG and precompute \u00af\u03b2 = \u00afW \u00afC T \u00af\u03b1.\nPrediction for a test point x:\n\nt=1.\n\nt=1 by methods in Section 4.1 or Section 4.2.\n\nCompute m dimensional vector c = [K(x, u1), . . . , K(x, um)].\nCompute m + p dimensional vector \u00afc = [c, f1(c), . . . , fp(c)].\nDecision value: \u00afc\u00af\u03b2.\n\n4.1 Design the functions for stationary kernels\nNext we discuss various ways to design/learn the functions {ft}p\nt=1. First we consider the stationary\nkernels K(x, vt) = \u03ba((cid:107)x \u2212 vt(cid:107)), where the kernel approximation problem can be reduced to\nestimate (cid:107)x\u2212vt(cid:107) with low cost. Suppose we choose p pseudo landmark points {vt}p\nt=1 by randomly\nsampling p points in the dataset. By the triangle inequality,\n\n(|(cid:107)x \u2212 uj(cid:107) \u2212 (cid:107)vt \u2212 uj(cid:107)|) \u2264 (cid:107)x \u2212 vt(cid:107) \u2264 min\n\n(6)\nSince (cid:107)x \u2212 uj(cid:107) has already been evaluated for all uj (to compute K(x, uj)) and (cid:107)vt \u2212 uj(cid:107) can\nbe precomputed, we can use either left hand side or right hand side of (6) to estimate K(x, vt). We\ncan see that approximating K(x, vt) using (6) only requires O(m) \ufb02ops and is more ef\ufb01cient than\ncomputing K(x, vt) from scratch when m (cid:28) d (d is the dimensionality of data).\n\n((cid:107)x \u2212 uj(cid:107) + (cid:107)vt \u2212 uj(cid:107)) .\n\nmax\n\nj\n\nj\n\n4.2 Learning the functions for general kernels\n\nbe written as ft(c) =(cid:80)\n\nNext we consider learning the function ft for general kernels by solving a regression problem.\nAssume each ft is a degree-D polynomial function (in the paper we only use D = 2). Let Z denote\nthe basis functions: Z = {(i1, . . . , im) | i1 + \u00b7\u00b7\u00b7 + im = d}, and for each element z(q) \u2208 Z we\n(q)\ndenote the corresponding polynomial function as Z (q)(c) = cz\nm . Each ft can then\n1\n1\nqZ (q)(c). A naive way to apply the pseudo-landmark technique using\nq}|Z|\npolynomial functions is: to learn the optimal coef\ufb01cients {at\nq=1 for each t, and then compute\n\u00afC, \u00afW based on (4) and (5). However, this two-step procedure requires a huge amount of training\ntime, and the prediction time cannot be improved if |Z| is large.\nTherefore, we consider implicitly applying the pseudo-landmark point technique. We expand C by\n(7)\n\n\u02c6C = [C, C(cid:48)(cid:48)], where C(cid:48)(cid:48)\n\n. . . cz(q)\n\n(q)\ncz\n2\n2\n\nq at\n\niq = Z (q)(ci).\n\nm\n\n4\n\n\f(a) USPS,prediction cost vs approx.\nerror.\n\n(b) Protein,prediction cost vs ap-\nprox. error.\n\n(c) MNIST,prediction cost vs ap-\nprox. error.\n\nFigure 1: Comparison of different pseudo landmark points strategy. The relative approximation error\nis (cid:107)G\u2212 \u00afG(cid:107)F /(cid:107)G(cid:107)F where G and \u00afG is the real and approximate kernel respectively. We observe that\nboth Nys-triangle (using the triangular inequality to approximate kernel values) and Nys-dp (using\nthe polynomial expansion with the degree D = 2) can dramatically reduce the approximation error\nunder the same prediction cost.\n\nt=1 are degree-D polynomial functions, \u00afC, \u00afW are computed by (4), (5) and\n\nwhere ci = [K(xi, u1), . . . , K(xi, um)] and each Z (q)(\u00b7) is the q-th degree-D polynomial basis\nwith q = 1, . . . ,|Z|. After forming \u02c6C, we can then compute \u02c6W = \u02c6C\u2020G( \u02c6C\u2020)T and approximate\nthe kernel by \u02c6C \u02c6W \u02c6C T . This procedure is much more ef\ufb01cient than the previous two-step procedure\nq}|Z|\nwhere we need to learn {at\nq=1, and more importantly, in the following lemma we show that this\napproach gives better approximation to the previous two-step procedure.\nLemma 1. If {ft(\u00b7)}p\n\u02c6C, \u02c6W are computed by (7), (5), then (cid:107)G \u2212 \u00afC \u00afW \u00afC T(cid:107) \u2265 (cid:107)G \u2212 \u02c6C \u02c6W \u02c6C T(cid:107).\nThe proof is in Appendix 7.3. In practice we do not need to form all the low degree polynomial basis\n\u2013 just sample some of the basis from Z is enough. Figure 1 compares using Nystr\u00a8om method with or\nwithout pseudo landmark points for approximating Gaussian kernels. For each dataset, we choose\na few number of landmark points (2-30), and add pseudo landmark points according the triangular\ninequality (6) or according to the polynomial function (7). We observe that the kernel approximation\nerror is dramatically reduced under the same prediction cost. Note that we can also apply this\npseudo-landmark points approach as a building block in other kernel approximation frameworks,\ne.g., the Memory Ef\ufb01cient Kernel Approximation (MEKA) proposed in [23].\n\n5 Weighted Kmeans Sampling with a Divide-and-Conquer Framework\n\nIn all the related work, Nystr\u00a8om approximation is considered as a preprocessing step, which does\nnot incorporate the information from the model itself. In this section, we consider the case that the\nmodel \u03b1\u2217 for kernel SVM or kernel ridge regression is given, and derive a better approach to select\nlandmark points. The approach can be used in conjunction with divide-and-conquer SVM [8] where\nan approximate solution to \u03b1\u2217 can be computed ef\ufb01ciently.\nLet \u03b1\u2217 be the optimal solution of the kernel machines computed with the original kernel matrix G,\nand \u00af\u03b1 be the approximate solution by using approximate kernel matrix \u00afG. We derive the following\nupper bound of (cid:107) \u00af\u03b1 \u2212 \u03b1\u2217(cid:107) for both kernel SVM and kernel ridge regression:\nTheorem 1. Let \u03b1\u2217 be the optimal solution for kernel ridge regression with kernel matrix G, and\n\u00af\u03b1 is the solution for kernel ridge regression with kernel \u00afG obtained by Nystr\u00a8om approximation (3),\nthen\n\nn(cid:88)\n\n(cid:107) \u00af\u03b1 \u2212 \u03b1\u2217(cid:107) \u2264 \u2206/\u03bb with \u2206 =\n\n|\u03b1\u2217\ni |(cid:107) \u00afG\u00b7,i \u2212 G\u00b7,i(cid:107),\n\nwhere \u03bb is the regularization parameter in kernel ridge regression, and \u00afG\u00b7,i and G\u00b7,i are the i-th\ncolumn of \u00afG and G respectively.\nTheorem 2. Let \u03b1\u2217 be the optimal solution for kernel SVM with kernel G, and \u00af\u03b1 be the solution of\nkernel SVM with kernel \u00afG obtained by Nystr\u00a8om approximation (3), then\n\ni=1\n\n(cid:107) \u00af\u03b1 \u2212 \u03b1\u2217(cid:107) \u2264 \u03b82(cid:107)W(cid:107)2(1 + \u03c1)\u2206,\n\n(8)\n\nwhere \u03c1 is the largest eigenvalue of \u00afG, and \u03b8 is a positive constant independent on \u03b1\u2217, \u00af\u03b1.\n\n5\n\n\fThe proof is in Appendix 7.4 and 7.5. Here we show that (cid:107) \u00af\u03b1 \u2212 \u00af\u03b1\u2217(cid:107) can be upper bounded by a\nweighted kernel approximation error. This result looks natural but has a signi\ufb01cant consequence \u2013 to\nget a good approximate model, we do not need to minimize the kernel approximation error on all the\nn2 elements of G; instead, the quality of solution is mostly affected by a small portion of columns\nof G with larger |\u03b1\u2217\ni |. For example, in the kernel SVM problem, \u03b1\u2217 is a sparse vector containing\nmany zero elements, and the above bound indicates that we just need to approximate the columns\ni (cid:54)= 0 accurately. Based on the error bounds, we want to select landmark\nin G with corresponding \u03b1\u2217\npoints for Nystr\u00a8om approximation that minimize \u2206. We focus on the kernel functions that satisfy\n\n(K(a, b) \u2212 K(c, d))2 \u2264 CK((cid:107)a \u2212 c(cid:107)2 + (cid:107)b \u2212 d(cid:107)2),\u2200a, b, c, d,\n\n(9)\nwhere CK is a kernel-dependent constant. It has been shown in [29] that all the stationary kernels\n(K(xi, xj) = \u03ba((cid:107)xi \u2212 xj(cid:107))) satisfy (9). Next we show that the weighted kernel approximation\nerror \u2206 is upper bounded by the weighted kmeans objective.\nTheorem 3. If the kernel function satis\ufb01es condition (9), and let u1, . . . , um be the landmark points\nfor constructing the Nystr\u00a8om approximation ( \u00afG = CW \u2020C T ), then\n\n(cid:113)\n\nCk\n\nD2\n\n\u2206 \u2264 (n + n(cid:107)W \u2020(cid:107)(cid:112)k\u03b3max)(cid:112)\nn(cid:88)\n\n(cid:0){ui}m\n\n(cid:1) :=\n\nD2\n\u03b1\n\n\u03b1\u2217(cid:0){uj}m\n\nj=1\n\n(cid:1),\n\nwhere \u03b3max is the upper bound of kernel function,\n\ni=1\n\ni=1\n\n(10)\n\ni )2}n\n\ni )2}n\n\n\u03b1\u2217({ui}m\n\ni(cid:107)xi \u2212 u\u03c0(i)(cid:107)2,\n\u03b12\nand \u03c0(i) = argmins (cid:107)us \u2212 xi(cid:107)2 is the landmark point closest to xi.\ni=1) is the weighted kmeans objective function\nThe proof is in Appendix 7.6. Note that D2\nwith {(\u03b1\u2217\ni=1 as the weights. Combining Theorems 1, 2, and 3, we conclude that for both\nkernel SVM and ridge regression, the approximation error (cid:107) \u00af\u03b1 \u2212 \u03b1\u2217(cid:107) can be upper bounded by\nthe weighted kmeans objective function. As a consequence, if \u03b1\u2217 is given, we can use weighted\nkmeans with weights {(\u03b1\u2217\ni=1 to \ufb01nd the landmark points u1, . . . , um, which tends to minimize\nthe approximation error. In Figure 4 (in the Appendix) we show that for the kernel SVM problem,\nselecting landmark points by weighted kmeans is a very effective strategy for fast and accurate\nprediction in real-world datasets.\nIn practice we do not know \u03b1\u2217 before training the kernel machines, and exactly computing \u03b1\u2217 is\nvery expensive for large-scale datasets. However, using weighted kmeans to select landmark points\ncan be combined with any approximate solvers \u2013 we can use an approximate solver to quickly\napproximate \u03b1\u2217, and then use it as the weights for the weighted kmeans. Next we show how to\ncombine this approach with the divide-and-conquer framework recently proposed in [8, 7].\nDivide and Conquer Approach. The divide-and-conquer SVM (DC-SVM) was proposed in [8]\nto solve the kernel SVM problem. The main idea is to divide the whole problem into several smaller\nsubproblems, where each subproblem can be solved independently and ef\ufb01ciently. [8] proposed to\npartition the data points by kernel clustering, but this approach is expensive in terms of prediction\nef\ufb01ciency. Therefore we use kmeans clustering in the input space to build the hierarchical clustering.\nAssume we have k clusters as the leaf nodes, the DC-SVM algorithm computes the solutions\n{(\u03b1(i))\u2217}k\ni=1 for each cluster independently. For a testing sample, they use an \u201cearly prediction\u201d\nscheme, where the testing sample is \ufb01rst assigned to the nearest cluster and then the local model\nin that cluster is used for prediction. This approach can reduce the prediction time because it only\ncomputes the kernel values between the testing sample and all the support vectors in one cluster.\nHowever, the model in each cluster may still contain many support vectors, so we propose to ap-\nproximate the kernel in each cluster by Nystr\u00a8om based kernel approximation as mentioned in Section\n4 to further reduce the prediction time. In the prediction step we \ufb01rst go through the hierarchical\ntree to identify the nearest cluster, and then compute the kernel values between the testing sample\nand the landmark points in that cluster. Finally, we can compute the decision value based on the\nkernel values and the prediction model. The same idea can be applied to kernel ridge regression.\nOur overall algorithm \u2013 DC-Pred++ is presented in Algorithm 2.\n6 Experimental Results\n\nIn this section, we compare our proposed algorithm with other fast prediction algorithms for kernel\nSVM and kernel ridge regression problems. All the experiments are conducted on a machine with\n\n6\n\n\fAlgorithm 2: DC-Pred++: our proposed divide-and-conquer approach for fast Prediction.\n: Training samples {xi}n\nInput\nOutput: A fast prediction model.\nTraining:\n\ni=1, kernel function K.\n\nConstruct a hierarchical clustering tree with k leaf nodes by kmeans.\nCompute local models {(\u03b1(i))\u2217}k\nFor each cluster, use weighted kmeans centroids as landmark points.\nFor each cluster, run the proposed kernel approximation with pseudo landmark points\n(Algorithm 1) and use the approximate kernel to train a local prediction model.\n\ni=1 for each cluster.\n\nPrediction on x:\n\nIdentify the nearest cluster.\nRun the prediction phase of Algorithm 1 using local prediction models.\n\nTable 1: Comparison of kernel SVM prediction on real datasets. Note that the actual prediction time\nis normalized by the linear prediction time. For example, 12.8x means the actual prediction time\n= 12.8\u00d7 (time for linear SVM prediction time).\n\nMetric\n\nDC-Pred++\n\nkmeans Nystr\u00a8om AESVM STPRtool\n\nFastfood\n\nDataset\nLetter\n\nntrain = 12, 000,\n\nntest = 6, 000, d = 16\n\nCovType\n\nntrain = 522, 910,\n\nntest = 58, 102, d = 54\n\nUsps\n\nntrain = 7291,\n\nntest = 2007, d = 256\n\nWebspam\n\nntrain = 280, 000,\n\nntest = 70, 000, d = 254\n\nKddcup\n\nntrain = 4, 898, 431,\n\nntest = 311, 029, d = 134\n\na9a\n\nntrain = 32, 561,\n\nntest = 16, 281, d = 123\n\nPrediction Time\n\nAccuracy\n\nTraining Time\nPrediction Time\n\nAccuracy\n\nTraining Time\nPrediction Time\n\nAccuracy\n\nTraining Time\nPrediction Time\n\nAccuracy\n\nTraining Time\nPrediction Time\n\nAccuracy\n\nTraining Time\nPrediction Time\n\nAccuracy\n\nTraining Time\n\n12.8x\n95.90%\n\n1.2s\n18.8x\n95.19%\n\n372s\n14.4x\n95.56%\n\n2s\n20.5x\n98.4%\n239s\n11.8x\n92.3%\n154s\n12.5x\n83.9%\n6.3s\n\nLDKL\n29x\n\n95.78%\n243s\n35x\n\n89.53%\n4095s\n12.01x\n95.96%\n\n19s\n23x\n\n95.15%\n2158s\n26x\n92.2%\n997s\n32x\n\n81.95%\n490s\n\n140x\n87.58%\n\n3.8s\n200x\n73.63%\n1442s\n200x\n92.53%\n\n4.8s\n200x\n95.01%\n181s\n200x\n87%\n1481s\n50x\n83.9%\n1.28s\n\n1542x\n80.97%\n55.2s\n3157x\n75.81%\n204s\n5787x\n85.97%\n55.3s\n4375x\n98.4%\n909s\n604x\n92.1%\n2717s\n4859x\n81.9%\n33.17s\n\n50x\n85.9%\n47.7s\n50x\n\n82.14%\n77400s\n\n50x\n93.6%\n34.5s\n50x\n91.6%\n32571s\n\n50x\n89.8%\n4925s\n50x\n\n82.32%\n69.1s\n\n94.39%\n\n50x\n89.9%\n15s\n60x\n66.8%\n256s\n80x\n\n12s\n80x\n96.7%\n1621s\n80x\n91.1%\n970s\n80\n\n61.9%\n59.9s\n\nan Intel 2.83GHz CPU with 32G RAM. Note that the prediction cost is shown as actual prediction\ntime dividing by the linear model\u2019s prediction time. This measurement is more robust to the actual\nhardware con\ufb01guration and provides a comparison with the linear methods.\n\n6.1 Kernel SVM\n\nWe use six public datasets (shown in Table 1) for the comparison of kernel SVM prediction time.\nThe parameters \u03b3, C are selected by cross validation, and the detailed description of parameters for\nother competitors are shown in Appendix 7.1. We compare with the following methods:\n\n1. DC-Pred++: Our proposed framework, which involves Divide-and-Conquer strategy and\napplies weighted kmeans to select landmark points and then uses these landmark points to\ngenerate pseudo-landmark points in Nystr\u00a8om approximation for fast prediction.\n\n2. LDKL: The Local Deep Kernel Learning method proposed in [10]. They learn a tree-based\n\nprimal feature embedding to achieve faster prediction speed.\n\n3. Kmeans Nystr\u00a8om: The Nystr\u00a8om approximation using kmeans centroids as landmark points\n\n[29]. The resulting linear SVM problem is solved by LIBLINEAR [6].\n\n4. AESVM: Approximate Extreme points SVM solver proposed in [19]. It uses a preprocess-\n\ning step to \ufb01lter out unimportant points to get a smaller model.\n\n5. Fastfood: Random Hadamard features for kernel approximation [16].\n6. STPRtool: The kernel computation toolbox that implemented the reduced-set post process-\n\ning approach using the greedy iterative solver proposed in [22].\n\nNote that [10] reported that LDKL achieves much faster prediction speed compared with Locally\nLinear SVM [15], and reduced set methods [9, 3, 13], so we omit their comparisons here.\nThe results presented in Table 1 show that DC-Pred++ achieves the best prediction ef\ufb01ciency and\naccuracy in 5 of the 6 datasets. In general, DC-Pred++ takes less than half of the prediction time and\n\n7\n\n\f(a) Letter\n\n(b) Covtype\n\n(c) Kddcup\n\nFigure 2: Comparison between our proposed method and LDKL for fast prediction in kernel SVM\nproblem.x-axis is the prediction cost and y-axis shows the prediction accuracy. For results on more\ndatasets, please see Figure 5 in the Appendix.\n\n(a) Cadata\n\n(b) YearPredictionMSD\n\n(c) mnist2M\n\nFigure 3: Kernel ridge regression results for various datasets. x-axis is the prediction cost and y-axis\nshows the Test RMSE. All the results are averaged over \ufb01ve independent runs. For results on more\ndatasets, please see Figure 7 in the Appendix.\ncan still achieve better accuracy than LDKL. Interestingly, in terms of the training time, DC-Pred++\nis almost 10 times faster than LDKL on most of the datasets. Since LDKL is the most competitive\nmethod, we further show the comparison with LDKL by varying the prediction cost in Figure 2.\nThe results show that on 5 datasets DC-Pred++ achieves better prediction accuracy using the same\nprediction time.\nNote that our approach is an improvement over the divide-and-conquer SVM (DC-SVM) proposed\nin [8], therefore we further compare DC-Pred++ with DC-SVM in Appendix 7.8. The results clearly\ndemonstrate that DC-Pred++ achieves faster prediction speed, and the main reason is due to the two\ninnovations presented in this paper \u2013 adding pseudo landmark points and weighted kmeans to select\nlandmark points to improve Nystr\u00a8om approximation. Finally, we also present the trade-off of two\nparameters in our algorithm, number of clusters and number of landmark points, in Appendix 7.9.\n\ndataset\nntrain\nntest\nd\n\n6,553\n1,639\n12\n\n16,521\n4,128\n137\n\nTable 2: Dataset statistics\n\nCpusmall Cadata Census YearPredictionMSD\n463,715\n51,630\n90\n\n18,277\n4,557\n8\n\nmnist2M\n1,500,000\n500,000\n800\n\n6.2 Kernel Ridge Regression\n\nWe further demonstrate the bene\ufb01ts of DC-Pred++ for fast prediction in kernel ridge regression\nproblem on \ufb01ve public datasets listed in Table 2. Note that for mnist2M, we perform regression\non two digits and set the target variables to be 0 and 1. We compare DC-Pred++ with other four\nstate-of-the-art kernel approximation methods for kernel ridge regression including the standard\nNystrom(Nys)[5], Kmeans Nystrom(KNys)[28], Random Kitchen Sinks(RKS)[21], and Fastfood\n[16]. All experimental results are based on Gaussian kernel. It is unclear how to generalize LDKL\nfor kernel ridge regression, so we do not compare with LDKL here. The parameters used are cho-\nsen by \ufb01ve fold cross-validation (see Appendix 7.1). Figure 3 presents the Test RMSE(root mean\nsquared error on the test data) by varying the prediction cost. To control the prediction cost, for\nNys, KNys, and DC-Pred++, we vary the number of landmark points, and for RKS and fastfood, we\nvary the number of random features. In Figure 3, we can observe that with the same prediction cost,\nDC-Pred++ always yields lower Test RMSE than other methods.\nAcknowledgements\nThis research was supported by NSF grants CCF-1320746 and CCF-1117055. C.-J.H also acknowl-\nedges support from an IBM PhD fellowship.\n\n8\n\n\fReferences\n[1] Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J. Lin. Training and testing low-degree\n\npolynomial data mappings via linear SVM. JMLR, 11:1471\u20131490, 2010.\n\n[2] M. Cossalter, R. Yan, and L. Zheng. Adaptive kernel approximation for large-scale non-linear svm pre-\n\ndiction. In ICML, 2011.\n\n[3] A. Cotter, S. Shalev-Shwartz, and N. Srebro. Learning optimally sparse support vector machines.\n\nICML, 2013.\n\nIn\n\n[4] P. Drineas, R. Kannan, and M. W. Mahoney. Fast monte carlo algorithms for matrices iii: Computing a\n\ncompressed approximate matrix decomposition. SIAM J. Comput., 36(1):184\u2013206, 2006.\n\n[5] P. Drineas and M. W. Mahoney. On the Nystr\u00a8om method for approximating a Gram matrix for improved\n\nkernel-based learning. JMLR, 6:2153\u20132175, 2005.\n\n[6] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear\n\nclassi\ufb01cation. JMLR, 9:1871\u20131874, 2008.\n\n[7] C.-J. Hsieh, I. S. Dhillon, P. Ravikumar, and A. Banerjee. A divide-and-conquer method for sparse inverse\n\ncovariance estimation. In NIPS, 2012.\n\n[8] C.-J. Hsieh, S. Si, and I. S. Dhillon. A divide-and-conquer solver for kernel support vector machines. In\n\nICML, 2014.\n\n[9] T. Joachims and C.-N. Yu. Sparse kernel svms via cutting-plane training. Machine Learning, 76(2):179\u2013\n\n193, 2009.\n\n[10] C. Jose, P. Goyal, P. Aggrwal, and M. Varma. Local deep kernel learning for ef\ufb01cient non-linear svm\n\nprediction. In ICML, 2013.\n\n[11] H. G. Jung and G. Kim. Support vector number reduction: Survey and experimental evaluations. IEEE\n\nTransactions on Intelligent Transportation Systems, 2014.\n\n[12] P. Kar and H. Karnick. Random feature maps for dot product kernels. In AISTATS, 2012.\n[13] S. S. Keerthi, O. Chapelle, and D. DeCoste. Building support vector machines with reduced classi\ufb01er\n\ncomplexity. JMLR, 7:1493\u20131515, 2006.\n\n[14] S. Kumar, M. Mohri, and A. Talwalkar. Ensemble Nystr\u00a8om methods. In NIPS, 2009.\n[15] L. Ladicky and P. H. S. Torr. Locally linear support vector machines. In ICML, 2011.\n[16] Q. V. Le, T. Sarlos, and A. J. Smola. Fastfood \u2013 approximating kernel expansions in loglinear time. In\n\nICML, 2013.\n\n[17] Y.-J. Lee and O. L. Mangasarian. RSVM: Reduced support vector machines. In SDM, 2001.\n[18] S. Maji, A. C. Berg, and J. Malik. Ef\ufb01cient classi\ufb01cation for additive kernel svms. IEEE PAMI, 35(1),\n\n2013.\n\n[19] M. Nandan, P. R. Khargonekar, and S. S. Talathi. Fast svm training using approximate extreme points.\n\nJMLR, 15:59\u201398, 2014.\n\n[20] D. Pavlov, D. Chudova, and P. Smyth. Towards scalable support vector machines using squashing. In\n\nKDD, pages 295\u2013299, 2000.\n\n[21] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, pages 1177\u20131184,\n\n2007.\n\n[22] B. Sch\u00a8olkopf, P. Knirsch, A. J. Smola, and C. J. C. Burges. Fast approximation of support vector kernel\nexpansions, and an interpretation of clustering as approximation in feature spaces. In Mustererkennung\n1998\u201420. DAGM-Symposium, Informatik aktuell, pages 124\u2013132, Berlin, 1998. Springer.\n\n[23] S. Si, C.-J. Hsieh, and I. S. Dhillon. Memory ef\ufb01cient kernel approximation. In ICML, 2014.\n[24] I. Tsang, J. Kwok, and P. Cheung. Core vector machines: Fast SVM training on very large data sets.\n\nJMLR, 6:363\u2013392, 2005.\n\n[25] P.-W. Wang and C.-J. Lin.\n\nJMLR, 15:1523\u20131548, 2014.\n\nIteration complexity of feasible descent methods for convex optimization.\n\n[26] S. Wang and Z. Zhang. Improving cur matrix decomposition and the nystr\u00a8om approximation via adaptive\n\nsampling. JMLR, 14:2729\u20132769, 2013.\n\n[27] C. K. I. Williams and M. Seeger. Using the Nystr\u00a8om method to speed up kernel machines. In T. Leen,\n\nT. Dietterich, and V. Tresp, editors, NIPS, 2001.\n\n[28] K. Zhang and J. T. Kwok. Clustered Nystr\u00a8om method for large scale manifold learning and dimension\n\nreduction. Trans. Neur. Netw., 21(10):1576\u20131587, 2010.\n\n[29] K. Zhang, I. W. Tsang, and J. T. Kwok. Improved Nystr\u00a8om low rank approximation and error analysis. In\n\nICML, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1947, "authors": [{"given_name": "Cho-Jui", "family_name": "Hsieh", "institution": "UT Austin"}, {"given_name": "Si", "family_name": "Si", "institution": "University of Texas at Austin"}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": "University of Texas"}]}