{"title": "Sampling Techniques for Kernel Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 335, "page_last": 342, "abstract": null, "full_text": "Sampling Techniques for Kernel Methods\n\nDimitris Achlioptas\nMicrosoft Research\noptas@microsoft.com\n\nFrank McSherry\n\nUniversity of Washington\n\nmcsherry@cs.washington.edu\n\nBernhard Sch\u00a8olkopf\n\nBiowulf Technologies NY\n\nbs@conclu.de\n\nAbstract\n\nWe propose randomized techniques for speeding up Kernel Principal\nComponent Analysis on three levels: sampling and quantization of the\nGram matrix in training, randomized rounding in evaluating the kernel\nexpansions, and random projections in evaluating the kernel itself. In all\nthree cases, we give sharp bounds on the accuracy of the obtained ap-\nproximations. Rather intriguingly, all three techniques can be viewed as\ninstantiations of the following idea: replace the kernel function  by a\n\u201crandomized kernel\u201d which behaves like \n\nin expectation.\n\n1 Introduction\n, techniques such as linear SVMs [13] and\nGiven a collection\nPCA extract features from\nby computing linear functions of this data. However, it is\noften the case that the structure present in the training data is not simply a linear function\nof the data representation. Worse, many data sets do not readily support linear operations\nsuch as addition and scalar multiplication (text, for example).\n\nof training data\n\n\u0002\u0004\u0003\u0006\u0005\b\u0007\t\u0007\b\u0007\n\u0005\u000b\u0002\r\f\n\nis \ufb01rst mapped into some dot product space\n\n.\nIn a \u201ckernel method\u201d\nThe dimension of\ncan be very large, even in\ufb01nite, and therefore it may not be practical\n(or possible) to work with the mapped data explicitly. Nonetheless, in many cases the dot\nproducts\nfor\n, \u0131.e. a function  so that \nAny algorithm whose operations can be expressed in terms of dot products can be general-\nized to an algorithm which operates on\n\n\u000f\u0011\u0010\u0012\u0001\u0014\u0013\u0015\u000e\ncan be evaluated ef\ufb01ciently using a positive de\ufb01nite kernel \n\n\u0018\u001a\u0002#\u001b$\u0005\u000b\u0002\u0012 %\u001c'&(\u0016)\u000f\u0019\u0018\u001a\u0002*\u001b+\u001c\u001e\u0005\u001f\u000f\u0019\u0018\u001a\u0002! \u0006\u001c\u000b\"\n\n, simply by presenting the Gram matrix\n\n\u0016\u0017\u000f\u0019\u0018\u001a\u0002\r\u001b\u001d\u001c\u001e\u0005\u001f\u000f\u0019\u0018\u001a\u0002! \u0006\u001c\u000b\"\n\nusing\n\n.\n\n\u000f\u0019\u0018\u0017\u0001,\u001c\n\u0010/&\n\n\u001b. \n\n\u0018\u001a\u0002\n\n\u0005$\u0002\n\nimplicitly performs the dot product calculations between mapped points.\n\nas the input covariance matrix. Note that at no point is the function\nthe kernel \nWhile this \u201ckernel trick\u201d has been extremely successful, a problem common to all kernel\nmethods is that, in general, -\n. For\n021\nexample, in Kernel PCA such a matrix has to be diagonalized, while in SVMs a quadratic\nprogram of size\nmust be solved. As the size of training sets in practical applications\nincreases, the growth of the input size rapidly poses severe computational limitations.\n\nis a dense matrix, making the input size scale as\n\nexplicitly computed;\n\nVarious methods have been proposed to deal with this issue, such as decomposition meth-\nods for SVM training (e.g., [10]), speedup methods for Kernel PCA [12], and other kernel\nmethods [2, 14]. Our research is motivated by the need for such speedups that are also\naccompanied by strong, provable performance guarantees.\n\n\u0001\n\u0001\n\u0001\n\u000e\n\u000e\n\u000f\n-\n\n\u001b\n \n\u001c\n\u000f\n0\n1\n\fIn this paper we give three such speedups for Kernel PCA. We start by simplifying the Gram\nmatrix via a novel matrix sampling/quantization scheme, motivated by spectral properties\nof random matrices. We then move on to speeding up classi\ufb01cation, by using randomized\nrounding in evaluating kernel expansions. Finally, we consider the evaluation of kernel\nfunctions themselves and show how many popular kernels can be approximated ef\ufb01ciently.\n\nOur \ufb01rst technique relates matrix simpli\ufb01cation to the stability of invariant subspaces. The\nother two are, in fact, completely general and apply to all kernel methods. What is more,\nour techniques suggest the notion of randomized kernels, whereby each evaluation of the\nkernel \n(on the same input pair).\nin expectation (over\nThe idea is to use a function \nits internal coin-\ufb02ips), yet confers signi\ufb01cant computational bene\ufb01ts compared to using \n.\nIn fact, each one of our three techniques can be readily cast as an appropriate randomized\nkernel, with no other intervention.\n\nis replaced by an evaluation of a randomized function \n\n which for every input pair behaves like \n\n. For some\nlargest eigenvalues,\n, the method\n\n2 Kernel PCA\nGiven\nmatrix with -\nchoice of \u0002\u0004\u0003\ncomputes the value of \u0002 nonlinear feature extraction functions\n\u0003\u0006\u0005\b\u0007\t\u0007\t\u0007\t\u0005\n\n, the Kernel PCA (KPCA) method [11] computes the \u0002\n\ntraining points recall that -\n\u0005\u0007\u0006 , and eigenvectors, \b\n\n\f\u000b\n\n\u0005\t\u0007\t\u0007\b\u0007\t\u0005\t\b\n\u0005\u000e\n\n\u0006 of -\n\nis an\n\n\u0005\u000b\u0002\n. Then, given an input point\n\n\u001b. \n\n\u0018\u001a\u0002\r\u001b\u000b\u0005$\u0002#\u001c2\u0007\n\n\u0018\u001a\u0002\n\n\u0003\t\u000f\n\n\u001b\u0012\u0011\n\nThere are several methods for computing the principal components of a symmetric matrix.\nThe choice depends on the properties of the matrix and on how many components one is\nseeking. In particular, if relatively few principal components are required, as is the case in\nKPCA, Orthogonal Iteration is a commonly used method. 1\n\nOrthogonal Iteration\n\nbe a random\n\n\u0018\u0014\u0013\n\n1. Let \u0015\n\n2. While not converged, do\n\n\u0002 matrix with orthonormal columns.\n\n(a) \u0015\u0017\u0016\u0018\u0013\u0019\u0015\n(b) \u0015\u0017\u0016 Orthonormalize\n\n\u0018\u001a\u0015\n\n3. Return \u0015\n\n\u0018\u001a0\nin step 2a which, generally, takes \u001b\n\nIt is worth looking closely at the complexity of performing Orthogonal Iteration on a matrix\nsteps, making step 2 the computational bottleneck. The\nand is overwhelmed by the cost of comput-\n. The number of iterations of the while\n(with respect\nto the true principal components) decreases exponentially with the number of iterations.\nAll in all, the running time of Orthogonal Iteration scales linearly with the cost of the ma-\nis\n\n. Step 1 can be done in \u001b\northonormalization step 2b takes time \u001b\ning \u0013\u0019\u0015\nloop is a somewhat complicated issue, but one can prove that the \u201cerror\u201d in \u0015\ntrix multiplication \u0013\u001c\u0015\nnon-zero, then the matrix multiplication \u0013\u001c\u0015\nAs mentioned earlier, the matrix -\nsection, we will show how to sample and quantize the entries of -\nwhich is sparser and whose entries have simpler data representation, yet has essentially the\nsame spectral structure, i.e. eigenvalues/eigenvectors, as -\ncomplicated method. Here we focus on Orthogonal Iteration to simplify exposition.\n\nis sparse, \u0131.e., if roughly one out of every \u001d entries of \u0013\n\n1Our discussion applies equally well to Lanczos Iteration which, while often preferable, is a more\n\nused in Kernel PCA is almost never sparse. In the next\n\n, obtaining a matrix \n\ncosts \u001b\n\n. If \u0013\n\n\u001d\u0006\u001c\n\n\u0002\u001f\u001e\n\n.\n\n.\n\n\n0\n0\n\u0001\n0\n&\n\n\u0018\n\u0002\n\u001b\n \n\u001c\n0\n\u0005\n\u0003\n\u0002\n\u001c\n&\n1\n\u000b\n\f\n\u0010\n\u0003\n\b\n\u000b\n\u001b\n\n\u0005\n\u0002\n\u001c\n0\n\u0001\n\u001c\n\u0013\n\u0018\n0\n\u0002\n\u001c\n\u0002\n1\n\u001c\n\u0018\n0\n1\n\u0002\n\u001c\n\u0018\n0\n1\n-\n\f, we will prove that applying KPCA to the simpli\ufb01ed matrix \n\n3 Sampling Gram Matrices\nIn this section we describe two general \u201cmatrix simpli\ufb01cation\u201d techniques and discuss their\nimplications for Kernel PCA. In particular, under natural assumptions on the spectral struc-\nyields subspaces\nture of -\n. As a result, when we project\nwhich are very close to those that KPCA would \ufb01nd in -\nvectors onto these spaces (as performed by the feature extractors) the results are provably\nclose to the original ones.\nFirst, our sparsi\ufb01cation process works by randomly omitting entries in -\nwe let the matrix \n\nbe described entrywise as\n\n. Precisely stated,\n\nwith probability\n\n\u0003\u0005\u0004\u0006\u0003\n\u0007\b\u0004\n\t\n\n\u0002\u0001\n\u000b\r\t\n\u0004\n\t\n\n\u001b/ \n\n $\u001b\n\nwith probability\n\n\u001b. \n\n\u001b/ \n\n $\u001b\n\n, where\n\nto one of\n\n, thus reducing the representation of each entry to a single bit.\n\nSecond, our quantization process rounds each entry in -\nwith probability\n\n\u0010\u0012\u0011\u0014\u0013\n\u001b. \u0019\u0017\n \u0018\u0017\n\u001b\u0016\u0015\nSparsi\ufb01cation greatly accelerates the computation of eigenvectors by accelerating multi-\nplication by \n. Moreover, both approaches greatly reduce the space required to store the\nmatrix (and they can be readily combined), allowing for much bigger training sets to \ufb01t\nin main memory. Finally, we note that i) sampling also speeds up the construction of the\n, while ii)\nGram matrix since we need only compute those values of -\nquantization allows us to replace exact kernel evaluations by coarse unbiased estimators,\nwhich can be more ef\ufb01cient to compute.\n\n\u0005\f\u000b\r\t\u000f\u000e\n\t\n\u001c\n\t\n\u001c2\u0007\nthat remain in \n\nwith probability\n\n\u001e\u001b\u001a\n\u001e\u001b\u001a\n\n\u001b. \n\u001b. \n\nWhile the two processes above are quite different, they share one important commonality:\nin each case,\n,\nare independent random variables, having expectation zero and bounded variance. Large\ndeviation extensions [5] of Wigner\u2019s famous semi-circle law, imply that with very high\nprobability such matrices have small L2 norm (denoted by\n\n. Moreover, the entries of the error matrix,\n\nthroughout).\n\n\u001b. \n\n\u001b/ \n\n\"$#\n\n\u001e&% '\u001b(\u001b)\n\n\u0017 \u0017\n\n\u00171\u0017\n\nTheorem 1 (Furedi and Komlos [5]) Let\ntries are independent random variables with mean 0, variance bounded above by\nmagnitude bounded by\n\nsymmetric matrix whose en-\n, and\n\n. With probability\n\nbe an\n\n,\n\nIt is worth noting that this upper bound is within a constant factor of the lower bound on\n. More precisely, it is\nthe L2 norm of any matrix where the mean squared entry equals\neasy to show that every matrix with Frobenius norm\n.\nis within a factor of 4 of the L2 error\n\nTherefore, we see that the L2 error introduced by \nassociated with any modi\ufb01cation to -\nWe will analyze three different cases of spectral stability, corresponding to progressively\nstronger assumptions. At the heart of these results is the stability of invariant subspaces\nin the presence of additive noise. This stability is very strong, but can be rather technical\nto express. In stating each of these results, it is important to note that the eigenvectors\ncorrespond exactly to the feature extractors associated with Kernel PCA. For an input point\n\nthat has the same entrywise mean squared error.\n\nhas L2 norm at least\n\n\"4#\n\n\u001c$1\n\n, let\n\ndenote the vector whose\n\nand recall that\n\n\u001d\u001f\u001e\n\u001e\u00140\n\n\u0017 \u0017\u0014!\b\u0017 \u0017\n\u001a+*\n\n\u0013-,\n\n\u0018.\u0004/\"\u00041\t0\n\n\u0003\u0005\u0004\n\u00183\"#0\n\n\f\u000b\n\n\u0002#\u001c\n\nth coordinate is \n\u0005\u000e\n\n\u0003\t\u000f\n\n\u001b\u0012\u0011\n\n\u0002#\u001b$\u0005\u000b\u0002\n\n5%\u001b87\n\n\u0016\u0014\b\n\n\u0005.5\u0012\"2\u0007\n\n-\n-\n\n-\n&\n\n-\n&\n\u001e\n\u001d\n\u001d\n-\n\u0003\n\u001e\n\u001d\n\u0007\n\t\n&\n-\n\n-\n&\n\n-\n&\n\n\u0003\n\u000b\n-\n\u001e\n\u0018\n\u001a\n\u0003\n\u0004\n-\n\u001e\n\u0018\n\u001a\n-\n-\n\u001c\n\u0018\n\n-\n\u001c\n&\n-\n&\n\n-\n\u0004\n-\n\u001d\n\u001e\n0\n\u0001\n0\n\"\n1\n0\n0\n\u001c\n\u001d\n\u001e\n\u0003\n2\n\"\n#\n0\n\u0007\n\"\n1\n0\n-\n\u0002\n5\n6\n\u0018\n\u001c\n\u0018\n&\n1\n\u000b\n\f\n\u0010\n\u0003\n\b\n\u000b\n\u001b\n\u000b\n\f\u0018\u001a\u0002\n\n\u0018\u001a\u0002#\u001c\n\n\u0005\u0007\u0006\u0007\u0006\n\n\u0006\u0001\u0003\u0002\u0005\u0004\n\n, where \u0002\n\nlargest eigenvalues of \n\u0006\u0007\u0006\n, as the \u0002 th feature is very sensitive to small changes in \u0005\nfar from \u0002 are treated consistently in -\n\nis\n, for some threshold \u0002 . First, we consider what\nis not large. Observe that in this case we cannot hope to equate\n\u0006 . However,\nand \n\nbe any matrix whose columns form an orthonormal basis for the\nbe any\n.\n\nRecall that in KPCA we associate features with the \u0002\ntypically chosen by requiring \u0005\nhappens when \u0005\nall \n\n\f\u000b\nand \nwe can show that all features with \u0005\nTheorem 2 Let \b\nspace of features (eigenvectors) in - whose eigenvalue is at least \u0002 . Let \b\u0005\t\nmatrix whose columns form an orthonormal basis for the orthogonal complement of \b\nand \nLet \n\u0017 \u0017\n\b\u000b\n\n\u0017 \u0017\n\nbe the analogous matrices for \n\b\u000b\n\n\u00171\u0017\n\t\t\u001c\nIf we use the threshold \u0002 for the eigenvalues of \n, the \ufb01rst equation asserts that the features\nKPCA recovers are not among the features of - whose eigenvalues are less than \u0002\n.\nSimilarly, the second equation asserts that KPCA will recover all the features of - whose\neigenvalues are larger than \u0002\nProof: We employ the techniques of Davis and Kahan [4]. Observe that\n\n. For any\n\n\u001c\f\b\r\t\n\n\t\t\u001c\n\nand\n\n\u0017 \u0017\n\n\u0017 \u0017\n\n\u0017 \u0017\n\n\u00171\u0017\n\n\u00171\u0017\n\n\b\r\t\n\n.\n\n.\n\n,\n\n\b\r\t\n\n\b\u000b\n\n\b\u000b\n\n\t\t\u001c\n\n\u001c\u0007\b\r\t\n\n\u001c\u0010\b\u000e\t\n\n\b\u000e\t\n\b\u000e\t\n\n\t\b\u001c\n\n\b\u000b\n\n\b\u000b\n\n\b\u000b\n\n\b\u000b\n\n\t\b\u001c\n\t\b\u001c\n\n) are at\n\nand -\n\n\u001c\u0007\b\r\t\n\b\u000b\n\n\b\u000b\n\n\b\u000b\n\nand \u000f\n\n, respectively. Therefore\n\nwhere \nleast \u0002 and at most \u0002\n\u00171\u0017\n\n\b\u000e\t\n\t\b\u001c$\u0004\n\u001c\u0010\b\u000e\t\n\t\b\u001c&\u0004\nare diagonal matrices whose entries (the eigenvalues of \n\b\r\t\n\t\t\u001c\n\u0017 \u0017\n\u00171\u0017\n\u00171\u0017\nwhich implies the \ufb01rst stated result. The second proof is essentially identical.\nIn our second result we will still not be able to isolate individual features, as the error\nmatrix can reorder their importance by, say, interchanging \n\n\u000b and \n\n. However, we can\nshow that any such interchange will occur consistently in all test vectors. Let \n\nbe the\n\u0002 -dimensional vector whose \u0012\n, \u0131.e., here we do not normalize\n.\nfeatures to \u201cequal importance\u201d. Recall that\nTheorem 3 Assume that \u0005\nrotation matrix \u0015\n\nth coordinate is \u0005\n\u0005\u0007\u0006\u0007\u0006\nsuch that for all\n\n\u0018\u001a\u0002\nth coordinate is \n\u0005\u000b\u0002#\u001c\n. There is an orthonormal\n\nis the vector whose\n\n\t\t\u001c\n\u0017 \u0017\n\n\t\b\u001c\n\n\u001c\u0010\b\u000e\t\n\n\t\t\u001c\n\n\t\t\u001c\n\n\u0017 \u0017\n\u0017 \u0017\n\n\b\u000b\n\n\u0017 \u0017\n\n\u0017 \u0017\n\n\u00171\u0017\n\n\f\u000b\n\n\u0003\t\u000f\n\u0002#\u001c\nfor some \u0013\n\u001e\u0018\u0013\n\u0005\u0007\u0006 .\n\nand \u0002\n\n\u0017 \u0017\n\n\u0002#\u001c&\u0004\u0016\u0015\n\n\u0018\u001a\u0002#\u001c\n\n\u001a\u0014\u0013\n\u001a\u0019\u0013\n\n\u00171\u0017\n\u0017 \u0017\n\n\u00171\u0017\n\u0017 \u0017\n\n\u00171\u0017\n\nProof: Instantiate Theorem 2 with\n\nNote that the rotation matrix becomes completely irrelevant if we are only concerned with\ndifferences, angles, or inner products of feature vectors.\n\nFinally, we prove that in the special case where a feature is well separated from its neigh-\nboring features in the spectrum of -\nTheorem 4 If \u0013\n\n, we get a particularly strong bound.\n\n, then\n\n, and \u0005\n\n, \u0005\n\nProof:(sketch) As before, we specialize Theorem 2, but \ufb01rst shift both -\nThis does not change the eigenvectors, and allows us to consider \n\nin isolation.\n\nby \u0005\n\n\u000b\u001b\u001a\n\n.\n\n\u001a\u0014\u0013\n\u0018\u001a\u0002#\u001c+\u0004\n\n\f\u000b\n\n\u00171\u0017\n\n\u00171\u0017\n\n\u0002#\u001c\n\n\u001e\u0018\u0013\n\n\u001a\u0019\u0013\n\n\u00171\u0017\n\n\u0017 \u0017\nand \n\n-\n\u0003\n\u0006\n\u0004\n\u0005\n\u0003\n\u000b\n\u001c\n\u000b\n-\n\u0018\n\u0002\n\u001c\n\u0018\n\u0002\n\u001c\n\u0018\n\u0002\n\u001c\n\b\n\u0018\n\u0002\n\u001c\n\b\n\t\n\u0018\n\u0002\n\u001c\n-\n\t\n\u0004\n\u0001\n\n\u0018\n\u0002\n\u0018\n\u0002\n\u0004\n\u0003\n\u001d\n\u001e\n\t\n\u0018\n\u0002\n\u000b\n\n\u0018\n\u0002\n\u001c\n\u0003\n\u001d\n\u001e\n\t\n\u0007\n-\n\u0004\n\t\n\u000b\n\t\n\n-\n\u0004\n-\n&\n\u001d\n\u001e\n\n\u0018\n\u0002\n\u001c\n\n-\n\u0018\n\u0002\n\u0004\n\n\u0018\n\u0002\n\u001c\n-\n\u0018\n\u0002\n\u0004\n&\n\n\u0018\n\u0002\n\u001c\n\u001d\n\u001e\n\u0018\n\u0002\n\u0004\n\n\u000f\n\n\u0018\n\u0002\n\u0018\n\u0002\n\u0004\n\n\u0018\n\u0002\n\u0018\n\u0002\n\u0004\n\u000f\n&\n\n\u0018\n\u0002\n\u001c\n\u001d\n\u001e\n\u0018\n\u0002\n\u0004\n\u000f\n-\n\u0004\n\t\n\u0002\n\n\u0018\n\u0002\n\u0018\n\u0002\n\u0004\n\u0004\n\n\u0018\n\u0002\n\u0018\n\u0002\n\u0004\n\u0018\n\u0002\n\u0004\n\u0003\n\n\u0018\n\u0002\n\u001c\n\u001d\n\u001e\n\u0018\n\u0002\n\u0004\n\t\n\n\u0018\n\u0002\n\u0018\n\u0002\n\u0004\n\u0003\n\u001d\n\u001e\n\u0011\n\u000b\n\u0006\n\u0003\n\u001c\n1\n\u000b\n\u0018\n5\n6\n\u0018\n\u0002\n\u001b\n\u0006\n\u0004\n\u0003\n\n\u001d\n\u001e\n\n\u0003\n\u0002\n\n\n\u0018\n!\n\n\u0003\n\u0017\n5\n\u0017\n\u0007\n\t\n&\n\u001d\n\u001e\n&\n\u0011\n\n\u0003\n\u000b\n\u0004\n\u0005\n\u000b\n\u0006\n\u0003\n\u0004\n\u001d\n\u001e\n\u000b\n\n\u0003\n\u0004\n\u0005\n\u000b\n\u0004\n\u001d\n\u001e\n\u0017\n\n\u000b\n\n\u0018\n\u0017\n\u0003\n\u0017\n5\n\u0017\n\u0007\n-\n\f\n\u000b\n\u0011\n\f, the value of\n\non\n\n4 Approximating Feature Extractors Quickly\nHaving determined eigenvalues and eigenvectors, given an input point\n, a function\n\neach feature reduces to evaluating, for some unit vector \b\n\u001c2\u0005\n\n\u0018\u001a\u0002\n, as well as the scaling by \u0005\n\n\u0005$\u0002\n\n.\n\n. For each\n\nwhere we dropped the subscript \u0012\ntake values in an interval of width \u0013 and let\nfast, unbiased, small-variance estimator for \n\ncoef\ufb01cients \b\nFix \u001d\nthen let\n\n\u00171\u0017\n\u00171\u0017\n\u0017\u0001\nwith probability \u001d\u0005!\nperform \u201crandomized rounding\u201d on the (remaining) coef\ufb01cients of \b\n\notherwise.\n\n(\u0004\u0003\n\n: if\n\n; if\n\n\u001d\u001f\b\n\n\u0018\u0014\b\n\nlet\n\n. Let\n\n\u0003\t\u000f\n\n. Assume that \n\nbe any unit vector. We will devise a\n, by sampling and rounding the expansion\n\n\u0018\u001a\u0002\n\nThat is, after potentially keeping some large coef\ufb01cients deterministically, we proceed to\n\n!.\u001c\n\n(1)\n\n, i.e., dense eigenvectors.\nand suggests that using far fewer\n. In particular, for a\n\n\u0018\u001a\u0002\n\n\u0005$\u0002*\u001b\u001d\u001c\n\n. Moreover, using Hoeffding\u2019s inequality [7], we can\narising from the terms subjected to probabilistic round-\n\n\u0018\u001a\u0002#\u001c\n\u0002#\u001c\n\u0002#\u001c\n\nClearly, we have\nbound the behavior of\ning. In particular, this gives\n\n\u001c\u0006\u0005\n\n\u0018\u001a\u0002\n\n\u001c\b\u0007\n\u0002#\u001c\u000f\u0004\n\t\u000b\n\r\f\n\n\u0002\u000f\u000e\n\n\u0018\u001a\u0002\n\n\u001c+1\n\n\u0002#\u001c&\u0004\n\nkernel evaluations we can get good approximations of \n\n\u001a+*\n\u0013-,\u0011\u0010\nNote now that in Kernel PCA we typically expect \b\n\u001b\u0015\u0014\nthe natural scale for measuring \n\nThis makes \u0013\nthan\nchosen (\ufb01xed) value of\n% '\u001b(\n\n\u0018\u001a\u0002\nHaving picked some threshold\n(for SVM expansions\noffset) we want to determine whether \n\nrelative error estimate for it.\n\nlet us say that \n\n\u0002#\u001c\nis trivial if\n\n\u0001\u0019\u0018\n\n1\u0013\u0012\n\n\u0002#\u001c\n\n\u001c\u000b\u0017\u0016\n\n\u0018\u001a\u0002\n\n\u0018\u001a\u0002\n\nis related to the classi\ufb01cation\nis non-trivial and, if so, we want to get a good\n\nset \u001d\n\n\u001e+%1'\b(\n\n.\n\n\u001a\n\u001c\n\n\u001a\u0014\u0013\n\n\u0018\b\u001a\u001e\u0016\n\n. With\n\nTheorem 5 For any\nprobability at least\n\nand\n\n\u001e\u001d\u001c\n\n\u001a\u0006\u001b\n\u0003\u0005\u0004\u0006\u0003\n\n\u0018\u001a\u0002\n\n %\u0018\n\n\u0002#\u001c\n\n %\u0018\u001a\u0002\n\nnon-zero\n\nare trivial or\n\n\u0018\u001a\u0002\nand let \u001a\n\n2. Either both \n\ndenote the number of non-zero\n\n1. There are fewer than2\nand \n\u0003\u0005\u0004\u0019\u001a\n\u001c\nProof: Let\nequals\nindependent Bernoulli trials. It is not hard to show that the\nplus the sum of\n\u0004\u0012\u0017\nprobability that the event in 1 fails is bounded by the corresponding probability for the case\nwhere all coordinates of \u0017\nis a Binomial random variable with\ntrials and probability of success \u001d\n.\nThe Chernoff bound now implies that the event in 1 fails to occur with probability\n.\n\u0018\u001a0\nFor the enent in 2 it suf\ufb01ces to observe that failure occurs only if\nis at least\n\u001c\"\u0016\n\n\u000b\u001f\u001a\u001f\u001c\nand, by our choice of \u001d ,\n\n. By (1), this also occurs with probability\n\n\u0017 are equal. In that case,\n\n. Note that\n\n% '\u001b(\n\n\u0018 \u001a\u001e\u0016\n\n\u001d\u001b\u000e\n\n\u001a\u0019\u0013\n\n\u001e\u001d\u001c\n\n\u0018\u001a0\n\n\u0018\u001a\u0002\n\n\u0018\u001a\u0002\n\n\u0018\u001a\u0002\n\n.\n\n\u0002\n\u0002\n\n\u001c\n&\n\f\n\u0010\n\u001b\n\u0011\n\u0003\n\b\n\u001b\n\n\u0018\n\u0002\n\u001b\n\n1\n\u000b\n\u0005\n\b\n\u001b\n\u0004\n\u0001\n6\n\u0017\n\b\n\u001b\n\u0017\n\n\u0003\n\u001e\n\u001d\n\n\b\n\u001b\n&\n\u001b\n\u0017\n\b\n\u001b\n\u0003\n\u001e\n\u001d\n\n\b\n\u001b\n&\n\n\u0002\n\u001b\n\u001c\n\u0017\n\b\n\u001b\n\u0017\n\u0001\n\n\n&\n\u001d\n\n\u0003\n\f\n\u0010\n\u001b\n\u0011\n\u0003\n\n\b\n\u001b\n\n\u0007\n\n\n&\n\n\u0018\n\u0017\n\n\u0018\n\n\n\u0018\n\u0017\n\u0017\n\n\u0018\n\n\n\u001c\n\u0017\n\n\u0003\n\u0004\n\u001a\n\u0018\n\u001d\n\u0002\n0\n\u0013\n\u0007\n\u0003\n\u001e\n#\n0\n#\n0\n\u0018\n0\n\u001c\n\u0016\n \n\u0018\n\n \n0\n\u0007\n\u0016\n\u0016\n\u001c\n\u0018\n\u0001\n\u0005\n\u001a\n\u001c\n\u0016\n\u0003\n\u0018\n\u0013\n\u001e\n\u0018\n0\n0\n&\n#\n0\n\u001e\n\u001c\n\u001e\n0\n)\n\u0013\n\u001a\n\n\u0003\n\u0001\n0\n\u0016\n\n\b\n\u001b\n\n \n\u001c\n\u0018\n\n\u001c\n\u0003\n\n\n \n\u001c\n\u0003\n\u0018\n\u0003\n\n \n\u001c\n\u0007\n\n0\n\n\b\n\u001b\n&\n\u0007\n6\n\u0010\n\u0017\n\b\n\u001b\n\u0017\n\n\u0003\n\u001e\n\n0\n\u0017\n\u001a\n\u0017\n0\n\u001a\n\u0017\n\b\n\n0\n0\n\u001e\n#\n0\n\u001c\n\u0018\n\n0\n\u001c\n&\n0\n\u001e\n\u001c\n!\n\n)\n\u001c\n\u0017\n\n\n\u001c\n\u0004\n\n\u001c\n\u0017\n\u0018\n\u001a\n\u001a\n#\n0\n!\n\n)\n\u001c\n\u0011\n\f5 Quick batch approximations of Kernels\nIn this section we devise fast approximations of the kernel function itself. We focus on ker-\nnels sharing the following two characteristics: i) they map\n-dimensional Euclidean space,\nand, ii) the mapping depends only on the distance and/or inner product of the considered\npoints. We note that this covers some of the most popular kernels, e.g., RBFs and poly-\nnomial kernels. To simplify exposition we focus on the following task: given a sequence\nof (test) vectors\nfor each of a \ufb01xed set of (training) vectors\n\n\u0003\u0006\u0005\u000b\u0002\n\u0005\t\u0007\b\u0007\t\u0007\n\u0005\n.\n, where\n0\u0004\u0003\u0005\n\ndetermine \n\n\u0002\r\u001b\u000b\u0005\u0002\u0001\n\n\u0003\u0006\u0005\t\u0007\t\u0007\b\u0007\t\u0005\u0002\u0001\nTo get a fast batch approximdition, the idea is that rather than evaluating distances and\ninner products directly, we will use a fast, approximately correct oracle for these quantities\noffering the following guarantee: it will answer all queries with small relative error.\n\nA natural approach for creating such an oracle is to pick \u001d of the\n\u0005\t\u0007\b\u0007\t\u0007\b\u0005\n\ncoordinates in input\nspace and use the projection onto these coordinates to determine distances and inner prod-\nucts. The problem with this approach is that if\n,\nany coordinate sampling scheme is bound to do poorly. On the other hand, if we knew that\nall coordinates contributed \u201capproximately equally\u201d to \n, then coordinate sampling\nwould be much more appealing. We will do just this, using the technique of random pro-\njections [8], which can be viewed as coordinate sampling preceded by a random rotation.\n\n\u0005\b\u0007\t\u0007\b\u0007\n\u0005\n\n\u001b\u000b\u0005\u0002\u0001\n\n\u0004\u0006\u0001\n\n\u0004\u0007\u0001\n\nImagine that we applied a spherically random rotation \u0015\nand then applied the same random rotation \u0015\n\n(before training)\nas it became available.\nClearly, all distances and inner products would remain the same and we would get exactly\nthat\nthe same results as without the rotation. The interesting part is that any \ufb01xed vector\n, after being rotated\nwas a linear combination of training and/or input vectors, e.g.\n\u0017 . As a result, the coordinates of\nbecomes a spherically random vector of length \u0017\nare\ni.i.d. random variables, in fact\n\n, enabling coordinate sampling.\n\n\u0001!\u0003\nto each input point\n\n\u0005\b\u0007\t\u0007\t\u0007\b\u0005\b\u0001\n\nto\n\n!\u001c\n\n\u0012\u001c\n\n%1'\b(\ntraining vectors takes \u001b\n\nOur oracle amounts to multiplying each training and input point by the same\nmatrix\ndistances and inner products. (Think of\n\n\u001d projection\n, and using the resulting \u001d -dimensional points to estimate\n\nas the result of taking a\n\n, where \u001d\n\nand keeping the \ufb01rst \u001d columns (sampling)). Before describing the choice of\n\nquality of the resulting approximations, let us go over the computational savings.\n\nrotation matrix\nand the\n\n\u0012\u001c\n\n\u0018\u001a0\n\n0\r\n\n. Note that\n\n1. Rotating the\n\ntakes time \u001b\n\n%1'\b(\n\u000e This cost will be amortized over the sequence of input vectors.\n\u000e This rotation can be performed in the training phase.\n%1'\b(\ninstead of \u001b\nnow take \u001b\n\u0012\u001c\n%1'\b(\nwhich is dominated by \u001b\n\u0018\u001a0\n\u0012\u001c\n\n2. The kernel evaluations for each\n3. Rotating\n\n% '\u001b(\n\u001a . Thus, postponing the scaling by\n\nHaving motivated our oracle as a spherically random rotation followed by coordinate sam-\npling, we will actually employ a simpler method to perform the projection. Namely, we\nwill rely on a recent result of [1], asserting that we can do at least as well by taking\n, each case having probabil-\n\n\u0007\b\u0004\n\u0011\r\u001b/ \n\u001b/ \nity\nuntil the end, each of the \u001d new coordinates\nis formed as follows: split the\nin each group; take the difference of the two sums.\n\ncoordinates randomly into two groups; sum the coordinates\n\nRegarding the quality of approximations we get\n\n.\n\n\u0018\u001a0\u000f\u0012\u001c\n\n.\n\n\u0007\u0013\u0011*\u001b/ \n\nare i.i.d. with\n\nwhere the\n\n&\u0012\u0011\n\n*\u0005\n\n\u0018\u0010\n\n\u001b. \n\n\u0002\u0004\u001b$\u0004\n\u0001\u0006 \n\nTheorem 6 Consider any sets of points\nand for given\n\nlet\n\n\u0005\u001e\u001a\n\n\u001d\u0019&\n\n\u0002#\u0003%\u0005\t\u0007\b\u0007\t\u0007\b\u0005\u000b\u0002\u0015\u0014\n\nin\n\n. Let \u0012\n\n\u0016\u0018\u0017\n\nand\n\n\u0012\u0014\u0007\n\n\u0001\u0012\u0003\n\n\u0005\b\u0007\t\u0007\b\u0007\t\u0005\b\u0001\n\n\u001e\u001d\u001c\n\n\u0004\u0019\u001a\n\n\n\u0002\n1\n\u0018\n \n\u001c\n\u0001\n\f\n\n\u0002\n\u001b\n \n&\n\u0018\n\u0001\n\u0001\n\u0005\n\u0017\n\u0002\n\u001b\n \n\u0017\n\u0005\n\u0001\n\u0001\n\u001c\n\u0018\n\u0002\n \n\u001c\n\f\n\u0002\n\u001b\n\t\n\t\n\t\n\u000b\n\u0018\n\u0001\n\u0005\n\u0017\n\t\n\u0017\n\u001e\n#\n\n\u0001\n\f\n&\n\u001b\n\u0018\n\f\n\n\u0001\n\n\u0015\n\f\n0\n\u0018\n\u0002\n\u001b\n\u0002\n\n\u001c\n\f\n\u001e\n#\n\u000e\n\u001b\n\u0003\n\u0005\n\u000b\n\u0003\n\u000e\n\u0003\n\u001e\n\u0003\n\u001e\n#\n\n\n\f\n&\n0\n\u000b\n\u0002\n\u0019\n\u0004\n\u0001\n2\n\u000b\n\u001a\n\u0019\n\u001a\n1\n\u001e\n\u001a\n)\n%\n\u0003\n\fProof: We use Lemma 5 of [1], asserting that for any\n\nand any\n\n,\n\nwhere the\nlet\n\n\u001b. \ndenote\n\nare i.i.d. with\n\n\u0007\u0013\u0011\n\n.\n\n(2)\n\n(3)\n\n(4)\n\n, for every pair of points\n\nLet\n\nbe a random\n\n\u001b. \nWith probability at least\n\n\u0007\u0019\u0004\n\n\u0005\f\u000b\n\nand\n\n, each case having probability\n\n\u001d matrix de\ufb01ned by\n\u0003\u0005\u0004\u0006\u0003\n\u0017\n\u0016\n\n\u0004\n\u0001\n\n\u0003/\u0004\u0019\u001a\n\u001c\n\n\u0003\u0002\n\nability at least\n\n\u0003\u0005\u0004\u0019\u001a\n\u001c\n\n\"&\u0004\n\u000b\u001f\u001a\u001f\u001c\nBy our choice of \u001d , the r.h.s. of (4) is\n\u0003\n\u0004\n\u0004\u0007\u0001\nreadily yields (2). For (3) we observe that\u001a\n\u0003\u0010\u000f\u001f\u001a\n\n\u0007\t\u0007\t\u0007\nare within\n\nthe lengths of all\n\n\u0005\b\u0001\nand \u0017\n\n\u0007\t\u0007\b\u0007\u001d0\n\nand\n\n\u0002 ,\n\n,\n\nof \u0017\n\n\u001b. \n\n\u0017\u001a\n1\u0005\u0004\n\n\u0002#\u001b\u000b\u0005\u0002\u0001\n\n\u0011*\u001b/ \n\n\u001e\u001b\u001a . For any\n\t\u0006\u001b\n\u001a\n\u001c\n\u0018.\u0003\n\u001b\u000f\u0016\n\u001a+*\n,\u0007\u0006\n\u0003\u000b\n\nand \u0017\n\n\u0003\u0005\u0004\n0\u0002\u000b\n\n1\r\f\n\u0001\u00121\n\n. Thus, by the union bound, with prob-\nvectors corresponding to\n. This\nand thus if\n\n, are maintained within a factor of\n\n, then (3) holds.\n\n\u001e\u001d\u001c\n\n\u001c\t\b\n\u0003\u0003\u000f\u0017\u001a\n\n\u0004\u0019\u001a\n\u0004\u0006\u0001\n\n\u0018 \u001a\n\u0004\n\u0001\n\n6 Conclusion\nWe have described three techniques for speeding up kernel methods through the use of ran-\ndomization. While the discussion has focused on Kernel PCA, we feel that our techniques\nhave potential for further development and empirical evaluation in a more general setting.\n\nIndeed, the methods for sampling kernel expansions and for speeding up the kernel eval-\nuation are universal; also, the Gram matrix sampling is readily applicable to any kernel\ntechnique based on the eigendecomposition of the Gram matrix [3]. Furthermore, it might\nenable us to speed up SVM training by sparsifying the Hessian and then applying a sparse\nQP solver, such as the ones described in [6, 9].\n\nOur sampling and quantization techniques, both in training and classi\ufb01cation, amount to\nrepeatedly replacing single kernel evaluations with independent random variables that have\nappropriate expectations. Note, for example, that while we have represented the sampling\nof the kernel expansion as randomized rounding of coef\ufb01cients, this rounding is also equiv-\nalent to the following process: consider each coef\ufb01cients as is, but replace every kernel\ninvocation \n\nwith an invocation of a randomized kernel function, distributed as\n\n\u0018\u001a\u0002\n\n\u0005\u000b\u0002! \u0006\u001c\n\n\u0018\u001a\u0002\n\n\u0005$\u0002\u0012 \u0006\u001c\n\n\u0005$\u0002\u0012 \u0006\u001c\n\n\b\u0004 \u0019\u0017\n\n \u0019\u0017\n\notherwise.\n\nwith probability \u001d\u0019\u0017\n\u0003\u0005\u0004\u0006\u0003\n\nwith probability\n\nwith probability\n\nSimilarly, the process of sampling in training can be thought of as replacing  with\n\n\u0018\u001a\u0002\n\n\u0005\u000b\u0002\n\n\u0018\u001a\u0002\n\n\u0005\u000b\u0002\n\nwhile an analogous randomized kernel is the obvious choice for quantization.\n\nWe feel that this approach suggests a notion of randomized kernels, wherein kernel evalua-\ntions are no longer considered as deterministic but inherently random, providing unbiased\nestimators for the corresponding inner products. Given bounds on the variance of these es-\ntimators, it seems that algorithms which reduce to computing weighted sums of kernel eval-\nuations can exploit concentration of measure. Thus, randomized kernels appear promising\nas a general tool for speeding up kernel methods, warranting further investigation.\n\n\f\n\n\u0001\n\f\n&\n\u001e\n#\n\n\u0005\n\u000e\n\u0011\n\u001b\n\u0003\n\u0003\n\u000e\n\u0003\n\u0016\n\u0017\n\n\t\n\t\n\f\n\u001e\n\u0012\n\u0001\n \n\u0018\n\u0017\n\u0002\n\u001b\n \n\u0017\n1\n\u0003\n\u0017\n\n\u0002\n\u001b\n\u0004\n\n\u0001\n \n\u0017\n1\n\u0003\n\u000b\n\u0017\n\u0002\n\u001b\n\u0004\n\u0001\n \n\u0017\n1\n\n\u0002\n\u001b\n\n\u0001\n \n\u0016\n\u0002\n\u001b\n\u0001\n \n\"\n\u0017\n\u0017\n\u0002\n\u001b\n\u0017\n1\n\u000b\n\u001a\n\u0017\n\u0001\n \n\u0017\n1\n\u0007\n\t\n\u0017\n\u001a\n\u0004\n\u0001\n\t\n\u0018\n\u0017\n\t\n\u0017\n1\n\u0003\n\u0017\n\n\t\n\u0017\n1\n\u0003\n\u0018\n\u0003\n\u0017\n\t\n\u0017\n\u0004\n\u0013\n\u0004\n\u001d\n\u001a\n1\n\u001e\n\u001a\n)\n\u0007\n\u001a\n\u001e\n\u0012\n1\n\u0006\n\u0001\n\u0003\n\u0004\n\u0003\n\u001e\n\u0012\n\u0001\n0\n\u0002\n\u000b\n\u0002\n\u000b\n\u0002\n\u001b\n \n\u0002\n\u001b\n \n6\n&\n\u0003\n\u000e\n&\n\u0003\n\u0016\n\u0002\n\u0001\n\"\n&\n\u0017\n\u0002\n\u0017\n1\n\u000b\n\u0017\n\u0017\n\u0004\n\u0017\n\u0002\n\u0017\n1\n\u0017\n\n\u0002\n\u0017\n1\n\u0005\n\u0017\n\n\u0001\n\u0017\n1\n\n\u0002\n\u0004\n\n\u0001\n\u0017\n1\n\u0002\n\u0017\n1\n\u0005\n\u0017\n\u0001\n\u0017\n1\n\u0002\n\u0017\n1\n\u0011\n\n\n \n\u0018\n\u0002\n&\n\n\n\u001e\n\u0017\n\b\n\u0001\n\n\n\u001b\n \n\u001c\n&\n\n\u0001\n\u001e\n\u001d\n\u001d\n\n\u001b\n \n\u001c\n\u0003\n\u001e\n\u001d\n\u0005\n\fAcknowledgments. BS would like to thank Santosh Venkatesh for detailed discussions\non sampling kernel expansions.\n\nReferences\n\n[1] D. Achlioptas, Database-friendly random projections, Proc. of the 20th Symposium\n\non Principle of Database Systems (Santa Barbara, California), 2001, pp. 274\u2013281.\n\n[2] C. J. C. Burges, Simpli\ufb01ed support vector decision rules, Proc. of the 13th Interna-\n\ntional Conference on Machine Learning, Morgan Kaufmann, 1996, pp. 71\u201377.\n\n[3] N. Cristianini, J. Shawe-Taylor, and H. Lodhi, Latent semantic kernels, Proc. of the\n\n18th International Conference on Machine Learning, Morgan Kaufman, 2001.\n\n[4] C. Davis and W. Kahan, The rotation of eigenvectors by a perturbation 3, SIAM\n\nJournal on Numerical Analysis 7 (1970), 1\u201346.\n\n[5] Z. F\u00a8uredi and J. Koml\u00b4os, The eigenvalues of random symmetric matrices, Combina-\n\ntorica 1 (1981), no. 3, 233\u2013241.\n\n[6] N. I. M. Gould, An algorithm for large-scale quadratic programming, IMA Journal\n\nof Numerical Analysis 11 (1991), no. 3, 299\u2013324.\n\n[7] W. Hoeffding, Probability inequalities for sums of bounded random variables, Journal\n\nof the American Statistical Association 58 (1963), 13\u201330.\n\n[8] W. B. Johnson and J. Lindenstrauss, Extensions of Lipschitz mappings into a Hilbert\nspace, Conference in modern analysis and probability (New Haven, Conn., 1982),\nAmerican Mathematical Society, 1984, pp. 189\u2013206.\n\n[9] R. H. Nickel and J. W. Tolle, A sparse sequential quadratic programming algorithm,\n\nJournal of Optimization Theory and Applications 60 (1989), no. 3, 453\u2013473.\n\n[10] E. Osuna, R. Freund, and F. Girosi, An improved training algorithm for support vector\n\nmachines, Neural Networks for Signal Processing VII, 1997, pp. 276\u2013285.\n\n[11] B. Sch\u00a8olkopf, A. J. Smola, and K.-R. M\u00a8uller, Nonlinear component analysis as a\n\nkernel eigenvalue problem, Neural Computation 10 (1998), 1299\u20131319.\n\n[12] A. J. Smola and B. Sch\u00a8olkopf, Sparse greedy matrix approximation for machine\nlearning, Proc. of the 17th International Conference on Machine Learning, Morgan\nKaufman, 2000, pp. 911\u2013918.\n\n[13] V. Vapnik, The nature of statistical learning theory, Springer, NY, 1995.\n[14] C. K. I. Williams and M. Seeger, Using the Nystrom method to speed up kernel ma-\nchines, Advances in Neural Information Processing Systems 2000, MIT Press, 2001.\n\n\f", "award": [], "sourceid": 2072, "authors": [{"given_name": "Dimitris", "family_name": "Achlioptas", "institution": null}, {"given_name": "Frank", "family_name": "Mcsherry", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}]}