{"title": "An Empirical Study on The Properties of Random Bases for Kernel Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 2763, "page_last": 2774, "abstract": "Kernel machines as well as neural networks possess universal function approximation properties. Nevertheless in practice their ways of choosing the appropriate function class differ. Specifically neural networks learn a representation by adapting their basis functions to the data and the task at hand, while kernel methods typically use a basis that is not adapted during training. In this work, we contrast random features of approximated kernel machines with learned features of neural networks. Our analysis reveals how these random and adaptive basis functions affect the quality of learning. Furthermore, we present basis adaptation schemes that allow for a more compact representation, while retaining the generalization properties of kernel machines.", "full_text": "An Empirical Study on The Properties of\n\nRandom Bases for Kernel Methods\n\nMaximilian Alber, Pieter-Jan Kindermans, Kristof T. Sch\u00fctt\n\nTechnische Universit\u00e4t Berlin\n\nmaximilian.alber@tu-berlin.de\n\nKlaus-Robert M\u00fcller\n\nTechnische Universit\u00e4t Berlin\n\nKorea University\n\nMax Planck Institut f\u00fcr Informatik\n\nFei Sha\n\nUniversity of Southern California\n\nfeisha@usc.edu\n\nAbstract\n\nKernel machines as well as neural networks possess universal function approxima-\ntion properties. Nevertheless in practice their ways of choosing the appropriate\nfunction class differ. Speci\ufb01cally neural networks learn a representation by adapt-\ning their basis functions to the data and the task at hand, while kernel methods\ntypically use a basis that is not adapted during training. In this work, we contrast\nrandom features of approximated kernel machines with learned features of neural\nnetworks. Our analysis reveals how these random and adaptive basis functions\naffect the quality of learning. Furthermore, we present basis adaptation schemes\nthat allow for a more compact representation, while retaining the generalization\nproperties of kernel machines.\n\n1\n\nIntroduction\n\nRecent work on scaling kernel methods using random basis functions has shown that their performance\non challenging tasks such as speech recognition can match closely those by deep neural networks [22,\n6, 35]. However, research also highlighted two disadvantages of random basis functions. First, a large\nnumber of basis functions, i.e., features, are needed to obtain useful representations of the data. In a\nrecent empirical study [22], a kernel machine matching the performance of a deep neural network\nrequired a much larger number of parameters. Second, a \ufb01nite number of random basis functions\nlead to an inferior kernel approximation error that is data-speci\ufb01c [30, 32, 36].\nDeep neural networks learn representations that are adapted to the data using end-to-end training.\nKernel methods on the other hand can only achieve this by selecting the optimal kernels to represent\nthe data \u2013 a challenge that persistently remains. Furthermore, there are interesting cases in which\nlearning with deep architectures is advantageous, as they require exponentially fewer examples [25].\nYet arguably both paradigms have the same modeling power as the number of training examples goes\nto in\ufb01nity. Moreover, empirical studies suggest that for real-world applications the advantage of one\nmethod over the other is somewhat limited [22, 6, 35, 37].\nUnderstanding the differences between approximated kernel methods and neural networks is crucial\nto use them optimally in practice. In particular, there are two aspects that require investigation: (1)\nHow much performance is lost due to the kernel approximation error of the random basis? (2) What\nis the possible gain of adapting the features to the task at hand? Since these effects are expected to be\ndata-dependent, we argue that an empirical study is needed to complement the existing theoretical\ncontributions [30, 36, 20, 32, 8].\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn this work, we investigate these issues by making use of the fact that approximated kernel methods\ncan be cast as shallow, one-hidden-layer neural networks. The bottom layers of these networks are\nrandom basis functions that are generated in a data-agnostic manner and are not adapted during\ntraining [30, 31, 20, 8]. This stands in stark contrast to, even the conventional single layer, neural\nnetwork where the bottom-layer parameters are optimized with respect to the data distribution and\nthe loss function. Speci\ufb01cally, we designed our experiments to distinguish four cases:\n\n\u2022 Random Basis (RB): we use the (approximated) kernel machine in its traditional formula-\n\n\u2022 Unsupervised Adapted Basis (UAB): we adapt the basis functions to better approximate\n\ntion [30, 8].\n\nthe true kernel function.\n\n\u2022 Supervised Adapted Basis (SAB): we adapt the basis functions using kernel target align-\n\nment [5] to incorporate label information.\n\n\u2022 Discriminatively Adapted Basis (DAB): we adapt the basis functions with a discriminative\nloss function, i.e., optimize jointly over basis and classi\ufb01er parameters. This corresponds to\nconventional neural network optimization.\n\nThese experiments allow us to isolate the effect of the randomness of the basis and contrast it to data-\nand task-dependent adaptations. We found that adapted bases consistently outperform random ones:\nan unsupervised basis adaption leads to a better kernel approximation than a random approximation,\nand, when considering the task at hand, a supervised kernel basis leads to a even more compact\nmodel while showing a superior performance compared to the task-agnostic bases. Remarkably, this\nperformance is retained after transferring the basis to another task and makes this adaption scheme a\nviable alternative to a discriminatively adapted basis.\nThe remainder is structured as follows. After a presentation of related work we explain approximated\nkernel machines in context of neural networks and describe our propositions in Sec. 3. In Sec. 4 we\nquantify the bene\ufb01t of adapted basis function in contrast to their random counterparts empirically.\nFinally, we conclude in Sec. 5.\n\n2 Related work\n\nTo overcome the limitations of kernel learning, several approximation methods have been proposed.\nIn addition to Nystr\u00f6m methods [34, 7], random Fourier features [30, 31] have gained a lot of\nattention. Random features or (faster) enhancements [20, 9, 39, 8] were successfully applied in\nmany applications [6, 22, 14, 35], and were theoretically analyzed [36, 32]. They inspired scalable\napproaches to learn kernels with Gaussian processes [35, 38, 23]. Notably, [2, 24] explore kernels in\nthe context of neural networks, and, in the \ufb01eld of RBF-networks, basis functions were adapted to the\ndata by [26, 27].\nOur work contributes in several ways: we view kernel machines from a neural network perspective\nand delineate the in\ufb02uence of different adaptation schemes. None of the above does this. The related\nwork [36] compares the data-dependent Nystr\u00f6m approximation to random features. While our\napproach generalizes to structured matrices, i.e., fast kernel machines, Nystr\u00f6m does not. Most\nsimilar to our work is [37]. They interpret the Fastfood kernel approximation as a neural network.\nTheir aim is to reduce the number of parameters in a convolutional neural network.\n\n3 Methods\n\nIn this section we will detail the relation between kernel approximations with random basis functions\nand neural networks. Then, we discuss the different approaches to adapt the basis in order to perform\nour analysis.\n\n3.1 Casting kernel approximations as shallow, random neural networks\nKernels are pairwise similarity functions k(x, x(cid:48)) : Rd\u00d7Rd (cid:55)\u2192 R between two data points x, x(cid:48) \u2208 Rd.\nThey are equivalent to the inner-products in an intermediate, potentially in\ufb01nite-dimensional feature\n\n2\n\n\fspace produced by a function \u03c6 : Rd (cid:55)\u2192 RD\n\nk(x, x(cid:48)) = \u03c6(x)T \u03c6(x(cid:48))\n\n(1)\n\nNon-linear kernel machines typically avoid using \u03c6 explicitly by applying the kernel trick. They work\nin the dual space with the (Gram) kernel matrix. This imposes a quadratic dependence on the number\nof samples n and prevents its application in large scale settings. Several methods have been proposed\nto overcome this limitation by approximating a kernel machine with the following functional form\n\nf (x) = W T \u02c6\u03c6(x) + b,\n\n(2)\n\nwhere \u02c6\u03c6(x) is the approximated kernel feature map. Now, we will explain how to obtain this\napproximation for the Gaussian and the ArcCos kernel [2]. We chose the Gaussian kernel because it\nis the default choice for many tasks. On the other hand, the ArcCos kernel yields an approximation\nconsisting of recti\ufb01ed, piece-wise linear units (ReLU) as used in deep learning [28, 11, 19].\n\nGaussian kernel To obtain the approximation of the Gaussian kernel, we use the following prop-\nerty [30]. Given a smooth, shift-invariant kernel k(x \u2212 x(cid:48)) = k(z) with Fourier transform p(w),\nthen:\n\nk(z) =\n\np(w)ejwT zdw.\n\n(3)\n\nUsing the Gaussian distribution p(w) = N (0, \u03c3\u22121), we obtain the Gaussian kernel\n\nRd\n\nk(z) = exp\n\n\u2212 (cid:107)z(cid:107)2\n2\u03c32 .\n\n2\n\nThus, the kernel value k(x, x(cid:48)) can be approximated by the inner product between \u02c6\u03c6(x) and \u02c6\u03c6(x(cid:48)),\nwhere \u02c6\u03c6 is de\ufb01ned as\n\n(cid:90)\n\n(cid:114) 1\n\n1\n\u03c0\n\n(cid:114) 1\n\n\u02c6\u03c6(x) =\n\n(4)\nand WB \u2208 Rd\u00d7D/2 as a random matrix with its entries drawn from N (0, \u03c3\u22121). The resulting features\nare then used to approximate the kernel machine with the implicitly in\ufb01nite dimensional feature\nspace,\n\nB x), cos(W T\n\n[sin(W T\n\nB x)]\n\nD\n\nk(x, x(cid:48)) \u2248 \u02c6\u03c6(x)T \u02c6\u03c6(x(cid:48)).\n\n(5)\n\nArcCos kernel To yield a better connection to state-of-the-art neural networks we use the ArcCos\nkernel [2]\n\nk(x, x(cid:48)) =\n\n(cid:107)x(cid:107)(cid:107)x(cid:48)(cid:107) J(\u03b8)\n\nwith J(\u03b8) = (sin \u03b8 + (\u03c0 \u2212 \u03b8) cos \u03b8) and \u03b8 = cos\u22121( x\u00b7x(cid:48)\napproximation is not based on a Fourier transform, but is given by\n\n(cid:107)x(cid:107)(cid:107)x(cid:48)(cid:107) ), the angle between x and x(cid:48). The\n\n(6)\nwith WB \u2208 Rd\u00d7D being a random Gaussian matrix. This makes the approximated feature map of the\nArcCos kernel closely related to ReLUs in deep neural networks.\n\nmax(0, W T\n\nB x)\n\nD\n\n\u02c6\u03c6(x) =\n\nNeural network interpretation The approximated kernel features \u02c6\u03c6(x) can be interpreted as the\noutput of the hidden layer in a shallow neural network. To obtain the neural network interpretation,\nwe rewrite Eq. 2 as the following\n\n(7)\nwith W \u2208 RD\u00d7c with c number of classes, and b \u2208 Rc. Here, the non-linearity h corresponds to\nB x in Eqs. 4 and 6 yielding\nthe obtained kernel approximation map. Now, we substitute z = W T\n\nh(z) = (cid:112)1/D[sin(z), cos(z)]T for the Gaussian kernel and h(z) = (cid:112)1/D max(0, z) for the\n\nf (x) = W T h(W T\n\nB x) + b,\n\nArcCos kernel.\n\n3\n\n\f3.2 Adapting random kernel approximations\n\nHaving introduced the neural network interpretation of random features, the key difference between\nboth methods is which parameters are trained. For the neural network, one optimizes the parameters\nin the bottom-layer and those in the upper layers jointly. For kernel machines, however, WB is \ufb01xed,\ni.e., the features are not adapted to the data. Hyper-parameters (such as \u03c3 de\ufb01ning the bandwidth\nof the Gaussian kernel) are selected with cross-validation or heuristics [12, 6, 8]. Consequently, the\nbasis is not directly adapted to the data, loss, and task at hand.\nIn our experiments, we consider the classi\ufb01cation setting where for the given data X \u2208 Rn\u00d7d\ncontaining n samples with d input dimensions one seeks to predict the target labels Y \u2208 [0, 1]n\u00d7c with\na one-hot encoding for c classes. We use accuracy as the performance measure and the multinomial-\nlogistic loss as its surrogate. All our models have the same, generic form shown in Eq. 7. However,\nwe use different types of basis functions to analyze varying degrees of adaptation. In particular, we\nstudy whether data-dependent basis functions improve over data-agnostic basis functions. On top\nof that, we examine how well label-informative, thus task-adapted basis functions can perform in\ncontrast to the data-agnostic basis. Finally, we use end-to-end learning of all parameters to connect to\nneural networks.\n\nRandom Basis - RB: For data-agnostic kernel approximation, we use the current state-of-the-art\nof random features. Orthogonal random features [8, ORF] improve the convergence properties of the\nGaussian kernel approximation over random Fourier features [30, 31]. Practically, we substitute WB\nwith 1/\u03c3 GB, sample GB \u2208 Rd\u00d7D/2 from N (0, 1) and orthogonalize the matrix as given in [8] to\napproximate the Gaussian kernel. The ArcCos kernel is applied as described above.\nWe also use these features as initialization of the following adaptive approaches. When adapting the\nGaussian kernel we optimize GB while keeping the scale 1/\u03c3 \ufb01xed.\n\nUnsupervised Adapted Basis - UAB: While the introduced random bases converge towards the\ntrue kernel with an increasing number of features, it is to be expected that an optimized approximation\nwill yield a more compact representation. We address this by optimizing the sampled parameters WB\nw.r.t. the kernel approximation error (KAE):\n\n\u02c6L(x, x(cid:48)) =\n\n1\n2\n\n(k(x, x(cid:48)) \u2212 \u02c6\u03c6(x)T \u02c6\u03c6(x(cid:48)))2\n\n(8)\n\nThis objective is kernel- and data-dependent, but agnostic to the classi\ufb01cation task.\n\nSupervised Adapted Basis - SAB: As an intermediate step between task-agnostic kernel approx-\nimations and end-to-end learning, we propose to use kernel target alignment [5] to inject label\ninformation. This is achieved by a target kernel function kY with kY (x, x(cid:48)) = +1 if x and x(cid:48)\nbelong to the same class and kY (x, x(cid:48)) = 0 otherwise. We maximize the alignment between the\napproximated kernel k and the target kernel kY for a given data set X:\n\n(cid:112)(cid:104)K, K(cid:105)(cid:104)KY , KY (cid:105)\n\n(cid:104)K, KY (cid:105)\n\n(9)\n\nwith (cid:104)Ka, Kb(cid:105) =(cid:80)n\n\n\u02c6A(X, k, kY ) =\n\ni,j ka(xi, xj)kb(xi, xj).\n\nDiscriminatively Adapted Basis - DAB: The previous approach uses label information, but is\noblivious to the \ufb01nal classi\ufb01er. On the other hand, a discriminatively adapted basis is trained jointly\nwith the classi\ufb01er to minimize the classi\ufb01cation objective, i.e., WB, W , b are optimized at the same\ntime. This is the end-to-end optimization performed in neural networks.\n\n4 Experiments\n\nIn the following, we present the empirical results of our study, starting with a description the\nexperimental setup. Then, we proceed to present the results of using data-dependent and task-\ndependent basis approximations. In the end, we bridge our analysis to deep learning and fast kernel\nmachines.\n\n4\n\n\f10\n\n10\n\n10\n\nE\nA\nK\n\nE\nA\nK\n\nE\nA\nK\n\n100\n10\u22121\n10\u22122\n10\u22123\n10\u22124\n10\u22125\n10\u22126\n\n10\u22121\n10\u22122\n10\u22123\n10\u22124\n10\u22125\n\n100\n10\u22121\n10\u22122\n10\u22123\n10\u22124\n10\u22125\n\n10\u22121\n10\u22122\n10\u22123\n10\u22124\n10\u22125\n\nE\nA\nK\n\n10\n\nBasis:\n\nGisette\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\ny\nc\na\nr\nu\nc\nc\nA\n\nE\nA\nK\n\n108\n\n107\n\n106\n\n105\n\n104\n\n103\n\n102\n\nMNIST\n\n1\n\n0.8\n\n0.6\n\ny\nc\na\nr\nu\nc\nc\nA\n\nE\nA\nK\n\n0.4\n\n106\n105\n104\n103\n102\n101\n100\n10\u22121\n\nCoverType\n\ny\nc\na\nr\nu\nc\nc\nA\n\nE\nA\nK\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n102\n\n100\n10\u22122\n10\u22124\n10\u22126\n\nCIFAR10\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ny\nc\na\nr\nu\nc\nc\nA\n\nE\nA\nK\n\n103\n\n102\n\n101\n\n100\n\n10\n\n10\n\n10\n\n10\n\nArcCos\n\n100\n\n1000 10000\n\n# of Features\n\n10\n\nArcCos\n\n100\n\n1000 10000\n\n# of Features\n\n100\n\n1000 10000\n\n# of Features\n\n10\n\n100\n\n1000 10000\n\n# of Features\n\nArcCos\n\n100\n\n1000 10000\n\n# of Features\n\n10\n\nArcCos\n\n100\n\n1000 10000\n\n# of Features\n\n100\n\n1000 10000\n\n# of Features\n\n10\n\n100\n\n1000 10000\n\n# of Features\n\n1\n\n0.9\n\n0.8\n\n0.7\n\ny\nc\na\nr\nu\nc\nc\nA\n\n1\n\n0.8\n\n0.6\n\ny\nc\na\nr\nu\nc\nc\nA\n\n0.4\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\ny\nc\na\nr\nu\nc\nc\nA\n\n0.8\n\n0.6\n\n0.4\n\ny\nc\na\nr\nu\nc\nc\nA\n\nGaussian\n\n100\n\n1000 10000\n\n# of Features\n\n10\n\nGaussian\n\n100\n\n1000 10000\n\n# of Features\n\n10\n\n100\n\n1000 10000\n\n# of Features\n\n100\n\n1000 10000\n\n# of Features\n\nGaussian\n\n100\n\n1000 10000\n\n# of Features\n\n10\n\nGaussian\n\n100\n\n1000 10000\n\n# of Features\n\n100\n\n1000 10000\n\n# of Features\nrandom (RB)\n\n10\n\n100\n\n1000 10000\n\n# of Features\n\nunsupervised adapted (UAB)\n\nsupervised adapted (SAB)\n\ndiscriminative adapted (DAB)\n\nFigure 1: Adapting bases. The plots show the relationship between the number of features (X-Axis), the\nKAE in logarithmic spacing(left, dashed lines) and the classi\ufb01cation error (right, solid lines). Typically, the\nKAE decreases with a higher number of features, while the accuracy increases. The KAE for SAB and DAB\n(orange and red dotted line) hints how much the adaptation deviates from its initialization (blue dashed line).\nBest viewed in digital and color.\n\n4.1 Experimental setup\n\nWe used the following seven data sets for our study: Gisette [13], MNIST [21], CoverType [1],\nCIFAR10 features from [4], Adult [18], Letter [10], USPS [15]. The results for the last three can be\nfound in the supplement. We center the data sets and scale them feature-wise into the range [\u22121, +1].\nWe use validation sets of size 1, 000 for Gisette, 10, 000 for MNIST, 50, 000 for CoverType, 5, 000\nfor CIFAR10, 3, 560 for Adult, 4, 500 for Letter, and 1, 290 for USPS. We repeat every test three\ntimes and report the mean over these trials.\n\nOptimization We train all models with mini-batch stochastic gradient descent. The batch size is\n64 and as update rule we use ADAM [17]. We use early-stopping where we stop when the respective\nloss on the validation set does not decrease for ten epochs. We use Keras [3], Scikit-learn [29],\nNumPy [33] and SciPy [16]. We set the hyper-parameter \u03c3 for the Gaussian kernel heuristically\naccording to [39, 8].\nUAB ans SAB learning problems scale quadratically in the number of samples n. Therefore, to\nreduce memory requirements we optimize by sampling mini-batches from the kernel matrix. A batch\nfor UAB consists of 64 sample pairs x and x(cid:48) as input and the respective value of the kernel function\nk(x, x(cid:48)) as target value. Similarly for SAB, we sample 64 data points as input and generate the\n\n5\n\n\ftarget kernel matrix as target value. For each training epoch we randomly generate 10, 000 training\nand 1, 000 validation batches, and, eventually, evaluate the performance on 1, 000 unseen, random\nbatches.\n\n4.2 Analysis\n\nTab. 1 gives an overview of the best performances achieved by each basis on each data set.\n\nDataset\nGisette\nMNIST\nCoverType\nCIFAR10\n\nRB\n98.1\n98.2\n91.9\n76.4\n\nGaussian\n\nUAB SAB DAB\n97.9\n97.9\n98.3\n98.2\n95.2\n91.9\n76.8\n77.3\n\n98.1\n98.3\n90.4\n79.0\n\nRB\n97.7\n97.2\n83.6\n74.9\n\nArcCos\n\nUAB SAB DAB\n97.8\n97.8\n97.9\n97.4\n92.9\n83.1\n76.3\n75.3\n\n97.8\n97.7\n88.7\n79.4\n\nTable 1: Best accuracy in % for different bases.\n\nData-adapted kernel approximations First, we evaluate the effect of choosing a data-dependent\nbasis (UAB) over a random basis (RB). In Fig. 1, we show the kernel approximation error (KAE)\nand the classi\ufb01cation accuracy for a range from 10 to 30,000 features (in logarithmic scale). The\n\ufb01rst striking observation is that a data-dependent basis can approximate the kernel equally well\nwith up to two orders of magnitude fewer features compared to the random baseline. This hold for\nboth the Gaussian and the ArcCos kernel. However, the advantage diminishes as the number of\nfeatures increases. When we relate the kernel approximation error to the accuracy, we observe that\ninitially a decrease in KAE correlates well with an increase in accuracy. However, once the kernel is\napproximated suf\ufb01ciently well, using more feature does not impact accuracy anymore.\nWe conclude that the choice between a random or data-dependent basis strongly depends on the\napplication. When a short training procedure is required, optimizing the basis could be too costly. On\nthe other hand, if the focus lies on fast inference, we argue to optimize the basis to obtain a compact\nrepresentation. In settings with restricted resources, e.g., mobile devices, this can be a key advantage.\n\nTask-adapted kernels A key difference between kernel methods and neural networks originates\nfrom the training procedure. In kernel methods the feature representation is \ufb01xed while the classi\ufb01er is\noptimized. In contrast, deep learning relies on end-to-end training such that the feature representation\nis tightly coupled to the classi\ufb01er. Intuitively, this allows the representation to be tailor-made for the\ntask at hand. Therefore, one would expect that this allows for an even more compact representation\nthan the previously examined data-adapted basis.\nIn Sec. 3, we proposed a task-adapted kernel (SAB). Fig. 1 shows that the approach is comparable in\nterms of classi\ufb01cation accuracy to discriminatively trained basis (DAB). Only for CoverType data\nset SAB performs signi\ufb01cantly worse due to the limited model capabilities, which we will discuss\nbelow. Both task-adapted features improve signi\ufb01cantly in accuracy compared to the random and\ndata-adaptive kernel approximations.\n\nTransfer learning The beauty of kernel methods is, however, that a kernel function can be used\nacross a wide range of tasks and consistently result in good performance. Therefore, in the next\nexperiment, we investigate whether the resulting kernel retains this generalization capability when it\nis task-adapted. To investigate the in\ufb02uence of task-dependent information, we randomly separate the\nclasses MNIST into two distinct subsets. The \ufb01rst task is to classify \ufb01ve randomly samples classes\nand their respective data points, while the second task is to do the same with the remaining classes.\nWe train the previously presented model variants on task 1 and transfer their bases to task 2 where we\nonly learn the classi\ufb01er. The experiment is repeated with \ufb01ve different splits and the mean accuracy\nis reported.\nFig. 2 shows that on the transfer task, the random and the data-adapted bases RB and UAB approx-\nimately retain the accuracy achieved on task 1. The performance of the end-to-end trained basis\nDAB drops signi\ufb01cantly, however, yields still a better performance than the default random basis.\nSurprisingly, the supervised basis SAB using kernel-target alignment retains its performance and\nachieves the highest accuracy on task 2. This shows that using label information can indeed be\n\n6\n\n\f1\n\n0.8\n\n0.6\n\ny\nc\na\nr\nu\nc\nc\nA\n\nTask 1\n\nTransferred - Task 2\n\n1\n\ny\nc\na\nr\nu\nc\nc\nA\n\n0.8\n\n0.6\n\n10\n\n100\n# of Features\nBasis:\n\n1000\n\n10\n\nRB\n\nUAB\n\nSAB\n\n1000\n\n100\n# of Features\nDAB\n\nFigure 2: Transfer learning. We train to discriminate a random subset of 5 classes on the MNIST data set\n(left) and then transfer the basis function to a new task (right), i.e., train with the \ufb01xed basis from task 1 to\nclassify between the remaining classes.\n\nMNIST\n\nCoverType\n\nArcCos2\n\nArcCos3\n\nArcCos2\n\nArcCos3\n\ny\nc\na\nr\nu\nc\nc\nA\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n10\n\n100\n\n1000 10000\n\n# of Features\n\n10\n\n100\n\n1000 10000\n\n# of Features\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ny\nc\na\nr\nu\nc\nc\nA\n\ny\nc\na\nr\nu\nc\nc\nA\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n10\n\n100\n\n1000 10000\n\n# of Features\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\ny\nc\na\nr\nu\nc\nc\nA\n\n10\n\n100\n\n1000 10000\n\n# of Features\n\nBasis:\n\nrandom (RB)\n\nunsupervised adapted (UAB)\n\nsupervised adapted (SAB)\n\ndiscriminative adapted (DAB)\n\nRB\n\nUAB\n\n100\n\n1000 10000\n\n# of Features\n\n10\n\n100\n\n1000 10000\n\n# of Features\n\nRB\n\nUAB\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\ny\nc\na\nr\nu\nc\nc\nA\n\ny\nc\na\nr\nu\nc\nc\nA\n\n10\n\n10\n\nMNIST\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ny\nc\na\nr\nu\nc\nc\nA\n\ny\nc\na\nr\nu\nc\nc\nA\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nCoverType\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\ny\nc\na\nr\nu\nc\nc\nA\n\ny\nc\na\nr\nu\nc\nc\nA\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n10\n\n10\n\nSAB\n\nDAB\n\n100\n\n1000 10000\n\n# of Features\n\n10\n\n100\n\n1000 10000\n\n# of Features\n\nSAB\n\nDAB\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\ny\nc\na\nr\nu\nc\nc\nA\n\ny\nc\na\nr\nu\nc\nc\nA\n\n100\n\n1000 10000\n\n# of Features\nArcCos3\n\n10\n\n100\n\n1000 10000\n\n# of Features\n\n100\n\n1000 10000\n\n# of Features\n\n10\n\nArcCos\n\nArcCos2\n\n100\n\n1000 10000\n\n# of Features\nKernel:\n\nFigure 3: Deep kernel machines. The plots show the classi\ufb01cation performance of the ArcCos-kernels with\nrespect to the kernel (\ufb01rst part) and with respect to the number of layers (second part). Best viewed in digital\nand color.\n\nexploited in order to improve the ef\ufb01ciency and performance of kernel approximations without having\nto sacri\ufb01ce generalization. I.e., a target-driven kernel (SAB) can be an ef\ufb01cient and still general\nalternative to the universal Gaussian kernel.\n\nDeep kernel machines We extend our analysis and draw a link to deep learning by adding two deep\nkernels [2]. As outlined in the aforementioned paper, stacking a Gaussian kernel is not useful instead\nwe use ArcCos kernels that are related to deep learning as described below. Recall the ArcCos kernel\nfrom Eq. 3.1 as k1(x, x(cid:48)). Then the kernels ArcCos2 and ArcCos3 are de\ufb01ned by the inductive step\n\n7\n\n\f10\u22121\n10\u22122\n10\u22123\n10\u22124\n10\u22125\n\nE\nA\nK\n\nE\nA\nK\n\n10\u22121\n10\u22122\n10\u22123\n10\u22124\n10\u22125\n\n1\n\n0.8\n\n0.6\n\n0.4\n\ny\nc\na\nr\nu\nc\nc\nA\n\n10\n\n10\n\n10\n\nRB\n\n10\n\nRB\n\n10\n\n100\n\n1000 10000\n\n# of Features\n\n100\n\n1000 10000\n\n# of Features\n\n100\n\n1000 10000\n\n# of Features\n\n100\n\n1000 10000\n\n# of Features\n\nMNIST\n\nSAB\n\nDAB\n\n100\n\n1000 10000\n\n# of Features\n\n10\n\n100\n\n1000 10000\n\n# of Features\n\nMNIST\n\n1\n\n0.8\n\n0.6\n\n0.4\n\ny\nc\na\nr\nu\nc\nc\nA\n\nE\nA\nK\n\n10\u22121\n10\u22122\n10\u22123\n10\u22124\n10\u22125\n\nCoverType\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n1\n\n0.8\n\n0.6\n\n0.4\n\ny\nc\na\nr\nu\nc\nc\nA\n\nE\nA\nK\n\n10\u22121\n10\u22122\n10\u22123\n10\u22124\n10\u22125\n\ny\nc\na\nr\nu\nc\nc\nA\n\ny\nc\na\nr\nu\nc\nc\nA\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n10\n\n10\n\n10\n\nUAB\n\n10\n\nUAB\n\n10\n\n100\n\n1000 10000\n\n# of Features\n\n100\n\n1000 10000\n\n# of Features\n\n100\n\n1000 10000\n\n# of Features\n\n100\n\n1000 10000\n\n# of Features\n\nCoverType\n\nSAB\n\nDAB\n\n100\n\n1000 10000\n\n# of Features\n\n10\n\n100\n\n1000 10000\n\n# of Features\n\n1\n\n0.8\n\n0.6\n\n0.4\n\ny\nc\na\nr\nu\nc\nc\nA\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\ny\nc\na\nr\nu\nc\nc\nA\n\ny\nc\na\nr\nu\nc\nc\nA\n\nBasis:\n\nGBGBGB\n\nHDHD\nHDHDHD HDHD\nHDHDHD\nHDHD HDHDHD\nHDHDHD\n\nFigure 4: Fast kernel machines. The plots show how replacing the basis GB with an fast approximation\nin\ufb02uences the performance of a Gaussian kernel. I.e., GB is replaced by 1, 2, or 3 structured blocks HDi. Fast\napproximations with 2 and 3 blocks might overlap with GB. Best viewed in digital and color.\n\nki+1(x, x(cid:48)) = 1\n\u03c0 [ki(x, x)ki(x(cid:48), x(cid:48))]\u22121/2J(\u03b8i) with \u03b8i = cos\u22121(ki(x, x(cid:48))[ki(x, x)ki(x(cid:48), x(cid:48))]\u22121/2).\nSimilarly, the feature map of the ArcCos kernel is approximated by a one-layer neural network with\nthe ReLU-activation function and a random weight matrix WB\n\n(cid:114) 1\n\nD\n\n\u02c6\u03c6ArcCos(x) = \u02c6\u03c6B(x) =\n\nmax(0, W T\n\nB x),\n\n(10)\n\nand the feature maps of the ArcCos2 and ArcCos3 kernels are then given by a 2- or 3-layer neu-\nral network with the ReLU-activations, i.e., \u02c6\u03c6ArcCos2(x) = \u02c6\u03c6B1( \u02c6\u03c6B0 (x)) and \u02c6\u03c6ArcCos3(x) =\n\u02c6\u03c6B2( \u02c6\u03c6B1 ( \u02c6\u03c6B0(x))). The training procedure for the ArcCos2 and ArcCos3 kernels remains identical\nto the training of the ArcCos kernel, i.e., the random matrices WBi are simultaneously adapted. Only,\nnow the basis consists of more than one layer, and, to remain comparable for a given number of\nfeatures, we split these features evenly over two layers for a 2-layer kernel and over three layers for a\n3-layer kernel.\nIn the following we describe our results on the MNIST and CoverType data sets. We observed that the\nso far described relationship between the cases RB, UAB, SAB, DAB also generalizes to deep models\n(see Fig. 3, \ufb01rst part, and Fig. 7 in the supplement). I.e., UAB approximates the true kernel function\nup to several magnitudes better than RB and leads to a better resulting classi\ufb01cation performance.\nFurthermore, SAB and DAB perform similarly well and clearly outperform the task-agnostic bases\nRB and UAB.\nWe now compare the results across the ArcCos-kernels. Consider the third row of Fig. 3, which\ndepicts the performance of RB and UAB on the CoverType data set. For a limited number of features,\ni.e., less than 3, 000, the deeper kernels perform worse than the shallow ones. Only given enough\ncapacity the deep kernels are able to perform as good as or better than the single-layer bases. On the\n\n8\n\n\fother hand for the CoverType data set, task related bases, i.e., SAB and DAB, bene\ufb01t signi\ufb01cantly\nfrom a deeper structure and are thus more ef\ufb01cient. Comparing SAB with DAB, for the ArcCos\nkernel with only one layer SAB leads to worse results than DAB. Given two layers the gap diminishes\nand vanishes with three layers (see Fig. 3). This suggests that for this data set the evaluated shallow\nmodels are not expressive enough to extract the task-related kernel information.\n\nFast kernel machines By using structured matrices one can speed up approximated kernel ma-\nchines [20, 8]. We will now investigate how this important technique in\ufb02uences the presented basis\nschemes. The approximation is achieved by replacing random Gaussian matrices with an approx-\nimation composed of diagonal and structured Hadamard matrices. The advantage of these matrix\ntypes is that they allow for low storage costs as fast multiplications. Recall that the input dimension\nis d and the number of features is D. By using the fast Hadamard-transform these algorithms only\nneed to store O(D) instead of O(dD) parameters and the kernel approximation can be computed in\nO(D log d) rather than O(Dd).\nWe use the approximation from [8] and replace the random Gaussian matrix WB = 1/\u03c3 GB in Eq. 4\nwith a chain of random, structured blocks WB \u2248 1/\u03c3 HD1 . . . HDi. Each block HDi consists of\na diagonal matrix Di with entries sampled from the Rademacher distribution and a Hadamard matrix\nH. More blocks lead to a better approximation, but consequently require more computation. We\nfound that the optimization is slightly more unstable and therefore stop early only after 20 epochs\nwithout improvement. When adapting a basis we will only modify the diagonal matrices.\nWe re-conducted our previous experiments for the Gaussian kernel on the MNIST and CoverType\ndata sets (Fig. 4). In the \ufb01rst place one can notice that in most cases the approximation exhibits\nno decline in performance and that it is a viable alternative for all basis adaption schemes. Two\nmajor exceptions are the following. Consider \ufb01rst the left part of the second row which depicts a\napproximated, random kernel machine (RB). The convergence of the kernel approximation stalls\nwhen using a random basis with only one block. As a result the classi\ufb01cation performance drops\ndrastically. This is not the case when the basis is adapted unsupervised, which is given in the right\npart of the second row. Here one cannot notice a major difference between one or more blocks. This\nmeans that for fast kernel machines an unsupervised adaption can lead to a more effective model\nutilization, which is crucial for resource aware settings. Furthermore, a discriminatively trained basis,\ni.e., a neural network, can be effected similarly from this re-parameterization (see Fig. 4, bottom row).\nHere an order of magnitude more features is needed to achieve the same accuracy compared to an\nexact representation, regardless how many blocks are used. In contrast, when adapting the kernel in\na supervised fashion no decline in performance is noticeable. This shows that this procedure uses\nparameters very ef\ufb01ciently.\n\n5 Conclusions\n\nOur analysis shows how random and adaptive bases affect the quality of learning. For random features\nthis comes with the need for a large number of features and suggests that two issues severely limit\napproximated kernel machines: the basis being (1) agnostic to the data distribution and (2) agnostic\nto the task. We have found that data-dependent optimization of the kernel approximation consistently\nresults in a more compact representation for a given kernel approximation error. Moreover, task-\nadapted features could further improve upon this. Even with fast, structured matrices, the adaptive\nfeatures allow to further reduce the number of required parameters. This presents a promising strategy\nwhen a fast and computationally cheap inference is required, e.g., on mobile device.\nBeyond that, we have evaluated the generalization capabilities of the adapted variants on a transfer\nlearning task. Remarkably, all adapted bases outperform the random baseline here. We have found that\nthe kernel-task alignment works particularly well in this setting, having almost the same performance\non the transfer task as the target task. At the junction of kernel methods and deep learning, this\nshows that incorporating label information can indeed be bene\ufb01cial for performance without having\nto sacri\ufb01ce generalization capability. Investigating this in more detail appears to be highly promising\nand suggests the path for future work.\n\n9\n\n\fAcknowledgments\n\nMA, KS, KRM, and FS acknowledge support by the Federal Ministry of Education and Research\n(BMBF) under 01IS14013A. PJK has received funding from the European Union\u2019s Horizon 2020\nresearch and innovation program under the Marie Sklodowska-Curie grant agreement NO 657679.\nKRM further acknowledges partial funding by the Institute for Information & Communications\nTechnology Promotion (IITP) grant funded by the Korea government (No. 2017-0-00451), BK21\nand by DFG. FS is partially supported by NSF IIS-1065243, 1451412, 1513966/1632803, 1208500,\nCCF-1139148, a Google Research Award, an Alfred. P. Sloan Research Fellowship and ARO#\nW911NF-12-1-0241 and W911NF-15-1-0484. This work was supported by NVIDIA with a hardware\ndonation.\n\nReferences\n[1] J. A. Blackard and Denis J. D. Comparative accuracies of arti\ufb01cial neural networks and\ndiscriminant analysis in predicting forest cover types from cartographic variables. Computers\nand Electronics in Agriculture, 24(3):131\u2013151, 2000.\n\n[2] Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in neural\n\ninformation processing systems, pages 342\u2013350, 2009.\n\n[3] Fran\u00e7ois Chollet et al. Keras. https://github.com/fchollet/keras, 2015.\n\n[4] Adam Coates, Andrew Y. Ng, and Honglak Lee. An analysis of single-layer networks in\nIn International Conference on Arti\ufb01cial Intelligence and\n\nunsupervised feature learning.\nStatistics, pages 215\u2013223, 2011.\n\n[5] Nello Cristianini, Andre Elisseeff, John Shawe-Taylor, and Jaz Kandola. On kernel-target\n\nalignment. Advances in neural information processing systems, 2001.\n\n[6] Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina F Balcan, and Le Song.\nScalable kernel methods via doubly stochastic gradients. In Advances in Neural Information\nProcessing Systems, pages 3041\u20133049, 2014.\n\n[7] Petros Drineas and Michael W Mahoney. On the nystr\u00f6m method for approximating a gram\nmatrix for improved kernel-based learning. journal of machine learning research, 6(Dec):2153\u2013\n2175, 2005.\n\n[8] X Yu Felix, Ananda Theertha Suresh, Krzysztof M Choromanski, Daniel N Holtmann-Rice,\nand Sanjiv Kumar. Orthogonal random features. In Advances in Neural Information Processing\nSystems, pages 1975\u20131983, 2016.\n\n[9] Chang Feng, Qinghua Hu, and Shizhong Liao. Random feature mapping with signed circulant\n\nmatrix projection. In IJCAI, pages 3490\u20133496, 2015.\n\n[10] Peter W Frey and David J Slate. Letter recognition using holland-style adaptive classi\ufb01ers.\n\nMachine learning, 6(2):161\u2013182, 1991.\n\n[11] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse recti\ufb01er neural networks. In\nProceedings of the Fourteenth International Conference on Arti\ufb01cial Intelligence and Statistics,\npages 315\u2013323, 2011.\n\n[12] Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Sch\u00f6lkopf. Measuring statistical\ndependence with hilbert-schmidt norms. In Algorithmic learning theory, pages 63\u201377. Springer,\n2005.\n\n[13] Isabelle Guyon, Steve R Gunn, Asa Ben-Hur, and Gideon Dror. Result analysis of the nips 2003\n\nfeature selection challenge. In NIPS, volume 4, pages 545\u2013552, 2004.\n\n[14] Po-Sen Huang, Haim Avron, Tara N Sainath, Vikas Sindhwani, and Bhuvana Ramabhadran. Ker-\nnel methods match deep neural networks on timit. In Acoustics, Speech and Signal Processing\n(ICASSP), 2014 IEEE International Conference on, pages 205\u2013209. IEEE, 2014.\n\n10\n\n\f[15] Jonathan J. Hull. A database for handwritten text recognition research. IEEE Transactions on\n\npattern analysis and machine intelligence, 16(5):550\u2013554, 1994.\n\n[16] Eric Jones, Travis Oliphant, and Pearu Peterson. {SciPy}: open source scienti\ufb01c tools for\n\n{Python}. 2014.\n\n[17] D Kingma and J Ba Adam. A method for stochastic optimisation, 2015.\n\n[18] Ron Kohavi. Scaling up the accuracy of naive-bayes classi\ufb01ers: A decision-tree hybrid. In\n\nKDD, volume 96, pages 202\u2013207, 1996.\n\n[19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[20] Quoc Le, Tamas Sarlos, and Alexander Smola. Fastfood \u2013 computing hilbert space expansions\n\nin loglinear time. Journal of Machine Learning Research, 28:244\u2013252, 2013.\n\n[21] Yann LeCun, Corinna Cortes, and Christopher JC Burges. The mnist database of handwritten\n\ndigits, 1998.\n\n[22] Zhiyun Lu, Avner May, Kuan Liu, Alireza Bagheri Garakani, Dong Guo, Aur\u00e9lien Bellet, Linxi\nFan, Michael Collins, Brian Kingsbury, Michael Picheny, et al. How to scale up kernel methods\nto be as good as deepneural nets. arXiv preprint arXiv:1411.4000, 2014.\n\n[23] Miguel L\u00e1zaro-Gredilla, Joaquin Qui\u00f1onero-Candela, Carl Edward Rasmussen, and An\u00edbal R.\nFigueiras-Vidal. Sparse spectrum gaussian process regression. Journal of Machine Learning\nResearch, 11:1865\u20131881, 2010.\n\n[24] Gr\u00e9goire Montavon, Mikio L Braun, and Klaus-Robert M\u00fcller. Kernel analysis of deep networks.\n\nJournal of Machine Learning Research, 12(Sep):2563\u20132581, 2011.\n\n[25] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of\nlinear regions of deep neural networks. In Advances in neural information processing systems,\npages 2924\u20132932, 2014.\n\n[26] John Moody and Christian J Darken. Fast learning in networks of locally-tuned processing\n\nunits. Neural computation, 1(2):281\u2013294, 1989.\n\n[27] Klaus-Robert M\u00fcller, A Smola, Gunnar R\u00e4tsch, B Sch\u00f6lkopf, Jens Kohlmorgen, and Vladimir\nVapnik. Using support vector machines for time series prediction. Advances in kernel meth-\nods\u2014support vector learning, pages 243\u2013254, 1999.\n\n[28] Vinod Nair and Geoffrey E Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines.\nIn Proceedings of the 27th international conference on machine learning (ICML-10), pages\n807\u2013814, 2010.\n\n[29] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,\nP. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,\nM. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine\nLearning Research, 12:2825\u20132830, 2011.\n\n[30] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In J.C.\nPlatt, D. Koller, Y. Singer, and S.T. Roweis, editors, Advances in Neural Information Processing\nSystems 20, pages 1177\u20131184. Curran Associates, Inc., 2008.\n\n[31] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimiza-\ntion with randomization in learning. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou,\neditors, Advances in Neural Information Processing Systems 21, pages 1313\u20131320. Curran\nAssociates, Inc., 2009.\n\n[32] Dougal J Sutherland and Jeff Schneider. On the error of random fourier features. AUAI, 2015.\n\n[33] St\u00e9fan van der Walt, S Chris Colbert, and Gael Varoquaux. The numpy array: a structure for\n\nef\ufb01cient numerical computation. Computing in Science & Engineering, 13(2):22\u201330, 2011.\n\n11\n\n\f[34] Christopher KI Williams and Matthias Seeger. Using the nystr\u00f6m method to speed up kernel ma-\nchines. In Proceedings of the 13th International Conference on Neural Information Processing\nSystems, pages 661\u2013667. MIT press, 2000.\n\n[35] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel\n\nlearning. arXiv preprint arXiv:1511.02222, 2015.\n\n[36] Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nystr\u00f6m method\nvs random fourier features: A theoretical and empirical comparison. In Advances in neural\ninformation processing systems, pages 476\u2013484, 2012.\n\n[37] Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alex Smola, Le Song, and\n\nZiyu Wang. Deep fried convnets. June 2015.\n\n[38] Zichao Yang, Andrew Wilson, Alex Smola, and Le Song. A la carte \u2013 learning fast kernels.\n\nJournal of Machine Learning Research, 38:1098\u20131106, 2015.\n\n[39] Felix X Yu, Sanjiv Kumar, Henry Rowley, and Shih-Fu Chang. Compact nonlinear maps and\n\ncirculant extensions. arXiv preprint arXiv:1503.03893, 2015.\n\n12\n\n\f", "award": [], "sourceid": 1562, "authors": [{"given_name": "Maximilian", "family_name": "Alber", "institution": "TU Berlin"}, {"given_name": "Pieter-Jan", "family_name": "Kindermans", "institution": "Google AI Resident"}, {"given_name": "Kristof", "family_name": "Sch\u00fctt", "institution": "TU Berlin"}, {"given_name": "Klaus-Robert", "family_name": "M\u00fcller", "institution": "TU Berlin"}, {"given_name": "Fei", "family_name": "Sha", "institution": "University of Southern California (USC)"}]}