{"title": "Stochastic Variational Deep Kernel Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2586, "page_last": 2594, "abstract": "Deep kernel learning combines the non-parametric flexibility of kernel methods with the inductive biases of deep learning architectures. We propose a novel deep kernel learning model and stochastic variational inference procedure which generalizes deep kernel learning approaches to enable classification, multi-task learning, additive covariance structures, and stochastic gradient training. Specifically, we apply additive base kernels to subsets of output features from deep neural architectures, and jointly learn the parameters of the base kernels and deep network through a Gaussian process marginal likelihood objective. Within this framework, we derive an efficient form of stochastic variational inference which leverages local kernel interpolation, inducing points, and structure exploiting algebra. We show improved performance over stand alone deep networks, SVMs, and state of the art scalable Gaussian processes on several classification benchmarks, including an airline delay dataset containing 6 million training points, CIFAR, and ImageNet.", "full_text": "Stochastic Variational Deep Kernel Learning\n\nAndrew Gordon Wilson*\n\nCornell University\n\nZhiting Hu*\n\nCMU\n\nRuslan Salakhutdinov\n\nCMU\n\nEric P. Xing\n\nCMU\n\nAbstract\n\nDeep kernel learning combines the non-parametric \ufb02exibility of kernel methods\nwith the inductive biases of deep learning architectures. We propose a novel deep\nkernel learning model and stochastic variational inference procedure which gener-\nalizes deep kernel learning approaches to enable classi\ufb01cation, multi-task learning,\nadditive covariance structures, and stochastic gradient training. Speci\ufb01cally, we\napply additive base kernels to subsets of output features from deep neural archi-\ntectures, and jointly learn the parameters of the base kernels and deep network\nthrough a Gaussian process marginal likelihood objective. Within this framework,\nwe derive an ef\ufb01cient form of stochastic variational inference which leverages local\nkernel interpolation, inducing points, and structure exploiting algebra. We show\nimproved performance over stand alone deep networks, SVMs, and state of the\nart scalable Gaussian processes on several classi\ufb01cation benchmarks, including an\nairline delay dataset containing 6 million training points, CIFAR, and ImageNet.\n\n1\n\nIntroduction\n\nLarge datasets provide great opportunities to learn rich statistical representations, for accurate\npredictions and new scienti\ufb01c insights into our modeling problems. Gaussian processes are promising\nfor large data problems, because they can grow their information capacity with the amount of available\ndata, in combination with automatically calibrated model complexity [21, 25].\nFrom a Gaussian process perspective, all of the statistical structure in data is learned through a kernel\nfunction. Popular kernel functions, such as the RBF kernel, provide smoothing and interpolation,\nbut cannot learn representations necessary for long range extrapolation [22, 25]. With smoothing\nkernels, we can only use the information in a large dataset to learn about noise and length-scale\nhyperparameters, which tell us only how quickly correlations in our data vary with distance in the\ninput space. If we learn a short length-scale hyperparameter, then by de\ufb01nition we will only make\nuse of a small amount of training data near each testing point. If we learn a long length-scale, then\nwe could subsample the data and make similar predictions.\nTherefore to fully use the information in large datasets, we must build kernels with great repre-\nsentational power and useful learning biases, and scale these approaches without sacri\ufb01cing this\nrepresentational ability. Indeed many recent approaches have advocated building expressive kernel\nfunctions [e.g., 22, 9, 26, 25, 17, 31], and emerging research in this direction takes inspiration from\ndeep learning models [e.g., 28, 5, 3]. However, the scalability, general applicability, and interpretabil-\nity of such approaches remain a challenge. Recently, Wilson et al. [30] proposed simple and scalable\ndeep kernels for single-output regression problems, with promising performance on many experi-\nments. But their approach does not allow for stochastic training, multiple outputs, deep architectures\nwith many output features, or classi\ufb01cation. And it is on classi\ufb01cation problems, in particular, where\nwe often have high dimensional input vectors, with little intuition about how these vectors should\ncorrelate, and therefore most want to learn a \ufb02exible non-Euclidean similarity metric [1].\n\n*Equal contribution. 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona,\nSpain.\n\n\fIn this paper, we introduce inference procedures and propose a new deep kernel learning model\nwhich enables (1) classi\ufb01cation and non-Gaussian likelihoods; (2) multi-task learning1; (3) stochastic\ngradient mini-batch training; (4) deep architectures with many output features; (5) additive covariance\nstructures; and (5) greatly enhanced scalability.\nWe propose to use additive base kernels corresponding to Gaussian processes (GPs) applied to subsets\nof output features of a deep neural architecture. We then linearly mix these Gaussian processes,\ninducing correlations across multiple output variables. The result is a deep probabilistic neural\nnetwork, with a hidden layer composed of additive sets of in\ufb01nite basis functions, linearly mixed\nto produce correlated output variables. All parameters of the deep architecture and base kernels\nare jointly learned through a marginal likelihood objective, having integrated away all GPs. For\nscalability and non-Gaussian likelihoods, we derive stochastic variational inference (SVI) which\nleverages local kernel interpolation, inducing points, and structure exploiting algebra, and a hybrid\nsampling scheme, building on Wilson and Nickisch [27], Wilson et al. [29], Titsias [24], Hensman\net al. [10], and Nickson et al. [18]. The resulting approach, SV-DKL, has a complexity of O(m1+1/D)\nfor m inducing points and D input dimensions, versus the standard O(m3) for ef\ufb01cient stochastic\nvariational methods.\nWe achieve good predictive accuracy and scalability over a wide range of classi\ufb01cation tasks,\nwhile retaining a straightforward, general purpose, and highly practical probabilistic non-parametric\nrepresentation, with code available at https://people.orie.cornell.edu/andrew/code.\n2 Background\nThroughout this paper, we assume we have access to vectorial input-output pairs D = {xi, yi},\nwhere each yi is related to xi through a Gaussian process and observation model. For example,\nin regression, one could model y(x)|f (x) \u223c N (y(x); f (x), \u03c32I), where f (x) is a latent vector of\nindependent Gaussian processes f j \u223c GP(0, kj), and \u03c32I is a noise covariance matrix.\nThe computational bottleneck in working with Gaussian processes typically involves computing\n(KX,X + \u03c32I)\u22121y and log |KX,X| over an n \u00d7 n covariance matrix KX,X evaluated at n training\ninputs X. Standard procedure is to compute the Cholesky decomposition of KX,X, which incurs\nO(n3) computations and O(n2) storage, after which predictions cost O(n2) per test point. Gaussian\nprocesses are thus typically limited to at most a few thousand training points. Many promising\napproaches to scalability have been explored, for example, involving randomized methods [20, 16, 31]\n, and low rank approximations [23, 19]. Wilson and Nickisch [27] recently introduced the KISS-GP\nX(cid:48), which admits fast computations, given the\nexact kernel matrix KZ,Z evaluated on a latent multidimensional lattice of m inducing inputs Z, and\nMX, a sparse interpolation matrix. Without requiring any grid structure in X, KZ,Z decomposes\ninto a Kronecker product of Toeplitz matrices, which can be approximated by circulant matrices [29].\nExploiting such structure in combination with local kernel interpolation enables one to use many\ninducing points, resulting in near-exact accuracy in the kernel approximation, and O(n) inference.\nUnfortunately, this approach does not typically apply to D > 5 dimensional inputs [29].\nMoreover, the Gaussian process marginal likelihood does not factorize, and thus stochastic gradient\ndescent does not ordinarily apply. To address this issue, Hensman et al. [10] extended the variational\napproach from Titsias [24] and derived a stochastic variational GP posterior over inducing points\nfor a regression model which does have the required factorization for stochastic gradient descent.\nHensman et al. [12], Hensman et al. [11], and Dezfouli and Bonilla [6] further combine this with\na sampling procedure for estimating non-conjugate expectations. These methods have O(m3)\nsampling complexity which becomes prohibitive where many inducing points are desired for accurate\napproximation. Nickson et al. [18] consider Kronecker structure in the stochastic approximation of\nHensman et al. [10] for regression, but do not leverage local kernel interpolation or sampling.\nTo address these limitations, we introduce a new deep kernel learning model for multi-task classi\ufb01ca-\ntion, mini-batch training, and scalable kernel interpolation which does not require low dimensional\ninput spaces. In this paper, we view scalability and \ufb02exibility as two sides of one coin: we most want\nthe \ufb02exible models on the largest datasets, which contain the necessary information to discover rich\n\napproximate kernel matrix (cid:101)KX,X(cid:48) = MX KZ,ZM(cid:62)\n\n1We follow the GP convention where multi-task learning involves a function mapping a single input to\nmultiple correlated output responses (class probabilities, regression responses, etc.). Unlike NNs which naturally\nhave correlated outputs by sharing hidden basis functions (and multi-task can have a more specialized meaning),\nmost GP models perform multiple binary classi\ufb01cation, ignoring correlations between output classes. Even\napplying a GP to NN features for deep kernel learning does not naturally produce multiple correlated outputs.\n\n2\n\n\fFigure 1: Deep Kernel Learning for Multidimensional Outputs. Multidimensional inputs x \u2208 RD are mapped\nthrough a deep architecture, and then a series of additive Gaussian processes f1, . . . , fJ, with base kernels\nk1, . . . , kJ, are each applied to subsets of the network features h(L)\nQ . The thick lines indicate a\nprobabilistic mapping. The additive Gaussian processes are then linearly mixed by the matrix A and mapped to\noutput variables y1, . . . , yC (which are then correlated through A). All of the parameters of the deep network,\nbase kernel, and mixing layer, \u03b3 = {w, \u03b8, A} are learned jointly through the (variational) marginal likelihood of\nour model, having integrated away all of the Gaussian processes. We can view the resulting model as a Gaussian\nprocess which uses an additive series of deep kernels with weight sharing.\n\n, . . . , h(L)\n\n1\n\nstatistical structure. We show that the resulting approach can learn very expressive and interpretable\nkernel functions on large classi\ufb01cation datasets, containing millions of training points.\n3 Deep Kernel Learning for Multi-task Classi\ufb01cation\nWe propose a new deep kernel learning approach to account for classi\ufb01cation and non-Gaussian\nlikelihoods, multiple correlated outputs, additive covariances, and stochastic gradient training.\nWe propose to build a probabilistic deep network as follows: 1) a deep non-linear transformation\nh(x, w), parametrized by weights w, is applied to the observed input variable x, to produce Q\nfeatures at the \ufb01nal layer L, h(L)\nQ ; 2) J Gaussian processes, with base kernels k1, . . . , kJ,\nare applied to subsets of these features, corresponding to an additive GP model [e.g., 7]. The base\nkernels can thus act on relatively low dimensional inputs, where local kernel interpolation and learning\nbiases such as similarities based on Euclidean distance are most natural; 3) these GPs are linearly\nmixed by a matrix A \u2208 RC\u00d7J, and transformed by an observation model, to produce the output\nvariables y1, . . . , yC. The mixing of these variables through A produces correlated multiple outputs,\na multi-task property which is uncommon in Gaussian processes or SVMs. The structure of this\nnetwork is illustrated in Figure 1. Critically, all of the parameters in the model (including base kernel\nhyperparameters) are trained through optimizing a marginal likelihood, having integrated away the\nGaussian processes, through the variational inference procedures described in section 4.\nFor classi\ufb01cation, we consider a special case of this architecture. Let C be the number of classes, and\nwe have data {xi, yi}n\ni=1, where yi \u2208 {0, 1}C is a one-shot encoding of the class label. We use the\nsoftmax observation model:\n\n, . . . , h(L)\n\n1\n\n(cid:80)\n\nexp(A(f i)(cid:62)yi)\nc exp(A(f i)(cid:62)ec)\n\n,\n\np(yi|f i, A) =\n\n(1)\nwhere f i \u2208 RJ is a vector of independent Gaussian processes followed by a linear mixing layer\nA(f i) = Af i; and ec is the indicator vector with the cth element being 1 and the rest 0.\nFor the jth Gaussian process in the additive GP layer, let f j = {fij}n\ni=1 be the latent functions on\nthe input data features. By introducing a set of latent inducing variables uj indexed by m inducing\ninputs Z, we can write [e.g., 19]\np(f j|uj) = N (f j|K (j)\n\n(cid:101)K = KX,X \u2212 KX,ZK\n\n[27] into Eq. (2), we \ufb01nd (cid:101)K (j) = 0; it therefore follows that f j = KX,ZK\n\n(2)\nSubstituting the local interpolation approximation KX,X(cid:48) = M KZ,ZM(cid:62) of Wilson and Nickisch\n\u22121\nZ,Zu = M u. In section 4\nwe exploit this deterministic relationship between f and u, governed by the sparse matrix M, to\nderive a particularly ef\ufb01cient stochastic variational inference procedure.\n\nZ,Z uj, (cid:101)K (j)) ,\n\nX,ZK (j),\u22121\n\n\u22121\nZ,ZKZ,X .\n\n3\n\nx1xDInputlayerh(1)1h(1)A......h(2)1h(2)Bh(L)1h(L)QW(1)W(2)W(L)HiddenlayersAdditiveGPlayery1yCOutputlayer..................f1fJA\fEq. (1) and Eq. (2) together form the additive GP layer and the linear mixing layer of the proposed\ndeep probabilistic network in Figure 1, with all parameters (including network weights) trained jointly\nthrough the Gaussian process marginal likelihood.\n4 Structure Exploiting Stochastic Variational Inference\nExact inference and learning in Gaussian processes with a non-Gaussian likelihood is not analytically\ntractable. Variational inference is an appealing approximate technique due to its automatic regulariza-\ntion to avoid over\ufb01tting, and its ability to be used with stochastic gradient training, by providing a\nfactorized approximation to the Gaussian process marginal likelihood. We develop our stochastic\nvariational method equipped with a fast sampling scheme for tackling any intractable marginalization.\nLet u = {uj}J\nj=1 be the collection of the inducing variables of the J additive GPs. We assume a\nvariational posterior over the inducing variables q(u). By Jensen\u2019s inequality we have\n\nlog p(y) \u2265 Eq(u)p(f|u)[log p(y|f )] \u2212 KL[q(u)(cid:107)p(u)] (cid:44) L(q),\n\n(3)\nwhere we have omitted the mixing weights A for clarity. The KL divergence term can be interpreted\nas a regularizer encouraging the approximate posterior q(u) to be close to the prior p(u). We aim\nat tightening the marginal likelihood lower bound L(q) which is equivalent to minimizing the KL\nSince the likelihood function typically factorizes over data instances: p(y|f ) = (cid:81)n\ndivergence from q to the true posterior.\n(cid:81)\ni=1 p(yi|f i),\nwe can optimize the lower bound with stochastic gradients.\nIn particular, we specify q(u) =\nj N (uj|\u00b5j, Sj) for the independent GPs, and iteratively update the variational parameters\n{\u00b5j, Sj}J\nj=1 and the kernel and deep network parameters using a noisy approximation of the gradient\nof the lower bound on minibatches of the full data. Henceforth we omit the index j for clarity.\nUnfortunately, for general non-Gaussian likelihoods the expectation in Eq (3) is usually intractable.\nWe develop a sampling method for tackling this intractability which is highly ef\ufb01cient with structured\nreparameterization, local kernel interpolation, and structure exploiting algebra.\nUsing local kernel interpolation, the latent function f is expressed as a deterministic local interpolation\nof the inducing variables u (section 3). This result allows us to work around any dif\ufb01cult approximate\nposteriors on f which typically occur in variational approaches for GPs. Instead, our sampler only\nneeds to account for the uncertainty on u. The direct parameterization of q(u) yields a straightforward\nand ef\ufb01cient sampling procedure. The latent function samples (indexed by t) are then computed\ndirectly through interpolation f (t) = M u(t).\nAs opposed to conventional mean-\ufb01eld methods, which assume a diagonal variational covariance\nmatrix, we use the Cholesky decomposition for reparameterizing u in order to preserve structures\nwithin the covariance. Speci\ufb01cally, we let S = LT L, resulting in the following sampling procedure:\n\nu(t) = \u00b5 + L\u0001(t);\n\n\u0001(t) \u223c N (0, I).\n\nsition on L =(cid:78)D\n\nEach step of the above standard sampler has complexity of O(m2), where m is the number of\ninducing points. Due to the matrix vector product, this sampling procedure becomes prohibitive\nin the presence of many inducing points, which are required for accuracy on large datasets with\nmultidimensional inputs \u2013 particularly if we have an expressive kernel function [27].\nWe scale up the sampler by leveraging the fact that the inducing points are placed on a grid (taking\nadvantage of both Toeplitz and circulant structure), and additionally imposing a Kronecker decompo-\nd=1 Ld, where D is the input dimension of the base kernel. With the fast Kronecker\nmatrix-vector products, we reduce the above sampling cost of O(m2) to O(m1+1/D). Our approach\nthus greatly improves over previous stochastic variational methods which typically scale with O(m3)\ncomplexity, as discussed shortly.\nNote that the KL divergence term between the two Gaussians in Eq (3) has a closed form without the\nneed for Monte Carlo estimation. Computing the KL term and its derivatives, with the Kronecker\nmethod, is O(Dm\nD ). With T samples of u and a minibatch of data points of size B, we can estimate\nthe marginal likelihood lower bound as\n\n3\n\nlog p(yi|f (t)\n\ni ) \u2212 KL[q(u)(cid:107)p(u)],\n\n(4)\n\nT(cid:88)\n\nB(cid:88)\n\nL (cid:39) N\nT B\n\nt=1\n\ni=1\n\n4\n\n\fd=1} can be taken similarly. We provide the detailed derivation in the supplement.\n\nand the derivatives \u2207L w.r.t the model hyperparameters \u03b3 and the variational parameters\n{\u00b5,{Ld}D\nAlthough a small body of pioneering work has developed stochastic variational methods for Gaussian\nprocesses, our approach distinctly provides the above representation-preserving variational approx-\nimation, and exploits algebraic structure for signi\ufb01cant advantages in scalability and accuracy. In\nparticular, a similar variational lower bound as in Eq (3) was proposed in [24, 10] for a sparse GP,\nwhich were extended to non-conjugate likelihoods, with the intractable integrals estimated using\nGaussian quadrature as in the KLSP-GP [11] or univariate Gaussian samples as in the SAVI-GP [6].\nHensman et al. [12] estimates nonconjugate expectations with a hybrid Monte Carlo sampler (denoted\nas MC-GP). The computations in these approaches can be costly, with O(m3) complexity, due to\na complicated variational posterior over f as well as the expensive operations on the full inducing\npoint matrix. In addition to its increased ef\ufb01ciency, our sampling scheme is much simpler, without\nintroducing any additional tuning parameters. We empirically compare with these methods and show\nthe practical signi\ufb01cance of our algorithm in section 5.\nVariational methods have also been used in GP regression for stochastic inference (e.g., [18, 10]),\nand some of the most recent work in this area applied variational auto-encoders [14] for coupled\nvariational updates (aka back constraints) [4, 2]. We note that these techniques are orthogonal and\ncomplementary to our inference approach, and can be leveraged for further enhancements.\n\n5 Experiments\nWe evaluate our proposed approach, stochastic variational deep kernel learning (SV-DKL), on a\nwide range of classi\ufb01cation problems, including an airline delay task with over 5.9 million data\npoints (section 5.1), a large and diverse collection of classi\ufb01cation problems from the UCI repository\n(section 5.2), and image classi\ufb01cation benchmarks (section 5.3). Empirical results demonstrate the\npractical signi\ufb01cance of our approach, which provides consistent improvements over stand-alone\nDNNs, while preserving a GP representation, and dramatic improvements in speed and accuracy over\nmodern state of the art GP models. We use classi\ufb01cation accuracy when comparing to DNNs, because\nit is a standard for evaluating classi\ufb01cation benchmarks with DNNs. However, we also compute the\nnegative log probability (NLP) values (supplement), which show similar trends.\nAll experiments were performed on a Linux machine with eight 4.0GHz CPU cores, one Tesla K40c\nGPU, and 32GB RAM. We implemented deep neural networks with Caffe [13].\n\nModel Training For our deep kernel learning model, we used deep neural networks which produce\nC-dimensional top-level features. Here C is the number of classes. We place a Gaussian process on\neach dimension of these features. We used RBF base kernels. The additive GP layer is then followed\nby a linear mixing layer A \u2208 RC\u00d7C. We initialized A to be an identity matrix, and optimized in the\njoint learning procedure to recover cross-dimension correlations from data.\nWe \ufb01rst train a deep neural network using SGD with the softmax loss objective, and recti\ufb01ed linear\nactivation functions. After the neural network has been pre-trained, we \ufb01t an additive KISS-GP\nlayer, followed by a linear mixing layer, using the top-level features of the deep network as inputs.\nUsing this pre-training initialization, our joint SV-DKL model of section 3 is then trained through the\nstochastic variational method of section 4 which jointly optimizes all the hyperparameters \u03b3 of the\ndeep kernel (including all network weights), as well as the variational parameters, by backpropagating\nderivatives through the proposed marginal likelihood lower bound of the additive Gaussian process in\nsection 4. In all experiments, we use a relatively large mini-batch size (speci\ufb01ed according to the\nfull data size), enabled by the proposed structure exploiting variational inference procedures. We\nachieve good performance setting the number of samples T = 1 in Eq. 4 for expectation estimation\nin variational inference, which provides additional con\ufb01rmation for a similar observation in [14].\n\n5.1 Airline Delays\n\nWe \ufb01rst consider a large airline dataset consisting of \ufb02ight arrival and departure details for all\ncommercial \ufb02ights within the US in 2008. The approximately 5.9 million records contain extensive\ninformation about the \ufb02ights, including the delay in reaching the destination. Following [11], we\nconsider the task of predicting whether a \ufb02ight was subject to delay based on 8 features (e.g., distance\nto be covered, day of the week, etc).\n\n5\n\n\fClassi\ufb01cation accuracy Table 1 reports the classi\ufb01cation accuracy of 1) KLSP-GP [11], a recent\nscalable variational GP classi\ufb01er as discussed in section 4; 2) stand-alone deep neural network (DNN);\n3) DNN+, a stand-alone DNN with an extra Q \u00d7 c fully-connected hidden layer with Q, c de\ufb01ned as\nin Figure 1; 4) DNN+GP which is a GP applied to a pre-trained DNN (with same architecture as in 2);\nand 5) our stochastic variational DKL method (SV-DKL) (same DNN architecture as in 2). For DNN,\nwe used a fully-connected architecture with layers d-1000-1000-500-50-c.2 The DNN component of\nthe SV-DKL model has the exact same architecture. The SV-DKL joint training was conducted using\na large minibatch size of 50,000 to reduce the variance of the stochastic gradient. We can use such a\nlarge minibatch in each iteration (which is daunting for regular GP even as a whole dataset) due to the\nef\ufb01ciency of our inference strategy within each mini-batch, leveraging structure exploiting algebra.\nFrom the table we see that SV-DKL outperforms both the alternative variational GP model (KLSP-\nGP) and the stand-alone deep network. DNN+GP outperforms stand-alone DNNs, showing the\nnon-parametric \ufb02exibility of kernel methods. By combining KISS-GP with DNNs as part of a joint\nSV-DKL procedure, we obtain better results than DNN and DNN+GP. Besides, both the plain DNN\nand SV-DKL notably improve on stand-alone GPs, indicating a superior capacity of deep architectures\nto learn representations from large but \ufb01nite training sets, despite the asymptotic approximation\nproperties of Gaussian processes. By contrast, adding an extra hidden layer, as in DNN+, does not\nimprove performance.\nFigure 2(a) further studies how performance changes as data size increases. We observe that the\nproposed SV-DKL classi\ufb01er trained on 1/50 of the data already can reach a competitive accuracy as\ncompared to the KLSP-GP model trained on the full dataset. As the number of the training points\nincreases, the SV-DKL and DNN models continue to improve. This experiment demonstrates the\nvalue of expressive kernel functions on large data problems, which can effectively capture the extra\ninformation available as seeing more training instances. Furthermore, SV-DKL consistently provides\nbetter performance over the plain DNN, through its non-parametric \ufb02exibility.\n\nScalability We next measure the scalability of our variational DKL in terms of the number of\ninducing points m in each GP. Figure 2(c) shows the runtimes in seconds, as a function of m, for\nevaluating the objective and any relevant derivatives. We compare our structure exploiting variational\nmethod with the scalable variational inference in KLSP-GP, and the MCMC-based variational method\nin MC-GP [12]. We see that our inference approach is far more ef\ufb01cient than previous scalable\nalgorithms. Moreover, when the number of inducing points is not too large (e.g., m = 70), the added\ntime for SV-DKL over DNN is reasonable (e.g., 0.39s vs. 0.27s), especially considering the gains in\nperformance and expressive power. Figure 2(d) shows the runtime scaling of different variational\nmethods as m grows. We can see that the runtime of our approach increases only slowly in a wide\nrange of m (< 2, 000), greatly enhancing the scalability over the other methods. This empirically\nvalidates the improved time complexity of our new inference method as presented in section 4.\nWe next investigate the total training time of the models. Table 1, right panel, lists the time cost of\ntraining KLSP-GP, DNN, and SV-DKL; and Figure 2(b) shows how the training time of SV-DKL and\nDNN changes as more training data is presented. We see that on the full dataset DKL, as a GP model,\nsaves over 60% time as compared to the modern state of the art KLSP-GP, while at the same time\nachieving over an 18% improvement in predictive accuracy. Generally, the training time of SV-DKL\nincreases slowly with growing data sizes, and has only modest additional overhead compared to\nstand-alone architectures, justi\ufb01ed by improvements in performance, and the general bene\ufb01ts of a\nnon-parametric probabilistic representation. Moreover, the DNN was fully trained on a GPU, while\nin SV-DKL the base kernel hyperparameters and variational parameters were optimized on a CPU.\nSince most updates of the SV-DKL parameters are computed in matrix forms, we believe the already\nmodest time gap between SV-DKL and DNNs can be almost entirely closed by deploying the whole\nSV-DKL model on GPUs.\n\n5.2 UCI Classi\ufb01cation Tasks\n\nThe second evaluation of our proposed algorithm (SV-DKL) is conducted on a number of commonly\nused UCI classi\ufb01cation tasks of varying sizes and properties. Table 1 (supplement) lists the classi-\n\ufb01cation accuracy of SVM, DNN, DNN+ (a stand-alone DNN with an extra Q \u00d7 c fully-connected\nhidden layer with Q, c de\ufb01ned as in Figure 1), DNN+GP (a GP trained on the top level features of a\ntrained DNN without the extra hidden layer), and SV-DKL (same architecture as DNN).\n\n2We obtained similar results with other DNN architectures (e.g., d-1000-1000-500-50-20-c).\n\n6\n\n\fTable 1: Classi\ufb01cation accuracy and training time on the airline delay dataset, with n data points, d input\ndimensions, and c classes. KLSP-GP is a stochastic variational GP classi\ufb01er proposed in [11]. DNN+ is the\nDNN with an extra hidden layer. DNN+GP is a GP applied to \ufb01xed pre-trained output layer of the DNN (without\nthe extra hidden layer). Following Hensman et al. [11], we selected a hold-out sets of 100,000 points uniformly\nat random, and the results of DNN and SV-DKL are averaged over 5 runs \u00b1 one standard deviation. Since the\ncode of KLSP-GP is not publicly available we directly show the results from [11].\n\nDatasets\n\nn\n\nAirline\n\n5,934,530\n\nd\n\n8\n\nc\n\n2\n\nAccuracy\n\nTotal Training Time (h)\n\nKLSP-GP [11]\n\u223c0.675\n\nDNN\n0.773\u00b10.001\n\nDNN+\n0.722\u00b10.002\n\nDNN+GP\n0.7746\u00b10.001\n\nSV-DKL\n0.781\u00b10.001\n\nKLSP-GP\n\u223c11\n\nDNN\n\nSV-DKL\n\n0.53\n\n3.98\n\nFigure 2: (a) Classi\ufb01cation accuracy vs. the number of training points (n). We tested the deep models, DNN\nand SV-DKL, by training on 1/50, 1/10, 1/3, and the full dataset, respectively. For comparison, the cyan diamond\nand black dashed line show the accuracy level of KLSP-GP trained on the full data. (b) Training time vs. n. The\ncyan diamond and black dashed line show the training time of KLSP-GP on the full data. (c) Runtime vs. the\nnumber of inducing points (m) on airline task, by applying different variational methods for deep kernel learning.\nThe minibatch size is \ufb01xed to 50,000. The runtime of the stand-alone DNN does not change as m varies. (d)\nThe scaling of runtime relative to the runtime of m = 70. The black dashed line indicates a slope of 1.\n\nThe plain DNN, which learns salient features effectively from raw data, gives notably higher accuracy\ncompared to an SVM, the mostly widely used kernel method for classi\ufb01cation problems. We see that\nthe extra layer in DNN+GP can sometimes harm performance. By contrast, non-parametric \ufb02exibility\nof DNN+GP consistently improves upon DNN. And SV-DKL, by training a DNN through a GP\nmarginal likelihood objective, consistently provides further enhancements (with particularly notable\nperformance on the Connect4 and Covtype datasets).\n\n5.3\n\nImage Classi\ufb01cation\n\nWe next evaluate the proposed scalable SV-DKL procedure for ef\ufb01ciently handling high-dimensional\nhighly-structured image data. We used a minibatch size of 5,000 for stochastic gradient training of\nSV-DKL. Table 2 compares SV-DKL with the most recent scalable GP classi\ufb01ers. Besides KLSP-GP,\nwe also collected the results of the MC-GP [12] which uses a hybrid Monte Carlo sampler to tackle\nnon-conjugate likelihoods, SAVI-GP [6] which approximates with a univariate Gaussian sampler,\nas well as the distributed GP latent variable model (denoted as D-GPLVM) [8]. We see that on the\nrespective benchmark tasks, SV-DKL improves over all of the above scalable GP methods by a large\nmargin. We note that these datasets are very challenging for conventional GP methods.\nWe further compare SV-DKL to stand-alone convolutional neural networks, and GPs applied to\n\ufb01xed pre-trained CNNs (CNN+GP). On the \ufb01rst three datasets in Table 2, we used the reference\nCNN models implemented in Caffe; and for the SVHN dataset, as no benchmark architecture is\navailable, we used the CIFAR10 architecture which turned out to perform quite well. As we can see,\nthe SV-DKL model outperforms CNNs and CNN+GP on all datasets. By contrast, the extra hidden\nQ \u00d7 c hidden layer CNN+ does not consistently improve performance over CNN.\nResNet Comparison: Based on one of the best public implementations on Caffe, the ResNet-20 has\n0.901 accuracy on CIFAR10, and SV-DKL (with this ResNet base architecture) improves to 0.910.\nImageNet: We randomly selected 20 categories of images with an AlexNet variant as the base NN\n[15], which has an accuracy of 0.6877, while SV-DKL achieves 0.7067 accuracy.\n\n5.3.1 Interpretation\n\nIn Figure 3(a) we investigate the deep kernels learned on the MNIST dataset by randomly selecting 4\nclasses and visualizing the covariance matrices of respective dimensions. The covariance matrices are\nevaluated on the set of test inputs, sorted in terms of the labels of the input images. We see that the\n\n7\n\n123456x 1060.660.680.70.720.740.760.78#Training InstancesAccuracy DNNSV\u2212DKLKLSP\u2212GP123456x 106024681012#Training InstancesTraining time (h) DNNSV\u2212DKLKLSP\u2212GP702004008001200160020000100200300#Inducing pointsRuntime (s) SV\u2212DKLKLSP\u2212GPMC\u2212GPDNN70200400800120016002000150100150200#Inducing pointsRuntime scaling SV\u2212DKLKLSP\u2212GPMC\u2212GPslope=1\fTable 2: Classi\ufb01cation accuracy on the image classi\ufb01cation benchmarks. MNIST-Binary is the task to\ndifferentiate between odd and even digits on the MNIST dataset. We followed the standard training-test set\npartitioning of all these datasets. We have collected recently published results of a variety of scalable GPs.\nFor CNNs, we used the respective benchmark architectures (or with slight adaptations) from Caffe. CNN+ is\na stand-alone CNN with Q \u00d7 c fully connected extra hidden layer. See the text for more details, including a\ncomparison with ResNets on CIFAR10.\n\nDatasets\n\nMNIST-Binary\nMNIST\nCIFAR10\nSVHN\n\nn\n\n60K\n60K\n50K\n73K\n\nd\n\nc\n\n28\u00d728\n28\u00d728\n3\u00d732\u00d732\n3\u00d732\u00d732\n\n2\n10\n10\n10\n\nAccuracy\n\nMC-GP [12]\n\nSAVI-GP [6]\n\nD-GPLVM [8]\n\nKLSP-GP [11]\n\nCNN\n\nCNN+\n\nCNN+GP\n\nSV-DKL\n\n\u2014\n0.9804\n\u2014\n\u2014\n\n\u2014\n0.9749\n\u2014\n\u2014\n\n\u2014\n0.9405\n\u2014\n\u2014\n\n0.978\n\n0.9934\n\u2014 0.9908\n\u2014 0.7592\n\u2014 0.9214\n\n0.8838\n0.9909\n0.7618\n0.9193\n\n0.9938\n0.9915\n0.7633\n0.9221\n\n0.9940\n0.9920\n0.7704\n0.9228\n\nFigure 3: (a) The induced covariance matrices on classes 2, 3, 6, and 8, on test cases of the MNIST dataset\nordered according to the labels. (b) The \ufb01nal mixing layer (i.e., matrix A) on MNIST digit recognition.\n\ndeep kernel on each dimension effectively discovers the correlations between the images within the\ncorresponding class. For instance, in c = 2 the data points between 2k-3k (i.e., images of digit 2) are\nstrongly correlated with each other, and carry little correlation with the rest of the images. Besides,\nwe can also clearly observe that the rest of the data points also form multiple \u201cblocks\u201d, rather than\nbeing crammed together without any structure. This validates that the DKL procedure and additive\nGPs do capture the correlations across different dimensions.\nTo further explore the learnt dependencies between the output classes and the additive GPs serving\nas the bases, we visualized the weights of the mixing layer (A) in Fig. 3(b), enabling the correlated\nmulti-output (multi-task) nature of the model. Besides the expected high weights along the diagonal,\nwe \ufb01nd that class 9 is also highly correlated with dimension 0 and 6, which is consistent with the\nvisual similarity between digit \u201c9\u201d and \u201c0\u201d/\u201c6\u201d. Overall, the ability to interpret the learned deep\ncovariance matrix as discovering an expressive similarity metric across data instances is a distinctive\nfeature of our approach.\n6 Discussion\nWe introduced a scalable Gaussian process model which leverages deep learning, stochastic variational\ninference, structure exploiting algebra, and additive covariance structures. The resulting deep kernel\nlearning approach, SV-DKL, allows for classi\ufb01cation and non-Gaussian likelihoods, multi-task\nlearning, and mini-batch training. SV-DKL achieves superior performance over alternative scalable\nGP models and stand-alone deep networks on many signi\ufb01cant benchmarks.\nSeveral fundamental themes emerge from the exposition: (1) kernel methods and deep learning\napproaches are complementary, and we can combine the advantages of each approach; (2) expressive\nkernel functions are particularly valuable on large datasets; (3) by viewing neural networks through\nthe lens of metric learning, deep learning approaches become more interpretable.\nDeep learning is able to obtain good predictive accuracy by automatically learning structure which\nwould be dif\ufb01cult to a priori feature engineer into a model. In the future, we hope deep kernel\nlearning approaches will be particularly helpful for interpreting these learned features, leading to new\nscienti\ufb01c insights into our modelling problems.\nAcknowledgements: We thank NSF IIS-1563887, ONR N000141410684, N000141310721,\nN000141512791, and ADeLAIDE FA8750-16C-0130-001 grants.\n\n8\n\nc=22k4k6k8k10k2k4k6k8k10kc=32k4k6k8k10k2k4k6k8k10kc=62k4k6k8k10k2k4k6k8k10kc=82k4k6k8k10k2k4k6k8k10k0.10.20.30.40.50.6Input dimensions0123456789Output classes0123456789-0.100.10.2\fReferences\n[1] C. C. Aggarwal, A. Hinneburg, and D. A. Keim. On the surprising behavior of distance metrics in high\n\ndimensional space. Springer, 2001.\n\n[2] T. D. Bui and R. E. Turner. Stochastic variational inference for Gaussian process latent variable models\n\nusing back constraints. Black Box Learning and Inference NIPS workshop, 2015.\n\n[3] R. Calandra, J. Peters, C. E. Rasmussen, and M. P. Deisenroth. Manifold Gaussian processes for regression.\n\n[4] Z. Dai, A. Damianou, J. Gonz\u00e1lez, and N. Lawrence. Variational auto-encoded deep Gaussian processes.\n\narXiv preprint arXiv:1402.5876, 2014.\n\narXiv preprint arXiv:1511.06455, 2015.\n\n[5] A. Damianou and N. Lawrence. Deep Gaussian processes. Arti\ufb01cial Intelligence and Statistics, 2013.\n[6] A. Dezfouli and E. V. Bonilla. Scalable inference for Gaussian process models with black-box likelihoods.\n\nIn Advances in Neural Information Processing Systems, pages 1414\u20131422, 2015.\n\n[7] N. Durrande, D. Ginsbourger, and O. Roustant. Additive kernels for Gaussian process modeling. arXiv\n\npreprint arXiv:1103.4023, 2011.\n\n[8] Y. Gal, M. van der Wilk, and C. Rasmussen. Distributed variational inference in sparse Gaussian process\nregression and latent variable models. In Advances in Neural Information Processing Systems, pages\n3257\u20133265, 2014.\n\n[9] M. G\u00f6nen and E. Alpayd\u0131n. Multiple kernel learning algorithms. Journal of Machine Learning Research,\n\n[10] J. Hensman, N. Fusi, and N. Lawrence. Gaussian processes for big data. In Uncertainty in Arti\ufb01cial\n\n12:2211\u20132268, 2011.\n\nIntelligence (UAI). AUAI Press, 2013.\n\n[11] J. Hensman, A. Matthews, and Z. Ghahramani. Scalable variational Gaussian process classi\ufb01cation. In\nProceedings of the Eighteenth International Conference on Arti\ufb01cial Intelligence and Statistics, pages\n351\u2013360, 2015.\n\n[12] J. Hensman, A. G. Matthews, M. Filippone, and Z. Ghahramani. MCMC for variationally sparse Gaussian\n\nprocesses. In Advances in Neural Information Processing Systems, pages 1648\u20131656, 2015.\n\n[13] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:\n\nConvolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.\n\n[14] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.\n[15] A. Krizhevsky, I. Sutskever, and G. Hinton.\nImagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Advances in Neural Information Processing Systems, 2012.\n\n[16] Q. Le, T. Sarlos, and A. Smola. Fastfood-computing Hilbert space expansions in loglinear time. In\n\nProceedings of the 30th International Conference on Machine Learning, pages 244\u2013252, 2013.\n\n[17] J. R. Lloyd, D. Duvenaud, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani. Automatic construction and\nNatural-Language description of nonparametric regression models. In Association for the Advancement of\nArti\ufb01cial Intelligence (AAAI), 2014.\n\n[18] T. Nickson, T. Gunter, C. Lloyd, M. A. Osborne, and S. Roberts. Blitzkriging: Kronecker-structured\n\nstochastic Gaussian processes. arXiv preprint arXiv:1510.07965, 2015.\n\n[19] J. Qui\u00f1onero-Candela and C. Rasmussen. A unifying view of sparse approximate Gaussian process\n\nregression. The Journal of Machine Learning Research, 6:1939\u20131959, 2005.\n\n[20] A. Rahimi and B. Recht. Random features for large-scale kernel machines.\n\nIn Neural Information\n\n[21] C. E. Rasmussen and Z. Ghahramani. Occam\u2019s razor. In Neural Information Processing Systems (NIPS),\n\nProcessing Systems, 2007.\n\n2001.\n\n[22] C. E. Rasmussen and C. K. I. Williams. Gaussian processes for Machine Learning. The MIT Press, 2006.\n[23] B. W. Silverman. Some aspects of the spline smoothing approach to non-parametric regression curve\n\n\ufb01tting. Journal of the Royal Statistical SocietyB, 47(1):1\u201352, 1985.\n\n[24] M. K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics, pages 567\u2013574, 2009.\n\n[25] A. G. Wilson. Covariance kernels for fast automatic pattern discovery and extrapolation with Gaussian\n\nprocesses. PhD thesis, University of Cambridge, 2014.\n\n[26] A. G. Wilson and R. P. Adams. Gaussian process kernels for pattern discovery and extrapolation. Interna-\n\ntional Conference on Machine Learning (ICML), 2013.\n\n[27] A. G. Wilson and H. Nickisch. Kernel interpolation for scalable structured Gaussian processes (KISS-GP).\n\nInternational Conference on Machine Learning (ICML), 2015.\n\n[28] A. G. Wilson, D. A. Knowles, and Z. Ghahramani. Gaussian process regression networks. In International\n\nConference on Machine Learning (ICML), Edinburgh, 2012.\n\n[29] A. G. Wilson, C. Dann, and H. Nickisch. Thoughts on massively scalable Gaussian processes. arXiv\n\npreprint arXiv:1511.01870, 2015. https://arxiv.org/abs/1511.01870.\n\n[30] A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing. Deep kernel learning. Arti\ufb01cial Intelligence and\n\n[31] Z. Yang, A. J. Smola, L. Song, and A. G. Wilson. A la carte - learning fast kernels. Arti\ufb01cial Intelligence\n\nStatistics, 2016.\n\nand Statistics, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1349, "authors": [{"given_name": "Andrew", "family_name": "Wilson", "institution": "Carnegie Mellon University"}, {"given_name": "Zhiting", "family_name": "Hu", "institution": "Carnegie Mellon University"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "University of Toronto"}, {"given_name": "Eric", "family_name": "Xing", "institution": "Carnegie Mellon University"}]}