{"title": "Scalable Levy Process Priors for Spectral Kernel Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3940, "page_last": 3949, "abstract": "Gaussian processes are rich distributions over functions, with generalization properties determined by a kernel function. When used for long-range extrapolation, predictions are particularly sensitive to the choice of kernel parameters. It is therefore critical to account for kernel uncertainty in our predictive distributions. We propose a distribution over kernels formed by modelling a spectral mixture density with a Levy process. The resulting distribution has support for all stationary covariances---including the popular RBF, periodic, and Matern kernels---combined with inductive biases which enable automatic and data efficient learning, long-range extrapolation, and state of the art predictive performance. The proposed model also presents an approach to spectral regularization, as the Levy process introduces a sparsity-inducing prior over mixture components, allowing automatic selection over model order and pruning of extraneous components. We exploit the algebraic structure of the proposed process for O(n) training and O(1) predictions. We perform extrapolations having reasonable uncertainty estimates on several benchmarks, show that the proposed model can recover flexible ground truth covariances and that it is robust to errors in initialization.", "full_text": "Scalable L\u00b4evy Process Priors for Spectral Kernel\n\nLearning\n\nPhillip A. Jang Andrew E. Loeb Matthew B. Davidow Andrew Gordon Wilson\n\nCornell University\n\nAbstract\n\nGaussian processes are rich distributions over functions, with generalization prop-\nerties determined by a kernel function. When used for long-range extrapolation,\npredictions are particularly sensitive to the choice of kernel parameters.\nIt is\ntherefore critical to account for kernel uncertainty in our predictive distributions.\nWe propose a distribution over kernels formed by modelling a spectral mixture\ndensity with a L\u00b4evy process. The resulting distribution has support for all sta-\ntionary covariances\u2014including the popular RBF, periodic, and Mat\u00b4ern kernels\u2014\ncombined with inductive biases which enable automatic and data ef\ufb01cient learn-\ning, long-range extrapolation, and state of the art predictive performance. The\nproposed model also presents an approach to spectral regularization, as the L\u00b4evy\nprocess introduces a sparsity-inducing prior over mixture components, allowing\nautomatic selection over model order and pruning of extraneous components. We\nexploit the algebraic structure of the proposed process for O(n) training and O(1)\npredictions. We perform extrapolations having reasonable uncertainty estimates\non several benchmarks, show that the proposed model can recover \ufb02exible ground\ntruth covariances and that it is robust to errors in initialization.\n\nIntroduction\n\n1\nGaussian processes (GPs) naturally give rise to a function space view of modelling, whereby we\nplace a prior distribution over functions, and reason about the properties of likely functions under\nthis prior (Rasmussen & Williams, 2006). Given data, we then infer a posterior distribution over\nfunctions to make predictions. The generalisation behavior of the Gaussian process is determined\nby its prior support (which functions are a priori possible) and its inductive biases (which functions\nare a priori likely), which are in turn encoded by a kernel function. However, popular kernels,\nand even multiple kernel learning procedures, typically cannot extract highly expressive hidden\nrepresentations, as was envisaged for neural networks (MacKay, 1998; Wilson, 2014).\nTo discover such representations, recent approaches have advocated building more expressive ker-\nnel functions. For instance, spectral mixture kernels (Wilson & Adams, 2013b) were introduced for\n\ufb02exible kernel learning and extrapolation, by modelling a spectral density with a scale-location mix-\nture of Gaussians, with promising results. However, Wilson & Adams (2013b) specify the number of\nmixture components by hand, and do not characterize uncertainty over the mixture hyperparameters.\nAs kernel functions become increasingly expressive and parametrized, it becomes natural to also\nadopt a function space view of kernel learning\u2014to represent uncertainty over the values of the\nkernel function, and to re\ufb02ect the belief that the kernel does not have a simple form. Just as we\nuse Gaussian processes over functions to model data, we can apply the function space view a step\nfurther in a hierarchical model\u2014with a prior distribution over kernels.\nIn this paper, we introduce a scalable distribution over kernels by modelling a spectral density, the\nFourier transform of a kernel, with a L\u00b4evy process. We consider both scale-location mixtures of\nGaussians and Laplacians as basis functions for the L\u00b4evy process, to induce a prior over kernels that\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fgives rise to the sharply peaked spectral densities that often occur in practice\u2014providing a powerful\ninductive bias for kernel learning. Moreover, this choice of basis functions allows our kernel func-\ntion, conditioned on the L\u00b4evy process, to be expressed in closed form. This prior distribution over\nkernels also has support for all stationary covariances\u2014containing, for instance, any composition\nof the popular RBF, Mat\u00b4ern, rational quadratic, gamma-exponential, or spectral mixture kernels.\nAnd unlike the spectral mixture representation in Wilson & Adams (2013b), this proposed process\nprior allows for natural automatic inference over the number of mixture components in the spectral\ndensity model. Moreover, the priors implied by popular L\u00b4evy processes such as the gamma process\nand symmetric \u03b1-stable process result in even stronger complexity penalties than (cid:96)1 regularization,\nyielding sparse representations and removing mixture components which \ufb01t to noise.\nConditioned on this distribution over kernels, we model data with a Gaussian process. To form a\npredictive distribution, we take a Bayesian model average of GP predictive distributions over a large\nset of possible kernel functions, represented by the support of our prior over kernels, weighted by\nthe posterior probabilities of each of these kernels. This procedure leads to a non-Gaussian heavy-\ntailed predictive distribution for modelling data. We develop a reversible jump MCMC (RJ-MCMC)\nscheme (Green, 1995) to infer the posterior distribution over kernels, including inference over the\nnumber of components in the L\u00b4evy process expansion. For scalability, we pursue a structured kernel\ninterpolation (Wilson & Nickisch, 2015) approach, in our case exploiting algebraic structure in the\nL\u00b4evy process expansion, for O(n) inference and O(1) predictions, compared to the standard O(n3)\nand O(n2) computations for inference and predictions with Gaussian processes. Flexible distri-\nbutions over kernels will be especially valuable on large datasets, which often contain additional\nstructure to learn rich statistical representations.\nThe key contributions of this paper are summarized as follows:\n\n1. The \ufb01rst fully probabilistic approach to inference with spectral mixture kernels \u2014 to incor-\nporate kernel uncertainty into our predictive distributions, for a more realistic coverage of\nextrapolations. This feature is demonstrated in Section 5.3.\n\n2. Spectral regularization in spectral kernel learning. The L\u00b4evy process prior acts as a sparsity-\ninducing prior on mixture components, automatically pruning extraneous components.\nThis feature allows for automatic inference over model order, a key hyperparameter which\nmust be hand tuned in the original spectral mixture kernel paper.\n\n3. Reduced dependence on a good initialization, a key practical improvement over the original\n\nspectral mixture kernel paper.\n\n4. A conceptually natural and interpretable function space view of kernel learning.\n\n2 Background\nWe provide a review of Gaussian and L\u00b4evy processes as models for prior distributions over functions.\n2.1 Gaussian Processes\nA stochastic process f (x) is a Gaussian process (GP) if for any \ufb01nite collection of inputs X =\n{x1,\u00b7\u00b7\u00b7 , xn} \u2282 RD, the vector of function values [f (x1),\u00b7\u00b7\u00b7 , f (xn)]T is jointly Gaussian.\nThe distribution of a GP is completely determined by its mean function m(x), and covariance\nkernel k(x, x(cid:48)). A GP used to specify a distribution over functions is denoted as f (x) \u223c\nGP(m(x), k(x, x(cid:48))), where E[f (xi)] = m(xi) and cov(f (x), f (x(cid:48))) = k(x, x(cid:48)). The general-\nization properties of the GP are encoded by the covariance kernel and its hyperparameters.\nBy exploiting properties of joint Gaussian variables, we can obtain closed form expressions for\nconditional mean and covariance functions of unobserved function values given observed function\nvalues. Given that f (x) is observed at n training inputs X with values f = [f (x1),\u00b7\u00b7\u00b7 , f (xn)]T ,\nthe predictive distribution of the unobserved function values f\u2217 at n\u2217 testing inputs X\u2217 is given by\n(1)\n(2)\n(3)\n\nf\u2217|X\u2217, X, \u03b8 \u223c N (\u00aff\u2217, cov(f\u2217)),\n\u00aff\u2217 = mX\u2217 + KX\u2217,X K\u22121\n\ncov(f\u2217) = KX\u2217,X\u2217 \u2212 KX\u2217,X K\u22121\n\nX,X (f \u2212 mX ),\nX,X KX,X\u2217 .\n\nwhere KX\u2217,X for example denotes the n\u2217 \u00d7 n matrix of covariances evaluated at X\u2217 and X.\n\n2\n\n\fThe popular radial basis function (RBF) kernel has the following form:\nkRBF(x, x(cid:48)) = exp(\u22120.5(cid:107)x \u2212 x(cid:48)(cid:107)2 /(cid:96)2).\n\n(4)\n\nGPs with RBF kernels are limited in their expressiveness and act primarily as smoothing interpo-\nlators, because the only covariance structure they can learn from data is the length scale (cid:96), which\ndetermines how quickly covariance decays with distance.\nWilson & Adams (2013b) introduce the more expressive spectral mixture (SM) kernel capable of ex-\ntracting more complex covariance structures than the RBF kernel, formed by placing a scale-location\nmixture of Gaussians in the spectrum of the covariance kernel. The RBF kernel in comparison can\nonly model a single Gaussian centered at the origin in frequency (spectral) space.\n2.2 L\u00b4evy Processes\nA stochastic process {L(\u03c9)}\u03c9\u2208R+ is a L\u00b4evy process if it has stationary, independent increments and\nit is continuous in probability. In other words, L must satisfy\n\n1. L(0) = 0,\n2. L(\u03c90), L(\u03c91) \u2212 L(\u03c90),\u00b7\u00b7\u00b7 , L(\u03c9n) \u2212 L(\u03c9n\u22121) are independent \u2200\u03c90 \u2264 \u03c91 \u2264 \u00b7\u00b7\u00b7 \u2264 \u03c9n,\n3. L(\u03c92) \u2212 L(\u03c91) d= L(\u03c92 \u2212 \u03c91) \u2200\u03c92 \u2265 \u03c91,\n4.\n\nP(|L(\u03c9 + h) \u2212 L(\u03c9)| \u2265 \u03b5) = 0 \u2200\u03b5 > 0 \u2200\u03c9 \u2265 0.\n\nlim\nh\u21920\n\nBy the L\u00b4evy-Khintchine representation, the dis-\ntribution of a (pure jump) L\u00b4evy process is com-\npletely determined by its L\u00b4evy measure. That\nis, the characteristic function of L(\u03c9) is given\nby:\nlog E[eiuL(\u03c9)] =\n\n(cid:0)eiu\u00b7\u03b2 \u2212 1 \u2212 iu \u00b7 \u03b21|\u03b2|\u22641\n\n(cid:1) \u03bd(d\u03b2).\n\n(cid:90)\n\n\u03c9\n\nRd\\{0}\n\nwhere the L\u00b4evy measure \u03bd(d\u03b2) is any \u03c3-\ufb01nite\nmeasure which satis\ufb01es the following integra-\nbility condition\n\n(cid:90)\n\n(1 \u2227 \u03b22)\u03bd(d\u03b2) < \u221e.\n\nRd\\{0}\n\nFigure 1: Annotated realization of a compound\nPoisson process, a special case of a L\u00b4evy process.\nThe \u03c9j represent jump locations, and \u03b2j represent\njump magnitudes.\n\nA L\u00b4evy process can be viewed as a combination of a Brownian motion with drift and a superposition\nof independent Poisson processes with differing jump sizes \u03b2. The L\u00b4evy measure \u03bd(d\u03b2) determines\nthe expected number of Poisson events per unit of time for any particular jump size \u03b2. The Brow-\nnian component of a L\u00b4evy process will not be considered for this model. For higher dimension\ninput spaces \u03c9 \u2208 \u2126, one de\ufb01nes the more general notion of L\u00b4evy random measure, which is also\ncharacterized by its L\u00b4evy measure \u03bd(d\u03b2d\u03c9) (Wolpert et al., 2011) . We will show that the sample\nrealizations of L\u00b4evy processes can be used to draw sample parameters for adaptive basis expansions.\n2.3 L\u00b4evy Process Priors over Adaptive Expansions\nSuppose we wish\nover\n\nadaptive\n\nclass\n\nprior\n.\n\nexpansions:\nThrough a simple manipulation, we can rewrite\n\nthe\n\nf : X \u2192 R(cid:12)(cid:12)(cid:12) f (x) =(cid:80)J\n(cid:110)\n\na\nj=1 \u03b2j\u03c6(x, \u03c9j)\nf (x) into the form of a stochastic integral:\n\nspecify\n\nof\n\nto\n\n(cid:111)\n(cid:90)\n\nJ(cid:88)\n\nJ(cid:88)\n\n(cid:90)\n\n\u2126\n\nJ(cid:88)\n(cid:124)\n\nj=1\n\n(cid:123)(cid:122)\n\n=dL(\u03c9)\n\n(cid:125)\n\nf (x) =\n\n\u03b2j\u03c6(x, \u03c9j) =\n\n\u03b2j\n\n\u03c6(x, \u03c9)\u03b4\u03c9j (\u03c9)d\u03c9 =\n\n\u03c6(x, \u03c9)\n\n\u03b2j\u03b4\u03c9j (\u03c9)d\u03c9\n\n.\n\nj=1\n\nj=1\n\n\u2126\n\nHence, by specifying a prior for the measure L(\u03c9), we can simultaneously specify a prior for all\nof the parameters {J, (\u03b21, \u03c91), ..., (\u03b2J , \u03c9J )} of the expansion. L\u00b4evy random measures provide a\n\n3\n\n0246810x-5051015f(x)\u03b21\u03b22\u03b23\u03c91\u03c92\u03c93\ffamily of priors naturally suited for this purpose, as there is a one-to-one correspondence between\nthe jump behavior of the L\u00b4evy prior and the components of the expansion.\nTo illustrate this point, suppose the basis function parameters \u03c9j are one-dimensional and consider\nthe integral of dL(\u03c9) from 0 to \u03c9.\n\n(cid:90) \u03c9\nWe see in Figure 1 that(cid:80)J\n\nL(\u03c9) =\n\n0\n\n(cid:90) \u03c9\n\nJ(cid:88)\n\n0\n\nj=1\n\nJ(cid:88)\n\nj=1\n\ndL(\u03be) =\n\n\u03b2j\u03b4\u03c9j (\u03be)d\u03be =\n\n\u03b2j1[0,\u03c9](\u03c9j).\n\nj=1 \u03b2j1[0,\u03c9](\u03c9j) resembles the sample path of a compound Poisson pro-\ncess, with the number of jumps J, jump sizes \u03b2j, and jump locations \u03c9j corresponding to the number\nof basis functions, basis function weights, and basis function parameters respectively. We can use a\ncompound Poisson process to de\ufb01ne a prior over all such piecewise constant paths. More generally,\nwe can use a L\u00b4evy process to de\ufb01ne a prior for L(\u03c9).\nThrough the L\u00b4evy-Khintchine representation, the jump behavior of the prior is characterized by a\nL\u00b4evy measure \u03bd(d\u03b2d\u03c9) which controls the mean number of Poisson events in every region of the\nparameter space, encoding the inductive biases of the model. As the number of parameters in this\nframework is random, we use a form of trans-dimensional reversible jump Markov chain Monte\nCarlo (RJ-MCMC) to sample the parameter space (Green, 2003).\nPopular L\u00b4evy processes such as the gamma process, symmetric gamma process, and the symmetric\n\u03b1-stable process each possess desirable properties for different situations. The gamma process is\nable to produce strictly positive gamma distributed \u03b2j without transforming the output space. The\nsymmetric gamma process can produce both positive and negative \u03b2j, and according to Wolpert et al.\n(2011) can achieve nearly all the commonly used isotropic geostatistical covariance functions. The\nsymmetric \u03b1-stable process can produce heavy-tailed distributions for \u03b2j and is appropriate when\none might expect the basis expansion to be dominated by a few heavily weighted functions.\nWhile one could dispense with L\u00b4evy processes and place Gaussian or Laplace priors on \u03b2j to obtain\n(cid:96)2 or (cid:96)1 regularization on the expansions, respectively, a key bene\ufb01t particular to these L\u00b4evy process\npriors are that the implied priors on the coef\ufb01cients yield even stronger complexity penalties than\n(cid:96)1 regularization. This property encourages sparsity in the expansions and permits scalability of\nour MCMC algorithm. Refer to the supplementary material for an illustration of the joint priors\non coef\ufb01cients, which exhibit concave contours in contrast to the convex elliptical and diamond\ncontours of (cid:96)2 and (cid:96)1 regularization. Furthermore, in the log posterior for the L\u00b4evy process there\nis a log(J!) complexity penalty term which further encourages sparsity in the expansions. Refer to\nClyde & Wolpert (2007) for further details.\n3 L\u00b4evy Distributions over Kernels\nIn this section, we motivate our choice of prior over kernel functions and describe how to generate\nsamples from this prior distribution in practice.\n3.1 L\u00b4evy Kernel Processes\nBy Bochner\u2019s Theorem (1959), a continuous stationary kernel can be represented as the Fourier dual\nof a spectral density:\n\nS(s)e2\u03c0is(cid:62)\u03c4 ds, S(s) =\n\nk(\u03c4 )e\u22122\u03c0is(cid:62)\u03c4 d\u03c4.\n\n(5)\n\n(cid:90)\n\nk(\u03c4 ) =\n\nRD\n\n(cid:90)\n\nRD\n\nHence, the spectral density entirely characterizes a stationary kernel. Therefore, it can be desirable\nto model the spectrum rather than the kernel, since we can then view kernel estimation through the\nlens of density estimation. In order to emulate the sharp peaks that characterize frequency spectra\nof natural phenomena, we model the spectral density with a location-scale mixture of Laplacian\ncomponents:\n\n\u03c6L(s, \u03c9j) =\n\ne\u2212\u03bbj|s\u2212\u03c7j|, \u03c9j \u2261 (\u03c7j, \u03bbj) \u2208 [0, fmax] \u00d7 R+.\n\nThen the full speci\ufb01cation of the symmetric spectral mixture is\n\n\u03bbj\n2\n\n(cid:104) \u02dcS(s) + \u02dcS(\u2212s)\n(cid:105)\n\nS(s) =\n\n1\n2\n\n(6)\n\n(7)\n\nJ(cid:88)\n\nj=1\n\n,\n\n\u02dcS(s) =\n\n4\n\n\u03b2j\u03c6L(s, \u03c9j).\n\n\fAs Laplacian spikes have a closed form inverse Fourier transform, the spectral density S(s) repre-\nsents the following kernel function:\n\nJ(cid:88)\n\nj=1\n\nk(\u03c4 ) =\n\n\u03b2j\n\n\u03bb2\nj\n\nj + 4\u03c02\u03c4 2 cos(2\u03c0\u03c7j\u03c4 ).\n\u03bb2\n\n(8)\n\nThe parameters J, \u03b2j, \u03c7j, \u03bbj can be interpreted through Eq. (8). The total number of terms to the\nmixture is J, while \u03b2j is the scale of the jth frequency contribution, \u03c7j is its central frequency, and \u03bbj\ngoverns how rapidly the term decays (a high \u03bb results in con\ufb01dent, long-term periodic extrapolation).\nOther basis functions can be used in place of \u03c6L to model the spectrum as well. For example, if a\nGaussian mixture is chosen, along with maximum likelihood estimation for the learning procedure,\nthen we obtain the spectral mixture kernel (Wilson & Adams, 2013b).\nAs the spectral density S(s) takes the form of an adaptive expansion, we can de\ufb01ne a L\u00b4evy prior\nover all such densities and hence all corresponding kernels of the above form. For a chosen basis\nfunction \u03c6(s, \u03c9) and L\u00b4evy measure \u03bd(d\u03b2d\u03c9) we say that k(\u03c4 ) is drawn from a L\u00b4evy kernel process\n(LKP), denoted as k(\u03c4 ) \u223c LKP(\u03c6, \u03bd). Wolpert et al. (2011) discuss the necessary regularity\nconditions for \u03c6 and \u03bd. In summary, we propose the following hierarchical model over functions\n\nf (x)|k(\u03c4 ) \u223c GP(0, k(\u03c4 )),\n\n\u03c4 = x \u2212 x(cid:48),\n\nk(\u03c4 ) \u223c LKP(\u03c6, \u03bd).\n\n(9)\n\nFigure 2 shows three samples from the L\u00b4evy\nprocess speci\ufb01ed through Eq. (7) and their cor-\nresponding covariance kernels. We also show\none GP realization for each of the kernel func-\ntions. By placing a L\u00b4evy process prior over\nspectral densities, we induce a L\u00b4evy kernel pro-\ncess prior over stationary covariance functions.\n3.2 Sampling L\u00b4evy Priors\nWe now discuss how to generate samples\nfrom the L\u00b4evy kernel process in practice.\nIn\nshort, the kernel parameters are drawn accord-\ning to {J,{(\u03b2j, \u03c9j)}J\nj=1} \u223c L\u00b4evy(\u03bd(d\u03b2d\u03c9)),\nand then Eq. (8) is used to evaluate k \u223c\nLKP(\u03c6L, \u03bd) at values of \u03c4.\nRecall from Section 2.3 that\nthe choice of\nL\u00b4evy measure \u03bd is completely determined by\nthe choice of the corresponding L\u00b4evy process\nand vice versa. Though the processes men-\ntioned there produce sample paths with in-\n\ufb01nitely many jumps (and cannot be sampled\ndirectly), almost all jumps are in\ufb01nitesimally\nsmall, and therefore these processes can be ap-\nproximated in L2 by a compound Poisson pro-\ncess with a jump size distribution truncated by\n\u03b5.\nOnce the desired L\u00b4evy process is chosen and the truncation bound is set, the basis expansion\nparameters are generated by drawing J \u223c Poisson(\u03bd+\nsamples\n\u03b21,\u00b7\u00b7\u00b7 , \u03b2J \u223c \u03c0\u03b2(d\u03b2), and J i.i.d. samples \u03c91,\u00b7\u00b7\u00b7 , \u03c9J \u223c \u03c0\u03c9(d\u03c9). Refer to the supplementary\n\u03b5 = \u03bd\u03b5(R \u00d7 \u2126) for the gamma, symmetric gamma,\nmaterial for L2 error bounds and formulas for \u03bd+\nand symmetric \u03b1-stable processes.\nThe form of \u03c0\u03b2(\u03b2j) also depends on the choice of L\u00b4evy process and can be found in the supplemen-\ntary material, with further details in Wolpert et al. (2011). We choose to draw \u03c7 from an uninformed\nuniform prior over a reasonable range in the frequency domain, and \u03bb from a gamma distribution,\n\u03bb \u223c Gamma(a\u03bb, b\u03bb). The choices for a\u03bb, b\u03bb, and the frequency limits are left as hyperparame-\nters, which can have their own hyperprior distributions. After drawing the 3J values that specify\n\nFigure 2: Samples from a L\u00b4evy kernel mix-\nture prior distribution.\n(top) Three spectra with\nLaplace components drawn from a L\u00b4evy process\nprior. (middle) The corresponding stationary co-\nvariance kernel functions and the prior mean with\ntwo standard deviations of the model, as deter-\nmined by 10,000 samples. (bottom) GP samples\nwith the respective covariance kernel functions.\n\n\u03b5 ), and then drawing J i.i.d.\n\n5\n\n00.050.10.150.2Frequency024Power00.050.10.150.2Frequency00.51Power00.050.10.150.2Frequency00.511.5Power00.20.40.60.811.21.41.61.82\u03c4-0.100.10.2K(\u03c4)02468101214161820X-0.500.5f(X)\fa L\u00b4evy process realization, the corresponding covariance function can be evaluated through the an-\nalytical expression for the inverse Fourier transform (e.g. Eq. (8) for Laplacian frequency mixture\ncomponents).\n4 Scalable Inference\nGiven observed data D = {xi, yi}N\nx\u2217 for interpolation and extrapolation. We model observations y(x) with a hierarchical model:\n\ni=1, we wish to infer p(y(x\u2217)|D, x\u2217) over some test set of inputs\n\ny(x)|f (x) = f (x) + \u03b5(x),\nf (x)|k(\u03c4 ) \u223c GP(0, k(\u03c4 )),\nk(\u03c4 ) \u223c LKP(\u03c6, \u03bd).\n\n(10)\n(11)\n(12)\nComputing the posterior distributions by marginalizing over the LKP will yield a heavy-tailed non-\nGaussian process for y(x\u2217) = y\u2217 given by an in\ufb01nite Gaussian mixture model:\n\n\u03b5(x)\n\u03c4 = x \u2212 x(cid:48),\n\niid\u223c N (0, \u03c32),\n\np(y\u2217|D) =\n\np(y\u2217|k,D)p(k|D)dk \u2248 1\nH\n\np(y\u2217|kh), kh \u223c p(k|D).\n\n(13)\n\n(cid:90)\n\nH(cid:88)\n\nh=1\n\nInitialization Considerations\n\nWe compute this approximating sum using H RJ-MCMC samples (Green, 2003). Each sample\ndraws a kernel from the posterior kh \u223c p(k|D) distribution. Each sample of kh enables us to draw a\nsample from the posterior predictive distribution p(y\u2217|D), from which we can estimate the predictive\nmean and variance.\nAlthough we have chosen a Gaussian observation model in Eq. (10) (conditioned on f (x)), all of the\ninference procedures we have introduced here would also apply to non-Gaussian likelihoods, such\nas for Poisson processes with Gaussian process intensity functions, or classi\ufb01cation.\nThe sum in Eq. (13) requires drawing kernels from the distribution p(k|D). This is a dif\ufb01cult dis-\ntribution to approximate, particularly because there is not a \ufb01xed number of parameters as J varies.\nWe employ RJ-MCMC, which extends the capability of conventional MCMC to allow sequential\nsamples of different dimensions to be drawn (Green, 2003). Thus, a posterior distribution is not\nlimited to coef\ufb01cients and other parameters of a \ufb01xed basis expansion, but can represent a chang-\ning number of basis functions, as required by the description of L\u00b4evy processes described in the\nprevious section. Indeed, RJ-MCMC can be used to automatically learn the appropriate number\nof basis functions in an expansion. In the case of spectral kernel learning, inferring the number of\nbasis functions corresponds to automatically learning the important frequency contributions to a GP\nkernel, which can lead to new interpretable insights into our data.\n4.1\nThe choice of an initialization procedure is often an important practical consideration for machine\nlearning tasks due to severe multimodality in a likelihood surface (Neal, 1996).\nIn many cases,\nhowever, we \ufb01nd that spectral kernel learning with RJ-MCMC can automatically learn salient fre-\nquency contributions with a simple initialization, such as a uniform covering over a broad range\nof frequencies with many sharp peaks. The frequencies which are not important in describing the\ndata are quickly attenuated or removed within RJ-MCMC learning. Typically only a few hundred\nRJ-MCMC iterations are needed to discover the salient frequencies in this way.\nWilson (2014) proposes an alternative structured approach to initialization in previous spectral ker-\nnel modelling work. First, pass the (squared) data through a Fourier transform to obtain an empirical\nspectral density, which can be treated as observed. Next, \ufb01t the empirical spectral density using a\nstandard Gaussian mixture density estimation procedure, assuming a \ufb01xed number of mixture com-\nponents. Then, use the learned parameters of the Gaussian mixture as an initialization of the spectral\nmixture kernel hyperparameters, for Gaussian process marginal likelihood optimization. We observe\nsuccessful adaptation of this procedure to our L\u00b4evy process method, replacing the approximation\nwith Laplacian mixture terms and using the result to initialize RJ-MCMC.\n4.2 Scalability\nAs with other GP based kernel methods, the computational bottleneck lies in the evaluation of\nthe log marginal likelihood during MCMC, which requires computing (KX,X + \u03c32I)\u22121y and\n\n6\n\n\flog |KX,X + \u03c32I| for an n \u00d7 n kernel matrix KX,X evaluated at the n training points X. A di-\nrect approach through computing the Cholesky decomposition of the kernel matrix requires O(n3)\ncomputations and O(n2) storage, restricting the size of training sets to O(104). Furthermore, this\ncomputation must be performed at every iteration of RJ-MCMC, compounding standard computa-\ntional constraints.\nHowever, this bottleneck can be readily overcome through the Structured Kernel Interpolation\napproach introduced in Wilson & Nickisch (2015), which approximates the kernel matrix as\n\u02dcKX,X(cid:48) = MX KZ,ZM(cid:62)\nX(cid:48) for an exact kernel matrix KZ,Z evaluated on a much smaller set of\nm (cid:28) n inducing points, and a sparse interpolation matrix MX which facilitates fast computations.\nThe calculation reduces to O(n + g(m)) computations and O(n + g(m)) storage. As described\nin Wilson & Nickisch (2015), we can impose Toeplitz structure on KZ,Z for g(m) = m log m,\nallowing our RJ-MCMC procedure to train on massive datasets.\n5 Experiments\nWe conduct four experiments in total.\nIn order to motivate our model for kernel learning in\nlater experiments, we \ufb01rst demonstrate the ability of a L\u00b4evy process to recover\u2014through direct\nregression\u2014an observed noise-contaminated spectrum that is characteristic of sharply peaked nat-\nurally occurring spectra.\nIn the second experiment we demonstrate the robustness of our RJ-\nMCMC sampler by automatically recovering the generative frequencies of a known kernel, even\nin presence of signi\ufb01cant noise contamination and poor initializations.\nIn the third experiment\nwe demonstrate the ability of our method to infer the spectrum of airline passenger data, to per-\nform long-range extrapolations on real data, and to demonstrate the utility of accounting for un-\ncertainty in the kernel.\nIn the \ufb01nal experiment we demonstrate the scalability of our method\nthrough training the model on a 100,000 data point sound waveform. Code is available at https:\n//github.com/pjang23/levy-spectral-kernel-learning.\n5.1 Explicit Spectrum Modelling\nWe begin by applying a L\u00b4evy process di-\nrectly for function modelling (known as LARK\nregression), with inference as described in\nWolpert et al. (2011), and Laplacian basis func-\ntions. We choose an out of class test function\nproposed by Donoho & Johnstone (1993) that\nis standard in wavelet literature. The spatially\ninhomogeneous function is de\ufb01ned to represent\nspectral densities that arise in scienti\ufb01c and en-\ngineering applications. Gaussian i.i.d. noise is\nadded to give a signal-to-noise ratio of 7, to be\nconsistent with previous studies of the test func-\ntion Wolpert et al. (2011).\nThe noisy test function and LARK regression \ufb01t are shown in Figure 3. The synthetic spectrum\nis well characterized by the L\u00b4evy process, with no \u201cfalse positive\u201d basis function terms \ufb01tting the\nnoise owing to the strong regularization properties of the L\u00b4evy prior. By contrast, GP regression\nwith an RBF kernel learns a length scale of 0.07 through maximum marginal likelihood training:\nthe Gaussian process posterior can \ufb01t the sharp peaks in the test function only if it also over\ufb01ts to\nthe additive noise.\nThe point of this experiment is to show that the L\u00b4evy process with Laplacian basis functions forms\na natural prior over spectral densities. In other words, samples from this prior will typically look\nlike the types of spectra that occur in practice. Thus, this process will have a powerful inductive bias\nwhen used for kernel learning, which we explore in the next experiments.\n\nFigure 3: L\u00b4evy process regression on a noisy test\nfunction (black). The \ufb01t (red) captures the lo-\ncations and scales of each spike while ignoring\nnoise, but falls slightly short at its modes since\nthe black spikes are parameterized as (1 + |x|)\u22124\nrather than Laplacian.\n\n7\n\n012345678910x01020304050f(x)\fFigure 4: Ground truth recovery of known fre-\nquency components.\n(left) The spectrum of the\nGaussian process that was used to generate the\nnoisy training data is shown in black. From these\nnoisy data and the erroneous spectral initialization\nshown in dashed blue, the maximum a posteriori\nestimate of the spectral density (over 1000 RJ-\nMCMC steps) is shown in red. A SM kernel also\nidenti\ufb01es the salient frequencies, but with broader\nsupport, shown in magenta. (right) Noisy training\ndata are shown with a scatterplot, with withheld\ntesting data shown in green. The learned posterior\npredictive distribution (mean in black, with 95%\ncredible set in grey) captures the test data.\n\n5.2 Ground Truth Recovery\nWe next demonstrate the ability of our method\nto recover the generative frequencies of a\nknown kernel and its robustness to noise and\npoor initializations. Data are generated from a\nGP with a kernel having two spectral Laplacian\npeaks, and partitioned into training and testing\nsets containing 256 points each. Moreover, the\ntraining data are contaminated with i.i.d. Gaus-\nsian noise (signal-to-noise ratio of 85%).\nBased on these observed training data (depicted\nas black dots in Figure 4, right), we estimate\nthe kernel of the Gaussian process by inferring\nits spectral density (Figure 4, left) using 1000\nRJ-MCMC iterations. The empirical spectrum\ninitialization described in section 4.1 results in\nthe discovery of the two generative frequencies.\nCritically, we can also recover these salient fre-\nquencies even with a very poor initialization, as\nshown in Figure 4 (left).\nFor comparison, we also train a Gaussian SM\nkernel, initializing based on the empirical spec-\ntrum. The resulting kernel spectrum (Figure 4,\nmagenta curve) does recover the salient frequencies, though with less con\ufb01dence and higher over-\nhead than even a poor initialization and spectral kernel learning with RJ-MCMC.\n5.3 Spectral Kernel Learning for Long-Range Extrapolation\nWe next demonstrate the ability of our method\nto perform long-range extrapolation on real\ndata. Figure 5 shows a time series of monthly\nairline passenger data from 1949 to 1961 (Hyn-\ndman, 2005). The data show a long-term ris-\ning trend as well as a short term seasonal wave-\nform, and an absence of white noise artifacts.\nAs with Wilson & Adams (2013b), the \ufb01rst 96\nmonthly data points are used to train the model\nand the last 48 months (4 years) are withheld as\ntesting data, indicated in green. With an initial-\nization from the empirical spectrum and 2500\nRJ-MCMC steps, the model is able to automat-\nically learn the necessary frequencies and the\nshape of the spectral density to capture both the\nrising trend and the seasonal waveform, allow-\ning for accurate long-range extrapolations with-\nout pre-specifying the number of model compo-\nnents in advance.\nThis experiment also demonstrates the impact of accounting for uncertainty in the kernel, as the\nwithheld data often appears near or crosses the upper bound of the 95% predictive bands of the SM\n\ufb01t, whereas our model yields wider and more conservative predictive bands that wholly capture the\ntest data. As the SM extrapolations are highly sensitive to the choice of parameter values, \ufb01xing\nthe parameters of the kernel will yield overcon\ufb01dent predictions. The L\u00b4evy process prior allows us\nto account for a range of possible kernel parameters so we can achieve a more realistically broad\ncoverage of possible extrapolations.\nNote that the L\u00b4evy process over spectral densities induces a prior over kernel functions. Figure 6\nshows a side-by-side comparison of covariance function draws from the prior and posterior distribu-\ntions over kernels. We see that sample covariance functions from the prior vary quite signi\ufb01cantly,\nbut are concentrated in the posterior, with movement towards the empirical covariance function.\n\nFigure 5: Learning of Airline passenger data.\nTraining data is scatter plotted, with withheld test-\ning data shown in green. The learned posterior\ndistribution with the proposed approach (mean in\nblack, with 95% credible set in grey) captures the\nperiodicity and the rising trend in the test data.\nThe analogous 95% interval using a GP with a SM\nkernel is illustrated in magenta.\n\n8\n\n00.20.4Frequency100200300400Power01020304050X-10-505f(X)\fFigure 6: Covariance function draws from the kernel prior (left) and posterior (right) distributions,\nwith the empirical covariance function shown in black. After RJ-MCMC, the covariance distribution\ncenters upon the correct frequencies and order of magnitude.\n\nFigure 7: Learning of a natural sound tex-\nture. A close-up of the training interval is\ndisplayed with the true waveform data scat-\nter plotted. The learned posterior distribu-\ntion (mean in black, with 95% credible set\nin grey) retains the periodicity of the signal\nwithin the corrupted interval. Three samples\nare drawn from the posterior distribution.\n\n5.4 Scalability Demonstration\nA \ufb02exible and fully Bayesian approach to kernel\nlearning can come with some additional computa-\ntional overhead. Here we demonstrate the scalability\nthat is achieved through the integration of SKI (Wil-\nson & Nickisch, 2015) with our L\u00b4evy process model.\nWe consider a 100,000 data point waveform, taken\nfrom the \ufb01eld of natural sound modelling (Turner,\n2010). A L\u00b4evy kernel process is trained on a sound\ntexture sample of howling wind with the middle\n10% removed. Training involved initialization from\nthe signal empirical covariance and 500 RJ-MCMC\nsamples, and took less than one hour using an In-\ntel i7 3.4 GHz CPU and 8 GB of memory. Four\ndistinct mixture components in the model were au-\ntomatically identi\ufb01ed through the RJ-MCMC proce-\ndure. The learned kernel is then used for GP in\ufb01lling\nwith 900 training points, taken by down-sampling\nthe training data, which is then applied to the origi-\nnal 44,100 Hz natural sound \ufb01le for in\ufb01lling.\nThe GP posterior distribution over the region of interest is shown in Figure 7, along with sample\nrealizations, which appear to capture the qualitative behavior of the waveform. This experiment\ndemonstrates the applicability of our proposed kernel learning method to large datasets, and shows\npromise for extensions to higher dimensional data.\n6 Discussion\nWe introduced a distribution over covariance kernel functions that is well suited for modelling quasi-\nperiodic data. We have shown how to place a L\u00b4evy process prior over the spectral density of a sta-\ntionary kernel. The resulting hierarchical model allows the incorporation of kernel uncertainty into\nthe predictive distribution. Through the spectral regularization properties of L\u00b4evy process priors, we\nfound that our trans-dimensional sampling procedure is suitable for automatically performing infer-\nence over model order, and is robust over initialization strategies. Finally, we incorporated structured\nkernel interpolation into our training and inference procedures for linear time scalability, enabling\nexperiments on large datasets. The key advances over conventional spectral mixture kernels are in\nbeing able to interpretably and automatically discover the number of mixture components, and in\nrepresenting uncertainty over the kernel. Here, we considered one dimensional inputs and station-\nary processes to most clearly elucidate the key properties of L\u00b4evy kernel processes. However, one\ncould generalize this process to multidimensional non-stationary kernel learning by jointly infer-\nring properties of transformations over inputs alongside the kernel hyperparameters. Alternatively,\none could consider neural networks as basis functions in the L\u00b4evy process, inferring distributions\nover the parameters of the network and the numbers of basis functions as a step towards automating\nneural network architecture construction.\n\n9\n\n0.0080.0090.010.0110.0120.0130.0140.015X (Seconds)-0.4-0.200.20.4f(X)\fAcknowledgements. This work is supported in part by the Natural Sciences and Engineering Re-\nsearch Council of Canada (PGS-D 502888) and the National Science Foundation DGE 1144153 and\nIIS-1563887 awards.\nReferences\nBochner, S. Lectures on Fourier Integrals.(AM-42), volume 42. Princeton University Press, 1959.\n\nClyde, Merlise A and Wolpert, Robert L. Nonparametric function estimation using overcomplete\n\ndictionaries. Bayesian Statistics, 8:91\u2013114, 2007.\n\nDonoho, D. and Johnstone, J.M. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81(3):\n\n425\u2013455, 1993.\n\nGreen, P.J.\n\nReversible jump monte carlo computation and bayesian model determination.\n\nBiometrika, 89(4):711\u2013732, 1995.\n\nGreen, P.J. Trans-dimensional Markov chain Monte Carlo, chapter 6. Oxford University Press,\n\n2003.\n\nHyndman, R.J. Time series data library. 2005. http://www-personal.buseco.monash.\n\nedu.au/\u02dchyndman/TSDL/.\n\nMacKay, David J.C. Introduction to Gaussian processes. In Bishop, Christopher M. (ed.), Neural\n\nNetworks and Machine Learning, chapter 11, pp. 133\u2013165. Springer-Verlag, 1998.\n\nMicchelli, Charles A, Xu, Yuesheng, and Zhang, Haizhang. Universal kernels. Journal of Machine\n\nLearning Research, 7(Dec):2651\u20132667, 2006.\n\nNeal, R.M. Bayesian Learning for Neural Networks. Springer Verlag, 1996. ISBN 0387947248.\n\nRasmussen, C. E. and Williams, C. K. I. Gaussian processes for Machine Learning. The MIT Press,\n\n2006.\n\nTurner, R. Statistical models for natural sounds. PhD thesis, University College London, 2010.\n\nWilson, A.G. and Adams, R.P. Gaussian process kernels for pattern discovery and extrapola-\ntion supplementary material and code. 2013a. http://mlg.eng.cam.ac.uk/andrew/\nsmkernelsupp.pdf.\n\nWilson, Andrew Gordon. Covariance kernels for fast automatic pattern discovery and extrapolation\n\nwith Gaussian processes. PhD thesis, University of Cambridge, 2014.\n\nWilson, Andrew Gordon and Adams, Ryan Prescott. Gaussian process kernels for pattern discovery\n\nand extrapolation. International Conference on Machine Learning (ICML), 2013b.\n\nWilson, Andrew Gordon and Nickisch, Hannes. Kernel interpolation for scalable structured Gaus-\n\nsian processes (KISS-GP). International Conference on Machine Learning (ICML), 2015.\n\nWolpert, R.L., Clyde, M.A., and Tu, C. Stochastic expansions using continuous dictionaries: L\u00b4evy\n\nadaptive regression kernels. The Annals of Statistics, 39(4):1916\u20131962, 2011.\n\n10\n\n\f", "award": [], "sourceid": 2121, "authors": [{"given_name": "Phillip", "family_name": "Jang", "institution": "Cornell University"}, {"given_name": "Andrew", "family_name": "Loeb", "institution": "Cornell University"}, {"given_name": "Matthew", "family_name": "Davidow", "institution": "Cornell University"}, {"given_name": "Andrew", "family_name": "Wilson", "institution": "Cornell University"}]}