{"title": "Non-Stationary Spectral Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 4642, "page_last": 4651, "abstract": "We propose non-stationary spectral kernels for Gaussian process regression by modelling the spectral density of a non-stationary kernel function as a mixture of input-dependent Gaussian process frequency density surfaces. We solve the generalised Fourier transform with such a model, and present a family of non-stationary and non-monotonic kernels that can learn input-dependent and potentially long-range, non-monotonic covariances between inputs. We derive efficient inference using model whitening and marginalized posterior, and show with case studies that these kernels are necessary when modelling even rather simple time series, image or geospatial data with non-stationary characteristics.", "full_text": "Non-Stationary Spectral Kernels\n\nSami Remes\n\nMarkus Heinonen\n\nSamuel Kaski\n\nsamuel.kaski@aalto.fi\n\nsami.remes@aalto.fi\n\nmarkus.o.heinonen@aalto.fi\n\nHelsinki Institute for Information Technology HIIT\nDepartment of Computer Science, Aalto University\n\nAbstract\n\nWe propose non-stationary spectral kernels for Gaussian process regression by\nmodelling the spectral density of a non-stationary kernel function as a mixture of\ninput-dependent Gaussian process frequency density surfaces. We solve the gener-\nalised Fourier transform with such a model, and present a family of non-stationary\nand non-monotonic kernels that can learn input-dependent and potentially long-\nrange, non-monotonic covariances between inputs. We derive ef\ufb01cient inference\nusing model whitening and marginalized posterior, and show with case studies that\nthese kernels are necessary when modelling even rather simple time series, image\nor geospatial data with non-stationary characteristics.\n\n1\n\nIntroduction\n\nGaussian processes are a \ufb02exible method for non-linear regression [18]. They de\ufb01ne a distribution\nover functions, and their performance depends heavily on the covariance function that constrains the\nfunction values. Gaussian processes interpolate function values by considering the value of functions\nat other similar points, as de\ufb01ned by the kernel function. Standard kernels, such as the Gaussian\nkernel, lead to smooth neighborhood-dominated interpolation that is oblivious of any periodic or\nlong-range connections within the input space, and can not adapt the similarity metric to different\nparts of the input space.\nTwo key properties of covariance functions are stationarity and monotony. A stationary kernel\nK(x, x(cid:48)) = K(x + a, x(cid:48) + a) is a function only of the distance x \u2212 x(cid:48) and not directly the value of\nx. Hence it encodes an identical similarity notion across the input space, while a monotonic kernel\ndecreases over distance. Kernels that are both stationary and monotonic, such as the Gaussian and\nMat\u00e9rn kernels, can encode neither input-dependent function dynamics nor long-range correlations\nwithin the input space. Non-monotonic and non-stationary functions are commonly encountered in\nrealistic signal processing [19], time series analysis [9], bioinformatics [5, 20], and in geostatistics\napplications [7, 8].\nRecently, several authors have explored kernels that are either non-monotonic or non-stationary. A\nnon-monotonic kernel can reveal informative manifolds over the input space by coupling distant\npoints due to periodic or other effects. Non-monotonic kernels have been derived from the Fourier\ndecomposition of kernels [13, 24, 30], which renders them inherently stationary. Non-stationary\nkernels, on the other hand, are based on generalising monotonic base kernels, such as the Mat\u00e9rn\nfamily of kernels [6, 15], by partitioning the input space [4], or by input transformations [25].\nWe propose an expressive and ef\ufb01cient kernel family that is \u2013 in contrast to earlier methods \u2013\nboth non-stationary and non-monotonic, and hence can infer long-range or periodic relations in an\ninput-dependent manner. We derive the kernel from \ufb01rst principles by solving the more expressive\ngeneralised Fourier decomposition of non-stationary functions, than the more limited standard Fourier\ndecomposition exploited by earlier works. We propose and solve the generalised spectral density as a\nmixture of Gaussian process density surfaces that model \ufb02exible input-dependent frequency patterns.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fThe kernel reduces to a stationary kernel with appropriate parameterisation. We show the expressivity\nof the kernel with experiments on time series data, image-based pattern recognition and extrapolation,\nand on climate data modelling.\n\n2 Related Work\nBochner\u2019s theorem for stationary signals, whose covariance can be written as k(\u03c4 ) = k(x \u2212 x(cid:48)) =\nk(x, x(cid:48)), implies a Fourier dual [30]\n\n(cid:90)\n(cid:90)\n\nk(\u03c4 ) =\n\nS(s) =\n\nS(s)e2\u03c0is\u03c4 ds\n\nk(\u03c4 )e\u22122\u03c0is\u03c4 d\u03c4.\n\ni=1 cos(2\u03c0sT\n\ni \u03c4 ).\n\ni wi exp(\u22122\u03c02\u03c32\n\ni ) + N (s| \u2212 \u00b5i, \u03c32\n\n(cid:80)Q\ni wi[N (s|\u00b5i, \u03c32\n\nspectral density using a mixture of normals SSM(s) = (cid:80)\ncorresponding to the kernel function kSM(\u03c4 ) =(cid:80)\n\nThe dual is a special case of the more general Fourier transform (1), and has been exploited to\ndesign rich, yet stationary kernel representations [24, 32] and used for large-scale inference [17].\nLazaro-Gredilla et al. [13] proposed to directly learn the spectral density as a mixture of Dirac delta\nfunctions leading to a sparse spectrum (SS) kernel kSS(\u03c4 ) = 1\nQ\nWilson et al. [30] derived a stationary spectral mixture (SM) kernel by modelling the univariate\ni )]/2,\ni \u03c4 ) cos(2\u03c0\u00b5i\u03c4 ), which we gen-\neralize to the non-stationary case. The SM kernel was also extended for multidimensional inputs\nusing Kronecker structure for scalability [27]. Kernels derived from the spectral representation are\nparticularly well suited to encoding long-range, non-monotonic or periodic kernels; however, they\nhave so far been unable to handle non-stationarity, although [29] presented a partly non-stationary\nSM kernel that has input-dependent mixture weights. Kom Samo and Roberts also derived a kernel\nsimilar to our bivariate spectral mixture kernel in a recent technical report [11].\nNon-stationary kernels, on the other hand, have been constructed by non-stationary extensions of\nMat\u00e9rn and Gaussian kernels with input-dependent length-scales [3, 6, 15, 16], input space warpings\n[22, 25], and with local stationarity with products of stationary and non-stationary kernels [2, 23].\nThe simplest non-stationary kernel is arguably the dot product kernel [18], which has been used as\na way to assign input-dependent signal variances [26]. Non-stationary kernels are a good match\nfor functions with transitions in their dynamics, yet are unsuitable for modelling non-monotonic\nproperties.\nOur work can also be seen as a generalisation of wavelets, or time-dependent frequency components,\ninto general and smooth input-dependent components. In signal processing, Hilbert-Huang transforms\nand Hilbert spectral analysis explore input-dependent frequencies, but with deterministic transform\nfunctions on the inputs [8, 9].\n\n3 Non-stationary spectral mixture kernels\n\nThis section introduces the main contributions. We employ the generalised spectral decomposition of\nnon-stationary functions and derive a practical and ef\ufb01cient family of kernels based on non-stationary\nspectral components. Our approach relies on associating input-dependent frequencies for data inputs,\nand solving a kernel through the generalised spectral transform.\nThe most general family of kernels is the non-stationary kernels, which include stationary kernels\nas special cases [2]. A non-stationary kernel k(x, x(cid:48)) \u2208 R for scalar inputs x, x(cid:48) \u2208 R can be\ncharacterized by its spectral density S(s, s(cid:48)) over frequencies s, s(cid:48) \u2208 R, and the two are related via a\ngeneralised Fourier inverse transform1\n\nk(x, x(cid:48)) =\n\ne2\u03c0i(xs\u2212x(cid:48)s(cid:48))\u00b5S(ds, ds(cid:48)) ,\n\n(1)\n\n(cid:90)\n\n(cid:90)\n\nR\n\nR\n\n1We focus on scalar inputs and frequencies for simplicity. An extension based on vector-valued inputs and\n\nfrequencies [2, 10] is straightforward.\n\n2\n\n\f(a)\n\n(b)\n\nFigure 1: (a): Spectral density surface of a single component bivariate spectral mixture kernel with 8\npermuted peaks. (b): The corresponding kernel on inputs x \u2208 [\u22121, 1].\n\nwhere \u00b5S is a Lebesgue-Stieltjes measure associated to some positive semi-de\ufb01nite (PSD) spectral\ndensity function S(s, s(cid:48)) with bounded variations [2, 14, 31], which we denote as the spectral surface\nsince it considers the amplitude of frequency pairs (See Figure 1a).\nThe generalised Fourier transform (1) speci\ufb01es that a spectral surface S(s, s(cid:48)) generates a PSD kernel\nK(x, x(cid:48)) that is non-stationary unless the spectral measure mass is concentrated only on the diagonal\ns = s(cid:48). We design a practical, ef\ufb01cient and \ufb02exible parameterisation of spectral surfaces that, in turn,\nspeci\ufb01es novel non-stationary kernels with input-dependent characteristics and potentially long-range\nnon-monotonic correlation structures.\n\n3.1 Bivariate Spectral Mixture kernel\n\nNext, we introduce spectral kernels that remove the restriction of stationarity of earlier works. We\nstart by modeling the spectral density as a mixture of Q bivariate Gaussian components\n\nSi(s, s(cid:48)) =\n\n|\u00b5i, \u03a3i\n\n,\n\n\u03a3i =\n\n,\n\n(2)\n\n(cid:88)\n\n(cid:18)(cid:18) s\n\n(cid:19)\n\ns(cid:48)\n\nN\n\n\u00b5i\u2208\u00b1{\u00b5i,\u00b5(cid:48)\n\ni}2\n\n(cid:19)\n\n(cid:20) \u03c32\n\ni\n\n\u03c1i\u03c3i\u03c3(cid:48)\n\ni\n\n(cid:21)\n\n\u03c1i\u03c3i\u03c3(cid:48)\n\u03c3(cid:48)\n\n2\n\ni\n\ni\n\ni , \u03c3(cid:48)\n\ni\n\n\u00b5i\u2208\u00b1{\u00b5i,\u00b5(cid:48)\n\ni and variances \u03c32\n\nwith parameterisation using the correlation \u03c1i, means \u00b5i, \u00b5(cid:48)\n2. To produce a PSD\nspectral density Si as required by equation (1) we need to include symmetries Si(s, s(cid:48)) = Si(s(cid:48), s)\nand suf\ufb01cient diagonal components Si(s, s), Si(s(cid:48), s(cid:48)). To additionally result in a real-valued kernel,\nsymmetry is required with respect to the negative frequencies as well, i.e., Si(s, s(cid:48)) = Si(\u2212s,\u2212s(cid:48)).\ni}2 satis\ufb01es all three requirements by iterating over the four permutations of\n{\u00b5i, \u00b5(cid:48)\nThe generalised Fourier inverse transform (1) can be solved in closed form for a weighted spectral\ni Si(s, s(cid:48)) using Gaussian integral identities (see the Supplement):\n\nThe sum(cid:80)\nsurface mixture S(s, s(cid:48)) =(cid:80)Q\nQ(cid:88)\n\ni}2 and the opposite signs (\u2212\u00b5i,\u2212\u00b5(cid:48)\n\ni), resulting in eight components (see Figure 1a).\n\ni exp(\u22122\u03c02 \u02dcxT \u03a3i \u02dcx)\u03a8\u00b5i,\u00b5(cid:48)\nw2\n\ni\n\n(x)T \u03a8\u00b5i,\u00b5(cid:48)\n\ni\n\n(x(cid:48))\n\n(3)\n\nk(x, x(cid:48)) =\n\ni=1 w2\n\nwhere\n\ni=1\n\n\u03a8\u00b5i,\u00b5(cid:48)\n\ni\n\n(x) =\n\n(cid:18)cos 2\u03c0\u00b5ix + cos 2\u03c0\u00b5(cid:48)\n\nix\nsin 2\u03c0\u00b5ix + sin 2\u03c0\u00b5(cid:48)\nix\n\n(cid:19)\n\n,\n\nand where we de\ufb01ne \u02dcx = (x,\u2212x(cid:48))T and introduce mixture weights wi for each component. We\ndenote the proposed kernel as the bivariate spectral mixture (BSM) kernel (see Figure 1b). The\npositive de\ufb01niteness of the kernel is guaranteed by the spectral transform, and is also easily veri\ufb01ed\nsince the sinusoidal components form an inner product and the exponential component resembles an\nunscaled Gaussian density. A similar formulation for non-stationary spectral kernels was presented\nalso in a technical report [11].\n\n3\n\n\f(a)\n\n(e)\n\n(i)\n\n(b)\n\n(f)\n\n(j)\n\n(c)\n\n(g)\n\n(k)\n\n(d)\n\n(h)\n\n(l)\n\nFigure 2: (a)-(d): Examples of kernel matrices on inputs x \u2208 [\u22121, 1] for a Gaussian kernel (a), sparse\nspectrum kernel [13] (b), spectral mixture kernel [30] (c), and for the GSM kernel (d). (e)-(h): The\ncorresponding generalised spectral density surfaces of the four kernels. (i)-(l): The corresponding\nspectrograms, that is, input-dependent frequency amplitudes. The GSM kernel is highlighted with a\nspectrogram mixture of Q = 2 Gaussian process surface functions.\n\nWe immediately notice that the BSM kernel vanishes rapidly outside the origin (x, x(cid:48)) = (0, 0). We\nwould require a huge number of components centered at different points xi to cover a reasonably-sized\ninput space.\n\n3.2 Generalised Spectral Mixture (GSM) kernel\n\nWe extend the kernel derived in Section 3.1 further by parameterising the frequencies, length-scales\nand mixture weights as a Gaussian processes2, that form a smooth spectrogram (See Figure 2(l)):\n\nlog wi(x) \u223c GP(0, kw(x, x(cid:48))),\nlog (cid:96)i(x) \u223c GP(0, k(cid:96)(x, x(cid:48))),\nlogit \u00b5i(x) \u223c GP(0, k\u00b5(x, x(cid:48))).\n\n(4)\n(5)\n(6)\n\n\u00b5\n\nHere the log transform is used to ensure the weights w(x) and lengthscales (cid:96)(x) are non-negative,\nand the logit transform logit \u00b5(x) = log\nFN\u2212\u00b5 limits the learned frequencies between zero and the\nNyquist frequency FN , which is de\ufb01ned as half of the sampling rate of the signal.\nA GP prior f (x) \u223c GP(0, k(x, x(cid:48))) de\ufb01nes a distribution over zero-mean functions, and denotes\nthe covariance between function values cov[f (x), f (x(cid:48))] = k(x, x(cid:48)) equals their prior kernel. For\nany collection of inputs, x1, . . . , xN , the function values follow a multivariate normal distribution\n(f (x1), . . . , f (xN ))T \u223c N (0, K), where Kij = k(xi, xj). The key property of Gaussian processes\nis that they can encode smooth functions by correlating function values of input points that are similar\naccording to the kernel k(x, x(cid:48)). We use standard Gaussian kernels kw, k(cid:96) and k\u00b5.\n\n2See the Supplement for a tutorial on Gaussian processes.\n\n4\n\n\fWe accommodate the input-dependent lengthscale by replacing the exponential part of (3) by the\nGibbs kernel\n\n(cid:115)\n\n(cid:18)\n\nkGibbs,i(x, x(cid:48)) =\n\n2(cid:96)i(x)(cid:96)i(x(cid:48))\n(cid:96)i(x)2 + (cid:96)i(x(cid:48))2 exp\n\n\u2212 (x \u2212 x(cid:48))2\n(cid:96)i(x)2 + (cid:96)i(x(cid:48))2\n\n(cid:19)\n\n,\n\nwhich is a non-stationary generalisation of the Gaussian kernel [3, 6, 15]. We propose a non-stationary\ngeneralised spectral mixture (GSM) kernel with a simple closed form (see the Supplement):\n\nkGSM(x, x(cid:48)) =\n\nwi(x)wi(x(cid:48))kgibbs,i(x, x(cid:48)) cos(2\u03c0(\u00b5i(x)x \u2212 \u00b5i(x(cid:48))x(cid:48))) .\n\n(7)\n\nQ(cid:88)\n\ni=1\n\nP(cid:89)\n\nThe kernel is a product of three PSD terms. The GSM kernel encodes the similarity between two\ndata points based on their combined signal variance w(x)w(x(cid:48)), and the frequency surface based on\nthe frequencies \u00b5(x), \u00b5(x(cid:48)) and frequency lengthscales (cid:96)(x), (cid:96)(x(cid:48)) associated with both inputs. The\nGSM kernel encodes the spectrogram surface mixture into a relatively simple kernel. The kernel\nreduces to the stationary Spectral Mixture (SM) kernel [30] with constant functions wi(x) = wi,\n\u00b5i(x) = \u00b5i and (cid:96)i(x) = 1/(2\u03c0\u03c3i) (see the Supplement).\nWe have presented the proposed kernel (7) for univariate inputs for simplicity. The kernel can be\nextended to multivariate inputs in a straightforward manner using the generalised Fourier transform\nwith vector-valued inputs [2, 10]. However, in many applications multivariate inputs have a grid-\nlike structure, for instance in geostatistics, image analysis and temporal models. We exploit this\nassumption and propose a multivariate extension that assumes the inputs to decompose across input\ndimensions [1, 27]:\n\nkGSM(x, x(cid:48)|\u03b8) =\n\nkGSM(xp, x(cid:48)\n\np|\u03b8p) .\n\n(8)\n\np=1\n\nHere x, x(cid:48) \u2208 RP , \u03b8 = (\u03b81, . . . , \u03b8P ) collects the dimension-wise kernel parameters \u03b8p =\ni=1 of the n-dimensional realisations wip, (cid:96)ip, \u00b5ip \u2208 Rn per dimension p. Then,\n(wip, (cid:96)ip, \u00b5ip)Q\nthe kernel matrix can be expressed using Kronecker products as K\u03b8 = K\u03b81 \u2297 \u00b7\u00b7\u00b7 \u2297 K\u03b8P , while\nmissing values and data not on a regular grid can be handled with standard techniques [1, 21, 28, 27].\n\n4\n\nInference\n\nWe use the Gaussian process regression framework and assume a Gaussian likelihood over N = nP\ndata points3 (xj, yj)N\n\nj=1 with all outputs collected into a vector y \u2208 RN ,\n\nyj = f (xj) + \u03b5j,\n\n\u03b5j \u223c N (0, \u03c32\nn)\n\n(9)\nwith a standard predictive GP posterior f (x(cid:63)|y) for a new input point x(cid:63) [18]. The posterior can be\nef\ufb01ciently computed using Kronecker identities [21] (see the Supplement).\n\nf (x) \u223c GP(0, kGSM(x, x(cid:48)|\u03b8)),\n\nWe aim to infer the noise variance \u03c32\ni=1,p=1 that\nreveal the input-dependent frequency-based correlation structures in the data, while regularising the\nlearned kernel to penalise over\ufb01tting. We perform MAP inference over the log marginalized posterior\nlog p(\u03b8|y) \u221d log p(y|\u03b8)p(\u03b8) = L(\u03b8), where the functions f (x) have been marginalised out,\n\nn and the kernel parameters \u03b8 = (wip, (cid:96)ip, \u00b5ip)Q,P\n\n\uf8eb\uf8edN (y|0, K\u03b8 + \u03c32\n\nnI)\n\nQ,P(cid:89)\n\nL(\u03b8) = log\n\n\uf8f6\uf8f8 ,\n\nN ( \u02dcwip|0, Kwp )N ( \u02dc\u00b5ip|0, K\u00b5p )N (\u02dc(cid:96)ip|0, K(cid:96)p )\n\n(10)\n\ni,p=1\n\nwhere Kwp , K\u00b5p , K(cid:96)p are n\u00d7 n prior matrices per dimensions p, and \u02dcw, \u02dc\u00b5 and \u02dc(cid:96) represent the log or\nlogit transformed variables. The marginalized posterior automatically balances between parameters\n\u03b8 that \ufb01t the data and a model that is not overly complex [18]. We can ef\ufb01ciently evaluate both\n\n3Assuming that we have equal number of points n in all dimensions.\n\n5\n\n\fthe marginalized posterior and its gradients in O(P N\n[21, 27] (see the Supplement).\nGradient-based optimisation of (10) is likely to converge very slowly due to parameters \u02dcwip, \u02dc\u00b5ip, \u02dc(cid:96)ip\nbeing highly self-correlated. We remove the correlations by whitening the variables as \u02c6\u03b8 = L\u22121\u02dc\u03b8\nwhere L is the Cholesky decomposition of the prior covariances. We maximize L using gradient\nascent with respect to the whitened variables \u02c6\u03b8 by evaluating L(L\u02c6\u03b8) and the gradient as [6, 12]\n\nP ) instead of the usual O(N 3) complexity\n\nP +1\n\n\u2202L\n\u2202 \u02c6\u03b8\n\n=\n\n\u2202L\n\u2202\u03b8\n\n\u2202\u03b8\n\u2202 \u02dc\u03b8\n\n\u2202 \u02dc\u03b8\n\u2202 \u02c6\u03b8\n\n= LT \u2202L\n\u2202 \u02dc\u03b8\n\n.\n\n(11)\n\n5 Experiments\n\nWe apply our proposed kernel \ufb01rst on simple simulated time series, then on texture images and lastly\non a land surface temperature dataset. With the image data, we compare our method to two stationary\nmixture kernels, speci\ufb01cally the spectral mixture (SM) [30] and sparse spectrum (SS) kernels [13],\nand the standard squared exponential (SE) kernel. We employ the GPML Matlab toolbox, which\ndirectly implements the SM and SE kernels, and the SS kernel as a meta kernel combining simple\ncosine kernels. The GPML toolbox also implements Kronecker inference automatically for these\nkernels.\nWe implemented the proposed GSM kernel and inference in Matlab4. For optimising the log posterior\n(10) we employ the L-BFGS algorithm. For both our method and the comparisons, we restart\nthe optimisation from 10 different initialisations, each of which is chosen as the best among 100\nrandomly sampled hyperparameter values as evaluating the log posterior is cheap compared to\nevaluating gradients or running the full optimisation.\n\n5.1 Simulated time series with a decreasing frequency component\n\nFirst we experiment whether the GSM kernel can \ufb01nd a simulated time-varying frequency pattern. We\nsimulated a dataset where the frequency of the signal changes deterministically as \u00b5(x) = 1+(1\u2212x)2\non the interval x \u2208 [\u22121, 1]. We built a single-component GSM kernel K using the speci\ufb01ed functions\n\u00b5(x), (cid:96)(x) = (cid:96) = exp(\u22121) and w(x) = w = 1. We sampled a noisy function y \u223c N (0, K + \u03c32\nnI)\nwith a noise variance \u03c32\nn = 0.1. The example in Figure 3 shows the learned GSM kernel, as well\nas the data and the function posterior f (x). For this 1D case, we also employed the empirical\nspectrogram for initialising the hyperparameter values. The kernel correctly captures the increasing\nfrequency towards negative values (towards left in Figure 3a).\n\n5.2\n\nImage data\n\nWe applied our kernel to two texture images. The \ufb01rst image of a sheet of metal represents a\nmostly stationary periodic pattern. The second, a wood texture, represents an example of a very\nnon-stationary pattern, especially on the horizontal axis. We use majority of the image as training\ndata (the non-masked regions of Figure 3a and 3f) , and use the compared kernels to predict a missing\ncross-section in the middle, and also to extrapolate outside the borders of the original image.\nFigure 4 shows the two texture images, and extrapolation predictions given by the proposed GSM\nkernel, with a comparison to the spectral mixture (SM), sparse spectrum (SS) and standard squared\nexponential (SE) kernels. For GSM, SM and SS we used Q = 5 mixture components for the metal\ntexture, and Q = 10 components for the more complex wood texture.\nThe GSM kernel gives the most pleasing result visually, and \ufb01lls in both patterns well with consistent\nexternal extrapolation as well. The stationary SM kernel does capture the cross-section, but has\ntrouble extrapolation outside the borders. The SS kernel fails to represent even the training data, it\nlacks any smoothness in the frequency space. The gaussian kernel extrapolates poorly.\n\n4Implementation available at https://github.com/sremes/nonstationary-spectral-kernels\n\n6\n\n\f(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 3: (a) A simulated time series with a single decreasing frequency component and a GP \ufb01tted\nusing a GSM kernel. (b) The learned kernel shows that close to x = \u22121 the signal is highly correlated\nand anti-correlated with close time points, while these periodic dependencies vanish when moving\n\ntowards x = 1. For visualisation, the values are scaled as K = sgn(K)(cid:112)|K|. (c) The spectrogram\n\nshows the decreasing frequency. (d) The learned latent frequency function \u00b5(x) correctly \ufb01nds the\ndecreasing trend. The length-scale (cid:96)(x) is almost a constant, and weights w(x) slightly decrease in\ntime.\n\n5.3 Spatio-Temporal Analysis of Land Surface Temperatures\n\nNASA5 provides a land surface temperature dataset that we used to demonstrate our kernel in analysis\nof spatio-temporal data. Our primary objective is to demonstrate the capability of the kernel in\ninferring long-range, non-stationary spatial and temporal covariances.\nWe took a subset of four years (February 2000 to February 2004) of North American land temper-\natures for training data. In total we get 407,232 data points, constituting 48 monthly temperature\nmeasurements on a 84 \u00d7 101 map grid. The grid also contains water regions, which we imputed\nwith the mean temperature of each month. We experimented with the data by learning a generalized\nspectral mixture kernel using Q = 5 components.\nFigure 5 presents our results. Figure 5b highlights the training data and model \ufb01ts for a winter\nand summer month, respectively. Figure 5a shows the non-stationary kernel slices at two locations\nacross both latitude and longitude, as well as indicating that the spatial covariances are remarkably\nnon-symmetric. Figure 5c indicates \ufb01ve months of successive training data followed by three months\nof test data predictions.\n\n6 Discussion\n\nIn this paper we have introduced non-stationary spectral mixture kernels, with treatment based on\nthe generalised Fourier transform of non-stationary functions. We \ufb01rst derived the bivariate spectral\nmixture (BSM) kernel as a mixture of non-stationary spectral components. However, we argue it\nhas only limited practical use due to requiring an impractical amount of components to cover any\nsuf\ufb01ciently sized input space. The main contribution of the paper is the generalised spectral mixture\n(GSM) kernel with input-dependent Gaussian process frequency surfaces. The Gaussian process\ncomponents can cover non-trivial input spaces with just a few interpretable components. The GSM\nkernel is a \ufb02exible, practical and ef\ufb01cient kernel that can learn both local and global correlations\n\n5https://neo.sci.gsfc.nasa.gov/view.php?datasetId=MOD11C1_M_LSTDA\n\n7\n\n-1.5-1-0.500.511.50.511.522.533.5\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\n(g)\n\n(h)\n\n(i)\n\n(j)\n\nFigure 4: A metal texture data with Q = 5 components used for GSM, SM and SS kernels shown in\n(a)-(e) and a wood texture in (f)-(j) (with Q = 10 components). The GSM kernel performs the best,\nmaking the most believable extrapolation outside image borders in (b) and (g). The SM kernel \ufb01lls in\nthe missing cross pattern in (c) but does not extrapolate well. In (h) the SM kernel \ufb01lls in the vertical\nmiddle block only with the mean value while GSM in (g) is able to \ufb01ll in a wood-like pattern. SS is\nnot able discover enough structure in either texture (d) or (i), while the SE kernel over\ufb01ts by using a\ntoo short length-scale in (e) and (j).\n\nacross the input domains in an input-dependent manner. We highlighted the capability of the kernel\nto \ufb01nd interesting patterns in the data by applying it on climate data where it is highly unrealistic\nto assume the same (stationary) covariance pattern for every spatial location irrespective of spatial\nstructures.\nEven though the proposed kernel is motivated by the generalised Fourier transform, the solution to its\nspectral surface\n\n(cid:90)(cid:90)\n\nSGSM(s, s(cid:48)) =\n\nkGSM(x, x(cid:48))e\u22122\u03c0i(xs\u2212x(cid:48)s(cid:48))dxdx(cid:48)\n\n(12)\n\nremains unknown due to having multiple GP functions inside the integral. Figure 2h highlights a\nnumerical integration of the surface equation (12) on an example GP frequency surface. Furthermore,\nthe theoretical work of Kom Samo and Roberts [11] on generalised spectral transforms suggests\nthat the GSM kernel may also be dense in the family of non-stationary kernels, that is, to reproduce\narbitrary non-stationary kernels.\n\nAcknowledgments\n\nThis work has been partly supported by the Finnish Funding Agency for Innovation (project Re:Know)\nand Academy of Finland (COIN CoE, and grants 299915, 294238 and 292334). We acknowledge the\ncomputational resources provided by the Aalto Science-IT project.\n\nReferences\n[1] S. Flaxman, A. G. Wilson, D. Neill, H. Nickisch, and A. Smola. Fast kronecker inference in\n\nGaussian processes with non-Gaussian likelihoods. In ICML, volume 2015, 2015.\n\n[2] M. Genton. Classes of kernels for machine learning: A statistics perspective. Journal of\n\nMachine Learning Research, 2:299\u2013312, 2001.\n\n[3] M. Gibbs. Bayesian Gaussian Processes for Regression and Classi\ufb01cation. PhD thesis,\n\nUniversity of Cambridge, 1997.\n\n8\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 5: (a) Demonstrates the non-stationary spatial covariances in the land surface data. The\nvertical black lines denote the point x0 at which the kernel function k(\u00b7, x0) is centered. (b) Sample\nreconstructions. In all plots, only the land area temperatures are shown. (c) Posterior for \ufb01ve last\ntraining months (until Jan 2004) and prediction for the three next months (February 2004 to April\n2004), which the model is able to to construct reasonably accurately.\n\n[4] R. Gramacy and H. Lee. Bayesian treed Gaussian process models with an application to\n\ncomputer modeling. Journal of the American Statistical Association, 103:1119\u20131130, 2008.\n\n[5] M. Grzegorczyk, D. Husmeier, K. Edwards, P. Ghazal, and A. Millar. Modelling non-stationary\ngene regulatory processes with a non-homogeneous bayesian network and the allocation sampler.\nBioinformatics, 24:2071\u20132078, 2008.\n\n[6] M. Heinonen, H. Mannerstr\u00f6m, J. Rousu, S. Kaski, and H. L\u00e4hdesm\u00e4ki. Non-stationary\nGaussian process regression with Hamiltonian Monte Carlo. In AISTATS, volume 51, pages\n732\u2013740, 2016.\n\n[7] D. Higdon, J. Swall, and J. Kern. Non-stationary spatial modeling. Bayesian statistics, 6:761\u2013\n\n768, 1999.\n\n[8] N. Huang. A review on hilbert-huang transform: Method and its applications to geophysical\n\nstudies. Reviews of Geophysics, 46, 2008.\n\n[9] N. Huang, S. Zheng, S. Long, M. Wu, H. Shih, Q. Zheng, N.-Q. Yen, C. Tung, and H. Liu. The\nempirical mode decomposition and the hilbert spectrum for nonlinear and non-stationary time\nseries analysis. In Proceedings of the Royal Society of London A: Mathematical, Physical and\nEngineering Sciences, 454:903\u2013995, 1998.\n\n[10] Y. Kakihara. A note on harmonizable and v-bounded processes. Journal of Multivariate\n\nAnalysis, 16:140\u2013156, 1985.\n\n9\n\n\f[11] Y.-L. Kom Samo and S. Roberts. Generalized spectral kernels. Technical report, University of\n\nOxford, 2015. arXiv:1506.02236.\n\n[12] M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process\n\nclassi\ufb01cation. Journal of Machine Learning Research, 6:1679\u20131704, 2005.\n\n[13] M. L\u00e1zaro-Gredilla, J. Qui\u00f1onero-Candela, C. E. Rasmussen, and A. R. Figueiras-Vidal. Sparse\nspectrum Gaussian process regression. Journal of Machine Learning Research, 11:1865\u20131881,\n2010.\n\n[14] M. Loeve. Probability Theory II, volume 46 of Graduate Texts in Mathematics. Springer, 1978.\n\n[15] C. Paciorek and M. Schervish. Nonstationary covariance functions for Gaussian process\n\nregression. In NIPS, pages 273\u2013280, 2004.\n\n[16] C. Paciorek and M. Schervish. Spatial modelling using a new class of nonstationary covariance\n\nfunctions. Environmetrics, 17(5):483\u2013506, 2006.\n\n[17] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in\n\nneural information processing systems, pages 1177\u20131184, 2008.\n\n[18] C. E. Rasmussen and C. Williams. Gaussian processes for machine learning. MIT Press, 2006.\n\n[19] O. Rioul and V. Martin. Wavelets and signal processing. IEEE signal processing magazine,\n\n8:14\u201338, 1991.\n\n[20] J. Robinson and A. Hartemink. Non-stationary dynamic bayesian networks. In Advances in\n\nneural information processing systems, pages 1369\u20131376, 2009.\n\n[21] Y. Saat\u00e7i. Scalable Inference for Structured Gaussian Process Models. PhD thesis, University\n\nof Cambridge, 2011.\n\n[22] P. Sampson and P. Guttorp. Nonparametric estimation of nonstationary spatial covariance\n\nstructure. Journal of the American Statistical Association, 87, 1992.\n\n[23] R. Silverman. Locally stationary random processes. Information Theory, IRE Transactions on,\n\n3:182\u2013187, 1957.\n\n[24] A. Sinha and J. Duchi. Learning lernels with random features. In NIPS, 2016.\n\n[25] J. Snoek, K. Swersky, R. Zemel, and R. Adams. Input warping for bayesian optimization of\n\nnon-stationary functions. In ICML, volume 32, pages 1674\u20131682, 2014.\n\n[26] V. Tolvanen, P. Jyl\u00e4nki, and A. Vehtari. Expectation propagation for nonstationary heteroscedas-\ntic Gaussian process regression. In Machine Learning for Signal Processing (MLSP), 2014\nIEEE International Workshop on, pages 1\u20136. IEEE, 2014.\n\n[27] A. Wilson, E. Gilboa, J. P. Cunningham, and A. Nehorai. Fast kernel learning for multidimen-\n\nsional pattern extrapolation. In NIPS, 2014.\n\n[28] A. Wilson and H. Nickisch. Kernel interpolation for scalable structured gaussian processes\n\n(KISS-GP). In International Conference on Machine Learning, pages 1775\u20131784, 2015.\n\n[29] A. G. Wilson. Covariance kernels for fast automatic pattern discovery and extrapolation with\n\nGaussian processes. PhD thesis, University of Cambridge, 2014.\n\n[30] A. G. Wilson and R. Adams. Gaussian process kernels for pattern discovery and extrapolation.\n\nIn ICML, 2013.\n\n[31] A. M. Yaglom. Correlation theory of stationary and related random functions: Volume I: Basic\n\nresults. Springer Series in Statistics. Springer, 1987.\n\n[32] Z. Yang, A. Smola, L. Song, and A. Wilson. A la carte: Learning fast kernels. In AISTATS,\n\n2015.\n\n10\n\n\f", "award": [], "sourceid": 2426, "authors": [{"given_name": "Sami", "family_name": "Remes", "institution": "Aalto University"}, {"given_name": "Markus", "family_name": "Heinonen", "institution": "Aalto University"}, {"given_name": "Samuel", "family_name": "Kaski", "institution": "Aalto University"}]}