{"title": "Spectral Mixture Kernels for Multi-Output Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 6681, "page_last": 6690, "abstract": "Early approaches to multiple-output Gaussian processes (MOGPs) relied on linear combinations of independent, latent, single-output Gaussian processes (GPs). This resulted in cross-covariance functions with limited parametric interpretation, thus conflicting with the ability of single-output GPs to understand lengthscales, frequencies and magnitudes to name a few. On the contrary, current approaches to MOGP are able to better interpret the relationship between different channels by directly modelling the cross-covariances as a spectral mixture kernel with a phase shift. We extend this rationale and propose a parametric family of complex-valued cross-spectral densities and then build on Cram\u00e9r's Theorem (the multivariate version of Bochner's Theorem) to provide a principled approach to design multivariate covariance functions. The so-constructed kernels are able to model delays among channels in addition to phase differences and are thus more expressive than previous methods, while also providing full parametric interpretation of the relationship across channels. The proposed method is first validated on synthetic data and then compared to existing MOGP methods on two real-world examples.", "full_text": "Spectral Mixture Kernels for\n\nMulti-Output Gaussian Processes\n\nDepartment of Mathematical Engineering\n\nCenter for Mathematical Modeling\n\nGabriel Parra\n\nUniversidad de Chile\n\ngparra@dim.uchile.cl\n\nFelipe Tobar\n\nUniversidad de Chile\n\nftobar@dim.uchile.cl\n\nAbstract\n\nEarly approaches to multiple-output Gaussian processes (MOGPs) relied on lin-\near combinations of independent, latent, single-output Gaussian processes (GPs).\nThis resulted in cross-covariance functions with limited parametric interpretation,\nthus con\ufb02icting with the ability of single-output GPs to understand lengthscales,\nfrequencies and magnitudes to name a few. On the contrary, current approaches to\nMOGP are able to better interpret the relationship between different channels by\ndirectly modelling the cross-covariances as a spectral mixture kernel with a phase\nshift. We extend this rationale and propose a parametric family of complex-valued\ncross-spectral densities and then build on Cram\u00e9r\u2019s Theorem (the multivariate\nversion of Bochner\u2019s Theorem) to provide a principled approach to design multi-\nvariate covariance functions. The so-constructed kernels are able to model delays\namong channels in addition to phase differences and are thus more expressive\nthan previous methods, while also providing full parametric interpretation of the\nrelationship across channels. The proposed method is \ufb01rst validated on synthetic\ndata and then compared to existing MOGP methods on two real-world examples.\n\n1\n\nIntroduction\n\nThe extension of Gaussian processes (GPs [1]) to multiple outputs is referred to as multi-output\nGaussian processes (MOGPs). MOGPs model temporal or spatial relationships among in\ufb01nitely-\nmany random variables, as scalar GPs, but also account for the statistical dependence across different\nsources of data (or channels). This is crucial in a number of real-world applications such as fault\ndetection, data imputation and denoising. For any two input points x, x(cid:48), the covariance function of\nan m-channel MOGP k(x, x(cid:48)) is a symmetric positive-de\ufb01nite m \u00d7 m matrix of scalar covariance\nfunctions. The design of this matrix-valued kernel is challenging since we have to deal with the trade\noff between (i) choosing a broad class of m(m \u2212 1)/2 cross-covariances and m auto-covariances,\nwhile at the same time (ii) ensuring positive de\ufb01niteness of the symmetric matrix containing these\nm(m+1)/2 covariance functions for any pair of inputs x, x(cid:48). In particular, unlike the widely available\nfamilies of auto-covariance functions (e.g., [2]), cross-covariances are not bound to be positive de\ufb01nite\nand therefore can be designed freely; the construction of these functions with interpretable functional\nform is the main focus of this article.\nA classical approach to de\ufb01ne cross-covariances for a MOGP is to linearly combine independent\nlatents GPs, this is the case of the Linear Model of Coregionalization (LMC [3]) and the Convolution\nModel (CONV, [4]). In these cases, the resulting kernel is a function of both the covariance functions\nof the latent GPs and the parameters of the linear operator considered; this results in symmetric\nand centred cross-covariances. While these approaches are simple, they lack interpretability of the\ndependencies learnt and force the auto-covariances to have similar behaviour across different channels.\nThe LMC method has also inspired the Cross-Spectral Mixture (CSM) kernel [5], which uses the\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fSpectral Mixture (SM) kernel in [6] within LMC and model phase differences across channels by\nmanually introducing a shift between the cosine and exponential factors of the SM kernel. Despite\nexhibiting improved performance wrt previous approaches, the addition of the shift parameter in\nCSM poses the following question: Can the spectral design of multiouput covariance functions be\neven more \ufb02exible?\nWe take a different approach to extend the spectral mixture concept to multiple outputs: Recall that\nfor stationary scalar-valued GPs, [6] designs the power spectral density (PSD) of the process by a\nmixture of square exponential functions to then, supported by Bochner\u2019s theorem [7], present the\nSpectral Mixture kernel via the inverse Fourier transform of the so-constructed PSD. Along the same\nlines, our main contribution is to propose an expressive family of complex-valued square-exponential\ncross-spectral densities, and then build on Cram\u00e9r\u2019s theorem [8, 9], the multivariate extension of\nBochner\u2019s, to construct the Multi-Output Spectral Mixture kernel (MOSM). The proposed multivariate\ncovariance function accounts for all the properties of the Cross-Spectral Mixture kernel in [5] plus a\ndelay component across channels and variable parameters for auto-covariances of different channels.\nAdditionally, the proposed MOSM provides clear interpretation of all the parameters in spectral terms.\nOur experimental contribution includes an illustrative example using a trivariate synthetic signal and\nvalidation against all the aforementioned literature using two real-world datasets.\n\n2 Background\nDe\ufb01nition 1. A Gaussian process (GP) over the input set X is a real-valued stochastic process\n(f (x))x\u2208X such that for any \ufb01nite subset of inputs {xi}N\ni=1 \u2282 X , the random variables {f (xi)}N\nare jointly Gaussian. Without loss of generality we will choose X = Rn.\nA GP [1] de\ufb01nes a distribution over functions f (x) that is uniquely determined by its mean function\nm(x) := E(f (x)), typically assumed m(x) = 0, and its covariance function (also known as kernel)\nk(x, x(cid:48)) := cov(f (x), f (x(cid:48))), x, x(cid:48)\n\u2208 X . We now equip the reader with the necessary background\nto follow our proposal: we \ufb01rst review a spectral-based approach to the design of scalar-valued\ncovariance kernels and then present the de\ufb01nition of a multi-output GP.\n\ni=1\n\n2.1 The Spectral Mixture kernel\n\nTo bypass the explicit construction of positive-de\ufb01nite functions within the design of stationary\ncovariance kernels, it is possible to design the power spectral density (PSD) instead [6] and then\ntransform it into a covariance function using the inverse Fourier transform. This is motivated by\nthe fact that the strict positivity requirement of the PSD is much easier to achieve than the positive\nde\ufb01niteness requirement of the covariance kernel. The theoretical support of this construction is\ngiven by the following theorem:\nTheorem 1. (Bochner\u2019s theorem) An integrable1 function k : Rn \u2192 C is the covariance function of\na weakly-stationary mean-square-continuous stochastic process f : Rn \u2192 C if and only if it admits\nthe following representation\n(1)\n\ne\u03b9\u03c9(cid:62)\u03c4 S(\u03c9)d\u03c9\n\nk(\u03c4 ) =\n\n(cid:90)\n\nRn\n\nwhere S(\u03c9) is a non-negative bounded function on Rn and \u03b9 denotes the imaginary unit.\n\nFor a proof see [9]. The above theorem gives an explicit relationship between the spectral density S\nand the covariance function k of the stochastic process f. In this sense, [6] proposed to model the\nspectral density S as a weighted mixture of Q square-exponential functions, with weights wq, centres\n\u00b5q and diagonal covariance matrices \u03a3q, that is,\n\nQ(cid:88)\n\nq=1\n\n(2\u03c0)n/2|\u03a3q|1/2 exp(cid:0)\n\n1\n\nS(\u03c9) =\n\nwq\n\nq (\u03c9 \u2212 \u00b5q)(cid:1) .\n\n\u22121\n\n(cid:62)\n\u2212 1\n2 (\u03c9 \u2212 \u00b5q)\n\n\u03a3\n\n(2)\n\nRelying on Theorem 1, the kernel associated to the spectral density S(\u03c9) in eq. (2) is given the\nspectral mixture kernel de\ufb01ned as follows.\n\n1A function g(x) is said to be integrable if(cid:82)\n\nRn |g(x)|dx < +\u221e\n\n2\n\n\f(cid:18)\n\n(cid:19)\n\nQ(cid:88)\n\nq=1\n\nDe\ufb01nition 2. A Spectral Mixture (SM) kernel is a positive-de\ufb01nite stationary kernel given by\n\nk(\u03c4 ) =\n\nwq exp\n\n\u2212\n\n(cid:62)\n\n\u03c4\n\n1\n2\n\n\u03a3q\u03c4\n\ncos(\u00b5\n\n(cid:62)\nq \u03c4 )\n\n(3)\n\n1 , . . . , \u03c3(q)\n\nn ) and wq, \u03c3q \u2208 R+.\n\nwhere \u00b5q \u2208 Rn, \u03a3q = diag(\u03c3(q)\nDue to the universal function approximation property of the mixtures of Gaussians (considered\nhere in the frequency domain) and the relationship given by Theorem 1, the SM kernel is able to\napproximate continuous stationary kernels to an arbitrary precision given enough spectral components\nas is [10, 11]. This concept points in the direction of sidestepping the kernel selection problem in\nGPs and it will be extended to cater for multivariate GPs in Section 3.\n\n2.2 Multi-Output Gaussian Processes\n\ni=1 \u2282 X , the random variables {fc(i)(xi)}N\n\nA multivariate extension of GPs can be constructed by considering an ensemble of scalar-valued\nstochastic processes where any \ufb01nite collection of values across all such processes are jointly Gaussian.\nWe formalise this de\ufb01nition as follows.\nDe\ufb01nition 3. An m-channel multi-output Gaussian process f (x) := (f1(x), . . . , fm(x)), x \u2208 X , is\nan m-tuple of stochastic processes fp : X \u2192 R \u2200p = 1, . . . , m, such that for any (\ufb01nite) subset of\ninputs {xi}N\ni=1 are jointly Gaussian for any choice of\nindices c(i) \u2208 {1, . . . , m}.\nRecall that the construction of scalar-valued GPs requires choosing a scalar-valued mean function and\na scalar-valued covariance function. Conversely, an m-channel MOGP is de\ufb01ned by an Rm-valued\nmean function, whose ith element denotes the mean function of the ith channel, and an Rm \u00d7 Rm-\nvalued covariance function, whose (i, j)th element denotes the covariance between the ith and jth\nchannels. The symmetry and positive-de\ufb01niteness conditions of the MOGP kernel are de\ufb01ned as\nfollows.\nDe\ufb01nition 4. A two-input matrix-valued function K(x, x(cid:48)) : X \u00d7 X \u2192 Rm\u00d7m de\ufb01ned element-wise\nby [K(x, x(cid:48))]ij = kij(x, x(cid:48)) is a multivariate kernel (covariance function) if it is:\n(i) Symmetric, i.e., K(x, x(cid:48)) = K(x(cid:48), x)(cid:62),\u2200x, x(cid:48)\n(ii) Positive de\ufb01nite, i.e., \u2200N \u2208 N, c \u2208 RN\u00d7m, x \u2208 X N such that, [c]pi = cpi, [x]p = xp, we have\n(4)\n\n\u2208 X , and\n\nm(cid:88)\n\nN(cid:88)\n\ncpicqjkij(xp, xq) \u2265 0.\n\ni,j=1\n\np,q=1\n\nFurthermore, we say that a multivariate kernel K(x, x(cid:48)) is stationary if K(x, x(cid:48)) = K(x \u2212 x(cid:48)) or\nequivalently kij(x, x(cid:48)) = kij(x \u2212 x(cid:48)) \u2200i, j \u2208 {1, . . . , m}, in this case, we denote \u03c4 = x \u2212 x(cid:48).\nThe design of the MOGP covariance kernel involves jointly choosing functions that model the\ncovariance of each channel (diagonal elements in K) and functions that model the cross-covariance\nbetween different channels at different input locations (off-diagonal elements in K). Choosing these\nm(m + 1)/2 covariance functions is challenging when we want to be as expressive as possible and\ninclude, for instance, delays, phase shifts, negative correlations or to enforce speci\ufb01c spectral content\nwhile at the same time maintaining positive de\ufb01niteness of K. The reader is referred to [12, 13] for a\ncomprehensive review of MOGP models.\n\n3 Designing Multi-Output Gaussian Processes in the Fourier Domain\n\nWe extend the spectral-mixture approach [6] to multi-output Gaussian processes relying on the\nmultivariate version of Theorem 1 \ufb01rst proved by Cram\u00e9r and thus referred to as Cram\u00e9r\u2019s Theorem\n[8, 9] given by\nTheorem 2. (Cram\u00e9r\u2019s Theorem) A family {kij(\u03c4 )}m\ni,j=1 of integrable functions are the covariance\nfunctions of a weakly-stationary multivariate stochastic process if and only if they (i) admit the\n\n3\n\n\f(cid:90)\n\nrepresentation\n\nm(cid:88)\n\ni,j=1\n\n(5)\nwhere each Sij is an integrable complex-valued function Sij : Rn \u2192 C known as the spectral density\nassociated to the covariance function kij(\u03c4 ), and (ii) ful\ufb01l the positive de\ufb01niteness condition\n\n\u2200i, j \u2208 {1, . . . , m}\n\nkij(\u03c4 ) =\n\nRn\n\ne\u03b9\u03c9(cid:62)\u03c4 Sij(\u03c9)d\u03c9\n\nzizjSij(\u03c9) \u2265 0\n\n\u2200{z1, . . . , zm} \u2282 C, \u03c9 \u2208 Rn\n\n(6)\n\nwhere z denotes the complex conjugate of z \u2208 C.\nNote that eq. (5) states that each covariance function kij is the inverse Fourier transform of a spectral\ndensity Sij, therefore, we will say that these functions are Fourier pairs. Accordingly, we refer to\nthe set of arguments of the covariance function \u03c4 \u2208 Rn as time or space Domain depending of the\napplication considered, and to the set of arguments of the spectral densities \u03c9 \u2208 Rn as Fourier or\nspectral domain. Furthermore, a direct consequence of the above theorem is that for any element\ni,j=1 \u2208 Rm\u00d7m is Hermitian, i.e.,\n\u03c9 in the Fourier domain, the matrix de\ufb01ned by S(\u03c9) = [Sij(\u03c9)]m\nSij(\u03c9) = Sji(\u03c9) \u2200i, j, \u03c9.\nTheorem 2 gives the guidelines to construct covariance functions for MOGP by designing their\ncorresponding spectral densities instead, i.e., the design is performed in the Fourier rather than the\nspace domain. The simplicity of design in the Fourier domain stems from the positive-de\ufb01niteness\ncondition of the spectral densities in eq. (6), which is much easier to achieve than that of the covariance\nfunctions in eq. (4). This can be understood through an analogy with the univariate model: in the\nsingle-output case the positive-de\ufb01niteness condition of the kernel only requires positivity of the\nspectral density, whereas in the multioutput case the positive-de\ufb01niteness condition of the multivariate\nkernel only requires that the matrix S(\u03c9), \u2200\u03c9 \u2208 Rn, is positive de\ufb01nite but there are no constraints\non each function Sij : \u03c9 (cid:55)\u2192 Sij(\u03c9).\n3.1 The Multi-Output Spectral Mixture kernel\ni,j=1, thus\nWe now propose a family of Hermitian positive-de\ufb01nite complex-valued functions {Sij(\u00b7)}m\nful\ufb01lling the requirements of Theorem 2, eq. (6), to use them as cross-spectral densities within MOGP.\nThis family of functions is designed with the aim of providing physical parametric interpretation and\nclosed-form covariance functions after applying the inverse Fourier transform.\nRecall that complex-valued positive-de\ufb01nite matrices can be decomposed in the form S(\u03c9) =\nRH (\u03c9)R(\u03c9), meaning that the (i, j)th entry of S(\u03c9) can be expressed as Sij(\u03c9) = RH\n:i (\u03c9)R:j(\u03c9);\nwhere R(\u03c9) \u2208 CQ\u00d7m, R:i(\u03c9) is the ith column of R(\u03c9), and (\u00b7)H denotes the Hermitian (transpose\nand conjugate) operator. Note that this factor decomposition ful\ufb01ls eq. (6) for any choice of R(\u03c9) \u2208\nm(cid:88)\nCQ\u00d7m:\n\n\u2208 Cm, \u03c9 \u2208 Rn\n(7)\nWe refer to Q as the rank of the decomposition, since by choosing Q < m the rank of S(\u03c9) =\nRH (\u03c9)R(\u03c9) can be at most Q. For ease of notation we choose2 Q = 1, where the columns of\nR(\u03c9) are complex-valued functions {Ri}m\ni=1, and S(\u03c9) is modeled as a rank-one matrix according\nto Sij(\u03c9) = Ri(\u03c9)Rj(\u03c9). Since Fourier transforms and multiplications of square exponential (SE)\nfunctions are also SE, we model Ri(\u03c9) as a complex-valued SE function so as to ensure closed-form\nexpression of its corresponding covariance kernel, that is,\n\n= ||R(\u03c9)z||2 \u2265 0 \u2200z = [z1, . . . , zm]\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m(cid:88)\n\n:i (\u03c9)R:j(\u03c9)zj =\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2\n\nziR:i(\u03c9)\n\nziRH\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\ni,j=1\n\ni=1\n\n(cid:62)\n\n1\n4\n\n(cid:62)\n\nRi(\u03c9) = wi exp\n\n(\u03c9 \u2212 \u00b5i)\nwhere wi, \u03c6i \u2208 R, \u00b5i, \u03b8i \u2208 Rn and \u03a3i = diag([\u03c32\nfunctions {Ri}m\n\ni=1, the spectral densities {Sij}m\n\n(\u03c9 \u2212 \u00b5i)\n\n\u2212\n\n\u03a3\n\nSij(\u03c9) = wij exp\n\n\u2212\n\n1\n2\n\n(cid:62)\n(\u03c9 \u2212 \u00b5ij)\n\n\u03a3\n\n\u22121\ni\n\n(cid:19)\nexp(cid:0)\nij (\u03c9 \u2212 \u00b5ij) + \u03b9(cid:0)\u03b8\n\n\u22121\n\ni1, . . . , \u03c32\n\ni,j=1 are given by\n\n(cid:62)\n\ni \u03c9 + \u03c6i)(cid:1) ,\ni = 1, . . . , m (8)\n\u2212\u03b9(\u03b8\nin]) \u2208 Rn\u00d7n. With this choice of the\n(cid:1)(cid:19)\n(cid:62)\nij\u03c9 + \u03c6ij\n\ni, j = 1, . . . , m (9)\n\n,\n\n(cid:18)\n(cid:18)\n\n2The extension to arbitrary Q will be presented at the end of this section.\n\n4\n\n\fmeaning that the cross-spectral density between channels i and j is modeled as a complex-valued SE\nfunction with the following parameters:\n\n\u2022 covariance: \u03a3ij = 2\u03a3i(\u03a3i + \u03a3j)\u22121\u03a3j\n\u2022 mean: \u00b5ij = (\u03a3i + \u03a3j)\u22121(\u03a3i\u00b5j + \u03a3j\u00b5i)\n\n\u2022 magnitude: wij = wiwj exp(cid:0)\n\n4 (\u00b5i \u2212 \u00b5j)(cid:62)(\u03a3i + \u03a3j)\u22121(\u00b5i \u2212 \u00b5j)(cid:1)\n\n\u2212 1\n\n\u2022 delay: \u03b8ij = \u03b8i \u2212 \u03b8j\n\u2022 phase: \u03c6ij = \u03c6i \u2212 \u03c6j\n\nwhere the so-constructed magnitudes wij ensure positive de\ufb01niteness and, in particular, the auto-\nspectral densities Sii are real-valued SE functions (since \u03b8ii = \u03c6ii = 0) as in the standard (scalar-\nvalued) spectral mixture approach [6].\nThe power spectral density in eq. (9) corresponds to a complex-valued kernel and therefore to a\ncomplex-valued GP [14, 15] . In order to restrict this generative model only to real-valued GPs, the\nproposed power spectral density has to be symmetric with respect to \u03c9 [16], we then make Sij(\u03c9)\nsymmetric simply by reassigning Sij(\u03c9) (cid:55)\u2192 1\n2 (Sij(\u03c9) + Sij(\u2212\u03c9)), this is equivalent to choosing\nRi(\u03c9) to be a vector of two mirrored complex SE functions.\nThe resulting (symmetric with respect to \u03c9) cross-spectral density between the ith and jth channels\n\u22121{Sij(\u03c9)}(\u03c4 ) are the following Fourier\nSij(\u03c9) and its corresponding real-valued kernel kij(\u03c4 ) = F\n(cid:1)(cid:17)(cid:19)\n(cid:1)(cid:17)\npairs\n(cid:16) \u22121\n2 (\u03c9+\u00b5ij )(cid:62)\u03a3\ncos(cid:0)(\u03c4 + \u03b8ij)\n\nij (\u03c9+\u00b5ij )+\u03b9(cid:0)\u2212\u03b8(cid:62)\n(cid:1)\n\n(cid:16) \u22121\n(cid:18)\n2 (\u03c9\u2212\u00b5ij )(cid:62)\u03a3\n\nij (\u03c9\u2212\u00b5ij )+\u03b9(cid:0)\u03b8(cid:62)\n\nkij(\u03c4 ) = \u03b1ij exp\n\nSij(\u03c9) =\n\n(\u03c4 + \u03b8ij)\n\n\u03a3ij(\u03c4 + \u03b8ij)\n\n\u00b5ij + \u03c6ij\n\n(cid:18)\n\ne\n\n\u22121\n\n(cid:62)\n\nij \u03c9+\u03c6ij\n\n+ e\n\n\u22121\n\nij \u03c9+\u03c6ij\n\n(cid:19)\n\nwij\n2\n\n1\n2\n\n\u2212\n\n(cid:62)\n\n(10)\n\nn\n\nwhere the magnitude parameter \u03b1ij = wij(2\u03c0)\n2 |\u03a3ij|1/2 absorbs the constant resulting from the\ninverse Fourier transform.\nWe can again con\ufb01rm that the autocovariances (i = j) are real-valued and contain square-exponential\nand cosine factors as in the scalar SM approach since \u03b1ii \u2265 0 and \u03b8ii = \u03c6ii = 0. Conversely,\nthe proposed model for the cross-covariance between different channels (i (cid:54)= j) allows for (i) both\nnegatively- and positively-correlated signals (\u03b1ij \u2208 R), (ii) delayed channels through the delay\nparameter \u03b8ij (cid:54)= 0 and (iii) out-of-phase channels where the covariance is not symmetric with respect\nto the delay for \u03c6ij (cid:54)= 0. Fig. 1 shows cross-spectral densities and their corresponding kernel for a\nchoice of different delay and phase parameters.\n\nFigure 1: Power spectral density and kernels generated by the proposed model in eq. (10) for different\nparameters. Bottom: Cross-spectral densities, real part in blue and imaginary part in green. Top:\nCross-covariance functions in blue with reference SE envelope in dashed line. From left to right:\nzero delay and zero phase; zero delay and non-zero phase; non-zero delay and zero phase; and\nnon-zero delay and non-zero phase.\n\n5\n\n\u2212505CrossCovariances\u03b8ij=0\u03c6ij=0\u2212505\u03b8ij=0\u03c6ij6=0\u2212505\u03b8ij6=0\u03c6ij=0\u2212505\u03b8ij6=0\u03c6ij6=0\u2212505Cross-SpectralDensities\u2212505\u2212505\u2212505\fThe kernel in eq. (10) resulted from a low rank choice for the PSD matrix Sij, therefore, increasing\nthe rank in the proposed model for Sij is equivalent to consider several kernel components. Arbitrarily\nchoosing Q of these components yields the expression for the proposed multivariate kernel:\nDe\ufb01nition 5. The Multi-Output Spectral Mixture kernel (MOSM) has the form:\n\n(cid:19)\n\n(cid:16)\n\n(cid:17)\n\nQ(cid:88)\n\n(cid:18)\n\n1\n2\n\n\u03b1(q)\nij exp\n\nkij(\u03c4 ) =\n\n\u2212\nwhere \u03b1(q)\n2 |\u03a3(q)\nij |1/2 and the superindex (\u00b7)(q) denotes the parameter of the qth compo-\nnent of the spectral mixture.\n\nij = w(q)\n\nij (2\u03c0)\n\ncos\n\nq=1\n\nn\n\n(\u03c4 + \u03b8(q)\nij )\n\nij (\u03c4 + \u03b8(q)\n\u03a3(q)\nij )\n\n\u00b5(q)\nij + \u03c6(q)\n\nij\n\n(\u03c4 + \u03b8(q)\nij )\n\n(11)\n\n(cid:62)\n\n(cid:62)\n\nThis multivariate covariance function has spectral-mixture positive-de\ufb01nite kernels as auto-\ncovariances, while the cross-covariances are spectral mixture functions with different parameters for\ndifferent output pairs, which can be (i) non-positive-de\ufb01nite, (ii) non-symmetric, and (iii) delayed with\nrespect to one another. Therefore, the MOSM kernel is a multi-output generalisation of the spectral\nmixture approach [6] where the positive de\ufb01niteness is guaranteed by the factor decomposition of Sij\nas shown in eq. (7).\n\n3.2 Training the model and computing the predictive posterior\n\nFitting the model to observed data follows the same rationale of standard GP, that is, maximising\nlog-probability of the data. Recall that the observations in the multioutput case consist of (i) a\nlocation x \u2208 X , (ii) a channel identi\ufb01er i \u2208 {1, . . . , m}, and (iii) an observed value y \u2208 R; therefore,\nwe denote N observations as the set of 3-tuples D = {(xc, ic, yc)}N\nc=1. As all observations are\njointly Gaussian, we concatenate the observations into the three vectors x = [x1, . . . , xN ](cid:62)\n\u2208 X N ,\ni = [i1, . . . , iN ](cid:62)\n\u2208 RN , to express the negative log-\nlikelihood (NLL) by\n\n\u2208 {1, . . . , m}N , and y = [y1, . . . , yN ](cid:62)\n\n\u2212 log p(y|x, \u0398) =\n\nlog 2\u03c0 +\n\nN\n2\n\n1\n2\n\nlog |Kxi| +\n\n1\n2\n\ny(cid:62)K\u22121\nxi y\n\n(12)\n\nwhere all hyperparameters are denoted by \u0398, and Kxi is the covariance matrix of all observed samples,\nthat is, the (r, s)th element [Kxi]rs is the covariance between the process at (location: xr, channel:\nir) and the process at (location: xs, channel: is). Recall that, under the proposed MOSM model, this\ncovariance [Kxi]rs is given by eq. (11), that is, kiris(xr \u2212 xs) + \u03c32\nir,noise is a\ndiagonal term to cater for uncorrelated observation noise. The NLL is then minimised with respect to\n\u0398 = {w(q)\ni=1,q=1, that is, the original parameters chosen to construct\nR(\u03c9) in Section 3.1, plus the noise hyperparameters.\nOnce the hyperparameters are optimised, computing the predictive posterior in the proposed MOSM\nfollows the standard GP procedure with the joint covariances given by eq. (11).\n\nir,noise\u03b4iris, where \u03c32\n\ni,noise}m,Q\n\n, \u03a3(q)\n\n, \u00b5(q)\n\n, \u03c6(q)\n\n, \u03b8(q)\n\n, \u03c32\n\ni\n\ni\n\ni\n\ni\n\ni\n\n3.3 Related work\n\nGeneralising the scalar spectral mixture kernel to MOGPs can be achieved from the LMC framework\nas pointed out in [5] (denoted SM-LMC). As this formulation only considers real-valued cross spectral\ndensities, the authors propose a multivariate covariance function by including a complex component\nto the cross spectral densities to cater for phase differences across channels, which they call the Cross\nSpectral Mixture kernel (denoted CSM). This multivariate covariance function can be seen as the\nproposed MOSM model with \u00b5i = \u00b5j, \u03a3i = \u03a3j, \u03b8i = \u03b8j \u2200i, j \u2208 {1, . . . , m} and \u03c6i = \u00b5(cid:62)\ni \u03c8i for\n\u03c8i \u2208 Rn. As a consequence, the SM-LMC is a particular case of the proposed MOSM model, where\nthe parameters \u00b5i, \u03a3i, \u03b8i are restricted to be same for all channels and therefore no phase shifts and\nno delays are allowed\u2014unlike the MOSM example in Fig. 1. Additionally, Cram\u00e9r\u2019s theorem has\nalso been used in a similar fashion in [17] but only with real-valued t-Student cross-spectral densities\nyielding cross-covariances that are either positive-de\ufb01nite or negative-de\ufb01nite.\n\n4 Experiments\n\nWe show two sets of experiments. First, we validated the ability of the proposed MOSM model in\nthe identi\ufb01cation of known auto- and cross-covariances of synthetic data. Second, we compared\n\n6\n\n\fMOSM against the spectral-mixture linear model of coregionalization (SM-LMC, [3, 6, 5]), the\nGaussian convolution model (CONV, [4]), and the cross-spectral mixture model (CSM, [5]) in the\nestimation of missing real-world data in two different distributed settings: climate signals and metal\nconcentrations. All models were implemented in Tensor\ufb02ow [18] using GP\ufb02ow [19] in order to make\nuse of automatic differentiation to compute the gradients of the NLL. The performance of all the\nmodels in the experiments was measured by the mean absolute error given by\n\nN(cid:88)\n\ni=1\n\nMAE :\n\n1\nN\n\n|yi \u2212 \u02c6yi|\n\n(13)\n\nwhere yi denotes the true value and \u02c6yi the MOGP estimate.\n\n4.1 Synthetic example: Learning derivatives and delayed signals\n\nAll models were implemented to recover the auto- and cross-covariances of a three-output GP with\nthe following components: (i) a reference signal sampled from a GP f (x) \u223c GP(0, KSM ) with\nspectral mixture covariance kernel KSM and zero mean, (ii) its derivative f(cid:48)(x), and (iii) a delayed\nversion f\u03b4(x) = f (x \u2212 \u03b4). The motivation for this illustrative example is that the covariances and\ncross covariances of the aforementioned processes are known explicitly (see [1, Sec. 9.4]) and we\ncan therefore compare our estimates to the true model. The derivative was computed numerically\n(\ufb01rst order through \ufb01nite differences) and the training samples were generated as follows: We chose\nN1 = 500 samples from the reference function in the interval [-20, 20], N2 = 400 samples from the\nderivative signal in the interval [-20, 0], and N3 = 400 samples from the delayed signal in the interval\n[-20, 0]. All samples were randomly uniformly chosen in the intervals mentioned and Gaussian\nnoised was added to yield realistic observations. The experiment then consisted in the reconstruction\nthe reference signal in the interval [-20, 20], and the imputation of the derivative and delayed signals\nover the interval [0, 20].\nFig. 2 shows the ground truth and MOSM estimates for all three synthetic signals and the co-\nvariances (normalised), and Table 1 reports the MAE for all models over ten realisations of the\nexperiment. Notice that the proposed model successfully learnt all cross-covariances cov(f (\u00b7), f(cid:48)(x))\nand cov(f (x), f (x \u2212 \u03b4)), and autocovariances without prior information about the delayed or the\nderivative relationship between the two channels. Furthermore, MOSM was the only model that\nsuccessfully extrapolated the derivate signal and the delayed signal simultaneously, this is due the fact\nthat the cross-covariances needed for this setting are not linear combinations of univariate kernels,\nhence models based on latent processes fail in this synthetic example.\n\nFigure 2: MOSM learning of the covariance functions of a synthetic reference signal, its derivative\nand a delayed version. Left: synthetic signals, middle: autocovariances, right: cross-covariances.\nThe dashed line is the ground truth, the solid colour lines are the MOSM estimates, and the shaded\narea is the 95% con\ufb01dence interval. The training points are shown in green.\n\n7\n\n\u221220\u221215\u221210\u2212505101520\u22122024ReferenceSignalSyntheticExample:MOSMf(x)\u221220\u221215\u221210\u2212505101520\u22125.0\u22122.50.02.55.0DerivativeSignalf0(x)\u221220\u221215\u221210\u2212505101520Input\u2212202DelayedSignalf(x\u2212\u03b4)\u22125.0\u22122.50.02.55.0\u22120.50.00.51.0Cov:Referencek11(\u03c4)\u22125.0\u22122.50.02.55.0\u22120.50.00.51.0Cov:Derivativek22(\u03c4)\u22125.0\u22122.50.02.55.0\u2212101Cov:ReferenceandDerivativek21(\u03c4)\u22125.0\u22122.50.02.55.0\u22120.50.00.51.0Cov:Delayedk33(\u03c4)\u22125.0\u22122.50.02.55.0\u22120.50.00.51.0Cov:ReferenceandDelayedk31(\u03c4)\fTable 1: Reconstruction of a synthetic signal, its derivative and delayed version: Mean absolute error\nfor all four models with one-standard-deviation error bars over ten realisations.\n\nReference\nModel\n0.211 \u00b1 0.085\nCONV\nSM-LMC 0.166 \u00b1 0.009\nCSM\n0.148 \u00b1 0.010\nMOSM\n0.127 \u00b1 0.011\n\nDerivative\n0.759 \u00b1 0.075\n0.747 \u00b1 0.101\n0.262 \u00b1 0.032\n0.223 \u00b1 0.015\n\nDelayed\n0.524 \u00b1 0.097\n0.398 \u00b1 0.042\n0.368 \u00b1 0.089\n0.146 \u00b1 0.017\n\n4.2 Climate data\n\nThe \ufb01rst real-world dataset contained measurements3 from a sensor network of four climate stations\nin the south on England: Cambermet, Chimet, Sotonmet and Bramblemet. We considered the\nnormalised air temperature signal from 12 March, 2017 to 16 March, 2017, in 5-minute intervals\n(5692 samples), from where we randomly chose N = 1000 samples for training. Following [4],\nwe simulated a sensor failure by removing the second half of the measurements for one sensor and\nleaving the remaining three sensors operating correctly; we reproduced the same setup across all\nfour sensors thus producing four experiments. All models considered had \ufb01ve latent signals/spectral\ncomponents.\nFor all four models considered, Fig. 3 shows the estimates of missing data for the Cambermet-failure\ncase. Table 2 shows the mean absolute error for all models and failure cases over the missing data\nregion. Observe how all models were able to capture the behaviour of the signal in the missing range,\nthis is because the considered climate signals are very similar to one another. This shows that the\nMOSM can also collapse to models that share parameters across pairs of outputs when required.\n\nFigure 3: Imputation of the Cambermet sensor measurements using the remaining sensors. The red\npoints denote the observations, the dashed black line the true signal, and the solid colour lines the\npredictive means. From left to right: MOSM, CONV, SM-LMC and CSM.\n\nTable 2: Imputation of the climate sensor measurements using the remaining sensors. Mean absolute\nerror for all four experiments with one-standard-deviation error bars over ten realisations.\n\nCambermet\nModel\n0.098 \u00b1 0.008\nCONV\nSM-LMC 0.084 \u00b1 0.004\nCSM\n0.094 \u00b1 0.003\nMOSM\n0.097 \u00b1 0.006\n\nChimet\n0.192 \u00b1 0.015\n0.176 \u00b1 0.003\n0.129 \u00b1 0.004\n0.137 \u00b1 0.007\n\nSotonmet\n0.211 \u00b1 0.038\n0.273 \u00b1 0.001\n0.195 \u00b1 0.011\n0.162 \u00b1 0.011\n\nBramblemet\n0.163 \u00b1 0.009\n0.134 \u00b1 0.002\n0.130 \u00b1 0.004\n0.129 \u00b1 0.003\n\nThese results do not show a signi\ufb01cant difference between the proposed model and the latent processes\nbased models. In order to test for statistical signi\ufb01cance, the Kolmogorov-Smirnov test [20, Ch.\n7] was used with a signi\ufb01cance level \u03b1 = 0.05, concluding that for the Sotonmet sensor we can\nassure that the MOSM model yields the best results. Conversely, for the Cambermet, Chimet and\nBramblemet sensors, MOSM and CSM provided similar results, though we cannot con\ufb01rm their\ndifference is statistically signi\ufb01cant. However, given the high correlation of these signals and the\n\n3The data can be obtained from www.cambermet.co.uk. and the sites therein.\n\n8\n\n012343210123Temperature[\u00baC]MOSM95%CI012343210123CONV95%CI012343210123SMLMC95%CI012343210123CSM95%CITime[Days]\fsimilarity between the MOSM model and the CSM model, the close performance of these two models\non this dataset is to be expected.\n\n4.3 Heavy metal concentration\n\nThe Jura dataset [3] contains, in addition to other geological data, the concentration of seven heavy\nmetals in a region of 14.5 km2 of the Swiss Jura, and it is divided into a training set (259 locations)\nand a validation set (100 locations). We followed [3, 4], where the motivation was to aid the prediction\nof a variable that is expensive to measure by using abundant measurements of correlated variables\nwhich are less expensive to acquire. Speci\ufb01cally, we estimated Cadmium and Copper at the validation\nlocations using measurements of related variables at the training and test locations: Nickel and Zinc\nfor Cadmium; and Lead, Nickel and Zinc for Copper. The MAE\u2014see eq. (13)\u2014is shown in Table 3,\nwhere the results for the CONV model were obtained from [4] and all models considered \ufb01ve latent\nsignals/spectral components, except for the independent Gaussian process (denoted IGP).\nObserve how the proposed MOSM model outperforms all other models over the Cadmium data,\nwhich is statistical signi\ufb01cant with a signi\ufb01cance level \u03b1 = 0.05. Conversely, we cannot guarantee a\nstatistically-signi\ufb01cant difference between the CSM model and the MOSM in the Copper case. In\nboth cases, testing for statistical signi\ufb01cance against the CONV model was not possible since those\nresults were obtained from [4]. On the other hand, the higher variability and non-Gaussianity of\nthe Copper data may be the reason of why the simplest MOGP model (SM-LMC) achieves the best\nresults.\n\nTable 3: Mean absolute error for the estimation of Cadmium and Copper concentrations with one-\nstandard-deviation error bars over ten repetitions of the experiment.\n\nCadmium\nModel\nIGP\n0.56 \u00b1 0.005\nCONV\n0.443 \u00b1 0.006\nSM-LMC 0.46 \u00b1 0.01\nCSM\n0.47 \u00b1 0.02\nMOSM\n0.43 \u00b1 0.01\n\nCopper\n16.5 \u00b1 0.1\n7.45 \u00b1 0.2\n7.0 \u00b1 0.1\n7.4 \u00b1 0.3\n7.3 \u00b1 0.1\n\n5 Discussion\n\nWe have proposed the multioutput spectral mixture (MOSM) kernel to model rich relationships\nacross multiple outputs within Gaussian processes regression models. This has been achieved by\nconstructing a positive-de\ufb01nite matrix of complex-valued spectral densities, and then transforming\nthem via the inverse Fourier transform according to Cram\u00e9r\u2019s Theorem. The resulting kernel provides\na clear interpretation from a spectral viewpoint, where each of its parameters can be identi\ufb01ed with\nfrequency, magnitude, phase and delay for a pair of channels. Furthermore, a key feature that is unique\nto the proposed kernel is the ability joint model delays and phase differences, this is possible due to\nthe complex-valued model for the cross-spectral density considered and validated experimentally\nusing a synthetic example\u2014see Fig. 2. The MOSM kernel has also been compared against existing\nMOGP models on two real-world datasets, where the proposed model performed competitively in\nterms of the mean absolute error. Further research should point towards a sparse implementation of\nthe proposed MOGP which can build on [4, 21] to design inducing variables that exploit the spectral\ncontent of the processes as in [22, 23].\n\nAcknowledgements\n\nWe thank Crist\u00f3bal Silva (Universidad de Chile) for useful recommendations about GPU implementa-\ntion, Rasmus Bonnevie from the GP\ufb02ow team for his assistance on the experimental MOGP module\nwithin GP\ufb02ow, and the anonymous reviewers. This work was \ufb01nancially supported by Conicyt\nBasal-CMM.\n\n9\n\n\fReferences\n[1] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning. The MIT Press, 2006.\n[2] D. Duvenaud, \u201cAutomatic model construction with Gaussian processes,\u201d Ph.D. dissertation, University of\n\nCambridge, 2014.\n\n[3] P. Goovaerts, Geostatistics for natural resources evaluation. Oxford University Press on Demand, 1997.\n[4] M. A. \u00c1lvarez and N. D. Lawrence, \u201cSparse convolved Gaussian processes for multi-output regression,\u201d in\n\nAdvances in Neural Information Processing Systems 21, 2008, pp. 57\u201364.\n\n[5] K. R. Ulrich, D. E. Carlson, K. Dzirasa, and L. Carin, \u201cGP kernels for cross-spectrum analysis,\u201d in\n\nAdvances in Neural Information Processing Systems 28, 2015, pp. 1999\u20132007.\n\n[6] A. G. Wilson and R. P. Adams, \u201cGaussian process kernels for pattern discovery and extrapolation,\u201d in\nProceedings of the 30th International Conference on Machine Learning (ICML-13), 2013, pp. 1067\u20131075.\n[7] S. Bochner, M. Tenenbaum, and H. Pollard, Lectures on Fourier Integrals, ser. Annals of mathematics\n\nstudies. Princeton University Press, 1959.\n\n[8] H. Cram\u00e9r, \u201cOn the theory of stationary random processes,\u201d Annals of Mathematics, pp. 215\u2013230, 1940.\n[9] A. Yaglom, Correlation Theory of Stationary and Related Random Functions, ser. Correlation Theory of\n\nStationary and Related Random Functions. Springer, 1987, no. v. 1.\n\n[10] F. Tobar, T. D. Bui, and R. E. Turner, \u201cLearning stationary time series using Gaussian processes with\nnonparametric kernels,\u201d in Advances in Neural Information Processing Systems 28. Curran Associates,\nInc., 2015, pp. 3501\u20133509.\n\n[11] F. Tobar and R. E. Turner, \u201cModelling time series via automatic learning of basis functions,\u201d in Proc. of\n\nIEEE SAM, 2016, pp. 2209\u20132213.\n\n[12] M. A. \u00c1lvarez, L. Rosasco, and N. D. Lawrence, \u201cKernels for vector-valued functions: A review,\u201d Found.\n\nTrends Mach. Learn., vol. 4, no. 3, pp. 195\u2013266, Mar. 2012.\n\n[13] M. G. Genton and W. Kleiber, \u201cCross-covariance functions for multivariate geostatistics,\u201d Institute of\n\nMathematical Statistics, vol. 30, no. 2, 2015.\n\n[14] F. Tobar and R. E. Turner, \u201cModelling of complex signals using Gaussian processes,\u201d in Proc. of IEEE\n\nICASSP, 2015, pp. 2209\u20132213.\n\n[15] R. Boloix-Tortosa, F. J. Pay\u00e1n-Somet, and J. J. Murillo-Fuentes, \u201cGaussian processes regressors for\n\ncomplex proper signals in digital communications,\u201d in Proc. of IEEE SAM, 2014, pp. 137\u2013140.\n\n[16] S. M. Kay, Modern spectral estimation : Theory and application. Englewood Cliffs, N.J. : Prentice Hall,\n\n1988.\n\n[17] T. Gneiting, W. Kleiber, and M. Schlather, \u201cMat\u00e9rn cross-covariance functions for multivariate random\n\n\ufb01elds,\u201d Journal of the American Statistical Association, vol. 105, no. 491, pp. 1167\u20131177, 2010.\n\n[18] M. Abadi et al., \u201cTensorFlow: Large-scale machine learning on heterogeneous systems,\u201d 2015, software\n\navailable from tensor\ufb02ow.org. [Online]. Available: http://tensor\ufb02ow.org/\n\n[19] A. G. d. G. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. Le\u00f3n-Villagr\u00e1, Z. Ghahra-\n\nmani, and J. Hensman, \u201cGP\ufb02ow: A Gaussian process library using TensorFlow,\u201d 2016.\n\n[20] J. W. Pratt and J. D. Gibbons, Concepts of nonparametric theory. Springer Science & Business Media,\n\n2012.\n\n[21] M. A. \u00c1lvarez, D. Luengo, M. K. Titsias, and N. D. Lawrence, \u201cEf\ufb01cient multioutput Gaussian processes\n\nthrough variational inducing kernels.\u201d in AISTATS, vol. 9, 2010, pp. 25\u201332.\n\n[22] J. Hensman, N. Durrande, and A. Solin, \u201cVariational Fourier features for Gaussian processes,\u201d arXiv\n\npreprint arXiv:1611.06740, 2016.\n\n[23] F. Tobar, T. D. Bui, and R. E. Turner, \u201cDesign of covariance functions using inter-domain inducing\n\nvariables,\u201d in NIPS 2015 - Time Series Workshop, December 2015.\n\n10\n\n\f", "award": [], "sourceid": 3348, "authors": [{"given_name": "Gabriel", "family_name": "Parra", "institution": "Universidad de Chile"}, {"given_name": "Felipe", "family_name": "Tobar", "institution": "Universidad de Chile"}]}