{"title": "Multi-resolution Multi-task Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 14025, "page_last": 14035, "abstract": "We consider evidence integration from potentially dependent observation processes under varying spatio-temporal sampling resolutions and noise levels. We offer a multi-resolution multi-task (MRGP) framework that allows for both inter-task and intra-task multi-resolution and multi-fidelity. We develop shallow Gaussian Process (GP) mixtures that approximate the difficult to estimate joint likelihood with a composite one and deep GP constructions that naturally handle biases. In doing so, we generalize existing approaches and offer information-theoretic corrections and efficient variational approximations. We demonstrate the competitiveness of MRGPs on synthetic settings and on the challenging problem of hyper-local estimation of air pollution levels across London from multiple sensing modalities operating at disparate spatio-temporal resolutions.", "full_text": "Multi-resolution Multi-task Gaussian Processes\n\nDepartment of Computer Science\n\nDepts. of Computer Science & Statistics\n\nTheodoros Damoulas\nThe Alan Turing Institute\n\nUniversity of Warwick\n\ntdamoulas@turing.ac.uk\n\nOliver Hamelijnck\n\nThe Alan Turing Institute\n\nUniversity of Warwick\n\nohamelijnck@turing.ac.uk\n\nKangrui Wang\n\nThe Alan Turing Institute\nDepartment of Statistics\nUniversity of Warwick\nkwang@turing.ac.uk\n\nMark Girolami\n\nThe Alan Turing Institute\nDepartment of Engineering\nUniversity of Cambridge\n\nmgirolami@turing.ac.uk\n\nAbstract\n\nWe consider evidence integration from potentially dependent observation processes\nunder varying spatio-temporal sampling resolutions and noise levels. We offer a\nmulti-resolution multi-task (MRGP) framework that allows for both inter-task and\nintra-task multi-resolution and multi-\ufb01delity. We develop shallow Gaussian Process\n(GP) mixtures that approximate the dif\ufb01cult to estimate joint likelihood with a\ncomposite one and deep GP constructions that naturally handle biases. In doing\nso, we generalize existing approaches and offer information-theoretic corrections\nand ef\ufb01cient variational approximations. We demonstrate the competitiveness\nof MRGPs on synthetic settings and on the challenging problem of hyper-local\nestimation of air pollution levels across London from multiple sensing modalities\noperating at disparate spatio-temporal resolutions.\n\n1\n\nIntroduction\n\nThe increased availability of ground and remote sensor networks coupled with new sensing modalities,\narising from e.g. citizen science intiatives and mobile platforms, is creating new challenges for\nperforming formal evidence integration. These multiple observation processes and sensing modalities\ncan be dependent, with different signal-to-noise ratios and varying sampling resolutions across\nspace and time.\nIn our motivating application, London authorities measure air pollution from\nmultiple sensor networks; high-\ufb01delity ground sensors that provide frequent multi-pollutant readings,\nlow \ufb01delity diffusion tubes that only provide monthly single-pollutant readings, hourly satellite-\nderived information at large spatial scales, and high frequency medium-\ufb01delity multi-pollutant sensor\nnetworks. Such a multi-sensor multi-resolution multi-task evidence integration setting is becoming\nprevalent across any real world application of machine learning.\nThe current state of the art, see also Section 5, is assuming independent and unbiased observation\nprocesses and cannot handle the challenges of real world settings that are jointly non-stationary,\nmulti-task, multi-\ufb01delity, and multi-resolution [2, 7, 14, 22, 23, 28, 29]. The latter challenge has\nrecently attracted the interest of the machine learning community under the context of working\nwith aggregate, binned observations [2, 14, 29] or the special case of natural language generation at\nmultiple levels of abstraction [28]. When the independence and unbiasedness assumptions are not\nsatis\ufb01ed they lead to posterior contraction, degradation of predictive performance and insuf\ufb01cient\nuncertainty quanti\ufb01cation.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn this paper we introduce a multi-resolution multi-task GP framework that can integrate evidence\nfrom observation processes with varying support (e.g. partially overlapping in time and space),\nthat can be dependent and biased while allowing for both inter-task and intra-task multi-resolution\nand multi-\ufb01delity. Our \ufb01rst contribution is a shallow GP mixture, MR-GPRN, that corrects for the\ndependency between observation processes through composite likelihoods and extends the Gaussian\naggregation model of Law et al. [14], the multi-task GP model of Wilson et al. [33], and the variational\nlower bound of Nguyen and Bonilla [19]. Our second contribution is a multi-resolution deep GP\ncomposition that can additionally handle biases in the observation processes and extends the deep GP\nmodels and variational lower bounds of Damianou and Lawrence [5] and Salimbeni and Deisenroth\n[27] to varying support, multi-resolution data. Lastly, we demonstrate the superiority of our models\non synthetic problems and on the challenging spatio-temporal setting of predicting air pollution in\nLondon at hyper-local resolution.\nSections 3 and 4 introduce our shallow GP mixtures and deep GP constructions respectively. In\nSection 6 we demonstrate the empirical advantages of our framework versus the prior art followed by\na additional related work in Section 5 and our concluding remarks. Further analysis is provided in the\nAppendix with code available at https://github.com/ohamelijnck/multi_res_gps.\n\n2 Multi-resolution Multi-task Learning\n\nConsider A \u2208 N observation processes Ya \u2208 IRNa\u00d7P across P tasks with Na observations. Each\nprocess may be observed at varying resolutions that arises as the volume average over a sampling area\nSa. Typically we discretise the area Sa and so we overload Sa to denote these points. We construct\nA datasets {(Xa, Ya)}A\na=1, ordered by resolution size (Y1 is the highest, YA is the lowest), where\nXa \u2208 IRNa\u00d7|Sa|\u00d7Da and Da is the input dimension. For notational simplicity we assume that all\ntasks are observed across all observational processes, although this need not be the case.\nIn our motivating application there are multiple sensor networks (observation processes) measuring\nmultiple air pollutants (tasks) such as CO2, NO2, PM10, PM2.5 at different sampling resolutions.\nThese multi-resolution observations exist both within tasks, (intra-task multi-resolution) when\ndifferent sensor networks measure the same pollutant, and across tasks (inter-task multi-resolution)\nwhen different sensor networks measure different but potentially correlated pollutants due to e.g.\ncommon emission sources. Our goal is to develop scalable, non-stationary non-parametric models for\nair pollution while delivering accurate estimation and uncertainty quanti\ufb01cation.\n\n3 Multi-Resolution Gaussian Process Regression Networks (MR-GPRN)\n\nWe \ufb01rst introduce a shallow instantiation of the multi-resolution multi-task framework. MR-GPRN is\na shallow GP mixture, Fig. 1, that extends the Gaussian process regression network (GPRN) [33].\nBrie\ufb02y, the GPRN jointly models all tasks by introducing Q \u2208 N latent GPs that act as basis for the\nP tasks. These GPs are combined using task speci\ufb01c weights, that are themselves GPs, resulting\nin P Q \u2208 N latent weights Wp,q. More formally, fq \u223c GP(0, Kf\np,q) and\nq=1 Wp,q (cid:12) fq + \u0001p where (cid:12) is the Hadamard product and\npI). The GPRN is an extension of the Linear Coregionalization Model (LCM) [3] and\n\neach task p is modelled as Yp = (cid:80)Q\n\nq ), Wp,q \u223c GP(0, Kw\n\n\u0001 \u223c N (0, \u03c32\ncan enable the learning of non-stationary processes through input dependent weights [1].\n\n3.1 Model Speci\ufb01cation\n\nWe extend the GPRN model to handle multi-resolution observations by integrating the latent process\nover the sampling area for each observation. Apart from the standard inter-task dependency we would\nideally want to be able to model additional dependencies between observation processes such as,\nfor example, correlated noises. Directly modelling this additional dependency can quickly become\nintractable, due to the fact that it can vary in input space. If one ignores this dependency by assuming\na product likelihood, as in [14, 18], then when violated the misspeci\ufb01cation results in severe posterior\ncontractions (see Fig. 2). To circumvent these extremes we approximate the full likelihood using a\nmulti-resolution composite likelihood that corrects for the misspeci\ufb01cation [31]. The posterior over\n\n2\n\n\fX\n\nfq\n\nQ\n\nX\n\nWp,q\n\nP Q\n\nYa,p\n\na,p\n\n\u03c32\nAP\n\nAlgorithm 1 Inference of MR-GPRN\nA multi-resolution\na=1, initial parameters \u03b8,\na=1(\u2207(cid:96)(Ya|\u02c6\u03b8)(\u2207(cid:96)(Ya|\u02c6\u03b8))T\n\n(cid:80)A\na=1 (cid:96)(Ya|\u03b8)\n\ndatasets\n\nInput:\n{(Xa, Ya)}A\n\u02c6\u03b8 \u2190 arg max\u03b8\nJ \u2190 \u22072(cid:96)(Y|\u02c6\u03b8)\n|\u02c6\u03b8|\n\u03c6 \u2190\n\nH \u2190(cid:80)A\n\uf8f1\uf8f2\uf8f3\n\n\u03b81 \u2190 arg min\u03b8\n\nTr[H(\u02c6\u03b8)\u22121J(\u02c6\u03b8)]\nTr[H(\u02c6\u03b8)J(\u02c6\u03b8)\u22121H(\u02c6\u03b8)]\n\n(cid:16)(cid:80)A\na=1 \u03c6Eq [(cid:96)(Ya|\u03b8)] + KL(cid:17)\n\nTr[H(\u02c6\u03b8)]\n\nFigure 1: Left: Graphical model of MR-GPRN for A observation processes each with |Pa| tasks. This\nallows multi-resolution learning between and across tasks. Right: Inference for MR-GPRN.\n\nthe latent functions is now:\n\np(W, f|Y) \u221d\n\nP(cid:89)\n\nNa(cid:89)\n\np=1\n\nn=1\n\nA(cid:89)\n(cid:124)\n\na=1\n\n(cid:90)\n\nQ(cid:88)\n(cid:123)(cid:122)\n\nq=1\n\nN (Ya,p,n| 1\n|Sa|\n\nSa,n\n\nWp,q(x) (cid:12) fq(x) dx, \u03c32\n\na,pI)\u03c6\n\nMR-GPRN Composite Likelihood\n\n(cid:125)\n\n(1)\n\np(W, f )\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\nGPRN Prior\n\nwhere \u03c6 \u2208 IR>0 are the composite weights that are critical for inference. The integral within the\nmulti-resolution likelihood links the underlying latent process to each of resolutions; in general this\nis not available in closed form and so we approximate it by discretizing over a uniform grid. When\nwe only have one task and W becomes a vector of constants we denote the model as MR-GP.\n\n3.2 Composite Likelihood Weights\n\nn\n\nthe asymptotic distribution of\n\nUnder a misspeci\ufb01ed model\nverges to N (\u03b80, 1\n\n(cid:80)N\nn=1 \u2207(cid:96)(Y|\u03b80)\u2207(cid:96)(Y|\u03b80)T , J(\u03b80) = 1\n\n(cid:80)N\nthe MLE estimate con-\nn H(\u03b80)J(\u03b80)\u22121H(\u03b80)) where \u03b80 are the true parameters and H(\u03b80) =\nn=1 \u22072(cid:96)(Y|\u03b80) are the Hessian and Jacobian re-\n1\nn\nspectively. The form of the asymptotic variance is the sandwich information matrix and it represents\nthe loss of information in the MLE estimate due to the failure of Bartletts second identity [31].\nFollowing Lyddon et al. [16] and Ribatet [26] we write down the asymptotic posterior of MR-GPRN\nas N (\u03b80, n\u22121\u03c6\u22121H(\u03b80)). In practise we only consider a subset of parameters that present in all\nlikelihood terms, such as the kernel parameters. Asymptotically one would expect the contribution\nof the prior to vanish causing the asymptotic posterior to match the limiting MLE. The composite\nweights \u03c6 can be used to bring these distributions as close together as possible. Approximating \u03b80\nwith the MLE estimate \u02c6\u03b8 and setting \u03c6\u22121H(\u02c6\u03b8) = H(\u02c6\u03b8)J(\u02c6\u03b8)\u22121H(\u02c6\u03b8) we can rearrange to \ufb01nd \u03c6 and\nrecover the magnitude correction of Ribatet [26]. Instead if we take traces and then rearrange we\nrecover the correction of Lyddon et al. [16]:\n\n\u03c6Ribatet =\n\n|\u02c6\u03b8|\n\nTr[H(\u02c6\u03b8)\u22121J(\u02c6\u03b8)]\n\n, \u03c6Lyddon =\n\nTr[H(\u02c6\u03b8)J(\u02c6\u03b8)\u22121H(\u02c6\u03b8)]\n\nTr[H(\u02c6\u03b8)]\n\n.\n\n(2)\n\n3.3\n\nInference\n\nIn this section we a present a closed form variational lower bound for MR-GPRN, the full details can\nbe found in the Appendix. For computational ef\ufb01ciency we introduce inducing points (see [10, 30])\np,q=1, for the latent GPs f and W respectively, where uq \u2208 IRM\nU = {uq}Q\nand vp,q \u2208 IRM . The inducing points are at the corresponding locations Z(u) = {Z(u)\nq=1, Z(v) =\np,q=1 for Z(\u00b7)\u00b7 \u2208 IRM,D. We construct the augmented posterior and use the approximate\n{Z(v)\n\nq=1 and V = {vp,q}P,Q\n\np,q}P,Q\n\nq }Q\n\n3\n\n\f6\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n-1\n\n0.5\n\nMR-DGP\nVBagg\nMR-GPRN\nDGP-Cascade\n\nTrue Likelihood\nMR-GPRN\nProduct Likelihood\nObserved Y\nObserved Y\n\n5\n\n4\n\n3\n\n2\n\n1\n\n0\n\n1.0\n\n1.5\n\n2.0\n\n2.5\n\n3.0\n\n3.5\n\n4.0\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n16\n\nFigure 2: Left: MR-GPRN recovers the true predictive variance whereas assuming a product likelihood\nassumption leads to posterior contraction. Right: MR-DGP recovers the true predictive mean under\na multi-resolution setting with scaling biases. Both VBAGG-NORMAL and MR-GPRN fail as they\npropagate the bias. Black crosses and lines denote observed values. Grey crosses denote observations\nremoved for testing.\n\nK(cid:88)\n\nposterior q(u, v, f , W) = p(f , W|u, v)q(u, v) where\n\n) \u00b7 P,Q(cid:89)\n(cid:80)P\n[13, 21] and derive our expected log-likelihood ELL =(cid:80)A\n\nN (m(u)\n\nQ(cid:89)\n\nq(u, v) =\n\n, S(u)\n\ni,j=1\n\n\u03c0k\n\nk=1\n\nj=1\n\nj\n\nj\n\nis a free form mixture of Gaussians with K components. We follow the variational derivation of\n\na=1\n\np=1\n\nn=1\n\nk=1 ELLa,p,n,k,\n\nN (m(v)\n\ni,j , S(v)\ni,j )\n\n(3)\n\n(cid:80)Na\n\n(cid:80)K\n\uf8f6\uf8f8\n\n(cid:88)\n\nQ(cid:88)\n\nx\u2208Sa,n\n\nq=1\n\n1\n\n|Sa,n|\n\n\u00b5(w)\nk,p,q(x)\u00b5(f )\n\nk,q(x), \u03c32\n\na,p\n\n\uf8eb\uf8edYa,p,n |\nQ(cid:88)\n(cid:88)\n\nq=1\n\nx1,x2\n\nELLa,p,n,k = \u03c0k log N\n\n\u2212 \u03c0k\n2\u03c32\n\na,p\n\n1\n\n|Sa,n|2\n\n\u03a3(w)\n\nk,p,q\u03a3(f )\n\nk,q + \u00b5(f )\n\nk,q(x1)\u03a3(w)\n\nk,p,q\u00b5(f )\n\nk,q(x2)\u00b5(w)\n\nk,p,q(x1)\u03a3(f )\n\nk,q\u00b5(w)\n\nk,p,q(x2)\n\n(4)\nwhere \u03a3(\u00b7)\u00b7,\u00b7,\u00b7 is evaluated at the points x1, x2. and \u00b5(f )\nk,p are respectively the\nmean and variance of qk(Wp), qk(f ). To infer the composite weights we follow [16, 26] and \ufb01rst\nobtain the MLE estimate of \u03b8 by maximizing the likelihood in Eq. 1. The weights can then be\ncalculated and the variational lowerbound optimised as in Alg. 1 with O(E \u00b7 (P Q + Q)N M 2) for\nE optimization steps until convergence. Our closed form ELBO generalizes prior state of the art\nof the GPRN ([1, 13, 19]) by extending to support multi-resolution data and allowing a free form\nmixture of Gaussians variational posterior. In the Appendix we also provide variational lower bounds\nq=1 exp(Wp,q) (cid:12) fq + \u0001 that we \ufb01nd can improve\n\nfor the positively-restricted GPRN form Yp =(cid:80)Q\n\nk , \u03a3(w)\n\nk,p , \u03a3(f )\n\nk , \u00b5(w)\n\nidenti\ufb01ability and predictive performance.\n\n3.4 Prediction\n\nAlthough the full predictive distribution of a speci\ufb01c observation process is not available in closed\nform, using the variational posterior we derive the predictive mean and variance, avoiding Monte\nCarlo estimates. The mean is simply E[Y\u2217\ncomponents in the mixture of Gaussians variational posterior and \u03c0k is the k\u2019th weight. We provide\nthe predictive variance and full derivations in the appendix .\n\n(cid:3) Ek[\u02c6f\u2217], where K is the number of\n\na,p] =(cid:80)K\n\n(cid:2)W\u2217\n\nk \u03c0kEk\n\np\n\n4\n\n\fY1,p\n\nYa,p\n\nf (2)\na,p\n\nXa\n\nfa,p\nA \u2212 1\n\n(cid:4)p\n\nf1,p\n\nX1\n\nP\n\nY1,1\n\nm1\n\n(cid:4)1\n\nf2\n\nfP\n\nm2\n\n(cid:4)2\n\nX1\n\nmP\n\n(cid:4)P\n\nY1,1\n\nm1\n\nf2\n\nm2\n\nY2,2\n\nf2,2\n\nf1,1\n\nX1\n\nX2\n\nFigure 3: Left: General plate diagram of MR-DGP for A observation processes across P tasks with\nnoise variances omitted. For notational simplicity we have assumed that the target resolution is\na = 1 and we use (cid:4)p to depict each of the sub-plate diagrams de\ufb01ned on the LHS. Right: A speci\ufb01c\ninstantiation of an MR-DGP for 2 tasks and 2 observation processes (resolutions) with a target process\nY1,1 as in the inter-task multi-resolution PM10, PM25 experiment in Section 4.\n\n4 Multi-Resolution Deep Gaussian Processes (MR-DGP)\n\nWe now introduce MR-DGP, a deep instantiation of the framework which extends the deep GP (DGP)\nmodel of Damianou and Lawrence [5] into a tree-structured multi-resolution construction, Fig. 3.\nFor notational convenience henceforth we assume that p = 1 is the target task and that a = 1 is the\nhighest resolution and the one of primary interest. We note that this need not be the case and the\nrelevant expressions can be trivially updated accordingly.\n\n4.1 Model Speci\ufb01cation\n\nFirst we focus on the case when P = 1 and then generalize to an arbitrary number of tasks. We\na=1 on each of the A datasets within task p that model their\nplace A independent \u201cBase\" GPs {fa,p}A\ncorresponding resolution independently. Taking a = 1 to be the target observation process we now\nconstruct A \u2212 1 DGPs that map from these base GPs {fa,p}A\na=2 to the target process a = 1 while\nlearning an input-dependent mapping between observation processes. These DGPs are local experts\nthat capture the information contained in each resolution for the target observation process. Every\nGP has an explicit likelihood which enables us to estimate and predict at every resolution and task\nwhile allowing for biases between observation processes to be corrected, see Fig. 2.\nMore formally, the likelihood of the MR-DGP with one task is p(Yp|Fp)=\n\nN (Y1,p| 1\n|Sa|\n\nSa\n\n(cid:123)(cid:122)\n\nDeep GPs\n\nf (2)\na,p(x) dx, \u03c32\n\na,p)p(f (2)\n\nN ((Ya,p| 1\n|Sa|\n\nfa,p(x) dx, \u03c32\n\na,p)p(fa,p)\n\n(cid:125)\n\n(5)\nwhere fa,p \u223c GP(0, Ka,p) and we have stacked all the observations and latent GPs into Yp and Fp\nrespectively. Each of the likelihood components is a special case of the multi-resolution likelihood in\nEq. 1 (where Q = 1 and the latent GPs W are constant) and we discretize the integral in the same\nfashion. Similarly to the deep multi-\ufb01delity model of [4] we de\ufb01ne each DGP as:\n\na,p|fa,p) = N (0, K(2)\n\np(f (2)\n\na,p((fa,p, X1), (fa,p, X1)))\n\n(6)\nwhere X1 are the covariates of the resolution of interest in our running example and allow each\nDGP to learn a mapping, between any observation process a and the target one, that varies across\nX1. We now have A independent DGPs modelling Y1,p with separable spatio-temporal kernels at\neach layer. The observation processes are not only at varying resolutions, but could also be partially\noverlapping or disjoint. This motivates treating each GP as a local model in a mixture of GP experts\n[35]. Mixture of GP experts typically combine the local GPs in two ways: either through a gating\n\n5\n\n(cid:90)\n\nA(cid:89)\n(cid:124)\n\na=2\n\nA(cid:89)\n(cid:124)\n\na=1\n\n\u00b7\n\na,p|fa,p)\n(cid:125)\n\n(cid:90)\n\n(cid:123)(cid:122)\n\nSa\n\nBase GPs\n\n\fa=1 \u03b2a (cid:12) f (2)\n\nmp = \u03b21 (cid:12) f1,p +(cid:80)A\nas \u03b2a = (1 \u2212 Va)(cid:80)a\n\nnetwork [24] or through weighing the local GPs [6, 20]. We employ the mixing weight approach\nin order to avoid the computational burden of learning the gating work. We de\ufb01ne the mixture\na,p where the weight captures the reliability of the local GPs (or\nis set to 1 if the mixture is a singleton). The reliability is de\ufb01ned by the resolution and support of\nthe base GPs and is naturally achieved by utilising the normalised log variances of the base GPs\ni Vi. We provide the full justi\ufb01cation and derivation for these weights in the\n\nappendix.\nWe can now generalize to an arbitrary number of tasks. For each task we construct a mixture of experts\nmp as described above. For tasks p > 1 we learn the mapping from mp to the target observation\nprocess Y1,1. This de\ufb01nes another set of local GP experts that is combined into a mixture with DGP\nexperts. In our experiments we set mp for p > 1 to be a simple average and for m1 we use our\nvariance derived weights. This formulation naturally handles biases between the mean of different\nobservations processes and each layer of the DGPs has a meaningful interpretation as it is modelling\na speci\ufb01c observation process.\n\n4.2 Augmented Posterior\n\nDue to the non-linear forms of the parent GPs within the DGPs marginalising out the parent GPs\nis generally analytically intractable. Following [27] we introduce inducing points U = {up}P\np=2 \u222a\na,p=1 where each u(\u00b7)\u00b7,\u00b7 \u2208 IRM and inducing locations Z = {Zp}P\na,p, Za,p}P,A\na,p, ua,p}P,A\n{u(2)\na,p \u2208 IRM\u00d7(D+1) and Za,p \u2208 IRM\u00d7D. The augmented posterior is now simply\nwhere Zp, Z(2)\np(Y, F, M, U) = p(Y|F)p(F, M|U)p(U) (with slight notation abuse) where each p(u(\u00b7)\u00b7,\u00b7 ) =\nN (0, K(\u00b7)\u00b7,\u00b7 ). Full details are provided in the Appendix.\n\np=2 \u222a{Z(2)\n\na,p=1\n\n4.3\n\nInference\n\nq(u(2)\n\na,p)q(ua,p)\n\n(7)\n\nFollowing [27] we construct an approximate augmented posterior that maintains the dependency\nstructure between layers:\n\nq(up) \u00b7 P(cid:89)\n\nA(cid:89)\n\np=1\n\na=1\n\nq(M, F, U) = p(M, F|U)\n\nP(cid:89)\np(fp|mp, up)p(mp|Pa(mp))\u00b7 P(cid:89)\n\np=2\n\np=2\n\nP(cid:89)\n\nwhere each q(u(\u00b7)\u00b7,\u00b7 ) are independent free-form Gaussian N (m(\u00b7)\u00b7,\u00b7 , S(\u00b7)\u00b7,\u00b7 ) and the conditional is\np(F, M|U) =\n\na,p)p(fa,p|ua,p).\n(8)\nWe use Pa(\u00b7) to denote the set of parent GPs of a given GP and L(f ) to denote the depth of DGP f,\na wa,p\u03a3a,pwa,p) and \u00b5a,p, \u03a3a,p are the mean and variance\nof the relevant DGPs. Note that the mixture m1 combines all the DGPs at the top layer of the\ntree-hierarchy and hence it only appears in the predictive distribution of MR-DGP. All other terms are\nstandard sparse GP conditionals and are provided in the Appendix. The ELBO is be simply derived as\n\np(mp|Pa(mp)) = N ((cid:80)A\n\na wa,p\u00b5a,p,(cid:80)A\n\na,p|fa,p, u(2)\n\np(f1,p|u1,p)\n\nA(cid:89)\n\np(f (2)\n\na=2\n\np=1\n\nLMR-DGP = Eq(M,F,U) [log p(Y|F)]\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nELL\n\n(cid:20)\n\nlog\n\n(cid:123)(cid:122)\n\nKL\n\n(cid:21)\n(cid:125)\n\nP (U)\nq(U)\n\n(cid:125)\n\n+ Eq(U)\n\n(cid:124)\n\n(9)\n\nwhere the KL term is decomposed into a sum over all inducing variables u(\u00b7)\u00b7,\u00b7 . The expected log\n(cid:105)\nP(cid:88)\nlikelihood (ELL) term decomposed across all Y:\n+ Eq(fa,p) [log p(Ya,p|fa,p)]\n\nEq(fp) [log p(Y1,1|fp)] +\n\nlog p(Y1,p|f (2)\na,1 )\n\nP(cid:88)\n\nA(cid:88)\n\n(cid:104)E\n\n(cid:104)\n\n(cid:105)\n\n.\n\np=2\n\nq(f (2)\na,1)\n\np=1\n\na\n\n(10)\nFor each ELL component the marginal q(f (\u00b7)\u00b7,\u00b7 ) is required. Because the base GPs are Gaussian,\nsampling is straightforward and the samples can be propagated through the layers, allowing the\nmarginalization integral to be approximated by Monte Carlo samples. We use the reparametization\ntrick to draw samples from the variational posteriors [11]. The inference procedure is given in Alg. 2.\n\n6\n\n\fAlgorithm 2 Inference procedure for MR-DGP\n\nInput: P multi-resolution datasets {(Xp, Yp)}P\nprocedure MARGINAL(f,X, l, L)\n\np=1, initial parameters \u03b80,\n\nif l = L then\n\nreturn q(f|X)\n\n(cid:80)S\nend if\nq(P(f )|X) \u2190 MARGINAL (P(f ), X, l + 1, L(P(f )))\n(cid:104)E{MARGINAL(fp,Xa,0,L(fp))}P\ns=1 p(f|f (s), X)) where f (s) \u223c q(P(f )|X)\nreturn 1\nS\nend procedure\n\u03b81 \u2190 arg min\n\np=1\n\n\u03b8\n\n[log p(Y|F, X, \u03b8)] + KL(q(U)||p(U))\n\n(cid:105)\n\n4.4 Prediction\nPredictive Density. To predict at x\u2217 \u2208 IRD in the target resolution a = 1 we simply approximate\nthe predictive density q(m\u2217\n1) by sampling from the variational posteriors and propagating the samples\nf (s) through all the layers of our MR-DGP structure:\n\nq(m\u2217\n\n1) =\n\nq(m\u2217\n\n1|Pa(m\u2217\n1))\n\nq(f ) dPa(m\u2217\n\n1) \u2248 1\nS\n\nq(m\u2217\n\n1|{f (s)}f\u2208Pa(m\u2217\n1 ))\n\n(11)\n\n(cid:90)\n\n(cid:89)\n\nf\u2208Pa(m\u2217\n1 )\n\nS(cid:88)\n\ns=1\n\nIn fact while propagating the samples through the tree structure the model naturally predicts at every\nresolution a and task p for the corresponding input location.\n\n5 Related Work\n\nGaussian processes (GPs) are the workhorse for spatio-temporal modelling in spatial statistics [9]\nand in machine learning [25] with the direct link between multi-task GPs and Linear Models of\nCoregionalisation (LCM) reviewed by Alvarez et al. [3]. Heteroscedastic GPs [15] and recently\nproposed deeper compositions of GPs for the multi-\ufb01delity setting [4, 22, 23] assume that all\nobservations are of the same resolution. In spatial statistics the related change of support problem\nhas been approached through Markov Chain Monte Carlo approximations and domain discretizations\n[8, 9]. A recent exception to this is the work by Smith et al. [29] that solves the integral for squared\nexponential kernels but only considers observations from one resolution and cannot handle additional\ninput features. Independently and concurrently, [34] have recently proposed a multi-resolution\nLCM model that is similar to our MR-GPRN model without dependent observation processes and\ncomposite likelihood corrections but instead a focus on improved estimation of the area integral and\nnon-Gaussian likelihoods. Finally, we note that the multiresolution GP work by Fox and Dunson [7]\nde\ufb01nes a DGP construction for non-stationary models that is more akin to multi-scale modelling [32].\nThis line of research typically focuses on learning multiple kernel lengthscales to explain both broad\nand \ufb01ne variations in the underlying process and hence cannot handle multi-resolution observations .\n\n6 Experiments\n\nWe demonstrate and evaluate the MRGPs on synthetic experiments and the challenging problem\nof estimating and forecasting air pollution in the city of London. We compare against VBAGG-\nNORMAL [14] and two additional baselines. The \ufb01rst, CENTER-POINT , is a GPRN modi\ufb01ed to\nsupport multi-resolution data by taking the center point of each aggregation region as the input. The\nsecond, MR-CASCADE is a MR-DGP but instead of a tree structured DGP as in Fig. 3 we construct\na cascade to illustrate the bene\ufb01ts of the tree composition and the mixture of experts approach of\nMR-DGP. Experiments are coded1 in TensorFlow and we provide additional analysis in the Appendix.\nDependent observation processes: We provide additional details of the dependent observation\nprocesses experiment in the left of Fig. 2 in the Appendix.\n\n1Codebase and datasets to reproduce results are available at www\n\n7\n\n\fMR-DGP\n\nVBAGG-NORMAL\n\nCENTER-POINT\n\nFigure 4: Spatio-temporal estimation and forecasting of NO2 levels in London. Top Row: Spatial\nslices from MR-GPRN, VBAGG-NORMAL and CENTER-POINT respectively at 19/02/2019 11:00:00\nusing observations from both LAQN and the satellite model (low spatial resolution). Bottom Row:\nSpatial slices at the base resolution from the same models at 19/02/2019 17:00:00 where only\nobservations from the satellite model are present.\n\nBiased observation processes:. To demonstrate the ability of MR-DGP in handling biases across\nobservation processes we construct 3 datasets from the function y = s \u00b7 5 sin(x)2 + 0.1\u0001 where\n\u0001 \u223c N (0, 1). The \ufb01rst X1, Y1 is at resolution S1 = 1 in the range x=[7,12] with a scale s = 1. The\nsecond is at resolution of S2 = 5 between x=[-10, 10] with a scale s = 0.5 and lastly the third is at\nresolution of S3 = 5 x=[10, 20] with a scale s = 0.3. The aim is to predict y across the range [-10,\n20] and the results are shown in Table 2 and Fig. 2. MR-DGP signi\ufb01cantly outperforms all of the four\nalternative approaches as it is learning a forward mapping between observation processes, e.g. f (2)\nin\nFig. 3, and is not just trusting and propagating the mean.\nTraining. When training both MR-GPRN and VBAGG-NORMAL we \ufb01rst jointly optimize the varia-\ntional and hyper parameters while keeping the likelihood variances \ufb01xed and then jointly optimize\nall parameters together. For MR-DGP we \ufb01rst optimize layer by layer and then jointly optimize all\nparameters together, see Appendix. We \ufb01nd that this helps to avoid early local optima.\nInter-task multi-resolution: modelling of PM10 and PM25 in London: In this experiment we\nconsider multiple tasks with different resolutions. We jointly model PM10 and PM25 at a speci\ufb01c\nLAQN location in London. The site we consider is RB7 in the date range 18/06/2018 to 28/06/2018.\nAt this location we have hourly data from both PM10 and PM25. To simulate having multiple\nresolutions we construct 2, 5, 10 and 24 hour aggregations of PM10 and remove a 2 day region of\nPM25 which is the test region. The results from all of our models in Table 1 demonstrate the ability\nto successfully learn the multi-task dependencies. Note that CENTER-POINT fails, e.g. Table 2, when\nthe sampling area cannot be approximated by a single center point due the scale of the underlying\nprocess.\nIntra-task multi-resolution: spatio-temporal modelling of NO2 in London: In this experiment\nwe consider the case of a single task but with multiple multi-resolution observation processes. First we\n\n2\n\n8\n\n\fTable 1: Inter-task multi-resolution. Missing data predictive MSE on PM25 from MR-GPRN, MR-DGP\nand baseline CENTER-POINT for 4 different aggregation levels of PM10. VBAGG-NORMAL is\ninapplicable in this experiment as it is a single-task approach.\n\nModel\n\nCENTER-POINT\nMR-GPRN\nMR-DGP\n\nPM10 Resolution\n\n2 Hours\n4.67 \u00b1 0.74\n4.54 \u00b1 0.93\n5.14 \u00b1 1.28\n\n5 Hours\n5.04 \u00b1 0.45\n5.09 \u00b1 1.04\n4.81 \u00b1 1.06\n\n10 Hours\n5.26 \u00b1 0.91\n4.96 \u00b1 1.07\n4.61 \u00b1 1.43\n\n24 Hours\n5.72 \u00b1 0.91\n5.32 \u00b1 1.14\n5.42 \u00b1 1.15\n\nTable 2: Intra-task multi-resolution. Left: Predicting NO2 across London (Fig. 4). Right: Synthetic\nexperiment results (Fig. 2) with three observations processes and scaling bias.\n\nModel\n\nSingle GP\n\nCENTER-POINT\nVBAGG-NORMAL\nMR-GPRN w/o CL\nMR-GPRN w CL\n\nMR-DGP\n\nRMSE\n\n20.55 \u00b1 9.44\n18.74 \u00b1 12.65\n16.16 \u00b1 9.44\n12.97 \u00b1 9.22\n11.92 \u00b1 6.8\n6.27 \u00b1 2.77\n\nMAPE\n0.8 \u00b1 0.16\n0.65 \u00b1 0.21\n0.69 \u00b1 0.37\n0.56 \u00b1 0.32\n0.45 \u00b1 0.17\n0.38 \u00b1 0.32\n\nModel\n\nMR-CASCADE\n\nVBAGG-NORMAL\n\nMR-GPRN\nMR-DGP\n\nRMSE MAPE\n0.16\n2.12\n0.14\n1.68\n1.6\n0.14\n0.02\n0.19\n\nuse observations coming from ground point sensors from the London Air Quality Network (LAQN).\nThese sensors provide hourly readings of NO2. Secondly we use observations arising from a global\nsatellite model [17] that provide hourly data at a spatial resolution of 7km \u00d7 7km and provide 48\nhour forecasts. We train on both the LAQN and satellite observations from 19/02/2018-20/02/2018\nand the satellite ones from 20/02/2018-21/02/2018. We then predict at the resolution of the LAQN\nsensors in the latter date range. To calculate errors we predict for each LAQN sensor site, and \ufb01nd\nthe average and standard deviation across all sites.\nWe \ufb01nd that MR-DGP is able to substantially outperform both VBAGG-NORMAL, MR-GPRN and the\nbaselines, Table 2 (left), as it is learning the forward mapping between the low resolution satellite\nobservations and the high resolution LAQN sensors, while handling scaling biases. This is further\nhighlighted in the bottom of Fig. 4 where MR-DGP is able to retain high resolution structure based\nonly on satellite observations whereas VBAGG-NORMAL and CENTER-POINT over-smooth.\n\n7 Conclusion\n\nWe offer a framework for evidence integration when observation processes can have varying inter-\nand intra-task sampling resolutions, dependencies, and different signal to noise ratios. Our motivation\ncomes from a challenging and impactful problem of hyper-local air quality prediction in the city\nof London, while the underlying multi-resolution multi-sensor problem is general and pervasive\nacross modern spatio-temporal settings and applications of machine learning. We proposed both\nshallow mixtures and deep learning models that generalise and outperform the prior art, correct for\nposterior contraction, and can handle biases in observation processes such as discrepancies in the\nmean. Further directions now open up to robustify the multi-resolution framework against outliers\nand against further model misspeci\ufb01cation by exploiting ongoing advances in generalized variational\ninference [12]. Finally an open challenge remains on developing continuous model constructions that\navoid domain discretization, as in [2, 34], for more complex settings.\n\nAcknowledgements\n\nO. H., T. D and K.W. are funded by the Lloyd\u2019s Register Foundation programme on Data Centric\nEngineering through the London Air Quality project. This work is supported by The Alan Turing\nInstitute for Data Science and AI under EPSRC grant EP/N510129/1 in collaboration with the Greater\n\n9\n\n\fLondon Authority. We would like to thank the anonymous reviewers for their feedback and Libby\nRogers, Patrick O\u2019Hara and Daniel Tait for their help on multiple aspects of this work.\n\nReferences\n[1] (2008). Gaussian process product models for nonparametric nonstationarity. In Proceedings of\n\nthe 25th International Conference on Machine Learning.\n\n[2] Adelsberg, M. and Schwantes, C. (2018). Binned kernels for anomaly detection in multi-timescale\ndata using Gaussian processes. In Proceedings of the KDD 2017: Workshop on Anomaly Detection\nin Finance, Proceedings of Machine Learning Research.\n\n[3] Alvarez, M. A., Rosasco, L., Lawrence, N. D., et al. (2012). Kernels for vector-valued functions:\n\nA review. Foundations and Trends\u00ae in Machine Learning, 4(3):195\u2013266.\n\n[4] Cutajar, K., Pullin, M., Damianou, A., Lawrence, N., and Gonz\u00e1lez, J. (2019). Deep Gaussian\n\nProcesses for Multi-\ufb01delity Modeling. arXiv e-prints, page arXiv:1903.07320.\n\n[5] Damianou, A. and Lawrence, N. (2013). Deep Gaussian processes. In Proceedings of the\n\nSixteenth International Conference on Arti\ufb01cial Intelligence and Statistics.\n\n[6] Deisenroth, M. P. and Ng, J. W. (2015). Distributed gaussian processes. In Proceedings of the\n32Nd International Conference on International Conference on Machine Learning - Volume 37,\nICML\u201915, pages 1481\u20131490. JMLR.org.\n\n[7] Fox, E. B. and Dunson, D. B. (2012). Multiresolution Gaussian processes. In Proceedings of the\n\n25th International Conference on Neural Information Processing Systems - Volume 1.\n\n[8] Fuentes, M. and Raftery, A. E. (2005). Model evaluation and spatial interpolation by Bayesian\n\ncombination of observations with outputs from numerical models. Biometrics.\n\n[9] Gelfand, A., Fuentes, M., Guttorp, P., and Diggle, P. (2010). Handbook of Spatial Statistics.\n\nChapman & Hall/CRC Handbooks of Modern Statistical Methods. Taylor & Francis.\n\n[10] Hensman, J., Fusi, N., and Lawrence, N. D. (2013). Gaussian processes for big data. In\n\nProceedings of the Twenty-Ninth Conference on Uncertainty in Arti\ufb01cial Intelligence.\n\n[11] Kingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes. In International\n\nConference for Learning Representations.\n\n[12] Knoblauch, J., Jewson, J., and Damoulas, T. (2019). Generalized Variational Inference. arXiv\n\ne-prints, page arXiv:1904.02063.\n\n[13] Krauth, K., Bonilla, E. V., Cutajar, K., and Filippone, M. (2017). AutoGP: Exploring the\nIn Conference on Uncertainty in\n\nCapabilities and Limitations of Gaussian Process Models.\nArti\ufb01cial Intelligence (UAI).\n\n[14] Law, H. C. L., Sejdinovic, D., Cameron, E., Lucas, T. C., Flaxman, S., Battle, K., and Fukumizu,\nK. (2018). Variational learning on aggregate outputs with Gaussian processes. Advances in Neural\nInformation Processing Systems (NeurIPS).\n\n[15] L\u00e1zaro-Gredilla, M. and Titsias, M. K. (2011). Variational heteroscedastic Gaussian process\nregression. In Proceedings of the 28th International Conference on International Conference on\nMachine Learning.\n\n[16] Lyddon, S. P., Holmes, C. C., and Walker, S. G. (2019). General Bayesian updating and the\n\nloss-likelihood Bootstrap. Biometrika.\n\n[17] Mar\u00e9cal, V., Peuch, V.-H., Andersson, C., Andersson, S., Arteta, J., Beekmann, M., Benedictow,\nA., Bergstr\u00f6m, R., Bessagnet, B., Cansado, A., Ch\u00e9roux, F., Colette, A., Coman, A., Curier, R. L.,\nDenier van der Gon, H. A. C., Drouin, A., Elbern, H., Emili, E., Engelen, R. J., Eskes, H. J., Foret,\nG., Friese, E., Gauss, M., Giannaros, C., Guth, J., Joly, M., Jaumouill\u00e9, E., Josse, B., Kadygrov,\nN., Kaiser, J. W., Krajsek, K., Kuenen, J., Kumar, U., Liora, N., Lopez, E., Malherbe, L., Martinez,\nI., Melas, D., Meleux, F., Menut, L., Moinat, P., Morales, T., Parmentier, J., Piacentini, A., Plu, M.,\n\n10\n\n\fPoupkou, A., Queguiner, S., Robertson, L., Rou\u00efl, L., Schaap, M., Segers, A., So\ufb01ev, M., Tarasson,\nL., Thomas, M., Timmermans, R., Valdebenito, A., van Velthoven, P., van Versendaal, R., Vira,\nJ., and Ung, A. (2015). A regional air quality forecasting system over europe: the macc-ii daily\nensemble production. Geoscienti\ufb01c Model Development.\n\n[18] Moreno-Mu\u00f1oz, P., Art\u00e9s-Rodr\u00edguez, A., and \u00c1lvarez, M. A. (2018). Heterogeneous multi-\noutput Gaussian process prediction. In Proceedings of the 32Nd International Conference on\nNeural Information Processing Systems.\n\n[19] Nguyen, T. and Bonilla, E. (2013). Ef\ufb01cient variational inference for Gaussian process regression\nnetworks. In Proceedings of the Sixteenth International Conference on Arti\ufb01cial Intelligence and\nStatistics.\n\n[20] Nguyen, T. and Bonilla, E. (2014a). Fast allocation of Gaussian process experts. In Proceedings\n\nof the 31st International Conference on Machine Learning.\n\n[21] Nguyen, T. V. and Bonilla, E. V. (2014b). Automated variational inference for Gaussian process\n\nmodels. In Advances in Neural Information Processing Systems 27.\n\n[22] Perdikaris, P., Raissi, M., Damianou, A., D. Lawrence, N., and Karniadakis, G. (2017). Nonlin-\near information fusion algorithms for data-ef\ufb01cient multi-\ufb01delity modelling. Proceedings of the\nRoyal Society A: Mathematical, Physical and Engineering Science.\n\n[23] Perdikaris, P., Venturi, D., Royset, J. O., and Karniadakis, G. E. (2015). Multi-\ufb01delity modelling\nvia recursive co-kriging and Gaussian\u2013markov random \ufb01elds. Proceedings of the Royal Society A:\nMathematical, Physical and Engineering Sciences.\n\n[24] Rasmussen, C. E. and Ghahramani, Z. (2002). In\ufb01nite mixtures of Gaussian process experts. In\n\nAdvances in Neural Information Processing Systems 14.\n\n[25] Rasmussen, C. E. and Williams, C. K. I. (2005). Gaussian Processes for Machine Learning\n\n(Adaptive Computation and Machine Learning). The MIT Press.\n\n[26] Ribatet, M. (2012). Bayesian inference from composite likelihoods, with an application to\n\nspatial extremes. In Statistica Sinica 22: 813\u2013845.\n\n[27] Salimbeni, H. and Deisenroth, M. (2017). Doubly stochastic variational inference for deep\n\nGaussian processes. In Advances in Neural Information Processing Systems 30.\n\n[28] Serban, I. V., Klinger, T., Tesauro, G., Talamadupula, K., Zhou, B., Bengio, Y., and Courville, A.\n(2017). Multiresolution recurrent neural networks: An application to dialogue response generation.\nIn Thirty-First AAAI Conference on Arti\ufb01cial Intelligence.\n\n[29] Smith, M. T., Alvarez, M. A., and Lawrence, N. D. (2018). Gaussian process regression for\n\nbinned data. arXiv e-prints.\n\n[30] Titsias, M. (2009). Variational learning of inducing variables in sparse Gaussian processes. In\n\nProceedings of the Twelth International Conference on Arti\ufb01cial Intelligence and Statistics.\n\n[31] Varin, C., Reid, N., and Firth, D. (2011). An overview of composite likelihood methods. Statist.\n\nSinica.\n\n[32] Walder, C., Kim, K. I., and Sch\u00f6lkopf, B. (2008). Sparse multiscale Gaussian process regression.\n\nIn Proceedings of the 25th international conference on Machine learning.\n\n[33] Wilson, A. G., Knowles, D. A., and Ghahramani, Z. (2012). Gaussian process regression\n\nnetworks. In Proceedings of the 29th International Conference on Machine Learning.\n\n[34] Youse\ufb01, F., Smith, M. T., and Alvarez, M. A. (2019). Multi-task learning for aggregated data\n\nusing gaussian processes.\n\n[35] Yuan, C. and Neubauer, C. (2009). Variational mixture of Gaussian process experts. In Advances\n\nin Neural Information Processing Systems 21.\n\n11\n\n\f", "award": [], "sourceid": 7845, "authors": [{"given_name": "Oliver", "family_name": "Hamelijnck", "institution": "The Alan Turing Institute"}, {"given_name": "Theodoros", "family_name": "Damoulas", "institution": "University of Warwick & The Alan Turing Institute"}, {"given_name": "Kangrui", "family_name": "Wang", "institution": "The Alan Turing Institute"}, {"given_name": "Mark", "family_name": "Girolami", "institution": "Imperial College London"}]}