{"title": "Efficient Modeling of Latent Information in Supervised Learning using Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 5131, "page_last": 5139, "abstract": "Often in machine learning, data are collected as a combination of multiple conditions, e.g., the voice recordings of multiple persons, each labeled with an ID. How could we build a model that captures the latent information related to these conditions and generalize to a new one with few data? We present a new model called Latent Variable Multiple Output Gaussian Processes (LVMOGP) that allows to jointly model multiple conditions for regression and generalize to a new condition with a few data points at test time. LVMOGP infers the posteriors of Gaussian processes together with a latent space representing the information about different conditions. We derive an efficient variational inference method for LVMOGP for which the computational complexity is as low as sparse Gaussian processes. We show that LVMOGP significantly outperforms related Gaussian process methods on various tasks with both synthetic and real data.", "full_text": "Ef\ufb01cient Modeling of Latent Information in\nSupervised Learning using Gaussian Processes\n\nZhenwen Dai \u2217\u2021\n\nzhenwend@amazon.com\n\nMauricio A. \u00c1lvarez \u2020\n\nmauricio.alvarez@sheffield.ac.uk\n\nNeil D. Lawrence \u2020\u2021\nlawrennd@amazon.com\n\nAbstract\n\nOften in machine learning, data are collected as a combination of multiple condi-\ntions, e.g., the voice recordings of multiple persons, each labeled with an ID. How\ncould we build a model that captures the latent information related to these condi-\ntions and generalize to a new one with few data? We present a new model called\nLatent Variable Multiple Output Gaussian Processes (LVMOGP) that allows to\njointly model multiple conditions for regression and generalize to a new condition\nwith a few data points at test time. LVMOGP infers the posteriors of Gaussian\nprocesses together with a latent space representing the information about different\nconditions. We derive an ef\ufb01cient variational inference method for LVMOGP for\nwhich the computational complexity is as low as sparse Gaussian processes. We\nshow that LVMOGP signi\ufb01cantly outperforms related Gaussian process methods\non various tasks with both synthetic and real data.\n\n1\n\nIntroduction\n\nMachine learning has been very successful in providing tools for learning a function mapping from an\ninput to an output, which is typically referred to as supervised learning. One of the most pronouncing\nexamples currently is deep neural networks (DNN), which empowers a wide range of applications\nsuch as computer vision, speech recognition, natural language processing and machine translation\n[Krizhevsky et al., 2012, Sutskever et al., 2014]. The modeling in terms of function mapping assumes\na one/many to one mapping between input and output. In other words, ideally the input should\ncontain suf\ufb01cient information to uniquely determine the output apart from some sensory noise.\nUnfortunately, in most cases, this assumption does not hold. We often collect data as a combination\nof multiple scenarios, e.g., the voice recording of multiple persons, the images taken from different\nmodels of cameras. We only have some labels to identify these scenarios in our data, e.g., we can\nhave the names of the speakers and the speci\ufb01cations of the used cameras. These labels themselves\ndo not represent the full information about these scenarios. A question therefore is how to use\nthese labels in a supervised learning task. A common practice in this case would be to ignore the\ndifference of scenarios, but this will result in low accuracy of modeling, because all the variations\nrelated to the different scenarios are considered as the observation noise, as different scenarios are not\ndistinguishable anymore in the inputs,. Alternatively, we can either model each scenario separately,\nwhich often suffers from too small training data, or use a one-hot encoding to represent each scenario.\nIn both of these cases, generalization/transfer to new scenario is not possible.\n\n\u2217Inferentia Limited.\n\u2020Dept. of Computer Science, University of Shef\ufb01eld, Shef\ufb01eld, UK.\n\u2021Amazon.com. The scienti\ufb01c idea and a preliminary version of code were developed prior to joining Amazon.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 1: A toy example about modeling the braking distance of a car. (a) illustrating a car with\nthe initial speed v0 on a \ufb02at road starts to brake due to the friction force Fr. (b) the results of a GP\nregression on all the data from 10 different road and tyre conditions. (c) The top plot visualizes the\n\ufb01tted model with respect to one of the conditions in the training data and the bottom plot shows the\nprediction of the trained model for a new condition with only one observation. The model assumes\nevery condition independently. (d) LVMOGP captures the correlation among different conditions and\nthe plot shows the curve with respect to one of the conditions. By using the information from all the\nconditions, it is able to predict in a new condition with only one observation.(e) The learned latent\nvariable with uncertainty corresponds to a linear transformation of the inverse of the true friction\ncoef\ufb01cient (\u00b5). The blue error bars denote the variational posterior of the latent variables q(H).\n\nIn this paper, we address this problem by proposing a probabilistic model that can jointly consider\ndifferent scenarios and enables ef\ufb01cient generalization to new scenarios. Our model is based on\nGaussian Processes (GP) augmented with additional latent variables. The model is able to represent\nthe data variance related to different scenarios in the latent space, where each location corresponds\nto a different scenario. When encountering a new scenario, the model is able to ef\ufb01cient infer the\nposterior distribution of the location of the new scenario in the latent space. This allows the model\nto ef\ufb01ciently and robustly generalize to a new scenario. An ef\ufb01cient Bayesian inference method of\nthe propose model is developed by deriving a closed-form variational lower bound for the model.\nAdditionally, with assuming a Kronecker product structure in the variational posterior, the derived\nstochastic variational inference method achieves the same computational complexity as a typical\nsparse Gaussian process model with independent output dimensions.\n\n2 Modeling Latent Information\n\n2.1 A Toy Problem\n\nLet us consider a toy example where we wish to model the braking distance of a car in a completely\ndata-driven way. Assuming that we do not know physics about car, we could treat it as a non-\nparametric regression problem, where the input is the initial speed read from the speedometer and\nthe output is the distance from the location where the car starts to brake to the point where the car is\nfully stopped. We know that the braking distance depends on the friction coef\ufb01cient, which varies\naccording to the condition of the tyres and road. As the friction coef\ufb01cient is dif\ufb01cult to measure\ndirectly, we can conduct experiments with a set of different tyre and road conditions, each associated\nwith a condition id, e.g., ten different conditions, each has \ufb01ve experiments with different initial\nspeeds. How can we model the relation between the speed and distance in a data-driven way, so that\nwe can extrapolate to a new condition with only one experiment?\nDenote the speed to be x, the observed braking distance to be y, and the condition id to be d. A\nstraight-forward modeling choice is to ignore the difference in conditions. Then, the relation between\n\n2\n\n\ud835\udc63\"\ud835\udc39$0246810Speed\u2212200204060BrakingDistanceMeanDataCon\ufb01dence010Distancegroundtruthdata0246810Speed010Distance010BrakingDistancegroundtruthdata0246810InitialSpeed010BrakingDistance2.55.07.510.01/\u00b5\u22122\u2212101LatentVariable\fthe speed and the distance can be modeled as\n\ny = f (x) + \u0001,\n\nf \u223c GP,\n\n(1)\n\nwhere \u0001 represents measurement noise, and the function f is modeled as a Gaussian Process (GP).\nSince we do not know the parametric form of the function, we model it non-parametrically. The\ndrawback of this model is that the accuracy is very low as all the variations caused by different\nconditions are modeled as measurement noise (see Figure 1b). Alternatively, we can model each\ncondition separately, i.e., fd \u223c GP, d = 1, . . . , D, where D denotes the number of considered\nconditions. In this case, the relation between speed and distance for each condition can be modeled\ncleanly if there are suf\ufb01cient data in that condition. However, such modeling is not able to generalize\nto new conditions (see Figure 1c), because it does not consider the correlations among conditions.\nIdeally, we wish to model the relation together with the latent information associated with different\nconditions, i.e., the friction coef\ufb01cient in this example. A probabilistic approach is to assume a latent\nvariable. With a latent variable hd that represents the latent information associated with the condition\nd, the relation between speed and distance for the condition d is, then, modeled as\n\ny = f (x, hd) + \u0001,\n\nf \u223c GP, hd \u223c N (0, I).\n\n(2)\n\nNote that the function f is shared across all the conditions like in (1), while for each condition a\ndifferent latent variable hd is inferred. As all the conditions are jointly modeled, the correlation\namong different conditions are correctly captured, which enables generalization to new conditions\n(see Figure 1d for the results of the proposed model).\nThis model enables us to capture the relation between the speed, distance as well as the latent\ninformation. The latent information is learned into a latent space, where each condition is encoded\nas a point in the latent space. Figure 1e shows how the model \u201cdiscovers\" the concept of friction\ncoef\ufb01cient by learning the latent variable as a linear transformation of the inverse of the true friction\ncoef\ufb01cients. With this latent representation, we are able to infer the posterior distribution of a new\ncondition given only one observation and it gives reasonable prediction for the speed-distance relation\nwith uncertainty.\n\n2.2 Latent Variable Multiple Output Gaussian Processes\nIn general, we denote the set of inputs as X = [x1, . . . , xN ](cid:62), which corresponds to the speed in\nthe toy example, and each input xn can be considered in D different conditions in the training data.\nFor simplicity, we assume that, given an input xn, the outputs associated with all the D conditions\nare observed, denoted as yn = [yn1, . . . , ynD](cid:62) and Y = [y1, . . . , yN ](cid:62). The latent variables\nrepresenting different conditions are denoted as H = [h1, . . . , hD](cid:62), hd \u2208 RQH . The dimensionality\nof the latent space QH needs to be pre-speci\ufb01ed like in other latent variable models. The more general\ncase where each condition has a different set of inputs and outputs will be discussed in Section 4.\nUnfortunately, inference of the model in (2) is challenging, because the integral for computing the\nanalytical intractability, the computation of the likelihood p(Y|X, H) is also very expensive, because\nof its cubic complexity O((N D)3). To enable ef\ufb01cient inference, we propose a new model which\nassumes the covariance matrix can be decomposed as a Kronecker product of the covariance matrix\nof the latent variables KH and the covariance matrix of the inputs KX. We call the new model Latent\nVariable Multiple Output Gaussian Processes (LVMOGP) due to its connection with multiple output\nGaussian processes. The probabilistic distributions of LVMOGP are de\ufb01ned as\n\nmarginal likelihood, p(Y|X) =(cid:82) p(Y|X, H)p(H)dH, is analytically intractable. Apart from the\n\n(3)\nwhere the latent variables H have unit Gaussian priors, hd \u223c N (0, I), F = [f1, . . . , fN ](cid:62), fn \u2208 RD\ndenote the noise-free observations, the notation \":\" represents the vectorization of a matrix, e.g.,\nY: = vec(Y) and \u2297 denotes the Kronecker product. KX denotes the covariance matrix computed\non the inputs X with the kernel function kX and KH denotes the covariance matrix computed on the\nlatent variable H with the kernel function kH. Note that the de\ufb01nition of LVMOGP only introduces\na Kronecker product structure in the kernel, which does not directly avoid the intractability of its\nmarginal likelihood. In the next section, we will show how the Kronecker product structure can be\nused for deriving an ef\ufb01cient variational lower bound.\n\np(Y:|F:) = N(cid:0)Y:|F:, \u03c32I(cid:1) ,\n\np(F:|X, H) = N(cid:0)F:|0, KH \u2297 KX(cid:1) ,\n\n3\n\n\f3 Scalable Variational Inference\n\nThe exact inference of LVMOGP in (3) is analytically intractable due to an integral of the latent\nvariable in the marginal likelihood. Titsias and Lawrence [2010] develop a variational inference\nmethod by deriving a closed-form variational lower bound for a Gaussian process model with latent\nvariables, known as Bayesian Gaussian process latent variable model. Their method is applicable to\na broad family of models including the one in (2), but is not ef\ufb01cient for LVMOGP because it has\ncubic complexity with respect to D.4 In this section, we derive a variational lower bound that has\nthe same complexity as a sparse Gaussian process assuming independent outputs by exploiting the\nKronecker product structure of the kernel of LVMOGP.\nWe augment the model with an auxiliary variable, known as the inducing variable U, following\nthe same Gaussian process prior p(U:) = N (U:|0, Kuu). The covariance matrix Kuu is de\ufb01ned\nas Kuu = KH\nuu following the assumption of the Kronecker product decomposition in (3),\nm \u2208 RQH with\nwhere KH\nuu is computed on another set of inducing inputs ZX =\nthe kernel function kH. Similarly, KX\nm has the same dimensionality as\n[zX\nthe inputs xn. We construct the conditional distribution of F as:\n\nuu is computed on a set of inducing inputs ZH = [zH\nm \u2208 RQX with the kernel function kX, where zX\n\nuu \u2297 KX\n\n1 , . . . , zX\n\n1 , . . . , zH\n\n](cid:62), zX\n\n](cid:62), zH\n\nMX\n\nMH\n\np(F|U, ZX , ZH , X, H) = N(cid:0)F:|Kf uK\u22121\n\n(cid:1) ,\n\nuu K(cid:62)\n\nf u\n\n(4)\n\nf u \u2297 KX\n\nf u and Kf f = KH\n\nbecause p(F|X, H) =(cid:82) p(F|U, ZX , ZH , X, H)p(U|ZX , ZH )dU. Assuming variational posteriors\n\nwhere Kf u = KH\nf u is the cross-covariance computed\nbetween X and ZX with kX and KH\nf u is the cross-covariance computed between H and ZH with\nkH. Kf f is the covariance matrix computed on X with kX and KH\nf f is computed on H with kH.\nNote that the prior distribution of F after marginalizing U is not changed with the augmentation,\nq(F|U) = p(F|U, X, H) and q(H), the lower bound of the log marginal likelihood can be derived\nas\n(5)\nwhere F = (cid:104)log p(Y:|F:)(cid:105)p(F|U,X,H)q(U)q(H). It is known that the optimal posterior distribution of\nq(U) is a Gaussian distribution [Titsias, 2009, Matthews et al., 2016]. With an explicit Gaussian\n\nde\ufb01nition of q(U) = N(cid:0)U|M, \u03a3U(cid:1), the integral in F has a closed-form solution:\n\nlog p(Y|X) \u2265 F \u2212 KL (q(U)(cid:107) p(U)) \u2212 KL (q(H)(cid:107) p(H)) ,\n\nuu U:, Kf f \u2212 Kf uK\u22121\nf f . KX\n\nf f \u2297 KX\n\nF = \u2212 N D\n2\n\u03c32 Y(cid:62)\n1\n\nlog 2\u03c0\u03c32 \u2212 1\n: \u03a8K\u22121\n\n2\u03c32 Y(cid:62)\nuu M: \u2212 1\n2\u03c32\n\n+\n\n: Y: \u2212 1\n\n2\u03c32 Tr(cid:0)K\u22121\n(cid:0)\u03c8 \u2212 tr(cid:0)K\u22121\nuu \u03a6(cid:1)(cid:1) ,\n(cid:68)\n\nuu \u03a6K\u22121\nuu (M:M(cid:62)\n(cid:69)\n\n: + \u03a3U )(cid:1)\n\n(6)\n\nwhere \u03c8 = (cid:104)tr (Kf f )(cid:105)q(H), \u03a8 = (cid:104)Kf u(cid:105)q(H) and \u03a6 =\n.5 Note that the optimal\nvariational posterior of q(U) with respect to the lower bound can be computed in closed-form.\nHowever, the computational complexity of the closed-form solution is O(N DM 2\n\nf uKf u\n\nK(cid:62)\n\nq(H)\n\nX M 2\n\nH ).\n\n3.1 More Ef\ufb01cient Formulation\n\nNote that the lower bound in (5-6) does not take advantage of the Kronecker product decomposition.\nThe computational ef\ufb01ciency could be improved by avoiding directly computing the Kronecker\nproduct of the covariance matrices. Firstly, we reformulate the expectations of the covariance\nmatrices \u03c8, \u03a8 and \u03a6, so that the expectation computation can be decomposed,\n\n\u03c8 = \u03c8Htr(cid:0)KX\n(cid:17)(cid:69)\n(cid:16)\n(cid:68)\n\nf f\n\n(cid:1) , \u03a8 = \u03a8H \u2297 KX\n(cid:69)\n\n(cid:68)\n\nf u, \u03a6 = \u03a6H \u2297(cid:0)(KX\n\nf u)(cid:1) ,\n(cid:69)\nf u)(cid:62)KX\n\nf u)(cid:62)KH\n\nf u\n\n(cid:68)\n\ntr\n\nKH\nf f\n\nwhere \u03c8H =\n. Secondly, we\nassume a Kronecker product decomposition of the covariance matrix of q(U), i.e., \u03a3U = \u03a3H \u2297 \u03a3X.\nAlthough this decomposition restricts the covariance matrix representation, it dramatically reduces\n\nand \u03a6H =\n\n, \u03a8H =\n\nKH\nf u\n\n(KH\n\nq(H)\n\nq(H)\n\nq(H)\n\n(7)\n\n4Assume that the number of inducing points is proportional to D.\n5The expectation with respect to a matrix (cid:104)\u00b7(cid:105)q(H) denotes the expectation with respect to every element of\n\nthe matrix.\n\n4\n\n\fthe number of variational parameters in the covariance matrix from M 2\nto the above decomposition, the lower bound can be rearranged to speed up the computation,\n\nH to M 2\n\nX + M 2\n\nX M 2\n\nH. Thanks\n\nF = \u2212 N D\n2\n\u2212 1\n\u2212 1\n\u03c32 Y(cid:62)\n1\n\n+\n\n:\n\n+\n\nuu)\u22121)M(KH\n\nuu)\u22121\u03a3H(cid:1) tr(cid:0)(KX\nuu)\u22121(\u03a8H )(cid:62)(cid:1)\nuu)\u22121\u03a6X(cid:1) .\n\nlog 2\u03c0\u03c32 \u2212 1\n\nuu)\u22121)M(KH\n\nuu)\u22121\u03a6H (KH\n\n2\u03c32 Y(cid:62)\n: Y:\nuu)\u22121\u03a6C(KX\n\n2\u03c32 tr(cid:0)M(cid:62)((KX\n2\u03c32 tr(cid:0)(KH\n(cid:0)(\u03a8X (KX\nuu)\u22121\u03a6H(cid:1) tr(cid:0)(KX\n2\u03c32 tr(cid:0)(KH\n(cid:18)\n+ tr(cid:0)(KH\n\n|KH\nuu|\n|\u03a3H| + MH log\n\nuu)\u22121\u03a3H(cid:1) tr(cid:0)(KX\n\nMX log\n\n1\n2\n\n1\n\nKL (q(U)(cid:107) p(U)) =\n\nSimilarly, the KL-divergence between q(U) and p(U) can also take advantage of the above decom-\nposition:\n\nuu)\u22121(cid:1)\nuu)\u22121\u03a3X(cid:1)\n\nuu)\u22121\u03a6H (KH\nuu)\u22121\u03a6X (KX\n\n: \u2212 1\n\n2\u03c32 \u03c8\n\n(8)\n\n|KX\nuu|\n\n|\u03a3X| + tr(cid:0)M(cid:62)(KX\n(cid:19)\nuu)\u22121M(KH\nuu)\u22121\u03a3X(cid:1) \u2212 MH MX\n\n.\n\nuu)\u22121(cid:1)\n\n(9)\n\nAs shown in the above equations,\nthe direct computation of Kronecker products is com-\npletely avoided. Therefore, the computational complexity of the lower bound is reduced to\nO(max(N, MH ) max(D, MX ) max(MX , MH )), which is comparable to the complexity of sparse\nGPs with independent observations O(N M max(D, M )). The new formulation is signi\ufb01cantly\nmore ef\ufb01cient than the formulation described in the previous section. This enables LVMOGP to be\napplicable to real world scenarios. It is also straight-forward to extend this lower bound to mini-batch\nlearning like in Hensman et al. [2013], which allows further scaling up.\n\n3.2 Prediction\n\n(cid:90)\n=N(cid:0)F\u2217\n\n: |U:, X\u2217, H\u2217)q(U:)dU:\n: |Kf\u2217uK\u22121\n\nAfter estimating the model parameters and variational posterior distributions, the trained model is\ntypically used to make predictions. In our model, a prediction can be about a new input x\u2217 as well\nas a new scenario which corresponds to a new value of the hidden variable h\u2217. Given both a set of\nnew inputs X\u2217 with a set of new scenarios H\u2217, the prediction of noiseless observation F\u2217 can be\ncomputed in closed-form,\nq(F\u2217\np(F\u2217\n\n: |X\u2217, H\u2217) =\n\nf\u2217f\u2217 and KX\n\nf\u2217f\u2217 \u2297 KX\n\nf\u2217f\u2217 and Kf\u2217u = KH\n\nuu K(cid:62)\nf\u2217u . KH\n\nf\u2217u + Kf\u2217uK\u22121\nf\u2217f\u2217 and KH\n\nuu M:, Kf\u2217f\u2217 \u2212 Kf\u2217uK\u22121\nf\u2217u \u2297 KX\n\nuu \u03a3U K\u22121\nwhere Kf\u2217f\u2217 = KH\nf\u2217u are the covariance\nmatrices computed on H\u2217 and the cross-covariance matrix computed between H\u2217 and ZH. Similarly,\nf\u2217u are the covariance matrices computed on X\u2217 and the cross-covariance matrix\nKX\ncomputed between X\u2217 and ZX. For a regression problem, we are often more interested in predicting\nfor the existing condition from the training data. As the posterior distributions of the existing\nconditions have already been estimated as q(H), we can approximate the prediction by integrating the\n: |X\u2217, H)q(H)dH. The above integration\nabove prediction equation with q(H), q(F\u2217\nis intractable, however, as suggested by Titsias and Lawrence [2010], the \ufb01rst and second moment of\nF\u2217\n: under q(F\u2217\n\n: |X\u2217) can be computed in closed-form.\n\n: |X\u2217) =(cid:82) q(F\u2217\n\nuu K(cid:62)\nf\u2217u\n\n(cid:1) ,\n\n4 Missing Data\n\nThe model described in Section 2.2 assumes that for N different inputs, we observe them in all the\nD different conditions. However, in real world problems, we often collect data at a different set of\ninputs for each scenario, i.e., for each condition d, d = 1, . . . , D. Alternatively, we can view the\nproblem as having a large set of inputs and for each condition only the outputs associated with a\n\n5\n\n\fsubset of the inputs being observed. We refer to this problem as missing data. For the condition d,\n](cid:62) and the outputs as Yd = [y1d, . . . , yNdd](cid:62), and\nwe denote the inputs as X(d) = [x(d)\noptionally a different noise variance as \u03c32\nd. The proposed model can be extended to handle this case\nby reformulating the F as\n\u2212 Nd\n2\n\nuu \u03a6dK\u22121\n\n1 , . . . , x(d)\nNd\n\nd Yd \u2212 1\nY(cid:62)\n2\u03c32\nd\n\nlog 2\u03c0\u03c32\n\nF =\n\n: + \u03a3U )(cid:1)\nuu (M:M(cid:62)\n(cid:17)\n(cid:16)\n\n(10)\n\nKX\n\n, in which\n\nD(cid:88)\nd \u2297(cid:16)\n\nd=1\n\n+\n\nd \u2212 1\n2\u03c32\nd\n(cid:17)\n(cid:68)\n\n1\n\u03c32\nd\n(KX\n\nY(cid:62)\nd \u03a8dK\u22121\nfdu)(cid:62)KX\nfdu)\n, \u03a8H\nd =\n\n(cid:69)\n\nuu M: \u2212 1\n2\u03c32\nd\n\nTr(cid:0)K\u22121\n(cid:0)\u03c8d \u2212 tr(cid:0)K\u22121\n\nuu \u03a6d\n\n(cid:1)(cid:1) ,\n(cid:16)\n(cid:68)\n\n(cid:68)\n\nwhere \u03a6d = \u03a6H\n\n, \u03a8d = \u03a8H\n\nfdu\n\n(KH\n\nd =\n\nfdu)(cid:62)KH\n\nfdfd\n. The rest of the\n\u03a6H\nlower bound remains unchanged because it does not depend on the inputs and outputs. Note that,\nalthough it looks very similar to the bound in Section 3, the above lower bound is computationally\nmore expensive, because it involves the computation of a different set of \u03a6d, \u03a8d, \u03c8d and the\ncorresponding part of the lower bound for each condition.\n\nKH\n\nq(hd)\n\nq(hd)\n\nq(hd)\n\nfdu\n\n(cid:69)\n\nd \u2297 KX\nand \u03c8H\n\nfdu, \u03c8d = \u03c8H\nKH\nd =\n\ntr\n\nfdfd\n\n(cid:17)(cid:69)\nd \u2297 tr\n\n5 Related works\n\nLVMOGP can be viewed as an extension of a multiple output Gaussian process. Multiple output\nGaussian processes have been thoughtfully studied in \u00c1lvarez et al. [2012]. LVMOGP can be seen as\nan intrinsic model of coregionalization [Goovaerts, 1997] or a multi-task Gaussian process [Bonilla\net al., 2008], if the coregionalization matrix B is replaced by the kernel KH. By replacing the\ncoregionalization matrix with a kernel matrix, we endow the multiple output GP with the ability to\npredict new outputs or tasks at test time, which is not possible if a \ufb01nite matrix B is used at training\ntime. Also, by using a model for the coregionalization matrix in the form of a kernel function, we\nreduce the number of hyperparameters necessary to \ufb01t the covariance between the different conditions,\nreducing over\ufb01tting when fewer datapoints are available for training. Replacing the coregionalization\nmatrix by a kernel matrix has also been used in Qian et al. [2008] and more recently by Bussas et al.\n[2017]. However, these works do not address the computational complexity problem and their models\ncan not scale to large datasets. Furthermore, in our model, the different conditions hd are treated as\nlatent variables, which are not observed, as opposed to these two models where we would need to\nprovide observed data to compute KH.\nComputational complexity in multi-output Gaussian processes has also been studied before for\nconvolved multiple output Gaussian processes [\u00c1lvarez and Lawrence, 2011] and for the intrinsic\nmodel of coregionalization [Stegle et al., 2011]. In \u00c1lvarez and Lawrence [2011], the idea of inducing\ninputs is also used and computational complexity reduces to O(N DM 2), where M refers to a generic\nnumber of inducing inputs. In Stegle et al. [2011], the covariances KH and KX are replaced by their\nrespective eigenvalue decompositions and computational complexity reduces to O(N 3 + D3). Our\nmethod reduces computationally complexity to O(max(N, MH ) max(D, MX ) max(MX , MH ))\nwhen there are no missing data. Notice that if MH = MX = M, N > M and D > M, our method\nachieves a computational complexity of O(N DM ), which is faster than O(N DM 2) in \u00c1lvarez and\nLawrence [2011]. If N = D = MH = MX, our method achieves a computational complexity of\nO(N 3), similar to Stegle et al. [2011]. Nonetheless, the usual case is that N (cid:29) MX, improving the\ncomputational complexity over Stegle et al. [2011]. An additional advantage of our method is that it\ncan easily be parallelized using mini-batches like in Hensman et al. [2013]. Note that we have also\nprovided expressions for dealing with missing data, a setup which is very common in our days, but\nthat has not been taken into account in previous formulations.\nThe idea of modeling latent information about different conditions jointly with the modeling of data\npoints is related to the style and content model by Tenenbaum and Freeman [2000], where they\nexplicitly model the style and content separation as a bilinear model for unsupervised learning.\n\n6 Experiments\n\nWe evaluate the performance of the proposed model with both synthetic and real data.\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: The results on two synthetic datasets. (a) The performance of GP-ind, LMC and LVMOGP\nevaluated on 20 randomly drawn datasets without missing data. (b) The performance evaluated on 20\nrandomly drawn datasets with missing data. (c) A comparison of the estimated functions by the three\nmethods on one of the synthetic datasets with missing data. The plots show the estimated functions\nfor one of the conditions with few training data. The red rectangles are the noisy training data and the\nblack crosses are the test data.\n\nSynthetic Data. We compare the performance of the proposed method with GP with independent ob-\nservations and the linear model of coregionalization (LMC) [Journel and Huijbregts, 1978, Goovaerts,\n1997] on synthetic data, where the ground truth is known. We generated synthetic data by sampling\nfrom a Gaussian process, as stated in (3), and assuming a two-dimensional space for the different\nconditions. We \ufb01rst generated a dataset, where all the conditions of a set of inputs are observed. The\ndataset contains 100 different uniformly sampled input locations (50 for training and 50 for testing),\nwhere each corresponds to 40 different conditions. An observation noise with variance 0.3 is added\nonto the training data. This dataset belongs to the case of no missing data, therefore, we can apply\nLVMOGP with the inference method presented in Section 3. We assume a 2 dimensional latent\nspace and set MH = 30 and MX = 10. We compare LVMOGP with two other methods: GP with\nindependent output dimensions (GP-ind) and LMC (with a full rank coregionalization matrix). We\nrepeated the experiments on 20 randomly sampled datasets. The results are summarized in Figure\n2a. The means and standard deviations of all the methods on 20 repeats are: GP-ind: 0.24 \u00b1 0.02,\nLMC:0.28\u00b1 0.11, LVMOGP 0.20\u00b1 0.02. Note that, in this case, GP-ind performs quite well because\nthe only gain by modeling different conditions jointly is the reduction of estimation variance from the\nobservation noise.\nThen, we generated another dataset following the same setting, but where each condition had a\ndifferent set of inputs. Often, in real data problems, the number of available data in different\nconditions is quite uneven. To generate a dataset with uneven numbers of training data in different\nconditions, we group the conditions into 10 groups. Within each group, the numbers of training\ndata in four conditions are generated through a three-step stick breaking procedure with a uniform\nprior distribution (200 data points in total). We apply LVMOGP with missing data (Section 4) and\ncompare with GP-ind and LMC. The results are summarized in Figure 2b. The means and standard\ndeviations of all the methods on 20 repeats are: GP-ind: 0.43 \u00b1 0.06, LMC:0.47 \u00b1 0.09, LVMOGP\n0.30 \u00b1 0.04. In both synthetic experiments, LMC does not perform well because of over\ufb01tting\ncaused by estimating the full rank coregionalization matrix. The \ufb01gure 2c shows a comparison of\nthe estimated functions by the three methods for a condition with few training data. Both LMC and\nLVMOGP can leverage the information from other conditions to make better predictions, while LMC\noften suffers from over\ufb01tting due to the high number of parameters in the coregionalization matrix.\nServo Data. We apply our method to a servo modeling problem, in which the task is to predict the\nrise time of a servomechanism in terms of two (continuous) gain settings and two (discrete) choices of\nmechanical linkages [Quinlan, 1992]. The two choices of mechanical linkages introduce 25 different\nconditions in the experiments (\ufb01ve types of motors and \ufb01ve types of lead screws). The data in each\ncondition are scarce, which makes joint modeling necessary (see Figure 3a). We took 70% of the\ndataset as training data and the rest as test data, and randomly generated 20 partitions. We applied\nLVMOGP with a two-dimensional latent space with an ARD kernel and used \ufb01ve inducing points\nfor the latent space and 10 inducing points for the function. We compared LVMOGP with GP with\nignoring the different conditions (GP-WO), GP with taking each condition as an independent output\n(GP-ind), GP with one-hot encoding of conditions (GP-OH) and LMC. The means and standard\ndeviations of the RMSE of all the methods on 20 partitions are: GP-WO: 1.03 \u00b1 0.20, GP-ind:\n\n7\n\nGP-indLMCLVMOGP0.20.30.40.50.60.7RMSEGP-indLMCLVMOGP0.20.30.40.50.60.7RMSEGP-ind\u2212202testtrainLMC\u2212202\u22120.20.00.20.40.60.81.01.2LVMOGP\u2212202\f(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 3: The experimental results on servo data and sensor imputation. (a) The numbers of data\npoints are scarce in each condition. (b) The performance of a list of methods on 20 different train/test\npartitions is shown in the box plot. (c) The function learned by LVMOGP for the condition with the\nsmallest amount of data. With only one training data, the method is able to extrapolate a non-linear\nfunction due to the joint modeling of all the conditions. (d) The performance of three methods on\nsensor imputation with 20 repeats.\n\n1.30 \u00b1 0.31, GP-OH: 0.73 \u00b1 0.26, LMC:0.69 \u00b1 0.35, LVMOGP 0.52 \u00b1 0.16. Note that in some\nconditions the data are very scarce, e.g., there are only one training data point and one test data\npoint (see Figure 3c). As all the conditions are jointly modeled in LVMOGP, the method is able to\nextrapolate a non-linear function by only seeing one data point.\nSensor Imputation. We apply our method to impute multivariate time series data with massive\nmissing data. We take a in-house multi-sensor recordings including a list of sensor measurements such\nas temperature, carbon dioxide, humidity, etc. [Zamora-Mart\u00ednez et al., 2014]. The measurements\nare recorded every minute for roughly a month and smoothed with 15 minute means. Different\nmeasurements are normalized to zero-mean and unit-variance. We mimic the scenario of massive\nmissing data by randomly taking out 95% of the data entries and aim at imputing all the missing\nvalues. The performance is measured as RMSE on the imputed values. We apply LVMOGP with\nmissing data with the settings: QH = 2, MH = 10 and MX = 100. We compare with LMC and\nGP-ind. The experiments are repeated 20 times with different missing values. The results are shown\nin a box plot in Figure 3d. The means and standard deviations of all the methods on 20 repeats are:\nGP-ind: 0.85 \u00b1 0.09, LMC:0.59 \u00b1 0.21, LVMOGP 0.45 \u00b1 0.02. The high variance of LMC results\nare due to the large number of parameters in the coregionalization matrix.\n\n7 Conclusion\n\nIn this work, we study the problem of how to model multiple conditions in supervised learning.\nCommon practices such as one-hot encoding cannot ef\ufb01ciently model the relation among different\nconditions and are not able to generalize to a new condition at test time. We propose to solve this\nproblem in a principled way, where we learn the latent information of conditions into a latent space.\nBy exploiting the Kronecker product decomposition in the variational posterior, our inference method\nis able to achieve the same computational complexity as sparse GPs with independent observations,\nwhen there are no missing data. In experiments on synthetic and real data, LVMOGP outperforms\ncommon approaches such as ignoring condition difference, using one-hot encoding and LMC. In\nFigure 3b and 3d, LVMOGP delivers more reliable performance than LMC among different train/test\npartitions due to the marginalization of latent variables.\n\nAcknowledgements MAA has been \ufb01nanced by the Engineering and Physical Research Council (EPSRC)\nResearch Project EP/N014162/1.\n\n8\n\n0510152025024681012GP-WOGP-indGP-OHLMCLVMOGP0.51.01.52.0RMSE2.53.03.54.04.55.05.56.06.512345-0.948-0.700-0.452-0.452-0.204-0.2040.0440.0440.2920.2920.5390.5390.7870.787traintestGP-indLMCLVMOGP0.40.50.60.70.80.9RMSE\fReferences\nMauricio A. \u00c1lvarez and Neil D. Lawrence. Computationally ef\ufb01cient convolved multiple output\n\nGaussian processes. J. Mach. Learn. Res., 12:1459\u20131500, July 2011.\n\nEdwin V. Bonilla, Kian Ming Chai, and Christopher K. I. Williams. Multi-task Gaussian process\nIn John C. Platt, Daphne Koller, Yoram Singer, and Sam Roweis, editors, NIPS,\n\nprediction.\nvolume 20, 2008.\n\nMatthias Bussas, Christoph Sawade, Nicolas K\u00fchn, Tobias Scheffer, and Niels Landwehr. Varying-\n\ncoef\ufb01cient models for geospatial transfer learning. Machine Learning, pages 1\u201322, 2017.\n\nPierre Goovaerts. Geostatistics For Natural Resources Evaluation. Oxford University Press, 1997.\n\nJames Hensman, Nicolo Fusi, and Neil D. Lawrence. Gaussian processes for big data. In UAI, 2013.\n\nAndre G. Journel and Charles J. Huijbregts. Mining Geostatistics. Academic Press, 1978.\n\nAlex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolu-\ntional neural networks. In Advances in Neural Information Processing Systems, pages 1097\u20131105,\n2012.\n\nAlexander G. D. G. Matthews, James Hensman, Richard E Turner, and Zoubin Ghahramani. On\nsparse variational methods and the Kullback-Leibler divergence between stochastic processes. In\nAISTATS, 2016.\n\nPeter Z. G Qian, Huaiqing Wu, and C. F. Jeff Wu. Gaussian process models for computer experiments\n\nwith qualitative and quantitative factors. Technometrics, 50(3):383\u2013396, 2008.\n\nJ R Quinlan. Learning with continuous classes.\n\nIntelligence, pages 343\u2013348, 1992.\n\nIn Australian Joint Conference on Arti\ufb01cial\n\nOliver Stegle, Christoph Lippert, Joris Mooij, Neil Lawrence, and Karsten Borgwardt. Ef\ufb01cient\ninference in matrix-variate Gaussian models with IID observation noise. In NIPS, pages 630\u2013638,\n2011.\n\nIlya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks.\n\nIn Advances in Neural Information Processing Systems, 2014.\n\nJB Tenenbaum and WT Freeman. Separating style and content with bilinear models. Neural\n\nComputation, 12:1473\u201383, 2000.\n\nMichalis K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In\n\nAISTATS, 2009.\n\nMichalis K. Titsias and Neil D. Lawrence. Bayesian Gaussian process latent variable model. In\n\nAISTATS, 2010.\n\nF. Zamora-Mart\u00ednez, P. Romeu, P. Botella-Rocamora, and J. Pardo. On-line learning of indoor\ntemperature forecasting models towards energy ef\ufb01ciency. Energy and Buildings, 83:162\u2013172,\n2014.\n\nMauricio A. \u00c1lvarez, Lorenzo Rosasco, and Neil D. Lawrence. Kernels for vector-valued functions:\nA review. Foundations and Trends R(cid:13) in Machine Learning, 4(3):195\u2013266, 2012. ISSN 1935-8237.\ndoi: 10.1561/2200000036. URL http://dx.doi.org/10.1561/2200000036.\n\n9\n\n\f", "award": [], "sourceid": 2656, "authors": [{"given_name": "Zhenwen", "family_name": "Dai", "institution": "Amazon"}, {"given_name": "Mauricio", "family_name": "\u00c1lvarez", "institution": "University of Sheffield"}, {"given_name": "Neil", "family_name": "Lawrence", "institution": "Amazon.com"}]}