{"title": "Linearly constrained Gaussian processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1215, "page_last": 1224, "abstract": "We consider a modification of the covariance function in Gaussian processes to correctly account for known linear constraints. By modelling the target function as a transformation of an underlying function, the constraints are explicitly incorporated in the model such that they are guaranteed to be fulfilled by any sample drawn or prediction made. We also propose a constructive procedure for designing the transformation operator and illustrate the result on both simulated and real-data examples.", "full_text": "Linearly constrained Gaussian processes\n\nCarl Jidling\n\nNiklas Wahlstr\u00f6m\n\nDepartment of Information Technology\n\nDepartment of Information Technology\n\nUppsala University, Sweden\ncarl.jidling@it.uu.se\n\nUppsala University, Sweden\n\nniklas.wahlstrom@it.uu.se\n\nAdrian Wills\n\nSchool of Engineering\n\nUniversity of Newcastle, Australia\n\nadrian.wills@newcastle.edu.au\n\nThomas B. Sch\u00f6n\n\nDepartment of Information Technology\n\nUppsala University, Sweden\nthomas.schon@it.uu.se\n\nAbstract\n\nWe consider a modi\ufb01cation of the covariance function in Gaussian processes to\ncorrectly account for known linear operator constraints. By modeling the target\nfunction as a transformation of an underlying function, the constraints are explicitly\nincorporated in the model such that they are guaranteed to be ful\ufb01lled by any\nsample drawn or prediction made. We also propose a constructive procedure for\ndesigning the transformation operator and illustrate the result on both simulated\nand real-data examples.\n\n1\n\nIntroduction\n\nBayesian non-parametric modeling has had a\nprofound impact in machine learning due, in\nno small part, to the \ufb02exibility of these model\nstructures in combination with the ability to en-\ncode prior knowledge in a principled manner [6].\nThese properties have been exploited within the\nclass of Bayesian non-parametric models known\nas Gaussian Processes (GPs), which have re-\nceived signi\ufb01cant research attention and have\ndemonstrated utility across a very large range of\nreal-world applications [16].\nAbstracting from the myriad number of these ap-\nplications, it has been observed that the ef\ufb01cacy\nof GPs modeling is often intimately dependent\non the appropriate choice of mean and covari-\nance functions, and the appropriate tuning of\ntheir associated hyper-parameters. Often, the\nmost appropriate mean and covariance functions\nare connected to prior knowledge of the underly-\ning problem. For example, [10] uses functional\nexpectation constraints to consider the problem\nof gene-disease association, and [13] employs\na multivariate generalized von Mises distribu-\ntion to produce a GP-like regression that handles\ncircular variable problems.\n\nFigure 1: Predicted strength of a magnetic \ufb01eld at\nthree heights, given measured data sampled from\nthe trajectory shown (blue curve). The three com-\nponents (x1, x2, x3) denote the Cartesian coordi-\nnates, where the x3-coordinate is the height above\nthe \ufb02oor. The magnetic \ufb01eld is curl-free, which\ncan be formulated in terms of three linear con-\nstraints. The method proposed in this paper can\nexploit these constraints to improve the predictions.\nSee Section 5.2 for details.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n32x1[m]10-1-3-2x2[m]-10121.510.50x3[m]0.70.80.911.11.21.3Predictedmagnetic\ufb01eldstrength[a.u.]\fAt the same time, it is not always obvious how one might construct a GP model that obeys underlying\nprinciples, such as equilibrium conditions and conservation \"laws\". One straightforward approach\nto this problem is to add \ufb01ctitious measurements that observe the constraints at a \ufb01nite number of\npoints of interest. This has the bene\ufb01t of being relatively straightforward to implement, but has\nthe sometimes signi\ufb01cant drawback of increasing the problem dimension and at the same time not\nenforcing the constraints between the points of interest.\nA different approach to constraining the GP model is to construct mean and covariance functions that\nobey the constraints. For example, curl and divergence free covariance functions are used in [22] to\nimprove the accuracy for regression problems. The main bene\ufb01t of this approach is that the problem\ndimension does not grow, and the constraints are enforced everywhere, not pointwise. However, it is\nnot obvious how these approaches can be scaled for an arbitrary set of linear operator constraints.\nThe contribution of this paper is a new way to include constraints into multivariate GPs. In particular,\nwe develop a method that transforms a given GP into a new, derived, one that satis\ufb01es the constraints.\nThe procedure relies upon the fact that GPs are closed under linear operators, and we propose an\nalgorithm capable of constructing the required transformation. We will demonstrate the utility of\nthis new method on both simulated examples and on a real-world application, the latter in form of\npredicting the components of a magnetic \ufb01eld, as illustrated in Figure 1.\nTo make these ideas more concrete, we present a simple example that will serve as a focal point\nseveral times throughout the paper. To that end, assume that we have a two-dimensional function\nf (x) : R2 (cid:55)\u2192 R2 on which we put a GP prior f (x) \u223c GP (\u00b5(x), K(x, x(cid:48))) . We further know that\nf (x) should obey the differential equation\n\n(1)\nIn this paper we show how to modify K(x, x(cid:48)) and \u00b5(x) such that any sample from the new GP is\nguaranteed to obey the constraints like (1), considering any kind of linear operator constraint.\n\n= 0.\n\n\u2202f1\n\u2202x1\n\n+\n\n\u2202f2\n\u2202x2\n\n2 Problem formulation\nAssume that we are given a data set of N observations {xk, yk}N\nk=1 where xk denotes the input and\nyk the output. Both the input and output are potentially vector-valued, where xk \u2208 RD and yk \u2208 RK.\nWe consider the regression problem where the data can be described by a non-parametric model\nyk = f (xk) + ek, where ek is zero-mean white noise representing the measurement uncertainty. In\nthis work, we place a vector-valued GP prior on f\n\nf (x) \u223c GP (\u00b5(x), K(x, x(cid:48))) ,\n\n(2)\n\nwith the mean function and the covariance function\n\n\u00b5(\u00b7) : RD (cid:55)\u2192 RK,\n\nBased on the data {xk, yk}N\naddition to the data, we know that the function f should ful\ufb01ll certain constraints\n\n(3)\nk=1, we would now like to \ufb01nd a posterior over the function f (x). In\n\nK(\u00b7,\u00b7) : RD \u00d7 RD (cid:55)\u2192 RK \u00d7 RK.\n\n(cid:105)\n\n(cid:104)\n\n(cid:70)x[f ] = 0,\n\n\u03bb1f 1 + \u03bb2f 2\n\n(4)\nwhere (cid:70)x is an operator mapping the function f (x) to another function g(x) as (cid:70)x[f ] = g(x). We\nfurther require (cid:70)x to be a linear operator meaning that (cid:70)x\n= \u03bb1(cid:70)x[f 1] + \u03bb2(cid:70)x[f 2],\nwhere \u03bb1, \u03bb2 \u2208 R. The operator (cid:70)x can for example be a linear transform (cid:70)x[f ] = Cf (x) which\ntogether with the constraint (4) forces a certain linear combination of the outputs to be linearly\ndependent.\nThe operator (cid:70)x could also include other linear operations on the function f (x). For example, we\nmight know that the function f (x) : R2 \u2192 R2 should obey a certain partial differential equation\n(cid:70)x[f ] = \u2202f1\n. A few more linear operators are listed in Section 1 of the Supplementary\n\u2202x1\nmaterial, including integration as one the most well-known.\nThe constraints (4) can either come from known physical laws or other prior knowledge of the process\ngenerating the data. Our objective is to encode these constraints in the mean and covariance functions\n(3) such that any sample from the corresponding GP prior (2) always obeys the constraint (4).\n\n+ \u2202f2\n\u2202x2\n\n2\n\n\f3 Building a constrained Gaussian process\n\n3.1 Approach based on arti\ufb01cial observations\n\nJust as Gaussian distributions are closed under linear transformations, so are GPs closed under linear\noperations (see Section 2 in the Supplementary material). This can be used for a straightforward way\nof embedding linear operator constraints of the form (4) into GP regression. The idea is to treat the\nconstraints as noise-free arti\ufb01cial observations {\u02dcxk, \u02dcyk} \u02dcN\nk=1 with \u02dcyk = 0 for all k = 1 . . . \u02dcN. The\nregression is then performed on the model \u02dcyk = (cid:70)\u02dcxk [f ], where \u02dcxk are input points in the domain of\ninterest. For example, one could let these arti\ufb01cial inputs \u02dcxk coincide with the points of prediction.\nAn advantage of this approach is that it allows constraints of the type (4) with a non-zero right hand\nside. Furthermore, there is no theoretical limit on how many constraints we can include (i.e. number\nof rows in (cid:70)x) \u2013 although in practice, of course, there is.\nHowever, this is problematic mainly for two reasons. First of all, it makes the problem size grow.\nThis increases memory requirements and execution time, and the numerical stability is worsen due to\nan increased condition number. This is especially clear from the fact that we want these observations\nto be noise-free, since the noise usually has a regularizing effect. Secondly, the constraints are only\nenforced point-wise, so a sample drawn from the posterior ful\ufb01lls the constraint only in our chosen\npoints. The obvious way of compensating for this is by increasing the number of points in which the\nconstraints are observed \u2013 but that exacerbates the \ufb01rst problem. Clearly, the challenge grows quickly\nwith the dimension of the inferred function.\nEmbedding the constraints in the covariance function removes these issues \u2013 it makes the enforcement\ncontinuous while the problem size is left unchanged. We will now address the question of how to\ndesign such a covariance function.\n\n3.2 A new construction\n\nWe want to \ufb01nd a GP prior (2) such that any sample f (x) from that prior obeys the constraints (4). In\nturn, this leads to constraints on the mean and covariance functions (3) of that prior. However, instead\nof posing these constraints on the mean and covariance functions directly, we consider f (x) to be\nrelated to another function g(x) via some operator (cid:71)x\n\nf (x) = (cid:71)x[g].\n\n(5)\n\nThe constraints (4) then amounts to\n\n(6)\nWe would like this relation to be true for any function g(x). To do that, we will interpret (cid:70)x and\n(cid:71)x as matrices and use a similar procedure to that of solving systems of linear equations. Since (cid:70)x\nand (cid:71)x are linear operators, we can think of (cid:70)x[f ] and (cid:71)x[g] as matrix-vector multiplications where\nj=1((cid:70)x)ijfj where each element ((cid:70)x)ij in the operator matrix\n\n(cid:70)x[f ] = (cid:70)xf , with ((cid:70)xf )i =(cid:80)K\n\n(cid:70)x[ (cid:71)x[g]] = 0.\n\n(cid:70)x is a scalar operator. With this notation, (6) can be written as\n\n(cid:70)x (cid:71)x = 0.\n\n(7)\n\nThis reformulation imposes constraints on the operator (cid:71)x rather than on the GP prior for f (x)\ndirectly. We can now proceed by designing a GP prior for g(x) and transform it using the mapping\n(5). We further know that GPs are closed under linear operations. More speci\ufb01cally, if g(x) is\nmodeled as a GP with mean \u00b5g(x) and covariance Kg(x, x(cid:48)), then f (x) is also a GP with\n\nf (x) = (cid:71)xg \u223c GP(cid:0) (cid:71)x \u00b5g, (cid:71)xKg (cid:71)T\nx(cid:48)(cid:1) .\n\n(8)\n\nx(cid:48))ij to denote that ( (cid:71)xKg (cid:71)T\n\nWe use ( (cid:71)xKg (cid:71)T\nx(cid:48))ij = ( (cid:71)x)ik( (cid:71)x(cid:48))jl(Kg)kl, where (cid:71)x and (cid:71)x(cid:48) act\non the \ufb01rst and second argument of Kg(x, x(cid:48)), respectively. See Section 2 in the Supplementary\nmaterial for further details on linear operations on GPs.\nThe procedure to \ufb01nd the desired GP prior for f can now be divided into the following three steps\n\n1. Find an operator (cid:71)x that ful\ufb01lls the condition (6).\n\n3\n\n\f2. Choose a mean and covariance function for g(x).\n3. Find the mean and covariance functions for f (x) according to (8).\n\nIn addition to being resistant to the disadvantages of the approach described in Section 3.1, there are\nsome additional strengths worth pointing out with this method. First of all, we have separated the task\nof encoding the constraints and encoding other desired properties of the kernel. The constraints are\nencoded in (cid:70)x and the remaining properties are determined by the prior for g(x), such as smoothness\nassumptions. Hence, satisfying the constraints does not sacri\ufb01ce any desired behavior of the target\nfunction.\nSecondly, K(x, x(cid:48)) is guaranteed to be a valid covariance function provided that Kg(x, x(cid:48)) is, since\nGPs are closed under linear functional transformations. From (8), it is clear that each column of K\nmust ful\ufb01ll all constraints encoded in (cid:70)x. Possibly K could be constructed only with this knowledge,\nassuming a general form and solving the resulting equation system. However, a solution may not just\nbe hard to \ufb01nd, but one must also make sure that it is indeed a valid covariance function.\nFurthermore, this approach provides a simple and straightforward way of constructing the covariance\nfunction even if the constraints have a complicated form. It makes no difference if the linear operators\nrelate the components of the target function explicitly or implicitly \u2013 the procedure remains the same.\n\n3.3\n\nIllustrating example\n\n\u2202x1\n\n+ \u2202f2\n\u2202x2\n\nWe will now illustrate the method using the example (1) introduced already in the introduction.\nConsider a function f (x) : R2 (cid:55)\u2192 R2 satisfying \u2202f1\n= 0, where x = [x1, x2]T and\nf (x) = [f1(x), f2(x)]T. This equation describes all two-dimensional divergence-free vector \ufb01elds.\nThe constraint can be written as a linear constraint on the form (4) where (cid:70)x = [ \u2202\n] and\n\u2202x1\nf (x) = [f1(x) f2(x)]T. Modeling this function with a GP and building the covariance structure as\ndescribed above, we \ufb01rst need to \ufb01nd the transformation (cid:71)x such that (7) is ful\ufb01lled. For example,\nwe could pick\n\n(cid:71)x =(cid:2)\u2212 \u2202\nIf the underlying function g(x) : R2 (cid:55)\u2192 R is given by g(x) \u223c GP(cid:0)0, kg(x, x(cid:48))(cid:1), then we can make\nuse of (8) to obtain f (x) \u223c GP(cid:0)0, K(x, x(cid:48))(cid:1) where\n\uf8ee\uf8f0 \u22022\n\n\uf8f9\uf8fb kg(x, x(cid:48)).\n\nK(x, x(cid:48)) = (cid:71)xkg(x, x(cid:48)) (cid:71)T\n\n(cid:3)T\n\nx =\n\n(9)\n\n\u2202x2\n\n\u2202x2\n\n\u2202\n\n\u2202x1\n\n.\n\n\u2202\n\n2\n\n\u2202x2x(cid:48)\n\u2212 \u22022\n\u2202x1x(cid:48)\n\n2\n\n\u2212 \u22022\n\u2202x2x(cid:48)\n\u22022\n\u2202x1x(cid:48)\n\n1\n\n1\n\nUsing a covariance function with the following structure, we know that the constraint will be ful\ufb01lled\nby any function generated from the corresponding GP.\n\n4 Finding the operator (cid:71)x\n\nIn a general setting it might be hard to \ufb01nd an operator (cid:71)x that ful\ufb01lls the constraint (7). Ultimately,\nwe want an algorithm that can construct (cid:71)x from a given (cid:70)x. In more formal terms, the function\n(cid:71)xg forms the nullspace of (cid:70)x. The concept of nullspaces for linear operators is well-established\n[11], and does in many ways relate to real-number linear algebra.\nHowever, an important difference is illustrated by considering a one-dimensional function f (x)\nsubject to the constraint (cid:70)xf = 0 where (cid:70)x = \u2202\n\u2202x. The solution to this differential equation can not\nbe expressed in terms of an arbitrary underlying function, but it requires f (x) to be constant. Hence,\nthe nullspace of \u2202\n\u2202x consists of the set of horizontal lines. Compare this with the real number equation\nab = 0, a (cid:54)= 0, which is true only if b = 0. Since the nullspace differs between operators, we must be\ncareful when discussing the properties of (cid:70)x and (cid:71)x based on knowledge from real-number algebra.\nLet us denote the rows in (cid:70)x as (cid:102)T\n\nL. We now want to \ufb01nd all solutions (cid:103) such that\n\n1 , . . . , (cid:102)T\n\n(cid:70)x (cid:103) = 0 \u21d2 (cid:102)T\n\n(10)\nThe solutions (cid:103)1, . . . , (cid:103)P to (10) will then be the columns of (cid:71)x. Each row vector (cid:102)j can be written\nas (cid:102)i = \u03a6i\u03be(cid:102) where \u03a6i \u2208 RK\u00d7M(cid:102) and \u03be(cid:102) = [\u03be1, . . . , \u03beM(cid:102)]T is a vector of M(cid:102) scalar operators\n\ni (cid:103) = 0,\n\n\u2200 i = 1, . . . , L.\n\n4\n\n\fAlgorithm 1 Constructing (cid:71)x\nInput: Operator matrix (cid:70)x\nOutput: Operator matrix (cid:71)x where (cid:70)x (cid:71)x = 0\nStep 1: Make an ansatz (cid:103) = \u0393\u03be (cid:103) for the columns in (cid:71)x.\nStep 2: Expand (cid:70)x\u0393\u03be (cid:103) and collect terms.\nStep 3: Construct A \u00b7 vec(\u0393) = 0 and \ufb01nd the vectors \u03931 . . . \u0393P spanning its nullspace.\nStep 4: If P = 0, go back to Step 1 and make a new ansatz, i.e. extend the set of operators.\nStep 5: Construct (cid:71)x = [\u03931\u03be (cid:103), . . . , \u0393P \u03be (cid:103)].\n\nincluded in (cid:70)x. We now assume that (cid:103) also can be written in a similar form (cid:103) = \u0393\u03be (cid:103) where\n\u0393 \u2208 RK\u00d7M(cid:103) and \u03be (cid:103) = [\u03be1, . . . , \u03beM(cid:103)]T is a vector of M(cid:103) scalar operators. One may make the\nassumption that the same set of operators that are used to describe (cid:102)i also can be used to describe (cid:103),\ni.e., \u03be (cid:103) = \u03be(cid:102). However, this assumption might need to be relaxed. The constraints (10) can then be\nwritten as\n\n(\u03be(cid:102))T\u03a6i\u0393\u03be (cid:103) = 0,\n\n\u2200 i = 1, . . . , L.\n\nWe perform the multiplication and collect the terms in \u03be(cid:102) and \u03be (cid:103). The condition (11) then results in\nconditions on the parameters in \u0393 resulting a in a homogeneous system of linear equations\n\nA \u00b7 vec(\u0393) = 0.\n\n(11)\n\n(12)\n\nThe vectors vec(\u03931), . . . , vec(\u0393P ) spanning the nullspace of A in (12) are then used to compute the\ncolumns in (cid:71)x = [(cid:103)1, . . . (cid:103)P ] where (cid:103)p = \u0393p\u03be (cid:103) . If it turns out that the nullspace of A is empty,\none should start over with a new ansatz and extend the set of operators in \u03be (cid:103).\nThe outline of the procedure as described above is summarized in Algorithm 1. The algorithm is\nbased upon a parametric ansatz rather than directly upon the theory for linear operators. Not only\nis it more intuitive, but it does also remove any conceptual challenges that theory may provide. A\nproblem with this is that one may have to iterate before having found the appropriate set of operators\nin (cid:71)x. It might be of interest to examine possible alternatives to this algorithm that does not use a\nparametric approach. Let us now illustrate the method with an example.\n\n4.1 Divergence-free example revisited\n\nLet us return to the example discussed in Section 3.3, and show how the solution found by visual\ninspection also can be found with the algorithm described above. Since (cid:70)x only contains \ufb01rst-order\nderivative operators, we assume that a column in (cid:71)x does so as well. Hence, let us propose the\nfollowing ansatz (step 1)\n\n(cid:103) =\n\n= \u0393\u03be (cid:103).\n\n(13)\n\n(cid:21)(cid:34) \u2202\n\n(cid:35)\n\n\u2202x1\n\n\u2202\n\n\u2202x2\n\n\u03b312\n\u03b322\n\n(cid:20)\u03b311\n(cid:21)(cid:34) \u2202\n\n\u03b321\n\n(cid:35)\n\n(cid:70)x\u0393\u03be (cid:103) =(cid:2) \u2202\n\n\u2202x1\n\n(cid:3)(cid:20)\u03b311\n\n\u03b321\n\n\u2202\n\n\u2202x2\n\nApplying the constraint, expanding and collecting terms (step 2) we \ufb01nd\n\n\u03b312\n\u03b322\n\n\u2202x1\n\n\u2202\n\n\u2202x2\n\n= \u03b311\n\n\u22022\n\u2202x2\n1\n\n+ (\u03b312 + \u03b321)\n\n\u22022\n\n\u2202x1\u2202x2\n\n+ \u03b322\n\n\u22022\n\u2202x2\n2\n\n,\n\n(14)\n\n\u22022\n\n\u2202xi\u2202xj\n\n= \u22022\n\nassuming continuous second derivatives. The\n\nwhere we have used the fact that\n\nexpression (14) equals zero if(cid:34)1\nChoosing \u03bb = 1, we get (cid:71)x =(cid:2)\u2212 \u2202\n\n0\n0\n\n0 0\n1 1\n0 0\n\n0\n0\n1\n\n\u2202x2\n\n\u2202\n\n\u2202x1\n\n\u2202xj \u2202xi\n\n(cid:35)\uf8ee\uf8ef\uf8f0\u03b311\n\uf8f9\uf8fa\uf8fb = A \u00b7 vec(\u0393) = 0.\n(cid:3)T (step 5), which is the same as in (9).\n\n\u03b312\n\u03b321\n\u03b322\n\nThe nullspace is spanned by a single vector (step 3) [\u03b311 \u03b312 \u03b321 \u03b322]T = \u03bb[0 \u2212 1 1 0]T, \u03bb \u2208 R.\n\n(15)\n\n5\n\n\f4.2 Generalization\n\nAlthough there are no conceptual problems with the algorithm introduced above, the procedure of\nexpanding and collecting terms appears a bit informal. In a general form, the algorithm is reformulated\nsuch that the operators are completely left out from the solution process. The drawback of this is a\nmore cumbersome notation, and we have therefore limited the presentation to this simpli\ufb01ed version.\nHowever, the general algorithm is found in the Supplementary material of this paper.\n\n5 Experimental results\n\n5.1 Simulated divergence-free function\n\nConsider the example in Section 3.3. An example of a function ful\ufb01lling \u2202f1\n\u2202x1\n\n+ \u2202f2\n\u2202x2\n\nf1(x1, x2) = e\u2212ax1x2(cid:0)ax1 sin(x1x2) \u2212 x1 cos(x1x2)(cid:1),\nf2(x1, x2) = e\u2212ax1x2(cid:0)x2 cos(x1x2) \u2212 ax2 sin(x1x2)(cid:1),\n\n= 0 is\n\n(16)\n\nf exp(cid:2)\u2212 1\n\n2 l\u22122(cid:107)x \u2212 x(cid:48)(cid:107)2(cid:3) has been used for kg and k with hyperparameters chosen by maximizing\n\nwhere a denotes a constant. We will now study how the regression of this function differs when\nusing the covariance function found in Section 3.3 as compared to a diagonal covariance function\nK(x, x(cid:48)) = k(x, x(cid:48))I. The measurements generated are corrupted with Gaussian noise such that\nyk = f (xk) + ek, where ek \u223c N (0, \u03c32I). The squared exponential covariance function k(x, x(cid:48)) =\n\u03c32\nthe marginal likelihood. We have used the value a = 0.01 in (16).\nWe have used 50 measurements randomly picked over the domain [0 4] \u00d7 [0 4], generated with the\nnoise level \u03c3 = 10\u22124. The points for prediction corresponds to a discretization using 20 uniformly\ndistributed points in each direction, and hence a total of NP = 202 = 400. We have included the\napproach described is Section 3.1 for comparison. The number of arti\ufb01cial observations have been\nchosen as random subsets of the prediction points, up to and including the full set.\n\n\u00aff \u2206, where\nThe comparison is made with regard to the root mean squared error erms =\n\u00aff \u2206 = \u02c6\u00aff \u2212 \u00aff and \u00aff is a concatenated vector storing the true function values in all prediction points and\n\u02c6\u00aff denotes the reconstructed equivalent. To decrease the impact of randomness, each error value has\nbeen formed as an average over 50 reconstructions given different sets of measurements.\nAn example of the true \ufb01eld, measured values and reconstruction errors using the different methods\nis seen in Figure 2. The result from the experiment is seen in Figure 3a. Note that the error from the\napproach with arti\ufb01cial observations is decreasing as the number of observations is increased, but\nonly to a certain point. Have in mind, however, that the Gram matrix is growing, making the problem\nlarger and worse conditioned. The result from our approach is clearly better, while the problem size\nis kept small and numerical problems are therefore avoided.\n\n1\nNP\n\n\u00aff T\n\u2206\n\n(cid:113)\n\nFigure 2: Left: Example of \ufb01eld plots illustrating the measurements (red arrows) and the true \ufb01eld\n(gray arrows). Remaining three plots: reconstructed \ufb01elds subtracted from the true \ufb01eld. The arti\ufb01cial\nobservations of the constraint have been made in the same points as the predictions are made.\n\n5.2 Real data experiment\n\nMagnetic \ufb01elds can mathematically be considered as a vector \ufb01eld mapping a 3D position to a 3D\nmagnetic \ufb01eld strength. Based on the magnetostatic equations, this can be modeled as a curl-free\n\n6\n\n\f(a) Simulated experiment\n\n(b) Real-data experiment\n\nFigure 3: Accuracy of the different approaches as the number of arti\ufb01cial observations Nc is increased.\nvector \ufb01eld. Following Section 3.1 in the Supplementary material, our method can be used to encode\nthe constraints in the following covariance function (which also has been presented elsewhere [22])\n\n(cid:32)\n\n(cid:18) x \u2212 x(cid:48)\n\n(cid:19)(cid:18) x \u2212 x(cid:48)\n\n(cid:19)T(cid:33)\n\nKcurl(x, x(cid:48)) = \u03c32\nf e\n\n\u2212 (cid:107)x\u2212x(cid:48)(cid:107)2\n\n2l2\n\nI3\u2212\n\nl\n\nl\n\n.\n\n(17)\n\nWith a magnetic sensor and an optical positioning system, both position and magnetic \ufb01eld data have\nbeen collected in a magnetically distorted indoor environment, see the Supplementary material for\ndetails about the experimental details. In Figure 1 the predicted magnitude of the magnetic \ufb01eld over\na two-dimensional domain for three different heights above the \ufb02oor is displayed. The predictions\nhave been made based on 500 measurements sampled from the trajectory given by the blue curve.\nSimilar to the simulated experiment in Section 5.1, we compare the predictions of the curl-free covari-\nance function (17) with the diagonal covariance function and the diagonal covariance function using\narti\ufb01cial observations. The results have been formed by averaging the error over 50 reconstructions.\nIn each iteration, training data and test data were randomly selected from the data set collected in the\nexperiment. 500 train data points and 1 000 test data points were used.\nThe result is seen in Figure 3b. We recognize the same behavior as we saw for the simulated\nexperiment in Figure 3a. Note that the accuracy of the arti\ufb01cial observation approach gets very close\nto our approach for a large number of arti\ufb01cial observations. However, in the last step of increasing\nthe arti\ufb01cial observations, the accuracy decreases. This is probably caused by the numerical errors\nthat follows from an ill-conditioned Gram matrix.\n\n6 Related work\n\nMany problems in which GPs are used contain some kind of constraint that could be well exploited\nto improve the quality of the solution. Since there are a variety of ways in which constraints\nmay appear and take form, there is also a variety of methods to deal with them. The treatment\nof inequality constraints in GP regression have been considered for instance in [1] and [5], based\non local representations in a limited set of points. The paper [12] proposes a \ufb01nite-dimensional\nGP-approximation to allow for inequality constraints in the entire domain.\nIt has been shown that linear constraints satis\ufb01ed by the training data will be satis\ufb01ed by the GP\nprediction as well [19]. The same paper shows how this result can be extended to quadratic forms\nthrough a parametric reformulation and minimization of the Frobenious norm, with application\ndemonstrated for pose estimation. Another approach on capturing human body features is described\nin [18], where a face-shape model is included in the GP framework to imply anatomic correctness.\nA rigorous theoretical analysis of degeneracy and invariance properties of Gaussian random \ufb01elds\nis found in [7], including application examples for one-dimensional GP problems. The concept of\nlearning the covariance function with respect to algebraic invariances is explored in [9].\nAlthough constraints in most situations are formulated on the outputs of the GP, there are also\nsituations in which they are acting on the inputs. An example of this is given in [21], describing a\nmethod of bene\ufb01t from ordering constraints on the input to reduce the negative impact of input noise.\nApplications within medicine include gene-disease association through functional expectation con-\nstraints [10] and lung disease sub-type identi\ufb01cation using a mixture of GPs and constraints encoded\nwith Markov random \ufb01elds [17]. Another way of viewing constraints is as modi\ufb01ed prior distributions.\nBy making use of the so-called multivariate generalized von Mises distribution, [13] ends up in a\nversion of GP regression customized for circular variable problems. Other \ufb01elds of interest include\nusing GPs in approximately solving one-dimensional partial differential equations [8, 14, 15].\n\n7\n\nNc25100400erms0.50.70.9OurapproachDiagonalArti\ufb01cialobsNc101102103erms0.0340.0360.038OurapproachDiagonalArti\ufb01cialobs\fGenerally speaking, the papers mentioned above consider problems in which the constraints are dealt\nwith using some kind of external enforcement \u2013 that is, they are not explicitly incorporated into the\nmodel, but rely on approximations or \ufb01nite representations. Therefore, the constraints may just be\napproximately satis\ufb01ed and not necessarily in a continuous manner, which differs from the method\nproposed in this paper. Of course, comparisons can not be done directly between methods that have\nbeen developed for different kinds of constraints. The interest in this paper is multivariate problems\nwhere the constraints are linear combinations of the outputs that are known to equal zero.\nFor multivariate problems, constructing the covariance function is particularly challenging due to\nthe correlation between the output components. We refer to [2] for a very useful review. The basic\nidea behind the so-called separable kernels is to separate the process of modeling the covariance\nfunction for each component and the process of modeling the correlation between them. The \ufb01nal\ncovariance function is chosen for example according to some method of regularization. Another\nclass of covariance functions is the invariant kernels. Here, the correlation is inherited from a\nknown mathematical relation. The curl- and divergence free covariance functions are such examples\nwhere the structure follows directly from the underlying physics, and has been shown to improve\nthe accuracy notably for regression problems [22]. Another example is the method proposed in\n[4], where the Taylor expansion is used to construct a covariance model given a known relationship\nbetween the outputs. A very useful property on linear transformations is given in [20], based on the\nGPs natural inheritance of features imposed by linear operators. This fact has for example been used\nin developing a method for monitoring infectious diseases [3].\nThe method proposed in this work is exploiting the transformation property to build a covariance\nfunction of the invariant kind for a multivariate GP. We show how this property can be exploited to\nincorporate knowledge of linear constraints into the covariance function. Moreover, we present an\nalgorithm of constructing the required transformation. This way, the constraints are built into the\nprior and are guaranteed to be ful\ufb01lled in the entire domain.\n\n7 Conclusion and future work\n\nWe have presented a method for designing the covariance function of a multivariate Gaussian process\nsubject to known linear operator constraints on the target function. The method will by construction\nguarantee that any sample drawn from the resulting process will obey the constraints in all points.\nNumerical simulations show the bene\ufb01ts of this method as compared to alternative approaches.\nFurthermore, it has been demonstrated to improve the performance on real data as well.\nAs mentioned in Section 4, it would be desirable to describe the requirements on (cid:71)x more rigorously.\nThat might allow us to reformulate the construction algorithm for (cid:71)x in a way that allows for a more\nstraightforward approach as compared to the parametric ansatz that we have proposed. In particular,\nour method relies upon the requirement that the target function can be expressed in terms of an\nunderlying potential function g. This leads to the intriguing and nontrivial question: Is it possible to\nmathematically guarantee the existence of such a potential? If the answer to this question is yes, the\nnext question will of course be what it look like and how it relates to the target function.\nAnother possible topic of further research is the extension to constraints including nonlinear operators,\nwhich for example might rely upon a linearization in the domain of interest. Furthermore, it may be\nof potential interest to study the extension to a non-zero right-hand side of (4).\n\nAcknowledgements\n\nThis research is \ufb01nancially supported by the Swedish Foundation for Strategic Research (SSF) via\nthe project ASSEMBLE (Contract number: RIT 15-0012). The work is also supported by the Swedish\nResearch Council (VR) via the project Probabilistic modeling of dynamical systems (Contract number:\n621-2013-5524). We are grateful for the help and equipment provided by the UAS Technologies\nLab, Arti\ufb01cial Intelligence and Integrated Computer Systems Division (AIICS) at the Department of\nComputer and Information Science (IDA), Link\u00f6ping University, Sweden. The real data set used in\nthis paper has been collected by some of the authors together with Manon Kok, Arno Solin, and Simo\nS\u00e4rkk\u00e4. We thank them for allowing us to use this data. We also thank Manon Kok for supporting us\nwith the data processing. Furthermore, we would like to thank Carl Rasmussen and Marc Deisenroth\nfor fruitful discussions on constrained GPs.\n\n8\n\n\fReferences\n[1] Petter Abrahamsen and Fred Espen Benth. Kriging with inequality constraints. Math. Geol.,\n\n33(6):719\u2013744, 2001.\n\n[2] Mauricio A. \u00c1lvarez, Lorenzo Rosasco, and Neil D. Lawrence. Kernels for vector-valued\nfunctions: A review. Foundations and Trends in Machine Learning, 4(3):195\u2013266, March 2012.\n[3] Ricardo Andrade-Pacheco, Martin Mubangizi, John Quinn, and Neil Lawrence. Monitoring\nShort Term Changes of Infectious Diseases in Uganda with Gaussian Processes, pages 95\u2013110.\nSpringer International Publishing, 2016.\n\n[4] Emil. M. Constantinescu and Mihai Anitescu. Physics-based covariance models for Gaussian\nprocesses with multiple outputs. International Journal for Uncertainty Quanti\ufb01cation, 3(1):47\u2013\n71, 2013.\n\n[5] S\u00e9bastien Da Veiga and Amandine Marrel. Gaussian process modeling with inequality con-\nstraints. Annales de la facult\u00e9 des sciences de Toulouse Math\u00e9matiques, 21(3):529\u2013555, 2012.\n[6] Zoubin Ghahramani. Probabilistic machine learning and arti\ufb01cial intelligence. Nature, 521:452\u2013\n\n459, 2015.\n\n[7] David Ginsbourger, Olivier Roustant, and Nicolas Durrande. On degeneracy and invariances\nof random \ufb01elds paths with applications in Gaussian process modelling. Journal of Statistical\nPlanning and Inference, 170:117\u2013128, 2016.\n\n[8] Thore Graepel. Solving noisy linear operator equations by Gaussian processes: Application\nto ordinary and partial differential equations. In Proceedings of the Twentieth International\nConference on Machine Learning (ICML), August 2003.\n\n[9] Franz J. Kir\u00e1ly, Andreas Ziehe, and Klaus-Robert M\u00fcller. Learning with algebraic invariances,\n\nand the invariant kernel trick. Technical report, arXiv:1411.7817, November 2014.\n\n[10] Oluwasanmi Koyejo, Cheng Lee, and Joydeep Ghosh. Constrained Gaussian process regression\nfor gene-disease association. Proceedings of the IEEE 13th International Conference on Data\nMining Workshops, 00:72\u201379, 2013.\n\n[11] David G. Luenberger. Optimization by vector space methods. John Wiley & Sons, Inc, 1969.\n[12] Hassan Maatouk and Xavier Bay. Gaussian process emulators for computer experiments with\n\ninequality constraints. Mathematical Geosciences, 49(5):557\u2013582, 2017.\n\n[13] Alexandre K. W. Navarro, Jes Frellsen, and Richard E. Turner. The multivariate generalised von\nMises distribution: inference and applications. Technical report, arXiv:1602.05003, February\n2016.\n\n[14] Ngoc Cuong Nguyen and Jaime Peraire. Gaussian functional regression for linear partial\ndifferential equations. Computer Methods in Applied Mechanics and Engineering, 287:69\u201389,\n2015.\n\n[15] Ngoc Cuong Nguyen and Jaime Peraire. Gaussian functional regression for output prediction:\nModel assimilation and experimental design. Journal of Computational Physics, 309:52\u201368,\n2016.\n\n[16] Carl E. Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning.\n\nMIT press, Cambridge, MA, 2006.\n\n[17] James Ross and Jennifer Dy. Nonparametric mixture of Gaussian processes with constraints. In\nProceedings of the 30th International Conference on Machine Learning (ICML-13), volume 28,\npages 1346\u20131354. JMLR Workshop and Conference Proceedings, 2013.\n\n[18] Ognjen Rudovic and Maja Pantic. Shape-constrained gaussian process regression for facial-\npoint-based head-pose normalization\u201d. In Proceedings of the International Conference on\nComputer Vision (ICCV), 2011.\n\n[19] Mathieu Salzmann and Raquel Urtasun. Implicitly constrained Gaussian process regression for\nmonocular non-rigid pose estimation. In Neural Information Processing Systems (NIPS), 2010.\n[20] Simo S\u00e4rkk\u00e4. Linear operators and stochastic partial differential equations in Gaussian process\nregression. In Proceedings of the Arti\ufb01cial Neural Networks and Machine Learning (ICANN),\npages 151\u2013158. Springer, 2011.\n\n9\n\n\f[21] Cuong Tran, Vladimir Pavlovic, and Robert Kopp. Gaussian process for noisy inputs with\n\nordering constraints. Technical report, arXiv:1507.00052, July 2015.\n\n[22] Niklas Wahlstr\u00f6m. Modeling of Magnetic Fields and Extended Objects for Localization\n\nApplications. PhD thesis, Division of Automatic Control, Link\u00f6ping University, 2015.\n\n10\n\n\f", "award": [], "sourceid": 815, "authors": [{"given_name": "Carl", "family_name": "Jidling", "institution": "Uppsala University"}, {"given_name": "Niklas", "family_name": "Wahlstr\u00f6m", "institution": "Uppsala University"}, {"given_name": "Adrian", "family_name": "Wills", "institution": "University of Newcastle, Australia"}, {"given_name": "Thomas", "family_name": "Sch\u00f6n", "institution": "Uppsala University"}]}