{"title": "Extended and Unscented Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1251, "page_last": 1259, "abstract": "We present two new methods for inference in Gaussian process (GP) models with general nonlinear likelihoods. Inference is based on a variational framework where a Gaussian posterior is assumed and the likelihood is linearized about the variational posterior mean using either a Taylor series expansion or statistical linearization. We show that the parameter updates obtained by these algorithms are equivalent to the state update equations in the iterative extended and unscented Kalman filters respectively, hence we refer to our algorithms as extended and unscented GPs. The unscented GP treats the likelihood as a 'black-box' by not requiring its derivative for inference, so it also applies to non-differentiable likelihood models. We evaluate the performance of our algorithms on a number of synthetic inversion problems and a binary classification dataset.", "full_text": "Extended and Unscented Gaussian Processes\n\nDaniel M. Steinberg\n\nNICTA\n\ndaniel.steinberg@nicta.com.au\n\nEdwin V. Bonilla\n\nThe University of New South Wales\n\ne.bonilla@unsw.edu.au\n\nAbstract\n\nWe present two new methods for inference in Gaussian process (GP) models\nwith general nonlinear likelihoods.\nInference is based on a variational frame-\nwork where a Gaussian posterior is assumed and the likelihood is linearized about\nthe variational posterior mean using either a Taylor series expansion or statistical\nlinearization. We show that the parameter updates obtained by these algorithms\nare equivalent to the state update equations in the iterative extended and unscented\nKalman \ufb01lters respectively, hence we refer to our algorithms as extended and un-\nscented GPs. The unscented GP treats the likelihood as a \u2018black-box\u2019 by not\nrequiring its derivative for inference, so it also applies to non-differentiable like-\nlihood models. We evaluate the performance of our algorithms on a number of\nsynthetic inversion problems and a binary classi\ufb01cation dataset.\n\n1\n\nIntroduction\n\nNonlinear inversion problems, where we wish to infer the latent inputs to a system given obser-\nvations of its output and the system\u2019s forward-model, have a long history in the natural sciences,\ndynamical modeling and estimation. An example is the robot-arm inverse kinematics problem. We\nwish to infer how to drive the robot\u2019s joints (i.e. joint torques) in order to place the end-effector in a\nparticular position, given we can measure its position and know the forward kinematics of the arm.\nMost of the existing algorithms either estimate the system inputs at a particular point in time like the\nLevenberg-Marquardt algorithm [1], or in a recursive manner such as the extended and unscented\nKalman \ufb01lters (EKF, UKF) [2].\nIn many inversion problems we have a continuous process; a smooth trajectory of a robot arm for\nexample. Non-parametric regression techniques like Gaussian processes [3] seem applicable, and\nhave been used in linear inversion problems [4]. Similarly, Gaussian processes have been used to\nlearn inverse kinematics and predict the motion of a dynamical system such as robot arms [3, 5]\nand a human\u2019s gait [6, 7, 8]. However, in [3, 5] the inputs (torques) to the system are observable\n(not latent) and are used to train the GPs. Whereas [7, 8] are not concerned with inference over\nthe original latent inputs, but rather they want to \ufb01nd a low dimensional representation of high\ndimensional outputs for prediction using Gaussian process latent variable models [6]. In this paper\nwe introduce inference algorithms for GPs that can infer and predict the original latent inputs to a\nsystem, without having to be explicitly trained on them.\nIf we do not need to infer the latent inputs to a system it is desirable to still incorporate do-\nmain/system speci\ufb01c information into an algorithm in terms of a likelihood model speci\ufb01c to the\ntask at hand. For example, non-parametric classi\ufb01cation or robust regression problems. In these\nsituations it is useful to have an inference procedure that does not require re-derivation for each\nnew likelihood model without having to resort to MCMC. An example of this is the variational\nalgorithm presented in [9] for factorizing likelihood models. In this model, the expectations aris-\ning from the use of arbitrary (non-conjugate) likelihoods are only one-dimensional, and so they\ncan be easily evaluated using sampling techniques or quadrature. We present two alternatives to\nthis algorithm that are also underpinned by variational principles but are based on linearizing the\n\n1\n\n\fnonlinear likelihood models about the posterior mean. These methods are straight-forwardly ap-\nplicable to non-factorizing likelihoods and would retain computational ef\ufb01ciency, unlike [9] which\nwould require evaluation of multidimensional intractable integrals. One of our algorithms, based on\nstatistical linearization, does not even require derivatives of the likelihood model (like [9]) and so\nnon-differentiable likelihoods can be incorporated.\nInitially we formulate our models in \u00a72 for the \ufb01nite Gaussian case because the linearization methods\nare more general and comparable with existing algorithms. In fact we show we can derive the update\nsteps of the iterative EKF [10] and similar updates to the iterative UKF [11] using our variational\ninference procedures. Then in \u00a7 3 we speci\ufb01cally derive a factorizing likelihood Gaussian process\nmodel using our framework, which we use for experiments in \u00a74.\n\n2 Variational Inference in Nonlinear Gaussian Models with Linearization\nGiven some observable quantity y \u2208 Rd, and a likelihood model for the system of interest, in many\nsituations it is desirable to reason about the latent input to the system, f \u2208 RD, that generated the\nobservations. Finding these inputs is an inversion problem and in a probabilistic setting it can be\ncast as an application of Bayes\u2019 rule. The following forms are assumed for the prior and likelihood:\n(1)\nwhere g(\u00b7) : RD \u2192 Rd is a nonlinear function or forward model. Unfortunately the marginal like-\nlihood, p(y), is intractable as the nonlinear function makes the likelihood and prior non-conjugate.\nThis also makes the posterior p(f|y), which is the solution to the inverse problem, intractable to\nevaluate. So, we choose to approximate the posterior with variational inference [12].\n\nand p(y|f ) = N (y|g(f ) , \u03a3) ,\n\np(f ) = N (f|\u00b5, K)\n\n2.1 Variational Approximation\n\n(cid:90)\n\nUsing variational inference procedures we can put a lower bound on the log-marginal likelihood\nusing Jensen\u2019s inequality,\n\nlog p(y) \u2265\n\nq(f ) log\n\n(2)\nwith equality iff KL[q(f )(cid:107) p(f|y)] = 0, and where q(f ) is an approximation to the true posterior,\np(f|y). This lower bound is often referred to as \u2018free energy\u2019, and can be re-written as follows\n\n(3)\nwhere (cid:104)\u00b7(cid:105)qf is an expectation with respect to the variational posterior, q(f ). We assume the posterior\ntakes a Gaussian form, q(f ) = N (f|m, C), so we can evaluate the expectation and KL term in (3),\n\nF = (cid:104)log p(y|f )(cid:105)qf \u2212 KL[q(f )(cid:107) p(f )] ,\n\nq(f )\n\ndf ,\n\np(y|f ) p(f )\n\n(5)\nwhere the expectation involving g(\u00b7) may be intractable. One method of dealing with these expec-\ntations is presented in [9] by assuming that the likelihood factorizes across observations. Here we\nprovide two alternatives based on linearizing g(\u00b7) about the posterior mean, m.\n\nK-1 (\u00b5 \u2212 m) \u2212 log |C| + log |K| \u2212 D\n\n.\n\n2.2 Parameter Updates\n\n(cid:68)\n\nTo \ufb01nd the optimal posterior mean, m, we need to \ufb01nd the derivative,\n\n\u2202F\n\u2202m\n\n= \u2212 1\n2\n\n\u2202\n\u2202m\n\n(cid:62)\n(\u00b5 \u2212 f )\n\n(cid:62)\nK-1 (\u00b5 \u2212 f ) + (y \u2212 g(f ))\n\n\u03a3-1 (y \u2212 g(f ))\n\n(6)\nwhere all terms in F independent of m have been dropped, and we have placed the quadratic and\ntrace terms from the KL component in Equation (5) back into the expectation. We can represent this\nas an augmented Gaussian,\n\u2202F\n\u2202m\n\nS-1 (z \u2212 h(f ))\n\n(cid:62)\n(z \u2212 h(f ))\n\n= \u2212 1\n2\n\n\u2202\n\u2202m\n\n(cid:68)\n\n(cid:69)\n\n(7)\n\nqf\n\nqf\n\n,\n\n,\n\n(cid:69)\n\n2\n\n(cid:20)\n(cid:20)\n(cid:104)log p(y|f )(cid:105)qf = \u2212 1\ntr(cid:0)K-1C(cid:1) + (\u00b5 \u2212 m)\n2\nKL[q(f )(cid:107) p(f )] =\n1\n2\n\nD log 2\u03c0 + log |\u03a3| +\n\n(cid:62)\n\n(cid:68)\n\n(cid:62)\n(y \u2212 g(f ))\n\n\u03a3-1 (y \u2212 g(f ))\n\n(cid:21)\n\n,\n\n(cid:69)\n\nqf\n\n(4)\n\n(cid:21)\n\n\fwhere\n\n(cid:21)\n\n(cid:20)y\n\n\u00b5\n\nz =\n\n,\n\nh(f ) =\n\n(cid:20)g(f )\n(cid:21)\n\nf\n\n, S =\n\n(cid:20)\u03a3 0\n\n(cid:21)\n\n0 K\n\n.\n\n(8)\n\nNow we can see solving for m is essentially a nonlinear least squares problem, but about the\nexpected posterior value of f. Even without the expectation, there is no closed form solution to\n\u2202F/\u2202m = 0. However, we can use an iterative Newton method to \ufb01nd m. It begins with an initial\nguess, m0, then proceeds with the iterations,\n\n(9)\nfor some step length, \u03b1 \u2208 (0, 1]. Though evaluating \u2207mF is still intractable because of the nonlinear\nterm within the expectation in Equation (7). If we linearize g(f ), we can evaluate the expectation,\n\nmk+1 = mk \u2212 \u03b1 (\u2207m\u2207mF)-1 \u2207mF,\n\ng(f ) \u2248 Af + b,\n\n(10)\n\n(11)\n\n(12)\n\nfor some linearization matrix A \u2208 Rd\u00d7D and an intercept term b \u2208 Rd. Using this we get,\n\u2207mF \u2248 A(cid:62)\u03a3-1 (y \u2212 Am \u2212 b) + K-1 (\u00b5 \u2212 m)\nSubstituting (11) into (9) and using the Woodbury identity we can derive the iterations,\n\nand \u2207m\u2207mF \u2248 \u2212K-1 \u2212 A(cid:62)\u03a3-1A.\n\nmk+1 = (1 \u2212 \u03b1) mk + \u03b1\u00b5 + \u03b1Hk (y \u2212 bk \u2212 Ak\u00b5) ,\n\nwhere Hk is usually referred to as a \u201cKalman gain\u201d term,\n\n(13)\nand we have assumed that the linearization Ak and intercept, bk are in some way dependent on the\niteration. We can \ufb01nd the posterior covariance by setting \u2202F/\u2202C = 0 where,\n\nk\n\n,\n\nHk = KA(cid:62)\n\nC =(cid:2)K-1 + A(cid:62)\u03a3-1A(cid:3)-1\n\nAgain we do not have an analytic solution, so we once more apply the approximation (10) to get,\n\n(15)\nwhere we have once more made use of the Woodbury identity and also the converged values of A\nand H. At this point it is also worth noting the relationship between Equations (15) and (11).\n\n= (ID \u2212 HA)K,\n\n+\n\n1\n2\n\n\u2202\n\u2202C\n\nlog |C| .\n\n(14)\n\nqf\n\n(cid:0)\u03a3 + AkKA(cid:62)\n(cid:1)-1\n(cid:69)\n\nS-1 (z \u2212 h(f ))\n\nk\n\n(cid:68)\n\n\u2202F\n\u2202C\n\n= \u2212 1\n2\n\n\u2202\n\u2202C\n\n(cid:62)\n(z \u2212 h(f ))\n\n2.3 Taylor Series Linearization\n\nNow we need to \ufb01nd expressions for the linearization terms A and b. One method is to use a \ufb01rst\norder Taylor Series expansion to linearize g(\u00b7) about the last calculation of the posterior mean, mk,\n(16)\nwhere Jmk is the Jacobian \u2202g(mk)/\u2202mk. By linearizing the function in this way we end up with a\nGauss-Newton optimization procedure for \ufb01nding m. Equating coef\ufb01cients with (10),\n\ng(f ) \u2248 g(mk) + Jmk (f \u2212 mk) ,\n\nA = Jmk ,\n\nb = g(mk) \u2212 Jmk mk,\nand then substituting these values into Equations (12) \u2013 (15) we get,\n(cid:1)-1\n\nmk+1 = (1 \u2212 \u03b1) mk + \u03b1\u00b5 + \u03b1Hk (y \u2212 g(mk) + Jmk (mk \u2212 \u00b5)) ,\n\n(cid:0)\u03a3 + Jmk KJ(cid:62)\n\n,\n\nmk\n\nHk = KJ(cid:62)\nC = (ID \u2212 HJm)K.\n\nmk\n\n(17)\n\n(18)\n\n(19)\n(20)\n\nHere Jm and H without the k subscript are constructed about the converged posterior, m.\n\nRemark 1 A single step of the iterated extended Kalman \ufb01lter [10, 11] corresponds to an update\nin our variational framework when using the Taylor series linearization of the non-linear forward\nmodel g(\u00b7) around the posterior mean.\n\nHaving derived the updates in our variational framework, the proof of this is trivial by making \u03b1 = 1,\nand using Equations (18) \u2013 (20) as the iterative updates.\n\n3\n\n\f2.4 Statistical Linearization\nAnother method for linearizing g(\u00b7) is statistical linearization (see e.g. [13]), which \ufb01nds a least\nsquares best \ufb01t to g(\u00b7) about a point. The advantage of this method is that it does not require deriva-\ntives \u2202g(f )/\u2202f. To obtain the \ufb01t, multiple observations of the forward model output for different\ninput points are required. Hence, the key question is where to evaluate our forward model so as to\nobtain representative samples to carry out the linearization. One method of obtaining these points is\nthe unscented transform [2], which de\ufb01nes 2D + 1 \u2018sigma\u2019 points,\n\nM0 = m,\nMi = m +\n\n(cid:16)(cid:112)(D + \u03ba) C\nMi = m \u2212(cid:16)(cid:112)(D + \u03ba) C\n\n(cid:17)\n(cid:17)\n\nYi = g(Mi) ,\n\nfor\n\nfor\n\ni\n\ni\n\ni = 1 . . . D,\n\ni = D + 1 . . . 2D,\n\n(21)\n\n(22)\n\n(23)\n\nfor a free parameter \u03ba. Here (\nthe Cholesky decomposition. Unlike the usual unscented transform, which uses the prior to create\nthe sigma points, here we have used the posterior because of the expectation in Equation (7). Using\nthese points we can de\ufb01ne the following statistics,\n\n\u221a\u00b7)i refers to columns of the matrix square root, we follow [2] and use\n2D(cid:88)\n\n2D(cid:88)\n\n(cid:62)\nwi (Yi \u2212 \u00afy) (Mi \u2212 m)\n\n,\n\n(25)\n\nwiYi,\n\n\u0393ym =\n\n(24)\n\ni=0\n\u03ba\n\nD + \u03ba\n\n,\n\ni=0\n1\n\n2 (D + \u03ba)\n\nwi =\n\nfor\n\ni = 1 . . . 2D.\n\n(26)\n\nAccording to [2] various settings of \u03ba can capture information about the higher order moments of\nthe distribution of y; or setting \u03ba = 0.5 yields uniform weights. To \ufb01nd the linearization coef\ufb01cients\nstatistical linearization solves the following objective,\n\n\u00afy =\n\nw0 =\n\n2D(cid:88)\n\ni=0\n\nargmin\n\nA,b\n\n(cid:107)Yi \u2212 (AMi + b)(cid:107)2\n2 .\n\n(27)\n\n(28)\n\n(29)\n\nThis is simply linear least-squares and has the solution [13]:\n\nA = \u0393ymC-1,\n\nb = \u00afy \u2212 Am.\n\nSubstituting b back into Equation (12), we obtain,\n\nmk+1 = (1 \u2212 \u03b1) mk + \u03b1\u00b5 + \u03b1Hk (y \u2212 \u00afyk + Ak (mk \u2212 \u00b5)) .\n\nHere Hk, Ak and \u00afyk have been evaluated using the statistics from the kth iteration. This implies\nthat the posterior covariance, Ck, is now estimated at every iteration of (29) since we use it to form\nAk and bk. Hk and Ck have the same form as Equations (13) and (15) respectively.\nRemark 2 A single step of the iterated unscented sigma-point Kalman \ufb01lter (iSPKF, [11]) can be\nseen as an ad hoc approximation to an update in our statistically linearized variational framework.\n\nEquations (29) and (15) are equivalent to the equations for a single update of the iterated sigma-point\nKalman \ufb01lter (iSPKF) for \u03b1 = 1, except for the term \u00afyk appearing in Equation (29) as opposed to\ng(mk). The main difference is that we have derived our updates from variational principles. These\nupdates are also more similar to the regular recursive unscented Kalman \ufb01lter [2], and statistically\nlinearized recursive least squares [13].\n\n2.5 Optimizing the Posterior\n\nBecause of the expectations involving an arbitrary function in Equation (4), no analytical solution\nexists for the lower bound on the marginal likelihood, F. We can use our approximation (10) again,\n\n(cid:62)\nD log 2\u03c0 + log |\u03a3| \u2212 log |C| + log |K| + (\u00b5 \u2212 m)\n\nK-1 (\u00b5 \u2212 m)\n\n(cid:20)\n\nF \u2248 \u2212 1\n2\n\n(cid:21)\n\n.\n\n(30)\n\n(cid:62)\n+ (y \u2212 Am \u2212 b)\n\n\u03a3-1 (y \u2212 Am \u2212 b)\n\n4\n\n\ftr(cid:0)A(cid:62)\u03a3-1AC(cid:1) = D \u2212 tr(cid:0)K-1C(cid:1), once we have linearized g(\u00b7) and substituted (15). Unfortunately\n\nHere the trace term from Equation (5) has cancelled with a trace term from the expected likelihood,\n\nthis approximation is no longer a lower bound on the log marginal likelihood in general. In practice\nwe only calculate this approximation F if we need to optimize some model hyperparameters, like\nfor a Gaussian process as described in \u00a73. When optimizing m, the only terms of F dependent on\nm in the Taylor series linearization case are,\n\n(cid:62)\n(y \u2212 g(m))\n\n\u2212 1\n2\n\n\u03a3-1 (y \u2212 g(m)) \u2212 1\n2\n\n(cid:62)\n(\u00b5 \u2212 m)\n\nK-1 (\u00b5 \u2212 m) .\n\n(31)\n\nThis is also the maximum a-posteriori objective. A global convergence proof exists for this objec-\ntive when optimized by a Gauss-Newton procedure, like our Taylor series linearization algorithm,\nunder some conditions on the Jacobians, see [14, p255]. No such guarantees exist for statistical\nlinearization, though monitoring (31) works well in practice (see the experiment in \u00a74.1).\nA line search could be used to select an optimal value for the step length, \u03b1 in Equation (12).\nHowever, we \ufb01nd that setting \u03b1 = 1, and then successively multiplying \u03b1 by some number in (0, 1)\nuntil the MAP objective (31) decreases, or some maximum number of iterations is exceeded is fast\nand works well in practice. If the maximum number of iterations is exceeded we call this a \u2018diverge\u2019\ncondition, and terminate the search for m (and return the last good value). This only tends to happen\nfor statistical linearization, but does not tend to impact the algorithms performance since we always\nmake sure to improve (approximate) F.\n\n3 Variational Inference in Gaussian Process Models with Linearization\n\ny \u223c N(cid:0)g(f ) , \u03c32IN\n\n(cid:1) ,\n\nWe now present two inference methods for Gaussian Process (GP) models [3] with arbitrary nonlin-\near likelihoods using the framework presented previously. Both Gaussian process models have the\nfollowing likelihood and prior,\n\nf \u223c N (0, K) .\n\n(32)\nHere y \u2208 RN are the N noisy observed values of the transformed latent function, g(f ), and f \u2208 RN\nis the latent function we are interested in inferring. K \u2208 RN\u00d7N is the kernel matrix, where each\nelement kij = k(xi, xj) is the result of applying a kernel function to each input, x \u2208 RP , in a pair-\nwise manner. It is also important to note that the likelihood noise model is isotropic with a variance\nof \u03c32. This is not a necessary condition, and we can use a correlated noise likelihood model, however\nthe factorized likelihood case is still useful and provides some computational bene\ufb01ts.\nAs before, we make the approximation that the posterior is Gaussian, q(f|m, C) = N (f|m, C)\nwhere m \u2208 RN is the mean posterior latent function, and C \u2208 RN\u00d7N is the posterior covari-\nance. Since the likelihood is isotropic and factorizes over the N observations we have the following\nexpectation under our variational inference framework:\n\n(cid:104)log p(y|f )(cid:105)qf = \u2212 N\n2\n\nlog 2\u03c0\u03c32 \u2212 1\n2\u03c32\n\n(cid:68)\n(yn \u2212 g(fn))2(cid:69)\n\nN(cid:88)\n\nn=1\n\n.\n\nqfn\n\nAs a consequence, the linearization is one-dimensional, that is g(fn) \u2248 anfn + bn. Using this we\ncan derive the approximate gradients,\n\n\u2207mF \u2248 1\n\nwhere A = diag([a1, . . . , aN ]) and \u039b = diag(cid:0)(cid:2)\u03c32, . . . , \u03c32(cid:3)(cid:1). Because of the factorizing likelihood\n\n\u03c32 A (y \u2212 Am \u2212 b) \u2212 K-1m,\n\n\u2207m\u2207mF \u2248 \u2212K-1 \u2212 A\u039b-1A,\n\nwe obtain C-1 = K-1 + A\u039b-1A, that is, the inverse posterior covariance is just the prior inverse\ncovariance, but with a modi\ufb01ed diagonal. This means if we were to use this inverse parameterization\nof the Gaussian, which is also used in [9], we would only have to infer 2N parameters (instead of\nN + N (N + 1)/2). We can obtain the iterative steps for m straightforwardly:\n\n(33)\n\nmk+1 = (1 \u2212 \u03b1) mk + \u03b1Hk (y \u2212 bk) , where Hk = KAk (\u039b + AkKAk)-1 ,\n\nand also an expression for posterior covariance,\n\nC = (IN \u2212 HA)K.\n\n(34)\n\n(35)\n\n5\n\n\fThe values for an and bn for the linearization methods are,\n\nTaylor : an=\n\nmn,\n\n(36)\n\n\u2202g(mn)\n\n,\n\n,\n\n\u2202mn\n\u0393my,n\nCnn\n\nbn = g(mn) \u2212 \u2202g(mn)\n\u2202mn\nbn = \u00afyn \u2212 anmn.\n\nStatistical : an =\n\nCnn is the nth diagonal element of C, and \u0393my,n and \u00afyn are scalar versions of Equations (21) \u2013\n\n(26). The sigma points for each observation, n, are Mn = (cid:8)mn, mn +(cid:112)(1 + \u03ba) Cnn, mn \u2212\n(cid:112)(1 + \u03ba) Cnn\n(cid:9). We refer to the Taylor series linearized GP as the extended GP (EGP), and the\n\n(37)\n\nstatistically linearized GP as the unscented GP (UGP).\n\n3.1 Prediction\nThe predictive distribution of a latent value, f\u2217, given a query point, x\u2217, requires the marginalization\n\n(cid:82) p(f\u2217|f ) q(f|m, C) df, where p(f\u2217|f ) is a regular predictive GP. This gives f\u2217 \u223c N (m\u2217, C\u2217), and,\n\nm\u2217 = k\u2217(cid:62)K-1m,\n\nwhere k\u2217\u2217 = k(x\u2217, x\u2217) and k\u2217 = [k(x1, x\u2217) , . . . , k(xN , x\u2217)]\nobservations, \u00afy\u2217 by evaluating the one-dimensional integral,\n\nC\u2217 = k\u2217\u2217 \u2212 k\u2217(cid:62)K-1(cid:2)IN \u2212 CK-1(cid:3) k\u2217,\n(cid:90)\n\ng(f\u2217)N (f\u2217|m\u2217, C\u2217) df\u2217,\n\n(38)\n(cid:62). We can also \ufb01nd the predicted\n\ntion of the unscented transform to approximate the predictive distribution y\u2217 \u223c N(cid:0)\u00afy\u2217, \u03c32\n\nfor which we use quadrature. Alternatively, if we were to use the UGP we can use another applica-\n\ny\u2217(cid:1) where,\n\n(39)\n\n\u00afy\u2217 = (cid:104)y\u2217(cid:105)qf\u2217 =\n2(cid:88)\n\nwiM\u2217\ni ,\n\n\u00afy\u2217 =\n\n2(cid:88)\n\n\u03c32\ny\u2217 =\n\nwi (Y\u2217\n\ni \u2212 \u00afy\u2217)2 .\n\n(40)\n\nThis works well in practice, see Figure 1 for a demonstration.\n\ni=0\n\ni=0\n\n3.2 Learning the Linearized GPs\n\nLearning the extended and unscented GPs consists of an inner and outer loop. Much like the Laplace\napproximation for binary Gaussian Process classi\ufb01ers [3], the inner loop is for learning the posterior\nmean, m, and the outer loop is to optimize the likelihood parameters (e.g. the variance \u03c32) and ker-\nnel hyperparameters, k(\u00b7,\u00b7|\u03b8). The dominant computational cost in learning the parameters is the\ninversion in Equation (34), and so the computational complexity of the EGP and UGP is about the\nsame as for the Laplace GP approximation. To learn the kernel hyperparameters and \u03c32 we use nu-\nmerical techniques to \ufb01nd the gradients, \u2202F/\u2202\u03b8, for both the algorithms, where F is approximated,\n\n(cid:20)\n\n(cid:21)\n\nF \u2248 \u2212 1\n2\n\nN log 2\u03c0\u03c32 \u2212 log |C| + log |K| + m(cid:62)K-1m +\n\n.\n(41)\nSpeci\ufb01cally we use derivative-free optimization methods (e.g. BOBYQA) from the NLopt li-\nbrary [15], which we \ufb01nd fast and effective. This also has the advantage of not requiring knowledge\nof \u2202g(f )/\u2202f or higher order derivatives for any implicit gradient dependencies between f and \u03b8.\n\n(cid:62)\n\u03c32 (y \u2212 Am \u2212 b)\n\n(y \u2212 Am \u2212 b)\n\n1\n\n4 Experiments\n\n4.1 Toy Inversion Problems\nIn this experiment we generate \u2018latent\u2019 function data from f \u223c N (0, K) where a Mat\u00e9rn 5\n2 kernel\nfunction is used with amplitude \u03c3m52 = 0.8, length scale lm52 = 0.6 and x \u2208 R are uniformly\nspaced between [\u22122\u03c0, 2\u03c0] to build K. Observations used to test and train the GPs are then generated\n\nas y = g(f ) + \u0001 where \u0001 \u223c N(cid:0)0, 0.22(cid:1). 1000 points are generated in this way, and we use 5-fold\n\ncross validation to train (200 points) and test (800 points) the GPs. We use standardized mean\n\n6\n\n\fTable 1: The negative log predictive density (NLPD) and the standardized mean squared error\n(SMSE) on test data for various differentiable forward models. Lower values are better for both\nmeasures. The predicted f\u2217 and y\u2217 are the same for g(f ) = f, so we do not report y\u2217 in this case.\n\nSMSE f\u2217\n\nSMSE y\u2217\n\ng(f ) Algorithm\n\nNLPD f\u2217\n\nmean\n\nstd.\n\nf\n\nf 3 + f 2 + f\n\nexp(f )\n\nsin(f )\n\ntanh(2f )\n\nUGP\nEGP\n[9]\nGP\nUGP\nEGP\n[9]\nUGP\nEGP\n[9]\nUGP\nEGP\n[9]\nUGP\nEGP\n[9]\n\n-0.90046\n-0.89908\n-0.27590\n-0.90278\n-0.23622\n-0.22325\n-0.14559\n-0.75475\n-0.75706\n-0.08176\n-0.59710\n-0.59705\n-0.04363\n0.01101\n0.57403\n0.15743\n\n0.06743\n0.06608\n0.06884\n0.06988\n1.72609\n1.76231\n0.04026\n0.32376\n0.32051\n0.10986\n0.22861\n0.21611\n0.03883\n0.60256\n1.25248\n0.14663\n\nmean\n0.01219\n0.01224\n0.01249\n0.01211\n0.01534\n0.01518\n0.06733\n0.13860\n0.13971\n0.17614\n0.03305\n0.03480\n0.05913\n0.15703\n0.18739\n0.16049\n\nstd.\n\nmean\n\n0.00171\n0.00178\n0.00159\n0.00160\n0.00202\n0.00203\n0.01421\n0.04833\n0.04842\n0.04845\n0.00840\n0.00791\n0.01079\n0.06077\n0.07869\n0.04563\n\n\u2013\n\u2013\n\u2013\n\u2013\n\n0.02184\n0.02184\n0.02686\n0.03865\n0.03872\n0.05956\n0.11513\n0.11478\n0.11890\n0.08767\n0.08874\n0.09434\n\nstd.\n\u2013\n\u2013\n\u2013\n\u2013\n\n0.00525\n0.00528\n0.00266\n0.00403\n0.00411\n0.01070\n0.00521\n0.00532\n0.00652\n0.00292\n0.00394\n0.00425\n\n(a) g(f ) = 2 \u00d7 sign(f ) + f 3\n\n(b) MAP trace from learning m\n\nFigure 1: Learning the UGP with a non-differentiable forward model in (a), and a corresponding\ntrace from the MAP objective function used to learn m is shown in (b). The optimization shown ter-\nminated because of a \u2018divergence\u2019 condition, though the objective function value has still improved.\n\nN\u2217(cid:80)\n\nn log N (f\u2217\n\nn|m\u2217\n\nn, C\u2217\n\nn). All GP methods use Mat\u00e9rn 5\n\nsquared error (SMSE) to test the predictions with the held out data in both the latent and observed\nspaces. We also use average negative log predictive density (NLPD) on the latent test data, which\nis calculated as \u2212 1\n2 covariance functions\nwith the hyperparameters and \u03c32 initialized at 1.0 and lower-bounded at 0.1 (and 0.01 for \u03c32).\nTable 1 shows results for multiple differentiable forward models, g(\u00b7). We test the EGP and UGP\nagainst the model in [9] \u2013 which uses 10,000 samples to evaluate the one dimensional expectations.\nAlthough this number of samples may seem excessive for these simple problems, our goal here is\nto have a competitive baseline algorithm. We also test against normal GP regression for a linear\nforward model, g(f ) = f.\nIn Figure 1 we show the results of the UGP using a forward model\nfor which no derivative exists at the zero crossing points, as well as an objective function trace for\nlearning the posterior mean. We use quadrature for the predictions in observation space in Table 1\nand the unscented transform, Equation (40), for the predictions in Figure 1. Interestingly, there is\nalmost no difference in performance between the EGP and UGP, even though the EGP has access to\nthe derivatives of the forward models and the UGP does not. Both the UGP and EGP consistently\noutperformed [9] in terms of NLPD and SMSE, apart from the tanh experiment for inversion. In\nthis experiment, the UGP had the best performance but the EGP was outperformed by [9].\n\n7\n\n\fTable 2: Classi\ufb01cation performance on the USPS handwritten-digits dataset for numbers \u20183\u2019 and \u20185\u2019.\nLower values of the negative log probability (NLP) and error rate indicate better performance. The\n\n(cid:1) and length scale(lse) are also shown for consistency with [3, \u00a73.7.3].\n\nlearned signal variance(cid:0)\u03c32\n\nse\n\nAlgorithm NLP y\u2217\n0.11528\nGP \u2013 Laplace\nGP \u2013 EP\n0.07522\nGP \u2013 VB 0.10891\n0.08055\n0.11995\n0.07290\n0.08051\n\nSVM (RBF)\nLogistic Reg.\nUGP\nEGP\n\nError rate (%)\n\n2.9754\n2.4580\n3.3635\n2.3286\n3.6223\n1.9405\n2.1992\n\nlog(\u03c3se)\n2.5855\n5.2209\n0.9045\n\n\u2013\n\u2013\n\nlog(lse)\n2.5823\n2.5315\n2.0664\n\n\u2013\n\u2013\n\n1.5743\n2.9134\n\n1.5262\n1.7872\n\n4.2 Binary Handwritten Digit Classi\ufb01cation\n\nFor this experiment we evaluate the EGP and UGP on a classi\ufb01cation task. We are just interested\nin a probabilistic prediction of class labels, and not the values of the latent function. We use the\nUSPS handwritten digits dataset with the task of distinguishing between \u20183\u2019 and \u20185\u2019 \u2013 this is the\nsame experiment from [3, \u00a73.7.3]. A logistic sigmoid is used as the forward model, g(\u00b7), in our\nalgorithms. We test against Laplace, expectation propagation and variational Bayes logistic GP\nclassi\ufb01ers (from the GPML Matlab toolbox [3]), a support vector machine (SVM) with a radial\nbasis kernel function (and probabilistic outputs [16]), and logistic regression (both from the scikit-\nlearn python library [17]). A squared exponential kernel with amplitude \u03c3se and length scale lse is\nused for the GPs in this experiment. We initialize these hyperparameters at 1.0, and put a lower\nbound of 0.1 on them. We initialize \u03c32 and place a lower bound at 10\u221214 for the EGP and UGP (the\noptimized values are near or at this value). The hyperparameters for the SVM are learned using grid\nsearch with three-fold cross validation.\nThe results are summarized in Table 2, where we report the average Bernoulli negative log-\nprobability (NLP), the error rate and the learned hyperparameter values for the GPs. Surprisingly,\nthe UGP outperforms the other classi\ufb01ers on this dataset, despite the other classi\ufb01ers being speci\ufb01-\ncally formulated for this task.\n\n5 Conclusion and Discussion\n\nWe have presented a variational inference framework with linearization for Gaussian models with\nnonlinear likelihood functions, which we show can be used to derive updates for the extended and\nunscented Kalman \ufb01lter algorithms, the iEKF and the iSPKF. We then generalize these results and\ndevelop two inference algorithms for Gaussian processes, the EGP and UGP. The UGP does not\nuse derivatives of the nonlinear forward model, yet performs as well as the EGP for inversion and\nclassi\ufb01cation problems.\nOur method is similar to the Warped GP (WGP) [18], however, we wish to infer the full posterior\nover the latent function f. The goal of the WGP is to infer a transformation of a non-Gaussian\nprocess observation to a space where a GP can be constructed. That is, the WGP is concerned with\ninferring an inverse function g\u22121(\u00b7) so the transformed (latent) function is well modeled by a GP.\nAs future work we would like to create multi-task EGPs and UGPs. This would extend their appli-\ncability to inversion problems where the forward models have multiple inputs and outputs, such as\ninverse kinematics for dynamical systems.\n\nAcknowledgments\n\nThis research was supported by the Science Industry Endowment Fund (RP 04-174) Big Data Knowledge\nDiscovery project. We thank F. Ramos, L. McCalman, S. O\u2019Callaghan, A. Reid and T. Nguyen for their helpful\nfeedback. NICTA is funded by the Australian Government through the Department of Communications and\nthe Australian Research Council through the ICT Centre of Excellence Program.\n\n8\n\n\fReferences\n[1] D. W. Marquardt, \u201cAn algorithm for least-squares estimation of nonlinear parameters,\u201d Journal\n\nof the Society for Industrial & Applied Mathematics, vol. 11, no. 2, pp. 431\u2013441, 1963.\n\n[2] S. Julier and J. Uhlmann, \u201cUnscented \ufb01ltering and nonlinear estimation,\u201d Proceedings of the\n\nIEEE, vol. 92, no. 3, pp. 401\u2013422, Mar 2004.\n\n[3] C. E. Rasmussen and C. K. I. Williams, Gaussian processes for machine learning. The MIT\n\nPress, Cambridge, Massachusetts, 2006.\n\n[4] A. Reid, S. O\u2019Callaghan, E. V. Bonilla, L. McCalman, T. Rawling, and F. Ramos, \u201cBayesian\njoint inversions for the exploration of Earth resources,\u201d in Proceedings of the Twenty-Third\ninternational joint conference on Arti\ufb01cial Intelligence. AAAI Press, 2013, pp. 2877\u20132884.\n[5] K. M. A. Chai, C. K. I. Williams, S. Klanke, and S. Vijayakumar, \u201cMulti-task Gaussian process\nlearning of robot inverse dynamics,\u201d in Advances in Neural Information Processing Systems\n(NIPS). Curran Associates, Inc., 2009, pp. 265\u2013272.\n\n[6] N. D. Lawrence, \u201cGaussian process latent variable models for visualisation of high dimensional\n\ndata.\u201d in Advances in Neural Information Processing Systems (NIPS), vol. 2, 2003, p. 5.\n\n[7] J. M. Wang, D. J. Fleet, and A. Hertzmann, \u201cGaussian process dynamical models,\u201d in Advances\n\nin Neural Information Processing Systems (NIPS), vol. 18, 2005, p. 3.\n\n[8] \u2014\u2014, \u201cGaussian process dynamical models for human motion,\u201d Pattern Analysis and Machine\n\nIntelligence, IEEE Transactions on, vol. 30, no. 2, pp. 283\u2013298, 2008.\n\n[9] M. Opper and C. Archambeau, \u201cThe variational Gaussian approximation revisited,\u201d Neural\n\ncomputation, vol. 21, no. 3, pp. 786\u2013792, 2009.\n\n[10] B. M. Bell and F. W. Cathey, \u201cThe iterated Kalman \ufb01lter update as a Gauss-newton method,\u201d\n\nIEEE Transactions on Automatic Control, vol. 38, no. 2, pp. 294\u2013297, 1993.\n\n[11] G. Sibley, G. Sukhatme, and L. Matthies, \u201cThe iterated sigma point kalman \ufb01lter with applica-\ntions to long range stereo.\u201d in Robotics: Science and Systems, vol. 8, no. 1, 2006, pp. 235\u2013244.\n[12] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, \u201cAn introduction to variational\n\nmethods for graphical models,\u201d Machine Learning, vol. 37, no. 2, pp. 183\u2013233, 1999.\n\n[13] M. Geist and O. Pietquin, \u201cStatistically linearized recursive least squares,\u201d in Machine Learn-\nIEEE, 2010, pp.\n\ning for Signal Processing (MLSP), 2010 IEEE International Workshop on.\n272\u2013276.\n\n[14] J. Nocedal and S. J. Wright, Numerical Optimization, 2nd ed. New York: Springer, 2006.\n[15] S. G. Johnson, \u201cThe nlopt nonlinear-optimization package.\u201d [Online]. Available: http:\n\n//ab-initio.mit.edu/wiki/index.php/Citing_NLopt\n\n[16] J. Platt et al., \u201cProbabilistic outputs for support vector machines and comparisons to regu-\nlarized likelihood methods,\u201d Advances in large margin classi\ufb01ers, vol. 10, no. 3, pp. 61\u201374,\n1999.\n\n[17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pret-\ntenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Per-\nrot, and E. Duchesnay, \u201cScikit-learn: Machine learning in Python,\u201d Journal of Machine Learn-\ning Research, vol. 12, pp. 2825\u20132830, 2011.\n\n[18] E. Snelson, C. E. Rasmussen, and Z. Ghahramani, \u201cWarped Gaussian processes,\u201d in NIPS,\n\n2003.\n\n9\n\n\f", "award": [], "sourceid": 706, "authors": [{"given_name": "Daniel", "family_name": "Steinberg", "institution": "NICTA"}, {"given_name": "Edwin", "family_name": "Bonilla", "institution": "The University of New South Wales"}]}