{"title": "Probabilistic ODE Solvers with Runge-Kutta Means", "book": "Advances in Neural Information Processing Systems", "page_first": 739, "page_last": 747, "abstract": "Runge-Kutta methods are the classic family of solvers for ordinary differential equations (ODEs), and the basis for the state of the art. Like most numerical methods, they return point estimates. We construct a family of probabilistic numerical methods that instead return a Gauss-Markov process defining a probability distribution over the ODE solution. In contrast to prior work, we construct this family such that posterior means match the outputs of the Runge-Kutta family exactly, thus inheriting their proven good properties. Remaining degrees of freedom not identified by the match to Runge-Kutta are chosen such that the posterior probability measure fits the observed structure of the ODE. Our results shed light on the structure of Runge-Kutta solvers from a new direction, provide a richer, probabilistic output, have low computational cost, and raise new research questions.", "full_text": "Probabilistic ODE Solvers with Runge-Kutta Means\n\nMichael Schober\n\nMPI for Intelligent Systems\n\nT\u00fcbingen, Germany\n\nmschober@tue.mpg.de\n\nDavid Duvenaud\n\nPhilipp Hennig\n\nDepartment of Engineering\n\nMPI for Intelligent Systems\n\nCambridge University\ndkd23@cam.ac.uk\n\nT\u00fcbingen, Germany\n\nphennig@tue.mpg.de\n\nAbstract\n\nRunge-Kutta methods are the classic family of solvers for ordinary differential\nequations (ODEs), and the basis for the state of the art. Like most numerical meth-\nods, they return point estimates. We construct a family of probabilistic numerical\nmethods that instead return a Gauss-Markov process de\ufb01ning a probability distribu-\ntion over the ODE solution. In contrast to prior work, we construct this family such\nthat posterior means match the outputs of the Runge-Kutta family exactly, thus in-\nheriting their proven good properties. Remaining degrees of freedom not identi\ufb01ed\nby the match to Runge-Kutta are chosen such that the posterior probability measure\n\ufb01ts the observed structure of the ODE. Our results shed light on the structure of\nRunge-Kutta solvers from a new direction, provide a richer, probabilistic output,\nhave low computational cost, and raise new research questions.\n\n1\n\nIntroduction\n\nDifferential equations are a basic feature of dynamical systems. Hence, researchers in machine\nlearning have repeatedly been interested in both the problem of inferring an ODE description from\nobserved trajectories of a dynamical system [1, 2, 3, 4], and its dual, inferring a solution (a trajectory)\nfor an ODE initial value problem (IVP) [5, 6, 7, 8]. Here we address the latter, classic numerical\nproblem. Runge-Kutta (RK) methods [9, 10] are standard tools for this purpose. Over more than a\ncentury, these algorithms have matured into a very well-understood, ef\ufb01cient framework [11].\nAs recently pointed out by Hennig and Hauberg [6], since Runge-Kutta methods are linear extrapola-\ntion methods, their structure can be emulated by Gaussian process (GP) regression algorithms. Such\nan algorithm was envisioned by Skilling in 1991 [5], and the idea has recently attracted both theoreti-\ncal [8] and practical [6, 7] interest. By returning a posterior probability measure over the solution\nof the ODE problem, instead of a point estimate, Gaussian process solvers extend the functionality\nof RK solvers in ways that are particularly interesting for machine learning. Solution candidates\ncan be drawn from the posterior and marginalized [7]. This can allow probabilistic solvers to stop\nearlier, and to deal (approximately) with probabilistically uncertain inputs and problem de\ufb01nitions\n[6]. However, current GP ODE solvers do not share the good theoretical convergence properties of\nRunge-Kutta methods. Speci\ufb01cally, they do not have high polynomial order, explained below.\nWe construct GP ODE solvers whose posterior mean functions exactly match those of the RK families\nof \ufb01rst, second and third order. This yields a probabilistic numerical method which combines the\nstrengths of Runge-Kutta methods with the additional functionality of GP ODE solvers. It also\nprovides a new interpretation of the classic algorithms, raising new conceptual questions.\nWhile our algorithm could be seen as a \u201cBayesian\u201d version of the Runge-Kutta framework, a\nphilosophically less loaded interpretation is that, where Runge-Kutta methods \ufb01t a single curve (a\npoint estimate) to an IVP, our algorithm \ufb01ts a probability distribution over such potential solutions,\nsuch that the mean of this distribution matches the Runge-Kutta estimate exactly. We \ufb01nd a family of\nmodels in the space of Gaussian process linear extrapolation methods with this property, and select a\nmember of this family (\ufb01x the remaining degrees of freedom) through statistical estimation.\n\n1\n\n\fp= 1\n\n0\n\n0\n1\n\np= 2\n(1\u2212 1\n\n0\n\u03b1\n\n2\u03b1\n\n)\n\n0\n\u03b1\n\np= 3\nv\u2212 v(v\u2212u)\n6u(u\u2212v)\u2212 2\u22123u\n1\u2212 2\u22123v\nu(2\u22123u)\n6v(v\u2212u)\n\n0\nu\n\n0\nu\nv\n\n0\n\nv(v\u2212u)\nu(2\u22123u)\n2\u22123v\n6u(u\u2212v)\n\n0\n\n0\n1\n2\u03b1\n\n2 Background\n\n2\u22123u\n6v(v\u2212u)\nTable 1: All consistent Runge-Kutta methods of order p\u2264 3 and number of stages s= p (see [11]).\nAn ODE Initial Value Problem (IVP) is to \ufb01nd a function x(t)\u2236 R\u2192 RN such that the ordinary\ndifferential equation \u02d9x= f(x, t) (where \u02d9x= \u2202x~\u2202t) holds for all t\u2208 T =[t0, tH], and x(t0)= x0.\ncontiguous subintervals[tn, tn+ h]\u2282 T of length h. Assume for the moment that n= 0. Within\n[t0, t0+ h], an RK method of stage s collects evaluations yi= f(\u02c6xi, t0+ hci) at s recursively de\ufb01ned\ninput locations, i= 1, . . . , s, where \u02c6xi is constructed linearly from the previously-evaluated yj<i as\nthen returns a single prediction for the solution of the IVP at t0+ h, as \u02c6x(t0+ h)= x0+ h\u2211s\n\nWe assume that a unique solution exists. To keep notation simple, we will treat x as scalar-valued;\nthe multivariate extension is straightforward (it involves N separate GP models, explained in supp.).\nRunge-Kutta methods1 [9, 10] are carefully designed linear extrapolation methods operating on small\n\ni\u22121Q\nj=1\n\ni=1 biyi\n\ni\u22121Q\nj=1\n\nyi= f\n\n\u00ef\u00efx0+ h\n\nwijyj, t0+ hci\n\n(modern variants can also construct non-probabilistic error estimates, e.g. by combining the same\nobservations into two different RK predictions [12]). In compact form,\n\n\u02c6x(t0+ h)= x0+ h\nsQ\ni=1\n\u02c6x(t0+ h) is then taken as the initial value for t1= t0+ h and the process is repeated until tn+ h\u2265 tH.\nA Runge-Kutta method is thus identi\ufb01ed by a lower-triangular matrix W ={wij}, and vectors\nc=[c1, . . . , cs], b=[b1, . . . , bs], often presented compactly in a Butcher tableau [13]:\n\ni= 1, . . . , s,\n\n\u02c6xi= x0+ h\n\u00ef\u0017 ,\n\nwijyj,\n\nbiyi.\n\n(2)\n\n(1)\n\n\u22ee\n\n\u0016\n\n0\nc1\nc2 w21\n0\nc3 w31 w32\n\n\u22ee\n\u22ee\n\u0016\ncs ws1 ws2 \u0016 ws,s\u22121\nb2 \u0016 bs\u22121\n\nb1\n\n0\n\n0\nbs\n\nAs Hennig and Hauberg [6] recently pointed out, the linear structure of the extrapolation steps in\nRunge-Kutta methods means that their algorithmic structure, the Butcher tableau, can be constructed\n\nproper RK methods have structure that is not generally reproduced by an arbitrary Gaussian pro-\ncess prior on x: Their distinguishing property is that the approximation \u02c6x and the Taylor series\n\nnaturally from a Gaussian process regression method over x(t), where the yi are treated as \u201cob-\nservations\u201d of \u02d9x(t0+ hci) and the \u02c6xi are subsequent posterior estimates (more below). However,\nof the true solution coincide at t0+ h up to the p-th term\u2014their numerical error is bounded by\nx(t0+ h)\u2212 \u02c6x(t0+ h)\u2264 Khp+1 for some constant K (higher orders are better, because h is assumed\np= s. This is only possible for p< 5 [14, 15]. There are no methods of order p> s. High order is a\nTable 1 lists all consistent methods of order p\u2264 3 where s= p. For s= 1, only Euler\u2019s method (linear\nextrapolation) is consistent. For s= 2, there exists a family of methods of order p= 2, parametrized\n\nto be small). The method is then said to be of order p [11]. A method is consistent, if it is of order\n\nstrong desideratum for ODE solvers, not currently offered by Gaussian process extrapolators.\n\n1In this work, we only address so-called explicit RK methods (shortened to \u201cRunge-Kutta methods\u201d for\nsimplicity). These are the base case of the extensive theory of RK methods. Many generalizations can be found\nin [11]. Extending the probabilistic framework discussed here to the wider Runge-Kutta class is not trivial.\n\n2\n\n\fby a single parameter \u03b1\u2208(0, 1], where \u03b1= 1~2 and \u03b1= 1 mark the midpoint rule and Heun\u2019s method,\nrespectively. For s= 3, third order methods are parameterized by two variables u, v\u2208(0, 1].\nWe will use the standard notation \u00b5\u2236 R\u2192 R for the mean function, and k\u2236 R\u00d7 R\u2192 R for the\ncovariance function; kU V for Gram matrices of kernel values k(ui, vj), and analogous for the mean\nfunction: \u00b5T =[\u00b5(t1), . . . , \u00b5(tN)]. A GP prior p(x)=GP(x; \u00b5, k) and observations(T, Y)=\n{(t1, y1), . . . ,(ts, ys)} having likelihoodN(Y ; xT , \u039b) give rise to a posteriorGP s(x; \u00b5s, ks) with\n\nGaussian processes (GPs) are well-known in the NIPS community, so we omit an introduction.\n\n= \u00b5t+ ktT(kT T+ \u039b)\u22121(Y \u2212 \u00b5T)\n\u0003 ;\u0003 \u00b5\n\u0003\u0003=GP\u0004\u0003x\n\u00b5\u2202\u0003 ,\u0004 k\nk\u2202 = \u2202k(t, t\u2032)\nk\u2202= \u2202k(t, t\u2032)\n\u2202t\u2032\n\np\u0003\u0003x\n\u00b5\u2202= \u2202\u00b5(t)\n\n= kuv\u2212 kuT(kT T+ \u039b)\u22121kT v.\nk\u2202 \u2202= \u22022k(t, t\u2032)\n\u2202t\u2202t\u2032\n\n\u0004\u0004\n\nk\u2202\nk\u2202 \u2202\n\nGPs are closed under linear maps. In particular, the joint distribution over x and its derivative is\n\n(3)\n\n(4)\n\nwith\n\nks\nuv\n\nand\n\n(5)\n\n\u00b5s\nt\n\n\u2202t\n\n\u2202t\n\nA recursive algorithm analogous to RK methods can be constructed [5, 6] by setting the prior mean\n\nto the constant \u00b5(t)= x0, then recursively estimating \u02c6xi in some form from the current posterior\nover x. The choice in [6] is to set \u02c6xi= \u00b5i(t0+ hci). \u201cObservations\u201d yi= f(\u02c6xi, t0+ hci) are then\nincorporated with likelihood p(yi x)=N(yi; \u02d9x(t0+ hci), \u039b). This recursively gives estimates\n\u02c6x(t0+ hci)= x0+ i\u22121Q\nk\u2202(t0+ hci, t0+ hc(cid:96))( K\u2202\ni\u22121Q\nj=1\n(cid:96)=1\nij= k\u2202 \u2202(t0+ hci, t0+ hcj). The \ufb01nal prediction is the posterior mean at this point:\nk\u2202(t0+ h, t0+ hcj)( K\u2202\n\u02c6x(t0+ h)= x0+ sQ\nsQ\ni=1\nj=1\n\n(cid:96)j yj= x0+ hQ\nji yi= x0+ h\nsQ\n\n\u2202+ \u039b)\u22121\n\u2202+ \u039b)\u22121\n\nwith K\u2202\n\nwijyj,\n\nbiyi.\n\n(7)\n\n(6)\n\n\u2202\n\nj\n\ni\n\n\u02d9x\n\n,\n\n\u02d9x\n\nk\u2202\n\n,\n\n,\n\n.\n\n3 Results\n\nThe described GP ODE estimate shares the algorithmic structure of RK methods (i.e. they both\nuse weighted sums of the constructed estimates to extrapolate). However, in RK methods, weights\nand evaluation positions are found by careful analysis of the Taylor series of f, such that low-order\nterms cancel. In GP ODE solvers they arise, perhaps more naturally but also with less structure,\nby the choice of the ci and the kernel. In previous work [6, 7], both were chosen ad hoc, with no\nguarantee of convergence order. In fact, as is shown in the supplements, the choices in these two\nworks\u2014square-exponential kernel with \ufb01nite length-scale, evaluations at the predictive mean\u2014do not\neven give the \ufb01rst order convergence of Euler\u2019s method. Below we present three speci\ufb01c regression\nmodels based on integrated Wiener covariance functions and speci\ufb01c evaluation points. Each model is\nthe improper limit of a Gauss-Markov process, such that the posterior distribution after s evaluations\n\nis a proper Gaussian process, and the posterior mean function at t0+ h coincides exactly with the\n\nRunge-Kutta estimate. We will call these methods, which give a probabilistic interpretation to RK\nmethods and extend them to return probability distributions, Gauss-Markov-Runge-Kutta (GMRK)\nmethods, because they are based on Gauss-Markov priors and yield Runge-Kutta predictions.\n\n3.1 Design choices and desiderata for a probabilistic ODE solver\n\nAlthough we are not the \ufb01rst to attempt constructing an ODE solver that returns a probability\ndistribution, open questions still remain about what, exactly, the properties of such a probabilistic\nnumerical method should be. Chkrebtii et al. [8] previously made the case that Gaussian measures\nare uniquely suited because solution spaces of ODEs are Banach spaces, and provided results on\nconsistency. Above, we added the desideratum for the posterior mean to have high order, i.e. to\nreproduce the Runge-Kutta estimate. Below, three additional issues become apparent:\n\n\u201cnodes\u201d \u02c6x(t0+ hci) at the current posterior mean of the belief. We will \ufb01nd that this can be made\n\nMotivation of evaluation points Both Skilling [5] and Hennig and Hauberg [6] propose to put the\n\n3\n\n\f1st order (Euler)\n\n2nd order (midpoint)\n\n3rd order (u= 1~4, v= 3~4)\n\nx\n\n)\nt\n(\n\u00b5\n\u2212\nx\n\n0\n\nt0+ h\n\nt0\n\nt\n\nt0+ h\n\nt0\n\nt\n\nt0+ h\n\nt0\n\nInitial value at t0 = 1 (\ufb01lled blue).\n\nt\n\nFigure 1: Top: Conceptual sketches. Prior mean in gray.\nGradient evaluations (empty blue circles, lines). Posterior (means) after \ufb01rst, second and third\ngradient observation in orange, green and red respectively. Samples from the \ufb01nal posterior as dashed\nlines. Since, for the second and third-order methods, only the \ufb01nal prediction is a proper probability\ndistribution, for intermediate steps only mean functions are shown. True solution to (linear) ODE in\nblack. Bottom: For better visibility, same data as above, minus \ufb01nal posterior mean.\n\nthird-order methods will be forced to use a node \u02c6x(t0+ hci) that, albeit lying along a function w(t)\n\nconsistent with the order requirement for the RK methods of \ufb01rst and second order. However, our\n\nin the reproducing kernel Hilbert space associated with the posterior GP covariance function, is not\nthe mean function itself. It will remain open whether the algorithm can be amended to remove this\nblemish. However, as the nodes do not enter the GP regression formulation, their choice does not\ndirectly affect the probabilistic interpretation.\n\nconvergence order only holds strictly for the \ufb01rst extrapolation interval[t0, t0+ h]. From the second\n\nExtension beyond the \ufb01rst extrapolation interval\n\nImportantly, the Runge-Kutta argument for\n\ninterval onward, the RK step solves an estimated IVP, and begins to accumulate a global estimation\nerror not bounded by the convergence order (an effect termed \u201cLady Windermere\u2019s fan\u201d by Wanner\n[16]). Should a probabilistic solver aim to faithfully reproduce this imperfect chain of RK solvers, or\nrather try to capture the accumulating global error? We investigate both options below.\n\nCalibration of uncertainty A question easily posed but hard to answer is what it means for the\nprobability distribution returned by a probabilistic method to be well calibrated. For our Gaussian\ncase, requiring RK order in the posterior mean determines all but one degree of freedom of an answer.\nThe remaining parameter is the output scale of the kernel, the \u201cerror bar\u201d of the estimate. We offer a\nrelatively simple statistical argument below that \ufb01ts this parameter based on observed values of f.\nWe can now proceed to the main results. In the following, we consider extrapolation algorithms\nbased on Gaussian process priors with vanishing prior mean function, noise-free observation model\n\n(\u039b= 0 in Eq. (3)). All covariance functions in question are integrals over the kernel k0(\u02dct, \u02dct\u2032)=\n\u03c32 min(\u02dct\u2212 \u03c4, \u02dct\u2032\u2212 \u03c4) (parameterized by scale \u03c32> 0 and off-set \u03c4\u2208 R; valid on the domain \u02dct, \u02dct\u2032> \u03c4),\ncost [18]. We will use the shorthands t= \u02dct\u2212 \u03c4 and t\u2032= \u02dct\u2032\u2212 \u03c4 for inputs shifted by \u03c4.\nTheorem 1. The once-integrated Wiener process prior p(x)=GP(x; 0, k1) with\n\u2032 min2(t, t\u2032)\n\nthe covariance of the Wiener process [17]. Such integrated Wiener processes are Gauss-Markov\nprocesses, of increasing order, so inference in these methods can be performed by \ufb01ltering, at linear\n\nk0(u, v)du dv= \u03c32\u0004 min3(t, t\u2032)\n\n3.2 Gauss-Markov methods matching Euler\u2019s method\n\n\u2032)=\u00de \u02dct,\u02dct\n\n+t\u2212 t\n\nk1(t, t\n\n\u0004\n\n(8)\n\n\u2032\n\n\u03c4\n\n3\n\n2\n\nchoosing evaluation nodes at the posterior mean gives rise to Euler\u2019s method.\n\n4\n\n\fProof. We show that the corresponding Butcher tableau from Table 1 holds. After \u201cobserving\u201d the\ninitial value, the second observation y1, constructed by evaluating f at the posterior mean at t0, is\n\ny1= f\u0001\u00b5x0\n\n(t0), t0\u0001= f\u0004 k(t0, t0)\nk(t0, t0) x0, t0\u0004= f(x0, t0),\nk\u2202(t0, t0)\n(t0+ h)=\u0001k(t0+ h, t0) k\u2202(t0+ h, t0)(cid:6)\u0004 k(t0, t0)\nk\u2202 \u2202(t0, t0)\u0004\u22121\u0003x0\nk\u2202(t0, t0)\n\ndirectly from the de\ufb01nitions. The posterior mean after incorporating y1 is\n\n\u00b5x0,y1\n\ny1\n\n(9)\n\n\u0003= x0+ hy1.\n\n(10)\n\nAn explicit linear algebraic derivation is available in the supplements.\n\n\u2032\n\n\u03c4\n\n2\n\n12\n\n20\n\n\u0004(t+ t\n\n+t\u2212 t\u2032\n\n\u2032) min3(t, t\n\n3.3 Gauss-Markov methods matching all Runge-Kutta methods of second order\n\nin\ufb01nity. Fortunately, this limit still leads to a proper posterior probability distribution.\n\n(The twice-integrated Wiener process is a proper Gauss-Markov process for all \ufb01nite values of \u03c4 and\n\nExtending to second order is not as straightforward as integrating the Wiener process a second time.\n\nk1(u, v)du dv= \u03c32\u0004 min5(t, t\u2032)\n\n(11)\nChoosing evaluation nodes at the posterior mean gives rise to the RK family of second order methods\n\nThe theorem below shows that this only works after moving the onset\u2212\u03c4 of the process towards\nTheorem 2. Consider the twice-integrated Wiener process prior p(x)=GP(x; 0, k2) with\n\u2032)\u2212 min4(t, t\u2032)\n\u0004\u0004 .\n\u2032)=\u00de \u02dct,\u02dct\nk2(t, t\nin the limit of \u03c4\u2192\u221e.\n\u02dct, \u02dct\u2032> 0. In the limit of \u03c4\u2192\u221e, it turns into an improper prior of in\ufb01nite local variance.)\nas in Eq. (9). Because y2= f(x0+ h\u03b1y1, t0+ h\u03b1), we need to show \u00b5x0,y1\n(t0+ h\u03b1)= x0+ h\u03b1y1.\nTherefore, let \u03b1\u2208(0, 1] arbitrary but \ufb01xed:\nk\u2202(t0, t0)\n(t0+ h\u03b1)=\u0001k(t0+ h, t0) k\u2202(t0+ h, t0)(cid:6)\u0004 k(t0, t0)\nk\u2202 \u2202(t0, t0)\u0004\u22121\u0003x0\n\u0003\nk\u2202 (t0, t0)\n0~20\n0~8\n\u0004\u22121\u0003x0\n\u0002\u0004t5\n0~3\n0~8\n0\u0001\n0\u00016(h\u03b1)2+8h\u03b1t0+3t2\n\u0003\n\u0002\u0003x0\n\nProof. The proof is analogous to the previous one. We need to show all equations given by the\nButcher tableau and choice of parameters hold for any choice of \u03b1. The constraint for y1 holds trivially\n\n\u00b5x0,y1\n\n\u0003\n\nt4\nt3\n\ny1\n\ny1\n\n=\u0002 t3\n0\u0001\n0\u000110(h\u03b1)2+15h\u03b1t0+6t2\n=\u00021\u2212 10(h\u03b1)2\nh\u03b1+ 2(h\u03b1)2\n\u03c4\u2192\u221e x0+ h\u03b1y1\n\u2014\u2014\u2014\u2192\n\n3t2\n0\n\n120\n\nt0\n\nAs t0= \u02dct0\u2212 \u03c4, the mismatched terms vanish for \u03c4\u2192\u221e. Finally, extending the vector and matrix with\n(t0+ h)= x0+ h(1\u2212 1~2\u03b1)y1+\nh~2\u03b1y2 also holds, analogous to Eq. (10). Omitted details can be found in the supplements. They also\none more entry, a lengthy computation shows that lim\u03c4\u2192\u221e \u00b5x0,y1,y2\n\ninclude the \ufb01nal-step posterior covariance. Its \ufb01nite values mean that this posterior indeed de\ufb01nes a\nproper GP.\n\n(12)\n\ny1\n\n24\n\nt4\n\nt2\n\n3.4 A Gauss-Markov method matching Runge-Kutta methods of third order\n\nMoving from second to third order, additionally to the limit towards an improper prior, also requires\na departure from the policy of placing extrapolation nodes at the posterior mean.\n\nTheorem 3. Consider the thrice-integrated Wiener process prior p(x)=GP(x; 0, k3) with\nk3(t, t\n\u2032)\u0001\u0004 .\n\n\u2032)=\u00de \u02dct,\u02dct\nk2(u, v)du dv\n= \u03c32\u0004 min7(t, t\u2032)\n\n+t\u2212 t\u2032 min4(t, t\u2032)\n\n\u2032+ 3 min2(t, t\n\n\u00015 max2(t, t\n\n\u2032)+ 2tt\n\n(13)\n\n\u2032\n\n\u03c4\n\n252\n\n720\n\n5\n\n\fEvaluating twice at the posterior mean and a third time at a speci\ufb01c element of the posterior\ncovariance functions\u2019 RKHS gives rise to the entire family of RK methods of third order, in the limit\n\nof \u03c4\u2192\u221e.\nfor the term where the mean does not match the RK weights exactly. This is the case for y3 =\nx0+ h[(v\u2212 v(v\u2212u)~u(2\u22123u))y1+ v(v\u2212u)~u(2\u22123u)y2] (see Table 1). The weights of Y which give the\nposterior mean at this point are given by kK\u22121 (cf. Eq. (3), which, in the limit, has value (see supp.):\n\nProof. The proof progresses entirely analogously as in Theorems 1 and 2, with one exception\n\n2u\n\n2u\n\nlim\n\n\u22121\n\n\u03c4\u2192\u221e\u0001k(t0+ hv, t0) k\u2202(t0+ hv, t0) k\u2202(t0+ hv, t0+ hu)(cid:6) K\n(cid:6)\n=\u00011 h(v\u2212 v2\n) h v2\n=\u00021 h\u0002v\u2212 v(v\u2212u)\n2(3u\u22122)\u0002\u0002\nu(2\u22123u)\u2212 v(3v\u22122)\n2(3u\u22122)\u0002 h\u0002 v(v\u2212u)\nu(2\u22123u)+ v(3v\u22122)\nu(2\u22123u)\u0002\u0002+\u00020 \u2212h v(3v\u22122)\nu(2\u22123u)\u0002 h\u0002 v(v\u2212u)\n=\u00021 h\u0002v\u2212 v(v\u2212u)\n2(3u\u22122)\u0002\n2(3u\u22122) h v(3v\u22122)\n3v\u2212 2\n\u03b5(v)= v\n3u\u2212 2\n\n2\n\nHowever, it can be produced by adding a correction term w(v)= \u00b5(v)+ \u03b5(v)(y2\u2212 y1) where\n\nThis means that the \ufb01nal RK evaluation node does not lie at the posterior mean of the regressor.\n\n(14)\n\n(15)\n\nis a second-order polynomial in v. Since k is of third or higher order in v (depending on the value\nof u), w can be written as an element of the thrice integrated Wiener process\u2019 RKHS [19, \u00a76.1].\nImportantly, the \ufb01nal extrapolation weights b under the limit of the Wiener process prior again match\nthe RK weights exactly, regardless of how y3 is constructed.\n\nWe note in passing that Eq. (15) vanishes for v= 2~3. For this choice, the RK observation y2 is\nfor \u03b1 for which the posterior variance at t0+ h is minimized.\n\ngenerated exactly at the posterior mean of the Gaussian process. Intriguingly, this is also the value\n\n3.5 Choosing the output scale\n\nThe above theorems have shown that the \ufb01rst three families of Runge-Kutta methods can be con-\nstructed from repeatedly integrated Wiener process priors, giving a strong argument for the use of such\npriors in probabilistic numerical methods. However, requiring this match to a speci\ufb01c Runge-Kutta\nfamily in itself does not yet uniquely identify a particular kernel to be used: The posterior mean\nof a Gaussian process arising from noise-free observations is independent of the output scale (in\nour notation: \u03c32) of the covariance function (this can also be seen by inspecting Eq. (3)). Thus, the\nparameter \u03c32 can be chosen independent of the other parts of the algorithm, without breaking the\nmatch to Runge-Kutta. Several algorithms using the observed values of f to choose \u03c32 without major\ncost overhead have been proposed in the regression community before [e.g. 20, 21]. For this particular\nmodel an even more basic rule is possible: A simple derivation shows that, in all three families of\n\nmethods de\ufb01ned above, the posterior belief over \u2202sx~\u2202ts is a Wiener process, and the posterior mean\n\n\u03c32. We choose \u03c32 such that this property is met, by setting \u03c32=[\u2202s\u00b5s(t)~\u2202ts]2.\n\nfunction over the s-th derivative after all s steps is a constant function. The Gaussian model implies\nthat the expected distance of this function from the (zero) prior mean should be the marginal standard\ndeviation\nFigure 1 shows conceptual sketches highlighting the structure of GMRK methods. Interestingly, in\nboth the second- and third-order families, our proposed priors are improper, so the solver can not\nactually return a probability distribution until after the observation of all s gradients in the RK step.\n\n\u221a\n\nSome observations We close the main results by highlighting some non-obvious aspects. First, it\nis intriguing that higher convergence order results from repeated integration of Wiener processes.\nThis repeated integration simultaneously adds to and weakens certain prior assumptions in the\nimplicit (improper) Wiener prior: s-times integrated Wiener processes have marginal variance\n\nks(t, t)\u221d t2s+1. Since many ODEs (e.g. linear ones) have solution paths of valuesO(exp(t)), it\n\nis tempting to wonder whether there exists a limit process of \u201cin\ufb01nitely-often integrated\u201d Wiener\nprocesses giving natural coverage to this domain (the results on a linear ODE in Figure 1 show how\nthe polynomial posteriors cannot cover the exponentially diverging true solution). In this context,\n\n6\n\n\f1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nx\n\nNa\u00efve chaining\n\nSmoothing\n\nProbabilistic continuation\n\n\u22c510\n\n\u22122\n\n\u22c510\n\n\u22122\n\n\u22c510\n\n\u22122\n\n4\n\n2\n\n0\n\n)\nt\n(\nf\n\u2212\n)\nt\n(\nx\n\n3h\n\n4h\n\n2h\nt\n\n2h\nt\n\nt0+\u0016 h\n\nt0+\u0016 h\nplots use the midpoint method and h= 1. Posterior after two steps (same for all three options) in red\n(mean,\u00b12 standard deviations). Extrapolation after 2, 3, 4 steps (gray vertical lines) in green. Final\n\nFigure 2: Options for the continuation of GMRK methods after the \ufb01rst extrapolation step (red). All\n\nt0+\u0016 h\n\nprobabilistic prediction as green shading. True solution to (linear) ODE in black. Observations of x\nand \u02d9x marked by solid and empty blue circles, respectively. Bottom row shows the same data, plotted\nrelative to true solution, at higher y-resolution.\n\n3h\n\n4h\n\n3h\n\n4h\n\n2h\nt\n\ns\u2032< s, so \u201chighly-integrated\u201d Wiener kernels can be used to match \ufb01nite-order Runge-Kutta methods.\n\nit is also noteworthy that s-times integrated Wiener priors incorporate the lower-order results for\n\nSimultaneously, though, sample paths from an s-times integrated Wiener process are almost surely\ns-times differentiable. So it seems likely that achieving good performance with a Gauss-Markov-\nRunge-Kutta solver requires trading off the good marginal variance coverage of high-order Markov\nmodels (i.e. repeatedly integrated Wiener processes) against modelling non-smooth solution paths\nwith lower degrees of integration. We leave this very interesting question for future work.\n\n4 Experiments\n\nSince Runge-Kutta methods have been extensively studied for over a century [11], it is not necessary\nto evaluate their estimation performance again. Instead, we focus on an open conceptual question for\nthe further development of probabilistic Runge-Kutta methods: If we accept high convergence order\nas a prerequisite to choose a probabilistic model, how should probabilistic ODE solvers continue\nafter the \ufb01rst s steps? Purely from an inference perspective, it seems unnatural to introduce new\n\nevaluations of x (as opposed to \u02d9x) at t0+ nh for n= 1, 2, . . . . Also, with the exception of the Euler\n\ncase, the posterior covariance after s evaluations is of such a form that its renewed use in the next\ninterval will not give Runge-Kutta estimates. Three options suggest themselves:\n\nNa\u00efve Chaining One could simply re-start the algorithm several times as if the previous step had\ncreated a novel IVP. This amounts to the classic RK setup. However, it does not produce a joint\n\u201cglobal\u201d posterior probability distribution (Figure 2, left column).\n\nSmoothing An ad-hoc remedy is to run the algorithm in the \u201cNa\u00efve chaining\u201d mode above, pro-\n\nducing N\u00d7 s gradient observations and N function evaluations, but then compute a joint posterior\n3, then using the remaining s(N\u2212 1) gradients and(N\u2212 1) function values as in standard GP\n\ndistribution by using the \ufb01rst s gradient observations and 1 function evaluation as described in Section\n\ninference. The appeal of this approach is that it produces a GP posterior whose mean goes through\nthe RK points (Figure 2, center column). But from a probabilistic standpoint it seems contrived. In\nparticular, it produces a very con\ufb01dent posterior covariance, which does not capture global error.\n\n7\n\n\f\u22c510\n\n\u22122\n\n2\n\n1\n\n2nd-order GMRK\nGP with SE kernel\n\n)\nt\n(\nf\n\u2212\n)\nt\n(\n\u00b5\n\n0\n\n\u22121\nt0+\u0016\nand posterior uncertainty of GMRK (green) and SE kernel (orange). Dashed lines are+2 standard\n\nFigure 3: Comparison of a 2nd order GMRK method and the method from [6]. Shown is error\n\n2h\nt\n\nh\n\n3h\n\n4h\n\ndeviations. The SE method shown used the best out of several evaluated parameter choices.\n\nContinuing after s evaluations Perhaps most natural from the probabilistic viewpoint is to break\nwith the RK framework after the \ufb01rst RK step, and simply continue to collect gradient observations\u2014\neither at RK locations, or anywhere else. The strength of this choice is that it produces a continuously\ngrowing marginal variance (Figure 2, right). One may perceive the departure from the established RK\nparadigm as problematic. However, we note again that the core theoretical argument for RK methods\nis only strictly valid in the \ufb01rst step, the argument for iterative continuation is a lot weaker.\n\nFigure 2 shows exemplary results for these three approaches on the (stiff) linear IVP \u02d9x(t)=\u22121~2x(t),\nx(0)= 1. Na\u00efve chaining does not lead to a globally consistent probability distribution. Smoothing\n\ndoes give this global distribution, but the \u201cobservations\u201d of function values create unnatural nodes of\ncertainty in the posterior. The probabilistically most appealing mode of continuing inference directly\noffers a naturally increasing estimate of global error. At least for this simple test case, it also happens\nto work better in practice (note good match to ground truth in the plots). We have found similar results\nfor other test cases, notably also for non-stiff linear differential equations. But of course, probabilistic\ncontinuation breaks with at least the traditional mode of operation for Runge-Kutta methods, so a\ncloser theoretical evaluation is necessary, which we are planning for a follow-up publication.\n\nComparison to Square-Exponential kernel Since all theoretical guarantees are given in forms of\nupper bounds for the RK methods, the application of different GP models might still be favorable in\npractice. We compared the continuation method from Fig. 2 (right column) to the ad-hoc choice of\na square-exponential (SE) kernel model, which was used by Hennig and Hauberg [6] (Fig. 3). For\nthis test case, the GMRK method surpasses the SE-kernel algorithm both in accuracy and calibration:\nits mean is closer to the true solution than the SE method, and its error bar covers the true solution,\nwhile the SE method is over-con\ufb01dent. This advantage in calibration is likely due to the more natural\nchoice of the output scale \u03c32 in the GMRK framework.\n\n5 Conclusions\n\nWe derived an interpretation of Runge-Kutta methods in terms of the limit of Gaussian process\nregression with integrated Wiener covariance functions, and a structured but nontrivial extrapolation\nmodel. The result is a class of probabilistic numerical methods returning Gaussian process posterior\ndistributions whose means can match Runge-Kutta estimates exactly.\nThis class of methods has practical value, particularly to machine learning, where previous work has\nshown that the probability distribution returned by GP ODE solvers adds important functionality over\nthose of point estimators. But these results also raise pressing open questions about probabilistic\nODE solvers. This includes the question of how the GP interpretation of RK methods can be extended\nbeyond the 3rd order, and how ODE solvers should proceed after the \ufb01rst stage of evaluations.\n\nAcknowledgments\n\nThe authors are grateful to Simo S\u00e4rkk\u00e4 for a helpful discussion.\n\n8\n\n\fReferences\n[1] T. Graepel. \u201cSolving noisy linear operator equations by Gaussian processes: Application to\nordinary and partial differential equations\u201d. In: International Conference on Machine Learning\n(ICML). 2003.\n\n[2] B. Calderhead, M. Girolami, and N. Lawrence. \u201cAccelerating Bayesian inference over non-\nlinear differential equations with Gaussian processes.\u201d In: Advances in Neural Information\nProcessing Systems (NIPS). 2008.\n\n[3] F. Dondelinger et al. \u201cODE parameter inference using adaptive gradient matching with Gaus-\n\nsian processes\u201d. In: Arti\ufb01cial Intelligence and Statistics (AISTATS). 2013, pp. 216\u2013228.\n\n[4] Y. Wang and D. Barber. \u201cGaussian Processes for Bayesian Estimation in Ordinary Differential\n\nEquations\u201d. In: International Conference on Machine Learning (ICML). 2014.\nJ. Skilling. \u201cBayesian solution of ordinary differential equations\u201d. In: Maximum Entropy and\nBayesian Methods, Seattle (1991).\n\n[5]\n\n[6] P. Hennig and S. Hauberg. \u201cProbabilistic Solutions to Differential Equations and their Applica-\ntion to Riemannian Statistics\u201d. In: Proc. of the 17th int. Conf. on Arti\ufb01cial Intelligence and\nStatistics (AISTATS). Vol. 33. JMLR, W&CP, 2014.\n\n[7] M. Schober et al. \u201cProbabilistic shortest path tractography in DTI using Gaussian Process\nODE solvers\u201d. In: Medical Image Computing and Computer-Assisted Intervention\u2013MICCAI\n2014. Springer, 2014.\n\n[8] O. Chkrebtii et al. \u201cBayesian Uncertainty Quanti\ufb01cation for Differential Equations\u201d. In: arXiv\n\nprePrint 1306.2365 (2013).\n\n[9] C. Runge. \u201c\u00dcber die numerische Au\ufb02\u00f6sung von Differentialgleichungen\u201d. In: Mathematische\n\nAnnalen 46 (1895), pp. 167\u2013178.\n\n[10] W. Kutta. \u201cBeitrag zur n\u00e4herungsweisen Integration totaler Differentialgleichungen\u201d. In:\n\nZeitschrift f\u00fcr Mathematik und Physik 46 (1901), pp. 435\u2013453.\n\n[11] E. Hairer, S. N\u00f8rsett, and G. Wanner. Solving Ordinary Differential Equations I \u2013 Nonstiff\n\n[12]\n\n[13]\n\nProblems. Springer, 1987.\nJ. R. Dormand and P. J. Prince. \u201cA family of embedded Runge-Kutta formulae\u201d. In: Journal of\ncomputational and applied mathematics 6.1 (1980), pp. 19\u201326.\nJ. Butcher. \u201cCoef\ufb01cients for the study of Runge-Kutta integration processes\u201d. In: Journal of\nthe Australian Mathematical Society 3.02 (1963), pp. 185\u2013201.\n\n[14] F. Ceschino and J. Kuntzmann. Probl\u00e8mes diff\u00e9rentiels de conditions initiales (m\u00e9thodes\n\nnum\u00e9riques). Dunod Paris, 1963.\n\n[15] E. B. Shanks. \u201cSolutions of Differential Equations by Evaluations of Functions\u201d. In: Mathe-\n\nmatics of Computation 20.93 (1966), pp. 21\u201338.\n\n[16] E. Hairer and C. Lubich. \u201cNumerical solution of ordinary differential equations\u201d. In: The\n\nPrinceton Companion to Applied Mathematics, ed. by N. Higham. PUP, 2012.\n\n[17] N. Wiener. \u201cExtrapolation, interpolation, and smoothing of stationary time series with engi-\n\nneering applications\u201d. In: Bull. Amer. Math. Soc. 56 (1950), pp. 378\u2013381.\n\n[18] S. S\u00e4rkk\u00e4. Bayesian \ufb01ltering and smoothing. Cambridge University Press, 2013.\n[19] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT, 2006.\n[20] R. Shumway and D. Stoffer. \u201cAn approach to time series smoothing and forecasting using the\n\nEM algorithm\u201d. In: Journal of time series analysis 3.4 (1982), pp. 253\u2013264.\n\n[21] Z. Ghahramani and G. Hinton. Parameter estimation for linear dynamical systems. Tech. rep.\n\nTechnical Report CRG-TR-96-2, University of Totronto, Dept. of Computer Science, 1996.\n\n9\n\n\f", "award": [], "sourceid": 514, "authors": [{"given_name": "Michael", "family_name": "Schober", "institution": "MPI for Intelligent Systems"}, {"given_name": "David", "family_name": "Duvenaud", "institution": "Harvard University"}, {"given_name": "Philipp", "family_name": "Hennig", "institution": "MPI T\u00fcbingen"}]}