{"title": "Implicit Probabilistic Integrators for ODEs", "book": "Advances in Neural Information Processing Systems", "page_first": 7244, "page_last": 7253, "abstract": "We introduce a family of implicit probabilistic integrators for initial value problems (IVPs), taking as a starting point the multistep Adams\u2013Moulton method. The implicit construction allows for dynamic feedback from the forthcoming time-step, in contrast to previous probabilistic integrators, all of which are based on explicit methods. We begin with a concise survey of the rapidly-expanding field of probabilistic ODE solvers. We then introduce our method, which builds on and adapts the work of Conrad et al. (2016) and Teymur et al. (2016), and provide a rigorous proof of its well-definedness and convergence. We discuss the problem of the calibration of such integrators and suggest one approach. We give an illustrative example highlighting the effect of the use of probabilistic integrators\u2014including our new method\u2014in the setting of parameter inference within an inverse problem.", "full_text": "Implicit Probabilistic Integrators for ODEs\n\nOnur Teymur\u21e4& Ben Calderhead\n\nDepartment of Mathematics\nImperial College London\n\nHan Cheng Lie & T.J. Sullivan\n\nInstitute of Mathematics, Freie Universit\u00a8at Berlin;\n\n& Zuse Institut Berlin\n\nAbstract\n\nWe introduce a family of implicit probabilistic integrators for initial value problems\n(IVPs), taking as a starting point the multistep Adams\u2013Moulton method. The\nimplicit construction allows for dynamic feedback from the forthcoming time-\nstep, in contrast to previous probabilistic integrators, all of which are based on\nexplicit methods. We begin with a concise survey of the rapidly-expanding \ufb01eld of\nprobabilistic ODE solvers. We then introduce our method, which builds on and\nadapts the work of Conrad et al. (2016) and Teymur et al. (2016), and provide a\nrigorous proof of its well-de\ufb01nedness and convergence. We discuss the problem of\nthe calibration of such integrators and suggest one approach. We give an illustrative\nexample highlighting the effect of the use of probabilistic integrators\u2014including\nour new method\u2014in the setting of parameter inference within an inverse problem.\n\n1 Set-up, motivation and context\n\nWe consider the common statistical problem of inferring model parameters \u2713 from data Y . In a\nBayesian setting, the parameter posterior is given by p(\u2713|Y ) / p(Y |\u2713)p(\u2713). Suppose we have a\nregression model in which the likelihood term p(Y |\u2713) requires us to solve an ordinary differential\nequation (ODE). Speci\ufb01cally, for each datum, we have Yj = x(tYj ) + \"j for some latent function\nx(t) satisfying \u02d9x = f (x, \u2713) and vector of measurement errors \" with spread parameter .\nWe can write the full model as p(\u2713, , x|Y ) / p(Y |x, )p(x|\u2713)p(\u2713)p(). Since x is latent, it is\nincluded as an integral part of the posterior model. This more general decomposition would not need\nto be considered explicitly in, say, a linear regression model for which x = \u27131 + \u27132t; here we would\nsimply have p(x|\u2713) = x(\u27131 + \u27132t). In other words, given \u2713 there is no uncertainty in x, and the\nmodel would reduce to simply p(\u2713, |Y ) / p(Y |\u2713, )p(\u2713)p().\nIn our case, however, x is de\ufb01ned implicitly through the ODE \u02d9x = f (x, \u2713) and p(x|\u2713) is therefore\nno longer trivial. What we mean by this is that x can only be calculated approximately and thus\u2014\nfollowing the central principle of probabilistic numerical methods (Hennig et al., 2015)\u2014we assign\nto it a probability distribution representing our lack of knowledge about its true value. Our focus here\nis on initial value problems (IVPs) where we assume the initial value X0 \u2318 x(0) is known (though\nan extension to unknown X0 is straightforward). We thus have the setup\n(1)\nFor our purposes, the interesting term on the right-hand side is p(x|\u2713), where hereafter we omit X0.\nIn the broadest sense, our aim is to account as accurately as possible for the numerical error which\nis inevitable in the calculation of x, and to do this within a probabilistic framework by describing\np(x|\u2713). We then wish to consider the effect of this uncertainty as it is propagated through (1), when\nperforming inference on \u2713 in an inverse problem setting. An experimental example in this context is\nconsidered in Section 3.\n\np(\u2713, , x|Y, X0) / p(Y |x, )p(x|\u2713, X0)p(\u2713)p().\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\u21e4Corresponding author: o@teymur.uk\n\n\f1.1 Probabilistic numerical methods\n\nBefore we give a summary of the current state of probabilistic numerical methods (PN) for ODEs,\nwe take a brief diversion. It is interesting to note that the concept of de\ufb01ning a distribution for p(x|\u2713)\nhas appeared in the recent literature in different forms. For example, a series of papers (Calderhead\net al., 2009; Dondelinger et al., 2013; Wang and Barber, 2014; Macdonald et al., 2015), which arose\nseparately from PN in its modern form, seek to avoid solving the ODE entirely and instead replace\nit with a \u2018surrogate\u2019 statistical model, parameterised by , with the primary motivation being to\nreduce overall computation. The central principle in these papers is to perform gradient matching\n(GM) between the full and surrogate models. A consequence of this framework is the introduction\nof a distribution p(x|\u2713, ) akin to the p(x|\u2713) appearing in (1). The aim is to approximate x using\nstatistical techniques, but there is no attempt to model the error itself\u2014instead, simply an attempt to\nminimise the discrepancy between the true solution and its surrogate. Furthermore, the parameters\n of the surrogate models proposed in the GM framework are \ufb01tted by conditioning on data Y,\nmeaning p(x|\u2713, ) needs to be viewed as a data-conditioned posterior p(x|\u2713, , Y ). In our view this\nis problematic, since where the uncertainty in a quantity of interest arises solely from the inexactness\nof the numerical methods used to calculate it, inference over that quantity should not be based on\ndata that is the outcome of an experiment. The circularity induced in (1) by Y-conditioning is clear.\nThe fundamental shift in thinking in the papers by Hennig and Hauberg (2014) and Chkrebtii et al.\n(2016), building on Skilling (1991), and then followed up and extended by Schober et al. (2014),\nConrad et al. (2016), Teymur et al. (2016), Kersting and Hennig (2016), Schober et al. (2018) and\nothers is that of what constitutes \u2018data\u2019 in the algorithm used to determine p(x|\u2713). By contrast to\nthe GM approach, the experimental data Y is not used in constructing this distribution. Though\nthe point has already been made tacitly in some of these works, we argue that this constitutes the\nkey difference in philosophy. Instead, we should strive to quantify the numerical uncertainty in x\n\ufb01rst, then propagate this uncertainty via the data likelihood to the Bayesian inversion employed\nfor inferring \u2713. This is effectively direct probabilistic modelling of the numerical error and is the\napproach taken in PN.\nHow then is x inferred in PN? The common thread here is that a discrete path Z \u2318 Z1:N is generated\nwhich numerically approximates X \u2318 X1:N\u2014the discretised version of the true solution x\u2014then\n\u2018model interrogations\u2019 (Chkrebtii et al., 2016) F := f (Z, \u2713) are thought of as a sort of numerical\ndata and x is inferred based on these. Done this way, an entirely model-based description of the\nuncertainty in x results, with no recourse to experimental data Y .\n\n1.2 Sequential inference\n\nAnother central feature of PN solvers from Chkrebtii et al. (2016) onward is that of treating the\nproblem sequentially, in the manner of a classic IVP integrator. In all of the GM papers, and indeed\nin Hennig and Hauberg (2014), X is treated as a block \u2013 inferred all at once, given data Y (or, in\nHennig and Hauberg, F ). This necessarily limits the degree of feedback possible from the dynamics\nof the actual ODE, and in a general IVP this may be the source of signi\ufb01cant inaccuracy, since errors\nin the inexact values Z approximating X are ampli\ufb01ed by the ODE itself. In a sequential approach,\nthe numerical data is not a static pre-existing object as the true data Y is, but rather is generated as\nwe go by repeatedly evaluating the ODE at a sequence of input ordinates. Thus it is clear that the\nnumerical data generated at time t is affected by the inferred solution at times before t. This iterative\ninformation feedback is qualitatively much more like a standard IVP solver than a block inference\napproach and is similar to the principle of statistical \ufb01ltering (S\u00a8arkk\u00a8a, 2013).\nWe now examine the existing papers in this area more closely, in order to give context to our own\ncontribution in the subsequent section. In Chkrebtii et al. (2016) a Gaussian process (GP) prior is\njointly placed over x and its derivative \u02d9x, then at step i the current GP parameters are used to predict\na value for the state at the next step, Zi+1. This is then transformed to give Fi+1 \u2318 f (Zi+1,\u2713 ).\nThe modelling assumption now made is that this value is distributed around the true value of the\nderivative \u02d9Xi+1 with Gaussian error. On this basis the new datum is assimilated into the model,\ngiving a posterior for (x, \u02d9x) which can be used as the starting prior in the next step. This approach\ndoes not make direct use of the sequence Z; rather it is merely generated in order to produce the\nnumerical data F which is then compared to the prior model in derivative space. The result is a\ndistributional Gaussian posterior over x consistent with the sequence F .\n\n2\n\n\fConrad et al. (2016) take a different approach. Treating the problem in a discrete setting, they\nproduce a sequence Z of values approximating X, with Fi+1 \u2318 f (Zi+1,\u2713 ) constituting the data and\nZi+1 calculated from the previous values Zi and Fi by employing some iterative relation akin to a\nrandomised version of a standard IVP solver. Note that there is no attempt to continuously assimilate\nthe generated values into the model for the unknown X or \u02d9X during the run of the algorithm. Instead,\nthe justi\ufb01cation for the method comes post hoc in the form of a convergence theorem bounding the\nmaximum expected squared-error maxi E||Zi Xi||2. An extension to multistep methods\u2014in which\nZi+1 is allowed to depend on multiple past values F\uf8ffi \u2014is introduced in Teymur et al. (2016) and\nutilises the same basic approach. Various extensions and generalisations of the theoretical results in\nthese papers are given in Lie et al. (2017), and a related idea in which the step-size is randomised is\nproposed by Abdulle and Garegnani (2018).\nThis approach is intuitive, allowing for modi\ufb01ed versions of standard algorithms which inherit known\nuseful properties, and giving provable expected error bounds. It is also more general since it allows\nfor non-parametric posterior distributions for x, though it relies on Monte Carlo sampling to give\nempirical approximations to it. Mathematically, we write\n\np(Z|\u2713) =Z p(Z, F|\u2713) dF =Z \"N1Yi=0\n\np(Fi|Zi,\u2713 )p(Zi+1|Zi, F\uf8ffi)# dF.\n\n(2)\n\nHere, Z \u2318 Z1:N is the approximation to the unknown discretised solution function X \u2318 X1:N,\nand each Fi \u2318 f (Zi,\u2713 ) is a piece of numerical data. We use F\uf8ffi to mean (Fi, Fi1, Fi2, . . . ).\nUsing the terminology of Hennig et al. (2015), the two constituent components of the telescopic\ndecomposition in the right-hand side of (2) correspond to the \u2018decision rule\u2019 (how the algorithm\ngenerates a new data-point Fi) and the \u2018generative model\u2019 (which encodes the likelihood model for\nZ) respectively. Note that, from a statistical viewpoint, the method explicitly de\ufb01nes a distribution\nover numerical solutions Z rather than an uncertainty centred around the true solution x (or X). The\nrelationship of the measure over Z to that over X is then guaranteed by the convergence analysis.\nThe term p(Fi|Zi,\u2713 ) is taken in both Conrad et al. (2016) and Teymur et al. (2016) to be simply\na deterministic transformation; this could be written in distributional form as Fi(f (Zi,\u2713 )). The\nterm p(Zi+1|Zi, F\uf8ffi) is given by Conrad et al. as a Gaussian centred around the output Zdet\ni+1 of\nany deterministic single step IVP solver, with variance scaled in accordance with the constraints of\ntheir theorem. Teymur et al. introduce a construction for this term which permits conditioning on\nmultiple previous Fi\u2019s and has mean equivalent to the multistep Adams\u2013Bashforth method. They\ngive the corresponding generalised convergence result. Their proof is also easily veri\ufb01ed to be valid\nfor implicit multistep methods\u2014a result we appeal to later\u2014though the speci\ufb01c implicit integrator\nmodel they suggest is methodologically inconsistent, for reasons we will explain in Section 2.\nIn all of the approaches described so far, Monte Carlo sampling is required to marginalise F and\nthereby calculate p(Z|\u2713). This constitutes an appreciable computational overhead. A third approach,\nrelated to stochastic \ufb01ltering, is presented in Schober et al. (2014), Kersting and Hennig (2016) and\nSchober et al. (2018). These papers develop a framework which does not rely on sampling, but instead\nmakes the simplifying assumption that all distributions are Gaussian, and propagates the uncertainty\nfrom step to step using the theory of Kalman \ufb01ltering (S\u00a8arkk\u00a8a, 2013). This is an alternative settlement\nto the accuracy/computational cost trade-off, a point which is acknowledged in those papers.\nFor the sake of comparison, we can loosely rewrite their general approach in our notation as follows:\n\np(x|\u2713) =Z \uf8ffYi\n\np( \u02dcZi+1|x[i], F0:i)p(Fi+1| \u02dcZi+1,\u2713 )p(x[i+1]| \u02dcZi+1, Fi+1) dF d \u02dcZ,\n\n(3)\n\nwhere we write x[i] instead of Zi to emphasise that this represents an i-times updated model for the\ncontinuous solution x, rather than the i\u2019th iteration of an algorithm which generates an approximation\nto the discrete Xi. This algorithm predicts a value for the new state \u02dcZi+1 from the current model and\nall previous data, then generates a data point based on that prediction, and then updates the model\nbased on this new datum. Note that all distributions in this framework are Gaussian, to permit fast\n\ufb01ltering, and as a result the non-linearities in f are effectively linearised, and any numerical method\nwhich produces non-Gaussian errors has Gaussians \ufb01tted to them anyway.\nThis \ufb01ltering approach is interesting because of the earlier-stated desideratum of maximising the\ndegree of feedback from the ODE dynamics to the solver. The predict-evaluate-update approach\n\n3\n\n\fsuggested by (3) means that information from the ODE function at the next time step ti+1 is fed\nback into the procedure at each step, unlike in other methods which only predict forwards. In\nnumerical analysis this is typically a much-desired feature, leading to methods with improved stability\nand accuracy. However, it is still a three-part procedure, analogous for example to paired Adams\u2013\nBashforth and Adams\u2013Moulton integrators used in PEC mode (Butcher, 2008). This connection is\nreferred to in Schober et al. (2018).\n\n2 Our proposed method\n\nWe now propose a different, novel sequential procedure which also incorporates information from\nthe ODE at time step ti+1 but does so directly. This produces a true implicit probabilistic integrator,\nwithout a subtle inconsistency present in the method suggested by Teymur et al. (2016). There, the\nanalogue of (2) de\ufb01nes a joint Gaussian distribution over Zi+1 and Fi+1 (the right-hand component,\nwith F\uf8ffi replaced by F\uf8ff(i+1)) but then generates Fi+1 by passing Zi+1 through the function f (the\nleft hand component). This gives two mutually-incompatible meanings to Fi+1, one linearly and one\nnon-linearly related to Zi+1. Our proposed method \ufb01xes this problem. Indeed, we speci\ufb01cally exploit\nthe difference in these two quantities by separating them out and directly penalising the discrepancy\nbetween them.\nTo introduce the idea we consider the one-dimensional case \ufb01rst, then later we generalise to a multi-\ndimensional context. We \ufb01rst note that unlike in the explicit randomised integrators of Conrad et al.\n(2016) and Teymur et al. (2016), we do not have access to the exact deterministic Adams\u2013Moulton\npredictor, to which we could then add a zero-mean perturbation. An alternative approach is therefore\nrequired. Consider instead the following distribution which directly advances the integrator one step\nand depends only the current point:\n\np(Zi+1 = z|Zi,\u2713,\u2318 ) / g(r(z),\u2318 ).\n\n(4)\nHere, r(z) is a positive discrepancy measure in derivative space de\ufb01ned in the next paragraph, and\ng is an \u2318-scaled functional transformation which ensures that the expression is a valid probability\ndistribution in the variable z.\nA concrete example will illuminate the de\ufb01nition. Consider the simplest implicit method, backward\nEuler. This is de\ufb01ned by the relation Zi+1 = Zi + hFi+1 and typically can only be solved by an\niterative calculation, since Fi+1 \u2318 f (Zi+1,\u2713 ) is of course unknown. If the random variable Zi+1 has\nvalue z, then we may express Fi+1 as a function of z. Speci\ufb01cally, we have Fi+1(z) = h1(z Zi).\nThe discrepancy r(z) between the value of Fi+1(z) and the value of f (z, \u2713) can then be used as\na measure of the error in the linear method, and penalised. This is equivalent to penalising the\ndifference between the two different expressions for Fi+1 arising from the previously-described naive\nextension of (2) to the implicit case. We write\n\np(Zi+1 = z|Zi,\u2713,\u2318 ) = K1 exp\u21e3 1\n\n2 \u23182h1(z Zi) f (z, \u2713)2\u2318 .\n\nComparing (4) and (5), r(z) is the expression h1(z Zi) f (z, \u2713), and g is in this case the\ntransformation u 7! exp(u2/2\u23182). This approach directly advances the solver in a single leap,\nwithout collecting explicit numerical data as in previous approaches. It is in general non-parametric\nand requires either sampling or approximation to be useful (more on which in the next section). Since\nf is in general non-linear, it follows that r is non-linear too. It then follows that the density in equation\n(5) does not result in a Gaussian measure, despite g being a squared-exponential transformation.\nThe generalisation to higher order implicit linear multistep methods of Adams\u2013Moulton (AM) type,\n\n(5)\n\nhaving the form Zi+1 = Zi + hPs1\nexp0@\n\np(Zi+1 = z|Z\uf8ffi,\u2713,\u2318 ) =\n\n1\nK\n\nj=1 jf (Zij,\u2713 ), for AM coef\ufb01cients j, follows as\n\n1\n\n2\u23182 h1(z Zi) Ps1\n\n1\n\nj=0 jFij\n\n f (z, \u2713)!21A .\n\n(6)\n\n2.1 Mathematical properties of the proposed method\nThe following analysis proves the well-de\ufb01nedness and convergence properties of the construction\nproposed in Section 2. First we show that the distribution (6) is well-de\ufb01ned and proper, by proving\n\n4\n\n\fthe \ufb01niteness and strict positivity of the normalising constant K. We then describe conditions on\nthe h-dependence of the scaling parameter \u2318, such that the random variables \u21e0i in (8) satisfy the\nhypotheses of Theorem 3 in Teymur et al. (2016). In particular, the convergence of our method\nfollows from the Adams\u2013Moulton analogue of that result.\nDenote by h\n\u2713,s : Rd\u21e5s ! Rd the deterministic map de\ufb01ned by the s-step Adams\u2013Moulton method.\nFor example, the implicit map associated with the backward Euler method\u2014the \u2018zero step\u2019 AM\nmethod\u2014for a \ufb01xed parameter \u2713 is h\n\u2713,0(Zi),\u2713 ). More generally, the map\nassociated with the s-step AM method is\n\n\u2713,0(Zi) = Zi + hf ( h\n\n h\n\n\u2713,s(Zis+1:i) = Zi + hh1f h\n\nj=0 jf (Zij,\u2713 )i,\n\u2713,s(Zis+1:i),\u2713 +Ps1\n(7)\nwhere Zis+1:i \u2318 (Zi, Zi1, . . . , Zis+1), and the j 2 R+ are the Adams\u2013Moulton coef\ufb01cients.\nNote that h\n\u2713,s(Zis+1:i) represents the deterministic Adams\u2013Moulton estimate for Zi+1. Given a\nprobability space (\u2326,F, P), de\ufb01ne for every i 2 N the random variable \u21e0h\ni :\u2326 ! Rd according to\n(8)\nThe relationship between the expressions (6) and (8) is addressed in part (i) of the following Theorem,\nthe proof of which is given in the supplementary material accompanying this paper.\nTheorem. Assume that the vector \ufb01eld f (\u00b7,\u2713 ) is globally Lipschitz with Lipschitz constant Lf,\u2713 > 0.\nFix s 2 N [{ 0}, Zis+1:i 2 Rd\u21e5s, \u2713 2 Rq, and 0 < h < (Lf,\u2713 1)1. If \u2318 = kh\u21e2 for some k > 0\nindependent of h and \u21e2 1, then the following statements hold:\n\n\u2713,s(Zis+1:i) + \u21e0h\ni .\n\nZi+1 = h\n\n(i) The function de\ufb01ned in (6) is a well-de\ufb01ned probability density.\n(ii) For every r 1, there exists a constant 0 < Cr < 1 that does not depend on h, such that\n(iii) If \u21e2 s + 1\n2, the probabilistic integrator de\ufb01ned by (6) converges in mean-square as h ! 0,\n\nfor all i 2 N, E[k\u21e0ikr] \uf8ff Crh(\u21e2+1)r.\nat the same rate as the deterministic s-step Adams\u2013Moulton method.\n\n2.2 Multi-dimensional extension\nThe extrapolation part of any linear method operates on each component of a multi-dimensional\ni + hPj jF (k)\nproblem separately. Thus if Z = (Z(1), . . . , Z(d))T , we have Z(k)\nij for\neach component k in turn. Of course, this is not true of the transformation Zi+1 7! Fi+1 \u2318\nf (Zi+1), except in the trivial case where f is linear in z; thus in (2), the right-hand distribution is\ncomponentwise-independent while the left-hand one is not. All previous sequential PN integrators\nhave treated the multi-dimensional problem in this way, as a product of one-dimensional relations.\nIn our proposal it does not make sense to consider the system of equations component by component,\ndue to the presence of the non-linear f (z, \u2713) term, which appears as an intrinsic part of the step-\nforward distribution p(Zi+1|Z\uf8ffi,\u2713,\u2318 ). The multi-dimensional analogue of (6) should take account\nof this and be de\ufb01ned over all d dimensions together. For vector-valued z, Zk, Fk, we therefore de\ufb01ne\n\ni+1 = Z(k)\n\n(9)\nwhere r(z) = 1\nj=0 jFij) f (z, \u2713) is now a d\u21e5 1 vector of discrepancies in\nderivative space, and H is a d \u21e5 d matrix encoding the solver scale, generalising \u2318. Straightforward\nmodi\ufb01cations to the proof give multi-dimensional analogues to the statements in the Theorem.\n\n1(h1(z Zi)Ps1\n\np(Zi+1|Z\uf8ffi,\u2713, H ) / exp 1\n\n2 r(z)T H1r(z) .\n\n2.3 Calibration and setting H\nThe issue of calibration of ODE solvers is addressed without consensus in every treatment of this\ntopic referenced in Section 1. The approaches can broadly be split into those of \u2018forward\u2019 type, in\nwhich there is an attempt to directly model what the theoretical uncertainty in a solver step should be\nand propagate that through the calculation; and those of \u2018backward\u2019 type, where the uncertainty scale\nis somehow matched after the computation to that suggested by some other indicator. Both of these\nhave shortcomings, the former due to the inherent dif\ufb01culty of explicitly describing the error, and\nthe latter because it is by de\ufb01nition less precise. One major stumbling block is that it is in general a\nchallenging problem to even de\ufb01ne what it means for an uncertainty estimate to be well-calibrated.\n\n5\n\n\fIn the present paper, we require a way of setting H. We proceed by modifying and generalising an\nidea from Conrad et al. (2016) which falls into the \u2018backward\u2019 category. There, the variance of the\nstep-forward distribution Var(Zi+1|\u00b7\u00b7\u00b7 ) is taken to be a matrix \u2303Z = \u21b5h\u21e2 Id, with \u21b5 determined\nby a scale-matching procedure that ensures the integrator outputs a global error scale in line with\nexpectations. We refer the reader to the detailed exposition of this procedure in Section 3.1 of\nthat paper. Furthermore, the convergence result from Teymur et al. (2016) implies that, for the\nprobabilistic s-step Adams\u2013Bashforth integrator, the exponent \u21e2 should be taken to be 2s + 1.\nIn our method, we are not able to relate such a matrix \u2303Z directly to H because from the de\ufb01nition\n(9) it is clear that H is a scaling matrix for the spread of the derivative Fi+1, whereas \u2303Z measures\nthe spread of the state Zi+1. In order to transform to the correct space without linearising the ODE,\nwe apply the multivariate delta method (Oehlert, 1992) to give an approximation for the variance of\nthe transformed random variable, and set H to be equal to the result. Thus\nH = Var(f (Zi+1)) \u21e1 Jf (E(Zi+1))\u2303ZJf (E(Zi+1))T\n= \u21b5h\u21e2Jf (E(Zi+1))Jf (E(Zi+1))T ,\n\n(10)\n\nwhere Jf is the Jacobian of f. The mean value E(Zi+1) is unknown, but we can use an explicit\nmethod of equal or higher order to compute an estimate ZAB\ni+1 at negligible cost, and use Jf (ZAB\ni+1)\ninstead, under the assumption that these are reasonably close. Remember that we are roughly\ncalibrating the method so some level of approximation is unavoidable. This comment applies equally\nto the case where the Jacobian is not analytically available and is estimated numerically. Such\napproximations do not affect the fundamental convergence properties of the algorithm, since they\ndo not affect the h-scaling of the stepping distribution. We also note that we are merely matching\nvariances/spread parameters and nowhere assuming that the distribution (9) is Gaussian. This idea\nbears some similarity to the original concept in Skilling (1993), where a scalar \u2018stiffness constant\u2019 is\nused in a similar way to transform the uncertainty scale from solution space to derivative space.\nWe now ascertain the appropriate h-scaling for H by setting the exponent \u21e2. The condition required\nby the univariate analysis in this paper is that \u2318 = kh\u21e2; part (iii) of the Theorem shows that we\n2, where s is the number of steps in the corresponding AM method.2 Choosing\nrequire \u21e2 s + 1\n2 \u2014an approach supported by the numerical experiments in Section 3\u2014the backwards Euler\n\u21e2 = s + 1\nmethod (s = 0) requires \u21e2 = 1\n2 . The multidimensional analogue of the above condition is H = Qh2\u21e2\nfor an h-independent positive-de\ufb01nite matrix Q. Since Jf is independent of h, this means we must\nset \u2303Z to be proportional to h2(s+ 1\n2 ), and thus we have H = \u21b5h2s+1Jf (E(Zi+1))Jf (E(Zi+1))T .\nOur construction has the bene\ufb01cial consequence of giving a non-trivial cross-correlation structure\nto the error calibration matrix H, allowing a richer description of the error in multi-dimensional\nproblems, something absent from previous approaches. Furthermore, it derives this additional\ninformation via direct feedback from the ODE, which we have shown is a desirable attribute.\n\n2.4 Reducing computational expenditure\nIn the form described in the previous section, our algorithm results in a non-parametric distribution for\nZi+1 at each step. With this approach, a description of the uncertainty in the numerical method can\nonly be evaluated by a Monte Carlo sampling procedure at every iteration. Even if this sampling is\nperformed using a method well-suited to targeting distributions close to Gaussian\u2014we use a modi\ufb01ed\nversion of the pre-conditioned Crank\u2013Nicolson algorithm proposed by Cotter et al. (2013)\u2014there is\nclearly a signi\ufb01cant computational penalty associated with this.\nThe only way to avoid this penalty is by reverting to distributions of standard form, which are easy to\nsample from. One possibility is to approximate (6) by a Gaussian distribution\u2014depending on how\nthis approximation is performed the desideratum of maintaining information feedback from the future\ndynamics of the target function can be maintained. For example, a \ufb01rst order Taylor expansion of\nf (z) \u21e1 f (Zi) + Jf (Zi)(z Zi), when substituted into r(z) as de\ufb01ned in (9), gives an approximation\n\u02dcr(z) which is linear in z. This yields a non-centred Gaussian when transformed into a probability\nmeasure as in (9). De\ufb01ning \u2318 (h1Id)1 Jf (Zi) and w \u2318 f (Zi,\u2713 ) + 1\nj=0 jFij),\n2We take this opportunity to remind the reader of the unfortunate convention from numerical analysis that\nresults in s having different meanings here and in the previous paragraph\u2014the explicit method of order s is the\none with s steps, whereas the implicit method of order s is the one with s 1 steps.\n\n1(Ps1\n\n6\n\n\fsome straightforward algebra gives the moments of the approximating Gaussian measure for the next\nstep as \u00b5 = Zi + 1w and Var = \u21b5h2s+11Jf J T\nf T . We note that this procedure is merely to\nfacilitate straightforward sampling\u2014though \u02dcr(z) is linear in z, the inclusion of the \ufb01rst additional\nterm from the Taylor expansion means that information about the non-linearity (in z) of f are still\nincorporated to second order, and the generated solution Z is not jointly Gaussian across time steps i.\nFurthermore, since 1 is order 1 in h, this approximation does not impact the global convergence of\nthe integrator, as long as H is set in accordance with the principles described in Section 2.3. This\nmethod of solving implicit integrators by linearising them in f is well-known in classical numerical\nanalysis, and the resulting methods are sometimes called semi-implicit methods (Press et al., 2007).\n\n3 Experimental results\n\nWe illustrate our new algorithm by considering the case of a simple inverse problem, the FitzHugh\u2013\nNagumo model discussed in Ramsay et al. (2007) and subsequently considered in a several papers\non this topic. This is a two-dimensional non-linear dynamical system with three parameters \u2713 =\n(\u27131,\u2713 2,\u2713 3), the values of which (\u27131 = 0.2,\u2713 2 = 0.2,\u2713 3 = 3.0) are chosen to produce periodic\nmotion. With the problem having a Bayesian structure, we write down the posterior as\n\np(\u2713, Z|Y ) / p(Y |Z, )p(Z|\u2713, \u21e0)p(\u2713)p(\u21e0).\n\n(11)\nThis expression recalls (1), but with Z substituting for x as described in Section 1.2. We write\np(Z|\u2713, \u21e0) to emphasise that the trajectory Z depends on the sequence of random perturbations \u21e00:N.\nFor simplicity we use the known value of throughout, so do not include it in the posterior model.\nConrad et al. (2016) remark on the biasing effect on the posterior distribution for \u2713 of naively\nevaluating the forward model using a standard numerical method. They showed that their probabilistic\nintegrator returns wider posteriors, preventing misplaced overcon\ufb01dence in an erroneous estimate. We\nnow extend these experiments to our new method. In passing, we note interesting recent theoretical\ndevelopments discussing the quantitative effect on posterior inference of randomised forward models,\npresented in Lie et al. (2018).\n\nFigure 1: 500 Monte Carlo repetitions of the probabilistic backward Euler (AM0) method\napplied to the FitzHugh\u2013Nagumo model with h = 0.1 and 0 \uf8ff t \uf8ff 20. The approximation\nfrom Section 2.4 is used and \u21b5\u21e4AM0 = 0.2. The upper pane plots the ensemble of discrete\ntrajectories, with the path of the deterministic backward Euler method in light blue. The\nlower pane is based on the same data, this time summarised. The ensemble mean is shown\ndashed, and 1, 2 and 3 intervals are shown shaded, with reference solution in solid black.\n\n7\n\n\f3.1 Calibration\n\nThe \ufb01rst task is to infer the appropriate value for the overall scaling constant \u21b5 for each method, to\nbe used in setting the matrix H in (10). As in Conrad et al. (2016), we calculate a value \u21b5\u21e4 that\nmaximises the agreement between the output of the probabilistic integrator and a measure of global\nerror from the deterministic method, and then \ufb01x and proceed with this value.\nFor each of several methods M, \u21b5\u21e4M was calculated for a range of values of h and was close to\nconstant throughout, suggesting that the h-scaling advocated in Section 2.3 (ie. taking the equality\nin the bound in part (iii) of the Theorem) is the correct one. This point has not been speci\ufb01cally\naddressed in previous works on this subject. The actual maxima \u21b5\u21e4M for each method are different and\nfurther research is required to examine whether a relationship can be deduced between these values\nand some known characteristic of each method, such as number of steps s or local error constant of\nthe underlying method. Furthermore, we expect these values to be problem-dependent. In this case,\nwe found \u21b5\u21e4AB1 \u21e1 0.2, \u21b5\u21e4AB2 \u21e1 0.1, \u21b5\u21e4AB3 \u21e1 0.2, \u21b5\u21e4AM0 \u21e1 0.2, \u21b5\u21e4AM1 \u21e1 0.05, \u21b5\u21e4AM2 \u21e1 0.05.\nHaving calibrated the probabilistic integrator, we illustrate its typical output in Figure 1: the top pane\nplots the path of 500 iterations of the probabilistic backward Euler method run at \u21b5 = \u21b5\u21e4AM0 = 0.2.\nWe plot the discrete values Z1:N for each repetition, without attempting to distinguish the trajectories\nfrom individual runs. This is to stress that each randomised run (resulting from a different instantiation\nof \u21e0) is not intended to be viewed as a \u2018typical sample\u2019 from some underlying continuous probability\nmeasure, as in some other probabilistic ODE methods, but rather that collectively they form an\nensemble from which an empirical distribution characterising discrete-time solution uncertainty can\nbe calculated. The bottom pane plots the same data but with shaded bands representing the 1, 2\nand 3 intervals, and a dotted line representing the empirical mean.\n\n3.2 Parameter inference\n\nWe now consider the inverse problem of inferring the parameters of the FitzHugh\u2013Nagumo model in\nthe range t 2 [0, 20]. We \ufb01rst generate synthetic data Y ; 20 two-dimensional data-points collected at\ntimes tY = 1, 2, . . . , 20 corrupted by centred Gaussian noise with variance = (0.01) \u00b7 I2. We then\ntreat the parameters \u2713 as unknown and run an MCMC algorithm\u2014Adaptive Metropolis Hastings\n(Haario et al., 2001)\u2014to infer their posterior distribution.\nIn Conrad et al. (2016), the equivalent algorithm performs multiple repetitions of the forward solve\nat each step of the outer MCMC (each with a different instantiation of \u21e0) then marginalises \u21e0 out to\nform an expected likelihood. This is computationally very expensive; in our experiments we \ufb01nd that\nfor the MCMC to mix well, many tens of repetitions of the forward solve are required at each step.\nInstead we use a Metropolis-within-Gibbs scheme where at MCMC iteration k, a candidate parameter\n\u2713\u21e4 is proposed and accepted or rejected having had its likelihood calculated using the same sample\n\u21e0[k]\n0:N as used in the current iteration k. If accepted as \u2713[k+1], a new \u21e0[k+1]\n0:N can then be sampled and\nthe likelihood value recalculated ready for the next proposal. The proposal at step k + 1 is then\ncompared to this new value. Pseudo-code for this algorithm is given in the supplementary material.\nOur approach requires that p(Z|\u2713, \u21e0) be recalculated exactly once for each time a new parameter\nvalue \u2713\u21e4 is accepted. The cost of this strategy is therefore bounded by twice the cost of an MCMC\noperating with a deterministic integrator\u2014the bound being achieved only in the scenario that all\nproposed moves \u2713\u21e4 are accepted. Thus the algorithm, in contrast to the calibration procedure (which\nis relatively costly but need only be performed once), has limited additional computational overhead\ncompared to the naive approach using a classical method.\nFigure 2 shows kernel density estimates approximating the posterior distribution of (\u27132,\u27133) for\nthe forward Euler, probabilistic forward Euler, backward Euler and probabilistic backward Euler\nmethods. Each represents 1000 parameter samples from simulations run with step-sizes h =\n0.005, 0.01, 0.02, 0.05. This is made of 11000 total samples, with the \ufb01rst 1000 discarded as burn-in,\nand the remainder thinned by a factor of 10. For each method M, its pre-calculated calibration\nparameter \u21b5\u21e4M is used to set the variance of \u21e0.\nAt larger step-sizes, the deterministic methods both give over-con\ufb01dent and biased estimates (on\ndifferent sides of the true value). In accordance with the \ufb01ndings of Conrad et al. (2016), the\nprobabilistic forward Euler method returns a wider posterior which covers the true solution. The\n\n8\n\n\fFigure 2: Comparison of the posterior distribution of (\u27132,\u2713 3) from the FitzHugh\u2013\nNagumo model in cases where the forward solve is calculated using one of four\ndifferent integrators (deterministic and probabilistic backward- and forward-Euler\nmethods), each for four different step sizes h = 0.005, 0.01, 0.02, 0.05. All density\nestimates calculated using 1000 MCMC samples. Dashed black lines indicate true\nparameter values. Full details are given in main text.\n\nbottom right-hand panel demonstrates the same effect with the probabilistic backward Euler method\nwe have introduced in this paper.\nWe \ufb01nd similar results for second- and higher-order methods, both explicit and implicit. The scale of\nthe effect is however relatively small on such a simple test problem, where a higher-order integrator\nwould not be expected to produce much error in the forward solve. Further work will investigate the\napplication of these methods to more challenging problems.\n\n4 Conclusions and avenues for further work\n\nIn this paper, we have surveyed the existing collection of probabilistic integrators for ODEs, and\nproposed a new construction\u2014the \ufb01rst to be based on implicit methods\u2014giving a rigorous description\nof its theoretical properties. We have given preliminary experimental results showing the effect on\nparameter inference of the use of different \ufb01rst-order methods, both existing and new, in the evaluation\nof the forward model. Higher-order multistep methods are allowed by our construction.\nOur discussion on integrator calibration does not claim a resolution to this subtle and thorny problem,\nbut suggests several avenues for future research. We have mooted a question on the relationship\nbetween the scaling parameter \u21b5 and other method characteristics. Insight into this issue may be the\nkey to making these types of randomised methods more practical, since common tricks for calibration\nmay emerge which are then applicable to different problems. An interesting direction of enquiry,\nbeing explored separately, concerns whether estimates of global error from other sources, eg. adjoint\nerror modelling, condition number estimation, could be gainfully applied to calibrate these methods.\n\n9\n\n\fAcknowledgements\n\nHCL and TJS are partially supported by the Freie Universit\u00a8at Berlin within the Excellence Initiative of the\nGerman Research Foundation (DFG). This work was partially supported by the DFG through grant CRC 1114\nScaling Cascades in Complex Systems, and by the National Science Foundation (NSF) under grant DMS-1127914\nto the Statistical and Applied Mathematical Sciences Institute\u2019s QMC Working Group II Probabilistic Numerics.\n\nReferences\n\nAbdulle, A. and Garegnani, G. (2018). Random Time Step Probabilistic Methods for Uncertainty Quanti\ufb01cation\n\nin Chaotic and Geometric Numerical Integration. arXiv:1801.01340.\n\nButcher, J. (2008). Numerical Methods for Ordinary Differential Equations: Second Edition. Wiley.\nCalderhead, B., Girolami, M., and Lawrence, N. (2009). Accelerating Bayesian Inference over Nonlinear\nDifferential Equations with Gaussian Processes. Advances in Neural Information Processing Systems\n(NeurIPS), 21:217\u2013224.\n\nChkrebtii, O., Campbell, D., Calderhead, B., and Girolami, M. (2016). Bayesian Solution Uncertainty Quanti\ufb01-\n\ncation for Differential Equations. Bayesian Analysis, 11(4):1239\u20131267.\n\nConrad, P., Girolami, M., S\u00a8arkk\u00a8a, S., Stuart, A., and Zygalakis, K. (2016). Statistical Analysis of Differential\nEquations: Introducing Probability Measures on Numerical Solutions. Statistics and Computing, pages 1\u201318.\nCotter, S., Roberts, G., Stuart, A., and White, D. (2013). MCMC Methods for Functions: Modifying Old\n\nAlgorithms to Make Them Faster. Statistical Science, 28(3):424\u2013446.\n\nDondelinger, F., Rogers, S., and Husmeier, D. (2013). ODE Parameter Inference using Adaptive Gradient\nMatching with Gaussian Processes. Proc. of the 16th Int. Conf. on Arti\ufb01cial Intelligence and Statistics\n(AISTATS), 31:216\u2013228.\n\nHaario, H., Saksman, E., and Tamminen, J. (2001). An Adaptive Metropolis Algorithm. Bernoulli, 7(2):223\u2013242.\nHennig, P. and Hauberg, S. (2014). Probabilistic Solutions to Differential Equations and their Application\nto Riemannian Statistics. Proc. of the 17th Int. Conf. on Arti\ufb01cial Intelligence and Statistics (AISTATS),\n33:347\u2013355.\n\nHennig, P., Osborne, M., and Girolami, M. (2015). Probabilistic Numerics and Uncertainty in Computations.\n\nProc. R. Soc. A, 471(2179):20150142.\n\nKersting, H. and Hennig, P. (2016). Active Uncertainty Calibration in Bayesian ODE Solvers. Uncertainty in\n\nArti\ufb01cial Intelligence (UAI), 32.\n\nLie, H. C., Stuart, A., and Sullivan, T. (2017). Strong Convergence Rates of Probabilistic Integrators for Ordinary\n\nDifferential Equations. arXiv:1703.03680.\n\nLie, H. C., Sullivan, T. J., and Teckentrup, A. L. (2018). Random Forward Models and Log-Likelihoods in\n\nBayesian Inverse Problems. SIAM/ASA Journal on Uncertainty Quanti\ufb01cation. To appear.\n\nMacdonald, B., Higham, C., and Husmeier, D. (2015). Controversy in Mechanistic Modelling with Gaussian\n\nProcesses. Proc. of the 32nd Int. Conf. on Machine Learning (ICML), 37:1539\u20131547.\n\nOehlert, G. (1992). A Note on the Delta Method. The American Statistician, 46(1):27\u201329.\nPress, W., Teukolsky, S., Vetterling, W., and Flannery, B. (2007). Numerical Recipes 3rd Edition: The Art of\n\nScienti\ufb01c Computing. Cambridge University Press.\n\nRamsay, J., Hooker, G., Campbell, D., and Cao, J. (2007). Parameter Estimation for Differential Equations: a\n\nGeneralized Smoothing Approach. J. Royal Stat. Soc. B, 69(5):741\u2013796.\n\nS\u00a8arkk\u00a8a, S. (2013). Bayesian Filtering and Smoothing. Cambridge University Press.\nSchober, M., Duvenaud, D., and Hennig, P. (2014). Probabilistic ODE Solvers with Runge-Kutta Means.\n\nAdvances in Neural Information Processing Systems (NeurIPS), 27:739\u2013747.\n\nSchober, M., S\u00a8arkk\u00a8a, S., and Hennig, P. (2018). A Probabilistic Model for the Numerical Solution of Initial\n\nValue Problems. Statistics and Computing, pages 1\u201324.\n\nSkilling, J. (1991). Bayesian Solution of Ordinary Differential Equations. In Smith, C. R., editor, Maximum\n\nEntropy and Bayesian Methods, Fundamental Theories of Physics, pages 23\u201337. Springer, Dordrecht.\n\nSkilling, J. (1993). Bayesian Numerical Analysis. In Grandy, W. J. and Milonni, P., editors, Physics and\n\nProbability, pages 207\u2013222. Cambridge University Press.\n\nTeymur, O., Zygalakis, K., and Calderhead, B. (2016). Probabilistic Linear Multistep Methods. Advances in\n\nNeural Information Processing Systems (NeurIPS), 29:4314\u20134321.\n\nWang, Y. and Barber, D. (2014). Gaussian Processes for Bayesian Estimation in Ordinary Differential Equations.\n\nProc. of the 31st Int. Conf. on Machine Learning (ICML), 32:1485\u20131493.\n\n10\n\n\f", "award": [], "sourceid": 3602, "authors": [{"given_name": "Onur", "family_name": "Teymur", "institution": "Imperial College London"}, {"given_name": "Han Cheng", "family_name": "Lie", "institution": "Freie Universit\u00e4t Berlin"}, {"given_name": "Tim", "family_name": "Sullivan", "institution": "Free University of Berlin"}, {"given_name": "Ben", "family_name": "Calderhead", "institution": "Imperial College"}]}