{"title": "Scalable Variational Inference for Dynamical Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 4806, "page_last": 4815, "abstract": "Gradient matching is a promising tool for learning parameters and state dynamics of ordinary differential equations. It is a grid free inference approach, which, for fully observable systems is at times competitive with numerical integration. However, for many real-world applications, only sparse observations are available or even unobserved variables are included in the model description. In these cases most gradient matching methods are difficult to apply or simply do not provide satisfactory results. That is why, despite the high computational cost, numerical integration is still the gold standard in many applications. Using an existing gradient matching approach, we propose a scalable variational inference framework which can infer states and parameters simultaneously, offers computational speedups, improved accuracy and works well even under model misspecifications in a partially observable system.", "full_text": "Scalable Variational Inference for Dynamical Systems\n\nDept. of Computer Science\n\nDept. of Computer Science\n\nNico S. Gorbach\u2217\n\nETH Zurich\n\nngorbach@inf.ethz.ch\n\nbauers@inf.ethz.ch\n\njbuhmann@inf.ethz.ch\n\nStefan Bauer\u2217\n\nETH Zurich\n\nJoachim M. Buhmann\n\nDept. of Computer Science\n\nETH Zurich\n\nAbstract\n\nGradient matching is a promising tool for learning parameters and state dynamics\nof ordinary differential equations. It is a grid free inference approach, which,\nfor fully observable systems is at times competitive with numerical integration.\nHowever, for many real-world applications, only sparse observations are available\nor even unobserved variables are included in the model description. In these cases\nmost gradient matching methods are dif\ufb01cult to apply or simply do not provide\nsatisfactory results. That is why, despite the high computational cost, numerical\nintegration is still the gold standard in many applications. Using an existing gradient\nmatching approach, we propose a scalable variational inference framework which\ncan infer states and parameters simultaneously, offers computational speedups,\nimproved accuracy and works well even under model misspeci\ufb01cations in a partially\nobservable system.\n\n1\n\nIntroduction\n\nParameter estimation for ordinary differential equations (ODE\u2019s) is challenging due to the high\ncomputational cost of numerical integration. In recent years, gradient matching techniques established\nthemselves as successful tools [e.g. Babtie et al., 2014] to circumvent the high computational\ncost of numerical integration for parameter and state estimation in ordinary differential equations.\nGradient matching is based on minimizing the difference between the interpolated slopes and the time\nderivatives of the state variables in the ODE\u2019s. First steps go back to spline based methods [Varah,\n1982, Ramsay et al., 2007] where in an iterated two-step procedure coef\ufb01cients and parameters are\nestimated. Often cubic B-splines are used as basis functions while more advanced approaches [Niu\net al., 2016] use kernel functions derived from the ODE\u2019s. An overview of recent approaches with a\nfocus on the application for systems biology is provided in Macdonald and Husmeier [2015]. It is\nunfortunately not straightforward to extend spline based approaches to include unobserved variables\nsince they usually require full observability of the system. Moreover, these methods critically\ndepend on the estimation of smoothing parameters, which are dif\ufb01cult to estimate when only sparse\nobservations are available. As a solution for both problems, Gaussian process (GP) regression was\nproposed in Calderhead et al. [2008] and further improved in Dondelinger et al. [2013]. While both\nBayesian approaches work very well for fully observable systems, they (opposite to splines) cannot\nsimultaneously infer parameters and unobserved states and perform poorly when only combinations\nof variables are observed or the differential equations contain unobserved variables. Unfortunately\nthis is the case for most practical applications [e.g. Barenco et al., 2006].\nRelated work. Archambeau et al. [2008] proposed variational inference to approximate the true\nprocess of the dynamical system by a time-varying linear system. Their approach was later sign\ufb01cantly\nextended [Ruttor et al., 2013, Ruttor and Opper, 2010, Vrettas et al., 2015]. However, similiar to\n[Lyons et al., 2012] they study parameter estimation in stochastic dynamical systems while our work\n\n\u2217The \ufb01rst two authors contributed equally to this work.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\ffocuses on deterministic systems. In addition, they use the Euler-Maruyama discretization, whereas\nour approach is grid free. Wang and Barber [2014] propose an approach based on a belief network\nbut as discussed in the controversy of mechanistic modelling [Macdonald et al., 2015], this leads to\nan intrinsic identi\ufb01ability problem.\nOur contributions. Our proposal is a scalable variational inference based framework which can infer\nstates and parameters simultaneously, offers signi\ufb01cant runtime improvements, improved accuracy\nand works well even in the case of partially observable systems. Since it is based on simplistic\nmean-\ufb01eld approximations it offers the opportunity for signi\ufb01cant future improvements. We illustrate\nthe potential of our work by analyzing a system of up to 1000 states in less than 400 seconds on a\nstandard Laptop2.\n\n2 Deterministic Dynamical Systems\n\nA deterministic dynamical system is represented by a set of K ordinary differential equations (ODE\u2019s)\nwith model parameters \u03b8 that describe the evolution of K states x(t) = [x1(t), x2(t), . . . , xK(t)]T\nsuch that:\n\n\u02d9x(t) =\n\ndx(t)\n\ndt\n\n= f (x(t), \u03b8).\n\n(1)\n\nA sequence of observations, y(t), is usually contaminated by some measurement error which\nwe assume to be normally distributed with zero mean and variance for each of the K states, i.e.\nE \u223c N (0, D), with Dik = \u03c32\nk\u03b4ik. Thus for N distinct time points the overall system may be\nsummarized as:\n\nY = X + E,\n\n(2)\n\nwhere\n\nX = [x(t1), . . . , x(tN )] = [x1, . . . , xK]T ,\n\nY = [y(t1), . . . , y(tN )] = [y1, . . . , yK]T ,\n\nand xk = [xk(t1), . . . , xk(tN )]T is the k\u2019th state sequence and yk = [yk(t1), . . . , yk(tN )]T are the\nobservations. Given the observations Y and the description of the dynamical system (1), the aim is to\nestimate both state variables X and parameters \u03b8. While numerical integration can be used for both\nproblems, its computational cost is prohibitive for large systems and motivates the grid free method\noutlined in section 3.\n\n3 GP based Gradient Matching\n\nGaussian process based gradient matching was originally motivated in Calderhead et al. [2008] and\nfurther developed in Dondelinger et al. [2013]. Assuming a Gaussian process prior on state variables\nsuch that:\n\np(X | \u03c6) :=\n\nN (0, C\u03c6k )\n\n(3)\n\nwhere C\u03c6k is a covariance matrix de\ufb01ned by a given kernel with hyper-parameters \u03c6k, the k-th\nelement of \u03c6, we obtain a posterior distribution over state-variables (from (2)):\n\np(X | Y, \u03c6, \u03c3) =\n\nN (\u00b5k(yk), \u03a3k) ,\n\n(4)\n\nwhere \u00b5k(yk) := \u03c3\u22122\nAssuming that the covariance function C\u03c6k is differentiable and using the closure property under\ndifferentiation of Gaussian processes, the conditional distribution over state derivatives is:\n\n\u03c3\u22122\nk I + C\u22121\n\nyk and \u03a3\u22121\n\n\u03c6k\n\nk\n\n:= \u03c3\u22122\n\nk I + C\u22121\n\n\u03c6k\n\n.\n\nk\n\np( \u02d9X | X, \u03c6) =\n\nN ( \u02d9xk | mk, Ak),\n\n(5)\n\n(cid:89)\n\nk\n\n(cid:89)\n\nk\n\n(cid:89)\n\n(cid:16)\n\n(cid:17)\u22121\n\n2All experiments were run on a 2.5 GHz Intel Core i7 Macbook.\n\nk\n\n2\n\n\f(cid:89)\n\nwhere the mean and covariance is given by:\n\nmk := (cid:48)C\u03c6k C\u22121\n\n\u03c6k\n\nxk, Ak := C(cid:48)(cid:48)\n\n\u03c6k\n\ndenotes the auto-covariance for each state-derivative with C(cid:48)\n\u03c6k\n\nC(cid:48)(cid:48)\n\u03c6k\ncovariances between the state and its derivative.\nAssuming additive, normally distributed noise with state-speci\ufb01c error variance \u03b3k in (1), we have:\n\n\u2212 (cid:48)C\u03c6k C\u22121\n\nC(cid:48)\n\u03c6k\n\n,\n\n\u03c6k\n\n(6)\nand (cid:48)C\u03c6k denoting the cross-\n\np( \u02d9X | X, \u03b8, \u03b3) =\n\nN ( \u02d9xk | fk(X, \u03b8), \u03b3kI) .\n\n(7)\n\nA product of experts approach, combines the ODE informed distribution of state-derivatives (distribu-\ntion (7)) with the smoothed distribution of state-derivatives (distribution (5)):\n\np( \u02d9X | X, \u03b8, \u03c6, \u03b3) \u221d p( \u02d9X | X, \u03c6)p( \u02d9X | X, \u03b8, \u03b3)\n\n(8)\n\nk\n\nThe motivation for the product of experts is that the multiplication implies that both the data \ufb01t\nand the ODE response have to be satis\ufb01ed at the same time in order to achieve a high value of\np( \u02d9X | X, \u03b8, \u03c6, \u03b3). This is contrary to a mixture model, i.e. a normalized addition, where a high\nvalue for one expert e.g. over\ufb01tting the data while neglecting the ODE response or vice versa, is\nacceptable.\nThe proposed methodology in Calderhead et al. [2008] is to analytically integrate out \u02d9X:\n\n(cid:90)\n(cid:89)\n\n(cid:88)\n\ni=1\n\n(cid:89)\n\nj\u2208Mki\n\np(\u03b8|X, \u03c6, \u03b3) = Z\u22121\n= Z\u22121\n\n\u03b8 (X) p(\u03b8)\n\n\u03b8 (X) p(\u03b8)\n\np( \u02d9X|X, \u03c6)p( \u02d9X|X, \u03b8, \u03b3)d \u02d9X\nN (fk(X, \u03b8)|mk, \u039b\u22121\nk ),\n\n(9)\n\nk\n\nk\n\n:= Ak + \u03b3kI and Z\u22121\n\nwith \u039b\u22121\n\u03b8 (X) as the normalization that depends on the states X. Calderhead\net al. [2008] infer the parameters \u03b8 by \ufb01rst sampling the states (i.e. X \u223c p(X | Y, \u03c6, \u03c3)) followed\nby sampling the parameters given the states (i.e. \u03b8, \u03b3 \u223c p(\u03b8, \u03b3 | X, \u03c6, \u03c3)). In this setup, sampling\nX is independent of \u03b8, which implies that \u03b8 and \u03b3 have no in\ufb02uence on the inference of the state\nvariables. The desired feedback loop was closed by Dondelinger et al. [2013] through sampling from\nthe joint posterior of p(\u03b8 | X, \u03c6, \u03c3, \u03b3, Y). Since sampling the states only provides their values at\ndiscrete time points, Calderhead et al. [2008] and Dondelinger et al. [2013] require the existence of an\nexternal ODE solver to obtain continuous trajectories of the state variables. For simplicity, we derived\nthe approach assuming full observability. However, the approach has the advantage (as opposed\nto splines) that the assumption of full observability can be relaxed to include only observations for\ncombinations of states by replacing (2) with Y = AX + E, where A encodes the linear relationship\nbetween observations and states. In addition, unobserved states can be naturally included in the\ninference by simply using the prior on state variables (3) [Calderhead et al., 2008].\n\n4 Variational Inference for Gradient Matching by Exploiting Local\n\nLinearity in ODE\u2019s\n\nFor subsequent sections we consider only models of the form (1) with reactions based on mass-action\nkinetics which are given by:\n\nfk(x(t), \u03b8) =\n\n\u03b8ki\n\nxj\n\n(10)\n\nwith Mki \u2286 {1, . . . , K} describing the state variables in each factor of the equation i.e.\nthe\nfunctions are linear in parameters and contain arbitrary large products of monomials of the states.\nThe motivation for the restriction to this functional class is twofold. First, this formulation includes\nmodels which exhibit periodicity as well as high nonlinearity and especially physically realistic\nreactions in systems biology [Schillings et al., 2015].\n\n3\n\n\fSecond, the true joint posterior over all unknowns is given by:\n\np(\u03b8, X | Y, \u03c6, \u03b3, \u03c3) = p(\u03b8 | X, \u03c6, \u03b3)p(X | Y, \u03c6, \u03c3)\n\n(cid:89)\n\nN(cid:0)fk(X, \u03b8) | mk, \u039b\u22121\n\nk\n\n(cid:1)N (xk | \u00b5k(Y), \u03a3k) ,\n\n= Z\u22121\n\n\u03b8 (X) p(\u03b8)\n\nk\n\nwhere the normalization of the parameter posterior (9), Z\u03b8(X), depends on the states X. The\ndependence is nontrivial and induced by the nonlinear couplings of the states X, which make\nthe inference (e.g. by integration) challenging in the \ufb01rst place. Previous approaches ignore the\ndependence of Z\u03b8(X) on the states X by setting Z\u03b8(X) equal to one [Dondelinger et al., 2013,\nequation 20]. We determine Z\u03b8(X) analytically by exploiting the local linearity of the ODE\u2019s as\nshown in section 4.1 (and section 7 in the supplementary material). More precisely, for mass action\nkinetics 10, we can rewrite the ODE\u2019s as a linear combination in an individual state or as a linear\ncombination in the ODE parameters3. We thus achieve superior performance over existing gradient\nmatching approaches, as shown in the experimental section 5.\n\n4.1 Mean-\ufb01eld Variational Inference\n\nTo infer the parameters \u03b8, we want to \ufb01nd the maximum a posteriori estimate (MAP):\n\n\u03b8(cid:63) := argmax\n\nln p(\u03b8 | Y, \u03c6, \u03b3, \u03c3) = argmax\n\nln\n\n\u03b8\n\n\u03b8\n\np(\u03b8 | X, \u03c6, \u03b3)p(X | Y, \u03c6, \u03c3)\n\ndX (11)\n\n(cid:123)(cid:122)\n\n=p(\u03b8,X|Y,\u03c6,\u03b3,\u03c3)\n\n(cid:125)\n\n(cid:90)\n\n(cid:124)\n\nHowever, the integral in (11) is intractable in most cases due to the strong couplings induced by\nthe nonlinear ODE\u2019s f which appear in the term p(\u03b8 | X, \u03c6, \u03b3) (equation 9). We therefore use\nmean-\ufb01eld variational inference to establish variational lower bounds that are analytically tractable\nby decoupling state variables from the ODE parameters as well as decoupling the state variables\nfrom each other. Before explaining the mechanism behind mean-\ufb01eld variational inference, we \ufb01rst\nobserve that, due to the model assumption (10), the true conditional distributions p(\u03b8 | X, Y, \u03c6, \u03b3, \u03c3)\nand p(xu | \u03b8, X\u2212u, Y, \u03c6, \u03b3, \u03c3) are Gaussian distributed, where X\u2212u denotes all states excluding\nstate xu (i.e. X\u2212u := {x \u2208 X | x (cid:54)= xu}). For didactical reasons, we write the true conditional\ndistributions in canonical form:\n\np(\u03b8 | X, Y, \u03c6, \u03b3, \u03c3) = h(\u03b8) \u00d7 exp(cid:0)\u03b7\u03b8(X, Y, \u03c6, \u03b3, \u03c3)T t(\u03b8) \u2212 a\u03b8(\u03b7\u03b8(X, Y, \u03c6, \u03b3, \u03c3)(cid:1)\n\np(xu | \u03b8, X\u2212u, Y, \u03c6, \u03b3, \u03c3) = h(xu) \u00d7 exp(cid:0)\u03b7u(\u03b8, X\u2212u, Y, \u03c6, \u03b3, \u03c3)T t(xu)\n\u2212 au(\u03b7u(X\u2212u, Y, \u03c6, \u03b3, \u03c3)(cid:1)\n\n(12)\nwhere h(\u00b7) and a(\u00b7) are the base measure and log-normalizer and \u03b7(\u00b7) and t(\u00b7) are the natural\nparameter and suf\ufb01cient statistics.\nThe decoupling is induced by designing a variational distribution Q(\u03b8, X) which is restricted to the\nfamily of factorial distributions:\n\nQ :=\n\nQ : Q(\u03b8, X) = q(\u03b8 | \u03bb)\n\nq(xu | \u03c8u)\n\n,\n\n(13)\n\nu\n\nwhere \u03bb and \u03c8u are the variational parameters. The particular form of q(\u03b8 | \u03bb) and q(xu | \u03c8u) is\ndesigned to be in the same exponential family as the true conditional distributions in equation (12):\n\n(cid:16)\n(cid:16)\nq(\u03b8 | \u03bb) := h(\u03b8) exp\nq(xu | \u03c8u) := h(xu) exp\n\n(cid:17)\n(cid:17)\n\u03bbT t(\u03b8) \u2212 a\u03b8(\u03bb)\nu t(xu) \u2212 au(\u03c8u)\n\u03c8T\n\n3For mass-action kinetics as in (10), the ODE\u2019s are nonlinear in all states but linear in a single state as well\n\nas linear in all ODE parameters.\n\n4\n\n(cid:26)\n\n(cid:89)\n\n(cid:27)\n\n\fTo \ufb01nd the optimal factorial distribution we minimize the Kullback-Leibler divergence between the\nvariational and the true posterior distribution:\n\nKL(cid:2)Q(\u03b8, X)(cid:12)(cid:12)(cid:12)(cid:12)p(\u03b8, X | Y, \u03c6, \u03b3, \u03c3)(cid:3)\n\nEQ log Q(\u03b8, X) \u2212 EQ log p(\u03b8, X | Y, \u03c6, \u03b3, \u03c3)\nLQ(\u03bb, \u03c8)\n\n(14)\nwhere \u02c6Q is the proxy distribution and LQ(\u03bb, \u03c8) is the ELBO (Evidence Lower Bound) terms\nthat depends on the variational parameters \u03bb and \u03c8. Maximizing ELBO w.r.t. \u03b8 is equivalent to\nmaximizing the following lower bound:\n\n\u02c6Q : = argmin\nQ(\u03b8,X)\u2208Q\n= argmin\nQ(\u03b8,X)\u2208Q\n= argmax\nQ(\u03b8,X)\u2208Q\n\nL\u03b8(\u03bb) : = EQ log p(\u03b8 | X, Y, \u03c6, \u03b3, \u03c3) \u2212 EQ log q(\u03b8 | \u03bb)\n\n= EQ\u03b7T\n\n\u03b8 (cid:53)\u03bb a\u03b8(\u03bb) \u2212 \u03bbT (cid:53)\u03bb a\u03b8(\u03bb),\n\nwhere we substitute the true conditionals given in equation (12) and (cid:53)\u03bb is the gradient operator.\nSimilarly, maximizing ELBO w.r.t. latent state xu, we have:\n\nLx(\u03c8u) : = EQ log p(xu | \u03b8, X\u2212u, Y, \u03c6, \u03b3, \u03c3) \u2212 EQ log q(xu | \u03c8u)\n\n= EQ\u03b7T\n\nu (cid:53)\u03c8u au(\u03c8u) \u2212 \u03c8T\n\nu (cid:53)\u03c8u au(\u03c8u)\n\nGiven the assumptions we made about the true posterior and the variational distribution (i.e. that each\ntrue conditional is in an exponential family and that the corresponding variational distribution is in\nthe same exponential family) we can optimize each coordinate in closed form.\nTo maximize ELBO we set the gradient w.r.t. the variational parameters to zero:\n\n(cid:53)\u03bbL\u03b8(\u03bb) = (cid:53)2\n\n\u03bba\u03b8(\u03bb) (EQ\u03b7\u03b8 \u2212 \u03bb) != 0\n\nwhich is zero when:\n\n\u02c6\u03bb = EQ\u03b7\u03b8\n\n(15)\n\nSimilarly, the optimal variational parameters of the states are given by:\n\n(16)\nSince the true conditionals are Gaussian distributed the expectations over the natural parameters are\ngiven by:\n\n\u02c6\u03c8u = EQ\u03b7u\n\n(cid:19)\n\n(cid:18) EQ\u2126\u22121\n\n\u03b8 r\u03b8\nEQ\u2126\u22121\n\u03b8\n\n\u2212 1\n\n2\n\n(cid:19)\n\n(cid:18) EQ\u2126\u22121\n\nu ru\nEQ\u2126\u22121\nu\n\n\u2212 1\n\n2\n\nEQ\u03b7\u03b8 =\n\n, EQ\u03b7u =\n\n,\n\n(17)\n\nwhere r\u03b8 and \u2126\u03b8 are the mean and covariance of the true conditional distribution over ODE parame-\nters. Similarly, ru and \u2126u are the mean and covariance of the true conditional distribution over states.\nThe variational parameters in equation (17) are derived analytically in the supplementary material 7.\nThe coordinate ascent approach (where each step is analytically tractable) for estimating states and\nparameters is summarized in algorithm 1.\n\nAlgorithm 1 Mean-\ufb01eld coordinate ascent for GP Gradient Matching\n1: Initialization of proxy moments \u03b7u and \u03b7\u03b8.\n2: repeat\n3: Given the proxy over ODE parameters q(\u03b8 | \u02c6\u03bb), calculate the proxy over individual states\n4: Given the proxy over individual states q(xu | \u02c6\u03c8u), calculate the proxy over ODE parameters\n5: until convergence of maximum number of iterations is exceeded.\n\nq(xu | \u02c6\u03c8u) \u2200 u \u2264 n, by computing its moments \u02c6\u03c8u = EQ\u03b7u.\nq(\u03b8 | \u02c6\u03bb), by computing its moments \u02c6\u03bb = EQ\u03b7\u03b8.\n\nAssuming that the maximal number of states for each equation in (10) is constant (which is to the\nbest of our knowledge the case for any reasonable dynamical system), the computational complexity\nof the algorithm is linear in the states O(N \u00b7 K) for each iteration. This result is experimentally\nsupported by \ufb01gure 5 where we analyzed a system of up to 1000 states in less than 400 seconds.\n\n5\n\n\f5 Experiments\n\nIn order to provide a fair comparison to existing approaches, we test our approach on two small to\nmedium sized ODE models, which have been extensively studied in the same parameter settings\nbefore [e.g. Calderhead et al., 2008, Dondelinger et al., 2013, Wang and Barber, 2014]. Additionally,\nwe show the scalability of our approach on a large-scale partially observable system which has so far\nbeen infeasible to analyze with existing gradient matching methods due to the number of unobserved\nstates.\n\n5.1 Lotka-Volterra\n\nFigure 1: Lotka-Volterra: Given few noisy observations (red stars), simulated with a variance of\n\u03c32 = 0.25, the leftmost plot shows the inferred state dynamics using our variational mean-\ufb01eld\nmethod (mean-\ufb01eld GM, median runtime 4.7sec). Estimated mean and standard deviation for one\nrandom data initialization using our approach are illustrated in the left-center plot. The implemented\nspline method (splines, median runtime 48sec) was based on Niu et al. [2016] and the adaptive\ngradient matching (AGM) is the approach proposed by Dondelinger et al. [2013]. Boxplots in the\nleftmost, right-center and rightmost plot illustrate the variance in the state and parameter estimations\nover 10 independent datasets.\n\nThe ODE\u2019s f (X, \u03b8) of the Lotka-Volterra system [Lotka, 1978] is given by:\n\n\u02d9x1 : = \u03b81x1 \u2212 \u03b82x1x2\n\u02d9x2 : = \u2212\u03b83x2 + \u03b84x1x2\n\nThe above system is used to study predator-prey\ninteractions and exhibits periodicity and non-\nlinearity at the same time. We used the same\nODE parameters as in Dondelinger et al. [2013]\n(i.e. \u03b81 = 2, \u03b82 = 1, \u03b83 = 4, \u03b84 = 1) to simu-\nlate the data over an interval [0, 2] with a sam-\npling interval of 0.1. Predator species (i.e. x1)\nwere initialized to 3 and prey species (i.e. x)\nwere initialized to 5. Mean-\ufb01eld variational in-\nference for gradient matching was performed on\na simulated dataset with additive Gaussian noise\nwith variance \u03c32 = 0.25. The radial basis func-\ntion kernel was used to capture the covariance\nbetween a state at different time points.\nAs shown in \ufb01gure 1, our method performs sig-\nni\ufb01cantly better than all other methods at a frac-\ntion of the computational cost. The poor per-\nformance in accuracy of Niu et al. [2016] can\nbe explained by the signi\ufb01cantly lower number\nof samples and higher noise level, compared\nto the simpler setting of their experiments. In\norder to show the potential of our work we de-\ncided to follow the more dif\ufb01cult and established\nexperimental settings used in [e.g. Calderhead\net al., 2008, Dondelinger et al., 2013, Wang and\nBarber, 2014]. This illustrates the dif\ufb01culty of\n\nFigure 2: Lotka-Volterra: Given only observa-\ntions (red stars) until time t = 2 the state trajec-\ntories are inferred including the unobserved time\npoints up to time t = 4. The typical patterns of\nthe Lotka-Volterra system for predator and prey\nspecies are recovered. The shaded blue area shows\nthe uncertainty around for the inferred state trajec-\ntories.\n\n6\n\n012time246Population31323334ODE parameters0246Parameter Valuetruemean-field GMAGMsplinesmean-field GMAGMsplines0100200300400500Runtime (seconds)123400.511.522.533.54ODE parameters024time01234567prey024time00.511.522.533.54predator\fspline based gradient matching methods when only few observations are available. We estimated\nthe smoothing parameter \u03bb in the proposal of Niu et al. [2016] using leave-one-out cross-validation.\nWhile their method can in principle achieve the same runtime (e.g. using 10-fold cv) as our method,\nthe performance for parameter estimation is signi\ufb01cantly worse already when using leave-one-out\ncross-validation, where the median parameter estimation over ten independent data initializations\nis completely off for three out of four parameters (\ufb01gure 1). Adaptive gradient matching (AGM)\n[Dondelinger et al., 2013] would eventually converge to the true parameter values but at roughly 100\ntimes the runtime achieves signifcantly worse results in accuracy than our approach (\ufb01gure 1). In\n\ufb01gure 2 we additionally show that the mechanism of the Lotka-Volterra system is correctly inferred\neven when including unobserved time intervals.\n\n5.2 Protein Signalling Transduction Pathway\n\nIn the following we only compare with the current state of the art in GP based gradient matching\n[Dondelinger et al., 2013] since spline methods are in general dif\ufb01cult or inapplicable for partial\nobservable systems. In addition, already in the case of a simpler system and more data points (e.g.\n\ufb01gure 1), splines were not competitive (in accuracy) with the approach of Dondelinger et al. [2013].\n\nFigure 3: For the noise level of \u03c32 = 0.1 the leftmost and left-center plot show the performance\nof Dondelinger et al. [2013](AGM) for inferring the state trajectories of state S. The red curve in\nall plots is the groundtruth, while the inferred trajectories of AGM are plotted in green (left and\nleft-center plot) and in blue (right and right center) for our approach. While in the scenario of the\nleftmost and right-center plot observations are available (red stars) and both approaches work well,\nthe approach of Dondelinger et al. [2013](AGM) is signi\ufb01cantly off in inferring the same state when\nit is unobserved but all other parameters remain the same (left-center plot) while our approach infers\nsimilar dynamics in both scenarios.\n\nThe chemical kinetics for the protein signalling transduction pathway is governed by a combination\nof mass action kinetics and the Michaelis-Menten kinetics:\n\n\u02d9S = \u2212k1 \u00d7 S \u2212 k2 \u00d7 S \u00d7 R + k3 \u00d7 RS\n\u02d9dS = k1 \u00d7 S\n\u02d9R = \u2212k2 \u00d7 S \u00d7 R + k3 \u00d7 RS + V \u00d7\n\u02d9RS = k2 \u00d7 S \u00d7 R \u2212 k3 \u00d7 RS \u2212 k4 \u00d7 RS\n\u02d9Rpp = k4 \u00d7 RS \u2212 V \u00d7\n\nRpp\n\nKm + Rpp\n\nRpp\n\nKm + Rpp\n\nFor a detailed descripton of the systems with its biological interpretations we refer to Vyshemirsky\nand Girolami [2008]. While mass-action kinetics in the protein transduction pathway satisfy our\nconstraints on the functional form of the ODE\u2019s 1, the Michaelis-Menten kinetics do not, since they\ngive rise to the ratio of states\n\nKm+Rpp. We therefore de\ufb01ne the following latent variables:\n\nRpp\n\nx1 := S, x2 := dS, x3 := R, x4 := RS, x5 :=\n\n\u03b81 := k1, \u03b82 := k2, \u03b83 := k3, \u03b84 := k4, \u03b85 := V\n\nRpp\n\nKm + Rpp\n\nThe transformation is motivated by the fact that in the new system, all states only appear as monomials,\nas required in (10). Our variable transformation includes an inherent error (e.g. by replacing\n\n7\n\n0 50 100time00.51State S123time0200400State S0 50 100time00.51State S0 50 100time00.51State S\f\u02d9Rpp = k4 \u00d7 RS \u2212 V \u00d7 Rpp\nKm+Rpp with \u02d9x5 = \u03b84 \u00d7 x4 \u2212 \u03b85 \u00d7 x5) but despite such a misspeci\ufb01cation,\nour method estimates four out of \ufb01ve parameters correctly (4). Once more, we use the same\nODE parameters as in Dondelinger et al. [2013] i.e. k1 = 0.07, k2 = 0.6, k3 = 0.05, k4 =\n0.3, V = 0.017. The data was sampled over an interval [0, 100] with time point samples at t =\n[0, 1, 2, 4, 5, 7, 10, 15, 20, 30, 40, 50, 60, 80, 100]. Parameters were inferred in two experiments with\ndifferent standard Gaussian distributed noise with variances \u03c32 = 0.01 and \u03c32 = 0.1.\nEven for a misspeci\ufb01ed model, containing a systematic error, the ranking according to parameter\nvalues is preserved as indicated in \ufb01gure 4. While the approach of Dondelinger et al. [2013] converges\nmuch slower (again factor 100 in runtime) to the true values of the parameters (for a fully observable\nsystem), it is signi\ufb01cantly off if state S is unobserved and is more senstitive to the introduction of\nnoise than our approach (\ufb01gure 3). Our method infers similar dynamics for the fully and partially\nobservable system as shown in \ufb01gure 3 and remains unchanged in its estimation accuracy after\nthe introduction of unobserved variables (even having its inherent bias) and performs well even in\ncomparison to numerical integration (\ufb01gure 4). Plots for the additional state dynamics are shown in\nthe supplementary material 6.\n\nFigure 4: From the left to the right the plots represent three different inference settings of increasing\ndif\ufb01culty using the protein transduction pathway as an example. The left plot shows the results for a\nfully observable system and a small noise level (\u03c32 = 0.01). Due to the violation of the functional\nform assumption our approach has an inherent bias and Dondelinger et al. [2013](AGM) performs\nbetter while Bayesian numerical integration (Bayes num. int.) serves as a gold standard and performs\nbest. The middle plot shows the same system with an increased noise level of \u03c32 = 0.1. Due to many\noutliers we only show the median over ten independent runs and adjust the scale for the middle and\nright plot. In the right plot state S was unobserved while the noise level was kept at \u03c32 = 0.1 (the\nestimate for k3 of AGM is at 18 and out of the limits of the plot). Initializing numerical integration\nwith our result (Bayes num. int. mf.) achieves the best results and signi\ufb01cantly lowers the estimation\nerror (right plot).\n\n5.3 Scalability\n\nTo show the true scalability of our approach we apply it to the Lorenz 96 system, which consists of\nequations of the form:\n\nfk(x(t), \u03b8) = (xk+1 \u2212 xk\u22122)xk\u22121 \u2212 xk + \u03b8,\n\n(18)\n\nwhere \u03b8 is a scalar forcing parameter, x\u22121 = xK\u22121, x0 = xK and xK+1 = x1 (with K being the\nnumber of states in the deterministic system (1)). The Lorenz 96 system can be seen as a minimalistic\nweather model [Lorenz and Emanuel, 1998] and is often used with an additional diffusion term as a\nreference model for stochastic systems [e.g. Vrettas et al., 2015]. It offers a \ufb02exible framework for\nincreasing the number states in the inference problem and in our experiments we use between 125\nto 1000 states. Due to the dimensionality the Lorenz 96 system has so far not been analyzed using\ngradient matching methods and to additionally increase the dif\ufb01culty of the inference problem we\nrandomly selected one third of the states to be unobserved. We simulated data setting \u03b8 = 8 with an\nobservation noise of \u03c32 = 1 using 32 equally space observations between zero to four seconds. Due\nto its scaling properties, our approach is able to infer a system with 1000 states within less than 400\nseconds (right plot in \ufb01gure 5). We can visually conclude that unobserved states are approximately\ncorrect inferred and the approximation error is independent of the dimensionality of the problem\n(right plot in \ufb01gure 5).\n\n8\n\nk1k2k3k400.10.20.3RMSE of ODE Parametersmean-field GMAGMBayes num. int.k1k2k3k400.20.40.60.8RMSE of ODE Parametersmean-field GMAGMBayes num. int.k1k2k3k400.20.40.60.8RMSE of ODE Parametersmean-field GMAGMBayes num. int.Bayes num. int. mf\fFigure 5: The left plot shows the improved mechanistic modelling and the reduction of the root median\nsquared error (RMSE) with each iteration of our algorithm. The groundtruth for an unobserved state\nis plotted in red while the thin gray lines correspond to the inferred state trajectories in each iteration\nof the algorithm (the \ufb01rst \ufb02at thin gray line being the initialisation). The blue line is the inferred\nstate trajectory of the unobserved state after convergence. The right plot shows the scaling of our\nalgorithm with the dimensionality in the states. The red curve is the runtime in seconds wheras the\nblue curve is corresponding to the RSME (right plot).\n\nDue to space limitations, we show additional experiments for various dynamical systems in the \ufb01elds\nof \ufb02uid dynamics, electrical engineering, system biology and neuroscience only in the supplementary\nmaterial in section 8.\n\n6 Discussion\n\nNumerical integration is a major bottleneck due to its computational cost for large scale estimation\nof parameters and states e.g. in systems biology. However, it still serves as the gold standard for\npractical applications. Techniques based on gradient matching offer a computationally appealing\nand successful shortcut for parameter inference but are dif\ufb01cult to extend to include unobserved\nvariables in the model descripton or are unable to keep their performance level from fully observed\nsystems. However, most real world applications are only partially observed. Provided that state\nvariables appear as monomials in the ODE, we offer a simple, yet powerful inference framework\nthat is scalable, signi\ufb01cantly outperforms existing approaches in runtime and accuracy and performs\nwell in the case of sparse observations even for partially observable systems. Many non-linear and\nperiodic ODE\u2019s, e.g. the Lotka-Volterra system, already ful\ufb01ll our assumptions. The empirically\nshown robustness of our model to misspeci\ufb01cation even in the case of additional partial observability\nalready indicates that a relaxation of the functional form assumption might be possible in future\nresearch.\n\nAcknowledgements\n\nThis research was partially supported by the Max Planck ETH Center for Learning Systems and the\nSystemsX.ch project SignalX.\n\nReferences\nC\u00e9dric Archambeau, Manfred Opper, Yuan Shen, Dan Cornford, and John S Shawe-taylor. Variational\n\ninference for diffusion processes. Neural Information Processing Systems (NIPS), 2008.\n\nAnn C Babtie, Paul Kirk, and Michael PH Stumpf. Topological sensitivity analysis for systems\n\nbiology. Proceedings of the National Academy of Sciences, 111(52):18507\u201318512, 2014.\n\nMartino Barenco, Daniela Tomescu, Daniel Brewer, Robin Callard, Jaroslav Stark, and Michael\nHubank. Ranked prediction of p53 targets using hidden variable dynamic modeling. Genome\nbiology, 7(3):R25, 2006.\n\n9\n\n05101520iteration number00.511.522.533.544.5RMSE for ODE Parameter01234time-4-202468Unobserved State3.52.521.4RMSE reduction of ODE parameterunobserved statetime 1252503755006257508751000number of ODEs5006007008009001000average RMSE of unobs. states0100200300400500runtime (seconds)Scaling of Mean-field Gradient Matching for Lorenz 96\fBen Calderhead, Mark Girolami and Neil D. Lawrence. Accelerating bayesian inference over\nnonliner differential equations with gaussian processes. Neural Information Processing Systems\n(NIPS), 2008.\n\nFrank Dondelinger, Maurizio Filippone, Simon Rogers and Dirk Husmeier. Ode parameter inference\nusing adaptive gradient matching with gaussian processes. International Conference on Arti\ufb01cial\nIntelligence and Statistics (AISTATS), 2013.\n\nEdward N Lorenz and Kerry A Emanuel. Optimal sites for supplementary weather observations:\n\nSimulation with a small model. Journal of the Atmospheric Sciences, 55(3):399\u2013414, 1998.\n\nAlfred J Lotka. The growth of mixed populations: two species competing for a common food supply.\n\nIn The Golden Age of Theoretical Ecology: 1923\u20131940, pages 274\u2013286. Springer, 1978.\n\nSimon Lyons, Amos J Storkey, and Simo S\u00e4rkk\u00e4. The coloured noise expansion and parameter\n\nestimation of diffusion processes. Neural Information Processing Systems (NIPS), 2012.\n\nBenn Macdonald and Dirk Husmeier. Gradient matching methods for computational inference\nin mechanistic models for systems biology: a review and comparative analysis. Frontiers in\nbioengineering and biotechnology, 3, 2015.\n\nBenn Macdonald, Catherine F. Higham and Dirk Husmeier. Controversy in mechanistic modemodel\n\nwith gaussian processes. International Conference on Machine Learning (ICML), 2015.\n\nMu Niu, Simon Rogers, Maurizio Filippone, and Dirk Husmeier. Fast inference in nonlinear\ndynamical systems using gradient matching. International Conference on Machine Learning\n(ICML), 2016.\n\nJim O Ramsay, Giles Hooker, David Campbell, and Jiguo Cao. Parameter estimation for differential\nequations: a generalized smoothing approach. Journal of the Royal Statistical Society: Series B\n(Statistical Methodology), 69(5):741\u2013796, 2007.\n\nAndreas Ruttor and Manfred Opper. Approximate parameter inference in a stochastic reaction-\n\ndiffusion model. AISTATS, 2010.\n\nAndreas Ruttor, Philipp Batz, and Manfred Opper. Approximate gaussian process inference for the\ndrift function in stochastic differential equations. Neural Information Processing Systems (NIPS),\n2013.\n\nClaudia Schillings, Mikael Sunn\u00e5ker, J\u00f6rg Stelling, and Christoph Schwab. Ef\ufb01cient characterization\nof parametric uncertainty of complex (bio) chemical networks. PLoS Comput Biol, 11(8):e1004457,\n2015.\n\nKlaas Enno Stephan, Lars Kasper, Lee M Harrison, Jean Daunizeau, Hanneke EM den Ouden,\nMichael Breakspear, and Karl J Friston. Nonlinear dynamic causal models for fmri. NeuroImage,\n42(2):649\u2013662, 08 2008. doi: 10.1016/j.neuroimage.2008.04.262. URL http://www.ncbi.\nnlm.nih.gov/pmc/articles/PMC2636907/.\n\nJames M Varah. A spline least squares method for numerical parameter estimation in differential\n\nequations. SIAM Journal on Scienti\ufb01c and Statistical Computing, 3(1):28\u201346, 1982.\n\nMichail D Vrettas, Manfred Opper, and Dan Cornford. Variational mean-\ufb01eld algorithm for ef\ufb01cient\ninference in large systems of stochastic differential equations. Physical Review E, 91(1):012148,\n2015.\n\nVladislav Vyshemirsky and Mark A Girolami. Bayesian ranking of biochemical system models.\n\nBioinformatics, 24(6):833\u2013839, 2008.\n\nYali Wang and David Barber. Gaussian processes for bayesian estimation in ordinary differential\n\nequations. International Conference on Machine Learning (ICML), 2014.\n\n10\n\n\f", "award": [], "sourceid": 2503, "authors": [{"given_name": "Nico", "family_name": "Gorbach", "institution": "Swiss Federal Institute of Technology Zurich (ETHZ)"}, {"given_name": "Stefan", "family_name": "Bauer", "institution": "ETH Z\u00fcrich"}, {"given_name": "Joachim", "family_name": "Buhmann", "institution": "ETH Zurich"}]}