{"title": "Neural Networks with Cheap Differential Operators", "book": "Advances in Neural Information Processing Systems", "page_first": 9961, "page_last": 9971, "abstract": "Gradients of neural networks can be computed efficiently for any architecture, but some applications require computing differential operators with higher time complexity. We describe a family of neural network architectures that allow easy access to a family of differential operators involving \\emph{dimension-wise derivatives}, and we show how to modify the backward computation graph to compute them efficiently. We demonstrate the use of these operators for solving root-finding subproblems in implicit ODE solvers, exact density evaluation for continuous normalizing flows, and evaluating the Fokker-Planck equation for training stochastic differential equation models.", "full_text": "Neural Networks with Cheap Differential Operators\n\nRicky T. Q. Chen, David Duvenaud\nUniversity of Toronto, Vector Institute\n\n{rtqichen,duvenaud}@cs.toronto.edu\n\nAbstract\n\nGradients of neural networks can be computed ef\ufb01ciently for any architecture, but\nsome applications require differential operators with higher time complexity. We\ndescribe a family of restricted neural network architectures that allow ef\ufb01cient com-\nputation of a family of differential operators involving dimension-wise derivatives,\nused in cases such as computing the divergence. Our proposed architecture has\na Jacobian matrix composed of diagonal and hollow (non-diagonal) components.\nWe can then modify the backward computation graph to extract dimension-wise\nderivatives ef\ufb01ciently with automatic differentiation. We demonstrate these cheap\ndifferential operators for solving root-\ufb01nding subproblems in implicit ODE solvers,\nexact density evaluation for continuous normalizing \ufb02ows, and evaluating the\nFokker\u2013Planck equation for training stochastic differential equation models.\n\n1\n\nIntroduction\n\nArti\ufb01cial neural networks are useful as arbitrarily-\ufb02exible function approximators (Cybenko, 1989;\nHornik, 1991) in a number of \ufb01elds. However, their use in applications involving differential equations\nis still in its infancy. While many focus on the training of black-box neural nets to approximately\nrepresent solutions of differential equations (e.g., Lagaris et al. (1998); Tompson et al. (2017)), few\nhave focused on designing neural networks such that differential operators can be ef\ufb01ciently applied.\nIn modeling differential equations, it is common to see differential operators that require only\ndimension-wise or element-wise derivatives, such as the Jacobian diagonal, the divergence (ie. the\nJacobian trace), or generalizations involving higher-order derivatives. Often we want to compute\nthese operators when evaluating a differential equation or as a downstream task. For instance, once we\nhave \ufb01t a stochastic differential equation, we may want to apply the Fokker\u2013Planck equation (Risken,\n1996) to compute the probability density, but this requires computing the divergence and other\ndifferential operators. The Jacobian diagonal can also be used in numerical optimization schemes\nsuch as the accelerating \ufb01xed-point iterations, where it can be used to approximate the full Jacobian\nwhile maintaining the same \ufb01xed-point solution.\nIn general, neural networks do not admit cheap evaluation of arbitrary differential operators. If\nwe view the evaluation of a neural network as traversing a computation graph, then reverse-mode\nautomatic differentiation\u2013a.k.a backpropagation\u2013traverses the exact same set of nodes in the reverse\ndirection (Griewank and Walther, 2008; Schulman et al., 2015). This allows us to compute what is\nmathematically equivalent to vector-Jacobian products with asymptotic time cost equal to that of\nthe forward evaluation. However, in general, the number of backward passes\u2014ie. vector-Jacobian\nproducts\u2014required to construct the full Jacobian for unrestricted architectures grows linearly with\nthe dimensionality of the input and output. Unfortunately, this is also true for extracting the diagonal\nelements of the Jacobian needed for differential operators such as the divergence.\nIn this work, we construct a neural network in a manner that allows a family of differential operators\ninvolving dimension-wise derivatives to be cheaply accessible. We then modify the backward\ncomputation graph to ef\ufb01ciently compute these derivatives with a single backward pass.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fx1\n\nx2\n\n...\n\nxd\n\nh1\n\nh2\n\n...\n\nhd\n\nf1\n\nf2\n\n...\n\nfd\n\nx1\n\nx2\n\n...\n\nxd\n\nh1\n\nh2\n\n...\n\nhd\n\nf1\n\nf2\n\n...\n\nfd\n\n(a) f (x) computation graph\n\n(b) Ddimf (x) computation graph\n\nd\ndx f =\n\n\u2202\u03c4\n\u2202x\n\n\u2202\u03c4\n\u2202h\n\n\u2202h\n\u2202x\n\n(diagonal)\n\n(hollow)\n\n(c) Total derivative d\n\ndx f factors into partial derivatives consisting of diagonal and hollow components.\n\nFigure 1: (a) Visualization of HollowNet\u2019s computation graph, which is composed of a conditioner\nnetwork (blue) and a transformer network (red). Each line represents a non-linear dependency. (b)\nWe can modify the backward pass to retain only dimension-wise dependencies, which exist only\nthrough the transformer network. (c) The connections are designed so we can easily factor the full\nJacobian matrix d\n\ndx f into diagonal and hollow components (visualized for dh=2).\n\n2 HollowNet: Segregating Dimension-wise Derivatives\nGiven a function f : Rd \u2192 Rd, we seek to obtain a vector containing its dimension-wise k-th order\nderivatives,\n\n(cid:104) \u2202kf1(x)\n\n\u2202xk\n1\n\n(cid:105)T \u2208 Rd\n\n\u2202kfd(x)\n\n\u2202xk\nd\n\nDk\ndimf :=\n\n\u2202kf2(x)\n\n\u2202xk\n2\n\n\u00b7\u00b7\u00b7\n\n(1)\n\ndim.\n\nusing only k evaluations of automatic differentiation regardless of the dimension d. For notational\nsimplicity, we denote Ddim := D1\nWe \ufb01rst build the forward computation graph of a neural network such that the the Jacobian matrix is\ncomposed of a diagonal and a hollow matrix, corresponding to dimension-wise partial derivatives\nand interactions between dimensions respectively. We can ef\ufb01ciently apply the Dk\ndim operator by\ndisconnecting the connections that represent the hollow elements of the Jacobian in the backward\ncomputation graph. Due to the structure of the Jacobian matrix, we refer to this type of architecture\nas a HollowNet.\n\n2.1 Building the Computation Graph\n\nWe build the HollowNet architecture for f (x) by \ufb01rst constructing hidden vectors for each dimension\ni, hi \u2208 Rdh, that don\u2019t depend on xi, and then concatenating them with xi to be fed into an arbitrary\nneural network. The combined architecture sets us up for modifying the backward computation graph\nfor cheap access to dimension-wise operators. We can describe this approach as two main steps,\nborrowing terminology from Huang et al. (2018):\n1. Conditioner. hi = ci(x\u2212i) where ci : Rd\u22121 \u2192 Rdh and x\u2212i denotes the vector of length d \u2212 1\nwhere the i-th element is removed. All states {hi}d\ni=1 can be computed in parallel by using\nnetworks with masked weights, which exist for both fully connected (Germain et al., 2015) and\nconvolutional architectures (Oord et al., 2016).\n2. Transformer. fi(x) = \u03c4i(xi, hi) where \u03c4i : Rdh+1 \u2192 R is a neural network that takes as input\nthe concatenated vector [xi, hi]. All dimensions of fi(x) can be computed in parallel if the \u03c4i\u2019s\nare composed of matrix-vector and element-wise operations, as is standard in deep learning.\n\n2\n\n\fRelaxation of existing work. This family of architectures contains existing special cases, such as\nInverse (Kingma et al., 2016), Masked (Papamakarios et al., 2017) and Neural (Huang et al., 2018)\nAutoregressive Flows, as well as NICE and Real NVP (Dinh et al., 2014, 2016). Notably, existing\nworks focus on constraining f (x) to have a triangular Jacobian by using a conditioner network with\na speci\ufb01ed ordering. They also choose \u03c4i to be invertible. In contrast, we relax both constraints as\nthey are not required for our application. We compute h = c(x) in parallel by using two masked\nautoregressive networks (Germain et al., 2015).\nExpressiveness. This network introduces a bottleneck in terms of expressiveness. If dh \u2265 d \u2212 1,\nthen it is at least as expressive as a standard neural network, since we can simply set hi = x\u2212i\nto recover the same expressiveness as a standard neural net. However, this would require O(d2)\ntotal number of hidden units for each evaluation, resulting in having similar cost\u2014though with\nbetter parallelization still\u2014to na\u00efvely computing the full Jacobian of general neural networks with\nd AD calls. For this reason, we would like to have dh (cid:28) d to reduce the amount of compute in\nevaluating our network. It is worth noting that existing works that make use of masking to parallelize\ncomputation typically use dh = 2 which correspond to the scale and shift parameters of an af\ufb01ne\ntransformation \u03c4 (Kingma et al., 2016; Papamakarios et al., 2017; Dinh et al., 2014, 2016).\n\n2.2 Splicing the Computation Graph\n\n(cid:88)\n\ni\n\nHere, we discuss how to compute dimension-wise derivatives for the HollowNet architecture. This\nprocedure allows us to obtain the exact Jacobian diagonal at a cost of only one backward pass whereas\nthe na\u00efve approach would require d.\nA single call to reverse-mode automatic differentiation (AD)\u2014ie. a single backward pass\u2014can\ncompute vector-Jacobian products.\n\nvT df (x)\ndx\n\n=\n\nvi\n\ndfi(x)\n\ndx\n\n(2)\nBy constructing v to be a one-hot vector\u2014ie. vi = 1 and vj = 0 \u2200j (cid:54)= i\u2014then we obtain a single\nrow of the Jacobian dfi(x)/dx which contains the dimension-wise derivative of the i-th dimension.\nNow suppose the computation graph of f is constructed in the manner described in 2.1. Let(cid:98)h denote\nUnfortunately, to obtain the full Jacobian or even Ddimf would require d AD calls.\nh but with the backward connection removed, so that AD would return \u2202(cid:98)h/\u2202xj = 0 for any index j.\nTensor\ufb02ow (Abadi et al., 2016) or detach in PyTorch (Paszke et al., 2017). Let (cid:98)f (x) = \u03c4 (x,(cid:98)h), then\nthe Jacobian of (cid:98)f (x) contains only zeros on the off-diagonal elements.\nAs the Jacobian of (cid:98)f (x) is a diagonal matrix, we can recover the diagonals by computing a vector-\n\nThis kind of computation graph modi\ufb01cation can be performed with the use of stop_gradient in\n\n\u2202\u03c4i(xi,(cid:98)hi)\n\n(cid:40) \u2202fi(x)\n\n\u2202(cid:98)fi(x)\n\nif i = j\nif i (cid:54)= j\n\n\u2202xj\n\n\u2202xj\n\nJacobian product with a vector with all elements equal to one, denoted as 1.\n\n1T \u2202(cid:98)f (x)\ndim can be obtained by k AD calls, as (cid:98)fi(x) is only connected to the i-th dimension\nof x in the computation graph, so any differential operator on (cid:98)fi only contains the dimension-wise\n\n= Ddim(cid:98)f =\n\nThe higher orders Dk\n\n= Ddimf\n\n(cid:104) \u2202f1(x)\n\n(cid:105)T\n\n\u00b7\u00b7\u00b7\n\n(4)\n\n\u2202fd(x)\n\n\u2202xd\n\n\u2202xi\n\n0\n\n(3)\n\n\u2202x\n\n\u2202x1\n\n=\n\n=\n\nconnections. This can be written as the following recursion:\n\ndim (cid:98)f (x)\n\n1T \u2202Dk\u22121\n\n\u2202x\n\ndim(cid:98)f (x) = Dk\n\n= Dk\n\ndimf (x)\n\nAs connections have been removed from the computation graph, backpropagating through Dk\nwould give erroneous gradients as the connections between fi(x) and xj for j (cid:54)= i were severed. To\n\nensure correct gradients, we must reconnect(cid:98)h and h in the backward pass,\n\n\u2202h\n\u2202w\n\n=\n\n\u2202Dk\ndimf\n\u2202w\n\n(6)\n\ndim(cid:98)f\n\n\u2202Dk\n\u2202w\n\n+\n\n\u2202Dk\n\ndim(cid:98)f\n\u2202(cid:98)h\n\n3\n\n(5)\n\ndim(cid:98)f\n\n\f(a) Explicit solver is suf\ufb01cient for nonstiff dynamics.\nFigure 2: Comparison of ODE solvers as differential equation models are being trained. (a) Explicit\nmethods such as RK4(5) are generally more ef\ufb01cient when the system isn\u2019t too stiff. (b) However,\nwhen a trained dynamics model becomes stiff, predictor-corrector methods (ABM & ABM-Jacobi) are\nmuch more ef\ufb01cient. In dif\ufb01cult cases, the Jacobi-Newton iteration (ABM-Jacobi) uses signi\ufb01cantly\nless evaluations than functional iteration (ABM). A median \ufb01lter with a kernel size of 5 iterations\nwas applied prior to visualization.\n\n(b) Training may result in stiff dynamics.\n\nwhere w is any node in the computation graph. This gradient computation can be implemented as a\ncustom backward procedure, which is available in most modern deep learning frameworks.\n\nEquations (4), (5), and (6) perform computations on only (cid:98)f\u2014shown on the left-hand-sides of each\n\nequation\u2014to compute dimension-wise derivatives of f\u2014the right-hand-sides of each equation. The\nnumber of AD calls is k whereas na\u00efve backpropagation would require k \u00b7 d calls. We note that this\nprocess can only be applied to a single HollowNet, since for a composition of two functions f and g,\nDdim(f \u25e6 g) cannot be written solely in terms of Ddimf and Ddimg.\nIn the following sections, we show how ef\ufb01cient access to the Ddim operator provides improvements\nin a number of settings including (i) more ef\ufb01cient ODE solvers for stiff dynamics, (ii) solving\nfor the density of continuous normalizing \ufb02ows, and (iii) learning stochastic differential equation\nmodels by Fokker\u2013Planck matching. Each of the following sections are stand-alone and can be read\nindividually.\n\n3 Ef\ufb01cient Jacobi-Newton Iterations for Implicit Linear Multistep Methods\n\nOrdinary differential equations (ODEs) parameterized by neural networks are typically solved using\nexplicit methods such as Runge-Kutta 4(5) (Hairer and Peters, 1987). However, the learned ODE\ncan often become stiff, requiring a large number of evaluations to accurately solve with explicit\nmethods. Instead, implicit methods can achieve better accuracy at the cost of solving an inner-loop\n\ufb01xed-point iteration subproblem at every step. When the initial guess is given by an explicit method\nof the same order, this is referred to as a predictor-corrector method (Moulton, 1926; Radhakrishnan\nand Hindmarsh, 1993). Implicit formulations also show up as inverse problems of explicit methods.\nFor instance, an Invertible Residual Network (Behrmann et al., 2018) is an invertible model that\ncomputes forward Euler steps in the forward pass, but requires solving (implicit) backward Euler for\nthe inverse computation. Though the following describes and applies HollowNet to implicit ODE\nsolvers, our approach is generally applicable to solving root-\ufb01nding problems.\nThe class of linear multistep methods includes forward and backward Euler, explicit and implicit\nAdams, and backward differentiation formulas:\n\nyn+s + as\u22121yn+s\u22121 + as\u22122yn+s\u22122 + \u00b7\u00b7\u00b7 + a0yn\n\n= h(bsf (tn+s, yn+s) + bs\u22121f (tn+s\u22121, yn+s\u22121) + \u00b7\u00b7\u00b7 + b0f (tn, yn))\n\nwhere the values of the state yi and derivatives f (ti, yi) from the previous s steps are used to solve\nfor yn+s. When bs (cid:54)= 0, this requires solving a non-linear optimization problem as both yn+s and\nf (tn+s, yn+s) appears in the equation, resulting in what is known as an implicit method.\nSimplifying notation with y = yn+s, we can write (7) as a root \ufb01nding problem:\n\nF (y) := y \u2212 hbsf (y) \u2212 \u03b4 = 0\n\n4\n\n(7)\n\n(8)\n\n02500500075001000012500150001750020000Training Iteration050100150Num. EvaluationsRK4(5)ABMABM-Jacobi020004000600080001000012000Training Iteration0500100015002000Num. EvaluationsRK4(5)ABMABM-Jacobi\fwhere \u03b4 is a constant representing the rest of the terms in (7) from previous steps. Newton-Raphson\ncan be used to solve this problem, resulting in an iterative algorithm\n\ny(k+1) = y(k) \u2212\n\nF (y(k))\n\n(9)\n\n(cid:20) \u2202F (y(k))\n\n(cid:21)\u22121\n\n\u2202y(k)\n\nWhen the full Jacobian is expensive to compute, one can approximate using the diagonal elements.\nThis approximation results in the Jacobi-Newton iteration (Radhakrishnan and Hindmarsh, 1993).\n\ny(k+1) = y(k) \u2212 [DdimF (y)]\n\n\u22121 (cid:12) F (y(k))\n\n\u22121 (cid:12) (y \u2212 hbsf (y) \u2212 \u03b4)\n\n= y(k) \u2212 [1 \u2212 hbsDdimf (y)]\n\n(10)\nwhere (cid:12) denotes the Hadamard product, 1 is a vector with all elements equal to one, and the inverse\nis taken element-wise. Each iteration requires evaluating f once. In our implementation, the \ufb01xed\n\u221a\nd \u2264 \u03c4a + \u03c4r||y(0)||\u221e for some user-provided tolerance\npoint iteration is repeated until ||y(k\u22121)\u2212y(k)||/\nparameters \u03c4a, \u03c4r.\nAlternatively, when the Jacobian in (9) is approximated by the identity matrix, the resulting algorithm\nis referred to as functional iteration (Radhakrishnan and Hindmarsh, 1993). Using our ef\ufb01cient\ncomputation of the Ddim operator, we can apply Jacobi-Newton and obtain faster convergence than\nfunctional iteration while maintaining the same asymptotic computation cost per step.\n\n3.1 Empirical Comparisons\n\nWe compare a standard Runge-Kutta (RK) solver with adaptive stepping (Shampine, 1986) and\na predictor-corrector Adams-Bashforth-Moulton (ABM) method in Figure 2. A learned ordinary\ndifferential equation is used as part of a continuous normalizing \ufb02ow (discussed in Section 4), and\ntraining requires solving this ordinary differential equation at every iteration. We initialized the\nweights to be the same for fair comparison, but the models may have slight numerical differences\nduring training due to the amounts of numerical error introduced by the different solvers. The number\nof function evaluations includes both evaluations made in the forward pass and for solving the adjoint\nstate in the backward pass for parameter updates as in Chen et al. (2018). We applied Jacobi-Newton\niterations for ABM-Jacobi using the ef\ufb01cient Ddim operator in both the forward and backward passes.\nAs expected, if the learned dynamics model becomes too stiff, RK results in using very small step\nsizes and uses almost 10 times the number of evaluations as ABM with Jacobi-Newton iterations.\nWhen implicit methods are used with HollowNet, Jacobi-Newton can help reduce the number of\nevaluations at the cost of just one extra backward pass.\n\n4 Continuous Normalizing Flows with Exact Trace Computation\n\nContinuous normalizing \ufb02ows (CNF) (Chen et al., 2018) transform particles from a base distribution\np(x0) at time t0 to another time t1 according to an ordinary differential equation dh\n\ndt = f (t, h(t)).\n\n(cid:90) t1\n\nt0\n\nx := x(t1) = x0 +\n\nf (t, h(t))dt\n\n(11)\n\nThe change in distribution as a result of this transformation is described by an instantaneous change\nof variables equation (Chen et al., 2018),\n\n\u2202 log p(t, h(t))\n\n\u2202t\n\n= \u2212Tr\n\n[Ddimf ]i\n\n(12)\n\n(cid:18) \u2202f (t, h(t))\n\n(cid:19)\n\n\u2202h(t)\n\n= \u2212 d(cid:88)\n\ni=1\n\nIf (12) is solved along with (11) as a combined ODE system, we can obtain the density of transformed\nparticles at any desired time t1.\nDue to requiring d AD calls to compute Ddimf for a black-box neural network f, Grathwohl et al.\n(2019) adopted a stochastic trace estimator (Skilling, 1989; Hutchinson, 1990) to provide unbiased\nestimates for log p(t, h(t)). Behrmann et al. (2018) used the same estimator and showed that the\nstandard deviation can be quite high for single examples. Furthermore, an unbiased estimator of the\nlog-density has limited uses. For instance, the IWAE objective (Burda et al., 2015) for estimating a\n\n5\n\n\fTable 1: Evidence lower bound (ELBO) and negative log-likelihood (NLL) for static MNIST and\nOmniglot in nats. We outperform CNFs with stochastic trace estimates (FFJORD), but surprisingly,\nour improved approximate posteriors did not result in better generative models than Sylvester Flows\n(as indicated by NLL). Bolded estimates are not statistically signi\ufb01cant by a two-tailed t-test with\nsigni\ufb01cance level 0.05.\n\nModel\n\nVAE (Kingma and Welling, 2013)\nPlanar (Rezende and Mohamed, 2015)\nIAF (Kingma et al., 2016)\nSylvester (van den Berg et al., 2018)\nFFJORD (Grathwohl et al., 2019)\nHollow-CNF\n\nMNIST\n\n-ELBO \u2193\n86.55 \u00b1 0.06\n86.06 \u00b1 0.31\n84.20 \u00b1 0.17\n83.32 \u00b1 0.06\n82.82 \u00b1 0.01\n82.37 \u00b1 0.04\n\nNLL \u2193\n82.14 \u00b1 0.07\n81.91 \u00b1 0.22\n80.79 \u00b1 0.12\n80.22 \u00b1 0.03\n\u2014\n80.22 \u00b1 0.08\n\nOmniglot\n\n-ELBO \u2193\n104.28 \u00b1 0.39\n102.65 \u00b1 0.42\n102.41 \u00b1 0.04\n99.00 \u00b1 0.04\n98.33 \u00b1 0.09\n97.42 \u00b1 0.05\n\nNLL \u2193\n97.25 \u00b1 0.23\n96.04 \u00b1 0.28\n96.08 \u00b1 0.16\n93.77 \u00b1 0.03\n\u2014\n93.90 \u00b1 0.14\n\nlower bound of the log-likelihood log p(x) = log Ez\u223cp(z)[p(x|z)] of latent variable models has the\nfollowing form:\n\nLIWAE-k = Ez1,...,zk\u223cq(z|x)\n\n(13)\nFlow-based models have been used as the distribution q(z|x) (Rezende and Mohamed, 2015), but\nan unbiased estimator of log q would not translate into an unbiased estimate of this importance\nweighted objective, resulting in biased evaluations and biased gradients if used for training. For this\nreason, FFJORD (Grathwohl et al., 2019) was unable to report approximate log-likelihood values for\nevaluation which are standardly estimated using (13) with k = 5000 (Burda et al., 2015).\n\nlog\n\np(x, zi)\nq(zi|x)\n\n(cid:34)\n\nk(cid:88)\n\ni=1\n\n1\nk\n\n(cid:35)\n\n4.1 Exact Trace Computation\nBy constructing f in the manner described in Section 2.1, we can ef\ufb01ciently compute Ddimf and the\ntrace. This allows us to exactly compute (12) using a single AD call, which is the same cost as the\nstochastic trace estimator. We believe that using exact trace should reduce gradient variance during\ntraining, allowing models to converge to better local optima. Furthermore, it should help reduce the\ncomplexity of solving (12) as stochastic estimates can lead to more dif\ufb01cult dynamics.\n\n4.2 Latent Variable Model Experiments\n\nWe trained variational autoencoders (Kingma and Welling, 2013) using the same setup as van den\nBerg et al. (2018). This corresponds to training using (13) with k = 1, also known as the evidence\nlower bound (ELBO). We searched for dh \u2208 {32, 64, 100} and used dh = 100 as the computational\ncost was not signi\ufb01cantly impacted. We used 2-3 hidden layers for the conditioner and transformer\nnetworks, with the ELU activation function. Table 1 shows that training CNFs with exact trace using\nthe HollowNet architecture can lead to improvements on standard benchmark datasets, static MNIST\nand Omniglot. Furthermore, we can estimate the NLL of our models using k = 5000 for evaluating\nthe quality of the generative model. Interestingly, although the NLLs were not improved signi\ufb01cantly,\nCNFs can achieve much better ELBO values. We conjecture that the CNF approximate posterior may\nbe slow to update, and has a strong effect of anchoring the generative model to this posterior.\n\n4.3 Exact vs. Stochastic Continuous Normalizing Flows\n\nWe take a closer look at the effects of using an exact trace. We compare exact and stochastic trace\nCNFs with the same architecture and weight initialization. Figure 3 contains comparisons of models\ntrained using maximum likelihood on the MINIBOONE dataset preprocessed by Papamakarios et al.\n(2017). The comparisons between exact and stochastic trace are carried out across two network\nsettings with 1 or 2 hidden layers. We \ufb01nd that not only can exact trace CNFs achieve better training\n\n6\n\n\fFigure 3: Comparison of exact trace versus stochastically estimated trace on learning continuous\nnormalizing \ufb02ows with identical initialization. Continuous normalizing \ufb02ows with exact trace\nconverge faster and can sometimes be easier to solve, shown across two architecture settings.\n\nNLLs, they converge faster. Additionally, exact trace allows the ODE to be solved with comparable\nor fewer number of evaluations, when comparing models with similar performance.\n\n5 Learning Stochastic Differential Equations by Fokker\u2013Planck Matching\n\nGeneralizing ordinary differential equations to contain a stochastic term results in stochastic differen-\ntial equations (SDE), a special type of stochastic process modeled using differential operators. SDEs\nare applicable in a wide range of applications, from modeling stock prices (Iacus, 2011) to physical\ndynamical systems with random perturbations (\u00d8ksendal, 2003). Learning SDE models is a dif\ufb01cult\ntask as exact maximum likelihood is infeasible. Here we propose a new approach to learning SDE\nmodels based on matching the Fokker\u2013Planck (FP) equation (Fokker, 1914; Planck, 1917; Risken,\n1996). The main idea is to explicitly construct a density model p(t, x), then train a SDE that matches\nthis density model by ensuring that it satis\ufb01es the FP equation.\nLet x(t) \u2208 Rd follow a SDE described by a drift function f (x(t), t) and diagonal diffusion matrix\ng(x(t), t) in the It\u00f4 sense.\n\n(14)\nwhere dW is the differential of a standard Wiener process. The Fokker\u2013Planck equation describes\nhow the density of this SDE at a speci\ufb01ed location changes through time t. We rewrite this equation\nin terms of Dk\n\ndx(t) = f (x(t), t)dt + g(x(t), t)dW\n\n\u2202\n\u2202xi\n\n[fi(t, x)p(t, x)] +\n\n1\n2\n\nd(cid:88)\n\ni=1\n\n\u22022\n\u2202x2\ni\n\n(cid:2)g2\nii(t, x)p(t, x)(cid:3)\n\ndim operator.\n\u2202p(t, x)\n\n\u2202t\n\n= \u2212 d(cid:88)\n(cid:20)\nd(cid:88)\n\ni=1\n\n=\n\ni=1\n\n\u2212 (Ddimf )p \u2212 (\u2207p) (cid:12) f + (D2\n\ndimdiag(g))p\n\n(15)\n\n+ 2(Ddimdiag(g)) (cid:12) (\u2207p) + 1/2 diag(g)2 (cid:12) (Ddim\u2207p)\n\n(cid:21)\n\ni\n\nm(cid:88)\n\nWritten in terms of the Ddim operator makes clear where we can take advantage of ef\ufb01cient dimension-\nwise derivatives for evaluating the Fokker\u2013Planck equation.\nFor simplicity, we choose a mixture of m Gaussians as the density model, which can approximate\nany distribution if m is large enough.\n\np(t, x) =\n\n\u03c0c(t)N (x; \u03bdc(t), \u03a3c(t))\n\n(16)\n\nc=1\n\nUnder this density model, the differential operators applied to p can be computed exactly. We note\nthat it is also possible to use complex black-box density models such as normalizing \ufb02ows (Rezende\nand Mohamed, 2015). The gradient can be easily computed with a standard automatic differentiation,\nand the diagonal Hessian can be easily and cheaply estimated using the approach from Martens et al.\n(2012). HollowNet can be used to parameterize f and g so that the Ddim and D2\ndim operators operators\nin the right-hand-side of (15) can be ef\ufb01ciently evaluated, though for these initial experiments we\nused simple 2-D multilayer perceptrons.\n\n7\n\n0200040006000800010000Iteration45678910NLL (nats)1-Hidden Stochastic1-Hidden Exact2-Hidden Stochastic2-Hidden Exact0200040006000800010000Iteration050100150200250Complexity (NFE)1-Hidden Stochastic1-Hidden Exact2-Hidden Stochastic2-Hidden Exact\fTraining. Let \u03b8 be the parameter of the SDE model and \u03c6 be the parameters of the density model.\nWe seek to perform maximum-likelihood on the density model p while simultaneously learning\nan SDE model that satis\ufb01es the Fokker\u2013Planck equation (15) applied to this density. As such, we\npropose maximizing the objective\n\nEt,xt\u223cpdata [log p\u03c6(t, xt)] + \u03bbEt,xt\u223cpdata\n\n(17)\nwhere FP(t, xt|\u03b8, \u03c6) refers to the right-hand-side of (15), and \u03bb is a non-negative weight that is\nannealed to zero by the end of training. Having a positive \u03bb value regularizes the density model to be\ncloser to the SDE model, which can help guide the SDE parameters at the beginning of training.\nThis purely functional approach has multiple bene\ufb01ts:\n\n\u2212 FP(t, xt|\u03b8, \u03c6)\n\n(cid:20)(cid:12)(cid:12)(cid:12)(cid:12) \u2202p\u03c6(t, xt)\n\n\u2202t\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:21)\n\n1. No reliance on \ufb01nite-difference approximations. All derivatives are evaluated exactly.\n2. No need for sequential simulations. All observations (t, xt) can be trained in parallel.\n3. Having access to a model of the marginal densities allows us to approximately sample\n\ntrajectories from the SDE starting from any time.\n\nLimitations. We note that this process of matching a density model cannot be used to uniquely\nidentify stationary stochastic processes, as when marginal densities are the same across time, no\ninformation regarding the individual sample trajectories is present in the density model. Previously\nAit-Sahalia (1996) tried a similar approach where a SDE is trained to match non-parameteric kernel\ndensity estimates of the data; however, due to the stationarity assumption inherent in kernel density\nestimation, Pritsker (1998) showed that kernel estimates were not suf\ufb01ciently informative for learning\nSDE models. While the inability to distinguish stationary SDEs is also a limitation of our approach,\nthe bene\ufb01ts of FP matching are appealing and should be able to learn the correct trajectory of the\nsamples when the data is highly non-stationary.\n\n5.1 Alternative Approaches\n\nA wide range of parameter estimation approaches have been proposed for SDEs (Prakasa Rao, 1999;\nS\u00f8rensen, 2004; Kutoyants, 2013). Exact maximum likelihood is dif\ufb01cult except for very simple\nmodels (Jeisman, 2006). An expensive approach is to directly simulate the Fokker\u2013Planck partial\ndifferential equation, but approximating the differential operators in (15) with \ufb01nite difference is\nintractable in more than two or three dimensions. A related approach to ours is pseudo-maximum\nlikelihood (Florens-Zmirou, 1989; Ozaki, 1992; Kessler, 1997), where the continuous-time stochastic\nprocess is discretized. The distribution of a trajectory of observations log p(x(t1), . . . , x(tN )) is\ndecomposed into conditional distributions,\n\nN(cid:88)\n\ni=1\n\nlog p(x(ti)|x(ti\u22121)) \u2248 N(cid:88)\n\nlog N(cid:0)x(ti); x(ti\u22121) + f (x(ti\u22121))\u2206ti\n(cid:125)\n\n(cid:123)(cid:122)\n\n(cid:124)\n\n(cid:124)\n\nmean\n\ni=1\n\nwhere we\u2019ve used the Markov property of SDEs, and N denotes the density of a Normal distribution\nwith the given mean and variance. The conditional distributions are generally unknown and the\napproximation made in (18) is based on Euler discretization (Florens-Zmirou, 1989; Yoshida, 1992).\nUnlike our approach, the pseudo-maximum likelihood approach relies on a discretization scheme that\nmay not hold when the observations are sparse and is also not parallelizable across time.\n\n(cid:1),\n\n(cid:125)\n\n(cid:123)(cid:122)\n\nvar\n\n, g2(x(ti\u22121))\u2206ti\n\n(18)\n\n5.2 Experiments on Fokker\u2013Planck Matching\n\nWe verify the feasibility of Fokker\u2013Planck matching and compare to the pseudo-maximum likelihood\napproach. We construct a synthetic experiment where a pendulum is initialized randomly at one of\ntwo modes. The pendulum\u2019s velocity changes with gravity and is randomly perturbed by a diffusion\nprocess. This results in two states, a position and velocity following the stochastic differential\nequation\n\n(cid:21)\n\n(cid:20)p\n\nv\n\n(cid:21)\n\n(cid:20)\nv\u22122 sin(p)\n\n(cid:20)0\n\n0\n\n(cid:21)\n\n0\n0.2\n\nd\n\n=\n\ndt +\n\ndW.\n\n(19)\n\nThis problem is multimodal and exhibits trends that are dif\ufb01cult to model. By default, we use 50\nequally spaced observations for each sample trajectory. We use 3-hidden layer deep neural networks\n\n8\n\n\fData\n\nLearned Density\n\nSamples from Learned SDE\n\nFigure 4: Fokker\u2013Planck Matching correctly learns the overall dynamics of a SDE.\n\nto parameterize the SDE and density models with the Swish nonlinearity (Ramachandran et al., 2017),\nand use m = 5 Gaussian mixtures.\nThe result after training for 30000 iterations is\nshown in Figure 4. The density model correctly\nrecovers the multimodality of the marginal dis-\ntributions, including at the initial time, and the\nSDE model correctly recovers the sinusoidal be-\nhavior of the data. The behavior is more erratic\nwhere the density model exhibits imperfections,\nbut the overall dynamics were recovered suc-\ncessfully.\nA caveat of pseudo-maximum likelihood is its\nreliance on discretization schemes that do not\nhold for observations spaced far apart. Instead\nof using all available observations, we randomly\nsample a small percentage. For quantitative com-\nparison, we report the mean absolute error of the\ndrift f and diffusion g values over 10000 sam-\npled trajectories. Figure 5 shows that when the observations are sparse, pseudo-maximum likelihood\nhas substantial error due to the \ufb01nite difference assumption. Whereas the Fokker\u2013Planck matching\napproach is not in\ufb02uenced at all by the sparsity of observations.\n\nFigure 5: Fokker\u2013Planck matching outperforms\npseudo-maximum likelihood in the sparse data\nregime, and its performance is independent of the\nobservations intervals. Error bars show standard\ndeviation across 3 runs.\n\n6 Conclusion\n\nWe propose a neural network construction along with a computation graph modi\ufb01cation that allows\nus to obtain \u201cdimension-wise\u201d k-th derivatives with only k evaluations of reverse-mode AD, whereas\nna\u00efve automatic differentiation would require k \u00b7 d evaluations. Dimension-wise derivatives are\nuseful for modeling various differential equations as differential operators frequently appear in such\nformulations. We show that parameterizing differential equations using this approach allows more\nef\ufb01cient solving when the dynamics are stiff, provides a way to scale up Continuous Normalizing\nFlows without resorting to stochastic likelihood evaluations, and gives rise to a functional approach\nto parameter estimation method for SDE models through matching the Fokker\u2013Planck equation.\n\nReferences\nMart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu\nDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. TensorFlow: A system for\nlarge-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and\nImplementation ({OSDI} 16), pages 265\u2013283, 2016.\n\nYacine Ait-Sahalia. Testing continuous-time models of the spot interest rate. The review of \ufb01nancial\n\nstudies, 9(2):385\u2013426, 1996.\n\n9\n\npositiontimevelocitypositiontimevelocitypositiontimevelocity101102Number of Observations (%)0.00.20.40.60.81.01.2Mean Abs ErrorFP MatchingPseudo-ML\fJens Behrmann, David Duvenaud, and J\u00f6rn-Henrik Jacobsen. Invertible residual networks. arXiv\n\npreprint arXiv:1811.00995, 2018.\n\nYuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv\n\npreprint arXiv:1509.00519, 2015.\n\nRicky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary\n\ndifferential equations. Advances in Neural Information Processing Systems, 2018.\n\nGeorge Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control,\n\nsignals and systems, 2(4):303\u2013314, 1989.\n\nLaurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components\n\nestimation. arXiv preprint arXiv:1410.8516, 2014.\n\nLaurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real NVP. arXiv\n\npreprint arXiv:1605.08803, 2016.\n\nDanielle Florens-Zmirou. Approximate discrete-time schemes for statistics of diffusion processes.\n\nStatistics: A Journal of Theoretical and Applied Statistics, 20(4):547\u2013557, 1989.\n\nAdriaan Dani\u00ebl Fokker. Die mittlere energie rotierender elektrischer dipole im strahlungsfeld. Annalen\n\nder Physik, 348(5):810\u2013820, 1914.\n\nMathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. MADE: Masked autoencoder\nfor distribution estimation. In International Conference on Machine Learning, pages 881\u2013889,\n2015.\n\nWill Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. FFJORD:\nFree-form continuous dynamics for scalable reversible generative models. International Conference\non Learning Representations, 2019.\n\nAndreas Griewank and Andrea Walther. Evaluating derivatives: principles and techniques of\n\nalgorithmic differentiation, volume 105. Siam, 2008.\n\nHairer and Peters. Solving ordinary differential equations I. Springer Berlin Heidelberg, 1987.\nKurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):\n\n251\u2013257, 1991.\n\nChin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neural autoregressive\n\n\ufb02ows. arXiv preprint arXiv:1804.00779, 2018.\n\nMichael F Hutchinson. A stochastic estimator of the trace of the in\ufb02uence matrix for laplacian\nsmoothing splines. Communications in Statistics-Simulation and Computation, 19(2):433\u2013450,\n1990.\n\nStefano M Iacus. Option pricing and estimation of \ufb01nancial models with R. John Wiley & Sons,\n\n2011.\n\nJoseph Ian Jeisman. Estimation of the parameters of stochastic differential equations. PhD thesis,\n\nQueensland University of Technology, 2006.\n\nMathieu Kessler. Estimation of an ergodic diffusion from discrete observations. Scandinavian Journal\n\nof Statistics, 24(2):211\u2013229, 1997.\n\nDiederik P Kingma and Max Welling. Auto-encoding variational Bayes.\n\narXiv:1312.6114, 2013.\n\narXiv preprint\n\nDurk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling.\nImproved variational inference with inverse autoregressive \ufb02ow. In Advances in neural information\nprocessing systems, pages 4743\u20134751, 2016.\n\nYury A Kutoyants. Statistical inference for ergodic diffusion processes. Springer Science & Business\n\nMedia, 2013.\n\nIsaac E Lagaris, Aristidis Likas, and Dimitrios I Fotiadis. Arti\ufb01cial neural networks for solving\nordinary and partial differential equations. IEEE transactions on neural networks, 9(5):987\u20131000,\n1998.\n\nZichao Long, Yiping Lu, Xianzhong Ma, and Bin Dong. PDE-net: Learning PDEs from data. arXiv\n\npreprint arXiv:1710.09668, 2017.\n\n10\n\n\fJames Martens, Ilya Sutskever, and Kevin Swersky. Estimating the hessian by back-propagating\n\ncurvature. arXiv preprint arXiv:1206.6464, 2012.\n\nForest Ray Moulton. New methods in exterior ballistics. 1926.\nBernt \u00d8ksendal. Stochastic differential equations. In Stochastic differential equations, pages 65\u201384.\n\nSpringer, 2003.\n\nAaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks.\n\narXiv preprint arXiv:1601.06759, 2016.\n\nTohru Ozaki. A bridge between nonlinear time series models and nonlinear stochastic dynamical\n\nsystems: a local linearization approach. Statistica Sinica, pages 113\u2013135, 1992.\n\nGeorge Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive \ufb02ow for density\n\nestimation. In Advances in Neural Information Processing Systems, pages 2338\u20132347, 2017.\n\nAdam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\nPyTorch. In NIPS-W, 2017.\n\nMax Planck. \u00dcber einen Satz der statistischen Dynamik und seine Erweiterung in der Quantentheorie.\n\nReimer, 1917.\n\nBLS Prakasa Rao. Statistical inference for diffusion type processes. Kendall\u2019s Lib. Statist., 8, 1999.\nMatt Pritsker. Nonparametric density estimation and tests of continuous time interest rate models.\n\nThe Review of Financial Studies, 11(3):449\u2013487, 1998.\n\nKrishnan Radhakrishnan and Alan C Hindmarsh. Description and use of LSODE, the Livermore\n\nsolver for ordinary differential equations. 1993.\n\nMaziar Raissi. Deep hidden physics models: Deep learning of nonlinear partial differential equations.\nJournal of Machine Learning Research, 19(25):1\u201324, 2018. URL http://jmlr.org/papers/\nv19/18-046.html.\n\nPrajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. arXiv preprint\n\narXiv:1710.05941, 2017.\n\nDanilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows. arXiv\n\npreprint arXiv:1505.05770, 2015.\n\nHannes Risken. Fokker-Planck equation. In The Fokker-Planck Equation, pages 63\u201395. Springer,\n\n1996.\n\nJohn Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using\nstochastic computation graphs. In Advances in Neural Information Processing Systems, pages\n3528\u20133536, 2015.\n\nLawrence F Shampine. Some practical Runge-Kutta formulas. Mathematics of Computation, 46\n\n(173):135\u2013150, 1986.\n\nJohn Skilling. The eigenvalues of mega-dimensional matrices. In Maximum Entropy and Bayesian\n\nMethods, pages 455\u2013466. Springer, 1989.\n\nHelle S\u00f8rensen. Parametric inference for diffusion processes observed at discrete points in time: a\n\nsurvey. International Statistical Review, 72(3):337\u2013354, 2004.\n\nJonathan Tompson, Kristofer Schlachter, Pablo Sprechmann, and Ken Perlin. Accelerating Eulerian\n\ufb02uid simulation with convolutional networks. In Proceedings of the 34th International Conference\non Machine Learning-Volume 70, pages 3424\u20133433. JMLR. org, 2017.\n\nRianne van den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester\n\nnormalizing \ufb02ows for variational inference. arXiv preprint arXiv:1803.05649, 2018.\n\nNakahiro Yoshida. Estimation for diffusion processes from discrete observation. Journal of Multi-\n\nvariate Analysis, 41(2):220\u2013242, 1992.\n\n11\n\n\f", "award": [], "sourceid": 5277, "authors": [{"given_name": "Ricky T. Q.", "family_name": "Chen", "institution": "U of Toronto"}, {"given_name": "David", "family_name": "Duvenaud", "institution": "University of Toronto"}]}