{"title": "ANODEV2: A Coupled Neural ODE Framework", "book": "Advances in Neural Information Processing Systems", "page_first": 5151, "page_last": 5161, "abstract": "It has been observed that residual networks can be viewed as the explicit Euler discretization of an Ordinary Differential Equation (ODE). This observation motivated the introduction of so-called Neural ODEs, in which other discretization schemes and/or adaptive time stepping techniques can be used to improve the performance of residual networks. Here, we propose \\OURS, which extends this approach by introducing a framework that allows ODE-based evolution for both the weights and the activations, in a coupled formulation. Such an approach provides more modeling flexibility, and it can help with generalization performance. We present the formulation of \\OURS, derive optimality conditions, and implement the coupled framework in PyTorch. We present empirical results using several different configurations of \\OURS, testing them on the CIFAR-10 dataset. We report results showing that our coupled ODE-based framework is indeed trainable, and that it achieves higher accuracy, compared to the baseline ResNet network and the recently-proposed Neural ODE approach.", "full_text": "ANODEV2: A Coupled Neural ODE Framework\n\nTianjun Zhang1\u21e4 Zhewei Yao1\u21e4 Amir Gholami1\u21e4\n\nKurt Keutzer1 Joseph Gonzalez1 George Biros2 Michael W. Mahoney1,3\n1University of California at Berkeley, 2University of Texas at Austin, 3ICSI\n\n{tianjunz, zheweiy, amirgh, keutzer, jegonzal, and mahoneymw}@berkeley.edu, biros@ices.utexas.edu\n\nAbstract\n\nIt has been observed that residual networks can be viewed as the explicit Euler dis-\ncretization of an Ordinary Differential Equation (ODE). This observation motivated\nthe introduction of so-called Neural ODEs, which allow more general discretization\nschemes with adaptive time stepping. Here, we propose ANODEV2, which is an\nextension of this approach that allows evolution of the neural network parameters,\nin a coupled ODE-based formulation. The Neural ODE method introduced earlier\nis in fact a special case of this new framework. We present the formulation of\nANODEV2, derive optimality conditions, and implement the coupled framework\nin PyTorch. We present empirical results using several different con\ufb01gurations of\nANODEV2, testing them on multiple models on CIFAR-10. We report results\nshowing that this coupled ODE-based framework is indeed trainable, and that\nit achieves higher accuracy, as compared to the baseline models as well as the\nrecently-proposed Neural ODE approach.\n\nIntroduction\n\n1\nResidual networks [1, 2] have enabled training of very deep neural networks (DNNs). Recent work\nhas shown an interesting connection between residual blocks and ODEs, showing that a residual\nnetwork can be viewed as a discretization to a continuous ODE operator [3, 4, 5, 6, 7, 8]. These\nformulations are commonly called Neural ODEs and here we follow the same convention. Neural\nODEs provide a general framework that connects discrete DNNs to continuous dynamical systems\ntheory as well as discretization and optimal control of ODEs, all subjects with very rich theory.\nA basic Neural ODE formulation and its connection to residual networks (for a single block in a\nnetwork) is the following:\n\nz1 = z0 + f (z0,\u2713 )\n\nz(1) = z(0) +Z 1\n\n0\n\nf (z(t),\u2713 )dt\n\nResNet,\n\nODE,\n\n(1a)\n\n(1b)\n\nODE forward Euler.\n\nz(1) = z(0) + f (z0,\u2713 )\n\n(1c)\nHere, z0 is the input to the network and z1 is the output activation; \u2713 is the vector of network weights\n(independent of time); and f (z, \u2713) is the nonlinear operator de\ufb01ned by this block. (Here we have\nwritten the ODE dz/dt = f (z, \u2713) in terms of its solution at t = 1.) We can see that a single-step of\nforward Euler discretization of the ODE is identical to a traditional residual block. Alternatively, we\ncould use a different time-stepping scheme or, more interestingly, use more time steps. Once the\nconnection to ODEs was identi\ufb01ed, several groups have incorporated the Neural ODE structure in\nneural networks and evaluated their performance on several different learning tasks.\nA major challenge with training Neural ODEs is that backpropogating through ODE layers requires\nstorage of all the intermediate activations (i.e., z) in time. In principle, the memory footprint of\n\n\u21e4Equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fODE layers has a cost of O(Nt) (Nt is the number of time steps to solve the ODE layer), which is\nprohibitive. The recent work of [8] proposed an adjoint based method, with a training strategy that\nrequired only storage of the activation at the end of the ODE layer. All the intermediate activations\nwere then \u201cre-computed\u201d by solving the ODE layers backwards. However, it has been recently\nshown that such an approach could lead to incorrect gradients, due both to numerical instability\nand also to inconsistencies that relate to optimizing in\ufb01nite dimensional operators (the so called\nDiscretize-Then-Optimize vs. Optimize-Then-Discretize issue) [9]. Moreover, importantly, it was\nobserved that using other discretization schemes such as RK2 or RK4, or using more time steps, does\nnot affect the generalization performance of the model (even with the DTO approach).\nIn this paper, building on the latter approach of [9], we propose ANODEV2, a more general Neural\nODE framework that addresses this problem. ANODEV2 allows the evolution of both weights and\nactivations by a coupled system of ODEs:\n\n0 f (z(t),\u2713 (t))dt\n0 q(\u2713(t), p)dt, \u2713(0) = \u27130 \u201cweight network\u201d.\n\n\u201cparent network\u201d,\n\n(2)\n\n(z(1) = z(0) +R 1\n\u2713(t) = \u2713(0) +R t\n\nHere, q(\u00b7) is a nonlinear operator (essentially controlling the dynamics of the network parameters in\ntime); \u27130 and p are the corresponding parameters for the weight network. Our approach allows \u2713 to be\ntime dependent: \u2713(t) is parameterized by the learnable dynamics of d\u2713/dt = q(\u2713(t), p). This, in turn,\nis parameterized by \u27130 and p. In other words, instead of optimizing for a constant \u2713, we optimize\nfor \u27130 and p. During inference, both weights \u2713(t) and activations z(t) are forward-propagated in\ntime by solving Eq. 2. Observe that if we set q = 0 then we recover the Neural ODE approach\nproposed by [8]. Eq. 2 replaces the problem of designing appropriate neural network blocks (f) with\nthe problem of choosing appropriate function (q) in an ODE to model the changes of parameter \u2713\n(the weight network).\nIn summary, our main contributions are the following.\n\n\u2022 We provide a general framework that extends Neural ODEs to system of coupled ODEs\nwhich allows coupled evolution of both model parameters and activations. This coupled\nformulation addresses the challenge with Neural ODEs, in that using more time steps or\ndifferent discretization schemes do not affect model\u2019s generalization performance [9].\n\n\u2022 We derive the optimality conditions for how backpropagation should be performed for the\ncoupled ODE formulation using by imposing the standard Karush\u2013Kuhn\u2013Tucker conditions.\nIn particular, we implement the corresponding Discretize-Then-Optimize (DTO) approach,\nalong with a checkpointing scheme presented in [9].\n\n\u2022 We test the framework using multiple different residual models on Cifar-10 by considering\ndifferent coupled formulations. In particular, we show examples illustrating how a biologi-\ncally motivated reaction-diffusion-advection ODE could be used to model the evolution of\nthe neural network parameters.\n\n\u2022 We have open sourced the implementation of the coupled framework in Pytorch which\nallows general evolution operators (and not just the reaction-diffusion-advection). In fact,\nsome of the earlier works such as HyperNetworks are special cases of ANODEV2, and\nthey can be implemented in this framework. The code is available in [10].\n\nThere is a rich literature on neural evolution research [11, 12, 13, 14, 15, 16, 17]. Several similar\napproaches to ours have been taken in the line of evolutionary computing, where an auxiliary \u201cchild\u201d\nnetwork is used to generate the parameters for a \u201cparent\u201d network. This approach permits the\nrestriction of the effective depth that the activations must go through, since the parent network could\nhave smaller weight space than the child network. One example is HyperNEAT [18], which uses\n\u201cCompositional Pattern Producing Networks\u201d (CPPNs) to evolve the model parameters [19, 20].\nA similar approach using \u201cCompressed Weight Search\u201d was proposed in [21]. A follow up work\nextended this approach by using differentiable CPPNs [22]. The authors show that neural network\nparameters could be encoded through a fully connected architecture. Another seminal work in this\ndirection is [23, 24], where an auxiliary network learns to produce \u201ccontext-aware\u201d weights in a\nrecurrent neural network model. A similar recent approach is taken in Hypernetworks [25]. In this\napproach, the model parameters are evolved through an auxiliary learnable neural network. This\napproach is a special case of the above framework, which could be derived by using a single time step\n\n2\n\n\fdiscretization of Eq. 2, with a neural network for the evolution operator (denoted by q and introduced\nin the next section). Our framework is a generalization of these evolutionary algorithms, and it\nprovides more \ufb02exibility for modeling the evolution of the model parameters in time. For instance,\nwe will show how biologically motivated diffusion-reaction-advection operators could be used for\nthe evolution operator q, with negligible increase in the model parameter size.\n\n2 Methodology\nIn this section, we discuss the formulation for the coupled ODE-based neural network model described\nabove, and we derive the corresponding optimality conditions. For a typical learning problem, the\ngoal is to minimize the empirical risk over a set of training examples. Given a loss function `i, where\ni indexes the training sample, we seek to \ufb01nd weights, \u2713 2 Rd, such that:\n\nmin\n\n\u2713\n\n1\nN\n\nNXi=1\n\n`i(zi(\u2713)) + R(\u2713),\n\n(3)\n\nwhere R is a regularization operator and N the number of training samples. The loss function depends\nimplicitly on \u2713 through the network activation vector zi. This problem is typically solved using\nStochastic Gradient Descent (SGD) and backpropagation to compute the gradient of zi with respect\nto \u2713.\n\n2.1 Neural ODE\n\nConsider the following notation for a residual block: z1 = z0 + f (z0; \u2713), where z0 is the input\nactivation, f (\u00b7) is the neural network kernel (e.g., comprising a series of convolutional blocks with\nnon-linear or linear activation functions), and z1 is the output activation. As discussed above, an\nalternative view of a residual network is the following continuous-time formulation: dz\ndt = f (z(t); \u2713),\nwith z(t = 0) = z0 and z(t = 1) = z1 (we will use both z(t) and zt to denote activation at time t).\nIn the ODE-based formulation, this neural network has a continuous depth. In this case, we need to\nsolve the following constrained optimization problem (Neural ODE):\n\nmin\n\n\u2713\n\n1\nN\n\nNXi=1\n\nli(zi(1)) + R(\u2713)\n\nsubject to:\n\ndz\ndt\n\n= f (z(t),\u2713 ),\n\nz(0) = z0.\n\n(4)\n\nNote that in this formulation the neural network parameters are stale in time. In fact it has been\nobserved that using adaptive time stepping or higher order discretization methods such as Runge-\nKutta does not result in any gains in generalization performance using the above framework [9]. To\naddress this, we extend the Neural ODEs by considering a system of coupled ODEs, where the model\nparameters as well as activations evolve in time. In fact, this formulation is slightly more general\nthan what we described in the introduction. For this reason, we introduce an auxiliary dynamical\nsystem for w(t), which we use to de\ufb01ne \u2713. In particular, we propose the following formulation:\n\np,w0 J (z(1)) =\nmin\n\n1\nN\n\nNXi=1\n\nli(zi(1)) + R(w0, p),\n\n= f (z(t),\u2713 (t)), z(0) = z0\n\n\u201cActivation ODE\u201d,\n\n= q(w; p), w(0) = w0\n\n\u201cEvolution ODE\u201d,\n\ndz\ndt\n@w\n@t\n\n\u2713(t) =Z t\n\n0\n\nK(t  \u2327 )w(\u2327 )d\u2327.\n\n(5a)\n\n(5b)\n\n(5c)\n\n(5d)\n\nNote that here \u2713(t) is a function of time, and it is parameterized by the whole dynamics of w(t)\nand a time convolution kernel K (which in the simplest form could be a Dirac delta function such\nthat \u2713(t) = w(t)). Also, q(w, p) can be a general function, e.g., another neural network, a linear\noperator or even a discretized Partial Differential Equation (PDE) based operator. The latter perhaps\nis useful if we consider the \u2713(t) as a function \u2713(u, t), where u parameterizes the signal space (e.g.,\n2D pixel space for images). This formulation allows for rich variations of \u2713(t), while using a lower\n\n3\n\n\fdimensional parameterization: notice that implicitly we have that \u2713(t) = \u2713(w0, p, t). Also, this\nformulation permits novel regularization techniques. Instead of regularizing \u2713(t), we can regularize\nw0 and p.\nA crucial question is: how should one perform backpropagation for this formulation? It is instructive\nto compute the actual derivatives to illustrate the structure of the problem. To derive the optimality\nconditions for this constrained problem, we need to \ufb01rst form the Lagrangian operator and derive the\nso called Karush\u2013Kuhn\u2013Tucker (KKT) conditions:\n\nL = J (z(1)) +Z 1\n+Z 1\n\n\u21b5(t) \u00b7\u2713 dz\n(t) \u00b7\u2713\u2713(t) Z t\n\ndt  f (z(t),\u2713 (t))\u25c6 dt +Z 1\nK(t  \u2327 )w(\u2327 )d\u2327\u25c6 dt.\n\n0\n\n0\n\n0\n\n0\n\n(t) \u00b7\u2713 @w\n\n@t  q(w; p)\u25c6 dt\n\n(6)\n\nHere, \u21b5(t), (t), and (t) are the corresponding adjoint variables (Lagrange multiplier vector\nfunctions) for the constraints in Eq. 5. The solution to the optimization problem of Eq. 5 could be\nfound by computing the stationary points of the Lagrangian (KKT conditions), which are the gradient\nof L with respect to z(t), w(t),\u2713 (t), p, w0 and the adjoints \u21b5(t), (t), (t). The variations of L with\nrespect to the three adjoint functions just result in the ODE constraints in Eq. 5. The remaining\nvariations of L are the most interesting and are given below (see Appendix D for additional discussion\non the derivation):\n\n@J (z(1))\n\n@z1\n\n+ \u21b51 = 0, \n\n@(t)\n\n@t \u2713 @q\n@w\u25c6T\n\n(t)  (1  H(t))Z 1\n\n0\n\n\n\n@\u21b5\n\n@t \u2713 @f\n@z\u25c6T\n\u2713 @f\n@\u2713\u25c6T\n\n\u21b5(t) = 0;\n\n\u21b5(t) + (t) = 0;\n\nKT (\u2327  t)(\u2327 )d\u2327 = 0, (1) = 0;\n= gw0;\n\n(0) +\n\n@R\n@w0\n\n@R\n\n@p Z 1\n\n0 \u2713 @q\n@p\u25c6T\n\n(t)dt = gp;\n\n(@Lz)\n\n(@L\u2713)\n\n(@Lw)\n\n(@Lw0)\n\n(@Lp)\n\n(7a)\n\n(7b)\n\n(7c)\n\n(7d)\n\n(7e)\n\nwhere H(t) is the scalar Heaviside function. To compute the gradients gp and gw0, we proceed as\nfollows. Given w0 and p, we forward propagate w0 to compute w(t) and then \u2713(t). Then using \u2713(t)\nwe can compute the activations z(t). Then we solve the adjoint equations for \u21b5(t), (t) and (t),\nin this order Eq. 7a- 7e. Finally, the gradients of the loss function with respect to p (gp) and w0\n(gw0) are given from the last two equations. Notice that if we set q = 0 we will derive the optimality\nconditions for the Neural ODE without any dynamics for the model parameters, which was the\nmodel presented in [8]. The bene\ufb01t of this more general framework is that we can encapsulate time\ndynamics of the model parameter without increasing the memory footprint of the model. In fact, this\napproach only requires storing initial condition for the parameters, which is parameterized by w0,\nalong with the parameters of the control operator q which are denoted by p. As we show in the results\nsection, the latter can have negligible memory footprint, but yet allow rich representation of model\nparameter dynamics.\nPDE-inspired formulation. There are several different models for the q(w, p), the evolution func-\ntion for the weight convolutional network. One possibility is to use a convolutional block (resembling\na recurrent network). However, this can increase the number of parameters signi\ufb01cantly. Inspired by\nTuring\u2019s reaction-diffusion partial differential equation models for pattern formation, we view a con-\nvolutional \ufb01lter as a time-varying pattern (where time here represents the depth of the network) [12].\nTo illustrate this, we consider a PDE based model for the control operator q, as follows:\n\n(8)\nwhere \u2327 is used to control the diffusion (w),  is used to control the advection (rw), \u21e2 is used to\ncontrol the reaction (w), and  is a nonlinear activation (such as sigmoid or tanh). View the weights\n\n= (\u2327 w +  \u00b7r w + \u21e2w),\n\ndw\ndt\n\n4\n\n\fw as a time series signal, starting from the initial signal, w(0), and evolving in time to produce\nw(1). In fact one can show that the above formulation can evolve the parameters to any weights, if\nthere exists a diffeomorphic transformation of between the two distributions (i.e. if there exists a\nvelocity \ufb01eld  such w(1) is the solution of Eq. 8, with initial condition w(0) [26]). Although this\noperator is mainly used as an example control block (i.e., ANODEV2 is not limited to this model),\nbut diffusion-reaction-advection operator can capture interesting dynamics for model parameters. For\ninstance, consider a single Gaussian operator for a convolutional kernel, which is centered in the\nmiddle with a unit variance. A diffusion operator can simulate multiple different normal distributions\nwith different variance in time. Note that this requires storing only a single diffusion parameter\n(i.e., \u2327). Another interesting operator is the advection operator which models species transport. For\nthe Gaussian case, this operator could for instance transport the center of the Gaussian to different\npositions other than the center of the convolution. Finally, the reaction operator, could allow growth\nor decay of the intensity of the convolution \ufb01lters. The full diffusion-reaction-advection operator\ncould encapsulate more complex dynamics of the neural network parameters in time. A synthetic\nexample is shown in Figure 3 in the appendix, and a real example (5 \u21e5 5 convolutional kernel of\nAlexNet) is shown in Figure 6 in the appendix.\n\n2.2 Two methods used in this paper\nWe use two different coupling con\ufb01gurations of ANODEV2 as described below.\nCon\ufb01guration One. We use multiple time steps to solve for both z and \u2713 in the network instead of\njust one time step as in the original ResNet. Then the discretized solution of Eq. 10 in appendix will\nbe as follows:\n\nzt0+t = zt0 + tf (zt0; \u2713t0); \u2713t0+t = F 1exp((\u2327k 2 + ik + \u21e2)t)F (\u2713t0) ,\n\n(9)\nwhere t is the discretization time scale, and F is Fast Fourier Transform (FFT) operator (for the\nderivation, see the appendix). In this setting, we will alternatively update the value of z and \u2713\naccording to Eq. 9. Hence, the computational cost for an ODE block will be roughly Nt times more\nexpensive compared to that for the original residual block (same as in [8]). This network can be\nviewed as applying Nt different residual blocks in the network but with neural network weights\nthat evolve in time. Note that this con\ufb01guration does not increase the parameter size of the original\nnetwork except slight overhead of \u2327,  and \u21e2.\nThe \ufb01rst con\ufb01guration is shown in top of Figure 1, where the model parameters and activations are\nsolved with the same discretization. This is similar to the Neural ODE framework of [8], except\nthat the model parameters are evolved in time for subsequent times, whereas in [8] the same model\nparameters are applied to the activations. 2 The dynamics of the model parameters are illustrated by\ndifferent colors used for the convolution kernels in top of Figure 1. This con\ufb01guration is equivalent\nto using the Dirac delta function for the K function in Eq. 5d.\nCon\ufb01gurations Two. ANODEV2 supports different coupling con\ufb01gurations between the dynam-\nics of activations and model parameters. For example, it is possible to not restrict the dynamics of \u2713\nand z to align in time, which is the second con\ufb01guration that we consider. Here, we allow model\nparameters to evolve and only apply to activations after a \ufb01xed number of time steps. For instance,\nconsider the Gaussian example illustrated in Figure 3. In con\ufb01guration one, a residual block is created\nfor each of the three time steps. However, in con\ufb01guration two, we only apply the \ufb01rst and last time\nevolutions of the parameters (i.e., we only use w0 and w1 to apply to activations). This con\ufb01guration\nallows suf\ufb01cient time for the model parameters to evolve, and importantly limits the depth of the\nnetwork that activations go through (see the bottom of Figure 1). In this case, the depth of the network\nis increased by a factor of two, instead of Nt, as in the \ufb01rst con\ufb01guration (which is the approach used\nin [8, 9]). Both con\ufb01gurations are supported in ANODEV2, and we present preliminary results for\nboth settings.\n\n3 Results\nIn this section, we report the results of ANODEV2 for the two con\ufb01gurations discussed in section 2,\non CIFAR-10 dataset which consists of 60,000 32\u21e532 colour images in 10 classes. The framework is\n2To be precise the possibility of using time varying parameters was discussed in [8] but the experiments\nwere only limited to concatenating time to channels. Our approach here is more general, as it allows the NN\nparameters to evolve in time.\n\n5\n\n\fResidual \nConnection\n\nConv Kernel\n\nActivation\n\nz t=0\n\nz t=1/3\n\nz t=2/3\n\n.\n.\n.\n\n!1 t=0 !2 t=0 !1 t=1/3 !2 t=1/3 !1 t=2/3 !2\n\n.\n.\n.\n\n.\n.\n.\n\n.\n.\n.\n\nt=2/3\n\n.\n.\n.\n\n.\n.\n.\n\nResidual \nODE\nConnection\n\nConv Kernel\n\nz t=0\n\n.\n.\n.\n\n!1 t=0 !2 t=0\n\n.\n.\n.\n\nActivation\n\nz t=1/2\n\n.\n.\n.\n\n!1 t=1 !2\n\nt=1\n\n.\n.\n.\n\nz t=1\n\nz t=1\n\nODE\n\n.\n.\n\n.\n.\n.\n\n.!1 t=1/n!2 t=1/n\n\n...\n\n.\n.\n\n.\n.\n.\n\n.!1 t=(n-1)/n!2 t=(n-1)/n\n\nFigure 1: Illustration of different con\ufb01gurations in ANODEV2. The top \ufb01gure shows con\ufb01guration\none, where both the activation and weights \u2713 are evolved through a coupled system of ODEs. During\ninference, we solve both of these ODEs forward in time. Blue squares in the \ufb01gure represent activation\nwith multiple channels; the orange bars represent the convolution kernel. The convolution weights \u2713\nare computed by solving an auxiliary ODE. The bottom \ufb01gure shows con\ufb01guration two, where \ufb01rst\nthe weights are evolved in time before applying them to the activations.\n\ndeveloped as a library in PyTorch and uses the checkpointing method proposed in [9], along with the\ndiscretize-then-optimize formulation of the optimality conditions shown in Eq. 7.\nWe test ANODEV2 on AlexNet with residual connection, as well as two different ResNets. See Ap-\npendix B and Appendix A.1 for the details of model architectures and training settings. We consider\nthe two coupling con\ufb01gurations between the evolution of the activations and model parameters as\ndiscussed next.\n\n3.1 Con\ufb01guration One\n\nWe \ufb01rst start with con\ufb01guration one. In this con\ufb01guration, each time step corresponds to adding a\nnew residual block in the network, as illustrated in Figure 1 (top). The results shown in Table 1. All\nthe experiments were repeated \ufb01ve times, and we report both the min/max accuracy as well as the\naverage of these \ufb01ve runs.\nNote that the coupled ODE based approach outperforms the baseline in all of the three statistical\nproperties above (i.e., min/max/average accuracy). For example, on ResNet-10 the coupled ODE\nnetwork achieves 89.04% average test accuracy, as compared to 88.10% of baseline, which is 0.94%\nbetter. Meanwhile, a noticeable observation is that the minimum performance of the coupled ODE\nbased network is comparable or even better than the maximum performance of baseline. The coupled\nODE based AlexNet has 88.59% minimum accuracy, which is 1.44% higher than the best performance\nof baseline out of \ufb01ve runs. Hence, the generalization performances of the coupled ODE based\nnetwork are consistently better than those of the baseline. It is important to note that the model\nparameter size of the coupled ODE approach in ANODEV2 is the same as that of the baseline.\nThis is because the size of the control parameters p is negligible. A comparison discussion is shown\nin section 4.1.\n\n6\n\n\fTable 1: Results for using N t = 5 time steps to solve z and \u2713 in neural network with con\ufb01guration\none. We tested on AlexNet, ResNet-4, and ResNet-10. We get 1.75%, 1.16% and 0.94% improvement\nover the baseline respectively. Note that the model size of the ANODEV2 and baseline is comparable.\n\nAlexNet\n\nResNet-4\n\nResNet-10\n\nMin / Max\n\nAvg\n\nMin / Max\n\nAvg\n\nMin / Max\n\nAvg\n\nBaseline\n86.84% / 87.15% 87.03% 76.47% / 77.35% 76.95% 87.79% / 88.52% 88.10%\nANODEV2 88.59% / 88.96% 88.78% 77.27% / 78.58% 78.11% 88.67% / 89.39% 89.04%\nImp.\n0.94%\n\n0.80% / 1.23%\n\n1.75% / 1.81%\n\n0.88% / 0.87%\n\n1.75%\n\n1.16%\n\nThe dynamics of how the neural network parameters are evolved in time is illustrated in Figure 6,\nwhere we extract the \ufb01rst 5\u21e5 5 convolution of AlexNet and show how it evolves in time. Here, T ime\nrepresents how long \u2713 evolves in time, i.e., T ime = 0 shows the result of \u2713(t = 0) and T ime = 1\nshows the result of \u2713(t = 1). It can be clearly seen that the coupled ODE based method encapsulates\nmore complex dynamics of \u2713 in time. Similar illustrations for ResNet-4 and ResNet-10 are shown\nin Figure 4 and 5 in the appendix.\n\n3.2 Con\ufb01guration Two\nHere, we test the second con\ufb01guration where the evolution of the parameters and the activations\ncould have different time steps. This means the parameter is only applied after a certain number of\ntime steps of evolution but not at every time step which was the case in the \ufb01rst con\ufb01guration. This\neffectively reduces the depth of the network and the computational cost, and it allows suf\ufb01cient time\nfor the neurons to be evolved, instead of naively applying them at each time step. An illustration for\nthis con\ufb01guration is shown in Figure 1 (bottom). The results on AlexNet, ResNet4 and ResNet10 are\nshown in Table 2, where we again report the min/max and average accuracy over \ufb01ve runs. As in the\nprevious setting (con\ufb01guration one), the coupled ODE based network performs better in all cases.\nThe minimum performance of the coupled ODE based network still is comparable or even better\nthan the maximum performance of the baseline. Although the overall performance of this setting is\nslightly worse than the previous con\ufb01guration, the computational cost is much less, due to the smaller\neffective depth of the network that the activations go through.\n\nTable 2: Results for using N t = 2 time steps to solve z in neural network and N t = 10 to solve \u2713 in\nthe ODE block (con\ufb01guration two). ANODEV2 achieves 1.23%, 0.78% and 0.83% improvement\nover the baseline respectively. Note that the model size is comparable to baseline Table 1.\n\nAlexNet\n\nResNet-4\n\nResNet-10\n\nMin / Max\n\nAvg\n\nMin / Max\n\nAvg\n\nMin / Max\n\nAvg\n\nBaseline\n86.84% / 87.15% 87.03% 76.47% / 77.35% 76.95% 87.79% / 88.52% 88.10%\nANODEV2 88.1% / 88.33% 88.26% 77.23% / 78.28% 77.73% 88.65% / 89.19% 88.93%\nImp.\n0.83%\n\n1.26% / 1.18% 1.23%\n\n0.76% / 0.93%\n\n0.86% / 0.67%\n\n0.78%\n\n4 Ablation Study\nHere we perform an ablation study in which we remove the evolution of the model parameters, and\ninstead we \ufb01x them to stale values in time (which is the con\ufb01guration used in [8, 9]), and test with a\ncase where the model parameters are indeed evolved in time, which corresponds to results of Table 2.\nPrecisely, we use two time steps for activation ODE (Eq. 5b) and ten time steps for the evolution\nof the model parameters (Eq. 5c). In this setting, both the FLOPS and model sizes are the same,\nallowing us to test the ef\ufb01cacy of evolving model parameters. The results are shown in Table 4. As\none can see, there is indeed bene\ufb01t in allowing the model parameter to evolve in time, which is\nrather obvious since it gives more \ufb02exibility to the neural network to evolve the model parameters.\nThe ANODE results are derived using the DTO approach with checkpointing presented in [9]. We\nalso tested Neural ODE approach used in [8], the results are signi\ufb01cant worse than ANODE and\nANODEV2. Also note that evolving model parameters has a negligible computational cost, since we\ncan actually use analytical solutions for solving the reaction-diffusion-advection, which is discussed\nin Appendix A.1.\n\n7\n\n\fTable 3: Parameter comparison for two ANODEV2 con\ufb01gurations, the network used in section 4,\nand the baseline network. The parameter size of ANODEV2 is comparable with others.\n\nBaseline\nANODEV2 con\ufb01g. 1\nANODEV2 con\ufb01g. 2\nNeural ODE [8] / ANODE [9]\n\nAlexNet\n1756.68K\n1757.51K\n1757.13K\n1757.13K\n\nResNet-4 ResNet-10\n\n7.71K\n8.23K\n7.99K\n7.96K\n\n44.19K\n45.77K\n45.05K\n44.95K\n\nTable 4: We use N t = 2 time steps to solve z in neural network and keep \u2713 as static for Neural ODE\nand ANODE. We tested all the con\ufb01gurations on AlexNet, ResNet-4 and ResNet-10. The results\nshows that Neural ODE get signi\ufb01cantly worse results comparing to ANODEV2 and ANODE.\nANODEV2 get 0.24%, 0.43% and 0.33% improvement over ANODE respectively. The model size\ncomparison is shown in Table 3.\n\nAlexNet\n\nResNet-4\n\nResNet-10\n\nMin / Max\n\nAvg\n\nMin / Max\n\nAvg\n\nMin / Max\n\nAvg\n\nBaseline\n86.84% / 87.15% 87.03% 76.47% / 77.35% 76.95% 87.79% / 88.52% 88.10%\nNeuralODE [8] 74.54% / 76.78% 75.67% 44.73% / 49.91% 47.37% 64.7% / 70.06% 67.94%\n87.86% / 88.14% 88.02% 76.92% / 77.45% 77.30% 88.48% / 88.75% 88.60%\nANODE [9]\n88.1% / 88.33% 88.26% 77.23% / 78.28% 77.73% 88.65% / 89.19% 88.93%\nANODEV2\n\n4.1 Parameter Size Comparison\nHere, we provide the parameter sizes of the two con\ufb01gurations tested above and the model used in\nablation study in section 4. It can be clearly seen that the model sizes of both con\ufb01gurations are\nroughly the same as those of the baseline models. In fact, con\ufb01guration one grows the parameter sizes\nof AlexNet, ResNet-4, and ResNet-10 by only 0.5% to 6.7%, as compared to those of baseline models.\nIn con\ufb01guration two, the parameter size increases from 0.2% to 3.6% compared to baseline model\n(note that we even count the additional batch norm parameters for fair comparison). Comparing with\nthe ablation network used in section 4, in which we apply the same model parameters for multiple\ntime steps, ANODEV2 con\ufb01guration two has basically the same number of parameters. Table 3\nsummarizes all the results.\n\n5 Conclusions\nThe connection between residual networks and ODEs has been recently found in several works.\nHere, we propose ANODEV2, which is a more general extension of this approach by introducing\na coupled ODE based framework, motivated by the works in neural evolution. The framework\nallows dynamical evolution of both the residual parameters as well as the activations in a coupled\nformulation. This gives more \ufb02exibility to the neural network to adjust the parameters to achieve\nbetter generalization performance. We derived the optimality conditions for this coupled formulation\nand presented preliminary experiments using two different con\ufb01gurations, and we showed that we\ncan indeed train such models using our differential framework. The results on three neural networks\n(AlexNet, ResNet-4, and ResNet-10) all showed accuracy gains across \ufb01ve different trials. In fact\nthe worst accuracy of the coupled ODE formulation was better than the best performance of the\nbaseline. This is achieved with negligible change in the model parameter size. To the best of the\nour knowledge, this is the \ufb01rst coupled ODE formulation that allows for the evolution of the model\nparameters in time along with the activations. We are working on extending the framework for other\nlearning tasks. The source code will be released as open source software to the public.\n\n8\n\n\f6 Rebuttal\n\nWe would like to thank all the reviewers and area chair for taking the time to review our work and\nproviding us with their valuable feedback.\n\nHere we present solving the problem of simulating 1D wage equation with ANODEV2. For example,\nwe can directly capture a variable velocity wave equation, using the reaction term in the evolution\nkernel. This is illustrated in Figure 2 below, where we test a simple transport phenomena with\nvariable velocity. The governing equation here is dz/dt = c(t) dz/dx, where c(t) is variable velocity,\nand z is the signal that changes in time. The learning task is to predict how the signal changes in time.\nThat is we are given z(t=0) and want to infer z(t) at different time points. We test a one layer model in\n1D and illustrate the results in Figure 2 above. ANODEV2 can easily capture variable velocity (c(t))\nwith only a single layer through reaction operator, and as you can see the quality of its prediction\nis better than ANODE. We emphasize that this is a simple problem, and we are now investigating\nmore complex physics based problems, for which we anticipate ANODEV2 to perform better by\nincorporating physical constraints in the evolution kernel.\n\nFigure 2: Reconstruction of a signal transport problem. Here the task is to predict the change of\nthe input signal (shown in the top left at t = 0) in time. The governing equation is a \ufb01rst-order\nwave equation with variable velocity in time. The blue curve shows ground truth, orange shows\nANODE, and green shows ANODEV2. We used a single layer model to learn the transport equation.\nANODEV2 performs better, as it can capture the transport physics as a constraint through the\nTuring\u2019s reaction operator. The x-axis shows spatial location, and y-axis is signal amplitude.\n\n9\n\n\fAcknowledgments\nThis work was supported by a gracious fund from Intel corporation, Berkeley Deep Drive (BDD), and\nBerkeley AI Research (BAIR) sponsors. We would like to thank the Intel VLAB team for providing\nus with access to their computing cluster. We also gratefully acknowledge the support of NVIDIA\nCorporation for their donation of two Titan Xp GPU used for this research. We would also like to\nacknowledge ARO, DARPA, NSF, and ONR for providing partial support of this work.\nReferences\n[1] K. He, X. Zhang, S. Ren, and J. Sun, \u201cDeep residual learning for image recognition,\u201d in\nProceedings of the IEEE conference on computer vision and pattern recognition, pp. 770\u2013778,\n2016.\n\n[2] K. He, X. Zhang, S. Ren, and J. Sun, \u201cIdentity mappings in deep residual networks,\u201d in European\n\nconference on computer vision, pp. 630\u2013645, Springer, 2016.\n\n[3] E. Weinan, \u201cA proposal on machine learning via dynamical systems,\u201d Communications in\n\nMathematics and Statistics, vol. 5, no. 1, pp. 1\u201311, 2017.\n\n[4] E. Haber and L. Ruthotto, \u201cStable architectures for deep neural networks,\u201d Inverse Problems,\n\nvol. 34, no. 1, p. 014004, 2017.\n\n[5] L. Ruthotto and E. Haber, \u201cDeep neural networks motivated by partial differential equations,\u201d\n\narXiv preprint arXiv:1804.04272, 2018.\n\n[6] Y. Lu, A. Zhong, Q. Li, and B. Dong, \u201cBeyond \ufb01nite layer neural networks: Bridging deep\n\narchitectures and numerical differential equations,\u201d arXiv preprint arXiv:1710.10121, 2017.\n\n[7] M. Ciccone, M. Gallieri, J. Masci, C. Osendorfer, and F. Gomez, \u201cNais-net: Stable deep\nnetworks from non-autonomous differential equations,\u201d arXiv preprint arXiv:1804.07209, 2018.\n[8] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud, \u201cNeural ordinary differential\n\nequations,\u201d in Advances in Neural Information Processing Systems, pp. 6571\u20136583, 2018.\n\n[9] A. Gholami, K. Keutzer, and G. Biros, \u201cAnode: Unconditionally accurate memory-ef\ufb01cient\n\ngradients for neural odes,\u201d arXiv preprint arXiv:1902.10298, 2019.\n\n[10] \u201cAnonymized for review,\u201d Nov. 2019.\n[11] A. Lindenmayer, \u201cMathematical models for cellular interactions in development i. \ufb01laments\n\nwith one-sided inputs,\u201d Journal of theoretical biology, vol. 18, no. 3, pp. 280\u2013299, 1968.\n\n[12] A. M. Turing, \u201cThe chemical basis of morphogenesis,\u201d Bulletin of mathematical biology, vol. 52,\n\nno. 1-2, pp. 153\u2013197, 1952.\n\n[13] R. K. Belew and T. E. Kammeyer, \u201cEvolving aesthetic sorting networks using developmental\n\ngrammars.,\u201d in ICGA, p. 629, Citeseer, 1993.\n\n[14] P. Bentley and S. Kumar, \u201cThree ways to grow designs: A comparison of embryogenies for an\nevolutionary design problem,\u201d in Proceedings of the 1st Annual Conference on Genetic and\nEvolutionary Computation-Volume 1, pp. 35\u201343, Morgan Kaufmann Publishers Inc., 1999.\n\n[15] F. Dellaert and R. D. Beer, \u201cA developmental model for the evolution of complete autonomous\nagents,\u201d in Proceedings of the fourth international conference on simulation of adaptive behav-\nior, pp. 393\u2013401, MIT Press Cambridge, MA, 1996.\n\n[16] P. Eggenberger, \u201cEvolving morphologies of simulated 3d organisms based on differential gene\nexpression,\u201d in Proceedings of the fourth european conference on Arti\ufb01cial Life, pp. 205\u2013213,\n1997.\n\n[17] G. S. Hornby and J. B. Pollack, \u201cCreating high-level components with a generative representa-\n\ntion for body-brain evolution,\u201d Arti\ufb01cial life, vol. 8, no. 3, pp. 223\u2013246, 2002.\n\n[18] K. O. Stanley, D. B. D\u2019Ambrosio, and J. Gauci, \u201cA hypercube-based encoding for evolving\n\nlarge-scale neural networks,\u201d Arti\ufb01cial life, vol. 15, no. 2, pp. 185\u2013212, 2009.\n\n[19] K. O. Stanley, \u201cExploiting regularity without development,\u201d in Proceedings of the AAAI Fall\n\nSymposium on Developmental Systems, p. 37, AAAI Press Menlo Park, CA, 2006.\n\n[20] K. O. Stanley, \u201cCompositional pattern producing networks: A novel abstraction of development,\u201d\n\nGenetic programming and evolvable machines, vol. 8, no. 2, pp. 131\u2013162, 2007.\n\n10\n\n\f[21] J. Koutnik, F. Gomez, and J. Schmidhuber, \u201cEvolving neural networks in compressed weight\nspace,\u201d in Proceedings of the 12th annual conference on Genetic and evolutionary computation,\npp. 619\u2013626, ACM, 2010.\n\n[22] C. Fernando, D. Banarse, M. Reynolds, F. Besse, D. Pfau, M. Jaderberg, M. Lanctot, and\nD. Wierstra, \u201cConvolution by evolution: Differentiable pattern producing networks,\u201d in Pro-\nceedings of the Genetic and Evolutionary Computation Conference 2016, pp. 109\u2013116, ACM,\n2016.\n\n[23] J. Schmidhuber, \u201cLearning to control fast-weight memories: An alternative to dynamic recurrent\n\nnetworks,\u201d Neural Computation, vol. 4, no. 1, pp. 131\u2013139, 1992.\n\n[24] J. Schmidhuber, \u201cA \u2018self-referential\u2019weight matrix,\u201d in International Conference on Arti\ufb01cial\n\nNeural Networks, pp. 446\u2013450, Springer, 1993.\n\n[25] D. Ha, A. Dai, and Q. V. Le, \u201cHypernetworks,\u201d arXiv preprint arXiv:1609.09106, 2016.\n[26] L. Younes, \u201cShapes and diffeomorphisms,\u201d Springer Science & Business Media.\n\n11\n\n\f", "award": [], "sourceid": 2811, "authors": [{"given_name": "Tianjun", "family_name": "Zhang", "institution": "University of California, Berkeley"}, {"given_name": "Zhewei", "family_name": "Yao", "institution": "UC Berkeley"}, {"given_name": "Amir", "family_name": "Gholami", "institution": "University of California, Berkeley"}, {"given_name": "Joseph", "family_name": "Gonzalez", "institution": "UC Berkeley"}, {"given_name": "Kurt", "family_name": "Keutzer", "institution": "EECS, UC Berkeley"}, {"given_name": "Michael", "family_name": "Mahoney", "institution": "UC Berkeley"}, {"given_name": "George", "family_name": "Biros", "institution": "University of Texas at Austin"}]}