{"title": "Neural Jump Stochastic Differential Equations", "book": "Advances in Neural Information Processing Systems", "page_first": 9847, "page_last": 9858, "abstract": "Many time series are effectively generated by a combination of deterministic continuous flows along with discrete jumps sparked by stochastic events. However, we usually do not have the equation of motion describing the flows, or how they are affected by jumps. To this end, we introduce Neural Jump Stochastic Differential Equations that provide a data-driven approach to learn continuous and discrete dynamic behavior, i.e., hybrid systems that both flow and jump. Our approach extends the framework of Neural Ordinary Differential Equations with a stochastic process term that models discrete events. We then model temporal point processes with a piecewise-continuous latent trajectory, where the discontinuities are caused by stochastic events whose conditional intensity depends on the latent state. We demonstrate the predictive capabilities of our model on a range of synthetic and real-world marked point process datasets, including classical point processes (such as Hawkes processes), awards on Stack Overflow, medical records, and earthquake monitoring.", "full_text": "Neural Jump Stochastic Differential Equations\n\nJunteng Jia\n\nCornell University\n\njj585@cornell.edu\n\nAustin R. Benson\nCornell University\n\narb@cs.cornell.edu\n\nAbstract\n\nMany time series are effectively generated by a combination of deterministic\ncontinuous \ufb02ows along with discrete jumps sparked by stochastic events. However,\nwe usually do not have the equation of motion describing the \ufb02ows, or how they are\naffected by jumps. To this end, we introduce Neural Jump Stochastic Differential\nEquations that provide a data-driven approach to learn continuous and discrete\ndynamic behavior, i.e., hybrid systems that both \ufb02ow and jump. Our approach\nextends the framework of Neural Ordinary Differential Equations with a stochastic\nprocess term that models discrete events. We then model temporal point processes\nwith a piecewise-continuous latent trajectory, where the discontinuities are caused\nby stochastic events whose conditional intensity depends on the latent state. We\ndemonstrate the predictive capabilities of our model on a range of synthetic and\nreal-world marked point process datasets, including classical point processes (such\nas Hawkes processes), awards on Stack Over\ufb02ow, medical records, and earthquake\nmonitoring.\n\n1\n\nIntroduction\n\nIn a wide variety of real-world problems, the system of interest evolves continuously over time, but\nmay also be interrupted by stochastic events [1, 2, 3]. For instance, the reputation of a Stack Over\ufb02ow\nuser may gradually build up over time and determines how likely the user gets a certain badge, while\nthe event of earning a badge may steer the user to participate in different future activities [4]. How\ncan we simultaneously model these continuous and discrete dynamics?\n\nOne approach is with hybrid systems, which are dynamical systems characterized by piecewise\ncontinuous trajectories with a \ufb01nite number of discontinuities introduced by discrete events [5].\nHybrid systems have long been used to describe physical scenarios [6], where the equation of motion\nis often given by an ordinary differential equation. A simple example is table tennis \u2014 the ball follows\nphysical laws of motion and changes trajectory abruptly when bouncing off paddles. However, for\nproblems arising in social and information sciences, we usually know little about the time evolution\nmechanism. And in general, we also have little insight about how the stochastic events are generated.\n\nHere, we present Neural Jump Stochastic Differential Equations (JSDEs) for learning the continuous\nand discrete dynamics of a hybrid system in a data-driven manner. In particular, we use a latent vector\nz(t) to encode the state of a system. The latent vector z(t) \ufb02ows continuously over time until an event\nhappens at random, which introduces an abrupt jump and changes its trajectory. The continuous \ufb02ow\nis described by Neural Ordinary Differential Equations (Neural ODEs), while the event conditional\nintensity and the in\ufb02uence of the jump are parameterized with neural networks as functions of z(t).\n\nThe Neural ODEs framework models continuous transformation of a latent vector as an ODE \ufb02ow\nand parameterizes the \ufb02ow dynamics with a neural network [7]. The approach is a continuous analogy\nto residual networks, ones with in\ufb01nite depth and in\ufb01nitesimal step size, which brings about many\ndesirable properties. Remarkably, the derivative of the loss function can be computed via the adjoint\nmethod, which integrates the adjoint equation backwards in time with constant memory regardless of\nthe network depth. However, the downside of these continuous models is that they cannot incorporate\ndiscrete events (or inputs) that abruptly change the latent vector. To address this limitation, we extend\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthe Neural ODEs framework with discontinuities for modeling hybrid systems. In particular, we\nshow how the discontinuities caused by discrete events should be handled in the adjoint method.\nMore speci\ufb01cally, at the time of a discontinuity, not only does the latent vector describing the state\nof the system changes abruptly; as a consequence, the adjoint vector representing the loss function\nderivatives also jumps. Furthermore, our Neural JSDE model can serve as a stochastic process for\ngenerating event sequences. The latent vector z(t) determines the conditional intensity of event\narrival, which in turn causes a discontinuity in z(t) at the time an event happens.\n\nA major advantage of Neural JSDEs is that they can be used to model a variety of marked point\nprocesses, where events can be accompanied with either a discrete value (say, a class label) or a\nvector of real-valued features (e.g., spatial locations); thus, our framework is broadly applicable\nfor time series analysis. We test our Neural JSDE model in a variety of scenarios. First, we \ufb01nd\nthat our model can learn the intensity function of a number of classical point processes, including\nHawkes processes and self-correcting processes (which are already used broadly in modeling, e.g.,\nsocial systems [8, 9, 10]). After, we show that Neural JSDEs can achieve state-of-the-art performance\nin predicting discrete-typed event labels, using datasets of awards on Stack Over\ufb02ow and medical\nrecords. Finally, we demonstrate the capabilities of Neural JSDEs for modeling point processes\nwhere events have real-valued feature vectors, using both synthetic data as well as earthquake data,\nwhere the events are accompanied with spatial locations as features.\n\n2 Background, Motivation, and Challenges\n\nIn this section, we review classical temporal point process models and the Neural ODE framework of\nChen et al. [7]. Compared to a discrete time step model like an RNN, the continuous time formation\nof Neural ODEs makes it more suitable for describing events with real-valued timestamps. However,\nNeural ODEs enforce continuous dynamics and therefore cannot model sudden event effects.\n\n2.1 Classical Temporal Point Process Models\n\nA temporal point process is a stochastic generative model whose output is a sequence of discrete\nevents H = {\u03c4j}. An event sequence can be formally described by a counting function N (t)\nrecording the number of events before time t, which is de\ufb01ned as follows:\n\nN (t) = X\u03c4j \u2208H\n\nH(t \u2212 \u03c4j), where H(t) =(cid:26)0\n\n1\n\nt \u2264 0\notherwise,\n\n(1)\n\nwhere H is the Heaviside step function. Oftentimes, we are interested in a temporal point process\nwhose future outcome depends on historical events [11]. Such dependency is best described by a\nconditional intensity function \u03bb(t). Let Ht denote the subset of events up to but not including t. Then\n\u03bb(t) de\ufb01nes the probability density of observing an event conditioned on the event history:\n\nP {event in [t, t + dt) | Ht} = \u03bb (t) \u00b7 dt\n\n(2)\n\nUsing this form, we now describe some of the most well-studied point process models, which we\nlater use in our experiments.\n\nPoisson processes. The conditional intensity is a function g(t) independent of event history Ht. The\nsimplest case is a homogeneous Poisson process where the intensity function is a constant \u03bb0:\n\n\u03bb(t) = g(t) = \u03bb0.\n\n(3)\n\nHawkes processes. These processes assume that events are self-exciting. In other words, an event\nleads to an increase in the conditional intensity function, whose effect decays over time:\n\n\u03bb(t) = \u03bb0 + \u03b1 X\u03c4j \u2208Ht\n\n\u03ba(t \u2212 \u03c4j),\n\n(4)\n\nwhere \u03bb0 is the baseline intensity, \u03b1 > 0, and \u03ba is a kernel function. We consider two widely used\nkernels: (1) the exponential kernel \u03ba1, which is often used for its computational ef\ufb01ciency [12]; and\n(2) the power-law kernel \u03ba2, which is used for modeling in seismology [13] and social media [14]:\n\n\u03ba1(t) = e\u2212\u03b2t,\n\n\u03ba2(t) =(0\n\u03c3(cid:1)\u2212\u03b2\u22121\n\u03c3 (cid:0) t\n\n\u03b2\n\n2\n\nt < \u03c3\notherwise.\n\n(5)\n\n\fThe variant of the power-law kernel we use here has a delaying effect.\n\nSelf-correcting processes. A self-correcting process assumes the conditional intensity grows expo-\nnentially with time and an event suppresses future events. This model has been used for modeling\nearthquakes once aftershocks have been removed [15]:\n\n\u03bb(t) = e\u00b5t\u2212\u03b2N (t).\n\n(6)\n\nMarked temporal point processes. Oftentimes, we care not only about when an event happens, but\nalso what the event is; having such labels makes the point process marked. In these cases, we use a\nvector embedding k to denote event type, and H = {(\u03c4j, kj)} for an event sequence, where each\ntuple denotes an event with embedding kj happening at timestamp \u03c4j . This setup is applicable to\nevents with discrete types as well as events with real-valued features. For discrete-typed events, we\nuse a one-hot encoding kj \u2208 {0, 1}m, where m is the number of discrete event types. Otherwise, the\nkj are real-valued feature vectors.\n\n2.2 Neural ODEs\n\nA Neural ODE de\ufb01nes a continuous-time transformation of variables [7]. Starting from an initial\nstate z(t0), the transformed state at any time ti is given by integrating an ODE forward in time:\n\ndz(t)\n\ndt\n\n= f (z(t), t; \u03b8),\n\nz(ti) = z(t0) +Z ti\n\nt0\n\ndz(t)\n\ndt\n\ndt.\n\n(7)\n\nHere, f is a neural network parameterized by \u03b8 that de\ufb01nes the ODE dynamics.\n\nAssuming the loss function depends directly on the latent variable values at a sequence of checkpoints\n{ti}N\ni=0 (i.e., L = L({z(ti)}; \u03b8)), Chen et al. proposed to use the adjoint method to compute the\nderivatives of the loss function with respect to the initial state z(t0), model parameters \u03b8, and the\ninitial time t0 as follows. First, de\ufb01ne the initial condition of the adjoint variables as follows,\n\na(tN ) =\n\n\u2202L\n\n\u2202z(tN )\n\n,\n\na\u03b8(tN ) = 0,\n\nat(tN ) =\n\n\u2202L\n\u2202tN\n\n= a(tN )f (z(tN ), tN ; \u03b8).\n\n(8)\n\nThen, the loss function derivatives dL/dz(t0) = a(t0), dL/d\u03b8 = a\u03b8(t0), and dL/dt0 = at(t0) can be\ncomputed by integrating the following ordinary differential equation backward in time:\n\nda(t)\n\ndt\n\nda\u03b8(t)\n\ndt\n\ndat(t)\n\ndt\n\n= \u2212a(t)\n\n\u2202f (z(t), t; \u03b8)\n\n\u2202z(t)\n\n= \u2212a(t)\n\n\u2202f (z(t), t; \u03b8)\n\n\u2202\u03b8\n\n= \u2212a(t)\n\n\u2202f (z(t), t; \u03b8)\n\n\u2202t\n\n,\n\ntN \uf8ee\na(t0) = a(tN ) +Z t0\n\uf8f0\n, a\u03b8(t0) = a\u03b8(tN ) +Z t0\ntN \uf8ee\n, at(t0) = at(tN ) +Z t0\n\uf8f0\n\ntN\n\nda(t)\n\ndt\n\n\u03b4(t \u2212 ti)\n\ndt\n\n\u2202L\n\n\u2202z(ti)\uf8f9\n\uf8fb\n\n\u2212 Xi6=N\n\ndt\n\nda\u03b8(t)\n\ndt\n\ndat(t)\n\ndt\n\n\u2212 Xi6=N\n\n\u03b4(t \u2212 ti)\n\ndt.\n\n(9)\n\n\u2202L\n\n\u2202ti\uf8f9\n\uf8fb\n\nAlthough solving Eq. (9) requires the value of z(t) along its entire trajectory [7], z(t) can be\nrecomputed backwards in time together with the adjoint variables starting with its \ufb01nal value z(tN )\nand therefore induce no memory overhead.\n\n2.3 When can Neural ODEs Model Temporal Point Processes?\n\nThe continuous Neural ODE formulation makes it a good candidate for modeling events with real-\nvalued timestamps. In fact, Chen et al. applied their model for learning the intensity of Poisson\nprocesses, which notably do not depend on event history. However, in many real-world applications,\nthe event (e.g., \ufb01nancial transactions or tweets) often provides feedback to the system and in\ufb02uences\nthe future dynamics [16, 17].\n\nThere are two possible ways to encode the event history and model event effects. The \ufb01rst approach\nis to parametrize f with an explicit dependence on time: events that happen before time t changes the\nfunction f and consequently in\ufb02uence the trajectory z(t) after time t. Unfortunately, even the mild\n\n3\n\n\fassumption requiring f to be \ufb01nite would imply the event effects \u201ckick in\u201d continuously, and therefore\ncannot model events that create immediate shocks to a system (e.g., effects of Federal Reserve interest\nrate changes on the stock market). For this reason, areas such as \ufb01nancial mathematics have long\nadvocated for discontinuous time series models [18, 19]. The second alternative is to encode the\nevent effects as abrupt jumps of the latent vector z(t). However, the original Neural ODE framework\nassumes a Lipschitz continuous trajectory, and therefore cannot model temporal point processes that\ndepend on event history (such as a Hawkes process).\n\nIn the next section, we show how to incorporate jumps into the Neural ODE framework for modeling\nevent effects, while maintaining the simplicity of the adjoint method for training.\n\n3 Neural Jump Stochastic Differential Equations\n\nIn our setup, we are given a sequence of events H = {(\u03c4j, kj)} (i.e., a set of tuples, each consisting\nof a timestamp and a vector), and we are interested in both simulating and predicting the likelihood\nof future events.\n\n3.1 Latent Dynamics and Stochastic Events\n\nAt a high level, our model represents the latent state of the system with a vector z(t) \u2208 Rn. The\nlatent state continuously evolves with a deterministic trajectory until interrupted by a stochastic event.\nWithin any time interval [t, t + dt), an event happens with the following probability:\n\nP {event happens in [t, t + dt) | Ht} = \u03bb(t) \u00b7 dt,\n\n(10)\n\nwhere \u03bb(t) = \u03bb(z(t)) is the total conditional intensity for events of all types. The embedding of an\nevent happening at time t is sampled from k(t) \u223c p(k|z(t)). Here, both \u03bb(z(t)) and p(k|z(t)) are\nparameterized with neural networks and learned from data. In cases where events have discrete types,\np(k|z(t)) is supported on the \ufb01nite set of one-hot encodings and the neural network directly outputs\nthe intensity for every event. On the other hand, for events with real-valued features, we parameterize\np(k|z(t)) with a Gaussian mixture model, whose parameters \u03b7 depend on z(t). The mapping from\nz(t) to \u03b7 is learned with another neural network.\n\nNext, let N (t) be the number of events up to time t. The latent state dynamics of our Neural JSDE\nmodel is described by the following equation:\n\ndz(t) = f (z(t), t; \u03b8) \u00b7 dt + w(z(t), k(t), t; \u03b8) \u00b7 dN (t),\n\n(11)\n\nwhere f and w are two neural networks that control the \ufb02ow and jump, respectively. Following our\nde\ufb01nition for the counting function (Eq. (1)), all time dependent variables are left continuous in t, i.e.,\nlim\u01eb\u21920+ z(t \u2212 \u01eb) = z(t). Section 3.3 describes the neural network architectures for f , w, \u03bb, and p.\n\nNow that we have fully de\ufb01ned the latent dynamics and stochastic event handling, we can simulate\nthe hybrid system by integrating Eq. (11) forward in time with an adaptive step size ODE solver.\nThe complete algorithm for simulating the hybrid system with stochastic events is described in\nAppendix A.1. However, in this paper, we focus on prediction instead of simulation.\n\n3.2 Learning the Hybrid System\n\nFor a given set of model parameters, we compute the log probability density for a sequence of events\nH = {(\u03c4j, kj)} and de\ufb01ne the loss function as\n\nL = \u2212 log P(H) = \u2212Xj\n\nlog \u03bb(z(\u03c4j)) \u2212Xj\n\nlog p(kj|z(\u03c4j)) +Z tN\n\nt0\n\n\u03bb(z(t))dt.\n\n(12)\n\nIn practice, the integral in Eq. (12) is computed by a weighted sum of intensities \u03bb(z(ti)) on\ncheckpoints {ti}. Therefore, computing the loss function L = L ({z(ti)}; \u03b8) requires integrating\nEq. (11) forward from t0 to tN and recording the latent vectors along the trajectory.\n\nThe loss function derivatives are evaluated with the adjoint method (Eq. (9)). However, we encounter\njumps in the latent vector \u2206z(\u03c4i) = w(z(\u03c4j), kj, \u03c4j; \u03b8) when integrating the adjoint equations\nbackwards in time (Fig. 1). These jumps also introduce discontinuities to the adjoint vectors at \u03c4j .\n\n4\n\n\fFigure 1: Reverse-mode differentiation of\nan ODE with discontinuities. Each jump\n\u2206z(\u03c4j) in the latent vector (green, top\npanel) also introduces a discontinuity for\nadjoint vectors (green, bottom panel).\n\nDenote the right limit of any time dependent variable x(t) by x(t+) = lim\u01eb\u21920+ x(t + \u01eb). Then, at\nany timestamp \u03c4j when an event happens, the left and right limits of the adjoint vectors a, a\u03b8, and at\nexhibit the following relationships (see Appendix A.2 for the derivation):\n\na(\u03c4j) = a(\u03c4 +\n\nj ) + a(\u03c4 +\nj )\n\n\u2202 [w(z(\u03c4j), kj, \u03c4j; \u03b8)]\n\n\u2202z(\u03c4j)\n\na\u03b8(\u03c4j) = a\u03b8(\u03c4 +\n\nj ) + a(\u03c4 +\nj )\n\nat(\u03c4j) = at(\u03c4 +\n\nj ) + a(\u03c4 +\nj )\n\n\u2202 [w(z(\u03c4j), kj, \u03c4j; \u03b8)]\n\n\u2202\u03b8\n\n\u2202 [w(z(\u03c4j), kj, \u03c4j; \u03b8)]\n\n\u2202\u03c4j\n\n.\n\n(13)\n\nIn order to compute the loss function derivatives dL/dz(t0) = a(t0), dL/d\u03b8 = a\u03b8(t0), and dL/dt0 =\nat(t0), we integrate the adjoint vectors backwards in time following Eq. (9). However, at every \u03c4j\nwhen an event happens, the adjoint vectors is discontinuous and needs to be lifted from its right limit\nto its left limit. One caveat is that computing the Jacobian in Eq. (13) requires the value of z(\u03c4j) at\nthe left limit, which need to be recorded during forward integration. The complete algorithm for\nintegrating z(t) forward and z(t), a(t), a\u03b8(t), at(t) backward is described in Appendix A.3.\n\n3.3 Network Architectures\n\nFigure 2 shows the network architectures that parameterizes our model. In order to better simulate\nthe time series, the latent state z(t) \u2208 Rn is further split into two vectors: c(t) \u2208 Rn1 encodes the\ninternal state, and h(t) \u2208 Rn2 encodes the memory of events up to time t, where n = n1 + n2.\n\nDynamics function f (z). We parameterize the internal state dynamics \u2202 c(t)/\u2202t by a multi-layer\nperceptron (MLP). Furthermore, we require \u2202 c(t)/\u2202t to be orthogonal to c(t). This constrains the\ninternal state dynamics to a sphere and improves the stability of the ODE solution. On the other hand,\nthe event memory h(t) decays over time, with a decay rate parameterized by another MLP, whose\noutput passes through a softplus activation to guarantee the decay rate to be positive.\n\nJump function w(z(t)). An event introduces a jump \u2206h(t) to event history h(t). The jump is\nparameterized by a MLP that takes the event embedding k(t) and internal state c(t) as input. Our\narchitecture also assumes that the event does not directly interrupt internal state (i.e., \u2206c(t) = 0).\n\nIntensity \u03bb(z(t)) and probability p(k|z(t)). We use a MLP to compute both the total intensity\n\u03bb(z(t)) and the probability distribution over the event embedding. For events that are discrete (where\nk is a one-hot encoding), the MLP directly outputs the intensity of each event type. For events with\nreal-valued features, the probability density distribution is represented by a mixture of Gaussians,\nand the MLP outputs the weight, mean, and variance of each Gaussian.\n\n4 Experimental Results\n\nNext, we use our model to study a variety of synthetic and real-world time series of events that\noccur at real-valued timestamps. We train all of our models on a workstation with a 8 core i7-\n\n5\n\n\f MLPmulti-layer perceptronnegationmultiplicationsoftplusorthogonalizationvector transfervector copyvector concatenateMLPMLPMLPMLP\fTable 1: The mean absolute percentage error of the predicted conditional intensities. Each column\nrepresents a different type of generating process. Each row represents a prediction model. In all cases,\nour neural JSDE outperforms the RNN baseline by a signi\ufb01cant margin.\n\nPoisson Hawkes (E) Hawkes (PL)\n\nSelf-Correcting\n\nPoisson\nHawkes (E)\nHawkes (PL)\nSelf-Correcting\n\nRNN\nNeural JSDE\n\n0.1\n0.3\n0.1\n98.7\n\n3.2\n1.3\n\n188.2\n\n3.5\n\n128.5\n101.0\n\n22.0\n5.9\n\n95.6\n155.4\n\n9.8\n87.1\n\n20.1\n17.1\n\n29.1\n29.1\n29.1\n1.6\n\n24.3\n9.3\n\nFigure 3: The ground truth and predicted conditional intensity of three event sequences generated by\ndifferent processes (A\u2013C), and an example of the latent state dynamics (D). Each blue dot represents\nan event at the corresponding time. In all cases, our model captures the general trends in the intensity.\n\nmodel(t) is the trained model intensity and \u03bb\u2217\n\nwhere \u03bb\u2217\nGT(t) is the ground truth intensity. The integral\nis approximated by evaluation at 2000 uniformly spaced points (the same points that are used for\ntraining the RNN model). Table 1 reports the errors of the conditional intensity for our model and\nthe baselines. In all cases, our neural JSDE model is a better \ufb01t for the data than the RNN and other\npoint process models (except for the ground truth model, which shows what we can expect to achieve\nwith perfect knowledge of the process). Figure 3 shows how the learned conditional intensity of\nour Neural JSDE model effectively tracks the ground truth intensities. Remarkably, our model is\nable to capture the delaying effect in the power-law kernel (Fig. 3D) through a complex interplay\nbetween the internal state and event memory: although an event immediately introduces a jump to\nthe event memory h(t), the intensity function peaks when the internal state c(t) is the largest, which\nlags behind h(t).\n\n4.2 Discrete Event Type Prediction on social and medical datasets.\n\nTable 2: The classi\ufb01cation error rate of our\nmodel on discrete event type prediction. The\nbaseline error rates are taken form [20].\n\nNext, we evaluate our model on a discrete-type event\nprediction task with two real-world datasets. The\nStack Over\ufb02ow dataset contains the awards history\nof 6633 users in an online question-answering web-\nsite [21]. Each sequence is a collection of badges\na user received over a period of 2 years, and there\nare 22 different badges types in total. The medical\nrecords (MIMIC2) dataset contains the clinical visit\nhistory of 650 de-identi\ufb01ed patients in an Intensive Care Unit [21]. Each sequence consists of visit\nevents of a patient over 7 years, where event type is the reason for the visit (75 reasons in total).\nUsing 5-fold cross validation, we predict the event type of every held-out event (\u03c4j, kj) by choosing\nthe event embedding with the largest probability p(k|z(\u03c4j)) given the past event history H\u03c4j . For the\nStack Over\ufb02ow dataset, we use a 20-dimensional latent state (n1 = 10, n2 = 10) and MLPs with\none hidden layer of 32 units to parameterize the dynamics function f , w, \u03bb, p. For MIMIC2, we use\n\n[20] NJSDE\n\nStack Over\ufb02ow 54.1\n18.8\n\nMIMIC2\n\nError Rate\n\n[21]\n\n53.7\n16.8\n\n52.7\n19.8\n\n7\n\n(A)(C)(B)(D)\fFigure 4: The ground truth and predicted event embedding in one event sequences.\n\nFigure 5: Contour plots of the predicted conditional intensity and the locations of earthquakes (red\ndots) over the years 2007-2018 using earthquake training data from 1970\u20132006.\n\na 64-dimensional latent state (n1 = 32, n2 = 32) and MLPs with one hidden layer of 64 units. The\nlearning rate is set to be 10\u22123 with weighted decay rate 10\u22125.\n\nWe compare the event type classi\ufb01cation accuracy of our model against two other models for learning\nevent sequences that directly simulate the next event based on the history, namely neural point process\nmodels based on an RNN [21] or LSTM [20]. Note that these approaches model event sequences but\nnot trajectories. Our model not only achieve similar performance with these state-of-the-art models\nfor discrete event type prediction (Table 2), but also allows us to model events with real-valued\nfeatures, as we study next.\n\n4.3 Real-Valued Event Feature Prediction \u2014 Synthetic and Earthquake data\n\nNext, we use our model to predict events with real-valued features. To this end, we \ufb01rst test our model\non synthetic event sequences whose event times are generated by a Hawkes process with exponential\nkernel, but the feature of each event records the time interval since the previous event. We train our\nmodel in a similar way as to Section 4.1, using a 10-dimensional latent state (n1 = 5, n2 = 5) and\nMLPs with one hidden layer of 20 units (the learning rate for Adam optimizer is set to be 10\u22124 with\nweighted decay rate 10\u22125). This achieves a mean absolute error of 0.353. In contrast, the baseline\nof simply predicting the mean of the features seen so far has an error of 3.654. Figure 4 shows one\nevent sequence and predicted event features.\n\nFinally, we provide an illustrative example of real-world data with real-valued features. We use our\nmodel to predict the time and locations of earthquakes above level 4.0 in 2007\u20132018 using historical\ndata from 1970\u20132006.2 In this case, an event\u2019s features are the longitude and latitude locations of an\nearthquake. This time, we use a 20-dimensional latent state (n1 = 10, n2 = 10), and parameterize\nthe event feature\u2019s probability density distribution by a mixture of 5 Gaussians. Figure 5 shows the\ncontours of the conditional intensity of the learned Neural JSDE model.\n\n2Data from https://www.kaggle.com/danielpe/earthquakes\n\n8\n\nfeatureHawkes (Exponential)51015eventsprediction mean\f5 Related Work\n\nModeling point processes. Temporal point processes are elegant abstractions for time series analysis.\nThe self-exciting nature of the Hawkes process has made it a key model within machine learning\nand information science [22, 23, 24, 25, 26, 27, 28, 29]. However, classical point process models\n(including the Hawkes process) make strong assumptions about how the event history in\ufb02uences\nfuture dynamics. To get around this, RNNs and LSTMs have been adapted to directly model events as\ntime steps within the model [21, 20]. However, these models do not consider latent space dynamics\nin the absence of events as we have, which may re\ufb02ect time-varying internal evolution that inherently\nexists in the system. Xiao et al. also proposed a combined approach to event history and internal state\nevolution by simultaneously using two RNNs \u2014 one that takes event sequence as input, and one\nthat models evenly spaced time intervals [30]. In contrast, our model provides a uni\ufb01ed approach\nthat addresses both aspects by making a connection to ordinary differential equations and can be\nef\ufb01ciently trained with the adjoint method using only constant memory. Another approach uses GANs\nto circumvent modeling the intensity function [31]; however, this cannot provide insight into the\ndynamics in the system. Most related to our approach is the recently proposed ODE-RNN model [32],\nwhich uses an RNN to make abrupt updates to a hidden state that mostly evolves continuous.\n\nLearning differential equations. More broadly, learning parameters of differential equations from\ndata has been successful for physics-based problems with deterministic dynamics [33, 34, 35, 36, 37].\nIn terms of modeling real-world randomness, Wang et al. introduced a jump-diffusion stochastic\ndifferential equation framework for modeling user activities [38], parameterizing the conditional\nintensity and opinion dynamics with a \ufb01xed functional form. More recently, Ryder et al. proposed\nan RNN-based variational method for learning stochastic dynamics [39], which has later been\ngeneralized for in\ufb01nitesimal step size [40, 41]. These approaches focus on randomness introduced by\nBrownian motion, which cannot model abrupt jumps of latent states. Finally, recent developments of\nrobust software packages [42, 43] and numerical methods [44] have made the process of learning\nmodel parameters easier and more reliable for a host of models.\n\n6 Discussion\n\nWe have developed Neural Jump Stochastic Differential Equations, a general framework for modeling\ntemporal event sequences. Our model learns both the latent continuous dynamics of the system and\nthe abrupt effects of events from data. The model maintains the simplicity and memory ef\ufb01ciency\nof Neural ODEs and uses a similar adjoint method for learning; in our case, we additionally model\njumps in the trajectory with a neural network, and handle the effects of this discontinuity in the\nlearning method. Our approach is quite \ufb02exible, being able to model intensity functions and discrete\nor continuous event types, all while providing interpretable latent space dynamics.\n\nAcknowledgements. This research was supported by NSF Award DMS-1830274, ARO Award\nW911NF19-1-0057, and ARO MURI.\n\nReferences\n\n[1] William Glover and John Lygeros. A stochastic hybrid model for air traf\ufb01c control simulation.\nIn Rajeev Alur and George J. Pappas, editors, Hybrid Systems: Computation and Control, pages\n372\u2013386, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg.\n\n[2] Jo\u02dcao P. Hespanha. Stochastic hybrid systems: Application to communication networks. In\nRajeev Alur and George J. Pappas, editors, Hybrid Systems: Computation and Control, pages\n387\u2013401, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg.\n\n[3] Xiangfang Li, Oluwaseyi Omotere, Lijun Qian, and Edward R. Dougherty. Review of stochastic\nhybrid systems with applications in biological systems modeling and analysis. EURASIP\nJournal on Bioinformatics and Systems Biology, 2017(1):8, Jun 2017.\n\n[4] Ashton Anderson, Daniel Huttenlocher, Jon Kleinberg, and Jure Leskovec. Steering user\nbehavior with badges. In Proceedings of the 22nd international conference on World Wide Web.\nACM Press, 2013.\n\n9\n\n\f[5] Michael S. Branicky. Introduction to Hybrid Systems. Birkh\u00a8auser Boston, Boston, MA, 2005.\n\n[6] Arjan J. Van Der Schaft and Johannes Maria Schumacher. An Introduction to Hybrid Dynamical\n\nSystems, volume 251. Springer London, 2000.\n\n[7] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K. Duvenaud. Neural ordinary\ndifferential equations. In Advances in Neural Information Processing Systems, pages 6571\u20136583,\n2018.\n\n[8] Charles Blundell, Jeff Beck, and Katherine A. Heller. Modelling reciprocating relationships with\nhawkes processes. In Advances in Neural Information Processing Systems, pages 2600\u20132608,\n2012.\n\n[9] Liangda Li and Hongyuan Zha. Dyadic event attribution in social networks with mixtures of\nhawkes processes. In Proceedings of the 22nd ACM international conference on Information &\nKnowledge Management, pages 1667\u20131672. ACM, 2013.\n\n[10] Alexey Stomakhin, Martin B. Short, and Andrea L. Bertozzi. Reconstruction of missing data in\nsocial networks based on temporal patterns of interactions. Inverse Problems, 27(11):115013,\n2011.\n\n[11] Daryl J. Daley and David Vere-Jones. An Introduction to the Theory of Point Processes.\nProbability and its Applications (New York). Springer-Verlag, New York, second edition, 2003.\nElementary theory and methods.\n\n[12] Patrick J. Laub, Thomas Taimre, and Philip K. Pollett. Hawkes Processes. arXiv:1507.02822,\n\n2015.\n\n[13] Yosihiko Ogata. Seismicity analysis through point-process modeling: A review. Pure and\n\nApplied Geophysics, 155(2):471\u2013507, Aug 1999.\n\n[14] Marian-Andrei Rizoiu, Lexing Xie, Scott Sanner, Manuel Cebrian, Honglin Yu, and Pascal Van\nHentenryck. Expecting to be HIP: Hawkes intensity processes for social media popularity. In\nProceedings of the 26th International Conference on World Wide Web. ACM Press, 2017.\n\n[15] Yosihiko Ogata and David Vere-Jones. Inference for earthquake models: A self-correcting\n\nmodel. Stochastic Processes and their Applications, 17(2):337 \u2013 347, 1984.\n\n[16] Stephen J. Hardiman, Nicolas Bercot, and Jean-Philippe Bouchaud. Critical re\ufb02exivity in\n\ufb01nancial markets: a hawkes process analysis. The European Physical Journal B, 86(10):442,\n2013.\n\n[17] Ryota Kobayashi and Renaud Lambiotte. Tideh: Time-dependent hawkes process for predicting\n\nretweet dynamics. In Tenth International AAAI Conference on Web and Social Media, 2016.\n\n[18] John C. Cox and Stephen A. Ross. The valuation of options for alternative stochastic processes.\n\nJournal of Financial Economics, 3(1-2):145\u2013166, 1976.\n\n[19] Robert C. Merton. Option pricing when underlying stock returns are discontinuous. Journal of\n\nFinancial Economics, 3(1-2):125\u2013144, 1976.\n\n[20] Hongyuan Mei and Jason Eisner. The neural Hawkes process: A neurally self-modulating\n\nmultivariate point process. In Advances in Neural Information Processing Systems, 2017.\n\n[21] Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and\nLe Song. Recurrent marked temporal point processes: Embedding event history to vector. In\nProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and\nData Mining, pages 1555\u20131564. ACM, 2016.\n\n[22] Ke Zhou, Hongyuan Zha, and Le Song. Learning triggering kernels for multi-dimensional\nhawkes processes. In International Conference on Machine Learning, pages 1301\u20131309, 2013.\n\n[23] Mehrdad Farajtabar, Yichen Wang, Manuel Gomez Rodriguez, Shuang Li, Hongyuan Zha,\nand Le Song. Coevolve: A joint point process model for information diffusion and network\nco-evolution. In Advances in Neural Information Processing Systems, pages 1954\u20131962, 2015.\n\n10\n\n\f[24] Isabel Valera and Manuel Gomez-Rodriguez. Modeling adoption and usage of competing\nproducts. In 2015 IEEE International Conference on Data Mining, pages 409\u2013418. IEEE, 2015.\n\n[25] Liangda Li and Hongyuan Zha. Learning parametric models for social infectivity in multi-\ndimensional hawkes processes. In Twenty-Eighth AAAI Conference on Arti\ufb01cial Intelligence,\n2014.\n\n[26] Hongteng Xu, Dixin Luo, and Hongyuan Zha. Learning hawkes processes from short doubly-\ncensored event sequences. In Proceedings of the 34th International Conference on Machine\nLearning, pages 3831\u20133840, 2017.\n\n[27] Hongteng Xu and Hongyuan Zha. A dirichlet mixture model of hawkes processes for event\nsequence clustering. In Advances in Neural Information Processing Systems, pages 1354\u20131363,\n2017.\n\n[28] Mehrdad Farajtabar, Nan Du, Manuel Gomez Rodriguez, Isabel Valera, Hongyuan Zha, and\nLe Song. Shaping social activity by incentivizing users. In Advances in Neural Information\nProcessing Systems, pages 2474\u20132482, 2014.\n\n[29] Ali Zarezade, Utkarsh Upadhyay, Hamid R. Rabiee, and Manuel Gomez-Rodriguez. Redqueen:\nAn online algorithm for smart broadcasting in social networks. In Proceedings of the Tenth\nACM International Conference on Web Search and Data Mining, pages 51\u201360. ACM, 2017.\n\n[30] Shuai Xiao, Junchi Yan, Stephen M. Chu, Xiaokang Yang, and Hongyuan Zha. Modeling the\n\nintensity function of point process via recurrent neural networks. In AAAI, 2017.\n\n[31] Shuai Xiao, Mehrdad Farajtabar, Xiaojing Ye, Junchi Yan, Le Song, and Hongyuan Zha. Wasser-\nstein learning of deep generative point process models. In Advances in Neural Information\nProcessing Systems, pages 3247\u20133257, 2017.\n\n[32] Yulia Rubanova, Ricky TQ Chen, and David Duvenaud. Latent ODEs for irregularly-sampled\n\ntime series. arXiv:1907.03907, 2019.\n\n[33] Maziar Raissi and George Em Karniadakis. Hidden physics models: Machine learning of\nnonlinear partial differential equations. Journal of Computational Physics, 357:125\u2013141, 2018.\n\n[34] Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Multistep neural networks for\n\ndata-driven discovery of nonlinear dynamical systems. arXiv:1801.01236, 2018.\n\n[35] Yuehaw Khoo, Jianfeng Lu, and Lexing Ying. Solving for high-dimensional committor functions\n\nusing arti\ufb01cial neural networks. Research in the Mathematical Sciences, 6(1), 2018.\n\n[36] Zichao Long, Yiping Lu, Xianzhong Ma, and Bin Dong. PDE-net: Learning PDEs from data.\nIn Proceedings of the 35th International Conference on Machine Learning, pages 3208\u20133216,\n2018.\n\n[37] Yuwei Fan, Lin Lin, Lexing Ying, and Leonardo Zepeda-N\u00b4unez. A multiscale neural network\n\nbased on hierarchical matrices. arXiv:1807.01883, 2018.\n\n[38] Yichen Wang et al. A stochastic differential equation framework for guiding online user\nIn Proceedings of the Twenty-First International Conference on\n\nactivities in closed loop.\nArti\ufb01cial Intelligence and Statistics, 2018.\n\n[39] Thomas Ryder, Andrew Golightly, Stephen McGough, and Dennis Prangle. Black-box varia-\n\ntional inference for stochastic differential equations. arXiv:1802.03335, 2018.\n\n[40] Belinda Tzen and Maxim Raginsky. Neural Stochastic Differential Equations: Deep Latent\n\nGaussian Models in the Diffusion Limit. arXiv:1905.09883, 2019.\n\n[41] Stefano Peluchetti and Stefano Favaro. In\ufb01nitely deep neural networks as diffusion processes.\n\narXiv:1905.11065, 2019.\n\n[42] Chris Rackauckas, Mike Innes, Yingbo Ma, Jesse Bettencourt, Lyndon White, and Vaibhav\nDixit. DiffEqFlux.jl \u2014 a julia library for neural differential equations. arXiv:1902.02376, 2019.\n\n11\n\n\f[43] Mike Innes, Alan Edelman, Keno Fischer, Chris Rackauckus, Elliot Saba, Viral B Shah, and\nWill Tebbutt. Zygote: A differentiable programming system to bridge machine learning and\nscienti\ufb01c computing. arXiv:1907.07587, 2019.\n\n[44] Christopher Rackauckas, Yingbo Ma, Vaibhav Dixit, Xingjian Guo, Mike Innes, Jarrett Revels,\nJoakim Nyberg, and Vijay Ivaturi. A comparison of automatic differentiation and continuous\nsensitivity analysis for derivatives of differential equation solutions. arXiv:1812.01892, 2018.\n\n[45] Sebastien Corner, Corina Sandu, and Adrian Sandu. Adjoint Sensitivity Analysis of Hybrid\n\nMultibody Dynamical Systems. arXiv:1802.07188, 2018.\n\n12\n\n\f", "award": [], "sourceid": 5208, "authors": [{"given_name": "Junteng", "family_name": "Jia", "institution": "Cornell"}, {"given_name": "Austin", "family_name": "Benson", "institution": "Cornell University"}]}