{"title": "Wasserstein Learning of Deep Generative Point Process Models", "book": "Advances in Neural Information Processing Systems", "page_first": 3247, "page_last": 3257, "abstract": "Point processes are becoming very popular in modeling asynchronous sequential data due to their sound mathematical foundation and strength in modeling a variety of real-world phenomena. Currently, they are often characterized via intensity function which limits model's expressiveness due to unrealistic assumptions on its parametric form used in practice. Furthermore, they are learned via maximum likelihood approach which is prone to failure in multi-modal distributions of sequences. In this paper, we propose an intensity-free approach for point processes modeling that transforms nuisance processes to a target one. Furthermore, we train the model using a likelihood-free leveraging Wasserstein distance between point processes. Experiments on various synthetic and real-world data substantiate the superiority of the proposed point process model over conventional ones.", "full_text": "Wasserstein Learning of Deep Generative Point\n\nProcess Models\n\nShuai Xiao\u2217\u2020 , Mehrdad Farajtabar\u2217(cid:5) Xiaojing Ye\u2021 , Junchi Yan\u00a7 , Le Song(cid:5)\u00b6, Hongyuan Zha(cid:5)\n\n\u2020 Shanghai Jiao Tong University\n\n(cid:5)College of Computing, Georgia Institute of Technology\n\n\u2021 School of Mathematics, Georgia State University\n\n\u00a7 IBM Research \u2013 China\n\n\u00b6Ant Financial\n\nbenjaminforever@sjtu.edu.cn, mehrdad@gatech.edu\n\nxye@gsu.edu, yanjc@cn.ibm.com\n\n{lsong,zha}@cc.gatech.edu\n\nAbstract\n\nPoint processes are becoming very popular in modeling asynchronous sequential\ndata due to their sound mathematical foundation and strength in modeling a variety\nof real-world phenomena. Currently, they are often characterized via intensity\nfunction which limits model\u2019s expressiveness due to unrealistic assumptions on\nits parametric form used in practice. Furthermore, they are learned via maximum\nlikelihood approach which is prone to failure in multi-modal distributions of\nsequences. In this paper, we propose an intensity-free approach for point processes\nmodeling that transforms nuisance processes to a target one. Furthermore, we train\nthe model using a likelihood-free leveraging Wasserstein distance between point\nprocesses. Experiments on various synthetic and real-world data substantiate the\nsuperiority of the proposed point process model over conventional ones.\n\n1\n\nIntroduction\n\nEvent sequences are ubiquitous in areas such as e-commerce, social networks, and health informatics.\nFor example, events in e-commerce are the times a customer purchases a product from an online\nvendor such as Amazon. In social networks, event sequences are the times a user signs on or\ngenerates posts, clicks, and likes. In health informatics, events can be the times when a patient\nexhibits symptoms or receives treatments. Bidding and asking orders also comprise events in the\nstock market. In all of these applications, understanding and predicting user behaviors exhibited by\nthe event dynamics are of great practical, economic, and societal interest.\nTemporal point processes [1] is an effective mathematical tool for modeling events data. It has\nbeen applied to sequences arising from social networks [2, 3, 4], electronic health records [5], e-\ncommerce [6], and \ufb01nance [7]. A temporal point process is a random process whose realization\nconsists of a list of discrete events localized in (continuous) time. The point process representation of\nsequence data is fundamentally different from the discrete time representation typically used in time\nseries analysis. It directly models the time period between events as random variables, and allows\ntemporal events to be modeled accurately, without requiring the choice of a time window to aggregate\nevents, which may cause discretization errors. Moreover, it has a remarkably extensive theoretical\nfoundation [8].\n\n\u2217Authors contributed equally. Work completed at Georgia Tech.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fHowever, conventional point process models often make strong unrealistic assumptions about the\ngenerative processes of the event sequences. In fact, a point process is characterized by its conditional\nintensity function \u2013 a stochastic model for the time of the next event given all the times of previous\nevents. The functional form of the intensity is often designed to capture the phenomena of interests [9].\nSome examples are homogeneous and non-homogeneous Poisson processes [10], self-exciting point\nprocesses [11], self-correcting point process models [12], and survival processes [8]. Unfortunately,\nthey make various parametric assumptions about the latent dynamics governing the generation of the\nobserved point patterns. As a consequence, model misspeci\ufb01cation can cause signi\ufb01cantly degraded\nperformance using point process models, which is also shown by our experimental results later.\nTo address the aforementioned problem, the authors in [13, 14, 15] propose to learn a general\nrepresentation of the underlying dynamics from the event history without assuming a \ufb01xed parametric\nform in advance. The intensity function of the temporal point process is viewed as a nonlinear function\nof the history of the process and is parameterized using a recurrent neural network. Attenional\nmechanism is explored to discover the underlying structure [16]. Apparently this line of work still\nrelies on explicit modeling of the intensity function. However, in many tasks such as data generation\nor event prediction, knowledge of the whole intensity function is unnecessary. On the other hand,\nsampling sequences from intensity-based models is usually performed via a thinning algorithm [17],\nwhich is computationally expensive; many sample events might be rejected because of the rejection\nstep, especially when the intensity exhibits high variation. More importantly, most of the methods\nbased on intensity function are trained by maximizing log likelihood or a lower bound on it. They\nare asymptotically equivalent to minimizing the Kullback-Leibler (KL) divergence between the data\nand model distributions, which suffers serious issues such as mode dropping [18, 19]. Alternatively,\nGenerative Adversarial Networks (GAN) [20] have proven to be a promising alternative to traditional\nmaximum likelihood approaches [21, 22].\nIn this paper, for the \ufb01rst time, we bypass the intensity-based modeling and likelihood-based esti-\nmation of temporal point processes and propose a neural network-based model with a generative\nadversarial learning scheme for point processes. In GANs, two models are used to solve a minimax\ngame: a generator which samples synthetic data from the model, and a discriminator which clas-\nsi\ufb01es the data as real or synthetic. Theoretically speaking, these models are capable of modeling\nan arbitrarily complex probability distribution, including distributions over discrete events. They\nachieve state-of-the-art results on a variety of generative tasks such as image generation, image\nsuper-resolution, 3D object generation, and video prediction [23, 24].\nThe original GAN in [20] minimizes the Jensen-Shannon (JS) and is regarded as highly unstable and\nprone to miss modes. Recently, Wasserstein GAN (WGAN) [25] is proposed to use the Earth Moving\ndistance (EM) as an objective for training GANs. Furthermore it is shown that the EM objective, as\na metric between probability distributions [26] has many advantages as the loss function correlates\nwith the quality of the generated samples and reduces mode dropping [27]. Moreover, it leverages\nthe geometry of the space of event sequences in terms of their distance, which is not the case for an\nMLE-based approach. In this paper we extend the notion of WGAN for temporal point processes\nand adopt a Recurrent Neural Network (RNN) for training. Importantly, we are able to demonstrate\nthat Wasserstein distance training of RNN point process models outperforms the same architecture\ntrained using MLE.\nIn a nutshell, the contributions of the paper are: i) We propose the \ufb01rst intensity-free generative model\nfor point processes and introduce the \ufb01rst (to our best knowledge) likelihood-free corresponding\nlearning methods; ii) We extend WGAN for point processes with Recurrent Neural Network architec-\nture for sequence generation learning; iii) In contrast to the usual subjective measures of evaluating\nGANs we use a statistical and a quantitative measure to compare the performance of the model to\nthe conventional ones. iv) Extensive experiments involving various types of point processes on both\nsynthetic and real datasets show the promising performance of our approach.\n2 Proposed Framework\nIn this section, we de\ufb01ne Point Processes in a way that is suitable to be combined with the WGANs.\n2.1 Point Processes\nLet S be a compact space equipped with a Borel \u03c3-algebra B. Take \u039e as the set of counting measures\non S with C as the smallest \u03c3-algebra on it. Let (\u2126,F, P) be a probability space. A point process\non S is a measurable map \u03be : \u2126 \u2192 \u039e from the probability space (\u2126,F, P) to the measurable space\n(\u039e,C). Figure 1-a illustrates this mapping.\n\n2\n\n\fi=1 \u03b4Xi where \u03b4 is the Dirac measure,\nn is an integer-valued random variable and Xi\u2019s are random elements of S or events. A point process\nB \u03be(x)dx, which basically is\nthe number of events in each Borel subset B \u2208 B of S. The mean measure M of a point process\n\u03be is a measure on S that assigns to every B \u2208 B the expected number of events of \u03be in B, i.e.,\nM (B) := E[N (B)] for all B \u2208 B.\n\nEvery realization of a point process \u03be can be written as \u03be =(cid:80)n\ncan be equivalently represented by a counting process: N (B) := (cid:82)\nFor inhomogeneous Poisson process, M (B) =(cid:82)\nany point process, given \u03bb(\u00b7), N (B) \u223c Poisson((cid:82)\n\nB \u03bb(x)dx, where the intensity function \u03bb(x) yields\na positive measurable function on S. Intuitively speaking, \u03bb(x)dx is the expected number of events\nin the in\ufb01nitesimal dx. For the most common type of point process, a Homogeneous Poisson process,\n\u03bb(x) = \u03bb and M (B) = \u03bb|B|, where | \u00b7 | is the Lebesgue measure on (S,B). More generally, in Cox\npoint processes, \u03bb(x) can be a random density possibly depending on the history of the process. For\nB \u03bb(x)dx). In addition, if B1, . . . , Bk \u2208 B are\ndisjoint, then N (B1), . . . , N (Bk) are independent conditioning on \u03bb(\u00b7).\nFor the ease of exposition, we will present the framework for the case where the events are happening\nin the real half-line of time. But the framework is easily extensible to the general space.\n2.2 Temporal Point Processes\nA particularly interesting case of point processes is given when S is the time interval [0, T ), which\ni=1 \u03b4ti.\nWith a slight notation abuse we will write \u03be = {t1, . . . , tn} where each ti is a random time before T .\nUsing a conditional intensity (rate) function is the usual way to characterize point processes.\nFor Inhomogeneous Poisson process (IP), the intensity \u03bb(t) is a \ufb01xed non-negative function supported\nin [0, T ). For example, it can be a multi-modal function comprised of k Gaussian kernels: \u03bb(t) =\n\nwe will call a temporal point process. Here, a realization is simply a set of time points: \u03be =(cid:80)n\n\n(cid:1) , for t \u2208 [0, T ), where ci and \u03c3i are \ufb01xed center and\n\ni )\u22121/2 exp(cid:0)\u2212(t \u2212 ci)2/\u03c32\n\n(cid:80)k\n(random) events in a special parametric form: \u03bb(t) = \u00b5 + \u03b2(cid:80)\n\nstandard deviation, respectively, and \u03b1i is the weight (or importance) for kernel i.\nA self-exciting (Hawkes) process (SE) is a cox process where the intensity is determined by previous\nti 0. This process has an implication that the\noccurrence of an event will increase the probability of near future events and its in\ufb02uence will\n(usually) decrease over time, as captured by (the usually) decaying \ufb01xed kernel g. \u00b5 is the exogenous\nrate of \ufb01ring events and \u03b1 is the coef\ufb01cient for the endogenous rate.\nIn contrast, in self-correcting processes (SC), an event will decrease the probability of an event:\nti 0. Therefore,\ng\u03b8 : \u039e \u2192 \u039e is a transformation in the space of counting measures. Note that \u03bbz is part of the prior\nknowledge and belief about the problem domain. Therefore, the objective of learning the generative\nmodel can be written as min W (Pr, Pg) or equivalently:\n\nmin\n\n\u03b8\n\nmax\n\nw\u2208W,(cid:107)fw(cid:107)L\u22641\n\nE\u03be\u223cPr [fw(\u03be)] \u2212 E\u03b6\u223cPz [fw(g\u03b8(\u03b6))]\n\n(6)\n\ng\n\n,(cid:0)bh\nk\u00d7k ,(cid:0)bh\n\nd(Ba\n\nd\n\n(cid:1)\n\n(cid:1)\n\nL\n\n(cid:80)L\n\nl=1 fw(g\u03b8(\u03b6l)).\n\nIn GAN terminology fw is called the discriminator and g\u03b8 is known as the generator model. We\nestimate the generative model by enforcing that the sample sequences from the model have the same\ndistribution as training sequences. Given L samples sequences from real data Dr = {\u03be1, . . . , \u03beL} and\n(cid:80)L\nfrom the noise Dz = {\u03b61, . . . , \u03b6L} the two expectations are estimated empirically: E\u03be\u223cPr [fw(\u03be)] =\n\nl=1 fw(\u03bel) and E\u03b6\u223cPz [fw(g\u03b8(\u03b6))] = 1\nIngredients of WGANTPP\n\n1\nL\n2.5\nTo proceed with our point process based WGAN, we need the generator function g\u03b8 : \u039e \u2192 \u039e, the\ndiscriminator function fw : \u039e \u2192 R, and enforce Lipschitz constraint on fw. Figure 4 in Appendix A\nillustrates the data \ufb02ow for WGANTPP.\nThe generator transforms a given sequence to another sequence. Similar to [32, 33] we use Recurrent\nNeural Networks (RNN) to model the generator. For clarity, we use the vanilla RNN to illustrate the\ncomputational process as below. The LSTM is used in our experiments for its capacity to capture\nlong-range dependency. If the input and output sequences are \u03b6 = {z1, . . . , zn} and \u03c1 = {t1, . . . , tn}\nthen the generator g\u03b8(\u03b6) = \u03c1 works according to\ng hi\u22121 + bh\n\n,(cid:0)bx\nSimilarly, we de\ufb01ne the discriminator function who assigns a scalar value fw(\u03c1) =(cid:80)n\n\nHere hi is the k-dimensional history embedding vector and \u03c6h\nThe parameter set of the generator is \u03b8 =\nsequence \u03c1 = {t1, . . . , tn} according to\nd ti + Bh\n\n(7)\ng hi + bx\ng )\ng are the activation functions.\n.\ni=1 ai to the\n\ng (Bx\ng and \u03c6x\nk\u00d7k\n\n,(cid:0)Bx\n\ng hi\u22121 + bh\ng )\n\nd hi + ba\nd)\n\ng zi + Bh\n\nhi = \u03c6h\n\nhi = \u03c6h\n\nai = \u03c6a\n\nti = \u03c6x\n\ng (Ah\n\nd (Ah\n\n(cid:111)\n\nk\u00d71\n\nk\u00d71\n\n1\u00d7k\n\n1\u00d71\n\n(cid:1)\n\n(cid:1)\n\n(cid:1)\n\n(cid:1)\n\ng ),\n\n(8)\n\ng\n\ng\n\ng\n\n(cid:110)(cid:0)Ah\n(cid:1)\n(cid:110)(cid:0)Ah\n\n(cid:1)\n\ng\n\n,(cid:0)Bh\nk\u00d71 ,(cid:0)Bh\n\nd\n\nd\n\nd)1\u00d71\n\nk\u00d71 , (Ba\n\nd )1\u00d7k , (ba\n\nwhere the parameter set is comprised of w =\n.\nNote that both generator and discriminator RNNs are causal networks. Each event is only in\ufb02uenced\nby the previous events. To enforce the Lipschitz constraints the original WGAN paper [18] adopts\nweight clipping. However, our initial experiments shows an inferior performance by using weight\nclipping. This is also reported by the same authors in their follow-up paper [27] to the original work.\nThe poor performance of weight clipping for enforcing 1-Lipschitz can be seen theoretically as well:\njust consider a simple neural network with one input, one neuron, and one output: f (x) = \u03c3(wx + b)\nand the weight clipping w < c. Then,\n\n|f(cid:48)(x)| \u2264 1 \u21d0\u21d2 |w\u03c3(cid:48)(wx + b)| \u2264 1 \u21d0\u21d2 |w| \u2264 1/|\u03c3(cid:48)(wx + b)|\n\n(9)\nIt is clear that when 1/|\u03c3(cid:48)(wx + b)| < c, which is quite likely to happen, the Lipschitz constraint is\nnot necessarily satis\ufb01ed. In our work, we use a novel approach for enforcing the Lipschitz constraints,\navoiding the computation of the gradient which can be costly and dif\ufb01cult for point processes. We\nadd the Lipschitz constraint as a regularization term to the empirical loss of RNN.\n\nL(cid:88)\nWe can take each of the(cid:0)2L\n\nw\u2208W,(cid:107)fw(cid:107)L\u22641\n\nmax\n\nmin\n\n1\nL\n\nl=1\n\nfw(\u03bel) \u2212 L(cid:88)\n(cid:1) pairs of real and generator sequences, and regularize based on them;\n\n||fw(\u03bel) \u2212 fw(g\u03b8(\u03b6m))|\n\nfw(g\u03b8(\u03b6l)) \u2212 \u03bd\n\n|\u03bel \u2212 g\u03b8(\u03b6m)|(cid:63)\n\nL(cid:88)\n\n\u2212 1|\n\n(10)\n\nl,m=1\n\nl=1\n\n\u03b8\n\nhowever, we have seen that only a small portion of pairs (O(L)), randomly selected, is suf\ufb01cient. The\nprocedure of WGANTPP learning is given in Algorithm 1\nRemark The signi\ufb01cance of Lipschitz constraint and regularization (or more generally any capacity\ncontrol) is more apparent when we consider the connection of W-distance and optimal transport\nproblem [28]. Basically, minimizing the W-distance between the empirical distribution and the model\ndistribution is equivalent to a semidiscrete optimal transport [28]. Without capacity control for the\ngenerator and discriminator, the optimal solution simply maps a partition of the sample space to the\nset of data points, in effect, memorizing the data points.\n\n2\n\n(cid:111)\n\n5\n\n\fAlgorithm 1 WGANTPP for Temporal Point Process. The default values \u03b1 = 1e \u2212 4, \u03b21 = 0.5,\n\u03b22 = 0.9, m = 256, ncritic = 5.\nRequire: : the regularization coef\ufb01cient \u03bd for direct Lipschitz constraint. the batch size, m. the\nnumber of iterations of the critic per generator iteration, ncritic. Adam hyper-parameters \u03b1, \u03b21, \u03b22.\n\nSample point process realizations {\u03be(i)}m\nSample {\u03b6 (i)}m\n\nRequire: : w0, initial critic parameters. \u03b80, initial generator\u2019s parameters.\n1: set prior \u03bbz to the expectation of event rate for real data.\n2: while \u03b8 has not converged do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: end while\n\n(cid:80)m\ni=1 \u223c Pz from a Poisson process with rate \u03bbz.\ni=1 fw(g\u03b8(\u03b6 (i))) \u2212 1\nw \u2190 Adam(\u2207wL(cid:48), w, \u03b1, \u03b21, \u03b22)\n(cid:80)m\n\ni=1 \u223c Pr from real data.\n\ni=1 fw(\u03be(i))(cid:3) + \u03bd(cid:80)m\n(cid:80)m\n\ni=1 \u223c Pz from a Poisson process with rate \u03bbz.\n\nend for\nSample {\u03b6 (i)}m\n\u03b8 \u2190 Adam(\u2212\u2207\u03b8\n\nL(cid:48) \u2190(cid:2) 1\n\nm\n\n1\nm\n\ni=1 fw(g\u03b8(\u03b6 (i))), \u03b8, \u03b1, \u03b21, \u03b22)\n\nfor t = 0, ..., ncritic do\n\nm\n\ni,j=1 ||fw(\u03bei)\u2212fw(g\u03b8(\u03b6j ))|\n\n|\u03bei\u2212g\u03b8(\u03b6j )|(cid:63)\n\n\u2212 1|\n\nFigure 2: Performance of different methods on various synthetic data. Top row: QQ plot slope\ndeviation; middle row: intensity deviation in basic conventional models; bottom row: intensity\ndeviation in mixture of conventional processes.\n\n3 Experiments\nThe current work aims at exploring the feasibility of modeling point process without prior knowledge\nof its underlying generating mechanism. To this end, most widely-used parametrized point processes,\ne.g., self-exciting and self-correcting, and inhomogeneous Poisson processes and one \ufb02exible neural\nnetwork model, neural point process are compared. In this work we use the most general forms for\nsimpler and clear exposition to the reader and propose the very \ufb01rst model in adversarial training of\npoint processes in contrast to likelihood based models.\n\n3.1 Datasets and Protocol\nSynthetic datasets. We simulate 20,000 sequences over time [0, T ) where T = 15, for inhomoge-\nneous process (IP), self-exciting (SE), and self-correcting process (SC), recurrent neural point process\n(NN). We also create another 4 (= C 3\n4 ) datasets from the above 4 synthetic data by a uniform mixture\n\n6\n\n0246Theoretical Quantiles0246Sample QuantilesData Generated by IPRealIPSESCNNWGAN0246Theoretical Quantiles0246Sample QuantilesData Generated by SERealIPSESCNNWGAN0246Theoretical Quantiles0246Sample QuantilesData Generated by SCRealIPSESCNNWGAN0246Theoretical Quantiles0246Sample QuantilesData Generated by NNRealIPSESCNNWGAN0510time0.000.250.500.751.00intensityData Generated by IPRealIPSESCNNWGAN0510time02468intensityData Generated by SERealIPSESCNNWGAN0510time02468intensityData Generated by SCRealIPSESCNNWGAN0510time024intensityData Generated by NNRealIPSESCNNWGAN0510time0246intensityData Generated by IP+SE+SCrealIPSESCNNWGAN0510time01234intensityData Generated by IP+SC+NNrealIPSESCNNWGAN0510time01234intensityData Generated by IP+SE+NNrealIPSESCNNWGAN0510time0246intensityData Generated by SE+SC+NNrealIPSESCNNWGAN\fTable 1: Deviation of QQ plot slope and empirical intensity for ground-truth and learned model\n\nData\n\nMLE-IP\n\n.\n\n. IP\nv\ne\nSE\nD\nSC\nP\nQ\nNN\nQ\n. IP\nv\nSE\ne\nD\nSC\nNN\n\n0.035 (8.0e-4)\n0.055 (6.5e-5)\n3.510 (4.9e-5)\n0.182 (1.6e-5)\n0.110 (1.9e-4)\n1.950 (4.8e-4)\n2.208 (7.0e-5)\n1.044 (2.4e-4)\n1.505 (3.3e-4)\n. IP+SE+SC\nv\n1.178 (7.0e-5)\nIP+SC+NN\ne\nD\nIP+SE+NN\n1.052 (2.4e-4)\nSE+SC+NN 1.825 (2.8e-4)\n\n.\nt\nn\nI\n\n.\nt\nn\nI\n\nMLE-SE\n\n0.284 (7.0e-5)\n0.001 (1.3e-6)\n2.778 (7.4e-5)\n0.687 (5.0e-6)\n0.241 (1.0e-4)\n0.019 (1.84e-5)\n0.653 (1.2e-4)\n0.889 (1.2e-5)\n0.410 (1.8e-5)\n0.588 (1.3e-4)\n0.453 (1.2e-4)\n0.324 (1.1e-4)\n\nEstimator\nMLE-SC\n\n0.159 (3.8e-5)\n0.086 (1.1e-6)\n0.002 (8.8e-6)\n1.004 (2.5e-6)\n0.289 (2.8e-5)\n1.112 (3.1e-6)\n0.006 (9.9e-5)\n1.101 (1.3e-4)\n0.823 (3.1e-6)\n0.795 (9.9e-5)\n0.583 (1.0e-4)\n1.269 (1.1e-4)\n\nMLE-NN\n\n0.216 (3.3e-2)\n0.104 (6.7e-3)\n4.523 (2.6e-3)\n0.065 (1.2e-2)\n0.511 (1.8e-1)\n0.414 (1.6e-1)\n1.384 (1.7e-1)\n0.341 (3.4e-1)\n0.929 (1.6e-1)\n0.713 (1.7e-1)\n0.678 (3.4e-1)\n0.286 (3.6e-1)\n\nWGAN\n\n0.033 (3.3e-3)\n0.051 (1.8e-3)\n0.070 (6.4e-3)\n0.012 (4.7e-3)\n0.136 (8.7e-3)\n0.860 (6.2e-2)\n0.302 (2.2e-3)\n0.144 (4.28e-2)\n0.305 (6.1e-2)\n0.525 (2.2e-3)\n0.419 (4.2e-2)\n0.200 (3.8e-2)\n\nfrom the triplets. The new datasets IP+SE+SC, IP+SE+NN, IP+SC+NN, SE+SC+NN are created to\ntestify the mode dropping problem of learning a generative model. The parameter setting follows:\ni) Inhomogeneous process. The intensity function is independent from history and given in Sec. 2.2,\nwhere k = 3, \u03b1 = [3, 7, 11], c = [1, 1, 1], \u03c3 = [2, 3, 2].\nii) Self-exciting process. The past events increase the rate of future events. The conditional intensity\nfunction is given in Sec. 2.2 where \u00b5 = 1.0, \u03b2 = 0.8 and the decaying kernel g(t \u2212 ti) = e\u2212(t\u2212ti).\niii) Self-correcting process. The conditional intensity function is de\ufb01ned in Sec. 2.2. It increases\nwith time and decreases by events occurrence. We set \u03b7 = 1.0, \u03b3 = 0.2.\niv) Recurrent Neural Network process. The conditional intensity is given in Sec. 2.2, where the\nneural network\u2019s parameters are set randomly and \ufb01xed. We \ufb01rst feed random variable from [0,1]\nuniform distribution, and then iteratively sample events from the intensity and feed the output into\nthe RNN to get the new intensity for the next step.\nReal datasets. We collect sequences separately from four public available datasets, namely, health-\ncare MIMIC-III, public media MemeTracker, NYSE stock exchanges, and publications citations. The\ntime scale for all real data are scaled to [0,15], and the details are as follows:\ni) MIMIC. MIMIC-III (Medical Information Mart for Intensive Care III) is a large, publicly available\ndataset, which contains de-identi\ufb01ed health-related data during 2001 to 2012 for more than 40,000\npatients. We worked with patients who appear at least 3 times, which renders 2246 patients. Their\nvisiting timestamps are collected as the sequences.\nii) Meme. MemeTracker tracks the meme diffusion over public media, which contains more than\n172 million news articles or blog posts. The memes are sentences, such as ideas, proverbs, and the\ntime is recorded when it spreads to certain websites. We randomly sample 22,000 cascades.\niii) MAS. Microsoft Academic Search provides access to its data, including publication venues, time,\ncitations, etc. We collect citations records for 50,000 papers.\niv) NYSE. We use 0.7 million high-frequency transaction records from NYSE for a stock in one day.\nThe transactions are evenly divided into 3,200 sequences with equal durations.\n3.2 Experimental Setup\nDetails. We can feed the temporal sequences to generator and discriminator directly. In practice,\nall temporal sequences are transformed into time duration between two consecutive events, i.e.,\ntransforming the sequence \u03be = {t1, . . . , tn} into {\u03c41, . . . , \u03c4n\u22121}, where \u03c4i = ti+1\u2212ti. This approach\nmakes the model train easily and perform robustly. The transformed sequences are statistically\nidentical to the original sequences, which can be used as the inputs of our neural network. To make\nsure we that the times are increasing we use elu + 1 activation function to produce positive inter arrival\ntimes for the generator and accumulate the intervals to get the sequence. The Adam optimization\nmethod with learning rate 1e-4, \u03b21 = 0.5, \u03b22 = 0.9, is applied. The code is available 2.\nBaselines. We compare the proposed method of learning point processes (i.e., minimizing sample\ndistance) with maximum likelihood based methods for point process. To use MLE inference for point\nprocess, we have to specify its parametric model. The used parametric model are inhomogeneous\nPoisson process (mixture of Gaussian), self-exciting process, self-correcting process, and RNN. For\n\n2https://github.com/xiaoshuai09/Wasserstein-Learning-For-Point-Process\n\n7\n\n\fti\n\nif the sequences come from the point process \u03bb(t) , then the integral \u039b = (cid:82) tt+1\n\neach data, we use all the above solvers to learn the model and generate new sequences, and then we\ncompare the generated sequences with real ones.\nEvaluation metrics. Although our model is an intensity-free approach we will evaluate the per-\nformance by metrics that are computed via intensity. For all models, we work with the empirical\nintensity instead. Note that our objective measures are in sharp contrast with the best practices in\nGANs in which the performance is usually evaluated subjectively, e.g., by visual quality assess-\nment. We evaluate the performance of different methods to learn the underlying processes via two\nmeasures: 1) The \ufb01rst one is the well-known QQ plot of sequences generated from learned model.\nThe quantile-quantile (q-q) plot is the graphical representation of the quantiles of the \ufb01rst data set\nagainst the quantiles of the second data set. From the time change property [10] of point processes,\n\u03bb(s)ds between\nconsecutive events should be exponential distribution with parameter 1. Therefore, the QQ plot of \u039b\nagainst exponential distribution with rate 1 should fall approximately along a 45-degree reference line.\nThe evaluation procedure is as follows: i) The ground-truth data is generated from a model, say IP; ii)\nAll 5 methods are used to learn the unknown process using the ground-truth data; iii) The learned\nmodel is used to generate a sequence; iv) The sequence is used against the theoretical quantiles from\nthe model to see if the sequence is really coming from the ground-truth generator or not; v) The\ndeviation from slope 1 is visualized or reported as a performance measure. 2) The second metric\nis the deviation between empirical intensity from the learned model and the ground truth intensity.\nWe can estimate empirical intensity \u03bb(cid:48)(t) = E(N (t + \u03b4t) \u2212 N (t))/\u03b4t from suf\ufb01cient number of\nrealizations of point process through counting the average number of events during [t, t + \u03b4t], where\nN (t) is the count process for \u03bb(t). The L1 distance between the ground-truth empirical intensity and\nthe learned empirical intensity is reported as a performance measure.\n3.3 Results and Discussion\nSynthetic data. Figure 2 presents the learning ability of WGANTPP when the ground-truth data is\ngenerated via different types of point process. We \ufb01rst compare the QQ plots in the top row from the\nmicro perspective view, where QQ plot describes the dependency between events. Red dots legend-ed\nwith Real are the optimal QQ distribution, where the intensity function generates the sequences are\nknown. We can observe that even though WGANTPP has no prior information about the ground-truth\npoint process, it can estimate the model better except for the estimator that knows the parametric\nform of data. This is quite expected: When we are training a model and we know the parametric\nform of the generating model we can \ufb01nd it better. However, whenever the model is misspeci\ufb01ed\n(i.e., we don\u2019t know the parametric from a priori) WGANTPP outperforms other parametric forms\nand RNN approach. The middle row of \ufb01gure 2 compares the empirical intensity. The Real line is\nthe optimal empirical intensity estimated from the real data. The estimator can recover the empirical\nintensity well in the case that we know the parametric form where the data comes from. Otherwise,\nestimated intensity degrades considerably when the model is misspeci\ufb01ed. We can observe our\nWGANTPP produces the empirical intensity better and performs robustly across different types of\npoint process data. For MLE-IP, different number of kernels are tested and the empirical intensity\nresults improves but QQ plot results degrade when the number of kernels increases, so only result of\n3 kernels is shown mainly for clarity of presentation. The fact that the empirical intensity estimated\nfrom MLE-IP method are good and QQ plots are very bad indicates the inhomogeneous Poisson\nprocess can capture the average intensity (Macro dynamics) accurately but incapable of capturing\nthe dependency between events (Micro dynamics). To testify that WGANTPP can cope with mode\ndropping, we generate mixtures of data from three different point processes and use this data to\ntrain different models. Models with speci\ufb01ed form can handle limited types of data and fail to learn\nfrom diverse data sources. The last row of \ufb01gure 2 shows the learned intensity from mixtures of\ndata. WGANTPP produces better empirical intensity than alternatives, which fail to capture the\nheterogeneity in data. To verify the robustness of WGANTPP, we randomly initialize the generator\nparameters and run 10 rounds to get the mean and std of deviations for both empirical intensity and\nQQ plot from ground truth. For empirical intensity, we compute the integral of difference of learned\nintensity and ground-truth intensity. Table 1 reports the mean and std of deviations for intensity\ndeviation. For each estimators, we obtain the slope from the regression line for its QQ plot. Table 1\nreports the mean and std of deviations for slope of the QQ plot. Compared to the MLE-estimators,\nWGANTPP consistently outperforms even without prior knowledge about the parametric form of the\ntrue underlying generative point process. Note that for mixture models QQ-plot is not feasible.\nReal-world data. We evaluate WGANTPP on a diverse real-world data process from health-care,\npublic media, scienti\ufb01c activities and stock exchange. For those real world data, the underlying\n\n8\n\n\fFigure 3: Performance of different methods on various real-world datasets.\n\nTable 2: Deviation of empirical intensity for real-world data.\nData\n\nEstimator\n\nMIMIC\nMeme\nMAS\nNYSE\n\nMLE-IP MLE-SE MLE-SC MLE-NN WGAN\n0.122\n0.351\n0.849\n0.303\n\n0.686\n0.920\n2.712\n0.347\n\n0.150\n0.839\n1.089\n0.799\n\n0.160\n1.008\n1.693\n0.426\n\n0.339\n0.701\n1.592\n0.361\n\ngenerative process is unknown, previous works usually assume that they are certain types of point\nprocess from their domain knowledge. Figure 3 shows the intensity learned from different models,\nwhere Real is estimated from the real-world data itself. Table 2 reports the intensity deviation. When\nall models have no prior knowledge about the true generative process, WGANTPP recovers intensity\nbetter than all the other models across the data sets.\nAnalysis. We have observed that when the generating model is misspeci\ufb01ed WGANTPP outperforms\nthe other methods without leveraging the a priori knowledge of the parametric form. However, when\nthe exact parametric form of data is known and when it is utilized to learn the parameters, MLE\nwith this full knowledge performs better. However, this is generally a strong assumption. As we\nhave observed from the real-world experiments WGANTPP is superior in terms of performance.\nSomewhat surprising is the observation that WGANTPP tends to outperform the MLE-NN approach\nwhich basically uses the same RNN architecture but trained using MLE. The superior performance\nof our approach compared to MLE-NN is another witness of the the bene\ufb01ts of using W-distance\nin \ufb01nding a generator that \ufb01ts the observed sequences well. Even though the expressive power of\nthe estimators is the same for WGANTPP and MLE-NN, MLE-NN may suffer from mode dropping\nor get stuck in an inferior local minimum since maximizing likelihood is asymptotically equivalent\nto minimizing the Kullback-Leibler (KL) divergence between the data and model distribution. The\ninherent weakness of KL divergence [25] renders MLE-NN perform unstably, and the large variances\nof deviations empirically demonstrate this point.\n4 Conclusion and Future Work\nWe have presented a novel approach for Wasserstein learning of deep generative point processes\nwhich requires no prior knowledge about the underlying true process and can estimate it accurately\nacross a wide scope of theoretical and real-world processes. For the future work, we would like to\nexplore the connection of the WGAN with the optimal transport problem. We will also explore other\npossible distance metrics over the realizations of point processes, and more sophisticated transforms\nof point processes, particularly those that are causal. Extending the current work to marked point\nprocesses and processes over structured spaces is another interesting venue for future work.\nAcknowledgements. This project was supported in part by NSF (IIS-1639792, IIS-1218749, IIS-\n1717916, CMMI-1745382), NIH BIGDATA 1R01GM108341, NSF CAREER IIS-1350983, NSF\nCNS-1704701, ONR N00014-15-1-2340, NSFC 61602176, Intel ISTC, NVIDIA and Amazon AWS.\n\nReferences\n[1] DJ Daley and D Vere-Jones. An introduction to the theory of point processes. 2003.\n\n[2] Scott W Linderman and Ryan P Adams. Discovering latent network structure in point process data. In\n\nICML, pages 1413\u20131421, 2014.\n\n[3] Mehrdad Farajtabar, Nan Du, Manuel Gomez-Rodriguez, Isabel Valera, Hongyuan Zha, and Le Song.\n\nShaping social activity by incentivizing users. In NIPS 2014\n\n[4] Mehrdad Farajtabar, Xiaojing Ye, Sahar Harati, Hongyuan Zha, and Le Song. Multistage campaigning in\n\nsocial networks In NIPS 2016\n\n9\n\n0510time0.00.20.40.60.81.0intensityMIMICrealIPSESCNNWGAN0510time012345intensityMEMErealIPSESCNNWGAN0510time0246intensityMASrealIPSESCNNWGAN0510time012intensityNYSErealIPSESCNNWGAN\f[5] Wenzhao Lian, Ricardo Henao, Vinayak Rao, Joseph E Lucas, and Lawrence Carin. A multitask point\n\nprocess predictive model. In ICML, pages 2030\u20132038, 2015.\n\n[6] Lizhen Xu, Jason A Duan, and Andrew Whinston. Path to purchase: A mutually exciting point process\n\nmodel for online advertising and conversion. Management Science, 60(6):1392\u20131412, 2014.\n\n[7] Emmanuel Bacry, Iacopo Mastromatteo, and Jean-Fran\u00e7ois Muzy. Hawkes processes in \ufb01nance. Market\n\nMicrostructure and Liquidity, 1(01):1550005, 2015.\n\n[8] Odd Aalen, Ornulf Borgan, and Hakon Gjessing. Survival and event history analysis: a process point of\n\nview. Springer Science & Business Media, 2008.\n\n[9] Mehrdad Farajtabar, Yichen Wang, Manuel Gomez-Rodriguez, Shuang Li, Hongyuan Zha, and Le Song.\nCoevolve: A joint point process model for information diffusion and network co-evolution. In NIPS 2015.\n\n[10] John Frank Charles Kingman. Poisson processes. Wiley Online Library, 1993.\n\n[11] Alan G Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 1971.\n\n[12] Valerie Isham and Mark Westcott. A self-correcting point process. Stochastic Processes and Their\n\nApplications, 8(3):335\u2013347, 1979.\n\n[13] Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and Le Song.\n\nRecurrent marked temporal point processes: Embedding event history to vector. In KDD, 2016.\n\n[14] Shuai Xiao, Junchi Yan, Xiaokang Yang, Hongyuan Zha, Stephen M. Chu. Modeling the Intensity Function\n\nof Point Process Via Recurrent Neural Networks. In AAAI, 2017.\n\n[15] Hongyuan Mei and Jason Eisne The Neural Hawkes Process: A Neurally Self-Modulating Multivariate\n\nPoint Process. In NIPS 2017.\n\n[16] Shuai Xiao, Junchi Yan, Mehrdad Farajtabar, Le Song, Xiaokang Yang, and Hongyuan Zha. Joint Modeling\nof Event Sequence and Time Series with Attentional Twin Recurrent Neural Networks. arXiv preprint\narXiv:1703.08524, 2017.\n\n[17] Yosihiko Ogata. On lewis\u2019 simulation method for point processes. IEEE Transactions on Information\n\nTheory, 27(1):23\u201331, 1981.\n\n[18] Martin Arjovsky and L\u00e9on Bottou. Towards principled methods for training generative adversarial networks.\n\nIn NIPS 2016 Workshop on Adversarial Training. In review for ICLR, volume 2016, 2017.\n\n[19] Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint:1701.00160, 2016.\n\n[20] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\n\nCourville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[21] Ferenc Husz\u00e1r. How (not) to train your generative model: Scheduled sampling, likelihood, adversary?\n\narXiv preprint arXiv:1511.05101, 2015.\n\n[22] Lucas Theis, A\u00e4ron van den Oord, and Matthias Bethge. A note on the evaluation of generative models.\n\narXiv preprint arXiv:1511.01844, 2015.\n\n[23] Anh Nguyen, Jason Yosinski, Yoshua Bengio, Alexey Dosovitskiy, and Jeff Clune. Plug & play generative\n\nnetworks: Conditional iterative generation of images in latent space. arXiv preprint:1612.00005, 2016.\n\n[24] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep\n\nconvolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.\n\n[25] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein gan. arXiv:1701.07875, 2017.\n\n[26] Ding Zhou, Jia Li, and Hongyuan Zha. A new mallows distance based metric for comparing clusterings.\n\nIn ICML, pages 1028\u20131035, 2005.\n\n[27] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved\n\ntraining of wasserstein gans. arXiv preprint arXiv:1704.00028, 2017.\n\n[28] C\u00e9dric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.\n\n[29] Marco Cuturi and Mathieu Blondel. Soft-DTW: a Differentiable Loss Function for Time-Series. In ICML,\n\npages 894\u2013903, 2017.\n\n10\n\n\f[30] Dominic Schuhmacher and Aihua Xia. A new metric between distributions of point processes. Advances\n\nin applied probability, 40(3):651\u2013672, 2008.\n\n[31] Laurent Decreusefond, Matthias Schulte, Christoph Th\u00e4le, et al. Functional poisson approximation in\nkantorovich\u2013rubinstein distance with applications to u-statistics and stochastic geometry. The Annals of\nProbability, 44(3):2147\u20132197, 2016.\n\n[32] Olof Mogren. C-rnn-gan: Continuous recurrent neural networks with adversarial training. arXiv preprint\n\narXiv:1611.09904, 2016.\n\n[33] Arnab Ghosh, Viveka Kulharia, Amitabha Mukerjee, Vinay Namboodiri, and Mohit Bansal. Contextual\n\nrnn-gans for abstract reasoning diagram generation. arXiv preprint arXiv:1609.09444, 2016.\n\n11\n\n\f", "award": [], "sourceid": 1846, "authors": [{"given_name": "Shuai", "family_name": "Xiao", "institution": "Georgia Institute of Technology"}, {"given_name": "Mehrdad", "family_name": "Farajtabar", "institution": "Georgia Tech"}, {"given_name": "Xiaojing", "family_name": "Ye", "institution": "Georgia State University"}, {"given_name": "Junchi", "family_name": "Yan", "institution": "IBM Research - China"}, {"given_name": "Le", "family_name": "Song", "institution": "Georgia Institute of Technology"}, {"given_name": "Hongyuan", "family_name": "Zha", "institution": "Georgia Tech"}]}