{"title": "Nonparametric Regressive Point Processes Based on Conditional Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1064, "page_last": 1074, "abstract": "Real-world event sequences consist of complex mixtures of different types of events occurring in time. An event may depend on past events of the same type, as well as, the other types. Point processes define a general class of models for event sequences. ``Regressive point processes'' refer to point processes that directly model the dependency between an event and any past event, an example of which is a Hawkes process. In this work, we propose and develop a new nonparametric regressive point process model based on Gaussian processes. We show that our model can represent better many commonly observed real-world event sequences and capture the dependencies between events that are difficult to model using existing nonparametric Hawkes process variants. We demonstrate the improved predictive performance of our model against state-of-the-art baselines on multiple synthetic and real-world datasets.", "full_text": "Nonparametric Regressive Point Processes Based on\n\nConditional Gaussian Processes\n\nDepartment of Computer Science\n\nDepartment of Computer Science\n\nMilos Hauskrecht\n\nUniversity of Pittsburgh\nPittsburgh, PA 15213\nmilos@pitt.edu\n\nSiqi Liu\n\nUniversity of Pittsburgh\nPittsburgh, PA 15213\n\nsiqiliu@cs.pitt.edu\n\nAbstract\n\nReal-world event sequences consist of complex mixtures of different types of\nevents occurring in time. An event may depend on past events of the same type,\nas well as, the other types. Point processes de\ufb01ne a general class of models for\nevent sequences. \u201cRegressive point processes\u201d refer to point processes that directly\nmodel the dependency between an event and any past event, an example of which\nis a Hawkes process. In this work, we propose and develop a new nonparametric\nregressive point process model based on Gaussian processes. We show that our\nmodel can represent better many commonly observed real-world event sequences\nand capture the dependencies between events that are dif\ufb01cult to model using\nexisting nonparametric Hawkes process variants. We demonstrate the improved\npredictive performance of our model against state-of-the-art baselines on multiple\nsynthetic and real-world datasets.\n\n1\n\nIntroduction\n\nEvent sequences consist of timestamps of events occurring over a period of time. They arise commonly\nin our everyday life. Examples are spread of news in a social network, buying and selling actions in\na stock market, occurrences of earthquakes in a region, administration of medications for a patient,\nand many others. Because of their wide applicability, event sequences and their models have become\npopular in machine learning research.\nPoint processes [Daley and Vere-Jones, 2003, 2007] can model event sequences by representing\nevents as points in the one-dimensional space: time. In general, a point process de\ufb01nes a probabilistic\ndistribution of points in a space. For a (temporal) point process, the distribution is uniquely determined\nby its intensity function, which de\ufb01nes the rate of event occurring at any instant.\nTwo main types of point process models have been developed independently over years. One type is\nwhat we call \u201cregressive point processes\u201d, where the dependencies of the intensity function on the\npast events are directly modeled. Hawkes processes [Hawkes, 1971] are the most studied and used\nclass of regressive point processes (e.g., [Zhou et al., 2013a],[Zhou et al., 2013b],[Bacry and Muzy,\n2014],[Lee et al., 2016],[Wang et al., 2016],[Xu et al., 2016],[Eichler et al., 2017]). A bene\ufb01t of the\nregressive point processes is that they are easy to apply and interpret. They can be learned on a set\nof sequences and then applied on another unseen set of sequences. Since the in\ufb02uence of each past\nevent on the intensity is explicitly modeled, it is easy to see how different types of events in\ufb02uence\neach other over time.\nAnother type is what we call \u201clatent-state point processes\u201d, where the dependencies of the intensity\non the past events are indirectly modeled through a latent state. Based on the past latent state, we can\ninfer the future latent state and thereby predict future events. The most studied class of latent-state\npoint processes are Gaussian-process-modulated point processes (e.g., [Adams et al., 2009],[Lasko,\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f2014],[Rao and Teh, 2011],[Gunter et al., 2014],[Lloyd et al., 2015],[Lloyd et al., 2016],[Ding et al.,\n2018],[Kim, 2018]). They use some transformation of a Gaussian process (GP) as the prior for the\nintensity function, which provides a probabilistic distribution over possible intensity functions and\nacts as the latent state. Then the posterior of the intensity function can be inferred from the data. The\nmain bene\ufb01t of GP-modulated point processes is that they provide a principled way to \ufb02exibly model\nthe intensity functions. However, a signi\ufb01cant drawback is that they are harder to apply, compared\nwith Hawkes processes, due to the need of inferring a separate latent state for each sequence. To\nmake inference on any new sequence, long enough history of the sequence must be available. It is\nimpossible to learn a model from a set of sequences and apply it to other unseen sequences (i.e. cold\nstart).\nIn this work, we propose a new nonparametric model, GP regressive point process (GPRPP), com-\nbining the advantages of the above two models: the \ufb02exibility of GP-modulated point processes\nand the applicability of Hawkes processes. (A discussion of related work is in the supplementary\nmaterial.) Similar to Hawkes processes, our model directly captures the dependencies of the intensity\nfunction on the past events. However, unlike Hawkes processes, the dependencies are modeled\nnonparametrically through a GP. Meanwhile, different from GP-modulated point processes, the input\nof our GP is not de\ufb01ned by the \u201cabsolute\u201d time relative to each sequence, but by the collection of\n\u201crelative\u201d times from past events of different types. This de\ufb01nes a latent state independent of speci\ufb01c\nsequences, and therefore can be learned from and applied to different sequences. Figure 1 illustrates\nthe differences between GPRPP and the previous models.\nTo better model the dependencies of the intensity function on the past events, we propose a conditional\nGP model for GPRPP. It relies on a set of points introduced in the input space of the GP to capture\nthe dependencies independent of speci\ufb01c sequences. These points, although bearing a similarity to\nthe inducing points in sparse GPs [Qui\u00f1onero-Candela and Rasmussen, 2005, Titsias, 2009], function\nquite differently, because instead of marginalizing them out, we condition on them for inference.\n\n(a) Hawkes process\n\n(b) GP-modulated point process\n\n(c) GP regressive point process\n\nFigure 1: Illustrations of different point process models. The \ufb01rst three rows are a multivariate event\nsequence consisting of events (stems) of three types (u) on the timeline. The vertical line t marks the\ncurrent time. The last row is the estimated conditional intensity function (CIF) \u03bb3(\u00b7) for event type\nu = 3. (a) In a Hawkes process, the CIF depends on the past events through the triggering kernels\n\u03c6uiuj (Eq. 1). (b) In a GP-modulated point process, the entire (transformed) CIF on the speci\ufb01c\nsequence is a function with a GP prior. (c) In a GPRPP, the (transformed) CIF depends on the past\nevents through a function with a GP prior.\n\n2 Preliminary: point processes\nThe training data consist of multiple sequences D = {yc}|D|\nc=1. Each yc is a sequence of (time,\nlabel) pairs yc = {(ti, ui)}|yc|\ni=1, representing the time and the type of each event, where ti \u2208 R\u22650\nand ui = 1, . . . , U. A (temporal) point process has a conditional intensity function (CIF) \u03bb(t) =\nas the instantaneous rate of events at time t given the history Ht up to t,\nlimdt\u21920+\nwhere N (\u00b7) counts the number of events in an interval. For example, for a Hawkes process, the CIF\nof event type ui is de\ufb01ned as\n\nE[N ([t,t+dt))|Ht]\n\ndt\n\n(cid:88)\n\n\u03bbui (t) = \u00b5ui +\n\n\u03c6uiuj (t \u2212 tj)\n\n(1)\n\nwhere \u03c6uiuj is a function of time that characterizes the in\ufb02uence of past events of type uj on type ui.\nIt is called a triggering kernel in previous works. Meanwhile, \u00b5ui de\ufb01nes the baseline intensity.\n\ntj 0. This is essentially a sum of D kernels, each of which is a product of two kernels\nK1 and K2. K2 is the squared-exponential kernel on the value of xd(t). K1 is the inner product\non the indicator I [xd(t)]. We use the squared-exponential kernel, because it is widely used and has\nclosed-form evaluations of \u03c8 and \u03a8 as shown in the next section, but it can be replaced by any kernel\nwith the latter property.\nRemark 1. The two inputs of the kernel have different notations for x and t, indicating they can\ncome from different sequences at different absolute times. The kernel is actually isolated from the\nabsolute time t in the individual sequences, since it only depends on the value of x(\u00b7) at t. This is\nvery different from previous GP-based models (e.g., [Lloyd et al., 2015, 2016, Ding et al., 2018]),\nwhere the inputs of the kernel always come from the same sequence, and the kernel depends on the\nabsolute time t in the sequence.\n\nWe make the following two assumptions to justify the de\ufb01nition of our model. A proof of Theorem 1\nis in the supplementary material.\nAssumption 1. For each type u of events, they have time-limited in\ufb02uences on the target events.\nThat is, there is a time limit, \u2206Tu < \u221e, for each type u of event, such that for any event of type u\noccurring at su, \u03bb\u02dcu(t) may depend on su only if 0 < t \u2212 su \u2264 \u2206Tu.\nAssumption 2. For each type u of events, there exists Mu : R \u2192 Z such that for any bounded time\ninterval I = [tbeg, tend), |I| = tend \u2212 tbeg < \u221e, the number of events Nu(I) \u2264 Mu(|I|) < \u221e.\nTheorem 1. Given that assumption 1 and 2 hold, there exists Q < \u221e such that \u03bb\u02dcu(t) depends on at\nmost the last Q events of any type at any time t.\n\n4 Conditional GPRPP\n\nIn this section, we propose a conditional GP model for GPRPP and call it conditional GP regressive\npoint process (CGPRPP). The input of the GP is de\ufb01ned in the previous section and denoted as\nx = x(t). When t is not important or clear from the context, we just denote the input as x.\nPreviously, different forms of sparse GPs based on inducing variables have been proposed to improve\nthe ef\ufb01ciency of GPs (e.g., [Qui\u00f1onero-Candela and Rasmussen, 2005, Titsias, 2009]). Typically, a\nset of inducing points are introduced in the input space, and the inducing variables corresponding to\nthe points are marginalized out for learning and inference. Our idea is similar, but the difference is\nthat we condition on the inducing variables, which correspond to the values of the CIF given different\nsituations of the history. Therefore, we call these points conditional points.\nLet Z \u2208 X M be a sequence of M conditional points in the input space. We will explain how to\npick these points later. Given any input x \u2208 X , the output of the GP is fx = f (x) \u2208 R. Let\nfZ = f (Z) \u2208 RM . We de\ufb01ne \u00b5x and \u00b5Z as the prior mean \u00b5 of the appropriate dimensions for x\nand Z respectively. Let Kxx(cid:48) = K(x, x(cid:48)) as de\ufb01ned in Eq. 5 for any inputs x and x(cid:48). If x and x(cid:48)\nare vectors, Kxx(cid:48) is the Gram matrix of the corresponding size. Then p(fZ) = N (\u00b5Z, KZZ), and\np(fx|fZ) = N (\u00b5x|Z, \u03c32\n\nx|Z), where\n\n\u00b5x|Z = \u00b5x + KxZK\u22121\n\nx|Z = Kxx \u2212 KxZK\u22121\n\u03c32\n\nZZKZx\n\nFrom Eq. 3 and 4, the conditional density of the sequence y given fx is\n\nln p\u02dcu(y|fx) =\n\n\u03b4(un, \u02dcu) ln f (x(tn))2 \u2212\n\nf (x(t))2dt\n\nwhere N = |y| is the total number of all types of events in y.\nRemark 2. It is worth noting that the conditional points in Z are independent of any speci\ufb01c sequence\ny, since they are points in X .\n\n(cid:90) T\n\n0\n\n(6)\n\n(7)\n\nZZ(fZ \u2212 \u00b5Z),\nN(cid:88)\n\nn=1\n\n4\n\n\fAssuming we observe fZ = mZ, we can maximize the conditional density p\u02dcu(y|mZ) to learn the\nhyper-parameters of the model:\n\n(cid:90)\n\nln p\u02dcu(y|mZ) = ln\n\np\u02dcu(y|fx)p(fx|mZ)dfx\n\nwhere p(fx|mZ) is de\ufb01ned by Eq. 6 with fZ = mZ. Because there is a correspondence between\nf (x(t)) and the CIF \u03bb(t), mZ essentially corresponds to different values of the CIF given different\nsituations of the history, determined by different z \u2208 Z. Even given the exact same history, the CIF\nmay still be stochastic. Therefore, we allow noise in fZ, which is a generalization of the noiseless\ncase, so fZ = mZ + \u0001Z, where p(\u0001Z) = N (0, S\u0001). Then we can marginalize out \u0001Z by integrating\nw.r.t. p(\u0001z). In the end, we maximize\n\nln p\u02dcu(y|mZ) = ln\n\np\u02dcu(y|fx)p(fx|mZ, \u0001Z)p(\u0001Z)dfxd\u0001Z = ln\n\np\u02dcu(y|fx)p(fx|mZ)dfx\n\n(8)\n\n(cid:90)(cid:90)\n\n(cid:90)\n\nwhere p(fx|mZ, \u0001Z) is de\ufb01ned by Eq. 6, and\n\n(cid:90)\n\np(fx|mZ) =\n\np(fx|mZ, \u0001Z)p(\u0001Z)d\u0001Z = N (\u02dc\u00b5x, \u02dc\u03c32\nx)\n\n(cid:90)\nN(cid:88)\n\nn=1\n\nhas a closed-form solution\n\u02dc\u00b5x = \u00b5x + KxZK\u22121\n(9)\nRemark 3. Because we condition on the pseudo-observations mZ, they can move freely when we\noptimize Eq. 8, and their values are determined by \ufb01tting to the training data. Intuitively, they act as\nkey points of the CIF, which are supposed to capture the key information regarding the entire CIF.\n\nx = Kxx \u2212 KxZK\u22121\n\u02dc\u03c32\n\nZZKZx + KxZK\u22121\n\nZZ(mZ \u2212 \u00b5Z),\n\nZZS\u0001K\u22121\n\nZZKZx.\n\nEq. 8 is hard to maximize directly. Instead, we derive a lower bound using Jensen\u2019s inequality and\nmaximize the lower bound\n\nln\n\np\u02dcu(y|fx)p(fx|mZ)dfx = ln E [p\u02dcu(y|fx)] \u2265 E [ln p\u02dcu(y|fx)]\n\nwhere the expectation is w.r.t. p(fx|mZ), and from Eq. 7\nE [ln p\u02dcu(y|fx)] =\n\n\u03b4(un, \u02dcu)E(cid:2)ln f (x(tn))2(cid:3) \u2212 N +1(cid:88)\n(cid:19)\n\nE(cid:2)ln f (x(tn))2(cid:3) = \u2212 \u02dcG\n\n(cid:18)\n\nn=1\n\n\u2212 \u02dc\u00b52\nx\n2\u02dc\u03c32\nx\n\n(cid:90) tn\n\ntn\u22121\n\nwhere we de\ufb01ne t0 = 0 and tN +1 = T to be the start and end time of the sequence y.\nFrom [Lloyd et al., 2015], we have\n\n(cid:16)E [f (x(t))]2 + Var [f (x(t))]\n(cid:17)\n(cid:19)\n(cid:18) \u02dc\u03c32\n\n\u2212 C\n\n(11)\nwhere C \u2248 0.57721566 is the Euler-Mascheroni constant, \u02dcG is de\ufb01ned via the con\ufb02uent hypergeo-\nmetric function, and \u02dc\u00b52\nMeanwhile,\n\nx = Var [fx] can be computed as in Eq. 9.\n\nx = E [fx]2 , \u02dc\u03c32\n\n+ ln\n\nx\n2\n\n(10)\n\ndt\n\n(cid:90) tn\n(cid:90) tn\n\ntn\u22121\n\ntn\u22121\n\nE [fx]2 dt =(tn \u2212 tn\u22121)\u00b52\n\nx + 2\u00b5x\u03c8T\n\nn K\u22121\nZZ\u03a8nK\u22121\n\nZZ(mZ \u2212 \u00b5Z)\nZZ(mZ \u2212 \u00b5Z),\n\n\u03b3dI [xd(t)] dt \u2212 Tr(cid:0)K\u22121\n\nZZ\u03a8n\n\n(cid:1) + Tr(cid:0)K\u22121\n\n+ (mZ \u2212 \u00b5Z)T K\u22121\n\n(cid:90) tn\n\nD(cid:88)\n\nd=1\n\ntn\u22121\n\nVar [fx] dt =\n\n(cid:1) .\n\n(12)\n\n(13)\n\nZZS\u0001K\u22121\n\nZZ\u03a8n\n\nThe de\ufb01nitions of \u03c8 and \u03a8 are complex and included in the supplementary material. We note that the\ncomputation of them is the bottleneck of the learning algorithm. A straightforward algorithm would\ncost O(M 2N D2). However, by merging computation related to the same event type, we can reduce\nthe complexity to O(M 2N DQ). By setting each conditional point active on only one dimension,\nwe can reduce the complexity to O(M 2N Q). If we further set Q\u02dcu > 1 only for the target type and\nQu = 1 for u (cid:54)= \u02dcu, and assume N\u02dcuQ\u02dcu = O(N ), where Nu is the number of points of type u, the\ncomplexity can be reduced to O(M 2N ), which is what we adopt in the experiments. The details and\na proof are in the supplementary material.\n\n5\n\n\fLearning We also add an independent noise kernel \u03c32I to the existing kernel (Eq. 5), which results\nin a new term in the integral of the variance (Eq. 13). For learning the model, we maximize the lower\nbound (Eq. 10) w.r.t. the set of hyper-parameters \u0398 = {\u00b5, \u03b1, \u03b3, \u03c3, mZ, S\u0001}.\nWe assume that mZ provides suf\ufb01cient information for the inference on the test data D\u2217.\nAssumption 3. Conditioned on mZ, the test data D\u2217 is independent from the training data D,\np(D\u2217|D, mZ) = p(D\u2217|mZ).\n\nInference For inference of the test likelihood, to compare with non-Bayesian models, we use a\npoint estimate of the CIF, instead of model averaging. We use the optimal hyper-parameters \u0398\u2217\nlearned from the training data D to estimate the mean CIF\n\n\u02dcu(t) = E [\u03bb\u02dcu(t)|\u0398\u2217] = E [f (x(t))|\u0398\u2217]2 + Var [f (x(t))|\u0398\u2217]\n\u03bb\u2217\n\n(14)\non the test data D\u2217, where the conditional mean and variance are de\ufb01ned in Eq. 9. Then we use the\nmean CIF as our prediction to compute the likelihood p(D\u2217|\u03bb\u2217) on the test data. For prediction of\nevent times, a sampling-based algorithm is derived in the supplementary material.\n\nConditional point placement Due to the high dimensionality of the input space of f, it is preferable\nthat the conditional points are placed beforehand and \ufb01xed in the learning procedure. Based on the\nadditive form of our kernel, we place the conditional points independently on each dimension. Each\nconditional point will be active on only one dimension. In our experiments, for simplicity, we put the\nconditional points regularly on each dimension within a region. If prior knowledge is available, it can\nbe used to determine the region; otherwise, we can use the following heuristics. The left bound of the\nregion is usually 0. The right bound can be set to the maximum (or some quantile) of the time span\nbetween two (Q = 1) or more (Q > 1) consecutive points of the same type, since beyond that, the\nconditional points will have limited effects.\n\n5 Experiments\n\nWe compare our method with two state-of-the-art nonparametric Hawkes process variants. HP-GS\n[Xu et al., 2016] is a nonparametric Hawkes process using a set of (Gaussian) basis functions to\napproximate the triggering kernels, with sparse and group lasso regularization. For each experiment,\nwe tune its tuning parameters \u03b1S and \u03b1G in a wide range {10\u22122, 10\u22121, . . . , 104} as in the original\nwork using cross-validation based on the likelihood. In all the experiments, the bandwidth of the\nGaussian kernels is set to be optimal, that is the inverse of the cut-off frequency, based on the positions\nof the kernels. The cut-off frequency \u03c90 = \u03c0M/T , where M is the number of kernels and T is the\nright bound on the kernels. HP-LS [Eichler et al., 2017] is another nonparametric Hawkes process.\nThis method allows very \ufb02exible triggering kernels to be estimated by discretizing the kernels and\nsolving a least-square problem. Its parameters are set in accordance with the other methods for each\nexperiment. For our method, to improve ef\ufb01ciency, when we set Q > 1, we only set it for the target\ntype and keep Q = 1 for the others, as discussed in the supplementary material. We tie the parameters\nfor different dimensions q = 1, . . . , Q of the same type u.\n\n5.1 Synthetic datasets\n\nFirst we generate two synthetic datasets representing two distinctive types of event sequences using\nthe thinning algorithm [Ogata, 1981]. The \ufb01rst dataset is generated through a renewal process.\nThe baseline intensity is \u00b5. When there is a new event, the intensity temporarily becomes A(1 \u2212\nsin(2\u03c0t/\u03c4 )), for a limited time t \u2208 (0, \u03c4 ) after the event. Each new event will reset the intensity. We\nset \u00b5 = 0.1, A = 0.1, \u03c4 = 20.\nThe second dataset is generated through a Hawkes process. The baseline intensity is \u00b5. The\ntriggering kernel is A exp(\u2212(t \u2212 b)2/\u03c32), i.e., a Gaussian kernel. Different from the renewal\nprocess, each new event will add a new Gaussian kernel on top of the existing intensity. We set\n\u00b5 = 0.1, A = 0.1, b = 10, \u03c32 = 4.\nFor each dataset we generate 200 sequences of length of 100 time units each. Each dataset is split into\n100 training sequences and 100 testing sequences. For the \ufb01rst dataset, we set Q = 1 and conditional\npoints at 0, 5, . . . , 15 for CGPRPP. For HP-GS, the kernels are also placed at 0, 5, . . . , 15. For HP-LS,\n\n6\n\n\fwe set h = 1, k = 20. For the second dataset, we use the same settings for all the methods, except\nthat we vary Q = 1, 5, 10, 20, 40 for CGPRPP to see the effect of adding more regression terms.\nWe visualize the in\ufb02uence from a past event of a speci\ufb01c type to the target events as the changes\nin the intensity of the target type over time since that event. For Hawkes processes, it is similar to\nplotting the triggering kernels, except that the triggering kernels are added on top of the baseline\nintensity, so we can compare the intensity after an event with the baseline intensity. For CGPRPP, it\nis equivalent to simulating an event at time 0 and plotting the changes in the intensity as time elapses.\nThe true in\ufb02uence functions are in the \ufb01rst column in Figure 2, followed by the inferred in\ufb02uence\nfunctions for each method. For the \ufb01rst dataset, HP-GS cannot learn the in\ufb02uence function, because\nits limitation in the dependencies of the CIF on the past events. Although the triggering kernels\nare nonparametric, the baseline intensity and the triggering kernels are additive in the CIF. This\nlimitation is quite common in nonparametric Hawkes processes (e.g., [Zhang et al., 2018, Donnet\net al., 2018, Zhou et al., 2013b]). HP-LS is more \ufb02exible and learns a better in\ufb02uence function, but\nthe discretization tends to make the function noisy. CGPRPP almost completely recovers the true\nin\ufb02uence function. We note that this in\ufb02uence represents an inhibition followed by an excitation,\nwhich is common in practice such as neural spike trains [Eichler et al., 2017]. However, most Hawkes\nprocess variants can only model either excitations or inhibitions, but not a mix of both at the same\ntime. In contrast, CGPRPP models the whole CIF as a nonparametric function of the past events and\ntherefore can model these more complex dependencies.\nFor the second dataset, HP-GS is a perfect match for the data, so unsurprisingly it recovers the\nin\ufb02uence function very well. HP-LS also learns the in\ufb02uence reasonably well, although still suffering\nfrom discretization. Interestingly, CGPRPP with Q = 40 (similar for Q = 10, 20) learns an in\ufb02uence\nfunction very close to HP-GS.\nThe test log-likelihood on the synthetic datasets are shown in Table 1. CGPRPP performs the best on\nthe \ufb01rst dataset, while HP-GS and CGPRPP perform similarly on the second dataset, with CGPRPP\nbeing marginally better. The likelihood results are concordant with how well the models recovered\nthe in\ufb02uence function. For the second dataset, we show the performance of CGPRPP selected with\nthe training likelihood (Q = 40). A comparison of GPRPP based on variational sparse GP [Lloyd\net al., 2015] and conditional GP, and the effect of varying Q on the performance of CGPRPP are in\nthe supplementary material.\n\n(a) Ground truth\n\n(b) HP-GS\n\n(c) HP-LS\n\n(d) CGPRPP\n\nFigure 2: In\ufb02uences from past events on the \ufb01rst (top) and second (bottom) synthetic datasets. Solid\nlines are the CIFs after an event. Dashed lines are the baseline intensities. The ground truth is in the\n\ufb01rst column, followed by the result of each method.\n\n7\n\nTime elapsed01020Influence00.050.10.150.2Time elapsed01020Influence0.070.0720.0740.0760.078Time elapsed01020Influence0.040.060.080.10.120.14Time elapsed01020Influence00.050.10.15Time elapsed01020Influence0.10.150.2Time elapsed01020Influence0.10.120.140.160.18Time elapsed01020Influence0.150.20.25Time elapsed01020Influence0.10.120.140.16\fTable 1: Test log-likelihood on synthetic\ndatasets.\nData HP-GS HP-LS CGPRPP\n1\n2\n\n-2455\n-4071\n\n-2770\n-4161\n\n-2671\n-4074\n\nTable 2: Test log-likelihood on IPTV dataset.\nCGPRPP\n-1.479e+05\n-1.502e+05\n-1.608e+05\n\nMonth HP-GS\n1\n2\n3\n\nHP-LS\n-1.779e+05\n-1.825e+05\n-1.928e+05\n\n-1.477e+05\n-1.509e+05\n-1.608e+05\n\n5.2\n\nIPTV dataset\n\nThe IPTV dataset consists of TV viewing records of users over 11 months [Luo et al., 2014, Xu et al.,\n2016]. Each sequence consists of times and types of the TV programs viewed by a user. Events in\nthis dataset are generally very bursty, i.e., one event tends to trigger a group of events of the same\ntype happening in a relatively short amount of time, while the distance between these burst groups\nare relatively large. This is a distinctive characteristic of data generated by Hawkes processes, so we\nexpect the Hawkes process baselines to perform well. To the best of our knowledge, HP-GS has the\nbest performance on this dataset, but our goal is to con\ufb01rm whether CGPRPP can also \ufb01t the data\nwell and achieve similar or better performance.\nThe data are extracted from THAP1 [Xu and Zha, 2017], which contain 302 users in total. For\nef\ufb01ciency, we randomly sample 200 users and use 100 for training and the others for testing. All\nthe models are trained on 1 month and tested on the following 3 months. More details are in the\nsupplementary material. For HP-GS, we put the kernels at every 20 minutes from 0 up to 24 hours,\nsince the length of most TV programs is about 20 to 40 minutes [Xu et al., 2016]. For HP-LS, we\ntrain multiple models with h = 1.25, 5, 20 minutes and k = (24 \u2217 60 + 20)/h correspondingly.\nFor CGPRPP, the conditional points are also placed at every 20 minutes up to 24 hours. We set\nQ = 5, 10, 20 and select Q based on the training likelihood.\nTable 2 shows the test log-likelihood of the models on different months. This is the total log-likelihood\nof all types of events. For HP-LS, we show the best test log-likelihood across different h and k. As\nexpected, HP-GS performs the best, con\ufb01rming the bursty characteristic of the data. However, HP-LS\ndoes not perform well. A problem of HP-LS is that the discretization tends to make the in\ufb02uence\nfunction noisy and fail to generalize well. CGPRPP has a competitive performance close to HP-GS,\nshowing its capability to model bursty events.\n\n5.3 MIMIC datasets\n\nTo show the \ufb02exibility of CGPRPP in modeling other complex event patterns than the bursty patterns\nas in many previously used datasets similar to the IPTV dataset, we derive multiple new event\nsequence datasets from MIMIC III [Johnson et al., 2016] consisting of lab tests ordered to patients in\na hospital. Lab orders tend to have more complex dependencies such as a complex mix of multiple\ninhibitions and excitations over time (e.g., see Figure 3).\nSince there are labs that tend to occur together, we group them into several lab classes. We extract\n20 different datasets targeting the most frequent 20 classes. Each dataset consists of 10 different lab\nclasses, one of which is the target we try to predict, while the others are the predictors. We sample\n200 admissions (sequences) randomly from each dataset, where 100 admissions are used for training\nand the others for testing. More details are in the supplementary material.\nFor HP-GS, we put the kernels at 0, 8, . . . , 48 hours. We also test a different version of the method,\nHP-GS-A, using the adaptive basis-function-selection algorithm in [Xu et al., 2016] to place the\nkernels. For HP-LS, we train multiple models with h = 0.5, 2, 8 hours and k = (48 + 8)/h\ncorrespondingly. For CGPRPP, the conditional points are also placed at 0, 8, . . . , 48 hours. We set\nQ = 1, 10 and select Q based on the training likelihood. As a reference, we also test against a model\nbased on deep neural networks, the neural self-modulating multivariate point process (NSMMPP)\n[Mei and Eisner, 2017]. The number of hidden units is selected from 64, 128, . . . , 1024 as in the\noriginal work through a validation set (80/20 split from the full training set).\n\n1https://github.com/HongtengXu/Hawkes-Process-Toolkit\n\n8\n\n\fThe test log-likelihood of the models is shown in Table 3. Each column is a different dataset with a\ndifferent target lab class. They are ordered from the most frequent (355) to the least frequent (18)\nbased on their occurrences. For HP-LS, we show the best test log-likelihood across different h and k\non each dataset. CGPRPP achieves the best or close to the best performance on all datasets except\nclass 550 and 18. On class 355, 60, 151, 113, and 140, CGPRPP outperforms the second best by a\nlarge margin. In some cases (e.g., class 550) CGPRPP with a different Q actually has a much better\nresult, although not being selected. We also conducted time prediction experiments to predict the\ntime for each target event. The results also show the advantage of CGPRPP. The full likelihood and\ntime prediction results are in the supplementary material.\nAs an example, we plot the in\ufb02uence functions for class 355 from past events of the same type in\nFigure 3. HP-GS learns a smooth in\ufb02uence function with excitations around 24 and 48 hours. This\ncorresponds to the fact that right after a lab being ordered, it might need to be repeated after one or\ntwo days. In contrast, HP-LS (h = 0.5 with the best test likelihood) learns a much noisier pattern\ndue to discretization, which is harder to interpret. Compared with HP-GS, CGPRPP learns not only\nsimilar excitations around 24 and 48 hours, but also a strong inhibition after each excitation, showing\na more \ufb02exible \ufb01t to the data.\n\n355\nDataset\nHP-GS\n-3668\nHP-GS-A -3947\nHP-LS\n-6510\n-3664\nNSMMPP\n-3249\nCGPRPP\n294\nDataset\n-1011\nHP-GS\nHP-GS-A -1054\n-1308\nHP-LS\n-941.2\nNSMMPP\nCGPRPP\n-993.6\n\n60\n-4673\n-5051\n-7299\n-4660\n-4246\n17\n-3783\n-3807\n-5339\n-3758\n-3808\n\n3\n-3721\n-3733\n-5722\n-3737\n-3759\n\n150\n-3238\n-3537\n-4894\n-3377\n-3100\n\n95\n-4064\n-4390\n-5712\n-3982\n-3933\n80\n-3388\n-3772\n-5365\n-3903\n-3402\n\n354\n-4344\n-4792\n-7185\n-4409\n-4225\n1\n-3220\n-3291\n-3772\n-3228\n-3234\n\n151\n-3338\n-3574\n-5323\n-3763\n-3093\n53\n-1913\n-2138\n-2963\n-1916\n-1900\n\n394\n-3098\n-3251\n-4945\n-3268\n-3010\n\nTable 3: Test log-likelihood on MIMIC lab order datasets.\n550\n-1053\n-1064\n-1744\n-1039\n-1175\n\n368\n-3366\n-3711\n-5625\n-3309\n-3378\n\n113\n-4656\n-5049\n-7143\n-4539\n-4276\n8\n-1633\n-1667\n-3142\n-1786\n-1694\n\n140\n-3206\n-3475\n-4625\n-3244\n-2942\n18\n-1596\n-1678\n-3085\n-1532\n-1648\n\n7\n-2502\n-2533\n-3514\n-2626\n-2512\n\n(a) HP-GS\n\n(b) HP-LS\n\n(c) CGPRPP\n\nFigure 3: In\ufb02uences from past events of the same type as the target class 355 on the MIMIC dataset.\n\n6 Conclusion\n\nIn this work, we proposed a new nonparametric method for modeling dependencies between events in\nevent sequences using Gaussian processes. Similar to Hawkes processes and different from previous\nGP-modulated point processes, the proposed model can be learned on a sample of sequences and\nthen applied to other unseen sequences. However, we showed that the proposed model is more\n\ufb02exible than state-of-the-art nonparametric Hawkes process variants. It can learn the dependencies\nbetween events that are common in practice but dif\ufb01cult for the Hawkes process variants to represent,\ne.g., a mix of inhibitions and excitations after an event. Our method showed competitive or better\nperformance on different datasets compared with the Hawkes process variants.\n\n9\n\nTime elapsed050Influence0.020.030.04Time elapsed050Influence00.050.10.15Time elapsed050Influence00.050.1\fAcknowledgement\n\nThis work was supported by NIH grant R01-GM088224. Siqi Liu was also supported by CS50 Merit\nPre-doctoral Fellowship from the Department of Computer Science, University of Pittsburgh. The\ncontent of this paper is solely the responsibility of the authors and does not necessarily represent the\nof\ufb01cial views of the NIH.\n\nReferences\nDaryl J. Daley and David Vere-Jones. An Introduction to the Theory of Point Processes: Volume I:\n\nElementary Theory and Methods. Springer, New York, 2003.\n\nDaryl J. Daley and David Vere-Jones. An Introduction to the Theory of Point Processes: Volume II:\n\nGeneral Theory and Structure. Springer Science & Business Media, 2007.\n\nAlan G. Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika,\n\n58(1):83\u201390, 1971. ISSN 0006-3444. doi: 10.2307/2334319.\n\nKe Zhou, Hongyuan Zha, and Le Song. Learning social infectivity in sparse low-rank networks using\nmulti-dimensional Hawkes processes. In Arti\ufb01cial Intelligence and Statistics, pages 641\u2013649,\n2013a.\n\nKe Zhou, Hongyuan Zha, and Le Song. Learning triggering kernels for multi-dimensional Hawkes\n\nprocesses. In International Conference on Machine Learning, pages 1301\u20131309, 2013b.\n\nEmmanuel Bacry and Jean-Francois Muzy. Second order statistics characterization of Hawkes\n\nprocesses and non-parametric estimation. arXiv:1401.0903 [physics, q-\ufb01n, stat], January 2014.\n\nYoung Lee, Kar Wai Lim, and Cheng Soon Ong. Hawkes processes with stochastic excitations. In\n\nInternational Conference on Machine Learning, pages 79\u201388, 2016.\n\nYichen Wang, Bo Xie, Nan Du, and Le Song. Isotonic Hawkes processes. In International Conference\n\non Machine Learning, pages 2226\u20132234, 2016.\n\nHongteng Xu, Mehrdad Farajtabar, and Hongyuan Zha. Learning Granger causality for Hawkes\n\nprocesses. In International Conference on Machine Learning, pages 1717\u20131726, 2016.\n\nMichael Eichler, Rainer Dahlhaus, and Johannes Dueck. Graphical modeling for multivariate Hawkes\nprocesses with nonparametric link functions. Journal of Time Series Analysis, 38(2):225\u2013242,\nMarch 2017. ISSN 01439782. doi: 10.1111/jtsa.12213.\n\nRyan Prescott Adams, Iain Murray, and David JC MacKay. Tractable nonparametric Bayesian\ninference in Poisson processes with Gaussian process intensities. In Proceedings of the 26th\nAnnual International Conference on Machine Learning, pages 9\u201316. ACM, 2009.\n\nThomas A. Lasko. Ef\ufb01cient inference of Gaussian-process-modulated renewal processes with\napplication to medical event data. In Proceedings of the Thirtieth Conference on Uncertainty in\nArti\ufb01cial Intelligence, UAI\u201914, pages 469\u2013476, Arlington, Virginia, United States, 2014. AUAI\nPress. ISBN 978-0-9749039-1-0.\n\nVinayak Rao and Yee W. Teh. Gaussian process modulated renewal processes. In Advances in Neural\n\nInformation Processing Systems, pages 2474\u20132482, 2011.\n\nTom Gunter, Chris Lloyd, Michael A. Osborne, and Stephen J. Roberts. Ef\ufb01cient Bayesian nonpara-\nmetric modelling of structured point processes. In Proceedings of the Thirtieth Conference on\nUncertainty in Arti\ufb01cial Intelligence, UAI\u201914, pages 310\u2013319, Arlington, Virginia, United States,\n2014. AUAI Press. ISBN 978-0-9749039-1-0.\n\nChris Lloyd, Tom Gunter, Michael Osborne, and Stephen Roberts. Variational inference for Gaussian\nprocess modulated Poisson processes. In International Conference on Machine Learning, pages\n1814\u20131822, 2015.\n\nChris Lloyd, Tom Gunter, Michael Osborne, Stephen Roberts, and Tom Nickson. Latent point process\n\nallocation. In Arti\ufb01cial Intelligence and Statistics, pages 389\u2013397, May 2016.\n\n10\n\n\fHongyi Ding, Mohammad Khan, Issei Sato, and Masashi Sugiyama. Bayesian nonparametric\nPoisson-process allocation for time-sequence modeling. In International Conference on Arti\ufb01cial\nIntelligence and Statistics, pages 1108\u20131116, 2018.\n\nMinyoung Kim. Markov modulated Gaussian Cox processes for semi-stationary intensity modeling\nof events data. In International Conference on Machine Learning, pages 2640\u20132648, July 2018.\n\nJoaquin Qui\u00f1onero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate\nGaussian process regression. Journal of Machine Learning Research, 6:1939\u20131959, December\n2005. ISSN 1532-4435.\n\nMichalis Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Arti\ufb01cial\n\nIntelligence and Statistics, pages 567\u2013574, 2009.\n\nY. Ogata. On Lewis\u2019 simulation method for point processes. IEEE Transactions on Information\n\nTheory, 27(1):23\u201331, January 1981. ISSN 0018-9448. doi: 10.1109/TIT.1981.1056305.\n\nRui Zhang, Christian Walder, Marian-Andrei Rizoiu, and Lexing Xie. Ef\ufb01cient non-parametric\n\nBayesian Hawkes processes. arXiv:1810.03730 [cs, stat], October 2018.\n\nSophie Donnet, Vincent Rivoirard, and Judith Rousseau. Nonparametric Bayesian estimation of\n\nmultivariate Hawkes processes. arXiv:1802.05975 [math, stat], February 2018.\n\nD. Luo, H. Xu, H. Zha, J. Du, R. Xie, X. Yang, and W. Zhang. You are what you watch and\nwhen you watch: Inferring household structures from IPTV viewing data. IEEE Transactions on\nBroadcasting, 60(1):61\u201372, March 2014. ISSN 0018-9316. doi: 10.1109/TBC.2013.2295894.\n\nHongteng Xu and Hongyuan Zha. THAP: A matlab toolkit for learning with Hawkes processes.\n\narXiv:1708.09252 [cs, stat], August 2017.\n\nAlistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-wei H. Lehman, Mengling Feng, Mohammad\nGhassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. MIMIC-III,\na freely accessible critical care database. Scienti\ufb01c Data, 3:160035, May 2016. ISSN 2052-4463.\ndoi: 10.1038/sdata.2016.35.\n\nHongyuan Mei and Jason M. Eisner. The neural Hawkes process: A neurally self-modulating\nmultivariate point process. In Advances in Neural Information Processing Systems, pages 6757\u2013\n6767, 2017.\n\n11\n\n\f", "award": [], "sourceid": 630, "authors": [{"given_name": "Siqi", "family_name": "Liu", "institution": "University of Pittsburgh"}, {"given_name": "Milos", "family_name": "Hauskrecht", "institution": "University of Pittsburgh"}]}