{"title": "Fast Estimation of Causal Interactions using Wold Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 2971, "page_last": 2982, "abstract": "We here focus on the task of learning Granger causality matrices for multivariate point processes. In order to accomplish this task, our work is the first to explore the use of Wold processes. By doing so, we are able to develop asymptotically fast MCMC learning algorithms. With $N$ being the total number of events and $K$ the number of processes, our learning algorithm has a $O(N(\\,\\log(N)\\,+\\,\\log(K)))$ cost per iteration. This is much faster than the $O(N^3\\,K^2)$ or $O(K^3)$ for the state of the art. Our approach, called GrangerBusca, is validated on nine datasets. This is an advance in relation to most prior efforts which focus mostly on subsets of the Memetracker data. Regarding accuracy, GrangerBusca is three times more accurate (in Precision@10) than the state of the art for the commonly explored subsets Memetracker. Due to GrangerBusca's much lower training complexity, our approach is the only one able to train models for larger, full, sets of data.", "full_text": "Fast Estimation of Causal Interactions\n\nusing Wold Processes\n\nFlavio Figueiredo Guilherme Borges\n\nPedro O. S. Vaz de Melo Renato Assunc\u00b8 \u02dcao\n\nUniversidade Federal de Minas Gerais (UFMG)\n\n{flaviovdf, guilherme.borges, olmo, assuncao}@dcc.ufmg.br\n\nReproducibility: http://github.com/flaviovdf/granger-busca\n\nAbstract\n\nWe here focus on the task of learning Granger causality matrices for multivariate\npoint processes. In order to accomplish this task, our work is the \ufb01rst to explore\nthe use of Wold processes. By doing so, we are able to develop asymptotically fast\nMCMC learning algorithms. With N being the total number of events and K the\nnumber of processes, our learning algorithm has a O(N ( log(N ) + log(K))) cost\nper iteration. This is much faster than the O(N 3 K 2) or O(K 3) for the state of the\nart. Our approach, called GRANGER-BUSCA, is validated on nine datasets. This is\nan advance in relation to most prior efforts which focus mostly on subsets of the\nMemetracker data. Regarding accuracy, GRANGER-BUSCA is three times more\naccurate (in Precision@10) than the state of the art for the commonly explored\nsubsets Memetracker. Due to GRANGER-BUSCA\u2019s much lower training complexity,\nour approach is the only one able to train models for larger, full, sets of data.\n\n1\n\nIntroduction\n\nIn order to understand complex systems we need to know how their components (or entities) interact\nwith each other. Networks (or graphs) offer a common language to model such systems, where their\nentities are represented by nodes and their interactions by edges [6]. The networked point process\nis a stochastic model for these systems, when data take the form of a time series of random events\nobserved in each node. That is, in each node of a network we have a temporal point process, which\nis a random process whose realizations consist of the times P = {tj, j \u2208 N} of isolated events.\nNetworked point processes are probabilistic models designed to analyze the in\ufb02uence that events\noccurring in a node may have on the events occurring on other nodes of the network.\n\nRecently, several \ufb01elds used networked point processes to understand complex systems such as\nspiking biological neurons [36], social networks [8, 42] geo-sensor networks [22], \ufb01nancial agents\nin markets [37], television records [48] and patient visits [11]. One of the main objectives in these\nanalyses is to uncover the causal relationships among the entities of the system, or the interaction\nstructure among the nodes, which is also called the latent network structure. Typically, this is\nrepresented by a graph where edges connect nodes that affect each other and edge weights represent\nthe intensity of this pairwise interaction. To the best of our knowledge, all methods that tackle\nthis problem are based on Hawkes point process [25, 24] with a Granger causality framework [20]\nimposed to retrieve the causal graph from data [1, 48, 53, 35, 34]. A point process Pb is said to\nGranger cause another point process Pa when the past information of Pb can provide statistically\nsigni\ufb01cant information about the future occurrences of Pa. We can thus encode causal relationships as\na matrix [15, 16]. In a multivariate point process, this notion of causality can be reduced to measuring\nif the conditional intensity function of Pb is affected by the previous timestamps of Pa [48].\n\nIn contrast with the popular choice of using Hawkes process to model interacting processes, Wold\nprocesses [47] have been neglected as a possible model. Wold processes are a class of point processes\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fde\ufb01ned in terms of the joint distribution of inter-event times. Let \u03b4i = ti \u2212 ti\u22121 be the waiting\ntime for i-th event since the occurrence of the (i \u2212 1)-th event. The main characteristic of Wold\nprocesses is that the collection of inter-event times {\u03b4i, i \u2208 N} follows a Markov chain of \ufb01nite\nmemory. That is, different from Hawkes processes, whose intensity function depends on the whole\nhistory of previous events, the probability distribution of the i-th inter-event time \u03b4i depends only\non the previous inter-event time \u03b4i\u22121. When Wold processes are able to mimic the dynamics of\ncomplex systems [46, 45], this Markovian property can potentially boost the performance of learning\nalgorithms as in our setting. Wold processes were proposed over sixty years ago, however, they\nremain scarcely explored in machine learning, which is unfortunate. As we will demonstrate in this\npaper, Wold processes can be both fast and accurate for some learning tasks.\n\nWe here present the \ufb01rst approach to tackle the discovery of latent network structures in point process\ndata using Wold processes. Similar to recent efforts [1, 48], our goal in this work is to extract Granger\ncausality [20] from multivariate point process data only. The main reason to consider the Wold\nprocess as an alternative to the Hawkes process is its Markovian structure. By adding Dirichlet\npriors over the mutual in\ufb02uences among the processes, we exploit the Markov property to develop\nlearning algorithms that are asymptotically fast. We call our approach GRANGER-BUSCA. With N\nbeing the number of observations in all processes and K the number of processes, the state of the art\napproaches learn at a cost of O(M N 3 K 2) [48] (M being de\ufb01ned by hyper-parameters) or O(K 3) [1]\nper iteration. GRANGER-BUSCA, in contrast, learns at a cost of O(N (log(N ) + log(K))).\n\nEqually important, our results show signi\ufb01cant improvements over the state of the art methods. For\ninstance, when we measure the Precision@10 score between our Granger causal estimates and the\nground-truth number of mentions of the commonly explored Memetracker data [29], our results are\nat least three times more accurate than the most recent and most accurate method [1].\n\nIn summary, our main contributions are: (1) We present the \ufb01rst approach to extract Granger causality\nmatrices via Wold processes; (2) We develop asymptotically fast algorithms to learn such matrices;\n(3) We show how GRANGER-BUSCA is much faster and more accurate than state of the art methods,\nopening up the potential of Wold processes for the machine learning community.\n\n2 Background and Related Work\n\nA temporal point process P is a probability model for a collection of times {0 \u2264 t0 < t1 < t2 < . . .}\nindexing the occurrences of random isolated events. Our context considers several point processes\nPa, Pb, Pc, . . . observed simultaneously, where each event is associated to a single point process:\n\nPa = {0 \u2264 ta0 < ta1 < . . .}. Let P = Sa Pa be the union of all timestamps from all point\nprocesses with N = |P| being the total number of events. We denote by Na(t) = P|Pa|\n\n\u2736tai \u2264t the\nrandom cumulative count of the number of events up to (and including) time t in process Pa. The\ncollection Na = {Na(t) | t \u2208 [0, T ]} is an equivalent representation of the point process Pa.\n\ni=1\n\nThe conditional intensity rate function is the fundamental tool for modeling and making inferences\non point processes. Let Ha(t) be the random history of the process Pa up to, but not including,\ntime t, called the \ufb01ltration. Similarly, H(t) is de\ufb01ned as the \ufb01ltration considering the collection\nP of all point processes. In the limit, as dt \u2192 0, the conditional intensity rate function is given\nby \u03bba(t|H(t))dt = Pr [Na(t + dt) \u2212 Na(t) > 0 | H(t)] = E (Na(t + dt) \u2212 Na(t) | H(t)). The\ninterpretation of this function is that, for a small time interval dt, the value of \u03bba(t|H(t)) dt is\napproximately equal to the expected number of events in (t, t + dt]. It can also be interpreted as the\nprobability that process Pa has at least one event in the interval. The notation emphasizes that the\nconditional intensity at time t depends on the random events that occurred previously.\n\nThe commonly explored Hawkes process P is de\ufb01ned by its set of conditional intensity functions:\n\nb=0 R t\n\u03bba(t|H(t)) = \u00b5a +PK\u22121\n\n0 \u03c6ba(t \u2212 t\u2032)dNb(t\u2032) = \u00b5a +PK\u22121\n\nb=0 Ptbi <t \u03c6ba(t \u2212 tbi )\n\n(1)\n\nWe can consider processes as being enumerated from {a, b, \u00b7 \u00b7 \u00b7 } \u2208 [0, K). \u00b5a captures the exogenous\n(Poissonian) background arrival rate of process Pa. \u03c6ba captures the in\ufb02uence of the point process\nPb on Pa. \u03c6ba(t) has to be integrable, non-negative, and with \u03c6ba(t) = 0 when t < 0. Usual forms\nof \u03c6ba(t) are the exponential and power-law functions [4]. With \u03bba(t) from Eq (1), there is evidence\nthat Pb Granger causes Pa when there exists a time t where \u03c6ba(t) 6= 0 [48].\n\n2\n\n\fIn contrast to the Hawkes process, a Wold process [14, 47] is de\ufb01ned in terms of a Markovian\ntransition on the inter-event times (increments). Let Da = {\u03b4ai = tai \u2212 tai\u22121 , . . .} be the stochastic\nprocess of time increments associated with point process Pa. The Markovian transition between\nincrements is established by the transition probability density Pr[x = \u03b4ai\u22121 | y = \u03b4ai ] which\nmeasures the likelihood of \u03b4ai given the value of the previous increment \u03b4ai\u22121 [47, 14].\nIt is important to point out that, for most forms of Pr[x | y], the model is analytically intractable [21].\nHowever, when Pr[x | y] has an exponential form, such as Pr[x | y] = f (x)e\u2212f (x)y, the model has\nseveral interesting properties [12, 13]. In this particular form, the next increment is exponentially\ndistributed with rate \u03bb = f (x) where x is the previous increment, i.e.: \u03b4i+1 \u223c Exponential(\u03bb =\nf (\u03b4i)). For the particular case of f (x) = \u03b2 + \u03b1x, the work of Cox [12] and Daley [13, 14] derive\nthe stationary distribution of increments showing that it is of the form: Pr[x] \u221d (\u03b2 + \u03b1x)\u22121e\u2212\u03b1x.\n\nGRANGER-BUSCA is motivated by recent efforts [2, 45, 46] that employ variations of the exponential\nWold process (de\ufb01ned above). Instead of de\ufb01ning the Wold process in terms of its interval exponential\nrate, such efforts de\ufb01ned the process in terms of the conditional mean \u00b5(x) = Ea(\u03b4ai|\u03b4ai\u22121 = x) =\n\u03b2 + \u03b1x of an exponentially distributed random variable. This class of point processes is called self-\nfeeding processes (SFP). For the particular case of \u03b1 = 1 and \u03b2 = median(\u03b4ai )/e, with e \u2248 2.718\nbeing the Euler constant, [45, 46] showed that the stationary behavior can be very well approximated\nby the more tractable Log-Logistic distribution. This new form of specifying a Wold process is\ninteresting because it is able to capture both exponential and power-law behavior, disparate behaviors\nsimultaneously observed in real data [2, 45, 46]. Realizations of this process tend to generate bursts\nof intense activity followed by long periods of silence.\n\nBusca [2] is another point process model based on Wold processes and it is GRANGER-BUSCA\u2019s\nstarting point. Starting from a SFP model with \u03b1 = 1, the authors accommodate the less frequent long\nperiods of waiting times observed in some real datasets, the process adds a background Poissonian\nrate (\u00b5a). The conditional intensity can thus be derived given by:\n\n\u03bba(t | Ha(t)) = \u00b5a +\n\n1\n\n\u03b2 + \u03b4ap\n\n,\n\n(2)\n\nwhere \u03b4ap = tap \u2212 tap\u22121 with p being de\ufb01ned by arg max{p : tai < t}. That is, p the index\nassociated to the closest previous event to t. \u03b4ap is thus the previously observed increment.\n\nOver the years, several Hawkes methods have been created for different applications with varying\nasymptotic costs [5, 8, 9, 10, 33, 41]. We now discuss the methods most related to GRANGER-BUSCA.\nAchab et at. [1] presents a O(K 3) approach. Xu et al. [48] learns kernels at cost of O(M K 2 N 3).\nHere, M represents the number of basis functions used to approximate the kernel no-parametrically.\nSimilarly Yang et al. [49] discusses a non-parametric Hawkes that does not explore the in\ufb01nite\nmemory. The authors show that after W events, the kernel is adequately estimated. However, the\nproposed method still scales in the order of O(M K 3 W )1, M represents again a pre-de\ufb01ned number\nof functions used to approximate the kernel, W is the maximum number of previous events to be\nconsidered. A similar proposal is presented by Etesami et al. [18]. In contrast, GRANGER-BUSCA has\na computation complexity of O(N (log(N ) + log(K))) per iteration. We achieve this low cost by\nemploying a Bayesian inference where at each step the algorithm learns, for each event in P, which\nother process, if any, caused that event. Past efforts have employed similar sampling approaches on\nHawkes based models [17, 34, 35] which again do not scale.\n\nIt is important to point out that Isham [27] was one of the \ufb01rst and few to discuss multivariate Wold\nprocesses. However, the author was mostly focused on analytical properties of the Multivariate Wold\nProcess on a very speci\ufb01c setting (an in\ufb01nite process de\ufb01ned on the unit circle).\n\n3 Model\n\nWe de\ufb01ne GRANGER-BUSCA using Figure 1, which exempli\ufb01es the behavior of the model with two\nprocesses. In (a), we show the events of each process as a horizontal line of symbols, triangles in\n\n1The authors discuss a O(K 2) cost for a parallel algorithm with hyper-parameters being O(1). While this\nis correct asymptotically, we compare, for all methods, the non-parallel cost with hyper-parameters. In practice,\nhyper-parameters have a multiplicative effect on learning time. Moreover, for every parallel method (GRANGER-\nBUSCA included), the reduction factor of K (K 2 from K 3) is only possible via unbounded parallelization (one\nprocess per CPU), being unfeasible for large data: K >> #CP U s.\n\n3\n\n\f(a) GRANGER-BUSCA\u2019s clustering behavior\n\nCross-excitment\n\nCross-excitement\n\nBurst\n\nIdleness\n\nBurst\n\nIdleness\n\n(b) GRANGER-BUSCA\u2019s usage of increments\n\n(c) Rate and Number of Events\n\nFigure 1: GRANGER-BUSCA at work. Plot (a) shows the events of process Pa (circles) and process\nPb (triangles). The arrows show the excitement component of the model. Plot (b) illustrates how\n\u2206aa(t) and \u2206ba(t) are calculated. Plot (c) shows the cumulative random processes Na(t) and Nb(t)\nin the top, while the bottom plot shows the random conditional intensity functions \u03bba(t) and \u03bbb(t).\n\nthe upper row for process Pb, and circles in the bottom row for process Pa. The un\ufb01lled symbols\nrepresent events that are caused in an exogenous way, not triggered by any other event. We shall detail\nhow to label events in Section 4. The \ufb01lled symbols are events that appear as a causal in\ufb02uence from\nsome other previous event. We say that these events have been excited by the previous occurrence\nof other events. The directed edge connects the origin event to the resulting event. Edges crossing\nthe two parallel lines of events represent the cross-excitement component. In this case, one event\nin a given process stimulates the occurrence of events in another process. Another situation is due\nto self-excitement, when events in a given process stimulate further events in the same process.\nThese are represented by the horizontal arrows in Figure 1(a). Figure 1(c) shows the behavior of the\nrandom Na(t) and the random conditional intensity function \u03bba(t). Notice that \u03bba(t) behaves like\na step function. That is, the rate of arrivals is \ufb01xed until the next arrival. The \ufb01gure also shows the\nburst-idleness cycle observed with GRANGER-BUSCA, as well as the cross-excitation.\n\nTo formalize GRANGER-BUSCA, let us \ufb01rst rede\ufb01ne \u03b4ap as a function \u2206a(t). Recall that p is the\nindex of the previously observed event in Pa that is closest to t. Also recall that \u03b4ap is the last\nobserved increment. The function notation will simplify our extension to a multivariate Wold process.\nSee Figure 1(b) for an illustration. Let us now de\ufb01ne de\ufb01ne q as the event in Pb that is closest to tap ,\nthat is arg max{q : tbq < tap < t}. With such an index, we can denote by \u2206ba(t) the difference\nbetween tap and tbq , i.e.: \u2206ba(t) = tap \u2212 tbq . When the expected values of \u2206ba(t) are small, events\nin Pb usually precede Pa. In this sense, one intuition on how GRANGER-BUSCA works is that larger\nobserved values of \u2206ba(t) lead to weaker evidence for the in\ufb02uence of process Pb on process Pa.\n\nThe above behavior motivates GRANGER-BUSCA\u2019s multivariate conditional intensity function:\n\n\u03bba(t) =\n\n+\n\nExogenous Poisson Rate\n\n\u00b5a\n\n|{z}\n\nEndogenous Wold Rate\n\n\u03b1ba\n\n\u03b2b + \u2206ba(t)\n\n(3)\n\nK\u22121\n\nXb=0\n|\n\n{z\n\n}\n\nThe random intensity function \u03bba(t) is the sum of two components. The \ufb01rst one is \u00b5a and it\nrepresents the exogenous events in process Pa that are generated at a Poissonian constant rate \u00b5a by\nunit of time. The other component is a sum over all processes, including the same a process. The\nterms in this sum represent the increment on the baseline rate \u00b5a contributed by other previous events\nfrom Pa itself (self-excitement) or from other processes (cross-excitement). Based on a very sparse\nrepresentation, the entire history is concentrated only on the most recent time gaps between events.\nHence, the process Pb in\ufb02uence on process Pa is represented by the ratio \u03b1ba/(\u03b2b + \u2206ba(t)). The\nnumerator is a scale factor measuring the amount of cross-excitement: when is equal to zero, there\nis no in\ufb02uence from Pb on Pa. The denominator models how this cross-excitement takes place. At\ntime t, we add the time gap \u2206ba(t) to the \ufb02at value \u03b2b. A large gap \u2206ba(t) makes the contribution of\n\n4\n\nSelf-Exc.Cross-Exc.ExogenousTimeExogenousSelf-Exc.0246810TimePaPb\u2206a(t)=\u2206aa(t)\u2206ba(t)t\u2206a(t)=\u2206aa(t)\u2206ba(t)t020406080100Time020Na(t)020406080100Time0.00.51.0\u03bba(t)\fthe ratio \u03b1ba/(\u03b2b + \u2206ba(t)) small relative to the baseline rate \u00b5a. Otherwise, a small gap raises this\ncontribution up to its maximum possible contribution rate of \u03b1ba/\u03b2b.\n\nG = [\u03b1ba] is the Granger matrix of the model. We require that \u03b1ba \u2265 0 and that PK\u22121\n\nModel parameters and de\ufb01nitions: \u0398 = {G, \u03b2, \u00b5} contains the full set of model parameters.\na=0 \u03b1ba = 1.\nHence, the value \u03b1ba in the G[b, a] is proportional to the fraction of events from process Pb that\ntriggered events on Pa. By de\ufb01nition, G is row stochastic [26]. This property leads to several\ninteresting characteristics of the model, that we further develop throughout the rest of this section.\nThe vector \u03b2 = [\u03b2b] captures the the overall in\ufb02uence strength for each process Pb. That is, when a\nprocess in\ufb02uences another, it does so by exponentially distributed inter-event times with a mean of\n\u03b2b + \u2206ba(t). As we now show, to guarantee stationary conditions, it is necessary that 1 \u2264 \u03b2b < \u221e.\nFinally, we consider that each event tai has a latent label indicating either that it is exogenous or\nwhich process caused it (edges in Figure 1(a)). Estimation of \u0398 would be trivial if these labels were\nknown. Thus, our learning algorithm (see Section 4) focuses on estimating on such labels from data.\n\n\u03b1ba\n\nAs pointed out, GRANGER-BUSCA\u2019s stationarity (some authors also call this property stability [14,\n34]) depends on 1 \u2264 \u03b2b < \u221e. Let B(t) be the diagonal matrix where each diagonal cell is equal to\n1/(\u03b2b + \u2206ba(t)). Moreover, let \u03a6(t) = B(t)G = [\n\u03b2b+\u2206ba(t) = \u03bbba(t)]. Each value in \u03a6(t) is the\ncross-intensity for a pair of processes. If Pb are represented in the rows and Pa in the columns, the\nexpected number of events at an in\ufb01nitesimal region for Pa is equal to the row sum of this matrix.\nThis value is simply GRANGER-BUSCA\u2019s cross-feeding intensity without the exogenous factor. Now,\nlet ||X|| be the induced l\u221e-norm (maximum row sum).\nDe\ufb01nition 1 (Stationarity): Notice that, ||\u03a6(t)|| < 1 [26]. The proof for this property is straight-\nforward and has interesting implications. As G is row-stochastic, we have ||G|| = 1. Also,\n\u03b2b + \u2206ba(t) > 1 since \u03b2b \u2265 1 and \u2206ba(t) > 0. The matrix B(t) will either scale down this norm\nor keep it unchanged. The spectral radius is \u03c1(\u03a6(t)) < 1. Also, limk\u2192\u221e \u03a6(t)k = 0.\n\nDue to the above de\ufb01nitions, the model is stationary/stable as it will never generate in\ufb01nite offspring.\nk=0 \u03a6(t)k =\n\nIn fact, the total number of offspring at any given time is determined by the sum P\u221e\n\n(I \u2212\u03a6(t))\u22121. Next, we explore the de\ufb01nition of Granger causality for point processes [16, 48].\n\nDe\ufb01nition 2 (Granger Causality): A process Pa is said to be independent of any other process Pb\nwhen: \u03bba(t|Ha(t)) = \u03bba(t|HP (t)) for any t \u2208 [0, T ]. In contrast, Pb is de\ufb01ned to Granger cause Pa\nwhen: \u03bba(t|Ha(t)) 6= \u03bba(t|HQ(t)), where Q = Pa \u222a Pb . As a consequence, given two processes\nPa and Pb whose intensity functions follow Eq (3), Granger causality arises when \u03b1ba 6= 0 2\n\n4 Learning GRANGER-BUSCA\n\nRecall from Figure 1 that GRANGER-BUSCA exhibits a cluster like behavior. That is, exogenous\nobservations arrive at a \ufb01xed rate, with each observation being able to trigger new observations\nleading to a burst/idleness cycle. Based on this remark, we developed our Markov Chain Monte Carlo\n(MCMC) sampling algorithm to learn GRANGER-BUSCA from data. Our algorithm will work by\nsampling, for each observation tai , the hidden process (or label), that likely caused this observation.\nIn other words, we sample a latent variable, zai , which takes a value of b \u2208 [0, K \u2212 1] when process\nPb in\ufb02uences tai . When the stamp is exogenous, we set this value to a constant K. Such a number\nmerely represents a label (exogenous) and does not affect our sampling.\n\nTo simplify our learning strategy, we decided to \ufb01x \u03b2 = 1. We notice that such a value shown to be\nsuf\ufb01cient for GRANGER-BUSCA to outperform state of the art baselines (see Section 5). Later in\nthis section, we shall discuss that our learning algorithm is easily adaptable for general forms of the\nintensity function. G is estimated based on the number of events that Pb caused on Pa, whereas \u00b5\nis estimated as the maximum likelihood rate for a Poisson process based on the exogenous events\nfor Pa. We can thus learn GRANGER-BUSCA with an Expectation Maximization approach. Hidden\nlabels are estimated in the Expectation step. The matrix G can also be readily updated in such a step.\nWith the labels, \u00b5 estimated in the maximization step. Next, we discuss our \ufb01tting strategy.\n\nInitially, we explored the Markovian nature of the process to evaluate \u2206ba(t) at a O(log(N )) cost.\nGiven some labels\u2019 assignments for the events, we obtain \u2206ba(t) with a binary search over Pa and Pb.\n\n2When interpreting results from GRANGER-BUSCA, we relax the above condition of strict independence to\n\n\u03b1ba \u2248 0. In such cases, we can state that we have no statistical evidence for Granger causality.\n\n5\n\n\fWe explain now how we update the zai labels. Given any event at tai in process Pa, the event either\nexogenous or induced by some other previous event on Pa or from some other process Pb. By the\nsuperposition theorem [14, 34, 35, 17], we obtain the conditional probability that an individual event\nat tai is exogenous or was caused by process Pb (where b can be equal to a for SFP like behavior) as:\n\nPr[tai \u2208 EXOG.] =\n\n,\n\nPr[tai \u2190 Pb] =\n\n\u00b5a\nb\u2032=0 \u03bbb\u2032a(tai )\n\n\u00b5a +PK\u22121\n\n\u03bbba(tai )\nb\u2032=0 \u03bbb\u2032a(tai )\n\n\u00b5a +PK\u22121\n\n,\n\n(4)\n\nwhere the \u2190 operator indicates causality. Notice that, zai = b is equivalent to tai \u2190 Pb. Eq (4) is\ncarried out conditionally on the history Ha(tai ).\n\nWe can accelerate substantially our evaluations by using a binary modi\ufb01ed Fenwick Tree [19] (the\nF+Tree [51]) data structure if we break the zai label assignment into two steps. First, we decide if it\nis exogenous. Given it is not, we select the inducing process based on the conditional probability:\n\nPr[tai \u2190 Pb | tai 6\u2208 EXOG.] =\n\n.\n\n(5)\n\n\u03bbba(tai )\nb\u2032=0 \u03bbb\u2032a(tai )\n\nPK\u22121\n\nThe evaluation of the probability that an event is not exogenous has an O(1) cost because we estimate\nPr[tai 6\u2208 EXOG.] \u02c6= 1 \u2212 e\u2212\u00b5a(tai \u2212t\u00b5a ) where t\u00b5a is the last event before tai that arrived from an\nexogenous factor3. This probability is the complement of the probability that zero Poissonian events\nhappened between t\u00b5a and tai . As G is row stochastic, we \ufb01rst add a Dirichlet prior over each row\nof this matrix (\u03b1p). We \ufb01nally sample the hidden labels zai as follows:\n\n1. For each process Pa\n\n(a) Sample row a from G as \u223c Dirichlet(\u03b1p)\n\n2. For each process Pa\n\n(a) For each observation tai \u2208 Pa\n\ni. Sample p \u223c U nif orm(0, 1)\nA. When p < e\u2212\u00b5a(tai \u2212t\u00b5a )\n\nzai \u2190 exogeneous\n\nB. Otherwise\n\nSample zai \u223c M ultinomial(Eq 5)\n\nModel parameters are estimated through a MCMC sampler. Starting at an arbitrary random state\n(i.e., labels\u2019 assignment), let nba be the number of times Pb in\ufb02uenced Pa. Similarly, nb captures\nthe number of times Pb in\ufb02uenced any process, including itself. The conditional probability of\nPr[tai 6\u2208 EXOG.] \u00d7\nPr[tai \u2190 Pb | tai 6\u2208 EXOG.]. By factoring G, labels are sampled based on the collapsed Gibbs\nsampler [43]. Thus, we re-write Eq (5) below. In this next equation, we point out the Proposal and\nTarget Distributions used on the Metropolis Based Sampler (discussed next).\n\nhidden labels for every observation, z, is given by: Pr[z | \u0398] = QK\u22121\n\na=0 Q|Pa|\u22121\n\ni=0\n\nTarget Distribution\n\nPr[tai \u2190 Pb | tai 6\u2208 EXOG.] \u221d\n\n1\n\n{\n\n\u03b2b + \u2206ba(t)\n\n,\n\n(6)\n\nz\n|\n\n\u2212tai\nn\nba + \u03b1p\n\u2212tai\n+ \u03b1pK\nb\n\nn\n\nProposal Distribution\n\n{z\n\n}|\n}\n\n\u2212tai\nba = (n\n\n\u2212tai\nba + \u03b1p)/(n\n\n\u2212tai\nb\n\n+ \u03b1pK) being the current estimate of \u03b1ba, with n\n\nwhere \u03b1\nthe count for the pair nba excluding the current assignment for tai and n\nbeing similarly de\ufb01ned.\nThus, the full algorithm follows an EM approach. After labels are assigned in the Expectation step,\nwe can compute \u00b5a for every process by simply estimating the maximum likelihood Poissonian rate.\nGiven that it takes O(log(N )) time to compute \u2206ba(t) and O(K) time to compute Eq 5, the total\nsampling complexity per learning iteration for the Gibbs sampler will be of O(N log(N ) K).\n\n\u2212tai\nb\n\nbeing\n\n\u2212tai\nba\n\nSpeeding Up with a Metropolis Based Sampler: The K factor in the traditional Gibbs sampler may\nbe reduced by exploiting speci\ufb01c data structures, such as the AliasTable [52, 32] or the F+Tree [51].\nIn order to speed-up GRANGER-BUSCA, we shall employ the latter (F+Tree). The trade-offs between\n\n3The \u02c6= operator means equality by de\ufb01nition.\n\n6\n\n\fthe two are discussed in [51]. Our choice is motivated by the fact that the F+Tree does not make use\nof stale samples. Thus, we can perform multinomial sampling with a O(log(K)) cost. Given that the\nAliasTable cannot be updated quickly, it is usually only suitable at later iterations [52, 32].\n\nWe exploit the F+Tree by changing our sampler to a Metropolis Hasting (MH) approach. Using the\ncommon notation for an MH, let Q(b) be the proposal probability density function as detailed in Eq 6.\nHere, b is a candidate process which may replace the current assignment c = zai . When the target\ndistribution function is simply Eq 6, i.e., P (c) = Eq 6, we can sample the assignment zai using the\nacceptance probability min{1, (P (c)Q(b))/(P (b)Q(c))}. In other words, at each iteration we either\nkeep the previous zai or replace with b according to the acceptance probability. As discussed, with\nthe F+Tree, we can sample from Q(b) in O(log(K)) time. We can also update the tree with the new\nprobabilities after each step with the same cost. Given that F+Tree has a O(K log(K)) cost to build,\nwe build the tree once per process. Finally, as required for the Metropolis Hasting algorithm, it is\ntrivial to see that the proposal distribution is proportional to the target distribution [23].\n\nParallel Sampler: With the F+Tree, candidates are sampled at a O(log(K)) cost per event. More-\nover, \ufb01nding previous stamp costs O(log(N )). By adding these two costs, the algorithmic complexity\nof GRANGER-BUSCA per iteration is O(N ( log(N ) + log(K))). Finally, notice that the sampler\nis easy to parallelize. That is, by iterating over events on a per-process basis, we parallelize the\nalgorithm by scheduling different processes to different CPU cores. Overall, only vector of variables\nis shared across processes (nb in Eq (6)). In our implementation, each core will read nb for each\nprocess before an iteration. After the iteration, the value is thus updated globally. Our sampler falls\nin the case of being Asynchronous with Shared Memory as discussed in [44].\n\nLearning different Kernels Consider the equivalent rewrite of \u03bba(t) = \u00b5a + PK\u22121\n\nb=0 \u03b1ba\u03c9ba(t),\nwhere, \u03c9ba(t) = 1/(\u03b2b +\u2206ba(t)) for GRANGER-BUSCA in particular. With this new form, the model\nacts as a mixture of intensities (\u03c9ba(t)). Each pairwise intensity is weighted by the causal parameters\n\u03b1ba. Now, notice that using this form our EM algorithm is easily extensible for different functions\n\u2212tai\n\u03c9ba(t). The E-step is able to estimate the causal graph (considering that Eq (6) \u02c6= \u03b1\nba \u03c9ba(t)).\nThe M-Step provides maximum likelihood estimates for the speci\ufb01c parameters appearing in \u03c9ba(t).\nIn fact, even unsupervised forms may be learned. As we discuss in the next section, we keep the\naforementioned intensity given that it is simpler and outperforms baselines in our datasets.\n\n5 Results and Experiments\n\nWe now present our main results. GRANGER-BUSCA is compared with three recently proposed\nbaselines methods: ADM4 [53], Hawkes-Granger [48] and Hawkes-Cumulants [1]. The code for\neach method is publicly available via the library tick4. Experiments were performed on a dedicated\nAzure Standard DS15 v2 instance with 20 Intel Xeon CPU E5-2673 v3 cores and 140GB of memory.\nWe point out that we perform comparisons are performed with Hawkes methods given that it is the\nmost prominent framework. There is no Wold method for our task. We compare with approaches that\nare: parametric [53] and non-parametric [48, 1], and explore \ufb01nite [1] and in\ufb01nite [53, 48] memory.\n\nHyper Parameters: ADM4 enforces an exponential kernel and with a \ufb01xed rate. We employ a\nTree-structured Parzen Estimator to learn such a rate [7], optimizing for the best model in terms of\nlog-likelihood. For Hawkes-Granger, we \ufb01t the model with M = 10 basis functions as suggested\nin [1]. Finally, Hawkes-Cumulants [1] also has a single hyper parameter called the half-width, which\nwas also estimated using [7]. Training is performed until convergence or until 300 iterations is\nreached. Our MCMC sampler executes for 300 iterations with \u03b1p = 1\n\nK and \u03b2 = 1.\n\nDatasets: We evaluate GRANGER-BUSCA and the aforementioned three baselines on 9 different\ndatasets, all of which were gathered from the Snap Network Repository5. Out of the nine datasets,\nwe pay particular attention to the Memetracker data, which is the only one used to evaluate all past\nefforts. The Memetracker dataset consists of web-domains linking to one another. We pre-process\nthe Memetracker dataset using the code made available by [1]. The other eight datasets consist of\nsource nodes (e.g., students or blogs) sending events to destination nodes (e.g., messages or citations).\nEach datasets can be represented as triples D = {(source, destination, timestamp)}. The ground\ntruth network is de\ufb01ned as the graph Gt = {Vt, Et, Wt}, where the vertices Vt are both the sources\n\n4https://github.com/X-DataInitiative/tick. Results produced with version 0.4.0.0.\n5https://snap.stanford.edu/data/\n\n7\n\n\fTable 1: Datasets used for Experiments and Precision Scores for Full Datasets. Due to their sizes,\nonly GRANGER-BUSCA is able to execute in all datasets. To allow comparisons, we execute baselines\nmethods with only the Top-100 destination nodes. Other results are presented in Table 2 and Figure 2.\n\n# Proc (K)\n\n# Obs. (N)\n\nN (Top-100)\n\nSpan\n\n%NZ\n\nP@5\n\nP@10\n\nP@20\n\nTT(s)\n\nbitcoinalpha [28]\nbitcoinotc [28]\ncollege-msg [39]\nemail [31, 50]\nsx-askubuntu [40]\nsx-mathover\ufb02ow [40]\nsx-superuser [40]\nwikitalk [30, 40]\nmemetracker-100 [29]\nmemetracker-500 [29]\n\n3,257\n4,791\n1,313\n803\n88,549\n16,936\n114,623\n251,154\n100\n500\n\n23,399\n33,766\n58,486\n327,677\n879,121\n488,984\n1,360,974\n7,833,140\n24,665,418\n39,318,989\n\n2,279\n2,328\n10,869\n92,924\n58,142\n59,602\n64,866\n211,344\n-\n-\n\n5Y\n5Y\n193D\n803D\n7Y\n7Y\n7Y\n6Y\n9M\n9M\n\n0.2%\n0.1%\n1.1%\n3.74%\n0.006%\n0.07%\n0.006%\n0.003%\n9.85%\n4.44%\n\n0.26\n0.25\n0.36\n0.23\n0.25\n0.28\n0.26\n0.25\n0.30\n0.30\n\n0.14\n0.14\n0.30\n0.28\n0.13\n0.16\n0.14\n0.14\n0.29\n0.30\n\n0.07\n0.07\n0.19\n0.32\n0.06\n0.09\n0.07\n0.07\n0.22\n0.23\n\n3\n7\n1\n4\n2774\n98\n4614\n27540\n114\n274\n\nTable 2: Comparing GRANGER-BUSCA (GB) with Hawkes-Cumulants (HC) Memetracker.\n\nPrecision@5\nHC\nGB\n\nPrecision@10\nHC\nGB\n\nPrecision@20\nHC\nGB\n\nKendall\n\nHC\n\nGB\n\nRel. Error\nHC\nGB\n\ntop-100\ntop-500\n\n0.06\n0.01\n\n0.30\n0.30\n\n0.09\n0.01\n\n0.29\n0.30\n\n0.01\n0.02\n\n0.22\n0.23\n\n0.05\n0.08\n\n0.26\n0.20\n\n1.0\n1.8\n\n0.44\n0.06\n\nTT(s)\n\nHC\n\n87\n715\n\nGB\n\n114\n274\n\nand the destinations. Edges, e = (b, a) \u2208 Et and {b, a} \u2286 Vt, represent the relationship between\ntwo entities. The weighted adjacency matrix of this graph, Gt, is our ground-truth. It is de\ufb01ned as:\nGt[b, a] = (# (b, a) \u2208 D)/(# b is a source). To compose each process from these datasets, each\ndestination node represents a process. In compliance to our notation, we call such processes Pa.\nNotice that we do not consider source nodes, Pb, as processes. Thus, the models will aim to extract\ncausal graphs based on the likelihood that reception of messages at a Pa destination will trigger\nincoming messages. If this is the case, we have evidence that Pa is also be a source node. Finally, we\npre-process the data to only consider destinations that have also sent messages.\n\nMetrics: We evaluate GRANGER-BUSCA and its competitors using the Precision@n, Kendall\nCorrelation and Relative Error Scores. Each score is measured per node (or row of G), and is\nsummarized for the entire dataset as the average over every node. Both the Kendall Correlation, as\nwell as the Relative Errors, were proposed as evaluation metrics for networked point processes in [1].\nPrecision@n captures the average fraction of edges in G out of the top n edges ordered by weight\nwhich are also present in Gt. As we shall discuss, there are several limitations with the Kendall and\nRelative Error scores due to graph sparsity. We argue that Precision@n measured at different cut-offs\n(n) is a more reliable evaluation metric for large and sparse graphs, as the ones we explore here.\n\nResults: Table 1 describes the datasets used in this work presenting their number of nodes and of\nobservations. Most baselines do not execute on large datasets in under 24h of training time (TT), in\ncontrast with GRANGER-BUSCA. Given the asymptotic complexity, we estimate that some models\nmay take months to execute. Hence, to allow comparisons with GRANGER-BUSCA, we considered\nsubsamples containing only the events involving the Top-100 destinations. We pay particular attention\nto Top-100 and 500 nodes for Memetracker, given that it was explored in prior efforts [53, 48, 1].\n\nTable 1 also presents the Precision@n scores for the GRANGER-BUSCA. Recall that ours is the\nonly approach able to execute on the full sets of data. Firstly, notice that the Kendall and Relative\nError scores are absent from Table 1. Given that datasets are sparse \u2013 as shown by the fraction of\nnon-zeros in the ground truth, or %NZ, in Table 1 \u2013 the Kendall Correlations and Relative Errors are\nunreliable metrics for large networks. It is well known that complex networks as ours have long-tailed\ndistributions for the edge-weights [6], leading to the high sparsity levels (%NZ) seen in Table 1. With\nmost cells being zeros, Kendall Scores also tend to zero as most sources connect to few destinations.\nSimilarly, the relative errors will likely increase. In order to avoid divisions by zero, previous efforts\nimpose a constant penalization, of one, when zero edges exist between two nodes. Similar to the\nKendall Correlation, this penalization will also dominate the score due to the sparsity.\n\nBecause of the above limitations of prior efforts and metrics, we are unable to present a fair comparison\nwith competitors on the full datasets. To achieve this goal, in Table 2, we present the overall scores for\nGRANGER-BUSCA and the Hawkes-Cumulants (HC) [1] approach, focusing only on the Memetracker\ndata. In this setting, Hawkes-Cumulants has already been shown to be more accurate and faster than\n\n8\n\n\fFigure 2: Precision Scores for the Top-100 datasets.\n\nADM4 [53] and Granger-Hawkes [48] (GH). Observe that GRANGER-BUSCA is more accurate than\nthe competing method in every score. Indeed, Precision@n scores are at least three times greater\ndepending on the cut-off (Precision@5, 10 and 20). Kendall Scores also show substantial gain (250%),\nwith GRANGER-BUSCA achieving 0.20 and HC achieving 0.08 correlations on average. Relative\nerrors for GRANGER-BUSCA are also 56% lower than HC (1.0 versus 0.44). Finally, notice how\nGRANGER-BUSCA is slightly slower than HC when fewer nodes are present (100), but signi\ufb01cantly\nsurpasses HC in speed as K increases. This comes from the O(K 3) cubic cost incurred by HC.\n\nTo present an overall view of GRANGER-BUSCA against all three competing methods (ADM4, HC\nand HG), in Figure 2 we show Precision@5, 10 and 20 scores for each approach on every Top-100\ndataset. A total of 26 (out of 27 possible) points are plotted in the \ufb01gure. One single setting, HC with\nMemetracker, is absent since the model was unable o train suf\ufb01cient time. The x-axis of the \ufb01gure\npresents the Precision@n score for the baseline. Similarly, the y-axis shows the Precision@n score\nfor GRANGER-BUSCA. We also show three regions where either GRANGER-BUSCA or competitors\nperform worse than a Null model. This model was created by performing uniformly random rankings.\nNotice that for Precision@5 and Precision@10, GRANGER-BUSCA outperforms most baselines on\na large fraction of the datasets. In fact, for Precision@5, there is only one setting where ADM4\noutperforms GRANGER-BUSCA. As the precision cut-off increases, so does the ef\ufb01cacy of the Null\nmodel (i.e., it easier for a random ranking to recover top edges). For Precision@20, there does exists\nsome cases where GRANGER-BUSCA is outperformed by baseline methods. However, the majority\nof these cases exist in the region where both models are below the ef\ufb01cacy of a Null model.\n\nWhy does the model work? Recall that a Wold process is an adequate model when there is a strong\ndependency between consecutive inter-event times \u03b4t and \u03b4t+1. To explain our results, we measured\nthe correlation between \u03b4t and \u03b4t+1 for pairs of interacting processes from the ground-truth data. Out\nof nine datasets, the worst case had the median Pearson correlation per pair equal to 0.3, a moderate\nvalue. However, in the remaining eight datasets this median was above 0.70, indicating the adequacy\nof a Wold model. The high linear dependency implies that \u03b4t+1 \u2248 \u03b1\u03b4t + \u03b2 \u2192 E[\u03b4t+1] \u2248 \u03b1 E[\u03b4t] + \u03b2.\nThus, E[\u03b4t+1] is a linear function, f , of the previous inter-event, justifying the intensity: \u03b4t+1 \u223c\nExponential(\u00b5 = f (\u03b4t)) (see Section 4 for a discussion on how to learn other forms).\n\nIt is also important to understand the limitations of non-parametric methods such as HC. Recall that\nHC relies on the statistical moments (e.g., mean/variance) of the inter-event times [1]. Given that web\ndatasets exhibit long tails (with theoretical distributions exhibiting high, or even in\ufb01nite, variance),\nsuch moments will not accurately capture the underlying behavior of the dataset (see Section 2).\n\n6 Conclusions and Future Work\n\nIn this work, we present the \ufb01rst method to extract Granger causality matrices via Wold Processes.\nThough it was proposed over sixty years ago, this framework of point processes remain largely unex-\nplored by the machine learning community. Our approach, called GRANGER-BUSCA, outperforms\nrecent baseline methods [1, 48, 53] both in training time and in overall accuracy. GRANGER-BUSCA\nworks particularly well when extracting the top connections of a node (Precision@5, 10, 20).\n\nGiven the ef\ufb01cacy of GRANGER-BUSCA, our hope is that current results may open up a new class of\nmodels, Wold processes, to be explored by the machine learning community. Finally, GRANGER-\nBUSCA can be used to explore real world behavior in complex large scale datasets.\n\n9\n\n0.000.050.100.150.200.250.300.350.40Competitor0.000.050.100.150.200.250.300.350.40Granger-BuscaPrecision@50.000.050.100.150.200.250.300.350.40Competitor0.000.050.100.150.200.250.300.350.40Granger-BuscaPrecision@100.000.050.100.150.200.250.300.350.40Competitor0.000.050.100.150.200.250.300.350.40Granger-Busca (39% of cases) (61% of cases)Precision@20HCADM4HGGB Ourperforms Competitors (96% of cases)Competitors Outperform GB (4% of cases) (73% of cases) (27% of cases)All are randomCompetitors are random GB is random\fAcknowledgements\n\nWe thank Fabricio Murai and the anonymous reviewers for providing comments. We also thank\nGabriel Coutinho for discussions on the mathematical properties of GRANGER-BUSCA, as well as\nAlexandre Souza for providing pointers to prior studies. This work has been partially supported by\nthe project ATMOSPHERE (atmosphere-eubrazil.eu), funded by the Brazilian Ministry of Science,\nTechnology and Innovation (Project 51119 - MCTI/RNP 4th Coordinated Call) and by the European\nCommission under the Cooperation Programme, Horizon 2020 grant agreement no 777154. Funding\nwas also provided by the authors\u2019 individual grants from CNPq, CAPES and Fapemig. Computational\nresources were provided by the Microsoft Azure for Data Science Research Award (CRM:0740801).\n\nReferences\n\n[1] M. Achab, E. Bacry, S. Ga\u00a8\u0131ffas, I. Mastromatteo, and J.-F. Muzy. Uncovering causality from\n\nmultivariate hawkes integrated cumulants. In ICML, 2017.\n\n[2] R. Alves, R. Assunc\u00b8 \u02dcao, and P. O. S. Vaz de Melo. Burstiness scale: A parsimonious model for\n\ncharacterizing random series of events. In KDD, 2016.\n\n[3] R. H. Arpaci-Dusseau and A. C. Arpaci-Dusseau. Operating Systems: Three Easy Pieces.\n\nArpaci-Dusseau Books, 0.91 edition, 2015.\n\n[4] E. Bacry, I. Mastromatteo, and J.-F. Muzy. Hawkes processes in \ufb01nance. Market Microstructure\n\nand Liquidity, 1(01), 2015.\n\n[5] Y. Bao, Z. Kuang, P. Peissig, D. Page, and R. Willett. Hawkes process modeling of adverse\n\ndrug reactions with longitudinal observational data. In ML for HC, 2017.\n\n[6] A.-L. Barab\u00b4asi and M. P\u00b4osfai. Network science. Cambridge University Press, Cambridge, 2016.\n\n[7] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. K\u00b4egl. Algorithms for hyper-parameter optimiza-\n\ntion. In NIPS, 2011.\n\n[8] C. Blundell, J. Beck, and K. A. Heller. Modeling reciprocating relationships with hawkes\n\nprocesses. In NIPS, 2012.\n\n[9] S. Chen, A. Shojaie, E. Shea-Brown, and D. Witten. The multivariate hawkes process in high\n\ndimensions: Beyond mutual excitation. arXiv preprint arXiv:1707.04928, 2017.\n\n[10] J. Chevallier, M. J. C\u00b4aceres, M. Doumic, and P. Reynaud-Bouret. Microscopic approach of a\ntime elapsed neural model. Mathematical Models and Methods in Applied Sciences, 25(14),\n2015.\n\n[11] E. Choi, N. Du, R. Chen, L. Song, and J. Sun. Constructing disease network and temporal\n\nprogression model via context-sensitive hawkes process. In ICDM, 2015.\n\n[12] D. R. Cox. Some statistical methods connected with series of events. Journal of the Royal\n\nStatistical Society. Series B (Methodological), 1955.\n\n[13] D. Daley. Stationary point processes by markov-dependent intervals and in\ufb01nite intensity.\n\nJournal of Applied Probability, 19(A), 1982.\n\n[14] D. J. Daley and D. Vere-Jones. An introduction to the theory of point processes, vol. 1. Springer,\n\nNew York, 2003.\n\n[15] M. Dhamala, G. Rangarajan, and M. Ding. Estimating granger causality from fourier and\n\nwavelet transforms of time series data. Physical review letters, 100(1), 2008.\n\n[16] V. Didelez. Graphical models for marked point processes based on local independence. Journal\n\nof the Royal Statistical Society: Series B (Statistical Methodology), 70(1), 2008.\n\n[17] N. Du, M. Farajtabar, A. Ahmed, A. J. Smola, and L. Song. Dirichlet-hawkes processes with\n\napplications to clustering continuous-time document streams. In KDD, 2015.\n\n[18] J. Etesami, N. Kiyavash, K. Zhang, and K. Singhal. Learning network of multivariate hawkes\n\nprocesses: A time series approach. In UAI, 2016.\n\n[19] P. M. Fenwick. A new data structure for cumulative frequency tables. Software: Practice and\n\nExperience, 24(3), 1994.\n\n10\n\n\f[20] C. W. Granger. Investigating causal relations by econometric models and cross-spectral methods.\n\nEconometrica: Journal of the Econometric Society, 1969.\n\n[21] P. Guttorp and T. L. Thorarinsdottir. What happened to discrete chaos, the quenouille process,\nand the sharp markov property? some history of stochastic point processes. International\nStatistical Review, 80(2), 2012.\n\n[22] D. Hallac, Y. Park, S. Boyd, and J. Leskovec. Network inference via the time-varying graphical\n\nlasso. In KDD, 2017.\n\n[23] W. K. Hastings. Monte carlo sampling methods using markov chains and their applications.\n\nBiometrika, 57(1):97\u2013109, 1970.\n\n[24] A. G. Hawkes. Point spectra of some mutually exciting point processes. Journal of the Royal\n\nStatistical Society. Series B (Methodological), 1971.\n\n[25] A. G. Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika,\n\n58(1), 1971.\n\n[26] R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge university press, 1990.\n\n[27] V. Isham. A markov construction for a multidimensional point process. Journal of Applied\n\nProbability, 14(3), 1977.\n\n[28] S. Kumar, F. Spezzano, V. Subrahmanian, and C. Faloutsos. Edge weight prediction in weighted\n\nsigned networks. In ICDM, 2016.\n\n[29] J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the dynamics of the news\n\ncycle. In KDD, 2009.\n\n[30] J. Leskovec, D. P. Huttenlocher, and J. M. Kleinberg. Governance in social media: A case study\n\nof the wikipedia promotion process. In ICWSM, 2010.\n\n[31] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graph evolution: Densi\ufb01cation and shrinking\n\ndiameters. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1), 2007.\n\n[32] A. Q. Li, A. Ahmed, S. Ravi, and A. J. Smola. Reducing the sampling complexity of topic\n\nmodels. In KDD, 2014.\n\n[33] S. Li, X. Gao, W. Bao, and G. Chen. Fm-hawkes: A hawkes process based approach for\n\nmodeling online activity correlations. In CIKM, 2017.\n\n[34] S. Linderman and R. Adams. Discovering latent network structure in point process data. In\n\nICML, 2014.\n\n[35] S. Linderman and R. Adams. Scalable bayesian inference for excitatory point process networks.\n\narXiv preprint arXiv:1507.03228, 2015.\n\n[36] R. P. Monti, P. Hellyer, D. Sharp, R. Leech, C. Anagnostopoulos, and G. Montana. Estimating\ntime-varying brain connectivity networks from functional mri time series. NeuroImage, 103,\n2014.\n\n[37] A. Namaki, A. Shirazi, R. Raei, and G. Jafari. Network analysis of a \ufb01nancial market based on\ngenuine correlation and threshold method. Physica A: Statistical Mechanics and its Applications,\n390(21), 2011.\n\n[38] Y. Ogata. On lewis\u2019 simulation method for point processes. IEEE Transactions on Information\n\nTheory, 27(1):23\u201331, 1981.\n\n[39] P. Panzarasa, T. Opsahl, and K. M. Carley. Patterns and dynamics of users\u2019 behavior and inter-\naction: Network analysis of an online community. Journal of the Association for Information\nScience and Technology, 60(5), 2009.\n\n[40] A. Paranjape, A. R. Benson, and J. Leskovec. Motifs in temporal networks. In WSDM, 2017.\n\n[41] M.-A. Rizoiu, Y. Lee, S. Mishra, and L. Xie. Hawkes processes for events in social media. In\n\nS.-F. Chang, editor, Frontiers of Multimedia Research. ACM, 2018.\n\n[42] J. Silva and R. Willett. Hypergraph-based anomaly detection of high-dimensional co-\n\noccurrences. IEEE transactions on pattern analysis and machine intelligence, 31(3), 2009.\n\n[43] M. Steyvers and T. Grif\ufb01ths. Probabilistic topic models. Handbook of latent semantic analysis,\n\n427(7), 2007.\n\n11\n\n\f[44] A. Terenin and E. P. Xing. Techniques for proving asynchronous convergence results for markov\n\nchain monte carlo methods. In NIPS, 2017.\n\n[45] P. O. S. Vaz de Melo, C. Faloutsos, R. Assunc\u00b8 \u02dcao, R. Alves, and A. A. Loureiro. Universal\nand distinct properties of communication dynamics: how to generate realistic inter-event times.\nACM Transactions on Knowledge Discovery from Data (TKDD), 9(3), 2015.\n\n[46] P. O. S. Vaz de Melo, C. Faloutsos, R. Assunc\u00b8 \u02dcao, and A. Loureiro. The self-feeding process: A\n\nunifying model for communication dynamics in the web. In WWW, 2013.\n\n[47] H. Wold. On stationary point processes and markov chains. Scandinavian Actuarial Journal,\n\n1948(1-2), 1948.\n\n[48] H. Xu, M. Farajtabar, and H. Zha. Learning granger causality for hawkes processes. In ICML,\n\n2016.\n\n[49] Y. Yang, J. Etesami, N. He, and N. Kiyavash. Online learning for multivariate hawkes processes.\n\nIn NIPS, 2017.\n\n[50] H. Yin, A. R. Benson, J. Leskovec, and D. F. Gleich. Local higher-order graph clustering. In\n\nKDD, 2017.\n\n[51] H.-F. Yu, C.-J. Hsieh, H. Yun, S. Vishwanathan, and I. S. Dhillon. A scalable asynchronous\n\ndistributed algorithm for topic modeling. In WWW, 2015.\n\n[52] J. Yuan, F. Gao, Q. Ho, W. Dai, J. Wei, X. Zheng, E. P. Xing, T.-Y. Liu, and W.-Y. Ma. Lightlda:\n\nBig topic models on modest computer clusters. In WWW, 2015.\n\n[53] K. Zhou, H. Zha, and L. Song. Learning social infectivity in sparse low-rank networks using\n\nmulti-dimensional hawkes processes. In AISTATS, 2013.\n\nA Simulating GRANGER-BUSCA\n\nIn Algorithm 1 we show how Ogata\u2019s Modi\ufb01ed Thinning algorithm [38] is adapted for GRANGER-\nBUSCA. We initially point out that some care has to be taken for the initial simulated timestamps.\nGiven that tai (the previous observation) does not exist, the algorithm will need to either start with a\nsynthetic initial increment of fall back to the Poisson rate. In the algorithm, the rate of each individual\nprocess is computed. Then, a new observation is generated based on the sum of such rates. Given\nthat each process will behave like a Poisson process while a new event does not surface (Figure 1),\nthe sum of these processes is also a Poisson process. Lastly, we employ a multinomial sampling to\ndetermine the process from which the observation belongs to.\n\nB Log Likelihood\n\nWe now derive the log likelihood of GRANGER-BUSCA for parameters \u0398 = {G, \u03b2, \u00b5}. For a point\nprocess with intensity \u03bb(t), the likelihood can be computed as [14]:\n\nL(\u0398) =\n\nN\n\nYi=i\n\n\u03bb(ti) exp(\u2212Z t\n\n0\n\n\u03bb(t)dt).\n\nConsidering the intensity of each process separately, we can write the log likelihood as:\n\n(7)\n\n(8)\n\nLLa(\u0398) = Xtai \u2208Pa\n= Xtai \u2208Pa\n\n(cid:16)log(cid:0) \u03bba(tai(cid:1)(cid:17) \u2212Z t\n(cid:16)log(cid:0)\u00b5a +\n\nXb=0\n\nK\u22121\n\n0\n\n\u03b1ba\n\n\u03b2b + \u2206ba(tai )(cid:1)(cid:17) \u2212 Ta\u00b5a \u2212 Xtai \u2208P\nHere, Ta is the last event from Pa. The expansion of the integral R Ta\nthe form,Pti\u2208P PK\u22121\n\n0 \u03bba(t)dt comes from the\nstepwise nature of \u03bba(t) (see Figure 1). For simplicity, let us initially assume that there is only one\nprocess. As discussed in the paper, computing \u2206ba(t) has a log(N ) cost. Due to summations of\nb=0 , the cost to evaluate LLa(\u0398) is O(K N log(N )). N log(N ) is the cost to\n\nK\u22121\n\nXb=0\n\n\u03b1ba(tai \u2212 tai\u22121 )\n\u03b2b + \u2206ba(tai\u22121 )\n\nevaluate \u2206ba(t) for every observation.\n\n\u03bba(t)dt\n\n12\n\n\f", "award": [], "sourceid": 1546, "authors": [{"given_name": "Flavio", "family_name": "Figueiredo", "institution": "Universidade Federal de Minas Gerais"}, {"given_name": "Guilherme", "family_name": "Resende Borges", "institution": "Universidade Federal de Minas Gerais"}, {"given_name": "Pedro", "family_name": "O.S. Vaz de Melo", "institution": "Universidade Federal de Minas Gerais, Brazil"}, {"given_name": "Renato", "family_name": "Assun\u00e7\u00e3o", "institution": "UFMG"}]}