{"title": "Learning Hawkes Processes from a handful of events", "book": "Advances in Neural Information Processing Systems", "page_first": 12715, "page_last": 12725, "abstract": "Learning the causal-interaction network of multivariate Hawkes processes is a useful task in many applications. Maximum-likelihood estimation is the most common approach to solve the problem in the presence of long observation sequences. However, when only short sequences are available, the lack of data amplifies the risk of overfitting and regularization becomes critical. Due to the challenges of hyper-parameter tuning, state-of-the-art methods only parameterize regularizers by a single shared hyper-parameter, hence limiting the power of representation of the model. To solve both issues, we develop in this work an efficient algorithm based on variational expectation-maximization. Our approach is able to optimize over an extended set of hyper-parameters. It is also able to take into account the uncertainty in the model parameters by learning a posterior distribution over them. Experimental results on both synthetic and real datasets show that our approach significantly outperforms state-of-the-art methods under short observation sequences.", "full_text": "Learning Hawkes Processes from a Handful of Events\n\nFarnood Salehi\u21e4\n\nEPFL\n\nfarnood.salehi@epfl.ch\n\nWilliam Trouleau\u21e4\n\nEPFL\n\nwilliam.trouleau@epfl.ch\n\nMatthias Grossglauser\n\nEPFL\n\nmatthias.grossglauser@epfl.ch\n\nPatrick Thiran\n\nEPFL\n\npatrick.thiran@epfl.ch\n\nAbstract\n\nLearning the causal-interaction network of multivariate Hawkes processes is a\nuseful task in many applications. Maximum-likelihood estimation is the most com-\nmon approach to solve the problem in the presence of long observation sequences.\nHowever, when only short sequences are available, the lack of data ampli\ufb01es the\nrisk of over\ufb01tting and regularization becomes critical. Due to the challenges of\nhyper-parameter tuning, state-of-the-art methods only parameterize regularizers by\na single shared hyper-parameter, hence limiting the power of representation of the\nmodel. To solve both issues, we develop in this work an ef\ufb01cient algorithm based\non variational expectation-maximization. Our approach is able to optimize over an\nextended set of hyper-parameters. It is also able to take into account the uncertainty\nin the model parameters by learning a posterior distribution over them. Experimen-\ntal results on both synthetic and real datasets show that our approach signi\ufb01cantly\noutperforms state-of-the-art methods under short observation sequences.\n\n1\n\nIntroduction\n\nIn many real-world applications, including \ufb01nance, computational biology, social-network studies,\ncriminology, and epidemiology, it is important to gain insight from the interactions of multivariate\ntime series of discrete events. For example, in \ufb01nance, changes in the price of a stock might affect\nthe market [4]; and in epidemiology, individuals infected by an infectious disease might spread the\ndisease to their neighbors [15]. Such networks of time series often exhibit mutually exciting patterns\nof diffusion. Hence, a recurring issue is to learn in an unsupervised way the causal structure of\ninteracting networks. This task is usually tackled by de\ufb01ning a so-called causal graph of entities\nwhere an edge from a node i to a node j means that events in node j depend on the history of node\ni. Such causal interactions are typically learned with either directed information [24, 23], transfer\nentropy [26], or Granger causality [2, 11].\nA widely used model for capturing mutually exciting patterns in a multi-dimensional time series\nis the Multivariate Hawkes process (MHP), a particular type of temporal point process where an\nevent in one dimension can affect future arrivals in other dimensions. It has been shown that learning\nthe excitation matrix of an MHP encodes the causal structure between the processes, both in terms\nof Granger causality [12] and directed information [13]. Most studies focus on the scalability of\nMHPs to large datasets. However, in many applications, data can be very expensive to collect, or\nsimply not available. For example, in economic and public health studies, collecting survey data\nis usually an expensive process. Similarly, in the case of epidemic modeling, it is critical to learn\nas fast as possible the patterns of diffusion of a spreading disease. As a result, the amount of data\n\n\u21e4The \ufb01rst two authors contributed equally to this work.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\favailable is intrinsically limited. MHPs are known to be sensitive to the amount of data used for\ntraining, and the excitation patterns learned by MHPs from short sequences can be unreliable [29].\nIn such settings, the likelihood becomes noisy and regularization becomes critical. Nevertheless,\nas most hyper-parameter tuning algorithms such as grid search, random search, and even Bayesian\noptimization become challenging when the number of hyper-parameters is large, state-of-the-art\nmethods only parameterize regularizers by a single shared hyper-parameter, hence limiting the power\nof representation of the model.\nIn this work, we address both the small data and hyper-parameter tuning issues by considering\nthe parameters of the model as latent variables and by developing an ef\ufb01cient algorithm based on\nvariational expectation-maximization. By estimating the evidence \u2013 rather than the likelihood \u2013 the\nproposed approach is able to optimize, with minimal computational complexity, over an extended\nset of hyper-parameters. Our approach is also able to take into account the uncertainty in the model\nparameters by \ufb01tting a posterior distribution over them. Therefore, rather than just providing a\npoint estimate, this approach can provide an estimation of uncertainty on the learned causal graph.\nExperimental results on synthetic and real datasets show that, as a result, the proposed approach\nsigni\ufb01cantly outperforms state-of-the-art methods under short observation sequences, and maintains\nthe same performance in the large-data regime.\n\n2 Related Works\n\nThe most common approaches to learning the excitation matrix of MHPs are based on variants\nof regularized maximum-likelihood estimation (MLE). Zhou et al. [31] propose regularizers that\nenforce sparse and low-rank structures, along with an ef\ufb01cient algorithm based on the alternating-\ndirection method of multipliers. To mitigate the parametric assumption, Xu et al. [28] represent the\nexcitation functions as a series of basis functions, and to achieve sparsity under this representation\nthey propose a sparse group-lasso regularizer. Such estimation methods are often referred to as\nnon-parametric as they enable more \ufb02exibility on the shape of the excitation functions [19, 16]. To\nestimate the excitation matrix without any parametric modeling, fully non-parametric approaches\nwere developed [32, 1]. However, these methods focus on scalability and target settings where\nlarge-scale datasets are available.\nBayesian methods go beyond the classic approach of MLE by enabling a probabilistic interpretation\nof the model parameters. A few studies tackled the problem of learning MHPs from a Bayesian\nperspective. Linderman and Adams [20] use a Gibbs sampling-based approach, but the convergence\nof the proposed algorithm is slow. To tackle this problem, Linderman and Adams [21] discretize\nthe time, which introduces noise in the model. In a different setting where some of the events\nor dimensions are hidden, Linderman et al. [22] use an expectation maximization algorithm to\nmarginalize over the unseen part of the network.\nBayesian probabilistic models are usually intractable and require approximate inference. To address\nthe issue, variational inference (VI) approximates the high-dimensional posterior of the probabilistic\nmodel.\nIt recently gained interest in many applications. VI is used, to name a few, for word\nembedding [5, 7], paragraph embedding [17], and knowledge-graph embedding [6]. For more details\non this topic, we refer the reader to Zhang et al. [30] and Blei et al. [9]. Variational inference has also\nproven to be a successful approach to learning hyper-parameters [8, 6]. Building on recent advances\nin variational inference, we develop in this work a variational expectation-maximization algorithm by\ninterpreting the parameters of an MHP as latent variables of a probabilistic model.\n\n3 Preliminary De\ufb01nitions\n\n3.1 Multivariate Hawkes Processes\n\nFormally, a D-dimensional MHP is a collection of D univariate counting processes Ni(t), i =\n1, . . . , D, whose realization over an observation period [0, T ) consists of a sequence of discrete\nevents S = {(tn, in)}, where tn 2 [0, T ) is the timestamp of the n-th event and in 2{ 1, . . . , D} is\n\n2\n\n\fits dimension. Each process has the particular form of conditional intensity function\n\ni(t) = \u00b5i +\n\nij(t  \u2327 )dNj(\u2327 ),\n\n(1)\n\nDXj=1Z t\n\n0\n\nwhere \u00b5i > 0 is the constant exogenous part of the intensity of process i, and the excitation function\nij : R+ 7! R+ encodes the effect of past events from dimension j on future events from dimension i.\nThe larger the values of ij(t), the more likely events in dimension j will trigger events in dimension i.\nIt has been shown that the excitation matrix [ij(t)] encodes the causal structure of the MHP in\nterms of Granger causality, i.e., ij(t) = 0 if and only if the process j does not Granger-cause\nprocess i [13, 12].\nMost of the literature uses a parametric form for the excitation functions. The most popular form is\nthe exponential excitation function\n\n(2)\nHowever, in most applications the excitation patterns are unknown and this form might be too\nrestrictive. Hence, to alleviate the assumption of a particular form for the excitation function, other\napproaches [19, 16, 28] over-parameterize the space and encode the excitation functions as a linear\ncombination of M basis functions \uf8ff1(t),\uf8ff 2(t), . . . ,\uf8ff M (t) as\n\nij(t) = wij\u21e3e\u21e3t.\n\nij(t) =\n\nwm\n\nij \uf8ffm(t),\n\n(3)\n\nMXm=1\n\nwhere the basis functions are often exponential or Gaussian kernels [28]. This kind of approach\nis generally referred to as non-parametric. In the experiments of Section 5, we investigate the\nperformance of both parametric and non-parametric approaches to learning MHPs from small\n+ and the\nsequences of observations. We denote the set of exogenous rates by \u00b5 = {\u00b5i}D\nexcitation matrix by W := {{wm\nm=1}D\n3.2 Maximum Likelihood Estimation\nSuppose that we observe a sequence of discrete events S = {(tn, in)}N\nn=1 over an observation period\n[0, T ). The most common approach to learning the parameters of an MHP given S is to perform a\nregularized maximum-likelihood estimation [31, 3, 28], which amounts to minimizing an objective\nfunction that is the sum of the negative log-likelihood and of a penalty term that induces some desired\nstructural properties. Speci\ufb01cally, the objective is to solve the optimization problem\n\ni,j=1 2 RD2M\n\ni=1 2 RD\n\nij}M\n\n+\n\n.\n\n\u02c6\u00b5, \u02c6W = argmin\n\n\u00b50,W0 log p(S|\u00b5, W ) +\n\n1\n\u21b5R(\u00b5, W ),\n\nwhere the log-likelihood of the parameters is given by\n\nlog p(S|\u00b5, W ) = X(tn,in)2S\n\nlog in(tn) \n\nDXi=1Z T\n\n0\n\ni(t)dt.\n\n(4)\n\n(5)\n\nThe particular choice of penalty R(\u00b5, W ), along with the single hyper-parameter \u21b5 controlling its\nin\ufb02uence, depends on the problem at hand. For example, a necessary condition to ensure that the\nlearned model is stable is that limt!1 ij(t) = 0 and that the spectral radius of the excitation matrix\nis less than 1 [10]. Hence, a common penalty used is\n\nwith p = 1 or 2 in [32, 31, 28]. Another common assumption is that the graph is sparse. In this case,\na Group-Lasso penalty of the form\n\nm=1 |wm\n\nij|p,\n\nRp(\u00b5, W ) =Pd\nR1,2(\u00b5, W ) =Pd\n\ni,j=1PM\ni,j=1qPM\n\nm=1(wm\nij )2\nis commonly used to enforce sparsity in the excitation functions [28].\n\n(6)\n\n(7)\n\n3\n\n\fSmall data ampli\ufb01es the danger of over\ufb01tting; hence the choice of regularizers and their hyper-\nparameters becomes essential. Nevertheless, to control the in\ufb02uence of the penalty in (4), all\nstate-of-the-art methods are limited by the use of a single shared hyper-parameter \u21b5. Ideally, we\nwould have a different hyper-parameter to independently control the effect of the penalty on each\nparameter of the model. However, the number of parameters, i.e., (D2M + D), grows quadratically\nwith the dimension of the problem D. To make matters worse, the most common approaches\nused to \ufb01ne-tune the choice of hyper-parameters, i.e., grid search and random search, become\ncomputationally prohibitive when the number of hyper-parameters becomes large. Indeed, the search\nspace exponentially increases with the number of hyper-parameters. Another approach is to use\nBayesian optimization of hyper-parameters, but the cost of doing this also becomes prohibitive as\nthe number of samples required to learn the landscape of cost function exponentially increases with\nthe number of hyper-parameters [27]. We describe the details of our proposed approach in the next\nsection.\n\n4 Proposed Learning Approach\n\nWe now introduce the proposed approach for learning MHPs. The approach enables us to use different\nhyper-parameters for each model parameter and ef\ufb01ciently tune them all by taking into account\nparameter uncertainty. It is based on the variational expectation-maximization (EM) algorithm and\njointly optimizes both the model parameters \u00b5 and W , as well as the hyper-parameters \u21b5.\nFirst, we can view regularized MLE as a maximum a posteriori (MAP) estimator of the model where\nparameters are considered as latent variables. Under this interpretation, regularizers on the model\nparameters correspond to unnormalized priors on the latent variables. The optimization problem\nbecomes\n\n\u02c6\u00b5, \u02c6W = argmax\n\u00b50,W0\n\nlog p\u21b5(\u00b5, W ,S) = argmax\n\u00b50,W0\n\nlog p(S|\u00b5, W ) + log p\u21b5(\u00b5, W ).\n\n(8)\n\nTherefore, having a better regularizer means having a better prior. In the presence of a long sequence\nof observations, we want the prior to be as uninformative as possible (i.e., a smaller regularization)\nas we have access to enough information for the MLE to accurately estimate the parameters of the\nmodel. But in the case where we only observe short sequences, we want to use more informative\npriors to avoid over\ufb01tting (i.e., a larger regularization).\nUnfortunately, the MAP estimator cannot adjust the in\ufb02uence of the prior by optimizing over \u21b5.\nIndeed, the cost function in (8) is unbounded from above and solving Equation (8) with respect to\n\u21b5 leads trivially to a divergent solution 1\n\u21b5 ! 1. To address this issue, we can take a Bayesian\napproach, integrate out parameters and optimize the evidence (or marginal likelihood) p\u21b5(S) instead\nof the log-likelihood. Such an approach changes the optimization problem of Equation (8) into\n\n\u02c6\u21b5 = argmax\n\n\u21b50\n\nZZ p(S|\u00b5, W )p\u21b5(\u00b5, W )d\u00b5dW = argmax\n\n\u21b50\n\np\u21b5(S).\n\n(9)\n\nUnlike the MAP objective function, maximizing the evidence over \u21b5 does not lead to a degenerate\nsolution because it is upper bounded by the likelihood. However, this optimization problem can be\nsolved only for simple models where the integral has a closed form, which requires a conjugate prior\nto the likelihood. Therefore, we use variational inference to estimate the evidence and develop a\nvariational EM algorithm to optimize our objective with respect to \u21b5.\n\n4.1 Variational Expectation-Maximization for Multivariate Hawkes Processes\nVariational inference. The derivation of the variational objective is as follows. First postulate a\nvariational distribution q(\u00b5, W ), parameterized by the variational parameters , approximating\nthe posterior p(\u00b5, W|S). The variational parameters  are chosen such that the Kullback\u2013Leibler\ndivergence between the true posterior p(\u00b5, W|S) and the variational distribution q(\u00b5, W ) is\nminimized. It is known that minimizing KL [q(\u00b5, W )kp(\u00b5, W|S)] is equivalent to maximizing\nthe evidence lower-bound (ELBO) [9, 30] de\ufb01ned as\n(10)\nBy invoking Jensen\u2019s inequality on the integral p\u21b5(S) =RR p\u21b5(\u00b5, W ,S)d\u00b5dW , we obtain the\ndesired lower bound on the evidence p\u21b5(S)  ELBO(q, \u21b5) where, by maximizing ELBO(q, \u21b5)\nwith respect to , the bound becomes tighter.\n\nELBO(q, \u21b5) := Eq [log p\u21b5(\u00b5, W ,S)]  Eq [log q(\u00b5, W )].\n\n4\n\n\fFor simplicity, we adopt the mean-\ufb01eld assumption by choosing a variational distribution q(\u00b5, W )\nthat factorizes over the latent variables2. As the parameters \u00b5 and W of an MHP are non-negative, a\ngood candidate to approximate the posterior is a log-normal distribution. We de\ufb01ne the variational\nparameters  = {\u232b, e} as the mean and the standard deviation of q. We denote the standard\ndeviation by e because, we optimize its log to naturally ensure its positivity and the stability of the\noptimization procedure. Although we present our learning approach for the log-normal distribution,\nit is easily generalizable to other distributions.\n\nVariational EM algorithm.\nIn order to ef\ufb01ciently optimize the ELBO with respect to both the\nvariational parameters  and the hyper-parameters \u21b5, we use the variational EM algorithm that\niterates over the two following steps: The E-step maximizes the ELBO with respect to the variational\nparameters  in order to get a tighter lower-bound on the evidence; and the M-step updates the\nhyper-parameters \u21b5 with a closed form update. Details of the two steps are as follows.\nThe E-step maximizes the ELBO with respect to the variational parameters  to make the variational\ndistribution q(\u00b5, W ) close to the exact posterior p(\u00b5, W|S) and to ensure that the ELBO is a\ngood proxy for the evidence. To evaluate the ELBO, we use the black-box variational-inference\noptimization in [18, 25]. Re-parameterize the model as\n\n\u00b5 = g(\"\u00b5) = exp(\u232b\u00b5 + e\u00b5\nW = g(\"W ) = exp(\u232bW + eW\n\n \"\u00b5),\n \"W ),\n\nwhere \"\u00b5 (resp. \"W ) has the same shape as \u00b5 (resp. W ), with each element following a normal\ndistribution N (0, 1).  denotes the element-wise product. This trick enables us to rewrite the \ufb01rst\nintractable expectation term of the ELBO in (10) as\n(11)\nThe second term of the ELBO in (10) is the entropy of the log-normal distribution that can be\n(\u00b5i + i). The ELBO then can be estimated by Monte-Carlo\n\nEq [log pC(\u00b5, W ,S)] = E\"\u21e0N (0,I)\u21e5log p\u21b5g(\"\u00b5), g(\"W ),S\u21e4 .\n\nintegration as\n\nexpressed, up to a constant, asP\u00b5i,i\nLX`=1\n\nELBO(q, \u21b5) \u21e1\n\n1\nL\n\nlog p\u21b5g(\"\u00b5\n\n` ), g(\"W\n\n` ),S + X\u00b5i,i\n\n(\u00b5i + i),\n\n(12)\n\n` ), g(\"W\n\nwhere L is the number of Monte-Carlo samples \"1, . . . , \"L. Note that the \ufb01rst term of (12) is the cost\nfunction for the MAP problem (8) evaluated at {\u00b5, W} = {g(\"\u00b5\n` )} for ` 2 [n]. Hence,\nthe E-step summarizes into maximizing the right-hand side of (12) with respect to  using gradient\ndescent.\nIn the M-step, the ELBO is used as a proxy for the evidence p\u21b5(S) and is maximized with respect\nto the hyper-parameters \u21b5. Again, we rely on the re-parameterization technique and compute the\nunbiased estimate of the ELBO in (12). The maximum of the estimate (12) with respect to \u21b5 has a\nclosed form that depends on the choice of prior. We provide the closed-form solutions in Appendix A\nfor a few proposed priors that emulate common regularizers. To avoid fast changes in \u21b5 due to\nthe variance of the Monte-Carlo integration, we take an update similar to the one in [6] and take a\nweighted average between the current estimate and the minimizer of the current Monte-Carlo estimate\nof the ELBO as\n\n\u21b5 :=  \u00b7 \u21b5 + (1  ) \u00b7 argmin\n\n\u02dc\u21b5\n\n1\nL\n\nLX`=1\n\nlog p \u02dc\u21b5g(\"\u00b5\n\n` ), g(\"W\n\n` ),S ,\n\n(13)\n\nwhere  2 [0, 1] is the momentum term.\nAlgorithm 1 summarizes the proposed variational EM approach.3 The computational complexity of\nthe inner-most loop of Algorithm 1 is L times the complexity of an iteration of gradient descent on\nthe log-likelihood. However, as observed by recent studies in variational inference, using L = 1 is\nusually suf\ufb01cient in many applications [18]. Hence, we use L = 1 in all our experiments, leading to\nthe same computational complexity per-iteration as MLE using gradient descent.\n\n2This assumption can be relaxed using more advanced techniques at the cost of having a higher computational\n\ncomplexity.\n\n3Source code is available publicly.\n\n5\n\n\fAlgorithm 1 Variational EM algorithm for Multivariate Hawkes Processes\nInput: Sequence of observations S = {(tn, in)}N\n\nn=1. Initial values for \u21b5 and . Momentum\nterm 0\uf8ff < 1. Sample size L of Monte-Carlo integrations. Number of iterations TE and\nTEM of E-steps and EM-steps. Learning rate \u2318.\n1: for t 1, . . . , TEM do\nfor t 1, . . . , TE do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10: end for\nOutput: \u21b5, \n\nSample Gaussian noise \"1, . . . , \"L \u21e0N (0, I).\nEvaluate the ELBO using Equation (12)\nUpdate \u232b \u232b + \u2318(r\u232bf (\u232b, , \"; \u21b5) + 1).\nUpdate   + \u2318(rf (\u232b, , \"; \u21b5) + 1).\n\nend for\nSample L Gaussian noise \"1, . . . , \"L.\nUpdate \u21b5 using Equation (13).\n\n. M step\n\n. E step\n\n5 Experimental Results\n\nWe carry out two sets of experiments. First, we perform a link-prediction task on synthetic data to\nshow that our approach can accurately recover the support of the excitation matrix of the MHP under\nshort sequences. Second, we perform an event-prediction task on real datasets of short sequences to\nshow that our approach outperforms state-of-the-art methods in terms of predictive log-likelihood.\nWe run our experiments in two different settings. First, in a parametric setting where the exponential\nform of the excitation function is known, we compare our approach (VI-EXP) to the state-of-the-art\nMLE-based method (MLE-ADM4) from Zhou et al. [31]. Second, we use a non-parametric setting\nwhere no assumption is made on the shape of the excitation function. We then set the excitation\nfunction as a mixture of M = 10 Gaussian kernels de\ufb01ned as\n\n(14)\n\n\uf8ffm(t) = (2\u21e1b2)1 exp(t  \u2327m)2/(2b2) ,8m = 1, . . . , M,\n\nwhere \u2327m and b are the known location and scale of the kernel. In this setting, we compare our\napproach (VI-SG) to the state-of-the-art MLE-based methods (MLE-SGLP) of Xu et al. [28] with\nthe same {\uf8ffm(t)} 4. Let us stress that the parametric methods have a strong advantage over the\nnon-parametric ones because they are given the true value of the exponential decay \u21e3.\nAs our VI approach returns a posterior on the parameters, rather than a point estimate, we use the\nmode of the approximate log-normal posterior as the inferred edges { \u02c6wij}. For the non-parametric\nsetting, we use \u02c6wij =PM\nij . To mimic the regularization schemes of the baselines, we use a\nLaplacian prior for the edge weights {wij} to enforce sparsity, and we use a Gaussian prior for the\nbaselines {\u00b5i}. We tune the hyper-parameters of the baselines using grid search5.\n5.1 Synthetic Data\n\nm=1 \u02c6wm\n\nWe \ufb01rst evaluate the performance of our VI approach on simulated data. We generate random\nErd\u02ddos\u2013R\u00e9nyi graphs with D = 50 nodes and edge probability p = log(D)/D. Then, a sequence of\nobservations is generated from an MHP with exponential excitation functions de\ufb01ned in (2) with\nexponential decay \u21e3 = 1. The baselines {\u00b5\u21e4i} are sampled independently in Unif[0, 0.02], and the\nedge weights {w\u21e4ij} are sampled independently in Unif[0.1, 0.2]. Results are averaged over 30 graphs\nwith 10 simulations each. For reproducibility, a detailed description of the experimental setup is\nprovided in Appendix E.\nTo investigate if the support of the excitation matrix can be accurately recovered under small data, we\nevaluate the performance of each approach on three metrics [32, 28, 14].\n\n4 We also performed the experiments with other approaches designed for large-scale datasets, but their\n\nperformance was below that of the reported baselines [1, 20, 21].\n\n5More details are provided in Appendix E.\n\n6\n\n\f(cid:50)\n(cid:96)\n(cid:81)\n(cid:43)\n(cid:97)\n(cid:64)\n(cid:82)\n(cid:54)\n\n(cid:625)(cid:1264)(cid:600)(cid:625)(cid:1264)(cid:515)(cid:625)(cid:1264)(cid:513)(cid:625)(cid:1264)(cid:594)(cid:625)(cid:1264)(cid:593)(cid:625)(cid:1264)(cid:494)(cid:625)(cid:1264)(cid:550)(cid:559)(cid:1264)(cid:625)\n\n(cid:559)(cid:625)(cid:600)\n\n(cid:121)\n(cid:107)\n(cid:33)\n(cid:77)\n(cid:81)\n(cid:66)\n(cid:98)\n(cid:66)\n(cid:43)\n(cid:50)\n(cid:96)\n(cid:83)\n\n(cid:559)(cid:1264)(cid:625)\n(cid:625)(cid:1264)(cid:550)\n(cid:625)(cid:1264)(cid:494)\n(cid:625)(cid:1264)(cid:593)\n(cid:625)(cid:1264)(cid:594)\n\n(cid:74)(cid:71)(cid:49)(cid:64)(cid:27)(cid:46)(cid:74)(cid:57)\n(cid:74)(cid:71)(cid:49)(cid:64)(cid:97)(cid:58)(cid:71)(cid:83)\n(cid:111)(cid:65)(cid:64)(cid:49)(cid:115)(cid:83)\n(cid:111)(cid:65)(cid:64)(cid:97)(cid:58)\n\n(cid:559)(cid:625)(cid:515)\n\n(cid:559)(cid:625)(cid:600)\n\n(b)\n\n(cid:76)(cid:109)(cid:75)(cid:35)(cid:50)(cid:96) (cid:81)(cid:55) (cid:105)(cid:96)(cid:28)(cid:66)(cid:77)(cid:66)(cid:77)(cid:59) (cid:50)(cid:112)(cid:50)(cid:77)(cid:105)(cid:98)\n\n(cid:76)(cid:109)(cid:75)(cid:35)(cid:50)(cid:96) (cid:81)(cid:55) (cid:105)(cid:96)(cid:28)(cid:66)(cid:77)(cid:66)(cid:77)(cid:59) (cid:50)(cid:112)(cid:50)(cid:77)(cid:105)(cid:98)\n\n(cid:559)(cid:625)(cid:515)\n(cid:2638)(cid:559)(cid:625)(cid:2615)(cid:559)\n\n(a)\n\n(cid:96)\n(cid:81)\n(cid:96)\n(cid:96)\n(cid:50)\n\n(cid:50)\n(cid:112)\n(cid:66)\n(cid:105)\n(cid:28)\n(cid:72)\n(cid:50)\n(cid:95)\n\n(cid:625)(cid:1264)(cid:606)(cid:513)(cid:625)(cid:1264)(cid:513)(cid:625)(cid:625)(cid:1264)(cid:593)(cid:513)(cid:559)(cid:1264)(cid:625)(cid:625)(cid:559)(cid:1264)(cid:606)(cid:513)(cid:559)(cid:1264)(cid:513)(cid:625)\n\n(cid:559)(cid:625)(cid:600)\n\n(cid:559)(cid:625)(cid:515)\n\n(cid:76)(cid:109)(cid:75)(cid:35)(cid:50)(cid:96) (cid:81)(cid:55) (cid:105)(cid:96)(cid:28)(cid:66)(cid:77)(cid:66)(cid:77)(cid:59) (cid:50)(cid:112)(cid:50)(cid:77)(cid:105)(cid:98)\n\n(c)\n\nFigure 1: Performance measured by (a) F1-Score, (b) Precision@20, and (c) Relative error with\nrespect to the number of training samples. Our VI approaches are shown in solid lines. The non-\nparametric methods are highlighted with square markers. Results are averaged over 30 random graphs\nwith 10 simulations each (\u00b1 standard deviation).\n\nresulting binary edge classi\ufb01cation problem6.\n\n\u2022 F1-score. We zero-out small weights using a threshold \u2318 = 0.04 and measure the F1-score of the\n\u2022 Precision@k. Instead of thresholding, we also report the precision@k de\ufb01ned by the average\nfraction of correctly identi\ufb01ed edges in the top k largest estimated weights. Since the proposed\nVI approach gives an estimate of uncertainty via the variance of the posterior, we select the edges\nwith high weights \u02c6wij and low uncertainty, i.e., the edges with ratio of lowest standard deviation\nover weight \u02c6wij.\n\n\u2022 Relative error. To evaluate the distance of the estimated weights to the ground truth ones, we use\nthe averaged relative error de\ufb01ned as | \u02c6wij  w\u21e4ij|/w\u21e4ij when w\u21e4ij 6= 0, and \u02c6wij/(minw\u21e4kl>0 w\u21e4kl)\nwhen w\u21e4ij = 0. This metric is more sensitive to errors in small weights w\u21e4ij, and therefore penalizes\nfalse positive over false negative errors.\n\nWe investigate the sensitivity of each approach to the amount of data available for training by varying\nthe size of the training set from N = 750 to N = 25 000 events, i.e., 15 to 500 events per node.\nResults are shown in Figure 1. Our approach improves the results in both parametric and non-\nparametric settings for all metrics. The improvements are more substantial in the non-parametric\nsetting. If the accuracy of the top edges is similar for both VI-SG and MLE-SGLP in terms of\nprecision@20, VI-SG improves the F1-score by about 20% with N = 5 000 training events. The\nreason for this improvement is that MLE-SGLP has a much higher false positive rate, which is hurting\nthe F1-score but does not affect the precision@20. VI-SG is also able to reach the same F1-score\nas the parametric baseline MLE-ADM4 with only N = 4 000 training events7. Note that VI-SG is\noptimizing D2M + D = 25 050 hyper-parameters with minimal additional cost.\n\n6Additional results with varying thresholds \u2318 are provided in Appendix D.\n7We present additional results with various thresholds \u2318 and k in Appendix D.\n\n7\n\n\f(cid:74)(cid:71)(cid:49)(cid:64)(cid:97)(cid:58)(cid:71)(cid:83)\n(cid:111)(cid:65)(cid:64)(cid:97)(cid:58)\n\n(cid:50)\n(cid:96)\n(cid:81)\n(cid:43)\n(cid:97)\n(cid:64)\n(cid:82)\n(cid:54)\n\n(cid:625)(cid:1264)(cid:513)(cid:625)(cid:1264)(cid:594)(cid:625)(cid:1264)(cid:593)(cid:625)(cid:1264)(cid:494)(cid:625)(cid:1264)(cid:550)(cid:559)(cid:1264)(cid:625)\n\n(cid:513)\n\n(cid:559)(cid:625)\n\n(cid:606)(cid:625)\n\n(cid:559)(cid:513)\n\n(cid:1390)\n\nFigure 2: Analysis of the robustness of non-parametric approaches to the number of bases M of\nexcitation functions (for \ufb01xed N = 2 000).\n\n(cid:2270)(cid:1426)\n(cid:16)\n(cid:4560)\n\n(cid:625)(cid:1264)(cid:606)(cid:625)(cid:1264)(cid:515)(cid:625)(cid:1264)(cid:594)(cid:625)(cid:1264)(cid:494)(cid:559)(cid:1264)(cid:625)(cid:559)(cid:1264)(cid:606)(cid:559)(cid:1264)(cid:515)(cid:559)(cid:1264)(cid:594)\n\n(cid:54)(cid:28)(cid:72)(cid:98)(cid:50)\n\n(cid:83)(cid:81)(cid:98)(cid:66)(cid:105)(cid:66)(cid:112)(cid:50)\n\n(cid:104)(cid:96)(cid:109)(cid:50)\n\n(cid:83)(cid:81)(cid:98)(cid:66)(cid:105)(cid:66)(cid:112)(cid:50)\n\n(a)\n\n(cid:118)\n(cid:43)\n(cid:77)\n(cid:50)\n(cid:109)\n(cid:91)\n(cid:50)\n(cid:96)\n(cid:54)\n\n(cid:625)(cid:1264)(cid:625)(cid:625)(cid:1264)(cid:559)(cid:625)(cid:1264)(cid:606)(cid:625)(cid:1264)(cid:600)(cid:625)(cid:1264)(cid:515)(cid:625)(cid:1264)(cid:513)(cid:625)(cid:1264)(cid:594)(cid:625)(cid:1264)(cid:593)\n\n(cid:76)(cid:81)(cid:77)(cid:64)(cid:50)(cid:47)(cid:59)(cid:50)(cid:98)\n(cid:49)(cid:47)(cid:59)(cid:50)(cid:98)\n\n(cid:2615)(cid:606)(cid:513)\n\n(cid:2615)(cid:606)(cid:625)\n\n(cid:2615)(cid:559)(cid:513)\n(cid:2615)(cid:559)(cid:625)\n(cid:944)(cid:960)(cid:924)(cid:4543)\n\n(b)\n\n(cid:2615)(cid:513)\n\n(cid:625)\n\nFigure 3: Analysis of the uncertainty of the parameters learned by VI-EXP (for \ufb01xed N = 5 000).\n(a) Uncertainty of the inferred edges and (b) histogram of learned \u21b5. The learned \u21b5 are smaller for\nnon-edges, and false positive edges have higher uncertainty than the true positive ones.\n\nIn the next experiment, we focus on the non-parametric setting, we \ufb01x the length of observation to\nN = 5000 and study the effect of increasing M on the performance of the algorithms. The results\nare shown in Figure 2. We see that our approach is more robust to the choice of M than MLE-SGLP.\nA possible explanation for this behavior is that MLE-SGLP over\ufb01ts due to the increasing number of\nmodel parameters.\nFinally, we investigate the parameters of the model learned by our VI-EXP approach. In Figure 3a, we\nuse the variance of the approximated posterior q as a measure of con\ufb01dence for edge identi\ufb01cation,\nand we report the distribution of ratio of standard deviation over weight \u02c6wij for both the true and false\npositive edges. Similar results hold between the true and false negative edges. The false positive edges\nhave a higher uncertainty than the true positive ones. This is relevant when we cannot identify all\nedges due to lack of data, even though we still wish to identify a subset of edges with high con\ufb01dence.\nIn addition, Figure 3b con\ufb01rms that, as expected, the optimized weight priors \u21b5 are much larger for\ntrue edges in the ground-truth excitation matrix than for non-edges.\n\n5.2 Real Data\n\nWe also evaluate the performance of our approach on the following three small datasets:\n\n1. Epidemics. This dataset contains records of infection of individuals, along with their correspond-\ning district of residence, during the last Ebola epidemic in West Africa in 2014-2015 [15]. To\nlearn the propagation network of the epidemics, we consider the 54 districts as processes and\nde\ufb01ne infection records as events.\n\n8\n\n\f2. Stock market. This dataset contains the stock prices of 12 high-tech companies sampled every 2\nminutes on the New York Stock Exchange for 20 days in April 2008 [13]. We consider each stock\nas a process and record an event every time a stock price changes by 0.15% from its current value.\n3. Enron email. This dataset contains emails between employees of Enron from the Enron corpus.\nWe consider all employees with more than 10 received emails as processes and record an event\nevery time an employee receives an email.\n\nWe perform an event-prediction task to show that our approach outperforms the state-of-the-art\nmethods in terms of predictive log-likelihood. To do so, we use the \ufb01rst 70% events as training set,\nand we compute the held-out averaged log-likelihood on the remaining 30%. We present the results\nin Table 1.\nWe \ufb01rst see that the non-parametric methods outperform the parametric ones on both the Epidemic\ndataset and the Stock market dataset. This suggests that the exponential excitation function might\nbe too restrictive to \ufb01t their excitation patterns. In addition, our non-parametric approach VI-SG\nsigni\ufb01cantly outperforms MLE-SGLP on all datasets. The improvement is particularly clear for the\nEpidemic dataset, which has the smallest number of events per dimension. Indeed, the top edges\nlearned by VI-SG correspond to contiguous districts as expected. This is not the case for MLE-SGLP,\nfor which the top learned edges correspond to districts that are far from each other.\n\nTable 1: Predictive log-likelihood for the models learned on various real datasets.\nAveraged predictive log-likelihood\n\nStatistics\n\nDataset\n\n#dim (D)\n\n#events (N) VI-SG MLE-SGLP VI-EXP MLE-ADM4\n\nEpidemics\nStock market\nEnron email\n\n54\n12\n143\n\n5 349\n7 089\n74 294\n\n2,06\n1,00\n0,42\n\n3,03\n2,45\n1,01\n\n4,31\n2,82\n0,23\n\n4,61\n2,81\n0,40\n\n6 Conclusion\n\nWe proposed a novel approach to learn the excitation matrix of a multivariate Hawkes process in the\npresence of short observation sequences. We observed that state-of-the-art methods are sensitive to the\namount of data used for training and showed that the proposed approach outperforms these methods\nwhen only short training sequences are available. The common tool to tackle this problem is to design\nsmarter regularization schemes. However, all maximum likelihood-based methods suffer from a\ncommon problem: all the model parameters are regularized equally with a few hyper-parameters.\nWe developed a variational expectation maximization algorithm that is able to (1) optimize over\nan extended set of hyper-parameters, with almost no additional cost and (2) take into account\nthe uncertainty of the learned model parameters by \ufb01tting a posterior distribution over them. We\nperformed experiments on both synthetic and real datasets and showed that our approach outperforms\nstate-of-the-art methods under small-data regimes.\n\nAcknowledgments\n\nWe would like to thank Negar Kiyavash and Jalal Etesami for the valuable discussions and insightful\nfeedback at the early stage of this work. The work presented in this paper was supported in part by\nthe Swiss National Science Foundation under grant number 200021-182407.\n\nReferences\n[1] Massil Achab, Emmanuel Bacry, St\u00e9phane Ga\u00efffas, Iacopo Mastromatteo, and Jean-Fran\u00e7ois\nMuzy. Uncovering causality from multivariate Hawkes integrated cumulants. The Journal of\nMachine Learning Research, 18(1):6998\u20137025, 2017.\n\n[2] Andrew Arnold, Yan Liu, and Naoki Abe. Temporal causal modeling with graphical granger\nmethods. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge\nDiscovery and Data Mining, KDD \u201907, pages 66\u201375, New York, NY, USA, 2007. ACM. ISBN\n\n9\n\n\f978-1-59593-609-7. doi: 10.1145/1281192.1281203. URL http://doi.acm.org/10.1145/\n1281192.1281203.\n\n[3] Emmanuel Bacry, St\u00e9phane Ga\u00efffas, and Jean-Fran\u00e7ois Muzy. A generalization error bound for\nsparse and low-rank multivariate Hawkes processes. arXiv preprint arXiv:1501.00725, 2015.\n\n[4] Emmanuel Bacry, Iacopo Mastromatteo, and Jean-Fran\u00e7ois Muzy. Hawkes processes in \ufb01nance.\n\nMarket Microstructure and Liquidity, 1(01):1550005, 2015.\n\n[5] Robert Bamler and Stephan Mandt. Dynamic word embeddings. In Proceedings of the 34th\n\ninternational conference on Machine learning, 2017.\n\n[6] Robert Bamler, Farnood Salehi, and Stephan Mandt. Augmenting and tuning knowledge graph\nembeddings. In Proceedings of the Conference on Uncertainty in Arti\ufb01cial Intelligence, 2019.\n\n[7] Oren Barkan. Bayesian neural word embedding. In AAAI, pages 3135\u20133143, 2017.\n\n[8] JM Bernardo, MJ Bayarri, JO Berger, AP Dawid, D Heckerman, AFM Smith, M West, et al.\nThe variational bayesian em algorithm for incomplete data: with application to scoring graphical\nmodel structures. Bayesian statistics, 7:453\u2013464, 2003.\n\n[9] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for\n\nstatisticians. Journal of the American Statistical Association, 112(518):859\u2013877, 2017.\n\n[10] D. J. Daley and D. Vere-Jones. An introduction to the theory of point processes. Vol. I.\nProbability and its Applications (New York). Springer-Verlag, New York, second edition, 2003.\nISBN 0-387-95541-0. Elementary theory and methods.\n\n[11] Michael Eichler. Graphical modelling of multivariate time series. Probability Theory and\nRelated Fields, 153(1):233\u2013268, Jun 2012. ISSN 1432-2064. doi: 10.1007/s00440-011-0345-8.\nURL https://doi.org/10.1007/s00440-011-0345-8.\n\n[12] Michael Eichler, Rainer Dahlhaus, and Johannes Dueck. Graphical modeling for multivariate\nHawkes processes with nonparametric link functions. Journal of Time Series Analysis, 38(2):\n225\u2013242, 2017.\n\n[13] Jalal Etesami, Negar Kiyavash, Kun Zhang, and Kushagra Singhal. Learning network\nof multivariate Hawkes processes: a time series approach.\nIn Proceedings of the Thirty-\nSecond Conference on Uncertainty in Arti\ufb01cial Intelligence, UAI\u201916, pages 162\u2013171, Ar-\nlington, Virginia, United States, 2016. AUAI Press.\nISBN 978-0-9966431-1-5. URL\nhttp://dl.acm.org/citation.cfm?id=3020948.3020966.\n\n[14] Flavio Figueiredo, Guilherme Borges, Pedro O. S. Vaz de Melo, and Renato Assun\u00e7\u00e3o. Fast\nestimation of causal interactions using Wold processes. In Proceedings of the 32Nd International\nConference on Neural Information Processing Systems, NIPS\u201918, pages 2975\u20132986, USA, 2018.\nCurran Associates Inc. URL http://dl.acm.org/citation.cfm?id=3327144.3327220.\n[15] Tini Garske, Anne Cori, Archchun Ariyarajah, Isobel M. Blake, Ilaria Dorigatti, Tim Eckmanns,\nChristophe Fraser, Wes Hinsley, Thibaut Jombart, Harriet L. Mills, Gemma Nedjati-Gilani,\nEmily Newton, Pierre Nouvellet, Devin Perkins, Steven Riley, Dirk Schumacher, Anita Shah,\nMaria D. Van Kerkhove, Christopher Dye, Neil M. Ferguson, and Christl A. Donnelly. Het-\nerogeneities in the case fatality ratio in the West African ebola outbreak 2013-2016. Philo-\nsophical Transactions of the Royal Society B: Biological Sciences, 372(1721):20160308, 2017.\ndoi: 10.1098/rstb.2016.0308. URL https://royalsocietypublishing.org/doi/abs/10.\n1098/rstb.2016.0308.\n\n[16] Niels Richard Hansen, Patricia Reynaud-Bouret, and Vincent Rivoirard. Lasso and probabilistic\ninequalities for multivariate point processes. Bernoulli, 21(1):83\u2013143, 2015. doi: 10.3150/\n13-BEJ562. URL https://hal.archives-ouvertes.fr/hal-00722668. 61 pages.\n\n[17] Geng Ji, Robert Bamler, Erik B Sudderth, and Stephan Mandt. Bayesian paragraph vectors.\n\nSymposium on Advances in Approximate Bayesian Inference, 2017.\n\n10\n\n\f[18] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In International\n\nConference on Learning Representations (ICLR), 2014.\n\n[19] Remi Lemonnier and Nicolas Vayatis. Nonparametric markovian learning of triggering kernels\nfor mutually exciting and mutually inhibiting multivariate Hawkes processes. In Toon Calders,\nFloriana Esposito, Eyke H\u00fcllermeier, and Rosa Meo, editors, Machine Learning and Knowledge\nDiscovery in Databases, pages 161\u2013176, Berlin, Heidelberg, 2014. Springer Berlin Heidelberg.\n[20] Scott Linderman and Ryan Adams. Discovering latent network structure in point process data.\n\nIn International Conference on Machine Learning, pages 1413\u20131421, 2014.\n\n[21] Scott W Linderman and Ryan P Adams. Scalable Bayesian inference for excitatory point\n\nprocess networks. arXiv preprint arXiv:1507.03228, 2015.\n\n[22] Scott W. Linderman, Yixin Wang, and David M Blei. Bayesian inference for latent Hawkes\nprocesses. NeurIPS Symposium on Advances in Approximate Bayesian Inference Probabilistic,\n2017.\n\n[23] C. J. Quinn, N. Kiyavash, and T. P. Coleman. Directed information graphs. IEEE Transactions\non Information Theory, 61(12):6887\u20136909, Dec 2015. ISSN 0018-9448. doi: 10.1109/TIT.\n2015.2478440.\n\n[24] Arvind Rao, Alfred O. Hero, David J. States, and James Douglas Engel. Using directed\ninformation to build biologically relevant in\ufb02uence networks. Journal of Bioinformatics and\nComputational Biology, 06(03):493\u2013519, 2008. doi: 10.1142/S0219720008003515.\n\n[25] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation\nand approximate inference in deep generative models. In Proceedings of the 31st International\nConference on Machine Learning, 2014.\n\n[26] Thomas Schreiber. Measuring information transfer. Phys. Rev. Lett., 85:461\u2013464, Jul 2000. doi:\n10.1103/PhysRevLett.85.461. URL https://link.aps.org/doi/10.1103/PhysRevLett.\n85.461.\n\n[27] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine\nlearning algorithms. In Advances in neural information processing systems, pages 2951\u20132959,\n2012.\n\n[28] Hongteng Xu, Mehrdad Farajtabar, and Hongyuan Zha. Learning granger causality for Hawkes\nprocesses. In Proceedings of The 33rd International Conference on Machine Learning, pages\n1717\u20131726. PMLR, 2016. URL http://proceedings.mlr.press/v48/xuc16.html.\n\n[29] Hongteng Xu, Dixin Luo, and Hongyuan Zha. Learning Hawkes processes from short doubly-\ncensored event sequences. In Proceedings of the 34th International Conference on Machine\nLearning - Volume 70, ICML\u201917, pages 3831\u20133840. JMLR.org, 2017. URL http://dl.acm.\norg/citation.cfm?id=3305890.3306077.\n\n[30] Cheng Zhang, Judith Butepage, Hedvig Kjellstrom, and Stephan Mandt. Advances in variational\n\ninference. IEEE transactions on pattern analysis and machine intelligence, 2018.\n\n[31] Ke Zhou, Hongyuan Zha, and Le Song. Learning social infectivity in sparse low-rank networks\nusing multi-dimensional Hawkes processes. In AISTATS, volume 31 of JMLR Workshop and\nConference Proceedings, pages 641\u2013649. JMLR.org, 2013.\n\n[32] Ke Zhou, Hongyuan Zha, and Le Song. Learning triggering kernels for multi-dimensional\nIn International Conference on Machine Learning, volume 28, pages\n\nHawkes processes.\n1301\u20131309, 2013.\n\n11\n\n\f", "award": [], "sourceid": 6909, "authors": [{"given_name": "Farnood", "family_name": "Salehi", "institution": "EPFL"}, {"given_name": "William", "family_name": "Trouleau", "institution": "EPFL"}, {"given_name": "Matthias", "family_name": "Grossglauser", "institution": "EPFL"}, {"given_name": "Patrick", "family_name": "Thiran", "institution": "EPFL"}]}