{"title": "Online Learning for Multivariate Hawkes Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 4937, "page_last": 4946, "abstract": "We develop a nonparametric and online learning algorithm that estimates the triggering functions of a multivariate Hawkes process (MHP). The approach we take approximates the triggering function $f_{i,j}(t)$ by functions in a reproducing kernel Hilbert space (RKHS), and maximizes a time-discretized version of the log-likelihood, with Tikhonov regularization. Theoretically, our algorithm achieves an $\\calO(\\log T)$ regret bound. Numerical results show that our algorithm offers a competing performance to that of the nonparametric batch learning algorithm, with a run time comparable to the parametric online learning algorithm.", "full_text": "Online Learning for Multivariate Hawkes Processes\n\nYingxiang Yang\u2217\n\nJalal Etesami\u2020\n\nNiao He\u2020\n\nNegar Kiyavash\u2217\u2020\n\nUniversity of Illinois at Urbana-Champaign\n\n{yyang172,etesami2,niaohe,kiyavash} @illinois.edu\n\nUrbana, IL 61801\n\nAbstract\n\nWe develop a nonparametric and online learning algorithm that estimates the\ntriggering functions of a multivariate Hawkes process (MHP). The approach we\ntake approximates the triggering function fi,j(t) by functions in a reproducing\nkernel Hilbert space (RKHS), and maximizes a time-discretized version of the\nlog-likelihood, with Tikhonov regularization. Theoretically, our algorithm achieves\nan O(log T ) regret bound. Numerical results show that our algorithm offers a\ncompeting performance to that of the nonparametric batch learning algorithm, with\na run time comparable to parametric online learning algorithms.\n\nIntroduction\n\n1\nMultivariate Hawkes processes (MHPs) are counting processes where an arrival in one dimension can\naffect the arrival rates of other dimensions. They were originally proposed to statistically model the\narrival patterns of earthquakes [16]. However, MHP\u2019s ability to capture mutual excitation between\ndimensions of a process also makes it a popular model in many other areas, including high frequency\ntrading [3], modeling neural spike trains [24], and modeling diffusion in social networks [28], and\ncapturing causality [12, 18].\nFor a p-dimensional MHP, the intensity function of the i-th dimension takes the following form:\n\n(cid:90) t\n\np(cid:88)\n\n\u03bbi(t) = \u00b5i +\n\nfi,j(t \u2212 \u03c4 )dNj(\u03c4 ),\n\n(1)\n\nj=1\n\n0\n\nwhere the constant \u00b5i is the base intensity of the i-th dimension, Nj(t) counts the number of arrivals\nin the j-th dimension within [0, t], and fi,j(t) is the triggering function that embeds the underlying\ncausal structure of the model. In particular, one arrival in the j-th dimension at time \u03c4 will affect the\nintensity of the arrivals in the i-th dimension at time t by the amount fi,j(t \u2212 \u03c4 ) for t > \u03c4. Therefore,\nlearning the triggering function is the key to learning an MHP model. In this work, we consider the\nproblem of estimating the fi,j(t)s using nonparametric online learning techniques.\n1.1 Motivations\nWhy nonparametric? Most of existing works consider exponential triggering functions:\n\nfi,j(t) = \u03b1i,j exp{\u2212\u03b2i,jt}1{t > 0},\n\n(2)\nwhere \u03b1i,j is unknown while \u03b2i,j is given a priori. Under this assumption, learning fi,j(t) is\nequivalent to learning a real number, \u03b1i,j. However, there are many scenarios where (2) fails to\n\u2217Department of Electrical and Computer Engineering. \u2020Department of Industrial and Enterprise Systems\nEngineering. This work was supported in part by MURI grant ARMY W911NF-15-1-0479 and ONR grant\nW911NF-15-1-0479.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fdescribe the correct mutual in\ufb02uence pattern between dimensions. For example, [20] and [11] have\nreported delayed and bell-shaped triggering functions when applying the MHP model to neural spike\ntrain datasets. Moreover, when fi,j(t)s are not exponential, or when \u03b2i,js are inaccurate, formulation\nin (2) is prone to model mismatch [15].\nWhy online learning? There are many reasons to consider an online framework. (i) Batch learning\nalgorithms do not scale well due to high computational complexity [15]. (ii) The data can be costly\nto observe, and can be streaming in nature, for example, in criminology.\nThe above concerns motivate us to design an online learning algorithm in the nonparametric regime.\n\n1.2 Related Works\n\nset of basis functions S = {e1(t), . . . , e|S|(t)}: fi,j(t) =(cid:80)|S|\n\nEarlier works on learning the triggering functions can be largely categorized into three classes.\nBatch and parametric. The simplest way to learn the triggering functions is to assume that they\npossess a parametric form, e.g. (2), and learn the coef\ufb01cients. The most widely used estimators\ninclude the maximum likelihood estimator [23], and the minimum mean-square error estimator [2].\nThese estimators can also be generalized to the high dimensional case when the coef\ufb01cient matrix is\nsparse and low-rank [2]. More generally, one can assume that fi,j(t)s lie within the span of a given\ni=1 ciei(t), where ei(t)s have a given\nparametric form [13, 27]. The state-of-the-art of such algorithms is [27], where |S| is adaptively\nchosen, which sometimes requires a signi\ufb01cant portion of the data to determine the optimal S.\nBatch and nonparametric. A more sophisticated approach towards \ufb01nding the set S is explored in\n[29], where the coef\ufb01cients and the basis functions are iteratively updated and re\ufb01ned. Unlike [27],\nwhere the basis functions take a predetermined form, [29] updates the basis functions by solving a\nset of Euler-Lagrange equations in the nonparametric regime. However, the formulation of [29] is\nnonconvex, and therefore the optimality is not guaranteed. The method also requires more than 105\narrivals for each dimension in order to obtain good results, on networks of less than 5 dimensions.\nAnother way to estimate fi,j(t)s nonparametrically is proposed in [4], which solves a set of p Wiener-\nHopf systems, each of dimension at least p2. The algorithm works well on small dimensions; however,\nit requires inverting a p2 \u00d7 p2 matrix, which is costly, if not all together infeasible, when p is large.\nOnline and parametric. To the best of our knowledge, learning the triggering functions in an online\nsetting seems largely unexplored. Under the assumption that fi,j(t)s are exponential, [15] proposes\nan online algorithm using gradient descent, exploiting the evolutionary dynamics of the intensity\nfunction. The time axis is discretized into small intervals, and the updates are performed at the end of\neach interval. While the authors provide the online solution to the parametric case, their work cannot\nreadily extend to the nonparametric setting where the triggering functions are not exponential, mainly\nbecause the evolutionary dynamics of the intensity functions no longer hold. Learning triggering\nfunctions nonparametrically remains an open problem.\n\n1.3 Challenges and Our Contributions\n\nDesigning an online algorithm in the nonparametric regime is not without its challenges: (i) It is\nnot clear how to represent fi,j(t)s. In this work, we relate fi,j(t) to an RKHS. (ii) Although online\nlearning with kernels is a well studied subject in other scenarios [19], a typical choice of loss function\nfor learning an MHP usually involves the integral of fi,j(t)s, which prevents the direct application of\nthe representer theorem. (iii) The outputs of the algorithm at each step require a projection step to\nensure positivity of the intensity function. This requires solving a quadratic programming problem,\nwhich can be computationally expensive. How to circumvent this computational complexity issue is\nanother challenge of this work.\nIn this paper, we design, to the best of our knowledge, the \ufb01rst online learning algorithm for the\ntriggering functions in the nonparametric regime. In particular, we tackle the challenges mentioned\nabove, and the only assumption we make is that the triggering functions fi,j(t)s are positive, have\na decreasing tail, and that they belong to an RKHS. Theoretically, our algorithm achieves a regret\n\n2\n\n\fN (t) =(cid:80)p\n\nbound of O(log T ), and numerical experiments show that our approach outperforms the previous\napproaches despite the fact that they are tailored to a less general setting. In particular, our algorithm\nattains a similar performance to the nonparametric batch learning maximum likelihood estimators\nwhile reducing the run time extensively.\n1.4 Notations\nPrior to discussing our results, we introduce the basic notations used in the paper. Detailed notations\nwill be introduced along the way. For a p-dimensional MHP, we denote the intensity function of\nthe i-th dimension by \u03bbi(t). We use \u03bb(t) to denote the vector of intensity functions, and we use\nF = [fi,j(t)] to denote the matrix of triggering functions. The i-th row of F is denoted by fi. The\nnumber of arrivals in the i-th dimension up to t is denoted by the counting process Ni(t). We set\ni=1 Ni(t). The estimates of these quantities are denoted by their \u201chatted\u201d versions. The\narrival time of the n-th event in the j-th dimension is denoted by \u03c4j,n. Lastly, de\ufb01ne (cid:98)x(cid:99)y = y(cid:98)x/y(cid:99).\n2 Problem Formulation\nIn this section, we introduce our assumptions and de\ufb01nitions followed by the formulation of the loss\nfunction. We omit the basics on MHPs and instead refer the readers to [22] for details.\nAssumption 2.1. We assume that the constant base intensity \u00b5i is lower bounded by a given threshold\n\u00b5min > 0. We also assume bounded and stationary increments for the MHP [16, 9]: for any t, z > 0,\nNi(t)\u2212Ni(t\u2212z) \u2264 \u03baz = O(z). See Appendix A for more details.\nDe\ufb01nition 2.1. Suppose that {tk}\u221e\nk=0 is an arbitrary time sequence with t0 = 0, and supk\u22651(tk \u2212\ntk\u22121) \u2264 \u03b4 \u2264 1. Let \u03b5f : [0,\u221e) \u2192 [0,\u221e) be a continuous and bounded function such that\nlimt\u2192\u221e \u03b5f (t) = 0. Then, f (x) satis\ufb01es the decreasing tail property with tail function \u03b5f (t) if\n\n(tk \u2212 tk\u22121)\n\nsup\n\nx\u2208(tk\u22121,tk]\n\n|f (x)| \u2264 \u03b5f (tm\u22121),\n\n\u2200m > 0.\n\nAssumption 2.2. Let H be an RKHS associated with a kernel K(\u00b7,\u00b7) that satis\ufb01es K(x, x) \u2264 1.\nLet L1[0,\u221e) be the space of functions for which the absolute value is Lebesgue integrable. For\nany i, j \u2208 {1, . . . , p}, we assume that fi,j(t) \u2208 H and fi,j(t) \u2208 L1[0,\u221e), with both fi,j(t) and\ndfi,j(t)/dt satisfying the decreasing tail property of De\ufb01nition 2.1.\n\n\u221e(cid:88)\n\nk=m\n\nAssumption 2.1 is common and has been adopted in existing literature [22]. It ensures that the MHP\nis not \u201cexplosive\u201d by assuming that N (t)/t is bounded. Assumption 2.2 restricts the tail behaviors\nof both fi,j(t) and dfi,j(t)/dt. Complicated as it may seem, functions with exponentially decaying\ntails satisfy this assumption, as is illustrated by the following example (See Appendix B for proof):\nExample 1. Functions f1(t) = exp{\u2212\u03b2t}1{t > 0} and f2(t) = exp{\u2212(t \u2212 \u03b3)2}1{t > 0} satisfy\nAssumption 2.2 with tail functions \u03b2\u22121 exp{\u2212\u03b2(t \u2212 \u03b4)} and\n2.1 A Discretized Loss Function for Online Learning\nA common approach for learning the parameters of an MHP is to perform regularized maximum\nlikelihood estimation. As such, we introduce a loss function comprised of the negative of the log-\nlikelihood function and a penalty term to enforce desired structural properties, e.g. sparsity of the\ntriggering matrix F or smoothness of the triggering functions (see, e.g., [2, 29, 27]). The negative of\nthe log-likelihood function of an MHP over a time interval of [0, t] is given by\n\n2 \u2212 \u03b3) exp{\u03b42/2}.\n\n2\u03c0 erfc(t/\n\n\u221a\n\n\u221a\n\n(3)\nLet {\u03c41, ..., \u03c4N (t)} denote the arrival times of all the events within [0, t] and let {t0, . . . , tM (t)} be\na \ufb01nite partition of the time interval [0, t] such that t0 = 0 and tk+1 := min\u03c4i\u2265tk{(cid:98)tk(cid:99)\u03b4 + \u03b4, \u03c4i}.\nUsing this partitioning, it is straightforward to see that the function in (3) can be written as\n\n\u03bbi(\u03c4 )d\u03c4\n\ni=1\n\n0\n\n0\n\n.\n\nlog \u03bbi(\u03c4 )dNi(\u03c4 ) \u2212\n\nLt(\u03bb) := \u2212 p(cid:88)\n\n(cid:18)(cid:90) t\n\n(cid:90) t\n\n(cid:19)\n\n(cid:33)\n\np(cid:88)\n\nLt(\u03bb) =\n\n\u03bbi(\u03c4 )d\u03c4 \u2212 xi,k log \u03bbi(tk)\n\n:=\n\nLi,t(\u03bbi),\n\n(4)\n\np(cid:88)\n\nM (t)(cid:88)\n\n(cid:32)(cid:90) tk\n\ni=1\n\nk=1\n\ntk\u22121\n\ni=1\n\n3\n\n\fp(cid:88)\n\nM (t)(cid:88)\n\nwhere xi,k := Ni(tk) \u2212 Ni(tk\u22121). By the de\ufb01nition of tk, we know that xi,k \u2208 {0, 1}. In order to\nlearn fi,j(t)s using an online kernel method, we require a similar result as the representer theorem in\n[25] that speci\ufb01es the form of the optimizer. This theorem requires that the regularized version of the\nloss in (4) to be a function of only fi,j(t)s. However, due to the integral part, Lt(\u03bb) is a function of\nboth fi,j(t)s and their integrals, which prevents us from applying the representer theorem directly. To\nresolve this issue, several approaches can be applied such as adjusting the Hilbert space as proposed\nin [14] in context of Poisson processes, or approximating the log-likelihood function as in [15]. Here,\nwe adopt a method similar to [15] and approximate (4) by discretizing the integral:\n\nL(\u03b4)\nt (\u03bb) :=\n\n((tk \u2212 tk\u22121)\u03bbi(tk) \u2212 xi,k log \u03bbi(tk)):=\n\nL(\u03b4)\n\ni,t (\u03bbi).\n\n(5)\n\ni=1\n\nk=1\n\ni=1\n\nIntuitively, if \u03b4 is small enough and the triggering functions are bounded, it is reasonable to expect\nthat Li,t(\u03bb) is close to L(\u03b4)\ni,t (\u03bb). Below, we characterize the accuracy of the above discretization and\nalso truncation of the intensity function. First, we require the following de\ufb01nition.\nDe\ufb01nition 2.2. We de\ufb01ne the truncated intensity function as follows\n\n\u03bb(z)\ni\n\n(t) := \u00b5i +\n\n1{t \u2212 \u03c4 < z}fi,j(t \u2212 \u03c4 )dNj(\u03c4 ).\n\n(6)\n\nProposition 1. Under Assumptions 2.1 and 2.2, for any i \u2208 {1, . . . , p}, we have\n\ni,t (\u03bb(z)\n\ni\n\n) \u2212 Li,t(\u03bbi)\n\nmin)N (t \u2212 z)\u03b5(z) + \u03b4N (t)\u03b5(cid:48)(0),\n\n(cid:12)(cid:12)(cid:12)L(\u03b4)\n\nwhere \u00b5min is the lower bound for \u00b5i, \u03ba1 is the upper bound for Ni(t)\u2212Ni(t \u2212 1) from De\ufb01nition\n2.1, while \u03b5 and \u03b5(cid:48) are two tail functions that uniformly capture the decreasing tail property of all\nfi,j(t)s and all dfi,j(t)/dts, respectively.\n\n(cid:90) t\np(cid:88)\n(cid:12)(cid:12)(cid:12) \u2264 (1 + \u03ba1\u00b5\u22121\n\nj=1\n\n0\n\ni\n\nThe \ufb01rst term in the bound characterizes the approximation error when one truncates \u03bbi(t) with\n\u03bb(z)\n(t). The second term describes the approximation error caused by the discretization. When\ni\nz = \u221e, \u03bbi(t) = \u03bb(z)\n(t), and the approximation error is contributed solely by discretization. Note\nthat, in many cases, a small enough truncation error can be obtained by setting a relatively small z.\nFor example, for fi,j(t) = exp{\u22123t}1{t > 0}, setting z = 10 would result in a truncation error less\nthan 10\u221213. Meanwhile, truncating \u03bbi(t) greatly simpli\ufb01es the procedure of computing its value.\nHence, in our algorithm, we focus on \u03bb(z)\nIn the following, we consider the regularized instantaneous loss function with the Tikhonov regular-\nization for fi,j(t)s and \u00b5i:\n\ninstead of \u03bbi.\n\ni\n\np(cid:88)\n\nli,k(\u03bbi) := (tk \u2212 tk\u22121)\u03bbi(tk) \u2212 xi,k log \u03bbi(tk) +\n\nand aim at producing a sequence of estimates {(cid:98)\u03bbi(tk)}M (t)\n\nM (t)(cid:88)\n\nk=1\n\nli,k((cid:98)\u03bbi(tk)) \u2212\n\nM (t)(cid:88)\n\nk=1\n\nmin\n\n\u00b5i\u2265\u00b5min,fi,j (t)\u22650\n\nli,k(\u03bbi(tk)).\n\np(cid:88)\n\nj=1\n\n1\n2\n\n\u03c9i\u00b52\n\ni +\n\n(cid:107)fi,j(cid:107)2H,\n\n\u03b6i,j\n2\n\nk=1 of \u03bbi(t) with minimal regret:\n\n(7)\n\n(8)\n\nEach regularized instantaneous loss function in (7) is jointly strongly convex with respect to fi,js and\n\u00b5i. Combining with the representer theorem in [25], the minimizer to (8) is a linear combination of\na \ufb01nite set of kernels. In addition, by setting \u03b6i,j = O(1), our algorithm achieves \u03b2-stability with\n\u03b2 = O((\u03b6i,jt)\u22121), which is typical for a learning algorithm in RKHS (Theorem 22 of [8]).\n3 Online Learning for MHPs\n\nWe introduce our NonParametric OnLine Estimation for MHP (NPOLE-MHP) in Algorithm 1. The\nmost important components of the algorithm are (i) the computation of the gradients and (ii) the\n\n4\n\n\fk=1 and a set of regularization coef\ufb01cients \u03b6i,js, along with\n\ni\n\nfor all i, j.\n\ni,j and(cid:98)\u00b5(0)\n\nAlgorithm 1 NonParametric OnLine Estimation for MHP (NPOLE-MHP)\n1: input: a sequence of step sizes {\u03b7k}\u221e\n\npositive values of \u00b5min, z and \u03c3. output: (cid:98)\u00b5(M (t)) and (cid:98)F (M (t)).\n2: Initialize (cid:98)f (0)\n(cid:111)\n(cid:17)\n(cid:110)(cid:98)\u00b5(k)\n, (cid:98)f (k)\n((cid:98)\u00b5(k)\nSet(cid:98)\u00b5(k+1)\n(cid:17)(cid:105)\ni,j \u2190(cid:104)(cid:98)f (k)\nSet (cid:98)f (k+ 1\n, (cid:98)f (k)\n, and (cid:98)f (k+1)\n((cid:98)\u00b5(k)\n\nObserve the interval [tk, tk+1), and compute xi,k for i \u2208 {1, . . . , p}.\nfor i = 1, . . . , p do\n\n3: for k = 0, ..., M (t) \u2212 1 do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: end for\n\ni \u2212 \u03b7k+1\u2202\u00b5ili,k\n(cid:16)\n\n\u2190 max\nfor j = 1, . . . , p do\n\ni,j \u2212 \u03b7k+1\u2202fi,j li,k\n\nend for\n\nend for\n\n, \u00b5min\n\n\u03bb(z)\ni\n\n\u03bb(z)\ni\n\n(cid:16)\n\n2 )\n\n)\n\n)\n\n.\n\ni\n\ni\n\ni\n\ni\n\ni\n\n(cid:104)(cid:98)f (k+ 1\n\ni,j\n\n2 )\n\n(cid:105)\n\n.\n\ni,j \u2190 \u03a0\n\ni\n\ni\n\ni\n\ni\n\ni\n\n)\n\n(cid:104)\n\n(cid:17)\n\n(cid:16)\n\n\u03bb(z)\ni\n\n((cid:98)\u00b5(k)\n\nin (6). Since \u03bb(z)\n\n(cid:16)(cid:98)\u00b5(k)\n\n= (tk \u2212 tk\u22121) \u2212 xi,k\n\nis a linear function of \u00b5i, we have\n\nprojections in lines 6 and 8. For the partial derivative with respect to \u00b5i, recall the de\ufb01nition of li,k in\n(7) and \u03bb(z)\n\n(cid:17)(cid:105)\u22121\n, (cid:98)f (k)\n\u2202\u00b5ili,k\n\u2265 \u00b5min, which further ensures that(cid:98)\u03bb(z)\nalgorithm makes sure that(cid:98)\u00b5(k+1)\nwhere \u03c1k is the simpli\ufb01ed notation for the \ufb01rst two terms. Upon performing gradient descent, the\n) \u2265 \u03bbmin.\nFor the update step of (cid:98)f (k)\ni,j (t), notice that the li,k is also a linear function with respect to fi,j. Since\n(cid:16)\n\u2202fi,j fi,j(x) = K(x,\u00b7), which holds true due to the reproducing property of the kernel, we thus have\n, (cid:98)f (k)\n((cid:98)\u00b5(k)\n(9)\n\n+ \u03c9i(cid:98)\u00b5(k)\n((cid:98)\u00b5(k+1)\nK(tk \u2212 \u03c4j,n,\u00b7) + \u03b6i,j(cid:98)f (k)\ni,j (\u00b7).\n\n(cid:44) \u03c1k + \u03c9i(cid:98)\u00b5(k)\n, (cid:98)f (k+1)\n\n, (cid:98)f (k)\n\n(cid:88)\n\n\u2202fi,j li,k\n\n\u03bb(z)\ni\n\n\u03bb(z)\ni\n\n(cid:17)\n\n= \u03c1k\n\n)\n\n,\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\n\u03c4j,n\u2208[tk\u2212z,tk)\n\ni,j\n\n2 )\n\nprogramming problem: min(cid:107)f \u2212 (cid:98)f (k+ 1\n\nOnce again, a projection \u03a0[\u00b7] is necessary to ensure that the estimated triggering functions are positive.\n3.1 Projection of the Triggering Functions\nFor any kernel, the projection step for a triggering function can be executed by solving a quadratic\n(cid:107)2H subject to f \u2208 H and f (t) \u2265 0. Ideally, the positivity\nconstraint has to hold for every t > 0, but in order to simplify computation, one can approximate the\nsolution by relaxing the constraint such that f (t) \u2265 0 holds for only a \ufb01nite set of ts within [0, z].\nSemi-De\ufb01nite Programming (SDP). When the reproducing kernel is polynomial, the problem is\nmuch simpler. The projection step can be formulated as an SDP problem [26] as follows:\nProposition 2. Let S = \u222ar\u2264k{tr \u2212 \u03c4j,n :\ntr \u2212 z \u2264 \u03c4j,n < tr} be the set of tr \u2212 \u03c4j,ns. Let\nK(x, y) = (1+xy)2d and K(cid:48)(x, y) = (1+xy)d be two polynomial kernels with d \u2265 1. Furthermore,\nlet K and G denote the Gramian matrices where the i, j-th element correspond to K(s, s(cid:48)), with s\nand s(cid:48) being the i-th and j-th element in S. Suppose that a \u2208 R|S| is the coef\ufb01cient vector such\nsK(s,\u00b7).\n\ns\u2208S asK(s,\u00b7), and that the projection step returns (cid:98)f (k+1)\n\nthat (cid:98)f (k+ 1\n\n(\u00b7) =(cid:80)\n\n(\u00b7) =(cid:80)\n\ns\u2208S b\u2217\n\n2 )\n\ni,j\n\ni,j\n\nThen the coef\ufb01cient vector b\u2217 can be obtained by\n\u22122a(cid:62)Kb + b(cid:62)Kb,\n\nb\u2217 = argmin\nb\u2208R|S|\n\ns.t. G \u00b7 diag(b) + diag(b) \u00b7 G (cid:23) 0.\n\n(10)\n\ni,j(t) where gi,j(t) \u2208 H. By\nNon-convex approach. Alternatively, we can assume that fi,j(t) = g2\nminimizing the loss with respect to gi,j(t), one can naturally guarantee that fi,j(t) \u2265 0. This method\nwas adopted in [14] for estimating the intensity function of non-homogeneous Poisson processes.\nWhile this approach breaks the convexity of the loss function, it works relatively well when the\ninitialization is close to the global minimum. It is also interestingly related to a line of recent works\nin non-convex SDP [6], as well as phase retrieval with Wirtinger \ufb02ow [10]. Deriving guarantees on\nregret bound and convergence performances is a future direction implied by the result of this work.\n\n5\n\n\f4 Theoretical Properties\nWe now discuss the theoretical properties of NPOLE-MHP. We start with de\ufb01ning the regret.\nDe\ufb01nition 4.1. The regret of Algorithm 1 at time t is given by\n\nM (t)(cid:88)\n\n(cid:16)\n\nk=1\n\n((cid:98)\u00b5(k)\n\ni\n\n, (cid:98)f (k)\n\ni\n\n(\u00b5i, fi)) :=\n\nli,k(\u03bb(z)\n\ni\n\n)) \u2212 li,k(\u03bb(z)\n\ni\n\n(\u00b5i, fi))\n\n,\n\n(cid:17)\n\nwhere(cid:98)\u00b5(k)\n\nR(\u03b4)\n\nt (\u03bb(z)\n\ni\n\nand (cid:98)f (k)\n\ni\n\ni\n\ndenote the estimated base intensity and the triggering functions, respectively.\nTheorem 1. Suppose that the observations are generated from a p-dimensional MHP that satis\ufb01es\nAssumptions 2.1 and 2.2. Let \u03b6 = mini,j{\u03b6i,j, \u03c9i}, and \u03b7k = 1/(\u03b6k + b) for some positive constants\nb. Then\n\nwhere C1 = 2(1 + p\u03ba2\n\nR(\u03b4)\n\nt (\u03bb(z)\nz)\u03b6\u22121|\u03b4 \u2212 \u00b5\u22121\nmin|2.\n\ni\n\n(\u00b5i, fi)) \u2264 C1(1 + log M (t)),\n\nThe regret bound of Theorem 1 resembles the regret bound for a typical online learning algorithm\nwith strongly convex loss function (see for example, Theorem 3.3 of [17]). When \u03b4, \u03b6 and \u00b5\u22121\nare \ufb01xed, C1 = O(p), which is intuitive as one needs to update p functions at a time. Note that\nthe regret in De\ufb01nition 4.1, encodes the performance of Algorithm 1 by comparing its loss with the\napproximated loss. Below, we compare the loss of Algorithm 1 with the original loss in (4).\nCorollary 1. Under the same assumptions as Theorem 1, we have\n\nmin\n\n(cid:17) \u2264 C1[1 + log M (t)] + C2N (t),\n\n(11)\n\n)) \u2212 li,k(\u03bbi(\u00b5i, fi))\n\nM (t)(cid:88)\n\n(cid:16)\n\nk=1\n\n((cid:98)\u00b5i, (cid:98)f (k)\n\ni\n\nli,k(\u03bb(z)\n\ni\n\nwhere C1 is de\ufb01ned in Theorem 1 and C2 = (1 + \u03ba1\u00b5\u22121\nNote that C3N (t) is due to discretization and truncation steps and it can be made arbitrary small for\ngiven t and setting small \u03b4 and large enough z.\n\nComputational Complexity. Since (cid:98)fis can be estimated in parallel, we restrict our analysis to the\n\nmin)\u03b5(z) + \u03b4\u03b5(cid:48)(0).\n\ncase of a \ufb01xed i \u2208 {1, . . . , p} in a single iteration. For each iteration, the computational complexity\ncomes from evaluating the intensity function and projection. Since the number of arrivals within\nthe interval [tk \u2212 z, tk) is bounded by p\u03baz and \u03baz = O(1), evaluating the intensity costs O(p2)\noperations. For the projection in each step, one can truncate the number of kernels used to represent\nfi,j(t) to be O(1) with controllable error (Proposition 1 of [19]), and therefore the computation\ncost is O(1). Hence, the per iteration computation cost of NPOLE-MHP is O(p2). By comparison,\nparametric online algorithms (DMD, OGD of [15]) also require O(p2) operations for each iteration,\nwhile the batch learning algorithms (MLE-SGLP, MLE of [27]) require O(p2t3) operations.\n5 Numerical Experiments\nWe evaluate the performance of NPOLE-MHP on both synthetic and real data, from multiple aspects:\n(i) visual assessment of the goodness-of-\ufb01t comparing to the ground truth; (ii) the \u201caverage L1 error\u201d\n\n(cid:80)p\nj=1 (cid:107)fi,j \u2212 (cid:98)fi,j(cid:107)L1[0,z] over multiple trials; (iii) scalability over\n\nde\ufb01ned as the average of(cid:80)p\n\nboth dimension p and time horizon T . For benchmarks, we compare NPOLE-MHP\u2019s performance\nto that of online parametric algorithms (DMD, OGD of [15]) and nonparametric batch learning\nalgorithms (MLE-SGLP, MLE of [27]).\n5.1 Synthetic Data\nConsider a 5-dimensional MHP with \u00b5i = 0.05 for all dimensions. We set the triggering functions as\n\ni=1\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\ne\u22122.5t\n2\u22125t\n0\n0\n0\n\nF =\n\n(1 + cos(\u03c0t))e\u2212t/2\n\n2e\u22123t\n\n0\n\n0\n0\n\n0\ne\u22125t\n0\n0\n\nte\u22125(t\u22121)2\n\n6\n\ne\u221210(t\u22121)2\n\n0\n0\n\n+ 0.4e\u22123(t\u22121)2\n\n0.6e\u22123t2\n\n0\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb .\n\n0\n0\n0\ne\u22124t\ne\u22123t\n\n\ff2,2(t)\n\nf3,2(t)\n\nf1,4(t)\n\nFigure 1: Performances of different algorithms for estimating F . Complete set of result can be found\nin Appendix F. For each subplot, the horizontal axis covers [0, z] and the vertical axis covers [0, 1].\nThe performances are similar between DMD and OGD, and between MLE and MLE-SGLP.\n\nThe design of F allows us to test NPOLE-MHP\u2019s ability of detecting (i) exponential triggering\nfunctions with various decaying rate; (ii) zero functions; (iii) functions with delayed peaks and tail\nbehaviors different from an exponential function.\nGoodness-of-\ufb01t. We run NPOLE-MHP over a set of data with T = 105 and around 4 \u00d7 104 events\nfor each dimension. The parameters are chosen by grid search over a small portion of data, and the\nparameters of the benchmark algorithms are \ufb01ne-tuned (see Appendix F for details). In particular, we\nset the discretization level \u03b4 = 0.05, the window size z = 3, the step size \u03b7k = (k\u03b4/20 + 100)\u22121, and\nthe regularization coef\ufb01cient \u03b6i,j \u2261 \u03b6 = 10\u22128. The performances of NPOLE-MHP and benchmarks\nare shown in Figure 1. We see that NPOLE-MHP captures the shape of the function much better than\nthe DMD and OGD algorithms with mismatched forms of the triggering functions. It is especially\nvisible for f1,4(t) and f2,2(t). In fact, our algorithm scores a similar performance to the batch\nlearning MLE estimator, which is optimal for any given set of data. We next plot the average loss\nper iteration for this dataset in Figure 2. In the left-hand side, the loss is high due to initialization.\nHowever, the effect of initialization quickly diminishes as the number of events increases.\nRun time comparison. The simulation of the DMD and OGD algorithms took 2 minutes combined\non a Macintosh with two 6-core Intel Xeon processor at 2.4 GHz, while NPOLE-MHP took 3 minutes.\nThe batch learning algorithms MLE-SGLP and MLE in [27] each took about 1.5 hours. Therefore,\nour algorithm achieves the performance similar to batch learning algorithms with a run time close to\nthat of parametric online learning algorithms.\nEffects of the hyperparameters: \u03b4, \u03b6i,j, and \u03b7k. We investigate the sensitivity of NPOLE-MHP\nwith respect to the hyperparameters, measuring the \u201caveraged L1 error\u201d de\ufb01ned at the beginning\nof this section. We independently generate 100 sets of data with the same parameters, and a\nsmaller T = 104 for faster data generation. The result is shown in Table 1. For NPOLE-MHP,\nwe \ufb01x \u03b7k = 1/(k/2000 + 10). MLE and MLE-SGLP score around 1.949 with 5/5 inner/outer\nrounds of iterations. NPOLE-MHP\u2019s performance is robust when the regularization coef\ufb01cient and\ndiscretization level are suf\ufb01ciently small. It surpasses MLE and MLE-SGLP on large datasets, in\nwhich case the iterations of MLE and MLE-SGLP are limited due to computational considerations. As\n\u03b6 increases, the error decreases \ufb01rst before rising drastically, a phenomenon caused by the mismatch\nbetween the loss functions. For the step size, the error varies under different choice of \u03b7k, which can\nbe selected via grid-search on a small portion of the data like most other online algorithms.\n5.2 Real Data: Inferencing Impact Between News Agencies with Memetracker Data\nWe test the performance of NPOLE-MHP on the memetracker data [21], which collects from the\ninternet a set of popular phrases, including their content, the time they were posted, and the url\naddress of the articles that included them. We study the relationship between different news agencies,\nmodeling the data with a p-dimensional MHP where each dimension corresponds to a news website.\nUnlike [15], which conducted a similar experiment where all the data was used, we focus on only 20\n\n7\n\n00.511.522.5300.10.20.30.40.50.60.70.80.900.511.522.5300.20.40.60.811.21.41.61.800.511.522.5300.10.20.30.40.50.60.70.80.9100.511.522.5300.10.20.30.40.50.60.70.80.91Truefi,j(t)NPOLE-MHPDMDOGDMLE-SGLPMLE\f\u22128\n1.83\n1.86\n1.92\n4.80\n5.73\n\nRegularization log10 \u03b6\n\u22122\n\u22126\n4.15\n1.83\n3.10\n1.86\n1.92\n2.73\n2.19\n4.80\n5.73\n2.38\n\n\u22124\n1.84\n1.86\n1.88\n4.64\n5.58\n\n0\n\n4.64\n4.64\n4.64\n4.62\n4.59\n\n\u03b4\n\n0.01\n0.05\n0.1\n0.5\n1\n\nHorizon T (days)\n5.4\n1.8\n15.3\n3.9\n17.0\n4.6\n4.6\n16.7\n16.4\n4.5\n4.5\n15.9\n\n3.6\n9.1\n10.4\n10.2\n10.0\n9.7\n\n20\n40\n60\n80\n100\n\nDimension\n\np\n\nTable 1: Effect of hyperparameters \u03b6 and \u03b4,\nmeasured by the \u201caverage L1 error\u201d.\n\nTable 2: Average CPU-time for estimating\none triggering function (seconds).\n\nFigure 2: Effect of discretization in NPOLE-\nMHP.\n\nFigure 3: Cumulative loss on memetracker data\nof 20 dimensions.\n\nwebsites that are most active, using 18 days of data. We plot the cumulative losses in Figure 3, using\na window size of 3 hours, an update interval \u03b4 = 0.2 seconds, and a step size \u03b7k = 1/(k\u03b6 + 800)\n\nwith \u03b6 = 10\u221210 for NPOLE-MHP. For DMD and OGD, we set \u03b7k = 5/(cid:112)T /\u03b4. The result shows\n\nthat NPOLE-MHP accumulates a smaller loss per step compared to OGD and DMD.\nScalability and generalization error. Finally, we evaluate the scalability of NPOLE-MHP using\nthe average CPU-time for estimating one triggering function. The result in Table 2 shows that the\ncomputation cost of NPOLE-MHP scales almost linearly with the dimension and data size. When\nscaling the data to 100 dimensions and 2 \u00d7 105 events, NPOLE-MHP scores an average 0.01 loss per\niteration on both training and test data, while OGD and DMD scored 0.005 on training data and 0.14\non test data. This shows a much better generalization performance of NPOLE-MHP.\n\n6 Conclusion\n\nWe developed a nonparametric method for learning the triggering functions of a multivariate Hawkes\nprocess (MHP) given time series observations. To formulate the instantaneous loss function, we\nadopted the method of discretizing the time axis into small intervals of lengths at most \u03b4, and we\nderived the corresponding upper bound for approximation error. From this point, we proposed an\nonline learning algorithm, NPOLE-MHP, based on the framework of online kernel learning and\nexploits the interarrival time statistics under the MHP setup. Theoretically, we derived the regret\nbound for NPOLE-MHP, which is O(log T ) when the time horizon T is known a priori, and we\nshowed that the per iteration cost of NPOLE-MHP is O(p2). Numerically, we compared NPOLE-\nMHP\u2019s performance with parametric online learning algorithms and nonparametric batch learning\nalgorithms. Results on both synthetic and real data showed that we are able to achieve similar\nperformance to that of the nonparametric batch learning algorithms with a run time comparable to the\nparametric online learning algorithms.\n\n8\n\n0246810TimeAxist\u00d710400.20.40.60.81AverageLossperIteration\u03b4=0.05,withtruefi,j(t)s\u03b4=0.05,NPOLE-MHP\u03b4=0.10,NPOLE-MHP\u03b4=0.50,NPOLE-MHP\u03b4=0.05,DMD051015TimeAxist\u00d71050123456CumulativeLoss\u00d7105NPOLE-MHPDMDOGD\fReferences\n[1] Emmanuel Bacry, Khalil Dayri, and Jean-Franc\u00b8ois Muzy. Non-parametric kernel estimation\nfor symmetric Hawkes processes. application to high frequency \ufb01nancial data. The European\nPhysical Journal B-Condensed Matter and Complex Systems, 85(5):1\u201312, 2012.\n\n[2] Emmanuel Bacry, St\u00b4ephane Ga\u00a8\u0131ffas, and Jean-Franc\u00b8ois Muzy. A generalization error bound for\n\nsparse and low-rank multivariate Hawkes processes, 2015.\n\n[3] Emmanuel Bacry, Iacopo Mastromatteo, and Jean-Franc\u00b8ois Muzy. Hawkes processes in \ufb01nance.\n\nMarket Microstructure and Liquidity, 1(01):1550005, 2015.\n\n[4] Emmanuel Bacry and Jean-Franc\u00b8ois Muzy. First- and second-order statistics characterization of\nHawkes processes and non-parametric estimation. IEEE Transactions on Information Theory,\n62(4):2184\u20132202, 2016.\n\n[5] J Andrew Bagnell and Amir-massoud Farahmand. Learning positive functions in a Hilbert\n\nspace, 2015.\n\n[6] Srinadh Bhojanapalli, Anastasios Kyrillidis, and Sujay Sanghavi. Dropping convexity for faster\n\nsemi-de\ufb01nite optimization. Conference on Learning Theory, pages 530\u2013582, 2016.\n\n[7] Jacek Bochnak, Michel Coste, and Marie-Franc\u00b8oise Roy. Real algebraic geometry, volume 36.\n\nSpringer Science & Business Media, 2013.\n\n[8] Olivier Bousquet and Andr\u00b4e Elisseeff. Stability and generalization. Journal of Machine Learning\n\nResearch, 2(Mar):499\u2013526, 2002.\n\n[9] Pierre Br\u00b4emaud and Laurent Massouli\u00b4e. Stability of nonlinear Hawkes processes. The Annals\n\nof Probability, pages 1563\u20131588, 1996.\n\n[10] Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval via wirtinger\n\ufb02ow: Theory and algorithms. IEEE Transactions on Information Theory, 61(4):1985\u20132007,\n2015.\n\n[11] Michael Eichler, Rainer Dahlhaus, and Johannes Dueck. Graphical modeling for multivariate\nHawkes processes with nonparametric link functions. Journal of Time Series Analysis, 38(2):225\u2013\n242, 2017.\n\n[12] Jalal Etesami and Negar Kiyavash. Directed information graphs: A generalization of linear\ndynamical graphs. In American Control Conference (ACC), 2014, pages 2563\u20132568. IEEE,\n2014.\n\n[13] Jalal Etesami, Negar Kiyavash, Kun Zhang, and Kushagra Singhal. Learning network of\nmultivariate Hawkes processes: A time series approach. Conference on Uncertainty in Arti\ufb01cial\nIntelligence, 2016.\n\n[14] Seth Flaxman, Yee Whye Teh, and Dino Sejdinovic. Poisson intensity estimation with repro-\n\nducing kernels. International Conference on Arti\ufb01cial Intelligence and Statistics, 2017.\n\n[15] Eric C Hall and Rebecca M Willett. Tracking dynamic point processes on networks. IEEE\n\nTransactions on Information Theory, 62(7):4327\u20134346, 2016.\n\n[16] Alan G Hawkes. Spectra of some self-exciting and mutually exciting point processes.\n\nBiometrika, 58(1):83\u201390, 1971.\n\n[17] Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends R(cid:13) in\n\nOptimization, 2(3-4):157\u2013325, 2016.\n\n9\n\n\f[18] Sanggyun Kim, Christopher J Quinn, Negar Kiyavash, and Todd P Coleman. Dynamic and\nsuccinct statistical analysis of neuroscience data. Proceedings of the IEEE, 102(5):683\u2013698,\n2014.\n\n[19] Jyrki Kivinen, Alexander J Smola, and Robert C Williamson. Online learning with kernels.\n\nIEEE Transactions on Signal Processing, 52(8):2165\u20132176, 2004.\n\n[20] Michael Krumin, Inna Reutsky, and Shy Shoham. Correlation-based analysis and generation of\nmultiple spike trains using Hawkes models with an exogenous input. Frontiers in Computational\nNeuroscience, 4, 2010.\n\n[21] Jure Leskovec, Lars Backstrom, and Jon Kleinberg. Meme-tracking and the dynamics of\nthe news cycle. International Conference on Knowledge Discovery and Data Mining, pages\n497\u2013506, 2009.\n\n[22] Thomas Josef Liniger. Multivariate Hawkes processes. PhD thesis, Eidgen\u00a8ossische Technische\n\nHochschule ETH Z\u00a8urich, 2009.\n\n[23] Tohru Ozaki. Maximum likelihood estimation of Hawkes\u2019 self-exciting point processes. Annals\n\nof the Institute of Statistical Mathematics, 31(1):145\u2013155, 1979.\n\n[24] Patricia Reynaud-Bouret, Sophie Schbath, et al. Adaptive estimation for Hawkes processes;\n\napplication to genome analysis. The Annals of Statistics, 38(5):2781\u20132822, 2010.\n\n[25] Bernhard Sch\u00a8olkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem.\n\nInternational Conference on Computational Learning Theory, pages 416\u2013426, 2001.\n\n[26] Lieven Vandenberghe and Stephen Boyd. Semide\ufb01nite programming. SIAM review, 38(1):49\u201395,\n\n1996.\n\n[27] Hongteng Xu, Mehrdad Farajtabar, and Hongyuan Zha. Learning Granger causality for Hawkes\n\nprocesses. International Conference on Machine Learning, 48:1717\u20131726, 2016.\n\n[28] Shuang-Hong Yang and Hongyuan Zha. Mixture of mutually exciting processes for viral\n\ndiffusion. International Conference on Machine Learning, 28:1\u20139, 2013.\n\n[29] Ke Zhou, Hongyuan Zha, and Le Song. Learning triggering kernels for multi-dimensional\n\nHawkes processes. International Conference on Machine Learning, 28:1301\u20131309, 2013.\n\n10\n\n\f", "award": [], "sourceid": 2546, "authors": [{"given_name": "Yingxiang", "family_name": "Yang", "institution": "UIUC"}, {"given_name": "Jalal", "family_name": "Etesami", "institution": "UIUC"}, {"given_name": "Niao", "family_name": "He", "institution": "UIUC"}, {"given_name": "Negar", "family_name": "Kiyavash", "institution": "UIUC"}]}