{"title": "Modelling sparsity, heterogeneity, reciprocity and community structure in temporal interaction data", "book": "Advances in Neural Information Processing Systems", "page_first": 2343, "page_last": 2352, "abstract": "We propose a novel class of network models for temporal dyadic interaction data. Our objective is to capture important features often observed in social interactions: sparsity, degree heterogeneity, community structure and reciprocity. We use mutually-exciting Hawkes processes to model the interactions between each (directed) pair of individuals. The intensity of each process allows interactions to arise as responses to opposite interactions (reciprocity), or due to shared interests between individuals (community structure). For sparsity and degree heterogeneity, we build the non time dependent part of the intensity function on compound random measures following Todeschini et al., 2016. We conduct experiments on real-world temporal interaction data and show that the proposed model outperforms competing approaches for link prediction, and leads to interpretable parameters.", "full_text": "Modelling sparsity, heterogeneity, reciprocity and\ncommunity structure in temporal interaction data\n\nXenia Miscouridou1, Fran\u00e7ois Caron1, Yee Whye Teh1,2\n\n1Department of Statistics, University of Oxford\n\n{miscouri, caron, y.w.teh}@stats.ox.ac.uk\n\n2DeepMind\n\nAbstract\n\nWe propose a novel class of network models for temporal dyadic interaction\ndata. Our objective is to capture important features often observed in social\ninteractions: sparsity, degree heterogeneity, community structure and reciprocity.\nWe use mutually-exciting Hawkes processes to model the interactions between\neach (directed) pair of individuals. The intensity of each process allows interactions\nto arise as responses to opposite interactions (reciprocity), or due to shared interests\nbetween individuals (community structure). For sparsity and degree heterogeneity,\nwe build the non time dependent part of the intensity function on compound random\nmeasures following (Todeschini et al., 2016). We conduct experiments on real-\nworld temporal interaction data and show that the proposed model outperforms\ncompeting approaches for link prediction, and leads to interpretable parameters.\n\n1\n\nIntroduction\n\nThere is a growing interest in modelling and understanding temporal dyadic interaction data. Temporal\ninteraction data take the form of time-stamped triples (t, i, j) indicating that an interaction occurred\nbetween individuals i and j at time t. Interactions may be directed or undirected. Examples of such\ninteraction data include commenting a post on an online social network, exchanging an email, or\nmeeting in a coffee shop. An important challenge is to understand the underlying structure that\nunderpins these interactions. To do so, it is important to develop statistical network models with\ninterpretable parameters, that capture the properties which are observed in real social interaction data.\nOne important aspect to capture is the community structure of the interactions. Individuals are often\naf\ufb01liated to some latent communities (e.g. work, sport, etc.), and their af\ufb01liations determine their\ninteractions: they are more likely to interact with individuals sharing the same interests than to\nindividuals af\ufb01liated with different communities. An other important aspect is reciprocity. Many\nevents are responses to recent events of the opposite direction. For example, if Helen sends an email\nto Mary, then Mary is more likely to send an email to Helen shortly afterwards. A number of papers\nhave proposed statistical models to capture both community structure and reciprocity in temporal\ninteraction data (Blundell et al., 2012; Dubois et al., 2013; Linderman and Adams, 2014). They use\nmodels based on Hawkes processes for capturing reciprocity and stochastic block-models or latent\nfeature models for capturing community structure.\nIn addition to the above two properties, it is important to capture the global properties of the interaction\ndata. Interaction data are often sparse: only a small fraction of the pairs of nodes actually interact.\nAdditionally, they typically exhibit high degree (number of interactions per node) heterogeneity:\nsome individuals have a large number of interactions, whereas most individuals have very few,\ntherefore resulting in empirical degree distributions being heavy-tailed. As shown by Karrer and\nNewman (2011), Gopalan et al. (2013) and Todeschini et al. (2016), failing to account explicitly for\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fdegree heterogeneity in the model can have devastating consequences on the estimation of the latent\nstructure.\nRecently, two classes of statistical models, based on random measures, have been proposed to capture\nsparsity and power-law degree distribution in network data. The \ufb01rst one is the class of models\nbased on exchangeable random measures (Caron and Fox, 2017; Veitch and Roy, 2015; Herlau et al.,\n2016; Borgs et al., 2018; Todeschini et al., 2016; Palla et al., 2016; Janson, 2017a). The second\none is the class of edge-exchangeable models (Crane and Dempsey, 2015; 2018; Cai et al., 2016;\nWilliamson, 2016; Janson, 2017b; Ng and Silva, 2017). Both classes of models can handle both\nsparse and dense networks and, although the two constructions are different, connections have been\nhighlighted between the two approaches (Cai et al., 2016; Janson, 2017b).\nThe objective of this paper is to propose a class of statistical models for temporal dyadic interaction\ndata that can capture all the desired properties mentioned above, which are often found in real\nworld interactions. These are sparsity, degree heterogeneity, community structure and reciprocity.\nCombining all the properties in a single model is non trivial and there is no such construction to our\nknowledge. The proposed model generalises existing reciprocating relationships models (Blundell\net al., 2012) to the sparse and power-law regime. Our model can also be seen as a natural extension\nof the classes of models based on exchangeable random measures and edge-exchangeable models\nand it shares properties of both families. The approach is shown to outperform alternative models for\nlink prediction on a variety of temporal network datasets.\nThe construction is based on Hawkes processes and the (static) model of Todeschini et al. (2016)\nfor sparse and modular graphs with overlapping community structure. In Section 2, we present\nHawkes processes and compound completely random measures which form the basis of our model\u2019s\nconstruction. The statistical model for temporal dyadic data is presented in Section 3 and its properties\nderived in Section 4. The inference algorithm is described in Section 5. Section 6 presents experiments\non four real-world temporal interaction datasets.\n\n2 Background material\n\ntimes between time 0 and time t. Let N (t) =(cid:80)\n\n2.1 Hawkes processes\nLet (tk)k\u22651 be a sequence of event times with tk \u2265 0, and let Ht = (tk|tk \u2264 t) the subset of event\nk\u22651 1tk\u2264t denote the number of events between time\n0 and time t, where 1A = 1 if A is true, and 0 otherwise. Assume that N (t) is a counting process\nwith conditional intensity function \u03bb(t), that is for any t \u2265 0 and any in\ufb01nitesimal interval dt\n\nPr(N (t + dt) \u2212 N (t) = 1|Ht) = \u03bb(t)dt.\n\n(1)\nConsider another counting process \u02dcN (t) with the corresponding (\u02dctk)k\u22651, \u02dcHt, \u02dc\u03bb(t). Then, N (t), \u02dcN (t)\nare mutually-exciting Hawkes processes (Hawkes, 1971) if the conditional intensity functions \u03bb(t)\nand \u02dc\u03bb(t) take the form\n\n(cid:90) t\n\n(cid:90) t\n\n\u03bb(t) = \u00b5 +\n\ng\u03c6(t \u2212 u) d \u02dcN (u)\n\n\u02dc\u03bb(t) = \u02dc\u00b5 +\n\n0\n\n0\n\ng \u02dc\u03c6(t \u2212 u) dN (u)\n\nwhere \u00b5 = \u03bb(0) > 0, \u02dc\u00b5 = \u02dc\u03bb(0) > 0 are the base intensities and g\u03c6, g \u02dc\u03c6 non-negative kernels\nparameterised by \u03c6 and \u02dc\u03c6. This de\ufb01nes a pair of processes in which the current rate of events of each\nprocess depends on the occurrence of past events of the opposite process.\nAssume that \u00b5 = \u02dc\u00b5, \u03c6 = \u02dc\u03c6 and g\u03c6(t) \u2265 0 for t > 0, g\u03c6(t) = 0 for t < 0. If g\u03c6 admits a form of fast\ndecay then this results in strong local effects. However, if it prescribes a peak away from the origin\nthen longer term effects are likely to occur. We consider here an exponential kernel\n\ng\u03c6(t \u2212 u) = \u03b7e\u2212\u03b4(t\u2212u), t > u\n\n(2)\nwhere \u03c6 = (\u03b7, \u03b4). \u03b7 \u2265 0 determines the sizes of the self-excited jumps and \u03b4 > 0 is the constant\nrate of exponential decay. The stationarity condition for the processes is \u03b7 < \u03b4. Figure 1 gives an\nillustration of two mutually-exciting Hawkes processes with exponential kernel and their conditional\nintensities.\n\n2\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: Illustration of mutually-exciting Hawkes processes with exponential kernel with parameters\n\u00b5 = 0.25, \u03b7 = 1 and \u03b4 = 2. (a) Counting process N (t) and (b) its conditional intensity \u03bb(t). (c)\nCounting process \u02dcN (t) and its conditional intensity \u02dc\u03bb(t).\n\n2.2 Compound completely random measures\nA homogeneous completely random measure (CRM) (Kingman, 1967; 1993) on R+ without \ufb01xed\natoms nor deterministic component takes the form\n\nW =\n\nwi\u03b4\u03b8i\n\n(3)\n\n(cid:88)\n\ni\u22651\n\n(cid:88)\n\nwhere (wi, \u03b8i)i\u22651 are the points of a Poisson process on (0,\u221e)\u00d7R+ with mean measure \u03c1(dw)H(d\u03b8)\nwhere \u03c1 is a L\u00e9vy measure, H is a locally bounded measure and \u03b4x is the dirac delta mass at x. The\nhomogeneous CRM is completely characterized by \u03c1 and H, and we write W \u223c CRM(\u03c1, H), or\nsimply W \u223c CRM(\u03c1) when H is taken to be the Lebesgue measure. Grif\ufb01n and Leisen (2017)\nproposed a multivariate generalisation of CRMs, called compound CRM (CCRM). A compound\nCRM (W1, . . . , Wp) with independent scores is de\ufb01ned as\n\nWk =\n\nwik\u03b4\u03b8i, k = 1, . . . , p\n\n(4)\n\ndistribution Fk and W0 =(cid:80)\n\nwhere wik = \u03b2ikwi0 and the scores \u03b2ik \u2265 0 are independently distributed from some probability\ni\u22651 wi0\u03b4\u03b8i is a CRM with mean measure \u03c10(dw0)H0(d\u03b8). In the rest\nof this paper, we assume that Fk is a gamma distribution with parameters (ak, bk), H0(d\u03b8) = d\u03b8 is\nthe Lebesgue measure and \u03c10 is the L\u00e9vy measure of a generalized gamma process\n\ni\u22651\n\nwhere \u03c3 \u2208 (\u2212\u221e, 1) and \u03c4 > 0.\n\n\u03c10(dw) =\n\n1\n\n\u0393(1 \u2212 \u03c3)\n\nw\u22121\u2212\u03c3e\u2212\u03c4 wdw\n\n(5)\n\n3 Statistical model for temporal interaction data\nConsider temporal interaction data of the form D = (tk, ik, jk)k\u22651 where (tk, ik, jk) \u2208 R+ \u00d7 N2\u2217\nrepresents a directed interaction at time tk from node/individual ik to node/individual jk. For example,\nthe data may correspond to the exchange of messages between students on an online social network.\nWe use a point process (tk, Uk, Vk)k\u22651 on R3\n+, and consider that each node i is assigned some\ncontinuous label \u03b8i \u2265 0. the labels are only used for the model construction, similarly to (Caron\nand Fox, 2017; Todeschini et al., 2016), and are not observed nor inferred from the data. A point\nat location (tk, Uk, Vk) indicates that there is a directed interaction between the nodes Uk and Vk at\ntime tk. See Figure 2 for an illustration.\nFor a pair of nodes i and j, with labels \u03b8i and \u03b8j, let Nij(t) be the counting process\n\n(cid:88)\n\nNij(t) =\n\n1tk\u2264t\n\n(6)\n\nk|(Uk,Vk)=(\u03b8i,\u03b8j )\n\nfor the number of interactions between i and j in the time interval [0, t]. For each pair i, j, the\ncounting processes Nij, Nji are mutually-exciting Hawkes processes with conditional intensities\n\n\u03bbij(t) = \u00b5ij +\n\ng\u03c6(t \u2212 u) dNji(u),\n\n\u03bbji(t) = \u00b5ji +\n\ng\u03c6(t \u2212 u) dNij(u)\n\n(7)\n\n(cid:90) t\n\n(cid:90) t\n\n0\n\n0\n\n3\n\n\ft\n\nwk2\n\nwk3\n\nwk1\n\nwj3\n\nwj2\n\nwj1\nwi3\nwi1\n\n\u03b8i\n\nwi2\n\n\u00d7\n\n\u03b1\n\n\u00d7\n\n\u03b8k\n\n\u03b8j\n\nNij\n\nNji\n\n\u00d7\u00d7\n\u00d7\n\n\u00d7\n\n\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7\u00d7\n\nNkj\n\nNki\n\n\u03b8k\n\n\u03b1\n\n0\n\n\u03b8i\n\nwi1 wi3\nwi2\n\n\u03b8j\nwj1 wj3\nwj2\n\nwk1 wk3\nwk2\n\nFigure 2: (Left) Representation of the temporal dyadic interaction data as a point process on R3\n+. A\npoint at location (\u03c4, \u03b8i, \u03b8j) indicates a directed interaction from node i to node j at time \u03c4, where\n\u03b8i > 0 and \u03b8j > 0 are labels of nodes i and j. Interactions between nodes i and j, arise from a\nHawkes process Nij with conditional intensity \u03bbij given by Equation (7). Processes Nik and Njk are\nnot shown for readability. (Right) Conditional intensities of processes Nik(t), Nij(t) and Njk(t).\n\nwhere g\u03c6 is the exponential kernel de\ufb01ned in Equation (2). Interactions from individual i to individual\nj may arise as a response to past interactions from j to i through the kernel g\u03c6, or via the base\nintensity \u00b5ij. We also model assortativity so that individuals with similar interests are more likely\nto interact than individuals with different interests. For this, assume that each node i has a set of\npositive latent parameters (wi1, . . . , wip) \u2208 Rp\n+, where wik is the level of its af\ufb01liation to each latent\ncommunity k = 1, . . . , p. The number of communities p is assumed known. We model the base rate\n\np(cid:88)\n\n\u00b5ij = \u00b5ji =\n\nwikwjk.\n\n(8)\n\nk=1\n\nTwo nodes with high levels of af\ufb01liation to the same communities will be more likely to interact than\nnodes with af\ufb01liation to different communities, favouring assortativity.\nIn order to capture sparsity and power-law properties and as in Todeschini et al. (2016), the set of\naf\ufb01liation parameters (wi1, . . . , wip) and node labels \u03b8i is modelled via a compound CRM with\ni=1 wi0\u03b4\u03b8i \u223c CRM(\u03c10) where the L\u00e9vy measure \u03c10 is de\ufb01ned by\nEquation (5), and for each node i \u2265 1 and community k = 1, . . . , p\n\ngamma scores, that is W0 =(cid:80)\u221e\n\nind\u223c Gamma(ak, bk).\n\nwik = wi0\u03b2ik, where \u03b2ik\n\n(9)\nThe parameter wi0 \u2265 0 is a degree correction for node i and can be interpreted as measuring the\noverall popularity/sociability of a given node i irrespective of its level of af\ufb01liation to the different\ncommunities. An individual i with a high sociability parameter wi0 will be more likely to have\ninteractions overall than individuals with low sociability parameters. The scores \u03b2ik tune the level\nof af\ufb01liation of individual i to the community k. The model is de\ufb01ned on R3\n+. We assume that we\nobserve interactions over a subset [0, T ] \u00d7 [0, \u03b1]2 \u2286 R3\n+ where \u03b1 and T tune both the number of\nnodes and number of interactions. The whole model is illustrated in Figure 2.\nThe model admits the following set of hyperparameters, with the following interpretation:\n\u2022 The hyperparameters \u03c6 = (\u03b7, \u03b4) where \u03b7 \u2265 0 and \u03b4 \u2265 0 of the kernel g\u03c6 tune the reciprocity.\n\u2022 The hyperparameters (ak, bk) tune the community structure of the interactions. ak/bk = E[\u03b2ik]\ntunes the size of community k while ak/b2\nk = var(\u03b2ik) tunes the variability of the level of af\ufb01liation\nto this community; larger values imply more separated communities.\n\u2022 The hyperparameter \u03c3 tunes the sparsity and the degree heterogeneity: larger values imply higher\nsparsity and heterogeneity. It also tunes the slope of the degree distribution. Parameter \u03c4 tunes the\nexponential cut-off in the degree distribution. This is illustrated in Figure 3.\n\u2022 Finally, the hyperparameters \u03b1 and T tune the overall number of interactions and nodes.\nWe follow (Rasmussen, 2013) and use vague Exponential(0.01) priors on \u03b7 and \u03b4. Following\nTodeschini et al. (2016) we set vague Gamma(0.01, 0.01) priors on \u03b1, 1 \u2212 \u03c3, \u03c4, ak and bk. The right\nlimit for time window, T is considered known.\n\n4\n\n\fFigure 3: Degree distribution for a Hawkes graph with different values of \u03c3 (left) and \u03c4 (right). The\ndegree of a node i is de\ufb01ned as the number of nodes with whom it has at least one interaction. The\nvalue \u03c3 tunes the slope of the degree distribution, larger values corresponding to a higher slope. The\nparameter \u03c4 tunes the exponential cut-off in the degree distribution.\n\n4 Properties\n\n4.1 Connection to sparse vertex-exchangeable and edge-exchangeable models\n\nThe model is a natural extension of sparse vertex-exchangeable and edge-exchangeable graph models.\nLet zij(t) = 1Nij (t)+Nji(t)>0 be a binary variable indicating if there is at least one interaction in\n[0, t] between nodes i and j in either direction. We assume\n\nPr(zij(t) = 1|(wik, wjk)k=1,...,p) = 1 \u2212 e\u22122t(cid:80)p\n\nk=1 wikwjk\n\nwhich corresponds to the probability of a connection in the static simple graph model proposed by\nTodeschini et al. (2016). Additionally, for \ufb01xed \u03b1 > 0 and \u03b7 = 0 (no reciprocal relationships), the\nmodel corresponds to a rank-p extension of the rank-1 Poissonized version of edge-exchangeable\nmodels considered by Cai et al. (2016) and Janson (2017a). The sparsity properties of our model\nfollow from the sparsity properties of these two classes of models.\n\n4.2 Sparsity\n\nThe size of the dataset is tuned by both \u03b1 and T . Given these quantities, both the number of\ninteractions and the number of nodes with at least one interaction are random variables. We now study\nthe behaviour of these quantities, showing that the model exhibits sparsity. Let I\u03b1,T , E\u03b1,T , V\u03b1,T be\nthe overall number of interactions between nodes with label \u03b8i \u2264 \u03b1 until time T , the total number of\npairs of nodes with label \u03b8i \u2264 \u03b1 who had at least one interaction before time T , and the number of\nnodes with label \u03b8i \u2264 \u03b1 who had at least one interaction before time T respectively.\n\n(cid:88)\n(cid:88)\n\ni\n\ni01\u03b8i\u2264\u03b1\n\nI\u03b1,T =\n\n1Nij (T )+Nji(T )>01\u03b8i\u2264\u03b11\u03b8j\u2264\u03b1\n\n(cid:88)\n\ni(cid:54)=j\n\nNij(T )1\u03b8i\u2264\u03b11\u03b8j\u2264\u03b1\n\nWe provide in the supplementary material a theorem for the exact expectations of I\u03b1,T , E\u03b1,T V\u03b1,T .\nNow consider the asymptotic behaviour of the expectations of V\u03b1,T , E\u03b1,T and I\u03b1,T , as \u03b1 and T go\nto in\ufb01nity.1 Consider \ufb01xed T > 0 and \u03b1 that tends to in\ufb01nity. Then,\nE[E\u03b1,T ] = \u0398(\u03b12),\n\n(cid:26) \u0398(\u03b1)\n\nE[I\u03b1,T ] = \u0398(\u03b12)\n\nE[V\u03b1,T ] =\n\n,\n\nif \u03c3 < 0\nif \u03c3 \u2265 0\n\n\u03c9(\u03b1)\n\nas \u03b1 tends to in\ufb01nity. For \u03c3 < 0, the number of edges and interactions grows quadratically with the\nnumber of nodes, and we are in the dense regime. When \u03c3 \u2265 0, the number of edges and interaction\ngrows subquadratically, and we are in the sparse regime. Higher values of \u03c3 lead to higher sparsity.\nFor \ufb01xed \u03b1,\n\n(cid:40) \u0398(1)\n\nE[V\u03b1,T ] =\n\n\u0398(log T )\n\u0398(T \u03c3)\n\nif \u03c3 < 0\nif \u03c3 = 0\nif \u03c3 > 0\n\n,\n\nE[E\u03b1,T ] =\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 \u0398(1)\n\nO(log T )\nO(T 3\u03c3/2)\nO(T (1+\u03c3)/2)\n\nif \u03c3 < 0\nif \u03c3 = 0\nif \u03c3 \u2208 (0, 1/2]\nif \u03c3 \u2208 (1/2, 1)\n\n1 We use the following asymptotic notations. X\u03b1 = O(Y\u03b1) if lim X\u03b1/Y\u03b1 < \u221e, X\u03b1 = \u03c9(Y\u03b1) if\n\nlim Y\u03b1/X\u03b1 = 0 and X\u03b1 = \u0398(Y\u03b1) if both X\u03b1 = O(Y\u03b1) and Y\u03b1 = O(X\u03b1).\n\n5\n\n\fand E[I\u03b1,T ] = \u0398(T ) as T tends to in\ufb01nity. Sparsity in T arises when \u03c3 \u2265 0 for the number of\nedges and when \u03c3 > 1/2 for the number of interactions. The derivation of the asymptotic behaviour\nof expectations of V\u03b1,T , E\u03b1,T and I\u03b1,T follows the lines of the proofs of Theorems 3 and 5.3 in\n(Todeschini et al., 2016) (\u03b1 \u2192 \u221e) and Lemma D.6 in the supplementary material of (Cai et al.,\n2016) (T \u2192 \u221e), and is omitted here.\n\n5 Approximate Posterior Inference\nAssume a set of observed interactions D = (tk, ik, jk)k\u22651 between V individuals over a period\nof time T . The objective is to approximate the posterior distribution \u03c0(\u03c6, \u03be|D) where \u03c6 are the\nkernel parameters and \u03be = ((wik)i=1,...,V,k=1,...,p, (ak, bk)k=1,...,p, \u03b1, \u03c3, \u03c4 ), the parameters and\nhyperparameters of the compound CRM. One possible approach is to follow a similar approach to\nthat taken in (Rasmussen, 2013); derive a Gibbs sampler using a data augmentation scheme which\nassociates a latent variable to each interaction. However, such an algorithm is unlikely to scale\nwell with the number of interactions. Additionally, we can make use of existing code for posterior\ninference with Hawkes processes and graphs based on compound CRMs, and therefore propose a\ntwo-step approximate inference procedure, motivated by modular Bayesian inference (Jacob et al.,\n2017).\nLet Z = (zij(T ))1\u2264i,j\u2264V be the adjacency matrix de\ufb01ned by zij(T ) = 1 if there is at least one\ninteraction between i and j in the interval [0, T ], and 0 otherwise. We have\n\u03c0(\u03c6, \u03be|D) = \u03c0(\u03c6, \u03be|D,Z) = \u03c0(\u03be|D,Z)\u03c0(\u03c6|\u03be,D).\n\nThe idea of the two-step procedure is to (i) Approximate \u03c0(\u03be|D,Z) by \u03c0(\u03be|Z) and obtain a Bayesian\n\npoint estimate(cid:98)\u03be then (ii) Approximate \u03c0(\u03c6|\u03be,D) by \u03c0(\u03c6|(cid:98)\u03be,D).\nThe full posterior is thus approximated by(cid:101)\u03c0(\u03c6, \u03be) = \u03c0(\u03be|Z)\u03c0(\u03c6|(cid:98)\u03be,D). As mentioned in Section 4.1,\nFrom the posterior samples we compute a point estimate ((cid:98)wi1, . . . ,(cid:98)wip) of the weight vector for each\nof the base intensities(cid:98)\u00b5ij. Posterior inference on the parameters \u03c6 of the Hawkes kernel is then\n\nthe statistical model for the binary adjacency matrix Z is the same as in (Todeschini et al., 2016). We\nuse the MCMC scheme of (Todeschini et al., 2016) and the accompanying software package SNetOC2\nto perform inference. The MCMC sampler is a Gibbs sampler which uses a Metropolis-Hastings (MH)\nstep to update the hyperparameters and a Hamiltonian Monte Carlo (HMC) step for the parameters.\n\nnode. We follow the approach of Todeschini et al. (2016) and compute a minimum Bayes risk point\nestimate using a permutation-invariant cost function. Given these point estimates we obtain estimates\n\nperformed using Metropolis-Hastings, as in (Rasmussen, 2013). Details of the two-stage inference\nprocedure are given in the supplementary material.\nEmpirical investigation of posterior consistency. To validate the two-step approximation to the\nposterior distribution, we study empirically the convergence of our approximate inference scheme\nusing synthetic data. Experiments suggest that the posterior concentrates around the true parameter\nvalue. More details are given in the supplementary material.\n\n6 Experiments\n\nWe perform experiments on four temporal interaction datasets from the Stanford Large Network\nDataset Collection3 (Leskovec and Krevl, 2014):\n\u2022 The EMAIL dataset consists of emails sent within a large European research institution over 803\ndays. There are 986 nodes, 24929 edges and 332334 interactions. A separate interaction is created\nfor every recipient of an email.\n\u2022 The COLLEGE dataset consists of private messages sent over a period of 193 days on an online\nsocial network (Facebook-like platform) at the University of California, Irvine. There are 1899 nodes,\n20296 edges and 59835 interactions. An interaction (t, i, j) corresponds to a user i sending a private\nmessage to another user j at time t.\n\u2022 The MATH over\ufb02ow dataset is a temporal network of interactions on the stack exchange website\nMath Over\ufb02ow over 2350 days. There are 24818 nodes, 239978 edges and 506550 interactions.An\n\n2https://github.com/misxenia/SNetOC\n3https://snap.stanford.edu/data/\n\n6\n\n\finteraction (t, i, j) means that a user i answered another user\u2019s j question at time t, or commented on\nanother user\u2019s j question/response.\n\u2022 The UBUNTU dataset is a temporal network of interactions on the stack exchange website Ask\nUbuntu over 2613 days. There are 159316 nodes, 596933 edges and 964437 interactions. An\ninteraction (t, i, j) means that a user i answered another user\u2019s j question at time t, or commented on\nanother user\u2019s j question/response.\nComparison on link prediction. We compare our model (Hawkes-CCRM) to \ufb01ve other benchmark\nmethods: (i) our model, without the Hawkes component (obtained by setting \u03b7 = 0), (ii) the\nHawkes-IRM model of Blundell et al. (2012) which uses an in\ufb01nite relational model (IRM) to\ncapture the community structure together with a Hawkes process to capture reciprocity, (iii) the same\nmodel, called Poisson-IRM, without the Hawkes component, (iv) a simple Hawkes model where the\nconditional intensity given by Equation (7) is assumed to be same for each pair of individuals, with\nunknown parameters \u03b4 and \u03b7, (v) a simple Poisson process model, which assumes that interactions\nbetween two individuals arise at an unknown constant rate. Each of these competing models capture\na subset of the features we aim to capture in the data: sparsity/heterogeneity, community structure\nand reciprocity, as summarized in Table 1. The models are given in the supplementary material. The\nonly model to account for all the features is the proposed Hawkes-CCRM model.\n\nTable 1: (Left) Performance in link prediction. (Right) Properties captured by the different models.\n\nHawkes-CCRM\nCCRM\nHawkes-IRM\nPoisson-IRM\nHawkes\nPoisson\n\ncollege math ubuntu\nemail\n29.1\n10.95\n36.5\n12.08\n59.5\n14.2\n79.3\n31.7\n154.8 153.29 220.10 191.39\n\u223c 103 \u223c 104 \u223c 104 \u223c 104\n\n20.07\n89.0\n96.9\n204.7\n\n1.88\n2.90\n3.56\n15.7\n\nsparsity/\n\nheterogeneity\n\n(cid:88)\n(cid:88)\n\ncommunity\nstructure\n\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\nreciprocity\n\n(cid:88)\n(cid:88)\n(cid:88)\n\nWe perform posterior inference using a Markov chain Monte Carlo algorithm. For our Hawkes-\nCCRM model, we follow the two-step procedure described in Section 5. For each dataset, there is\nsome background information in order to guide the choice of the number p of communities. The\nnumber of communities p is set to p = 4 for the EMAIL dataset, as there are 4 departments at the\ninstitution, p = 2 for the COLLEGE dataset corresponding to the two genders, and p = 3 for the\nMATH and UBUNTU datasets, corresponding to the three different types of possible interactions. We\nfollow Todeschini et al. (2016) regarding the choice of the MCMC tuning parameters and initialise\nthe MCMC algorithm with the estimates obtained by running a MCMC algorithm with p = 1 feature\nwith fewer iterations. For all experiments we run 2 chains in parallel for each stage of the inference.\nWe use 100000 iterations for the \ufb01rst stage and 10000 for the second one. For the Hawkes-IRM\nmodel, we also use a similar two-step procedure, which \ufb01rst obtains a point estimate of the parameters\nand hyperparameters of the IRM, then estimates the parameters of the Hawkes process given this\npoint estimate. This allows us to scale this approach to the large datasets considered. We use the\nsame number of MCMC samples as for our model for each step.\nWe compare the different algorithms on link prediction. For each dataset, we make a train-test split\nin time so that the training datasets contains 85% of the total temporal interactions. We use the\ntraining data for parameter learning and then use the estimated parameters to perform link prediction\non the held out test data. We report the root mean square error between the predicted and true\nnumber of interactions for each directed pair in the test set . The results are reported in Table 1. On\nall the datasets, the proposed Hawkes-CCRM approach outperforms other methods. Interestingly,\nthe addition of the Hawkes component brings improvement for both the IRM-based model and the\nCCRM-based model.\nCommunity structure, degree distribution and sparsity. Our model also estimates the latent\nstructure in the data through the weights wik, representing the level of af\ufb01liation of a node i to a\ncommunity k. For each dataset, we order the nodes by their highest estimated feature weight, obtaining\na clustering of the nodes. We represent the ordered matrix (zij (T )) of binary interactions in Figure 4\n(a)-(d). This shows that the method can uncover the latent community structure in the different datasets.\nWithin each community, nodes still exhibit degree heterogeneity as shown in Figure 4 (e)-(h). where\n\nthe nodes are sorted within each block according to their estimated sociability (cid:98)wi0. The ability of the\n\napproach to uncover latent structure was illustrated by Todeschini et al. (2016), who demonstrate that\nmodels which do not account for degree heterogeneity, cannot capture latent community estimation\n\n7\n\n\f(a) EMAIL\n\n(b) COLLEGE\n\n(c) MATH\n\n(d) UBUNTU\n\n(e) EMAIL\n\n(f) COLLEGE\n\n(g) MATH\n\n(h) UBUNTU\n\n(i) EMAIL\n\n(j) COLLEGE\n\n(k) MATH\n\n(l) UBUNTU\n\nFigure 4: Top: Sorted adjacency matrix for each dataset. The vertices are classi\ufb01ed to one of the\ncommunities based to their highest af\ufb01liation. Darker color correspond to more interactions. Middle:\nSorted adjacency matrix. The nodes are grouped according to their highest af\ufb01liation and then sorted\n\naccording to their estimated sociability parameter (cid:98)wi0. Bottom: Empirical degree distribution (red)\n\nand posterior predictive distribution (blue).\n\nbut they rather cluster the nodes based on their degree. We also look at the posterior predictive\ndegree distribution based on the estimated hyperparameters, and compare it to the empirical degree\ndistribution in the test set. The results are reported in Figure 4 (i)-(l) showing a reasonable \ufb01t to\nthe degree distribution. Finally we report the 95% posterior credible intervals (PCI) for the sparsity\nparameter \u03c3 for all datasets. Each PCI is (\u22120.69,\u22120.50), (\u22120.35,\u22120.20), (0.15, 0.18), (0.51, 0.57)\nrespectively. The range of \u03c3 is (\u2212\u221e, 1). EMAIL and COLLEGE give negative values corresponding\nto denser networks whereas MATH and UBUNTU datasets are sparser.\n\n7 Conclusion\n\nWe have presented a novel statistical model for temporal interaction data which captures multiple\nimportant features observed in such datasets, and shown that our approach outperforms competing\nmodels in link prediction. The model could be extended in several directions. One could consider\nasymmetry in the base intensities \u00b5ij (cid:54)= \u00b5ji and/or a bilinear form as in (Zhou, 2015). Another\nimportant extension would be the estimation of the number of latent commmunities/features p.\nAcknowledgments. The authors thank the reviewers and area chair for their constructive comments.\nXM, FC and YWT acknowledge funding from the ERC under the European Union\u2019s 7th Frame-\nwork programme (FP7/2007-2013) ERC grant agreement no. 617071. FC acknowledges support\nfrom EPSRC under grant EP/P026753/1 and from the Alan Turing Institute under EPSRC grant\nEP/N510129/1. XM acknowledges support from the A. G. Leventis Foundation.\n\n8\n\n\fReferences\nC. Blundell, K. Heller, and J. Beck. Modelling reciprocating relationships with Hawkes processes.\nIn Advances in Neural Information Processing Systems 25, volume 15, pages 5249\u20135262. Curran\nAssociates, Inc., 2012.\n\nC. Borgs, J. T. Chayes, H. Cohn, and N. Holden. Sparse exchangeable graphs and their limits via\n\ngraphon processes. Journal of Machine Learning Research, 18:1\u201371, 2018.\n\nD. Cai, T. Campbell, and T. Broderick. Edge-exchangeable graphs and sparsity. In D. D. Lee,\nM. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information\nProcessing Systems 29, pages 4249\u20134257. Curran Associates, Inc., 2016.\n\nF. Caron and E. B. Fox. Sparse graphs using exchangeable random measures. Journal of the Royal\n\nStatistical Society B, 79(5), 2017.\n\nH. Crane and W. Dempsey. A framework for statistical network modeling. arXiv preprint\n\narXiv:1509.08185, 2015.\n\nH. Crane and W. Dempsey. Edge exchangeable models for interaction networks. Journal of the\n\nAmerican Statistical Association, 113(523):1311\u20131326, 2018.\n\nC. Dubois, C. Butts, and P. Smyth. Stochastic blockmodeling of relational event dynamics. In\n\nArti\ufb01cial Intelligence and Statistics, pages 238\u2013246, 2013.\n\nP. K. Gopalan, C. Wang, and D. Blei. Modeling overlapping communities with node popularities. In\n\nAdvances in neural information processing systems, pages 2850\u20132858, 2013.\n\nJ. E. Grif\ufb01n and F. Leisen. Compound random measures and their use in Bayesian non-parametrics.\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 79(2):525\u2013545, 2017.\n\nA. G. Hawkes. Spectra of some self-exciting and mutually exciting point processes. Journal of the\n\nRoyal Statistical Society. Series B (Methodological), 15(7):5249\u20135262, 1971.\n\nT. Herlau, M. N. Schmidt, and M. M\u00f8rup. Completely random measures for modelling block-\nstructured sparse networks. In Advances in Neural Information Processing Systems 29 (NIPS\n2016), 2016.\n\nP. E. Jacob, M. L. M., H. C. C., and R. C. P. Better together? statistical learning in models made of\n\nmodules. ArXiv preprint arXiv:1708.08719, 2017.\n\nS. Janson. On convergence for graphexes. arXiv preprint arXiv:1702.06389, 2017a.\n\nS. Janson. On edge exchangeable random graphs. To appear in Journal of Statistical Physics., 2017b.\n\nB. Karrer and M. E. J. Newman. Stochastic blockmodels and community structure in networks.\n\nPhysical Review E, 83(1):016107, 2011.\n\nJ. Kingman. Completely random measures. Paci\ufb01c Journal of Mathematics, 21(1):59\u201378, 1967.\n\nJ. Kingman. Poisson processes, volume 3. Oxford University Press, USA, 1993.\n\nJ. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http:\n\n//snap.stanford.edu/data, June 2014.\n\nS. Linderman and R. Adams. Discovering latent network structure in point process data.\n\nInternational Conference on Machine Learning, pages 1413\u20131421, 2014.\n\nIn\n\nY. C. Ng and R. Silva. A dynamic edge exchangeable model for sparse temporal networks. arXiv\n\npreprint arXiv:1710.04008, 2017.\n\nK. Palla, F. Caron, and Y. Teh. Bayesian nonparametrics for sparse dynamic networks. arXiv preprint\n\narXiv:1607.01624, 2016.\n\nJ. G. Rasmussen. Bayesian inference for Hawkes processes. Methodological Computational Applied\n\nProbability, 15:623\u2013642, 2013.\n\n9\n\n\fA. Todeschini, X. Miscouridou, and F. Caron. Exchangeable random measure for sparse networks\n\nwith overlapping communities. arXiv:1602.02114, 2016.\n\nV. Veitch and D. M. Roy. The class of random graphs arising from exchangeable random measures.\n\narXiv preprint arXiv:1512.03099, 2015.\n\nS. Williamson. Nonparametric network models for link prediction. Journal of Machine Learning\n\nResearch, 17(202):1\u201321, 2016.\n\nM. Zhou. In\ufb01nite Edge Partition Models for Overlapping Community Detection and Link Prediction.\nIn G. Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International\nConference on Arti\ufb01cial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning\nResearch, pages 1135\u20131143, San Diego, California, USA, 09\u201312 May 2015. PMLR.\n\n10\n\n\f", "award": [], "sourceid": 1190, "authors": [{"given_name": "Xenia", "family_name": "Miscouridou", "institution": "University of Oxford"}, {"given_name": "Francois", "family_name": "Caron", "institution": "Oxford"}, {"given_name": "Yee Whye", "family_name": "Teh", "institution": "University of Oxford, DeepMind"}]}