{"title": "Proximal Graphical Event Models", "book": "Advances in Neural Information Processing Systems", "page_first": 8136, "page_last": 8145, "abstract": "Event datasets include events that occur irregularly over the timeline and are prevalent in numerous domains. We introduce proximal graphical event models (PGEM) as a representation of such datasets. PGEMs belong to a broader family of models that characterize relationships between various types of events, where the rate of occurrence of an event type depends only on whether or not its parents have occurred in the most recent history. The main advantage over the state of the art models is that they are entirely data driven and do not require additional inputs from the user, which can require knowledge of the domain such as choice of basis functions or hyperparameters in graphical event models. We theoretically justify our learning of  optimal windows for parental history and the choices of parental sets, and the algorithm are sound and complete in terms of parent structure learning.  We present additional efficient heuristics for learning PGEMs from data, demonstrating their effectiveness on synthetic and real datasets.", "full_text": "Proximal Graphical Event Models\n\nDebarun Bhattacharjya\n\nThomas J. Watson Research Center, Yorktown Heights, NY, USA\n\n{debarunb,dharmash,tgao}@us.ibm.com\n\nDharmashankar Subramanian\nIBM Research\n\nTian Gao\n\nAbstract\n\nEvent datasets involve irregular occurrences of events over the timeline and are\nprevalent in numerous domains. We introduce proximal graphical event models\n(PGEMs) as a representation of such datasets. PGEMs belong to a broader family\nof graphical models that characterize relationships between various types of events;\nin a PGEM, the rate of occurrence of an event type depends only on whether or\nnot its parents have occurred in the most recent history. The main advantage over\nstate-of-the-art models is that learning is entirely data driven and without the need\nfor additional inputs from the user, which can require knowledge of the domain\nsuch as choice of basis functions and hyper-parameters. We theoretically justify our\nlearning of parental sets and their optimal windows, proposing sound and complete\nalgorithms in terms of parent structure learning. We present ef\ufb01cient heuristics for\nlearning PGEMs from data, demonstrating their effectiveness on synthetic and real\ndatasets.\n\n1\n\nIntroduction and Related Work\n\nEvent datasets are sequences of events of various types that typically occur as irregular and asyn-\nchronous continuous-time arrivals. This is in contrast to time series data, which are observations of\ncontinuous-valued variables over regular discrete epochs in time. Examples of event datasets include\nlogs, transactions, noti\ufb01cations and alarms, insurance claims, medical events, political events, and\n\ufb01nancial events.\nIt is well known that a multivariate point process is able to capture the dynamics of events occurring\nin continuous time, under reasonable regularity conditions, using conditional intensity functions.\nThese are akin to hazard rates in survival analysis and represent the rate at which an event type\noccurs, conditioned on the history of event occurrences. Learning arbitrary history-dependent\nintensity functions can be dif\ufb01cult and impractical, thus the literature makes various simplifying\nassumptions. Some examples of such point processes include continuous time noisy-or (CT-NOR)\nmodels [Simma et al., 2008], Poisson cascades [Simma and Jordan, 2010], Poisson networks [Rajaram\net al., 2005], piecewise-constant conditional intensity models [Gunawardana et al., 2011], forest-based\npoint processes [Weiss and Page, 2013], multivariate Hawkes processes [Zhou et al., 2013], and\nnon-homogeneous Poisson processes [Goulding et al., 2016].\nGraphical event models (GEMs) have been proposed as a graphical representation for multivariate\npoint processes [Didelez, 2008, Meek, 2014, Gunawardana and Meek, 2016]. Unlike graphical\nmodels for discrete-time dynamic uncertain variables such as dynamic Bayesian networks [Dean\nand Kanazawa, 1989, Murphy, 2002] and time series graphs [Eichler, 1999, Dahlhaus, 2000], GEMs\ncapture continuous-time processes. They also differ from continuous-time Bayesian networks\n[Nodelman et al., 2002], which represent homogeneous Markov models of the joint trajectories of\ndiscrete variables rather than models of event streams in continuous time. GEMs provide a framework\nthat generalizes many of the afore-mentioned history-dependent models for event datasets, many of\nwhich make the assumption of piece-wise constant conditional intensity functions. The literature\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: An example involving M = 3 event labels: a) Event dataset; b) Example PGEM; c) Surface plot of\nlog likelihood for node C, given parents A and B, as a function of windows wac and wbc.\n\ntakes varying approaches to the representation and learning of such functions, including decision\ntrees [Gunawardana et al., 2011], forests [Weiss and Page, 2013], and generalized linear models\n[Rajaram et al., 2005].\nA major drawback in these existing approaches is that they require the user to specify a set of\nbasis functions in the form of relevant time intervals in the history. It is not obvious beforehand\nin most applications how to specify such a basis. Alternatively, if a basis is chosen in a manner\nthat is exhaustively data-adaptive, i.e. using all historical epochs of event arrivals to de\ufb01ne all\nhistorical time intervals of interest, one ends up with a prohibitively large basis set that makes any\nlearning impractical. Thus there is a need to investigate approaches that don\u2019t require such a basis set\nspeci\ufb01cation and yet provide practical learning algorithms.\nIn this paper, we introduce proximal graphical event models (PGEMs), where the intensity of an event\nlabel depends on occurrences of its parent event labels in the graph within the most recent history, i.e.\nin temporal proximity (see Figure 1(b) for an example). Although PGEMs are a special case from\nthe piece-wise constant conditional intensity family, there are several advantages of such models\nand our work on learning them from data: 1) They are practical models, capturing the assumption\nthat the most recent history is suf\ufb01cient for understanding how the future may unfold. We argue\nthat PGEMs are particularly interpretable event models and could be useful for providing insights\nabout the dynamics in an event dataset to political or \ufb01nancial analysts or medical practitioners or\nscientists; 2) Importantly, we present data-driven algorithms that learn a PGEM from an event dataset\nwithout additional user information, unlike the state-of-the-art models; 3) We present polynomial\ntime heuristic algorithms that make PGEM learning computationally more tractable and therefore\namenable to large event datasets, possibly with a large number of event types.\n\n2 Notation and Model Formulation\n\n2.1 Preliminaries\nAn event dataset is denoted D = {(li, ti)}N\ni=1, where ti is the occurrence time of the ith event,\nti \u2208 R+, and li is an event label/type belonging to a \ufb01nite alphabet L with cardinality |L| = M. We\nassume a temporally ordered dataset, ti \u2264 tj for i < j, with initial time t0 = 0 \u2264 t1 and end time\ntN +1 = T \u2265 tN , where T is the total time period. Figure 1(a) shows an example event dataset with\nN = 7 events from the event label set L = {A, B, C} over T = 20 days.\nIn this paper, we propose learning algorithms that are data-driven; speci\ufb01cally, we will rely on\ninter-event times between event labels in the dataset. We denote the set of times from the most recent\noccurrence of Z, if Z has occurred, to every occurrence of X (Z (cid:54)= X) as {\u02c6tzx}. We use {\u02c6tzz} to\ndenote inter-event times between Z occurrences, including the time from the last occurrence of Z to\nthe \ufb01nal time T . In the Figure 1(a) example, {\u02c6tac} = {2, 8}, {\u02c6tbc} = {1, 7} and {\u02c6tbb} = {3, 7, 7}.\n\n2\n\n\f2.2 PGEM Formulation\n\nAn event dataset can be modeled using marked point processes, whose parameters are conditional\nintensity functions; in the most general case, the conditional intensity for event label X is a function\nof the entire history, \u03bbx(t|ht), where ht includes all events up to time t, ht = {(li, ti) : ti \u2264 t}. We\nuse lower case x wherever we refer to label X in subscripts or parentheses. A graphical representation\nof a marked point process can help specify the historical dependence. For graph G = (L,E) where\nnodes correspond to event labels, the conditional intensity for label X depends only on historical\noccurrences of its parent event labels, therefore \u03bbx(t|ht) = \u03bbx(t|[h(U)]t), where U are parents of\nnode X in G and [h(U)]t is the history restricted to event labels in set U [Gunawardana and Meek,\n2016]. We refer to nodes and event labels interchangeably.\nA proximal graphical model M consists of a graph along with a set of (time) windows and conditional\nintensity parameters, M = {G,W, \u039b}. There is a window for every edge in the graph, W = {wx :\n\u2200X \u2208 L}, where wx = {wzx : \u2200Z \u2208 U} denotes the set of all windows corresponding to incoming\nx|u : \u2200X \u2208 L} is the set of all conditional intensity parameters.\nedges from X\u2019s parents U. \u039b = {\u03bbwx\nFor node X, there is a parameter for every instantiation u of its parent occurrences, depending\non whether a parent event label has occurred in its window; thus there are 2|U| parameters for X,\nmaking the PGEM parametrization analogous to a Bayesian network with binary variables. To\navoid notational clutter, we will hide the window superscript for conditional intensities. Figure 1(b)\nprovides an illustrative PGEM graph along with the windows. In this example, parameter \u03bbc|a\u00afb\nsigni\ufb01es the rate at which C occurs at any time t given that A has occurred at least once in the interval\n[t \u2212 wac, t) and that B has not occurred in [t \u2212 wbc, t).\n\n3 Learning PGEMs\nThe learning problem is as follows: given an event dataset D, estimate a PGEM M = {G,W, \u039b},\ni.e. parents, windows, and conditional intensity parameters for each event label. In this section, we\n\ufb01rst discuss learning windows and parameters for a node given its parents, and then present some\ntheoretical results to help with parent search. We end the section with practical heuristic algorithms.\n\n3.1 Learning Windows\n\nWhen the parents U of all nodes X are known, the log likelihood of an event dataset given a PGEM\ncan be written in terms of the summary statistics of counts and durations in the data and the conditional\nintensity rates of the PGEM:\n\n(cid:88)\n\n(cid:88)\n\nX\n\nu\n\n(cid:0)\u2212\u03bbx|uD(u) + N (x; u) ln(\u03bbx|u)(cid:1),\n\nlogL(D) =\n\n(1)\n\n(cid:82) ti\n\ni=1\n\nti\u22121\n\ni=1 I(li = X)I wx\n\nu (t)dt, where I wx\nI wx\n\nN (x; u) =(cid:80)N\n\nu (ti) and D(u) =(cid:80)N +1\n\nwhere N (x; u) is the number of times that X is observed in the dataset and that the condi-\ntion u (from 2|U| possible parental combinations) is true in the relevant preceding windows,\nand D(u) is the duration over the entire time period where the condition u is true. Formally,\nu (t) is an indica-\ntor for whether u is true at time t as a function of the relevant windows wx. Note that we have hidden\nthe dependence of the summary statistics on windows wx for notational simplicity.\nFrom Equation 1, it is easy to see that the maximum likelihood estimates (MLEs) of the conditional\nintensity rates are \u02c6\u03bbx|u = N (x;u)\nD(u) . The following theorem uses this to provide a high-level recipe for\n\ufb01nding optimal windows for a node given its parents. N (x) denotes counts of event label X in the\ndata.\nTheorem 1. The log likelihood maximizing windows for a node X with parents U are those that\nmaximize the KL divergence between a count-based distribution with probabilities N (x;u)\nN (x) and a\nduration-based distribution with probabilities D(u)\nT .\nNote that for each time t \u2208 [0, T ], there is some one parental state u(ht, wx) that is active. Since\nthe number of such parental states over [0, T ] is \ufb01nite (upper bounded by min(2|U|, 2N ) and further\nlimited by what the data D and windows wx allow), this leads to a \ufb01nite partition of [0, T ]. Each\n\n3\n\n\fmember in this partition corresponds to some parental state u, and in general, it is a union of a\ncollection of non-intersecting half-open or closed time intervals that are subsets of [0, T ]. Each\nmember thus has a net total duration, which sums to T across the above partition, and similarly a net\ntotal count of the number of arrivals of type X. As such, wx taken with D is equivalent to two \ufb01nite\ndistributions (histograms) whose support is over the above set of partition members, one each for\ncounts and the durations. The above theorem observes that the optimal wx is one where the count\nhistogram across the partition members maximally differs from the corresponding duration histogram,\nas per KL divergence. (All proofs are in the supplementary section.) In informal terms, the windows\nwx that lead to MLE estimates for conditional intensities are the ones where the summary statistics\nof empirical arrival rates differ maximally across the above parental state partition.\nThe challenge with applying Theorem 1 to the practical issue of \ufb01nding the optimal windows is that\nthis is in general a dif\ufb01cult combinatorial optimization problem with a non-linear objective function.\nFigure 1(c) displays the shape of the log likelihood function for node C as a function of windows\nfrom its parents A and B in the PGEM from Figure 1(b). Note that the maximization over regionally\nconvex areas results in several local maxima.\nNext we provide an upper bound on the optimal window from parent Z to node X regardless of other\nconsiderations.\nTheorem 2. The log likelihood maximizing window wzx from parent Z to a node X is upper bounded\nby max{\u02c6tzz}, where {\u02c6t} denotes inter-event times, which is also taken to include the inter-event time\nbetween the last arrival of Z and T (end of the horizon).\n\nThe following theorem shows that when a node has a single parent, one can discover a small number\nof local maxima from the inter-event times in the data, thereby easily computing the global maximum\nby exhaustively comparing all local maxima.\nTheorem 3. For a node X with a single parent Z, the log likelihood maximizing window wzx either\nbelongs to or is a left limit of a window in the candidate set W \u2217 = {\u02c6tzx} \u222a max{\u02c6tzz}, where {\u02c6t}\ndenotes inter-event times.\n\nIn the proof of Theorem 3, we show that the candidate window set arises primarily because the\ncounts N (x; z) only change at the inter-event times {\u02c6tzx}. Note that the counts are step functions\nand therefore discontinuous at the jump points; this is the reason why the optimal window can be\na left limit of an element in W \u2217. We deal with this practically by searching over both W \u2217 as well\nas W \u2217 \u2212 \u0001 (\u0001 = 0.001 is chosen for all experiments). Next, we show that the optimal window from\nparent Z to node X given other parents and their windows also belongs to a grid, albeit a \ufb01ner one\nthan in Theorem 3.\nTheorem 4. For a node X and parent(s) Y, the log likelihood maximizing window for a new parent\nZ, wzx, given the windows corresponding to nodes from Y to X, either belongs to or is a left limit of\na window in the candidate set W \u2217 = {\u02c6tzx} \u222a \u02c6Cy,z, where {\u02c6t} denotes inter-event times and \u02c6Cy,z\nare change points across the set of the piecewise linear functions D(y, z) (multiple functions, due to\nmultiple parental state combinations) that are obtainable from Algorithm 1.\n\n\u02c6Cy,z captures all the change points that are pertinent to any of the functions D(y, z) when a window\nw is varied over [0, \u00afW ], where \u00afW = max{\u02c6tzz} is an upper bound on the optimal w (Theorem 2).\nWe will use the above two theorems in our heuristics for \ufb01nding the optimal windows and parameters\ngiven a parent set.\n\n3.2 Optimal Parent Set Search\n\nIn the literature on Bayesian networks, various scores such as Akaike information criterion (AIC),\nBayesian information criterion (BIC), and Bayesian Dirichlet equivalent uniform (BDeu) are used to\nlearn the graphical structure from data. These scores can be viewed as combining the log likelihood\nof the data with a term that penalizes the complexity of the model. Prior work has shown that these\ncriteria enjoy some properties that can help parent search by eliminating non-optimal parent sets\nquickly, reducing the search space size and speeding up learning [Teyssier and Koller, 2005, Campos\nand Ji, 2011]. Here we provide similar theoretical results on parent set search in PGEMs. We use the\n\n4\n\n\fAlgorithm 1 Change points in w across all of piece-wise linear functions D(y, z)\nInputs: Dataset D, Labels X, Z, given parent set Y with windows wyx, y \u2208 Y\nOutputs: \u02c6Cy,z, initialized to \u2205\nInitialize: S = {[tz,i, tz,i+1]}N (z)\nfor all y in Y do\nfor all si = [tz,i, tz,i+1] in S do\n\ni=1 , ordered inter-arrival intervals of Z, with tz,N (z)+1 = T .\n\nLet: cl(si) = maxk{ty,k|ty,k < tz,i}\nLet: in(si) = {ty,k|ty,k \u2208 (tz,i, tz,i+1)}\n\u02c6Cy,z = \u02c6Cy,z\u222a changepoints(si, cl(si), in(si), y)\n\nzz }, i.e. add the set of the N (z) ascending order statistics of label Z inter-event\n\n\u02c6Cy,z = \u02c6Cy,z \u222a {\u02c6t(k)\ntimes {\u02c6tzz}, including the inter-event time between the last arrival of Z and T\nchangepoints(si, cl(si), in(si), y):\nInitialize: Stack \u03c3 = \u2205, C = \u2205\nif (cl(si) + wyz) < tz,i+1 then\n\nif (cl(si) + wyz) \u2208 (tz,i, tz,i+1) then\n\n\u03c3.push(cl(si) + wyz \u2212 tz,i)\n\nfor all t in in(si) do\n\nif \u03c3 not empty then\n\ntail = \u03c3.top\n\ntail = -1\n\nelse\nif t \u2264 tail then\nelse\n\n\u03c3.pop\n\u03c3.push(t \u2212 tz,i)\n\nif (t + wyx > tz,i+1) then\n\nelse\n\nbreak\n\u03c3.push(t + wyz \u2212 tz,i)\n\nC = set(\u03c3)\n\nreturn C\n\nBIC score in our experiments, de\ufb01ned for a PGEM as:\n\nBIC(D) = logL(D) \u2212 ln(T )\n\n(cid:88)\n\nX\n\n2|U|.\n\n(2)\n\nFirst, we state a simple way to discard parent sets for a node in a PGEM, as used in Teyssier and\nKoller [2005], Campos and Ji [2011].\nLemma 5. Let X be an arbitrary node of G, a candidate PGEM graph where the parent set of X is\nU(cid:48). If U \u2282 U(cid:48) such that sX (U) > sX (U(cid:48)), where s is BIC, AIC, BDeu or a derived scoring criteria,\nthen U(cid:48) is not the parent set of X in the optimal PGEM graph G\u2217.\nWhile Lemma 5 provides a way to eliminate low scoring structures locally, one still needs to\ncompute the scores of all possible parent sets and then remove the redundant ones. The search still\nrequires M \u00d7 2M asymptotic score computation and the same complexity for parent score storage,\nalthough the space can be reduced after applying Lemma 5. We present the following two lemmas\nto reduce some of these computations. We focus on the BIC score but similar results should hold\nfor other scores. Since BIC is decomposable, the local BIC score for node X can be expressed as\nsX (U) = LX (U) \u2212 tX (U), where LX (U) is the log likelihood term and tX (U) is the structure\npenalty term, tX (U) = ln T \u00b7 2U.\nTheorem 6. Using BIC as the score function, suppose that X and U are such that 2|U| >\nN (x)(1\u2212ln N (x))\n+ N (x), where 2|U| is the total size of all possible parent combinations, N (x)\nis the total count of X in the data and T is the maximal time horizon. If U(cid:48) is a proper superset of U,\nthen U(cid:48) is not the parent set of X in the optimal PGEM graph.\n\nln T\n\n5\n\n\fAlgorithm 2 Forward Backward Search\n\nInputs: Event label X, event dataset D\nOutputs: Parents U, windows wx, lambdas \u03bbx|u, score such as BIC\nForward Search: Initialize U = \u2205; S = \u2212\u221e\nwhile score cannot be improved or no more parents can be added do\n\nfor all Z not in U do\nif maxZ{S(U \u222a Z)} > S then\n\nFind all optimal windows and \u03bbs with Z added to U and corresponding score S(U \u222a Z)\nAdd Z to U, S = maxZ{S(U \u222a Z)}\n\nBackward Search: Start with parent set U and S from forward search\nwhile score cannot be improved or U = \u2205 do\n\nfor all Z in U do\nif maxZ{S(U\\Z)} > S then\n\nFind all optimal windows and \u03bbs with Z removed from U and corresponding score S(U\\Z)\nRemove Z from U, S = maxZ{S(U\\Z)}\n\nCorollary 7. Using BIC as the score function, the optimal graph G\u2217 has at most O(log2 N (x))\nparents for each node X.\n\nTheorem 6 and Corollary 7 ensure that we only need to compute O((cid:80)(cid:100)log2 N (x)(cid:101)\n\n(cid:1)) elements\n\n(cid:0)M\u22121\n\nfor each variable X.\nThe next theorem does not directly improve the theoretical size bound of the parent set size that is\nachieved by Corollary 7, but it helps in practice as it is applicable to cases where Theorem 6 is not,\nimplying even fewer parent sets need to be tested.\nTheorem 8. Using BIC as score function s, let X be a node with two possible parent sets U \u2282 U(cid:48)\nsuch that tX (U(cid:48)) + sX (U) > 0. Then U(cid:48) and all its supersets U(cid:48)(cid:48) \u2283 U(cid:48) are not optimal parent sets\nfor X in the optimal PGEM graph.\n\nk=0\n\nk\n\nHence, Theorem 8 can be used to discard additional parents sets without computing its local scores.\nEvery time the score of a parent set U of X is about to be computed, we can take the best score of\nany its subsets and test it against the theorem. If the condition applies, we can safely discard U and\nall its supersets. To summarize, we would need to build all possible parent sets up to O(log2 N (x))\nfor each X and then use Theorem 8 and then Lemma 5 to test the optimal parent set.\n\n3.3 A Forward-Backward Search Algorithm\n\nWe propose a forward-backward search (FBS) algorithm to learn the structure of a PGEM as shown\nin Algorithm 2. Since a PGEM can include cycles, there are no acyclicity constraints like in Bayesian\nnetworks, therefore we can run Algorithm 2 on each node/label X separately. This local learning\napproach is similar to local learning in Bayesian networks [Gao and Wei, 2018] but can contain\ncycles.\nGiven an event data set D and a target label X, FBS \ufb01rst initializes the parent set U to be empty. At\neach step of a forward search, FBS iteratively chooses a parent candidate Z that is not in U, and \ufb01nd\nthe best window and rates \u03bb that maximize the score S(U\u222a Z) with parent set U\u222a Z (as discussed in\nSection 3.1). If the maximized S(U \u222a Z) is better than the current best score S, then FBS chooses to\nadd Z to U and update S. It runs until all variables have been tested or no parent set would improve\nthe score (as discussed in Section 3.2). Then during the backward search step, FBS iteratively tests if\neach variable Z in U can be removed, i.e. if the removed set U \\ Z would give a better score. If so,\nZ would be removed from U. Backward search runs until score S cannot be improved or U becomes\nempty.\nWith the optimal parent set search with bounded sizes and determination of optimal windows and\nconditional intensity rates given a graph, one can show the soundness and completeness of Algorithm 2\nunder mild assumptions. Gunawardana and Meek [2016] show that backward and forward search with\n\n6\n\n\fBIC scores is sound and complete for a family of GEMs. Assuming that the underlying distribution\ncan be captured uniquely by a PGEM model, then since PGEMs can be considered a sub-class of\nthis family and Algorithm 2 is a similar forward-and-backward search, soundness and completeness\napplies in this instance as well.\nTheorem 9. Under the large sample limit and no detailed balance assumptions [Gunawardana and\nMeek, 2016], Algorithm 2 is sound and complete.\n\nJointly optimizing the windows for multiple parents simultaneously is a hard problem in general.\nWe instead realize two ef\ufb01cient heuristics based on the above FBS procedure, namely FBS-IW and\nFBS-CW. In FBS-IW, we independently optimize the window for each parent relative to label X,\nusing the \ufb01nite characterization of single-parent optimal windows presented in Theorem 3. After each\nindividual parent\u2019s window has been independently optimized, we compute the corresponding \ufb01nite\npartition of [0, T ] in terms of parental states, and use the suf\ufb01cient statistics in each partition member\nto estimate the corresponding conditional intensity rates. In FBS-CW, we appeal to Theorem 4 and\nrealize a block coordinate ascent strategy (over parent labels) for optimizing the windows. For each\nparent that is added in the forward search, we optimize its window while keeping all the other existing\nparents \ufb01xed at their current windows. The rate estimation is then as described above for FBS-IW.\nWe add parents in the forward search if there is a score improvement based on the new windows and\nrates. For the backward search, we delete a parent, retain existing windows for remaining parents and\nonly recompute the intensity rates in both FBS-IW and FBS-CW.\nTheorem 10. If all event labels occur in the dataset in similar proportions, the worst case complexity\nof the FBS-IW and FBS-CW algorithms are O(N 2 + M 3N ) and O(M 3N 2) respectively.\n\n4 Experiments\n\nWe consider 2 baselines for our experiments. A superposition of independent Poisson (SIP) arrivals\nis a weak baseline that treats every event label as an independent Poisson process and is equivalent\nto a PGEM without edges. We also test the CPCIM algorithm [Parikh et al., 2012], shown to be an\nimproved version over piecewise constant intensity model (PCIM) [Gunawardana et al., 2011] and\nother variants, to compare the performance of the proposed algorithm. For CPCIM, we used the\nfollowing hyper-parameters. The conjugate prior for conditional intensity has two parameters, the\npseudo-count \u03b1 and pseudo-duration \u03b2 for each label. We used the same values for all labels, by\ncomputing a ratio \u03c1 of the total number of all arrivals over all labels to the total duration for all labels\n(the product of the number of labels and the horizon T under consideration). This ratio provides an\nempirically based estimate of the arrival rate. We ran experiments using \u03b1 = K\u03c1, \u03b2 = K, for various\nvalues of K = 10, 20, . . . , where higher values of K correspondingly increase the in\ufb02uence of the\nprior on the results. Experimental results presented in this section are for K = 20. The structural\nprior \u03ba was \ufb01xed at 0.1 [Gunawardana et al., 2011]. We also experimented with MFPP [Weiss and\nPage, 2013] which is based on random forests, but we observed high sensitivity to forest parameters\nas well as randomness in the optimized log likelihood values which went to negative in\ufb01nity in many\nruns. We therefore present comparisons with only SIP and CPCIM in the experi ments. Both PGEM\nlearning algorithms use \u0001 = 0.001 to search for left limiting points.\n\n4.1 Synthetic Datasets\nWe generate PGEMs for a label set L of size M through the following process. For each node, we\nselect the number of its parents K uniformly from the parameters Kmin \u2265 0,\u00b7\u00b7\u00b7 , Kmax \u2264 M in\ninteger increments; a random subset of size K from L is then chosen as its parent set. We generate\nwindows for each edge uniformly from wmin to wmax in increments of \u2206w. For the conditional\nintensity rates, we assume that each node\u2019s parent either has a multiplicative ampli\ufb01cation or damping\nrate beyond a baseline rate of r/M (r = 1 implies an overall rate of one label per time period in\nthe dataset). Nodes that always increase occurrence rate for their children are obtained by randomly\nchoosing a subset LA of size KA from L. Nodes in the sets LA and L\\LA have an ampli\ufb01cation and\ndamping rate of \u03b3A and \u03b3D respectively.\nFigure 2 compares models using 6 PGEMs generated from the afore-mentioned process; the top and\nbottom rows have PGEMs with M = 5 and M = 10 labels respectively. Other details about the model\ngenerating parameters are described in the supplementary material. For each model, we generated 10\nevent datasets over T = 1000 days (around 3 years) from a synthetic PGEM generator. Windows\n\n7\n\n\fFigure 2: Model comparisons with 10 synthetic event datasets generated from 6 PGEMs. The top and bottom\nrows have PGEMs with M = 5 and M = 10 labels respectively. Both PGEM learning algorithms (FBS-IW and\n-CW) are compared with baseline models SIP and CPCIM as well as the true model.\n\nTable 1: Log likelihood of models for experiments on the books dataset\n\nDataset\nLeviathan (M = 10)\nLeviathan (M = 20)\nBible (M = 10)\nBible (M = 20)\n\nSIP\n-19432\n-36398\n-76097\n-147706\n\nPGEM CPCIM\n-18870\n-19237\n-35179\n-36055\n-72013\n-72801\n-138190\n-140327\n\nwere chosen to range from between a fortnight to 2 months. For CPCIM, we used intervals of the\nform [t \u2212 t\u2217, t) as basis functions, where t\u2217 \u2208 {1, 2, 3, 4, 5, 6, 7, 15, 30, 45, 60, 75, 90, 180}. The\nboxplots indicate that the PGEM learning algorithms beat the baselines and come close to matching\nthe log likelihood of the true model on the datasets. We observed in these and other experiments that\nthe PGEM learning algorithms perform comparably; we therefore restrict our attention to the more\nef\ufb01cient FBS-IW algorithm in subsequent experiments.\n\n4.2 Real Datasets\n\nBooks. We consider two books from the SPMF data mining library [Fournier-Viger et al., 2014]:\nLeviathan, a book by Thomas Hobbes from the 1600s, and the Bible. We ignore the 100 most frequent\nwords to remove stop-words and only retain the next most frequent M words; this provides us with\nlarge event datasets where every word in scope is an event label and its index in the book is the occur-\nrence time. For the Bible with M = 20, there are N = 19009 words. Table 1 shows that PGEM has\ngreater log likelihood than the baselines on the four datasets considered. For CPCIM, we used intervals\nof the form [t \u2212 t\u2217, t) as basis functions, where t\u2217 \u2208 {25, 50, 100, 200, 300, 400, 500, 1000, 5000}.\nThese datasets revealed to us how challenging it could be to identify basis functions, thereby reinforc-\ning the bene\ufb01ts of PGEMs.\nFrom Table 1, we see that PGEM outperforms both SIP and CPCIM consistently on the book datasets,\nwhile CPCIM is better than SIP. PGEM achieves the best result on all 4 datasets, with the smallest\nmargin of 400 in LL and up to 2000 over CPCIM.\n\nICEWS. We consider the Integrated Crisis Early Warning System (ICEWS) political relational\nevent dataset [O\u2019Brien, 2010], where events take the form \u2018who does what to whom\u2019, i.e. an event z\ninvolves a source actor az performing an action/verb vz on a target actor a(cid:48)\nz, denoted z = (az, vz, a(cid:48)\nz).\nIn ICEWS, actors and actions come from the Con\ufb02ict and Mediation Event Observations (CAMEO)\n\n8\n\n\fTable 2: Log likelihood of models for experiments on the ICEWS dataset\n\nDataset\nArgentina\nBrazil\nColombia\nMexico\nVenezuela\n\nSIP\n-11915\n-14289\n-4621\n-7895\n-8922\n\nPGEM CPCIM\n-8412\n-10631\n-8856\n-11706\n-2965\n-3557\n-5676\n-6011\n-5454\n-6757\n\nontology [Gerner et al., 2002]. Actors in this ontology could either be associated with generic\nactor roles and organizations (ex: Police (Brazil)) or they could be speci\ufb01c people (ex: Hugo\nChavez). Actions in the CAMEO framework are hierarchically organized into 20 high-level base\ncoded actions that range 1-20. For our experiment, we restricted attention to \ufb01ve countries, namely,\nBrazil, Argentina, Venezuela, Mexico and Colombia over a four year time period, Jan 1 2012 to\nDec 31, 2015. We included only 5 types of actors, namely, Police, Citizen, Government, Head of\nGovernment and Protester, normalizing for actual heads of governments (i.e. mapping Hugo Chavez\nto Head of Government (Venezuela) for e.g.). We considered 5 types of actions, namely, Neutral\n[1-2], Verbal cooperation [3-5], Material cooperation [6-8], Verbal con\ufb02ict [9-13] and Material\ncon\ufb02ict [14-20], where the numbers in the brackets show how the action categories map to the\nCAMEO codes. For CPCIM, we used intervals of the form [t \u2212 t\u2217, t) as basis functions, where\nt\u2217 \u2208 {7, 15, 30, 45, 60, 75, 90, 180}. From Table 2, we see that PGEM outperforms both SIP and\nCPCIM on 4 out of 5 countries, while CPCIM is better than PGEM for Mexico.\n\n5 Conclusions\n\nIn this paper, we introduce a novel model for event datasets \u2013 proximal graphical event models\n\u2013 with the following major contributions: 1) We study the optimal window size in PGEMs and\nconduct theoretical analysis; 2) We derive ef\ufb01cient parent set size bounds in PGEMs for usage in\nstructure learning algorithms; 3) We propose a forward-backward search algorithm, with two ef\ufb01cient\nheuristics, to learn the structure and parameters of PGEMs; 4) We demonstrate PGEM\u2019s superior\nmodeling power on multiple synthetic and real datasets. PGEMs do not require careful tuning of\nmany hyper-parameters compared to existing methods, making them useful along with interpretable.\nIn practice, given the underlying parametric assumptions of a PGEM and the proposed heuristic ap-\nproach to obtaining windows, the learning approach could potentially mis-characterize causal/acausal\nrelationships between event types in more complex underlying distributions. Nevertheless, we believe\nthat PGEMs are readily suitable for many real world applications.\n\nAcknowledgments\n\nWe thank Nicholas Mattei and Karthikeyan Shanmugam for helpful discussions, Christian Shelton\nfor help with the CPCIM code, and three anonymous reviewers for their valuable feedback.\n\nReferences\nC. P. de Campos and Q. Ji. Ef\ufb01cient structure learning of Bayesian networks using constraints.\n\nJournal of Machine Learning Research, 12(Mar):663\u2013689, 2011.\n\nR. Dahlhaus. Graphical interaction models for multivariate time series. Metrika, 51:157\u2013172, 2000.\n\nT. Dean and K. Kanazawa. A model for reasoning about persistence and causation. Computational\n\nIntelligence, 5:142\u2013150, 1989.\n\nV. Didelez. Graphical models for marked point processes based on local independence. Journal of\n\nthe Royal Statistical Society, Ser. B, 70(1):245\u2013264, 2008.\n\nM. Eichler. Graphical Models in Time Series Analysis. PhD thesis, University of Heidelberg,\n\nGermany, 1999.\n\n9\n\n\fP. Fournier-Viger, A. Gomariz, T. Gueniche, A. Soltani, C. Wu., and V. S. Tseng. SPMF: A Java Open-\nSource Pattern Mining Library. Journal of Machine Learning Research (JMLR), 15:3389\u20133393,\n2014.\n\nT. Gao and D. Wei. Parallel Bayesian network structure learning. In Proceedings of the International\n\nConference on Machine Learning (ICML), pages 1671\u20131680, 2018.\n\nD. J. Gerner, P. A. Schrodt, O. Yilmaz, and R. Abu-Jabr. Con\ufb02ict and mediation event observa-\ntions (CAMEO): A new event data framework for the analysis of foreign policy interactions.\nInternational Studies Association (ISA) Annual Convention, 2002.\n\nJ. Goulding, S. Preston, and G. Smith. Event series prediction via non-homogeneous Poisson process\nmodelling. In Proceedings of the Sixteenth IEEE Conference on Data Mining (ICDM), pages\n161\u2013170, 2016.\n\nA. Gunawardana and C. Meek. Universal models of multivariate temporal point processes. In\nProceedings of the Nineteenth International Conference on Arti\ufb01cial Intelligence and Statistics\n(AISTATS), pages 556\u2013563, 2016.\n\nA. Gunawardana, C. Meek, and P. Xu. A model for temporal dependencies in event streams. In\nProceedings of Advances in Neural Information Processing Systems (NIPS), pages 1962\u20131970,\n2011.\n\nC. Meek. Toward learning graphical and causal process models. In Proceedings of Uncertainty\nin Arti\ufb01cial Intelligence Workshop on Causal Inference: Learning and Prediction, pages 43\u201348,\n2014.\n\nK. Murphy. Dynamic Bayesian Networks: Representation, Inference and Learning. PhD thesis,\n\nUniversity of California Berkeley, USA, 2002.\n\nU. Nodelman, C. R. Shelton, and D. Koller. Continuous time Bayesian networks. In Proceedings\nof the Eighteenth International Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages\n378\u2013378, 2002.\n\nS. P. O\u2019Brien. Crisis early warning and decision support: Contemporary approaches and thoughts on\n\nfuture research. International Studies Review, 12:87\u2013104, 2010.\n\nA. P. Parikh, A. Gunawardana, and C. Meek. Conjoint modeling of temporal dependencies in event\nstreams. In Proceedings of Uncertainty in Arti\ufb01cial Intelligence Workshop on Bayesian Modeling\nApplications, August 2012.\n\nS. Rajaram, T. Graepel, and R. Herbrich. Poisson-networks: A model for structured point processes. In\nProceedings of the Tenth International Workshop on Arti\ufb01cial Intelligence and Statistics (AISTATS),\npages 277\u2013284, 2005.\n\nA. Simma and M. I. Jordan. Modeling events with cascades of Poisson processes. In Proceedings of\nthe Twenty-Sixth International Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages\n546\u2013555, 2010.\n\nA. Simma, M. Goldszmidt, J. MacCormick, P. Barham, R. Black, R. Isaacs, and R. Mortie. CT-NOR:\nRepresenting and reasoning about events in continuous time. In Proceedings of the Twenty-Fourth\nInternational Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 484\u2013493, 2008.\n\nM. Teyssier and D. Koller. Ordering-based search: A simple and effective algorithm for learning\nBayesian networks. In Proceedings of the Twenty-First Conference on Uncertainty in Arti\ufb01cial\nIntelligence, pages 584\u2013590. AUAI Press, 2005.\n\nJ. C. Weiss and D. Page. Forest-based point process for event prediction from electronic health\n\nrecords. In Machine Learning and Knowledge Discovery in Databases, pages 547\u2013562, 2013.\n\nK. Zhou, H. Zha, and L. Song. Learning triggering kernels for multi-dimensional Hawkes processes.\nIn Proceedings of the International Conference on Machine Learning (ICML), pages 1301\u20131309,\n2013.\n\n10\n\n\f", "award": [], "sourceid": 4984, "authors": [{"given_name": "Debarun", "family_name": "Bhattacharjya", "institution": "IBM Research"}, {"given_name": "Dharmashankar", "family_name": "Subramanian", "institution": "IBM Research"}, {"given_name": "Tian", "family_name": "Gao", "institution": "IBM Research AI"}]}