{"title": "Uncertainty on Asynchronous Time Event Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 12851, "page_last": 12860, "abstract": "Asynchronous event sequences are the basis of many applications throughout different industries. In this work, we tackle the task of predicting the next event (given a history), and how this prediction changes with the passage of time. Since at some time points (e.g. predictions far into the future) we might not be able to predict anything with confidence, capturing uncertainty in the predictions is crucial. We present two new architectures, WGP-LN and FD-Dir, modelling the evolution of the distribution on the probability simplex with time-dependent logistic normal and Dirichlet distributions. In both cases, the combination of RNNs with either Gaussian process or function decomposition allows to express rich temporal evolution of the distribution parameters, and naturally captures uncertainty. Experiments on class prediction, time prediction and anomaly detection demonstrate the high performances of our models on various datasets compared to other approaches.", "full_text": "Uncertainty on Asynchronous Time Event Prediction\n\nMarin Bilo\u0161\u2217, Bertrand Charpentier\u2217, Stephan G\u00fcnnemann\n\nTechnical University of Munich, Germany\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\nAbstract\n\nAsynchronous event sequences are the basis of many applications throughout dif-\nferent industries. In this work, we tackle the task of predicting the next event (given\na history), and how this prediction changes with the passage of time. Since at some\ntime points (e.g. predictions far into the future) we might not be able to predict\nanything with con\ufb01dence, capturing uncertainty in the predictions is crucial. We\npresent two new architectures, WGP-LN and FD-Dir, modelling the evolution of\nthe distribution on the probability simplex with time-dependent logistic normal\nand Dirichlet distributions. In both cases, the combination of RNNs with either\nGaussian process or function decomposition allows to express rich temporal evolu-\ntion of the distribution parameters, and naturally captures uncertainty. Experiments\non class prediction, time prediction and anomaly detection demonstrate the high\nperformances of our models on various datasets compared to other approaches.\n\n1\n\nIntroduction\n\nDiscrete events, occurring irregularly over time, are a common data type generated naturally in\nour everyday interactions with the environment (see Fig. 2a for an illustration). Examples include\nmessages in social networks, medical histories of patients in healthcare, and integrated information\nfrom multiple sensors in complex systems like cars. The problem we are solving in this work is:\ngiven a (past) sequence of asynchronous events, what will happen next? Answering this question\nenables us to predict, e.g., what action an internet user will likely perform or which part of a car\nmight fail.\nWhile many recurrent models for asynchronous sequences have been proposed in the past [19, 6],\nthey are ill-suited for this task since they output a single prediction (e.g. the most likely next event)\nonly. In an asynchronous setting, however, such a single prediction is not enough since the most\nlikely event can change with the passage of time \u2013 even if no other events happen. Consider a car\napproaching another vehicle in front of it. Assuming nothing happens in the meantime, we can expect\ndifferent events at different times in the future. When forecasting a short time, one expects the driver\nto start overtaking; after a longer time one would expect braking; in the long term, one would expect\na collision. Thus, the expected behavior changes depending on the time we forecast, assuming no\nevents occured in the meantime. Fig. 2a illustrates this schematically: having observed a square and a\npentagon, it is likely to observe a square after a short time, while a circle after a longer time. Clearly,\nif some event occurs, e.g. braking/square, the event at the (then) observed time will be taken into\naccount, updating the temporal prediction.\nAn ad-hoc solution to this problem would be to discretize time. However, if the events are near each\nother, a high sampling frequency is required, giving us very high computational cost. Besides, since\nthere can be intervals without events, an arti\ufb01cial \u2018no event\u2019 class is required.\nIn this work, we solve these problems by directly predicting the entire evolution of the events over\n(continuous) time. Given a past asynchronous sequence as input, we can predict and evaluate for\n\n\u2217Equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fany future timepoint what the next event will likely be (under the assumption that no other event\nhappens in between which would lead to an update of our model). Crucially, the likelihood of the\nevents might change and one event can be more likely than others multiple times in the future. This\nperiodicity exists in many event sequences. For instance, given that a person is currently at home, a\nsmart home would predict a high probability that the kitchen will be used at lunch and/or dinner time\n(see Fig. 1a for an illustration). We require that our model captures such multimodality.\nWhile Fig. 1a illustrates the evolution of the categorical dis-\ntribution (corresponding to the probability of a speci\ufb01c event\nclass to happen), an issue still arises outside of the observed\ndata distribution. E.g. in some time intervals we can be certain\nthat two classes are equiprobable, having observed many simi-\nlar examples. However, if the model has not seen any examples\nat speci\ufb01c time intervals during training, we do not want to\ngive a con\ufb01dent prediction. Thus, we incorporate uncertainty\nin a prediction directly in our model. In places where we ex-\npect events, the con\ufb01dence will be higher, and outside of these\nareas the uncertainty in a prediction will grow as illustrated in\nFig. 1b. Technically, instead of modeling the evolution of a categorical distribution, we model the\nevolution of a distribution on the probability simplex. Overall, our model enables us to operate with\nthe asynchronous discrete event data from the past as input to perform continuous-time predictions to\nthe future incorporating the predictions\u2019 uncertainty. This is in contrast to existing works as [6, 18].\n\nFigure 1: (a) An event can be ex-\npected multiple times in the future.\n(b) At some times we should be un-\ncertain in the prediction. Yellow de-\nnotes higher probability density.\n\n(a)\n\n(b)\n\n2 Model Description\n\nWe consider a sequence [e1, . . . , en] of events ei = (ci, ti), where ci \u2208 {1, . . . , C} denotes the class\nof the ith event and ti \u2208 R is its time of occurrence. We assume the events arrive over time, i.e.\nti > ti\u22121, and we introduce \u03c4\u2217i = ti \u2212 ti\u22121 as the observed time gap between the ith and the (i\u2212 1)th\nevent. The history preceding the ith event is denoted by Hi. Let S = {p \u2208 [0, 1]C,\ufffdc pc = 1}\ndenote the set of probability vectors that form the (C \u2212 1)-dimensional simplex, and P (\u03b8) be a family\nof probability distributions on this simplex parametrized by parameters \u03b8. Every sample p \u223c P (\u03b8)\ncorresponds to a (categorical) class distribution.\nGiven ei\u22121 and Hi\u22121, our goal is to model the evolution of the class probabilities, and their uncer-\ntainty, of the next event i over time. Technically, we model parameters \u03b8(\u03c4 ), leading to a distribution\nP over the class probabilities p for all \u03c4 \u2265 0. Thus, we can estimate the most likely class after a time\ngap \u03c4 by calculating argmaxc \u00afp(\u03c4 )c, where \u00afp(\u03c4 ) := Ep(\u03c4 )\u223cP (\u03b8(\u03c4 ))[p(\u03c4 )] is the expected probability\nvector. Even more, since we do not consider a point estimate, we can get the amount of certainty in a\nprediction. For this, we estimate the probability of class c being more likely than the other classes,\ngiven by qc(\u03c4 ) := Ep(\u03c4 )\u223cP (\u03b8(\u03c4 ))[1p(\u03c4 )c\u2265maxc\ufffd\ufffd=c p(\u03c4 )c\ufffd\n]. This tells us how certain we are that one\nclass is the most probable (i.e. \u2019how often\u2019 is c the argmax when sampling from P ).\nTwo expressive and well-established choices for the family P are the Dirichlet distribution and\nthe logistic-normal distribution (Appendix A). Based on a common modeling idea, we present two\nmodels that exploit the speci\ufb01cities of these distributions: the WGP-LN (Sec. 2.1) and the FD-Dir\n(Sec. 2.2). We also introduce a novel loss to train these models in Sec. 2.3.\nIndependent of the chosen model, we have to tackle two core challenges: (1) Expressiveness. Since\nthe time dependence of \u03b8(\u03c4 ) may be of different forms, we need to capture complex behavior. (2)\nLocality. For regions out of the observed data we want to have a higher uncertainty in our predictions.\nSpeci\ufb01cally for \u03c4 \u2192 \u221e, i.e. far into the future, the distribution should have a high uncertainty.\n2.1 Logistic-Normal via a Weighted Gaussian Process (WGP-LN)\n\nWe start by describing our model for the case when P is the family of logistic-normal (LN) distribu-\ntions. How to model a compact yet expressive evolution of the LN distribution? Our core idea is to\nexploit the fact that the LN distribution corresponds to a multivariate random variable whose logits\nfollow a normal distribution \u2013 and a natural way to model the evolution of a normal distribution is\na Gaussian Process. Given this insight, the core idea of our model is illustrated in Fig. 2: (1) we\ngenerate M pseudo points based on a hidden state of an RNN whose input is a sequence, (2) we \ufb01t\n\n2\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\f(a)\n\n(b)\n\nFigure 2: The model framework. (a) During training we use sequences si. (b) Given a new sequence\nof events s the model generates pseudo points that describe \u03b8(\u03c4 ), i.e. the temporal evolution of the\ndistribution on the simplex. These pseudo points are based on the data that was observed in the\ntraining examples and weighted accordingly. We also have a measure of certainty in our prediction.\na Gaussian Process to the pseudo points, thus capturing the temporal evolution, and (3) we use the\nlearned GP for estimating the parameters \u00b5(\u03c4 ) and \u03a3(\u03c4 ) of the \ufb01nal LN distribution at any speci\ufb01c\ntime \u03c4. Thus, by generating a small number of points we characterize the full distribution.\nClassic GP. To keep the complexity low, we train one GP per class c. That is, our model generates\nM points (\u03c4 (c)\nrepresents logits. Note that the \ufb01rst coordinate of each\npseudo point corresponds to time, leading to the temporal evolution when \ufb01tting the GP. Essentially\nwe perform a non-parameteric regression from the time domain to the logit space. Indeed, using a\nclassic GP along with the pseudo points, the parameters \u03b8 of the logistic-normal distribution, \u00b5 and\n\u03a3, can be easily computed for any time \u03c4 in closed form:\n\nj ) per class c, where y(c)\n\n, y(c)\n\nj\n\nj\n\nj\n\nc kc\n\nc K\u22121\n\nc yc, \u03c32\n\nc K\u22121\n\n\u00b5c(\u03c4 ) = kT\n\nc (\u03c4 ) = sc \u2212 kT\n\n(1)\nwhere Kc is the gram matrix w.r.t. the M pseudo points of class c based on a kernel k (e.g.\nk(\u03c41, \u03c42) = exp(\u2212\u03b32(\u03c41 \u2212 \u03c42)2)). Vector kc contains at position j the value k(\u03c4 (c)\n, \u03c4 ), and yc\nthe value y(c)\n, and sc = k(\u03c4, \u03c4 ). At every time point \u03c4 the logits then follow a multivariate normal\ndistribution with mean \u00b5(\u03c4 ) and covariance \u03a3 = diag(\u03c32(\u03c4 )).\nUsing a GP enables us to describe complex functions. Furthermore, since a GP models uncertainty\nin the prediction depending on the pseudo points, uncertainty is higher in areas far away from the\npseudo points. Speci\ufb01cally, it holds for distant future; thus, matching the idea of locality. However,\nuncertainty is always low around the M pseudo points. Thus M should be carefully picked since\nthere is a trade-off between having high certainty at (too) many time points and the ability to capture\ncomplex behavior. Thus, in the following we present an extended version solving this problem.\nWeighted GP. We would like to pick M large\nenough to express rich multimodal functions and\nallow the model to discard unnecessary points.\nTo do this we generate an additional weight vec-\ntor w(c) \u2208 [0, 1]M that assigns the weight w(c)\nto a point \u03c4 (c)\n. Giving a zero weight to a point\nshould discard it, and giving 1 will return the\nsame result as with a classic GP. To achieve\nthis goal, we introduce a new kernel function:\n\nFigure 3: WGP on toy data with different weights.\n(a) All weights are 1 \u2013 classic GP. (b) Zero weights\ndiscard points. (c) Mixed weight assignment.\n\nj\n\nj\n\nj\n\nk\ufffd(\u03c41, \u03c42) = f (w1, w2)k(\u03c41, \u03c42)\n\n(2)\n\nwhere k is the same as above. The function f weights the kernel k\naccording to the weigths for \u03c41 and \u03c42. We require f to have the following\nproperties: (1) f should be a valid kernel over the weights, since then the\nfunction k\ufffd is a valid kernel as well; (2) the importance of pseudo points\nshould not increase, giving f (w1, w2) \u2264 min(w1, w2); this fact implies\nthat a point with zero weight will be discarded since f (w1, 0) = 0 as\ndesired. The function f (w1, w2) = min(w1, w2) is a simple choice that\nful\ufb01lls these properties. In Fig. 3 we show the effect of different weights\nwhen \ufb01tting of a GP (see Appendix B for a more detailed discussion of\nthe behavior of the min kernel).\nTo predict \u00b5 and \u03c32 for a new time \u03c4, we can now simply apply Eq. 1\nbased on the new kernel k\ufffd, where the weight for the query point \u03c4 is 1.\n\nei\u22121\n\nRNN\n\nHi\u22121\n\n[w(c)\n\nj\n\n,\u03c4 (c)\n\nj\n\n, y(c)\n\nj\n\n]M\nj=1\n\nGP\n\n\u03c4c(\u03c4 ), \u03c32\n\nc (\u03c4 )\n\np(\u03c4 ) \u223c P (\u03b8(\u03c4 ))\n\nFigure 4: Model diagram\n\n3\n\nModel\u03c4pseudo pointsTraining sequencesTest sequenceUncertainty0\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\f, \u03c4 (c)\n\nTo summarize: From a hidden state hi = RNN(ei\u22121,Hi\u22121) we use a a neural network to generate M\nweighted pseudo points (w(c)\nj ) per class c. Fitting a Weighted GP to these points enables\nc (\u03c4 )) and, thus, accordingly of the logistic-Normal\nus to model the temporal evolution of N (\u00b5c(\u03c4 ), \u03c32\ndistribution. Fig. 4 shows an illustration of this model. Note that the cubic complexity of a GP, due to\nthe matrix inversion, is not an issue since the number M is usually small (< 10), while still allowing\nto represent rich multimodal functions. Crucially, given the loss de\ufb01ned in Sec. 2.3, our model is\nfully differentiable, enabling us ef\ufb01cient training.\n\n, x(c)\n\nj\n\nj\n\n2.2 Dirichlet via a Function Decomposition (FD-Dir)\n\nNext, we consider the Dirichlet distribution to model the uncertainty in the predictions. The goal is\nto model the evolution of the concentrations parameters \u03b1 = (\u03b11, . . . , \u03b1C)T of the Dirichlet over\ntime. Since unlike to the logistic-normal, we cannot draw the connection to the GP, we propose to\ndecompose the parameters of the Dirichlet distribution with expressive (local) functions in order to\nallow complex dependence on time. Since the concentration parameters \u03b1c(\u03c4 ) need to be positive,\nwe propose the following decomposition of \u03b1c(\u03c4 ) in the log-space\n\nlog \u03b1c(\u03c4 ) =\n\nM\ufffdj=1\n\nw(c)\n\nj\n\n\u00b7 N (\u03c4|\u03c4 (c)\n\nj\n\n, \u03c3(c)\n\nj ) + \u03bd\n\n(3)\n\nj\n\n, \u03c3(c)\n\nwhere the real-valued scalar \u03bd is a constant prior on log \u03b1c(\u03c4 ) which takes over in regions where the\nGaussians are close to 0.\nThe decomposition into a sum of Gaussians is bene\ufb01cial for various reasons: (i) First note that\nthe concentration parameter \u03b1c can be viewed as the effective number of observations of class\nc. Accordingly the larger log \u03b1, the more certain becomes the prediction. Thus, the functions\nN (\u03c4|\u03c4 (c)\nj ) can describe time regions where we observed data and, thus, should be more certain;\ni.e. regions around the time \u03c4 (c)\nj where the \u2019width\u2019 is controlled by \u03c3(c)\n. (ii) Since most of the\nfunctions\u2019 mass is centered around their mean, the locality property is ful\ufb01lled. Put differently: In\nregions where we did not observed data (i.e. where the functions N (\u03c4|\u03c4 (c)\nj ) are close to 0), the\nvalue log \u03b1c(\u03c4 ) is close to the prior value \u03bd. In the experiments, we use \u03bd = 0 , thus \u03b1c(\u03c4 ) = 1 in\nthe out of observed data regions; a common (uninformative) prior value for the Dirichlet parameters.\nSpeci\ufb01cally for \u03c4 \u2192 \u221e the resulting predictions have a high uncertainty. (iii) Lastly, a linear\ncombination of translated Gaussians is able to approximate a wide family of functions [4]. And\nsimilar to the weighted GP, the coef\ufb01cients w(c)\nj\nThe basis functions parameters (w(c)\n, \u03c3(c)\nj ) are the output of the neural network, and can also\nbe interpreted as weighted pseudo points that determine the regression of Dirichlet parameters \u03b8(\u03c4 ),\ni.e. \u03b1c(\u03c4 ), over time (Fig. 2 & Fig. 4). The concentration parameters \u03b1c(\u03c4 ) themselves have also a\nnatural interpretation: they can be viewed as the rate of events after time gap \u03c4.\n\nallow discarding unnecessary basis functions.\n\n, \u03c3(c)\n\n, \u03c4 (c)\n\nj\n\nj\n\nj\n\nj\n\n2.3 Model Training with the Distributional Uncertainty Loss\n\nThe core feature of our models is to perform predictions in the future with uncertainty. The classical\ncross-entropy loss, however, is not well suited to learn uncertainty on the categorical distribution\nsince it is only based on a single (point estimate) of the class distribution. That is, the standard\ncross-entropy loss for the ith event between the true categorical distribution p\u2217i and the predicted\n(mean) categorical distribution pi is LCE\nestimate pi(\u03c4 ) = Epi\u223cPi(\u03b8(\u03c4 ))[pi], the uncertainty on pi is completely neglected.\nInstead, we propose the uncertainty cross-entropy which takes into account uncertainty:\n\ni = H[p\u2217i , pi(\u03c4\u2217i )] = \u2212\ufffdc p\u2217ic log pic(\u03c4\u2217i ). Due to the point\n\nLUCE\n\ni = Epi\u223cPi(\u03b8(\u03c4\u2217i ))[H[p\u2217i , pi]] = \u2212\ufffd Pi(\u03b8(\u03c4\u2217i ))\ufffdc\n\np\u2217ic log pic\n\n(4)\n\nRemark that the uncertainty cross-entropy does not use the compound distribution pi(\u03c4 ) but considers\nthe expected cross-entropy. Based on Jensen\u2019s inequality, it holds: 0 \u2264 LCE\n. Consequently,\na low value of the uncertainty cross-entropy guarantees a low value for the classic cross entropy\nloss, while additionally taking the variation in the class probabilities into account. A comparison\nbetween the classic cross entropy and the uncertainty cross-entropy on a simple classi\ufb01cation task\nand anomaly detection in asynchronous event setting is presented in Appendix F.\n\ni \u2264 LUCE\n\ni\n\n4\n\n\fIn practice the true distribution p\u2217i is often a one hot-encoded representation of the observed class\nci which simpli\ufb01es the computations. During training, the models compute Pi(\u03b8(\u03c4 )) and evaluate\nit at the true time of the next event \u03c4\u2217i given the past event ei\u22121 and the history Hi\u22121. The \ufb01nal\nloss for a sequence of events is simply obtained by summing up the loss for each event L =\n\ufffdi Epi\u223cPi(\u03b8(\u03c4\u2217i ))[H[p\u2217i , pi]].\nFast computation. In order to have an ef\ufb01cient computation of the uncertainty cross-entropy, we\npropose closed-form expressions. (1) Closed-form loss for Dirichlet. Given that the observed class ci\nis one hot-encoded by p\u2217i , the uncertain loss can be computed in closed form for the Dirichlet:\n\nLUCE\ni = Epi(\u03c4\u2217i )\u223cDir(\u03b1(\u03c4\u2217i ))[log pci (\u03c4\u2217i )] = \u03a8(\u03b1ci (\u03c4\u2217i )) \u2212 \u03a8(\u03b10(\u03c4\u2217i ))\n\n(5)\n\nbased on second order series expansion (Appendix C):\n\nc \u03b1c(\u03c4\u2217i ). (2) Loss approximation for GP.\n\nwhere \u03a8 denotes the digamma function and \u03b10(\u03c4\u2217i ) =\ufffdC\nFor WGP-LN, we approximate LUCE\nc (\u03c4 \u2217i )/2)\ufffd +\ufffdC\ni \u2248 \u00b5ci (\u03c4 \u2217i ) \u2212 log\ufffd C\ufffdc\nLUCE\n\nexp(\u00b5c(\u03c4 \u2217i ) + \u03c32\n\ni\n\nc (exp(\u03c32\n\nc (\u03c4 \u2217i )) \u2212 1) exp(2\u00b5c(\u03c4 \u2217i ) + \u03c32\nc (\u03c4 \u2217i )/2)\ufffd2\nc exp(\u00b5c(\u03c4 \u2217i ) + \u03c32\n\n2\ufffd\ufffdC\n\nc (\u03c4 \u2217i ))\n\n(6)\nNote that we can now fully backpropagate through our loss (and through the models as well), enabling\nto train our methods ef\ufb01ciently with automatic differentiation frameworks and, e.g., gradient descent.\nRegularization. While the above loss much better incorporates uncertainty, it is still possible to\ngenerate pseudo points with high weight values outside of the observed data regime giving us\npredictions with high con\ufb01dence. To eliminate this behaviour we introduce a regularization term rc:\n\n(\u00b5c(\u03c4 ))2 d\u03c4\n\nc (\u03c4 ))2 d\u03c4\n\n(7)\n\nrc = \u03b1\ufffd T\n\ufffd\n\n0\n\n\ufffd\ufffd\n\n+\u03b2\ufffd T\n\ufffd\n\n0\n\n\ufffd\n\n(\u03bd \u2212 \u03c32\n\ufffd\ufffd\n\nPushes mean to 0\n\nPushes variance to \u03bd\n\n\ufffd\n\nFor the WGP-LN, \u00b5c(\u03c4 ) and \u03c3c(\u03c4 ) correspond to the mean and the variance of the class logits which\nare pushed to prior values of 0 and \u03bd. For the FD-Dir, \u00b5c(\u03c4 ) and \u03c3c(\u03c4 ) correspond to the mean and\nthe variance of the class probabilities where the regularizer on the mean can actually be neglected\nbecause of the prior \u03bd introduced in the function decomposition (Eq. 3). In experiments, \u03bd is set\nto 1 for WGP-LN and\nC2(C+1) for FD-Dir which is the variance of the classic Dirichlet prior with\nconcentration parameters equal to 1. For both models, this regularizer forces high uncertainty on the\ninterval (0, T ). In practice, the integrals can be estimated with Monte-Carlo sampling whereas \u03b1 and\n\u03b2 are hyperparameters which are tuned on a validation set.\nIn [16], to train models capable of uncertain predictions, another dataset or a generative models to\naccess out of observed distribution samples is required. In contrast, our regularizer suggests a simple\nway to consider out of distribution data which does not require another model or dataset.\n\nC\u22121\n\n3 Point Process Framework\n\nOur models FD-Dir and WGP-LN predict P (\u03b8(\u03c4 )), enabling to evaluate, e.g., p after a speci\ufb01c time\ngap \u03c4. This corresponds to a conditional distribution q(c|\u03c4 ) := pc(\u03c4 ) over the classes. In this section,\nwe introduce a point process framework to generalize FD-Dir to also predict the time distribution\nq(\u03c4 ). This enables us to predict, e.g., the most likely time the next event is expected or to evaluate\nthe joint distribution q(c|\u03c4 ) \u00b7 q(\u03c4 ). We call the model FD-Dir-PP.\nWe modify the model so that each class c is modelled using an inhomogeneous Poisson point process\nwith positive locally integrable intensity function \u03bbc(\u03c4 ). Instead of generating parameters \u03b8(\u03c4 ) =\n(\u03b11(\u03c4 ), ..., \u03b1C(\u03c4 )) by function decomposition, FD-Dir-PP generates intensity parameters over time:\nj ) + \u03bd. The main advantage of such general decomposition is\nits potential to describe complex multimodal intensity functions contrary to other models like RMTPP\n[6] (Appendix D). Since the concentration parameter \u03b1c(\u03c4 ) and the intensity parameter \u03bbc(\u03c4 ) both\nrelate to the number of events of class c around time \u03c4, it is natural to convert one to the other.\nGiven this C-multivariate point process, the probability of the next class given time and the probability\nof the next event time are q(c|\u03c4 ) = \u03bbc(\u03c4 )\nc=1 \u03bbc(\u03c4 ).\n\nlog \u03bbc(\u03c4 ) =\ufffdM\n\n\u03bb0(\u03c4 ) and q(\u03c4 ) = \u03bb0(\u03c4 )e\u2212\ufffd \u03c4\n\nj N (\u03c4|\u03c4 (c)\n\nj=1 w(c)\n\n, \u03c3(c)\n\nj\n\n0 \u03bb0(s)ds where \u03bb0(\u03c4 ) =\ufffdC\n\n5\n\n\fSince the classes are now modelled via a point proc., the log-likelihood of the event ei = (ci, \u03c4\u2217i ) is:\n\nlog q(ci, \u03c4\u2217i ) = log q(ci|\u03c4\u2217i ) + log q(\u03c4\u2217i ) = log\n\ufffd\n\n(i)\n\n\ufffd\ufffd\n\n\ufffd\n\n(ii)\n\n\ufffd\ufffd\n\n\ufffd\n\n0\n\n\u2212\ufffd \u03c4\u2217i\n\ufffd\n\n\ufffd\n\n(iii)\n\n\ufffd\ufffd\n\n\ufffd\n\n\u03bbci (\u03c4\u2217i )\n\u03bb0(\u03c4\u2217i )\n\n+ log \u03bb0(\u03c4\u2217i )\n\n\u03bb0(t)dt\n\n(8)\n\nThe terms (ii) and (iii) act like a regularizer on the intensities by penalizing large cumulative intensity\n\u03bb0(\u03c4 ) on the time interval [ti\u22121, ti] where no events occurred. The term (i) is the standard cross-\nentropy loss at time \u03c4i. Or equivalently, by modeling the distribution Dir(\u03bb1(\u03c4 ), .., \u03bbC(\u03c4 )), we see\nthat term (i) is equal to LCE\n(see Section 2.3). Using this insight, we obtain our \ufb01nal FD-Dir-PP\nmodel: We achieve uncertainty on the class prediction by modeling \u03bbc(\u03c4 ) as concentration parameters\nof a Dirichlet distribution and train the model with the loss of Eq. 8 replacing term (i) by LUCE\n. As\nit becomes apparent FD-Dir-PP differs from FD-Dir only in the regularization of the loss function,\nenabling it to be interpreted as a point process.\n\ni\n\ni\n\n4 Related Work\n\nPredictions based on discrete sequences of events regardless of time can be modelled by Markov\nModels [2] or RNNs, usually with its more advanced variants like LSTMs [11] and GRUs [5]. To\nexploit the time information some models [15, 19] additionally take time as an input but still output\na single prediction for the entire future. In contrast, temporal point process framework de\ufb01nes the\nintensity function that describes the rate of events occuring over time.\nRMTPP [6] uses an RNN to encode the event history into a vector that de\ufb01nes an exponential\nintensity function. Hence, it is able to capture complex past dependencies and model distributions\nresulting from simple point processes, such as Hawkes [10] or self-correcting [12], but not e.g.\nmultimodal distributions. On the other hand, Neural Hawkes Process [18] uses continuous-time\nLSTM which allows specifying more complex intensity functions. Now the likelihood evaluation is\nnot in closed-form anymore, but requires Monte Carlo integration. However, these approaches, unlike\nour models, do not provide any uncertainty in the predictions. In addition, WGP-LN and FD-Dir can\nbe extended with a point process framework while having the expressive power to represent complex\ntime evolutions.\nUncertainty in machine learning has shown a great interest [9, 8, 14]. For example, uncertainty can\nbe imposed by introducing distributions over the weights [3, 17, 20]. Simpler approaches introduce\nuncertainty directly on the class prediction by using Dirichlet distribution independent of time\n[16, 21]. In contrast, the FD-Dir model models complex temporal evolution of Dirichlet distribution\nvia function decomposition which can be adapted to have a point process interpretation.\nOther methods introduce uncertainty time series prediction by learning state space model with\nGaussian processes [7, 23]. Alternatively, RNN architecture has been used to model the probability\ndensity function over time [25]. Compared to these models, the WGP-LN model uses both Gaussian\nprocesses and RNN to model uncertainty and time. Our models are based on pseudo points. Pseudo\npoints in a GP have been used to reduce the computational complexity [22]. Our goal is not to\nspeed up the computation, since we control the number of points that are generated, but to give them\ndifferent importance. In [24] a weighted GP has been considered by rescaling points; in contrast, our\nmodel uses a custom kernel to discard (pseudo) points.\n\n5 Experiments\n\nWe evaluate our models on large-scale synthetic and real world data. We compare to neural point\nprocess models: RMTPP [6] and Neural hawkes process [18]. Additionally, we use various RNN\nmodels with the knowledge of the time of the next event. We measure the accuracy of class prediction,\naccuracy of time prediction, and evaluate on an anomaly detection task to show prediction uncertainty.\nWe split the data into train, validation and test set (60%\u201320%\u201320%) and tune all models on a validation\nset using grid search over learning rate, hidden state dimension and L2 regularization. After running\nmodels on all datasets 5 times we report mean and standard deviation of test set accuracy. Details\n\n6\n\n\f\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\n\n0.0\n\n0.5\n\n1.0\n\n1.5\n\n2.0\n\n2.5\n\n3.0\n\n)\n\u03c4\n(\nc\np\n\n1.0\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\n1.0\n\n0.8\n\n)\n\u03c4\n(\nc\nq\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\n)\n\u03c4\n(\nc\nt\ni\ng\no\nl\n\n4\n\n2\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n1.0\n\n0.8\n\n)\n\u03c4\n(\nc\nq\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\n\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\n\n)\n\u03c4\n(\nc\np\n\n1.0\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\n\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\n\n0.0\n\n0.5\n\n1.0\n\n1.5\n\n2.0\n\n2.5\n\n3.0\n\n0.0\n\n0.5\n\n1.0\n\n1.5\n\n2.0\n\n2.5\n\n3.0\n\n1.0\n\n0.8\n\n)\n\u03c4\n(\nc\nq\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\n\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\n\n0.0\n\n0.5\n\n1.0\n\n1.5\n\n2.0\n\n2.5\n\n3.0\n\n)\n\u03c4\n(\nc\nt\ni\ng\no\nl\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\n\u22124\n\n1.0\n\n0.8\n\n)\n\u03c4\n(\nc\nq\n\n0.6\n\n0.4\n\n0.2\n\n0.0\n\n0.0\n\n0.5\n\n1.0\n\n1.5\n\n\u03c4\n\n2.0\n\n2.5\n\n3.0\n\n0.0\n\n0.5\n\n1.0\n\n1.5\n\n\u03c4\n\n2.0\n\n2.5\n\n3.0\n\n0.0\n\n0.5\n\n1.0\n\n1.5\n\n\u03c4\n\n2.0\n\n2.5\n\n3.0\n\n0.0\n\n0.5\n\n1.0\n\n1.5\n\n\u03c4\n\n2.0\n\n2.5\n\n3.0\n\nFD-Dir on 3-G\n\nWGP-LN on 3-G\n\nFD-Dir on Multi-G\n\nWGP-LN on Multi-G\n\nFigure 5: Visualization of the prediction evolution. The red line indicates the true time of the next\nevent for an example sequence. Here, both models predict the orange class, which is correct, and\ncapture the variation of the class distributions over time. Generated points from WGP-LN are plotted\nwith the size corresponding to the weight. For predictions in the far future, both models given high\nuncertainty.\n\non model selection can be found in Appendix H.1. The code and further supplementary material is\navailable online.2\nWe use the following data (more details in Appendix G): (1) Graph. We generate data from a directed\nErd\u02ddos\u2013R\u00e9nyi graph where nodes represent the states and edges the weighted transitions between\nthem. The time it takes to cross one edge is modelled with one normal distribution per edge. By\nrandomly walking along this graph we created 10K asynchronous events with 10 unique classes. (2)\nStack Exchange.3 Sequences contain rewards as events that users get for participation on a question\nanswering website. After preprocessing according to [6] we have 40 classes and over 480K events\nspread over 2 years of activity of around 6700 users. The goal is to predict the next reward a user will\nreceive. (3) Smart Home [1].4 We use a recorded sequence from a smart house with 14 classes and\nover 1000 events. Events correspond to the usage of different appliances. The next event will depend\non the time of the day, history of usage and other appliances. (4) Car Indicators. We obtained a\nsequence of events from car\u2019s indicators that has around 4000 events with 12 unique classes. The\nsequence is highly asynchronous, with \u03c4 ranging from milliseconds to minutes.\nVisualization. To analyze the behaviour of the models, we propose visualizations of the evolutions\nof the parameters predicted by FD-Dir and WGP-LN.\nSet-up: We use two toy datasets where the probability of an event depends only on time. The \ufb01rst one\n(3-G) has three classes occuring at three distinct times. It represents the events in the Fig. 13a. The\nsecond one (Multi-G) consists of two classes where one of them has two modes and corresponds\nto the Fig. 1a. We use these datasets to showcase the importance of time when predicting the next\nevent. In Fig. 5, the four top plots show the evolution of the categorical distribution for the FD-Dir\nand the logits for the WGP-LN with 10 points each. The four bottom plots describe the certainty of\nthe models on the probability prediction by plotting the probability qc(\u03c4 ) that the probability of class\nc is higher than others, as introduced in Sec. 2. Additionally, the evolution of the dirichlet distribution\nover the probability simplex is presented in Appendix E.\nResults. Both models learn meaningful evolutions of the distribution on the simplex. For the 3-G data,\nwe can distinguish four areas: the \ufb01rst three correspond to the three classes; after that the prediction is\nuncertain. The Multi-G data shows that both models are able to approximate multimodal evolutions.\n\nClass prediction accuracy. The aim of this experiment is to assess whether our models can correctly\npredict the class of the next event, given the time at which it occurs. For this purpose, we compare\nour models against Hawkes and RMTPP and evalute prediction accuracy on the test set.\nResults. We can see (Fig. 6) that our models consistently outperform the other methods on all datasets.\nResults of the other baselines can be found in Appendix H.2.\nTime-Error evaluation. Next, we aim to assess the quality of the time intervals at which we have\ncon\ufb01dence in one class. Even though WGP-LN and the FD-Dir do not model a distribution on\n\n2\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n3\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n4\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\n7\n\n\fFigure 6: Class accuracy (top; higher is better) and Time-Error (bottom; lower is better).\n\n\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\n\n\ufffd\n\ufffd\n\ufffd\n\ufffd\n\ufffd\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n\ufffd\n\ufffd\n\ufffd\n\ufffd\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0.0\n\n\ufffd\n\ufffd\n\ufffd\n\ufffd\n\ufffd\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n\ufffd\n\ufffd\n\ufffd\n\ufffd\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0.0\n\n\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\nFigure 7: AUROC and APR comparison across dataset on anomaly detection. The orange and blue\nbars use categorical uncertainty score whereas the green bars use distributional uncertainty.\n\ni\n\n(\u03c4 )\u2265g(c)\n\ni\n\ni\n\ni\n\ni=1\ufffd 1g(c)\nn\ufffdn\n\ntime, they still have intervals at which we are certain in a class prediction, making the conditional\nprobability a good indicator of the time occurrence of the event.\nSet-up. While models predicting a single time \u02c6\u03c4i for the next event often use the MSE score\nn\ufffdn\ni=1(\u02c6\u03c4i \u2212 \u03c4\u2217i )2, in our case the MSE is not suitable since one event can occur at multiple time\n1\npoints. In the conventional least-squares approach, the mean of the true distribution is an optimal\nprediction; however, here it is almost always wrong. Therefore, we use another metric which is better\nsuited for multimodal distributions. Assume that a model returns a score function g(c)\n(\u03c4 ) for each\nclass regarding the next event i, where a large value means the class c is likely to occur at time \u03c4.\nWe de\ufb01ne Time-Error = 1\n(\u03c4\u2217i )d\u03c4. The Time-Error computes the size of the\ntime intervals where the predicted score is larger than the score of the observed time \u03c4\u2217i . Hence, a\nperformant model would achieve a low Time-Error if its score function g(c)\n(\u03c4 ) is high at time \u03c4\u2217. As\nthe score function in our models, we use the corresponding class probability \u00afpic(\u03c4 ).\nResults. We can see that our models clearly obtain the best results on all datasets. The point process\nversion of FD-Dir does not improve the performance. Thus, taking also into account the class\nprediction performance, we recommend to use our other two models. In Appendix H.3 we compare\nFD-Dir-PP with other neural point process models on time prediction using the MSE score and\nachieve similar results.\nAnomaly detection & Uncertainty. The goal of this experiment is twofold: (1) it assesses the ability\nof the models to detect anomalies in asynchronous sequences, (2) it evaluates the quality of the\npredicted uncertainty on the categorical distribution. For this, we use a similar set-up as [16].\nSet-up: The experiments consist in introducing anomalies in datasets by changing the occurrence time\nof 10% of the events (at random after the time transformation described in appendix G). Hence, the\nanomalies form out-of-distribution data, whereas unchanged events represent in-distribution data. The\nperformance of the anomaly detection is assessed using Area Under Receiver Operating Characteristic\n(AUROC) and Area Under Precision-Recall (AUPR). We use two approaches: (i) We consider the\ncategorical uncertainty on \u00afp(\u03c4 ), i.e., to detect anomalies we use the predicted probability of the true\nevent as the anomaly score. (ii) We use the distribution uncertainty at the observed occurrence time\nprovided by our models. For WGP-LN, we can evaluate qc(\u03c4 ) directly (difference of two normal\ndistributions). For FD-Dir, this probability does not have a closed-form solution so instead, we use\nthe concentration parameters which are also indicators of out-of-distribution events. For all scores,\ni.e \u00afp(\u03c4 )c, qc(\u03c4 ) and \u03b1c(\u03c4 ), a low value indicates a potential anomaly around time \u03c4.\nResults. As seen in Fig. 5, the FD-Dir and the WGP-LN have particularly good performance. We\nobserve that the FD-Dir gives better results especially with distributional uncertainty. This might be\ndue to the power of the concentration parameters that can be viewed as number of similar events\naround a given time.\n\n8\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\f6 Conclusion\n\nWe proposed two new methods to predict the evolution of the probability of the next event in\nasynchronous sequences, including the distributions\u2019 uncertainty. Both methods follow a common\nframework consisting in generating pseudo points able to describe rich multimodal time-dependent\nparameters for the distribution over the probability simplex. The complex evolution is captured\nvia a Gaussian Process or a function decomposition, respectively; still enabling easy training. We\nalso provided an extension and interpretation within a point process framework. In the experiments,\nWGP-LN and FD-Dir have clearly outperformed state-of-the-art models based on point processes;\nfor event and time prediction as well as for anomaly detection.\n\nAcknowledgement\n\nThis research was supported by the German Federal Ministry of Education and Research (BMBF),\ngrant no. 01IS18036B, and by the BMW AG. The authors would like to thank Bernhard Schlegel for\nhelpful discussion and comments. The authors of this work take full responsibilities for its content.\n\nReferences\n[1] Activity Recognition in Pervasive Intelligent Environments of the Atlantis Ambient and Pervasive\n\nIntelligence series, Atlantis Press, chapter 8. Springer, 2010.\n\n[2] Ron Begleiter, Ran El-Yaniv, and Golan Yona. On prediction using variable order markov\n\nmodels. J. Artif. Int. Res., 22(1):385\u2013421, December 2004.\n\n[3] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight Uncertainty\n\nin Neural Networks. arXiv e-prints, page arXiv:1505.05424, May 2015.\n\n[4] Craig Calcaterra and Axel Boldt. Approximating with gaussians. arXiv e-prints, page\n\narXiv:0805.3795, 2008.\n\n[5] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation\nof gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555,\n2014.\n\n[6] Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and\nLe Song. Recurrent marked temporal point processes: Embedding event history to vector. In\nKDD. ACM, 2016.\n\n[7] Stefanos Eleftheriadis, Tom Nicholson, Marc Deisenroth, and James Hensman. Identi\ufb01cation of\n\ngaussian process state space models. In NIPS. Curran Associates, Inc., 2017.\n\n[8] Dhivya Eswaran, Stephan G\u00fcnnemann, and Christos Faloutsos. The Power of Certainty: A\n\nDirichlet-Multinomial Model for Belief Propagation, pages 144\u2013152. 06 2017.\n\n[9] Meire Fortunato, Charles Blundell, and Oriol Vinyals. Bayesian Recurrent Neural Networks.\n\narXiv e-prints, page arXiv:1704.02798, Apr 2017.\n\n[10] Alan G Hawkes. Spectra of some self-exciting and mutually exciting point processes.\n\nBiometrika, 58(1):83\u201390, 1971.\n\n[11] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation,\n\n9(8):1735\u20131780, 1997.\n\n[12] Valerie Isham and Mark Westcott. A self-correcting point process. Stochastic Processes and\n\nTheir Applications, 8(3):335\u2013347, 1979.\n\n[13] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[14] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and Scalable Pre-\ndictive Uncertainty Estimation using Deep Ensembles. arXiv e-prints, page arXiv:1612.01474,\nDec 2016.\n\n9\n\n\f[15] Yang Li, Nan Du, and Samy Bengio. Time-dependent representation for neural event sequence\n\nprediction, 2018.\n\n[16] Andrey Malinin and Mark Gales. Predictive Uncertainty Estimation via Prior Networks. arXiv\n\ne-prints, page arXiv:1802.10501, Feb 2018.\n\n[17] Patrick L. McDermott and Christopher K. Wikle. Bayesian Recurrent Neural Network Models\nfor Forecasting and Quantifying Uncertainty in Spatial-Temporal Data. arXiv e-prints, page\narXiv:1711.00636, Nov 2017.\n\n[18] Hongyuan Mei and Jason M Eisner. The neural hawkes process: A neurally self-modulating\n\nmultivariate point process. In Advances in Neural Information Processing Systems, 2017.\n\n[19] Daniel Neil, Michael Pfeiffer, and Shih-Chii Liu. Phased LSTM: accelerating recurrent network\n\ntraining for long or event-based sequences. CoRR, abs/1610.09513, 2016.\n\n[20] Hippolyt Ritter, Aleksandar Botev, and David Barber. A scalable laplace approximation for\n\nneural networks. In International Conference on Learning Representations, 2018.\n\n[21] Peter Sadowski and Pierre Baldi. Neural network regression with beta, dirichlet, and dirichlet-\n\nmultinomial outputs, 2019.\n\n[22] Edward Snelson and Zoubin Ghahramani. Sparse gaussian processes using pseudo-inputs. In\n\nAdvances in neural information processing systems, 2006.\n\n[23] Ryan Turner, Marc Deisenroth, and Carl Rasmussen. State-space inference and learning with\n\ngaussian processes. In Yee Whye Teh and Mike Titterington, editors, AISTATS, 2010.\n\n[24] Junfeng Wen, Negar Hassanpour, and Russell Greiner. Weighted gaussian process for estimating\n\ntreatment effect. 2018.\n\n[25] Kyongmin Yeo, Igor Melnyk, Nam Nguyen, and Eun Kyung Lee. Learning temporal evolution\n\nof probability distribution with recurrent neural network, 2018.\n\n10\n\n\f", "award": [], "sourceid": 7007, "authors": [{"given_name": "Marin", "family_name": "Bilo\u0161", "institution": "Technical University of Munich"}, {"given_name": "Bertrand", "family_name": "Charpentier", "institution": "Technical University of Munich"}, {"given_name": "Stephan", "family_name": "G\u00fcnnemann", "institution": "Technical University of Munich"}]}