{"title": "Structured Variational Inference in Continuous Cox Process Models", "book": "Advances in Neural Information Processing Systems", "page_first": 12458, "page_last": 12468, "abstract": "We propose a scalable framework for inference in a continuous sigmoidal Cox process that assumes the corresponding intensity function is given by a Gaussian process (GP) prior transformed with a scaled logistic sigmoid function. We present a tractable representation of the likelihood through augmentation with a superposition of Poisson processes. This view enables a structured variational approximation capturing dependencies across variables in the model. Our framework avoids discretization of the domain, does not require accurate numerical integration over the input space and is not limited to GPs with squared exponential kernels. We evaluate our approach on synthetic and real-world data showing that its benefits are particularly pronounced on multivariate input settings where it overcomes the limitations of mean-field methods and sampling schemes. We provide the state of-the-art in terms of speed, accuracy and uncertainty quantification trade-offs.", "full_text": "Structured Variational Inference in\n\nContinuous Cox Process Models\n\nVirginia Aglietti\n\nUniversity of Warwick\nThe Alan Turing Institute\n\nV.Aglietti@warwick.ac.uk\n\nTheodoros Damoulas\nUniversity of Warwick\nThe Alan Turing Institute\n\nT.Damoulas@warwick.ac.uk\n\nEdwin V. Bonilla\nCSIRO\u2019s Data61\n\nEdwin.Bonilla@data61.csiro.au\n\nSally Cripps\n\nCentre for Translational Data Science\n\nThe University of Sydney\n\nSally.Cripps@sydney.edu.au\n\nAbstract\n\nWe propose a scalable framework for inference in a continuous sigmoidal Cox\nprocess that assumes the corresponding intensity function is given by a Gaussian\nprocess (GP) prior transformed with a scaled logistic sigmoid function. We present a\ntractable representation of the likelihood through augmentation with a superposition\nof Poisson processes. This view enables a structured variational approximation\ncapturing dependencies across variables in the model. Our framework avoids\ndiscretization of the domain, does not require accurate numerical integration over\nthe input space and is not limited to GPs with squared exponential kernels. We\nevaluate our approach on synthetic and real-world data showing that its bene\ufb01ts\nare particularly pronounced on multivariate input settings where it overcomes the\nlimitations of mean-\ufb01eld methods and sampling schemes. We provide the state\nof-the-art in terms of speed, accuracy and uncertainty quanti\ufb01cation trade-offs.\n\n1\n\nIntroduction\n\nPoint processes have been used effectively to model a variety of event data such as occurrences of\ndiseases [9, 19], location of earthquakes [21] or crime events [2, 11] . The most commonly adopted\nclass of models for such discrete data are non-homogenous Poisson processes and in particular Cox\nprocesses [6]. In these, the observed events are assumed to be generated from a Poisson point process\n(PPP) whose intensity is stochastic, enabling non-parametric inference and uncertainty quanti\ufb01cation.\nGaussian processes [GPs; 25] form a \ufb02exible prior over functions and, therefore, have been used\nto model the intensity of a Cox process via a non-linear positive link function. Typical mappings\nare the exponential [9, 22], the square [17, 19] and the sigmoidal [1, 10, 12] transformations. In\ngeneral, inferring the intensity function over a continuous input space X is highly problematic as it\nrequires integrating an in\ufb01nite-dimensional random function. This integral is generally intractable\nand, depending on the transformation used, different algorithms have been proposed to deal with this\nissue. For example, under the exponential transformation, a regular computational grid is commonly\nintroduced [9]. While this signi\ufb01cantly simpli\ufb01es inference, it leads to poor approximations, especially\nin high dimensional settings. Increasing the resolution of the grid to improve the approximation\nyields computationally prohibitive algorithms that do not scale, highlighting the well-known trade-off\nbetween statistical performance and computational cost.\nOther algorithms have been proposed to deal with a continuous X but they are computationally\nexpensive [1, 12], are limited to simple covariance functions [19], require accurate numerical integra-\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fTable 1: Summary of related work. X represents the inputs space with(cid:82) and(cid:80) denoting continuous\n\nand discrete models respectively. O gives the time complexity of the algorithm. M represents the\nnumber of thinned points derived from the thinning [16] of a PPP. K indicates the number of inducing\ninputs. STVB denotes our approach.\n\nInference\nO\n\u03bb(x)\u03bb(x)\u03bb(x)\nXXX\n\nSTVB\nSVI\nK 3\n\n(cid:82)\n\n\u03bb(cid:63)\u03c3(f (x))\n\nexp(f (x))\n\nLGCP [22]\n\nMCMC\n\nN 3\n\n(cid:80)\n\nSGCP [1]\nMCMC\n\n(N + M )3\n\u03bb(cid:63)\u03c3(f (x))\n\n(cid:82)\n\nGunter et al. [12]\n\nVBPP [19] Lian et al. [17] MFVB [10]\n\nMCMC\n\n(N + M )3\n\u03bb(cid:63)\u03c3(f (x))\n\n(cid:82)\n\nVI-MF\nN K 2\n(f (x))2\n\n(cid:82)\n\nVI-MF\nN K 2\n(f (x))2\n\n(cid:80)\n\nVI-MF\nN K 2\n\n(cid:82)\n\n\u03bb(cid:63)\u03c3(f (x))\n\ntion over the domain [10] or do not account for the model dependencies in the posterior distribution\n[10]. In this paper we propose an inference framework that addresses all of these modeling and\ninference limitations by having a tractable representation of the likelihood via augmentation with a\nsuperposition of PPPs. This enables a scalable structured variational inference algorithm (SVI) in the\ncontinuous space directly, where the approximate posterior distribution incorporates dependencies\nbetween the variables of interest. Our speci\ufb01c contributions are as follows.\nScalable inference in continuous input spaces: The augmentation of the input space via a process\nsuperposition view allows us to develop a scalable variational inference algorithm that does not\nrequire discretization or accurate numerical integration. With this view, we obtain a joint distribution\nthat is readily normalized, providing a natural regularization over the latent variables in our model.\nEf\ufb01cient structured posterior estimation: We estimate a joint posterior that captures the complex\nvariable dependencies in the model while being signi\ufb01cantly faster than sampling approaches.\nState-of-the-art performance: Our experimental evaluation shows the bene\ufb01ts of our approach\nwhen compared to state-of-the-art inference schemes, link functions, augmentation schemes and\nrepresentations of the input space X .\n\n1.1 Related work\n\nGP-modulated Poisson point processes are the gold standard for modeling event data. Performing\ninference in these models is challenging due to the need to integrate an in\ufb01nite-dimensional random\nfunction over X . Under the exponential transformation, inference has typically required discretization\nwhere the domain is gridded and the intensity function is assumed to be constant over each grid cell [4,\n7, 9, 22]. Alternatively, Lasko [15] also considers an exponential link function and performs inference\nover a renewal process resorting to numerical integration within a computationally expensive sampling\nscheme. These methods suffer from poor scaling with the dimensionality of X and sensitivity to the\nchoice of discretization or numerical integration technique. Several approaches have been proposed\nto deal with inference in the continuous domain directly by using alternative transformations along\nwith additional modeling assumptions and computational tricks or by constraining the GP [20].\nOne of those alternative transformation is the squared mapping as developed in the Permanental\nprocess [13, 17\u201319, 28]. Although the square transformation enables analytical computation of the\nrequired integrals over X , this only holds for certain standard types of kernels such as the squared\nexponential. In addition, Permanental processes suffer from important identi\ufb01ability issues such as\nre\ufb02ection invariance1 and lead to models with \u201cnodal lines\u201d [13].\nAnother transformation is the scaled logistic sigmoid function proposed by [1] that achieves tractabil-\nity by augmenting the input space via thinning [16], which can be seen as a point-process variant\nof rejection sampling. This model is known as the sigmoidal Gaussian Cox process (SGCP). Their\nproposed inference algorithm is based on Markov chain Monte Carlo (MCMC), which enables draw-\ning \u2018exact\u2019 samples from the posterior intensity. However, as acknowledged by the authors, it has\nsigni\ufb01cant computational demands making it unfeasible to large datasets. As an extension to this\nwork, [12] introduce the concept of \u201cadaptive thinning\u201d and propose an expensive MCMC scheme\nwhich scales as O(N 3). More recently, [10] introduced a neat double augmentation scheme for\nSGCP which enables closed form updates using a mean-\ufb01eld approximation (VI-MF). However, it\n\n1With re\ufb02ection invariance we refer to the invariance of the intensity function with respect to the sign change\n\nof the GP used to model it.\n\n2\n\n\frequires accurate numerical integration over X , which makes the performance of the algorithm highly\ndependent on the number of integration points.\nIn this work, we overcome the limitations of the mentioned VI-MF and MCMC schemes by proposing\nan SVI framework, henceforth STVB, which takes into account the complex posterior dependencies\nwhile being scalable and thus applicable to high-dimensional real-world settings. To the best of\nour knowledge we are the \ufb01rst to propose a fast structured variational inference framework for GP\nmodulated point process models. See Tab. 1 for a summary of the most relevant related works.\n\n2 Model formulation\nWe consider learning problems where we are given a dataset of N events D = {xn}N\nn=1, where xn\nis a D-dimensional vector in the compact space X \u2282 RD. We aim at modeling these data via a PPP,\ninferring its underlying intensity function \u03bb(x) : X \u2192 R+ and making probabilistic predictions.\n\n2.1 Sigmoidal Gaussian Cox process\nConsider a realization \u03be = (N,{x1, ..., xn}) of a PPP on X where the points {x1, ..., xn} are treated\nas indistinguishable apart from their locations [8]. Conditioned on \u03bb(x), the Cox process likelihood\nfunction evaluated at \u03be can be written as:\n\nL(\u03be|\u03bb(x)) = exp\n\n\u2212\n\n\u03bb(x)dx\n\n\u03bb(xn),\n\n(1)\n\n(cid:18)\n\n(cid:90)\n\nX\n\n(cid:19) N(cid:89)\n\nn=1\n\nN !\n\nL(\u03be|\u03bb(x))\n\nwhere the intensity is given by \u03bb(x) = \u03bb(cid:63)\u03c3(f (x)) with \u03bb(cid:63) > 0 being an upperbound on \u03bb(x)\nwith prior distribution p(\u03bb(cid:63)), \u03c3(\u00b7) denoting the the logistic sigmoid function and f is drawn from\na zero-mean GP prior with covariance function \u03ba(x, x(cid:48); \u03b8) and hyperparameters \u03b8, i.e. f|\u03b8 \u223c\nGP(0, \u03ba(x, x(cid:48); \u03b8)). We will refer to this joint model as the sigmoidal Gaussian Cox process (SGCP).\nNotice that, when considering the tuple (x1, ..., xn) instead of the set {x1, ..., xn}, and thus the event\n\u03be0 = (N, (x1, ..., xn)), the likelihood function is given by L(\u03be0|\u03bb(x)) =\n. There are indeed\nN ! permutations of the events {x1, ..., xn} giving the same point process realization. When the set\n{x1, ..., xn} is known, considering L(\u03be|\u03bb(x)) or L(\u03be0|\u03bb(x)) does not affect the inference procedure.\nThe same holds for MCMC algorithms inferring the event locations. In this case, the factorial term\ndisappears in the computation of the acceptance ratio. However, as we shall see later, when the event\nlocations are latent variables in a model and inference proceeds via a variational approximation the\ndifference between the two likelihoods is essential. Indeed, while L(\u03be0|\u03bb(x)) is normalized with\nrespect to N, one must be cautious when integrating the likelihood in Eq. (1) over sets and bring\nback the missing N ! factor so as to obtain a proper discrete probability mass function for N.\nAs it turns out, inference in SGCP is doubly intractable, as it requires solving the integral in Eq. (1)\nand computing the intractable posterior distribution for the latent function at the N event locations\nand the bounding intensity, i.e. p(fN , \u03bb(cid:63)|{xn}N\nn=1), which in turns requires computing the marginal\nlikelihood. One way to avoid the \ufb01rst source on intractability (integral in Eq.\n(1)) is through\naugmentation of the input space [1, 10], a procedure that introduces precisely those latent (event)\nvariables that require explicit normalization during variational inference. We will describe below a\nprocess superposition view of this augmented scheme that allows us to de\ufb01ne a proper distribution\nover the joint space of observed and latent variables and carry out posterior estimation via variational\ninference. By superimposing two PPP with opposite intensities we obtain an homogenous PPP and\nthus avoid the integration of the GP over X while reducing the integral in Eq. (1) to the computation\n\nof the measure of the input space(cid:82)\n\nX dx.\n\n2.2 Augmentation via superposition\n\nA very useful property of independent PPPs is that their superposition, which is de\ufb01ned as the\ncombination of events from two processes in a single one, is a PPP. Consider two PPP with intensities\n\u03bb(x) and \u03bd(x) and realisations (N,{x1, ..., xn}) and (M,{y1, ..., yM}) respectively. The combined\nevent \u03beR = (R = M + N,{v1, ..., vR}) is a realization of a PPP with intensity given by \u03bb(x) + \u03bd(x)\nwhere knowledge of which points originated from which process is assumed lost. The likelihood for\n\n3\n\n\f\u03b1\n\n\u03b2\n\n\u03b8\n\n\u03bb(cid:63)\n\nM\n\nym\n\nM\n\nu\n\nf\nN+M\n\nZd\n\nK\n\nFigure 1: Plate diagram representing the posterior distribution accounting for all model dependencies.\nIn our variational posterior (Eq. (6)) we drop the dependency represented by the dashed line.\n\nL(\u03beR|\u03bb(x), \u03bd(x)) can be thus written as:\n\nR(cid:88)\n\n(cid:18)N + M\n\n(cid:19) (cid:88)\n\n\uf8eb\uf8ed exp(\u2212(cid:82)\n\nN =0\n\nN\n\nPN\u2208PN\n\nX \u03bb(x)dx)\nN !\n\n\u03bb(r) \u00d7 exp(\u2212(cid:82)\n\nX \u03bd(x)dx)\nM !\n\n(cid:89)\n\nr\u2208PN\n\n\uf8f6\uf8f8 ,\n\n\u03bd(r)\n\n(cid:89)\n\nr\u2208P c\n\nN\n\nN is its complement.\n\nprobability for the event to be latent is \u03c3(\u2212f (x)). In addition, let \u03bb(cid:63)(cid:82)\n\n(2)\nwhere PN denotes the collection of all possible partitions of size N, PN represents an element of PN\nand P c\nConsider now R = N + M to be the total number of events resulting from thinning [16] where N\nis the number of observed events while M is the number of latent events with stochastic locations\ny1, ..., yM . We assume that the probability of observing an event is given by \u03c3(f (x)) while the\nX dx be the expected total\nnumber of events. We can see the realization (M + N, (x1, ..., xN , y1, ..., yM )) as the result of the\nsuperposition of two PPPs with intensities \u03bb(x) = \u03bb(cid:63)\u03c3(f (x)) and \u03bd(x) = \u03bb(cid:63)\u03c3(\u2212f (x)). Differently\nfrom the standard superposition, we do know which events are observed and which are latent. In\nwriting the likelihood for (M + N,{x1, ..., xN , y1, ..., yM}) we thus do not need to consider all the\npossible partitions of N. We can write LN +M\n\ndef= L(N + M, (x1, ..., xN , y1, ..., yM )):\n\n(cid:89)\n\nX \u03bd(x)dx)\nM !\n\nM(cid:89)\n\n\u03bd(r)\n\nN\n\nr\u2208P c\n\u03c3(\u2212f (ym)).\n\n(3)\n\n(4)\n\nexp(\u2212(cid:82)\n\nLN +M =\n\nX \u03bb(x)dx)\nN !\n\n(cid:89)\n\nr\u2208PN\n\n\u03bb(r) \u00d7 exp(\u2212(cid:82)\nN(cid:89)\n\n(cid:90)\n\nX\n\n=\n\n1\n\nN !M !\n\nexp(\u2212\u03bb(cid:63)\n\ndx)(\u03bb(cid:63))M +N\n\n\u03c3(f (xn))\n\nn=1\n\nm=1\n\nThe augmentation via superposition offers a different view on the thinning procedure proposed in\nAdams et al. [1]. However, there is a crucial difference between Eq. 4 and the usual likelihood\nconsidered in SGCP [1]. Eq. (4) represents a distribution over tuples and thus, as mentioned above, is\nproperly normalized. In addition, it makes a distinction between the observed and latent events and it\nis thus different from Eq. (1) written for the the tuple (M + N,{x1, ..., xN , y1, ..., yM}). We can\nwrite the full joint distribution as L+\n\ndef= L({xn}N\n\n(\u03bb(cid:63))N +M exp(\u2212\u03bb(cid:63)(cid:82)\n\nN +M\n\nX dx)\n\nN(cid:89)\n\nm=1, M, f , \u03bb(cid:63)|X , \u03b8):\nn=1,{ym}M\nM(cid:89)\n\n\u03c3(\u2212f (ym)) \u00d7 p(f ) \u00d7 p(\u03bb(cid:63)),\n\n(5)\n\nL+\nN +M =\n\nN !M !\n\n\u03c3(f (xn))\n\nn=1\n\nm=1\n\nwhere p(f ) def= p(fN +M ) denotes the joint prior at both {xn}N\nm=1 and p(\u03bb(cid:63)) denotes\nthe prior over the upper bound of \u03bb(x). We consider p(\u03bb(cid:63)) = Gamma(a, b) and set a and b so that\n\u03bb(cid:63) has mean and standard deviation equal to 2\u00d7 and 1\u00d7 the intensity we would expect from an\nhomogenous PPP on X . Eq. (5) represents the joint distribution for the data and the variables in\nthe model. Estimating their posterior distributions requires computing the marginal likelihood by\nintegrating out all variables in Eq. (5). This is generally intractable and in section \u00a73 we perform\ninference via a variational approximation which minimises a bound, the so-called evidence lower\nbound (ELBO), to the marginal likelihood.\n\nn=1 and {ym}M\n\n4\n\n\fFigure 2: Qualitative results on synthetic data. Solid colored lines denote posterior mean intensities\nwhile shaded areas are \u00b1 standard deviation.\n\n2.3 Scalability via inducing variables\n\nAs in standard GP modulated models, the introduction of a GP prior poses signi\ufb01cant computational\nchallenges during posterior estimation as inference is dominated by algebraic operations that are cubic\non the number of observations. In order to make inference scalable, we follow the inducing-variable\napproach proposed by [27] and further developed by [3]. To this end, we consider an augmented\nprior p(f , u) with K underlying inducing variables denoted by u. The corresponding inducing inputs\nare given by the K \u00d7 D matrix Z. Major computational gains are realized when K (cid:28) N + M.\nThe augmented prior distributions for the inducing variables and the latent functions are p(u|\u03b8) =\nN (0, KZZ) and p(f|u, \u03b8) = N (KXZ(KZZ)\u22121u, KXX \u2212 AKZX) where A = KXZ(KZZ)\u22121\nand X denotes the (N + M ) \u00d7 D matrix of all events locations {xn, ym}N,M\nn=1,m=1. KUV is the\ncovariance matrix obtained by evaluating the covariance function at all pairwise columns of matrices\nU and V.\n\n3 Structured Variational Inference in the augmented space\n\nGiven the joint distribution in Eq. 5, our goal is to estimate the posterior distribution over all latent\nvariables given the data. i.e. p(f , u, M,{ym}M\nm=1, \u03bb(cid:63)|D). This posterior is analytically intractable\nand we resort to variational inference [14]. Variational inference entails de\ufb01ning an approximate\nposterior q(f , u, M,{ym}M\nm=1, \u03bb(cid:63)) and optimizing the ELBO with respect to this distribution. In\nSGCP, the GP and the latent variables are highly coupled and breaking their dependencies would\nlead to poor approximations, especially in high dimensional settings. Fig. 1 shows the structure\nof a general posterior distribution for SGCP without any factorisation assumption. We consider an\napproximate posterior distribution that takes dependencies into account:\n\nQ(f , u, M,{ym}M\n\nm=1, \u03bb(cid:63)) = p(f|u)q({ym}M\n\nm=1|M )q(M|f , \u03bb(cid:63))q(u)q(\u03bb(cid:63))\n\n(6)\n\nM(cid:89)\n\nm=1\n\ns=1\n\nS(cid:88)\n\nWith respect to the general posterior distribution, the only factorisation we impose in Eq. (6) is in the\nfactor q({ym}M\n\nm=1|M ) where we drop the dependency on f, see dashed line in Fig. 1. We set:\n\u03c0sNT (\u00b5s, \u03c32\n\nq(\u03bb(cid:63)) = Gamma(\u03b1, \u03b2)\n\nq({ym}M\n\nm=1|M ) =\n\nq(u) = N (m, S)\n\ns ;X )\nwhere NT (\u00b7;X ) denotes a truncated Gaussian distribution on X . The factorisation assumption\nm=1 can be relaxed by considering a PPP with intensity \u03bb(cid:63)\u03c3(\u2212f (x)) as the\nbetween f and {ym}M\njoint variational distribution q(M,{ym}M\nm=1), which is indeed the true posterior distributions for the\nnumber of latent events and their locations [8]. Considering a fully structured posterior distribution\nsigni\ufb01cantly increases the computational cost of the algorithm as it would require sampling from\nthe full posterior in the computation of the ELBO. The mixture of truncated Gaussians provides a\n\ufb02exible and computationally advantageous alternative while satisfying the constraint of being within\nthe domain of interest.\nX \u03c3(\u2212f (x))dx. This is\nindeed the true conditional posterior distribution for the number of latent points, see Proposition (3.7)\nin [23]. Considering q(M|f , \u03bb(cid:63)) we thus fully account for the dependency structure existing among\n\nMore importantly, we assume q(M|f , \u03bb(cid:63)) = Poisson(\u03b7) with \u03b7 = \u03bb(cid:63)(cid:82)\n\n5\n\n\f+\n\nN(cid:88)\nwhere V =(cid:82)\n\nn=1\n\n(cid:124)\n(cid:123)(cid:122)\n\nT3\n\n\u03b1\n\u03b2\n\nM, f and \u03bb(cid:63). Crucially, while in this work we estimate(cid:82)\n\nX \u03c3(\u2212f (x))dx via Monte Carlo, STVB\ndoes not require accurate estimation of this term. Indeed, differently from the competing techniques\n[10], where the algorithm convergence and the posterior q(f ) is directly dependent on numerical\nintegration, STVB only requires evaluation of the integral during optimisation but q(f ) and thus \u03bb(x)\ndo not directly depend on its value. In other words, the quality of the posterior intensity does not\ndepend directly on the accurate estimation of this integral.\n\n3.1 Evidence Lower Bound\n\nFollowing standard variational inference arguments, it is straightforward to show that the ELBO\ndecomposes as:\nLelbo = N (\u03c8(\u03b1) \u2212 log(\u03b2)) \u2212 V\n\n\u2212 log(N !) + EQ[M log(\u03bb(cid:63))]\n\n\u2212 EQ[log(M !)]\n\n\u03b1\n\u03b2\n\n(cid:34) M(cid:88)\n\nm=1\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nT1\n\n(cid:125)\n(cid:35)\n(cid:125)\n\n(cid:124)\n\n+\n\n(cid:123)(cid:122)\nkl \u2212 L\u03bb(cid:63)\n\n(cid:125)\n(cid:124)(cid:123)(cid:122)(cid:125)\nkl \u2212 LM\n\nent\n\nT2\n\nT4\n\n\u2212L{ym}M\n\nent\n\nm=1\n\n(cid:124)\n\n(cid:123)(cid:122)\n\nT5\n\n(cid:125)\n\nEq(f )[log(\u03c3(f (xn)))] + EQ\n\nlog(\u03c3(\u2212f (ym)))\n\n\u2212Lu\n\n(7)\nX dx, \u03c8(\u00b7) is the digamma function and q(f ) = N (Am, KXX \u2212 AKZX + ASAT ) .\nThe terms denoted by Ti, i = 1, ..., 5 cannot be computed analytically. Na\u00efvely, black-box variational\ninference algorithms could be used to estimate these terms via Monte Carlo, thus sampling from\nthe full variational posterior in Eq. (6). This would require sampling f, \u03bb(cid:63), M and {ym}M\nm=1\nthus slowing down the algorithm while leading to slow convergence. On the contrary, we exploit\nthe structure of the model and the approximate posterior to simplify these terms and increase the\n\nalgorithm ef\ufb01ciency. Denote \u00b5(f ) =(cid:82)\n\nX \u03c3(\u2212f (x))dx, we can write:\n\nT1 = Eq(\u03bb(cid:63))[\u03bb(cid:63) log(\u03bb(cid:63))]Eq(f )[\u00b5(f )], T5 =\n\nEq(ym)[log q(ym)]Eq(f )[\u00b5(f )]\n\nT3 =\n\n\u03b1\n\u03b2\n\nEq(f )[\u00b5(f )]Eq(f )q(ym)[log(\u03c3(\u2212f (ym)))],\n\nT4 =\n\n\u03b1\n\u03b2\n\nEq(f )[\u00b5(f ) [log(\u00b5(f )) \u2212 1]] + Eq(\u03bb(cid:63))[\u03bb(cid:63) log(\u03bb(cid:63))]Eq(f )[\u00b5(f )] \u2212 EQ[log(M !)].\n\n(8)\n\n(9)\n\n(10)\n\nNotice how the term \u2212EQ[log(M !)] in T4, which would require further approximations, appears\nwith opposite sign in T2 (Eq. (7)) and thus cancels out in the computation of the ELBO. See the\nsupplement (\u00a71) for the full derivations.\n(8)\u2013(10) give an expression for Lelbo which avoids sampling from q(M|f , \u03bb(cid:63)) and\nEqs.\nm=1|M ) and does not require computing the GP on the stochastic locations of the latent\nq({ym}M\nevents. The remaining expectations are with respect to reparameterizable distributions. We thus\navoid the use of other estimators (such as the score function estimators) which would lead to high-\nvariance gradient estimates. Stochastic optimisation techniques can be used to evaluate T3 and\nEq(f )[log(\u03c3(f (xn)))] in Eq. (7) thus reducing the computational cost by making it indepen-\ndent of M and N. This would reduce the computational complexity of the algorithm to O(K 3).\nHowever, when the number of inputs used per mini-batch equals N, the time complexity becomes\nO(N K 2). In the following experiments, we show how the proposed structured approach together\nwith these ef\ufb01cient ELBO computations lead to higher predictive performances and better uncertainty\nquanti\ufb01cation. The presented results do not exploit the computational gains attainable via stochastic\noptimisation thus the CPU times and performances are directly comparable across all methods.\n\n(cid:80)N\n\nn=1\n\n4 Experiments\n\nWe test our algorithm on three 1D synthetic data settings and on two 2D real-world applications2.\n\n2Code and data for all the experiments is provided at https://github.com/VirgiAgl/STVB.\n\n6\n\n\fBaselines We compare against alternative inference schemes, different link functions and a different\naugmentation scheme. In terms of continuous models, we consider a sampling approach [SGCP,\n1], a Permanental Point process model [VBPP, 19] and a mean-\ufb01eld approximation based on a\nP\u00f3lya-Gamma augmentation [MFVB, 10]. In addition, we compare against a discrete variational\nlog-Gaussian Cox process model [LGCP, 24]. Details are given in the supplement (\u00a73).\nPerformance measures We test the algorithms evaluating the l2 norm to the true intensity function\n(for the synthetic datasets), the test log likelihood ((cid:96)test) on the test set and the negative log predicted\nlikelihood (NLPL) on the training set. In order to assess the model capabilities in terms of uncertainty\nquanti\ufb01cation, we compute the empirical coverage (EC), i.e. the coverage of the empirical count\ndistributions obtained by sampling from the posterior intensity. We do that for different credible\nintervals (CI) on both the training (in-sample, p(N|D)) and test set (out-of-sample, p(N\u2217|D)). Details\non the metrics are in the supplement (\u00a72). For the synthetic data experiments, we run the algorithms\nwith 10 training datasets each including a different PPP realization sampled from the ground truth.\nFor each different training set, we then evaluate the performance on other 10 unseen realizations\nsampled again from the ground truth. We compute the mean and the standard deviation for the\npresented metrics averaging across the training and test sets. For the real data settings, we compute\nthe NLPL and in-sample EC on the observed events. We then test the algorithm computing both (cid:96)test\nand out-of-sample EC on the held-out events. In order to compute the out-of-sample EC we rescale\nX dx. We then\n\nthe intensity function as \u03bbtest(x) = \u03bbtrain(x) \u2212 Ntrain/V + Ntest/V with V =(cid:82)\n\nsample from \u03bbtest(x) and generate the predicted count distributions for different seeds.\nSynthetic experiments We test our approach using the three toy example proposed by [1]:\n\nx \u2208 [0, 5] and\n\nx \u2208 [0, 50],\n\n\u2022 \u03bb1(x) = 2 exp(\u22121/15) + exp(\u2212[(x \u2212 15)/10]2)\n\u2022 \u03bb2(x) = 5sin(x2) + 6\n\u2022 \u03bb3(x) piecewise linear through (0, 20), (25, 3),(50, 1), (75, 2.5), (100, 3)\n\nx \u2208 [0, 100].\nFor LGCP, we discretize the input space considering a grid cell width of one for \u03bb1(x) and \u03bb3(x) and\nm=1|M ), we\nof 0.5 for \u03bb2(x). For MFVB we consider 1000 integration points. In terms of q({ym}M\nset S = 5 but consistent results where found across different values of this parameter. The results\nare given in Fig. 2 and Tab. 2, where we see that all algorithms recover similar predicted mean\nintensities and give roughly comparable performances across all metrics. Out of all 9 settings and\nmetrics (top section of Tab. 2) our method (STVB) outperforms competing methods on 3 cases and it\nis only second to SGCP on 6 cases. However, the CPU time of SGCP is almost an order of magnitude\nlarger than ours even in these simple low-dimensional problems. This con\ufb01rms the bene\ufb01ts of having\nstructured approximate posteriors within a computationally ef\ufb01cient inference algorithm such as\nVI. In terms of uncertainty quanti\ufb01cation (bottom section of Tab. 2), our algorithm outperforms all\ncompeting approaches for \u03bb1(x) and \u03bb2(x).\n\n2D real data experiments\nIn this section we show the performance of the algorithm on two 2D\nreal-world datasets. In both cases, we assume independent two-dimensional truncated Gaussian\ndistributions for q({ym}M\nm=1|M ) so that they factorize across input dimensions. Qualitative and\nquantitative results are given in Fig. 3, Fig. 4 and Tab. 3.\nOur \ufb01rst dataset is concerned with neuronal data, where event locations correspond to the position of\na mouse moving in an arena when a recorded cell \ufb01red [5, 26]. We randomly assign the events to\neither training (N = 583) or test (N = 29710) and we run the model using a regular grid of 10 \u00d7 10\ninducing inputs. We see that the intensity function recovered by the three methods vary in terms of\nsmoothness with MFVB estimating the smoothest \u03bb(x) and VBPP recovering an irregular surface (Fig.\n3). MFVB gives slightly better performance in terms of (cid:96)test but our method (STVB) outperforms\ncompeting approaches in terms of NLPL and EC \ufb01gures. Remarkably, STVB contains the true number\nof test events in the 30% credible intervals for 56% of the simulations from the posterior intensity\n(Tab. 3 and Fig. 4).\nAs a second dataset, we consider the Porto taxi dataset3 which contains the trajectories of 7000\ntaxi travels in the years 2013/2014 in the city of Porto. As in [10], we consider the pick-up loca-\ntions as observations of a PPP and restrict the analysis to events happening within the coordinates\n(41.147,\u22128.58) and (41.18,\u22128.65). We select N = 1000 events at random as training set and train\n\n3http://www.geolink.pt/ecmlpkdd2015-challenge/dataset.html.\n\n7\n\n\fTable 2: Average performances on synthetic data across 10 training and 10 test datasets with standard\nerrors in brackets. Top: Lower values of l2, NLPL and higher values of (cid:96)test are better. Bottom:\nOut-of-sample EC for different CI, higher values are better. Our method denoted by STVB.\n\nl2\n3.44\n(1.43)\n4.56\n(1.43)\n9.19\n(2.32)\n4.22\n(1.88)\n67.76\n(24.38)\n\n30% CI\n0.81\n(0.27)\n0.76\n(0.25)\n0.75\n(0.21)\n0.39\n(0.28)\n0.08\n(0.12)\n\nSTVB\n\nMFVB\n\nVBPP\n\nSGCP\n\nLGCP\n\nSTVB\n\nMFVB\n\nVBPP\n\nSGCP\n\nLGCP\n\n\u03bb1(x)\n(cid:96)test\n-1.39\n(1.05)\n-2.84\n(1.0)\n-7.71\n(3.31)\n-1.39\n(1.28)\n-5.26\n(8.84)\nEC\u2013\u03bb1(x)\n40% CI\n0.72\n(0.27)\n0.61\n(0.28)\n0.41\n(0.25)\n0.27\n(0.22)\n0.03\n(0.09)\n\n\u03bb2(x)\n(cid:96)test\n56.04\n(4.47)\n55.35\n(4.72)\n56.82\n(4.42)\n55.05\n(1.35)\n28.56\n(6.88)\n\nl2\n\n46.28\n(9.95)\n44.44\n(10.7)\n48.15\n(13.16)\n43.50\n(8.69)\n106.74\n(13.89)\n\nNLPL\n5.62\n(0.72)\n5.52\n(1.29)\n5.20\n(1.33)\n3.77\n(0.54)\n15.75\n(3.36)\n\nl2\n7.39\n(2.76)\n8.17\n(3.43)\n20.54\n(6.53)\n14.44\n(2.97)\n19.24\n(6.44)\n\n\u03bb3(x)\n(cid:96)test\n153.98\n(11.91)\n155.08\n(10.20)\n152.82\n(11.43)\n165.66\n(2.12)\n147.67\n(11.76)\n\nNLPL\n6.41\n(0.64)\n5.82\n(0.61)\n8.35\n(2.28)\n4.78\n(0.33)\n10.84\n(1.36)\n\nNLPL\n4.71\n(0.51)\n4.74\n(0.1)\n8.91\n(1.19)\n4.21\n(1.04)\n26.26\n(8.09)\n\nEC\u2013\u03bb2(x)\n40% CI\n0.88\n(0.23)\n0.84\n(0.29)\n0.45\n(0.26)\n0.14\n(0.05)\n0.00\n(0.00)\n\n50% CI\n0.86\n(0.22)\n0.82\n(0.29)\n0.05\n(0.05)\n0.00\n(0.00)\n0.00\n(0.00)\n\n30% CI\n0.91\n(0.24)\n0.89\n(0.23)\n0.76\n(0.26)\n0.64\n(0.09)\n0.04\n(0.08)\n\nEC\u2013\u03bb3(x)\n40% CI\n0.97\n(0.09)\n0.91\n(0.14)\n0.43\n(0.14)\n0.34\n(0.07)\n0.99\n(0.12)\n\n30% CI\n0.99\n(0.03)\n0.97\n(0.09)\n0.83\n(0.19)\n0.49\n(0.03)\n0.99\n(0.00)\n\n50% CI\n\n0.6\n(0.34)\n0.52\n(0.29)\n0.04\n(0.09)\n0.08\n(0.12)\n0.01\n(0.03)\n\nCPU\ntime (s)\n\n315.59\n\n0.01\n\n0.44\n\n2764.88\n\n4.74\n\n50% CI\n0.92\n(0.15)\n0.78\n(0.15)\n0.03\n(0.05)\n0.02\n(0.04)\n0.95\n(0.10)\n\nTable 3: Average performances on real-data experiments with standard errors in brackets. EC is\ncomputed across 100 replications using different seeds. Higher (cid:96)test, EC and lower NLPL are better.\nEC \ufb01gures are given as In-sample - Out-of-sample.\n\n(cid:96)test[\u00d7103]\n\n-84.55\n(16.05)\n-83.54\n(4.60)\n-83.89\n(12.49)\n\nNLPL\n10.10\n(7.02)\n10.71\n(3.39)\n11.39\n(8.18)\n\nSTVB\n\nMFVB\n\nVBPP\n\nNeuronal data\n\nEC-30% CI\n1.00-1.00\n\n(0.00)-(0.00)\n\n1.00-0.03\n\n(0.00)-(0.17)\n\n1.00-0.00\n\n(0.00) - (0.00)\n\nEC-40% CI\n0.99-0.56\n\n(0.10)-(0.50)\n\n0.78-0.00\n\n(0.41)-(0.00)\n\n0.83-0.00\n\n(0.38)-(0.00)\n\nCPU time (s)\n\n(cid:96)test[\u00d7106]\n\nNLPL [\u00d7104]\n\n193.07\n\n0.35\n\n26.23\n\n-27.96\n(9.16)\n-40.8\n(6.41)\n-31.32\n(8.18)\n\n27.96\n(9.16)\n40.65\n(6.41)\n31.32\n(8.18)\n\nTaxi data\nEC-30% CI\n0.81-0.37\n\n(0.39)-(0.48)\n\n0.00-0.00\n\n(0.00)-(0.00)\n\n0.98-0.00\n\n(0.14)-(0.00)\n\nEC-40% CI\n0.09-0.01\n\n(0.29)-(0.10)\n\n0.00-0.00\n\n(0.00)-(0.00)\n\n0.48-0.00\n\n(0.50)-(0.00)\n\nCPU time (s)\n\n290.34\n\n0.24\n\n3.62\n\nthe model with 400 inducing points placed on a regular grid. The test log likelihood is then computed\non the remaining 3401 events. We see that our method (STVB) outperforms competing methods on all\nperformance metrics (Tab. 3), recovering an intensity that is smoother than VBPP and captures more\nstructure compared to MFVB (Fig. 3). In terms of uncertainty quanti\ufb01cation, the coverage of p(N\u2217|D)\nare the highest for STVB across all CI. Notice how the irregularity of the VBPP intensity leads to good\nperformance on the training set but results in a p(N\u2217|D) which is centered on a signi\ufb01cantly higher\nnumber of test events (Fig. 4). As expected, the SVI approach implies wider counts distributions\ncompared to the mean \ufb01eld approximation. This generally yields better predictive performances in a\nvariety of settings and especially in higher dimensions.\n\nFigure 3: Real data. Posterior mean intensities and events on the two-dimensional input space.\n\n8\n\n\fFigure 4: Predicted counts distributions for the training set (p(N|D)) and the test set (p(N\u2217|D)) on\nreal data. The gray line denotes the number of observed events. The red bars on the x-axis denote\nbreaks in the axis due to the different shifts of the distributions.\n\nTable 4: Average performances on the spatio-temporal\nTaxi dataset. Standard errors in brackets. EC is com-\nputed across 100 replications using different seeds.\nHigher (cid:96)test, EC and lower NLPL are better. EC \ufb01gures\nare given as In-sample - Out-of-sample.\n\n(cid:96)test[\u00d7107]\n\n-31.26\n(10.88)\n-42.97\n(9.56)\n\nSTVB\n\nVBPP\n\nNLPL[\u00d7105]\n\n31.26\n(10.88)\n42.97\n(9.56)\n\nSpatio-temporal Taxi Data\n\nEC-30% CI\n1.00-0.00\n\n(0.00)-(0.00)\n\n0.00-0.00\n\n(0.00)-(0.00)\n\nEC-40% CI\n0.98-0.00\n\n(0.14)-(0.00)\n\n0.00-0.00\n\n(0.00)-(0.00)\n\nCPU time (s)\n\n1208.00\n\n1.00\n\nFigure 5: Predicted counts distributions for\nthe training set (p(N|D)) and the test set\n(p(N\u2217|D)).\n\n3D real data experiment Finally, we show the performance of the algorithm on the spatio-temporal\nTaxi dataset used above where, for each taxi travel, we consider both the trajectory and the pickup\ntime in seconds. While VBPP does not currently support D > 2, we found STVB to outperform MFVB\nboth in terms of performance metrics and uncertainty quanti\ufb01cation (Tab. 4, Fig. 5).\n\n5 Conclusions and discussion\n\nWe have proposed a new variational inference framework for estimating the intensity of a continuous\nsigmoidal Cox process. By seeing an augmented input space from a superposition of two PPPs,\nwe have derived a scalable and computationally ef\ufb01cient structured variational approximation. Our\nframework does not require discretization or accurate numerical computation of integrals on the input\nspace, it is not limited to speci\ufb01c kernel functions and properly accounts for the strong dependencies\nexisting across the latent variables. Through extensive empirical evaluation we have shown that\nour methods compares favorably against \u2018exact\u2019 but computationally costly MCMC schemes, while\nbeing almost an order of magnitude faster. More importantly, our inference scheme outperforms all\ncompeting approaches in terms of uncertainty quanti\ufb01cation. The bene\ufb01t of the proposed scheme\nand resulting SVI are particularity pronounced on multivariate input settings where accounting for\nthe highly coupled variables become crucial for interpolation and prediction. Future work will\nfocus on relaxing the factorization assumption between the GP and the latent points in the posterior.\nIntroducing a fully structured variational inference would further improve the accuracy performance\nof the method but would require further approximations in the variational objective.\n\nAcknowledgments\n\nThis work was supported by the EPSRC grant EP/L016710/1, The Alan Turing Institute under EPSRC\ngrant EP/N510129/1, the Lloyds Register Foundation programme on Data Centric Engineering, the\nUniversity of Sydney\u2019s Centre for Translational Data Science and the Australian Research Council\nARC FT140101266.\n\n9\n\n\fReferences\n[1] Adams, R. P., Murray, I., and MacKay, D. J. (2009). Tractable nonparametric Bayesian inference\nin Poisson processes with Gaussian process intensities. In Annual International Conference on\nMachine Learning, pages 9\u201316.\n\n[2] Aglietti, V., Damoulas, T., and Bonilla, E. V. (2019). Ef\ufb01cient Inference in Multi-task Cox\n\nProcess Models. In Arti\ufb01cial Intelligence and Statistics, pages 537\u2013546.\n\n[3] Bonilla, E. V., Krauth, K., and Dezfouli, A. (2019). Generic Inference in Latent Gaussian Process\n\nModels. Journal of Machine Learning Research, 20(117):1\u201363.\n\n[4] Brix, A. and Diggle, P. J. (2001). Spatiotemporal prediction for log-Gaussian Cox processes.\n\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 63(4):823\u2013841.\n\n[5] Centre For The Biology Of Memory and Sargolini, F. (2014). Grid cell data Sargolini et al. 2006.\n\n[6] Cox, D. R. (1955). Some statistical methods connected with series of events. Journal of the\n\nRoyal Statistical Society. Series B (Methodological), pages 129\u2013164.\n\n[7] Cunningham, J. P., Shenoy, K. V., and Sahani, M. (2008). Fast Gaussian Process Methods for\nPoint Process Intensity Estimation. In International Conference on Machine Learning, pages\n192\u2013199.\n\n[8] Daley, D. J. and Vere-Jones, D. (2003). An Introduction to the Theory of Point Processes. Volume\n\nI: Elementary Theory and Methods. Springer Science & Business Media.\n\n[9] Diggle, P. J., Moraga, P., Rowlingson, B., and Taylor, B. M. (2013). Spatial and spatio-temporal\nlog-Gaussian Cox processes: Extending the geostatistical paradigm. Statistical Science, pages\n542\u2013563.\n\n[10] Donner, C. and Opper, M. (2018). Ef\ufb01cient Bayesian Inference of Sigmoidal Gaussian Cox\n\nProcesses. In Journal of Machine Learning Research, volume 19, pages 2710\u20132743.\n\n[11] Grubesic, T. H. and Mack, E. A. (2008). Spatio-temporal interaction of urban crime. Journal of\n\nQuantitative Criminology, 24(3):285\u2013306.\n\n[12] Gunter, T., Lloyd, C., Osborne, M. A., and Roberts, S. J. (2014). Ef\ufb01cient Bayesian Nonpara-\nmetric Modelling of Structured Point Processes. In Uncertainty in Arti\ufb01cial Intelligence, pages\n310\u2013319.\n\n[13] John, S. and Hensman, J. (2018). Large-Scale Cox Process Inference using Variational Fourier\n\nFeatures. In International Conference on Machine Learning, pages 2362\u20132370.\n\n[14] Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to\n\nvariational methods for graphical models. Machine learning, 37(2):183\u2013233.\n\n[15] Lasko, T. A. (2014). Ef\ufb01cient Inference of Gaussian-Process-Modulated Renewal Processes\nwith Application to Medical Event Data. In Uncertainty in Arti\ufb01cial Intelligence, pages 469\u2013476.\n\n[16] Lewis, P. W. and Shedler, G. S. (1979). Simulation of nonhomogeneous Poisson processes by\n\nthinning. Naval research logistics quarterly, 26(3):403\u2013413.\n\n[17] Lian, W., Henao, R., Rao, V., Lucas, J., and Carin, L. (2015). A Multitask Point Process\n\nPredictive Model. In International Conference on Machine Learning, pages 2030\u20132038.\n\n[18] Lloyd, C., Gunter, T., Nickson, T., Osborne, M., and Roberts, S. J. (2016). Latent Point Process\n\nAllocation. In Arti\ufb01cial Intelligence and Statistics, pages 389\u2013397.\n\n[19] Lloyd, C., Gunter, T., Osborne, M. A., and Roberts, S. J. (2015). Variational Inference for\nGaussian Process Modulated Poisson Processes. In International Conference on Machine Learning,\npages 1814\u20131822.\n\n[20] L\u00f3pez-Lopera, A. F., John, S., and Durrande, N. (2019). Gaussian Process Modulated Cox\n\nProcesses under Linear Inequality Constraints. In Arti\ufb01cial Intelligence and Statistics.\n\n10\n\n\f[21] Marsan, D. and Lengline, O. (2008). Extending earthquakes\u2019 reach through cascading. Science,\n\n319(5866):1076\u20131079.\n\n[22] M\u00f8ller, J., Syversveen, A. R., and Waagepetersen, R. P. (1998). Log Gaussian Cox Processes.\n\nScandinavian journal of statistics, 25(3):451\u2013482.\n\n[23] Moller, J. and Waagepetersen, R. P. (2003). Statistical inference and simulation for spatial point\n\nprocesses. Chapman and Hall/CRC.\n\n[24] Nguyen, T. V. and Bonilla, E. V. (2014). Automated Variational Inference for Gaussian Process\n\nModels. In Neural Information Processing Systems, pages 1404\u20131412.\n\n[25] Rasmussen, C. E. and Williams, C. K. I. (2005). Gaussian Processes for Machine Learning.\n\nThe MIT Press.\n\n[26] Sargolini, F., Fyhn, M., Hafting, T., McNaughton, B. L., Witter, M. P., Moser, M.-B., and Moser,\nE. I. (2006). Conjunctive representation of position, direction, and velocity in entorhinal cortex.\nScience, 312(5774):758\u2013762.\n\n[27] Titsias, M. K. (2009). Variational Learning of Inducing Variables in Sparse Gaussian Processes.\n\nIn Arti\ufb01cial Intelligence and Statistics, pages 567\u2013574.\n\n[28] Walder, C. J. and Bishop, A. N. (2017). Fast Bayesian Intensity Estimation for the Permanental\n\nProcess. In International Conference on Machine Learning, pages 3579\u20133588.\n\n11\n\n\f", "award": [], "sourceid": 6744, "authors": [{"given_name": "Virginia", "family_name": "Aglietti", "institution": "University of Warwick"}, {"given_name": "Edwin", "family_name": "Bonilla", "institution": "CSIRO's Data61"}, {"given_name": "Theodoros", "family_name": "Damoulas", "institution": "University of Warwick & The Alan Turing Institute"}, {"given_name": "Sally", "family_name": "Cripps", "institution": "University of Sydney"}]}