{"title": "Poisson-Gamma dynamical systems", "book": "Advances in Neural Information Processing Systems", "page_first": 5005, "page_last": 5013, "abstract": "This paper presents a dynamical system based on the Poisson-Gamma construction for sequentially observed multivariate count data. Inherent to the model is a novel Bayesian nonparametric prior that ties and shrinks parameters in a powerful way. We develop theory about the model's infinite limit and its steady-state. The model's inductive bias is demonstrated on a variety of real-world datasets where it is shown to learn interpretable structure and have superior predictive performance.", "full_text": "Poisson\u2013Gamma Dynamical Systems\n\nAaron Schein\n\nCollege of Information and Computer Sciences\n\nUniversity of Massachusetts Amherst\n\nAmherst, MA 01003\n\naschein@cs.umass.edu\n\nMingyuan Zhou\n\nMcCombs School of Business\n\nThe University of Texas at Austin\n\nAustin, TX 78712\n\nmingyuan.zhou@mccombs.utexas.edu\n\nHanna Wallach\n\nMicrosoft Research New York\n641 Avenue of the Americas\n\nNew York, NY 10011\n\nhanna@dirichlet.net\n\nAbstract\n\nWe introduce a new dynamical system for sequentially observed multivariate count\ndata. This model is based on the gamma\u2013Poisson construction\u2014a natural choice for\ncount data\u2014and relies on a novel Bayesian nonparametric prior that ties and shrinks\nthe model parameters, thus avoiding over\ufb01tting. We present an ef\ufb01cient MCMC in-\nference algorithm that advances recent work on augmentation schemes for inference\nin negative binomial models. Finally, we demonstrate the model\u2019s inductive bias\nusing a variety of real-world data sets, showing that it exhibits superior predictive\nperformance over other models and infers highly interpretable latent structure.\n\n1\n\nIntroduction\n\nSequentially observed count vectors y(1), . . . , y(T ) are the main object of study in many real-world\napplications, including text analysis, social network analysis, and recommender systems. Count data\npose unique statistical and computational challenges when they are high-dimensional, sparse, and\noverdispersed, as is often the case in real-world applications. For example, when tracking counts of\nuser interactions in a social network, only a tiny fraction of possible edges are ever active, exhibiting\nbursty periods of activity when they are. Models of such data should exploit this sparsity in order to\nscale to high dimensions and be robust to overdispersed temporal patterns. In addition to these charac-\nteristics, sequentially observed multivariate count data often exhibit complex dependencies within and\nacross time steps. For example, scienti\ufb01c papers about one topic may encourage researchers to write\npapers about another related topic in the following year. Models of such data should therefore capture\nthe topic structure of individual documents as well as the excitatory relationships between topics.\nThe linear dynamical system (LDS) is a widely used model for sequentially observed data, with\nmany well-developed inference techniques based on the Kalman \ufb01lter [1, 2]. The LDS assumes\nthat each sequentially observed V -dimensional vector r(t) is real valued and Gaussian distributed:\nr(t) \u223c N (\u03a6 \u03b8(t), \u03a3), where \u03b8(t) \u2208 RK is a latent state, with K components, that is linked to the\nobserved space via \u03a6 \u2208 RV \u00d7K. The LDS derives its expressive power from the way it assumes\nthat the latent states evolve: \u03b8(t) \u223c N (\u03a0 \u03b8(t\u22121), \u2206), where \u03a0 \u2208 RK\u00d7K is a transition matrix that\ncaptures between-component dependencies across time steps. Although the LDS can be linked to\nnon-real observations via the extended Kalman \ufb01lter [3], it cannot ef\ufb01ciently model real-world count\ndata because inference is O((K + V )3) and thus scales poorly with the dimensionality of the data [2].\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fMany previous approaches to modeling sequentially observed count data rely on the generalized\nlinear modeling framework [4] to link the observations to a latent Gaussian space\u2014e.g., via the\nPoisson\u2013lognormal link [5]. Researchers have used this construction to factorize sequentially ob-\nserved count matrices under a Poisson likelihood, while modeling the temporal structure using\nwell-studied Gaussian techniques [6, 7]. Most of these previous approaches assume a simple Gaus-\n\nsian state-space model\u2014i.e., \u03b8(t) \u223c N (\u03b8(t\u22121), \u2206)\u2014that lacks the expressive transition structure\n\nof the LDS; one notable exception is the Poisson linear dynamical system [8]. In practice, these\napproaches exhibit prohibitive computational complexity in high dimensions, and the Gaussian\nassumption may fail to accommodate the burstiness often inherent to real-world count data [9].\n\nWe present the Poisson\u2013gamma dynamical system\n(PGDS)\u2014a new dynamical system, based on the\ngamma\u2013Poisson construction, that supports the expres-\nsive transition structure of the LDS. This model natu-\nrally handles overdispersed data. We introduce a new\nBayesian nonparametric prior to automatically infer\nthe model\u2019s rank. We develop an elegant and ef\ufb01cient\nalgorithm for inferring the parameters of the transition\nstructure that advances recent work on augmentation\nschemes for inference in negative binomial models [10]\nand scales with the number of non-zero counts, thus\nexploiting the sparsity inherent to real-world count data.\nWe examine the way in which the dynamical gamma\u2013\nPoisson construction propagates information and derive\nthe model\u2019s steady state, which involves the Lambert\nW function [11]. Finally, we use the PGDS to analyze\na diverse range of real-world data sets, showing that it\nexhibits excellent predictive performance on smooth-\ning and forecasting tasks and infers interpretable latent\nstructure, an example of which is depicted in \ufb01gure 1.\n\n2 Poisson\u2013Gamma Dynamical Systems\n\nFigure 1: The time-step factors for three\ncomponents inferred by the PGDS from a\ncorpus of NIPS papers. Each component\nis associated with a feature factor for each\nword type in the corpus; we list the words\nwith the largest factors. The inferred struc-\nture tells a familiar story about the rise and\nfall of certain sub\ufb01elds of machine learning.\n\ny(t)\n\n(cid:80)K\nk2=1\u03c0kk2 \u03b8(t\u22121)\n\nk2\n\n, \u03c40),\n\nk=1\u03c6vk \u03b8(t)\n\nk ) and \u03b8(t)\n\nv \u223c Pois(\u03b4(t)(cid:80)K\n\nk , the PGDS is a form of Poisson matrix factorization: Y \u223c Pois(\u03a6 \u03a8T ) [12, 13, 14, 15].\n\nWe can represent a data set of V -dimensional sequentially observed count vectors y(1), . . . , y(T ) as a\nV \u00d7 T count matrix Y . The PGDS models a single count y(t)\nv \u2208 {0, 1, . . .} in this matrix as follows:\nk \u223c Gam(\u03c40\n(1)\nwhere the latent factors \u03c6vk and \u03b8(t)\nk are both positive, and represent the strength of feature v in\ncomponent k and the strength of component k at time step t, respectively. The scaling factor \u03b4(t)\ncaptures the scale of the counts at time step t, and therefore obviates the need to rescale the data as a\npreprocessing step. We refer to the PGDS as stationary if \u03b4(t) = \u03b4 for t = 1, . . . , T . We can view the\nfeature factors as a V \u00d7K matrix \u03a6 and the time-step factors as a T\u00d7K matrix \u0398. Because we can also\ncollectively view the scaling factors and time-step factors as a T \u00d7 K matrix \u03a8, where element \u03c8tk =\n\u03b4(t) \u03b8(t)\nThe PGDS is characterized by its expressive transition structure, which assumes that each time-step\nfactor \u03b8(t)\nk is drawn from a gamma distribution, whose shape parameter is a linear combination of the\nK factors at the previous time step. The latent transition weights \u03c011, . . . , \u03c0k1k2, . . . , \u03c0KK, which we\ncan view as a K \u00d7 K transition matrix \u03a0, capture the excitatory relationships between components.\nK ) has an expected value of E[\u03b8(t) | \u03b8(t\u22121), \u03a0] = \u03a0 \u03b8(t\u22121) and is there-\nThe vector \u03b8(t) = (\u03b8(t)\nfore analogous to a latent state in the the LDS. The concentration parameter \u03c40 determines the variance\nof \u03b8(t)\u2014speci\ufb01cally, Var (\u03b8(t) | \u03b8(t\u22121), \u03a0) = (\u03a0 \u03b8(t\u22121)) \u03c4\u22121\n0 \u2014without affecting its expected value.\nTo model the strength of each component, we introduce K component weights \u03bd = (\u03bd1, . . . , \u03bdK)\nand place a shrinkage prior over them. We assume that the time-step factors and transition weights\nfor component k are tied to its component weight \u03bdk. Speci\ufb01cally, we de\ufb01ne the following structure:\n(2)\n\n\u03b8(1)\nk \u223c Gam(\u03c40 \u03bdk, \u03c40) and \u03c0k \u223c Dir(\u03bd1\u03bdk, . . . , \u03be\u03bdk . . . , \u03bdK\u03bdk) and \u03bdk \u223c Gam( \u03b30\n\n1 , . . . , \u03b8(t)\n\nK , \u03b2),\n\n2\n\n19881991199419972000510152025\fwhere \u03c0k = (\u03c01k, . . . , \u03c0Kk) is the kth column of \u03a0. Because(cid:80)K\n\nk1=1 \u03c0k1k = 1, we can interpret\n\u03c0k1k as the probability of transitioning from component k to component k1. (We note that interpreting\n\u03a0 as a stochastic transition matrix relates the PGDS to the discrete hidden Markov model.) For a \ufb01xed\nvalue of \u03b30, increasing K will encourage many of the component weights to be small. A small value of\n\u03bdk will shrink \u03b8(1)\nk , as well as the transition weights in the kth row of \u03a0. Small values of the transition\nweights in the kth row of \u03a0 therefore prevent component k from being excited by the other components\nand by itself. Speci\ufb01cally, because the shape parameter for the gamma prior over \u03b8(t)\ninvolves a\nk\nlinear combination of \u03b8(t\u22121) and the transition weights in the kth row of \u03a0, small transition weights\nwill result in a small shape parameter, shrinking \u03b8(t)\nk . Thus, the component weights play a critical\nrole in the PGDS by enabling it to automatically turn off any unneeded capacity and avoid over\ufb01tting.\nFinally, we place Dirichlet priors over the feature factors and draw the other parameters from a non-\ninformative gamma prior: \u03c6k = (\u03c61k, . . . , \u03c6V k) \u223c Dir(\u03b70, . . . , \u03b70) and \u03b4(t), \u03be, \u03b2 \u223c Gam(\u00010, \u00010).\ning feature factor vectors constitute a draw G =(cid:80)\u221e\nThe PGDS therefore has four positive hyperparameters to be set by the user: \u03c40, \u03b30, \u03b70, and \u00010.\nBayesian nonparametric interpretation: As K \u2192 \u221e, the component weights and their correspond-\nk=1 \u03bdk1\u03c6k from a gamma process GamP (G0, \u03b2),\nwhere \u03b2 is a scale parameter and G0 is a \ufb01nite and continuous base measure over a complete separable\n(cid:82) \u221e\nmetric space \u2126 [16]. Models based on the gamma process have an inherent shrinkage mechanism\nbecause the number of atoms with weights greater than \u03b5 > 0 follows a Poisson distribution with a \ufb01-\n\u03b5 d\u03bd \u03bd\u22121 exp (\u2212\u03b2 \u03bd)), where \u03b30 = G0(\u2126) is the total mass under\nnite mean\u2014speci\ufb01cally, Pois(\u03b30\nthe base measure. This interpretation enables us to view the priors over \u03a0 and \u0398 as novel stochastic\nprocesses, which we call the column-normalized relational gamma process and the recurrent gamma\nprocess, respectively. We provide the de\ufb01nitions of these processes in the supplementary material.\nNon-count observations: The PGDS can also model non-count data by linking the observed vectors\nto latent counts. A binary observation b(t)\nv via the Bernoulli\u2013\nk=1 \u03c6vk \u03b8(t)\nPoisson distribution: b(t)\nk ) [17]. Similarly, a\nreal-valued observation r(t)\nv via the Poisson randomized\ngamma distribution [18]. Finally, Basbug and Engelhardt [19] recently showed that many types of\nnon-count matrices can be linked to a latent count matrix via the compound Poisson distribution [20].\n\nv \u223c Pois(\u03b4(t)(cid:80)K\n\nv can be linked to a latent Poisson count y(t)\n\nv can be linked to a latent Poisson count y(t)\n\nv \u2265 1) and y(t)\n\nv = 1(y(t)\n\n3 MCMC Inference\n\nMCMC inference for the PGDS consists of drawing samples of the model parameters from their joint\nposterior distribution given an observed count matrix Y and the model hyperparameters \u03c40, \u03b30, \u03b70,\n\u00010. In this section, we present a Gibbs sampling algorithm for drawing these samples. At a high level,\nour approach is similar to that used to develop Gibbs sampling algorithms for several other related\nmodels [10, 21, 22, 17]; however, we extend this approach to handle the unique properties of the\nPGDS. The main technical challenge is sampling \u0398 from its conditional posterior, which does not\nhave a closed form. We address this challenge by introducing a set of auxiliary variables. Under this\naugmented version of the model, marginalizing over \u0398 becomes tractable and its conditional posterior\nhas a closed form. Moreover, by introducing these auxiliary variables and marginalizing over \u0398,\nwe obtain an alternative model speci\ufb01cation that we can subsequently exploit to obtain closed-form\nconditional posteriors for \u03a0, \u03bd, and \u03be. We marginalize over \u0398 by performing a \u201cbackward \ufb01ltering\u201d\npass, starting with \u03b8(T ). We repeatedly exploit the following three de\ufb01nitions in order to do this.\n\nDe\ufb01nition 1: If y\u00b7 =(cid:80)N\n\nn=1 \u03b8n\n\nn=1 \u03b8n\n\n)) and y\u00b7 \u223c Pois((cid:80)N\n\n, . . . ,\n\n\u03b81(cid:80)N\n\n\u03b8N(cid:80)N\n\nn=1 yn, where yn \u223c Pois(\u03b8n) are independent Poisson-distributed random vari-\n\nables, then (y1, . . . , yN ) \u223c Mult(y\u00b7, (\nDe\ufb01nition 2: If y \u223c Pois(c \u03b8), where c is a constant, and \u03b8 \u223c Gam(a, b), then y \u223c NB(a,\nc\nb+c )\nis a negative binomial\u2013distributed random variable. We can equivalently parameterize it as\ny \u223c NB(a, g(\u03b6)), where g(z) = 1\u2212 exp (\u2212z) is the Bernoulli\u2013Poisson link [17] and \u03b6 = ln (1 + c\nb ).\nDe\ufb01nition 3: If y \u223c NB(a, g(\u03b6)) and l \u223c CRT(y, a) is a Chinese restaurant table\u2013distributed\nrandom variable, then y and l are equivalently jointly distributed as y \u223c SumLog(l, g(\u03b6)) and\nl \u223c Pois(a \u03b6) [21]. The sum logarithmic distribution is further de\ufb01ned as the sum of l independent\ni=1 xi and xi \u223c Log(g(\u03b6)).\n\nand identically logarithmic-distributed random variables\u2014i.e., y =(cid:80)l\n\nn=1 \u03b8n) [23, 24].\n\n3\n\n\fv\u00b7 =(cid:80)K\n\n(cid:80)K\nk2=1\u03c0kk2 \u03b8(T\u22121)\n\nk2\n\nde\ufb01ne y(t)\u00b7k =(cid:80)V\n\nMarginalizing over \u0398: We \ufb01rst note that we can re-express the Poisson likelihood in equation 1\nin terms of latent subcounts [13]: y(t)\nk ). We then\nvk . Via de\ufb01nition 1, we obtain y(t)\u00b7k \u223c Pois(\u03b4(t) \u03b8(t)\nv=1 \u03c6vk = 1.\nbecause none of the other time-step factors depend on it in their priors. Via\n\nk ) because(cid:80)V\n\nvk \u223c Pois(\u03b4(t) \u03c6vk \u03b8(t)\n\nvk and y(t)\n\nv = y(t)\n\nk=1 y(t)\n\nv=1 y(t)\n\nWe start with \u03b8(T )\nde\ufb01nition 2, we can immediately marginalize over \u03b8(T )\n\nto obtain the following equation:\n\nk\n\nk\n\n, g(\u03b6 (T ))), where \u03b6 (T ) = ln (1 + \u03b4(T )\n\u03c40\n\n).\n\n(3)\n\nNext, we marginalize over \u03b8(T\u22121)\nCRT(y(T )\n\nk\n\n\u00b7k , \u03c40\n\ny(T )\n\u00b7k \u223c NB(\u03c40\n(cid:80)K\nk2=1\u03c0kk2 \u03b8(T\u22121)\ny(T )\n\u00b7k \u223c SumLog(l(T )\n\nk2\n\n. To do this, we introduce an auxiliary variable:\n\n(cid:80)K\n). We can then re-express the joint distribution over y(T )\nk2=1\u03c0kk2 \u03b8(T\u22121)\n\nk , g(\u03b6 (T )) and l(T )\n\n).\n\nl(T )\nk \u223c\n\u00b7k and l(T )\nas\n\nk\n\nk\n\nk2\n\nkk2\n\n(4)\n\nand l(T )\n\nk \u223c Pois(\u03b6 (T ) \u03c40\nbecause it appears in a sum in the parameter of the\n\nk2=1l(T )\n\nk1=1 l(T )\n\nl(T )\nk = l(T )\n\nWe are still unable to marginalize over \u03b8(T\u22121)\nPoisson distribution over l(T )\n\nWe then de\ufb01ne l(T )\nover l(T )\n\u00b7k\n\nkk2 \u223c Pois(\u03b6 (T ) \u03c40 \u03c0kk2 \u03b8(T\u22121)\n\nk ; however, via de\ufb01nition 1, we can re-express this distribution as\n\n\u00b7k = (cid:80)K\n\u00b7k \u223c Pois(\u03b6 (T ) \u03c40 \u03b8(T\u22121)\n\nk\u00b7 =(cid:80)K\nthe transition weights because (cid:80)K\n1k , . . . , l(T )\n(l(T )\nKk ) \u223c Mult(l(T )\nsummarizes all of the information about the data at time steps T \u2212 1 and T via y(T\u22121)\nrespectively. Because y(T\u22121)\n\nk1k. Again via de\ufb01nition 1, we can express the distribution\n). We note that this expression does not depend on\nk1=1 \u03c0k1k = 1. We also note that de\ufb01nition 1 implies that\n\u00b7k , which\nand l(T )\n\u00b7k ,\n\u00b7k are both Poisson distributed, we can use de\ufb01nition 1 to obtain\n\u223c Pois(\u03b8(T\u22121)\n\n\u00b7k , (\u03c01, . . . , \u03c0K)). Next, we introduce m(T\u22121)\n\n= y(T\u22121)\n+ l(T )\n\u00b7k\n\n(\u03b4(T\u22121) + \u03b6 (T ) \u03c40)).\n\nand l(T )\nm(T\u22121)\n\nCombining this likelihood with the gamma prior in equation 1, we can marginalize over \u03b8(T\u22121)\n\nas l(T )\n\n(6)\n\n(5)\n\n\u00b7k\n\n\u00b7k\n\n).\n\nk2\n\n:\n\nk\n\nk\n\nk\n\nk\n\nk\n\n(cid:80)K\nk2=1\u03c0kk2 \u03b8(T\u22122)\n\u223c CRT(m(T\u22121)\n\nk2\n\nk\n\nm(T\u22121)\n\nk\n\n\u223c NB(\u03c40\n\n, g(\u03b6 (T\u22121))), where \u03b6 (T\u22121) = ln (1 + \u03b4(T \u22121)\n\n+ \u03b6 (T )).\n\n(7)\n\n\u03c40\n\n(cid:80)K\nk2=1 \u03c0kk2 \u03b8(T\u22122)\n\nk\n\nk\n\nk\n\n, \u03c40\n\nand m(T\u22121)\n\nWe then introduce l(T\u22121)\ntion over l(T\u22121)\nequation 4. This then allows us to marginalize over \u03b8(T\u22122)\nWe can repeat the same process all the way back to t = 1, where marginalizing over \u03b8(1)\nNB(\u03c40 \u03bdk, g(\u03b6 (1))). We note that just as m(t)\nsteps t, . . . , T , \u03b6 (t) = ln (1 + \u03b4(t)\n\u03c40\n\n) and re-express the joint distribu-\nas the product of a Poisson and a sum logarithmic distribution, similar to\nto obtain a negative binomial distribution.\nk \u223c\nk summarizes all of the information about the data at time\n+ \u03b6 (t+1)) summarizes all of the information about \u03b4(t), . . . , \u03b4(T ).\n\nk yields m(1)\n\nk2\n\nk\n\nk\u00b7 =(cid:80)K\n\nKk) \u223c Mult(l(t)\u00b7k , (\u03c01k, . . . , \u03c0Kk)) for t > 1\nk2=1l(t)\n\nk\u00b7 \u223c Pois(\u03b6 (1) \u03c40 \u03bdk)\nl(1)\n(l(t)\n1k , . . . , l(t)\nl(t)\nk \u223c SumLog(l(t)\nm(t)\n(y(t)\u00b7k , l(t+1)\n(y(t)\n\nAs we mentioned previously, introducing\nthese auxiliary variables and marginalizing\nover \u0398 also enables us to de\ufb01ne an alterna-\ntive model speci\ufb01cation that we can exploit\nto obtain closed-form conditional posteri-\nors for \u03a0, \u03bd, and \u03be. We provide part of its\ngenerative process in \ufb01gure 2. We de\ufb01ne\nm(T )\n= 0,\nand \u03b6 (T +1) = 0 so that we can present the\nalternative model speci\ufb01cation concisely.\nSteady state: We draw particular attention to the backward pass \u03b6 (t) = ln (1 + \u03b4(t)\n+ \u03b6 (t+1)) that\n\u03c40\npropagates information about \u03b4(t), . . . , \u03b4(T ) as we marginalize over \u0398. In the case of the stationary\nPGDS\u2014i.e., \u03b4(t) = \u03b4\u2014the backward pass has a \ufb01xed point that we de\ufb01ne in the following proposition.\n\n) \u223c Bin(m(t)\nV k) \u223c Mult(y(t)\u00b7k , (\u03c61k, . . . , \u03c6V k))\n\nFigure 2: Alternative model speci\ufb01cation.\n\nfor t > 1\nk\u00b7 , g(\u03b6 (t)))\n\n\u00b7k + l(T +1)\n\n, where l(T +1)\n\nk = y(T )\n\n1k , . . . , y(t)\n\n\u03b4(t)+\u03b6(t+1)\u03c40\n\n\u03b4(t)+\u03b6(t+1)\u03c40\n\n\u03b6(t+1)\u03c40\n\nk , (\n\n\u03b4(t)\n\n\u00b7k\n\n\u00b7k\n\nkk2\n\n\u00b7k\n\n))\n\n,\n\n4\n\n\fk\n\n\u00b7k\n\n\u03c40\n\n\u03c40\n\n\u00b7k\n\n\u00b7k\n\n)) \u2212 1 \u2212 \u03b4\n\n) instead of l(T +1)\n\n\u223c Pois(\u03b6 (cid:63) \u03c40 \u03b8(T )\n\nProposition 1: The backward pass has a \ufb01xed point of \u03b6 (cid:63) = \u2212W\u22121(\u2212 exp (\u22121 \u2212 \u03b4\n.\nThe function W\u22121(\u00b7) is the lower real part of the Lambert W function [11]. We prove this proposition\nin the supplementary material. During inference, we perform the O(T ) backward pass repeatedly.\nThe existence of a \ufb01xed point means that we can assume the stationary PGDS is in its steady state and\nreplace the backward pass with an O(1) computation1 of the \ufb01xed point \u03b6\u2217. To make this assumption,\nwe must also assume that l(T +1)\n= 0. We note that an analogous\nsteady-state approximation exists for the LDS and is routinely exploited to reduce computation [25].\nGibbs sampling algorithm: Given Y and the hyperparameters, Gibbs sampling involves resampling\neach auxiliary variable or model parameter from its conditional posterior. Our algorithm involves a\n\u201cbackward \ufb01ltering\u201d pass and a \u201cforward sampling\u201d pass, which together form a \u201cbackward \ufb01ltering\u2013\nforward sampling\u201d algorithm. We use \u2212 \\ \u0398(\u2265t) to denote everything excluding \u03b8(t), . . . , \u03b8(T ).\nSampling the auxiliary variables: This step is the \u201cbackward \ufb01ltering\u201d pass. For the stationary PGDS\nin its steady state, we \ufb01rst compute \u03b6\u2217 and draw (l(T +1)\n). For the other vari-\nants of the model, we set l(T +1)\n= \u03b6 (T +1) = 0. Then, working backward from t = T, . . . , 2, we draw\n\n|\u2212) \u223c Pois(\u03b6 (cid:63) \u03c40 \u03b8(T )\n(cid:80)K\nk2=1\u03c0kk2 \u03b8(t\u22121)\n) and\n(cid:80)K\n\u03c0kK \u03b8(t\u22121)\nAfter using equations 8 and 9 for all k = 1, . . . , K, we then set l(t)\u00b7k =(cid:80)K\nk2=1 \u03c0kk2 \u03b8(t\u22121)\nk1=1l(t)\n\nk\u00b7 | \u2212 \\ \u0398(\u2265t)) \u223c CRT(y(t)\u00b7k + l(t+1)\n(l(t)\n(cid:80)K\n\u03c0k1 \u03b8(t\u22121)\nkK | \u2212 \\ \u0398(\u2265t)) \u223c Mult(l(t)\nk\u00b7 , (\nk2=1 \u03c0kk2 \u03b8(t\u22121)\n\nk1k. For the non-steady-\n+ \u03b6 (t+1)); for the steady-state variant, we set \u03b6 (t) = \u03b6\u2217.\nstate variants, we also set \u03b6 (t) = ln (1 + \u03b4(t)\n\u03c40\nSampling \u0398: We sample \u0398 from its conditional posterior by performing a \u201cforward sampling\u201d pass,\nstarting with \u03b8(1). Conditioned on the values of l(2)\u00b7k , . . . , l(T +1)\nand \u03b6 (2), . . . , \u03b6 (T +1) obtained via\nthe \u201cbackward \ufb01ltering\u201d pass, we sample forward from t = 1, . . . , T , using the following equations:\n\n(8)\n\n(9)\n\n)).\n\n(l(t)\nk1 , . . . , l(t)\n\n, . . . ,\n\n, \u03c40\n\n\u00b7k\n\n\u00b7k\n\nK\n\nk2\n\n1\n\nk2\n\nk\n\nk2\n\n\u00b7k\n\n(cid:80)K\nk2=1\u03c0kk2 \u03b8(t\u22121)\n\n(\u03b8(1)\nk | \u2212 \\ \u0398) \u223c Gam(y(1)\u00b7k + l(2)\u00b7k + \u03c40 \u03bdk, \u03c40 + \u03b4(1) + \u03b6 (2) \u03c40) and\nk | \u2212 \\ \u0398(\u2265t)) \u223c Gam(y(t)\u00b7k + l(t+1)\n(\u03b8(t)\n(\u03c0k | \u2212 \\ \u0398) \u223c Dir(\u03bd1\u03bdk +(cid:80)T\n\n1k , . . . , \u03be\u03bdk +(cid:80)T\n\nSampling \u03a0: The alternative model speci\ufb01cation, with \u0398 marginalized out, assumes that\n(l(t)\n1k , . . . , l(t)\n\nKk) \u223c Mult(l(t)\u00b7k , (\u03c01k, . . . , \u03c0Kk)). Therefore, via Dirichlet\u2013multinomial conjugacy,\n\nkk , . . . , \u03bdK\u03bdk +(cid:80)T\n\n, \u03c40 + \u03b4(t) + \u03b6 (t+1) \u03c40).\n\n\u00b7k + \u03c40\n\n(10)\n(11)\n\nt=1l(t)\n\nt=1l(t)\n\n(12)\n\nt=1l(t)\n\nKk).\n\nk2\n\nSampling \u03bd and \u03be: We use the alternative model speci\ufb01cation to obtain closed-form conditional\nposteriors for \u03bdk and \u03be. First, we marginalize over \u03c0k to obtain a Dirichlet\u2013multinomial distribution.\nWhen augmented with a beta-distributed auxiliary variable, the Dirichlet\u2013multinomial distribution is\nproportional to the negative binomial distribution [26]. We draw such an auxiliary variable, which we\nuse, along with negative binomial augmentation schemes, to derive closed-form conditional posteriors\nfor \u03bdk and \u03be. We provide these posteriors, along with their derivations, in the supplementary material.\nWe also provide the conditional posteriors for the remaining model parameters\u2014\u03a6, \u03b4(1), . . . , \u03b4(T ),\nand \u03b2\u2014which we obtain via Dirichlet\u2013multinomial, gamma\u2013Poisson, and gamma\u2013gamma conjugacy.\n\n4 Experiments\n\nv \u223c Pois((cid:80)K\n\nIn this section, we compare the predictive performance of the PGDS to that of the LDS and that\nof gamma process dynamic Poisson factor analysis (GP-DPFA) [22]. GP-DPFA models a single\ncount in Y as y(t)\nk ), where each component\u2019s time-step factors evolve\nas a simple gamma Markov chain, independently of those belonging to the other components:\nk \u223c Gam(\u03b8(t\u22121)\n\u03b8(t)\n, c(t)). We consider the stationary variants of all three models.2 We used \ufb01ve data\nsets, and tested each model on two time-series prediction tasks: smoothing\u2014i.e., predicting y(t)\nv given\n\nk=1 \u03bbk \u03c6vk \u03b8(t)\n\nk\n\n1Several software packages contain fast implementations of the Lambert W function.\n2We used the pykalman Python library for the LDS and implemented GP-DPFA ourselves.\n\n5\n\n\fv\n\nv\n\nv\n\ngiven y(1)\n\nv , . . . , y(T )\n\nv\n\n, y(t+1)\n\n, . . . , y(T )\n\nv \u2014and forecasting\u2014i.e., predicting y(T +s)\n\nv , . . . , y(t\u22121)\ny(1)\nfor\nsome s \u2208 {1, 2, . . .} [27]. We provide brief descriptions of the data sets below before reporting results.\nGlobal Database of Events, Language, and Tone (GDELT): GDELT is an international relations data\nset consisting of country-to-country interaction events of the form \u201ccountry i took action a toward\ncountry j at time t,\u201d extracted from news corpora. We created \ufb01ve count matrices, one for each year\nfrom 2001 through 2005. We treated directed pairs of countries i\u2192j as features and counted the\nnumber of events for each pair during each day. We discarded all pairs with fewer than twenty-\ufb01ve\ntotal events, leaving T = 365, around V \u2248 9, 000, and three to six million events for each matrix.\nIntegrated Crisis Early Warning System (ICEWS): ICEWS is another international relations event data\nset extracted from news corpora. It is more highly curated than GDELT and contains fewer events.\nWe therefore treated undirected pairs of countries i\u2194j as features. We created three count matrices,\none for 2001\u20132003, one for 2004\u20132006, and one for 2007\u20132009. We counted the number of events for\neach pair during each three-day time step, and again discarded all pairs with fewer than twenty-\ufb01ve\ntotal events, leaving T = 365, around V \u2248 3, 000, and 1.3 to 1.5 million events for each matrix.\nState-of-the-Union transcripts (SOTU): The SOTU corpus contains the text of the annual SOTU\nspeech transcripts from 1790 through 2014. We created a single count matrix with one column per\nyear. After discarding stopwords, we were left with T = 225, V = 7, 518, and 656,949 tokens.\nDBLP conference abstracts (DBLP): DBLP is a database of computer science research papers. We\nused the subset of this corpus that Acharya et al. used to evaluate GP-DPFA [22]. This subset cor-\nresponds to a count matrix with T = 14 columns, V = 1, 771 unique word types, and 13,431 tokens.\nNIPS corpus (NIPS): The NIPS corpus contains the text of every NIPS conference paper from 1987\nto 2003. We created a single count matrix with one column per year. We treated unique word types\nas features and discarded all stopwords, leaving T = 17, V = 9, 836, and 3.1 million tokens.\n\nFigure 3: y(t)\n\nv over time for the top four features in the NIPS (left) and ICEWS (right) data sets.\n\nExperimental design: For each matrix, we created four masks indicating some randomly selected\nsubset of columns to treat as held-out data. For the event count matrices, we held out six (non-\ncontiguous) time steps between t = 2 and t = T \u2212 3 to test the models\u2019 smoothing performance, as\nwell as the last two time steps to test their forecasting performance. The other matrices have fewer\ntime steps. For the SOTU matrix, we therefore held out \ufb01ve time steps between t = 2 and t = T \u2212 2,\nas well as t = T . For the NIPS and DBLP matrices, which contain substantially fewer time steps\nthan the SOTU matrix, we held out three time steps between t = 2 and t = T \u2212 2, as well as t = T .\nFor each matrix, mask, and model combination, we ran inference four times.3 For the PGDS and\nGP-DPFA, we performed 6,000 Gibbs sampling iterations, imputing the missing counts from the\n\u201csmoothing\u201d columns at the same time as sampling the model parameters. We then discarded the\n\ufb01rst 4,000 samples and retained every hundredth sample thereafter. We used each of these samples to\npredict the missing counts from the \u201cforecasting\u201d columns. We then averaged the predictions over the\nsamples. For the LDS, we ran EM to learn the model parameters. Then, given these parameter values,\nwe used the Kalman \ufb01lter and smoother [1] to predict the held-out data. In practice, for all \ufb01ve data\nsets, V was too large for us to run inference for the LDS, which is O((K + V )3) [2], using all V\nfeatures. We therefore report results from two independent sets of experiments: one comparing all\nthree models using only the top V = 1, 000 features for each data set, and one comparing the PGDS\nto just GP-DPFA using all the features. The \ufb01rst set of experiments is generous to the LDS because\nthe Poisson distribution is well approximated by the Gaussian distribution when its mean is large.\n\n3For the PGDS and GP-DPFA we used K = 100. For the PGDS, we set \u03c40 = 1, \u03b30 = 50, \u03b70 = \u00010 = 0.1.\nWe set the hyperparameters of GP-DPFA to the values used by Acharya et al. [22]. For the LDS, we used the\ndefault hyperparameters for pykalman, and report results for the best-performing value of K \u2208 {5, 10, 25, 50}.\n\n6\n\n1988199219951998200205001000150020002500Mar2009Jun2009Aug2009Oct2009Dec20090100200300400500600700800900Israel\u2194PalestineRussia\u2194USAChina\u2194USAIraq\u2194USA\fTable 1: Results for the smoothing (\u201cS\u201d) and forecasting (\u201cF\u201d) tasks. For both error measures, lower\nvalues are better. We also report the number of time steps T and the burstiness \u02c6B of each data set.\n\nMean Relative Error (MRE)\n\nMean Absolute Error (MAE)\n\nT\n\n\u02c6B Task\n\nLDS\n\nPGDS\n\nPGDS\n\nGP-DPFA\n\nGP-DPFA\nGDELT 365 1.27 S 2.335 \u00b10.19 2.951 \u00b10.32 3.493 \u00b10.53 9.366 \u00b12.19 9.278 \u00b12.01\nF 2.173 \u00b10.41 2.207 \u00b10.42 2.397 \u00b10.29 7.002 \u00b11.43 7.095 \u00b11.67\nICEWS 365 1.10 S 0.808 \u00b10.11 0.877 \u00b10.12 1.023 \u00b10.15 2.867 \u00b10.56 2.872 \u00b10.56\nF 0.743 \u00b10.17 0.792 \u00b10.17 0.937 \u00b10.31 1.788 \u00b10.47 1.894 \u00b10.50\nSOTU 225 1.45 S 0.233 \u00b10.01 0.238 \u00b10.01 0.260 \u00b10.01 0.408 \u00b10.01 0.414 \u00b10.01\nF 0.171 \u00b10.00 0.173 \u00b10.00 0.225 \u00b10.01 0.323 \u00b10.00 0.314 \u00b10.00\n0.417 \u00b10.03 0.422 \u00b10.05 0.405 \u00b10.05 0.771 \u00b10.03 0.782 \u00b10.06\nF 0.322 \u00b10.00 0.323 \u00b10.00 0.369 \u00b10.06 0.747 \u00b10.01 0.715 \u00b10.00\n\nDBLP 14 1.64 S\n\nLDS\n\n10.098 \u00b12.39\n7.047 \u00b11.25\n3.104 \u00b10.60\n1.973 \u00b10.62\n0.448 \u00b10.00\n0.370 \u00b10.00\n0.831 \u00b10.01\n0.943 \u00b10.07\n\nNIPS 17 0.33 S\nF\n\n0.415 \u00b10.07 0.392 \u00b10.07 1.609 \u00b10.43 29.940 \u00b12.95 28.138 \u00b13.08 108.378 \u00b115.44\n0.343 \u00b10.01 0.312 \u00b10.00 0.642 \u00b10.14 62.839 \u00b10.37 52.963 \u00b10.52 95.495 \u00b110.52\n\nv\n\nv\n\nt=1\n\nt=1 y(t)\n\n(cid:80)T\n\n, where \u02c6\u00b5v = 1\nT\n\n\u2212y(t)\n|y(t+1)\nv |\n\u02c6\u00b5v\n\n(cid:80)T\u22121\n\nis the true count and \u02c6y(t)\nv\n\nResults: We used two error measures\u2014mean relative error (MRE) and mean absolute error (MAE)\u2014\nto compute the models\u2019 smoothing and forecasting scores for each matrix and mask combination. We\nthen averaged these scores over the masks. For the data sets with multiple matrices, we also averaged\nthe scores over the matrices. The two error measures differ as follows: MRE accommodates the\nscale of the data, while MAE does not. This is because relative error\u2014which we de\ufb01ne as |y(t)\nv \u2212\u02c6y(t)\nv |\n,\n1+y(t)\nwhere y(t)\nis the prediction\u2014divides the absolute error by the true count and\nv\nthus penalizes overpredictions more harshly than underpredictions. MRE is therefore an especially\nnatural choice for data sets that are bursty\u2014i.e., data sets that exhibit short periods of activity that far\nexceed their mean. Models that are robust to these kinds of overdispersed temporal patterns are less\nlikely to make overpredictions following a burst, and are therefore rewarded accordingly by MRE.\nIn table 1, we report the MRE and MAE scores for the experiments using the top V = 1, 000 features.\nWe also report the average burstiness of each data set. We de\ufb01ne the burstiness of feature v in matrix\nY to be \u02c6Bv = 1\nv . For each data set, we calculated\nT\u22121\nthe burstiness of each feature in each matrix, and then averaged these values to obtain an average\nburstiness score \u02c6B. The PGDS outperformed the LDS and GP-DPFA on seven of the ten prediction\ntasks when we used MRE to measure the models\u2019 performance; when we used MAE, the PGDS\noutperformed the other models on \ufb01ve of the tasks. In the supplementary material, we also report the\nresults for the experiments comparing the PGDS to GP-DPFA using all the features. The superiority\nof the PGDS over GP-DPFA is even more pronounced in these results. We hypothesize that the\ndifference between these models is related to the burstiness of the data. For both error measures, the\nonly data set for which GP-DPFA outperformed the PGDS on both tasks was the NIPS data set. This\ndata set has a substantially lower average burstiness score than the other data sets. We provide visual\nevidence in \ufb01gure 3, where we display y(t)\nv over time for the top four features in the NIPS and ICEWS\ndata sets. For the former, the features evolve smoothly; for the latter, they exhibit bursts of activity.\nExploratory analysis: We also explored the latent structure inferred by the PGDS. Because its\nparameters are positive, they are easy to interpret. In \ufb01gure 1, we depict three components inferred\nfrom the NIPS data set. By examining the time-step factors and feature factors for these components,\nwe see that they capture the decline of research on neural networks between 1987 and 2003, as well\nas the rise of Bayesian methods in machine learning. These patterns match our prior knowledge.\nIn \ufb01gure 4, we depict the three components with the largest component weights inferred by the PGDS\nfrom the 2003 GDELT matrix. The top component is in blue, the second is in green, and the third\nis in red. For each component, we also list the sixteen features (directed pairs of countries) with\nthe largest feature factors. The top component (blue) is most active in March and April, 2003. Its\nfeatures involve USA, Iraq (IRQ), Great Britain (GBR), Turkey (TUR), and Iran (IRN), among others.\nThis component corresponds to the 2003 invasion of Iraq. The second component (green) exhibits a\nnoticeable increase in activity immediately after April, 2003. Its top features involve Israel (ISR),\nPalestine (PSE), USA, and Afghanistan (AFG). The third component exhibits a large burst of activity\n\n7\n\n\fFigure 4: The time-step factors for the top three components inferred by the PGDS from the 2003\nGDELT matrix. The top component is in blue, the second is in green, and the third is in red. For each\ncomponent, we also list the features (directed pairs of countries) with the largest feature factors.\n\nin August, 2003, but is otherwise inactive. Its top features involve North Korea (PRK), South Korea\n(KOR), Japan (JPN), China (CHN), Russia (RUS), and USA. This component corresponds to the\nsix-party talks\u2014a series of negotiations between these six countries for the purpose of dismantling\nNorth Korea\u2019s nuclear program. The \ufb01rst round of talks occurred during August 27\u201329, 2003.\nIn \ufb01gure 5, we also show the component weights for the top ten com-\nponents, along with the corresponding subset of the transition matrix\n\u03a0. There are two components with weights greater than one: the com-\nponents that are depicted in blue and green in \ufb01gure 4. The transition\nweights in the corresponding rows of \u03a0 are also large, meaning that\nother components are likely to transition to them. As we mentioned\npreviously, the GDELT data set was extracted from news corpora.\nTherefore, patterns in the data primarily re\ufb02ect patterns in media cov-\nerage of international affairs. We therefore interpret the latent structure\ninferred by the PGDS in the following way: in 2003, the media brie\ufb02y\ncovered various major events, including the six-party talks, before\nquickly returning to a backdrop of the ongoing Iraq war and Israeli\u2013\nPalestinian relations. By inferring the kind of transition structure\ndepicted in \ufb01gure 5, the PGDS is able to model persistent, long-term\ntemporal patterns while accommodating the burstiness often inherent\nto real-world count data. This ability is what enables the PGDS to\nachieve superior predictive performance over the LDS and GP-DPFA.\n\n5 Summary\n\nWe introduced the Poisson\u2013gamma dynamical system (PGDS)\u2014a new\nBayesian nonparametric model for sequentially observed multivariate\ncount data. This model supports the expressive transition structure\nof the linear dynamical system, and naturally handles overdispersed\ndata. We presented a novel MCMC inference algorithm that remains\nef\ufb01cient for high-dimensional data sets, advancing recent work on aug-\nmentation schemes for inference in negative binomial models. Finally,\nwe used the PGDS to analyze \ufb01ve real-world data sets, demonstrating\nthat it exhibits superior smoothing and forecasting performance over\ntwo baseline models and infers highly interpretable latent structure.\n\nAcknowledgments\n\nFigure 5: The latent tran-\nsition structure inferred by\nthe PGDS from the 2003\nGDELT matrix. Top: The\ncomponent weights for the\ntop ten components, in de-\ncreasing order from left to\nright; two of the weights are\ngreater than one. Bottom:\nThe transition weights in the\ncorresponding subset of the\ntransition matrix. This struc-\nture means that all compo-\nnents are likely to transition\nto the top two components.\n\nWe thank David Belanger, Roy Adams, Kostis Gourgoulias, Ben Marlin, Dan Sheldon, and Tim\nVieira for many helpful conversations. This work was supported in part by the UMass Amherst CIIR\nand in part by NSF grants SBE-0965436 and IIS-1320219. Any opinions, \ufb01ndings, conclusions, or\nrecommendations are those of the authors and do not necessarily re\ufb02ect those of the sponsors.\n\n8\n\nJan2003Mar2003May2003Aug2003Oct2003Dec2003123456012345678901234567890123456\fReferences\n[1] R. E. Kalman. A new approach to linear \ufb01ltering and prediction problems. Journal of Basic\n\nEngineering, 82(1):35\u201345, 1960.\n\n[2] Z. Ghahramani and S. T. Roweis. Learning nonlinear dynamical systems using an EM algorithm.\n\nIn Advances in Neural Information Processing Systems, pages 431\u2013437, 1998.\n\n[3] S. S. Haykin. Kalman Filtering and Neural Networks. 2001.\n[4] P. McCullagh and J. A. Nelder. Generalized linear models. 1989.\n[5] M. G. Bulmer. On \ufb01tting the Poisson lognormal distribution to species-abundance data. Biomet-\n\nrics, pages 101\u2013110, 1974.\n\n[6] D. M. Blei and J. D. Lafferty. Dynamic topic models. In Proceedings of the 23rd International\n\nConference on Machine Learning, pages 113\u2013120, 2006.\n\n[7] L. Charlin, R. Ranganath, J. McInerney, and D. M. Blei. Dynamic Poisson factorization. In\n\nProceedings of the 9th ACM Conference on Recommender Systems, pages 155\u2013162, 2015.\n\n[8] J. H. Macke, L. Buesing, J. P. Cunningham, B. M. Yu, K. V. Krishna, and M. Sahani. Empirical\nmodels of spiking in neural populations. In Advances in Neural Information Processing Systems,\npages 1350\u20131358, 2011.\n\n[9] J. Kleinberg. Bursty and hierarchical structure in streams. Data Mining and Knowledge\n\nDiscovery, 7(4):373\u2013397, 2003.\n\n[10] M. Zhou and L. Carin. Augment-and-conquer negative binomial processes. In Advances in\n\nNeural Information Processing Systems, pages 2546\u20132554, 2012.\n\n[11] R. Corless, G. Gonnet, D. E. G. Hare, D. J. Jeffrey, and D. E. Knuth. On the LambertW function.\n\nAdvances in Computational Mathematics, 5(1):329\u2013359, 1996.\n\n[12] J. Canny. GaP: A factor model for discrete data. In Proceedings of the 27th Annual International\nACM SIGIR Conference on Research and Development in Information Retrieval, pages 122\u2013129,\n2004.\n\n[13] A. T. Cemgil. Bayesian inference for nonnegative matrix factorisation models. Computational\n\nIntelligence and Neuroscience, 2009.\n\n[14] M. Zhou, L. Hannah, D. Dunson, and L. Carin. Beta-negative binomial process and Poisson\nfactor analysis. In Proceedings of the 15th International Conference on Arti\ufb01cial Intelligence\nand Statistics, 2012.\n\n[15] P. Gopalan, J. Hofman, and D. Blei. Scalable recommendation with Poisson factorization. In\n\nProceedings of the 31st Conference on Uncertainty in Arti\ufb01cial Intelligence, 2015.\n\n[16] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. The Annals of Statistics,\n\n1(2):209\u2013230, 1973.\n\n[17] M. Zhou. In\ufb01nite edge partition models for overlapping community detection and link prediction.\nIn Proceedings of the 18th International Conference on Arti\ufb01cial Intelligence and Statistics,\npages 1135\u20131143, 2015.\n\n[18] M. Zhou, Y. Cong, and B. Chen. Augmentable gamma belief networks. Journal of Machine\n\nLearning Research, 17(163):1\u201344, 2016.\n\n[19] M. E. Basbug and B. Engelhardt. Hierarchical compound Poisson factorization. In Proceedings\n\nof the 33rd International Conference on Machine Learning, 2016.\n\n[20] R. M. Adelson. Compound Poisson distributions. OR, 17(1):73\u201375, 1966.\n[21] M. Zhou and L. Carin. Negative binomial process count and mixture modeling. IEEE Transac-\n\ntions on Pattern Analysis and Machine Intelligence, 37(2):307\u2013320, 2015.\n\n[22] A. Acharya, J. Ghosh, and M. Zhou. Nonparametric Bayesian factor analysis for dynamic\ncount matrices. Proceedings of the 18th International Conference on Arti\ufb01cial Intelligence and\nStatistics, 2015.\n\n[23] J. F. C. Kingman. Poisson Processes. Oxford University Press, 1972.\n[24] D. B. Dunson and A. H. Herring. Bayesian latent variable models for mixed discrete outcomes.\n\nBiostatistics, 6(1):11\u201325, 2005.\n\n[25] W. J. Rugh. Linear System Theory. Pearson, 1995.\n[26] M. Zhou. Nonparametric Bayesian negative binomial factor analysis. arXiv:1604.07464.\n[27] J. Durbin and S. J. Koopman. Time Series Analysis by State Space Methods. Oxford University\n\nPress, 2012.\n\n9\n\n\f", "award": [], "sourceid": 2564, "authors": [{"given_name": "Aaron", "family_name": "Schein", "institution": "UMass Amherst"}, {"given_name": "Hanna", "family_name": "Wallach", "institution": "Microsoft Research"}, {"given_name": "Mingyuan", "family_name": "Zhou", "institution": "University of Texas at Austin"}]}