{"title": "Deep Dynamic Poisson Factorization Model", "book": "Advances in Neural Information Processing Systems", "page_first": 1666, "page_last": 1674, "abstract": "A new model, named as deep dynamic poisson factorization model, is proposed in this paper for analyzing sequential count vectors. The model based on the Poisson Factor Analysis method captures dependence among time steps by neural networks, representing the implicit distributions. Local complicated relationship is obtained from local implicit distribution, and deep latent structure is exploited to get the long-time dependence. Variational inference on latent variables and gradient descent based on the loss functions derived from variational distribution is performed in our inference. Synthetic datasets and real-world datasets are applied to the proposed model and our results show good predicting and fitting performance with interpretable latent structure.", "full_text": "Deep Dynamic Poisson Factorization Model\n\nDepartment of Information Management\n\nDepartment of Information Management\n\nChengyue Gong\n\nPeking University\n\ncygong@pku.edu.cn\n\nWin-bin Huang\n\nPeking University\n\nhuangwb@pku.edu.cn\n\nAbstract\n\nA new model, named as deep dynamic poisson factorization model, is proposed\nin this paper for analyzing sequential count vectors. The model based on the\nPoisson Factor Analysis method captures dependence among time steps by neural\nnetworks, representing the implicit distributions. Local complicated relationship\nis obtained from local implicit distribution, and deep latent structure is exploited\nto get the long-time dependence. Variational inference on latent variables and\ngradient descent based on the loss functions derived from variational distribution is\nperformed in our inference. Synthetic datasets and real-world datasets are applied\nto the proposed model and our results show good predicting and \ufb01tting performance\nwith interpretable latent structure.\n\n1\n\nIntroduction\n\n+\n\nThere has been growing interest in analyzing sequentially observed count vectors x1, x2,. . . , xT . Such\ndata appears in many real world applications, such as recommend systems, text analysis, network\nanalysis and time series analysis. Analyzing such data should conquer the computational or statistical\nchallenges, since they are often high-dimensional, sparse, and with complex dependence across the\ntime steps. For example, when analyzing the dynamic word count matrix of research papers, the\namount of words used is large and many words appear only few times. Although we know the trend\nthat one topic may encourage researchers to write papers about related topics in the following year,\nthe relationship among each time step and each topic is still hard to analyze completely.\nBayesian factor analysis model has recently reached success in modeling sequentially observed\ncount matrix. They assume the data is Poisson distributed, and model the data under Poisson\nFactorize Analysis (PFA). PFA factorizes a count matrix, where \u03a6 \u2208 RV \u00d7K\nis the loading matrix and\n\u0398 \u2208 RT\u00d7K\nis the factor score matrix. The assumption that \u03b8t \u223c Gamma(\u03b8t\u22121, \u03b2t) is then included\n[1, 2] to smooth the transition through time. With property of the Gamma-Poisson distribution and\nGamma-NB process, inference via MCMC is used in these models. Considering the lack of ability to\ncapture the relationship between factors, a transition matrix is included in Poisson-Gamma Dynamical\nSystem (PGDS) [2]. However, these models may still have some shortcomings in exploring the\nlong-time dependence among the time steps, as the independence assumption is made on \u03b8t\u22121 and\n\u03b8t+1 if \u03b8t is given. In text analysis problem, temporal Dirichlet process [3] is used to catch the time\ndependence on each topic using a given decay rate. This method may have weak points in analyzing\nother data with different pattern long-time dependence, such as fanatical data and disaster data [3].\nDeep models, which are also called hierarchical models in Bayesian learning \ufb01eld, are widely used in\nBayesian models to \ufb01t the deep relationship between latent variables. Examples of this include the\nnested Chinese Restaurant Process [4], nest hierarchical Dirichlet process [5], deep Gaussian process\n[6, 7] and so on. Some models based on neural network structure or recurrent structure is also used,\nsuch as the Deep Exponential Families [8], the Deep Poisson Factor Analysis based on RBM or SBN\n[9, 10], the Neural Autoregressive Density Estimator based on neural networks [11], Deep Poisson\n\n+\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: The visual representation of our model. In (a), the structure of one-layer model is shown.\n(b) shows the transmission of the posterior information. The prior and posterior distributions between\n\ninterfacing layers are shown in (c) and (d).\n\nFactor Modeling with a recurrent structure based on PFA using a Bernoulli-Poisson link [12], Deep\nLatent Dirichlet Allocation uses stochastic gradient MCMC [23]. These models capture the deep\nrelationship among the shallow models, and often outperform shallow models.\nIn this paper, we present the Deep Dynamic Poisson Factor Analysis (DDPFA) model. Based on PFA,\nour model includes recurrent neural networks to represent implicit distributions, in order to learn\ncomplicated relationship between different factors among short time. Deep structure is included in\norder to capture the long-time dependence. An inference algorithm based on variational inference is\nused for inferring the latent variables. Parameters in the neural networks are learnt according to a\nloss function based on the variational distributions. Finally, the DDPFA model is used on several\nsynthetic and real-world datasets, and excellent results are obtained in prediction and \ufb01tting tasks.\n\n2 Deep Dynamic Poisson Factorization Model\n\nAssume that V -dimensional sequentially observed count data x1, x2,. . . , xT are represented as a\nV \u00d7 T count matrix X, a count data xvt \u2208 {0, 1, . . .} is generated by the proposed DDPFA model as\nfollows:\n\nxvt \u223c P oisson((cid:80)K\n\nk=1 \u03b8tk\u03c6vk\u03bbk\u03c8v)\n\n(1)\n\n(2)\n\nwhere the latent variables \u03b8tk, \u03c6vk, \u03bbk and \u03c8v are all positive variables. \u03c6k represents the strength of\nthe kth component and is treated as factor. \u03b8tk represents the strength of the kth component at the tth\ntime step. Feature-wise variable \u03c8v captures the sparsity of the vth feature and \u03bbk recognizes the\nimportance of the kth component. According to the regular setting in [2, 13-16], the factorization is\nregarded as X \u223c P oisson(\u03a6\u0398T ). \u039b and \u03a8 can be absorbed into \u0398. In this paper, in order to extract\nthe sparsity of vth feature or kth component and impose a feature-wise or temporal smoothness\nconstraint, \u03a8 and \u039b are included in our model. The additive property of the Poisson distribution is\nused to decompose the observed count of xvt as K latent counts xvtk, k \u2208 {0, . . . , K}. In this way,\nthe model is rewritten as:\n\nxvt =(cid:80)K\n\nk=1 xvtk and xvtk \u223c P oisson(\u03b8tk\u03c6vk\u03bbk\u03c8v)\n\nCapturing the complicated temporal dependence of \u0398 is the major purpose in this paper. In the\nprevious work, transition via Gamma-Gamma-Poisson distribution structure is used, where \u03b8t \u223c\nGamma(\u03b8t\u22121, \u03b2t) [1]. Non-homogeneous Poisson process over time to model the stochastic\ntransition over different features is exploited in Poisson process models [17-19]. These models are\nthen trained via MCMC or variational inference. However, it is rough for these models to catch\ncomplicated time dependence because of the weak points in their shallow structure in time dimension.\nIn order to capture the complex time dependence over \u0398, a deep and long-time dependence model\nwith a dynamic structure over time steps is proposed. The \ufb01rst layer over \u0398 is as follows:\n\n\u03b8t \u223c p(\u03b8t|h(0)\n\nt\u2212c, ..., h(0)\nt )\n\n(3)\n\n2\n\n\t\u03b80\t\u03b81\t\u03b82\t\u03b83\t\u03b84\t\u03b85(a) \t\u03b80\t\u03b81\t\u03b82\t\u03b83\t\u03b84\t\u03b85(b) \t\u03b80\t\u03b81\t\u03b82\t\u03b83\t\u03b84\t\u03b85(d) \t\th0(0)\t\th1(0)\t\th2(0)\t\th3(0)\t\th4(0)\t\th5(0)\t\u03b80\t\u03b81\t\u03b82\t\u03b83\t\u03b84\t\u03b85\t\th0(0)\t\th1(0)\t\th2(0)\t\th3(0)\t\th4(0)\t\th5(0)(c) \ft \u223c p(h(n)\nh(n)\n\nt\n\n|h(n+1)\nt\u2212c\n\nt \u223c p(h(N )\n\nt\n\n|h(N )\n\nwhere c is the size of a window for analysis, and the latent variables in the nth layer, n \u2264 N, are\nindicated as follows:\n\nt\n\nt\n\nt\n\nt+c\n\n, ..., h(n+1)\n\nt\n\n, . . . , h(n\u22121)\n\nt\u2212c\u22121, ..., h(N )\nt\u22121)\n\nt\u2212c\u22121:t\u22121 as the prior information. This structure is illustrated in Figure 1.\n\n(4)\n) and h(N )\n|\u00b7) is modeled as a recurrent neural network. Proba-\n|h(n\u22121)\n), also modeled\nt\nis a K-dimensional latent variable in\nis generated from a Gamma\n\nwhere the implicit probability distribution p(h(n)\nbility AutoEncoder with an auxiliary posterior distribution p(h(n)\nas a neural network, is exploited in our training phase. h(n)\nthe nth layer at the tth time step. Specially, in the nth layer, h(N )\ndistribution with h(N )\nFinally, prior parameters are placed over other latent variables for Bayesian inference. These variables\nare generated as: \u03c6vk \u223c Gamma(\u03b1\u03c6, \u03b2\u03c6) and\u03bbk \u223c Gamma(\u03b1\u03bb, \u03b2\u03bb) and \u03c8v \u223c Gamma(\u03b1\u03c8, \u03b2\u03c8).\nAlthough Dirichlet distribution is often used as prior distribution [13, 14, 15] over \u03c6vk in previous\nworks, a Gamma distribution is exploited in our model due to the including of feature-wise parameter\n\u03c8v and the purpose for obtaining feasible factor strength of \u03c6k.\nIn real world applications, like recommendation systems, the observed binary count data can be\nformulated by the proposed DDPFA model with a Bernoulli-Poisson link [1]. The distribution of b\ngiven \u03bb is called Bernoulli-Poisson distribution as: b = 1(x > 1), x \u223c P oisson(\u03bb) and the linking\ndistribution is rewritten as: f (b|x, \u03bb) = e\u2212\u03bb(1\u2212b)(1 \u2212 e\u2212\u03bb)\nb. The conditional posterior distribution\nis then (x|b, \u03bb) \u223c b \u00b7 P oisson+(\u03bb), where P oisson+(\u03bb) is a truncated Poisson distribution, so the\nMCMC or VI methods can be used to do inference. Non-count real-valued matrix can also be linked\nto a latent count matrix via a compound Poisson distribution or a Gamma belief network [20].\n\nt\n\n3\n\nInference\n\nq(\u0398, \u03a6, \u03a8, \u039b, H) =(cid:81)\n\nThere are many classical inference approaches for Bayesian probabilistic model, such as Monte Carlo\nmethods and variational inference. In the proposed method, variational inference is exploited because\nthe implicit distribution is regarded as prior distribution over \u0398. Two stages of inference in our model\nare adopted: the \ufb01rst stage updates latent variables by the coordinate-ascent method with a \ufb01xed\nimplicit distribution, and the parameters in neural networks are learned in the second one.\nMean-\ufb01eld Approximation: In order to obtain mean-\ufb01eld VI, all variables are independent and\ngoverned by its own variational distribution. The joint distribution of the variational distribution is\nwritten as:\n\nv,t,n,k q(\u03c6vk|\u03c6\u2217\n\nvk)q(\u03c8k|\u03c8\u2217\n\nk)q(\u03b8tk|\u03b8\u2217\n\ntk)q(\u03bbk|\u03bb\u2217\n\n)\n\nk)q(h(n)\n\n(5)\nwhere y\u2217 represents the prior variational parameter of the variable y. The variational parameters \u03bd\nare \ufb01tted to minimize the KL divergence:\n\n\u03bd = argmin\u03bd\u2217 KL(p(\u0398, \u03a6, \u03a8, \u039b, H|X)||q(\u0398, \u03a6, \u03a8, \u039b, H|\u03bd))\n\n(6)\nThe variational distribution q(\u00b7|\u03bd\u2217) is then used as a proxy for the posterior. The objective actually is\nequal to maximize the evidence low bound (ELBO) [19]. The optimization can be performed by a\ncoordinate-ascent method or a variational-EM method. As a result, each variational parameter can be\noptimized iteratively while the remaining parameters of the model are set to \ufb01xed value. Due to Eq.\n2, the conditional distribution of (xvt1, . . . , xvtk) is a multinomial while its parameter is normalized\nset of rates [19] and formulated as:\n\n(xvt1, . . . , xvtk)|\u03b8t, \u03c6v, \u03bb, \u03c8v \u223c M ult(xvt\u00b7; \u03b8t\u03c6v\u03bb\u03c6v/(cid:80)\n\nk \u03b8tk\u03c6vk\u03bbk\u03c8v)\n\n(7)\n\ntk |h(n)\u2217\n\ntk\n\nGiven the auxiliary variables xvtk, the Poisson factorization model is a conditional conjugate model.\nThe complete conditional of the latent variables is Gamma distribution and shown as:\n\n\u03c6vk|\u0398, \u039b, \u03a8, \u03b1, \u03b2, X \u223c Gamma(\u03b1\u03c6 + xv\u00b7k, \u03b2\u03c6 + \u03bbk\u03c8v\u03b8\u00b7k)\n\u03bbk|\u0398, \u03a6, \u03a6, \u03b1, \u03b2, X \u223c Gamma(\u03b1\u03bb + x\u00b7\u00b7k, \u03b2\u03b8 + \u03b8\u00b7k\n\n\u03c8v|\u0398, \u039b, \u03a6, \u03b1, \u03b2, X \u223c Gamma(\u03b1\u03c8 + xv\u00b7\u00b7, \u03b2\u03c8 +(cid:80)\n\nv \u03c8v\u03c6vk)\nk \u03bbk\u03c6vk\u03b8\u00b7k)\n\n(cid:80)\n\n(8)\n\n3\n\n\fGenerally, these distributions are derived from conjugate properties between Poisson and Gamma\ndistribution. The posterior distribution of \u03b8tk described in Eq. 3 can be a Gamma distribution while\nthe prior h(0)\n\nt\u2212c:t is given as:\n\n\u03b8tk|\u03a8, \u039b, \u03a6, h(0), \u03b2, X \u223c Gamma(\u03b1\u03b8tk + xv\u00b7k, \u03b2\u03b8 + \u03bbk\n\nv \u03c8v\u03c6vk)\n\n(9)\n\n(cid:80)\n\nwhere \u03b1\u03b8tk is calculated through a recurrent neural network with (h(0)\nthe posterior distribution of h(0)\n\ntk described in Eq. 4 is given as:\ntk |\u0398, h(1), \u03b2, X \u223c Gamma(\u03b1h(0)\nh(0)\n\nt\u2212c, ..., h(0)\n\nt ) as its inputs. Then\n\n+ \u03b3h(0)\n\ntk\n\ntk\n\n, \u03b2h)\n\n(10)\n\ntk\n\ntk\n\ntk\n\nt\u2212c\n\nis the prior information given by the (n + 1)th layer, \u03b3h(n)\n\nwhere \u03b1h(n)\ngiven by the (n \u2212 1)th layer. Here, the notation h(\u22121)\n, ..., h(n+1)\na recurrent neural network using (h(n+1)\na recurrent neural network using (h(n\u22121)\n, ..., h(n\u22121)\nmentioned in Eq. 9 can be regarded as an implicit conditional distribution of \u03b8tk given (h(0)\nAnd the distribution in Eq. 10 is an implicit distribution of \u03b1h(n)\n(h(n\u22121)\nVariational Inference: Mean \ufb01eld variational inference can approximate the latent variables while\nall parameters of a neural network are given. If the observed data satis\ufb01es xvt > 0, the auxiliary\nvariables xvtk can be updated by:\n\nis the posterior information\nis calculated through\nis equal to \u03b8tk. \u03b1h(n)\n) as its inputs. \u03b3h(n)\nis calculated through\n) as its inputs. Therefore, the distribution\nt\u2212c, ..., h(0)\nt ).\n) and\n\n, ..., h(n\u22121)\n\ngiven (h(n+1)\n\n, ..., h(n+1)\n\nt\u2212c\n\nt+c\n\nt+c\n\n).\n\ntk\n\ntk\n\ntk\n\nt\n\nt\n\nt\n\nt\n\nxvtk \u221d exp{\u03a8 (\u03b8shp\n\ntk ) \u2212 log\u03b8rte\n\ntk + \u03a8 (\u03bbshp\n\nk\n\n) \u2212 log\u03bbrte\n\nk\n\n+ \u03a8 (\u03c6shp\n\nvk ) \u2212 log\u03c6rte\n\nvk + \u03a8 (\u03c8shp\n\nv\n\n) \u2212 log\u03c8rte\nv }\n\n(11)\n\nwhere \u03a8 (\u00b7) is the digamma function. Variables with the superscript \u201cshp\u201d indicate the shape parameter\nof Gamma distribution, and those with the superscript \u201crte\u201d are the rate parameter of it. This update\ncomes from the expectation of the logarithm of a Gamma variable as (cid:104)log\u03b8(cid:105) = \u03a8 (\u03b8shp) \u2212 log(\u03b8rte).\nHere, \u03b8 is generated from a Gamma distribution and (cid:104)\u00b7(cid:105) represents the expectation of the variable.\nCalculation of the expectation of the variable, obeyed Gamma distribution, is noted as (cid:104)\u03b8(cid:105) =\n\u03b8shp/\u03b8rte. Variables can be updated by mean-\ufb01eld method as:\n\n\u03c6vk \u223c Gamma(\u03b1\u03c6 + (cid:104)xv\u00b7k(cid:105), \u03b2\u03c6 + (cid:104)\u03bbk(cid:105)(cid:104)\u03c8v(cid:105)(cid:104)\u03b8\u00b7k(cid:105))\n\n\u03bbk \u223c Gamma(\u03b1\u03bb + (cid:104)x\u00b7\u00b7k(cid:105), \u03b2\u03b8 + (cid:104)\u03b8\u00b7k(cid:105)(cid:80)\n\u03c8v \u223c Gamma(\u03b1\u03c8 + (cid:104)xv\u00b7\u00b7(cid:105), \u03b2\u03c8 +(cid:80)\n\u03b8tk \u223c Gamma(\u03b1\u03b8tk + (cid:104)xv\u00b7k(cid:105), \u03b2\u03b8 + (cid:104)\u03bbk(cid:105)(cid:80)\n\nv (cid:104)\u03c8v(cid:105)(cid:104)\u03c6vk(cid:105))\nk (cid:104)\u03bbk(cid:105)(cid:104)\u03c6vk(cid:105)(cid:104)\u03b8\u00b7k(cid:105))\n\nv (cid:104)\u03c8v(cid:105)(cid:104)\u03c6vk(cid:105))\n\nThe latent variables in the deep structure can also be updated by mean-\ufb01eld method:\n\ntk \u223c Gamma(\u03b1h(n)\nh(n)\n= ff eed((cid:104)hn+1(cid:105)), \u03b1h(N )\n\n= ff eed((cid:104)hN\n\ntk\n\n+ \u03b3h(n)\n\ntk\n\n, \u03b2h)\n\nt\u2212c\u22121:t\u22121(cid:105)) and \u03b3h(n)\n\n= fback((cid:104)hn\u22121(cid:105)), \u03b3h(N )\n\nt\n\nt\n\n=\nt+c+1:t+1(cid:105)). ff eed(\u00b7) is the neural network constructing the prior distribution and fback(\u00b7)\n\nwhere \u03b1h(n)\nfback((cid:104)hN\nis the neural network constructing the posterior distribution.\nProbability AutoEncoder: This stage of the inference is to update the parameters of the neural\nnetworks. The bottom layer is used by us as an example. Given all latent variables, these parameters\ncan be approximated by p(\u03b8t|h(0)\nt ) =\nGamma(\u03b1\u03b8t, \u03b2h) is modeled by a RNN with the inputs (h(0)\nt ) and the outputs, \u03b1\u03b8t. The\n\n|\u03b8t+c, ..., \u03b8t). p(\u03b8(n)\nt\u2212c, ..., h(0)\n\nt ) and p(h(0)\n\nt\u2212c, ..., h(0)\n\nt\u2212c, ..., h(0)\n\n|h(0)\n\nt\n\nt\n\nt\n\nt\n\n(12)\n\n(13)\n\n(14)\n\n4\n\n\ft\n\n|\u03b8t+c, ..., \u03b8t) is also modeled as a RNN with the inputs (\u03b8t+c, ..., \u03b8t) and the outputs ,\u03b3h(0)\np(h(0)\n. With the posterior distribution from \u0398 to H (0) and the prior distribution from H (0) to \u0398, the\nprobability of \u0398 should be maximized. The loss function of these two neural networks is as follows:\n(15)\n\n{(cid:82) p(\u0398|H (0))p(H (0)|\u0398)dH (0)}\n\nt\n\nmax\nW\n\nis generated by \u0398,\n\nwhere W represents the parameters in neural networks. Because the integration in Eq. 15\nis intractable, a new loss function should include auxiliary variational variables H (0)(cid:48). As-\nsume that H (0)(cid:48)\nthe optimization can be regarded as maximizing the\n{p(\u0398|H (0))} and\nprobability of \u0398 with minimal difference between H (0)(cid:48) and H (0) as max\nmin\nW\nThen approximating the variables generated from a distribution by its expectation, the loss function,\nsimilar to variational AutoEncoder [21], can be simpli\ufb01ed to:\n\n{KL(p(H (0)(cid:48)|\u0398)||p(H (0)|H (1))}\n\nW\n\n{(cid:107)(cid:104)p(H (0)(cid:48)|\u0398)(cid:105) \u2212 (cid:104)p(H (0)|H (1))(cid:105)(cid:107)2 + (cid:107)\u0398 \u2212 (cid:104)p(\u0398|H (0))(cid:105)(cid:107)2}\n\n(16)\n\nmin\nW\n\nSince only a few samples are drawn from one certain distrbution, which means sampling all latent\nvariables is high-cost and useless, differentiable variational Bayes is not suitable. As a result, we\nfocus more on \ufb01tting data than generating data. In our objective, the \ufb01rst term, a regularization,\nencourages the data to be reconstructed from the latent variables, and the second term encourages the\ndecoder to \ufb01t the data.\nThe parameters in the networks for nth and (n + 1)th layer are trained by the loss function:\n\nmin\nW\n\n{(cid:107)(cid:104)p(H (n+1)(cid:48)|H (n))(cid:105) \u2212 (cid:104)p(H (n)|H (n+1))(cid:105)(cid:107)2\n+ (cid:107)H (n) \u2212 (cid:104)p(H (n)|H (n+1))(cid:105)(cid:107)2}\n\n(17)\n\nIn order to make the convergence more stable, the term of \u0398 in the \ufb01rst layer is collapsed into X by\nusing the \ufb01xed latent variables approximated by mean-\ufb01eld VI, and the loss function is as follows:\n(18)\n\n{(cid:107)(cid:104)p(H (0)(cid:48)|\u0398)(cid:105) \u2212 (cid:104)p(H (0)|H (1))(cid:105)(cid:107)2 + (cid:107)X \u2212 (cid:104)\u03a8(cid:105)(cid:104)\u039b(cid:105)(cid:104)\u03a6(cid:105)(cid:104)p(\u0398|H (0))(cid:105)(cid:107)2}\n\nmin\nW\n\nAfter the layer-wise training, all the parameters in neural networks are jointly trained by the \ufb01ne-tuning\ntrick in stacked AutoEncoder [22].\n\n4 Experiments\n\nIn this section, four multi-dimensional synthetic datasets and \ufb01ve real-world datasets are exploited to\nexamine the performance of the proposed model. Besides, the results of three existed methods, PGDS,\nLSTM, and PFA, are compared with results of our model. PGDS is a dynamic Poisson-Gamma\nsystem mentioned in Section 1, and LSTM is a classical time sequence model. In order to prove the\ndeep relationship learnt by the deep structure can improve the performance, a simple PFA model is\nalso included as a baseline.\nAll hyperparameters of PGDS set in [2] are used in this paper. 1000 times gibbs sampling iterations\nfor PGDS is performed, 100 iterations used mean-\ufb01eld VI for PFA is performed, and 400 epochs is\nexecuted for LSTM. The parameters in the proposed DDPFA model are set as follows:\u03b1(\u03bb,\u03c6,\u03c8) =\n1, \u03b2(\u03bb,\u03c6,\u03c8) = 2, \u03b1(\u03b8,h) = 1, \u03b2(\u03b8,h) = 1. The iterations is set to 100. The stochastic gradient\ndescent for the neural networks is executed 10 epochs in each iteration. The size of the window is 4.\nHyperparameters of PFA are set as the same to our model. Data in the last time step is exploited as\nthe predicting target in a prediction task. Mean squared error (MSE) between the ground truth and\nthe estimated value and the predicted mean squared error (PMSE) between the ground truth and the\npredicted value in next time step are exploited to evaluate the performance of each model.\n\n4.1 Synthetic Datasets\n\nThe multi-dimensional synthetic datasets are obtained by using the following functions where the\nsubscript stands for the index of dimension:\n\n5\n\n\fData Measure\n\nSDS1\n\nSDS2\n\nSDS3\n\nSDS4\n\nMSE\nPMSE\nMSE\nPMSE\nMSE\nPMSE\nMSE\nPMSE\n\nTable 1: The result on the synthetic data\nLSTM\n\nDDPFA\n\nPGDS\n\n0.15 \u00b1 0.01\n2.07 \u00b1 0.02\n0.06 \u00b1 0.01\n2.01 \u00b1 0.02\n0.10 \u00b1 0.02\n2.14 \u00b1 0.04\n0.15 \u00b1 0.03\n1.48 \u00b1 0.04\n\n1.48 \u00b1 0.00\n5.96 \u00b1 0.00\n3.38 \u00b1 0.00\n3.50 \u00b1 0.01\n1.62 \u00b1 0.00\n4.33 \u00b1 0.01\n2.92 \u00b1 0.00\n6.41 \u00b1 0.01\n\n2.02 \u00b1 0.23\n2.94 \u00b1 0.31\n1.83 \u00b1 0.04\n2.41 \u00b1 0.06\n1.13 \u00b1 0.06\n3.03 \u00b1 0.05\n4.30 \u00b1 0.26\n4.67 \u00b1 0.24\n\nPFA\n\n-\n\n1.61 \u00b1 0.00\n4.42 \u00b1 0.00\n1.34 \u00b1 0.00\n0.25 \u00b1 0.00\n\n-\n\n-\n\n-\n\n(mod 2), f2(t) = 2t\n\n(mod 2), f2(t) = 2t\n\nSDS1:f1(t) = f2(t) = t, f3(t) = f4(t) = t + 1 on the interval t = [1 : 1 : 6].\nSDS2:f1(t) = t\n(mod 2) + 2, f3(t) = t on the interval t = [1 : 1 : 20].\nSDS3:f1(t) = f2(t) = t, f3(t) = f4(t) = t + 1, f5(t) = I(4|t) on the interval t = [1 : 1 : 20],\nwhere I is an indicator function.\nSDS4:f1(t) = t\nt = [1 : 1 : 100].\nThe number of factor is set to K = 3, and the number of the layers is 2. Both \ufb01tting and predicting\ntasks are performed in each model. The hidden layer of LSTM is 4 and the size in each layer is 20. In\nTable 1, it is obviously that DDPFA has the best performance in \ufb01tting and prediction task of all the\ndatasets. Note that the complex relationship learnt from the time steps helps the model catch more\ntime patterns according to the results of DDPFA, PGDS and PFA. LSTM performs worse in SDS4\nbecause the noise in the synthetic data and the long time steps make the neural network dif\ufb01cult to\nmemorize enough information.\n\n(mod 10) on the interval\n\n(mod 2) + 2, f3(t) = t\n\n4.2 Real-world Datasets\n\nFive real-world datasets are used as follows:\nIntegrated Crisis Early Warning System (ICEWS): ICEWS is an international relations event data set\nextracted from news corpora used in [2]. We therefore treated undirected pairs of countries i \u2194 j as\nfeatures and created a count matrix for the year 2003. The number of events for each pair during each\nday time step is counted, and all pairs with fewer than twenty-\ufb01ve total events is discarded, leaving\nT = 365, V = 6197, and 475646 events for the matrix.\nNIPS corpus (NIPS): NIPS corpus contains the text of every NIPS conference paper from the year\n1987 to 2003. We created a single count matrix with one column per year. The dataset is downloaded\nfrom Gal\u2019s page 1, with T = 17, V = 14036, with 3280697 events for the matrix.\nEbola corpus (EBOLA)2 : EBOLA corpus contains the data for the 2014 Ebola outbreak in West\nAfrica every day from Mar 22th, 2014 to Jan 5th 2015, each column represents the cases or deaths in\na West Africa country. After data cleaning, the dataset is with T = 122, V = 16.\nInternational Disaster(ID)3 : The International Disaster dataset contains essential core data on the\noccurrence and effects of over 22,000 mass disasters in the world from 1900 to the present day. A\ncount matrix with T = 115 and V = 12 is built from the events of disasters occurred in Europe from\nthe year 1902 to 2016, classi\ufb01ed according to their disaster types.\nAnnual Sheep Population(ASP)4 : The Annual Sheep Population contains the sheep population in\nEngland & Wales from the year 1867 to 1939 yearly. The data matrix is with T = 73, V = 1.\n\n1http://ai.stanford.edu/gal/data.html\n2https://github.com/cmrivers/ebola/blob/master/country_timeseries.csv\n3http://www.emdat.be/\n4https://datamarket.com/data/list/?q=provider:tsdl\n\n6\n\n\fData\n\nMeasure\n\nICEWS\n\nNIPS\n\nEBOLA\n\nID\n\nASP\n\nMSE\nPMSE\nMSE\nPMSE\nMSE\nPMSE\nMSE\nPMSE\nMSE\nPMSE\n\n3.05 \u00b1 0.02\n0.96 \u00b1 0.03\n51.14 \u00b1 0.03\n289.21 \u00b1 0.02\n381.82 \u00b1 0.13\n490.32 \u00b1 0.12\n1.59 \u00b1 0.01\n5.18 \u00b1 0.01\n14.17 \u00b1 0.02\n21.23 \u00b1 0.04\n\n3.21 \u00b1 0.01\n0.97 \u00b1 0.02\n54.71 \u00b1 0.08\n337.60 \u00b1 0.10\n516.57 \u00b1 0.01\n1071.01 \u00b1 0.01\n3.45 \u00b1 0.00\n10.44 \u00b1 0.00\n2128.47 \u00b1 0.02\n760.42 \u00b1 0.02\n\nTable 2: The result on the real-world data\nDDPFA\n\nPGDS\n\nLSTM\n\n4.53 \u00b1 0.04\n6.30 \u00b1 0.03\n\n1053.12 \u00b1 39.01\n1728.04 \u00b1 38.42\n4892.34 \u00b1 10.21\n5839.26 \u00b1 11.92\n11.19 \u00b1 1.32\n10.37 \u00b1 1.54\n\n17962.47 \u00b1 14.12\n21324.72 \u00b1 17.48\n\nPFA\n\n-\n\n-\n\n3.70 \u00b1 0.01\n69.05 \u00b1 0.43\n1493.32 \u00b1 0.21\n4.41 \u00b1 0.01\n388.02 \u00b1 0.01\n\n-\n\n-\n\n-\n\n(a) PGDS\n\n(b) DDPFA\n\nFigure 2: The visual of the factor strength in each time step of the ICEWS data, the data is\n\nnormalized each time step. In (a), the result of PGDS shows the factors are shrunk to some local time\n\nsteps. In (b), the result of DDPFA shows the factors are not taking effects locally.\n\nWe set K = 3 for ID and ASP datasets, while set K = 10 for the others. The size of the hidden\nlayers of the LSTM is 40. The settings of remainder parameters here are the same as those in the\nabove experiment. The results of the experiment are shown in Table 2.\nTable 2 shows the results of four different models on the \ufb01ve datasets, and the proposed model\nDDPFA has satisfying performance in most experiments although the DDPFA\u2019s result in ICEWS\nprediction task is not good enough. While smoothed data obtained from the transition matrix in\nPGDS performs well in this prediction task. However, In EBOLA and ASP datasets, PGDS fails in\ncatching complicated time dependence. And it is a tough challenge for LSTM network to memorize\nenough useful patterns while its input data includes long-time patterns or the dimension of the data is\nparticular high.\nAccording to the observation in Figure 2, it can be shown that the factors learnt by our model are\nnot activated locally compared to PGDS. Natrually, in real-world data, it is impossible that only\none factor happens in one time step. For example, in the ICEWS dataset, the connection between\nIsrael and Occupied Palestinian Territory still remains strong during the Iraq War or other accidents.\nFigure 2(a) reveals that several factors at a certain time step are not captured by PGDS. In Figure 3,\nthe changes of two meaningful factors in ICEWS is shown. These two factor, respectively, indicate\nIsrael-Palestinian con\ufb02ict and six-party talks. The long-time activation of factors is shown in thi\n\ufb01gure, since DDPFA model can capture weak strength along time.\nIn Table 3, we show the performance of our model with different sizes. From the table, performance\ncannot be improved distinctly by adding more layers or adding more variables in upper layer. It is\nalso noticed that expanding the dimension in bottom layer is more useful than in upper layers. The\nresults reveal two problems of proposed DDPFA: \"pruning\" and uselessness of adding network layers.\n\n7\n\n\f(a)\n\n(b)\n\nFigure 3: The visual of the top two factors of the ICEWS data generated by DDPFA method. In (a),\n\n\u2019Japan\u2013Russian Federation\u2019, \u2019North Korea\u2013United States\u2019, \u2019Russian Federation\u2013United States\u2019,\n\u2019South Korea\u2013United States\u2019, and \u2019China\u2013Russian Federation\u2019 are the largest features due to their\n\nloading weights. This factor stands for six-party talks and other accidents about it. In (b),\n\n\u2019Israel\u2013Occupied Palestinian Territory\u2019, \u2019Israel\u2013United States\u2019, \u2019Occupied Palestinian\n\nTerritory\u2013United States\u2019 are the largest features and it stands for the Israeli-Palestinian con\ufb02ict.\n\nTable 3: MSE on real datasets with different sizes.\n\nSize\n\n10-10-10\n\n10-10-10 (ladder structure)\n\n10-10\n\n32-32-32\n\n32-32-32 (ladder structure)\n\n32-64-64\n64-32-32\n\nICEWS NIPS\n51.24\n49.81\n51.14\n50.12\n49.26\n50.18\n50.04\n\n2.94\n2.88\n3.05\n2.95\n2.86\n2.93\n2.90\n\nEBOLA\n382.17\n379.08\n381.82\n379.64\n377.81\n380.01\n378.87\n\n[25] notices hierarchical latent variable models do not take advantage of the structure, and gives\nsuch a conclusion that only using the bottom latent layer of hierarchical variational autoencoders\nshould be enough. In order to solve this problem, the ladder-like architecture, in which each layer\ncombines independent variables with latent variables depend on the upper layers, is used in our model.\nIt is noticed that using ladder architecture could reach much better results from Table 3. Another\nproblem, \"pruning\", is a phenomenon where the optimizer severs connections between most of the\nlatent variables and the data [24]. In our experiments, it is noticed that some dimenisions in the latent\nlayers only contain data noise. This problem is also found in differentiable variational Bayes and\nsolved by using auxiliary MCMC strcuture [24]. Therefore, we believe this problem is caused by\nMF-variational inference used in our model and we hope it can be solved if we try other inference\nmethods.\n\n5 Summary\n\nA new model, called DDPFA, is proposed to obtain long-time and complicated dependence in time\nseries count data. Inference in DDPFA is based on variational method for estimating the latent\nvariables and approximating parameters in neural networks. In order to show the performance of the\nproposed model, four multi-dimensional synthetic datasets and \ufb01ve real-world datasets, ICEWS, NIPS\ncorpus, EBOLA, International Disaster and Annual Sheep Population, are used, and the performance\nof three existed methods, PGDS, LSTM, and PFA, are compared. According to our experimental\nresults, DDPFA has better effectivity and interpretability in sequential count analysis.\n\n8\n\n\fReferences\n\n[1] A. Ayan, J. Ghosh, & M. Zhou. Nonparametric Bayesian Factor Analysis for Dynamic Count Matrices.\nAISTATS, 2015.\n[2] A. Schein, M. Zhou, & H. Wallach. Poisson\u2013Gamma Dynamical Systems. NIPS, 2016.\n[3] A. Ahmed, & E. Xing. Dynamic Non-Parametric Mixture Models and The Recurrent Chinese Restaurant\nProcess. SDM, 2008.\n[4] D. M. Blei, D. M. Grif\ufb01ths, M. I. Jordan, & J. B. Tenenbaum. Hierarchical topic models and the nested\nChinese restaurant process. NIPS, 2004.\n[5] J. Paisley, C. Wang, D. M. Blei, & M. I. Jordan. Nested hierarchical Dirichlet processes. PAMI, 2015.\n[6] T. D. Bui, D. Hern\u00e1ndezlobato, Y. Li, & et al. Deep Gaussian Processes for Regression using Approximate\nExpectation Propagation. ICML, 2016.\n[7] T. D. Bui, D. Thang, E. Richard, & et al. A unifying framework for sparse Gaussian process approximation\nusing Power Expectation Propagation. arXiv:1605.07066.\n[8] R. Ranganath, L. Tang, L. Charlin, & D. M. Blei. Deep exponential families. AISTATS, 2014.\n[9] Z. Gan, C. Chen, R. Henao, D. Carlson, & L. Carin. Scalable deep Poisson factor analysis for topic modeling.\nICML, 2015.\n[10] Z. Gan, R. Henao, D. Carlson, & L. Carin. Learning deep sigmoid belief networks with data augmentation.\nAISTATS, 2015.\n[11] H. Larochelle & S. Lauly. A neural autoregressive topic model. NIPS, 2012.\n[12] H. Ricardo, Z. Gan, J. Lu & L. Carin. Deep Poisson Factor Modeling. NIPS, 2015.\n[13] M. Zhou & L. Carin. Augment-and-conquer negative binomial processes. NIPS, pages 2546\u20132554, 2012.\n[14] M. Zhou, L. Hannah, D. Dunson, & L. Carin. Beta-negative binomial process and Poisson factor analysis.\nAISTATS, 2012.\n[15] M. Zhou & L. Carin. Negative binomial process count and mixture modeling. IEEE Transactions on Pattern\nAnalysis and Machine Intelligence, 37(2):307\u2013320, 2015.\n[16] M. Zhou. Nonparametric Bayesian negative binomial factor analysis. arXiv:1604.07464.\n[17] S. A. Hosseini, K. Alizadeh, A. Khodadadi, & et al. Recurrent Poisson Factorization for Temporal\nRecommendation. KDD, 2017.\n[18] P. Gopalan, J. M. Hofman, & D. M. Blei. Scalable Recommendation with Hierarchical Poisson Factorization.\nUAI, 2015.\n[19] P. Gopalan, J. M. Hofman, & D. M. Blei. Scalable Recommendation with Poisson Factorization.\narXiv:1311.1704.\n[20] M. Zhou, Y. Cong, & B. Chen. Augmentable gamma belief networks. Journal of Machine Learning\nResearch, 17(163):1\u201344, 2016.\n[21] D. P. Kingma & W. Max. Auto-encoding variational Bayes. ICLR, 2014.\n[22] Y. Bengio, P. Lamblin, D. Popovici, & H. Larochelle. Greedy layer-wise training of deep networks. NIPS,\n2006.\n[23] Y. Cong, B. Chen, H. Liu, and M. Zhou, Deep latent Dirichlet allocation with topic-layer-adaptive stochastic\ngradient Riemannian MCMC. ICML, 2017.\n[24] S. Zhao, J. Song, S. Ermon. Learning Hierarchical Features from Generative Models. ICML, 2017.\n[25] M. Hoffman. Learning Deep Latent Gaussian Models with Markov Chain Monte Carlo. ICML, 2017.\n\n9\n\n\f", "award": [], "sourceid": 1039, "authors": [{"given_name": "Chengyue", "family_name": "Gong", "institution": "Peking University"}, {"given_name": "win-bin", "family_name": "huang", "institution": "peking university"}]}