{"title": "Deep Poisson Factor Modeling", "book": "Advances in Neural Information Processing Systems", "page_first": 2800, "page_last": 2808, "abstract": "We propose a new deep architecture for topic modeling, based on Poisson Factor Analysis (PFA) modules. The model is composed of a Poisson distribution to model observed vectors of counts, as well as a deep hierarchy of hidden binary units. Rather than using logistic functions to characterize the probability that a latent binary unit is on, we employ a Bernoulli-Poisson link, which allows PFA modules to be used repeatedly in the deep architecture. We also describe an approach to build discriminative topic models, by adapting PFA modules. We derive efficient inference via MCMC and stochastic variational methods, that scale with the number of non-zeros in the data and binary units, yielding significant efficiency, relative to models based on logistic links. Experiments on several corpora demonstrate the advantages of our model when compared to related deep models.", "full_text": "Deep Poisson Factor Modeling\n\nRicardo Henao, Zhe Gan, James Lu and Lawrence Carin\n\nDepartment of Electrical and Computer Engineering\n\nDuke University, Durham, NC 27708\n\n{r.henao,zhe.gan,james.lu,lcarin}@duke.edu\n\nAbstract\n\nWe propose a new deep architecture for topic modeling, based on Poisson Fac-\ntor Analysis (PFA) modules. The model is composed of a Poisson distribution to\nmodel observed vectors of counts, as well as a deep hierarchy of hidden binary\nunits. Rather than using logistic functions to characterize the probability that a\nlatent binary unit is on, we employ a Bernoulli-Poisson link, which allows PFA\nmodules to be used repeatedly in the deep architecture. We also describe an ap-\nproach to build discriminative topic models, by adapting PFA modules. We derive\nef\ufb01cient inference via MCMC and stochastic variational methods, that scale with\nthe number of non-zeros in the data and binary units, yielding signi\ufb01cant ef\ufb01-\nciency, relative to models based on logistic links. Experiments on several corpora\ndemonstrate the advantages of our model when compared to related deep models.\n\n1\n\nIntroduction\n\nDeep models, understood as multilayer modular networks, have been gaining signi\ufb01cant interest\nfrom the machine learning community, in part because of their ability to obtain state-of-the-art per-\nformance in a wide variety of tasks. Their modular nature is another reason for their popularity.\nCommonly used modules include, but are not limited to, Restricted Boltzmann Machines (RBMs)\n[10], Sigmoid Belief Networks (SBNs) [22], convolutional networks [18], feedforward neural net-\nworks, and Dirichlet Processes1 (DPs). Perhaps the two most well-known deep model architectures\nare the Deep Belief Network (DBN) [11] and the Deep Boltzmann Machine (DBM) [25], the former\ncomposed of RBM and SBN modules, whereas the latter is purely built using RBMs.\n\nDeep models are often employed in topic modeling. Speci\ufb01cally, hierarchical tree-structured models\nhave been widely studied over the last decade, often composed of DP modules. Examples of these\ninclude the nested Chinese Restaurant Process (nCRP) [1], the hierarchical DP (HDP) [27], and\nthe nested HDP (nHDP) [23]. Alternatively, topic models built using modules other than DPs have\nbeen proposed recently, for instance the Replicated Softmax Model (RSM) [12] based on RBMs,\nthe Neural Autoregressive Density Estimator (NADE) [17] based on neural networks, the Over-\nreplicated Softmax Model (OSM) [26] based on DBMs, and Deep Poisson Factor Analysis (DPFA)\n[6] based on SBNs.\n\nDP-based models have attractive characteristics from the standpoint of interpretability, in the sense\nthat their generative mechanism is parameterized in terms of distributions over topics, with each\ntopic characterized by a distribution over words. Alternatively, non-DP-based models, in which\nmodules are parameterized by a deep hierarchy of binary units [12, 17, 26], do not have parameters\nthat are as readily interpretable in terms of topics of this type, although model performance is often\nexcellent. The DPFA model in [6] is one of the \ufb01rst representations that characterizes documents\nbased on distributions over topics and words, while simultaneously employing a deep architecture\nbased on binary units. Speci\ufb01cally, [6] integrates the capabilities of Poisson Factor Analysis (PFA)\n\n1Deep models based on DP priors are usually called hierarchical models.\n\n1\n\n\f[32] with a deep architecture composed of SBNs [7]. PFA is a nonnegative matrix factorization\nframework closely related to DP-based models. Results in [6] show that DPFA outperforms other\nwell-known deep topic models.\n\nBuilding upon the success of DPFA, this paper proposes a new deep architecture for topic model-\ning, based entirely on PFA modules. Our model fundamentally merges two key aspects of DP and\nnon-DP-based architectures, namely: (i) its fully nonnegative formulation relies on Dirichlet dis-\ntributions, and is thus readily interpretable throughout all its layers, not just at the base layer as in\nDPFA [6]; (ii) it adopts the rationale of traditional non-DP-based models such as DBNs and DBMs,\nby connecting layers via binary units, to enable learning of high-order statistics and structured cor-\nrelations. The probability of a binary unit being on is controlled by a Bernoulli-Poisson link [30]\n(rather than a logistic link, as in the SBN), allowing repeated application of PFA modules at all\nlayers of the deep architecture.\n\nThe main contributions of this paper are: (i) A deep architecture for topic models based entirely on\nPFA modules. (ii) Unlike DPFA, which is based on SBNs, our model has inherent shrinkage in all\nits layers, thanks to the DP-like formulation of PFA. (iii) DPFA requires sequential updates for its\nbinary units, while in our formulation these are updated in block, greatly improving mixing. (iv)\nWe show how PFA modules can be used to easily build discriminative topic models. (v) An ef\ufb01cient\nMCMC inference procedure is developed, that scales as a function of the number of non-zeros in the\ndata and binary units. In contrast, models based on RBMs and SBNs scale with the size of the data\nand binary units. (vi) We also employ a scalable Bayesian inference algorithm based on the recently\nproposed Stochastic Variational Inference (SVI) framework [15].\n\n2 Model\n\n2.1 Poisson factor analysis as a module\n\nWe present the model in terms of document modeling and word counts, but the basic setup is appli-\ncable to other problems characterized by vectors of counts (and we consider such a non-document\napplication when presenting results). Assume xn is an M -dimensional vector containing word\ncounts for the n-th of N documents, where M is the vocabulary size. We impose the model,\nxn \u223c Poisson (\u03a8(\u03b8n \u25e6 hn)), where \u03a8 \u2208 RM \u00d7K\nis the factor loadings matrix with K factors,\n+ are factor intensities, hn \u2208 {0, 1}K is a vector of binary units indicating which factors\n\u03b8n \u2208 RK\nare active for observation n, and \u25e6 represents the element-wise (Hadamard) product. One possible\nprior speci\ufb01cation for this model, recently introduced in [32], is\n\n+\n\nxmn = PK\n\nk=1xmkn ,\n\n\u03c8k \u223c Dirichlet(\u03b71M ) ,\n\nxmkn \u223c Poisson(\u03bbmkn) ,\n\n\u03bbmkn = \u03c8mk\u03b8knhkn ,\n\n\u03b8kn \u223c Gamma(rk, (1 \u2212 b)b\u22121) ,\n\nhkn \u223c Bernoulli(\u03c0kn) ,\n\n(1)\n\nwhere 1M is an M -dimensional vector of ones, and we have used the additive property of the Poisson\ndistribution to decompose the m-th observed count of xn as K latent counts, {xmkn}K\nk=1. Here, \u03c8k\nis column k of \u03a8, xmn is component m of xn, \u03b8kn is component k of \u03b8n, and hkn is component k\nof hn. Furthermore, we let \u03b7 = 1/K, b = 0.5 and rk \u223c Gamma(1, 1). Note that \u03b7 controls for the\nsparsity of \u03a8, while rk accommodates for over-dispersion in xn via \u03b8n (see [32] for details).\n\nThere is one parameter in (1) for which we have not speci\ufb01ed a prior distribution, speci\ufb01cally\nE[p(hkn = 1)] = \u03c0kn.\nIn [32], hkn is provided with a beta-Bernoulli process prior by letting\n\u03c0kn = \u03c0k \u223c Beta(c\u01eb, c(1 \u2212 \u01eb)), meaning that every document has on average the same probability\nof seeing a particular topic as active, based on corpus-wide popularity. It further assumes topics are\nindependent of each other. These two assumptions are restrictive because: (i) in practice, documents\nbelong to a rather heterogeneous population, in which themes naturally occur within a corpus; letting\ndocuments have individual topic activation probabilities will allow the model to better accommodate\nfor heterogeneity in the data. (ii) Some topics are likely to co-occur systematically, so being able to\nharness such correlation structures can improve the ability of the model for \ufb01tting the data.\n\nThe hierarchical model in (1), which in the following we denote as xn \u223c PFA(\u03a8, \u03b8n, hn; \u03b7, rk, b),\nshort for Poisson Factor Analysis (PFA), represents documents, xn, as purely additive combinations\nof up to K topics (distributions over words), where hn indicates what topics are active and \u03b8n, is the\nintensity of each one of the active topics that is manifested in document xn. It is also worth noting\nthat the model in (1) is closely related to other widely known topic model approaches, such as Latent\nDirichlet Allocation (LDA) [3], HDP [27] and Focused Topic Modeling (FTM) [29]. Connections\nbetween these models are discussed in Section 4.\n\n2\n\n\f2.2 Deep representations with PFA modules\n\nSeveral models have been proposed recently to address the limitations described above [1, 2, 6, 27].\nIn particular, [6] proposed using multilayer SBNs [22], to impose correlation structure across topics,\nwhile providing each document with the ability to control its topic activation probabilities, without\nthe need of a global beta-Bernoulli process [32]. Here we follow the same rationale as [6], but\nwithout SBNs. We start by noting that for a binary vector hn with elements hkn, we can write\n\nzkn \u223c Poisson(\u02dc\u03bbkn) ,\n\nhkn = 1(zkn \u2265 1),\n\n(2)\nwhere zkn is a latent count for variable hkn, parameterized by a Poisson distribution with rate \u02dc\u03bbkn;\n1(\u00b7) = 1 if the argument is true, and 1(\u00b7) = 0 otherwise. The model in (2), recently proposed in\n[30], is known as the Bernoulli-Poisson Link (BPL) and is denoted hn \u223c BPL(\u02dc\u03bbn), for \u02dc\u03bbn \u2208 RK\n+ .\nAfter marginalizing out the latent count zkn [30], the model in (2) has the interesting property that\np(hkn = 1) = Bernoulli(\u03c0kn), where \u03c0kn = 1 \u2212 exp(\u2212\u02dc\u03bbkn). Hence, rather than using the logistic\nfunction to represent binary unit probabilities, we employ \u03c0kn = 1 \u2212 exp(\u2212\u02dc\u03bbkn).\nIn (1) and (2) we have represented the Poisson rates as \u03bbmkn and \u02dc\u03bbkn, respectively, to distinguish\nbetween the two. However, the fact that the count vector in (1) and the binary variable in (2) are\nboth represented in terms of Poisson distributions suggests the following deep model, based on PFA\nmodules (see graphical model in Supplementary Material):\n\nxn \u223c PFA(cid:16)\u03a8(1), \u03b8(1)\nn \u223c PFA(cid:16)\u03a8(2), \u03b8(2)\n\nz(2)\n\nn , h(1)\n\nn ; \u03b7(1), r(1)\n\nn , h(2)\n\nn ; \u03b7(2), r(2)\n\nk , b(1)(cid:17) ,\nk , b(2)(cid:17) ,\n\nn = 1(cid:16)z(2)\n\nn (cid:17) ,\n\nh(1)\n\n...\nn (cid:17) ,\n= 1(cid:16)z(L)\nn = 1(cid:16)z(L+1)\n(cid:17) ,\n\nn\n\nh(L)\n\nh(L\u22121)\n\nn\n\n(3)\n\n...\n\nn \u223c PFA(cid:16)\u03a8(L), \u03b8(L)\n\nz(L)\n\nn , h(L)\n\nn ; \u03b7(L), r(L)\n\nk , b(L)(cid:17) ,\n\nwhere L is the number of layers in the model, and 1(\u00b7) is a vector operation in which each component\nimposes the left operation in (2). In this Deep Poisson Factor Model (DPFM), the binary units at\n(\u2113)\nn ). The form of\nlayer \u2113 \u2208 {1, . . . , L} are drawn h\nthe model in (3) introduces latent variables {z\n\u2113=2 and the element-wise function 1(\u00b7), rather\nthan explicitly drawing {h\n\u2113=1 from the BPL distribution. Concerning the top layer, we let\nz(L+1)\n\u223c Poisson(\u03bb(L+1)\n\n(\u2113)\nn }L\n) and \u03bb(L+1)\n\n(\u2113)\nn \u223c BPL(\u03bb(\u2113+1)\n\nn = \u03a8(\u2113)(\u03b8(\u2113)\n\nn\n(\u2113)\nn }L+1\n\n\u223c Gamma(a0, b0).\n\n), for \u03bb(\u2113)\n\nn \u25e6 h\n\nkn\n\nk\n\nk\n\n2.3 Model interpretation\n\n(1)\nConsider layer 1 of (3), from which xn is drawn. Assuming h\nn is known, this corresponds to a\nfocused topic model [29]. The columns of \u03a8(1) correspond to topics, with the k-th column \u03c8(1)\nde\ufb01ning the probability with which words are manifested for topic k (each \u03c8(1)\nis drawn from a\nkn = \u03c8(1)\nk \u03b8(1)\nDirichlet distribution, as in (1)). Generalizing the notation from (1), \u03bb(1)\nkn h(1)\n+ is\nthe rate vector associated with topic k and document n, and it is active when h(1)\nkn = 1. The word-\nkn ), and xn = PK1\ncount vector for document n manifested from topic k is xkn \u223c Poisson(\u03bb(1)\nk=1 xkn,\nwhere K1 is the number of topics in the model. The columns of \u03a8(1) de\ufb01ne correlation among the\nwords associated with the topics; for a given topic (column of \u03a8(1)), some words co-occur with high\nprobability, and other words are likely jointly absent.\n\nkn \u2208 RM\n\nk\n\nk\n\nk \u03b8(2)\n\nkn h(2)\n\n(2)\nn assumed known. To generate h\n(2)\nkn \u223c Poisson(\u03bb(2)\n\n(2)\nWe now consider a two-layer model, with h\nn ,\nkn ) and\nwhich, analogous to above, may be expressed as z\n\u03bb(2)\nkn = \u03c8(2)\na K1-dimensional\nprobability vector, denoting the probability with which each of the layer-1 topics are \u201con\u201d when\nlayer-2 \u201cmeta-topic\u201d k is on (i.e., when h(2)\nkn = 1). The columns of \u03a8(2) de\ufb01ne correlation among\nthe layer-1 topics; for a given layer-2 meta-topic (column of \u03a8(2)), some layer-1 topics co-occur\nwith high probability, and other layer-1 topics are likely jointly absent.\n\nkn . Column k of \u03a8(2) corresponds to a meta-topic, with \u03c8(2)\n\nn = PK2\n\n(1)\nn , we \ufb01rst draw z\n\n(2)\nkn , with z\n\nk=1 z\n\n(2)\n\nk\n\n3\n\n\fAs one moves up the hierarchy, to layers \u2113 > 2, the meta-topics become increasingly more abstract\nand sophisticated, manifested in terms of probabilisitic combinations of topics and meta-topics at\nthe layers below. Because of the properties of the Dirichlet distribution, each column of a particular\n\u03a8(\u2113) is encouraged to be sparse, implying that a column of \u03a8(\u2113) encourages use of a small subset\nof columns of \u03a8(\u2113\u22121), with this repeated all the way down to the data layer, and the topics re\ufb02ected\nin the columns of \u03a8(1). This deep architecture imposes correlation across the layer-1 topics, and it\ndoes it through use of PFA modules at all layers of the deep architecture, unlike [6] which uses an\nSBN for layers 2 through L, and a PFA at the bottom layer. In addition to the elegance of using a\nsingle class of modules at each layer, the proposed deep model has important computational bene\ufb01ts,\nas later discussed in Section 3.\n\n2.4 PFA modules for discriminative tasks\n\nAssume that there is a label yn \u2208 {1, . . . , C} associated with document n. We seek to learn the\nmodel for mapping xn \u2192 yn simultaneously with learning the above deep topic representation. In\nfact, the mapping xn \u2192 yn is based on the deep generative process for xn in (3). We represent yn\n\nvia the C-dimensional one-hot vector byn, which has all elements equal to zero except one, with the\n\nnon-zero value (which is set to one) located at the position of the label. We impose the model\n\nb\u03bbcn = \u03bbcn/PC\n\n(1)\nn ) and B \u2208 RC\u00d7K\n\nc=1 \u03bbcn ,\n\nbyn \u223c Multinomial(1,b\u03bbn) ,\nwhere b\u03bbcn is element c of b\u03bbn, \u03bbn = B(\u03b8(1)\n\nn \u25e6 h\n\n+\n\nn \u25e6 h\n\n, is a matrix of nonnegative\nclassi\ufb01cation weights, with prior distribution bk \u223c Dirichlet(\u03b61C ), where bk is a column of B.\nCombining (3) with (4) allows us to learn the mapping xn \u2192 yn via the shared \ufb01rst-layer local\nrepresentation, \u03b8(1)\n(1)\nn , that encodes topic usage for document n. This sharing mechanism\nallows the model to learn topics, \u03a8(1), and meta-topics, {\u03a8(\u2113)}L\n\u2113=2, biased towards discrimination,\nas opposed to just explaining the data, xn. We call this construction discriminative deep Poisson\nfactor modeling. It is worth noting that this is the \ufb01rst time that PFA and multi-class classi\ufb01cation\nhave been combined into a joint model. Although other DP-based discriminative topic models have\nbeen proposed [16, 21], they rely on approximations in order to combine the topic model, usually\nLDA, with softmax-based classi\ufb01cation approaches.\n\n(4)\n\n3\n\nInference\n\nA very convenient feature of the model in (3) is that all its conditional posterior distributions can be\nwritten in closed form due to local conjugacy. In this section, we focus on Markov chain Monte Carlo\n(MCMC) via Gibbs sampling as reference implementation and a stochastic variational inference\napproach for large datasets, where the fully Bayesian treatment becomes prohibitive.\n\nOther alternatives for scaling up inference in Bayesian models such as the parameter server [13,\n19], conditional density \ufb01ltering [9] and stochastic gradient-based approaches [4, 5, 28] are left as\ninteresting future work.\n\nn , h\n\n(\u2113)\nn , r(\u2113)\n\nMCMC Due to local conjugacy, Gibbs sampling for the model\nin (3) amounts to sam-\npling in sequence from the conditional posterior of all the parameters of the model, namely\n{\u03a8(\u2113), \u03b8(\u2113)\n\u2113=1 and \u03bb(L+1). The remaining parameters of the model are set to \ufb01xed\nvalues: \u03b7 = 1/K, b = 0.5 and a0 = b0 = 1. We note that priors for \u03b7, b, a0 and b0 exist that\nresult in Gibbs-style updates, and can be easily incorporated into the model if desired; however, we\nopted to keep the model as simple as possible, without compromising \ufb02exibility. The most unique\nconditional posteriors are shown below, without layer index for clarity,\n\nk }L\n\n\u03c8k \u223c Dirichlet(\u03b7 + x1k\u00b7, . . . , \u03b7 + xM k\u00b7) ,\n\u03b8kn \u223c Gamma(rkhkn + x\u00b7kn, b\u22121) ,\nhkn \u223c \u03b4(x\u00b7kn = 0)Bernoulli(\u02dc\u03c0kn(\u02dc\u03c0kn + 1 \u2212 \u03c0kn)\u22121) + \u03b4(x\u00b7kn \u2265 1) ,\n\n(5)\n\nwhere xmk\u00b7 = PN\n\nn=1 xmkn, x\u00b7kn = PM\n\nm=1 xmkn and \u02dc\u03c0kn = \u03c0kn(1 \u2212 b)rk . Omitted details,\nincluding those for the discriminative DPFM in Section 2.4, are given in the Supplementary Material.\n\n4\n\n\fInitialization is done at random from prior distributions, followed by layer-wise \ufb01tting (pre-training).\nIn the experiments, we run 100 Gibbs sampling cycles per layer. In preliminary trials we observed\nthat 50 cycles are usually enough to obtain good initial values of the global parameters of the model,\nnamely {\u03a8(\u2113), r(\u2113)\n\n\u2113=1 and \u03bb(L+1).\n\nk }L\n\nStochastic variational inference (SVI) SVI is a scalable algorithm for approximating poste-\nrior distributions consisting of EM-style local-global updates, in which subsets of a dataset (mini-\nbatches) are used to update in closed-form the variational parameters controlling both the local and\nglobal structure of the model in an iterative fashion [15]. This is done by using stochastic optimiza-\ntion with noisy natural gradients to optimize the variational objective function. Additional details\nand theoretical foundations of SVI can be found in [15].\n\nIn practice the algorithm proceeds as follows, where again we have omitted the layer index for\nclarity: (i) let {\u03a8(t), r(t)\nk , \u03bb(t)} be the global variables at iteration t. (ii) Sample a mini-batch from\nthe full dataset. (iii) Compute updates for the variational parameters of the local variables using\n\n\u03c6mkn \u221d exp(E[log \u03c8mk] + E[log \u03b8kn]) ,\n\n\u03b8kn \u223c Gamma(E[rk]E[hkn] +PM\n\nm=1\u03c6mkn, b\u22121) ,\n\nhkn \u223c E[p(x\u00b7kn = 0)]Bernoulli(E[\u02dc\u03c0kn](E[\u02dc\u03c0kn] + 1 \u2212 E[\u03c0kn])\u22121) + E[p(x\u00b7kn \u2265 1)] ,\n\nwhere E[xmkn] = \u03c6mkn and E[\u02dc\u03c0kn] = E[\u03c0kn](1 \u2212 b)E[rk]. In practice, expectations for \u03b8kn and\nhkn are computed in log-domain. (iv) Compute a local update for the variational parameters of the\nglobal variables (only \u03a8 is shown) using\n\nB PNB\n\n(6)\nwhere N and NB are sizes of the corpus and mini-batch, respectively. Finally, we update the global\nvariables as \u03c8(t+1)\n(0.5, 1] controls how fast previous information is forgotten and the delay, \u03c4 \u2265 0, down-weights\nearly iterations. These conditions for \u03ba and \u03c4 guarantee that the iterative algorithm converges to a\nlocal optimum of the variational objective function. In the experiments, we set \u03ba = 0.7 and \u03c4 = 128.\nAdditional details of the SVI algorithm for the model in (3) are given in the Supplementary Material.\n\nb\u03c8mk = \u03b7 + N N \u22121\nk + \u03c1t b\u03c8k, where \u03c1t = (t + \u03c4 )\u2212\u03ba. The forgetting rate, \u03ba \u2208\n\n= (1 \u2212 \u03c1t)\u03c8(t)\n\nn=1\u03c6mkn ,\n\nk\n\nImportance of computations scaling as a function of number of non-zeros From a practical\nstandpoint, the most important feature of the model in (3) is that inference does not scale as a\nfunction of the size of the corpus, but as a function of its number of non-zero elements, which is\nadvantageous in cases where the input data is sparse (often the case). For instance, 2% of the entries\nin the widely studied 20 Newsgroup corpus are non-zero; similar proportions are also observed in\nthe Reuters and Wikipedia data. Furthermore, this feature also extends to all the layers of the model\n(\u2113)\nn } being latent. Similarly, for the discriminative DPFM in Section 2.4, inference\nregardless of {h\n\nappealing in cases where C is large.\n\nscales with N , not CN , because the binary vectorbyn has a single non-zero entry. This is particularly\nif xmn = PK\n\nIn order to show that this scaling behavior holds, it is enough to see that by construction, from (1),\nmn for \u2113 > 1), thus xmkn = 0, \u2200k with probability 1. Besides,\nfrom (2) we see that if hkn = 0 then zkn = 0 with probability 1. As a result, update equations for\n(\u2113)\nn }.\nall parameters of the model except for {h\nUpdates for the binary variables can be cheaply obtained in block from h(\u2113)\nkn ) via\n\u02dc\u03bb(\u2113)\nkn, as previously described.\n\n(\u2113)\nn }, depend only on non-zero elements of xn and {z\nkn \u223c Bernoulli(\u03c0(\u2113)\n\nk=1 xmkn = 0 (or z(\u2113)\n\nIt is worth mentioning that models based on multinomial or Poisson likelihoods such as LDA [3],\nHDP [27], FTM [29] and PFA [32], also enjoy this property. However, the recently proposed deep\nPFA [6], does not use PFA modules on layers other than the \ufb01rst one. It uses SBNs or RBMs that\nare known to scale with the number of binary variables as opposed to their non-zero elements.\n\n4 Related work\n\nConnections to other DP-based topic models PFA is a nonnegative matrix factorization model\nwith Poisson link that is closely related to other DP-based models. Speci\ufb01cally, [32] showed that\n\n5\n\n\fby making p(hkn = 1) = 1 and letting \u03b8kn have a Dirichlet, instead of a Gamma distribution as\nin (1), we can recover LDA by using the equivalence between Poisson and multinomial distributions.\nBy looking at (5)-(6), we see that PFA and LDA have the same blocked Gibbs [3] and SVI [14]\nupdates, respectively, when Dirichlet distributions for \u03b8kn are used. In [32], the authors showed that\nusing the Poisson-gamma representation of the negative binomial distribution and a beta-Bernoulli\nspeci\ufb01cation for p(hkn) in (1), we can recover the FTM formulation and inference in [29]. More\nrecently, [31] showed that PFA is comparable to HDP in that the former builds group-speci\ufb01c DPs\nwith normalized gamma processes. A more direct relationship between a three-layer HDP [27] and a\ntwo-layer version of (3) can be established by grouping documents by categories. In the HDP, three\nDPs are set for topics, document-wise topic usage and category-wise topic usage. In our model,\n\u03a8(1) represent K1 topics, \u03b8(1)\n(1)\nn encodes document-wise topic usage and \u03a8(2) encodes topic\nusage for K2 categories. In HDP, documents are assigned to categories a priori, but in our model\ndocument-category soft assignments are estimated and encoded via \u03b8(2)\n(2)\nn . As a result, the\nmodel in (3) is a more \ufb02exible alternative to HDP in that it groups documents into categories in an\nunsupervised manner.\n\nn \u25e6 h\n\nn \u25e6 h\n\nSimilar models Non-DP-based deep models for topic modeling employed in the deep learning\nliterature typically utilize RBMs or SBNs as building blocks. For instance, [12] and [20] extended\nRBMs via DBNs to topic modeling and [26] proposed the over-replicated softmax model, a deep\nversion of RSM that generalizes RBMs.\n\nRecently, [24] proposed a framework for generative deep models using exponential family modules.\nAlthough they consider Poisson-Poisson and Gamma-Gamma factorization modules akin to our\nPFA modules, their model lacks the explicit binary unit linking between layers commonly found in\ntraditional deep models. Besides, their inference approach, black-box variational inference, is not as\nconceptually simple, but it scales with the number of non-zeros as our model.\n\nDPFA, proposed in [6], is the model closest to ours. Nevertheless, our proposed model has a num-\nber of key differentiating features. (i) Both of them learn topic correlations by building a multilayer\nmodular representation on top of PFA. Our model uses PFA modules throughout all layers in a con-\nceptually simple and easy to interpret way. DPFA uses Gaussian distributed weight matrices within\nSBN modules; these are hard to interpret in the context of topic modeling. (ii) SBN architectures\nhave the shortcoming of not having block closed-form conditional posteriors for their binary vari-\nables, making them dif\ufb01cult to estimate, especially as the number of variables increases. (iii) Factor\nloading matrices in PFAs have natural shrinkage to counter over\ufb01tting, thanks to the Dirichlet prior\nused for their columns. In SBN-based models, shrinkage has to be added via variable augmenta-\ntion at the cost of increasing inference complexity. (iv) Inference for SBN modules scales with the\nnumber of hidden variables in the model, not with the number of non-zero elements, as in our case.\n\n5 Experiments\n\nBenchmark corpora We present experiments on three corpora: 20 Newsgroups (20 News),\nReuters corpus volume I (RCV1) and Wikipedia (Wiki). 20 News is composed of 18,845 doc-\numents and 2,000 words, partitioned into a 11,315 training set and a 7,531 test set. RCV1 has\n804,414 newswire articles containing 10,000 words. A random 10,000 subset of documents is used\nfor testing. For Wiki, we obtained 107 random documents, from which a subset of 1,000 is set aside\nfor testing. Following [14], we keep a vocabulary consisting of 7,702 words taken from the top\n10,000 words in the Project Gutenberg Library.\n\nAs performance measure we use held-out perplexity, de\ufb01ned as the geometric mean of the inverse\nmarginal likelihood of every word in the set. We cannot evaluate the intractable marginal for our\nmodel, thus we compute the predictive perplexity on a 20% subset of the held-out set. The remaining\n80% is used to learn document-speci\ufb01c variables of the model. The training set is used to estimate\nthe global parameters of the model. Further details on perplexity evaluation for PFA models can be\nfound in [6, 32].\n\nWe compare our model (denoted DPFM) against LDA [3], FTM [29], RSM [12], nHDP [23] and\nDPFA with SBNs (DPFA-SBN) and RBMs (DPFA-RBM) [6]. For all these models we use the\nsettings described in [6]. Inference methods for RSM and DPFA are contrastive divergence with\n\n6\n\n\fTable 1: Held-out perplexities for 20 News, RCV1 and Wiki. Size indicates number of topics and/or\nbinary units, accordingly.\n\nSize\n128-64\n128-64\n\nMethod\nModel\nSVI\nDPFM\nMCMC\nDPFM\nSGNHT 1024-512-256\nDPFA-SBN\nDPFA-SBN\nSGNHT 128-64-32\nDPFA-RBM SGNHT 128-64-32\nnHDP\nLDA\nFTM\nRSM\n\n(10,10,5)\n128\n128\n128\n\nSVI\nGibbs\nGibbs\nCD5\n\n20 News RCV1 Wiki\n791\n783\n770\n876\n942\n932\n1059\n991\n1001\n\n961\n908\n942\n1143\n920\n1041\n1179\n1155\n1171\n\n818\n780\n\u2014\u2013\n827\n896\n889\n893\n887\n877\n\nstep size 5 (CD5) and stochastic gradient Nse-Hoover thermostats (SGNHT) [5], respectively. For\nour model, we run 3,000 samples (\ufb01rst 2,000 as burnin) for MCMC and 4,000 iterations with 200-\ndocument mini-batches for SVI. For the Wiki corpus, MCMC-based DPFM is run on a random\nsubset of 106 documents. The code used, implemented in Matlab, will be made publicly available.\n\nTable 1 show results for the corpora being considered. Figures for methods other than DPFM were\ntaken from [6]. We see that multilayer models (DPFM, DPFA and nHDP) consistently outperform\nsingle layer ones (LDA, FTM and RSM), and that DPFM has the best performance across all cor-\npora for models of comparable size. OSM result (not shown) are about 20 units better than RSM\nin 20 News and RCV1, see [26]. We also see that MCMC yields better perplexities when com-\npared to SVI. The difference in performance between these two inference methods is likely due\nto the mean-\ufb01eld approximation and the online nature of SVI. We veri\ufb01ed empirically (results not\nshown) that doubling the number of hidden units, adding a third layer or increasing the number\nof samples/iterations for DPFM does not signi\ufb01cantly change the results in Table 1. As a note on\ncomputational complexity, one iteration of the two-layer model on the 20 News corpus takes ap-\nproximately 3 and 2 seconds, for MCMC and SVI, respectively. For comparison, we also ran the\nDPFA-SBN model in [6] using a two-layer model of the same size; in their case it takes about 24, 4\nand 5 seconds to run one iteration using MCMC, conditional density \ufb01ltering (CDF) and SGNHT,\nrespectively. Runtimes for DPFA-RBM are similar to those of DPFA-SBN, LDA and RSM are faster\nthan 1-layer DPFM, FTM is comparable to the latter, and nHDP is slower than DPFM.\n\nFigure 1 shows a representative meta-topic, \u03c8(2)\n\ufb01ve largest weights in \u03c8(2)\nshow the top \ufb01ve words in their layer-1 topic, \u03c8(1)\nk . We observe that this meta-topic is loaded with\nreligion speci\ufb01c topics, judging by the words in them. Additional graphs, and tables showing the\ntop words in each topic for 20 News and RCV1 are provided in the Supplementary Material.\n\nk , from the two-layer model for 20 News. For the\n(y-axis), which correspond to layer-1 topic indices (x-axis), we also\n\nk\n\nM13\n\nM22\n\n0.07\n\n0.06\n\n0.05\n\n)\n2\nk\n(\n\u03c8\n\n0.04\n\n0.03\n\ngod\n\nreligion\nchristians\nchristianity\nchristian\n\ntrue\nfact\nwrong\npeople\npoint\n\npoint\nthing\npeople\nidea\nwrites\n\ngod\njesus\nchrist\n\nchristians\n\nbible\n\ngod\nexist\n\nexistence\n\nexists\n\nuniverse\n\n0.16\n\n0.14\n\n0.12\n\nAlbuterol\nsalmeterol\nIpratropium\ntiotropium\nPrednisone\n\nAmoxicillin\nDiltiazem\nClavulanate\n\ncefdinir\n\nCetirizine\n\nmontelukast\nAmitriptyline\nolopatadine\nrizatriptan\n\ndesloratadine\n\n0.1\n\n)\n2\nk\n(\n\u03c8\n\n0.08\n\nNA\n\nrabeprazole\nalcaftadine\n\nLactobacillus rhamnosus GG\n\nMultivitamin preparation\n\nfluticasone\nfexofenadine\nPropranolol\n\nCarbamazepine\n\nMethimazole\n\n0.02\n\n0.01\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n120\n\nFirst layer topic index\n\n0.06\n\n0.04\n\n0.02\n\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\nFirst layer topic index\n\nFigure 1: Representative meta-topics obtained from (left) 20 News and (right) medical records.\nMeta-topic weights \u03c8(2)\nvs.\nwords in layer-1 topics, \u03c8(1)\nk .\n\nlayer-1 topics indices, with word lists corresponding to the top \ufb01ve\n\nk\n\nClassi\ufb01cation We use 20 News for document classi\ufb01cation, to evaluate the discriminative DPFM\nmodel described in Section 2.4. We use test set accuracy on the 20-class task as performance mea-\nsure and compare our model against LDA, DocNADE [17], RSM and OSM. Results for these four\nmodels were obtained from [26], where multinomial logistic regression with cross-entropy loss func-\n\n7\n\n\fTable 2: Test accuracy on 20 News. Subscript accompanying model names indicate their size.\n\nModel\n\nLDA128 DocNADE512 RSM512 OSM512 DPFM128 DPFM128\u221264\n\nAccuracy (%)\n\n65.7\n\n68.4\n\n67.7\n\n69.1\n\n72.11\n\n72.67\n\ntion was used as classi\ufb01cation module. Test accuracies in Table 2 show that our model signi\ufb01cantly\noutperforms the others being considered. Note as well that our one-layer model still improves upon\nthe four times larger OSM, by more than 3%. We veri\ufb01ed that our two-layer model outperforms\nwell known supervised methods like multinomial logistic regression, SVM, supervised LDA and\ntwo-layer feedforward neural networks, for which test accuracies ranged from 67% to 72.14%, using\nterm frequency-inverse document frequency features. We could not improve results by increasing\nthe size of our model, however, we may be able to do so by following the approach of [33], where a\nsingle classi\ufb01cation module (SVM) is shared by 20 one-layer topic models (LDAs). Exploration of\nmore sophisticated deep model architectures for discriminative DPFMs is left as future work.\n\nMedical records The Duke University Health System medical records database used here, is a\n5 year dataset generated within a large health system including three hospitals and an extensive\nnetwork of outpatient clinics. For this analysis, we utilized self-reported medication usage from\nover 240,000 patients that had over 4.4 million patient visits. These patients reported over 34,000\ndifferent types of medications which were then mapped to one of 1,691 pharmaceutical active ingre-\ndients (AI) taken from RxNorm, a depository of medication information maintained by the National\nLibrary of Medicine that includes trade names, brand names, dosage information and active ingredi-\nents. Counts for patient-medication usage re\ufb02ected the number of times an AI appears in a patients\nrecord. Compound medications that include multiple active ingredients incremented counts for all\nAI in that medication. Removing AIs with less than 10 overall occurrences and patients lacking\nmedication information, results in a 1,019\u00d7131,264 matrix of AIs vs. patients.\n\nResults for a MCMC-based DPFM of size 64-32, with the same setting used for the \ufb01rst exper-\niment, indicate that pharmaceutical topics derived from this analysis form clinically reasonable\nclusters of pharmaceuticals, that may be prescribed to patients for various ailments.\nIn particu-\nlar, we found that layer-1 topic 46 includes a cluster of insulin products: Insulin Glargine, Insulin\nLispro, Insulin Aspart, NPH Insulin and Regular Insulin. Insulin dependent type-2 diabetes patients\noften rely on tailored mixtures of insulin products with different pharmacokinetic pro\ufb01les to en-\nsure glycemic control. In another example, we found in layer-1 topic 22, an Angiotensin Receptor\nBlocker (ARB), Losartan with a HMGCoA Reductase inhibitor, Atorvastatin and a heart speci\ufb01c\nbeta blocker, Carvedilol. This combination of medications is commonly used to control hyperten-\nsion and hyperlipidemia in patients with cardiovascular risk. The second layer correlation structure\nbetween topics of drug products also provide interesting composites of patient types based on the\n\ufb01rst-layer pharmaceutical topics. Speci\ufb01cally, layer-2 factor 22 in Figure 1 reveals correlation be-\ntween layer-1 drug factors that would be used to treat types of respiratory patients that had chronic\nobstructive respiratory disease and/or asthma (Albuterol, Montelukast) and seasonal allergies. Ad-\nditional graphs, including top medications for all pharmaceutical topics found by our model are\nprovided in the Supplementary Material.\n\n6 Conclusion\n\nWe presented a new deep model for topic modeling based on PFA modules. We have combined the\ninterpretability of DP-based speci\ufb01cations found in traditional topic models with deep hierarchies of\nhidden binary units. Our model is elegant in that a single class of modules is used at each layer, but\nat the same time, enjoys the computational bene\ufb01t of scaling as a function of the number of zeros\nin the data and binary units. We described a discriminative extension for our deep architecture, and\ntwo inference methods: MCMC and SVI, the latter for large datasets. Compelling experimental\nresults on several corpora and on a new medical records database demonstrated the advantages of\nour model.\n\nFuture directions include working towards alternatives for scaling up inference algorithms based on\ngradient-based approaches, extending the use of PFA modules in deep architectures to more sophis-\nticated discriminative models, multi-modal tasks with mixed data types, and time series modeling\nusing ideas similar to [8].\n\nAcknowledgements This research was supported in part by ARO, DARPA, DOE, NGA and ONR.\n\n8\n\n\fReferences\n\n[1] D. M. Blei, D. M. Grif\ufb01ths, M. I. Jordan, and J. B. Tenenbaum. Hierarchical topic models and the nested\n\nChinese restaurant process. In NIPS, 2004.\n\n[2] D. M. Blei and J. D. Lafferty. A correlated topic model of science. AOAS, 2007.\n\n[3] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. JMLR, 2003.\n\n[4] T. Chen, E. B. Fox, and C. Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In ICML, 2014.\n\n[5] N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel, and H. Neven. Bayesian sampling using stochastic\n\ngradient thermostats. In NIPS, 2014.\n\n[6] Z. Gan, C. Chen, R. Henao, D. Carlson, and L. Carin. Scalable deep Poisson factor analysis for topic\n\nmodeling. In ICML, 2015.\n\n[7] Z. Gan, R. Henao, D. Carlson, and L. Carin. Learning deep sigmoid belief networks with data augmenta-\n\ntion. In AISTATS, 2015.\n\n[8] Z. Gan, C. Li, R. Henao, D. Carlson, and L. Carin. Deep temporal sigmoid belief networks for sequence\n\nmodeling. In NIPS, 2015.\n\n[9] R. Guhaniyogi, S. Qamar, and D. B. Dunson. Bayesian conditional density \ufb01ltering. arXiv:1401.3632,\n\n2014.\n\n[10] G. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 2002.\n\n[11] G. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural computation,\n\n2006.\n\n[12] G. E. Hinton and R. R. Salakhutdinov. Replicated softmax: an undirected topic model. In NIPS, 2009.\n\n[13] Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing. More\n\neffective distributed ML via a stale synchronous parallel parameter server. In NIPS, 2013.\n\n[14] M. Hoffman, F. R. Bach, and D. M. Blei. Online learning for latent Dirichlet allocation. In NIPS, 2010.\n\n[15] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. JMLR, 2013.\n\n[16] S. Lacoste-Julien, F. Sha, and M. I. Jordan. DiscLDA: Discriminative learning for dimensionality reduc-\n\ntion and classi\ufb01cation. In NIPS, 2009.\n\n[17] H. Larochelle and S. Lauly. A neural autoregressive topic model. In NIPS, 2012.\n\n[18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 1998.\n\n[19] M. Li, D. G. Andersen, A. J. Smola, and K. Yu. Communication ef\ufb01cient distributed machine learning\n\nwith the parameter server. In NIPS, 2014.\n\n[20] L. Maaloe, M. Arngren, and O. Winther. Deep belief nets for topic modeling. arXiv:1501.04325, 2015.\n\n[21] J. D. Mcauliffe and D. M. Blei. Supervised topic models. In NIPS, 2008.\n\n[22] R. M. Neal. Connectionist learning of belief networks. Arti\ufb01cial Intelligence, 1992.\n\n[23] J. Paisley, C. Wang, D. M. Blei, and M. I. Jordan. Nested hierarchical Dirichlet processes. PAMI, 2015.\n\n[24] R. Ranganath, L. Tang, L. Charlin, and D. M. Blei. Deep exponential families. In AISTATS, 2014.\n\n[25] R. Salakhutdinov and G. E. Hinton. Deep Boltzmann machines. In AISTATS, 2009.\n\n[26] R. R. S. Srivastava, Nitish and G. E. Hinton. Modeling documents with deep Boltzmann machines. In\n\nUAI, 2013.\n\n[27] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. JASA, 2006.\n\n[28] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In ICML, 2011.\n\n[29] S. Williamson, C. Wang, K. Heller, and D. Blei. The IBP compound Dirichlet process and its application\n\nto focused topic modeling. In ICML, 2010.\n\n[30] M. Zhou. In\ufb01nite edge partition models for overlapping community detection and link prediction. In\n\nAISTATS, 2015.\n\n[31] M. Zhou and L. Carin. Negative binomial process count and mixture modeling. PAMI, 2015.\n\n[32] M. Zhou, L. Hannah, D. Dunson, and L. Carin. Beta-negative binomial process and Poisson factor anal-\n\nysis. In AISTATS, 2012.\n\n[33] J. Zhu, A. Ahmed, and E. P. Xing. MedLDA: maximum margin supervised topic models. JMLR, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1592, "authors": [{"given_name": "Ricardo", "family_name": "Henao", "institution": "Duke University"}, {"given_name": "Zhe", "family_name": "Gan", "institution": "Duke University"}, {"given_name": "James", "family_name": "Lu", "institution": "Duke University"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}