{"title": "Dependent Multinomial Models Made Easy: Stick-Breaking with the Polya-gamma Augmentation", "book": "Advances in Neural Information Processing Systems", "page_first": 3456, "page_last": 3464, "abstract": "Many practical modeling problems involve discrete data that are best represented as draws from multinomial or categorical distributions. For example, nucleotides in a DNA sequence, children's names in a given state and year, and text documents are all commonly modeled with multinomial distributions. In all of these cases, we expect some form of dependency between the draws: the nucleotide at one position in the DNA strand may depend on the preceding nucleotides, children's names are highly correlated from year to year, and topics in text may be correlated and dynamic. These dependencies are not naturally captured by the typical Dirichlet-multinomial formulation. Here, we leverage a logistic stick-breaking representation and recent innovations in P\\'{o}lya-gamma augmentation to reformulate the multinomial distribution in terms of latent variables with jointly Gaussian likelihoods, enabling us to take advantage of a host of Bayesian inference techniques for Gaussian models with minimal overhead.", "full_text": "Dependent Multinomial Models Made Easy:\n\nStick Breaking with the P\u00b4olya-Gamma Augmentation\n\nScott W. Linderman\u2217\nHarvard University\n\nCambridge, MA 02138\n\nswl@seas.harvard.edu\n\nMatthew J. Johnson\u2217\nHarvard University\n\nCambridge, MA 02138\n\nmattjj@csail.mit.edu\n\nRyan P. Adams\n\nrpa@seas.harvard.edu\n\nTwitter & Harvard University\n\nCambridge, MA 02138\n\nAbstract\n\nMany practical modeling problems involve discrete data that are best represented\nas draws from multinomial or categorical distributions. For example, nucleotides\nin a DNA sequence, children\u2019s names in a given state and year, and text documents\nare all commonly modeled with multinomial distributions. In all of these cases,\nwe expect some form of dependency between the draws: the nucleotide at one\nposition in the DNA strand may depend on the preceding nucleotides, children\u2019s\nnames are highly correlated from year to year, and topics in text may be corre-\nlated and dynamic. These dependencies are not naturally captured by the typical\nDirichlet-multinomial formulation. Here, we leverage a logistic stick-breaking\nrepresentation and recent innovations in P\u00b4olya-gamma augmentation to reformu-\nlate the multinomial distribution in terms of latent variables with jointly Gaussian\nlikelihoods, enabling us to take advantage of a host of Bayesian inference tech-\nniques for Gaussian models with minimal overhead.\n\n1\n\nIntroduction\n\nIt is often desirable to model discrete data in terms of continuous latent structure. In applications in-\nvolving text corpora, discrete-valued time series, or polling and purchasing decisions, we may want\nto learn correlations or spatiotemporal dynamics and leverage these structures to improve inferences\nand predictions. However, adding these continuous latent dependence structures often comes at the\ncost of signi\ufb01cantly complicating inference: such models may require specialized, one-off inference\nalgorithms, such as a non-conjugate variational optimization, or they may only admit very general\ninference tools like particle MCMC [1] or elliptical slice sampling [2], which can be inef\ufb01cient and\ndif\ufb01cult to scale. Developing, extending, and applying these models has remained a challenge.\nIn this paper we aim to provide a class of such models that are easy and ef\ufb01cient. We develop models\nfor categorical and multinomial data in which dependencies among the multinomial parameters are\nmodeled via latent Gaussian distributions or Gaussian processes, and we show that this \ufb02exible class\nof models admits a simple auxiliary variable method that makes inference easy, fast, and modular.\nThis construction not only makes these models simple to develop and apply, but also allows the\nresulting inference methods to use off-the-shelf algorithms and software for Gaussian processes and\nlinear Gaussian dynamical systems.\nThe paper is organized as follows. After providing background material and de\ufb01ning our general\nmodels and inference methods, we demonstrate the utility of this class of models by applying it to\nthree domains as case studies. First, we develop a correlated topic model for text corpora. Second,\nwe study an application to modeling the spatial and temporal patterns in birth names given only\nsparse data. Finally, we provide a new continuous state-space model for discrete-valued sequences,\n\n\u2217These authors contributed equally.\n\n1\n\n\fincluding text and human DNA. In each case, given our model construction and auxiliary variable\nmethod, inference algorithms are easy to develop and very effective in experiments.\nCode to use these models, write new models that leverage these inference methods, and reproduce\nthe \ufb01gures in this paper is available at github.com/HIPS/pgmult.\n\n2 Modeling correlations in multinomial parameters\n\nIn this section, we discuss an auxiliary variable scheme that allows multinomial observations to\nappear as Gaussian likelihoods within a larger probabilistic model. The key trick discussed in the\nproceeding sections is to introduce P\u00b4olya-gamma random variables into the joint distribution over\ndata and parameters in such a way that the resulting marginal leaves the original model intact.\nThe integral identity underlying the P\u00b4olya-gamma augmentation scheme [3] is\n\n(e\u03c8)a\n\n(1 + e\u03c8)b = 2\u2212be\u03ba\u03c8\n\n(1)\nwhere \u03ba = a \u2212 b/2 and p(\u03c9 | b, 0) is the density of the P\u00b4olya-gamma distribution PG(b, 0), which\ndoes not depend on \u03c8. Consider a likelihood function of the form\n\n0\n\ne\u2212\u03c9\u03c82/2p(\u03c9 | b, 0) d\u03c9,\n\n(cid:90) \u221e\n\np(x| \u03c8) = c(x)\n\n(e\u03c8)a(x)\n\n(1 + e\u03c8)b(x)\n\n(2)\n\n(cid:90) \u221e\n\n0\n\nfor some functions a, b, and c. Such likelihoods arise, e.g., in logistic regression and in binomial and\nnegative binomial regression [3]. Using (1) along with a prior p(\u03c8), we can write the joint density\nof (\u03c8, x) as\n\np(\u03c8, x) = p(\u03c8) c(x)\n\n(e\u03c8)a(x)\n\n(1 + e\u03c8)b(x)\n\n=\n\np(\u03c8) c(x) 2\u2212b(x)e\u03ba(x)\u03c8e\u2212\u03c9\u03c82/2p(\u03c9 | b(x), 0) d\u03c9.\n\n(3)\n\nThe integrand of (3) de\ufb01nes a joint density on (\u03c8, x, \u03c9) which admits p(\u03c8, x) as a marginal density.\nConditioned on these auxiliary variables \u03c9, we have\n\np(\u03c8 | x, \u03c9) \u221d p(\u03c8)e\u03ba(x)\u03c8e\u2212\u03c9\u03c82/2\n\n(4)\nwhich is Gaussian when p(\u03c8) is Gaussian. Furthermore, by the exponential tilting property of the\nP\u00b4olya-gamma distribution, we have \u03c9 | \u03c8, x \u223c PG(b(x), \u03c8). Thus the identity (1) gives rise to a\nconditionally conjugate augmentation scheme for Gaussian priors and likelihoods of the form (2).\nThis augmentation scheme has been used to develop Gibbs sampling and variational inference al-\ngorithms for Bernoulli, binomial [3], and negative binomial [4] regression models with logit link\nfunctions, and to the multinomial distribution with a multi-class logistic link function [3, 5].\nThe multi-class logistic \u201csoftmax\u201d function, \u03c0LN(\u03c8), maps a real-valued vector \u03c8 \u2208 RK to a proba-\nj=1 e\u03c8j . It is commonly used in multi-class regres-\nsion [6] and correlated topic modeling [7]. Correlated multinomial parameters can be modeled with\na Gaussian prior on the vector \u03c8, though the resulting models are not conjugate. The P\u00b4olya-gamma\naugmentation can be applied to such models [3, 5], but it only provides single-site Gibbs updating\nof \u03c8. This paper develops a joint augmentation in the sense that, given the auxiliary variables, the\nentire vector \u03c8 is resampled as a block in a single Gibbs update.\n\nbility vector \u03c0 \u2208 [0, 1]K by setting \u03c0k = e\u03c8k /(cid:80)K\n\n2.1 A new P\u00b4olya-gamma augmentation for the multinomial distribution\nFirst, rewrite the K-dimensional multinomial recursively in terms of K \u2212 1 binomial densities:\n\nK\u22121(cid:89)\n1 \u2212(cid:80)\n\nk=1\n\n2\n\nMult(x| N, \u03c0) =\n\nNk = N \u2212(cid:88)\n\nxj,\n\nj