{"title": "A hybrid sampler for Poisson-Kingman mixture models", "book": "Advances in Neural Information Processing Systems", "page_first": 2161, "page_last": 2169, "abstract": "This paper concerns the introduction of a new Markov Chain Monte Carlo scheme for posterior sampling in Bayesian nonparametric mixture models with priors that belong to the general Poisson-Kingman class. We present a novel and compact way of representing the infinite dimensional component of the model such that while explicitly representing this infinite component it has less memory and storage requirements than previous MCMC schemes. We describe comparative simulation results demonstrating the efficacy of the proposed MCMC algorithm against existing marginal and conditional MCMC samplers.", "full_text": "A hybrid sampler for Poisson-Kingman mixture\n\nmodels\n\nMar\u00b4\u0131a Lomel\u00b4\u0131\nGatsby Unit\n\nUniversity College London\n\nmlomeli@gatsby.ucl.ac.uk\n\nStefano Favaro\n\nDepartment of Economics and Statistics\n\nUniversity of Torino and Collegio Carlo Alberto\n\nstefano.favaro@unito.it\n\nYee Whye Teh\n\nDepartment of Statistics\n\nUniversity of Oxford\n\ny.w.teh@stats.ox.ac.uk\n\nAbstract\n\nThis paper concerns the introduction of a new Markov Chain Monte Carlo scheme\nfor posterior sampling in Bayesian nonparametric mixture models with priors that\nbelong to the general Poisson-Kingman class. We present a novel compact way\nof representing the in\ufb01nite dimensional component of the model such that while\nexplicitly representing this in\ufb01nite component it has less memory and storage re-\nquirements than previous MCMC schemes. We describe comparative simulation\nresults demonstrating the ef\ufb01cacy of the proposed MCMC algorithm against ex-\nisting marginal and conditional MCMC samplers.\n\n1\n\nIntroduction\n\nAccording to Ghahramani [9], models that have a nonparametric component give us more \ufb02exiblity\nthat could lead to better predictive performance. This is because their capacity to learn does not satu-\nrate hence their predictions should continue to improve as we get more and more data. Furthermore,\nwe are able to fully consider our uncertainty about predictions thanks to the Bayesian paradigm.\nHowever, a major impediment to the widespread use of Bayesian nonparametric models is the prob-\nlem of inference. Over the years, many MCMC methods have been proposed to perform inference\nwhich usually rely on a tailored representation of the underlying process [5, 4, 18, 20, 28, 6]. This\nis an active research area since dealing with this in\ufb01nite dimensional component forbids the direct\nuse of standard simulation-based methods for posterior inference. These methods usually require a\n\ufb01nite-dimensional representation. There are two main sampling approaches to facilitate simulation\nin the case of Bayesian nonparametric models: random truncation and marginalization. These two\nschemes are known in the literature as conditional and marginal samplers.\nIn conditional samplers, the in\ufb01nite-dimensional prior is replaced by a \ufb01nite-dimensional repre-\nsentation chosen according to a truncation level. In marginal samplers, the need to represent the\nin\ufb01nite-dimensional component can be bypassed by marginalising it out. Marginal samplers have\nless storage requirements than conditional samplers but could potentially have worst mixing proper-\nties. However, not integrating out the in\ufb01nite dimensional compnent leads to a more comprehensive\nrepresentation of the random probability measure, useful to compute expectations of interest with\nrespect to the posterior.\nIn this paper, we propose a novel MCMC sampler for Poisson-Kingman mixture models, a very\nlarge class of Bayesian nonparametric mixture models that encompass all previously explored ones\nin the literature. Our approach is based on a hybrid scheme that combines the main strengths of\n\n1\n\n\fboth conditional and marginal samplers. In the \ufb02avour of probabilistic programming, we view our\ncontribution as a step towards wider usage of \ufb02exible Bayesian nonparametric models, as it allows\nautomated inference in probabilistic programs built out of a wide variety of Bayesian nonparametric\nbuilding blocks.\n\n2 Poisson-Kingman processes\n\nPoisson-Kingman random probability measures (RPMs) have been introduced in Pitman [23] as a\ngeneralization of homogeneous Normalized Random Measures (NRMs) [25, 13]. Let X be a com-\nplete and separable metric space endowed with the Borel \u03c3-\ufb01eld BpXq, let \u00b5 \u201e CRMp\u03c1, H0q be a\nhomogeneous Completely Random Measure (CRM) with L\u00b4evy measure \u03c1 and base distribution H0,\nsee Kingman [15] for a good overview about CRMs and references therein. Then, the corresponding\ntotal mass of \u00b5 is T \u201c \u00b5pXq and let it be \ufb01nite, positive almost surely, and absolutely continuous\nwith respect to Lebesgue measure. For any t P R`, let us consider the conditional distribution of\n\u00b5{t given that the total mass T P dt. This distribution is denoted by PKp\u03c1, \u03b4t, H0q, it is the distri-\nbution of a RPM, where \u03b4t denotes the usual Dirac delta function. Poisson-Kingman RPMs form\na class of RPMs whose distributions are obtained by mixing PKp\u03c1, \u03b4t, H0q, over t, with respect to\nsome distribution \u03b3 on the positive real line. Speci\ufb01cally, a Poisson-Kingman RPM has following\nthe hierarchical representation\n\nT \u201e \u03b3\n\nP|T \u201c t \u201e PKp\u03c1, \u03b4t, H0q.\n\n(1)\nThe RPM P is referred to as the Poisson-Kingman RPM with L\u00b4evy measure \u03c1, base distribution H0\nand mixing distribution \u03b3. Throughout the paper we denote by PKp\u03c1, \u03b3, H0q the distribution of P\nand, without loss of generality, we will assume that \u03b3pdtq9hptqf\u03c1ptqdt where f\u03c1 is the density of\nthe total mass T under the CRM and h is a non-negative function. Note that, when \u03b3pdtq \u201c f\u03c1ptqdt\nthen the distribution PKp\u03c1, f\u03c1, H0q coincides with NRMp\u03c1, H0q. The resulting P \u201c\nk\u011b1 pk\u03b4\u03c6k\nis almost surely discrete and since \u00b5 is homogeneous, the atoms p\u03c6kqk\u011b1 of P are independent of\ntheir masses ppkqk\u011b1 and form a sequence of independent random variables identically distributed\naccording to H0. Finally, the masses of P have distribution governed by the L\u00b4evy measure \u03c1 and\nthe distribution \u03b3.\nOne nice property is that P is almost surely discrete: if we obtain a sample tYiun\ni\u201c1 from it, there is\na positive probability of Yi \u201c Yj for each pair of indexes i \u2030 j. Hence, it induces a random partition\n\u03a0 on N, where i and j are in the same block in \u03a0 if and only if Yi \u201c Yj. Kingman [16] showed\nthat \u03a0 is exchangeable, this property will be one of the main tools for the derivation of our hybrid\nsampler.\n\n\u0159\n\n2.1 Size-biased sampling Poisson-Kingman processes\n\nA second object induced by a Poisson-Kingman RPM is a size-biased permutation of its atoms.\nSpeci\ufb01cally, order the blocks in \u03a0 by increasing order of the least element in each block, and for\neach k P N let Zk be the least element of the kth block. Zk is the index among pYiqi\u011b1 of the\n\ufb01rst appearance of the kth unique value in the sequence. Let \u02dcJk \u201c \u00b5ptYZkuq be the mass of the\ncorresponding atom in \u00b5. Then p \u02dcJkqk\u011b1 is a size-biased permutation of the masses of atoms in \u00b5,\n\u02dcJk \u201c T , and\nwith larger masses tending to appear earlier in the sequence. It is easy to see that\nthat the sequence can be understood as a stick-breaking construction: starting with a stick of length\nT0 \u201c T ; break off the \ufb01rst piece of length \u02dcJ1; the surplus length of stick is T1 \u201c T0 \u00b4 \u02dcJ1; then the\nsecond piece with length \u02dcJ2 is broken off, etc.\n[21] states that the sequence of surplus masses pTkqk\u011b0 forms a\nTheorem 2.1 of Perman et al.\nMarkov chain and gives the corresponding initial distribution and transition kernels. The corre-\nsponding generative process for the sequence pYiqi\u011b1 is as follows:\n\n\u0159\n\nk\u011b1\n\ni) Start with drawing the total mass from its distribution P\u03c1,h,H0pT P dtq9hptqf\u03c1ptqdt.\nii) The \ufb01rst draw Y1 from P is a size-biased pick from the masses of \u00b5. The actual value of Y1\n1 \u201e H0, while the mass of the corresponding atom in \u00b5 is \u02dcJ1, with conditional\n\nis simply Y \u02da\n\n2\n\n\fdistribution\nP\u03c1,h,H0p \u02dcJ1 P ds1|T P dtq \u201c s1\nt\n\n\u03c1pds1q f\u03c1pt \u00b4 s1q\nf\u03c1ptq\n\n,\n\nwith surplus mass\n\nT1 \u201c T \u00b4 \u02dcJ1.\n\niii) For subsequent draws i \u011b 2:\n\n\u2013 Let K be the current number of distinct values among Y1, . . . , Yi\u00b41, and Y \u02da\n\n1 , . . . , Y \u02da\nK\nthe unique values, i.e., atoms in \u00b5. The masses of these \ufb01rst K atoms are denoted by\n\u02dcJ1, . . . , \u02dcJK and the surplus mass is TK \u201c T \u00b4\n\n\u02dcJk.\n\u2013 For each k \u010f K, with probability \u02dcJk{T , we set Yi \u201c Y \u02da\nk .\n\u2013 With probability TK{T , Yi takes on the value of an atom in \u00b5 besides the \ufb01rst K\n\nK\nk\u201c1\n\nK`1 is drawn from H0, while its mass is drawn from\n\n\u0159\n\natoms. The actual value Y \u02da\nP\u03c1,h,H0p \u02dcJK`1 P dsK`1|TK P dtKq \u201c sK`1\ntK\n\n\u03c1pdsK`1q f\u03c1ptK \u00b4 sK`1q\n\nf\u03c1ptKq\n\nTK`1 \u201c TK\u00b4 \u02dcJK`1.\n\n,\n\nBy multiplying the above in\ufb01nitesimal probabilities, one obtains the joint distribution of the random\nelements T , \u03a0, p \u02dcJiqi\u011b1 and pY \u02da\n\ni qi\u011b1\n\u0159\nk\u201c1 skqhptqdt\n\nK\n\nP\u03c1,h,H0p\u03a0n \u201c pckqkPrKs, Y \u02da\n\n\u201c t\u00b4nf\u03c1pt \u00b4\n\nK\u017a\nk P dy\u02da\nk , \u02dcJk P dsk for k P rKs, T P dtq\n\n|ck|\nk \u03c1pdskqH0pdy\u02da\nkq,\ns\n\nk\u201c1\n\n(2)\n\nwhere pckqkPrKs denotes a particular partition of rns with K blocks, c1, . . . , cK, ordered by in-\ncreasing least element and |ck| is the cardinality of block ck. The distribution (2) is invariant to the\nsize-biased order. Such a joint distribution was \ufb01rst obtained in Pitman [23] , see also Pitman [24]\nfor further details.\n\n2.2 Relationship to the usual Stick-breaking construction\n\nd\u201c\n\n\u0159\nT\u00b4\n\nIn the generative process above, we mentioned that it is reminiscent of the well known stick breaking\nconstruction from Ishwaran & James [12], where you break a stick of length one but it is not the\nsame. However, we can effectively reparameterize the model, starting with Equation (2), due to\nfor j \u201c 1, . . . , K.\ntwo useful identities in distribution: Pj\nIndeed, using this reparameterization, we obtain the corresponding joint in terms of K p0, 1q-valued\nstick-breaking weights tVjuK\nj\u201c1 which correspond to a stick-breaking representation. Note that this\njoint distribution is for a general L\u00b4evy measure \u03c1, density f\u03c1 and it is conditioned on the valued\nof the random variable T . We can recover the well known Stick breaking representations for the\nDirichlet and Pitman-Yor processes, for a speci\ufb01c choice of \u03c1 and if we integrate out T , see the\nsupplementary material for further details about the latter. However, in general, these stick-breaking\nrandom variables form a sequence of dependent random variables with a complicated distribution,\nexcept for the two previously mentioned processes, see Pitman [22] for details.\n\n\u0159\n1\u00b4\n\nPj\n(cid:96)\u0103j P(cid:96)\n\nand Vj\n\n\u02dcJj\n(cid:96)\u0103j\n\n\u02dcJ(cid:96)\n\nd\u201c\n\n2.3 Poisson-Kingman mixture model\n\nWe are mainly interested in using Poisson-Kingman RPMs as a building block for an in\ufb01nite mixture\nmodel. Indeed, we can use Equation (1) as the top level of the following hierarchical speci\ufb01cation\n\nT \u201e \u03b3\n\nP|T \u201e PKp\u03c1\u03c3, \u03b4T , H0q\nYi | P iid\u201e P\nXi | Yi\n\nind\u201e Fp\u00a8 | Yiq\n\n3\n\n(3)\n\n\fFigure 1: Varying table size Chinese restaurant representation for observations tXiu9\ni\u201c1\n\nwhere Fp\u00a8 | Y q is the likelihood term for each mixture component, and our dataset consists of\nn observations pxiqiPrns of the corresponding variables pXiqiPrns. We will assume that Fp\u00a8 | Y q is\nsmooth. After specifying the model we would like to carry out inference for clustering and/or density\nestimation tasks. We can do it exactly and more ef\ufb01ciently than with known MCMC samplers with\nour novel approach. In the next section, we present our main contribution and in the following one\nwe show how it outperforms other samplers.\n3 Hybrid Sampler\n\nEquation\u2019s (2) joint distribution is written in terms of the \ufb01rst K size-biased weights. In order to\nobtain a complete representation of the RPM, we need to size-bias sample from it a countably in\ufb01nite\nnumber of times. Succesively, devise some way of representing this object exactly in a computer\nwith \ufb01nite memory and storage is needed.\nWe introduce the following novel strategy: starting from equation (2), we exploit the generative\nprocess of section 2.1 when reassigning observations to clusters. In addition to this, we reparame-\nterize the model in terms of a surplus mass random variable V \u201c T \u00b4\n\u02dcJk and end up with the\nfollowing joint distribution\nP\u03c1,h,H0p\u03a0n \u201c pckqkPrKs, Y \u02da\n\u02dc\nv ` K\u00ff\n\nk , \u02dcJk P dsk for k P rKs, T \u00b4 K\u00ff\n\u00b8\nf\u03c1pvq K\u017a\n\n\u02dcJk P dv, Xi P dxi for i P rnsq\n\u017a\n(4)\n\n\u201c pv ` K\u00ff\n\n|ck|\nk \u03c1pdskqH0pdy\u02da\nkq\ns\n\nFpdxi|y\u02da\nkq.\n\nk P dy\u02da\n\nskq\u00b4nh\n\n\u0159\n\nK\nk\u201c1\n\nk\u201c1\n\nk\u201c1\n\nsk\n\nk\u201c1\n\nk\u201c1\n\niPck\n\nFor this reason, while having a complete representation of the in\ufb01nite dimensional part of the model\nwe only need to explicitly represent those size-biased weights associated to occupied clusters plus\na surplus mass term which is associated to the rest of the empty clusters, as Figure 1 shows. The\ncluster reassignment step can be seen as a lazy sampling scheme: we explicitly represent and update\nthe weights associated to occupied clusters and create a size-biased weight only when a new cluster\nappears. To make this possible we use the induced partition and we call Equation (4) the varying\ntable size Chinese restaurant representation because the size-biased weights can be thought as the\nsizes of the tables in our restaurant. In the next subsection, we compute the complete conditionals\nof each random variable of interest to implement an overall Gibbs sampling MCMC scheme.\n\nStarting from equation (4), we obtain the following complete conditionals for the Gibbs sampler\n\n(5)\n\n|ci|\ni \u03c1pdsiqIp0,Surpmassiqpsiqdsi\ns\n\n3.1 Complete conditionals\n\nPpV P dv | Restq9\n\u00b4\n\u00af\n\u02dcJi P dsi | Rest\n\n9\n\nP\n\n\u02dc\nv ` K\u00ff\n\u02dc\nk\u201c1\nv ` si `\n\nsk\n\n\u00b8\u00b4n\n\u00ff\n\nsk\n\nk\u2030i\n\n\u02dc\nv ` K\u00ff\n\u02dc\nf\u03c1pvqh\nk\u201c1\nv ` si `\n\n\u00b8\u00b4n\n\nh\n\n\u00b8\n\u00ff\n\nsk\n\ndv\n\n\u00b8\n\nsk\n\nk\u2030i\n\n4\n\n\u02dcJ1,Y\u22171X3X1X2\u02dcJ2,Y\u22172X4X5\u02dcJ3,Y\u22173X6\u02dcJ4,Ye1X1X8T\u2212P4\u2018=1\u02dcJ\u2018nY0e1,Ye2o\fwhere Surpmassi \u201c V `\n\nk\nj\u201c1\nPpci \u201c c | c\u00b4i, Restq9\n\n\u0159\n\n\u0159\n\n\u02dcJj.\n\nj\u0103i\n\n#\n\u02dcJj \u00b4\nscFpdxi | tXjujPc Y \u02da\nc q\nM Fpdxi | Y \u02da\nc q\n\nv\n\nif i is assigned to existing cluster c\nif i is assigned to a new cluster c\n\nAccording to the rule above, the ith observation will be either reassigned to an existing cluster or to\none of the M new clusters in the ReUse algorithm as in Favaro & Teh [6]. If it is assigned to a new\ncluster, then we need to sample a new size-biased weight from the following\n\n\u00b4\n\n\u00af\n\u02dcJk`1 P dsk`1 | Rest\n\nP\n\n9f\u03c1pv \u00b4 sk`1q\u03c1psk`1qsk`1Ip0,vqpsk`1qdsk`1.\n\n(6)\n\nEvery time a new cluster is created we need to obtain its corresponding size-biased weight which\ncould happen 1 \u010f R \u010f n times per iteration hence, it has a signi\ufb01cant contribution to the overall\ncomputational cost. For this reason, an independent and identically distributed (i.i.d.) draw from\nits corresponding complete conditional (6) is highly desirable. In the next subsection we present a\nk ukPrKs, in the case where H0 is\nway to achieve this. Finally, for updating cluster parameters tY \u02da\nnon-conjugate to the likelihood, we use an extension of Favaro & Teh [6]\u2019s ReUse algorithm, see\nAlgorithm 3 in the supplementary material for details.\nThe complete conditionals in Equation (5) do not have a standard form but a generic MCMC method\ncan be applied to sample from each within the Gibbs sampler. We use slice sampling from Neal [19]\nto update the size-biased weights and the surplus mass. However, there is a class of priors where the\ntotal mass\u2019s density is intractable so an additional step needs to be introduced to sample the surplus\nmass. In the next subsection we present two alternative ways to overcome this issue.\n\n\u01598\n\n3.2 Example of classes of Poisson-Kingman priors\n\nj!\n\nP\n\nj\u201c0\n\np\u00b41qj`1\n\np0, 1q,\n\nFor any \u03c3\n\nsinp\u03c0\u03c3jq \u0393p\u03c3j`1q\n\na) \u03c3-Stable Poisson-Kingman processes [23].\n1\n\u03c0\n\nlet f\u03c3ptq \u201c\nt\u03c3j`1 be the density function of a positive \u03c3-Stable random variable and\n\u0393p1\u00b4\u03c3q x\u00b4\u03c3\u00b41dx. This class of RPMs is denoted by PKp\u03c1\u03c3, hT , H0q where h\n\u03c1pdxq \u201c \u03c1\u03c3pdxq :\u201c \u03c3\nis a function that indexes each member of the class. For example, in the experimental section, we\npicked 3 choices of the h function that index the following processes: Pitman-Yor, Normalized Sta-\nble and Normalized Generalized Gamma processes. This class includes all Gibbs type priors with\nparameter \u03c3 P p0, 1q, so other choices of h are possible, see Gnedin & Pitman [10] and De Blasi\net al. [1] for a noteworthy account of this class of Bayesian nonparametric priors. In this case, the\ntotal mass\u2019s density is intractable and we propose two ways of dealing with this. Firstly, we used\nKanter [14]\u2019s integral representation for the \u03c3-Stable density as in Lomeli et al. [17], introduce an\nauxiliary variable Z and slice sample each variable\n\n\u00b8\u00b4n\n\n\u02dc\nv ` k\u00ff\n\u201d\n\u00b4vp\u00b4 \u03c3\n\nsi\n\n\u0131\nv\u00b4 \u03c3\n1\u00b4\u03c3 exp\n1\u00b4\u03c3qApzq\n\ndz,\n\nPpV P dv | Restq9\ni\u201c1\nPpZ P dz | Restq9Apzq exp\n\n\u201d\n\u00b4v\n\n\u0131\n1\u00b4\u03c3 Apzq\n\u00b4\u03c3\n\nh\n\n\u00b8\n\n\u02dc\nv ` k\u00ff\n\ni\u201c1\n\nsi\n\ndv\n\nsee Algorithm 1 in the supplementary material for details. Alternatively, we can completely bypass\nthe evaluation of the total mass\u2019s density by updating the surplus mass with a Metropolis-Hastings\nstep with an independent proposal from a Stable or from an Exponentially Tilted Stable(\u03bb). It is\nstraight forward to obtain i.i.d draws from these proposals, see Devroye [3] and Hofert [11] for an\nimproved rejection sampling method for the Exponentially tilted case. This leads to the following\nacceptance ratio\n\n\u00b4\n\u00b4\nv1 `\nv `\n\n\u0159\n\u0159\n\nk\ni\u201c1 si\n\nk\ni\u201c1 si\n\n\u00af\u00b4n\n\u00af\u00b4n\n\n\u00b4\n\u0159\n\u00b4\n\u0159\nv1 `\nv `\n\nh\n\nh\n\nk\ni\u201c1 si\n\nk\ni\u201c1 si\n\n\u00af\n\u00af\ndv1 expp\u00b4vq\ndv expp\u00b4v1q ,\n\nsee Algorithm 2 in the supplementary material for details. Finally, to sample a new size-biased\nweight\n\nPpV 1 P dv1 | Restq f\u03c3pvq expp\u00b4\u03bbvq\nPpV P dv | Restq f\u03c3pv1q expp\u00b4\u03bbv1q \u201c\n\u00b4\n\u00af\n\u02dcJk`1 P dsk`1 | Rest\n\nP\n\n9f\u03c3pv \u00b4 sk`1qs\u00b4\u03c3\nk`1\n\nIp0,vqpsk`1qdsk`1.\n\n5\n\n\fFortunately, we can get an i.i.d. draw from the above due to an identity in distribution given by\nFavaro et al. [8] for the usual stick breaking weights for any prior in this class such that \u03c3 \u201c u\nwhere u \u0103 v are coprime integers. Then we just reparameterize it back to obtain the new size-biased\nweight, see Algorithm 4 in the supplementary material for details.\n\nv\n\n27].\n\n[25,\n\nprocesses\n\n\u00b4 logBeta-Poisson-Kingman\n\n\u201c\nb)\n\u0393paq\u0393pbq expp\u00b4atqp1 \u00b4 expp\u00b4tqqb\u00b41 be the density of a positive random variable X\n\u0393pa`bq\nd\u201c \u00b4 log Y ,\nwhere Y \u201e Betapa, bq and \u03c1pxq \u201c expp\u00b4axqp1\u00b4expp\u00b4bxqq\n. This class of RPMs generalises the\nGamma process but has similar properties. Indeed, if we take b \u201c 1 and the density function for\nT is \u03b3ptq \u201c f\u03c1ptq we recover the L\u00b4evy measure and total mass\u2019s density function of a Gamma\nprocess. Finally, to sample a new size-biased weight\n\n\u00b4\n\u00af\n9p1 \u00b4 exppsk`1 \u00b4 vqqb\u00b41 p1 \u00b4 expp\u00b4bsk`1qq\n\u02dcJk`1 P dsk`1 | Rest\n\ndsk`1Ip0,vqpsk`1q.\n\nxp1\u00b4expp\u00b4xqq\n\nf\u03c1ptq\n\nLet\n\nP\n\n1 \u00b4 expp\u00b4sk`1q\n\nIf b \u0105 1, this complete conditional is a monotone decreasing unnormalised density with maximum\nat b. We can easily get an i.i.d. draw with a simple rejection sampler [2] where the rejection constant\nis bv and the proposal is Up0, vq. There is no other known sampler for this process.\n\n3.3 Relationship to marginal and conditional MCMC samplers\n\nStarting from equation (2), another strategy would be to reparameterize the model in terms of the\nusual stick breaking weights. Next, we could choose a random truncation level and represent \ufb01nitely\nmany sticks as in Favaro & Walker [7]. Alternatively, we could integrate out the random probability\nmeasure and sample only the partition induced by it as in Lomeli et al. [17]. Conditional samplers\nhave large memory requirements as often, the number of sticks needed can be very large. Fur-\nthermore, the conditional distributions of the stick lengths are quite involved so they tend to have\nslow running times. Marginal samplers have less storage requirements than conditional samplers but\ncould potentially have worst mixing properties. For example, Lomeli et al. [17] had to introduce a\nnumber of auxiliary variables which worsen the mixing.\nOur novel hybrid sampler exploits marginal and conditional samplers advantages. It has less memory\nrequirements since it just represents the size-biased weights of occupied as opposed to conditional\nsamplers which represent both empty and occupied clusters. Also, it does not integrate out the\nsize-biased weights thus, we obtain a more comprehensive representation of the RPM.\n\n4 Performance assesssment\n\n0\n\n`\n\n\u02d8\n\nand\n\nd\u00b5k | \u00b50, \u03c32\n\n\u015b\ni\u201c1 N\n\nFpdx1, . . . , dxnk | \u00b5k, \u03c41q \u201c\n\nWe illustrate the performance of our hybrid sampler on a range of Bayesian nonparametric mixture\nmodels, obtained by different speci\ufb01cations of \u03c1 and \u03b3, as in Equation (3). At the top level of this\nhierarchical speci\ufb01cation, different Bayesian nonparametric priors were chosen from both classes\n\u02d8\n`\npresented in the examples section. We chose the base distribution H0 and the likelihood term F for\nthe kth cluster to be\nH0pd\u00b5kq \u201c N\nwhere tXjunk\nj\u201c1 are the nk observations assigned to the kth cluster at some iteration. N denotes a\n1, a common parameter among all clusters. The\nNormal distribution with mean \u00b5k and variance \u03c32\n0. Although the base distri-\nmean\u2019s prior distribution is Normal, centered at \u00b50 and with variance \u03c32\nbution is conjugate to the likelihood we treated it as non-conjugate case and sampled the parameters\nat each iteration rather than integrating them out.\nWe used the dataset from Roeder [26] to test the algorithmic performance in terms of running time\nand effective sample size (ESS), as Table 1 shows. The dataset consists of measurements of veloc-\nities in km/sec of n \u201c 82 galaxies from a survey of the Corona Borealis region. For the \u03c3-Stable\nPoisson-Kingman class, we compared it against our implementation of Favaro & Walker [7]\u2019s con-\nditional sampler and against the marginal sampler of Lomeli et al. [17]. We chose to compare our\nhybrid sampler against these existing approaches which follow the same general purpose paradigm.\n\nxi | \u00b5k, \u03c32\n\nnk\n\n1\n\n,\n\n6\n\n\fAlgorithm\n\nPitman-Yor process (\u03b8 \u201c 10)\n\nHybrid\n\nHybrid-MH (\u03bb \u201c 0)\n\nHybrid-MH (\u03bb \u201c 50)\n\nConditional\nMarginal\nHybrid\n\nConditional\nMarginal\n\nNormalized Stable process\n\nHybrid\n\nHybrid-MH (\u03bb \u201c 0)\n\nHybrid-MH (\u03bb \u201c 50)\n\nConditional\nMarginal\nHybrid\n\nConditional\nMarginal\n\nNormalized Generalized Gamma process (\u03c4 \u201c 1)\n\nHybrid\n\nHybrid-MH (\u03bb \u201c 0)\n\nConditional\nMarginal\nHybrid\n\nHybrid-MH (\u03bb \u201c 50)\n\nConditional\nMarginal\n\n-logBeta (a \u201c 1, b \u201c 2)\n\nHybrid\n\nConditional\nMarginal\n\nRunning time\n\nESS(\u02d8std)\n\n7135.1(28.316)\n5469.4(186.066)\n\n2635.488(187.335)\n2015.625(152.030)\n\nNA\n\n4685.7(84.104)\n3246.9(24.894)\n4902.3(6.936)\n\n10141.6(237.735)\n4757.2(37.077)\n\nNA\n\n2382.799(169.359)\n3595.508(174.075)\n3579.686(135.726)\n905.444(41.475)\n2944.065(195.011)\n\n5054.7(70.675)\n7866.4(803.228)\n\n5324.146(167.843)\n5074.909(100.300)\n\nNA\n\n7658.3(193.773)\n5382.9(57.561)\n4537.2(37.292)\n10033.1(22.647)\n8203.1(106.798)\n\nNA\n\n2630.264(429.877)\n4877.378(469.794)\n4454.999(348.356)\n912.382(167.089)\n3139.412(351.788)\n\n4157.8(92.863)\n4745.5(187.506)\n\n5104.713(200.949)\n4848.560(312.820)\n\nNA\n\n7685.8(208.98)\n6299.2(102.853)\n4686.4(35.661)\n10046.9(206.538)\n8055.6(93.164)\n\nNA\n\n3587.733(569.984)\n4646.987(370.955)\n4343.555(173.113)\n1000.214(70.148)\n4443.905(367.297)\n\n2520.6(121.044)\n\nNA\nNA\n\n3068.174(540.111)\n\nNA\nNA\n\n\u03c3\n\n0.3\n0.3\n0.3\n0.3\n0.5\n0.5\n0.5\n0.5\n\n0.3\n0.3\n0.3\n0.3\n0.5\n0.5\n0.5\n0.5\n\n0.3\n0.3\n0.3\n0.3\n0.5\n0.5\n0.5\n0.5\n\n-\n-\n-\n\nTable 1: Running times in seconds and ESS averaged over 10 chains, 30,000 iterations, 10,000 burn in.\n\nTable 1 shows that different choices of \u03c3 result in differences in the algorithm\u2019s running times and\nESS. The reason for this is that in the \u03c3 \u201c 0.5 case there are readily available random number\ngenerators which do not increase the computational cost. In contrast, in the \u03c3 \u201c 0.3 case, a rejection\nsampler method is needed every time a new size-biased weight is sampled which increases the\ncomputational cost, see Favaro et al. [8] for details. Even so, in most cases, we outperform both\nmarginal and conditional MCMC schemes in terms of running times and in all cases, in terms of\nESS. In the Hybrid-MH case, even thought the ESS and running times are competitive, we found\nthat the acceptance rate is not optimal, we are currently exploring other choices of proposals. Finally,\nin Example b), our approach is the only one available and it has good running times and ESS. This\nqualitative comparison con\ufb01rms our previous statements about our novel approach.\n\n5 Discussion\n\nOur main contribution is our Hybrid MCMC sampler as a general purpose tool for inference with a\nvery large class of in\ufb01nite mixture models. We argue in favour of an approach in which a generic\nalgorithm can be applied to a very large class of models, so that the modeller has a lot of \ufb02exibility in\nchoosing speci\ufb01c models suitable for his/her problem of interest. Our method is a hybrid approach\nsince it combines the perks of the conditional and marginal schemes.\nIndeed, our experiments\ncon\ufb01rm that our hybrid sampler is more ef\ufb01cient since it outperforms both marginal and conditional\nsamplers in running times in most cases and in ESS in all cases.\nWe introduced a new compact way of representing the in\ufb01nite dimensional component such that it is\nfeasible to perform inference and how to deal with the corresponding intractabilities. However, there\nare still various challenges that remain when dealing with these type of models. For instance, there\nare some values for \u03c3 which we are unable to perform inference with our novel sampler. Secondly,\nwhen a Metropolis-Hastings step is used, there could be other ways to improve the mixing in terms\nof better proposals. Finally, all BNP MCMC methods can be affected by the dimensionality and\nsize of the dataset when dealing with an in\ufb01nite mixture model. Indeed, all methods rely on the\nsame way of dealing with the likelihood term. When adding a new cluster, all methods sample its\n\n7\n\n\fcorresponding parameter from the prior distribution. In a high dimensional scenario, it could be very\ndif\ufb01cult to sample parameter values close to the existing data points. We consider these points to be\nan interesting avenue of future research.\n\nAcknowledgments\n\nWe thank Konstantina Palla for her insightful comments. Mar\u00b4\u0131a Lomel\u00b4\u0131 is funded by the Gatsby\nCharitable Foundation, Stefano Favaro is supported by the European Research Council through\nStG N-BNP 306406 and Yee Whye Teh is supported by the European Research Council under\nthe European Unions Seventh Framework Programme (FP7/2007-2013) ERC grant agreement no.\n617071.\n\nReferences\n[1] De Blasi, P., Favaro, S., Lijoi, A., Mena, R. H., Pr\u00a8uenster, I., & Ruggiero, M. 2015. Are Gibbs-type priors\nthe most natural generalization of the Dirichlet process? Pages 212\u2013229 of: IEEE Transactions on Pattern\nAnalysis & Machine Intelligence, vol. 37.\n\n[2] Devroye, L. 1986. Non-Uniform Random Variate Generation. Springer-Verlag.\n[3] Devroye, L. 2009. Random variate generation for exponentially and polynomially tilted Stable distribu-\n\ntions. ACM Transactions on Modelling and Computer Simulation, 19, 1\u201320.\n\n[4] Escobar, M. D. 1994. Estimating normal means with a Dirichlet process prior. Journal of the American\n\nStatistical Association, 89, 268\u2013277.\n\n[5] Escobar, M. D., & West, M. 1995. Bayesian density estimation and inference using mixtures. Journal of\n\nthe American Statistical Association, 90, 577\u2013588.\n\n[6] Favaro, S., & Teh, Y. W. 2013. MCMC for Normalized Random Measure Mixture Models. Statistical\n\nScience, 28(3), 335\u2013359.\n\n[7] Favaro, S., & Walker, S. G. 2012. Slice sampling \u03c3-Stable Poisson-Kingman mixture models. Journal of\n\nComputational and Graphical Statistics, 22, 830\u2013847.\n\n[8] Favaro, S., Lomeli, M., Nipoti, B., & Teh, Y. W. 2014. On the Stick-Breaking representation of \u03c3-Stable\n\nPoisson-Kingman models. Electronic Journal of Statistics, 8, 1063\u20131085.\n\n[9] Ghahramani, Z. 2015. Probabilistic Machine Learning and Arti\ufb01cial Inteligence. Nature, 521, 452459.\n[10] Gnedin, A., & Pitman, J. 2006. Exchangeable Gibbs partitions and Stirling triangles. Journal of Mathe-\n\nmatical Sciences, 138, 5674\u20135684.\n\n[11] Hofert, M. 2011. Ef\ufb01ciently sampling nested Archimedean copulas. Comput. Statist. Data Anal., 55,\n\n5770.\n\n[12] Ishwaran, H., & James, L. F. 2001. Gibbs Sampling Methods for Stick-Breaking Priors. Journal of the\n\nAmerican Statistical Association, 96(453), 161\u2013173.\n\n[13] James, L. F. 2002. Poisson process partition calculus with applications to exchangeable models and\n\nBayesian nonparametrics. ArXiv:math/0205093.\n\n[14] Kanter, M. 1975. Stable densities under change of scale and total variation inequalities. Annals of\n\nProbability, 3, 697\u2013707.\n\n[15] Kingman, J. F. C. 1967. Completely Random Measures. Paci\ufb01c Journal of Mathematics, 21, 59\u201378.\n[16] Kingman, J. F. C. 1978. The representation of partition structures. Journal of the London Mathematical\n\nSociety, 18, 374\u2013380.\n\n[17] Lomeli, M., Favaro, S., & Teh, Y. W. 2015. A marginal sampler for \u03c3-stable Poisson-Kingman mixture\n\nmodels. Journal of Computational and Graphical Statistics (To appear).\n\n[18] Neal, R. M. 1998. Markov Chain Sampling Methods for Dirichlet Process Mixture Models. Tech. rept.\n\n9815. Department of Statistics, University of Toronto.\n\n[19] Neal, R. M. 2003. Slice sampling. Annals of Statistics, 31, 705\u2013767.\n[20] Papaspiliopoulos, O., & Roberts, G. O. 2008. Retrospective Markov chain Monte Carlo methods for\n\nDirichlet process hierarchical models. Biometrika, 95, 169\u2013186.\n\n[21] Perman, M., Pitman, J., & Yor, M. 1992. Size-biased sampling of Poisson point processes and excursions.\n\nProbability Theory and Related Fields, 92, 21\u201339.\n\n[22] Pitman, J. 1996. Random discrete distributions invariant under size-biased permutation. Advances in\n\nApplied Probability, 28, 525\u2013539.\n\n8\n\n\f[23] Pitman, J. 2003. Poisson-Kingman Partitions. Pages 1\u201334 of: Goldstein, D. R. (ed), Statistics and\n\nScience: a Festschrift for Terry Speed. Institute of Mathematical Statistics.\n\n[24] Pitman, J. 2006. Combinatorial Stochastic Processes. Lecture Notes in Mathematics. Springer-Verlag,\n\nBerlin.\n\n[25] Regazzini, E., Lijoi, A., & Pr\u00a8uenster, I. 2003. Distributional results for means of normalized random\n\nmeasures with independent increments. Annals of Statistics, 31, 560\u2013585.\n\n[26] Roeder, K. 1990. Density estimation with con\ufb01dence sets exempli\ufb01ed by super-clusters and voids in the\n\ngalaxies. Journal of the American Statistical Association, 85, 617\u2013624.\n\n[27] von Renesse, M., Yor, M., & Zambotti, L. 2008. Quasi-invariance properties of a class of subordinators.\n\nStochastic Processes and their Applications, 118, 2038\u20132057.\n\n[28] Walker, Stephen G. 2007. Sampling the Dirichlet Mixture Model with Slices. Communications in Statis-\n\ntics - Simulation and Computation, 36, 45.\n\n9\n\n\f", "award": [], "sourceid": 1292, "authors": [{"given_name": "Maria", "family_name": "Lomeli", "institution": "Gatsby Unit, University College London"}, {"given_name": "Stefano", "family_name": "Favaro", "institution": "University of Torino and Collegio Carlo Alberto"}, {"given_name": "Yee Whye", "family_name": "Teh", "institution": "University of Oxford"}]}