{"title": "A simple example of Dirichlet process mixture inconsistency for the number of components", "book": "Advances in Neural Information Processing Systems", "page_first": 199, "page_last": 206, "abstract": "For data assumed to come from a finite mixture with an unknown number of components, it has become common to use Dirichlet process mixtures (DPMs) not only for density estimation, but also for inferences about the number of components. The typical approach is to use the posterior distribution on the number of components occurring so far --- that is, the posterior on the number of clusters in the observed data. However, it turns out that this posterior is not consistent --- it does not converge to the true number of components. In this note, we give an elementary demonstration of this inconsistency in what is perhaps the simplest possible setting: a DPM with normal components of unit variance, applied to data from a mixture\" with one standard normal component. Further, we find that this example exhibits severe inconsistency: instead of going to 1, the posterior probability that there is one cluster goes to 0.\"", "full_text": "A simple example of Dirichlet process mixture\ninconsistency for the number of components\n\nJeffrey W. Miller\n\nDivision of Applied Mathematics\n\nBrown University\n\nProvidence, RI 02912\n\njeffrey miller@brown.edu\n\nmatthew harrison@brown.edu\n\nMatthew T. Harrison\n\nDivision of Applied Mathematics\n\nBrown University\n\nProvidence, RI 02912\n\nAbstract\n\nFor data assumed to come from a \ufb01nite mixture with an unknown number of com-\nponents, it has become common to use Dirichlet process mixtures (DPMs) not\nonly for density estimation, but also for inferences about the number of compo-\nnents. The typical approach is to use the posterior distribution on the number of\nclusters \u2014 that is, the posterior on the number of components represented in the\nobserved data. However, it turns out that this posterior is not consistent \u2014 it does\nnot concentrate at the true number of components. In this note, we give an elemen-\ntary proof of this inconsistency in what is perhaps the simplest possible setting: a\nDPM with normal components of unit variance, applied to data from a \u201cmixture\u201d\nwith one standard normal component. Further, we show that this example exhibits\nsevere inconsistency: instead of going to 1, the posterior probability that there is\none cluster converges (in probability) to 0.\n\n1\n\nIntroduction\n\nIt is well-known that Dirichlet process mixtures (DPMs) of normals are consistent for the density \u2014\nthat is, given data from a suf\ufb01ciently regular density p0 the posterior converges to the point mass at\np0 (see [1] for details and references). However, it is easy to see that this does not necessarily imply\nconsistency for the number of components, since for example, a good estimate of the density might\ninclude super\ufb02uous components having vanishingly small weight.\nDespite the fact that a DPM has in\ufb01nitely many components with probability 1, it has become\ncommon to apply DPMs to data assumed to come from \ufb01nitely many components or \u201cpopulations\u201d,\nand to apply the posterior on the number of clusters (in other words, the number of components used\nin the process of generating the observed data) for inferences about the true number of components;\nsee [2, 3, 4, 5, 6] for a few prominent examples. Of course, if the data-generating process very\nclosely resembles the DPM model, then it is \ufb01ne to use this posterior for inferences about the number\nof clusters (but beware of misspeci\ufb01cation; see Section 2). However, in the examples cited, the\nauthors evaluated the performance of their methods on data simulated from a \ufb01xed \ufb01nite number of\ncomponents or populations, suggesting that they found this to be more realistic than a DPM for their\napplications.\nTherefore, it is important to understand the behavior of this posterior when the data comes from a\n\ufb01nite mixture \u2014 in particular, does it concentrate at the true number of components? In this note,\nwe give a simple example in which a DPM is applied to data from a \ufb01nite mixture and the posterior\ndistribution on the number of clusters does not concentrate at the true number of components. In\nfact, DPMs exhibit this type of inconsistency under very general conditions [7] \u2014 however, the aim\nof this note is brevity and clarity. To that end, we focus our attention on a special case that is as\n\n1\n\n\fdata, for a univariate normal DPM on n i.i.d. samples from (a) N (0, 1), and (b)(cid:80)2\n\nFigure 1: Prior (red x) and estimated posterior (blue o) of the number of clusters in the observed\n5N (4k, 1\n2 ).\nThe DPM had concentration parameter \u03b1 = 1 and a Normal\u2013Gamma base measure on the mean and\nprecision: N (\u00b5 | 0, 1/c\u03bb)Gamma(\u03bb | a, b) with a = 1, b = 0.1, and c = 0.001. Estimates were\nmade using a collapsed Gibbs sampler, with 104 burn-in sweeps and 105 sample sweeps; traceplots\nand running averages were used as convergence diagnostics. Each plot shown is an average over 5\nindependent runs.\n\n1\n\nk=\u22122\n\nsimple as possible: a \u201cstandard normal DPM\u201d, that is, a DPM using univariate normal components\nof unit variance, with a standard normal base measure (prior on component means).\nThe rest of the paper is organized as follows. In Section 2, we address several pertinent questions\nand consider some suggestive experimental evidence. In Section 3, we formally de\ufb01ne the DPM\nmodel under consideration. In Section 4, we give an elementary proof of inconsistency in the case\nof a standard normal DPM on data from one component, and in Section 5, we show that on standard\nnormal data, a standard normal DPM is in fact severely inconsistent.\n\n2 Discussion\n\nIt should be emphasized that these results do not diminish, in any way, the utility of Dirichlet process\nmixtures as a \ufb02exible prior on densities, i.e., for Bayesian density estimation. In addition to their\nwidespread success in empirical studies, DPMs are backed by theoretical guarantees showing that\nin many cases the posterior on the density concentrates at the true density at the minimax-optimal\nrate, up to a logarithmic factor (see [1] and references therein).\nMany researchers (e.g. [8, 9], among others) have empirically observed that the DPM posterior on\nthe number of clusters tends to overestimate the number of components, in the sense that it tends to\nput its mass on a range of values greater or equal to the true number. Figure 1 illustrates this effect for\nunivariate normals, and similar experiments with different families of component distributions yield\nsimilar results. Thus, while our theoretical results in Sections 4 and 5 (and in [7]) are asymptotic in\nnature, experimental evidence suggests that the issue is present even in small samples.\nIt is natural to think that this overestimation is due to the fact that the prior on the number of clusters\ndiverges as n \u2192 \u221e, at a log n rate. However, this does not seem to be the main issue \u2014 rather,\nthe problem is that DPMs strongly prefer having some tiny clusters and will introduce extra clusters\neven when they are not needed (see [7] for an intuitive explanation of why this is the case).\n\n2\n\n\fIn fact, many researchers have observed the presence of tiny extra clusters (e.g. [8, 9]), but the reason\nfor this has not previously been well understood, often being incorrectly attributed to the dif\ufb01culty of\ndetecting components with small weight. These tiny extra clusters are rather inconvenient, especially\nin clustering applications, and are often dealt with in an ad hoc way by simply removing them. It\nmight be possible to consistently estimate the number of components in this way, but this remains\nan open question.\nA more natural solution is the following: if the number of components is unknown, put a prior on\nthe number of components. For example, draw the number of components s from a probability\nmass function p(s) on {1, 2, . . .} with p(s) > 0 for all s, draw mixing weights \u03c0 = (\u03c01, . . . , \u03c0s)\n(given s), draw component parameters \u03b81, . . . , \u03b8s i.i.d. (given s and \u03c0) from an appropriate prior,\nand draw X1, X2, . . . i.i.d. (given s, \u03c0, and \u03b81:s) from the resulting mixture. This approach has been\nwidely used [10, 11, 12, 13]. Under certain conditions, the posterior on the density has been shown\nto concentrate at the true density at the minimax-optimal rate, up to a logarithmic factor, for any\nsuf\ufb01ciently regular true density [14]. Strictly speaking, as de\ufb01ned, such a model is not identi\ufb01able,\nbut it is fairly straightforward to modify it to be identi\ufb01able by choosing one representative from\neach equivalence class. Subject to a modi\ufb01cation of this sort, it can be shown (see [10]) that under\nvery general conditions, when the data is from a \ufb01nite mixture of the chosen family, such models are\n(a.e.) consistent for the number of components, the mixing weights, the component parameters, and\nthe density. Also see [15] for an interesting discussion about estimating the number of components.\nHowever, as a practical matter, when dealing with real-world data, one would not expect to \ufb01nd data\ncoming exactly from a \ufb01nite mixture of a known family (except, perhaps, in rare circumstances).\nUnfortunately, even for a model as in the preceding paragraph, the posterior on the number of com-\nponents will typically be highly sensitive to misspeci\ufb01cation, and it seems likely that in order to\nobtain robust estimators, the problem itself may need to be reformulated. We urge researchers inter-\nested in the number of components to be wary of this robustness issue, and to think carefully about\nwhether they really need to estimate the number of components, or whether some other measure of\nheterogeneity will suf\ufb01ce.\n\n3 Setup\n\nIn this section, we de\ufb01ne the Dirichlet process mixture model under consideration.\n\n3.1 Dirichlet process mixture model\n\nThe DPM model was introduced by Ferguson [16] and Lo [17] for the purpose of Bayesian den-\nsity estimation, and was made practical through the efforts of several authors (see [18] and ref-\nerences therein). We will use p(\u00b7) to denote probabilities under the DPM model (as opposed to\nother probability distributions that will be considered in what follows). The core of the DPM is the\nso-called Chinese restaurant process (CRP), which de\ufb01nes a certain probability distribution on par-\ntitions. Given n \u2208 {1, 2, . . .} and t \u2208 {1, . . . , n}, let At(n) denote the set of all ordered partitions\n(A1, . . . , At) of {1, . . . , n} into t nonempty sets. In other words,\n\nAt(n) =\n\n(A1, . . . , At) : A1, . . . , At are disjoint,\n\nAi = {1, . . . , n}, |Ai| \u2265 1 \u2200i\n\n(cid:83)n\nThe CRP with concentration parameter \u03b1 > 0 de\ufb01nes a probability mass function on A(n) =\nt=1 At(n) by setting\n\ni=1\n\np(A) =\n\n\u03b1t\n\n\u03b1(n) t!\n\n(|Ai| \u2212 1)!\n\nt(cid:89)\n\ni=1\n\n(cid:110)\n\nt(cid:91)\n\n(cid:111)\n\n.\n\nfor A \u2208 At(n), where \u03b1(n) = \u03b1(\u03b1 + 1)\u00b7\u00b7\u00b7 (\u03b1 + n \u2212 1). Note that since t is a function of A,\nwe have p(A) = p(A, t). (It is more common to see this distribution de\ufb01ned in terms of unordered\npartitions {A1, . . . , At}, in which case the t! does not appear in the denominator \u2014 however, for our\npurposes it is more convenient to use the distribution on ordered partitions (A1, . . . , At) obtained\nby uniformly permuting the parts. This does not affect the prior or posterior on t.)\n\n3\n\n\fConsider the hierarchical model\n\np(A, t) = p(A) =\n\n(|Ai| \u2212 1)!,\n\n(3.1)\n\nt(cid:89)\n\n\u03b1t\n\n\u03b1(n) t!\n\ni=1\n\nt(cid:89)\n\np(\u03b81:t | A, t) =\n\ni=1\n\np(x1:n | \u03b81:t, A, t) =\n\n\u03c0(\u03b8i), and\n\nt(cid:89)\n\n(cid:89)\n\ni=1\n\nj\u2208Ai\n\np\u03b8i(xj),\n\nwhere \u03c0(\u03b8) is a prior on component parameters \u03b8 \u2208 \u0398, and {p\u03b8 : \u03b8 \u2208 \u0398} is a parametrized family\nof distributions on x \u2208 X for the components. Typically, X \u2282 Rd and \u0398 \u2282 Rk for some d and k.\nHere, x1:n = (x1, . . . , xn) with xi \u2208 X , and \u03b81:t = (\u03b81, . . . , \u03b8t) with \u03b8i \u2208 \u0398. This hierarchical\nmodel is referred to as a Dirichlet process mixture (DPM) model.\n\nThe prior on the number of clusters t under this model is pn(t) = (cid:80)\n\u201ccluster\u201d: a component is part of a mixture distribution (e.g. a mixture(cid:80)\u221e\n\nA\u2208At(n) p(A, t). We use Tn\n(rather than T ) to denote the random variable representing the number of clusters, as a reminder\nthat its distribution depends on n. Note that we distinguish between the terms \u201ccomponent\u201d and\ni=1 \u03c0ip\u03b8i has components\np\u03b81 , p\u03b82, . . . ), while a cluster is the set of indices of data points coming from a given component\n(e.g. in the DPM model above, A1, . . . , At are the clusters).\nSince we are concerned with the posterior distribution p(Tn = t | x1:n) on the number of clusters,\nwe will be especially interested in the marginal distribution on (x1:n, t), given by\n\np(x1:n, Tn = t) =\n\n=\n\n=\n\n(cid:90)\n\n(cid:88)\n(cid:88)\n(cid:88)\n\nA\u2208At(n)\n\np(A)\n\np(A)\n\nA\u2208At(n)\n\nA\u2208At(n)\n\np(x1:n, \u03b81:t, A, t) d\u03b81:t\n\n(cid:17)\n\np\u03b8i(xj)\n\n(cid:90) (cid:16) (cid:89)\n\nj\u2208Ai\n\nm(xAi)\n\nt(cid:89)\nt(cid:89)\n\ni=1\n\ni=1\n\n\u03c0(\u03b8i) d\u03b8i\n\n(3.2)\n\nwhere for any subset of indices S \u2282 {1, . . . , n}, we denote xS = (xj : j \u2208 S) and let m(xS)\ndenote the single-cluster marginal of xS,\n\nm(xS) =\n\np\u03b8(xj)\n\n\u03c0(\u03b8) d\u03b8.\n\n(3.3)\n\n(cid:90) (cid:16)(cid:89)\n\nj\u2208S\n\n(cid:17)\n\n3.2 Specialization to the standard normal case\n\nIn this note, for brevity and clarity, we focus on the univariate normal case with unit variance, with\na standard normal prior on means \u2014 that is, for x \u2208 R and \u03b8 \u2208 R,\n\np\u03b8(x) = N (x | \u03b8, 1) =\n\u03c0(\u03b8) = N (\u03b8 | 0, 1) =\n\n1\u221a\n2\u03c0\n\nexp(\u2212 1\n\n2 (x \u2212 \u03b8)2),\n\nand\n\n1\u221a\n2\u03c0\n\nexp(\u2212 1\n\n2 \u03b82).\n\nIt is a straightforward calculation to show that the single-cluster marginal is then\n\nm(x1:n) =\n\n1\u221a\nn + 1\n\np0(x1:n) exp\n\nxj\n\n,\n\n(3.4)\n\nwhere p0(x1:n) = p0(x1)\u00b7\u00b7\u00b7 p0(xn) (and p0 is the N (0, 1) density). When p\u03b8(x) and \u03c0(\u03b8) are as\nabove, we refer to the resulting DPM as a standard normal DPM.\n\n4\n\n(cid:16) 1\n\n(cid:16) n(cid:88)\n\n(cid:17)2(cid:17)\n\n1\n\n2\n\nn + 1\n\nj=1\n\n\f4 Simple example of inconsistency\n\nIn this section, we prove the following result, exhibiting a simple example in which a DPM is\ninconsistent for the number of components: even when the true number of components is 1 (e.g.\nN (\u00b5, 1) data), the posterior probability of Tn = 1 does not converge to 1. Interestingly, the result\napplies even when X1, X2, . . . are identically equal to a constant c \u2208 R. To keep it simple, we set\n\u03b1 = 1; for more general results, see [7].\nTheorem 4.1. If X1, X2, . . . \u2208 R are i.i.d. from any distribution with E|Xi| < \u221e, then with\nprobability 1, under the standard normal DPM with \u03b1 = 1 as de\ufb01ned above, p(Tn = 1 | X1:n) does\nnot converge to 1 as n \u2192 \u221e.\nProof. Let n \u2208 {2, 3, . . .}. Let x1, . . . , xn \u2208 R, A \u2208 A2(n), and ai = |Ai| for i = 1, 2.\nxj for i = 1, 2. Using Equation 3.4 and noting that\n1/(n + 1) \u2264 1/(n + 2) + 1/n2, we have\n\nj=1 xj and sAi = (cid:80)\n\nDe\ufb01ne sn = (cid:80)n\n\nj\u2208Ai\n\n2\n\ns2\nn\n\n= exp\n\nn + 1\nn), where xn = 1\nn\n\n(cid:16) 1\n(cid:16) sA1\n(cid:17)2 \u2264 a1 + 1\n(cid:17)\n\n\u221a\n\na1 + 1\n\ns2\nA2\n\n+\n\n1\n2\n\n(cid:16) sn\n(cid:16) 1\n\ns2\nA1\n\n\u221a\n\nm(x1:n)\np0(x1:n)\nThe second factor equals exp( 1\n2 x2\n\nn + 1\n\nn + 2\nand thus, the \ufb01rst factor is less or equal to\n\nn + 2\n\nexp\n\nHence,\n\n2\n\na1 + 1\n\na2 + 1\n\ns2\nn\n\n(cid:17)\n\n(cid:16) 1\n(cid:17) \u2264 exp\n(cid:80)n\n(cid:17)2\nj=1 xj. By the convexity of x (cid:55)\u2192 x2,\n\n(cid:17)\n(cid:16) sA2\n\n(cid:16) 1\n(cid:17)2\n\ns2\nn\nn2\n\nn + 2\n\nexp\n\n2\n\n2\n\n.\n\n+\n\na2 + 1\nn + 2\n\n,\n\na2 + 1\n\n=\n\na1 + 1\n\n\u221a\n\na2 + 1\n\nm(xA1 ) m(xA2 )\n\np0(x1:n)\n\n.\n\nm(x1:n)\n\nm(xA1) m(xA2 )\n\n\u221a\n\n\u2264\n\n\u221a\n\na2 + 1\n\n\u221a\na1 + 1\n\nn + 1\n\nexp( 1\n\n2 x2\n\nn).\n\n(4.1)\n\nConsequently, we have\n\np(x1:n, Tn = 2)\np(x1:n, Tn = 1)\n\nn p(A)\n\nn p(A)\n\nm(xA1) m(xA2 )\n\nm(x1:n)\n\n\u221a\n\n(cid:112)|A1| + 1(cid:112)|A2| + 1\n\nn + 1\n\n(n \u2212 2)!\nn! 2!\n\nn\n\n\u221a\nn + 1\u221a\n\u221a\nn\n2\n\nexp(\u2212 1\n\n2 x2\nn)\n\nexp(\u2212 1\n\n2 x2\nn)\n\nA\u2208A2(n)\n\n(a)\n=\n\n(cid:88)\n(b)\u2265 (cid:88)\n(c)\u2265 (cid:88)\n\nA\u2208A2(n)\n\nA\u2208A2(n):\n|A1|=1\n\u221a\n(d)\u2265 1\n2\n\n2\n\nexp(\u2212 1\n\n2 x2\n\nn),\n\nwhere step (a) follows from applying Equation 3.2 to both numerator and denominator, plus using\nEquation 3.1 (with \u03b1 = 1) to see that p(A) = 1/n when A = ({1, . . . , n}), step (b) follows from\nEquation 4.1 above, step (c) follows since all the terms in the sum are nonnegative and p(A) =\n(n \u2212 2)!/n! 2! when |A1| = 1 (by Equation 3.1, with \u03b1 = 1), and step (d) follows since there are n\npartitions A \u2208 A2(n) such that |A1| = 1.\nIf X1, X2, . . . \u2208 R are i.i.d. with \u00b5 = EXj \ufb01nite, then by the law of large numbers, X n =\n\n1\nn\n\n(cid:80)n\nj=1 Xj \u2192 \u00b5 almost surely as n \u2192 \u221e. Therefore,\np(Tn = 1 | X1:n) =\n\u2264\n\u2264\n\np(X1:n, Tn = 1)\nt=1 p(X1:n, Tn = t)\n\n(cid:80)\u221e\n\na.s.\u2212\u2212\u2192\n\n1\nexp(\u2212 1\n\n\u221a\n1 + 1\n2\nHence, almost surely, p(Tn = 1 | X1:n) does not converge to 1.\n\n\u221a\n1 + 1\n2\n\n2 X\n\n2\nn)\n\n2\n\np(X1:n, Tn = 1)\n\np(X1:n, Tn = 1) + p(X1:n, Tn = 2)\n\n1\nexp(\u2212 1\n\n2 \u00b52)\n\n2\n\n< 1.\n\n5\n\n\f5 Severe inconsistency\nIn the previous section, we showed that p(Tn = 1 | X1:n) does not converge to 1 for a standard\nnormal DPM on any data with \ufb01nite mean. In this section, we prove that in fact, it converges to 0, at\nleast on standard normal data. This vividly illustrates that improperly using DPMs in this way can\nlead to entirely misleading results. The key step in the proof is an application of Hoeffding\u2019s strong\nlaw of large numbers for U-statistics.\nTheorem 5.1. If X1, X2, . . . \u223c N (0, 1) i.i.d. then\np(Tn = 1 | X1:n)\n\nas n \u2192 \u221e\n\nPr\u2212\u2192 0\n\nunder the standard normal DPM with concentration parameter \u03b1 = 1.\n\nProof. For t = 1 and t = 2 de\ufb01ne\n\nRt(X1:n) = n3/2 p(X1:n, Tn = t)\n\np0(X1:n)\n\n.\n\nOur method of proof is as follows. We will show that\n\nR2(X1:n)\n\nPr\u2212\u2212\u2212\u2212\u2192\nn\u2192\u221e \u221e\n\n(or in other words, for any B > 0 we have P(R2(X1:n) > B) \u2192 1 as n \u2192 \u221e), and we will show\nthat R1(X1:n) is bounded in probability:\n\nR1(X1:n) = OP (1)\n\n(or in other words, for any \u03b5 > 0 there exists B\u03b5 > 0 such that P(R1(X1:n) > B\u03b5) \u2264 \u03b5 for all\nn \u2208 {1, 2, . . .}). Putting these two together, we will have\n\np(Tn = 1 | X1:n) =\n\np(X1:n, Tn = 1)\nt=1 p(X1:n, Tn = t)\n\n\u2264 p(X1:n, Tn = 1)\np(X1:n, Tn = 2)\n\n=\n\nR1(X1:n)\nR2(X1:n)\n\nPr\u2212\u2212\u2212\u2212\u2192\nn\u2192\u221e 0.\n\nFirst, let\u2019s show that R2(X1:n) \u2192 \u221e in probability. For S \u2282 {1, . . . , n} with |S| \u2265 1, de\ufb01ne h(xS)\nby\n\nh(xS) =\n\nm(xS)\np0(xS)\n\n=\n\nexp\n\n1\n\n|S| + 1\n\n1(cid:112)|S| + 1\n\n(cid:17)2(cid:17)\n\nxj\n\n,\n\n(cid:16)(cid:88)\n\nj\u2208S\n\nwhere m is the single-cluster marginal as in Equations 3.3 and 3.4. Note that when 1 \u2264 |S| \u2264 n\u2212 1,\nwe have\n\nn h(xS) \u2265 1. Note also that Eh(XS) = 1 since\n\n\u221a\n\n(cid:80)\u221e\n\n(cid:90)\n\n(cid:16) 1\n\n2\n\n(cid:90)\n\nEh(XS) =\n\nh(xS) p0(xS) dxS =\n\nm(xS) dxS = 1,\n\nusing the fact that m(xS) is a density with respect to Lebesgue measure. For k \u2208 {1, . . . , n}, de\ufb01ne\nthe U-statistics\n\n(cid:1) (cid:88)\n1(cid:0)n\n\nk\n\n|S|=k\n\nUk(X1:n) =\n\nh(XS)\n\nwhere the sum is over all S \u2282 {1, . . . , n} such that |S| = k. By Hoeffding\u2019s strong law of large\nnumbers for U-statistics [19],\n\nUk(X1:n)\n\na.s.\u2212\u2212\u2212\u2212\u2192\nn\u2192\u221e\n\nEh(X1:k) = 1\n\n6\n\n\ffor any k \u2208 {1, 2, . . .}. Therefore, using Equations 3.1 and 3.2 we have that for any K \u2208 {1, 2, . . .}\nand any n > K,\n\nm(XA1 ) m(XA2 )\n\np0(X1:n)\n\nA\u2208A2(n)\n\np(A)\n\u221a\n\np(A)\n\nn h(XA1) h(XA2)\n\np(A) h(XA1)\n\n(k \u2212 1)! (n \u2212 k \u2212 1)!\n\nh(XS)\n\n= n\n\n\u2265 n\n\nA\u2208A2(n)\n\nA\u2208A2(n)\n\nR2(X1:n) = n3/2 (cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\nn\u22121(cid:88)\nn\u22121(cid:88)\nn\u22121(cid:88)\n\u2265 K(cid:88)\n\n= n\n\nk=1\n\nk=1\n\nk=1\n\n=\n\n=\n\nn\n\nn\n\nn\n\n|S|=k\n\n2k(n \u2212 k)\n\n2k(n \u2212 k)\n\n2k(n \u2212 k)\n\nk=1\n\na.s.\u2212\u2212\u2212\u2212\u2192\nn\u2192\u221e\n\nK(cid:88)\n\nk=1\n\nn! 2!\n\n(cid:1) (cid:88)\n1(cid:0)n\n\nk\n\n|S|=k\n\nh(XS)\n\nUk(X1:n)\n\nUk(X1:n)\n\n1\n2k\n\n=\n\nHK\n2\n\n>\n\nlog K\n\n2\n\nwhere HK is the K th harmonic number, and the last inequality follows from the standard bounds\n[20] on harmonic numbers: log K < HK \u2264 log K + 1. Hence, for any K,\n\nand it follows easily that\n\nlim inf\nn\u2192\u221e R2(X1:n) >\n\nlog K\n\n2\n\nalmost surely,\n\nR2(X1:n)\n\na.s.\u2212\u2212\u2212\u2212\u2192\nn\u2192\u221e \u221e.\n\nConvergence in probability is implied by almost sure convergence.\nNow, let\u2019s show that R1(X1:n) = OP (1). By Equations 3.1, 3.2, and 3.4, we have\n\nR1(X1:n) = n3/2 p(X1:n, Tn = 1)\n\np0(X1:n)\n\n(cid:16) 1\n\nn\n\n\u221a\nn\u221a\nn + 1\n\nn\n\n=\n\n\u221a\n\n(cid:16) 1\u221a\n\nm(X1:n)\np0(X1:n)\n\nn(cid:88)\n\n(cid:17)2(cid:17) \u2264 exp(Z 2\n\n\u221a\nwhere Zn = (1/\nconclude that R1(X1:n) = OP (1). This completes the proof.\n\n=\ni=1 Xi \u223c N (0, 1) for each n \u2208 {1, 2, . . .}. Since Zn = OP (1) then we\n\nn + 1\n\nn/2)\n\nexp\n\nXi\n\ni=1\n\nn\n\n2\n\nn)(cid:80)n\n\nAcknowledgments\n\nWe would like to thank Stu Geman for raising this question, and the anonymous referees for several\nhelpful suggestions that improved the quality of this manuscript. This research was supported in part\nby the National Science Foundation under grant DMS-1007593 and the Defense Advanced Research\nProjects Agency under contract FA8650-11-1-715.\n\nReferences\n[1] S. Ghosal. The Dirichlet process, related priors and posterior asymptotics.\n\nIn N.L. Hjort,\nC. Holmes, P. M\u00a8uller, and S.G. Walker, editors, Bayesian Nonparametrics, pages 36\u201383. Cam-\nbridge University Press, 2010.\n\n7\n\n\f[2] J.P. Huelsenbeck and P. Andolfatto. Inference of population structure under a Dirichlet process\n\nmodel. Genetics, 175(4):1787\u20131802, 2007.\n\n[3] M. Medvedovic and S. Sivaganesan. Bayesian in\ufb01nite mixture model based clustering of gene\n\nexpression pro\ufb01les. Bioinformatics, 18(9):1194\u20131206, 2002.\n\n[4] E. Otranto and G.M. Gallo. A nonparametric Bayesian approach to detect the number of\n\nregimes in Markov switching models. Econometric Reviews, 21(4):477\u2013496, 2002.\n\n[5] E.P. Xing, K.A. Sohn, M.I. Jordan, and Y.W. Teh. Bayesian multi-population haplotype in-\nference via a hierarchical Dirichlet process mixture. In Proceedings of the 23rd International\nConference on Machine Learning, pages 1049\u20131056, 2006.\n\n[6] P. Fearnhead. Particle \ufb01lters for mixture models with an unknown number of components.\n\nStatistics and Computing, 14(1):11\u201321, 2004.\n\n[7] J. W. Miller and M. T. Harrison. Inconsistency of Pitman\u2013Yor process mixtures for the number\n\nof components. arXiv:1309.0024, 2013.\n\n[8] M. West, P. M\u00a8uller, and M.D. Escobar. Hierarchical priors and mixture models, with applica-\ntion in regression and density estimation. Institute of Statistics and Decision Sciences, Duke\nUniversity, 1994.\n\n[9] A. Onogi, M. Nurimoto, and M. Morita. Characterization of a Bayesian genetic clustering\nalgorithm based on a Dirichlet process prior and comparison among Bayesian clustering meth-\nods. BMC Bioinformatics, 12(1):263, 2011.\n\n[10] A. Nobile. Bayesian Analysis of Finite Mixture Distributions. PhD thesis, Department of\n\nStatistics, Carnegie Mellon University, Pittsburgh, PA, 1994.\n\n[11] S. Richardson and P.J. Green. On Bayesian analysis of mixtures with an unknown number of\n\ncomponents. Journal of the Royal Statistical Society. Series B, 59(4):731\u2013792, 1997.\n\n[12] P.J. Green and S. Richardson. Modeling heterogeneity with and without the Dirichlet process.\n\nScandinavian Journal of Statistics, 28(2):355\u2013375, June 2001.\n\n[13] A. Nobile and A.T. Fearnside. Bayesian \ufb01nite mixtures with an unknown number of compo-\n\nnents: The allocation sampler. Statistics and Computing, 17(2):147\u2013162, 2007.\n\n[14] W. Kruijer, J. Rousseau, and A. Van der Vaart. Adaptive Bayesian density estimation with\n\nlocation-scale mixtures. Electronic Journal of Statistics, 4:1225\u20131257, 2010.\n\n[15] P. McCullagh and J. Yang. How many clusters? Bayesian Analysis, 3(1):101\u2013120, 2008.\n[16] T.S. Ferguson. Bayesian density estimation by mixtures of normal distributions.\n\nIn M. H.\nRizvi, J. Rustagi, and D. Siegmund, editors, Recent Advances in Statistics, pages 287\u2013302.\nAcademic Press, 1983.\n\n[17] A. Y. Lo. On a class of Bayesian nonparametric estimates: I. Density estimates. The Annals of\n\nStatistics, 12(1):351\u2013357, 1984.\n\n[18] M.D. Escobar and M. West. Computing nonparametric hierarchical models.\n\nIn D. Dey,\nP. M\u00a8uller, and D. Sinha, editors, Practical Nonparametric and Semiparametric Bayesian Statis-\ntics, pages 1\u201322. Springer-Verlag, New York, 1998.\n\n[19] W. Hoeffding. The strong law of large numbers for U-statistics. Institute of Statistics, Univ. of\n\nN. Carolina, Mimeograph Series, 302, 1961.\n\n[20] R.L. Graham, D.E. Knuth, and O. Patashnik. Concrete Mathematics. Addison-Wesley, 1989.\n\n8\n\n\f", "award": [], "sourceid": 173, "authors": [{"given_name": "Jeffrey", "family_name": "Miller", "institution": "Brown University"}, {"given_name": "Matthew", "family_name": "Harrison", "institution": "Brown University"}]}