{"title": "Convergence rates of a partition based Bayesian multivariate density estimation method", "book": "Advances in Neural Information Processing Systems", "page_first": 4738, "page_last": 4746, "abstract": "We study a class of non-parametric density estimators under Bayesian settings. The estimators are obtained by adaptively partitioning the sample space. Under a suitable prior, we analyze the concentration rate of the posterior distribution, and demonstrate that the rate does not directly depend on the dimension of the problem in several special cases. Another advantage of this class of Bayesian density estimators is that it can adapt to the unknown smoothness of the true density function, thus achieving the optimal convergence rate without artificial conditions on the density. We also validate the theoretical results on a variety of simulated data sets.", "full_text": "Convergence rates of a partition based Bayesian\n\nmultivariate density estimation method\n\nLinxi Liu \u2217\n\nDepartment of Statistics\n\nColumbia University\n\nll3098@columbia.edu\n\nDangna Li\n\nICME\n\nStanford University\n\ndangna@stanford.edu\n\nWing Hung Wong\n\nDepartment of Statistics\n\nStanford University\n\nwhwong@stanford.edu\n\nAbstract\n\nWe study a class of non-parametric density estimators under Bayesian settings.\nThe estimators are obtained by adaptively partitioning the sample space. Under\na suitable prior, we analyze the concentration rate of the posterior distribution,\nand demonstrate that the rate does not directly depend on the dimension of the\nproblem in several special cases. Another advantage of this class of Bayesian\ndensity estimators is that it can adapt to the unknown smoothness of the true\ndensity function, thus achieving the optimal convergence rate without arti\ufb01cial\nconditions on the density. We also validate the theoretical results on a variety of\nsimulated data sets.\n\n1\n\nIntroduction\n\nIn this paper, we study the asymptotic behavior of posterior distributions of a class of Bayesian density\nestimators based on adaptive partitioning. Density estimation is a building block for many other\nstatistical methods, such as classi\ufb01cation, nonparametric testing, clustering, and data compression.\nWith univariate (or bivariate) data, the most basic non-parametric method for density estimation\nis the histogram method. In this method, the sample space is partitioned into regular intervals\n(or rectangles), and the density is estimated by the relative frequency of data points falling into\neach interval (rectangle). However, this method is of limited utility in higher dimensional spaces\nbecause the number of cells in a regular partition of a p-dimensional space will grow exponentially\nwith p, which makes the relative frequency highly variable unless the sample size is extremely\nlarge. In this situation the histogram may be improved by adapting the partition to the data so\nthat larger rectangles are used in the parts of the sample space where data is sparse. Motivated\nby this consideration, researchers have recently developed several multivariate density estimation\nmethods based on adaptive partitioning [13, 12]. For example, by generalizing the classical P\u00f3lya\nTree construction [7, 22] developed the Optional P\u00f3lya Tree (OPT) prior on the space of simple\nfunctions. Computational issues related to OPT density estimates were discussed in [13], where\nef\ufb01cient algorithms were developed to compute the OPT estimate. The method performs quite well\nwhen the dimension is moderately large (from 10 to 50).\nThe purpose of the current paper is to address the following questions on such Bayesian density\nestimates based on partition-learning. Question 1: what is the class of density functions that can be\n\u201cwell estimated\u201d by the partition-learning based methods. Question 2: what is the rate at which the\nposterior distribution is concentrated around the true density as the sample size increases. Our main\ncontributions lie in the following aspects:\n\n\u2217Work was done while the author was at Stanford University.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f\u2022 We impose a suitable prior on the space of density functions de\ufb01ned on binary partitions, and\ncalculate the posterior concentration rate under the Hellinger distance with mild assumptions.\nThe rate is adaptive to the unknown smoothness of the true density.\n\u2022 For two dimensional density functions of bounded variation, the posterior contraction rate\n\nof our method is n\u22121/4(log n)3.\n\n\u2212 \u03b2\n\n\u2022 For H\u00f6lder continuous (one-dimensional case) or mixture H\u00f6lder continuous (multi-\ndimensional case) density functions with regularity parameter \u03b2 in (0, 1], the posterior\nconcentration rate is n\n2\u03b2 , whereas the minimax rate for one-dimensional\nH\u00f6lder continuous functions is (n/ log n)\u2212\u03b2/(2\u03b2+1).\n\n\u2022 When the true density function is sparse in the sense that the Haar wavelet coef\ufb01cients satisfy\n2q\u22121 .\n\u2022 We can use a computationally ef\ufb01cient algorithm to sample from the posterior distribution.\n\na weak-lq (q > 1/2) constraint, the posterior concentration rate is n\n\n2\u03b2+p (log n)2+ p\n\n(log n)2+ 1\n\n\u2212 q\u22121/2\n\n2q\n\nWe demonstrate the theoretical results on several simulated data sets.\n\n1.1 Related work\n\nAn important feature of our method is that it can adapt to the unknown smoothness of the true density\nfunction. The adaptivity of Bayesian approaches has drawn great attention in recent years. In terms of\ndensity estimation, there are mainly two categories of adaptive Bayesian nonparametric approaches.\nThe \ufb01rst category of work relies on basis expansion of the density function and typically imposes a\nrandom series prior [15, 17]. When the prior on the coef\ufb01cients of the expansion is set to be normal\n[4], it is also a Gaussian process prior. In the multivariate case, most existing work [4, 17] uses\ntensor-product basis. Our improvement over these methods mainly lies in the adaptive structure. In\nfact, as the dimension increases the number of tensor-product basis functions can be prohibitively\nlarge, which imposes a great challenge on computation. By introducing adaptive partition, we are\nable to handle the multivariate case even when the dimension is 30 (Example 2 in Section 4).\nAnother line of work considers mixture priors [16, 11, 18]. Although the mixture distributions have\ngood approximation properties and naturally lead to adaptivity to very high smoothness levels, they\nmay fail to detect or characterize the local features. On the other hand, by learning a partition of the\nsample space, the partition based approaches can provide an informative summary of the structure,\nand allow us to examine the density at different resolutions [14, 21].\nThe paper is organized as follows. In Section 2 we provide more details of the density functions on\nbinary partitions and de\ufb01ne the prior distribution. Section 3 summarizes the theoretical results on\nposterior concentration rates. The results are further validated in Section 4 by several experiments.\n\n2 Bayesian multivariate density estimation\nWe focus on density estimation problems in p-dimensional Euclidean space. Let (\u2126,B) be a mea-\nsurable space and f0 be a compactly supported density function with respect to the Lebesgue\nmeasure \u00b5. Y1, Y2,\u00b7\u00b7\u00b7 , Yn is a sequence of independent variables distributed according to f0. Af-\nter translation and scaling, we can always assume that the support of f0 is contained in the unit\ncube in Rp. Translating this into notations, we assume that \u2126 = {(y1, y2,\u00b7\u00b7\u00b7 , yp) : yl \u2208 [0, 1]}.\n\u2126 f d\u00b5 = 1} denotes the collection of all the\ndensity functions on (\u2126,B, \u00b5). Then F constitutes the parameter space in this problem. Note that F\nis an in\ufb01nite dimensional parameter space.\n\nF = {f is a nonnegative measurable function on \u2126 :(cid:82)\n\n2.1 Densities on binary partitions\nTo address the in\ufb01nite dimensionality of F, we construct a sequence of \ufb01nite dimensional approx-\nimating spaces \u03981, \u03982,\u00b7\u00b7\u00b7 , \u0398I ,\u00b7\u00b7\u00b7 based on binary partitions. With growing complexity, these\nspaces provide more and more accurate approximations to the initial parameter space F. Here, we\nuse a recursive procedure to de\ufb01ne a binary partition with I subregions of the unit cube in Rp. Let\n\u2126 = {(y1, y2,\u00b7\u00b7\u00b7 , yp) : yl \u2208 [0, 1]} be the unit cube in Rp. In the \ufb01rst step, we choose one of\nthe coordinates yl and cut \u2126 into two subregions along the midpoint of the range of yl. That is,\n0. In this way, we get a partition\n\u2126 = \u2126l\n\n0 = {y \u2208 \u2126 : yl \u2264 1/2} and \u2126l\n\n1 = \u2126\\\u2126l\n\n1, where \u2126l\n\n0 \u222a \u2126l\n\n2\n\n\fwith two subregions. Note that the total number of possible partitions after the \ufb01rst step is equal to\nthe dimension p. Suppose after I \u2212 1 steps of the recursion, we have obtained a partition {\u2126i}I\ni=1\nwith I subregions. In the I-th step, further partitioning of the region is de\ufb01ned as follows:\n\n1. Choose a region from \u21261,\u00b7\u00b7\u00b7 , \u2126I. Denote it as \u2126i0.\n2. Choose one coordinate yl and divide \u2126i0 into two subregions along the midpoint of the\n\nrange of yl.\n\nSuch a partition obtained by I \u2212 1 recursive steps is called a binary partition of size I. Figure 1\ndisplays all possible two dimensional binary partitions when I is 1, 2 and 3.\n\nFigure 1: Binary partitions\n\nNow, let\n\n\u0398I = {f : f =\n\nI(cid:88)\n\nI(cid:88)\n\n\u03b8i = 1, {\u2126i}I\n\ni=1 is a binary partition of \u2126.}\n\n\u03b8i\n|\u2126i| 1\u2126i ,\n\ni=1\n\ni=1\n\nwhere |\u2126i| is the volume of \u2126i. Then, \u0398I is the collection of the density functions supported by the\nbinary partitions of size I. They constitute a sequence of approximating spaces (i.e. a sieve, see\n[10, 20] for background on sieve theory). Let \u0398 = \u222a\u221e\nI=1\u0398I be the space containing all the density\nfunctions supported by the binary partitions. Then \u0398 is an approximation of the initial parameter\nspace F to certain approximation error which will be characterized later.\nWe take the metric on F, \u0398 and \u0398I to be Hellinger distance, which is de\ufb01ned as\n\n((cid:112)f (y) \u2212(cid:112)g(y))2dy)1/2, f, g \u2208 F.\n\n\u03c1(f, g) = (\n\n(cid:90)\n\n\u2126\n\n2.2 Prior distribution\nAn ideal prior \u03a0 on \u0398 = \u222a\u221e\nI=1\u0398I is supposed to be capable of balancing the approximation error\nand the complexity of \u0398. The prior in this paper penalizes the size of the partition in the sense\nthat the probability mass on each \u0398I is proportional to exp(\u2212\u03bbI log I). Given a sample of size n,\nwe restrict our attention to \u0398n = \u222an/ log n\nI=1 \u0398I, because in practice we need enough samples within\neach subregion to get a meaningful estimate of the density. This is to say, when I \u2264 n/ log n,\n\u03a0(\u0398I ) \u221d exp(\u2212\u03bbI log I), otherwise \u03a0(\u0398I ) = 0.\nIf we use TI to denote the total number of possible partitions of size I, then it is not hard to see\nthat log TI \u2264 c\u2217I log I, where c\u2217 is a constant. Within each \u0398I, the prior is uniform across all\nbinary partitions. In other words, let {\u2126i}I\ni=1)\nis the collection of piecewise constant density functions on this partition (i.e. F({\u2126i}I\ni=1) = {f =\n\ni=1 be a binary partition of \u2126 of size I, and F({\u2126i}I\n\n(cid:80)I\n\ni=1\n\n\u03b8i|\u2126i| 1\u2126i :(cid:80)I\n\ni=1 \u03b8i = 1 and \u03b8i \u2265 0, i = 1, . . . , I}), then\n\n\u03a0(cid:0)F(cid:0){\u2126i}I\n\ni=1\n\n(cid:1)(cid:1) \u221d exp(\u2212\u03bbI log I)/TI .\n\n(1)\n\n(2)\n\n3\n\n\fi=1, the weights \u03b8i on the subregions follow a Dirichlet distribution with\n\nLet \u03a0n(\u00b7|Y1,\u00b7\u00b7\u00b7 , Yn) to denote the posterior distribution. After integrating out the weights \u03b8i, we\n\n\u03a0\n\n\u03b8i\n\nf =\n\nI(cid:88)\n\nGiven a partition {\u2126i}I\n\nparameters all equal to \u03b1 (\u03b1 < 1). This is to say, for x1,\u00b7\u00b7\u00b7 , xI \u2265 0 and(cid:80)I\n(cid:32)\n(cid:12)(cid:12)(cid:12)(cid:12)F(cid:0){\u2126i}I\nwhere D(\u03b41,\u00b7\u00b7\u00b7 , \u03b4I ) =(cid:81)I\ncan compute the marginal posterior probability of F(cid:0){\u2126i}I\ni=1)(cid:12)(cid:12)Y1,\u00b7\u00b7\u00b7 , Yn\n\n|\u2126i| 1\u2126i : \u03b81 \u2208 dx1,\u00b7\u00b7\u00b7 , \u03b8I \u2208 dxI\ni=1 \u03b4i).\n\n(cid:1) \u221d \u03a0(cid:0)F({\u2126i}I\n\ni=1 \u0393(\u03b4i)/\u0393((cid:80)I\n\n(cid:0)F({\u2126i}I\n\n\u03a0n\n\ni=1\n\ni=1\n\ni=1\n\ni=1 xi = 1,\n\nI(cid:89)\n\ni=1\n\n1\n\n=\n\nD(\u03b1,\u00b7\u00b7\u00b7 , \u03b1)\n\n(cid:1)(cid:33)\n(cid:1):\ni=1)(cid:1)(cid:90) (cid:32) I(cid:89)\nI(cid:89)\nd\u03b81 \u00b7\u00b7\u00b7 d\u03b8I\n\u00b7 D(\u03b1 + n1,\u00b7\u00b7\u00b7 , \u03b1 + nI )\n\n(\u03b8i/|\u2126i|)ni\n\n\u03b8\u03b1\u22121\n\n(cid:33)\n\n(cid:33)\n\ni=1\n\ni=1\n\nD(\u03b1,\u00b7\u00b7\u00b7 , \u03b1)\n\ni\n\n1\n\n(cid:32)\n\n\u00d7\n\nD(\u03b1,\u00b7\u00b7\u00b7 , \u03b1)\n\u221d exp(\u2212\u03bbI log I)\n\nTI\n\nI(cid:89)\n\ni=1\n\n1\n|\u2126i|ni\n\n, (4)\n\nx\u03b1\u22121\n\ni\n\n,\n\n(3)\n\nwhere ni is the number of observations in \u2126i. Under the prior introduced in [13], the marginal\nposterior distribution is:\n\n(cid:0)F({\u2126i}I\n\ni=1)(cid:12)(cid:12)Y1,\u00b7\u00b7\u00b7 , Yn\n\n(cid:1) \u221d exp(\u2212\u03bbI)\n\n\u03a0\u2217\n\nn\n\nD(\u03b1 + n1,\u00b7\u00b7\u00b7 , \u03b1 + nI )\n\nD(\u03b1,\u00b7\u00b7\u00b7 , \u03b1)\n\n1\n|\u2126i|ni\n\n,\n\n(5)\n\nwhile the maximum log-likelihood achieved by histograms on the partition {\u2126i}n\n\nI(cid:89)\n\ni=1\ni=1 is:\n\nI(cid:88)\n\ni=1\n\n(cid:19)\n\n(cid:18) ni\n\nn|\u2126i|\n\nln(F({\u2126i}I\n\ni=1)) :=\n\nmax\nf\u2208F ({\u2126i}I\n\ni=1)\n\nln(f ) =\n\nni log\n\n.\n\n(6)\n\nFrom a model selection perspective, we may treat the histograms on each binary partition as a model\nof the data. When I (cid:28) n, asymptotically,\n\nlog(cid:0)\u03a0\u2217\n\nn\n\n(cid:0)F({\u2126i}I\n\ni=1)(cid:12)(cid:12)Y1,\u00b7\u00b7\u00b7 , Yn\n\n(cid:1)(cid:1) (cid:16) ln(F({\u2126i}I\n\n(I \u2212 1) log n.\n\n(7)\n\ni=1)) \u2212 1\n2\n\nThis is to say, in [13], selecting the partition which maximizes the marginal posterior distribution is\nequivalent to applying the Bayesian information criterion (BIC) to perform model selection. However,\nif we allow I to increase with n, (7) will not hold any more. But if we use the prior introduced in this\nsection, in the case when I\n\nn \u2192 \u03b6 \u2208 (0, 1) as n \u2192 \u221e, we still have\n\n(cid:0)F({\u2126i}I\n\ni=1)(cid:12)(cid:12)Y1,\u00b7\u00b7\u00b7 , Yn\n\n(cid:1)(cid:1) (cid:16) ln(F({\u2126i}I\n\nlog(cid:0)\u03a0n\n\ni=1)) \u2212 \u03bbI log I.\n\n(8)\n\nFrom a model selection perspective, this is closer to the risk in\ufb02ation criterion (RIC, [8]).\n\n3 Posterior concentration rates\n\nWe are interested in how fast the posterior probability measure concentrates around the true the\ndensity f0. Under the prior speci\ufb01ed above, the posterior probability is the random measure given by\n\n(cid:81)n\n(cid:81)n\n\n(cid:82)\n(cid:82)\n\nB\n\n\u0398\n\n\u03a0n(B|Y1,\u00b7\u00b7\u00b7 , Yn) =\n\nj=1 f (Yj)d\u03a0(f )\nj=1 f (Yj)d\u03a0(f )\n\n.\n\nA Bayesian estimator is said to be consistent if the posterior distribution concentrates on arbitrarily\nsmall neighborhoods of f0, with probability tending to 1 under P n\n0 (P0 is the probability measure\ncorresponding to the density function f0). The posterior concentration rate refers to the rate at which\nthese neighborhoods shrink to zero while still possessing most of the posterior mass. More explicitly,\nwe want to \ufb01nd a sequence \u0001n \u2192 0, such that for suf\ufb01ciently large M,\n\u03a0n ({f : \u03c1(f, f0) \u2265 M \u0001n}|Y1,\u00b7\u00b7\u00b7 , Yn) \u2192 0 in P n\n\n0 \u2212 probability.\n\n4\n\n\fIn [6] and [2], the authors demonstrated that it is impossible to \ufb01nd an estimator which works\nuniformly well for every f in F. This is the case because for any estimator \u02c6f, there always exists\nf \u2208 F for which \u02c6f is inconsistent. Given the minimaxity of the Bayes estimator, we have to restrict\nour attention to a subset of the original parameter space F. Here, we focus on the class of density\nfunctions that can be well approximated by \u0398I\u2019s. To be more rigorous, a density function f \u2208 F is\nsaid to be well approximated by elements in \u0398, if there exits a sequence of fI \u2208 \u0398I, satisfying that\n\u03c1(fI , f ) = O(I\u2212r)(r > 0). Let F0 be the collection of these density functions. We will \ufb01rst derive\nposterior concentration rate for the elements in F0 as a function of r. For different function classes,\nthis approximation rate r can be calculated explicitly. In addition to this, we also assume that f0 has\n\ufb01nite second moment.\nThe following theorem gives the posterior concentration rate under the prior introduced in Section\n2.2.\nTheorem 3.1. Y1,\u00b7\u00b7\u00b7 , Yn is a sequence of independent random variables distributed according\nto f0. P0 is the probability measure corresponding to f0. \u0398 is the collection of p-dimensional\ndensity functions supported by the binary partitions as de\ufb01ned in Section 2.1. With the modi\ufb01ed prior\ndistribution, if f0 \u2208 F0, then the posterior concentration rate is \u0001n = n\u2212 r\nThe strategy to show this theorem is to write the posterior probability of the shrinking ball as\n\n2r+1 (log n)2+ 1\n2r .\n\n\u03a0({f : \u03c1(f, f0) \u2265 M \u0001n}|Y1,\u00b7\u00b7\u00b7 , Yn) =\n\nI=1\n\n{f :\u03c1(f,f0)\u2265M \u0001n}\u2229\u0398I\n\nf (Yj )\nf0(Yj ) d\u03a0(f )\n\nj=1\n\n.\n\n(9)\n\n(cid:80)\u221e\n\n(cid:82)\n\n(cid:81)n\n\n(cid:80)\u221e\n\n(cid:82)\n\n(cid:81)n\n\nI=1\n\n\u0398I\n\nj=1\n\nf (Yj )\nf0(Yj ) d\u03a0(f )\n\nThe proof employs the mechanism developed in the landmark works [9] and [19]. We \ufb01rst obtain\nthe upper bounds for the items in the numerator by dividing them into three blocks, each of which\naccounts for bias, variance, and rapidly decaying prior respectively, and calculate the upper bound for\neach block separately. Then we provide the prior thickness result, i.e., we bound the prior mass of a\nball around the true density from below. Due to space constraints, the details of the proof will be\nprovided in the appendix.\nThis theorem suggests the following two take-away messages: 1. The rate is adaptive to the unknown\nsmoothness of the true density. 2. The posterior contraction rate is n\u2212 r\n2r , which does\nnot directly depend on the dimension p. For some density functions, r may depend on p. But in\nseveral special cases, like the density function is spatially sparse or the density function lies in a low\ndimensional subspace, we will show that the rate will not be affected by the full dimension of the\nproblem.\nIn the following three subsections, we will calculate the explicit rates for three density classes. Again,\nall proofs are given in the appendix.\n\n2r+1 (log n)2+ 1\n\n3.1 Spatial adaptation\n\nFirst, we assume that the density concentrates spatially. Mathematically, this implies the density\nfunction satis\ufb01es a type of sparsity. In the past two decades, sparsity has become one of the most\ndiscussed types of structure under which we are able to overcome the curse of dimensionality. A\nremarkable example is that it allows us to solve high-dimensional linear models, especially when the\nsystem is underdetermined.\nLet f be a p dimensional density function and \u03a8 the p-dimensional Haar basis. We will work\nf \ufb01rst. Note that g \u2208 L2([0, 1]p). Thus we can expand g with respect to \u03a8 as\nwith g =\n\u03c8\u2208\u03a8(cid:104)g, \u03c8(cid:105)\u03c8, \u03c8 \u2208 \u03a8. We rearrange this summation by the size of wavelet coef\ufb01cients. In\n\ng =(cid:80)\n\n\u221a\n\nother words, we order the coef\ufb01cients as the following\n\n|(cid:104)g, \u03c8(1)(cid:105)| \u2265 |(cid:104)g, \u03c8(2)(cid:105)| \u2265 \u00b7\u00b7\u00b7 \u2265 |(cid:104)g, \u03c8(k)(cid:105)| \u2265 \u00b7\u00b7\u00b7 ,\n\nthen the sparsity condition imposed on the density functions is that the decay of the wavelet coef\ufb01-\ncients follows a power law,\n\n|(cid:104)g, \u03c8(k)(cid:105)| \u2264 Ck\u2212q for all k \u2208 N and q > 1/2,\n\n(10)\n\nwhere C is a constant.\n\n5\n\n\fWe call such a constraint a weak-lq constraint. The condition has been widely used to characterize\nthe sparsity of signals and images [1, 3]. In particular, in [5], it was shown that for two-dimensional\ncases, when q > 1/2, this condition reasonably captures the sparsity of real world images.\nCorollary 3.2. (Application to spatial adaptation) Suppose f0 is a p-dimensional density function\nand satis\ufb01es the condition (10). If we apply our approaches to this type of density functions, the\nposterior concentration rate is n\n\n(log n)2+ 1\n\n\u2212 q\u22121/2\n\n2q\u22121 .\n\n2q\n\n3.2 Density functions of bounded variation\nLet \u2126 = [0, 1)2 be a domain in R2. We \ufb01rst characterize the space BV (\u2126) of functions of bounded\nvariation on \u2126.\nFor a vector \u03bd \u2208 R2, the difference operator \u2206\u03bd along the direction \u03bd is de\ufb01ned by\n\n\u2206\u03bd(f, y) := f (y + \u03bd) \u2212 f (y).\n\nFor functions f de\ufb01ned on \u2126, \u2206\u03bd(f, y) is de\ufb01ned whenever y \u2208 \u2126(\u03bd), where \u2126(\u03bd) := {y :\n[y, y + \u03bd] \u2282 \u2126} and [y, y + \u03bd] is the line segment connecting y and y + \u03bd. Denote by el, l = 1, 2 the\ntwo coordinate vectors in R2. We say that a function f \u2208 L1(\u2126) is in BV (\u2126) if and only if\n\n2(cid:88)\n\nl=1\n\n2(cid:88)\n\nl=1\n\n\u221a\n\nV\u2126(f ) := sup\nh>0\n\nh\u22121\n\n(cid:107)\u2206hel (f,\u00b7)(cid:107)L1(\u2126(hel)) = lim\nh\u21920\n\nh\u22121\n\n(cid:107)\u2206hel (f,\u00b7)(cid:107)L1(\u2126(hel))\n\nis \ufb01nite. The quantity V\u2126(f ) is the variation of f over \u2126.\nCorollary 3.3. Assume that f0 \u2208 BV (\u2126). If we apply the Bayesian multivariate density estimator\nbased on adaptive partitioning here to estimate f0, the posterior concentration rate is n\u22121/4(log n)3.\n\n3.3 H\u00f6lder space\nIn one-dimensional case, the class of H\u00f6lder functions H(L, \u03b2) with regularity parameter \u03b2 is de\ufb01ned\nas the following: let \u03ba be the largest integer smaller than \u03b2, and denote by f (\u03ba) its \u03bath derivative.\n\nH(L, \u03b2) = {f : [0, 1] \u2192 R : |f (\u03ba)(x) \u2212 f (\u03ba)(y)| \u2264 L|x \u2212 y|\u03b2\u2212\u03ba}.\n\nIn multi-dimensional cases, we introduce the Mixed-H\u00f6lder continuity. In order to simplify the\nnotation, we give the de\ufb01nition when the dimension is two. It can be easily generalized to high-\ndimensional cases. A real-valued function f on R2 is called Mixed-H\u00f6lder continuous for some\nnonnegative constant C and \u03b2 \u2208 (0, 1], if for any (x1, y1), (x1, y2) \u2208 R2,\n\n|f (x2, y2) \u2212 f (x2, y1) \u2212 f (x1, y2) + f (x1, y1)| \u2264 C|x1 \u2212 x2|\u03b2|y1 \u2212 y2|\u03b2.\n\nCorollary 3.4. Let f0 be the p-dimensional density function. If\nf0 is H\u00f6lder continuous (when\np = 1) or mixed-H\u00f6lder continuous (when p \u2265 2) with regularity parameter \u03b2 \u2208 (0, 1], then the\nposterior concentration rate of the Bayes estimator is n\n\n2\u03b2+p (log n)2+ p\n2\u03b2 .\n\n\u2212 \u03b2\n\nThis result also implies that if f0 only depends on \u02dcp variable where \u02dcp < p, but we do not know in\nadvance which \u02dcp variables, then the rate of this method is determined by the effective dimension \u02dcp of\nthe problem, since the smoothness parameter r is only a function of \u02dcp. In next section, we will use a\nsimulated data set to illustrate this point.\n\n4 Simulation\n\n4.1 Sequential importance sampling\nEach partition AI = {\u2126i}I\ni=1 is obtained by recursively partitioning the sample space. We can\nuse a sequence of partitions A1,A2,\u00b7\u00b7\u00b7 ,AI to keep track of the path leading to AI. Let \u03a0n(\u00b7)\ndenote the posterior distribution \u03a0n(\u00b7|Y1,\u00b7\u00b7\u00b7 , Yn) for simplicity, and \u03a0I\nn be the posterior distribution\nconditioning on \u0398I. Then \u03a0I\n\nn(AI ) can be decomposed as\nn(A2|A1)\u00b7\u00b7\u00b7 \u03a0I\n\nn(A1)\u03a0I\n\nn(AI ) = \u03a0I\n\u03a0I\n\nn(AI|AI\u22121).\n\n6\n\n\fFigure 2: Heatmap of the density and plots of the 2-dimensional Haar coef\ufb01cients. For the plot on the\nright, the left panel is the plot of the Haar coef\ufb01cients from low resolution to high resolution up to\nlevel 6. The middle one is the plot of the sorted coef\ufb01cients according to their absolute values. And\nthe right one is the same as the middle plot but with the abscissa in log scale.\n\nn(Ai+1|Ai) can be calculated by \u03a0I\n\nn(Ai). However, the\nThe conditional distribution \u03a0I\nn(Ai) is sometimes infeasible, especially when both I and\ncomputation of the marginal distribution \u03a0I\nI \u2212 i are large, because we need to sum the marginal posterior probability over all binary partitions of\nsize I for which the \ufb01rst i steps in the partition generating path are the same as those of Ai. Therefore,\nwe adopt the sequential importance algorithm proposed in [13]. In order to build a sequence of binary\nn (Ai+1|Ai). The obtained\npartitions, at each step, the conditional distribution is approximated by \u03a0i+1\npartition is assigned a weight to compensate the approximation, where the weight is\n\nn(Ai+1)/\u03a0I\n\nwI (AI ) =\n\nn(AI )\n\u03a0I\n\nn(A1)\u03a02\n\u03a01\n\nn(A2|A1)\u00b7\u00b7\u00b7 \u03a0I\n\nn(AI|AI\u22121)\n\n.\n\nIn order to make the data points as uniform as possible, we apply a copula transformation to each\nvariable in advance whenever the dimension exceeds 3. More speci\ufb01cally, we estimate the marginal\ndistribution of each variable Xj by our approach, denoted as \u02c6fj (we use \u02c6Fj to denote the cdf of\nXj), and transform each point (y1,\u00b7\u00b7\u00b7 , yp) to (F1(y1),\u00b7\u00b7\u00b7 , Fp(yp)). Another advantage of this\ntransformation is that after the transformation the sample space naturally becomes [0, 1]p.\n\nExample 1 Assume that the two-dimensional density function is\n\n(cid:18)Y1\n\n(cid:19)\n\nY2\n\nN\n\n\u223c 2\n5\n\n(cid:18)(cid:18)0.25\n\n(cid:19)\n\n0.25\n\n(cid:19)\n\n(cid:18)(cid:18)0.75\n\n0.75\n\n(cid:19)\n\n(cid:19)\n\n.\n\n, 0.052I2\u00d72\n\n, 0.052I2\u00d72\n\n+\n\nN\n\n3\n5\n\nThis density function both satis\ufb01es the spatial sparsity condition and belongs to the space of functions\nof bounded variation. Figure 2 shows the heatmap of the density function and its Haar coef\ufb01cients.\nThe last panel in the second plot displays the sorted coef\ufb01cients with the abscissa in log-scale. From\nthis we can clearly see that the power-law decay de\ufb01ned in Section 3.1 is satis\ufb01ed.\nWe apply the adaptive partitioning approach to estimate the density, and allow the sample size increase\nfrom 102 to 105. In Figure 3, the left plot is the density estimation result based on a sample with\n10000 data points. The right one is the plot of Kullback-Leibler (KL) divergence from the estimated\ndensity to f0 vs. sample size in log-scale. The sample sizes are set to be 100, 500, 1000, 5000, 104,\nand 105. The linear trend in the plot validates the posterior concentrate rates calculated in Section 3.\nThe reason why we use KL divergence instead of the Hellinger distance is that for any f0 \u2208 F0 and\n\u02c6f \u2208 \u0398, we can show that the KL divergence and the Hellinger distance are of the same order. But\nKL divergence is relatively easier to compute in our setting, since we can show that it is linear in\nthe logarithm of the posterior marginal probability of a partition. The proof will be provided in the\nappendix. For each \ufb01xed sample size, we run the experiment 10 times and estimate the standard error,\nwhich is shown by the lighter blue part in the plot.\n\nExample 2 In the second example we work with a density function of moderately high dimension.\nAssume that the \ufb01rst \ufb01ve random variables Y1,\u00b7\u00b7\u00b7 Y5 are generated from the following location\n\n7\n\n\fFigure 3: Plot of the estimated density and KL divergence against sample size. We use the posterior\nmean as the estimate. The right plot is on log-log scale, while the labels of x and y axes still represent\nthe sample size and the KL divergence before we take the logarithm.\n\nFigure 4: KL divergence vs. sample size. The blue, purple and red curves correspond to the cases\nwhen p = 5, p = 10 and p = 30 respectively. The slopes of the three lines are almost the same,\nimplying that the concentration rate only depends on the effective dimension of the problem (which\nis 5 in this example).\n\nmixture of the Gaussian distribution:\n\n(cid:32)Y1\n\n(cid:33)\n\n\u223c 1\n2\n\nN\n\n0.25\nY2\n0.25\nY3\nY4, Y5 \u223c N (0.5, 0.1),\n\n(cid:33)\n\n\uf8eb\uf8ed(cid:32)0.25\n\n\uf8eb\uf8ed0.052\n\n0.032\n\n,\n\n\uf8f6\uf8f8\uf8f6\uf8f8 +\n\n(cid:32)(cid:32)0.75\n(cid:33)\n\n0.75\n0.75\n\nN\n\n1\n2\n\n0.032\n0.052\n\n0\n\n0\n0\n\n0.052\n\n(cid:33)\n\n, 0.052I3\u00d73\n\n,\n\n0\n\nthe other components Y6,\u00b7\u00b7\u00b7 , Yp are independently uniformly distributed. We run experiments for\np = 5, 10, and 30. For a \ufb01xed p, we generate n \u2208 {500, 1000, 5000, 104, 105} data points. For\neach pair of p and n, we repeat the experiment 10 times and calculate the standard error. Figure 4\ndisplays the plot of the KL divergence vs. the sample size on log-log scale. The density function is\ncontinuous differentiable. Therefore, it satis\ufb01es the mixed-H\u00f6lder continuity condition. The effective\ndimension of this example is \u02dcp = 5, and this is re\ufb02ected in the plot: the slopes of the three lines,\nwhich correspond to the concentration rates under different dimensions, almost remain the same as\nwe increase the full dimension of the problem.\n\n5 Conclusion\n\nIn this paper, we study the posterior concentration rate of a class of Bayesian density estimators\nbased on adaptive partitioning. We obtain explicit rates when the density function is spatially sparse,\nbelongs to the space of bounded variation, or is H\u00f6lder continuous. For the last case, the rate is\nminimax up to a logarithmic term. When the density function is sparse or lies in a low-dimensional\nsubspace, the rate will not be affected by the dimension of the problem. Another advantage of this\nmethod is that it can adapt to the unknown smoothness of the underlying density function.\n\n8\n\n1e25e21e35e31e41e5sample size0.010.200.400.600.801.00KL divergence\fBibliography\n\n[1] Felix Abramovich, Yoav Benjamini, David L. Donoho, and Iain M. Johnstone. Adapting to unknown\n\nsparsity by controlling the false discovery rate. The Annals of Statistics, 34(2):584\u2013653, 04 2006.\n\n[2] Lucien Birg\u00e9 and Pascal Massart. Minimum contrast estimators on sieves: exponential bounds and rates of\n\nconvergence. Bernoulli, 4(3):329\u2013375, 09 1998.\n\n[3] E.J. Cand\u00e8s and T. Tao. Near-optimal signal recovery from random projections: Universal encoding\n\nstrategies? Information Theory, IEEE Transactions on, 52(12):5406\u20135425, Dec 2006.\n\n[4] R. de Jonge and J.H. van Zanten. Adaptive estimation of multivariate functions using conditionally gaussian\n\ntensor-product spline priors. Electron. J. Statist., 6:1984\u20132001, 2012.\n\n[5] R.A. DeVore, B. Jawerth, and B.J. Lucier. Image compression through wavelet transform coding. Informa-\n\ntion Theory, IEEE Transactions on, 38(2):719\u2013746, March 1992.\n\n[6] R. H. Farrell. On the lack of a uniformly consistent sequence of estimators of a density function in certain\n\ncases. The Annals of Mathematical Statistics, 38(2):471\u2013474, 04 1967.\n\n[7] Thomas S. Ferguson. Prior distributions on spaces of probability measures. Ann. Statist., 2:615\u2013629, 1974.\n\n[8] Dean P. Foster and Edward I. George. The risk in\ufb02ation criterion for multiple regression. Ann. Statist.,\n\n22(4):1947\u20131975, 12 1994.\n\n[9] Subhashis Ghosal, Jayanta K. Ghosh, and Aad W. van der Vaart. Convergence rates of posterior distributions.\n\nThe Annals of Statistics, 28(2):500\u2013531, 04 2000.\n\n[10] U. Grenander. Abstract Inference. Probability and Statistics Series. John Wiley & Sons, 1981.\n\n[11] Willem Kruijer, Judith Rousseau, and Aad van der Vaart. Adaptive bayesian density estimation with\n\nlocation-scale mixtures. Electron. J. Statist., 4:1225\u20131257, 2010.\n\n[12] Dangna Li, Kun Yang, and Wing Hung Wong. Density estimation via discrepancy based adaptive sequential\n\npartition. 30th Conference on Neural Information Processing Systems (NIPS 2016), 2016.\n\n[13] Luo Lu, Hui Jiang, and Wing H. Wong. Multivariate density estimation by bayesian sequential partitioning.\n\nJournal of the American Statistical Association, 108(504):1402\u20131410, 2013.\n\n[14] Li Ma and Wing Hung Wong. Coupling optional p\u00f3lya trees and the two sample problem. Journal of the\n\nAmerican Statistical Association, 106(496):1553\u20131565, 2011.\n\n[15] Vincent Rivoirard and Judith Rousseau. Posterior concentration rates for in\ufb01nite dimensional exponential\n\nfamilies. Bayesian Anal., 7(2):311\u2013334, 06 2012.\n\n[16] Judith Rousseau. Rates of convergence for the posterior distributions of mixtures of betas and adaptive\n\nnonparametric estimation of the density. The Annals of Statistics, 38(1):146\u2013180, 02 2010.\n\n[17] Weining Shen and Subhashis Ghosal. Adaptive bayesian procedures using random series priors. Scandina-\n\nvian Journal of Statistics, 42(4):1194\u20131213, 2015. 10.1111/sjos.12159.\n\n[18] Weining Shen, Surya T. Tokdar, and Subhashis Ghosal. Adaptive bayesian multivariate density estimation\n\nwith dirichlet mixtures. Biometrika, 100(3):623\u2013640, 2013.\n\n[19] Xiaotong Shen and Larry Wasserman. Rates of convergence of posterior distributions. The Annals of\n\nStatistics, 29(3):687\u2013714, 06 2001.\n\n[20] Xiaotong Shen and Wing Hung Wong. Convergence rate of sieve estimates. The Annals of Statistics,\n\n22(2):pp. 580\u2013615, 1994.\n\n[21] Jacopo Soriano and Li Ma. Probabilistic multi-resolution scanning for two-sample differences. Journal of\n\nthe Royal Statistical Society: Series B (Statistical Methodology), 79(2):547\u2013572, 2017.\n\n[22] Wing H. Wong and Li Ma. Optional p\u00f3lya tree and bayesian inference. The Annals of Statistics, 38(3):1433\u2013\n\n1459, 06 2010.\n\n9\n\n\f", "award": [], "sourceid": 2482, "authors": [{"given_name": "Linxi", "family_name": "Liu", "institution": "Columbia University"}, {"given_name": "Dangna", "family_name": "Li", "institution": "Stanford University"}, {"given_name": "Wing Hung", "family_name": "Wong", "institution": "Stanford university"}]}