{"title": "Approximate Inference in Continuous Determinantal Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1430, "page_last": 1438, "abstract": "Determinantal point processes (DPPs) are random point processes well-suited for modeling repulsion. In machine learning, the focus of DPP-based models has been on diverse subset selection from a discrete and finite base set. This discrete setting admits an efficient algorithm for sampling based on the eigendecomposition of the defining kernel matrix. Recently, there has been growing interest in using DPPs defined on continuous spaces. While the discrete-DPP sampler extends formally to the continuous case, computationally, the steps required cannot be directly extended except in a few restricted cases. In this paper, we present efficient approximate DPP sampling schemes based on Nystrom and random Fourier feature approximations that apply to a wide range of kernel functions. We demonstrate the utility of continuous DPPs in repulsive mixture modeling applications and synthesizing human poses spanning activity spaces.", "full_text": "Approximate Inference in Continuous\n\nDeterminantal Point Processes\n\nRaja Ha\ufb01z Affandi1, Emily B. Fox2, and Ben Taskar2\n\n1University of Pennsylvania, rajara@wharton.upenn.edu\n\n2University of Washington, {ebfox@stat,taskar@cs}.washington.edu\n\nAbstract\n\nDeterminantal point processes (DPPs) are random point processes well-suited for\nmodeling repulsion. In machine learning, the focus of DPP-based models has been\non diverse subset selection from a discrete and \ufb01nite base set. This discrete setting\nadmits an ef\ufb01cient sampling algorithm based on the eigendecomposition of the\nde\ufb01ning kernel matrix. Recently, there has been growing interest in using DPPs\nde\ufb01ned on continuous spaces. While the discrete-DPP sampler extends formally to\nthe continuous case, computationally, the steps required are not tractable in general.\nIn this paper, we present two ef\ufb01cient DPP sampling schemes that apply to a wide\nrange of kernel functions: one based on low rank approximations via Nystr\u00a8om\nand random Fourier feature techniques and another based on Gibbs sampling. We\ndemonstrate the utility of continuous DPPs in repulsive mixture modeling and\nsynthesizing human poses spanning activity spaces.\n\nIntroduction\n\n1\nSamples from a determinantal point process (DPP) [15] are sets of points that tend to be spread out.\nMore speci\ufb01cally, given \u2126 \u2286 Rd and a positive semide\ufb01nite kernel function L : \u2126 \u00d7 \u2126 (cid:55)\u2192 R, the\nprobability density of a point con\ufb01guration A \u2282 \u2126 under a DPP with kernel L is given by\n\nPL(A) \u221d det(LA) ,\n\n(1)\nwhere LA is the |A| \u00d7 |A| matrix with entries L(x, y) for each x, y \u2208 A. The tendency for repulsion\nis captured by the determinant since it depends on the volume spanned by the selected points in the\nassociated Hilbert space of L. Intuitively, points similar according to L or points that are nearly\nlinearly dependent are less likely to be selected.\nBuilding on the foundational work in [5] for the case where \u2126 is discrete and \ufb01nite, DPPs have been\nused in machine learning as a model for subset selection in which diverse sets are preferred [2, 3,\n9, 12, 13]. These methods build on the tractability of sampling based on the algorithm of Hough et\nal. [10], which relies on the eigendecomposition of the kernel matrix to recursively sample points\nbased on their projections onto the subspace spanned by the selected eigenvectors.\nRepulsive point processes, like hard core processes [7, 16], many based on thinned Poisson processes\nand Gibbs/Markov distributions, have a long history in the spatial statistics community, where\nconsidering continuous \u2126 is key. Many naturally occurring phenomena exhibit diversity\u2014trees tend\nto grow in the least occupied space [17], ant hill locations are over-dispersed relative to uniform\nplacement [4] and the spatial distribution of nerve \ufb01bers is indicative of neuropathy, with hard-core\nprocesses providing a critical tool [25]. Repulsive processes on continuous spaces have garnered\ninterest in machine learning as well, especially relating to generative mixture modeling [18, 29].\nThe computationally attractive properties of DPPs make them appealing to consider in these appli-\ncations. On the surface, it seems that the eigendecomposition and projection algorithm of [10] for\ndiscrete DPPs would naturally extend to the continuous case. While this is true in a formal sense as L\n\n1\n\n\fbecomes an operator instead of a matrix, the key steps such as the eigendecomposition of the kernel\nand projection of points on subspaces spanned by eigenfunctions are computationally infeasible\nexcept in a few very limited cases where approximations can be made [14]. The absence of a tractable\nDPP sampling algorithm for general kernels in continuous spaces has hindered progress in developing\nDPP-based models for repulsion.\nIn this paper, we propose an ef\ufb01cient algorithm to sample from DPPs in continuous spaces using\nlow-rank approximations of the kernel function. We investigate two such schemes: Nystr\u00a8om and\nrandom Fourier features. Our approach utilizes a dual representation of the DPP, a technique that has\nproven useful in the discrete \u2126 setting as well [11]. For k-DPPs, which only place positive probability\non sets of cardinality k [13], we also devise a Gibbs sampler that iteratively samples points in the\nk-set conditioned on all k \u2212 1 other points. The derivation relies on representing the conditional\nDPPs using the Schur complement of the kernel. Our methods allow us to handle a broad range of\ntypical kernels and continuous subspaces, provided certain simple integrals of the kernel function\ncan be computed ef\ufb01ciently. Decomposing our kernel into quality and similarity terms as in [13],\nthis includes, but is not limited to, all cases where the (i) spectral density of the quality and (ii)\ncharacteristic function of the similarity kernel can be computed ef\ufb01ciently. Our methods scale well\nwith dimension, in particular with complexity growing linearly in d.\nIn Sec. 2, we review sampling algorithms for discrete DPPs and the challenges associated with\nsampling from continuous DPPs. We then propose continuous DPP sampling algorithms based on\nlow-rank kernel approximations in Sec. 3 and Gibbs sampling in Sec. 4. An empirical analysis of the\ntwo schemes is provided in Sec. 5. Finally, we apply our methods to repulsive mixture modeling and\nhuman pose synthesis in Sec. 6 and 7.\n\n2 Sampling from a DPP\n\nkernel matrix L =(cid:80)N\n\nWhen \u2126 is discrete with cardinality N, an ef\ufb01cient algorithm for sampling from a DPP is given\nin [10]. The algorithm, which is detailed in the supplement, uses an eigendecomposition of the\nn and recursively samples points xi as follows, resulting in a set\nA \u223c DPP(L) with A = {xi}:\n\nn=1 \u03bbnvnv(cid:62)\n\n\u03bbn+1. Let V be the selected eigenvectors (k = |V |).\nPhase 1 Select eigenvector vn with probability \u03bbn\nPhase 2 For i = 1, . . . , k, sample points xi \u2208 \u2126 sequentially with probability based on the projection\nof xi onto the subspace spanned by V . Once xi is sampled, update V by excluding the\nsubspace spanned by the projection of xi onto V .\n\nWhen \u2126 is discrete, both steps are straightforward since the \ufb01rst phase involves eigendecomposing a\nkernel matrix and the second phase involves sampling from discrete probability distributions based\non inner products between points and eigenvectors. Extending this algorithm to a continuous space\nwas considered by [14], but for a very limited set of kernels L and spaces \u2126. For general L and \u2126,\nwe face dif\ufb01culties in both phases. Extending Phase 1 to a continuous space requires knowledge of\nthe eigendecomposition of the kernel function. When \u2126 is a compact rectangle in Rd, [14] suggest\napproximating the eigendecomposition using an orthonormal Fourier basis.\nEven if we are able to obtain the eigendecomposition of the kernel function (either directly or via\napproximations as considered in [14] and Sec. 3), we still need to implement Phase 2 of the sampling\nalgorithm. Whereas the discrete case only requires sampling from a discrete probability function,\nhere we have to sample from a probability density. When \u2126 is compact, [14] suggest using a rejection\nsampler with a uniform proposal on \u2126. The authors note that the acceptance rate of this rejection\nsampler decreases with the number of points sampled, making the method inef\ufb01cient in sampling large\nsets from a DPP. In most other cases, implementing Phase 2 even via rejection sampling is infeasible\nsince the target density is in general non-standard with unknown normalization. Furthermore, a\ngeneric proposal distribution can yield extremely low acceptance rates.\nIn summary, current algorithms can sample approximately from a continuous DPP only for translation-\ninvariant kernels de\ufb01ned on a compact space. In Sec. 3, we propose a sampling algorithm that allows\nus to sample approximately from DPPs for a wide range of kernels L and spaces \u2126.\n\n2\n\n\f3 Sampling from a low-rank continuous DPP\nAgain considering \u2126 discrete with cardinality N, the sampling algorithm of Sec. 2 has complexity\ndominated by the eigendecomposition, O(N 3). If the kernel matrix L is low-rank, i.e. L = B(cid:62)B,\nwith B a D \u00d7 N matrix and D (cid:28) N, [11] showed that the complexity of sampling can be reduced to\nO(N D2 + D3). The basic idea is to exploit the fact that L and the dual kernel matrix C = BB(cid:62),\nwhich is D \u00d7 D, share the same nonzero eigenvalues, and for each eigenvector vk of L, Bvk is the\ncorresponding eigenvector of C. See the supplement for algorithmic details.\nWhile the dependence on N in the dual is sharply reduced, in continuous spaces, N is in\ufb01nite. In\norder to extend the algorithm, we must \ufb01nd ef\ufb01cient ways to compute C for Phase 1 and manipulate\neigenfunctions implicitly for the projections in Phase 2. Generically, consider sampling from a DPP\nn=1 \u03bbn\u03c6n(x)\u03c6n(y),where \u03bbn and \u03c6n(x) are\neigenvalues and eigenfunctions, and \u03c6n(y) is the complex conjugate of \u03c6n(y). Assume that we can\napproximate L by a low-dimensional (generally complex-valued) mapping, B(x) : \u2126 (cid:55)\u2192 CD:\n\non a continuous space \u2126 with kernel L(x, y) = (cid:80)\u221e\n\n\u02dcL(x, y) = B(x)\u2217B(y) , where B(x) = [B1(x), . . . , BD(x)](cid:62).\n\n(2)\nHere, A\u2217 denotes complex conjugate transpose of A. We consider two ef\ufb01cient low-rank approxima-\ntion schemes in Sec. 3.1 and 3.2. Using such a low-rank representation, we propose an analog of\nthe dual sampling algorithm for continuous spaces, described in Algorithm 1. A similar algorithm\nprovides samples from a k-DPP, which only gives positive probability to sets of a \ufb01xed cardinality\nk [13]. The only change required is to the for-loop in Phase 1 to select exactly k eigenvectors using\nan ef\ufb01cient O(Dk) recursion. See the supplement for details.\n\nPHASE 1\n\na rank-D DPP kernel\n\nInput: \u02dcL(x, y) = B(x)\u2217B(y),\n\nCompute C =(cid:82)\nCompute eigendecomp. C =(cid:80)D\n\nAlgorithm 1 Dual sampler for a low-rank continuous DPP\nPHASE 2\nX \u2190 \u2205\nwhile |V | > 0 do\nSample \u02c6x from f (x) = 1|V |\nX \u2190 X \u222a {\u02c6x}\nLet v0 be a vector in V such that v\u2217\nUpdate V \u2190 {v \u2212 v\u2217B(\u02c6x)\nOrthonormalize V w.r.t. (cid:104)v1, v2(cid:105) = v\u2217\n\n(cid:80)\nv\u2208V |v\u2217B(x)|2\n0B(\u02c6x) (cid:54)= 0\n0 B(\u02c6x) v0 | v \u2208 V \u2212 {v0}}\nv\u2217\n\nJ \u2190 J \u222a {k} with probability \u03bbk\n\n\u2126 B(x)B(x)\u2217dx\n\nk=1 \u03bbkvkv\u2217\n\nk\n\n\u03bbk+1\n\n1Cv2\n\nJ \u2190 \u2205\nfor k = 1, . . . , D do\nV \u2190 {\n}k\u2208J\n\nvk\u221a\nv\u2217\nkCvk\n\nOutput: X\n\nIn this dual view, we still have the same two-phase structure, and must address two key challenges:\nPhase 1 Assuming a low-rank kernel function decomposition as in Eq. (2), we need to able to\n\ncompute the dual kernel matrix, given by an integral:\n\nC =\n\nB(x)B(x)\u2217dx .\n\n(3)\n\n(cid:90)\n\n\u2126\n\nPhase 2 In general, sampling directly from the density f (x) is dif\ufb01cult; instead, we can compute the\ncumulative distribution function (CDF) and sample x using the inverse CDF method [21]:\n\nF (\u02c6x = (\u02c6x1, . . . , \u02c6xd)) =\n\nf (x)1{xl\u2208\u2126}dxl.\n\n(4)\n\n(cid:90) \u02c6xl\n\nd(cid:89)\n\n\u2212\u221e\n\nl=1\n\nAssuming (i) the kernel function \u02dcL is \ufb01nite-rank and (ii) the terms C and f (x) are computable,\nAlgorithm 1 provides exact samples from a DPP with kernel \u02dcL. In what follows, approximations only\narise from approximating general kernels L with low-rank kernels \u02dcL. If given a \ufb01nite-rank kernel L\nto begin with, the sampling procedure is exact.\nOne could imagine approximating L as in Eq. (2) by simply truncating the eigendecomposition\n(either directly or using numerical approximations). However, this simple approximation for known\ndecompositions does not necessarily yield a tractable sampler, because the products of eigenfunctions\nrequired in Eq. (3) might not be ef\ufb01ciently integrable. For our approximation algorithm to work, not\nonly do we need methods that approximate the kernel function well, but also that enable us to solve\nEq. (3) and (4) directly for many different kernel functions. We consider two such approaches that\nenable an ef\ufb01cient sampler for a wide range of kernels: Nystr\u00a8om and random Fourier features.\n\n3\n\n\fSampling from RFF-approximated DPP\n\n3.1\nRandom Fourier features (RFF) [19] is an approach for approximating shift-invariant kernels,\nk(x, y) = k(x \u2212 y), using randomly selected frequencies. The frequencies are sampled inde-\npendently from the Fourier transform of the kernel function, \u03c9j \u223c F(k(x \u2212 y)), and letting:\n\n\u02dck(x \u2212 y) =\n\n1\nD\n\nexp{i\u03c9(cid:62)\n\nj (x \u2212 y)} , x, y \u2208 \u2126 .\n\nTo apply RFFs, we factor L into a quality function q and similarity kernel k (i.e., q(x) =(cid:112)L(x, x)):\n\nj=1\n\nL(x, y) = q(x)k(x, y)q(y) ,\n\nx, y \u2208 \u2126 where k(x, x) = 1.\n\n(5)\n\n(6)\n\nD(cid:88)\n\nThe RFF approximation can be applied to cases where the similarity function has a known char-\nacteristic function, e.g., Gaussian, Laplacian and Cauchy. Using Eq. (5), we can approximate the\nsimilarity kernel function to obtain a low-rank kernel and dual matrix:\n\n\u02dcLRF F (x, y) =\n\n1\nD\n\nq(x) exp{i\u03c9(cid:62)\n\nD(cid:88)\n\nj=1\n\njk =\n\nj (x \u2212 y)}q(y), C RF F\nD(cid:88)\n\n(cid:90) \u02c6xl\n\nd(cid:89)\n\nvjv\u2217\n\nk\n\n(cid:88)\n\nD(cid:88)\n\nThe CDF of the sampling distribution f (x) in Algorithm 1 is given by:\n\nFRF F (\u02c6x) =\n\n1\n|V |\n\nv\u2208V\n\nj=1\n\nk=1\n\nl=1\n\n\u2212\u221e\n\nq2(x) exp{i(\u03c9j \u2212 \u03c9k)(cid:62)x}1{xl\u2208\u2126}dxl.\n\n(7)\n\nwhere vj denotes the jth element of vector v. Note that equations C RF F and FRF F can be computed\nfor many different combinations of \u2126 and q(x). In fact, this method works for any combination\nof (i) translation-invariant similarity kernel k with known characteristic function and (ii) quality\nfunction q with known spectral density. The resulting kernel L need not be translation invariant. In\nthe supplement, we illustrate this method by considering a common and important example where\n\u2126 = Rd, q(x) is Gaussian, and k(x, y) is any kernel with known Fourier transform.\n3.2 Sampling from a Nystr\u00a8om-approximated DPP\nAnother approach to kernel approximation is the Nystr\u00a8om method [27]. In particular, given z1, . . . , zD\nlandmarks sampled from \u2126, we can approximate the kernel function and dual matrix as,\n\n(cid:90)\n\n1\nD\n\n\u2126\n\nq2(x) exp{i(\u03c9j \u2212 \u03c9k)(cid:62)x}dx.\n\nD(cid:88)\n\nD(cid:88)\n\nj=1\n\nk=1\n\nW 2\n\n\u02dcLN ys(x, y) =\n\njkL(x, zj)L(zk, y), C N ys\n\nwhere Wjk = L(zj, zk)\u22121/2. Denoting wj(v) =(cid:80)D\nd(cid:89)\n\nD(cid:88)\n\nD(cid:88)\n\n(cid:88)\n\njk =\n\nFN ys(\u02c6x) =\n\n1\n|V |\n\n(cid:90)\n\nD(cid:88)\n\nD(cid:88)\n(cid:90) \u02c6xl\n\nn=1\n\nWjnWmk\n\nL(zn, x)L(x, zm)dx,\n\nm=1\n\n\u2126\n\nn=1 Wjnvn, the CDF of f (x) in Alg. 1 is:\n\nwj(v)wk(v)\n\nL(x, zj)L(zk, x)1{xl\u2208\u2126}dxl.\n\n(8)\n\nv\u2208V\n\nj=1\n\nk=1\n\n\u2212\u221e\n\nl=1\n\nAs with the RFF case, we consider a decomposition L(x, y) = q(x)k(x, y)q(y). Here, there are no\ntranslation-invariant requirements, even for the similarity kernel k. In the supplement, we provide the\nimportant example where \u2126 = Rd and both q(x) and k(x, y) are Gaussians and also when k(x, y) is\npolynomial, a case that cannot be handled by RFF since it is not translationally invariant.\n4 Gibbs sampling\nFor k-DPPs, we can consider a Gibbs sampling scheme. In the supplement, we derive that the full\nconditional for the inclusion of point xk given the inclusion of the k\u2212 1 other points is a 1-DPP with a\nmodi\ufb01ed kernel, which we know how to sample from. Let the kernel function be represented as before:\nL(x, y) = q(x)k(x, y)q(y). Denoting J\\k = {xj}j(cid:54)=k and M\\k = L\u22121\nJ\\k the full conditional can\nbe simpli\ufb01ed using Schur\u2019s determinantal equality [22]:\n\np(xk|{xj}j(cid:54)=k) \u221d L(xk, xk) \u2212 (cid:88)\n\n\\k\nij L(xi, xk)L(xj, xk).\n\nM\n\n(9)\n\ni,j(cid:54)=k\n\n4\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: Estimates of total variational distance for Nystr\u00a8om and RFF approximation methods to a DPP with\nGaussian quality and similarity with covariances \u0393 = diag(\u03c12, . . . , \u03c12) and \u03a3 = diag(\u03c32, . . . , \u03c32), respectively.\n(a)-(c) For dimensions d=1, 5 and 10, each plot considers \u03c12 = 1 and varies \u03c32. (d) Eigenvalues for the Gaussian\nkernels with \u03c32 = \u03c12 = 1 and varying dimension d.\nIn general, sampling directly from this full conditional is dif\ufb01cult. However, for a wide range of\nkernel functions, including those which can be handled by the Nystr\u00a8om approximation in Sec. 3.2,\nthe CDF can be computed analytically and xk can be sampled using the inverse CDF method:\n\ni,j(cid:54)=k M\n\n\\k\nij L(xi, xl)L(xj, xl)1{xl\u2208\u2126}dxl\n\ni,j(cid:54)=k M\n\n\\k\nij L(xi, x)L(xj, x)dx\n\n(10)\n\nF (\u02c6xl|{xj}j(cid:54)=k) =\n\n(cid:82) \u02c6xl\u2212\u221e L(xl, xl) \u2212(cid:80)\n(cid:82)\n\u2126 L(x, x) \u2212(cid:80)\n\nIn the supplement, we illustrate this method by considering the case where \u2126 = Rd and q(x) and\nk(x, y) are Gaussians. We use this same Schur complement scheme for sampling from the full\nconditionals in the mixture model application of Sec. 6. A key advantage of this scheme for several\ntypes of kernels is that the complexity of sampling scales linearly with the number of dimensions d\nmaking it suitable in handling high-dimensional spaces.\nAs with any Gibbs sampling scheme, the mixing rate is dependent on the correlations between\nvariables. In cases where the kernel introduces low repulsion we expect the Gibbs sampler to mix well,\nwhile in a high repulsion setting the sampler can mix slowly due to the strong dependencies between\npoints and fact that we are only doing one-point-at-a-time moves. We explore the dependence of\nconvergence on repulsion strength in the supplementary materials. Regardless, this sampler provides\na nice tool in the k-DPP setting. Asymptotically, theory suggests that we get exact (though correlated)\nsamples from the k-DPP. To extend this approach to standard DPPs, we can \ufb01rst sample k (this\nassumes knowledge of the eigenvalues of L) and then apply the above method to get a sample. This is\nfairly inef\ufb01cient if many samples are needed. A more involved but potentially ef\ufb01cient approach is to\nconsider a birth-death sampling scheme where the size of the set can grow/shrink by 1 at every step.\n5 Empirical analysis\n(cid:80)\nTo evaluate the performance of the RFF and Nystr\u00a8om approximations, we compute the total variational\nX |PL(X) \u2212 P \u02dcL(X)|, where PL(X) denotes the probability of set X\ndistance (cid:107)PL \u2212 P \u02dcL(cid:107)1 = 1\nunder a DPP with kernel L, as given by Eq. (1). We restrict our analysis to the case where the quality\nfunction and similarity kernel are Gaussians with isotropic covariances \u0393 = diag(\u03c12, . . . , \u03c12) and \u03a3 =\ndiag(\u03c32, . . . , \u03c32), respectively, enabling our analysis based on the easily computed eigenvalues [8].\nWe also focus on sampling from k-DPPs for which the size of the set X is always k. Details are in\nthe supplement.\nFig. 1 displays estimates of the total variational distance for the RFF and Nystr\u00a8om approximations\nwhen \u03c12 = 1, varying \u03c32 (the repulsion strength) and the dimension d. Note that the RFF method\nperforms slightly worse as \u03c32 increases and is rather invariant to d while the Nystr\u00a8om method\nperforms much better for increasing \u03c32 but worse for increasing d.\nWhile this phenomenon seems perplexing at \ufb01rst, a study of the eigenvalues of the Gaussian kernel\nacross dimensions sheds light on the rationale (see Fig. 1). Note that for \ufb01xed \u03c32 and \u03c12, the decay\nof eigenvalues is slower in higher dimensions. It has been previously demonstrated that the Nystr\u00a8om\nmethod performs favorably in kernel learning tasks compared to RFF in cases where there is a\nlarge eigengap in the kernel matrix [28]. The plot of the eigenvalues seems to indicate the same\nphenomenon here. Furthermore, this result is consistent with the comparison of RFF to Nystr\u00a8om in\napproximating DPPs in the discrete \u2126 case provided in [3].\nThis behavior can also be explained by looking at the theory behind these two approximations.\nFor the RFF, while the kernel approximation is guaranteed to be an unbiased estimate of the true\nkernel element-wise, the variance is fairly high [19]. In our case, we note that the RFF estimates of\nminors are biased because of non-linearity in matrix entries, overestimating probabilities for point\n\n2\n\n5\n\n01234500.20.40.60.81\u03c32Variational Distance NystromRFF01234500.20.40.60.81\u03c32Variational Distance NystromRFF01234500.20.40.60.81\u03c32Variational Distance NystromRFF02040608010000.20.40.60.81Eigenvalues d=1d=5d=10\fcon\ufb01gurations that are more spread out, which leads to samples that are overly-dispersed. For the\nNystr\u00a8om method, on the other hand, the quality of the approximation depends on how well the\nlandmarks cover \u2126. In our experiments the landmarks are sampled i.i.d. from q(x). When either the\nsimilarity bandwidth \u03c32 is small or the dimension d is high, the effective distance between points\nincreases, thereby decreasing the accuracy of the approximation. Theoretical bounds for the Nystr\u00a8om\nDPP approximation in the case when \u2126 is \ufb01nite are provided in [3]. We believe the same result holds\nfor continuous \u2126 by extending the eigenvalues and spectral norm of the kernel matrix to operator\neigenvalues and operator norms, respectively.\nIn summary, for moderate values of \u03c32 it is generally good to use the Nystr\u00a8om approximation for\nlow-dimensional settings and RFF for high-dimensional settings.\n6 Repulsive priors for mixture models\nMixture models are used in a wide range of applications from clustering to density estimation.\nA common issue with such models, especially in density estimation tasks, is the introduction of\nredundant, overlapping components that increase the complexity and reduce interpretability of the\nresulting model. This phenomenon is especially prominent when the number of samples is small. In\na Bayesian setting, a common \ufb01x to this problem is to consider a sparse Dirichlet prior on the mixture\nweights, which penalizes the addition of non-zero-weight components. However, such approaches\nrun the risk of inaccuracies in the parameter estimates [18]. Instead, [18] show that sampling the\nlocation parameters using repulsive priors leads to better separated clusters while maintaining the\naccuracy of the density estimate. They propose a class of repulsive priors that rely on explicitly\nde\ufb01ning a distance metric and the manner in which small distances are penalized. The resulting\nposterior computations can be fairly complex.\nThe theoretical properties of DPPs make them an appealing choice as a repulsive prior. In fact, [29]\nconsidered using DPPs as repulsive priors in latent variable models. However, in the absence of\na feasible continuous DPP sampling algorithm, their method was restricted to performing MAP\ninference. Here we propose a fully generative probabilistic mixture model using a DPP prior for the\nlocation parameters, with a K-component model using a K-DPP.\nIn the common case of mixtures of Gaussians (MoG), our posterior computations can be performed\nusing Gibbs sampling with nearly the same simplicity of the standard case where the location\nparameters \u00b5k are assumed to be i.i.d.. In particular, with the exception of updating the location\nparameters {\u00b51, . . . , \u00b5K}, our sampling steps are identical to standard MoG Gibbs updates in the\nuncollapsed setting. For the location parameters, instead of sampling each \u00b5k independently from its\nconditional posterior, our full conditional depends upon the other locations \u00b5\\k as well. Details are\nin the supplement, where we show that this full conditional has an interpretation as a single draw\nfrom a tilted 1-DPP. As such, we can employ the Gibbs sampling scheme of Sec. 4.\nWe assess the clustering and density estimation performance of the DPP-based model on both\nsynthetic and real datasets. In each case, we run 10,000 Gibbs iterations, discard 5,000 as burn-in\nand thin the chain by 10. Hyperparameter settings are in the supplement. We randomly permute the\nlabels in each iteration to ensure balanced label switching. Draws are post-processed following the\nalgorithm of [23] to address the label switching issue.\nSynthetic data To assess the role of the prior in a density estimation task, we generated a small\nsample of 100 observations from a mixture of two Gaussians. We consider two cases, the \ufb01rst with\nwell-separated components and the second with poorly-separated components. We compare a mixture\nmodel with locations sampled i.i.d. (IID) to our DPP repulsive prior (DPP). In both cases, we set\nan upper bound of six mixture components. In Fig. 2, we see that both IID and DPP provide very\nsimilar density estimates. However, IID uses many large-mass components to describe the density.\nAs a measure of simplicity of the resulting density description, we compute the average entropy of the\nposterior mixture membership distribution, which is a reasonable metric given the similarity of the\noverall densities. Lower entropy indicates a more concise representation in an information-theoretic\nsense. We also assess the accuracy of the density estimate by computing both (i) Hamming distance\nerror relative to true cluster labels and (ii) held-out log-likelihood on 100 observations. The results are\nsummarized in Table 1. We see that DPP results in (i) signi\ufb01cantly lower entropy, (ii) lower overall\nclustering error, and (iii) statistically indistinguishable held-out log-likelihood. These results signify\nthat we have a sparser representation with well-separated (interpretable) clusters while maintaining\nthe accuracy of the density estimate.\n\n6\n\n\fWell-Sep\n\nPoor-Sep\n\nGalaxy\n\nEnzyme\n\nAcidity\n\nFigure 2: For each synthetic and real dataset: (top) histogram of data overlaid with actual Gaussian mixture\ngenerating the synthetic data, and posterior mean mixture model for (middle) IID and (bottom) DPP. Red\ndashed lines indicate resulting density estimate.\n\nTable 1: For IID and DPP on synthetic datasets: mean (stdev) for mixture membership entropy, cluster\nassignment error rate and held-out log-likelihood of 100 observations under the posterior mean density estimate.\n\nDATASET\n\nENTROPY\n\nCLUSTERING ERROR HELDOUT LOG-LIKE.\n\nWell-separated\nPoorly-separated\n\nIID\n\n1.11 (0.3)\n1.46 (0.2)\n\nDPP\n\n0.88 (0.2)\n0.92 (0.3)\n\nIID\n\n0.19 (0.1)\n0.47 (0.1)\n\nDPP\n\n0.19 (0.1)\n0.39 (0.1)\n\nIID\n\n-169 (6)\n-211(10)\n\nDPP\n-171(8)\n-207(9)\n\nReal data We also tested our DPP model on three real density estimation tasks considered in [20]:\n82 measurements of velocity of galaxies diverging from our own (galaxy), acidity measurement of 155\nlakes in Wisconsin (acidity), and the distribution of enzymatic activity in the blood of 245 individuals\n(enzyme). We once again judge the complexity of the density estimates using the posterior mixture\nmembership entropy as a proxy. To assess the accuracy of the density estimates, we performed 5-fold\ncross validation to estimate the predictive held-out log-likelihood. As with the synthetic data, we\n\ufb01nd that DPP visually results in better separated clusters (Fig. 2). The DPP entropy measure is also\nsigni\ufb01cantly lower for data that are not well separated (acidity and galaxy) while the differences in\npredictive log-likelihood estimates are not statistically signi\ufb01cant (Table 2).\nFinally, we consider a classi\ufb01cation task based on the iris dataset: 150 observations from three iris\nspecies with four length measurements. For this dataset, there has been signi\ufb01cant debate on the\noptimal number of clusters. While there are three species in the data, it is known that two have\nvery low separation. Based on loss minimization, [24, 26] concluded that the optimal number of\nclusters was two. Table 2 compares the classi\ufb01cation error using DPP and IID when we assume\nfor evaluation the real data has three or two classes (by collapsing two low-separation classes) , but\nconsider a model with a maximum of six components. While both methods perform similarly for\nthree classes, DPP has signi\ufb01cantly lower classi\ufb01cation error under the assumption of two classes,\nsince DPP places large posterior mass on only two mixture components. This result hints at the\npossibility of using the DPP mixture model as a model selection method.\n7 Generating diverse sample perturbations\nWe consider another possible application of continuous-space sampling. In many applications of\ninverse reinforcement learning or inverse optimal control, the learner is presented with control\ntrajectories executed by an expert and tries to estimate a reward function that would approximately\nreproduce such policies [1]. In order to estimate the reward function, the learner needs to compare\nthe rewards of a large set of trajectories (or all, if possible), which becomes intractable in high-\ndimensional spaces with complex non-linear dynamics. A typical approximation is to use a set of\nperturbed expert trajectories as a comparison set, where a good set of trajectories should cover as\nlarge a part of the space as possible.\n\n7\n\n\u22123\u22122\u22121012300.10.20.30.40.5\u22123\u22122\u22121012300.10.20.30.40.5\u22124\u2212202400.10.20.30.40.50.60.7\u221210123400.20.40.60.811.21.41.6\u22123\u22122\u22121012300.10.20.30.40.50.60.7\u22123\u22122\u22121012300.10.20.30.40.5\u22123\u22122\u22121012300.10.20.30.40.5\u22124\u2212202400.10.20.30.40.50.60.7\u221210123400.20.40.60.811.21.41.6\u22123\u22122\u22121012300.10.20.30.40.50.60.7\u22123\u22122\u22121012300.10.20.30.40.5\u22123\u22122\u22121012300.10.20.30.40.5\u22124\u2212202400.10.20.30.40.50.60.7\u221210123400.20.40.60.811.21.41.6\u22123\u22122\u22121012300.10.20.30.40.50.60.7\fTable 2: For IID and DPP, mean (stdev) of (left) mixture membership entropy and held-out log-likelihood for\nthree density estimation tasks and (right) classi\ufb01cation error under 2 vs. 2 of true classes for the iris data.\nDATA\n\nENTROPY\n\nGalaxy\nAcidity\nEnzyme\n\nIID\n\n0.89 (0.2)\n1.32 (0.1)\n1.01 (0.1)\n\nDPP\n\n0.74 (0.2)\n0.98 (0.1)\n0.96 (0.1)\n\nHELDOUT LL.\nIID\nDPP\n-21(2)\n-20(2)\n-48(3)\n-49 (2)\n-55(2)\n-55(3)\n\nDATA\n\nCLASS ERROR\nIID\nDPP\n\nIris (3 cls)\nIris (2 cls)\n\n0.43 (0.02)\n0.23 (0.03)\n\n0.43 (0.02)\n0.15 (0.03)\n\nOriginal\n\nDPP Samples\n\nFigure 3: Left: Diverse set of human poses relative to an original pose by sampling from an RFF (top) and\nNystr\u00a8om (bottom) approximations with kernel based on MoCap of the activity dance. Right: Fraction of data\nhaving a DPP/i.i.d. sample within an \u0001 neighborhood.\n\nWe propose using DPPs to sample a large-coverage set of trajectories, in particular focusing on a\nhuman motion application where we assume a set of motion capture (MoCap) training data taken\nfrom the CMU database [6]. Here, our dimension d is 62, corresponding to a set of joint angle\nmeasurements. For a given activity, such as dancing, we aim to select a reference pose and synthesize\na set of diverse, perturbed poses. To achieve this, we build a kernel with Gaussian quality and\nsimilarity using covariances estimated from the training data associated with the activity. The\nGaussian quality is centered about the selected reference pose and we synthesize new poses by\nsampling from our continuous DPP using the low-rank approximation scheme. In Fig. 3, we show\nan example of such DPP-synthesized poses. For the activity dance, to quantitatively assess our\nperformance in covering the activity space, we compute a coverage rate metric based on a random\nsample of 50 poses from a DPP. For each training MoCap frame, we compute whether the frame has\na neighbor in the DPP sample within an \u0001 neighborhood. We compare our coverage to that of i.i.d.\nsampling from a multivariate Gaussian chosen to have variance matching our DPP sample. Despite\nfavoring the i.i.d. case by in\ufb02ating the variance to match the diverse DPP sample, the DPP poses still\nprovide better average coverage over 100 runs. See Fig. 3 (right) for an assessment of the coverage\nmetric. A visualization of the samples is in the supplement. Note that the i.i.d. case requires on\naverage \u0001 = 253 to cover all data whereas the DPP only requires \u0001 = 82. By \u0001 = 40, we cover over\n90% of the data on average. Capturing the rare poses is extremely challenging with i.i.d. sampling,\nbut the diversity encouraged by the DPP overcomes this issue.\n8 Conclusion\nMotivated by the recent successes of DPP-based subset modeling in \ufb01nite-set applications and the\ngrowing interest in repulsive processes on continuous spaces, we considered methods by which\ncontinuous-DPP sampling can be straightforwardly and ef\ufb01ciently approximated for a wide range of\nkernels. Our low-rank approach harnessed approximations provided by Nystr\u00a8om and random Fourier\nfeature methods and then utilized a continuous dual DPP representation. The resulting approximate\nsampler garners the same ef\ufb01ciencies that led to the success of the DPP in the discrete case. One can\nuse this method as a proposal distribution and correct for the approximations via Metropolis-Hastings,\nfor example. For k-DPPs, we devised an exact Gibbs sampler that utilized the Schur complement\nrepresentation. Finally, we demonstrated that continuous-DPP sampling is useful both for repulsive\nmixture modeling (which utilizes the Gibbs sampling scheme) and in synthesizing diverse human\nposes (which we demonstrated with the low-rank approximation method). As we saw in the MoCap\nexample, we can handle high-dimensional spaces d, with our computations scaling just linearly with\nd. We believe this work opens up opportunities to use DPPs as parts of many models.\nAcknowledgements: RHA and EBF were supported in part by AFOSR Grant FA9550-12-1-0453\nand DARPA Grant FA9550-12-1-0406 negotiated by AFOSR. BT was partially supported by NSF CA-\nREER Grant 1054215 and by STARnet, a Semiconductor Research Corporation program sponsored\nby MARCO and DARPA.\n\n8\n\n05010000.20.40.60.81\u03b5Coverage Rate DPPIID\fReferences\n[1] P. Abbeel and A.Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proc.\n\nICML, 2004.\n\n[2] R. H. Affandi, A. Kulesza, and E. B. Fox. Markov determinantal point processes. In Proc. UAI,\n\n2012.\n\n[3] R.H. Affandi, A. Kulesza, E.B. Fox, and B. Taskar. Nystr\u00a8om approximation for large-scale\n\ndeterminantal processes. In Proc. AISTATS, 2013.\n\n[4] R. A. Bernstein and M. Gobbel. Partitioning of space in communities of ants. Journal of Animal\n\nEcology, 48(3):931\u2013942, 1979.\n\n[5] A. Borodin and E.M. Rains. Eynard-Mehta theorem, Schur process, and their Pfaf\ufb01an analogs.\n\nJournal of statistical physics, 121(3):291\u2013317, 2005.\n\n[6] CMU.\n\nhttp://mocap.cs.cmu.edu/, 2009.\n\nCarnegie Mellon University graphics\n\nlab motion capture database.\n\n[7] D.J. Daley and D. Vere-Jones. An introduction to the theory of point processes: Volume I:\n\nElementary theory and methods. Springer, 2003.\n\n[8] G.E. Fasshauer and M.J. McCourt. Stable evaluation of Gaussian radial basis function inter-\n\npolants. SIAM Journal on Scienti\ufb01c Computing, 34(2):737\u2013762, 2012.\n\n[9] J. Gillenwater, A. Kulesza, and B. Taskar. Discovering diverse and salient threads in document\n\ncollections. In Proc. EMNLP, 2012.\n\n[10] J.B. Hough, M. Krishnapur, Y. Peres, and B. Vir\u00b4ag. Determinantal processes and independence.\n\nProbability Surveys, 3:206\u2013229, 2006.\n\n[11] A. Kulesza and B. Taskar. Structured determinantal point processes. In Proc. NIPS, 2010.\n[12] A. Kulesza and B. Taskar. k-DPPs: Fixed-size determinantal point processes. In ICML, 2011.\n[13] A. Kulesza and B. Taskar. Determinantal point processes for machine learning. Foundations\n\nand Trends in Machine Learning, 5(2\u20133), 2012.\n\n[14] F. Lavancier, J. M\u00f8ller, and E. Rubak. Statistical aspects of determinantal point processes. arXiv\n\npreprint arXiv:1205.4818, 2012.\n\n[15] O. Macchi. The coincidence approach to stochastic point processes. Advances in Applied\n\nProbability, pages 83\u2013122, 1975.\n\n[16] B. Mat\u00b4ern. Spatial variation. Springer-Verlag, 1986.\n[17] T. Neeff, G. S. Biging, L. V. Dutra, C. C. Freitas, and J. R. Dos Santos. Markov point processes\nfor modeling of spatial forest patterns in Amazonia derived from interferometric height. Remote\nSensing of Environment, 97(4):484\u2013494, 2005.\n\n[18] F. Petralia, V. Rao, and D. Dunson. Repulsive mixtures. In NIPS, 2012.\n[19] A. Rahimi and B. Recht. Random features for large-scale kernel machines. NIPS, 2007.\n[20] S. Richardson and P. J. Green. On Bayesian analysis of mixtures with an unknown number of\n\ncomponents (with discussion). JRSS:B, 59(4):731\u2013792, 1997.\n\n[21] C.P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, 2nd edition, 2004.\n[22] J Schur. \u00a8Uber potenzreihen, die im innern des einheitskreises beschr\u00a8ankt sind. Journal f\u00a8ur die\n\nreine und angewandte Mathematik, 147:205\u2013232, 1917.\n\n[23] M. Stephens. Dealing with label switching in mixture models. JRSS:B, 62(4):795\u2013809, 2000.\n[24] C.A. Sugar and G.M. James. Finding the number of clusters in a dataset: An information-\n\ntheoretic approach. JASA, 98(463):750\u2013763, 2003.\n\n[25] L. A. Waller, A. S\u00a8arkk\u00a8a, V. Olsbo, M. Myllym\u00a8aki, I.G. Panoutsopoulou, W.R. Kennedy, and\nG. Wendelschafer-Crabb. Second-order spatial analysis of epidermal nerve \ufb01bers. Statistics in\nMedicine, 30(23):2827\u20132841, 2011.\n\n[26] J. Wang. Consistent selection of the number of clusters via crossvalidation. Biometrika, 97(4):\n\n893\u2013904, 2010.\n\n[27] C.K.I. Williams and M. Seeger. Using the Nystr\u00a8om method to speed up kernel machines. NIPS,\n\n2000.\n\n[28] T. Yang, Y.-F. Li, M. Mahdavi, R. Jin, and Z.-H. Zhou. Nystr\u00a8om method vs random fourier\n\nfeatures: A theoretical and empirical comparison. NIPS, 2012.\n\n[29] J. Zou and R.P. Adams. Priors for diversity in generative latent variable models. In NIPS, 2012.\n\n9\n\n\f", "award": [], "sourceid": 718, "authors": [{"given_name": "Raja Hafiz", "family_name": "Affandi", "institution": "University of Pennsylvania"}, {"given_name": "Emily", "family_name": "Fox", "institution": "University of Washington"}, {"given_name": "Ben", "family_name": "Taskar", "institution": "University of Washington"}]}