{"title": "Exact sampling of determinantal point processes with sublinear time preprocessing", "book": "Advances in Neural Information Processing Systems", "page_first": 11546, "page_last": 11558, "abstract": "We study the complexity of sampling from a distribution over all index subsets of the set {1, ..., n} with the probability of a subset S proportional to the determinant of the submatrix L_S of some n x n positive semidefinite matrix L, where L_S corresponds to the entries of L indexed by S. Known as a determinantal point process (DPP), this distribution is used in machine learning to induce diversity in subset selection. When sampling from DDPs, we often wish to sample multiple subsets S with small expected size k = E[|S|] << n from a very large matrix L, so it is important to minimize the preprocessing cost of the procedure (performed once) as well as the sampling cost (performed repeatedly). For this purpose we provide DPP-VFX, a new algorithm which, given access only to L, samples exactly from a determinantal point process while satisfying the following two properties: (1) its preprocessing cost is n poly(k), i.e., sublinear in the size of L, and (2) its sampling cost is poly(k), i.e., independent of the size of L. Prior to our results, state-of-the-art exact samplers required O(n^3) preprocessing time and sampling time linear in n or dependent on the spectral properties of L. We furthermore give a reduction which allows using our algorithm for exact sampling from cardinality constrained determinantal point processes with n poly(k) time preprocessing. Our implementation of DPP-VFX is provided at https://github.com/guilgautier/DPPy/.", "full_text": "Exact sampling of determinantal point processes\n\nwith sublinear time preprocessing\n\nMicha\u0142 Derezi\u00b4nski\u21e4\nDepartment of Statistics\n\nUniversity of California, Berkeley\n\nmderezin@berkeley.edu\n\nDaniele Calandriello\u21e4\n\nLCSL\n\nIstituto Italiano di Tecnologia, Italy\ndaniele.calandriello@iit.it\n\nMichal Valko\nDeepMind Paris\n\nvalkom@deepmind.com\n\nAbstract\n\nWe study the complexity of sampling from a distribution over all index subsets of\nthe set {1, ..., n} with the probability of a subset S proportional to the determinant\nof the submatrix LS of some n \u21e5 n positive semide\ufb01nite matrix L, where LS\ncorresponds to the entries of L indexed by S. Known as a determinantal point\nprocess (DPP), this distribution is used in machine learning to induce diversity in\nsubset selection. When sampling from DDPs, we often wish to sample multiple\nsubsets S with small expected size k , E[|S|] \u2327 n from a very large matrix L,\nso it is important to minimize the preprocessing cost of the procedure (performed\nonce) as well as the sampling cost (performed repeatedly). For this purpose we\nprovide DPP-VFX, a new algorithm which, given access only to L, samples\nexactly from a determinantal point process while satisfying the following two\nproperties: (1) its preprocessing cost is n \u00b7 poly(k), i.e., sublinear in the size\nof L, and (2) its sampling cost is poly(k), i.e., independent of the size of L.\nPrior to our results, state-of-the-art exact samplers required O(n3) preprocessing\ntime and sampling time linear in n or dependent on the spectral properties of\nL. We furthermore give a reduction which allows using our algorithm for exact\nsampling from cardinality constrained determinantal point processes with n \u00b7\npoly(k) time preprocessing. Our implementation of DPP-VFX is provided at\nhttps://github.com/guilgautier/DPPy/.\n\nIntroduction\n\n1\nGiven a positive semi-de\ufb01nite (psd) n \u21e5 n matrix L, a determinantal point process DPP(L), also\nknown as an L-ensemble, is a distribution over all 2n index subsets S \u2713{ 1, . . . , n} such that\n\nPr(S) , det(LS)\ndet(I + L)\n\n,\n\nwhere LS denotes the |S|\u21e5| S| submatrix of L with rows and columns indexed by S. Determinantal\npoint processes naturally appear across many scienti\ufb01c domains [Mac75, BLMV17, Gue83], and\nthey have emerged as an important tool in machine learning [KT12] for inducing diversity in subset\nselection and as a variance reduction approach. DPP sampling has been successfully applied in\ncore ML problems such as recommender systems [GKVM18, CZZ18, Bru18], stochastic optimiza-\ntion [ZKM17, Z\u00d6MS19], data summarization [CKS+18], Gaussian processes [MS16, BRVDW19],\n\n\u21e4Equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f[HKP+06, KT11]\n[AGR16]\n[LJS16b]\n[LGD18]\n[Der19]\nDPP-VFX (this paper)\n\nexact DPP\n\nk-DPP\n\nx\n\nx\nx\n\nx\nx\nx\n\n\ufb01rst sample\nn3\nn \u00b7 poly(k)\nn2 \u00b7 poly(k)\nn3\nn3\nn \u00b7 poly(k)\n\nsubsequent samples\nnk2\nn \u00b7 poly(k)\nn2 \u00b7 poly(k)\npoly(k \u00b7 (1 + kLk))\npoly(rank(L))\npoly(k)\n\nTable 1: Comparison of DPP and k-DPP algorithms using the L-ensemble representation. For a DPP,\nk denotes the expected subset size. Note that k \uf8ff rank(L) \uf8ff n. We omit log terms for clarity.\nexperimental design [DW17, DWH18], and many more. In these applications, we often wish to\nef\ufb01ciently produce many DPP samples of small expected size2 k , E[|S|] given a large matrix L.\nSometimes, the distribution is restricted to subsets of \ufb01xed size |S| = k \u2327 n, denoted k-DPP(L).\n[HKP+06] gave an algorithm for drawing samples from DPP(L) distributed exactly, later adapted to\nk-DPP(L) by [KT11], which can be implemented to run in polynomial time. In many applications,\nhowever, sampling is still a computational bottleneck because the algorithm requires performing\nthe eigendecomposition of matrix L at the cost of O(n3). In addition to that initial cost, producing\nmany independent samples S1, S2, . . . at high frequency poses a challenge because the cost of each\nsample is at least linear in n. Many alternative algorithms exist for both DPPs and k-DPPs to reduce\nthe computational cost of preprocessing and/or sampling, including many approximate and heuristic\napproaches. Contrary to approximate solutions, we present an algorithm which samples exactly from\na DPP or a k-DPP with the initial preprocessing cost sublinear in the size of L and the sampling cost\nindependent of the size of L.\nTheorem 1 For a psd n \u21e5 n matrix L, let S1, S2, . . . be i.i.d. random sets from DPP(L) or from\nany k-DPP(L). Then, there is an algorithm which, given access to L, returns\n\na) the \ufb01rst subset, S1, in: n \u00b7 poly(k) polylog(n) time,\nb) each subsequent Si in:\n\npoly(k) time.\n\nWe refer to this algorithm as the Very Fast and eXact DPP sampler, or DPP-VFX. Table 1 compares\nDPP-VFX with other DPP and k-DPP sampling algorithms. In this comparison, we feature the\nmethods that provide strong accuracy guarantees. As seen from the table, our algorithm is the\n\ufb01rst exact sampler to achieve sublinear overall runtime. Only the approximate MCMC sampler of\n[AGR16] matches our n \u00b7 poly(k) complexity (and only for a k-DPP), but for this method every next\nsample is equally expensive, making it less practical when repeated sampling is needed. In fact, to\nour knowledge, no other exact or approximate method (with rigorous approximation guarantees)\nachieves poly(k) sampling time of the present paper.\nOur method is based on a technique developed recently by [DWH18, DWH19] and later extended\nby [Der19]. In this approach, we carefully downsample the index set [n] = {1, ..., n} to a sample\n = (1, ..., t) 2 [n]t that is small but still suf\ufb01ciently larger than the expected target size k, and then\nrun a DPP on . As the downsampling distribution we use a regularized determinantal point process\n(R-DPP), proposed by [Der19], which (informally) samples with probability Pr() \u21e0 det(I +eL),\nwhereeL is a rescaled version of L. We can summarize this approach as follows, where |S|\uf8ff t \u2327 n,\nThe DPP algorithm proposed by [Der19] follows the same diagram, however it requires that the size\nof the intermediate sample be \u2326(rank(L)\u00b7 k). This means that their method provides improvement\nover [HKP+06] only when L can be decomposed as XX> for some n \u21e5 r matrix X, with r \u2327 n.\nHowever, in practice, matrix L is often only approximately low-rank, i.e., it exhibits some form of\neigenvalue decay but it does not have a low-rank factorization. In this case, the results of [Der19] are\nvacuous both in terms of the preprocessing cost and the sampling cost, in that obtaining every sample\nwould take \u2326(n3). We propose a different R-DPP implementation (see DPP-VFX as Algorithm 1)\n2 To avoid complicating the exposition with edge cases, we assume k 1. Note that this can be always\nsatis\ufb01ed without distorting the distribution by rescaling L by a constant, and is without loss of generality, as our\nanalysis can be trivially extended to the case 0 \uf8ff k < 1 with some additional notation.\n\neS\u21e0DPP! S = {i : i 2 eS}.\n\n{1, ..., n}\n\n\u21e0R-DPP\n! (1, ..., t)\n\n2\n\n\fDe\ufb01nition 1 Given a psd matrix L, its ith -ridge leverage score (RLS) \u2327i() is the ith diagonal\n\nwhere the expected size of is O(k2). To make the algorithm ef\ufb01cient, we use new connections\nbetween determinantal point processes, ridge leverage scores, and Nystr\u00f6m approximation.\nentry of L(I + L)1. The -effective dimension deff() is the sum of the leverage scores,Pi \u2327i().\nAn important connection between RLSs and DPPs is that when S \u21e0 DPP(L), the marginal probabil-\nity of index i being sampled into S is equal to the ith 1-ridge leverage score of L, and the expected\nsize k of S is equal to the 1-effective dimension,\n\nPr(i 2 S) =\u21e5L(I + L)1\u21e4ii = \u2327i(1),\n\nk , E\u21e5|S|\u21e4 = trL(I + L)1 = deff(1).\n\nIntuitively, if the marginal probability of i is high, then this index should likely make it into the\nintermediate sample . This suggests that i.i.d. sampling of the indices 1, ..., t proportionally to\n1-ridge leverage scores, i.e., Pr(1 = i) / \u2327i(1), should serve as a reasonable and cheap heuristic\nfor constructing . In fact, we can show that this distribution can be easily corrected by rejection\nsampling to become the R-DPP that we need. Computing ridge leverage scores exactly costs O(n3),\nso instead we compute them approximately by \ufb01rst constructing a Nystr\u00f6m approximation of L.\nDe\ufb01nition 2 Let L be a psd matrix and C a subset of its row/column indices with size m , |C|. Then\nwe de\ufb01ne the Nystr\u00f6m approximation of L based on C as the n \u21e5 n matrixbL , (LC,I)>L+\nCLC,I.\nHere, LC,I denotes an m \u21e5 n matrix consisting of (entire) rows of L indexed by C and (\u00b7)+ denotes\nthe Moore-Penrose pseudoinverse. Since we use rejection sampling to achieve the right intermediate\ndistribution, the correctness of our algorithm does not depend on which Nystr\u00f6m approximation is\nchosen. However, the subset C greatly in\ufb02uences the computational cost of the sampling through\n\nmultiplication and inversion involving the Nystr\u00f6m approximation will scale with m, and therefore a\n\nthe sample will be very high and the algorithm inef\ufb01cient. In this case, a slightly larger subset could\n\nthe rank ofbL and the probability of rejecting a sample. Since rank(bL) = |C|, operations such as\nsmall subset increases ef\ufb01ciency. However, ifbL is too different from L, the probability of rejecting\nimprove accuracy and acceptance rate without increasing too much the cost of handlingbL. Therefore,\n\nsubset C has to be selected so that it is both small and accurately represents the matrix L. Here, we\nonce again rely on ridge leverage score sampling which has been effectively used for obtaining good\nNystr\u00f6m approximations in a number of prior works such as [AM15, CLV17, RCCR18].\nWhile our main algorithm can sample only from the random-size DPP, and not from the \ufb01xed-size\nk-DPP, we present a rigorous reduction argument which lets us use our DPP algorithm to sample\nexactly from a k-DPP (for any k) with a small computational overhead.\n\nRelated work Prior to our work, fast exact sampling from generic DPPs has been considered\nout of reach. The \ufb01rst procedure to sample general DPPs was given by [HKP+06] and even most\nrecent exact re\ufb01nements [LGD18, Der19, Pou19], when the DPP is represented in the form of an\nL-ensemble, require preprocessing that amounts to an expensive n \u21e5 n matrix diagonalization at the\ncost O(n3), which is shown as the \ufb01rst-sample complexity column in Table 1.\nNonetheless, there are well-known samplers for very speci\ufb01c DPPs that are both fast and exact, for\ninstance for sampling uniform spanning trees [Ald90, Bro89, PW98], which leaves the possibility of\na more generic fast sampler open. Since the sampling from DPPs has several practical large scale\nmachine learning applications [KT12], there are now a number of methods known to be able to\nsample from a DPP approximately, outlined in the following paragraphs.\nAs DPPs can be speci\ufb01ed by kernels (L-kernels or K-kernels), a natural approximation strategy is\nto resort to low-rank approximations [KT11, GKT12, AKFT13, LJS16a]. For example, [AKFT13]\nprovides approximate guarantee for the probability of any subset being sampled as a function of\neigengaps of the L-kernel. Next, [LJS16a] construct coresets approximating a given k-DPP and then\nuse them for sampling. In their Section 4.1, [LJS16a] show in which cases we can hope for a good\napproximation. These guarantees become tight if these approximations (Nystr\u00f6m subspace, coresets)\nare aligned with data. In our work, we aim for an adaptive approach that is able to provide a good\napproximation for any DPP.\nThe second class of approaches are based on Markov chain Monte-Carlo [MU49] techniques [Kan13,\nRK15, AGR16, LJS16b, GBV17]. There are known polynomial bounds on the mixing rates [DS91]\n\n3\n\n\fof MCMC chains with arbitrary DPPs as their limiting measure. In particular, [AGR16] showed\nthem for cardinality-constrained DPPs and [LJS16b] for the general case. The two chains have\nmixing times which are, respectively, linear and quadratic in n (see Table 1). Unfortunately, for any\nsubsequent sample we need to wait until the chain mixes again.\nNeither the known low-rank approximations or the known MCMC methods are able to provide\nsamples that are exactly distributed according to a DPP (also called perfect sampling). This is not\nsurprising as having scalable and exact sampling is very challenging in general. For example, methods\nbased on rejection sampling are always exact, but they typically scale poorly to high-dimensional data\nand are adversely affected by the spikes in the distribution [EVCM16], resulting in high rejection\nrate and inef\ufb01ciency. Surprisingly, our method is based on both low-rank approximation (a source of\ninaccuracy) and rejection sampling (a common source of inef\ufb01ciency). In the following section, we\nshow how to obtain a perfect DPP sampler from a Nystr\u00f6m approximation of the L-kernel. Then, to\nguarantee ef\ufb01ciency, in Section 3 we bound the number of rejections, which is possible thanks to the\nuse of intermediate downsampling.\n\n2 Exact sampling using any Nystr\u00f6m approximation\nNotation We use [n] to denote the set {1, . . . , n}. For a matrix B 2 Rn\u21e5m and index sets C, D,\nwe use BC,D to denote the submatrix of B consisting of the intersection of rows indexed by C with\ncolumns indexed by D. If C = D, we use a shorthand BC and if D = [m], we may write BC,I.\nFinally, we also allow C, D to be multisets or sequences, in which case each row/column is duplicated\nin the matrix according to its multiplicity (and in the case of sequences, we order the rows/columns\nas they appear in the sequence). Note that with this notation if L = BB> then LC,D = BC,IB>D,I\n.\nAs discussed in the introduction, our method relies on an intermediate downsampling distribution\nto reduce the size of the problem. The exactness of our sampler relies on the careful choice of that\nintermediate distribution. To that end, we use regularized determinantal processes, introduced by\n[Der19]. In the below de\ufb01nition, we adapt them to the kernel setting.\n\nDe\ufb01nition 3 Given an n\u21e5 n psd matrix L, distribution p , (p1, . . . , pn) and r > 0, leteL denote an\nn \u21e5 n matrix such thateLij , 1\nover events A \u2713S1k=0[n]k, where\n\nLij for all i, j 2 [n]. We de\ufb01ne R-DPPr\n\np(L) as a distribution\n\nrppipj\n\nPr(A) , E\u21e51[2A] det(I +eL)\u21e4\n\ndet(I + L)\n\n,\n\nfor = (1, . . . , t) i.i.d.\u21e0 p,\n\nt \u21e0 Poisson(r).\n\nif \u21e0 R-DPPr\n\np(L)\n\nand S \u21e0 DPP(eL)\n\ncalculation shows that the R-DPP can be used as an intermediate distribution in our algorithm without\nintroducing any distortion in the sampling.\n\nSince the term det(I +eL) has the same form as the normalization constant of DPP(eL), an easy\nProposition 1 (Der19, Theorem 8) For any L, p, r, andeL de\ufb01ned as in De\ufb01nition 3,\nthen {i : i2 S}\u21e0 DPP(L).\nTo sample from the R-DPP, our algorithm uses rejection sampling, where the proposal distribution is\nsampling i.i.d. proportionally to the approximate 1-ridge leverage scores li \u21e1 \u2327i(1) (see De\ufb01nition 1\nand the following discussion), computed using any Nystr\u00f6m approximationbL of matrix L. Apart\nfrombL, the algorithm also requires an additional parameter q, which controls the size of the interme-\nnot depend on the choice ofbL and q, as demonstrated in the following result. The key part of the\nTheorem 2 Given a psd matrix L, any one of its Nystr\u00f6m approximations bL and any positive q,\nDPP-VFX (Algorithm 1) returns S \u21e0 DPP(L).\nAn important implication of Theorem 2 is that even though the choice of bL affects the overall\n\nproof involves showing that the acceptance probability in Line 6 is bounded by 1. Here, we obtain a\nconsiderably tighter bound than the one achieved by [Der19], which allows us to use a much smaller\nintermediate sample (see Section 3) while maintaning the ef\ufb01ciency of rejection sampling.\n\nexecution of the algorithm, it does not affect the distribution of the output. Therefore we can reuse\n\ndiate sample. Because of rejection sampling and Proposition 1, the correctness of the algorithm does\n\n4\n\n\fAlgorithm 1 DPP-VFX sampling S \u21e0 DPP(L)\nInput: L 2 Rn\u21e5n, its Nystr\u00f6m approximationbL, q > 0\n1: Compute li =\u21e5(L bL) +bL(I +bL)1\u21e4ii \u21e1 Pr(i 2 S)\nqh 1plilj\nz = trbL(I +bL)1, eL = s\n2: Initialize s =Pi li,\nsample t \u21e0 Poisson(q es/q)\nsample 1, . . ., t\nsample Acc \u21e0Bernoulli\u21e3 ez det(I+eL)\nets/q det(I+bL)\u2318\n\n6:\n7: until Acc = true\n\n3: repeat\n4:\n5:\n\n, \u00b7 \u00b7 \u00b7 , ln\ns ),\n\ni.i.d.\u21e0 ( l1\n\ns\n\nLijiij\n\n8: sample eS \u21e0 DPPeL\n9: return S = {i : i2eS}\nthe same bL to produce multiple independent samples S1, S2, ... \u21e0 DPP(L), which we prove in\nAppendix A. This is particularly important because, as we will see in Section 3, DPP-VFX can\ngenerate successive samples S2, S3, . . . much more cheaply than the \ufb01rst sample S1.\nLemma 1 Let C \u2713 [n] be a random set variable with any distribution. Suppose that S1 and S2\nare returned by two executions of DPP-VFX, both using inputs constructed from the same L and\nbL = LI,CL+\nAs a result, the de\ufb01nition ofeL (Line 2) is not exactly the one that would be suggested by De\ufb01nition 3.\n\nIn particular, DPP-VFX uses q es/q instead of just q as the mean parameter for t. The extra es/q\nfactor is necessary to correct for the rejection sampling step in Line 6 during the sampling loop, see\nthe proof below.\n\nBefore proceeding with the proof, we highlight for clarity that the Poisson r.v. t in DPP-VFX\n(Algorithm 1, Line 4) has a different role than the the Poisson r.v. used in the de\ufb01nition of R-DPPs.\n\nCLC,I. Then S1 and S2 are (unconditionally) independent.\n\nProof of Theorem 2 We start by showing that the Bernoulli probability in Line 6 is bounded by 1.\nNote that this is important not only to sample correctly, but also when we later establish the ef\ufb01ciency\nof the algorithm. If we showed a weaker upper bound, say c > 1, we could always divide the\nexpression by c and retain the correctness, however it would also be c times less likely that Acc = 1.\n\nCBC,IB> = BPB,>\nCBC,I is a projection, so that P2 = P. Let\nb>i . Then, we have\n\nSincebL is a Nystr\u00f6m approximation for some C \u2713I , it can be written as\nbL = LI,CL+\nCLC,I = BB>C,IL+\nfor any B such that L = BB,> where P , B>C,IL+\neL , eBeB,> where the ith row of eB is the rescaled ith row of B, i.e.eb>i ,q s\ndet(I +eB>,IeB,I)\n= detI + (eB>,IeB,I PB>BP)(I + PB>BP)1\n\uf8ff exp\u21e3tr(eB>,IeB,I PB>BP)(I + PB>BP)1\u2318\n= exp\u2713 tXi=1\n\ndet(I +eL)\ndet(I +bL)\n\ndet(I + PB>BP)\n\nqli\u21e5B(I + PB>BP)1B>\u21e4ii\u25c6 \u00b7 ez = ets/qez,\n\n= det(I +eB>,IeB,I)(I + PB>BP)1\n\n=\n\nqli\n\ns\n\nwhere the last equality follows because\n\nB(I + PB>BP)1B> = BI PB>(I + BPB>)1BPB>\n\n= L bL(I +bL)1bL = L bL +bL(I +bL)1.\n\n5\n\n\fas is after exiting the repeat loop. It follows that\n\nThus, we showed that the expression in Line 6 is valid. Lete denote the random variable distributed\nPr(e 2 A) / E\uf8ff1[2A]\neq es/q t! \u00b7 ets/q E\u21e51[2A] det(I +eL) | t\u21e4\nwhich shows thate \u21e0 R-DPPq\n\nets/q det(I +bL) /\n1Xt=0\nez det(I +eL)\n/ Et0hE\u21e51[2A] det(I +eL) | t = t0\u21e4i\n\ns ). The claim follows from Proposition 1.\n\n3 Conditions for fast sampling\n\nfor t0 \u21e0 Poisson(q),\n\nl (L) for l = ( l1\ns\n\n, \u00b7 \u00b7 \u00b7 , ln\n\n(q es/q)t\n\nThe complexity cost of DPP-VFX can be roughly summarized as follows: we pay a large one-\n\nrejection sampling scheme which must be multiplied by the number of times we repeat the loop\n\ntime cost to precomputebL and all its associated quantities, and then we pay a smaller cost in the\nuntil acceptance. We \ufb01rst show that ifbL is suf\ufb01ciently close to L then we will exit the loop with\n\nhigh probability. We then analyze how accurate the precomputing step needs to be to satisfy this\ncondition. Finally, we bound the overall computational cost, which includes also the \ufb01nal step where\nwe sample S out of the intermediate sample using any off-the-shelf exact DPP sampler.\n1) Bounding the number of rejections The following result presents the two conditions needed\nfor achieving ef\ufb01cient rejection sampling in DPP-VFX. First, the Nystr\u00f6m approximation needs to\nbe accurate enough, and second, the intermediate sample size (controlled by parameter q) needs to\nbe \u2326(k2). This is a signi\ufb01cant improvement over the guarantee of [Der19] where the intermediate\nsample size is \u2326(rank(L) \u00b7 k), which is only meaningful for low-rank kernels. The main novelty\nin this proof comes from showing the following lower bound on the ratio of the determinants of\nI + L and I +bL, whenbL is a Nystr\u00f6m approximation: det(I + L)/ det(I +bL) ekz, where\nz , tr(bL(I +bL)1). Remarkably, this bound exploits the fact that any Nystr\u00f6m approximation of\nL = BB> can be written asbL = BPB>, where P is a projection matrix. Note that while our result\nholds in the worst case, in deployement, the conditions onbL and on q can be considerably relaxed.\nTheorem 3 If the Nystr\u00f6m approximationbL and the intermediate sample size parameter q satisfy\nthen Pr(Acc = true) e2. Therefore, with probability 1 , Algorithm 1 exits the rejection\nsampling loop after at most O(log 1) iterations and, after precomputing all of the inputs, the time\ncomplexity of the rejection sampling loop is Ok6 log 1 + log41.\nP , E\uf8ff ez det(I +eL)\nets/q det(I +bL) =\n\ndet(I +bL)\n1Xt=0\neqt!E\u21e5det(I +eL) | t\u21e4 = eqq es/q+z det(I + L)\ndet(I +bL)\nwhere the last equality follows because the in\ufb01nite series computes the normalization constant of\nR-DPPq\nl (L) given in De\ufb01nition 3. If s 1 then q = s2 and the inequality ex \uf8ff 1 + x + x2 for\nx 2 [0, 1] implies that q q es/q + z = s2(1 e1/s + 1/s) + z s 1 + z s. On the other\nhand, if q = s 2 [0, 1], then q qes/q + z = (1 e)s + z 1 + z s. We proceed to lower bound\nthe determinantal ratio. Here, let L = BB> andbL = BPB,> where P is a projection matrix. Then,\n\nProof We \ufb01rst bound the number of rejections, and then discuss the complexity of each loop iteration.\nLet be distributed as in Line 5. The probability of exiting the repeat loop at each iteration is\n\ntrL(I + L)1L bL(I + L)1bL \uf8ff 1\n\n1Xt=0\ndet(I +bL)\n\nE\u21e5 det(I +eL) | t\u21e4\n\nq max{s2, s},\n\n(q es/q)t\neq es/q t! \u00b7\n\ndet(I + B>B)\n\neqq es/q+z\n\nqt\n\nezts/q\n\nand\n\n=\n\ndet(I + L)\n\ndet(I +bL)\n\ndet(I + PB>BP)\n\n= detI (B>B PB>BP)(I + B>B)11\n\n exp\u21e3tr(B>B PB>BP)(I + B>B)1\u2318\n= exp\u21e3trB(I + B>B)1B> trBP(I + B>B)1PB>\u2318,\n\n=\n\n,\n\n6\n\n\fwhere we can reformulate k = trB(I + B>B)1B> and the other elements in terms ofbL and L as\n\nexp\u21e3k trBP(I + B>B B>B)(I + B>B)1PB>\u2318\n= exp\u21e3k tr(bL) + trBPB>B(I + B>B)1PB>\u2318\n= exp\u21e3k tr(bL) + trBPB>(I + BB>)1BPB>\u2318\n= exp\u21e3k tr(bL) + trbL(I + L)1bL\u2318.\ndet(I +bL) exp\u21e3 1 + z s + k tr(bL) + trbL(I + L)1bL\u2318,\n\nPutting all together,\n\neqq es/q+z det(I + L)\n\n, \u00b7 \u00b7 \u00b7 , ln\n\nand therefore,\n\neqq es/q+z det(I + L)\n\nand using the de\ufb01nitions k = trL(I+L)1, s = tr(LbL+bL(bL+I)1), and z = tr(bL(bL+I)1)\non the middle term z s + k tr(bL) we have\ntrbL(bL + I)1 + L(I + L)1 L +bL bL(bL + I)1 bL\n= trL(I + L)1 L = trL(I + L)1 L(I + L)1(I + L) = trL(I + L)1L,\ndet(I +bL) exp\u21e3 1 + trbL(I + L)1bL trL(I + L)1L\u2318,\n\nand we obtain our condition. Thus, with probability 1 , the main loop will be repeated O(log 1)\ntimes. We now quantify the cost of a single loop iteration. First note that since the number of samples\nti drawn from l in the ith iteration of the loop is a Poisson distributed random variable, a standard\nPoisson tail bound implies that with probability 1 , all iterations will satisfy ti = O(k2 + log 1).\nThen, drawing the Poisson r.v. ti (Line 4) requires O(k2 + log 1) time. Drawing from the\nmultinomial ( l1\ns ) can be done by \ufb01rst sorting the li, which is a one-time O(n log(n)) cost,\ns\nand then using specialized samplers which require O(1) time [BP12]. Therefore, the dominant cost\nis computing the determinant of the matrix I +eL in O(t3\nWe separate the analysis into two steps: how much it costs to choosebL to satisfy the assumption of\nTheorem 3, and how much it costs to compute everything else givenbL, see Appendix A.\nLemma 2 LetbL be constructed by sampling m = O(k3 log n\nThen, with probability 1 ,bL satis\ufb01es the assumption of Theorem 3.\nThere exist many algorithms to sample columns proportionally to their RLSs. For example, we can\ntake the BLESS [RCCR18] with the following guarantee.\nProposition 2 (RCCR18, Theorem 1) There exists an algorithm that with probability 1 sam-\nples m columns proportionally to their RLSs in O(nk2 log2 n\nWe can now compute the remaining preprocessing costs, given a Nystr\u00f6m approximationbL.\nLemma 3 GivenbL with rank m, we can compute li, s, z, andeL in O(nm2 + m3) time.\n3) Bounding the overall cost We can now fully characterize the computational cost.\nTheorem 1 (restated for DPPs only) For a psd n\u21e5 n matrix L, let S1, S2 be i.i.d. random sets from\nDPP(L). Denote withbL a Nystr\u00f6m approximation of L obtained by sampling m = O(k3 log(n/))\nof its columns proportionally to their RLSs. If q max{s2, s}, then w.p. 1 , DPP-VFX returns\n\n2) Bounding the precompute cost All that is left is to control the cost of the precomputation phase.\n\n ) columns proportionally to their RLS.\n\n + k3 log4 n\n\n + m) time.\n\ni ), and the result follows.\n\n + k9 log3 n\n\n + k3 log4 n\n\n ) time,\n\na) subset S1 in: O(nk6 log2 n\nb) then, S2 in: Ok6 log 1\n\n + log4 1\n\n time.\n\n7\n\n\fDiscussion The above result follows from the bounds on the precompute costs, on the number of\n\niteratios, on the iterations cost, and on the fact that the \ufb01nal DPP sampling step eS \u21e0 DPPeL\n\nrequires O(t3) \uf8ffO ((k2 +log 1)3) time. Note however that due to the nature of rejection sampling,\nas long as we exit the loop, i.e., we accept the sample, the output of DPP-VFX is guaranteed to\nfollow the DPP distribution for any value of m and q. In Theorem 1 we set m = (k3 log n\n ) and\nq max{s2, s} to satisfy Theorem 3 and guarantee a constant acceptance probability in the rejection\nsampling loop, but this might not be necessary or even desirable in practice. Experimentally, much\nsmaller values of m, starting from m =\u2326( k log n\n ) seem to be suf\ufb01cient to accept the sample, while\nat the same time a smaller m greatly reduces the preprocessing costs. In general, we recommend to\nseparate DPP-VFX in three phases. First, compute an accurate estimate of the RLS using off-the-\nshelf algorithms in O(nk2 log2 n\n ) time. Then, sample a small number m of columns\nto construct an explorativebL, and try to run DPP-VFX. If the rejection sampling loop does not\nterminate suf\ufb01ciently fast, then we can reuse the RLS estimates to compute a more accuratebL for\n\na larger m. Using a simple doubling schedule for m, this strategy quickly reaches a regime where\nDPP-VFX is w.h.p. guaranteed to accept, resulting in faster sampling.\n\n + k3 log4 n\n\n4 Reduction from DPPs to k-DPPs\n\nWe next show that with a simple extra rejection sampling step we can ef\ufb01ciently transform any exact\nDPP sampler into an exact k-DPP sampler.\nA common heuristic to sample S from a k-DPP is to \ufb01rst sample S from a DPP, and then reject the\nsample if the size of S is not exactly k. As we show in this section, the success probability of this\nprocedure can be improved by appropriately rescaling L by a constant factor \u21b5,\n\nsample S\u21b5 \u21e0 DPP(\u21b5L),\n\naccept if |S\u21b5| = k.\n\nNote that rescaling the DPP by a constant \u21b5 as above only changes the expected size of the set S\u21b5,\nand not its distribution. Therefore, if we accept only sets with size k, we will be sampling exactly\nfrom our k-DPP. Moreover, if k\u21b5 = E[|S\u21b5|] is close to k, the success probability will improve. With\na slight abuse of notation, in the context of k-DPPs we will indicate with k the desired size of S\u21b5,\ni.e., the \ufb01nal output size at acceptance, and with k\u21b5 = E[|S\u21b5|] the expected size of the scaled DPP.\nWhile the above rejection sampling heuristic is widespread, until now there has been no proof that\nthis heuristic can provably succeed in few rejections We solve this open question with two new results.\nFirst, we show that for an appropriate rescaling \u21b5? we only reject S\u21b5? roughly O(pk) times. Then,\nwe show how to \ufb01nd such an \u21b5? with a eO(n \u00b7 poly(k)) time preprocessing step.\nTheorem 4 There exists constant C > 0 such that for any rank n psd matrix L and k 2 [n], there is\n\u21b5? > 0 with the following property: if we sample S\u21b5? \u21e0 DPP(\u21b5?L), then Pr(|S\u21b5?| = k) 1\nCpk\u00b7\nThe proof in Appendix B. relies on a known Chernoff bound for the sample size |S\u21b5| of a DPP. When\napplied na\u00efvely, the inequality does not offer a lower bound on the probability of any single sample\nsize. However we show that the probability mass is concentrated on O(pk\u21b5) sizes. This leads to a\nlower bound on the sample size with the largest probability, i.e., the mode of the distribution. Then, it\nremains to observe that for any k 2 [n] we can always \ufb01nd \u21b5? for which k is the mode of |S\u21b5?|. We\nconclude that given \u21b5?, the rejection sampling scheme described above transforms any poly(k) time\nDPP sampler into a poly(k) time k-DPP sampler. It remains to ef\ufb01ciently \ufb01nd \u21b5?, which once again\nrelies on using a Nystr\u00f6m approximation of L.\n\nLemma 4 If k 1, then there is an algorithm that \ufb01nds \u21b5? in eO(n \u00b7 poly(k)) time.\n\nWhile the existence proof of \u21b5? is general and based on simple unimodality, this characterization\nis not suf\ufb01cient to control the mode of the DPP when \u21b5? is perturbed, as it happens during an\napproximate optimization of \u21b5. However, for DPPs the mode can be expressed as a Poisson binomial\nrandom variable [H+56] based on the spectrum of L, which can be controlled [D+64] using the\n\nis necessary compared to Theorem 3, in practice the same m =\u2326( k log n\n\napproximate spectrum ofbL. While our analysis shows that in the worst case a much more accuratebL\n\n ) seems to suf\ufb01ce.\n\n8\n\n\fFigure 1: First-sample cost for DPP-VFX against\nother DPP samplers (n is data size).\n\nFigure 2: Resampling cost for DPP-VFX com-\npared to the exact sampler of [HKP+06].\n\n5 Experiments\n\nIn this section, we experimentally evaluate3 the performance of DPP-VFX and compare it to exact\nsampling [HKP+06] and MCMC-based approaches [AGR16]. In particular, we are only interested\nin evaluating computational performance, i.e., how DPP-VFX and baselines scale with the size n\nof the matrix L. Note that an inexact DPP sampler, e.g., MCMC samplers or previous approximate\nrejection samplers [Der19], also had another metric to validate, i.e., they had to empirically show that\ntheir samples were close enough to a DPP distribution. However, Section 2 proves that DPP-VFX\u2019s\noutput is distributed exactly according to the DPP and therefore it is strictly equivalent to any other\nexact DPP sampler, e.g., the Gram-Schmidt [Gil14] or dual [KT12] samplers implemented in DPPy.\nWe use the in\ufb01nite MNIST digits dataset (i.e., MNIST8M [LCB07]) as input data, where n varies up\nto 106 and d = 784. All algorithms, including DPP-VFX, are implemented in python as part of the\nDPPy library [GBV19]. All experiments are carried out on a 24-core CPU and fully take advantage\nof potential parallelization. For each experiment we report the mean and a 95% con\ufb01dence interval\nover 10 runs of each algorithm\u2019s runtime.\nThis \ufb01rst experiment compares the time required to generate a \ufb01rst sample. We consider subsets of\nMNIST8M that go from n = 103 to n = 7 \u00b7 104, i.e., the original MNIST dataset, and use an RBF\nkernel with = p3d to construct L. For the Nystr\u00f6m approximation we set m = 10deff(1) \u21e1 10k.\nWhile this is much lower than the O(k3) value suggested by the theory, as we will see it is already\naccurate enough to result in drastic runtime improvements over exact and MCMC. Following the\nstrategy of Section 4, for each algorithm we control4 the size of the output set by rescaling the input\nmatrix L by a constant \u21b5? such that E[|S\u21b5?|] = 10. Results are reported in Figure 1.\nExact sampling is performed using eigendecomposition and DPPy\u2019s default Gram-Schmidt [Gil14]\nsampler. It is clearly cubic in n, and we could not push it beyond n = 1.5 \u00b7 104. For MCMC,\nwe enforce mixing by runnning the chain for nk steps, the minimum recommended by [AGR16].\nHowever, for n = 7 \u00b7 105 the MCMC runtime is 160s and exceeds the plot\u2019s limit, while DPP-VFX\ncompletes in 16s, an order of magnitude faster. Moreover, DPP-VFX rarely rejects more than 10\ntimes, and the mode of the rejections up to n = 7 \u00b7 105 is 1, i.e., we mostly accept at the \ufb01rst iteration.\nSince in Figure 1, the resampling time of both exact and DPP-VFX are negligible, we investigate the\ncost of a second sample, i.e., of resampling, on a much larger subset of MNIST8M up to n = 106,\nnormalized to have maximum row norm equal to 1. In this case it is not possible to perform an\neigendecomposition of L, so we replace RBFs with a linear kernel, for which the input X represents\na factorization L = XX> and we use an exact dual sampler on X [KT12]. However, as Figure 2\nshows, even with a special algorithm for this simpler setting the resampling process still scales with n.\nOn the other hand, DPP-VFX\u2019s complexity (after preprocessing) is light years better as it scales only\nwith k and remains constant regardless of n.\n\n3The code used for these experiments is available at https://github.com/LCSL/dpp-vfx.\n4For simplicity we do not perform the full k-DPP rejection step, but only adjust the expected size of the set.\n\n9\n\n\fAcknowledgements\nMD thanks the NSF for funding via the NSF TRIPODS program. This material is based upon work\nsupported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award\nCCF-1231216, and the Italian Institute of Technology. We gratefully acknowledge the support of\nNVIDIA Corporation for the donation of the Titan Xp GPUs and the Tesla k40 GPU used for this\nresearch.\n\nReferences\n\n[AGR16] Nima Anari, Shayan Oveis Gharan, and Alireza Rezaei. Monte carlo markov chain\nalgorithms for sampling strongly rayleigh distributions and determinantal point pro-\ncesses. In Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir, editors, 29th Annual\nConference on Learning Theory, volume 49 of Proceedings of Machine Learning\nResearch, pages 103\u2013115, Columbia University, New York, New York, USA, 23\u201326\nJun 2016. PMLR.\n\n[AKFT13] Raja Ha\ufb01z Affandi, Alex Kulesza, Emily Fox, and Ben Taskar. Nystrom approximation\nfor large-scale determinantal processes. In Carlos M. Carvalho and Pradeep Ravikumar,\neditors, Proceedings of the Sixteenth International Conference on Arti\ufb01cial Intelligence\nand Statistics, volume 31 of Proceedings of Machine Learning Research, pages 85\u201398,\nScottsdale, Arizona, USA, 29 Apr\u201301 May 2013. PMLR.\n\n[Ald90] David J Aldous. The random walk construction of uniform spanning trees and uniform\n\nlabelled trees. SIAM Journal on Discrete Mathematics, 3(4):450\u2013465, 1990.\n\n[AM15] Ahmed El Alaoui and Michael W. Mahoney. Fast randomized kernel ridge regression\nwith statistical guarantees. In Proceedings of the 28th International Conference on\nNeural Information Processing Systems, pages 775\u2013783, Montreal, Canada, December\n2015.\n\n[BLMV17] R\u00e9mi Bardenet, Fr\u00e9d\u00e9ric Lavancier, Xavier Mary, and Aur\u00e9lien Vasseur. On a few\nstatistical applications of determinantal point processes. ESAIM: Procs, 60:180\u2013202,\n2017.\n\n[BP12] Karl Bringmann and Konstantinos Panagiotou. Ef\ufb01cient sampling methods for discrete\ndistributions. In International Colloquium on Automata, Languages, and Programming,\npages 133\u2013144. Springer, 2012.\n\n[Bra14] Petter Branden. Unimodality, log-concavity, real-rootedness and beyond. Handbook\n\nof Enumerative Combinatorics, 10 2014.\n\n[Bro89] A Broder. Generating random spanning trees. In Foundations of Computer Science,\n\n1989., 30th Annual Symposium on, pages 442\u2013447. IEEE, 1989.\n\n[Bru18] Victor-Emmanuel Brunel. Learning signed determinantal point processes through\nthe principal minor assignment problem. In S. Bengio, H. Wallach, H. Larochelle,\nK. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information\nProcessing Systems 31, pages 7365\u20137374. Curran Associates, Inc., 2018.\n\n[BRVDW19] David Burt, Carl Edward Rasmussen, and Mark Van Der Wilk. Rates of convergence\nfor sparse variational Gaussian process regression. In Kamalika Chaudhuri and Ruslan\nSalakhutdinov, editors, Proceedings of the 36th International Conference on Machine\nLearning, volume 97 of Proceedings of Machine Learning Research, pages 862\u2013871,\nLong Beach, California, USA, 09\u201315 Jun 2019. PMLR.\n\n[CCL+19] Daniele Calandriello, Luigi Carratino, Alessandro Lazaric, Michal Valko, and Lorenzo\nRosasco. Gaussian process optimization with adaptive sketching: Scalable and no\nregret. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second\nConference on Learning Theory, volume 99 of Proceedings of Machine Learning\nResearch, pages 533\u2013557, Phoenix, USA, 25\u201328 Jun 2019. PMLR.\n\n10\n\n\f[CKS+18] Elisa Celis, Vijay Keswani, Damian Straszak, Amit Deshpande, Tarun Kathuria, and\nNisheeth Vishnoi. Fair and diverse DPP-based data summarization. In Jennifer Dy\nand Andreas Krause, editors, Proceedings of the 35th International Conference on\nMachine Learning, volume 80 of Proceedings of Machine Learning Research, pages\n716\u2013725, Stockholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018. PMLR.\n\n[CLV17] Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Distributed adaptive\n\nsampling for kernel matrix approximation. In AISTATS, 2017.\n\n[CZZ18] Laming Chen, Guoxin Zhang, and Eric Zhou. Fast greedy map inference for determi-\nnantal point process to improve recommendation diversity. In S. Bengio, H. Wallach,\nH. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in\nNeural Information Processing Systems 31, pages 5622\u20135633. Curran Associates, Inc.,\n2018.\n\n[D+64] John N Darroch et al. On the distribution of the number of successes in independent\n\ntrials. The Annals of Mathematical Statistics, 35(3):1317\u20131321, 1964.\n\n[Der19] Micha\u0142 Derezi\u00b4nski. Fast determinantal point processes via distortion-free intermediate\n\nsampling. In Proceedings of the 32nd Conference on Learning Theory, 2019.\n\n[DS91] Persi Diaconis and Daniel Stroock. Geometric Bounds for Eigenvalues of Markov\n\nChains. The Annals of Applied Probability, 1991.\n\n[DW17] Micha\u0142 Derezi\u00b4nski and Manfred K. Warmuth. Unbiased estimates for linear regression\nvia volume sampling. In Advances in Neural Information Processing Systems 30,\npages 3087\u20133096, Long Beach, CA, USA, December 2017.\n\n[DWH18] Micha\u0142 Derezi\u00b4nski, Manfred K. Warmuth, and Daniel Hsu. Leveraged volume sampling\nfor linear regression. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-\nBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems\n31, pages 2510\u20132519. Curran Associates, Inc., 2018.\n\n[DWH19] Micha\u0142 Derezi\u00b4nski, Manfred K. Warmuth, and Daniel Hsu. Correcting the bias in\nleast squares regression with volume-rescaled sampling. In Proceedings of the 22nd\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, 2019.\n\n[EVCM16] Akram Erraqabi, Michal Valko, Alexandra Carpentier, and Odalric-Ambrym Maillard.\nPliable rejection sampling. In International Conference on Machine Learning, 2016.\n[GBV17] Guillaume Gautier, R\u00e9mi Bardenet, and Michal Valko. Zonotope hit-and-run for\nef\ufb01cient sampling from projection DPPs. In International Conference on Machine\nLearning, 2017.\n\n[GBV19] Guillaume Gautier, R\u00e9mi Bardenet, and Michal Valko. DPPy: Sampling determinantal\npoint processes with Python. Journal of Machine Learning Research - Machine\nLearning Open Source Software (JMLR-MLOSS), 2019.\n\n[Gil14] Jennifer Ann Gillenwater. Approximate inference for determinantal point processes.\n\n2014.\n\n[GKT12] Jennifer Gillenwater, Alex Kulesza, and Ben Taskar. Discovering diverse and salient\nthreads in document collections. In Proceedings of the 2012 Joint Conference on\nEmpirical Methods in Natural Language Processing and Computational Natural\nLanguage Learning, EMNLP-CoNLL \u201912, pages 710\u2013720, Stroudsburg, PA, USA,\n2012. Association for Computational Linguistics.\n\n[GKVM18] Jennifer A Gillenwater, Alex Kulesza, Sergei Vassilvitskii, and Zelda E. Mariet. Maxi-\nmizing induced cardinality under a determinantal point process. In S. Bengio, H. Wal-\nlach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances\nin Neural Information Processing Systems 31, pages 6911\u20136920. Curran Associates,\nInc., 2018.\n\n[Gue83] A Guenoche. Random spanning tree. Journal of Algorithms, 4(3):214 \u2013 220, 1983.\n\n11\n\n\f[H+56] Wassily Hoeffding et al. On the distribution of the number of successes in independent\n\ntrials. The Annals of Mathematical Statistics, 27(3):713\u2013721, 1956.\n\n[HKP+06] J Ben Hough, Manjunath Krishnapur, Yuval Peres, B\u00e1lint Vir\u00e1g, et al. Determinantal\n\nprocesses and independence. Probability surveys, 3:206\u2013229, 2006.\n\n[Kan13] Byungkon Kang. Fast determinantal point process sampling with application to\nclustering. In Proceedings of the 26th International Conference on Neural Information\nProcessing Systems, NIPS\u201913, pages 2319\u20132327, USA, 2013.\n\n[KT11] Alex Kulesza and Ben Taskar. k-DPPs: Fixed-Size Determinantal Point Processes.\nIn Proceedings of the 28th International Conference on Machine Learning, pages\n1193\u20131200, Bellevue, WA, USA, June 2011.\n\n[KT12] Alex Kulesza and Ben Taskar. Determinantal Point Processes for Machine Learning.\n\nNow Publishers Inc., Hanover, MA, USA, 2012.\n\n[LCB07] Ga\u00eblle Loosli, St\u00e9phane Canu, and L\u00e9on Bottou. Training invariant support vector\nmachines using selective sampling. In L\u00e9on Bottou, Olivier Chapelle, Dennis DeCoste,\nand Jason Weston, editors, Large Scale Kernel Machines, pages 301\u2013320. MIT Press,\nCambridge, MA., 2007.\n\n[LGD18] Claire Launay, Bruno Galerne, and Agn\u00e8s Desolneux. Exact Sampling of De-\narXiv e-prints, page\n\nterminantal Point Processes without Eigendecomposition.\narXiv:1802.08429, Feb 2018.\n\n[LJS16a] Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Ef\ufb01cient sampling for k-determinantal\npoint processes. In Arthur Gretton and Christian C. Robert, editors, Proceedings of the\n19th International Conference on Arti\ufb01cial Intelligence and Statistics, volume 51 of\nProceedings of Machine Learning Research, pages 1328\u20131337, Cadiz, Spain, 09\u201311\nMay 2016. PMLR.\n\n[LJS16b] Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Fast mixing markov chains for strongly\nrayleigh measures, dpps, and constrained sampling.\nIn Proceedings of the 30th\nInternational Conference on Neural Information Processing Systems, NIPS\u201916, pages\n4195\u20134203, USA, 2016. Curran Associates Inc.\n\n[Mac75] Odile Macchi. The coincidence approach to stochastic point processes. Advances in\n\nApplied Probability, 7(1):83\u2013122, 1975.\n\n[MS16] Zelda E. Mariet and Suvrit Sra. Kronecker determinantal point processes. In D. D. Lee,\nM. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 29, pages 2694\u20132702. Curran Associates, Inc., 2016.\n[MU49] Nicholas Metropolis and S. Ulam. The Monte Carlo method. Journal of the American\n\nStatistical Association, 44(247):335\u2013341, 1949.\n\n[Pou19] Jack Poulson. High-performance sampling of generic determinantal point processes.\n\nArXive:1905.00165v1, 2019.\n\n[PP14] Robin Pemantle and Yuval Peres. Concentration of lipschitz functionals of determinan-\ntal and other strong rayleigh measures. Combinatorics, Probability and Computing,\n23(1):140\u2013160, 2014.\n\n[PW98] J G Propp and D B Wilson. How to get a perfectly random sample from a generic\nMarkov chain and generate a random spanning tree of a directed graph. Journal of\nAlgorithms, 27(2):170\u2013217, 1998.\n\n[RCCR18] Alessandro Rudi, Daniele Calandriello, Luigi Carratino, and Lorenzo Rosasco. On\nfast leverage score sampling and optimal learning. In Advances in Neural Information\nProcessing Systems 31, pages 5672\u20135682. 2018.\n\n[RK15] P Rebeschini and A Karbasi. Fast mixing for discrete point processes. In Conference\n\non Learning Theory, pages 1480\u20131500, 2015.\n\n12\n\n\f[ZKM17] Cheng Zhang, Hedvig Kjellstr\u00f6m, and Stephan Mandt. Determinantal point processes\nfor mini-batch diversi\ufb01cation. In 33rd Conference on Uncertainty in Arti\ufb01cial Intelli-\ngence, UAI 2017, Sydney, Australia, 11 August 2017 through 15 August 2017. AUAI\nPress Corvallis, 2017.\n\n[Z\u00d6MS19] Cheng Zhang, Cengiz \u00d6ztireli, Stephan Mandt, and Giampiero Salvi. Active mini-\n\nbatch sampling using repulsive point processes. In AAAI, 2019.\n\n13\n\n\f", "award": [], "sourceid": 6160, "authors": [{"given_name": "Michal", "family_name": "Derezinski", "institution": "UC Berkeley"}, {"given_name": "Daniele", "family_name": "Calandriello", "institution": "LCSL IIT/MIT"}, {"given_name": "Michal", "family_name": "Valko", "institution": "DeepMind Paris and Inria Lille - Nord Europe"}]}