{"title": "Inference for determinantal point processes without spectral knowledge", "book": "Advances in Neural Information Processing Systems", "page_first": 3393, "page_last": 3401, "abstract": "Determinantal point processes (DPPs) are point process models thatnaturally encode diversity between the points of agiven realization, through a positive definite kernel $K$. DPPs possess desirable properties, such as exactsampling or analyticity of the moments, but learning the parameters ofkernel $K$ through likelihood-based inference is notstraightforward. First, the kernel that appears in thelikelihood is not $K$, but another kernel $L$ related to $K$ throughan often intractable spectral decomposition. This issue is typically bypassed in machine learning bydirectly parametrizing the kernel $L$, at the price of someinterpretability of the model parameters. We follow this approachhere. Second, the likelihood has an intractable normalizingconstant, which takes the form of large determinant in the case of aDPP over a finite set of objects, and the form of a Fredholm determinant in thecase of a DPP over a continuous domain. Our main contribution is to derive bounds on the likelihood ofa DPP, both for finite and continuous domains. Unlike previous work, our bounds arecheap to evaluate since they do not rely on approximating the spectrumof a large matrix or an operator. Through usual arguments, these bounds thus yield cheap variationalinference and moderately expensive exact Markov chain Monte Carlo inference methods for DPPs.", "full_text": "Inference for determinantal point processes\n\nwithout spectral knowledge\n\nR\u00b4emi Bardenet\u2217\nCNRS & CRIStAL\n\nUMR 9189, Univ. Lille, France\n\nremi.bardenet@gmail.com\n\nMichalis K. Titsias\u2217\n\nDepartment of Informatics\n\nAthens Univ. of Economics and Business, Greece\n\nmtitsias@aueb.gr\n\n\u2217Both authors contributed equally to this work.\n\nAbstract\n\nDeterminantal point processes (DPPs) are point process models that natu-\nrally encode diversity between the points of a given realization, through a\npositive de\ufb01nite kernel K. DPPs possess desirable properties, such as ex-\nact sampling or analyticity of the moments, but learning the parameters of\nkernel K through likelihood-based inference is not straightforward. First,\nthe kernel that appears in the likelihood is not K, but another kernel L\nrelated to K through an often intractable spectral decomposition. This\nissue is typically bypassed in machine learning by directly parametrizing\nthe kernel L, at the price of some interpretability of the model parameters.\nWe follow this approach here. Second, the likelihood has an intractable\nnormalizing constant, which takes the form of a large determinant in the\ncase of a DPP over a \ufb01nite set of objects, and the form of a Fredholm\ndeterminant in the case of a DPP over a continuous domain. Our main\ncontribution is to derive bounds on the likelihood of a DPP, both for \ufb01nite\nand continuous domains. Unlike previous work, our bounds are cheap to\nevaluate since they do not rely on approximating the spectrum of a large\nmatrix or an operator. Through usual arguments, these bounds thus yield\ncheap variational inference and moderately expensive exact Markov chain\nMonte Carlo inference methods for DPPs.\n\n1\n\nIntroduction\n\nDeterminantal point processes (DPPs) are point processes [1] that encode repulsiveness using\nalgebraic arguments. They \ufb01rst appeared in [2], and have since then received much attention,\nas they arise in many \ufb01elds, e.g. random matrix theory, combinatorics, quantum physics.\nWe refer the reader to [3, 4, 5] for detailed tutorial reviews, respectively aimed at audiences of\nmachine learners, statisticians, and probabilists. More recently, DPPs have been considered\nas a modelling tool, see e.g. [4, 3, 6]: DPPs appear to be a natural alternative to Poisson\nprocesses when realizations should exhibit repulsiveness. In [3], for example, DPPs are used\nto model diversity among summary timelines in a large news corpus. In [7], DPPs model\ndiversity among the results of a search engine for a given query. In [4], DPPs model the\nspatial repartition of trees in a forest, as similar trees compete for nutrients in the ground,\nand thus tend to grow away from each other. With these modelling applications comes\nthe question of learning a DPP from data, either through a parametrized form [4, 7], or\nnon-parametrically [8, 9]. We focus in this paper on parametric inference.\n\n1\n\n\fSimilarly to the correlation between the function values in a Gaussian process (GPs; [10]),\nthe repulsiveness in a DPP is de\ufb01ned through a kernel K, which measures how much two\npoints in a realization repel each other. The likelihood of a DPP involves the evaluation\nand the spectral decomposition of an operator L de\ufb01ned through a kernel L that is related\nto K. There are two main issues that arise when performing likelihood-based inference\nfor a DPP. First, the likelihood involves evaluating the kernel L, while it is more natural\nto parametrize K instead, and there is no easy link between the parameters of these two\nkernels. The second issue is that the spectral decomposition of the operator L required\nin the likelihood evaluation is rarely available in practice, for computational or analytical\nreasons. For example, in the case of a large \ufb01nite set of objects, as in the news corpus\napplication [3], evaluating the likelihood once requires the eigendecomposition of a large\nmatrix. Similarly, in the case of a continuous domain, as for the forest application [4], the\nspectral decomposition of the operator L may not be analytically tractable for nontrivial\nchoices of kernel L. In this paper, we focus on the second issue, i.e., we provide likelihood-\nbased inference methods that assume the kernel L is parametrized, but that do not require\nany eigendecomposition, unlike [7]. More speci\ufb01cally, our main contribution is to provide\nbounds on the likelihood of a DPP that do not depend on the spectral decomposition of\nthe operator L. For the \ufb01nite case, we draw inspiration from bounds used for variational\ninference of GPs [11], and we extend these bounds to DPPs over continuous domains.\nFor ease of presentation, we \ufb01rst consider DPPs over \ufb01nite sets of objects in Section 2, and\nwe derive bounds on the likelihood. In Section 3, we plug these bounds into known inference\nparadigms: variational inference and Markov chain Monte Carlo inference. In Section 4, we\nextend our results to the case of a DPP over a continuous domain. Readers who are only\ninterested in the \ufb01nite case, or who are unfamiliar with operator theory, can safely skip\nSection 4 without missing our main points. In Section 5, we experimentally validate our\nresults, before discussing their breadth in Section 6.\n\n2 DPPs over \ufb01nite sets\n\n2.1 De\ufb01nition and likelihood\nConsider a discrete set of items Y = {x1, . . . , xn}, where xi \u2282 Rd is a vector of attributes\nthat describes item i. Let K be a symmetric positive de\ufb01nite kernel [12] on Rd, and let\nK = ((K(xi, xj))) be the Gram matrix of K. The DPP of kernel K is de\ufb01ned as the\nprobability distribution over all possible 2n subsets Y \u2286 Y such that\n\nP(A \u2282 Y ) = det(KA),\n\n(1)\nwhere KA denotes the sub-matrix of K indexed by the elements of A. This distribution\nexists and is unique if and only if the eigenvalues of K are in [0, 1] [5].\nIntuitively, we\ncan think of K(x, y) as encoding the amount of negative correlation, or \u201crepulsiveness\u201d\nbetween x and y. Indeed, as remarked in [3], (1) \ufb01rst yields that diagonal elements of K\nare marginal probabilities: P(xi \u2208 Y ) = Kii. Equation (1) then entails that xi and xj are\nlikely to co-occur in a realization of Y if and only if\n\ndet K{xi,xj} = K(xi, xi)K(yi, yi) \u2212 K(xi, xj)2 = P(xi \u2208 Y )P(xj \u2208 Y ) \u2212 K2\n\nij\n\nis large: o\ufb00-diagonal terms in K indicate whether points tend to co-occur.\nProviding the eigenvalues of K are further restricted to be in [0, 1), the DPP of kernel K\nhas a likelihood [1]. More speci\ufb01cally, writing Y1 for a realization of Y ,\n\nP(Y = Y1) = det LY1\n\n(2)\nwhere L = (I \u2212 K)\u22121K, I is the n \u00d7 n identity matrix, and LY1 denotes the sub-matrix\nof L indexed by the elements of Y1. Now, given a realization Y1, we would like to infer\nthe parameters of kernel K, say the parameters \u03b8K = (aK, \u03c3K) \u2208 (0,\u221e)2 of a squared\nexponential kernel [10]\n\ndet(L + I) ,\n\n(cid:19)\n\n.\n\n(3)\n\n(cid:18)\n\n\u2212kx \u2212 yk2\n\n2\u03c32\n\nK\n\nK(x, y) = aK exp\n\n2\n\n\fSince the trace of K is the expected number of points in Y [5], one can estimate aK by\nthe number of points in the data divided by n [4]. But \u03c3K, the repulsive \u201clengthscale\u201d,\nhas to be \ufb01tted. If the number of items n is large, likelihood-based methods such as max-\nimum likelihood are too costly: each evaluation of (2) requires O(n2) storage and O(n3)\ntime. Furthermore, valid choices of \u03b8K are constrained, since one needs to make sure the\neigenvalues of K remain in [0, 1).\nA partial work-around is to note that given any symmetric positive de\ufb01nite kernel L, the\nlikelihood (2) with matrix L = ((L(xi, xj))) corresponds to a valid choice of K, since the\ncorresponding matrix K = L(I+L)\u22121 necessarily has eigenvalues in [0, 1], which makes sure\nthe DPP exists [5]. The work-around consists in directly parametrizing and inferring the\nkernel L instead of K, so that the numerator of (2) is cheap to evaluate, and parameters\nare less constrained. Note that this step favours tractability over interpretability of the\ninferred parameters:\nif we assume L to take the squared exponential form (3) instead of\nK, with parameters aL and \u03c3L, the number of points and the repulsiveness of the points in\nY do not decouple as nicely. For example, the expected number of items in Y depends on\naL and \u03c3L now, and both parameters also signi\ufb01cantly a\ufb00ect repulsiveness. There is some\nwork investigating approximations to K to retain the more interpretable parametrization\n[4], but the machine learning literature [3, 7] almost exclusively adopts the more tractable\nparametrization of L. In this paper, we also make this choice of parametrizing L directly.\nNow, the computational bottleneck in the evaluation of (2) is computing det(L + I). While\nthis still prevents the application of maximum likelihood, bounds on this determinant can\nbe used in a variational approach or an MCMC algorithm. In [7], bounds on det(L + I)\nare proposed, requiring only the \ufb01rst m eigenvalues of L, where m is chosen adaptively at\neach MCMC iteration to make the acceptance decision possible. This still requires applying\npower iteration methods, which are limited to \ufb01nite domains, require storing the whole n\u00d7n\nmatrix L, and are prohibitively slow when the number of required eigenvalues m is large.\n\n2.2 Nonspectral bounds on the likelihood\nLet us denote by LAB the submatrix of L where row indices correspond to the elements of A,\nand column indices to those of B. When A = B, we simply write LA for LAA, and we drop\nthe subscript when A = Y. Drawing inspiration from sparse approximations to Gaussian\nprocesses using inducing variables [11], we let Z = {z1, . . . , zm} be an arbitrary set of points\nin Rd, and we approximate L by Q = LYZ[LZ]\u22121LZY. Note that we do not constrain Z\nto belong to Y, so that our bounds do not rely on a Nystr\u00a8om-type approximation [13]. We\nterm Z \u201cpseudo-inputs\u201d, or \u201cinducing inputs\u201d.\nProposition 1.\n\n1\n\ndet(Q + I) e\u2212tr(L\u2212Q) \u2264\n\n1\n\ndet(L + I) \u2264\n\n1\n\ndet(Q + I) .\n\n(4)\n\nThe proof relies on a nontrivial inequality on determinants [14, Theorem 1], and is provided\nin the supplementary material.\n\n3 Learning a DPP using bounds\n\nIn this section, we explain how to run variational inference and Markov chain Monte Carlo\nmethods using the bounds in Proposition 1. In this section, we also make connections with\nvariational sparse Gaussian processes more explicit.\n\n3.1 Variational inference\n\nThe lower bound in Proposition 1 can be used for variational inference. Assume we have\nT point process realizations Y1, . . . , YT , and we \ufb01t a DPP with kernel L = L\u03b8. The log\n\n3\n\n\flikelihood can be expressed using (2)\n\nTX\n\ni=1\n\n\u2018(\u03b8) =\n\nlog det(LYt) \u2212 T log det(L + I).\n\nLet Z be an arbitrary set of m points in Rd. Proposition 1 then yields a lower bound\n\nF(\u03b8,Z) (cid:44) TX\n\nt=1\n\n(5)\n\n(6)\n\n(7)\n\nlog det(LYt) \u2212 T log det(Q + I) + Ttr (L \u2212 Q) \u2264 \u2018(\u03b8).\n\nThe lower bound F(\u03b8,Z) can be computed e\ufb03ciently in O(nm2) time, which is considerably\nlower than a power iteration in O(n2) if m (cid:28) n.\nInstead of maximizing \u2018(\u03b8), we thus\nmaximize F(\u03b8,Z) jointly w.r.t. the kernel parameters \u03b8 and the variational parameters Z.\nTo maximize (8), one can e.g.\nimplement an EM-like scheme, alternately optimizing in\nZ and \u03b8. Kernels are often di\ufb00erentiable with respect to \u03b8, and sometimes F will also be\ndi\ufb00erentiable with respect to the pseudo-inputs Z, so that gradient-based methods can help.\nIn the general case, black-box optimizers such as CMA-ES [15], can also be employed.\n\n3.2 Markov chain Monte Carlo inference\nIf approximate inference is not suitable, we can use the bounds in Proposition 1 to build\na more expensive Markov chain Monte Carlo [16] sampler. Given a prior distribution p(\u03b8)\non the parameters \u03b8 of L, Bayesian inference relies on the posterior distribution \u03c0(\u03b8) \u221d\nexp(\u2018(\u03b8))p(\u03b8), where the log likelihood \u2018(\u03b8) is de\ufb01ned in (7). A standard approach to\nsample approximately from \u03c0(\u03b8) is the Metropolis-Hastings algorithm (MH; [16, Chapter\n7.3]). MH consists in building an ergodic Markov chain of invariant distribution \u03c0(\u03b8). Given\na proposal q(\u03b80|\u03b8), the MH algorithm starts its chain at a user-de\ufb01ned \u03b80, then at iteration\nk + 1 it proposes a candidate state \u03b80 \u223c q(\u00b7|\u03b8k) and sets \u03b8k+1 to \u03b80 with probability\n\n\"\n\n#\n\n\u03b1(\u03b8k, \u03b80) = min\n\n1,\n\ne\u2018(\u03b80)p(\u03b80)\ne\u2018(\u03b8k)p(\u03b8k)\n\nq(\u03b8k|\u03b80)\nq(\u03b80|\u03b8k)\n\nwhile \u03b8k+1 is otherwise set to \u03b8k. The core of the algorithm is thus to draw a Bernoulli\nvariable with parameter \u03b1 = \u03b1(\u03b8, \u03b80) for \u03b8, \u03b80 \u2208 Rd. This is typically implemented by\ndrawing a uniform u \u223c U[0,1] and checking whether u < \u03b1. In our DPP application, we\ncannot evaluate \u03b1. But we can use Proposition 1 to build a lower and an upper bound\n\u2018(\u03b8) \u2208 [b\u2212(\u03b8,Z), b+(\u03b8,Z)], which can be arbitrarily re\ufb01ned by increasing the cardinality of\nZ and optimizing over Z. We can thus build a lower and upper bound for \u03b1\n\u2264 log \u03b1 \u2264 b+(\u03b80,Z0) \u2212 b\u2212(\u03b8,Z) + log\n\n(8)\nNow, another way to draw a Bernoulli variable with parameter \u03b1 is to \ufb01rst draw u \u223c U[0,1],\nand then re\ufb01ne the bounds in (13), by augmenting the numbers |Z|, |Z0| of inducing variables\nand optimizing over Z,Z0, until1 log u is out of the interval formed by the bounds in (13).\nThen one can decide whether u < \u03b1. This Bernoulli trick is sometimes named retrospective\nsampling and has been suggested as early as [17]. It has been used within MH for inference\non DPPs with spectral bounds in [7], we simply adapt it to our non-spectral bounds.\n\nb\u2212(\u03b80,Z0) \u2212 b+(\u03b8,Z) + log\n\n(cid:20) p(\u03b80)\n\n(cid:21)\n\n(cid:20) p(\u03b80)\n\n(cid:21)\n\np(\u03b8)\n\n.\n\np(\u03b8)\n\n4 The case of continuous DPPs\n\nDPPs can be de\ufb01ned over very general spaces [5]. We limit ourselves here to point processes\non X \u2282 Rd such that one can extend the notion of likelihood. In particular, we de\ufb01ne here\na DPP on X as in [1, Example 5.4(c)], by de\ufb01ning its Janossy density. For de\ufb01nitions of\ntraces and determinants of operators, we follow [18, Section VII].\n\n1Note that this necessarily happens under fairly weak assumptions: saying that the upper and\nlower bounds in (4) match when m goes to in\ufb01nity is saying that the integral of the posterior variance\nof a Gaussian process with no evaluation error goes to zero as we add more distinct training points.\n\n4\n\n\f4.1 De\ufb01nition\nLet \u00b5 be a measure on (Rd,B(Rd)) that is continuous with respect to the Lebesgue measure,\nwith density \u00b50. Let L be a symmetric positive de\ufb01nite kernel. L de\ufb01nes a self-adjoint\n\noperator on L2(\u00b5) through L(f) (cid:44) R L(x, y)f(y)d\u00b5(y). Assume L is trace-class, and\n\ntr(L) =\n\nL(x, x)d\u00b5(x).\n\n(9)\n\nZ\n\nX\n\nWe assume (14) to avoid technicalities. Proving (14) can be done by requiring various\nassumptions on L and \u00b5. Under the assumptions of Mercer\u2019s theorem, for instance, (14)\nwill be satis\ufb01ed [18, Section VII, Theorem 2.3]. More generally, the assumptions of [19,\nTheorem 2.12] apply to kernels over noncompact domains, in particular the Gaussian kernel\nwith Gaussian base measure that is often used in practice. We denote by \u03bbi the eigenvalues\nof the compact operator L. There exists [1, Example 5.4(c)] a simple2 point process on Rd\nsuch that\n\n(cid:18)There are n particles, one in each of\n(cid:19)\n\nthe in\ufb01nitesimal balls B(xi, dxi)\n\ndet(I + L) \u00b50(x1) . . . \u00b50(xn),\nwhere B(x, r) is the open ball of center x and radius r, and where det(I +L) (cid:44) Q\u221e\n\ni=1(\u03bbi+1)\nis the Fredholm determinant of operator L [18, Section VII]. Such a process is called the\ndeterminantal point process associated to kernel L and base measure \u00b5.3 Equation (15) is\nthe continuous equivalent of (2). Our bounds require \u03a8 to be computable. This is the case\nfor the popular Gaussian kernel with Gaussian base measure.\n\n= det((L(xi, xj))\n\n(10)\n\nP\n\n4.2 Nonspectral bounds on the likelihood\nIn this section, we derive bounds on the likelihood (15) that do not require to compute the\nFredholm determinant det(I + L).\nProposition 2. Let Z = {z1, . . . , zm} \u2282 Rd, then\nZ \u03a8) \u2264\n\n\u2212R L(x,x)d\u00b5(x)+tr(L\u22121\n\nwhere LZ = ((L(zi, zj)) and \u03a8ij =R L(zi, x)L(x, zj)d\u00b5(x).\n\ndet(LZ + \u03a8) e\n\ndet(I + L) \u2264\n\ndet(LZ + \u03a8) ,\n\ndet LZ\n\ndet LZ\n\n(11)\n\n1\n\nAs for Proposition 1, the proof relies on a nontrivial inequality on determinants [14, Theorem\n1] and is provided in the supplementary material. We also detail in the supplementary\nmaterial why (16) is the continuous equivalent to (4).\n\n5 Experiments\n\nexp(cid:0)\u2212\u00012kx \u2212 yk2(cid:1).\n\n5.1 A toy Gaussian continuous experiment\nIn this section, we consider a DPP on R, so that the bounds derived in Section 4 apply.\nAs in [7, Section 5.1], we take the base measure to be proportional to a Gaussian, i.e.\nits\ndensity is \u00b50(x) = \u03baN (x|0, (2\u03b1)\u22122). We consider a squared exponential kernel L(x, y) =\nIn this particular case, the spectral decomposition of operator L is\nknown [20]4: the eigenfunctions of L are scaled Hermite polynomials, while the eigenvalues\nare a geometrically decreasing sequence. This 1D Gaussian-Gaussian example is interesting\nfor two reasons: \ufb01rst, the spectral decomposition of L is known, so that we can sample\nexactly from the corresponding DPP [5] and thus generate synthetic datasets. Second,\nthe Fredholm determinant det(I + L) in this special case is a q-Pochhammer symbol, and\n\n2i.e., for which all points in a realization are distinct.\n3There is a notion of kernel K for general DPPs [5], but we de\ufb01ne L directly here, for the sake\nof simplicity. The interpretability issues of using L instead of K are the same as for the \ufb01nite case,\nsee Sections 2 and 5.\n\n4We follow the parametrization of [20] for ease of reference.\n\n5\n\n\fcan thus be e\ufb03ciently computed, see e.g. the SymPy library in Python. This allows for\ncomparison with \u201cideal\u201d likelihood-based methods, to check the validity of our MCMC\nsampler, for instance. We emphasize that these special properties are not needed for the\ninference methods in Section 3, they are simply useful to demonstrate their correctness.\nWe sample a synthetic dataset using (\u03ba, \u03b1, \u0001) = (1000, 0.5, 1), resulting in 13 points shown\nin red in Figure 1(a). Applying the variational inference method of Section 3.1, jointly\noptimizing in Z and \u03b8 = (\u03ba, \u03b1, \u0001) using the CMA-ES optimizer [15], yields poorly consistent\nresults: \u03ba varies over several orders of magnitude from one run to the other, and relative\nerrors for \u03b1 and \u0001 go up to 100% (not shown). We thus investigate the identi\ufb01ability of the\nparameters with the retrospective MH of Section 3.2. To limit the range of \u03ba, we choose for\n(log \u03ba, log \u03b1, log \u0001) a wide uniform prior over\n\n[200, 2000] \u00d7 [\u221210, 10] \u00d7 [\u221210, 10].\n\nWe use a Gaussian proposal, the covariance matrix of which is adapted on-the-\ufb02y [21] so as to\nreach 25% of acceptance. We start each iteration with m = 20 pseudo-inputs, and increase\nit by 10 and re-optimize when the acceptance decision cannot be made. Most iterations\ncould be made with m = 20, and the maximum number of inducing inputs required in\nour run was 80. We show the results of a run of length 10 000 in Figure 5.1. Removing\na burn-in sample of size 1000, we show the resulting marginal histograms in Figures 1(b),\n1(c), and 1(d). Retrospective MH and the ideal MH agree. The prior pdf is in green. The\nposterior marginals of \u03b1 and \u0001 are centered around the values used for simulation, and are\nvery di\ufb00erent from the prior, showing that the likelihood contains information about \u03b1 and\n\u0001. However, as expected, almost nothing is learnt about \u03ba, as posterior and prior roughly\ncoincide. This is an example of the issues that come with parametrizing L directly, as\nmentioned in Section 1. It is also an example when MCMC is preferable to the variational\napproach in Section 3. Note that this can be detected through the variability of the results\nof the variational approach across independent runs.\nTo conclude, we show a set of optimized pseudo-inputs Z in black in Figure 1(a). We\nalso superimpose the marginal of any single point in the realization, which is available\nthrough the spectral decomposition of L here [5].\nIn this particular case, this marginal\nis a Gaussian. Interestingly, the pseudo-inputs look like evenly spread samples from this\nmarginal. Intuitively, they are likely to make the denominator in the likelihood (15) small,\nas they represent an ideal sample of the Gaussian-Gaussian DPP.\n\n5.2 Diabetic neuropathy dataset\nHere, we consider a real dataset of spatial patterns of nerve \ufb01bers in diabetic patients.\nThese \ufb01bers become more clustered as diabetes progresses [22]. The dataset consists of 7\nsamples collected from diabetic patients at di\ufb00erent stages of diabetic neuropathy and one\nhealthy subject. We follow the experimental setup used in [7] and we split the total samples\ninto two classes: Normal/Mildly Diabetic and Moderately/Severely Diabetic. The \ufb01rst\nclass contains three samples and the second one the remaining four. Figure 2 displays the\npoint process data, which contain on average 90 points per sample in the Normal/Mildly\nclass and 67 for the Moderately/Severely class. We investigate the di\ufb00erences between\nthese classes by \ufb01tting a separate DPP to each class and then quantify the di\ufb00erences of the\nrepulsion or overdispersion of the point process data through the inferred kernel parameters.\nParaphrasing [7], we consider a continuous DPP on R2, with kernel function\n\n,\n\n(12)\n\nand base measure proportional to a Gaussian \u00b50(x) = \u03baQ2\n\nL(xi, xj) = exp\n\n2\u03c32\n\nd). As in [7], we\nquantify the overdispersion of realizations of such a Gaussian-Gaussian DPP through the\nquantities \u03b3d = \u03c3d/\u03c1d, which are invariant to the scaling of x. Note however that, strictly\nspeaking, \u03ba also mildly in\ufb02uences repulsion.\nWe investigate the ability of the variational method in Section 3.1 to perform approximate\nmaximum likelihood training over the kernel parameters \u03b8 = (\u03ba, \u03c31, \u03c32, \u03c11, \u03c12). Speci\ufb01-\ncally, we wish to \ufb01t a separate continuous DPP to each class by jointly maximizing the\n\nd=1 N (xd|\u00b5d, \u03c12\n\nd\n\n!\n\n(xi,d \u2212 xj,d)2\n\n \n\n\u2212\n\n2X\n\nd=1\n\n6\n\n\f(a)\n\nFigure 1: Results of adaptive Metropolis-Hastings in the 1D continuous experiment of Sec-\ntion 5.1. Figure 1(a) shows data in red, a set of optimized pseudo-inputs in black for \u03b8 set\nto the value used in the generation of the synthetic dataset, and the marginal of one point\nin the realization in blue. Figures 1(b), 1(c), and 1(d) show marginal histograms of \u03ba, \u03b1, \u0001.\n\nvariational lower bound over \u03b8 and the inducing inputs Z using gradient-based optimiza-\ntion. Given that the number of inducing variables determines the amount of the approx-\nimation, or compression of the DPP model, we examine di\ufb00erent settings for this num-\nber and see whether the corresponding trained models provide similar estimates for the\noverdispersion measures. Thus, we train the DPPs under di\ufb00erent approximations having\nm \u2208 {50, 100, 200, 400, 800, 1200} inducing variables and display the estimated overdisper-\nsion measures in Figures 3(a) and 3(b). These estimated measures converge to coherent\nvalues as m increases. They show a clear separation between the two classes, as also found\nin [7, 22]. This also suggests tuning m in practice by increasing it until inference results\nstop varying. Furthermore, Figures 3(c) and 3(d) show the values of the upper and lower\nbounds on the log likelihood, which as expected, converge to the same limit as m increases.\nWe point out that the overall optimization of the variational lower bound is relatively fast\nin our MATLAB implementation. For instance, it takes 24 minutes for the most expensive\nrun where m = 1200 to perform 1 000 iterations until convergence. Smaller values of m\nyield signi\ufb01cantly smaller times.\nFinally, as in Section 5.1, we comment on the optimized pseudo-inputs. Figure 4 displays\nthe inducing points at the end of a converged run of variational inference for various values\nof m. Similarly to Figure 1(a), these pseudo-inputs are placed in remarkably neat grids and\ndepart signi\ufb01cantly from their initial locations.\n\n6 Discussion\n\nWe have proposed novel, cheap-to-evaluate, nonspectral bounds on the determinants arising\nin the likelihoods of DPPs, both \ufb01nite and continuous. We have shown how to use these\nbounds to infer the parameters of a DPP, and demonstrated their use for expensive-but-\nexact MCMC and cheap-but-approximate variational inference. In particular, these bounds\nhave some degree of freedom \u2013 the pseudo-inputs \u2013, which we optimize so as to tighten the\nbounds. This optimization step is crucial for likelihood-based inference of parametric DPP\nmodels, where bounds have to adapt to the point where the likelihood is evaluated to yield\n\n7\n\n\u221210\u2212505100.00.51.01.52.0K(\u00b7,\u00b7)\u00b5(cid:48)(\u00b7)0.20.40.60.81.01.21.41.61.82.0\u00d71030.00.51.01.52.02.5\u00d710\u22123\u03bapriorpdfidealMHMHw/bounds0.10.20.30.40.50.60.70.80.901234567\u03b1priorpdfidealMHMHw/bounds0.51.01.52.00.00.51.01.52.02.53.0\u0001priorpdfidealMHMHw/bounds\fFigure 2: Six out of the seven nerve \ufb01ber samples. The \ufb01rst three samples (from left to\nright) correspond to a Normal Subject and two Mildly Diabetic Subjects, respectively. The\nremaining three samples correspond to a Moderately Diabetic Subject and two Severely\nDiabetic Subjects.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 3: Figures 3(a) and 3(b) show the evolution of the estimated overdispersion measures\n\u03b31 and \u03b32 as functions of the number of inducing variables used. The dotted black lines corre-\nspond to the Normal/Mildly Diabetic class while the solid lines to the Moderately/Severely\nDiabetic class. Figure 3(c) shows the upper bound (red) and the lower bound (blue) on\nthe log likelihood as functions of the number of inducing variables for the Normal/Mildly\nDiabetic class while the Moderately/Severely Diabetic case is shown in Figure 3(d).\n\nFigure 4: We illustrate the optimization over the inducing inputs Z for several values of\nm \u2208 {50, 100, 200, 400, 800, 1200} in the DPP of Section 5.2. We consider the Normal/Mildly\ndiabetic class. The panels in the top row show the initial inducing input locations for various\nvalues of m, while the corresponding panels in the bottom row show the optimized locations.\n\ndecisions which are consistent with the ideal underlying algorithms. In future work, we plan\nto investigate connections of our bounds with the quadrature-based bounds for Fredholm\ndeterminants of [23]. We also plan to consider variants of DPPs that condition on the\nnumber of points in the realization, to put joint priors over the within-class distributions\nof the features in classi\ufb01cation problems, in a manner related to [6]. In the long term, we\nwill investigate connections between kernels L and K that could be made without spectral\nknowledge, to address the issue of replacing L by K.\n\nAcknowledgments\nWe would like to thank Adrien Hardy for useful discussions and Emily Fox for kindly pro-\nviding access to the diabetic neuropathy dataset. RB was funded by a research fellowship\nthrough the 2020 Science programme, funded by EPSRC grant number EP/I017909/1, and\nby ANR project ANR-13-BS-03-0006-01.\n\n8\n\nnormalmild1mild2mod1mod2sev1504008001200051015x 10\u22123Ratio 1Number of inducing variables50400800120000.010.020.03Ratio 2Number of inducing variables504008001200200250300350400BoundsNumber of inducing variables504008001200100150200250300350BoundsNumber of inducing variables\fReferences\n[1] D. J. Daley and D. Vere-Jones. An Introduction to the Theory of Point Processes.\n\nSpringer, 2nd edition, 2003.\n\n[2] O. Macchi. The coincidence approach to stochastic point processes. Advances in Applied\n\nProbability, 7:83\u2013122, 1975.\n\n[3] A. Kulesza and B. Taskar. Determinantal point processes for machine learning. Foun-\n\ndations and Trends in Machine Learning, 2012.\n\n[4] F. Lavancier, J. M\u00f8ller, and E. Rubak. Determinantal point process models and sta-\n\ntistical inference. Journal of the Royal Statistical Society, 2014.\n\n[5] J. B. Hough, M. Krishnapur, Y. Peres, and B. Vir\u00b4ag. Determinantal processes and\n\nindependence. Probability surveys, 2006.\n\n[6] J. Y. Zou and R. P. Adams. Priors for diversity in generative latent variable models.\n\nIn Advances in Neural Information Processing Systems (NIPS), 2012.\n\n[7] R. H. A\ufb00andi, E. B. Fox, R. P. Adams, and B. Taskar. Learning the parameters\nof determinantal point processes. In Proceedings of the International Conference on\nMachine Learning (ICML), 2014.\n\n[8] J. Gillenwater, A. Kulesza, E. B. Fox, and B. Taskar. Expectation-maximization for\nlearning determinantal point processes. In Advances in Neural Information Proccessing\nSystems (NIPS), 2014.\n\n[9] Z. Mariet and S. Sra. Fixed-point algorithms for learning determinantal point processes.\n\nIn Advances in Neural Information systems (NIPS), 2015.\n\n[10] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning.\n\nMIT Press, 2006.\n\n[11] Michalis K. Titsias. Variational Learning of Inducing Variables in Sparse Gaussian\n\nProcesses. In AISTATS, volume 5, 2009.\n\n[12] N. Cristianini and J. Shawe-Taylor. Kernel methods for pattern recognition. Cambridge\n\nUniversity Press, 2004.\n\n[13] R. H. A\ufb00andi, A. Kulesza, E. B. Fox, and B. Taskar. Nystr\u00a8om approximation for large-\nscale determinantal processes. In Proceedings of the conference on Arti\ufb01cial Intelligence\nand Statistics (AISTATS), 2013.\n\n[14] E. Seiler and B. Simon. An inequality among determinants. Proceedings of the National\n\nAcademy of Sciences, 1975.\n\n[15] N. Hansen. The CMA evolution strategy: a comparing review.\n\nIn Towards a new\nevolutionary computation. Advances on estimation of distribution algorithms. Springer,\n2006.\n\n[16] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, 2004.\n[17] L. Devroye. Non-uniform random variate generation. Springer-Verlag, 1986.\n[18] I. Gohberg, S. Goldberg, and M. A. Kaashoek. Classes of linear operators, Volume I.\n\nSpringer, 1990.\n\n[19] B. Simon. Trace ideals and their applications. American Mathematical Society, 2nd\n\nedition, 2005.\n\n[20] G. E. Fasshauer and M. J. McCourt. Stable evaluation of gaussian radial basis function\n\ninterpolants. SIAM Journal on Scienti\ufb01c Computing, 34(2), 2012.\n\n[21] H. Haario, E. Saksman, and J. Tamminen. An adaptive Metropolis algorithm.\n\nBernoulli, 7:223\u2013242, 2001.\n\n[22] L. A. Waller, A. S\u00a8arkk\u00a8a, V. Olsbo, M. Myllym\u00a8aki, I. G. Panoutsopoulou, W. R.\nKennedy, and G. Wendelschafer-Crabb. Second-order spatial analysis of epidermal\nnerve \ufb01bers. Statistics in Medicine, 30(23):2827\u20132841, 2011.\n\n[23] F. Bornemann. On the numerical evaluation of Fredholm determinants. Mathematics\n\nof Computation, 79(270):871\u2013915, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1872, "authors": [{"given_name": "R\u00e9mi", "family_name": "Bardenet", "institution": "University of Lille"}, {"given_name": "Michalis", "family_name": "Titsias RC AUEB", "institution": "Athens University of Economics and Business"}]}