{"title": "DppNet: Approximating Determinantal Point Processes with Deep Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 3223, "page_last": 3234, "abstract": "Determinantal point processes (DPPs) provide an elegant and versatile way to sample sets of items that balance the point-wise quality with the set-wise diversity of selected items. For this reason, they have gained prominence in many machine learning applications that rely on subset selection. However, sampling from a DPP over a ground set of size N is a costly operation, requiring in general an O(N^3) preprocessing cost and an O(Nk^3) sampling cost for subsets of size k. We approach this problem by introducing DppNets: generative deep models that produce DPP-like samples for arbitrary ground sets. We develop an inhibitive attention mechanism based on transformer networks that captures a notion of dissimilarity between feature vectors. We show theoretically that such an approximation is sensible as it maintains the guarantees of inhibition or dissimilarity that makes DPPs so powerful and unique. Empirically, we show across multiple datasets that DPPNET is orders of magnitude faster than competing approaches for DPP sampling, while generating high-likelihood samples and performing as well as DPPs on downstream tasks.", "full_text": "DPPNET: Approximating Determinantal Point\n\nProcesses with Deep Networks\n\nZelda Mariet \u2217\n\nMassachusetts Institute of Technology\nCambridge, Massachusetts 02139, USA\n\nzelda@csail.mit.edu\n\nYaniv Ovadia & Jasper Snoek\n\nGoogle Brain\n\nCambridge, Massachusetts 02139, USA\n{yovadia, jsnoek}@google.com\n\nAbstract\n\nDeterminantal point processes (DPPs) provide an elegant and versatile way to\nsample sets of items that balance the quality with the diversity of selected items. For\nthis reason, they have gained prominence in many machine learning applications\nthat rely on subset selection. However, sampling from a DPP over a ground set\nof size N is a costly operation, requiring in general an O(N 3) preprocessing cost\nand an O(N k3) sampling cost for subsets of size k. We approach this problem\nby introducing DPPNETs: generative deep models that produce DPP-like samples\nfor arbitrary ground sets. We develop an inhibitive attention mechanism based\non transformer networks that captures a notion of dissimilarity between feature\nvectors. We show theoretically that such an approximation is sensible as it maintains\nthe guarantees of inhibition or dissimilarity that makes DPPs so powerful and\nunique. Empirically, we show across multiple datasets that DPPNET is orders of\nmagnitude faster than competing approaches for DPP sampling, while generating\nhigh-likelihood samples and performing as well as DPPs on downstream tasks.\n\nIntroduction\n\n1\nSelecting a representative sample of data from a large pool of available candidates is an essential step\nof a large class of machine learning problems: noteworthy examples include automatic summarization,\nmatrix approximation, and minibatch selection. Such problems require sampling schemes that\ncalibrate the tradeoff between the point-wise quality \u2013 e.g. the relevance of a sentence to a summary \u2013\nof selected elements and the set-wise diversity2 of the sampled set.\nSubmodular set functions and their log-submodular counterparts (functions f such that log f is\nsubmodular) have arisen as a theoretically grounded model for such diversity modeling problems, with\napplications to settings such as sensor placement [27], summarization [33], and optimal experimental\ndesign [44]. Submodular functions over a ground set [N ] := {1, . . . , N} are functions f : 2[N ] \u2192 R\nthat satisfy the inequality\n\nf (S) + f (T ) \u2265 f (S \u2229 T ) + f (S \u222a T )\n\nfor S, T \u2286 [N ].\n\nAmong log-submodular measures, determinantal point processes (DPPs) have proven to be of\nparticular interest to the machine learning community, due to their ability to elegantly model the\ntradeoff between quality and diversity. Given a ground set of size N, DPPs allow for O(N 3) sampling\nover all 2N possible subsets of elements, assigning to any subset S of elements the probability\n\nPL(S) = det LS/ det(I + L),\n\n(1)\n\n\u2217Work done while at Google Brain.\n2Here, we use diversity to mean useful coverage across dissimilar examples in a meaningful feature space,\n\nrather than other de\ufb01nitions of diversity that may appear in the ML fairness literature.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f\u03c6i\n\n\u03c6j\n\n(a)\n\n(b)\n\n(c)\n\nFigure 1: Geometric intuition for DPPs: let \u03c6i, \u03c6j be two feature vectors of \u03a6 such that the kernel veri\ufb01es\nL = \u03a6\u03a6T ; then PL({i, j}) \u221d Vol(\u03c6i, \u03c6j)2. Increasing the norm of a vector (quality) or increasing the angle\nbetween the vectors (diversity) increases the spanned volume [28].\nwhere L \u2208 RN\u00d7N is the DPP kernel and LS = [Lij]i,j\u2208S denotes the principal submatrix of L\nindexed by items in S (we adopt here the L-Ensemble construction [7] of a DPP). Intuitively, DPPs\nmeasure the volume spanned by the feature embedding of the items in feature space (Figure 1).\nFirst introduced by Macchi [35] to model the distribution of possible states of fermions obeying\nthe Pauli exclusion principle, the properties of DPPs have since then been studied in depth, e.g.,\n[24, 6]. As DPPs capture repulsive forces between similar elements, they arise in many natural\nprocesses, such as the distribution of non-intersecting random walks [25], spectra of random matrix\nensembles [41, 17], and zero-crossings of polynomials with Gaussian coef\ufb01cients [23]. More recently,\nDPPs have become a prominent tool in machine learning due to their elegance and tractability over\nsmall datasets: recent applications include video recommendations [9], minibatch selection [51],\nkernel approximation [31, 38], and neural network pruning [36]; continuous DPPs have also been\nconnected to active learning [22].\nHowever, O(N 3) sampling makes DPPs intractable for large datasets. This has led to the development\nof alternate approaches such as subsampling from {1, . . . , N}, structured kernels [15, 37], tree-\nbased samplers [16] and approximate sampling [2, 30, 1]. While faster than the standard approach,\nthese methods require signi\ufb01cant pre-processing time or cannot be parallelized, and still scale\npoorly with the size of the dataset. Furthermore, when dealing with ground sets with variable\ncomponents, pre-processing costs cannot be amortized, impeding the application of DPPs in practice.\nRecently, Derezi\u00b4nski et al. [12] showed that for exact sampling, the preprocessing cost can be reduced\nto O(Npoly(k)), where k is the size of the sampled set.\nThese setbacks motivate us to investigate more scalable and \ufb02exible models to generate high-quality,\ndiverse samples from datasets. We introduce generative deep models to approximate the DPP\ndistribution over a ground set of items with both \ufb01xed and variable feature representations. We\nshow that a carefully constructed neural network, DPPNET, can generate DPP-like samples with\nlittle overhead, orders of magnitude faster than all competing approaches. We further motivate our\napproach by proving that neural networks are theoretically able to inherit the log-submodularity\nproperties of their target functions. Finally, we show that DPPNETs can trivially approximate\nconditional DPP samples and greedy mode \ufb01nding.\n\n2 Related work\nAlthough the greedy maximization of submodular and log-submodular set functions is possible with\nprovable guarantees under a variety of constraints [27], sampling and evaluating submodular functions\nis not necessarily computationally feasible. Indeed, approximating submodular functions has been\nstudied in discrete optimization and game theory [4, 13]; approximate sampling for log-submodular\nfunctions has also been considered by Gotovos et al. [19] via MCMC sampling schemes.\nIn the general case, sampling exactly from a DPP over a discrete set of N items requires an ini-\ntial eigendecomposition of the kernel matrix L, incurring a O(N 3) cost. In order to avoid this\ntime-consuming step, several approximate sampling methods have been derived; Affandi et al. [1]\napproximate the DPP kernel during sampling; more recently, results by Anari et al. [2] followed by Li\net al. [30] showed that DPPs are amenable to ef\ufb01cient MCMC-based sampling methods.\nExact methods that signi\ufb01cantly speed up sampling by leveraging speci\ufb01c structure in the DPP\nkernel have also been developed [37, 15, 43, 39]. Of particular interest is the dual sampling method\nintroduced in Kulesza and Taskar [28]: if the DPP kernel can be composed as an inner product over\na \ufb01nite basis, i.e. there exists a feature matrix \u03a6 \u2208 RN\u00d7D such that the DPP kernel is given by\nL = \u03a6\u03a6(cid:62), exact sampling can be done in O(N D2 + N Dk2 + D2k3).\n\n2\n\n\fHowever, MCMC sampling requires variable amounts of sampling rounds, which is unfavorable for\nparallelization; dual DPP sampling requires an explicit feature matrix \u03a6. Motivated by recent work on\nmodeling set functions with neural networks [50, 10], we propose to generate approximate samples\nvia a generative network; this allows for simple parallelization while simultaneously bene\ufb01ting from\nrecent improvements in specialized architectures for neural network models (e.g. parallelized matrix\nmultiplications). We furthermore show that, extending the abilities of dual DPP sampling, neural\nnetworks may take as input variable feature matrices \u03a6 and sample from non-linear kernels L.\n3 Generating DPP samples with deep models\nIn this section, we describe our approach to generating approximate DPP samples using a generative\nneural network; by doing so, we avoid the O(N 3) computational cost of DPP sampling, generating\nsamples orders of magnitude faster than competing approaches.\nOur goal is to generate samples from a ground set {1, . . . , N} where each item i is represented by\na feature \u03c6i \u2208 Rd. Although in select cases we may know the related feature matrix \u03a6 a priori, in\nmany situations \u03a6 will evolve over time. For example, this is the case when \u03a6 represents a pool\nof products that are available for sale at a given time, or social media posts whose relevance varies\nbased on context. For this reason, \u03a6 is considered to be an input to our model. Figure 2 presents the\narchitecture of our model.\n\nprobabilities\n\nFeed-forward\nconnections\n\n(a(cid:62) \u2297 1N )\u03a6\n\na \u2208 RN\n\n(cid:12)(1 \u2212 softmax(Q\u03a6(cid:62)\n\n/\n\n\u221a\n\nd))\n\nQ \u2208 Rk\u00d7d\n\n[:]\n\nS \u2286 Y\n\n\u03a6 \u2208 RN\u00d7d\n\nsample\n\nk times\n\nFigure 2: DPPNET takes as input a set S (one-hot encoding) and a representation \u03a6 \u2208 RN\u00d7d of the ground set,\nand outputs a probability vector of length N representing the respective probabilities of adding any item i to\nS. When initialized with the empty set and repeated k times, this process generates a sample of size k. When\nthe feature representation \u03a6 does not evolve over time, DPPNET consists only of the block of feed-forward\nconnections and ignores \u03a6 as an input.\n\n3.1 Motivating the model choice\n\nIn addition to elegantly capturing the quality/diversity tradeoffs that arise in subset sampling tasks\nwithin machine learning, DPPs also enjoy a variety of properties which make them particularly\namenable to modeling via neural networks. Conversely, we may want to preserve some of the\nproperties of the standard DPP sampling algorithm. In this section, we show how such properties of\nDPPs are either preserved by DPPNET or can be ef\ufb01ciently incorporated into it.\n\nSimple computation of marginal probabilities. Given a set Y sampled by a DPP with kernel L\nand S \u2286 Y , the probability Pr({i} \u222a S \u2286 Y | S \u2286 Y ) of also sampling i (cid:54)\u2208 S has the closed form\n(2)\n\nPr({i} \u222a S \u2286 Y | S \u2286 Y ) = 1 \u2212 [(L + I[N ]\\S)\n\n\u22121]ii.\n\n3\n\n\fAlthough Eq. 2 requires an expensive matrix inversion to compute, such costs can be offset during\noff-line training. These probabilities act as the vector of probabilities that DPPNET seeks to output.\nThis signal has the advantage of providing information during every step of training, compared to\nother downstream possibilities (e.g., the negative log-likelihood of generated sets). In practice, we\nfound the the L1 loss led to the best performance; hence, DPPNET is trained to minimize the L1\ndistance from its output vector to the corresponding normalized probabilities of adding an item.\n\nv \u2190 DPPNET(S, \u03a6)\nif sampling then\n\nAlgorithm 1 Sampling and greedy mode for DPPNET\nInput: Initial set S, target size k, feature matrix \u03a6\nwhile |S| < k do\n\nSequential sampling. The standard DPP sampling algorithm [28] generates samples sequentially,\nadding items one after the other until reaching a pre-determined size3. We take advantage of this by\nrecording all intermediary subsets generated by the DPP when sampling training data.\nIn practice, instead of training on n subsets\nof size k, we train on kn subsets of size\n0, . . . , k \u2212 1.\nThe sequential form of exact sampling is\nalso amenable to simple modi\ufb01cations that\nyield greedy sampling algorithms [8]. For\nthis reason, our architecture also imple-\nments sequential sampling (Alg. 1), yield-\ning a straightforward greedy estimation\nwithout further overhead.\nClosure under conditioning. DPPs are closed under conditioning: given A \u2286 Y, the conditional\ndistribution over Y \\ A given by PL(S = A \u222a B | A \u2286 S) for B \u2229 A = \u2205 is a DPP with kernel\nLA = ([(L + I \u00afA)\u22121] \u00afA)\u22121 \u2212 I (see [7]). This property make DPPs well-suited to applications\nrequiring diversity in conditioned sets, e.g. basket completion tasks.4\nStandard deep generative models such as (Variational) Auto-Encoders [26] (VAEs) and Generative\nAdversarial Networks [18] would not enable conditioning operations during sampling, as such\noperations would have to take place over the model\u2019s latent space. With the DPPNET architecture, we\ncan sample a set via Alg. 1, which allows for trivial basket-completion type conditioning operations.\n\ni \u223c Multinomial(v/(cid:107)v(cid:107))\ni \u2190 argmax v\n\nelse if greedy mode then\nS \u2190 S \u222a {i}\n\nreturn S\n\n3.2 The inhibitive attention mechanism: sampling over variable feature matrices\n\nIn simple settings, we wish to draw samples over a ground set with \ufb01xed features \u03a6. In this case,\nDPPNET\u2019s knowledge of \u03a6 can be obtained during training, and so DPPNET is a feed-forward\nnetwork taking a partially sampled set S as input. However, often the feature representation \u03a6 will\nvary across time or contexts. In such cases, DPPNET also takes as input the feature matrix \u03a6.\nWe con\ufb01rmed that naively adding \u03a6 as input to a stack of feed-forward connections requires deeper\nnetworks and larger layers, increasing learning and sampling time. Instead, we draw inspiration\nfrom the dot-product attention of [47]. Intuitively, attention is a vector computed by the network that\nindicates relevant parts of the inputs. For DPPNET, this attention reweights \u03a6 based on items in S.\n\u221a\nIn [47], the attention mechanism takes three matrices as input, which can be viewed as a (1) the\nkeys K, (2) the values V , and (3) a the query Q. The attention matrix A := softmax(QK(cid:62)/\nd)\nreweights the values V , with d is the dimension of each query/key and the softmax being computed\nacross each row. The inner product5 acts as a proxy to the similarity between queries and keys.\nFor DPPNET, the submatrix of \u03a6 given by \u03a6S,: \u2208 R|S|\u00d7d and corresponding to the items in the\ninput set S acts as the query; the representation \u03a6 \u2208 RN\u00d7d of the ground set is both the keys and the\nvalues. In order for the attention mechanism to make sense in the framework of DPP modeling, we\nmake two modi\ufb01cations to the attention in [47]:\n\u2022 DPPNET should attend to items that are dissimilar to those in input subset S: for i \u2208 S, we\nd).\n\n\u221a\ncompute its pairwise dissimilarity to all items in Y as the vector di = 1 \u2212 softmax(\u03a6i,:\u03a6(cid:62)/\n3The expected sampled set size under a DPP depends on the eigenspectrum of the kernel L.\n4Such tasks require the model to output a set of likely items given a pre-selected choice of items, for example\n\nwhen recommending items to customers that have already chosen certain items to purchase.\n\n5This inner product could be replaced by the kernel function that de\ufb01nes the true DPP for DPPNET.\n\n4\n\n\f(cid:16)\n\u221a\n1 \u2212 softmax(\u03a6i,:\u03a6(cid:62)/\n\n(cid:17)\n\na(cid:48)\n\n= (cid:12)\ni\u2208S\n\nprobability simplex such that aj \u221d(cid:81)\n\n\u2022 Instead of returning the k \u00d7 N matrix of dissimilarities, we return a vector a \u2208 RN in the\ni\u2208S dij. This yields a \ufb01xed-size input to the network; this\nforces the similarity of any item j to a single pre-sampled item i to disqualify j from being sampled.\nWith these modi\ufb01cations, our attention vector a is computed via the inhibitive attention mechanism\n\na = a(cid:48)/(cid:107)a(cid:48)(cid:107)1,\n\n,\n\nd)\n\n(3)\nwhere (cid:12) represents the row-wise multiplication; a can be computed in O(kDN ) time. The attention\na is \ufb01nally multiplied element-wise with each row of \u03a6; the resulting reweighted feature matrix is\nthe input to the feed-forward component of DPPNET.\nRemark 1. An ef\ufb01cient (\u201cdual\u201d) DPP sampling algorithm for kernels of the form L = \u03a6\u03a6(cid:62) was\nintroduced in [28]. However, this algorithm requires knowledge of such low-rank decomposition.\nFor non-linear kernels, a low-rank decomposition of L(\u03a6) must \ufb01rst be obtained, requiring O(N 3)\ntime. In comparison, the dynamic DPPNET models DPPs with kernels that depend arbitrarly on \u03a6,\nincluding kernels with kernel functions too costly to be computed on-demand.\n\n3.3 Sampling over ground sets of varying size\nWhen the ground set size N is expected to vary little over time (e.g., recommender systems where\navailable items are added/removed over time in small numbers), we can modify the architecture of\nFig. 2 by slightly overshooting the number of rows N(cid:48) of the feature matrix \u03a6 so as to guarantee\nN \u2264 N(cid:48). By setting the additional N(cid:48) \u2212 N rows of \u03a6 to 0, as well as the N(cid:48) \u2212 N coef\ufb01cients of the\noutput probability vector, we maintain DPPNET properties and allow variations in ground set size.\nWhen N has high variance, the inhibitive attention mechanism can be modi\ufb01ed to accomodate\nsubsampling: a subset T \u2286 [N ] of pre-de\ufb01ned size is selected by sampling |T| items independently\nfrom the distribution parametrized by the attention vector a. The corresponding \ufb01xed-size feature\nmatrix is reweighted by the attention, then fed to the learnable feed-forward network. Note that this\napproach can be combined or replaced by other subsampling schemes for DPPs, e.g., [11].\n3.4 Preserving log-submodularity\nA fundamental property of DPPs is their log-submodularity. Indeed, log-submodularity is one of the\nfew key properties responsible for DPPs\u2019s preference for diverse subsets [5].This section presents\na surprising result: under certain conditions, the (log) submodularity of a distribution P can be\ninherited by a generative model trained to approximate P. In particular, DPPNET may under the right\nconditions generate samples from a log-submodular distribution. The proof of Theorem 1 can be\nfound in Appendix A.\nTheorem 1. Let f : 2Y \u2192 R be a strictly submodular function over subsets of a ground set Y, and\nlet g be a function over the same space such that\n\n(cid:107)f \u2212 g(cid:107)\u221e \u2264 min\nS(cid:54)=T,\n\nS,T(cid:54)\u2208{\u2205,Y}\n\n1\n\n4 [f (S) + f (T ) \u2212 f (S \u222a T ) \u2212 f (S \u2229 T ))] .\n\nThen g is also submodular.\nRemark 2. Thm. 1 can also be stated for supermodular functions.\n\nCor. 1 for DPPNET follows directly from the equivalence of norms in \ufb01nite dimensional spaces.\nCorollary 1. Let PL be a strictly log-submodular DPP over Y; let DPPNET be a network trained with\nloss function (cid:107)p \u2212 q(cid:107), where (cid:107) \u00b7 (cid:107) is a norm and p \u2208 R2N (resp. q) is the probability vector assigned\nby the DPP (resp. the DPPNET) to each subset of Y. Let \u03b1 = max\n1/(cid:107)x(cid:107). The distribution\n(cid:107)x(cid:107)\u221e=1\nmodeled by DPPNET is log-submodular if its loss satis\ufb01es\n\n(cid:107)p \u2212 q(cid:107) \u2264 min\nS(cid:54)=T\n\nS,T(cid:54)\u2208{\u2205,Y}\n\n1\n\n4\u03b1 [PL(S) + PL(T ) \u2212 PL(S \u222a T ) \u2212 PL(S \u2229 T ))] .\n\nRemark 3. Cor. 1 is generalizable to the KL divergence loss DKL(P(cid:107)Q) via Pinsker\u2019s inequality.\nChecking numerically whether the conditions for Corollary 1 apply during training is NP-hard:\nthe results of this section are purely theoretical. However, Theorem 1 and Corollary 1 provide an\nadditional justi\ufb01cation for the use of probabilities in the objective function of DPPNET, compared to\nother possible choices for the loss (such as the NLL of generated subsets).\n\n5\n\n\f4 Experimental results\nTo evaluate DPPNET, we evaluate its performance (a) as a proxy for a static DPP (with \ufb01xed kernel L)\nand (b) a generator of diverse subsets of varying ground sets. Our models are trained with TensorFlow\nusing the Adam optimizer. Hyperparameters are tuned to maximize the normalized log-likelihood of\ngenerated subsets. We compare DPPNET to standard DPPs and to the following baselines:\n\u2022 UNIF: Uniform sampling over the ground set,\n\u2022 HCP: Mat\u00b4ern hard core point processes. Points are sampled from a Poisson distribution then\nthinned out to remove points within distance r < 0.2 (chosen by cross-validation) from each other,\n\u2022 k-MEDOIDS: The clustering algorithm from [21], which uses datapoints as cluster centers. The\n\ndistance between points is computed using the same metric used by the DPP.\n\nIn sections 4.1 and 4.2 we evaluate the quality of training and ability of DPPNET to emulate DPP\nsamples. For this reason, we evaluate subset quality using the subset\u2019s negative log-likelihood (NLL)\nunder the DPP we seek to approximate, as \u2013 to the extent of our knowledge \u2013 there is no other standard\nmethod to benchmark the diversity of a selected subset that depends on speci\ufb01c dataset encodings.\nIn section 4.3, we evaluate DPPNET sampling as a proxy for DPP samples on a downstream task\n(kernel reconstruction); there, the evaluation metric evaluates the quality of the reconstructed kernel.\n\n4.1 Sampling over the unit square\n\nWe begin by analyzing the performance of a DPPNET trained on a DPP with \ufb01xed kernel over the\nunit square. This is motivated by the need for diverse sampling methods on the unit hypercube, e.g.\nquasi-Monte Carlo methods, latin hypercube sampling [40] and low discrepancy sequences.\nThe ground set consists of the 100 points lying on the 10 \u00d7 10 grid on the unit square. The DPP is\nde\ufb01ned by its kernel L such that Lij = exp(\u2212(cid:107)xi \u2212 xj(cid:107)2\n2/2). As the target distribution has a \ufb01xed\nground set representation (by way of L), DPPNET has no inhibitive attention mechanism. We report\nthe performance of the different sampling methods in Figure 3. Visually (Figure 3a) and quantitively\n(Figure 3b), DPPNET improves signi\ufb01cantly over all other baselines. Furthermore, greedily sampling\nthe mode from the DPPNET achieves a better NLL than DPP samples themselves (Table 1).\n\n(a) Sampled subsets of size 20 and cor-\nresponding NLLs for several baselines.\n\n(b) Normalized log-likelihood of samples drawn from all meth-\nods as a function of the sampled set size.\n\nFigure 3: Sampling on the unit square with a DPPNET (1 hidden layer, 841 neurons) trained on a single DPP\nkernel. Visually, DPPNET gives similar results to the full DPP (left). As evaluated by DPP NLL, the DPPNET\u2019s\nmode achieves superior performance to the full DPP, and DPPNET sampling overlaps with DPP sampling (right).\n\nTable 1: Negative log likelihood (NLL) under PL for sets of size k = 20 sampled over the unit square. DPPNET\nachieves comparable performance to the DPP, outperforming the other baselines. DPP GREEDY is deterministic\ngreedy DPP sampling and achieves the lowest NLL; however, DPPNET MODE is able to reach it (Fig. 3a).\n\nDPP\n\n154.95 \u00b1 2.93\n\nDPP GREEDY\n\n147.76\n\nUNIFORM\n180.53 \u00b1 9.56\n\nHCP\n\n163.40 \u00b1 5.87\n\nk-MEDOIDS\n169.37 \u00b1 6.41\n\nDPPNET\n\n153.44 \u00b1 2.07\n\n6\n\nUnifNLL=168.57DppNLL=156.71DppNetNLL=151.01DppNetmodeNLL=146.54DppNetmodeDppDppNetUniformk\u2013MedoidHCP05101520Generatedsetsize\u2212150\u2212100\u221250NormalizedLog-Likelihood\f4.2 Sampling over variable ground sets\nWe evaluate the performance of DPPNETs on varying ground set sizes through the MNIST [29],\nCelebA [34], and MovieLens [20] datasets. For MNIST and CelebA, we generate feature represen-\ntations of length 32 by training a VAE on the dataset (see App. B for details); for MovieLens, we\nobtain a feature vector for each movie by applying nonnegative matrix factorization the rating matrix,\nobtaining features of length 10. We train DPPNET with the embeddings corresponding to randomly\nsubsampled ground sets of size N = 100 of the training sets of each dataset; during testing (i.e., in\nthe results below), the trained models are fed feature representations from the corresponding test sets.\nThe DPPNET is trained based on samples from DPPs with a linear kernel for MovieLens and with\nan exponentiated quadratic kernel for the image datasets. Bandwidths were set to \u03b2 = 0.0025 for\nMNIST and \u03b2 = 0.1 for CelebA, chosen in order to obtain a DPP sample size \u2248 20: for a DPP with\nkernel L, the expected sample size is given by ES\u223cPL [|S|] = Tr[L(L + I)\u22121].\n\nFor MNIST, Figure 4 shows images selected by the base-\nlines and the DPPNET, chosen among 100 digits with all\nidentical labels; visually, DPPNET and DPP samples pro-\nvide a wider coverage of writing styles. However, the NLL\nof samples from DPPNET decay signi\ufb01cantly, whereas the\nDPPNET mode maintains competitive performance with\nDPP samples. For this reason, all further experiments focus\non greedy mode samples drawn from the DPPNET.\nNumerical results for MNIST are reported in Table 2. Al-\nthough DPPNET was trained on feature matrices represent-\ning random subsets of the training set, we see that when\nselecting subsets restricted by label at test time, DPPNET\nremains competitive, suggesting that DPPNET sampling\nmay be leveraged to focus on sub-areas of datasets identi-\n\ufb01ed as areas of interest. Numerical results for CelebA and\nMovieLens are reported in Table 3.\nTo analyze the contribution of the attention mechanism, we\nfurthermore performed an ablation test, training a neural\nnetwork without the attention block; architecture tuning\nrevealed that the model that achieved the best performance required 6 layers of 585 neurons on\nMNIST: signi\ufb01cantly more parameters than with the attention mechanism (3 layers of 365 neurons).\nTable 2: NLL (mean \u00b1 standard error) under the true DPP of samples drawn uniformly, according to the mode\nof the DPPNET, and from the DPP itself. We sample subsets of size 20; for each class of digits we build 25\nfeature matrices \u03a6 from encodings of those digits, and for each feature matrix we draw 25 different samples.\nFor the last column, DPPNET was trained on all digits.\n\nFigure 4: Digits sampled from a DPPNET\n(3 layers of 365 neurons) trained on MNIST.\n\nDIGIT\n\nDPP BASELINE\n\nUNIF\n\nMEDOIDS\n\nDPPNET MODE\n\n0\n\n52.2 \u00b1 0.1\n54.9 \u00b1 0.1\n55.1 \u00b1 0.1\n53.6 \u00b1 0.3\n\nDIGIT\n\nDPP BASELINE\n\nUNIF\n\nMEDOIDS\n\nDPPNET MODE\n\n1\n\n60.5 \u00b1 0.1\n65.1 \u00b1 0.1\n65.0 \u00b1 0.1\n63.6 \u00b1 0.4\n\n5\n\n50.4 \u00b1 0.1\n52.4 \u00b1 0.1\n52.4 \u00b1 0.0\n51.8 \u00b1 0.3\n\n2\n\n49.8 \u00b1 0.0\n51.5 \u00b1 0.1\n51.5 \u00b1 0.0\n50.8 \u00b1 0.2\n\n6\n\n51.6 \u00b1 0.1\n54.6 \u00b1 0.1\n54.4 \u00b1 0.1\n52.8 \u00b1 0.3\n\n3\n\n50.7 \u00b1 0.1\n52.9 \u00b1 0.1\n52.9 \u00b1 0.1\n51.4 \u00b1 0.3\n\n7\n\n51.5 \u00b1 0.1\n55.1 \u00b1 0.1\n55.1 \u00b1 0.1\n52.7 \u00b1 0.4\n\n4\n\n51.0 \u00b1 0.1\n53.3 \u00b1 0.1\n53.1 \u00b1 0.1\n51.6 \u00b1 0.4\n\n8\n\n50.9 \u00b1 0.1\n53.3 \u00b1 0.1\n53.2 \u00b1 0.1\n50.9 \u00b1 0.3\n\n9\n\n52.7 \u00b1 0.1\n56.2 \u00b1 0.1\n56.1 \u00b1 0.1\n55.0 \u00b1 0.4\n\nAll\n\n49.2 \u00b1 0.1\n51.6 \u00b1 0.1\n51.0 \u00b1 0.1\n48.6 \u00b1 0.2\n\nTable 3: NLLs on CelebA and MovieLens (mean \u00b1 standard error); 20 samples of size 20 were drawn for 20\ndifferent feature matrices each, with 100 samples per method; DPPNET achieves the best NLLs.\n\nDATASET\nCelebA\n\nMovieLens\n\nKERNEL DPP BASELINE\n49.04 \u00b1 2.03\n84.29 \u00b1 0.20\n\nRBF\nLinear\n\nUNIFORM\n50.84 \u00b1 1.53\n92.04 \u00b1 0.17\n\nk-MEDOIDS DPPNET Mode\n49.28 \u00b1 1.57\n51.18 \u00b1 1.34\n88.90 \u00b1 0.16\n80.21 \u00b1 0.33\n\n7\n\nUnifNLL=32.80DppNetNLL=30.20DppNetmodeNLL=28.55DppNLL=30.88\f4.3 DPPNET for kernel reconstruction\nAs a \ufb01nal experiment, we evaluate DPPNET\u2019s performance on a downstream task for which DPPs have\nbeen shown to be useful: kernel reconstruction using the Nystr\u00a8om method [42, 48]. Given a positive\n\u2020\nsemide\ufb01nite matrix K \u2208 RN\u00d7N , the Nystr\u00a8om method approximates K by \u02c6K = K\u00b7,SK\nS,SKS,\u00b7\nwhere K\u2020 denotes the pseudoinverse of K and K\u00b7,S (resp. KS,\u00b7) is the submatrix of K formed by\nits rows (resp. columns) indexed by S. The Nystr\u00a8om method is a popular choice to scale up kernel\nmethods, e.g., [3, 45, 14, 46]. Reconstruction quality depends directly on the selected set of columns\nS; choosing the columns by sampling from the DPP with kernel K is a standard approach [31, 38].\nFollowing the approach of Li et al. [31], we evaluate the quality of the kernel reconstruction via\nthe following process: given a RBF ridge regression kernel K built from 1000 training points\nin the Ailerons regression dataset, and with regularization and bandwidth chosen using 10-fold\ncross validation, we report the test prediction error obtained by the Nystr\u00a8om reconstruction \u02c6K. The\ncolumns for the reconstruction are chosen with different DPP sampling algorithms: full DPP sampling,\nDPPNET and approximate DPP sampling using MCMC with quadrature acceleration [31, 32].\nFig. 5 reports our results, and con\ufb01rms that DPPNET-based mode sampling performs comparably to\nother DPP sampling methods (Fig. 5a), while running orders of magnitude faster. Furthermore, while\nall methods were run on CPU, DPPNET is amenable to further acceleration using GPUs.\n\nFigure 5: Results for the Nystr\u00a8om approximation experiments, comparing DPPNET to the fast MCMC sampling\nmethod [32]. Subset selection by DPPNET achieves comparable or lower RMSE than other methods and is\nsigni\ufb01cantly faster. In (c), the relative size of the marker represents the size of the sampled subset.\n\n5 Conclusion and discussion\n\nWe introduced DPPNET: a generative network that approximates DPP sampling over \ufb01xed and\nvarying ground sets. We showed across several datasets and applications that DPPNETs are orders of\nmagnitude faster than standard DPP sampling algorithms, without decreasing sample quality.\nWe derived an inhibitive attention mechanism based on the repulsion process modeled by DPPs;\nadded to DPPNET while learning a class of DPPs over ground sets that vary over time, this mechanism\nsigni\ufb01cantly reduces the number of trainable parameters required to learn a DPPNET.\nUsing DPPNET, several applications of DPPs that remained purely theoretical in practice due to high\nsampling costs (e.g., minibatch sampling for SGD as suggested in [51]) are now within reach of\nmodern computing abilities; as such, replacing DPPs with DPPNET in cases where approximate, fast\nsampling is required in downstream applications is a key area for future work.\nOur choice of architecture for DPPNET leaves certain questions open. DPPNETs samples are not\nexchangeable: two sequences i1, . . . , ik and \u03c3(i1), . . . , \u03c3(ik) where \u03c3 is a permutation of [k] will not\nhave the same probability under a DPPNET. Although exchangeability can be enforced by leveraging\nprevious work [50], non-exchangeability can be an asset when sampling a ranking of items.\nFinally, our theoretical results (Thm. 1) suggests a new area of research in terms of using generative\nnetworks for combinatorial optimization. Two questions of particular interest are the following.\nWhich properties of set functions, other than submodularity, can be inherited by a generative model?\nCan generative neural networks be leveraged to learn other combinatorial functions for which marginal\nprobabilities (used to train the network) can be easily obtained?\n\n8\n\nMCMCDPPDppNetModeFullSet0200400Subsetsize0.020.030.040.05RMSE(a)Testerror0200400Subsetsize0204060Time(seconds)(b)Wallclocktime0102030Wallclocktime0.000.020.040.06RMSE(c)Errorvs.time\fAcknowledgements. The authors would like to thank D. Sculley and Dustin Tran for their help\nwith this project.\n\nReferences\n[1] R. H. Affandi, A. Kulesza, E. Fox, and B. Taskar. Nystrom approximation for large-scale\n\ndeterminantal processes. In Arti\ufb01cial Intelligence and Statistics, 2013.\n\n[2] N. Anari, S. O. Gharan, and A. Rezaei. Monte Carlo Markov Chain algorithms for sampling\nStrongly Rayleigh distributions and Determinantal Point Processes. In Conference on Learning\nTheory, 2016.\n\n[3] F. R. Bach and M. I. Jordan. Kernel independent component analysis. J. Mach. Learn. Res., 3:\n\n1\u201348, Mar. 2003.\n\n[4] A. Badanidiyuru, S. Dobzinski, H. Fu, R. Kleinberg, N. Nisan, and T. Roughgarden. Sketching\n\nvaluation functions. In SODA. Society for Industrial and Applied Mathematics, 2012.\n\n[5] J. Borcea, P. Br\u00a8and\u00b4en, and T. M. Liggett. Negative dependence and the geometry of polynomials.\n\nJournal of the American Mathematical Society, 22(2):521\u2013567, 2009.\n\n[6] A. Borodin. Determinantal Point Processes, 2009.\n\n[7] A. Borodin and E. M. Rains. Eynard\u2013Mehta theorem, Schur process, and their Pfaf\ufb01an analogs.\n\nJournal of Statistical Physics, 121(3):291\u2013317, Nov 2005.\n\n[8] L. Chen, G. Zhang, and E. Zhou. Fast greedy map inference for determinantal point process to\nimprove recommendation diversity. In Advances in Neural Information Processing Systems 31.\nCurran Associates, Inc.\n\n[9] L. Chen, G. Zhang, and E. Zhou. Fast greedy MAP inference for Determinantal Point Process\nto improve recommendation diversity. In Advances in Neural Information Processing Systems,\n2018.\n\n[10] A. Cotter, M. R. Gupta, H. Jiang, J. Muller, T. Narayan, S. Wang, and T. Zhu. Interpretable set\n\nfunctions. abs/1806.00050, 2018.\n\n[11] M. Derezi\u00b4nski. Fast determinantal point processes via distortion-free intermediate sampling,\n\n2018.\n\n[12] M. Derezi\u00b4nski, D. Calandriello, and M. Valko. Exact sampling of determinantal point processes\n\nwith sublinear time preprocessing, 2019.\n\n[13] N. R. Devanur, S. Dughmi, R. Schwartz, A. Sharma, and M. Singh. On the approximation of\n\nsubmodular functions, 2013.\n\n[14] C. Fowlkes, S. Belongie, F. Chung, and J. Malik. Spectral grouping using the nystr\u00a8om method.\n\nIEEE Trans. Pattern Anal. Mach. Intell., 26(2):214\u2013225, Jan. 2004.\n\n[15] M. Gartrell, U. Paquet, and N. Koenigstein. Low-rank factorization of Determinantal Point\n\nProcesses. In AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[16] J. Gillenwater, A. Kulesza, Z. Mariet, and S. Vassilvtiskii. A tree-based method for fast\nrepeated sampling of determinantal point processes. In K. Chaudhuri and R. Salakhutdinov,\neditors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of\nProceedings of Machine Learning Research. PMLR, 2019.\n\n[17] J. Ginibre. Statistical ensembles of complex: Quaternion, and real matrices. Journal of\n\nMathematical Physics (New York) (U.S.), 6, 3 1965.\n\n[18] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,\nand Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing\nSystems, 2014.\n\n9\n\n\f[19] A. Gotovos, H. Hassani, and A. Krause. Sampling from probabilistic submodular models. In\n\nNeural Information Processing Systems. 2015.\n\n[20] F. M. Harper and J. A. Konstan. The movielens datasets: History and context. ACM Trans.\n\nInteract. Intell. Syst., 5(4):19:1\u201319:19, Dec. 2015.\n\n[21] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series\n\nin Statistics. Springer New York Inc., New York, NY, USA, 2001.\n\n[22] P. Hennig and R. Garnett. Exact sampling from determinantal point processes, 2016.\n\n[23] J. Hough, M. Krishnapur, Y. Peres, and B. Virag. Zeros of Gaussian Analytic Functions and\n\nDeterminantal Point Processes. American Mathematical Society, 2009.\n\n[24] J. B. Hough, M. Krishnapur, Y. Peres, and B. Virg. Determinantal processes and independence.\n\nProbab. Surveys, 3:206\u2013229, 2006.\n\n[25] K. Johansson. Determinantal processes with number variance saturation. Communications in\n\nMathematical Physics, 252(1):111\u2013148, Dec 2004.\n\n[26] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In International Conference\n\non Learning Representations, 2014.\n\n[27] A. Krause and D. Golovin. Submodular function maximization. In L. Bordeaux, Y. Hamadi,\n\nand P. Kohli, editors, Tractability. Cambridge University Press, 2014.\n\n[28] A. Kulesza and B. Taskar. Determinantal Point Processes for Machine Learning. Now\n\nPublishers Inc., Hanover, MA, USA, 2012.\n\n[29] Y. LeCun and C. Cortes. MNIST handwritten digit database. 2010.\n\n[30] C. Li, S. Jegelka, and S. Sra. Fast mixing Markov chains for Strongly Rayleigh measures, DPPs,\n\nand constrained sampling. In Advances in Neural Information Processing Systems, 2016.\n\n[31] C. Li, S. Jegelka, and S. Sra. Fast DPP sampling for Nystr\u00a8om with application to kernel methods.\n\nIn International Conference on Machine Learning, 2016.\n\n[32] C. Li, S. Sra, and S. Jegelka. Gaussian quadrature for matrix inverse forms with applications.\n\nIn International Conference on Machine Learning, 2016.\n\n[33] H. Lin and J. Bilmes. A class of submodular functions for document summarization.\n\nIn\nProceedings of the 49th Annual Meeting of the Association for Computational Linguistics:\nHuman Language Technologies, HLT \u201911. Association for Computational Linguistics, 2011.\n\n[34] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In International\n\nConference on Computer Vision, December 2015.\n\n[35] O. Macchi. The coincidence approach to stochastic point processes. Advances in Applied\n\nProbability, 7:83\u2013122, 03 1975.\n\n[36] Z. Mariet and S. Sra. Diversity networks: Neural network compression using Determinantal\n\nPoint Processes. In International Conference on Learning Representations, 2016.\n\n[37] Z. Mariet and S. Sra. Kronecker Determinantal Point Processes.\n\nInformation Processing Systems. 2016.\n\nIn Advances in Neural\n\n[38] Z. Mariet, S. Sra, and S. Jegelka. Exponentiated Strongly Rayleigh distributions. In Advances\n\nin Neural Information Processing Systems 31. 2018.\n\n[39] Z. Mariet, M. Gartrell, and S. Sra. Learning determinantal point processes by corrective negative\n\nsampling. In AISTATS, 2019.\n\n[40] M. D. McKay, R. J. Beckman, and W. J. Conover. Comparison of three methods for selecting\nvalues of input variables in the analysis of output from a computer code. Technometrics, 21(2):\n239\u2013245, 1979.\n\n10\n\n\f[41] M. Mehta and M. Gaudin. On the density of eigenvalues of a random matrix. Nuclear Physics,\n\n18:420 \u2013 427, 1960.\n\n[42] E. Nystr\u00a8om. \u00a8Uber Die Praktische Au\ufb02\u00a8osung von Integralgleichungen mit Anwendungen auf\n\nRandwertaufgaben. Acta Mathematica, 54(1), 1930. ISSN 0001-5962.\n\n[43] T. Osogami, R. Raymond, A. Goel, T. Shirai, and T. Maehara. Dynamic determinantal point\n\nprocesses, 2018.\n\n[44] T. G. Robertazzi and S. C. Schwartz. An accelerated sequential algorithm for producing\n\nd-optimal designs. SIAM J. Sci. Stat. Comput., 10(2), Mar. 1989. ISSN 0196-5204.\n\n[45] H. Shen, S. Jegelka, and A. Gretton. Fast kernel-based independent component analysis. Trans.\n\nSig. Proc., 57(9):3498\u20133511, Sept. 2009.\n\n[46] A. Talwalkar, S. Kumar, M. Mohri, and H. Rowley. Large-scale svd and manifold learning. J.\n\nMach. Learn. Res., 14(1):3129\u20133152, Jan. 2013.\n\n[47] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and\nI. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems.\n2017.\n\n[48] C. K. I. Williams and M. Seeger. Using the Nystr\u00a8om Method to Speed Up Kernel Machines. In\n\nAdvances in Neural Information Processing Systems 13. MIT Press, 2001.\n\n[49] S. Zagoruyko and N. Komodakis. Wide residual networks. In BMVC, 2016.\n\n[50] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. P\u00b4oczos, R. R. Salakhutdinov, and A. J. Smola. Deep\n\nsets. In Advances in Neural Information Processing Systems, 2017.\n\n[51] C. Zhang, H. Kjellstr\u00a8om, and S. Mandt. Stochastic learning on imbalanced data: Determinantal\nPoint Processes for mini-batch diversi\ufb01cation. In Uncertainty in Arti\ufb01cial Intelligence, 2017.\n\n11\n\n\fA Maintaining log-submodularity in the generative model\nTheorem 1. Let f be a strictly submodular function over subsets of a ground set Y, and g be a\nfunction over the same space such that\n\n(cid:107)f \u2212 g(cid:107)\u221e \u2264 min\nS(cid:54)=T\n\nS,T(cid:54)\u2208{\u2205,Y}\n\n1\n4\n\n[f (S) + f (T ) \u2212 f (S \u222a T ) \u2212 f (S \u2229 T ))] .\n\n(4)\n\nThen g is also submodular.\nProof. In all the following, we assume that S, T are subsets of a ground set Y such that S (cid:54)= T and\nS, T (cid:54)\u2208 {\u2205,Y} (the inequalities being immediate in these corner cases). Let\nf (S) + f (T ) \u2212 f (S \u222a T ) \u2212 f (S \u2229 T ))\n\n\u0001 := min\nS,T\n\nBy the strict submodularity hypothesis, we know \u0001 > 0.\nLet S, T \u2286 Y such that S (cid:54)= T and S, T (cid:54)= \u2205,Y. To show the log-submodularity of g, it suf\ufb01ces to\nshow that\n\ng(S) + g(T ) \u2265 g(S \u222a T ) + g(S \u2229 T ).\n\nBy de\ufb01nition of \u0001,\n\nf (S) + f (T ) \u2212 f (S \u222a T ) \u2212 f (S \u2229 T )) \u2265 \u0001\n\nFrom equation 4, we know that\n\nS\u2286Y |f (S) \u2212 g(S)| \u2264 \u0001/4.\n\nmax\n\nIt follows that\n\ng(S) + g(T ) \u2212 g(S \u222a T ) + g(S \u2229 T )\n\n\u2265f (S) + f (T ) \u2212 f (S \u222a T ) \u2212 f (S \u2229 T ) \u2212 \u0001\n\u22650\n\nwhich proves the submodularity of g.\n\nB Encoder details\n\nFor the MNIST encodings, the VAE encoder consists of a 2d-convolutional layer with 64 \ufb01lters of\nheight and width 4 and strides of 2, followed by a 2d convolution layer with 128 \ufb01lters (same height,\nwidth and strides), then by a dense layer of 1024 neurons. The encodings are of length 32.\n\nFigure 6: Digits and VAE reconstructions from the MNIST training set\n\nCelebA encodings were generated by a VAE using a Wide Residual Network [49] encoder with 10\nlayers and \ufb01lter-multiplier k = 4, a latent space of 32 full-covariance Gaussians, and a deconvolutional\ndecoder trained end-to-end using an ELBO loss. In detail, the decoder architecture consists of a 16K\ndense layer followed by a sequence of 4 \u00d7 4 convolutions with [512, 256, 128, 64] \ufb01lters interleaved\nwith 2\u00d7 upsampling layers and a \ufb01nal 6 \u00d7 6 convolution with 3 output channels for each of 5\ncomponents in a mixture of quantized logistic distributions representing the decoded image.\n\n12\n\n\f", "award": [], "sourceid": 1820, "authors": [{"given_name": "Zelda", "family_name": "Mariet", "institution": "MIT"}, {"given_name": "Yaniv", "family_name": "Ovadia", "institution": "Princeton University"}, {"given_name": "Jasper", "family_name": "Snoek", "institution": "Google Brain"}]}