{"title": "Maximizing Induced Cardinality Under a Determinantal Point Process", "book": "Advances in Neural Information Processing Systems", "page_first": 6911, "page_last": 6920, "abstract": "Determinantal point processes (DPPs) are well-suited to recommender systems where the goal is to generate collections of diverse, high-quality items. In the existing literature this is usually formulated as finding the mode of the DPP (the so-called MAP set). However, the MAP objective inherently assumes that the DPP models \"optimal\" recommendation sets, and yet obtaining such a DPP is nontrivial when there is no ready source of example optimal sets. In this paper we advocate an alternative framework for applying DPPs to recommender systems. Our approach assumes that the DPP simply models user engagements with recommended items, which is more consistent with how DPPs for recommender systems are typically trained. With this assumption, we are able to formulate a metric that measures the expected number of items that a user will engage with. We formalize this optimization of this metric as the Maximum Induced Cardinality (MIC) problem. Although the MIC objective is not submodular, we show that it can be approximated by a submodular function, and that empirically it is well-optimized by a greedy algorithm.", "full_text": "Maximizing Induced Cardinality Under a\n\nDeterminantal Point Process\n\nJennifer Gillenwater\nGoogle Research NYC\njengi@google.com\n\nAlex Kulesza\n\nGoogle Research NYC\nkulesza@google.com\n\nZelda Mariet\n\nMassachusetts Institute of Technology\n\nzelda@csail.mit.edu\n\nSergei Vassilvitskii\nGoogle Research NYC\nsergeiv@google.com\n\nAbstract\n\nDeterminantal point processes (DPPs) are well-suited to recommender systems\nwhere the goal is to generate collections of diverse, high-quality items. In the\nexisting literature this is usually formulated as \ufb01nding the mode of the DPP (the\nso-called MAP set). However, the MAP objective inherently assumes that the DPP\nmodels \u201coptimal\u201d recommendation sets, and yet obtaining such a DPP is nontrivial\nwhen there is no ready source of example optimal sets. In this paper we advocate an\nalternative framework for applying DPPs to recommender systems. Our approach\nassumes that the DPP simply models user engagements with recommended items,\nwhich is more consistent with how DPPs for recommender systems are typically\ntrained. With this assumption, we are able to formulate a metric that measures\nthe expected number of items that a user will engage with. We formalize this\noptimization of this metric as the Maximum Induced Cardinality (MIC) problem.\nAlthough the MIC objective is not submodular, we show that it can be approximated\nby a submodular function, and that empirically it is well-optimized by a greedy\nalgorithm.\n\n1\n\nIntroduction\n\nDiversity is frequently advantageous for recommender systems. It can compensate for uncertainty,\nfor example, when a search engine can\u2019t be sure which type of \u201cjava\u201d a user intended and hence\nreturns results spanning coffee, programming languages, and Indonesia. But diversity can also be an\ninherently desirable property, re\ufb02ecting the way that users engage with a set of results. A news feed,\nfor example, might include stories on politics, health, sports, and arts\u2014even when the important\nnews of the day is all political\u2014simply because users enjoy reading a variety of articles. This is one\nof the reasons why diversity has been a longstanding focus for research on information retrieval and\nrecommender systems [Smyth and McClave, 2001, Herlocker et al., 2004, Ziegler et al., 2005, Hurley\nand Zhang, 2011].\nThe determinantal point process (DPP), a probabilistic model of subset selection that prefers diverse\nsets, is a natural \ufb01t for these kinds of applications. However, while DPPs offer ef\ufb01cient algorithms for\nprobabilistic operations like marginalization, conditioning, and sampling, in practice we often need\nto select a single \u201cbest\u201d set, and this can be more challenging. To date, most research in this direction\nhas focused on approximation algorithms for \ufb01nding the set with the highest probability, sometimes\ncalled the maximum a posteriori (MAP) set [Gillenwater et al., 2012, Kathuria and Deshpande,\n2017, Zhang and Ou, 2016, Nikolov and Singh, 2016]. In particular, the MAP objective has recently\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fbeen applied to recommender systems with some success [Chen et al., 2017, Wilhelm et al., 2018].\nHowever, we argue that, for most recommender systems, the MAP objective is not the best \ufb01t.\nAs an alternative, in this paper we propose and analyze induced cardinality (IC), which directly\nmeasures the expected number of items that the user will engage with. This is a natural objective for\nany recommender system where engagements (e.g., clicks) are an important metric. We investigate\nbasic properties of the IC objective for DPPs, \ufb01nding that it is fractionally subadditive but not\nsubmodular, and that, as with MAP, it is NP-hard to \ufb01nd the Maximum Induced Cardinality (MIC)\nset. Despite this negative result, we are able to establish a data-dependent bound showing that the\nIC objective can often be well-approximated by a submodular function, which offers corresponding\ngreedy optimization guarantees for the MIC problem. We also show that, empirically, a greedy\nalgorithm typically performs very well.\nIn the remainder of this section we cover background material on DPPs and discuss why MIC is\nlikely a better \ufb01t for recommendation systems than MAP. We proceed in Section 2 to study basic\nproperties of the IC objective, and then in Section 3 we consider the implications of those properties\nfor the MIC optimization problem. Finally, in Section 4 we present empirical studies of several\noptimization algorithms.\n\n1.1 Background\nGiven an n \u00d7 n positive semi-de\ufb01nite (PSD) kernel matrix L, the associated determinantal point\nprocess assigns to any subset S of [n] = {1, 2, . . . , n} the probability PL(S) = det(LS )\ndet(L+I), where\nLS denotes the restriction of L to the row and column indices found in S. Note that, because\nS\u2286[n] det(LS) = det(L + I), the DPP de\ufb01ned above is a properly normalized probability distribu-\n\n(cid:80)\n\ntion over all 2n subsets of items drawn from the ground set [n].\nIntuition. If we think of the diagonal kernel entry Lii as a measurement of the quality of item i, then\nit is not dif\ufb01cult to see that PL assigns higher probabilities to sets with high-quality items. If we think\nof the off-diagonal entry Lij as a scaled measurement of the similarity between items i and j, then\nproperties of determinants can be used to show that PL assigns higher probabilities to sets whose\nitems are less similar\u2014i.e., more diverse. Thus a DPP prefers sets of items that are both high-quality\nand diverse. For more background on DPPs, see Kulesza and Taskar [2012].\nGiven a training collection consisting of subsets of [n], the general DPP learning problem is to \ufb01nd a\nkernel L such that PL best replicates the empirical distribution of the subsets in the training collection.\nFor instance, this can be done using maximum likelihood estimation (MLE), for which a variety of\noptimization techniques have been developed [Kulesza and Taskar, 2011, Gillenwater et al., 2014,\nMariet and Sra, 2015, Dupuy and Bach, 2018, Gartrell et al., 2017]. In this work, we do not attempt\nto improve upon these learning techniques. Rather, we assume that one of these techniques has been\napplied to learn a DPP kernel L for a recommendation task, and we focus on how to best use that\nkernel.\n\n1.2 Recommender System Example\nSuppose we want to recommend k items from a large set denoted by [n], where n (cid:29) k. As training\ndata, we have r examples of previously recommended k-sets [S1, S2, . . . , Sr] and the associated user\nengagements with those sets [E1, E2, . . . , Er]. (That is, each Ei \u2286 Si is the set of items that a user\nactually clicked on, watched, read, etc.)\nTo learn a DPP for this problem, assume we have a parameterized kernel L(\u03b8), where \u03b8 is the\nparameter to be learned. (Concretely, you might imagine that each item i has an associated feature\nvector bi \u2208 Rd, and we de\ufb01ne the kernel as something like Lij(\u03b8) = b(cid:62)\ni diag(\u03b8)bj, where \u03b8 \u2208 Rd\nand diag(\u03b8) denotes the d \u00d7 d matrix with \u03b8 on the diagonal. In reality, of course, the kernel may\nalso depend on context such as a query string or the user\u2019s history.)\nLet L(i) be a shorthand for LSi, the |Si| \u00d7 |Si| kernel matrix over the set Si. A natural learning\nproblem is to \ufb01nd the value of \u03b8 that maximizes the log-likelihood of the interaction sets {Ei} given\nthe recommendation sets {Si}:\n\nr(cid:88)\n\ni=1\n\nmax\n\n\u03b8\n\nlog(cid:0)PL(i)(\u03b8)(Ei)(cid:1) = max\n\nr(cid:88)\n\n\u03b8\n\ni=1\n\n2\n\n(cid:16)\n\n(cid:17) \u2212 log det\n\n(cid:16)\n\n(cid:17)\n\nlog det\n\nL(i)(\u03b8)Ei\n\nL(i)(\u03b8) + I\n\n.\n\n(1)\n\n\fOptimization techniques from the references in the previous section can be applied to learn a good \u03b8\nunder this learning objective.\nAt test time, we want to generate new recommendations using a DPP with kernel parameterized by\nthis \ufb01xed \u03b8. Previous work has addressed the problem of generating a recommendation set of size k\nby (approximately) solving the following optimization problem:\nProblem 1 (Maximum a posteriori (MAP)).\n\nmax\nS:|S|=k\n\nPL(\u03b8)(S) = max\nS:|S|=k\n\ndet(LS(\u03b8))\n\n(2)\n\nAt \ufb01rst glance, this MAP objective seems quite natural: it seeks the most likely set under the\nprobabilities de\ufb01ned by the DPP. However, it has subtle semantics. Recall that at training time we\nlearned \u03b8 by maximizing the likelihood of the items that were engaged with, not the likelihood of\nall items that were recommended. That is, we maximized the probability of {Ei}, not {Si}. (It\nwouldn\u2019t make any sense to maximize the likelihood of {Si}, as this would result in learning a DPP\nthat simply mimics whatever recommender system was used in generating {Si} in the \ufb01rst place.)\nThus, the learned DPP models user engagements, not recommended sets. Formally, this means that\nthe MAP objective PL(S) = det(LS )\ndet(L+I) represents the probability that a user, when presented with a\nrecommendation consisting of all of the available items, will engage with every item in S.\nIn practice, of course, it is the set S that gets shown to the user, who then engages with some subset\nof S. Hence, the MAP objective does not have the correct semantics, instead introducing a mismatch\nbetween train and test time.\n\n1.3 Maximum Induced Cardinality\n\nAs an alternative to MAP, we propose Maximum Induced Cardinality (MIC), which appeals to a\nspeci\ufb01c notion of success that is natural for many recommender systems: maximizing the number of\nrecommended items that the user engages with. Whereas with the MAP objective the ground set is\n[n] and the modeled variable is the set S \u2286 [n] of recommended items, here the ground set is S and\nthe modeled variable is the set E \u2286 S of recommended items that the user eventually engages with:\n\nPLS (E) =\n\ndet(LE)\n\ndet(LS + I)\n\n.\n\n(3)\n\nThis matches the learning setup above. The MIC problem, which aims to maximize the expected\ncardinality of the induced engagement set E, can be formalized as follows.\nProblem 2 (Maximum Induced Cardinality (MIC)).\n\nmax\nS:|S|=k\n\nf (S),\n\nf (S) \u2261 EE\u223cPLS\n\n[|E|] .\n\nWhile f (S) is na\u00a8\u0131vely an exponential sum over all 2k subsets of S, it can be simpli\ufb01ed:\n= Tr(I \u2212 (LS + I)\u22121) .\n\n|E|PLS (E) =\n\nf (S) =\n\n|E| det(LE)\ndet(LS + I)\n\n(cid:88)\n\nE\u2286S\n\n(cid:88)\n\nE\u2286S\n\n(4)\n\n(5)\n\nThe \ufb01nal equality follows from Equations 15 and 34 in Kulesza and Taskar [2012]. In this form, the\ntime required to compute f (S) is dominated by the inverse, which requires O(k3) time.\n\n1.4 MAP vs MIC\n\nAs discussed above, the semantics of MIC are more appropriate for recommender systems than those\nof MAP. However, the choice of objective can also have immediate, practical consequences. While\nboth MIC and MAP can produce relatively diverse sets in general, when the DPP kernel is low-rank\nMAP can fail dramatically. This occurs in practice, for instance, if there are a small number of\nfeatures relative to the size of the desired recommendation set.\nTo illustrate how MAP fails and why MIC does not, let\u2019s consider a toy example (see Fig-\nure 1). Suppose that we are in a movie recommendation context, and each dot in Figure 1\n\n3\n\n\frepresents one movie, with the size of the dot proportional to movie quality. Suppose fur-\nther that the two dimensions in Figure 1 are star ratings and box of\ufb01ce revenue. Then\nthe three clusters correspond to three types of movies: 1. \u201cArtistic gems\u201d characterized by\nhigh ratings but low revenue; 2.\n\u201cOscar winners\u201d with high ratings and high revenue; and\n3. \u201cSummer blockbusters\u201d, which have low critical ratings but high revenue. Each of these\ncategories could be desirable, depending on the user\u2019s mood, so it might be advantageous\nto recommend one movie from each group. However, when asked\nto select a set of size k = 3, the MAP objective has equal value\n(zero) for all size-three sets \u2014with only two features, the rank\nof the DPP kernel is 2, and hence the determinant of any 3 \u00d7 3\nmatrix will be zero. Hence MAP cannot distinguish among any\nthree-item sets. The MIC objective on the other hand continues\nto provide useful differentiation even when the number of items\nrequested exceeds the rank of the kernel matrix. It will select\none item from each cluster (e.g., the + items), whereas MAP will\nselect a random size-3 set (e.g., the x items).\n\nFigure 1: Example where MIC\n(+) is more diverse than MAP (x).\n\n2 Properties of Induced Cardinality\n\nWe begin by presenting key properties of the IC objective. We\nshow that while f (S) is monotone (Theorem 1), it is not sub-\nmodular (Example 1.1), but is fractionally subadditive (Theorem 2). These results will inform the\nsubsequent discussion of optimization techniques in Section 3.\nTheorem 1. f (S) is monotone increasing.\n\nShowing that f (S) is monotone is a straightforward application of the Cauchy eigenvalue interlacing\ntheorem. The proof can be found in the supplement.\nWe can also show that f (S) is not submodular. Formally, recall that a set function g is submodular if\nfor all sets S \u2286 T \u2286 [n] and all i /\u2208 T :\n\ng(S \u222a {i}) \u2212 g(S) \u2265 g(T \u222a {i}) \u2212 g(T ) .\n\n(6)\n\nTo show non-submodularity, it suf\ufb01ces to create a single counterexample violating this property.\nExample 1.1. Consider n = 3 items, and de\ufb01ne a matrix F with one row of features per item:\n\n\uf8ee\uf8f0 2\n\n\u221a\n2\n2\n\n0\n\u221a\n0\n2\n\nF =\n\n\uf8f9\uf8fb , L = F F (cid:62) =\n\n\uf8ee\uf8f0 4\n\n\u221a\n4\n\n2\n\n2\n2\n\n4\n\u221a\n4\n2\n\n2\n\n2\n\n\uf8f9\uf8fb\n\n2\n2\n\n\u221a\n\u221a\n\n4\n\nLet S = {1}, T = {1, 2}, and i = 3; then it is easy to verify that the inequality required for\nsubmodularity, Equation 6, does not hold.\nTo give some intuition, recall the original de\ufb01nition of f (Z) as the expected set size under PLZ . When\nZ = S = {1}, PLZ is split between the empty set and the singleton {1}. When Z = T = {1, 2}, the\nprobability is still only split between the empty set and singletons, because items 1 and 2 are identical\nand so det(LT ) = 0. Hence, f (T ) is not much larger than f (S). However, when Z = T \u222a {i}, both\n{1, 3} and {2, 3} have substantial probability mass, whereas S \u222a {i} only supports a single size-2\nsubset, {1, 3}. Hence f (T \u222a {i}) ends up being substantially larger than f (S \u222a {i}).\nIt is also possible to construct examples showing that Equation 6 does not hold even approximately.\nExample 1.2. De\ufb01ne the feature matrix\n\n(cid:34)x\n\n(cid:35)\n\nx\n\nF =\n\nx x + 1\n1\n\n1\n\nfor a given value x, and let L = F F (cid:62) as before. Then, for S = {1}, T = {1, 2}, and i = 3, one can\nverify that f (T\u222a{i})\u2212f (T )\n\nf (S\u222a{i})\u2212f (S) grows without bound as x \u2192 \u221e.\n\n4\n\n\fWe note that there does exist a restricted setting of L where f (S) is provably submodular. Recall\nthat a real matrix L is an M-matrix if all of its off-diagonal entries are non-positive, and all of\nits eigenvalues are non-negative. Theorem 3 of Friedland and Gaubert [2013] shows that f (S) is\nsubmodular whenever the kernel matrix L is an M-matrix. In practice, for many applications, the\noff-diagonal entries of the kernel matrix are naturally positive, and in these cases, the kernel is not an\nM-matrix. In Section 4 we consider algorithms that \ufb01rst project to the kernel to an M-matrix and then\noptimize it using the standard greedy submodular maximization algorithm.\nWhile f (S) is not submodular, it is (fractionally) subadditive. Recall that a set function g on [n] is:\n\n\u2022 Subadditive if for all sets S, T \u2286 [n]: g(S \u222a T ) \u2264 g(S) + g(T ) .\n\n\u2022 Fractionally subadditive if g(S) \u2264 (cid:80)\n\u03b1i \u2264 1 such that(cid:80)\n\ni:j\u2208Ti\n\ni \u03b1ig(Ti) for all Ti \u2286 [n] and all constants 0 \u2264\n\u03b1i \u2265 1 for all j \u2208 S. Note that a fractionally subadditive function\n\nis also subadditive.\n\nTheorem 2. f (S) is fractionally subadditive.\n\nThe proof can be found in the supplement.\n\n3 Optimizing Induced Cardinality\n\nIn the previous section we showed that f (S) is monotone and subadditive, but not submodular. In\ncontrast to monotone submodular functions, for which the greedy algorithm [Nemhauser et al., 1978]\nis guaranteed to give a (1 \u2212 1/e)-approximation, subadditive functions cannot be approximated by\ngeneral black box methods [Feige, 2009]. Moreover, we have the following result:\nTheorem 3. MIC is NP-hard.\n\nThe proof can be found in the supplement. Despite its NP-hardness, however, we will develop an\napproximation algorithm for MIC. We begin by giving a different representation of the objective\nfunction, expressing it as an in\ufb01nite geometric series. We show that the \ufb01rst few terms of the series\nare submodular, and can thus be optimized using greedy methods. By bounding the contribution of\nthe remaining terms we can then prove a data-dependent approximation bound.\nGeometric Series Representation. Recall from Equation 5 that f (S) = |S| \u2212 Tr((LS + I)\u22121).\nDenote the largest eigenvalue of L by \u03bbn(L). Then, de\ufb01ne the PSD matrix B = (m \u2212 1)I \u2212 L, with\nm = \u03bbn(L) + 1. (The smallest eigenvalue of B will be zero, and the largest will be at most m \u2212 1.)\nRe-arranging, we have: L + I = mI \u2212 B. Since \u03bbn(B/m) < 1, we can apply the Neumann series\nrepresentation [Suhubi, 2003, page 390] to this expression:\n\n(cid:18)\n\n(cid:19)\u22121\n\n\u221e(cid:88)\n\ni=0\n\n=\n\nBi\nmi .\n\nm(L + I)\u22121 =\n\nI \u2212 1\nm\n\nB\n\n\u221e(cid:88)\n\nf (S) = |S| \u2212\n\nTr(Bi\nS)\nmi+1\n\n.\n\n(7)\n\n(8)\n\n(9)\n\nThus, we can re-write f (S) as an in\ufb01nite sum of traces of matrix powers:\n\nNote that Bi\n\nS here means (BS)i and not (Bi)S.\n\ni=0\n\nSubmodularity. The \ufb01rst two terms in this sum are modular functions:\n\nTr(B0\nS)\n\nm\n\n=\n\n|S|\nm\n\nand Tr(BS)\n\nm2 =\n\nBii\nm2 .\n\n(cid:88)\n\ni\u2208S\n\nCorollary 2 in Friedland and Gaubert [2013] states that the third term, Tr(B2\nS )\n, is a supermodular\nm3\nfunction. Thus, the following function, consisting of the \ufb01rst few terms from the geometric series\nrepresentation of f, is submodular:\n\n\u02c6f (S) = |S| \u2212 |S|\n\nm\n\n\u2212 Tr(BS)\n\nm2 \u2212 Tr(B2\n\nm3\n\nS)\n\n.\n\n(10)\n\n5\n\n\fMonotonicity. This function is also monotone. This is easiest to see by expressing it in terms of f.\nLet h represent the difference between \u02c6f and f:\n\n\u221e(cid:88)\n\ni=3\n\nh(S) =\n\nTr(Bi\nS)\nmi+1\n\n.\n\n(11)\n\nThen \u02c6f (S) = f (S) + h(S). Since f is monotone, it remains to show that h is monotone. Consider\nsets S, T such that S \u2286 T . Then, by the Cauchy eigenvalue interlacing theorem, the j-th eigenvalue\nof BS is smaller than the (j + |T| \u2212 |S|)-th eigenvalue of BT . Hence, Tr(BS) \u2264 Tr(BT ), and\nsimilarly for all higher powers of these matrices. Thus, h is monotone and so is \u02c6f.\nWe propose to maximize \u02c6f using the standard greedy algorithm [Nemhauser et al., 1978], which we\nwill refer to as GREEDY. Since \u02c6f is monotone submodular, this gives a (1 \u2212 1/e) approximation; that\nis, let \u02c6S be the solution returned by GREEDY, and let \u02c6S\u2217 be the true maximizer of \u02c6f. Then:\n\n\u02c6f ( \u02c6S) \u2265 (1 \u2212 1/e) \u02c6f ( \u02c6S\u2217) .\n\n(12)\n\nTail Analysis. To show that \u02c6S is a good approximation for MIC, it remains to bound the dif-\nference between \u02c6f and f. Recall that h represents this difference. Let BS = QAQ\u22121 be the\neigendecomposition of the PSD matrix BS. Note that BSBS = QAQ\u22121QAQ\u22121 = QA2Q\u22121,\nS = QAiQ\u22121 is also a PSD matrix. This means that h is non-negative. Thus,\nand hence Bi\nf (S) = \u02c6f (S) \u2212 h(S) \u2264 \u02c6f (S).\nWe will show (Theorem 4) that \u02c6f is also bounded by f from above in that there exists some constant\n0 < c \u2264 1 such that c \u02c6f (S) \u2264 f (S). Combining these inequalities with Equation 12:\n\nf ( \u02c6S) \u2265 c \u02c6f ( \u02c6S) \u2265 c(1 \u2212 1/e) \u02c6f ( \u02c6S\u2217) \u2265 c(1 \u2212 1/e) \u02c6f (S\u2217) \u2265 c(1 \u2212 1/e)f (S\u2217) .\n\nThus, the \ufb01nal approximation ratio achieved by this procedure is c(1 \u2212 1/e). It remains to prove a\nbound on c. We start by proving a theorem that bounds c for a particular set S, later extending it to a\nuniform result as a corollary.\nTheorem 4. The ratio of f (Equation 8) to \u02c6f (Equation 10) is bounded from below:\n\n|S|(cid:88)\n\n(cid:18) \u03bbj(BS)\n\n(cid:19)(cid:96)\n\nj=1\n\nm\n\n,\n\n(13)\n\n\u2265 1 \u2212\n\nf (S)\n\u02c6f (S)\n\n(m \u2212 1)k \u2212 r(BS, 1) \u2212 r(BS, 2)\nk = |S|, m = \u03bbn(L) + 1, and B = (m \u2212 1)I \u2212 L.\n\nmr(BS, 3)\n\n, where r(BS, (cid:96)) =\n\n. We lower-bound it by substituting an upper bound\n\nProof. The ratio of interest is f (S)\n\u02c6f (S)\nfor h.\n\n= 1 \u2212 h(S)\n\u02c6f (S)\n\n(cid:80)k\n\u221e(cid:88)\n(cid:19)3(cid:32)\n(cid:18) \u03bbj(BS)\n\nTr(Bi\nmi+1 =\n\n\u221e(cid:88)\nk(cid:88)\n\nS)\n\ni=3\n\ni=3\n\nj=1\n\nm\n\nh(S) =\n\n=\n\n1\nm\n\nk(cid:88)\n\n\u221e(cid:88)\n\n(cid:18) \u03bbj(BS)\n\n(cid:19)i\n\nj=1\n\ni=3\n\nm\n\n1\nm\n\n=\n\n(cid:33)\n\nj=1 \u03bbj(BS)i\n\nmi+1\n\n1\n\n1 \u2212 \u03bbj (BS )\n\nm\n\n\u2264 r(BS, 3) ,\n\n(14)\n\n(15)\n\n(16)\n\nwhere the last equality follows from the geometric series summation formula, and the last inequality\nfollows by de\ufb01nition of r and the fact that \u03bbj(BS) \u2264 m \u2212 1.\nNoting that:\n\n\u02c6f (S) = k \u2212 k\nm\n\n\u2212 Tr(BS)\n\nm2 \u2212 Tr(B2\n\nm3 =\n\nS)\n\n1 \u2212 1\nm\n\nk \u2212 1\nm\n\n[r(BS, 1) + r(BS, 2)] ,\n\n(17)\n\n(cid:18)\n\n(cid:19)\n\nwe can now substitute the upper bound on h to complete the proof.\n\n6\n\n\fCorollary 4.1. For all sets S of size k,\nmr(cid:48)(B, k, 3)\n\n\u2265 1 \u2212\n\nf (S)\n\u02c6f (S)\n\n(m \u2212 1)k \u2212 r(cid:48)(B, k, 1) \u2212 r(cid:48)(B, k, 2)\n\n, with r(cid:48)(B, k, (cid:96)) =\n\nn(cid:88)\n\n(cid:18) \u03bbj(B)\n\n(cid:19)(cid:96)\n\nj=n\u2212k+1\n\nm\n\n.\n\n(18)\n\nThe proof of the corollary can be found in the supplement. The value of c given by Corollary 4.1\nis best (closest to 1) for matrices B where the eigenvalues are small, which will be the case when\neigenvalues of L are close to \u03bbn(L). In the extreme case where L is approximately a multiple of the\nidentity matrix, B and thus r(cid:48)(B, k, (cid:96)) will be close to zero. In this case, c \u2248 1.\nThe value of c is worst when the eigenvalues of B decay slowly, which means that most eigenvalues\nof L are small compared to \u03bbn(L). In the extreme case where all of the top-k eigenvalues of B are\nidentical and equal to m \u2212 1, the expression for c is :\n\nc = 1 \u2212 (m \u2212 1)2\nm2 \u2212 m \u2212 1\n\n=\n\nm\n\n(m \u2212 1)2 + m\n\n\u2248 1\nm\n\n,\n\n(19)\n\nand thus the approximation is less meaningful. However, in contrast to the MAP objective, this\ndegradation of approximation is gradual, and catastrophic failure such as that seen in Figure 1 is\ncompletely avoided.\n\n4 Experiments\n\nAs described in Section 1, MIC\u2019s semantics are a better \ufb01t for DPP-based recommendation systems,\nwhereas the traditional application of MAP leads to a mismatch between how the DPP is learned and\nhow it is applied. Properly comparing MIC to MAP on real data requires a live system where we\ncan observe users engaging with different sets of recommendations; a static dataset is not likely to\nbe suf\ufb01cient since the number of possible recommendation sets is combinatorially large. (See the\nwork by Swaminathan et al. [2017] for a longer discussion of the challenges here.) In this work, we\nfocus on evaluating algorithms that optimize the MIC objective, speci\ufb01cally evaluating the GREEDY\nalgorithm in three settings: 1) when optimizing f (S), 2) when optimizing f (S) after projecting\nto the space of M-matrices (see Section 4.1 for details), and 3) when optimizing the submodular\napproximation \u02c6f (S). We call these methods and their results GIC, PIC, and SIC respectively.\n\n4.1 Projecting to the set of M-matrices\n\nWe tried several methods for projecting to the set of (real, symmetric) PSD M-matrices for the PIC\nmethod. We found that \ufb02ipping the signs of any positive off-diagonal elements, then projecting to\nthe PSD cone by truncating negative eigenvalues at zero worked best. If the PSD projection resulted\nin any positive off-diagonal elements, we simply iterated the process of \ufb02ipping their signs and\nprojecting to the PSD cone until the resulting matrix satis\ufb01ed all requirements.\nNote that the sign-\ufb02ipping step computes a projection onto the set of Z-matrices (under the Frobenius\nnorm). Since the set of Z-matrices is closed and convex, as is the set of PSD matrices, this means\nthat the iterative process described above is guaranteed to converge. (Though it will not necessarily\nconverge to the projection onto the intersection of the two convex sets.)\n\n4.2 Runtime Analysis\n\nGIC: The de\ufb01nition of f (S) in Equation 5 implies that in iteration i of GREEDY we need to compute\nan i \u00d7 i matrix inverse to evaluate the objective on each of the remaining items. Rather than doing\nthis directly, requiring time O(nk4), we can use incremental inverse updates Hager [1989]. This\nreduces the runtime of GREEDY by a factor of k to O(nk3). (Note that PIC\u2019s runtime is identical,\nignoring the initial step of projecting to the space of M-matrices.)\nSIC: At \ufb01rst glance, \u02c6f (S) requires squaring an i \u00d7 i matrix for each item (Equation 10). This too can\ns1s2.\nEvaluating a prospective point simply requires updating this sum with i new terms. This reduces the\nna\u00a8\u0131ve runtime of GREEDY by a factor of k2 to O(nk2), making SIC a factor of k faster than GIC.\n\nbe substantially improved by taking advantage of the fact that Tr(BSBS) =(cid:80)\n\n(cid:80)\n\ns1\u2208S\n\ns2\u2208S b2\n\n7\n\n\f(a) Average runtimes from 100 trials\nwith n = 500.\n\n(b) Average eigenvalues from 100\ntrials with n = 200. The cluster ker-\nnel uses 50 clusters, and the Lapla-\ncian kernel uses p = 0.2.\n\n(c) MIC vs GIC. Lines represent\nthe average value of the ratio 100 \u00b7\nf (GIC)/f (MIC) from 1000 trials\nwith n = 12.\n\n(d) GIC vs PIC and SIC for three types of kernel matrices. From left to right, the\nresults are for Wishart matrices, cluster matrices, and Laplacian matrices. Lines\nrepresent the average value of the ratio f (PIC)/f (GIC) or f (SIC)/f (GIC)\nfrom 100 trials with n = 200.\n\n(e) Average ratio of f to the\nSIC objective \u02c6f, evaluated\non the GREEDY result for \u02c6f.\n\nFigure 2: Experimental results\n\nFigure 2a shows the runtimes for GIC and SIC. For n = 500 and k = 250, SIC runs about 18 times\nfaster.\n\n4.3 Approximation Quality\n\nWe ran experiments with three types of kernel matrices:\n\n\u2022 Wishart matrix: For each item, draw a feature vector from an n-dimensional zero-mean\nGaussian. Stack the feature vectors into a matrix F , and set L to F F (cid:62).\n\u2022 Cluster matrix: Divide items evenly into k clusters, and sample an n-dimensional mean\nfrom N (0, 1) for each cluster. Draw each item\u2019s feature vector from N (\u00b5, 1), where \u00b5 is the\ncorresponding cluster mean. Stack the feature vectors into a matrix F , and set L to F F (cid:62).\n\u2022 Graph Laplacian: Generate an n-node random graph using the Erdos-Renyi random graph\nmodel with edge existence probability p. Compute the graph Laplacian matrix from the\ndegree matrix, D, and the adjacency matrix, A: L = D \u2212 A.\n\nAs Figure 2b shows, each of these three types of matrices has a distinct shape to its spectrum. The\nWishart grows rapidly but smoothly. The cluster matrix also has rapid, smooth growth for k of its\neigenvalues (one per cluster), but has value zero for all other eigenvalues. The Laplacian has a smooth,\nnearly linear growth, the slope of which generally increases with p.\n\n4.3.1 Comparison to MIC\n\nAlthough GIC does not, in general, have any approximation guarantees, empirically we found that it\nwas quite effective. In Figure 2c we plot the ratio of the GREEDY solution, GIC, to the optimum,\nMIC (for small n where it is possible to compute MIC by brute force). GIC does best on the\nLaplacian matrices (edge existence parameter is \ufb01xed at p = 0.2), and slightly worse on the other\ntwo matrix types. The success of GIC on the Laplacian may be partly due to the fact that Laplacians\nare M-matrices, and, as mentioned in Section 2, f (S) is submodular in this case. The performance\non the Wishart and cluster kernels is not quite as good, but GIC still achieves more than 99% of the\nmaximum possible value in both cases.\n\n8\n\n\fsearch of the space of all(cid:0)n\n\nk\n\nNote that it is only possible to compute MIC for relatively small n, since it requires an exhaustive\n\n(cid:1) possible size-k subsets. Thus, in all subsequent experiments we use the\n\nGIC solution as the baseline for comparing with PIC and SIC.\n\n4.3.2 Comparison to PIC and SIC\n\nIn general, the M-matrix projection, PIC, and the submodular approximation, SIC, slightly underper-\nform GIC. This is despite a formal approximation guarantee on the performance of SIC. Figure 2d\nshows the performance of the methods on each of the three types of kernels.\n\nvalue of GIC.\n\n\ufb01nding good sets whose value is at least 99% that of the GIC.\n\n\u2022 For Wishart matrices, SIC does slightly better than PIC, and both methods are consistently\n\u2022 For cluster matrices, SIC and PIC struggle, sometimes choosing sets with less than half the\n\u2022 For Laplacian matrices, PIC is identical to GIC. This is because Laplacian matrices are\nM-matrices, and hence L does not need to be projected. The SIC results are also consistently\ngood. In the plot, we show values for Laplacians with edge existence parameter p = 0.01\nrather than the p = 0.2 used in earlier experiments, as for p = 0.2 the spectrum is non-\ufb02at\nenough that SIC results are indistinguishable from PIC and GIC.\n\nThe SIC trends in Figure 2d can be explained by the extent to which f is well-approximated by the\nSIC objective, \u02c6f. In Figure 2e we plot the ratio of the two. Note that for Wishart and Laplacian\nmatrices, the f / \u02c6f ratio decays slowly with k, hence optimizing \u02c6f is very similar to optimizing f. For\nthe cluster matrices, the ratio grows dramatically with k, which explains the poor performance of\nSIC for low values of k.\n\n5 Conclusion\n\nOur proposed MIC optimization problem has advantages over the common MAP setup for recom-\nmender systems in terms of interpretability and train-test time matching. In this work we have shown\nthat the MIC objective can often be well-approximated by a submodular function and optimized\nby a straightforward greedy algorithm. Future work includes the application of MIC to real-world\nrecommendation systems.\n\nReferences\nL. Chen, G. Zhang, and H. Zhou. Improving the Diversity of Top-N Recommendation via Determi-\n\nnantal Point Process. In Large Scale Recommendation Systems Workshop, 2017.\n\nC. Dupuy and F. Bach. Learning Determinantal Point Processes in Sublinear Time. Conference on\n\nArti\ufb01cial Intelligence and Statistics (AIStats), 2018.\n\nU. Feige. On Maximizing Welfare When Utility Functions Are Subadditive. SIAM Journal on\n\nComputing (SICOMP), 39, 2009.\n\nS. Friedland and S. Gaubert. Submodular Spectral Functions of Principal Submatrices of a Hermitian\n\nMatrix, Extensions and Applications. Linear Algebra and its Applications, 438, 2013.\n\nM. Gartrell, U. Paquet, and N. Koenigstein. Low-Rank Factorization of Determinantal Point Processes.\n\nIn AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\nJ. Gillenwater, A. Kulesza, and B. Taskar. Near-Optimal MAP Inference for Determinantal Point\n\nProcesses. In Neural Information Processing Systems (NIPS), 2012.\n\nJ. Gillenwater, A. Kulesza, E. Fox, and B. Taskar. Expectation-Maximization for Learning Determi-\n\nnantal Point Processes. In Neural Information Processing Systems (NIPS), 2014.\n\nW. Hager. Updating the Inverse of a Matrix. SIAM Review, 31, 1989.\n\nJ. Herlocker, J. Konstan, L. Terveen, and J. Riedl. Evaluating Collaborative Filtering Recommender\n\nSystems. ACM Transactions on Information Systems (TOIS), 22, 2004.\n\n9\n\n\fN. Hurley and M. Zhang. Novelty and Diversity in Top-N Recommendation \u2013 Analysis and Evaluation.\n\nACM Transactions on Internet Technology (TOIT), 10, 2011.\n\nT. Kathuria and A. Deshpande. On Sampling and Greedy MAP Inference of Constrained Determi-\n\nnantal Point Processes. 2017.\n\nA Kulesza and B Taskar. Learning Determinantal Point Processes. In Conference on Uncertainty in\n\nArti\ufb01cial Intelligence (UAI), 2011.\n\nA Kulesza and B Taskar. Determinantal Point Processes for Machine Learning. Foundations and\n\nTrends in Machine Learning, 5, 2012.\n\nZ. Mariet and S. Sra. Fixed-point Algorithms for Learning Determinantal Point Processes.\n\nInternational Conference on Machine Learning (ICML), 2015.\n\nIn\n\nG. Nemhauser, L. Wolsey, and M. Fisher. An Analysis of Approximations for Maximizing Submodu-\n\nlar Set Functions I. Mathematical Programming, 14, 1978.\n\nA. Nikolov and M. Singh. Maximizing Determinants under Partition Constraints. In Symposium on\n\nthe Theory of Computing (STOC), 2016.\n\nB. Smyth and P. McClave. Similarity vs. Diversity. In International Conference on Case-Based\n\nReasoning, 2001.\n\nE. Suhubi. Functional Analysis. Springer Netherlands, 2003. ISBN 9781402016165.\n\nA. Swaminathan, A. Krishnamurthy, A. Agarwal, M. Dud\u00b4\u0131k, J. Langford, D. Jose, and I. Zitouni.\nOff-Policy Evaluation for Slate Recommendation. In Neural Information Processing Systems\n(NIPS), 2017.\n\nM. Wilhelm, A. Ramanathan, A. Bonomo, S. Jain, E.H. Chi, and J. Gillenwater. Practical Diversi\ufb01ed\nRecommendations on YouTube with Determinantal Point Processes. In Conference on Information\nand Knowledge Management (CIKM), 2018.\n\nM. Zhang and Z. Ou. Block-Wise MAP Inference for Determinantal Point Processes with Application\n\nto Change-Point Detection. In Statistical Signal Processing Workshop (SSP), 2016.\n\nC. Ziegler, S. McNee, J. Konstan, and G. Lausen. Improving Recommendation Lists Through Topic\n\nDiversi\ufb01cation. In International Conference on the World Wide Web (WWW), 2005.\n\n10\n\n\f", "award": [], "sourceid": 3444, "authors": [{"given_name": "Jennifer", "family_name": "Gillenwater", "institution": "Google"}, {"given_name": "Alex", "family_name": "Kulesza", "institution": "Google"}, {"given_name": "Sergei", "family_name": "Vassilvitskii", "institution": "Google"}, {"given_name": "Zelda", "family_name": "Mariet", "institution": "MIT"}]}