{"title": "Expectation-Maximization for Learning Determinantal Point Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 3149, "page_last": 3157, "abstract": "A determinantal point process (DPP) is a probabilistic model of set diversity compactly parameterized by a positive semi-definite kernel matrix. To fit a DPP to a given task, we would like to learn the entries of its kernel matrix by maximizing the log-likelihood of the available data. However, log-likelihood is non-convex in the entries of the kernel matrix, and this learning problem is conjectured to be NP-hard. Thus, previous work has instead focused on more restricted convex learning settings: learning only a single weight for each row of the kernel matrix, or learning weights for a linear combination of DPPs with fixed kernel matrices. In this work we propose a novel algorithm for learning the full kernel matrix. By changing the kernel parameterization from matrix entries to eigenvalues and eigenvectors, and then lower-bounding the likelihood in the manner of expectation-maximization algorithms, we obtain an effective optimization procedure. We test our method on a real-world product recommendation task, and achieve relative gains of up to 16.5% in test log-likelihood compared to the naive approach of maximizing likelihood by projected gradient ascent on the entries of the kernel matrix.", "full_text": "Expectation-Maximization\n\nfor Learning Determinantal Point Processes\n\nJennifer Gillenwater\n\nComputer and Information Science\n\nUniversity of Pennsylvania\njengi@cis.upenn.edu\n\nAlex Kulesza\n\nComputer Science and Engineering\n\nUniversity of Michigan\nkulesza@umich.edu\n\nEmily Fox\nStatistics\n\nUniversity of Washington\n\nebfox@stat.washington.edu\n\ntaskar@cs.washington.edu\n\nBen Taskar\n\nComputer Science and Engineering\n\nUniversity of Washington\n\nAbstract\n\nA determinantal point process (DPP) is a probabilistic model of set diversity com-\npactly parameterized by a positive semi-de\ufb01nite kernel matrix. To \ufb01t a DPP to a\ngiven task, we would like to learn the entries of its kernel matrix by maximizing\nthe log-likelihood of the available data. However, log-likelihood is non-convex\nin the entries of the kernel matrix, and this learning problem is conjectured to be\nNP-hard [1]. Thus, previous work has instead focused on more restricted convex\nlearning settings: learning only a single weight for each row of the kernel matrix\n[2], or learning weights for a linear combination of DPPs with \ufb01xed kernel ma-\ntrices [3]. In this work we propose a novel algorithm for learning the full kernel\nmatrix. By changing the kernel parameterization from matrix entries to eigen-\nvalues and eigenvectors, and then lower-bounding the likelihood in the manner\nof expectation-maximization algorithms, we obtain an effective optimization pro-\ncedure. We test our method on a real-world product recommendation task, and\nachieve relative gains of up to 16.5% in test log-likelihood compared to the naive\napproach of maximizing likelihood by projected gradient ascent on the entries of\nthe kernel matrix.\n\n1\n\nIntroduction\n\nSubset selection is a core task in many real-world applications. For example, in product recom-\nmendation we typically want to choose a small set of products from a large collection; many other\nexamples of subset selection tasks turn up in domains like document summarization [4, 5], sensor\nplacement [6, 7], image search [3, 8], and auction revenue maximization [9], to name a few. In\nthese applications, a good subset is often one whose individual items are all high-quality, but also all\ndistinct. For instance, recommended products should be popular, but they should also be diverse to\nincrease the chance that a user \ufb01nds at least one of them interesting. Determinantal point processes\n(DPPs) offer one way to model this tradeoff; a DPP de\ufb01nes a distribution over all possible subsets\nof a ground set, and the mass it assigns to any given set is a balanced measure of that set\u2019s quality\nand diversity.\nOriginally discovered as models of fermions [10], DPPs have recently been effectively adapted for a\nvariety of machine learning tasks [8, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 3, 20]. They offer attractive\ncomputational properties, including exact and ef\ufb01cient normalization, marginalization, conditioning,\nand sampling [21]. These properties arise in part from the fact that a DPP can be compactly param-\n\n1\n\n\feterized by an N \u00d7 N positive semi-de\ufb01nite matrix L. Unfortunately, though, learning L from\nexample subsets by maximizing likelihood is conjectured to be NP-hard [1, Conjecture 4.1]. While\ngradient ascent can be applied in an attempt to approximately optimize the likelihood objective, we\nshow later that it requires a projection step that often produces degenerate results.\nFor this reason, in most previous work only partial learning of L has been attempted. [2] showed that\nthe problem of learning a scalar weight for each row of L is a convex optimization problem. This\namounts to learning what makes an item high-quality, but does not address the issue of what makes\ntwo items similar. [3] explored a different direction, learning weights for a linear combination of\nDPPs with \ufb01xed Ls. This works well in a limited setting, but requires storing a potentially large set\nof kernel matrices, and the \ufb01nal distribution is no longer a DPP, which means that many attractive\ncomputational properties are lost. [8] proposed as an alternative that one \ufb01rst assume L takes on a\nparticular parametric form, and then sample from the posterior distribution over kernel parameters\nusing Bayesian methods. This overcomes some of the disadvantages of [3]\u2019s L-ensemble method,\nbut does not allow for learning an unconstrained, non-parametric L.\nThe learning method we propose in this paper differs from those of prior work in that it does not\nassume \ufb01xed values or restrictive parameterizations for L, and exploits the eigendecomposition of L.\nMany properties of a DPP can be simply characterized in terms of the eigenvalues and eigenvectors\nof L, and working with this decomposition allows us to develop an expectation-maximization (EM)\nstyle optimization algorithm. This algorithm negates the need for the problematic projection step that\nis required for naive gradient ascent to maintain positive semi-de\ufb01niteness of L. As the experiments\nshow, a projection step can sometimes lead to learning a nearly diagonal L, which fails to model\nthe negative interactions between items. These interactions are vital, as they lead to the diversity-\nseeking nature of a DPP. The proposed EM algorithm overcomes this failing, making it more robust\nto initialization and dataset changes. It is also asymptotically faster than gradient ascent.\n\n2 Background\nFormally, a DPP P on a ground set of items Y = {1, . . . , N} is a probability measure on 2Y, the set\nof all subsets of Y. For every Y \u2286 Y we have P(Y ) \u221d det(LY ), where L is a positive semi-de\ufb01nite\n(PSD) matrix. The subscript LY \u2261 [Lij]i,j\u2208Y denotes the restriction of L to the entries indexed by\nelements of Y , and we have det(L\u2205) \u2261 1. Notice that the restriction to PSD matrices ensures that\nall principal minors of L are non-negative, so that det(LY ) \u2265 0 as required for a proper probability\ndistribution. The normalization constant for the distribution can be computed explicitly thanks to\nY det(LY ) = det(L + I), where I is the N \u00d7 N identity matrix. Intuitively, we\ncan think of a diagonal entry Lii as capturing the quality of item i, while an off-diagonal entry Lij\nmeasures the similarity between items i and j.\nAn alternative representation of a DPP is given by the marginal kernel: K = L(L + I)\u22121. The\nL-K relationship can also be written in terms of their eigendecompositons. L and K share the same\neigenvectors v, and an eigenvalue \u03bbi of K corresponds to an eigenvalue \u03bbi/(1 \u2212 \u03bbi) of L:\n\nthe fact that(cid:80)\n\nN(cid:88)\n\nK =\n\n\u03bbjvjv(cid:62)\n\nj \u21d4 L =\n\nN(cid:88)\n\nj=1\n\nj=1\n\n\u03bbj\n1 \u2212 \u03bbj\n\nvjv(cid:62)\n\nj\n\n.\n\n(1)\n\nClearly, if L if PSD then K is as well, and the above equations also imply that the eigenvalues of K\nare further restricted to be \u2264 1. K is called the marginal kernel because, for any set Y \u223c P and for\nevery A \u2286 Y:\n\n(2)\nWe can also write the exact (non-marginal, normalized) probability of a set Y \u223c P in terms of K:\n\nP(A \u2286 Y ) = det(KA) .\n\nP(Y ) =\n\ndet(LY )\ndet(L + I)\n\n(3)\nwhere IY is the identity matrix with entry (i, i) zeroed for items i \u2208 Y [1, Equation 3.69]. In what\nfollows we use the K-based formula for P(Y ) and learn the marginal kernel K. This is equivalent\nto learning L, as Equation (1) can be applied to convert from K to L.\n\n= | det(K \u2212 IY )| ,\n\n2\n\n\f3 Learning algorithms\nIn our learning setting the input consists of n example subsets, {Y1, . . . , Yn}, where Yi \u2286\n{1, . . . , N} for all i. Our goal is to maximize the likelihood of these example sets. We \ufb01rst de-\nscribe in Section 3.1 a naive optimization procedure: projected gradient ascent on the entries of the\nmarginal matrix K, which will serve as a baseline in our experiments. We then develop an EM\nmethod: Section 3.2 changes variables from kernel entries to eigenvalues and eigenvectors (intro-\nducing a hidden variable in the process), Section 3.3 applies Jensen\u2019s inequality to lower-bound the\nobjective, and Sections 3.4 and 3.5 outline a coordinate ascent procedure on this lower bound.\n\n3.1 Projected gradient ascent\n\nThe log-likelihood maximization problem, based on Equation (3), is:\n\nn(cid:88)\n\ni=1\n\nmax\n\nK\n\nlog(cid:0)| det(K \u2212 IY i\n\n)|(cid:1) s.t. K (cid:23) 0, I \u2212 K (cid:23) 0\n\n(4)\n\nwhere the \ufb01rst constraint ensures that K is PSD and the second puts an upper limit of 1 on its\neigenvalues. Let L(K) represent this log-likelihood objective. Its partial derivative with respect to\nK is easy to compute by applying a standard matrix derivative rule [22, Equation 57]:\n\n\u2202L(K)\n\u2202K\n\n=\n\nn(cid:88)\n(K \u2212 IY i\n\ni=1\n\n)\u22121 .\n\n(5)\n\nThus, projected gradient ascent [23] is a viable, simple optimization technique. Algorithm 1 outlines\nthis method, which we refer to as K-Ascent (KA). The initial K supplied as input to the algorithm\ncan be any PSD matrix with eigenvalues \u2264 1. The \ufb01rst part of the projection step, max(\u03bb, 0),\nchooses the closest (in Frobenius norm) PSD matrix to Q [24, Equation 1]. The second part,\nmin(\u03bb, 1), caps the eigenvalues at 1. (Notice that only the eigenvalues have to be projected; K\nremains symmetric after the gradient step, so its eigenvectors are already guaranteed to be real.)\nUnfortunately, the projection can take us to a poor local optima. To see this, consider the case where\nthe starting kernel K is a poor \ufb01t to the data. In this case, a large initial step size \u03b7 will probably\nbe accepted; even though such a step will likely result in the truncation of many eigenvalues at 0,\nthe resulting matrix will still be an improvement over the poor initial K. However, with many zero\neigenvalues, the new K will be near-diagonal, and, unfortunately, Equation (5) dictates that if the\ncurrent K is diagonal, then its gradient is as well. Thus, the KA algorithm cannot easily move\nto any highly non-diagonal matrix. It is possible that employing more complex step-size selection\nmechanisms could alleviate this problem, but the EM algorithm we develop in the next section will\nnegate the need for these entirely.\nThe EM algorithm we develop also has an advantage in terms of asymptotic runtime. The compu-\ntational complexity of KA is dominated by the matrix inverses of the L derivative, each of which\nrequires O(N 3) operations, and by the eigendecomposition needed for the projection, also O(N 3).\nThe overall runtime of KA, assuming T1 iterations until convergence and an average of T2 iterations\nto \ufb01nd a step size, is O(T1nN 3 + T1T2N 3). As we will show in the following sections, the overall\nruntime of the EM algorithm is O(T1nN k2 +T1T2N 3), which can be substantially better than KA\u2019s\nruntime for k (cid:28) N.\n\n3.2 Eigendecomposing\n\nEigendecomposition is key to many core DPP algorithms such as sampling and marginalization.\nThis is because the eigendecomposition provides an alternative view of the DPP as a genera-\ntive process, which often leads to more ef\ufb01cient algorithms. Speci\ufb01cally, sampling a set Y can\nbe broken down into a two-step process, the \ufb01rst of which involves generating a hidden variable\nJ \u2286 {1, . . . , N} that codes for a particular set of K\u2019s eigenvectors. We review this process below,\nthen exploit it to develop an EM optimization scheme.\nSuppose K = V \u039bV (cid:62) is an eigendecomposition of K. Let V J denote the submatrix of V containing\nonly the columns corresponding to the indices in a set J \u2286 {1, . . . , N}. Consider the corresponding\n\n3\n\n\fAlgorithm 1 K-Ascent (KA)\nInput: K, {Y1, . . . , Yn}, c\nrepeat\n\n\u2202K (Eq. 5)\n\nG \u2190 \u2202L(K)\n\u03b7 \u2190 1\nrepeat\n\nQ \u2190 K + \u03b7G\nEigendecompose Q into V, \u03bb\n\u03bb \u2190 min(max(\u03bb, 0), 1)\nQ \u2190 V diag(\u03bb)V (cid:62)\n\u03b7 \u2190 \u03b7\nuntil L(Q) > L(K)\n\u03b4 \u2190 L(Q) \u2212 L(K)\nK \u2190 Q\nuntil \u03b4 < c\nOutput: K\n\n2\n\nAlgorithm 2 Expectation-Maximization (EM)\n\nInput: K, {Y1, . . . , Yn}, c\nEigendecompose K into V, \u03bb\nrepeat\nj \u2190 1\n(cid:48)\n\nfor j = 1, . . . , N do\n\n(cid:80)\n\n\u03bb\n\nn\n\ni pK(j \u2208 J | Yi) (Eq. 19)\n(Eq. 20)\n\nG \u2190 \u2202F (V,\u03bb(cid:48))\n\u03b7 \u2190 1\nrepeat\n\nV (cid:48) \u2190 V exp[\u03b7(cid:0)V (cid:62)G \u2212 G(cid:62)V(cid:1)]\n\n\u2202V\n\n\u03b7 \u2190 \u03b7\n\n2\n\n)\n, V \u2190 V (cid:48), \u03b7 \u2190 2\u03b7\n\nuntil L(V (cid:48), \u03bb\n) > L(V, \u03bb\n(cid:48)\n(cid:48)\n\u03b4 \u2190 F (V (cid:48), \u03bb\n) \u2212 F (V, \u03bb)\n(cid:48)\n\u03bb \u2190 \u03bb\n(cid:48)\nuntil \u03b4 < c\nOutput: K\n\nmarginal kernel, with all selected eigenvalues set to 1:\n\nK V J\n\n=\n\nvjv(cid:62)\n\nj = V J (V J )(cid:62) .\n\n(6)\n\n(cid:88)\n\nj\u2208J\n\nAny such kernel whose eigenvalues are all 1 is called an elementary DPP. According to [21, Theorem\n7], a DPP with marginal kernel K is a mixture of all 2N possible elementary DPPs:\nP(Y ) =\n\n(Y ) = 1(|Y | = |J|) det(K V J\n\n(cid:88)\n\n(cid:89)\n\n(cid:89)\n\n(1 \u2212 \u03bbj) ,\n\nP V J\n\nP V J\n\n(Y )\n\n(7)\n\n\u03bbj\n\nY ) .\n\nJ\u2286{1,...,N}\n\nj\u2208J\n\nj /\u2208J\n\nThis perspective leads to an ef\ufb01cient DPP sampling algorithm, where a set J is \ufb01rst chosen according\nto its mixture weight in Equation (7), and then a simple algorithm is used to sample from P V J [5,\nAlgorithm 1]. In this sense, the index set J is an intermediate hidden variable in the process for\ngenerating a sample Y .\nWe can exploit this hidden variable J to develop an EM algorithm for learning K. Re-writing the\ndata log-likelihood to make the hidden variable explicit:\n\n(cid:32)(cid:88)\n\n(cid:33)\n\nn(cid:88)\n\n(cid:32)(cid:88)\n\n(cid:33)\n\nlog\n\npK(J, Yi)\n\n=\n\nlog\n\npK(Yi | J)pK(J)\n\n, where (8)\n\nJ\n\ni=1\n\nJ\n\n(1 \u2212 \u03bbj) ,\n\npK(Yi | J) =1(|Yi| = |J|) det([V J (V J )(cid:62)]Yi) .\n\n(9)\n\nL(K) = L(\u039b, V ) =\n\n(cid:89)\n\nj\u2208J\n\npK(J) =\n\n\u03bbj\n\nn(cid:88)\n(cid:89)\n\ni=1\n\nj /\u2208J\n\nThese equations follow directly from Equations (6) and (7).\n\n3.3 Lower bounding the objective\nWe now introduce an auxiliary distribution, q(J | Yi), and deploy it with Jensen\u2019s inequality to\nlower-bound the likelihood objective. This is a standard technique for developing EM schemes for\ndealing with hidden variables [25]. Proceeding in this direction:\n\nn(cid:88)\n\n(cid:32)(cid:88)\n\ni=1\n\nJ\n\nL(V, \u039b) =\n\nlog\n\nq(J | Yi)\n\npK(J, Yi)\nq(J | Yi)\n\nq(J | Yi) log\n\n(cid:19)\n\n(cid:18) pK(J, Yi)\n\nq(J | Yi)\n\n\u2261 F (q, V, \u039b) .\n\n(10)\n\n(cid:33)\n\n\u2265 n(cid:88)\n\n(cid:88)\n\ni=1\n\nJ\n\n4\n\n\fThe function F (q, V, \u039b) can be expressed in either of the following two forms:\n\u2212KL(q(J | Yi) (cid:107) pK(J | Yi)) + L(V, \u039b)\n\nF (q, V, \u039b) =\n\n=\n\nEq[log pK(J, Yi)] + H(q)\n\nn(cid:88)\nn(cid:88)\n\ni=1\n\ni=1\n\nn(cid:88)\n\n(cid:88)\n\ni=1\n\nJ\n\n(11)\n\n(12)\n\n(14)\n\n(15)\n\nwhere H is entropy. Consider optimizing this new objective by coordinate ascent. From Equa-\ntion (11) it is clear that, holding V, \u039b constant, F is concave in q. This follows from the concavity\nof KL divergence. Holding q constant in Equation (12) yields the following function:\n\nF (V, \u039b) =\n\nq(J | Yi) [log pK(J) + log pK(Yi | J)] .\n\n(13)\n\nThis expression is concave in \u03bbj, since log is concave. However, it is not concave in V due to the\nnon-convex V (cid:62)V = I constraint. We describe in Section 3.5 one way to handle this.\nTo summarize, coordinate ascent on F (q, V, \u039b) alternates the following \u201cexpectation\u201d and \u201cmaxi-\nmization\u201d steps; the \ufb01rst is concave in q, and the second is concave in the eigenvalues:\n\nn(cid:88)\nn(cid:88)\n\ni=1\n\ni=1\n\nE-step: min\n\nq\n\nM-step: max\nV,\u039b\n\nKL(q(J | Yi) (cid:107) pK(J | Yi))\n\nEq[log pK(J, Yi)] s.t. 0 \u2264 \u03bb \u2264 1, V (cid:62)V = I\n\n3.4 E-step\nThe E-step is easily solved by setting q(J | Yi) = pK(J | Yi), which minimizes the KL diver-\ngence. Interestingly, we can show that this distribution is itself a conditional DPP, and hence can be\ncompactly described by an N \u00d7 N kernel matrix. Thus, to complete the E-step, we simply need to\nconstruct this kernel. Lemma 1 (see the supplement for a proof) gives an explicit formula. Note that\nq\u2019s probability mass is restricted to sets of a particular size k, and hence we call it a k-DPP. A k-DPP\nis a variant of DPP that can also be ef\ufb01ciently sampled from and marginalized, via modi\ufb01cations of\nthe standard DPP algorithms. (See the supplement and [3] for more on k-DPPs.)\nLemma 1. At the completion of the E-step, q(J | Yi) with |Yi| = k is a k-DPP with (non-marginal)\nkernel QYi:\n\nQYi = RZ Yi R, and q(J | Yi) \u221d 1(|Yi| = |J|) det(QYi\nU = V (cid:62), Z Yi = U Yi(U Yi )(cid:62), and R = diag\n\n(cid:16)(cid:112)\u03bb/(1 \u2212 \u03bb)\n\nJ ) , where\n\n(cid:17)\n\n.\n\n(16)\n\n(17)\n\n3.5 M-step\n\nThe M-step update for the eigenvalues is a closed-form expression with no need for projection.\nTaking the derivative of Equation (13) with respect to \u03bbj, setting it equal to zero, and solving for \u03bbj:\n\nn(cid:88)\n\n(cid:88)\n\ni=1\n\nJ:j\u2208J\n\n\u03bbj =\n\n1\nn\n\nq(J | Yi) .\n\n(18)\n\nThe exponential-sized sum here is impractical, but we can eliminate it. Recall from Lemma 1 that\nq(J | Yi) is a k-DPP with kernel QYi. Thus, we can use k-DPP marginalization algorithms to\nef\ufb01ciently compute the sum over J. More concretely, let \u02c6V represent the eigenvectors of QYi, with\n\u02c6vr(j) indicating the jth element of the rth eigenvector. Then the marginals are:\n\nq(J | Yi) = q(j \u2208 J | Yi) =\n\n\u02c6vr(j)2 ,\n\n(19)\n\n(cid:88)\n\nJ:j\u2208J\n\nN(cid:88)\n\nr=1\n\n5\n\n\f=\n\ni=1\n\nYi\n\n\u2202V\n\n\u2202F (V, \u039b)\n\n2BYi(H Yi)\u22121VYiR2\n\nwhich allows us to compute the eigenvalue updates in time O(nN k2), for k = maxi |Yi|. (See the\nsupplement for the derivation of Equation (19) and its computational complexity.) Note that this\nupdate is self-normalizing, so explicit enforcement of the 0 \u2264 \u03bbj \u2264 1 constraint is unnecessary.\nThere is one small caveat: the QYi matrix will be in\ufb01nite if any \u03bbj is exactly equal to 1 (due to R in\nEquation (17)). In practice, we simply tighten the constraint on \u03bb to keep it slightly below 1.\nTurning now to the M-step update for the eigenvectors, the derivative of Equation (13) with respect\nto V involves an exponential-size sum over J similar to that of the eigenvalue derivative. However,\nthe terms of the sum in this case depend on V as well as on q(J | Yi), making it hard to simplify.\nYet, for the particular case of the initial gradient, where we have q = p, simpli\ufb01cation is possible:\n\nn(cid:88)\nwhere H Yi is the |Yi| \u00d7 |Yi| matrix VYiR2V (cid:62)\nand VYi = (U Yi)(cid:62). BYi is a N \u00d7 |Yi| matrix\ncontaining the columns of the N \u00d7 N identity corresponding to items in Yi; BYi simply serves\nto map the gradients with respect to VYi into the proper positions in V . This formula allows us\nto compute the eigenvector derivatives in time O(nN k2), where again k = maxi |Yi|. (See the\nsupplement for the derivation of Equation (20) and its computational complexity.)\nEquation (20) is only valid for the \ufb01rst gradient step, so in practice we do not bother to fully optimize\nV in each M-step; we simply take a single gradient step on V . Ideally we would repeatedly evaluate\nthe M-step objective, Equation (13), with various step sizes to \ufb01nd the optimal one. However,\nthe M-step objective is intractable to evaluate exactly, as it is an expectation with respect to an\nexponential-size distribution. In practice, we solve this issue by performing an E-step for each trial\nstep size. That is, we update q\u2019s distribution to match the updated V and \u039b that de\ufb01ne pK, and then\ndetermine if the current step size is good by checking for improvement in the likelihood L.\nThere is also the issue of enforcing the non-convex constraint V (cid:62)V = I. We could project V to en-\nsure this constraint, but, as previously discussed for eigenvalues, projection steps often lead to poor\nlocal optima. Thankfully, for the particular constraint associated with V , more sophisticated update\ntechniques exist: the constraint V (cid:62)V = I corresponds to optimization over a Stiefel manifold, so\nthe algorithm from [26, Page 326] can be employed. In practice, we simplify this algorithm by\nnegelecting second-order information (the Hessian) and using the fact that the V in our application\nis full-rank. With these simpli\ufb01cations, the following multiplicative update is all that is needed:\n\n(20)\n\n(cid:34)\n\nV \u2190 V exp\n\n\u03b7\n\n(cid:32)\nV (cid:62) \u2202L\n\n\u2202V\n\n(cid:18) \u2202L\n\n(cid:19)(cid:62)\n\n\u2202V\n\n\u2212\n\n(cid:33)(cid:35)\n\nV\n\n,\n\n(21)\n\nwhere exp denotes the matrix exponential and \u03b7 is the step size. Algorithm 2 summarizes the overall\nEM method. As previously mentioned, assuming T1 iterations until convergence and an average of\nT2 iterations to \ufb01nd a step size, its overall runtime is O(T1nN k2 + T1T2N 3). The \ufb01rst term in\nthis complexity comes from the eigenvalue updates, Equation (19), and the eigenvector derivative\ncomputation, Equation (20). The second term comes from repeatedly computing the Stiefel manifold\nupdate of V , Equation (21), during the step size search.\n\n4 Experiments\n\nWe test the proposed EM learning method (Algorithm 2) by comparing it to K-Ascent (KA, Algo-\nrithm 1)1. Both methods require a starting marginal kernel \u02dcK. Note that neither EM nor KA can\ndeal well with starting from a kernel with too many zeros. For example, starting from a diagonal\nkernel, both gradients, Equations (5) and (20), will be diagonal, resulting in no modeling of diver-\nsity. Thus, the two initialization options that we explore have non-trivial off-diagonals. The \ufb01rst of\nthese options is relatively naive, while the other incorporates statistics from the data.\nFor the \ufb01rst initialization type, we use a Wishart distribution with N degrees of freedom and an\nidentity covariance matrix to draw \u02dcL \u223c WN (N, I). The Wishart distribution is relatively unassum-\ning: in terms of eigenvectors, it spreads its mass uniformly over all unitary matrices [27]. We make\n\n1Code and data for all experiments can be downloaded from https://code.google.com/p/em-for-dpps\n\n6\n\n\f(a)\n\n(b)\n\nFigure 1: Relative test log-likelihood differences, 100 (EM\u2212KA)\nthe full-data setting, and (b) moments-matching initialization in the data-poor setting.\n\n|KA|\n\n, using: (a) Wishart initialization in\n\njust one simple modi\ufb01cation to its output to make it a better \ufb01t for practical data: we re-scale the re-\nsulting matrix by 1/N so that the corresponding DPP will place a non-trivial amount of probability\nmass on small sets. (The Wishart\u2019s mean is N I, so it tends to over-emphasize larger sets unless we\nre-scale.) We then convert \u02dcL to \u02dcK via Equation (1).\nFor the second initialization type, we employ a form of moment matching. Let mi and mij represent\nthe normalized frequencies of single items and pairs of items in the training data:\n1(i \u2208 Y(cid:96) \u2227 j \u2208 Y(cid:96)) .\n\n1(i \u2208 Y(cid:96)), mij =\n\nn(cid:88)\n\nmi =\n\n(22)\n\n1\nn\n\n(cid:96)=1\n\nRecalling Equation (2), we attempt to match the \ufb01rst and second order moments by choosing \u02dcK as:\n\n1\nn\n\nn(cid:88)\n(cid:16) \u02dcKii \u02dcKjj \u2212 mij, 0\n(cid:17)\n\n(cid:96)=1\n\n(cid:114)\n\n\u02dcKii = mi,\n\n\u02dcKij =\n\nmax\n\n.\n\n(23)\n\nTo ensure a valid starting kernel, we then project \u02dcK by clipping its eigenvalues at 0 and 1.\n\n4.1 Baby registry tests\n\nConsider a product recommendation task, where the ground set comprises N products that can be\nadded to a particular category (e.g., toys or safety) in a baby registry. A very simple recommendation\nsystem might suggest products that are popular with other consumers; however, this does not account\nfor negative interactions: if a consumer has already chosen a carseat, they most likely will not choose\nan additional carseat, no matter how popular it is with other consumers. DPPs are ideal for capturing\nsuch negative interactions. A learned DPP could be used to populate an initial, basic registry, as well\nas to provide live updates of product recommendations as a consumer builds their registry.\nTo test our DPP learning algorithms, we collected a dataset consisting of 29,632 baby registries\nfrom Amazon.com, \ufb01ltering out those listing fewer than 5 or more than 100 products. Amazon\ncharacterizes each product in a baby registry as belonging to one of 18 categories, such as \u201ctoys\u201d\nand\u201csafety\u201d. For each registry, we created sub-registries by splitting it according to these categories.\n(A registry with 5 toy items and 10 safety items produces two sub-registries.) For each category, we\nthen \ufb01ltered down to its top 100 most frequent items, and removed any product that did not occur\nin at least 100 sub-registries. We discarded categories with N < 25 or fewer than 2N remaining\n(non-empty) sub-registries for training. The resulting 13 categories have an average inventory size\nof N = 71 products and an average number of sub-registries n = 8,585. We used 70% of the\ndata for training and 30% for testing. Note that categories such as \u201ccarseats\u201d contain more diverse\nitems than just their namesake; for instance, \u201ccarseats\u201d also contains items such as seat back kick\nprotectors and rear-facing baby view mirrors. See the supplement for more dataset details and for\nquartile numbers for all of the experiments.\nFigure 1a shows the relative test log-likelihood differences of EM and KA when starting from a\nWishart initialization. These numbers are the medians from 25 trials (draws from the Wishart). EM\n\n7\n\nsafety furniturecarseatsstrollershealth bath media toys bedding apparel diaper gear feedingrelative log likelihood difference0.00.00.50.91.31.82.42.52.57.78.19.811.0safety furniturecarseatsstrollershealth bath media toys bedding apparel diaper gear feedingrelative log likelihood difference0.62.6-0.11.53.12.31.93.55.35.35.810.416.5\f(a)\n\n(b)\n\nFigure 2: (a) A high-probability set of size k = 10 selected using an EM model for the \u201csafety\u201d\ncategory. (b) Runtime ratios.\n\n||M||F\n\n||diag(M )||2\n\ngains an average of 3.7%, but has a much greater advantage for some categories than for others.\nSpeculating that EM has more of an advantage when the off-diagonal components of K are truly\nimportant\u2014when products exhibit strong negative interactions\u2014we created a matrix M for each\ncategory with the true data marginals from Equation (22) as its entries. We then checked the value\n. This value correlates well with the relative gains for EM: the 4 categories\nof d = 1\nN\nfor which EM has the largest gains (safety, furniture, carseats, and strollers) all exhibit d > 0.025,\nwhile categories such as feeding and gear have d < 0.012. Investigating further, we found that, as\nforeshadowed in Section 3.1, KA performs particularly poorly in the high-d setting because of its\nprojection step\u2014projection can result in KA learning a near-diagonal matrix.\nIf instead of the Wishart initialization we use the moments-matching initializer, this alleviates KA\u2019s\nprojection problem, as it provides a starting point closer to the true kernel. With this initializer, KA\nand EM have comparable test log-likelihoods (average EM gain of 0.4%). However, the moments-\nmatching initializer is not a perfect \ufb01x for the KA algorithm in all settings. For instance, consider\na data-poor setting, where for each category we have only n = 2N training examples.\nIn this\ncase, even with the moments-matching initializer EM has a signi\ufb01cant edge over KA, as shown in\nFigure 1b: EM gains an average of 4.5%, with a maximum gain of 16.5% for the safety category.\nTo give a concrete example of the advantages of EM training, Figure 2a shows a greedy approx-\nimation [28, Section 4] to the most-likely ten-item registry in the category \u201csafety\u201d, according to\na Wishart-initialized EM model. The corresponding KA selection differs from Figure 2a in that it\nreplaces the lens \ufb01lters and the head support with two additional baby monitors: \u201cMotorola MBP36\nRemote Wireless Video Baby Monitor\u201d, and \u201cSummer Infant Baby Touch Digital Color Video Mon-\nitor\u201d. It seems unlikely that many consumers would select three different brands of video monitor.\nHaving established that EM is more robust than KA, we conclude with an analysis of runtimes.\nFigure 2b shows the ratio of KA\u2019s runtime to EM\u2019s for each category. As discussed earlier, EM is\nasymptotically faster than KA, and we see this borne out in practice even for the moderate values of\nN and n that occur in our registries dataset: on average, EM is 2.1 times faster than KA.\n\n5 Conclusion\n\nWe have explored learning DPPs in a setting where the kernel K is not assumed to have \ufb01xed values\nor a restrictive parametric form. By exploiting K\u2019s eigendecomposition, we were able to develop a\nnovel EM learning algorithm. On a product recommendation task, we have shown EM to be faster\nand more robust than the naive approach of maximizing likelihood by projected gradient. In other\napplications for which modeling negative interactions between items is important, we anticipate that\nEM will similarly have a signi\ufb01cant advantage.\n\nAcknowledgments\n\nThis work was supported in part by ONR Grant N00014-10-1-0746.\n\n8\n\nGraco Sweet Slumber Sound MachineVTech Comm. Audio MonitorBoppy Noggin Nest Head SupportCloud b Twilight Constellation Night LightBraun ThermoScan Lens FiltersBritax EZ-Cling Sun ShadesTL Care Organic Cotton MittensRegalo Easy Step Walk Thru GateAquatopia Bath Thermometer AlarmInfant Optics Video MonitorTuesday, August 5, 14feeding gear bedding bath apparel diaper media furniturehealth toys safety carseats strollersKA runtime / EM runtime0.40.50.60.70.71.11.31.31.92.24.06.07.4\fReferences\n[1] A. Kulesza. Learning with Determinantal Point Processes. PhD thesis, University of Pennsylvania, 2012.\n[2] A. Kulesza and B. Taskar. Learning Determinantal Point Processes. In Conference on Uncertainty in\n\nArti\ufb01cial Intelligence (UAI), 2011.\n\n[3] A. Kulesza and B. Taskar. k-DPPs: Fixed-Size Determinantal Point Processes. In International Confer-\n\nence on Machine Learning (ICML), 2011.\n\n[4] H. Lin and J. Bilmes. Learning Mixtures of Submodular Shells with Application to Document Summa-\n\nrization. In Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2012.\n\n[5] A. Kulesza and B. Taskar. Determinantal Point Processes for Machine Learning. Foundations and Trends\n\nin Machine Learning, 5(2-3), 2012.\n\n[6] A. Krause, A. Singh, and C. Guestrin. Near-Optimal Sensor Placements in Gaussian Processes: Theory,\nEf\ufb01cient Algorithms, and Empirical Studies. Journal of Machine Learning Research (JMLR), 9:235\u2013284,\n2008.\n\n[7] A. Krause and C. Guestrin. Near-Optimal Non-Myopic Value of Information in Graphical Models. In\n\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2005.\n\n[8] R. Affandi, E. Fox, R. Adams, and B. Taskar. Learning the Parameters of Determinantal Point Process\n\nKernels. In International Conference on Machine Learning (ICML), 2014.\n\n[9] S. Dughmi, T. Roughgarden, and M. Sundararajan. Revenue Submodularity. In Electronic Commerce,\n\n2009.\n\n[10] O. Macchi. The Coincidence Approach to Stochastic Point Processes. Advances in Applied Probability,\n\n7(1), 1975.\n\n[11] J. Snoek, R. Zemel, and R. Adams. A Determinantal Point Process Latent Variable Model for Inhibition\n\nin Neural Spiking Data. In NIPS, 2013.\n\n[12] B. Kang. Fast Determinantal Point Process Sampling with Application to Clustering. In NIPS, 2013.\n[13] R. Affandi, E. Fox, and B. Taskar. Approximate Inference in Continuous Determinantal Point Processes.\n\nIn NIPS, 2013.\n\n[14] A. Shah and Z. Ghahramani. Determinantal Clustering Process \u2014 A Nonparametric Bayesian Approach\nIn Conference on Uncertainty in Arti\ufb01cial Intelligence\n\nto Kernel Based Semi-Supervised Clustering.\n(UAI), 2013.\n\n[15] R. Affandi, A. Kulesza, E. Fox, and B. Taskar. Nystr\u00a8om Approximation for Large-Scale Determinantal\n\nProcesses. In Conference on Arti\ufb01cial Intelligence and Statistics (AIStats), 2013.\n\n[16] J. Gillenwater, A. Kulesza, and B. Taskar. Near-Optimal MAP Inference for Determinantal Point Pro-\n\ncesses. In NIPS, 2012.\n\n[17] J. Zou and R. Adams. Priors for Diversity in Generative Latent Variable Models. In NIPS, 2013.\n[18] R. Affandi, A. Kulesza, and E. Fox. Markov Determinantal Point Processes. In Conference on Uncertainty\n\nin Arti\ufb01cial Intelligence (UAI), 2012.\n\n[19] J. Gillenwater, A. Kulesza, and B. Taskar. Discovering Diverse and Salient Threads in Document Collec-\n\ntions. In Empirical Methods in Natural Language Processing (EMNLP), 2012.\n\n[20] A. Kulesza and B. Taskar. Structured Determinantal Point Processes. In NIPS, 2010.\n[21] J. Hough, M. Krishnapur, Y. Peres, and B. Vir\u00b4ag. Determinantal Processes and Independence. Probability\n\nSurveys, 3, 2006.\n\n[22] K. Petersen and M. Pedersen. The Matrix Cookbook. Technical report, University of Denmark, 2012.\n[23] E. Levitin and B. Polyak. Constrained Minimization Methods. USSR Computational Mathematics and\n\nMathematical Physics, 6(5):1\u201350, 1966.\n\n[24] D. Henrion and J. Malick. Projection Methods for Conic Feasibility Problems. Optimization Methods\n\nand Software, 26(1):23\u201346, 2011.\n\n[25] R. Neal and G. Hinton. A New View of the EM Algorithm that Justies Incremental, Sparse and Other\n\nVariants. Learning in Graphical Models, 1998.\n\n[26] A. Edelman, T. Arias, and S. Smith. The Geometry of Algorithms with Orthogonality Constraints. SIAM\n\nJournal on Matrix Analysis and Applications (SIMAX), 1998.\n\n[27] A. James. Distributions of Matrix Variates and Latent Roots Derived from Normal Samples. Annals of\n\nMathematical Statistics, 35(2):475\u2013501, 1964.\n\n[28] G. Nemhauser, L. Wolsey, and M. Fisher. An Analysis of Approximations for Maximizing Submodular\n\nSet Functions I. Mathematical Programming, 14(1), 1978.\n\n9\n\n\f", "award": [], "sourceid": 1621, "authors": [{"given_name": "Jennifer", "family_name": "Gillenwater", "institution": "University of Pennsylvania"}, {"given_name": "Alex", "family_name": "Kulesza", "institution": "University of Michigan"}, {"given_name": "Emily", "family_name": "Fox", "institution": "University of Washington"}, {"given_name": "Ben", "family_name": "Taskar", "institution": "University of Washington"}]}