{"title": "Learning Signed Determinantal Point Processes through the Principal Minor Assignment Problem", "book": "Advances in Neural Information Processing Systems", "page_first": 7365, "page_last": 7374, "abstract": "Symmetric determinantal point processes (DPP) are a class of probabilistic models that encode the random selection of items that have a repulsive behavior. They have attracted a lot of attention in machine learning, where returning diverse sets of items is sought for. Sampling and learning these symmetric DPP's is pretty well understood. In this work, we consider a new class of DPP's, which we call signed DPP's, where we break the symmetry and allow attractive behaviors. We set the ground for learning signed DPP's through a method of moments, by solving the so called principal assignment problem for a class of matrices $K$ that satisfy $K_{i,j}=\\pm K_{j,i}$, $i\\neq j$, in polynomial time.", "full_text": "Learning Signed Determinantal Point Processes\nthrough the Principal Minor Assignment Problem\n\nVictor-Emmanuel Brunel\nDepartment of Mathematics\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\nvebrunel@mit.edu\n\nAbstract\n\nSymmetric determinantal point processes (DPP) are a class of probabilistic models\nthat encode the random selection of items that have a repulsive behavior. They\nhave attracted a lot of attention in machine learning, where returning diverse sets\nof items is sought for. Sampling and learning these symmetric DPP\u2019s is pretty\nwell understood. In this work, we consider a new class of DPP\u2019s, which we call\nsigned DPP\u2019s, where we break the symmetry and allow attractive behaviors. We\nset the ground for learning signed DPP\u2019s through a method of moments, by solving\nthe so called principal assignment problem for a class of matrices K that satisfy\n\nKi,j=\u00b1Kj,i, i\u2260 j, in polynomial time.\n\n1\n\nIntroduction\n\nRandom point processes on \ufb01nite spaces are probabilistic distributions that allow to model random\nselections of sets of items from a \ufb01nite collection. For example, the basket of a random customer\nin a store is a random subset of items selected from that store. In some contexts, random point\nprocesses are encoded as random binary vectors, where the 1 coordinates correspond to the selected\nitems. A very famous subclass of random point processes, much used in statistical mechanics, is\ncalled the Ising model, where the log-likelihood function is a quadratic polynomial in the coordinates\nof the binary vector. More generally, Markov random \ufb01elds encompass models of random point\nprocesses where stochastic dependence between the coordinates of the random vector is encoded in\nan undirected graph. In recent years, a different family of random point processes has attracted a lot\nof attention, mainly for its computational tractability: determinantal point processes (DPP\u2019s). DPP\u2019s\nwere \ufb01rst studied and used in statistical mechanics [19]. Then, following the seminal work [15],\ndiscrete DPP\u2019s have been used increasingly in various applications such as recommender systems\n[10, 11], document and timeline summarization [18, 27], image search [15, 1] and segmentation [17],\naudio signal processing [26], bioinformatics [5] and neuroscience [24].\nA DPP on a \ufb01nite space is a random subset of that space whose inclusion probabilities are determined\nby the principal minors of a given matrix. More precisely, encode the \ufb01nite space with labels\n\n[N]={1, 2, . . . , N}, where N is the size of the space. A DPP is a random subset Y \u2286[N] such that\nP[J\u2286 Y]= det(KJ), for all \ufb01xed J\u2286[N], where K is an N\u00d7 N matrix with real entries, called\nthe kernel of the DPP, and KJ=(Ki,j)i,j\u2208J is the square submatrix of K associated with the set J.\nprincipal minors of the matrix L= K(I\u2212 K)1, where I is the N\u00d7 N identity matrix. DPP\u2019s with\n\nIn the applications cited above, it is assumed that K is a symmetric matrix. In that case, it is shown\n(e.g., see [16]) that a suf\ufb01cient and necessary condition for K to be the kernel of a DPP is that all\nits eigenvalues are between 0 and 1. If, in addition, 1 is not an eigenvalue of K, then the DPP with\nkernel K is also known as an L-ensemble, where the probability mass function is proportional to the\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fsymmetric kernels, which we refer to as symmetric DPP\u2019s, model repulsive interactions: Indeed, they\nimply a strong negative dependence between items, called negative association [7].\nRecently, symmetric DPP\u2019s have become popular in recommender systems, e.g., automatized systems\nthat seek for good recommendations for users on online shopping websites [10]. The main idea is to\nmodel a random basket as a DPP and learn the kernel K based on previous observations. Then, for\na new customer, predict which items are the most likely to be selected next, given his/her current\n\nbasket, by maximizing the conditional probability P[J\u222a{i}\u2286 YJ\u2286 Y] over all items i that are\n\nnot yet in the current basket J. One very attractive feature of DPP\u2019s is that if the \ufb01nal basket Y of a\nrandom user is modeled as a DPP, the latter conditional probability is tractable and can be computed\nin a polynomial time in N. However, if the kernel K is symmetric, this procedure enforces diversity\nin the baskets that are modeled, because of the negative association property. However, in general,\nnot all items should be modeled as repelling each other. For instance, say, on a website that sells\nhousehold goods, grounded coffee and coffee \ufb01lters should rather be modeled as attracting each\nother, since a user who buys grounded coffee is more likely to also buy coffee \ufb01lters. In this work,\nwe extend the class of symmetric DPP\u2019s in order to account for possible attractive interactions, by\nconsidering nonsymmetric kernels. In the learning prospective, this extended model poses a question:\nHow to estimate the kernel, based on past observations? In the case of symmetric kernels, this\nproblem has been tackled in several works [12, 1, 20, 4, 9, 10, 11, 21, 8, 25]. Here, we assume that\nK is nonparametric, i.e., it is not parametrized by a low dimensional parameter. As explained in\n[8] in the symmetric case, the maximum likelihood approach requires to solve a highly non convex\noptimization problem, and even though some algorithms have been proposed such as \ufb01xed point\nalgorithms [21], Expectation-Maximisation [12], MCMC [1], neither computational nor statistical\nguarantees are given. The method of moments proposed in [25] provides a polynomial time algorithm\nbased on the estimation of a small number of principal minors of K, and \ufb01nding a symmetric matrix\n\u02c6K whose principal minors approximately match the estimated ones. This algorithm is closely related\nto the principal minor assignment problem. Here, we are interested in learning a nonsymmetric\nkernel given available estimates of its principal minors; In order to simplify the exposition, we always\nassume that the available list of principal minors is exact, not approximate.\nIn Section 2, we recall the de\ufb01nition of DPP\u2019s, we de\ufb01ne a new class of nonsymmetric kernels, that\nwe call signed kernels and we characterize the set of admissible kernels under lack of symmetry. We\npose the questions of identi\ufb01ability of the kernel of a signed DPP and show that this question, together\nwith the problem of learning the kernel, are related to the principal minor assignment problem. In\nSection 3, we propose a solution to the principal minor assignment problem for signed kernels, which\nyields a polynomial time learning algorithm for the kernel of a signed DPP.\n\n2 Determinantal Point Processes\n\n2.1 De\ufb01nitions\n\nfollowing holds:\n\nDe\ufb01nition 1 (Discrete Determinantal Point Process). A Determinantal Point Process (DPP) on the\n\nP[J\u2286 Y]= det(KJ), \u2200J\u2286[N],\n\n(1)\nwhere KJ is the submatrix of K obtained by keeping the columns and rows of K whose indices are\n\n\ufb01nite set[N] is a random subset Y \u2286[N] for which there exists a matrix K\u2208 RN\u00d7N such that the\nin J. The matrix K is called the kernel of the DPP, and we write Y \u223c DPP(K).\nNote that not all matrices K\u2208 RN\u00d7N give rise to a DPP since, for instance, the numbers det(KJ)\nfrom (1) must all lie in[0, 1], and be nonincreasing with the set J. We call a matrix K\u2208 RN\u00d7N\nfollowing proposition where, for all J \u2286[N], we denote by IJ the diagonal matrix whose j-th\ndiagonal entry is 1 if j\u2208 J, 0 otherwise.\nProposition 1. A matrix K\u2208 RN\u00d7N is admissible if and only if(\u22121)J det(K\u2212 IJ)\u2265 0, for all\nJ\u2286[N].\nProof. By [16], if Y \u223c DPP(K), then, necessarily, 0\u2264 P[Y = J]=(\u22121)N\u2212J det(K\u2212 I \u00afJ) for\nall J \u2286 [N]. Conversely, assume(\u22121)J det(K\u2212 IJ) \u2265 0 for all J \u2286 [N]. Denote by pJ =\n\nIn short, the inclusion probabilities of a DPP are given by the principal minors of some matrix K.\n\nadmissible if there exists a DPP with kernel K. As a simple consequence of [16], we have the\n\n2\n\n\fadmissible if and only if L is a P0-matrix, i.e., all its principal minors are nonnegative. If, in addition,\nK is invertible, then it is admissible if and only if L is a P -matrix, i.e., all its principal minors are\n\nJ\u2286[N] pJ = 1. Hence, one can\n(\u22121) \u00afJ det(K\u2212 I \u00afJ), for all J \u2286[N]. By a standard computation, Q\nde\ufb01ne a random subset Y \u2286 [N] with P[Y = J] = pJ for all J \u2286 [N]. A simple application\nof the inclusion-exclusion principle yields that P[J \u2286 Y] = det(KJ) for all J \u2286 [N], hence,\nY \u223c DPP(K).\nLet K\u2208 RN\u00d7N . Assume that I\u2212 K is invertible and let L= K(I\u2212 K)\u22121. Then, I+ L=(I\u2212 K)\u22121\nis invertible and by [16], det(LJ)~ det(I+ L)=(\u22121) \u00afJ det(K\u2212 I \u00afJ) for all J\u2286[N]. Hence, K is\npositive, if and only if T K+(I\u2212 T)(I\u2212 K) is invertible for all diagonal matrices T with entries in\n[0, 1] (see [14, Theorem 3.3]). Hence, it is easy to see that any matrix K of the form D+ \u00b5A, where\nD is a diagonal matrix with Di,i\u2208[\u03bb, 1\u2212 \u03bb], i= 1, . . . , N, for some \u03bb\u2208(0, 1~2), A\u2208[\u22121, 1]N\u00d7N\nand 0\u2264 \u00b5< \u03bb~(N\u2212 1), is admissible.\ncase, it is well known ([16]) that admissibility is equivalent to lie in the intersectionS of two copies of\nthe cone of positive semide\ufb01nite matrices: K\u0002 0 and I\u2212 K\u0002 0. Such processes possess a very strong\nproperty of negative dependence: negative association. A simple observation is that if Y \u223c DPP(K)\nfor some symmetric K\u2208S, then cov(1i\u2208Y , 1j\u2208Y)=\u2212K 2\ni,j\u2264 0, for all i, j\u2208[N], i\u2260 j. Moreover, if\nJ, J\u2032 are two disjoint subsets of[N], then cov(1J\u2286Y , 1J\u2032\u2286Y)= det(KJ\u222aJ\u2032)\u2212 det(KJ) det(K\u2032\nJ)\u2264\n0. Negative association is the property that, more generally, cov(f(Y \u2229 J), g(Y \u2229 J))\u2264 0 for\nall disjoint subsets J, J\u2032 \u2286 [N] and for all nondecreasing functions f, g \u2236 P([N]) \u2192 R (i.e.,\nf(J1)\u2264 f(J2),\u2200J1\u2286 J2\u2286[N]), whereP([N]) is the power set of[N]. We refer to [6] for more\n\nSymmetric DPP\u2019s Most commonly, DPP\u2019s are de\ufb01ned with a real symmetric kernel K. In that\n\ndetails on the account of negative association. For their computational appeal, it is very tempting\nto apply DPP\u2019s in order to model interactions, e.g., as an alternative to Ising models. However, the\nnegative association property of DPP\u2019s with symmetric kernels is unreasonably restrictive in several\ncontexts, for it forces repulsive interactions between items. Next, we extend the class of DPP\u2019s with\nsymmetric kernels in a simple way which is yet also allowing for attractive interactions.\n\nSigned DPP\u2019s We introduce the classT of signed kernels, i.e., matrices K\u2208 RN\u00d7N such that for\nall i, j\u2208[N] with i\u2260 j, Kj,i=\u00b1Ki,j, i.e., Kj,i= \u0001i,jKi,j for some \u0001i,j\u2208{\u22121, 1}. We call a signed\nDPP any DPP with kernel K\u2208T . As of particular interest, one can also consider signed block DPP\u2019s,\nwith kernels K\u2208T , where there is a partition of[N] into pairwise disjoint, nonempty groups such\nthat Kj,i=\u2212Ki,j if i and j are in the same group (hence, i and j attract each other), Kj,i= Ki,j if i\n\nand j are in different groups (hence, i and j repel each other).\n\n2.2 Learning DPP\u2019s\n\nThe main purpose of this work is to understand how to learn the kernel of a nonsymmetric DPP,\ngiven i.i.d. copies of that DPP. Namely, if Y1, . . . , Yn\nto estimate K from the observation of Y1, . . . , Yn? First comes the question of identi\ufb01ability of K:\n\ni.i.d.\u223c DPP(K) for some unknown K\u2208T , how\ntwo matrices K, K\u2032\u2208T can give rise to the same DPP. To be more speci\ufb01c, DPP(K)= DPP(K\u2032)\nif and only if K and K\u2032 have the same list of principal minors. Hence, the kernel of a DPP is not\n\nnecessarily unique. It is actually easy to see that it is unique if and only if it is diagonal. A \ufb01rst\nnatural question that arises in learning the kernel of a DPP is the following:\n\n\u201cWhat is the collection of all matrices K\u2208T that produce a given DPP?\"\n\nGiven that the kernel of Y1 is not uniquely de\ufb01ned, the goal is no longer to estimate K exactly, but\none possible kernel that would give rise to the same DPP as K. The route that we follow is similar to\nthat followed by [25], which is based on a method of moments. However, lack of symmetry of K\nrequires signi\ufb01cantly different ideas. The idea is based on the fact that only few principal minors\nof K are necessary in order to completely recover K up to identi\ufb01ability. Moreover, each principal\n\nminor \u2206J\u2236= det(KJ) can be estimated from the samples by \u02c6\u2206J= n\u22121\u2211n\ni=1\n\n1J\u2286Yi. Since this last\n\nstep is straightforward, we only focus on the problem of complete recovery of K, up to identi\ufb01ability,\ngiven a list of few of its principal minors. In other words, we will ask the following question:\n\n3\n\n\f\u201cGiven an available list of prescribed principal minors, how to recover a matrix K\u2208T whose\n\nprincipal minors are given by that list, using as few queries from that list as possible?\"\n\nThis question, together with the one we asked for identi\ufb01ability, is known as the principal minor\nassignment problem, which we state precisely in the next section.\n\ntwo questions:\n\n2.3 The principal minor assignment problem\n\nThe principal minor assignment problem (PMA) is a well known problem in linear algebra that\n\nconsists of \ufb01nding a matrix with a prescribed list of principal minors [23]. LetH\u2286 CN\u00d7N be a\ncollection of matrices. Typically,H is the set of Hermitian matrices, or real symmetric matrices or, in\nthis work,H=T . Given a list(aJ)J\u2286[N],J\u2260\u089d of 2N\u2212 1 complex numbers, (PMA) asks the following\n(PMA1) Find a matrix K\u2208H such that det(KJ)= aJ,\u2200J\u2286[N], J\u2260\u089d.\nsolution exists, i.e., the list(aJ)J\u2286[N],J\u2260\u089d is a valid list of prescribed principal minors, and we aim\n\nA third question, which we do not address here, is to decide whether (PMA1) has a solution. It is\nknown that this would require the aJ\u2019s to satisfy polynomial equations [22]. Here, we assume that a\n\n(PMA2) Describe the set of all solutions of (PMA1).\n\nto answer (PMA1) ef\ufb01ciently, i.e., output a solution in polynomial time in the size N of the problem,\nand to answer (PMA2) at a purely theoretical level. In the framework of DPP\u2019s, (PMA1) is related to\nthe problem of estimating K by a method of moments and (PMA2) concerns the identi\ufb01ability of K.\n\n3 Solving the principal minor assignment problem for nonsymmetric DPP\u2019s\n\n3.1 Preliminaries: PMA for symmetric matrices\n\nHere, we brie\ufb02y describe the PMA problem for symmetric matrices, i.e.,H=S, the set of real\nsymmetric N\u00d7 N matrices. This will give some intuition for the next section.\nThe adjacency graph GK =([N], EK) of a matrix a matrix K\u2208S is the undirected graph on N\nvertices, where, for all i, j\u2208[N],{i, j}\u2208 EK \u21d0\u21d2 Ki,j\u2260 0. As a consequence of Fact 1, we have:\n\nFact 1. The principal minors of order one and two of a symmetric matrix completely determine its\ndiagonal entries and the magnitudes of its off diagonal entries.\n\nFact 2. The adjacency graph of any symmetric solution of (PMA1) can be learned by querying the\nprincipal minors of order one and two. Moreover, any two symmetric solutions of (PMA1) have the\nsame adjacency graph.\n\nThen, the signs of the off diagonal entries of a symmetric solution of (PMA1) should be determined\nusing queries of higher order principal minors, and the idea is based on the next fact. For a matrix\n\nK\u2208S and a cycle C in GK, denote by \u03c0K(C) the product of entries of K along the cycle C, i.e.,\n\u03c0K(C)= M{i,j}\u2208C\u2236i