{"title": "Structured Determinantal Point Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1171, "page_last": 1179, "abstract": "We present a novel probabilistic model for distributions over sets of structures -- for example, sets of sequences, trees, or graphs. The critical characteristic of our model is a preference for diversity: sets containing dissimilar structures are more likely. Our model is a marriage of structured probabilistic models, like Markov random fields and context free grammars, with determinantal point processes, which arise in quantum physics as models of particles with repulsive interactions. We extend the determinantal point process model to handle an exponentially-sized set of particles (structures) via a natural factorization of the model into parts. We show how this factorization leads to tractable algorithms for exact inference, including computing marginals, computing conditional probabilities, and sampling. Our algorithms exploit a novel polynomially-sized dual representation of determinantal point processes, and use message passing over a special semiring to compute relevant quantities. We illustrate the advantages of the model on tracking and articulated pose estimation problems.", "full_text": "Structured Determinantal Point Processes\n\nAlex Kulesza\n\nBen Taskar\n\nDepartment of Computer and Information Science\n\nUniversity of Pennsylvania\n\nPhiladelphia, PA 19104\n\n{kulesza,taskar}@cis.upenn.edu\n\nAbstract\n\nWe present a novel probabilistic model for distributions over sets of structures\u2014\nfor example, sets of sequences, trees, or graphs. The critical characteristic of our\nmodel is a preference for diversity: sets containing dissimilar structures are more\nlikely. Our model is a marriage of structured probabilistic models, like Markov\nrandom \ufb01elds and context free grammars, with determinantal point processes,\nwhich arise in quantum physics as models of particles with repulsive interactions.\nWe extend the determinantal point process model to handle an exponentially-sized\nset of particles (structures) via a natural factorization of the model into parts. We\nshow how this factorization leads to tractable algorithms for exact inference, in-\ncluding computing marginals, computing conditional probabilities, and sampling.\nOur algorithms exploit a novel polynomially-sized dual representation of determi-\nnantal point processes, and use message passing over a special semiring to com-\npute relevant quantities. We illustrate the advantages of the model on tracking and\narticulated pose estimation problems.\n\n1\n\nIntroduction\n\nThe need for distributions over sets of structures arises frequently in computer vision, computational\nbiology, and natural language processing. For example, in multiple target tracking, sets of structures\nof interest are multiple object trajectories [6].\nIn gene \ufb01nding, sets of structures of interest are\nmultiple proteins coded by a single gene via alternative splicing [13]. In machine translation, sets of\nstructures of interest are multiple interpretations or parses of a sentence in a different language [12].\nConsider as a running example the problem of detecting and tracking several objects of the same\ntype (e.g., cars, people, faces) in a video, assuming the number of objects is not known a priori. We\nwould like a distribution over sets of trajectories that (1) includes sets of different cardinality and\n(2) prefers sets of trajectories that are spread out in space-time, as objects are likely to be [11, 15].\nDeterminantal point processes [10] are attractive models for distributions over sets, because they\nconcisely capture probabilistic mutual exclusion between items via a kernel matrix that determines\nwhich items are similar and therefore less likely to appear together. Intuitively, the model balances\nthe diversity of a set against the quality of the items it contains (for example, observation likelihood\nof an object along the trajectory, or motion smoothness). Remarkably, algorithms for computing\ncertain marginal and conditional probabilities as well as sampling from this model are O(N 3),\nwhere N is total number of possible items, even though there are 2N possible subsets of a set\nof size N [7, 1] .\nThe problem, however, is that in our setting the total number of possible trajectories N is exponential\nin the number of time steps. More generally, we consider modeling distributions over sets of struc-\ntures (e.g., sequences, trees, graphs) where the total number of possible structures is exponential.\nOur structured determinatal point process model (SDPP) captures such distributions by combining\nstructured probabilistic models (e.g., a Markov random \ufb01eld to model individual trajectory quality)\n\n1\n\n\f(a)\n\n(b)\n\nFigure 1: (a) A set of points in the plane drawn from a DPP (left), and the same number of\npoints sampled independently (right). (b) The \ufb01rst three steps of sampling a DPP on a set of one-\ndimensional particle positions, from left to right. Red circles indicate already selected positions.\nThe DPP naturally reduces the probabilities for positions that are similar to those already selected.\n\nwith determinantal point processes. We introduce a natural factorization of the determinantal model\ninto parts (as in graphical models and grammars), and show that this factorization together with a\nnovel dual representation of the process enables tractable inference and sampling using message\npassing algorithms over a special semiring. The contributions of this paper are: (1) introducing\nSDPPs, (2) a concise dual representation of determinantal processes, (3) tractable message passing\nalgorithms for exact inference and sampling in SDPPs, (4) experimental validation on synthetic mo-\ntion tracking and real-world pose detection problems. The paper is organized as follows: we present\nbackground on determinantal processes in Section 2 and introduce our model in Section 3; we de-\nvelop inference and sampling algorithms in Section 4, and we describe experiments in Section 5.\n\n2 Background: determinantal point processes\nA point process P on a discrete set Y = {y1, . . . , yN} is a probability measure on 2Y, the set of all\nsubsets of Y. P is called a determinantal point process (DPP) if there exists a positive semide\ufb01nite\nmatrix K indexed by the elements of Y such that if Y \u223c P then for every A \u2286 Y, we have\n\nDeterminantal Point Process:\n\n(1)\nHere KA = [Kij]yi,yj\u2208A is the restriction of K to the entries indexed by elements of A, and we\nadopt det(K\u2205) = 1. We will refer to K as the marginal kernel, as it contains all the information\nneeded to compute the probability of including any subset A in Y \u223c P. A few simple observations\nfollow from Equation (1):\n\nP(A \u2286 Y ) = det(KA) .\n\nP(yi \u2208 Y ) = Kii\n\nP(yi, yj \u2208 Y ) = KiiKjj \u2212 KijKji = P(yi \u2208 Y )P(yj \u2208 Y ) \u2212 K 2\nij.\n\n(2)\n(3)\n\nThat is, the diagonal of K gives the marginal probabilities of inclusion for individual elements of\nY, and the off-diagonal elements determine the (anti-) correlations between pairs of elements: large\nvalues of Kij imply that i and j tend not to co-occur. Note that DPPs cannot represent distributions\nwhere elements are more likely to co-occur than if they were independent: correlations are negative.\nFigure 1a shows the difference between sampling a set of points in the plane using a DPP (with Kij\ninversely related to the distance between points i and j), which leads to a set that is spread out with\ngood coverage, and sampling points independently, where the points exhibit random clumping.\nDeterminantal point processes, introduced to model fermions [10], also arise in studies of non-\nintersecting random paths, random spanning trees, and eigenvalues of random matrices [3, 2, 7].\nThe most relevant construction of DPPs for our purpose is via L-ensembles [1]. An L-ensemble\nde\ufb01nes a DPP via a positive semide\ufb01nite matrix L indexed by the elements of Y.\n\nL-ensemble DPP:\n\n(4)\nwhere I is the N \u00d7 N identity matrix. Note that PL is normalized due to the identity\nY \u2286Y det(LY ) = det(L+I). L-ensembles directly de\ufb01ne the probability of observing each subset\n\nP\n\nPL(Y ) =\n\ndet(LY )\ndet(L + I) ,\n\n2\n\nDPPIndependent0101020x 10\u22123PositionStep 0Probability01PositionStep 101PositionStep 2\fof Y, and subsets that have higher diversity (as measured by the corresponding determinant) have\nhigher likelihood. To get probabilities of item co-occurrence as in Equation (1), we can compute the\nmarginal kernel K for the L-ensemble PL:\n\nNote that K can be computed from the eigen-decomposition of L = PN\nre-scaling of eigenvalues: K =PN\n\nL-ensemble marginal kernel:\n\nK = (L + I)\u22121L.\n\n\u03bbk\n\n\u03bbk+1 vkv>\nk .\n\nk=1\n\nk=1 \u03bbkvkv>\n\n(5)\nk by a simple\n\nTo get a better understanding of how L affects marginals K, note that L can be written as a Gram\nmatrix with L(yi, yj) = q(yi)\u03c6(yi)>\u03c6(yj)q(yj) for q(yi) \u2265 0 and some \u201cfeature mapping\u201d \u03c6(y) :\nY 7\u2192 RD, where D \u2264 N and ||\u03c6(yi)||2 = 1. We can think of q(yi) as the \u201cquality score\u201d for item\nyi and \u03c6(yi)>\u03c6(yj) as normalized \u201csimilarity\u201d between items yi and yj.\n\nL-ensemble (L=quality*similarity):\n\nPL(Y ) \u221d det(\u03c6(Y )>\u03c6(Y )) Y\n\nq2(yi) ,\n\n(6)\n\nyi\u2208Y\n\nwhere \u03c6(Y ) is a D \u00d7 |Y | matrix with columns \u03c6(yi), yi \u2208 Y . We will use this quality*similarity\nbased representation extensively below. Roughly speaking, PL(yi \u2208 Y ) increases monotonically\nwith quality q(yi) and PL(yi, yj \u2208 Y ) decreases monotonically with similarity \u03c6(yi)>\u03c6(yj).\nWe brie\ufb02y mention a few other ef\ufb01ciently computable quantities of DPPs [1]:\n\nL-ensemble conditionals: PL(Y = A \u222a B | A \u2286 Y ) =\n\n(7)\nwhere IY\\A is the matrix with ones in the diagonal entries indexed by elements of Y \\ A and\nzeros everywhere else. Conditional marginal probabilities PL(B \u2286 Y | A \u2286 Y ) as well as in-\nclusion/exclusion probabilities PL(A \u2286 Y \u2227 B \u2229 Y = \u2205) can also be computed ef\ufb01ciently using\neigen-decompositions of L and related matrices.\n\ndet(LA\u222aB)\ndet(L + IY\\A) ,\n\nSampling\n\nSampling from PL is also ef\ufb01cient [7]. Let L = PN\n\nk=1 \u03bbkvkv>\n\nk be an orthonormal eigen-\ndecomposition, and let ei be the ith standard basis N-vector (all zeros except for a 1 in the ith\nposition). Then the following algorithm samples Y \u223c PL:\nInitialize: Y = \u2205, V = \u2205;\nAdd each eigenvector vk to V independently with prob. \u03bbk\nwhile |V | > 0 do\n\nSelect a yi from Y with Pr(yi) = 1|V |\nUpdate Y = Y \u222a yi;\nCompute V\u22a5, an orthonormal basis for the subspace of V orthogonal to ei, and let V = V\u22a5;\n\nP\nv\u2208V (v>ei)2;\n\n\u03bbk+1;\n\nend\nReturn Y ;\n\nAlgorithm 1: Sampling algorithm for L-ensemble DPPs.\n\nThis yields a natural and ef\ufb01cient procedure for sampling from P given an eigen-decomposition\nof L. It also offers some additional insights. Because the dimension of V is reduced by one on\neach iteration of the loop, and because the initial dimension of V is simply the number of selected\neigenvectors in step one, the size of Y is distributed as the number of successes in N Bernoulli trials\n\u03bbk+1. In particular, |Y | cannot be larger than rank(L), and\nwhere trial k succeeds with probability \u03bbk\n\nE[|Y |] =PN\n\n\u03bbk\n\n\u03bbk+1.\n\nk=1\n\nTo get a feel for the sampling algorithm, it is useful to visualize the distributions used to select yi at\neach time step, and to see how they are in\ufb02uenced by previously chosen items. Figure 1b shows this\nprogression for a simple DPP where Y is the set of points in [0, 1], quality scores are uniformly 1, and\nthe feature mapping is such that \u03c6(yi)>\u03c6(yj) \u221d exp(\u2212(yi \u2212 yj)2)\u2014that is, points are more similar\nthe closer together they are. Initially, the eigenvectors V give rise to a fairly uniform distribution\nover points in Y, but as each successive point is selected and V is updated, the distribution shifts to\navoid points near those already chosen.\n\n3\n\n\fL, LY\nK, KA\n\nMeaning\nSymbol\nY, Y, yi, N Y is the base set, Y is a subset of Y, yi is an element of Y, N is the size of |Y|\nL is a p.s.d. matrix de\ufb01ning P(Y ) \u221d det(LY ), LY is a submatrix indexed by Y\nK is a p.s.d. matrix de\ufb01ning marginals via P(A \u2286 Y ) = det(KA)\nquality*similarity decomposition; Lij = q(yi)\u03c6(yi)>\u03c6(yj)q(yj), \u03c6(yj) \u2208 RD\nC = BB> is the dual of L = B>B; the columns of B are Bi = q(yi)\u03c6(yi)\n\u03b1 is a factor of a structure; yi\u03b1, y\u03b1 index the relevant part of the structure\n\nq(yi), \u03c6(yi)\n\nB, C\n\n\u03b1, yi\u03b1, y\u03b1\n\nTable 1: Summary of notation.\n\n3 Structured determinantal point processes\nDPPs are amazingly tractable distributions when N, the size of the base set Y, is small. However,\nwe are interested in de\ufb01ning DPPs over exponentially sized Y. For example, consider the case where\neach yi is itself a sequence of length T : yi = (yi1, . . . , yiT ), where yit is the state at time t (e.g., the\nlocation of an object in the t-th frame of a video). Assuming there are n states at each time t and all\nstate transitions are possible, there are nT possible sequences, so N = nT .\nIn order to de\ufb01ne a DPP over structures such as sequences or trees, we assume a factorization of\nthe quality score q(yi) and similarity score \u03c6(yi)>\u03c6(yj) into parts, similar to a graphical model\ndecomposition. For a sequence, the scores can be naturally decomposed into factors that depend on\nthe state yit at each time t and the states (yit, yit+1) for each transition (t, t+1). More generally, we\nassume a set of factors and use the notation yi\u03b1 to refer to the \u03b1 part of the structure yi (similarly, we\nuse y\u03b1 to refer to the \u03b1 part of the structure y). We assume that quality decomposes multiplicatively\nand similarity decomposes additively, as follows. (As before, L(yi, yj) = q(yi)\u03c6(yi)>\u03c6(yj)q(yj).)\n(8)\n\nand \u03c6(yi) =X\n\nq(yi) =Y\n\nStructured DPP Factorization:\n\n\u03c6(yi\u03b1).\n\nq(yi\u03b1)\n\n\u03b1\n\n\u03b1\n\nWe argue that these are quite natural factorizations. Quality scores, for example, can be given by a\ntypical log-linear Markov random \ufb01eld, which de\ufb01nes a multiplicative distribution over structures.\nSimilarity scores can be thought of as dot products between features of the two labelings.\nIn our tracking example, the feature mapping \u03c6(yit) should re\ufb02ect similarity between trajectories;\ne.g., features could track coarse-level position at time t, so that the model considers sets with tra-\njectories that pass near or through the same states less likely. A common problem in multiple target\ntracking is that the quality of one object\u2019s trajectory and its neighborhood \u201ctube\u201d is often much\nmore likely than other objects\u2019 trajectories as measured by an HMM or CRF model, so standard\nsampling from a graphical model will produce very similar, overlapping trajectories, ignoring less\n\u201cdetectable\u201d targets. A sample from the structured DPP model would be much more likely to contain\ndiverse trajectories. (See Figure 2.)\n\nDual representation\nWhile the factorization in Equation (8) concisely de\ufb01nes a DPP over a structured Y, the more re-\nmarkable fact is that it gives rise to tractable algorithms for computing key marginals and condition-\nals when the set of factors is low-treewidth, just as in graphical model inference [8], even though L\nis too large to even write down. We propose the following dual representation of L in order to exploit\nthe factorization. Let us de\ufb01ne a D \u00d7 N matrix B whose columns are given by Bi = q(yi)\u03c6(yi), so\nthat L = B>B. Consider the D \u00d7 D matrix C = BB>; note that typically D (cid:28) N (actually, the\nrank of B is at most O(nT ) in the sequence case). The eigenvalues of C and L are identical, and\nk \u03bbk(B>vk)>(B>vk).\nThat is, if vk is the k-th eigenvector of C, B>vk is the k-th eigenvector of L, and it has the same\neigenvalue \u03bbk. This connection allows us to compute important quantities from C.\n\nthe eigenvectors are related as follows: if C = P\nk , then L = P\nFor example, to compute the L-ensemble normalization det(L + I) = Q\nP\n\nk(\u03bbk + 1) in Equa-\ntion (4), we just need the eigenvalues of C. To compute C itself, we need to compute BB> =\nq2(yi)\u03c6(yi)\u03c6(yi)>. This appears daunting, but the factorization turns out to offer an ef\ufb01cient\ndynamic programming solution. We discuss in more detail how to compute C for sequences (and\nfor \ufb01xed-treewidth factors in general) in the next section. Assuming we can compute C ef\ufb01ciently,\n\nk \u03bbkvkv>\n\nyi\n\n4\n\n\fFigure 2: Sets of (structured) particle trajectories sampled from the SDPP (top row) and indepen-\ndently using only quality scores (bottom row). The curves to the left indicate the quality scores for\nthe possible initial positions.\n\nprobability of any single trajectory being included in Y \u223c PL, we have all we need:\n\nk \u03bbkvkv>\n\nk in O(D3). Then, to compute PL(yi \u2208 Y ), the\n\nwe can eigen-decompose it as C = P\nStructured Marginal: Kii =X\nKij =X\n\ni vk)(B>\n\n(B>\n\n\u03bbk\n\nk\n\n\u03bbk + 1\n\nk\n\n\u03bbk\n\n\u03bbk + 1\n\nj vk) = q(yi)q(yj)X\n\nk\n\ni vk)2 = q2(yi)X\n\n(B>\n\nSimilarly, given two trajectories yi and yj, PL(yi, yj \u2208 Y ) = KiiKjj \u2212 K 2\n\nij, where:\n\n\u03bbk\n\n\u03bbk + 1\n\nk\n\n(\u03c6(yi)>vk)2\n\n(9)\n\n\u03bbk\n\n\u03bbk + 1\n\n(\u03c6(yi)>vk)(\u03c6(yj)>vk) .\n\n(10)\n\n4\n\nInference for SDPPs\n\n Y\n\nC =X\n\nWe now turn to computing C using the factorization in Equation (8). We have\n\nq2(y)\u03c6(y)\u03c6(y)> =X\n\n! X\nIf we think of q2(y\u03b1) as factor potentials of a graphical model p(y) \u221dQ\nexpectation, add up the contributions: C = ZP\n\n\u03b1 q2(y\u03b1), then computing\nC is equivalent to computing second moments of additive features (modulo normalization Z). A\nnaive algorithm can simply compute all O(T 2) pairwise marginals p(y\u03b1, y\u03b10) and, by linearity of\n\n!>\n\n\u03c6(y\u03b1)\n\n.\n\n(11)\n\n! X\n\ny\u03b1,y\u03b10 p(y\u03b1, y\u03b10)\u03c6(y\u03b1)\u03c6(y\u03b10)>.\n\n\u03b1,\u03b10P\n\ny\u2208Y\n\n\u03b1\n\nq2(y\u03b1)\n\n\u03c6(y\u03b1)\n\ny\u2208Y\n\n\u03b1\n\n\u03b1\n\nHowever, we can use a much more ef\ufb01cient O(D2T ) algorithm based on second-order semiring\nmessage passing [9]. The details are given in Appendix A of the supplementary material, but in short\nwe apply the standard two-pass belief propagation algorithm for trees with a particular semiring in\nplace of the usual sum-product or max-sum. By performing message passing under this second-order\nsemiring, one can ef\ufb01ciently compute any quantity of the form:\n\n Y\n\n\u03b1\n\nX\n\ny\u2208Y\n\n! X\n\n! X\n\n\u03b1\n\n\u03b1\n\n!\n\np(y\u03b1)\n\na(y\u03b1)\n\nb(y\u03b1)\n\n(12)\n\nfor functions p \u2265 0, a, and b in time O(T ). Since the outer product in Equation (11) comprises D2\nquantities of the type in Equation (12), we can compute C in time O(D2T ).\n\nSampling\n\nAs described in Section 3, the eigen-decomposition of C yields an implicit representation of L: for\neach eigenvalue/vector pair (\u03bbk, vk) of C, (\u03bbk, B>vk) is a corresponding pair for L. We show that\nthis implicit representation is enough to ef\ufb01ciently perform the sampling procedure in Algorithm 1.\n\n5\n\n10203040501020304050SDPP10203040501020304050Independent10203040501020304050Sampled particle trajectories (position vs. time)102030405010203040501020304050102030405010203040501020304050\fi BB>vj = v>\n\nThe key is to represent V , the orthonormal set of vectors in RN , as a set \u02c6V of vectors in RD,\nwith the mapping V = {B>v|v \u2208 \u02c6V }. Let vi, vj be two arbitrary vectors in \u02c6V . Then we have\n(B>vi)>(B>vj) = v>\ni Cvj. Thus we can compute dot products between vectors in\nV using their preimage in \u02c6V . This is suf\ufb01cient to compute the normalization for each eigenvector\nB>v, as required to obtain an initial orthonormal basis. Trivially, we can also compute (implicit)\nsums between vectors in V ; this combined with dot products is enough to perform the Gram-Schmidt\northonormalization needed to obtain \u02c6V\u22a5 from \u02c6V and the most recently selected yi at each iteration.\nis to choose a structure yi according to the distribution Pr(yi) =\nAll that remains,\nv\u2208 \u02c6V ((B>v)>ei)2. Recall that the columns of B are given by Bi = q(yi)\u03c6(yi). Thus\n\n1/| \u02c6V |P\n\nthen,\n\nthe distribution can be rewritten as\n\nq2(yi)(v>\u03c6(yi))2 .\n\n(13)\n\nPr(yi) =\n\n1\n| \u02c6V |\n\nX\n\nv\u2208 \u02c6V\n\nX\n\ny\u223cy\u03b1\n\nX\n\nv\u2208 \u02c6V\n\n1\n| \u02c6V |\n\nBy assumption q2(yi) decomposes multiplicatively over parts of yi, and v>\u03c6(yi) decomposes ad-\nditively. Thus the distribution is a sum of | \u02c6V | terms, each having the form of Equation (12). We\ncan therefore apply message passing in the second-order semiring to compute marginals of this\ndistribution\u2014that is, for each part y\u03b1 we can compute\n\nq2(y)(v>\u03c6(y))2 ,\n\n(14)\n\nwhere the sum is over all structures consistent with the value of y\u03b1. This only takes O(T| \u02c6V |) time.\nIn fact, the message-passing computation of these marginals yields an ef\ufb01cient algorithm for sam-\npling individual full structures yi as required by Algorithm 1; the key is to pass normal messages\nforward, but conditional messages backward. Suppose we have a sequence model; since the forward\npass completes with correct marginals at the \ufb01nal node, we can correctly sample its value before any\nbackwards messages are sent. Once the value of the \ufb01nal node is \ufb01xed, we pass a conditional mes-\nsage backwards; that is, we send zeros for all values other than the one just selected. This results\nin condtional marginals at the penultimate node. We can then conditionally sample its value, and\nrepeat this process until all nodes have been assigned. Furthermore, by applying the second-order\nsemiring we are able to sample from a distribution quite different from that of a traditional graphical\nmodel. The algorithm is described in more detail in Appendix B of the supplementary material.\n\n5 Experiments\n\nq(y1)QT\n\nWe begin with a synthetic motion tracking task, where the goal is to follow a collection of particles as\nthey travel in a one-dimensional space over time. This is the structured analog of the setting shown\nin Figure 1b, where elements of Y are no longer single positions in [0, 1], but are now sequences of\nsuch positions over many time periods. For our experiments, we modeled paths yi over T = 50 time\nsteps, where at each time t a particle can be in one of 50 discretized positions, yit \u2208 {1, . . . , 50}.\nThe total number of possible trajectories is thus 5050, and there are 25050 possible sets of trajectories.\nWhile a real tracking problem would involve quality scores q(y) that depend on some observations,\ne.g., measurements over time from a set of physical sensors, for simplicity we determine the quality\nof a trajectory using only its starting position and a measure of smoothness over time: q(y) =\nt=2 q(yt\u22121, yt). The initial quality scores q(y1) depicted on the left of Figure 2 are high\nin the middle with secondary modes on each side. The transition quality is given by q(yt\u22121, yt) =\nf(yt\u22121 \u2212 yt), where f is the density function of the zero-mean Gaussian with unit variance. We\nscale the quality scores so that the expected number of selected trajectories is 5.\nWe want trajectories to be considered similar if they travel through similar positions, so we de\ufb01ne\nt=1 \u03c6(yt) where \u03c6r(yt) \u221d f(i \u2212 yt) for r = 1, . . . , 50.\nIntuitively, feature r is activated when the trajectory passes near position r, so trajectories passing\nthrough nearby positions will activate the same features and thus appear similar.\nFigure 2 shows the results of applying our SDPP sampling algorithm to this setting. Sets of trajec-\ntories drawn independently according to quality score tend to cluster in the middle region (second\n\na 50-dimensional feature vector \u03c6(y) =PT\n\n6\n\n\frow). The SDPP samples, however, are more diverse, tending to cover more of the space while still\nrespecting the quality scores\u2014they are still smooth, and still tend to start near the middle position.\n\nPose estimation\n\np\u2208P q(yp)Q\n\nand use \u03c6(y) = P\n\nacross the nodes (body parts) P and edges (joints) J as q(y) = \u03b3(Q\n\nTo demonstrate that SDPPs effectively model characteristics of real-world data, we apply them to\na multiple-person pose estimation task. Our dataset consists of 73 still frames taken from various\nTV shows, each approximately 720 by 540 pixels in size1. As much as possible, the selected frames\ncontain three or more people at similar scale, all facing the camera and without serious occlusions.\nSample images from the dataset are shown in Figure 4. The task is to identify the location and pose\nof each person in the image. For our purposes, each pose is a structure containing four parts (head,\ntorso, right arm, and left arm), each of which takes a value consisting of a pixel location and an\norientation (one of 24 discretized angles). There are approximately 75,000 possible such values for\neach part, so there are about 475,000 possible poses. Each image was labeled by hand for evaluation.\nWe use a standard pictorial strucure model [4, 5], treating each pose as a two-level tree with the torso\nas the root and the head and arms as leaves. Our quality scores are derived from [14]; they factorize\npp0\u2208J q(yp, yp0))\u03b2.\n\u03b3 is a scale parameter that controls the expected number of poses in each sample, and \u03b2 is a sharpness\nparameter that we found helpful in controlling the impact of the quality scores. (We set parameter\nvalues using a held-out training set; see below.) Each part receives a quality score q(yp) given by\na customized part detector previously trained on similar images. The joint quality score q(yp, yp0)\nis given by a Gaussian \u201cspring\u201d that encourages, for example, the left arm to begin near the left\nshoulder. Full details of the quality terms are provided in [14].\nGiven our data, we want to discourage the model from selecting overlapping poses, so we design our\nsimilarity features spatially. We de\ufb01ne an evenly spaced 8 by 4 grid of reference points x1, . . . , x32,\np\u2208P \u03c6(yp), where \u03c6r(yp) \u221d f(kyp \u2212 xrk2/\u03c3). Recall that f is the standard\nnormal density function, and kyp \u2212 xrk2 is the distance between the position of part p (ignoring\nangle) and the reference point xr. The parameter \u03c3 controls the width of the kernel. Poses that\noccupy the same part of the image will be near the same reference points, and thus appear similar.\nWe compare our model against two baselines. The \ufb01rst is an independent model which draws poses\nindependently according to the distribution obtained by normalizing the quality scores. The second\nis a simple non-maxima suppression model that iteratively selects successive poses using the nor-\nmalized quality scores, but under the hard constraint that they do not overlap with any previously\nselected pose. (Poses overlap if they cover any of the same pixels when rendered.) In both cases,\nthe number of poses is given by a draw from the SDPP model, ensuring no systematic bias.\nWe split our data randomly into a training set of 13 images and a test set of 60 images. Using the\ntraining set, we select values for \u03b3, \u03b2, and \u03c3 that optimize overall F1 score at radius 100 (see below),\nas well as distinct optimal values of \u03b2 for the baselines. (\u03b3 and \u03c3 are irrelevant for the baselines.)\nWe then use each model to sample 10 sets of poses for each test image, or 600 samples per model.\nFor each sample, we compute precision, recall, and F1 score. For our purposes, precision is the\nfraction of predicted parts where both endpoints are within a particular radius of the endpoints of\nan expert-labeled part of the same type (head, left arm, etc.). Correspondingly, recall is the fraction\nof expert-labeled parts within a given radius of a predicted part of the same type. Since our SDPP\nmodel encourages diversity, we expect to see improvements in recall at the expense of precision.\nF1 score is the harmonic mean of precision and recall. We compute all metrics separately for each\nsample, and then average the results across samples and images in the test set.\nThe results over several different radii are shown in Figure 3a. At tight tolerances the SDPP performs\ncomparably to the independent samples (perhaps because the quality scores are only accurate at the\nmode, so diverse samples are not close enough to be valuable). As the radius increases, however,\nthe SDPP obtains signi\ufb01cantly better results, outperforming both baselines. Figure 3b shows the\ncurves for the arms alone; the arms tend to be more dif\ufb01cult to locate accurately. Figure 3c shows\nthe precision/recall obtained by each model. As expected, the SDPP model achieves its improved\nF1 score by increasing recall at the cost of precision.\n\n1The images and code from [14] are available at http://www.vision.grasp.upenn.edu/video\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 3: Results for pose estimation. The horizontal axis gives the distance threshold used to\ndetermine whether two parts are successfully matched. 95% con\ufb01dence intervals are shown.\n\nFigure 4: Structured marginals for the pose estimation task on successive steps of the sampling\nalgorithm, with already selected poses superimposed. Input images are shown on the left.\n\nFor illustration, we show the sampling process for a few images in Figure 4. As in Figure 1b, the\nSDPP ef\ufb01ciently discounts poses that are similar to those already selected.\n\n6 Conclusion\n\nWe introduced the structured determinantal point process (SDPP), a probabilistic model over sets of\nstructures such as sequences, trees, or graphs. We showed the intuitive \u201cdiversi\ufb01cation\u201d properties\nof the SDPP, and developed ef\ufb01cient message-passing algorithms to perform inference through a\ndual characterization of the standard DPP and a natural factorization.\n\nAcknowledgments\n\nThe authors were partially supported by NSF Grant 0803256.\n\n8\n\n4060801001201400.10.20.30.40.50.6Match radius (in pixels)Overall F1 SDPPNon\u2212maxIndep.4060801001201400.10.20.30.40.50.6Match radius (in pixels)Arms F14060801001201400.20.40.60.8Match radius (in pixels)Precision / recall (circles)\fJanossy densities. I. Determinantal ensembles. Journal of\n\n[3] D. Daley and D. Vere-Jones. An introduction to the theory of point processes: volume I:\n\nelementary theory and methods. Springer, 2003.\n\n[4] P. Felzenszwalb and D. Huttenlocher. Pictorial structures for object recognition. International\n\nJournal of Computer Vision, 61(1):55\u201379, 2005.\n\n[5] M. Fischler and R. Elschlager. The representation and matching of pictorial structures. IEEE\n\nTransactions on Computers, 100(22), 1973.\n\n[6] D. Forsyth and J. Ponce. Computer Vision: A Modern Approach. Prentice Hall, 2003.\n[7] J. Hough, M. Krishnapur, Y. Peres, and B. Vir\u00b4ag. Determinantal processes and independence.\n\n[8] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. The\n\nProbability Surveys, 3:206\u2013229, 2006.\n\nMIT Press, 2009.\n\n[9] Z. Li and J. Eisner.\n\nFirst-and second-order expectation semirings with applications to\n\nminimum-risk training on translation forests. In Proc. EMNLP, 2009.\n\n[10] O. Macchi. The coincidence approach to stochastic point processes. Advances in Applied\n\nProbability, 7(1):83\u2013122, 1975.\n\n[11] J. MacCormick and A. Blake. A probabilistic exclusion principle for tracking multiple objects.\n\nInternational Journal of Computer Vision, 39(1):57\u201371, 2000.\n\n[12] C. D. Manning and H. Sch\u00a8utze. Foundations of Statistical Natural Language Processing. MIT\n\nReferences\n[1] A. Borodin. Determinantal point processes, 2009.\n[2] A. Borodin and A. Soshnikov.\n\nStatistical Physics, 113(3):595\u2013610, 2003.\n\n[13] T. Nilsen and B. Graveley. Expansion of the eukaryotic proteome by alternative splicing.\n\nPress, Boston, MA, 1999.\n\nNature, 463(7280):457\u2013463, 2010.\n\n[14] B. Sapp, C. Jordan, and B. Taskar. Adaptive pose priors for pictorial structures.\n\nIn IEEE\nComputer Society Conference on Computer Vision and Pattern Recognition (CVPR\u201910), 2010.\n[15] T. Zhao and R. Nevatia. Tracking multiple humans in complex situations. IEEE Transactions\n\non Pattern Analysis and Machine Intelligence, 26:1208\u20131221, 2004.\n\n9\n\n\f", "award": [], "sourceid": 880, "authors": [{"given_name": "Alex", "family_name": "Kulesza", "institution": null}, {"given_name": "Ben", "family_name": "Taskar", "institution": null}]}