{"title": "Near-Optimal MAP Inference for Determinantal Point Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 2735, "page_last": 2743, "abstract": "Determinantal point processes (DPPs) have recently been proposed as computationally efficient probabilistic models of diverse sets for a variety of applications, including document summarization, image search, and pose estimation. Many DPP inference operations, including normalization and sampling, are tractable; however, finding the most likely configuration (MAP), which is often required in practice for decoding, is NP-hard, so we must resort to approximate inference. Because DPP probabilities are log-submodular, greedy algorithms have been used in the past with some empirical success; however, these methods only give approximation guarantees in the special case of DPPs with monotone kernels. In this paper we propose a new algorithm for approximating the MAP problem based on continuous techniques for submodular function maximization. Our method involves a novel continuous relaxation of the log-probability function, which, in contrast to the multilinear extension used for general submodular functions, can be evaluated and differentiated exactly and efficiently. We obtain a practical algorithm with a 1/4-approximation guarantee for a general class of non-monotone DPPs. Our algorithm also extends to MAP inference under complex polytope constraints, making it possible to combine DPPs with Markov random fields, weighted matchings, and other models. We demonstrate that our approach outperforms greedy methods on both synthetic and real-world data.", "full_text": "Near-Optimal MAP Inference\n\nfor Determinantal Point Processes\n\nJennifer Gillenwater Alex Kulesza Ben Taskar\n\n{jengi,kulesza,taskar}@cis.upenn.edu\n\nComputer and Information Science\n\nUniversity of Pennsylvania\n\nAbstract\n\nDeterminantal point processes (DPPs) have recently been proposed as computa-\ntionally ef\ufb01cient probabilistic models of diverse sets for a variety of applications,\nincluding document summarization, image search, and pose estimation. Many\nDPP inference operations, including normalization and sampling, are tractable;\nhowever, \ufb01nding the most likely con\ufb01guration (MAP), which is often required in\npractice for decoding, is NP-hard, so we must resort to approximate inference.\nThis optimization problem, which also arises in experimental design and sensor\nplacement, involves \ufb01nding the largest principal minor of a positive semide\ufb01nite\nmatrix. Because the objective is log-submodular, greedy algorithms have been\nused in the past with some empirical success; however, these methods only give\napproximation guarantees in the special case of monotone objectives, which cor-\nrespond to a restricted class of DPPs. In this paper we propose a new algorithm\nfor approximating the MAP problem based on continuous techniques for submod-\nular function maximization. Our method involves a novel continuous relaxation of\nthe log-probability function, which, in contrast to the multilinear extension used\nfor general submodular functions, can be evaluated and differentiated exactly and\nef\ufb01ciently. We obtain a practical algorithm with a 1/4-approximation guarantee\nfor a more general class of non-monotone DPPs; our algorithm also extends to\nMAP inference under complex polytope constraints, making it possible to com-\nbine DPPs with Markov random \ufb01elds, weighted matchings, and other models.\nWe demonstrate that our approach outperforms standard and recent methods on\nboth synthetic and real-world data.\n\n1\n\nIntroduction\n\nInformative subset selection problems arise in many applications where a small number of items\nmust be chosen to represent or cover a much larger set; for instance, text summarization [1, 2],\ndocument and image search [3, 4, 5], sensor placement [6], viral marketing [7], and many others.\nRecently, probabilistic models extending determinantal point processes (DPPs) [8, 9] were proposed\nfor several such problems [10, 5, 11]. DPPs offer computationally attractive properties, including\nexact and ef\ufb01cient computation of marginals [8], sampling [12, 5], and (partial) parameter estima-\ntion [13]. They are characterized by a notion of diversity, as shown in Figure 1; points in the plane\nsampled from a DPP (center) are more spread out than those sampled independently (left).\nHowever, in many cases we would like to make use of the most likely con\ufb01guration (MAP inference,\nright), which involves \ufb01nding the largest principal minor of a positive semide\ufb01nite matrix. This is\nan NP-hard problem [14], and so we must resort to approximate inference methods. The DPP\nprobability is a log-submodular function, and hence greedy algorithms are natural; however, the\nstandard greedy algorithm of Nemhauser and Wolsey [15] offers an approximation guarantee of\n1 1/e only for non-decreasing (monotone) submodular functions, and does not apply for general\n\n1\n\n\fIndependent\n\nDPP sample\n\nDPP MAP\n\nFigure 1: From left to right, a set of points in the plane sampled independently at random, a sample\ndrawn from a DPP, and an approximation of the DPP MAP set estimated by our algorithm.\n\nDPPs. In addition, we are are often interested in conditioning MAP inference on knapsack-type\nbudget constraints, matroid constraints, or general polytope constraints. For example, we might\nconsider a DPP model over edges of a bipartite graph and ask for the most likely set under the\none-to-one matching constraint. In this paper we propose a new algorithm for approximating MAP\ninference that handles these types of constraints for non-monotone DPPs.\nRecent work on non-monotone submodular function optimization can be broadly split into combina-\ntorial versus continuous approaches. Among combinatorial methods, modi\ufb01ed greedy, local search\nand simulated annealing algorithms provide certain constant factor guarantees [16, 17, 18] and have\nbeen recently extended to optimization under knapsack and matroid constraints [19, 20]. Continu-\nous methods [21, 22] use a multilinear extension of the submodular set function to the convex hull\nof the feasible sets and then round fractional solutions obtained by maximizing in the interior of the\npolytope. Our algorithm falls into the continuous category, using a novel and ef\ufb01cient non-linear\ncontinuous extension speci\ufb01cally tailored to DPPs. In comparison to the constant-factor algorithms\nfor general submodular functions, our approach is more ef\ufb01cient because we have explicit access to\nthe objective function and its gradient. In contrast, general submodular functions assume a simple\nfunction oracle and need to employ sampling to estimate function and gradient values in the poly-\ntope interior. We show that our non-linear extension enjoys some of the critical properties of the\nstandard multilinear extension and propose an ef\ufb01cient algorithm that can handle solvable polytope\nconstraints. Our algorithm compares favorably to greedy and recent \u201csymmetric\u201d greedy [18] meth-\nods on unconstrained simulated problems, simulated problems under matching constraints, and a\nreal-world matching task using quotes from political candidates.\n\n2 Background\n\nDeterminantal point processes (DPPs) are distributions over subsets that prefer diversity. Originally,\nDPPs were introduced to model fermions in quantum physics [8], but since then they have arisen\nin a variety of other settings including non-intersecting random paths, random spanning trees, and\neigenvalues of random matrices [9, 23, 12]. More recently, they have been applied as probabilistic\nmodels for machine learning problems [10, 13, 5, 11].\nFormally, a DPP P on a set of items Y = {1, 2, . . . , N} is a probability measure on 2Y, the set of\nall subsets of Y. For every Y \u2713Y we have:\n\nP(Y ) / det(LY )\n\n(1)\n\nwhere L is a positive semide\ufb01nite matrix. LY \u2318 [Lij]i,j2Y denotes the restriction of L to the entries\nindexed by elements of Y , and det(L;) = 1. If L is written as a Gram matrix, L = B>B, then the\nquantity det(LY ) can be interpreted as the squared volume spanned by the column vectors Bi for\ni 2 Y . If Lij = B>i Bj is viewed as a measure of similarity between items i and j, then when i and\nj are similar their vectors are relatively non-orthogonal, and therefore sets including both i and j\nwill span less volume and be less probable. This is illustrated in Figure 2. As a result, DPPs assign\nhigher probability to sets that are diverse under L.\n\n2\n\n\fFigure 2: (a) The DPP probability of a set Y depends on the volume spanned by vectors Bi for\ni 2 Y . (b) As length increases, so does volume. (c) As similarity increases, volume decreases.\n\nThe normalization constant in Equation (1) can be computed explicitly thanks to the identity\n\ndet(LY ) = det(L + I) ,\n\n(2)\n\nXY\n\nwhere I is the N \u21e5 N identity matrix. In fact, a variety of probabilistic inference operations can\nbe performed ef\ufb01ciently, including sampling, marginalization, and conditioning [12, 24]. However,\nthe maximum a posteriori (MAP) problem arg maxY det(LY ) is NP-hard [14]. In many practical\nsituations it would be useful to approximate the MAP set; for instance, during decoding, online\ntraining, etc.\n\n2.1 Submodularity\nA function f : 2Y ! R is called submodular if it satis\ufb01es\n\nf (X [{ i}) f (X) f (Y [{ i}) f (Y )\n\n(3)\nwhenever X \u2713 Y and i 62 Y . Intuitively, the contribution made by a single item i only decreases as\nthe set grows. Common submodular functions include the mutual information of a set of variables\nand the number of cut edges leaving a set of vertices of a graph. A submodular function f is called\nnondecreasing (or monotone) when X \u2713 Y implies f (X) \uf8ff f (Y ).\nIt is possible to show that log det(LY ) is a submodular function: entropy is submodular, and the\nentropy of a Gaussian is proportional to log det(\u2303Y ) (plus a linear term in |Y |), where \u2303 is the\ncovariance matrix. Submodular functions are easy to minimize, and a variety of algorithms exist\nfor approximately maximizing them; however, to our knowledge none of these existing algorithms\nsimultaneously allows for general polytope constraints on the set Y , offers an approximation guar-\nantee, and can be implemented in practice without expensive sampling to approximate the objective.\nWe provide a technique that addresses all three criteria for the DPP MAP problem, although ap-\nproximation guarantees for the general polytope case depend on the choice of rounding algorithm\nand remain an open problem. We use the submodular maximization algorithm of [21] as a starting\npoint.\n\n3 MAP Inference\n\nWe seek an approximate solution to the generalized DPP MAP problem arg maxY 2S log det(LY ),\nwhere S \u2713 [0, 1]N and Y 2 S means that the characteristic vector I(Y ) is in S. We will assume that\nS is a down-monotone, solvable polytope; down-monotone means that for x, y 2 [0, 1]N, x 2 S\nimplies y 2 S whenever x y (that is, whenever xi yi 8i), and solvable means that for any\nlinear objective function g(x) = a>x, we can ef\ufb01ciently \ufb01nd x 2 S maximizing g(x).\nOne common approach for approximating discrete optimization problems is to replace the discrete\nvariables with continuous analogs and extend the objective function to the continuous domain. When\nthe resulting continuous optimization is solved, the result may include fractional variables. Typi-\ncally, a rounding scheme is then used to produce a valid integral solution. As we will detail below,\n\n3\n\n\fwe use a novel non-linear continuous relaxation that has a nice property: when the polytope is un-\nconstrained, S = [0, 1]N, our method will (essentially) always produce integral solutions. For more\ncomplex polytopes, a rounding procedure is required.\nWhen the objective f (Y ) is a submodular set function, as in our setting, the multilinear extension\ncan be used to obtain certain theoretical guarantees for the relaxed optimization scheme described\nabove [21, 25]. The multilinear extension is de\ufb01ned on a vector x 2 [0, 1]N:\n\nF (x) =XY Yi2Y\n\nxiYi62Y\n\n(1 xi)f (Y ) .\n\n(4)\n\n(6)\n\n(7)\n\nThat is, F (x) is the expected value of f (Y ) when Y is the random set obtained by including ele-\nment i with probability xi. Unfortunately, this expectation generally cannot be computed ef\ufb01ciently,\nsince it involves summing over exponentially many sets Y . Thus, to use the multilinear extension\nin practice requires estimating its value and derivative via Monte Carlo techniques. This makes the\noptimization quite computationally expensive, as well as introducing a variety of technical conver-\ngence issues.\nInstead, for the special case of DPP probabilities we propose a new continuous extension that is ef\ufb01-\nciently computable and differentiable. We refer to the following function as the softmax extension:\n\n\u02dcF (x) = logXY Yi2Y\n\nxiYi62Y\n\n(1 xi) exp(f (Y )) .\n\n(5)\n\nSee the supplementary material for a visual comparison of Equations (4) and (5). While the softmax\nextension also involves a sum over exponentially many sets Y , we have the following theorem.\nTheorem 1. For a positive semide\ufb01nite matrix L and x 2 [0, 1]N ,\n\n(1 xi) det(LY ) = det(diag(x)(L I) + I) .\n\nXY Yi2Y\n\nxiYi62Y\n\nAll proofs are included in the supplementary material.\nCorollary 2. For f (Y ) = log det(LY ), we have \u02dcF (x) = log det(diag(x)(L I) + I) and\n\n@\n@xi\n\n\u02dcF (x) = tr((diag(x)(L I) + I)1(L I)i) ,\n\nwhere (L I)i denotes the matrix obtained by zeroing all except the ith row of L I.\nCorollary 2 says that softmax extension for the DPP MAP problem is computable and differentiable\nin O(N 3) time. Using a variant of gradient ascent (Section 3.1), this will be suf\ufb01cient to ef\ufb01ciently\n\ufb01nd a local maximum of the softmax extension over an arbitrary solvable polytope. It then remains\nto show that this local maximum comes with approximation guarantees.\n\n3.1 Conditional gradient\n\nWhen the optimization polytope S is simple\u2014for instance, the unit cube [0, 1]N\u2014we can apply\ngeneric gradient-based optimization methods like L-BFGS to rapidly \ufb01nd a local maximum of the\nsoftmax extension. In situations where we are able to ef\ufb01ciently project onto the polytope S, we can\napply projected gradient methods. In the general case, however, we assume only that the polytope is\nsolvable. In such settings, we can use the conditional gradient algorithm (also known as the Frank-\nWolfe algorithm) [26, 27]. Algorithm 1 describes the procedure; intuitively, at each step we move to\na convex combination of the current point and the point maximizing the linear approximation of the\nfunction given by the current gradient. This ensures that we move in an increasing direction while\nremaining in S. Note that \ufb01nding y requires optimizing a linear function over S; this step is ef\ufb01cient\nwhenever the polytope is solvable.\n\n3.2 Approximation bound\n\nIn order to obtain an approximation bound for the DPP MAP problem, we consider the two-phase\noptimization in Algorithm 2, originally proposed in [21]. The second call to LOCAL-OPT is neces-\nsary in theory; however, in practice it can usually be omitted with minimal loss (if any). We will\nshow that Algorithm 2 produces a 1/4-approximation.\n\n4\n\n\fAlgorithm 1 LOCAL-OPT\n\nInput: function \u02dcF , polytope S\nx 0\nwhile not converged do\ny arg maxy02S r \u02dcF (x)>y0\n\u21b5 arg max\u21b502[0,1]\nx \u21b5x + (1 \u21b5)y\n\nend while\nOutput: x\n\n\u02dcF (\u21b50x + (1 \u21b50)y)\n\nAlgorithm 2 Approximating the DPP MAP\n\nInput: kernel L, polytope S\nLet \u02dcF (x) = log det(diag(x)(L I) + I)\nx LOCAL-OPT( \u02dcF , S)\ny LOCAL-OPT( \u02dcF , S\\{y0 | y0 \uf8ff 1x})\nOutput:\u21e2 x : \u02dcF (x) > \u02dcF (y)\n\ny : otherwise\n\nWe begin by proving that the continuous extension \u02dcF is concave in positive directions, although it\nis not concave in general.\nLemma 3. When u, v 0, we have\n\n@2\n@s@t\n\n\u02dcF (x + su + tv) \uf8ff 0\n\n(8)\n\nwherever 0 < x + su + tv < 1.\nCorollary 4. \u02dcF (x + tv) is concave along any direction v 0 (equivalently, v \uf8ff 0).\nCorollary 4 tells us that a local optimum x of \u02dcF has certain global properties\u2014namely, that \u02dcF (x) \n\u02dcF (y) whenever y \uf8ff x or y x. This leads to the following result from [21].\nLemma 5. If x is a local optimum of \u02dcF (\u00b7), then for any y 2 [0, 1]N ,\n2 \u02dcF (x) \u02dcF (x _ y) + \u02dcF (x ^ y) ,\n\nwhere (x _ y)i = max(xi, yi) and (x ^ y)i = min(xi, yi).\nFollowing [21], we now de\ufb01ne a surrogate function \u02dcF \u21e4. Let Xi \u2713 [0, 1] be a subset of the unit\ninterval representing xi = |Xi|, where |Xi| denotes the measure of Xi. (Note that this representation\nis overcomplete, since there are in general many subsets of [0, 1] with measure xi.) \u02dcF \u21e4 is de\ufb01ned on\nX = (X1,X2, . . . ,XN ) by\n\n(9)\n\n\u02dcF \u21e4(X ) = \u02dcF (x), x = (|X1|,|X2|, . . . ,|XN|) .\n\n(10)\n\nLemma 6. \u02dcF \u21e4 is submodular.\n\nLemmas 5 and 6 suf\ufb01ce to prove the following theorem, which appears for the multilinear extension\nin [21], bounding the approximation ratio of Algorithm 2.\nTheorem 7. Let \u02dcF (x) be the softmax extension of a nonnegative submodular function f (Y ) =\nlog det(LY ), let OPT = maxx02S \u02dcF (x0), and let x and y be local optima of \u02dcF in S and S \\\n{y0 | y0 \uf8ff 1 x}, respectively. Then\n\nmax( \u02dcF (x), \u02dcF (y)) \n\n1\n4\n\nOPT \n\n1\n4\n\nmax\nY 2S\n\nlog det(LY ) .\n\n(11)\n\nNote that the softmax extension is an upper bound on the multilinear extension, thus Equation (11)\nis at least as tight as the corresponding result in [21].\nCorollary 8. Algorithm 2 yields a 1/4-approximation to the DPP MAP problem whenever\nlog det(LY ) 0 for all Y . In general, the objective value obtained by Algorithm 2 is bounded\nbelow by 1\n\n4 (OPT p0) + p0, where p0 = minY log det(LY ).\n\nIn practice, \ufb01ltering of near-duplicates can be used to keep p0 from getting too small; however, in\nour empirical tests p0 did not seem to have a signi\ufb01cant effect on approximation quality.\n\n5\n\n\f3.3 Rounding\n\nWhen the polytope S is unconstrained, it is easy to show that the results of Algorithm 1\u2014and, in\nturn, Algorithm 2\u2014are integral (or can be rounded without loss).\nTheorem 9. If S = [0, 1]N , then for any local optimum x of \u02dcF , either x is integral or at least one\nfractional coordinate xi can be set to 0 or 1 without lowering the objective.\n\nMore generally, however, the polytope S can be complex, and the output of Algorithm 2 needs to\nbe rounded. We speculate that the contention resolution rounding schemes proposed in [21] for the\nmultilinear extension F may be extensible to \u02dcF , but do not attempt to prove so here. Instead, in our\nexperiments we apply pipage rounding [28] and threshold rounding (rounding all coordinates up or\ndown using a single threshold), which are simple and seem to work well in practice.\n\n3.4 Model combination\n\nIn addition to theoretical guarantees and the empirical advantages we demonstrate in Section 4, the\nproposed approach to the DPP MAP problem offers a great deal of \ufb02exibility. Since the general\nframework of continuous optimization is widely used in machine learning, this technique allows\nDPPs to be easily combined with other models. For instance, if S is the local polytope for a Markov\nrandom \ufb01eld, then, augmenting the objective with the (linear) log-likelihood of the MRF\u2014additive\nlinear objective terms do not affect the lemmas proved above\u2014we can approximately compute the\nMAP con\ufb01guration of the DPP-MRF product model. We might in this way model diverse objects\nplaced in a sequence, or \ufb01t to an underlying signal like an image. Empirical studies of these possi-\nbilities are left to future work.\n\n4 Experiments\n\nTo illustrate the proposed method, we compare it to the widely used greedy algorithm of Nemhauser\nand Wolsey [15] (Algorithm 3) and the recently proposed deterministic \u201csymmetric\u201d greedy al-\ngorithm [18], which has a 1/3 approximation guarantee for unconstrained non-monotone prob-\nlems. Note that, while a naive implementation of the arg max in Algorithm 3 requires evalu-\nating the objective for each item in U, here we can exploit the fact that DPPs are closed un-\nder conditioning to compute all necessary values with only two matrix inversions [5]. We re-\nport baseline runtimes using this optimized greedy algorithm, which is about 10 times faster than\nthe naive version at N = 200. The code and data for all experiments can be downloaded from\nhttp://www.seas.upenn.edu/\u02dcjengi/dpp-map.html.\n\n4.1 Synthetic data\n\nAs a \ufb01rst test, we approximate the MAP con\ufb01guration for DPPs with random kernels drawn from a\nWishart distribution. Speci\ufb01cally, we choose L = B>B, where B 2 RN\u21e5N has entries drawn inde-\npendently from the standard normal distribution, bij \u21e0N (0, 1). This results in L \u21e0W N (N, I), a\nWishart distribution with N degrees of freedom and an identity covariance matrix. This distribution\nhas several desirable properties: (1) in terms of eigenvectors, it spreads its mass uniformly over all\nunitary matrices [29], and (2) the probability density of eigenvalues 1, . . . , N is\n\nexp \n\ni! NYi=1QN\n\nNXi=1\n\nj=i+1(i j)2\n((N i)!)2\n\n,\n\n(12)\n\nthe \ufb01rst term of which deters the eigenvalues from being too large, and the second term of which\nencourages the eigenvalues to be well-separated [30]. Property (1) implies that we will see a variety\nof eigenvectors, which play an important role in the structure of a DPP [5]. Property (2) implies that\ninteractions between these eigenvectors will be important, as no one eigenvalue is likely to dominate.\nCombined, these properties suggest that samples should encompass a wide range of DPPs.\nFigure 3a shows performance results on these random kernels in the unconstrained setting. Our\nproposed algorithm outperforms greedy in general, and the performance gap tends to grow with\nthe size of the ground set, N. (We let N vary in the range [50, 200] since prior work with DPPs\n\n6\n\n\f)\ny\nd\ne\ne\nr\ng\n\n \n.\ns\nv\n(\n \no\n\ni\nt\n\na\nr\n \n.\n\nb\no\nr\np\ng\no\n\n \n\nl\n\n)\ny\nd\ne\ne\nr\ng\n\n \n.\ns\nv\n(\n \no\n\ni\nt\n\na\nr\n \n\ne\nm\n\ni\nt\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n50\n\n4\n\n3\n\n2\n\n1\n\n0\n50\n\n8\n\n6\n\n4\n\n2\n\n0\n50\n\n15\n\n10\n\n5\n\n100\n\n150\n\n200\n\nN\n\n)\ny\nd\ne\ne\nr\ng\n\n \n.\ns\nv\n(\n \no\n\ni\nt\n\na\nr\n \n.\n\nb\no\nr\np\ng\no\n\n \n\nl\n\n)\ny\nd\ne\ne\nr\ng\n\n \n.\ns\nv\n(\n \no\n\ni\nt\n\na\nr\n \n\ne\nm\n\ni\nt\n\n \n\n)\n.\nr\ng\nm\ny\ns\n \n.\ns\nv\n(\n \no\n\ni\nt\n\na\nr\n \n.\n\nb\no\nr\np\ng\no\n\n \n\n6\n\n4\n\n2\n\n0\n\n100\n\n150\n\n200\n\nl\n\n50\n\n100\n\n150\n\n200\n\nN\n\nN\n\n \n\n)\ny\nd\ne\ne\nr\ng\nm\ny\ns\n \n.\ns\nv\n(\n \no\n\ni\nt\n\na\nr\n \n\ne\nm\n\ni\nt\n\n3\n\n2\n\n1\n\n0\n50\n\n100\n\n150\n\n200\n\nN\n(a)\n\n100\n\n150\n\n200\n\n0\n50\n\n100\n\n150\n\n200\n\nN\n(b)\n\nN\n(c)\n\nFigure 3: Median and quartile log probability ratios (top) and running time ratios (bottom) for 100\nrandom trials. (a) The proposed algorithm versus greedy on unconstrained problems. (b) The pro-\nposed algorithm versus symmetric greedy on unconstrained problems. (c) The proposed algorithm\nversus greedy on constrained problems. Dotted black lines indicate equal performance.\n\nin real-world scenarios [5, 13] has typically operated in this range.) Moreover, Figure 3a (bottom)\nillustrates that our method is of comparable ef\ufb01ciency at medium N, and becomes more ef\ufb01cient as\nN grows. Despite the fact that the symmetric greedy algorithm [18] has an improved approximation\nguarantee of 1/3, essentially the same analysis applies to Figure 3b.\nFigure 3c summarizes the performance of our algorithm in a constrained setting. To create plausible\nconstraints, in this setting we generate two separate random matrices B(1) and B(2), and then select\nrandom pairs of rows (B(1)\n)/2 creates one row of the matrix B;\nwe then set L = B>B. The constraints require that if xk corresponding to the (i, j) pair is 1, no\nother xk0 can have \ufb01rst element i or second element j; i.e., the pairs cannot overlap. Since exact\nduplicate pairs produce identical rows in L, they are never both selected and can be pruned ahead\nof time. This means our constraints are of a form that allows us to apply pipage rounding to the\npossibly fractional result. Figure 3c shows even greater gains over greedy in this setting; however,\nenforcing the constraints precludes using fast methods like L-BFGS, so our optimization procedure\nis in this case somewhat slower than greedy.\n\n). Averaging (B(1)\n\n, B(2)\n\ni + B(2)\n\nj\n\nj\n\ni\n\n4.2 Matched summarization\n\nFinally, we demonstrate our approach using real-world data. Consider the following task: given a\nset of documents, select a set of document pairs such that the two elements within a pair are similar,\nbut the overall set of pairs is diverse. For instance, we might want to compare the opinions of various\nauthors on a range of topics\u2014or even to compare the statements made at different points in time by\nthe same author, e.g., a politician believed to have changed positions on various issues.\nIn this vein, we extract all the statements made by the eight main contenders in the 2012 US Repub-\nlican primary debates: Bachmann, Cain, Gingrich, Huntsman, Paul, Perry, Romney, and Santorum.\nSee the supplementary material for an example of some of these statements. Each pair of candidates\n(a, b) constitutes one instance of our task. The task output is a set of statement pairs where the\n\ufb01rst statement in each pair comes from candidate a and the second from candidate b. The goal of\noptimization is to \ufb01nd a set that is diverse (contains many topics, such as healthcare, foreign policy,\nimmigration, etc.) but where both statements in each pair are topically similar.\nBefore formulating a DPP objective for this task, we perform some pre-processing. We \ufb01lter short\nstatements, leaving us with an average of 179 quotes per candidate (min = 93, max = 332 quotes).\n\n7\n\n\fAlgorithm 3 Greedy MAP for DPPs\n\nInput: kernel L, polytope S\nY ;, U Y\nwhile U is not empty do\n\nend while\nOutput: Y\n\ni\u21e4 arg maxi2U log det(LY [{i})\nif log det(LY [{i\u21e4}) < log det(LY )\nthen\n\nbreak\n\nend if\nY Y [{ i\u21e4}\nU { i | i 62 Y, I(Y [{ i}) 2 S}\n\n)\ny\nd\ne\ne\nr\n\nG\n\n \n/\n \n\nx\na\nM\n\nt\nf\no\nS\n\n(\n \no\ni\nt\na\nr\n \n\ny\nt\ni\nl\ni\n\nb\na\nb\no\nr\np\ng\no\n\n \n\nl\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n0\n\n0.2\n\n0.4\n\n\u03bb\n\n0.6\n\n0.8\n\n1\n\nFigure 4: Log ratio of the objective value\nachieved by our method to that achieved by\ngreedy for ten settings of match weight .\n\n, q(b)\n\ni\n\n, q(b)\n\ni\n\nWe parse the quotes, keeping only nouns. We further \ufb01lter nouns by document frequency, keeping\nonly those that occur in at least 10% of the quotes. Then we generate a feature matrix W where Wqt\nis the number of times term t appears in quote q. This matrix is then normalized so that kWqk2 = 1,\nwhere Wq is the qth row of W . For a given pair of candidates (a, b) we compute the quality of each\npossible quote pair (q(a)\nj ) as the dot product of their rows in W . While the model will naturally\nignore low-quality pairs, for ef\ufb01ciency we throw away such pairs in pre-processing. For each of\ncandidate a\u2019s quotes q(a)\nj0 ) from candidate\nb, and vice-versa. The scores of the unpruned quotes, which we denote r, are re-normalized to span\nthe [0, 1] range. To create a feature vector describing each pair, we simply add the corresponding\npair of quote feature vectors and re-normalize, forming a new W matrix.\nOur task is to select some high-quality representative subset of the unpruned quote pairs. We for-\nmulate this as a DPP objective with kernel L = M SM, where Sij is a measurement of similarity\nbetween quote pairs i and j, and M is a diagonal matrix with Mii representing the match quality of\n\ni we keep a pair with quote j = arg maxj0 quality(q(a)\n\npair i. We set S = W W T and diag(M ) =pexp(r), where is a hyperparameter. Large places\n\nmore emphasis on picking high-quality pairs than on making the overall set diverse.\nTo help limit the number of pairs selected when optimizing the objective, we add some constraints.\nFor each candidate we cluster their quotes using k-means on the word feature vectors and impose\nthe constraint that no more than one quote per cluster can be selected. We round the \ufb01nal solution\nusing the threshold rounding scheme described in Section 3.3.\nFigure 4 shows the result of optimizing this constrained objective, averaged over all 56 candidate\npairs. For all settings of we outperform greedy. In general, we observe that our algorithm is most\nimproved compared to greedy when the constraints are in play. In this case, when is small the\nconstraints are less relevant, since the model has an intrinsic preference for smaller sets. On the\nother hand, when is very large the algorithms must choose as many pairs as possible in order to\nmaximize their score; in this case the constraints play an important role.\n\n5 Conclusion\n\nWe presented a new approach to solving the MAP problem for DPPs based on continuous algorithms\nfor submodular maximization. Unlike the multilinear extension used in the general case, the softmax\nextension we propose is ef\ufb01ciently computable and differentiable. Furthermore, it allows for general\nsolvable polytope constraints, and yields a guaranteed 1/4-approximation in a subclass of DPPs. Our\nmethod makes it easy to combine DPPs with other models like MRFs or matching models, and is\nfaster and more reliable than standard greedy methods on synthetic and real-world problems.\n\nAcknowledgments\nThis material is based upon work supported under a National Science Foundation Graduate Research\nFellowship, Sloan Research Fellowship, and NSF Grant 0803256.\n\n8\n\n\fReferences\n[1] A. Nenkova, L. Vanderwende, and K. McKeown. A Compositional Context-Sensitive Multi-Document\n\nSummarizer: Exploring the Factors that In\ufb02uence Summarization. In Proc. SIGIR, 2006.\n\n[2] H. Lin and J. Bilmes. Multi-document Summarization via Budgeted Maximization of Submodular Func-\n\ntions. In Proc. NAACL/HLT, 2010.\n\n[3] F. Radlinski, R. Kleinberg, and T. Joachims. Learning Diverse Rankings with Multi-Armed Bandits. In\n\nProc. ICML, 2008.\n\n[4] Y. Yue and T. Joachims. Predicting Diverse Subsets Using Structural SVMs. In Proc. ICML, 2008.\n[5] A. Kulesza and B. Taskar. k-DPPs: Fixed-Size Determinantal Point Processes. In Proc. ICML, 2011.\n[6] C. Guestrin, A. Krause, and A. Singh. Near-Optimal Sensor Placements in Gaussian Processes. In Proc.\n\nICML, 2005.\n\n[7] D. Kempe, J. Kleinberg, and E. Tardos. In\ufb02uential Nodes in a Diffusion Model for Social Networks. In\n\nAutomata, Languages and Programming, volume 3580 of Lecture Notes in Computer Science. 2005.\n\n[8] O. Macchi. The Coincidence Approach to Stochastic Point Processes. Advances in Applied Probability,\n\n7(1), 1975.\n\n[9] D. Daley and D. Vere-Jones. An Introduction to the Theory of Point Processes: Elementary Theory and\n\nMethods. 2003.\n\n[10] A. Kulesza and B Taskar. Structured Determinantal Point Processes. In Proc. NIPS, 2010.\n[11] A. Kulesza, J. Gillenwater, and B. Taskar. Discovering Diverse and Salient Threads in Document Collec-\n\ntions. In Proc. EMNLP, 2012.\n\n[12] J. Hough, M. Krishnapur, Y. Peres, and B. Vir\u00b4ag. Determinantal Processes and Independence. Probability\n\nSurveys, 3, 2006.\n\n[13] A. Kulesza and B. Taskar. Learning Determinantal Point Processes. In Proc. UAI, 2011.\n[14] C. Ko, J. Lee, and M. Queyranne. An Exact Algorithm for Maximum Entropy Sampling. Operations\n\nResearch, 43(4), 1995.\n\n[15] G. Nemhauser, L. Wolsey, and M. Fisher. An Analysis of Approximations for Maximizing Submodular\n\nSet Functions I. Mathematical Programming, 14(1), 1978.\n\n[16] U. Feige, V. Mirrokni, and J. Vondrak. Maximizing Non-Monotone Submodular Functions.\n\nFOCS, 2007.\n\nIn Proc.\n\n[17] T. Robertazzi and S. Schwartz. An Accelerated Sequential Algorithm for Producing D-optimal Designs.\n\nSIAM J. Sci. Stat. Comput., 10(2), 1989.\n\n[18] N. Buchbinder, M. Feldman, J. Naor, and R. Schwartz. A Tight Linear Time (1/2)-Approximation for\n\nUnconstrained Submodular Maximization. In Proc. FOCS, 2012.\n\n[19] A. Gupta, A. Roth, G. Schoenebeck, and K. Talwar. Constrained Nonmonotone Submodular Maximiza-\nIn Internet and Network Economics, volume 6484 of LNCS.\n\ntion: Of\ufb02ine and Secretary Algorithms.\n2010.\n\n[20] S. Gharan and J. Vondr\u00b4ak. Submodular Maximization by Simulated Annealing. In Proc. Soda, 2011.\n[21] C. Chekuri, J. Vondr\u00b4ak, and R. Zenklusen. Submodular Function Maximization via the Multilinear Re-\n\nlaxation and Contention Resolution Schemes. arXiv:1105.4593, 2011.\n\n[22] M. Feldman, J. Naor, and R. Schwartz. Nonmonotone Submodular Maximization via a Structural Con-\n\ntinuous Greedy Algorithm. Automata, Languages and Programming, 2011.\n\n[23] A. Borodin and A. Soshnikov. Janossy Densities I. Determinantal Ensembles. Journal of Statistical\n\nPhysics, 113(3), 2003.\n\n[24] A. Borodin. Determinantal Point Processes. arXiv:0911.1153, 2009.\n[25] M. Feldman, J. Naor, and R. Schwartz. A Uni\ufb01ed Continuous Greedy Algorithm for Submodular Maxi-\n\nmization. In Proc. FOCS, 2011.\n\n[26] D. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, 1999.\n[27] M. Frank and P. Wolfe. An Algorithm for Quadratic Programming. Naval Research Logistics Quarterly,\n\n3(1-2), 1956.\n\n[28] A. Ageev and M. Sviridenko. Pipage Rounding: A New Method of Constructing Algorithms with Proven\n\nPerformance Guarantee. Journal of Combinatorial Optimization, 8(3), 2004.\n\n[29] A. James. Distributions of Matrix Variates and Latent Roots Derived from Normal Samples. Annals of\n\nMathematical Statistics, 35(2), 1964.\n\n[30] P. Hsu. On the Distribution of Roots of Certain Determinantal Equations. Annals of Eugenics, 9(3), 1939.\n\n9\n\n\f", "award": [], "sourceid": 1264, "authors": [{"given_name": "Jennifer", "family_name": "Gillenwater", "institution": null}, {"given_name": "Alex", "family_name": "Kulesza", "institution": null}, {"given_name": "Ben", "family_name": "Taskar", "institution": null}]}