{"title": "Leveraging Sparsity for Efficient Submodular Data Summarization", "book": "Advances in Neural Information Processing Systems", "page_first": 3414, "page_last": 3422, "abstract": "The facility location problem is widely used for summarizing large datasets and has additional applications in sensor placement, image retrieval, and clustering. One difficulty of this problem is that submodular optimization algorithms require the calculation of pairwise benefits for all items in the dataset. This is infeasible for large problems, so recent work proposed to only calculate nearest neighbor benefits. One limitation is that several strong assumptions were invoked to obtain provable approximation guarantees. In this paper we establish that these extra assumptions are not necessary\u2014solving the sparsified problem will be almost optimal under the standard assumptions of the problem. We then analyze a different method of sparsification that is a better model for methods such as Locality Sensitive Hashing to accelerate the nearest neighbor computations and extend the use of the problem to a broader family of similarities. We validate our approach by demonstrating that it rapidly generates interpretable summaries.", "full_text": "Leveraging Sparsity for Ef\ufb01cient\nSubmodular Data Summarization\n\nErik M. Lindgren, Shanshan Wu, Alexandros G. Dimakis\n\nerikml@utexas.edu, shanshan@utexas.edu, dimakis@austin.utexas.edu\n\nThe University of Texas at Austin\n\nDepartment of Electrical and Computer Engineering\n\nAbstract\n\nThe facility location problem is widely used for summarizing large datasets and\nhas additional applications in sensor placement, image retrieval, and clustering.\nOne dif\ufb01culty of this problem is that submodular optimization algorithms require\nthe calculation of pairwise bene\ufb01ts for all items in the dataset. This is infeasible for\nlarge problems, so recent work proposed to only calculate nearest neighbor bene\ufb01ts.\nOne limitation is that several strong assumptions were invoked to obtain provable\napproximation guarantees. In this paper we establish that these extra assumptions\nare not necessary\u2014solving the sparsi\ufb01ed problem will be almost optimal under\nthe standard assumptions of the problem. We then analyze a different method of\nsparsi\ufb01cation that is a better model for methods such as Locality Sensitive Hashing\nto accelerate the nearest neighbor computations and extend the use of the problem\nto a broader family of similarities. We validate our approach by demonstrating that\nit rapidly generates interpretable summaries.\n\n1\n\nIntroduction\n\nIn this paper we study the facility location problem: we are given sets V of size n, I of size m and a\nbene\ufb01t matrix of nonnegative numbers C 2 RI\u21e5V , where Civ describes the bene\ufb01t that element i\nreceives from element v. Our goal is to select a small set A of k columns in this matrix. Once we\nhave chosen A, element i will get a bene\ufb01t equal to the best choice out of the available columns,\nmaxv2A Civ. The total reward is the sum of the row rewards, so the optimal choice of columns is the\nsolution of:\n(1)\n\nCiv.\n\narg max\n\n{A\u2713V :|A|\uf8ffk}Xi2I\n\nmax\nv2A\n\nA natural application of this problem is in \ufb01nding a small set of representative images in a big dataset,\nwhere Civ represents the similarity between images i and v. The problem is to select k images that\nprovide a good coverage of the full dataset, since each one has a close representative in the chosen\nset.\nThroughout this paper we follow the nomenclature common to the submodular optimization for\nmachine learning literature. This problem is also known as the maximization version of the k-medians\nproblem or the submodular facility location problem. A number of recent works have used this\nproblem for selecting subsets of documents or images from a larger corpus [27, 39], to identify\nlocations to monitor in order to quickly identify important events in sensor or blog networks [24, 26],\nas well as clustering applications [23, 34].\nWe can naturally interpret Problem 1 as a maximization of a set function F (A) which takes as an\ninput the selected set of columns and returns the total reward of that set. Formally, let F (;) = 0 and\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\ffor all other sets A \u2713 V de\ufb01ne\n\nF (A) =Xi2I\n\nCiv.\n\nmax\nv2A\n\n(2)\n\nThe set function F is submodular, since for all j 2 V and sets A \u2713 B \u2713 V \\ {j}, we have\nF (A [{ j}) F (A) F (B [{ j}) F (B), that is, the gain of an element is diminishes as we add\nelements. Since the entries of C are nonnegative, F is monotone, since for all A \u2713 B \u2713 V , we have\nF (A) \uf8ff F (B). We also have F normalized, since F (;) = 0.\nThe facility location problem is NP-Hard, so we consider approximation algorithms. Like all mono-\ntone and normalized submodular functions, the greedy algorithm guarantees a (1 1/e)-factor\napproximation to the optimal solution [35]. The greedy algorithm starts with the empty set, then for\nk iterations adds the element with the largest reward. This approximation is the best possible\u2014the\nmaximum coverage problem is an instance of the submodular facility location problem, which was\nshown to be NP-Hard to optimize within a factor of 1 1/e + \" for all \"> 0 [13].\nThe problem is that the greedy algorithm has super-quadratic running time \u21e5(nmk) and in many\ndatasets n and m can be in the millions. For this reason, several recent papers have focused on\naccelerating the greedy algorithm. In [26], the authors point out that if the bene\ufb01t matrix is sparse,\nthis can dramatically speed up the computation time. Unfortunately, in many problems of interest,\ndata similarities or rewards are not sparse. Wei et al. [40] proposed to \ufb01rst sparsify the bene\ufb01t matrix\nand then run the greedy algorithm on this new sparse matrix. In particular, [40] considers t-nearest\nneighbor sparsi\ufb01cation, i.e., keeping for each row the t largest entries and zeroing out the rest. Using\nthis technique they demonstrated an impressive 80-fold speedup over the greedy algorithm with little\nloss in solution quality. One limitation of their theoretical analysis was the limited setting under\nwhich provable approximation guarantees were established.\nOur Contributions: Inspired by the work of Wei et al. [40] we improve the theoretical analysis of\nthe approximation error induced by sparsi\ufb01cation. Speci\ufb01cally, the previous analysis assumes that\nthe input came from a probability distribution where the preferences of each element of i 2 I are\nindependently chosen uniformly at random. For this distribution, when k =\u2326( n), they establish that\nthe sparsity can be taken to be O(log n) and running the greedy algorithm on the sparsi\ufb01ed problem\nwill guarantee a constant factor approximation with high probability. We improve the analysis in the\nfollowing ways:\n\non the input besides nonnegativity of the bene\ufb01t matrix.\n\nlow as O(1) while guaranteeing a constant factor approximation.\n\n\u2022 We prove guarantees for all values of k and our guarantees do not require any assumptions\n\u2022 In the case where k =\u2326( n), we show that it is possible to take the sparsity of each row as\n\u2022 Unlike previous work, our analysis does not require the use of any particular algorithm and\n\u2022 We establish a lower bound which shows that our approximation guarantees are tight up to\n\ncan be integrated to many algorithms for solving facility location problems.\n\nlog factors, for all desired approximation factors.\n\nIn addition to the above results we propose a novel algorithm that uses a threshold based sparsi\ufb01cation\nwhere we keep matrix elements that are above a set value threshold. This type of sparsi\ufb01cation is\neasier to ef\ufb01ciently implement using nearest neighbor methods. For this method of sparsi\ufb01cation, we\nobtain worst case guarantees and a lower bound that matches up to constant factors. We also obtain a\ndata dependent guarantee which helps explain why our algorithm empirically performs better than\nthe worst case.\nFurther, we propose the use of Locality Sensitive Hashing (LSH) and random walk methods to\naccelerate approximate nearest neighbor computations. Speci\ufb01cally, we use two types of similarity\nmetrics: inner products and personalized PageRank (PPR). We propose the use of fast approximations\nfor these metrics and empirically show that they dramatically improve running times. LSH functions\nare well-known but, to the best of our knowledge, this is the \ufb01rst time they have been used to accelerate\nfacility location problems. Furthermore, we utilize personalized PageRank as the similarity between\nvertices on a graph. Random walks can quickly approximate this similarity and we demonstrate that\nit yields highly interpretable results for real datasets.\n\n2\n\n\f2 Related Work\n\nThe use of a sparsi\ufb01ed proxy function was shown by Wei et al. to also be useful for \ufb01nding a\nsubset for training nearest neighbor classi\ufb01ers [41]. Further, they also show a connection of nearest\nneighbor classi\ufb01ers to the facility location function. The facility location function was also used by\nMirzasoleiman et al. as part of a summarization objective function in [32], where they present a\nsummarization algorithm that is able to handle a variety of constraints.\nThe stochastic greedy algorithm was shown to get a 1 1/e \" approximation with runtime\n\" ), which has no dependance on k [33]. It works by choosing a sample set from V of size\nO(nm log 1\n\" each iteration and adding to the current set the element of the sample set with the largest gain.\nk log 1\nn\nAlso, there are several related algorithms for the streaming setting [5] and distributed setting [6, 25,\n31, 34]. Since the objective function is de\ufb01ned over the entire dataset, optimizing the submodular\nfacility location function becomes more complicated in these memory limited settings. Often the\nfunction is estimated by considering a randomly chosen subset from the set I.\n\n2.1 Bene\ufb01ts Functions and Nearest Neighbor Methods\n\n.\n\n2, dot product sim(x, y) = xT y, or cosine similarity sim(x, y) = xT y\nkxkkyk\n\nFor many problems, the elements V and I are vectors in some feature space where the bene\ufb01t\nmatrix is de\ufb01ned by some similarity function sim. For example, in Rd we may use the RBF kernel\nsim(x, y) = ekxyk2\nThere has been decades of research on nearest neighbor search in geometric spaces. If the vectors are\nlow dimensional, then classical techniques such as kd-trees [7] work well and are exact. However\nit has been observed that as the dimensions grow that the runtime of all known exact methods does\nlittle better than a linear scan over the dataset.\nAs a compromise, researchers have started to work on approximate nearest neighbor methods, one of\nthe most successful approaches being locality sensitive hashing [15, 20]. LSH uses a hash function\nthat hashes together items that are close. Locality sensitive hash functions exist for a variety of metrics\nand similarities such as Euclidean [11], cosine similarity [3, 9], and dot product [36, 38]. Nearest\nneighbor methods other than LSH that have been shown to work for machine learning problems\ninclude [8, 10]. Additionally, see [14] for ef\ufb01cient and exact GPU methods.\nAn alternative to vector functions is to use similarities and bene\ufb01ts de\ufb01ned from graph structures. For\ninstance, we can use the personalized PageRank of vertices in a graph to de\ufb01ne the bene\ufb01t matrix\n[37]. The personalized PageRank is similar to the classic PageRank, except the random jumps, rather\nthan going to anywhere in the graph, go back to the users \u201chome\u201d vertex. This can be used as a value\nof \u201creputation\u201d or \u201cin\ufb02uence\u201d between vertices in a graph [17].\nThere are a variety of algorithms for \ufb01nding the vertices with a large PageRank personalized to some\nvertex. One popular one is the random walk method. If \u21e1i is the personalized PageRank vector to\nsome vertex i, then \u21e1i(v) is the same as the probability that a random walk of geometric length\nstarting from i ends on a vertex v (where the parameter of the geometric distribution is de\ufb01ned by the\nprobability of jumping back to i) [4]. Using this approach, we can quickly estimate all elements in\nthe bene\ufb01t matrix greater than some value \u2327.\n\n3 Guarantees for t-Nearest Neighbor Sparsi\ufb01cation\n\nWe associate a bipartite support graph G = (V, I, E) by having an edge between v 2 V and i 2 I\nwhenever Cij > 0. If the support graph is sparse, we can use the graph to calculate the gain of an\nelement much more ef\ufb01ciently, since we only need to consider the neighbors of the element versus\nthe entire set I. If the average degree of a vertex i 2 I is t, (and we use a cache for the current best\nvalue of an element i) then we can execute greedy in time O(mtk). See Algorithm 1 in the Appendix\nfor pseudocode. If the sparsity t is much smaller than the size of V , the runtime is greatly improved.\nHowever, the instance we wish to optimize may not be sparse. One idea is to sparsify the original\nmatrix by only keeping the values in the bene\ufb01t matrix C that are t-nearest neighbors, which\nwas considered in [40]. That is, for every element i in I, we only keep the top t elements of\nCi1, Ci2, . . . , Cin and set the rest equal to zero. This leads to a matrix with mt nonzero elements. We\n\n3\n\n\f1\n\n\u21b5k log m\n\n1+\u21b5 from the value of the optimal solution.\n\nthen want the solution from optimizing the sparse problem to be close to the value of the optimal\nsolution in the original objective function F .\nOur main theorem is that we can set the sparsity parameter t to be O( n\n\u21b5k )\u2014which is a\nsigni\ufb01cant improvement for large enough k\u2014while still having the solution to the sparsi\ufb01ed problem\nbe at most a factor of\nTheorem 1. Let Ot be the optimal solution to an instance of the submodular facility location problem\nwith a bene\ufb01t matrix that was sparsi\ufb01ed with t-nearest neighbor. For any t t\u21e4(\u21b5) = O( n\n\u21b5k ),\nwe have F (Ot) 1\nProof Sketch. For the value of t chosen, there exists a set of size \u21b5k such that every element of\nI has a neighbor in the t-nearest neighbor graph; this is proven using the probabilistic method. By\nappending to the optimal solution and using the monotonicity of F , we can move to the sparsi\ufb01ed\nfunction, since no element of I would prefer an element that was zeroed out in the sparsi\ufb01ed matrix\nas one of their top t most bene\ufb01cial elements is present in the set . The optimal solution appended\nwith is a set of size (1 + \u21b5)k. We then bound the amount that the optimal value of a submodular\nfunction can increase by when adding \u21b5k elements. See the appendix for the complete proof.\n\n1+\u21b5 OPT.\n\n\u21b5k log n\n\n\u21e2\n\n\"k log m\n\nk , then this is equivalent to taking \u21b5 very large and we have some guarantee of optimality.\n\nNote that Theorem 1 is agnostic to the algorithm used to optimize the sparsi\ufb01ed function, and so if we\nuse a \u21e2-approximation algorithm, then we are at most a factor of\n1+\u21b5 from the optimal solution. Later\nthis section we will utilize this to design a subquadratic algorithm for optimizing facility location\nproblems as long as we can quickly compute t-nearest neighbors and k is large enough.\nIf m = O(n) and k =\u2326( n), we can achieve a constant factor approximation even when taking\nthe sparsity parameter as low as t = O(1), which means that the bene\ufb01t matrix C has only O(n)\nnonzero entries. Also note that the only assumption we need is that the bene\ufb01ts between elements are\nnonnegative. When k =\u2326( n), previous work was only able to take t = O(log n) and required the\nbene\ufb01t matrix to come from a probability distribution [40].\nOur guarantee has two regimes depending on the value of \u21b5. If we want the optimal solution to the\nsparsi\ufb01ed function to be a 1 \" factor from the optimal solution to the original function, we have that\n\"k ) suf\ufb01ces. Conversely, if we want to take the sparsity t to be much smaller than\nt\u21e4(\") = O( n\nn\nk log m\nIn the proof of Theorem 1, the only time we utilize the value of t is to show that there exists a small\nset that covers the entire set I in the t-nearest neighbor graph. Real datasets often contain a covering\nset of size \u21b5k for t much smaller than O( n\n\u21b5k ). This observation yields the following corollary.\nCorollary 2. If after sparsifying a problem instance there exists a covering set of size \u21b5k in the\nt-nearest neighbor graph, then the optimal solution Ot of the sparsi\ufb01ed problem satis\ufb01es F (Ot) \n1+\u21b5 OPT.\nIn the datasets we consider in our experiments of roughly 7000 items, we have covering sets with\nonly 25 elements for t = 75, and a covering set of size 10 for t = 150. The size of covering set was\nupper bounded by using the greedy set cover algorithm. In Figure 2 in the appendix, we see how the\nsize of the covering set changes with the choice of the number of neighbors chosen t.\nIt would be desirable to take the sparsity parameter t lower than the value dictated by t\u21e4(\u21b5). As\ndemonstrated by the following lower bound, is not possible to take the sparsity signi\ufb01cantly lower\nthan 1\n\u21b5\nProposition 3. Suppose we take\n\n1+\u21b5 approximation in the worst case.\n\nk and still have a\n\n\u21b5k log m\n\nn\n\n1\n\n1\n\nt = max\u21e2 1\n\n2\u21b5\n\n,\n\n1\n\n1 + \u21b5 n 1\n\n.\n\nk\n1+\u21b52/k OPT.\n\nThere exists a family of inputs such that we have F (Ot) \uf8ff\nThe example we create to show this has the property that in the t-nearest neighbor graph, the set \nneeds \u21b5k elements to cover every element of I. We plant a much smaller covering set that is very\nclose in value to but is hidden after sparsi\ufb01cation. We then embed a modular function within the\nfacility location objective. With knowledge of the small covering set, an optimal solver can take\n\n1\n\n4\n\n\fadvantage of this modular function, while the sparsi\ufb01ed solution would prefer to \ufb01rst choose the set \nbefore considering the modular function. See the appendix for full details.\nSparsi\ufb01cation integrates well with the stochastic greedy algorithm [33]. By taking t t\u21e4(\"/2) and\nrunning stochastic greedy with sample sets of size n\n\" , we get a 1 1/e \" approximation in\nexpectation that runs in expected time O( nm\n\"k ). If we can quickly sparsify the problem\nand k is large enough, for example n1/3, this is subquadratic. The following proposition shows a high\nprobability guarantee on the runtime of this algorithm and is proven in the appendix.\nProposition 4. When m = O(n), the stochastic greedy algorithm [33] with set sizes of\nsize n\n\" , combined with sparsi\ufb01cation with sparsity parameter t, will terminate in time\n\"k ), this algorithm\nO(n log 1\nhas a 1 1/e \" approximation in expectation.\n4 Guarantees for threshold-based Sparsi\ufb01cation\n\n\" max{t, log n}) with high probability. When t t\u21e4(\"/2) = O( n\n\nk ln 2\n\" log m\n\n\"k log m\n\n\"k log 1\n\nk log 2\n\nRather than t-nearest neighbor sparsi\ufb01cation, we now consider using \u2327-threshold sparsi\ufb01cation, where\nwe zero-out all entries that have value below a threshold \u2327. Recall the de\ufb01nition of a locality sensitive\nhash.\nDe\ufb01nition. H is a (\u2327, K\u2327, p, q )-locality sensitive hash family if for x, y satisfying sim(x, y) \u2327 we\nhave Ph2H(h(x) = h(y)) p and if x, y satisfy sim(x, y) \uf8ff K\u2327 we have Ph2H(h(x) = h(y)) \uf8ff\nq.\n\nWe see that \u2327-threshold sparsi\ufb01cation is a better model than t-nearest neighbors for LSH, as for\nK = 1 it is a noisy \u2327-sparsi\ufb01cation and for non-adversarial datasets it is a reasonable approximation\nof a \u2327-sparsi\ufb01cation method. Note that due to the approximation constant K, we do not have an a\npriori guarantee on the runtime of arbitrary datasets. However we would expect in practice that we\nwould only see a few elements with threshold above the value \u2327. See [2] for a discussion on this.\nOne issue is that we do not know how to choose the threshold \u2327. We can sample elements of the\nbene\ufb01t matrix C to estimate how sparse the threshold graph will be for a given threshold \u2327. Assuming\nthe values of C are in general position1, by using the Dvoretzky-Kiefer-Wolfowitz-Massart Inequality\n[12, 28] we can bound the number of samples needed to choose a threshold that achieves a desired\nsparsi\ufb01cation level.\nWe establish the following data-dependent bound on the difference in the optimal solutions of the\n\u2327-threshold sparsi\ufb01ed function and the original function. We denote the set of vertices adjacent to S\nin the \u2327-threshold graph with N (S).\nTheorem 5. Let O\u2327 be the optimal solution to an instance of the facility location problem with a\nbene\ufb01t matrix that was sparsi\ufb01ed using a \u2327-threshold. Assume there exists a set S of size k such that\nin the \u2327-threshold graph we have the neighborhood of S satisfying |N (S)| \u00b5n. Then we have\n\nF (O\u2327 ) \u27131 +\n\n1\n\n\u00b5\u25c61\n\nOPT.\n\nFor the datasets we consider, we see that we can keep a 0.01 0.001 fraction of the elements of C\nwhile still having a small set S with a neighborhood N (S) that satis\ufb01ed |N (S)| 0.3n. In Figure 3\nin the appendix, we plot the relationship between the number of edges in the \u2327-threshold graph and\nthe number of coverable element by a a set of small size, as estimated by the greedy algorithm for\nmax-cover.\nAdditionally, we have worst case dependency on the number of edges in the \u2327-threshold graph and the\napproximation factor. The guarantees follow from applying Theorem 5 with the following Lemma.\nLemma 6. For k c\n2 2n2 edges has a set S of size k such that the\n1\n12c2\nneighborhood N (S) satis\ufb01es\n\n , any graph with 1\n\n1By this we mean that the values of C are all unique, or at least only a few elements take any particular value.\nWe need this to hold since otherwise a threshold based sparsi\ufb01cation may exclusively return an empty graph or\nthe complete graph.\n\n|N (S)| cn.\n\n5\n\n\fTo get a matching lower bound, consider the case where the graph has two disjoint cliques, one of\nsize n and one of size (1 )n. Details are in the appendix.\n5 Experiments\n\n5.1 Summarizing Movies and Music from Ratings Data\n\nWe consider the problem of summarizing a large collection of movies. We \ufb01rst need to create a feature\nvector for each movie. Movies can be categorized by the people who like them, and so we create\nour feature vectors from the MovieLens ratings data [16]. The MovieLens database has 20 million\nratings for 27,000 movies from 138,000 users. To do this, we perform low-rank matrix completion\nand factorization on the ratings matrix [21, 22] to get a matrix X = U V T , where X is the completed\nratings matrix, U is a matrix of feature vectors for each user and V is a matrix of feature vectors for\neach movie. For movies i and j with vectors vi and vj, we set the bene\ufb01t function Cij = vT\ni vj. We\ndo not use the normalized dot product (cosine similarity) because we want our summary movies to be\nmovies that were highly rated, and not normalizing makes highly rated movies increase the objective\nfunction more.\nWe complete the ratings matrix using the MLlib library in Apache Spark [29] after removing all but\nthe top seven thousand most rated movies to remove noise from the data. We use locality sensitive\nhashing to perform sparsi\ufb01cation; in particular we use the LSH in the FALCONN library for cosine\nsimilarity [3] and the reduction from a cosine simiarlity hash to a dot product hash [36]. As a baseline\nwe consider sparsi\ufb01cation using a scan over the entire dataset, the stochastic greedy algorithm with\nlazy evaluations[33], and the greedy algorithm with lazy evaluations [30]. The number of elements\nchosen was set to 40 and for the LSH method and stochastic greedy we average over \ufb01ve trials.\nWe then do a scan over the sparsity parameter t for the sparsi\ufb01cation methods and a scan over the\nnumber of samples drawn each iteration for the stochastic greedy algorithm. The sparsi\ufb01ed methods\nuse the (non-stochastic) lazy greedy algorithm as the base optimization algorithm, which we found\nworked best for this particular problem2. In Figure 1(a) we see that the LSH method very quickly\napproaches the greedy solution\u2014it is almost identical in value just after a few seconds even though\nthe value of t is much less than t\u21e4(\"). The stochastic greedy method requires much more time to get\nthe same function value. Lazy greedy is not plotted, since it took over 500 seconds to \ufb01nish.\n\n(a) Fraction of Greedy Set Value vs. Runtime\n\n0.99\n0.98\n0.97\n0.96\n0.95\n0.94\n0.93\n0.92\n0.91\n0.90\n\nExact top-t\nLSH top-t\n\nStochastic Greedy\n\n0\n\n25\n\n50\n\n75\n\n100\n\n125\n\n150\n\nRuntime (s)\n\n(b) Fraction of Greedy Set Contained vs. Runtime\n1.0\n0.9\n0.8\n0.7\n0.6\n0.5\n0.4\n0.3\n0.2\n0.1\n0.0\n\n75\n\n100\n\n125\n\n150\n\n0\n\n25\n\n50\n\nRuntime (s)\n\nFigure 1: Results for the MovieLens dataset [16]. Figure (a) shows the function value as the runtime increases,\nnormalized by the value the greedy algorithm obtained. As can be seen our algorithm is within 99.9% of greedy\nin less than 5 seconds. For this experiment, the greedy algorithm had a runtime of 512 seconds, so this is a 100x\nspeed up for a small penalty in performance. We also compare to the stochastic greedy algorithm [33], which\nneeds 125 seconds to get equivalent performance, which is 25x slower.\nFigure (b) shows the fraction of the set that was returned by each method that was common with the set returned\nby greedy. We see that the approximate nearest neighbor method has 90% of its elements common with the\ngreedy set while being 50x faster than greedy, and using exact nearest neighbors can perfectly match the greedy\nset while being 4x faster than greedy.\n\n2When experimenting on very larger datasets, we found that runtime constraints can make it necessary to use\n\nstochastic greedy as the base optimization algorithm\n\n6\n\n\fTable 1: A subset of the summarization outputted by our algorithm on the MovieLens dataset, plus the elements\nwho are represented by each representative with the largest dot product. Each group has a natural interpretation:\n90\u2019s slapstick comedies, 80\u2019s horror, cult classics, etc. Note that this was obtained with only a similarity matrix\nobtained from ratings.\n\nHappy Gilmore\nTommy Boy\nBilly Madison\nDumb & Dumber\nAce Ventura Pet Detective\nRoad Trip\nAmerican Pie 2\nBlack Sheep\n\nNightmare on Elm Street\nFriday the 13th\nHalloween II\nNightmare on Elm Street 3\nChild\u2019s Play\nReturn of the Living Dead II\nFriday the 13th 2\nPuppet Master\n\nStar Wars IV\nStar Wars V\nRaiders of the Lost Ark\nStar Wars VI\nIndiana Jones, Last Crusade\nTerminator 2\nThe Terminator\nStar Trek II\n\nShawshank Redemption\nSchindler\u2019s List\nThe Usual Suspects\nLife Is Beautiful\nSaving Private Ryan\nAmerican History X\nThe Dark Knight\nGood Will Hunting\n\nPulp Fiction\nReservoir Dogs\nAmerican Beauty\nA Clockwork Orange\nTrainspotting\nMemento\nOld Boy\nNo Country for Old Men\n\nPride and Prejudice\nAnne of Green Gables\nPersuasion\nEmma\n\nThe Notebook\nP.S. I Love You\nThe Holiday\nRemember Me\nA Walk to Remembe Mostly Martha\nThe Proposal\nThe Vow\nLife as We Know It\n\nDesk Set\nThe Young Victoria\nMans\ufb01eld Park\n\nThe Godfather\nThe Godfather II\nOne Flew Over the Cuckoo\u2019s Nest\nGoodfellas\nApocalypse Now\nChinatown\n12 Angry Men\nTaxi Driver\n\nA performance metric that can be better than the objective value is the fraction of elements returned\nthat are common with the greedy algorithm. We treat this as a proxy for the interpretability of\nthe results. We believe this metric is reasonable since we found the subset returned by the greedy\nalgorithm to be quite interpretable. We plot this metric against runtime in Figure 1b. We see that the\nLSH method quickly gets to 90% of the elements in the greedy set while stochastic greedy takes\nmuch longer to get to just 70% of the elements. The exact sparsi\ufb01cation method is able to completely\nmatch the greedy solution at this point. One interesting feature is that the LSH method does not\ngo much higher than 90%. This may be due to the increased inaccuracy when looking at elements\nwith smaller dot products. We plot this metric against the number of exact and approximate nearest\nneighbors t in Figure 4 in the appendix.\nWe include a subset of the summarization and for each representative a few elements who are\nrepresented by this representative with the largest dot product in Table 1 to show the interpretability\nof our results.\n\n5.2 Finding In\ufb02uential Actors and Actresses\n\nFor our second experiment, we consider how to \ufb01nd a diverse subset of actors and actresses in a\ncollaboration network. We have an edge between an actor or actress if they collaborated in a movie\ntogether, weighted by the number of collaborations. Data was obtained from [19] and an actor or\nactress was only included if he or she was one of the top six in the cast billing. As a measure of\nin\ufb02uence, we use personalized PageRank [37]. To quickly calculate the people with the largest\nin\ufb02uence relative to someone, we used the random walk method[4].\nWe \ufb01rst consider a small instance where we can see how well the sparsi\ufb01ed approach works. We\nbuild a graph based on the cast in the top thousand most rated movies. This graph has roughly 6000\nvertices and 60,000 edges. We then calculate the entire PPR matrix using the power method. Note\nthat this is infeasible on larger graphs in terms of time and memory. Even on this moderate sized\ngraph it took six hours and takes up two gigabytes of space. We then compare the value of the greedy\nalgorithm using the entire PPR matrix with the sparsi\ufb01ed algorithm using the matrix approximated\nby Monte Carlo sampling using the two metrics mentioned in the previous section. We omit exact\nnearest neighbor and stochastic greedy because it is not clear how it would work without having to\ncompute the entire PPR matrix. Instead we compare to an approach where we choose a sample from\nI and calculate the PPR only on these elements using the power method. As mentioned in Section 2,\nseveral algorithms utilize random sampling from I. We take k to be 50 for this instance. In Figure 5 in\nthe appendix we see that sparsi\ufb01cation performs drastically better in both function value and percent\nof the greedy set contained for a given runtime.\n\n7\n\n\fWe now scale up to a larger graph by taking the actors and actresses billed in the top six for the\ntwenty thousand most rated movies. This graph has 57,000 vertices and 400,000 edges. We would\nnot be able to compute the entire PPR matrix for this graph in a reasonable amount of time and even\nif we could it would be a challenge to store the entire matrix in memory. However we can run the\nsparsi\ufb01ed algorithm in three hours using only 2 GB of memory, which could be improved further by\nparallelizing the Monte Carlo approximation.\nWe run the greedy algorithm separately on the actors and actresses. For each we take the top twenty-\n\ufb01ve and compare to the actors and actresses with the largest (non-personalized) PageRank. In Figure 2\nof the appendix, we see that the PageRank output fails to capture the diversity in nationality of the\ndataset, while the submodular facility location optimization returns actors and actresses from many\nof the worlds \ufb01lm industries.\n\nAcknowledgements\n\nThis material is based upon work supported by the National Science Foundation Graduate Research\nFellowship under Grant No. DGE-1110007 as well as NSF Grants CCF 1344179, 1344364, 1407278,\n1422549 and ARO YIP W911NF-14-1-0258.\n\nReferences\n[1] N. Alon and J. H. Spencer. The Probabilistic Method. Wiley, 3rd edition, 2008.\n\n[2] A. Andoni.\non\n\nExact\nTheory,\n\ndows\nexact-algorithms-from-approximation-algorithms-part-2/ (version: 2016-09-06).\n\n2012.\n\nalgorithms\n\nfrom approximation\n\nWin-\nhttps://windowsontheory.org/2012/04/17/\n\nalgorithms?\n\n(part\n\n2).\n\n[3] A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt. Practical and optimal LSH for angular\n\ndistance. In NIPS, 2015.\n\n[4] K. Avrachenkov, N. Litvak, D. Nemirovsky, and N. Osipova. Monte Carlo methods in PageRank computa-\n\ntion: When one iteration is suf\ufb01cient. SIAM Journal on Numerical Analysis, 2007.\n\n[5] A. Badanidiyuru, B. Mirzasoleiman, A. Karbasi, and A. Krause. Streaming submodular maximization:\n\nMassive data summarization on the \ufb02y. In KDD, 2014.\n\n[6] R. Barbosa, A. Ene, H. L. Nguyen, and J. Ward. The power of randomization: Distributed submodular\n\nmaximization on massive datasets. In ICML, 2015.\n\n[7] J. L. Bentley. Multidimensional binary search trees used for associative searching. Communications of the\n\nACM, 1975.\n\n[8] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In ICML, 2006.\n\n[9] M. S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, 2002.\n\n[10] J. Chen, H.-R Fang, and Y. Saad. Fast approximate k\u2013NN graph construction for high dimensional data via\n\nrecursive Lanczos bisection. JMLR, 2009.\n\n[11] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable\n\ndistributions. In Symposium on Computational Geometry, 2004.\n\n[12] A. Dvoretzky, J. Kiefer, and J. Wolfowitz. Asymptotic minimax character of the sample distribution\nfunction and of the classical multinomial estimator. The Annals of Mathematical Statistics, pages 642\u2013669,\n1956.\n\n[13] U. Feige. A threshold of ln n for approximating set cover. Journal of the ACM, 1998.\n\n[14] V. Garcia, E. Debreuve, F. Nielsen, and M. Barlaud. K-nearest neighbor search: Fast GPU-based imple-\nmentations and application to high-dimensional feature matching. In International Conference on Image\nProcessing, 2010.\n\n[15] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999.\n\n[16] GroupLens. MovieLens 20M dataset, 2015. http://grouplens.org/datasets/movielens/20m/.\n\n8\n\n\f[17] P. Gupta, A. Goel, J. Lin, A. Sharma, D. Wang, and R. Zadeh. Wtf: The who to follow service at twitter. In\n\nWWW, 2013.\n\n[18] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American\n\nStatistical Association, 1963.\n\n[19] IMDb. Alternative interfaces, 2016. http://www.imdb.com/interfaces.\n[20] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality.\n\nIn STOC, 1998.\n\n[21] P. Jain, P. Netrapalli, and S. Sanghavi. Low-rank matrix completion using alternating minimization. In\n\nSTOC, 2013.\n\n[22] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer,\n\n2009.\n\n[23] A. Krause and R. G. Gomes. Budgeted nonparametric learning from data streams. In ICML, 2010.\n[24] A. Krause, J. Leskovec, C. Guestrin, J. VanBriesen, and C. Faloutsos. Ef\ufb01cient sensor placement optimiza-\ntion for securing large water distribution networks. Journal of Water Resources Planning and Management,\n2008.\n\n[25] R. Kumar, B. Moseley, S. Vassilvitskii, and A. Vattani. Fast greedy algorithms in MapReduce and streaming.\n\nACM Transactions on Parallel Computing, 2015.\n\n[26] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance. Cost-effective outbreak\n\ndetection in networks. In KDD, 2007.\n\n[27] H. Lin and J. A. Bilmes. Learning mixtures of submodular shells with application to document summariza-\n\ntion. In UAI, 2012.\n\n[28] P. Massart. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. The Annals of Probability,\n\npages 1269\u20131283, 1990.\n\n[29] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. B. Tsai, M. Amde,\nS. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. MLlib: Machine learning\nin Apache Spark. JMLR, 2016.\n\n[30] M. Minoux. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization\n\nTechniques. Springer, 1978.\n\n[31] V. Mirrokni and M. Zadimoghaddam. Randomized composable core-sets for distributed submodular\n\nmaximization. STOC, 2015.\n\n[32] B. Mirzasoleiman, A. Badanidiyuru, and A. Karbasi. Fast constrained submodular maximization: Personal-\n\nized data summarization. In ICML, 2016.\n\n[33] B. Mirzasoleiman, A. Badanidiyuru, A. Karbasi, J. Vondrak, and A. Krause. Lazier than lazy greedy. In\n\nAAAI, 2015.\n\n[34] B. Mirzasoleiman, A. Karbasi, R. Sarkar, and A. Krause. Distributed submodular maximization: Identifying\n\nrepresentative elements in massive data. In NIPS, 2013.\n\n[35] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing\n\nsubmodular set functions\u2014-I. Mathematical Programming, 1978.\n\n[36] B. Neyshabur and N. Srebro. On symmetric and asymmetric LSHs for inner product search. In ICML,\n\n2015.\n\n[37] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web.\n\nStanford Digital Libraries, 1999.\n\n[38] A. Shrivastava and P. Li. Asymmetric LSH (ALSH) for sublinear time maximum inner product search\n\n(MIPS). In NIPS, 2014.\n\n[39] S. Tschiatschek, R. K. Iyer, H. Wei, and J. A. Bilmes. Learning mixtures of submodular functions for\n\nimage collection summarization. In NIPS, 2014.\n\n[40] K. Wei, R. Iyer, and J. Bilmes. Fast multi-stage submodular maximization. In ICML, 2014.\n[41] K. Wei, R. Iyer, and J. Bilmes. Submodularity in data subset selection and active learning. In ICML, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1690, "authors": [{"given_name": "Erik", "family_name": "Lindgren", "institution": "University of Texas at Austin"}, {"given_name": "Shanshan", "family_name": "Wu", "institution": "UT Austin"}, {"given_name": "Alexandros", "family_name": "Dimakis", "institution": "University of Texas, Austin"}]}