{"title": "Distributed Submodular Maximization: Identifying Representative Elements in Massive Data", "book": "Advances in Neural Information Processing Systems", "page_first": 2049, "page_last": 2057, "abstract": "Many large-scale machine learning problems (such as clustering, non-parametric learning, kernel machines, etc.) require selecting, out of a massive data set, a manageable, representative subset. Such problems can often be reduced to maximizing a submodular set function subject to cardinality constraints. Classical approaches require centralized access to the full data set; but for truly large-scale problems, rendering the data centrally is often impractical.  In this paper, we consider the problem of submodular function maximization in a distributed fashion. We develop a simple, two-stage protocol GreeDI, that is easily implemented using MapReduce style computations. We theoretically analyze our approach, and show, that under certain natural conditions, performance close to the (impractical) centralized approach can be achieved. In our extensive experiments, we demonstrate the effectiveness of our approach on several applications, including sparse Gaussian process inference on tens of millions of examples using Hadoop.", "full_text": "Distributed Submodular Maximization:\n\nIdentifying Representative Elements in Massive Data\n\nBaharan Mirzasoleiman\n\nETH Zurich\n\nRik Sarkar\n\nUniversity of Edinburgh\n\nAmin Karbasi\nETH Zurich\n\nAndreas Krause\n\nETH Zurich\n\nAbstract\n\nMany large-scale machine learning problems (such as clustering, non-parametric\nlearning, kernel machines, etc.) require selecting, out of a massive data set, a\nmanageable yet representative subset. Such problems can often be reduced to\nmaximizing a submodular set function subject to cardinality constraints. Classical\napproaches require centralized access to the full data set; but for truly large-scale\nproblems, rendering the data centrally is often impractical. In this paper, we con-\nsider the problem of submodular function maximization in a distributed fashion.\nWe develop a simple, two-stage protocol GREEDI, that is easily implemented us-\ning MapReduce style computations. We theoretically analyze our approach, and\nshow, that under certain natural conditions, performance close to the (impractical)\ncentralized approach can be achieved. In our extensive experiments, we demon-\nstrate the effectiveness of our approach on several applications, including sparse\nGaussian process inference and exemplar-based clustering, on tens of millions of\ndata points using Hadoop.\n\n1\n\nIntroduction\n\nNumerous machine learning algorithms require selecting representative subsets of manageable size\nout of large data sets. Applications range from exemplar-based clustering [1], to active set selection\nfor large-scale kernel machines [2], to corpus subset selection for the purpose of training complex\nprediction models [3]. Many such problems can be reduced to the problem of maximizing a sub-\nmodular set function subject to cardinality constraints [4, 5].\nSubmodularity is a property of set functions with deep theoretical and practical consequences. Sub-\nmodular maximization generalizes many well-known problems, e.g., maximum weighted match-\ning, max coverage, and \ufb01nds numerous applications in machine learning and social networks, such\nas in\ufb02uence maximization [6], information gathering [7], document summarization [3] and active\nlearning [8, 9]. A seminal result of Nemhauser et al. [10] states that a simple greedy algorithm pro-\nduces solutions competitive with the optimal (intractable) solution. In fact, if assuming nothing but\nsubmodularity, no ef\ufb01cient algorithm produces better solutions in general [11, 12].\nData volumes are increasing faster than the ability of individual computers to process them. Dis-\ntributed and parallel processing is therefore necessary to keep up with modern massive datasets.\nThe greedy algorithms that work well for centralized submodular optimization, however, are un-\nfortunately sequential in nature; therefore they are poorly suited for parallel architectures. This\nmismatch makes it inef\ufb01cient to apply classical algorithms directly to distributed setups.\n\n1\n\n\fIn this paper, we develop a simple, parallel protocol called GREEDI for distributed submodular\nmaximization. It requires minimal communication, and can be easily implemented in MapReduce\nstyle parallel computation models [13]. We theoretically characterize its performance, and show that\nunder some natural conditions, for large data sets the quality of the obtained solution is competitive\nwith the best centralized solution. Our experimental results demonstrate the effectiveness of our\napproach on a variety of submodular maximization problems. We show that for problems such as\nexemplar-based clustering and active set selection, our approach leads to parallel solutions that are\nvery competitive with those obtained via centralized methods (98% in exemplar based clustering\nand 97% in active set selection). We implement our approach in Hadoop, and show how it enables\nsparse Gaussian process inference and exemplar-based clustering on data sets containing tens of\nmillions of points.\n\n2 Background and Related Work\n\nDue to the rapid increase in data set sizes, and the relatively slow advances in sequential processing\ncapabilities of modern CPUs, parallel computing paradigms have received much interest. Inhabiting\na sweet spot of resiliency, expressivity and programming ease, the MapReduce style computing\nmodel [13] has emerged as prominent foundation for large scale machine learning and data mining\nalgorithms [14, 15]. MapReduce works by distributing the data to independent machines, where\nit is processed in parallel by map tasks that produce key-value pairs. The output is shuf\ufb02ed, and\ncombined by reduce tasks. Hereby, each reduce task processes inputs that share the same key. Their\noutput either comprises the ultimate result, or forms the input to another MapReduce computation.\nThe problem of centralized maximization of submodular functions has received much interest, start-\ning with the seminal work of [10]. Recent work has focused on providing approximation guarantees\nfor more complex constraints. See [5] for a recent survey. The work in [16] considers an algorithm\nfor online distributed submodular maximization with an application to sensor selection. However,\ntheir approach requires k stages of communication, which is unrealistic for large k in a MapReduce\nstyle model. The authors in [4] consider the problem of submodular maximization in a streaming\nmodel; however, their approach is not applicable to the general distributed setting. There has also\nbeen new improvements in the running time of the greedy solution for solving SET-COVER when\nthe data is large and disk resident [17]. However, this approach is not parallelizable by nature.\nRecently, speci\ufb01c instances of distributed submodular maximization have been studied. Such sce-\nnarios often occur in large-scale graph mining problems where the data itself is too large to be\nstored on one machine. Chierichetti et al. [18] address the MAX-COVER problem and provide a\n(1\u22121/e\u2212\u0001) approximation to the centralized algorithm, however at the cost of passing over the data\nset many times. Their result is further improved by Blelloch et al. [19]. Lattanzi et al. [20] address\nmore general graph problems by introducing the idea of \ufb01ltering, namely, reducing the size of the\ninput in a distributed fashion so that the resulting, much smaller, problem instance can be solved on\na single machine. This idea is, in spirit, similar to our distributed method GREEDI. In contrast, we\nprovide a more general framework, and analyze in which settings performance competitive with the\ncentralized setting can be obtained.\n\n3 The Distributed Submodular Maximization Problem\n\nWe consider the problem of selecting subsets out of a large data set, indexed by V (called ground\nset). Our goal is to maximize a non-negative set function f : 2V \u2192 R+, where, for S \u2286 V , f (S)\nquanti\ufb01es the utility of set S, capturing, e.g., how well S represents V according to some objective.\nWe will discuss concrete instances of functions f in Section 3.1. A set function f is naturally\nassociated with a discrete derivative\n\n(cid:52)f (e|S)\n\n= f (S \u222a {e}) \u2212 f (S),\n.\n\n(1)\nwhere S \u2286 V and e \u2208 V , which quanti\ufb01es the increase in utility obtained when adding e to set S. f\nis called monotone iff for all e and S it holds that (cid:52)f (e|S) \u2265 0. Further, f is submodular iff for all\nA \u2286 B \u2286 V and e \u2208 V \\ B the following diminishing returns condition holds:\n\n(cid:52)f (e|A) \u2265 (cid:52)f (e|B).\n\n2\n\n(2)\n\n\fThroughout this paper, we focus on such monotone submodular functions. For now, we adopt the\ncommon assumption that f is given in terms of a value oracle (a black box) that computes f (S) for\nany S \u2286 V . In Section 4.5, we will discuss the setting where f (S) itself depends on the entire data\nset V , and not just the selected subset S. Submodular functions contain a large class of functions\nthat naturally arise in machine learning applications (c.f., [5, 4]). The simplest example of such\nfunctions are modular functions for which the inequality (2) holds with equality.\nThe focus of this paper is on maximizing a monotone submodular function (subject to some con-\nstraint) in a distributed manner. Arguably, the simplest form of constraints are cardinality con-\nstraints. More precisely, we are interested in the following optimization problem:\n\nmax\nS\u2286V\n\nf (S)\n\ns.t.\n\n|S| \u2264 k.\n\n(3)\n\nWe will denote by Ac[k] the subset of size at most k that achieves the above maximization, i.e.,\nthe best centralized solution. Unfortunately, problem (3) is NP-hard, for many classes of sub-\nmodular functions [12]. However, a seminal result by Nemhauser et al. [10] shows that a simple\ngreedy algorithm provides a (1 \u2212 1/e) approximation to (3). This greedy algorithm starts with\nthe empty set S0, and at each iteration i, it chooses an element e \u2208 V that maximizes (1), i.e.,\nSi = Si\u22121 \u222a {arg maxe\u2208V (cid:52)f (e|Si\u22121)}. Let Agc[k] denote this greedy-centralized solution of size\nat most k. For several classes of monotone submodular functions, it is known that (1 \u2212 1/e) is the\nbest approximation guarantee that one can hope for [11, 12, 21]. Moreover, the greedy algorithm\ncan be accelerated using lazy evaluations [22].\nIn many machine learning applications where the ground set |V | is large (e.g., cannot be stored\non a single computer), running a standard greedy algorithm or its variants (e.g., lazy evaluation)\nin a centralized manner is infeasible. Hence, in those applications we seek a distributed solution,\ne.g., one that can be implemented using MapReduce-style computations (see Section 5). From the\nalgorithmic point of view, however, the above greedy method is in general dif\ufb01cult to parallelize,\nsince at each step, only the object with the highest marginal gain is chosen and every subsequent\nstep depends on the preceding ones. More precisely, the problem we are facing in this paper is the\nfollowing. Let the ground set V be partitioned into V1, V2, . . . , Vm, i.e., V = V1 \u222a V2,\u00b7\u00b7\u00b7 \u222a Vm and\nVi \u2229 Vj = \u2205 for i (cid:54)= j. We can think of Vi as a subset of elements (e.g., images) on machine i. The\nquestions we are trying to answer in this paper are: how to distribute V among m machines, which\nalgorithm should run on each machine, and how to merge the resulting solutions.\n\n3.1 Example Applications Suitable for Distributed Submodular Maximization\n\nIn this part, we discuss two concrete problem instances, with their corresponding submodular objec-\ntive functions f, where the size of the datasets often requires a distributed solution for the underlying\nsubmodular maximization.\n\nActive Set Selection in Sparse Gaussian Processes (GPs): Formally a GP is a joint probabil-\nity distribution over a (possibly in\ufb01nite) set of random variables XV , indexed by our ground set\nV , such that every (\ufb01nite) subset XS for S = {e1, . . . , es} is distributed according to a multivari-\nate normal distribution, i.e., P (XS = xS) = N (xS; \u00b5S, \u03a3S,S), where \u00b5S = (\u00b5e1 , . . . , \u00b5es) and\n\u03a3S,S = [Kei,ej ](1 \u2264 i, j \u2264 k) are the prior mean vector and prior covariance matrix, respectively.\nThe covariance matrix is parametrized via a (positive de\ufb01nite kernel) function K. For example, a\ncommonly used kernel function in practice where elements of the ground set V are embedded in a\nEuclidean space is the squared exponential kernel Kei,ej = exp(\u2212|ei \u2212 ej|2\n2/h2). In GP regression,\neach data point e \u2208 V is considered a random variable. Upon observations yA = xA + nA (where\nnA is a vector of independent Gaussian noise with variance \u03c32), the predictive distribution of a new\ndata point e \u2208 V is a normal distribution P (Xe | yA) = N (\u00b5e|A, \u03a32\n\ne|A), where\n\n\u00b5e|A = \u00b5e + \u03a3e,A(\u03a3A,A + \u03c32I)\u22121(xA \u2212 \u00b5A),\n\n(4)\nNote that evaluating (4) is computationally expensive as it requires a matrix inversion. Instead, most\nef\ufb01cient approaches for making predictions in GPs rely on choosing a small \u2013 so called active \u2013 set\nof data points. For instance, in the Informative Vector Machine (IVM) one seeks a set S such that\nthe information gain, f (S) = I(YS; XV ) = H(XV ) \u2212 H(XV |YS) = 1\n2 log det(I + \u03c3\u22122\u03a3S,S) is\nmaximized. It can be shown that this choice of f is monotone submodular [21]. For medium-scale\nproblems, the standard greedy algorithms provide good solutions. In Section 5, we will show how\nGREEDI can choose near-optimal subsets out of a data set of 45 million vectors.\n\ne \u2212 \u03a3e,A(\u03a3A,A + \u03c32I)\u22121\u03a3A,e.\n\n\u03c32\ne|A = \u03c32\n\n3\n\n\fExemplar Based Clustering: Suppose we wish to select a set of exemplars, that best represent a\nmassive data set. One approach for \ufb01nding such exemplars is solving the k-medoid problem [23],\nwhich aims to minimize the sum of pairwise dissimilarities between exemplars and elements of the\ndataset. More precisely, let us assume that for the data set V we are given a distance function d : V \u00d7\nV \u2192 R (not necessarily assumed symmetric, nor obeying the triangle inequality) such that d(\u00b7,\u00b7) en-\ncodes dissimilarity between elements of the underlying set V . Then, the loss function for k-medoid\ncan be de\ufb01ned as follows: L(S) = 1|V |\ne\u2208V min\u03c5\u2208S d(e, \u03c5). By introducing an auxiliary element\ne0 (e.g., = 0) we can turn L into a monotone submodular function: f (S) = L({e0})\u2212 L(S \u222a{e0}).\nIn words, f measures the decrease in the loss associated with the set S versus the loss associated\nwith just the auxiliary element. It is easy to see that for suitable choice of e0, maximizing f is\nequivalent to minimizing L. Hence, the standard greedy algorithm provides a very good solution.\nBut again, the problem becomes computationally challenging when we have a large data set and we\nwish to extract a small set of exemplars. Our distributed solution GREEDI addresses this challenge.\n\n(cid:80)\n\n3.2 Naive Approaches Towards Distributed Submodular Maximization\n\nOne way of implementing the greedy algorithm in parallel would be the following. We proceed\nin rounds. In each round, all machines \u2013 in parallel \u2013 compute the marginal gains of all elements\nin their sets Vi. They then communicate their candidate to a central processor, who identi\ufb01es the\nglobally best element, which is in turn communicated to the m machines. This element is then\ntaken into account when selecting the next element and so on. Unfortunately, this approach requires\nsynchronization after each of the k rounds.\nIn many applications, k is quite large (e.g., tens of\nthousands or more), rendering this approach impractical for MapReduce style computations.\nAn alternative approach for large k would be to \u2013 on each machine \u2013 greedily select k/m elements\nindependently (without synchronization), and then merge them to obtain a solution of size k. This\napproach is much more communication ef\ufb01cient, and can be easily implemented, e.g., using a single\nMapReduce stage. Unfortunately, many machines may select redundant elements, and the merged\nsolution may suffer from diminishing returns.\nIn Section 4, we introduce an alternative protocol GREEDI, which requires little communication,\nwhile at the same time yielding a solution competitive with the centralized one, under certain natural\nadditional assumptions.\n\n4 The GREEDI Approach for Distributed Submodular Maximization\n\nIn this section we present our main results. We \ufb01rst provide our distributed solution GREEDI for\nmaximizing submodular functions under cardinality constraints. We then show how we can make\nuse of the geometry of data inherent in many practical settings in order to obtain strong data-\ndependent bounds on the performance of our distributed algorithm.\n\n4.1 An Intractable, yet Communication Ef\ufb01cient Approach\n\nBefore we introduce GREEDI, we \ufb01rst consider an intractable, but communication\u2013ef\ufb01cient parallel\nprotocol to illustrate the ideas. This approach, shown in Alg. 1, \ufb01rst distributes the ground set V to\nm machines. Each machine then \ufb01nds the optimal solution, i.e., a set of cardinality at most k, that\nmaximizes the value of f in each partition. These solutions are then merged, and the optimal subset\nof cardinality k is found in the combined set. We call this solution f (Ad[m, k]).\nAs the optimum centralized solution Ac[k] achieves the maximum value of the submodular function,\nit is clear that f (Ac[k]) \u2265 f (Ad[m, k]). Further, for the special case of selecting a single element\nk = 1, we have Ac[1] = Ad[m, 1]. In general, however, there is a gap between the distributed and\nthe centralized solution. Nonetheless, as the following theorem shows, this gap cannot be more than\n1/ min(m, k). Furthermore, this is the best result one can hope for under our two-round model.\nTheorem 4.1. Let f be a monotone submodular function and let k > 0. Then, f (Ad[m, k])) \u2265\nmin(m,k) f (Ac[k]). In contrast, for any value of m, and k, there is a data partition and a monotone\nsubmodular function f such that f (Ac[k]) = min(m, k) \u00b7 f (Ad[m, k]).\n\n1\n\n4\n\n\fAlgorithm 1 Exact Distrib. Submodular Max.\nInput: Set V , #of partitions m, constraints k.\nOutput: Set Ad[m, k].\n1: Partition V into m sets V1, V2, . . . , Vm.\n2: In each partition Vi \ufb01nd the optimum set\n3: Merge the resulting sets: B = \u222am\n4: Find the optimum set of cardinality k in B.\n\ni [k] of cardinality k.\n\ni=1Ac\n\ni [k].\n\nAc\n\nOutput this solution Ad[m, k].\n\nAlgorithm 2 Greedy Dist. Subm. Max. (GREEDI)\nInput: Set V , #of partitions m, constraints l, \u03ba.\nOutput: Set Agd[m, \u03ba, l].\n1: Partition V into m sets V1, V2, . . . , Vm.\n2: Run the standard greedy algorithm on each set\n3: Merge the resulting sets: B = \u222am\n4: Run the standard greedy algorithm on B until\n\nVi. Find a solution Agc\n\ni=1Agc\n\ni [\u03ba].\n\ni [\u03ba].\n\nl elements are selected. Return Agd[m, \u03ba, l].\n\nThe proof of all the theorems can be found in the supplement. The above theorem fully character-\nizes the performance of two-round distributed algorithms in terms of the best centralized solution.\nA similar result in fact also holds for non-negative (not necessarily monotone) functions. Due to\nspace limitation, the result is reported in the appendix. In practice, we cannot run Alg. 1. In par-\nticular, there is no ef\ufb01cient way to identify the optimum subset Ac\ni [k] in set Vi, unless P=NP. In the\nfollowing, we introduce our ef\ufb01cient approximation GREEDI.\n\n4.2 Our GREEDI Approximation\n\nOur main ef\ufb01cient distributed method GREEDI is shown in Algorithm 2. It parallels the intractable\nAlgorithm 1, but replaces the selection of optimal subsets by a greedy algorithm. Due to the approx-\nimate nature of the greedy algorithm, we allow the algorithms to pick sets slightly larger than k. In\nparticular, GREEDI is a two-round algorithm that takes the ground set V , the number of partitions\nm, and the cardinality constraints l (\ufb01nal solution) and \u03ba (intermediate outputs). It \ufb01rst distributes\nthe ground set over m machines. Then each machine separately runs the standard greedy algorithm,\nnamely, it sequentially \ufb01nds an element e \u2208 Vi that maximizes the discrete derivative shown in\ni [\u00b7] until it reaches \u03ba\n(1). Each machine i \u2013 in parallel \u2013 continues adding elements to the set Agc\nelements. Then the solutions are merged: B = \u222am\ni [\u03ba], and another round of greedy selection\nis performed over B, which this time selects l elements. We denote this solution by Agd[m, \u03ba, l]: the\ngreedy solution for parameters m, \u03ba and l. The following result parallels Theorem 4.1.\n\ni=1Agc\n\nTheorem 4.2. Let f be a monotone submodular function and let l, \u03ba, k > 0. Then\n\nf (Agd[m, \u03ba, l])) \u2265 (1 \u2212 e\u2212\u03ba/k)(1 \u2212 e\u2212l/\u03ba)\n\nmin(m, k)\n\nf (Ac[k]).\n\nFor the special case of \u03ba = l = k the result of 4.2 simpli\ufb01es to f (Agd[m, \u03ba, k]) \u2265 (1\u22121/e)2\nmin(m,k) f (Ac[k]).\nFrom Theorem 4.1, it is clear that in general one cannot hope to eliminate the dependency of the\ndistributed solution on min(k, m). However, as we show below, in many practical settings, the\nground set V and f exhibit rich geometrical structure that can be used to prove stronger results.\n\n4.3 Performance on Datasets with Geometric Structure\n\n1, e(cid:48)\n\n2, . . . , e(cid:48)\n\nk} and for any matching of elements: M = {(e1, e(cid:48)\n\nIn practice, we can hope to do much better than the worst case bounds shown above by exploiting\nunderlying structures often present in real data and important set functions. In this part, we assume\nthat a metric d exists on the data elements, and analyze performance of the algorithm on functions\nthat change gracefully with change in the input. We refer to these as Lipschitz functions. More\nformally, a function f : 2V \u2192 R is \u03bb-Lipschitz, if for equal sized sets S = {e1, e2, . . . , ek} and\nk)},\nS(cid:48) = {e(cid:48)\nthe difference between f (S) and f (S(cid:48)) is bounded by the total of distances between respective\nelements: |f (S) \u2212 f (S(cid:48))| \u2264 \u03bb\ni). It is easy to see that the objective functions from both\nexamples in Section 3.1 are \u03bb-Lipschitz for suitable kernels/distance functions. Two sets S and S(cid:48)\nare \u03b5-close with respect to f, if |f (S) \u2212 f (S(cid:48))| \u2264 \u03b5. Sets that are close with respect to f can be\nthought of as good candidates to approximate the value of f over each-other; thus one such set is\na good representative of the other. Our goal is to \ufb01nd sets that are suitably close to Ac[k]. At an\nelement v \u2208 V , let us de\ufb01ne its \u03b1-neighborhood to be the set of elements within a distance \u03b1 from\n\n2) . . . , (ek, e(cid:48)\n\n1), (e2, e(cid:48)\n\n(cid:88)\n\nd(ei, e(cid:48)\n\ni\n\n5\n\n\fv (i.e., \u03b1-close to v): N\u03b1(v) = {w : d(v, w) \u2264 \u03b1}. We can in general consider \u03b1-neighborhoods of\npoints of the metric space.\nOur algorithm GREEDI partitions V into sets V1, V2, . . . Vm for parallel processing. In this subsec-\ntion, we assume that GREEDI performs the partition by assigning elements uniformly randomly to\nthe machines. The following theorem says that if the \u03b1-neighborhoods are suf\ufb01ciently dense and f\nis a \u03bb-lipschitz function, then this method can produce a solution close to the centralized solution:\nTheorem 4.3. If for each ei \u2208 Ac[k],|N\u03b1(ei)| \u2265 km log(k/\u03b41/m), and algorithm GREEDI assigns\nelements uniformly randomly to m processors , then with probability at least (1 \u2212 \u03b4),\n\nf (Agd[m, \u03ba, l]) \u2265 (1 \u2212 e\u2212\u03ba/k)(1 \u2212 e\u2212l/\u03ba)(f (Ac[k]) \u2212 \u03bb\u03b1k).\n\n4.4 Performance Guarantees for Very Large Data Sets\n\nSuppose that our data set is a \ufb01nite sample drawn from an underlying in\ufb01nite set, according to some\nunknown probability distribution. Let Ac[k] be an optimal solution in the in\ufb01nite set such that around\neach ei \u2208 Ac[k], there is a neighborhood of radius at least \u03b1\u2217, where the probability density is at\nleast \u03b2 at all points, for some constants \u03b1\u2217 and \u03b2. This implies that the solution consists of elements\ncoming from reasonably dense and therefore representative regions of the data set.\nLet us consider g : R \u2192 R, the growth function of the metric. g(\u03b1) is de\ufb01ned to be the volume of a\nball of radius \u03b1 centered at a point in the metric space. This means, for ei \u2208 Ac[k] the probability of\na random element being in N\u03b1(ei) is at least \u03b2g(\u03b1) and the expected number of \u03b1 neighbors of ei\nis at least E[|N\u03b1(ei)|] = n\u03b2g(\u03b1). As a concrete example, Euclidean metrics of dimension D have\ng(\u03b1) = O(\u03b1D). Note that for simplicity we are assuming the metric to be homogeneous, so that the\ngrowth function is the same at every point. For heterogeneous spaces, we require g to be a uniform\nlower bound on the growth function at every point.\nIn these circumstances, the following theorem guarantees that if the data set V is suf\ufb01ciently large\nand f is a \u03bb-lipschitz function, then GREEDI produces a solution close to the centralized solution.\n\nTheorem 4.4. For n \u2265 8km log(k/\u03b41/m)\nelements uniformly randomly to m processors , then with probability at least (1 \u2212 \u03b4),\n\n\u03bbk \u2264 \u03b1\u2217, if the algorithm GREEDI assigns\n\n, where \u03b5\n\n\u03b2g( \u03b5\n\n\u03bbk )\n\nf (Agd[m, \u03ba, l]) \u2265 (1 \u2212 e\u2212\u03ba/k)(1 \u2212 e\u2212l/\u03ba)(f (Ac[k]) \u2212 \u03b5).\n\n4.5 Handling Decomposable Functions\n\n(cid:80)\n\nSo far, we have assumed that the objective function f is given to us as a black box, which we can\nevaluate for any given set S independently of the data set V . In many settings, however, the objective\nf depends itself on the entire data set. In such a setting, we cannot use GREEDI as presented above,\nsince we cannot evaluate f on the individual machines without access to the full set V . Fortunately,\nmany such functions have a simple structure which we call decomposable. More precisely, we call\na monotone submodular function f decomposable if it can be written as a sum of (non-negative)\nIn other words, there is\nmonotone submodular functions as follows: f (S) = 1|V |\nseparate monotone submodular function associated with every data point i \u2208 V . We require that\neach fi can be evaluated without access to the full set V . Note that the exemplar based clustering\napplication we discussed in Section 3.1 is an instance of this framework, among many others.\nLet us de\ufb01ne the evaluation of f restricted to D \u2286 V as follows: fD(S) = 1|D|\ni\u2208D fi(S). Then,\nin the remaining of this section, our goal is to show that assigning each element of the data set\nrandomly to a machine and running GREEDI will provide a solution that is with high probability\nclose to the optimum solution. For this, let us assume the fi\u2019s are bounded, and without loss of\ngenerality 0 \u2264 fi(S) \u2264 1 for 1 \u2264 i \u2264 |V |, S \u2286 V . Similar to Section 4.3 we assume that\nGREEDI performs the partition by assigning elements uniformly randomly to the machines. These\nmachines then each greedily optimize fVi. The second stage of GREEDI optimizes fU , where\nU \u2286 V is chosen uniformly at random, of size (cid:100)n/m(cid:101). Then, we can show the following result.\n\ni\u2208V fi(S).\n\n(cid:80)\n\n6\n\n\fTheorem 4.5. Let m, k, \u03b4 > 0, \u0001 < 1/4 and let n0 be an integer such that for n \u2265 n0 we have\nln(n)/n \u2264 \u00012/(mk). For n \u2265 max(n0, m log(\u03b4/4m)/\u00012), and under the assumptions of Theo-\nrem 4.4, we have, with probability at least 1 \u2212 \u03b4,\n\nf (Agd[m, \u03ba, l]) \u2265 (1 \u2212 e\u2212\u03ba/k)(1 \u2212 e\u2212l/\u03ba)(f (Ac[k]) \u2212 2\u03b5).\n\nThe above result demonstrates why GREEDI performs well on decomposable submodular functions\nwith massive data even when they are evaluated locally on each machine. We will report our exper-\nimental results on exemplar-based clustering in the next section.\n\n5 Experiments\n\nIn our experimental evaluation we wish to address the following questions: 1) how well does\nGREEDI perform compared to a centralized solution, 2) how good is the performance of GREEDI\nwhen using decomposable objective functions (see Section 4.5), and \ufb01nally 3) how well does\nGREEDI scale on massive data sets. To this end, we run GREEDI on two scenarios: exemplar based\nclustering and active set selection in GPs. Further experiments are reported in the supplement.\nWe compare the performance of our GREEDI method (using different values of \u03b1 = \u03ba/k) to the\nfollowing naive approaches: a) random/random: in the \ufb01rst round each machine simply outputs k\nrandomly chosen elements from its local data points and in the second round k out of the merged mk\nelements, are again randomly chosen as the \ufb01nal output. b) random/greedy: each machine outputs\nk randomly chosen elements from its local data points, then the standard greedy algorithm is run\nover mk elements to \ufb01nd a solution of size k. c) greedy/merge: in the \ufb01rst round k/m elements are\nchosen greedily from each machine and in the second round they are merged to output a solution\nof size k. d) greedy/max: in the \ufb01rst round each machine greedily \ufb01nds a solution of size k and in\nthe second round the solution with the maximum value is reported. For data sets where we are able\nto \ufb01nd the centralized solution, we report the ratio of f (Adist[k])/f (Agc[k]), where Adist[k] is the\ndistributed solution (in particular Agd[m, \u03b1k, k] = Adist[k] for GREEDI).\nExemplar based clustering. Our exemplar based clustering experiment involves GREEDI applied\nto the clustering utility f (S) (see Sec. 3.1) with d(x, x(cid:48)) = (cid:107)x\u2212x(cid:48)(cid:107)2. We performed our experiments\non a set of 10,000 Tiny Images [24]. Each 32 by 32 RGB pixel image was represented by a 3,072\ndimensional vector. We subtracted from each vector the mean value, normalized it to unit norm, and\nused the origin as the auxiliary exemplar. Fig. 1a compares the performance of our approach to the\nbenchmarks with the number of exemplars set to k = 50, and varying number of partitions m. It can\nbe seen that GREEDI signi\ufb01cantly outperforms the benchmarks and provides a solution that is very\nclose to the centralized one. Interestingly, even for very small \u03b1 = \u03ba/k < 1, GREEDI performs\nvery well. Since the exemplar based clustering utility function is decomposable, we repeated the\nexperiment for the more realistic case where the function evaluation in each machine was restricted\nto the local elements of the dataset in that particular machine (rather than the entire dataset). Fig 1b\nshows similar qualitative behavior for decomposable objective functions.\nLarge scale experiments with Hadoop. As our \ufb01rst large scale experiment, we applied GREEDI\nto the whole dataset of 80,000,000 Tiny Images [24] in order to select a set of 64 exemplars. Our\nexperimental infrastructure was a cluster of 10 quad-core machines running Hadoop with the number\nof reducers set to m = 8000. Hereby, each machine carried out a set of reduce tasks in sequence.\nWe \ufb01rst partitioned the images uniformly at random to reducers. Each reducer separately performed\nthe lazy greedy algorithm on its own set of 10,000 images (\u2248123MB) to extract 64 images with\nthe highest marginal gains w.r.t. the local elements of the dataset in that particular partition. We\nthen merged the results and performed another round of lazy greedy selection on the merged results\nto extract the \ufb01nal 64 exemplars. Function evaluation in the second stage was performed w.r.t a\nrandomly selected subset of 10,000 images from the entire dataset. The maximum running time per\nreduce task was 2.5 hours. As Fig. 1c shows, GREEDI highly outperforms the other distributed\nbenchmarks and can scale well to very large datasets. Fig. 1d shows a set of cluster exemplars\ndiscovered by GREEDI where each column in Fig. 1h shows 8 nearest images to exemplars 9 and\n16 (shown with red borders) in Fig. 1d.\nActive set selection. Our active set selection experiment involves GREEDI applied to the informa-\ntion gain f (S) (see Sec. 3.1) with Gaussian kernel, h = 0.75 and \u03c3 = 1. We used the Parkinsons\nTelemonitoring dataset [25] consisting of 5,875 bio-medical voice measurements with 22 attributes\n\n7\n\n\f(a) Tiny Images 10K\n\n(b) Tiny Images 10K\n\n(c) Tiny Images 80M\n\n(d)\n\n(e) Parkinsons Telemonitoring\n\n(f) Parkinsons Telemonitoring\n\n(g) Yahoo! front page\n\n(h)\n\nFigure 1: Performance of GREEDI compared to the other benchmarks. a) and b) show the mean and standard\ndeviation of the ratio of distributed vs. centralized solution for global and local objective functions with budget\nk = 50 and varying the number m of partitions, for a set of 10,000 Tiny Images. c) shows the distributed\nsolution with m = 8000 and varying k for local objective functions on the whole dataset of 80,000,000 Tiny\nImages. e) shows the ratio of distributed vs. centralized solution with m = 10 and varying k for Parkinsons\nTelemonitoring. f) shows the same ratio with k = 50 and varying m on the same dataset, and g) shows the\ndistributed solution for m = 32 with varying budget k on Yahoo! Webscope data. d) shows a set of cluster\nexemplars discovered by GREEDI, and each column in h) shows 8 images nearest to exemplars 9 and 16 in d).\n\nfrom people with early-stage Parkinson\u2019s disease. We normalized the vectors to zero mean and unit\nnorm. Fig. 1f compares the performance GREEDI to the benchmarks with \ufb01xed k = 50 and varying\nnumber of partitions m. Similarly, Fig 1e shows the results for \ufb01xed m = 10 and varying k. We\n\ufb01nd that GREEDI signi\ufb01cantly outperforms the benchmarks.\nLarge scale experiments with Hadoop. Our second large scale experiment consists of 45,811,883\nuser visits from the Featured Tab of the Today Module on Yahoo! Front Page [26]. For each visit,\nboth the user and each of the candidate articles are associated with a feature vector of dimension 6.\nHere, we used the normalized user features. Our experimental setup was a cluster of 5 quad-core ma-\nchines running Hadoop with the number of reducers set to m = 32. Each reducer performed the lazy\ngreedy algorithm on its own set of 1,431,621 vectors (\u224834MB) in order to extract 128 elements with\nthe highest marginal gains w.r.t the local elements of the dataset in that particular partition. We then\nmerged the results and performed another round of lazy greedy selection on the merged results to ex-\ntract the \ufb01nal active set of size 128. The maximum running time per reduce task was 2.5 hours. Fig.\n1g shows the performance of GREEDI compared to the benchmarks. We note again that GREEDI\nsigni\ufb01cantly outperforms the other distributed benchmarks and can scale well to very large datasets.\n\n6 Conclusion\nWe have developed an ef\ufb01cient distributed protocol GREEDI, for maximizing a submodular function\nsubject to cardinality constraints. We have theoretically analyzed the performance of our method and\nshowed under certain natural conditions it performs very close to the centralized (albeit impractical\nin massive data sets) greedy solution. We have also demonstrated the effectiveness of our approach\nthrough extensive large scale experiments using Hadoop. We believe our results provide an impor-\ntant step towards solving submodular optimization problems in very large scale, real applications.\n\nAcknowledgments. This research was supported by SNF 200021-137971, DARPA MSEE\nFA8650-11-1-7156, ERC StG 307036, a Microsoft Faculty Fellowship, an ETH Fellowship,\nScottish Informatics and Computer Science Alliance.\n\n8\n\n2468100.80.850.90.951m  Distributed/CentralizedGreedy/MaxGreedy/MergeRandom/RandomRandom/Greedy\u03b1=2/mGreeDI (\u03b1=1)\u03b1=4/m2468100.80.850.90.951m  Distributed/CentralizedGreeDI (\u03b1=1)\u03b1=4/mGreedy/MergeGreedy/Max\u03b1=2/mRandom/RandomRandom/Greedy1020304050601.751.81.851.91.9522.052.12.152.2x 104k  DistributedRandom/Greedy\u03b1=4/m\u03b1=2/mGreedy/MaxGreedy/MergeRandom/randomGreeDI (\u03b1=1)2468100.50.60.70.80.91m  Distributed/CentralizedGreeDI (\u03b1=1)Greedy/MaxRandom/RandomRandom/Greedy\u03b1=4/m\u03b1=2/mGreedy/Merge204060801000.650.70.750.80.850.90.9511.05kDistributed/CentralizedGreeDI (\u03b1=1)\u03b1=4/m\u03b1=2/mGreedy/MergeRandom/GreedyGreedy/MaxRandom/Random2040608010012005101520253035k  Distributed\u03b1=2/m\u03b1=4/mRandom/GreedyRandom/randomGreedy/MergeGreedy/MaxGreeDI (\u03b1=1)\fReferences\n[1] Delbert Dueck and Brendan J. Frey. Non-metric af\ufb01nity propagation for unsupervised image categoriza-\n\ntion. In ICCV, 2007.\n\n[2] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning\n\n(Adaptive Computation and Machine Learning). 2006.\n\n[3] Hui Lin and Jeff Bilmes. A class of submodular functions for document summarization. In North Ameri-\n\ncan chapter of the Assoc. for Comp. Linguistics/Human Lang. Tech., 2011.\n\n[4] Ryan Gomes and Andreas Krause. Budgeted nonparametric learning from data streams. In Proc. Inter-\n\nnational Conference on Machine Learning (ICML), 2010.\n\n[5] Andreas Krause and Daniel Golovin. Submodular function maximization.\n\nApproaches to Hard Problems. Cambridge University Press, 2013.\n\nIn Tractability: Practical\n\n[6] David Kempe, Jon Kleinberg, and \u00b4Eva Tardos. Maximizing the spread of in\ufb02uence through a social\n\nnetwork. In Proceedings of the ninth ACM SIGKDD, 2003.\n\n[7] Andreas Krause and Carlos Guestrin. Submodularity and its applications in optimized information gath-\n\nering. ACM Transactions on Intelligent Systems and Technology, 2011.\n\n[8] Andrew Guillory and Jeff Bilmes. Active semi-supervised learning using submodular functions.\n\nUncertainty in Arti\ufb01cial Intelligence (UAI), Barcelona, Spain, July 2011. AUAI.\n\nIn\n\n[9] Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in active learning\n\nand stochastic optimization. Journal of Arti\ufb01cial Intelligence Research, 2011.\n\n[10] George L. Nemhauser, Laurence A. Wolsey, and Marshall L. Fisher. An analysis of approximations for\n\nmaximizing submodular set functions - I. Mathematical Programming, 1978.\n\n[11] G. L. Nemhauser and L. A. Wolsey. Best algorithms for approximating the maximum of a submodular\n\nset function. Math. Oper. Research, 1978.\n\n[12] Uriel Feige. A threshold of ln n for approximating set cover. Journal of the ACM, 1998.\n[13] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simpli\ufb01ed data processing on large clusters. In OSDI,\n\n2004.\n\n[14] Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, and Andrew Y. Ng. Map-\n\nreduce for machine learning on multicore. In NIPS, 2006.\n\n[15] Jaliya Ekanayake, Shrideep Pallickara, and Geoffrey Fox. Mapreduce for data intensive scienti\ufb01c analy-\n\nses. In Proc. of the 4th IEEE Inter. Conf. on eScience.\n\n[16] Daniel Golovin, Matthew Faulkner, and Andreas Krause. Online distributed sensor selection. In IPSN,\n\n2010.\n\n[17] Graham Cormode, Howard Karloff, and Anthony Wirth. Set cover algorithms for very large datasets. In\n\nProc. of the 19th ACM intern. conf. on Inf. knowl. manag.\n\n[18] Flavio Chierichetti, Ravi Kumar, and Andrew Tomkins. Max-cover in map-reduce. In Proceedings of the\n\n19th international conference on World wide web, 2010.\n\n[19] Guy E. Blelloch, Richard Peng, and Kanat Tangwongsan. Linear-work greedy parallel approximate set\n\ncover and variants. In SPAA, 2011.\n\n[20] Silvio Lattanzi, Benjamin Moseley, Siddharth Suri, and Sergei Vassilvitskii. Filtering: a method for\n\nsolving graph problems in mapreduce. In SPAA, 2011.\n\n[21] A. Krause and C. Guestrin. Near-optimal nonmyopic value of information in graphical models. In Proc.\n\nof Uncertainty in Arti\ufb01cial Intelligence (UAI), 2005.\n\n[22] M. Minoux. Accelerated greedy algorithms for maximizing submodular set functions. Optimization\n\nTechniques, LNCS, pages 234\u2013243, 1978.\n\n[23] Leonard Kaufman and Peter J Rousseeuw. Finding groups in data: an introduction to cluster analysis,\n\nvolume 344. Wiley-Interscience, 2009.\n\n[24] Antonio Torralba, Rob Fergus, and William T Freeman. 80 million tiny images: A large data set for\n\nnonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell., 2008.\n\n[25] Athanasios Tsanas, Max A Little, Patrick E McSharry, and Lorraine O Ramig. Enhanced classical dys-\nphonia measures and sparse regression for telemonitoring of parkinson\u2019s disease progression. In IEEE\nInt. Conf. Acoust. Speech Signal Process., 2010.\n\n[26] Yahoo! academic relations. r6a, yahoo! front page today module user click log dataset, version 1.0, 2012.\n[27] Tore Opsahl and Pietro Panzarasa. Clustering in weighted networks. Social networks, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1029, "authors": [{"given_name": "Baharan", "family_name": "Mirzasoleiman", "institution": "ETH Zurich"}, {"given_name": "Amin", "family_name": "Karbasi", "institution": "ETH Zurich"}, {"given_name": "Rik", "family_name": "Sarkar", "institution": "University of Edinburgh"}, {"given_name": "Andreas", "family_name": "Krause", "institution": "ETH Zurich"}]}