{"title": "Mixed Robust/Average Submodular Partitioning: Fast Algorithms, Guarantees, and Applications", "book": "Advances in Neural Information Processing Systems", "page_first": 2233, "page_last": 2241, "abstract": "We investigate two novel mixed robust/average-case submodular data partitioning problems that we collectively call Submodular Partitioning. These problems generalize purely robust instances of the problem, namely max-min submodular fair allocation (SFA) and \\emph{min-max submodular load balancing} (SLB), and also average-case instances, that is the submodular welfare problem (SWP) and submodular multiway partition (SMP). While the robust versions have been studied in the theory community, existing work has focused on tight approximation guarantees, and the resultant algorithms are not generally scalable to large real-world applications. This contrasts the average case instances, where most of the algorithms are scalable. In the present paper, we bridge this gap, by proposing several new algorithms (including greedy, majorization-minimization, minorization-maximization, and relaxation algorithms) that not only scale to large datasets but that also achieve theoretical approximation guarantees comparable to the state-of-the-art. We moreover provide new scalable algorithms that apply to additive combinations of the robust and average-case objectives. We show that these problems have many applications in machine learning (ML), including data partitioning and load balancing for distributed ML, data clustering, and image segmentation. We empirically demonstrate the efficacy of our algorithms on real-world problems involving data partitioning for distributed optimization (of convex and deep neural network objectives), and also purely unsupervised image segmentation.", "full_text": "Mixed Robust/Average Submodular Partitioning:\nFast Algorithms, Guarantees, and Applications\n\nKai Wei1\n\nRishabh Iyer1\n\nShengjie Wang2\n\nWenruo Bai1\n\nJeff Bilmes1\n\n1 Department of Electrical Engineering, University of Washington\n\n2 Department of Computer Science, University of Washington\n\n{kaiwei, rkiyer, wangsj, wrbai, bilmes}@u.washington.edu\n\nAbstract\n\nWe investigate two novel mixed robust/average-case submodular data partitioning\nproblems that we collectively call Submodular Partitioning. These problems gen-\neralize purely robust instances of the problem, namely max-min submodular fair\nallocation (SFA) [12] and min-max submodular load balancing (SLB) [25], and\nalso average-case instances, that is the submodular welfare problem (SWP) [26]\nand submodular multiway partition (SMP) [5]. While the robust versions have been\nstudied in the theory community [11, 12, 16, 25, 26], existing work has focused\non tight approximation guarantees, and the resultant algorithms are not generally\nscalable to large real-world applications. This is in contrast to the average case,\nwhere most of the algorithms are scalable. In the present paper, we bridge this gap,\nby proposing several new algorithms (including greedy, majorization-minimization,\nminorization-maximization, and relaxation algorithms) that not only scale to large\ndatasets but that also achieve theoretical approximation guarantees comparable\nto the state-of-the-art. We moreover provide new scalable algorithms that apply\nto additive combinations of the robust and average-case objectives. We show that\nthese problems have many applications in machine learning (ML), including data\npartitioning and load balancing for distributed ML, data clustering, and image seg-\nmentation. We empirically demonstrate the ef\ufb01cacy of our algorithms on real-world\nproblems involving data partitioning for distributed optimization (of convex and\ndeep neural network objectives), and also purely unsupervised image segmentation.\n\n1\n\nIntroduction\n\nThe problem of data partitioning is of great importance to many machine learning (ML) and data\nscience applications as is evidenced by the wealth of clustering procedures that have been and continue\nto be developed and used. Most data partitioning problems are based on expected, or average-case,\nutility objectives where the goal is to optimize a sum of cluster costs, and this includes the ubiquitous\nk-means procedure [1]. Other algorithms are based on robust objective functions [10], where the\ngoal is to optimize the worst-case cluster cost. Such robust algorithms are particularly important\nin mission critical applications, such as parallel and distributed computing, where one single poor\npartition block can signi\ufb01cantly slow down an entire parallel machine (as all compute nodes might\nneed to spin while waiting for a slow node to complete a round of computation). Taking a weighted\ncombination of both robust and average case objective functions allows one to balance between\noptimizing worst-case and overall performance. We are unaware, however, of any previous work that\nallows for a mixing between worst- and average-case objectives in the context of data partitioning.\nThis paper studies two new mixed robust/average-case partitioning problems of the following form:\n\n1\n\n\f(cid:104)\u00af\u03bb min\n\ni\n\nm(cid:88)\n\n\u03bb\nm\n\n(cid:105)\n\n(cid:104)\u00af\u03bb max\n\nm(cid:88)\n\n\u03bb\nm\n\n(cid:105)\n\ni\n\n,\n\nj=1\n\nj=1\n\n1 , A\u03c0\n\nfi(A\u03c0\n\ni ) +\n\nfi(A\u03c0\n\ni ) +\n\nfj(A\u03c0\nj )\n\nfj(A\u03c0\nj )\n\ni \u2229 A\u03c0\n\n2 ,\u00b7\u00b7\u00b7 , A\u03c0\n\n, Prob. 2: min\n\u03c0\u2208\u03a0\n\ni = V and \u2200i (cid:54)= j, A\u03c0\n\nProb. 1: max\n\u03c0\u2208\u03a0\nwhere 0 \u2264 \u03bb \u2264 1, \u00af\u03bb (cid:44) 1 \u2212 \u03bb, the set of sets \u03c0 = (A\u03c0\nm) is a partition of a \ufb01nite\nset V (i.e, \u222aiA\u03c0\nj = \u2205), and \u03a0 refers to the set of all partitions of\nV into m blocks. The parameter \u03bb controls the objective: \u03bb = 1 is the average case, \u03bb = 0 is\nthe robust case, and 0 < \u03bb < 1 is a mixed case. In general, Problems 1 and 2 are hopelessly\nintractable, even to approximate, but we assume that the f1, f2,\u00b7\u00b7\u00b7 , fm are all monotone non-\ndecreasing (i.e., fi(S) \u2264 fi(T ) whenever S \u2286 T ), normalized (fi(\u2205) = 0), and submodular [9] (i.e.,\n\u2200S, T \u2286 V , fi(S) + fi(T ) \u2265 fi(S \u222a T ) + fi(S \u2229 T )). These assumptions allow us to develop fast,\nsimple, and scalable algorithms that have approximation guarantees, as is done in this paper. These\nassumptions, moreover, allow us to retain the naturalness and applicability of Problems 1 and 2 to\na wide variety of practical problems. Submodularity is a natural property in many real-world ML\napplications [20, 15, 18, 27]. When minimizing, submodularity naturally model notions of interacting\ncosts and complexity, while when maximizing it readily models notions of diversity, summarization\nquality, and information. Hence, Problem 1 asks for a partition whose blocks each (and that\ncollectively) are a good, say, summary of the whole. Problem 2 on the other hand, asks for a partition\nwhose blocks each (and that collectively) are internally homogeneous (as is typical in clustering).\nTaken together, we call Problems 1 and 2 Submodular Partitioning. We further categorize these\nproblems depending on if the fi\u2019s are identical to each other (homogeneous) or not (heterogeneous).1\nThe heterogeneous case clearly generalizes the homogeneous setting, but as we will see, the additional\nhomogeneous structure can be exploited to provide more ef\ufb01cient and/or tighter algorithms.\n\nProblem 1 (Max-(Min+Avg))\n\nProblem 2 (Min-(Max+Avg))\n\n\u03bb = 0, BINSRCH [16]\n\u03bb = 0, MATCHING [12]\n\n\u03bb = 1, GREEDWELFARE [8]\n\n\u03bb = 0, ELLIPSOID [11]\n\u03bb = 0, GREEDSAT\u2217\n\u03bb = 0, MMAX\u2217\n\n\u03bb = 0, GREEDMAX\u2020\u2217\n\n0 < \u03bb < 1, COMBSFASWP\u2217\n\n0 < \u03bb < 1, GENERALGREEDSAT\u2217\n\nApproximation factor\n\n1/(2m \u2212 1)\n1/(n \u2212 m + 1)\n1\n4 log n log\nnm\n\n\u221a\n\nO(\n\n3\n2 m)\n\nO(min\n\ni\n\n1/2\n(1/2 \u2212 \u03b4,\n|\u221a\n\n|A\u02c6\u03c0\ni\n1/m\nmax{ \u03b2\u03b1\n\u00af\u03bb\u03b2+\u03b1\n\u03bb/2\n\n\u03b4\n\n1/2+\u03b4 )\n1\nm log3 m\n\n)\n\n, \u03bb\u03b2}\n\n\u03bb = 0, Hardness\n\u03bb = 1, Hardness\n\n1/2 [12]\n\n1 \u2212 1/e [26]\n\n\u03bb = 0, BALANCED\u2020 [25]\n\u03bb = 0, SAMPLING [25]\n\n\u03bb = 0, ELLIPSOID [11]\n\n\u03bb = 1, GREEDSPLIT\u2020 [29, 22]\n\n\u03bb = 1, RELAX [5]\n\u03bb = 0, MMIN\u2217\n\n\u03bb = 0, LOV \u00b4ASZ ROUND\u2217\n0 < \u03bb < 1, COMBSLBSMP\u2217\n\n0 < \u03bb < 1, GENERALLOV \u00b4ASZ ROUND\u2217\n\n\u03bb = 0, Hardness\u2217\n\u03bb = 1, Hardness\n\nApproximation factor\nmin{m, n/m}\nn log n)\nO(\n\n\u221a\n\u221a\n\nO(\n\nn log n)\n2\n\nO(log n)\n|A\u03c0\u2217\n\n2max\n\n|\n\ni\n\ni\n\nmin{ m\u03b1\nm\u00af\u03bb+\u03bb\n\nm\n, \u03b2(m\u00af\u03bb + \u03bb)}\nm\n\nm\n\n2 \u2212 2/m [7]\n\nTable 1: Summary of our contributions and existing work on Problems 1 and 2.2 See text for details.\n\nPrevious work: Special cases of Problems 1 and 2 have appeared previously. Problem 1 with \u03bb = 0\nis called submodular fair allocation (SFA), and Problem 2 with \u03bb = 0 is called submodular load\nbalancing (SLB), robust optimization problems both of which previously have been studied. When\nfi\u2019s are all modular, SLB is called minimum makespan scheduling. An LP relaxation algorithm\nprovides a 2-approximation for the heterogeneous setting [19]. When the objectives are submodular,\nthe problem becomes much harder. Even in the homogeneous setting, [25] show that the problem\npartitioning algorithm yielding a factor of min{m, n/m} under the homogeneous setting. They also\n\nis information theoretically hard to approximate within o((cid:112)n/ log n). They provide a balanced\ngive a sampling-based algorithm achieving O((cid:112)n/ log n) for the homogeneous setting. However,\n\nthe sampling-based algorithm is not practical and scalable since it involves solving, in the worst-case,\nO(n3 log n) instances of submodular function minimization each of which requires O(n5\u03b3 + n6)\ncomputation [23], where \u03b3 is the cost of a function valuation. Another approach approximates\neach submodular function by its ellipsoid approximation (again non-scalable) and reduces SLB\n\u221a\nto its modular version (minimum makespan scheduling) leading to an approximation factor of\nn log n) [11]. SFA, on the other hand, has been studied mostly in the heterogeneous setting.\nO(\n\u221a\nWhen fi\u2019s are all modular, the tightest algorithm, so far, is to iteratively round an LP solution achieving\nm log3 m)) approximation [2], whereas the problem is NP-hard to 1/2 + \u0001 approximate for\nO(1/(\nany \u0001 > 0 [12]. When fi\u2019s are submodular, [12] gives a matching-based algorithm with a factor\n1/(n \u2212 m + 1) approximation that performs poorly when m (cid:28) n. [16] proposes a binary search\nalgorithm yielding an improved factor of 1/(2m\u2212 1). Similar to SLB, [11] applies the same ellipsoid\n\n1Similar sub-categorizations have been called the \u201cuniform\u201d vs. the \u201cnon-uniform\u201d case in the past [25, 11].\n2Results obtained in this paper are marked as \u2217. Methods for only the homogeneous setting are marked as \u2020.\n\n2\n\n\fj ) of segment j based on the image pixels A\u03c0\n\n\u221a\nnm1/4 log n log3/2 m). These approaches are\napproximation techniques leading to a factor of O(\ntheoretically interesting, but they do not scale to large problems. Problems 1 and 2, when \u03bb = 1, have\nalso been previously studied. Problem 2 becomes the submodular multiway partition (SMP) for which\none can obtain a relaxation based 2-approximation [5] in the homogeneous case. In the heterogeneous\ncase, the guarantee is O(log n) [6]. Similarly, [29, 22] propose a greedy splitting 2-approximation\nalgorithm for the homogeneous setting. Problem 1 becomes the submodular welfare [26] for which a\nscalable greedy algorithm achieves a 1/2 approximation [8]. Unlike the worst case (\u03bb = 0), many of\nthe algorithms proposed for these problems are scalable. The general case (0 < \u03bb < 1) of Problems 1\nand 2 differs from either of these extreme cases since we wish both for a robust (worst-case) and\naverage case partitioning, and controlling \u03bb allows one to trade off between the two. As we shall see,\nthe \ufb02exibility of a mixture can be more natural in certain applications.\nApplications: There are a number of applications of submodular partitioning in ML as outlined below.\nSome of these we evaluate in Section 4. Submodular functions naturally capture notions of interacting\ncooperative costs and homogeneity and thus are useful for clustering and image segmentation [22,\n17]. While the average case instance has been used before, a more worst-case variant (i.e., Problem 2\nwith \u03bb \u2248 0) is useful to produce balanced clusterings (i.e., the submodular valuations of all the\nblocks should be similar to each other). Problem 2 also addresses a problem in image segmentation,\nnamely how to use only submodular functions (which are instances of pseudo-Boolean functions) for\nmulti-label (i.e., non-Boolean) image segmentation. Problem 2 addresses this problem by allowing\neach segment j to have its own submodular function fj, and the objective measures the homogeneity\nj assigned to it. Moreover, by combining the average\nfj(A\u03c0\ncase and the worst case objectives, one can achieve a tradeoff between the two. Empirically, we\nevaluate our algorithms on unsupervised image segmentation (Section 4) and \ufb01nd that it outperforms\nother clustering methods including k-means, k-medoids, spectral clustering, and graph cuts.\nSubmodularity also accurately represents computational costs in distributed systems, as shown\nin [20]. In fact, [20] considers two separate problems: 1) text data partitioning for balancing memory\ndemands; and 2) parameter partitioning for balancing communication costs. Both are treated by\nsolving an instance of SLB (Problem 2, \u03bb = 0) where memory costs are modeled using a set-cover\nsubmodular function and the communication costs are modeled using a modular (additive) function.\nAnother important ML application, evaluated in Section 4, is distributed training of statistical\nmodels. As data set sizes grow, the need for statistical training procedures tolerant of the distributed\ndata partitioning becomes more important. Existing schemes are often developed and performed\nassuming data samples are distributed in an arbitrary or random fashion. As an alternate strategy,\nif the data is intelligently partitioned such that each block of samples can itself lead to a good\napproximate solution, a consensus amongst the distributed results could be reached more quickly\nthan when under a poor partitioning. Submodular functions can in fact express the value of a\nsubset of training data for certain machine learning risk functions, e.g., [27]. Using these functions\nwithin Problem 1, one can expect a partitioning (by formulating the problem as an instance of\nProblem 1, \u03bb \u2248 0) where each block is a good representative of the entire set, thereby achieving\nfaster convergence in distributed settings. We demonstrate empirically, in Section 4, that this provides\nbetter results on several machine learning tasks, including the training of deep neural networks.\nOur Contributions: In contrast to Problems 1 and 2 in the average case (i.e., \u03bb = 1), existing\nalgorithms for the worst case (\u03bb = 0) are not scalable. This paper closes this gap, by proposing\nthree new classes of algorithmic frameworks to solve SFA and SLB: (1) greedy algorithms; (2)\nsemigradient-based algorithms; and (3) a Lov\u00b4asz extension based relaxation algorithm. For SFA,\nwhen m = 2, we formulate the problem as non-monotone submodular maximization, which can\nbe approximated up to a factor of 1/2 with O(n) function evaluations [4]. For general m, we give a\nsimple and scalable greedy algorithm (GREEDMAX), and show a factor of 1/m in the homogeneous\nsetting, improving the state-of-the-art factor of 1/(2m \u2212 1) under the heterogeneous setting [16].\nFor the heterogeneous setting, we propose a \u201csaturate\u201d greedy algorithm (GREEDSAT) that iteratively\nsolves instances of submodular welfare problems. We show GREEDSAT has a bi-criterion guarantee\nof (1/2 \u2212 \u03b4, \u03b4/(1/2 + \u03b4)), which ensures at least (cid:100)m(1/2 \u2212 \u03b4)(cid:101) blocks receive utility at least\n\u03b4/(1/2 + \u03b4)OP T for any 0 < \u03b4 < 1/2. For SLB, we \ufb01rst generalize the hardness result in [25]\n\nand show that it is hard to approximate better than m for any m = o((cid:112)n/ log n) even in the\n\nhomogeneous setting. We then give a Lov\u00b4asz extension based relaxation algorithm (LOV \u00b4ASZROUND)\nyielding a tight factor of m for the heterogeneous setting. As far as we know, this is the \ufb01rst algorithm\nachieving a factor of m for SLB in this setting. For both SFA and SLB, we also obtain more ef\ufb01cient\n\n3\n\n\falgorithms with bounded approximation factors, which we call majorization-minimization (MMIN)\nand minorization-maximization (MMAX).\nNext we show algorithms to handle Problems 1 and 2 with general 0 < \u03bb < 1. We \ufb01rst give two sim-\nple and generic schemes (COMBSFASWP and COMBSLBSMP), both of which ef\ufb01ciently combines an\nalgorithm for the worst-case problem (special case with \u03bb = 0), and an algorithm for the average case\n(special case with \u03bb = 1) to provide a guarantee interpolating between the two bounds. For Problem 1\nwe generalize GREEDSAT leading to GENERALGREEDSAT, whose guarantee smoothly interpolates in\nterms of \u03bb between the bi-criterion factor by GREEDSAT in the case of \u03bb = 0 and the constant factor of\n1/2 by the greedy algorithm in the case of \u03bb = 1. For Problem 2 we generalize LOV \u00b4ASZROUND to ob-\ntain a relaxation algorithm (GENERALLOV \u00b4ASZROUND) that achieves an m-approximation for general\n\u03bb. The theoretical contributions and the existing work for Problems 1 and 2 are summarized in Table 1.\nLastly, we demonstrate the ef\ufb01cacy of Problem 2 on unsupervised image segmentation, and the\nsuccess of Problem 1 to distributed machine learning, including ADMM and neural network training.\n\n2 Robust Submodular Partitioning (Problems 1 and 2 when \u03bb = 0)\nNotation: we de\ufb01ne f (j|S) (cid:44) f (S \u222a j) \u2212 f (S) as the gain of j \u2208 V in the context of S \u2286 V . We\nassume w.l.o.g. that the ground set is V = {1, 2,\u00b7\u00b7\u00b7 , n}.\n2.1 Approximation Algorithms for SFA (Problem 1 with \u03bb = 0)\n\nFor general m, we approach SFA from the perspective of the greedy algorithms.\n\nWe \ufb01rst study approximation algorithms for SFA. When m = 2, the problem becomes maxA\u2286V g(A)\nwhere g(A) = min{f1(A), f2(V \\ A)} and is submodular thanks to Theorem 2.1.\nTheorem 2.1. If f1 and f2 are monotone submodular, min{f1(A), f2(V \\ A)} is also submodular.\nProofs for all theorems in this paper are given in [28]. The simple bi-directional randomized greedy\nalgorithm [4] therefore approximates SFA with m = 2 to a factor of 1/2 matching the problem\u2019s\nhardness.\nIn\nthis work we introduce two variants of a greedy algorithm \u2013 GREEDMAX (Alg. 1) and GREEDSAT\n(Alg. 2), suited to the homogeneous and heterogeneous settings, respectively.\nGREEDMAX: The key idea of GREEDMAX (see Alg. 1) is to greedily add an item with the\nmaximum marginal gain to the block whose current solution is minimum. Initializing {Ai}m\ni=1 with\nthe empty sets, the greedy \ufb02avor also comes from that it incrementally grows the solution by greedily\nimproving the overall objective mini=1,...,m fi(Ai) until {Ai}m\ni=1 forms a partition. Besides its\nsimplicity, Theorem 2.2 offers the optimality guarantee.\nTheorem 2.2. GREEDMAX achieves a guarantee of 1/m under the homogeneous setting.\n\nm min{fi(A), c} (Line 2). The parameter c controls the saturation in each block. f c\n\none proposed in [18] GREEDSAT de\ufb01nes an intermediate objective \u00afF c(\u03c0) =(cid:80)m\n\nBy assuming the homogeneity of the fi\u2019s, we obtain a very simple 1/m-approximation algorithm\nimproving upon the state-of-the-art factor 1/(2m \u2212 1) [16]. Thanks to the lazy evaluation trick as\ndescribed in [21], Line 5 in Alg. 1 need not to recompute the marginal gain for every item in each\nround, leading GREEDMAX to scale to large data sets.\nGREEDSAT: Though simple and effective in the homogeneous setting, GREEDMAX performs\narbitrarily poorly under the heterogeneous setting. To this end we provide another algorithm \u2013\n\u201cSaturate\u201d Greedy (GREEDSAT, see Alg. 2). The key idea of GREEDSAT is to relax SFA to a much\nsimpler problem \u2013 Submodular Welfare (SWP), i.e., Problem 1 with \u03bb = 0. Similar in \ufb02avor to the\ni ), where\ni sat-\ni (A) = 1\nf c\nis\ufb01es submodularity for each i. Unlike SFA, the combinatorial optimization problem max\u03c0\u2208\u03a0 \u00afF c(\u03c0)\n(Line 6) is much easier and is an instance of SWP. In this work, we solve Line 6 by the ef\ufb01cient\ngreedy algorithm as described in [8] with a factor 1/2. One can also use a more computationally\nexpensive multi-linear relaxation algorithm as given in [26] to solve Line 6 with a tight factor\n\u03b1 = (1\u2212 1/e). Setting the input argument \u03b1 as the approximation factor for Line 6, the essential idea\nof GREEDSAT is to perform a binary search over the parameter c to \ufb01nd the largest c\u2217 such that the\n) \u2265 \u03b1c\u2217. GREEDSAT terminates after\nreturned solution \u02c6\u03c0c\u2217\nsolving O(log( mini fi(V )\n)) instances of SWP. Theorem 2.3 gives a bi-criterion optimality guarantee.\nTheorem 2.3. Given \u0001 > 0, 0 \u2264 \u03b1 \u2264 1 and any 0 < \u03b4 < \u03b1, GREEDSAT \ufb01nds a partition such that\nat least (cid:100)m(\u03b1 \u2212 \u03b4)(cid:101) blocks receive utility at least\n\nfor the instance of SWP satis\ufb01es \u00afF c\u2217\n\ni=1 f c\n\ni (A\u03c0\n\n(\u02c6\u03c0c\u2217\n\n\u0001\n\n\u03b4\n\n1\u2212\u03b1+\u03b4 (max\u03c0\u2208\u03a0 mini fi(A\u03c0\n\ni ) \u2212 \u0001).\n\n4\n\n\fi=1.\n\ni=1, m, V , partition \u03c00.\n\ni ), c}.\n\ni=1, m, V , partition \u03c00.\n\nfor i = 1, . . . , m do\n\nfor fi.\n\ni=1, { \u02dcfi}m\ni }m\n\ni=1.\n\ni=1, m, V , \u03b1.\n\nfor i = 1, . . . , m do\n\nPick a subgradient hi at A\u03c0t\ni\n\nfor fi.\n\n\u00afF c(\u03c0)\n\n2 (cmax + cmin)\n\nc = 1\n\u02c6\u03c0c \u2208 argmax\u03c0\u2208\u03a0\nif \u00afF c(\u02c6\u03c0c) < \u03b1c then\n\nelse\n\ncmax = c\ncmin = c; \u02c6\u03c0 \u2190 \u02c6\u03c0c\n\nj\u2217 \u2208 argminj f (Aj );\na\u2217 \u2208 argmaxa\u2208R f (a|Aj\u2217 )\nAj\u2217 \u2190 Aj\u2217 \u222a {a\u2217}; R \u2190 R \\ a\u2217\n\nAlgorithm 4: GREEDMIN\n1: Input: f, m, V ;\n2: Let A1 =, . . . , = Am = \u2205; R = V .\n3: while R (cid:54)= \u2205 do\n4:\n5:\n6:\n7: end while\n8: Output {Ai}m\n\nj\u2217 \u2208 argminj f (Aj )\na\u2217 \u2208 mina\u2208R f (a|Aj\u2217 )\nAj\u2217 \u2190 Aj\u2217 \u222a a\u2217; R \u2190 R \\ a\u2217\n\nAlgorithm 6: MMAX\n1: Input: {fi}m\n2: Let t = 0.\n3: repeat\n4:\n5:\n6:\nend for\n\u03c0t+1 \u2208 argmax\u03c0\u2208\u03a0 mini hi(A\u03c0\n7:\ni )\n8:\nt = t + 1;\n9: until \u03c0t = \u03c0t\u22121\n10: Output: \u03c0t.\n\nAlgorithm 5: MMIN\n1: Input: {fi}m\n2: Let t = 0\n3: repeat\n4:\nPick a supergradient mi at A\u03c0t\n5:\ni\n6:\nend for\n\u03c0t+1 \u2208 argmin\u03c0\u2208\u03a0 maxi mi(A\u03c0\n7:\ni )\n8:\nt = t + 1;\n9: until \u03c0t = \u03c0t\u22121\n10: Output: \u03c0t.\n\nAlgorithm 1: GREEDMAX\n1: Input: f, m, V .\n2: Let A1 =, . . . , = Am = \u2205; R = V .\n3: while R (cid:54)= \u2205 do\n4:\n5:\n6:\n7: end while\n8: Output {Ai}m\nAlgorithm 2: GREEDSAT\n(cid:80)m\n1: Input: {fi}m\ni=1 min{fi(A\u03c0\n2: Let \u00afF c(\u03c0) = 1\nm\n3: Let cmin = 0, cmax = mini fi(V )\n4: while cmax \u2212 cmin \u2265 \u0001 do\n5:\n6:\n7:\n8:\n9:\n10:\n11:\nend if\n12: end while\n13: Output: \u02c6\u03c0.\nAlgorithm 3: LOV \u00b4ASZ ROUND\n1: Input: {fi}m\ni=1, m, V .\n2: Solve for {x\u2217\ni=1 via convex relaxation.\n3: Rounding: Let A1 =, . . . , = Am = \u2205.\n4: for j = 1, . . . , n do\ni (j); A\u02c6i = A\u02c6i \u222a j\n\u02c6i \u2208 argmaxi x\u2217\n5:\n6: end for\n7: Output {Ai}m\ni=1.\nFor any 0 < \u03b4 < \u03b1 Theorem 2.3 ensures that the top (cid:100)m(\u03b1 \u2212 \u03b4)(cid:101) valued blocks in the partition\nreturned by GREEDSAT are (\u03b4/(1\u2212\u03b1+\u03b4)\u2212\u0001)-optimal. \u03b4 controls the trade-off between the number of\ntop valued blocks to bound and the performance guarantee attained for these blocks. The smaller \u03b4 is,\nthe more top blocks are bounded, but with a weaker guarantee. We set the input argument \u03b1 = 1/2 (or\n\u03b1 = 1 \u2212 1/e) as the worst-case performance guarantee for solving SWP so that the above theoretical\nanalysis follows. However, the worst-case is often achieved only by very contrived submodular\nfunctions. For the ones used in practice, the greedy algorithm often leads to near-optimal solution\n([18] and our own observations). Setting \u03b1 as the actual performance guarantee for SWP (often very\nclose to 1) can improve the empirical bound, and we, in practice, typically set \u03b1 = 1 to good effect.\nMMAX: Lastly, we introduce another algorithm for the heterogeneous setting, called minorization-\nmaximization (MMAX, see Alg. 6). Similar to the one proposed in [14], the idea is to iteratively\nmaximize tight lower bounds of the submodular functions. Submodular functions have tight modular\nlower bounds, which are related to the subdifferential \u2202f (Y ) of the submodular set function f at a set\nY \u2286 V [9]. Denote a subgradient at Y by hY \u2208 \u2202f (Y ), the extreme points of \u2202f (Y ) may be com-\nputed via a greedy algorithm: Let \u03c3 be a permutation of V that assigns the elements in Y to the \ufb01rst\n|Y | positions (\u03c3(i) \u2208 Y if and only if i \u2264 |Y |). Each such permutation de\ufb01nes a chain with elements\n0 = \u2205, S\u03c3\nY of \u2202f (Y ) has each entry\nS\u03c3\nY forms a lower bound of f, tight at Y \u2014 i.e.,\nas h\u03c3\nY (Y ) = f (Y ). The idea of MMAX is to consider a\nh\u03c3\nmodular lower bound tight at the set corresponding to each block of a partition. In other words, at iter-\nation t + 1, for each block i, we approximate fi with its modular lower bound tight at A\u03c0t\ni and solve a\nmodular version of Problem 1 (Line 7), which admits ef\ufb01cient approximation algorithms [2]. MMAX\nis initialized with a partition \u03c00, which is obtained by solving Problem 1, where each fi is replaced\nwith a simple modular function f(cid:48)\nTheorem 2.4. MMAX achieves a worst-case guarantee of O(mini\n\u02c6\u03c0 = (A\u02c6\u03c0\n[0, 1] is the curvature of a submodular function f at A \u2286 V .\n\ni |\u22121)(1\u2212\u03bafi (A\u02c6\u03c0\n1+(|A\u02c6\u03c0\ni |\u221a\ni ))\n), where\n|A\u02c6\u03c0\nf (v|A\\v)\nm) is the partition obtained by the algorithm, and \u03baf (A) = 1\u2212 minv\u2208V\nf (v) \u2208\n\ni = {\u03c3(1), \u03c3(2), . . . , \u03c3(i)}, and S\u03c3|Y | = Y . An extreme point h\u03c3\n\ni ) \u2212 f (S\u03c3\nY (j) \u2264 f (X),\u2200X \u2286 V and h\u03c3\n\ni\u22121). De\ufb01ned as above, h\u03c3\n\nY (X) =(cid:80)\n\nY (\u03c3(i)) = f (S\u03c3\nj\u2208X h\u03c3\n\na\u2208A fi(a). The following worst-case bound holds:\n\ni (A) =(cid:80)\n\n1 ,\u00b7\u00b7\u00b7 , A\u02c6\u03c0\n\nm log3 m\n\n5\n\n\f2.2 Approximation Algorithms for SLB (Problem 2 with \u03bb = 0)\n\nWe next investigate SLB, where existing hardness results [25] are o((cid:112)n/ log n), which is independent\nof m and implicitly assumes that m = \u0398((cid:112)n/ log n). However, applications for SLB are often\nm = o((cid:112)n/ log n) with polynomial number of queries even under the homogeneous setting.\nFor the rest of the paper, we assume m = o((cid:112)n/ log n) for SLB, unless stated otherwise.\n\ndependent on m with m (cid:28) n. We hence offer hardness analysis in terms of m in the following.\nTheorem 2.5. For any \u0001 > 0, SLB cannot be approximated to a factor of (1 \u2212 \u0001)m for any\n\ni f (A\u03c0(cid:48)\n\ni ) \u2264 f (V ) \u2264(cid:80)\n\nGREEDMIN: Theorem 2.5 implies that SLB is hard to approximate better than m. However, an\narbitrary partition \u03c0 \u2208 \u03a0 already achieves the best approximation factor of m that one can hope\ni ) \u2264 m maxi f (A\u03c0(cid:48)\nfor under the homogeneous setting, since maxi f (A\u03c0\ni )\nfor any \u03c0(cid:48) \u2208 \u03a0. In practice, one can still implement a greedy style heuristic, which we refer to as\nGREEDMIN (Alg. 4). Very similar to GREEDMAX, GREEDMIN only differs in Line 5, where the\nitem with the smallest marginal gain is added. Since the functions are all monotone, any additions\nto a block can (if anything) only increase its value, so we choose to add to the minimum valuation\nblock in Line 4 to attempt to keep the maximum valuation block from growing further.\nLOV \u00b4ASZ ROUND: Next we consider the heterogeneous setting, for which we propose a tight\nalgorithm \u2013 LOV \u00b4ASZ ROUND (see Alg. 3). The algorithm proceeds as follows: (1) apply the Lov\u00b4asz\nextension of submodular functions to relax SLB to a convex program, which is exactly solved to\na fractional solution (Line 2); (2) map the fractional solution to a partition using the \u03b8-rounding\ntechnique as proposed in [13] (Line 3 - 6). The Lov\u00b4asz extension, which naturally connects a\nsubmodular function f with its convex relaxation \u02dcf, is de\ufb01ned as follows: given any x \u2208 [0, 1]n,\nwe obtain a permutation \u03c3x by ordering its elements in non-increasing order, and thereby a chain of\nj = {\u03c3x(1), . . . , \u03c3x(j)} for j = 1, . . . , n. The Lov\u00b4asz extension \u02dcf\nsets S\u03c3x\nj\u22121)).\n\nfor f is the weighted sum of the ordered entries of x: \u02dcf (x) =(cid:80)n\n\n0 \u2282, . . . ,\u2282 S\u03c3x\n\nj=1 x(\u03c3x(j))(f (S\u03c3x\n\nj ) \u2212 f (S\u03c3x\n\nn with S\u03c3x\n\nmin\n\nGiven the convexity of the \u02dcfi\u2019s , SLB is relaxed to the following convex program:\nxi(j) \u2265 1, for j = 1, . . . , n\n(1)\nm}, the \u03b8-rounding step simply maps each\n1, . . . , x\u2217\ni (j) . The bound for LOV \u00b4ASZ ROUND is as follows:\n\n\u02dcfi(xi), s.t\nDenoting the optimal solution for Eqn 1 as {x\u2217\nitem j \u2208 V to a block \u02c6i such that \u02c6i \u2208 argmaxi x\u2217\nTheorem 2.6. LOV \u00b4ASZROUND achieves a worst-case approximation factor m.\n\nx1,...,xm\u2208[0,1]n\n\nmax\n\nm(cid:88)\n\ni=1\n\ni\n\nj\u2208X\\Y\n\nj\u2208Y \\X\n\nj\u2208X\\Y\n\nj\u2208Y \\X\n\nWe remark that, to the best of our knowledge, LOV \u00b4ASZROUND is the \ufb01rst algorithm that is tight\nand that gives an approximation in terms of m for the heterogeneous setting.\nMMIN: Similar to MMAX for SFA, we propose Majorization-Minimization (MMIN, see Alg. 5)\nfor SLB. Here, we iteratively choose modular upper bounds, which are de\ufb01ned via superdifferentials\n\u2202f (Y ) of a submodular function [15] at Y . Moreover, there are speci\ufb01c supergradients [14] that\nde\ufb01ne the following two modular upper bounds (when referring to either one, we use mf\nf (j|V \\j) +\nmf\n\nX,2(Y ) (cid:44) f (X) \u2212(cid:88)\n\nX,1(Y ) (cid:44) f (X) \u2212(cid:88)\n\nf (j|\u2205), mf\n\nf (j|X\\j) +\n\n(cid:88)\n\n(cid:88)\n\nX):\n\nf (j|X).\n\nX,1(Y ) \u2265 f (Y ) and mf\n\nX,2(Y ) \u2265 f (Y ),\u2200Y \u2286 V and mf\n\nThen mf\nX,2(X) = f (X). At\niteration t + 1, for each block i, MMIN replaces fi with a choice of its modular upper bound mi tight\nat A\u03c0t\ni and solves a modular version of Problem 2 (Line 7), for which there exists an ef\ufb01cient LP relax-\nation based algorithm [19]. Similar to MMAX, the initial partition \u03c00 is obtained by solving Problem\n2, where each fi is substituted with f(cid:48)\na\u2208A fi(a). The following worst-case bound holds:\n\ni (A) =(cid:80)\n\nX,1(X) = mf\n\nTheorem 2.7. MMIN achieves a worst-case guarantee of (2 maxi\n\u03c0\u2217 = (A\u03c0\u2217\n\nm ) denotes the optimal partition.\n\n1 ,\u00b7\u00b7\u00b7 , A\u03c0\u2217\n\n1+(|A\u03c0\u2217\n\ni\n\n6\n\n|\n\n|A\u03c0\u2217\n|\u22121)(1\u2212\u03bafi (A\u03c0\u2217\n\ni\n\ni\n\n), where\n\n))\n\n\f(a)20Newsgroups\n\n(b) MNIST\n\n(c) TIMIT\n\nFigure 1: Comparison between submodular and random partitions for distributed ML, including\nADMM (Fig 1a) and distributed neural nets (Fig 1b) and (Fig 1c). For the box plots, the central mark\nis the median, the box edges are 25th and 75th percentiles, and the bars denote the best and worst cases.\n\n3 General Submodular Partitioning (Problems 1 and 2 when 0 < \u03bb < 1)\nIn this section we study Problem 1 and Problem 2, in the most general case, i.e., 0 < \u03bb < 1. We \ufb01rst\npropose a simple and general \u201cextremal combination\u201d scheme that works both for problem 1 and 2.\nIt naturally combines an algorithm for solving the worst-case problem (\u03bb = 0) with an algorithm for\nsolving the average case (\u03bb = 1). We use Problem 1 as an example, but the same scheme easily works\nfor Problem 2. Denote ALGWC as the algorithm for the worst-case problem (i.e. SFA), and ALGAC\nas the algorithm for the average case (i.e., SWP). The scheme is to \ufb01rst obtain a partition \u02c6\u03c01 by running\nALGWC on the instance of Problem 1 with \u03bb = 0 and a second partition \u02c6\u03c02 by running ALGAC with\n\u03bb = 1. Then we output one of \u02c6\u03c01 and \u02c6\u03c02, with which the higher valuation for Problem 1 is achieved.\nWe call this scheme COMBSFASWP. Suppose ALGWC solves the worst-case problem with a factor\n\u03b1 \u2264 1 and ALGAC for the average case with \u03b2 \u2264 1. When applied to Problem 2 we refer to this\nscheme as COMBSLBSMP (\u03b1 \u2265 1 and \u03b2 \u2265 1). The following guarantee holds for both schemes:\nTheorem 3.1. For any \u03bb \u2208 (0, 1) COMBSFASWP solves Problem 1 with a factor max{ \u03b2\u03b1\nin the heterogeneous case, and max{min{\u03b1, 1\nm}, \u03b2\u03b1\nCOMBSLBSMP solves Problem 2 with a factor min{ m\u03b1\nand min{m, m\u03b1\nm\u00af\u03bb+\u03bb , \u03b2(m\u00af\u03bb + \u03bb)} in the homogeneous case.\n\n\u00af\u03bb\u03b2+\u03b1 , \u03bb\u03b2}\n\u00af\u03bb\u03b2+\u03b1 , \u03bb\u03b2} in the homogeneous case. Similarly,\nm\u00af\u03bb+\u03bb , \u03b2(m\u00af\u03bb + \u03bb)} in the heterogeneous case,\n\n(cid:80)m\n\nj=1 fj(A\u03c0\n\n\u03bb(\u03c0) = 1\nm\n\n(cid:80)m\ni=1 min{\u00af\u03bbfi(A\u03c0\n\nThe drawback of COMBSFASWP and COMBSLBSMP is that they do not explicitly exploit the trade-\noff between the average-case and worst-case objectives in terms of \u03bb. To obtain more practically\ninteresting algorithms, we also give GENERALGREEDSAT that generalizes GREEDSAT to solve Prob-\nlem 1. Similar to GREEDSAT we de\ufb01ne an intermediate objective: \u00afF c\ni )+\nj ), c} in GENERALGREEDSAT. Following the same algorithmic design as in GREED-\n\u03bb 1\nm\nSAT, GENERALGREEDSAT only differs from GREEDSAT in Line 6, where the submodular welfare\nproblem is de\ufb01ned on the new objective \u00afF c\n\u03bb(\u03c0). In [28] we show that GENERALGREEDSAT gives \u03bb/2\napproximation, while also yielding a bi-criterion guarantee that generalizes Theorem 2.3. In particular\nGENERALGREEDSAT recovers the bicriterion guarantee as shown in Theorem 2.3 when \u03bb = 0. In\nthe case of \u03bb = 1, GENERALGREEDSAT recovers the 1/2-approximation guarantee of the greedy\nalgorithm for solving the submodular welfare problem, i.e., the average-case objective. Moreover an\nimproved guarantee is achieved by GENERALGREEDSAT as \u03bb increases. Details are given in [28].\nTo solve Problem 2 we generalize LOV \u00b4ASZ ROUND leading to GENERALLOV \u00b4ASZ ROUND. Similar\nto LOV \u00b4ASZ ROUND we relax each submodular objective as its convex relaxation using the Lov\u00b4asz\nextension. Almost the same as LOV \u00b4ASZ ROUND, GENERALLOV \u00b4ASZ ROUND only differs in Line 2,\nwhere Problem 2 is relaxed as the following convex program: minx1,...,xm\u2208[0,1]n \u00af\u03bb maxi\n\u02dcfi(xi) +\n\n7\n\nNumber of iterations5 101520253035Test accuracy (%)79808182838485865-Partition on 20newsgroup with ADMMSubmodular partitionRandom partitionNumber of iterations5 101520253035Test accuracy (%)74767880828410-Partition on 20newsgroup with ADMMSubmodular partitionAdversarial partitionRandom partitionNumber of iterations5 101520Test accuracy (%)98.398.498.598.698.798.898.99999.15-Partition on MNIST with Distributed NNSubmodular partitionRandom partitionNumber of iterations5 101520Test accuracy (%)97.89898.298.498.698.89999.210-Partition on MNIST with Distributed NNSubmodular partitionRandom partitionNumber of iterations5 10152025303540455055Test accuracy (%)152025303540455030-Partition on TIMITSubmodular partitionRandom partitionNumber of iterations5 10152025303540455055Test accuracy (%)10152025303540455040-Block Partition on TIMIT with Distributed NNSubmodular partitionRandom partition\f(cid:80)m\n\n\u02dcfj(xj), s.t (cid:80)m\n\ni=1 xi(j) \u2265 1, for j = 1, . . . , n. Following the same rounding procedure\n\u03bb 1\nm\nas LOV \u00b4ASZ ROUND, GENERALLOV \u00b4ASZ ROUND is guaranteed to give an m-approximation for\nProblem 2 with general \u03bb. Details are given in [28].\n\nj=1\n\n4 Experiments and Conclusions\nWe conclude in this section by empirically evaluating the algorithms proposed for Problems 1 and 2\non real-world data partitioning applications including distributed ADMM, distributed deep neural\nnetwork training, and lastly unsupervised image segmentation tasks.\nADMM: We \ufb01rst consider data partitioning for distributed convex optimization. The evaluation\ntask is text categorization on the 20 Newsgroup data set, which consists of 18,774 articles divided\nalmost evenly across 20 classes. We formulate the multi-class classi\ufb01cation as an (cid:96)2 regularized\nlogistic regression, which is solved by ADMM implemented as [3]. We run 10 instances of random\npartitioning on the training data as a baseline. In this case, we use the feature based function (same as\nthe one used in [27]), in the homogeneous setting of Problem 1 (with \u03bb = 0). We use GREEDMAX\nas the partitioning algorithm. In Figure 1a, we observe that the resulting partitioning performs much\nbetter than a random partitioning (and signi\ufb01cantly better than an adversarial partitioning, formed\nby grouping similar items together). More details are given in [28].\nDistributed Deep Neural Network (DNN) Training: Next we evaluate our framework on dis-\ntributed deep neural network (DNN) training. We test on two tasks: 1) handwritten digit recognition\non the MNIST database, which consists of 60,000 training and 10,000 test samples; 2) phone\nclassi\ufb01cation on the TIMIT data, which has 1,124,823 training and 112,487 test samples. A 4-layer\nDNN model is applied to the MNIST experiment, and we train a 5-layer DNN for TIMIT. For both\nexperiments the submodular partitioning is obtained by solving the homogeneous case of Problem 1\n(\u03bb = 0) using GREEDMAX on a form of clustered facility location (as proposed and used in [27]).\nWe perform distributed training using an averaging stochastic gradient descent scheme, similar to the\none in [24]. We also run 10 instances of random partitioning as a baseline. As shown in Figure 1b\nand 1c, the submodular partitioning outperforms the random baseline. An adversarial partitioning,\nwhich is formed by grouping items with the same class, in either case, cannot even be trained.\nUnsupervised Image Seg-\nmentation: We test\nthe\nef\ufb01cacy of Problem 2 on\nunsupervised image seg-\nmentation over the GrabCut\ndata set (30 color images\nand their ground truth fore-\nground/background labels).\nBy \u201cunsupervised\u201d, we\nmean that no labeled data\nat any time in supervised\nor semi-supervised training,\nnor any kind of interactive\nsegmentation, was used in\nforming or optimizing the\nobjective. The submodular\npartitioning for each image\nis obtained by solving the\nhomogeneous case of Problem 2 (\u03bb = 0.8) using a modi\ufb01ed variant of GREEDMIN on the facility lo-\ncation function. We compare our method against the other unsupervised methods k-means, k-medoids,\nspectral clustering, and graph cuts. Given an m-partition of an image and its ground truth labels, we\nassign each of the m blocks either to the foreground or background label having the larger intersection.\nIn Fig. 2 we show example segmentation results after this mapping on several example images as well\nas averaged F-measure (relative to ground truth) over the whole data set. More details are given in [28].\nAcknowledgments: This material is based upon work supported by the National Science Foundation\nunder Grant No. IIS-1162606, the National Institutes of Health under award R01GM103544, and by\na Google, a Microsoft, and an Intel research award. R. Iyer acknowledges support from a Microsoft\nResearch Ph.D Fellowship. This work was supported in part by TerraSwarm, one of six centers of\nSTARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA.\n\nFigure 2: Unsupervised image segmentation (right: some examples).\n\n8\n\nOriginalF-measure onall of GrabCut1.00.8100.8230.8540.870GroundTruthk-meansk-medoidsSpectralClustering0.853GraphCutSubmodularPartitioning\fReferences\n[1] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In SODA, 2007.\n[2] A. Asadpour and A. Saberi. An approximation algorithm for max-min fair allocation of indivisible goods.\n\nIn SICOMP, 2010.\n\n[3] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning\nvia the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 2011.\n[4] N. Buchbinder, M. Feldman, J. Naor, and R. Schwartz. A tight linear time (1/2)-approximation for\n\nunconstrained submodular maximization. In FOCS, 2012.\n\n[5] C. Chekuri and A. Ene. Approximation algorithms for submodular multiway partition. In FOCS, 2011.\n[6] C. Chekuri and A. Ene. Submodular cost allocation problem and applications. In Automata, Languages\n\nand Programming, pages 354\u2013366. Springer, 2011.\n\n[7] A. Ene, J. Vondr\u00b4ak, and Y. Wu. Local distribution and the symmetry gap: Approximability of multiway\n\npartitioning problems. In SODA, 2013.\n\n[8] M. Fisher, G. Nemhauser, and L. Wolsey. An analysis of approximations for maximizing submodular set\n\nfunctions\u2014II. In Polyhedral combinatorics, 1978.\n\n[9] S. Fujishige. Submodular functions and optimization, volume 58. Elsevier, 2005.\n[10] L. A. Garc\u00b4\u0131a-Escudero, A. Gordaliza, C. Matr\u00b4an, and A. Mayo-Iscar. A review of robust clustering methods.\n\nAdvances in Data Analysis and Classi\ufb01cation, 4(2-3):89\u2013109, 2010.\n\n[11] M. Goemans, N. Harvey, S. Iwata, and V. Mirrokni. Approximating submodular functions everywhere. In\n\nSODA, 2009.\n\n[12] D. Golovin. Max-min fair allocation of indivisible goods. Technical Report CMU-CS-05-144, 2005.\n[13] R. Iyer, S. Jegelka, and J. Bilmes. Monotone closure of relaxed constraints in submodular optimization:\n\nConnections between minimization and maximization: Extended version.\n\n[14] R. Iyer, S. Jegelka, and J. Bilmes. Fast semidifferential based submodular function optimization. In ICML,\n\n2013.\n\n[15] S. Jegelka and J. Bilmes. Submodularity beyond submodular energies: coupling edges in graph cuts. In\n\nCVPR, 2011.\n\n[16] S. Khot and A. Ponnuswami. Approximation algorithms for the max-min allocation problem. In APPROX,\n\n2007.\n\n[17] V. Kolmogorov and R. Zabin. What energy functions can be minimized via graph cuts? In TPAMI, 2004.\n[18] A. Krause, B. McMahan, C. Guestrin, and A. Gupta. Robust submodular observation selection. In JMLR,\n\n2008.\n\n[19] J. K. Lenstra, D. B. Shmoys, and \u00b4E. Tardos. Approximation algorithms for scheduling unrelated parallel\n\nmachines. In Mathematical programming, 1990.\n\n[20] M. Li, D. Andersen, and A. Smola. Graph partitioning via parallel submodular approximation to accelerate\n\ndistributed machine learning. In arXiv preprint arXiv:1505.04636, 2015.\n\n[21] M. Minoux. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization\n\nTechniques, 1978.\n\n[22] M. Narasimhan, N. Jojic, and J. A. Bilmes. Q-clustering. In NIPS, 2005.\n[23] J. Orlin. A faster strongly polynomial time algorithm for submodular function minimization. Mathematical\n\nProgramming, 2009.\n\n[24] D. Povey, X. Zhang, and S. Khudanpur. Parallel training of deep neural networks with natural gradient and\n\nparameter averaging. arXiv preprint arXiv:1410.7455, 2014.\n\n[25] Z. Svitkina and L. Fleischer. Submodular approximation: Sampling-based algorithms and lower bounds.\n\nIn FOCS, 2008.\n\n[26] J. Vondr\u00b4ak. Optimal approximation for the submodular welfare problem in the value oracle model. In\n\nSTOC, 2008.\n\n[27] K. Wei, R. Iyer, and J. Bilmes. Submodularity in data subset selection and active learning. In ICML, 2015.\n[28] K. Wei, R. Iyer, S. Wang, W. Bai, and J. Bilmes. Mixed robust/average submodular partitioning: Fast\n\nalgorithms, guarantees, and applications: NIPS 2015 Extended Supplementary.\n\n[29] L. Zhao, H. Nagamochi, and T. Ibaraki. On generalized greedy splitting algorithms for multiway partition\n\nproblems. Discrete applied mathematics, 143(1):130\u2013143, 2004.\n\n9\n\n\f", "award": [], "sourceid": 1322, "authors": [{"given_name": "Kai", "family_name": "Wei", "institution": null}, {"given_name": "Rishabh", "family_name": "Iyer", "institution": "University of Washington, Seattle"}, {"given_name": "Shengjie", "family_name": "Wang", "institution": "University of Washington"}, {"given_name": "Wenruo", "family_name": "Bai", "institution": "University of Washington"}, {"given_name": "Jeff", "family_name": "Bilmes", "institution": "University of Washington, Seattle"}]}