{"title": "Streaming Min-max Hypergraph Partitioning", "book": "Advances in Neural Information Processing Systems", "page_first": 1900, "page_last": 1908, "abstract": "In many applications, the data is of rich structure that can be represented by a hypergraph, where the data items are represented by vertices and the associations among items are represented by hyperedges. Equivalently, we are given an input bipartite graph with two types of vertices: items, and associations (which we refer to as topics). We consider the problem of partitioning the set of items into a given number of parts such that the maximum number of topics covered by a part of the partition is minimized. This is a natural clustering problem, with various applications, e.g. partitioning of a set of information objects such as documents, images, and videos, and load balancing in the context of computation platforms.In this paper, we focus on the streaming computation model for this problem, in which items arrive online one at a time and each item must be assigned irrevocably to a part of the partition at its arrival time. Motivated by scalability requirements, we focus on the class of streaming computation algorithms with memory limited to be at most linear in the number of the parts of the partition. We show that a greedy assignment strategy is able to recover a hidden co-clustering of items under a natural set of recovery conditions. We also report results of an extensive empirical evaluation, which demonstrate that this greedy strategy yields superior performance when compared with alternative approaches.", "full_text": "Streaming Min-Max Hypergraph Partitioning\n\nDan Alistarh\n\nMicrosoft Research\n\nCambridge, United Kingdom\n\nJennifer Iglesias(cid:3)\n\nCarnegie Mellon University\n\nPittsburgh, PA\n\ndan.alistarh@microsoft.com\n\njiglesia@andrew.cmu.edu\n\nMilan Vojnovic\n\nMicrosoft Research\n\nCambridge, United Kingdom\nmilanv@microsoft.com\n\nAbstract\n\nIn many applications, the data is of rich structure that can be represented by a\nhypergraph, where the data items are represented by vertices and the associations\namong items are represented by hyperedges. Equivalently, we are given an input\nbipartite graph with two types of vertices: items, and associations (which we refer\nto as topics). We consider the problem of partitioning the set of items into a given\nnumber of components such that the maximum number of topics covered by a\ncomponent is minimized. This is a clustering problem with various applications,\ne.g. partitioning of a set of information objects such as documents, images, and\nvideos, and load balancing in the context of modern computation platforms.\nIn this paper, we focus on the streaming computation model for this problem, in\nwhich items arrive online one at a time and each item must be assigned irrevocably\nto a component at its arrival time. Motivated by scalability requirements, we focus\non the class of streaming computation algorithms with memory limited to be at\nmost linear in the number of components. We show that a greedy assignment\nstrategy is able to recover a hidden co-clustering of items under a natural set of\nrecovery conditions. We also report results of an extensive empirical evaluation,\nwhich demonstrate that this greedy strategy yields superior performance when\ncompared with alternative approaches.\n\n1 Introduction\n\nIn a variety of applications, one needs to process data of rich structure that can be conveniently\nrepresented by a hypergraph, where associations of the data items, represented by vertices, are rep-\nresented by hyperedges, i.e. subsets of items. Such data structure can be equivalently represented\nby a bipartite graph that has two types of vertices: vertices that represent items, and vertices that\nrepresent associations among items, which we refer to as topics. In this bipartite graph, each item\nis connected to one or more topics. The input can be seen as a graph with vertices belonging to\n(overlapping) communities.\nThere has been signi\ufb01cant work on partitioning a set of items into disjoint components such that\nsimilar items are assigned to the same component, see, e.g., [8] for a survey. This problem arises in\nthe context of clustering of information objects such as documents, images or videos. For example,\nthe goal may be to partition given collection of documents into disjoint sub-collections such that\nthe maximum number of distinct topics covered by each sub-collection is minimized, resulting in a\n\n(cid:3)\n\nWork performed in part while an intern with Microsoft Research.\n\n1\n\n\fFigure 1: A simple example of a set of items\nwith overlapping associations to topics.\n\nFigure 2: An example of hidden co-\nclustering with \ufb01ve hidden clusters.\n\nparsimonious summary. The same fundamental problem also arises in processing of complex data\nworkloads, including enterprise emails [10], online social networks [18], graph data processing and\nmachine learning computation platforms [20, 21, 2], and load balancing in modern streaming query\nprocessing platforms [24]. In this context, the goal is to partition a set of data items over a given\nnumber of servers to balance the load according to some given criteria.\n\nProblem De\ufb01nition. We consider the min-max hypergraph partitioning problem de\ufb01ned as follows.\nThe input to the problem is a set of items, a set of topics, a number of components to partition the\nset of items, and a demand matrix that speci\ufb01es which particular subset of topics is associated with\neach individual item. Given a partitioning of the set of items, the cost of a component is de\ufb01ned\nas the number of distinct topics that are associated with items of the given component. The cost of\na given partition is the maximum cost of a component. In other words, given an input hypergraph\nand a partition of the set of vertices into a given number of disjoints components, the cost of a\ncomponent is de\ufb01ned to be the number of hyperedges that have at least one vertex assigned to this\ncomponent. For example, for the simple input graph in Figure 1, a partition of the set of items into\ntwo components f1; 3g and f2; 4g amounts to the cost of the components each of value 2, thus, the\ncost of the partition is of value 2. The cost of a component is a submodular function as the distinct\ntopics associated with items of the component correspond to a neighborhood set in the input bipartite\ngraph.\nIn the streaming computation model that we consider, items arrive sequentially one at a time, and\neach item needs to be assigned, irrevocably, to one component at its arrival time. This streaming\ncomputation model allows for limited memory to be used at any time during the execution whose\nsize is restricted to be at most linear in the number of the components. Both these assumptions arise\nas part of system requirements for deployment in web-scale services.\nThe min-max hypergraph partition problem is NP hard. The streaming computation problem is even\nmore dif\ufb01cult, as less information is available to the algorithm when an item must be assigned.\n\nContribution. In this paper, we consider the streaming min-max hypergraph partitioning problem.\nWe identify a greedy item placement strategy which outperforms all alternative approaches consid-\nered on real-world datasets, and can be proven to have a non-trivial recovery property: it recovers\nhidden co-clusters of items in probabilistic inputs subject to a recovery condition.\nSpeci\ufb01cally, we show that, given a set of hidden co-clusters to be placed onto k components, the\ngreedy strategy will tend to place items from the same hidden cluster onto the same component, with\nhigh probability. In turn, this property implies that greedy will provide a constant factor approxima-\ntion of the optimal partition on inputs satisfying the recovery property.\nThe probabilistic input model we consider is de\ufb01ned as follows. The set of topics is assumed to\nbe partitioned into a given number \u2113 (cid:21) 1 of disjoint hidden clusters. Each item is connected to\ntopics according to a mixture probability distribution de\ufb01ned as follows. Each item \ufb01rst selects one\nof the hidden clusters as a home hidden cluster by drawing an independent sample from a uniform\ndistribution over the hidden clusters. Then, it connects to each topic from its home hidden cluster\nindependently with probability p, and it connects to each topic from each other hidden cluster with\nprobability q (cid:20) p. This de\ufb01nes a hidden co-clustering of the input bipartite graph; see Figure 2 for\nan example.\nThis model is similar in spirit to the popular stochastic block model of an undirected graph, and\nit corresponds to a hidden co-clustering [6, 7, 17, 4] model of an undirected bipartite graph. We\nconsider asymptotically accurate recovery of this hidden co-clustering.\n\n2\n\n\fA hidden cluster is said to be asymptotically recovered if the portion of items from the given hidden\ncluster assigned to the same partition goes to one asymptotically as the number of items observed\ngrows large. An algorithm guarantees balanced asymptotic recovery if, additionally, it ensures that\nthe cost of the most loaded partition is within a constant of the average partition load.\nOur main analytical result is showing that a simple greedy strategy provides balanced asymptotic\nrecovery of hidden clusters (Theorem 1). We prove that a suf\ufb01cient condition for the recovery of\nhidden clusters is that the number of hidden clusters \u2113 is at least k log k, where k is the number\nof components, and that the gap between the probability parameters q and p is suf\ufb01ciently large:\nq < log r=(kr) < 2 log r=r (cid:20) p; where r is the number of topics in a hidden cluster. Roughly\nspeaking, this means that if the mean number of topics to which an item is associated with in its\nhome hidden cluster of topics is at least twice as large as the mean number of topics to which an\nitem is associated with from other hidden clusters of topics, then the simple greedy online algorithm\nguarantees asymptotic recovery.\nThe proof is based on a coupling argument, where we \ufb01rst show that assigning an item to a parti-\ntion based on the number of topics it has in common with each partition is similar to making the\nassignment proportionally to the number of items corresponding to the same hidden cluster present\non each partition. In turn, this allows us to couple the assignment strategy with a Polya urn pro-\ncess [5] with \u201crich-get-richer\u201d dynamics, which implies that the policy converges to assigning each\nitem from a hidden cluster to the same partition. Additionally, this phenomenon occurs \u201cin parallel\u201d\nfor each cluster. This recovery property will imply that this strategy will ensure a constant factor\napproximation of the optimum assignment.\nFurther, we provide experimental evidence that this greedy online algorithm exhibits good perfor-\nmance for several real-world input bipartite graphs, outperforming more complex assignment strate-\ngies, and even some of\ufb02ine approaches.\n\n2 Problem De\ufb01nition and Basic Results\n\nIn this section we provide a formal problem de\ufb01nition, and present some basic results on the com-\nputational hardness and lower bounds.\nInput. The input is de\ufb01ned by a set of items N = f1; 2; : : : ; ng, a set of topics M = f1; 2; : : : ; mg,\nand a given number of components k. Dependencies between items and topics are given by a demand\nmatrix D = (di;l) 2 f0; 1gn(cid:2)m where di;l = 1 indicates that item i needs topic l, and di;l = 0,\notherwise.1\nAlternatively, we can represent the input as a bipartite graph G = (N; M; E) where there is an edge\n(i; l) 2 E if and only if item i needs topic l or as a hypergraph H = (N; E) where a hyperedge\ne 2 E consists of all items that use the same topic.\nThe Problem. An assignment of items to components is given by x 2 f0; 1gn(cid:2)k where xi;j = 1\nif item i is assigned to component j, and xi;j = 0, otherwise. Given an assignment of items to\ncomponents x, the cost of component j is de\ufb01ned to be equal to the minimum number of distinct\ntopics that are needed by this component to cover all the items assigned to it, i.e.\n\n}\n\n\u2211\n\n{\u2211\n\ncj(x) =\n\nmin\n\nl2M\n\ni2N\n\ndi;lxi;j; 1\n\n:\n\nAs de\ufb01ned, the cost of each component is a submodular function of the items assigned to it. We\nconsider the min-max hypergraph partitioning problem de\ufb01ned as follows:\n\nminimize maxfc1(x); c2(x); : : : ; ck(x)g\nsubject to\n\n\u2211\nj2[k] xi;j = 1 8i 2 [n]\nx 2 f0; 1gn(cid:2)k\n\n(1)\n\nWe note that this problem is an instance of the submodular load balancing, as de\ufb01ned in [23].\n\n1The framework allows for a natural generalization to allow for real-valued demands. In this paper we focus\n\non f0; 1g-valued demands.\n\n3\n\n\fBasic Results. This problem is NP-Complete, by reduction from the subset sum problem.\nProposition 1. The min-max hypergraph partitioning problem is NP-Complete.\n\nWe now give a lower bound on the optimal value of the problem, using the observation that each\ntopic needs to be made available on at least one component.\nProposition 2. For every partition of the set of items in k components, the maximum cost of a\ncomponent is larger than or equal to m=k, where m is the number of topics.\n\nWe next analyze the performance of an algorithm which simply assigns each item independently to\na component chosen uniformly at random from the set of all components upon its arrival. Although\nthis is a popular strategy commonly deployed in practice (e.g. for load balancing in computation\nplatforms), the following result shows that it does not yield a good solution for the min-max hyper-\ngraph partitioning problem.\nProposition 3. The expected maximum load of a component under random assignment is at least\n\nj=1(1 (cid:0) 1=k)nj =m) (cid:1) m, where nj is the number of items associated with topic j.\n\n(1 (cid:0)\u2211\n\nm\n\nFor instance, if we assume that nj (cid:21) k for each topic j, we obtain that the expected maximum load\nis of at least (1 (cid:0) 1=e)m. This suggests that the performance of random assignment is poor: on\nan input where m topics form k disjoint clusters, and each item subscribes to a single cluster, the\noptimal solution has cost m=k, whereas, by the above claim, random assignment has approximate\ncost 2m=3, yielding a competitive ratio that is linear in k.\n\nBalanced Recovery of Hidden Co-Clusters. We relax the worst-case input requirements by de\ufb01n-\ning a family of hidden co-clustering inputs. Our model is a generalization of the stochastic block\nmodel of a graph to the case of hypergraphs.\nWe consider a set of topics R, partitioned into \u2113 clusters C1; C2; : : : ; C\u2113, each of which contains\nr topics. Given these hidden clusters, each item is associated with topics as follows. Each item is\n\ufb01rst assigned a \u201chome\u201d cluster Ch, chosen uniformly at random among the hidden clusters. The\nitem then connects to topics inside its home cluster by picking each topic independently with \ufb01xed\nprobability p. Further, the item connects to topics from a \ufb01xed arbitrary \u201cnoise\u201d set Qh of size at\nmost r=2 outside its home cluster Ch, where the item is connected to each topic in Qh uniformly at\nrandom, with \ufb01xed probability q. (Sampling outside topics from the set of all possible topics would\nin the limit lead to every partition to contain all possible topics, which renders the problem trivial.\nWe do not impose this limitation in the experimental validation.)\nDe\ufb01nition 1 (Hidden Co-Clustering). A bipartite graph is in HC(n; r; \u2113; p; q) if it is constructed\nusing the above process, with n items and \u2113 clusters with r topics per cluster, where each item\nsubscribes to topics inside its randomly chosen home cluster with probability p, and to topics from\nthe noise set with probability q.\n\nAt each time step t, a new item is presented in the input stream of items, and is immediately assigned\nto one of the k components, S1; S2; : : : ; Sk, according to some algorithm. Algorithms do not know\nthe number of hidden clusters or their size, but can examine previous assignments.\nDe\ufb01nition 2 (Asymptotic Balanced Recovery.). Given a hidden co-clustering HC(n; r; \u2113; p; q), we\nsay an algorithm asymptotically recovers the hidden clusters C1; C2; : : : ; C\u2113 if there exists a recov-\nery time tR during its execution after which, for each hidden cluster Ci, there exists a component\nSj such that each item with home cluster Ci is assigned to component Sj with probability that goes\nto 1 as the number of items grows large. Moreover, the recovery is balanced if the ratio between\nthe maximum cost of a component and the average cost over components is upper bounded by a\nconstant B > 0.\n\n3 Streaming Algorithm and the Recovery Guarantee\n\nRecall that we consider the online problem, where we receive one item at a time together with all its\ncorresponding topics. The item must be immediately and irrevocably assigned to some component.\nIn the following, we describe the greedy strategy, speci\ufb01ed in Algorithm 1.\n\n4\n\n\fData: Hypergraph H = (V; E), received one item (vertex) at a time, k partitions, capacity bound c\nResult: A partition of V into k parts\n\n1 Set initial partitions S1; S2; : : : ; Sk to be empty sets\n2 while there are incoming items do\n3\n4\n5\n6\n7\n8 return S1; S2; : : : ; Sk\n\nReceive the next item t, and its topics R\nI fi : jSij (cid:20) minj jSjj + cg /* components not exceeding capacity */\nCompute ri = jSi \\ Rj 8i 2 I /* size of topic intersection\nj arg maxi2I ri /* if tied, choose a least loaded component\n*/\nSj Sj [ R /* item t and its topics are assigned to Sj\n*/\n\n*/\n\nAlgorithm 1: The greedy algorithm.\n\nThis strategy places each incoming item onto the component whose incremental cost (after adding\nthe item and its topics) is minimized. The immediate goal is not balancing, but rather clustering\nsimilar items. This could in theory lead to large imbalances; to prevent this, we add a balancing\nconstraint specifying the maximum load imbalance. If adding the item to the \ufb01rst candidate compo-\nnent would violate the balancing constraint, then the item is assigned to the \ufb01rst valid component,\nin decreasing order of the intersection size.\n\n3.1 The Recovery Theorem\n\nIn this section, we present our main theoretical result, which provides a suf\ufb01cient condition for the\ngreedy strategy to guarantee balanced asymptotic recovery of hidden clusters.\nTheorem 1 (The Recovery Theorem). For a random input consisting of a hidden co-cluster graph\nG in HC(n; r; \u2113; p; q) to be partitioned across k (cid:21) 2 components, if the number of clusters is \u2113 (cid:21)\nk log k, and the probabilities p and q satisfy p (cid:21) 2 log r=r, and q (cid:20) log r=(rk), then the greedy\nalgorithm ensures balanced asymptotic recovery of the hidden clusters.\n\nRemarks. Speci\ufb01cally, we prove that, under the given conditions, recovery occurs for each hidden\ncluster by the time r= log r cluster items have been observed, with probability 1(cid:0) 1=rc, where c (cid:21) 1\nis a constant. Moreover, clusters are randomly distributed among the k components.\nTogether, these results can be used to bound the maximum cost of a partition to be at most a con-\nstant factor away the lower bound of r\u2113=k given by Lemma 2. The extra cost comes from incorrect\nassignments before the recovery time, and from the imperfect balancing of clusters over the compo-\nnents.\nCorollary 1. The expected maximum load of a component is at most 2:4r\u2113=k.\n\n3.2 Proof Overview\n\nWe now provide an overview of the main ideas of the proof, which is available in the full version of\nthe paper.\n\nPreliminaries. We say that two random processes are coupled if their random choices are the\nsame. We say that an event occurs with high probability (w.h.p.) if it occurs with probability at least\n1 (cid:0) 1=rc, where c (cid:21) 1 is a constant. We make use of a Polya urn process [5], which is de\ufb01ned as\nfollows. We start each of k (cid:21) 2 urns with one ball, and, at each step t, observe a new ball. We assign\nthe new ball to urn i 2 f1; : : : ; kg with probability proportional to (bi)(cid:13), where (cid:13) > 0 is a \ufb01xed real\nconstant, and bi is the number of balls in urn i at time t. We use the following classic result.\nLemma 1 (Polya Urn Convergence [5]). Consider a \ufb01nite k-bin Polya urn process with exponent\n(cid:13) > 1, and let xt\ni be the fraction of balls in urn i at time t. Then, almost surely, the limit Xi =\ni exists for each 1 (cid:20) i (cid:20) k. Moreover, we have that there exists an urn j such that\nlimt!1 xt\nXj = 1, and that Xi = 0, for all i \u0338= j.\nStep 1: Recovering a Single Cluster. We \ufb01rst prove that, in the case of a single home cluster\nfor all items, and two components (k = 2), with no balance constraints, the greedy algorithm\nwith no balance constraints converges to a monopoly, i.e., eventually assigns all the items from\n\n5\n\n\fDataset\n\nBook Ratings\n\nFacebook App Data\n\nRetail Data\n\nZune Podcast Data\n\nItems\nReaders\nUsers\n\nCustomers\nListeners\n\nTopics\nBooks\nApps\n\nItems bought\n\nPodcasts\n\n# of Items\n107,549\n173,502\n74,333\n80,633\n\n# of Topics\n105,283\n13,604\n16,470\n7928\n\n# edges\n965,949\n5,115,433\n947,940\n1,037,999\n\nFigure 3: A table showing the data sets and information about the items and topics.\n\nthis cluster onto the same component, w.h.p. Formally, there exists some convergence time tR and\nsome component Si such that, after time tR, all future items will be assigned to component Si, with\nprobability at least 1 (cid:0) 1=rc.\nOur strategy will be to couple greedy assignment with a Polya urn process with exponent (cid:13) >\n1, showing that the dynamics of the two processes are the same, w.h.p. There is one signi\ufb01cant\ntechnical challenge that one needs to address: while the Polya process assigns new balls based on\nthe ball counts of urns, greedy assigns items (and their respective topics) based on the number of\ntopic intersections between the item and the partition. We resolve this issue by taking a two-tiered\napproach. Roughly, we \ufb01rst prove that, w.h.p., we can couple the number of items in a component\nwith the number of unique topics assigned to the same component. We then prove that this is enough\nto couple the greedy assignment with a Polya urn process with exponent (cid:13) > 1. This will imply that\ngreedy converges to a monopoly, by Lemma 1.\nWe then extend this argument to a single cluster and k (cid:21) 3 components, but with no load balanc-\ning constraints. The crux of the extension is that we can apply the k = 2 argument to pairs of\ncomponents to yield that some component achieves a monopoly.\nLemma 2. Given a single cluster instance in HC(n; r; \u2113; p; q) with \u2113 = 1, p (cid:21) 2 log r=r and q = 0 to\nbe partitioned in k components, the greedy algorithm with no balancing constraints will eventually\nplace every item in the cluster onto the same component w.h.p.\n\nSecond Step: The General Case. We complete the proof of Theorem 1 by considering the general\ncase with \u2113 (cid:21) 2 clusters and q > 0. We proceed in three sub-steps. We \ufb01rst show the recovery claim\nfor general number of clusters \u2113 (cid:21) 2, but q = 0 and no balance constraints. This follows since, for\nq = 0, the algorithm\u2019s choices with respect to clusters and their respective topics are independent.\nHence clusters are assigned to components uniformly at random.\nSecond, we extend the proof for any value q (cid:20) log r=(rk), by showing that the existence of \u201cnoise\u201d\nedges under this threshold only affects the algorithm\u2019s choices with very low probability. Finally,\nwe prove that the balance constraints are practically never violated for this type of input, as clusters\nare distributed uniformly at random. We obtain the following.\nLemma 3. For a hidden co-cluster input, the greedy algorithm with q = 0 and without capacity\nconstraints can be coupled with a version of the algorithm with q (cid:20) log r=(rk) and a constant\ncapacity constraint, w.h.p.\n\nFinal Argument. Putting together Lemmas 2 and 3, we obtain that greedy ensures balanced re-\ncovery for general inputs in HC(n; r; \u2113; p; q), for parameter values \u2113 (cid:21) k log k, p (cid:21) 2 log r=r, and\nq (cid:20) log r=(rk).\n\n4 Experimental Results\n\nDatasets and Evaluation. We \ufb01rst consider a set of real-world bipartite graph instances with a\nsummary provided in Table 3. All these datasets are available online, except for Zune podcast\nsubscriptions. We chose the consumer to be the item and the resource to be the topic. We provide an\nexperimental validation of the analysis on synthetic co-cluster inputs in the full version of our paper.\nIn our experiments, we considered partitioning of items onto k components for a range of values\ngoing from two to ten components. We report the maximum number of topics in a component\nnormalized by the cost of a perfectly balanced solution m=k, where m is the total number of topics.\n\nOnline Assignment Algorithms. We compared the following other online assignment strategies:\n\n6\n\n\f(a) Book Ratings\n\n(b) Facebook App Data\n\n(c) Retail Data\n\n(d) Zune Podcast Data\n\nFigure 4: The normalized maximum load for various online assignment algorithms under different\ninput bipartite graphs versus the numbers of components.\n(cid:15) All-on-One: trivially assign all items and topics to one component.\n(cid:15) Random: assign each item independently to a component chosen uniformly at random from the\nset of all components.\n(cid:15) Balance Big: inspect the items in a random order and assign the large items to the least loaded\ncomponent, and the small items according to greedy. An item is considered large if it subscribes to\nmore than 100 topics, and small otherwise.\n(cid:15) Prefer Big: inspect the items in a random order, and keep a buffer of up to 100 small items; when\nreceiving a large item, put it on the least loaded component; when the buffer is full, place all the\nsmall items according to greedy.\n(cid:15) Greedy: assign the items to the component they have the most topics in common with. We consider\ntwo variants: items arrive in random order, and items arrive in decreasing order of the number of\ntopics. We allow a slack (parameter c) of up to 100 topics.\n(cid:15) Proportional Allocation: inspect the items in decreasing order of the number of topics; the proba-\nbility an item is assigned to a component is proportional to the number of common topics.\n\nResults. Greedy generally outperforms other online heuristics (see Figure 4). Also, its performance\nis improved if items arrive in decreasing order of number of topics. Intuitively, items with larger\nnumber of topics provide more information about the underlying structure of the bipartite graph than\nthe items with smaller number of topics. Interestingly, adding randomness to the greedy assignment\nmade it perform far worse; most times Proportional Assignment approached the worst case scenario.\nRandom assignment outperformed Proportional Assignment and regularly outperformed Prefer Big\nand Balance Big item assignment strategies.\n\nOf\ufb02ine methods. We also tested the streaming algorithm for a wide range of synthetic input bi-\npartite graphs according to the model de\ufb01ned in this paper, and several of\ufb02ine approaches for the\nproblem including hMetis [11], label propagation, basic spectral methods, and PARSA [13]. We\nfound that label propagation and spectral methods are extremely time and memory intensive on our\ninputs, due to the large number of topics and item-topic edges. hMetis returns within seconds, how-\never the assignments were not competitive. However, hMetis provides balanced hypergraph cuts,\nwhich are not necessarily a good solution to our problem.\n\n7\n\n2345678910k12345678910Normalized Maximum LoadAll on OneProportional Greedy (Decreasing Order)Balance BigPrefer BigRandomGreedy (Random Order)Greedy (Decreasing Order)2345678910k12345678910Normalized Maximum LoadAll on OneProportional Greedy (Decreasing Order)Balance BigPrefer BigRandomGreedy (Random Order)Greedy (Decreasing Order)2345678910k12345678910Normalized Maximum LoadAll on OneProportional Greedy (Decreasing Order)Balance BigPrefer BigRandomGreedy (Random Order)Greedy (Decreasing Order)2345678910k12345678910Normalized Maximum LoadAll on OneProportional Greedy (Decreasing Order)Balance BigPrefer BigRandomGreedy (Random Order)Greedy (Decreasing Order)\fCompared to PARSA on bipartite graph inputs, greedy provides assignments with up to 3x higher\nmax partition load. On social graphs, the performance difference can be as high as 5x. This discrep-\nancy is natural since PARSA has the advantage of performing multiple passes through the input.\n\n5 Related Work\n\nThe related problem of min-max multi-way graph cut problem, originally introduced in [23], is\nde\ufb01ned as follows: given an input graph, the objective is to component the set of vertices such\nthat the maximum number of edges adjacent to a component is minimized. A similar problem was\nrecently studied, e.g. [1], with respect to expansion, de\ufb01ned as the ratio of the sum of weights of\nedges adjacent to a component and the minimum between the sum of the weights of vertices within\nand outside the given component. The balanced graph partition problem is a bi-criteria optimization\nproblem where the goal is to \ufb01nd a balanced partition of the set of vertices that minimizes the total\nnumber of edges cut. The best known approximation ratio for this problem is poly-logarithmic in\nthe number of vertices [12]. The balanced graph partition problem was also considered for the set of\nedges of a graph [2]. The related problem of community detection in an input graph data has been\ncommonly studied for the planted partition model, also well known as stochastic block model. Tight\nconditions for recovery of hidden clusters are known from the recent work in [16] and [14], as well\nas various approximation algorithms, e.g. see [3]. Some variants of hypergraph partition problems\nwere studied by the machine learning research community, including balanced cuts studied by [9]\nusing relaxations based on the concept of total variation, and the maximum likelihood identi\ufb01cation\nof hidden clusters [17]. The difference is that we consider the min-max multi-way cut problem for\na hypergraph in the streaming computation model. PARSA [13] considers the same problem in an\nof\ufb02ine model, where the entire input is initially available to the algorithm, and provides an ef\ufb01cient\ndistributed algorithm for optimizing multiple criteria. A key component of PARSA is a procedure for\noptimizing the order of examining vertices. By contrast, we focus on performance under arbitrary\narrival order, and provide analytic guarantees under a stochastic input model.\nStreaming computation with limited memory was considered for various canonical problems such\nas principal component analysis [15], community detection [22], balanced graph partition [20, 21],\nand query placement [24]. For the class of (hyper)graph partition problems, most of the work is\nrestricted to studying various streaming heuristics using empirical evaluations with a few notable\nexceptions. A \ufb01rst theoretical analysis of streaming algorithms for balanced graph partitioning was\npresented in [19] using the framework similar to the one deployed in this paper. The paper gives\nsuf\ufb01cient conditions for a greedy streaming strategy to recover clusters of vertices for the input graph\naccording to stochastic block model, which makes irrevocable assignments of vertices as they are\nobserved in the input stream and uses memory limited to grow linearly with the number of clusters.\nAs in our case, the argument uses a reduction to Polya urn processes. The two main differences with\nour work is that we consider a different problem (min-max hypergraph partition) and this requires a\nnovel proof technique based on a two-step reduction to Polya urn processes. Streaming algorithms\nfor the recovery of clusters in a stochastic block model were also studied in [22], under a weaker\ncomputation model, which does not require irrevocable assignments of vertices at instances they are\npresented in the input stream and allows for memory polynomial in the number of vertices.\n\n6 Conclusion\n\nWe studied the min-max hypergraph partitioning problem in the streaming computation model with\nthe size of memory limited to be at most linear in the number of the components of the partition.\nWe established \ufb01rst approximation guarantees for inputs according to a random bipartite graph with\nhidden co-clusters, and evaluated performance on several real-world input graphs. There are sev-\neral interesting open questions for future work. It is of interest to study the tightness of the given\nrecovery condition, and, in general, better understand the trade-off between the memory size and\nthe accuracy of the recovery. It is also of interest to consider the recovery problem for a wider set\nof random bipartite graph models. Another question of interest is to consider dynamic graph inputs\nwith addition and deletion of items and topics.\n\n8\n\n\fReferences\n[1] N. Bansal, U. Feige, R. Krauthgamer, K. Makarychev, V. Nagarajan, J. Sef\ufb01Naor, and\nR. Schwartz. Min-max graph partitioning and small set expansion. SIAM J. on Computing,\n43(2):872\u2013904, 2014.\n\n[2] F. Bourse, M. Lelarge, and M. Vojnovic. Balanced graph edge partition.\n\nKDD, 2014.\n\nIn Proc. of ACM\n\n[3] Y. Chen, S. Sanghavi, and H. Xu. Clustering sparse graphs. In Proc. of NIPS, 2012.\n[4] Y. Cheng and G. M. Church. Biclustering of expression data. In Ismb, volume 8, pages 93\u2013103,\n\n2000.\n\n[5] F. Chung, S. Handjani, and D. Jungreis. Generalizations of Polya\u2019s urn problem. Annals of\n\nCombinatorics, (7):141\u2013153, 2003.\n\n[6] I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning.\n\nIn Proc. of ACM KDD, 2001.\n\n[7] I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In Proc. of\n\nACM KDD, 2003.\n\n[8] S. Fortunato. Community detection in graphs. Physics Reports, 486(75), 2010.\n[9] M. Hein, S. Setzer, L. Jost, and S. S. Rangapuram. The total variation on hypergraphs - learning\n\nhypergraphs revisited. In Proc. of NIPS, 2013.\n\n[10] T. Karagiannis, C. Gkantsidis, D. Narayanan, and A. Rowstron. Hermes: clustering users in\n\nlarge-scale e-mail services. In Proc. of ACM SoCC, 2010.\n\n[11] G. Karypis and V. Kumar. Multilevel k-way hypergraph partitioning. VLSI Design, 11(3),\n\n2000.\n\n[12] R. Krauthgamer, J. S. Naor, and R. Schwartz. Partitioning graphs into balanced components.\n\n2009.\n\n[13] M. Li, D. G. Andersen, and A. J. Smola. Graph partitioning via parallel submodular approxi-\n\nmation to accelerate distributed machine learning. arXiv preprint arXiv:1505.04636, 2015.\n\n[14] L. Massouli\u00b4e. Community detection thresholds and the weak Ramanujan property. In Proc. of\n\nACM STOC, 2014.\n\n[15] I. Mitliagkas, C. Caramanis, and P. Jain. Memory limited, streaming PCA. In Proc. of NIPS,\n\n2013.\n\n[16] E. Mossel, J. Neeman, and A. Sly. Reconstruction and estimation in the planted partition\n\nmodel. Probability Theory and Related Fields, pages 1\u201331, 2014.\n\n[17] L. O\u2019Connor and S. Feizi. Biclustering using message passing. In Proc. of NIPS, 2014.\n[18] J. M. Pujol et al. The little engine(s) that could: Scaling online social networks. IEEE/ACM\n\nTrans. Netw., 20(4):1162\u20131175, 2012.\n\n[19] I. Stanton. Streaming balanced graph partitioning algorithms for random graphs. In Proc. of\n\nACM-SIAM SODA, 2014.\n\n[20] I. Stanton and G. Kliot. Streaming graph partitioning for large distributed graphs. In Proc. of\n\nACM KDD, 2012.\n\n[21] C. E. Tsourakakis, C. Gkantsidis, B. Radunovic, and M. Vojnovic. FENNEL: streaming graph\n\npartitioning for massive scale graphs. In Proc. of ACM WSDM, 2014.\n\n[22] S.-Y. Yun, M. Lelarge, and A. Proutiere. Streaming, memory limited algorithms for community\n\ndetection. In Proc. of NIPS, 2014.\n\n[23] Z. Z. Svitkina and E. Tardos. Min-max multiway cut. In K. Jansen, S. Khanna, J. Rolim, and\n\nD. Ron, editors, Proc. of APPROX/RANDOM, pages 207\u2013218. 2004.\n\n[24] B. Zong, C. Gkantsidis, and M. Vojnovic. Herding small streaming queries. In Proc. of ACM\n\nDEBS, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1177, "authors": [{"given_name": "Dan", "family_name": "Alistarh", "institution": "Microsoft Research"}, {"given_name": "Jennifer", "family_name": "Iglesias", "institution": "Carnegie Mellon University"}, {"given_name": "Milan", "family_name": "Vojnovic", "institution": "Microsoft Research"}]}