{"title": "Found Graph Data and Planted Vertex Covers", "book": "Advances in Neural Information Processing Systems", "page_first": 1356, "page_last": 1367, "abstract": "A typical way in which network data is recorded is to measure all interactions involving a specified set of core nodes, which produces a graph containing this core together with a potentially larger set of fringe nodes that link to the core. Interactions between nodes in the fringe, however, are not present in the resulting graph data. For example, a phone service provider may only record calls in which at least one of the participants is a customer; this can include calls between a customer and a non-customer, but not between pairs of non-customers. Knowledge of which nodes belong to the core is crucial for interpreting the dataset, but this metadata is unavailable in many cases, either because it has been lost due to difficulties in data provenance, or because the network consists of \"found data\" obtained in settings such as counter-surveillance. This leads to an algorithmic problem of recovering the core set. Since the core is a vertex cover, we essentially have a planted vertex cover problem, but with an arbitrary underlying graph. We develop a framework for analyzing this planted vertex cover problem, based on the theory of fixed-parameter tractability, together with algorithms for recovering the core. Our algorithms are fast, simple to implement, and out-perform several baselines based on core-periphery structure on various real-world datasets.", "full_text": "Found Graph Data and Planted Vertex Covers\n\nAustin R. Benson\nCornell University\n\narb@cs.cornell.edu\n\nJon Kleinberg\n\nCornell University\n\nkleinber@cs.cornell.edu\n\nAbstract\n\nA typical way in which network data is recorded is to measure all interactions\ninvolving a speci\ufb01ed set of core nodes, which produces a graph containing this\ncore together with a potentially larger set of fringe nodes that link to the core.\nInteractions between nodes in the fringe, however, are not present in the resulting\ngraph data. For example, a phone service provider may only record calls in which at\nleast one of the participants is a customer; this can include calls between a customer\nand a non-customer, but not between pairs of non-customers. Knowledge of which\nnodes belong to the core is crucial for interpreting the dataset, but this metadata\nis unavailable in many cases, either because it has been lost due to dif\ufb01culties\nin data provenance, or because the network consists of \u201cfound data\u201d obtained in\nsettings such as counter-surveillance. This leads to an algorithmic problem of\nrecovering the core set. Since the core is a vertex cover, we essentially have a\nplanted vertex cover problem, but with an arbitrary underlying graph. We develop\na framework for analyzing this planted vertex cover problem, based on the theory\nof \ufb01xed-parameter tractability, together with algorithms for recovering the core.\nOur algorithms are fast, simple to implement, and out-perform several baselines\nbased on core-periphery structure on various real-world datasets.\n\n1 Partially measured graphs, data provenance, and planted structure\n\nDatasets that take the form of graphs are ubiquitous throughout the sciences [4, 23, 49], but the graph\ndata that we work with is generally incomplete in certain systematic ways [28, 33, 34, 36, 39]. A\ncommon type of incompleteness comes from the way in which graph data is generally measured:\nwe observe a set of nodes C and record all the interactions involving this set of nodes. The result\nis a measured graph G consisting of this core set C together with a a potentially larger set of\nadditional fringe nodes\u2014the nodes outside of C that interact with some node in C. For example, in\nconstructing a social network dataset, we might study the employees of a company and record all of\ntheir friendships [54]. From this information, we now have a graph that contains all the employees\ntogether with all of their friends, including friends who do not work for the company. This latter\ngroup constitutes the set of fringe nodes in the graph. The edge set of the graph G re\ufb02ects the\nconstruction: each edge involves at least one core node, but if two nodes in the fringe have interacted,\nit is invisible to us and hence not recorded in the data.\nE-mail and other communication datasets typically have this core-fringe structure. For example, the\nwidely-studied Enron email graph [24, 37, 43, 55] contains tens of thousands of nodes; however, this\ngraph was constructed from the email inboxes of fewer than 150 employees [35]. The vast majority\nof the nodes in the graph, therefore, belong to the fringe, and their direct communications are not\npart of the data. The issue comes up in much larger network datasets as well. For example, a phone\nservice provider has data on calls and messages that its customers make both to each other and to\nnon-customers, but it does not have data on communication between pairs of non-customers. Another\nexample would be a massive online social network may get some information about the contacts of\nits users with a fringe set comprised of of people not on the system\u2014often including entire countries\nthat do not participate in the platform\u2014but generally not about the interactions taking place in this\nfringe set. Similarly, Internet measurements at the IP-layer from individual service providers give\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fonly a partial view of the global Internet network [58, 60]. Therefore, graph data often takes the form\nillustrated in Figure 1a: the nodes are divided into a core and a fringe, and we only see edges that\ninvolve a member of the core. Thus, the core set is a vertex cover of the underlying graph\u2014since a\nvertex cover, by de\ufb01nition, is a vertex set incident to all edges.\nOften, a graph dataset is annotated with metadata about which nodes belong to the core, and\nthis is crucial for correctly interpreting the data. But there are a number of important contexts\nwhere this metadata is unavailable, and we do not know which nodes constitute the core.\nIn\nother words, at some point, we have \u201cfound data\u201d that has core-fringe structure, but the labels\nidentifying the core nodes are missing. One reason for this scenario is that metadata is lost over\ntime for a wide range of reasons: data is repeatedly shared and manipulated, managers of datasets\nchange jobs, and hard drives are decommissioned. This is a central theme in data provenance,\nlineage, and preservation [11, 44, 56, 59], an especially challenging issue in modern digitization\nefforts [38] and large-scale data management [32]. A concrete example is the following. Suppose a\ntelecommunications company shares an anonymized dataset of telephone call records with a university\nbut does not include on which nodes were customers and which were the fringe set of non-customers.\nBy the time a student begins to analyze the data and realizes the metadata is missing and important\nfor analysis, the researchers at the company who assembled the dataset have left. At this point, there\nis no easy way to reconstruct the metadata.\nThese issues also arise in current research on counter-surveillance [29].\nIntelligence agencies\nmay intercept data from adversaries conducting surveillance and build a graph to determine which\ncommunications the adversaries were recording. In different settings, activist groups may petition for\nthe release of surveillance data by governments or infer it from other sources [47]. In these cases,\nthe \u201cfound data\u201d consists of a communication graph in which an unknown core subset of the nodes\nwas observed, and the remainder of the nodes in the graph (the fringe) are there simply because they\ncommunicated with someone in the core. In such situations, there generally is not any annotation\nto distinguish the core from the fringe. In this case, the core nodes are the compromised ones, and\nidentifying the core from the data can help to warn the vulnerable parties, hide future communications,\nor disseminate misinformation.\nPlanted Vertex Covers. Here we study the problem of recovering the set of core nodes in found\ngraph data, motivated by the range of settings above. We can view this as a planted vertex cover\nproblem: we are given a graph G in which an adversary knows a speci\ufb01c vertex cover C. We do not\nknow C, but we want to output a set that is as close to C as possible. The property of being \u201cclose\u201d\nto C corresponds to a performance guarantee that we will formulate in different ways. We may want\nto output a set not much larger than C that is guaranteed to completely contain it, or we may want\na small set that is guaranteed to have substantial overlap with C. A simple instance of the task is\ndepicted in Figure 1b, after the explicit labeling of the core nodes has been removed from Figure 1a.\nGenerically, planted problems arise when some hidden structure (like the vertex cover in our case) has\nbeen \u201cplanted\u201d in a larger input, and the goal is to \ufb01nd the structure. Planted problems are typically\n\n(a) Graph dataset built\nfrom a small core.\n\n(b) The dataset with-\nout the core labeled.\n\nFigure 1: (a) Graph datasets are often constructed by recording the interactions of a set of core nodes.\nThe resulting data contains these core nodes together with a potentially much larger fringe, consisting\nof all other nodes that had an interaction with some member of the core. (b) Knowing which nodes\nare in the core is important for interpreting the dataset. However, in many cases, the graph is \u201cfound\ndata\u201d and this metadata is not available. This can arise from challenges in data provenance that\nlead to the loss of the metadata or when only partial information is available in contexts such as\ncounter-surveillance. We study how accurately we can recover the core, despite limited information\nabout how the dataset was constructed. Algorithmically, this leads to a planted vertex cover problem.\n\n2\n\ncorefringe\fbased on formal frameworks in which the input is generated by a highly structured probabilistic\nmodel. This is true in some of the most studied planted problems, such as the planted clique\nproblem [5, 20, 25, 46] and the recovery problem for stochastic block models [1, 2, 3, 7, 19, 48].\nIt might seem inevitable that planted problems should require such probabilistic assumptions\u2014how\nelse could an algorithm guess which part of the graph corresponds to the planted structure, if there are\nno assumptions on what the \u201cnon-planted\u201d part of the graph looks like? But the vertex cover problem\nturns out to be different, and it is possible to solve what, surprisingly, can be described as a \u201cworst-\ncase\u201d planted problem. With extremely limited assumptions, we can provide provable guarantees\nfor approximately recovering unknown vertex covers. Speci\ufb01cally, we make only two assumptions\n(both necessary in some form, though relaxable): (i) the planted vertex cover is inclusion-wise\nminimal and (ii) its size is upper-bounded by a known quantity k. We think of k as small compared\nto the total number of nodes, which is consistent with many datasets. Among other results, we use\n\ufb01xed-parameter tractability to show that there is an algorithm operating on arbitrary graphs that can\noutput a set of f (k) nodes (independent of the size of the graph) guaranteed to contain the planted\nvertex cover. We obtain further bounds with additional structural assumptions.\nWe pair our theoretical guarantees with experiments showing the ef\ufb01cacy of these methods on real-\nworld networks with core-fringe structure. Using our theory, we develop a natural heuristic based on\nconstructing a union of minimal vertex covers from random initializations. Our simple algorithm\nprovides superior recovery performance and running time compared to a number of competitive\nbaselines. Among these, we show improvements over a line of well-developed heuristics for detecting\ncore-periphery structure [15, 31, 53, 63]\u2014a sociological notion in which a network has a dense core\nand a sparser periphery, generally for reasons of differential status rather than measurement effects.\n\n2 Theoretical methodology for partial recovery or tight containment\n\nWe \ufb01rst formalize the problem. Suppose there is a large universe of nodes U that interact via\ncommunication, friendship, or another mechanism. We choose a core subset C \u2286 U and measure all\npairwise interactions that involve at least one node in C. We represent our measurement by a graph\nG = (V, E), where V \u2286 U is all nodes that belong to C or interact with at least one node in C, and\nE is the set of all such interactions. The nodes in V \u2212 C are called the fringe of the graph. We ignore\ndirectionality and self-loops, so G is a simple, undirected graph. Note that C is a vertex cover of G.\n\n2.1 Finding a planted vertex cover\n\nIn the planted vertex cover problem, we observe G and are tasked with \ufb01nding C. Can we say\nanything non-trivial in answer to this question? Without other information, it could be that C = V , so\nwe \ufb01rst assume that |C| \u2264 k, where we think of k as small relative to |V |. With this extra information,\nwe can ask if it is possible to obtain a small set that is guaranteed to contain C:\nQuestion 1. For some function f, can we \ufb01nd a set D of size at most f (k) (independent of the size\nof V ) that is guaranteed to contain the planted vertex cover C?\nThe answer to this question is \u201cno.\u201d For example, let k = 2 and let G be a star graph with n > 3\nnodes v1, ..., vn and edges (v1, vi) for each i > 1. The two endpoints of any edge in the graph form a\nvertex cover of size k = 2, but C could conceivably be any edge. Thus, under these constraints, the\nonly set guaranteed to contain C is the entire node set V .\nIn this negative example, once we put v1 into a 2-node vertex cover, the other node can be arbitrary,\nsince it is super\ufb02uous. This suggests that it would be more reasonable to ask about minimal vertex\ncovers. Formally, C is a minimal vertex cover if for all v \u2208 C, the set C \u2212 {v} is not a vertex cover.\n(In contrast, a minimum vertex cover is a minimal cover of minimum size.) Minimality is natural\nwith respect to our motivating applications\u2014it is reasonable to assume that the measured nodes are\nnon-redundant, in the sense that omitting a node from C would cause at least one edge to be lost from\nthe measured communication pattern. Thus, we ask the following adaptation of Question 1:\nQuestion 2. If C is a minimal planted vertex cover, can we \ufb01nd a set D of size at most f (k)\n(independent of |V |) that is guaranteed to contain C?\nInterestingly, the answer to this question is \u201cyes\u201d for arbitrary graphs. We derive this as a consequence\nof results from Damaschke and Molokov [17, 18] in the theory of \ufb01xed-parameter tractability:\n\n3\n\n\fLemma 1 ([17, 18]). Consider a graph G with a minimum vertex cover size k\u2217. Let U (k) be the\nunion of all minimal vertex covers of size at most k. Then\n\n(a) |U (k)| \u2264 (k + 1)2/4 + k and is asymptotically tight [17, Theorem 3]\n(b) |U (k)| \u2264 (k \u2212 k\u2217 + 2)k\u2217 and is tight [18, Theorem 12]\n\nThere is an informative direct proof of part (a) that gives the O(k2) bound, using a kernelization\ntechnique from \ufb01xed-parameter tractability [12, 21]. The proof begins from the following observation:\nObservation 1. Any node with degree strictly greater than |C| must be in C.\nThe observation follows simply from the fact that if a node is omitted from C, then all of its neighbors\nmust belong to C. Thus, if S is the set of all nodes in G with degree greater than k, then S is contained\nin every vertex cover of size at most k. Hence Observation 1 implies that if U (k) is non-empty, we\nmust have |S| \u2264 k and S \u2286 U (k). Now G\u2212 S is a graph with maximum degree k and a vertex cover\nof size \u2264 k, so it has at most O(k2) edges. Next, let T be the set of all nodes incident to at least one\nof these edges. Any node not in S \u222a T is isolated in G \u2212 S and hence not part of any minimal vertex\ncover of size \u2264 k. Therefore U (k) \u2286 S \u222a T , and so |U (k)| = O(k2). We will later use Observation 1\nas motivation for a degree-ordering component to our planted vertex cover recovery algorithm.\nThe following theorem, giving a positive answer to Question 2, is thus a corollary of Lemma 1(a)\nobtained by setting D = U (k), the union of all minimal vertex covers of size at most k:\nTheorem 1. If C is a minimal planted vertex cover with |C| \u2264 k, then we can \ufb01nd a set D of size\nO(k2) that is guaranteed to contain C.\nTo see why O(k2) is tight, let G consist consist of the disjoint union of k/2 stars each with 1 + k/2\nleaves. Any set consisting of the centers of all but one of the stars, and the leaves of the remaining\nstar, is a minimal vertex cover of size k. Thus every node in G could potentially belong to the planted\nvertex cover C, and so we must output the full node set V . Since |V | = \u2126(k2), the bound follows.\nWe can compute U (k) in time exponential in k but polynomial in the number of nodes n for \ufb01xed\nk [18], but these algorithms remain impractical for our datasets. However, the results motivate our\nalgorithm in Section 3, which is based on taking the union of several minimal vertex covers.\n\n2.2 Non-minimal vertex covers\n\nA natural next question is whether we can say anything positive when the planted vertex cover C\nis not minimal. In particular, if C is not minimal, can we still ensure that some parts of it must\nbe contained in U (|C|), the union of all minimal vertex covers of size at most |C|? The following\npropositions show that if u \u2208 C links to a node v either outside C or deeply contained in C (with v\nand its neighbors all in C), then u must belong to U (|C|).\nProposition 1. If u \u2208 C and there is an edge (u, v) to a fringe node v /\u2208 C, then u \u2208 U (|C|).\nProof. Consider the following iterative procedure for \u201cpruning\u201d the set C. We repeatedly check\nwhether there is a node w such that C \u2212 {w} is still a vertex cover. Anytime we encounter such a\nnode w, we delete it from C. When this process terminates, we have a minimal vertex cover C(cid:48) \u2286 C.\nSince |C(cid:48)| \u2264 |C|, we must have C(cid:48) \u2286 U (|C|). But in this iterative process we cannot delete u, since\n(u, v) is an edge and v (cid:54)\u2208 C. Thus u \u2208 C(cid:48), and hence u \u2208 U (|C|).\nNext, let us say that a node v belongs to the interior of the vertex cover C if v and all the neighbors\nof v belong to C. We have the following result.\nProposition 2. If u \u2208 C and there is an edge (u, v) to a node v in the interior of C, then u \u2208 U (|C|).\nProof. Let u and v be nodes as described in the statement of the proposition. Since all of v\u2019s\nneighbors are in C, it follows that C0 = C \u2212 {v} is a vertex cover. We now proceed as in the\nproof of Proposition 1. We iteratively delete nodes from C0 as long as we can preserve the vertex\ncover property. When this process terminates, we have a minimal vertex cover C(cid:48) \u2286 C0, and since\n|C(cid:48)| \u2264 |C|, we must have C(cid:48) \u2286 U (|C|). Now, u could not have been deleted during this process,\nbecause (u, v) is an edge and v (cid:54)\u2208 C0. Thus u \u2208 C(cid:48), and hence u \u2208 U (|C|).\nEven with these results, an arbitrarily small fraction of a non-minimal planted vertex cover C may be\nin U (|C|). Consider a star graph with center u and k + 1 leaves, and let C consist of u and any k \u2212 1\nleaves. The set {u} is the only minimal vertex cover of size \u2264 k, and hence |U (|C|)|/|C| = 1/k.\nIn this example, only node u satis\ufb01es the hypotheses of Proposition 1 or 2. However, we will see\nin Section 3 that in many real-world networks, most nodes in C are indeed contained in U (|C|)\nby satisfying one a condition in Proposition 1 or Proposition 2. The planted vertex cover recovery\n\n4\n\n\falgorithm that we develop in Section 3 uses the union of minimal vertex covers to identify nodes\nthat are likely in the planted cover. Propositions 1 and 2 show that even if the planted cover is not\nminimal, we can still recover its nodes with such unions of minimal vertex covers.\nThe above example has the property that C is much larger than the minimum vertex cover size k\u2217.\nWe next consider the case in which C may be non-minimal, but is within a constant factor of k\u2217. In\nthis case, we show how to \ufb01nd small sets guaranteed to intersect a constant fraction of the nodes in C.\n\n2.3 Maximal matching 2-approximation to minimum vertex cover and intersecting the core\n\nif u, v /\u2208 M then M \u2190 M \u222a {u, v}\n\nAlgorithm 1: Greedy maximal matching with\n2-approximation for minimum vertex cover.\nInput: Graph G = (V, E)\nOutput: Vertex cover M with |M| \u2264 2k\u2217\nM \u2190 \u2205\nfor e = (u, v) \u2208 E do\n\nA basic building block for our theory and al-\ngorithms is the classic maximal matching 2-\napproximation to minimum vertex cover (Algo-\nrithm 1, right). The greedy algorithm builds a\nmaximal matching M by processing each edge\ne = (u, v) of the graph and adding u and v to\nM if neither endpoint is already in M. Upon\ntermination, M is both a maximal matching and\na vertex cover. It is maximal because if we could\nadd another edge e, then it would have been added when processed; and it is a vertex cover because if\nboth endpoints of an edge e are not in the matching, then e would have been added to the matching\nwhen processed. Since any vertex cover must contain at least one endpoint from each edge in the\nmatching, we have k\u2217 \u2265 |M|/2, or |M| \u2264 2k\u2217, where k\u2217 is the minimum vertex cover size of G.\nThe output M of Algorithm 1 may not be a minimal vertex cover. However, we can iteratively prune\nnodes from M to make it minimal, which we do for our recovery algorithm described in Section 3.\nFor the theory in this section, though, we assume no such pruning.\nThe following lemma shows that any vertex cover whose size is bounded by a multiplicative constant\nof k\u2217 must intersect the output of Algorithm 1 in a constant fraction of its nodes.\nLemma 2. Let B be any vertex cover of size |B| \u2264 bk\u2217 for some constant b. Then any set M\nproduced by Algorithm 1 satis\ufb01es |M \u2229 B| \u2265 1\nProof. The maximal matching M is h edges satisfying h = |M|/2 \u2264 k\u2217. Since B is a vertex cover,\nit must contain at least one endpoint of each of edge in M. Hence, |M\u2229B| \u2265 h \u2265 k\u2217/2 \u2265 1\nA corollary is that if our planted cover C is relatively small in the sense that it is close to the minimum\nvertex cover size, then Algorithm 1 must partially recover C. We write this as follows.\nCorollary 1. If the planted vertex cover C has size |C| \u2264 ck\u2217, then Algorithm 1 produces a set M\nof size \u2264 2k\u2217 that intersects at least a 1/(2c) fraction of the nodes in C.\nThe planted vertex cover recovery algorithm that we design in Section 3 guesses that nodes output\nby Algorithm 1 are part of the planted cover. Corollary 1 thus tells us that our method will indeed\ncapture part of the planted cover.\nAn important property of Algorithm 1 that will be useful for our algorithm design later in the paper\nis that the algorithm\u2019s guarantees hold regardless of the order in which the edges are processed.\nFurthermore, two matchings produced by the algorithm using two different orderings of the edges\nmust share a constant fraction of nodes, as formalized in the following corollary.\nCorollary 2. Any two sets S1 and S2 obtained from Algorithm 1 (with possible different orders in\nthe processing of the edges) satisfy |S1 \u2229 S2| \u2265 1\nOur results thus far only assume that the graph contains a vertex cover of size at most k. Next, we\nshow how assuming structure on the graph can yield stronger guarantees.\n\n2b|B|.\n\n2b|B|.\n\n4 max(|S1|,|S2|).\n\n2.4\n\nImproving results with known graph structure\n\nWe now strengthen our theoretical guarantees by assuming structure on C and G. Speci\ufb01cally, we\nconsider how to make use of random structure and bounds on the minimum vertex cover size k\u2217\nobtained through computation with Algorithm 1 can strengthen our theoretical guarantees. These\nresults are theoretical and do not affect the algorithms we develop in Section 3.\nStochastic block model. One possible structural assumption is that edges are generated indepen-\ndently at random. The stochastic block model (SBM) is a common generative model for this idealized\n\n5\n\n\fFigure 2: Improving Lemma 1(b) by bounding the minimum vertex cover size with the output of\nAlgorithm 1. For each network dataset, we synthetically create a planted vertex cover C for every\n1-hop neighborhood, which covers the corresponding 2-hop neighborhood. Blue dots show relative\nsizes of the planted cover (1-hop neighborhood) and the subgraph it covers (the 2-hop neighborhood).\nThe bound on |U (|C|)| from Lemma 1(a) is the thick black line. Orange squares show improvements\nin the bound on |U (|C|)| from Lemma 1(b) by bounding the minimum vertex cover size with\nAlgorithm 1. The improved bounds appear linear instead of quadratic for CollegeMsg.\nsetting [30] that is the basis for a major class of planted problems [1]. Here we consider a 2-block\nSBM, where one block is the planted vertex cover C and the other block is the fringe nodes F . The\nSBM provides a probability of an edge forming within a block and between blocks. For our purposes,\nC is a vertex cover, so we assume that the probability of an edge between nodes in F is 0. Denote the\nprobability of an edge between nodes in C as p and the probability of an edge between a node in C\nand a node in F as q. We make no assumption on the relative values of p and q.\nThe following result combines Lemma 1(b) with well-known lower bounds on independent set size\nin Erd\u02ddos-R\u00e9nyi graph (in the SBM, C is an Erd\u02ddos-R\u00e9nyi graph with edge probability p).\nLemma 3. With probability at least 1 \u2212 |C|\u22123 ln n/(2p), the union of minimal vertex covers of size at\nmost |C| contains at most |C|(3 ln|C|/p + 3) nodes.\nIt is straightforward to show that the independence number \u03b1 of C is less than (3 ln|C|)/p+1\nProof.\nwith probability at least 1 \u2212 |C|\u22123 ln|C|/(2p) [57]. The minimum vertex cover size of the \ufb01rst block is\nthen k\u2217 = |C| \u2212 \u03b1 \u2265 |C| \u2212 (3 ln|C|)/p \u2212 1 with the same probability. Plugging this bound on k\u2217\ninto Lemma 1(b) gives the result.\nThe following theorem further develops how the SBM provides substantial structure for our problem.\nTheorem 2. Let C be a planted vertex cover in our SBM model, where |C| = k. Let p and q be\nconstants, and let the number of nodes in the SBM be ck for some constant c \u2265 1. Then with high\nprobability as a function of k, there is a set D of size O(k log k) that is guaranteed to contain C.\nProof. By Lemma 3, we know that |U (k)| is O(k log k) with high probability. Now, for any node\nv \u2208 C, the probability that it links to at least one node outside of C is 1 \u2212 (1 \u2212 q)(c\u22121)k. Taking the\nunion bound over all nodes in C shows that with high probability in k, each node in C has at least\none edge to a node outside C. In this case, Proposition 1 implies that C \u2286 U (k), so by computing\nU (k), we contain C with high probability.\nBounds on the minimum vertex cover size.\nIt may be impractical to compute the minimum\nvertex cover size k\u2217, but Lemma 1(b) may still be used if we can bound k\u2217 from above and below.\nSpeci\ufb01cally, given lower and upper bounds l and u on k\u2217, |U (k)| \u2264 (k \u2212 l + 2)u. A cheap way to\n\ufb01nd such bounds is to use Algorithm 1 several times with different edge orderings (and subsequent\npruning) to get better lower and upper bounds on k\u2217. We evaluated this methodology on three\ndatasets: (i) a network of messages sent on a college\u2019s online social network (CollegeMsg; [51]), (ii)\nan autonomous systems graph (as-caida20071105; [41]), and (iii) a snapshot of the Gnutella P2P\n\ufb01le sharing network (p2p-Gnutella31; [45]). For each network, we construct many planted covers\nby considering a 1-hop neighborhood of a node u covering the 2-hop neighborhood of u (we then\nremove u from these sets). We used 20 random orderings of the edges with Algorithm 1 to get the best\nbound on |U (k)|. Figure 2 summarizes the results. In most cases, the approximations substantially\nimprove the bound. With CollegeMsg, the bounds appear approximately linear in the cover size.\n\n3 Recovery performance on datasets with \u201creal\u201d planted vertex cores\n\nNext, we study recovery of planted vertex covers in datasets with conceivable but known core-fringe\nstructure arising from the measurement processes described in the introduction. We use 5 datasets:\n\n6\n\n\f1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n13\n14\n15\n16\n17\n18\n19\n20\n21\n22\n23\n24\n25\n26\n\nusing SparseArrays , Random\nfunction UMVC (A :: SparseMatrixCSC { Int64 , Int64 },\n\nedges = filter (e ->e [1] < e [2] ,\n\nncovers :: Int64 =300)\n\ncollect ( zip ( findnz (A )[1:2]...)))\n\numvc = zeros ( Int64 , size (A ,1))\nfor _ in 1: ncovers\n\n# Run 2- approximation with random edge ordering\nvc = zeros ( Int64 , size (A ,1))\nfor (i , j) in shuffle ( edges )\n\nif vc [[i ,j ]] == [0 ,0]; vc [[i ,j ]] .= 1; end\n\nend\n# Reduce to a minimal cover\nwhile true\n\nvc_size = sum ( vc )\nfor c in shuffle ( findall ( vc .== 1))\n\nnbrs = findnz (A[: ,c ])[1]\nif sum ( vc [ nbrs ]) == length ( nbrs ); vc [c] = 0; end\n\nend\nif sum ( vc ) == vc_size ; break ; end\n\nend\numvc [ findall ( vc .== 1)] .= 1\n\nend\ndegs = vec ( sum (A , dims =1)) # node degrees\nreturn sortperm ( collect ( zip ( umvc , degs )) , rev = true )\n\nFigure 3: Complete implementation of our\nunion of minimal vertex covers (UMVC)\nalgorithm in 26 lines of Julia code. The\nalgorithm repeatedly runs the standard\nmaximal matching\n2-approximation\nalgorithm for minimum vertex cover\n(Algorithm 1; in lines 8\u201312) and reduces\neach cover to a minimal one (lines 13\u201321).\nThe union of covers is ranked \ufb01rst in the\nordering, sorted by degree; the remaining\nnodes are then sorted by degree (line 25).\nCode on the left is available at https:\n//gist.github.com/arbenson/\n27c6d9ef2871a31cbdbba33239ea60d0.\n\nend\n(1) email-Enron [35] is an email communication network, where the core is the set of email addresses\nwhose inboxes were released via a regulatory investigation; (2) email-W3C [14, 50, 61] is derived\nfrom crawled W3C email list threads, where the core is the set of nodes with a w3.org domain\nin the email address; (3) email-Eu [42, 62] consists of emails involving members of a European\nresearch institution, where the core nodes are the institution\u2019s members. (4) call-Reality and (5)\ntext-Reality [22] come from phone calls and text messages involving a set of students and faculty at\nMIT participating in the reality mining project. The study participants constitute the core, and an\nedge connects two phone numbers if a call or message was made between them. Each dataset has\ntimestamped edges, and we will evaluate how well we can recover the core as the networks evolve.\nTable 1 provides summary statistics of the datasets. We include the minimum vertex cover size, which\nlets us evaluate Lemma 1(b). We also computed the fraction of nodes that are guaranteed to be in\nU (|C|) by Propositions 1 and 2 and \ufb01nd that 82%\u201399% of the nodes \ufb01t these guarantees.\nWe next study recovery of the planted vertex cover, i.e., the core C. The methods we use provide a\nnode ordering, often through a score function. We then evaluate recovery using precision at core size\n(the fraction of the top-|C| ordered nodes that are in C) and area under the precision recall curve.\nProposed algorithm: union of minimal vertex covers (UMVC). Our proposed algorithm, which\nwe call the union of minimal covers (UMVC), repeatedly \ufb01nds minimal vertex covers and takes their\nunion. The nodes in this union are ordered by degree \ufb01rst and the nodes not appearing in any minimal\ncover are ordered by degree after. The minimal covers are constructed by \ufb01rst \ufb01nding a 2-approximate\nsolution to the minimum vertex cover problem using Algorithm 1 (which takes linear time in the\nsize of the data) and then pruning the resulting cover to be minimal. We randomly order the edges\nfor processing in order to capture different minimal covers (we use 300 covers in our experiments).\nThe algorithm is incredibly simple. Figure 3 shows a complete implementation in just 26 lines of\nJulia code. We use 300 covers as this keeps the running time to about a minute on the largest dataset.\nHowever, a smaller number is needed for the same performance on several datasets. In practice, a\nlarger number of covers requires more computation but could capture more nodes in the planted cover.\nAt the same time, a larger number of covers could lead to more false positives.\n\nTable 1: Summary statistics of datasets: number of nodes (n), number of edges (m), time spanned,\nplanted vertex cover size (|C|), minimum vertex cover size (k\u2217), bounds from Lemma 1 as a fraction\nof the total number of nodes (trivially capped at 1), and fraction of nodes in C connected to a node\nv (i) not in C or (ii) in the interior of C (all neighbors are in C). Nodes in the last two cases are\nguaranteed to be in the union of minimal vertex covers of size at most |C| by Propositions 1 and 2.\n|C|\nk\u2217 Bnd. a Bnd. b frac. Prop. 1 frac. Prop. 2\nDataset\n146\nemail-Enron 18.6k 43.2k\n146\n1.99k 1.11k\nemail-W3C 20.1k 31.9k\nemail-Eu\n202k 320k\n1.22k 1.18k\ncall-Reality 9.02k 10.6k\ntext-Reality 1.18k 1.95k\n\n1.50k\n7.52k\n804\n543\n478\n\n0.30\n1.00\n1.00\n0.24\n1.00\n\n0.02\n1.00\n0.26\n0.09\n0.41\n\nn\n\nm days spanned\n\n0.99\n0.76\n0.99\n0.90\n0.88\n\n0.00\n0.06\n0.00\n0.01\n0.00\n\n90\n84\n\n82\n80\n\n7\n\n\fImportantly, UMVC makes no assumption on the size or minimality of the planted cover C. Instead,\nwe are motivated by our theory in Section 2 in several ways. First, we expect that most of C will lie in\nthe union of all minimal vertex covers of size at most |C| by Propositions 1 and 2 and Theorem 1. The\ndegree ordering is motivated by Observation 1, which says that nodes of suf\ufb01ciently large degree must\nbe in C. Second, even though we are pruning the maximal matchings to be minimal vertex covers,\nCorollary 1 provides motivation that the matchings should be intersecting C. If only a constant\nnumber of nodes are pruned when making the matching a minimal cover, then the overlap is still a\nconstant fraction of C. Third, Corollary 2 says that we shouldn\u2019t expect the union to grow too fast.\nOther algorithms. We compare\nUMVC against 5 other methods.\nFirst, we use a degree ordering of\nnodes, which captures the fact that\nthe nodes outside of C cannot link\nto each other and that |C| is much\nsmaller than the total number of\nvertices. This heuristic is a com-\nmon baseline for core-periphery\nidenti\ufb01cation [53] and is theoret-\nically justi\ufb01ed in certain stochas-\ntic block models of core-periphery\nstructure [63].\nSecond, we or-\nder nodes by betweenness central-\nity [26], the idea being that nodes\nin the core must appear in shortest\npaths between fringe nodes. Third,\nwe order nodes by Path-Core (PC)\nscores [16], which have been used\nto identify core-periphery structure\nin networks [40]. Fourth, we use\na metric from Borgatti and Ev-\nerett (BE) which scores nodes in a\nway to capture core-periphery struc-\nture [8, 53]. Fifth, we use a belief\npropagation (BP) method for block\nrecovery in stochastic block models\nof core-periphery structure [63].\nRecovery performance. We di-\nvide the temporal edges of each\ndataset\ninto 10-day increments\nand construct an undirected, un-\nweighted, simple graph for the\n\ufb01rst 10r days of activity, r =\n1, 2, . . . ,(cid:98)T /10(cid:99), where T is the\nnumber of days spanned by the\ndataset. Given the ordering of\nnodes from an algorithm, we eval-\nuate recovery performance by the\nprecision at core size (P@CS; Fig-\nure 4, left) and area under the pre-\ncision recall curve (AUPRC; Fig-\nure 4, right). We also provide upper\nbounds on performance. This is the\nfraction of core nodes that are non-\nisolated for P@CS and the AUPRC\nof a node ordering that places all\nnon-isolated core nodes \ufb01rst and\nthen the remaining nodes randomly.\n\nFigure 4: Core recovery performance on real-world datasets\nusing our union of minimal vertex covers algorithm (UMVC),\ndegree ordering, betweenness centrality [26]), belief propaga-\ntion (BP; [63]), Borgatti-Everett scores (BE; [8]), and Path-\nCore scores (PC; [16]). Each algorithm orders the nodes, and\nwe measure performance with precision at core size (left) and\narea under the precision recall curve (right) over every 10 days\nof real time. UMVC performs well on all datasets. BP is\nsometimes competitive but is susceptible to poor local minima,\ngiving erratic performance. UMVC is also much faster than\nbetweenness, BP, and PC (Table 2).\n\n8\n\n0250500750100012501500Days0.00.20.40.60.81.0Precisionatcoresizeemail-EnronupperboundUMVCDegreeBetweennessBPPCBE0250500750100012501500Days0.00.20.40.60.81.0AreaunderP-Rcurveemail-EnronupperboundUMVCDegreeBetweennessBPPCBE0200040006000Days0.00.20.40.60.81.0Precisionatcoresizeemail-W3C0200040006000Days0.20.40.60.81.0AreaunderP-Rcurveemail-W3C0200400600800Days0.00.20.40.60.81.0Precisionatcoresizeemail-Eu0200400600800Days0.00.20.40.60.81.0AreaunderP-Rcurveemail-Eu0100200300400500Days0.00.20.40.60.81.0Precisionatcoresizecall-Reality0100200300400500Days0.00.20.40.60.81.0AreaunderP-Rcurvecall-Reality0100200300400Days0.00.20.40.60.8Precisionatcoresizetext-Reality0100200300400Days0.00.20.40.60.8AreaunderP-Rcurvetext-Reality\fTable 2: Running time of algorithms on our datasets (here, we use the entire dataset, instead of\nevaluating over time, as in Figure 4). Our proposed union of minimal vertex covers algorithm\n(UMVC) is fast and provides the best performance on several real-world datasets (Figure 4).\n\nDataset\nemail-W3C\nemail-Enron\nemail-Eu\ncall-Reality\ntext-Reality\n\ndegree\n\nUMVC\n6.5 secs < 0.01 secs\n8.4 secs < 0.01 secs\n1.2 mins < 0.01 secs\n2.2 secs < 0.01 secs\n0.5 secs < 0.01 secs\n\nbetweenness [26]\n\n2.8 mins\n2.5 mins\n11.8 hours\n27.9 secs\n0.8 secs\n\nBE [8, 13]\nPC [16]\n0.1 secs\n1.1 hours\n0.1 secs\n1.8 hours\n0.9 secs\n> 3 days\n6.1 mins\n1.8 secs\n11.4 secs < 0.1 secs\n\nBP [63]\n1.0 mins\n20.2 mins\n15.0 mins\n4.3 secs\n6.8 secs\n\nAcross all datasets, our UMVC algorithm out-performs the degree, betweenness, PC, and BE baselines\nat nearly all points in time. BP sometimes exhibits slightly better performance but suffers from erratic\nperformance over time due to landing in local minima; see, e.g., results for email-W3C (Figure 4, row\n2). In some cases, BP can hardly pick up any signal, such as in email-Eu (Figure 4, row 3). On the\nemail-Eu dataset, UMVC vastly out-performs all other baselines. On email-Enron, UMVC achieves\nperfect recovery after 900 days of activity.\nA key reason for UMVC\u2019s success is its use of the vertex cover structure. Other algorithms do not\ndetect low-degree nodes that might look like traditional \u201cperiphery\u201d nodes while residing in the core.\nCore-periphery detection algorithms in network science have traditionally relied on SBM benchmarks\nor heuristic evaluation [53, 63]. We have already shown how the SBM induces substantial structure\nfor our problem, so the SBM is not a great benchmark for further analysis. Here, we used some\nnotion of ground truth labels on which to evaluate the algorithms and exploited the vertex cover\nstructure to get good performance.\nTiming performance. Table 2 shows the time to run the algorithms on the entire dataset. We used\n300 vertex covers for UMVC, which was implemented in Julia (Figure 3). Tuning the number of\ncovers provides a trade off between run-time performance and (potentially) recovery performance.\nThe degree ordering was also implemented in Julia, and betweenness centrality was computed with\nthe LightGraphs.jl Julia package\u2019s implementation of Brandes\u2019 algorithm [9, 10]. Path-Core\nscoring (PC) was implemented with Python\u2019s NetworkX, and the Belief Propagation (BP) algorithm\nwas implemented in C++. We note that our goal is to demonstrate the approximate computation times,\nrather than to compare the most high-performance implementations. UMVC takes a few seconds on\nthe email-W3C, email-Enron, call-Reality, and text-Reality datasets, and about one minute for the\nemail-Eu dataset. This is an order of magnitude faster than BP, and several orders of magnitude faster\nthan betweenness and PC. There are fast approximation algorithms for betweenness [6, 27], but the\nweak performance of exact betweenness did not justify exploring these approaches.\n\n4 Discussion\n\nMany network datasets are partial measurements and are often found in some way that destroys the\nrecord of how the measurements were made. Here, we examined the case where edges are collected\nby observing interactions involving some core set of nodes, and the identity of the core nodes is\nlost. Recovering the core nodes is then a planted vertex recovery problem. We developed theory for\nthis problem, which we used to devise a simple algorithm that recovers the core with high ef\ufb01cacy\nin several real-world datasets. We assumed that our graphs were simple and undirected, but richer\nstructure such as edge directions and timestamps could be incorporated in future research.\nThe network science community has tools for core-periphery detection, but evaluation is typically\nlimited to recovery of synthetic models or ad-hoc score functions. Our work is also the \ufb01rst effort\nto evaluate the recovery of core-periphery-like (i.e., core-fringe) network structure through the lens\nof machine learning with \u201cground truth\u201d labels on the nodes. We hope that this provides a valuable\ntestbed for evaluating algorithms that reveal core-periphery or core-fringe structure, although we\nshould not take evaluation on ground truth labels as absolute [52].\nCode and data accompanying the experiments in this paper are available at\nhttps://github.com/arbenson/FGDnPVC.\n\n9\n\n\fAcknowledgments\n\nWe thank Jure Leskovec for providing access to the email-Eu data; Mason Porter and Sang Hoon\nLee for providing the Path-Core code; and Travis Martin and Thomas Zhang for providing the\nbelief propagation code. This research was supported in part by a Simons Investigator Award, NSF\nTRIPODS Award #1740822, and NSF Award DMS-1830274.\n\nReferences\n[1] E. Abbe. Community detection and stochastic block models: Recent developments. Journal of Machine\n\nLearning Research, 18(177):1\u201386, 2018.\n\n[2] E. Abbe and C. Sandon. Recovering communities in the general stochastic block model without knowing\n\nthe parameters. In Advances in Neural Information Processing Systems, pages 676\u2013684, 2015.\n\n[3] E. Abbe and C. Sandon. Achieving the ks threshold in the general stochastic block model with linearized\nacyclic belief propagation. In Advances in Neural Information Processing Systems, pages 1334\u20131342,\n2016.\n\n[4] R. Albert and A.-L. Barab\u00e1si. Statistical mechanics of complex networks. Reviews of Modern Physics,\n\n74(1), 2002.\n\n[5] N. Alon, M. Krivelevich, and B. Sudakov. Finding a large hidden clique in a random graph. Random\n\nStructures and Algorithms, 13(3-4):457\u2013466, 1998.\n\n[6] D. A. Bader, S. Kintali, K. Madduri, and M. Mihail. Approximating betweenness centrality. In International\n\nWorkshop on Algorithms and Models for the Web-Graph, pages 124\u2013137. Springer, 2007.\n\n[7] P. J. Bickel and A. Chen. A nonparametric view of network models and newman\u2013girvan and other\n\nmodularities. Proceedings of the National Academy of Sciences, 106(50):21068\u201321073, 2009.\n\n[8] S. P. Borgatti and M. G. Everett. Models of core/periphery structures. Social Networks, 21(4):375\u2013395,\n\n2000.\n\n[9] U. Brandes. A faster algorithm for betweenness centrality. The Journal of Mathematical Sociology,\n\n25(2):163\u2013177, 2001.\n\n[10] U. Brandes. On variants of shortest-path betweenness centrality and their generic computation. Social\n\nNetworks, 30(2):136\u2013145, 2008.\n\n[11] P. Buneman, S. Khanna, and T. Wang-Chiew. Why and where: A characterization of data provenance. In\n\nThe International Conference on Database Theory, pages 316\u2013330. Springer Berlin Heidelberg, 2001.\n\n[12] J. F. Buss and J. Goldsmith. Nondeterminism within P \u2217. SIAM Journal on Computing, 22(3):560\u2013572,\n\n1993.\n\n[13] A. L. Comrey. The minimum residual method of factor analysis. Psychological Reports, 11(1):15\u201318,\n\n1962.\n\n[14] N. Craswell, A. P. de Vries, and I. Soboroff. Overview of the trec 2005 enterprise track. In TREC, volume 5,\n\npages 199\u2013205, 2005.\n\n[15] P. Csermely, A. London, L.-Y. Wu, and B. Uzzi. Structure and dynamics of core/periphery networks.\n\nJournal of Complex Networks, 1(2):93\u2013123, 2013.\n\n[16] M. Cucuringu, P. Rombach, S. H. Lee, and M. A. Porter. Detection of core-periphery structure in networks\nusing spectral methods and geodesic paths. European Journal of Applied Mathematics, 27(06):846\u2013887,\n2016.\n\n[17] P. Damaschke. Parameterized enumeration, transversals, and imperfect phylogeny reconstruction. Theoret-\n\nical Computer Science, 351(3):337\u2013350, 2006.\n\n[18] P. Damaschke and L. Molokov. The union of minimal hitting sets: Parameterized combinatorial bounds\n\nand counting. Journal of Discrete Algorithms, 7(4):391\u2013401, 2009.\n\n[19] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov\u00e1. Asymptotic analysis of the stochastic block model\n\nfor modular networks and its algorithmic applications. Physical Review E, 84(6), 2011.\n\n10\n\n\f[20] Y. Deshpande and A. Montanari. Finding hidden cliques of size(cid:112)N/e in nearly linear time. Foundations\n\nof Computational Mathematics, 15(4):1069\u20131128, 2014.\n\n[21] R. G. Downey and M. R. Fellows. Parameterized complexity. Springer Science & Business Media, 2012.\n\n[22] N. Eagle and A. S. Pentland. Reality mining: sensing complex social systems. Personal and Ubiquitous\n\nComputing, 10(4):255\u2013268, 2005.\n\n[23] D. Easley and J. Kleinberg. Networks, crowds, and markets: Reasoning about a highly connected world.\n\nCambridge University Press, 2010.\n\n[24] D. Eppstein and D. Strash. Listing all maximal cliques in large sparse real-world graphs. In Experimental\n\nAlgorithms, pages 364\u2013375. Springer Berlin Heidelberg, 2011.\n\n[25] U. Feige and D. Ron. Finding hidden cliques in linear time. In 21st International Meeting on Proba-\nbilistic, Combinatorial, and Asymptotic Methods in the Analysis of Algorithms, pages 189\u2013204. Discrete\nMathematics and Theoretical Computer Science, 2010.\n\n[26] L. C. Freeman. A set of measures of centrality based on betweenness. Sociometry, 40(1):35, 1977.\n\n[27] R. Geisberger, P. Sanders, and D. Schultes. Better approximation of betweenness centrality. In Proceedings\nof the Meeting on Algorithm Engineering & Expermiments, pages 90\u2013100. Society for Industrial and\nApplied Mathematics, 2008.\n\n[28] K. J. Gile and M. S. Handcock. Respondent-driven sampling: An assessment of current methodology.\n\nSociological Methodology, 40(1):285\u2013327, 2010.\n\n[29] S. P. Hier and J. Greenberg. Surveillance: Power, Problems, and Politics. UBC Press, 2009.\n\n[30] P. W. Holland, K. B. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social Networks,\n\n5(2):109\u2013137, 1983.\n\n[31] P. Holme. Core-periphery organization of complex networks. Physical Review E, 72(4), 2005.\n\n[32] H. V. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M. Patel, R. Ramakrishnan, and C. Shahabi.\n\nBig data and its technical challenges. Communications of the ACM, 57(7):86\u201394, 2014.\n\n[33] M. Khabbazian, B. Hanlon, Z. Russek, and K. Rohe. Novel sampling design for respondent-driven\n\nsampling. Electronic Journal of Statistics, 11(2):4769\u20134812, 2017.\n\n[34] M. Kim and J. Leskovec. The network completion problem: Inferring missing nodes and edges in networks.\nIn Proceedings of the SIAM Conference on Data Mining, pages 47\u201358. Society for Industrial and Applied\nMathematics, 2011.\n\n[35] B. Klimt and Y. Yang. Introducing the Enron Corpus. In CEAS, 2004.\n\n[36] G. Kossinets. Effects of missing data in social networks. Social Networks, 28(3):247\u2013268, 2006.\n\n[37] D. Koutra, J. T. Vogelstein, and C. Faloutsos. DeltaCon: A principled massive-graph similarity function.\nIn Proceedings of the 2013 SIAM International Conference on Data Mining, pages 162\u2013170. Society for\nIndustrial and Applied Mathematics, may 2013.\n\n[38] T. Kuny. A digital dark ages? challenges in the preservation of electronic information of electronic\n\ninformation. In 63rd IFLA Council and General Conference, 1997.\n\n[39] E. O. Laumann, P. V. Marsden, and D. Prensky. The boundary speci\ufb01cation problem in network analysis.\n\nResearch methods in social network analysis, 61:87, 1989.\n\n[40] S. H. Lee, M. Cucuringu, and M. A. Porter. Density-based and transport-based core-periphery structures in\n\nnetworks. Physical Review E, 89(3), 2014.\n\n[41] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graphs over time: Densi\ufb01cation laws, shrinking diameters\nand possible explanations. In Proceeding of the eleventh ACM SIGKDD international conference on\nKnowledge discovery in data mining. ACM Press, 2005.\n\n[42] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graph evolution: Densi\ufb01cation and shrinking diameters. ACM\n\nTransactions on Knowledge Discovery from Data, 1(1):2\u2013es, 2007.\n\n[43] J. Leskovec, K. J. Lang, and M. Mahoney. Empirical comparison of algorithms for network community\n\ndetection. In Proceedings of the 19th international conference on World Wide Web. ACM Press, 2010.\n\n11\n\n\f[44] C. Lynch. How do your data grow? Nature, 455(7209):28\u201329, 2008.\n\n[45] R. Matei, A. Iamnitchi, and P. Foster. Mapping the gnutella network. IEEE Internet Computing, 6(1):50\u201357,\n\n2002.\n\n[46] R. Meka, A. Potechin, and A. Wigderson. Sum-of-squares lower bounds for planted clique. In Proceedings\n\nof the Forty-Seventh Annual ACM on Symposium on Theory of Computing. ACM Press, 2015.\n\n[47] T. Monahan, D. J. Phillips, and D. M. Wood. Surveillance and empowerment. Surveillance and Society,\n\n8(2):106\u2013112, 2010.\n\n[48] E. Mossel, J. Neeman, and A. Sly. Belief propagation, robust reconstruction and optimal recovery of block\n\nmodels. In Conference on Learning Theory, pages 356\u2013370, 2014.\n\n[49] M. E. J. Newman. The structure and function of complex networks. SIAM Review, 45(2), 2003.\n\n[50] D. Oard, T. Elsayed, J. Wang, Y. Wu, P. Zhang, E. Abels, J. Lin, and D. Soergel. Trec-2006 at maryland:\nBlog, enterprise, legal and qa tracks. Technical report, University of Maryland Institute for Advanced\nComputer Studies, 2006.\n\n[51] P. Panzarasa, T. Opsahl, and K. M. Carley. Patterns and dynamics of users\u2019 behavior and interaction:\nNetwork analysis of an online community. Journal of the American Society for Information Science and\nTechnology, 60(5):911\u2013932, 2009.\n\n[52] L. Peel, D. B. Larremore, and A. Clauset. The ground truth about metadata and community detection in\n\nnetworks. Science Advances, 3(5):e1602548, 2017.\n\n[53] P. Rombach, M. A. Porter, J. H. Fowler, and P. J. Mucha. Core-periphery structure in networks (revisited).\n\nSIAM Review, 59(3):619\u2013646, 2017.\n\n[54] D. M. Romero, B. Uzzi, and J. Kleinberg. Social networks under stress. In Proceedings of the 25th\n\nInternational Conference on World Wide Web. ACM Press, 2016.\n\n[55] C. Seshadhri, A. Pinar, and T. G. Kolda. Fast triangle counting through wedge sampling. In Proceedings of\n\nthe SIAM Conference on Data Mining, volume 4, page 5, 2013.\n\n[56] Y. L. Simmhan, B. Plale, and D. Gannon. A survey of data provenance in e-science. ACM SIGMOD\n\nRecord, 34(3):31, 2005.\n\n[57] D. A. Spielman. Erd\u00f6s-R\u00e9nyi Random Graphs: Warm Up. Graphs and Networks Lecture Notes. http:\n\n//www.cs.yale.edu/homes/spielman/462/2010/lect3-10.pdf, 2010.\n\n[58] N. Spring, R. Mahajan, and D. Wetherall. Measuring ISP topologies with rocketfuel. ACM SIGCOMM\n\nComputer Communication Review, 32(4):133, oct 2002.\n\n[59] W.-C. Tan. Research problems in data provenance. IEEE Data Engineering Bulletin, 27:45\u201352, 2004.\n\n[60] A. Tsiatas, I. Saniee, O. Narayan, and M. Andrews. Spectral analysis of communication networks using\ndirichlet eigenvalues. In Proceedings of the 22nd international conference on World Wide Web. ACM\nPress, 2013.\n\n[61] Y. Wu, D. W. Oard, and I. Soboroff. An exploratory study of the w3c mailing list test collection for retrieval\n\nof emails with pro/con argument. In CEAS, 2006.\n\n[62] H. Yin, A. R. Benson, J. Leskovec, and D. F. Gleich. Local higher-order graph clustering. In Proceedings\nof the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM\nPress, 2017.\n\n[63] X. Zhang, T. Martin, and M. E. J. Newman. Identi\ufb01cation of core-periphery structure in networks. Physical\n\nReview E, 91(3), 2015.\n\n12\n\n\f", "award": [], "sourceid": 699, "authors": [{"given_name": "Austin", "family_name": "Benson", "institution": "Cornell University"}, {"given_name": "Jon", "family_name": "Kleinberg", "institution": "Cornell University"}]}