{"title": "Estimating the Size of a Large Network and its Communities from a Random Sample", "book": "Advances in Neural Information Processing Systems", "page_first": 3072, "page_last": 3080, "abstract": "Most real-world networks are too large to be measured or studied directly and there is substantial interest in estimating global network properties from smaller sub-samples. One of the most important global properties is the number of vertices/nodes in the network. Estimating the number of vertices in a large network is a major challenge in computer science, epidemiology, demography, and intelligence analysis. In this paper we consider a population random graph G = (V;E) from the stochastic block model (SBM) with K communities/blocks. A sample is obtained by randomly choosing a subset W and letting G(W) be the induced subgraph in G of the vertices in W. In addition to G(W), we observe the total degree of each sampled vertex and its block membership. Given this partial information, we propose an efficient PopULation Size Estimation algorithm, called PULSE, that accurately estimates the size of the whole population as well as the size of each community. To support our theoretical analysis, we perform an exhaustive set of experiments to study the effects of sample size, K, and SBM model parameters on the accuracy of the estimates. The experimental results also demonstrate that PULSE significantly outperforms a widely-used method called the network scale-up estimator in a wide variety of scenarios.", "full_text": "Estimating the Size of a Large Network and its\n\nCommunities from a Random Sample\n\nLin Chen1,2, Amin Karbasi1,2, Forrest W. Crawford2,3\n\n1Department of Electrical Engineering, 2Yale Institute for Network Science,\n\n3Department of Biostatistics, Yale University\n\n{lin.chen, amin.karbasi, forrest.crawford}@yale.edu\n\nAbstract\n\nMost real-world networks are too large to be measured or studied directly and\nthere is substantial interest in estimating global network properties from smaller\nsub-samples. One of the most important global properties is the number of ver-\ntices/nodes in the network. Estimating the number of vertices in a large network is a\nmajor challenge in computer science, epidemiology, demography, and intelligence\nanalysis. In this paper we consider a population random graph G = (V, E) from the\nstochastic block model (SBM) with K communities/blocks. A sample is obtained\nby randomly choosing a subset W \u2286 V and letting G(W ) be the induced subgraph\nin G of the vertices in W . In addition to G(W ), we observe the total degree of\neach sampled vertex and its block membership. Given this partial information, we\npropose an ef\ufb01cient PopULation Size Estimation algorithm, called PULSE, that\naccurately estimates the size of the whole population as well as the size of each\ncommunity. To support our theoretical analysis, we perform an exhaustive set of\nexperiments to study the effects of sample size, K, and SBM model parameters\non the accuracy of the estimates. The experimental results also demonstrate that\nPULSE signi\ufb01cantly outperforms a widely-used method called the network scale-up\nestimator in a wide variety of scenarios.\n\n1\n\nIntroduction\n\nMany real-world networks cannot be studied directly because they are obscured in some way, are too\nlarge, or are too dif\ufb01cult to measure. There is therefore a great deal of interest in estimating properties\nof large networks via sub-samples [15, 5]. One of the most important properties of a large network\nis the number of vertices it contains. Unfortunately census-like enumeration of all the vertices in a\nnetwork is often impossible, so researchers must try to learn about the size of real-world networks\nby sampling smaller components. In addition to the size of the total network, there is great interest\nin estimating the size of different communities or sub-groups from a sample of a network. Many\nreal-world networks exhibit community structure, where nodes in the same community have denser\nconnections than those in different communities [10, 18]. In the following examples, we describe\nnetwork size estimation problems in which only a small subgraph of a larger network is observed.\nSocial networks. The social and economic value of an online social network (e.g. Facebook,\nInstagram, Twitter) is closely related to the number of users the service has. When a social networking\nservice does not reveal the true number of users, economists, marketers, shareholders, or other groups\nmay wish to estimate the number of people who use the service based on a sub-sample [4].\nWorld Wide Web. Pages on the World-Wide Web can be classi\ufb01ed into several categories (e.g.\nacademic, commercial, media, government, etc.). Pages in the same category tend to have more\nconnections. Computer scientists have developed crawling methods for obtaining a sub-network of\nweb pages, along with their hyperlinks to other unknown pages. Using the crawled sub-network and\nhyperlinks, they can estimate the number of pages of a certain category [17, 16, 21, 13, 19].\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: Illustration of the vertex set size estimation problem with N = 13 and K = 2. White vertices are\ntype-1 and gray are type-2.\n\nSize of the Internet. The number of computers on the Internet (the size of the Internet) is of great\ninterest to computer scientists. However, it is impractical to access and enumerate all computers\non the Internet and only a small sample of computers and the connection situation among them are\naccessible [24].\nCounting terrorists. Intelligence agencies often target a small number of suspicious or radicalized\nindividuals to learn about their communication network. But agencies typically do not know the\nnumber of people in the network. The number of elements in such a covert network might indicate\nthe size of a terrorist force, and would be of great interest [7].\nEpidemiology. Many of the groups at greatest risk for HIV infection (e.g. sex workers, injection\ndrug users, men who have sex with men) are also dif\ufb01cult to survey using conventional methods.\nSince members of these groups cannot be enumerated directly, researchers often trace social links to\nreveal a network among known subjects. Public health and epidemiological interventions to mitigate\nthe spread of HIV rely on knowledge of the number of HIV-positive people in the population [12, 11,\n22, 23, 8].\nCounting disaster victims. After a disaster, it can be challenging to estimate the number of people\naffected. When logistical challenges prevent all victims from being enumerated, a random sample of\nindividuals may be possible to obtain [2, 3].\nIn this paper, we propose a novel method called PULSE for estimating the number of vertices and the\nsize of individual communities from a random sub-sample of the network. We model the network\nas an undirected simple graph G = (V, E), and we treat G as a realization from the stochastic\nblockmodel (SBM), a widely-studied extension of the Erd\u02ddos-R\u00e9nyi random graph model [20] that\naccommodates community structures in the network by mapping each vertex into one of K \u2265 1\ndisjoint types or communities. We construct a sample of the network by choosing a sub-sample of\nvertices W \u2286 V uniformly at random without replacement, and forming the induced subgraph G(W )\nof W in G. We assume that the block membership and total degree d(v) of each vertex v \u2208 W are\nobserved. We propose a Bayesian estimation algorithm PULSE for N = |V |, the number of vertices in\nthe network, along with the number of vertices Ni in each block. We \ufb01rst prove important regularity\nresults for the posterior distribution of N. Then we describe the conditions under which relevant\nmoments of the posterior distribution exist. We evaluate the performance of PULSE in comparison\nwith the popular \u201cnetwork scale-up\u201d method (NSUM) [12, 11, 22, 23, 8, 14, 9]. We show that while\nNSUM is asymptotically unbiased, it suffers from serious \ufb01nite-sample bias and large variance. We\nshow that PULSE has superior performance \u2013 in terms of relative error and variance \u2013 over NSUM in\na wide variety of model and observation scenarios. All proofs are given in the extended version [6].\n\n2 Problem Formulation\n\nprobability pij \u2208 [0, 1]. Let Ni be the number of type-i vertices in G, with N =(cid:80)K\n\nThe stochastic blockmodel (SBM) is a random graph model that generalizes the Erd\u02ddos-R\u00e9nyi random\ngraph [20]. Let G = (V, E) \u223c G(N, K, p, t) be a realization from an SBM, where N = |V | is the\ntotal number of vertices, the vertices are divided into K types indexed 1, . . . , K, speci\ufb01ed by the\nmap t : V \u2192 {1, . . . , K}, and a type-i vertex and a type-j vertex are connected independently with\ni=1 Ni. The degree\nof a vertex v is d(v). An edge is said to be of type-(i, j) if it connects a type-i vertex and a type-j\nvertex. A random induced subgraph is obtained by sampling a subset W \u2286 V with |W| = n uniformly\nat random without replacement, and forming the induced subgraph, denoted by G(W ). Let Vi be\nthe number of type-i vertices in the sample and Eij be the number of type-(i, j) edges in the sample.\n\n2\n\nG(W)GTrueG(W)Observed\f\u02dcd(v) = d(v)\u2212(cid:80)\nnumber of type-(t(v), i) pendant edges of vertex v; i.e., yi(v) =(cid:80)\nWe have(cid:80)K\n\nFor a vertex v in the sample, a pendant edge connects vertex v to a vertex outside the sample. Let\nw\u2208W 1{{w, v} \u2208 E} be the number of pendant edges incident to v. Let yi(v) be the\nw\u2208V \\W 1{t(w) = i,{w, v} \u2208 E}.\ni=1 yi(v) = \u02dcd(v). Let \u02dcNi = Ni\u2212Vi be the number of type-i nodes outside the sample. We\nde\ufb01ne \u02dcN = ( \u02dcNi : 1 \u2264 i \u2264 K), p = (pij : 1 \u2264 i < j \u2264 K), and y = (yi(v) : v \u2208 W, 1 \u2264 i \u2264 K).\nWe observe only G(W ) and the total degree d(v) of each vertex v in the sample. Assume that we\nknow the type of each vertex in the sample. The observed data D consists of G(W ), (d(v) : v \u2208 W )\nand (t(v) : v \u2208 W ); i.e., D = (G(W ), (d(v) : v \u2208 W ), (t(v) : v \u2208 W )).\nProblem 1. Given the observed data D, estimate the size N of the vertex set N = |V | and the size\nof each community Ni.\n\nFig. 1 illustrates the vertex set size estimation problem. White nodes are of type-1 and gray nodes\nare of type-2. All nodes outside G(W ) are unobserved. We observe the types and the total degree\nof each vertex in the sample. Thus we know the number of pendant edges that connect each vertex\nin the sample to other, unsampled vertices. However, the destinations of these pendant edges are\nunknown to us.\n\n3 Network Scale-Up Estimator\nWe brie\ufb02y outline a simple and intuitive estimator for N = |V | that will serve as a comparison\nto PULSE. The network scale-up method (NSUM) is a simple estimator for the vertex set size of\nErd\u02ddos-R\u00e9nyi random graphs. It has been used in real-world applications to estimate the size of\nhidden or hard-to-reach populations such as drug users [12], HIV-infected individuals [11, 22, 23],\nmen who have sex with men (MSM) [8], and homeless people [14]. Consider a random graph\nthat follows the Erd\u02ddos-R\u00e9nyi distribution. The expected sum of total degrees in a random sample\n\nv\u2208W d(v)(cid:3) = n(N \u2212 1)p. The expected number of edges in the sample is\n(cid:1)p, where ES is the number of edges within the sample. A simple estimator of the\n(cid:1). Plugging \u02c6p into into the moment equation and solving\n\nW of vertices is E(cid:2)(cid:80)\nE [ES] = (cid:0)n\nconnection probability p is \u02c6p = ES/(cid:0)n\nfor N yields \u02c6N = 1 + (n \u2212 1)(cid:80)\n\nv\u2208W d(v)/2ES, often simpli\ufb01ed to \u02c6NN S = n(cid:80)\n\n[12, 11, 22, 23, 8, 14, 9].\nTheorem 1. (Proof in [6]) Suppose G follows a stochastic blockmodel with edge probability pij > 0\nfor 1 \u2264 i, j \u2264 K. For any suf\ufb01ciently large sample size, the NSUM is positively biased and\nE[ \u02c6NN S|ES > 0] \u2212 N has an asymptotic lower bound E[ \u02c6NN S|ES > 0] \u2212 N (cid:38) N/n \u2212 1, as n\nbecomes large, where for two sequences {an} and {bn}, an (cid:38) bn means that there exists a sequence\ncn such that an \u2265 cn \u223c bn; i.e., an \u2265 cn for all n and limn\u2192\u221e cn/bn = 1. However, as sample size\ngoes to in\ufb01nity, the NSUM becomes asymptotically unbiased.\n\nv\u2208W d(v)/2ES\n\n2\n\n2\n\n4 Main Results\n\nNSUM uses only aggregate information about the sum of the total degrees of vertices in the sample\nand the number of edges in the sample. We propose a novel algorithm, PULSE, that uses individual\ndegree, vertex type, and the network structure information. Experiments (Section 5) show that it\noutperforms NSUM in terms of both bias and variance.\nGiven p = (pij : 1 \u2264 i < j \u2264 K), the conditional likelihood of the edges in the sample is given by\n\n\uf8eb\uf8ed (cid:89)\n\n1\u2264i<j\u2264K\n\nLW (D; p) =\n\n\uf8f6\uf8f8 \u00d7\n\n(cid:32) K(cid:89)\n\ni=1\n\n(cid:33)\n\nii (1 \u2212 pii)(Vi\npEii\n\n2 )\u2212Eii\n\n,\n\nand the conditional likelihood of the pendant edges is given by\n\nL\u00acW (D; p) =\n\ni,t(v)(1 \u2212 pi,t(v))\npyi(v)\n\n\u02dcNi\u2212yi(v),\n\n(cid:80)K\n\nwhere the sum is taken over all yi(v)\u2019s (i = 1, 2, 3, . . . , K) such that yi(v) \u2265 0,\u22001 \u2264 i \u2264 K and\n\ni=1 yi(v) = \u02dcd(v). Thus the total conditional likelihood is L(D; p) = LW (D; p)L\u00acW (D; p).\n\nEij\n\nij (1 \u2212 pij)ViVj\u2212Eij\np\n(cid:19)\n(cid:88)\n(cid:89)\n\n(cid:18) \u02dcNi\n\nK(cid:89)\n\nv\u2208W\n\ny(v)\n\nyi(v)\n\ni=1\n\n3\n\n\fIf we condition on p and y, the likelihood of the edges within the sample is the same as LW (D; p)\nsince it does not rely on y, while the likelihood of the pendant edges given p and y is\n\nL\u00acW (D; p, y) =\n\ni,t(v)(1 \u2212 pi,t(v))\npyi(v)\n\n\u02dcNi\u2212yi(v).\n\n(cid:89)\n\n(cid:18) \u02dcNi\n\nK(cid:89)\n\n(cid:19)\n\nv\u2208W\n\ni=1\n\nyi(v)\n\nTherefore the total\nlikelihood conditioned on p and y is given by L(D; p, y) =\nLW (D; p)L\u00acW (D; p, y). The conditional likelihood L(D; p) is indeed a function of \u02dcN. We may view\nthis as the likelihood of \u02dcN given the data D and the probabilities p; i.e., L( \u02dcN ; D, p) (cid:44) L(D; p). Simi-\nand (cid:80)\nlarly, the likelihood L(D; p, y) conditioned on p and y is a function of \u02dcN and y. It can be viewed as the\njoint likelihood of \u02dcN and y given the data D and the probabilities p; i.e., L( \u02dcN , y; D, p) (cid:44) L(D; p, y),\n1 \u2264 i \u2264 K, such that yi(v) \u2265 0 and(cid:80)K\ny L( \u02dcN , y; D, p) = L( \u02dcN ; D, p), where the sum is taken over all yi(v)\u2019s, v \u2208 W and\ni=1 yi(v) = \u02dcd(v), \u2200v \u2208 W , \u22001 \u2264 i \u2264 K. To have a\nfull Bayesian approach, we assume that the joint prior distribution for \u02dcN and p is \u03c0( \u02dcN , p). Hence,\n\u02c6\nthe population size estimation problem is equivalent to the following optimization problem for \u02dcN:\n\nThen we estimate the total population size as \u02c6N =(cid:80)K\n\nL( \u02dcN ; D, p)\u03c0( \u02dcN , p)dp.\n\u02c6\u02dcNi + |W|.\n\n\u02c6\u02dcN = arg max\n\ni=1\n\n(1)\n\nWe brie\ufb02y study the regularity of the posterior distribution of N. In order to learn about \u02dcN, we must\nobserve enough vertices from each block type, and enough edges connecting members of each block,\nso that the \ufb01rst and second moments of the posterior distribution exist. Intuitively, in order for the\n\ufb01rst two moments to exist, either we must observe many edges connecting vertices of each block\ntype, or we must have suf\ufb01ciently strong prior beliefs about pij.\nTheorem 2. (Proof in [6]) Assume that \u03c0( \u02dcN , p) = \u03c6( \u02dcN )\u03c8(p) and pij follows the Beta distribution\nB(\u03b1ij, \u03b2ij) independently for 1 \u2264 i < j \u2264 K. Let \u03bb = min1\u2264i\u2264K\n. If \u03c6( \u02dcN )\nis bounded and \u03bb > n + 1, then the n-th moment of N exists.\n\n(cid:16)(cid:80)K\n\nj=1(Eij + \u03b1ij)\n\n(cid:17)\n\nIn particular, if \u03bb > 3, the variance of N exists. Theorem 2 gives the minimum possible number\nof edges in the sample to make the posterior sampling meaningful. If the prior distribution of\npij is Uniform[0, 1], then we need at least three edges incident on type-i edges for all types i =\n1, 2, 3, . . . , K to guarantee the existence of the posterior variance.\n\n4.1 Erd\u02ddos-R\u00e9nyi Model\n\nIn order to better understand how PULSE estimates the size of a general stochastic blockmodel\nwe study the Erd\u02ddos-R\u00e9nyi case where K = 1, and all vertices are connected independently with\nprobability p. Let N denote the total population size, W be the sample with size |W| = V1 and\n\u02dcN = N \u2212 |W|. For each vertex v \u2208 W in the sample, let \u02dcd(v) = y(v) denote the number\nof pendant edges of vertex v, and E = E11 is the number of edges within the sample. Then\n\nLW (D; p) = pE(1 \u2212 p)(|W |\n\n2 )\u2212E,\n\nL\u00acW (D; p) =\n\n\u02dcd(v)(1 \u2212 p)\np\n\n\u02dcN\u2212 \u02dcd(v).\n\nIn the Erd\u02ddos-R\u00e9nyi case, y(v) = \u02dcd(v) and thus L\u00acW (D; p) = L\u00acW (D; p, y). Therefore, the total\nlikelihood of \u02dcN conditioned on p is given by\n\nL( \u02dcN ; D, p) = LW (D; p)L\u00acW (D; p) = pE(1 \u2212 p)(|W |\n\n\u02dcd(v)(1 \u2212 p)\n\n\u02dcN\u2212 \u02dcd(v).\n\np\n\n(cid:19)\n\n\u02dcd(v)\n\n(cid:18) \u02dcN\n(cid:89)\n(cid:33)\n(cid:32) \u02dcN\n\nv\u2208W\n\n\u02dcd(v)\n\nWe assume that p has a beta prior B(\u03b1, \u03b2) and that \u02dcN has a prior \u03c6( \u02dcN ). Let\n\nL( \u02dcN ; D) =\n\nB(E + u + \u03b1,\n\n\u2212 E + |W| \u02dcN \u2212 u + \u03b2),\n\n(cid:18) \u02dcN\n\n(cid:19)\n\n\u02dcd(v)\n\n(cid:89)\n\nv\u2208W\n\n\u02dcd(v). The posterior probability Pr[ \u02dcN|D] is proportional to \u039b( \u02dcN ; D) (cid:44)\n\n\u03c6( \u02dcN )L( \u02dcN ; D). The algorithm is presented in Algorithm 1.\n\nwhere u = (cid:80)\n\nv\u2208W\n\n2 )\u2212E (cid:89)\n(cid:19)\n(cid:18)|W|\n\nv\u2208W\n\n2\n\n4\n\n\fAlgorithm 1 Population size estimation algorithm PULSE (Erd\u02ddos-R\u00e9nyi case)\nInput: Data D; initial guess for \u02c6N, denoted by N (0);\n\n6:\n\nparameters of the beta prior, \u03b1 and \u03b2\n\nOutput: Estimate for the population size \u02c6N\n1: \u02dcN (0) \u2190 N (0) \u2212 |W|\n2: \u03c4 \u2190 1\n3: repeat\n4:\n\nPropose \u02dcN(cid:48)(\u03c4 ) according to a proposal distribu-\ntion g( \u02dcN (\u03c4 \u2212 1) \u2192 \u02dcN(cid:48)(\u03c4 ))\nq \u2190 min{1, \u039b( \u02dcN(cid:48)(\u03c4 );D)g( \u02dcN(cid:48)(\u03c4 )\u2192 \u02dcN (\u03c4\u22121))\n\u039b( \u02dcN (\u03c4\u22121);D)g( \u02dcN (\u03c4\u22121)\u2192 \u02dcN(cid:48)(\u03c4 ))\n\n}\n\n5:\n\n\u02dcN (\u03c4 ) \u2190 \u02dcN(cid:48)(\u03c4 ) with probability q; otherwise\n\u02dcN (\u03c4 ) \u2190 \u02dcN (\u03c4 \u2212 1)\n\u03c4 \u2190 \u03c4 + 1\n\n7:\n8: until some termination condition is satis\ufb01ed\n9: Look at { \u02dcN (\u03c4 ) : \u03c4 > \u03c40} and view it as the sam-\n\npled posterior distribution for \u02dcN\n\n10: Let \u02c6\u02dcN be the posterior mean with respect to the\n\nsampled posterior distribution.\n\nAlgorithm 2 Population size estimation algorithm PULSE (general stochastic blockmodel case)\nInput: Data D; initial guess for \u02dcN, denoted by \u02dcN (0);\ninitial guess for y, denoted by y(0); parameters of\nthe beta prior, \u03b1ij and \u03b2ij, 1 \u2264 i \u2264 j \u2264 K.\n\nelse\n\nOutput: Estimate for the population size \u02c6N\n1: \u03c4 \u2190 1\n2: repeat\n3:\n4:\n5:\n6:\n7:\n\nRandomly decide whether to update \u02dcN or y\nif update \u02dcN then\nRandomly selects i \u2208 [1, K] \u2229 N.\n\u02dcN\u2217 \u2190 \u02dcN (\u03c4\u22121)\nPropose \u02dcN\u2217\nbution gi( \u02dcN (\u03c4\u22121)\n}\nq \u2190 min{1,\n)\n\u2192 \u02dcN\u2217\ni )\n\u02dcN (\u03c4 ) \u2190 \u02dcN\u2217 with probability q; otherwise\n\u02dcN (\u03c4 ) \u2190 \u02dcN (\u03c4\u22121).\ny(\u03c4 ) \u2190 y(\u03c4\u22121)\n\n\u039b( \u02dcN\u2217,y;D)gi( \u02dcN\u2217\n\u039b( \u02dcN (\u03c4\u22121),y;D)gi( \u02dcN\n\ni according to the proposal distri-\n\n8:\n\n9:\n\n10:\n\n\u2192 \u02dcN\u2217\ni )\n\n(\u03c4\u22121)\ni\n\n(\u03c4\u22121)\ni\n\ni \u2192 \u02dcN\n\ni\n\n11:\n12:\n13:\n14:\n\n15:\n16:\n\nRandomly selects v \u2208 W .\ny\u2217 \u2190 y(\u03c4\u22121)\nPropose y(v)\u2217 according to the proposal dis-\ntribution hv(y(v)(\u03c4\u22121) \u2192 y(v)\u2217)\nq \u2190 min{1, L( \u02dcN ,y\u2217;D)hv (y(v)\u2217\u2192y(v)(\u03c4\u22121))\nL( \u02dcN ,y;D)hv (y(v)(\u03c4\u22121)\u2192y(v)\u2217)\ny(\u03c4 ) \u2190 y\u2217 with probability q; otherwise\ny(\u03c4 ) \u2190 y(\u03c4\u22121).\n\u02dcN (\u03c4 ) \u2190 \u02dcN (\u03c4\u22121)\n\n}\n\nend if\n\u03c4 \u2190 \u03c4 + 1\n\n17:\n18:\n19:\n20: until some termination condition is satis\ufb01ed\n21: Look at { \u02dcN (\u03c4 ) : \u03c4 > \u03c40} and view it as the\n\u02dcNi + |W|\nwith respect to the sampled posterior distribution.\n\n22: Let \u02c6N be the posterior mean of(cid:80)K\n\nsampled posterior distribution for \u02dcN\n\ni=1\n\n4.2 General Stochastic Blockmodel\n\n\u00b4\n\n\u00b4\n\nIn the Erd\u02ddos-R\u00e9nyi case, y(v) = \u02dcd(v). However, in the general stochastic blockmodel case,\nin addition to the unknown variables \u02dcN1, \u02dcN2, . . . , \u02dcNK to be estimated, we do not know yi(v)\n(v \u2208 W , i = 1, 2, 3, . . . , K) either. The expression L\u00acW (D; p) involves costly summation\nover all possibilities of integer composition of \u02dcd(v) (v \u2208 W ). However, the joint posterior dis-\ntribution for \u02dcN and y, which is proportional to\nL( \u02dcN , y; D, p)\u03c6( \u02dcN )\u03c8(p)dp, does not involve\nsumming over integer partitions; thus we may sample from the joint posterior distribution for \u02dcN\nand y, and obtain the marginal distribution for \u02dcN. Our proposed algorithm PULSE realizes this\nidea. Let L( \u02dcN , y; D) =\nL( \u02dcN , y; D, p)\u03c8(p)dp. We know that the joint posterior distribution\nfor \u02dcN and y, denoted by Pr[ \u02dcN , y|D], is proportional to \u039b( \u02dcN , y; D) (cid:44) L( \u02dcN , y; D)\u03c8( \u02dcN ). In ad-\ndition, the conditional distributions Pr[ \u02dcNi| \u02dcN\u00aci, y] and Pr[y(v)| \u02dcN , y(\u00acv)] are also proportional to\nL( \u02dcN , y; D)\u03c8( \u02dcN ), where \u02dcN\u00aci = ( \u02dcNj : 1 \u2264 j \u2264 K, j (cid:54)= i), y(v) = (yi(v) : 1 \u2264 i \u2264 K) and\ny(\u00acv) = (y(w) : w \u2208 W, w (cid:54)= v). The proposed algorithm PULSE is a Gibbs sampling process that\nsamples from the joint posterior distribution (i.e., Pr[ \u02dcN , y|D]), which is speci\ufb01ed in Algorithm 2.\nFor every v \u2208 W and i = 1, 2, 3, . . . , K, 0 \u2264 yi(v) \u2264 \u02dcNi because the number of type-(i, t(v))\npendant edges of vertex v must not exceed the total number of type-i vertices outside the sample.\nTherefore, we have \u02dcNi \u2265 maxv\u2208W yi(v) must hold for every i = 1, 2, 3, . . . , K. These observations\nput constraints on the choice of proposal distributions gi and hv, i = 1, 2, 3, . . . , K and v \u2208 W ;\ni.e., the support of gi must be contained in [maxv\u2208W yi(v),\u221e) \u2229 N and the support of hv must be\n\ncontained in {y(v) : \u22001 \u2264 i \u2264 K, 0 \u2264 yi(v) \u2264 \u02dcNi,(cid:80)K\n\nj=1 yi(v) = \u02dcd(v)}.\n\n5\n\n\fLet \u03c9i be the window size for \u02dcNi, taking values in N. Let l = max{maxv\u2208W yi(v), \u02dcN (\u03c4\u22121)\nLet the proposal distribution gi be de\ufb01ned as below:\n\ni\n\n\u2212 \u03c9i}.\n\n(cid:40) 1\n\n2\u03c9i+1\n0\n\ngi( \u02dcN (\u03c4\u22121)\n\ni\n\n\u2192 \u02dcN\n\n\u2217\ni ) =\n\ni \u2264 l + 2\u03c9i\n\nif l \u2264 \u02dcN\u2217\notherwise.\n\nThe proposed value \u02dcN\u2217\ntribution uniform within the window [l, l + 2\u03c9i], and thus the proposal ratio is gi( \u02dcN\u2217\n\u02dcN (\u03c4\u22121)\ni ) = 1. The proposal for y(v) is detailed in the extended version [6].\n\nis always greater than or equal to maxv\u2208W yi(v). This proposal dis-\ni \u2192\n\n)/gi( \u02dcN (\u03c4\u22121)\n\n\u2192 \u02dcN\u2217\n\ni\n\ni\n\ni\n\n5 Experiment\n\n5.1 Erd\u02ddos-R\u00e9nyi\n\nEffect of Parameter p. We \ufb01rst evaluate the performance of PULSE in the Erd\u02ddos-R\u00e9nyi case. We\n\ufb01x the size of the network at N = 1000 and the sample size |W| = 280 and vary the parameter p.\nFor each p \u2208 [0.1, 0.9], we sample 100 graphs from G(N, p). For each selected graph, we compute\nNSUM and run PULSE 50 times (as it is a randomized algorithm) to compute its performance. We\nrecord the relative errors by the Tukey boxplots shown in Fig. 2a. The posterior mean proposed\nby PULSE is an accurate estimate of the size. For the parameter p varying from 0.1 to 0.9, most of\nthe relative errors are bounded between \u22121% and 1%. We also observe that the NSUM tends to\noverestimate the size as it shows a positive bias. This con\ufb01rms experimentally the result of Theorem\n1. For both methods, the interquartile ranges (IQRs, hereinafter) correlate negatively with p. This\nshows that the variance of both estimators shrinks when the graph becomes denser. The relative errors\nof PULSE tend to concentrate around 0 with larger p which means that the performance of PULSE\nimproves with larger p. In contrast, a larger p does not improve the bias of the NSUM.\nEffect of Network Size N. We \ufb01x the parameter p = 0.3 and the sample size |W| = 280 and vary\nthe network size N from 400 to 1000. For each N \u2208 [400, 1000], we randomly pick 100 graphs from\nG(N, p). For each selected graph, we compute NSUM and run PULSE 50 times. We illustrate the\nresults via Tukey boxplots in Fig. 2b. Again, the estimates given by PULSE are very accurate. Most\nof the relative errors reside in [\u22120.5%, 0.5%] and almost all reside in [\u22121%, 1%]. We also observe\nthat smaller network sizes can be estimated more accurately as PULSE will have a smaller variance.\nFor example, when the network size is N = 400, almost all of the relative errors are bounded in the\nrange [\u22120.7%, 0.7%] while for N = 1000, the relative errors are in [\u22121.5%, 1.5%]. This agrees with\nour intuition that the performance of estimation improves with a larger sampling fraction. In contrast,\nNSUM heavily overestimates the network size as the size increases. In addition, its variance also\ncorrelates positively with network size.\nEffect of Sample Size |W|. We study the effect of the sample size |W| on the estimation error. Thus,\nwe \ufb01x the size N = 1000 and the parameter p = 0.3, and we vary the sample size |W| from 100 to\n500. For each |W| \u2208 [100, 500], we randomly select 100 graphs from G(N, p). For every selected\ngraph, we compute the NSUM estimate, run PULSE 50 times, and record the relative errors. The\nresults are presented in Fig. 2c. We observe that for both methods that the IQR shrinks as the sample\nsize increases; thus a larger sample size reduces the variance of both estimators. PULSE does not\nexhibit appreciable bias when the sample size varies from 100 to 500. Again, NSUM overestimates\nthe size; however, its bias reduces when the sample size becomes large. This recon\ufb01rms Theorem 1.\n\n5.2 General Stochastic Blockmodel\n\nEffect of Sample Size and Type Partition. Here, we study the effect of the sample size and\nthe type partition. We set the network size N to 200 and we assume that there are two types of\nvertices in this network: type 1 and type 2 with N1 and N2 nodes, respectively. The ratio N1/N\nquanti\ufb01es the type partition. We vary N1/N from 0.2 to 0.8 and the sample size |W| from 40\nto 160. For each combination of N1/N and the sample size |W|, we generate 50 graphs with\np11, p22 \u223c Uniform[0.5, 1] and p12 = p21 \u223c Uniform[0, min{p11, p22}]. For each graph, we\ncompute the NSUM and obtain the average relative error. Similarly, for each graph, we run PULSE\n10 times in order to compute the average relative error for the 50 graphs and 10 estimates for each\ngraph. The results are shown as heat maps in Fig. 2d. Note that the color bar on the right side of Fig.\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\n(g)\n\nFigure 2: Fig. 2a, 2b and 2c are the results of the Erd\u02ddos-R\u00e9nyi case: (a) Effect of parameter p on the estimation\nerror. (b) Effect of the network size on the estimation error. (c) Effect of the sample size on the estimation error.\nFig. 2d, 2e, 2f and 2g are the results of the general SBM case: (d) Effect of sample size and type partition on\nthe relative error. Note that the color bar on the right is on logarithmic scale. (e) Effect of deviation from the\nErd\u02ddos-R\u00e9nyi model (controlled by \u0001) on the relative error of NSUM and PULSE in the SBM with K = 2. (f)\nEffect of deviation from the Erd\u02ddos-R\u00e9nyi model (controlled by \u0001) on the relative error of PULSE in estimating\nthe number of type-1 and type-2 nodes in the SBM with K = 2. (g) Effect of the number of types K and the\nsample size on the population estimation. The percentages are the sampling fractions n/N. The horizontal axis\nrepresents the number of types K that varies from 1 to 6. The vertical axis is the relative error in percentage.\n\n2d is on logarithmic scale. In general, the estimates given by PULSE are very accurate and exhibit\nsigni\ufb01cant superiority over the NSUM estimates. The largest relative errors of PULSE in absolute\nvalue, which are approximately 1%, appear in the upper-left and lower-left corner on the heat map.\nThe performance of the NSUM (see the right sub\ufb01gure in Fig. 2d) is robust to the type partition and\nequivalently the ratio N1/N. As we enlarge the sample size, its relative error decreases.\nThe left sub\ufb01gure in Fig. 2d shows the performance of PULSE. When the sample size is small, the\nrelative error decreases as N1/N increases from 0.2 to 0.5; when N1/N rises from 0.5 to 0.8, the\nrelative error becomes large. Given the \ufb01xed ratio N1/N, as expected, the relative error declines\nwhen we have a larger sample. This agrees with our observation in the Erd\u02ddos-R\u00e9nyi case. However,\nwhen the sample size is large, PULSE exhibits better performance when the type partition is more\nhomogeneous. There is a local minimum relative error in absolute value shown at the center of the\nsub\ufb01gure. PULSE performs best when there is a balance between the number of edges in the sampled\n\n7\n\n\u22122.50.02.50.10.20.30.40.50.60.70.80.9Parameter pRelative error (%)NSUMPULSE\u22122\u221210124005006007008009001000True network sizeRelative error (%)NSUMPULSE\u2212505100200300400500Sample sizeRelative error (%)NSUMPULSE\u2212505\u22120.3\u22120.2\u22120.100.10.20.3Deviation from Erdos\u2212RenyiRelative error (%)NSUMPULSE\u221240040\u22120.3\u22120.2\u22120.100.10.20.3Deviation from Erdos\u2212RenyiRelative error (%)N1 (PULSE)N2 (PULSE)\u221210\u221250510123456Number of types (K)Relative error (%)NS 33%NS 50%NS 66%PU 33%PU 50%PU 66%\f(cid:104)(cid:0)N1\n\n2\n\n(cid:105)\n\n(cid:1)p11 +(cid:0)N1\n\n2\n\n, p22 = \u02dcp + N1N2\u0001\n2(N2\n2 )\n\n(cid:1)p22 + N1N2p12\n\ninduced subgraph and the number of pendant edges emanating outward. Larger sampled subgraphs\nallow more precision in knowledge about pij, but more pendant edges allow for better estimation of y,\nand hence each Ni. Thus when the sample is about half of the network size, the balanced combination\nof the number of edges within the sample and those emanating outward leads to better performance.\nEffect of Intra- and Inter-Community Edge Probability. Suppose that there are two types of nodes\nin the network. The mean degree is given by dmean = 2\n. We want\nN\nto keep the mean degree constant and vary the random graph gradually so that we observe 3 phases:\nhigh intra-community and low inter-community edge probability (more cohesive), Erd\u02ddos-R\u00e9nyi , and\nlow intra-community and high inter-community edge probability (more incohesive). We introduce a\ncohesion parameter \u0001. In the two-block model, we have p11 = p22 = p01 = \u02dcp, where \u02dcp is a constant.\n, p12 = \u02dcp\u2212 \u0001.\nLet\u2019s call \u0001 the deviation from this situation and let p11 = \u02dcp + N1N2\u0001\n2(N1\n2 )\nThe mean degree stays constant for different \u0001. In addition, p11, p12 and p22 must reside in [0, 1].\nThis requirement can be met if we set the absolute value of \u0001 small enough. By changing \u0001 from\npositive to negative we go from cohesive behavior to incohesive behavior. Clearly, for \u0001 = 0, the\ngraph becomes an Erd\u02ddos-R\u00e9nyi graph with p11 = p22 = p01 = \u02dcp.\nWe set the network size N to 850, N1 to 350, and N2 to 500. We \ufb01x \u02dcp = 0.5 and let \u0001 vary from \u22120.3\nto 0.3. When \u0001 = 0.3, the intra-community edge probabilities are p11 = 0.9298 and p22 = 0.7104\nand the inter-community edge probability is p12 = 0.2. When \u0001 = \u22120.3, the intra-community\nedge probabilities are p11 = 0.0702 and p22 = 0.2896 and the inter-community edge probability is\np12 = 0.8. For each \u0001, we generate 500 graphs and for each graph, we run PULSE 50 times. Given\neach value of \u0001, relative errors are shown in box plots. We present the results in Fig. 2e as we vary\n\u0001. From Fig. 2e, we observe that despite deviation from the Erd\u02ddos-R\u00e9nyi graph, both methods are\nrobust. However, the \ufb01gure indicates that PULSE is unbiased (as median is around zero) while NSUM\noverestimates the size on average. This again con\ufb01rms Theorem 1.\nAn important feature of PULSE is that it can also estimate the number of nodes of each type while\nNSUM cannot. The results for type-1 and type-2 with different \u0001 are shown in Fig. 2f. We observe\nthat the median of all boxes agree with the 0% line; thus the separate estimates for N1 or N2 are\nunbiased. Note that when the edge probabilities are more homogeneous (i.e., when the graph becomes\nmore similar to the Erd\u02ddos-R\u00e9nyi model) the IQRs, as well as the interval between the two ends of the\nwhiskers, become larger. This shows that when we try to \ufb01t an Erd\u02ddos-R\u00e9nyi model (a single-type\nstochastic blockmodel) into a two-type model, the variance becomes larger.\nEffect of Number of Types and Sample Size. Finally, we study the impact of the number of types\nK and the sample size |W| = n on the relative error. To generate graphs with different number of\ntypes, we use a Chinese restaurant process (CRP) [1]. We set the total number of vertices to 200, \ufb01rst\npick 100 vertices and use the Chinese restaurant process to assign them to different types. Suppose\nthat CRP gives K types; We then distribute the remaining 100 vertices evenly among the K types.\nThe edge probability pii (1 \u2264 i \u2264 K) is sampled from Uniform[0.7, 1] and pij (1 \u2264 i < j \u2264 K) is\nsampled from Uniform[0, min{pii, pjj}], all independently. We set the sampling fraction n/N to\n33%, 50% and 66%, and use NSUM and PULSE to estimate the network size. Relative estimation\nerrors are illustrated in Fig. 2g. We observe that with the same sampling fraction n/N and same\nthe number of types K, PULSE has a smaller relative error than that of the NSUM. Similarly, the\ninterquartile range of PULSE is also smaller than that of the NSUM. Hence, PULSE provides a higher\naccuracy with a smaller variance. For both methods the relative error decreases (in absolute value) as\nthe sampling fraction increases. Accordingly, the IQRs also shrink for larger sampling fraction. With\nthe sampling fraction \ufb01xed, the IQRs become larger when we increase the number of types in the\ngraph. The variance of both methods increases for increasing values of K. The median of NSUM is\nalways above 0 on average which indicates that it overestimates the network size.\n\nAcknowledgements\n\nThis research was supported by Google Faculty Research Award, DARPA Young Faculty Award\n(D16AP00046), NIH grants from NICHD DP2HD091799, NCATS KL2 TR000140, and NIMH\nP30 MH062294, the Yale Center for Clinical Investigation, and the Yale Center for Interdisciplinary\nResearch on AIDS. LC thanks Zheng Wei for his consistent support.\n\n8\n\n\fReferences\n[1] D. J. Aldous. Exchangeability and related topics. Springer, 1985.\n\n[2] H. Bernard, E. Johnsen, P. Killworth, and S. Robinson. How many people died in the mexico city\nearthquake. Estimating the Number of People in an Average Network and in an Unknown Event Population.\nThe Small World, ed. M. Kochen (forthcoming). Newark, 1988.\n\n[3] H. R. Bernard, P. D. Killworth, E. C. Johnsen, G. A. Shelley, and C. McCarty. Estimating the ripple effect\n\nof a disaster. Connections, 24(2):18\u201322, 2001.\n\n[4] M. S. Bernstein, E. Bakshy, M. Burke, and B. Karrer. Quantifying the invisible audience in social networks.\n\nIn Proc. SIGCHI, pages 21\u201330. ACM, 2013.\n\n[5] L. Chen, F. W. Crawford, and A. Karbasi. Seeing the unseen network: Inferring hidden social ties from\n\nrespondent-driven sampling. In AAAI, pages 1174\u20131180, 2016.\n\n[6] L. Chen, A. Karbasi, and F. W. Crawford. Estimating the size of a large network and its communities from\n\na random sample. arXiv preprint arXiv:1610.08473, 2016. https://arxiv.org/abs/1610.08473.\n\n[7] F. W. Crawford. The graphical structure of respondent-driven sampling. Sociological Methodology,\n\n46(1):187\u2013211, 2016.\n\n[8] S. Ezoe, T. Morooka, T. Noda, M. L. Sabin, and S. Koike. Population size estimation of men who have sex\n\nwith men through the network scale-up method in Japan. PLoS One, 7(1):e31184, 2012.\n\n[9] D. M. Feehan and M. J. Salganik. Generalizing the network scale-up method: a new estimator for the size\n\nof hidden populations. Sociological Methodology, 46(1):153\u2013186, 2016.\n\n[10] M. Girvan and M. E. Newman. Community structure in social and biological networks. PNAS, 99(12):7821\u2013\n\n7826, 2002.\n\n[11] W. Guo, S. Bao, W. Lin, G. Wu, W. Zhang, W. Hladik, A. Abdul-Quader, M. Bulterys, S. Fuller, and\nL. Wang. Estimating the size of HIV key affected populations in Chongqing, China, using the network\nscale-up method. PLoS One, 8(8):e71796, 2013.\n\n[12] C. Kadushin, P. D. Killworth, H. R. Bernard, and A. A. Beveridge. Scale-up methods as applied to estimates\n\nof heroin use. Journal of Drug Issues, 2006.\n\n[13] L. Katzir, E. Liberty, and O. Somekh. Estimating sizes of social networks via biased sampling. In WWW,\n\npages 597\u2013606. ACM, 2011.\n\n[14] P. D. Killworth, C. McCarty, H. R. Bernard, G. A. Shelley, and E. C. Johnsen. Estimation of seroprevalence,\nrape, and homelessness in the United States using a social network approach. Eval. Rev., 22(2):289\u2013308,\n1998.\n\n[15] A. S. Maiya and T. Y. Berger-Wolf. Bene\ufb01ts of bias: Towards better characterization of network sampling.\n\nIn Proc. SIGKDD, pages 105\u2013113. ACM, 2011.\n\n[16] L. Massouli\u00e9, E. Le Merrer, A.-M. Kermarrec, and A. Ganesh. Peer counting and sampling in overlay\n\nnetworks: random walk methods. In Proc. PODC, pages 123\u2013132. ACM, 2006.\n\n[17] B. H. Murray and A. Moore. Sizing the internet. White paper, Cyveillance, page 3, 2000.\n\n[18] M. E. Newman. Modularity and community structure in networks. PNAS, 103(23):8577\u20138582, 2006.\n\n[19] M. Papagelis, G. Das, and N. Koudas. Sampling online social networks. TKDE, 25(3):662\u2013676, 2013.\n\n[20] A. R\u00e9nyi and P. Erd\u02ddos. On random graphs. Publicationes Mathematicae, 6:290\u2013297, 1959.\n\n[21] B. Ribeiro and D. Towsley. Estimating and sampling graphs with multidimensional random walks. In Proc.\n\nIMC, pages 390\u2013403. ACM, 2010.\n\n[22] M. J. Salganik, D. Fazito, N. Bertoni, A. H. Abdo, M. B. Mello, and F. I. Bastos. Assessing network\nscale-up estimates for groups most at risk of HIV/AIDS: evidence from a multiple-method study of heavy\ndrug users in Curitiba, Brazil. American Journal of Epidemiology, 174(10):1190\u20131196, 2011.\n\n[23] M. Shokoohi, M. R. Baneshi, and A.-a. Haghdoost. Size estimation of groups at high risk of HIV/AIDS\n\nusing network scale up in Kerman, Iran. Int\u2019l J. Prev. Medi., 3(7):471, 2012.\n\n[24] S. Xing and B.-P. Paris. Measuring the size of the internet via importance sampling. J. Sel. Areas Commun,\n\n21(6):922\u2013933, 2003.\n\n9\n\n\f", "award": [], "sourceid": 1525, "authors": [{"given_name": "Lin", "family_name": "Chen", "institution": "Yale University"}, {"given_name": "Amin", "family_name": "Karbasi", "institution": "Yale"}, {"given_name": "Forrest", "family_name": "Crawford", "institution": "Yale University"}]}