Paper ID: | 1525 |
---|---|

Title: | Estimating the Size of a Large Network and its Communities from a Random Sample |

In this submission, the author(s) provide(s) a new method for estimating the size of communities from the observation of a random subgraph and the total degree of each sampled vertices and its community membership. Their estimator is based on maximizing the likehood and they avoid combinatorial intractability using a Gibbs sampling step in their procedure. No theoretical guarantees are given regarding the rate of convergence of the joint posterior distribution in the Gibbs sampling step of their proposed algorithm but they have a result on the $n$-th moment of their estimator. Their is no result on the consistency of their method.

Their is some important typos: n is the size of the sampled subgraph but, in Theorem 2, it is also the number of moments of $N$. They claim that Theorem 2 "gives the minimum possible number ... meaningful" but they do not provide a converse result of their theorem, what is the sense of "minimum" here? Does is mean "sharp"?

2-Confident (read it all; understood it all reasonably well)

The article "Estimating the Size of a Large Network and its Communities from a Random Sample" considers the problem of estimating the total number of vertices in a network given the subsample of the network and full degrees of the vertices in the subsample. The authors propose the algorithm which uses the stochastic block model (SBM) approximation of the network to estimate the number of vertices in each block and subsequently in the full network. The properties of the algorithm are demonstrated on a series of examples on model data.

The main idea of the article is to improve the estimation of the size of a network based on considering the SBM model instead of Erdos-Renyi model, which is considered in the baseline algorithm. This is a natural approach, which however leads to significant complications as a resultive likelihood is costly to evaluate due to massive summation. The authors overcome the problem by considering the joint posterior and sampling from this posterior via Gibbs sampling. The general approach seems fair, however the computational limitations of the algorithm are not considered and it is not evident whether the algorithm achieves the goal of mimicking the respective distribution. I should note, that the theorems provided are stated as one of the main results of work, but in fact don't add significant value to the algorithm and the results. The first theorem states finite sample bias in the basic estimate, however finite sample unbiasness of the proposed algorithm is not showed. The second theorem states the tractability of obtained posterior, however the properties of the obtained algorithm itself are not theoretically accessed. The experimental part seems to be proper comparison of two algorithms on different variants of model data, however no experiments on real data are considered. Real data is very important as any model for this data is an approximation, which for the SBM case will mean the necessity of choosing some number of blocks K. It is of interest to consider the work of the algorithm on the real data, especially considering that it is quite easy to perform the experiment on any type of real network just by sampling a part of it. To sum up, I think that the paper presents a good effort in the interesting research direction, however theoretical results seem to be not supportive for the algorithm (while are underlined as important) and experimental part needs significant improvement.

2-Confident (read it all; understood it all reasonably well)

This paper provides an algorithm to estimate the size (number of vertices) of a large network which is assumed to be generated from a Stochastic Block Model (SBM). The data consist of a subgraph with the total degree of sampled edges and the clusters in which they belong. The authors first show that a naive estimate is biased. They propose then their algorithm(s) which rely(ies) on Bayesian techniques. They provide a theoretical result which is a sufficient condition regarding the moment of the posterior distribution of the size. They show on simulation the efficiency of their method.

This paper tackles an interesting problem thanks to a novel promising method. The two theorems are meaningful and their proofs in supplementary material are well written. Numerical experiments support the interest of the proposed algorithms. I just list few points below for helping a future reader : -I found the discussion on the different likelihoods which are considered, on top of page 4, quite fuzzy. This paragraph could be improved. -In Theorem 2, could you precise that you deal with the n-th moment of N with regard to its posterior distribution ? -Between lines 158-159, could you state that you use B to denote the beta distribution and also its density. -In algorithms 1 and 2, you should precise that you rely mainly on a Metropolis-Hastings algorithm. I have also some questions which could be discussed : -Figure 2d shows an astonishing behavior of the relative error, you do not discuss why, when sample size is 140-160, the error drops to 0 around the upper and lower right corner ? -You only use the posterior mean of N, is there any interesting information in the whole distribution ? I would have appreciate to have some insights in this scope.

2-Confident (read it all; understood it all reasonably well)

The article, entitled "Estimating the Size of a Large Network and its Communities from a Random Sample", studies the problem of estimating the size of a large network and its communities from a random sample of stochastic block model. The authors propose an algorithm, and show that it outperforms a baseline in synthetic datasets.

The motivation of the manuscript is clearly explained. However, the model proposed does not quite fit to the problem description. The authors only consider a special type of graphs, generated from a stochastic block model, for their analysis. However, the authors never explained how it resembles the real-world network. Moreover, the experiments are only performed on synthetic datasets, which further obscures the actual effect of their algorithm on real networks. Therefore, the paper is in a preliminary stage, and I do not recommend its publication in NIPS.

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

In this paper, the authors consider how to estimate the number of vertices in a large network. They focus on dealing with the random graphs from the stochastic block model (SBM). Given the partial information of the induced subgraph, they propose an algorithm called PULSE to estimate the population size. They also compare PULSE with another algorithm call NSUM. For NSUM, they give an theoretical asymptotic lower bound of bias. For PULSE, they study the condition which makes the posterior sampling meaningful. Then they further study the ER model case and general SBM case separately. To support the theoretical results, they conduct experiments to study the effects of sample size and SBM model parameters on the accuracy of the estimates.

* Strong points: - The paper is well written and organized. - The idea is simple and novel. - The mathematical analysis is solid. * Weak points: - The motivation is interesting and important in big data era. However, I wonder how many real-world networks can be modeled well by the SBM? For the social networks, which you are given as an example, can you give some evidences that they fit the SBM well? For the WWW networks, in order to fit the SBM, they need to be transferred to the undirected graphs. Then it is hard to get the degrees of the vertices in the sample graph, since we can only get their out-degrees as said in line 37. - The motivation of this paper is that most real-world networks are too *large* to be measured. Thus people are interested in estimating global network properties from smaller sub-samples. However, in your experiment part, the global graphs you use are not such large (N ~ 1000). What about the scalability of PULSE? Comparatively, NSUM is simpler and can be used on large graphs. - In Theorem 1, the authors study the asymptotic lower bound for NSUM. However, sample size n actually cannot go to infinity and is bounded by N. How does this constraint affect your analytical result? In addition, as Theorem 1 points out that \hat N_NS is positively biased, why do you use the simplified version \hat N_NS instead of the original version \hat N (line 108)? What's the benefit to use the simplified but biased estimator?

2-Confident (read it all; understood it all reasonably well)

With the aim of estimating the size of a large-scale (population) graph and its communities, this paper proposes a size estimation method using a randomly sampled (induced) subgraph. The proposed method can be applied to the graphs generated according to the Erdos-Renyi and Stochastic Block models. While an existing popular approach (NSUM) is biased as proven in the paper, the proposed method is unbiased, and the number of edges in the sampled graph that leads to meaningful estimations is described. Experimental evaluations study the performance of the proposed method with a number of parameter settings and demonstrate its superiority over NSUM.

Technical quality: 2 [results] + Discussion of the minimum possible number of edges (Lines 145--148) is of practical use. [experiments] + The proposed method is extensively investigated with a number of parameter settings. - I do not think that the results are conclusive enough to demonstrate the superiority on "large" networks because the size of graph datasets is too small, i.e., at most 1000 vertices. Novelty / originality: 3 + The theoretical result (Theorem 1) on the accuracy of NSUM is new and thus motivates to develop an unbiased and accurate estimator. + The problem addressed in the paper, which is concerned with the cluster size as well as the number of vertices, has a certain novelty. Potential impact or usefulness: 3 + Real-world networks are massive and estimation of the whole graph size is one of the most fundamental issues involved networks analysis, and thus the problem addressed in this paper has a wide range of areas. Clarity and presentation: 3 [explanations] + Experimental evaluations verified the performance of the proposed method with a number of parameter settings, which clearly strengthen the superiority of the proposed method against the existing approach. - There are no comparisons with existing network-size-estimation algorithms (such as a random-walk based approach [a]) except for NSUM. Clarification of the differences would be convincing. [a] Stephen J. Hardiman and Liran Katzir. Estimating Clustering Coefficients and Size of Social Networks via Random Walk. In WWW 2013.

1-Less confident (might not have understood significant parts)