Review for NeurIPS paper: Provable Overlapping Community Detection in Weighted Graphs

NeurIPS 2020

Provable Overlapping Community Detection in Weighted Graphs

Review 1

Summary and Contributions: In this paper the authors consider the overlapping community detection problem under the Mixed Membership Stochastic Block (MMSB) model. The input to the problem is a weighted matrix with weights resambing probabilities of interaction between nodes, and the number k of the communities to be found. It is assumed that there exists a ground truth community interaction k times k dimensional matrix B.The objective is to compute a partial assignment of each node to each other communities that best describes the input graph. Similarly to the Stochastic Block Model being the main underlying assumption when studying non-overlapping community detection with theoretical guarantees, its generalization, namely the MMSB model, is a reasonable model for studying overlapping community detection with theoretical guarantees. The main differentiation to prior work, is that the authors do not require the assumption that in each community there exists at least one node that belongs to no other community, which are called pure nodes. This assumption is arguably unrealistic in practice. It is known that if not pure nodes are present in the instance, then one cannot recover the community structures. The authors circumvent this barrier by showing that if two different sets of parameters to the MMSB that lead to the same probability matrix P, then the corresponding node-community fractional assignment are close with high probability. This means that the authors recover a node-community assignment that is close to the ground truth. The algorithm suggested by the authors provides some bounds on each node-community assignment. The technical part is well-written, but it seems mostly accessible to people with a good knowledge of the techniques and tools that are used. The experimental part contains a well-executed set of experiments against a competitor algorithm (GeoNMF) which has been reported to outperform several other algorithms for recovery of the MMSB under the pure nodes assumption. The experimental setup contains several synthetic and real-world datasets of moderate size. The authors evaluate the tested algorithms as follows. They first generate an MMSB graph (according to some underlying node-community assignment) and then compare the produced solution to that of the ground truth data that generated the input. The results of the experiments highlight that the newly proposed algorithm performs better that the GeoNMF algorithm in terms of quality, but it suffers in efficiency. An interesting aspect of the paper is that they apply their algorithm on a problem in computational biology, where one is given as an input a weighted graph with protein-protein interactions and the objective is to identify the overlapping communities of proteins that form what is called the protein complexes. This problem is well studied in computational biology and there are specialized metrics for evaluating the quality of the solution. The authors show that their algorithm performs comparably to some well-performing specialized algorithm for the problem, when evaluated on the specialized metrics for the problem. Unfortunately, the code used for the experiments is not available.

Strengths: First study to remove the assumption of pure nodes in each community for the problem, and provide provable guarantees. Experimental case-study on clustering proteins based on protein-protein interactions, showcases real-world applications of the proposed method.

Weaknesses: The scalability of the proposed method suffers compared to competitor algorithms. While the paper is certainly readable, it requires good knowledge of related tools and concepts. The probable guarantees provided in the paper only apply to inputs that are weighted graphs that follow the MMSB models, and for small values of k (the number of communities).

Correctness: I don't see any flaws in the claims. Unfortunately, I didn't manage to go through all the proofs in the appendix.

Clarity: In general, yes. A few parts could be improved. I explain in the comments below.

Relation to Prior Work: Yes, and in fact they explain well what differentiates the paper compared to the previous contributions.

Reproducibility: Yes

Additional Feedback: In Theorem 3.1, is it possible to state bounds where you do not hide O(1/poly k) factors inside \epsilon? In particular, is it easy to state the bound only with respect to k and \kappa? Could you provide a reference for the statement on line 119: "It is known that having pure nodes for each community is both necessary and sufficient for identifiability of MMSB" How do you perform the matrix-multiplication at step 5 of algorithm 2 in O(n^2) time? Could you elaborate more on how mild is the assumption that c_min / c_max > 1/2 in Theorem 3.4 is?

Review 2

Summary and Contributions: The proposed work studied the problem of obtaining node-community distribution matrix for Mixed Membership Stochastic Blockmodel (MMSB) to detect potentially overlapping network communities, without assuming the existence of pure nodes. The work proposed an SP+LP algorithm as the recovery procedure, which was supported with successful evaluation on both synthetic datasets and a real application. ====================== I have read the authors' response. I choose to keep my score.

Strengths: 1. The work proposed a theoretically interesting problem to study, i.e., how to detect potentially overlapping communities without assuming the existence of pure nodes for MMSB. 2. The work designed an algorithm to recover the node-community distribution matrix with theoretical support under certain assumptions. 3. The work demonstrated a successful application with the proposed model and algorithm.

Weaknesses: 1. Different community detection models have been proposed in literature other than MMSB. In practical applications, it is often preferred to see a comparison of effectiveness among these different models. As an example, Newman's modularity model and its subsequent extension on overlapping communities have been long used to detect network communities. Accordingly, could the modularity model be applied on the computational biology problem studied in this paper? 2. The claim of the LP's complexity (Line 134-135) in the proposed algorithm seems problematic to me. The "LP in linear time" claim made in Megiddo [1984] assumed that the variable dimension is fixed. While for the proposed SP+LP algorithm, the dimension is $n$, which seems inappropriate to be treated as a constant number. Let me know if there is misunderstanding here.

Correctness: The work seems technical sound to me, except the analysis of the proposed algorithm's time complexity (Ref. point 2 in "Weaknesses" section).

Clarity: Good.

Relation to Prior Work: Good.

Reproducibility: Yes

Additional Feedback: For questions to authors, please ref. "Weaknesses" section.

Review 3

Summary and Contributions: The paper proposes a new provable overlapping community detection algorithm based on the idea of MMSB, but without the assumption of pure nodes. The algorithm starts with a successive projection which selects almost pure node. Then, using that set of nodes the final estimation is generated through linear programming. The paper also proves the theoretical guarantees of the algorithm.

Strengths: -The idea is simple and it can find communities in the real datasets applied in the paper. -The major contribution is the theoretical guarantees of this algorithm, showing the converge to a correct solution.

Weaknesses: -One of the main problems is the poor performance of the algorithm against state of the art methods. Even though authors try to justify it, saying that ground truth in biology networks as not been achieved, then look for another type of network where the ground truth is known. -The algorithm is quite expensive, has the paper shows the time complexity is O(n^2), which is impossible to apply to medium-size networks. This is corroborated by the real experiment where the largest network has only 5000 nodes (the synthetic network). -The experiment section must be largely improved. Most results are not even discussed. Please, reduce the size of the graph (fit three plots in a row) and the setup, use this space to discuss real results. Also, add SVI, Bayesian variant of SNMF, OCCAM, and SAAC to the results. Even though these algorithms were beaten by GeoNMF, algorithms behave differently on datasets.

Correctness: Yes, they seem correct.

Clarity: -The main idea of the paper is clear, but algorithms not. Algorithms are not explained, not even cited (algorithm 1). You could save some space from the introduction and Problem formulation and expand this part.

Relation to Prior Work: Yes, it is clearly discussed.

Reproducibility: Yes

Additional Feedback: Please, read the following papers: An Analysis of Overlapping Community Detection Algorithms in Social Networks Community-Affiliation Graph Model for Overlapping Network Community Detection Overlapping Community Detection by Local Community Expansion, algorithm with time complexity O(m), where m is the number of edges. Also, this is a website with multiple datasets https://snap.stanford.edu/data/#onlinecoms