Reviews: Structure Learning with Side Information: Sample Complexity

Major comments: Throughout the paper k is used to refer to the maximum number of edges in a graph, but I'm unclear if that means that the proofs hold for some k = max(k_1,k_2) where k_1 is the number of edges in G_1, etc, and k_1 and k_2 are allowed to be different. If it doesn't allow each graph to have a different value for k this should be made clear. If it does allow that, then it's unclear what ranges of k the proofs hold for (presumably min(k_1,k_2) is lower bounded by \gamma k). Allowing each graph to have a different k implies that there could be different recovery rates for each graph, but the error metric is over the full joint space (rather than the subgraph or independently in the two graphs). Is it possible to make statements about the error metric just in the shared subgraph? As this is the (unknown) region where there is extra information about the graph structure then it would make sense that it's simpler to recover the shared structure, but I don't think any of the supplied results make specific statements about the joint subgraph recovery. Figure 2 is unclear. I believe that the solid line is the ML decoder's sample complexity and the dashed line is the theoretical bound, but the caption and surrounding text could do with clarification. If that's incorrect then I think it could do with more extensive clarification. The experimental evaluations use random graphs with a single specific structure pattern. It would be useful to see more random structures to show how it holds across a wider range of more realistic graphs. Minor comments: The description for figure three says that when the error metric punishes smaller deviations then it's harder to score well on the error metric, which seems tautological. As the error measure changes as d changes this could do with a little more explanation, as otherwise it might lead the reader to think that increasing d to relax the computation allowed the ML decoder to better recover the exact graph (with d = 0). There are several small notational issues, e.g. definition 2 spells out gamma instead of using \gamma in the last line, eq 31 is missing a close paren. The second appendix doesn't use the NeurIPS style file, and isn't folded into the first appendix. Is this intended to be part of the submission?

Reviewer 2

While the paper is interesting and useful, the authors should also compare this paper with recent work at ICASSP that superficially looks similar: https://ieeexplore.ieee.org/document/8682199 I am unaware if the authors happen to be the same, but regardless, the two papers seem very very closely related. If applicable, the authors should update their submission (and their response) with any other material (the above, as well as publications other than the ICASSP paper as well) that may be closely related to their work. I will update my review after author feedback on this point.

Reviewer 3

There are, however, several deficiencies that should be addressed in order for the work to be publishable in a venue like NeurIPS. -- The analysis of the achievability part is based on analyzing (9), which is the ML decoder searched over all pairs of graphs that are \eta-similar. The ML analysis is rather straightforward, involving merely large deviation upper bounds, straightforward counting of graphs, together with those for bounding the KL divergence already present in Santhanam and Wainwright (2012). Hence, it’s not clear what the novelties in the analysis here are. -- The bounds for the converse part are useful but the techniques are based on specializing Fano and using techniques not dissimilar to those in Scarlett and Cevher (2018). -- Even if there are technical novelties in the analyses above, the bounds as shown in Table 1 are not very tight; the sufficient condition(s) seems far from the necessary condition(s). Further, there’s no discussion of this gap. Next, they do not seem to depend on d. -- One thing the reviewer fails to understand is how the authors obtained the plots in Figure 2. Usually such plots are obtained by applying tractable algorithms. However, ML is not tractable even for p = 100. Can ML be implemented in "real-life"? Do these plots show any phase transition in k, p, and \eta? -- Some typesetting is very careless. “gammak” is present in Definition 2. Some compilation Latex errors in the supplementary material. Generally, this submission looks “rushed”. Also, in the supplementary, the reviewer wonders why the fonts for the proof of the achievability and converse parts are different. -- One philosophical comment. Upon reading the title, the reviewer was hoping for recovery guarantees for a *single* graph given side information. However, while what the authors have done here is not incorrect, it is misleading given the title. They recover *both* graphs. Naturally, one would wonder whether one can do better (in terms of fundamental limits) if one seeks to only learn G_1 when samples from G_1 and G_2 are given (and G_1 and G_2 are similar in some sense). I would be thinking this is the true essence of graphical model selection with side information

Paper ID:	8139
Title:	Structure Learning with Side Information: Sample Complexity

Reviewer 1

Reviewer 2

Reviewer 3