__ Summary and Contributions__: This paper considers the problem of learning to do combinatorial optimization on graphs. In particular, it focuses on a set of constrained minimization problems: given an objective function and a constraint, find a set of nodes minimizing the objective function subject to the constraint. This is a broad family that includes many NP-hard problems. The goal is to train a neural network such that, when given a new instance of one of these problems, it can efficiently compute a solution satisfying the constraints whose cost is "close" to the optimal cost.
This work proposes a novel framework for unsupervised combinatorial optimization on graphs, which is inspired by Erdos's probabilistic proof technique and makes it more likely that the outputs produce a valid solution when compared to previous approaches. Instead of using a differentiable proxy objective as a replacement for the discrete cost, the authors propose a probabilistic method: minimize an upper bound on the expected value (or similar statistic) of the discrete cost. They present two ways to construct this upper bound in the presence of different types of constraints, and prove that the bound ensures that valid low-cost labels can be sampled with high probability. They further describe how, given such a model, it is possible to deterministically extract a labeling with low cost by greedily making choices that do not increase the expected cost.
The authors evaluate their approach on two NP-hard tasks: maximum clique and constrained minimum cut. For maximum clique, they show that their method outperforms baseline proxy objective methods, although it does not always beat all learned baselines and generally does worse than special-purpose heuristic algorithms. For constrained minimum cut, their method finds better solutions than heuristics and proxy objective baselines. In all cases, their method generates outputs that satisfy the desired constraints, in contrast to other learned approaches.

__ Strengths__: This approach seems like an exciting and novel approach to solving combinatorial optimization problems. Instead of using a smoothed proxy objective that is not guaranteed to correspond to the original problem, the objective proposed here is directly tied to the quality of the solutions that can be extracted. This ensures that, as long as the model has low enough loss, the resulting outputs obey the constraints and are likely to be low-cost solutions. Another notable property of this approach is that, even though the objective is based on a probabilistic distribution on labelings, it is possible to deterministically extract a solution that obeys the constraints instead of having to repeatedly draw samples.
The framework described in section 3 is general enough to apply to a wide variety of problems, and includes theoretical results that describe the conditions under which good solutions are guaranteed. It is unclear exactly how difficult it is in practice to design loss functions that satisfy the requirements of the framework. Even so, this seems like something that could inspire additional research and provide a starting point for solving additional combinatorial optimization problems, or designing other types of upper bound based on a similar probabilistic method.
The experiments are thorough, and compare to a wide set of baselines that cover both state-of-the-art learned approaches and heuristic algorithms. Although their model and objective function is not always superior to the other methods, the authors give a good characterization of the strengths and weaknesses of their approach compared to the others.

__ Weaknesses__: Section 3 seems to assume familiarity with a few things that not all readers may know. In particular, in section 3.1.1 the authors use "the probabilistic method" to justify the existence of a label set of a particular small cost. For readers who may be unfamiliar with Erdos's method, this is a bit opaque. It would be nice to have a concrete example here, or a more intuitive explanation of what the method is and how it works. As another case, the "method of conditional expectation" is explained only very briefly, and could benefit from a more intuitive explanation or a worked-out example.
There are a few minor errors in the presentation of the method, which I have noted in the "correctness" section below.

__ Correctness__: I believe the statement of Theorem 1 as written is incorrect. Here is a counterexample: let f(S;G)=0, beta=1, and sigma be the empty set. It is obvious that the probability of sampling a valid label set is zero, since there are no valid label sets. But Theorem 1 states that, with probability at least t (for any t), the sampled label set is a valid label set.
The problem seems to be that there is a missing condition on the relationship between the loss and beta (and possibly also t). The proof relies on the loss being strictly less than beta, but this is not required in the actual statement of the theorem, which only constraints the function f to be less than beta. This is obscured by the change in notation between the proof, which uses epsilon and constrains it to be less than beta, and the theorem, which uses t and thus doesn't mention anything about epsilon.
Another error: in section 3.2, I believe the comparison has been flipped. To generate an example with a small cost, it would be better to make the choice of v_i that leads to the lower expected cost, not the higher expected cost, right? So line 168 should have "<" instead of ">".
These issues seem fairly easy to fix, and I don't think either of them will cause any problems for the rest of the paper.

__ Clarity__: The paper is well written and was enjoyable to read. I though the discussion of prior work was clear, and the approach was well motivated.
My only concern is that the method may be difficult to understand for people who are unfamiliar with the probabilistic method or the method of conditional expectation; it would be good to provide a bit more explanation of these in the final paper.

__ Relation to Prior Work__: The paper does a good job of describing prior work in this space, and contextualizing their approach relative to this prior work.

__ Reproducibility__: Yes

__ Additional Feedback__: ==== Edit: after author response =====
I am glad my suggestions were helpful, and look forward to seeing the updated paper.

__ Summary and Contributions__: [Post-rebuttal] Great job again; a truly excellent paper!!
This paper establishes a connection between Erdos' Probabilistic Method and learning heuristics for graph optimization problems. The family of graph optimization targeted here involves constrained binary problems with node-based decision variables; many problems can be cast in this form, including Max Clique and Graph Partitioning, which are explored as case studies here.
The key idea is that a parametric mapping, here a Graph Neural Network (GNN), from instances of a graph optimization problem to node probabilities can be trained, without knowledge of optimal solutions, such that high-quality solutions are assigned high probabilities. This is in stark contrast to both ML-based optimization methods that require optimal solutions in the training phase, and reinforcement learning methods that typically have high sample complexity in combinatorial optimization problems. The main step involves constructing an appropriate (unsupervised) loss function that has the desired probabilistic property above. The authors demonstrate how that can be done for the two case studies in question, and similar analytical derivations can be obtained for other graph problems with a reasonable amount of work. Given a trained model, the probabilistic method provides a simple iterative procedure to "decode" a good solution from the predicted node probabilities.
Experimentally, the proposed Erdos GNN performs well compared to recent ML-based combinatorial heuristics, classical heuristics and two exact solvers (depending on the case study). Remarkably, Erdos GNN rarely decodes solutions that violate both problems' constraints.
Overall, this is an excellent paper that should be accepted.

__ Strengths__: - A novel, principled framework for learning heuristics for graph optimization problems: the paper establishes a deep connection between the probabilistic method and learning probabilistic heuristics for graph optimization. Unlike many recent papers in this space which are rather incremental in combining GNN with reinforcement learning in various ways, this paper proposes a fresh, fundamentally new perspective.
- Interest to NeurIPS: This paper is likely to lead to substantial interest from the NeurIPS and optimization communities, as it involves nice ideas from theoretical computer science, deep learning and discrete optimization.
- Impressive experimental results, showing that Erdos' GNN rarely produces infeasible solutions and often produces high-quality solutions for both Clique and Graph Partitioning.
- Solid experimental setup: Comparison against various (very recent) ML and non-ML based algorithms, including exact solvers.

__ Weaknesses__: - Further experiments: I would've liked to see more experimental results on additional graph problems. This could drive your point further by showing more examples of how the loss function can be constructed.
- Scalability bottleneck: GNN training can only scale up to some level. Thoughts on whether your method can be scaled up despite this limitation?
- Architecture choices/tuning: In lines 219-227, how did you select this particular architecture? B.2 in the appendix did not clarify this. There are many ingredients to your architecture but it is difficult to figure out what's necessary and what's not. Perhaps an ablation study or detailed results on the validation sets can help here.

__ Correctness__: - Lines 277-278: "Nevertheless, we should stress that both solvers scale poorly with the number of nodes and are not viable candidates for graphs with more than a few thousand nodes.": You may be referring to scalability w.r.t. finding feasible solutions, whereas Gurobi is an exact solver who's also working to certify optimality. There are ways to encourage MIP solvers like Gurobi to focus on finding feasible solutions, e.g. by setting MIPFocus = 1 in Gurobi: https://www.gurobi.com/documentation/9.0/refman/mipfocus.html.

__ Clarity__: The paper is beautifully written and largely flawless.

__ Relation to Prior Work__: Related work is discussed in sufficient depth and compared against if suitable.

__ Reproducibility__: Yes

__ Additional Feedback__: Nice work on the Broader impact section!

__ Summary and Contributions__: This paper presents a novel method to solve graph-based combinatorial optimisation with deep neural networks. The proposed approach adopts the probabilistic method and transforms the objective of classical combinatorial optimisation problems to continuous and differentiable surrogate loss. Then a deep neural net applied to optimise the probability density function on the vertices of the graph that minimises the surrogate loss function. Finally, the answers are selected by sampling from the learned probability distribution.

__ Strengths__: - The idea of using a probabilistic method to derive surrogate loss from classical CO problems and optimise it with deep neural nets is novel;
- The paper is well-written and easy to read. The proposed method is simple and straightforward, and achieves a good result on the tested tasks;
- Although the performance of the presented approach is still not comparable to classic approximation-based CO tools, it does outperform the other DNN-based methods significantly.

__ Weaknesses__: - The technique of converting constraints into loss function is ad-hoc, which needs to be designed specifically for the CO task;
- The algorithm will work when the probability distribution $\mathcal{D}$ is easy to find, and the surrogate loss is convex. When the derived loss function is non-convex, then there would still be an exponential search space for the initialisation of the neural net, which could even be larger than the original problem space (i.e., it may perform worse than solving the original problem directly with classic methods). I think the author should state in the paper about the limitation of the proposed technique.

__ Correctness__: The proposed approach is reasonable. I only roughly checked the derivations and proofs in the supplementary, but they appear to be correct.

__ Clarity__: This paper is well written and did a good job of clearly explaining the motivation and the derivations of the proposed method.

__ Relation to Prior Work__: This paper has covered a broad range of references, and clearly discussed the difference to previous approaches.

__ Reproducibility__: Yes

__ Additional Feedback__: - Does the time cost in Table 2 include the training time of $g_\theta$?
- Eq 4 and line 150: What is $a_i$? What is its relationship to $v_i$?
- Line 116 says $\mathcal{D}$ is defined "over sets", which is ambiguous because it could refer to the power set of $V$ or $V$ itself.
-----------------------------
In the rebuttal the authors answered my question about the generality of this approach, and clarified that in practical the neural nets are robust for minimising the loss. I think this is a very interesting work and I recommend for acceptance.

__ Summary and Contributions__: This work presents a unsupervised method for combinatorial optimization problems on graphs based on learning probabilistic distributions over sets (nodes) and using these distributions for reasoning about the probability that nodes belong to the solution set on the graph. The main contribution is in the theoretical development of this probabilistic method for identifying the probability of nodes in being members of such a set. Specifically, the probabilistic loss is problem-specific and must be developed for each problem such that the distributions learned by the GNN produce useful (constraint-obeying) distributions over nodes. Two examples of applying the method for finding this loss are included in the paper (min cut and max clique).
The method is evaluated on maximum clique identification and local partitioning (min cut) and compared against some neural approaches, algorithmic approaches, and solvers based on integer programing. Compared to neural approaches, the method matches or improves performance and does not exhibit any constraint violations.

__ Strengths__: The work tackles an important field, CO problems on graphs, which are generally computationally hard and specifically the ability to learn in an unsupervised setting avoids the problem of generating labels for these hard problems. Furthermore, the presented method ensures that the constraints are met, avoiding issues introduced by methods that make use of relaxations to facilitate learning. The main contribution is in the general framework for loss functions of such CO problems.

__ Weaknesses__: One major advantage of neural approaches is the promise of scalability/speed up. However, experiments on unseen large instances shows weak performance (both in speed and accuracy) where the Toenshoff-Greedy method outperforms all methods on both metrics. For graph partitioning, the smooth-relaxation GNNs (L1, L2 loss) methods perform comparably to Erdos GNN but an order of magnitude faster on these particular datasets considered.

__ Correctness__: Claims appear to be well supported by the proofs provided in the appendices and the empirical analysis is clear. One especially important aspect of the empirical study is the inclusion of highly optimized solvers that serve as very strong baselines.

__ Clarity__: The paper is well written and the development of the ideas are laid out nicely and well founded in the motivation

__ Relation to Prior Work__: Structured neural approaches to combinatorial optimization problems are well cited and the relationship between them and the proposed work is discussed clearly.

__ Reproducibility__: Yes

__ Additional Feedback__: Thank you to the authors for their rebuttal that provides answers to the questions I raised in my original review about greedy methods and RUN-CSP.