__ Summary and Contributions__: The authors phrase the problem of Multilevel Budgeted Combinatorial prob-lems as Alternating Markov Games. A simple Multilevel Q-Learning algorithmis extended to a Multilevel Curriculum learning algorithm that leverages onthe combinatorial setting with a-priori available sequence-lengths, by iterativelygenerating harder training sets. They are used to train a graph neural network-based agent, using copies of early training stages as experts for later training.

__ Strengths__: The authors present a novel and non-trivial approach for multi-level budgeted combinatorial problems. The writeup is clear and the experimental results are promising. The topic is not mainstream for the NeurIPS community but still relevant enough as the connections with discrete optimization are on the rise.

__ Weaknesses__: The addressed problem does not have established benchmarks so it is hard to evaluate the real strength of the method.
The main component -- the curriculum learning -- is strongly inspired by Bengio et al, which puts the novelty into question to some extent.
The performance on directed and weighted graphs (Table 2) is not compared to DAAD. Is there a reason for it?
The overall architecture is somewhat complex (given that it is applied on a single task). Are all components necessary? Maybe some ablations would be helpful.

__ Correctness__: The claims seem correct to me. Experimental methodology makes sense to me.

__ Clarity__: Yes, the write-up is very clear.

__ Relation to Prior Work__: Yes, it is sufficiently discussed.
One detail on line 109. The MuZero, AlphaZero extensions of AlphaGo are not committed to a single game.

__ Reproducibility__: Yes

__ Additional Feedback__: line 104: What if player 2 always does nothing? Maybe one could compare to a single-player multilevel task as well? Not sure if there is an application or baseline for that, also not really the point of the paper
line 112: typo
line 139: For clarification: emphasize how budgets at different timesteps look like e.g. $(0, 0, 0, k, b_{l+1}^t, ..., b_L^t)$
\item equation 8: Again for clarification instead of $... \circ N_{1,\pi^*}^{b_1^\text{max}}$ use $... \circ N_{1,\pi^*}^{b_1^\text{max}-1} \circ N_{1,\pi^*}^{b_1^\text{max}}$
Algorithm 1: although it is quite obvious maybe add that on evaluation the $y_k's$ are sampled from $D_\text{val}^j$
line 208 appendix: typo
=== POST REBUTTAL ===
This is just to indicate that I read the rebuttal as well as the other reviews. My evaluation stands.

__ Summary and Contributions__: UPDATE AFTER AUTHOR RESPONSE:
I thank the authors for the response. Based on your clarifications I have increased my score.
==================================
The paper tackles the problem of multi-level combinatorial optimization, formulated as a two-agent, zero-sum, turn-based game. The authors formulate this as a reinforcement learning problem and propose an algorithm that uses curriculum learning to solve it. The proposed solution seems to bring some benefits in terms of the data and number of steps needed to solve the problem for relatively small graphs (<100 nodes), but does not scale beyond this. Regardless, it provides a good first step that could inspire more future curriculum work in the area.

__ Strengths__: 1. Well written introduction, with compelling arguments for the importance of the problem, with applications to security, flood control, matrix decomposition, etc.
2. The idea of using curriculum learning in this setting makes sense, and it is a good idea to bring it up to the combinatorial optimization and game theory community.
3. The results look promising, at least for small graphs.

__ Weaknesses__: 1. Related work on curriculum learning is missing.
2. I found section 4.4. that discusses the proposed curriculum algorithm pretty difficult to read (see more details below).
3. The current version of the curriculum algorithm seems to have some scalability limitations.

__ Correctness__: As far as I can tell, the experiments seem to be conducted correctly, and the authors provide plenty of details in the appendix that should help with reproducing them.

__ Clarity__: Up to section 4.4. I found it very easy to follow. The introduction is particularly well written and motivated. Moreover, the use of English is flawless.
In section 4.4. things become more complicated to follow. I’m not sure I understood why in equation 7 B_t = (0, .., 0, k, b_{l+1}, .., b_L). In general, eq. 7 could use more explanation, as well as algorithm 1.

__ Relation to Prior Work__: The discussed work is clearly explained, but I cannot vouch for completeness on the combinatorial optimization literature side, as I am not very familiar. However, the curriculum learning literature is not discussed at all in the intro or related work section, and I think it should be. There is a brief comparison with the work of Bengio et. al in section 4.4., but this paper is more than 10 years old and dozens of other methods have been proposed since then, especially a lot of approaches for reinforcement learning.

__ Reproducibility__: Yes

__ Additional Feedback__: Questions:
1. How do you set T_val in Algo 1?
2. Given the constraints added in the problem description (e.g., no incentive to pass, perfect information, etc), I am wondering what are some real-world problems that fit this constrained problem formulation?
Suggestions:
1. Add a comparison with curriculum learning literature. Is the curriculum side of the proposed method novel in any way? Could it provide insights for other problems in combinatorial optimization?
2. Introduce section 5 earlier, and use it as a running example in Section 4, it would help a lot in understanding it more intuitively.
Corrections:
I noticed something funny with the bib file. There is a citation in the main text “Dai et al. [30]” but if you look at the list of authors of paper [30], it seems that Dai is the second author. I’m wondering if that’s an issue with your bib file or it was intentional.

__ Summary and Contributions__: This paper proposes a deep reinforcement learning approach to solve multi-level budgeted combinatorial problems. To this end, the authors propose to formulate a multi-level budgeted combinatorial problem as an alternating Markov game. While the proposed alternating Markov game can be solved with regular deep reinforcement learning, the authors propose an additional curriculum learning scheme that boosts the training of agents for solving the multi-level budgeted combinatorial problem. Experiments on the multilevel critical node problem validate the performance of the proposed schemes.

__ Strengths__: - This paper newly proposes the multi-level budgeted combinatorial problem with deep reinforcement learning, which is a novel combination of task and methodology to my knowledge.
- Overall, the proposed methodology is very reasonable and sound.
- The proposed curriculum learning is delivers an interesting message (later stages of the MDP are easier to train).
- Empirical evaluation is complete with details.

__ Weaknesses__: - Experiments were done only on small-scale synthetic datasets. Hence it is unclear whether the proposed method is useful in real life.
- While the paper targets general multi-level budgeted combinatorial problems, experiments were conducted only on the three-level critical node problem. (Is it also useful for higher-levels?)
- Table 2. hints that the proposed approach does not perform well for large-scale instances, e.g., when |V|=100. I think this should be investigated further, e.g., trying the experiment for |V|=200.
- The proposed curriculum learning strategy delivers a similar message made by Florensa et al. 2017, where a robot is trained in reverse, gradually learning to reach the goal from a set of start states increasingly far from the goal. It would be useful if the authors could discuss this work.
Florensa et al. 2017, Reverse Curriculum Generation for Reinforcement Learning.

__ Correctness__: The proposed methodology seems reasonable and the propositions are correct to my knowledge.

__ Clarity__: This paper was very clear and easy to read.

__ Relation to Prior Work__: The positioning of the paper can be improved by making reference to existing works that use curriculum learning for reinforcement learning, e.g., [Florensa et al. 2017].

__ Reproducibility__: Yes

__ Additional Feedback__: - Overall, this paper could be improved by showing its significance in practical and realistic scenarios.
- I think Section E.4 (in the appendix) is worth mentioning in the main text since it addresses an important case where the solver is evaluated on distribution unseen during training.
- To my knowledge, rollout using only the V function is more costly than Q function. It requires the value network to evaluate every possible future state (which has a higher dimension than the action). I am concerned that this approach might not be scalable to large graphs. It would be useful if the authors could elaborate more on this aspect.
- It seems that the original MCN paper [Baggio et al. 2018] also considered a tree-structured graph in the experiments. It would be useful if the authors also experiment in this case.
============
I appreciate the author's thoughtful and detailed rebuttal. I have changed my score since the rebuttal persuaded me on (a) the significance of tackling a new combinatorial optimization problem with deep RL and (b) novelty of the proposed curriculum learning compared to prior works.
I am still concerned that this paper does not provide enough empirical evidence for the proposed algorithm to be useful in practice. However, this seems fine given that the proposed problem itself is in an early stage of development.

__ Summary and Contributions__: This paper proposed a curriculum-based framework to solve a multilevel combinatorial optimization problem with a budget. The authors formulated the problem as multi-agent reinforcement learning(RL) and proposed RL-based algorithms. They conducted experiments on the Multilevel Critical Node problem (MCN) and compared their results with exact and heuristic-based solvers.

__ Strengths__: This paper addresses a problem which is an interesting use case of RL. The paper is overall well-written and the methodology is well-explained. The proposed method sounds novel and introduces a multi-agent RL framework into combinatorial optimization problems. Thus this paper brings an interesting perspective relevant to the NeurIPS community.

__ Weaknesses__: The significance of the results compared to baselines needs to be explained and justified. Some details are missing which are necessary to understand the significance of the proposed framework. Please refer to the details comment section.

__ Correctness__: Claims and methodology seem correct. However, I have a concern about results comparison, time(s) metric. I am expecting an explanation from the authors. Please refer to the details comment section.

__ Clarity__: Overall, the paper is well written and easy to follow. There are a few points which need further clarification. They are in the comment section.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: While I do think this is a good contribution, I have the following concerns which, if resolved, will help me to champion this paper.
Table 1: The loss values are on different scales thus they are not comparable. Any reward-based metric (average reward) might mitigate this issue.
Table 2: DA-AD results are from the reference paper [1]. A comparison of computing resource configuration is needed for the time comparison. The reference paper [1] uses a machine configuration “Intel Xeon E5-2637 processor clocked at 3.50GHz and 8 GB RAM, using a single core”. While this paper uses a different machine “one gpu of a cluster of NVidia V100SXM2 with 16G of memory”. An explanation is necessary to understand the performance improvement in the proposed method. How much improvement because of computing resources and how much due to the proposed technique.
Results in Table 2: The difference between DA-AD and cur in optimality gaps, and approximation ratios seem small in number. How significant are these differences? Any statistical measure or empirical justification would strengthen the proposed improvement.
Does the training data consist of exactly solved instances?
What dataset is used for the results in Table 2? Is it D^1 or D^2? What is the train test split in the results in Table 2?
It seems the testing inference time is relatively quick (few seconds) for up to 100 sized graphs. It would be interesting to see how they transfer to the larger graphs, for example, graphs with #node =1000, 5000, etc. Larger graphs are common in many real-world networks (e.g., social media). The bottleneck seems to be the training time (e.g., MultiL-Cur) as it is not scaled well with graph size.
Algorithm 1:
Line 3: what is the rationale to use 2? Will the loop run at most 2 times? What is the value range of B_max?
What is “number of new updates” and T_val? How are they calculated?
Comments can be added in the algorithm for such clarification.
Section 5: Figure 1 is not clear to me, how the graph is constructed in different steps, and also how the values of \Omega, \phi are calculated? A description of them would be easier to follow in Figure 1.
How does the proposed framework help to generate heuristics (line 302).
Training RL-algorithms take time and training data. An elaborate discussion of how RL can be useful for practical use would be helpful for readers especially when limited or no training data available.