__ Summary and Contributions__: This paper focuses on bandit approaches to decoding LDPC codes, building up the emerging area of using machine learning techniques in communications, where training data is "infinite" and performance metrics are very clear. The idea is to learn a good sequential scheduling policy for the order in which different check nodes are operated on (and presumably not to get stuck in combinatorial structures such as stopping sets), to improve decoding performance. This is in contrast to other work in this broad area that focuses on finding coding schemes, rather than improving decoding algorithms. Comparisons to other decoders shows significant improvements.
It seems the authors have taken all the reviewer comments to heart and will have an improved paper.

__ Strengths__: * ML for communications is a nice area of exploration and such investigations push the scope of neurIPS to settings where performance criteria are very clear
* Performance improvement over BP is compelling and practically could be important
* Broader Impacts section well written and broader impacts themselves are quite positive

__ Weaknesses__: * Although the general ML for communications area is relevant for neurips, this particular paper seems too specific to be of general interest, and may fit better in a venue like IEEE Trans. Commun.
* There is some confusion between multi-armed bandit and reinforcement learning in this work
* No explanation/intuition of why this dynamic scheduling helps is given: some greater insight from [3] would have set the stage better for this work

__ Correctness__: I went over the paper fairly closely and it looks generally correct. Empirical evaluation is quite standard in the communications field.

__ Clarity__: As noted above, some initial intuition as to why scheduling helps (from [3]) would help the reader understand all that is to follow.

__ Relation to Prior Work__: A greater discussion of how this work differs from other ML-for-comms work would help the general reader understand that this paper is focused on the internal operation of the decoding algorithm rather than other work on designing feedback coding schemes (deepcode), etc.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: The paper proposes a novel decoding method for LDPC codes based on reinforcement learning. The proposed method determines check node scheduling based on Q-learning. Some numerical experiments shown in Section 5 indicates that the proposed method provides faster convergence compared with the conventional ones.

__ Strengths__: As far as I know (and as the authors claim), this paper is the first work to employ reinforcement learning to improve decoding performance of BP decoding. I thus think that the paper has enough novelty. The main idea appears quite natural and reasonable to me. The numerical results shown in Section 5 supports the effectiveness of the idea.

__ Weaknesses__: The reward function (2) without detailed explanation (I guess "informative messages" are more rewarded). This part (i.e., justification of the reward function) may be important to understand the proposed scheme. In Fig 2, I would like to see high SNR (waterfall) performance as well. Compared with the naive BP (BP), the decoder using the proposed scheme requires to memorize Q-table to select a CN. And, time complexity of selecting a CN (line 12, Algorithm 1) can be explicitly described in the text. Such overheads (compared with a naive flooding BP) should be written for fair comparison.
========== After Authors' Rebuttal ==========
I read the authors' response and fully understand their explanation.

__ Correctness__: The empirical methodology used in the paper seems correct to me.

__ Clarity__: I think the paper is well written.

__ Relation to Prior Work__: I think the description in the paper is enough.

__ Reproducibility__: Yes

__ Additional Feedback__: It would be nice to see "interpretation of the learned schedules". Some good CN selection rules have been obtained in the experiments. If the learned schedule follows a certain simple rule, I would like to know it, and such an observation may lead to another optimal scheduling without a learning process.

__ Summary and Contributions__: This paper aims at improving LDPC node-wise scheduling decoding performance, by forming the node selection problem as a multi-arm bandit (MAB) problem, which can be solved by Q-learning. This paper also uses clustered Q-learning and cluster connecting sets, to reduce the unavoidable complexity coming from long block length nature of channel coding. The proposed method shows better BER performance and less complexity comparing to a few solid existing benchmarks.

__ Strengths__: LDPC code is an indispensable building block for LTE/5G communication systems, a more efficient and accurate decoding algorithm is impactful for current communication systems. Node-wise scheduling (NS) is known to improve decoding efficiency, yet incurs more complexity. Using Q-learning Table the computation complexity improves, which makes the NS-based method become viable.
The long block length nature of LDPC code, makes the number of state exponential. The author uses clustered based method to reduce the number of potential state. Using the structure of LDPC code, the cluster-connected set (CCS) is bounded, and can be further optimized via cycle maximization.
The proposed method shows better BER performance and less complexity, on not-very-long LDPC code.

__ Weaknesses__: However, the long block length nature of LDPC code incurs a very special clustering and optimization method to migigate the large state space problem. The proposed method is very specific to channel coding, which doesn't have enough interest to the general audience.
Moreover, for very large block length (e.g. LDPC code can be longer than 1000, due to block length gain of channel coding), the proposed algorithm could still suffer from large state space. Some further complexity reduction methods might be needed.
Furthermore, the Q-learning with quantization is a good first try, yet there exists quite a few RL algorithm which can better support this MAB for LDPC NS decoding problem.

__ Correctness__: Yes.

__ Clarity__: Yes

__ Relation to Prior Work__: Yes.

__ Reproducibility__: No

__ Additional Feedback__:

__ Summary and Contributions__: The paper introduce a method for decoding short block codes. The method utilize a sequential update policy for the check nodes by model it as bandits problem. Furthermore, they use clustering approach to reduce the learning complexity.

__ Strengths__: The ideas in the paper to learn the scheduling of the check node is very interesting. Moreover, the learning of the scheduling updates is novel.

__ Weaknesses__: I have a major concern about the results:
1. Please use the standard metrics to evaluate the model, i.e. please use BER/SNR figure. It is very hard to judge what is the improvement with the proposed method with BEP/SNR.
2. Please provide more results for other short block codes (for example - BCH), and compare it to other SOTA results as in previous papers:
"I. Be’ery, N. Raviv, T. Raviv, and Y. Be’ery. Active deep decoding of linear codes. IEEE Transactions on Communications"
"Nachmani, Eliya, and Lior Wolf. "Hyper-graph-network decoders for block codes." Advances in Neural Information Processing Systems. 2019."

__ Correctness__: seems correct

__ Clarity__: yes

__ Relation to Prior Work__: Yes, the paper show what is the difference between the proposed method to other learnable decoders.

__ Reproducibility__: Yes

__ Additional Feedback__: