Summary: The paper proposes to use backward planning for credit assignment in RL. Given an observed state, its Q-value is back-propagated to all potential past states from which the observed state could have resulted. A backward probability distribution of all past states that end up in a given state is used for this purpose. The paper also compares this backward strategy to the classical forward planning, and experiments on toy examples demonstrate the advantages of this method. Pros: - The idea of backward planning for credit assignment is interesting - Rigorous derivation of the algorithm - Empirical evaluation shows the the advantage of the method Cons: - Toy experiments - Lack of examples to explain the method Discussion and decision: The reviewers agree that this is a good paper. The idea is intriguing. One issue raised by the reviewers is the toyish nature of the problems considered in the experiments. The MDPs considered in the experiments have a very small number of states. The proposed algorithm needs to be tested on more serious problems and more realistic systems, similar to what is being considered in state-the-art works on RL, especially given that the proposed approach has no theoretical guarantees to it.