Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper proposes a language for expressing task specifications for reinforcement learning (RL). The language is based on linear temporal logic and is compiled into a finite-state automaton (FSA). An augmented MDP that contains the FSA’s states as a part of its state is then constructed and equipped with a shaped reward function that rewards the agent for being close towards completing subtasks without violating constraints. The paper is clearly written. It is very heavy on notation, but this is probably unavoidable in such a paper. All notation is clearly explained, and with a bit of effort, the method can be understood. Empirical evaluation is limited, but it does show big improvements over carefully chosen baselines. As a non-expert in the topic of task specifications for RL, the reviewer can’t 100% guarantee that the method is novel, but the related work section seems rather compelling. In particular, it seems that the construction of the augmented MDP and reward shaping indeed make the paper sufficiently novel for publication. A question: using augmented MDP effectively introduces a sort of memory. Could not the same be achieved via using something like an LSTM? Another question: it is not very clear from the paper what the SPECTRL system is, and it probably should not be mentioned in the bullet list of contributions in the introduction. UPD: I have read the rebuttal and other reviews, I still think the paper is good. I would highly recommend the authors to release a reference implementation.
Significance: I'm not sure Spectrl enables RL to tackle new problems it couldn't before, given that the experiments are on relatively simple tasks and don't have a few baselines I'd like to see (see the "Improvements" section of my review). The approach also seems pretty specific to a certain class of problems (ones that can be specified in the particular formal language). That's fine as long as it results in significant improvements on that class of problems or enables tackling new problems entirely; I wasn't sure that was the case given the existing experiments, at least enough to warrant it's use over simpler RL methods that don't require the propose framework and its added implementation overhead / reward shaping. Quality: I'd have liked to see more complex/challenging experiments that validate Spectrl enables something new to be done, or else that it's widely applicable or applicable to harder tasks than grid worlds. Prior work e.g. on TLTL shows experiments with an actual robotic arm, for instance. I'm also not sure how to interpret the existing experiments, given there aren't error bars and a few baselines I'd find helpful (see "Improvements" section of my review). Clarity: Overall, the writing was clear. The method itself took a while to explain (and a lot of math), so it may help to shorten that. The conclusion is missing, though, and explanation of the final experiment (cart pole) and its results are cut short (just one paragraph). Originality: Spectrl seems somewhat close to TLTL, except for the task monitor which can help handle non-Markovian specifications. However, I would have liked to see more experiments (and on more complex environments) that highlight that Spectrl can handle non-Markovian specifications specifically. I'm not too familiar with TLTL, but it sounds like it would be a small modification to enable TLTL to handle non-Markovian cases as well (i.e., add some extra history to the state, though if I'm wrong, please let me know in the rebuttal).
The specification language seems to be similar to past work, being a restricted form of temporal logic. The atomic predicates comes in two flavours: (“eventually”) achieve certain state or (“always”) ensuring to avoid certain states. Various composition of these atomic predicates can be used (A then B, A or B, etc.). The paper’s proposed finite state machine “task monitor” bears resemblance to the FSM “reward machines” proposed by Icarte et al. , which was not cited/discussed. So I will be quite interested how the authours clarify its differences to the Reward Machines. It appears that the main contribution is in the use of reward shaping and theoretical analysis for the shaped reward consistency. On the quantitative semantics: an example of using L_infinity distance metric was given to quantify the degree at which the agent achieves the subtask. I am wondering how this type of reward shaping affects the SPECTRL agent when there are say physical barriers in the environment, which leads to some local minimas if the agent relies on the L_infinity metric for the reaching subtask? The experiment in cartpole OpenAI gym appears incomplete -- is it possible to compare SPECTRL to the other baselines as in the other environment here? I found the paper to be fairly difficult to follow due to the large number of notations introduced. It may be helpful to include: A high level diagram used to convert from the task specification language -> Task Monitor -> Augmented MDP. A concise algorithm box for SPECTRL. I am aware that there is a detailed algorithm description in the appendix, but perhaps a brief minimal summary of the SPECTRL framework in this form will make it quite clear to the reader. For the plots, I suggest having a more standard plotting of average of the 5 runs (as done) and also include a shaded area with standard deviation over the seeds so that we can understand the stability of SPECTRL. Each of the axis should also labeled (in addition to the current caption description). *** Post Author Response *** I think that the authors did a reasonable job in their response contrasting Reward Machines (RM) to SPECTRL. They clarified that (1) in RM, the FSM is directly given to the agent, while here the contribution is in the conversion from formal language to the task monitor, and (2) The task monitor also contains registers, making it more powerful than FSM. On (1), I do want to point out a concurrent work by Camacho et al  to appear in IJCAI 2019 which follows up on RM by using formal language to specify reward and automatically construct the RM and use reward shaping (potential-based as in Ng et al 1999). I only found out about the work now and hence did not mention it in my original review, so I am not trying to penalize the authors for not mentioning/comparing to this work. Rather, I think this concurrent work at least supports that it's a research direction that people in the community are pursuing and SPECTRL will be of interest. 2. On environments, I agree that their existing environment contains obstacles anyways. The clarification on the cartpole environment was also important, as I'm sure that they ran out of space in the paper to properly discuss (at least could have included these modification details in the appendix). But other reviewers have said, we still don't have baselines for the cartpole environment so that will be helpful to add / discuss. Overall I am increasing my score from my initial review. Reference:  Icarte, R.T., Klassen, T., Valenzano, R. & McIlraith, S.. (2018). Using Reward Machines for High-Level Task Specification and Decomposition in Reinforcement Learning. Proceedings of the 35th International Conference on Machine Learning, in PMLR 80:2107-2116  LTL and Beyond: Formal Languages for Reward Function Specification in Reinforcement Learning. Camacho, A.; Icarte, R. T.; Klassen, T. Q.; Valenzano, R.; and McIlraith, S. A. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI), 2019. To appear. (PDF: http://www.cs.toronto.edu/~toryn/papers/IJCAI-2019.pdf)