Paper ID: | 1664 |
---|---|

Title: | Safe Exploration for Interactive Machine Learning |

Originality: While safe interactive machine learning has been investigated before, I really like the reductionist approach taken here which uses any unsafe IML algorithm as an oracle and combines it with safety testing to make provably safe decisions. This reductionist approach is original. Adequate citation of earlier works has been provided. Quality: The submission is technically sound. Clarity: Most parts of the paper are well written. I really like the detailed explanation of the algorithm. However, it is still a pretty hard paper to read. The authors use a lot of ideas from a previously published paper Turchetta (2016) and unless the reader is very well aware with previous techniques it is hard to follow some of the arguments. For example, the safe expansion strategy is hard to follow w/o prior knowledge. Lines 184-203 are hard to understand w/o prior knowledge. Similarly, the choice of heuristic (lines 225- 236) paragraph was also hard to understand. Furthermore, it is not clear to me about the computational aspects of the proposed algorithm. For example, what does the safe expansion steps entail computationally? How easy is it to maintain and update the sets S_t^o and S_t^p? Significance: This is a significant result in the area of safe interactive learning.

Overall, the paper is tackling an important problem. Safe exploration is particularly important since it allows autonomous agents to gather data about their environment without violating safety constraints. The paper’s theoretical guarantees are also strong, ensuring that it finds the optimal solution within a slightly constrained version of the safe set. The assumptions appear to be standard. My main concern with this paper is its comparison to prior work, in particular, (Berkenkamp et al., 2017). The authors state that (Berkenkamp et al., 2017) “can benefit from more efficient, goal-oriented exploration”. Is this difference reflected in the theoretical guarantees provided by the authors? If not, don’t the authors at least empirically compare to this prior work to show that they obtain better solutions more quickly? I would be much more positive about this paper is the authors could clearly convey the difference between their paper and this prior work.

This paper considers the safe exploration problem in both (Bayesian, Gaussian Process) optimization and reinforcement learning settings. In this work, as with some previous works, which states are safe is treated as unknown, but it is assumed that safety is determined by a sufficiently smooth constraint function, so that evaluating (exploring) a point may be adequate to ensure that nearby points are also safe on account of smoothness. Perhaps the most significant aspect of this work is the way the problem is formulated. Some previous works allowed unsafe exploration, provided that a near-optimal safe point could be identified; other works treated safe exploration as the sole objective, with finding the optimal point within the safe region as an afterthought. The former model is inappropriate for many reinforcement learning applications in which the learning may happen on-line in a live robotic platform and safety must be ensured during execution; the latter model is simply inefficient, which is in a sense the focus of the evaluation in this work. The approach here is to take an unsafe optimization algorithm, together with a heuristic function for the cost of reaching a state (in the sense of heuristic search, eg., the A* algorithm) and use this to produce an optimization algorithm that is guaranteed to remain in safe states during the process. They use GPs over both the constraint function to guarantee safety, but also over the potential payoff (plus exploration cost) to prioritize queries. The paper provides theoretical guarantees that the search process remains safe and bounds the complexity of exploring the state space, similar to the guarantees provided by prior works. What is different is that empirically, the new method uses much fewer evaluations and much less computation than the prior works by Sui et al. for GPs; fewer evaluations than the work by Turchetta et al. for MDPs that provided such guarantees; and less computation and samples to first path (but slightly worse exploration cost) than a prior algorithm by Wachi et al. that did not feature a theoretical guarantee for exploration. In the last case, the differences in performance correspond to the respective objectives sought by the different works. In any case, the proposed method appears to be general (it can be employed with a variety of different unsafe optimization algorithms) and effective in practice. There are some small issues with the exposition. First, there seems to be some confusion between the notation used in the exposition versus in the formal theorem statement: the theorems use \bar{R} whereas the text uses \tilde{R}. Also the parameter gamma is used without really being introduced; it seems that perhaps given a value of beta a value of gamma can be inferred via the formula on line 211, I wonder how I should interpret it. These don't affect the main message of the paper too seriously though.