NeurIPS 2020

Towards Safe Policy Improvement for Non-Stationary MDPs

Review 1

Summary and Contributions: In this paper, the authors introduce a novel model-free, policy improvement-based algorithm, for smooth non-stationary Markov decision processes (NS-MDP), focusing on safety guarantees of their method. The method relies heavily on Assumption 1 (Smooth performance), implicitly assumed in [51], which enables the treatment of the off-policy evaluation (OPE) in the NS-MDP as a time-series forecasting (TSF) problem. The authors introduce safe policy improvement for NS-MDPs, which they term SPIN, which, under Assumption 1, iterates between a policy evaluation step and a policy improvement step. Importance sampling is used for OPE according to past evaluation samples. Then TSF is applied to estimate future performance, while wild bootstrapping is used to obtain uncertainty estimates for future performance. A learnable safe policy is used when the performance estimates are below the safe policy, otherwise, a policy improvement step is made. The experiments on a non-stationary recommender system, RecoSys, and a non-stationary diabetes treatment simulated environment suggest that the method outperforms methods that ignore non-stationarities but strive for safety only.

Strengths: This paper address a very important problem in reinforcement learning (RL), the usually false assumption of stationarity. As highlighted by the authors, most real-world settings are intrinsically non-stationary and hence ignoring this can lead to catastrophic outcomes. The authors provide a way time-series-forecasting (TSF) can be used to enable RL in non-stationary settings since TSF has been used for non-stationaries extensively. This link can encourage future research in this direction. To the best of my knowledge, the method introduced in this paper is novel and the empirical results suggest that it is successful in (smooth) non-stationary settings.

Weaknesses: The method relies heavily on Assumption 1 (Smooth performance), which I do not have good intuition to what that translates, in terms of drift in transition dynamics and reward function. I wonder if there is a direct connection to a Lipschitz smoothness assumption. Although the empirical results and the theoretical justifications of the introduced method, SPIN, are sound, the lack of an ablation study makes it hard to credit assign the contribution of each sub-component to the final performance improvement. For example, how important is practice (a) the data-splitting [lines 235-237, 243-244]; (b) searching for the highest lower confidence bound [lines 219-220] over, e.g. the expected performance; (3) the wild bootstrap. It would be great to see an ablation study where starting from the Baseline (or a simplified backbone) the authors build methods towards full SPIN and observe which decisions result in what gains, and also which combinations were the most impactful. Also how about methods that only consider non-stationarities, without striving for safety?

Correctness: In my opinion, the claims and method are correct. They are either supported by empirical evidence, or references or analytical derivations. The experimental protocol is also well described and doesn’t have any obvious flaws.

Clarity: The paper is well written. It’s self-contained and comprehensive.

Relation to Prior Work: The authors discuss how their work differs from previous contributions as well as provide the relevant background (both in Section 2 and in the Appendix) so that their method is self-contained and understood. However, I wonder how this work relates to (a) lifelong reinforcement learning and (b) zero-shot adaptation literature, which either explicitly or implicitly make non-stationarity assumptions and devise algorithms that address similar challenges with the ones the authors face. Also, some of the high-level decisions made by the authors reminded me of some prior work in “safe imitation learning” [A]. Also, it’s not very clear if the authors credit Assumption 1 to [51] or if they claim to have introduced it in this work. [A] Zhang et al. (2016) Query-Efficient Imitation Learning for End-to-End Autonomous Driving.

Reproducibility: Yes

Additional Feedback:

Review 2

Summary and Contributions: This paper studies safety in non-stationary MDP, defining safety as as not reducing performance with respect to a current "safe policy." Prior approaches have assumed stationary MDPs. Motivations come from settings such as diabetes, sepsis, and e-commerce. The model considers an exogenous sequence over MDPs and makes use of time-series forecasting style approaches to estimate the performance of distinct policies. Crucially, safety is only possible with a smooth transition between environments, so that inference can be done about relative policy quality (either P or R moving slowly, or the payoff of a policy changing slowly enough). The technical contribution is to estimate the performance of a counterfactual policy via importance sampling, following Precup and Thomas, and "wild bootstrap"to get confidence intervals. The "Seldonian" framework from Thomas et al. (2019) is used to perform sequential hypothesis testing, wuth a sophisticated test, search and data splitting approach.

Strengths: The paper is clear, comprehensive, and convincing. Beyond new technical results (Theorem 1, Theorem 2), the SPIN approach is carefully evaluated in simulation in the settings of recommeder systems and diabetes treatment.

Weaknesses: None noted. After the rebuttal: thanks also to the authors for their careful responses to the comments from other reviewers.

Correctness: yes, appears correct.

Clarity: yes, very clear.

Relation to Prior Work: yes, very good.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: Update: Thanks the authors for the detailed feedback. All my concerns are addressed, and I am happy to maintain my judgement of accepting this paper. The paper consider RL with safety constraint (by comparing to a known safe policy). The authors propose algorithms for this setting, and perform empirical analysis.

Strengths: This is by far the first paper that considers safety constraints in RL. The authors provide detailed steps to build Rl algorithms that performs no worse than a baseline policy with high probability

Weaknesses: 1. It appears that the non-stationarity considered in this paper is governed by a underlying linear model, and seems like the phi(i)'s are known ahead of time (please correct me if I am wrong). 2. Also, seems like [8] is a very closely related setting, and the solutions are pretty related. It would be helpful if the authors can explain the key differences in techniques.

Correctness: It looks correct to me.

Clarity: It is well written.

Relation to Prior Work: Some related works on conservative bandit exploration is not mentioned. For example: Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi-Yadkori, Benjamin Van Roy (2016). "Conservative Contextual Linear Bandits."

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: This paper proposes a seldonian framework for safe policy adaptation in nonstationary MDPs. I find the paper interesting and the approach also interesting. However, I am a bit unsure about the overall architecture and the empirical evaluation.

Strengths: The paper proposes algorithms for nonstationary MDPs, which are a challenging and open area in RL. The findings would showcase how algorithms could be designed in the presence of nonstationarity using the seldonian framework, which in itself is a recent framework. The algorithm seems to be based on the Seldonian framework, which uses the pseudo samples. The pseudo samples are generated from regressed models of the existing samples. The authors do note that this can only work when the MDP satisfies some smoothness properties.

Weaknesses: It is not clear how many real world problems would satisfy these properties, and for those cases, would adding a state to the MDP address the issue of nonstationarity?

Correctness: The theory and experiments seem correct

Clarity: yes, the paper is well written. The assumptions are clearly stated.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: I found the paper hard to read and the idea of creating pseudo samples is not very crystal still to me. An algorithm would have helped here, since this seems like the key thing that would affect the algorithm. What happens if the pseudo samples fail? How much data is necessary before this can be done safely? I have gone through other reviews and feedback and stay with my recommendation.