Review for NeurIPS paper: Non-Stochastic Control with Bandit Feedback

NeurIPS 2020

Non-Stochastic Control with Bandit Feedback

Review 1

Summary and Contributions: This work considers online control of fully observable linear systems under adversarial noise and bandit feedback, i.e. when only the value of the cost is observed. Both known and unknown systems are considered and regret bound of T^3/4 is presented in both settings. Moreover, some experimental results are provided.

Strengths: -The setting is novel and a generalization of prior results. This is the first result in the challenging bandit feedback setting. It also considers the setting with general convex loss functions and adversarial noise. -The bandit convex optimization (BCO) with memory setting is a novel generalization of the previously developed online convex optimization (OCO) with memory tools. -Simulations of the new algorithm are given in 4 different loss functions for the known system and 3 different disturbances for the unknown LQR system.

Weaknesses: Technical contribution is limited. The BCO with memory framework is novel but essentially it combines general OCO framework with gradient estimate methods in literature from bandit feedback. As authors mention, the naïve way of combining these two creates problems but it is solved using artificial time delay. The role of this delay is not fully discussed in detail in describing the analysis in the main text. Besides, the regret results follow directly from [4] in the known system and from [16] for the unknown system. -The comparator for the regret seems a bit puzzling. Enforcing the comparator to be state feedback even under adversarial noise, i.e. competing against the controller with u_t = -K x_t, seems a bit unusual since it may not be the optimal performing controller under the adversarial noise setting. -For the unknown system, similar to [16], it is assumed that the system is strongly controllable which is restrictive in LQR setting and has been generalized to stabilizability in recent works in non-stochastic control. -It is not clear whether the regret result is tight. The dominating term (T^3/4) comes from BCO with memory subroutine and it may improve to sqrt(T) similar to the existing bandit feedback results in literature [14].

Correctness: The technical claims and results are correct. The setting of empirical study is relevant and correct.

Clarity: The goal of the paper, the setting, the formulation and the assumptions are clearly described. The setting of BCO could be explained in more detail and it looks like there are some typos (see below). The crucial part of the result which is the intentional delay to resolve the time dependency is not fully discussed which stays a bit mysterious for the reader.

Relation to Prior Work: The relation to prior work is clearly described and differentiated. It is shown how the current work generalizes the prior results.

Reproducibility: Yes

Additional Feedback: +++++++++++++Post-Rebuttal++++++++++++++++++++ Thanks for your answers. However, I'm a bit puzzled with the claims in the rebuttal. 1) The regret definition given in lines 146-148 and Equation (4.1) show that the algorithm competes against a controller u = -K x. This is the same setting used in [4] and [16]. However, the general setting especially under adversarial noise should be linear controllers with internal state, as done in Foster and Simchowitz, 'Logarithmic Regret for Adversarial Online Control' or Simchowitz et al. 'Improper Learning for non-stochastic control'. Therefore, I don't think the regret definition is what is claimed in the rebuttal (All the proofs in the appendix seem to analyze for static linear controller u = -K x) 2) I disagree with the claim on the generality of the controllability. Stabilizability is in fact more general setting than controllability, on the contrary the claims in the rebuttal. There has been many efforts to consider the less restrictive setting of stabilizability e.g. Foster and Simchowitz, 'Logarithmic Regret for Adversarial Online Control' or Simchowitz et al. 'Improper Learning for non-stochastic control'. 3) I still believe Section 3 and the algorithm need to be rewritten and explained clearly since many parts are unclear or not discussed well (See the comment on the unclarity of the role of intentional delay) This is not addressed in the rebuttal and it is not clear how it will be handled if the paper gets accepted. In light of these, even though the paper tackles an important problem that has not been addressed before in online control of linear systems, I think the paper is not ready for publication in this current form. Thus, I would like to lower my score to 5. +++++++++++++++++++++++++++++++++++++++++++ I believe having a regret result in bandit feedback is an important contribution compared to prior work in literature with full feedback. The empirical study is also very promising. However, the paper requires some clarification and suffers from some limitations. -How is having best static state-feedback controller justified as a comparator? For instance, in [31] the best linear controller is chosen as the comparator and regret result is derived for this setting. It is not clear whether or when a static state-feedback controller outperforms any causal dynamic controller under adversarial noise disturbances which is the setting of this work. -I think it would be useful to rewrite Section 3 of the paper and introduce the BCO with memory setting directly through online control framework. The Algorithm 1 appears rather confusing where at line 4 y_i’s are set to some values for i = 1,…H and then line 6 predicts y_i for i= 1,…,H-1. I believe g_i’s are the gradient estimates and as far as I can tell they are not introduced in the main text. Moreover, the notations of \in_R \mathcal{S}_1^d are not defined in the main text and deferred to Appendix A without referencing. The intentional delay to resolve the time dependency problem is not explained clearly. -Do you think T^3/4 regret for BCO with memory framework can be optimal or is it an artifact of the analysis? -For the unknown system it is assumed that a stabilizing controller is given and this makes the underlying system strongly controllable. I’m not sure why controllability is required in this setting. Is it due to analysis technique or is it unavoidable in this current bandit feedback setting? In the Appendix, it is mentioned that this is required for the exploration phase but I’m not sure why the stabilizability won’t be enough. -The experimental results are presented without discussions. I believe it would be useful to provide a discussion for the reader about the performance of the newly designed algorithm.

Review 2

Summary and Contributions: This paper studies online control of linear systems with changing convex costs with bandit feedback on the cost and full state measurement. Under both known and known system assumptions, regret of order T^{3/4} is obtained (suppressing log terms).

Strengths: The main novelty of the work is the use of bandit feedback on the costs, vs full information about the costs, which is used in the existing literature. This is a problem that is definitely in the interests of the NeurIPS community. The extension of bandit convex optimization to the case of functions with memory, which is used in the derivation, could also be of interest.

Weaknesses: The main weakness is that in context, the work seems quite incremental. The work uses a fairly standard reduction from a control problem to OCO with memory, and then uses the most basic reduction from full information to bandit feedback. The main insight required to making this work is that the noise used for the gradient estimate needs to be delayed. Even with this insight, the work still seems incremental. In using the FKM reduction to bandit feedback, the regret is likely suboptimal. In particular, the regret increases from T^{1/2} from the full information case to T^{3/4}. This seems entirely due to the suboptimality of the FMK reduction. Using the reductions from: Hazan, Elad, and Kfir Levy. "Bandit convex optimization: Towards tight bounds." Advances in Neural Information Processing Systems. 2014. Bubeck, Sébastien, Yin Tat Lee, and Ronen Eldan. "Kernel-based methods for bandit convex optimization." Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing. 2017. could potentially close this gap. Edit: The issue of how the gradients are estimated is subtle and I missed the significance of Theorem A.9. While I still think that the work is slightly incremental, I am bumping my score up because it was not as obvious as I thought.

Correctness: The claims and arguments appear to be correct. The empirical evaluations seem sensible.

Clarity: The paper is well written and clear.

Relation to Prior Work: The context of the work is described clearly. As discussed above, however, more sophisticated BCO algorithms were not discussed, and so the potential for improvement is not clearly identified.

Reproducibility: Yes

Additional Feedback:

Review 3

Summary and Contributions: Recent work (e.g. [4]) has proposed algorithms to obtain low regret in linear systems with adversarial disturbances. This work extends the setting to bandit feedback, were the loss function is only observed at the point that is played. The authors proposed an algorithm, which, like previous literature, reduces the online control problem to an online convex optimization with memory problem, which attains T^3/4 regret in systems with known and unknown dynamics. Because the algorithms require knowledge of the adversarial disturbances to use the BCO with memory reduction, authors propose a two stage algorithm for the unknown dynamics that estimates the dynamics in the first stage then applies the previous algorithm (where the perturbations are estimated with the dynamics). Empirical evaluations showing improvement in a small problem for a variety of perturbation models are also provided. The proposed algorithm seems to handle the perturbations is a much better manner than previous algorithms. post-rebuttal: I remain supportive of the paper.

Strengths: The problem of adversarial online control is receiving a lot of attention, and the bandit generalization is a natural problem to consider that would interest the community. Given especially that BCO with memory is formulated a non-trivial algorithm is provided, I think the contributions of this work are large. I believe they constitute non-trivial extensions of previous ideas. The writing is generally clear and well organized (especially the literature review).

Weaknesses: The paper is more descriptive than illuminating, as not a lot of intuition is provided. Some of the algorithmic choices are not obvious (such as the computation of g_t or the delayed gradient feedback in choosing x_t+1), but are not really motivated by the authors. Many of the assumptions are also stated without any defense; please describe how restrictive they are (e.g. without strong stability, no results are known, or without Lipschitzness, the problem is obvious intractable). Some effort is spent describing the difficulties overcome (e.g. why it is not possible to apply the trick of Flaxman), but less effort in explaining how they are overcome.

Correctness: The claims seem correct.

Clarity: Providing more intuition about the algorithm would be beneficial. For example, why is g_t a good estimate of the gradient? Describing the gradient decomposition used in A.8 would provide a good answer. Highlighting the difference between the proof of theorem 5.1 and the application of OCO with memory to previous control problems would also help. Some of the definitions are a bit sloppy. For example, you never defined the \otimes notation, nor did you define what X\sim some set means (even though the definition might be clear for context, a definition should still be provided). The matrix norm used is also not defined. The fact that the perturbation sequence needs to be observed should be emphasized, as it does not naturally fit into the unknown system dynamics setting, in my opinion.

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback:

Review 4

Summary and Contributions: - This is a theory paper that considers the problem of controlling a (potentially unknown) linear dynamical systems in the presence of adversarial perturbations. - The authors propose a general algorithm for bandit convex optimization with memory (i.e. if the loss functions may depend on some bounded number of past decisions) - This algorithm is then applied in a setting of online control of linear dynamical systems, using a perturbation-based controller parameterization. The main result is a regret bound of O(T^3/4) (up to log factors) amont the class of strongly stable linear policies, for both the setting of known and unknown system dynamics (in the latter case an explore-then-commit paradigm is utilized).

Strengths: - The paper proposes novel algorithms and derives associated regret bounds for a very challenging setting of online control. - The theoretical results are of interesting ot the online control research community (though not much beyond it, at least given the way the paper is written currently)

Weaknesses: - The paper could benefit from a better practical motivation, in its current form it will be quite hard for someone who is not at home in this field to understand why they should care about this work. What are specific practical examples in which the proposed algorithm would be beneficial? - The presentation of the simulation study is not really doing a favor to the authors. Specifically, the authors do not really comment on why the GPC (benchmark) is performing better than BPC (their method). It would be worth re-iterating that this is b/c of the bandit feedback and not using information about the form of the cost function. - More generally, the discussion of the simulation study results could be strenghtened. It is not really clear what the reader should take away from the results, and some discussion could help a lot with interpreting them properly.

Correctness: - As a non-expert in this area, I had a hard time following the details of the derivations. What I did follow appeared to be sound. - The simulation study is straightforward, though I feel the comparison is somewhat misleading (to the DISadvantage of the authors' contribution, see below)

Clarity: - The paper is generally well written. - The theoretical results are discussed at the right level, and their presentation is relatively clear (except for what appears to be a some forward references to the appendix, e.g. D in Theorem 3.1?, "probability of failure" \hat{\delta} in l. 192) - Presentation/discussion of the simulation results could be improved (see above).

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: - what does "efficiently model" (iterative decision making) mean? - Algorithm 2, l4: Should this be \tilde{M}_0^{[i]}? - minor comments: - define acronym: MDP - grammar: "efficient algorithms ... which attains" - capitalization of algorithm / figure references in text is inconsistent - l 170: "whose complete proof is deferred" -> "whose proof is deferred" - authors should discuss when a "explore-then-commit" paradigm is justified. In which settings do we only care about asymptotic regret? Relate this to some practical problems. - discuss why assuming that w_t's aren't chosen "reactively" is necessary - should provide reference for perturbation-based controller parameterization - Might want to discuss the assumptions on loss functions: how strong are these? - assumption of access to initial stabilizing controller -> might as well assume that A is stable from the get-go? seems equivalent. ******* Post-author response update ********* I read the response, not much to argue here.