__ Summary and Contributions__: This paper provides a new algorithm, which learns a close to optimal policy where the sample complexity scales only with the dimension of the latent state space and the capacity of the decoder function class. The algorithm is oracle-efficient and accesses the decoder class only through calls to a least-squares regression oracle. The authors claim that this work provides the first provable sample complexity guarantee for continuous control with an unknown nonlinearity in the system model.

__ Strengths__: The paper is very well written and clearly formulates motivation and contributions of the presented work. The technical contributions are sophisticated and well organized using appendices. The problem formulations are worthy and relevant. I also find them to be of interest to the general NIPS community. Furthermore and in my opinion, this work has a very good potential in contributing to the area of perception based control.

__ Weaknesses__: There are some limitations due to linearity in the modeling but even this is well recognized and justified by the authors.

__ Correctness__: Given the length of the appendices and the time I could spend on checking them, I would say it all looks fine.

__ Clarity__: Yes, definitely well written and organized.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__: None.

__ Summary and Contributions__: The topic of this paper is adaptive linear quadratic (LQ) control in a setting where the dynamics are x’ = Ax + Bu + w, costs are quadratic c=x^TQx + u^TRu, and we only observe y = q(x), a high-dimensional non-linear function of the state.
The authors assume that noise w is Gaussian, (A, B) controllable, A stable. They also assume a known function class F of decoders x=f(y) and decoding realizability (F contains the true decoder f* and f* is perfect). To justify the last assumption, their Theorem 5 shows that in the absence of realizability, when y is a noisy version of f*^{-1}(x), learning the decoder has exponential sample complexity.
In the first phase, the algorithm learns a coarse decoder up to a linear transformation. The key insight is that in each trajectory, the last state x is a linear function of all previous inputs. Therefore, to find a decoding x, they train a regressor on (y_{H+1}, [u_1, …, u_H]) pairs, and reduce the output dimension to the dimension of x using PCA. The inputs u_h for this phase are Gaussian.
In the second phase, they estimate dynamics (A, B) and cost parameters in the predicted decoded space. They compute the certainty-equivalent controller corresponding to this estimate.
In the third phase, they improve the decoder by training on near-optimal trajectories, generated using a noisy version of the Phase 2 policy. Rather than having a single decoder, the algorithm now relies on a sequence of decoders f_t (one per timestep), where f_{t+1} is formed from f_t and h_t, and h_t is a residual decoder satisfying h_t(y_{t+1}) - Ah_t(y_t) = x_{t+1} - Ax_t.
The main theorem (Thm 1) states that under appropriate assumptions, the algorithm learns an eps-optimal policy for horizon T using O((dx + du)^17 T^3 log|F| / eps^6) trajectories.
The main contribution is a new algorithm and analysis of LQ control with hidden states and high-dimensional observations, which are a nonlinear function of the state.

__ Strengths__: The paper applies a rigorous approach to adaptive control in systems with latent linearly-evolving states, and high dimensional observations which are a nonlinear function of the state. This is an important topic with applications in e.g. perception-based control. Most existing work in this setting is somewhat heuristic (other than LQG control).

__ Weaknesses__: There are no experiments / numerical simulations of the algorithm. It would be more convincing if demonstrated even in a synthetic realizable setting.
Some of the assumptions in the paper are quite strong (stability, realizability) and it is difficult to assess the extent to which the algorithm is practical.
The decoding procedure in Phase 3 is quite elaborate. Assuming that the simple regression-based Phase 1 decoder is good enough to find a near-optimal controller, it is unclear why a similar procedure does not suffice on noisy near-optimal trajectories, and why data from the two phases cannot be combined. It would be useful to clarify this and/or demonstrate any problematic aspects in simulation.

__ Correctness__: The claims seem correct, though I have not fully checked (and the 80-page appendix is a bit overwhelming). There are no experiments.

__ Clarity__: Section 2.3 is difficult to parse and could be organized better. Otherwise, the main paper is well-written.

__ Relation to Prior Work__: The discussion of related work seems adequate.

__ Reproducibility__: Yes

__ Additional Feedback__: In Algorithm 4 line 28, why is noise added to the optimal policy that is returned to the user?
Most practical systems are only locally linear (and this is also the setting modeled by iLQR and embed-to-control approaches). How difficult is it to extend this algorithm to the locally-linear setting?
Thanks to the authors for the rebuttal and the experiment. After reading the other reviews, my overall opinion is the same. I would not be opposed to acceptance if authors include more simulations in the final version.

__ Summary and Contributions__: This paper provides an algorithm for performing control in a high-dimensional observation space based on LQR. The authors provide a sample-complexity bound for learning a “decoder” into a latent space where LQR can succeed.

__ Strengths__: The problem setting (performing LQR in a latent space with nonlinear observations) is generally difficult so this paper’s perhaps greatest strength is its significance. Given the wide variety of methods employed in model-based reinforcement learning, it is good to have a straightforward presentation of an algorithm that has guarantees, especially in a powerful model like a latent LQR.
The algorithm has a sample complexity polynomial in the dimensionality of the latent space and control space. The argument is generally well structured but as someone not very familiar with the surrounding literature, I cannot comment on the novelty of the result.

__ Weaknesses__: The paper lacks any empirical evaluation. Although the theoretical argument is strong, even a simple synthetic experiment would go a long way in strengthening its case (perhaps evaluating on a 2-d pointmass system with image observations). With such an experiment this paper merits a higher score in my opinion. As for space constraints, Phase III could be tightened. For example, the noninvertible case could be explained with a sketch and the details relegated to the supplement.
-------------------------------------------------------------------------------------------
In the rebuttal, the authors present the additional empirical experiment with convincing enough results. I've updated my score and I advocate this paper's acceptance due to the difficulty of the problem setting and the paper's rigorous results. It is a clear starting point for research into other interesting control settings.

__ Correctness__: I skimmed the proofs in the supplement and found no obvious mistakes. The argument in the paper looks correct.

__ Clarity__: The writing and presentation of the problem setting and algorithm is clear and compelling. The authors divide the algorithm into three phases, each of which needs some number of samples and each phase clearly outlines its goals and highlights its achieved result.
Some minor comments:
The paper uses “decoder” in place of what in traditional autoencoding literature is called the “encoder” (an encoder usually maps a high dimensional state to a low dimensional code where the f function in this paper maps y to x). The paper would be more clear to people familiar with autoencoding literature if they swap the terms.
It seems like the $h$ in 2.3 is different from the $h$ in 2.1. One should be changed to make that explicitly clear.

__ Relation to Prior Work__: The paper does a sufficient job of drawing comparisons to prior work. However, in the abstract when it is stated “We introduce a new problem setting for continuous control called the LQR with
2 Rich Observations, or RichLQR”, it is certainly not the case that this paper is introducing this as a novel problem. It has been extensively studied in the E2C line of work and related papers.

__ Reproducibility__: Yes

__ Additional Feedback__: Minor writing comments:
230, 235, 240, 242, 257: gaussian -> Gaussian
213: propogate -> propagate

__ Summary and Contributions__: This paper suggests a novel problem setting, RichLQR, whose observation is high, complex information while the environment is in a low-dimensional state and described with a linear equation. Based on the perfect decodability assumption, the authors suggest RichID-CE solve the problem. The decoder is realized with a trainable neural network and is iteratively updated.
The proposed algorithm is the first to achieve polynomial sample complexity.

__ Strengths__: RichLQR problem is expected to be practical and useful. Adopting a neural network as a decoder is a novel idea and a reasonable choice.
Phase 1 -> 2 -> 3 of RichID-CE makes sense. Especially, iteratively refining the decoder to follow a near-optimal trajectory in phase 3 is a novel approach.

__ Weaknesses__: I am not much familiar with control theory and related works, so I will mainly focus on questions for the neural network decoder.
In Theorem 1, RichID-CE achieves polynomial-in-dimension sample complexity with T^3 log|F|. However, if F is a family of neural networks, isn’t the search space (capacity) |F| increases exponentially? (much faster than T?)
During phase 3, is the sequence of decoders f_t has any convergence promise? Can you give more detailed explanation of how parameterized neural network decoder f_t+1 updated for each parameter?

__ Correctness__: I cannot say whether the mathematical derivations are correct.

__ Clarity__: The paper is well organized and describes each part in detail.

__ Relation to Prior Work__: This work directly expands the rich observation case, and related works are discussed, mentioned, and compared.

__ Reproducibility__: Yes

__ Additional Feedback__: The simplest decoder network would be a deep convolutional neural network (CNN). An example of the network architecture, training objective, and update rule during phase 1, 2, and 3 would help much.
----
I've read the rebuttal and updated my score.