__ Summary and Contributions__: Authors derive novel online convex optimisation approaches to improper learning of LDS, which provide logarithmic regret guarantees even in the case of a long forecast memory.

__ Strengths__: System identification has recently seen something of a Renaissance, with several major recent results appearing, incl. those of Hazan et al (NIPS 2017, 2018), Kozdoba et al (AAAI 2019), and Tsiamis and Pappas (submitted). This is a fundamental contribution, combining many non-trivial ideas including the spectral filtering of Hazan et al and truncations therein of Kozdoba et al and Tsiamis and Pappas, and substantially improving upon all the recent works, which made many simplifying assumptions.
A key realisation is the "regret decomposition", followed by many technical steps to bound each part in the decomposition. The regret decomposition could be very useful, more generally.
The results rely on Kolmogorov widths, a rather non-trivial branch of approximation theory. These can be used much more widely in statistical theory and machine learning.

__ Weaknesses__: The experimental part is very weak, for a number of reasons:
-- the examples chosen do not meet the requirements of the other methods tested. (Notably, the non-symmetric matrix A of System 3.) So the authors are beating the straw man, in some sense.
-- the source code of the authors has not been attached. It's not possible to verify the results and recreate the plots.
-- While I appreciate that the authors wanted to pick examples with long forecast memory, their choice seems rather arbitrary. (Notably System 2 has a lot of arbitrary looking constants.) It would be good to present some empirical perturbation analysis, showing that the results of the proposed method are "no fluke".
-- in the case of the work of Hazan et al, it's not clear whether the users utilise the Wave filter or the Spectral filter, actually, or how do they implement it, considering there is no official implementation released. Both methods are non-trivial to implement.
The reliance on Kolmogorov widths, e.g. the results of Temlyakov,
https://link.springer.com/content/pdf/10.1007/BF02312773.pdf
is not made very clear. More generally, even Theorem 2 in the main body of the text may be borderline incomprehensible without a background in approximation theory that cannot be expected in the NeurIPS community at large. ("Five people in the world"?)

__ Correctness__: As far as I can tell.

__ Clarity__: While the first five pages are very clear, Theorem 2 in the main body of the text may be borderline incomprehensible without a background in approximation theory that cannot be expected in the NeurIPS community at large. ("Five people in the world"?)

__ Relation to Prior Work__: Several major recent results are discussed in detail, with a considerable insight, incl. those of Hazan et al (NIPS 2017, 2018), Kozdoba et al (AAAI 2019), and Tsiamis and Pappas (submitted). While the present paper combines non-trivial ideas including the spectral filtering of Hazan et al and truncations therein of Kozdoba et al and Tsiamis and Pappas, it credits them fairly and substantially improves upon all the recent works, which made many simplifying assumptions.
The paper does not mention recent works by Maryam Fazel:
https://scholar.google.com/citations?hl=en&user=vlN_kRoAAAAJ&view_op=list_works&sortby=pubdate
and Ben Recht:
https://scholar.google.com/citations?hl=en&user=a_dbdxAAAAAJ&view_op=list_works&sortby=pubdate
and Csaba Szepesvari:
https://scholar.google.com/citations?hl=en&user=zvC19mQAAAAJ&view_op=list_works&sortby=pubdate
While it does not seem to utilise any particular ideas from these, the citations may be welcome.

__ Reproducibility__: No

__ Additional Feedback__: I have read the rebuttal and I am happy with the answers.

__ Summary and Contributions__: The paper "Spectral Kalman Filtering: Learning to Predict in Unknown Dynamical Systems with Long-Term Memory" provides an algorithm to approximate Kalman filter learning for online prediction of unknown and partially observed LDS. The main contributions are four-fold, they show:
1) that for diagonizable forecast memory G with real eigenvalues the Kalman filter coefficients can be approximated by a linear combination of known filters,
2) the general difficulty of approximating Kalman filter coefficients via linear subspaces in general,
3) a logarithmic regret bound for the proposed algorithm when the system is marginally stable, and
4) experimental results in 3 simulations of LDS prediction.

__ Strengths__: The contribution is well-structured, well-written and theoretically founded. The theorems are well explained and set into context with respect ot the state-of-the-art. As fast approximation algorithm this can be a useful tool for practical use in 1 step forecasting with provable risk bounds for the class of LDS satisfying the assumptions.
The exhaustive rebuttal was one of the best I have ever read. The authors addressed all reviewer doubts and suggestions very convincingly and I changed my assessment acknowledging that this contribution is among the top in NeurIPS.

__ Weaknesses__: It would have been nice to give real world examples in the experiments for systems where the assumptions for the approximation bounds hold to demonstrate the usefulness in real-world scenarios. Also the influence of the choice of the hyperparameters and their sensitivities could have been shown.

__ Correctness__: This paper is mainly theoretical. The theorems seem to be correct. The empirical contribution is not the main focus.

__ Clarity__: Yes it is.

__ Relation to Prior Work__: It differs in its analysis based on the forecast memory and approximation by spectral methods

__ Reproducibility__: No

__ Additional Feedback__: Theoretically yes, empirically one would need transparent list of hyperparameter settings used for all methods, but the practical side was not the focus here. It would have been nice though, since the contribution claims increased practical use in comparison to existing techniques, so more detailed demonstration also on real-world scenarios would be much more convincing than just synthetic examples.

__ Summary and Contributions__: This paper proposes an efficient online prediction algorithm for unknown linear dynamical systems with the long-term prediction memory property. Motivations and contributions are very clearly written.

__ Strengths__: Derivation of the proposed algorithm using some techniques such as Kolmovorov width and small-ball condition is theoretically sound.

__ Weaknesses__: Experiment is a little weak. It would be more helpful, if you showed more exhaustive comarison results. Personally, I want to know the comparison with the usual Kalman filtering with identified model parameters.

__ Correctness__: Yes, the claims made by the authors seem correct.

__ Clarity__: Yes, this paper is well written.

__ Relation to Prior Work__: Basically, yes.
But, I am not clear whether the proposed algorithm (Algorithm 1) is identical to the spectral filtering by [Hazan 2017] or not. What is the main difference between them ?
The authors clearly answered to my question. Now I understood the difference between them.

__ Reproducibility__: Yes

__ Additional Feedback__: While the authors named the proposed method "spectral Kalman filtering", I feel it is a little misleading because Kaman filtering algorithm means recursive update of the posterior distribution (mean vector and covariance matrix) of the latent state vector for many people. Although it is no doubt the proposed method has a close connection to Kalman filtering, it doesn't explicitly estimate the posterior of state vector.

__ Summary and Contributions__: The paper presents a model-free algorithm that converges to Kalman Filter (optimal prediction) results in polynomial time. The predictor is shown to work for marginally stable systems and avoids the inferior performance of a fixed-length window to estimate the future observations. The paper provides a statistical as well as geometric reasoning to justify the technical results. The performance is also verified with an experimental setting where the methodology is compared with two other recently proposed algorithms.

__ Strengths__: The paper shows a strong background on control theory and statistical learning. The topic of the paper is timely and ideas are original. The main result of the paper sounds authentic and with adequate theoretical grounding. The empirical plots also give a good sense of the performance of the algorithm. The contribution is significant and of high relevance to the statistical learning and controls communities.

__ Weaknesses__: The paper begins with a general scope of the problem setup and boils down to the control-free settings for the main results. While this assumption is reasonable due to the space limitation and the amount of analysis behind the general setup, the manuscript would be more complete in case of having a note on the difficulties of adding (active) control. For instance, removing the control from the setup makes the excitation assumption almost trivial (based on martingale small ball condition), but this would definitely need more investigation when the control design problem is added to the scenario.

__ Correctness__: The claims, results, and empirical methodology sound correct.

__ Clarity__: Overall, the paper is well written and easy to follow. There are only a few instances that the authors might want to reconsider. For example B_t is used in the analysis while B was initially reserved for the input. Adding dimensions when introducing new variables (such as Psi in the algorithm) also makes the paper more readable.

__ Relation to Prior Work__: It is clearly explained how the paper is different from previous works in the literature. The authors are also suggested to cite a textbook or a seminal paper on the original Kalman Filter theory.

__ Reproducibility__: Yes

__ Additional Feedback__: After going over the rebuttal, I believe the authors are already aware of the shortcomings of the paper and they intend to add discussions to the manuscript to show that. Regarding my point on excitation signals, as the authors pointed out, having control in the setup makes the machinery more complicated and I agree. I believe the work has the potential to be extended to the case with input analysis. In general, I firmly believe that the paper has a very well-organized structure and the proofs are rigorously provided and I stick to my score.