__ Summary and Contributions__: The paper proposes to use Single-input, multiple-output linear dynamical systems as a building block for complex RNNs. The author shows that reachable SIMO LDS can be computed in parallel. A theoretical analysis is provided on the conditioned MIMO LDS is arbitrarily close to a coupling of d SIMO LDS and a simpler method is presented to approximately implement it. The new method is encapsulate time-varying LDS by LDSTACK to approximate nonlinear RNN.

__ Strengths__: The strengths of this work including propose a theoretical analysis on using LDS in RNN, a new method to approximate nonlinear RNN with LDS, a theoretical analysis on the relation between SIMO LDS and MIMO LDS, a scheme for approximate MIMO LDS with SIMO LDS.

__ Weaknesses__: This paper seems to be a paper with two style, well-written in section 1 to section 4 and not good enough in section 5 and section 6. Section 5 and section 6 seems to be quite strange to this paper. First, as the culmination of this paper, the analyze and the formulation about LDStack in section 5 is not clear enough, which makes this paper confusing,
Second, the experimental result in section 6 is not clear enough. The author describe the background of the experiment, but something is confusing about section 6. The relation between section 6 and previous section is not clear enough. Besides, there are some small mistakes, e.g., Figure 6 is an empty figure.

__ Correctness__: The adopted methodology seems to be correct. The empirical results seems to be convinced.

__ Clarity__: Satisfactory, but not good enough in supplementary and experiments.

__ Relation to Prior Work__: Yes.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: The paper proposes a new algorithm for learning sequence to sequence tasks such as those in time-series and language. The algorithm relies on a single-input linear dynamical system (LDS) as a building block where multiple LDSs are coupled to handle multiple inputs and then stacked together to become functionally equivalent to a recurrent neural network (RNN). Different from an RNN, the proposed stacked-LDS can be parallelized and is easier to analyze. The authors start with a theoretical foundation of the proposed model with detailed proofs and they end by comparing their model to RNNs and apply it to real data.

__ Strengths__: While powerful, RNNs are known to be both computationally expensive and difficult to train. This attracted the community to other sequence models such as convolutional or attention-based. This paper introduces a novel algorithm using linear dynamical systems (LDS) as another alternative to RNNs. They show that their algorithm can be parallelized and since it is based on linear components it can be easily analyzed and explained. This can be a very attractive method to further our understanding of sequence models and RNNs specifically. Both the theory and results presented in this paper were impressive.

__ Weaknesses__: When comparing LDStack to RNNs a number of concerns arise:
- As mentioned by the authors, the dynamics of the RNN are viewed as an Euler discretization and the nonlinear update rule is essentially linearized to work with LDS. Do we understand the limitations of this approximation? Furthermore, wouldn't we get an accumulation of the error over time during inference?
- One advantage of RNNs over other sequence models is their ability to process long term memory, can LDS retain that property? or do the different stacks have to operate on different time-scales?

__ Correctness__: Both the theoretical assumptions, proofs and results were described in detail and are well-explained with no visible errors.

__ Clarity__: The paper is well written, detailed mathematical derivations were well presented as well as their proofs. Experimental results are explained well in the figures.

__ Relation to Prior Work__: The authors cite various contributions from various fields including control theory, dynamical systems and machine learning. The novelty and differences from other studies is well described.

__ Reproducibility__: Yes

__ Additional Feedback__:

__ Summary and Contributions__: In this paper, it is shown that for reachable SIMO LDS, i.e. single-input multi-output (time-invariant) linear dynamical systems, the output, and its gradient w.r.t. parameters can be efficiently computed due to the Parallel Linear Recurrence Algorithm introduced in previous works. This potentially leads to efficient training for RNNs with long sequence data.
The result applies to training more complicated model in the following ways:
1. For MIMO (multi-input multi-output) LDS, given suitable condition, a perturbed system, which can be regarded as multiple coupled SISO LDS, serves as a good approximation;
2. For MIMO LDS, consider multiple SIMO LDS by randomly projecting the original input onto 1-dimension subspaces, the averaged output of these SIMO LDS can approximate the output of the original system given sufficiently large number of projections;
3. For nonliear RNNs, nonliear operation from layer to layer is approximated by a stack of time-varying LDS.

__ Strengths__: This paper suggested several approaches to approximate complicate dynamical systems by combination of simple linear systems, the claims and propositions regarding the approximation accuracy are well developed and supported by the numerical simulations. In the broad sense, these approaches are examples of identifying complicated systems by combination of simple systems with reasonable accuracy, which, as claimed by the author, might improve the computation efficiency.

__ Weaknesses__: To well motivate the approach of system identification with combination of linear systems. The LDStack, the RNN structure proposed in the paper, is supposed to: 1) have as rich expressive power as ordinary RNNs; 2) have some advantages over ordinary RNNs in some aspects such as computational complexity, dimension of parameter space, etc. In this paper, 1) is theoretically and numerically supported, while 2) is not seen from the numerical experiments. To be specific, one of the motivation of such approach is the high computational efficiency of training SIMO LDS for long sequence data. However, in the experiments there is no comparison between LDStack and ordinary RNNs regarding computation efficiency, possibly with dependency on the length of the sequence data.

__ Correctness__: yes

__ Clarity__: Theoretical part: well written with a few minor typos and problems
Algorithm part: Overall good but would be better with more details on the implementation of the LDStack: explicit parameterization and the gradients.
Experiment: see Weaknesses.

__ Relation to Prior Work__: yes

__ Reproducibility__: Yes

__ Additional Feedback__: typos:
1. line 109, C is missing in the canonical form; e_n, e_2 shoud be e_n(\lambda), e_2(\lambda).
2. line 124 Lemma 2, A is required to have the canonical form. The section 2.2 didn't explicitly specify that.
3. Typos in Figure 1. the state transition and output equations.
Comments:
1. line 170 Lemma 3, there is the degenerate case where A has repeated eigenvalues, which is not considered here (actually the reachablity prevents this case, but it is possible to have two eigenvalues close to each other, does that causes any numerical issues?)
2. line 461 in supplement, the proof only consider the case when A is normal, which is insufficient.