__ Summary and Contributions__: The paper presents an approach to forecast discrete-time time series with non-stochastic covariates using the Transformer. The proposed architecture uses the quantile loss as the main training objective, helped with an adversarial loss to encourage the generated forecasts to match training data at the whole sequence level. Promising forecasting results on a number of datasets are presented.

__ Strengths__: The paper studies an important sub-class of forecasting problems. The proposed approach of combining transformers for time series forecasting with an adversarial regularization appears novel and promising. The empirical evaluation is reasonably thorough, and the authors carried out ablation studies to better understand their model performance.

__ Weaknesses__: The presentation of the full model architecture is incomplete, and the notation is sloppy in places. The way in which the time series and covariates are normalized and coded for input to the transformer should be explained. The way in which the transformer output is mapped to a (probabilistic?) forecast should also be explained. Figure 2 refers to L_G (generator loss) and L_D (discriminator loss) but equations for those losses are not explicitly given (one can infer them by eq (4)-(7), but should not need to).
From the text, it is not clear which of the following is true:
1. one trains multiple models to each predict at a different quantile (e.g. 50%, 90%),
2. a single model predicts multiple quantiles at the same time,
3. a single model outputs a probability distribution (how?), and the quantile loss is used to make sure that the quantiles of the predictive distribution match those of the empirical one.
This is obviously major in conveying the working of the model through the paper.
In Algorithm 1, the generator G is not properly defined. In particular, it should be explicited what distribution exactly it is sampling from, and how the transformer’s output defines this distribution. Moreover, the relationship between the quantile loss and the predictive distribution needs to be made clear. In the update to the Sparse Transformer parameters, the notation for L_\rho seems to be mistaken, and does not match the definition of L_\rho on the next page. Moreover, the log loss in the discriminator makes use of an expectation, which we can assume is an approximation thereto based on a minibatch. This should be noted.

__ Correctness__: The empirical methodology appears sound.

__ Clarity__: The paper is reasonably easy to follow, although it would benefit from a thorough review of English grammar and style. More importantly, there is a general sloppiness in the mathematical notation used throughout, which makes interpretation at times uncertain. Some points are noted explicitly below.

__ Relation to Prior Work__: There is generally an acceptable review of the extant literature, both around forecasting and transformers. However, there is a missing comparison with the N-BEATS model that was presented at ICLR 2020, and which obtains state-of-the-art results on most of the datasets shown by the authors. Adding N-BEATS to the set of baseline models would appear to be in order:
Oreshkin, B. N., Carpov, D., Chapados, N., & Bengio, Y. (2019). N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. arXiv preprint arXiv:1905.10437.

__ Reproducibility__: No

__ Additional Feedback__: Line 32-33: more explanation is needed for this statement.
Line 36: this should be explained, as it seems to depend on problem set up: “While during inference, real previous target values are unavailable…”
Line 39: “Recently, …” needs a citation
Line 74-75: The “complex structure and interdependence between groups of series” is addressed by the following paper:
Chapados, N. (2014). Effective Bayesian Modeling of Groups of Related Count Time Series. In International Conference on Machine Learning (pp. 1395-1403).
Line 75: apply ==> applied
Line 82: explain what is meant by “inflexible objective”.
Figure 1: the differences in the illustrated forecasts are very small and not very convincing — in particular, one would immediately question what those tiny differences would imply for decision making.
Line 120: define “long-term”
Line 123: the footnote “2” next to h is utterly confusing
Line 123: better explain the dimensionality of h; in particular, n was previously used to denote the dimensionality of the parameter vector (line 113). One would assume that this is a new n?
Line 130: metrics ==> matrices
Lines 166-175: it is not clear how a discriminator network with only fully-connected layers can process a time series as input, unless said time series has a fixed length. Is this what is intended?
Line 277: the phrase “improve the contiguous and fidelity from sequence level” is not clear.

__ Summary and Contributions__: This paper proposed a new time series forecasting model - Adversarial Sparse Transformer (AST), based on Generated Adversarial Networks (GANs). AST adopts a Sparse Transformer as the generator to learn a sparse attention map for time series forecasting, and uses a discriminator to; improve the prediction performance from sequence level. Extensive experiments ;on several real-world datasets show the effectiveness and efficiency of AST.

__ Strengths__: It proposed an effective time series forecasting model AST.
1) AST introduce a discriminator to encourage model to generate more realistic time series and improve performances in sequence level. Extensive experiments showed the discriminator loss improve forecasting performance on various time series forecasting model, like sparse transformer and deepAR [1].
2) AST introduced \alpha-entmax to sparse Transformer for time series forecasting, which outperformed previous softmax and sparsemax. \alpha-entmax was proposed in [2], and used in sparse Transformer in [3].
3) AST was a good combination of recent advanced technologies and worked very well according to the experiments in this paper.
[1] Deepar: Probabilistic forecasting with autoregressive recurrent networks, International Journal of Forecasting 2020.
[2] Learning classifiers with fenchel-young losses: Generalized entropies, margins, and algorithms, AISTATS 2019.
[3] Sparse Sequence-to-Sequence Models, ACL 2019.

__ Weaknesses__: Limited technical novelty. The proposed model in this paper AST seems a combination of sparse Transformer and GAN framework. GAN has already been empirically proved that it can improve forecasting performance in timeGAN [4]. timeGAN mainly focused on time series generation, however, it also showed adversarial training can bring better prediction score with experiments on both synthetic data and real-world datasets. Transformer or Sparse Transformer has also been explored in time series forecasting task [5].
[4] Time-series generative adversarial networks, NeurIPS 2019.
[5] Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting, NeurIPS 2019.

__ Correctness__: No code provided.
The method and empirical methodology seem correct.

__ Clarity__: This paper is written clearly.

__ Relation to Prior Work__: This clearly discussed how this work differs from previous contributions clearly.

__ Reproducibility__: No

__ Additional Feedback__: Minor comments:
1. What generator does AST use? Is it a standard transformer, but change the softmax to \alpha-entmax, or similar with ConvTrans?
2. What's DSSM? It hasn't been introduced in the experiment part. Is it DeepState?
3. What's the Q50 loss of AST on elect1d? It's showed to be 0.042 in Table 2 and 0.039 Table 4.
4. Do you plan to public the code of AST?
5. Some papers in reference list have been published in peer-review conferences, but are cited in arXiv versions.

__ Summary and Contributions__: Time series prediction is an important research area in the scope of NeurIPS. Recent SOTA methods rely on RNN or Transformer architecture to make predictions. However, they mainly optimize a single loss (e.g., MSE) in the training phase. Moreover, the error accumulation problem has limited their predictive performance. In this paper, the authors present a sparse transformer with adversarial training to address these problems. The experiments demonstrate its superiority against the SOTA methods in different datasets.

__ Strengths__: 1. Insight and technical quality are okay. The proposed method can address existing challenges correctly. Using GAN to regularize the prediction model is an interesting and straightforward idea.
2. The topic is very relevant to NeurIPS. I believe most of the researchers in this research area can learn something from this paper.
3. The experiments show the SOTA performance achieved by the proposed method.

__ Weaknesses__: 1. My major concern is the novelty. The work seems to simply combine Sparse Transformer and GAN to make predictions. The contribution is somewhat incremental but okay for an acceptance.
2. Another concern is the baselines for comparison. To the best of my knowledge, many papers based on attention mechanisms that are published in top venues (e.g., NeurIPS, SIGIR) in the recent three years shown more promising results than DeepAR, such as [1] and [2]. Moreover, there are also several attempts on introducing extra loss function to regularize the forecasting models. For example, [3] introduces a shape loss to preserve the trend of time series. I prefer more discussion (or experiments) on the difference between the proposed method and these existing arts.
Reference:
[1] Qin, Yao, et al. "A dual-stage attention-based recurrent neural network for time series prediction." IJCAI 2018.
[2] Lai, Guokun, et al. “Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks.” SIGIR 2018.
[3] Vincent, L. E., and Nicolas Thome. "Shape and Time Distortion Loss for Training Deep Time Series Forecasting Models." NeurIPS 2019.

__ Correctness__: Yes, the claims and empirical methodology are both satisfactory.

__ Clarity__: This paper is well-written and easy to follow. I can easily get most of the points.

__ Relation to Prior Work__: Appear to be, but still need to be improved.

__ Reproducibility__: Yes

__ Additional Feedback__: The feedback from the authors has mostly addressed my concerns on the compared baselines. It would be good to add the provided experimental results into the revision.

__ Summary and Contributions__: This paper extends the sparse transformer models for time series forecasting by using adversarial training procedure, as generative adversarial networks. The experimental results show that adversarial training improves over (sparse) transformers models and an lstm-based model (DeepAR).

__ Strengths__: Using adversarial training for time series forecasting in order to regularize the training is a sound idea. Furthermore, the experimental results of the paper show that the proposed model consistently improves over (sparse) transformers models and other baselines.

__ Weaknesses__: Although some details of the experimental setup (network structure, training procedure) are provided in the supplementary, I do not think it is detailed enough to reproduce the results. For instance, for the sparse transformers model (although the chosen learning rate is provided), the grid search values of learning rate and number of layers are not provided. Furthermore, there are no details of experimental setup for the baseline approaches, what are the final learning rates that are chosen, what are the architecture details (hyperparameter of the networks). Providing all the hyperparameter search space, and the final hyperparameters that are leading to best validation error for the proposed model and the baselines would increase the reproducibility.
Sharing the code would further increase the reproducibility of the results. I would strongly suggest authors to share their code to increase the reproducibility.
---
Author response promisses to address the comments concerning reproducibility. That would be great to have experimental details of the all the models including baselines.

__ Correctness__: There proposed model compared against several baselines: stochastic auto-regressive models (ARIMA, ETS), matrix-factorization based model (TRMF), and recurrent neural network based models (deep AR, DSSM, DeepState), and attention-based models (ConvTrans, transformers and sparse transformers). Furthermore, the ablation studies support the main results of the paper.

__ Clarity__: The paper is well written.
Further remarks:
In Figure 2 in the Supplementary Materials adding x-label and y-label and further discussion on the attention range could help the readability.

__ Relation to Prior Work__: The proposed approach is clearly framed compared to earlier work.
---
One very recent paper using GAN training for RNN-based model could be added to related work:
If You Like It, GAN It. Probabilistic Multivariate Times Series Forecast With GAN, Koochali et al, https://arxiv.org/abs/2005.01181, 2020.

__ Reproducibility__: No

__ Additional Feedback__: Figure 1 shows the prediction of time series using only transformers approaches. It can be interesting to see in the supplementary material forecasting of stochastic and other neural network approaches on electricity and other datasets.
Furthermore, it would be interesting to visualize/plot the average attention weights on the input data using vanilla, sparse and adversarial sparse transformer models.