Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This paper proposes to use causal convolutions to enhance local structure and a log-sparse attention formulation to reduce the memory requirements of Transformers. Moreover, different log-sparse variations are proposed, for example local and restart. The experiments involve exploring different convolutional kernel sizes, the combination of convs and log-sparse, and evaluating the data at different resolutions. Pros: - Addresses key challenges in Transformers, enhancing locality and reducing memory through log-sparse formulation. - The log-sparse formulation is intuitive as dense on recent history and sparser as history is more distant. - The figure illustrations such as Figure 1, 2, and 3 are extremely informative. Cons: - The variations in the log-sparse formulation such as local and restart does not seem to be tested in the experiments. - Perhaps provide some plots for the real data against generated sequences to help readers see the challenges in the dataset (such as how frequently do changes points happen) and qualitatively how well the model is able to capture them.
On its own, the paper assigns text space inefficiently. While simulated data is fine (in context) it should not take space from essential Methods description. The baseline method is described in one short and incomplete paragraph, insufficiently described apart from a couple of perfunctory references (one of which, as several other references, is incompletely listed). The architecture window does not suffice and it is left to the reader to piece together how the forecasts are computed from data, end-to-end. Several mentions of 'rolling window' is not a sufficient description of train/validation/test procedure. What was it? Depending on the exact details, the evaluation procedure can result in overfit. What was the loss function used in training (it appears only briefly in the 'training curve' Figure). Results are incomplete. For the M4 dataset in particular (which has a test set), there are known accuracy results in the literature which can be compared with the R_0.5 result. They should be included. There are scant or no details given on how alternative methods (Arima, TRMF, DeepAR) have been set up (lag length, metaparameters) or how the metaparemeters of the proposed method (in particular the kernel size) has been chosen *prior* to any ablation studies Finally, while the stated goal is computational efficiency, not running time is reported, nor the actual software/hdwre architecture that implemented the main method. Minor details, for improved clarity: the methods section is minuscule: less than 10 lines on Page 3. Expand Deep neural networks have been proposed to capture shared information across related time series for accurate forecasting. how were baseline metaparameters (kernel size, h) chosen before ablation study? Figure 5 suggests NN was trained iteratively over the same data ARIMA performs significantly worse than the simpler method (ETS) which suggests seasonality was not used in what software / environment was main method and improvement implement and how fast did these run?
The paper is a rather straightforward extension of the well known transformer network for time series forecasting. However, it precisely targets two major limitations of the original algorithm, and the proposed improvements handle them effectively. As a result, it shows a significant improvement over the state-of-the-art methods such as DeepAR, especially for datasets which require long term dependency modeling. The paper is clearly written and the quality is high in all aspects. Readers can understand the benefit of each proposed component, thanks to carefully designed experiments. I think it is a significant contribution for the community, demonstrating the potential of transformer networks for time series forecasting. Some questions: Could you provide the dimension of covariate vectors x for each of the experiments with some details? Which positional encoding scheme was used? Exactly the same as the formula in p6 of the original transformer paper? It seems the performance of the proposed algorithm for electricity-f_1d and traffic-f_1d in Table 3 and 4 do not match. Is the kernel size different? What was the window size for electricity-c and traffic-c experiments? The full history length? Could you provide more details? Would the performance with k = 1 in Table 2 be almost the same as the one of the original transformer network? I assume a sparse attention was used in Table 2, but the performance should be equivalent or better with full attention model according to Table 3. Do you have exact numbers for the original transformer network?