{"title": "Shape and Time Distortion Loss for Training Deep Time Series Forecasting Models", "book": "Advances in Neural Information Processing Systems", "page_first": 4189, "page_last": 4201, "abstract": "This paper addresses the problem of time series forecasting for non-stationary\nsignals and multiple future steps prediction. To handle this challenging task, we\nintroduce DILATE (DIstortion Loss including shApe and TimE), a new objective\nfunction for training deep neural networks. DILATE aims at accurately predicting\nsudden changes, and explicitly incorporates two terms supporting precise shape\nand temporal change detection. We introduce a differentiable loss function suitable\nfor training deep neural nets, and provide a custom back-prop implementation for\nspeeding up optimization. We also introduce a variant of DILATE, which provides\na smooth generalization of temporally-constrained Dynamic TimeWarping (DTW).\nExperiments carried out on various non-stationary datasets reveal the very good\nbehaviour of DILATE compared to models trained with the standard Mean Squared\nError (MSE) loss function, and also to DTW and variants. DILATE is also agnostic\nto the choice of the model, and we highlight its benefit for training fully connected\nnetworks as well as specialized recurrent architectures, showing its capacity to\nimprove over state-of-the-art trajectory forecasting approaches.", "full_text": "Shape and Time Distortion Loss for Training Deep\n\nTime Series Forecasting Models\n\nVincent Le Guen 1,2\n\nvincent.le-guen@edf.fr\n\nNicolas Thome 2\n\nnicolas.thome@cnam.fr\n\n(1) EDF R&D\n\n6 quai Watier, 78401 Chatou, France\n\n(2) CEDRIC, Conservatoire National des Arts et M\u00e9tiers\n\n292 rue Saint-Martin, 75003 Paris, France\n\nAbstract\n\nThis paper addresses the problem of time series forecasting for non-stationary\nsignals and multiple future steps prediction. To handle this challenging task, we\nintroduce DILATE (DIstortion Loss including shApe and TimE), a new objective\nfunction for training deep neural networks. DILATE aims at accurately predicting\nsudden changes, and explicitly incorporates two terms supporting precise shape\nand temporal change detection. We introduce a differentiable loss function suitable\nfor training deep neural nets, and provide a custom back-prop implementation for\nspeeding up optimization. We also introduce a variant of DILATE, which provides\na smooth generalization of temporally-constrained Dynamic Time Warping (DTW).\nExperiments carried out on various non-stationary datasets reveal the very good\nbehaviour of DILATE compared to models trained with the standard Mean Squared\nError (MSE) loss function, and also to DTW and variants. DILATE is also agnostic\nto the choice of the model, and we highlight its bene\ufb01t for training fully connected\nnetworks as well as specialized recurrent architectures, showing its capacity to\nimprove over state-of-the-art trajectory forecasting approaches.\n\n1\n\nIntroduction\n\nTime series forecasting [6] consists in analyzing the dynamics and correlations between historical data\nfor predicting future behavior. In one-step prediction problems [39, 30], future prediction reduces to\na single scalar value. This is in sharp contrast with multi-step time series prediction [49, 2, 48], which\nconsists in predicting a complete trajectory of future data at a rather long temporal extent. Multi-step\nforecasting thus requires to accurately describe time series evolution.\nThis work focuses on multi-step forecasting problems for non-stationary signals, i.e. when future data\ncannot only be inferred from the past periodicity, and when abrupt changes of regime can occur. This\nincludes important and diverse application \ufb01elds, e.g. regulating electricity consumption [63, 36],\npredicting sharp discontinuities in renewable energy production [23] or in traf\ufb01c \ufb02ow [35, 34],\nelectrocardiogram (ECG) analysis [9], stock markets prediction [14], etc.\nDeep learning is an appealing solution for this multi-step and non-stationary prediction problem,\ndue to the ability of deep neural networks to model complex nonlinear time dependencies. Many\napproaches have recently been proposed, mostly relying on the design of speci\ufb01c one-step ahead\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) Non informative prediction\n\n(b) Correct shape, time delay\n\n(c) Correct time, inaccurate shape\n\nFigure 1: Limitation of the euclidean (MSE) loss: when predicting a sudden change (target blue step\nfunction), the 3 predictions (a), (b) and (c) have similar MSE but very different forecasting skills. In\ncontrast, the DILATE loss proposed in this work, which disentangles shape and temporal decay terms,\nsupports predictions (b) and (c) over prediction (a) that does not capture the sharp change of regime.\n\narchitectures recursively applied for multi-step [24, 26, 7, 5], on direct multi-step models [3] such as\nSequence To Sequence [34, 60, 57, 61] or State Space Models for probabilistic forecasts [44, 40].\nRegarding training, the huge majority of methods use the Mean Squared Error (MSE) or its variants\n(MAE, etc) as loss functions. However, relying on MSE may arguably be inadequate in our context,\nas illustrated in Fig 1. Here, the target ground truth prediction is a step function (in blue), and we\npresent three predictions, shown in Fig 1(a), (b), and (c), which have a similar MSE loss compared to\nthe target, but very different forecasting skills. Prediction (a) is not adequate for regulation purposes\nsince it doesn\u2019t capture the sharp drop to come. Predictions (b) and (c) much better re\ufb02ect the change\nof regime since the sharp drop is indeed anticipated, although with a slight delay (b) or with a slight\ninaccurate amplitude (c).\nThis paper introduces DILATE (DIstortion Loss including shApe and TimE), a new objective\nfunction for training deep neural networks in the context of multi-step and non-stationary time series\nforecasting. DILATE explicitly disentangles into two terms the penalization related to the shape\nand the temporal localization errors of change detection (section 3). The behaviour of DILATE is\nshown in Fig 1: whereas the values of our proposed shape and temporal losses are large in Fig 1(a),\nthe shape (resp. temporal) term is small in Fig 1(b) (resp. Fig 1(c)). DILATE combines shape and\ntemporal terms, and is consequently able to output a much smaller loss for predictions (b) and (c)\nthan for (a), as expected.\nTo train deep neural nets with DILATE, we derive a differentiable loss function for both shape and\ntemporal terms (section 3.1), and an ef\ufb01cient and custom back-prop implementation for speeding\nup optimization (section 3.2). We also introduce a variant of DILATE, which provides a smooth\ngeneralization of temporally-constrained Dynamic Time Warping (DTW) metrics [43, 28]. Exper-\niments carried out on several synthetic and real non-stationary datasets reveal that models trained\nwith DILATE signi\ufb01cantly outperform models trained with the MSE loss function when evaluated\nwith shape and temporal distortion metrics, while DILATE maintains very good performance when\nevaluated with MSE. Finally, we show that DILATE can be used with various network architectures\nand can outperform on shape and time metrics state-of-the-art models speci\ufb01cally designed for\nmulti-step and non-stationary forecasting.\n\n2 Related work\n\nTime series forecasting Traditional methods for time series forecasting include linear auto-\nregressive models, such as the ARIMA model [6], and Exponential Smoothing [27], which both fall\ninto the broad category of linear State Space Models (SSMs) [17]. These methods handle linear\ndynamics and stationary time series (or made stationary by temporal differences). However the\nstationarity assumption is not satis\ufb01ed for many real world time series that can present abrupt changes\nof distribution. Since, Recurrent Neural Networks (RNNs) and variants such as Long Short Term\nMemory Networks (LSTMs) [25] have become popular due to their automatic feature extraction abili-\nties, complex patterns and long term dependencies modeling. In the era of deep learning, much effort\nhas been recently devoted to tackle multivariate time series forecasting with a huge number of input\n\n2\n\n\fseries [31], by leveraging attention mechanisms [30, 39, 50, 12] or tensor factorizations [60, 58, 46]\nfor capturing shared information between series. Another current trend is to combine deep learning\nand State Space Models for modeling uncertainty [45, 44, 40, 56]. In this paper we focus on deter-\nministic multi-step forecasting. To this end, the most common approach is to apply recursively a\none-step ahead trained model. Although mono-step learned models can be adapted and improved\nfor the multi-step setting [55], a thorough comparison of the different multi-step strategies [48] has\nrecommended the direct multi-horizon strategy. Of particular interest in this category are Sequence\nTo Sequence (Seq2Seq) RNNs models 1 [44, 31, 60, 57, 19] which achieved great success in machine\ntranslation. Theoretical generalization bounds for Seq2Seq forecasting were derived with an addi-\ntional discrepancy term quantifying the non-stationarity of time series [29]. Following the success\nof WaveNet for audio generation [53], Convolutional Neural Networks with dilation have become a\npopular alternative for time series forecasting [5]. The self-attention Transformer architecture [54]\nwas also lately investigated for accessing long-range context regardless of distance [32]. We highlight\nthat our proposed loss function can be used for training any direct multi-step deep architecture.\n\nEvaluation and training metrics The largely dominant loss function to train and evaluate deep\nmodels is the MAE, MSE and its variants (SMAPE, etc). Metrics re\ufb02ecting shape and temporal\nlocalization exist: Dynamic Time Warping [43] for shape ; timing errors can be casted as a detection\nproblem by computing Precision and Recall scores after segmenting series by Change Point Detection\n[8, 33], or by computing the Hausdorff distance between two sets of change points [22, 51]. For\nassessing the detection of ramps in wind and solar energy forecasting, speci\ufb01c algorithms were\ndesigned: for shape, the ramp score [18, 52] based on a piecewise linear approximation of the\nderivatives of time series; for temporal error estimation, the Temporal Distortion Index (TDI) [20, 52].\nHowever, these evaluation metrics are not differentiable, making them unusable as loss functions\nfor training deep neural networks. The impossibility to directly optimize the appropriate (often\nnon-differentiable) evaluation metric for a given task has bolstered efforts to design good surrogate\nlosses in various domains, for example in ranking [15, 62] or computer vision [38, 59].\nRecently, some attempts have been made to train deep neural networks based on alternatives to\nMSE, especially based on a smooth approximation of the Dynamic time warping (DTW) [13, 37, 1].\nTraining DNNs with a DTW loss enables to focus on the shape error between two signals. However,\nsince DTW is by design invariant to elastic distortions, it completely ignores the temporal localization\nof the change. In our context of sharp change detection, both shape and temporal distortions are\ncrucial to provide an adequate forecast. A differentiable timing error loss function based on DTW on\nthe event (binary) space was proposed in [41] ; however it is only applicable for predicting binary\ntime series. This paper speci\ufb01cally focuses on designing a loss function able to disentangle shape and\ntemporal delay terms for training deep neural networks on real world time series.\n\n3 Training Deep Neural Networks with DILATE\n\nOur proposed framework for multi-step forecasting is depicted in Figure 2. During training, we\nconsider a set of N input time series A = {xi}i\u2208{1:N}. For each input example of length n,\ni ) \u2208 Rp\u00d7n, a forecasting model such as a neural network predicts the future\ni.e. xi = (x1\ni ) \u2208 Rd\u00d7k. Our DILATE objective function, which compares\nk-step ahead trajectory \u02c6yi = (\u02c6y1\n\u2217\nk) of length k, is\nthis prediction \u02c6yi with the actual ground truth future trajectory\nyi = (\ncomposed of two terms balanced by the hyperparameter \u03b1 \u2208 [0, 1]:\n\ni , ..., xn\n\ni , ..., \u02c6yk\n\n\u2217\nyi\n\n1, ...,\n\n\u2217\nyi\n\nLDILAT E(\u02c6yi,\n\n\u2217\nyi) = \u03b1 Lshape(\u02c6yi,\n\n\u2217\nyi) + (1 \u2212 \u03b1) Ltemporal(\u02c6yi,\n\n\u2217\nyi)\n\n(1)\n\n\u2217\n\u2217\nNotations and de\ufb01nitions Both our shape Lshape(\u02c6yi,\nyi) and temporal Ltemporal(\u02c6yi,\nyi) distor-\n\u2217\nyi \u2208 Rd\u00d7k\ntions terms are based on the alignment between predicted \u02c6yi \u2208 Rd\u00d7k and ground truth\ntime series. We de\ufb01ne a warping path as a binary matrix A \u2282 {0, 1}k\u00d7k with Ah,j = 1 if \u02c6yh\ni is associ-\nj, and 0 otherwise. The set of all valid warping paths connecting the endpoints (1, 1) to (k, k)\nated to\n\n\u2217\nyi\n\n1A Seq2Seq architecture was the winner of a 2017 Kaggle competition on multi-step time series forecasting\n\n(https://www.kaggle.com/c/web-traffic-time-series-forecasting)\n\n3\n\n\fFigure 2: Our proposed framework for training deep forecasting models.\n\nwith the authorized moves \u2192,\u2193,(cid:38) (step condition) is noted Ak,k. Let \u2206(\u02c6yi,\nbe the pairwise cost matrix, where \u03b4 is a given dissimilarity between \u02c6yh\ni and\ndistance.\n\n3.1 Shape and temporal terms\n\n\u2217\nyi) := [\u03b4(\u02c6yh\ni ,\n\u2217\nyi\n\nj)]h,j\nj, e.g. the euclidean\n\n\u2217\nyi\n\n(cid:68)\n\n(cid:69)\n\n\u2217\nyi)\n\n(cid:68)\n\nA, \u2206(\u02c6yi,\n\n(cid:69)\n\n\u2217\nyi)\n\nShape term Our shape loss function is based on the Dynamic Time Warping (DTW) [43], which\ncorresponds to the following optimization problem: DT W (\u02c6yi,\n.\nA\u2217 = arg min\nA\u2208Ak,k\n\n\u2217\nyi) = min\nA\u2208Ak,k\n\u2217\nis the optimal association (path) between \u02c6yi and\nyi. By temporally\n\n\u2217\nyi time series, the DTW loss focuses on the structural\naligning the predicted \u02c6yi and ground truth\nshape dissimilarity between signals. The DTW, however, is known to be non-differentiable. We use\ni exp(\u2212ai/\u03b3)) with \u03b3 > 0 proposed in [13]\nto de\ufb01ne our differentiable shape term Lshape:\n\nthe smooth min operator min\u03b3(a1, ..., an) = \u2212\u03b3 log((cid:80)n\n\uf8eb\uf8ed (cid:88)\n\n\u2217\nyi) = DT W\u03b3(\u02c6yi,\n\n\u2217\nyi) := \u2212\u03b3 log\n\n\uf8f6\uf8f8\uf8f6\uf8f8\n\n(cid:68)\n\n\uf8eb\uf8ed\u2212\n\nLshape(\u02c6yi,\n\nA, \u2206(\u02c6yi,\n\n(cid:69)\n\n\u2217\nyi)\n\nA, \u2206(\u02c6yi,\n\n(2)\n\nexp\n\nA\u2208Ak,k\n\nTemporal term Our second term Ltemporal in Eq (1) aims at penalizing temporal distortions\n\u2217\n\u2217\nyi. A\u2217\nyi. Our analysis is based on the optimal DTW path A\u2217 between \u02c6yi and\nbetween \u02c6yi and\nis used to register both time series when computing DTW and provide a time-distortion invariant\n\u2217\nloss. Here, we analyze the form of A\u2217 to compute the temporal distortions between \u02c6yi and\nyi. More\nprecisely, our loss function is inspired from computing the Time Distortion Index (TDI) for temporal\nmisalignment estimation [20, 52], which basically consists in computing the deviation between the\noptimal DTW path A\u2217 and the \ufb01rst diagonal. We \ufb01rst rewrite a generalized TDI loss function with\nour notations:\n\n\u03b3\n\n(cid:43)\n\nT DI(\u02c6yi,\n\n\u2217\nyi) = (cid:104)A\u2217, \u2126(cid:105) =\n\nA, \u2206(\u02c6yi,\n\n, \u2126\n\n(3)\n\n(cid:42)\n\n(cid:68)\n\narg min\nA\u2208Ak,k\n\n(cid:69)\n\n\u2217\nyi)\n\nwhere \u2126 is a square matrix of size k \u00d7 k penalizing each element \u02c6yh\nh (cid:54)= j. In our experiments we choose a squared penalization, e.g. \u2126(h, j) =\n\ni being associated to an\n\n\u2217\ny\n\nj\ni , for\nk2 (h \u2212 j)2, but other\n\n1\n\n4\n\n\fFigure 3: DILATE loss computation for separating the shape and temporal errors.\n\nvariants could be used. Note that prior knowledge can also be incorporated in the \u2126 matrix structure,\ne.g. to penalize more heavily late than early predictions (and vice versa).\nThe TDI loss function in Eq (3) is still non-differentiable. Here, we cannot directly use the same\nsmoothing technique that for de\ufb01ning DTW\u03b3 in Eq (2), since the minimization involves two different\nquantities \u2126 and \u2206. Since the optimal path A\u2217 is itself non-differentiable, we use the fact that\nA\u2217 = \u2207\u2206DT W (\u02c6yi,\n(cid:29)\n\u03b3 = \u2207\u2206DT W\u03b3(\u02c6yi,\nA\u2217\nA\u2208Ak,k\nbeing the partition function. Based on A\u2217\n\n\u2217\nyi) to de\ufb01ne a smooth approximation A\u2217\n(cid:29)\n\u2217\n\nyi) = 1/Z (cid:80)\n\nA exp\n\u03b3, we obtain our smoothed temporal loss from Eq (3):\n\nA\u2208Ak,k\n\n, with Z =(cid:80)\n\n\u03b3 of the arg min operator, i.e. :\n\nA,\u2206(\u02c6yi,\n\nA,\u2206(\u02c6yi,\n\n\u2212\n\nexp\n\n(cid:28)\n\n\u2212\n\n\u2217\nyi)\n\n\u2217\nyi)\n\n\u03b3\n\n(cid:28)\n\nyi) :=(cid:10)A\u2217\n\n\u03b3, \u2126(cid:11) =\n\n\u2217\n\nLtemporal(\u02c6yi,\n\n\u03b3\n\n(cid:88)\n\n1\nZ\n\nA\u2208Ak,k\n\n(cid:104)A, \u2126(cid:105) exp\n\n\u2212(cid:104)A,\u2206(\u02c6yi,\n\n\u03b3\n\nyi)(cid:105)\n\u2217\n\n(4)\n\n3.2 DILATE Ef\ufb01cient Forward and Backward Implementation\n\nThe direct computation of our shape and temporal losses in Eq (2) and Eq (4) is intractable, due to\nthe cardinal of Ak,k, which exponentially grows with k. We provide a careful implementation of the\nforward and backward passes in order to make learning ef\ufb01cient.\nShape loss Regarding Lshape, we rely on [13] to ef\ufb01ciently compute the forward pass with a variant\nof the Bellmann dynamic programming approach [4]. For the backward pass, we implement the\nrecursion proposed in [13] in a custom Pytorch loss. This implementation is much more ef\ufb01cient\nthan relying on vanilla auto-differentiation, since it reuses intermediate results from the forward pass.\nTemporal loss For Ltemporal, note that the bottleneck for the forward pass in Eq (4) is to com-\n\u2217\nyi), which we implement as explained for the Lshape backward pass.\npute A\u2217\n\u2217\nRegarding Ltemporal backward pass, we need to compute the Hessian \u22072DT W\u03b3(\u02c6yi,\nyi). We use\nthe method proposed in [37], based on a dynamic programming implementation that we embed in a\ncustom Pytorch loss. Again, our back-prop implementation allows a signi\ufb01cant speed-up compared\nto standard auto-differentiation (see section 4.4).\nThe resulting time complexity of both shape and temporal losses for forward and backward is O(k2).\n\n\u03b3 = \u2207\u2206DT W\u03b3(\u02c6yi,\n\nDiscussion A variant of our approach to combine shape and temporal penalization would be to\nincorporate a temporal term inside our smooth Lshape function in Eq (2), i.e. :\n\nLDILAT Et(\u02c6yi,\n\n\u2217\nyi) := \u2212\u03b3 log\n\n(cid:69)\n\u2217\nyi) + (1 \u2212 \u03b1)\u2126\n\u03b3\n\n\uf8f6\uf8f8\uf8f6\uf8f8 (5)\n\nA, \u03b1\u2206(\u02c6yi,\n\n\uf8eb\uf8ed (cid:88)\n\nA\u2208Ak,k\n\n(cid:68)\n\n\uf8eb\uf8ed\u2212\n\nexp\n\n5\n\n\fwhen \u03b3 \u2192 0+. In\nWe can notice that Eq (5) reduces to minimizing\nthis case, LDILAT Et can recover DTW variants studied in the literature to bias the computation\nbased on penalizing sequence misalignment, by designing speci\ufb01c \u2126 matrices:\n\nA, \u03b1\u2206(\u02c6yi,\n\n(cid:68)\n\n(cid:69)\n\u2217\nyi) + (1 \u2212 \u03b1)\u2126\n\nSakoe-Chiba DTW hard band constraint [43] \u2126(h, j) = +\u221e if |h \u2212 j| > T , 0 otherwise\n\u2126(h, j) = f (|i \u2212 j|), f increasing function\n\nWeighted DTW [28]\n\nLDILAT Et in Eq (5) enables to train deep neural networks with a smooth loss combining shape\nand temporal criteria. However, LDILAT Et presents limited capacities for disentangling the shape\nand temporal errors, since the optimal path is computed from both shape and temporal terms. In\ncontrast, our LDILAT E loss in Eq (1) separates the loss into two shape and temporal misalignment\ncomponents, the temporal penalization being applied to the optimal unconstrained DTW path. We\nverify experimentally that our LDILAT E outperforms its \"tangled\" version LDILAT Et (section 4.3).\n\n4 Experiments\n\n4.1 Experimental setup\n\nDatasets: To illustrate the relevance of DILATE, we carry out experiments on 3 non-stationary time\nseries datasets from different domains (see examples in Fig 4). The multi-step evaluation consists in\nforecasting the future trajectory on k future time steps.\nSynthetic (k = 20) dataset consists in predicting sudden changes (step functions) based on an input\nsignal composed of two peaks. This controlled setup was designed to measure precisely the shape\nand time errors of predictions. We generate 500 times series for train, 500 for validation and 500\nfor test, with 40 time steps: the \ufb01rst 20 are the inputs, the last 20 are the targets to forecast. In each\nseries, the input range is composed of 2 peaks of random temporal position i1 and i2 and random\namplitude j1 and j2 between 0 and 1, and the target range is composed of a step of amplitude j2 \u2212 j1\nand stochastic position i2 + (i2 \u2212 i1) + randint(\u22123, 3). All time series are corrupted by an additive\ngaussian white noise of variance 0.01.\nECG5000 (k = 56) dataset comes from the UCR Time Series Classi\ufb01cation Archive [10], and is\ncomposed of 5000 electrocardiograms (ECG) (500 for training, 4500 for testing) of length 140. We\ntake the \ufb01rst 84 time steps (60 %) as input and predict the last 56 steps (40 %) of each time series\n(same setup as in [13]).\nTraf\ufb01c (k = 24) dataset corresponds to road occupancy rates (between 0 and 1) from the California\nDepartment of Transportation (48 months from 2015-2016) measured every 1h. We work on the \ufb01rst\nunivariate series of length 17544 (with the same 60/20/20 train/valid/test split as in [30]), and we\ntrain models to predict the 24 future points given the past 168 points (past week).\n\nNetwork architectures and training: We perform multi-step forecasting with two kinds of neural\nnetwork architectures: a fully connected network (1 layer of 128 neurons), which does not make\nany assumption on data structure, and a more specialized Seq2Seq model [47] with Gated Recurrent\nUnits (GRU) [11] with 1 layer of 128 units. Each model is trained with PyTorch for a max number of\n1000 epochs with Early Stopping with the ADAM optimizer. The smoothing parameter \u03b3 of DTW\nand TDI is set to 10\u22122. The hyperparameter \u03b1 balancing Lshape and Ltemporal is determined on a\nvalidation set to get comparable DTW shape performance than the DT W\u03b3 trained model: \u03b1 = 0.5\nfor Synthetic and ECG5000, and 0.8 for Traf\ufb01c. Our code implementing DILATE is available on line\nfrom https://github.com/vincent-leguen/DILATE.\n\n4.2 DILATE forecasting performances\n\nWe evaluate the performances of DILATE, and compare it against two strong baselines: the widely\nused Euclidean (MSE) loss, and the smooth DTW introduced in [13, 37]. For each experiment, we\nuse the same neural network architecture (section 4.1), in order to isolate the impact of the training\nloss and to enable fair comparisons. The results are evaluated using three metrics: MSE, DTW\n(shape) and TDI (temporal). We perform a Student t-test with signi\ufb01cance level 0.05 to highlight the\nbest(s) method(s) in each experiment (averaged over 10 runs).\nOverall results are presented in Table 1.\n\n6\n\n\fEval\nMSE\nDTW\nTDI\nMSE\nDTW\nTDI\nMSE\nDTW\nTDI\n\nFully connected network (MLP)\n\nMSE\n\n1.65 \u00b1 0.14\n38.6 \u00b1 1.28\n15.3 \u00b1 1.39\n31.5 \u00b1 1.39\n19.5 \u00b1 0.159\n7.58 \u00b1 0.192\n0.620 \u00b1 0.010\n24.6 \u00b1 0.180\n16.8 \u00b1 0.799\n\nDTW\u03b3 [13]\n4.82 \u00b1 0.40\n27.3 \u00b1 1.37\n26.9 \u00b1 4.16\n70.9 \u00b1 37.2\n18.4 \u00b1 0.749\n38.9 \u00b1 8.76\n2.52 \u00b1 0.230\n23.4 \u00b1 5.40\n27.4 \u00b1 5.01\n\nDILATE (ours)\n1.67\u00b1 0.184\n32.1 \u00b1 5.33\n13.8 \u00b1 0.712\n37.2 \u00b1 3.59\n17.7 \u00b1 0.427\n7.21 \u00b1 0.886\n1.93 \u00b1 0.080\n23.1 \u00b1 0.41\n16.7 \u00b1 0.508\n\nRecurrent neural network (Seq2Seq)\nMSE\n\n1.10 \u00b1 0.17\n24.6 \u00b1 1.20\n17.2 \u00b1 1.22\n21.2 \u00b1 2.24\n17.8 \u00b1 1.62\n8.27 \u00b1 1.03)\n0.890 \u00b1 0.11\n24.6 \u00b1 1.85\n15.4 \u00b1 2.25\n\nDTW\u03b3 [13]\n2.31 \u00b1 0.45\n22.7 \u00b1 3.55\n20.0 \u00b1 3.72\n75.1 \u00b1 6.30\n17.1 \u00b1 0.650\n27.2 \u00b1 11.1\n2.22 \u00b1 0.26\n22.6 \u00b1 1.34\n22.3 \u00b1 3.66\n\nDILATE (ours)\n1.21 \u00b1 0.13\n23.1 \u00b1 2.44\n14.8 \u00b1 1.29\n30.3 \u00b1 4.10\n16.1 \u00b1 0.156\n6.59 \u00b1 0.786\n1.00 \u00b1 0.260\n23.0 \u00b1 1.62\n14.4\u00b1 1.58\n\nDataset\n\nSynth\n\nECG\n\nTraf\ufb01c\nTable 1: Forecasting results evaluated with MSE (\u00d7100), DTW (\u00d7100) and TDI (\u00d710) metrics,\naveraged over 10 runs (mean \u00b1 standard deviation). For each experiment, best method(s) (Student\nt-test) in bold.\n\nMSE comparison: DILATE outperforms MSE when evaluated on shape (DTW) in all experiments,\nwith signi\ufb01cant differences on 5/6 experiments. When evaluated on time (TDI), DILATE also\nperforms better in all experiments (signi\ufb01cant differences on 3/6 tests). Finally, DILATE is equivalent\nto MSE when evaluated on MSE on 3/6 experiments.\nDTW\u03b3 [13, 37] comparison: When evaluated on shape (DTW), DILATE performs similarly to\nDTW\u03b3 (2 signi\ufb01cant improvements, 1 signi\ufb01cant drop and 3 equivalent performances). For time (TDI)\nand MSE evaluations, DILATE is signi\ufb01cantly better than DTW\u03b3 in all experiments, as expected.\nWe display a few qualitative examples for Synthetic, ECG5000 and Traf\ufb01c datasets on Fig 4 (other\nexamples are provided in supplementary 2). We see that MSE training leads to predictions that are\nnon-sharp, making them inadequate in presence of drops or sharp spikes. DTW\u03b3 leads to very sharp\npredictions in shape, but with a possibly large temporal misalignment. In contrast, our DILATE\npredicts series that have both a correct shape and precise temporal localization.\n\nFigure 4: Qualitative forecasting results.\n\n7\n\n\fEvaluation with external metrics To consolidate the good behaviour of our loss function seen in\nTable 1, we extend the comparison using two additional (non differentiable) metrics for assessing\nshape and time. For shape, we compute the ramp score [52]. For time, we perform change point\ndetection on both series and compute the Hausdorff measure between the sets of detected change\npoints T \u2217 (in the target signal) and \u02c6T (in the predicted signal):\n\nHausdorff(T \u2217, \u02c6T ) := max(max\n\u02c6t\u2208 \u02c6T\n\nt\u2217\u2208T \u2217min\n\u02c6t\u2208 \u02c6T\nWe provide more details about these external metrics in supplementary 1.1.\nIn Table 2, we report the comparison between Seq2Seq models trained with DILATE, DTW\u03b3 and\nMSE. We see that DILATE is always better than MSE in shape (Ramp score) and equivalent to DTW\u03b3\nin 2/3 experiments. In time (Hausdorff metric), DILATE is always better or equivalent compared to\nMSE (and always better than DTW\u03b3, as expected).\n\nt\u2217\u2208T \u2217|\u02c6t \u2212 t\u2217|, max\n\n|\u02c6t \u2212 t\u2217|)\n\nmin\n\n(6)\n\nECG5000 Ramp score\n\nMSE\n2.87 \u00b1 0.127\n5.80 \u00b1 0.104\n4.32 \u00b1 0.505\n4.84 \u00b1 0.240\n2.16 \u00b1 0.378\n6.29 \u00b1 0.319\n\nDT W\u03b3 [13]\n3.45 \u00b1 0.318\n4.27 \u00b1 0.800\n6.16 \u00b1 0.854\n4.79 \u00b1 0.365\n2.29 \u00b1 0.329\n5.78 \u00b1 0.404\n\nDILATE (ours)\n2.70 \u00b1 0.166\n4.99 \u00b1 0.460\n4.23 \u00b1 0.414\n4.80 \u00b1 0.249\n2.13 \u00b1 0.514\n5.93 \u00b1 0.235\n\nSynthetic\n\nTraf\ufb01c\n\nHausdorff\nRamp score (\u00d710)\nHausdorff\n\nHausdorff\nRamp score (\u00d710)\n\nTable 2: Forecasting results of Seq2Seq evaluated with Hausdorff and Ramp Score, averaged over 10\nruns (mean \u00b1 standard deviation). For each experiment, best method(s) (Student t-test) in bold.\n\n4.3 Comparison to temporally constrained versions of DTW\n\nIn Table 3, we compare the Seq2Seq DILATE to its tangled variants Weighted DTW (DILATEt-W)\n[28] and Band Constraint (DILATEt-BC) [43] on the Synthetic dataset. We observe that DILATE\nperformances are similar in shape for both the DTW and ramp metrics and better in time than both\nvariants. This shows that our DILATE leads a \ufb01ner disentanglement of shape and time components.\nResults for ECG5000 and Traf\ufb01c are consistent and given in supplementary 3. We also analyze the\ngradient of DILATE vs DILATEt-W in supplementary 3, showing that DILATEt-W gradients are\nsmaller at low temporal shifts, certainly explaining the superiority of our approach when evaluated\nwith temporal metrics. Qualitative predictions are also provided in supplementary 3.\n\nEval loss\nEuclidian MSE (\u00d7100)\nDTW (\u00d7100)\nShape\nRamp\nTDI (\u00d710)\nHausdorff\n\nTime\n\nDILATE (ours) DILATEt-W [28] DILATEt-BC [43]\n1.21 \u00b1 0.130\n23.1 \u00b1 2.44\n4.99 \u00b1 0.460\n14.8 \u00b1 1.29\n2.70 \u00b1 0.166\n\n1.36 \u00b1 0.107\n20.5 \u00b1 2.49\n5.56 \u00b1 0.87\n17.8 \u00b1 1.72\n2.85 \u00b1 0.210\n\n1.83 \u00b1 0.163\n21.6 \u00b1 1.74\n5.23 \u00b10.439\n19.6 \u00b1 1.72\n3.30 \u00b1 0.273\n\nTable 3: Comparison to the tangled variants of DILATE for the Seq2Seq model on the Synthetic\ndataset, averaged over 10 runs (mean \u00b1 standard deviation).\n\n4.4 DILATE Analysis\n\nCustom backward implementation speedup: We compare in Fig 5(a) the computational time\nbetween the standard Pytorch auto-differentiation mechanism and our custom backward pass imple-\nmentation (section 3.2). We plot the speedup of our implementation with respect to the prediction\nlength k (averaged over 10 random target/prediction tuples). We notice the increasing speedup with\nrespect to k: speedup of \u00d7 20 for 20 steps ahead and up to \u00d7 35 for 100 steps ahead predictions.\nImpact of \u03b1 (Fig 5(b)): When \u03b1 = 1, LDILAT E reduces to DTW\u03b3, with a good shape but large\ntemporal error. When \u03b1 \u2212\u2192 0, we only minimize Ltemporal without any shape constraint. Both\nMSE and shape errors explode in this case, illustrating the fact that Ltemporal is only meaningful in\nconjunction with Lshape.\n\n8\n\n\fFigure 5(a): Speedup of DILATE\n\nFigure 5(b): In\ufb02uence of \u03b1\n\n4.5 Comparison to state of the art time series forecasting models\n\nFinally, we compare our Seq2Seq model trained with DILATE with two recent state-of-the-art\ndeep architectures for time series forecasting trained with MSE: LSTNet [30] trained for one-step\nprediction that we apply recursively for multi-step (LSTNet-rec) ; and Tensor-Train RNN (TT-RNN)\n[60] trained for multi-step2. Results in Table 4 for the traf\ufb01c dataset reveal the superiority of TT-RNN\nover LSTNet-rec, which shows that dedicated multi-step prediction approaches are better suited for\nthis task. More importantly, we can observe that our Seq2Seq DILATE outperforms TT-RNN in all\nshape and time metrics, although it is inferior on MSE. This highlights the relevance of our DILATE\nloss function, which enables to reach better performances with simpler architectures.\n\nEval loss\nEuclidian MSE (x100)\nDTW (x100)\nShape\nRamp (x10)\nTDI (x10)\nHausdorff\n\nTime\n\nLSTNet-rec [30] TT-RNN [60, 61]\n0.837 \u00b1 0.106\n1.74 \u00b1 0.11\n25.9 \u00b1 1.99\n42.0 \u00b1 2.2\n6.71 \u00b1 0.546\n9.00 \u00b1 0.577\n25.7 \u00b1 4.75\n17.8 \u00b1 1.73\n2.34 \u00b1 1.41\n2.19 \u00b1 0.125\n\nSeq2Seq DILATE\n1.00 \u00b1 0.260\n23.0 \u00b1 1.62\n5.93 \u00b1 0.235\n14.4 \u00b1 1.58\n2.13 \u00b1 0.514\n\nTable 4: Comparison with state-of-the-art forecasting architectures trained with MSE on Traf\ufb01c,\naveraged over 10 runs (mean \u00b1 standard deviation).\n\n5 Conclusion and future work\n\nIn this paper, we have introduced DILATE, a new differentiable loss function for training deep multi-\nstep time series forecasting models. DILATE combines two terms for precise shape and temporal\nlocalization of non-stationary signals with sudden changes. We showed that DILATE is comparable\nto the standard MSE loss when evaluated on MSE, and far better when evaluated on several shape\nand timing metrics. DILATE compares favourably on shape and timing to state-of-the-art forecasting\nalgorithms trained with the MSE.\nFor future work we intend to explore the extension of these ideas to probabilistic forecasting, for\nexample by using bayesian deep learning [21] to compute the predictive distribution of trajectories,\nor by embedding the DILATE loss into a deep state space model architecture suited for probabilistic\nforecasting. Another interesting direction is to adapt our training scheme to relaxed supervision\ncontexts, e.g. semi-supervised [42] or weakly supervised [16], in order to perform full trajectory\nforecasting using only categorical labels at training time (e.g. presence or absence of change points).\n\nAknowledgements We would like to thank St\u00e9phanie Dubost, Bruno Charbonnier, Christophe\nChaussin, Lo\u00efc Vallance, Lorenzo Audibert, Nicolas Paul and our anonymous reviewers for their\nuseful feedback and discussions.\n\n2We use the available Github code for both methods.\n\n9\n\n\fReferences\n[1] Abubakar Abid and James Zou. Learning a warping distance from unlabeled time series using\nsequence autoencoders. In Advances in Neural Information Processing Systems (NeurIPS),\npages 10547\u201310555, 2018.\n\n[2] Nguyen Hoang An and Duong Tuan Anh. Comparison of strategies for multi-step-ahead\nprediction of time series using neural network. In International Conference on Advanced\nComputing and Applications (ACOMP), pages 142\u2013149. IEEE, 2015.\n\n[3] Yukun Bao, Tao Xiong, and Zhongyi Hu. Multi-step-ahead time series prediction using multiple-\n\noutput support vector regression. Neurocomputing, 129:482\u2013493, 2014.\n\n[4] Richard Bellman. On the theory of dynamic programming. Proceedings of the National\n\nAcademy of Sciences of the United States of America, 38(8):716, 1952.\n\n[5] Anastasia Borovykh, Sander Bohte, and Cornelis W Oosterlee. Conditional time series forecast-\n\ning with convolutional neural networks. arXiv preprint arXiv:1703.04691, 2017.\n\n[6] George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. Time series\n\nanalysis: forecasting and control. John Wiley & Sons, 2015.\n\n[7] Rohitash Chandra, Yew-Soon Ong, and Chi-Keong Goh. Co-evolutionary multi-task learning\nwith predictive recurrence for multi-step chaotic time series prediction. Neurocomputing,\n243:21\u201334, 2017.\n\n[8] Wei-Cheng Chang, Chun-Liang Li, Yiming Yang, and Barnab\u00e1s P\u00f3czos. Kernel change-point\ndetection with auxiliary deep generative models. In International Conference on Learning\nRepresentations (ICLR), 2019.\n\n[9] Sucheta Chauhan and Lovekesh Vig. Anomaly detection in ECG time signals via deep long\nshort-term memory networks. In International Conference on Data Science and Advanced\nAnalytics (DSAA), pages 1\u20137. IEEE, 2015.\n\n[10] Yanping Chen, Eamonn Keogh, Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen,\n\nand Gustavo Batista. The UCR time series classi\ufb01cation archive. 2015.\n\n[11] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,\nHolger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-\ndecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.\n\n[12] Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, and Walter\nStewart. RETAIN: An interpretable predictive model for healthcare using reverse time attention\nmechanism. In Advances in Neural Information Processing Systems (NIPS), pages 3504\u20133512,\n2016.\n\n[13] Marco Cuturi and Mathieu Blondel. Soft-dtw: a differentiable loss function for time-series. In\n\nInternational Conference on Machine Learning (ICML), pages 894\u2013903, 2017.\n\n[14] Xiao Ding, Yue Zhang, Ting Liu, and Junwen Duan. Deep learning for event-driven stock\n\nprediction. In International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), 2015.\n\n[15] Thibaut Durand, Nicolas Thome, and Matthieu Cord. Mantra: Minimum maximum latent\nstructural svm for image classi\ufb01cation and ranking. In IEEE International Conference on\nComputer Vision (ICCV), pages 2713\u20132721, 2015.\n\n[16] Thibaut Durand, Nicolas Thome, and Matthieu Cord. Exploiting negative evidence for deep\nlatent structured models. IEEE Transactions on Pattern Analysis and Machine Intelligence,\n41(2):337\u2013351, 2018.\n\n[17] James Durbin and Siem Jan Koopman. Time series analysis by state space methods. Oxford\n\nuniversity press, 2012.\n\n10\n\n\f[18] Anthony Florita, Bri-Mathias Hodge, and Kirsten Orwig. Identifying wind and solar ramping\nevents. In 2013 IEEE Green Technologies Conference (GreenTech), pages 147\u2013152. IEEE,\n2013.\n\n[19] Ian Fox, Lynn Ang, Mamta Jaiswal, Rodica Pop-Busui, and Jenna Wiens. Deep multi-output\nIn ACM SIGKDD\nforecasting: Learning to accurately predict blood glucose trajectories.\nInternational Conference on Knowledge Discovery & Data Mining, pages 1387\u20131395. ACM,\n2018.\n\n[20] Laura Fr\u00edas-Paredes, Ferm\u00edn Mallor, Mart\u00edn Gast\u00f3n-Romeo, and Teresa Le\u00f3n. Assessing energy\nforecasting inaccuracy by simultaneously considering temporal and absolute errors. Energy\nConversion and Management, 142:533\u2013546, 2017.\n\n[21] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model\nuncertainty in deep learning. In International Conference on Machine Learning (ICML), pages\n1050\u20131059, 2016.\n\n[22] Damien Garreau, Sylvain Arlot, et al. Consistent change-point detection with kernels. Electronic\n\nJournal of Statistics, 12(2):4440\u20134486, 2018.\n\n[23] Amir Ghaderi, Borhan M Sanandaji, and Faezeh Ghaderi. Deep forecast: Deep learning-based\n\nspatio-temporal forecasting. In ICML Time Series Workshop, 2017.\n\n[24] Agathe Girard and Carl Edward Rasmussen. Multiple-step ahead prediction for non linear\ndynamic systems - a gaussian process treatment with propagation of the uncertainty. In Advances\nin neural information processing systems (NIPS), volume 15, pages 529\u2013536, 2002.\n\n[25] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural Computing,\n\n9(8):1735\u20131780, November 1997.\n\n[26] Shamina Hussein, Rohitash Chandra, and Anuraganand Sharma. Multi-step-ahead chaotic\ntime series prediction using coevolutionary recurrent neural networks. In IEEE Congress on\nEvolutionary Computation (CEC), pages 3084\u20133091. IEEE, 2016.\n\n[27] Rob Hyndman, Anne B Koehler, J Keith Ord, and Ralph D Snyder. Forecasting with exponential\n\nsmoothing: the state space approach. Springer Science & Business Media, 2008.\n\n[28] Young-Seon Jeong, Myong K Jeong, and Olufemi A Omitaomu. Weighted dynamic time\n\nwarping for time series classi\ufb01cation. Pattern Recognition, 44:2231\u20132240, 2011.\n\n[29] Vitaly Kuznetsov and Zelda Mariet. Foundations of sequence-to-sequence modeling for time\n\nseries. In International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2019.\n\n[30] Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. Modeling long-and short-\nterm temporal patterns with deep neural networks. In ACM SIGIR Conference on Research &\nDevelopment in Information Retrieval, pages 95\u2013104. ACM, 2018.\n\n[31] Nikolay Laptev, Jason Yosinski, Li Erran Li, and Slawek Smyl. Time-series extreme event\nforecasting with neural networks at Uber. In International Conference on Machine Learning\n(ICML), number 34, pages 1\u20135, 2017.\n\n[32] Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Wang Yu-Xiang, and Yan\nXifeng. Enhancing the locality and breaking the memory bottleneck of transformer on time\nseries forecasting. In Advances in neural information processing systems (NeurIPS), 2019.\n\n[33] Shuang Li, Yao Xie, Hanjun Dai, and Le Song. M-statistic for kernel change-point detection.\n\nIn Advances in Neural Information Processing Systems (NIPS), pages 3366\u20133374, 2015.\n\n[34] Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. Diffusion convolutional recurrent neural net-\nwork: Data-driven traf\ufb01c forecasting. In International Conference on Learning Representations\n(ICLR), 2018.\n\n[35] Yisheng Lv, Yanjie Duan, Wenwen Kang, Zhengxi Li, and Fei-Yue Wang. Traf\ufb01c \ufb02ow prediction\nwith big data: a deep learning approach. IEEE Transactions on Intelligent Transportation\nSystems, 16(2):865\u2013873, 2015.\n\n11\n\n\f[36] Shamsul Masum, Ying Liu, and John Chiverton. Multi-step time series forecasting of electric\nload using machine learning models. In International Conference on Arti\ufb01cial Intelligence and\nSoft Computing, pages 148\u2013159. Springer, 2018.\n\n[37] Arthur Mensch and Mathieu Blondel. Differentiable dynamic programming for structured\n\nprediction and attention. International Conference on Machine Learning (ICML), 2018.\n\n[38] Sebastian Nowozin. Optimal decisions from probabilistic models: the intersection-over-union\ncase. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 548\u2013555,\n2014.\n\n[39] Yao Qin, Dongjin Song, Haifeng Cheng, Wei Cheng, Guofei Jiang, and Garrison W Cottrell. A\ndual-stage attention-based recurrent neural network for time series prediction. In International\nJoint Conference on Arti\ufb01cial Intelligence (IJCAI), pages 2627\u20132633. AAAI Press, 2017.\n\n[40] Syama Sundar Rangapuram, Matthias W Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang,\nand Tim Januschowski. Deep state space models for time series forecasting. In Advances in\nNeural Information Processing Systems (NeurIPS), pages 7785\u20137794, 2018.\n\n[41] Fran\u00e7ois Rivest and Richard Kohar. A new timing error cost function for binary time series\n\nprediction. IEEE transactions on neural networks and learning systems, 2019.\n\n[42] Thomas Robert, Nicolas Thome, and Matthieu Cord. Hybridnet: Classi\ufb01cation and reconstruc-\ntion cooperation for semi-supervised learning. In European Conference on Computer Vision\n(ECCV), pages 153\u2013169, 2018.\n\n[43] Hiroaki Sakoe and Seibi Chiba. Dynamic programming algorithm optimization for spoken\n\nword recognition. Readings in speech recognition, 159:224, 1990.\n\n[44] David Salinas, Valentin Flunkert, and Jan Gasthaus. DeepAR: Probabilistic forecasting with\nautoregressive recurrent networks. In International Conference on Machine Learning (ICML),\n2017.\n\n[45] Matthias W Seeger, David Salinas, and Valentin Flunkert. Bayesian intermittent demand\nforecasting for large inventories. In Advances in Neural Information Processing Systems (NIPS),\npages 4646\u20134654, 2016.\n\n[46] Rajat Sen, Yu Hsiang-Fu, and Dhillon Inderjit. Think globally, act locally: a deep neural network\napproach to high dimensional time series forecasting. In Advances in Neural Information\nProcessing Systems (NeurIPS), 2019.\n\n[47] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural\nnetworks. In Advances in neural information processing systems (NIPS), pages 3104\u20133112,\n2014.\n\n[48] Souhaib Ben Taieb and Amir F Atiya. A bias and variance analysis for multistep-ahead time\nseries forecasting. IEEE Transactions on Neural Networks and Learning Systems, 27(1):62\u201376,\n2016.\n\n[49] Souhaib Ben Taieb, Gianluca Bontempi, Amir F Atiya, and Antti Sorjamaa. A review and com-\nparison of strategies for multi-step ahead time series forecasting based on the NN5 forecasting\ncompetition. Expert systems with applications, 39(8):7067\u20137083, 2012.\n\n[50] Yunzhe Tao, Lin Ma, Weizhong Zhang, Jian Liu, Wei Liu, and Qiang Du. Hierarchical attention-\nbased recurrent highway networks for time series prediction. arXiv preprint arXiv:1806.00685,\n2018.\n\n[51] Charles Truong, Laurent Oudre, and Nicolas Vayatis. Supervised kernel change point detec-\ntion with partial annotations. In International Conference on Acoustics, Speech and Signal\nProcessing (ICASSP), pages 3147\u20133151. IEEE, 2019.\n\n[52] Lo\u00efc Vallance, Bruno Charbonnier, Nicolas Paul, St\u00e9phanie Dubost, and Philippe Blanc. To-\nwards a standardized procedure to assess solar forecast accuracy: A new ramp and time\nalignment metric. Solar Energy, 150:408\u2013422, 2017.\n\n12\n\n\f[53] A\u00e4ron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex\nGraves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. WaveNet: A generative\nmodel for raw audio. arXiv preprint arXiv:1609.03499, 2016.\n\n[54] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information\nprocessing systems (NIPS), pages 5998\u20136008, 2017.\n\n[55] Arun Venkatraman, Martial Hebert, and J Andrew Bagnell. Improving multi-step prediction of\nlearned time series models. In Twenty-Ninth AAAI Conference on Arti\ufb01cial Intelligence, 2015.\n\n[56] Yuyang Wang, Alex Smola, Danielle Maddix, Jan Gasthaus, Dean Foster, and Tim Januschowski.\nDeep factors for forecasting. In International Conference on Machine Learning (ICML), pages\n6607\u20136617, 2019.\n\n[57] Ruofeng Wen, Kari Torkkola, Balakrishnan Narayanaswamy, and Dhruv Madeka. A multi-\n\nhorizon quantile recurrent forecaster. NIPS Time Series Workshop, 2017.\n\n[58] Hsiang-Fu Yu, Nikhil Rao, and Inderjit S Dhillon. Temporal regularized matrix factorization for\nhigh-dimensional time series prediction. In Advances in neural information processing systems\n(NIPS), pages 847\u2013855, 2016.\n\n[59] Jiaqian Yu and Matthew B Blaschko. The lov\u00e1sz hinge: A novel convex surrogate for submodular\n\nlosses. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.\n\n[60] Rose Yu, Stephan Zheng, Anima Anandkumar, and Yisong Yue. Long-term forecasting using\n\ntensor-train RNNs. arXiv preprint arXiv:1711.00073, 2017.\n\n[61] Rose Yu, Stephan Zheng, and Yan Liu. Learning chaotic dynamics using tensor recurrent neural\n\nnetworks. In ICML Workshop on Deep Structured Prediction, volume 17, 2017.\n\n[62] Yisong Yue, Thomas Finley, Filip Radlinski, and Thorsten Joachims. A support vector method\nfor optimizing average precision. In ACM SIGIR conference on Research and development in\ninformation retrieval, pages 271\u2013278. ACM, 2007.\n\n[63] Jian Zheng, Cencen Xu, Ziang Zhang, and Xiaohua Li. Electric load forecasting in smart grids\nusing long-short-term-memory based recurrent neural network. In 51st Annual Conference on\nInformation Sciences and Systems (CISS), pages 1\u20136. IEEE, 2017.\n\n13\n\n\f", "award": [], "sourceid": 2368, "authors": [{"given_name": "Vincent", "family_name": "LE GUEN", "institution": "CNAM, Paris, France"}, {"given_name": "Nicolas", "family_name": "THOME", "institution": "Cnam (Conservatoire national des arts et m\u00e9tiers)"}]}