{"title": "Think Globally, Act Locally: A Deep Neural Network Approach to High-Dimensional Time Series Forecasting", "book": "Advances in Neural Information Processing Systems", "page_first": 4837, "page_last": 4846, "abstract": "Forecasting high-dimensional time series plays a crucial role in many applications such as demand forecasting and financial predictions. Modern datasets can have millions of correlated time-series that evolve together, i.e they are extremely high dimensional (one dimension for each individual time-series). There is a need for exploiting global patterns and coupling them with local calibration for better prediction. However, most recent deep learning approaches in the literature are one-dimensional, i.e, even though they are trained on the whole dataset, during prediction, the future forecast for a single dimension mainly depends on past values from the same dimension. In this paper, we seek to correct this deficiency and propose DeepGLO, a deep forecasting model which thinks globally and acts locally. In particular, DeepGLO is a hybrid model that combines a global matrix factorization model regularized by a temporal convolution network, along with another temporal network that can capture local properties of each time-series and associated covariates. Our model can be trained effectively on high-dimensional but diverse time series, where different time series can have vastly different scales, without a priori normalization or rescaling. Empirical results demonstrate that DeepGLO can outperform state-of-the-art approaches; for example, we see more than 25% improvement in WAPE over other methods on a public dataset that contains more than 100K-dimensional time series.", "full_text": "Think Globally, Act Locally: A Deep Neural Network\n\nApproach to High-Dimensional Time Series\n\nForecasting\n\nRajat Sen1, Hsiang-Fu Yu1, and Inderjit Dhillon2\n\n1Amazon\n\n2Amazon and UT Austin\n\nAbstract\n\nForecasting high-dimensional time series plays a crucial role in many applications\nsuch as demand forecasting and \ufb01nancial predictions. Modern datasets can have\nmillions of correlated time-series that evolve together, i.e they are extremely high\ndimensional (one dimension for each individual time-series). There is a need\nfor exploiting global patterns and coupling them with local calibration for better\nprediction. However, most recent deep learning approaches in the literature are\none-dimensional, i.e, even though they are trained on the whole dataset, during\nprediction, the future forecast for a single dimension mainly depends on past values\nfrom the same dimension. In this paper, we seek to correct this de\ufb01ciency and\npropose DeepGLO, a deep forecasting model which thinks globally and acts\nlocally. In particular, DeepGLO is a hybrid model that combines a global matrix\nfactorization model regularized by a temporal convolution network, along with\nanother temporal network that can capture local properties of each time-series and\nassociated covariates. Our model can be trained effectively on high-dimensional\nbut diverse time series, where different time series can have vastly different scales,\nwithout a priori normalization or rescaling. Empirical results demonstrate that\nDeepGLO can outperform state-of-the-art approaches; for example, we see more\nthan 25% improvement in WAPE over other methods on a public dataset that\ncontains more than 100K-dimensional time series.\n\nIntroduction\n\n1\nTime-series forecasting is an important problem with many industrial applications like retail demand\nforecasting [21], \ufb01nancial predictions [15], predicting traf\ufb01c or weather patterns [5]. In general\nit plays a key role in automating business processes [17]. Modern data-sets can have millions of\ncorrelated time-series over several thousand time-points. For instance, in an online shopping portal\nlike Amazon or Walmart, one may be interested in the future daily demands for all items in a category,\nwhere the number of items may be in millions. This leads to a problem of forecasting n time-series\n(one for each of the n items), given past demands over t time-steps. Such a time series data-set can\nbe represented as a matrix Y 2 Rn\u21e5t and we are interested in the high-dimensional setting where n\ncan be of the order of millions.\nTraditional time-series forecasting methods operate on individual time-series or a small number of\ntime-series at a time. These methods include the well known AR, ARIMA, exponential smooth-\ning [19], the classical Box-Jenkins methodology [4] and more generally the linear state-space\nmodels [13]. However, these methods are not easily scalable to large data-sets with millions of\ntime-series, owing to the need for individual training. Moreover, they cannot bene\ufb01t from shared\ntemporal patterns in the whole data-set while training and prediction.\nDeep networks have gained popularity in time-series forecasting recently, due to their ability to\nmodel non-linear temporal patterns. Recurrent Neural Networks (RNN\u2019s) [10] have been popular in\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fsequential modeling, however they suffer from the gradient vanishing/exploding problems in training.\nLong Short Term Memory (LSTM) [11] networks alleviate that issue and have had great success in\nlangulage modeling and other seq-to-seq tasks [11, 22]. Recently, deep time-series models have used\nLSTM blocks as internal components [9, 20]. Another popular architecture, that is competitive with\nLSTM\u2019s and arguably easier to train is temporal convolutions/causal convolutions popularized by the\nwavenet model [24]. Temporal convolutions have been recently used in time-series forecasting [3, 2].\nThese deep network based models can be trained on large time-series data-sets as a whole, in\nmini-batches. However, they still have two important shortcomings.\n\nFirstly, most of the above deep models are dif\ufb01cult to train on data-sets that have wide variation in\nscales of the individual time-series. For instance in the item demand forecasting use-case, the demand\nfor some popular items may be orders of magnitude more than those of niche items. In such data-sets,\neach time-series needs to be appropriately normalized in order for training to succeed, and then the\npredictions need to be scaled back to the original scale. The mode and parameters of normalization\nare dif\ufb01cult to choose and can lead to different accuracies. For example, in [9, 20] each time-series is\nwhitened using the corresponding empirical mean and standard deviation, while in [3] the time-series\nare scaled by the corresponding value on the \ufb01rst time-point.\n\nSecondly, even though these deep models are trained on the entire data-set, during prediction the\nmodels only focus on local past data i.e only the past data of a time-series is used for predicting\nthe future of that time-series. However, in most datasets, global properties may be useful during\nprediction time. For instance, in stock market predictions, it might be bene\ufb01cial to look at the\npast values of Alphabet, Amazon\u2019s stock prices as well, while predicting the stock price of Apple.\nSimilarly, in retail demand forecasting, past values of similar items can be leveraged while predicting\nthe future for a certain item. To this end, in [16], the authors propose a combination of 2D convolution\nand recurrent connections, that can take in multiple time-series in the input layer thus capturing global\nproperties during prediction. However, this method does not scale beyond a few thousand time-series,\nowing to the growing size of the input layer. On the other end of the spectrum, TRMF [29] is\na temporally regularized matrix factorization model that can express all the time-series as linear\ncombinations of basis time-series. These basis time-series can capture global patterns during\nprediction. However, TRMF can only model linear temporal dependencies. Moreover, there can be\napproximation errors due to the factorization, which can be interpreted as a lack of local focus i.e the\nmodel only concentrates on the global patterns during prediction.\n\nIn light of the above discussion, we aim to propose a deep learning model that can think globally and\nact locally i.e., leverage both local and global patterns during training and prediction, and also can be\ntrained reliably even when there are wide variations in scale. The main contributions of this paper\nare as follows:\n\n\u2022 In Section A, we discuss issues with wide variations in scale among different time-series,\nand propose a simple initialization scheme, LeveledInit for Temporal Convolution Networks\n(TCN) that enables training without apriori normalization.\n\u2022 In Section 5.1, we present a matrix factorization model regularized by a TCN (TCN-MF),\nthat can express each time-series as linear combination of k basis time-series, where k is\nmuch less than the number of time-series. Unlike TRMF, this model can capture non-linear\ndependencies as the regularization and prediction is done using a temporal convolution\ntrained concurrently and also is amicable to scalable mini-batch training. This model can\nhandle global dependencies during prediction.\n\u2022 In Section 5.2, we propose DeepGLO, a hybrid model, where the predictions from our\nglobal TCN-MF model, is provided as covariates for a temporal convolution network,\nthereby enabling the \ufb01nal model to focus both on local per time-series properties as well as\nglobal dataset wide properties, while both training and prediction.\n\u2022 In Section 6, we show that DeepGLO outperforms other benchmarks on four real world\ntime-series data-sets, including a public wiki dataset which contains more than 110K\ndimensions of time series. More details can be found in Tables 1 and 2.\n\n2 Related Work\nThe literature on time-series forecasting is vast and spans several decades. Here, we will mostly focus\non recent deep learning approaches. For a comprehensive treatment of traditional methods, we refer\nthe readers to [13, 19, 4, 18, 12] and the references there in.\n\n2\n\n\fIn recent years deep learning models have gained popularity in time-series forecasting. DeepAR [9]\nproposes a LSTM based model where parameters of Bayesian models for the future time-steps are\npredicted as a function of the corresponding hidden states of the LSTM. In [20], the authors combine\nlinear state space models with deep networks. In [26], the authors propose a time-series model where\nall history of a time-series is encoded using an LSTM block, and a multi horizon MLP decoder is used\nto decode the input into future forecasts. LSTNet [16] can leverage correlations between multiple\ntime-series through a combination of 2D convolution and recurrent structures. However, it is dif\ufb01cult\nto scale this model beyond a few thousand time-series because of the growing size of the input layer.\nTemporal convolutions have been recently used for time-series forecasting [3].\n\nMatrix factorization with temporal regularization was \ufb01rst used in [27] in the context of speech\nde-noising. A spatio-temporal deep model for traf\ufb01c data has been proposed in [28]. Perhaps closest\nto our work is TRMF [29], where the authors propose an AR based temporal regularization. In\nthis paper, we extend this work to non-linear settings where the temporal regularization can be\nperformed by a temporal convolution network (see Section 4). We further combine the global matrix\nfactorization model with a temporal convolution network, thus creating a hybrid model that can focus\non both local and global properties. There has been a concurrent work [25], where an RNN has been\nused to evolve a global state common to all time-series.\n3 Problem Setting\nWe consider the problem of forecasting high-dimensional time series over future time-steps. High-\ndimensional time-series datasets consist of several possibly correlated time-series evolving over time\nalong with corresponding covariates, and the task is to forecast the values of those time-series in\nfuture time-steps. Before, we formally de\ufb01ne the problem, we will set up some notation.\nNotation: We will use bold capital letters to denote matrices such as M 2 Rm\u21e5n. Mij and M[i, j]\nwill be used interchangeably to denote the (i, j)-th entry of the matrix M. Let [n] , {1, 2, ..., n}\nfor a positive integer n. For I\u2713 [m] and J\u2713 [n], the notation M[I,J ] will denote the sub-matrix\nof M with rows in I and columns in J . M[:,J ] means that all the rows are selected and similarly\nM[I, :] means all the columns are chosen. J + s denotes all the set of elements in J increased by\ns. The notation i : j for positive integers j > i, is used to denote the set {i, i + 1, ..., j}. kMkF ,\nkMk2 denote the Frobenius and Spectral norms respectively. We will also de\ufb01ne 3-dimensional\ntensor notation in a similar way as above. Tensors will also be represented by bold capital letters\nT 2 Rm\u21e5r\u21e5n. Tijk and T[i, j, k] will be used interchangeably to denote the (i, j, k)-th entry of the\ntensor T. For I\u2713 [m], J\u2713 [n] and K\u2713 [r], the notation T[I,K,J ] will denote the sub-tensor of\nT, restricted to the selected coordinates. By convention, all vectors in this paper are row vectors unless\notherwise speci\ufb01ed. kvkp denotes the p-norm of the vector v 2 R1\u21e5n. vI denotes the sub-vector\nwith entries {vi : 8i 2I} where vi denotes the i-th coordinate of v and I\u2713 [n]. The notation vi:j\nwill be used to denote the vector [vi, ..., vj]. The notation [v; u] 2 R1\u21e52n will be used to denote\nthe concatenation of two row vectors v and u. For a vector v 2 R1\u21e5n, \u00b5(v) , (Pi vi)/n denotes\nthe empirical mean of the coordinates and (v) ,p(Pi(vi \u00b5(v)2))/n denotes the empirical\nForecasting Task: A time-series data-set consists of the raw time-series, represented by a matrix\nY = [Y(tr)Y(te)], where Y(tr) 2 Rn\u21e5t, Y(te) 2 Rn\u21e5\u2327 , n is the number of time-series, t is the number\ntime-points observed during training phase, \u2327 is the window size for forecasting. y(i) is used to denote\nthe i-th time series, i.e., the i-th row of Y. In addition to the raw time-series, there may optionally\nbe observed covariates, represented by the tensor Z 2 Rn\u21e5r\u21e5(t+\u2327 ). z(i)\nj = Z[i, :, j] denotes the\nr-dimensional covariates for time-series i and time-point j. Here, the covariates can consist of global\nfeatures like time of the day, day of the week etc which are common to all time-series, as well\nas covariates particular to each time-series, for example vectorized text features describing each\ntime-series. The forecasting task is to accurately predict the future in the test range, given the original\ntime-series Y(tr) in the training time-range and the covariates Z. \u02c6Y(te) 2 Rn\u21e5\u2327 will be used to denote\nthe predicted values in the test range.\n\nstandard deviation.\n\nObjective: The quality of the predictions are generally measured using a metric calculated between\nthe predicted and actual values in the test range. One of the popular metrics is the normalized absolute\n\n3\n\n\f1\n2AAAB9HicbVBNS8NAEJ3Ur1q/oh69LLaCp5LEgx6LXjxWsB/QhrLZbtqlm03c3RRKyO/w4kERr/4Yb/4bt20O2vpg4PHeDDPzgoQzpR3n2yptbG5t75R3K3v7B4dH9vFJW8WpJLRFYh7LboAV5UzQlmaa024iKY4CTjvB5G7ud6ZUKhaLRz1LqB/hkWAhI1gbya/1Q4lJ5uaZl9cGdtWpOwugdeIWpAoFmgP7qz+MSRpRoQnHSvVcJ9F+hqVmhNO80k8VTTCZ4BHtGSpwRJWfLY7O0YVRhiiMpSmh0UL9PZHhSKlZFJjOCOuxWvXm4n9eL9XhjZ8xkaSaCrJcFKYc6RjNE0BDJinRfGYIJpKZWxEZYxODNjlVTAju6svrpO3V3au69+BVG7dFHGU4g3O4BBeuoQH30IQWEHiCZ3iFN2tqvVjv1seytWQVM6fwB9bnD/b7kZE=\n\n1\n2AAAB9HicbVBNS8NAEJ3Ur1q/oh69LLaCp5LEgx6LXjxWsB/QhrLZbtqlm03c3RRKyO/w4kERr/4Yb/4bt20O2vpg4PHeDDPzgoQzpR3n2yptbG5t75R3K3v7B4dH9vFJW8WpJLRFYh7LboAV5UzQlmaa024iKY4CTjvB5G7ud6ZUKhaLRz1LqB/hkWAhI1gbya/1Q4lJ5uaZl9cGdtWpOwugdeIWpAoFmgP7qz+MSRpRoQnHSvVcJ9F+hqVmhNO80k8VTTCZ4BHtGSpwRJWfLY7O0YVRhiiMpSmh0UL9PZHhSKlZFJjOCOuxWvXm4n9eL9XhjZ8xkaSaCrJcFKYc6RjNE0BDJinRfGYIJpKZWxEZYxODNjlVTAju6svrpO3V3au69+BVG7dFHGU4g3O4BBeuoQH30IQWEHiCZ3iFN2tqvVjv1seytWQVM6fwB9bnD/b7kZE=\n\n1\n2AAAB9HicbVBNS8NAEJ3Ur1q/oh69LLaCp5LEgx6LXjxWsB/QhrLZbtqlm03c3RRKyO/w4kERr/4Yb/4bt20O2vpg4PHeDDPzgoQzpR3n2yptbG5t75R3K3v7B4dH9vFJW8WpJLRFYh7LboAV5UzQlmaa024iKY4CTjvB5G7ud6ZUKhaLRz1LqB/hkWAhI1gbya/1Q4lJ5uaZl9cGdtWpOwugdeIWpAoFmgP7qz+MSRpRoQnHSvVcJ9F+hqVmhNO80k8VTTCZ4BHtGSpwRJWfLY7O0YVRhiiMpSmh0UL9PZHhSKlZFJjOCOuxWvXm4n9eL9XhjZ8xkaSaCrJcFKYc6RjNE0BDJinRfGYIJpKZWxEZYxODNjlVTAju6svrpO3V3au69+BVG7dFHGU4g3O4BBeuoQH30IQWEHiCZ3iFN2tqvVjv1seytWQVM6fwB9bnD/b7kZE=\n\n1\n2AAAB9HicbVBNS8NAEJ3Ur1q/oh69LLaCp5LEgx6LXjxWsB/QhrLZbtqlm03c3RRKyO/w4kERr/4Yb/4bt20O2vpg4PHeDDPzgoQzpR3n2yptbG5t75R3K3v7B4dH9vFJW8WpJLRFYh7LboAV5UzQlmaa024iKY4CTjvB5G7ud6ZUKhaLRz1LqB/hkWAhI1gbya/1Q4lJ5uaZl9cGdtWpOwugdeIWpAoFmgP7qz+MSRpRoQnHSvVcJ9F+hqVmhNO80k8VTTCZ4BHtGSpwRJWfLY7O0YVRhiiMpSmh0UL9PZHhSKlZFJjOCOuxWvXm4n9eL9XhjZ8xkaSaCrJcFKYc6RjNE0BDJinRfGYIJpKZWxEZYxODNjlVTAju6svrpO3V3au69+BVG7dFHGU4g3O4BBeuoQH30IQWEHiCZ3iFN2tqvVjv1seytWQVM6fwB9bnD/b7kZE=\n\n1\n2AAAB9HicbVBNS8NAEJ3Ur1q/oh69LLaCp5LEgx6LXjxWsB/QhrLZbtqlm03c3RRKyO/w4kERr/4Yb/4bt20O2vpg4PHeDDPzgoQzpR3n2yptbG5t75R3K3v7B4dH9vFJW8WpJLRFYh7LboAV5UzQlmaa024iKY4CTjvB5G7ud6ZUKhaLRz1LqB/hkWAhI1gbya/1Q4lJ5uaZl9cGdtWpOwugdeIWpAoFmgP7qz+MSRpRoQnHSvVcJ9F+hqVmhNO80k8VTTCZ4BHtGSpwRJWfLY7O0YVRhiiMpSmh0UL9PZHhSKlZFJjOCOuxWvXm4n9eL9XhjZ8xkaSaCrJcFKYc6RjNE0BDJinRfGYIJpKZWxEZYxODNjlVTAju6svrpO3V3au69+BVG7dFHGU4g3O4BBeuoQH30IQWEHiCZ3iFN2tqvVjv1seytWQVM6fwB9bnD/b7kZE=\n\n1\n2AAAB9HicbVBNS8NAEJ3Ur1q/oh69LLaCp5LEgx6LXjxWsB/QhrLZbtqlm03c3RRKyO/w4kERr/4Yb/4bt20O2vpg4PHeDDPzgoQzpR3n2yptbG5t75R3K3v7B4dH9vFJW8WpJLRFYh7LboAV5UzQlmaa024iKY4CTjvB5G7ud6ZUKhaLRz1LqB/hkWAhI1gbya/1Q4lJ5uaZl9cGdtWpOwugdeIWpAoFmgP7qz+MSRpRoQnHSvVcJ9F+hqVmhNO80k8VTTCZ4BHtGSpwRJWfLY7O0YVRhiiMpSmh0UL9PZHhSKlZFJjOCOuxWvXm4n9eL9XhjZ8xkaSaCrJcFKYc6RjNE0BDJinRfGYIJpKZWxEZYxODNjlVTAju6svrpO3V3au69+BVG7dFHGU4g3O4BBeuoQH30IQWEHiCZ3iFN2tqvVjv1seytWQVM6fwB9bnD/b7kZE=\n\nOutput\nDilation = 241 = 8\n\nAAACB3icbVDLSsNAFJ3UV42vqEtBBlvBjSWJgt0UirpwZwX7gDaWyXTaDp1MwsxEKKE7N/6KGxeKuPUX3Pk3TtMstHpg4HDOvdw5x48Ylcq2v4zcwuLS8kp+1Vxb39jcsrZ3GjKMBSZ1HLJQtHwkCaOc1BVVjLQiQVDgM9L0RxdTv3lPhKQhv1XjiHgBGnDapxgpLXWt/etYRbEyzUvKUglWYNG9S06PnUmlXOxaBbtkp4B/iZORAshQ61qfnV6I44BwhRmSsu3YkfISJBTFjEzMTixJhPAIDUhbU44CIr0kzTGBh1rpwX4o9OMKpurPjQQFUo4DX08GSA3lvDcV//PaseqXvYRynZRwPDvUjxlUIZyWAntUEKzYWBOEBdV/hXiIBMJKV2fqEpz5yH9Jwy05JyX3xi1Uz7M68mAPHIAj4IAzUAVXoAbqAIMH8ARewKvxaDwbb8b7bDRnZDu74BeMj29WCJcG\n\nHidden Layer 3\nDilation = 231 = 4\n\nAAACD3icbVA9SwNBEN3zM55fUUubxUSxMdwlgjZCUAsLCwXzAckZ9vYmyeLe3rG7J4Qj/8DGv2JjoYitrZ3/xk1yhUYfDDzem2Fmnh9zprTjfFkzs3PzC4u5JXt5ZXVtPb+xWVdRIinUaMQj2fSJAs4E1DTTHJqxBBL6HBr+3dnIb9yDVCwSN3oQgxeSnmBdRok2Uie/d8GCAAS+JAOQuGLb54yPLXyCi+XbtHLgDk8Oi518wSk5Y+C/xM1IAWW46uQ/20FEkxCEppwo1XKdWHspkZpRDkO7nSiICb0jPWgZKkgIykvH/wzxrlEC3I2kKaHxWP05kZJQqUHom86Q6L6a9kbif14r0d1jL2UiTjQIOlnUTTjWER6FgwMmgWo+MIRQycytmPaJJFSbCG0Tgjv98l9SL5fcSql8XS5UT7M4cmgb7aB95KIjVEUX6ArVEEUP6Am9oFfr0Xq23qz3SeuMlc1soV+wPr4BEGKZfA==\n\nHidden Layer 2\nDilation = 221 = 2\n\nAAACD3icbVA9TwJBEN3DLzy/Ti1tNqLGRnJ3FtqQELWgsNBEkASQ7O0NsGFv77K7Z0Iu/AMb/4qNhcbY2tr5b1yQQsGXTPLy3kxm5gUJZ0q77peVm5tfWFzKL9srq2vrG87mVk3FqaRQpTGPZT0gCjgTUNVMc6gnEkgUcLgN+ucj//YepGKxuNGDBFoR6QrWYZRoI7WdgwoLQxD4kgxAYt+2LxgfW7iE9/y7zD/yhiV/r+0U3KI7Bp4l3oQU0ARXbeezGcY0jUBoyolSDc9NdCsjUjPKYWg3UwUJoX3ShYahgkSgWtn4nyHeN0qIO7E0JTQeq78nMhIpNYgC0xkR3VPT3kj8z2ukunPayphIUg2C/izqpBzrGI/CwSGTQDUfGEKoZOZWTHtEEqpNhLYJwZt+eZbU/KJ3XPSv/UL5bBJHHu2gXXSIPHSCyqiCrlAVUfSAntALerUerWfrzXr/ac1Zk5lt9AfWxzcKMpl4\n\nHidden Layer 1\nDilation = 211 = 1\n\nAAACD3icbVA9SwNBEN2L3+dX1NJmMVFsDLex0CYQ1MLCQsF8QBLD3t4kWbK3d+zuCeHIP7Dxr9hYKGJra+e/cRNTaOKDgcd7M8zM82PBtfG8LyczN7+wuLS84q6urW9sZre2qzpKFIMKi0Sk6j7VILiEiuFGQD1WQENfQM3vn4/82j0ozSN5awYxtELalbzDGTVWamcPLnkQgMRXdAAKE9e94GJs4RLOF+9SckSGJZJvZ3NewRsDzxIyITk0wXU7+9kMIpaEIA0TVOsG8WLTSqkynAkYus1EQ0xZn3ahYamkIehWOv5niPetEuBOpGxJg8fq74mUhloPQt92htT09LQ3Ev/zGonpnLZSLuPEgGQ/izqJwCbCo3BwwBUwIwaWUKa4vRWzHlWUGRuha0Mg0y/PkmqxQI4LxZtirnw2iWMZ7aI9dIgIOkFldImuUQUx9ICe0At6dR6dZ+fNef9pzTiTmR30B87HNwWHmXU=\n\n1\n2AAAB9HicbVBNS8NAEJ3Ur1q/oh69LLaCp5LEgx6LXjxWsB/QhrLZbtqlm03c3RRKyO/w4kERr/4Yb/4bt20O2vpg4PHeDDPzgoQzpR3n2yptbG5t75R3K3v7B4dH9vFJW8WpJLRFYh7LboAV5UzQlmaa024iKY4CTjvB5G7ud6ZUKhaLRz1LqB/hkWAhI1gbya/1Q4lJ5uaZl9cGdtWpOwugdeIWpAoFmgP7qz+MSRpRoQnHSvVcJ9F+hqVmhNO80k8VTTCZ4BHtGSpwRJWfLY7O0YVRhiiMpSmh0UL9PZHhSKlZFJjOCOuxWvXm4n9eL9XhjZ8xkaSaCrJcFKYc6RjNE0BDJinRfGYIJpKZWxEZYxODNjlVTAju6svrpO3V3au69+BVG7dFHGU4g3O4BBeuoQH30IQWEHiCZ3iFN2tqvVjv1seytWQVM6fwB9bnD/b7kZE=\n\n1\n2AAAB9HicbVBNS8NAEJ3Ur1q/oh69LLaCp5LEgx6LXjxWsB/QhrLZbtqlm03c3RRKyO/w4kERr/4Yb/4bt20O2vpg4PHeDDPzgoQzpR3n2yptbG5t75R3K3v7B4dH9vFJW8WpJLRFYh7LboAV5UzQlmaa024iKY4CTjvB5G7ud6ZUKhaLRz1LqB/hkWAhI1gbya/1Q4lJ5uaZl9cGdtWpOwugdeIWpAoFmgP7qz+MSRpRoQnHSvVcJ9F+hqVmhNO80k8VTTCZ4BHtGSpwRJWfLY7O0YVRhiiMpSmh0UL9PZHhSKlZFJjOCOuxWvXm4n9eL9XhjZ8xkaSaCrJcFKYc6RjNE0BDJinRfGYIJpKZWxEZYxODNjlVTAju6svrpO3V3au69+BVG7dFHGU4g3O4BBeuoQH30IQWEHiCZ3iFN2tqvVjv1seytWQVM6fwB9bnD/b7kZE=\n\nAAAB73icbVA9SwNBEJ3zM8avqKXNYiJYhbtYaBm0sYxgPiA5wt7eJlmyt3fuzgnhyJ+wsVDE1r9j579xk1yhiQ8GHu/NMDMvSKQw6Lrfztr6xubWdmGnuLu3f3BYOjpumTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlRgIRtFKnUqPhTGaSr9UdqvuHGSVeDkpQ45Gv/TVC2OWRlwhk9SYrucm6GdUo2CST4u91PCEsjEd8q6likbc+Nn83ik5t0pIBrG2pZDM1d8TGY2MmUSB7YwojsyyNxP/87opDq79TKgkRa7YYtEglQRjMnuehEJzhnJiCWVa2FsJG1FNGdqIijYEb/nlVdKqVb3Lau2+Vq7f5HEU4BTO4AI8uII63EEDmsBAwjO8wpvz6Lw4787HonXNyWdO4A+czx9ph4+N\n\n\u00b7\u00b7\u00b7\n\n(a)\n\n\u2327\n\nk\n\nt\n\n\u2327\n\nk\n\nX(tr)\n\nX(te)\n\nte\n\n=\n\nF\n\nn\n\nt\n\ntr\n\nY\n\n(b)\n\nFigure 1: Fig. 1a contains an illustration of a TCN. Note that the base image has been borrowed from [24]. The\nnetwork has d = 4 layers, with \ufb01lter size k = 2. The network maps the input ytl:t1 to the one-shifted output\n\u02c6ytl+1:t. Figure 1b presents an illustration of the matrix factorization approach in time-series forecasting. The\nY(tr) training matrix can be factorized into low-rank factors F (2 Rn\u21e5k) and X(tr) ( 2 Rk\u21e5t). If X(tr) preserves\ntemporal structures then the future values X(te) can be predicted by a time-series forecasting model and thus the\ntest period predictions can be made as FX(te).\n\ndeviation metric [29], de\ufb01ned as follows,\n\nL(Y (obs), Y (pred)) = Pn\n\n|\n\n,\n\n(1)\n\ni=1P\u2327\nj=1 |Y (obs)\nPn\ni=1P\u2327\n\nij Y (pred)\nj=1 |Y (obs)\n\n|\n\nij\n\nij\n\nF , during training.\n\nwhere Y (obs) and Y (pred) are the observed and predicted values, respectively. This metric is also\nreferred to as WAPE in the forecasting literature. Other commonly used evaluation metrics are\nde\ufb01ned in Section C.2. Note that (1) is also used as a loss function in one of our proposed models. We\n\nalso use the squared-loss L2(Y (obs), Y (pred)) = (1/n\u2327 )Y (obs) Y (pred)2\n\n4 LeveledInit: Handling Diverse Scales with TCN\nIn this section, we propose LeveledInit a simple initialization scheme for Temporal Convolution\nNetworks (TCN) [2] which is designed to handle high-dimensional time-series data with wide\nvariation in scale, without apriori normalization. As mentioned before, deep networks are dif\ufb01cult\nto train on time-series datasets, where the individual time-series have diverse scales. LSTM based\nmodels cannot be reliably trained on such datasets and may need apriori standard normalization [16]\nor pre-scaling of the bayesian mean predictions [9]. TCN\u2019s have also been shown to require apriori\nnormalization [3] for time-series predictions. The choice of normalization parameters can have a\nsigni\ufb01cant effect on the prediction performance. Here, we show that a simple initialization scheme\nfor the TCN network weights can alleviate this problem and lead to reliable training without apriori\nnormalization. First, we will brie\ufb02y discuss the TCN architecture.\nTemporal Convolution: Temporal convolution (also referred to as Causal Convolution) [24, 3, 2]\nare multi-layered neural networks comprised of 1D convolutional layers with dilation. The dilation\nin layer i is usually set as dil(i) = 2i1. A temporal convolution network with \ufb01lter size k and\nnumber of layers d has a dynamic range (or look-back) of l0 := 1 + l = 1 + 2(k 1)2d1. Note\nthat it is assumed that the stride is 1 in every layer and layer i needs to be zero-padded in the\nbeginning with (k 1)dil(i) zeros. An example of a temporal convolution network with one channel\nper layer is shown in Fig. 1a. For more details, we refer the readers to the general description\nin [2]. Note that in practice, we can have multiple channels per layer of a TC network. The TC\nnetwork can thus be treated as an object that takes in the previous values of a time-series yJ ,\nwhere J = {j l, j l + 1,\u00b7\u00b7\u00b7 , j 1} along with the past covariates zJ , corresponding to that\ntime-series and outputs the one-step look ahead predicted value \u02c6yJ +1. We will denote a temporal\nconvolution neural network by T (\u00b7|\u21e5), where \u21e5 is the parameter weights in the temporal convolution\nnetwork. Thus, we have \u02c6yJ +1 = T (yJ , zJ |\u21e5). The same operators can be de\ufb01ned on matrices.\nGiven Y 2 Rn\u21e5t, Z 2 Rn\u21e5r\u21e5(t+\u2327 ) and a set of row indices I = {i1, ..., ibn}\u21e2 [n], we can write\n\u02c6Y[I,J + 1] = T (Y[I,J ], Z[I, :,J ]|\u21e5).\nLeveledInit Scheme: One possible method to alleviate the issues with diverse scales, is to start\nwith initial network parameters, that results in approximately predicting the average value of a given\nwindow of past time-points yjl:j1, as the future prediction \u02c6yj. The hope is that, over the course of\ntraining, the network would learn to predict variations around this mean prediction, given that the\nvariations around this mean is relatively scale free. This can be achieved through a simple initialization\n\n4\n\n\fscheme for some con\ufb01gurations of TCN, which we call LeveledInit. For ease of exposition, let us\nconsider the setting without covariates and only one channel per layer, which can be functionally\nrepresented as \u02c6yJ +1 = T (yJ |\u21e5). In this case, the initialization scheme is to simply set all the \ufb01lter\nweights to 1/k, where k is the \ufb01lter size in every layer. This results in a proposition.\nProposition 1 (LeveledInit). Let T (\u00b7|\u21e5) be a TCN with one channel per layer, \ufb01lter size k = 2,\nnumber of layers d. Here \u21e5 denotes the weights and biases of the network. Let [\u02c6yjl+1,\u00b7\u00b7\u00b7 , \u02c6yj] :=\nT (yJ |\u21e5) for J = {j l, ...., j 1} and l = 2(k 1)2d1. If all the biases in \u21e5 are set to 0 and\nall the weights set to 1/k then, \u02c6yj = \u00b5(yJ ) if y 0 and all activation functions are ReLUs.\nThe above proposition shows that LeveledInit results in a prediction \u02c6yj, which is the average of the\npast l time-points, where l is the dynamic range of the TCN, when \ufb01lter size is k = 2 (see Fig. 1a).\nThe proof of proposition 1 (see Section A)) follows from the fact that an activation value in an internal\nlayer is the average of the corresponding k inputs from the previous layer, and an induction on the\nlayers yields the results. LeveledInit can also be extended to the case with covariates, by setting\nthe corresponding \ufb01lter weights to 0 in the input layer. It can also be easily extended to multiple\nchannels per layer with k = 2. In Section 6, we show that a TCN with LeveledInit can be trained\nreliably without apriori normalization, on real world datasets, even for values of k 6= 2. We provide\nthe psuedo-code for training a TCN with LeveledInit as Algorithm 1.\nNote that we have also experimented with a more sophisticated variation of the Temporal Convolution\narchitecture termed as Deep Leveled Network (DLN ), which we include in Appendix A. However,\nwe observed that the simple initialization scheme for TCN (LeveledInit) can match the performance\nof the Deep Leveled network.\n5 DeepGLO: A Deep Global Local Forecaster\nIn this section we will introduce our hybrid model DeepGLO, that can leverage both global and\nlocal features, during training and prediction. In Section 5.1, we present the global component,\nTCN regularized Matrix Factorization (TCN-MF). This model can capture global patterns during\nprediction, by representing each of the original time-series as a linear combination of k basis time-\nseries, where k \u2327 n. In Section 5.2, we detail how the output of the global model can be incorporated\nas an input covariate dimension for a TCN, thus leading to a hybrid model that can both focus on\nlocal per time-series signals and leverage global dataset wide components.\n\nAlgorithm 1 Mini-batch Training for TCN\nwith LeveledInit\nRequire: learning rate \u2318, horizontal batch size bt, vertical\n1: Initialize T (\u00b7|\u21e5) according to LeveledInit\n2: for iter = 1, \u00b7 \u00b7 \u00b7 , maxiters do\n3:\n4:\n\nfor each batch with indices I and J in an epoch do\n=\n\nbatch size bn, and maxiters\n\n=\n\n{j + 1, j + 2, ..., j + bt}\n\n{i1, ..., ibn} and J\n\nI\n\u02c6Y T (Y[I, J ], Z[I, :, J ]|\u21e5)\n\u21e5L(Y[I, J + 1], \u02c6Y)\n\u21e5 \u21e5 \u2318 @\n\n5:\n6:\n7:\n8: end for\n\nend for\n\nAlgorithm 2 Temporal Matrix Factoriza-\ntion Regularized by TCN (TCN-MF)\nRequire: itersinit, iterstrain, itersalt.\n1: /* Model Initialization */\n2: Initialize TX (\u00b7) by LeveledInit\n3: Initialize F and X(tr) by Alg 3 for itersinit iterations.\n4: /* Alternate training cycles */\n5: for iter = 1, 2, ..., itersalt do\n6:\n7:\n8: end for\n\nUpdate F and X(tr) by Alg 3 for iterstrain itera-\nUpdate TX (\u00b7) by Alg 1 on X(tr) for iterstrain iter-\n\nations, with no covariates.\n\ntions\n\n5.1 Global: Temporal Convolution Network regularized Matrix Factorization\n\n(TCN-MF)\n\nIn this section we propose a low-rank matrix factorization model for time-series forecasting that uses\na TCN for regularization. The idea is to factorize the training time-series matrix Y(tr) into low-rank\nfactors F 2 Rn\u21e5k and X(tr) 2 Rk\u21e5t, where k \u2327 n. This is illustrated in Figure 1b. Further, we\nwould like to encourage a temporal structure in X(tr) matrix, such that the future values X(te) in the\ntest range can also be forecasted. Let X = [X(tr)X(te)]. Thus, the matrix X can be thought of to be\ncomprised of k basis time-series that capture the global temporal patterns in the whole data-set and\nall the original time-series are linear combinations of these basis time-series. In the next subsection\nwe will describe how a TCN can be used to encourage the temporal structure for X.\nTemporal Regularization by a TCN: If we are provided with a TCN that captures the temporal\npatterns in the training data-set Y(tr), then we can encourage temporal structures in X(tr) using this\nmodel. Let us assume that the said network is TX(\u00b7). The temporal patterns can be encouraged by\n\n5\n\n\fincluding the following temporal regularization into the objective function:\n\nR(X(tr) | TX(\u00b7)) :=\n\n1\n\n|J |L2 (X[:,J ],TX (X[:,J 1])) ,\n\n(2)\n\n(3)\n\nwhere J = {2,\u00b7\u00b7\u00b7 , t} and L2(\u00b7,\u00b7) is the squared-loss metric, de\ufb01ned before. This implies that the\nvalues of the X(tr) on time-index j are close to the predictions from the temporal network applied on\nthe past values between time-steps {j l, ..., j 1}. Here, l + 1 is equal to the dynamic range of the\nnetwork de\ufb01ned in Section 4. Thus the overall loss function for the factors and the temporal network\nis as follows:\n\nLG(Y(tr), F, X(tr),TX) := L2Y(tr), FX(tr) + T R(X(tr) | TX(\u00b7)),\n\nwhere T is the regularization parameter for the temporal regularization component.\nTraining: The low-rank factors F, X(tr) and the temporal network TX(\u00b7) can be trained alternatively\nto approximately minimize the loss in Eq. (3). The overall training can be performed through mini-\nbatch SGD and can be broken down into two main components performed alternatingly: (i) given\na \ufb01xed TX(\u00b7) minimize LG(F, X(tr),TX) with respect to the factors F, X(tr) - Algorithm 3 and (ii)\ntrain the network TX(\u00b7) on the matrix X(tr) using Algorithm 1.\nThe overall algorithm is detailed in Algorithm 2. The TCN TX(\u00b7) is \ufb01rst initialized by LeveledInit.\nThen in the second initialization step, two factors F and X(tr) are trained using the initialized TX(\u00b7)\n(step 3), for itersinit iterations. This is followed by the itersalt alternative steps to update F,\nX(tr), and TX(\u00b7) (steps 5-7).\nPrediction: The trained model local network TX(\u00b7) can be used for multi-step look-ahead pre-\ndiction in a standard manner. Given the past data-points of a basis time-series, xjl:j1, the\nprediction for the next time-step, \u02c6xj is given by [\u02c6xjl+1,\u00b7\u00b7\u00b7 , \u02c6xj]\n:= TX(xjl:j1) Now, the\none-step look-ahead prediction can be concatenated with the past values to form the sequence\n\u02dcxjl+1:j = [xjl+1:j1 \u02c6xj], which can be again passed through the network to get the next pre-\ndiction: [\u00b7\u00b7\u00b7 , \u02c6xj+1] = TX(\u02dcxjl+1:j). The same procedure can be repeated \u2327 times to predict \u2327\ntime-steps ahead in the future. Thus we can obtain the basis time-series in the test-range \u02c6X(te). The\n\ufb01nal global predictions are then given by Y(te) = F \u02c6X(te).\nRemark 1. Note that TCN-MF can perform rolling predictions without retraining unlike TRMF. We\nprovide more details in Appendix B, in the interest of space.\n\nAlgorithm 3 Training the Low-rank factors\nF, X(tr) given a \ufb01xed network TX(\u00b7), for\none epoch\nRequire: learning rate \u2318, a TCN TX (\u00b7).\n1: for each batch with indices I and J in an epoch do\n2:\n{i1, ..., ibn}\n3:\n\nand J\nI\n{j + 1, j + 2, ..., j + bt}\nX[:, J ]\nX[:, J ]\n \n@X[:,J ]LG(Y[I, J ], F[I, :], X[:, J ], TX )\n\u2318\nF[I, :]\nF[I, :]\n@F[I,:]LG(Y[I, J ], F[I, :], X[:, J ], TX )\n\n \n\n4:\n\n=\n\n=\n\n\n\n\n\n@\n\n@\n\n\u2318\n\neledInit.\n\nAlgorithm 4 DeepGLO- Deep Global Lo-\ncal Forecaster\n1: Obtain global F, X(tr) and TX (\u00b7) by Alg 2.\n2: Initialize TY (\u00b7) with number of inputs r + 2 and Lev-\n3: /* Training hybrid model */\n4: Let \u02c6Y(g) be the global model prediction in the training\n5: Create covariates Z0 2 Rn\u21e5(r+1)\u21e5t s.t Z0[:, 1, :] =\n6: Train TY (\u00b7) using Algorithm 1 with time-series Y(tr)\n\n\u02c6Y(g) and Z0[:, 2 : r + 1, :] = Z[:, :, 1 : t].\n\nrange.\n\nand covariates Z0.\n\n5: end for\n5.2 Combining the Global Model with Local Features\nIn this section, we present our \ufb01nal hybrid model. Recall that our forecasting task has as input the\ntraining raw time-series Y(tr) and the covariates Z 2 Rn\u21e5r\u21e5(t+\u2327 ). Our hybrid forecaster is a TCN\nTY (\u00b7|\u21e5Y ) with a input size of r + 2 dimensions: (i) one of the inputs is reserved for the past points\nof the original raw time-series, (ii) r inputs for the original r-dimensional covariates and (iii) the\nremaining dimension is reserved for the output of the global TCN-MF model, which is fed as input\ncovariates. The overall model is demonstrated in Figure 2. The training pseudo-code for our model is\nprovided as Algorithm 4.\nTherefore, by providing the global TCN-MF model prediction as covariates to a TCN, we can make\nthe \ufb01nal predictions a function of global dataset wide properties as well as the past values of the local\ntime-series and covariates. Note that both rolling predictions and multi-step look-ahead predictions\ncan be performed by DeepGLO, as the global TCN-MF model and the hybrid TCN TY (\u00b7) can\n\n6\n\n\fX(tr)\n\n\u02c6X(te)\n\ntr\n\nte\n\nf (i)\n\n1\n\nf (i)\n\n2\n\nf (i)\n\nk\n\n.\n.\n.\n\n(a)\n\nGlobal Pred.\n\n(b)\n\nFigure 2: In Fig. 2a, we show some of the basis time-series extracted from the traf\ufb01c dataset, which can be\ncombined linearly to yield individual original time-series. It can be seen that the basis series are highly temporal\nand can be predicted in the test range using the network TX (\u00b7|\u21e5X ). In Fig. 2b (base image borrowed from [24]),\nwe show an illustration of DeepGLO. The TCN shown is the network TY (\u00b7), which takes in as input the original\ntime-points, the original covariates and the output of the global model as covariates. Thus this network can\ncombine the local properties with the output of the global model during prediction.\n\nTable 1: Data statistics and Evaluation settings. In the rolling Pred., \u2327w denotes the number of time-points in\neach window and nw denotes the number of rolling windows. std({\u00b5}) denotes the standard deviation among\nthe means of all the time series in the data-set i.e std({\u00b5}) = std({\u00b5(y(i))}n\ni=1). Similarly, std({std}) denotes\nthe standard deviation among the std. of all the time series in the data-set i.e std({std}) = std({std(y(i))}n\ni=1).\nIt can be seen that the electricity and wiki datasets have wide variations in scale.\nstd({\u00b5(yi)})\n\nstd({std(yi)})\n\nData\nelectricity\ntraf\ufb01c\nwiki\nPeMSD7(M)\n\nn\n370\n963\n115,084\n228\n\nt\n\u2327\n25,968\n10,392\n747\n11,232\n\nw\n24\n24\n14\n9\n\nnw\n7\n7\n4\n160\n\n1.19e4\n1.08e2\n4.85e4\n3.97\n\n7.99e3\n1.25e2\n1.26e4\n4.42\n\nperform the forecast, without any need for re-training. We illustrate some representative results on\na time-series from the dataset in Fig. 2. In Fig. 2a, we show some of the basis time-series (global\nfeatures) extracted from the traf\ufb01c dataset, which can be combined linearly to yield individual original\ntime-series. It can be seen that the basis series are highly temporal and can be predicted in the test\nrange using the network TX(\u00b7|\u21e5X). In Fig. 2b, we illustrate the complete architecture of DeepGLO.\nIt can be observed that the output of the global TCN-MF model is inputted as a covariate to the\nTCN TY (\u00b7), which inturn combines this with local features, and predicts in the test range through\nmulti-step look-ahead predictions.\n\nTable 2: Comparison of algorithms on normalized and unnormalized versions of data-sets on rolling prediction\ntasks The error metrics reported are WAPE/MAPE/SMAPE (see Section C.2). TRMF is retrained before every\nprediction window, during the rolling predictions. All other models are trained once on the initial training set and\nused for further prediction for all the rolling windows. Note that for DeepAR, the normalized column represents\nmodel trained with scaler=True and unnormalized represents scaler=False. Prophet could not be scaled\nto the wiki dataset, even though it was parallelized on a 32 core machine. Below the main table, we provide\nMAE/MAPE/RMSE comparison with the models implemented in [29], on the PeMSD7(M) dataset.\n\nAlgorithm\n\nDeepGLO\n\nLocal TCN (LeveledInit)\n\nGlobal TCN-MF\n\nLSTM\nDeepAR\n\nTCN (no LeveledInit).\n\nProphet\n\nTRMF (retrained)\n\nSVD+TCN\n\nProposed\n\nLocal-Only\n\nGlobal-Only\n\nelectricity n = 370\n\ntraf\ufb01c n = 963\n\nwiki n = 115, 084\n\nNormalized\n\nUnnormalized\n\nNormalized\n\nUnnormalized\n\nNormalized\n\n0.133/0.453/0.162\n0.143/0.356/0.207\n0.144/0.485/0.174\n0.109/0.264/0.154\n0.086/0.259/ 0.141\n0.147/0.476/0.156\n0.197/0.393/0.221\n0.104/0.280/0.151\n0.219/0.437/0.238\nAlgorithm\n\n0.082/0.341/0.121 0.166/ 0.210 /0.179\n0.157/0.201/0.156\n0.092/0.237/0.126\n0.336/0.415/0.451\n0.106/0.525/0.188\n0.276/0.389/0.361\n0.896/0.672/0.768\n0.140/0.201/ 0.114\n0.994/0.818/1.85\n0.204/0.284/0.236\n0.423/0.769/0.523\n0.221/0.586/0.524\n0.313/0.600/0.420\n0.105/0.431/0.183 0.159/0.226/ 0.181\n0.368/0.779/0.346\n0.468/0.841/0.580\n\n0.569/3.335/1.036\n0.243/0.545/0.431\n\n1.19/8.46/1.56\n\n0.148/0.168/0.142\n0.169/0.177/0.169\n0.226/0.284/0.247\n0.270/0.357/0.263\n0.211/0.331/0.267\n0.239/0.425/0.281\n0.303/0.559/0.403\n0.210/ 0.322/ 0.275 0.309/0.847/0.451\n0.329/0.687/0.340\n0.697/3.51/0.886\n\n0.427/2.170/0.590\n0.429/2.980/0.424\n0.336/1.322/0.497\n\n-\n\nPeMSD7(M) (MAE/MAPE/RMSE)\n\nUnnormalized\n\n0.237/0.441/0.395\n0.212/0.316/0.296\n0.433/1.59/0.686\n0.789/0.686/0.493\n0.993/8.120/1.475\n0.511/0.884/0.509\n\n0.320/0.938/0.503\n0.639/2.000/0.893\n\n-\n\nDeepGLO (Unnormalized)\nDeepGLO (Normalized)\n\nSTGCN(Cheb)\nSTGCN(1st)\n\n3.53/ 0.079 / 6.49\n4.53/ 0.103 / 6.91\n3.57/0.087/6.77\n3.79/0.091/7.03\n\n7\n\n\f1:t\u2327 \u00b5(y(i)\n\n1:t\u2327 ))/((y(i)\n\n6 Empirical Results\nIn this section, we validate our model on four real-world data-sets on rolling prediction tasks (see\nSection C.1 for more details) against other benchmarks. The data-sets in consideration are, (i)\nelectricity [23] - Hourly load on 370 houses. The training set consists of 25, 968 time-points and the\ntask is to perform rolling validation for the next 7 days (24 time-points at a time, for 7 windows) as\ndone in [29, 20, 9];(ii) traf\ufb01c [7] - Hourly traf\ufb01c on 963 roads in San Francisco. The training set\nconsists of 10m392 time-points and the task is to perform rolling validation for the next 7 days (24\ntime-points at a time, for 7 windows) as done in [29, 20, 9] and (iii) wiki [14] - Daily web-traf\ufb01c on\nabout 115, 084 articles from Wikipedia. We only keep the time-series without missing values from\nthe original data-set. The values for each day are normalized by the total traf\ufb01c on that day across all\ntime-series and then multiplied by 1e8. The training set consists of 747 time-points and the task is to\nperform rolling validation for the next 86 days, 14 days at a time. More data statistics indicating scale\nvariations are provided in Table 1. (iv) PeMSD7(M) [6] - Data collected from Caltrain PeMS system,\nwhich contains data for 228 time-series, collected at 5 min interval. The training set consists of 11232\ntime-points and we perform rolling validation for the next 1440 points, 9 points at a time.\nFor each data-set, all models are compared on two different settings. In the \ufb01rst setting the models are\ntrained on normalized version of the data-set where each time series in the training set is re-scaled as\n\u02dcy(i)\n1:t\u2327 = (y(i)\n1:t\u2327 )) and then the predictions are scaled back to the original\nscaling. In the second setting, the data-set is unnormalized i.e left as it is while training and prediction.\nNote that all models are compared in the test range on the original scale of the data. The purpose of\nthese two settings is to measure the impact of scaling on the accuracy of the different models.\nThe models under contention are: (1) DeepGLO: The combined local and global model proposed in\nSection 5.2. We use time-features like time-of-day, day-of-week etc. as global covariates, similar to\nDeepAR. For a more detailed discussion, please refer to Section C.3. (2) Local TCN (LeveledInit):\nThe temporal convolution based architecture discussed in Section 4, with LeveledInit. (3) LSTM:\nA simple LSTM block that predicts the time-series values as function of the hidden states [11].\n(4) DeepAR: The model proposed in [9]. (5) TCN: A simple Temporal Convolution model as\ndescribed in Section 4. (6) Prophet: The versatile forecasting model from Facebook based on\nclassical techniques [8]. (7) TRMF: the model proposed in [29]. Note that this model needs to\nbe retrained for every rolling prediction window. (8) SVD+TCN: Combination of SVD and TCN.\nThe original data is factorized as Y = UV using SVD and a leveled network is trained on the\nV. This is a simple baseline for a global-only approach. (9) STGCN: The spatio-temporal models\nin [28]. We use the same hyper-parameters for DeepAR on the traf\ufb01c and electricity datasets, as\nspeci\ufb01ed in [9], as implemented in GluonTS [1]. The WAPE values from the original paper could\nnot be directly used, as there are different values reported in [9] and [20]. Note that for DeepAR,\nnormalized and unnormalized settings corresponds to using sclaing=True and scaling=False in\nthe GluonTS package. The model in TRMF [29] was trained with different hyper-parameters (larger\nrank) than in the original paper and therefore the results are slightly better. More details about all the\nhyper-parameters used are provided in Section C. The rank k used in electricity, traf\ufb01c, wiki and\nPeMSD7(M) are 64/60, 64/60, 256/1, 024 and 64/ for DeepGLO/TRMF.\nIn Table 2, we report WAPE/MAPE/SMAPE (see de\ufb01nitions in Section C.2) on the \ufb01rst three datasets\nunder both normalized and unnormalized training. We can see that DeepGLO features among the\ntop two models in almost all categories, under all three metrics. DeepGLO does better than the\nindividual local TCN (LeveledInit) method and the global TCN-MF model on average, as it is\naided by both global and local factors. The local TCN (LeveledInit) model performs the best on the\nlarger wiki dataset with > 30% improvement over all models (not proposed in this paper) in terms of\nWAPE, while DeepGLO is close behind with greater than 25% improvement over all other models.\nWe also \ufb01nd that DeepGLO performs better in the unnormalized setting on all instances, because\nthere is no need for scaling the input and rescaling back the outputs of the model. We \ufb01nd that the\nTCN (no LeveledInit), DeepAR1 and the LSTM models do not perform well in the unnormalized\nsetting as expected. In the second table, we compare DeepGLO with the models in [28], which\ncan capture global features but require a weighted graph representing closeness relations between\nthe time-series as external input. We see that DeepGLO (unnormalized) performs the best on all\nmetrics, even though it does not require any external input. Our implementation can be found at\nhttps://github.com/rajatsen91/deepglo.\n\n1Note that for DeepAR, normalized means the GluonTS implementation run with scaler=True and\n\nunnormalized means scaler=False\n\n8\n\n\fReferences\n[1] A. Alexandrov, K. Benidis, M. Bohlke-Schneider, V. Flunkert, J. Gasthaus, T. Januschowski,\nD. C. Maddix, S. Rangapuram, D. Salinas, J. Schulz, L. Stella, A. C. T\u00fcrkmen, and Y. Wang.\nGluonTS: Probabilistic Time Series Modeling in Python. arXiv preprint arXiv:1906.05264,\n2019.\n\n[2] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional\n\nand recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.\n\n[3] Anastasia Borovykh, Sander Bohte, and Cornelis W Oosterlee. Conditional time series forecast-\n\ning with convolutional neural networks. arXiv preprint arXiv:1703.04691, 2017.\n\n[4] George EP Box and Gwilym M Jenkins. Some recent advances in forecasting and control.\n\nJournal of the Royal Statistical Society. Series C (Applied Statistics), 17(2):91\u2013109, 1968.\n\n[5] Chris Chat\ufb01eld. Time-series forecasting. Chapman and Hall/CRC, 2000.\n\n[6] Chao Chen, Karl Petty, Alexander Skabardonis, Pravin Varaiya, and Zhanfeng Jia. Freeway\nperformance measurement system: mining loop detector data. Transportation Research Record,\n1748(1):96\u2013102, 2001.\n\n[7] Marco Cuturi. Fast global alignment kernels. In Proceedings of the 28th international conference\n\non machine learning (ICML-11), pages 929\u2013936, 2011.\n\n[8] Facebook. Fbprophet. https://research.fb.com/prophet-forecasting-at-scale/,\n\n2017. [Online; accessed 07-Jan-2019].\n\n[9] Valentin Flunkert, David Salinas, and Jan Gasthaus. Deepar: Probabilistic forecasting with\n\nautoregressive recurrent networks. arXiv preprint arXiv:1704.04110, 2017.\n\n[10] Ken-ichi Funahashi and Yuichi Nakamura. Approximation of dynamical systems by continuous\n\ntime recurrent neural networks. Neural networks, 6(6):801\u2013806, 1993.\n\n[11] Felix A Gers, J\u00fcrgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction\n\nwith lstm. 1999.\n\n[12] James Douglas Hamilton. Time series analysis, volume 2. Princeton university press Princeton,\n\nNJ, 1994.\n\n[13] Rob Hyndman, Anne B Koehler, J Keith Ord, and Ralph D Snyder. Forecasting with exponential\n\nsmoothing: the state space approach. Springer Science & Business Media, 2008.\n\n[14] Kaggle.\n\nWikipedia\n\nweb-traffic-time-series-forecasting/data, 2017.\n2019].\n\nweb\n\ntraf\ufb01c.\n\nhttps://www.kaggle.com/c/\n[Online; accessed 07-Jan-\n\n[15] Kyoung-jae Kim. Financial time series forecasting using support vector machines. Neurocom-\n\nputing, 55(1-2):307\u2013319, 2003.\n\n[16] Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. Modeling long-and short-term\ntemporal patterns with deep neural networks. In The 41st International ACM SIGIR Conference\non Research & Development in Information Retrieval, pages 95\u2013104. ACM, 2018.\n\n[17] Paul D Larson. Designing and managing the supply chain: Concepts, strategies, and case\nstudies, david simchi-levi philip kaminsky edith simchi-levi. Journal of Business Logistics,\n22(1):259\u2013261, 2001.\n\n[18] Helmut L\u00fctkepohl. New introduction to multiple time series analysis. Springer Science &\n\nBusiness Media, 2005.\n\n[19] ED McKenzie. General exponential smoothing and the equivalent arma process. Journal of\n\nForecasting, 3(3):333\u2013344, 1984.\n\n9\n\n\f[20] Syama Sundar Rangapuram, Matthias W Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang,\nand Tim Januschowski. Deep state space models for time series forecasting. In Advances in\nNeural Information Processing Systems, pages 7796\u20137805, 2018.\n\n[21] Matthias W Seeger, David Salinas, and Valentin Flunkert. Bayesian intermittent demand\nforecasting for large inventories. In Advances in Neural Information Processing Systems, pages\n4646\u20134654, 2016.\n\n[22] Martin Sundermeyer, Ralf Schl\u00fcter, and Hermann Ney. Lstm neural networks for language mod-\neling. In Thirteenth annual conference of the international speech communication association,\n2012.\n\n[23] Artur Trindade. Electricityloaddiagrams20112014 data set. https://archive.ics.uci.\nedu/ml/datasets/ElectricityLoadDiagrams20112014, 2011. [Online; accessed 07-Jan-\n2019].\n\n[24] A\u00e4ron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex\nGraves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative\nmodel for raw audio. CoRR abs/1609.03499, 2016.\n\n[25] Yuyang Wang, Alex Smola, Danielle Maddix, Jan Gasthaus, Dean Foster, and Tim Januschowski.\nDeep factors for forecasting. In International Conference on Machine Learning, pages 6607\u2013\n6617, 2019.\n\n[26] Ruofeng Wen, Kari Torkkola, Balakrishnan Narayanaswamy, and Dhruv Madeka. A multi-\n\nhorizon quantile recurrent forecaster. arXiv preprint arXiv:1711.11053, 2017.\n\n[27] Kevin W Wilson, Bhiksha Raj, and Paris Smaragdis. Regularized non-negative matrix factor-\nization with temporal dependencies for speech denoising. In Ninth Annual Conference of the\nInternational Speech Communication Association, 2008.\n\n[28] Bing Yu, Haoteng Yin, and Zhanxing Zhu. Spatio-temporal graph convolutional networks: A\n\ndeep learning framework for traf\ufb01c forecasting. arXiv preprint arXiv:1709.04875, 2017.\n\n[29] Hsiang-Fu Yu, Nikhil Rao, and Inderjit S Dhillon. Temporal regularized matrix factorization for\nhigh-dimensional time series prediction. In Advances in neural information processing systems,\npages 847\u2013855, 2016.\n\n10\n\n\f", "award": [], "sourceid": 2686, "authors": [{"given_name": "Rajat", "family_name": "Sen", "institution": "Amazon"}, {"given_name": "Hsiang-Fu", "family_name": "Yu", "institution": "Amazon"}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": "UT Austin & Amazon"}]}