{"title": "Deep State Space Models for Time Series Forecasting", "book": "Advances in Neural Information Processing Systems", "page_first": 7785, "page_last": 7794, "abstract": "We present a novel approach to probabilistic time series forecasting that combines state space models with deep learning. By parametrizing a per-time-series linear state space model with a jointly-learned recurrent neural network, our method retains desired properties of state space models such as data efficiency and interpretability, while making use of the ability to learn complex patterns from raw data offered by deep learning approaches. Our method scales gracefully from regimes where little training data is available to regimes where data from millions of time series can be leveraged to learn accurate models. We provide qualitative as well as quantitative results with the proposed method, showing that it compares favorably to the state-of-the-art.", "full_text": "Deep State Space Models for Time Series Forecasting\n\nSyama Sundar Rangapuram\n\nMatthias Seeger\n\nJan Gasthaus\n\nLorenzo Stella\n\nYuyang Wang\n\nTim Januschowski\n\n{rangapur, matthis, gasthaus, stellalo, yuyawang, tjnsch}@amazon.com\n\nAmazon Research\n\nAbstract\n\nWe present a novel approach to probabilistic time series forecasting that combines\nstate space models with deep learning. By parametrizing a per-time-series lin-\near state space model with a jointly-learned recurrent neural network, our method\nretains desired properties of state space models such as data ef\ufb01ciency and inter-\npretability, while making use of the ability to learn complex patterns from raw data\noffered by deep learning approaches. Our method scales gracefully from regimes\nwhere little training data is available to regimes where data from large collection\nof time series can be leveraged to learn accurate models. We provide qualitative\nas well as quantitative results with the proposed method, showing that it compares\nfavorably to the state-of-the-art.\n\n1\n\nIntroduction\n\nTime series forecasting is a key component in many industrial and business decision processes. A\ntypical example of such tasks is demand forecasting: accurate and up-to-date models are fundamen-\ntal to successful inventory planning and minimization of operational costs.\nState space models [8, 13, 23] (SSMs) provide a principled framework for modeling and learning\ntime series patterns such as trend and seasonality. Prominent examples include ARIMA models\n[4, 8] and exponential smoothing [13]. SSMs are particularly well-suited for applications where the\nstructure of the time series is well-understood, as they allow for the incorporation of structural as-\nsumptions into the model. This allows for the model to be interpretable and the associated learning\nprocedure to be data ef\ufb01cient, but it requires time series with enough history. In modern forecast-\ning applications with large and diverse time series corpora, SSMs require prohibitively labor- and\ncompute-intensive tasks such as model and covariate selection. Further, traditional SSMs cannot\ninfer shared patterns from a dataset of similar time series, as they are \ufb01tted on each time series\nseparately. This makes creating forecasts for time series with little or no history challenging.\nDeep neural networks [12, 25, 26] offer an alternative. With their capability to extract higher order\nfeatures, they can identify complex patterns within and across time series, and can do so from\ndatasets of raw time series with considerably less human effort [9, 27, 19]. However, as these models\nmake fewer structural assumptions, they typically require larger training datasets to learn accurate\nmodels. Additionally, these models are dif\ufb01cult to interpret and\u2014often more importantly\u2014make it\ndif\ufb01cult to enforce assumptions such as temporal smoothness.\nIn this paper we propose to bridge the gap between these two approaches by fusing SSMs with deep\n(recurrent) neural networks. We present a forecasting method that parametrizes a particular linear\nSSM using a recurrent neural network (RNN). The parameters of the RNN are learned jointly from\na dataset of raw time series and associated covariates, allowing the model to automatically extract\nfeatures and learn complex temporal patterns. At the same time, as each individual time series\nis modeled using an SSM, we can enforce and exploit assumptions such as temporal smoothness.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fFurthermore our method is interpretable, in the sense that the SSM parameters for each individual\ntime series can be inspected (and even changed if necessary). By incorporating prior structural\nassumptions, the presented method scales from small to large data regimes seamlessly. When there\nis little data to learn from, the structure imposed by the SSM can alleviate over\ufb01tting.\nThe rest of the paper is organized as follows. We \ufb01rst discuss related work in Section 2 and then\nreview the general state space approach to time series forecasting in Section 3. In Section 4, we\npresent our joint forecasting model and describe the training and inference procedure. In Section 5,\nwe \ufb01rst do a qualitative analysis of our method and then present quantitative comparison against the\nstate-of-the-art. We conclude in Section 6.\n\n2 Related work\n\nHyndman et al. [13] and Durbin and Koopman [8] provide comprehensive overviews of SSMs. Re-\ncent work in the machine learning literature on linear state-space models includes [23, 22]. We\nfollow [13] in their approach to use linear state space models. The assumption of linear dynam-\nics consisting of interpretable components (level/trend/seasonality) makes the forecasts robust and\nunderstandable. Note that non-linear effects can still be captured via exogenous variables. In the\nforecasting context, non-linearities are typically associated with interventions such as promotions,\nso this assumption is practically reasonable.\nCombining state-space models (SSM) with RNNs has been proposed before, through either or both\nof the following: (i) extending the Gaussian emission to complex likelihood models; (ii) making the\ntransition equation non-linear via a multi-layer perceptron (MLP) or interlacing SSM with transi-\ntion matrices temporally speci\ufb01ed by RNNs. The so-called Deep Markov Model (DMM) proposed\nby [18, 17] keeps the Gaussian transition dynamics with mean and covariance matrix parameterized\nby MLPs. Stochastic RNNs [10] explicitly incorporate the deterministic dynamics from RNNs by\ninterlacing them with an SSM while the dynamics of RNNs do not depend on latent variables. Com-\npared to DMM, the difference is the added information from the deterministic state of RNNs. An\nalternative way to make the transition equation non-linear is to cut the ties between the latent states\nlt\u2019s and associate them with deterministic states ht of RNN. In this way, the transition from lt\u22121\nto lt is non-linearly determined by the RNN and the observation model. Chung et al. [7] \ufb01rst pro-\nposed such Variational RNNs. They were later used in Latent LSTM Allocation [29] and State-Space\nLSTM [30]. [15] discusses unsupervised learning of state space models from sequential data.\nArguably the most relevant to our work is [11], which aims to keep the linear Gaussian transition\nstructure intact so that the highly ef\ufb01cient Kalman \ufb01lter/smoother is applicable. The non-linear\nbehavior is approximated by locally linear transition matrices. The so-called called Kalman Vari-\national Auto-Encoder (KVAE) disentangles the observations (emissions) and the latent dynamics\n(transitions) with VAE. By making the locally linear part outside of the standard inference routine\nand using a fully factorized Gaussian \u201cdecoder,\u201d Kalman smoothing can be readily applied. A sim-\nilar idea appeared in [14] where a recognition network is used to output conjugate graphical model\npotentials so that ef\ufb01cient structural inference is feasible. Our model differs from [11] in that instead\nof using an RNN to specify the linear combination of a \ufb01xed set of K parameters, we directly use\nRNNs to output the SSM parameters, eliminating the need to tune additional hyper-parameters.\n\n3 Background\n\n1:Ti\n\n}N\ni=1, where z(i)\n1:Ti\n\nThe general probabilistic forecasting problem is the following. Let N be a set of univariate time\nt \u2208 R denotes the value of the i-th time\nseries {z(i)\n2 , . . . , z(i)\n= (z(i)\nTi\nseries at time t.1 Further, let {x(i)\ni=1 be a set of associated, time-varying covariate vectors\nt \u2208 RD. Our goal is to produce a set of probabilistic forecasts, i.e. for each i = 1, . . . , N\nwith x(i)\nwe are interested in the probability distribution of future trajectories z(i)\n\n) and z(i)\n\nTi+1:Ti+\u03c4 given the past:\n\n1 , z(i)\n1:Ti+\u03c4}N\n(cid:16)\n\np\n\nz(i)\nTi+1:Ti+\u03c4\n\n, x(i)\n\n1:Ti+\u03c4 ; \u03a6\n\n.\n\n(1)\n\n(cid:12)(cid:12)(cid:12)z(i)\n\n1:Ti\n\n(cid:17)\n\n1We consider time series where the the time points are equally spaced, but the units are arbitrary (e.g. hours,\ndays, months). Further, the time series do not have to be aligned, i.e., the starting point t = 1 can refer to a\ndifferent absolute time point for different time series i.\n\n2\n\n\fare given also in the prediction range.\n\nHere \u03a6 denotes the set of learnable parameters of the model, which are shared between and learned\njointly from all N time series. For any given i, we refer to time series z(i)\nas target time series, to\ntime range {1, 2, . . . , Ti} as training range, and to time {Ti + 1, Ti + 2, . . . , Ti + \u03c4} as prediction\n1:Ti\nrange. The time point Ti + 1 is referred to as forecast start time and \u03c4 \u2208 N>0 is the forecast horizon.\nNote that we assume that the covariate vectors x(i)\nt\nWe make the common simplifying assumption that the time series are independent of each other\nwhen conditioned on the associated covariates x(i)\nand the parameters \u03a6. However, in constrast to\n1:Ti\nmany related methods that make this assumption, in our approach the model parameters \u03a6 are shared\nbetween all time series. So while this assumption precludes us from modeling correlations between\ntime series, it does not mean that the proposed model is not able to share statistical strength between\nand learn patterns across the different time series, as we are learning the parameters \u03a6 jointly from\nall time series.\nState Space Models. SSMs model the temporal structure of the data via a latent state lt \u2208 RL\nthat can be used to encode time series components such as level, trend, and seasonality patterns.\nIn the forecasting setting they are typically applied to individual times series (though multivariate\nexensions with multi-dimensional targets exist). For this reason, we will drop the superscript i from\nthe notation in this section. A general SSM is described by the so-called state-transition equation,\nde\ufb01ning the stochastic transition dynamics p(lt|lt\u22121) by which the latent state evolves over time,\nand an observation model specifying the conditional probability p(zt|lt) of observations given the\nlatent state.\nWe consider linear state space models where the transition equation takes the form\n\nlt = F tlt\u22121 + gt\u03b5t,\n\n\u03b5t \u223c N (0, 1).\n\n(2)\n\nHere at time t, the latent state lt\u22121 maintains information about level, trend, and seasonality pat-\nterns and evolves by way of a deterministic transition matrix F t and a random innovation gt\u03b5t.\nThe structure of the transition matrix F t and innovation strength gt determine which kind of time\nseries patterns are encoded by the latent state lt (see [13] or [22] for details on possible structures;\nAppendix A.1 in the long version of the paper contains two example instantiations).\nThe probabilistic observation model then describes how the observations are generated from the\nlatent state lt. Here we consider a univariate Gaussian observation model of the form\n\nt lt\u22121 + bt,\n\nzt = yt + \u03c3t\u0001t,\n\n(3)\nwhere at \u2208 RL, \u03c3t \u2208 R>0 and bt \u2208 R are further (time-varying) parameters of the model. Finally,\nthe initial state l0 is assumed to follow an isotropic Gaussian distribution, l0 \u223c N (\u00b50, diag(\u03c32\n0)).\nParameter learning.\nfully speci\ufb01ed by the parameters\n\u0398t = (\u00b50, \u03a30, F t, gt, at, bt, \u03c3t), \u2200t > 0.\nIn the classical setting the dynamics are assumed to\nbe time-invariant, that is \u0398t = \u0398, \u2200t > 0. One generic way of estimating them is by maximizing\nthe marginal likelihood, i.e., \u0398\u2217\n\nThe state space model\n\nis\n\n\u0001t \u223c N (0, 1),\n\nyt = a(cid:62)\n\n1:T = argmax\u03981:T pSS(z1:T|\u03981:T ), where\nT(cid:89)\n\np(zt|z1:t\u22121, \u03981:t) =\n\np(l0)\n\n(cid:34) T(cid:89)\n\n(cid:90)\n\n(cid:35)\n\npSS(z1:T|\u03981:T ) := p(z1|\u03981)\n\np(zt|lt)p(lt|lt\u22121)\n\ndl0:T (4)\n\nt=2\n\nt=1\n\ndenotes the marginal probability of the observations z1:T given the parameters \u0398 under the state\nspace model, integrating out the latent state lt. In the linear-Gaussian case considered here, the\nrequired integrals are analytically tractable.\nNote that in the classical setting, if there is more than one time series, a separate set of parameters\n\u0398(i) is learned for each time series z(i)\nindependently. This has the disadvantage that no informa-\n1:Ti\ntion is shared across different time series, making it challenging to apply this approach to time series\nwith limited historical data or high noise levels.\n\n4 Deep State Space Models\n\nInstead of learning the state space parameters \u0398(i) for each time series independently, our forecasting\nmodel instead learns a globally shared mapping from the covariate vectors x(i)\nassociated with each\n1:Ti\n\n3\n\n\fobservations\n\nz(i)\n1:Ti\n\n\u0398(i)\n1\n\nh(i)\n1\n\nx(i)\n1\n\npSS(z(i)\n1:Ti\n\n|\u0398(i)\n\n1:Ti\n\n)\n\n. . .\n\n. . .\n\n. . .\n\n\u0398(i)\nt\n\nh(i)\nt\n\nx(i)\nt\n\n. . .\n\n. . .\n\n. . .\n\nlikelihood\n\nstate space parameters\n\nrecurrent network\n\nfeatures\n\n\u0398(i)\nTi\n\nh(i)\nTi\n\nx(i)\nTi\n\nFigure 1: Summary of the model. During training, the inputs to the network are the features x(i)\nas\nt\nt\u22121 at each time step t in the training range {1, 2, . . . , Ti}. The\nwell as the previous network output h(i)\nnetwork output h(i)\n, \u03a6) is then used to compute the parameters of the state space\nmodel \u0398(i)\nafter mapping it to the corresponding ranges of the parameters. Given the time series\nt\nobservations z(i)\n(which\n1:Ti\nare functions of the shared network parameters \u03a6) are computed according to Eq. 4. The shared\nnetwork parameters \u03a6 are then learned by maximizing the likelihood.\n\nin the training range, the likelihood of the state space parameters \u0398(i)\n1:Ti\n\nt = h(h(i)\n\nt\u22121, x(i)\n\nt\n\ntarget time series z(i)\n1:Ti\ni-th time series. This mapping,\n\n, to the (time-varying) parameters \u0398(i)\n\nt of a linear state space model for the\n\n\u0398(i)\n\nt = \u03a8(x(i)\n\n1:t, \u03a6),\n\ni = 1, . . . , N,\n\nt = 1, . . . , Ti + \u03c4,\n\n(5)\n\n1:t up to (and including) time t, as well as a set of\n1:T and the parameters \u03a6, under our model, the data\n\nis a function of the entire covariate time series x(i)\nshared parameters \u03a6. Then, given the features x(i)\nz(i)\n1:Ti\n\nis distributed according to\n|x(i)\n\np(z(i)\n1:Ti\n\n|\u0398(i)\n\n1:Ti\n\n),\n\n1:Ti\n\ni = 1, . . . , N.\n\n, \u03a6) = pSS(z(i)\n1:Ti\n\n(6)\nwhere pSS denotes the marginal likelihood under a linear state space model as de\ufb01ned in Eq. 4,\ngiven its (time-varying) parameters \u0398(i)\nt\nWe parameterize the mapping \u03a8 from covariates to state space model parameters using a deep\nrecurrent neural network (RNN). Figure 1 shows a sketch of the overall model structure, unrolled for\nall the time steps in the training range. Given the covariates2 x(i)\n, a\nt\nmulti-layer recurrent neural network with LSTM cells and parameters \u03a6 computes a representation\nof the features via a recurrent function h,\nh(i)\nt = h(h(i)\n\nassociated with time series z(i)\nt\n\nt\u22121, x(i)\n\n, \u03a6).\n\n.\n\nt\n\nThe real-valued output vector of the last LSTM layer is then mapped to the parameters \u0398(i)\nt of the\nstate space model, by applying af\ufb01ne mappings followed by suitable elementwise transformations\nconstraining the parameters to appropriate ranges (see Appendix A.2 in the long version of the\npaper). Parameters \u0398(i)\nare then used to compute the likelihood of the given observations z(i)\n,\nt\nt\nwhich is used for learning of the network parameters \u03a6.\n\n4.1 Training\n\n(cid:110)\nz(i)\nThe model parameters \u03a6 are learned by maximizing the probability of observing the data\n1:Ti\nin the training range, i.e., by maximizing the (log-)likelihood: \u03a6(cid:63) = argmax\u03a6 L(\u03a6), where\n\nN(cid:88)\n\ni=1\n\n(cid:16)\n\n(cid:12)(cid:12)(cid:12)x(i)\n\n1:Ti\n\n(cid:17)\n\nN(cid:88)\n\ni=1\n\nL(\u03a6) =\n\nlog p\n\nz(i)\n1:Ti\n\n, \u03a6\n\n=\n\nlog pSS\n\nz(i)\n1:Ti\n\n(cid:16)\n\n(cid:17)\n\n(cid:12)(cid:12)(cid:12)\u0398(i)\n\n1:Ti\n\n.\n\n2The covariates (features) can be time dependent (e.g. product price or a set of dummy variables indicating\n\nday-of-week) or time independent (e.g., product brand, category etc.).\n\n4\n\n(cid:111)N\n\ni=1\n\n(7)\n\n\fforecast start\n\nz(i)\n1:Ti\n\np(lTi|z(i)\n\n1:Ti\n\n)\n\n\u0398(i)\n1\n\nh(i)\n1\n\nx(i)\n1\n\n. . .\n\n. . .\n\n. . .\n\n\u0398(i)\nTi\n\nh(i)\nTi\n\nx(i)\nTi\n\n\u02c6z(i)\nTi+1\n\n\u0398(i)\n\nTi+1\n\nh(i)\n\nTi+1\n\nx(i)\n\nTi+1\n\n. . .\n\n. . .\n\n. . .\n\n. . .\n\n\u02c6z(i)\nTi+\u03c4\n\n\u0398(i)\n\nTi+\u03c4\n\nh(i)\n\nTi+\u03c4\n\nx(i)\n\nTi+\u03c4\n\nIllustration of the how the model is used to make forecasts after the network parameters\nFigure 2:\nin the training range {1, 2, . . . , Ti} (not necessarily in the\n\u03a6 are learned. Given a time series z(i)\n1:Ti\ntraining set) and associated features x(i)\n1:Ti+\u03c4 for both training and prediction ranges, forecasts are\nproduced as follows: (i) \ufb01rst the posterior of the latent state p(lTi|z1:Ti ) for the last time step Ti in\nthe training range is computed using the observations z(i)\nob-\n1:Ti\ntained by unrolling the RNN network in the training range; (ii) given the posterior of the latent state\np(lTi|z1:Ti ), prediction samples are generated by recursively applying the transition equation and\nthe observation model (Eq. 8) where the state space parameters for the prediction range \u0398(i)\nare obtained by unrolling the RNN in the prediction range.\n\nand the state space parameters \u0398(i)\n1:Ti\n\nTi+1:Ti+\u03c4\n\nWe can view each summand of L(\u03a6) in Eq. 7 as a (negative) loss function, that measures compatibil-\nity between the state space model parameters \u0398(i)\nproduced by the RNN when given input x(i)\n,\n1:Ti\n1:Ti\nand the true observations z(i)\n. Each of these terms is a standard likelihood computation under\n1:Ti\nlinear-Gaussian state space model, which can be carried out ef\ufb01ciently via Kalman \ufb01ltering (see e.g.\n[3, Sec. 24.3] or [22, Appendix A] for details): this involves mainly matrix-matrix and matrix-vector\nmultiplications, which allows us to implement the overall log-likelihood computation using a neural\nnetwork framework (MXNet), and use automatic differentiation to obtain gradients with respect to\nthe parameters \u03a6, which are then used by a stochastic gradient descent-based optimization proce-\ndure. Note that a forward pass of our network to compute the loss (i.e., negative log-likelihood)\nessentially uses the same basic primitives as that of classical methods that learn parameters per time\nseries independently. Thus, one can, in principle, extend our ideas to other instances of state space\nmodels by simply rede\ufb01ning their parameters as the outputs of the RNN.\n\n4.2 Prediction\n\nOnce the network parameters \u03a6 are learned, we can use them to address our original problem spec-\ni\ufb01ed in Eq. 1, i.e., to make probabilistic forecasts for each given time series. Given \u03a6, we can\ncompute the joint distribution over the prediction range for each time series analytically, as this joint\ndistribution is a multivariate Gaussian. However, in practice it is often more convenient to represent\nthe forecast distribution in terms of K Monte Carlo samples,\n1:Ti+\u03c4 , \u0398(i)\n\nk,Ti+1:Ti+\u03c4 \u223c p(z(i)\n\u02c6z(i)\n\nTi+1:Ti+\u03c4|z(i)\n\nk = 1, . . . , K.\n\n, x(i)\n\n1:Ti\n\n1:Ti+\u03c4 ),\n\nIn order to generate prediction samples from a state space model, one \ufb01rst computes the posterior of\nthe latent state p(lT|z1:T ) for the last time step T in the training range, and then recursively applies\nthe transition equation and the observation model to generate prediction samples. More precisely,\nstarting with sample (cid:96)T \u223c p((cid:96)T|z1:T ), we recursively apply\n\n\u0001T +t \u223c N (0, 1),\n\u03b5T +t \u223c N (0, 1),\n\nt = 1, . . . \u03c4,\nt = 1, . . . \u03c4,\nt = 1, . . . \u03c4 \u2212 1.\n\n(8a)\n(8b)\n(8c)\n\n) for each of the time series z(i)\nby unrolling the\n1:Ti\nas shown in Figure 2, and then using the Kalman\n\nT +t(cid:96)T +t\u22121 + bT +t,\n\nyT +t = a(cid:62)\n\u02c6zT +t = yT +t + \u03c3t\u0001t,\nlT +t \u223c F T +t(cid:96)T +t\u22121 + gT +t\u03b5T +t,\n|z(i)\nIn our case, we compute the posterior p(l(i)\n1:Ti\nTi\nRNN network in the training range to obtain \u0398(i)\n1:Ti\n\n5\n\n\f\ufb01ltering algorithm. Next, we unroll the RNN for the prediction range t = Ti + 1, . . . , Ti + \u03c4 and\nobtain \u0398(i)\nTi+1:Ti+\u03c4 , then generate the prediction samples by recursively applying above equations K\ntimes.3\n\nRemarks. Note that in our model, in contrast to classical and deep learning-based auto-regressive\nmodels (e.g. DeepAR [9]), target values are not used as inputs directly. This is a key feature of\nour method, and brings several advantages: (i) It makes the model more robust to noise, as target\nvalues are only incorporated through the likelihood term, where noise is properly accounted for;\n(ii) Missing target values can easily be handled by simply dropping the corresponding likelihood\nterms; (iii) Forecast sample path generation is computationally more ef\ufb01cient, as the RNN needs\nto be unrolled only once for the entire prediction (independent of the number of samples), whereas\nauto-regressive models (e.g. [9, 26]) have to be unrolled for each sample.\n\n5 Experiments\n\nQualitative experiments.\nIn our \ufb01rst experiment, we test whether our model effectively recovers\nthe state space parameters if trained on synthetic data. For this, we generate \ufb01ve groups of time\nseries from day-of-week seasonality model (see Appendix A.1 in the long version of the paper) but\nwith different initial states and innovation parameters per group. For simplicity, we use the same\nobservation noise \u03c3t for all time series. Each time series consists of six weeks of daily data and we\nuse the \ufb01rst four weeks of all time series for training our model. We use a group identi\ufb01er as an\ninput feature. In the ideal case, for each time series the model should output the parameters of the\nstate space model from which this time series was generated.\n0 , \u03c3(i)\nThe state space model parameters in this case are given by \u0398(i)\nt = (\u00b5(i)\nt =\n1, . . . , Ti + \u03c4, where Ti = 28,\u2200i and \u03c4 = 14. Note that except for \u03c3(i)\n, all the other parameters\nare different for each group. We encode the day-of-week seasonality using seven components of the\nlatent state as in [22], i.e., L = 7 and \u00b50 \u2208 R7 (each component corresponds to a different day of\nthe week). For simplicity, we \ufb01x the term b(i)\nTo analyse how much data is required for recovering the parameters, we train three different models\nusing N = {20, 40, 140} examples from each group. Figure 3 shows the ground truth values of the\nparameters as well as the values obtained by our model for different number of training examples\nper group. The columns show the mean of the initial state \u00b50 (seven values), innovation parameter\n\u03b3t as well as the standard deviation \u03c3t of the observations while the rows correspond to each of the\n\ufb01ve groups. The innovation parameter and the standard deviation of the observations are shown for\nthe prediction range (two-weeks). The recovery of state space parameters becomes more accurate\ngradually as we increase the number of examples from 20 to 140. Moreover, these parameters are\nrecovered reasonably well with N = 200 examples per group. In fact, the means of the initial state\nare exactly recovered. The standard deviation of the initial state \u03c30 (not plotted) has converged\nto a constant value in all cases. It turns out that the initial state means \u00b50 are easy to recover but\nobservation noise \u03c3t and the standard deviation of the initial state \u03c30 are the most dif\ufb01cult to recover.\n\nt = 0 in this experiment.\n\n0 , \u03b3(i)\n\n, \u03c3(i)\n\nt ),\n\nt\n\nt\n\nQuantitative experiments.\nIn our \ufb01rst quantitative experiment we evaluate how our model per-\nforms under small data regimes. For this, we use the publicly available datasets electricity\nand traffic [28]. The electricity dataset is a hourly time series of electricity consumption\nof 370 customers. The traffic dataset contains hourly occupancy rates (between 0 and 1) of 963\ncar lanes of San Francisco bay area freeways. As one expects, all the time series in these datasets\nexhibit hourly as well as daily seasonal patterns. As baselines we use the classical forecasting meth-\nods auto.arima, ets implemented in R\u2019s forecast package and a recent RNN-based method\nDeepAR [9]. We obtained results for DeepAR using the Amazon Sagemaker machine learning\nplatform [1]. Since DeepAR and DeepState \ufb01t a joint model across the time series, both are\ngiven a time independent feature representing the category (i.e., the index) of the time series and\ntime-based features like hour-of-the-day, day-of-the-week, day-of-the-month. For DeepState the\nsize of SSM (i.e., latent dimension) directly depends on the granularity of the time series which\ndetermines the number of seasons. For hourly data, we use hour-of-day (24 seasons) as well as\n\n3Note that\nTi+1:Ti+\u03c4 and the distribution of the \ufb01nal latent state are computed.\n\n\u0398(i)\n\nthe sampling procedure is trivially parallelizable over K samples once the parameters\n\n6\n\n\fFigure 3: Recovery of state space parameters as the number of examples per group is increased.\nColumns show state space parameters while the rows correspond to each of the \ufb01ve groups. Each\nplot shows the true and the recovered values of the parameters with increased number of examples.\n\nDatasets\n\nMethods\n\n2-weeks\n\np50Loss\n\np90Loss\n\n3-weeks\n\np50Loss\n\np90Loss\n\n4-weeks\n\np50Loss\n\np90Loss\n\nelectricity\n\ntraffic\n\nauto.arima\nets\nDeepAR\nDeepState\n\nauto.arima\nets\nDeepAR\nDeepState\n\n0.283\n0.121\n0.153\n0.087\n\n0.492\n0.621\n0.177\n0.168\n\n0.109\n0.101\n0.147\n0.05\n\n0.280\n0.650\n0.153\n0.117\n\n0.291\n0.130\n0.147\n0.085\n\n0.492\n0.509\n0.126\n0.170\n\n0.112\n0.110\n0.132\n0.052\n\n0.289\n0.529\n0.096\n0.113\n\n0.30\n0.13\n0.125\n0.085\n\n0.501\n0.532\n0.219\n0.168\n\n0.11\n0.11\n0.080\n0.057\n\n0.298\n0.60\n0.138\n0.114\n\nTable 1: Data ef\ufb01ciency. Evaluation on electricity and traffic datasets with increasing\ntraining range. The forecast is evaluated on 7 days.\n\nday-of-week (7 seasons) models and hence latent dimension is 31. We train each method on all time\nseries of these datasets but vary the size of the training range Ti \u2208 {14, 21, 28} days. We evaluate\nall the methods on the next \u03c4 = 7 days after the forecast start time using the standard p50 and\np90- quantile losses. For a given collection of time series z and corresponding predictions \u02c6z, the\n\u03c1-quantile loss for \u03c1 \u2208 (0, 1) is de\ufb01ned as\n\n(cid:40)\n\n\u03c1(z \u2212 \u02c6z)\nif z > \u02c6z,\n(1 \u2212 \u03c1)(\u02c6z \u2212 z) otherwise.\n\n(cid:80)\n\n(cid:80)\n\nt\n\ni,t P\u03c1(z(i)\ni,t |z(i)\n\n, \u02c6z(i)\nt )\n|\n\nt\n\nQL\u03c1(z, \u02c6z) = 2\n\n, P\u03c1(z, \u02c6z) =\n\nThe p50 and p90 losses are reported in Table 1. Overall our method achieves the best performance\nexcept for one case. Moreover, our method achieves very good performance even with 2-weeks data\nsince it can explicitly incorporate seasonal structures (i.e., hour-of-day seasonality). Although ets\nand auto.arima incorporate such seasonal structures, their results are much worse. Inability to\nlearn shared patterns across the time series could be a possible reason for their worse performance.\nDeepAR tries to learn seasonal patterns purely from the data and its performance generally improves\nwith increased training size. We show some example predictions of our method in Appendix A.5.\n\n7\n\n020406080100Prior means (\u00b50) for day-of-week seasonality (x-axis: Days)1020304050607080901000204060801001201401600204060801001201401600123456020406080100120140160TruthRecovered: N = 20Recovered: N = 40Recovered: N = 14001020304050Smoothing parameter (\u03b3t) over prediction range (two-weeks)0246810024681012024681012024681012140246810TruthRecovered: N = 20Recovered: N = 40Recovered: N = 140246810Observation nosie (\u03c3t) over prediction range (two-weeks)24681024681024681002468101214246810TruthRecovered: N = 20Recovered: N = 40Recovered: N = 140\fNext, to compare against the matrix factorization method [28], we repeat the experiment in [28]\nthat evaluates rolling-day forecasts for seven days (i.e., prediction horizon is one day and forecasts\nstart time is shifted by one day after evaluating the prediction for the current day). Note that unlike\nMatFact, our method and DeepAR need not retrain after updating the forecast start time. We just\nextend the training range by one day and update the posterior of the latent state accordingly. The\nresults are shown in Table 2. Since MatFact only produces point forecasts, we report normalized\ndeviation as in [28], which in our case is equal to p50-loss. For DeepAR and DeepState we\nreport both p50- and p90-losses. Note that our method is much better than MatFact even though the\nlatter is retrained after each day of prediction. We get comparable results to DeepAR, which in our\nexperience performs well with short forecast horizons.\n\nMatFact\n\nelectricity\ntraffic\n\n0.16\n0.20\n\nDeepState\n\n0.083/0.056\n0.167/0.113\n\nDeepAR\n0.075/0.04\n0.161/0.099\n\nTable 2: Average p50/p90-loss for rolling-day\nprediction for seven days. MatFact outputs points\npredictions, so we only report p50-loss.\n\nIn the \ufb01nal experiment we evaluate our method\non a diverse collection of publicly available\ndatasets. We selected datasets containing time\nseries from a single domain as our method is\nmost suited for datasets of related time series.\nThis includes monthly and quarterly time se-\nries from the tourism competition dataset [2]\ndescribing tourism demand, hourly time se-\nries from the M4 competition [20] and parts\ndataset [6] which contains monthly demand of\nspare parts at a US auto-mobile company. The number of time series in these data sets are 414\n(M4-Hourly), 366 (tourism-Monthly), 427 (tourism-Quarterly) and 1046 (parts).\nFor tourism and M4 datasets, train and test splits are already provided. The length of the train-\ning time series as well as the starting date differ for the time series in the M4-Hourly and\ntourism datasets. The prediction horizon for these data sets are 48 hours (M4-Hourly), 24 months\n(tourism-Monthly) and 8 quarters (tourism-Quarterly). For parts dataset we use the\nlast 12 months as the prediction range while the training range contains 39 months. For both\ntourism-Monthly and tourism-Quarterly we used month-of-year seasonal model along\nwith a trend component (to accommodate the trend visible in the training range of these time series)\nand for parts we used month-of-year seasonal model. For M4-Hourly we used the hour-of-day\nas well as day-of-week seasonal models. The p50 and p90 losses are reported in Table 3 for all the\nmethods. These results further show that our method achieves the best performance overall.\n\nM4-Hourly\nparts\ntourism-Monthly\ntourism-Quarterly\n\nets\n\n0.054/0.0267\n1.639/1.0086\n0.093/0.054\n0.105/0.055\n\nauto.arima\n0.052/0.0354\n1.6444/1.0664\n0.0999/0.058\n0.1241/0.062\n\nDeepAR\n\n0.090/0.0304\n1.273/1.086\n0.107/0.059\n0.11/0.062\n\nDeepState\n0.044/0.0266\n1.47/0.935\n0.138/0.067\n0.098/0.047\n\nTable 3: p50/p90-losses for datasets obtained from publicly available sources.\n\n6 Conclusions\n\nIn this paper we propose a new approach to time series forecasting by marrying state space models\nwith deep recurrent neural networks. This combination allows us to explicitly incorporate structural\nassumptions to handle small data regimes on one hand and learn complex patterns from raw time\nseries data for larger data regimes on the other hand. Our experiments on synthetic data suggest that\nthe model is capable of accurately recovering the parameters of the state space model from which the\ndata is generated. We also showed, on real-world datasets, that the proposed method achieves state-\nof-the-art performance by comparing it against a recent RNN-based method, a matrix factorization\nmethod, as well as classical approaches such as ARIMA and ETS. Under regimes of limited data our\nmethod clearly outperforms the other methods by explicitly modelling seasonal structure. Extending\nour approach to other instances of state space models as well as non-Gaussian likelihoods are some\nof the directions we are currently pursuing. Some ideas of extending our method to non-Gaussian\nlikelihoods are discussed in Appendix A.4 in the long version of the paper.\n\n8\n\n\fReferences\n[1] Amazon Sagemaker: DeepAR Forecasting.\nsagemaker/latest/dg/deepar.html.\n\nhttps://docs.aws.amazon.com/\n\n[2] G. Athanasopoulos, R. Hyndman, H. Song, and D. Wu. The tourism forecasting competition.\n\nInternational Journal of Forecasting, 27(3):822\u2013844, 2011.\n\n[3] D. Barber. Bayesian Reasoning and Machine Learning. Cambridge University Press, 2012.\n\n[4] George E. P. Box and Gwilym M. Jenkins. Some recent advances in forecasting and control.\n\nJournal of the Royal Statistical Society. Series C (Applied Statistics), 17(2):91\u2013109, 1968.\n\n[5] George EP Box and David R Cox. An analysis of transformations. Journal of the Royal\n\nStatistical Society. Series B (Methodological), pages 211\u2013252, 1964.\n\n[6] Nicolas Chapados. Effective bayesian modeling of groups of related count time series.\n\nInternational Conference on Machine Learning, pages 1395\u20131403, 2014.\n\nIn\n\n[7] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua\nBengio. A recurrent latent variable model for sequential data. In Advances in neural informa-\ntion processing systems, pages 2980\u20132988, 2015.\n\n[8] James Durbin and Siem Jan Koopman. Time series analysis by state space methods, volume 38.\n\nOUP Oxford, 2012.\n\n[9] Valentin Flunkert, David Salinas, and Jan Gasthaus. DeepAR: Probabilistic forecasting with\nautoregressive recurrent networks. CoRR, abs/1704.04110, 2017. URL http://arxiv.\norg/abs/1704.04110.\n\n[10] Marco Fraccaro, S\u00f8ren Kaae S\u00f8nderby, Ulrich Paquet, and Ole Winther. Sequential neural\nmodels with stochastic layers. In Advances in neural information processing systems, pages\n2199\u20132207, 2016.\n\n[11] Marco Fraccaro, Simon Kamronn, Ulrich Paquet, and Ole Winther. A disentangled recognition\nand nonlinear dynamics model for unsupervised learning. In Advances in Neural Information\nProcessing Systems, pages 3604\u20133613, 2017.\n\n[12] Alex Graves. Generating sequences with recurrent neural networks.\n\narXiv:1308.0850, 2013.\n\narXiv preprint\n\n[13] R. Hyndman, A. B. Koehler, J. K. Ord, and R. D. Snyder. Forecasting with Exponential\nSmoothing: The State Space Approach. Springer Series in Statistics. Springer, 2008. ISBN\n9783540719182.\n\n[14] Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta.\nComposing graphical models with neural networks for structured representations and fast in-\nference. In Advances in neural information processing systems, pages 2946\u20132954, 2016.\n\n[15] Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt. Deep varia-\ntional bayes \ufb01lters: Unsupervised learning of state space models from raw data. ICLR, 2017.\n\n[16] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. ICLR, 2014.\n\n[17] Rahul G Krishnan, Uri Shalit, and David Sontag. Deep kalman \ufb01lters.\n\narXiv:1511.05121, 2015.\n\narXiv preprint\n\n[18] Rahul G Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear\n\nstate space models. In AAAI, pages 2101\u20132109, 2017.\n\n[19] Nikolay Laptev, Jason Yosinsk, Li Li Erran, and Slawek Smyl. Time-series extreme event\n\nforecasting with neural networks at Uber. In ICML Time Series Workshop. 2017.\n\n[20] S. Makridakis, E. Spiliotis, and V. Assimakopoulos. The M4 competition: Results, \ufb01ndings,\n\nconclusion and way forward. International Journal of Forecasting, 34(4):802\u2013808, 2018.\n\n9\n\n\f[21] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation\nand approximate inference in deep generative models. In International Conference on Machine\nLearning, pages 1278\u20131286, 2014.\n\n[22] Matthias Seeger, Syama Rangapuram, Yuyang Wang, David Salinas, Jan Gasthaus, Tim\nJanuschowski, and Valentin Flunkert. Approximate Bayesian inference in linear state space\nmodels for intermittent demand forecasting at scale. CoRR, abs/1704.04110, 2017. URL\nhttp://arxiv.org/abs/1704.04110.\n\n[23] Matthias W Seeger, David Salinas, and Valentin Flunkert. Bayesian intermittent demand fore-\ncasting for large inventories. In Advances in Neural Information Processing Systems, pages\n4646\u20134654, 2016.\n\n[24] Matthias W Seeger, David Salinas, and Valentin Flunkert. Bayesian intermittent demand fore-\ncasting for large inventories. In Advances in Neural Information Processing Systems, pages\n4646\u20134654, 2016.\n\n[25] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 3104\u20133112, 2014.\n\n[26] A\u00a8aron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex\nGraves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: A genera-\ntive model for raw audio. CoRR, abs/1609.03499, 2016.\n\n[27] Ruofeng Wen Wen, Kari Torkkola, and Balakrishnan Narayanaswamy. A multi-horizon quan-\n\ntile recurrent forecaster. In NIPS Time Series Workshop. 2017.\n\n[28] Hsiang-Fu Yu, Nikhil Rao, and Inderjit S Dhillon. Temporal regularized matrix factorization\nfor high-dimensional time series prediction.\nIn D. D. Lee, M. Sugiyama, U. V. Luxburg,\nI. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29,\npages 847\u2013855. Curran Associates, Inc., 2016.\n\n[29] Manzil Zaheer, Amr Ahmed, and Alexander J Smola. Latent LSTM allocation: Joint clustering\nand non-linear dynamic modeling of sequence data. In International Conference on Machine\nLearning, pages 3967\u20133976, 2017.\n\n[30] Xun Zheng, Manzil Zaheer, Amr Ahmed, Yuan Wang, Eric P Xing, and Alexander J Smola.\nState space LSTM models with particle MCMC inference. arXiv preprint arXiv:1711.11179,\n2017.\n\n10\n\n\f", "award": [], "sourceid": 4852, "authors": [{"given_name": "Syama Sundar", "family_name": "Rangapuram", "institution": "Amazon Research"}, {"given_name": "Matthias", "family_name": "Seeger", "institution": "Amazon Development Center"}, {"given_name": "Jan", "family_name": "Gasthaus", "institution": "Amazon.com"}, {"given_name": "Lorenzo", "family_name": "Stella", "institution": "Amazon"}, {"given_name": "Yuyang", "family_name": "Wang", "institution": "AWS AI Labs"}, {"given_name": "Tim", "family_name": "Januschowski", "institution": "Amazon"}]}