{"title": "Dynamic Local Regret for Non-convex Online Forecasting", "book": "Advances in Neural Information Processing Systems", "page_first": 7982, "page_last": 7991, "abstract": "We consider online forecasting problems for non-convex machine learning models. Forecasting introduces several challenges such as (i) frequent updates are necessary to deal with concept drift issues since the dynamics of the environment change over time, and (ii) the state of the art models are non-convex models. We address these challenges with a novel regret framework. Standard regret measures commonly do not consider both dynamic environment and non-convex models. We introduce a local regret for non-convex models in a dynamic environment. We present an update rule incurring a cost, according to our proposed local regret, which is sublinear in time T. Our update uses time-smoothed gradients. Using a real-world dataset we show that our time-smoothed approach yields several benefits when compared with state-of-the-art competitors: results are more stable against new data; training is more robust to hyperparameter selection; and our  approach is more computationally efficient than the alternatives.", "full_text": "Dynamic Local Regret for Non-convex Online\n\nForecasting\n\nSergul Aydore \u2217\nDepartment of ECE\n\nHoboken, NJ USA\n\nsergulaydore@gmail.com\n\nStevens Institute of Technology\n\nStevens Institute of Technology\n\nTianhao Zhu\n\nDepartment of ECE\n\nHoboken, NJ USA\n\nromeo.zhuth@gmail.com\n\nDean Foster\n\nAmazon\n\nNew York, NY USA\nfoster@amazon.com\n\nAbstract\n\nWe consider online forecasting problems for non-convex machine learning models.\nForecasting introduces several challenges such as (i) frequent updates are necessary\nto deal with concept drift issues since the dynamics of the environment change over\ntime, and (ii) the state of the art models are non-convex models. We address these\nchallenges with a novel regret framework. Standard regret measures commonly\ndo not consider both dynamic environment and non-convex models. We introduce\na local regret for non-convex models in a dynamic environment. We present an\nupdate rule incurring a cost, according to our proposed local regret, which is\nsublinear in time T . Our update uses time-smoothed gradients. Using a real-world\ndataset we show that our time-smoothed approach yields several bene\ufb01ts when\ncompared with state-of-the-art competitors: results are more stable against new\ndata; training is more robust to hyperparameter selection; and our approach is more\ncomputationally ef\ufb01cient than the alternatives.\n\n1\n\nIntroduction\n\nOur goal is to design ef\ufb01cient stochastic gradient descent (SGD) algorithms for training non-convex\nmodels for online time-series forecasting problems. A time series is a temporally ordered sequence of\nreal-valued data. Time series applications appear in a variety of domains such as speech processing,\n\ufb01nancial market analysis, inventory planning, prediction of weather, earthquake forecasting; and\nmany other similar areas. Forecasting is the task of predicting future outcomes based on previous\nobservations. However, in some domains such as inventory planning or \ufb01nancial market analysis, the\nunderlying relationship between inputs and outputs change over time. This phenomenon is called\nconcept drift in machine learning (ML) [\u017dliobait\u02d9e et al., 2016]. Using a model that assumes a static\nrelationship can result in poor accuracy in forecasts. In order to address concept drift, the model\nshould either be periodically re-trained or updated as new data is observed.\nRecently, the state of the art in forecasting has been dominated by models with many parameters such\nas deep neural networks [Flunkert et al., 2017, Rangapuram et al., 2018, Toubeau et al., 2019]. In\nlarge scale ML, re-training such complex models using the entire dataset will be time consuming.\nIdeally, we should update our model using new data instead of re-training from scratch at every time\nstep. Of\ufb02ine (batch/mini-batch) learning refers to training an ML model over the entire training\ndataset. Online learning, on the other hand, refers to updating an ML model on each new example as\nit is observed. Using online learning approaches, we can make our ML models deal with concept\ndrift ef\ufb01ciently when re-training over the entire data set is infeasible.\n\n\u2217www.sergulaydore.com\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThe performance of online learning algorithms is commonly evaluated by regret, which is de\ufb01ned\nas the difference between the real cumulative loss and the minimum cumulative loss across number\nof updates [Zinkevich, 2003]. If the regret grows linearly with the number of updates, it can be\nconcluded that the model is not learning. If, on the other hand, the regret grows sub-linearly, the\nmodel is learning and its accuracy is improving. While this de\ufb01nition of regret makes sense for\nconvex optimization problems, it is not appropriate for non-convex problems, due to NP-hardness\nof non-convex global optimization even in of\ufb02ine settings. Indeed, most research on non-convex\nproblems focuses on \ufb01nding local optima. In literature on non-convex optimization algorithms, it is\ncommon to use the magnitude of the gradient to analyze convergence [Hazan et al., 2017, Hsu et al.,\n2012]. Our proposed dynamic local regret adopts this framework, de\ufb01ning regret as a sliding average\nof gradients.\nStandard regret minimization algorithms ef\ufb01ciently learn a static optimal strategy, as mentioned in\n[Hazan and Seshadhri, 2009]. But this may not be optimal for online forecasting problems where\nthe environment changes due to concept drift. To cope up with dynamic environments, some have\nproposed ef\ufb01cient algorithms for adaptive regrets [Daniely et al., 2015, Zhang et al., 2018, Wang\net al., 2018]. However, these works are limited to convex problems. Our proposed regret extends the\ndynamic environment framework to non-convex models.\nRelated Work: Online forecasting is an active area of research [Kuznetsov and Mohri, 2016]. There\nis a rich literature on linear models for online forecasting [Anava et al., 2013, Koolen et al., 2015, Liu\net al., 2016, Hazan et al., 2018, Gultekin and Paisley, 2019]. Among these, Kuznetsov and Mohri\n[2016], Koolen et al. [2015], Hazan et al. [2018] study the online forecasting problem in the regret\nframework. The regret considered in [Hazan et al., 2018] adapts to the dynamics of the sytem but it is\nlimited to linear applications.\nThe most relevant work to our present contribution is Hazan et al. [2017], which introduced a notion\nof local regret for online non-convex problems. They also proposed ef\ufb01cient algorithms that have\nnon-linear convergence rate according to their proposed regret. The main idea is averaging the\ngradients of the most recent loss functions within a window that are evaluated at the current parameter.\nHowever, such regret de\ufb01nition of local regret assumes a static best model. This paper precisely\naddresses both non-convexity and dynamic environment for online forecasting problems in a novel\nregret framework.\nOur Contributions: We present a regret framework for training non-convex forecasting models in\ndynamic environments. Our contributions:\n\nproperties, such as robustness to new training data.\n\nrobustness to hyperparameter selection.\n\nupdate. We prove that it is sublinear according to our proposed regret.\n\n\u2022 We introduce a novel local regret and demonstrate analytically that it has certain useful\n\u2022 We present an update rule for our regret: a dynamic exponentially time-smoothed SGD\n\u2022 We show that on a benchmark dataset our approach yields stability against new data and\n\u2022 Our approach is more computationally ef\ufb01cient than the algorithm proposed by Hazan et al.\n[2017]. We show this empirically on a benchmark dataset, and discuss why it is the case.\n\nWe provide extensive experiments using a real-world data set to support our claims. All of our results\ncan be reproduced using the code in https://github.com/Timbasa/Dynamic_Local_Regret_\nfor_Non-convex_Online_Forecasting_NeurIPS2019.\n\n2 Setting\n\nIn online forecasting, our goal is to update the model parameters xt at each time step t in order to\nincorporate the most recently available information. Assume that t \u2208 T = {1,\u00b7\u00b7\u00b7 , T} represents a\ncollection of T consecutive points where T is an integer and t = 1 represents an initial forecast point.\nf1,\u00b7\u00b7\u00b7 , fT : K \u2192 R are non-convex loss functions on some convex subset K \u2286 Rd. Then ft(xt)\nrepresents the loss function computed using the data from time t and predictions from the model\nparameterized by xt, which has been updated on data up to time t \u2212 1. In the subsequent sections,\nwe will assume K = Rd.\n\n2\n\n\f2.1 Static Local Regret\n\nHazan et al. [2017] introduced a local regret measure based on gradients of the loss. Using gradients\nallows the authors to address otherwise intractable non-convex models. Their regret is local in the\nsense that it averages a sliding window of gradients. Their regret quanti\ufb01es the objective of predicting\npoints with small gradients on average. They are motivated by a game-theoretic perspective, where\nan adversary reveals observations from an unknown static loss. The gradients of the loss functions\nfrom the w most recent rounds of play are evaluated at the current model parameters xt, and these\ngradients are then averaged. More formally, Hazan et al. [2017]\u2019s local regret, we call Static Local\nRegret, is de\ufb01ned to be the sum of the squared magnitudes of the average gradients as in De\ufb01nition\n2.1.\nDe\ufb01nition 2.1. (Static Local Regret) The w-local regret of an online algorithm is de\ufb01ned as:\n\n(cid:107)\u2207Ft,w(xt)(cid:107)2\n\n(1)\n\nSLRw(T ) (cid:44) T(cid:88)\n(cid:80)w\u22121\n\nt=1\n\nwhen K = Rd and Ft,w(xt) (cid:44) 1\ndescent algorithms where the regret SLR is sublinear.\n\nw\n\ni=0 ft\u2212i(xt). Hazan et al. [2017] proposed various gradient\n\nThe motivation behind averaging is two-fold: (i) a randomly selected update has a small time-averaged\ngradient in expectation if an algorithm incurs local regret sublinear in T (ii) for any online algorithm,\nan adversarial sequence of loss functions can force the local regret incurred to scale with T as \u2126( T\nw2 ).\nThese arguments presented in Hazan et al. [2017] inspire our use of local regret. However, static\nlocal regret computes loss from the past ft\u2212i using the most recent parameter xt. In other words, the\nmodel is evaluated on in-sample data. This can cause problems for forecasting applications because\nof concept drift. For instance, consider a demand forecasting problem where your past loss ft\u2212i\nrepresents your objective in November and xt represents the parameters of your model for January\nin the following year. Assuming that the sales increase in November due to Christmas shopping,\nevaluating November\u2019s objective using January\u2019s parameters can be misleading for decision making.\n\n2.2 Proposed Dynamic Local Regret\n\nHere, we introduce a new de\ufb01nition of a local regret that suits forecasting problems motivated by\nthe concept of calibration [Foster and Vohra, 1998] . First we consider the \ufb01rst order Taylor series\nexpansion of the cumulative loss. The loss function calculated using the data at time t is ft. The\nmodel parameters trained on data up to t \u2212 1 are xt. We perturb xt by u:\n(cid:104)u,\u2207ft(xt)(cid:105)\n\nft(xt + u) \u2248 T(cid:88)\n\nT(cid:88)\n\nT(cid:88)\n\nft(xt) +\n\n(2)\n\nt=1\nIf\n\nt=1\n\nt=1\n\nthe updates {x1,\u00b7\u00b7\u00b7 , xT} are well-calibrated,\nHence,\nt=1 ft(xt)\u2212(cid:80)T\n(cid:80)T\n\nfor any u \u2208 Rd.\nthen perturbing\nxt by any u cannot substantially reduce the cumulative loss.\nit can be said\nthat the sequence {x1,\u00b7\u00b7\u00b7 , xT} is asymptotically calibrated with respect to {f1,\u00b7\u00b7\u00b7 , fT} if:\n\u2264 0 where \u03b4 is a small positive scalar. Con-\nlim supT\u2192\u221e sup(cid:107)u(cid:107)=1\nsequently, using the \ufb01rst order Taylor series expansion, we can write the following equation that\nT (cid:104)u,\u2207ft(xt)(cid:105) \u2264 0. Hence, by\nmotivates the left hand side of equation 3: lim supT\u2192\u221e sup(cid:107)u(cid:107)=1 \u2212 1\nt=1 (cid:104)u,\u2207ft(xt)(cid:105), we ensure asymptotic calibration. In addition, we can write\n\ncontroling the term(cid:80)T\n\nt=1 ft(xt+\u03b4u)\n\nT\n\nthe following lemma for the upper bound of this \ufb01rst order term as:\nLemma 2.2. For all xs, the following equality holds:\n(cid:104)u,\u2207fs(xs)(cid:105) =\n\nt(cid:88)\n\nt(cid:88)\n\n\u2207fs(xs)\n\nsup\n(cid:107)u(cid:107)=1\n\ns=t\u2212w+1\n\ns=t\u2212w+1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) .\n\n(3)\n\nBased on the above observation, we propose the regret in De\ufb01nition 2.3. The idea is exponential\naveraging of the gradients at their corresponding parameters over a window at each update iteration.\nDe\ufb01nition 2.3. (Proposed Dynamic Local Regret) We propose a w-local regret as:\n\nDLRw(T ) (cid:44) T(cid:88)\n\nt=1\n\nw\u22121(cid:88)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1\n\nW\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\nT(cid:88)\n\nt=1\n\n\u03b1i\u2207ft\u2212i(xt\u2212i)\n\n=\n\n(cid:107)\u2207St,w,\u03b1(xt)(cid:107)2\n\ni=0\n\n3\n\n\f(cid:80)w\u22121\ni=0 \u03b1ift\u2212i(xt\u2212i), W (cid:44) (cid:80)w\u22121\n\ni=0 \u03b1i, and ft(xt) = 0 for t \u2264 0. The\nwhere St,w,\u03b1(xt) (cid:44) 1\nmotivation of introducing \u03b1 is two-fold: (i) it is reasonable to assign more weights to the most recent\nvalues, (ii) having \u03b1 less than 1 results in sublinear convergence as introduced in Theorem 3.4.\n\nW\n\nUsing our de\ufb01nition of regret, we effectively evaluate an online learning algorithm by computing\nthe exponential average of losses by assigning larger weight to the recent ones at the corresponding\nparameters over a sliding window. We believe that our de\ufb01nition of regret is more applicable to\nforecasting problems than the static local regret as evaluating today\u2019s forecast on previous loss\nfunctions might be misleading.\nMotivation via a Toy Example We demonstrate the motivation of our dynamic regret via a toy\nexample where the Static Local Regret fails. Concept drift occurs when the optimal model at time\nt may no longer be the optimal model at time t + 1. Let\u2019s consider an online learning problem\nwith concept drift with T = 3 time periods and loss functions: f1(x) = (x \u2212 1)2, f2(x) = (x \u2212\n2)2, f3(x) = (x\u22123)2. Obviously, the best possible sequence of parameters is x1 = 1, x2 = 2, x3 = 3.\nWe call this the oracle policy. Also consider a suboptimal sequence, where the model does not react\nquickly enough to concept drift: x1 = 1, x2 = 1.5, x3 = 2. We call this the stale policy. The values\nof the stale policy were chosen to minimize Static Local Regret. Using the formulation of static and\ndynamic local regrets, we can write:\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)(cid:13)\u2207f2(x2) + \u2207f1(x2)\n(cid:13)(cid:13)(cid:13)(cid:13)\u2207f2(x2) + \u2207f1(x1)\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n3\n\n3\n\n+\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)(cid:13)\u2207f1(x1)\n(cid:13)(cid:13)(cid:13)(cid:13)\u2207f1(x1)\n(cid:13)(cid:13)(cid:13)(cid:13)2\n\n3\n\n3\n\n+\n\n+\n\n+\n\n(Hazan\u2019s)\n\n(Ours)\n\n3\n\nSLR3(3) =\n\n(cid:13)(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)(cid:13)\u2207f3(x3) + \u2207f2(x3) + \u2207f1(x3)\n(cid:13)(cid:13)(cid:13)(cid:13)\u2207f3(x3) + \u2207f2(x2) + \u2207f1(x1)\n(cid:13)(cid:13)(cid:13)(cid:13)2\nformulation of the Standard Regret is (cid:80)T\n\nDLR3(3) =\n\n3\n\nNote that, for the local regrets, we use w = 3 and assume ft(x) = 0 for t \u2264 0. We also set \u03b1 = 1\nfor our Dynamic Local Regret but other values would not change the results for this example. The\nt=1 ft(x). Although the oracle policy\nachieves globally minimal loss, the Static Local Regret favors the stale policy. We can verify this by\ncomputing the loss and regret for these policies, as shown in the Table 1.\n\nt=1 ft(xt) \u2212 minx\n\n(cid:80)T\n\nRegret\nCumulative Loss\nStandard Regret\nStatic Local Regret (Hazan et al.)\nDynamic Local Regret (Ours)\n\nOracle Policy\n0\n-2\n40/9\n0\n\nStale Policy\n5/4\n-3/8\n4/9\n10/9\n\nDecision\n\nOracle policy is better\nOracle policy is better\nStale policy is better\nOracle policy is better\n\nTable 1: Values of various regrets for the toy example. The Static Local Regret incorrectly concludes\nthat the stale policy is better than the oracle policy.\n\n2.3 Dynamic Exponentially Time-Smoothed Stochastic Gradient Descent\n\nBelow we present two algorithms based on SGD algorithms which are shown to be effective for\nlarge-scale ML problems [Robbins and Monro, 1951]. Algorithm 1 proposed in [Hazan et al.,\n2017] represents the static time-smoothed SGD algorithm which is sublinear according to the the\nregret in De\ufb01nition 2.1 . Here, we propose to use dynamic exponentially time-smoothed online\ngradient descent represented in Algorithm 2 where gradients of loss functions are calculated at their\ncorresponding parameters. Stochastic gradients are represented by \u02c6\u2207f.\n\nAlgorithm 1 Static Time-Smoothed Stochastic Gradient Descent (STS-SGD)\nRequire: window size w \u2265 1, learning rate \u03b7 > 0, Set x1 \u2208 Rn arbitrarily\n1: for t = 1,\u00b7\u00b7\u00b7 , T do\n2:\n3:\n4: end for\n\nPredict xt. Observe the cost function ft : Rb \u2192 R\nUpdate xt+1 = xt \u2212 \u03b7\n\n(cid:80)w\u22121\n\n\u02c6\u2207ft\u2212i(xt).\n\ni=0\n\nw\n\n4\n\n\fapproaches 1 from the left), normalization parameter W (cid:44)(cid:80)w\u22121\n\nAlgorithm 2 Dynamic Exponentially Time-Smoothed Stochastic Gradient Descent (DTS-SGD)\nRequire: window size w \u2265 1, learning rate \u03b7 > 0, exponential smoothing parameter \u03b1 \u2192 1\u2212 (means that \u03b1\n1: for t = 1,\u00b7\u00b7\u00b7 , T do\n2:\n3:\n4: end for\n\n(cid:80)w\u22121\nPredict xt. Observe the cost function ft : Rb \u2192 R\nUpdate xt+1 = xt \u2212 \u03b7\ni=0 \u03b1i \u02c6\u2207ft\u2212i(xt\u2212i).\n\ni=0 \u03b1i, Set x1 \u2208 Rn arbitrarily\n\nW\n\nNote that STS-SGD needs to perform w gradient calculations at each time step, while we perform\nonly one and average the past w. This is a computational bottleneck for STS-SGD that we observed\nin our experimental results as well.\n\n3 Main Theoretical Results\n\nIn this section, we mathematically study the convergence properties of Algorithm 2 according to our\nproposed local regret. First, we assume the following assumptions hold for each loss function ft: (i)\nft is bounded: | ft(x) |\u2264 M (ii) ft is L-Lipschitz: | ft(x) \u2212 ft(y) |\u2264 L(cid:107)x \u2212 y(cid:107) (iii) ft is \u03b2-smooth:\n(cid:107)\u2207ft(x) \u2212 \u2207ft(y)(cid:107)\u2264 \u03b2(cid:107)x \u2212 y(cid:107) (iv) Each estimate of the gradient in SGD is an i.i.d random vector\n\n= \u2207f (x) and E(cid:104)(cid:107) \u02c6\u2207f (x) \u2212 \u2207f (x)(cid:107)2(cid:105) \u2264 \u03c32. Using the update in Algorithm\n\nsuch that: E(cid:104) \u02c6\u2207f (x)\n(cid:105)\n\n2, we can de\ufb01ne the update rule as: xt+1 = xt \u2212 \u03b7 \u02dc\u2207St,w,\u03b1(xt). Note that each \u02dc\u2207St,w,\u03b1(xt) is a\nweighted average of w independently sampled unbiased gradient estimates with a bounded variance\n\u03c32. Consequently, we have:\n\nE(cid:104) \u02dc\u2207St,w,\u03b1(xt) | xt\nE(cid:104)(cid:107) \u02dc\u2207St,w,\u03b1(xt)) \u2212 \u2207St,w,\u03b1(xt)(cid:107)2| xt\n\n(cid:105)\n(cid:105) \u2264 \u03c32(1 \u2212 \u03b12w)\n\n= \u2207St,w,\u03b1(xt)\n\nW 2(1 \u2212 \u03b12)\n\n.\n\n(4)\n\nAs a result of the above construction, we have the following lemma for the upper bound of\n(cid:107)\u2207St,w,\u03b1(xt)(cid:107)2.\nLemma 3.1. For any \u03b7, \u03b2, \u03b1 and w, the following inequality holds:\n(cid:107)\u2207St,w,\u03b1(xt)(cid:107)2 \u2264 St,w,\u03b1(xt) \u2212 St+1,w,\u03b1(xt+1)\n\n(cid:19)\n\n(cid:18)\n\n\u03b72\n\n\u03b7 \u2212 \u03b2\n2\n\n+ St+1,w,\u03b1(xt+1) \u2212 St,w,\u03b1(xt+1) + \u03b72 \u03b2\n2\n\n\u03c32(1 \u2212 \u03b12w)\nW 2(1 \u2212 \u03b12)\n\n(5)\n\nNext, we compute upper bounds for the terms in the right hand side of the inequality in Lemma 3.1.\nLemma 3.2. For any 0 < \u03b1 < 1 and w the following inequality holds:\n\nSt+1,w,\u03b1(xt+1) \u2212 St,w,\u03b1(xt+1) \u2264 M (1 + \u03b1w\u22121)\n\nW\n\n+\n\nM (1 \u2212 \u03b1w\u22121)(1 + \u03b1)\n\nW (1 \u2212 \u03b1)\n\nLemma 3.3. For any 0 < \u03b1 < 1 and w, the following inequality holds:\nSt,w,\u03b1(xt) \u2212 St+1,w,\u03b1(xt+1) \u2264 2M (1 \u2212 \u03b1w)\nW (1 \u2212 \u03b1)\n\n(6)\n\n(7)\n\n(8)\n\nProofs of the above lemmas are given in Sections A.1, A.2, A.3 in supplementary material.\nTheorem 3.4. Let the assumptions de\ufb01ned above are satis\ufb01ed, \u03b7 = 1/\u03b2, and \u03b1 \u2192 1\u2212, then\nAlgorithm 2 guarantees an upper bound for the regret in De\ufb01nition 2.3 as:\n\n(cid:0)8\u03b2M + \u03c32(cid:1)\n\nDLRw(T ) \u2264 T\nW\n\nwhich can be made sublinear in T if w is selected accordingly.\n\n5\n\n\fProof is given in section A.4 in supplementary material. This theorem justi\ufb01es our use of a window\nand an exponential parameter \u03b1 that approaches 1 from the left. One interesting observation is that\nAlgorithm 2 is equivalent to momentum based SGD [Sutskever et al., 2013] when T = w. As a\nconsequence, our contribution can be seen as a justi\ufb01cation for the use of momentum in online\nlearning by appropriate choice of regret.\n\n4 Forecasting Overview\n\nStandard mean squared error as a loss function summarizes the average relationship between inputs\nand outputs. The resulting forecast will be a point forecast which is the conditional mean of the value\nto be predicted given the input values, i.e. the most likely outcome. However, point forecasts provide\nonly partial information about the conditional distribution of outputs. Many business applications such\nas inventory planning require richer information than just the point forecasts. Quantile loss, on the\nother hand, minimizes a sum that gives asymmetric penalties for overprediction and underprediction.\nFor example, in demand forecasting, the penalty for overprediction and underprediction could be\nformulated as overage cost and opportunity cost, respectively. Hence, the loss for the ML model can\nbe designed so that the pro\ufb01t is maximized. Therefore, using quantile loss as an objective function\nis often desirable in forecasting applications. The quantile loss for a given quantile q between true\nvalue y and the forecast value \u02c6y is de\ufb01ned as:\n\n(cid:80)\n\nt\n\nk\n\nq Lq(yt+k, \u02c6yq\n\nt+k) where \u02c6yq\n\nLq(y, \u02c6y) = q max(y \u2212 \u02c6y, 0) + (1 \u2212 q) max(\u02c6y \u2212 y, 0)\n\n(9)\nwhere q \u2208 (0, 1). Typically, forecasting systems produce outputs for multiple quantiles and\nhorizons. The total quantile loss function to be minimized in such situations can be written as:\nt+k is the output of the ML model, e.g. RNN, to forecast the\nq-th quantile of horizon k at forecast creation time t. This way, the model learns several quantiles\n\n(cid:80)\n(cid:80)\nof the conditional distribution such that P(cid:0)yt+k \u2264 yq\n\n(cid:1) = q. We use quantile loss as our cost\n\nfunction in the following section to forecast electric demand values from a time-series data set.\n\nt+k | y:t\n\n5 Experimental Results\n\nWe conduct experiments on a real-world time series dataset to evaluate the performance of our\napproach and compare with other SGD algorithms.\n\n5.1 Time Series Data set\n\nWe use the data from GEFCom2014 [Barta et al., 2017] for our experiments. It is a public dataset\nreleased for a competition in 2014. The data contains 4 sub-datasets among which we use electrical\nloads. The electrical load directory contains 16 sub-directories: Task1-Task15 and Solution of Task\n15. Each Task1-Task15 directory contains two CSV \ufb01les: benchmark.csv and train.csv. Each train.csv\n\ufb01le contains electrical load values per hour and temperature values measured by 25 stations. The\ntrain.csv \ufb01le in Task 1 contains data from January 2005 to September 2010. The other folders have\none month of data from October 2010 to December 2012. Each benchmark.csv \ufb01le has benchmark\nforecasts of the electrical load values. These are point forecasts and score poorly on quantile loss\nmetrics.\n\n5.2\n\nImplementation Details\n\nThe general \ufb02ow chart of our experiments is illustrated in Figure 1(a). We use the data from January\n2005 to September 2010 for training and we set the forecast time between October 2010 and December\n2012. We assume that 5-year data arrives in monthly intervals. Therefore, we update the LSTM\nmodel every time new monthly data is observed. Computational details are given in Section A.5 in\nsupplementary material.\nLSTM Model: LSTMs are special kind of RNNs that are developed to deal with exploding and van-\nishing gradient problems by introducing input, output and forget gates [Hochreiter and Schmidhuber,\n1997]. Our model contains two LSTM layers and three fully connected linear layers where each\nrepresents one of the three quantiles. The architecture of our LSTM model is illustrated in Figure\n1(b). We use multi-step LSTM to forecast multiple horizons. We use electrial load value, hours of the\n\n6\n\n\f(a) The \ufb02ow chart of our experiments\n\n(b) The architecture of our LSTM model.\n\nFigure 1: (a) Each data block in orange represents a month of data from the 5-year dataset. The\nmodel is updated each time a new month of data arrives. Our test set is the last 15 months of the\ndataset. Green blocks represent the forecasts for this period after each update. QLgrand is computed\nusing these forecasts and the true values in black. (b) We use multi-step LSTM to forecast multiple\nhorizons. The input is two-day data of size 48 \u00d7 44 and the output is the prediction of three quantiles\nof next one-day electrical load values.\n\nday, days of the week and months of the year as features so that the total number of features is 44.\nThe input to our LSTM model is 48 \u00d7 44 where 48 is hours in two days. The output is the prediction\nof three quantiles of next day\u2019s values.\nTraining: During the update, we allow only one pass to the data, which means that the epoch number\n\u221a\nis set to 1. In order to make learning curves smoother, we adjust the learning rate at each update t so\nthat \u03b7t \u2190 \u03b7/\nt where \u03b7 is the initial value for the learning rate. In our experiments, we use 1, 3, 5, 9\nfor the value of \u03b7.\nMetrics: After updating the model once, we evaluate the performance on the 15 months of test data\n(October 2010 - December 2012). We compute quantile loss for each month and report the average\nof these which we call QLgrand. Lower QLgrand indicates better performance.\nMethods: We use one of\ufb02ine and three online methods for training. The of\ufb02ine model uses the\nstandard SGD algorithm and is re-trained from scratch on all data each time new data arrives. We see\nthis strategy as the best strategy to be achieved, but as the most expensive in terms of computation. We\ncall this SGD of\ufb02ine in our experiments. The online models are updated on new data as it is observed,\nwithout reviewing old data. We use standard SGD (called SGD online), static time-smoothed SGD\n(called STS-SGD) and our proposed dynamic exponentially time-smoothed SGD (called DTS-SGD)\nfor online models.\n\n(a) w=20\n\n(b) w=100\n\n(c) w=150\n\n(d) w=200\n\nFigure 2: STS-SGD for different window sizes and learning rates. The learning curves become more\nsensitive to the selection of learning rates as the window size increases.\n\n7\n\n\f(a) w=20\n\n(b) w=100\n\n(c) w=150\n\n(d) w=200\n\nFigure 3: DTS-SGD (ours) for different window sizes and learning rates. The learning curves stay\nstable against different window sizes and learning rates.\n\n5.3 Results\n\nWe compare the performance of online models in terms of their (i) accuracy, (ii) stability against\nwindow size, (iii) stability against the selection of learning rate, and (iv) computational ef\ufb01ciency.\nStability Against Window Size: Figures 2 and 3 show stability against window size for STS-SGD\nand DTS-SGD for different learning rates. As the window size increases, STS-SGD becomes more\nsensitive to the learning rate. The smoothest results with STS-SGD are obtained when the learning\nrate and the window size are small. For DTS-SGD, it takes longer for the curves to converge as the\nwindow size increases. However, it stays more stable against different learning rates regardless of\nwindow size.\nStability Against Learning Rate: We plot cumulative\nloss across t as a function of learning rates in Figure 4\nto evaluate sensitivity of three online learning methods\nto learning rates. It can be seen that DTS-SGD performs\nwell for a wider range than STS-SGD and SGD online.\nSTS-SGD started yielding nan (not a number) results due\nto very large value of losses as \u03b7 become larger; hence not\nshown in the \ufb01gure. The minimum values of cumulative\nQLgrand for each online method are: 14, 612 for SGD\nonline, 14, 585 for DTS-SGD and 14, 595 for STS-SGD\nindicating that global minimums are very similar but DTS-\nSGD is marginally better. However, other approaches re-\nquire more careful selection of a learing rate. SGD of\ufb02ine\nis not shown in this \ufb01gure because it was computationally\ninfeasible to compute SGD of\ufb02ine for such a wide range\nof learning rates. In Figure 5, we compare three online\nmethods and SGD of\ufb02ine for relatively smaller range of\nlearning rates. Each sub-\ufb01gure shows performance as a\nfunction of t given a learning rate. The results show that larger learning rate is needed for SGD of\ufb02ine\nand it is the best performing model as expected. However, the results for SGD online and STS-SGD\noscillate a lot indicating that they are very sensitive to the changes in learning rate as also observed\nin Figure 4. Our proposed approach DTS-SGD, on the other hand, stays robust as we increase the\nlearning rate. Note that, for \u03b7 = 9, the values for STS-SGD became nan (not a number) due to very\nlarge losses after some number of iterations, hence are not shown in the Figure.\nWe also ran experiments using SGD with momentum for various decay parameters and concluded\nthat SGD with momentum is not even as stable as SGD-online (standard SGD without momentum)\nto large values of learning rate as shown in Figure A.1.\nComputation Time: We further investigate the computation time of each method. Figure 6 shows\nthe amount of time spent in terms of GPU seconds at each update for \u03b7 = 9 and varying w for\nSTS-SGD and DTS-SGD. Note that, these results will not be different for other learning rates since\ncomputation time does not depend on the learning rate. The \ufb01gure shows that the elapsed time\nincreases for STS-SGD and DTS-SGD as w increases as expected. It can be seen that the time elapsed\ncurve looks exponential for SGD of\ufb02ine and linear for STS-SGD and DTS-SGD. As w increases,\nboth STS-SGD and DTS_SGD become slower but DTS-SGD is still more ef\ufb01cient that SGD of\ufb02ine.\n\nFigure 4: Comparison of online methods\nfor their sensitivity to the learning rate.\nOur DTS-SGD performs well for a wider\nrange of learning rates.\n\n8\n\n\f(a) \u03b7=1\n\n(b) \u03b7=3\n\n(c) \u03b7=5\n\n(d) \u03b7=9\n\nFigure 5: Comparison of models in terms of accuracy for various learning rates. Our DTS-SGD is\nless sensitive to \u03b7 than SGD online and STS-SGD. SGD of\ufb02ine performs the best as expected and\nyields higher accuracy as \u03b7 increases. Note that the values for STS-SGD become nan (not a number)\nafter a few interations for \u03b7 = 9 because of large values of gradients.\n\nThe reason why STS-SGD is not as ef\ufb01cient as DTS-SGD is because it needs to store previous losses\nand compute the gradients using the current parameters resulting in more backpropagation steps.\nUnsurprisingly, SGD online is the most ef\ufb01cient but its accuracy results in Figure 5 were not as stable\nas that of DTS-SGD.\n\n(a) w=10\n\n(b) w=20\n\n(c) w=50\n\n(d) w=200\n\nFigure 6: Comparison of computation time between four models with varying w when \u03b7 = 9.\nComputation time for STS-SGD and DTS-SGD increases as w increases. Our DTS-SGD is more\nef\ufb01cient than the SGD of\ufb02ine even for large w.\n\n6 Conclusion\n\nIn this work, we introduce a local regret for online forcasting with non-convex models and propose\ndynamic exponentially time-smoothed gradient descent as an update rule. Our contribution is inspired\nby adapting the approach of Hazan et al. [2017] to forecasting applications. The main idea is to\nsmooth the gradients in time when an update is performed using the new data set. We evaluate the\nperformance of this approach compared to: static time-smoothed update, a standard online SGD\nupdate, and an expensive of\ufb02ine model re-trained on all past data at every time step. We use a\nreal-world data set to compare all models in terms of computation time and stability against learning\nrate tuning. Our results show that our proposed algorithm DTS-SGD: (i) achieves the best loss on the\ntest set (likely a statistical tie); (ii) is not sensitive to the learning rate, and (iii) is computationally\nef\ufb01cient compared to the alternatives. We believe that our contribution can have a signi\ufb01cant impact\non applications for online forecasting problems.\n\nAcknowledgements\n\nThis project has been supported by AWS Machine Learning Research Awards.\n\nReferences\nOren Anava, Elad Hazan, Shie Mannor, and Ohad Shamir. Online learning for time series prediction.\n\nIn Conference on learning theory, pages 172\u2013184, 2013.\n\n9\n\n\fGergo Barta, Gyula Borbely Gabor Nagy, Sandor Kazi, and Tamas Henk. Gefcom 2014\u2014probabilistic\nelectricity price forecasting. In International Conference on Intelligent Decision Technologies,\npages 67\u201376. Springer, 2017.\n\nAmit Daniely, Alon Gonen, and Shai Shalev-Shwartz. Strongly adaptive online learning.\n\nInternational Conference on Machine Learning, pages 1405\u20131411, 2015.\n\nIn\n\nValentin Flunkert, David Salinas, and Jan Gasthaus. Deepar: Probabilistic forecasting with autore-\n\ngressive recurrent networks. arXiv preprint arXiv:1704.04110, 2017.\n\nDean P Foster and Rakesh V Vohra. Asymptotic calibration. Biometrika, 85(2):379\u2013390, 1998.\nSan Gultekin and John Paisley. Online forecasting matrix factorization. IEEE Transactions on Signal\n\nProcessing, 67(5):1223\u20131236, 2019.\n\nElad Hazan and Comandur Seshadhri. Ef\ufb01cient learning algorithms for changing environments. In\nProceedings of the 26th annual international conference on machine learning, pages 393\u2013400.\nACM, 2009.\n\nElad Hazan, Karan Singh, and Cyril Zhang. Ef\ufb01cient regret minimization in non-convex games.\n\narXiv preprint arXiv:1708.00075, 2017.\n\nElad Hazan, Holden Lee, Karan Singh, Cyril Zhang, and Yi Zhang. Spectral \ufb01ltering for general linear\ndynamical systems. In Advances in Neural Information Processing Systems, pages 4634\u20134643,\n2018.\n\nSepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\nDaniel Hsu, Sham M Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov\n\nmodels. Journal of Computer and System Sciences, 78(5):1460\u20131480, 2012.\n\nWouter M Koolen, Alan Malek, Peter L Bartlett, and Yasin Abbasi. Minimax time series prediction.\n\nIn Advances in Neural Information Processing Systems, pages 2557\u20132565, 2015.\n\nVitaly Kuznetsov and Mehryar Mohri. Time series prediction and online learning. In Conference on\n\nLearning Theory, pages 1190\u20131213, 2016.\n\nChenghao Liu, Steven CH Hoi, Peilin Zhao, and Jianling Sun. Online arima algorithms for time\n\nseries prediction. In Thirtieth AAAI conference on arti\ufb01cial intelligence, 2016.\n\nSyama Sundar Rangapuram, Matthias W Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, and\nTim Januschowski. Deep state space models for time series forecasting. In Advances in Neural\nInformation Processing Systems, pages 7785\u20137794, 2018.\n\nHerbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical\n\nstatistics, pages 400\u2013407, 1951.\n\nIlya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization\nand momentum in deep learning. In International conference on machine learning, pages 1139\u2013\n1147, 2013.\n\nJean-Fran\u00e7ois Toubeau, J\u00e9r\u00e9mie Bottieau, Fran\u00e7ois Vall\u00e9e, and Zacharie De Gr\u00e8ve. Deep learning-\nbased multivariate probabilistic forecasting for short-term scheduling in power markets. IEEE\nTransactions on Power Systems, 34(2):1203\u20131215, 2019.\n\nGuanghui Wang, Dakuan Zhao, and Lijun Zhang. Minimizing adaptive regret with one gradient per\n\niteration. In IJCAI, pages 2762\u20132768, 2018.\n\nLijun Zhang, Tianbao Yang, Zhi-Hua Zhou, et al. Dynamic regret of strongly adaptive methods. In\n\nInternational Conference on Machine Learning, pages 5877\u20135886, 2018.\n\nMartin Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In\nProceedings of the 20th International Conference on Machine Learning (ICML-03), pages 928\u2013936,\n2003.\n\nIndr\u02d9e \u017dliobait\u02d9e, Mykola Pechenizkiy, and Joao Gama. An overview of concept drift applications. In\n\nBig data analysis: new algorithms for a new society, pages 91\u2013114. Springer, 2016.\n\n10\n\n\f", "award": [], "sourceid": 4378, "authors": [{"given_name": "Sergul", "family_name": "Aydore", "institution": "Stevens Institute of Technology"}, {"given_name": "Tianhao", "family_name": "Zhu", "institution": "Stevens Institute of Techonlogy"}, {"given_name": "Dean", "family_name": "Foster", "institution": "Amazon"}]}