{"title": "Bayesian Intermittent Demand Forecasting for Large Inventories", "book": "Advances in Neural Information Processing Systems", "page_first": 4646, "page_last": 4654, "abstract": "We present a scalable and robust Bayesian method for demand forecasting in the context of a large e-commerce platform, paying special attention to intermittent and bursty target statistics. Inference is approximated by the Newton-Raphson algorithm, reduced to linear-time Kalman smoothing, which allows us to operate on several orders of magnitude larger problems than previous related work. In a study on large real-world sales datasets, our method outperforms competing approaches on fast and medium moving items.", "full_text": "Bayesian Intermittent Demand Forecasting for Large\n\nInventories\n\nMatthias Seeger, David Salinas, Valentin Flunkert\n\nAmazon Development Center Germany\n\nmatthis@amazon.de, dsalina@amazon.de, flunkert@amazon.de\n\nKrausenstrasse 38\n\n10115 Berlin\n\nAbstract\n\nWe present a scalable and robust Bayesian method for demand forecasting in the\ncontext of a large e-commerce platform, paying special attention to intermittent\nand bursty target statistics. Inference is approximated by the Newton-Raphson\nalgorithm, reduced to linear-time Kalman smoothing, which allows us to operate on\nseveral orders of magnitude larger problems than previous related work. In a study\non large real-world sales datasets, our method outperforms competing approaches\non fast and medium moving items.\n\n1\n\nIntroduction\n\nDemand forecasting plays a central role in supply chain management, driving automated ordering,\nin-stock management, and facilities planning. Classical forecasting methods, such as exponential\nsmoothing [10] or ARIMA models [5], produce Gaussian predictive distributions. While suf\ufb01cient\nfor inventories of several thousand fast-selling items, Gaussian assumptions are grossly violated\nfor the extremely large catalogues maintained by e-commerce platforms. There, demand is highly\nintermittent and bursty: long runs of zeros, with islands of high counts. Decision making requires\nquantiles of predictive distributions [14], whose accuracy suffer under erroneous assumptions.\nIn this work, we detail a novel methodology for intermittent demand forecasting which operates in\nthe industrial environment of a very large e-commerce platform. Implemented in Apache Spark\n[16], our method is used to process many hundreds of thousands of items and several hundreds of\nmillions of item-days. Key requirements are automated parameter learning (no expert interventions),\nscalability and a high degree of operational robustness. Our system produces forecasts both for short\n(one to three weeks) and longer lead times (up to several months), the latter require feature maps\ndepending on holidays, sales days, promotions, and price changes. Previous work on intermittent\ndemand forecasting in Statistics is surveyed in [15]: none of these address longer lead times. On a\nmodelling level, our proposal is related to [6], yet several novelties are essential for operating at the\nindustrial scale we target here. This paper makes the following contributions:\n\n\u2022 A combination of generalized linear models and time series smoothing. The former enables\nmedium and longer term forecasts, the latter provides temporal continuity and reasonable\ndistributions over time. Compared to [6], we provide empirical evidence for the usefulness\nof this combination.\n\n\u2022 A novel algorithm for maximum likelihood parameter learning in state space models with\nnon-Gaussian likelihood, using approximate Bayesian inference. While there is substantial\nrelated prior work, our proposal stands out in robustness and scalability. We show how\napproximate inference is solved by the Newton-Raphson algorithm, fully reduced to Kalman\nsmoothing once per iteration. This reduction scales linearly (a vanilla implementation\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fwould scale cubically). While previously used in Statistics [7, Sect. 10.7], this reduction\nis not widely known in Machine Learning. If L-BFGS is used instead (as proposed in [6]),\napproximate inference fails in our real-world use cases.\n\n\u2022 A multi-stage likelihood, taylored to intermittent and bursty demand data (extension of\n[15]), and a novel transfer function for Poisson likelihood, which robusti\ufb01es the Laplace\napproximation for bursty data. We demonstrate that our approach would not work without\nthese novelties.\n\nThe structure of this paper is as follows. In Section 2, we introduce intermittent demand likelihood\nfunction as well as a generalized linear model baseline. Our novel latent state forecasting methodology\nis detailed in Section 3. We relate our approach to prior work in Section 4. In Section 5, we evaluate\nour methods both on publicly available data and on a large dataset of real-world demand in the context\nof e-commerce, comparing against state of the art intermittent forecasting methods.\n\n2 Generalized Linear Models\n\nIn this section, we introduce a likelihood function for intermittent demand data, along with a\ngeneralized linear model as baseline. Denote demand by zit \u2208 N (i for item, t for day). Our goal is\nto predict distributions of zit aggregates in the future. We do this by \ufb01tting a probabilistic model to\nmaximize the likelihood of training demand data, then drawing sample paths from the \ufb01tted model,\nwhich represent forecast distributions. In the sequel, we \ufb01x an item i and write zt instead of zit.\nA model is de\ufb01ned by a likelihood P (zt|yt) and a latent function yt. An example is the Poisson:\n\nPpoi(z|y) =\n\n1\nz!\n\n\u03bb(y)ze\u2212\u03bb(y),\n\nz \u2208 N,\n\n(1)\n\nwhere the rate \u03bb(y) depends on y through a transfer function. Demand data over large inventories is\nboth intermittent (many zt = 0) and bursty (occasional large zt), and is not well represented by a\nPoisson. A better choice is the multi-stage likelihood, generalizing a proposal in [15]. This likelihood\ndecomposes into K = 3 stages, each with its latent function y(k). In stage k = 0, we emit z = 0 with\nprobability1 \u03c3(y(0)). Otherwise, we transfer to stage k = 1, where z = 1 is emitted with probability\n\u03c3(y(1)). Finally, if z \u2265 2, then stage k = 2 draws z \u2212 2 from the Poisson (1) with rate \u03bb(y(2)).\nt w, we have a generalized linear model\nIf the latent function yt (or functions yt\n(GLM) [11]. Features in xt include kernels anchored at holidays (Christmas, Halloween), seasonality\nindicators (DayOfWeek, MonthOfYear), promotion or price change indicators. The weights w are\nlearned by maximizing the training data likelihood. For the multi-stage likelihood, this amounts to\nseparate instances of binary classi\ufb01cation at stages 0, 1, and Poisson regression at stage 2. Generalized\nlinear forecasters work reasonably well, but have some important drawbacks. They lack temporal\ncontinuity: for short term predictions, even simple smoothers can outperform a tuned GLM. More\nimportant, a GLM predicts overly narrow forecast distributions, whose widths do not grow over time,\nand it neglects temporal correlations. Both drawbacks are alleviated in Gaussian linear time series\nmodels, such as exponential smoothing (ES) [10]. A major challenge is to combine this technology\nwith general likelihood functions (Poisson, multi-stage) to enable intermittent demand forecasting.\n\n(k)) is linear, yt = x(cid:62)\n\n3 Latent State Forecasting\n\nIn this section, we develop latent state forecasting for intermittent demand, combining GLMs, general\nlikelihoods, and exponential smoothing time series models. We begin with a single likelihood\nP (zt|yt), for example the Poisson (1), then consider a multi-stage extension. The latent process is\n(2)\n\n\u03b5t \u223c N (0, 1).\n\nbt = w(cid:62)xt,\n\nyt = a(cid:62)\n\nt lt\u22121 + bt,\n\nlt = F lt\u22121 + gt\u03b5t,\n\nHere, bt is the GLM deterministic linear function, lt is a latent state. This innovation state space\nmodel (ISSM) [10] is de\ufb01ned by at, gt and F , as well as the prior l0 \u223c P (l0). Note that ISSMs are\ncharacterized by a single Gaussian innovation variable \u03b5t per time step. In our experiments here, we\n\n1 Here, \u03c3(u) := (1 + e\u2212u)\u22121 is the logistic sigmoid.\n\n2\n\n\femploy a simple2 instance:\n\nyt = lt\u22121 + bt,\n\nlt = lt\u22121 + \u03b1\u03b5t,\n\nl0 \u223c N (\u00b50, \u03c32\n0),\n\nmeaning that F = [1], at = [1], gt = [\u03b1], and the latent state contains a level component only. The\nfree parameters are w (weights), \u03b1 > 0, and \u00b50, \u03c30 > 0 of P (l0), collected in the vector \u03b8.\n\n3.1 Training. Prediction. Multiple Stages\n\nWe would like to learn \u03b8 by maximizing the likelihood of data [zt]t=1,...,T . Compared to the\n(cid:62)](cid:62)\nGLM case, this is harder to do, since latent (unobserved) variables s = [\u03b51, . . . , \u03b5T\u22121, l0\nhave to be integrated out. If our likelihood P (zt|yt) was Gaussian, this marginalization could be\ncomputed analytically via Kalman smoothing [10]. With a non-Gaussian likelihood, the problem is\nanalytically intractable, yet is amenable to the Laplace approximation [4, Sect. 4.4]. The exact log\nt P (zt|yt)P (s) ds, where y = y(s) is\nthe af\ufb01ne mapping given by (2). We proceed in two steps. First, we \ufb01nd the mode of the posterior:\n\u02c6s = argmax log P (z, s|\u03b8), the inner optimization problem. Second, we replace \u2212 log P (z, s|\u03b8)\nby its quadratic Taylor approximation f (s; \u03b8) at the mode. The criterion to replace the negative\n\nlikelihood is log P (z|\u03b8) = log(cid:82) P (z, s|\u03b8) ds = log(cid:82)(cid:81)\nlog likelihood is \u03c8(\u03b8) := \u2212 log(cid:82) e\u2212f (s;\u03b8 ) ds. More precisely, denote \u03c6t(yt) := \u2212 log P (zt|yt),\n\nand let \u02c6y = y(\u02c6s), where \u02c6s is the posterior mode. The log-concavity of the likelihood implies\nthat \u03c6t(yt) is convex, and \u03c6(cid:48)(cid:48)\nt (yt) > 0. The quadratic Taylor approximation to \u03c6t(yt) at \u02c6yt is\n\u02dc\u03c6t(yt) := \u2212 log N (\u02dczt|yt, \u03c32\nt = 1/\u03c6(cid:48)(cid:48)\nt ), where \u03c32\nt(\u02c6yt). Now, Laplace\u2019s\napproximation to \u2212 log P (z|\u03b8) can be written as\n\nt (\u02c6yt) and \u02dczt = \u02c6yt \u2212 \u03c32\nt \u03c6(cid:48)\n(cid:17)\n(cid:88)\n\n\u03c6t(\u02c6yt) \u2212 \u02dc\u03c6t(\u02c6yt)\n\n(cid:16)\n\n(cid:90) (cid:89)\n\n\u03c8(\u03b8) = \u2212 log\n\nN (\u02dczt|yt, \u03c32\n\nt )P (s) ds +\n\n, y = y(s; \u03b8).\n\n(3)\n\nt\n\nt\n\nFor log-concave3 P (zt|yt), the inner optimization is a convex problem. We use the Newton-Raphson\nalgorithm to solve it. This algorithm iterates between \ufb01tting the current criterion by its local second\norder approximation and minimizing the quadratic surrogate. For the former step, we compute yt\nvalues by a forward pass (2), then replace the potentials P (zt|yt) by N (\u02dczt|yt, \u03c32\nt ), where the values\nt are determined by the second order \ufb01t (as above, but \u02c6yt \u2192 yt). The latter step amounts to\n\u02dczt, \u03c32\ncomputing the posterior mean (equal to the mode) E[s] of the resulting Gaussian-linear model. This\ninference problem is solved by Kalman smoothing.4\nNot only \ufb01nding the mode \u02c6s, but also the computation of \u2207\u03b8 \u03c8, is fully reduced to Kalman smoothing.\nThis point is crucial. We can \ufb01nd \u02c6s by the most effective optimization algorithm there is. In\ngeneral, each Newton step requires the O(T 3) inversion of a Hessian matrix. We reduce it to Kalman\nsmoothing, a robust algorithm with O(T ) scaling. As shown in Section 4, Newton-Raphson is\nessential here: other commonly used optimizers fail to \ufb01nd \u02c6s in a reasonable time.\nPrediction samples are obtained as follows. Denote observed demand by [z1, z2, . . . , zT ], unobserved\ndemand in the prediction range by [zT +1, zT +2, . . . ]. We run Newton-Raphson one more time to\nobtain the Gaussian approximation to the posterior P (lT|z1:T ) over the \ufb01nal state. For each sample\npath [zT +t], we draw lT \u223c P (lT|z1:T ), \u03b5T +t \u223c N (0, 1), compute [yT +t] by a forward pass, and\nzT +t \u223c P (zT +t|yT +t). Drawing prediction samples is not more expensive than from a GLM.\nFinally, we generalize latent state forecasting to the multi-stage likelihood. As for the GLM, we learn\nparameters \u03b8(k) separately for each stage k. Stages k = 0, 1 are binary classi\ufb01cation, while stage\nk = 2 is count regression. Say that a day t is active at stage k if zt \u2265 k. Recall that with GLMs, we\nsimply drop non-active days. Here, we use ISSMs for [y(k)\n] on the full range t = 1, . . . , T , but all\nnon-active y(k)\nare considered unobserved: no likelihood potential is associated with t. Both Kalman\nsmoothing and mode \ufb01nding (Laplace approximation) are adapted to missing observations, which\npresents no dif\ufb01culties (see also Section 5.1).\n\nt\n\nt\n\n2 More advanced variants include damping, linear trend, and seasonality factors [10].\n3 Unless otherwise said, all likelihoods in this paper are log-concave.\n4 We use a numerically robust implementation of Kalman smoothing, detailed in [10, Sect. 12].\n\n3\n\n\f3.2 Some Details\n\nlog P (s) = (cid:80)\n\nIn this section, we sketch technical details, most of which are novel contributions. As demonstrated\nin our experiments, these details are essential for the whole approach to work robustly at the intended\nscale on our dif\ufb01cult real-world data. Full details are given in a supplemental report.\n(cid:80)\nWe use L-BFGS for the outer optimization of \u03c8(\u03b8), encoding the constrained parameters: \u03b1 =\n\u03b1m + (\u03b1M \u2212 \u03b1m)\u03c3(\u03b81); 0 < \u03b1m < \u03b1M ; \u03c30 = log(1 + e\u03b82) > 0. We add a quadratic regularizer\nj(\u03c1j/2)(\u03b8j \u2212 \u00af\u03b8j)2 to the criterion, where \u03c1j, \u00af\u03b8j are shared across all items. Finally, recall that\nwith the multi-stage likelihood, day t is unobserved at stage k > 1 if zt < k. If for some item, there\nare less than 7 observed days in a stage, we skip training and fall back to \ufb01xed parameters \u00af\u03b8.\nEvery single evaluation of \u03c8(\u03b8) requires \ufb01nding the posterior mode \u02c6s. This high-dimensional inner\noptimization has to converge robustly in few iterations: \u02c6s = argmin F (s; \u03b8) := \u2212 log P (z|s) \u2212\nt \u03c6t(yt) \u2212 log P (s). The use of Newton-Raphson and its reduction to linear-time\nKalman smoothing was noted above. The algorithm is extended by a line search procedure as well as\na heuristic to pick a starting point s0 (see supplemental report).\nWe have to compute the gradient \u2207\u03b8 \u03c8(\u03b8), where the criterion is given by (3). The main dif\ufb01culty\nhere are indirect dependencies: \u03c8(\u03b8, \u02c6y, \u02c6s), where \u02c6y = y(\u02c6s; \u03b8), \u02c6s = \u02c6s(\u03b8). Since \u02c6s is computed by\nan iterative algorithm, commonly used automated differentiation tools do not sensibly apply here.\nMaybe the most dif\ufb01cult indirect term is (\u2202\u02c6s \u03c8)(cid:62)(\u2202\u02c6s/\u2202\u03b8j), where \u03b8j \u2208 \u03b8. First, \u02c6s is de\ufb01ned by\n\u2202\u02c6s F = 0. Taking the derivative w.r.t. \u03b8j on both sides, we obtain (\u2202\u02c6s/\u2202\u03b8j) = \u2212(\u2202\u02c6s,\u02c6s F )\u22121\u2202\u02c6s,\u03b8j F ,\nso we are looking at \u2212(\u2202\u02c6s,\u03b8j F )(cid:62)(\u2202\u02c6s,\u02c6s F )\u22121(\u2202\u02c6s \u03c8). It is of course out of the question to compute and\ninvert \u2202\u02c6s,\u02c6s F . But (\u2202\u02c6s,\u02c6s F )\u22121(\u2202\u02c6s \u03c8) corresponds to the posterior mean for an ISSM with Gaussian\nlikelihood, which depends on \u2202\u02c6s \u03c8. This means that the indirect gradient part costs one more run\nof Kalman smoothing, independent of the number of parameters \u03b8j. Note that the same reasoning\nunderlies our reduction of Newton-Raphson to Kalman smoothing.\nA \ufb01nal novel contribution is essential for making the Laplace approximation work on real-world bursty\ndemand data. Recall the transfer function \u03bb(y) for the Poisson likelihood (1) at the highest stage\nk = 2. As shown in Section 4, the exponential choice \u03bb = ey fails for all but short term forecasts.\nWith a GLM, the logistic transfer \u03bb(y) = g(y) works well, where g(u) := log(1 + eu). It behaves\nlike ey for y < 0, but grows linearly for positive y. However, it exhibits grave problems for latent\nstate forecasting. Denote \u03c6(y) := \u2212 log P (z|y), where P (z|y) is the Poisson with logistic transfer.\nRecall Laplace\u2019s approximation from Section 3.1: \u03c6(\u00b7) is \ufb01t by a quadratic \u02dc\u03c6(\u00b7) = (\u00b7 \u2212 \u02dcz)/(2\u03c32),\nwhere \u03c32 = 1/\u03c6(cid:48)(cid:48)(y), \u02dcz = y \u2212 \u03c32\u03c6(cid:48)(y). For large y and z = 0, these two terms scale as ey, while\nfor z > 0 they grow polynomially. In real-world data, we regularly exhibit sizable counts (say, a\nfew zt > 25, driving up yt), followed by a single zt = 0. At this point, huge values (\u02dczt, \u03c32\nt ) arise,\ncausing cancellation errors in \u03c8(\u03b8), and the outer optimization terminates prematurely.\nThe root cause for these issues lies with the transfer function: g(y) \u2248 y for large y, its curvature\nbehaves as e\u2212y. Our remedy is to propose the novel twice logistic transfer function: \u03bb(y) =\ng(y(1 + \u03bag(y)), where \u03ba > 0. If \u03c6\u03ba(y) = \u2212 log P (z|y) with the new transfer function, then \u03c6\u03ba(y)\nbehaves similar to \u03c6(y) for small or negative y, but crucially (\u03c6\u03ba)(cid:48)(cid:48)(y) \u2248 2\u03ba for large y and any\nz \u2208 N. This means that Laplace approximation terms are O(1/\u03ba). Setting \u03ba = 0.01 resolves all\nproblems described above. Importantly, the resulting Poisson likelihood is log-concave for any\n\u03ba \u2265 0. We conjecture that similar problems may arise with other \u201clocal\u201d variational or expectation\npropagation inference approximations as well. The twice logistic transfer function should therefore\nbe of wider applicability.\n\n4 Related Work\n\nOur work has precursors both in Statistics and Machine Learning. Maximum likelihood learning for\nexponential smoothing models is developed in [10]. These methods are limited to Gaussian likelihood,\napproximate Bayesian inference is not used. Starting from Croston\u2019s method [10, Sect. 16.2], there\nis a sizable literature on intermittent demand forecasting, as reviewed in [15]. The best-performing\nmethod in [15] uses negative binomial likelihood and a damped dynamic, parameters are learned\nby maximum likelihood. There is no latent (random) state, and neither non-Gaussian inference nor\nKalman smoothing are required. It does not allow for a combination with GLMs.\n\n4\n\n\fWe employ approximate Bayesian inference in a linear dynamical system, for which there is a lot\nof prior work in Machine Learning [3, 1, 2]. While Laplace\u2019s technique is the most frequently used\ndeterministic approximation in Statistics, both in publications and in automated inference systems\n[13], other techniques such as expectation propagation are applicable to models of interest here\n[12, 8]. The robustness and predictable running time of Laplace\u2019s approximation are key in our\napplication, where inference is driving parameter learning, running in parallel over hundreds of\nthousands of items. Expectation propagation is not guaranteed to converge, and Markov chain Monte\nCarlo methods even lack automated convergence tests.\nThe work most closely related to ours is [6]. They target intermittent demand forecasting, using a\nLaplace approximation for maximum likelihood learning, allow for a combination with GLMs, and\ngo beyond our work transferring information between items by way of a hierarchical prior distribution.\nTheir work is evaluated on small datasets and short term scenarios only. In contrast, our system runs\nrobustly on many hundreds of thousands of items and many millions of item-days, a three orders\nof magnitude larger scale than what they report. They do not explore the value of a feature-based\ndeterministic part, which on our real-world data is essential for medium term forecasts. We \ufb01nd that a\nnumber of choices in [6] are limiting when it comes to robustness and scalability. First, they choose a\nlikelihood which is not log-concave for two reasons: they use a negative binomial distribution instead\nof a Poisson, and they use zero-in\ufb02ation instead of a multi-stage setup.5 This means their inner\noptimization problem is non-convex, jeopardizing robustness and ef\ufb01ciency of the nested learning\nprocess. Moreover, in our multi-stage setup, the conditional probability of zt = 0 versus zt > 0 is\nrepresented exactly, while zero-in\ufb02ation caters for a time-independent zero probability only.\nNext, they use an exponential transfer function\n\u03bb = ey for the negative binomial rate, while\nwe propose the novel twice logistic function\n(Section 3.2). Experiments with the exponen-\ntial choice on our data resulted in total failure,\nat least beyond short term forecasts. Its huge\ncurvature for large y results in extremely large\nand instable predictions around holidays. In fact,\nthe exponential function causes rapid growth of\npredictions even without a linear function ex-\ntension, unless the random process is strongly\ndamped. Finally, they use a standard L-BFGS\nsolver for their inner problem, evaluating the cri-\nterion using additional sparse matrix software.\nIn contrast, we enable Newton-Raphson by re-\nducing it to Kalman smoothing. In Figure 1,\nwe evaluate the usefulness of L-BFGS for mode\n\ufb01nding in our setup.6 L-BFGS clearly fails to\nattain decent accuracy in any reasonable amount\nof time, while Newton-Raphson converges reli-\nably. Such inner reliability is key to reaching our goal of fully automated learning in an industrial\nsystem. In conclusion, while the lack of public code for [6] precludes a direct comparison, their ap-\nproach, while partly more advanced, should be limited to smaller problems, shorter forecast horizons,\nand would be hard to run in an industrial setting.\n\nFigure 1: Comparison Newton-Raphson vs. L-\nBFGS for inner optimization. Sampled at \ufb01rst\nevaluation of \u03c8(\u03b8). Shown are median (p10, p90)\nover ca. 1500 items. L-BFGS fails to converge to\ndecent accuracy.\n\n5 Experiments\n\nIn this section, we present experimental results, comparing variants of our approach to related work.\n\n5.1 Out of Stock Treatment\n\nWith a large and growing inventory, a fraction of items is out of stock at any given time, meaning\nthat order ful\ufb01llments are delayed or do not happen at all. When out of stock, an item cannot be sold\n\n5 Zero-in\ufb02ation, p0I{zt=0} + (1 \u2212 p0)P (cid:48)(zt|yt), destroys log-concavity for zt = 0.\n6 The inner problem is convex, its criterion is ef\ufb01ciently implemented (no dependence on foreign code). The\n\nsituation in [6] is likely more dif\ufb01cult.\n\n5\n\n103104105time[ms]10\u2212710\u2212610\u2212510\u2212410\u2212310\u2212210\u22121100101102103104105106gradientnormnewtonlbfgs\f(zt = 0), yet may still elicit considerable customer demand. The probabilistic nature of latent state\nforecasting renders it easy to use out of stock information. If an item is not in stock at day t, the data\nzt = 0 is explained away, and the corresponding likelihood term should be dropped. As noted in\nSection 3.1, this presents no dif\ufb01culty in our framework.\n\nFigure 2: Demand forecast for an item which is partially out of stock. Each panel: Training range\nleft (green), prediction range right (red), true targets black. In color: Median, P10 to P90. Bottom:\nOut of stock (\u2265 80% of day) marked in red. Left: Out of stock signal ignored. Demand forecast\ndrops to zero, strong underbias in prediction range. Right: Out of stock regions treated as missing\nobservations. Demand becomes uncertain in out of stock region. No underbias in prediction range.\n\nIn Figure 2, we show demand forecasts for an item which is out of stock during certain periods in\nthe training range. It is obvious that ignoring the out of stock signal leads to systematic underbias\n(since zt = 0 is interpreted as \u201cno demand\u201d). This underbias is corrected for by treating out of stock\nregions as having unobserved targets. Note that an item may be partially out of stock during a day,\nstill creating some sales. In such cases, we could treat zt as unobserved, but lower-bounded by the\nsales, and an expectation maximization extension may be applied. However, such situations are\ncomparatively rare in our data (compared to full-day out of stock). In the rest of this section, latent\nstate forecasting is taking out of stock information into account.\n\n5.2 Comparative Study\n\nWe present experimental results obtained on a number of datasets, containing intermittent counts\ntime series. Parts contains monthly demand of spare parts at a US automobile company, is publicly\navailable, and was previously used in [10, 15, 6]. Further results are obtained on internal daily\ne-commerce sales data. In either case, we subsampled the sets in a strati\ufb01ed manner from a larger\nvolume used in our production setting. EC-sub is medium size and contains fast and medium moving\nitems. EC-all is a large dataset (more than 500K items, 150M item-days), being the union of\nEC-sub with items which are slower moving. Properties of these datasets are given in Figure 3, top\nleft. Demand is highly intermittent and bursty in all cases, as witnessed by a large CV 2 and a high\nproportion of zt = 0: these properties are typical for supply chain data. Not only is EC-all much\nlarger than any public demand forecasting dataset we are aware of, our internal datasets consists of\nlonger series (up to 10\u00d7) and are more bursty than Parts.\nThe following methods are compared. ETS is exponential smoothing with Gaussian additive errors\nand automatic model selection, a frequently used R package [9]. NegBin is our implementation of\nthe negative binomial damped dynamic variant of [15]. We consider two variants of our latent state\nforecaster: LS-pure without features, and LS-feats with a feature vector xt (basic seasonality,\nkernels at holidays, price changes, out of stock). Predictive distributions are represented by 100\nsamples over the prediction range (length 8 for Parts, length 365 for others). We employ quadratic\nregularization for all methods except ETS (see Section 3.2). Hyperparameters consist of regularization\nconstants \u03c1j and centers \u00af\u03b8j (full details are given in the supplemental report). We tune7 such\nparameters on random 10% of the data, evaluating test results on the remaining 90%. For LS-pure\nand LS-feats, we use two sets of tuned hyperparameters on the largest set EC-all: one for the\nEC-sub part, the other for the rest.\nOur metrics quantify the forecast accuracy of certain quantiles of predictive distributions. They\nare de\ufb01ned in terms of spans [L, L + S) in the prediction range, where L are lead times.\nIn\ngeneral, we ignore days when items are out of stock (see Figure 3, top left, for in-stock ratios).\n\n7 We found that careful hyperparameter tuning is important for obtaining good results, also for NegBin.\nIn contrast, regularization is not even mentioned in [15] (our implementation of NegBin includes the same\nquadratic regularization as for our methods).\n\n6\n\nDec2013Mar2014Jun2014Sep2014Dec2014Mar2015Jun2015Sep2015unobservedDaysDec2013Mar2014Jun2014Sep2014Dec2014Mar2015Jun2015Sep2015unobservedDays\fIf \u03c0it = I{i in stock at t}, de\ufb01ne Zi;(L,S) =(cid:80)L+S\u22121\nis de\ufb01ned as R\u03c1[I; (L, S)] = |I|\u22121(cid:80)\n\n\u03c0itzit. For \u03c1 \u2208 (0, 1), the predicted \u03c1-quantile\nof Zi;(L,S) is denoted by \u02c6Z \u03c1\ni;(L,S). These predictions are obtained from the sample paths by \ufb01rst\nsumming over the span, then estimating the quantile by way of sorting. The \u03c1-quantile loss8 is\nde\ufb01ned as L\u03c1(z, \u02c6z) = 2(z \u2212 \u02c6z)(\u03c1I{z>\u02c6z} \u2212 (1 \u2212 \u03c1)I{z\u2264\u02c6z}). The P(\u03c1 \u00b7 100) risk metric for [L, L + S)\ni;(L,S)), where the left argument Zi;(L,S)\nis computed from test targets.9 We focus on P50 risk (\u03c1 = 0.5; mean absolute error) and P90 risk\n(\u03c1 = 0.9; the 0.9-quantile is often relevant for automated ordering).\n\ni\u2208I L\u03c1(Zi;(L,S), \u02c6Z \u03c1\n\nt=L\n\n# items\nUnit t\n\nMedian CV 2\nFreq. zt = 0\nIn-stock ratio\nAvg. size series\n# item-days\n\nParts\n19874\nmonth\n\n2.4\n54%\n100%\n\n33\n\n656K\n\nEC-sub\n39700\nday\n5.8\n46%\n73%\n329\n13M\n\nEC-all\n534884\n\nday\n9.7\n83%\n71%\n293\n157M\n\nFigure 3: Table: Dataset properties. CV 2 = Var[zt]/E[zt]2 measures burstiness. (a): Sum of\nweekly P50 point (median) forecast over a one-year prediction range for the different methods\n(lines) as well as sum of true demand (shaded area), on dataset I = EC-sub. (b): Weekly P50 risk\nR0.5[I; (7 \u00b7 k, 7)], k = 0, 1, . . . , for same dataset. (c): Same as (b) for P90 risk.\n\nWe plot the P50 and P90 risk on dataset EC-sub, as well the sum of P50 point (median) forecast and\nthe true demand, in the three panels of Figure 3. All methods work well in the \ufb01rst week, but there\nare considerable differences further out. Naturally, losses are highest during the Christmas peak sales\nperiod. LS-feats strongly outperforms all others in this critical region (see Figure 3, top right), by\nmeans of its features (holidays, seasonality). The Gaussian predictive distributions of ETS exhibit\ngrowing errors over time. With the exception of the Christmas period, NegBin works rather well (in\nparticular in P50 risk), but is uniformly outperformed by both LS-pure, and LS-feats in particular.\nA larger range of results are given in Table 1 (Parts, EC-sub) and Table 2 (EC-all), where numbers\nare relative to NegBin. Note that the R code for ETS could not be run on the large EC-all. On\nParts, NegBin works best, yet LS-pure comes close (we did not use features on this dataset). On\nEC-sub, LS-feats outperforms all others in all scenarios. The featureless NegBin and LS-pure are\ncomparable on this dataset. On the largest set EC-all, LS-feats generally outperforms the others,\nbut differences are smaller.\nFinally, we report running times of parameter learning (outer optimization) for LS-feats on EC-sub.\nL-BFGS was run with maxIters = 55, gradTol = 10\u22125. Our experimental cluster consists of\nabout 150 nodes, with Intel Xeon E5-2670 CPUs (4 cores) and 30GB RAM. Pro\ufb01ling was done\nseparately in each stage: k = 0 (P 5 = 0.180s, P 50 = 1.30s, P 95 = 2.15s), k = 1 (P 5 = 0.143s,\nP 50 = 1.11s, P 95 = 1.79s), k = 2 (P 5 = 0.138s, P 50 = 1.29s, P 95 = 3.25s). Here, we\nquote median (P50), 5% and 95% percentiles (P5, P95). The largest time recorded was 10.4s. The\nnarrow spread of these numbers witnesses the robustness and predictability of the nested optimization\nprocess, crucial properties in the context of production systems running on parallel compute clusters.\n\n8 EZ [L\u03c1(Z, \u02c6z)] is minimized by the \u03c1-quantile. Also, L0.5(z, \u02c6z) = |z \u2212 \u02c6z|.\n\n9 More precisely, we \ufb01lter I before use in R\u03c1[I; (L, S)]: I(cid:48) = {i \u2208 I |(cid:80)L+S\u22121\n\nt=L\n\n\u03c0it \u2265 0.8S}.\n\n7\n\nsumunitsETSNegBinLS-pureLS-featstruedemand(a)Oct14Nov14Dec14Jan15Feb15Mar15Apr15May15Jun15Jul15Aug15P50riskETSNegBinLS-pureLS-feats(b)Oct14Nov14Dec14Jan15Feb15Mar15Apr15May15Jun15Jul15Aug15P90riskETSNegBinLS-pureLS-feats(c)\fParts\n\nP90 risk\n\nP50 risk\n\n(L, S)\nETS\nLS-pure\nLS-feats\nNegBin\n\n(0, 2)\n1.04\n1.08\n\u2013\n1.00\n\ndy(8)\n1.04\n1.06\n\u2013\n1.00\n\n(0, 2)\n1.19\n1.04\n\u2013\n1.00\n\ndy(8)\n1.38\n1.06\n\u2013\n1.00\n\n(0, 56)\n0.99\n1.07\n0.80\n1.00\n\nEC-sub\n\nP90 risk\n(21, 84) wk(33)\n1.13\n0.99\n0.85\n1.00\n\n0.75\n0.97\n0.73\n1.00\n\n(0, 56)\n1.07\n0.95\n0.84\n1.00\n\nP50 risk\n(21, 84) wk(33)\n1.18\n0.99\n0.94\n1.00\n\n1.10\n1.03\n0.84\n1.00\n\nTable 1: Results for dataset Parts (left) and EC-sub (right). Metric values relative to NegBin\n(each column). dy(8): Average of R\u03c1[I; (k, 1)], k = 0, . . . , 7. wk(33): Average of R\u03c1[I; (7 \u00b7 k, 7)],\nk = 0, . . . , 32.\n\n(L, S)\nLS-pure\nLS-feats\nNegBin\n\n(0, 56)\n1.11\n0.95\n1.00\n\nP90 risk\n(21, 84) wk(33)\n0.99\n0.89\n1.00\n\n1.03\n0.86\n1.00\n\n(0, 56)\n1.00\n0.92\n1.00\n\nP50 risk\n(21, 84) wk(33)\n1.05\n0.98\n1.00\n\n1.03\n0.88\n1.00\n\nTable 2: Results for dataset EC-all. Metric values relative to NegBin (each column). ETS could not\nbe run at this scale.\n\n6 Conclusions. Future Work\n\nIn this paper, we developed a framework for maximum likelihood learning of probabilistic latent\nstate forecasting models, which can be seen as principled time series extensions of generalized linear\nmodels. We pay special attention to the intermittent and bursty statistics of demand, characteristic for\nthe vast inventories maintained by large retailers or e-commerce platforms. We show how approximate\nBayesian inference techniques can be implemented in a robust and highly scalable way, so to enable\na forecasting system which runs safely on hundred of thousands of items and hundreds of millions of\nitem-days.\nWe can draw some conclusions from our comparative study on a range of real-world datasets. Our\nproposed method strongly outperforms competitors on sales data from fast and medium moving\nitems. Besides good short term forecasts due to temporal smoothness and well-calibrated growth of\nuncertainty, our use of a feature vector seems most decisive for medium term forecasts. On slow\nmoving items, simpler methods like NegBin [15] are competitive, even though they lack signal\nmodels which could be learned from data.\nWe are investigating several directions for future work. Our current system uses time-independent\nISSMs, in particular gt = [\u03b1] means that the same amount of innovation variance is applied every day.\nThis assumption is violated by our data, where a lot more variation happens in the weeks leading up\nto Christmas or before major holidays than during the rest of the year. To this end, we are exploring\nlearning two parameters: \u03b1h during high-variation periods, and \u03b1l for all remaining days. We also\nplan to augment the state lt by seasonality10 factors [10, Sect. 14] (both at, gt depend on time then).\nOne of the most important future directions is to learn about and exploit dependencies between\nthe demand time series of different items. In fact, the strategy to learn and forecast each item\nindependently is not suitable for items with a short demand history, or for slow moving items. One\napproach we pursue is to couple latent processes by a shared (global) linear or non-linear function.\n\nAcknowledgements\n\nWe would like to thank Maren Mahsereci for determining the running time \ufb01gures, and the Wupper\nteam for all the hard work without which this paper would not have happened.\n\n10 Currently, periodic seasonality is dealt with by features in xt.\n\n8\n\n\fReferences\n[1] D. Barber. Expectation correction for smoothing in switching linear Gaussian state space\n\nmodels. Journal of Machine Learning Research, 7:2515\u20132540, 2006.\n\n[2] D. Barber, T. Cemgil, and S. Chiappa. Bayesian Time Series Models. Cambridge University\n\nPress, 1st edition, 2011.\n\n[3] M. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Unit,\n\nUCL, 2003.\n\n[4] C. Bishop. Pattern Recognition and Machine Learning. Springer, 1st edition, 2006.\n\n[5] G. Box, G. Jenkins, and G. Reinsel. Time Series Analysis: Forecasting and Control. John Wiley\n\n& Sons, 4th edition, 2013.\n\n[6] N. Chapados. Effective Bayesian modeling of groups of related count time series. In E. Xing\nand T. Jebara, editors, International Conference on Machine Learning 31, pages 1395\u20131403.\nJMLR.org, 2014.\n\n[7] J. Durbin and S. Koopman. Time Series Analysis by State Space Methods. Oxford Statistical\n\nSciences. Oxford University Press, 2nd edition, 2012.\n\n[8] Tom Heskes and Onno Zoeter. Expectation propagation for approximate inference in dy-\nnamic Bayesian networks. In A. Darwiche and N. Friedman, editors, Uncertainty in Arti\ufb01cial\nIntelligence 18. Morgan Kaufmann, 2002.\n\n[9] R. Hyndman and Y. Khandakar. Automatic time series forecasting: the forecast package for R.\n\nJournal of Statistical Software, 26(3):1\u201322, 2008.\n\n[10] R. Hyndman, A. Koehler, J. Ord, and R. Snyder. Forecasting with Exponential Smoothing: The\n\nState Space Approach. Springer, 1st edition, 2008.\n\n[11] P. McCullach and J.A. Nelder. Generalized Linear Models. Number 37 in Monographs on\n\nStatistics and Applied Probability. Chapman & Hall, 1st edition, 1983.\n\n[12] T. Minka. Expectation propagation for approximate Bayesian inference. In J. Breese and\n\nD. Koller, editors, Uncertainty in Arti\ufb01cial Intelligence 17. Morgan Kaufmann, 2001.\n\n[13] H. Rue and S. Martino. Approximate Bayesian inference for latent Gaussian models by using\nintegrated nested Laplace approximations. Journal of Roy. Stat. Soc. B, 71(2):319\u2013392, 2009.\n\n[14] L. Snyder and Z. Shen. Fundamentals of Supply Chain Theory. John Wiley & Sons, 1st edition,\n\n2011.\n\n[15] R. Snyder, J. Ord, and A. Beaumont. Forecasting the intermittent demand for slow-moving\ninventories: A modelling approach. International Journal on Forecasting, 28:485\u2013496, 2012.\n\n[16] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker,\nand I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster\ncomputing. In Proceedings of the 9th USENIX conference on Networked Systems Design and\nImplementation (NSDI), page 2, 2012.\n\n9\n\n\f", "award": [], "sourceid": 2327, "authors": [{"given_name": "Matthias", "family_name": "Seeger", "institution": "Amazon"}, {"given_name": "David", "family_name": "Salinas", "institution": "Amazon"}, {"given_name": "Valentin", "family_name": "Flunkert", "institution": "Amazon"}]}