{"title": "Forecasting Demand for Electric Power", "book": "Advances in Neural Information Processing Systems", "page_first": 739, "page_last": 746, "abstract": null, "full_text": "Forecasting Demand for Electric Power \n\nJen-Lun Yuan and Terrence L. Fine \n\nSchool of Electrical Engineering \n\nCornell University \nIthaca, NY 14853 \n\nAbstract \n\nWe are developing a forecaster for daily extremes of demand for \nelectric power encountered in the service area of a large midwest(cid:173)\nern utility and using this application as a testbed for approaches \nto input dimension reduction and decomposition of network train(cid:173)\ning. Projection pursuit regression representations and the ability \nof algorithms like SIR to quickly find reasonable weighting vectors \nenable us to confront the vexing architecture selection problem by \nreducing high-dimensional gradient searchs to fitting single-input \nsingle-output (SISO) subnets. We introduce dimension reduction \nalgorithms, to select features or relevant subsets of a set of many \nvariables, based on minimizing an index of level-set dispersions \n(closely related to a projection index and to SIR), and combine \nthem with backfitting to implement a neural network version of \nprojection pursuit. The performance achieved by our approach, \nwhen trained on 1989, 1990 data and tested on 1991 data, is com(cid:173)\nparable to that achieved in our earlier study of backpropagation \ntrained networks. \n\n1 \n\nIntroduction \n\nOur work has the intertwined goals of: \n\n(i) contributing to the improvement of the short-term electrical load (demand) \nforecasts used by electric utilities to buy and sell power and ensure that they can \nmeet demand; \n\n739 \n\n\f740 \n\nYuan and Fine \n\n(ii) reducing the computational burden entailed in gradient-based training of neural \nnetworks and thereby enabling the exploration of architectures; \n(iii) improving prospects for good statistical generalization by use of rational meth(cid:173)\nods for reducing complexity through the identification of good small subsets of \nvariables drawn from a large set of candidate predictor variables (feature selection); \n\n(iv) benchmarking backpropagation and neural networks as an approach to the \napplied problem of load forecasting. \n\nOur efforts proceed in the context of a problem suggested by the operational needs \nof a particular electric utility to make daily forecasts of short-term load or demand. \nForecasts are made at midday (1 p.m.) on a weekday t ( Monday - Thursday), for \nthe next evening peak e(t) (occuring usually about 8 p.m. in the winter), the daily \nminimum d(t + 1) (occuring about 4 a.m. the next morning) and the morning \npeak m( t + 1) (about noon ). \nIn addition, on Friday we are to forecast these \nthree variables for the weekend through the Monday morning peak. These daily \nextremes of demand are illustrated in an excerpt from our hourly load data plotted \nin Figure 1. \n\n4600 \n\n4400 \n\nf 4200 \n~ \ni J 4000 \n\n3600 \n\n3400 \n0 \n\n5am \n\nlam \n\nI~am \n\nRpm \n\n5 \n\n10 \n\n15 \n\n20 \n\n2S \n\n6am \n\n30 \n\n' lam \nIlam \n\n9jlm \n\n35 \n\n40 \n\n45 \n\n50 \n\nnumber of houri \n\nFigure 1: Hourly demand for two consecutive days showing the intended forecasting \nvariables. \n\nIn this paper, we focus on forecasting these extremal demands up to three days \nahead (e.g. forecasting on Fridays). Neural network-based forecasters are devel(cid:173)\noped which parallel the recently proposed method of slicing inverse regression (SIR) \n(Li [1991]) and then use backfitting (Hastie and Tibshirani [1990]) to implement a \ntraining algorithm for a projection pursuit model (Friedman [1987]' Huber [1985]) \nthat can be implemented with a single hidden layer network. Our data consists of \nhourly integrated system demand (MWH) and hourly temperatures measured at \nthree cities in the service area of a large midwestern utility during 1989-91. We use \n1989 and 1990 for a training set and test over the whole of 1991, with the exception \nof holidays that occur so infrequently that we have no training base. \n\n\fForecasting Demand for Electric Power \n\n741 \n\n2 Baseline Performance \n\n2.1 Previous Work on Load Forecasting \n\nSince demand is a process which does not have a known physical or mathematical \nmodel, we do not know the best achievable forecasting performance, and we are led \nto making comparisons with methods and results reported elsewhere. There is a \nsubstantial literature on short-term load forecasting, with Gross et al. [1987] and \nWillis et al. [1984] providing good reviews of approaches based upon such statisti(cid:173)\ncal methods as linear least squares regression -1Ild Box-Jenkins and ARMAX time \nseries models. Many utilities rely upon the seemingly seat-of-the-pants estimates \nproduced by individuals who have been long employed at this task and who extrap(cid:173)\nolate from a large historical data base. In the past few years there have been several \nefforts to employ neural networks trained through backpropagation. In two such \nrecent studies conducted at the Univ. of Washington an average peak error of 2.04% \nwas reported by Damborg et al. [1990] and an hourly load error of about 2.2% was \ngiven by Connor et al. [1991]. However, the accuracies reported in the literature are \ndifficult to compare with since utilities are exposed to different operating conditions \n(e.g., weather, residential/industrial balance). To provide a benchmark for the error \nperformance achieved by our method, we evaluated three basic forecasting models \non our data. These methods are based on a pair of features made plausible by the \nscatter plots shown in Figure 2. \n. , \n. . . \n0 .. ~ ,It . ~ \n. . : \n..... \n, ...... \n' . . . . ~ \u2022 \u2022\u2022 \u2022 \u2022 \u2022 \u2022 ' . 4b-:~. \" \":\" ... \", ... \n\n5000~~--~---r--~ \n\n4800 \n\n4600 \n\n\u2022 ~ 0\u00b0 \n\n0; \n\n: \n\n: \n\n: \n\no \n\n\u2022 \u2022 \n\n. \n: \n\n: \n\naD \n\n10 \n\n..... , ... ~ .... .,..,-:JO\" \" \" ....... ........ . \n\n4800 \n\n50oo~--......----'r----....., \n\n. \u2022 \u2022 \n'0 . \n',\" \n, \n, \n4600 _ ..... .......... .. :\"\" .. : ....... ; . . . \n.Ib. ~ \n. \\ \n:?' \n'.f .. ; \n........... .. , ... ~ .... 1. .. ~ ...... ~ .. \n~. ~, \n.~.t\"\". \n\u2022 \u2022 \n\n. \n\n~ 4400 \n~ \n~ 4200 \n\n3800 \n\n. , .. \n\n\" \n\n:c 4400 \n~ \n~ \n~ 4200 \n\n4000 \n\n3800 \n\n.. .. \n\" \n\nD. \n\n. . \n\n3~~'-OO--4000 \n\n.......... ~4-:'-5oo-:---:-5000\"\":---S-:'5oo \n\n3~'----~--~~--~ \n100 \n\n-SO \n\nso \n\no \n\nm(t)MWH \n\ntemperalUre(t) oP \n\nFigure 2: Evening peaks ('IUe.-Fri.,1989-90) vs. morning peaks and temperatures. \n\n2.2 Feature Selection and Homogeneous Data Types \n\nDemand depends on predictable calendar factors such as the season, day-of-the(cid:173)\nweek and time-of-day considerations. We grouped separately Mondays, 'IUesdays \nthrough Fridays, Saturdays, and Sundays, as well as holidays. In contrast to all \nof the earlier work on this problem, we ignored seasonal considerations and let the \nnetwork and training algorithm adjust as needed. The advantage of this was the \nability to form larger training data sets. We thus constructed twelve networks, one \n\n\f742 \n\nYuan and Fine \n\ntype \n\ntype \n\nm(t+1) \n\ne(t) \n\nd(t+1) \n\nm(t) \nMonday \nm(t-3) \nTue.-Fri. m(t-1) \nm(t) \nSaturday m(t-1) m(t-1) \nm(t-2} m(t-2) \nSunday \n\nd(t-3) \nd(t-1) \nd(t-1) \nd(t- 2) \n\nTable 1: Most recent peaks of a two-feature set \n\nm(t+1} \n\ne(t) \n\nLLS LOESS BP LLS LOESS BP LLS LOESS BP \n\nd(t+1) \n\n3.78 \nMonday \n3.01 \nThe.-Fri. \nSaturday 3.37 \nSunday \n4.83 \n\n2.45 \n2.44 \n2.60 \n3.28 \n\n1.73 \n2.42 \n1.98 1.89 \n2.36 4.54 \n3.79 4.89 \n\n2.43 \n3.04 \n3.76 \n2.74 \n\n4.40 \n1.59 \n3.29 \n1.65 \n3.10 3.48 \n3.81 \n4.26 \n\n3.30 \n3.81 \n3.25 \n2.44 \n\n2.69 \n2.49 \n2.06 \n3.03 \n\nTable 2: Forecasting accuracies (percentage absolute error) for three basic methods \n\nfor each pair consisting of one of these four types of days and one of the three \ndaily extremes to be forecast. Demand also depends heavily upon weather which is \nthe primary random factor affecting forecasts. This dependency can be seen in the \nscatter plots of current demand vs. previous demand and temperature in Figure 2, \nparticularly in the projection onto the 'current demand-temperature' plane which \nshows a pronounced \"U\"-shaped nonlinearity. A two-feature set consisting of the \nmost recent peaks and average temperatures over the three cities and the preceding \nsix hours is employed for testing all three models (Table 1). \n\n2.3 Benchmark Results \n\nThe three basic forecasting models using the two-featured set are: \n\n1) linear regression model fitted to the data in Figure 2; \n\n2) demand vs. temperature models which roughly model the \"U-shaped\" nonlinear \nrelationship, (LOESS with .5 span was employed for scatter plot smoothing); \n3) backpropagation trained neural networks using 5 logistic nodes in a single hidden \nlayer. \n\nThe test set errors are given in Table 2. Note that among these three models, \nBP-trained neural networks gives superior test set performance on all but Sundays. \nThese models all give results comparable to those obtained in our earlier work on \nforecasting demands for Thesday-Friday using autoregressive neural networks (Yuan \nand Fine [1992]). \n\n\fForecasting Demand for Electric Power \n\n743 \n\n3 Projection Pursuit Training \n\nSatisfactory forecasting performance of the neural networks described above relies \non the appropriate choice of feature sets and network architectures. Unfortunately, \nBP can only address the problem of appropriate architecture and relevant feature \nsets through repeated time-consuming experiments. Modeling of high-dimensional \ninput features using gradient search requires extensive computation. We were thus \nprompted to look at other network structures and at training algorithms that could \nmake it easier to explore architecture and training problems. Our initial attempt \ncombined the dimension reduction algorithm cf SIR (Li [1991]), currently replaced \nby an algorithm of our devising sketched in Section 4, and backfitting (Hastie \net.al [1990]) to implement a neural network version of projection pursuit regres(cid:173)\nsion (PPR). \n\n3.1 The Algorithm \n\nA general nonlinear regression model for a forecast variable y in terms of a vector \nx of input variables and model noise \u20ac, independent of x, is gi ven by \n\ny = f({3ix,{3~x, .. ,{3~X,\u20ac) \n\n(*). \nA least mean square predictor is the conditional expectation E(ylx). The projection \npursuit model/approximation of this conditional expectation is given in terms of a \nfamily of SISO functions 2 1,22, \", 2k by \nk \n\nE(ylx) = I: 2 i ({3:x) + {3o. \n\ni=l \n\nA single hidden layer neural network can approximate this representation by intro(cid:173)\nducing subnets whose summed outputs approximate the individual 2 j . \nWe train such a 'projection pursuit network' with nodes partitioned into subnets, \nrepresenting the 2 i , by training the subnets individually in rotation. In this we \nfollow the statistical regression notion of backfitting. The subnet 2i is trained to \npredict the residuals resulting from the difference between the weighted outputs of \nthe other k - 1 subnets and the true value of the demand variable. After a number \nof training cycles one then proceeds to the next subnet and repeats the process. The \ninputs to each subnet 2i are the low-dimensional projections {3:x of the regression \nmodel. One termination criteria for determining the number of subnets k is to stop \nadding subnets when the projection appears to be normally distributed; results \nof Diaconis and Freedman point out that 'most' projections will be so distributed \nand thus are 'uninteresting'. The directions {3i can be found by minimizing some \nprojection index which defines interesting projections that deviates from Gaussian \ndistributions (e.g., Friedman [1987]). Each {3i determines the weights connecting \nthe input to sub net 2 i .The whole projection pursuit regression process is simplified \nby decoupling the direction {3 search from training the SISO subnets. Albeit, its \nsuccess depends upon an ability to rapidly discern the significant directions {3i. \n\n3.2 \n\nImplementations \n\nThere are several variants in the implementation of projection pursuit training al(cid:173)\ngorithms. General PPR procedure can be implemented in one stage by computa-\n\n\f744 \n\nYuan and Fine \n\ntype \n\nm(t+l) \n\ne(t) \n\nd(t+1) \n\n2.35/3.45 1.25/1.60 2.76/3.49 \nMonday \nTue.-Fri. \n2.37/2.83 1.65/1.66 2.15/2.66 \nSaturday 2.67/3.16 2.78/3.96 2.57/3.04 \nSunday \n3.15/5.38 2.63/3.67 2.29/3.61 \n\nTable 3: Forecasting performance (training/testing percentage error) of projection \npursuit trained networks \n\ntionally intensive numerical methods, or in a two-stage heuristic (finding f3i, then \nBi) as proposed here. It can be implemented with or without back fitting after the \nPPR phase is done. Intrator [1992] has recently suggested incorporating the pro(cid:173)\njection index into the objective function and then running an overall BPA. Other \nvariants in training each Bi net include using nonparametric smoothing techniques \nsuch as LOESS or kernel methods. BP training can then be applied only in the \nlast stage to fit the smoothed curves so obtained. The complexity of each subnet \nis then largely determined by the smoothing parameters, like window sizes, inher(cid:173)\nent in most nonparametric smoothing techniques. Another practical advantage of \nthis process is that one can incorporate easily fixed functions of a single variable \n(e.g. linear nodes or quadratic nodes) when one's prior knowledge of the data \nsource suggests that such components may be present. Our current implementation \nemploys the two-stage algorithm with simple (either one or two nodes) logistic Bi \nsubnets. Each SISO Bi net runs a BP algorithm to fit the data. The directions f3i \nare calculated based on minimizing a projection index (dispersion of level-sets, de(cid:173)\nscribed in Section 4) which can be executed in a direct fashion. One can encourage \nthe convergence of backfitting by using a relaxation parameter (like a momentum \nparameter in BPA ) to control the amount updated in the current direction. Train(cid:173)\ning (fitting) of each (SISO) 3 i net can be carried out more efficiently than running \nBP based on high-dimensional inputs, for example, it is less expensive to evaluate \nthe Hessian matrices in a Bi net than in a full BPA networks. \n\n3.3 Forecasting Results \n\nExperimental results were obtained using the two-component feature data sets \nwhich gave the earlier baseline performance. To calibrate the performance we em(cid:173)\nployed in all twelve projection pursuit trained networks an uniform architecture of \nthree subnets ( a (1,2, 2)-logistic network), matching the 5 nodes of the BP net(cid:173)\nwork of Section 2. The number of backfitting cycles was set to 20 with a relaxation \nparameter w = 0.1. BPA was employed for fitting each Binet. The training/testing \npercentage absolute errors are given in Table 3. The limited data sets in the cases \nof individual days (Monday, Saturday, Sunday) led to failure in generalization that \ncould have been prevented by using one or two, rather than three, subnets. \n\n\fForecasting Demand for Electric Power \n\n745 \n\n4 Dimension Reduction \n\n4.1 \n\nIndex of Level-Set Dispersion \n\nA key step in the projection pursuit training algorithm is to find for each 3 i net \nthe projection direction f3i' an instance of the important problem of economically \nchoosing input features/variables in constructing a forecasting model. In general, \nthe fewer the number of input features, the more likely are the results to generalize \nfrom training set performance to test set performance- reduction in variance at the \npossible expense of increase in bias. Our controlled size subnet projection pursuit \ntraining algorithm deals with part of the complexity problem, provided that the \ninput features are fixed. We turn now to our approach to finding input features \nor search directions based on minimizing an index of dispersion of level-sets. Li \n[1991] proposed taking an inverse ('slicing the y's') point of view to estimate the \ndirections f3i. The justification provided for this so-called slicing inverse regression \n(SIR) method, however, requires that the input or feature vector x be elliptically \nsymmetrically distributed, and this is not likely to be the case in our electric load \nforecasting problem. The basic idea behind minimizing dispersion of level-sets \n(*) we see that a fixed value of y, and small noise \u00a3, implies \nis that from Eq. \na highly constrained set of values for f3ix, ... ,f3~x, while leaving unconstrained the \ncomponents of x that lie in the subspace B~ orthogonal to that space B spanned by \nthe f3i.. Hence, if one has a good number of i.i.d. observations sharing a similar value \nof the response y, then there should be more dispersion of input vectors projected \ninto Bl.. than along the projections into B. We implement this by quantizing the \nobserved y values into, say, H slices, with Lh denoting the h_th level-set containing \nthose inputs with y-value in the h_th slice, and X-h is their sample mean. The f3 are \nthen picked as the the eigenvector associated with the smallest eigenvalue of the \ncentered covariance matrix: \n\nH L L (Xi - x\"h)(Xi. - Xh)'. \n\nh=l xiELh \n\n4.2 \n\nImplementations \n\nIn practical implementations, one may discard both extremes of the family of H \nlevel sets (trimming) to avoid large response values when it is believed that they \nmay correspond to large magnitudes of input components. One should also stan(cid:173)\ndardize initially the input data to a unit sample covariance matrix. Otherwise, our \nresults will reflect the distribution of x rather than the functional relationship of \nEq. (*). We have applied this projection index both in finding the f3i. during pro(cid:173)\njection pursuit training and in reducing a high-dimensional feature set to a low(cid:173)\ndimensional feature set. We have implemented such a feature selection scheme for \nforecasting the Monday - Friday evening peaks. The initial feature set consists of \nthirteen hourly loads from lam to 1pm, thirteen hourly temperatures from lam to \n1pm and the temperature around the peak times. Three eigenvectors of the cen(cid:173)\ntered covariance matrix were chosen, thereby reducing a 27-dimensional feature set \nto a 3-dimensional one. We then ran a standard BPA on this reduced featured set \nand tested on the 1991 data. We obtained a percentage absolute error of 1.6% (rms \nerror about 100 MWH), which is as good as all of our previous efforts. \n\n\f746 \n\nYuan and Fine \n\nAcknow ledgements \n\nPartial support for this research was provided by NSF Grant No. ECS-9017493. \n\nWe wish to thank Prof. J. Hwang, Mathematics Department, Cornell, for initial \ndiscussions of SIR and are grateful to Dr. P.D. Yeshakul, American Electric Service \nCorp., for providing the data set and instructing us patiently in the lore of short(cid:173)\nterm load forecasting. \n\nReferences \n\nConnor, J., L. Atlas, R.D. Martin [1991], Recurrent networks and NARMA model(cid:173)\ning, NIPS 91. \n\nDamborg, M., M.EI-Sharkawi, R. Marks II [1990], Potential of artificial neural net(cid:173)\nworks in power system operation, Proc. 1990 IEEE Inter.Symp. on Circuits and \nSystems, 4, 2933-2937. \nFriedman, J. [1987], Exploratory projection pursuit, J. Amer. Stat. Assn., 82, \n249-266. \nGross,G., F. Galiana [1987], Short-term load forecasting, Proc. IEEE, 75, 1558-\n1573. \n\nHastie, T., R. Tibshirani [1990], Generalized Additive Models,Chapman and Hall. \n\nHuber, P. [1985]' Projection pursuit, The Annals of Statistics, 13, 435-475. \nIntrator, N. [1992] Combinining exploratory projection pursuit and projection pur(cid:173)\nsuit regression with applicatons to neural networks, To appear in Neural Computa(cid:173)\ntion. \nLi, K.-C. [1991] Slicing inverse regression for dimension reduction, Journal of Amer(cid:173)\nican Statistical Assoc., 86. \nWillis, H.1., J.F.D. Northcote-Green [1984], Comparison tests of fourteen load \nforecasting methods, IEEE Trans. on Power Apparatus and Systems, PAS-103, \n1190-1197. \n\nYuan, J-1., T.L.Fine [1992]' Forecasting demand for electric power using autore(cid:173)\ngressive neural networks, Proc. Con! on Info. Sci. and Systems, Princeton, NJ. \n\n\f", "award": [], "sourceid": 664, "authors": [{"given_name": "Jen-Lun", "family_name": "Yuan", "institution": null}, {"given_name": "Terrence", "family_name": "Fine", "institution": null}]}