{"title": "Memory-based Stochastic Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 1066, "page_last": 1072, "abstract": null, "full_text": "Memory-based Stochastic Optimization \n\nAndrew W. Moore and Jeff Schneider \n\nSchool of Computer Science \nCarnegie-Mellon University \n\nPittsburgh, PA 15213 \n\nAbstract \n\nIn this paper we introduce new algorithms for optimizing noisy \nplants in which each experiment is very expensive. The algorithms \nbuild a global non-linear model of the expected output at the same \ntime as using Bayesian linear regression analysis of locally weighted \npolynomial models. The local model answers queries about confi(cid:173)\ndence, noise, gradient and Hessians, and use them to make auto(cid:173)\nmated decisions similar to those made by a practitioner of Response \nSurface Methodology. The global and local models are combined \nnaturally as a locally weighted regression. We examine the ques(cid:173)\ntion of whether the global model can really help optimization, and \nwe extend it to the case of time-varying functions. We compare \nthe new algorithms with a highly tuned higher-order stochastic op(cid:173)\ntimization algorithm on randomly-generated functions and a sim(cid:173)\nulated manufacturing task. We note significant improvements in \ntotal regret , time to converge, and final solution quality. \n\nINTRODUCTION \n\n1 \nIn a stochastic optimization problem, noisy samples are taken from a plant. A \nsample consists of a chosen control u (a vector ofreal numbers) and a noisy observed \nresponse y. y is drawn from a distribution with mean and variance that depend on \nu. y is assumed to be independent of previous experiments. Informally t he goal is \nto quickly find control u to maximize the expected output E[y I u) . This is different \nfrom conventional numerical optimization because the samples can be very noisy, \nthere is no gradient information , and we usually wish to avoid ever performing badly \n(relative to our start state) even during optimization. Finally and importantly: \neach experiment is very expensive and there is ample computational time \n(often many minutes) for deciding on the next experiment. The following questions \nare both interesting and important: how should this computational time best be \nused, and how can the data best be used? \n\nStochastic optimization is of real industrial importance, and indeed one of our \nreasons for investigating it is an association with a U.S . manufacturing company \n\n\fMemory-based Stochastic Optimization \n\n1067 \n\nthat has many new examples of stochastic optimization problems every year. \n\nThe discrete version of this problem, in which u is chosen from a discrete set, \nis the well known k-armed bandit problem. Reinforcement learning researchers \nhave recently applied bandit-like algorithms to efficiently optimize several dis(cid:173)\ncrete problems [Kaelbling, 1990, Greiner and Jurisica, 1992, Gratch et al., 1993, \nMaron and Moore, 1993]. This paper considers extensions to the continuous case \nin which u is a vector of reals. We anticipate useful applications here too. Conti(cid:173)\nnuity implies a formidable number of arms (uncountably infinite) but permits us to \nassume smoothness of E[y I u] as a function of u. \nThe most popular current techniques are: \n\n\u2022 Response Surface Methods (RSM). Current RSM practice is described in \nthe classic reference [Box and Draper, 1987]. Optimization proceeds by cautious \nsteepest ascent hill-climbing. A region of interest (ROI) is established at a start(cid:173)\ning point and experiments are made at positions within the region that can best \nbe used to identify the function properties with low-order polynomial regression. \nA large portion of the RSM literature concerns experimental design-the decision \nof where to take data points in order to acquire the lowest variance estimate of \nthe local polynomial coefficients in a fixed number of experiments. When the \ngradient is estimated with sufficient confidence, the ROI is moved accordingly. \nRegression of a quadratic locates optima within the ROI and also diagnoses ridge \nsystems and saddle points. \nThe strength of RSM is that it is careful not to change operating conditions based \non inadequate evidence, but moves once the data justifies. A weakness of RSM \nis that human judgment is needed: it is not an algorithm, but a manufacturing \nmethodology . \n\n\u2022 Stochastic Approximation methods. The algorithm of [Robbins and Monro, \n1951] does root finding without the use of derivative estimates. Through the use of \nsuccessively smaller steps convergence is proven under broad assumptions about \nnoise. Keifer-Wolfowitz (KW) [Kushner and Clark, 1978] is a related algorithm \nfor optimization problems. From an initial point it estimates the gradient by \nperforming an experiment in each direction along each dimension of the input \nspace. Based on the estimate, it moves its experiment center and repeats. Again, \nuse of decreasing step sizes leads to a proof of convergence to a local optimum. \nThe strength of KW is its aggressive exploration, its simplicity, and that it comes \nwith convergence guarantees. However, it has more of a danger of attempting \nwild experiments in the presence of noise, and effectively discards the data it \ncollects after each gradient estimate is made. In practice, higher order versions \nof KW are available in which convergence is accelerated by replacing the fixed \nstep size schedule with an adaptive one [Kushner and Clark, 1978]. Later we \ncompare the performance of our algorithms to such a higher-order KW. \n\n2 MEMORY-BASED OPTIMIZATION \nNeither KW nor RSM uses old data. After a gradient has been identified the control \nu is moved up the gradient and the data that produced the gradient estimate is \ndiscarded. Does this lead to inefficiencies in operation? This paper investigates one \nway of using old data: build a global non-linear plant model with it. \n\nWe use locally weighted regression to model the system [Cleveland and Delvin, 1988, \nAtkeson, 1989, Moore, 1992]. We have adapted the methods to return posterior \ndistributions for their coefficients and noise (and thus, indirectly, their predictions) \n\n\f1068 \n\nA. W. MOORE, J. SCHNEIDER \n\nbased on very broad priors, following the Bayesian methods for global linear regres(cid:173)\nsion described in [DeGroot, 1970]. \nWe estimate the coefficients f3 = {,8I ... ,8m} of a local polynomial model in which \nthe data was generated by the polynomial and corrupted with gaussian noise of \nvariance u2, which we also estimate. Our prior assumption will be that f3 is dis(cid:173)\ntributed according to a multivariate gaussian of mean 0 and covariance matrix E. \nOur prior on u is that 1/u2 has a gamma distribution with parameters a and ,8. \n\nAssume we have observed n pieces of data. The jth polynomial term for the ith \ndata point is Xij and the output response of the ith data point is Ii. Assume \nfurther that we wish to estimate the model local to the query point X q , in which a \ndata point at distance di from the the query point has weight Wi = exp( -dl! K). \nK, the kernel width is a fixed parameter that determines the degree of localness in \nthe local regression. Let W = Diag(wl,w2 . .. Wn) . \nThe marginal posterior distribution of f3 is' a t distribution with mean 13 = (E- 1 + \nX T W 2X)-1(XT W 2y) covariance \n\n(2,8 + (yT -\n\nf3T XT)W2yT)(E-l + X T W 2 X)-l / (2a + I:~=l wi) \n\n(1) \n\nand a + I:~=l w'f degrees of freedom. \nWe assume a wide, weak, prior E = Diag(202,202, ... 202), a = 0.8,,8 = 0.001, \nmeaning the prior assumes each regression coefficient independently lies with high \nprobability in the range -20 to 20, and the noise lies in the range 0.01 to 0.5. \n\nBriefly, we note the following reasons that Bayesian locally weighted polynomial \nregression is particularly suited to this application: \n\n\u2022 We can directly obtain meaningful confidence estimates of the joint pdf of the \nregressed coefficients and predictions. Indirectly, we can compute the probability \ndistribution of the steepest gradient, the location of local optima and the principal \ncomponents of the local Hessian. \n\n\u2022 The Bayesian approach allows meaningful regressions even with fewer data points \nthan regression coefficients-the posterior distribution reveals enormous lack of \nconfidence in some aspects of such a model but other useful aspects can still be \npredicted with confidence. This is crucial in high dimensions, where it may be \nmore effective to head in a known positive gradient without waiting for all the \nexperiments that would be needed for a precise estimate of steepest gradient. \n\n\u2022 Other pros and cons of locally weighted regression in the context of control can \n\nbe found in [Moore et ai., 1995]. \n\nGiven the ability to derive a plant model from data, how should it best be used? \nThe true optimal answer, which requires solving an infinite-dimensional Markov \ndecision process, is intractable. We have developed four approximate algorithms \nthat use the learned model , described briefly below. \n\n\u2022 AutoRSM. Fully automates the (normally manual) RSM procedure and incor(cid:173)\n\nporates weighted data from the model; not only from the current design. It uses \nonline experimental design to pick ROI design points to maximize information \nabout local gradients and optima. Space does not permit description of the linear \nalgebraic formulations of these questions. \n\n\u2022 PMAX. This is a greedy, simpler approach that uses the global non-linear model \nfrom the data to jump immediately to the model optimum. This is similar to the \ntechnique described in [Botros, 1994], with two extensions. First, the Bayesian \n\n\fMemory-based Stochastic Optimization \n\n1069 \n\nFigure 1: Three examples \nof 2-d functions used in op(cid:173)\ntimization experiments \n\npriors enable useful decisions before the regression becomes full-rank. Second, \nlocal quadratic models permit second-order convergence near an optimum. \n\n\u2022 IEMAX. Applies Kaelbling's IE algorithm [Kaelbling, 1990] in the continuous \n\ncase using Bayesian confidence intervals. \n\nllchosen = \n\nargmax; () \n\nJ opt U \n\nu \n\n(2) \n\nwhere iopt(u) is the top of the 95th %-ile confidence interval. The intuition here \nis that we are encouraged to explore more aggressively than PMAX, but will not \nexplore areas that are confidently below the best known optimum . \n\n\u2022 COMAX. In a real plant we would never want to apply PMAX or IEMAX. Ex(cid:173)\n\nperiments must be cautious for reasons of safety, quality control, and managerial \npeace of mind. COMAX extends IEMAX thus: \n\nllchosen = \n\nargmax \nu E SAFE \n\nA \n\nfopt(u);U E SAFE{=} f,pess(U) > dIsaster threshold \n\nA\n\n. \n\n(3) \n\nAnalysis of these algorithms is problematic unless we are prepared to make strong \nassumptions about the form of E[Y I u]. To examine the general case we rely on \nMonte Carlo simulations, which we now describe. \n\nThe experiments used randomly generated nonlinear unimodal (but not necessarily \nconvex) d-dimensional functions from [0, l]d -+ [0,1]. Figure 1 shows three example \n2-d functions. Gaussian noise (0- = 0.1) is added to the functions. This is large \nnoise, and means several function evaluations would be needed to achieve a reliable \ngradient estimate for a system using even a large step size such as 0.2. \n\nThe following optimization algorithms were tested on a sample of such functions. \n\nVary-KW \n\nFixed-KW \n\nThe best performing KW algorithm we could find varied step size and \nadapted gradient estimation steps to avoid undue regret at optima. \nA version of KW that keeps its gradient-detecting step size fixed. \nThis risks causing extra regret at a true optima, but has less chance \nof becoming delayed by a non-optimum. \nThe best performing version thereof. \n\nAuto-RSM \nPasslve-RSM Auto-RSM continues to identify the precise location of the optimum \nwhen it's arrived at that optimum. When Passive-RSM is confident \n(greater than 99%) that it knows the location of the optimum to two \nsignificant places, it stops experimenting. \nA linear instead of quadratic model, thus restricted to steepest ascent. \nAuto-RSM with conservative parameters, more typical of those rec-\nommended in the RSM literature. \n\nLinear RSM \nCRSM \n\nPmax, IEmax As described above. \nand Comax \n\nFigures 2a and 2b show the first sixty experiments taken by AutoRSM and KW \nrespectively on their journeys to the goal. \n\n\f1070 \n\n(a) \n\n(b) \n\nA. W. MOORE, J. SCHNEIDER \n\nFigure 2a: The path taken (start at \n(0.8,0.2)) by AutoRSM optimizing the \ngiven function with added noise of stan(cid:173)\ndard deviation 0.1 at each experiment. \n\nFigure 2b: The path taken (start at \n(0.8,0.2)) by KW. KW's path looks decep(cid:173)\ntively bad, but remember it is continually \nbuffeted by considerable noise. \n\n(0) RetroI_d ............ _ \n\nte) No. of\", YntII wllhln 0,05 rtf optimum \n\nlet) ............ of FINAL . . . tepe \n\nFigure 3: Comparing nine stochastic optimization algorithms by four criteria: (a) Regret , \n(b) Disasters, (c) Speed to converge (d) Quality at convergence. The partial order depicted \nshows which results are significant at the 99% level (using blocked pairwise comparisons). \nThe outputs of the random functions range between 0-1 over the input domain. The \nnumbers in the boxes are means over fifty 5-d functions. (a) Regret is defined as the mean \nYopt - Yi-the cost incurred during the optimization compared with performance if we \nhad known the optimum location and used it from the beginning. With the exception of \nIEMAX, model-based methods perform significantly better than KW, with reduced ad(cid:173)\nvantage for cautious and linear methods. (b) The %-age of steps which tried experiments \nwith more than 0.1 units worse performance than at the search start. This matters to a \nrisk averse manager. AutoRSM has fewer than 1% disasters, but COMAX and the model(cid:173)\nfree methods do better still. PMAX's aggressive exploration costs it. (c) The number of \nsteps until we reach within 0.05 units of optimal. PMAX 's aggressiveness wins. (d) The \nquality of the \"final\" solution between steps 50 and 60 of the optimization. \n\nResults for 50 trials of each optimization algorithms for five-dimensional randomly \ngenerated functions are depicted in Figure 3. Many other experiments were per(cid:173)\nformed in other dimensionalities and for modified versions of the algorithm, but \nspace does not permit detailed discussion here. \n\nFinally we performed experiments with the simulated power-plant process in Fig(cid:173)\nure 4. The catalyst controller adjusts the flow rate of the catalyst to achieve the \ngoal chemical A content. Its actions also affect chemical B content. The tem(cid:173)\nperature controller adjusts the reaction chamber temperature to achieve the goal \nchemical B content . The chemical contents are also affected by the flow rate which \nis determined externally by demand for the product. \n\nThe task is to find the optimal values for the six controller parameters that min(cid:173)\nimize the total squared deviation from desired values of chemical A and chemical \nB contents. The feedback loops from sensors to controllers have significant delay. \nThe controller gains on product demand are feedforward terms since there is sig(cid:173)\nnificant delay in the effects of demand on the process. Finally, the performance of \nthe system may also depend on variations over time in the composition of the input \nchemicals which can not be directly sensed. \n\n\fMemory-based Stochastic Optimization \n\n1071 \n\nCatalyst Supply \n\nRaw \nInput \nChemicals \n\nOptimize 6 Controller Parameters \n\nTo Minimize Squared Deviation \nfrom Goal Chemical A and B Content \n\nSensor A \n\nProduct \nDemand \n\nTe \n\nbase lenns: \nrccdback term,,; \n\nBase temperature \nSen.for B gain \n\nf'L-_---'----'----~orwvd tem\\s: Product demand gain \n\nCatalyst Controller \nBase input rate \nSensor A gain \nProduct demand gain \n\nREACTION \nCHAMBER \n\nPumps governed by \ndemand for product \n\nFigure 4: A Simulated \nChemical Process \n\nChemical A \ncontent sensor \n\nChemical B \ncontent sensor \n\nProduct \noutput \n\nThe total summed regrets of the optimization methods on 200 simulated steps were: \n\nStay AtStart \n\n10.86 \n\nFixedKW \n\n2.82 \n\nAutoRSM \n\n1.32 \n\nPMAX \n\n3.30 \n\nCOMAX \n\n4.50 \n\nIn this case AutoRSM is best, considerably beating the best KW algorithm we could \nfind. In contrast PM AX and COMAX did poorly: in this plant wild experiments \nare very costly to PMAX and COMAX is too cautious. Stay AtStart is the regret \nthat would be incurred if all 200 steps were taken at the initial parameter setting. \n\n3 UNOBSERVED DISTURBANCES \nAn apparent danger of learning a model is that if the environment changes, the out \nof date model will mean poor performance and very slow adaptation. The model(cid:173)\nfree methods, which use only recent data, will react more nimbly. A simple but \nunsatisfactory answer to this is to use a model that implicitly (e.g. a neural net) or \nexplicitly (e.g. local weighted regression of the fifty most recent points) forgets. An \ninteresting possibility is to learn a model in a way that automatically determines \nwhether a disturbance has occurred, and if so, how far back to forget. \nThe following \"adaptive forgetting\" (AF) algorithm was added to the AutoRSM \nalgorithm: At each step, use all the previous data to generate 99% confidence \nintervals on the output value at the current step. If the observed output is outside \nthe intervals assume that a large change in the system has occured and forget all \nprevious data. This algorithm is good for recognizing jumps in the plant's operating \ncharacteristics and allows AutoRSM to respond to them quickly, but is not suitable \nfor detecting and handling process drift. \n\nWe tested our algorithm's performance on the simulated plant for 450 steps. Op(cid:173)\neration began as before, but at step 150 there was an unobserved change in the \ncomposition of the raw input chemicals. The total regrets of the optimization \nmethods were: \n\nStayAtStart FixedKW AutoRSM PMAX AutoRSM/AF \n\n11.90 \n\n5.31 \n\n8.37 \n\n9.23 \n\n2.75 \n\nAutoRSM and PMAX do poorly because all their decisions after step 150 are based \npartially on the invalid data collected before then. The AF addition to AutoRSM \nsolves the problem while beating the best KW by a factor of 2. Furthermore, \nAutoRSMj AF gets 1.76 on the invariant task, thus demonstrating that it can be \nused safely in cases where it is not known if the process is time varying. \n\n\f1072 \n\nA. W. MOORE, J. SCHNEIDER \n\n4 DISCUSSION \nBotros' thesis [Botros, 1994] discusses an algorithm similar to PMAX based on \nlocal linear regression. [Salganicoff and Ungar, 1995] uses a decision tree to learn \na model. They use Gittins indices to suggest experiments: we believe that the \nmemory-based methods can benefit from them too. They, however, do not use \ngradient information, and so require many experiments to search a 2D space. \nIEmax performed badly in these experiments, but optimism-gl1ided exploration may \nprove important in algorithms which check for potentially superior local optima. \n\nA possible extension is self tuning optimization. Part way through an optimization, \nto estimate the best optimization parameters for an algorithm we can run monte(cid:173)\ncarlo simulations which run on sample functions from the posterior global model \ngiven the current data. \n\nThis paper has examined the question of how much can learning a Bayesian memory(cid:173)\nbased model accelerate the convergence of stochastic optimization . We have pro(cid:173)\nposed four algorithms for doing this, one based on an autonomous version of RSM; \nthe other three upon greedily jumping to optima of three criteria dependent on \npredicted output and uncertainty. Empirically the model-based methods provide \nsignificant gains over a highly tuned higher order model-free method. \n\nReferences \n[Atkeson, 1989] C . G . Atkeson. Using Local Models to Control Movement . In Proceedings of Neural \n\nInformation Processing Systems Conference, November 1989. \n\n[Botros, 1994] S . M . Botros. Model-Based Techniques in Motor Learning and Task Optimization . PhD. \n\nThesis, MIT Dept. of Brain and Cognitive Sciences, February 1994 . \n\n[Box and Draper, 1987] G. E . P. Box and N . R. Draper . Empirical Model-Building and Response \n\nSurfaces. Wiley , 1987. \n\n[Cleveland and Delvin , 1988] W. S. Cleveland and S. J . Delvin. Locally Weighted Regression : An Ap(cid:173)\n\nproach to Regression Analysis by Local Fitting. Journal of the American Statistical Association, \n83(403):596-610, September 1988. \n\n[DeGroot, 1970] M. H . DeGroot . Optimal Statistical Decisions. McGraw-Hill, 1970. \n[Gratch et al. , 1993] J. Gratch , S. Chien, and G. DeJong. Learning Search Control Knowledge for \nDeep Space Network Scheduling. In Proceedings of the 10th International Conference on Machine \nLearning. Morgan Kaufmann, June 1993. \n\n[Greiner and Jurisica, 1992] R. Greiner and I. Jurisica. A statistical approach to solving the EBL utility \nproblem. In Proceedings of the Tenth International Conference on Artificial Intelligence (AAAI-\n92). MIT Press, 1992. \n\n[Kaelbling, 1990] L . P . Kaelbling. Learning in Embedded Systems. PhD . Thesis ; Technical Report No. \n\nTR-90-04, Stanford University, Department of Computer Science, June 1990. \n\n[Kushner and Clark, 1978] H . Kushner and D. Clark . Stochastic Approximation Methods for Con (cid:173)\n\nstrained and Unconstrained Systems. Springer-Verlag, 1978. \n\n[Maron and Moore , 1993] O. Maron and A . Moore. Hoeffding Races: Accelerating Model Selection \nSearch for Classification and Function Approximation . In Advances in Neural Information Processing \nSystems 6. Morgan Kaufmann, December 1993. \n\n[Moore et al ., 1995] A . W . Moore , C . G . Atkeson, and S. Schaal. Memory-based Learning for Con(cid:173)\n\ntrol. Technical report, CMU Robotics Institute, Technical Report CMU-RI-TR-95-18 (Submitted for \nPublication) , 1995. \n\n[Moore, 1992] A. W . Moore. Fast, Robust Adaptive Control by Learning only Forward Models . In \nJ . E . Moody, S. J . Hanson , and R . P . Lippman, editors, Advances in Neural Information Processing \nSystems 4. Morgan Kaufmann, April 1992 . \n\n[Robbins and Monro, 1951] H . Robbins and S. Monro. A stochastic approximation method . Annals of \n\nMathematical Statist2cs , 22 :400-407, 1951. \n\n[Salganicoff and Ungar, 1995] M. Salganicoffand L . H. Ungar. Active Exploration and Learning in Real(cid:173)\n\nValued Spaces using Multi-Armed Bandit Allocation Indices. In Proceedings of the 12th International \nConference on Machine Learning. Morgan Kaufmann , 1995 . \n\n\f", "award": [], "sourceid": 1124, "authors": [{"given_name": "Andrew", "family_name": "Moore", "institution": null}, {"given_name": "Jeff", "family_name": "Schneider", "institution": null}]}