{"title": "Improving Elevator Performance Using Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1017, "page_last": 1023, "abstract": null, "full_text": "Improving Elevator Performance Using \n\nReinforcement Learning \n\nRobert H. Crites \n\nComputer Science Department \n\nUniversity of Massachusetts \nAmherst, MA 01003-4610 \ncritesGcs.umass.edu \n\nAndrew G. Barto \n\nComputer Science Department \n\nUniversity of Massachusetts \nAmherst, MA 01003-4610 \n\nbartoGcs.umass.edu \n\nAbstract \n\nThis paper describes the application of reinforcement learning (RL) \nto the difficult real world problem of elevator dispatching. The el(cid:173)\nevator domain poses a combination of challenges not seen in most \nRL research to date. Elevator systems operate in continuous state \nspaces and in continuous time as discrete event dynamic systems. \nTheir states are not fully observable and they are nonstationary \ndue to changing passenger arrival rates. In addition, we use a team \nof RL agents, each of which is responsible for controlling one ele(cid:173)\nvator car. The team receives a global reinforcement signal which \nappears noisy to each agent due to the effects of the actions of the \nother agents, the random nature of the arrivals and the incomplete \nobservation of the state. In spite of these complications, we show \nresults that in simulation surpass the best of the heuristic elevator \ncontrol algorithms of which we are aware. These results demon(cid:173)\nstrate the power of RL on a very large scale stochastic dynamic \noptimization problem of practical utility. \n\n1 \n\nINTRODUCTION \n\nRecent algorithmic and theoretical advances in reinforcement learning (RL) have \nattracted widespread interest. RL algorithms have appeared that approximate dy(cid:173)\nnamic programming (DP) on an incremental basis. Unlike traditional DP algo(cid:173)\nrithms, these algorithms can perform with or without models of the system, and \nthey can be used online as well as offline, focusing computation on areas of state \nspace that are likely to be visited during actual control. On very large problems, \nthey can provide computationally tractable ways of approximating DP. An exam(cid:173)\nple of this is Tesauro's TD-Gammon system (Tesauro, 1992j 1994; 1995), which \nused RL techniques to learn to play strong masters level backgammon. Even the \n\n\f1018 \n\nR. H. CR~.A.G. BARTO \n\nbest human experts make poor teachers for this class of problems since they do not \nalways know the best actions. Even if they did, the state space is so large that \nit would be difficult for experts to provide sufficient training data. RL algorithms \nare naturally suited to this class of problems, since they learn on the basis of their \nown experience. This paper describes the application of RL to elevator dispatching, \nanother problem where classical DP is completely intractable. The elevator domain \nposes a number of difficulties that were not present in backgammon. In spite of \nthese complications, we show results that surpass the best of the heuristic elevator \ncontrol algorithms of which we are aware. The following sections describe the ele(cid:173)\nvator dispatching domain, the RL algorithm and neural network architectures that \nwere used, the results, and some conclusions. \n\n2 THE ELEVATOR SYSTEM \n\nThe particular elevator system we examine is a simulated 10-story building with \n4 elevator cars (Lewis, 1991; Bao et al, 1994). Passenger arrivals at each floor are \nassumed to be Poisson, with arrival rates that vary during the course of the day. \nOur simulations use a traffic profile (Bao et al, 1994) which dictates arrival rates for \nevery 5-minute interval during a typical afternoon down-peak rush hour. Table 1 \nshows the mean number of passengers arriving at each floor (2-10) during each \n5-minute interval who are headed for the lobby. In addition, there is inter-floor \ntraffic which varies from 0% to 10% of the traffic to the lobby. \n\nTable 1: The Down-Peak Traffic Profile \n\nThe system dynamics are approximated by the following parameters: \n\n\u2022 Floor time (the time to move one floor at the maximum speed): 1.45 secs. \n\u2022 Stop time (the time needed to decelerate, open and close the doors, and \n\naccelerate again): 7.19 secs. \n\n\u2022 Turn time (the time needed for a stopped car to change direction): 1 sec. \n\u2022 Load time (the time for one passenger to enter or exit a car): random \nvariable from a 20th order truncated Erlang distribution with a range from \n0.6 to 6.0 secs and a mean of 1 sec. \n\n\u2022 Car capacity: 20 passengers. \n\nThe state space is continuous because it includes the elapsed times since any hall \ncalls were registered. Even if these real values are approximated as binary values, \nthe size of the state space is still immense. Its components include 218 possible \ncombinations of the 18 hall call buttons (up and down buttons at each landing \nexcept the top and bottom), 240 possible combinations of the 40 car buttons, and \n184 possible combinations of the positions and directions of the cars (rounding off \nto the nearest floor). Other parts of the state are not fully observable, for example, \nthe desired destinations of the passengers waiting at each floor. Ignoring everything \nexcept the configuration of the hall and car call buttons and the approximate posi(cid:173)\ntion and direction of the cars, we obtain an extremely conservative estimate of the \nsize of a discrete approximation to the continuous state space: \n\n\fImproving Elevator Performance Using Reinforcement Learning \n\n1019 \n\nEach car has a small set of primitive actions. Ifit is stopped at a floor, it must either \n\"move up\" or \"move down\". If it is in motion between floors, it must either \"stop \nat the next floor\" or \"continue past the next floor\". Due to passenger expectations, \nthere are two constraints on these actions: a car cannot pass a floor if a passenger \nwants to get off there and cannot turn until it has serviced all the car buttons in its \npresent direction. We have added three additional action constraints in an attempt \nto build in some primitive prior knowledge: a car cannot stop at a floor unless \nsomeone wants to get on or off there, it cannot stop to pick up passengers at a floor \nif another car is already stopped there, and given a choice between moving up and \ndown, it should prefer to move up (since the down-peak traffic tends to push the \ncars toward the bottom of the building). Because of this last constraint, the only \nreal choices left to each car are the stop and continue actions. The actions of the \nelevator cars are executed asynchronously since they may take different amounts of \ntime to complete. \n\nThe performance objectives of an elevator system can be defined in many ways. One \npossible objective is to minimize the average wait time, which is the time between \nthe arrival of a passenger and his entry into a car. Another possible objective is \nto minimize the average 6y6tem time, which is the sum of the wait time and the \ntravel time. A third possible objective is to minimize the percentage of passengers \nthat wait longer than some dissatisfaction threshold (usually 60 seconds). Another \ncommon objective is to minimize the sum of 6quared wait times. We chose this \nlatter performance objective since it tends to keep the wait times low while also \nencouraging fair service. \n\n3 THE ALGORITHM AND NETWORK \n\nARCHITECTURE \n\nElevator systems can be modeled as ducrete event systems, where significant events \n(such as passenger arrivals) occur at discrete times, but the amount oftime between \nevents is a real-valued variable. In such systems, the constant discount factor 'Y \nused in most discrete-time reinforcement learning algorithms is inadequate. This \nproblem can be approached using a variable discount factor that depends on the \namount of time between events (Bradtke & Duff, 1995). In this case, returns are \ndefined as integrals rather than as infinite sums, as follows: \n\nbecomes \n\nwhere rt is the immediate cost at discrete time t, r.,. is the instantaneous cost at \ncontinuous time T (e.g., the sum of the squared wait times of all waiting passengers), \nand {3 controls the rate of exponential decay. \nCalculating reinforcements here poses a problem in that it seems to require knowl(cid:173)\nedge of the waiting times of all waiting passengers. There are two ways of dealing \nwith this problem. The simulator knows how long each passenger has been waiting. \nIt could use this information to determine what could be called omnucient rein(cid:173)\nforcements. The other possibility is to use only information that would be available \nto a real system online. Such online reinforcements assume only that the waiting \ntime of the first passenger in each queue is known (which is the elapsed button \ntime). If the Poisson arrival rate A for each queue is estimated as the reciprocal of \nthe last inter-button time for that queue, the Gamma distribution can be used to \nestimate the arrival times of subsequent passengers. The time until the nth. subse(cid:173)\nquent arrival follows the Gamma distribution r(n, f). For each queue, subsequent \n\n\f1020 \n\nR. H. CRITES, A. G. BARTO \n\narrivals will generate the following expected penalties during the first b seconds after \nthe hall button has been pressed: \n\n00 rb \nL Jo \n\nn=l \n\n0 \n\n(prob nth arrival occurs at time r) . (penalty given arrival at time r) dr \n\nThis integral can be solved by parts to yield expected penalties. We found that \nusing online reinforcements actually produced somewhat better results than using \nomniscient reinforcements, presumably because the algorithm was trying to learn \naverage values anyway. \n\nBecause elevator system events occur randomly in continuous time, the branching \nfactor is effectively infinite, which complicates the use of algorithms that require \nexplicit lookahead. Therefore, we employed a team of discrete-event Q-Iearning \nagents, where each agent is responsible for controlling one elevator car. Q(:z:, a) \nis defined as the expected infinite discounted return obtained by taking action a \nin state :z: and then following an optimal policy (Watkins, 1989). Because of the \nvast number of states, the Q-values are stored in feedforward neural networks. The \nnetworks receive some state information as input, and produce Q-value estimates \nas output. We have tested two architectures. In the parallel architecture, the agents \nshare a single network, allowing them to learn from each other's experiences and \nforcing them to learn identical policies. In the fully decentralized architecture, the \nagents have their own networks, allowing them to specialize their control policies. \nIn either case, none of the agents have explicit access to the actions of the other \nagents. Cooperation has to be learned indirectly via the global reinforcement signal. \nEach agent faces added stochasticity and nonstationarity because its environment \ncontains other learning agents. Other work on team Q-Iearning is described in \n(Markey, 1994). \nThe algorithm calls for each car to select its actions probabilistic ally using the \nBoltzmann distribution over its Q-value estimates, where the temperature is low(cid:173)\nered gradually during training. After every decision, error backpropagation is used \nto train the car's estimate of Q(:z:, a) toward the following target output: \n\nwhere action a is taken by the car from state :z: at time t x , the next decision by \nthat car is required from state y at time ty, and TT and (3 are defined as above. \ne-tJ(tv-t.) acts as a variable discount factor that depends on the amount of time \nbetween events. The learning rate parameter was set to 0.01 or 0.001 and {3 was set \nto 0.01 in the experiments described in this paper. \n\nAfter considerable experimentation, our best results were obtained using networks \nfor pure down traffic with 47 input units, 20 hidden sigmoid units, and two linear \noutput units (one for each action value). The input units are as follows: \n\n\u2022 18 units: Two units encode information about each of the nine down hall \nbuttons. A real-valued unit encodes the elapsed time if the button has \nbeen pushed and a binary unit is on if the button has not been pushed. \n\n\fImproving Elevator Performance Using Reinforcement Learning \n\n1021 \n\n\u2022 16 units: Each of these units represents a possible location and direction \nfor the car whose decision is required. Exactly one of these units will be on \nat any given time. \n\n\u2022 10 units: These units each represent one of the 10 floors where the other cars \nmay be located. Each car has a \"footprint\" that depends on its direction \nand speed. For example, a stopped car causes activation only on the unit \ncorresponding to its current floor, but a moving car causes activation on \nseveral units corresponding to the floors it is approachmg, with the highest \nactivations on the closest floors. \n\n\u2022 1 unit: This unit is on if the car whose decision is required is at the highest \n\nfloor with a waiting passenger. \n\n\u2022 1 unit: This unit is on if the car whose decision is required is at the floor \nwith the passenger that has been waiting for the longest amount of time. \n\n\u2022 1 unit: The bias unit is always on. \n\n4 RESULTS \n\nSince an optimal policy for the elevator dispatching problem is unknown, we mea(cid:173)\nsured the performance of our algorithm against other heuristic algorithms, including \nthe best of which we were aware. The algorithms were: SECTOR, a sector-based \nalgorithm similar to what is used in many actual elevator systems; DLB, Dynamic \nLoad Balancing, attempts to equalize the load of all cars; HUFF, Highest Unan(cid:173)\nswered Floor First, gives priority to the highest floor with people waiting; LQF, \nLongest Queue First, gives priority to the queue with the person who has been \nwaiting for the longest amount of time; FIM, Finite Intervisit Minimization, a re(cid:173)\nceding horizon controller that searches the space of admissible car assignments to \nminimize a load function; ESA, Empty the System Algorithm, a receding horizon \ncontroller that searches for the fastest way to \"empty the system\" assuming no new \npassenger arrivals. ESA uses queue length information that would not be available \nin a real elevator system. ESA/nq is a version of ESA that uses arrival rate informa(cid:173)\ntion to estimate the queue lengths. For more details, see (Bao et al, 1994). These \nreceding horizon controllers are very sophisticated, but also very computationally \nintensive, such that they would be difficult to implement in real time. RLp and \nRLd denote the RL controllers, parallel and decentralized. The RL controllers were \neach trained on 60,000 hours of simulated elevator time, which took four days on a \n100 MIPS workstation. The results are averaged over 30 hours of simulated elevator \ntime. Table 2 shows the results for the traffic profile with down traffic only. \n\nAlgorithm \nSECTOR \n\nDLB \n\nBASIC HUFF \n\nLQF \nHUFF \nFIM \n\nESA/nq \n\nESA \nRLp \nRLd \n\nI AvgWait I SquaredWait I SystemTime I Percent>60 secs I \n\n21.4 \n19.4 \n19.9 \n19.1 \n16.8 \n16.0 \n15.8 \n15.1 \n14.8 \n14.7 \n\n674 \n658 \n580 \n534 \n396 \n359 \n358 \n338 \n320 \n313 \n\n47.7 \n53.2 \n47.2 \n46.6 \n48.6 \n47.9 \n47.7 \n47.1 \n41.8 \n41.7 \n\n1.12 \n2.74 \n0.76 \n0.89 \n0.16 \n0.11 \n0.12 \n0.25 \n0.09 \n0.07 \n\nTable 2: Results for Down-Peak Profile with Down Traffic Only \n\n\f1022 \n\nR.H.C~.A.G. BARTO \n\nTable 3 shows the results for the down-peak traffic profile with up and down traffic, \nincluding an average of 2 up passengers per minute at the lobby. The algorithm \nwas trained on down-only traffic, yet it generalizes well when up traffic is added \nand upward moving cars are forced to stop for any upward hall calls. \n\nAlgorithm \nSECTOR \n\nDLB \n\nBASIC HUFF \n\nLQF \nHU ... \u00b7F \nESA \nFIM \nRLp \nRLd \n\nI AvgWait I Squared wait I SystemTime I Percent>60 secs I \n\n27.3 \n21.7 \n22.0 \n21.9 \n19.6 \n18.0 \n17.9 \n16.9 \n16.9 \n\n1252 \n826 \n756 \n732 \n608 \n524 \n476 \n476 \n468 \n\n54.8 \n54.4 \n51.1 \n50.7 \n50.5 \n50.0 \n48.9 \n42.7 \n42.7 \n\n9.24 \n4.74 \n3.46 \n2.87 \n1.99 \n1.56 \n0.50 \n1.53 \n1.40 \n\nTable 3: Results for Down-Peak Profile with Up and Down Traffic \n\nTable 4 shows the results for the down-peak traffic profile with up and down traffic, \nincluding an average of 4 up passengers per minute at the lobby. This time there is \ntwice as much up traffic, and the RL agents generalize extremely well to this new \nsituation. \n\nAlgorithm \nSECTOR \n\nHUFF \nDLB \nLQF \n\nBASIC HUFF \n\nFIM \nESA \nRLd \nRLp \n\nI AvgWait I SquaredWait I SystemTime I Percent>60 secs I \n\n30.3 \n22.8 \n22.6 \n23.5 \n23.2 \n20.8 \n20.1 \n18.8 \n18.6 \n\n1643 \n884 \n880 \n877 \n875 \n685 \n667 \n593 \n585 \n\n59.5 \n55.3 \n55.8 \n53.5 \n54.7 \n53.4 \n52.3 \n45.4 \n45.7 \n\n13.50 \n5.10 \n5.18 \n4.92 \n4.94 \n3.10 \n3.12 \n2.40 \n2.49 \n\nTable 4: Results for Down-Peak Profile with Twice as Much Up Traffic \n\nOne can see that both the RL systems achieved very good performance, most no(cid:173)\ntably as measured by system time (the sum of the wait and travel time), a measure \nthat was not directly being minimized. Surprisingly, the decentralized RL system \nwas able to achieve as good a level of performance as the parallel RL system. Bet(cid:173)\nter performance with nonstationary traffic profiles may be obtainable by providing \nthe agents with information about the current traffic context as part of their input \nrepresentation. We expect that an additional advantage of RL over heuristic con(cid:173)\ntrollers may be in buildings with less homogeneous arrival rates at each floor, where \nRL can adapt to idiosyncracies in their individual traffic patterns. \n\n5 CONCLUSIONS \n\nThese results demonstrate the utility of RL on a very large scale dynamic optimiza(cid:173)\ntion problem. By focusing computation onto the states visited during simulated \ntrajectories, RL avoids the need of conventional DP algorithms to exhaustively \n\n\fImproving Elevator Performance Using Reinforcement Learning \n\n1023 \n\nsweep the state set. By storing information in artificial neural networks, it avoids \nthe need to maintain large lookup tables. To achieve the above results, each RL \nsystem experienced 60,000 hours of simulated elevator time, which took four days \nof computer time on a 100 MIPS processor. Although this is a considerable amount \nof computation, it is negligible compared to what any conventional DP algorithm \nwould require. The results also suggest that approaches to decentralized control \nusing RL have considerable promise. Future research on the elevator dispatching \nproblem will investigate other traffic profiles and further explore the parallel and \ndecentralized RL architectures. \n\nAcknowledgements \n\nWe thank John McNulty, Christ os Cassandras, Asif Gandhi, Dave Pepyne, Kevin \nMarkey, Victor Lesser, Rod Grupen, Rich Sutton, Steve Bradtke, and the ANW \ngroup for assistance with the simulator and for helpful discussions. This research \nwas supported by the Air Force Office of Scientific Research under grant F49620-\n93-1-0269. \n\nReferences \n\nG. Bao, C. G. Cassandras, T. E. Djaferis, A. D. Gandhi, and D. P. Looze. (1994) \nElevator Di,patcher, for Down Peale Traffic. Technical Report, ECE Department, \nUniversity of Massachusetts, Amherst, MA. \n\nS. J. Bradtke and M. O. Duff. \n(1995) Reinforcement Learning Methods for \nContinuous-Time Markov Decision Problems. In: G. Tesauro, D. S. Touretzky \nand T. K. Leen, eds., Advance, in Neural Information Procelling Sy,tem, 7, MIT \nPress, Cambridge, MA. \n\nJ. Lewis. (1991) A Dynamic Load Balancing Approach to the Control of Multuerver \nPolling Sy,tem, with Applicationl to Elevator Syltem Dupatching. PhD thesis, \nUniversity of Massachusetts, Amherst, MA. \n\nK. L. Markey. (1994) Efficient Learning of Multiple Degree-of-Freedom Control \nProblems with Quasi-independent Q-agents. \nIn: M. C. Mozer, P. Smolensky, \nD. S. Touretzky, J. L. Elman and A. S. Weigend, eds., Proceeding' of the 1993 \nConnectionilt Modell Summer SchooL Erlbaum Associates, Hillsdale, NJ. \n\nG. Tesauro. (1992) Practical Issues in Temporal Difference Learning. Machine \nLearning 8:257-277. \n\nG. Tesauro. (1994) TO-Gammon, a Self-Teaching Backgammon Program, Achieves \nMaster-Level Play. Neural Computation 6:215-219. \n\nG. Tesauro. (1995) Temporal Difference Learning and TD-Gammon. Communica(cid:173)\ntion, of the ACM 38:58-68. \n\nC. J. C. H. Watkins. (1989) Learning from Delayed Reward,. PhD thesis, Cam(cid:173)\nbridge University. \n\n\f", "award": [], "sourceid": 1073, "authors": [{"given_name": "Robert", "family_name": "Crites", "institution": null}, {"given_name": "Andrew", "family_name": "Barto", "institution": null}]}