{"title": "Learning Instance-Independent Value Functions to Enhance Local Search", "book": "Advances in Neural Information Processing Systems", "page_first": 1017, "page_last": 1023, "abstract": null, "full_text": "Learning Instance-Independent Value Functions \n\nto Enhance Local Search \n\nRobert Moll Andrew G. Barto Theodore J. Perkins \n\nDepartment of Computer Science \n\nUniversity of Massachusetts, Amherst, MA 01003 \n\nAT&T Shannon Laboratory, 180 Park Avenue, Florham Park, NJ 07932 \n\nRichard S. Sutton \n\nAbstract \n\nReinforcement learning methods can be used to improve the performance \nof local search algorithms for combinatorial optimization by learning \nan evaluation function that predicts the outcome of search. The eval(cid:173)\nuation function is therefore able to guide search to low-cost solutions \nbetter than can the original cost function. We describe a reinforcement \nlearning method for enhancing local search that combines aspects of pre(cid:173)\nvious work by Zhang and Dietterich (1995) and Boyan and Moore (1997, \nBoyan 1998). In an off-line learning phase, a value function is learned \nthat is useful for guiding search for multiple problem sizes and instances. \nWe illustrate our technique by developing several such functions for the \nDial-A-Ride Problem. Our learning-enhanced local search algorithm ex(cid:173)\nhibits an improvement of more then 30% over a standard local search \nalgorithm. \n\n1 \n\nINTRODUCTION \n\nCombinatorial optimization is of great importance in computer science, engineering, and \noperations research. We investigated the use of reinforcement learning (RL) to enhance tra(cid:173)\nditionallocal search optimization (hillclimbing). Since local search is a sequential decision \nprocess. RL can be used to improve search performance by learning an evaluation func(cid:173)\ntion that predicts the outcome of search and is therefore able to guide search to low-cost \nsolutions better than can the original cost function. \n\nThree approaches to using RL to improve combinatorial optimization have been described \n\n\f1018 \n\nR. Moll, A. G. Barto, T. J Perkins and R. S. Sutton \n\nin the literature. One is to learn a value function over multiple search trajectories of a single \nproblem instance. As the value function improves in its predictive accuracy, its guidance \nenhances additional search trajectories on the same instance. Boyan and Moore's STAGE \nalgorithm (Boyan and Moore 1997, Boyan 1998) falls into this category, showing excellent \nperformance on a range of optimization problems. Another approach is to learn a value \nfunction off-line and then use it over mUltiple new instances of the same problem. Zhang \nand Dietterich's (1995) application of RL to a NASA space shuttle mission scheduling \nproblem takes this approach (although it does not strictly involve local search as we define \nit below). A key issue here is the need to normalize state representations and rewards so \nthat trajectories from instances of different sizes and difficulties yield consistent training \ndata. In each of the above approaches, a state of the RL problem is an entire solution (e.g., \na complete tour in a Traveling Salesman Problem (TSP)) and the actions select next solu(cid:173)\ntions from the current solutions' neighborhoods. A third approach, described by Bertsekas \nand Tsitsiklis (1996), uses a learned value function for guiding the direct construction of \nsolutions rather than for moving between them. \n\nWe focused on combining aspects of first two of these approaches with the goal of carefull y \nexamining how well the TD(,\\) algorithm can learn an instance-independent value function \nfor a given problem to produce an enhanced local search algorithm applicable to all in(cid:173)\nstances of that problem. Our approach combines an off-line learning phase with STAGE's \nalternation between using the learned value function and the original cost function to guide \nsearch. We present an extended case study of this algorithm's application to a somewhat \ncomplicated variant of TSP known as the Dial-A-Ride Problem, which exhibits some of \nthe non-uniform structure present in real-world transportation and logistics problems. \n\n2 ENHANCING LOCAL SEARCH \n\nThe components of local search for combinatorial optimization are 1) a finite set ofJeasible \nsolutions, S; 2) an objective, or cost,function, C : S -4 ~; and 3) a neighborhood Junction, \nA : S -4 P( S) (the power set of S). Local search starts with an initial feasible solution, So, \nofa problem instance and then at each step k = 1,2, ... , it selects a solution Sk E A(Sk-d \nsuch that C(Sk) < c(sk-d. This process continues until further local improvement is \nimpossible, and the current local optimum is returned. If the algorithm always moves to the \nfirst less expensive neighboring solution encountered in an enumeration of a neighborhood, \nit is called first improvement local search. \n\nFollowing Zhang and Dietterich (1995) and Boyan and Moore (1997), we note that local \nsearch can be viewed as a policy o\u00a3' a Markov decision process (MDP) with state set S \nand action sets A(s), S E S, where an action is identified with the neighboring solution \nselected. Local search selects actions which decrease the value of c, eventually absorbing \nat a state with a locally minimum cost. But C is not the optimal value function for the \nlocal search problem, whose objective is to reach the lowest-cost absorbing state (possibly \nincluding some tradeoff involving the number of search steps required to do so). RL used \nwith a function approximator can learn an approximate optimal value function, V, thereby \nproducing an enhanced search algorithm that is locally guided by V instead of by c. One \nway to do this is to give a small penalty, E, for each transition and a terminal reward upon \nabsorption that is inversely related to the cost of the terminal state. Maximizing the ex(cid:173)\npected undiscounted return accomplishes the desired tradeoff (determined by the value of \nE) between quality of final solution and search time (cf. Zhang and Dietterich, 1995). \n\nSince each instance of an optimization problem corresponds to a different MDP, a value \n\n\fLearning Instance-Independent Value Functions to Enhance Local Search \n\n1019 \n\nfunction V learned in this way is instance-specific. Whereas Boyan's STAGE algorithm in \neffect uses such a V to enhance additional searches that start from different states of the \nsame instance, we are interested in learning a V off-line, and then using it for arbitrary in(cid:173)\nstances of the given problem. In this case, the relevant sequential decision problem is more \ncomplicated than a single-instance MDP since it is a summary of aspects of all problem \ninstances. It would be extremely difficult to make the structure of this process explicit, but \nfortunately RL requires only the generation of sample trajectories, which is relatively easy \nin this case. \n\nIn addition to their cost, secondary characteristics of feasible solutions can provide valuable \ninformation for search algorithms. By adjusting the parameters of a function approximation \nsystem whose inputs are feature vectors describing feasible solutions, an RL algorithm can \nproduce a compact representation of V. Our approach operates in two distinct phases. In \nthe learning phase, it learns a value function by applying the TD(A) algorithm to a number \nof randomly chosen instances of the problem. In the performance phase, it uses the result(cid:173)\ning value function, now held fixed, to guide local search for additional problem instances. \nThis approach is in principle applicable to any combinatorial optimization problem, but we \ndescribe its details in the context of the Dial-A-Ride problem. \n\n3 THE DIAL-A-RIDE PROBLEM \n\nThe Dial-a-Ride Problem (DARP) has the following formulation. A van is parked at a \nterminal. The driver receives calls from N customers who need rides. Each call identifies \nthe location of a customer, as well as that customer's destination. After the calls have been \nreceived, the van must be routed so that it starts from the terminal, visits each pick-up \nand drop-off site in some order, and then returns to the terminal. The tour must pick up \na passenger before eventually dropping that passenger off. The tour should be of minimal \nlength. Failing this goal-and DARP is NP-complete, so it is unlikely that optimal DARP \ntours will be found easily-at least a good quality tour should be constructed. We assume \nthat the van has unlimited capacity and that the distances between pick-up and drop-off \nlocations are represented by a symmetric Euclidean distance matrix. \n\nWe use the notation \n\n012-13 - 3 - 2 \n\nto denote the following tour: \"start at the terminal (0), then pick up 1, then 2, then drop \noff 1 (thus: - 1), pick up 3, drop off 3, drop off 2 and then return to the terminal (site 0).\" \nGiven a tour s, the 2-opt neighborhood of s, A 2(S), is the set oflegal tours obtainable from \ns by subsequence reversal. For example, for the tour above, the new tour created by the \nfollowing subsequence reversal \n\n01 / 2 -13 / -3 - 2 --. 013 -12 -3-2 \n\nis an element of A2 (T). However, this reversal \n\n012 / -13 -3/ -2 --. 012 - 33 - 1 - 2 \n\nleads to an infeasible tour, since it asserts that passenger 3 is dropped off first, then picked \nup. The neighborhood structure of DARP is highly non-uniform, varying between A2 \nneighborhood sizes of O(N) and O(N 2 ). \n\nLet s be a feasible DARP tour. By 2-opt(s) we mean the tour obtained by first-improvement \nlocal search using the A2 neighborhood structure (presented in a fixed, standard enumer(cid:173)\nation), with tour length as the cost function. As with TSP, there is a 3-opt algorithm for \n\n\f1020 \n\nR. Moll. A. G. Barto. T J Perkins and R. S. Sutton \n\nDARP, where a 3-opt neighborhood A3(S) is defined and searched in a fixed, systematic \nway, again in first-improvement style. This neighborhood is created by inserting three \nrather than two \"breaks\" in a tour. 3-opt is much slower than 2-opt, more than 100 times \nas slow for N = 50, but it is much more effective, even when 2-opt is given equal time to \ngenerate multiple random starting tours and then complete its improvement scheme. \n\nPsaraftis (1983) was the first to study 2-opt and 3-opt algorithms for DARP. He studied \ntours up to size N = 30, reporting that at that size, 3-opt tours are about 30% shorter \non average than 2-opt tours. In theoretical studies of DARP, Stein (1978) showed that for \nsites placed in the unit square, the globally optimal tour for problem size N has a length \nthat asymptotically approaches 1.02-/2N with probability 1 as N increases. This bound \napplies to our study-although we multiply position coordinates by 100 and then truncate \nto get integer distance matrices-and thus, for example, a value of 1020 gives us a baseline \nestimate of the globally optimal tour cost for N = 50. Healy and Moll (1995) considered \nusing a secondary cost function to extend local search on DARP. In addition to primary \ncost (tour length) they considered as a secondary cost the ratio of tour cost to neighborhood \nsize, which they called cost-hood. Their algorithm employed a STAGE-like alternation \nbetween these two cost functions: starting from a random tour s, it first found 20pt(s); \nthen it performed a limited local search using the cost-hood function, which had the effect \nof driving the search to a new tour with a decent cost and a large neighborhood. These \nalternating processes were repeated until a time bound was exhausted, at which point the \nleast cost tour seen so far was reported as the result of the search. This technique worked \nwell, with effectiveness falling midway between that of 2-opt and 3-opt. \n\n4 ENHANCED 2-0PT FOR DARP \n\nWe restrict our description to a learning method for enhancing 2-opt for DARP, but the \nsame method can be used for other problems. In the learning phase, after initializing the \nfunction approximator, we conduct a number training episodes until we are satisfied that the \nweights have stabilized. For each episode k, we select a problem size N at random (from a \npredetermined range) and generate a random DARP instance of that size, i.e., we generate \na symmetric Euclidean distance matrix by generating random points in the plane inside the \nsquare bounded by the points (0,0), (0,100), (100,100) and (100,0). We set the \"terminal \nsite\" to point (50,50) and the initial tour to a randomly generated feasible tour. We then \nconduct a modified first-improvement 2-opt local search using the negated current value \nfunction, - Vk, as the cost function. The modification is that termination is controlled by a \n\nparameter E > \u00b0 as follows: the search terminates at a tour s if there is no s' E A( s) such \nthat Vk (s') > Vk (s) + E. In other words, a step is taken only if it produces an improvement \nof at least E according to the current value function. The episode returns a final tour sf. \nWe run one unmodified 2-opt local search, this time using the DARP cost function c (tour \nlength), from sf to compute 2-opt( sf). We then apply a batch version of undiscounted \nTD(A) to the saved search trajectory using the following immediate rewards: -E for each \ntransition, and -c(2-opt( sf)) / Stein N as a terminal reward, where Stein N is the Stein \nestimate for instance size N. Normalization by SteinN helps make the terminal reward \nconsistent across instance sizes. At the end of this learning phase, we have a final value \nfunction, V. V is used in the performance phase, which consists of applying the modified \nfirst-improvement 2-opt local search with cost function - Von new instances, followed by \na 2-opt application to the resulting tour. \n\nThe results described here were obtained using a simple linear approximator with a bias \n\n\fLearning Instance-Independent Value Functions to Enhance Local Search \n\n1021 \n\nTable 1: Weight Vectors for Learned Value Functions. \n\nValue \n\nFunction \n\nv \nV20 \nV30 \nV40 \nVso \nV60 \n\nWeight Vector \n\n< .951, .033, .0153 > \n< .981, .019, .00017 > \n< .984, .014, .0006 > \n< .977, .022, .0009 > \n< .980, .019, .0015 > \n< .971 , .022 , .0069 > \n\nweight and features developed from the following base features: 1) normcost N (s) = \nc(s)jSteinN ; 2) normhoodN = \n[A(s) [jaN' where aN is a normalization coefficient \ndefined below; and 3) normprox N, which considers a list of the N j 4 least expensive \nedges of the distance matrix, as follows . Let e be one of the edges, with endpoints u \nand v. The normprOXN feature examines the current tour, and counts the number of \nsites on the tour that appear between u and v . normprOXN is the sum of these counts \nover the edges on the proximity list divided by a normalizing coefficient bN described \nbelow. Our function approximator is then give by Wo +normcostN j(normhoodN )2Wl + \nnormproXN j(normhoodN )2W2 . The coefficients aN and bN are the result of running \nlinear regression on randomly sampled instances of random sizes to determine coefficients \nthat will yield the closest fit to a constant target value for normalized neighborhood size \nand proximity. The results were aN = .383N2 + .28.5N - 244.5 and bN = .43N2 + \n.736N - 68 .9.jN + 181.75. The motivation for the quotient features comes from Healy \nand Moll (1995) who found that using a similar term improved 2-opt on DARP by allowing \nit to sacrifice cost improvements to gain large neighborhoods . \n\n5 EXPERIMENTAL RESULTS \n\nComparisons among algorithms were done at five representative sizes N = 20, 30, 40, 50, \nand 60. For the learning phase, we conducted approximately 3,000 learning episodes, each \none using a randomly generated instance of size selected randomly between 20 and 60 \ninclusive. The result of the learning phase was a value function V . To assess the influence \nof this multi-instance learning, we also repeated the above learning phase 5 times, except \nthat in each we held the instance size fixed to a different one of the 5 representative sizes, \nyielding in each case a distinct value function VN , where N is the training instance size. \nTable 1 shows the resulting weight vector < bias weight, costhood N weight, proximitYN \nweight >. With the exception of the proximity,v weight, these are quite consistent across \ntraining instance size. We do not yet understand why training on multiple-sized instances \nled to this pattern of variation. \n\nTable 2 compares the tour quality found by six different local search algorithms. For the \nalgorithms using learned value functions, the results are for the performance phase after \nlearning using the algorithm listed. Table entries are the percent by which tour length \nexceeded SteinN for instance size N averaged over 100 instances of each representative \nsize. Thus, 2-opt exceeded Stein20 = 645 on the 100 instance sample set by an average of \n42%. The last row in the table gives the results of using the five different value functions \nVN , for the corresponding N . Results for TDC.8) are shown because they were better than \n\n\f1022 \n\nR. Moll, A. G. Barto, T J. Perkins and R. S. Sutton \n\nTable 2: Comparison of Six Algorithms at Sizes N = 20, 30, 40, 50, 60. Entries are \npercentage above SteinN averaged over 100 random instances of size N. \n\nAlgorithm \n2-opt \n3-opt \nTD(I) \nTD(.8) E = 0 \nTD(.8) E = .Ol/N \nTD(.8) E = 0, VN \n\nN=20 N=30 N=40 N=50 N=60 \n\n42 \n8 \n28 \n27 \n29 \n29 \n\n47 \n8 \n31 \n30 \n35 \n30 \n\n53 \n11 \n34 \n35 \n37 \n32 \n\n56 \n10 \n39 \n37 \n41 \n36 \n\n60 \n10 \n40 \n39 \n44 \n40 \n\nTable 3: Average Relative Running Times. Times for 2-opt are in seconds; other entries \ngive time divided by 2-opt time. \n\nAlgorithm \n2-opt \n3-opt \nTD(.8) E = 0 \nTD(.8) E = .01/ N \n\nN=20 N=30 N=40 N=50 N=60 \n3.55 \n.237 \n32 \n238 \n7.1 \n3.2 \n3.0 \n2.2 \n\n.770 \n45 \n3.4 \n1.8 \n\n1.09 \n100 \n6.3 \n2.6 \n\n1.95 \n162 \n6.9 \n2.9 \n\nthose for other values of .A. The learning-enhanced algorithms do well against 2-opt when \nrunning time is ignored, and indeed TD(.8), E = 0, is about 35% percent better (according \nto this measure) by size 60. Note that 3-opt clearly produces the best tours, and a non-zero \nE for TD(.8) decreases tour quality, as expected since it causes shorter search trajectories. \n\nTable 3 gives the relative running times of the various algorithms. The raw running times \nfor 2-opt are given in seconds (Common Lisp on 266 Mhz Mac G-3) at each of five sizes in \nthe first row. Subsequent rows give approximate running times divided by the correspond(cid:173)\ning 2-opt running time. Times are averages over 30 instances. The algorithms using learned \nvalue functions are slower mainly due to the necessity to evaluate the features. Note that \nTD(.8) becomes significantly faster with E non-zero. \n\nFinally. Table 4 gives the relative performance of seven algorithms. normalized for time, \nincluding the STAGE algorithm using linear regression with our features. We generated \n20 random instances at each of the representative sizes, and we allowed each algorithm \nto run for the indicated amount of time on each instance. If time remained when a local \noptimum was reached, we restarted the algorithm at that point, except in the case of 2-opt, \nwhere we selected a new random starting tour. The restarting regime for the learning(cid:173)\nenhanced algorithms is the regime employed by STAGE. Each algorithm reports the best \nresult found in the allotted time, and the chart reports the averages of these values across the \n20 instances. Notice that the algorithms that take advantage of extensive off-line learning \nsignificantly outperform the other algorithms, including STAGE, which relies on single(cid:173)\ninstance learning. \n\n6 DISCUSSION \n\nWe have presented an extension to local search that uses RL to enhance the local search \ncost function for a particular optimization problem. Our method combines aspects of work \n\n\fLearning Instance-Independent Value Functions to Enhance Local Search \n\n1023 \n\nTable 4: Performance Comparisons, Equalized for Running Time. \n\nAlgorithm \n2-opt \nSTAGE \nTD(.8) E = 0 \nTD(.8) E = .011N \n\nSize and Running Time \n\nN=20 N=30 N=40 \n10 sec 20 sec \n40 sec \n\nN=50 \n100 sec \n\nN=60 \n150 sec \n\n16 \n18 \n12 \n13 \n\n29 \n20 \n13 \n11 \n\n28 \n32 \n16 \n14 \n\n30 \n24 \n22 \n24 \n\n38 \n27 \n20 \n28 \n\nby Zhang and Dietterich (1995) and Boyan and Moore (1997; Boyan 1998). We have \napplied our method to a relatively pure optimization problem-DARP-which possesses \na relatively consistent structure across problem instances. This has allowed the method to \nlearn a value function that can be applied across all problem instances at all sizes. Our \nmethod yields significant improvement over a traditional local search approach to DARP \non the basis of a very simple linear approximator, built using a relatively impoverished set \nof features. It also improves upon Boyan and Moore's (1997) STAGE algorithm in our \nexample problem, benefiting from extensive off-line learning whose cost was not included \nin our assessment. We think this is appropriate for some types of problems; since it is a \none-time learning cost, it can be amortized over many future problem instances of practical \nimportance. \n\nAcknowledgement \n\nWe thank Justin Boyan for very helpful discussions of this subject. This research was sup(cid:173)\nported by a grant from the Air Force Office of Scientific Research, Bolling AFB (AFOSR \nF49620-96-1-0254). \n\nReferences \n\nBoyan, J. A. (1998). Learning Evaluation Functions for Global Optimization. Ph .D. Thesis, \nCarnegie-Mellon University. \n\nBoyan, J. A., and Moore, A. W. (1997). Using Prediction to Improve Combinatorial Opti(cid:173)\nmization Search. Proceedings of AI-STATS-97. \n\nD. P. Bertsekas, D. P., and Tsitsiklis, 1. N. (1996). Neuro-Dynamic Programming. Athena \nScientific, Belmont, MA. \n\nHealy, P., and Moll, R. (1995). A New Extension to Local Search Applied to the Dial-A(cid:173)\nRide Problem. European Journal of Operations Research, 8: 83-104. \n\nPsaraftis, H. N. (1983). ~-interchange Procedures for Local Search in a Precedence(cid:173)\nConstrained Routing Problem. European Journal of Operations Research, 13:391-402. \n\nZhang, W. and Dietterich, T. G. (1995). A Reinforcement Learning Approach to Job-Shop \nScheduling. In Proceedings of the Fourteenth International Joint Conference on ArtifiCial \nIntelligence, pp. 1114-1120. Morgan Kaufmann, San Francisco. \n\nStein, D. M. (1978). An Asymptotic Probabilistic Analysis of a Routing Problem. Math. \nOperations Res. J., 3: 89-101. \n\n\f", "award": [], "sourceid": 1573, "authors": [{"given_name": "Robert", "family_name": "Moll", "institution": null}, {"given_name": "Andrew", "family_name": "Barto", "institution": null}, {"given_name": "Theodore", "family_name": "Perkins", "institution": null}, {"given_name": "Richard", "family_name": "Sutton", "institution": null}]}