{"title": "The Noisy Euclidean Traveling Salesman Problem and Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 351, "page_last": 358, "abstract": null, "full_text": "The Noisy Euclidean Traveling Salesman \n\nProblem and Learning \n\nMikio L. Braun, Joachim M. Buhmann \nbraunm@cs.uni-bonn.de, jb@cs.uni-bonn.de \n\nInstitute for Computer Science, Dept. III, \n\nUniversity of Bonn \n\nR6merstraBe 164, 53117 Bonn, Germany \n\nAbstract \n\nWe consider noisy Euclidean traveling salesman problems in the \nplane, which are random combinatorial problems with underlying \nstructure. Gibbs sampling is used to compute average trajectories, \nwhich estimate the underlying structure common to all instances. \nThis procedure requires identifying the exact relationship between \npermutations and tours. In a learning setting, the average trajec(cid:173)\ntory is used as a model to construct solutions to new instances \nsampled from the same source. Experimental results show that the \naverage trajectory can in fact estimate the underlying structure and \nthat overfitting effects occur if the trajectory adapts too closely to \na single instance. \n\nIntroduction \n\n1 \nThe approach in combinatorial optimization is traditionally single-instance and \nworst-case-oriented. An algorithm is tested against the worst possible single in(cid:173)\nstance. In reality, algorithms are often applied to a large number of related instances, \nthe average-case performance being the measurement of interest. This constitutes \na completely different problem: given a set of similar instances, construct solutions \nwhich are good on average. We call this kind of problem multiple-instances and \naverage-case-oriented. Since the instances share some information, it might be ex(cid:173)\npected that this problem is simpler than solving all instances separately, even for \nNP-hard problems. \nWe will study the following example of a multiple-instance average-case problem, \nwhich is built from the Euclidean travelings salesman problem (TSP) in the plane. \nConsider a salesman who makes weekly trips. At the beginning of each week, the \nsalesman has a new set of appointments for the week, for which he has to plan \nthe shortest round-trip. The location of the appointments will not be completely \nrandom, because there are certain areas which have a higher probability of contain(cid:173)\ning an appointment, for example cities or business districts within cities. Instead \nof solving the planning problem each week from scratch, a clever salesman will try \nto exploit the underlying density and have a rough trip pre-planned, which he will \nonly adapt from week to week. \nAn idealizing formulization of this setting is as follows. Fix the number of ap(cid:173)\npointments n E N. Let Xl, ... , Xn E ]R2 and (J E 114. Then, the locations of the \n\n\fappointments for each week are given as samples from the normally distributed \nrandom vectors (i E {1, ... , n}) \n\n(1) \n\nThe random vector (Xl, ... ,Xn ) will be called a scenario, sampled appointment \nlocations (sampled) instance. The task consists in finding the permutation 7r E Sn \nwhich minimizes 7r I-t d7r(n)1f(l) + L~:ll d1f(i)1f(iH) , where dij := IIXi - Xj112' and \nSn being the set of all bijective functions on the set {1, ... , n}. Typical examples \nare depicted in figure l(a)- (c). \nIt turns out that the multiple-instance average-case setting is related to learning \ntheory, especially to the theory of cost-based unsupervised learning. This relation(cid:173)\nship becomes clear if one considers the performance measure of interest. The algo(cid:173)\nrithm takes a set of instances It, ... ,In as input and outputs a number of solutions \nSl,\u00b7\u00b7\u00b7, Sn\u00b7 It is then measured by the average performance (l/n) L~=l C(Sk, h), \nwhere C(s , I) denotes the cost of solution s on instance I. We now modify the \nperformance measure as follows. Given a finite number of instances It, ... ,In, the \nalgorithm has to construct a solution s' on a newly sampled instance I'. The perfor(cid:173)\nmance is then measured by the expected cost E (C (s' ,I')). This can be interpreted \nas a learning task. The instances 11 , ... ,In are then the training data, E(C(s', I')) \nis the analogue of the expected risk or cost, and the set of solutions is identified \nwith the hypothesis class in learning theory. \nIn this paper, the setting presented in the previous paragraph is studied with the \nfurther restriction that only one training instance is present. From this training in(cid:173)\nstance, an average solution is constructed, represented by a closed curve in the plane. \nThis average trajectory is supposed to capture the essential structure of the under(cid:173)\nlying probability density, similar to the centroids in K-means clustering. Then, the \naverage trajectory is used as a seed for a simple heuristic which constructs solutions \non newly drawn instances. The average trajectories are computed by geometrically \naveraging tours which are drawn by a Gibbs sampler at finite temperature. This \nwill be discussed in detail in sections 2 and 3. It turns out that the temperature \nacts as a scale or smoothing parameter. A few comments concerning the selection \nof this parameter are given in section 6. \nThe technical content of our approach is reminiscent of the \"elastic net\" -approaches \nof Durbin and Willshaw (see [2], [5]) , but differs in many points. It is based on \na completely different algorithmic approach using Gibbs sampling and a general \ntechnique for averaging tours. Our algorithm has polynomial complexity per Monte \nCarlo step and convergence is guaranteed by the usual bounds for Markov Chain \nMonte Carlo simulation and Gibbs sampling. Furthermore, the goal is not to provide \na heuristic for computing the best solution, but to extract the relevant statistics of \nthe Gibbs distribution at finite temperatures to generate the average trajectory, \nwhich will be used to compute solutions on future instances. \n\n2 The Metropolis algorithm \n\nThe Metropolis algorithm is a well-known algorithm which simulates a homogeneous \nMarkov chain whose distribution converges to the Gibbs distribution. We assume \nthat the reader is familiar with the concepts, we give here only a brief sketch of the \nrelevant results and refer to [6], [3] for further details. \nLet M be a finite set and f: M -+ lit The Gibbs distribution at temperature T E Il4 \nis given by (m E M) \n\n9T(m) := \n\nexp( - f(m)/T~ \n\n. \nLm/EM exp( - f(m )/T) \n\n(2) \n\n\fThe Metropolis algorithm works as follows. We start with any element m E M and \nset Xl +- m. For i ~ 2, apply a random local update m':= \u00a2(Xi). Then set \n\nwith probability min {I, exp( -(f(Xi) -\nelse \n\nf(m'))/T)} \n\n(3) \n\nThis scheme converges to the Gibbs distribution if certain conditions on \u00a2 are met. \nFurthermore, a L2-law of large numbers holds: For h: M --t ]R, ~ L:~=l h(Xk ) --t \nL:mEM gT(m)h(m) in L2. For TSP, M = Sn and \u00a2 is the Lin-Kernighan two-change \n[4], which consists in choosing two indexes i, j at random and reversing the path \nbetween the appointments i and j. Note that the Lin-Kernighan two-change and \nits generalizations for neighborhood search are powerful heuristic in itself. \n\n3 Averaging Tours \n\nOur goal is to compute the average trajectory, which should grasp the underlying \nstructure common to all instances, with respect to the Gibbs measure at non-zero \ntemperature T . The Metropolis algorithm produces a sequence of permutations \n7rl, 7r2, ... with P{ 7rn = .} --t gT(.) for n --t 00. Since permutations cannot be \nadded, we cannot simply compute the empirical means of 7rn. Instead, we map \npermutations to their corresponding trajectories. \nDefinition 1 (trajectory) The trajectory of 7r E Sn given n points Xl, ... ,Xn is a \nmapping r( 7r): {I, ... , n} --t ]R2 defined by r( 7r) (i) := X1C(i). The set of all trajec(cid:173)\ntories (for all sets of n points) is denoted by Tn (this is the set of all mappings \nT {I, ... , n} --t ]R2 ). \nAddition of trajectories and multiplication with scalars can be defined pointwise. \nThen it is technically possible to compute t L:~=l r(7rk). Unfortunately, this does \nnot yield the desired results, since the relation between permutations and tours is \nnot one-to-one. For example, the permutation obtained by starting the tour at a \ndifferent city still corresponds to the same tour. We therefore need to define the \naddition of trajectories in a way which is independent of the choice of permutation \n(and therefore trajectory) to represent the tour. We will study the relationship \nbetween tours and permutations first in some detail, since we feel that the concepts \nintroduced here might be generally useful for analyzing combinatorial optimization \nproblems. \nDefinition 2 (tour and length of a tour) Let G = (V, E) be a complete (undirected) \ngraph with V = {I, ... ,n} and E = (~). A subset tEE is called a tour iff It I = n, \nfor every v E V, there exist exactly two el, e2 E t such that v E el and v E e2, \nand (V, t) is connected. Given a symmetric matrix (dij ) of distances, the length of \na tour t is defined by C(t) := L:{i,j} Et dij . \n\nThe tour corresponding to a permutation 7r E Sn is given by \n\nt(7r) :={ {7r(I), 7r(n)}} U U {{7r(i) ,7r(i + I)}}. \n\nn-l \n\ni=l \n\n(4) \n\nIf t(7r) = t for a permutation 7r and a tour t, we say that 7r represents t. We \ncall two permutations 7r, 7r' equivalent, if they represent the same tour and write \n7r ,...., 7r'. Let [7r] denote the equivalence class of 7r as usual. Note that the length of \na permutation is fully determined by its equivalence class. Therefore, ,...., describes \nthe intrinsic symmetries of the TSP formulated as an optimization problem on Sn , \ndenoted by TSP(Sn). \nWe have to define the addition EB of trajectories such that the sum is independent of \nthe representation. This means that for two tours h, t2 such that h is represented \n\n\fby 'lf1, 'If~ and t2 by 'lf2, 'If~ it holds that f('lf1) EB f('lf2) ~ f('lfD EB f('If~). The idea \nwill be to normalize both summands before addition. We will first study the exact \nrepresentation symmetry of TSP(Sn) ' \n\nThe TSP(Sn) symmetry group Algebraically speaking, Sn is a group with \nconcatenation of functions as multiplication, so we can characterize the equivalence \nclasses of ~ by studying the set of operations on a permutation which map to the \nsame equivalent class. We define a group action of Sn on itself by right translation \n('If, 9 E Sn): \n\ng. 'If:= 'lfg- 1. \n\n\" . \" : Sn x Sn -+ Sn, \n\n(5) \nNote that any permutation in Sn can be mapped to another by an appropriate \ngroup action (namely 'If -+ 'If' by ('If,-l'lf) . 'If.), such that the group action of Sn on \nitself suffices to study the equivalence classes of ~. \nFor certain 9 E Sn, it holds that t(g\u00b7 'If) = t('If). We want to determine the maximal \nset H t of elements which keeps t invariant. It even holds that H t is a subgroup \nof Sn: The identity is trivially in H t . Let g, h be t-invariant, then t((gh- 1) . 'If) = \nt(g \u00b7(h- 1 . 'If)) = t(h- 1 . 'If) = t(h \u00b7(h- 1 . 'If) = t( 'If). H t will be called the symmetry \ngroup of TSP(Sn) and it follows that ['If] = H t \u00b7 'If :={h \u00b7 'If I hE Hd. \nThe shift u and reversal (2 are defined by (i E {I, ... , n} ) \n\n(.) . __ {i + 1 i < n, \n\nu z. \n\n1 \n\n. \nz = n \n\n, \n\n(2(i) :=n + 1- i, \n\n(6) \n\nand set H :=((2, u), the group generated by u and (2. It holds that (this result is an \neasy consequence of (2(2 = id{l , ... ,n}, (2U = u- 1(2 and un = id{l , ... ,n}) \n\nH = {uk IkE {I, ... , n}} U {(2uk IkE {l, ... ,n}}. \n\n(7) \n\nThe fundamental result is \nTheorem 1 Let t be the mapping which sends permutations to tours as defined in \n(4). Then, H t = H , where H t is the set of all t-invariant permutations and H is \ndefined in (7). \n\nIt is obvious that H ~ H t . Now, let h- 1 E H t . We are going to prove \nProof: \nthat t-invariant permutations are completely defined by their values on 1 and 2. \nLet hE H t and k:= h(l) . Then, h(2) = u(k) or h(2) = u - 1(k), because otherwise, \nh would give rise to a link {{'If(h(1),'If(h(2\u00bb}} 1. t('If) . For the same reason, h(3) \nmust be mapped to u \u00b12(k). Since h must be bijective, h(3) =I- h(l) , so that the sign \nof the exponent must be the same as for h(2). In general, h(i) = u\u00b1(i- 1l(k). Now \nnote that for i,k E {l , ... ,n} , u i(k) = uk(i) and therefore, \n\n{\n\nu k- 1 \n(2un-k \n\nh= \n\nif h(i) = ui-1 (k) \nifh(i)=u-i+1(k)' \n\nD \n\nAdding trajectories We can now define equivalence for trajectories. First define \na group action of Sn on Tn analogously to (5): the action of h E Ht on \"( E Tn is \ngiven by h \u00b7 \"( := \"( 0 h- 1 . Furthermore, we say that \"( ~ 1} , if H t \u00b7 \"( = H t \u00b71}. \nOur approach is motivated geometrically. We measure distances between trajectories \nas follows. Let d: ]R2 x ]R2 -+ Il4 be a metric. Then define h, 1} E Tn) \n\ndh,1}):= 2::=1 dh(k),1}(k). \n\n(8) \n\nBefore adding two trajectories we will first choose equivalent representations \"(', 1}' \nwhich minimize d( \"(' , 1}'). Because of the results presented so far, searching through \n\n\fall equivalent trajectories is computationally tractable. Note that for h E H t , it \nholds that d( h . ,,(, h . rJ) = db, rJ) as h only reorders the summands. It follows that \nit suffices to change the representations only for one argument, since d(h\u00b7 ,,(, i\u00b7 rJ) = \ndb, h - 1i\u00b7 rJ)\u00b7 So the time complexity of one addition reduces to 2n computation of \ndistances which involve n subtractions each. \nThe normalizing action is defined by b, rJ E Tn) \n\nn , 1J := argmin d( ,,(, n . rJ)\u00b7 \n\nn E H t \n\n(9) \n\nAssuming that the normalizing action is unique1 , we can prove \n\nTheorem 2 Let ,,(, rJ be two trajectories, and n , 1J the unique normalizing action as \ndefined in (9). Then, the operation \n\nis representation invariant. \n\n\"( EB rJ := \"( + n , 1J . rJ \n\n(10) \n\nProof: Let \"(I = g. ,,(, rJl = h\u00b7 rJ for g, h E H t . We claim that n ,I1J1 = gn' 1Jh-1. \nThe normalizing action is defined by \n\nn,I1J1 = argmin db /, n l . rJl) = argmin d(g . ,,(, nih\u00b7 rJ) = argmin db , g-l n lh\u00b7 rJ), \n\nn l E H t \n\nn l EHt \n\nn l E H t \n\n(11) \nby inserting g-l parallelly before both arguments in the last step. Since the nor(cid:173)\nmalizing action is unique, it follows that for the n l realizing the minimum it holds \nthat g-ln l h = n , 1J and therefore n l = n , I1J1 = gn' 1Jh-1. Now, consider the sum \n\nwhich proves the representation independence. \n0 \nThe sum of more than two trajectories can be defined by normalizing everything \nwith respect to the first summand, so that empirical sums t EB~=l f(?ri) are now \nwell-defined. \n\n4 \n\nInferring Solutions on New Instances \n\nWe transfer a trajectory to a new set of appointments Xl, .. . ,Xn by computing the \nrelaxed tour using the following finite-horizon adaption technique: \nFirst of all, passing times ti for all appointments are computed. We extend the \ndomain of a trajectory \"( from {I, ... , n} to the interval [1, n + 1) by linear interpo(cid:173)\nlation. Then we define ti such that \"((ti) is the earliest point with minimal distance \nbetween appointment Xi and the trajectory. The passing times can be calculated \neasily by simple geometric considerations. The permutation which sorts (ti)~l is \nthe relaxed solution of\"( to (Xi) . \nIn a post-processing step, self-intersections are removed first. Then, segments of \nlength w are optimized by exhaustive search. Let ?r be the relaxed solution. The \npath from ?rei) to ?r(i + w + 2) (index addition is modulo n) is replaced by the \nbest alternative through the appointments ?r(i + 1), ... , ?r(i + w + 1). Iterate for all \ni E {I , . . . , n} until there is no further improvement. Since this procedure has time \ncomplexity w!n, it can only be done efficiently for small w. \n\nlOtherwise, perturb the locations of the appointments by infinitesimal changes. \n\n\f5 Experiments \n\nFor experiments, we used the following set-up: We took the 11.111-norm to determine \nthe normalizing action. Typical sample-sizes for the Markov chain Monte Carlo \nintegration were 1000 with 100 steps in between to decouple consecutive samples. \nScenarios were modeled after eq. (1), where the Xi were chosen to form simple \ngeometric shapes. \nAverage trajectories for different temperatures are plotted in figures l(a)- (c). As \nthe temperature decreases, the average trajectory converges to the trajectory of a \nsingle locally optimal tour. The graphs demonstrate that the temperature T acts \nas a smoothing parameter. \nTo estimate the expected risk of an average trajectory, the post-processed relaxed \n(PPR) solutions were averaged over 100 new instances (see figure l(d)-(g)) in order \nto estimate the expected costs. The costs of the best solutions are good approx(cid:173)\nimations, within 5% of the average minimum as determined by careful simulated \nannealing. An interesting effect occurs: the expected costs have their minimum at \nnon-zero temperature. The corresponding trajectories are plotted in figure l(e),(f). \nThey recover the structure of the scenario. In other words, average trajectories com(cid:173)\nputed at temperatures which are too low, start to overfit to noise present only in \nthe instance for which they were computed. So computation of the global optimum \nof a noisy combinatorial optimization problem might not be the right strategy, be(cid:173)\ncause the solutions might not reflect the underlying structure. Averaging over many \nsuboptimal solutions provides much better statistics. \n\n6 Selection of the Temperature \nThe question remains how to select the optimal temperature. This problem is es(cid:173)\nsentially the same as determining the correct model complexity in learning theory, \nand therefore no fully satisfying answer is readily available. The problem is nev(cid:173)\nertheless suited for the application of the heuristic provided by the empirical risk \napproximation (ERA) framework [1], which will be briefly sketched here. \nThe main idea of ERA is to coarse-grain the set of hypotheses M by treating \nhypotheses as equivalent which are only slightly different. Hypotheses whose \u00a31 \nmutual distance (defined in a similar fashion as (8)) is smaller than the parameter \n\"( E Il4 are considered statistically equivalent. Selecting a subset of solutions such \nthat \u00a3l -spheres of radius \"( cover M results in the coarse-grained hypothesis set \nM,. VC-type large deviation bounds depending on the size of the coarse-grained \nhypothesis class can now be derived: \n\np{ C2 (m\"! ) - min C2 (m) > 2c} :::; 21M\"! 1 sup exp \n\n( \n\n-\n\nmEM \n\nmEM., \n\nn(c - \"()2 \n\n( \n\nam + c c -\n\n) \n) ' \n\n(13) \n\n\"( \n\nam depending on the distribution. The bound weighs two competing effects. On \nthe one hand, increasing \"( introduces a systematic bias in the estimation. On the \nother hand, decreasing \"( increases the cardinality of the hypothesis class. Given a \nconfidence J > 0, the probability of being worse than c > 0 on a second instance and \n\"( are linked. So an optimal coarsening \"( can be determined. ERA then advocates \nto either sample from the ,,(-sphere around the empirical minimizer or average over \nthese solutions. \nNow it is well known, that the Gibbs sampler is concentrated on solutions whose \ncosts are below a certain threshold. Therefore, the ERA is suited for our approach. \nIn the relating equation the log cardinality of the approximation set occurs, which \nis usually interpreted as micro canonical entropy. This relates back to statistical \nphysics, the starting point of our whole approach. Now interpreting \"( as energy, \nwe can compute the stop temperature from the optimal T Using the well-known \n\n\frelation from statistical physics ~ee:t:~:: = T - 1 , we can derive a lower bound on \nthe optimal temperature depending on variance estimates of the specific scenario \ngiven. \n\n7 Conclusion \nIn reality, optimization algorithms are often applied to many similar instances. We \npointed out that this can be interpreted as a learning problem. The underlying \nstructure of similar instances should be extracted and used in order reduce the \ncomputational complexity for computing solutions to related instances. \nStarting with the noisy Euclidean TSP, the construction of average tours is studied \nin this paper, which involves determining the exact relationship between permuta(cid:173)\ntion and tours, and identifying the intrinsic symmetries of the TSP. We hope that \nthis technique might prove to be useful for other applications in the field of averag(cid:173)\ning over solutions of combinatorial problems. The average trajectories are able to \ncapture the underlying structure common to all instances. A heuristic for construct(cid:173)\ning solutions on new instances is proposed. An empirical study of these procedures \nis conducted with results satisfying our expectations. \nIn terms of learning theory, overfitting effects can be observed. This phenomenon \npoints at a deep connection between combinatorial optimization problems with noise \nand learning theory, which might be bidirectional. On the one hand, we believe that \nnoisy (in contrast to random) combinatorial optimization problems are dominant \nin reality. Robust algorithms could be built by first estimating the undistorted \nstructure and then using this structure as a guideline for constructing solutions \nfor single instances. On the other hand, hardness of efficient optimization might be \nlinked to the inability to extract meaningful structure. These connections, which are \nsubject of further studies, link statistical complexity to computational complexity. \n\nAcknowledgments \n\nThe authors would like to thank Naftali Tishby, Scott Kirkpatrick and Michael \nClausen for their helpful comments and discussions. \n\nReferences \n[1] J. M. Buhmann and M. Held. Model selection in clustering by uniform conver(cid:173)\ngence bounds. Advances in Neural Information Processing Systems, 12:216- 222, \n1999. \n\n[2] R. Durbin and D. Willshaw. An analogue approach to the travelling salesman \n\nproblem using an elastic net method. Nature, 326:689- 691, 1987. \n\n[3] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchio Optimisation by simulated \n\nannealing. Science, 220:671- 680, 1983. \n\n[4] S. Lin and B. Kernighan. An effective heuristic algorithm for the traveling \n\nsalesman problem. Operations Research, 21:498- 516, 1973. \n\n[5] P.D. Simic. Statistical mechanics as the underlying theory of \"elastic\" and \n\n\"neural\" optimizations. Network, 1:89-103, 1990. \n\n[6] G. Winkler. Image Analysis, Random fields and Dynamic Monte Carlo Methods, \n\nvolume 27 of Application of Mathematics. Springer, Heidelberg, 1995. \n\n\fi 17.7 \n\n&: 17.6 \n\" j 17.5 \n\nf 17.4 \n\n<\"Il 17.3 \n\n~ 11.5 \n\n~ \n&: 11 .45 \n\" j 11.4 \n~ \n~ 11.35 \n\n-sigma2 = O.03 \n\nT.,...,:0.15OO:Xl \nLenglt. : 5.179571 \n\no 0 \no \n\no \n\ntemperatureT \n\n(d) \n\n- si ma = O.025 \n\nT.,...,: 0.212759 \nLenglt. : 6.295844 \n\ne \n\no \n\no \n\no \n\no \n\no.I H \n\nCD \no \n\ntemperatureT \n\n(f) \n\nn 5O\"\",11I>1S20_025 1 510 0_7654.2.() _742680_2 31390 .057 211.(l.Q1597 0. 2 1 479 0.8322 4 0 .58 33a1~ \n\ng \n\nFigure 1: (a) Average trajectories at different temperatures for n = 100 appoint(cid:173)\nments on a circle with a 2 = 0.03. (b) Average trajectories at different temperatures, \nfor multiple Gaussian sources, n = 50 and a 2 = 0.025. (c) The same for an instance \nwith structure on two levels. (d) Average tour length of the post-processed relaxed \n(PPR) solutions for the circle instance plotted in (a). The PPR width was w = 5. \nThe average fits to noise in the data if the temperature is too low, leading to over(cid:173)\nfitting phenomena. Note that the average best solution is :s: 16.5. (e) The average \ntrajectory with the smallest average length of its PPR solutions in (d). (f) Average \ntour length as in (d). The average best solution is :s: 10.80. (g) Lowest temperature \ntrajectory with small average PPR solution length in (f). \n\n\f", "award": [], "sourceid": 2049, "authors": [{"given_name": "Mikio", "family_name": "Braun", "institution": null}, {"given_name": "Joachim", "family_name": "Buhmann", "institution": null}]}