{"title": "Softassign versus Softmax: Benchmarks in Combinatorial Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 626, "page_last": 632, "abstract": null, "full_text": "Softassign versus Softmax: Benchmarks \n\nin Combinatorial Optimization \n\nSteven Gold \n\nYale University \n\nAnand Rangarajan \n\nYale University \n\nDepartment of Computer Science \n\nDept. of Diagnostic Radiology \n\nNew Haven, CT 06520-8285 \n\nNew Haven, CT 06520-8042 \n\nAbstract \n\nA new technique, termed soft assign, is applied for the first time \nto two classic combinatorial optimization problems, the travel(cid:173)\ning salesman problem and graph partitioning. Soft assign , which \nhas emerged from the recurrent neural network/statistical physics \nframework, enforces two-way (assignment) constraints without the \nuse of penalty terms in the energy functions. The soft assign can \nalso be generalized from two-way winner-take-all constraints to \nmultiple membership constraints which are required for graph par(cid:173)\ntitioning. The soft assign technique is compared to the softmax \n(Potts glass). Within the statistical physics framework, softmax \nand a penalty term has been a widely used method for enforcing the \ntwo-way constraints common within many combinatorial optimiza(cid:173)\ntion problems. The benchmarks present evidence that soft assign \nhas clear advantages in accuracy, speed, parallelizabilityand algo(cid:173)\nrithmic simplicity over softmax and a penalty term in optimization \nproblems with two-way constraints. \n\n1 \n\nIntroduction \n\nIn a series of papers in the early to mid 1980's, Hopfield and Tank introduced \ntechniques which allowed one to solve combinatorial optimization problems with \nrecurrent neural networks [Hopfield and Tank, 1985]. As researchers attempted \nto reproduce the original traveling salesman problem results of Hopfield and \nTank, problems emerged, especially in terms of the quality of the solutions ob(cid:173)\ntained. More recently however, a number of techniques from statistical physics \nhave been adopted to mitigate these problems. These include deterministic an(cid:173)\nnealing which convexifies the energy function in order help avoid some local min(cid:173)\nima and the Potts glass approximation which results in a hard enforcement of \nIn \na one-way (one set of) winner-take-all (WTA) constraint via the softmax. \n\n\fSoftassign versus Softmax: Benchmarks in Combinatorial Optimization \n\n627 \n\nthe late 80's, armed with these techniques optimization problems like the trav(cid:173)\neling salesman problem (TSP) [Peterson and Soderberg, 1989] and graph partition(cid:173)\ning [Peterson and Soderberg, 1989, Van den Bout and Miller III, 1990] were reex(cid:173)\namined and much better results compared to the original Hopfield-Tank dynamics \nwere obtained. \nHowever, when the problem calls for two-way interlocking WTA constraints, as \ndo TSP and graph partitioning, the resulting energy function must still include \na penalty term when the softmax is employed in order to enforce the second set \nof WTA constraints. Such penalty terms may introduce spurious local minima \nin the energy function and involve free parameters which are hard to set. A \nnew technique, termed soft assign, eliminates the need for all such penalty terms. \nThe first use of the soft assign was in an algorithm for the assignment problem \n[Kosowsky and Yuille, 1994] . \nIt has since been applied to much more difficult \noptimization problems, including parametric assignment problems-point match(cid:173)\ning [Gold et aI., 1994, Gold et aI., 1995, Gold et aI., 1996] and quadratic assign(cid:173)\nment problems-graph matching [Gold et aI., 1996, Gold and Rangarajan, 1996, \nGold, 1995] . \nHere, we for the first time apply the soft assign to two classic combinatorial op(cid:173)\ntimization problems, TSP and graph partitioning. Moreover, we show that the \nsoft assign can be generalized from two-way winner-take-all constraints to multiple \nmembership constraints, which are required for graph partitioning (as described be(cid:173)\nlow). We then run benchmarks against the older softmax (Potts glass) methods and \ndemonstrate advantages in terms of accuracy, speed, parallelizability, and simplicity \nof implementation. \nIt must be emphasized there are other conventional techniques, for solving \nsome combinatorial optimization problems such as TSP, which remain supe(cid:173)\nrior to this method in certain ways [Lawler et aI., 1985]. \nproblems-specifically the type of pattern matching problems essential for cogni(cid:173)\ntion [Gold, 1995]-this technique is superior to conventional methods.) Even within \nneural networks, elastic net methods may still be better in certain cases. However, \nthe elastic net uses only a one-way constraint in TSP. The main goal of this paper \nis to provide evidence, that when minimizing energy functions within the neural \nnetwork framework, which have two-way constraints, the soft assign should be the \ntechnique of choice. We therefore compare it to the current dominant technique, \nsoftmax with a penalty term. \n\n(We think for some \n\n2 Optimizing With Softassign \n\n2.1 The Traveling Salesman Problem \n\nThe traveling salesman problem may be defined in the following way. Given a set of \nintercity distances {hab} which may take values in R+ , find the permutation matrix \nM such that the following objective function is minimized. \nE 1(M) = 2 LLL hab M ai M b(i6H) \n\n1 N N N \n\na==lb==li=l \n\n(1) \n\nsubject to Va L~l Mai = 1 , Vi L~=l Mai = 1 , Vai Mai E {O, 1}. \nIn the above objective hab represents the distance between cities a and b. M is a \npermutation matrix whose rows represent cities, and whose columns represent the \nday (or order) the city was visited and N is the number of cities. (The notation i EEl 1 \n\n\f628 \n\nS.GOLD,A.RANGARAJAN \n\nis used to indicate that subscripts are defined modulo N, i.e. Ma(N+I) = Mal.) So \nif Mai = 1 it indicates that city a was visited on day i . \nThen, following [Peterson and Soderberg, 1989, Yuille and Kosowsky, 1994] we em(cid:173)\nploy Lagrange multipliers and an x log x barrier function to enforce the constraints, \nas well as a 'Y term for stability, resulting in the following objective: \n\nE2(M, 1',11) = 2 L L L babMaiMb(ieJ I ) - ~ L L M;i \n\nN N \n\n1 N N N \n\na=l b=l i=l \nN \n\nN \n\na=l i=l \nN \n\nN \n\nINN \n\n+p I: I: Mai(10g M ai - 1) + I: J.la(I: Mai - 1) + I: lIi(I: Mai - 1) \n\n(2) \n\na=l i=l \n\na=l \n\ni=l \n\ni=l \n\na=l \n\nIn the above we are looking for a saddle point by minimizing with respect to M \nand maximizing with respect to I' and 11, the Lagrange multipliers. \n\n2.2 The Soft assign \n\nIn the above formulation of TSP we have two-way interlocking WTA constraints. \n{Mai} must be a permutation matrix to ensure that a valid tour-one in which \neach city is visited once and only once-is described. A permutation matrix means \nall the rows and columns must add to one (and the elements must be zero or one) \nand therefore requires two-way WTA constraints-a set of WTA constraints on the \nrows and a set of WTA constraints on the columns. This set of two-way constraints \nmay also be considered assignment constraints, since each city must be assigned to \none and only one day (the row constraint) and each day must be assigned to one \nand only one city (the column constraint). \n\nThese assignment constraints can be satisfied using a result from [Sinkhorn, 1964]. \nIn [Sinkhorn, 1964] it is proven that any square matrix whose elements are all \npositive will converge to a doubly stochastic matrix just by the iterative process \nof alternatively normalizing the rows and columns. (A doubly stochastic matrix is \na matrix whose elements are all positive and whose rows and columns all add up \nto one-it may roughly be thought of as the continuous analog of a permutation \nmatrix). \nThe soft assign simply employs Sinkhorn's technique within a deterministic anneal(cid:173)\ning context. Figure 1 depicts the contrast between the soft assign and the softmax. \nIn the softmax, a one-way WTA constraint is strictly enforced by normalizing over \na vector. \n\n[Kosowsky and Yuille, 1994] used the soft assign to solve the assignment problem, \ni.e. minimize: - 2:~=1 2:{=1 MaiQai. For the special case of the quadratic assign(cid:173)\nment problem, being solved here, by setting Q ai = - :J:i' and using the values of \n\nM from the previous iteration, we can at each iteration produce a new assignment \nproblem for which the soft assign then returns a doubly stochastic matrix. As the \ntemperature is lowered a series of assignment problems are generated, along with \nthe corresponding doubly stochastic matrices returned by each soft assign , until a \npermutation matrix is reached. \n\nThe update with the partial derivative in the preceding may be derived using a \nTaylor series expansion. See [Gold and Rangarajan, 1996, Gold, 1995] for details. \n\nThe algorithm dynamics then become: \n\n\fSoftassign versus Softmax: Benchmarks in Combinatorial Optimization \n\n629 \n\nSoftassign \n\nSoftmax \n\nPositivity \n\nM.i = exP(I3Q.) \n\n1 \n\nTwo-way constraints \n\n( ~) \n\nRow Normalization \nMai--_1 \nl:M.i \n0<;. \"\"\"\"'~_ \nM.i- l:~. \n\na \n\n1 \n\nPositivity \n\nMi = exP(I3Qi) \n\n1 \nOne-way \nconstraint \nM\u00b7 __ _ 1_ \nM\u00b7 \nl:M. \ni \n) \n\n1 \n\nFigure 1: Softassign and softmax. This paper compares these two techniques. \n\nMai = Softassignai (Q) \n\n(3) \n\n(4) \n\nE2 is E2 without the {3, J.l or II terms of (2), therefore no penalty terms are now in(cid:173)\ncluded. The above dynamics are iterated as (3, the inverse temperature, is gradually \nincreased. \n\nThese dynamics may be obtained by evaluating the saddle points of the objective \nin (2). Sinkhorn's method finds the saddle points for the Lagrange parameters. \n\n2.3 Graph Partitioning \n\nThe graph partitioning problem maybe defined in the following way. Given an un(cid:173)\nweighted graph G, find the membership matrix M such that the following objective \nfunction is minimized. \n\nA \n\nI \n\nI \n\nE3(M) = - I:L:L:GijMaiMaj \n\na=1 i=1 j=1 \n\n(5) \n\nsubject to Va E;=1 Mai = IIA, Vi E:=1 Mai = 1, Vai Mai E to, I} where graph \nG has I nodes which should be equally partitioned into A bins. \n{Gij} is the adjacency matrix of the graph, whose elements must be 0 or 1. M \nis a membership matrix such that Mai = 1 indicates that node i is in bin a. The \npermutation matrix constraint present in TSP is modified to the membership con(cid:173)\nstraint. Node i is a member of only bin a and the number of members in each bin \nis fixed at IIA. When the above objective is at a minimum, then graph G will be \npartitioned into A equal sized bins, such that the cutsize is minimum for all possible \npartitionings of G into A equal sized bins. We assume IIA is an integer. \n\nThen following the treatment for TSP, we derive the following objective: \n\n\f630 \n\nS.GOLD,A. RANGARAJAN \n\nE4(M,p,v) = - I: I:L: CijMaiMaj - ~ L:L:M;i \n\nA \n\nI \n\nI \n\na=l i=l j=l \n\nA \n\nI \n\na=l i=l \nA \n\n1A I \n\n+:8 I: I: Mai(lOgMai - 1) + I:Pa(2: Mai -\n\nA \n\nI \n\nI \n\n[fA) + 2: Vi (2: Mai -1) \n\n(6) \n\na=li=l \n\na=l \n\ni=l \n\ni=l \n\na=l \n\nwhich is minimized with a similar algorithm employing the softassign. Note however \nnow in the soft assign the columns are normalized to [j A instead of 1. \n\n8 Experimental Results \n\nExperiments on Euclidean TSP and graph partitioning were conducted. For each \nproblem three different algorithms were run. One used the soft assign described \nabove. The second used the Potts glass dynamics employing synchronous update \nas described in [Peterson and Soderberg, 1989]. The third used the Potts glass \ndynamics employing serial update as described in [Peterson and Soderberg, 1989]. \nOriginally the intention was to employ just the synchronous updating version of \nthe Potts glass dynamics, since that is the dynamics used in the algorithms em(cid:173)\nploying soft assign and is the method that is massively parallelizable. We believe \nmassive parallelism to be such a critical feature of the neural network architecture \n[Rumelhart and McClelland, 1986] that any algorithm that does not have this fea(cid:173)\nture loses much of the power of the neural network paradigm. Unfortunately the \nsynchronous updating algorithms just worked so poorly that we also ran the serial \nversions in order to get a more extensive comparison. Note that the results reported \nin [Peterson and Soderberg, 1989] were all with the serial versions. \n\n3.1 Euclidean TSP Experiments \n\nFigure 2 shows the results of the Euclidean TSP experiments. 500 different 100-\ncity tours from points uniformly generated in the 2D unit square were used as \ninput. The asymptotic expected length of an optimal tour for cities distributed \nin the unit square is given by L( n) = J( Vn where n is the number of cities and \n0.765 ~ J( ~ 0.765 +.1 [Lawler et al., 1985]. This gives the interval [7.65,8.05] for \nthe 100 city TSP. 95<70 of the tour lengths fall in the interval [8,11] when using the \nsoft assign approach. Note the large difference in performance between the soft assign \nand the Potts glass algorithms. The serial Potts glass algorithm ran about 5 times \nslower than the soft assign version. Also as noted previously the serial version is \nnot massively parallelizable. The synchronous Potts glass ran about 2 times slower. \nAlso note the softassign algorithm is much simpler to implement-fewer parameters \nto tune. \n\n3.2 Graph Partitioning Experiments \n\nFigure 3 shows the results of the graph partitioning experiments. 2000 different \nrandomly generated 100 node graphs with 10% connectivity were used as input. \nThese graphs were partitioned into four bins. The soft assign performs better than \nthe Potts glass algorithms, however here the difference is more modest than in the \nTSP experiments. However the serial Potts glass algorithm again ran about 5 times \nslower then the soft assign version and as noted previously the serial version is not \nmassively parallelizable. The synchronous Potts glass ran about 2 times slower. \n\n\fSoftassign versus Softmax: Benchmarks in Combinatorial Optimization \n\n631 \n\nr--\n\nr--\n\n\"' \n\"' \n,. \n,. \n\u2022 \n\nI!'--,,:,:...u:~~-\"\"'=---!;-~,........_---!,.. \n\nIt \n\n11 \n\n11 \n\nn \n\n\u2022 \n\n,. \n\n--\n\n\" \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\nr -\n\n-\n\nr -\n\nr -\n\n\u2022 \nI \n\n..--:I \nII \n\n11 \"' n \"I \n\n......... \n\nInn \n\n,.1 \n\n,. \n\n11 \n\n,. \n\u2022\u2022 \n\u2022 \n\n.r--\n'.1 1.~' \n\n\" \n\nIt \n\n. .. \n\n.. \n\n... ..... \n\n\u2022 \n\nII \n\nFigure 2: 100 City Euclidean TSP. 500 experiments. Left: Softassign .. Middle: \nSoftmax (serial update). Right: Softmax (synchronous update). \n\nAlso again note the softassign algorithm was much simpler to implement-fewer \nparameters to tune. \n\n---\n\n\"' ,. \n\n-\n\ne-\n\ne-\n\n-\n\nr-\n\n.n \n\n\"' - ,. \n\n.. \n-\n-\n-\n.. \n.. \n,. \n,. \ntil -\n\u2022 \n\nInn. \n.. \n\n',-\n\n, \n\n0-\n\n, \n\nr-\n\n~ . . \n\n.n \n\nn_ \n\ntil \n\n\" ' \n\n... \n\n.. -\n\n\" ' \n\n~n \n\n_ \n\n_ ~ . . \n\n.. \n-\n-\n-\n\"' \n,. \n\u2022\u2022 \n\u2022 \n\ntil \n\n-\n\nr-\n\n-\n\nr-\n\nnn. \n\n\"' \n\n. . \n\n_ \n\nM \n\nFigure 3: 100 node Graph Partitioning, 4 bins. 2000 experiments. Left: Soft(cid:173)\nassign \u2022. Middle: Softmax (serial update). Right: Softmax (synchronous \nupdate). \n\nA relatively simple version of graph partitioning was run. It is likely that as the \nnumber of bins are increased the results on graph partitioning will come to resemble \nmore closely the TSP results, since when the number of bins equal the number of \nnodes, the TSP can be considered a special case of graph partitioning (there are \nsome additional restrictions). However even in this simple case the softassign has \nclear advantages over the softmax and penalty term. \n\n4 Conclusion \n\nFor the first time, two classic combinatorial optimization problems, TSP and graph \npartitioning, are solved using a new technique for constraint satisfaction, the soft as(cid:173)\nsign. The softassign, which has recently emerged from the statistical physics/neural \nnetworks framework, enforces a two-way (assignment) constraint, without penalty \nterms in the energy function . We also show that the softassign can be generalized \nfrom two-way winner-take-all constraints to multiple membership constraints, which \nare required for graph partitioning. Benchmarks against the Potts glass methods, \nusing softmax and a penalty term, clearly demonstrate its advantages in terms \nof accuracy, speed, parallelizability and simplicity of implementation. Within the \nneural network/statistical physics framework, soft assign should be considered the \ntechnique of choice for enforcing two-way constraints in energy functions. \n\n\f632 \n\nReferences \n\nS. GOLD,A. RANGARAJAN \n\n[Gold, 1995] Gold, S ~ (1995). Matching and Learning Structural and Spatial Repre(cid:173)\n\nsentations with Neural Networks. PhD thesis, Yale University. \n\n[Gold et al., 1995] Gold, S., Lu, C. P., Rangarajan, A., Pappu , S., and Mjolsness, \nE. (1995). New algorithms for 2-D and 3-D point matching: pose estimation \nand correspondence. In Tesauro, G., Touretzky, D. S., and Leen, T. K., editors, \nAdvances in Neural Information Processing Systems 7, pages 957-964. MIT Press, \nCambridge, MA. \n\n[Gold et al. , 1994] Gold, S., Mjolsness, E., and Rangarajan, A. (1994). Clustering \n\nwith a domain specific distance measure. In Cowan, J., Tesauro, G., and AI(cid:173)\nspector, J., editors, Advances in Neural Information Processing Systems 6, pages \n96-103. Morgan Kaufmann, San Francisco, CA. \n\n[Gold and Rangarajan, 1996] Gold, S. and Rangarajan, A. (1996) . A graduated as(cid:173)\n\nsignment algorithm for graph matching. IEEE Transactions on Pattern Analysis \nand Machine Intelligence, (in press). \n\n[Gold et al., 1996] Gold, S., Rangarajan, A., and Mjolsness, E. (1996). Learning \nwith preknowledge: clustering with point and graph matching distance measures. \nNeural Computation, (in press) . \n\n[Hopfield and Tank, 1985] Hopfield, J. J. and Tank, D. (1985) . 'Neural' computa(cid:173)\ntion of decisions in optimization problems. Biological Cybernetics, 52:141-152. \n\n[Kosowsky and Yuille, 1994] Kosowsky, J . J . and Yuille, A. L. (1994). The invisible \nhand algorithm: Solving the assignment problem with statistical physics. Neural \nNetworks, 7(3):477-490. \n\n[Lawler et al., 1985] Lawler, E. L., Lenstra, J. K., Kan, A. H. G. R., and Shmoys, \nD. B., editors (1985). The Traveling Salesman Problem. John Wiley and Sons, \nChichester. \n\n[Peterson and Soderberg, 1989] Peterson, C. and Soderberg, B. (1989). A new \nmethod for mapping optimization problems onto neural networks. Inti. Jour(cid:173)\nnal of Neural Systems, 1(1):3-22. \n\n[Rumelhart and McClelland, 1986] Rumelhart, D. and McClelland, J. L. (1986). \n\nParallel Distributed Processing, volume 1. MIT Press, Cambridge, MA. \n\n[Sinkhorn, 1964] Sinkhorn, R. (1964). A relationship between arbitrary positive \n\nmatrices and doubly stochastic matrices. Ann. Math. Statist., 35:876-879. \n\n[Van den Bout and Miller III, 1990] Van den Bout, D. E. and Miller III, T . K. \n(1990). Graph partitioning using annealed networks. IEEE Trans. Neural Net(cid:173)\nworks, 1(2):192-203. \n\n[Yuille and Kosowsky, 1994] Yuille, A. L. and Kosowsky, J. J. (1994). Statistical \n\nphysics algorithms that converge. Neural Computation, 6(3):341-356. \n\n\f", "award": [], "sourceid": 1088, "authors": [{"given_name": "Steven", "family_name": "Gold", "institution": null}, {"given_name": "Anand", "family_name": "Rangarajan", "institution": null}]}