{"title": "Softassign versus Softmax: Benchmarks in Combinatorial Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 626, "page_last": 632, "abstract": null, "full_text": "Softassign versus Softmax:  Benchmarks \n\nin Combinatorial Optimization \n\nSteven Gold \n\nYale University \n\nAnand Rangarajan \n\nYale University \n\nDepartment of Computer Science \n\nDept.  of Diagnostic Radiology \n\nNew  Haven, CT 06520-8285 \n\nNew  Haven,  CT 06520-8042 \n\nAbstract \n\nA  new  technique,  termed  soft assign,  is  applied  for  the  first  time \nto  two  classic  combinatorial  optimization  problems,  the  travel(cid:173)\ning  salesman  problem  and  graph  partitioning.  Soft assign ,  which \nhas emerged from  the recurrent  neural  network/statistical physics \nframework, enforces  two-way  (assignment) constraints without the \nuse  of penalty  terms  in  the  energy  functions.  The  soft assign  can \nalso  be  generalized  from  two-way  winner-take-all  constraints  to \nmultiple membership constraints which are required for graph par(cid:173)\ntitioning.  The  soft assign  technique  is  compared  to  the  softmax \n(Potts  glass).  Within  the  statistical  physics  framework,  softmax \nand a penalty term has been a widely used method for enforcing the \ntwo-way constraints common within many combinatorial optimiza(cid:173)\ntion  problems.  The  benchmarks  present  evidence  that  soft assign \nhas clear advantages in accuracy, speed,  parallelizabilityand algo(cid:173)\nrithmic simplicity over softmax and a penalty term in optimization \nproblems with two-way constraints. \n\n1 \n\nIntroduction \n\nIn  a  series  of papers  in  the  early  to  mid  1980's,  Hopfield  and  Tank  introduced \ntechniques  which  allowed  one  to  solve  combinatorial  optimization  problems  with \nrecurrent  neural  networks  [Hopfield  and Tank, 1985].  As  researchers  attempted \nto  reproduce  the  original  traveling  salesman  problem  results  of  Hopfield  and \nTank,  problems  emerged,  especially  in  terms  of the  quality  of  the  solutions  ob(cid:173)\ntained.  More  recently  however,  a  number  of techniques  from  statistical  physics \nhave  been  adopted  to  mitigate  these  problems.  These  include  deterministic  an(cid:173)\nnealing which  convexifies  the  energy  function  in  order help  avoid  some  local  min(cid:173)\nima  and  the  Potts  glass  approximation  which  results  in  a  hard  enforcement  of \nIn \na  one-way  (one  set  of)  winner-take-all  (WTA)  constraint  via  the  softmax. \n\n\fSoftassign  versus  Softmax:  Benchmarks  in  Combinatorial  Optimization \n\n627 \n\nthe  late  80's,  armed  with  these  techniques  optimization  problems  like  the  trav(cid:173)\neling salesman problem (TSP)  [Peterson  and Soderberg,  1989]  and graph partition(cid:173)\ning  [Peterson and Soderberg,  1989,  Van  den  Bout and Miller III,  1990]  were  reex(cid:173)\namined and much better results  compared  to the original  Hopfield-Tank dynamics \nwere  obtained. \nHowever,  when  the  problem  calls  for  two-way  interlocking  WTA  constraints,  as \ndo  TSP  and  graph  partitioning,  the  resulting  energy  function  must  still  include \na  penalty  term  when  the  softmax is  employed  in  order  to enforce  the  second  set \nof WTA  constraints.  Such  penalty  terms  may  introduce  spurious  local  minima \nin  the  energy  function  and  involve  free  parameters  which  are  hard  to  set.  A \nnew  technique,  termed  soft assign,  eliminates the  need  for  all  such  penalty  terms. \nThe  first  use  of the  soft assign  was  in  an  algorithm  for  the  assignment  problem \n[Kosowsky  and Yuille,  1994] . \nIt  has  since  been  applied  to  much  more  difficult \noptimization problems,  including  parametric  assignment  problems-point match(cid:173)\ning  [Gold et aI.,  1994,  Gold et aI.,  1995,  Gold et  aI.,  1996]  and  quadratic  assign(cid:173)\nment  problems-graph  matching  [Gold  et aI.,  1996,  Gold and Rangarajan, 1996, \nGold, 1995] . \nHere,  we  for  the  first  time  apply  the  soft assign  to  two  classic  combinatorial  op(cid:173)\ntimization  problems,  TSP  and  graph  partitioning.  Moreover,  we  show  that  the \nsoft assign  can  be generalized  from  two-way  winner-take-all constraints  to multiple \nmembership constraints, which are required for graph partitioning (as described  be(cid:173)\nlow).  We then run benchmarks against the older softmax (Potts glass) methods and \ndemonstrate advantages in terms of accuracy, speed,  parallelizability, and simplicity \nof implementation. \nIt  must  be  emphasized  there  are  other  conventional  techniques,  for  solving \nsome  combinatorial  optimization  problems  such  as  TSP,  which  remain  supe(cid:173)\nrior  to  this  method  in  certain  ways  [Lawler et aI.,  1985]. \nproblems-specifically  the  type  of pattern  matching problems essential  for  cogni(cid:173)\ntion [Gold,  1995]-this technique is superior to conventional methods.)  Even within \nneural  networks,  elastic net  methods may still be better in certain  cases.  However, \nthe elastic net uses  only a one-way  constraint in TSP. The main goal of this paper \nis  to  provide  evidence,  that  when  minimizing energy  functions  within  the  neural \nnetwork  framework,  which  have  two-way  constraints,  the  soft assign  should  be  the \ntechnique  of choice.  We  therefore  compare  it to  the  current  dominant  technique, \nsoftmax with  a  penalty  term. \n\n(We  think  for  some \n\n2  Optimizing With Softassign \n\n2.1  The Traveling Salesman Problem \n\nThe traveling salesman problem may be defined  in the following way.  Given a set of \nintercity distances  {hab}  which  may take values in R+ , find  the permutation matrix \nM  such that the  following objective function  is  minimized. \nE 1(M) = 2 LLL hab M ai M b(i6H) \n\n1  N  N  N \n\na==lb==li=l \n\n(1) \n\nsubject  to  Va  L~l Mai  = 1  ,  Vi  L~=l Mai  = 1 ,  Vai  Mai  E {O, 1}. \nIn  the above  objective  hab  represents  the  distance  between  cities  a  and  b.  M  is  a \npermutation matrix whose  rows  represent  cities,  and  whose  columns represent  the \nday (or order) the city was visited and N  is the number of cities.  (The notation i EEl  1 \n\n\f628 \n\nS.GOLD,A.RANGARAJAN \n\nis  used  to indicate that subscripts are defined  modulo N,  i.e.  Ma(N+I)  =  Mal.)  So \nif Mai  =  1 it  indicates  that city  a  was  visited  on  day i . \nThen, following [Peterson  and Soderberg,  1989, Yuille and  Kosowsky,  1994]  we em(cid:173)\nploy  Lagrange multipliers and an x log x  barrier function to enforce the constraints, \nas  well  as  a  'Y  term for  stability, resulting in the following objective: \n\nE2(M, 1',11) = 2 L L L babMaiMb(ieJ I ) - ~ L L M;i \n\nN  N \n\n1  N  N  N \n\na=l b=l  i=l \nN \n\nN \n\na=l i=l \nN \n\nN \n\nINN  \n\n+p I: I: Mai(10g M ai  - 1) + I: J.la(I: Mai  - 1) + I: lIi(I: Mai  - 1) \n\n(2) \n\na=l  i=l \n\na=l \n\ni=l \n\ni=l \n\na=l \n\nIn  the  above  we  are  looking for  a  saddle  point  by  minimizing with respect  to  M \nand maximizing with  respect  to I' and 11,  the Lagrange multipliers. \n\n2.2  The Soft assign \n\nIn  the  above formulation of TSP  we  have  two-way  interlocking  WTA  constraints. \n{Mai}  must  be  a  permutation  matrix  to  ensure  that  a  valid  tour-one  in  which \neach  city is  visited once and only once-is described.  A permutation matrix means \nall  the rows  and columns must add to one (and  the elements must be zero or one) \nand therefore requires  two-way WTA constraints-a set of WTA constraints on the \nrows and a set of WTA constraints on the columns.  This set of two-way constraints \nmay also be considered  assignment constraints, since each  city must be assigned  to \none and only  one  day  (the  row  constraint)  and each  day  must  be  assigned  to one \nand only one city (the column constraint). \n\nThese  assignment constraints can  be satisfied using a  result  from  [Sinkhorn,  1964]. \nIn  [Sinkhorn,  1964]  it  is  proven  that  any  square  matrix  whose  elements  are  all \npositive  will  converge  to  a  doubly  stochastic  matrix just  by  the  iterative  process \nof alternatively normalizing the rows  and columns.  (A  doubly stochastic  matrix is \na  matrix whose  elements  are  all  positive  and  whose  rows  and  columns  all  add  up \nto  one-it may  roughly  be  thought  of as  the  continuous  analog of a  permutation \nmatrix). \nThe soft assign  simply employs Sinkhorn's technique within a  deterministic anneal(cid:173)\ning context.  Figure 1 depicts  the contrast between  the soft assign  and  the softmax. \nIn the softmax, a one-way WTA  constraint is  strictly enforced  by  normalizing over \na  vector. \n\n[Kosowsky  and Yuille,  1994]  used  the  soft assign  to solve  the  assignment  problem, \ni.e.  minimize:  - 2:~=1 2:{=1  MaiQai.  For  the special  case  of the  quadratic  assign(cid:173)\nment problem,  being solved here,  by  setting Q ai  = - :J:i' and using the  values of \n\nM  from  the previous  iteration, we  can at each  iteration produce a  new  assignment \nproblem for  which  the soft assign  then  returns  a  doubly  stochastic  matrix.  As  the \ntemperature  is  lowered  a  series  of assignment  problems are  generated,  along  with \nthe  corresponding  doubly  stochastic  matrices  returned  by  each  soft assign ,  until  a \npermutation matrix is  reached. \n\nThe  update  with  the  partial  derivative  in  the  preceding  may  be  derived  using  a \nTaylor series  expansion.  See  [Gold and Rangarajan,  1996,  Gold,  1995]  for  details. \n\nThe algorithm dynamics then  become: \n\n\fSoftassign  versus  Softmax:  Benchmarks  in  Combinatorial  Optimization \n\n629 \n\nSoftassign \n\nSoftmax \n\nPositivity \n\nM.i = exP(I3Q.) \n\n1 \n\nTwo-way constraints \n\n(  ~) \n\nRow Normalization \nMai--_1 \nl:M.i \n0<;. \"\"\"\"'~_ \nM.i- l:~. \n\na \n\n1 \n\nPositivity \n\nMi  = exP(I3Qi) \n\n1 \nOne-way \nconstraint \nM\u00b7 __ _  1_ \nM\u00b7 \nl:M. \ni \n) \n\n1 \n\nFigure  1:  Softassign  and softmax.  This paper compares these two  techniques. \n\nMai  =  Softassignai (Q) \n\n(3) \n\n(4) \n\nE2  is  E2  without the {3,  J.l  or II terms of (2),  therefore no penalty terms are now  in(cid:173)\ncluded.  The above dynamics are iterated as (3,  the inverse temperature, is gradually \nincreased. \n\nThese  dynamics may be obtained  by  evaluating the saddle  points of the  objective \nin  (2).  Sinkhorn's method finds  the saddle points for  the  Lagrange parameters. \n\n2.3  Graph Partitioning \n\nThe graph partitioning problem maybe defined  in  the following way.  Given an un(cid:173)\nweighted graph G, find  the membership matrix M  such  that the following objective \nfunction  is minimized. \n\nA \n\nI \n\nI \n\nE3(M) = - I:L:L:GijMaiMaj \n\na=1 i=1  j=1 \n\n(5) \n\nsubject to  Va  E;=1 Mai  = IIA,  Vi  E:=1 Mai  = 1,  Vai  Mai  E to, I} where graph \nG  has  I  nodes  which should  be equally partitioned into A  bins. \n{Gij}  is  the  adjacency  matrix of the  graph,  whose  elements  must  be  0  or  1.  M \nis  a  membership matrix such  that  Mai  = 1 indicates  that node  i  is  in  bin  a.  The \npermutation matrix constraint present  in TSP  is  modified  to the membership con(cid:173)\nstraint.  Node  i  is  a  member of only bin a  and  the  number of members in each  bin \nis fixed  at IIA.  When  the  above  objective  is  at a  minimum, then graph G  will  be \npartitioned into A equal sized  bins, such that the cutsize is minimum for  all possible \npartitionings of G  into A  equal sized  bins.  We assume  IIA is  an integer. \n\nThen following  the treatment for  TSP,  we  derive the following objective: \n\n\f630 \n\nS.GOLD,A. RANGARAJAN \n\nE4(M,p,v) = - I: I:L: CijMaiMaj  - ~ L:L:M;i \n\nA \n\nI \n\nI \n\na=l  i=l j=l \n\nA \n\nI \n\na=l  i=l \nA \n\n1A I  \n\n+:8 I: I: Mai(lOgMai  - 1) + I:Pa(2: Mai  -\n\nA \n\nI \n\nI \n\n[fA) + 2: Vi (2: Mai -1) \n\n(6) \n\na=li=l \n\na=l \n\ni=l \n\ni=l \n\na=l \n\nwhich is minimized with a similar algorithm employing the softassign.  Note however \nnow  in  the soft assign  the  columns are normalized to [j A  instead of 1. \n\n8  Experimental Results \n\nExperiments on  Euclidean  TSP  and  graph partitioning were  conducted.  For each \nproblem  three  different  algorithms  were  run.  One  used  the  soft assign  described \nabove.  The second  used  the  Potts  glass  dynamics employing synchronous  update \nas  described  in  [Peterson  and Soderberg,  1989].  The  third  used  the  Potts  glass \ndynamics employing serial  update as  described  in  [Peterson  and Soderberg,  1989]. \nOriginally  the  intention  was  to  employ just  the  synchronous  updating  version  of \nthe  Potts  glass  dynamics,  since  that  is  the  dynamics  used  in  the  algorithms em(cid:173)\nploying soft assign  and  is  the  method  that  is  massively  parallelizable.  We  believe \nmassive parallelism to be such  a  critical feature  of the neural  network  architecture \n[Rumelhart and McClelland,  1986]  that any  algorithm that does  not have this fea(cid:173)\nture  loses  much  of the  power  of the  neural  network  paradigm.  Unfortunately  the \nsynchronous  updating algorithms just worked  so  poorly  that  we  also ran the serial \nversions in order to get a more extensive comparison.  Note that the results reported \nin  [Peterson  and Soderberg,  1989]  were  all with the serial  versions. \n\n3.1  Euclidean TSP  Experiments \n\nFigure  2 shows  the  results  of the  Euclidean  TSP  experiments.  500  different  100-\ncity  tours  from  points  uniformly  generated  in  the  2D  unit  square  were  used  as \ninput.  The  asymptotic  expected  length  of an  optimal  tour  for  cities  distributed \nin  the  unit  square  is  given  by  L( n)  = J( Vn  where  n  is  the  number  of cities  and \n0.765  ~ J(  ~ 0.765 +.1  [Lawler et al.,  1985].  This gives  the interval  [7.65,8.05] for \nthe  100  city TSP. 95<70 of the tour lengths fall  in the interval  [8,11]  when  using the \nsoft assign approach.  Note the large difference in performance between the soft assign \nand  the Potts glass  algorithms.  The serial  Potts glass  algorithm ran about 5  times \nslower  than  the  soft assign  version.  Also  as  noted  previously  the  serial  version  is \nnot massively parallelizable.  The synchronous Potts glass ran about 2 times slower. \nAlso note the softassign algorithm is much simpler to implement-fewer parameters \nto tune. \n\n3.2  Graph Partitioning Experiments \n\nFigure  3  shows  the  results  of the  graph  partitioning experiments.  2000  different \nrandomly  generated  100  node  graphs  with  10%  connectivity  were  used  as  input. \nThese graphs were  partitioned into four  bins.  The soft assign  performs better  than \nthe  Potts glass  algorithms, however  here  the difference  is more modest  than  in  the \nTSP experiments.  However  the serial Potts glass algorithm again ran about 5 times \nslower  then  the soft assign  version  and  as  noted  previously  the serial  version  is  not \nmassively  parallelizable.  The  synchronous  Potts  glass  ran  about  2  times  slower. \n\n\fSoftassign  versus  Softmax:  Benchmarks  in  Combinatorial  Optimization \n\n631 \n\nr--\n\nr--\n\n\"' \n\"' \n,. \n,. \n\u2022 \n\nI!'--,,:,:...u:~~-\"\"'=---!;-~,........_---!,.. \n\nIt \n\n11 \n\n11 \n\nn \n\n\u2022 \n\n,. \n\n--\n\n\" \n\n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\u2022 \n\nr -\n\n-\n\nr -\n\nr -\n\n\u2022 \nI \n\n..--:I \nII \n\n11  \"'  n  \"I \n\n......... \n\nInn \n\n,.1 \n\n,. \n\n11 \n\n,. \n\u2022\u2022 \n\u2022 \n\n.r--\n'.1  1.~' \n\n\" \n\nIt \n\n.  .. \n\n.. \n\n... ..... \n\n\u2022 \n\nII \n\nFigure  2:  100  City  Euclidean  TSP.  500  experiments.  Left:  Softassign ..  Middle: \nSoftmax (serial update).  Right:  Softmax (synchronous update). \n\nAlso  again  note  the  softassign  algorithm  was  much  simpler  to  implement-fewer \nparameters to tune. \n\n---\n\n\"' ,. \n\n-\n\ne-\n\ne-\n\n-\n\nr-\n\n.n \n\n\"'  - ,. \n\n.. \n-\n-\n-\n.. \n.. \n,. \n,. \ntil  -\n\u2022 \n\nInn. \n.. \n\n',-\n\n, \n\n0-\n\n, \n\nr-\n\n~  . .  \n\n.n \n\nn_ \n\ntil \n\n\" '  \n\n... \n\n.. -\n\n\" '  \n\n~n \n\n_ \n\n_  ~  . .  \n\n.. \n-\n-\n-\n\"' \n,. \n\u2022\u2022 \n\u2022 \n\ntil \n\n-\n\nr-\n\n-\n\nr-\n\nnn. \n\n\"' \n\n. .  \n\n_ \n\nM \n\nFigure  3:  100  node  Graph  Partitioning,  4  bins.  2000  experiments.  Left:  Soft(cid:173)\nassign \u2022.  Middle:  Softmax  (serial  update).  Right:  Softmax  (synchronous \nupdate). \n\nA  relatively  simple version  of graph  partitioning was  run.  It is  likely  that  as  the \nnumber of bins are increased the results on graph partitioning will come to resemble \nmore closely  the TSP  results,  since  when  the number of bins equal  the  number of \nnodes,  the  TSP  can  be  considered  a  special  case  of graph  partitioning (there  are \nsome additional restrictions).  However  even  in  this simple case  the softassign  has \nclear  advantages over  the softmax and penalty term. \n\n4  Conclusion \n\nFor the first  time, two classic combinatorial optimization problems, TSP and graph \npartitioning, are solved using a new technique for constraint satisfaction, the soft as(cid:173)\nsign.  The softassign, which has recently emerged from the statistical physics/neural \nnetworks  framework, enforces  a  two-way  (assignment)  constraint,  without penalty \nterms in  the energy  function .  We  also  show  that the softassign  can  be generalized \nfrom two-way winner-take-all constraints to multiple membership constraints, which \nare  required  for  graph  partitioning.  Benchmarks against  the  Potts glass  methods, \nusing  softmax  and  a  penalty  term,  clearly  demonstrate  its  advantages  in  terms \nof accuracy,  speed,  parallelizability and  simplicity of implementation.  Within  the \nneural  network/statistical  physics  framework,  soft assign  should  be  considered  the \ntechnique of choice for  enforcing two-way constraints in energy  functions. \n\n\f632 \n\nReferences \n\nS. GOLD,A. RANGARAJAN \n\n[Gold,  1995]  Gold, S ~  (1995).  Matching  and  Learning Structural and  Spatial Repre(cid:173)\n\nsentations  with  Neural  Networks.  PhD thesis,  Yale University. \n\n[Gold et  al.,  1995]  Gold,  S.,  Lu,  C.  P., Rangarajan, A.,  Pappu , S., and  Mjolsness, \nE.  (1995).  New  algorithms  for  2-D  and  3-D  point  matching:  pose  estimation \nand  correspondence.  In Tesauro,  G., Touretzky,  D.  S., and  Leen,  T.  K.,  editors, \nAdvances in Neural Information Processing Systems 7,  pages 957-964. MIT Press, \nCambridge, MA. \n\n[Gold et  al. , 1994]  Gold,  S.,  Mjolsness, E.,  and  Rangarajan, A.  (1994).  Clustering \n\nwith  a  domain  specific  distance  measure.  In  Cowan,  J.,  Tesauro,  G.,  and  AI(cid:173)\nspector, J., editors,  Advances in  Neural Information  Processing Systems 6,  pages \n96-103. Morgan Kaufmann, San  Francisco, CA. \n\n[Gold and Rangarajan, 1996]  Gold, S. and Rangarajan, A.  (1996) . A graduated as(cid:173)\n\nsignment algorithm for graph matching.  IEEE  Transactions  on  Pattern Analysis \nand  Machine  Intelligence,  (in press). \n\n[Gold et al.,  1996]  Gold,  S.,  Rangarajan, A.,  and  Mjolsness,  E.  (1996).  Learning \nwith preknowledge:  clustering with point and graph matching distance measures. \nNeural  Computation,  (in press) . \n\n[Hopfield  and Tank,  1985]  Hopfield,  J. J.  and Tank,  D.  (1985) .  'Neural' computa(cid:173)\ntion of decisions  in optimization problems.  Biological  Cybernetics,  52:141-152. \n\n[Kosowsky  and Yuille,  1994]  Kosowsky, J . J . and Yuille, A.  L.  (1994).  The invisible \nhand algorithm:  Solving the assignment problem with statistical physics.  Neural \nNetworks,  7(3):477-490. \n\n[Lawler et al.,  1985]  Lawler,  E.  L.,  Lenstra, J.  K.,  Kan,  A.  H.  G.  R.,  and Shmoys, \nD.  B., editors  (1985).  The  Traveling  Salesman  Problem.  John  Wiley  and  Sons, \nChichester. \n\n[Peterson  and  Soderberg,  1989]  Peterson,  C.  and  Soderberg,  B.  (1989).  A  new \nmethod  for  mapping optimization  problems  onto  neural  networks.  Inti.  Jour(cid:173)\nnal of Neural Systems, 1(1):3-22. \n\n[Rumelhart and McClelland,  1986]  Rumelhart,  D.  and  McClelland,  J.  L.  (1986). \n\nParallel Distributed Processing,  volume 1.  MIT Press,  Cambridge, MA. \n\n[Sinkhorn,  1964]  Sinkhorn,  R.  (1964).  A  relationship  between  arbitrary  positive \n\nmatrices and doubly stochastic  matrices.  Ann.  Math.  Statist., 35:876-879. \n\n[Van den  Bout and Miller III,  1990]  Van  den  Bout,  D.  E.  and  Miller  III,  T .  K. \n(1990).  Graph  partitioning using  annealed  networks.  IEEE  Trans.  Neural  Net(cid:173)\nworks,  1(2):192-203. \n\n[Yuille and  Kosowsky,  1994]  Yuille,  A.  L.  and  Kosowsky,  J.  J.  (1994).  Statistical \n\nphysics  algorithms that converge.  Neural  Computation,  6(3):341-356. \n\n\f", "award": [], "sourceid": 1088, "authors": [{"given_name": "Steven", "family_name": "Gold", "institution": null}, {"given_name": "Anand", "family_name": "Rangarajan", "institution": null}]}