{"title": "Clustering with a Domain-Specific Distance Measure", "book": "Advances in Neural Information Processing Systems", "page_first": 96, "page_last": 103, "abstract": null, "full_text": "Clustering with a  Domain-Specific \n\nDistance Measure \n\nSteven Gold, Eric Mjolsness and Anand Rangarajan \n\nDepartment of Computer Science \n\nYale  University \n\nNew  Haven,  CT 06520-8285 \n\nAbstract \n\nWith  a  point matching distance  measure which  is  invariant under \ntranslation,  rotation  and  permutation,  we  learn  2-D  point-set  ob(cid:173)\njects,  by clustering noisy point-set images.  Unlike  traditional clus(cid:173)\ntering methods which use distance measures that operate on feature \nvectors - a representation common to most problem domains - this \nobject-based  clustering technique employs a  distance measure spe(cid:173)\ncific  to  a  type  of object  within  a  problem  domain.  Formulating \nthe clustering problem as  two  nested objective functions,  we derive \noptimization  dynamics  similar  to  the  Expectation-Maximization \nalgorithm used  in mixture models. \n\n1 \n\nIntroduction \n\nClustering and related unsupervised  learning techniques  such  as  competitive learn(cid:173)\ning and self-organizing maps have  traditionally relied  on measures  of distance,  like \nEuclidean or Mahalanobis distance,  which are generic across most problem domains. \nConsequently,  when  working  in  complex domains like  vision,  extensive  preprocess(cid:173)\ning is required to produce feature sets which reflect properties critical to the domain, \nsuch  as  invariance  to  translation  and  rotation.  Not  only  does  such  preprocessing \nincrease  the  architectural  complexity of these  systems  but  it  may fail  to  preserve \nsome  properties  inherent  in  the  domain.  For  example in  vision,  while  Fourier  de(cid:173)\ncomposition may be adequate to handle reconstructions  invariant under translation \nand  rotation,  it  is  unlikely  that  distortion  invariance  will  be  as  amenable  to  this \ntechnique  (von der  Malsburg,  1988). \n\n96 \n\n\fClustering with a Domain-Specific Distance Measure \n\n97 \n\nThese  problems  may  be  avoided  with  the  help  of more  powerful,  domain-specific \ndistance  measures,  including some  which  have  been  applied  successfully  to  visual \nrecognition  tasks  (Simard,  Le  Cun,  and  Denker,  1993;  Huttenlocher  et  ai.,  1993). \nSuch  measures  can  contain  domain  critical  properties;  for  example,  the  distance \nmeasure  used  here  to  cluster  2-D  point images is  invariant under translation, rota(cid:173)\ntion and labeling permutation.  Moreover,  new  distance measures may constructed, \nas  this  was,  using  Bayesian  inference  on  a  model of the  visual  domain given  by  a \nprobabilistic  grammar  (Mjolsness,  1992).  Distortion  invariant or  graph  matching \nmeasures,  so  formulated,  can  then  be  applied  to other  domains which  may not  be \namenable to description  in terms of features. \n\nObjective functions  can  describe  the distance  measures  constructed  from  a proba(cid:173)\nbilistic grammar, as  well  as learning problems that use  them.  The clustering prob(cid:173)\nlem in the present  paper is  formulated as  two nested  objective functions:  the inner \nobjective  computes  the  distance  measures  and  the  outer  objective  computes  the \ncluster  centers  and cluster memberships.  A  clocked objective function is  used,  with \nseparate optimizations occurring in distinct  clock  phases  (Mjolsness  and Miranker, \n1993).  The optimization is  carried  out  with  coordinate  ascent/descent  and  deter(cid:173)\nministic annealing and the resulting dynamics is a generalization of the Expectation(cid:173)\nMaximization (EM)  algorithm commonly used  in mixture models. \n\n2  Theory \n\n2.1  The Distance Measure \n\nOur  distance  measure  quantifies  the  degree  of similarity  between  two  unlabeled \n2-D  point  images,  irrespective  of  their  position  and  orientation.  It is  calculated \nwith  an  objective  that  can  be  used  in  an  image registration  problem.  Given  two \nsets  of points  {Xj}  and  {Yk }, one  can minimize the following objective  to find  the \ntranslation, rotation and permutation which  best maps Y  onto X  : \n\nEreg(m, t, 0) = L mjkllXj - t  - R(0) . Yk l1 2 \n'Vj  L:k mjk = 1 ,  'Vk  L:j  mjk = l. \n\njk \n\nwith  constraints: \n\nSuch a registration permits the matching of two sparse feature images in the presence \nof noise (Lu and Mjolsness,  1994).  In the above objective, m  is a permutation matrix \nwhich matches one point in one image with a corresponding point in the other image. \nThe constraints on m  ensure  that each  point in each  image corresponds  to one and \nonly one  point in the other image (though note later remarks regarding fuzziness). \nThen given  two sets of points  {Xj} and  {Yk }  the distance between  them is  defined \nas: \n\nD({Xj}, {Yk}) = min(Ereg(m,t,0) I constraints on m)  . \n\n(1) \n\nm,t,e \n\nThis  measure  is  an  example of a  more  general  image distance  measure  derived  in \n(Mjolsness,  1992): \n\nd(x, y)  =  mind(x, T(y))  E  [0,00) \n\nT \n\nwhere  T  is  a  set  of transformation parameters introduced by a  visual grammar.  In \n(1)  translation, rotation and permutation are  the transformations, however  scaling \n\n\f98 \n\nGold, Mjolsness, and Rangarajan \n\nor distortion could also have been included, with consequent changes in the objective \nfunction. \n\nThe  constraints  are  enforced  by  applying  the  Potts  glass  mean  field  theory  ap(cid:173)\nproximations (Peterson  and Soderberg,1989)  and then  using  an equivalent form  of \nthe  resulting  objective,  which  employs  Lagrange  multipliers and  an  x log x  barrier \nfunction  (as in Yuille  and  Kosowsky,  1991): \n\nEreg(m, t, 8) \n\nL: mjkllXj - t - R(8) \u00b7 YkW + f31  L: mjk(logmjk -1) \njk \n\njk \n\n+L:J.tj(L:mjk-1)+L:vk(L:mjk-1). \n\n(2) \n\nj \n\nk \n\nk \n\nj \n\nIn this objective we  are looking for  a saddle point.  (2)  is minimized with respect  to \nm, t,  and 8, which  are  the  correspondence  matrix, translation,and rotation, and is \nmaximized with  respect  to  J.t  and  v,  the  Lagrange multipliers that enforce  the  row \nand column constraints  for  m. \n\n2.2  The Clustering Objective \n\nThe learning problem is  formulated as follows:  Given a  set  of I  images,  {Xd, with \neach  image consisting  of J  points,  find  a  set  of A  cluster  centers  {Ya }  and  match \nvariables {Mia}  defined  as \n\nM.  - {I  if Xi  is  in Ya's  cluster \n\nla  -\n\n0  otherwise, \n\nsuch  that each image is  in only one cluster,  and the total distance of all the images \nfrom their respective cluster centers is minimized.  To find  {Ya} and {Mia} minimize \nthe  cost  function, \n\nEc/U8ter(Y, M) = L: MiaD(Xi, Ya) \n\n, \n\n'Vi  l:a Mia  =  1.  D(Xi, Ya),  the  distance  function,  is \n\nwith  the  constraint  that \ndefined  by  (1). \nThe  constraints  on  M  are  enforced  in  a  manner  similar to  that  described  for  the \ndistance  measure,  except  that  now  only  the  rows  of the  matrix M  need  to  add  to \none,  instead of both  the rows  and  the  columns.  The Potts  glass mean field  theory \nmethod is  applied and an equivalent form of the  resulting objective is  used: \nEc/u8ter(Y, M) = ~ MiaD(Xi, Ya)  +  f3  ~ Mia (log Mia  - 1) +  ~ Ai(L: Mia  -1) \n\n1 \n\nta \n\nia \n\nza \n\nz \n\na \n\n(3) \n\nReplacing the distance  measure by  (2),  we  derive: \n\nEc/u8ter(Y, M, t, 8, m) =  L:Mia L: miajkllXij - tia  - R(8ia) . Ya k11 2+ \n\nia \n\njk \n\n~[f3~ ~k miajk(logmiajk - 1) + ~ J.tiaj(L:k  miajk - 1) + \nza \nL:Viak(L:miajk -1)]+ -;- L:Mia(logMia -1)+ L: Ai(L: Mia  -1) \nk \n\nia \n\na \n\nM \n\nJ \n\nJ \n\nj \n\ni \n\n\fClustering with a Domain-Specific Distance Measure \n\n99 \n\nA  saddle  point  is  required.  The objective  is  minimized with  respect  to  Y,  M,  m, \nt, 0, which are respectively  the  cluster centers,  the cluster  membership matrix, the \ncorrespondence  matrices,  the  rotations, and  the  translations.  It is  maximized with \nrespect  to  A,  which  enforces  the  row  constraint  for  M,  and  J..l  and  v  which  enforce \nthe column and row constraints for  m.  M  is  a cluster membership matrix indicating \nfor  each  image i,  which  cluster  a  it falls  within,  and  mia  is  a  permutation matrix \nwhich  assigns  to each point in  cluster  center  Ya  a  corresponding point in image Xi. \n0ia  gives  the  rotation between  image i  and  cluster  center  a.  Both  M  and  mare \nfuzzy,  so  a given image may partially fall within several  clusters,  with  the degree  of \nfuzziness  depending  upon 13m  and 13M. \nTherefore,  given  a  set  of images,  X,  we  construct  Ecltuter  and  upon  finding  the \nappropriate  saddle  point  of that  objective,  we  will  have  Y,  their  cluster  centers, \nand M, their  cluster  memberships. \n\n3  The Algorithm \n\n3.1  Overview - A  Clocked  Objective Function \n\nThe  algorithm  to  minimize  the  above  objective  consists  of  two  loops  - an  inner \nloop to minimize the distance  measure objective  (2)  and an outer  loop  to minimize \nthe  clustering  objective  (3).  Using  coordinate  descent  in  the  outer  loop  results \nin  dynamics  similar  to  the  EM  algorithm  for  clustering  (Hathaway,  1986).  (The \nEM  algorithm has  been similarly used  in supervised  learning  [Jordan  and  Jacobs, \n1993].)  All  variables  occurring  in  the  distance  measure  objective  are  held  fixed \nduring  this  phase.  The  inner  loop  uses  coordinate  ascent/descent  which  results  in \nrepeated row and column projections for  m.  The minimization of m, t  and 0  occurs \nin  an  incremental fashion,  that  is  their  values  are  saved  after  each  inner  loop  call \nfrom  within  the  outer loop  and  are  then  used  as  initial values for  the  next  call  to \nthe  inner  loop.  This  tracking  of the  values  of m,  t,  and  0  in  the  inner  loop  is \nessential to the efficiency of the algorithm since it greatly speeds  up  each inner loop \noptimization.  Each  coordinate ascent/descent  phase  can  be  computed analytically, \nfurther  speeding  up  the  algorithm.  Local  minima are  avoided,  by  deterministic \nannealing in both the outer and inner loops. \n\nThe resulting  dynamics can be  concisely expressed  by formulating the objective  as \na  clocked  objective  function,  which  is  optimized  over  distinct  sets  of variables  in \nphases, \n\nEcloc1ced = Ecl'luter( (((J..l, m)A , (v, m)A)$' 0 A, tA)$, (A, M)A, yA)$ \n\nwith  this special  notation employed recursively: \n\nE{x, Y)$  :  coordinate descent  on  x,  then  y,  iterated  (if necessary) \nx A \n\n:  use  analytic solution for  x  phase \n\nThe algorithm can be expressed  less  concisely  in  English,  as  follows: \nInitialize t, 0  to zero,  Y  to random values \nBegin Outer  Loop \nBegin Inner  Loop \n\nInitialize t,  0  with  previous  values \n\n\f100 \n\nGold, Mjolsness, and Rangarajan \n\nFind m,  t,  e for  each  ia  pair: \nFind m  by softmax, projecting across  j, then  k,  iteratively \nFind e by coordinate  descent \nFind t  by  coordinate descent \n\nEnd Inner  Loop \nIf first  time through outer loop  i  13m  and repeat  inner loop \nFind M ,Y  using fixed  values  of m,  t, e determined in inner loop: \n\nFind M  by  soft max, across  i \nFind Y  by coordinate  descent \n\ni  13M,  13m \n\nEnd Outer  Loop \n\nWhen the distances are calculated for all the X  - Y  pairs the first  time time through \nthe outer loop, annealing is needed  to minimize the objectives  accurately.  However \non  each  succeeding  iteration,  since  good initial estimates  are  available for  t  and e \n(the  values from  the  previous  iteration of the outer loop)  annealing is  unnecessary \nand  the  minimization is  much faster. \n\nThe speed of the above algorithm is increased by not recalculating the X  - Y distance \nfor  a  given  ia  pair when its  Mia  membership variable drops  below  a  threshold. \n\nInner Loop \n\n3.2 \nThe inner loop proceeds in three phases.  In phase one,  while t and e are held fixed, \nm  is  initialized with  the softmax function  and then iteratively projected  across  its \nrows  and columns until  the procedure  converges.  In phases  two and  three,  t  and e \nare updated  using  coordinate descent.  Then 13m  is  increased  and the loop  repeats. \nIn phase one  m  is  updated with softmax: \nexp( -13m \"Xij  -\n\ntia  - R(eia )  . Yak 112) \n\nmiajk = Lk' exp( -13m IIXij - tia - R(eia) . Yak/112) \n\nThen m  is iteratively normalized across  j  and  k  until  Ljk t:t.miajk  < f \n\n: \n\nmiajk \n\nmiajk  =  =-~-\u00ad\n'1\\'., m\u00b7  .I k \nL.JJ \n\n,aJ \n\nUsing  coordinate descent  e is  calculated in phase two: \n\nAnd t  in phase  three: \n\nFinally 13m  is  increased  and the loop  repeats. \n\n\fClustering with a Domain-Specific Distance Measure \n\n101 \n\nBy  setting  the  partial derivatives  of (2)  to  zero  and  initializing I-lJ  and v2  to  zero, \nthe algorithm for  phase one  may be derived.  Phases  two  and  three  may be derived \nby  taking the  partial derivative of (2)  with  respect  to  0, setting it to zero,  solving \nfor  0, and then solving for  the fixed  point of the vector  (tl, t2). \nBeginning with a small 13m  allows minimization over a fuzzy  correspondence  matrix \nm,  for  which  a  global minimum is  easier  to find.  Raising 13m  drives  the  m's  closer \nto  0 or  1,  as  the  algorithm approaches  a saddle point. \n\n3.3  Outer Loop \n\nThe outer loop also proceeds in three phases:  (1)  distances are calculated by calling \nthe inner loop, (2) M  is projected across a using the softmaxfunction, (3) coordinate \ndescent  is used  to update Y . \n\nTherefore,  using softmax M  is  updated in phase  two: \n\nexp( -13M  Ljk miajkllXij  - tia  - R(0ia) . Yak112) \n\nMia  =  ~----------~------~----------~~~----~7 \nLa' exp( -13M  Ljk mia' jk IIXij  - tia , - R(0ia ,) . Ya, k 112) \n\nY, in phase three is  calculated using coordinate descent: \n\nLi Mia  Lj miajk( cos 0 ia (Xij 1  -\n\ntiad + sin 0ia(Xij2 - tia2)) \n\nYak2 \n\nLi Mia  Lj miajk( - sin 0ia(Xi jl  - tiad + cos 0ia(Xij2 - tia2)) \n\nLi Mia Lj miaj k \n\nLi Mia Ej  miajk \n\nThen 13M  is  increased  and the  loop  repeats. \n\n4  Methods and  Experimental Results \n\nIn two  experiments  (Figures  la and  Ib)  16  and  100  randomly generated  images of \n15  and  20  points each  are  clustered into 4 and  10  clusters,  respectively. \n\nA  stochastic  model,  formulated  with  essentially  the  same visual  grammar used  to \nderive  the clustering algorithm (Mjolsness,  1992), generated the experimental data. \nThat  model  begins  with  the  cluster  centers  and  then  applies  probabilistic  trans(cid:173)\nformations according  to  the  rules  laid out  in  the  grammar to  produce  the  images. \nThese  transformations are  then  inverted  to  recover  cluster  centers  from  a  starting \nset  of images.  Therefore,  to  test  the  algorithm,  the  same  transformations are  ap(cid:173)\nplied  to produce  a  set  of images, and then the  algorithm is  run in order to  see  if it \ncan recover  the  set of cluster  centers,  from  which  the  images were  produced. \nFirst,  n  =  10  points  are  selected  using  a  uniform distribution across  a  normalized \nsquare.  For each  of the  n  =  10  points a  model prototype  (cluster  center)  is  created \nby  generating  a  set  of k  =  20  points  uniformly  distributed  across  a  normalized \nsquare  centered  at  each  orginal  point.  Then,  m  =  10  new  images  consisting  of \nk  =  20  points  each  are  generated  from  each  model  prototype  by  displacing  all  k \nmodel  points  by  a  random  global  translation,  rotating  all  k  points  by  a  random \nglobal rotation within a  54\u00b0  arc,  and then  adding independent  noise  to each of the \ntranslated and rotated points with  a  Gaussian distribution of variance  (1\"2. \n\n\f102 \n\nGold, Mjolsness, and Rangarajan \n\n10 \n\nj  t  t \n\n,. \n\n10 \n\nt \n\nt  j \n\nI \n\n0.2 \n\n0.' \n\n0.6 \n\n0.1 \n\n1 . 2 \n\n1.' \n\nFigure  1:  (a):  16  images,  15  points each  (b):100 images,  20  points each \n\nThe  p  =  n  x  m  =  100  images  so  generated  is  the  input  to  the  algorithm.  The \nalgorithm, which  is  initially ignorant of cluster  membership information, computes \nn  = 10  cluster  centers  as  well  as  n  x  p  =  1000  match  variables  determining  the \ncluster  membership  of each  point  image.  u  is  varied  and  for  each  u  the  average \ndistance of the  computed  cluster  centers  to the  theoretical  cluster  centers  (i.e.  the \noriginal n = 10  model prototypes)  is  plotted. \nData (Figure  1a)  is  generated  with  20  random seeds  with  constants  of n  =  4, k  = \n15, m  =  4, p  =  16,  varying  u  from  .02  to  .14  by  increments  of  .02  for  each  seed. \nThis produces  80  model prototype-computed cluster center distances for  each  value \nof u  which  are  then  averaged  and  plotted,  along  with  an  error  bar  representing \nthe  standard  deviation  of each  set.  15  random  seeds  (Figure  1 b)  with  constants \nof n  = 10, k  = 20, m  = 10, p  = 100,  u  varied  from  .02  to  .16  by  increments  of \n.02  for  each  seed,  produce  150  model  prototype-computed  cluster  center  distances \nfor  each  value  of u.  The  straight  line  plotted  on  each  graph  shows  the  expected \nmodel prototype-cluster  center  distances,  b = ku / vn,  which  would be obtained if \n\nthere  were  no  translation  or  rotation for  each  generated  image,  and  if the  cluster \nmemberships were known.  It can be considered a lower bound for the reconstruction \nperformance of our algorithm.  Figures  1a and 1 b together summarize the results of \n280  separate clustering experiments. \n\nFor each set of images the algorithm was run four times, varying the initial randomly \nselected starting cluster centers each time and then selecting the run with the lowest \nenergy for  the  results.  The annealing rate for  13M  and 13m  was  a  constant factor  of \n1.031.  Each run of the algorithm averaged ten minutes on an Indigo SGI workstation \nfor  the  16  image test,  and four  hours  for  the  100  image test.  The running  time of \nthe  algorithm  is  O(pnk2).  Parallelization,  as  well  as  hierarchical  and  attentional \nmechanisms, all  currently  under investigation, can  reduce  these  times. \n\n5  Summary \n\nBy  incorporating a  domain-specific distance measure instead of the  typical generic \ndistance  measures,  the  new  method of unsupervised  learning substantially reduces \nthe  amount of ad-hoc  pre-processing  required  in  conventional  techniques.  Critical \nfeatures  of a  domain  (such  as  invariance  under  translation,  rotation,  and  permu-\n\n\fClustering with a Domain-Specific Distance Measure \n\n103 \n\ntation)  are  captured  within  the  clustering  procedure,  rather  than  reflected  in  the \nproperties  of feature  sets  created  prior  to  clustering.  The  distance  measure  and \nlearning  problem  are  formally described  as  nested  objective  functions.  We  derive \nan efficient  algorithm by  using optimization techniques  that  allow  us  to  divide  up \nthe  objective function  into  parts  which  may be  minimized in  distinct  phases.  The \nalgorithm has accurately recreated  10 prototypes from a randomly generated sample \ndatabase of 100 images consisting of 20  points each in 120 experiments.  Finally, by \nincorporating permutation invariance in our distance measure,  we  have  a  technique \nthat  we  may be  able  to  apply  to  the  clustering  of graphs.  Our goal  is  to  develop \nmeasures  which  will enable  the  learning of objects  with shape  or structure. \n\nAcknowledgements \n\nThis  work  has  been  supported  by  AFOSR  grant  F49620-92-J-0465  and \nONR/DARPA grant N00014-92-J-4048. \n\nReferences \n\nR.  Hathaway.  (1986)  Another  interpretation  of  the  EM  algorithm  for  mixture \ndistributions.  Statistics  and  Probability  Letters 4:53:56. \n\nD.  Huttenlocher,  G.  Klanderman  and  W.  Rucklidge. \nages  using  the  Hausdorff  Distance.  Pattern  Analysis  and  Machine  Intelligence \n15(9):850:863. \n\n(1993)  Comparing  im(cid:173)\n\nA. L. Yuille and J.J. Kosowsky.  (1992).  Statistical physics algorithms that converge. \nTechnical Report  92-7, Harvard  Robotics  Laboratory. \n\nM.l.  Jordan  and  R.A.  Jacobs.  (1993).  Hierarchical  mixtures of experts  and  the \nEM  algorithm.  Technical  Report  9301,  MIT Computational Cognitive Science. \n\nC.  P.  Lu and E.  Mjolsness.  (1994).  Two-dimensional object localization by coarse(cid:173)\nto-fine  correlation  matching.  In this volume,  NIPS  6 . \n\nC.  von  der  Malsburg.  (1988) .  Pattern  recognition  by  labeled  graph  matching. \nNeural Networks,1:141:148 . \n\nE.  Mjolsness  and  W.  Miranker.  (1993).  Greedy  Lagrangians for  neural networks: \nthree  levels  of optimization in  relaxation  dynamics.  Technical  Report  945,  Yale \nUniversity,  Department of Computer Science. \n\nE.  Mjolsness.  Visual grammars and their neural networks .  (1992)  SPIE Conference \non  the  Science  of Artificial Neural  Networks,  1710:63:85. \n\nC.  Peterson and B.  Soderberg.  A new method for  mapping optimization problems \nonto neural networks.  (1989)  International  Journal  of Neural Systems,I(1):3:22. \nP.  Simard, Y.  Le  Cun,  and  J.  Denker.  Efficient  pattern  recognition  using  a  new \ntransformation  distance.  (1993).  In  S.  Hanson,  J .  Cowan,  and  C.  Giles,  (eds.), \nNIPS 5  .  Morgan  Kaufmann, San Mateo CA. \n\n\f", "award": [], "sourceid": 838, "authors": [{"given_name": "Steven", "family_name": "Gold", "institution": null}, {"given_name": "Eric", "family_name": "Mjolsness", "institution": null}, {"given_name": "Anand", "family_name": "Rangarajan", "institution": null}]}