{"title": "A Lagrangian Approach to Fixed Points", "book": "Advances in Neural Information Processing Systems", "page_first": 77, "page_last": 83, "abstract": null, "full_text": "A  Lagrangian  Approach to  Fixed  Points \n\nEric Mjolsness \nDepartment of Computer Science \nYale  University \nP.O.  Box 2158 Yale Station \nNew  Haven,  CT 16520-2158 \n\nWillard L.  Miranker \nIBM  Watson Research  Center \nYorktown  Heights,  NY  10598 \n\nAbstract \n\nWe  present  a  new  way  to  derive  dissipative,  optimizing  dynamics  from \nthe  Lagrangian formulation of mechanics.  It can  be  used  to  obtain  both \nstandard  and  novel  neural  net  dynamics  for  optimization  problems.  To \ndemonstrate this we  derive standard descent  dynamics as well as nonstan(cid:173)\ndard variants that introduce  a  computational attention mechanism. \n\n1 \n\nINTRODUCTION \n\nNeural nets  are often designed to optimize some objective function  E of the current \nstate of the system via a dissipative dynamical system that has a circuit-like imple(cid:173)\nmentation.  The fixed  points of such a system are locally optimal in E.  In physics the \npreferred formulation for  many dynamical derivations and calculations is  by means \nof an objective function  which  is  an integral over  time of a  \"Lagrangian\" function, \nL.  From Lagrangians one usually derives  time-reversable,  non-dissipative dynamics \nwhich  cannot  converge  to  a  fixed  point,  but  we  present  a  new  way  to  circumvent \nthis  limitation and  derive  optimizing neural  net  dynamics from  a  Lagrangian.  We \napply  the  method  to  derive  a  general  attention  mechanism for  optimization-based \nneural  nets,  and we  describe  simulations for  a graph-matching network. \n\n2  LAGRANGIAN FORMULATION  OF  NEURAL \n\nDYNAMICS \n\nOften  one  must  design  a  network  with nontrivial temporal  behaviors  such  as  run(cid:173)\nning  longer  in  exchange  for  less  circuitry,  or focussing  attention on  one  part  of a \n\n77 \n\n\f78  Mjolsness and Miranker \n\nproblem at a  time.  In this section we  transform the original objective function  (c.f. \n(Mjolsness  and Garrett,  1989]) into a Lagrangian which determines the detailed dy(cid:173)\nnamics by which the objective is optimized.  In section 3.1  we  will show how to add \nin an extra level of control dynamics. \n\n2.1  THE LAGRANGIAN \n\nReplacing  an objective  E  with  an  associated  Lagrangian,  L,  is  an  algebraic  trans(cid:173)\nformation: \n\nL[v, vlq] = K[v, vlq] + ~~. \nThe  \"action\"  8  = Joo  Ldt  is to be extremized in a  novel way: \n\nE[v] \n\n-\n\n-00 \n\n(2) \nIn  (1),  q  is an optional set of control  parameters  (see section  3.1)  and  K  is  a  cost(cid:173)\nof-movement term independent  of the problem and of E.  For one standard class of \nneural networks, \n\n(1) \n\n(3) \n\n(4) \n\nE[v] = -(1/2) L TijViVj  - L hivi + L \u00a2i(Vi) \n\nij \n\nso \n\n- 8E/8vi = L TijVj  + hi  - g-l(Vi), \nwhere g-l(v) = \u00a2'(v).  Also  dE/dt is  of course  Ei(8E/8vi)Vi. \n\nj \n\n2.2  THE  GREEDY FUNCTIONAL  DERIVATIVE \n\nIn physics,  Lagrangian dynamics usually have a  conserved  total energy  which  pro(cid:173)\nhibits  convergence  to fixed  points.  Here  the  main  difference  is  the  unusual  func(cid:173)\ntional derivative with respect  to v rather than v  in equation (2).  This is a  \"greedy\" \nfunctional  derivative,  in  which  the  trajectory  is  optimized from  beginning  to each \ntime t  by  choosing  an extremal value of v(t)  without  considering  its effect  on  any \nsubsequent  portion of the  trajectory: \n\n6  1t  d'L[' \n\n~()8L[v,v]  ~()  6  100  d'  [ ' ]  \n\n-00  t  v, v  ~ u  0  8Vi(t)  = u  0  6Vi(t) \n\n6Vi(t) \n\n-00  t  L v, v  oc  6Vi(t)' \n\n68 \n\n() \n5 \n\n] \n\nSince \n\n(6) \nequations  (1)  and  (2)  preserve  fixed  points  (where  8E/8vi = 0)  if 8K/8vi = 0  \u00a2} \nv = o. \n\n8L \n\n68 \n8E \n6Vi  = 8Vi  = 8Vi  + 8Vi' \n\n8K \n\n2.3  STEEPEST DESCENT  DYNAMICS \nFor example, with K  = Ei \u00a2(vdr) one may recover and generalize steepest-descent \ndynamics: \n\nE[v]  -\n\nL[vlr) = 4= \u00a2(vdr) + 4= ~~ Vi, \n\n\u2022 \n\n\u2022 \n\n(7) \n\n\fA Lagrangian Approach to Fixed Points \n\n79 \n\n.' \n\n,t.'::. . . \n, . . . . \n\n(a) \n\n(b) \n\nt \n\nFigure  1:  (a)  Greedy  functional  derivatives  result  in  greedy  optimization:  the \n\"next\"  point in a  trajectory is chosen on the basis of previous points but not future \nones.  (b)  Two  time  variables t  and  T  may  increase  during  nonoverlapping  inter(cid:173)\nvals  of an  underlying  physical  time  variable,  T.  For  example  t  = J dT(h(T)  and \nT  = J dT<p2(T)  where <Pl  and <P2  are nonoverlapping clock signals. \n\n8L/8vi(t) = 0 ~ <p'(vdr)/r + 8E/8vi = 0,  l.e. \n\n(8) \n\n(9) \nAs  usual  9  =  (<p') -1.  A  transfer  function  with  -1  <  g( x)  <  1  could  enforce  a \nvelocity  constraint  -r < Vi  < r . \n\nVi  = rg( - r  8E/8vi ). \n\n2.4  HOPFIELD/GROSSBERG DYNAMICS \n\nWith  a  suitable  J(  one  may  recover  the  analog  neuron  dynamics of Hopfield  (and \nGrossberg): \n\nL  ~ 1 ' 2,(  )  ~ 8E . \n\n) \n= L.J -2 Ui 9  Ui  + L.J -8 . Vi,  Vi  =  9  Ui  \u2022 \n8L/8ui(t) = 0 ~ Ui + 8E/8vi = 0,  i.e. \n\n_ \n\n( \n\n(10) \n\n(11) \n\n. \nI \n\n\u2022 \nI \n\nVI \n\nUi  = -8E/8vi  and  Vi  = g(Ui) . \n\n(12) \nWe  conjecture  that  this function  J( [Ui, ud  is  optimal in  a  certain  sense:  if we  lin(cid:173)\nearize the u  dynamics and consider the largest and smallest eigenvalues, extremized \nseparately over  the entire domain of u, with  -T constrained  to have  bounded  pos(cid:173)\nitive eigenvalues, then the ratio of such  largest  and smallest eigenvalues  is  minimal \nfor  this  J(.  This  criterion is of practical  importance because  the largest  eigenvalue \nshould  be bounded for  circuit implement ability, and the smallest eigenvalue should \nbe bounded  away from zero  for  circuit convergence  in finite  time. \n\n\f80  Mj olsness and Miranker \n\n2.5  A  CHANGE  OF VARIABLES  SIMPLIFIES L \n\nWe  note  a  change of variable which simplifies  the kinetic energy  term in  the above \ndynamics, for  use  in the  next  section: \n\nL[w] = Li ~wl + Li :~l Wi, \n8L/8wi(t) ==  0 ~ Wi  + 8E/8wi = 0,  I.e. \nWi  = -8E/8wi \n\n(13) \n\nwhich is supposed to be identical to  Ui = -8E/8vi, Vi  = g(Ui)  (c.f.  (12)).  This can \nbe  arranged  by  choosing  w: \n\ni.e. \n\nWi = JUI du.jg'(u)  and  Vi  = JWi  dw.jg'(u(w)). \n\n(14) \n\n(15) \n\n3  APPLICATION  TO  COMPUTATIONAL  ATTENTION \n\nWe can introduce a computational \"attention mechanism\" for neural nets as follows. \nSuppose  we  can  only  afford  to simulate  A  out  of N  ~ A  neurons  at  a  time  in  a \nlarge net.  We shall do this by simulating A  real neurons indexed by a  E {I ... A}, \ncorresponding  to  a  dynamically chosen  subset  of the  N  virtual neurons indexed \nby i  E {l. .. N}. \n\n3.0.1  Constraints \n\nIn  great  generality,  the  correspondance  can  be  chosen  dynamically  via  a  sparse \nmatrix of control parameters \n\nqia  =  ria  E  [0,1] \nL:i ria = 1, \nLa ria  < 1. \n\nconstrained  so  that \n\n(16) \n\nAlternatively, the r  variables can be coordinated to describe  a  \"window\" or  \"focus\" \nof  attention  by  taking  ria  to  be  a  function  of a  small  number  of parameters  q \nspecifying  the  window,  which  are  adjusted  to  optimize  E[r[q]].  This  procedure, \nwhich can result  in significant economies,  was  used  for  our computer  experiments. \n\n3.0.2  Neuron Dynamics \n\nThe assumed  control relationship  is \n\nWi  =  Lriaka, \n\na \n\n(17) \n\ni.e.  virtual neuron  Wi  follows  the  real  neuron  to which  r  assigns it.  Equation  (15) \nthen  determines  Ui(t)  and  viet).  A  plausible kinetic  energy  term for  k  is  the  same \n\n\fA Lagrangian Approach to Fixed Points \n\n81 \n\nas  for  w  (c.f.  equation (13\u00bb, since that choice  (equivalent to the Hoplield case)  has \na  good  eigenvalue  ratio for  the  u  variables.  The  Lagrangian  for  the  real  neurons \nbecomes \n\n. \n\n. \nL[k]  =  - L.Jka + L.J -8 . riaka \n\n1 ~\u00b72  ~ 8E \n2 \nWI \n\na \n\n(18) \n\n(19) \n\nand the equations of motion (greedy  variation) may be shown to be \n\nka  = L riavg'(U(w,\u00bb  [I: 71jvj + h, - u,]. \n\n, \n\n. \nla \n\nj \n\n3.1  CONTROL DYNAMICS  FOR ATTENTION \n\nNow we need dynamics for  the control parameters r  or more generally q.  An objec(cid:173)\ntive function transformation (proposed and subjected to preliminary experiments in \n[Mjolsness,  1987])  can be used  to construct  a new objective for  the control parame(cid:173)\nters,  q, which rewards speedy  convergence of the original objective  E  as a function \nof the original variables v  by  measuring dE/dt: \n\nE[v]  -+ E[q] \n\nb(dE/dt) + Ecost [q] \n\n=  b[2:i(8E/8v,)tid + Ecost [q], \n\n(20) \n\nwhere  b is  a monotonic, odd function  that can be used  to limit the range of E.  We \ncan  calculate dE/dt from equations (17)  and (19): \n\nEb~eftt(r) = 6(:~) = 6 [f.><. :! k.]  = -6 [~ (~>'.Vg,(U') ;~ y] , \n\n(21) \nwhere  8E/8vi  =  2:j  71jvj  + hi  - Ui.  If we  assume  that  Ecost  favors  fixed  points \nfor  which  ria  ~ 0  or  1  and  2:i ria  ~ 0  or  1,  there  is  a  fixed-point-preserving \ntransformation of (21)  to \n\nEb~eftt(r) = -6 [~r,.9'( U;)(;:')2] . \n\n(22) \n\nThis is monotonic in a linear function of r.  It remains to specify  Ecost  and a kinetic \nenergy  term [(. \n\n3.2 \n\nINDEPENDENT VIRTUAL  NEURONS \n\nFirst consider  independent  ria.  As  in the Tank-Hopfield  [Tank and  Hopfield,  1986] \nlinear programming net, we  could take \n\nThus  the  r  dynamics just  sorts the  virtual  neurons  and  chooses  the  A  neurons \nwith largest  g' (ui)8 E / 8v, .  For dynamics, we  introduce  a  new  time variable  T  that \n\n\f82  Mjolsness and Miranker \n\nmay not  even  be proportional to t  (see  figure  1 b)  and  imitate the  Lagrangians for \nHopfield  dynamics: \n\n' \"  1 (dPi a )  2 \n\nL = ~ 2  dr \n\nsa \n\n, \n9 (Pi) +  dr  Ebeneflt + ECOBt \n\n~) \n; \n\nd (_ \n\n(24) \n\n3.3  JUMPING WINDOW OF  ATTENTION \n\nA  far more cost-effective  net involves partitioning the virtual neurons into real-net(cid:173)\nsized  blocks  indexed  by  a,  so  i  -+  (a, a)  where  a  indexes  neurons  within  a  block. \nLet XQ  E [0,1] indicate which block  is  the current  window or focus  of attention, i.e. \n\nUsing  (22),  this  implies \n\nEbeneflt[x] =  -b [Z:XQ Z:g'(UQa)(8~E )2]  , \n\nQ \n\na \n\nQa \n\nand \n\n(26) \n\n(27) \n\n(28) \n\nSince  ECOBt  here  favors  LQ XQ  =  1  and  XQ  E  {O, I},  Ebeneflt  has  the  same  fixed \npoints as, and can be replaced  by, \n\n(29) \n\nThen the dynamics for X is just that of a winner-take-all neural net among the blocks \nwhich will select  the largest  value of b[La g'(uQa )(8E/8vQa)2].  The simulations of \nSection 4  report on an earlier version of this control scheme,  which  selected  instead \nthe block  with the largest  value of La 18E/8vQa l. \n\n3.4  ROLLING WINDOW  OF  ATTENTION \n\nHere  the  r  variables for  a  neural  net embedded  in  a  d-dimensional space  are  deter(cid:173)\nmined by  a vector x  representing the geometric position of the window.  ECOBt  can be \ndropped entirely, and E can be calculated from r(x).  Suppose the embedding is via \na d-dimensional grid which for  notational purposes  is  partitioned into window-sized \nsquares  indexed  by  integer-valued vectors  0:  and a.  Then \n\nwhere \n\n8w(x) \n----'--'- = \n8x~ \n\n{6[1/4 - (xp + L)2] \n6[(x~ - L)2 - 1/4] \n0 \n\nif \nif \n\notherwise \n\n-1/2  $xp+L<  1/2 \n-1/2  $  x~ - L  <  1/2 \n\n(31) \n\n(30) \n\n\fA Lagrangian Approach to Fixed Points \n\n83 \n\nand \n\nE[x] = -b [z: w(Lo: + a  - X)g'(uo:a)(8~E  )2]  . \n\no:a \n\no:a \n\n(32) \n\nThe advantage of (30)  over,  for example, a jumping or sliding window  of attention \nis  that  only  a  small  number  of real  neurons  are  being  reassigned  to  new  virtual \nneurons  at  anyone time. \n\n3.4.1  Dynamics of a  Rolling Window \n\nA  candidate Lagrangian is \n\nL[x] = ! '\" (dXp.) 2 + '\" 8E  dxp. , \n\n2 L...J \nP. \n\ndT \n\nL...J  8x P.  dT \np. \n\n(33) \n\nwhence  greedy  variation hS/hz = 0 yields \n\ndX JJ  = _ [2: 8w(x - Lo: - a) g'(Uo:a)(  8E  )2]  X b' [2: wg'(Uo:a)(  8E  )2] \n\nOVo:a \n\no:a \n\n8vo:a \n\ndT \n\no:a \n\nOX JJ \n\n(34) \nWe  may  also  calculate  that  the  linearized  dynamic's eigenvalues  can  be  bounded \naway from infinity  and zero. \n\n4  SIMULATIONS \n\nA  jumping  window  of attention  was  simulated  for  a  graph-matching  network  in \nwhich  the  matching neurons  were  partitioned  into groups,  only  one  of which  was \nactive  (ria  =  1)  at  any  given  time.  The resulting  optimization  method  produced \nsolutions  of similar  quality  as  the  original  neural  network,  but  had  a  smaller  re(cid:173)\nquirement for  computational space  resources  at any given  time. \n\nAcknowledgement:  Charles  Garrett performed the computer simulations. \n\nReferences \n\n[Mjolsness,  1987]  Mjolsness,  E.  (1987) .  Control of attention in neural networks.  In \nProc.  of First International Conference  on  Neural Networks, volume vol. II, pages \n567-574. IEEE. \n\n[Mjolsness  and  Garrett,  1989]  Mjolsness,  E.  and  Garrett,  C.  (1989).  Algebraic \ntransformations of objective functions.  Technical  Report  YALEU/DCS/RR686, \nYale  University  Computer  Science  Department.  Also,  in  press  for  Neural  Net(cid:173)\nworks. \n\n[Tank and Hopfield,  1986]  Tank, D.  W.  and  Hopfield,  J. J.  (1986).  Simple 'neural' \noptimization  networks:  An  aid  converter,  signal  decision  circuit,  and  a  linear \nprogramming circuit.  IEEE  Transactions  on  Circuits  and  Systems,  CAS-33 . \n\n\f", "award": [], "sourceid": 398, "authors": [{"given_name": "Eric", "family_name": "Mjolsness", "institution": null}, {"given_name": "Willard", "family_name": "Miranker", "institution": null}]}