{"title": "Triangulation by Continuous Embedding", "book": "Advances in Neural Information Processing Systems", "page_first": 557, "page_last": 563, "abstract": null, "full_text": "Triangulation by  Continuous Embedding \n\nMarina MeiHl and Michael I.  Jordan \n\n{mmp,  jordan }@ai.mit.edu \n\nCenter for  Biological &  Computational Learning \n\nMassachusetts  Institute of Technology \n\n45  Carleton St.  E25-201 \nCambridge, MA  02142 \n\nAbstract \n\nWhen  triangulating a  belief network  we  aim  to  obtain  a junction \ntree of minimum state space.  According to (Rose,  1970), searching \nfor  the  optimal triangulation can  be  cast  as  a  search  over  all  the \npermutations  of the  graph's  vertices.  Our  approach  is  to  embed \nthe discrete set of permutations in  a  convex continuous domain D. \nBy  suitably  extending  the  cost  function  over  D  and  solving  the \ncontinous  nonlinear  optimization  task  we  hope  to  obtain  a  good \ntriangulation  with respect  to  the  aformentioned  cost.  This  paper \npresents  two  ways  of embedding  the  triangulation  problem  into \ncontinuous  domain and shows  that they perform well  compared to \nthe  best  known  heuristic. \n\n1 \n\nINTRODUCTION.  WHAT IS  TRIANGULATION? \n\nBelief networks  are graphical representations of probability distributions over a  set \nof  variables.  In  what  follows  it  will  be  always  assumed  that  the  variables  take \nvalues  in  a  finite  set  and  that  they  correspond  to  the  vertices  of  a  graph.  The \ngraph's arcs will represent the dependencies among variables.  There are two kinds of \nrepresentations  that have gained wide use:  one is the directed  acyclic graph model, \nalso  called  a  Bayes  net,  which  represents  the joint distribution as  a  product of the \nprobabilities of each vertex conditioned on the values of its parents;  the other is the \nundirected  graph model,  also called  a  Markov  field,  where  the joint distribution is \nfactorized  over  the  cliques!  of an  undirected  graph.  This  factorization  is  called  a \njunction tree and optimizing it is the subject of the present paper.  The power of both \nmodels lies in their ability to display and exploit existent  marginal and conditional \nindependencies  among subsets  of variables.  Emphasizing independencies  is  useful \n\n1 A  clique is  a  fully  connected  set  of  vertices  and  a  maximal  clique  is  a  clique  that  is \n\nnot contained  in  any other clique. \n\n\f558 \n\nM. Meilii and M. /.  Jordan \n\nfrom both a qualitative point of view  (it reveals something about the domain under \nstudy)  and a  quantitative one  (it makes computations tractable).  The two models \ndiffer  in  the  kinds  of independencies  they  are  able  to  represent  and  often  times \nin  their  naturalness  in  particular  tasks.  Directed  graphs  are  more  convenient  for \nlearning a  model from  data;  on  the  other  hand,  the clique structure  of undirected \ngraphs  organizes  the  information in  a  way  that  makes  it immediately available to \ninference  algorithms.  Therefore  it is  a  standard  procedure  to  construct  the  model \nof a domain as a  Bayes net and then to convert  it to a  Markov field  for  the purpose \nof querying it. \n\nThis  process  is  known  as  decomposition  and  it  consists  of  the  following  stages: \nfirst,  the  directed  graph  is  transformed  into  an undirected  graph  by  an  operation \ncalled  moralization.  Second,  the moralized graph is  triangulated.  A  graph is called \ntriangulated if any  cycle  of length>  3  has  a  chord  (i.e.  an  edge  connecting  two \nnonconsecutive  vertices).  If a  graph is  not triangulated it is always possible  to add \nnew  edges  so  that the resulting graph is  triangulated.  We  shall  call  this procedure \ntriangulation  and  the  added  edges  the  fill-in.  In  the  final  stage,  the junction  tree \n(Kjrerulff,  1991)  is  constructed from the maximal cliques of the triangulated graph. \nWe define  the state space of a  clique to be the cartesian product of the state spaces \nof the  variables  associated  to  the  vertices  in  the  clique  and  we  call  weight  of the \nclique the size  of this state space.  The  weight  of the  junction  tree is  the sum of the \nweights of its component cliques.  All  further  exact  inference in  the net takes  place \nin  the junction  tree  representation.  The  number  of computations  required  by  an \ninference operation is  proportional to the  weight of the tree. \n\nFor  each  graph  there  are  several  and  usually  a  large  number  of possible  triangu(cid:173)\nlations,  with  widely  varying state space  sizes.  Moreover,  triangulation is  the  only \nstage where  the cost  of inference  can  be  influenced.  It is  therefore  critical that the \ntriangulation procedure  produces  a  graph that is optimal or at least  \"good\"  in this \nrespect. \n\nUnfortunately, this is a hard problem.  No optimal triangulation algorithm is known \nto date.  However,  a number of heuristic algorithms like maximum  cardinality search \n(Tarjan  and  Yannakakis,  1984),  lexicographic  search  (Rose  et  al.,  1976)  and  the \nminimum  weight  heuristic  (MW)  (Kjrerulff,  1990)  are  known.  An  optimization \nmethod  based  on  simulated  annealing  which  performs  better  than  the  heuristics \non  large  graphs has  been  proposed  in  (Kjrerulff,  1991)  and  recently  a  \"divide  and \nconquer\"  algorithm which bounds the maximum clique size of the triangulated graph \nhas been published  (Becker and Geiger,  1996).  All but the last algorithm are based \non Rose's (Rose,  1970)  elimination procedure:  choose a node v of the graph, connect \nall  its  neighbors  to form  a  clique,  then eliminate v  and all  the  edges  incident  to it \nand proceed  recursively.  The resulting filled-in  graph is  triangulated. \n\nIt can be proven that the optimal triangulation can always be obtained by applying \nRose's elimination procedure  with an  appropriate ordering of the  nodes.  It follows \nthen that searching for  an optimal triangulation can be cast as a search in the space \nof all  node  permutations.  The  idea  of the  present  work  is  the  following:  embed \nthe  discrete  search  space  of permutations of n  objects  (where  n  is  the  number of \nvertices)  into a suitably chosen  continuous space.  Then extend the cost  to a smooth \nfunction over  the  continuous domain and  thus  transform the discrete  optimization \nproblem  into  a  continuous  nonlinear  optimization  task.  This  allows  one  to  take \nadvantage of the thesaurus  of optimization methods that exist  for  continuous cost \nfunctions. The rest of the paper will present this procedure in the following sequence: \nthe  next  section  introduces  and  discusses  the  objective  function;  section  3  states \nthe  continuous  version  of the  problem;  section  4  discusses  further  aspects  of the \noptimization procedure  and  presents  experimental results  and  section  5  concludes \n\n\fTriangulation by Continuous Embedding \n\n559 \n\nthe paper. \n\n2  THE OBJECTIVE \n\nIn this section  we  introduce the objective function  that we  used  and  we  discuss  its \nrelationship to the junction tree weight.  First, some notation.  Let G =  (V,  E)  be a \ngraph, its vertex set and its edge set respectively.  Denote by n the cardinality of the \nvertex set,  by ru  the number of values of the (discrete)  variable associated to vertex \nv  E  V,  by  #  the  elimination ordering of the  nodes,  such  that #v =  i  means that \nnode v is the i-th node to be eliminated according to ordering #, by n(v)  the set of \nneighbors of v E V  in the triangulated graph and by Cu  =  {v} U {u  E n( v)  I #u > \n#v}. 2  Then,  a  result  in  (Golumbic,  1980)  allows us  to express  the  total weight of \nthe junction tree obtained with elimination ordering #  as \n\n(1) \n\nwhere  ismax(Cu )  is  a  variable which  is  1 when  C u  is  a  maximal clique  and  0 oth(cid:173)\nerwise.  As  stated,  this is  the objective of interest for  belief net triangulation.  Any \nreference  to optimality henceforth  will be made with respect  to J* . \n\nThis result implies that there are no more than n  maximal cliques in a junction tree \nand  provides  a  method to enumerate them.  This suggests  defining  a  cost  function \nthat  we  call  the  raw  weight  J  as  the  sum  over  all  the  cliques  Cu  (thus  possibly \nincluding some non-maximal cliques) : \n\nJ(#)  = I: II ru \n\nuEV uECv \n\n(2) \n\nJ  is  the  cost  function  that  will  be  used  throughout  this  paper.  A  reason  to  use \nit  instead  of J*  in  our  algorithm  is  that  the  former  is  easier  to  compute  and  to \napproximate.  How  to  do  this  will  be  the  object  of  the  next  section.  But  it  is \nnatural to ask  first  how well  do  the two  agree? \nObviously, J is an upper bound for J*.  Moreover, it can be proved that if r  = min ru \n\n(3) \n\nand  therefore  that  J  is  less  than  a  fraction  1/(r - 1)  away  from  J* .  The  upper \nbound  is  attained  when  the  triangulated  graph  is  fully  connected  and  all  ru  are \nequal. \n\nIn  other  words,  the  differece  between  J  and  J*  is  largest  for  the  highest  cost  tri(cid:173)\nangulation.  We  also expect this difference  to be low  for  the low  cost  triangulation. \nAn  intuitive  argument  for  this  is  that  good  triangulations  are  associated  with  a \nlarge number of smaller cliques  rather  than with a  few  large ones.  But  the former \nsituation means  that there  will  be only a  small number of small size  non-maximal \ncliques to contribute to the difference  J - J* , and therefore that the agreement with \nJ*  is  usually  closer  than  (3)  implies.  This  conclusion  is  supported  by  simulations \n(Meila. and Jordan,  1997). \n\n2Both  n(v)  and  CO)  depend  on  #  but  we  chose  not  to  emphasize  this  in  the  notation \n\nfor  the sake of readability. \n\n\f560 \n\nM.  Meilii and M. I. Jordan \n\n3  THE  CONTINUOUS  OPTIMIZATION  PROBLEM \n\nThis section  shows  two ways of defining  J  over  continuous  domains.  Both  rely on \na  formulation of J  that eliminates explicit  reference  to  the  cliques  Gu ;  we  describe \nthis formulation here. \nLet us first  define new  variables J.tUIl  and eUIl , U, v =  1, .. , n . For any permutation # \n\nJ.tuu \n\n{  1 \n\nif #u ~ #v \n\no  otherwise \n\nif the edge (u,v)  E EUF# \n\n{  1 \n\no  otherwise \n\nwhere  F #  is  the set of fill-in edges. \nIn  other  words,  J.t  represent  precedence  relationships  and  e  represent  the  edges \nbetween  the n  vertices.  Therefore,  they will be called  precedence  variables and  edge \nvariables respectively.  With these  variables,  J  can  be expressed  as \n\nJ(#)  = I: IT  r~vuevu \n\nuEV uEV \n\n(4) \n\nIn (4),  the product J.tuueuu  acts as  an indicator variable being 1 iff \"u  E Gil\"  is  true. \nFor  any  given  permutation, finding  the  J.t  variables  is straightforward.  Computing \nthe edge variables is possible thanks to a  result in (Rose et al.,  1976) . It states that \nan edge (u, v)  is  contained in F# iff there is a path in G  between u and v containing \nonly nodes w for which #w < mine #u, #v).  Formally, eUIl  = euu  = 1 iff there exists \na  path P =  (u, WI, W2,  ... v)  such  that \n\nIT  J.tw,uJ.tw,u  = 1 \n\nWoEP \n\nSo  far,  we  have succeeded  in  defining  the cost  J  associated  with  any permutation \nin  terms of the variables J.t  and  e.  In  the following,  the set  of permutations will  be \nembedded  in  a  continuous  domain.  As  a  consequence,  J.t  and  e  will  take  values  in \nthe interval  [0,1]  but the form of J  in  (4)  will stay the same. \n\nThe  first  method,  called  J.t-continuous  embedding  (J.t-CE)  assumes  that  the  vari(cid:173)\nables  J.tuu  E  [0,1]  represent  independent  probabilities  that  #u  <  #v.  For  any \npermutation,  the  precedence  variables  have  to  satisfy  the  transitivity  condition. \nTransitivity means that if #u < #v  and  #v < #w,  then  #u < #w,  or,  that for \nany  triple  (J.tuu,  J.tIlW,  J.twu)  the  assignments  (0, 0,0) and  (1,1,1)  are  forbidden.  Ac(cid:173)\ncording to  the  probabilistic interpretation of J.t  we  introduce  a  term that  penalizes \nthe probability of a  transitivity violation: \n\nL  P[(u, v, w) nontransitive] \nI: [J.tUUJ.tIlWJ.tWU  + (1  - J.tuu)(l  - J.tuw)(l- J.twu)] \n\nU<u<W \n\nU<Il<W \n\n>  P[assignment non  transitive] \n\n(5) \n\n(6) \n\n(7) \n\nIn  the  second  approach,  called  O-continuous  embedding  (O-CE),  the  permutations \nare directly embedded into the set of doubly stochastic matrices.  A  doubly stochastic \nmatrix ()  is  a  matrix for  which  the elements in  a  row or column sum to one. \n\nI:0ij \n\nI:0ij  =  1  Oij  ~ 0  for  i,j =  1, .. n. \nj \n\n(8) \n\n\fTriangulation by Continuous Embedding \n\n561 \n\nWhen  Oij  are  either  0  or  1,  implying  that  there  is  exactly  one  nonzero  element \nin  each  row  or  column,  the  matrix  is  called  a  permutation  matrix.  Oij  = 1  and \n#i = j  both mean that the position of object  i  is  j  in  the given  permutation.  The \nset  of  doubly  stochastic  matrices  e  is  a  convex  polytope  of dimension  (n  - 1)2 \nwhose extreme points are the permutation matrices (Balinski and Russakoff,  1974). \nThus,  every  doubly stochastic matrix can  be  represented  as  a  convex  combination \nof permutation matrices.  To constrain  the optimum to be a  an extreme point,  we \nuse  the penalty term \n\nR(O)  =  I: Oij (1  - Oij) \n\nij \n\n(9) \n\nThe precedence  variables are  defined  over e as \n\nJ.luv \n\n1 - J.lvu \n\n1 \n\nNow,  for  both embeddings,  the edge  variables  can  be computed from J.l  as follows \n\n{\n\nI  max \nPE {path&  u-v} \n\nITwEP J.lwuJ.lwv \n\nfor  (u, v)  E E  or u =  v \notherwise \n\nThe above assignments give the correct values for  J.l  and e for any point representing \na permutation.  Over the interior of the domain, e is a continuous, piecewise differen(cid:173)\ntiable function.  Each euv , (u, v)  ftE  can be computed by  a  shortest path algorithm \nbetween  u  and  v,  with the length of (WI,W2)  E E  defined  as  (-logJ.lwluJ.lw:>v). \nO-CE is  an interior point method whereas in J.l-CE  the current point, although inside \n[0,I]n(n-I)/2,  isn't  necessarily  in  the  convex  hull  of the  hypercube's  corners  that \nrepresent  permutations.  The  number of operation required for  one evaluation of J \nand its gradient is as follows:  O(n4) operations to compute J.l  from 0,  O(n3 10gn) to \ncompute e,  O(n3 )  for  ~: and O(n2 )  for  ~~ and  ~~ afterwards.  Since computing J.l  is \nthe most computationally intensive step, J.l-CE is a clear win in terms of computation \ncost.  In  addition, by operating directly in the J.l  domain, one level of approximation \nis  eliminated,  which  makes  one  expect  J.l-CE  to  perform  better  than  O-CE.  The \nresults  in  the following section  will  confirm this. \n\n4  EXPERIMENTAL RESULTS \n\nTo  assess  the  performance  of our  algorithms  we  compared  their  results  with  the \nresults  of the  minimum weight  heuristic  (MW),  the  heuristic  that  scored  best  in \nempirical  tests  (Kjrerulff,  1990).  The  lowest  junction  tree  weight  obtained  in  200 \nruns  of MW  was  retained  and  denoted  by  JMW '  Tests  were  run  on  6  graphs  of \ndifferent  sizes  and  densities: \n\nh9 \n9 \n.33 \n\nh12 \n12 \n.25 \n\ngraph \nn= IVI \ndensity \nr m in/r max/r avf1, \n10giO JMW \nThe  last  row  of the  table shows  the 10giO JMW '  We  ran  11  or more trials  of each \nof our  two  algorithms  on  each  graph.  To  enforce  the  variables  to  converge  to  a \npermutation,  we  minimized  the  objective  J  + >.R,  where>.  >  0  is  a  parameter \n\n6/15/10  2/8/5  6/15/10  6/15/10 \n13.94 \n\nm20 \n20 \n.25 \n\n2/2/2/  3/3/3 \n2.71 \n\n2.43 \n\ndlO \n10 \n.6 \n\n7.44 \n\n5.47 \n\n12.75 \n\na20 \n20 \n.45 \n\nd20 \n20 \n.6 \n\n\f562 \n\n100 \n\n30 \n\n10 \n\n3 \n\n0.3 \n\n-\n\n-\n\n--\n\n=-\n\n........- -\n\nh9 \n\nh12  d10  m20  820  d20 \n\na \n\nM.  Meilii and M. l  Jordan \n\n20 \n\n10 \n\n5 \n\n2 \n\n.5 \n\n.2 \n\n.1~~--~--~------~--~----~ \n\nh9 \n\nh12  d10  m20  a20  d20 \n\nb \n\nFigure 1:  Minimum, maximum (solid line) and median (dashed line) values of J1\u00b7 \nobtained by O-CE  (a)  and JL-CE  (b). \n\nMW \n\nthat  was  progressively  increased  following  a  deterministic  annealing schedule  and \nR  is  one  of the  aforementioned  penalty  terms.  The  algorithms  were  run  for  50-\n150  optimization  cycles,  usually  enough  to  reach  convergence.  However,  for  the \nJL-embedding on graph  d20, there  were  several  cases  where  many JL  values  did  not \nconverge  to  0 or  1.  In  those  cases  we  picked  the most  plausible permutation to  be \nthe answer. \n\nThe results  are shown  in figure  1 in terms of the ratio of the true  cost  obtained by \nthe  continuous embedding algorithm (denoted  by  J*)  and  J'Mw.  For the first  two \ngraphs,  h9 and  h12,  J1-w  is  the optimal cost;  the embedding algorithms reach  it \nmost trials.  On  the remaining graphs,  JL-CE  clearly  outperforms O-CE,  which  also \nperforms poorer than MW on  average.  On dIO, a20 and  m20 it also outperforms \nthe  MW  heuristic,  attaining junction  tree  weights  that  are  1.6  to  5  times  lower \non  average  than  those  obtained  by  MW.  On  d20,  a  denser  graph,  the  results  are \nsimilar for  MW  and JL-CE  in  half of the  cases  and  worse  for  JL-CE  otherwise.  The \nplots  also  show  that  the  variability of the  results  is  much  larger  for  CE  than  for \nMW. This behaviour is not surprising, given that the search space for CE, although \ncontinuous,  comprises a  large number of local minima.  This induces dependence on \nthe initial point and, as a consequence,  nondeterministic behaviour of the algorithm. \nMoreover,  while the number of choices  that MW has is  much lower than the upper \nlimit of n!,  the \"choices\"  that CE algorithms consider, although soft, span the space \nof all possible permutations. \n\n5  CONCLUSION \n\nThe idea of continuous embedding is  not  new  in  the field  of applied  mathematics. \nThe  large  body  of  literature  dealing  with  smooth  (sygmoidal)  functions  instead \nof  hard  nonlinearities  (step  functions)  is  only  one  example.  The  present  paper \nshows  a  nontrivial way of applying a  similar treatment to a  new  problem in a  new \nfield.  The results obtained by it-embedding are on average better than the standard \nMW  heuristic.  Although  not  directly  comparable,  the  best  results  reported  on \ntriangulation  (Kjrerulff,  1991;  Becker  and  Geiger,  1996)  are  only  by  little  better \nthan  ours.  Therefore  the  significance  of the  latter  goes  beyond  the  scope  of the \npresent problem.  They are obtained on a hard problem, whose  cost function  has no \nfeature to ease its minimization (J is neither linear, nor quadratic,  nor is it additive \n\n\fTriangulation by Continuous Embedding \n\n563 \n\nw.r.t.  the  vertices  or  the  edges)  and  therefore  they  demonstrate  the  potential  of \ncontinuous embedding as a general  tool. \n\nColaterally,  we  have  introduced  the  cost  function  J,  which  is  directly  amenable \nto  continuous  approximations  and  is  in  good  agreement  with  the  true  cost  r. \nSince  minimizing  J  may not  be  NP-hard,  this  opens  a  way  for  investigating  new \ntriangulation methods. \n\nAcknowledgements \n\nThe  authors  are  grateful  to  Tommi  Jaakkola  for  many  discussions  and  to  Ellie \nBonsaint for  her  invaluable help  in typing the paper. \n\nReferences \n\nBalinski,  M. and  Russakoff,  R. (1974).  On the  assignment polytope.  SIAM Rev. \nBecker,  A. and  Geiger,  D.  (1996) .  A sufficiently fast  algorithm for  finding  close  to \n\noptimal junction trees.  In  UAI 96  Proceedings. \n\nGolumbic,  M.  (1980) .  Algorithmic  Graph  Theory  and  Perfect  Graphs.  Academic \n\nPress,  New  York. \n\nKjrerulff,  U.  (1990) .  Triangulation of graphs-algorithms  giving small  total  state \nspace.  Technical  Report  R  90-09, Department of Mathematics and  Computer \nScience,  Aalborg University,  Denmark. \n\nKjrerulff,  U.  (1991).  Optimal decomposition of probabilistic networks  by simulated \n\nannealing.  Statistics  and  Computing. \n\nMeila., M.  and Jordan,  M.  I. (1997) .  An  objective function  for  belief net triangula(cid:173)\n\ntion.  In  Madigan,  D., editor , AI and  Statistics, number 7.  (to appear). \n\nRose,  D.  J.  (1970).  Triangulated  graphs  and  the  elimination process.  Journal  of \n\nMathematical Analysis  and Applications. \n\nRose,  D.  J.,  Tarjan,  R.  E.,  and  Lueker,  E.  (1976).  Algorithmic  aspects  of vertex \n\nelimination on graphs.  SIAM J.  Comput. \n\nTarjan, R. and Yannakakis, M.  (1984). Simple linear-time algorithms to test chordal(cid:173)\nity of graphs,  test  acyclicity of hypergraphs,  and select  reduced  acyclic  hyper(cid:173)\ngraphs.  SIAM 1.  Comput. \n\n\f", "award": [], "sourceid": 1318, "authors": [{"given_name": "Marina", "family_name": "Meila", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}