{"title": "Hierarchical Non-linear Factor Analysis and Topographic Maps", "book": "Advances in Neural Information Processing Systems", "page_first": 486, "page_last": 492, "abstract": "", "full_text": "Hierarchical Non-linear Factor  Analysis \n\nand  Topographic  Maps \n\nZoubin Ghahramani and Geoffrey E.  Hinton \nDept.  of Computer Science,  University of Toronto \n\nToronto,  Ontario,  M5S  3H5,  Canada \n\nhttp://www.cs.toronto.edu/neuron/ \n\n{zoubin,hinton}Ocs.toronto.edu \n\nAbstract \n\nWe  first  describe  a  hierarchical,  generative  model  that  can  be \nviewed  as  a  non-linear  generalisation  of factor  analysis  and  can \nbe  implemented  in  a  neural  network.  The  model  performs  per(cid:173)\nceptual  inference  in  a  probabilistically consistent  manner by  using \ntop-down,  bottom-up and  lateral  connections.  These  connections \ncan  be  learned  using  simple  rules  that  require  only  locally  avail(cid:173)\nable  information.  We  then  show  how  to  incorporate  lateral  con(cid:173)\nnections  into the  generative  model.  The  model extracts  a  sparse, \ndistributed,  hierarchical  representation  of  depth  from  simplified \nrandom-dot  stereograms  and  the  localised  disparity  detectors  in \nthe  first  hidden  layer  form  a  topographic  map.  When  presented \nwith image patches from  natural scenes,  the  model develops  topo(cid:173)\ngraphically organised local feature detectors. \n\n1 \n\nIntroduction \n\nFactor  analysis  is  a  probabilistic  model  for  real-valued  data  which  assumes  that \nthe  data is  a  linear combination of real-valued  uncorrelated  Gaussian sources  (the \nfactors).  After  the  linear  combination,  each  component  of the  data vector  is  also \nassumed  to  be  corrupted  by  additional Gaussian noise.  A  major advantage of this \ngenerative  model  is  that,  given  a  data  vector,  the  probability  distribution  in  the \nspace  of factors  is  a  multivariate Gaussian  whose  mean is  a  linear  function  of the \ndata.  It is  therefore  tractable to compute the posterior distribution exactly and to \nuse  it  when  learning  the  parameters  of the  model  (the  linear  combination  matrix \nand noise  variances).  A major disadvantage is that factor analysis is  a linear model \nthat is  insensitive to higher order statistical structure of the observed  data vectors. \n\nOne  way  to  make factor  analysis  non-linear  is  to  use  a  mixture of factor  analyser \nmodules,  each  of which  captures  a  different  linear  regime in  the  data [3].  We  can \nview the factors of all of the modules as a large set of basis functions for describing \nthe  data  and  the  process  of selecting  one  module  then  corresponds  to  selecting \nan  appropriate  subset  of the  basis  functions.  Since  the  number  of subsets  under \nconsideration is only linear in the number of modules, it is still tractable to compute \n\n\fHierarchical Non-linear Factor Analysis and Topographic Maps \n\n487 \n\nthe full  posterior distribution when given a data point.  Unfortunately, this mixture \nmodel  is  often  inadequate.  Consider,  for  example,  a  typical  image that  contains \nmultiple  objects.  To  represent  the  pose  and  deformation  of each  object  we  want \na  componential representation  of the object's  parameters  which  could  be  obtained \nfrom an appropriate factor  analyser.  But to represent  the multiple objects we  need \nseveral  of these  componential representations  at  once,  so  the  pure  mixture idea is \nnot tenable.  A more  powerful non-linear generalisation of factor analysis iF  to have \na  large  set  of factors  and  to  allow  any subset  of the  factors  to  be  selected.  This \ncan be  achieved  by using a generative model in which  there is  a  high probability of \ngenerating factor  activations of exactly zero. \n\n2  Rectified  Gaussian Belief Nets \n\nThe Rectified  Gaussian Belief Net  (RGBN)  uses multiple layers of units with states \nthat  are  either  positive  real  values  or  zero  [5].  Its  main disadvantage is  that com(cid:173)\nputing the posterior distribution over the factors given a data vector involves Gibbs \nsampling.  In general,  Gibbs sampling can be very  time consuming, but in  practice \n10  to  20  samples per  unit  have  proved  adequate and  there  are  theoretical  reasons \nfor  believing  that  learning  can  work  well  even  when  the  Gibbs  sampling fails  to \nreach  equilibrium [10]. \nWe first  describe  the RGBN  without considering neural  plausibility.  Then we show \nhow  lateral interactions  within  a  layer can  be  used  to  perform  probabilistic infer(cid:173)\nence  correctly  using  locally available information.  This makes the  RGBN far  more \nplausible  as  a  neural  model  than  a  sigmoid belief net  [9,  8]  because  it means  that \nGibbs  sampling can  be  performed  without  requiring  units  in  one  layer  to see  the \ntotal top-down  input to units in the layer below. \n\nThe generative model for  RGBN's consists of multiple layers of units each  of which \nhas  a  real-valued  unrectified  state,  Yj,  and  a  rectified  state,  [Yj]+,  which  is  zero  if \nYj  is  negative and equal to Yj  otherwise.  This rectification  is  the only non-linearity \nin the network. 1  The value of Yj  is  Gaussian distributed with  a standard deviation \n(Jj  and a mean, ih  that is determined by the generative bias, gOj,  and the combined \neffects  of the rectified  states of units,  k,  in the layer above: \n\nYj  =  gOj  + Lgkj[Yk]+ \n\nk \n\n(1) \n\nThe  rectified  state  [Yj]+  therefore  has  a  Gaussian distribution  above  zero,  but  all \nof the  mass  of the  Gaussian  that  falls  below  zero  is  concentrated  in  an  infinitely \ndense spike at zero  as shown in Fig.  la.  This infinite density creates  problems if we \nattempt to  use  Gibbs sampling over  the  rectified  states,  so,  following  a  suggestion \nby  Radford  Neal,  we  perform  Gibbs sampling on the unrectified states. \nConsider  a  unit,  j,  in  some  intermediate  layer  of a  multilayer  RGBN.  Suppose \nthat we  fix  the unrectified states of all the other units in the net.  To perform Gibbs \nsampling, we  need to stochastically select  a value for Yj  according to its distribution \ngiven  the  unrectified  states  of all  the  other  units.  If we  think  in  terms  of energy \nfunctions,  which  are  equal  to  negative  log  probabilities  (up  to  a  constant),  the \nrectified  states  of the  units  in  the layer  above  contribute  a  quadratic energy  term \nby determining Yj.  The unrectified states of units,  i,  in the layer below contribute a \nconstant if [Yj]+  is  0,  and if [Yj]+  is  positive they each  contribute a quadratic term \n\n1 The key  arguments  presented in  this  paper hold for  general nonlinear  belief networks \n\nas long as  the noise  is  Gaussian;  they are not  specific  to the rectification  nonlinearity. \n\n\f488 \n\na \n\nb \n\nc \n\nI \n\nI \n\nI \n\n/ \n\nW \n\n, \n~----..J,' Top-down \n, \n'-_ .. \" \n\n-3-2-1  0  1  2  3 \n\nY \n\n-3-2-1  0  1  2  3 \n\nY \n\nbecause  of the effect  of [Yj] + on Yi. \n\nZ  Ghahramani and G. E.  Hinton \n\nFigure 1:  a)  Probability den(cid:173)\nsity in  which all  the mass  of a \nGaussian below zero has been \nreplaced by an infinitely dense \nspike  at  zero.  b)  Schematic \nof  the  density  of a  unit's  un(cid:173)\nrectified  state. \nc)  Bottom(cid:173)\nup and top-down energy func(cid:173)\ntions  corresponding  to  b. \n\n(2) \n\nwhere  h is an index over all the units in the same layer as j  including j  itself.  Terms \nthat do not depend on Yj  have been omitted from Eq. 2.  For values of Yj  below zero \nthere  is  a  quadratic  energy function  which  leads  to  a  Gaussian  distribution.  The \nsame is true for  values of Yj  above zero,  but it is a different quadratic (Fig. Ic) .  The \nGaussian  distributions  corresponding  to  the  two  quadratics  must  agree  at  Yj  = 0 \n(Fig.  Ib).  Because  this  distribution  is  piecewise  Gaussian  it is  possible  to  perform \nGibbs sampling exactly. \n\nGiven samples from the posterior, the generative weights of a RGBN can be learned \nby using the online delta rule to maximise the  log  probability of the data. 2 \n\n(3) \nThe  variance  of the  local  Gaussian  noise  of each  unit,  o},  can  also  be  learned  by \nan  online  rule,  D-.o}  = f  [(Yj  - Yj)2  - o}].  Alternatively,  o}  can  be  fixed  at  I  for \nall  hidden  units  and  the  effective  local  noise  level  can  be controlled  by scaling the \ngenerative weights. \n\n3  The Role  of Lateral  Connections in Perceptual  Inference \n\nIn  RGBNs and other layered  belief networks,  fixing the value of a  unit  in one layer \ncauses  correlations  between  the  parents  of that  unit  in  the  layer  above.  One  of \nthe  main  reasons  why  purely  bottom-up approaches  to  perceptual  inference  have \nproven  inadequate for  learning  in  layered  belief networks  is  that  they  fail  to  take \ninto account  this phenomenon, which  is  known  as  \"explaining away.\" \n\nLee  and Seung (1997)  introduced a clever way of using lateral connections to handle \nexplaining  away  effects  during  perceptual  inference.  Consider  the  network  shown \nin  Fig.  2.  One  contribution,  Ebelow,  to  the  energy  of the  state  of the  network  is \nthe  squared  difference  between  the  unrectified  states  of the  units  in one  layer,  Yj, \na.nd  the top-down expectations generated  by  the states of units in  the layer above. \nAssuming the local noise models for the lower layer units all have unit variance, and \n\n2 If  Gibbs  sampling  has  not  been  run long  enough  to reach  equilibrium,  the  delta  rule \nfollows  the gradient  of the  penalized log  probability  of the  data [10].  The penalty  term is \nthe  Kullback-Liebler  divergence  between the equilibrium  distribution  and the distribution \nproduced  by  Gibbs  sampling.  Other  things  being  equal,  the  delta  rule  therefore  adjusts \nthe  parameters  that  determine  the  equilibrium  distribution  to  reduce  this  penalty,  thus \nfavouring  models  for  which  Gibbs  sampling  works  quickly. \n\n\fHierarchical Non-linear Factor Analysis and Topographic Maps \n\n489 \n\nignoring biases and constant terms that are unaffected  by the states of the units \n\nEbe\\ow  =  ~ l:)Yj - Yj)2  = ~ I)Yj - 2:k[Yk]+9kj)2. \n\n(4) \n\nj \n\nj \n\nRearranging this expression  and setting rjk = gkj  and  mkl = - Lj gkjglj  we  get \n\nEbe\\ow = ~ LyJ - L[Yk]+ LYjrjk - ~ L[Yk]+ L[y!l+mkl . \n\n(5) \n\nj \n\nk \n\nj \n\nk \n\nI \n\nThis  energy  function  can  be  exactly  implemented  in  a  network  with  recognition \nweights,  rjk,  and  symmetric lateral interactions,  mkl.  The lateral and  recognition \nconnections allow  a  unit,  k,  to compute how  Ebe\\ow  for  the layer below depends  on \nits own  state and  therefore  they  allow  it  to follow  the  gradient of E  or to  perform \nGibbs sampling in  E . \n\nFigure  2:  A  small  segment  of  a  network, \nshowing  the generative  weights  (dashed)  and \nthe  recognition  and  lateral  weights  (solid) \nwhich  implement  perceptual  inference  and \ncorrectly handle  explaining  away  effects. \n\nSeung's trick can be used in an RGBN and it eliminates the most neurally implau(cid:173)\nsible aspect  of this model which  is  that a  unit in one  layer appears to need  to send \nboth its state Y and the top-down prediction of its state Y to units in the layer above. \nUsing  the  lateral  connections,  the  units  in  the  layer  above  can,  in  effect,  compute \nall they need  to know about the top-down predictions.  In computer simulations, we \ncan simply set  each  lateral connection  mkl  to be the dot  product - 2:j  gkjglj.  It is \nalso  possible to learn  these  lateral connections in  a  more biologically plausible way \nby  driving units  in  the layer  below  with  unit-variance independent  Gaussian noise \nand  using  a  simple  anti-Hebbian  learning  rule.  Similarly,  a  purely  local  learning \nrule  can  learn  recognition  weights  equal  to  the  generative  weights . . If units  at  one \nlayer  are  driven  by  unit-variance,  independent  Gaussian  noise,  and  these  in  turn \ndrive  units in  the  layer below  using  the generative  weights,  then  Hebbian  learning \nbetween  the two layers  will learn  the correct  recognition  weights  [5]. \n\n4  Lateral  Connections  in  the Generative Model \n\nWhen the generative model contains only top-down connections, lateral connections \nmake it possible to do perceptual inference using locally available information.  But \nit is  also  possible,  and often desirable, to have lateral connections in the generative \nmodel.  Such connections can cause nearby units in a layer to have a priori correlated \nactivities,  which  in  turn  can lead  to the  formation  of redundant  codes  and,  as  we \nwill see,  topographic maps. \nSymmetric lateral interactions between the unrectified states of units within a layer \nhave the effect  of adding a  quadratic term to the  energy  function \n\nEMRF =  ~ L: L  Mkl  YkYI, \n\nk \n\nI \n\n(6) \n\nwhich corresponds  to a  Gaussian Markov  Random Field  (MRF).  During sampling, \nthis  term  is  simply added  to  the  top-down  energy  contribution.  Learning  is  more \ndifficult.  The difficulty sterns from the need to know the derivatives of the partition \nfunction  of the  MRF for  each  data vector.  This  partition function  depends  on  the \n\n\f490 \n\nZ  Ghahramani and G.  E.  Hinton \n\ntop-down inputs to a  layer so it varies from one data vector to the next, even if the \nlateral  connections  themselves  are  non-adaptive.  Fortunately, since  both the  MRF \nand the top-down prediction define  Gaussians over the states of the units in a layer, \nthese derivatives  can be easily calculated.  Assuming unit variances, \n\ntlYj; =  ,  ([Yj]+(Y;  - ii;) + [Yj]+ ~ [M(I + M)-ll;. ii.) \n\n(7) \n\nwhere M  is the MRF matrix for the layer including units i and k, and I  is the identity \nmatrix.  The  first  term  is  the delta  rule  (Eq.  3);  the second  term  is  the  derivative \nof the  partition  function  which  unfortunately  involves  a  matrix  inversion.  Since \nthe partition function for  a  multivariate Gaussian is  analytical it is  also  possible to \nlearn the lateral connections  in the MRF. \n\nLateral  interactions  between  the  rectified states  of units  add  the  quadratic  term \n~ Lk Ll Mkl  [Yk]+[YzJ+\u00b7  The partition function  is  no  longer analytical, so comput(cid:173)\ning the gradient of the likelihood involves  a  two-phase  Boltzmann-like procedure: \n\n([Yj]+Yi r) , \n\n!19ji =  f  ([Yj]+Yi) * -\n\n(8) \nwhere  0* averages with  respect  to the posterior distribution of Yi  and Yj,  and 0-\naverages  with  respect  to  the  posterior  distribution of Yj  and  the  prior  of Yi  given \nunits  in  the  same  layer  as  j.  This  learning  rule  suffers  from  all  the  problems  of \nthe Boltzmann machine, namely it is slow  and requires  two-phases.  However,  there \nis  an  approximation  which  results  in  the  familiar  one-phase  delta  rule  that  can \nbe  described  in  three  equivalent  ways:  (1)  it  treats  the  lateral  connections  in  the \ngenerative  model  as  if they  were  additional lateral connections  in  the  recognition \nmodel;  (2)  instead  of lateral  connections  in  the generative  model it  assumes  some \nfictitious  children  with clamped  values  which  affect  inference  but  whose  likelihood \nis  not  maximised  during  learning;  (3)  it  maximises  a  penalized  likelihood  of the \nmodel without the lateral connections in the generative  model. \n\n5  Discovering depth  in  simplified stereograms \n\nConsider the following generative process for stereo pairs.  Random dots of uniformly \ndistributed intensities  are scattered sparsely on  a  one-dimensional surface,  and the \nimage is  blurred with a Gaussian filter.  This surface is then randomly placed at one \nof two  different  depths,  giving rise  to two  possible  left-to-right disparities  between \nthe  images seen  by  each  eye.  Separate  Gaussian noise  is  then  added  to  the  image \nseen  by  each  eye.  Some  images generated in  this manner are  shown  in  Fig.  3a. \n\nFigure  3:  a)  Sample  data  from  the  stereo \ndisparity problem.  The left and right  column \nof each 2 x 32  image are the inputs to the left \nand  right  eye,  respectively.  Periodic  bound(cid:173)\nary conditions were used.  The value of a pixel \nis  represented  by  the  size  of the square,  with \nwhite  being  positive  and  black  being  nega(cid:173)\ntive.  Notice that pixel noise  makes it difficult \nto  infer  the  disparity,  i.e.  the  vertical  shift \nbetween  the  left  and right  columns,  in  some \nimages.  b) Sample  images  generated  by  the \nmodel  after learning. \n\nWe  trained  a  three-layer  RGBN consisting of 64  visible  units,  64  units  in  the  first \nhidden  layer  and  1  unit  in  the  second  hidden  layer  on  the  32-pixel  wide  stereo \n\n\fHierarchical Non-linear Factor Analysis and Topographic Maps \n\n491 \n\ndisparity problem.  Each of the hidden units in the first  hidden layer was connected \nto  the entire  array  of visible  units,  i.e. it  had  inputs from  both eyes.  The hidden \nunits  in  this  layer  were  also  laterally  connected  in  an  MRF  over  the  unrectified \nunits.  Nearby units excited each other and more distant units inhibited each other, \nwith  the  net  pattern  of excitation/inhibition being  a  difference  of two  Gaussians. \nThis  MRF  was  initialised with  large  weights  which  decayed  exponentially  to  zero \nover the  course  of training.  The network  was  trained  for  30  passes  through  a  data \nset  of 2000  images.  For  each  image  we  used  16  iterations  of Gibbs  sampling  to \napproximate the posterior distribution over hidden states.  Each iteration consisted \nof sampling every  hidden  unit once in a  random order.  The states  after the fourth \niteration of Gibbs sampling were  used for  learning, with a  learning rate of 0.05 and \na  weight  decay  parameter  of 0.001.  Since  the  top  level  of the  generative  process \nmakes  a  discrete  decision  between  left  and  right  global  disparity  we  used  a  trivial \nextension of the  RGBN  in which  the top level  unit saturates  both at 0  and  1. \n\n._--=\"TI:~\u00a3:I=-[J __ \n\nIEI[I:I _1II_-=-_.-:rr::JI...___I:IIUI::JI-L1D-.--:tIl::Jl-=-::l  .-'-' _______ OW''--o--.,u'-'-''__=_-..._.-'-\"._ \n\na \n\nb \nc \n\nFigure 4:  Generative  weights  of a  three-layered  RGBN  after being  trained  on  the  stereo \ndisparity  problem.  a) Weights from the top layer hidden unit to the 64 middle-layer hidden \nunits.  b) Biases of the middle-layer  hidden  units,  and c) weights from  the hidden units  to \nthe  2  x 32  visible  array. \n\nThirty-two of the hidden units learned to become local left-disparity detectors,  while \nthe other 32 became local right-disparity detectors  (Fig. 4c).  The unit in the second \nhidden  layer  learned  positive  weights  to  the  left-disparity  detectors  in  the  layer \nbelow,  and  negative  weights  to  the  right  detectors  (Fig.  4a).  In  fact,  the  activity \nof this  top  unit  discriminated  the  true  global  disparity  of the  input  images  with \n99%  accuracy.  A  random sample of images generated  by the model after learning is \nshown  in  Fig.  3b.  In  addition to forming  a  hierarchical  distributed  representation \nof disparity,  units  in  the  hidden  layer self-organised  into  a  topographic  map.  The \nMRF  caused  high  correlations  between  nearby  units  early  in  learning,  which  in \nturn  resulted  in  nearby  units  learning  similar  weight  vectors.  The  emergence  of \ntopography depended  on the strength of the  MRF and on the speed  with which  it \ndecayed.  Results were  relatively insensitive to other parametric changes. \n\nWe  also  presented  image patches  taken from  natural images  [1]  to a  network  with \nunits in the first  hidden layer arranged in laterally-connected  2D grid.  The network \ndeveloped  local feature  detectors,  with  nearby  units  responding  to similar features \n(Fig.  5).  Not  all units were  used,  but the unused  units  all clustered  into one  area. \n\n6  Discussion \n\nClassical models of topography formation such as Kohonen's self-organising map [6] \nand the elastic  net  [2,  4]  can  be  thought of as  variations on mixture models where \nadditional constraints  have  been  placed  to  encourage  neighboring  hidden  units  to \nhave similar generative weights. The problem with a mixture model is that it cannot \nhandle  images in which  there  are  several  things  going  on  at  once.  In  contrast,  we \n\n\f492 \n\nZ.  Ghahramani and G.  E.  Hinton \n\nFigure  5:  Generative  weights  of an \nRGBN  trained  on  12  x  12  natural \nimage  patches:  weights  from  each \nof the  100  hidden  units  which  were \narranged  in  a  10  x  10  sheet  with \ntoroidal  boundary  conclitions. \n\nhave shown that topography can arise in much richer hierarchical and componential \ngenerative models by inducing correlations between  neighboring  units. \n\nThere  is  a  sense  in  which  topography  is  a  necessary  consequence  of  the  lateral \nconnection  trick  used  for  perceptual  inference.  It  is  infeasible  to  interconnect  all \npairs  of units  in  a  cortical  area.  If we  assume  that  direct  lateral  interactions  (or \ninteractions  mediated  by  interneurons)  are  primarily local,  then  widely  separated \nunits  will  not  have  the  apparatus  required  for  explaining away.  Consequently  the \ncomputation  of the  posterior  distribution  will  be  incorrect  unless  the  generative \nweight  vectors  of widely  separated  units  are orthogonal.  If the  generative  weights \nare  constrained  to  be  positive,  the  only  way  two  vectors  can  be  orthogonal  is  for \neach  to  have  zeros  wherever  the other  has  non-zeros.  Since  the  redundancies  that \nthe  hidden  units  are  trying  to  model  are  typically  spatially  localised,  it  follows \nthat  widely  separated  units  must  attend  to different  parts of the  image and  units \ncan  only  attend  to  overlapping  patches  if they  are  laterally  interconnected.  The \nlateral connections in the generative model assist in the formation of the topography \nrequired for correct  perceptual  inference. \n\nAcknowledgements.  We  thank  P.  Dayan,  B.  Frey,  G.  Goodhill,  D.  MacKay,  R.  Neal \nand M.  Revow.  The research was funded by  NSERC and ITRC. GEH is  the Nesbitt-Burns \nfellow  of  CIAR. \n\nReferences \n[1]  A.  Bell  &  T.  J.  Sejnowski.  The  'Independent  components'  of natural  scenes  are edge \n\nfilters.  Vision  Research, In  Press. \n\n[2]  R.  Durbin  &  D.  Willshaw.  An  analogue  approach  to the travelling  salesman problem \n\nusing  an elastic  net method.  Nature,  326(16):689-691,  1987. \n\n[3]  Z.  Ghahramani  &  G.  E.  Hinton.  The EM  algorithm  for  mixtures  of factor analyzers. \n\nUniv.  Toronto Technical  Report  CRG-TR-96-1,  1996. \n\n[4]  G.  J .  Goodhill  &  D.  J .  Willshaw.  Application  of  the  elatic  net  algorithm  to  the \nformation of ocular dominance stripes.  Network:  Compo  in  Neur.  Sys ., 1:41-59,  1990. \n[5]  G.  E.  Hinton  &  Z.  Ghahramani.  Generative models  for  cliscovering  sparse clistributed \n\nrepresentations.  Philos.  Trans.  Roy.  Soc .  B,  352:1177-1190,  1997. \n\n[6]  T. Kohonen.  Self-organized formation of topologically  correct feature maps.  Biological \n\nCybernetics,  43:59-69,  1982. \n\n[7]  D.  D.  Lee  &  H.  S.  Seung.  Unsupervised  learning  by  convex  and  conic  cocling. \n\nIn \nM.  Mozer,  M.  Jordan,  &  T. Petsche, eds., NIPS 9.  MIT Press,  Cambridge,  MA,  1997. \n[8]  M.  S.  Lewicki  &  T.  J.  Sejnowski.  Bayesian  unsupervised  learning  of  higher  order \n\nstructure.  In  NIPS 9.  MIT Press, Cambridge,  MA,  1997. \n\n[9]  R.  M. Neal.  Connectionist learning of belief networks.  Arti/.  Intell., 56:71-113,  1992. \n[10]  R.  M.  Neal &  G. E. Hinton.  A new view of the EM algorithm that justifies incremental \n\nand other variants.  Unpublished Manuscript,  1993. \n\n\f", "award": [], "sourceid": 1472, "authors": [{"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}