{"title": "A Unified Learning Scheme: Bayesian-Kullback Ying-Yang Machine", "book": "Advances in Neural Information Processing Systems", "page_first": 444, "page_last": 450, "abstract": "", "full_text": "A  Unified Learning  Scheme: \n\nBayesian-Kullback Ying-Yang  Machine \n\n1.  Computer Science  Dept., The Chinese University of HK,  Hong Kong \n\n2.  National Machine  Perception  Lab,  Peking University,  Beijing \n\nLei  Xu \n\nAbstract \n\nA  Bayesian-Kullback  learning scheme,  called Ying-Yang  Machine, \nis  proposed  based on  the  two  complement but equivalent Bayesian \nrepresentations  for  joint  density  and  their  Kullback  divergence. \nNot  only  the  scheme  unifies  existing  major supervised  and  unsu(cid:173)\npervised  learnings,  including  the  classical  maximum likelihood  or \nleast  square learning,  the  maximum information preservation,  the \nEM  & em algorithm and information geometry, the recent  popular \nHelmholtz  machine,  as  well  as  other  learning  methods  with  new \nvariants  and  new  results;  but  also  the  scheme  provides  a  number \nof new  learning models. \n\n1 \n\nINTRODUCTION \n\nMany  different  learning  models  have  been  developed  in  the  literature.  We  may \ncome  to  an  age  of searching  a  unified  scheme  for  them.  With  a  unified  scheme, \nwe  may understand  deeply  the existing  models and their relationships,  which  may \ncause cross-fertilization on them to obtain new  results and variants; We may also be \nguided to develop new learning models, after we  get better understanding on  which \ncases  we  have already studied or missed,  which  deserve  to be further  explored. \n\nRecently,  a  Baysian-Kullback scheme,  called  the YING-YANG  Machine,  has  been \nproposed as such  an effort(Xu,  1995a).  It bases on the Kullback divergence and two \ncomplement but equivalent Baysian representations for  the joint distribution of the \ninput  space  and  the  representation  space,  instead  of merely  using  Kullback  diver(cid:173)\ngence  for  matching un-structuralized joint densities  in  information geometry  type \nlearnings  (Amari,  1995a&b;  Byrne,  1992;  Csiszar,  1975).  The  two  representations \nconsist  of four  different  components.  The different  combinations of choices  of each \ncomponent  lead  the  YING-YANG  Machine  into  different  learning  models.  Thus, \nit  acts  as  a  general  learning  scheme  for  unifying  the  existing  major unsupervised \nand  supervised  learnings.  As  shown  in  Xu(1995a),  its  one  special  case  reduces  to \nthe  EM  algorithm  (Dempster  et  aI,  1977; Hathaway,  1986;  Neal  &  Hinton ,  1993) \n\n\fA Unified Learning Scheme:  Bayesian-Kullback Ying-Yang Machine \n\n445 \n\nand the closely related  Information  Geometry theory and the em algorithm (Amari, \n1995a&b),  to  MDL  autoencoder  with  a  \"bits-back\"  argument by  Hinton  &  Zemel \n(1994)  and its alternative equivalent form that minimizes the bits of uncoded resid(cid:173)\nual errors  and the unused bits in  the  transmission channel's capacity  (Xu,  1995d), \nas  well  as  to Multisets  modeling learning  (Xu,  1995e)- a  unified learning framework \nfor  clustering,  PCA-type  learnings  and  self-organizing  map.  It other  special  case \nreduces  to  maximum information  preservation  (Linsker,  1989;  Atick  &  Redlich, \n1990;  Bell  & Sejnowski,  1995).  More  interestingly its another  special  case  reduces \nto  Helmholtz machine  (Dayan et  al,1995 ; Hinton,  1995)  with  new  understandings. \nMoreover,  the  YING-YANG  machine  includes  also  maximum likelihood  or  least \nsquare  learning. \n\nFurthermore,  the  YING- YANG Machine  has  also  been  extended  to  temporal  pat(cid:173)\nterns  with  a  number  of new  models  for  signal  modeling.  Some  of them  are  the \nextensions of Helmholtz machine or maximum information preservation  learning to \ntemporal processing.  Some of them include  and extend the  Hidden  Markov  Model \n(HMM), AMAR and AR models (Xu, 1995b).  In addition, it has also been shown in \nXu(1995a&c,  1996a)  that one special case of the YING-YANG machine can provide \nus  three  variants for  clustering  or  VQ,  particularly with  criteria and an  automatic \nprocedure  developed  for  solving how  to select  the  number of clusters  in  clustering \na classical problem that remains open for decades. \nanalysis or Gaussian mixtures -\n\nIn  this  paper,  we  present  a  deep  and  systematical  further  study.  Section  2  re(cid:173)\ndescribes the unified scheme on a more precise  and systematical basis via discussing \nthe  possible  marital status  of the  two  Bayesian  representations  for  joint  density. \nSection  3 summarizes and explains those existing models under the unified scheme, \nparticularly  we  have  clarified  some  confusion  made  in  the  previous  papers  (Xu, \n1995a&b) on  maximum information preservation  learning.  Section  4  proposed  and \nsummarizes a  number of possible new  models suggested  by the unified  scheme. \n\n2  BAYESIAN-KULLBACK YING-YANG MACHINE \n\nAs  argued  in  Xu  (1995a),  unsupervised  and  supervised  learning  problems  can  be \nsummarized  into  the  problem  of  estimating joint  density  P(x, y)  of  patterns  in \nthe  input  space  X  and  the  representation  space  Y,  as  shown  in  Fig.I.  Under  the \nBayesian framework,  we  have  two  representations  for  P(x, y).  One  is  PM! (x , y)  = \nPM! (ylx)PM! (x),  implemented by  a  model  Ml  called  YANG/(male)  part  since  it \nperforms the task of transferring  a pattern/(a real body)  into a code/(a seed).  The \nother  is  PM2(X, y)  =  PM2(xly)PM2(Y),  implemented by  a  model  M2  called  YING \npart since it performs the task of generating a pattern/(a real body) from a code/(a \nseed).  They  are complement to each  other  and together  implement an  entire circle \nx  -t y  -t x.  This compliments to the  ancient chinese  YING-YANG  philosophy. \nHere  we  have  four  components  PM! (x),  PM! (ylx),  PM2 (xly)  and  PM2(Y).  The \nPM! (x)  can  be fixed  at some density  estimate on input data, e.g ., we  have  at  least \ntwo choices-Parzen  window estimate Ph (x)  or empirical estimate Po (x) : \n\nPh(X)  = N~d I:~l K( X~XI),  Po(x)  = limh ... O Ph(X)  = -b I:~l 8(x - Xi). \n\n(1) \nFor  PM!(ylx),  PM2 (xly),  each  can  have  three  choices:  (1)  from  a  parametric fam(cid:173)\nily  specified  by  model  Ml  or  M 2 ;  (2)  free  of model  with  PM!(ylx)  =  P(ylx)  or \nPM2(xly) = P(xly);  (3)  broken channel  PM! (ylx)  = PM!(y)  or PM2(xly)  = PM2 (X) . \nFinally,  PM2(y)  with its  y consistent  to  PM! (ylx)  can  also  being from a  parametric \nfamily  or free  of model.  Any  combinations of the  choices  of the  four  components \nforms  a  potential YING-YANG  pair.  We at least have  2 x  3 x  3 x  2 =  36  pairs. \nA  YING-YANG  pair  has  four  types  of marital status:  (a)  marry,  i.e.,  YING  and \n\n\f446 \n\nL. XU \n\n... p.....n .. tIon apace  v \n\nSymbola. Intea __ \u2022  Binary Cod.. \n\n':.2(Y) \n\nEncoding \n\".ooannlon \nAepr __ ntatlon \n\nDecoding \no ........ trng \nAeconatruotlon \n\nFigure  1  The joint  spaces X, Y  and the YING-YANG Machine \n\nYANG  match  each  other;  (b)  divorce,  i.e.,  YING  and  YANG  go  away  from  each \nother;  (c)  YING chases  YANG, YANG escapes;  (d) YANG chases YING, but YING \nescapes.  The four  types  can  be described  by  a  combination of minimization  (chas(cid:173)\ning)  and maximization (escaping)  on  one  of the  two Kullback divergences  below: \n\n, \nR.(MI,M2) = \n( \n\nK  M2,MI  = \n\nf \n)  f \n\nx,y \n\nx,y \n\n( \\ \n( \\) \n\nPM2  X Y  PM2  Y  log  P \n\n() \n\nPMlyx)PMl(x)logp  (I)P  ()dxdy \n\nPM)  (ylx) PM) (x) \nM2  x  Y  M2  Y \nPM2 (xly) P M2 (y) \nMl  Y  x  Ml  x \n\n(I)P  ()dxdy \n\n( \n2a) \n( \n) \n2b \n\nWe  can  replace  K(MI' M2) by  K(M2, MJ)  in  the  table.  The  2nd  &  3rd  columns  are  for \n(c)  (d)  respectively,  each  has  two  cases  depending  on  who  starts  the  act  and  the  two \nare  usually  not equivalent.  Their results  are undefined  depending  on initial  condition  for \nM I,M2,  except  of two  special  cases:  (i)  Free  PMl(Y\\X)  and  parametric  PM2(X\\Y),  with \nminM2  maxMl  K  being  the same as  (b)  with broken  PMl  (y\\x),  and  with maXM2 minMl  K \ndefined  but  useless.  (ii)  Free  PM2 (X\\Y)  and  parametric  PMl(y\\X),  with  minMl  maXM2 K \nthe same as case  (a)  with broken  PM2 (xly),  with  minMl  maxM2  K  defined  but useless. \n\nTherefore,  we  will focus  on  the  status  marry and  divorce.  Even  so,  not  all  of the \nabove  mentioned  2 x  3 x  3 x  2 =  36  YING-YANG  pairs provide  sensible  learning \nmodels although minM l ,M2  K  and maxM l ,M2  K  are always well defined.  Fortunately, \na  quite number of them indeed lead us  to useful  learning models,  as  will  be shown \nin the sequent  sections. \nWe can implement minM l ,M2  K(Ml, M 2) by the following  Alternative Minimization \n(ALTMIN)  procedure: \nStep  1 \nStep 2 \n\nFix  M2  = M21d,  to get  Mr ew  = arg  M inMl  K L( M I, M21d) \nFix  MI  =  Mfld,  to  get  M:;ew  =  arg  MinM2  KL(Mfld, M2) \n\nThe  ALTMIN iteration  will  finally  converge  to a  local  minimum  of K(MI , M 2 ).  We  can \nhave  a  similar  procedure for  maXMl ,M2  K(MI, M2) via replacing  Min by  Max. \nSince  the  above scheme bases  on  the two complement YING  and YANG  Bayesian \nrepresentations  and  their  Kullback  divergence  for  their  marital status,  we  call  it \nBayesian-Kullback  YING- YANG learning scheme.  Furthermore, under  this scheme \nwe  call  each  obtained YING-YANG  pair that is  sensible  for  learning  purpose  as  a \nBayesian-Kullback  YING- YANG Machine or  YING- YANG machine shortly. \n\n3  UNIFIED  EXISTING LEARNINGS \n\nLet  PMl(X)  =  Po(x)  by eq.(l)  and  put it into eq.(2),  through certain mathematics \nwe  can  get  K(M1 , M2)  =  hMl  - haMl  - QMl ,2 + D  with  D  independent  of M 1 , M2 \nand hMll haMl' QMl ,2  given by Eqs.(El)(E2)&(E4)  in Tab.2 respectively.  The larger \n\n\fA Unified Learning Scheme:  Bayesian-Kullback Ying-Yang Machine \n\n447 \n\nis the hMl\n,  the more discriminative or separable are the representations in Y for  the \ninput  data set.  The larger  is  the  haMl'  the more concentrated  the  representations \nin  Y .  The larger is  the qMl,2'  the better  PM2(xIY)  fits  the input data. \nTherefore,  minMl ,M2 K(M1, M2) consists  of (1)  best  fitting  of PM2 (xIY)  on  input \ndata via maxQM l ,2'  which is desirable,  (2)  producing more concentrated representa(cid:173)\ntions in Y to occupy less resource,  which is also desirable and is the behind reason for \nsolving the problem of selecting cluster  number in clustering  analysis Xu(1995a&c, \n1996a), (3)  but with the cost of less discriminative representations in Y for the input \ndata.  Inversely,  maxMl ,M2  K(M1 , M2 )  consists of (1)  producing best discriminative \nor separable representation PMl (ylx)  in  Y for  the input data set,  which is  desirable, \nin the cost  of (2)  producing a  more uniform representation  in Y  to fully occupy the \nresource,  and  (3)  causing  PM2(xly)  away from fitting input data. \nShown  in  Table  2  are  the  unified  existing  unsupervised  learnings.  For  the  case \nH-f- W,  we  have  hMl  =  h,  haMl  =ha ,  QMl,2  =QM2,  and  minMJ\u00abM1 ,  M2)  re(cid:173)\nsults  in  PM2(y)  =  PMl (y)  =O:y  and  PM2(xly)PM2(Y)  =  PM2(X)PMl (ylx)  with \nIn  turn,  we  get  K(M1 , M2)  =-LM2  + D  with \nPM2 (X)  =I:~=l PM2 (xly)PM2 (y)\u00b7 \nLM2  being the likelihood given  by eq.(E5),  i.e., we  get  maximum likelihood estima(cid:173)\ntion on mixture model.  In fact,  the ALTMIN given in Tab.2 leads us to exactly the \nEM  algorithm by  Dempster et  al(1977).  Also,  here  PMl(X,y),  PM2(X,y)  is  equiv(cid:173)\nalent  to  the  data  submanifold  D  and  model  submanifold  M  in  the  Information \nGeometry theory  (Amari,  1995a&b),  with  the  ALTMIN  being  the  em  algorithm. \nAs  shown  in  Xu(95a),  the  cases  also  includes  the  MDL  auto-encoder  (Hinton  & \nZemel,  1994)  and Multi-sets  modeling (Xu,  1995e). \n\nFor  the  case  Single-M,  the  hMl  - haMl  is  actually the information transmitted  by \nthe YANG  part from  x  to y.  In  this case,  its minimization produces  a non-sensible \nmodel  for  learning.  However,  its  maximization  is  exactly  the  Informax  learning \nscheme  (Linsker,  1989;  Atick  &  Redlich,  1990;  Bell  &  Sejnowski,  1995).  Here,  we \nclear  up  a  confusion  made  in  Xu(95a&b)  where  the  minimization  was  mistakenly \nconsidered. \nFor the case H-m- W,  the hMl -haMl -QMl,2  isjust the -F(d; B, Q)  used by Dayan et \nal (1995)  and Hinton et al (1995) for Helmholtz machine.  We can set up the detailed \ncorrespondence  that  (i)  here  PMl(ylx;)  is  their  Qa;  (ii)  logPM2(x,y is  their  -Ea; \nand  (iii)  their Pa is  PM 2 (ylx)  = PM2(xly)PM2(Y)/  I:y  PM2(xly)PM2(Y).  So,  we  get \na  new  perspective for  Helmholtz machine.  Moreover,  we  know  that K(M1, M2) be(cid:173)\ncomes  a  negative  likelihood only when  PM2(xly)PM2(Y)  =  PM2(X)PMl (ylx),  which \nis  usually  not  true  when  the  YANG  and  YING  parts  are  both  parametric.  So \nHelmholtz  machine  is  not  equivalent  to  maximum likelihood  learning  in  general \nwith  a  gap  depending  on  PM2(xly)PM2 (y)  - PM2 (X)PMl (ylx).  The  equivalence  is \napproximately  acceptable  only  when  the  family  of PM2(xly)  or/and  PMl (ylx;)  is \nlarge  enough  or  M2 ,  Ml  are  both linear with gaussian density. \n\nIn Tab.4, the case  Single-Munder  K(M2, Ml)  is  the classical  maximum likelihood \n(ML)  learning  for  supervised  learning  which  includes  the  least  square  learning  by \nback propagation (BP)  for feedfarward net as a special case.  Moreover,  its counter(cid:173)\npart for  a  backward net  as inverse  mapping is the case  Single-Funder  K(Ml, M2). \n\n4  NEW LEARNING MODELS \n\nFirst, a  number of variants for  the above  existing models  are  given  in Table 2. \nSecond, a particular new  model can be obtained from the case H-m- Wby changing \nminMl ,M2 into maxMl ,M2.  That is,  we  have maXMl ,M2  [hMl - haMl  - QMl,2]'  shortly \n\n\f448 \n\nL.XU \n\nTable 2:  BKC-YY  Machine for  Unsupervised  Learning (  Part I)  :  K(MI, M2) \nGiven Data {X;}f:l' Fix PMl (x) = Po(x)  by  eq.(l), and thus  K(MI' M2) = Kb + D,  with \n\nD  irrelevant  to  M 1 ,  M2  and  K b given by  the following  formulae  and  table: \n\n~t\"y \n\nh = -N1  \"\"N,k P(ylx;)logP(ylx;) , \nhaMl  =  2::yO'~llogO'~l, \nO'y  =  1:i  2::. P(ylx;), \nqM1 ,2  =  1:;  2::;,y  PMl (Ylx;) log PM2(x;iy), \nL~2 = 1:i  2::; ,y O'y log PM2(x;iy), \n\nhMl  = -N1  \"\"  PMl(ylx;)logPMl(ylx;), \nha  =  2::y O'ylogO'y, \n\nO'~l =  1:i  2::; PMl(ylx;), \n\nUt ,y \n\nP(ylx;)  =  O'yPM2 (xily)J 2:: y O'yPM2 (x;iy), \n\nqM2 =  1:; 2::i,y  P(ylx;) log PM2(Xily), \n\nLM2  = 1:;  2::; log 2::y O'yPM2(X.ly) \n\n(El) \n(E2) \n(E3) \n(E4) \n(E5) \n\nSingle-F \n\nSingle-M \nPM2 (y) \n=  PMl (y) \nPMl (ylx) \nPM2(xly) \n=  PM2 (X)  = PMl(y) \n= Po(xl \nhMl - haMl \n\n-LM2 \n\n(max) \n\n(min) \n\nGet  MI \nby  max \nhMl-haMl \n\nGet  M2 \n\nby \nmax \n\nH-m-W \n\nPMl (ylx) \n\nand \n\nPM2 (xly) \n\nlhMl-haMl \n(~~l~)] \nmIn \n~1: t'IX \nM2, get \nMI  by \nmin  [hMl \n\nL~2'  -haMl - QMl,2] \n82:  Fix  M 1 , \nget  M2  by \nmax  QMl 2' \n\nNo \n\nRepeat \n\nKepeat \n81,82. \n\nDupli-\ncated \nmodels \nby  ML \nlearning \n\non \n\ninput \ndata. \n\nHelm-\nholtz \n\nmachine \n(Hin95) \n(Day95) \n\nNo \n\nRepeat \n\nInformax, \nMaximum \n\nmutual \nInform-\nation \n~Lin89~ \nAti90 \n(BeI95) \n\nW-f-H \nUniform \nPM2(y), \nand free \nPM~~~I~~ \n=  P  xjy \nhaMl \n\n(min) \n\nGet \nMI  by \nmm \nhaMl \n\nNo \n\nRepeat \n\nRelated \n\nto \n\nPCA \n\nH-f-W \n\nPMl (ylx) \nfree,  i.e., \nPMl:~YJXJ \n= Pyx \nh-ha-qM2 \n= -LM2 \n(minl \n~1: t'IX \nM2,  get \nP(ylx;) \nO'y  by \n\n(E3),  O'~l \nby  (E2) \n82:  get \nM2  by \n\nmax  QM2. \nKepeat \n81,82. \n1.  ML  on \nMixtures \n\n&EM \n(Dem77) \n2.  Inform-\n\nation \n\ngeometry \n(Amari95) \n3.  MDL \n\nAuto-\nencoder \n\nWin94) \n\nulti-sets \n\nmodeling \n(Xu94 ,95) \n\n4. \n\nMarriage \nStatus \n\nCondition \n\nKb \n\nALTMIN \n\nExisting \nEquiv-\n-lent \nmodels \n\nNew \nResults \n\nVariants \n\n1. \n\nt'or H-f- W  type, we  have: \n\nThree VQ  variants  when  PM2(xly)  is  Gaussian.  Also,  criteria for \n\nselecting  the correct k  for  VQ or clustering  (Xu95a&c). \n2.  For H-m-W  type,  we  have: \n\nRobust  PCA + criterion for  determining  subspace dimension  (Xu, 95c). \n\n1.  More smooth  PMl_(x)given  by  Parzen window  estimate. \n2.  Factorial  coding  PM2(y)  = ~M2(Y;) with binary  y = [YI  \"', yrn]. \n3.  Factorial  coding  PMl (ylx) = \n. PM2 (Yi Ix)  with binary  [YI  ... , Yrn]. \n4.  Replace  '2::11  .' in all  the  above items by 'fu \u00b7dy'  for  real  y. \n\nNote:  H- Husband,  W-WIfe,  f- follows,  M-Male,  F-Female,  m-matches.  X-f-Y stands for \nX  part is  free.  Single-X  stands for  the other part broken.  H-m-W  stands for both parts \n\nbeing parametric.  '(min)' stands for  min Kb  and '(max), stands for max Kb. \n\n\fA Unified  Learning Scheme:  Bayesian-Kullback Ying-Yang Machine \n\n449 \n\nTable 3:  BKC-YY  Machine for  Unsupervised  Learning  (  Part  II)  :  J(M2 , Ml) \nGiven  Data {Xi}F:I,  Fix  PMI (x) = Po(x)  by eq.(l), and thus  J(M2 , Ml) = J(b + D,  with \n\nD  irrelevant  to  M 1 ,  M2  and  J(b  given  by  the following  formulae  and  table: \n\nH-f-W \n\nhM2 \n\n(max) \n\nGet  M2 \n\nby \nmax \nhM2 \n\nMarrIage \nStatus \nC;ondahon \n\nJ(b \n\nALTMIN \n\nt;xlstmg \nmodels \nVanants \n\nSi ngle-M \n\nSingle-F \n\nThe same as those m  Table  1. \n\nlha M2 -\n- L MI ,2] \n\n(min) \nS1: \nFix M I , \nget ai:2 \nby  (E7) . \n\nS2: \n\nlhaMI-\n- LMI  ] \n(if forcing \nPM1(y)  = \nP02\u00b7(~I) \nmIn \nSI: \nFix  M 1 , \nget ai:l \nby  (E2). \nin Tab.l \n\nS2: \n\nupdate \nMI  by \n\nupdate \nMI  by \nmax  LMI ,2  max  LMI \n\nhM2 \n\n(max) \n\nGet  M2 \n\nby \n\nmax \nh~2' \n\nH-m-W \n[h~2 + \n+ haMI  -\n-qM2,1] \n\n(min) \n' SI: \nFix  M 2 , \nget  MI \n\nby \nmIn \n[haMI \n- q M2,1] \n\nS2:  Fix M 1 , \n\nget  M2 \nby  min \n\nh~?-qM~ I \n\n.H.epeat \nSI, S2 \n\nno \n\nnew! \n\n(E6) \n\n(E7) \n\n(ES) \n(E9) \n(ElO) \n\nW-f-H \n\n-La MI \n\n(min) \n\nGet  MI \n\nby \n\nmax \n\nL o MI \n\n1\\l0 \n\nRepeat \n\nno \n\nnew! \n\n1\\l0 \n\nRepeat \n\nno \n\nnew! \n\nttepeat \nSI, S2 \n\nno \n\nnew! \n\nttepeat \nSI,S2 \n\nno \n\nnew! \n\n1\\l0 \n\nRepeat \n\nno \n\nnew! \n\nImilar  to  those m  Table  1. \n\nTable 4:  BKC-YY  Machine for  Supervised Learning \n\nGiven  Data {Xi,y.}F:I , Fix PMI(X) = Po(x)  by  eq.(l). \n\nh'kl  = -kEiPM1(y;JXi)logPMI(Yil x i),  h'k2 = -kEiPM2(x;JYi)logPM2(x;Jy.), \n(Ell) \nQ'kI ,2  = -k Ei PMI (y;JXi) log PM2 (xdYi),  Q'k2 , 1  = -k Ei PM2(X;Jy.) log PMI (Y.lxi) , (El2) \nL'kl  =  -k Ei log PMI (y;JX i ), \n(El3) \n\nL'k2  =  -k Ei log PM2(XiIYi) , \n\nMarnage \nStatus \nJ(b \n\nFeature \n\n~'xastmg \nmodels \n\nK(MI, M2) = Kb + D \n\nJ((M2 , MI)  = J(b + D \n\nSingle-M \n\nhMl \n(max) \n\nmIrumum \n\nentropy (ME) \n\nF-net \n\nno \n\nnew! \n\nSingle-F \n-LM2 \n(min) \nML \n\nB-net \ntH' on \nB-net \n\nH-m-W \nhMI -QMI2 \n(min)  , \nMIxed \nF-B \nnet \nno \n\nnew! \n\nSingle-M  Si ngle-F \n\n-LMI \n(min) \nML \n\nF-net \ntiP on \nF-net \n\nhM2 \n(max) \n\nmlrumum \nentropy \nB-net \n\nno \n\nnew! \n\nH-m-W \nhM2-QM2 1 \n(min) , \nMixed \nB-F \nnet \nno \n\nnew! \n\n\f450 \n\nL. XU \n\ndenoted  by  H-m- W-Max.  This model is  a  dual to the Helmholtz machine in  order \nto focus  on  getting best  discriminative or separable  representations  PMl  (ylx)  in  Y \ninstead of best  fitting of PM2(xly)  on  input data. \nThird,  by  replacing  K(M1 , M 2)  with  K(M2, M 1 ),  in  Table  3  we  can  obtain  new \nmodels  that  are  the  counterparts  of those  given  in  Table  2.  For  the  case  H-J- W, \nits  maxMl,M2  gives  minimum entropy  estimate  on  PM2 (X)  instead  of  maximum \nlikelihood  estimate  on  PM2 (X)  in  Table  2.  For  the  case  Single-M,  it  will  function \nsimilarly to  the  case  Single-F in  Table 2,  but with  minimum entropy on  PMl (ylx) \nin  Table  2  replaced  by  maximum likelihood on  PM l  (ylx)  here.  For  the  case  H-m(cid:173)\nW,  the  focus  shifts  from  on  getting  best  fitting  of PM2(xly)  on  input  data  to  on \ngetting  best  discriminative representations  PM 1 (ylx)  in  Y,  which  is  similar to the \njust mentioned  H-m- W-Max,  but  with minimum entropy on  PMJylx)  replaced  by \nmaximum likelihood on  PM 1 (ylx).  The other  two  cases  in  Table  3 have  been  also \nchanged similarly from  those in Table 2. \nFourth, several new  model have also been proposed in Table 4 for supervised  learn(cid:173)\ning .  Instead of maximum likelihood, the new  models suggest  learning by  minimum \nentropy or a  mix of maximum likelihood and minimum entropy. \nFinally, further  studies  on  the  other  status in  Table  1 are  needed.  Heuristically, \nwe  can  also  treat  the  case  H-m- W  by  two  separated  steps.  We  first  get  Ml  by \nmax[hMl  - haMl],  and  then  get  M2  by maxqMl,2;  or  we  first  get  M2  by  min[h  -\nha  - qM2]  and  then  get  Ml  by  min[hMl  - haMl  - QMl,2]'  The  two  algorithms \nattempt to  get  both a  good discriminative representation  by PMl (ylx)  and  a  good \nfitting  of PM2 (xly)  on  input  data.  However  whether  they  work  well  needs  to  be \ntested  experimentally. \n\nWe  are  currently  conducting experiments  on  comparison several  of the  above  new \nmodels  against  their existing counterparts. \n\nAcknowledgements  The  work  was  Supported  by  the  HK  RGG Earmarked  Grant \nGUHK250/94E. \n\nReferences \n\nAmari ,  S(1995a) [Amari95]  \"  Information  geometry of the  EM  and  em  algorithms  for  neural  networks\", \nNeural  Networks  8,  to appear. \nAmari,  S(1995b),  Neural  Computation  7  ppI3-18. \nAtick,  J .J.  &  Redlich,  A .N .  (1990)  [Ati901.  Neural  Computation Vo1.2,  No.3 ,  pp308-320 . \nBell  A .  J .  &  Sejnowski ,  T .  J.(1995)  [Be195],  Neural  Computation Vo1.7,  No.6 ,  1129-1159 . \nByrne,  W .  (1992),  IEEE  Trans.  Neural  Networks  3 ,  pp612-620 . \nCsiszar,  I. ,  11975),  Annals  of Probabil.ty 3,  ppI46-158. \nDayan,  P. ,  Hinton ,  G.  E .,  &  Neal,  R.  N.  (1995)  [Day95],  Neural  Computat.on Vo1.7,  No.5 , 889-904. \nDempster,  A.P.,  Laird ,  N .M .,  &  Rubin,  D .B.  (1977)  [Dem77] ,  1.  Royal Statist.  Soc.ety ,  839,  1-38. \nHathaway,  R.J.(1986),  Statistics  &  Probability Letters 4, pp53-56 . \nHinton,  G .  E ., et ai,  (1995)  [Hin95],  Sc.ence  268,  pp1158-1160 . \nHinton ,  G.  E .  &  Zemel,  R.S.  (19M)  [Hin94],  Advances  in  NIPS  6,  pp3-10. \nLinsker,  R.  (1989)  [Lin89],  Advances  in  NIPS  1,  ppI86-194. \nNeal , R.  N .&  Hinton,  G.  E(1993), A  new view of the EM algorithm that Jushfies  Incremental and other \nvanants , pr~rint. \nXu,  L .  (1996  ,  \"How  Many  Clusters?  :  A  YING- YANG  Machine  Based  Theory  For  A  Classical  Open \nProblem  In  attern  Recognition\" ,  to  appear on  Proc.  IEEE ICNN96. \nXu ,  L.  (1995a),  \"YING-YANG  Machine:  a  Bayesian-Kullback  scheme  for  unified  learnings  and  new \nresults  on  vector  quantization\" ,  Keynote  talk,  Proc. \nInti  Conf.  on  Neural  Information  Processing \n(ICONIP95),  Oct 30 - Nov .  3,  1995 , pp977-988 . \nXu ,  L.(1995b),  \"YING-YANG  Machine  for  Temporal  Signals\",  Keynote  talk,  Proc  IEEE  inti  Conf. \nNeural Networks &  Signal  Processing,  Vol.I,  pp644-651, Nanjing,  10-13 , 1995. \nXu ,  L .  (1995c) ,  \"New  Advances  on  The  YING- YANG  Machine\",  Invited  paper,  Proc.  of  1995  IntI. \nSymposium  on  Artificial  Neural  Networks,  ppIS07-12 ,  Dec.  18-20 , Taiwan . \nXu ,  L .  (1995d),  \"Cluster  Number  Selection,  Adaptive  EM  Algorithms  and  Competitive  Learnings\", \nInvited  paper,  Proc .  Inti  Conf.  on  Neural  Information  Processing  (ICONIP95),  Oct  30 - Nov.  3,  1995 , \nVol. II ,  ppI499-1502. \nXu ,  L .  (1995e),  Invited  paper,  Proc.  WCNN95,  Vol.I,  pp35-42.  Also,  Invited  paper,  Proc .  IEEE  ICNN \n1994,  ppI315-320 . \nXu ,  L. ,  &  Jordan,  M.I .  (1993).  Proc .  of WCNN '93, Portland,  OR,  Vol.  II,  431-434 . \n\n\f", "award": [], "sourceid": 1065, "authors": [{"given_name": "Lei", "family_name": "Xu", "institution": null}]}