{"title": "A Unified Learning Scheme: Bayesian-Kullback Ying-Yang Machine", "book": "Advances in Neural Information Processing Systems", "page_first": 444, "page_last": 450, "abstract": "", "full_text": "A Unified Learning Scheme: \n\nBayesian-Kullback Ying-Yang Machine \n\n1. Computer Science Dept., The Chinese University of HK, Hong Kong \n\n2. National Machine Perception Lab, Peking University, Beijing \n\nLei Xu \n\nAbstract \n\nA Bayesian-Kullback learning scheme, called Ying-Yang Machine, \nis proposed based on the two complement but equivalent Bayesian \nrepresentations for joint density and their Kullback divergence. \nNot only the scheme unifies existing major supervised and unsu(cid:173)\npervised learnings, including the classical maximum likelihood or \nleast square learning, the maximum information preservation, the \nEM & em algorithm and information geometry, the recent popular \nHelmholtz machine, as well as other learning methods with new \nvariants and new results; but also the scheme provides a number \nof new learning models. \n\n1 \n\nINTRODUCTION \n\nMany different learning models have been developed in the literature. We may \ncome to an age of searching a unified scheme for them. With a unified scheme, \nwe may understand deeply the existing models and their relationships, which may \ncause cross-fertilization on them to obtain new results and variants; We may also be \nguided to develop new learning models, after we get better understanding on which \ncases we have already studied or missed, which deserve to be further explored. \n\nRecently, a Baysian-Kullback scheme, called the YING-YANG Machine, has been \nproposed as such an effort(Xu, 1995a). It bases on the Kullback divergence and two \ncomplement but equivalent Baysian representations for the joint distribution of the \ninput space and the representation space, instead of merely using Kullback diver(cid:173)\ngence for matching un-structuralized joint densities in information geometry type \nlearnings (Amari, 1995a&b; Byrne, 1992; Csiszar, 1975). The two representations \nconsist of four different components. The different combinations of choices of each \ncomponent lead the YING-YANG Machine into different learning models. Thus, \nit acts as a general learning scheme for unifying the existing major unsupervised \nand supervised learnings. As shown in Xu(1995a), its one special case reduces to \nthe EM algorithm (Dempster et aI, 1977; Hathaway, 1986; Neal & Hinton , 1993) \n\n\fA Unified Learning Scheme: Bayesian-Kullback Ying-Yang Machine \n\n445 \n\nand the closely related Information Geometry theory and the em algorithm (Amari, \n1995a&b), to MDL autoencoder with a \"bits-back\" argument by Hinton & Zemel \n(1994) and its alternative equivalent form that minimizes the bits of uncoded resid(cid:173)\nual errors and the unused bits in the transmission channel's capacity (Xu, 1995d), \nas well as to Multisets modeling learning (Xu, 1995e)- a unified learning framework \nfor clustering, PCA-type learnings and self-organizing map. It other special case \nreduces to maximum information preservation (Linsker, 1989; Atick & Redlich, \n1990; Bell & Sejnowski, 1995). More interestingly its another special case reduces \nto Helmholtz machine (Dayan et al,1995 ; Hinton, 1995) with new understandings. \nMoreover, the YING-YANG machine includes also maximum likelihood or least \nsquare learning. \n\nFurthermore, the YING- YANG Machine has also been extended to temporal pat(cid:173)\nterns with a number of new models for signal modeling. Some of them are the \nextensions of Helmholtz machine or maximum information preservation learning to \ntemporal processing. Some of them include and extend the Hidden Markov Model \n(HMM), AMAR and AR models (Xu, 1995b). In addition, it has also been shown in \nXu(1995a&c, 1996a) that one special case of the YING-YANG machine can provide \nus three variants for clustering or VQ, particularly with criteria and an automatic \nprocedure developed for solving how to select the number of clusters in clustering \na classical problem that remains open for decades. \nanalysis or Gaussian mixtures -\n\nIn this paper, we present a deep and systematical further study. Section 2 re(cid:173)\ndescribes the unified scheme on a more precise and systematical basis via discussing \nthe possible marital status of the two Bayesian representations for joint density. \nSection 3 summarizes and explains those existing models under the unified scheme, \nparticularly we have clarified some confusion made in the previous papers (Xu, \n1995a&b) on maximum information preservation learning. Section 4 proposed and \nsummarizes a number of possible new models suggested by the unified scheme. \n\n2 BAYESIAN-KULLBACK YING-YANG MACHINE \n\nAs argued in Xu (1995a), unsupervised and supervised learning problems can be \nsummarized into the problem of estimating joint density P(x, y) of patterns in \nthe input space X and the representation space Y, as shown in Fig.I. Under the \nBayesian framework, we have two representations for P(x, y). One is PM! (x , y) = \nPM! (ylx)PM! (x), implemented by a model Ml called YANG/(male) part since it \nperforms the task of transferring a pattern/(a real body) into a code/(a seed). The \nother is PM2(X, y) = PM2(xly)PM2(Y), implemented by a model M2 called YING \npart since it performs the task of generating a pattern/(a real body) from a code/(a \nseed). They are complement to each other and together implement an entire circle \nx -t y -t x. This compliments to the ancient chinese YING-YANG philosophy. \nHere we have four components PM! (x), PM! (ylx), PM2 (xly) and PM2(Y). The \nPM! (x) can be fixed at some density estimate on input data, e.g ., we have at least \ntwo choices-Parzen window estimate Ph (x) or empirical estimate Po (x) : \n\nPh(X) = N~d I:~l K( X~XI), Po(x) = limh ... O Ph(X) = -b I:~l 8(x - Xi). \n\n(1) \nFor PM!(ylx), PM2 (xly), each can have three choices: (1) from a parametric fam(cid:173)\nily specified by model Ml or M 2 ; (2) free of model with PM!(ylx) = P(ylx) or \nPM2(xly) = P(xly); (3) broken channel PM! (ylx) = PM!(y) or PM2(xly) = PM2 (X) . \nFinally, PM2(y) with its y consistent to PM! (ylx) can also being from a parametric \nfamily or free of model. Any combinations of the choices of the four components \nforms a potential YING-YANG pair. We at least have 2 x 3 x 3 x 2 = 36 pairs. \nA YING-YANG pair has four types of marital status: (a) marry, i.e., YING and \n\n\f446 \n\nL. XU \n\n... p.....n .. tIon apace v \n\nSymbola. Intea __ \u2022 Binary Cod.. \n\n':.2(Y) \n\nEncoding \n\".ooannlon \nAepr __ ntatlon \n\nDecoding \no ........ trng \nAeconatruotlon \n\nFigure 1 The joint spaces X, Y and the YING-YANG Machine \n\nYANG match each other; (b) divorce, i.e., YING and YANG go away from each \nother; (c) YING chases YANG, YANG escapes; (d) YANG chases YING, but YING \nescapes. The four types can be described by a combination of minimization (chas(cid:173)\ning) and maximization (escaping) on one of the two Kullback divergences below: \n\n, \nR.(MI,M2) = \n( \n\nK M2,MI = \n\nf \n) f \n\nx,y \n\nx,y \n\n( \\ \n( \\) \n\nPM2 X Y PM2 Y log P \n\n() \n\nPMlyx)PMl(x)logp (I)P ()dxdy \n\nPM) (ylx) PM) (x) \nM2 x Y M2 Y \nPM2 (xly) P M2 (y) \nMl Y x Ml x \n\n(I)P ()dxdy \n\n( \n2a) \n( \n) \n2b \n\nWe can replace K(MI' M2) by K(M2, MJ) in the table. The 2nd & 3rd columns are for \n(c) (d) respectively, each has two cases depending on who starts the act and the two \nare usually not equivalent. Their results are undefined depending on initial condition for \nM I,M2, except of two special cases: (i) Free PMl(Y\\X) and parametric PM2(X\\Y), with \nminM2 maxMl K being the same as (b) with broken PMl (y\\x), and with maXM2 minMl K \ndefined but useless. (ii) Free PM2 (X\\Y) and parametric PMl(y\\X), with minMl maXM2 K \nthe same as case (a) with broken PM2 (xly), with minMl maxM2 K defined but useless. \n\nTherefore, we will focus on the status marry and divorce. Even so, not all of the \nabove mentioned 2 x 3 x 3 x 2 = 36 YING-YANG pairs provide sensible learning \nmodels although minM l ,M2 K and maxM l ,M2 K are always well defined. Fortunately, \na quite number of them indeed lead us to useful learning models, as will be shown \nin the sequent sections. \nWe can implement minM l ,M2 K(Ml, M 2) by the following Alternative Minimization \n(ALTMIN) procedure: \nStep 1 \nStep 2 \n\nFix M2 = M21d, to get Mr ew = arg M inMl K L( M I, M21d) \nFix MI = Mfld, to get M:;ew = arg MinM2 KL(Mfld, M2) \n\nThe ALTMIN iteration will finally converge to a local minimum of K(MI , M 2 ). We can \nhave a similar procedure for maXMl ,M2 K(MI, M2) via replacing Min by Max. \nSince the above scheme bases on the two complement YING and YANG Bayesian \nrepresentations and their Kullback divergence for their marital status, we call it \nBayesian-Kullback YING- YANG learning scheme. Furthermore, under this scheme \nwe call each obtained YING-YANG pair that is sensible for learning purpose as a \nBayesian-Kullback YING- YANG Machine or YING- YANG machine shortly. \n\n3 UNIFIED EXISTING LEARNINGS \n\nLet PMl(X) = Po(x) by eq.(l) and put it into eq.(2), through certain mathematics \nwe can get K(M1 , M2) = hMl - haMl - QMl ,2 + D with D independent of M 1 , M2 \nand hMll haMl' QMl ,2 given by Eqs.(El)(E2)&(E4) in Tab.2 respectively. The larger \n\n\fA Unified Learning Scheme: Bayesian-Kullback Ying-Yang Machine \n\n447 \n\nis the hMl\n, the more discriminative or separable are the representations in Y for the \ninput data set. The larger is the haMl' the more concentrated the representations \nin Y . The larger is the qMl,2' the better PM2(xIY) fits the input data. \nTherefore, minMl ,M2 K(M1, M2) consists of (1) best fitting of PM2 (xIY) on input \ndata via maxQM l ,2' which is desirable, (2) producing more concentrated representa(cid:173)\ntions in Y to occupy less resource, which is also desirable and is the behind reason for \nsolving the problem of selecting cluster number in clustering analysis Xu(1995a&c, \n1996a), (3) but with the cost of less discriminative representations in Y for the input \ndata. Inversely, maxMl ,M2 K(M1 , M2 ) consists of (1) producing best discriminative \nor separable representation PMl (ylx) in Y for the input data set, which is desirable, \nin the cost of (2) producing a more uniform representation in Y to fully occupy the \nresource, and (3) causing PM2(xly) away from fitting input data. \nShown in Table 2 are the unified existing unsupervised learnings. For the case \nH-f- W, we have hMl = h, haMl =ha , QMl,2 =QM2, and minMJ\u00abM1 , M2) re(cid:173)\nsults in PM2(y) = PMl (y) =O:y and PM2(xly)PM2(Y) = PM2(X)PMl (ylx) with \nIn turn, we get K(M1 , M2) =-LM2 + D with \nPM2 (X) =I:~=l PM2 (xly)PM2 (y)\u00b7 \nLM2 being the likelihood given by eq.(E5), i.e., we get maximum likelihood estima(cid:173)\ntion on mixture model. In fact, the ALTMIN given in Tab.2 leads us to exactly the \nEM algorithm by Dempster et al(1977). Also, here PMl(X,y), PM2(X,y) is equiv(cid:173)\nalent to the data submanifold D and model submanifold M in the Information \nGeometry theory (Amari, 1995a&b), with the ALTMIN being the em algorithm. \nAs shown in Xu(95a), the cases also includes the MDL auto-encoder (Hinton & \nZemel, 1994) and Multi-sets modeling (Xu, 1995e). \n\nFor the case Single-M, the hMl - haMl is actually the information transmitted by \nthe YANG part from x to y. In this case, its minimization produces a non-sensible \nmodel for learning. However, its maximization is exactly the Informax learning \nscheme (Linsker, 1989; Atick & Redlich, 1990; Bell & Sejnowski, 1995). Here, we \nclear up a confusion made in Xu(95a&b) where the minimization was mistakenly \nconsidered. \nFor the case H-m- W, the hMl -haMl -QMl,2 isjust the -F(d; B, Q) used by Dayan et \nal (1995) and Hinton et al (1995) for Helmholtz machine. We can set up the detailed \ncorrespondence that (i) here PMl(ylx;) is their Qa; (ii) logPM2(x,y is their -Ea; \nand (iii) their Pa is PM 2 (ylx) = PM2(xly)PM2(Y)/ I:y PM2(xly)PM2(Y). So, we get \na new perspective for Helmholtz machine. Moreover, we know that K(M1, M2) be(cid:173)\ncomes a negative likelihood only when PM2(xly)PM2(Y) = PM2(X)PMl (ylx), which \nis usually not true when the YANG and YING parts are both parametric. So \nHelmholtz machine is not equivalent to maximum likelihood learning in general \nwith a gap depending on PM2(xly)PM2 (y) - PM2 (X)PMl (ylx). The equivalence is \napproximately acceptable only when the family of PM2(xly) or/and PMl (ylx;) is \nlarge enough or M2 , Ml are both linear with gaussian density. \n\nIn Tab.4, the case Single-Munder K(M2, Ml) is the classical maximum likelihood \n(ML) learning for supervised learning which includes the least square learning by \nback propagation (BP) for feedfarward net as a special case. Moreover, its counter(cid:173)\npart for a backward net as inverse mapping is the case Single-Funder K(Ml, M2). \n\n4 NEW LEARNING MODELS \n\nFirst, a number of variants for the above existing models are given in Table 2. \nSecond, a particular new model can be obtained from the case H-m- Wby changing \nminMl ,M2 into maxMl ,M2. That is, we have maXMl ,M2 [hMl - haMl - QMl,2]' shortly \n\n\f448 \n\nL.XU \n\nTable 2: BKC-YY Machine for Unsupervised Learning ( Part I) : K(MI, M2) \nGiven Data {X;}f:l' Fix PMl (x) = Po(x) by eq.(l), and thus K(MI' M2) = Kb + D, with \n\nD irrelevant to M 1 , M2 and K b given by the following formulae and table: \n\n~t\"y \n\nh = -N1 \"\"N,k P(ylx;)logP(ylx;) , \nhaMl = 2::yO'~llogO'~l, \nO'y = 1:i 2::. P(ylx;), \nqM1 ,2 = 1:; 2::;,y PMl (Ylx;) log PM2(x;iy), \nL~2 = 1:i 2::; ,y O'y log PM2(x;iy), \n\nhMl = -N1 \"\" PMl(ylx;)logPMl(ylx;), \nha = 2::y O'ylogO'y, \n\nO'~l = 1:i 2::; PMl(ylx;), \n\nUt ,y \n\nP(ylx;) = O'yPM2 (xily)J 2:: y O'yPM2 (x;iy), \n\nqM2 = 1:; 2::i,y P(ylx;) log PM2(Xily), \n\nLM2 = 1:; 2::; log 2::y O'yPM2(X.ly) \n\n(El) \n(E2) \n(E3) \n(E4) \n(E5) \n\nSingle-F \n\nSingle-M \nPM2 (y) \n= PMl (y) \nPMl (ylx) \nPM2(xly) \n= PM2 (X) = PMl(y) \n= Po(xl \nhMl - haMl \n\n-LM2 \n\n(max) \n\n(min) \n\nGet MI \nby max \nhMl-haMl \n\nGet M2 \n\nby \nmax \n\nH-m-W \n\nPMl (ylx) \n\nand \n\nPM2 (xly) \n\nlhMl-haMl \n(~~l~)] \nmIn \n~1: t'IX \nM2, get \nMI by \nmin [hMl \n\nL~2' -haMl - QMl,2] \n82: Fix M 1 , \nget M2 by \nmax QMl 2' \n\nNo \n\nRepeat \n\nKepeat \n81,82. \n\nDupli-\ncated \nmodels \nby ML \nlearning \n\non \n\ninput \ndata. \n\nHelm-\nholtz \n\nmachine \n(Hin95) \n(Day95) \n\nNo \n\nRepeat \n\nInformax, \nMaximum \n\nmutual \nInform-\nation \n~Lin89~ \nAti90 \n(BeI95) \n\nW-f-H \nUniform \nPM2(y), \nand free \nPM~~~I~~ \n= P xjy \nhaMl \n\n(min) \n\nGet \nMI by \nmm \nhaMl \n\nNo \n\nRepeat \n\nRelated \n\nto \n\nPCA \n\nH-f-W \n\nPMl (ylx) \nfree, i.e., \nPMl:~YJXJ \n= Pyx \nh-ha-qM2 \n= -LM2 \n(minl \n~1: t'IX \nM2, get \nP(ylx;) \nO'y by \n\n(E3), O'~l \nby (E2) \n82: get \nM2 by \n\nmax QM2. \nKepeat \n81,82. \n1. ML on \nMixtures \n\n&EM \n(Dem77) \n2. Inform-\n\nation \n\ngeometry \n(Amari95) \n3. MDL \n\nAuto-\nencoder \n\nWin94) \n\nulti-sets \n\nmodeling \n(Xu94 ,95) \n\n4. \n\nMarriage \nStatus \n\nCondition \n\nKb \n\nALTMIN \n\nExisting \nEquiv-\n-lent \nmodels \n\nNew \nResults \n\nVariants \n\n1. \n\nt'or H-f- W type, we have: \n\nThree VQ variants when PM2(xly) is Gaussian. Also, criteria for \n\nselecting the correct k for VQ or clustering (Xu95a&c). \n2. For H-m-W type, we have: \n\nRobust PCA + criterion for determining subspace dimension (Xu, 95c). \n\n1. More smooth PMl_(x)given by Parzen window estimate. \n2. Factorial coding PM2(y) = ~M2(Y;) with binary y = [YI \"', yrn]. \n3. Factorial coding PMl (ylx) = \n. PM2 (Yi Ix) with binary [YI ... , Yrn]. \n4. Replace '2::11 .' in all the above items by 'fu \u00b7dy' for real y. \n\nNote: H- Husband, W-WIfe, f- follows, M-Male, F-Female, m-matches. X-f-Y stands for \nX part is free. Single-X stands for the other part broken. H-m-W stands for both parts \n\nbeing parametric. '(min)' stands for min Kb and '(max), stands for max Kb. \n\n\fA Unified Learning Scheme: Bayesian-Kullback Ying-Yang Machine \n\n449 \n\nTable 3: BKC-YY Machine for Unsupervised Learning ( Part II) : J(M2 , Ml) \nGiven Data {Xi}F:I, Fix PMI (x) = Po(x) by eq.(l), and thus J(M2 , Ml) = J(b + D, with \n\nD irrelevant to M 1 , M2 and J(b given by the following formulae and table: \n\nH-f-W \n\nhM2 \n\n(max) \n\nGet M2 \n\nby \nmax \nhM2 \n\nMarrIage \nStatus \nC;ondahon \n\nJ(b \n\nALTMIN \n\nt;xlstmg \nmodels \nVanants \n\nSi ngle-M \n\nSingle-F \n\nThe same as those m Table 1. \n\nlha M2 -\n- L MI ,2] \n\n(min) \nS1: \nFix M I , \nget ai:2 \nby (E7) . \n\nS2: \n\nlhaMI-\n- LMI ] \n(if forcing \nPM1(y) = \nP02\u00b7(~I) \nmIn \nSI: \nFix M 1 , \nget ai:l \nby (E2). \nin Tab.l \n\nS2: \n\nupdate \nMI by \n\nupdate \nMI by \nmax LMI ,2 max LMI \n\nhM2 \n\n(max) \n\nGet M2 \n\nby \n\nmax \nh~2' \n\nH-m-W \n[h~2 + \n+ haMI -\n-qM2,1] \n\n(min) \n' SI: \nFix M 2 , \nget MI \n\nby \nmIn \n[haMI \n- q M2,1] \n\nS2: Fix M 1 , \n\nget M2 \nby min \n\nh~?-qM~ I \n\n.H.epeat \nSI, S2 \n\nno \n\nnew! \n\n(E6) \n\n(E7) \n\n(ES) \n(E9) \n(ElO) \n\nW-f-H \n\n-La MI \n\n(min) \n\nGet MI \n\nby \n\nmax \n\nL o MI \n\n1\\l0 \n\nRepeat \n\nno \n\nnew! \n\n1\\l0 \n\nRepeat \n\nno \n\nnew! \n\nttepeat \nSI, S2 \n\nno \n\nnew! \n\nttepeat \nSI,S2 \n\nno \n\nnew! \n\n1\\l0 \n\nRepeat \n\nno \n\nnew! \n\nImilar to those m Table 1. \n\nTable 4: BKC-YY Machine for Supervised Learning \n\nGiven Data {Xi,y.}F:I , Fix PMI(X) = Po(x) by eq.(l). \n\nh'kl = -kEiPM1(y;JXi)logPMI(Yil x i), h'k2 = -kEiPM2(x;JYi)logPM2(x;Jy.), \n(Ell) \nQ'kI ,2 = -k Ei PMI (y;JXi) log PM2 (xdYi), Q'k2 , 1 = -k Ei PM2(X;Jy.) log PMI (Y.lxi) , (El2) \nL'kl = -k Ei log PMI (y;JX i ), \n(El3) \n\nL'k2 = -k Ei log PM2(XiIYi) , \n\nMarnage \nStatus \nJ(b \n\nFeature \n\n~'xastmg \nmodels \n\nK(MI, M2) = Kb + D \n\nJ((M2 , MI) = J(b + D \n\nSingle-M \n\nhMl \n(max) \n\nmIrumum \n\nentropy (ME) \n\nF-net \n\nno \n\nnew! \n\nSingle-F \n-LM2 \n(min) \nML \n\nB-net \ntH' on \nB-net \n\nH-m-W \nhMI -QMI2 \n(min) , \nMIxed \nF-B \nnet \nno \n\nnew! \n\nSingle-M Si ngle-F \n\n-LMI \n(min) \nML \n\nF-net \ntiP on \nF-net \n\nhM2 \n(max) \n\nmlrumum \nentropy \nB-net \n\nno \n\nnew! \n\nH-m-W \nhM2-QM2 1 \n(min) , \nMixed \nB-F \nnet \nno \n\nnew! \n\n\f450 \n\nL. XU \n\ndenoted by H-m- W-Max. This model is a dual to the Helmholtz machine in order \nto focus on getting best discriminative or separable representations PMl (ylx) in Y \ninstead of best fitting of PM2(xly) on input data. \nThird, by replacing K(M1 , M 2) with K(M2, M 1 ), in Table 3 we can obtain new \nmodels that are the counterparts of those given in Table 2. For the case H-J- W, \nits maxMl,M2 gives minimum entropy estimate on PM2 (X) instead of maximum \nlikelihood estimate on PM2 (X) in Table 2. For the case Single-M, it will function \nsimilarly to the case Single-F in Table 2, but with minimum entropy on PMl (ylx) \nin Table 2 replaced by maximum likelihood on PM l (ylx) here. For the case H-m(cid:173)\nW, the focus shifts from on getting best fitting of PM2(xly) on input data to on \ngetting best discriminative representations PM 1 (ylx) in Y, which is similar to the \njust mentioned H-m- W-Max, but with minimum entropy on PMJylx) replaced by \nmaximum likelihood on PM 1 (ylx). The other two cases in Table 3 have been also \nchanged similarly from those in Table 2. \nFourth, several new model have also been proposed in Table 4 for supervised learn(cid:173)\ning . Instead of maximum likelihood, the new models suggest learning by minimum \nentropy or a mix of maximum likelihood and minimum entropy. \nFinally, further studies on the other status in Table 1 are needed. Heuristically, \nwe can also treat the case H-m- W by two separated steps. We first get Ml by \nmax[hMl - haMl], and then get M2 by maxqMl,2; or we first get M2 by min[h -\nha - qM2] and then get Ml by min[hMl - haMl - QMl,2]' The two algorithms \nattempt to get both a good discriminative representation by PMl (ylx) and a good \nfitting of PM2 (xly) on input data. However whether they work well needs to be \ntested experimentally. \n\nWe are currently conducting experiments on comparison several of the above new \nmodels against their existing counterparts. \n\nAcknowledgements The work was Supported by the HK RGG Earmarked Grant \nGUHK250/94E. \n\nReferences \n\nAmari , S(1995a) [Amari95] \" Information geometry of the EM and em algorithms for neural networks\", \nNeural Networks 8, to appear. \nAmari, S(1995b), Neural Computation 7 ppI3-18. \nAtick, J .J. & Redlich, A .N . (1990) [Ati901. Neural Computation Vo1.2, No.3 , pp308-320 . \nBell A . J . & Sejnowski , T . J.(1995) [Be195], Neural Computation Vo1.7, No.6 , 1129-1159 . \nByrne, W . (1992), IEEE Trans. Neural Networks 3 , pp612-620 . \nCsiszar, I. , 11975), Annals of Probabil.ty 3, ppI46-158. \nDayan, P. , Hinton , G. E ., & Neal, R. N. (1995) [Day95], Neural Computat.on Vo1.7, No.5 , 889-904. \nDempster, A.P., Laird , N .M ., & Rubin, D .B. (1977) [Dem77] , 1. Royal Statist. Soc.ety , 839, 1-38. \nHathaway, R.J.(1986), Statistics & Probability Letters 4, pp53-56 . \nHinton, G . E ., et ai, (1995) [Hin95], Sc.ence 268, pp1158-1160 . \nHinton , G. E . & Zemel, R.S. (19M) [Hin94], Advances in NIPS 6, pp3-10. \nLinsker, R. (1989) [Lin89], Advances in NIPS 1, ppI86-194. \nNeal , R. N .& Hinton, G. E(1993), A new view of the EM algorithm that Jushfies Incremental and other \nvanants , pr~rint. \nXu, L . (1996 , \"How Many Clusters? : A YING- YANG Machine Based Theory For A Classical Open \nProblem In attern Recognition\" , to appear on Proc. IEEE ICNN96. \nXu , L. (1995a), \"YING-YANG Machine: a Bayesian-Kullback scheme for unified learnings and new \nresults on vector quantization\" , Keynote talk, Proc. \nInti Conf. on Neural Information Processing \n(ICONIP95), Oct 30 - Nov . 3, 1995 , pp977-988 . \nXu , L.(1995b), \"YING-YANG Machine for Temporal Signals\", Keynote talk, Proc IEEE inti Conf. \nNeural Networks & Signal Processing, Vol.I, pp644-651, Nanjing, 10-13 , 1995. \nXu , L . (1995c) , \"New Advances on The YING- YANG Machine\", Invited paper, Proc. of 1995 IntI. \nSymposium on Artificial Neural Networks, ppIS07-12 , Dec. 18-20 , Taiwan . \nXu , L . (1995d), \"Cluster Number Selection, Adaptive EM Algorithms and Competitive Learnings\", \nInvited paper, Proc . Inti Conf. on Neural Information Processing (ICONIP95), Oct 30 - Nov. 3, 1995 , \nVol. II , ppI499-1502. \nXu , L . (1995e), Invited paper, Proc. WCNN95, Vol.I, pp35-42. Also, Invited paper, Proc . IEEE ICNN \n1994, ppI315-320 . \nXu , L. , & Jordan, M.I . (1993). Proc . of WCNN '93, Portland, OR, Vol. II, 431-434 . \n\n\f", "award": [], "sourceid": 1065, "authors": [{"given_name": "Lei", "family_name": "Xu", "institution": null}]}