{"title": "Products of Gaussians", "book": "Advances in Neural Information Processing Systems", "page_first": 1017, "page_last": 1024, "abstract": null, "full_text": "Products of Gaussians \n\nChristopher K. I. Williams \n\nFelix V. Agakov \n\nDivision of Informatics \nUniversity of Edinburgh \nEdinburgh EH1 2QL, UK \n\nc. k. i. williams@ed.ac.uk \n\nhttp://anc.ed.ac.uk \n\nSystem Engineering Research Group \nChair of Manufacturing Technology \n\nUniversitiit Erlangen-Niirnberg \n\n91058 Erlangen, Germany \n\nF.Agakov@lft\u00b7uni-erlangen.de \n\nStephen N. Felderhof \nDivision of Informatics \nUniversity of Edinburgh \nEdinburgh EH1 2QL, UK \n\nstephenf@dai.ed.ac.uk \n\nAbstract \n\nRecently Hinton (1999) has introduced the Products of Experts \n(PoE) model in which several individual probabilistic models for \ndata are combined to provide an overall model of the data. Be(cid:173)\nlow we consider PoE models in which each expert is a Gaussian. \nAlthough the product of Gaussians is also a Gaussian, if each Gaus(cid:173)\nsian has a simple structure the product can have a richer structure. \nWe examine (1) Products of Gaussian pancakes which give rise to \nprobabilistic Minor Components Analysis, (2) products of I-factor \nPPCA models and (3) a products of experts construction for an \nAR(l) process. \n\nRecently Hinton (1999) has introduced the Products of Experts (PoE) model in \nwhich several individual probabilistic models for data are combined to provide an \noverall model of the data. In this paper we consider PoE models in which each \nexpert is a Gaussian. It is easy to see that in this case the product model will \nalso be Gaussian. However, if each Gaussian has a simple structure, the product \ncan have a richer structure. Using Gaussian experts is attractive as it permits a \nthorough analysis of the product architecture, which can be difficult with other \nmodels, e.g. models defined over discrete random variables. \n\nBelow we examine three cases of the products of Gaussians construction: (1) Prod(cid:173)\nucts of Gaussian pancakes (PoGP) which give rise to probabilistic Minor Compo(cid:173)\nnents Analysis (MCA), providing a complementary result to probabilistic Principal \nComponents Analysis (PPCA) obtained by Tipping and Bishop (1999); (2) Prod(cid:173)\nucts of I-factor PPCA models; (3) A products of experts construction for an AR(l) \nprocess. \n\n\fProducts of Gaussians \n\nIf each expert is a Gaussian pi(xI8i ) '\" N(J1i' ( i), the resulting distribution of the \nproduct of m Gaussians may be expressed as \n\nBy completing the square in the exponent it may be easily shown that p(xI8) \nN(/1;E, (2:), where (E l = 2::1 (i l . To simplify the following derivations we will \nassume that pi(xI8i ) '\" N(O, (i) and thus that p(xI8) '\" N(O, (2:). J12: i \u00b0 can be \n\nobtained by translation of the coordinate system. \n\n1 Products of Gaussian Pancakes \n\nA Gaussian \"pancake\" (GP) is a d-dimensional Gaussian, contracted in one dimen(cid:173)\nsion and elongated in the other d - 1 dimensions. In this section we show that the \nmaximum likelihood solution for a product of Gaussian pancakes (PoGP) yields a \nprobabilistic formulation of Minor Components Analysis (MCA). \n\n1.1 Covariance Structure of a GP Expert \n\nConsider a d-dimensional Gaussian whose probability contours are contracted \nin the direction w and equally elongated in mutually orthogonal directions \nVI , ... , vd-l.We call this a Gaussian pancake or GP. Its inverse covariance may be \nwritten as \n\nd - l \n\n( - 1 = L ViV; /30 + wwT /3,;;, \n\ni = l \n\n(1) \n\nwhere VI, ... ,V d - l, W form a d x d matrix of normalized eigenvectors of the covari(cid:173)\nance C. /30 = 0\"0 2 , /3,;; = 0\";;2 define inverse variances in the directions of elongation \nand contraction respectively, so that 0\"5 2 0\"1. Expression (1) can be re-written in \na more compact form as \n\n(2) \nwhere w = wJ/3,;; - /30 and Id C jRdxd is the identity matrix. Notice that according \nto the constraint considerations /30 < /3,;;, and all elements of ware real-valued. \nNote the similarity of (2) with expression for the covariance of the data of a 1-\nfactor probabilistic principal component analysis model ( = 0\"21d + wwT (Tipping \nand Bishop, 1999) , where 0\"2 is the variance of the factor-independent spherical \nGaussian noise. The only difference is that it is the inverse covariance matrix for \nthe constrained Gaussian model rather than the covariance matrix which has the \nstructure of a rank-1 update to a multiple of Id . \n\n1.2 Covariance of the PoGP Model \n\nWe now consider a product of m GP experts, each of which is contracted in a single \ndimension. We will refer to the model as a (I,m) PoGP, where 1 represents the \nnumber of directions of contraction of each expert. We also assume that all experts \nhave identical means. \n\n\fFrom (1), the inverse covariance of the the resulting (I,m) PoGP model can be \nexpressed as \n\nm \n\nC;;l = L Cil \n\ni=l \n\n(3) \n\nwhere columns of We Rdxm correspond to weight vectors of the m PoGP experts, \nand (3E = 2::1 (3~i) > o. \n\n1.3 Maximum-Likelihood Solution for PoGP \n\nComparing (3) with m-factor PPCA we can make a conjecture that in contrast \nwith the PPCA model where ML weights correspond to principal components of \nthe data covariance (Tipping and Bishop, 1999), weights W of the PoGP model \ndefine projection onto m minor eigenvectors of the sample covariance in the visible \nd-dimensional space, while the distortion term (3E Id explains larger variationsl . This \nis indeed the case. \nIn Williams and Agakov (2001) it is shown that stationarity of the log-likelihood \nwith respect to the weight matrix Wand the noise parameter (3E results in three \nclasses of solutions for the experts' weight matrix, namely \n\nW \n5 \n5W \n\n0; \nCE ; \nCEW, W:j:. 0, 5:j:. CE, \n\n(4) \n\nwhere 5 is the covariance matrix of the data (with an assumed mean of zero). The \nfirst two conditions in (4) are the same as in Tipping and Bishop (1999), but for \nPPCA the third condition is replaced by C-l W = 5- l W (assuming that 5- 1 exists). \nIn Appendix A and Williams and Agakov (2001) it is shown that the maximum \nlikelihood solution for W ML is given by: \n\n(5) \n\nwhere R c Rmxm is an arbitrary rotation matrix, A is a m x m matrix containing \nthe m smallest eigenvalues of 5 and U = [Ul , ... ,u m ] c Rdxm is a matrix of the \ncorresponding eigenvectors of 5. Thus, the maximum likelihood solution for the \nweights of the (1, m) PoG P model corresponds to m scaled and rotated minor \neigenvectors of the sample covariance 5 and leads to a probabilistic model of minor \ncomponent analysis. As in the PPCA model, the number of experts m is assumed \nto be lower than the dimension of the data space d. \n\nThe correctness of this derivation has been confirmed experimentally by using a \nscaled conjugate gradient search to optimize the log likelihood as a function of W \nand (3E. \n\n1.4 Discussion of PoGP model \n\nAn intuitive interpretation of the PoGP model is as follows: Each Gaussian pancake \nimposes an approximate linear constraint in x space. Such a linear constraint is that \nx should lie close to a particular hyperplane. The conjunction of these constraints \nis given by the product of the Gaussian pancakes. If m \u00ab d it will make sense to \nlBecause equation 3 has the form of a factor analysis decomposition, but for the inverse \n\ncovariance matrix, we sometimes refer to PoGP as the rotcaf model. \n\n\fdefine the resulting Gaussian distribution in terms of the constraints. However, if \nthere are many constraints (m > d/2) then it can be more efficient to describe the \ndirections of large variability using a PPCA model, rather than the directions of \nsmall variability using a PoGP model. This issue is discussed by Xu et al. (1991) in \nwhat they call the \"Dual Subspace Pattern Recognition Method\" where both PCA \nand MCA models are used (although their work does not use explicit probabilistic \nmodels such as PPCA and PoGP). \n\nMCA can be used, for example, for signal extraction in digital signal processing \n(Oja, 1992), dimensionality reduction, and data visualization. Extraction of the \nminor component is also used in the Pisarenko Harmonic Decomposition method \nfor detecting sinusoids in white noise (see, e.g. Proakis and Manolakis (1992), p. \n911). Formulating minor component analysis as a probabilistic model simplifies \ncomparison of the technique with other dimensionality reduction procedures, per(cid:173)\nmits extending MCA to a mixture of MCA models (which will be modeled as a \nmixture of products of Gaussian pancakes) , permits using PoGP in classification \ntasks (if each PoGP model defines a class-conditional density) , and leads to a num(cid:173)\nber of other advantages over non-probabilistic MCA models (see the discussion of \nadvantages of PPCA over PCA in Tipping and Bishop (1999)). \n\n2 Products of PPCA \n\nIn this section we analyze a product of m I-factor PPCA models, and compare it \nto am-factor PPCA model. \n\n2.1 \n\nI-factor PPCA model \n\nConsider a I-factor PPCA model, having a latent variable Si and visible variables x. \nThe joint distribution is given by P(Si, x) = P(si) P(xlsi). We set P(Si) '\" N(O, 1) \nand P(XI Si) '\" N(WiSi' (]\"2) . Integrating out Si we find that Pi(x) '\" N(O, Ci ) where \nC = wiwT + (]\"21d and \n\nwhere (3 = (]\"-2 and \"(i = (3/(1 + (3 llwi W). (3 and \"(i are the inverse variances in the \ndirections of contraction and elongation respectively. \nThe joint distribution of Si and x is given by \n\n(6) \n\n- 2x WiSi + X X \n\nT \n\nT \n\n. \n\n(7) \n\n(8) \n\n] \n\n(3 [s; \n-\nexp - -\n2 \n\"(i \n\nTipping and Bishop (1999) showed that the general m-factor PPCA model (m(cid:173)\nPPCA) has covariance C = (]\"21d + WWT , where W is the d x m matrix of factor \nloadings. When fitting this model to data, the maximum likelihood solution is to \nchoose W proportional to the principal components of the data covariance matrix. \n\n\f2.2 Products of I-factor PPCA models \n\nWe now consider the product of m I-factor PPCA models, which we denote a \n(1, m)-PoPPCA model. The joint distribution over 5 = (Sl' ... ,Srn)T and x is \n\nP(x,s) ex: exp -\"2 L ---:- - 2x W iSi + X X \n\n13 m [s; \n\nT \n\nT \n\n] \n\ni=l \n\n,,(, \n\n\u2022 \n\n(9) \n\nLet zT d~f (xT , ST). Thus we see that the distribution of z is Gaussian with inverse \ncovariance matrix 13M, where \n\n-W) \nr - 1 \n\n, \n\n(10) \n\nand r = diag(\"(l , ... ,\"(m)' Using the inversion equations for partitioned matrices \n(Press et al., 1992, p. 77) we can show that \n\nwhere ~xx is the covariance of the x variables under this model. It is easy to confirm \nthat this is also the result obtained from summing (6) over i = 1, ... ,m. \n\n(11) \n\n2.3 Maximum Likelihood solution for PoPPCA \nAm-factor PPCA model has covariance a21d + WWT and thus, by the Woodbury \nj3W(a2 lm + WT W) - lWT . The maximum \nformula, it has inverse covariance j3 ld -\nlikelihood solution for a m-PPCA model is similar to (5), i.e. W = U(A _a2Im)1/2 RT, \nbut now A is a diagonal matrix of the m principal eigenvalues, and U is a matrix \nof the corresponding eigenvectors. If we choose RT = I then the columns of W are \northogonal and the inverse covariance of the maximum likelihood m-PPCA model \nhas the form j3 ld - j3WrwT. Comparing this to (11) (with W = W) we see that the \ndifference is that the first term of the RHS of (11) is j3m1d , while for m-PPCA it is \nj3 ld. \nIn section 3.4 and Appendix C.3 of Agakov (2000) it is shown that (for m :::=: 2) we \nobtain the m-factor PPCA solution when \n-\nA