{"title": "Bayesian Exponential Family PCA", "book": "Advances in Neural Information Processing Systems", "page_first": 1089, "page_last": 1096, "abstract": "Principal Components Analysis (PCA) has become established as one of the key tools for dimensionality reduction when dealing with real valued data. Approaches such as exponential family PCA and non-negative matrix factorisation have successfully extended PCA to non-Gaussian data types, but these techniques fail to take advantage of Bayesian inference and can suffer from problems of overfitting and poor generalisation. This paper presents a fully probabilistic approach to PCA, which is generalised to the exponential family, based on Hybrid Monte Carlo sampling. We describe the model which is based on a factorisation of the observed data matrix, and show performance of the model on both synthetic and real data.", "full_text": "Bayesian Exponential Family PCA\n\nShakir Mohamed\n\nKatherine Heller\n\nDepartment of Engineering, University of Cambridge\n{sm694,kah60,zoubin}@eng.cam.ac.uk\n\nCambridge, CB2 1PZ, UK\n\nZoubin Ghahramani\n\nAbstract\n\nPrincipal Components Analysis (PCA) has become established as one of the\nkey tools for dimensionality reduction when dealing with real valued data. Ap-\nproaches such as exponential family PCA and non-negative matrix factorisation\nhave successfully extended PCA to non-Gaussian data types, but these techniques\nfail to take advantage of Bayesian inference and can suffer from problems of over-\n\ufb01tting and poor generalisation. This paper presents a fully probabilistic approach\nto PCA, which is generalised to the exponential family, based on Hybrid Monte\nCarlo sampling. We describe the model which is based on a factorisation of the\nobserved data matrix, and show performance of the model on both synthetic and\nreal data.\n\n1 Introduction\n\nIn Principal Components Analysis (PCA) we seek to reduce the dimensionality of a D-dimensional\ndata vector to a smaller K-dimensional vector, which represents an embedding of the data in a lower\ndimensional space. The traditional PCA algorithm is non-probabilistic and de\ufb01nes the eigenvectors\ncorresponding to the K-largest eigenvalues as this low dimensional embedding. In probabilistic\napproaches to PCA, such as probabilistic PCA (PPCA) and Bayesian PCA [1], the data is modelled\nby unobserved latent variables, and these latent variables de\ufb01ne the low dimensional embedding. In\nthese models both the data and the latent variables are assumed to be Gaussian distributed.\n\nThis Gaussian assumption may not be suitable for all data types, especially in the case where data\nis binary or integer valued. Models such as Non-negative Matrix Factorisation (NMF) [2], Discrete\nComponents Analysis (DCA) [3], Exponential Family PCA (EPCA) [4] and Semi-parametric PCA\n(SP-PCA) [5], have been developed that endow PCA the ability to handle data for which Bernoulli\nor Poisson distributions may be more appropriate. These general approaches to PCA involve the\nrepresentation of the data matrix X as a product of smaller matrices: the factor score matrix V,\nrepresenting the reduced vectors; and a data independent part \u0398, known as the factor loading\nmatrix. In the original data matrix, there are N \u00d7 D entries, and in the matrix factorisation there are\n(N + D) \u00d7 K entries, which is a reduction in the number of parameters if K (cid:28) N, D [3].\n\nModels such as PCA, NMF and EPCA are from the class of deterministic latent variable\nmodels [6], since their latent variables are set to their maximum a posteriori (MAP) values.\nWelling et al. [6] argue that the resulting model essentially assigns zero probability to all input\ncon\ufb01gurations that are not in the training set. This problem stems from the use of an inappropriate\nobjective function, and can be remedied by using an alternate approximate inference scheme. In\nthis paper, we propose a fully Bayesian approach to PCA generalised to the exponential family.\nOur approach follows the method of factorising the data matrix into two lower rank matrices using\nan exponential family distribution for the data with conjugate priors. The exponential family of dis-\ntributions is reviewed in section 2, and the complete speci\ufb01cation for the model is given in section 3.\nLearning and inference in the model is performed using the Hybrid Monte Carlo approach, which is\n\n\fappropriate due to the continuous nature of variables in the model. The connections to existing gen-\neralised PCA methods, such as NMF and EPCA are discussed in section 4. We present results on the\nperformance of our Bayesian exponential family PCA model in section 5. We report performance\nusing both a synthetic data set to highlight particular model properties and also on two real datasets:\nthe Cedar Buffalo digits dataset and data on cardiac SPECT images. The Bayesian approach gives us\nmany samples of the \ufb01nal low dimensional embedding of the data, and techniques for determining a\nsingle low dimensional embedding are discussed in section 6. In section 7 we conclude, and present\na survey of possible future work.\n\n2 Exponential Family Models\n\nIn the exponential family of distributions, the conditional probability of a value xn given parameter\nvalue \u03b8, takes the following form:\n\np(xn|\u03b8) = exp{s(xn)(cid:62)\u03b8 + h(xn) + g(\u03b8)}\n\n(1)\nwhere s(xn) are the suf\ufb01cient statistics, \u03b8 is a vector of natural parameters, h(xn) is a function of\nthe data and g(\u03b8) is a function of the parameters. In this paper, the natural representation of the\nexponential family likelihood is used, such that s(xn) = xn.\n\nIt is convenient to represent a variable xn that is drawn from an exponential family distribu-\ntion using the notation: xn \u223c Expon(\u03b8) with natural parameters \u03b8. Probability distributions that\nbelong to the exponential family also have natural conjugate prior distributions p(\u03b8). The conjugate\nprior distribution for the exponential family distribution of equation (1) is:\n\nwhere \u03bb and \u03bd are hyperparameters of the prior distribution.\n\u03b8 \u223c Conj(\u03bb, \u03bd) as shorthand for the conjugate distribution.\n\np(\u03b8) \u221d exp{\u03bb\n(cid:62)\n\n\u03b8 + \u03bdg(\u03b8) + f(\u03bb)}\n\n(2)\nIn this case we use the notation:\n\nAs an example, for binary data an appropriate data distribution is the Bernoulli distribution.\nThe distribution is usually written as p(x|\u00b5) = \u00b5x(1 \u2212 \u00b5)1\u2212x, with \u00b5 in [0,1]. The exponential\nfamily form of this distribution, using the terms in equation (1) are: h(x) = 0, \u03b8 = ln( \u00b5\n1\u2212\u00b5)\nand g(\u03b8) = \u2212 ln(1 + e\u03b8). The natural parameters can be mapped to the parameter values of\nthe distribution using the link function, which is the logistic sigmoid in the case of the Bernoulli\ndistribution. The terms of the conjugate distribution can also be derived easily.\n\n3 Bayesian Exponential Family PCA\n\nWe can consider Bayesian Exponential Family PCA (BXPCA) as a method of searching for two\nmatrices V and \u0398, and we de\ufb01ne the product matrix P = V\u0398. In traditional PCA, the elements of\nthe matrix P which are the means of Gaussians, lie in the same space as that of the data X. In the\ncase of BXPCA and other methods for non-Gaussian PCA such as EPCA [4], this matrix represents\nthe natural parameters of the exponential family distribution of the data.\nWe represent the observed data as an N \u00d7 D matrix X = {x1, . . . , xN}, with an individual\ndata point xn = [xn1, . . . , xnD]. N is the number of data points and D is the number of input\nfeatures. \u0398 is a K \u00d7 D matrix with rows \u03b8k. V is a N \u00d7 K matrix V = {v1, . . . , vn}, and rows\nvn = [vn1, . . . , vnK], are K-dimensional vectors of continuous values in R. K is the number of\nlatent factors representing the dimensionality of the reduced space.\n\n3.1 Model Speci\ufb01cation\n\nThe generative process for the BXPCA model is described in \ufb01gure 1. Let m and S be hyperparam-\neters representing a K-dimensional vector of initial mean values and an initial covariance matrix\nrespectively. Let \u03b1 and \u03b2 be the hyperparameters corresponding to the shape and scale parameters\nof an inverse Gamma distribution. We start by drawing \u00b5 from a Gaussian distribution and the\nelements \u03c32\n\nk of the diagonal matrix \u03a3 from an inverse gamma distribution:\n\n\u00b5 \u223c N (\u00b5|m, S)\n\nk \u223c iG(\u03b1, \u03b2)\n\u03c32\n\n(3)\n\n\fFigure 1: Graphical Model for Bayesian Exponential Family PCA.\n\nFor each data point n, we draw the K-dimensional entry vn of the factor score matrix:\n\n(4)\nThe data is described by an exponential family distribution with natural parameters \u03b8k. The expo-\nnential family distribution modelling the data, and the corresponding prior over the model parame-\nters, is:\n\nvn \u223c N (vn|\u00b5, \u03a3)\n\nxn|vn, \u0398 \u223c Expon\n\nvnk\u03b8k\n\n\u03b8k \u223c Conj (\u03bb, \u03bd)\n\n(5)\n\n(cid:33)\n\n(cid:32)(cid:88)\n\nk\n\nWe denote \u2126 = {V, \u0398, \u00b5, \u03a3} as the set of unknown parameters with hyperparameters \u03a8 =\n{m, S, \u03b1, \u03b2, \u03bb, \u03bd}. Given the graphical model, the joint probability of all parameters and variables\nis:\n\np(X, \u2126|\u03a8) = p(X|V, \u0398)p(\u0398|\u03bb, \u03bd)p(V|\u00b5, \u03a3)p(\u00b5|m, S)p(\u03a3|\u03b1, \u03b2)\n\n(6)\nUsing the model speci\ufb01cation given by equations (3) - (5) and assuming that the parameter \u03bd = 1,\nthe log-joint probability distribution is:\n\nxn + h(xn) + g\n\nvnk\u03b8k\n\n(7)\n\n(cid:32)(cid:88)\n\n(cid:33)\uf8f9\uf8fb\n\nk\n\n(cid:21)\n\n(vn \u2212 \u00b5)T \u03a3\u22121(vn \u2212 \u00b5)\n\nln p(X, \u2126|\u03a8) =\n\n+\n\n+\n\nn=1\n\nk\n\nvnk\u03b8k\n\n(cid:33)(cid:62)\n\n\uf8ee\uf8f0(cid:32)(cid:88)\n(cid:2)\u03bb\n(cid:62)\u03b8k + g(\u03b8k) + f(\u03bb)(cid:3)\n(cid:20)\n\nN(cid:88)\nK(cid:88)\nN(cid:88)\nln(2\u03c0) \u2212 1\nK(cid:88)\n2\n\nln(2\u03c0) \u2212 1\n2\n\n\u2212 K\n2\n\nk=1\n\nn=1\n\n\u2212 K\n2\n\n+\n\nln|\u03a3| \u2212 1\n2\n\nln|S| \u2212 1\n2\n\n(cid:2)\u03b1 ln \u03b2 \u2212 ln \u0393(\u03b1) + (\u03b1 \u2212 1) ln \u03c32\n\n(\u00b5 \u2212 m)T S\u22121(\u00b5 \u2212 m)\n\ni \u2212 \u03b2\u03c32\n\ni\n\n(cid:3)\n\nwhere the functions h(\u00b7), g(\u00b7) and f(\u00b7) correspond to the functions of the chosen conjugate distribu-\ntion for the data.\n\ni=1\n\n3.2 Learning\nThe model parameters \u2126 = {V, \u0398, \u00b5, \u03a3} are learned from the data using Hybrid Monte Carlo\n(HMC) sampling [7]. While the parameters \u03a8 = {m, S, \u03b1, \u03b2, \u03bb, \u03bd} are treated as \ufb01xed hyperpa-\nrameters, these can also be learned from the data. Hybrid Monte Carlo is a suitable sampler for use\n\n\fwith this model since all the variables are continuous and it is possible to compute the derivative of\nthe log-joint probability. HMC is also an attractive scheme for sampling since it avoids the random\nwalk behaviour of the Metropolis or the Gibbs sampling algorithms [7].\n\nHybrid Monte Carlo (HMC) is an auxiliary variable sampler where we sample from an aug-\nmented distribution p(x, u), rather than the target distribution p(x), since it is easier to sample from\nthis augmented distribution [8]. HMC utilises the gradient of the target distribution to improve\nmixing in high dimensions. In BXPCA, the target distribution is: E(\u2126|\u03a8) = \u2212 ln p(X, \u2126|\u03a8) and\nrepresents the potential energy function. The auxiliary variable u, is Gaussian and is used to de\ufb01ne\n2 uT u. Furthermore, we de\ufb01ne the gradient vector \u2206(X, \u2126) (cid:44) \u2202E(\u2126)\nthe kinetic energy K = 1\n\u2202\u2126 ,\nwhich can be computed using equation (7). The sum of the kinetic and the potential energy\nde\ufb01nes the Hamiltonian. Samples of \u2126 and u are obtained by combining the Hamiltonian with the\ngradient information in the simulation of so-called \u201cleapfrog\u201d steps. These details and the general\npseudocode for HMC can be found in MacKay [9].\n\nOne key feature of HMC is that the dynamics is simulated in an unconstrained space. Therefore to\ncorrectly apply HMC to this model, we must ensure that all constrained variables are transformed\nto an unconstrained space, perform dynamics in this unconstrained space, and then transform the\nvariables back to the original constrained space. The only variable that is constrained in BXPCA\nis \u03a3 where each diagonal element \u03c32\nk can be transformed to a corresponding\nk > 0. Each \u03c32\nunconstrained variable \u03bek using the transformation: \u03c32\nk = e\u03bek. This transformation requires that\nwe then apply the chain rule for differentiation and that we must include the determinant of the\nJacobian of the transformed variables, which is: |J| =\n\n(cid:12)(cid:12)(cid:12) = |exp(\u03bek)| = \u03c32\n\nexp(\u03c32\nk)\n\n(cid:12)(cid:12)(cid:12) \u2202\n\n\u2202\u03bek\n\nk.\n\nWe also extended the HMC procedure to handle missing inputs in a principled manner, by\nanalytically integrating them out.In practice, this implies working with missing data under the\nMissing at Random (MAR) assumption. Here, we divide the data into the set of observed and\nmissing data, X = {Xobs, Xmissing}, and use the set Xobs in the inference.\n\n4 Related Work\n\nExponential Family PCA: Exponential family PCA (EPCA) [4] is a general class of PCA\nalgorithms that allows the ideas of PCA to be applied to any data that can be modelled from a\ndistribution in the exponential family. Like BXPCA, it is based on a factorisation of the data into a\nfactor score matrix V and a factor loading matrix \u0398. The algorithm is based on the optimisation\nof a loss function which is based on the Bregman divergence between the data and the learned\nreconstruction of the data. The learning is based on an alternating minimisation procedure where the\ntwo matrices V and \u0398 are optimised in turn, and each optimisation is a convex function. The EPCA\nobjective function can be seen as the likelihood function of a probabilistic model, and hence this op-\ntimisation corresponds to maximum a posteriori (MAP) learning. The use of MAP learning makes\nEPCA a deterministic latent variable model [6], since the latent variables are set to their MAP values.\n\nIn both our model and EPCA, the product P = V\u0398 represents the natural parameters of the\ndistribution over the data, and must be transformed using the link function to get to the parameter\nspace of the associated data distribution. Our model is different from EPCA in that it is a fully\nprobabilistic model in which all parameters can be integrated out by MCMC. Furthermore, EPCA\ndoes not include any form of regularisation and is prone to over\ufb01tting the data, which is avoided in\nthe Bayesian framework. We will compare BXPCA to EPCA throughout this paper.\n\nNon-negative Matrix Factorisation: Non-negative Matrix Factorisation (NMF) [2] is a technique\nof factorising a matrix into the product of two positive lower rank matrices. In NMF, the matrix\nproduct P approximates the mean parameters of the data distribution, and is thus in the same space\nas the data. A mean parameter for example, is the rate \u03bb if the data is modelled as a Poisson\ndistribution, or is the probability of data being a 1 if the data is modelled as a Bernoulli. In NMF,\nV and \u0398 are restricted to be positive matrices, and inference corresponds to maximum likelihood\nlearning with a Poisson likelihood. Similarly to EPCA, this learning method places NMF in the\nclass of deterministic latent variable methods.\n\n\fDiscrete Components Analysis: The Discrete Components Analysis (DCA) [3] is a family\nof probabilistic algorithms that deals with the application of PCA to discrete data and is a uni\ufb01ca-\ntion of the existing theory relating to dimensionality reduction with discrete distributions. In DCA\nthe product P = V\u0398 is the mean parameter of the appropriate distribution over that data, as with\nNMF, and also constrains V and \u0398 to be non-negative. The various algorithms of the DCA family\nare simulated using either Gibbs sampling or variational approximations.\n\nBayesian Partial Membership: The Bayesian Partial Membership (BPM) model is a clus-\ntering technique that allows data points to have fractional membership in multiple clusters. The\nmodel is derived from a \ufb01nite mixture model which allows the usual indicator variables to take on\nany value in the range [0,1]. The resulting model has the same form as the model shown in \ufb01gure\n1, but instead of the model variable V being modelled as a Gaussian with unknown mean and\ncovariance, it is instead modelled as a Dirichlet distribution. This difference is important, since\nIn the BXPCA, we interpret the matrix V as a lower\nit affects the interpretation of the results.\ndimensional embedding of the data which can be used for dimensionality reduction. In contrast,\nthe corresponding matrix for the BPM model, whose values are restricted to [0,1], is the partial\nmembership of each data point and represents the extent to which each data point belongs to each\nof the K clusters.\n\n5 Results and Discussion\n\nSynthetic Data: Synthetic data was generated by creating three 16-bit prototype vectors with each\nbit being generated with a probability of 0.5. Each of the three prototypes is replicated 200 times,\nresulting in a 600-point data set. We then \ufb02ip bits in the replicates with a probability of 0.1, as in\nTipping [10], thus adding noise about each of the prototypes. BXPCA inference was run using this\ndata for 4000 iterations, using the \ufb01rst half as burn-in. Figure 2 demonstrates the learning process of\nBXPCA. In the initial phase of the sampling, the energy decreases slowly and the model is unable\nto learn any useful structure from the data. Around sample 750, the energy function decreases and\nsome useful structure has been learnt. By sample 4000 the model has learnt the original data well,\nas can be seen by comparing sample 4000 and the original data.\n\nTo evaluate the performance of BXPCA, we de\ufb01ne training and test data from the available\n\nFigure 2: Reconstruction of data from samples at various stages of the sampling. The top plot shows\nthe change in the energy function. The lower plots show the reconstructions and the original data.\n\n5505005000200040006000800010000EnergyE(\u2126)Sample 551015100200300400500600Sample 20051015100200300400500600Sample 30051015100200300400500600Sample 50051015100200300400500600Sample 100051015100200300400500600Sample 125051015100200300400500600Sample 200051015100200300400500600Sample 325051015100200300400500600Sample 400051015100200300400500600Original Data51015100200300400500600\fFigure 3: Boxplots comparing the NLP and RMSE of BXPCA and EPCA for various latent factors.\n\ndata. The test data was created by randomly selecting 10% of the data points. These test data\npoints were set as missing values in the training data.\nInference is then run using BXPCA,\nwhich has been extended to consider missing data. This method of using missing data is a\nnatural way of testing these algorithms, since both are generative models. We calculate the\nnegative log probability (NLP) and the root mean squared error (RMSE) using the testing data.\nWe evaluate the same metrics for EPCA, which is also trained considering missing data. This\nmissing data testing methodology is also used in the experiments on real data that are described later.\n\nIn \ufb01gure 3a and 3b,\nthe RMSE and NLP of the two algorithms are compared respectively,\nfor various choices of the latent factor K. EPCA shows characteristic under\ufb01tting for K = 1\nand demonstrates severe over\ufb01tting for large K. This over\ufb01tting is seen by the very large values\nof NLP for EPCA. If we examine the RMSE on the training data shown in \ufb01gure 3c, we see the\nover\ufb01tting problem highlighted further, where the error on the training set is almost zero for EPCA,\nwhereas BXPCA manages to avoid this problem. We expect that a random model would have a\nN LP = 10% \u00d7 600 \u00d7 16 = 960 bits, but the NLP values for EPCA are signi\ufb01cantly larger than\nthis. This is because as EPCA begins to over\ufb01t, it becomes highly con\ufb01dent in its predictions and\nthe proportion of bits which it believes are 1, for example, but which are actually 0, increases.\nThis is shown in \ufb01gure 3d, where we show the frequency of incorrect predictions, where the error\nbetween the predicted and actual bits is greater than 0.95. BXPCA, based on a Bayesian approach\nthus avoids over\ufb01tting and gives improved predictions.\n\nDigits Data: BXPCA was applied to the CEDAR Buffalo digits dataset. The digit 2 was\nused, and consists of 700 greyscale images with 64 attributes. The digits were binarised by\nthresholding at a greyscale value of 128 from the 0 to 255 greyscale range. Table 1 compares the\nperformance of BXPCA and EPCA, using the same method of creating training and testing data\nsets as for the synthetic data. BXPCA has lower RMSE and NLP than EPCA and also does not\nexhibit over\ufb01tting at large K, which can be seen in EPCA by the large value of NLP at K = 5.\nSPECT Data: The data set describes the diagnosis of cardiac Single Proton Emission Computed\nTomography (SPECT) images [11]. The data consists of 267 SPECT image sets, and has been\nprocessed resulting in 22 binary attributes. Table 2 compares the performance of BXPCA and EPCA.\nThis dataset demonstrates that EPCA quickly over\ufb01ts the data, as shown by the rapidly increasing\nvalues of NLP, and that the two algorithms perform equally well for low values of K.\n\n12345810152025300.20.30.40.50.60.7RMSE on Test DataLatent Factors (K)123458101520253001000200030004000500060007000Neg. Log Prob. (Bits)Latent Factors (K)123458101520253000.10.20.30.40.50.60.70.8RMSE on Training DataLatent Factors (K)05101520253010\u2212310\u2212210\u22121100Latent Factors (K)|\u03b5| > 0.95EPCABXPCA\u2212Box\u2212 BXPCA\u2212Notch\u2212 EPCA\u2212Box\u2212 BXPCA\u2212Notch\u2212 EPCA\u2212Box\u2212 BXPCA\u2212Notch\u2212 EPCA(a)(b)(c)(d)\fTable 1: Table comparing BXPCA and EPCA on the digit 2 dataset.\n\n2\n\n3\n\n4\n\n5\n\nBXPCA\n\nEPCA\n\nK\nNLP\nRMSE\nNLP\nRMSE\n\n2032.3\n0.389\n2125.5\n0.392\n\n2022.9\n0.385\n2482.1\n0.393\n\n2002.4\n0.380\n2990.2\n0.399\n\n2032.0\n0.383\n4708.8\n0.402\n\n1\n\nTable 2: Table Comparing BXPCA and EPCA on the SPECT dataset.\nK\nNLP\nRMSE\nNLP\nRMSE\n\n325.94\n0.405\n507.79\n0.413\n\n305.22\n0.393\n4030.0\n0.517\n\n331.47\n0.419\n1096.6\n0.439\n\n291.75\n0.377\n1727.4\n0.487\n\n348.67\n0.441\n388.18\n0.439\n\n2\n\n343.40\n0.433\n516.78\n0.427\n\n7\n\n310.36\n0.383\n4209.0\n0.528\n\n3\n\n4\n\n5\n\n6\n\n8\n\n319.06\n0.396\n4330.0\n0.560\n\nBXPCA\n\nEPCA\n\n6 Choice of Final Embedding\n\nFor the purposes of dimensionality reduction, PCA is used to search for a low dimensional\nembedding V of the data points. In EPCA, the alternating minimisation returns a single V that is\nthe low dimensional representation. In BXPCA though, we do not get a single V, but rather many\nsamples which represent the variation in the embedding. Furthermore, we cannot simply take the\naverage of each of these samples to obtain a single V, since we have not included any identi\ufb01ability\nconstraints in the model. This lack of identi\ufb01ability subjects V to permutations of the columns, and\nto rotations of the matrix, making an average of the samples meaningless.\n\nThere are several approaches to obtaining a single low dimensional representation from the\nset of samples. The simplest approach is to choose from the set of available samples, the best global\ncon\ufb01guration, {V\u2217, \u0398\u2217} = arg max\u2126(s) p(X, \u2126(s)|\u03a8), and use this V\u2217. A second approach aims\nto give further information about the variability of the embedding. We begin by \ufb01xing the model\nparameters to {\u0398\u2217\n, \u00b5\u2217, \u03a3\u2217}. These can be set using the sample chosen in the \ufb01rst approach. We\nthen sample V from the conditional distribution:\n\n, \u00b5\u2217\n\nV \u223c p(V|X, \u0398\u2217\n\n, \u03a3\u2217) \u221d p(X|V, \u0398\u2217)p(V|\u00b5\u2217\n\n, \u03a3\u2217)\n\n(8)\nwhere equation (8) is obtained using Bayes theorem and the joint probability distribution given in\nequation (6). We can now average these samples to obtain a single embedding since the problems\nof rotation and permutation have been removed by constraining the variables {\u0398\u2217\n, \u00b5\u2217, \u03a3\u2217}. We\ndemonstrate this procedure using the synthetic data described in the previous section for K = 2.\nFigure 4 shows the embedding in the 2D space for 10 data points and 20 independent samples\ndrawn according to equation (8). The graph shows that there is some mean value and also gives\nus an understanding of the variation that is possible, in this 2D embedding. The drawback of this\nlast approach is that it does not give any indication of the effect of variation in \u0398. To gain some\nunderstanding of this effect, we can further extend this approach by choosing Q random samples,\n\u0398\u2217 = {\u0398\u2217(1), \u0398\u2217(2), . . . , \u0398\u2217(Q)}, at convergence of the HMC sampler. We then repeat the afore-\nmentioned procedure for these various \u0398\u2217(q). This then gives an understanding of the variability of\nthe \ufb01nal embedding, in terms of both \u0398 and V.\n\n7 Conclusions and Future Work\n\nWe have described a Bayesian approach to PCA which is generalised to the exponential family.\nWe have employed a Hybrid Monte Carlo sampling scheme with an energy based on the log-joint\nprobability of the model. In particular, we have demonstrated the ability of BXPCA to learn the\nstructure of the data while avoiding over\ufb01tting problems, which are experienced by other maximum\nlikelihood approaches to exponential family PCA. We have demonstrated this using both synthetic\nand real data.\n\n\fFigure 4: Variation in \ufb01nal embedding for 10 data points and various samples of V\n\nIn future the model can be extended by considering an alternate distribution for the factor\nscore matrix V. Instead of considering a Gaussian distribution, a Laplacian or other heavy tailed\ndistribution could be used, which would allow us to determine the lower dimensional embedding\nof the data, and also give the model a sparseness property. We could also speci\ufb01cally include\nrestrictions on the form of the score and the loading matrices, V and \u0398 respectively, to ensure\nidenti\ufb01ability. This makes learning in the model more complex since we must ensure that the\nrestrictions are maintained. Also, it will prove interesting to consider alternate forms of inference,\nspeci\ufb01cally the techniques of sequential Monte Carlo to allow for online inference.\nAcknowlegdements: We thank Peter Gehler for the EPCA implementation. SM thanks the NRF\nSA and the Commonwealth Commission for support. KH was supported by an EPSRC Postdoctoral\nFellowship (grant no. EP/E042694/1).\n\nReferences\n[1] C. M. Bishop, Pattern Recognition and Machine Learning. Information Science and Statistics,\n\nSpringer, August 2006.\n\n[2] D. D. Lee and H. S. Seung, \u201cAlgorithms for non-negative matrix factorization,\u201d in Advances in\nNeural Information Processing Systems, vol. 13, pp. 556 \u2013 562, MIT Press, Cambridge, MA,\n2001.\n\n[3] W. Buntine and A. Jakulin, \u201cDiscrete components analysis,\u201d in Subspace, Latent Structure and\n\nFeature Selection, vol. 3940/2006, pp. 1\u201333, Springer (LNCS), 2006.\n\n[4] M. Collins, S. Dasgupta, and R. Schapire, \u201cA generalization of principal components to the\nexponential family,\u201d in Advances in Neural Information Processing Systems, vol. 14, pp. 617\n\u2013 624, MIT Press, Cambridge, MA, 2002.\n\n[5] Sajama and A. Orlitsky, \u201cSemi-parametric exponential family PCA,\u201d in Advances in Neural\nInformation Processing Systems, vol. 17, pp. 1177 \u2013 1184, MIT Press, Cambridge, MA, 2004.\n[6] M. Welling, C. Chemudugunta, and N. Sutter, \u201cDeterministic latent variable models and their\n\npitfalls,\u201d in SIAM Conference on Data Mining (SDM), pp. 196 \u2013 207, 2008.\n\n[7] R. M. Neal, \u201cProbabilistic inference using Markov Chain Monte Carlo methods,\u201d Tech. Rep.\n\nCRG-TR-93-1, University of Toronto, Department of Computer Science, 1993.\n\n[8] C. Andrieu, N. De Freitas, A. Doucet, and M. I. Jordan, \u201cAn introduction to MCMC for ma-\n\nchine learning,\u201d Machine Learning, vol. 50, pp. 5\u201343, 2003.\n\n[9] D. J. C. MacKay, Information Theory, Inference & Learning Algorithms. Cambridge Univer-\n\nsity Press, June 2002.\n\n[10] M. E. Tipping, \u201cProbabilistic visualisation of high dimensional binary data,\u201d in Advances in\nNeural Information Processing Systems, vol. 11, pp. 592 \u2013 598, MIT Press, Cambridge, MA,\n1999.\n\n[11] \u201cUCI machine learning repository.\u201d http://archive.ics.uci.edu/ml/datasets/.\n\n\u221240\u221220020406080\u221240\u221230\u221220\u221210010203040Variation in Final EmbeddingDimension 2Dimension 1\f", "award": [], "sourceid": 745, "authors": [{"given_name": "Shakir", "family_name": "Mohamed", "institution": null}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "Katherine", "family_name": "Heller", "institution": null}]}