{"title": "Semi-parametric Exponential Family PCA", "book": "Advances in Neural Information Processing Systems", "page_first": 1177, "page_last": 1184, "abstract": null, "full_text": " Semi-parametric exponential family PCA\n\n\n\n Sajama Alon Orlitsky\n Department of Electrical and Computer Engineering\n University of California at San Diego, La Jolla, CA 92093\n sajama@ucsd.edu and alon@ece.ucsd.edu\n\n\n\n\n Abstract\n\n We present a semi-parametric latent variable model based technique for\n density modelling, dimensionality reduction and visualization. Unlike\n previous methods, we estimate the latent distribution non-parametrically\n which enables us to model data generated by an underlying low dimen-\n sional, multimodal distribution. In addition, we allow the components\n of latent variable models to be drawn from the exponential family which\n makes the method suitable for special data types, for example binary or\n count data. Simulations on real valued, binary and count data show fa-\n vorable comparison to other related schemes both in terms of separating\n different populations and generalization to unseen samples.\n\n\n1 Introduction\n\nPrincipal component analysis (PCA) is widely used for dimensionality reduction with ap-\nplications ranging from pattern recognition and time series prediction to visualization. One\nimportant limitation of PCA is that it is not based on a probability model. A proba-\nbilistic formulation of PCA can offer several advantages like allowing statistical testing,\napplication of Bayesian inference methods and naturally accommodating missing values\n[1]. Latent variable models are commonly used in statistics to summarize observations\n[2]. A latent variable model assumes that the distribution of data is determined by a la-\ntent or mixing distribution P () and a conditional or component distribution P (x|), i.e.,\nP (x) = P ()P (x|)d.\n\nProbabilistic PCA (PPCA) [1] borrows from one such popular model called factor anal-\nysis to propose a probabilistic alternative PCA. A key feature of this probabilistic model\nis that the latent distribution P () is also assumed to be Gaussian since it leads to simple\nand fast model estimation, i.e., the density of x is approximated by a Gaussian distribu-\ntion whose covariance matrix is aligned along a lower dimensional subspace. This may\nbe a good approximation when data is drawn from a single population and the goal is to\nexplain the data in terms of a few variables. However, in machine learning we often deal\nwith data drawn from several populations and PCA is used to reduce dimensions to control\ncomputational complexity of learning. A mixture model with Gaussian latent distribution\nwould not be able to capture this information. The projection obtained using a Gaussian\nlatent distribution tends to be skewed toward the center [1] and hence the distinction be-\ntween nearby sub-populations may be lost in the visualization space. For these reasons,\nit is important not to make restrictive assumptions about the latent distribution. Several\nrecently proposed dimension reduction methods can, like PPCA, be thought of as special\n\n\f\ncases of latent variable modelling which differ in the specic assumptions they make about\nthe latent and conditional distributions.\n\nWe present an alternative probabilistic formulation, called semi-parametric PCA (SP-\nPCA), where no assumptions are made about the distribution of the latent random vari-\nable . Non-parametric latent distribution estimation allows us to approximate data density\nbetter than previous schemes and hence gives better low dimensional representations. In\nparticular, multi-modality of the high dimensional density is better preserved in the pro-\njected space. When the observed data is composed of several clusters, this technique can\nbe viewed as performing simultaneous clustering and dimensionality reduction. To make\nour method suitable for special data types, we allow the conditional distribution P (x|) to\nbe any member of the exponential family of distributions. Use of exponential family distri-\nbutions for P (x|) is common in statistics where it is known as latent trait analysis and they\nhave also been used in several recently proposed dimensionality reduction schemes [3, 4].\nWe use Lindsay's non-parametric maximum likelihood estimation theorem to reduce the\nestimation problem to one with a large enough discrete prior. It turns out that this choice\ngives us a prior which is `conjugate' to all exponential family distributions, allowing us to\ngive a unied algorithm for all data types. This choice also makes it possible to efciently\nestimate the model even in the case when different components of the data vector are of\ndifferent types.\n\n\n\n2 The constrained mixture model\n\nWe assume that the d-dimensional observation vectors x1, . . . , xn are outcomes of iid\ndraws of a random variable whose distribution P (x) = P ()P (x|)d is determined\nby the latent distribution P () and the conditional distribution P (x|). This can also be\nviewed as a mixture density with P () being the mixing distribution, the mixture compo-\nnents labelled by and P (x|) being the component distribution corresponding to . The\nlatent distribution is used to model the interdependencies among the components of x and\nthe conditional distribution to model `noise'. For example in the case of a collection of\ndocuments we can think of the `content' of the document as a latent variable since it cannot\nbe measured. For any given content, the words used in the document and their frequency\nmay depend on random factors - for example what the author has been reading recently,\nand this can be modelled by P (x|).\n\nConditional distribution P (x|): We assume that P () adequately models the depen-\ndencies among the components of x and hence that the components of x are independent\nwhen conditioned upon , i.e., P (x|) = jP (xj|j), where xj and j are the j'th com-\nponents of x and . As noted in the introduction, using Gaussian means and constraining\nthem to a lower dimensional subspace of the data space is equivalent to using Euclidean\ndistance as a measure of similarity. This Gaussian model may not be appropriate for other\ndata types, for instance the Bernoulli distribution may be better for binary data and Poisson\nfor integer data. These three distributions, along with several others, belong to a family\nof distributions known as the exponential family [5]. Any member of this family can be\nwritten in the form\n\n log P (x|) = log P0(x) + x - G()\n\nwhere is called the natural parameter and G() is a function that ensures that the proba-\nbilities sum to one. An important property of this family is that the mean of a distribution\nand its natural parameter are related through a monotone invertible, nonlinear function\n = G () = g(). It can be shown that the negative log-likelihoods of exponential family\ndistributions can be written as Bregman distances (ignoring constants) which are a family\nof generalized metrics associated with convex functions [4]. Note that by using different\ndistributions for the various components of x, we can model mixed data types.\n\n\f\nLatent distribution P (): Like previous latent variable methods, including PCA, we\nconstrain the latent variable to an -dimensional Euclidean subspace of Rd to model the\nbelief that the intrinsic dimensionality of the data is smaller than d. One way to represent\nthe (unknown) linear constraint on values that can take is to write it as an invertible linear\ntransformation of another random variable which takes values a R ,\n\n\n = aV + b\n\nwhere V is an d rotation matrix and b is a d-dimensional displacement vector. Hence\nany distribution P() satisfying the low dimensional constraints can be represented us-\ning a triple (P (a), V, b), where P (a) is a distribution over R . Lindsay's mixture non-\nparametric maximum likelihood estimation (NPMLE) theorem states that for xed (V ,b),\nthe maximum likelihood (ML) estimate of P (a) exists and is a discrete distribution with no\nmore than n distinct points of support [6]. Hence if ML is the chosen parameter estimation\ntechnique, the SP-PCA model can be assumed (without loss of generality) to be a con-\nstrained nite mixture model with at most n mixture components. The number of mixture\ncomponents in the model, n, grows with the amount of data and we propose to use pruning\nto reduce the number of components during model estimation to help both in computational\nspeed and model generalization. Finally, we note that instead of the natural parameter, any\nof its invertible transformations could have been constrained to a lower dimensional space.\nChoosing to linearly constrain the natural parameter affords us computational advantages\nsimilar to those available when we use the canonical link in generalized linear regression.\n\nLow dimensional representation: There are several ways in which low dimensional\nrepresentations can be obtained using the constrained mixture model. We would ideally\nlike to represent a given observation x by the unknown (or the corresponding a related\nto by = aV + b) that generated it, since the conditional distribution P (x|) is used\nto model random effects. However, the actual value of a is not known to us and all of our\nknowledge of a is contained in the posterior distribution P (a|x) = P (a)P (x|a)/P (x).\nSince a belongs to an -dimensional space, any of its estimators like the posterior mean\nor mode (MAP estimate) can be used to represent x in dimensions. For presenting the\nsimulation results in this paper, we use the posterior mean as the representation point.\nThis representation has been used in other latent variable methods to get meaningful low\ndimensional views [1, 3]. Another method is to represent x by that point on (V, b) that\nis closest according to the appropriate Bregman distance (it can be shown that there is a\nunique such opt on the plane). This representation is a generalization of the standard\nEuclidean projection and was used in [4].\n\nThe Gaussian case: When the exponential family distribution chosen is Gaussian, the\nmodel is a mixture of n spherical Gaussians all of whose means lie on a hyperplane in\nthe data space. This can be thought of as a `soft' version of PCA, i.e., Gaussian case of\nSP-PCA is related to PCA in the same manner as Gaussian mixture model is related to\nK-means. The use of arbitrary mixing distribution over the plane allows us to approximate\narbitrary spread of data along the hyperplane. Use of xed variance spherical Gaussians\nensures that like PCA, the direction perpendicular to the plane (V, b) is irrelevant in any\nmetric involving relative values of likelihoods P (x|k), including the posterior mean.\n\nConsider the case when data density P (x) belongs to our model space, i.e., it is specied by\n{A, V, b, , } and let D be any direction parallel to the plane (V, b) along which the latent\ndistribution P () has non-zero variance. Since Gaussian noise with variance is added to\nthis latent distribution to obtain P (x), variance of P (x) along D will be greater than .\nThe variance of P (x) along any direction perpendicular to (V, b) will be exactly . Hence,\nPCA of P (x) yields the subspace (V, b) which is the same as that obtained using SP-PCA\n(this may not be true when P (x) does not belong to our model space). We found that\nSP-PCA differs signicantly from PPCA in the predictive power of the low-dimensional\n\n\f\ndensity model (see Section 5).\n\n\n3 Model estimation\n\nAlgorithm for ML estimation: We present an EM algorithm for estimating parameters\nof a nite mixture model with the components constrained to an -dimensional Euclidean\nsubspace. We propose an iterative re-weighted least squares (IRLS) method for the maxi-\nmization step along the lines of generalized linear model estimation. Use of weighted least\nsquares does not guarantee monotone increase in data likelihood. To guarantee convergence\nof the algorithm, we can check the likelihood of data at the IRLS update and decrease step\nsize if necessary. Let x1, . . . , xn be iid samples drawn from a d-dimensional density P (x),\nc be the number of mixture components and let the mixing density be = (1, . . . , c).\nAssociated with each mixture component (indexed by k) are parameter vectors k and ak\nwhich are related by k = akV + b. In this section we will work with the assumption\nthat all components of x correspond to the same exponential family for ease of notation.\nFor each observed xi there is an unobserved `missing' variable zi which is a c-dimensional\nbinary vector whose k'th component is one if the k'th mixture component was the outcome\nin the i'th random draw and zero otherwise. If yl is a vector, we use ylm to denote its m'th\ncomponent. (Derivation of the algorithm is omitted for lack of space, for details please see\n[7]).\n\nThe E-step is identical to unconstrained nite mixture case,\n\n\n n ^\n z\n ^\n z k P (xi/k ) i=1 ik xij\n ik = E(zik ) = c ; ~\n x\n kj = n ^\n z\n m=1 mP (xi/m) i=1 ik\n\nIn the M-step we update , V , b, and ak in the following manner\n\n n ^\n z n ^\n z\n i=1 ik i=1 ik\n k = n c =\n z n\n i=1 m=1 im\n\nai is updated by adding ai calculated using\n\n g( d\n (V iq )\n iV )ai = GRi ; [i]qq = ; [GR (~\n x\n i]l1 = ij - g(ij ))Vlj\n iq j=1\n\n\nHere the function g() is as dened in Section 2 and depends on the member of the expo-\nnential family that is being used. Each column of the matrix V , vs, is updated by adding\nvs calculated using\n\n g( c\n(A ks)\n sA)vs = GRs ; [s]kk = ; [GR (~\n x\n s]l1 = k s - g(k s))Ak l\n ks k =1\n\n\nEach component of vector b, bs, is updated by adding bs calculated using\n\n c g( c\n H k s)\n sbs = GRs ; Hs = ; GR (~\n x\n s = k s - g(k s))\n k s\n k =1 k =1\n\n\nPruning the mixture components: Redundant mixture components can be pruned be-\ntween the EM iterations in order to improve speed of the algorithm and generalization\nproperties while retaining the full capability to approximate P (x). We propose the follow-\ning criteria for pruning\n\n\f\n Starved components : If k < C1, then drop the k'th component\n\n\n Nearby components : If max\n i |P (xi|k1) - P (x|k2)| < C2, then drop either k1'th\n or k2'th component\n\n\nThe value of C1 should be (1/n) since we want to measure how starved a component\nis based on what percentage of the data it is `responsible' for. To measure the nearness of\ncomponents we use the -norm of the difference between probabilities the components\nassign to observations since we do not want to lose mixture components that are distin-\nguished with respect to a small number of observation vectors. In the case of clustering\nthis means that we do not ignore under-represented clusters. C2 should be chosen to be a\nsmall constant, depending on how much pruning is desired.\n\nConvergence of the EM iterations and computational complexity: It is easy to verify\nthat the SP-PCA model satises the continuity assumptions of Theorem 2, [8], and hence\nwe can conclude that any limit point of the EM iterations is a stationary point of the log\nlikelihood function. The computational complexity of the E-step is O(cdn) and of the M-\nstep is O(cd 2). For the Gaussian case, the E-step only takes O(c n) since we only need\nto take into account the variation of data along the subspace given by current value of V\n(see Section 2). The most expensive step is computation of P (xi|j). The k-d tree data\nstructure is often used to identify relevant mixture components to speed up this step.\n\nModel selection: While any of the standard model selection methods based on penalizing\ncomplexity could be used to choose , an alternative method is to pick which minimizes\na validation or bootstrap based estimate of the prediction error (negative log likelihood per\nsample). For the Gaussian case, a fast method to pick would be to plot the variance of data\nalong the principal directions (found using PCA) and look for the dimension at which there\nis a `knee' or a sudden drop in variance or where the total residual variance falls below a\nchosen threshold.\n\nConsistency of the Maximum Likelihood estimator: We propose to use the ML esti-\nmator to nd the latent space (V, b) and the latent distribution P (a). Usually a parametric\nform is assumed for P (a) and the consistency of the ML estimate is well known for this\ntask where the parameter space is a subset of a nite dimensional Euclidean space. In\nthe SP-PCA model, one of the parameters (P (a)) ranges over the space of all distribution\nfunctions on R and hence we need to do more to verify the validity of our estimator. Ex-\nponential family mixtures are not identiable in general. This, however, is not a problem\nfor us since we are only interested in approximating P (x) well and not in the actual pa-\nrameters corresponding to the distribution. Hence we use the denition of consistency of\nan estimator given by Redner. Let 0 be the `true' parameter from which observed samples\nare drawn. Let C0 be the set of all parameters corresponding to the `true' distribution\nF (x/0) (i.e., C0 = { : F (x/) = F (x/0) x}). Let ^\n n be an estimator of based\non n observed samples of X and let ^\n be the quotient topological space obtained from \nobtained by identifying the set C0 to a point ^\n 0.\n\n Denition The sequence of estimators {^\n n, n = 1, . . . , } is said to be strongly\nconsistent in the sense of Redner if limm ^\n n = ^\n 0 almost surely.\n\n Theorem If P (a) is assumed to be zero outside a bounded subset of R , the ML esti-\nmator of parameter (V, b, P (a)) is strongly consistent for Gaussian, Binary and Poisson\nconditional distributions.\n\nThe theorem follows by verifying that the assumptions of Kiefer et. al. [9] are satised\nby the SP-PCA model. The assumption that P (a) is zero outside a bounded region is not\nrestrictive in practice since we expect the observations xi belong to a bounded region of\nRd. (Proof omitted for lack of space, please see [7]).\n\n\f\n Table 1: Bootstrap estimates of prediction error for PPCA and SP-PCA.\n\n DENSITY ISOTROPIC PPCA SP-PCA FULL\n GAUSSIAN =1 =2 =3 =1 =2 =3 GAUSSIAN\n\n\n ERROR 50.39 38.03 34.71 34.76 36.85 30.99 28.54 343.83\n\n\n4 Relationship to past work\n\nSP-PCA is a factor model that makes fewer assumptions about latent distribution than\nPPCA [1]. Mixtures of probabilistic principal component analyzers (also known as mix-\ntures of factor analyzers) is a generalization of PPCA which overcomes the limitation of\nglobal linearity of PCA via local dimensionality reduction. Mixtures of SP-PCA's can be\nsimilarly dened and used for local dimensionality reduction. Collins et. al. [4] proposed a\ngeneralization of PCA using exponential family distributions. Note that this generalization\nis not associated with a probability density model for the data. SP-PCA can be thought of\nas a `soft' version of this generalization of PCA, in the same manner as Gaussian mixtures\nare a soft version of K-means. Generative topographic mapping (GTM) is a probabilistic\nalternative to Self organizing map which aims at nding a nonlinear lower dimensional\nmanifold passing close to data points. An extension of GTM using exponential family dis-\ntributions to deal with binary and count data is described in [3]. Apart from the fact that\nGTM is a non-linear dimensionality reduction technique while SP-PCA is globally linear\nlike PCA, one main feature that distinguishes the two is the choice of latent distribution.\nGTM assumes that the latent distribution is uniform over a nite and discrete grid of points.\nBoth the location of the grid and the nonlinear mapping are to be given as an input to the\nalgorithm. Tibshirani [10] used a semi-parametric latent variable model for estimation of\nprinciple curves. Discussion of these and other dimensionality reduction schemes based on\nlatent trait and latent class models can be found in [7].\n\n\n5 Experiments\n\nIn this section we present simulations on synthetic and real data to demonstrate the prop-\nerties of SP-PCA. In factor analysis literature, it is commonly believed that choice of prior\ndistribution is unimportant for the low dimensional data summarization (see [2], Sections\n2.3, 2.10 and 2.16). Through the examples below we argue that estimating the prior instead\nof assuming it arbitrarily can make a difference when latent variable models are used for\ndensity approximation, data analysis and visualization.\n\nUse of SP-PCA as a low dimensional density model: The Tobamovirus data which\nconsists of 38 18-dimensional examples was used in [1] to illustrate properties of PPCA.\nPPCA and SP-PCA can be thought of as providing a range of low-dimensional density\nmodels for the data. The complexity of these densities increases with and is controlled by\nthe value of (the projected space dimension) starting with the zero dimensional model of\nan isotropic Gaussian. For a xed lower dimension , SP-PCA has greater approximation\ncapability than PPCA. In Table 1, we present bootstrap estimates of the predictive power of\nPPCA and SP-PCA for various values of L. SP-PCA has lower prediction error than PPCA\nfor = 1, 2 and 3. This indicates that SP-PCA combines exible density estimation and\nexcellent generalization even when trained on a small amount of data.\n\nSimulation results on discrete datasets: We present experiments on 20 Newsgroups\ndataset comparing SP-PCA to PCA, exponential family GTM [3] and Exponential family\nPCA [4]. Data for the rst set of simulations was drawn from comp.sys.ibm.pc.hardware,\ncomp.sys.mac.hardware and sci.med newsgroups. A dictionary size of 150 words was\nchosen and the words in the dictionary were picked to be those which have maximum\nmutual information with class labels. 200 documents were drawn from each of the three\n\n\f\nnewsgroups to form the training data. Two-dimensional representations obtained using\nvarious methods are shown in Fig. 1. In the projection obtained using PCA, Exponential\nfamily PCA and Bernoulli GTM, the classes comp.sys.ibm.pc.hardware and comp.sys.-\nmac.hardware were not well separated in the 2D space. This result (Fig. 1(c)) was presented\nin [3] and the the overlap between the two groups was attributed to the fact that they are\nvery similar and hence share many words in common. However, SP-PCA was able to\nseparate the three sets reasonably well (Fig. 1(d)). One way to quantify the separation of\ndissimilar groups in the two-dimensional projections is to use the training set classication\nerror of projected data using SVM. The accuracy of the best SVM classier (we tried a\nrange of SVM parameter values and picked the best for each projected data set) was 75% for\nbernoulli GTM projection and 82.3% for SP-PCA projection (the difference corresponds to\n44 data points while the total number of data points is 600). We conjecture that the reason\ncomp.sys.ibm.pc.hardware and comp.sys.mac.hardware have overlap in projection using\nBernoulli GTM is that the prior is assumed to be over a pre-specied grid in latent space\nand the spacing between grid points happened to be large in the parameter space close to the\ntwo news groups. In contrast to this, in SP-PCA there is no grid and the latent distribution\nis allowed to adapt to the given data set. Note that a standard clustering algorithm could be\nused on the data projected using SP-PCA to conclude that data consisted of three kinds of\ndocuments.\n\n 20\n 1\n 1\n 100\n 0.8\n 10\n 0.5\n 0.6 80\n\n\n 0 0.4\n 0\n\n 60\n 0.2\n -0.5\n\n -10 0 40\n\n -1 -0.2\n\n\n -20 20\n -0.4\n -1.5\n\n\n -0.6\n 0\n -2 -30\n -0.8\n\n\n -2.5 -20\n -1\n\n -40\n -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 -50 -40 -30 -20 -10 0 10 20 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -60 -40 -20 0 20 40 60 80 100\n\n\n\n\n\n (a) PCA (b) Expontl. PCA (c) GTM (d) SP-PCA\n\n\nFigure 1: Projection by various methods of binary data from 200 documents each from\ncomp.sys.ibm.pc.hardware (), comp.sys.mac.hardware () and sci.med (.)\n\nData for the second set of simulations was drawn from sci.crypt, sci.med, sci.space and\nsoc.culture.religion.christianity newsgroups. A dictionary size of 100 words was chosen\nand again the words in the dictionary were picked to be those which have maximum mu-\ntual information with class labels. 100 documents were drawn from each of the newsgroups\nto form the training data and 100 more to form the test data. Fig. 2 shows two-dimensional\nrepresentations of binary data obtained using various methods. Note that while the four\nnewsgroups are bunched together in the projection obtained using Exponential family PCA\n[4] (Fig. 2(b)), we can still detect the presence four groups from this projection and in\nthis sense this projection is better than the PCA projection. This result is pleasing since it\nconrms our intuition that using negative log-likelihood of Bernoulli distribution as a mea-\nsure of similarity is more appropriate than squared Euclidean distance for binary data. We\nconjecture that the reason the four groups are not well separated in this projection is that\na conjugate prior has to be used in its estimation for computational purposes [4] and the\nform and parameters of this prior are considered xed and given inputs to the algorithm.\nBoth SP-PCA (Fig. 2(c)) and Bernoulli GTM (Fig. 2(e)) were able to clearly separate the\nclusters in the training data. Figures 2(d) and 2(f) show representation of test data using the\nmodels estimated by SP-PCA and Bernoulli GTM respectively. To measure generalization\nof these methods, we use a K-nearest neighbors based non-parametric estimate of the den-\nsity of the projected training data. The percentage difference between the log-likelihoods\nof training and test data with respect to this density was 9.1% for SP-PCA and 17.6% for\nGTM for K=40 (SP-PCA had smaller percentage change in log-likelihood for most values\nof K that we tried between 10 and 40). This indicates that SP-PCA generalizes better than\n\n\f\nGTM. This can be seen visually by comparing Figures 2(e) and 2(f) where the projections\nof training and test data of sci.space ( ) differ signicantly.\n\n\n 1 14\n\n\n\n\n 0.5 12\n 50\n\n\n\n 0\n 10\n\n\n 0\n -0.5\n\n 8\n\n\n -1\n -50\n 6\n\n -1.5\n\n\n 4 -100\n -2\n\n\n\n 2\n -2.5\n\n -150\n\n\n -3 0\n -4 -3 -2 -1 0 1 2 -20 -15 -10 -5 0 5 10 15 20 25 30 -200 -150 -100 -50 0\n\n\n\n\n\n (a) PCA (b) Exponential PCA (c) SP-PCA\n\n\n\n 1 1\n\n\n 0.8 0.8\n 50\n\n\n 0.6 0.6\n\n\n 0.4 0.4\n 0\n\n 0.2 0.2\n\n\n 0 0\n\n -50\n\n -0.2 -0.2\n\n\n -0.4 -0.4\n\n -100\n -0.6 -0.6\n\n\n -0.8 -0.8\n\n -150\n -1 -1\n\n\n -200 -150 -100 -50 0 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1\n\n\n\n\n\n (d) Test data - SP-PCA (e) Bernoulli GTM (f) Test data - GTM\n\n\nFigure 2: Projection by various methods of binary data from 100 documents each from\nsci.crypt (), sci.med (), sci.space ( ) and soc.culture.religion.christianity (+)\n\nAcknowledgments\n\nWe thank Sanjoy Dasgupta and Thomas John for helpful conversations.\n\nReferences\n\n [1] M. Tipping and C. Bishop. Probabilistic principal component analysis. Journal of the Royal\n Statistical Society, Series B, 61(3):611622, 1999.\n [2] David J. Bartholomew and Martin Knott. Latent variable models and Factor analysis, volume 7\n of Kendall's Library of Statistics. Oxford University Press, 2nd edition, 1999.\n [3] A. Kaban and M. Girolami. A combined latent class and trait model for the analysis and vi-\n sualization of discrete data. IEEE Transaction on Pattern Analysis and Machine Intelligence,\n 23(8):859872, August 2001.\n [4] M. Collins, S. Dasgupta, and R. E. Schapire. A generalization of principal components analysis\n to the exponential family. In Advances in Neural Information Processing Systems 14, 2002.\n [5] P. McCullagh and J. A. Nelder. Generalized Linear Models. Monographs on Statistics and\n Applied Probability. Chapman and Hall, 1983.\n [6] B. G. Lindsay. The geometry of mixture likelihoods : A general theory. The Annals of Statistics,\n 11(1):8604, 1983.\n [7] Sajama and A. Orlitsky. Semi-parametric exponential family PCA : Reducing dimensions via\n non-parametric latent distribution estimation. Technical Report CS2004-0790, University of\n California at San Diego, http://cwc.ucsd.edu/ sajama, 2004.\n [8] C. F. J. Wu. On the convergence properties of the EM algorithm. Annals of Statistics, 11(1):95\n 103, 1983.\n [9] J. Kiefer and J. Wolfowitz. Consistency of the maximum likelihood estimator in the presence\n of innitely many incidental parameters. The Annals of Mathematical Statistics, 27:887906,\n 1956.\n[10] R. Tibshirani. Principal curves revisited. Statistics and Computation, 2:183190, 1992.\n\n\f\n", "award": [], "sourceid": 2693, "authors": [{"given_name": "Sajama", "family_name": "Sajama", "institution": null}, {"given_name": "Alon", "family_name": "Orlitsky", "institution": null}]}