{"title": "A Generalization of Principal Components Analysis to the Exponential Family", "book": "Advances in Neural Information Processing Systems", "page_first": 617, "page_last": 624, "abstract": null, "full_text": "A Generalization of Principal Component\n\nAnalysis to the Exponential Family\n\nMichael Collins\n\nSanjoy Dasgupta\n\nRobert E. Schapire\n\nAT&T Labs  Research\n\n180 Park Avenue, Florham Park, NJ 07932\n\n\u0001 mcollins, dasgupta, schapire\u0002 @research.att.com\n\nAbstract\n\nPrincipal component analysis (PCA) is a commonly applied technique\nfor dimensionality reduction. PCA implicitly minimizes a squared loss\nfunction, which may be inappropriate for data that is not real-valued,\nsuch as binary-valued data. This paper draws on ideas from the Exponen-\ntial family, Generalized linear models, and Bregman distances, to give a\ngeneralization of PCA to loss functions that we argue are better suited to\nother data types. We describe algorithms for minimizing the loss func-\ntions, and give examples on simulated data.\n\n1 Introduction\n\nPrincipal component analysis (PCA) is a hugely popular dimensionality reduction tech-\nnique that attempts to \ufb01nd a low-dimensional subspace passing close to a given set of\n\npoints \u0003\u0005\u0004\u0007\u0006\t\b\n\b\n\b\t\u0006\u000b\u0003\r\f\u000f\u000e\u0011\u0010\u0013\u0012 . More speci\ufb01cally, in PCA, we \ufb01nd a lower dimensional subspace\nthat minimizes the sum of the squared distances from the data points \u0003\u0005\u0014 to their projections\n\nin the subspace, i.e.,\n\n\u0019\u0018\u0019\n\n\u0003\r\u0014\n\n\u0019\u0018\u0019\n\n(1)\n\n\u0014\u0018\u0017\n\u0014 , which is the same as the (empirical) variance of\n\nThis turns out to be equivalent to choosing a subspace that maximizes the sum of the\nsquared lengths of the projections \u0015\nthese projections if the data happens to be centered at the origin (so that \u001b\nPCA also has another convenient interpretation that is perhaps less well known. In this\nprobabilistic interpretation, each point \u0003\u001f\u0014\nis thought of as a random draw from some un-\n\u0015 denotes a unit Gaussian with mean \u0015\nknown distribution \nthen of PCA is to \ufb01nd the set of parameters \u0015\nthat maximizes the likelihood of the\ndata, subject to the condition that these parameters all lie in a low-dimensional subspace. In\nother words, \u0003\n\u0006\n\b\t\b\n\b\t\u0006\u000b\u0003\n\u0006\t\b\n\b\t\b&\u0006\n\n\u000e#\u0010$\u0012 . The purpose\n\u0015\"! , where \n\f are considered to be noise-corrupted versions of some true points\n\f which lie in a subspace; the goal is to \ufb01nd these true points, and the main as-\n\nsumption is that the noise is Gaussian. The equivalence of this interpretation to the ones\ngiven above follows simply from the fact that negative log likelihood under this Guassian\nmodel is equal (ignoring constants) to Eq. (1).\n\n\u0014\u0005\u001c\u001e\u001d ).\n\n\u0004%\u0006\n\b\t\b\n\b\n\u0006\n\nThis Gaussian assumption may be inappropriate, for instance if data is binary-valued, or\ninteger-valued, or is nonnegative. In fact, the Gaussian is only one of the canonical distri-\nbutions that make up the exponential family, and it is a distribution tailored to real-valued\n\n\u0015\n\u0014\n\f\n\u0016\n\u0004\n\n\u0015\n\u0014\n\u001a\n\b\n\u0014\n\u0003\n\u0015\n\f\n\u0004\n\u0015\n\u0004\n\u0015\n\fdata. The Poisson is better suited to integer data, and the Bernoulli to binary data. It seems\nnatural to consider variants of PCA which are founded upon these other distributions in\nplace of the Gaussian.\nWe extend PCA to the rest of the exponential family. Let \u0001\nof distributions from the exponential family, where \u0015\ntion. For instance, a one-dimensional Poisson distribution can be parameterized by\n\u0007\u0014\u0013\n\u0006\b\u0007\n\t\ncorresponding to mean\n\u001c\u0003\u0002\u0015\u000b\u000e\u0016\n\u0001\u0012\u0017\n\u0010\u0013\u0012 , the goal is now to \ufb01nd parameters \u0015\n\u0006\f\u0018\n\u0006\n\u0002\nwhich lie in a low-dimensional subspace and for which the log-likelihood \u001b\n\u0014\u0005\u001b\u001d\u001c\u0015\u001e\nis maximized.\n\n\u0002 be any parameterized set\nis the natural parameter of a distribu-\n,\n\u000e\u0011\u0010\n\u0007\u0014\u0013\n\u0006\t\b\n\b\n\b\t\u0006\n\n\u0003\r\u0014\nOur uni\ufb01ed approach effortlessly permits hybrid dimensionality reduction schemes in\nwhich different types of distributions can be used for different attributes of the data. If\nthe data \u0003\nordinates of the corresponding \u0015\nare parameters of Poisson distributions. (However, for simplicity of presentation, in this\nabstract we assume all distributions are of the same type.)\n\n\u0014 have a few binary attributes and a few integer-valued attributes, then some co-\n\u0014 can be parameters of binomial distributions while others\n\nand distribution \n\u0006\n\b\t\b\n\b&\u0006\n\n\u001c\u0003\u0002\u0005\u0004\n\b Given data \u0003\n\n\u0004\f\u000b\u000e\r\u0010\u000f\u0012\u0011\n\n\u001c\u0003\u0002\n\n\u0006\u001a\u0019\n\n\u0006\t\b\n\b\t\b\n\nlie in a linear subspace, they typically correspond to\n\nThe dimensionality reduction schemes for non-Gaussian distributions are substantially dif-\nferent from PCA. For instance, in PCA the parameters \u0015\n\u0014 , which are means of Gaussians,\n\u0014 . This is not the case in general, and\nlie in a space which coincides with that of the data \u0003\ntherefore, although the parameters \u0015\na nonlinear surface in the space of the data.\nThe discrepancy and interaction between the space of parameters \u0015 and the space of the\nis a central preoccupation in the study of exponential families, generalized linear\ndata \u0003\nmodels (GLM\u2019s), and Bregman distances. Our exposition is inevitably woven around these\nthree intimately related subjects. In particular, we show that the way in which we generalize\nPCA is exactly analogous to the manner in which regression is generalized by GLM\u2019s. In\nthis respect, and in others which will be elucidated later, it differs from other variants of\nPCA recently proposed by Lee and Seung [7], and by Hofmann [4].\n\nWe show that the optimization problem we derive can be solved quite naturally by an algo-\nrithm that alternately minimizes over the components of the analysis and their coef\ufb01cients;\nthus, the algorithm is reminiscent of Csisz\u00b4ar and Tusn\u00b4ady\u2019s alternating minization proce-\ndures [2]. In our case, each side of the minimization is a simple convex program that can\nbe interpreted as a projection with respect to a suitable Bregman distance; however, the\noverall program is not generally convex. In the case of Gaussian distributions, our algo-\nrithm coincides exactly with the power method for computing eigenvectors; in this sense it\nis a generalization of one of the oldest algorithms for PCA. Although omitted for lack of\nspace, we can show that our procedure converges in that any limit point of the computed\ncoef\ufb01cients is a stationary point of the loss function. Moreover, a slight modi\ufb01cation of the\noptimization criterion guarantees the existence of at least one limit point.\n\nSome comments on notation: All vectors in this paper are row vectors. If\nwe denote its\n\n\u2019th element by\n\n\u2019th row by\n\n.\n\nis a matrix,\n\n\u0014 and its\n\n #\"\n\n\u0014&%\n\n2 The Exponential Family, GLM\u2019s, and Bregman Distances\n\n2.1 The Exponential Family and Generalized Linear Models\n\nIn the exponential family of distributions the conditional probability of a value\nparameter value\n\ntakes the following form:\n\n\u001b\u001d\u001c\u0015\u001e\n\n\u001b(\u001c)\u001e\n\n +*\n\n\t-,\n\n0/\n\n'.\n\ngiven\n\n(2)\n\n \n\u0015\n\n\u0001\n\u0004\n\f\n\u0001\n\f\n\u0011\n\u0006\n\u0007\n\u000e\n\u0004\n\u0003\n\f\n\u000e\n\u0004\n\u0015\n\f\n \n\u0015\n!\n\u0006\n\t\n\u0014\n\u001f\n \n!\n$\n'\n\n \n\u0006\n'\n\u0019\n\n\t\n\u001c\n\u0006\n'\n\u0006\n\n\t\n\b\n\f.\n\nHere,\n\n\u001b(\u001c)\u001e\n.\n\n\u001b\u0001\u0003\u0002\u0005\u0004\n\nto denote the domain of\n\nis a term that depends only on\n\nis a function that ensures that the sum (integral) of \n\nis the \u201cnatural parameter\u201d of the distribution, and can usually take any value in the\n\t over the domain of\nreals. /\nis 1. From this it follows that /\nWe use \u0006\n. The sum is replaced by an integral in the continuous\nde\ufb01nes a density over \u0006\ncase, where \n, and can\nusually be ignored as a constant during estimation. The main difference between different\n\t . We will see that almost all of the concepts of\nmembers of the family is the form of /\nthe PCA algorithms in this paper stem directly from the de\ufb01nition of /\nA \ufb01rst example is a normal distribution, with mean \u0007 and unit variance, which has a density\nthat is usually written as\nthis is a member of the exponential family with\nand /\noutcomes. In this case \u0006\n\u000f\u000e where\u000b\nis a parameter in \u0010\nwith \n\u001b(\u001c)\u001e\u0014\u0013\nA critical function is the derivative /\u0016\u0015\npaper. By differentiating /\n, it is easily veri\ufb01ed that \u0017\n\u001b\u001d\u001c\u0015\u001e\nunder \n\n\u0011 , the expectation of\n. In the general case, \u0018\nthe Bernoulli case \u0018\n\n\u0011 is referred to as the \u201cexpectation\n\u0003\u0011\nparameter\u201d, and \u0017 de\ufb01nes a function from the natural parameter values to the expectation\nparameter values.\n\n. It can be veri\ufb01ed that\n,\n\u001b\u001d\u001c\u0015\u001e\n. Another common case is a Bernoulli distribution for the case of binary\nis usually written \n\n\u000b+\t\n\u0006\f\u0018\u0012\u0011 . This is a member of the exponential family\n\u001b\u001d\u001c\u0015\u001e\n\t , which we will denote as \u0017\n\u0003\u0002\u0005\u0004\n\t . In the normal distribution, \u0018\n\n\u0019\u0003\t\n \n*\n\u001b\u001d\u001c\u0015\u001e\n\u0002 . The probability of\n\n\u0001\u0012\u0017\n, and /\n\nthroughout this\n\n, and in\n\n\u000b-\t\r\f\n\n\u001b\u001d\u001c\u0015\u001e\n\n\u001b(\u001c)\u001e\n\n\t .\n\n\u0003\u0011\n\n\u0019\n\t\n\n.\n\n,\n\n,\n\n\u0006\u001a\u0019\n\n\u0012 . The dot product \u001b\u001d\u001c\n\nis a vector of attributes, and \u0019\n\u0014 . In least squares regression the optimal parameters \u001b\u001f\u001e are set to be\n\nOur generalization of PCA is analogous to the manner in which generalized linear mod-\nels (GLM\u2019s) [8] provide a uni\ufb01ed treatment of regression for the exponential family by\ngeneralizing least-squares regression to loss functions that are more appropriate for other\nmembers of this family. The regression set-up assumes a training sample of \u0006\n\t pairs,\nwhere \u0003\nis some response variable. The pa-\nis taken to be an\nrameters of the model are a vector \u001b\napproximation of \u0019\n\u001e\u0013\u001c! \u0005\"\n\u001b+\u001c&\u0003\n\u001e$#&%('\n\u0010*)\nIn GLM\u2019s, ,\n\u0003\u001f\u0014\n\u001b-\u001c\ntial model, where ,\nthe \u201ccanonical link\u201d, where ,\nparameters are directly approximated by \u001b/\u001c\nis simply \u001b\n\u001b2\u001c\n\u001b(\u001c)\u001e\nwith \ufb01xed variance, /\ngression where /\n\u0002)\u000b54\u000f6\n\nis taken to approximate the expectation parameter of the exponen-\nis the inverse of the \u201clink function\u201d [8]. A natural choice is to use\n\t . In this case the natural\n, \u0017 being the derivative /.\u0015\nIn the case of a normal distribution\n\u0003\r\u0014\nand it follows easily that the maximum-likelihood cri-\nterion is equivalent to the least squares criterion. Another interesting case is logistic re-\nis\n\n\t , and the negative log-likelihood for parameters \u001b\nif \u0019\n\n\u0014 , and the log-likelihood \u001b\n\t .\n\u0003\r\u0014\n\n\u00140\u001b1\u001c\n\u001b\u001d\u001c\u0015\u001e\n!:9 where \u0019\n\n\u0014\u0005\u001b\u001d\u001c\u0015\u001e\n\nif \u0019\n\n\u001b\u001f7\n\n\u001a .\n\n\u0017 .\n\n, \u0019\n\n\u0003\u001f\u0014\n\n\u001b\u001d\u001c\u0015\u001e\n\n2.2 Bregman Distances and the Exponential Family\n\n\u0010 be a differentiable and strictly convex function de\ufb01ned on a closed, convex\n. The Bregman distance associated with ;\n\nto be\n\n\u000eD=\n\n\u0006\rC\n\nis de\ufb01ned for \u000b\nLK\n\n\u0006M\u000b\n\nLet ;2<>=@?\nset =BA\nwhere K\n\nEGF\n\n\u0006H\u000bJI\n\n\u0006H\u000b+\t\n\native and is equal to zero if and only if its two arguments are equal.\n\n\t . It can be shown that, in general, every Bregman distance is nonneg-\nis directly related to a Bregman\n\nFor the exponential family the log-likelihood\n\n\u001b(\u001c)\u001e\n\n\n\u0006\n\n\t\n\u0006\n'\n\u0019\n\n'\n\u0006\n\n\t\n\u001c\n \n*\n\u0006\n'\n\t\n\u0002\n\n\u0004\n'\n \n*\n'\n\u0006\n\n \n\u0006\n'\n\u0019\n\u0007\n\t\n\u001c\n\n\b\n\n\u0006\n'\n\n\u0007\n\t\n\u001a\n\u0011\n\u0019\n\u0006\n'\n\t\n\u001c\n\n\b\n\n'\n\u001a\n\u0011\n\u0019\n\n\u001c\n\u0007\n\u0006\n\n\t\n\u001c\n\n\u001a\n\u0011\n\u0019\n\u001c\n\u0006\n\u0018\n'\n\u000e\n\u0006\n\u0006\n'\n\u0019\n\u001c\n\u000b\n\n\u0006\n\u0018\n\n\u0004\n\u000b\n\u0017\n*\n\u0006\n'\n\t\n\u001c\n\u0018\n\n\u001c\n\u0004\n\u000b\n\u0013\n\u0006\n\n\t\n\u001c\n\u0006\n\u0018\n,\n\u0002\n\u0004\n\u0006\n\n\u0006\n\n\t\n\u0006\n\n\t\n\u001c\n\u001b\n \n*\n\u0006\n'\n\t\n\u0002\n\n\u0004\n\u0006\n\n\t\n\u001c\n\u0018\n\u0010\n'\n\u0019\n'\n\u0006\n'\n\u0019\n\n\u0010\n'\n\u0019\n\u001c\n\u0007\n\u0010\n'\n\u0019\n\u001c\n\u000b\n\u0010\n'\n\u0019\n\u0003\n\u0014\n\u0014\n\u0014\n\u000e\n\u0010\n\u0012\n\u0014\n\u000e\n\u0010\n\u000e\n\u0010\n\u001b\n\u001b\n\u0002\n\u001b\n\u0014\n\u0006\n\u0019\n\u0014\n\n\u0014\n\t\n\u0006\n\t\n\u001c\n\u0017\n\u0006\n\n\u0003\n \n\u0006\n\u0019\n\u0014\n\u0019\n'\n\u0014\n\t\n\u0014\n \n*\n\u0006\n\u0019\n\u0014\n\t\n,\n\u0019\n\n/\n\u0006\n\u0006\n\n\t\n\u001c\n\n\u001a\n\u0011\n\u0019\n\u0006\n\n\t\n\u001c\n\u0006\n\u0018\n,\n\u0002\n\u0004\n\u001b\n\u0014\n3\n\u0018\n,\n!\n8\n\u001e\n\u0014\n\u001c\n\u0018\n\u0014\n\u001c\n\u0018\n\u001e\n\u0014\n\u001c\n\n\u0018\n\u0014\n\u001c\n\u0010\nC\n\t\n\b\n\u001c\n;\n\n;\n\u0006\nC\n\t\n\u0006\nC\n\t\n\nC\n\t\n\u0006\n'\n\t\n\u001c\n;\n\u0015\n\u0006\n'\n \n\u0006\n'\n\u0019\n\n\t\n\fnormal\n\nBernoulli\n\nPoisson\n\n\u0016\u0019\u001a\u0007\u001b\n\n\u0013\u0015\u0014\u0017\u0016\u0019\u0018\n\u0014\u0017\u0016\u0019\u0018\n\u0014.-/\u0018\n56\u0014.-/\u001887\n$39\n?A@\"B0\u0018\n<>=\n\u0014I-A@\nTable 1: Various functions of interest for three members of the exponential family\n\n\u000641\n?/\u0018\n\u0018 where -LKH7\n\n\u0001\u0003\u0002\u0005\u0004\u0007\u0006\u0003\b\n\u0006\r \"!\u0007#\n\u001c\u001e\u001d\t\u001f\n&('*)\n\u0014.-/\u0018\n\u000f\u000e+\n\u001c\u001e\u001d0\u001f\n\u001c\u001e\u001d\t\u001f;:\n\u001c\u001e\u001d0\u001fED\nFG \n\u001c\u001e\u001d\t\u001f\n\u0006\r \"!\n\n\u0014.-/\u0018\n\u001c\u001e\u001d0\u001f\n\u001c\u001e\u001d\t\u001f\n\u001c\u001e\u001d0\u001fGD\nFE \n-N\u0016\n\n\u0001\u0003\u0002\u0005\u0004\u0007\u0006\t\u0004\u000b\n\r\f\u000e\f\u000f\f\u0011\u0010\u0012\b\n!\u0007#\n\n-3\u0018\n\u001c\u001e\u001d0\u001f\n\u001c\u001e\u001d0\u001f\n\n\u000621\n\u0006H1\n6\u000b#\n\n\u0014.-3\u0018\n\u0014\u0017\u00160\u0018J\u0018\n\n1M\u0006\n\nB0\u0018\n\u00160\u0018\n\n\u001c\u001e\u001d0\u001f\n\n-/\u0018\n\n\u0014C?\n\u0014.-\n\ndistance. Speci\ufb01cally, [1, 3] de\ufb01ne a \u201cdual\u201d function ;\nIt can be shown under fairly general conditions that K\n\nand \u0017 :\n(3)\n\t . Application of these\nidentities implies that the negative log-likelihood of a point can be expressed through a\nBregman distance [1, 3]:\n\nthrough /\n\n\t\u000e,\n\n(4)\n +*\n\u001b(\u001c)\u001e\nIn other words, negative log-likelihood can always be written as a Bregman distance plus\na term that is constant with respect to\nand which therefore can be ignored. Table 1\nsummarizes various functions of interest for examples of the exponential family.\n\n\u001b(\u001c)\u001e\n\n\t\u000e,\n\nWe will \ufb01nd it useful to extend the idea of Bregman distances to divergences between vec-\nare matrices, then we overload the notation\ntors and matrices. If \u0003\nas E\n\t . (The\nnotion of Bregman distance as well as our generalization of PCA can be extended to vec-\ntors in a more general manner; here, for simplicity, we restrict our attention to Bregman\ndistances and PCA problems of this particular form.)\n\n, O are vectors, and P\n\t and E\n\nIUT\n\n, Q\n\n\u0006SR\n\n3 PCA for the Exponential Family\nWe now generalize PCA to other members of the exponential family. We wish to \ufb01nd \u0015\n\u0014 \u2019s\nthat are \u201cclose\u201d to the \u0003\u001f\u0014 \u2019s and which belong to a lower dimensional subspace of parameter\nin \u0010\u0013\u0012 and to represent each \u0015\nspace. Thus, our approach is to \ufb01nd a basis V\n\u0014 as\nthe linear combination of these elements \u0015\nthat is \u201cclosest\u201d to \u0003\r\u0014 .\n\\b^ matrix whose c \u2019th row\nbe the a\nLet [\n\u2019th row is \u0003\nis an \u0007d\\f^ matrix\nY . Then e\nis V\n\u0014 as above. This is a matrix of natural parameter values which de\ufb01ne\nthe probability of each point in [\n\n\u0004%\u0006\n\b\t\b\n\b\n\u0006\u000bVXW\n\u001bZY\n\u0014 . Let `\na matrix with elements R\n\nbe the \u0007]\\_^ matrix whose\nY , and let P\n\u2019th row is \u0015\n\nFollowing the discussion in Section 2, we consider the loss function taking the form\n\nbe the \u0007d\\\n\nwhose\n\nPG`\n\n.\n\n\u001b\u001d\u001c\u0015\u001e\n\n\u0014&%\n\n\u0014&%\n\n\u001cZh\n\n\u0014&%\n\nis a constant term which will be dropped from here on. The loss function varies\ndepending on which member of the exponential family is taken, which simply changes the\nis a matrix of real values, and the normal distribution is\nand the loss criterion is the usual squared loss\n,\n\n\u0006\u0011`\n\n\u001b\u001d\u001c\u0015\u001e\n\n\u0006\u000eP\nwhere h\nform of /\n. For example, if [\nappropriate for the data, then /\nfor PCA. For the Bernoulli distribution, /\n!\u001ei\nthen g\n%-\u001b\u001d\u001c\u0015\u001e\n\n\u0006\u000bP\n\n!\u001ei\n\n\t .\n\n\u001b\u001d\u001c\u0015\u001e\n\n\t . If we de\ufb01ne\n\n\u0002)\u0004\n\n\u0014&%\n\n\u0014&%\n\n\n\u0010\n\n\u0014\n\u0018\n$\n\u0016\n%\n\u000f\n%\n!\n#\n,\n-\n\u001a\n\u001b\n\n-\n \n\u0014\n\u0014\n-\n1\n-\n'\n-\n'\n9\n:\n-\n\u0014\n1\n\u001a\n\u001b\n\n?\n\u0014\n'\n9\nD\n'\n9\nF\n?\nB\n1\n?\n<\n=\n$\n1\n\u001a\n\u001b\n\n\u0014\n9\n:\n\n-\n!\n#\n1\n \n-\n-\n1\n-\n;\n\u0006\n\u0017\n\u0006\n\n\t\n/\n\u0006\n\n\t\n\u001c\n\u0017\n\u0006\n\n\t\n\n\u0006\n'\n\t\n\u001c\n\u0017\n\u000b\n\u0004\n\u0006\n'\n\n \n\u0006\n'\n\u0019\n\n\t\n\u001c\n\n\u0006\n'\n\t\n\n;\n\u0006\n'\nE\nF\n\u0006\n'\nI\n\u0017\n\u0006\n\n\t\n\t\n\b\n\nF\n\u0006\n\u0003\nI\nO\n\t\n\u001c\n\u001b\n\u0014\nE\nF\n\u0006\n'\n\u0014\nI\n\u0019\n\u0014\nF\n\u0006\nP\nI\nQ\n\t\n\u001c\n\u001b\n\u0014\n\u001b\n%\nE\nF\n\u0014\n%\n\u0014\n%\n\u0014\n\u001c\nR\n\u0014\nY\nV\nY\n \n\u0014\n\u001c\n \ng\n\u0006\n`\n\t\n\u001c\n\n \n\u0006\n[\n\u0019\nP\n\t\n\u001c\n\n\u0016\n\u0014\n\u0016\n%\n \n\u0006\n'\n\u0019\n\n\t\n,\n\u0016\n\u0014\n\u0016\n%\n\u0006\n\n'\n\n\u0014\n%\n,\n/\n\u0006\n\n\u0014\n%\n\t\n\t\n\u0006\n\n\t\n\u001c\n\n\u001a\n\u0011\n\u0019\n\u0006\n\n\t\n\u001c\n\u0006\n\u0018\n,\n'\n\u001e\n\u001c\n\u0019\n'\n\n\u0018\n\u0006\n`\n\t\n\u001c\n\u001b\n\u0014\n\u001b\n\u0006\n\u0018\n,\n\u0002\n\u000b\n\n6\n\u0004\n\fFrom the relationship between log-likelihood and Bregman distances (see Eq. (4)), the loss\ncan also be written as\n\u0006\u000eP\n\n\t\u001a\t\n\n\u0014&%\n\nto be applied to vectors and matrices in a pointwise manner). Once `\n\u0010\u0013\u0012 can be represented\nin the lower dimensional space \u0010\n\u0014 are the coef\ufb01cients which\n\n\u2019th data point \u0003\u0005\u0014\n\nhave been found for the data points, the\n\nas the vector\n\n(where we allow \u0017\nand P\nde\ufb01ne a Bregman projection of the vector \u0003\n#&%\n\nW . Then\n3`\n\nThe generalized form of PCA can also be considered to be search for a low dimensional\n\n\u0014 :\n\n(5)\n\nW\u0006\u0005\n\n3`\n\n\"\u0014\n\nto be \u0003\n\nimizes the sum of projection distances:\n\nbasis (matrix `\nset of points \u0003\nNote that for the normal distribution \u0017\n\n) which de\ufb01nes a surface that is close to all the data points \u0003\n\u0002\u000b\n\n\u0014 . We de\ufb01ne the\nthen min-\nI\r\f\n\t .\nand the Bregman distance is Euclidean dis-\n).\n`\u000f\u000e\n\u0003\r\u0014\n.\nis also simpli\ufb01ed in the normal case, simply being the hyperplane whose basis is `\n\n. The optimal value for `\n%('\b\u0007\n\ntance so that the projection operation in Eq. (5) is a simple linear projection (\n\nTo summarize, once a member of the exponential family \u2014 and by implication a convex\n\n\u0014\u0003#&%('\b\t\n\nPG`\n\t of /\n\nis derived from /\n\nis taken to be a matrix of natural parameter values.\n\n\t \u2014 is chosen, regular PCA is generalized in the following way:\n\t-,\n\nfunction /\n\u0010 The loss function is negative log-likelihood, \n\u0010 The matrix e\n\u0010 The derivative \u0017\n. A Bregman distance E\n\u0010 A function ;\n\u0010 The loss is a sum of Bregman distances from the elements\n\u0010 PCA can also be thought of as search for a matrix `\nThe normal distribution is a simple case because \u0017\ndistance. The projection operation is a linear operation, and \u0003\nhas `\n4 Generic Algorithms for Minimizing the Loss Function\n\n\t .\n\t de\ufb01nes a matrix of expectation parameters, \u0017\nis derived from ;\n\t .\nto values \u0017\n\n%\u0014\nthat de\ufb01nes a surface \u0003\n\nis \u201cclose\u201d to all the data points.\n\n\t which\n, and the divergence is Euclidean\nis the hyperplane which\n\nas its basis.\n\nand \u0017\n\nconstant.\n\n\u001b(\u001c)\u001e\n\nPG`\n\n.\n\nWe now describe a generic algorithm for minimization of the loss function. First, we con-\n. (We drop\n\ncentrate on the simplest case where there is just a single component so that a\nthe c subscript from R\nY .) The method is iterative, with an initial random choice for\n\u000e , etc. denote the values at the\n\u000e be the\nthe value of `\n. Let `\ninitial random choice. We propose the iterative updates P\n\u0006\u000bP\n\f\u0016\u0012\nis alternately minimized with respective to\nand `\n\u0006\u000bP\n \u0005\"\nits two arguments, each time optimizing one argument while keeping the other one \ufb01xed,\nreminiscent of Csisz\u00b4ar and Tusn\u00b4ady\u2019s alternating minization procedures [2].\n\n\u2019th iteration, and let `\n\f\u0013\u0012\n\nY and\n\f\u0013\u0012\n\u000e , P\n\n\t . Thus g\n\n\f\u0013\u0012\n%('\u0017\u0007\n\n%('\b\u0015\n\n \u0005\"\n\n\f\u0016\u0012\n\n\f\u0013\u0012\n\nIt is useful to write these minimization problems as follows:\n\nFor\n\nFor\n\n,\n^ ,\n\n\b\t\b\n\b\n\b\t\b\n\b\n\n\f\u0013\u0012\n\f\u0016\u0012\n\n\u001c! \n\u001c! \n\n\u001e$#&%('\u000b\u0018\n\u001e$#&%('\b\u0019\n\n\u0014&%\n\n\u0006*R\n\u0006SR\n\n\f\u0013\u0012\n\n\f\u0016\u0012\n\n.\n\ng\n\u0006\n`\n\t\n\u001c\n\u0016\n\u0014\n\u0016\n%\nE\nF\n\u0006\n'\nI\n\u0017\n\u0006\n\n\u0014\n%\n\t\n\t\n\u001c\n\u0016\n\u0014\nE\nF\n\u0006\n\u0003\n\u0014\nI\n\u0017\n\u0006\n\u0015\n\u0014\n \n\u000e\n\n\u0014\n\n\n\u0014\n\u001c\n \n\"\n\u001e\n'\n\u0001\n\u0002\n\u0010\n\u0002\nE\nF\n\u0006\n\u0003\n\u0014\nI\n\u0017\n\u0006\n\t\n\t\n\b\n\u0006\n`\n\t\n\u0006\n`\n\t\n\u001c\n\u0004\n\u0017\n\u0006\n\t\n\u0019\n\n\u000e\n\u0010\n`\n\u001e\n\u001c\n \n\"\n\u001e\n#\n\u001b\n\f\n\u0007\n\u000e\nE\nF\n\u0006\n\u0003\n\u0014\n\u0006\n\n\t\n\u001c\n\n\u001c\n\u0003\n\u0006\n`\n\t\n\u0006\n\n \n\u0006\n'\n\u0019\n\n\t\n\u001c\n\n'\n\n,\n/\n\u0006\n\n\u001c\n\u0006\n\n\u0006\n\n\u0006\nF\n'\n\u0014\n%\n\u0006\n%\n\u0006\n`\n\u0006\n\n\t\n\u001c\n\n\u0006\n`\n\t\n\u001c\n\u0018\n\u0014\n\u0011\n%\n\u0014\n\f\n*\n\u000e\n\u001c\n\u001e\n#\ng\n\u0006\n`\n\u000b\n\u0004\n\u000e\n\t\n\u000e\n\u001c\n\u001e\n#\ng\n\u0006\n`\n\u000e\n \n\u001c\n\u0018\n\u0007\nR\n\u000e\n\u0014\n\"\n\u0002\n\u0010\n\u001b\n%\nE\nF\n3\n'\nI\n\u0017\n\u0011\n\u000b\n\u0004\n\u000e\n%\n\t\n9\n\"\n\u001c\n\u0018\n\u0011\n\u000e\n%\n\"\n\u0002\n\u0010\n\u001b\n\u0014\nE\nF\n3\n'\n\u0014\n%\nI\n\u0017\n\u000e\n\u0014\n\u0011\n\t\n9\n\f^ optimization problems, and that each one is essen-\nWe can then see that there are \u0007\ntially identical to a GLM regression problem (a very simple one, where there is a single\nparameter being optimized over). These sub-problems are easily solved, as the functions\nare convex in the argument being optimized over, and the large literature on maximum-\nlikelihood estimation in GLM\u2019s can be directly applied to the problem.\n\nThese\n\nupdates\n\ntake\n\na\n\nmal distribution: P\n\f\u0013\u0012\nfollows that `\nalent to the power method (see Jolliffe [5]) for \ufb01nding the eigenvector of [\nlargest eigenvalue, which is the best single component solution for `\n\nnor-\nsimple\n\u0019\u0018\u0019\n\u001a . It\n\f\u0016\u0012\nis a scalar value. The method is then equiv-\n[ with the\n\nalgorithm generalizes one of the oldest algorithms for solving the regular PCA problem.\n\n\u0019\u0018\u0019\n\f\u0013\u0012\n, where h\n\n. Thus the generic\n\n\u001a , and `\n\nform\n\nthe\n\nfor\n\n\u0011\u0005h\n\n\f\u0016\u0012\n\n\f\u0013\u0012\n\n\f\u0013\u0012\n\n\f\u0013\u0012\n\n\f\u0013\u0012\n\nThe loss is convex in either of its arguments with the other \ufb01xed, but in general is not\nconvex in the two arguments together. This makes it very dif\ufb01cult to prove convergence to\nthe global minimum. The normal distribution is an interesting special case in this respect \u2014\nthe power method is known to converge to the optimal solution, in spite of the non-convex\nnature of the loss surface. A simple proof of this comes from properties of eigenvectors\n(Jolliffe [5]). It can also be explained by analysis of the Hessian \u0001\n: for any stationary point\nwhich is not the global minimum, \u0001\nis not positive semi-de\ufb01nite. Thus these stationary\npoints are saddle points rather than local minima. The Hessian for the generalized loss\nfunction is more complex; it remains an open problem whether it is also not positive semi-\nde\ufb01nite at stationary points other than the global minimum. It is also open to determine\nunder which conditions this generic algorithm will converge to a global minimum.\nIn\npreliminary numerical studies, the algorithm seems to be well behaved in this respect.\n\nHowever, it is possible for this sequence to diverge since the optimum may be at in\ufb01nity.\n\n\f\u0013\u0012\n\n\f\u0016\u0012\n\n\f\u0013\u0012\n\n\u000e will be a stationary point.\n\n, we can use a modi\ufb01ed loss\n\nMoreover, any limit point of the sequence e\nTo avoid such degenerate choices of e\n\u0014&%\n\n,\u0003\u0002\nis any value in the range of \u0017\n\nis a small positive constant, and \u0007+*\n\nwhere \u0002\n(and therefore\nfor which \u0017\nis \ufb01nite). This is roughly equivalent to adding a conjugate prior and\n\ufb01nding the maximum a posteriori solution. It can be proved, for this modi\ufb01ed loss, that the\n\u000e remains in a bounded region and hence always has at least one limit point\nsequence e\nwhich must be a stationary point. (All proofs omitted for lack of space.)\nThere are various ways to optimize the loss function when there is more than one compo-\nnent. We give one algorithm which cycles through the a components, optimizing each in\n\nturn while the others are held \ufb01xed:\n\n\f\u0013\u0012\n\ntimes\n\n:\n\n,\n\n//Initialization\n,\n\nSet\n\nFor\n\n7\u0003\u0005\n\n//Cycle through\n\n7\u0007\u0005\ncomponents\n\u00060\u0004\u000f\f\u000e\f\u000f\f\u000f\u0004\u000b\t\n\u00060\u0004\u0007\f\u000f\f\u000e\f\u000f\u0004\r\b\n//Now optimize the\n&\u0010\u000f\n\u00060\u0004\u000f\f\u0007\f\u000f\f\u000e\u0004 convergence\n&#\"\n\u0006\t\u0004\u0007\f\u000f\f\u000f\f\u000f\u0004\n\n&#\"\n\u00068\f\u000f\f\u000e\f10\n\nInitialize\nFor\n\nrandomly, and set\n\nFor\n\nFor\n\n,\n\n,\n\nThe modi\ufb01ed Bregman projections now include a term\n\nrepresenting the contribution of\n\ufb01xed components. These sub-problems are again a standard optimization problem\n\nregarding Bregman distances, where the terms\n\nform a \u201creference prior\u201d.\n\n\u0014&%\n\nthe a\n\n\u2019th component with other components \ufb01xed\n\n\u0012\u0014\u0013\n7\u0003$\u001e%\n7\u0003$\u001e%\n\n\u001b\u0017\u0016\u0019\u0018\n\u001f'&)(#*,+\u001f-\n\u001f'&)(#*32\n\n\u0016\u001e\u001d\u001f\u0016\n\n\u0011\u001c\u001b\n\n&#\"\n\n&\u0010\"\n\n .\u0012\n\n .\u0012\n\n,\n\u000e\n\u001c\n[\n\u0006\n`\n\u000b\n\u0004\n\u000e\n\t\n\u000e\n\u0011\n`\n\u000b\n\u0004\n\u000e\n\u000e\n\u001c\n\u0006\nP\n\u000e\n\t\n\u000e\n[\n\u0011\n\u0019\n\u0019\nP\n\u000e\n\u0019\n\u0019\n\u000e\n\u001c\n`\n\u000b\n\u0004\n\u000e\n[\n\u000e\n[\n\u000e\n\u000e\n\u001c\nP\n\u000e\n`\n\u0016\n\u0014\n\u0016\n%\n\u0010\nE\nF\n\u0006\n'\nI\n\u0017\n\u0006\n\n\u0014\n%\n\t\n\t\nE\nF\n\u0006\n\u0007\n*\nI\n\u0017\n\u0006\n\n\u0014\n%\n\t\n\t\n\u0011\n\u000b\n\u0004\n\u0006\n\u0007\n*\n\t\n\u0004\n\u0006\n\b\n\t\n\n7\n\f\n7\n\f\n\u000e\n+\n\u0011\n\u0015\n7\n\u001a\n\u0013\n\u0015\n \n7\n!\n7\n\u001b\n+\n\u0013\n\u0011\n\u0010\n\u001b\n\u0015\n<\n=\n3\n-\n\u0013\n\u0015\n@\n$\n\u0014\n\u001b\n\u001d\n9\n'\n+\n\u0011\n\u0015\n\u0013\n\u0015\n\u0018\n9\n/\n7\n\u001d\n+\n\u0011\n\u0015\n-\n\u0010\n\u001b\n\u0013\n<\n=\n3\n-\n\u0013\n\u0015\n@\n$\n\u0014\n\u001b\n+\n\u0013\n\u0011\n\u001d\n\u0013\n\u0015\n\u0018\n9\n4\n\n\u0018\n4\n\u0014\n%\n\fdata\nPCA\nexp\n\n90\n\n80\n\n70\n\n60\n\n50\nY\n40\n\n30\n\n20\n\n10\n\n0\n0\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\nX\n\ndata\nPCA\nexp\n\nY\n\n500\n\n450\n\n400\n\n350\n\n300\n\n250\n\n200\n\n150\n\n100\n\n50\n\n0\n0\n\n20\n\n40\n\nX\n\n60\n\n80\n\n100\n\nFigure 1: Regular PCA vs. PCA for the exponential distribution.\n\nB \n\nB \n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nB\u2019 \n\n0\n1\n\nA \n\n0.8\n\n0.6\n\n0.4\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.2\n\n0\n\n0\n\nC\u2019 \n\nC \n\n1\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n1\n\nB\u2019 \n\nD\u2019 \n\nD \n\nC\u2019 \n\nA \n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0.2\n\n0\n\n0\n\nC \n\n1\n\n0.8\n\n0.6\n\n0.4\n\nis added.\n\nare projected onto a one-dimensional curve. Right: point\n\nFigure 2: Projecting from 3- to 1-dimensional space, via Bernoulli PCA. Left: the three\npoints\n5 Illustrative examples\nExponential distribution. Our generalization of PCA behaves rather differently for differ-\nent members of the exponential family. One interesting example is that of the exponential\ndistributions on nonnegative reals. For one-dimensional data, these densities are usually\nis the mean. In the uniform system of notation we have been\nwritten as\nusing, we would instead index each distribution by a single natural parameter\n(basically,\n\n\t .\n\u0012 and want to \ufb01nd the best one-dimensional ap-\n\u0006 has minimum loss. The alternating minimization procedure of the previous sec-\n\nThe link function in this case is \u0017\nSuppose we are given data [\nproximation: a vector V\n\u0006\t\b\n\b\t\b&\u0006\ntion has a simple closed form in this case, consisting of the iterative update rule\n\n , where \u0004\n), and write the density as \n\nsuch that the approximation \u0003\n\n\f\n\t\nand coef\ufb01cients\n\n, the mean of the distribution.\n\n\u000e , where /\n\n\u000b\b\u0007\n\n\u0002\u0015\u000b\u0004\u0003\n\n\u0006\u0005\n\n\u0006*R\n\n\u0014\f\u000b\n\n[\u000f\u000e\n\n[bV\n\ndenotes a componentwise reciprocal, i.e., \u0006\n\nHere the shorthand \u0004\nsimilarity to the update rule of the power method for PCA: V\n\b The points \u0015\nwe can recover the coef\ufb01cients\nthe origin. Normally, we would not expect the points \u0017\nhowever, in this case they do, because any point of the form \u0017\nas \n\nand so must lie in the direction \u0004\n\n.\n\n\u0019\u0012\u0011\n\n\t . Notice the\n\u0006\t\b\n\b\n\b\t\u0006\nis found,\n. Once V\n[bV\nlie on a line through\nto also lie on a straight line;\n\u0006SR\n, can be written\n\nTherefore, we can reasonably ask how the lines found under this exponential assumption\ndiffer from those found under a Gaussian assumption (that is, those found by regular PCA),\nprovided all data is nonnegative. As a very simple illustration, we conducted two toy ex-\n\u001a (Figure 1). In the \ufb01rst, the points all lay very close\n\nperiments with twenty data points in \u0010\n\n\n\u0006\nE\n\u0006\nh\n\u0001\n\u0002\n\u0003\n\n\u000e\n\u0006\n\u0006\n\u0017\n\t\n\n\u001c\n\n\u0002\n\u0004\n\u0006\n'\n\t\n\u001c\n\u0002\n\u0004\n\n\f\n\u0004\n\u0006\n\n\t\n\u001c\n\n\u001b\n'\n\u0006\n\n\n\u0006\n\n\t\n\u001c\n\n\u0004\n\u0004\n\u000e\n\u0010\n\n\u0017\n\u0014\nV\n\t\n\u0006\n \n\u001c\n\u0018\n\u0007\n\u0018\nV\n\n\n\u0007\n^\n\u001c\n\u0018\n\b\n\u0010\n\u0004\n\u0004\n\u0019\n)\n\n[\n\u000e\n\n\u001c\n\n^\n\u001c\n\u0004\n\u0013\n\u0010\n\u0014\n\u001c\nR\n\u0014\nV\n\u0006\n\u0015\n\u0014\n\t\nV\n\t\n\u0006\nR\n\u000e\n\u0010\n\u0004\n\u0018\n\u001c\n\u0004\n\u0010\n\u0010\n\f\u0015 and h\n\nto a line, and the two versions of PCA produced similar results. In the second experiment,\na few of the points were moved farther a\ufb01eld, and these outliers had a larger effect upon\nregular PCA than upon its exponential variant.\nBernoulli distribution. For the Bernoulli distribution, a linear subspace of the space of\nparameters is typically a nonlinear surface in the space of the data.\nIn Figure 2 (left),\nthree points in the three-dimensional hypercube \u0001\u0012\u0017\nare mapped via our PCA to a one-\ndimensional curve. The curve passes through one of the points (\n); the projections of the\ntwo other (E\n\u0015 ) are indicated. Notice that the curve is symmetric about\n\t . In Figure 2 (right), another point (D) is added,\nthe center of the hypercube, \u0006\n\u0006\f\u0018\nand causes the approximating one-dimensional curve to swerve closer to it.\n6 Relationship to Previous Work\nLee and Seung [6, 7] and Hofmann [4] also describe probabilistic alternatives to PCA,\ntailored to data types that are not gaussian. In contrast to our method, [4, 6, 7] approx-\nimate mean parameters underlying the generation of the data points, with constraints on\nare in the correct domain. By\nand\nto give a\n\nensuring that the elements of PG`\n\t which lies in the domain of the data points.\n\n\u0002\u0001\n\n\u0006\f\u0018\n\n(ignoring constant factors, and again de\ufb01ning\n\nthe matrices P\nand `\ninstead choosing to approximate the natural parameters, in our method the matrices P\ndo not usually need to be constrained\u2014instead, we rely on the link function \u0017\ntransformed matrix \u0017\nMore speci\ufb01cally, Lee and Seung [6] use the loss function \u001b\n\u001b(\u001c)\u001e\nand `\nconstraint that P\nFor the Poisson distribution, our method uses the loss function \u001b\nbut without any constraints on the matrices P\na loss function \u001b\nand `\n\u2019s are positive, and also such that \u001b\n\n\u0014&%\n). This is optimized with the\nshould be positive. This method has a probabilistic interpretation,\nis generated from a Poisson distribution with mean parameter\n.\n!\u001ei\u0004\u0003 ,\n. The algorithm in Hofmann [4] uses\nare constrained such that\n\nBishop and Tipping [9] describe probabilistic variants of the gaussian case. Tipping [10]\ndiscusses a model that is very similar to our case for the Bernoulli family.\nAcknowledgements. This work builds upon intuitions about exponential families and\nBregman distances obtained largely from interactions with Manfred Warmuth, and from\nhis papers. Thanks also to Andreas Buja for several helpful comments.\nReferences\n[1] Katy S. Azoury and M. K. Warmuth. Relative loss bounds for on-line density estimation with\n\n, where the matrices P\n\nand `\n\u0014\u0006\u0005\n\nwhere each data point\n\nall the\n\n\u0014&%\n\n\u001b(\u001c)\u001e\n\nPG`\n\n\u0014&%\n\n\u0014&%$\u001c\n\n\u0014&%\n\n\u0014&%\n\n.\n\nthe exponential family of distributions. Machine Learning, 43:211\u2013246, 2001.\n\n[2] I. Csisz\u00b4ar and G. Tusn\u00b4ady.\n\nInformation geometry and alternating minimization procedures.\n\nStatistics and Decisions, Supplement Issue, 1:205\u2013237, 1984.\n\n[3] J\u00a8urgen Forster and Manfred Warmuth. Relative expected instantaneous loss bounds. Journal of\n\nComputer and System Sciences, to appear.\n\n[4] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual\nInternational ACM SIGIR Conference on Research and Development in Information Retrieval,\n1999.\n\n[5] I. T. Jolliffe. Principal Component Analysis. Springer-Verlag, 1986.\n[6] D. D. Lee and H. S. Seung. Learning the parts of objects with nonnegative matrix factorization.\n\nNature, 401:788, 1999.\n\n[7] Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. In\n\nAdvances in Neural Information Processing Systems 13, 2001.\n\n[8] P. McCullagh and J. A. Nelder. Generalized Linear Models. CRC Press, 2nd edition, 1990.\n[9] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the\n\nRoyal Statistical Society, Series B, 61(3):611\u2013622, 1999.\n\n[10] Michael E. Tipping. Probabilistic visualisation of high-dimensional binary data. In Advances\n\nin Neural Information Processing Systems 11, pages 592\u2013598, 1999.\n\n\u0006\n\u0018\n\n?\nE\n?\nh\n\u0018\n\u0011\n\u0019\n\u0011\n\u0019\n\u0011\n\u0019\n`\n\u0006\n\u0014\n\u001b\n%\n\u0006\n\n'\n\n\u0014\n%\n,\n\n\t\n\n\u001b\nY\nR\n\u0014\nY\n\u0011\nY\n%\n'\n\n\u0014\n%\n\u0014\n\u001b\n%\n\u0002\n\n'\n\u0014\n%\n\n,\n\u0002\n\u0004\n\u0014\n\u001b\n%\n'\n\n\u0014\n%\n\n\u0014\n%\n%\n\n\u0014\n%\n\u001c\n\u0018\n\f", "award": [], "sourceid": 2078, "authors": [{"given_name": "Michael", "family_name": "Collins", "institution": null}, {"given_name": "S.", "family_name": "Dasgupta", "institution": null}, {"given_name": "Robert", "family_name": "Schapire", "institution": null}]}