{"title": "EM Optimization of Latent-Variable Density Models", "book": "Advances in Neural Information Processing Systems", "page_first": 465, "page_last": 471, "abstract": "", "full_text": "EM Optimization of Latent-Variable \n\nDensity Models \n\nChristopher M Bishop, Markus Svensen and Christopher K I Williams \n\nNeural Computing Research Group \n\nAston University, Birmingham, B4 7ET, UK \n\nc.m.bishop~aston.ac.uk svensjfm~aston.ac.uk c.k.i.williams~aston.ac.uk \n\nAbstract \n\nThere is currently considerable interest in developing general non(cid:173)\nlinear density models based on latent, or hidden, variables. Such \nmodels have the ability to discover the presence of a relatively small \nnumber of underlying 'causes' which, acting in combination, give \nrise to the apparent complexity of the observed data set. Unfortu(cid:173)\nnately, to train such models generally requires large computational \neffort. In this paper we introduce a novel latent variable algorithm \nwhich retains the general non-linear capabilities of previous models \nbut which uses a training procedure based on the EM algorithm. \nWe demonstrate the performance of the model on a toy problem \nand on data from flow diagnostics for a multi-phase oil pipeline. \n\n1 \n\nINTRODUCTION \n\nMany conventional approaches to density estimation, such as mixture models, rely \non linear superpositions of basis functions to represent the data density. Such \napproaches are unable to discover structure within the data whereby a relatively \nsmall number of 'causes' act in combination to account for apparent complexity in \nthe data. There is therefore considerable interest in latent variable models in which \nthe density function is expressed in terms of of hidden variables. These include \ndensity networks (MacKay, 1995) and Helmholtz machines (Dayan et al., 1995). \nMuch of this work has been concerned with predicting binary variables. In this \npaper we focus on continuous data. \n\n\f466 \n\nc. M. BISHOP, M. SVENSEN, C. K. I. WILLIAMS \n\ny(x;W) \n\nFigure 1: The latent variable density model constructs a distribution function in t-space \nin terms of a non-linear mapping y(x; W) from a latent variable x-space. \n\n2 THE LATENT VARIABLE MODEL \n\nSuppose we wish to model the distribution of data which lives in aD-dimensional \nspace t = (tl, ... , tD). We first introduce a transformation from the hidden vari(cid:173)\nable space x = (Xl, ... , xL) to the data space, governed by a non:-linear function \ny(x; W) which is parametrized by a matrix of weight parameters W. Typically \nwe are interested in the situation in which the dimensionality L of the latent vari(cid:173)\nable space is less than the dimensionality D of the data space, since we wish to \ncapture the fact that the data itself has an intrinsic dimensionality which is less \nthan D. The transformation y(x; W) then maps the hidden variable space into an \nL-dimensional non-Euclidean subspace embedded within the data space. This is \nillustrated schematically for the case of L = 2 and D = 3 in Figure 1. \n\nIf we define a probability distribution p(x) on the latent variable space, this will \ninduce a corresponding distribution p(y) in the data space. We shall refer to p(x) \nas the prior distribution of x for reasons which will become clear shortly. Since \nL < D, the distribution in t-space would be confined to a manifold of dimension L \nand hence would be singular. Since in reality data will only approximately live on a \nlower-dimensional space, it is appropriate to include a noise model for the t vector. \nWe therefore define the distribution of t, for given x and W, given by a spherical \nGaussian centred on y(x; W) having variance {3-1 so that \n\n( 1) \n\nThe distribution in t-space, for a given value of the weight matrix W, lS then \nobtained by integration over the x-distribution \n\np(tIW) = J p(tlx, W)p(x) dx. \n\n(2) \n\nFor a given data set V = (t l , ... , t N ) of N data points, we can determine the \nweight matrix W using maximum likelihood. For convenience we introduce an \nerror function given by the negative log likelihood: \n\nE(W) = -In 11 p(tn IW) = - ~ In {J p(tn Ixn, W)p(xn) dxn } . \n\nN \n\nN \n\n(3) \n\n\fEM Optimization of Latent-Variable Density Models \n\n467 \n\nIn principle we can now seek the maximum likelihood solution for the weight matrix, \nonce we have specified the prior distribution p(x) and the functional form of the \nmapping y(x; W), by minimizing E(W). However, the integrals over x occuring \nin (3), and in the corresponding expression for 'iJ E, will, in general, be analyti(cid:173)\ncally intractable. MacKay (1995) uses Monte Carlo techniques to evaluate these \nintegrals and conjugate gradients to find the weights. This is computationally very \nintensive, however, since a Monte Carlo integration must be performed every time \nthe conjugate gradient algorithm requests a value for E(W) or 'iJ E(W). We now \nshow how, by a suitable choice of model, it is possible to find an EM algorithm for \ndetermining the weights. \n\n2.1 EM ALGORITHM \n\nThere are three key steps to finding a tractable EM algorithm for evaluating the \nweights. The first is to use a generalized linear network model for the mappmg \nfunction y(x; W). Thus we write \n\ny(x; W) = W \u00a2(x) \n\n(4) \n\nwhere the elements of \u00a2(x) consist of M fixed basis functions cPj(x), and W is a \nD x M matrix with elements Wkj' Generalized linear networks possess the same \nuniversal approximation capabilities as multi-layer adaptive networks. The price \nwhich has to be paid, however, is that the number of basis functions must typically \ngrow exponentially with the dimensionality L of the input space. In the present \ncontext this is not a serious problem since the dimensionality is governed by the la(cid:173)\ntent variable space and will typically be small. In fact we are particularly interested \nin visualization applications, for which L = 2. \n\nThe second important step is to use a simple Monte Carlo approximation for the \nintegrals over x. In general, for a function Q(x) we can write \n\nJ \n\nQ(x)p(x) dx ~ f{ ~ Q(xi ) \n\n1 K \n\nz=l \n\nwhere xi represents a sample drawn from the distribution p(x). If we apply this to \n(3) we obtain \n\nE(W) = - t,ln{ ~ tp(tnlxni,w)} \n\n(5) \n\n(6) \n\nThe third key step to choose the sample of points {xni} to be the same for each \nterm in the summation over n. Thus we can drop the index n on x ni to give \n\nN {I K \n\nE(W) = - ~ In \n\n} \nf{ ~p(tnlxi, W) \n\n(7) \n\nWe now note that (7) represents the negative log likelihood under a distribution \nconsisting of a mixture of f{ kernel functions. This allows us to apply the EM \nalgorithm to find the maximum likelihood solution for the weights. Furthermore, as \na consequence of our choice (4) for the non-linear mapping function, it will turn out \nthat the M-step can be performed explicitly, leading to a solution in terms of a set \n\n\f468 \n\nc. M. BISHOP, M. SYENSEN, C. K. I. WILLIAMS \n\nof linear equations. We note that this model corresponds to a constrained Gaussian \nmixture distribution of the kind discussed in Hinton et al. (1992). \n\nWe can formulate the EM algorithm for this system as follows. Setting the deriva(cid:173)\ntives of (7) with respect to Wkj to zero we obtain \n\nt, t, R;n(W) {t, w\"f,(x;) - t~ } f;(x;) = 0 \n\n(8) \n\nwhere we have used Bayes' theorem to introduce the posterior probabilities, or \nresponsibilities, for the mixture components given by \n\nR- (W) = \n\nm \n\np(tnlxi, W) \n\nL:~=1 p(tnlxil, W) \n\nSimilarly, maximizing with respect to (3 we obtain \n\nK N \n\n~ = N1D I: I: Rni(W) lIy(xn; W) - t n ll 2 . \n\ni=l n=l \n\n(9) \n\n(10) \n\nThe EM algorithm is obtained by supposing that, at some point in the algorithm, \nthe current weight matrix is given by wold and the current value of (3 is (30ld. Then \nwe can evaluate the responsibilities using these values for Wand (3 (the E-step), \nand then solve (8) for the weights to give W new and subsequently solve (10) to give \n(3new (the M-step). The two steps are repeated until a suitable convergence criterion \nis reached. In practice the algorithm converges after a relatively small number of \niterations. \n\nA more formal justification for the EM algorithm can be given by introducing \nauxiliary variables to label which component is responsible for generating each data \npoint, and then computing the expectation with respect to the distribution of these \nvariables. Application of Jensen's inequality then shows that, at each iteration \nof the algorithm, the error function will decrease unless it is already at a (local) \nminimum, as discussed for example in Bishop (1995). \n\nIf desired, a regularization term can be added to the error function to control the \ncomplexity of the model y(x; W). From a Bayesian viewpoint, this corresponds to \na prior distribution over weights. For a regularizer which is a quadratic function of \nthe weight parameters, this leads to a straightforward modification to the weight \nupdate equations. It is convenient to write the condition (8) in matrix notation as \n(11) \n\n(~TGold~ + AI)(Wnew)T = ~TTold \n\nwhere we have included a regularization term with coefficient A, and I denotes the \nunit matrix. In (11) ~ is a f{ x M matrix with elements ij = (/Jj(xi ), T is a I< x D \nmatrix, and G is a I< x I< diagonal matrix, with elements \n\nN \n\nTik = I: Rin(W)t~ \n\nn=l \n\nN \n\nGjj = I: ~n(W). \n\nn=l \n\n(12) \n\nWe can now solve (11) for w new using standard linear matrix inversion techniques, \nbased on singular value decomposition to allow for possible ill-conditioning. Note \nthat the matrix ~ is constant throughout the algorithm, and so need only be eval(cid:173)\nuated once at the start. \n\n\fEM Optimization of Latent-Variable Density Models \n\n469 \n\n4~----------------------~ 4~----------------------~ \n\n3 \n\n2 \n\n\u2022 \n\u2022 \n\n3 \n\n2 \n\n, : 1' . \n\no \n\n11 \n\no \n\n\u2022 \n\n-_11~--~----~--~----~--~ -1~--~----~------~~--~ \n4 \n\n-1 \n\n2 \n\n3 \n\n4 \n\no \n\n0 \n\n2 \n\n3 \n\nFigure 2: Results from a toy problem involving data (' x') generated from a 1-dimensional \ncurve embedded in 2 dimensions, together with the projected sample points ('+') and \ntheir Gaussian noise distributions (filled circles). The initial configuration, determined \nby principal component analysis, is shown on the left, and an intermediate configuration, \nobtained after 4 iterations of EM, is shown on the right. \n\n3 RESULTS \n\nWe now present results from the application of this algorithm first to a toy problem \ninvolving data in three dimensions, and then to a more realistic problem involving \n12-dimensional data arising from diagnostic measurements of oil flows along multi(cid:173)\nphase pipelines. \nFor simplicity we choose the distribution p(x) to be uniform over the unit square. \nThe basis functions \u00a2j (x) are taken to be spherically symmetric Gaussian func(cid:173)\ntions whose centres are distributed on a uniform grid in x-space, with a common \nwidth parameter chosen so that the standard deviation is equal to the separation of \nneighbouring basis functions. For both problems the weights in the network were \ninitialized by performing principal components analysis on the data and then find(cid:173)\ning the least-squares solution for the weights which best approximates the linear \ntransformation which maps latent space to target space while generating the correct \nmean and variance in target space. \n\nAs a simple demonstration of this algorithm, we consider data generated from a \none-dimensional distribution embedded in two dimensions , as shown in Figure 2. \n\n3.1 OIL FLOW DATA \n\nOur second example arises in the problem of determining the fraction of oil in a \nmulti-phase pipeline carrying a mixture of oil, water and gas (Bishop and James, \n1993). Each data point consists of 12 measurements taken from dual-energy gamma \ndensitometers measuring the attenuation of gamma beams passing through the pipe. \nSynthetically generated data is used which models accurately the attenuation pro(cid:173)\ncesses in the pipe, as well as the presence of noise (arising from photon statistics). \nThe three phases in the pipe (oil, water and gas) can belong to one of three different \ngeometrical configurations, corresponding to stratified, homogeneous, and annular \nflows, and the data set consists of 1000 points distributed equally between the 3 \n\n\f470 \n\nc. M. BISHOP, M. SVENSEN, C. K. I. W1LUAMS \n\n2~----~------~-------------, \n1.5 ~ ..,..~ \n\n\u2022 \n\n0 \n\n\" \n, \n\n\u2022 \n\n0.5 \n\n~,J1 \n\n..... \n. ... \n.. ,.,..t:'\" .. \n-1.5 : .. ~ ~_\" \n-32 \n\n(~ ~+ \n\n\u2022 aa.; \n\n-0.5 \n\n-1 \n\n.. \n\n00 \n\n. . .\" , \n\n.,,;' \n\n:\".\" \n~ ........ \n\n.. \n..\" \n\u2022 \n0 ~ ~iI:\" \n+~+ .,. \u2022 \n-, ++ \n\".a:. 0 \n\n~O ~ \n. . 0 6 C \n..... \n\n0 \n\n, \n\n-\n\n\u2022 \n\n#+ \n\n.'IIIt. \n\n-1 \n\n0 \n\n2 \n\n-2 \n\n0 \n\n2 \n\n4 \n\nFigure 3: The left plot shows the posterior-mean projection of the oil data in the latent \nspace of the non-linear model. The plot on the right shows the same data set projected \nonto the first two principal components. \nIn both plots, crosses, circles and plus-signs \nrepresent the stratified, annular and homogeneous configurations respectively. \n\nclasses. We take the latent variable space to be two-dimensional. This is appro(cid:173)\npriate for this problem as we know that, locally, the data must have an intrinsic \ndimensionality of two (neglecting noise on the data) since, for any given geometrical \nconfiguration of the three phases, there are two degrees of freedom corresponding to \nthe fractions of oil and water in the pipe (the fraction of gas being redundant since \nthe three fractions must sum to one). It also allows us to use the latent variable \nmodel to visualize the data by projection onto x-space. \n\nFor the purposes of visualization, we note that a data point t n induces a posterior \ndistribution p(xltn, W*) in x-space, where W* denotes the value of the weight \nmatrix for the trained network. This provides considerably more information in the \nvisualization space than many simple techniques (which generally project each data \npoint onto a single point in the visualization space). For example, the posterior \ndistribution may be multi-modal, indicating that there is more than one region of \nx-space which can claim significant responsibility for generating the data point. \nHowever, it is often convenient to project each data point down to a unique point in \nx-space. This can be done by finding the mean of the posterior distribution, which \nitself can be evaluated by a simple Monte Carlo integration using quantities already \ncalculated in the evaluation of W* . \n\nFigure 3 shows the oil data visualized in the latent-variable space in which, for each \ndata point, we have plotted the posterior mean vector. Again the points have been \nlabelled according to their multi-phase configuration. We have compared these re(cid:173)\nsults with those from a number of conventional techniques including factor analysis \nand principal component analysis. Note that factor analysis is precisely the model \nwhich results if a linear mapping is assumed for y(x; W), a Gaussian distribution \np(x) is chosen in the latent space, and the noise distribution in data space is taken \nto be Gaussian with a diagonal covariance matrix. Of these techniques, principal \ncomponent analysis gave the best class separation (assessed subjectively) and is \nillustrated in Figure 3. Comparison with the results from the non-linear model \nclearly shows that the latter gives much better separation of the three classes, as a \nconsequence of the non-linearity permitted by the latent variable mapping. \n\n\fEM Optimization of Latent-Variable Density Models \n\n471 \n\n4 DISCUSSION \n\nThere are interesting relationships between the model discussed here and a number \nof well-known algorithms for unsupervised learning. We have already commented \nthat factor analysis is a special case of this model, involving a linear mapping from \nlatent space to data space. The Kohonen topographic map algorithm (Kohonen, \n1995) can be regarded as an approximation to a latent variable density model of \nthe kind outlined here. Finally, there are interesting similarities to a 'soft' version \nof the 'principal curves' algorithm (Tibshirani, 1992). \n\nThe model we have described can readily be extended to deal with the problem of \nmissing data, provided we assume that the missing data is ignorable and missing at \nrandom (Little and Rubin, 1987) . This involves maximizing the likelihood function \nin which the missing values have been integrated out. For the model discussed here, \nthe integrations can be performed analytically, leading to a modified form of the \nEM algorithm. \n\nCurrently we are extending the model to allow for mixed continuous and categorical \nvariables. We are also exploring Bayesian approaches, based on Markov chain Monte \nCarlo, to replace the maximum likelihood procedure. \n\nAcknowledgements \n\nThis work was partially supported by EPSRC grant GR/J75425: Novel Develop(cid:173)\nments in Learning Theory. Markus Svensen would like to thank the staff of the \nSANS group in Stockholm for their hospitality during part of this project . \n\nReferences \n\nBishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford Univer(cid:173)\n\nsity Press. \n\nBishop, C. M. and G. D. James (1993). Analysis of multiphase flows using dual(cid:173)\n\nenergy gamma densitometry and neural networks. Nuclear Instruments and \nMethods in Physics Research A327, 580-593. \n\nDayan, P., G. E. Hinton, R. M. Neal, and R. S. Zemel (1995). The HelmQoltz \n\nmachine . Neural Computation 7 (5), 889- 904. \n\nHinton, G. E., C. K. 1. Williams, and M. D. Revow (1992) . Adaptive elastic mod(cid:173)\n\nels for hand-printed character recognition. In J. E. Moody, S. J. Hanson, and \nR. P. Lippmann (Eds.), Advances in Neural Information Processing Systems \n4. Morgan Kauffmann. \n\nKohonen, T. (1995). Self-Organizing Maps. Berlin: Springer-Verlag. \nLittle, R. J. A. and D. B. Rubin (1987). Statistical Analysis with Missing Data. \n\nNew York: John Wiley. \n\nMacKay, D. J. C. (1995). Bayesian neural networks and density networks. Nuclear \n\nInstruments and Methods in Physics Research, A 354 (1), 73- 80 . \n\nTibshirani, R . (1992). Principal curves revisited. Statistics and Computing 2, \n\n183-190. \n\n\f", "award": [], "sourceid": 1132, "authors": [{"given_name": "Christopher", "family_name": "Bishop", "institution": null}, {"given_name": "Markus", "family_name": "Svens\u00e9n", "institution": null}, {"given_name": "Christopher", "family_name": "Williams", "institution": null}]}