{"title": "Supervised learning from incomplete data via an EM approach", "book": "Advances in Neural Information Processing Systems", "page_first": 120, "page_last": 127, "abstract": null, "full_text": "Supervised learning from incomplete \n\ndata via an EM approach \n\nZoubin Ghahramani and Michael I. Jordan \n\nDepartment of Brain & Cognitive Sciences \n\nMassachusett.s Institute of Technology \n\nCambridge, MA 02139 \n\nAbstract \n\nReal-world learning tasks may involve high-dimensional data sets \nwith arbitrary patterns of missing data. In this paper we present \na framework based on maximum likelihood density estimation for \nlearning from such data set.s. VVe use mixture models for the den(cid:173)\nsity estimates and make two distinct appeals to the Expectation(cid:173)\nMaximization (EM) principle (Dempster et al., 1977) in deriving \na learning algorithm-EM is used both for the estimation of mix(cid:173)\nture components and for coping wit.h missing dat.a. The result(cid:173)\ning algorithm is applicable t.o a wide range of supervised as well \nas unsupervised learning problems. Result.s from a classification \nbenchmark-t.he iris data set-are presented. \n\n1 \n\nIntroduction \n\nAdaptive systems generally operate in environments t.hat are fraught with imper(cid:173)\nfections; nonet.heless they must cope with these imperfections and learn to extract \nas much relevant information as needed for their part.icular goals. One form of \nimperfection is incomplet.eness in sensing information. Incompleteness can arise ex(cid:173)\ntrinsically from the data generation process and intrinsically from failures of the \nsystem's sensors. For example, an object recognition system must be able to learn \nto classify images with occlusions, and a robotic controller must be able to integrate \nmultiple sensors even when only a fraction may operate at any given time. \n\nIn this paper we present a. fra.mework-derived from parametric statistics-for learn-\n\n120 \n\n\fSupervised Learning from Incomplete Data via an EM Approach \n\n121 \n\ning from data sets with arbitrary patterns of incompleteness. Learning in this frame(cid:173)\nwork is a classical estimation problem requiring an explicit probabilistic model and \nan algorithm for estimating the parameters of the model. A possible disadvantage \nof parametric methods is their lack of flexibility when compared with nonparamet(cid:173)\nric methods. This problem, however, can be largely circumvented by the use of \nmixture models (McLachlan and Basford, 1988) . Mixture models combine much of \nthe flexibility of nonparametric methods with certain of the analytic advantages of \nparametric methods. \n\nMixture models have been utilized recently for supervised learning problems in the \nform of the \"mixtures of experts\" architecture (Jacobs et al., 1991; Jordan and \nJacobs, 1994). This architecture is a parametric regression model with a modular \nstructure similar to the nonparametric decision tree and adaptive spline models \n(Breiman et al., 1984; Friedman, 1991). The approach presented here differs from \nthese regression-based approaches in that the goal of learning is to estimate the \ndensity of the data. No distinction is made between input and output variables; the \njoint density is estimated and this estimate is then used to form an input/output \nmap. Similar approaches have been discussed by Specht (1991) and Tresp et al. \n(1993). To estimate the vector function y = I(x) the joint density P(x, y) is esti(cid:173)\nmated and, given a particular input x, the conditional density P(ylx) is formed. \nTo obtain a single estimate of y rather than the full conditional density one can \nevaluate y = E(ylx), the expectation of y given x. \nThe density-based approach to learning can be exploited in several ways . First, \nhaving an estimate of the joint density allows for the representation of any rela(cid:173)\ntion between the variables. From P(x, y), we can estimate y = I(x), the inverse \nx = 1-1 (y), or any other relation between two subsets of the elements of the con(cid:173)\ncatenated vector (x, y). \nSecond, this density-based approach is applicable both to supervised learning and \nunsupervised learning in exactly the same way. The only distinction between su(cid:173)\npervised and unsupervised learning in this framework is whether some portion of \nthe data vector is denoted as \"input\" and another portion as \"target\". \n\nThird, as we discuss in this paper, the density-based approach deals naturally with \nincomplete data, i.e. missing values in the data set. This is because the problem \nof estimating mixture densities can itself be viewed as a missing data problem (the \n\"labels\" for the component densities are missing) and an Expectation-Maximization \n(EM) algorithm (Dempster et al., 1977) can be developed to handle both kinds of \nmissing data. \n\n2 Density estimation using EM \n\nThis section outlines the basic learning algorithm for finding the maximum like(cid:173)\nlihood parameters of a mixture model (Dempster et al., 1977; Duda and Hart, \n1973; Nowlan, 1991). \\IVe assume that. t.he data . .:t' = {Xl, ... , XN} are generated \nindependently from a mixture density \n\n1\\1 \n\nP(Xi) = LP(Xi IWj;(}j)P(Wj), \n\n;=1 \n\n(1) \n\n\f122 \n\nGhahramani and Jordan \n\nwhere each component of the mixture is denoted Wj and parametrized by (}j. From \nequation (1) and the independence assumption we see that the log likelihood of the \nparameters given the data set is \n\nN \n\nM \n\nl((}IX) = LlogLP(xilwj;Oj)P(Wj). \n\n(2) \n\ni=1 \n\nj=1 \n\nBy the maximum likelihood principle the best model of the data has parameters \nthat maximize l(OIX). This function, however, is not easily maximized numerically \nbecause it involves the log of a sum. \n\nIntuitively, there is a \"credit-assignment\" problem: it is not clear which component \nof the mixture generated a given data point and thus which parameters to adjust \nto fit that data point. The EM algorithm for mixture models is an iterative method \nfor solving this credit-assignment problem. The intuition is that if one had access \nto a \"hidden\" random variable z that indicated which data point was genera.ted \nby which component, then the maximization problem would decouple into a set \nof simple maximizations. Using the indicator variable z, a \"complete-data\" log \nlikelihood function can be written \n\nN M \n\nlc((}IX, Z) = L L Zij log P(XdZi; O)P(Zi; (}), \n\n(3) \n\n;=1 j=1 \n\nwhich does not involve a log of a summation. \nSince Z is unknown lc cannot be utilized directly, so we instead work with its ex(cid:173)\npectation, denoted by Q(OI(}k)' As shown by (Dempster et aI., 1977), l(OIX) can be \nmaximized by iterating the following two steps: \n\nEstep: Q(OI(}k) \nM step: \n\n(}k+l \n\nE[lc(OIX,Z)IX,(}k] \nargmax Q((}IOk)' \n\no \n\n(4) \n\nThe E (Expectation) step computes the expected complete data log likelihood and \nthe M (Maximization) step finds the parameters that maximize this likelihood. \nThese two steps form the basis of the EM algorithm; in the next two sections we \nwill outline how they can be used for real and discrete density estimation. \n\n2.1 Real-valued data: Inixture of Gaussians \n\nReal-valued data can be modeled as a mixture of Gaussians. For this model the \n\nE-step simplifies to computing hij = E[Zijlxi,Ok], the probability that Gaussian j, \n\nas defined by the parameters estimated at time step k, generated data point i. \n\nItj 1- 1/ 2 exp{ -~ (Xi -\n\nitj)Tt;l,k(Xi - itj)} \n\nh .. = \n\nI} \n\nL~1 IEfl-l/2exp{-~(Xi - it7)TE,I,k(Xi - it7)}' \n\n(5) \n\nThe M-step re-estimates the \ndata set weighted by the hii= \n\nmeans and covariances of the Gaussians1 using the \n\n) ~ k+l _ L~l hijXi \na \nLi=1 hij \n\nI-Lj \n\nN \n\n-\n\n' \n\n1 Though this derivation assumes equal priors for the Gaussians, if the priors arc viewed \n\nas mixing parameters they can also be learned in the maximization step. \n\n\fSupervised Learning from Incomplete Data via an EM Approach \n\n123 \n\n2.2 Discrete-valued data: Inixture of Bernoullis \n\nD-dimensional binary data x = (Xl, . .. ,Xd, . .. XD), Xd E {O, 1}, can be modeled as \na mixture of !II Bernoulli densities. That is, \n\nM \n\nP(xIO) = L P(Wj) IT /-ljd(1 - /-ljd)(l-Xd). \n\nD \n\nFor this model the E-step involves computing \n\nnD \n\nh .. -\nI) -\n\npX,ld (1 _ p. )(1-Xld) \n\nd=l}d \n\n}d \n\n'Ef'!l nf=l P7J d (1 - Pld)(1-xld) , \n\n(7) \n\n(8) \n\n(9) \n\nand the M-step again re-estimates the parameters by \n\n~ k+l _ 'E~l hijXi \n\nttj \n\n-\n\nN \n\n. \n\n'Ei=l hij \n\nMore generally, discrete or categorical data can be modeled as generated by a mix(cid:173)\nture of multinomial densities and similar derivations for the learning algorithm can \nbe applied. Finally, the extension to data with mixed real, binary. and categorical \ndimensions can be readily derived by assuming a joint density with mixed compo(cid:173)\nnents of the three types . \n\n3 Learning from inco111plete data \n\nIn the previous section we presented one aspect of the EM algorithm: learning \nmixture models. Another important application of EM is to learning from data \nsets with missing values (Little and Rubin, 1987; Dempster et aI., 1977). This \napplication has been pursued in the statistics literature for non-mixture density \nestimation problems; in this paper we combine this application of EM with that of \nlearning mixture parameters. \nWe assume that. the data set ,l:' = {Xl \u2022.. . , XN} is divided into an observed com(cid:173)\nponent ,yo and a missing component ;t'm. Similarly, each data vector Xi is divided \ninto (xi, xi) where each data vector can have different missing components-this \nwould be denoted by superscript Dli and OJ. but we have simplified the notation for \nthe sake of clarity. \n\nTo handle missing data we rewrite the EM algorithm as follows \n\nEstep: \nM step: \n\nE[ic( fJl,t'\u00b0, ;t'm , Z) I;t'\u00b0. Ok] \nargmax Q(fJlfJk). \n\no \n\n(10) \n\nComparing to equation (4) we see that aside from t.he indicator variables Z we have \nadded a second form of incomplete data, ;t'm , corresponding to the missing values \nin the data set. The E-step of the algorithm estimates both these forms of missing \ninformation; in essence it uses the current estimate of the data density to complete \nthe missing values. \n\n\f124 \n\nGhahramani and Jordan \n\n3.1 Real-valued data: mixture of Gaussians \n\nWe start by writing the log likelihood of the complete data, \n\nic(OIXO, xm, Z) = L L Zij log P(xdzj, 0) + L L Zij log P(zdO). \n\nN M \n\nN M \n\n(11) \n\nj \n\nj \n\nWe can ignore the second term since we will only be estimating the parameters of \nthe P(XdZi, 0). Using equation (11) for the mixture of Gaussians we not.e that if \nonly the indicator variables Zi are missing, the E step can be reduced to estimating \nE[ Zij lXi, 0]. For the case we are interested in, with two types of missing data Zi and \nxi, we expand equation (11) using m and 0 superscripts to denote subvectors and \nsubmatrices of the parameters matching the missing and observed components of \nthe data, \n\nIc(OIXO, xm, Z) = L L Zij[n log27r + ! log IEj 1- !(xi -l1-jf E;l,OO(xi -l1-j) \n\nN M \n\n. .22 \n\nI \n\nJ \n\n2 \nm) 1( m \n- 2 Xi \n\n-\n\nI1-j \n\nI1-j \n\n0 \n\n(\n\no)T~-l,Om( m \nXi -\n\nL...j \n\nI1-j \n\nm)T~-l,mm( m \nXi \n\nL...j \n\n-\n\nI1-j \n\n- Xi -\n\nm)] \n\u2022 \nNote that after taking the expectation, the sufficient statistics for the parameters \ninvolve three unknown terms, Zij, ZijXi, and zijxixiT. Thus we must compute: \nE[Zijlx?,Ok]' E[Zijxilx?,Ok], and E[ZijxixinTlx?,Ok]. \nOne intuitive approach to dealing with missing data is to use the current estimate \nof the data density to compute the expectat.ion of the missing data in an E-step, \ncomplete the data with these expectations, and then use this completed data to re(cid:173)\nestimate parameters in an M-step. However, this intuition fails even when dealing \nwith a single two-dimensional Gaussian; the expectation of the missing data always \nlies along a line, which biases the estimate of the covariance. On the other hand, \nthe approach arising from application of the EM algorithm specifies that one should \nuse the current density estimate to compute the expectation of whatever incomplete \nterms appear in the likelihood maximization. For the mixture of Gaussians these \nincomplete terms involve interactions between the indicator variable :;ij and the \nfirst and second moments of xi. Thus, simply computing the expectation of the \nmissing data Zi and xi from our model and substituting those values into the M \nstep is not sufficient to guarantee an increase in the likelihood of the parameters. \nThe above terms can be computed as follows: E[ Zij lxi, Ok] is again hij, the proba(cid:173)\nbility as defined in (5) measured only on the observed dimensions of Xi, and \nE[Zijxilxi, Ok] = hijE[xilzij = 1, xi, Od = hij(l1-j + EjOEjO-l (xi -Il.'}\u00bb, (12) \nDefining xi] = E[xi IZij = 1, xi, Ok], the regression of xi on xi using Gaussian j, \n(13) \n\nE[ .. m mTI \u00b0 0 ] _ h .. (~mm ~mo~oo-l ~moT ~ m ~ mT) \n. \n\nL...j ~j \n\n+ XijXij \n\nZ'Jxi Xi \n\nxi' k -\n\n'J L...j \n\nL...j \n\n-\n\nThe M-step uses these expectations substituted into equations (6)a and (6)b to \nre-estimate the means and covariances. To re-estimate the mean vector, I1-j' we \nsubstitute the values E[xilzij = 1, xi, Ok] for the missing components of Xi in \nequation (6)a. To re-estimate the covariance matrix we substitute t.he values \nE[xixiTlzij = 1, xi, Ok] for the outer product matrices involving the missing com(cid:173)\nponents of Xi in equation (6)b. \n\n\fSupervised Learning from Incomplete Data via an EM Approach \n\n125 \n\n3.2 Discrete-valued data: Inixture of Bernoullis \n\nFor the Bernoulli mixture the sufficient statistics for the M-step involve t he incom(cid:173)\nplete terms E[Zij Ix?, Ok] and E[ Zij xi Ix~, Ok]. The first is equal to hij calculated over \nthe observed subvector of Xi. The second, since we assume that within a class the \nindividual dimensions of the Bernoulli variable are independent., is simply hijl-Lj. \nThe M-step uses these expectations substituted into equation (9). \n\n4 Supervised learning \n\nIf each vector Xi in the data set is composed of an \"input\" subvector, x}, and a \n\"target\" or output subvector, x?, then learning the joint density of the input and \ntarget is a form of supervised learning. In supervised learning we generally wish to \npredict the output variables from the input variables. In this section we will outline \nhow this is achieved using the estimated density. \n\n4.1 Function approximation \n\nFor real-valued function approximation we have assumed that the densit.y is esti(cid:173)\nmated using a mixture of Gaussians. Given an input vector x~ we ext ract all the \nrelevant information from the density p(xi, XO) by conditionalizing t.o p(xOlxD. \nFor a single Gaussian this conditional densit.y is normal, and, since P(x 1 , XO) is a \nmixture of Gaussians so is P(xolxi ). In principle, this conditional density is the \nfinal output of the density estimator. That is, given a particular input the net(cid:173)\nwork returns the complete conditional density of t.he output. However, since many \napplications require a single estimate of the output, we note three ways to ob(cid:173)\ntain estimates x of XO = f(x~): the least squares estimate (LSE), which takes \nXO(xi) = E(xOlxi); stochastic sampling (STOCH), which samples according to \nthe distribution xO(xD \"\" P(xOlxi); single component LSE (SLSE), which takes \nxO(xD = E(xOlxLwj) where j = argmaxk P(zklx~). For a given input, SLSE picks \nthe Gaussian with highest posterior and approximates the out.put with the LSE \nestimator given by that Gaussian alone. \n\nThe conditional expectation or LSE estimator for a Gaussian mixt.ure is \n\n(14) \n\nwhich is a convex sum of linear approximations, where the weights h ij vary non(cid:173)\nlinearly according to equation (14) over the input space. The LSE estimator on a \nGaussian mixture has interesting relations to algorithms such as CART (Breiman \net al., 1984), MARS (Friedman, 1991), and mixtures of experts (Jacobs <.'t al., 1991; \nJordan and Jacobs, 1994), in that the mixture of Gaussians competit.ively parti(cid:173)\ntions the input space, and learns a linear regression surface on each part-it.ion. This \nsimilarity has also been noted by Tresp et al. (1993) . \n\nThe stochastic estimator (STOCH) and the single component estimator (SLSE) are \nbetter suited than any least squares method for learning non-convex ill verse maps, \nwhere the mean of several solutions to an inverse might not be a solut ion. These \n\n\f126 \n\nGhahramani and Jordan \n\nFigure 1: Classification of the iris data \nset. 100 data points were used for train(cid:173)\ning and 50 for testing. Each data point \nconsisted of 4 real-valued attributes and \none of three class labels. The figure \nshows classification performance \u00b1 1 \nstandard error (11 = 5) as a function \nof proportion missing features for the \nEM algorithm and for mean imputa(cid:173)\ntion (MI), a common heuristic where the \nmissing values are replaced with their \nunconditional means. \n\nClassification with missing inputs \n\n100 \n\n~\" 0-1---~, ! -t EM \nU \n;.;:: \n'\" '\" !! 60 \nU ... U \nII) .. o 40 \n\n\\ , 'l,_ \n\n\\ \n, \n, , \n\n, \n\n-'-'tI MI \n\nU \n~ \n\n20 \n\no \n\n20 \n\n40 \n\n60 \n\n80 \n\n100 \n\n% missing features \n\nestimators take advantage of the explicit representat.ion of the input/output density \nby selecting one of the several solutions to the inverse. \n\n4.2 Classification \n\nClassification problems involve learning a mapping from an input space into a set \nof discrete class labels. The density estimat.ion framework presented in this paper \nlends itself to solving classification problems by estimating the joint density of the \ninput and class label using a mixture model. For example, if the inputs have real(cid:173)\nvalued attributes and there are D class labels, a mixture model with Gaussian and \nmultinomial components will be used: \n\nP(x, e = dlO) = ~ P(Wj) (27r)n/2IEj 11/2 exp{ -\"2 (x - I-tj fEj1 (x -\n\n~jd \n\nI-'j n, \n\n(15) \n\nAI \n~ \n\n1 \n\ndenoting the joint probability that the data point. is x and belongs to class d, \nwhere the ~j d are the parameters for the multinomial. Once this density has been \nestimated, the maximum likelihood label for a particular input x may be obtained \nby computing P(C = dlx, 0). Similarly, the class conditional densities can be derived \nby evaluating P( x Ie = d, 0). Condi tionalizing over classes in this way yields class \nconditional densities which are in turn mixtures of Gaussians. Figure 1 shows \nthe performance of the EM algorithm on an example classification problem with \nvarying proportions of missing features. We have also applied these algorithms to \nthe problems of clustering 35-dimensional greyscale images and approximating the \nkinematics of a three-joint planar arm from incomplete data. \n\n5 Discussion \n\nDensit.y estimation in high dimensions is generally considered to be more difficult(cid:173)\nrequiring more parameters-than function approximation. The density-estimation(cid:173)\nbased approach to learning, however, has two advantages. First, it permits ready in(cid:173)\ncorporation of results from the statistical literature on missing data to yield flexible \nsupervised and unsupervised learning architectures. This is achieved by combining \ntwo branches of application of the EM algorithm yielding a set of learning rules for \nmixtures under incomplete sampling. \n\n\fSupervised Learning from Incomplete Data via an EM Approach \n\n127 \n\nSecond, estimating the density explicitly enables us to represent any relation be(cid:173)\ntween the variables. Density estimation is fundamentally more general than function \napproximation and this generality is needed for a large class of learning problems \narising from inverting causal systems (Ghahramani, 1994). These problems cannot \nbe solved easily by traditional function approximation techniques since the data is \nnot generated from noisy samples of a function, but rather of a relation. \n\nAcknow ledgmuents \n\nThanks to D. M. Titterington and David Cohn for helpful comments. This project \nwas supported in part by grants from the McDonnell-Pew Foundation, ATR Audi(cid:173)\ntory and Visual Perception Research Laboratories, Siemens Corporation, the N a(cid:173)\ntional Science Foundation, and the Office of Naval Research. The iris data set was \nobtained from the VCI Repository of Machine Learning Databases. \n\nReferences \n\nBreiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification \n\nand Regression Trees. Wadsworth International Group, Belmont, CA . \n\nDempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood fwm \nincomplete data via the EM algorithm. J. Royal Statistical Society Series B, \n39:1-38. \n\nDuda, R. O. and Hart, P. E. (1973). Pattern Classification and Scene Analysis. \n\nWiley, New York. \n\nFriedman, J. H. (1991). Multivariate adaptive regression splines. The Annols of \n\nStatistics, 19:1-141. \n\nGhahramani, Z. (1994). Solving inverse problems using an EM approach to density \nestimation. In Proceedings of the 1993 Connectionist Models Summer School. \nErlbaum, Hillsdale, NJ. \n\nJacobs, R., Jordan, M., Nowlan, S., and Hinton, G. (1991). Adaptive mixture of \n\nlocal experts. Neural Computation, 3:79-87. \n\nJordan, M. and Jacobs, R. (1994). Hierarchical mixtures of experts ano the EM \n\nalgorithm. Neural Computation, 6:181-214. \n\nLittle, R. J. A. and Rubin, D. B. (1987). Statistical Analysis with Mis.'ling Data. \n\nWiley, New York. \n\nMcLachlan, G. and Basford, K. (1988). Mixture models: Inference and applications \n\nto clustering. Marcel Dekkel'. \n\nNowlan, S. J. (1991). Soft Competitive Adaptation: Neural Network Learning Algo(cid:173)\nrithms based on Fitting Statistical Mixtures. CMV-CS-91-126, School of Com(cid:173)\nputer Science, Carnegie Mellon University, Pittsburgh, PA. \n\nSpecht, D. F. (1991). A general I'egression neural network. IEEE Trans. Neural \n\nNetworks, 2(6):568-576. \n\nTresp, V., Hollatz, J., and Ahmad, S. (1993). Network structuring and training \nusing rule-based knowledge. In Hanson, S. J., Cowan, J. D., and Giles, C. L., \neditors, Advances in Neural Information Processing Systems 5. Morgan Kauf(cid:173)\nman Publishers, San Mateo, CA. \n\n\f", "award": [], "sourceid": 767, "authors": [{"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}