{"title": "A Mixture of Experts Classifier with Learning Based on Both Labelled and Unlabelled Data", "book": "Advances in Neural Information Processing Systems", "page_first": 571, "page_last": 577, "abstract": null, "full_text": "A Mixture of Experts Classifier with \nLearning Based on Both Labelled and \n\nUnlabelled Data \n\nDavid J. Miller and Hasan S. Uyar \n\nDepartment of Electrical Engineering \n\nThe Pennsylvania State University \n\nUniversity Park, Pa. 16802 \nmiller@perseus.ee.psu.edu \n\nAbstract \n\nWe address statistical classifier design given a mixed training set con(cid:173)\nsisting of a small labelled feature set and a (generally larger) set of \nunlabelled features. This situation arises, e.g., for medical images, where \nalthough training features may be plentiful, expensive expertise is re(cid:173)\nquired to extract their class labels. We propose a classifier structure \nand learning algorithm that make effective use of unlabelled data to im(cid:173)\nprove performance. The learning is based on maximization of the total \ndata likelihood, i.e. over both the labelled and unlabelled data sub(cid:173)\nsets. Two distinct EM learning algorithms are proposed, differing in the \nEM formalism applied for unlabelled data. The classifier, based on a \njoint probability model for features and labels, is a \"mixture of experts\" \nstructure that is equivalent to the radial basis function (RBF) classifier, \nbut unlike RBFs, is amenable to likelihood-based training. The scope of \napplication for the new method is greatly extended by the observation \nthat test data, or any new data to classify, is in fact additional, unlabelled \ndata - thus, a combined learning/classification operation - much akin to \nwhat is done in image segmentation - can be invoked whenever there \nis new data to classify. Experiments with data sets from the UC Irvine \ndatabase demonstrate that the new learning algorithms and structure \nachieve substantial performance gains over alternative approaches. \n\n1 \n\nIntroduction \n\nStatistical classifier design is fundamentally a supervised learning problem, wherein \na decision function, mapping an input feature vector to an output class label, is \nlearned based on representative (feature,class label) training pairs. While a variety \nof classifier structures and associated learning algorithms have been developed, a \ncommon element of nearly all approaches is the assumption that class labels are \n\n\f572 \n\nD. J. Miller and H. S. Uyar \n\nknown for each feature vector used for training. This is certainly true of neu(cid:173)\nral networks such as multilayer perceptrons and radial basis functions (RBFs), for \nwhich classification is usually viewed as function approximation, with the networks \ntrained to minimize the squared distance to target class values. Knowledge of class \nlabels is also required for parametric classifiers such as mixture of Gaussian clas(cid:173)\nsifiers, for which learning typically involves dividing the training data into subsets \nby class and then using maximum likelihood estimation (MLE) to separately learn \neach class density. While labelled training data may be plentiful for some applica(cid:173)\ntions, for others, such as remote sensing and medical imaging, the training set is in \nprinciple vast but the size of the labelled subset may be inadequate. The difficulty \nin obtaining class labels may arise due to limited knowledge or limited resources, \nas expensive expertise is often required to derive class labels for features. In this \nwork, we address classifier design under these conditions, i.e. \nthe training set X \nis assumed to consist of two subsets, X = {Xl, Xu}, where Xl = {(Xl, cd, (X2' C2), \n... ,(XNI,CNln is the labelled subset and Xu = {XNI+l, ... ,XN} is the unlabelled \nsubset l. Here, Xi E R. k is the feature vector and Ci E I is the class label from the \nlabel set I = {I, 2, . \", N c }. \nThe practical significance of this mixed training problem was recognized in (Lipp(cid:173)\nmann 1989). However, despite this realization, there has been surprisingly little \nwork done on this problem. One likely reason is that it does not appear possi(cid:173)\nble to incorporate unlabelled data directly within conventional supervised learning \nmethods such as back propagation. For these methods, unlabelled features must \neither be discarded or preprocessed in a suboptimal, heuristic fashion to obtain class \nlabel estimates. We also note the existence of work which is less than optimistic \nconcerning the value of unlabelled data for classification (Castelli and Cover 1994). \nHowever, (Shashahani and Landgrebe 1994) found that unlabelled data could be \nused effectively in label-deficient situations. While we build on their work, as well \nas on our own previous work (Miller and Uyar 1996), our approach differs from \n(Shashahani and Landgrebe 1994) in several important respects. First, we suggest \na more powerful mixture-based probability model with an associated classifier struc(cid:173)\nture that has been shown to be equivalent to the RBF classifier (Miller 1996). The \npractical significance of this equivalence is that unlike RBFs, which are trained in \na conventional supervised fashion, the RBF-equivalent mixture model is naturally \nsuited for statistical training (MLE). The statistical framework is the key to incor(cid:173)\nporating unlabelled data in the learning. A second departure from prior work is \nthe choice of learning criterion. We maximize the joint data likelihood and suggest \ntwo di\"tinct EM algorithms for this purpose, whereas the conditional likelihood was \nconsidered in (Shashahani and Landgrebe 1994). We have found that our approach \nachieves superior results. A final novel contribution is a considerable expansion of \nthe range of situations for which the mixed training paradigm can be applied. This \nis made possible by the realization that test data or new data to classify can al\"o be \nviewed as an unlabelled set, available for \"training\". This notion will be clarified \nin the sequel. \n\n2 Unlabelled Data and Classification \n\nHere we briefly provide some intuitive motivation for the use of unlabelled data. \nSuppose, not very restrictively, that the data is well-modelled by a mixture density, \n\nlThis problem can be viewed as a type of \"missing data\" problem, wherein the missing \nitems are class labels. As such, it is related to, albeit distinct from supervised learning \ninvolving missing and/or noisy jeaturecomponents, addressed in (Ghahramani and Jordan \n1995),(Tresp et al. 1995). \n\n\fA Mixture of Experts Classifier for Label-deficient Data \n\n573 \n\nL \n\n1=1 \n\nin the following way. The feature vectors are generated according to the density \nf(z/9) = 2: ad(z/9t), where f(z/Oc) is one of L component densities, with non-\nnegative mixing parameters 0.1, such that 2: 0.1 = 1. Here, 01 is the set of parameters \nspecifying the component density, with 9 = {Ol}. The class labels are also viewed \nas random quantities and are assumed chosen conditioned on the selected mixture \ncomponent 7'7I.i E {I, 2, ... , L} and possibly on the feature value, i.e. according \nto the probabilities P[CdZi,7'7I.iJ 2. Thus, the data pairs are assumed generated \nby selecting, in order, the mixture component, the feature value, and the class \nlabel, with each selection depending in general on preceding ones. The optimal \nclassification rule for this model is the maximum a posteriori rule: \n\n1=1 \n\nL \n\nS(z) = arg max L P[c .. = k/7'7I.i = i, Zi]P[7'7I.i = i/Zi], \n\nk \n\n. \nj \n\n(1) \n\nwhere P[7'7I.i = i/Zi] = \n\nLajJ(~./6,) , and where S(z) is a selector function with \n2: atf(~i/61) \n1=1 \n\nrange in T. Since this rule is based on the a posteriori class probabilities, one can \nargue that learning should focus solely on estimating these probabilities. However, \nif the classifier truly implements (1), then implicitly it has been assumed that the \nestimated mixture density accurately models the feature vectors. If this is not true, \nthen presumably estimates of the a posteriori probabilities will also be affected. This \nsuggests that even in the ab8ence of cla88 label8, the feature vectors can be used to \nbetter learn a posteriori probabilities via improved estimation of the mixture-based \nfeature density. A commonly used measure of mixture density accuracy is the data \nlikelihood. \n\n3 \n\nJoint Likelihood Maximization for a Mixtures of Experts \nClassifier \n\nThe previous section basically argues for a learning approach that uses labelled data \nto directly estimate a posteriori probabilities and unlabelled data to estimate the \nfeature density. A criterion which essentially fulfills these objectives is the joint data \nlikelihood, computed over both the labelled and unlabelled data subsets. Given our \nmodel, the joint data log-likelihood is written in the form \n\nlog L = L log L ad(z,i/O,) + L log L aIP[cdzi, 7'7I.i = l]f(Zi/91). \n\nL \n\nL \n\n(2) \n\n1=1 \n\n1=1 \n\nThis objective function consists of a \"supervised\" term based on XI and an \"unsu(cid:173)\npervised\" term based on Xu. The joint data likelihood was previously considered \nin a learning context in (Xu et al. 1995). However, there the primary justification \nwas simplification of the learning algorithm in order to allow parameter estimation \nbased on fixed point iterations rather than gradient descent. Here, the joint likeli(cid:173)\nhood allows the inclusion of unlabelled samples in the learning. We next consider \ntwo special cases of the probability model described until now. \n\n2The usual assumption made is that components are \"hard-partitioned\", in a deter(cid:173)\n\nministic fashion, to classes. Our random model includes the \"partitioned\" one as a special \ncase. We have generally found this model to be more powerful than the \"partitioned\" one \n(Miller Uyar 1996). \n\n\f574 \n\nD. J. Miller and H. S. Uyar \n\nThe \"partitioned\" mixture (PM) model: This is the previously mentioned \ncase where mixture components are \"hard-partitioned\" to classes (Shashahani and \nLandgrebe 1994). This is written Mj E C/e, where Mj denotes mixture component \nj and C/e is the subset of components owned by class k. The posterior probabilities \nhave the form \n\n(3) \n\n2: \n\najf(3!/Oj) \nP[Ci = k/3!] = )_\u00b7;_M_'L,-EC_,, ___ _ \n2: azf(3!/Or) \n\n1=1 \n\nThe generalized mixture (G M) model: The form of the posterior for each \nmixture component is now P[c,:/1'7l.i, 3!il = P[c,:/1'7l.il == {3c,/m,, i.e., it is independent \nof the feature value. The overall posterior probability takes the form \n\nP C,:/3!i = ~ '2t azf(3!dOI ) {3c,lj. \n[ \n\n1 '\" ( ad(3!i/Oj) ) \n\n(4) \n\nThis model was introduced in (Miller and Uyar 1996) and was shown there to lead \nto performance improvement over the PM model. Note that the probabilities have \na \"mixture of experts\" structure, where the \"gating units\" are the probabilities \nP[1'7l.i = jl3!il (in parentheses), and with the \"expert\" for component j just the \nprobability {3c,Ii' Elsewhere (Miller 1996), it has been shown that the associated \nclassifier decision function is in fact equivalent to that of an RBF classifier (Moody \nand Darken 1989) . Thus, we suggest a probability model equivalent to a widely \nused neural network classifier, but with the advantage that, unlike the standard \nRBF, the RBF-equivalent probability model is amenable to statistical training, and \nhence to the incorporation of unlabelled data in the learning. Note that more pow(cid:173)\nerful models P[cilTn.i, 3!i] that do condition on 3!i are also possible. However, such \nmodels will require many more parameters which will likely hurt generalization \nperformance, especially in a label-deficient learning context. Interestingly, for the \nmixed training problem, there are two Expectation-Maximization (EM) (Dempster \net al. 1977) formulations that can be applied to maximize the likelihood associated \nwith a given probability model. These two formulations lead to di8tinct methods \nthat take different learning \"trajectories\", although both ascend in the data like(cid:173)\nlihood. The difference between the formulations lies in how the \"incomplete\" and \n\"complete\" data elements are defined within the EM framework. We will develop \nthese two approaches for the suggested G M model. \nEM-I (No class labels assumed): Distinct data interpretations are given for \nXI and Xu' In this case, for Xu, the incomplete data consists of the features {3!o.} \nand the complete data consists of {(3!i' 1'7l.iH. For XI, the incomplete data consists \nof {(3!;, Ci)}, with the complete data now the triple {(3!o., Co., Tn.i)}. To clarify, in this \ncase mizture labels are viewed as the sole missing data elements, for Xu as well as \nfor XI' Thus, in effect class labels are not even postulated to exist for Xu' \nEM-II (Class labels assumed): The definitions for XI are the same as before. \nHowever, for Xu, the complete data now consists of the triple {( 3!o., Ci, 1'7l.i H, i.e. class \nlabels are also assumed missing for Xu' \nFor Gaussian components, we have 01 = {I-'I , EI}, with 1-'1 the mean vector and EI \nthe covariance matrix. For EM-I, the resulting fixed point iterations for updating \nthe parameters are: \n\n\fA Mixture of Experts Classifier for Label-deficient Data \n\n575 \n\n+ L S};)P[ffli =j/Xi,O(t)]) \n\nz.EX\" \n\n+ L P[ffli = j/Xi, ott)]) \n\nVj \n\nz.EX,. \n\n.B(Hl) = ziEX,nCi=k \n\nkIJ \n\nP[ffli = j / Xi, Ci, ott)] \n\nI: \nI: P[ffli = j/Xi,Ci,O(t)] \nziEX, \n\nVk,j \n\n(5) \n\nHere, S~;) == (Xi - ~;t\u00bb)(Xi - ~;t\u00bb)T. New parameters are computed at iteration \nt+ 1 based on their values at iteration t. In these equations, P[ffli = j/Xi, Ci, ott)] = \n\nI:~(\"C\\~) \n\n\",(')p(') \u00b7f(z )e('\u00bb \n'zJ e(.) \n.... Pcil .... f ( .1 .... ) \n\nandP[ffli=j/Xi,O(t)]= M J \n\n\u2022 For EM-II, it can be \n\n\",(.) f(~ )e(\") \nI: \",~., f(zile~\") \n\nJ \n\n\u2022 \n\nshown that the resulting re-estimation equations are identical to those in (5) except \nregarding the parameters {.Bk/}}' The updates for these parameters now take the \nform \n\n,\",=1 \n\n,q(t+l) _ \n-\nfJklj \n\n( \n\n1 \n--(t-) ~ ffli - ) X\" C,,!7 \nN a j \n\nP[. _ 'j . \n\nz.EX,nCi=k \n\n\" \n\nZiEX,. \n\n. il(t)l \"P[\u00b7 - ' . _ k/ . il(t)]) \n\nJ + ~ ffli -\n\n), c, -\n\nX,,!7 \n\nH \n\nere, we 1 en 1 y \n\n'd \n\nt'f P[ \n\nffli = ), Ci = Xi, !7 =\" i.) \n\nk/ \n\nil(t)] \n\n. \n\n(t)~(.) ( \n\n\"'J \"Io/,f Zi \nL. \"'~ f(z.le .... ) \n\nJ \n\nle(\") \n\n(.). n t 1S !ormu atlOn, \n\nJ: \n\nl ' \n\nh' \n\nI \n\njoint probabilities for class and mixture labels are computed for data in Xu and used \nin the estimation of {.Bkfj}, whereas in the previous formulation {.Bklj} are updated \nsolely on the basis of X,. While this does appear to be a significant qualitative \ndifference between the two methods, both do ascend in log L, and in practice we \nhave found that they achieve comparable performance. \n\n... \n\n4 Combined Learning and Classification \n\nThe range of application for mixed training is greatly extended by the following \nobservation: te~t data (with label~ withheld), or for that matter, any new batch of \ndata to be cla~~ified, can be viewed ~ a new, unlabelled data ~et, Hence, this new \ndata can be taken to be Xu and used for learning (based on EM-I or EM-II) prior \nto its classification, What we are suggesting is a combined learning/classification \noperation that can be applied whenever there is a new batch of data to classify. In \nthe usual supervised learning setting, there is a clear division between the learning \nand classification (use) phases, In this setting, modification of the classifier for new \ndata is not possible (because the data is unlabelled), while for test data such mod(cid:173)\nification is a form of \"cheating\". However, in our suggested scheme, this learning \nfor unlabelled data is viewed simply as part of the classification operation. This \nis analogous to image segmentation, wherein we have a common energy function \nthat is minimized for each new image to be segmented. Each such minimization \ndetermines a model local to the image and a segmentation for the image, Our \"seg(cid:173)\nmentation\" is just classification, with log L playing the role of the energy function. \nIt may consist of one term which is always fixed (based on a given labelled training \nset) and one term which is modified based on each new batch of unlabelled data to \nclassify. We can envision several distinct learning contexts where this scheme can \n\n\f576 \n\nD. 1. Miller and H. S. Uyar \n\nbe used, as well as different ways of realizing the combined learning/classification \noperation3 One use is in classification of an image/speech archive, where each im(cid:173)\nage/speaker segment is a separate data \"batch\". Each batch to classify can be used \nas an unlabelled \"training\" set, either in concert with a representative labelled data \nset, or to modify a design based on such a set 4 . Effectively, this scheme would \nadapt the classifier to each new data batch. A second application is supervised \nlearning wherein the total amount of data is fixed. Here, we need to divide the data \ninto training and test sets with the conflicting goals of i) achieving a good design \nand ii) accurately measuring generalization performance. Combined learning and \nclassification can be used here to mitigate the loss in performance associated with \nthe choice of a large test set. More generally, our scheme can be used effectively \nin any setting where the new data to classify is either a) sizable or b) innovative \nrelative to the existing training set. \n\n5 Experimental Results \n\nFigure 1a shows results for the 40-dimensional, 3-class wa.veform- +noise data set \nfrom the UC Irvine database. The 5000 data pairs were split into equal-size training \nand test sets. Performance curves were obtained by varying the amount of labelled \ntraining data. For each choice of N/, various learning approaches produced 6 so(cid:173)\nlutions based on random parameter initialization, for each of 7 different labelled \nsubset realizations. The test set performance was then averaged over these 42 \"tri(cid:173)\nals\". All schemes used L = 12 components. DA-RBF (Miller et at. 1996) is a \ndeterministic annealing method for RBF classifiers that has been found to achieve \nvery good results, when given adequate training datas . However, this supervised \nlearning method is forced to discard unlabelled data, which severely handicaps its \nperformance relative to EM-I, especially for small N I , where the difference is sub(cid:173)\nstantial. TEM-I and TEM-II are results for the EM methods (both I and II) in \ncombined learning and classification mode, i.e., where the 2500 test vectors were \nalso used as part of Xu. As seen in the figure, this leads to additional, significant \nperformance gains for small N/. ~ ote also that performance of the two EM methods \nis comparable. Figure 1b shows results of similar experiments performed on 6-class \nsatellite imagery data (\"at), also from the UC Irvine database. For this set, the \nfeature dimension is 36, and we chose L = 18 components. Here we compared EM-I \nwith the method suggested in (Shashahani and Landgrebe 1994) (SL), based on the \nPM model. EM-I is seen to achieve substantial performance gains over this alter(cid:173)\nnative learning approach. Note also that the EM-I performance is nearly constant, \nover the entire range of N/. \n\nFuture work will investigate practical applications of combined learning and classi(cid:173)\nfication, as well as variations on this scheme which we have only briefly outlined. \nMoreover, we will investigate possible extensions of the methods described here for \nthe regression problem. \n\n3The image segmentation analogy in fact suggests an alternative scheme where we \nperform joint likelihood maximization over both the model parameters and the \"hard\", \nmissing class labels. This approach, which is analogous to segmentation methods such as \nICM, would encapsulate the classification operation directly within the learning. Such a \nscheme will be investigated in future work. \n\n~Note that if the classifier is simply modified based on Xu, EM-I will not need to update \n\n{,8kl;}, while EM-II must update the entire model. \n\n5 We assumed the same number of basis functions as mixture components. Also, for the \nDA design, there was only one initialization, since DA is roughly insensitive to this choice. \n\n\fA Mixture of Experts Classifier for Label-deficient Data \n\n577 \n\n1 02< \" \nI \n~022 \nIi 1 02 \n\" 1\u00b0,8 \nio.,6 . \nJ \n0,,, . \"\n\n. ..... , .. \n\n1021 ., .. , .. \nI \nlO.2< \nIi \n\n1022 \n\" \n\n'\n\n! ' \u2022\u2022 . . \u2022 \u2022. ! .\u2022\u2022..\u2022\u2022\u2022 ,. \n\n. !EI'~I .... .... . , \n\n0\" .... , .. .. .... , .... . \n\nM-H \n\n, ....\n\n... ; ....... . ; . . . . \n\n'0' \n\n'hi \n\nAcknowledgement s \nThis work was supported in part by National Science Foundation Career Award \nIRI-9624870. \n\nReferences \n\nV. Castelli and T. M. Cover. On the exponential value of labeled samples. Pattern \nRecognition Letters, 16:105-111, 1995. \n\nA.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum-likelihood from incomplete \ndata via the EM algorithm. Journal of the Roy. Stat. Soc. I Ser. B, 39:1-38, 1977. \nZ. Ghahramani and M. I. Jordan. Supervised learning from incomplete data via an \nEM approach. In Neural Information Processing Systems 6, 120-127, 1994. \n\nM. 1. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the EM \nalgorithm. Neural Computation, 6:181-214, 1994. \n\nR. P. Lippmann. Pattern classification using neural networks. IEEE Communica(cid:173)\ntions Magazine, 27,47-64, 1989. \n\nD. J. Miller, A. Rao, K. Rose, and A. Gersho. A global optimization method for \nstatistical classifier design. IEEE Transactions on Signal Processing, Dec. 1996. \n\nD. J. Miller and H. S. Uyar. A generalized Gaussian mixture classifier with learning \nbased on both labelled and unlabelled data. Conf. on Info. Sci. and Sys., 1996. \n\nD. J. Miller. A mixture model equivalent to the radial basis function classifier. \nSubmitted to Neural Computation, 1996. \n\nJ. Moody and C. J. Darken. Fast learning in locally-tuned processing units. Neural \nComputation, 1:281-294, 1989. \n\nB. Shashahani and D. Landgrebe. The effect of unlabeled samples in reducing \nthe small sample size problem and mitigating the Hughes phenomenon. \nIEEE \nTransactions on Geoscience and Remote Sensing, 32:1087-1095, 1994. \n\nV. Tresp, R. N euneier, and S. Ahmad. Efficient methods for dealing with missing \ndata in supervised learning. \nIn Neural Information Processing Systems 7, 689-\n696,1995. \nL. Xu, M. I. Jordan, and G. E. Hinton. An alternative model for mixtures of \nexperts. In Neural Information Processing Systems 7, 633-640, 1995. \n\n\f", "award": [], "sourceid": 1208, "authors": [{"given_name": "David", "family_name": "Miller", "institution": null}, {"given_name": "Hasan", "family_name": "Uyar", "institution": null}]}