{"title": "Probabilistic Modeling for Face Orientation Discrimination: Learning from Labeled and Unlabeled Data", "book": "Advances in Neural Information Processing Systems", "page_first": 854, "page_last": 860, "abstract": null, "full_text": "Probabilistic Modeling for Face Orientation \n\nDiscrimination: \n\nLearning from Labeled and Unlabeled Data \n\nShumeet Baluja \nbaluja@cs.cmu.edu \n\nJustsystem Pittsburgh Research Center & \n\nSchool of Computer Science, Carnegie Mellon University \n\nAbstract \n\nThis paper presents probabilistic modeling methods to solve the problem of dis(cid:173)\ncriminating between five facial orientations with very little labeled data. Three \nmodels are explored. The first model maintains no inter-pixel dependencies, the \nsecond model is capable of modeling a set of arbitrary pair-wise dependencies, \nand the last model allows dependencies only between neighboring pixels. We \nshow that for all three of these models, the accuracy of the learned models can \nbe greatly improved by augmenting a small number of labeled training images \nwith a large set of unlabeled images using Expectation-Maximization. This is \nimportant because it is often difficult to obtain image labels, while many unla(cid:173)\nbeled images are readily available. Through a large set of empirical tests, we \nexamine the benefits of unlabeled data for each of the models. By using only \ntwo randomly selected labeled examples per class, we can discriminate between \nthe five facial orientations with an accuracy of 94%; with six labeled examples, \nwe achieve an accuracy of 98%. \n\n1 Introduction \n\nThis paper examines probabilistic modeling techniques for discriminating between five \nface orientations: left profile, left semi-profile, frontal, right semi-profile, and right profile. \nThree models are explored: the first model represents no inter-pixel dependencies, the sec(cid:173)\nond model is c,~pable of modeling a set of arbitrary pair-wise dependencies, and the last \nmodel allows'~~~ndencies only between neighboring pixels. \nModels which capture inter-pixel dependencies can provide better classification perfor(cid:173)\nmance than those that do not capture dependencies. The difficulty in using the more com(cid:173)\nplex models, however, is that as more dependencies are modeled, more parameters must be \nestimated - which requires more training data. We show that by using Expectation-Maxi(cid:173)\nmization, the accuracy of what is learned can be greatly improved by augmenting a small \nnumber of labeled training images with unlabeled images, which are much easier to obtain. \nThe remainder of this section describes the problem of face orientation discrimination in \ndetail. Section 2 provides a brief description ofthe probabilistic models explored. Section 3 \npresents results with these models with varying amounts of training data. Also shown is \nhow Expectation-Maximization can be used to augment the limited labeled training data \nwith unlabeled training data. Section 4 briefly discusses related work. Finally, Section 5 \ncloses the paper with conclusions and suggestions for future work. \n\n\fProbabilistic Modelingfor Face Orientation Discrimination \n\n855 \n\n1.1 Detailed Problem Description \n\nThe interest in face orientation discrimination arises from two areas. First, the rapid \nincrease in the availability of inexpensive cameras makes it practical to create systems \nwhich automatically monitor a person while using a computer. By using motion, color, \nand size cues, it is possible to quickly fmd and segment a person's face when he/she is sit(cid:173)\nting in front of a computer monitor. By determining whether the person is looking directly \nat the computer, or is staring away from the computer, we can provide feedback to any \nuser interface that could benefit from knowing whether a user is paying attention or is dis(cid:173)\ntracted (such as computer-based tutoring systems for children, computer games, or even \ncar-mounted cameras that monitor drivers). \nSecond, to perform accurate face detection for use in video-indexing or content-based \nimage retrieval systems, one approach is to design detectors specific to each face orienta(cid:173)\ntion, such as [Rowley et aI., 1998, Sung 1996]. Rather than applying all detectors to every \nlocation, a face-orientation system can be applied to each candidate face location to \n\"route\" the candidate to the appropriate detector, thereby reducing the potential for false(cid:173)\npositives, and also reducing the computational cost of applying each detector. This \napproach was taken in [Rowley et at., 1998]. \nFor the experiments in this paper, each image to be classified is 20x20 pixels. The face is \ncentered in the image, and comprises most of the image. Sample faces are shown in \nFigure 1. Empirically, our experiments show that accurate pose discrimination is possible \nfrom binary versions of the images. First, the images were histogram-equalized to values \nbetween 0 and 255. This is a standard non-linear transformation that maps an approxi(cid:173)\nmately equal number of pixels to each value within the 0-255 range. It is used to improve \nthe contrast in images. Second, to \"binarize\" the images, pixels with intensity above 128 \nwere mapped to a value of255, otherwise the pixels were mapped to a value ofO. \n\nFrontal \n\nRight \nHalf Profile \n\n--\n\nRight Profile \n\nLeft \nHalf Profile \n\nLeft Profile \n\n\"... \n-[j \n.. \n\n.i \n~ \n\n..... ., \n\n. ,. \n~ ;; ,,\" \nl\" \n\u2022 \n... \n' J J \u2022 \nI \n\n\\' ' . 1 \nfII'!-\u2022 , t . \n... \n~ ... \nl \n.J \u00b7b , .... \n\ni ~~ \nl \n\nJ \n\nt. \n\nlIZ! \n\n6-\n; \n\nOriginal \n\nFigure 1: 4 images of \neach of the 5 classes to be \ndiscriminated, Note the \nvariability in the images. \nLeft: Original Images. \nRight: Images after \nhistogram equalization \nand binary quantization. \n\n2 Methods Explored \n\nThis section provides a description of the probabilistic models explored: Naive-Bayes, \nDependency Trees (as proposed by [Chow and Liu, 1968]), and a dependence network \nwhich models dependencies only between neighboring pixels. For more details on using \nBayesian \"multinets\" (independent networks trained to model each class) for classifica(cid:173)\ntion in a manner very similar to that used in this paper, see [Friedman, et at., 1997]. \n\n2.1 The Naive-Bayes Model \n\nThe first, and simplest, model assumes that each pixel is independent of every other pixel. \nAlthough this assumption is clearly violated in real images, the model often yields good \nresults with limited training data since it requires the estimation of the fewest parameters. \nAssuming that each image belongs exclusively to one of the five face classes to be dis-\n\n\f856 \n\nS. Baluja \n\ncriminated, the probability of the image belonging to a particular class is given as follows: \n\nCI \n\nP(ImagelClassc) x P(Classc) \nP ( assc Jlmage) = -...:....-;;.\"\".,.;.--P-(l-m...!:a,;..ge-)-.:...-~ \n\nP(lmagelClassc) = I1 P(Pixel, IClassc) \n\n400 \n\n, ~ I \n\nP(PixelilClassJ is estimated directly from the training data by: \n\nk+ \n\nl: \n\nPixel, x P(Classcllmage) \n\nP(Pixelri C lassc) = _\"\"'Tr-\"-a\"\"\"l1\"\"ng .... /\"\"ma .... g .... es'--___ ___ _ \nP(Classcllmage) \n\n2k+ \n\nl: \n\nTrarnrng/mages \n\nSince we are only counting examples from the training images, P(ClasscIImage) is \nknown. The notation P(ClasscIImage) is used to represent image labels because it is con(cid:173)\nvenient for describing the counting process with both labeled and unlabeled data (this will \nbe described in detail in Section 3). With the labeled data, P(ClasscIImage)E{O,I}. Later, \nP(ClasscIImage) may not be binary; instead, the probability mass may be divided between \nclasses. PixeliE {O, I} since the images are binary. k is a smoothing constant, set to 0.001. \n\nWhen used for classification, we compute the posterior probabilities and take the maxi(cid:173)\nmum, Cpredicted, where: cpred,cled = BrgmBX c P(Classc I Image) =: P(lmage lClassc) \n. For sim(cid:173)\nplicity, P(ClassJ is assumed equal for all c; prImage) is a normalization constant which \ncan be ignored since we are only interested in fmding the maximum posterior probability. \n\n2.2 Optimal Pair-Wise Dependency Trees \n\nWe wish to model a probability distribution P(Xb ... , X 4001ClassJ, where each X corre(cid:173)\nsponds to a pixel in the image. Instead of assuming pixel independence, we restrict our \nmodel to the following form: \n\nP(X1\u00b7\u00b7 \u00b7XnIClassc) = Il p(xilnx-,ClassJ \n\nn \n\ni = 1 \n\nI \n\n, \n\n( n x, = x) ~ m(,) < mU) \n\nwhere I1x is Xi's single \"parent\" variable. We require that there be no cycles in these \n\"parent-of' relationships: formally, there must exist some permutation m = (m b ... ' m,J of \nfor all i. In other words, we restrict P' to \n(1, ... , n) such that \nfactorizations representable by Bayesian networks in which each node (except the root) \nhas one parent, i.e., tree-shaped graphs. \nA method for finding the optimal model within these restrictions is presented in [Chow \nand Liu, 1968]. A complete weighted graph G is created in which each variable Xi is rep(cid:173)\nresented by a corresponding vertex Vi, and in which the weight Wjj for the edge between \nvertices V j and Vj is set to the mutual information I(Xj,Xj) between Xj and Xj. The edges \nin the maximum spanning tree of G determine an optimal set of (n-l) conditional probabil(cid:173)\nities with which to construct a tree-based model of the original probability distribution. \nWe calculate the probabilities P(Xi) and P(Xj, Xj) directly from the dataset. From these, \nwe calculate the mutual information, I(Xj, Xj), between all pairs of variables Xi and Xj: \n\nI(X X) = \"P(X = a X = b). log \n\nI \n\n'J \n\nP(X = a) \u00b7 P(X = b) \n\n\"J \n\nL.. \na,b \n\nP(X . = a, X = b) \n\nI \n\nI \n\nI \n\nJ \n\nThe maximum spanning tree minimizes the Kullback-Leibler divergence D(PIIP') between \n\n\fProbabilistic Modelingfor Face Orientation Discrimination \n\n857 \n\nthe true and estimated distributions: \n\nD(P II P') = L P(X)log :,\u00ab~ \n\nx \n\nas shown in [Chow & Liu, 1968]. Among all distributions of the same form, this distribu(cid:173)\ntion maximizes the likelihood of the data when the data is a set of empirical observations \ndrawn from any unknown distribution. \n\n2.3 Local Dependency Models \n\nUnlike the Dependency Trees presented in the previous section, the local dependency net(cid:173)\nworks only model dependencies between adjacent pixels. The most obvious dependencies \nto model are each pixel's eight neighbors. The dependencies are shown graphically in Fig(cid:173)\nure 2(left). The difficulty with the above representation is that two pixels may be depen(cid:173)\ndent upon each other (if this above model was represented as a Bayesian network, it would \ncontain cycles). Therefore, to avoid problems with circular dependencies, we use the fol(cid:173)\nlowing model instead. Each pixel is still connected to each of its eight neighbors; how(cid:173)\never, the arcs are directed such that the dependencies are acyclic. In this local dependence \nnetwork, each pixel is only dependent on four of its neighbors: the three neighbors to the \nright and the one immediately below. The dependencies which are modeled are shown \ngraphically in Figure 2 (right). The dependencies are: \n\nP(ImagelClassC> = TI P(Pixel,lnptrel,' Classc ) \n\n400 \n\n, = 1 \n\n(0,0) \n\n0 \n0 \n\n0 \n\nDO \n0 \n0 \n0 \n\n0 \n0 \n0 \n0 \n\n(0,0)0 \n\n0 \n0 \n\n0 DO \n0 \n0 \n0 \n\n0 \n0 \n0 \n0 \n\nDO 0 o 0 \n\n. \u2022\u2022 0 (20,20) \n\n0 DODD \n\n0(20,20) \n\nFigure 2: Diagmm of the dependencies maintained. Each square represents a pixel in the image. \nDependencies are shown only for two pixels. (Left) Model with 8 dependencies - note that because this model \nhas circular dependencies, we do not use it. Instead, we use the model shown on the Right. (Right) Model used \nhas 4 dependencies per pixel. By imposing an ordering on the pixels, circular dependencies are avoided. \n\n3 Performance with Labeled and Unlabeled Data \n\nIn this section, we compare the results of the three probabilistic models with varying \namounts of labeled training data. The training set consists of between 1 and 500 labeled \ntraining examples, and the testing set contains 5500 examples. Each experiment is \nrepeated at least 20 times with random train/test splits of the data. \n\n3.1 Using only Labeled Data \n\nIn this section, experiments are conducted with only labeled data. Figure 3(left} shows \neach model's accuracy in classifying the images in the test set into the five classes. As \n\n\f858 \n\nS. Baluja \n\nexpected, as more training data is used, the performance improves for all models. \nNote that the model with no-dependencies performs the best when there is little data. \nHowever, as the amount of data increases, the relative performance of this model, com(cid:173)\npared to the other models which account for dependencies, decreases. It is interesting to \nnote that when there is little data, the Dependency Trees perform poorly. Since these trees \ncan select dependencies between any two pixels, they are the most susceptible to fmding \nspurious dependencies. However, as the amount of data increases, the performance of this \nmodel rapidly improves. By using all of the labeled data (500 examples total), the Depen(cid:173)\ndency Tree and the Local-Dependence network perform approximately the same, achiev(cid:173)\ning a correct classification rate of approximately 99% . \n\nCIoMfkwtIap Poi Ibi \n\n..tda.....\" LaboIecI Data \n\nI All \n\n0.11 \n\n0 ... \n\nI 0.'111 \nI 0 . . \n\n0 . . \n\n0'\" \n\no.m \n\n0:111 \n\n/ \n\n, , \n, , \n, \n\nI \nl \n, \nI \n.......... \" \n\n/ , \n/ , , , \n\n-------\n\n, \n,/ \n\n--=-.., \n~\"\"-J \n----\n~\".. \n- - .... ~ \n\n0 ... \n\n0\" 0.' \nI \n... \n3 \nI \n... \n\n, , , \n, \n, \nI , , \n........ , \n.... , \n\nI \nI \nI \n\nI \nI \n\n.... \n\nl.oa \n\n\u2022. 711 \n\n0.40 \n\no.m \n\n0:111 \n\nau \n\nus \n\n..(cid:173)\n( .......... J \n\n--~ \n\n---- ~,.,. \n\n- - .... D~ \n\n10 \n\n......... \n\n:00 \n\n1011 \n\n-\n\n1011 -\n\nFigure 3: Perfonnance of the three models. X Axis: Amount oflabeled training data used. Y Axis: Percent \ncorrect on an independent test set. In the left graph, only labeled data was used. In the right graph, unlabeled and \nlabeled data was used (the total number of examples were 500, with varying amounts of labeled data). \n\n3.2 Augmenting the Models with Unlabeled Data \n\nWe can augment what is learned from only using the labeled examples by incorporating \nunlabeled examples through the use of the Expectation-Maximization (EM) algorithm. \nAlthough the details of EM are beyond the scope of this paper, the resulting algorithm is \neasily described (for a description of EM and applications to filling in missing values, see \n[Dempster et al., 1977] and [Ghahramani & Jordan, 1994]): \n\n1. \n\n2. \n\n3. \n\nBuild the models using only the labeled data (as in Section 2). \n\nUse the models to probabilistically label the unlabeled images. \n\nUsing the images with the probabilistically assigned labels, and the \nimages with the given labels, recalculate the models' parameters. As \nmentioned in section 2, for the images labeled by this process, \nP(Classcllmage) is not restricted to {0,1}; the probability mass for an \nimage may be spread to multiple classes. \n\n4. \n\nIf a pre-specified termination condition is not met, go to step 2. \n\nThis process is used for each classifier. The termination condition was five iterations; after \nfive iterations, there was little change in the models' parameters. \nThe performance of the three classifiers with unlabeled data is shown in Figure 3(right). \nNote that with small amounts of data, the performance of all of the classifiers improved \ndramatically when the unlabeled data is used. Figure 4 shows the percent improvement by \nusing the unlabeled data to augment the labeled data. Note that the error is reduced by \n\n\fProbabilistic Modelingfor Face Orientation Discrimination \n\n859 \n\nalmost 90% with the use of unlabeled data (see the case with Dependency Trees with only \n4 labeled examples, in which the accuracy rates increase from 44% to 92.5%). With only \n50 labeled examples, a classification accuracy of 99% was obtained. This accuracy was \nobtained with almost an order of magnitude fewer labeled examples than required with \nclassifiers which used only labeled examples. \nIn almost every case examined, the addition of unlabeled data helped performance. How(cid:173)\never, unlabeled data actually hurt the no-dependency model when a large amount of \nlabeled data already existed. With large amounts of labeled data, the parameters of the \nmodel were estimated well. Incorporating unlabeled data may have hurt performance \nbecause the underlying generative process modeled did not match the real generative pro(cid:173)\ncess. Therefore, the additional data provided may not have been labeled with the accuracy \nrequired to improve the model's classification performance. It is interesting to note that \nwith the more complex models, such as the dependency trees or local dependence net(cid:173)\nworks, even with the same amount of labeled data, unlabeled data improved performance. \n[Nigam, et al., 1998] have reported similar performance degradation when using a large \nnumber of labeled examples and EM with a naive-Bayesian model to classify text docu(cid:173)\nments. They describe two methods for overcoming this problem. First, they adjust the rel(cid:173)\native weight of the labeled and unlabeled data in the M-step by using cross-validation. \nSecond, they providing multiple centroids per class, which improves the data/model fit. \nAlthough not presented here due to space limitations, the first method was attempted - it \nimproved the performance on the face orientation discrimination task. \n\nLaaI \n\n--~-... ---\n.,.,..J..-_ \n, . \n~~ = \n...... \n&'-; , \u00b7 \n-\nI \n\n\u2022 \n'\" \n\u2022 \n.. \n\u2022 \n\u2022 \n\u2022 \n\n...... \n<-) \n\n., \n\n~~ ....................... \n\nILK) \n\n\u2022\u2022 \"\",..<-~) ,.. ...... (1K) LaaI_Do \n\u2022 \n.. \n.f-\n.. \n\u2022 \n\u2022 \n\u2022 \n\" \n\nt) \n\n~ 0 u \nis \n8 ... 01) \n\n0... \n\n(11\"') \n\n\u2022 \n\u2022 ~ \n'\" \n\u2022 I ;::-\n\u00b7 \n.1 \n.. \n\u2022 \n\u2022 I \n\n, . \n. ,--: \n\n\" \n\nFigure 4: Improvement for each model by using unlabeled data to augment the labeled data. Left: with \nonly 1 labeled example, Middle: 4 labeled, Right: 50 labeled. The bars in light gray represent the \nperformance with only labeled data, the dark bars indicate the performance with the unlabeled data. The \nnumber in parentheses indicates the absolute (in contrast to relative) percentage change in classification \nperformance with the use of unlabeled data. \n\n4 Related Work \n\nThere is a large amount of work which attempts to discover attributes of faces, including \n(but not limited to) face detection, face expression discrimination, face recognition, and \nface orientation discrimination (for example [Rowley et al., 1998][Sung, 1996][Bartlett & \nSejnowski, 1997][Cottrell & Metcalfe, 1991 ][Turk & Pentland, 1991 D. The work pre(cid:173)\nsented in this paper demonstrates the effective incorporation of unlabeled data into image \nclassification procedures; it should be possible to use unlabeled data in any of these tasks. \nThe closest related work is presented in [Nigam et aI, 1998]. They used naive-Bayes \nmethods to classify text documents into a pre-specified number of groups. By using unla(cid:173)\nbeled data, they achieve significant classification performance improvement over using \nlabeled documents alone. Other work which has employed EM for learning from labeled \nand unlabeled data include [Miller and Uyar, 1997] who used a mixture of experts classi(cid:173)\nfier, and [Shahshahani & Landgrebe, 1994] who used a mixture of Gaussians. However, \nthe dimensionality oftheir input was at least an order of magnitude smaller than used here. \nThere is a wealth of other related work, such as [Ghahramani & Jordan, 1994] who have \n\n\f860 \n\nS. Baluja \n\nused EM to fill in missing values in the training examples. In their work, class labels can \nbe regarded as another feature value to fill-in. \nOther approaches to reducing the need for large amounts of labeled data take the fonn of \nactive learning in which the learner can ask for the labels of particular examples. [Cohn, \net. a11996] [McCallum & Nigam, 1998] provide good overviews of active learning. \n\n5 Conclusions & Future Work \n\nThis paper has made two contributions. The first contribution is to solve the problem of \ndiscriminating between five face orientations with very little data. With only two labeled \nexample images per class, we were able to obtain classification accuracies of94% on sep(cid:173)\narate test sets (with the local dependence networks with 4 parents). With only a few more \nexamples, this was increased to greater than 98% accuracy. This task has a range of appli(cid:173)\ncations in the design of user-interfaces and user monitoring. \nWe also explored the use of mUltiple probabilistic models with unlabeled data. The mod(cid:173)\nels varied in their complexity, ranging from modeling no dependencies between pixels, to \nmodeling four dependencies per pixel. While the no-dependency model perfonns well \nwith very little labeled data, when given a large amount of labeled data, it is unable to \nmatch the perfonnance of the other models presented. The Dependency-Tree models per(cid:173)\nfonn the worst when given small amounts of data because they are most susceptible to \nfinding spurious dependencies in the data. The local dependency models perfonned the \nbest overall, both by working well with little data, and by being able to exploit more data, \nwhether labeled or unlabeled. By using EM to incorporate unlabeled data into the training \nof the classifiers, we improved the perfonnance of the classifiers by up to approximately \n90% when little labeled data was available. \nThe use of unlabeled data is vital in this domain. It is time-consuming to hand label many \nimages, but many unlabeled images are often readily available. Because many similar \ntasks, such as face recognition and facial expression discrimination, suffer from the same \nproblem of limited labeled data, we hope to apply the methods described in this paper to \nthese applications. Preliminary results on related recognition tasks have been promising. \n\nAcknowledgments \nScott Davies helped tremendously with discussions about modeling dependencies. I would also like to acknowl(cid:173)\nedge the help of Andrew McCallum for discussions of EM, unlabeled data and the related work. Many thanks are \ngiven to Henry Rowley who graciously provided the data set. Finally, thanks are given to Kaari Flagstad for \ncomments on drafts of this paper. \n\nReferences \nBartlett, M. & Sejnowski, T. (1997) \"Viewpoint Invariant Face Recognition using ICA and Attractor Networks\", \nin Adv. in Neural Information Processing Systems (NIPS) 9. \nChow, C. & Liu, C. (1968) \"Approximating Discrete Probability Distributions with Dependence Trees\". IEEE(cid:173)\nTransactions on Information Theory, 14: 462-467. \nCohn, D.A., Ghahramani, Z. & Jordan, M. (1996) \"Active Learning with Statistical Models\", Journal of Artifi(cid:173)\ncial Intelligence Research 4: 129-145. \nCottrell, G. & Metcalfe, (1991) \"Face, Gender and Emotion Recognition using Holons\", NIPS 3. \nDempster, A. P., Laird, N.M., Rubin, D.B. (1977) \" Maximum Likelihood from Incomplete Data via the EM \nAlgorithm\", J Royal Statistical Society Series B, 39 1-38. \nFriedman, N., Geiger, D. Goldszmidt, M. (1997) \"Bayesian Network Classifiers\", Machine Learning 1:29. \nGhahramani & Jordan (1994) \"Supervised Learning from Incomplete Data Via an EM Approach\" NIPS 6. \nMcCallum, A. & Nigam, K. (1998) \"Employing EM in Pool-Based Active Learning\", in ICML98. \nMiller, D. & Uyar, H. (1997) \"A Mixture of Experts Classifier with Learning based on both Labeled and Unla(cid:173)\nbeled data\", in Adv. in Neural Information Processing Systems 9. \nNigam, K. McCallum, A., Thrun, S., Mitchell, T. (1998), \"Learning to Classify Text from Labeled and Unla(cid:173)\nbeled Examples\", to appear in AAAI-98. \nRowley, H., Baluja, S. & Kanade, T. (1998) \"Neural Network-Based Face Detection\", IEEE-Transactions on \nPattern Analysis and Machine Intelligence (PAMI). Vol. 20, No. 1, January, 1998. \nShahshahani, B. & Landgrebe, D. (1994) \"The Effect of Unlabeled samples in reducing the small sample size \nproblem and mitigating the Hughes Phenomenon\", IEEE Trans. on Geosc. and Remote Sensing 32. \nSung, K.K. (1996), Learning and Example Selection for Object and Pattern Detection. Ph.D. Thesis, MIT AI \nLab - AI Memo 1572. \nTurk, M. & Pentland, A. (1991) \"Eigenfaces for Recognition\". J. Cog Neurosci. 3 (I). \n\n\f", "award": [], "sourceid": 1567, "authors": [{"given_name": "Shumeet", "family_name": "Baluja", "institution": null}]}