{"title": "Semi-supervised Learning via Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 753, "page_last": 760, "abstract": null, "full_text": "     Semi-supervised Learning via Gaussian\n                                  Processes\n\n\n\n            Neil D. Lawrence                                     Michael I. Jordan\n     Department of Computer Science                   Computer Science and Statistics\n           University of Sheffield                               University of California\n          Sheffield, S1 4DP, U.K.                              Berkeley, CA 94720, U.S.A.\n           neil@dcs.shef.ac.uk                               jordan@cs.berkeley.edu\n\n\n\n                                     Abstract\n\n        We present a probabilistic approach to learning a Gaussian Process\n        classifier in the presence of unlabeled data. Our approach involves\n        a \"null category noise model\" (NCNM) inspired by ordered cate-\n        gorical noise models. The noise model reflects an assumption that\n        the data density is lower between the class-conditional densities.\n        We illustrate our approach on a toy problem and present compar-\n        ative results for the semi-supervised classification of handwritten\n        digits.\n\n\n1    Introduction\n\nThe traditional machine learning classification problem involves a set of input vec-\ntors X = [x1 . . . xN ]T and associated labels y = [y1 . . . yN ]T , yn  {-1, 1}. The\ngoal is to find a mapping between the inputs and the labels that yields high predic-\ntive accuracy. It is natural to consider whether such predictive performance can be\nimproved via \"semi-supervised learning,\" in which a combination of labeled data\nand unlabeled data are available.\n\nProbabilistic approaches to classification either estimate the class-conditional den-\nsities or attempt to model p (yn|xn) directly. In the latter case, if we fail to make\nany assumptions about the underlying distribution of input data, the unlabeled\ndata will not affect our predictions. Thus, most attempts to make use of unlabeled\ndata within a probabilistic framework focus on incorporating a model of p (xn): for\nexample, by treating it as a mixture,                p (x\n                                           y                 n|yn) p (yn), and inferring p (yn|xn)\n                                                n\n\n(e.g., [5]), or by building kernels based on p (xn) (e.g., [8]). These approaches can be\nunwieldy, however, in that the complexities of the input distribution are typically\nof little interest when performing classification, so that much of the effort spent\nmodelling p (xn) may be wasted.\n\nAn alternative is to make weaker assumptions regarding p (xn) that are of particular\nrelevance to classification. In particular, the cluster assumption asserts that the\ndata density should be reduced in the vicinity of a decision boundary (e.g., [2]).\nSuch a qualitative assumption is readily implemented within the context of non-\nprobabilistic kernel-based classifiers. In the current paper we take up the challenge\n\n\f\nFigure 1: The ordered categorical noise model. The plot shows p (y |f ) for different\n                                                                                n     n\n\nvalues of y . Here we have assumed three categories.\n            n\n\n\n\n\n\nof showing how it can be achieved within a (nonparametric) probabilistic framework.\n\nOur approach involves a notion of a \"null category region,\" a region which acts\nto exclude unlabeled data points. Such a region is analogous to the traditional\nnotion of a \"margin\" and indeed our approach is similar in spirit to the transductive\nSVM [10], which seeks to maximize the margin by allocating labels to the unlabeled\ndata. A major difference, however, is that our approach maintains and updates the\nprocess variance (not merely the process mean) and, as we will see, this variance\nturns out to interact in a significant way with the null category concept.\n\nThe structure of the paper is as follows. We introduce the basic probabilistic frame-\nwork in Section 2 and discuss the effect of the null category in Section 3. Section 4\ndiscusses posterior process updates and prediction. We present comparative exper-\nimental results in Section 5 and present our conclusions in Section 6.\n\n\n2      Probabilistic Model\n\nIn addition to the input vector xn and the label yn, our model includes a latent\nprocess variable fn, such that the probability of class membership decomposes as\np (yn|xn) =       p (yn|fn) p (fn|xn) dfn. We first focus on the noise model, p (yn|fn),\ndeferring the discussion of an appropriate process model, p (fn|xn), to later.\n\n2.1     Ordered categorical models\n\nWe introduce a novel noise model which we have termed a null category noise model,\nas it derives from the general class of ordered categorical models [1]. In the specific\ncontext of binary classification, our focus in this paper, we consider an ordered\ncategorical model containing three categories1.\n\n                                     - fn + w                    for y\n                                                     2                      n = -1\n                 p (yn|fn) =   fn + w -  f                      for y                   ,\n                                         2                n - w\n                                                              2             n = 0\n                                       fn - w                    for y\n                                                2                           n = 1\n\nwhere  (x) =        x    N (z|0, 1) dz is the cumulative Gaussian distribution function\n                    -\n\nand w is a parameter giving the width of category yn = 0 (see Figure 1). We\ncan also express this model in an equivalent and simpler form by replacing the\n\n     1 See also [9] who makes use of a similar noise model in a discussion of Bayesian inter-\npretations of the SVM.\n\n\f\nFigure 2: Graphical representation of the null category model. The fully-shaded nodes\nare always observed, whereas the lightly-shaded node is observed when z = 0.\n                                                                                                          n\n\n\n\n\n\ncumulative Gaussian distribution by a Heaviside step function H() and adding\nindependent Gaussian noise to the process model:\n\n                                      H - fn + 1                                 for y\n                                                                  2                          n = -1\n            p (yn|fn) =  H fn + 1 - H f                                               for y                   ,\n                                             2                         n - 1\n                                                                           2                    n = 0\n                                           H fn - 1                                   for y\n                                                             2                                  n = 1\n\nwhere we have standardized the width parameter to 1, by assuming that the overall\nscale is also handled by the process model.\n\nTo use this model in an unlabeled setting we introduce a further variable, zn, which\nis one if a data point is unlabeled and zero otherwise. We first impose\n\n                                      p (zn = 1|yn = 0) = 0;                                                             (1)\n\nin other words, a data point can not be from the category yn = 0 and be unlabeled.\nWe assign probabilities of missing labels to the other classes p (zn = 1|yn = 1) = +\nand p (zn = 1|yn = -1) =  . We see from the graphical representation in Figure 2\n                                -\n\nthat zn is d-separated from xn. Thus when yn is observed, the posterior process is\nupdated by using p (yn|fn). On the other hand, when the data point is unlabeled\nthe posterior process must be updated by p (zn|fn) which is easily computed as:\n\n                        p (zn = 1|fn) =                p (yn|fn) p (zn = 1|yn) .\n                                              yn\n\nThe \"effective likelihood function\" for a single data point, L (fn), therefore takes\none of three forms:\n\n\n                               H - fn + 1                                      for     y\n                                                       2                                     n = -1, zn = 0\n     L (fn) =   H - f                      +                                 for                 z               .\n                   -            n + 1\n                                       2               +H         fn - 12                                n = 1\n                                    H fn - 1                                   for          y\n                                                  2                                               n = 1 zn = 0\n\nThe constraint imposed by (1) implies that an unlabeled data point never comes\nfrom the class yn = 0. Since yn = 0 lies between the labeled classes this is equivalent\nto a hard assumption that no data comes from the region around the decision\nboundary. We can also soften this hard assumption if so desired by injection of\nnoise into the process model. If we also assume that our labeled data only comes\nfrom the classes yn = 1 and yn = -1 we will never obtain any evidence for data\nwith yn = 0; for this reason we refer to this category as the null category and the\noverall model as a null category noise model (NCNM).\n\n\n3    Process Model and Effect of the Null Category\n\nWe work within the Gaussian process framework and assume\n\n                            p (fn|xn) = N (fn| (xn) ,  (xn)) ,\n\nwhere the mean  (xn) and the variance  (xn) are functions of the input space. A\nnatural consideration in this setting is the effect of our likelihood function on the\n\n\f\nFigure 3: Two situations of interest. Diagrams show the prior distribution over f (long\n                                                                                      n\n\ndashes) the effective likelihood function from the noise model when z = 1 (short dashes)\n                                                                         n\n\nand a schematic of the resulting posterior over f         (solid line). Left: The posterior is\n                                                     n\n\nbimodal and has a larger variance than the prior. Right: The posterior has one dominant\nmode and a lower variance than the prior. In both cases the process is pushed away from\nthe null category.\n\n\n\ndistribution over fn from incorporating a new data point. First we note that if\nyn  {-1, 1} the effect of the likelihood will be similar to that incurred in binary\nclassification, in that the posterior will be a convolution of the step function and a\nGaussian distribution. This is comforting as when a data point is labeled the model\nwill act in a similar manner to a standard binary classification model. Consider now\nthe case when the data point is unlabeled. The effect will depend on the mean and\nvariance of p (fn|xn). If this Gaussian has little mass in the null category region,\nthe posterior will be similar to the prior. However, if the Gaussian has significant\nmass in the null category region, the outcome may be loosely described in two ways:\n\n     1. If p (fn|xn) \"spans the likelihood,\" Figure 3 (Left), then the mass of the\n         posterior can be apportioned to either side of the null category region,\n         leading to a bimodal posterior. The variance of the posterior will be greater\n         than the variance of the prior, a consequence of the fact that the effective\n         likelihood function is not log-concave (as can be easily verified).\n\n     2. If p (fn|xn) is \"rectified by the likelihood,\" Figure 3 (Right), then the mass\n         of the posterior will be pushed in to one side of the null category and the\n         variance of the posterior will be smaller than the variance of the prior.\n\nNote that for all situations when a portion of the mass of the prior distribution\nfalls within the null category region it is pushed out to one side or both sides. The\nintuition behind the two situations is that in case 1, it is not clear what label the\ndata point has, however it is clear that it shouldn't be where it currently is (in the\nnull category). The result is that the process variance increases. In case 2 the data\npoint is being assigned a label and the decision boundary is pushed to one side of\nthe point so that it is classified according to the assigned label.\n\n\n4    Posterior Inference and Prediction\n\nBroadly speaking the effects discussed above are independent of the process model:\nthe effective likelihood will always force the latent function away from the null\ncategory. To implement our model, however, we must choose a process model and\nan inference method. The nature of the noise model means that it is unlikely that we\nwill find a non-trivial process model for which inference (in terms of marginalizing\n\n\f\nfn) will be tractable. We therefore turn to approximations which are inspired by\n\"assumed density filtering\" (ADF) methods; see, e.g., [3]. The idea in ADF is to\napproximate the (generally non-Gaussian) posterior with a Gaussian by matching\nthe moments between the approximation and the true posterior. ADF has also been\nextended to allow each approximation to be revisited and improved as the posterior\ndistribution evolves [7].\n\nRecall from Section 3 that the noise model is not log-concave. When the variance\nof the process increases the best Gaussian approximation to our noise model can\nhave negative variance. This situation is discussed in [7], where various suggestions\nare given to cope with the issue. In our implementation we followed the simplest\nsuggestion: we set a negative variance to zero.\n\nOne important advantage of the Gaussian process framework is that hyperparam-\neters in the covariance function (i.e., the kernel function), can be optimized by\ntype-II maximum likelihood. In practice, however, if the process variance is maxi-\nmized in an unconstrained manner the effective width of the null category can be\ndriven to zero, yielding a model that is equivalent to a standard binary classification\nnoise model2. To prevent this from happening we regularize with an L1 penalty on\nthe process variances (this is equivalent to placing an exponential prior on those\nparameters).\n\n4.1         Prediction with the NCNM\n\nOnce the parameters of the process model have been learned, we wish to make\npredictions about a new test-point x via the marginal distribution p (y |x ). For\n                                                                                            \n\nthe NCNM an issue arises here: this distribution will have a non-zero probability\nof y = 0, a label that does not exist in either our labeled or unlabeled data. This\n       \n\nis where the role of z becomes essential. The new point also has z = 1 so in reality\n                                                                                    \n\nthe probability that a data point is from the positive class is given by\n                             p (y |x , z )  p (z |y ) p (y |x ) .                                      (2)\n                                                                         \n\n\nThe constraint that p (z |y = 0) = 0 causes the predictions to be correctly nor-\n                                 \n\nmalized. So for the distribution to be correctly normalized for a test data point we\nmust assume that we have observed z = 1.\n                                                      \n\n\nAn interesting consequence is that observing x will have an effect on the process\n                                                                     \n\nmodel. This is contrary to the standard Gaussian process setup (see, e.g., [11])\nin which the predictive distribution depends only on the labeled training data and\nthe location of the test point x . In the NCNM the entire process model p (f |x )\n                                                                                                       \n\nshould be updated after the observation of x . This is not a particular disadvantage\n                                                           \n\nof our approach; rather, it is an inevitable consequence of any method that allows\nunlabeled data to affect the location of the decision boundary--a consequence that\nour framework makes explicit. In our experiments, however, we disregard such con-\nsiderations and make (possibly suboptimal) predictions of the class labels according\nto (2).\n\n\n5      Experiments\n\nSparse representations of the data set are essential for speeding up the process of\nlearning. We made use of the informative vector machine3 (IVM) approach [6] to\n\n     2 Recall, as discussed in Section 1, that we fix the width of the null category to unity:\nchanges in the scale of the process model are equivalent to changing this width.\n     3 The informative vector machine is an approximation to a full Gaussian Process which\nis competitive with the support vector machine in terms of speed and accuracy.\n\n\f\n 10                                                     10\n\n\n\n\n  5                                                      5\n\n\n\n\n  0                                                      0\n\n\n\n\n -5                                                    -5\n\n\n\n\n-10                                                    -10\n -10           -5         0             5        10     -10    -5       0        5        10\n\n\nFigure 4: Results from the toy problem. There are 400 points, which are labeled with\nprobability 0.1. Labelled data-points are shown as circles and crosses. Data-points in the\nactive set are shown as large dots. All other data-points are shown as small dots. Left:\nLearning on the labeled data only with the IVM algorithm. All labeled points are used in\nthe active set. Right: Learning on the labeled and unlabeled data with the NCNM. There\nare 100 points in the active set. In both plots decision boundaries are shown as a solid\nline; dotted lines represent contours within 0.5 of the decision boundary (for the NCNM\nthis is the edge of the null category).\n\n\n\ngreedily select an active set according to information-theoretic criteria. The IVM\nalso enables efficient learning of kernel hyperparameters, and we made use of this\nfeature in all of our experiments. In all our experiments we used a kernel of the\nform\n                     knm = 2 exp -1 (xn - xm)T (xn - xm) + 3nm,\n\nwhere nm is the Kronecker delta function. The IVM algorithm selects an active\nset, and the parameters of the kernel were learned by performing type-II maximum\nlikelihood over the active set. Since active set selection causes the marginalized\nlikelihood to fluctuate it cannot be used to monitor convergence, we therefore simply\niterated fifteen times between active set selection and kernel parameter optimisation.\nThe parameters of the noise model, {+,  } can also be optimized, but note that\n                                                       -\n\nif we constrain + =               =  then the likelihood is maximized by setting  to the\n                               -\n\nproportion of the training set that is unlabeled.\n\nWe first considered an illustrative toy problem to demonstrate the capabilities of our\nmodel. We generated two-dimensional data in which two class-conditional densities\ninterlock. There were 400 points in the original data set. Each point was labeled\nwith probability 0.1, leading to 37 labeled points. First a standard IVM classifier\nwas trained on the labeled data only (Figure 4, Left). We then used the null\ncategory approach to train a classifier that incorporates the unlabeled data. As\nshown in Figure 4 (Right), the resulting decision boundary finds a region of low\ndata density and more accurately reflects the underlying data distribution.\n\n\n5.1       High-dimensional example\n\nTo explore the capabilities of the model when the data set is of a much higher\ndimensionality we considered the USPS data set4 of handwritten digits. The task\nchosen was to separate the digit 3 from 5. To investigate performance across a range\nof different operating conditions, we varied the proportion of unlabeled data between\n\n       4 The data set contains 658 examples of 5s and 556 examples of 3s.\n\n\f\n                                             1\n\n\n\n\n\n                                            0.9\n\n\n\n\n\n                    area under ROC curve\n                                            0.8\n\n\n\n\n                                                   -2                               -1\n                                             10                                    10\n                                                         prob. of label present\n\n\nFigure 5: Area under the ROC curve plotted against probability of a point being labeled.\nMean and standard errors are shown for the IVM (solid line), the NCNM (dotted line),\nthe SVM (dash-dot line) and the transductive SVM (dashed line).\n\n\n\n0.2 and 1.25  10-2. We compared four classifiers: a standard IVM trained on the\nlabeled data only, a support vector machine (SVM) trained on the labeled data only,\nthe NCNM trained on the combined labeled-unlabeled data, and an implementation\nof the transductive SVM trained on the combined labeled-unlabeled data. The SVM\nand transductive SVM used the SVMlight software [4]. For the SVM, the kernel\ninverse width hyperparameter 1 was set to the value learned by the IVM. For the\ntransductive SVM it was set to the higher of the two values learned by the IVM\nand the NCNM5. For the SVM-based models we set 2 = 1 and 3 = 0; the margin\nerror cost, C, was left at the SVMlight default setting.\n\nThe quality of the resulting classifiers was evaluated by computing the area under\nthe ROC curve for a previously unseen test data set. Each run was completed ten\ntimes with different random seeds. The results are summarized in Figure 5.\n\nThe results show that below a label probability of 2.5  10-2 both the SVM and\ntransductive SVM outperform the NCNM. In this region the estimate 1 provided\nby the NCNM was sometimes very low leading to occasional very poor results\n(note the large error bar). Above 2.5  10-2 a clear improvement is obtained for\nthe NCNM over the other models. It is of interest to contrast this result with an\nanalogous experiment on discriminating twos vs. threes in [8], where p (xn) was used\nto derive a kernel. No improvement was found in this case, which [8] attributed to\nthe difficulties of modelling p (xn) in high dimensions. These difficulties appear to\nbe diminished for the NCNM, presumably because it never explicitly models p (xn).\n\nWe would not want to read too much into the comparison between the transductive\nSVM and the NCNM since an exhaustive exploration of the regularisation param-\neter C was not undertaken. Similar comments also apply to the regularisation of\nthe process variances for the NCNM. However, these preliminary results appear\nencouraging for the NCNM. Code for recreating all our experiments is available at\nhttp://www.dcs.shef.ac.uk/~neil/ncnm.\n\n\n   5 Initially we set the value to that learned by the NCNM, but performance was improved\nby selecting it to be the higher of the two.\n\n\f\n6    Discussion\n\nWe have presented an approach to learning a classifier in the presence of unlabeled\ndata which incorporates the natural assumption that the data density between\nclasses should be low. Our approach implements this qualitative assumption within\na probabilistic framework without explicit, expensive and possibly counterproduc-\ntive modeling of the class-conditional densities.\n\nOur approach is similar in spirit to the transductive SVM, but with a major differ-\nence that in the SVM the process variance is discarded. In the NCNM, the process\nvariance is a key part of data point selection; in particular, Figure 3 illustrated how\ninclusion of some data points actually increases the posterior process variance. Dis-\ncarding process variance has advantages and disadvantages--an advantage is that\nit leads to an optimisation problem that is naturally sparse, while a disadvantage is\nthat it prevents optimisation of kernel parameters via type-II maximum likelihood.\n\nIn Section 4.1 we discussed how test data points affect the location of our decision\nboundary. An important desideratum would be that the location of the decision\nboundary should converge as the amount of test data goes to infinity. One direction\nfor further research would be to investigate whether or not this is the case.\n\nAcknowledgments\n\nThis work was supported under EPSRC Grant No. GR/R84801/01 and a grant\nfrom the National Science Foundation.\n\n\nReferences\n\n [1] A. Agresti. Categorical Data Analysis. John Wiley and Sons, 2002.\n\n [2] O. Chapelle, J. Weston, and B. Sch\n                                        olkopf. Cluster kernels for semi-supervised learn-\n     ing. In Advances in Neural Information Processing Systems, Cambridge, MA, 2002.\n     MIT Press.\n\n [3] L. Csat\n            o. Gaussian Processes -- Iterative Sparse Approximations. PhD thesis, Aston\n     University, 2002.\n\n [4] T. Joachims. Making large-scale SVM learning practical. In Advances in Kernel\n     Methods: Support Vector Learning, Cambridge, MA, 1998. MIT Press.\n\n [5] N. D. Lawrence and B. Sch\n                                 olkopf. Estimating a kernel Fisher discriminant in the\n     presence of label noise. In Proceedings of the International Conference in Machine\n     Learning, San Francisco, CA, 2001. Morgan Kaufmann.\n\n [6] N. D. Lawrence, M. Seeger, and R. Herbrich. Fast sparse Gaussian process meth-\n     ods: The informative vector machine. In Advances in Neural Information Processing\n     Systems, Cambridge, MA, 2003. MIT Press.\n\n [7] T. P. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis,\n     Massachusetts Institute of Technology, 2001.\n\n [8] M. Seeger. Covariance kernels from Bayesian generative models. In Advances in\n     Neural Information Processing Systems, Cambridge, MA, 2002. MIT Press.\n\n [9] P. Sollich. Probabilistic interpretation and Bayesian methods for support vector ma-\n     chines. In Proceedings 1999 International Conference on Artificial Neural Networks,\n     ICANN'99, pages 9196, 1999.\n\n[10] V. N. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, 1998.\n\n[11] C. K. I. Williams. Prediction with Gaussian processes: From linear regression to\n     linear prediction and beyond. In Learning in Graphical Models, Cambridge, MA,\n     1999. MIT Press.\n\n\f\n", "award": [], "sourceid": 2605, "authors": [{"given_name": "Neil", "family_name": "Lawrence", "institution": null}, {"given_name": "Michael", "family_name": "Jordan", "institution": null}]}