{"title": "Robust Multi-Class Gaussian Process Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 280, "page_last": 288, "abstract": "Multi-class Gaussian Process Classifiers (MGPCs) are often affected by over-fitting problems when labeling errors occur far from the decision boundaries. To prevent this, we investigate a robust MGPC (RMGPC) which considers labeling errors independently of their distance to the decision boundaries. Expectation propagation is used for approximate inference. Experiments with several datasets in which noise is injected in the class labels illustrate the benefits of RMGPC. This method performs better than other Gaussian process alternatives based on considering latent Gaussian noise or heavy-tailed processes. When no noise is injected in the labels, RMGPC still performs equal or better than the other methods. Finally, we show how RMGPC can be used for successfully identifying data instances which are difficult to classify accurately in practice.", "full_text": "Robust Multi-Class Gaussian Process Classi\ufb01cation\n\nDaniel Hern\u00b4andez-Lobato\n\nICTEAM - Machine Learning Group\n\nUniversit\u00b4e catholique de Louvain\n\nPlace Sainte Barbe, 2\n\nLouvain-La-Neuve, 1348, Belgium\n\ndanielhernandezlobato@gmail.com\n\nJos\u00b4e Miguel Hern\u00b4andez-Lobato\n\nDepartment of Engineering\nUniversity of Cambridge\n\nTrumpington Street, Cambridge\n\nCB2 1PZ, United Kingdom\njmh233@eng.cam.ac.uk\n\nICTEAM - Machine Learning Group\n\nUniversit\u00b4e catholique de Louvain\n\nPlace Sainte Barbe, 2\n\nPierre Dupont\n\nLouvain-La-Neuve, 1348, Belgium\n\npierre.dupont@uclouvain.be\n\nAbstract\n\nMulti-class Gaussian Process Classi\ufb01ers (MGPCs) are often affected by over-\n\ufb01tting problems when labeling errors occur far from the decision boundaries. To\nprevent this, we investigate a robust MGPC (RMGPC) which considers labeling\nerrors independently of their distance to the decision boundaries. Expectation\npropagation is used for approximate inference. Experiments with several datasets\nin which noise is injected in the labels illustrate the bene\ufb01ts of RMGPC. This\nmethod performs better than other Gaussian process alternatives based on consid-\nering latent Gaussian noise or heavy-tailed processes. When no noise is injected in\nthe labels, RMGPC still performs equal or better than the other methods. Finally,\nwe show how RMGPC can be used for successfully identifying data instances\nwhich are dif\ufb01cult to classify correctly in practice.\n\n1\n\nIntroduction\n\nMulti-class Gaussian process classi\ufb01ers (MGPCs) are a Bayesian approach to non-parametric multi-\nclass classi\ufb01cation with the advantage of producing probabilistic outputs that measure uncertainty\nin the predictions [1]. MGPCs assume that there are some latent functions (one per class) whose\nvalue at a certain location is related by some rule to the probability of observing a speci\ufb01c class\nthere. The prior for each of these latent functions is speci\ufb01ed to be a Gaussian process. The task of\ninterest is to make inference about the latent functions using Bayes\u2019 theorem. Nevertheless, exact\nBayesian inference in MGPCs is typically intractable and one has to rely on approximate methods.\nApproximate inference can be implemented using Markov-chain Monte Carlo sampling, the Laplace\napproximation or expectation propagation [2, 3, 4, 5].\nA problem of MGPCs is that, typically, the assumed rule that relates the values of the latent functions\nwith the different classes does not consider the possibility of observing errors in the labels of the\ndata, or at most, only considers the possibility of observing errors near the decision boundaries\nof the resulting classi\ufb01er [1]. The consequence is that over-\ufb01tting can become a serious problem\nwhen errors far from these boundaries are observed in practice. A notable exception is found in\nthe binary classi\ufb01cation case when the labeling rule suggested in [6] is used. Such rule considers\nthe possibility of observing errors independently of their distance to the decision boundary [7, 8].\nHowever, the generalization of this rule to the multi-class case is dif\ufb01cult. Existing generalizations\n\n1\n\n\fare in practice simpli\ufb01ed so that the probability of observing errors in the labels is zero [3]. Labeling\nerrors in the context of MGPCs are often accounted for by considering that the latent functions of the\nMGPC are contaminated with additive Gaussian noise [1]. Nevertheless, this approach has again the\ndisadvantage of considering only errors near the decision boundaries of the resulting classi\ufb01er and is\nexpected to lead to over-\ufb01tting problems when errors are actually observed far from the boundaries.\nFinally, some authors have replaced the underlying Gaussian processes of the MGPC with heavy-\ntailed processes [9]. These processes have marginal distributions with heavier tails than those of a\nGaussian distribution and are in consequence expected to be more robust to labeling errors far from\nthe decision boundaries.\nIn this paper we investigate a robust MGPC (RMGPC) that addresses labeling errors by introducing\na set of binary latent variables. One latent variable for each data instance. These latent variables\nindicate whether the assumed labeling rule is satis\ufb01ed for the associated instances or not. If such\nrule is not satis\ufb01ed for a given instance, we consider that the corresponding label has been randomly\nselected with uniform probability among the possible classes. This is used as a back-up mechanism\nto explain data instances that are highly unlikely to stem from the assumed labeling rule. The\nresulting likelihood function depends only on the total number of errors, and not on the distances\nof these errors to the decision boundaries. Thus, RMGPC is expected to be fairly robust when\nthe data contain noise in the labels. In this model, expectation propagation (EP) can be used to\nef\ufb01ciently carry out approximate inference [10]. The cost of EP is O(ln3), where n is the number\nof training instances and l is the number of different classes. RMGPC is evaluated in four datasets\nextracted from the UCI repository [11] and from other sources [12]. These experiments show the\nbene\ufb01cial properties of the proposed model in terms of prediction performance. When labeling noise\nis introduced in the data, RMGPC outperforms other MGPC approaches based on considering latent\nGaussian noise or heavy-tailed processes. When there is no noise in the data, RMGPC performs\nbetter or equivalent to these alternatives. Extra experiments also illustrate the utility of RMGPC to\nidentify data instances that are unlikely to stem from the assumed labeling rule.\nThe organization of the rest of the manuscript is as follows: Section 2 introduces the RMGPC model.\nSection 3 describes how expectation propagation can be used for approximate Bayesian inference.\nThen, Section 4 evaluates and compares the predictive performance of RMGPC. Finally, Section 5\nsummarizes the conclusions of the investigation.\n\n2 Robust Multi-Class Gaussian Process Classi\ufb01cation\nConsider n training instances in the form of a collection of feature vectors X = {x1, . . . , xn} with\nassociated labels y = {y1, . . . , yn}, where yi \u2208 C = {1, . . . , l} and l is the number of classes. We\nfollow [3] and assume that, in the noise free scenario, the predictive rule for yi given xi is\n\nyi = arg max\n\nfk(xi) ,\n\nk\n\n(1)\n\nwhere f1, . . . , fl are unknown latent functions that have to be estimated. The prediction rule given by\n(1) is unlikely to hold always in practice. For this reason, we introduce a set of binary latent variables\nz = {z1, . . . , zn}, one per data instance, to indicate whether (1) is satis\ufb01ed (zi = 0) or not (zi = 1).\nIn this latter case, the pair (xi, yi) is considered to be an outlier and, instead of assuming that yi is\ngenerated by (1), we assume that xi is assigned a random class sampled uniformly from C. This is\nequivalent to assuming that f1, . . . , fl have been contaminated with an in\ufb01nite amount of noise and\nserves as a back-up mechanism to explain observations which are highly unlikely to originate from\n(1). The likelihood function for f = (f1(x1), f1(x2) . . . , f1(xn), f2(x1), f2(x2) . . . , f2(xn), . . . ,\nfl(x1), fl(x2), . . . , fl(xn))T given y, X and z is\n\n\uf8ee\uf8f0(cid:89)\n\nk(cid:54)=yi\n\nn(cid:89)\n\ni=1\n\n\uf8f9\uf8fb1\u2212zi(cid:20) 1\n\nl\n\n(cid:21)zi\n\nP(y|X, z, f ) =\n\n\u0398(fyi(xi) \u2212 fk(xi))\n\n,\n\n(2)\n\n(xi, yi) is a a mixture of two terms: A \ufb01rst term equal to(cid:81)\n\nwhere \u0398(\u00b7) is the Heaviside step function. In (2), the contribution to the likelihood of each instance\n\u0398(fyi(xi) \u2212 fk(xi)) and a second\nterm equal to 1/l. The mixing coef\ufb01cient is the prior probability of zi = 1. Note that only the \ufb01rst\nterm actually depends on the accuracy of f. In particular, it takes value 1 when the corresponding\ninstance is correctly classi\ufb01ed using (1) and 0 otherwise. Thus, the likelihood function described in\n\nk(cid:54)=yi\n\n2\n\n\f(2) considers only the total number of prediction errors made by f and not the distance of these errors\nto the decision boundary. The consequence is that (2) is expected to be robust when the observed\ndata contain labeling errors far from the decision boundaries.\nWe do not have any preference for a particular instance to be considered an outlier. Thus, z is set to\nfollow a priori a factorizing multivariate Bernoulli distribution:\n\nP(z|\u03c1) = Bern(z|\u03c1) =\n\n\u03c1zi(1 \u2212 \u03c1)1\u2212zi ,\n\n(3)\n\nn(cid:89)\n\ni=1\n\nwhere \u03c1 is the prior fraction of training instances expected to be outliers. The prior for \u03c1 is set to be\na conjugate beta distribution, namely\n\n\u03c1a0\u22121(1 \u2212 \u03c1)b0\u22121\n\nP(\u03c1) = Beta(\u03c1|a0, b0) =\n\n(4)\nwhere B(\u00b7,\u00b7) is the beta function and a0 and b0 are free hyper-parameters. The values of a0 and b0\ndo not have a big impact on the \ufb01nal model provided that they are consistent with the prior belief\nthat most of the observed data are labeled using (1) (b0 > a0) and that they are small such that (4) is\nnot too constraining. We suggest a0 = 1 and b0 = 9.\nAs in [3], the prior for f1, . . . , fl is set to be a product of Gaussian processes with means equal to 0\nand covariance matrices K1, . . . , Kl, as computed by l covariance functions c1(\u00b7,\u00b7), . . . , cl(\u00b7,\u00b7):\n\nB(a0, b0)\n\n,\n\n(5)\nwhere N (\u00b7|\u00b5, \u03a3) denotes a multivariate Gaussian density with mean vector \u00b5 and covariance matrix\n\u03a3, f is de\ufb01ned as in (2) and fk = (fk(x1), fk(x2), . . . , fk(xn))T, for k = 1, . . . , l.\n\nk=1\n\nP(f ) =\n\nN (fk|0, Kk)\n\nl(cid:89)\n\n2.1\n\nInference, Prediction and Outlier Identi\ufb01cation\n\nGiven the observed data X and y, we make inference about f, z and \u03c1 using Bayes\u2019 theorem:\n\nP(\u03c1, z, f|y, X) =\n\nP(y|X, z, f )P(z|\u03c1)P(\u03c1)P(f )\n\n(6)\nwhere P(y|X) is the model evidence, a constant useful to perform model comparison under a\nBayesian setting [13]. The posterior distribution and the likelihood function can be used to compute\na predictive distribution for the label y(cid:63) \u2208 C associated to a new observation x(cid:63):\n\nP(y|X)\n\n,\n\nP(y(cid:63)|x(cid:63), y, X) =\n\nP(y(cid:63)|x(cid:63), z(cid:63), f(cid:63))P(z(cid:63)|\u03c1)P(f(cid:63)|f )P(\u03c1, z, f|y, X) df df(cid:63) d\u03c1 ,\n\nwhere f(cid:63) = (f1(x(cid:63)), . . . , fl(x(cid:63)))T, P(y(cid:63)|x(cid:63), z(cid:63), f(cid:63)) = (cid:81)\n\n(7)\n\u0398(fk(x(cid:63)) \u2212 fy(cid:63) (x(cid:63)))1\u2212z(cid:63) (1/l)z(cid:63),\nP(z(cid:63)|\u03c1) = \u03c1z(cid:63) (1 \u2212 \u03c1)1\u2212z(cid:63) and P(f(cid:63)|f ) is a product of l conditional Gaussians with zero mean and\ncovariance matrices given by the covariance functions of K1, . . . , Kl. The posterior for z is\n\nk(cid:54)=y(cid:63)\n\nz ,z(cid:63)\n\n(cid:90)\n\n(cid:88)\n\nP(z|y, X) =\n\nP(\u03c1, z, f|y, X)df d\u03c1 .\n\n(8)\n\n(cid:90)\n\nThis distribution is useful to compute the posterior probability that the i-th training instance is an\noutlier, i.e., P(zi = 1|y, X). For this, we only have to marginalize (8) with respect to all the\ncomponents of z except zi. Unfortunately, the exact computation of (6), (7) and P(zi = 1|y, X) is\nintractable for typical classi\ufb01cation problems. Nevertheless, these expressions can be approximated\nusing expectation propagation [10].\n\n3 Expectation Propagation\n\nThe joint probability of f, z, \u03c1 and y given X can be written as the product of l(n + 1) + 1 factors:\n\nP(f , z, \u03c1, y|X) = P(y|X, z, f )P(z|\u03c1)P(\u03c1)P(f )\n\n\uf8f9\uf8fb(cid:34) n(cid:89)\n\n(cid:35)\n\n(cid:34) l(cid:89)\n\n(cid:35)\n\n\uf8ee\uf8f0 n(cid:89)\n\n(cid:89)\n\ni=1\n\nk(cid:54)=yi\n\n=\n\n\u03c8ik(f , z, \u03c1)\n\n\u03c8i(f , z, \u03c1)\n\n\u03c8\u03c1(f , z, \u03c1)\n\n\u03c8k(f , z, \u03c1)\n\n,\n\n(9)\n\ni=1\n\n3\n\nk=1\n\n\f,\n\n(cid:34) l(cid:89)\n\nwhere each factor has the following form:\n\n\u03c8ik(f , z, \u03c1) = \u0398(fyi(xi) \u2212 fk(xi))1\u2212zi (l\u2212 1\n\nl\u22121 )zi ,\n\n\u03c8\u03c1(f , z, \u03c1) =\n\n\u03c1a0\u22121(1 \u2212 \u03c1)b0\u22121\n\nB(a0, b0)\n\n\u03c8i(f , z, \u03c1) = \u03c1zi(1 \u2212 \u03c1)1\u2212zi ,\n\u03c8k(f , z, \u03c1) = N (fk|0, Kk) .\n\n(10)\n\n\uf8f9\uf8fb(cid:34) n(cid:89)\n\n(cid:35)\n\n(cid:34) l(cid:89)\n\n(cid:35)\n\n\u02dc\u03c8ik\n\n\u02dc\u03c8i\n\n\u02dc\u03c8\u03c1\n\n\u02dc\u03c8k\n\n.\n\n(11)\n\nLet \u03a8 be the set that contains all these exact factors. Expectation propagation (EP) approximates\neach \u03c8 \u2208 \u03a8 using a corresponding simpler factor \u02dc\u03c8 such that\n\n\uf8ee\uf8f0 n(cid:89)\n\n(cid:89)\n\ni=1\n\nk(cid:54)=yi\n\n\uf8f9\uf8fb(cid:34) n(cid:89)\n\n(cid:35)\n\n\u03c8ik\n\n\u03c8i\n\n\u03c8\u03c1\n\n(cid:35)\n\n\u2248\n\n\u03c8k\n\n\uf8ee\uf8f0 n(cid:89)\n\n(cid:89)\n\nk(cid:54)=yi\n\ni=1\n\nk=1\n\ni=1\n\ni=1\n\nk=1\n\nIn (11) the dependence of the exact and the approximate factors on f, z and \u03c1 has been removed\nto improve readability. The approximate factors \u02dc\u03c8 are constrained to belong to the same family of\nexponential distributions, but they do not have to integrate to one. Once normalized with respect to\nf, z and \u03c1, (9) becomes the exact posterior distribution (6). Similarly, the normalized product of the\napproximate factors becomes an approximation to the posterior distribution:\n\n(cid:35)\n\n(cid:34) l(cid:89)\n\n(cid:35)\n\n\uf8ee\uf8f0 n(cid:89)\n\n(cid:89)\n\ni=1\n\nk(cid:54)=yi\n\nQ(f , z, \u03c1) =\n\n1\nZ\n\n\uf8f9\uf8fb(cid:34) n(cid:89)\n\n\u02dc\u03c8ik(f , z, \u03c1)\n\n\u02dc\u03c8i(f , z, \u03c1)\n\n\u02dc\u03c8\u03c1(f , z, \u03c1)\n\n\u02dc\u03c8k(f , z, \u03c1)\n\n,\n\n(12)\n\ni=1\n\nk=1\n\nwhere Z is a normalization constant that approximates P(y|X). Exponential distributions are closed\nunder product and division operations. Therefore, Q has the same form as the approximate factors\nand Z can be readily computed. In practice, the form of Q is selected \ufb01rst, and the approximate\nfactors are then constrained to have the same form as Q. For each approximate factor \u02dc\u03c8 de\ufb01ne\nQ\\ \u02dc\u03c8 \u221d Q/ \u02dc\u03c8 and consider the corresponding exact factor \u03c8. EP iteratively updates each \u02dc\u03c8, one by\none, so that the Kullback-Leibler (KL) divergence between \u03c8Q\\ \u02dc\u03c8 and \u02dc\u03c8Q\\ \u02dc\u03c8 is minimized. The EP\nalgorithm involves the following steps:\n\n1. Initialize all the approximate factors \u02dc\u03c8 and the posterior approximation Q to be uniform.\n2. Repeat until Q converges:\n\n(a) Select an approximate factor \u02dc\u03c8 to re\ufb01ne and compute Q\\ \u02dc\u03c8 \u221d Q/ \u02dc\u03c8.\n(b) Update the approximate factor \u02dc\u03c8 so that KL(\u03c8Q\\ \u02dc\u03c8|| \u02dc\u03c8Q\\ \u02dc\u03c8) is minimized.\n(c) Update the posterior approximation Q to the normalized version of \u02dc\u03c8Q\\ \u02dc\u03c8.\n\n3. Evaluate Z \u2248 P(y|X) as the integral of the product of all the approximate factors.\n\nThe optimization problem in step 2-(b) is convex with a single global optimum. The solution to this\nproblem is found by matching suf\ufb01cient statistics between \u03c8Q\\ \u02dc\u03c8 and \u02dc\u03c8Q\\ \u02dc\u03c8. EP is not guaranteed\nto converge globally but extensive empirical evidence shows that most of the times it converges to\na \ufb01xed point [10]. Non-convergence can be prevented by damping the EP updates [14]. Damping\nis a standard procedure and consists in setting \u02dc\u03c8 = [ \u02dc\u03c8new]\u0001[ \u02dc\u03c8old]1\u2212\u0001 in step 2-(b), where \u02dc\u03c8new is the\nupdated factor and \u02dc\u03c8old is the factor before the update. \u0001 \u2208 [0, 1] is a parameter which controls the\namount of damping. When \u0001 = 1, the standard EP update operation is recovered. When \u0001 = 0, no\nupdate of the approximate factors occurs. In our experiments \u0001 = 0.5 gives good results and EP\nseems to always converge to a stationary solution. EP has shown good overall performance when\ncompared to other methods in the task of classi\ufb01cation with binary Gaussian processes [15, 16].\n\n3.1 The Posterior Approximation\nThe posterior distribution (6) is approximated by a distribution Q in the exponential family:\n\nQ(f , z, \u03c1) = Bern(z|p)Beta(\u03c1|a, b)\n\nN (fk|\u00b5k, \u03a3k) ,\n\n(13)\n\nwhere N (\u00b7|, \u00b5, \u03a3) is a multivariate Gaussian distribution with mean \u00b5 and covariance matrix \u03a3;\nBeta(\u00b7|a, b) is a beta distribution with parameters a and b; and Bern(\u00b7|p) is a multivariate Bernoulli\n\nk=1\n\n4\n\nl(cid:89)\n\n\fdistribution with parameter vector p. The parameters \u00b5k and \u03a3k for k = 1, . . . , l and p, a and b\nare estimated by EP. Note that Q factorizes with respect to fk for k = 1, . . . , l. This makes the cost\nof the EP algorithm linear in l, the total number of classes. More accurate approximations can be\nobtained at a cubic cost in l by considering correlations among the fk. The choice of (13) also makes\nall the required computations tractable and provides good results in Section 4.\nThe approximate factors must have the same functional form as Q but they need not be normalized.\nHowever, the exact factors \u03c8ik with i = 1, . . . , n and k (cid:54)= yi, corresponding to the likelihood,\n(2), only depend on fk(xi), fyi(xi) and zi. Thus, the beta part of the corresponding approximate\nfactors can be removed and the multivariate Gaussian distributions simplify to univariate Gaussians.\nSpeci\ufb01cally, the approximate factors \u02dc\u03c8ik with i = 1, . . . , n and k (cid:54)= yi are:\n\n\u02dc\u03c8ik(f , z, \u03c1) = \u02dcsik exp\n\n(cid:26)\n\n\u2212 1\n2\n\n(cid:20) (fk(xi) \u2212 \u02dc\u00b5ik)2\n\n\u02dc\u03bdik\n\n(cid:21)(cid:27)\n\n(fyi(xi) \u2212 \u02dc\u00b5yi\nik)2\n\n+\n\n\u02dc\u03bdyi\nik\n\nik(1 \u2212 \u02dcpik)1\u2212zi ,\n\u02dcpzi\n\n(14)\n\nwhere \u02dcsik, \u02dcpik, \u02dc\u00b5ik, \u02dc\u03bdik, \u02dc\u00b5yi\nik are free parameters to be estimated by EP. Similarly, the exact\nfactors \u03c8i, with i = 1, . . . , n, corresponding to the prior for the latent variables z, (3), only depend\non \u03c1 and zi. Thus, the Gaussian part of the corresponding approximate factors can be removed and\nthe multivariate Bernoulli distribution simpli\ufb01es to a univariate Bernoulli. The resulting factors are:\n\nik and \u02dc\u03bdyi\n\n\u02dc\u03c8i(f , z, \u03c1) = \u02dcsi\u03c1\u02dcai\u22121(1 \u2212 \u03c1)\n\n\u02dcbi\u22121 \u02dcpzi\n\ni (1 \u2212 \u02dcpi)1\u2212zi ,\n\n(15)\n\nfor i = 1, . . . , n, where \u02dcsi, \u02dcai, \u02dcbi, \u02dcpi are free parameters to be estimated by EP. The exact factor \u03c8\u03c1\ncorresponding to the prior for \u03c1, (4), need not be approximated, i.e., \u02dc\u03c8\u03c1 = \u03c8\u03c1. The same applies to\nthe exact factors \u03c8k, for k = 1, . . . , l, corresponding to the priors for f1, . . . , fl, (5). We set \u02dc\u03c8k = \u03c8k\nfor k = 1, . . . , l. All these factors \u02dc\u03c8\u03c1 and \u02dc\u03c8k, for k = 1, . . . , l, need not be re\ufb01ned by EP.\n\n3.2 The EP Update Operations\nThe approximate factors \u02dc\u03c8ik, for i = 1, . . . , n and k (cid:54)= yi, corresponding to the likelihood, are\nre\ufb01ned in parallel, as in [17]. This notably simpli\ufb01es the EP updates. In particular, for each \u02dc\u03c8ik\nwe compute Q\\ \u02dc\u03c8ik as in step 2-(a) of EP. Given each Q\\ \u02dc\u03c8ik and the exact factor \u03c8ik, we update\neach \u02dc\u03c8ik. Then, Qnew is re-computed as the normalized product of all the approximate factors.\nPreliminary experiments indicate that parallel and sequential updates converge to the same solution.\nThe remaining factors, i.e., \u02dc\u03c8i, for i = 1, . . . , n, are updated sequentially, as in standard EP. Further\ndetails about all these EP updates are found in the supplementary material1. The cost of EP, assuming\nconstant iterations until convergence, is O(ln3). This is the cost of inverting l matrices of size n\u00d7n.\n\n3.3 Model Evidence, Prediction and Outlier Identi\ufb01cation\n\nOnce EP has converged, we can evaluate the approximation to the model evidence as the integral of\nthe product of all the approximate terms. This gives the following result:\n\nCk \u2212 log |Mk|\n\nlog Z = B +\n\nwhere\n\ni=1\n\n(cid:34) n(cid:88)\n\uf8ee\uf8f0(cid:89)\n(cid:40)(cid:80)\n\nk(cid:54)=yi\n\n\u02dcpik\n\nk(cid:54)=yi\n\u02dc\u00b52\nik/\u02dc\u03bdik\n\nDi = \u02dcpi\n\n\u03c4 k\ni =\n\n+\n\n1\n2\n\nlog Di\n\n(cid:35)\n\uf8f9\uf8fb + (1 \u2212 \u02dcpi)\n\n(cid:34) l(cid:88)\n\uf8ee\uf8f0(cid:89)\n\nk=1\n\n(\u02dc\u00b5yi\n\nik)2/\u02dc\u03bdyi\n\nik\n\nk(cid:54)=yi\n\nif k = yi ,\notherwise ,\n\n\uf8f9\uf8fb + log \u02dcsi\n\n\uf8f9\uf8fb ,\n\n+\n\ni=1\n\nk(cid:54)=yi\n\nlog \u02dcsik\n\n\uf8ee\uf8f0 n(cid:88)\n\uf8ee\uf8f0(cid:88)\n(cid:35)\n\uf8f9\uf8fb , Ck = \u00b5T\nk \u00b5k \u2212 n(cid:88)\nii =(cid:80)\n\nk\u03a3\u22121\n\ni=1\n\nB = log B(a, b) \u2212 log B(a0, b0) ,\n\n(16)\n\n(17)\n\n(1 \u2212 \u02dcpik)\n\n\u03c4 k\ni ,\n\nik )\u22121, if yi = k, and\n(\u02dc\u03bdyi\nik otherwise. It is possible to compute the gradient of log Z with respect to \u03b8kj, i.e., the j-th\n\nand Mk = \u039bkKk + I, with \u039bk a diagonal matrix de\ufb01ned as \u039bk\nii = \u02dc\u03bd\u22121\n\u039bk\n1The supplementary material is available online at http://arantxa.ii.uam.es/%7edhernan/RMGPC/.\n\nk(cid:54)=yi\n\n5\n\n\fhyper-parameter of the k-th covariance function used to compute Kk. Such gradient is useful to \ufb01nd\nthe covariance functions ck(\u00b7,\u00b7), with k = 1, . . . , l, that maximize the model evidence. Speci\ufb01cally,\none can show that, if EP has converged, the gradient of the free parameters of the approximate\nfactors with respect to \u03b8kj is zero [18]. Thus, the gradient of log Z with respect to \u03b8kj is\n\n\u2202 log Z\n\u2202\u03b8kj\n\ntrace\n\n= \u2212 1\n2\nn)T with bk\n\n+\n\n(\u03c5k)T(M\u22121\n\n1\nk )T \u2202Kk\n2\n\u2202\u03b8kj\n\u02dc\u00b5yi\nik/\u02dc\u03bdyi\nik , if k = yi, and bk\n\nwhere \u03c5k = (bk\nThe predictive distribution (7) can be approximated when the exact posterior is replaced by Q:\n\ni = \u02dc\u00b5ik/\u02dc\u03bdik otherwise.\n\n2, . . . , bk\n\n1, bk\n\nM\u22121\n\nk \u03c5k ,\n\n(18)\n\n(cid:18)\n\n(cid:19)\n\nM\u22121\n\ni =(cid:80)\n\nk \u039bk \u2202Kk\n\u2202\u03b8kj\n(cid:90)\n\nk(cid:54)=yi\n\nP(y(cid:63)|x(cid:63), y, X) \u2248 \u03c1\nl\n\n+ (1 \u2212 \u03c1)\n\nN (u|my(cid:63) , vy(cid:63) )\n\ndu ,\n\n(19)\n\nwhere \u03a6(\u00b7) is the cumulative probability function of a standard Gaussian distribution and\nk \u03a3kK\u22121\n\u03c1 = a/(a + b) , mk = (k(cid:63)\nk equal to the covariances between x(cid:63) and X, and with \u03ba(cid:63)\n\n(20)\nk equal to the\nfor k = 1, . . . , l, with k(cid:63)\ncorresponding variance at x(cid:63), as computed by ck(\u00b7,\u00b7). There is no closed form expression for the\nintegral in (19). However, it can be easily approximated by a one-dimensional quadrature.\nThe posterior (8) of z can be similarly approximated by marginalizing Q with respect to \u03c1 and f:\n\nk \u2212 K\u22121\n\nk)TK\u22121\n\nk Mk\u03c5k ,\n\nk \u2212 (k(cid:63)\n\nvk = \u03ba(cid:63)\n\nk ,\n\nk\n\n(cid:1) k(cid:63)\n\n(cid:18) u \u2212 mk\u221a\n\n(cid:19)\n\nvk\n\n\u03a6\n\n(cid:89)\nk)T(cid:0)K\u22121\n\nk(cid:54)=y(cid:63)\n\nP(z|y, X) \u2248 Bern(z|p) =\n\nn(cid:89)\n\n(cid:2)pzi\ni (1 \u2212 pi)1\u2212zi(cid:3) ,\n\n(21)\n\ni=1\n\nwhere p = (p1, . . . , pn)T. Each parameter pi of Q, with 1 \u2264 i \u2264 n, approximates P(zi = 1|y, X),\ni.e., the posterior probability that the i-th training instance is an outlier. Thus, these parameters can\nbe used to identify the data instances that are more likely to be outliers.\nThe cost of evaluating (16) and (18) is respectively O(ln3) and O(n3). The cost of evaluating (19)\nis O(ln2) since K\u22121\n\nk , with k = 1, . . . , l, needs to be computed only once.\n\n4 Experiments\n\nThe proposed Robust Multi-class Gaussian Process Classi\ufb01er (RMGPC) is compared in several ex-\nperiments with the Standard Multi-class Gaussian Process Classi\ufb01er (SMGPC) suggested in [3].\nSMGPC is a particular case of RMGPC which is obtained when b0 \u2192 \u221e. This forces the prior\ndistribution for \u03c1, (4), to be a delta centered at the origin, indicating that it is not possible to observe\noutliers. SMGPC explains data instances for which (1) is not satis\ufb01ed in practice by considering\nGaussian noise in the estimation of the functions f1, . . . , fl, which is the typical approach found\nin the literature [1]. RMGPC is also compared in these experiments with the Heavy-Tailed Process\nClassi\ufb01er (HTPC) described in [9]. In HTPC, the prior for each latent function fk, with k = 1, . . . , l,\nis a Gaussian Process that has been non-linearly transformed to have marginals that follow hyper-\nbolic secant distributions with scale parameter bk. The hyperbolic secant distribution has heavier\ntails than the Gaussian distribution and is expected to perform better in the presence of outliers.\n\n4.1 Classi\ufb01cation of Noisy Data\n\nWe carry out experiments on four datasets extracted from the UCI repository [11] and from other\nsources [12] to evaluate the predictive performance of RMGPC, SMGPC and HTPC when different\nfractions of outliers are present in the data2. These datasets are described in Table 1. All have\nmultiple classes and a fairly small number n of instances. We have selected problems with small n\nbecause all the methods analyzed scale as O(n3). The data for each problem are randomly split 100\ntimes into training and test sets containing respectively 2/3 and 1/3 of the data. Furthermore, the\nlabels of \u03b7 \u2208 {0%, 5%, 10%, 20%} of the training instances are selected uniformly at random from\nC. The data are normalized to have zero mean and unit standard deviation on the training set and\n\n2The R source code of RMGPC is available at http://arantxa.ii.uam.es/%7edhernan/RMGPC/.\n\n6\n\n\fde\ufb01ned as 1/l(cid:80)l\n\nthe average balanced class rate (BCR) of each method on the test set is reported for each value of\n\u03b7. The BCR of a method with prediction accuracy ak on those instances of class k (k = 1, . . . , l) is\nk=1 ak. BCR is preferred to prediction accuracy in datasets with unbalanced class\n\ndistributions, which is the case for the datasets displayed in Table 1.\n\nTable 1: Characteristics of the datasets used in the experiments.\n# Source\n\n# Attributes\n\n# Instances\n\n# Classes\n\nDataset\n\nNew-thyroid\nWine\nGlass\nSVMguide2\n\n215\n178\n214\n319\n\n5\n13\n9\n20\n\n3\n3\n6\n3\n\nUCI\nUCI\nUCI\n\nLIBSVM\n\n(cid:26)\n\n(cid:27)\n\nIn our experiments, the different methods analyzed (RMGPC, SMGPC and HTPC) use the same\ncovariance function for each latent function, i.e., ck(\u00b7,\u00b7) = c(\u00b7,\u00b7), for k = 1, . . . , l, where\n\nc(xi, xj) = exp\n\n\u2212 1\n2\u03b3\n\n(xi \u2212 xj)T (xi \u2212 xj)\n\n(22)\n\nis a standard Gaussian covariance function with length-scale parameter \u03b3. Preliminary experiments\non the datasets analyzed show no signi\ufb01cant bene\ufb01t from considering a different covariance function\nfor each latent function. The diagonal of the covariance matrices Kk, for k = 1, . . . , l, of SMGPC\nare also added an extra term equal to \u03d12\nk to account for latent Gaussian noise with variance \u03d12\nk\naround fk [1]. These extra terms are used by SMGPC to explain those instances that are unlikely\nto stem from (1). In both RMGPC and SMGPC the parameter \u03b3 is found by maximizing (16) using\na standard gradient ascent procedure. The same method is used for tuning the parameters \u03d1k in\nSMGPC. In HTPC an approximation to the model evidence is maximized with respect to \u03b3 and the\nscale parameters bk, with k = 1, . . . , l, using also gradient ascent [9].\n\nHTPC\n\nRMGPC\n\nNew-thyroid\nWine\nGlass\nSVMguide2\n\nTable 2: Average BCR in % of each method for each problem, as a function of \u03b7.\nRMGPC\nSMGPC\nHTPC\nSMGPC\nDataset\n\u03b7 = 0%\n\u03b7 = 5%\n94.2\u00b14.5 93.9\u00b14.4\n90.0\u00b15.5 (cid:67) 92.7\u00b14.9 90.7\u00b15.8 (cid:67) 89.7\u00b16.1 (cid:67)\n98.0\u00b11.6 98.0\u00b11.6\n97.3\u00b12.0 (cid:67) 97.5\u00b11.7 97.3\u00b12.0\n96.6\u00b12.2 (cid:67)\n65.2\u00b17.7 60.6\u00b18.6 (cid:67) 59.5\u00b18.0 (cid:67) 63.5\u00b18.0 58.9\u00b18.0 (cid:67) 57.9\u00b17.5 (cid:67)\n76.3\u00b14.1 74.6\u00b14.2 (cid:67) 72.8\u00b14.1 (cid:67) 75.6\u00b14.3 73.8\u00b14.4 (cid:67) 71.9\u00b14.5 (cid:67)\n92.3\u00b15.4 89.0\u00b15.5 (cid:67) 88.3\u00b16.6 (cid:67) 89.5\u00b16.0 85.9\u00b17.4 (cid:67) 85.7\u00b17.7 (cid:67)\n97.0\u00b12.2 96.4\u00b12.6\n95.6\u00b14.6 (cid:67) 96.6\u00b12.7 95.5\u00b12.6 (cid:67) 95.1\u00b13.0 (cid:67)\n63.9\u00b17.9 58.0\u00b17.4 (cid:67) 55.7\u00b17.7 (cid:67) 59.7\u00b18.3 55.5\u00b17.3 (cid:67) 52.8\u00b17.8 (cid:67)\n74.9\u00b14.4 72.8\u00b14.7 (cid:67) 71.5\u00b14.7 (cid:67) 72.8\u00b15.1 71.4\u00b15.0 (cid:67) 67.5\u00b15.6 (cid:67)\n\nNew-thyroid\nWine\nGlass\nSVMguide2\n\n\u03b7 = 10%\n\n\u03b7 = 20%\n\nTable 2 displays for each problem the average BCR of each method for the different values of \u03b7\nconsidered. When the performance of a method is signi\ufb01cantly different from the performance of\nRMGPC, as estimated by a Wilcoxon rank test (p-value < 1%), the corresponding BCR is marked\nwith the symbol (cid:67). The table shows that, when there is no noise in the labels (i.e., \u03b7 = 0%), RMGPC\nperforms similarly to SMGPC in New-Thyroid and Wine, while it outperforms SMGPC in Glass\nand SVMguide2. As the level of noise increases, RMGPC is found to outperform SMGPC in all the\nproblems investigated. HTPC typically performs worse than RMGPC and SMGPC independently of\nthe value of \u03b7. This can be a consequence of HTPC using the Laplace approximation for approximate\ninference [9]. In particular, there is evidence indicating that the Laplace approximation performs\nworse than EP in the context of Gaussian process classi\ufb01ers [15]. Extra experiments comparing\nRMGPC, SMGPC and HTPC under 3 different noise scenarios appear in the supplementary material.\nThey further support the better performance of RMGPC in the presence of outliers in the data.\n\n4.2 Outlier Identi\ufb01cation\n\nA second batch of experiments shows the utility of RMGPC to identify data instances that are likely\nto be outliers. These experiments use the Glass dataset from the previous section. Recall that for this\n\n7\n\n\fdataset RMGPC performs signi\ufb01cantly better than SMGPC for \u03b7 = 0%, which suggest the presence\nof outliers. After normalizing the Glass dataset, we run RMGPC on the whole data and estimate the\nposterior probability that each instance is an outlier using (21). The hyper-parameters of RMGPC\nare estimated as described in the previous section. Figure 1 shows for each instance (xi, yi) of the\nGlass dataset, with i = 1, . . . , n, the value of P(zi = 1|y, X). Note that most of the instances\nare considered to be outliers with very low posterior probability. Nevertheless, there is a small set\nof instances that have very high posterior probabilities. These instances are unlikely to stem from\n(1) and are expected to be misclassi\ufb01ed when placed on the test set. Consider the set of instances\nthat are more likely to be outliers than normal instances (i.e., instances 3, 36, 127, 137, 152, 158 and\n188). Assume the experimental protocol of the previous section. Table 3 displays the fraction of\ntimes that each of these instances is misclassi\ufb01ed by RMGPC, SMGPC and HTPC when placed on\nthe test set. The posterior probability that each instance is an outlier, as estimated by RMGPC, is\nalso reported. The table shows that all the instances are typically misclassi\ufb01ed by all the classi\ufb01ers\ninvestigated, which con\ufb01rms the dif\ufb01culty of obtaining accurate predictions for them in practice.\n\nFigure 1: Posterior probability that each data instance form the Glass dataset is an outlier.\n\nTable 3: Average test error in % of each method on each data instance that is more likely to be an\noutlier. The probability that the instance is an outlier, as estimated by RMGPC, is also displayed.\n\nt\ns\ne\nT\n\no\nr\nr\nE\n\nr RMGPC\nSMGPC\nHTPC\nP(zi = 1|y, X)\n\nGlass Data Instances\n\n3-rd\n\n36-th\n\n127-th\n\n188-th\n100.0\u00b10.0 100.0\u00b10.0 100.0\u00b10.0 100.0\u00b10.0 100.0\u00b10.0 100.0\u00b10.0 100.0\u00b10.0\n100.0\u00b10.0 92.0\u00b15.5 100.0\u00b10.0 100.0\u00b10.0 100.0\u00b10.0 100.0\u00b10.0 100.0\u00b10.0\n100.0\u00b10.0 84.0\u00b17.5 100.0\u00b10.0 100.0\u00b10.0 100.0\u00b10.0 100.0\u00b10.0 100.0\u00b10.0\n\n152-th\n\n137-th\n\n158-th\n\n0.69\n\n0.96\n\n0.82\n\n0.51\n\n0.86\n\n0.83\n\n1.00\n\n5 Conclusions\n\nWe have introduced a Robust Multi-class Gaussian Process Classi\ufb01er (RMGPC). RMGPC considers\nonly the number of errors made, and not the distance of such errors to the decision boundaries of\nthe classi\ufb01er. This is achieved by introducing binary latent variables that indicate when a given\ninstance is considered to be an outlier (wrongly labeled instance) or not. RMGPC can also identify\nthe training instances that are more likely to be outliers. Exact Bayesian inference in RMGPC is\nintractable for typical learning problems. Nevertheless, approximate inference can be ef\ufb01ciently\ncarried out using expectation propagation (EP). When EP is used, the training cost of RMGPC is\nO(ln3), where l is the number of classes and n is the number of training instances. Experiments in\nfour multi-class classi\ufb01cation problems show the bene\ufb01ts of RMGPC when labeling noise is injected\nin the data. In this case, RMGPC performs better than other alternatives based on considering latent\nGaussian noise or noise which follows a distribution with heavy tails. When there is no noise in the\ndata, RMGPC performs better or equivalent to these alternatives. Our experiments also con\ufb01rm the\nutility of RMGPC to identify data instances that are dif\ufb01cult to classify accurately in practice. These\ninstances are typically misclassi\ufb01ed by different predictors when included in the test set.\n\nAcknowledgment\n\nAll experiments were run on the Center for Intensive Computation and Mass Storage (Louvain). All authors\nacknowledge support from the Spanish MCyT (Project TIN2010-21575-C02-02).\n\n8\n\n0501001502000.000.501.00GlassDataInstancesP(z_i=1|y,X)\fReferences\n[1] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine\n\nLearning (Adaptive Computation and Machine Learning). The MIT Press, 2006.\n\n[2] Christopher K. I. Williams and David Barber. Bayesian classi\ufb01cation with Gaussian processes.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342\u20131351, 1998.\n\n[3] Hyun-Chul Kim and Zoubin Ghahramani. Bayesian Gaussian process classi\ufb01cation with\nIEEE Transactions on Pattern Analysis and Machine Intelligence,\n\nthe EM-EP algorithm.\n28(12):1948\u20131959, 2006.\n\n[4] R.M Neal. Regression and classi\ufb01cation using Gaussian process priors. Bayesian Statistics,\n\n6:475\u2013501, 1999.\n\n[5] Matthias Seeger and Michael I. Jordan. Sparse Gaussian process classi\ufb01cation with multiple\n\nclasses. Technical report, University of California, Berkeley, 2004.\n\n[6] M. Opper and O. Winther. Gaussian process classi\ufb01cation and SVM: Mean \ufb01eld results. In\nP. Bartlett, B.Schoelkopf, D. Schuurmans, and A. Smola, editors, Advances in large margin\nclassi\ufb01ers, pages 43\u201365. MIT Press, 2000.\n\n[7] Daniel Hern\u00b4andez-Lobato and Jos\u00b4e Miguel Hern\u00b4andez-Lobato. Bayes machines for binary\n\nclassi\ufb01cation. Pattern Recognition Letters, 29(10):1466\u20131473, 2008.\n\n[8] Hyun-Chul Kim and Zoubin Ghahramani. Outlier robust Gaussian process classi\ufb01cation. In\nStructural, Syntactic, and Statistical Pattern Recognition, volume 5342 of Lecture Notes in\nComputer Science, pages 896\u2013905. Springer Berlin / Heidelberg, 2008.\n\n[9] Fabian L. Wauthier and Michael I. Jordan. Heavy-Tailed Process Priors for Selective Shrink-\nIn J. Lafferty, C. K. I. Williams, R. Zemel, J. Shawe-Taylor, and A. Culotta, editors,\n\nage.\nAdvances in Neural Information Processing Systems 23, pages 2406\u20132414. 2010.\n\n[10] Thomas Minka. A Family of Algorithms for approximate Bayesian Inference. PhD thesis,\n\nMassachusetts Institute of Technology, 2001.\n\n[11] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007.\n[12] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines, 2001.\n[13] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and\n\nStatistics). Springer, August 2006.\n\n[14] T. Minka and J. Lafferty. Expectation-propagation for the generative aspect model. In Adnan\nDarwiche and Nir Friedman, editors, Proceedings of the 18th Conference on Uncertainty in\nArti\ufb01cial Intelligence, pages 352\u2013359. Morgan Kaufmann, 2002.\n\n[15] Malte Kuss and Carl Edward Rasmussen. Assessing approximate inference for binary Gaussian\n\nprocess classi\ufb01cation. Journal of Machine Learning Research, 6:1679\u20131704, 2005.\n\n[16] H Nickisch and CE Rasmussen. Approximations for binary Gaussian process classi\ufb01cation.\n\nJournal of Machine Learning Research, 9:2035\u20132078, 10 2008.\n\n[17] Marcel Van Gerven, Botond Cseke, Robert Oostenveld, and Tom Heskes. Bayesian source\nlocalization with the multivariate Laplace prior.\nIn Y. Bengio, D. Schuurmans, J. Lafferty,\nC. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems\n22, pages 1901\u20131909, 2009.\n\n[18] Matthias Seeger. Expectation propagation for exponential families. Technical report, Depart-\n\nment of EECS, University of California, Berkeley, 2006.\n\n9\n\n\f", "award": [], "sourceid": 206, "authors": [{"given_name": "Daniel", "family_name": "Hern\u00e1ndez-lobato", "institution": null}, {"given_name": "Jose", "family_name": "Hern\u00e1ndez-lobato", "institution": null}, {"given_name": "Pierre", "family_name": "Dupont", "institution": null}]}