{"title": "Mind the Nuisance: Gaussian Process Classification using Privileged Noise", "book": "Advances in Neural Information Processing Systems", "page_first": 837, "page_last": 845, "abstract": "The learning with privileged information setting has recently attracted a lot of attention within the machine learning community, as it allows the integration of additional knowledge into the training process of a classifier, even when this comes in the form of a data modality that is not available at test time. Here, we show that privileged information can naturally be treated as noise in the latent function of a Gaussian process classifier (GPC). That is, in contrast to the standard GPC setting, the latent function is not just a nuisance but a feature: it becomes a natural measure of confidence about the training data by modulating the slope of the GPC probit likelihood function. Extensive experiments on public datasets show that the proposed GPC method using privileged noise, called GPC+, improves over a standard GPC without privileged knowledge, and also over the current state-of-the-art SVM-based method, SVM+. Moreover, we show that advanced neural networks and deep learning methods can be compressed as privileged information.", "full_text": "Mind the Nuisance: Gaussian Process\nClassi\ufb01cation using Privileged Noise\n\nDaniel Hern\u00b4andez-Lobato\n\nUniversidad Aut\u00b4onoma de Madrid\n\nMadrid, Spain\n\ndaniel.hernandez@uam.es\n\nKristian Kersting\n\nTU Dortmund\n\nDortmund, Germany\n\nViktoriia Sharmanska\n\nIST Austria\n\nKlosterneuburg, Austria\nvsharman@ist.ac.at\n\nChristoph H. Lampert\n\nIST Austria\n\nKlosterneuburg, Austria\n\nfirst.last@cs.tu-dortmund.de\n\nchl@ist.ac.at\n\nNovi Quadrianto\n\nSMiLe CLiNiC, University of Sussex\n\nBrighton, United Kingdom\n\nn.quadrianto@sussex.ac.uk\n\nAbstract\n\nThe learning with privileged information setting has recently attracted a lot of at-\ntention within the machine learning community, as it allows the integration of ad-\nditional knowledge into the training process of a classi\ufb01er, even when this comes\nin the form of a data modality that is not available at test time. Here, we show\nthat privileged information can naturally be treated as noise in the latent function\nof a Gaussian process classi\ufb01er (GPC). That is, in contrast to the standard GPC\nsetting, the latent function is not just a nuisance but a feature: it becomes a natural\nmeasure of con\ufb01dence about the training data by modulating the slope of the GPC\nprobit likelihood function. Extensive experiments on public datasets show that the\nproposed GPC method using privileged noise, called GPC+, improves over a stan-\ndard GPC without privileged knowledge, and also over the current state-of-the-art\nSVM-based method, SVM+. Moreover, we show that advanced neural networks\nand deep learning methods can be compressed as privileged information.\n\n1\n\nIntroduction\n\nPrior knowledge is a crucial component of any learning system as without a form of prior knowl-\nedge learning is provably impossible [1]. Many forms of integrating prior knowledge into machine\nlearning algorithms have been developed: as a preference of certain prediction functions over others,\nas a Bayesian prior over parameters, or as additional information about the samples in the training\nset used for learning a prediction function. In this work, we rely on the last of these setups, adopting\nVapnik and Vashist\u2019s learning using privileged information (LUPI), see e.g. [2, 3]: we want to learn\na prediction function, e.g. a classi\ufb01er, and in addition to the main data modality that is to be used for\nprediction, the learning system has access to additional information about each training example.\nThis scenario has recently attracted considerable interest within the machine learning community\nbecause it re\ufb02ects well the increasingly relevant situation of learning as a service: an expert trains a\nmachine learning system for a speci\ufb01c task on request from a customer. Clearly, in order to achieve\nthe best result, the expert will use all the information available to him or her, not necessarily just the\n\n1\n\n\finformation that the system itself will have access to during its operation after deployment. Typical\nscenarios for learning as a service include visual inspection tasks, in which a classi\ufb01er makes real-\ntime decisions based on the input from its sensor, but at training time, additional sensors could be\nmade use of, and the processing time per training example plays less of a role. Similarly, a classi\ufb01er\nbuilt into a robot or mobile device operates under strong energy constraints, while at training time,\nenergy is less of a problem, so additional data can be generated and made use of. A third scenario is\nwhen the additional data is con\ufb01dential, as e.g. in health care applications. Speci\ufb01cally, a diagnosis\nsystem may be improved when more information is available at training time, e.g., speci\ufb01c blood\ntests, genetic sequences, or drug trials, for the subjects that form the training set. However, the same\ndata may not be available at test time, as obtaining it could be impractical, unethical, or illegal.\nWe propose a novel method for using privileged information based on the framework of Gaussian\nprocess classi\ufb01ers (GPCs). The privileged data enters the model in form of a latent variable, which\nmodulates the noise term of the GPC. Because the noise is integrated out before obtaining the \ufb01nal\nmodel, the privileged information is only required at training time, not at prediction time. The most\ninteresting aspect of the proposed model is that by this procedure, the in\ufb02uence of the privileged\ninformation becomes very interpretable: its role is to model the con\ufb01dence that the GPC has about\nany training example, which can be directly read off from the slope of the probit likelihood. Instances\nthat are easy to classify by means of their privileged data cause a faster increasing probit, which\nmeans the GP trusts the training example and tried to \ufb01t it well. Instances that are hard to classify\nresult in a slowly increasing slope, so that the GPC considers them less reliable and does not put a\nlot of effort in \ufb01tting their label well. Our experiments on multiple datasets show that this procedure\nleads not just to more interpretable models, but also to better prediction accuracy.\nRelated work: The LUPI framework was originally proposed by Vapnik and Vashist [2], inspired\nby a thought-experiment: when training a soft-margin SVM, what if an oracle would provide us\nwith the optimal values of the slack variables? As it turns out, this would actually provably reduce\nthe amount of training data needed, and consequently, Vapnik and Vashist proposed the SVM+\nclassi\ufb01er that uses privileged data to predict values for the slack variables, which led to improved\nperformance on several categorisation tasks and found applications, e.g., in \ufb01nance [4]. This setup\nwas subsequently improved, by a faster training algorithm [5], better theoretical characterisation [3],\nand it was generalised, e.g., to the learning to rank setting [6], clustering [7], metric learning [8] and\nmulti-class data classi\ufb01cation [9]. Recently, however, it was shown that the main effect of the SVM+\nprocedure is to assign a data-dependent weight to each training example in the SVM objective [10].\nThe proposed method, GPC+, constitutes the \ufb01rst Bayesian treatment of classi\ufb01cation using priv-\nileged information. The resulting privileged noise approach is related to input-modulated noise\ncommonly done in the regression task, where several Bayesian treatments of heteroscedastic regres-\nsion using GPs have been proposed. Since the predictive density and marginal likelihood are no\nlonger analytically tractable, most works deal with approximate inference, i.e., techniques such as\nMarkov Chain Monte Carlo [11], maximum a posteriori [12], and variational Bayes [13]. To our\nknowledge, however, there is no prior work on heteroscedastic classi\ufb01cation using GPs \u2014 we will\nelaborate the reasons in Section 2.1 \u2014 and this work is the \ufb01rst to develop approximate inference\nbased on expectation propagation for the heteroscedastic noise case in the context of classi\ufb01cation.\n\n2 GPC+: Gaussian process classi\ufb01cation with privileged noise\n\nFor self-consistency we \ufb01rst review the GPC model [14] with an emphasis on the noise-corrupted\nlatent Gaussian process view. Then, we show how to treat privileged information as heteroscedastic\nnoise in this process. An elegant aspect of this view is how the privileged noise is able to distinguish\nbetween easy and hard samples and to re-calibrate the uncertainty on the class label of each instance.\n\n2.1 Gaussian process classi\ufb01er with noisy latent process\nConsider a set of N input-output data points or samples D = {(x1, y1), . . . , (xN , yN )} \u2282 Rd \u00d7\n{0, 1}. Assume that the class label yi of the sample xi has been generated as yi = I[ \u02dcf (xi) \u2265 0 ],\nwhere \u02dcf (\u00b7) is a noisy latent function and I[\u00b7] is the Iverson\u2019s bracket notation, i.e., I[ P ] = 1 when\nthe condition P is true, and 0 otherwise. Induced by the label generation process, we adopt the\n\n2\n\n\ffollowing form of likelihood function for \u02dcf = ( \u02dcf (x1), . . . , \u02dcf (xN ))(cid:62):\nPr(yn = 1|xn, \u02dcf ) =\n\nPr(y|\u02dcf , X = (x1, . . . , xN )(cid:62)) =\n\nn=1\n\n(cid:89)N\n\n(cid:89)N\n\nI[ \u02dcf (xn) \u2265 0 ],\n\n(1)\n\nn=1\n\nwhere \u02dcf (xn) = f (xn) + \u0001n with f (xn) being the noise-free latent function. The noise term \u0001n\nis assumed to be independent and normally distributed with zero mean and variance \u03c32, that is\n\u0001n \u223c N (\u0001n|0, \u03c32). To make inference about \u02dcf (xn), we need to specify a prior over this function.\nWe proceed by imposing a zero mean Gaussian process prior [14] on the noise-free latent function,\nthat is f (xn) \u223c GP(0, k(xn,\u00b7)) where k(\u00b7,\u00b7) is a positive-de\ufb01nite kernel function [15] that speci\ufb01es\nprior properties of f (\u00b7). A typical kernel function that allows for non-linear smooth functions is the\nsquared exponential kernel kf (xn, xm) = \u03b8 exp(\u2212 1\n), where \u03b8 controls the prior\namplitude of f (\u00b7) and l controls its prior smoothness. The prior and the likelihood are combined\nusing Bayes\u2019 rule to get the posterior of \u02dcf (\u00b7). Namely, Pr(\u02dcf|X, y) = Pr(y|\u02dcf , X)Pr(\u02dcf )/Pr(y|X).\nWe can simplify the above noisy latent process view by integrating out the noise term \u0001n and writing\ndown the individual likelihood at sample xn in terms of the noise-free latent function f (\u00b7). Namely,\n(2)\n\n(cid:90)\nPr(yn = 1|xn, f ) =\n\nI[ \u02dcf (xn) \u2265 0]N (\u0001n|0, \u03c32)d\u0001n = \u03a6(0,\u03c32)(f (xn)),\n\n2l (cid:107)xn \u2212 xm(cid:107)2\n\n(cid:96)2\n\nwhere we have used that \u02dcf (xn) = f (xn) + \u0001n and \u03a6(\u00b5,\u03c32)(\u00b7) is a Gaussian cumulative distribution\nfunction (CDF) with mean \u00b5 and variance \u03c32. Typically the standard Gaussian CDF is used, that is\n\u03a6(0,1)(\u00b7), in the likelihood of (2). Coupled with a Gaussian process prior on the latent function f (\u00b7),\nthis results in the widely adopted noise-free latent Gaussian process view with probit likelihood.\nThe equivalence between a noise-free latent process with probit likelihood and a noisy latent process\nwith step-function likelihood is widely known [14]. It is also widely accepted that the function \u02dcf (\u00b7)\n(or the functionf (\u00b7)) is a nuisance function as we do not observe its value and its sole purpose is\nfor a convenient formulation of the model [14]. However, in this paper, we show that by using\nprivileged information as the noise term, the latent function \u02dcf now plays a crucial role. The latent\nfunction with privileged noise adjusts the slope transition in the Gaussian CDF to be faster or slower\ncorresponding to more certainty or more uncertainty about the samples in the original input space.\n\n2.2\n\nIntroducing privileged information into the nuisance function\n\nIn the learning under privileged information (LUPI) paradigm [2], besides input data points\n{x1, . . . , xN} and associated labels {y1, . . . , yN}, we are given additional information x\u2217\nn \u2208 Rd\u2217\nabout each training instance xn. However, this privileged information will not be available for un-\nseen test instances. Our goal is to exploit the additional data x\u2217 to in\ufb02uence our choice of the latent\nfunction \u02dcf (\u00b7). This needs to be done while making sure that the function does not directly use the\nprivileged data as input, as it is simply not available at test time. We achieve this naturally by treating\nthe privileged information as a heteroscedastic (input-dependent) noise in the latent process.\nOur classi\ufb01cation model with privileged noise is then as follows:\n\ni.i.d.\n\n(3)\n(4)\n\nn) = exp(g(x\u2217\n\nPrivileged noise model : \u0001n\n\nLikelihood model : Pr(yn = 1|xn, \u02dcf ) = I[ \u02dcf (xn) \u2265 0 ] , where xn \u2208 Rd\nn))) , where x\u2217\ng(x\u2217\n\nAssume : \u02dcf (xn) = f (xn) + \u0001n\n\u223c N (\u0001n|0, z(x\u2217\n\nGP prior model : f (xn) \u223c GP(0, kf (xn,\u00b7))\n\n(5)\n(6)\nIn the above, the function exp(\u00b7) is needed to ensure positivity of the noise variance. The term kg(\u00b7,\u00b7)\nis a positive-de\ufb01nite kernel function that speci\ufb01es the prior properties of another latent function g(\u00b7),\nwhich is evaluated in the privileged space x\u2217. Crucially, the noise term \u0001n is now heteroscedastic,\nthat is, it has a different variance z(x\u2217\nn) at each input point xn. This is in contrast to the standard GPC\napproach discussed in Section 2.1 where the noise term is homoscedastic, \u0001n \u223c N (\u0001n|0, z(x\u2217\nn) =\n\u03c32). An input-dependent noise term is very common in regression tasks with continuous output\nvalues yn \u2208 R, resulting in heteroscedastic regression models, which have been proven more \ufb02exible\nin numerous applications as already touched upon in the section on related work. However, to our\nknowledge, there is no prior work on heteroscedastic classi\ufb01cation models. This is not surprising as\nthe nuisance view of the latent function renders a \ufb02exible input-dependent noise point-less.\n\nn) \u223c GP(0, kg(x\u2217\n\nn \u2208 Rd\u2217\nn,\u00b7)).\n\nand\n\n3\n\n\fFigure 1: Effects of privileged noise on the nuisance function. (Left) On synthetic data. Suppose for an input\nxn, the latent function value is f (xn) = 1. Now also assume that the associated privileged information x\u2217\nn for\nthe n-th data point deems the sample as dif\ufb01cult, say exp(g(x\u2217\nn)) = 5.0. Then the likelihood will re\ufb02ect this\nuncertainty Pr(yn = 1|f, g, xn, x\u2217\nn) = 0.58. In contrast, if the associated privileged information considers the\nn)) = 0.5, the likelihood is very certain Pr(yn = 1|f, g, xn, x\u2217\nsample as easy, say e.g. exp(g(x\u2217\nn) = 0.98.\n(Right) On real data taken from our experiments in Sec. 4. The posterior means of the \u03a6(\u00b7) function (solid)\nand its 1-standard deviation con\ufb01dence interval (dash-dot) for easy (blue) and dif\ufb01cult (black) instances of the\nChimpanzee v. Giant Panda binary task on the Animals with Attributes (AwA) dataset. (Best viewed in color).\n\nIn the context of privileged information heteroscedastic classi\ufb01cation is a very sensible idea, which is\nbest illustrated when investigating the effect of privileged information in the equivalent formulation\nof a noise free latent process, i.e., when one integrates out the privileged input-dependent noise term:\n\n(cid:90)\n\n(cid:112)exp(g(x\u2217\n\nPr(yn = 1|xn, x\u2217\n\nn, f, g) =\n\nI[ \u02dcf (xn) \u2265 0 ]N (\u0001n|0, exp(g(x\u2217\n\nn))d\u0001n\n\n= \u03a6(0,exp(g(x\u2217\n\nn)))(f (xn)) = \u03a6(0,1)(f (xn)/\n\n(7)\nThis equation shows that the privileged information adjusts the slope transition of the Gaussian\nCDF through the latent function g(\u00b7). For dif\ufb01cult samples the latent function g(\u00b7) will be high,\nthe slope transition will be slower, and thus more uncertainty will be in the likelihood Pr(yn =\n1|xn, x\u2217\nn, f, g). For easy samples, however, g(\u00b7) will be low, the slope transition will be faster,\nand thus less uncertainty will be in the likelihood term. This behaviour is illustrated in Figure 1.\nFor non-informative samples in the privileged space, the value of g for those samples should be\nequal to a global noise value, as in a standard GPC. Thus, privileged information should in principle\nnever hurt. Proving this theoretically is, however, an interesting and challenging research direction.\nExperimentally, however, we observe in the section on experiments the scenario described.\n\nn)).\n\n2.3 Posterior and prediction on test data\nDe\ufb01ne g = (g(x\u2217\n\nPr(y|X, X(cid:63), f , g) = (cid:81)N\n\nn))T and X\u2217 = (x\u2217\n\n1), . . . , g(x\u2217\nn=1 Pr(yn = 1|f, g, xn, x\u2217\n\n1, . . . , x\u2217\n\nGiven the likelihood\nn) with the individual term Pr(yn|f, g, xn, x\u2217\nn)\n\nn)T.\n\ngiven in (7) and the Gaussian process priors on functions, the posterior for f and g is:\n\nPr(f , g|y, X, X(cid:63)) =\n\nPr(y|X, X(cid:63), f , g)Pr(f )Pr(g)\n\nPr(y|X, X(cid:63))\n\n,\n\n(8)\n\nwhere Pr(y|X, X(cid:63)) can be maximised with respect to a set of hyper-parameter values such as the\namplitude \u03b8 and the smoothness l of the kernel functions [14]. For a previously unseen test point\nxnew \u2208 Rd, the predictive distribution for its label ynew is given as:\n\nPr(ynew = 1|y, X, X(cid:63)) =\n\nI[ \u02dcf (xnew) \u2265 0 ]Pr(fnew|f )Pr(f , g|y, X, X(cid:63))df dgdfnew ,\n\n(9)\n\n(cid:90)\n\nwhere Pr(fnew|f ) is a Gaussian conditional distribution. We note that in (9) we do not consider the\nprivileged information x(cid:63)\nnew associated to xnew. The interpretation is that we consider homoscedastic\n\n4\n\n\u221210\u221250510f(xn)0.00.20.40.60.81.0\u03a6(0,exp(g(x\u2217n))(f(xn))0.840.580.981exp(g(x\u2217n))=1.0exp(g(x\u2217n))=5.0exp(g(x\u2217n))=0.5\u22122\u221210120.00.20.40.60.81.0PosteriormeanofforadifficultinstancePosteriormeanofforaneasyinstanceAwA (DeCAF) / Chimpanzee v. Giant Panda\fnoise at test time. This is a reasonable approach as there is no additional information for increasing\nor decreasing our con\ufb01dence in the newly observed data xnew. Finally, we predict the label for a test\npoint via Bayesian decision theory: the label being predicted is the one with the largest probability.\n\n3 Expectation propagation with numerical quadrature\n\nUnfortunately, as for most interesting Bayesian models, inference in the GPC+ model is very chal-\nlenging. Already in the homoscedastic case, the predictive density and marginal likelihood are\nnot tractable. Here, we therefore adapt Minka\u2019s expectation propagation (EP) [16] with numerical\nquadrature for approximate inference. Our choice is supported on the fact that EP is the preferred\nmethod for approximate inference in GPCs, in terms of accuracy and computational cost [17, 18].\nConsider the joint distribution of f, g and y, Pr(y|X, X\u2217, f , g)Pr(f )Pr(g), where Pr(f ) and Pr(g)\nn, f, g),\nwith Pr(yn|xn, x\u2217\nn, f, g) given by (7). EP approximates each non-normal factor in this distribution\nby an un-normalised bi-variate normal distribution of f and g (we assume independence between f\nand g). The only non-normal factors are those of the likelihood, which are approximated as:\n\nare Gaussian process priors and the likelihood Pr(y|X, X\u2217, f , g) equals(cid:81)N\nn, f, g) \u2248 \u03b3n(f, g) = znN (f (xn)|mf , vf )N (g(x\u2217\n\n(10)\nwhere the parameters with the super-script\nare to be found by EP. The posterior approximation Q\ncomputed by EP results from normalising with respect to f and g the EP approximate joint. That is,\nQ is obtained by replacing each likelihood factor by the corresponding approximate factor \u03b3n:\n\nn=1 Pr(yn|xn, x\u2217\n\nPr(yn|xn, x\u2217\n\nn)|mg, vg) ,\n\n(cid:89)N\n\n\u22121[\n\nn=1\n\nPr(f , g|X, X\u2217\n\nn(cid:48)(cid:54)=n \u03b3n(cid:48)\n\n\u03b3(f, g)]Pr(f )Pr(g) ,\n\n, y) \u2248 Q(f , g) := Z\n\nn, f, g)Qold/Zn||Q\n\nPr(f )Pr(g) with all variables different from f (xn) and g(x\u2217\n\n(11)\nwhere Z is a normalisation constant that approximates the model evidence, Pr(y|X, X\u2217). The\nnormal distribution belongs to the exponential family of probability distributions and is closed under\nthe product and division. It is hence possible to show that Q is the product of two multi-variate\nnormals [19]. The \ufb01rst normal approximates the posterior for f and the second the posterior for g.\nto minimise KL(cid:0)Pr(yn|xn, x(cid:63)\nEP tries to \ufb01x the parameters of \u03b3n so that it is similar to the exact factor Pr(yn|xn, x\u2217\nn, f, g) in\n(cid:104)(cid:81)\nregions of high posterior probability [16]. For this, EP iteratively updates each \u03b3n until convergence\n\n(cid:1), where Qold is a normal distribution proportional\n\nto\nn) marginalised out, Zn\nis simply a normalisation constant and KL(\u00b7||\u00b7) denotes the Kullback-Leibler divergence between\nprobability distributions. Assume Qnew is the distribution minimising the previous divergence. Then,\n\u03b3n \u221d Qnew/Qold and the parameter zn of \u03b3n is \ufb01xed to guarantee that \u03b3n integrates the same as\nthe exact factor with respect to Qold. The minimisation of the KL divergence involves matching\nn, f, g)Qold/Zn and Qnew.\nexpected suf\ufb01cient statistics (mean and variance) between Pr(yn|xn, x(cid:63)\nThese expectations can be obtained from the derivatives of log Zn with respect to the (natural)\nparameters of Qold [19]. Unfortunately, the computation of log Zn in closed form is intractable. We\nshow here that it can be approximated by a one dimensional quadrature. Denote by mf , vf , mg and\nvg the means and variances of Qold for f (xn) and g(x\u2217\nvf + exp(g(x\u2217\n\nn), respectively. Then,\n\n(cid:113)\n\nZn =\n\n\u03a6(0,1)\n\nynmf /\n\nn))\n\nN (g(x\u2217\n\nn)|mg, vg)dg(x\u2217\n\nn) .\n\n(cid:18)\n\n(cid:19)\n\n(cid:105)\n\n(cid:90)\n\n(12)\n\nThus, EP only requires \ufb01ve quadratures to update each \u03b3n. One to compute log Zn and four extras\nto compute its derivatives with respect to mf , vf , mg and vg. After convergence, Q can be used\nto approximate predictive distributions and the normalisation constant Z can be maximised to \ufb01nd\ngood values for the model\u2019s hyper-parameters. In particular, it is possible to compute the gradient\nof Z with respect to the parameters of the Gaussian process priors for f and g [19]. An R language\nimplementation of GPC+ using EP for approximate inference is found in the supplementary material.\n\n4 Experiments\n\nWe investigate the performance of GPC+. To this aim we considered three types of binary classi\ufb01ca-\ntion tasks corresponding to different privileged information using two real-world datasets: Attribute\nDiscovery and Animals with Attributes. We detail these experiments in turn in the following sections.\n\n5\n\n\fMethods: We compared our proposed GPC+ method with the well-established LUPI method based\non SVM, SVM+ [5]. As a reference, we also \ufb01t standard GP and SVM classi\ufb01ers when learning on\nthe original space Rd (GPC and SVM baselines). For all four methods, we used a squared exponential\nkernel with amplitude parameter \u03b8 and smoothness parameter l. For simplicity, we set \u03b8 = 1.0 in\nall cases. There are two hyper-parameters in GPC (smoothness parameter l and noise variance \u03c32)\nand also two in GPC+ (smoothness parameters l of kernel kf (\u00b7,\u00b7) and of kernel kg(\u00b7,\u00b7)). In GPC\nand GPC+, we used type II-maximum likelihood for \ufb01nding all hyper-parameters. SVM has two\nknobs, i.e., smoothness and regularisation, and SVM+ has four knobs, two smoothness and two\nregularisation parameters. In SVM we used a grid search guided by cross-validation to set all hyper-\nparameters. However, this procedure was too expensive for \ufb01nding the best parameters in SVM+.\nThus, we used the performance on a separate validation set to guide the search. This means that we\ngive a competitive advantage to SVM+ over the other methods, which do not use the validation set.\nEvaluation metric: To evaluate the performance of each method we used the classi\ufb01cation error\nmeasured on an independent test set. We performed 100 repeats of all the experiments to get the\nbetter statistics of the performance and we report the mean and the standard deviation of the error.\n\n4.1 Attribute discovery dataset\n\nThe data set was collected from a website that aggregates product data from a variety of e-commerce\nsources and includes both images and associated textual descriptions [20]. The images and texts are\ngrouped into 4 broad shopping categories: bags, earrings, ties, and shoes. We used 1800 samples\nfrom this dataset. We generated 6 binary classi\ufb01cation tasks for each pair of the 4 classes with 200\nsamples for training, 200 samples for validation, and the rest of the samples for testing performance.\nNeural networks on texts as privileged information: We used images as the original domain and\ntexts as the privileged domain. This setting was also explored in [6]. However, we used a different\ndataset because textual descriptions of the images used in [6] are sparse and contain duplicates. More\nprecisely, we extracted more advanced text features instead of simple term frequency (TF) features.\nFor the images representation, we extracted SURF descriptors [21] and constructed a codebook of\n100 visual words using the k-means clustering. For the text representation, we extracted 200 dimen-\nsional continuous word-vectors using a neural network skip-gram architecture [22]1. To convert this\nword representation into a \ufb01xed-length sentence representation, we constructed a codebook of 100\nword-vectors using again k-means clustering. We note that a more elaborate approach to transform\nword to sentence or document features has recently been developed [23], and we are planning to\nexplore this in the future. We performed PCA for dimensionality reduction in the original and priv-\nileged domains and only kept the top 50 principal components. Finally, we standardised the data so\nthat each feature had zero mean and unit standard deviation.\nThe experimental results are summarised in Table 1. On average over 6 tasks, SVM with hinge loss\noutperforms GPC with probit likelihood. However, GPC+ signi\ufb01cantly improves over GPC provid-\ning the best results on average. This clearly shows that GPC+ is able to employ the neural network\ntextual representation as privileged information. In contrast, SVM+ produced the same result as\nSVM. We suspect this is due to the fact that that SVM has already shown strong performance on\nthe original image space coupled with the dif\ufb01culties of \ufb01nding the best values of the four hyper-\nparameters of SVM+. Keep in mind that in SVM+ we discretised the hyper-parameter search space\nover 625 (5 \u00d7 5 \u00d7 5 \u00d7 5) possible combination values and used a separate validation set to estimate\nthe resulting prediction performance.\n\n4.2 Animals with attributes (AwA) dataset\n\nThe dataset was collected by querying image search engines for each of the 50 animals categories\nwhich have complimentary high level descriptions of their semantic properties such as shape, colour,\nor habitat information among others [24]. The semantic attributes per animal class were retrieved\nfrom a prior psychological study. We focused on the 10 categories corresponding to the test set of this\ndataset for which the predicted attributes are provided based on the probabilistic DAP model [24].\nThe 10 classes are: chimpanzee, giant panda, leopard, persian cat, pig, hippopotamus, humpback\nwhale, raccoon, rat, seal, which have 6180 images associated in total. As in Section 4.1 and also in\n\n1https://code.google.com/p/word2vec/\n\n6\n\n\fTable 1: Average error rate in % (the lower the better) on the Attribute Discovery dataset over 100 repetitions.\nWe used images as the original domain and neural networks word-vector representation on texts as the privi-\nleged domain. The best method for each binary task is highlighted in boldface. An average rank equal to one\nmeans that the corresponding method has the smallest error on the 6 tasks.\nGPC+ (Ours)\n9.50\u00b10.11\n10.03\u00b10.15\n9.22\u00b10.11\n10.56\u00b10.13\n7.33\u00b10.10\n15.54\u00b10.16\n10.36\u00b10.12\n1.8\n\nbags v. earrings\nbags v. ties\nbags v. shoes\nearrings v. ties\nearrings v. shoes\nties v. shoes\n\nSVM+\n9.89\u00b10.13\n9.47\u00b10.13\n9.29\u00b10.14\n11.11\u00b10.16\n7.63\u00b10.13\n15.10\u00b10.18\n10.42\u00b10.11\n\nSVM\n9.89\u00b10.14\n9.44\u00b10.16\n9.31\u00b10.12\n11.15\u00b10.16\n7.75\u00b10.13\n14.90\u00b10.21\n10.41\u00b10.11\n\n2.7\n\naverage error on each task\naverage ranking\n\nGPC\n\n9.79\u00b10.12\n10.36\u00b10.16\n9.66\u00b10.13\n10.84\u00b10.14\n7.74\u00b10.11\n15.51\u00b10.16\n10.65\u00b10.11\n\n3.0\n\n2.5\n\n[6], we generated 45 binary classi\ufb01cation tasks for each pair of the 10 classes with 200 samples for\ntraining, 200 samples for validation, and the rest of samples for testing the predictive performance.\nNeural networks on images as privileged information: Deep learning methods have gained an in-\ncreased attention within the machine learning and computer vision community over the recent years.\nThis is due to their capability in extracting informative features and delivering strong predictive per-\nformance in many classi\ufb01cation tasks. As such, we are interested to explore the use of deep learning\nbased features as privileged information so that their predictive power can be used even if we do not\nhave access to them at prediction time. We used the standard SURF features [21] with 2000 visual\nwords as the original domain and the recently proposed DeCAF features [25] extracted from the\nactivation of a deep convolutional network trained in a fully supervised fashion as the privileged do-\nmain. The DeCAF features have 4096 dimensions. All features are provided with the AwA dataset2.\nWe again performed PCA for dimensionality reduction in the original and privileged domains and\nonly kept the top 50 principal components, as well as standardised the data.\nAttributes as privileged information: Following the experimental setting of [6], we also used\nimages as the original domain and attributes as the privileged domain. Images were represented by\n2000 visual words based on SURF descriptors and attributes were in the form of 85 dimensional\npredicted attributes based on probabilistic binary classi\ufb01ers [24]. As previously, we also performed\nPCA and kept the top 50 principal components in the original domain and standardised the data.\nThe results of these experiments are shown in Figure 2 in terms of pairwise comparisons over 45\nbinary tasks between GPC+ and the main baselines, GPC and SVM+. The complete results with\nthe error of each method GPC, GPC+, SVM, and SVM+ on each problem are relegated to the\nsupplementary material. In contrast to the results on the attribute discovery dataset, on the AwA\ndataset it is clear that GPC outperforms SVM in almost all of the 45 binary classi\ufb01cation tasks\n(see the supplementary material). The average error of GPC over 4500 (45 tasks and 100 repeats\nper task) experiments is much lower than SVM. On the AwA dataset, SVM+ can take advantage\nof privileged information \u2013 be it deep belief DeCAF features or semantic attributes \u2013 and shows\nsigni\ufb01cant performance improvement over SVM. However, GPC+ still shows the best overall results\nand further improves the already strong performance of GPC. As illustrated in Figure 1 (right), the\nprivileged information modulates the slope of the probit likelihood function differently for easy\nand dif\ufb01cult examples: easy examples gain slope and hence importance whereas dif\ufb01cult ones lose\nimportance in the classi\ufb01cation.\nIn this dataset we analysed our experimental results using the\nmultiple dataset statistical comparison method described in [26]3. The results of the statistical tests\nare summarised in Figure 3. When DeCAF attributes are used as privileged information, there is\nstatistical evidence supporting that GPC+ performs best among the four methods, while when the\nsemantic attributes are used as privileged information, GPC+ still performs best but there is not\nenough evidence to reject that GPC+ performs comparable to GPC.\n\n2http://attributes.kyb.tuebingen.mpg.de\n3Note that we are not able to use this method on the results of the attribute discovery dataset in Table 1\n\nbecause the number of methods compared (i.e., 4) is almost equal to the number of tasks or datasets (i.e., 6).\n\n7\n\n\f(DeCAF as privileged)\n\n(Attributes as privileged)\n\nFigure 2: Pairwise comparison of the proposed GPC+ method and main baselines is shown via the relative\ndifference of the error rate (top: GPC+ versus GPC, bottom: GPC+ versus SVM+). The length of the 45 bars\ncorresponds to relative difference of the error rate over 45 cases. Average error rates of each method on the\nAwA dataset across each of the 45 tasks are found in the supplementary material. (Best viewed in color).\n\n(DeCAF as privileged)\n\n(Attributes as privileged)\n\nFigure 3: Average rank (the lower the better) of the four methods and critical distance for statistically signif-\nicant differences (see [26]) on the AwA dataset. An average rank equal to one means that particular method\nhas the smallest error on the 45 tasks. Whenever the average ranks differ by more than the critical distance,\nthere is statistical evidence (p-value < 10%) supporting a difference in the average ranks and hence in the\nperformance. We also link two methods with a solid line if they are not statistically different from each other\n(p-value > 10%). When the DeCAF features are used as privileged information, there is statistical evidence\nsupporting that GPC+ performs best among the four methods considered. When the attributes are used, GPC+\nstill performs best, but there is not enough evidence to reject that GPC+ performs comparable to GPC.\n\n5 Conclusions and future work\n\nWe presented the \ufb01rst treatment of the learning with privileged information paradigm under the\nGaussian process classi\ufb01cation (GPC) framework, and called it GPC+. In GPC+ privileged infor-\nmation is used in the latent noise layer, resulting in a data-dependent modulation of the slope of the\nlikelihood. The training time of GPC+ is about twice times the training time of a standard Gaussian\nprocess classi\ufb01er. The reason is that GPC+ must train two latent functions, f and g, instead of only\none. Nevertheless, our results show that GPC+ is an effective way to use privileged information,\nwhich manifest itself in signi\ufb01cantly better prediction accuracy. Furthermore, to our knowledge,\nthis is the \ufb01rst time that a heteroscedastic noise term is used to improve GPC. We have also shown\nthat recent advances in continuous word-vector neural networks representations [23] and deep con-\nvolutional networks for image representations [25] can be used as privileged information. For future\nwork, we plan to extend the GPC+ framework to the multi-class case and to speed up computation\nby devising a quadrature-free expectation propagation method, similar to the ones in [27, 28].\nAcknowledgement: D. Hern\u00b4andez-Lobato is supported by Direcci\u00b4on General de Investigaci\u00b4on MCyT and by\nConsejer\u00b4\u0131a de Educaci\u00b4on CAM (projects TIN2010-21575-C02-02, TIN2013-42351-P and S2013/ICE-2845).\nV. Sharmanska is funded by the European Research Council under the ERC grant agreement no 308036.\n\nReferences\n[1] D.H. Wolpert. The lack of a priori distinctions between learning algorithms. Neural computation, 8:1341\u2013\n\n1390, 1996.\n\n8\n\n1234SVMSVM+GPC+GPCCritical Distance1234SVMSVM+GPC+GPCCritical Distance\f[2] V. Vapnik and A. Vashist. A new learning paradigm: Learning using privileged information. Neural\n\nNetworks, 22:544 \u2013 557, 2009.\n\n[3] D. Pechyony and V. Vapnik. On the theory of learnining with privileged information. In Advances in\n\nNeural Information Processing Systems (NIPS), pages 1894\u20131902, 2010.\n\n[4] B. Ribeiro, C. Silva, A. Vieira, A. Gaspar-Cunha, and J.C. das Neves. Financial distress model prediction\n\nusing SVM+. In International Joint Conference on Neural Networks (IJCNN), 2010.\n\n[5] D. Pechyony and V. Vapnik. Fast optimization algorithms for solving SVM+. In Statistical Learning and\n\nData Science, 2011.\n\n[6] V. Sharmanska, N. Quadrianto, and C. H. Lampert. Learning to rank using privileged information. In\n\nInternational Conference on Computer Vision (ICCV), 2013.\n\n[7] J. Feyereisl and U. Aickelin. Privileged information for data clustering. Information Sciences, 194:4\u201323,\n\n2012.\n\n[8] S. Fouad, P. Tino, S. Raychaudhury, and P. Schneider.\n\nIncorporating privileged information through\n\nmetric learning. IEEE Transactions on Neural Networks and Learning Systems, 24:1086 \u2013 1098, 2013.\n\n[9] V. Sharmanska, N. Quadrianto, and C. H. Lampert. Learning to transfer privileged information, 2014.\n\narXiv:1410.0389 [cs.CV].\n\n[10] M. Lapin, M. Hein, and B. Schiele. Learning using privileged information: SVM+ and weighted SVM.\n\nNeural Networks, 53:95\u2013108, 2014.\n\n[11] P. W. Goldberg, C. K. I. Williams, and C. M. Bishop. Regression with input-dependent noise: A Gaussian\n\nprocess treatment. In Advances in Neural Information Processing Systems (NIPS), 1998.\n\n[12] N. Quadrianto, K. Kersting, M. D. Reid, T. S. Caetano, and W. L. Buntine. Kernel conditional quantile\n\nestimation via reduction revisited. In International Conference on Data Mining (ICDM), 2009.\n\n[13] M. L\u00b4azaro-Gredilla and M. K. Titsias. Variational heteroscedastic Gaussian process regression. In Inter-\n\nnational Conference on Machine Learning (ICML), 2011.\n\n[14] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation\n\nand Machine Learning). The MIT Press, 2006.\n\n[15] B. Scholkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Opti-\n\nmization, and Beyond. MIT Press, Cambridge, MA, USA, 2001.\n\n[16] T. P. Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, Massachusetts\n\nInstitute of Technology, 2001.\n\n[17] H. Nickisch and C. E. Rasmussen. Approximations for Binary Gaussian Process Classi\ufb01cation. Journal\n\nof Machine Learning Research, 9:2035\u20132078, 2008.\n\n[18] M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process classi\ufb01ca-\n\ntion. Journal of Machine Learning Research, 6:1679\u20131704, 2005.\n\n[19] M. Seeger. Expectation propagation for exponential families. Technical report, Department of EECS,\n\nUniversity of California, Berkeley, 2006.\n\n[20] T. L. Berg, A. C. Berg, and J. Shih. Automatic attribute discovery and characterization from noisy web\n\ndata. In European Conference on Computer Vision (ECCV), 2010.\n\n[21] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up robust features (SURF). Computer Vision\n\nand Image Understanding, 110:346\u2013359, 2008.\n\n[22] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and\nphrases and their compositionality. In Advances in Neural Information Processing Systems (NIPS), 2013.\nIn International\n\n[23] Q. V. Le and T. Mikolov. Distributed representations of sentences and documents.\n\nConference on Machine Learning (ICML), 2014.\n\n[24] C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classi\ufb01cation for zero-shot visual object\n\ncategorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36:453\u2013465, 2014.\n\n[25] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolu-\ntional activation feature for generic visual recognition. In International Conference on Machine Learning\n(ICML), 2014.\n\n[26] J. Dem\u02c7sar. Statistical comparisons of classi\ufb01ers over multiple data sets. Journal of Machine Learning\n\nResearch, 7:1\u201330, 2006.\n\n[27] J. Riihim\u00a8aki, P. Jyl\u00a8anki, and A. Vehtari. Nested Expectation Propagation for Gaussian Process Classi\ufb01ca-\n\ntion with a Multinomial Probit Likelihood. Journal of Machine Learning Research, 14:75\u2013109, 2013.\n\n[28] D. Hern\u00b4andez-Lobato, J. M. Hern\u00b4andez-Lobato, and P. Dupont. Robust multi-class Gaussian process\n\nclassi\ufb01cation. In Advances in Neural Information Processing Systems (NIPS), 2011.\n\n9\n\n\f", "award": [], "sourceid": 557, "authors": [{"given_name": "Daniel", "family_name": "Hern\u00e1ndez-lobato", "institution": "Universidad Autonoma de Madrid"}, {"given_name": "Viktoriia", "family_name": "Sharmanska", "institution": "IST Austria"}, {"given_name": "Kristian", "family_name": "Kersting", "institution": "University of Bonn and Fraunhofer IAIS"}, {"given_name": "Christoph", "family_name": "Lampert", "institution": "IST Austria"}, {"given_name": "Novi", "family_name": "Quadrianto", "institution": "University of Sussex"}]}