{"title": "PAC-Bayesian AUC classification and scoring", "book": "Advances in Neural Information Processing Systems", "page_first": 658, "page_last": 666, "abstract": "We develop a scoring and classification procedure based on the PAC-Bayesian approach and the AUC (Area Under Curve) criterion. We focus initially on the class of linear score functions. We derive PAC-Bayesian non-asymptotic bounds for two types of prior for the score parameters: a Gaussian prior, and a spike-and-slab prior; the latter makes it possible to perform feature selection. One important advantage of our approach is that it is amenable to powerful Bayesian computational tools. We derive in particular a Sequential Monte Carlo algorithm, as an efficient method which may be used as a gold standard, and an Expectation-Propagation algorithm, as a much faster but approximate method. We also extend our method to a class of non-linear score functions, essentially leading to a nonparametric procedure, by considering a Gaussian process prior.", "full_text": "PAC-Bayesian AUC classi\ufb01cation and scoring\n\nJames Ridgway\u2217\n\nCREST and CEREMADE University Dauphine\n\njames.ridgway@ensae.fr\n\nPierre Alquier\nCREST (ENSAE)\n\npierre.alquier@ucd.ie\n\nNicolas Chopin\n\nCREST (ENSAE) and HEC Paris\nnicolas.chopin@ensae.fr\n\nFeng Liang\n\nUniversity of Illinois at Urbana-Champaign\n\nliangf@illinois.edu\n\nAbstract\n\nWe develop a scoring and classi\ufb01cation procedure based on the PAC-Bayesian ap-\nproach and the AUC (Area Under Curve) criterion. We focus initially on the class\nof linear score functions. We derive PAC-Bayesian non-asymptotic bounds for\ntwo types of prior for the score parameters: a Gaussian prior, and a spike-and-slab\nprior; the latter makes it possible to perform feature selection. One important ad-\nvantage of our approach is that it is amenable to powerful Bayesian computational\ntools. We derive in particular a Sequential Monte Carlo algorithm, as an ef\ufb01cient\nmethod which may be used as a gold standard, and an Expectation-Propagation\nalgorithm, as a much faster but approximate method. We also extend our method\nto a class of non-linear score functions, essentially leading to a nonparametric\nprocedure, by considering a Gaussian process prior.\n\n1\n\nIntroduction\n\nBipartite ranking (scoring) amounts to rank (score) data from binary labels. An important problem\nin its own right, bipartite ranking is also an elegant way to formalise classi\ufb01cation: once a score\nfunction has been estimated from the data, classi\ufb01cation reduces to chooses a particular threshold,\nwhich determine to which class is assigned each data-point, according to whether its score is above\nor below that threshold. It is convenient to choose that threshold only once the score has been esti-\nmated, so as to get \ufb01ner control of the false negative and false positive rates; this is easily achieved\nby plotting the ROC (Receiver operating characteristic) curve.\nA standard optimality criterion for scoring is AUC (Area Under Curve), which measures the area\nunder the ROC curve. AUC is appealing for at least two reasons. First, maximising AUC is equiv-\nalent to minimising the L1 distance between the estimated score and the optimal score. Second,\nunder mild conditions, Cortes and Mohri [2003] show that AUC for a score s equals the probability\nthat s(X\u2212) < s(X +) for X\u2212 (resp. X +) a random draw from the negative (resp. positive class).\nYan et al. [2003] observed AUC-based classi\ufb01cation handles much better skewed classes (say the\npositive class is much larger than the other) than standard classi\ufb01ers, because it enforces a small\nscore for all members of the negative class (again assuming the negative class is the smaller one).\nOne practical issue with AUC maximisation is that the empirical version of AUC is not a continuous\nfunction. One way to address this problem is to \u201dconvexify\u201d this function, and study the properties of\nso-obtained estimators [Cl\u00b4emenc\u00b8on et al., 2008a]. We follow instead the PAC-Bayesian approach in\nthis paper, which consists of using a random estimator sampled from a pseudo-posterior distribution\nthat penalises exponentially the (in our case) AUC risk. It is well known [see e.g. the monograph\nof Catoni, 2007] that the PAC-Bayesian approach comes with a set of powerful technical tools to\n\n\u2217http://www.crest.fr/pagesperso.php?user=3328\n\n1\n\n\festablish non-asymptotic bounds; the \ufb01rst part of the paper derive such bounds. A second advantage\nhowever of this approach, as we show in the second part of the paper, is that it is amenable to pow-\nerful Bayesian computational tools, such as Sequential Monte Carlo and Expectation Propagation.\n\n2 Theoretical bounds from the PAC-Bayesian Approach\n\nwith distribution P , and taking values in Rd\u00d7{\u22121, 1}. Let n+ =(cid:80)n\n\n2.1 Notations\nThe data D consist in the realisation of n IID (independent and identically distributed) pairs (Xi, Yi)\n1{Yi = +1}, n\u2212 = n\u2212n+.\nFor a score function s : Rd \u2192 R, the AUC risk and its empirical counter-part may be de\ufb01ned as:\n\ni=1\n\nR(s) = P(X,Y ),(X(cid:48),Y (cid:48))\u223cP [{s(X) \u2212 s(X(cid:48))}(Y \u2212 Y (cid:48)) < 0] ,\n1 [{s(Xi) \u2212 s(Xj)}(Yi \u2212 Yj) < 0] .\n\nRn(s) =\n\n1\n\nn(n \u2212 1)\n\n(cid:88)\n\ni(cid:54)=j\n\nLet \u03c3(x) = E(Y |X = x), \u00afR = R(\u03c3) and \u00afRn = Rn(\u03c3). It is well known that \u03c3 is the score that\nminimise R(s), i.e. R(s) \u2265 \u00afR = R(\u03c3) for any score s.\nThe results of this section apply to the class of linear scores, s\u03b8(x) = (cid:104)\u03b8, x(cid:105), where (cid:104)\u03b8, x(cid:105) = \u03b8T x\ndenotes the inner product. Abusing notations, let R(\u03b8) = R(s\u03b8), Rn(\u03b8) = Rn(s\u03b8), and, for a given\nprior density \u03c0\u03be(\u03b8) that may depend on some hyperparameter \u03be \u2208 \u039e, de\ufb01ne the Gibbs posterior\ndensity (or pseudo-posterior) as\n\n\u03c0\u03be,\u03b3(\u03b8|D) :=\n\n\u03c0\u03be(\u03b8) exp{\u2212\u03b3Rn(\u03b8)}\n\nZ\u03be,\u03b3(D)\n\n, Z\u03be,\u03b3(D) =\n\n\u03c0\u03be(\u02dc\u03b8) exp\n\nRd\n\n(cid:90)\n\n(cid:110)\u2212\u03b3Rn(\u02dc\u03b8)\n\n(cid:111)\n\nd\u02dc\u03b8\n\nfor \u03b3 > 0. Both the prior and posterior densities are de\ufb01ned with respect to the Lebesgue measure\nover Rd.\n\n2.2 Assumptions and general results\n\nOur general results require the following assumptions.\n\nDe\ufb01nition 2.1 We say that Assumption Dens(c) is satis\ufb01ed for c > 0 if\n\nP((cid:104)X1 \u2212 X2, \u03b8(cid:105) \u2265 0,(cid:104)X1 \u2212 X2, \u03b8(cid:48)(cid:105) \u2264 0) \u2264 c(cid:107)\u03b8 \u2212 \u03b8(cid:48)(cid:107)\n\nfor any \u03b8 and \u03b8(cid:48) \u2208 Rd such that (cid:107)\u03b8(cid:107) = (cid:107)\u03b8(cid:48)(cid:107) = 1.\nThis is a mild Assumption, which holds for instance as soon as (X1 \u2212 X2)/(cid:107)X1 \u2212 X2(cid:107) admits a\nbounded probability density; see the supplement.\n\nDe\ufb01nition 2.2 (Mammen & Tsybakov margin assumption) We say that Assumption MA(\u03ba, C)\nis satis\ufb01ed for \u03ba \u2208 [1, +\u221e] and C \u2265 1 if\n\n1,2)2(cid:3) \u2264 C(cid:2)R(\u03b8) \u2212 R(cid:3) 1\nE(cid:2)(q\u03b8\n\n\u03ba\n\nwhere q\u03b8\n\ni,j = 1{(cid:104)\u03b8, Xi \u2212 Xj(cid:105) (Yi \u2212 Yj) < 0} \u2212 1{[\u03c3(Xi) \u2212 \u03c3(Xj)](Yi \u2212 Yj) < 0} \u2212 R(\u03b8) + R.\nThis assumption was introduced for classi\ufb01cation by Mammen and Tsybakov [1999], and used for\nranking by Cl\u00b4emenc\u00b8on et al. [2008b] and Robbiano [2013] (see also a nice discussion in Lecu\u00b4e\n[2007]). The larger \u03ba, the less restrictive MA(\u03ba, C). In fact, MA(\u221e, C) is always satis\ufb01ed for\nC = 4. For a noiseless classi\ufb01cation task (i.e. \u03c3(Xi)Yi \u2265 0 almost surely), R = 0,\n\nE((q\u03b8\n\n1,2)2) = Var(q\u03b8\n\n1,2) = E[1{(cid:104)\u03b8, X1 \u2212 X2(cid:105) (Yi \u2212 Yj) < 0}] = R(\u03b8) \u2212 R\n\nand MA(1, 1) holds. More generally, MA(1, C) is satis\ufb01ed as soon as the noise is small; see the\ndiscussion in Robiano 2013 (Proposition 5 p. 1256) for a formal statement. From now, we focus\non either MA(1, C) or MA(\u221e, C), C \u2265 1. It is possible to prove convergence under MA(\u03ba, 1)\n\n2\n\n\f(cid:90)\n\n(cid:90)\n\nthe\n\nuse\n\nfor a general \u03ba \u2265 1, but at the price of complications regarding the choice of \u03b3; see Catoni [2007],\nAlquier [2008] and Robbiano [2013].\nWe\n[1998];\nShawe-Taylor and Williamson [1997] (see Alquier [2008]; Catoni [2007] for a complete sur-\nvey and more recent advances) to get the following results. Proof of these and forthcoming\nresults may be found in the supplement. Let K(\u03c1, \u03c0) denotes the Kullback-Liebler divergence,\n+ the set of probability\n\nK(\u03c1, \u03c0) = (cid:82) \u03c1(d\u03b8) log{ d\u03c1\n\nd\u03c0 (\u03b8)} if \u03c1 << \u03c0, \u221e otherwise, and denote M1\n\nclassical PAC-Bayesian methodology\n\ndistributions \u03c1(d\u03b8).\nLemma 2.1 Assume that MA(1, C) holds with C \u2265 1. For any \ufb01xed \u03b3 with 0 < \u03b3 \u2264 (n\u2212 1)/(8C),\nfor any \u03b5 > 0, with probability at least 1 \u2212 \u03b5 on the drawing of the data D,\n\nby McAllester\n\ninitiated\n\nR(\u03b8)\u03c0\u03be,\u03b3(\u03b8|D)d\u03b8 \u2212 R \u2264 2 inf\n\u03c1\u2208M1\n\n+\n\nR(\u03b8)\u03c1(d\u03b8) \u2212 R + 2\n\nLemma 2.2 Assume MA(\u221e, C) with C \u2265 1. For any \ufb01xed \u03b3 with 0 < \u03b3 \u2264 (n \u2212 1)/8, for any\n\u0001 > 0 with probability 1 \u2212 \u0001 on the drawing of D,\n\nR(\u03b8)\u03c0\u03be,\u03b3(\u03b8|D)d\u03b8 \u2212 \u00afR \u2264 inf\n\u03c1\u2208M1\n\n+\n\nR(\u03b8)\u03c1(d\u03b8) \u2212 \u00afR + 2\n\nK(\u03c1, \u03c0) + log 2\n\n\u0001\n\n\u03b3\n\n+\n\n16\u03b3\nn \u2212 1\n\n.\n\n(cid:27)\n\nBoth lemmas bound the expected risk excess, for a random estimator of \u03b8 generated from \u03c0\u03be,\u03b3(\u03b8|D).\n\n2.3\n\nIndependent Gaussian Prior\n\nWe now specialise these results to the prior density \u03c0\u03be(\u03b8) = (cid:81)d\n\nindependent Gaussian distributions N (0, \u03d1); \u03be = \u03d1 in this case.\nTheorem 2.3 Assume MA(1, C), C \u2265 1, Dens(c), c > 0, and take \u03d1 = 2\nn2d ), \u03b3 =\n(n \u2212 1)/8C, then there exists a constant \u03b1 = \u03b1(c, C, d) such that for any \u0001 > 0, with probability\n1 \u2212 \u0001,\n\nd (1 + 1\n\ni=1 \u03d5(\u03b8i; 0, \u03d1), i.e. a product of\n\nK(\u03c1, \u03c0) + log(cid:0) 4\n\n\u03b5\n\n(cid:41)\n\n(cid:1)\n\n.\n\n\u03b3\n\n(cid:40)(cid:90)\n\n(cid:26)(cid:90)\n\n(cid:90)\n\n(cid:90)\n\nR(\u03b8)\u03c0\u03b3(\u03b8|D)d\u03b8 \u2212 \u00afR \u2264 2 inf\n\n\u03b80\n\nTheorem 2.4 Assume MA(\u221e, C), C \u2265 1, Dens(c) c > 0, and take \u03d1 = 2\n\nC(cid:112)dn log(n), there exists a constant \u03b1 = \u03b1(c, C, d) such that for any \u0001 > 0, with probability 1\u2212 \u0001,\n\nn2d ), \u03b3 =\n\nd (1 + 1\n\n(cid:8)R(\u03b80) \u2212 \u00afR(cid:9) + \u03b1\n\nd log(n) + log 4\n\u0001\n\nn \u2212 1\n\n.\n\n(cid:8)R(\u03b80) \u2212 \u00afR(cid:9) + \u03b1\n\n(cid:112)d log(n) + log 2\n\n\u0001\n\n.\n\n\u221a\n\nn\n\nR(\u03b8)\u03c0\u03b3(\u03b8|D)d\u03b8 \u2212 \u00afR \u2264 inf\n\n\u03b80\n\nThe proof of these results is provided in the supplementary material.\nMA(\u03ba, C), the rate (d/n)\nFollowing Robbiano [2013] we conjecturate that this rate is also optimal for ranking problems.\n\nIt is known that, under\n2\u03ba\u22121 is minimax-optimal for classi\ufb01cation problems, see Lecu\u00b4e [2007].\n\n\u03ba\n\n2.4 Spike and slab prior for feature selection\n\nThe independent Gaussian prior considered in the previous section is a natural choice, but it does\nnot accommodate sparsity, that is, the possibility that only a small subset of the components of Xi\nactually determine the membership to either class. For sparse scenarios, one may use the spike and\nslab prior of Mitchell and Beauchamp [1988], George and McCulloch [1993],\n\nd(cid:89)\n\n\u03c0\u03be(\u03b8) =\n\n[p\u03d5(\u03b8i; 0, v1) + (1 \u2212 p)\u03d5(\u03b8i; 0, v0)]\n\nwith \u03be = (p, v0, v1) \u2208 [0, 1] \u00d7 (R+)2, and v0 (cid:28) v1, for which we obtain the following result. Note\n(cid:107)\u03b8(cid:107)0 is the number of non-zero coordinates for \u03b8 \u2208 Rd.\n\ni=1\n\n3\n\n\fTheorem 2.5 Assume MA(1, C) holds with C \u2265 1, Dens(c) holds with c > 0, and take p = 1 \u2212\nexp(\u22121/d), v0 \u2264 1/(2nd log(d)), and \u03b3 = (n\u2212 1)/(8C). Then there is a constant \u03b1 = \u03b1(C, v1, c)\nsuch that for any \u03b5 > 0, with probability at least 1 \u2212 \u03b5 on the drawing of the data D,\n\n(cid:40)\n\n(cid:107)\u03b80(cid:107)0 log(nd) + log(cid:0) 4\n\n\u03b5\n\n2(n \u2212 1)\n\n(cid:41)\n\n(cid:1)\n\n.\n\nR(\u03b8)\u03c0\u03b3(d\u03b8|D) \u2212 R \u2264 2 inf\n\n\u03b80\n\nR(\u03b80) \u2212 R + \u03b1\n\n(cid:90)\n\nCompared to Theorem 2.3, the bound above increases logarithmically rather than linearly in d, and\ndepends explicitly on (cid:107)\u03b8(cid:107)0, the sparsity of \u03b8. This suggests that the spike and slab prior should lead\nto better performance than the Gaussian prior in sparse scenarios. The rate (cid:107)\u03b8(cid:107)0 log(d)/n is the\nsame as the one obtained in sparse regression, see e.g. B\u00a8uhlmann and van de Geer [2011].\nFinally, note that if v0 \u2192 0, we recover the more standard prior which assigns a point mass at zero\nfor every component. However this leads to a pseudo-posterior which is a mixture of 2d components\nthat mix Dirac masses and continuous distributions, and thus which is more dif\ufb01cult to approximate\n(although see the related remark in Section 3.4 for Expectation-Propagation).\n\n3 Practical implementation of the PAC-Bayesian approach\n\n3.1 Choice of hyper-parameters\n\nTheorems 2.3, 2.4, and 2.5 propose speci\ufb01c values for hyper-parameters \u03b3 and \u03be, but these values de-\npend on some unknown constant C. Two data-driven ways to choose \u03b3 and \u03be are (i) cross-validation\n(which we will use for \u03b3), and (ii) (pseudo-)evidence maximisation (which we will use for \u03be).\nThe latter may be justi\ufb01ed from intermediate results of our proofs in the supplement, which provide\nan empirical bound on the expected risk:\n\nR(\u03b8)\u03c0\u03be,\u03b3(\u03b8|D)d\u03b8 \u2212 \u00afR \u2264 \u03a8\u03b3,n inf\n\u03c1\u2208M1\n\n+\n\nRn(\u03b8)\u03c1(d\u03b8) \u2212 \u00afRn + 2\n\nwith \u03a8\u03b3,n \u2264 2. The right-hand side is minimised at \u03c1(d\u03b8) = \u03c0\u03be,\u03b3(\u03b8|D)d\u03b8, and the so-obtained\nbound is \u2212\u03a8\u03b3,n log(Z\u03be,\u03b3(D))/\u03b3 plus constants. Minimising the upper bound with respect to hyper-\nparameter \u03be is therefore equivalent to maximising log Z\u03be,\u03b3(D) with respect to \u03be. This is of course\nakin to the empirical Bayes approach that is commonly used in probabilistic machine learning. Re-\ngarding \u03b3 the minimization is more cumbersome because the dependence with the log(2/\u0001) term\nand \u03a8n,\u03b3, which is why we recommend cross-validation instead.\nIt seems noteworthy that, beside Alquier and Biau [2013], very few papers discuss the practical im-\nplementation of PAC-Bayes, beyond some brief mention of MCMC (Markov chain Monte Carlo).\nHowever, estimating the normalising constant of a target density simulated with MCMC is notori-\nously dif\ufb01cult. In addition, even if one decides to \ufb01x the hyperparameters to some arbitrary value,\nMCMC may become slow and dif\ufb01cult to calibrate if the dimension of the sampling space becomes\nlarge. This is particularly true if the target does not (as in our case) have some speci\ufb01c structure\nthat make it possible to implement Gibbs sampling. The two next sections discuss two ef\ufb01cient\napproaches that make it possible to approximate both the pseudo-posterior \u03c0\u03be,\u03b3(\u03b8|D) and its nor-\nmalising constant, and also to perform cross-validation with little overhead.\n\n(cid:19)\n\nK(\u03c1, \u03c0) + log 2\n\n\u0001\n\n\u03b3\n\n(cid:18)(cid:90)\n\n(cid:90)\n\n3.2 Sequential Monte Carlo\nGiven the particular structure of the pseudo-posterior \u03c0\u03be,\u03b3(\u03b8|D), a natural approach to simulate\nfrom \u03c0\u03be,\u03b3(\u03b8|D) is to use tempering SMC [Sequential Monte Carlo Del Moral et al., 2006] that is,\nde\ufb01ne a certain sequence \u03b30 = 0 < \u03b31 < . . . < \u03b3T , start by sampling from the prior \u03c0\u03be(\u03b8),\nthen applies successive importance sampling steps, from \u03c0\u03be,\u03b3t\u22121 (\u03b8|D) to \u03c0\u03be,\u03b3t(\u03b8|D), leading to\nimportance weights proportional to:\n\n\u03c0\u03be,\u03b3t(\u03b8|D)\n\u03c0\u03be,\u03b3t\u22121(\u03b8|D)\n\n\u221d exp{\u2212(\u03b3t \u2212 \u03b3t\u22121)Rn(\u03b8)} .\n\nWhen the importance weights become too skewed, one rejuvenates the particles through a resam-\npling step (draw particles randomly with replacement, with probability proportional to the weights)\nand a move step (move particles according to a certain MCMC kernel).\n\n4\n\n\fOne big advantage of SMC is that it is very easy to make it fully adaptive. For the choice of\nthe successive \u03b3t, we follow Jasra et al. [2007] in solving numerically (1) in order to impose that\nthe Effective sample size has a \ufb01xed value. This ensures that the degeneracy of the weights always\nremain under a certain threshold. For the MCMC kernel, we use a Gaussian random walk Metropolis\nstep, calibrated on the covariance matrix of the resampled particles. See Algorithm 1 for a summary.\n\nAlgorithm 1 Tempering SMC\nInput N (number of particles), \u03c4 \u2208 (0, 1) (ESS threshold), \u03ba > 0 (random walk tuning parameter)\nInit. Sample \u03b8i\nLoop a. Solve in \u03b3t the equation\n\n0 \u223c \u03c0\u03be(\u03b8) for i = 1 to N, set t \u2190 1, \u03b30 = 0, Z0 = 1.\n\nt\u22121)}2\nt\u22121))2} = \u03c4 N, wt(\u03b8) = exp[\u2212(\u03b3t \u2212 \u03b3t\u22121)Rn(\u03b8)]\n(cid:111)\n\n(cid:80)N\n\n(1)\n\nt\u22121)\nin 1, . . . , N so that P(Ai\n\ni=1 wt(\u03b8i\n\nN\n\n, and stop.\n\nt = j) =\n\n{(cid:80)N\n(cid:80)N\ni=1 wt(\u03b8i\nusing bisection search. If \u03b3t \u2265 \u03b3T , set ZT = Zt\u22121 \u00d7(cid:110) 1\ni=1{wt(\u03b8i\nt\u22121)/(cid:80)N\nd. Set Zt = Zt\u22121 \u00d7(cid:110) 1\n\nfor i = 1 to N, draw Ai\nt\n\nt \u223c Mt(\u03b8Ai\n\nb. Resample:\n\nc. Sample \u03b8i\n\n(cid:80)N\n\nk=1 wt(\u03b8k\n\ni=1 wt(\u03b8i\n\nt\u22121)\n\nwt(\u03b8j\n\n(cid:111)\n\n.\n\nN\n\nt\u22121); see Algorithm 1 in the supplement.\n\nt\n\nt\u22121, d\u03b8) for i = 1 to N where Mt is a MCMC kernel that leaves\ninvariant \u03c0t; see Algorithm 2 in the supplement for an instance of such a MCMC\nkernel, which takes as an input S = \u03ba \u02c6\u03a3, where \u02c6\u03a3 is the covariance matrix of the \u03b8Ai\nt\u22121.\n\nt\n\nIn our context, tempering SMC brings two extra advantages: it makes it possible to obtain sam-\nples from \u03c0\u03be,\u03b3(\u03b8|D) for a whole range of values of \u03b3, rather than a single value. And it provides\nan approximation of Z\u03be,\u03b3(D) for the same range of \u03b3 values, through the quantity Zt de\ufb01ned in\nAlgorithm 1.\n\n3.3 Expectation-Propagation (Gaussian prior)\n\nThe SMC sampler outlined in the previous section works fairly well, and we will use it as gold\nstandard in our simulations. However, as any other Monte Carlo method, it may be too slow for\nlarge datasets. We now turn our attention to EP [Expectation-Propagation Minka, 2001], a general\nframework to derive fast approximations to target distributions (and their normalising constants).\nFirst note that the pseudo-posterior may be rewritten as:\n\n\u03c0\u03be,\u03b3(\u03b8|D) =\n\n1\n\nZ\u03be,\u03b3(D)\n\nfij(\u03b8),\n\nfij(\u03b8) = exp [\u2212\u03b3(cid:48)1{(cid:104)\u03b8, Xi \u2212 Xj(cid:105) < 0}]\n\nwhere \u03b3(cid:48) = \u03b3/n+n\u2212, and the product is over all (i, j) such that Yi = 1, Yj = \u22121. EP generates an\napproximation of this target distribution based on the same factorisation:\n\n\u03c0\u03be(\u03b8) \u00d7(cid:89)\n(cid:89)\n\ni,j\n\nqij(\u03b8),\n\ni,j\n\nq(\u03b8) \u221d q0(\u03b8)\n\nqij(\u03b8) = exp{\u2212 1\n2\n\n\u03b8T Qij\u03b8 + rT\n\nij\u03b8}.\n\nWe consider in the section the case where the prior is Gaussian, as in Section 2.3. Then one may set\nq0(\u03b8) = \u03c0\u03be(\u03b8). The approximating factors are un-normalised Gaussian densities (under a natural\nparametrisation), leading to an overall approximation that is also Gaussian, but other types of expo-\nnential family parametrisations may be considered; see next section and Seeger [2005]. EP updates\niteratively each site qij (that is, it updates the parameters Qij and rij), conditional on all the sites,\nby matching the moments of q with those of the hybrid distribution\n\n(cid:89)\n\nfkl(\u03b8)\n\n(k,l)(cid:54)=(i,j)\n\nhij(\u03b8) \u221d q(\u03b8)\n\nfij(\u03b8)\nqij(\u03b8)\n\n\u221d q0(\u03b8)fij(\u03b8)\n\n5\n\n\f\u03b8T Qh\n\n(k,l)(cid:54)=(i,j) rkl, Qh\n\nij = (cid:80)\n\nij = (cid:80)\n\nhij(\u03b8) \u221d exp{\u03b8T rh\n\nwhere again the product is over all (k, l) such that Yk = 1, Yl = \u22121, and (k, l) (cid:54)= (i, j).\nWe refer to the supplement for a precise algorithmic description of our EP implementation. We\nhighlight the following points. First, the site update is particularly simple in our case:\nij\u03b8} exp [\u2212\u03b3(cid:48)1{(cid:104)\u03b8, Xi \u2212 Xj(cid:105) < 0}] ,\n\nij \u2212 1\n2\nwith rh\n(k,l)(cid:54)=(i,j) Qkl, which may be interpreted as: \u03b8 conditional\non T (\u03b8) = (cid:104)\u03b8, Xi \u2212 Xj(cid:105) has a d \u2212 1-dimensional Gaussian distribution, and the distribution of\nT (\u03b8) is that of a one-dimensional Gaussian penalised by a step function. The two \ufb01rst moments\nof this particular hybrid may therefore be computed exactly, and in O(d2) time, as explained in\nthe supplement. The updates can be performed ef\ufb01ciently using the fact that the linear combination\n(Xi\u2212 Xj)\u03b8 is a one dimensional Gaussian. For our numerical experiment we used a parallel version\nof EP Van Gerven et al. [2010]. The complexity of our EP implementation is O(n+n\u2212d2 + d3).\nSecond, EP offers at no extra cost an approximation of the normalising constant Z\u03be,\u03b3(D) of the\ntarget \u03c0\u03be,\u03b3(\u03b8|D); in fact, one may even obtain derivatives of this approximated quantity with respect\nto hyper-parameters. See again the supplement for more details.\nThird, in the EP framework, cross-validation may be interpreted as dropping all the factors qij that\ndepend on a given data-point Xi in the global approximation q. This makes it possible to implement\ncross-validation at little extra cost [Opper and Winther, 2000].\n\n3.4 Expectation-Propagation (spike and slab prior)\n\nTo adapt our EP algorithm to the spike and slab prior of Section 2.4, we introduce latent variables\nZk = 0/1 which \u201dchoose\u201d for each component \u03b8k whether it comes from a slab, or from a spike,\nand we consider the joint target\n\n\u03c0\u03be,\u03b3(\u03b8, z|D) \u221d\n\nB(zk; p)N (\u03b8k; 0, vzk )\n\nexp\n\n1{(cid:104)\u03b8, Xi \u2212 Xj(cid:105) > 0}\n\nOn top of the n+n\u2212 Gaussian sites de\ufb01ned in the previous section, we add a product of d sites to\napproximate the prior. Following Hernandez-Lobato et al. [2013], we use\n\n(cid:40) d(cid:89)\n\nk=1\n\n(cid:41)\n\n(cid:88)\n\nij\n\nn+n\u2212\n\n\uf8ee\uf8f0\u2212 \u03b3\n(cid:19)\n\n\u2212 1\n2\n\n(cid:26)\n\n(cid:18) pk\n\n1 \u2212 pk\n\n(cid:27)\n\nqk(\u03b8k, zk) = exp\n\nzk log\n\n\u03b82\nkuk + vk\u03b8k\n\n\uf8f9\uf8fb .\n\nthat is a (un-normalised) product of an independent Bernoulli distribution for zk, times a Gaussian\ndistribution for \u03b8k. Again that the site update is fairly straightforward, and may be implemented in\nO(d2) time. See the supplement for more details. Another advantage of this formulation is that we\nobtain a Bernoulli approximation of the marginal pseudo-posterior \u03c0\u03be,\u03b3(zi = 1|D) to use in feature\nselection. Interestingly taking v0 to be exactly zero also yield stable results corresponding to the\ncase where the spike is a Dirac mass.\n\n4 Extension to non-linear scores\n\nTo extend our methodology to non-linear score functions, we consider the pseudo-posterior\n\n\uf8f1\uf8f2\uf8f3\u2212 \u03b3\n\n(cid:88)\n\n\u03c0\u03be,\u03b3(ds|D) \u221d \u03c0\u03be(ds) exp\n\n1{s(Xi) \u2212 s(Xj) > 0}\n\nn+n\u2212\n\ni\u2208D+, j\u2208D\u2212\n\n\uf8fc\uf8fd\uf8fe\n\nwhere \u03c0\u03be(ds) is some prior probability measure with respect to an in\ufb01nite-dimensional functional\nclass. Let si = s(Xi), s1:n = (s1, . . . , sn) \u2208 Rn, and assume that \u03c0\u03be(ds) is a GP (Gaus-\nsian process) associated to some kernel k\u03be(x, x(cid:48)), then using a standard trick in the GP literature\n[Rasmussen and Williams, 2006], one may derive the marginal (posterior) density (with respect to\n\n6\n\n\fthe n-dimensional Lebesgue measure) of s1:n as\n\u03c0\u03be,\u03b3(s1:n|D) \u221d Nd (s1:n; 0, K\u03be) exp\n\n\uf8f1\uf8f2\uf8f3\u2212 \u03b3\n\nn+n\u2212\n\n(cid:88)\n\ni\u2208D+, j\u2208D\u2212\n\n1{si \u2212 sj > 0}\n\n\uf8fc\uf8fd\uf8fe\n\nwhere Nd (s1:n; 0, K\u03be) denotes the probability density of the N (0, K\u03be) distribution, and K\u03be is the\nn \u00d7 n matrix (k\u03be(Xi, Xj))n\nThis marginal pseudo-posterior retains essentially the structure of the pseudo-posterior \u03c0\u03be,\u03b3(\u03b8|D)\nfor linear scores, except that the \u201cparameter\u201d s1:n is now of dimension n. We can apply straightfor-\nwardly the SMC sampler of Section 3.2, and the EP algorithm of 3.3, to this new target distribution.\nIn fact, for the EP implementation, the particular simple structure of a single site:\n\ni,j=1.\n\nexp [\u2212\u03b3(cid:48)1{si \u2212 sj > 0}]\n\nmakes it possible to implement a site update in O(1) time, leading to an overall complexity\nO(n+n\u2212 + n3) for the EP algorithm.\nTheoretical\nvan der Vaart and van Zanten [2009], but we leave this for future study.\n\nthis approach could be obtained by applying lemmas from e.g.\n\nresults for\n\n5 Numerical Illustration\n\nFigure 1 compares the EP approximation with the output of our SMC sampler, on the well-known\nPima Indians dataset and a Gaussian prior. Marginal \ufb01rst and second order moments essentially\nmatch; see the supplement for further details. The subsequent results are obtained with EP.\n\n(a) \u03b81\n\n(b) \u03b82\n\n(c) \u03b83\n\nFigure 1: EP Approximation (green), compared to SMC (blue) of the marginal posterior of the \ufb01rst\nthree coef\ufb01cients, for Pima dataset (see the supplement for additional analysis).\n\nWe now compare our PAC-Bayesian approach (computed with EP) with Bayesian logistic regression\n(to deal with non-identi\ufb01able cases), and with the rankboost algorithm [Freund et al., 2003] on dif-\nferent datasets1; note that Cortes and Mohri [2003] showed that the function optimised by rankbook\nis AUC.\nAs mentioned in Section 3, we set the prior hyperparameters by maximizing the evidence, and we\nuse cross-validation to choose \u03b3. To ensure convergence of EP, when dealing with dif\ufb01cult sites,\nwe use damping [Seeger, 2005]. The GP version of the algorithm is based on a squared exponential\nkernel. Table 1 summarises the results; balance refers to the size of the smaller class in the data\n(recall that the AUC criterion is particularly relevant for unbalanced classi\ufb01cation tasks), EP-AUC\n(resp. GPEP-AUC) refers to the EP approximation of the pseudo-posterior based on our Gaussian\nprior (resp. Gaussian process prior). See also Figure 2 for ROC curve comparisons, and Table 1 in\nthe supplement for a CPU time comparison.\nNote how the GP approach performs better for the colon data, where the number of covariates\n(2000) is very large, but the number of observations is only 40. It seems also that EP gives a better\napproximation in this case because of the lower dimensionality of the pseudo-posterior (Figure 2b).\n\n1All available at http://archive.ics.uci.edu/ml/\n\n7\n\n0.00.51.0\u22122\u2212100.000.250.500.75\u22124\u22123\u22122\u2212100.00.51.01.5\u2212101\fDataset Covariates Balance EP-AUC GPEP-AUC Logit Rankboost\n\nPima\nCredit\nDNA\nSPECTF\nColon\nGlass\n\n7\n60\n180\n22\n2000\n10\n\n34%\n28%\n22%\n50%\n40%\n1%\n\n0.8617\n0.7952\n0.9814\n0.8684\n0.7034\n0.9843\n\n0.8557\n0.7922\n0.9812\n0.8545\n0.75\n0.9629\n\n0.8646\n0.7561\n0.9696\n0.8715\n0.73\n0.9029\n\n0.8224\n0.788\n0.9814\n0.8684\n0.5935\n0.9436\n\nThe Glass dataset has originally more than two classes. We compare the \u201csilicon\u201d class against all others.\n\nTable 1: Comparison of AUC.\n\n(a) Rankboost vs EP-AUC\non Pima\n\n(b) Rankboost vs GPEP-\nAUC on Colon\n\n(c) Logistic vs EP-AUC on\nGlass\n\nFigure 2: Some ROC curves associated to the example described in a more systematic manner in\ntable 1. In black is always the PAC version.\n\nFinally, we also investigate feature selection for the DNA dataset (180 covariates) using a spike and\nslab prior. The regularization plot (3a) shows how certain coef\ufb01cients shrink to zero as the spike\u2019s\nvariance v0 goes to zero, allowing for some sparsity. The aim of a positive variance in the spike is\nto absorb negligible effects into it [Ro\u02c7ckov\u00b4a and George, 2013]. We observe this effect on \ufb01gure 3a\nwhere one of the covariates becomes positive when v0 decreases.\n\nFigure 3: Regularization plot for v0 \u2208(cid:2)10\u22126, 0.1(cid:3) and estimation for v0 = 10\u22126 for DNA dataset;\n\n(a) Regularization plot\n\nblue circles denote posterior probabilities \u2265 0.5.\n\n(b) Estimate\n\n6 Conclusion\n\nThe combination of the PAC-Bayesian theory and Expectation-Propagation leads to fast and ef\ufb01cient\nAUC classi\ufb01cation algorithms, as observed on a variety of datasets, some of them very unbalanced.\nFuture work may include extending our approach to more general ranking problems (e.g. multi-\nclass), establishing non-asymptotic bounds in the nonparametric case, and reducing the CPU time\nby considering only a subset of all the pairs of datapoints.\n\n8\n\n0.000.250.500.751.000.000.250.500.751.000.000.250.500.751.000.000.250.500.751.000.000.250.500.751.000.000.250.500.751.00\u22120.3\u22120.2\u22120.10.00.11e\u2212041e\u221202v0qllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll\u22120.10.0050100150V1q\fBibliography\n\nP. Alquier. Pac-bayesian bounds for randomized empirical risk minimizers. Mathematical Methods\n\nof Statistics, 17(4):279\u2013304, 2008.\n\nP. Alquier and G. Biau. Sparse single-index model. J. Mach. Learn. Res., 14(1):243\u2013280, 2013.\nP. B\u00a8uhlmann and S. van de Geer. Statistics for High-Dimensionnal Data. Springer, 2011.\nO. Catoni. PAC-Bayesian Supervised Classi\ufb01cation, volume 56. IMS Lecture Notes & Monograph\n\nSeries, 2007.\n\nS. Cl\u00b4emenc\u00b8on, G. Lugosi, and N. Vayatis. Ranking and empirical minimization of U-statistics. Ann.\n\nStat., 36(2):844\u2013874, 04 2008a.\n\nS. Cl\u00b4emenc\u00b8on, V.C. Tran, and H. De Arazoza. A stochastic SIR model with contact-tracincing: large\npopulation limits and statistical inference. Journal of Biological Dynamics, 2(4):392\u2013414, 2008b.\n\nC. Cortes and M. Mohri. Auc optimization vs. error rate minimization. In NIPS, volume 9, 2003.\nP. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers. J. R. Statist. Soc. B, 68\n\n(3):411\u2013436, 2006. ISSN 1467-9868.\n\nY. Freund, R. Iyer, R.E Schapire, and Y. Singer. An ef\ufb01cient boosting algorithm for combining\n\npreferences. J. Mach. Learn. Res., 4:933\u2013969, 2003.\n\nE.I. George and R.E. McCulloch. Variable selection via Gibbs sampling. J. Am. Statist. Assoc., 88\n\n(423):pp. 881\u2013889, 1993.\n\nD. Hernandez-Lobato, J. Hernandez-Lobato, and P. Dupont. Generalized Spike-and-Slab Priors for\nBayesian Group Feature Selection Using Expectation Propagation . J. Mach. Learn. Res., 14:\n1891\u20131945, 2013.\n\nA. Jasra, D. Stephens, and C. Holmes. On population-based simulation for static inference. Statist.\n\nComput., 17(3):263\u2013279, 2007.\n\nG. Lecu\u00b4e. M\u00b4ethodes d\u2019agr\u00b4egation: optimalit\u00b4e et vitesses rapides. Ph.D. thesis, Universit\u00b4e Paris 6,\n\n2007.\n\nE. Mammen and A. Tsybakov. Smooth discrimination analysis. Ann. Stat., 27(6):1808\u20131829, 12\n\n1999.\n\nD.A McAllester. Some PAC-Bayesian theorems. In Proceedings of the eleventh annual conference\n\non Computational learning theory, pages 230\u2013234. ACM, 1998.\n\nT. Minka. Expectation Propagation for approximate Bayesian inference. In Proc. 17th Conf. Uncer-\ntainty Arti\ufb01cial Intelligence, UAI \u201901, pages 362\u2013369. Morgan Kaufmann Publishers Inc., 2001.\nT. J Mitchell and J. Beauchamp. Bayesian variable selection in linear regression. J. Am. Statist.\n\nAssoc., 83(404):1023\u20131032, 1988.\n\nM. Opper and O. Winther. Gaussian Processes for Classi\ufb01cation: Mean-\ufb01eld Algorithms. Neural\n\nComputation, 12(11):2655\u20132684, November 2000.\n\nC. Rasmussen and C. Williams. Gaussian processes for Machine Learning. MIT press, 2006.\nS. Robbiano. Upper bounds and aggregation in bipartite ranking. Elec. J. of Stat., 7:1249\u20131271,\n\n2013.\n\nV. Ro\u02c7ckov\u00b4a and E. George. EMVS: The EM Approach to Bayesian Variable Selection. J. Am.\n\nStatist. Assoc., 2013.\n\nM. Seeger. Expectation propagation for exponential families. Technical report, U. of California,\n\n2005.\n\nJ. Shawe-Taylor and R.C. Williamson. A PAC analysis of a Bayesian estimator.\n\nComputat. learn. theory, pages 2\u20139. ACM, 1997.\n\nIn Proc. conf.\n\nA.W. van der Vaart and J.H. van Zanten. Adaptive Bayesian estimation using a Gaussian random\n\n\ufb01eld with inverse Gamma bandwidth. Ann. Stat., pages 2655\u20132675, 2009.\n\nM. A.J. Van Gerven, B. Cseke, F. P. de Lange, and T. Heskes. Ef\ufb01cient Bayesian multivariate fMRI\n\nanalysis using a sparsifying spatio-temporal prior. NeuroImage, 50:150\u2013161, 2010.\n\nL. Yan, R. Dodier, M. Mozer, and R. Wolniewicz. Optimizing classi\ufb01er performance via an ap-\nproximation to the Wilcoxon-Mann-Whitney statistic. Proc. 20th Int. Conf. Mach. Learn., pages\n848\u2013855, 2003.\n\n9\n\n\f", "award": [], "sourceid": 459, "authors": [{"given_name": "James", "family_name": "Ridgway", "institution": "Crest-Ensae and Dauphine"}, {"given_name": "Pierre", "family_name": "Alquier", "institution": "ENSAE"}, {"given_name": "Nicolas", "family_name": "Chopin", "institution": "CREST"}, {"given_name": "Feng", "family_name": "Liang", "institution": "Univ. of Illinois Urbana-Champaign Statistics"}]}