{"title": "Empirical Bernstein Inequalities for U-Statistics", "book": "Advances in Neural Information Processing Systems", "page_first": 1903, "page_last": 1911, "abstract": "We present original empirical Bernstein inequalities for U-statistics with bounded symmetric kernels q. They are expressed with respect to empirical estimates of either the variance of q or the conditional variance that appears in the Bernstein-type inequality for U-statistics derived by Arcones [2]. Our result subsumes other existing empirical Bernstein inequalities, as it reduces to them when U-statistics of order 1 are considered. In addition, it is based on a rather direct argument using two applications of the same (non-empirical) Bernstein inequality for U-statistics. We discuss potential applications of our new inequalities, especially in the realm of learning ranking/scoring functions. In the process, we exhibit an efficient procedure to compute the variance estimates for the special case of bipartite ranking that rests on a sorting argument. We also argue that our results may provide test set bounds and particularly interesting empirical racing algorithms for the problem of online learning of scoring functions.", "full_text": "Empirical Bernstein Inequalities for U-Statistics\n\nThomas Peel\n\nLIF, Aix-Marseille Universit\u00b4e\n\n39, rue F. Joliot Curie\n\nF-13013 Marseille, France\n\nSandrine Anthoine\n\nLATP, Aix-Marseille Universit\u00b4e, CNRS\n\nthomas.peel@lif.univ-mrs.fr\n\nanthoine@cmi.univ-mrs.fr\n\n39, rue F. Joliot Curie\n\nF-13013 Marseille, France\n\nLiva Ralaivola\n\nLIF, Aix-Marseille Universit\u00b4e\n\n39, rue F. Joliot Curie\n\nF-13013 Marseille, France\n\nliva.ralaivola@lif.univ-mrs.fr\n\nAbstract\n\nWe present original empirical Bernstein inequalities for U-statistics with bounded\nsymmetric kernels q. They are expressed with respect to empirical estimates of\neither the variance of q or the conditional variance that appears in the Bernstein-\ntype inequality for U-statistics derived by Arcones [2]. Our result subsumes other\nexisting empirical Bernstein inequalities, as it reduces to them when U-statistics\nof order 1 are considered. In addition, it is based on a rather direct argument using\ntwo applications of the same (non-empirical) Bernstein inequality for U-statistics.\nWe discuss potential applications of our new inequalities, especially in the realm\nof learning ranking/scoring functions. In the process, we exhibit an ef\ufb01cient pro-\ncedure to compute the variance estimates for the special case of bipartite ranking\nthat rests on a sorting argument. We also argue that our results may provide test set\nbounds and particularly interesting empirical racing algorithms for the problem of\nonline learning of scoring functions.\n\n1\n\nIntroduction\n\nThe motivation of the present work lies in the growing interest of the machine learning commu-\nnity for learning tasks that are richer than now well-studied classi\ufb01cation and regression. Among\nthose, we especially have in mind the task of ranking, where one is interested in learning a ranking\nfunction capable of predicting an accurate ordering of objects according to some attached relevance\ninformation. Tackling such problems generally implies the use of loss functions other than the 0-1\nmisclassi\ufb01cation loss such as, for example, a misranking loss [6] or a surrogate thereof. For (x, y)\nand (x(cid:48), y(cid:48)) two pairs from some space Z := X \u00d7 Y (e.g., X = Rd and Y = R) the misranking loss\n(cid:96)rank and a surrogate convex loss (cid:96)sur may be de\ufb01ned for a scoring function f \u2208 YX as:\n\n(cid:96)rank(f, (x, y), (x(cid:48), y(cid:48))) := 1{(y\u2212y(cid:48))(f (x)\u2212f (x(cid:48)))<0},\n(cid:96)sur(f, (x, y), (x(cid:48), y(cid:48))) := (1 \u2212 (y \u2212 y(cid:48))(f (x) \u2212 f (x(cid:48)))2 .\n\n(1)\n(2)\nGiven such losses or, more generally, a loss (cid:96) : YX \u00d7 Z \u00d7 Z \u2192 R, and a training sample Z n =\n{(Xi, Yi)}n\ni=1 of independent copies of some random variable Z := (X, Y ) distributed according\nto D, the learning task is to derive a function f \u2208 X Y such that the expected risk R(cid:96)(f ) of f\n\nR(cid:96)(f ) := EZ,Z(cid:48)\u223cD(cid:96)(f, Z, Z(cid:48)) = EZ,Z(cid:48)\u223cD(cid:96)(f, (X, Y ), (X(cid:48), Y (cid:48)))\n\n1\n\n\fis as small as possible. In practice, this naturally brings up the empirical estimate \u02c6R(cid:96)(f, Z n)\n\n\u02c6R(cid:96)(f, Z n) :=\n\n1\n\nn(n \u2212 1)\n\n(cid:96)(f, (Xi, Yi), (Xj, Yj)),\n\n(3)\n\n(cid:88)\n\ni(cid:54)=j\n\nwhich is a U-statistic [6, 10].\nAn important question is to precisely characterize how \u02c6R(cid:96)(f, Z n) is related to R(cid:96)(f ) and, more\nspeci\ufb01cally, one may want to derive an upper bound on R(cid:96)(f ) that is expressed in terms of \u02c6R(cid:96)(f, Z n)\nand other quantities such as a measure of the capacity of the class of functions f belongs to and the\nsize n of Z n \u2013 in other words, we may talk about generalization bounds [4]. Pivotal tools to perform\nsuch analysis are tail/concentration inequalities, which say how probable it is for a function of\nseveral independent variables to deviate from its expectation; of course, the sharper the concentration\ninequalities the more accurate the characterization of the relation between the empirical estimate and\nits expectation. It is therefore of the utmost importance to have at hand tail inequalities that are sharp;\nit is just as important that these inequalities rely as much as possible on empirical quantities.\nHere, we propose new empirical Bernstein inequalities for U-statistics. As indicated by the name\n(i) our results are Bernstein-type inequalities and therefore make use of information on the variance\nof the variables under consideration, (ii) instead of resting on some assumed knowledge about this\nvariance, they only rely on empirical related quantities and (iii) they apply to U-statistics. Our new\ninequalities generalize those of [3] and [13], which also feature points (i) and (ii) (but not (iii)),\nwhile based on simple arguments. To the best of our knowledge, these are the \ufb01rst results that ful\ufb01ll\n(i), (ii) and (iii); they may give rise to a few applications, of which we describe two in the sequel.\nThe paper is organized as follows. Section 2 introduces the notations and brie\ufb02y recalls the basics of\nU-statistics as well as tail inequalities our results are based upon. Our empirical Bernstein inequali-\nties are presented in Section 3; we also provide an ef\ufb01cient way of computing the empirical variance\nwhen the U-statistics considered are based on the misranking loss (cid:96)rank of (1). Section 4 discusses\ntwo applications of our new results: test set bounds for bipartite ranking and online ranking.\n\n2 Background\n\n2.1 Notation\n\nThe following notation will hold from here on. Z is a random variable of distribution D taking\nvalues in Z := X \u00d7Y; Z(cid:48), Z1, . . . , Zn are independent copies of Z and Z n := {Zi = (Xi, Yi)}n\nand Z p:q := {Zi}q\ni=1\ni=p.\nAm\nn denotes the set Am\nFinally, a function q : Z m \u2192 R is said to be symmetric if the value of q(z) = q(z1, . . . , zm) is\nindependent of the order of the zi\u2019s in z.\n\nn := {(i1, . . . , im) : 1 \u2264 i1 (cid:54)= . . . (cid:54)= im \u2264 n} , with 0 \u2264 m \u2264 n.\n\n2.2 U-statistics and Tail Inequalities\n\nq(Z1, . . . , Zm) = EZn\n\nis a U-statistic of order m with kernel q, when q : Z m \u2192 R is a measurable function on Z m.\n\u02c6Uq(Z n) is a lowest\nRemark 1. Obviously, EZm\nvariance estimate of EZm\nq(Z1, . . . , Zm) based on Z n [10]. Also, reusing some notation from the\nintroduction, \u02c6R(cid:96)(f, Z n) of Eq. (3) is a U-statistic of order 2 with kernel qf (Z, Z(cid:48)) := (cid:96)(f, Z, Z(cid:48)).\nRemark 2. Two peculiarities of U-statistics that entail a special care are the following: (i) they are\nsums of identically distributed but dependent variables: special tools need be resorted to in order\nto deal with these dependencies to characterize the deviation of \u02c6Uq(Z n) from Eq, and (ii) from an\nalgorithmic point of view, their direct computations may be expensive, as it scales as O(nm); in\nSection 3, we show for the special case of bipartite ranking how this complexity can be reduced.\n\n\u02c6Uq(Z n); in addition, EZn\n\nDe\ufb01nition 1 (U-statistic, Hoeffding [10]). The random variable \u02c6Uq(Z n) de\ufb01ned as\n\n\u02c6Uq(Z n) :=\n\n1\n|Am\nn |\n\nq(Zi1, . . . , Zim),\n\n(cid:88)\n\ni\u2208Am\n\nn\n\n2\n\n\fFigure 1: First two plots: values of the right-hand size of (5) and (6), for Duni and kernel qm for\nm = 2 and m = 10 (see Example 1) as functions of n. Last two plots: same for DBer(0.15).\n\nWe now recall three tail inequalities (Eq. (5), (6), (7)) that hold for U-statistics with symmetric and\nbounded kernels q. Normally, these inequalities make explicit use of the length qmax \u2212 qmin of the\nrange [qmin, qmax] of q. To simplify the reading, we will consider without loss of generality that q\nhas range [0, 1] (an easy way of retrieving the results for bounded q is to consider q/(cid:107)q(cid:107)\u221e).\nOne key quantity that appears in the original versions of tail inequalities (5) and (6) below is (cid:98)n/m(cid:99),\nthe integer part of the ratio n/m \u2013 this quantity might be thought of as the effective number of data.\nTo simplify the notation, we will assume that n is a multiple of m and, therefore, (cid:98)n/m(cid:99) = (n/m).\nTheorem 1 (First order tail inequality for \u02c6Uq, [11].). Hoeffding proved the following:\n\n\u2200\u03b5 > 0, PZn\n\n\u02c6Uq(Z(cid:48)\n\nn) \u2212 \u02c6Uq(Z n)\n\nn\n\nHence \u2200\u03b4 \u2208 (0, 1], with probability at least 1 \u2212 \u03b4 over the random draw of Z n:\n\n(cid:12)(cid:12)(cid:12) \u2265 \u03b5\n(cid:111) \u2264 2 exp(cid:8)\u2212(n/m)\u03b52(cid:9) ,\n(cid:115)\n(cid:12)(cid:12)(cid:12) \u2264\n\nln\n\n1\n\n.\n\n(n/m)\n\n2\n\u03b4\n\n(cid:110)(cid:12)(cid:12)(cid:12)EZ(cid:48)\n(cid:12)(cid:12)(cid:12)EZ(cid:48)\n\nn\n\n\u02c6Uq(Z(cid:48)\n\nn) \u2212 \u02c6Uq(Z n)\n\n(4)\n\n(5)\n\nTo go from the tail inequality (4) to the bound version (5), it suf\ufb01ces to make use of the elementary\ninequality reversal lemma (Lemma 1) provided in section 3, used also for the bounds given below.\nTheorem 2 (Bernstein Inequalities for \u02c6Uq, [2, 11]). Hoeffding [11] and, later, Arcones [2] re\ufb01ned\nthe previous result in the form of Bernstein-type inequalities of the form\n\n(cid:12)(cid:12)(cid:12) \u2265 \u03b5\n\n(cid:111) \u2264 a exp\n\n(cid:26)\n\n(cid:27)\n\n,\n\n\u2212 (n/m)\u03b52\n2\u03d1q,m + bm\u03b5\n\n(cid:110)(cid:12)(cid:12)(cid:12)EZ(cid:48)\n\nn\n\n\u2200\u03b5 > 0, PZn\n\n\u02c6Uq(Z(cid:48)\n\nFor Hoeffding, a = 2, \u03d1q,m = \u03a32\nHence, \u2200\u03b4 \u2208 (0, 1], with probability at least 1 \u2212 \u03b4:\n\nq where, \u03a32\n\nq is the variance of q(Z1, . . . , Zm) and bm = 2/3.\n\n\u02c6Uq(Z(cid:48)\n\nn) \u2212 \u02c6Uq(Z n)\n\n2\u03a32\nq\n\n(n/m)\n\nln\n\n2\n\u03b4\n\n+\n\n2\n\n3(n/m)\n\nln\n\n2\n\u03b4\n\n.\n\n(6)\n\nFor Arcones, a = 4, \u03d1q,m = m\u03c32\nq is the variance of EZ2,...,Zmq(Z1, Z2, . . . , Zm) (this is\na function of Z1) and bm = 2m+3mm\u22121 + (2/3)m\u22122. \u2200\u03b4 \u2208 (0, 1], with probability at least 1 \u2212 \u03b4:\n\nq where \u03c32\n\n\u02c6Uq(Z(cid:48)\n\nn) \u2212 \u02c6Uq(Z n)\n\nn\n\n2m\u03c32\nq\n(n/m)\n\nln\n\n4\n\u03b4\n\n+\n\nbm\n\n(n/m)\n\nln\n\n4\n\u03b4\n\n.\n\n(7)\n\nn) \u2212 \u02c6Uq(Z n)\n(cid:115)\n\n(cid:12)(cid:12)(cid:12) \u2264\n(cid:12)(cid:12)(cid:12) \u2264\n\n(cid:115)\n\nn\n\n(cid:12)(cid:12)(cid:12)EZ(cid:48)\n(cid:12)(cid:12)(cid:12)EZ(cid:48)\n\n\u02c6Uq(Z n).\n\nq(Z m) = EZn\n\nWith a slight abuse, we will now refer to Eq. (5), (6) and (7) as tail inequalities. In essence, these\nare con\ufb01dence intervals at level 1 \u2212 \u03b4 for EZm\nRemark 3. Eq. (7) is based on the so-called Hoeffding decomposition of U-statistics [11]. It provides\na more accurate Bernstein-type inequality than that of Eq. (6), as m\u03c32\nq is known to be smaller than\nq (see [16]). However, for moderate values of n/m (e.g. n/m < 105) and reasonable values of \u03b4\n\u03a32\n(e.g. \u03b4 = 0.05), the in\ufb02uence of the log terms might be such that the advantage of (7) over (6) goes\nunnoticed. Thus, we detail our results focusing on an empirical version of (6).\nExample 1. To illustrate how the use of the variance information provides smaller con\ufb01dence inter-\ni=1 zi and two distributions Duni and DBer(p). Duni is the uniform\n4m . DBer(p) is the Bernoulli distribution with parameter\ndistribution on [0, 1], for which \u03a32 = 1\np \u2208 [0, 1], for which \u03a32 = pm(1 \u2212 pm). Figure 1 shows the behaviors of (6) and (5) for various\nvalues of m as functions of n. Observe that the variance information renders the bound smaller.\n\nvals, consider the kernel qm :=(cid:81)m\n\n3m \u2212 1\n\n3\n\n 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9101102103m=2\u03b5Bernstein\u03b5Hoeffding 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9102103104m=10\u03b5Bernstein\u03b5Hoeffding 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9101102103m=2\u03b5Bernstein\u03b5Hoeffding 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9101102103m=10\u03b5Bernstein\u03b5Hoeffding\f3 Main Results\n\nThis section presents the main results of the paper. We \ufb01rst introduce the inequality reversal lemma,\nwhich allows to transform tail inequalities into upper bounds (or con\ufb01dence intervals), as in (5)-(7).\nLemma 1 (Inequality Reversal lemma). Let X be a random variable and a, b > 0, c, d\u2265 0 such that\n\n(cid:26)\n\n\u2200\u03b5 > 0, PX (|X| \u2265 \u03b5) \u2264 a exp\n\n\u2212 b\u03b52\nc + d\u03b5\n\n(cid:114) c\n(cid:114)\n\nb\n\n+\n\n(cid:18)\n\nthen, with probability at least 1 \u2212 \u03b4\n\n|X| \u2264\n\nln\n\na\n\u03b4\n\n+\n\nd\nb\n\nln\n\na\n\u03b4\n\n.\n\nProof. Solving for \u03b5 such that the right hand side of (8) is equal to \u03b4 gives:\n\n\u03b5 =\n\n1\n2b\n\nd ln\n\na\n\u03b4\n\nd2 ln2 a\n\u03b4\n\n+ 4bc ln\n\na\n\u03b4\n\n\u221a\n\na + b \u2264 \u221a\n\na +\n\n\u221a\n\nUsing\n\nb gives an upper bound on \u03b5 and provides the result.\n\n3.1 Empirical Bernstein Inequalities\n\n,\n\n(8)\n\n(9)\n\n(cid:27)\n\n(cid:19)\n\n.\n\nLet us now de\ufb01ne the empirical variances we will use in our main result.\nDe\ufb01nition 2. Let \u02c6\u03a32\n\nq be the U-statistic of order 2m de\ufb01ned as:\n\n(cid:0)q(Zi1, . . . , Zim ) \u2212 q(Zim+1 , . . . , Zi2m )(cid:1)2\n\n,\n\n(10)\n\n\u02c6\u03a32\n\nq(Z n) :=\n\n1\n|A2m\nn |\n\nand \u02c6\u03c32\n\nq be the U-statistic of order 2m \u2212 1 de\ufb01ned as:\n\n\u02c6\u03c32\nq (Z n) :=\n\n1\n\n|A2m\u22121\n\nn\n\n|\n\nq(Zi1, Zi2, . . . , Zim)q(Zi1 , Zim+1 , . . . , Zi2m\u22121),\n\n(11)\n\n(cid:88)\n\ni\u2208A2m\n\nn\n\n(cid:88)\n\ni\u2208A2m\u22121\n\nn\n\nIt is straightforward to see that (cf. the de\ufb01nitions of \u03a32\n\nq in (6) and \u03c32\n\nq in (7))\n\nEZn\n\n\u02c6\u03a32\n\nq(Z n) = \u03a32\nq,\n\nand EZn\n\n\u02c6\u03c32\nq (Z n) = \u03c32\n\nq + E2\n\nZm\n\nq(Z1, . . . , Zm).\n\nWe have the following main result.\nTheorem 3 (Empirical Bernstein Inequalities/Bounds). With probability at least 1 \u2212 \u03b4 over Z n,\n\n\u02c6Uq(Z(cid:48)\n\nn) \u2212 \u02c6Uq(Z n)\n\n+\nAnd, also, with probability at least 1 \u2212 \u03b4, (bm is the same as in (7))\n\u221a\n\nln\n\nn\n\n2 \u02c6\u03a32\nq\n(n/m)\n\n4\n\u03b4\n\n5\n\n(n/m)\n\nln\n\n4\n\u03b4\n\n.\n\n\u02c6Uq(Z(cid:48)\n\nn) \u2212 \u02c6Uq(Z n)\n\n2m\u02c6\u03c32\nq\n(n/m)\n\nln\n\n+\n\n8\n\u03b4\n\n5\n\nm + bm\n(n/m)\n\nln\n\n8\n\u03b4\n\n.\n\n(12)\n\n(13)\n\n(cid:115)\n\n(cid:12)(cid:12)(cid:12) \u2264\n(cid:115)\n(cid:12)(cid:12)(cid:12) \u2264\n\n(cid:12)(cid:12)(cid:12)EZ(cid:48)\n(cid:12)(cid:12)(cid:12)EZ(cid:48)\n\nn\n\nProof. We provide the proof of (12) for the upper bound of the con\ufb01dence interval; the same rea-\nsoning carries over to prove the lower bound. The proof of (13) is very similar.\nFirst, let us call Q the kernel of \u02c6\u03a32\nq:\n\nQ(Z1, . . . , Z2m) := (q(Z1, . . . , Zm) \u2212 q(Zm+1, . . . , Z2m))2 .\n\n4\n\n\fQ is of order 2m, has range [0, 1] but it is not necessarily symmetric. An equivalent symmetric\nkernel for \u02c6\u03a32\n\nq is Qsym:\n\n(cid:0)q(Z\u03c9(1), . . . , Z\u03c9(m)) \u2212 q(Z\u03c9(m+1), . . . , Z\u03c9(2m))(cid:1)2\n\nQsym(Z1, . . . , Z2m) :=\n\n1\n\n(2m)!\n\n(cid:88)\n\nwhere Pm is the set of all the permutations over {1, . . . , m}. This kernel is symmetric (and has\nrange [0, 1]) and Theorem 2 can be applied to bound \u03a32 as follows: with prob. at least 1 \u2212 \u03b4\n\n\u03c9\u2208P2m\n\n\u03a32 = EZ(cid:48)\n\n2m\n\nQsym(Z(cid:48)\n\n2m) = EZ(cid:48)\n\nn\n\nq(Z(cid:48)\n\u02c6\u03a32\n\nn) \u2264 \u02c6\u03a32\n\nq(Z n) +\n\n2V(Qsym)\n(n/2m)\n\nln\n\n2\n\u03b4\n\n+\n\n2\n\n3(n/2m)\n\nln\n\n2\n\u03b4\n\n,\n\n(cid:115)\n\n(cid:115)\n\nwhere V(Qsym) is the variance of Qsym. As Qsym has range [0, 1],\n\nV(Qsym) = EQ2\n\nsym \u2212 E2Qsym \u2264 EQ2\n\nsym \u2264 EQsym = \u03a32,\n\nand therefore\n\n\u03a32 \u2264 \u02c6\u03a32\n\nq(Z n) +\n\n4\u03a32\n\nln\n\n+\n\n2\n\u03b4\n\n4\n\nln\n\n2\n\u03b4\n\n.\n\n(To establish (13) we additionally use \u02c6\u03c32\n\nFollowing the approach of [13], we introduce\n\n3(n/m)\n\n(n/m)\n(cid:16)\u221a\n(cid:17)2\n\u03a32 \u2212(cid:112)(m/n) ln(2/\u03b4)\nq (Z n) \u2265 \u03c32\nq).\n(cid:17)2 \u2264 \u02c6\u03a32\n\u03a32 \u2212(cid:112)(m/n) ln(2/\u03b4)\n(cid:115)\n\n3(n/m)\n\u221a\n\nq(Z n) +\n\n7\n\nand taking the square root of both side, using 1 +(cid:112)7/3 < 3 and\n\n(cid:16)\u221a\n\nln\n\na + b \u2264 \u221a\n\n2\n\u03b4\n\n,\n\n\u03a32 \u2264(cid:113)\n\n\u221a\n\n\u02c6\u03a32\n\nq(Z n) + 3\n\n1\n\n(n/m)\n\nln\n\n2\n\u03b4\n\n.\n\nand we get\n\n\u221a\n\nb again gives\n\na +\n\nn\n\n\u02c6Uq(Z(cid:48)\n\nn)\u2212 \u02c6Uq(Z n)|, and plug in the latter equation, adjusting\nWe now apply Theorem 2 to bound |EZ(cid:48)\n\u03b4 to \u03b4/2 so the obtained inequality still holds with probability 1\u2212 \u03b4. Bounding appropriate constants\ngives the desired result.\nRemark 4. In addition to providing an empirical Bernstein bound for U-statistics based on arbitrary\nbounded kernels, our result differs from that of Maurer and Pontil [13] by the way we derive it. Here,\nwe apply the same tail inequality twice, taking advantage of the fact that estimates for the variances\nwe are interested in are also U-statistics. Maurer and Pontil use a tail inequality on self bounded\nrandom variables and do not explicitly take advantage of the estimates they use being U-statistics.\n\n3.2 Ef\ufb01cient Computation of the Variance Estimate for Bipartite Ranking\n\nWe have just showed how empirical Bernstein inequalities can be derived for U-statistics. The\nestimates that enter into play in the presented results are U-statistics with kernels of order 2m (or\n2m \u2212 1), meaning that a direct approach to practically compute them would scale as O(n2m) (or\nO(n2m\u22121)). This scaling might be prohibitive as soon as n gets large.\nHere, we propose an ef\ufb01cient way of evaluating the estimate \u02c6\u03a32\nq) in the special case where Y = {\u22121, +1} and the kernel qf induces the misranking loss (1):\n\u02c6\u03c32\n\nq (a similar reasoning carries over for\n\nqf ((x, y), (x(cid:48), y(cid:48))) := 1{(y\u2212y(cid:48))(f (x)\u2212f (x(cid:48)))<0}, \u2200f \u2208 RX ,\n\nwhich is a symmetric kernel of order m = 2 with range [0, 1]. In other words, we address the\nbipartite ranking problem. We have the following result.\nProposition 1 (Ef\ufb01cient computation of \u02c6\u03a32\n\nqf ). \u2200n, the computation of\n\n\u02c6\u03a3qf (zn) =\n\n1\n|A4\nn|\n\n1{(yi1\u2212yi2 )(f (xi1 )\u2212f (xi2 ))<0} \u2212 1{(yi3\u2212yi4 )(f (xi3 )\u2212f (xi4 ))<0}\n\n(cid:17)2\n\n(cid:88)\n\n(cid:16)\n\ni\u2208A4\n\nn\n\ncan be performed in O(n ln n).\n\n5\n\n\fProof. We simply provide an algorithmic way to compute \u02c6\u03a32\nqf\nreplace i1, i2, i3, i4 by i, j, k, l, respectively. We also drop the normalization factor |A4\nthe use of \u221d instead of = in the \ufb01rst line below). We have\n\n(zn). To simplify the reading, we\nn|\u22121 (hence\nf (zk, zl)(cid:1) ,\n\nf (zi, zj) \u2212 2qf (zi, zj)qf (zk, zl) + q2\n\nqf (zn) \u221d (cid:88)\n\n(qf (zi, zj) \u2212 qf (zk, zl))2 =\n\n(cid:0)q2\n\n\u02c6\u03a32\n\nqf (zi, zj) \u2212 2\n\ni,j\n\ni,j,k,l\n\ni(cid:54)=j(cid:54)=k(cid:54)=l\n\nqf (zi, zj)qf (zk, zl)\n\nsince\n\nf =qf\n\nqf (z,z)=0\n\n.\n\n(cid:16) q2\n\n(cid:17)\n\ni,j,k,l\n\ni(cid:54)=j(cid:54)=k(cid:54)=l\n\n= 2(n \u2212 2)(n \u2212 3)\n\n(cid:88)\n\n(cid:88)\n(cid:88)\n\ni,j,k,l\n\ni(cid:54)=j(cid:54)=k(cid:54)=l\n\n\uf8f9\uf8fb\n\n(14)\n\n(15)\n\n(cid:33)2\n\n(cid:88)\n\ni(cid:54)=j(cid:54)=k\n\nThe \ufb01rst term of the last line is proportional to the well-known Wilcoxon-Mann-Whitney statistic\n[9]. There exist ef\ufb01cient ways (O(n ln n)) to compute it, based on sorting the values of the f (xi)\u2019s.\nWe show how to deal with the second term, using sorting arguments as well. Note that\n\ni,j,k,l\n\ni(cid:54)=j(cid:54)=k(cid:54)=l\n\n(cid:88)\n\nqf (zi, zj)qf (zk, zl) =\n\n(cid:32)(cid:88)\nWe have subtracted from the square of(cid:80)\nf = qf ), de\ufb01ning R(zn) :=(cid:80)\n\ni,j\n\ni,j qf (zi, zj) all the products qf (zi, zj)qf (zk, zl) such that\nexactly one of the variables appears both in qf (zi, zj) and qf (zk, zl), which happens when i = k,\ni = l, j = k, j = l; using the symmetry of qf then provides the second term (together with the factor\n4). We also have subtracted all the products qf (zi, zj)qf (zk, zl) where i = k and j = l or i = l and\nf (zi, zj) (hence the factor 2) \u2013 this gives the last term.\nj = k, in which case the product reduces to q2\nThus (using q2\nij qf (zi, zj) and doing some simple calculations:\n\nqf (zi, zj)\n\n\u2212 4\n\nqf (zi, zj)qf (zi, zk) \u2212 2\n\nq2\nf (zi, zj).\n\n(cid:88)\n\ni,j\n\n\uf8ee\uf8f0\u22122R2(zn) + 2(n2 \u2212 5n + 8)R(zn) + 8\n\n(cid:88)\n\ni(cid:54)=j(cid:54)=k\n\n\u02c6\u03a3qf (zn) =\n\n1\n|A4\nn|\n\nqf (zi, zj)qf (zi, zk)\n\nThe only term that now requires special care is the last one (which is proportional to \u02c6\u03c32\nqf\n\n(zn)).\n\nRecalling that qf (zi, zj) = 1{(yi\u2212yj )(f (xi)\u2212f (xj ))<0}, we observe that\n\n(cid:26) yi = \u22121, yj = yk = +1 and f (xi) > f (xj), f (xk), or\n\nyi = +1, yj = yk = \u22121 and f (xi) < f (xj), f (xk).\n\nqf (zi, zj)qf (zi, zk) = 1 \u21d4\n\nLet us de\ufb01ne E +(i) and E\u2212(i) as\n\nE +(i) := {j : yj = \u22121, f (xj) > f (xi)} , and E\u2212(i) := {j : yj = +1, f (xj) < f (xi)} .\n\n:= |E +(i)|, and \u03ba\u2212\n\n:= |E\u2212(i)|.\n\ni\n\nand their sizes \u03ba+\ni\nFor i such that yi = 1, \u03ba+\nis the number of negative instances that have been scored higher than xi\ni\nby f. From (15), we see that the contribution of i to the last term of (14) corresponds to the number\n\u03ba+\ni (\u03ba+\n\ni \u2212 1) of ordered pairs of indices in E +(i) (similarly for \u03ba\u2212\ni , with yi = \u22121). Henceforth:\n(cid:88)\ni \u2212 1) +\n\nqf (zi, zj)qf (zi, zk) =\n\ni \u2212 1).\n\u2212\n\n(cid:88)\n\n(cid:88)\n\n\u03ba+\ni (\u03ba+\n\n\u2212\ni (\u03ba\n\n\u03ba\n\ni(cid:54)=j(cid:54)=k\n\ni:yi=+1\n\ni:yi=\u22121\n\nA simple way to compute the \ufb01rst sum (on i such that yi = +1) is to sort and visit the data by\ndescending order of scores and then to incrementally compute the \u03ba+\ni \u2019s and the corresponding sum:\nwhen a negative instance is encountered, \u03ba+\nis incremented by 1 and when a positive instance is\ni \u2212 1) is added to the current sum. An identical reasoning works for the second sum.\ni\nvisited, \u03ba+\nThe cost of computing \u02c6\u03a3qf is therefore that of sorting the scores, which has cost O(n ln n).\n\ni (\u03ba+\n\n4 Applications and Discussion\n\nHere, we mention potential applications of the new empirical inequalities we have just presented.\n\n6\n\n\fFigure 2: Left: UCI banana dataset, data labelled +1 (\u22121) in red (green). Right: half the con\ufb01dence\ninterval of the Hoeffding bound and that of the empirical Bernstein bound as functions of ntest.\n\n4.1 Test Set Bounds\n\nA direct use of the empirical Bernstein inequalities is to draw test set bounds. In this scenario, a\nsample Z n is split into a training set Ztrain := Z 1:ntrain of ntrain data and a hold-out set Ztest :=\nZ ntrain+1:n of size ntest. Ztrain is used to train a model f that minimizes an empirical risk based on a\nU-statistic inducing loss (such as in (1) or (2)) and Ztest is used to compute a con\ufb01dence interval on\nthe expected risk of f. For instance, if we consider the bipartite ranking problem, the loss is (cid:96)rank,\nthe corresponding kernel is qf (Z, Z(cid:48)) = (cid:96)rank(f, Z, Z(cid:48)), and, with probability at least 1 \u2212 \u03b4\n\nR(cid:96)rank(f ) \u2264 \u02c6R(cid:96)rank(f, Ztest) +\n\n4 \u02c6\u03a32\nqf\n\n(Ztest) ln(4/\u03b4)\n\nntest\n\n+\n\n10\nntest\n\nln\n\n4\n\u03b4\n\n,\n\n(16)\n\n(Ztest) is naturally the empirical variance of qf computed on Ztest.\n\nwhere \u02c6\u03a32\nqf\nFigure 2 displays the behavior of such test set bounds as ntest grows for the UCI banana dataset. To\nproduce this plot, we have learned a linear scoring function f (\u00b7) = (cid:104)w,\u00b7(cid:105) by minimizing\n\n(cid:115)\n\n(cid:88)\n\ni(cid:54)=j\n\n\u03bb(cid:107)w(cid:107)2 +\n\n(1 \u2212 (Yi \u2212 Yj)(cid:104)w, Xi \u2212 Xj(cid:105))2\n\nfor \u03bb = 1.0. Of course, a purely linear scoring function would not make it possible to achieve\ngood ranking accuracy so we in fact work in the reproducing kernel hilbert space associated with the\nGaussian kernel k(x, x(cid:48)) = exp(\u2212(cid:107)x\u2212x(cid:48)(cid:107)2/2). We train our scoring function on ntrain = 1000 data\npoints and evaluate the test set bound on ntest = 100, 500, 1000, 5000, 10000 data points. Figure 2\n(right) reports the size of half the con\ufb01dence interval of the Hoeffding bound (5) and that of the\nempirical Bernstein bound given in (16). Just as in the situation described in Example 1, the use of\nvariance information gives rise to smaller con\ufb01dence intervals, even for moderate sizes of test sets.\n\n4.2 Online Ranking and Empirical Racing Algorithms\n\nAnother application that we would like to describe is online bipartite ranking. Due to space limita-\ntion, we only provide the main ideas on how we think our empirical tail inequalities and the ef\ufb01cient\ncomputation of the variance estimates we propose might be particularly useful in this scenario.\nFirst, let us precise what we mean by online bipartite ranking. Obviously, this means that Y =\n{\u22121, +1} and that the loss of interest is (cid:96)rank. In addition, it means that given a training set Z =\n{Zi := (Xi, Yi)}n\ni=1 the learning procedure will process the data of Z incrementally to give rise to\nhypotheses f1, f2, . . . , fT . As (cid:96)rank entails a kernel of order m = 2, we assume that n = 2T and\nthat we process the data from Z pairs by pairs, i.e. (Z1, Z2) are used to learn f1, (Z3, Z4) and f1\nare used to learn f2 and, more generally, (Z2t\u22121, Z2t) and ft\u22121 are used to produce ft (there exist\nmore clever ways to handle the data but this goes out of the scope of the present paper). We do not\nspecify any learning algorithm but we may imagine trying to minimize a penalized empirical risk\nbased on the surrogate loss (cid:96)sur: if linear functions f (\u00b7) = (cid:104)w,\u00b7(cid:105) are considered and a penalization\n\n7\n\n-1 0 1 2-2-1 0 1 2Banana dataset 0 0.1 0.2 0.3 0.4 0.5102103104Bernstein vs Hoeffding\u03b5Hoeffding\u03b5Bernstein\flike (cid:107)w(cid:107)2 is used then the optimization problem to solve is of the same form as in the batch case:\n\n\u03bb(cid:107)w(cid:107)2 +\n\n(1 \u2212 (Yi \u2212 Yj)(cid:104)w, Xi \u2212 Xj(cid:105))2 ,\n\n(cid:88)\n\ni(cid:54)=j\n\nbut is solved incrementally here. Rank-1 update formulas for inverses of matrices easily provide\nmeans to incrementally solve this problem as new data arrive (this is the main reason why we have\nmentioned this surrogate function).\nAs evoked by [5], a nice feature of online learning is that the expected risk of hypothesis ft can\nbe estimated on the n \u2212 2t examples of Z it was not trained on. Namely, when 2\u03c4 data have been\nprocessed, there exist \u03c4 hypotheses f1, . . . , f\u03c4 and, for t < \u03c4, with probability at least 1 \u2212 \u03b4:\n\n2 \u02c6\u03a32\nqf\n\n(Z 2t:2\u03c4 ) ln(4/\u03b4)\n\n\u03c4 \u2212 t\n\n+\n\n5\n\u03c4 \u2212 t\n\nln\n\n4\n\u03b4\n\n.\n\nIf one wants to have these con\ufb01dence intervals to simultaneously hold for all t and all \u03c4 with prob-\nability 1 \u2212 \u03b4, basic computations to calculate the number of pairs (t, \u03c4 ), with 1 \u2264 t < \u03c4 \u2264 n show\nthat it suf\ufb01ces to adjust \u03b4 to 4\u03b4/(n + 1)2. Hence, with probability at least 1 \u2212 \u03b4: \u22001 \u2264 t < \u03c4 \u2264 n,\n\n(cid:115)\n\n(cid:12)(cid:12)(cid:12) \u2264\n\n(cid:12)(cid:12)(cid:12)R(cid:96)rank(ft) \u2212 \u02c6R(cid:96)rank(ft, Z 2t:2\u03c4 ))\n(cid:115)\n\n(cid:12)(cid:12)(cid:12)R(cid:96)rank(ft) \u2212 \u02c6R(cid:96)rank(ft, Z 2t:2\u03c4 ))\n\n(cid:12)(cid:12)(cid:12) \u2264\n\n4 \u02c6\u03a32\nqf\n\n(Z 2t:2\u03c4 ) ln((n + 1)/\u03b4)\n\n\u03c4 \u2212 t\n\n+\n\n10\n\u03c4 \u2212 t\n\nln\n\nn + 1\n\n\u03b4\n\n.\n\n(17)\n\nWe would like to draw the attention of the reader on two features: one has to do with statistical\nconsiderations and the other with algorithmic ones. First, if the con\ufb01dence intervals simultaneously\nhold for all t and all \u03c4 as in (17), it is possible, as the online learning process goes through, to discard\nthe hypotheses ft which have their lower bound (according to (17)) on R(cid:96)rank(ft) that is higher\nthan the upper bound (according to (17) as well) on R(cid:96)rank(ft(cid:48)) for some other hypothesis ft(cid:48). This\ncorresponds to a racing algorithm as described in [12]. Theoretically analyzing the relevance of such\na race can be easily done with the results of [14], which deal with empirical Bernstein racing, but for\nnon-U-statistics. This full analysis will be provided in a long version of the present paper. Second, it\nis algorithmically possible to preserve some ef\ufb01ciency in computing the various variance estimates\nthrough the online learning process: these computations rely on sorting arguments, and it is possible\nto take advantage of structures like binary search trees such as AVL trees, that are precisely designed\nto ef\ufb01ciently maintain and update sorted lists of numbers. The remaining question is whether it is\npossible to have shared such structures to summarize the sorted lists of scores for various hypotheses\n(recall that the scores are computed on the same data). This will be the subject of further research.\n\n5 Conclusion\n\nWe have proposed new empirical Bernstein inequalities designed for U-statistics. They generalize\nthe empirical inequalities of [13] and [3] while they merely result from two applications of the same\nnon-empirical tail inequality for U-statistics. We also show how, in the bipartite ranking situation,\nthe empirical variance can be ef\ufb01ciently computed. We mention potential applications, with illustra-\ntive results for the case of test set bounds in the realm of bipartite ranking. In addition to the possible\nextensions discussed in the previous section, we wonder whether it is possible to draw similar empir-\nical inequalities for other types of rich statistics such as, e.g., linear rank statistics [8]. Obviously, we\nplan to work on establishing generalization bounds derived from the new concentration inequalities\npresented. This would require to carefully de\ufb01ne a sound notion of capacity for U-statistic-based\nclasses of functions (inspired, for example, from localized Rademacher complexities). Such new\nbounds would be compared with those proposed in [1, 6, 7, 15] for the bipartite ranking and/or pair-\nwise classi\ufb01cation problems. Finally, we also plan to carry out intensive simulations \u2014in particular\nfor the task of online ranking\u2014 to get even more insights on the relevance of our contribution.\n\nAcknowledgments\n\nThis work is partially supported by the IST Program of the EC, under the FP7 Pascal 2 Network of\nExcellence, ICT-216886-NOE. LR is partially supported by the ANR project ASAP.\n\n8\n\n\fReferences\n[1] S. Agarwal, T. Graepel, R. Herbrich, S. Har-Peled, and D. Roth. Generalization Bounds for\n\nthe Area under the ROC Curve. Journal of Machine Learning Research, 6:393\u2013425, 2005.\n\n[2] M. A. Arcones. A bernstein-type inequality for u-statistics and u-processes. Statistics &\n\nprobability letters, 22(3):239\u2013247, 1995.\n\n[3] J.-Y. Audibert, R. Munos, and C. Szepesv\u00b4ari. Tuning bandit algorithms in stochastic environ-\nments. In ALT \u201907: Proceedings of the 18th international conference on Algorithmic Learning\nTheory, pages 150\u2013165, Berlin, Heidelberg, 2007. Springer-Verlag.\n\n[4] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classi\ufb01cation : A survey of some recent\n\nadvances. ESAIM. P&S, 9:323\u2013375, 2005.\n\n[5] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of online learning\n\nalgorithms. IEEE Transactions on Information Theory, 50(9):2050\u20132057, 2004.\n\n[6] S. Cl\u00b4emenc\u00b8on, G. Lugosi, and N. Vayatis. Ranking and empirical minimization of u -statistics.\n\nThe Annals of Statistics, 36(2):844\u2013874, April 2008.\n\n[7] Y. Freund, R. Iyer, R.E. Schapire, and Y. Singer. An ef\ufb01cient boosting algorithm for combining\n\npreferences. Journal of Machine Learning Research, 4:933\u2013969, 2003.\n\n[8] J. H\u00b4ajek and Z. Sid\u00b4ak. Theory of Rank Tests. Academic Press, 1967.\n[9] J. A. Hanley and B. J. Mcneil. The meaning and use of the area under a receiver operating\n\ncharacteristic (roc) curve. Radiology, 143(1):29\u201336, April 1982.\n\n[10] W. Hoeffding. A Class of Statistics with Asymptotically Normal Distribution. Annals of\n\nMathematical Statistics, 19(3):293\u2013325, 1948.\n\n[11] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the\n\nAmerican Statistical Association, 58(301):13\u201330, 1963.\n\n[12] O. Maron and A. Moore. Hoeffding races: Accelerating model selection search for classi\ufb01ca-\ntion and function approximation. In Adv. in Neural Information Processing Systems NIPS 93,\npages 59\u201366, 1993.\n\n[13] A. Maurer and M. Pontil. Empirical bernstein bounds and sample-variance penalization. In\n\nCOLT 09: Proc. of The 22nd Annual Conference on Learning Theory, 2009.\n[14] V. Mnih, C. Szepesv\u00b4ari, and J.-Y. Audibert. Empirical bernstein stopping.\n\nIn ICML \u201908:\nProceedings of the 25th international conference on Machine learning, pages 672\u2013679, New\nYork, NY, USA, 2008. ACM.\n\n[15] C. Rudin and R. E. Schapire. Margin-based ranking and an equivalence between AdaBoost\n\nand RankBoost. Journal of Machine Learning Research, 10:2193\u20132232, Oct 2009.\n\n[16] R. J. Ser\ufb02ing. Approximation theorems of mathematical statistics. J. Wiley & Sons, 1980.\n\n9\n\n\f", "award": [], "sourceid": 1114, "authors": [{"given_name": "Thomas", "family_name": "Peel", "institution": null}, {"given_name": "Sandrine", "family_name": "Anthoine", "institution": null}, {"given_name": "Liva", "family_name": "Ralaivola", "institution": null}]}