{"title": "Generalization error bounds for classifiers trained with interdependent data", "book": "Advances in Neural Information Processing Systems", "page_first": 1369, "page_last": 1376, "abstract": "", "full_text": "Generalization error bounds for classi\ufb01ers\n\ntrained with interdependent data\n\nNicolas Usunier, Massih-Reza Amini, Patrick Gallinari\nDepartment of Computer Science, University of Paris VI\n\n8, rue du Capitaine Scott, 75015 Paris France\n{usunier, amini, gallinari}@poleia.lip6.fr\n\nAbstract\n\nIn this paper we propose a general framework to study the generalization\nproperties of binary classi\ufb01ers trained with data which may be depen-\ndent, but are deterministically generated upon a sample of independent\nexamples. It provides generalization bounds for binary classi\ufb01cation and\nsome cases of ranking problems, and clari\ufb01es the relationship between\nthese learning tasks.\n\n1 Introduction\n\nMany machine learning (ML) applications deal with the problem of bipartite ranking where\nthe goal is to \ufb01nd a function which orders relevant elements over irrelevant ones. Such\nproblems appear for example in Information Retrieval, where the system returns a list of\ndocuments, ordered by relevancy to the user\u2019s demand. The criterion widely used to mea-\nsure the ranking quality is the Area Under the ROC Curve (AUC) [6]. Given a training set\np=1 with yp \u2208 {\u00b11}, its optimization over a class of real valued functions G\nS = ((xp, yp))n\n(cid:2))), g \u2208 G\ncan be carried out by \ufb01nding a classi\ufb01er of the form cg(x, x\n,\u22121) in S [6]. More\n(cid:2)\nwhich minimizes the error rate over pairs of examples (x, 1) and (x\ngenerally, it is well-known that the learning of scoring functions can be expressed as a\nclassi\ufb01cation task over pairs of examples [7, 5].\n\n(cid:2)) = sign(g(x)\u2212 g(x\n\nThe study of the generalization properties of ranking problems is a challenging task, since\nthe pairs of examples violate the central i.i.d. assumption of binary classi\ufb01cation. Using\ntask-speci\ufb01c studies, this issue has recently been the focus of a large amount of work. [2]\nshowed that SVM-like algorithms optimizing the AUC have good generalization guaran-\ntees, and [11] showed that maximizing the margin of the pairs, de\ufb01ned by the quantity\ng(x) \u2212 g(x\n(cid:2)), leads to the minimization of the generalization error. While these results\nsuggest some similarity between the classi\ufb01cation of the pairs of examples and the clas-\nsi\ufb01cation of independent data, no common framework has been established. As a major\ndrawback, it is not possible to directly deduce results for ranking from those obtained in\nclassi\ufb01cation.\n\nIn this paper, we present a new framework to study the generalization properties of clas-\nsi\ufb01ers over data which can exhibit a suitable dependency structure. Among others, the\nproblems of binary classi\ufb01cation, bipartite ranking, and the ranking risk de\ufb01ned in [5] are\nspecial cases of our study. It shows that it is possible to infer generalization bounds for clas-\n\n\fsi\ufb01ers trained over interdependent examples using generalization results known for binary\nclassi\ufb01cation. We illustrate this property by proving a new margin-based, data-dependent\nbound for SVM-like algorithms optimizing the AUC. This bound derives straightforwardly\nfrom the same kind of bounds for SVMs for classi\ufb01cation given in [12]. Since learning al-\ngorithms aim at minimizing the generalization error of their chosen hypothesis, our results\nsuggest that the design of bipartite ranking algorithms can follow the design of standard\nclassi\ufb01cation learning systems.\n\nThe remainder of this paper is as follows. In section 2, we give the formal de\ufb01nition of\nour framework and detail the progression of our analysis over the paper. In section 3, we\npresent a new concentration inequality which allows to extend the notion of Rademacher\ncomplexity (section 4), and, in section 5, we prove generalization bounds for binary clas-\nsi\ufb01cation and bipartite ranking tasks under our framework. Finally, the missing proofs are\ngiven in a longer version of the paper [13].\n\n2 Formal framework\n\nN\n\nN\n\n(cid:2)\n\nWe distinguish between the input and the training data. The input data S = (sp)n\np=1 is\na set of n independent examples, while the training data Z = (zi)N\ni=1 is composed of\nN binary classi\ufb01ed elements where each zi is in Xtr \u00d7 {\u22121, +1}, with Xtr the space\nof characteristics. For example, in the general case of bipartite ranking, the input data\nis the set of elements to be ordered, while the training data is constituted by the pairs of\nexamples to be classi\ufb01ed. The purpose of this work is the study of generalization properties\nof classi\ufb01ers trained using a possibly dependent training data, but in the special case where\nthe latter is deterministically generated from the input data. The aim here is to select a\nhypothesis h \u2208 H = {h\u03b8 : Xtr \u2192 {\u22121, 1}|\u03b8 \u2208 \u0398} which optimizes the empirical risk\ni=1 (cid:3)(h, zi), (cid:3) being the instantaneous loss of h, over the training set Z.\nL(h, Z) = 1\nDe\ufb01nition 1 (Classi\ufb01ers trained with interdependent data). A classi\ufb01cation algorithm over\ninterdependent training data takes as input data a set S = (sp)n\np=1 supposed to be drawn\naccording to an unknown product distribution \u2297n\np=1Dp over a product sample space Sn 1,\noutputs a binary classi\ufb01er chosen in a hypothesis space H : {h : Xtr \u2192 {+1,\u22121}}, and\nhas a two-step learning process. In a \ufb01rst step, the learner applies to its input data S a\n\ufb01xed function \u03d5 : Sn \u2192 (Xtr \u00d7 {\u22121, 1})N to generate a vector Z = (zi)N\ni=1 = \u03d5(S) of\nN training examples zi \u2208 Xtr \u00d7{\u22121, 1}, i = 1, ..., N. In the second step, the learner runs\na classi\ufb01cation algorithm in order to obtain h which minimizes the empirical classi\ufb01cation\nloss L(h, Z), over its training data Z = \u03d5(S).\nExamples Using the notations above, when S = Xtr \u00d7 {\u00b11}, n = N, \u03d5 is the identity\nfunction and S is drawn i.i.d. according to an unknown distribution D, we recover the\nclassical de\ufb01nition of a binary classi\ufb01cation algorithm. Another example is the ranking task\ndescribed in [5] where S = X \u00d7R, Xtr = X 2, N = n(n\u22121) and, given S = ((xp, yp))n\np=1\n)), k (cid:5)= l.\ndrawn i.i.d. according to a \ufb01xed D, \u03d5 generates all the pairs ((xk, xl), sign( yk\u2212yl\nIn the remaining of the paper, we will prove generalization error bounds of the selected\nhypothesis by upper bounding\n\nsup\n\nh\u2208H L(h) \u2212 L(h, \u03d5(S))\n\nwith high con\ufb01dence over S, where L(h) = ESL(h, \u03d5(S)). To this end we decompose\nZ = \u03d5(S) using the dependency graph of the random variables composing Z with a tech-\nnique similar to the one proposed by [8]. We go towards this result by \ufb01rst bounding\n\n1It is equivalent to say that the input data is a vector of independent, but not necessarilly identically\n\ndistributed random variables.\n\n2\n\n(1)\n\n\f(cid:3)\n\n(cid:2)\n\n(cid:2)\ni=1 q(\u03d5(S)i)\n\nN\n\n(cid:4)\n\nN\n\ni=1 q(\u03d5( \u02dcS)i) \u2212 1\n\nN\n\n1\nN\n\nE \u02dcS\n\nq\u2208Q\n\nsup\nwith high con\ufb01dence over samples\nS, where \u02dcS is also drawn according to \u2297n\np=1Dp, Q is a class of functions taking values\nin [0, 1], and \u03d5(S)i denotes the i-th training example (Theorem 4). This bound uses an\nextension of the Rademacher complexity [3], the fractional Rademacher complexity (FRC)\n(de\ufb01nition 3), which is a weighted sum of Rademacher complexities over independent sub-\nsets of the training data. We show that the FRC of an arbitrary class of real-valued functions\ncan be trivially computed given the Rademacher complexity of this class of functions and\n\u03d5 (theorem 6). This theorem shows that generalization error bounds for classes of classi-\n\ufb01ers over interdependent data (in the sense of de\ufb01nition 1) trivially follows from the same\nkind of bounds for the same class of classi\ufb01ers trained over i.i.d. data. Finally, we show\nan example of the derivation of a margin-based, data-dependent generalization error bound\n(i.e. a bound on equation (1) which can be computed on the training data) for the bipartite\nranking case when H = {(x, x\n(cid:2)))|K(\u03b8, \u03b8) \u2264 B2}, assuming\nthat the input examples are drawn i.i.d. according to a distribution D over X \u00d7 {\u00b11},\nX \u2282 Rd and K is a kernel over X 2.\n\n(cid:2)) (cid:6)\u2192 sign(K(\u03b8, x) \u2212 K(\u03b8, x\n\nNotations Throughout the paper, we will use the notations of the preceding subsection,\ni=1, which will denote an arbitrary element of (Xtr \u00d7 {\u22121, 1})N . In\nexcept for Z = (zi)N\norder to obtain the dependency graph of the random variables \u03d5(S)i, we will consider, for\neach 1 \u2264 i \u2264 N, a set [i] \u2282 {1, ..., n} such that \u03d5(S)i depends only on the variables sp \u2208 S\nfor which p \u2208 [i]. Using these notations, if we consider two indices k, l in {1, ..., N}, we\ncan notice that the two random variables \u03d5(S)k and \u03d5(S)l are independent if and only if\n[k]\u2229[l] = \u2205. The dependency graph of the \u03d5(S)is follows, by constructing the graph \u0393(\u03d5),\nwith the set of vertices V = {1, ..., N}, and with an edge between k and l if and only if\n[k]\u2229 [l] (cid:5)= \u2205. The following de\ufb01nitions, taken from [8], will enable us to separate the set of\npartly dependent variables into sets of independent variables:\n\n\u2022 A subset A of V is independent if all the elements in A are independent.\n(cid:5)\n\u2022 A sequence C = (Cj)m\nj=1 of subsets of V is a proper cover of V if, for all j, Cj is\nj Cj = V\nindependent, and\n\u2022 A sequence C = (Cj, wj)m\nj=1 is a proper, exact fractional cover of \u0393 if wj > 0\nfor all j, and, for each i \u2208 V ,\n(i) = 1, where ICj is the indicator\n(cid:2)\nfunction of Cj.\n\u2022 The fractional chromatic number of \u0393, noted \u03c7(\u0393), is equal to the minimum of\n\nm\nj=1 wjICj\n\n(cid:2)\n\nj wj over all proper, exact fractional cover.\n\n(cid:2)\n\nIt is to be noted that from lemma 3.2 of [8], the existence of proper, exact fractional covers\nis ensured. Since \u0393 is fully determined by the function \u03d5, we will note \u03c7(\u0393) = \u03c7(\u03d5).\nMoreover, we will denote by C(\u03d5) = (Cj, wj)\u03ba\nj=1 a proper, exact fractional cover of \u0393\nj wj = \u03c7(\u03d5). Finally, for a given C(\u03d5), we denote by \u03baj the number of\nsuch that\nelements in Cj, and we \ufb01x the notations: Cj = {Cj1, ..., Cj\u03baj\n}. It is to be noted that if\n(ti)N\n\u03baj(cid:6)\n\ni=1 \u2208 RN , and C(\u03d5) = (Cj, wj)\u03ba\n\u03ba(cid:6)\n\nj=1, lemma 3.1 of [8] states that:\n\nN(cid:6)\n\nti =\n\nwjTj, where Tj =\n\ntCj k\n\nk=1\n\n(2)\n\ni=1\n\nj=1\n\n3 A new concentration inequality\n\nConcentration inequalities bound the probability that a random variable deviates too much\nfrom its expectation (see [4] for a survey). They play a major role in learning theory as\n\n\fthey can be used for example to bound the probability of deviation of the expected loss of a\nfunction from its empirical value estimated over a sample set. A well-known inequality is\nMcDiarmid\u2019s theorem [9] for independent random variables, which bounds the probability\nof deviation from its expectation of an arbitrary function with bounded variations over\neach one of its parameters. While this theorem is very general, [8] proved a large deviation\nbound for sums of partly random variables where the dependency structure of the variables\nis known, which can be tighter in some cases. Since we also consider variables with known\ndependency structure, using such results may lead to tighter bounds. However, we will\nbound functions like in equation (1), which do not write as a sum of partly dependent\nvariables. Thus, we need a result on more general functions than sums of random variables,\nbut which also takes into account the known dependency structure of the variables.\nTheorem 2. Let \u03d5 : X n \u2192 X (cid:2)N . Using the notations de\ufb01ned above, let C(\u03d5) =\n(Cj, wj)\u03ba\n\nj=1. Let f : X (cid:2)N \u2192 R such that:\n\n1. There exist \u03ba functions fj : X (cid:2)\u03baj \u2192 R which satisfy \u2200Z = (z1, ..., zN ) \u2208 X (cid:2)N ,\n\n(cid:2)\nj wjfj(zCj1, ..., zCj\u03baj\n\n).\n\nf(Z) =\n\n2. There exist \u03b21, ..., \u03b2N \u2208 R+ such that \u2200j,\u2200Zj, Z k\ndiffer only in the k-th dimension, |fj(Zj) \u2212 fj(Z k\n\nj\n\nj\n\nLet \ufb01nally D1, ...,Dn be n probability distributions over X . Then, we have:\n\nj\n\n\u2208 X (cid:2)\u03baj such that Zj and Z k\n)| \u2264\u03b2 Cjk.\n(cid:2)\n\n2\u00012\n\n)\n\n(3)\n\n\u03c7(\u03d5)\n\nN\n\ni=1 \u03b22\n\ni\n\nPX\u223c\u2297n\n\ni=1Di\n\n(f \u25e6 \u03d5(X) \u2212 Ef \u25e6 \u03d5 > \u0001) \u2264 exp(\u2212\n\nand the same holds for P(Ef \u25e6 \u03d5 \u2212 f \u25e6 \u03d5 > \u0001).\nThe proof of this theorem (given in [13]) is a variation of the demonstrations in [8] and\nMcDiarmid\u2019s theorem. The main idea of this theorem is to allow the decomposition of\nf, which will take as input partly dependent random variables when applied to \u03d5(S), into\na sum of functions which, when considering f \u25e6 \u03d5(S), will be functions of independent\n(cid:2)\nvariables. As we will see, this theorem will be the major tool in our analysis. It is to be\nnoted that when X = X (cid:2)\n, N = n and \u03d5 is the identity function of X n, the theorem 2 is\ni=1 qi(zi) with\nexactly McDiarmid\u2019s theorem. On the other hand, when f takes the form\nfor all z \u2208 X (cid:2)\n, a \u2264 qi(z) \u2264 a + \u03b2i with a \u2208 R, then theorem 2 reduces to a particular case\nof the large deviation bound of [8].\n\nN\n\n4 The fractional Rademacher complexity\n\ni=1 \u2208 Z N . If Z is supposed to be drawn i.i.d. according to a distribution DZ\n(cid:2)\nLet Z = (zi)N\nover Z, for a class F of functions from Z to R, the Rademacher complexity of F is de\ufb01ned\nby [10] RN (F) = EZ\u223cDZ RN (F, Z), where RN (F, Z) = E\u03c3 sup\ni=1 \u03c3if(zi) is\nthe empirical Rademacher complexity of F on Z, and \u03c3 = (\u03c3i)n\ni=1 is a sequence of in-\ndependent Rademacher variables, i.e. \u2200i, P(\u03c3i = 1) = P(\u03c3i = \u22121) = 1\n2 . This quan-\ntity has been extensively used to measure the complexity of function classes in previous\nbounds for binary classi\ufb01cation [3, 10]. In particular, if we consider a class of functions\nQ = {q : Z (cid:6)\u2192 [0, 1]}, it can be shown (theorem 4.9 in [12]) that with probability at least\n1\u2212 \u03b4 over Z, all q \u2208 Q verify the following inequality, which serves as a preliminary result\nto show data-dependent bounds for SVMs in [12]:\n\nf\u2208F\n\nN\n\n(cid:7)\n\nEZ\u223cDZ q(z) \u2264 1\n\nN\n\nq(zi) + RN (Q) +\n\nln(1/\u03b4)\n\n2N\n\n(4)\n\nN(cid:6)\n\ni=1\n\n\f(cid:2)\n\nIn this section, we generalize equation (4) to our case with theorem 4, using the following\nde\ufb01nition2 (we denote \u03bb(q, \u03d5(S)) = 1\nDe\ufb01nition 3. Let Q, be class of functions from a set Z to R, Let \u03d5 : X n \u2192 Z N and S\na sample of size n drawn according to a product distribution \u2297n\np=1Dp over X n. Then, we\nde\ufb01ne the empirical fractional Rademacher complexity 3 of Q given \u03d5 as:\n\ni=1 q(\u03d5(S)i) and \u03bb(q) =E S\u03bb(q, \u03d5(S))):\n\nN\n\nN\n\n(cid:6)\n\n(cid:6)\n\ni\u2208Cj\nAs well as the fractional Rademacher complexity of Q as R\nTheorem 4. Let Q be a class of functions from Z to [0, 1]. Then, with probability at least\n1 \u2212 \u03b4 over the samples S drawn according to\n\n(Q, S, \u03d5)\n\n\u2217\nn\n\nn\n\nj\n\n\u2217\nn\n\nR\n\n(Q, S, \u03d5) =\n\n2\nN\n\nE\u03c3\n\n\u2217\nn\n\nwj sup\nq\u2208Q\n\n(cid:8)\np=1 Dp, for all q \u2208 Q:\n\u2217\nn\n\n\u03c3iq(\u03d5(S)i)\n(Q, \u03d5) = ESR\n(cid:7)\n(cid:7)\n(Q, S, \u03d5) + 3\n\n(Q, \u03d5) +\n\n2N\n\n\u03c7(\u03d5) ln(1/\u03b4)\n\n\u03c7(\u03d5) ln(2/\u03b4)\n\n2N\n\nN(cid:6)\n\n\u03bb(q) \u2212 1\nN(cid:6)\n\nN\n\nN\n\ni=1\n\nAnd: \u03bb(q) \u2212 1\n\nq(\u03d5i(S)) \u2264 R\n\ni=1\n\nq(\u03d5i(S)) \u2264 R\n\n\u2217\nn\n\ni=1\n\nE \u02dcS\n\nN\n\ni=1\n\n1\nN\n\nN(cid:6)\n\n(cid:9)\n\u03bb(q) \u2212 \u03bb(q, \u03d5(S)) \u2264 sup\nq\u2208Q\n(cid:6)\n\u2264 1\nN\n\nIn the de\ufb01nition of the fractional Rademacher complexity (FRC), if \u03d5 is the identity func-\ntion, we recover the standard Rademacher complexity, and theorem 4 reduces to equation\n(4). These results are therefore extensions of equation (4), and show that the generalization\nerror bounds for the tasks falling in our framework will follow from a unique approach.\n(cid:10)\nProof. In order to \ufb01nd a bound for all q in Q of \u03bb(q) \u2212 \u03bb(q, \u03d5(S)), we write:\n\nN(cid:6)\nq(\u03d5( \u02dcS)i) \u2212 1\n\u23a1\nq(\u03d5(S)i)\n(cid:6)\n(cid:6)\n\u23a3E \u02dcS\nwj sup\nq\u2208Q\nk=1 q(\u03d5( \u02dcS)Cj k) \u2212(cid:2)\n(cid:2)\nWhere we have used equation (2). Now, consider, for each j, fj : Z \u03baj \u2192 R such that, for\n(cid:2)\nall z(j) \u2208 Z \u03baj , fj(z(j)) = 1\n). It is clear\nthat if f : Z N \u2192 R is de\ufb01ned by: for all Z \u2208 Z N , f(Z) =\n),\nj=1 wjfj(zCj1, ..., zCj\u03baj\nthen the right side of equation (5) is equal to f \u25e6 \u03d5(S), and that f satis\ufb01es all the conditions\nof theorem 2 with, for all i \u2208 {1, ..., N}, \u03b2i = 1\n(cid:8)\nN . Therefore, with a direct application\nof theorem 2, we can claim that, with probability at least 1 \u2212 \u03b4 over samples S drawn\np=1 Dp (we denote \u03bbj(q, \u03d5(S)) = 1\naccording to\n\u03bb(q) \u2212 \u03bb(q, \u03d5(S)) \u2264 ES\n\n(cid:2)\ni\u2208Cj q(\u03d5(S)i)):\n(cid:4)\nE \u02dcS\u03bbj(q, \u03d5( \u02dcS)) \u2212 \u03bbj(q, \u03d5(S))\n(cid:7)\n(cid:6)\n\n(cid:6)\nwj sup\n(cid:6)\nq\u2208Q\n\nq(\u03d5( \u02dcS)i) \u2212\n\n\u03c7(\u03d5) ln(1/\u03b4)\n\nq(\u03d5(S)i)\n\nk=1 q(z\n\nq\u2208Q E \u02dcS\n\n\u23a4\n\u23a6\n\n(cid:7)\n\ni\u2208Cj\n\ni\u2208Cj\n\n2N\n\nsup\n\n(j)\nk\n\n(cid:3)\n\n(5)\n\n+\n\n\u03baj\n\nN\n\n\u03baj\n\nN\n\nN\n\nn\n\nj\n\nwj\nN\n\nsup\nq\u2208Q\n\ni\u2208Cj\n\nj\n\n[q(\u03d5( \u02dcS)i) \u2212 q(\u03d5(S)i)] +\n\n\u03c7(\u03d5) ln(1/\u03b4)\n\n2N\n\nj\n\n\u2264 E\n\nS, \u02dcS\n\n(6)\n2The fractional Rademacher complexity depends on the cover C(\u03d5) chosen, since it is not unique.\n\nHowever in practice, our bounds only depend on \u03c7(\u03d5) (see section 4.1).\n\n3this denomination stands as it is a sum of Rademacher averages over independent parts of \u03d5(S).\n\n\f(cid:6)\n\ni\u2208Cj\n\n(cid:6)\n\ni\u2208Cj\n\nNow \ufb01x j, and consider \u03c3 = (\u03c3i)N\nFor a given realization of \u03c3, we have that\n\ni=1, a sequence of N independent Rademacher variables.\n\nE\n\nS, \u02dcS\n\nsup\nq\u2208Q\n\n[q(\u03d5( \u02dcS)i) \u2212 q(\u03d5(S)i)] = E\n\nS, \u02dcS\n\nsup\nq\u2208Q\n\n\u03c3i[q(\u03d5( \u02dcS)i) \u2212 q(\u03d5(S)i)]\n\n(7)\n\nbecause, for each \u03c3i considered, \u03c3i = \u22121 simply corresponds to permutating, in S, \u02dcS, the\ntwo sequences S[i] and \u02dcS[i] (where S[i] denotes the subset of S \u03d5(S)i really depends on)\nwhich have the same distribution (even though the sp\u2019s are not identically distributed), and\nare independent from the other S[k] and \u02dcS[k] since we are considering i, k \u2208 Cj. Therefore,\ntaking the expection over S, \u02dcS is the same with the elements permuted this way as if they\nwere not permuted. Then, from equation (6), the \ufb01rst inequality of the theorem follow. The\n\u2217\nsecond inequality is due to an application of theorem 2 to R\nn\nRemark 5. The symmetrization performed in equation (7) requires the variables \u03d5(S)i\nappearing in the same sum to be independent. Thus, the generalization of Rademacher\ncomplexities could only be performed using a decomposition in independent sets, and the\ncover C assures some optimality of the decomposition. Moreover, even though McDi-\narmid\u2019s theorem could be applied each time we used theorem 2, the derivation of the real\nnumbers bounding the differences is not straightforward, and may not lead to the same\nresult. The creation of the dependency graph of \u03d5 and theorem 2 are therefore necessary\ntools for obtaining theorem 4.\n\n(Q, S, \u03d5).\n\nProperties of the fractional Rademacher complexity\nTheorem 6. Let Q be a class of functions from a set Z to R, and \u03d5 : X n \u2192 Z N . For\nS \u2208 X n, the following results are true.\n(cid:15)\n\n1. Let \u03c6 : R \u2192 R, an L-Lipschitz function. Then R\n(Q, S, \u03d5)\n2. If there exist M > 0 such that for every k, and samples Sk of size k Rk(Q, Sk) \u2264\n\n(\u03c6 \u25e6 Q, S, \u03d5) \u2264 LR\n\n\u2217\nn\n\n\u2217\nn\n\nN\n\n\u2217\nn\n\n\u03c7(\u03d5)\n\nM\u221a\nk\n\n, then R\n\n(Q, S, \u03d5) \u2264 M\n\n3. Let K be a kernel over Z, B > 0, denote ||x||K =\n\n(cid:16)\n(cid:17)(cid:18)(cid:18)(cid:19) N(cid:6)\n{h\u03b8 : Z \u2192 R, h\u03b8(x) = K(\u03b8, x)|||\u03b8||K \u2264 B}. Then:\n\n(cid:16)\n\n(HK,B, S, \u03d5) \u2264 2B\n\n\u2217\nn\n\nR\n\n\u03c7(\u03d5)\nN\n\n||\u03d5(S)i||2\n\nK\n\ni=1\n\nK(x, x) and de\ufb01ne HK,B =\n\nThe \ufb01rst point of this theorem is a direct consequence of a Rademacher process comparison\ntheorem, namely theorem 7 of [10], and will enable the obtention of margin-based bounds.\nThe second and third points show that the results regarding the Rademacher complexity\ncan be used to immediately deduce bounds on the FRC. This result, as well as theorem 4\nshow that binary classi\ufb01ers of i.i.d. data and classi\ufb01ers of interdependent data will have\ngeneralization bounds of the same form, but with different convergence rate depending on\nthe dependency structure imposed by \u03d5.\n\n(cid:2)\nj wj = \u03c7(\u03d5) and, from equation (2),\n\nelements of proof. The second point results from Jensen\u2019s inequality, using the facts that\nj wj|Cj| = N. The third point is based\np=1, then\n\n[3]), if Sk = ((xp, yp))k\n\n(cid:2)\n\non the same calculations by noting that (see e.g.\nRk(HK,B, Sk) \u2264 2B\n\np=1 ||xp||2\nK.\n\nk\n\nk\n\n(cid:15)(cid:2)\n\n\f5 Data-dependent bounds\n\n,\u22121) in S, h(x) \u2212 h(x\n\n,\u22121)) in S, each pair being represented by x \u2212 x\n(cid:2)\n\nThe fact that classi\ufb01ers trained on interdependent data will \u201dinherit\u201d the generalization\nbound of the same classi\ufb01er trained on i.i.d. data suggests simple ways of obtaining bi-\npartite ranking algorithms. Indeed, suppose we want to learn a linear ranking function, for\nexample a function h \u2208 HK,B as de\ufb01ned in theorem 6, where K is a linear kernel, and\nconsider a sample S \u2208 (X \u00d7 {\u22121, 1})n with X \u2282 Rd, drawn i.i.d. according to some D.\n(cid:2)) = h(x \u2212 x\n(cid:2)).\n(cid:2)\nThen we have, for input examples (x, 1) and (x\nTherefore, we can learn a bipartite ranking function by applying an SVM algorithm to the\n(cid:2)\npairs ((x, 1), (x\n, and our framework\nallows to immediately obtain generalization bounds for this learning process based on the\ngeneralization bounds for SVMs. We show these bounds in theorem 7.\nthe 1-Lipschitz function de\ufb01ned by \u03c6(x) =\nTo derive the bounds, we consider \u03c6,\nmin(1, max(1 \u2212 x, 0)) \u2265 [[x \u2264 0]]4. Given a training example z, we denote by\nzl its label and zf its feature representation. With an abuse of notation, we denote\np=1 Dp, we have,\n\u03c6(h, Z) = 1\nih(zf\n(cid:6)\nfor all h in some function class H:\n1\n(cid:7)\nES\nN\n\n(cid:2)\ni=1 \u03c6(zl\n(cid:3)(h, zi) \u2264 ES\n\n)). For a sample S drawn according to\n\n(cid:6)\n\n(cid:8)\n\n\u03c6(zl\n\n1\nN\n\n))\n\nN\n\nN\n\nn\n\ni\n\ni\n\ni\n\n\u2264 \u03c6(h, Z) + E\u03c3\n\ni\n\nih(zf\n(cid:6)\n\nj\n\n2wj\nN\n\nsup\nh\u2208H\n\n(cid:6)\n\ni\u2208Cj\n\n\u03c3i\u03c6(zl\n\nih(zf\n\ni\n\n)) + 3\n\n\u03c7(\u03d5) ln(2/\u03b4)\n\n2N\n\ni\n\nih(zf\n\n(cid:6)\n\nCjk \u03c3Cjk since zl\n\n) \u2264 0]], and the last inequality holds with probability at least\nwhere (cid:3)(h, zi) = [[zl\n1 \u2212 \u03b4 over samples S from theorem 4. Notice that when \u03c3Cjk is a Rademacher variable, it\n\u2208 {\u22121, 1}. Thus, using the \ufb01rst result of\nhas the same distribution as zl\ntheorem 6 we have that with probability 1 \u2212 \u03b4 over the samples S, all h in H satisfy:\n\u03c7(\u03d5) ln(2/\u03b4)\n\n(8)\nNow putting in equation (8) the third point of theorem 6, with H = HK,B as de\ufb01ned in\ntheorem 6 with Z = X , we obtain the following theorem:\nTheorem 7. Let S \u2208 (X \u00d7 {\u22121, 1})n be a sample of size n drawn i.i.d. according to an\nunknown distribution D. Then, with probability at least 1 \u2212 \u03b4, all h \u2208 HK,B verify:\n\n(cid:7)\n(H, S, \u03d5) + 3\n\n(cid:3)(h, zi) \u2264 1\n\n\u03c6(zlh(zf )) + R\n\n(cid:6)\n\n1\nN\n\n2N\n\nES\n\n\u2217\nn\n\nCjk\n\nN\n\ni\n\ni\n\n(cid:7)\n\nES[[yih(xi) \u2264 0]] \u2264 1\n\nn\n\n\u03c6(yih(xi)) +\n\n||xi||2\n\nK\n\n+ 3\n\nln(2/\u03b4)\n\n2n\n\nAnd E{[[h(x) \u2264 h(x\n(cid:15)\n\n(cid:2))]]|y = 1, y\n\n2B\n\n+\n\n+, nS\u2212)\n\nmax(nS\n+nS\u2212\nnS\n\n(cid:2) = \u22121} \u2264 1\n(cid:17)(cid:18)(cid:18)(cid:18)(cid:19) nS\n+nS\u2212\nnS\n+(cid:6)\n\nnS\u2212(cid:6)\n\ni=1\n\nj=1\n\n\u03c6(h(x\u03c3(i)) \u2212 h(x\u03bd(j)))\n\n(cid:20)\n\n||x\u03c3(i) \u2212 x\u03bd(j)||2\n\nK\n\n+ 3\n\nln(2/\u03b4)\n\n2 min(nS\n\n+, nS\u2212)\n\n+, nS\u2212 are the number of positive and negative instances in S, and \u03c3 and \u03bd also\nWhere nS\ndepend on S, and are such that x\u03c3(i) is the i-th positive instance in S and \u03bd(j) the j-th\nnegative instance.\n\n4remark that \u03c6 is upper bounded by the slack variables of the SVM optimization problem (see\n\ne.g. [12]).\n\nn(cid:6)\n\ni=1\n\n(cid:17)(cid:18)(cid:18)(cid:19) n(cid:6)\nnS\u2212(cid:6)\n+(cid:6)\n\ni=1\n\nnS\n\n2B\nn\n\ni=1\n\nj=1\n\n\fK replaced by ||x\u03c3(i)||2\n\nK\n\nK\n\nIt is to be noted that when h \u2208 HK,B with a non linear kernel, the same bounds apply, with,\nfor the case of bipartite ranking, ||x\u03c3(i) \u2212 x\u03bd(j)||2\n\u2212\n2K(x\u03c3(i), x\u03bd(j)).\nFor binary classi\ufb01cation, we recover the bounds of [12], since our framework is a gen-\neralization of their approach. As expected, the bounds suggest that kernel machines will\ngeneralize well for bipartite ranking. Thus, we recover the results of [2] obtained in a spe-\nci\ufb01c framework of algorithmic stability. However, our bound suggests that the convergence\nrate is controlled by 1/ min(nS\n+ + 1/nS\u2212. The full\nproof, in which we follow the approach of [1], is given in [13].\n\n+, nS\u2212), while their results suggested 1/nS\n\n+ ||x\u03bd(j)||2\n\n6 Conclusion\n\nWe have shown a general framework for classi\ufb01ers trained with interdependent data, and\nprovided the necessary tools to study their generalization properties. It gives a new insight\non the close relationship between the binary classi\ufb01cation task and the bipartite ranking,\nand allows to prove the \ufb01rst data-dependent bounds for this latter case. Moreover, the\nframework could also yield comparable bounds on other learning tasks.\n\nAcknowledgments\n\nThis work was supported in part by the IST Programme of the European Community, under\nthe PASCAL Network of Excellence, IST-2002-506778. This publication only re\ufb02ects the\nauthors views.\n\nReferences\n\n[1] Agarwal S., Graepel T., Herbrich R., Har-Peled S., Roth D. (2005) Generalization Error Bounds\nfor the Area Under the ROC curve, Journal of Machine Learning Research.\n\n[2] Agarwal S., Niyogi P. (2005) Stability and generalization of bipartite ranking algorithms, Confer-\nence on Learning Theory 18.\n\n[3] Bartlett P., Mendelson S. (2002) Rademacher and Gaussian Complexities: Risk Bounds and Struc-\ntural Results, Journal of Machine Learning Research 3, pp. 463-482.\n\n[4] Boucheron S., Bousquet O., Lugosi G. (2004) Concentration inequalities, in O. Bousquet, U.v.\nLuxburg, and G. Rtsch (editors), Advanced Lectures in Machine Learning, Springer, pp. 208-240.\n\n[5] Clemenc\u00b8on S., Lugosi G., Vayatis N. (2005) Ranking and scoring using empirical risk minimiza-\ntion, Conference on Learning Theory 18.\n\n[6] Cortes C., Mohri M. (2004) AUC optimization vs error rate miniminzation NIPS 2003,\n\n[7] Freund Y., Iyer R.D., Schapire R.E., Singer Y. (2003) An Ef\ufb01cient Boosting Algorithm for Com-\nbining Preferences, Journal of Machine Learning Research 4, pp. 933-969.\n\n[8] Janson S. (2004) Large deviations for sums of partly dependent random variables, Random Struc-\ntures and Algorithms 24, pp. 234-248.\n\n[9] McDiarmid C. (1989) On the method of bounded differences, Surveys in Combinatorics.\n\n[10] Meir R., Zhang T. (2003) Generalization Error Bounds for Bayesian Mixture Algorithms, Jour-\nnal of Machine Learning Research 4, pp. 839-860.\n\n[11] Rudin C., Cortes C., Mohri M., Schapire R.E. (2005) Margin-Based Ranking meets Boosting in\nthe middle, Conference on Learning Theory 18.\n\n[12] Shawe-Taylor J., Cristianini N. (2004) Kernel Methods for Pattern Analysis, Cambridge U. Prs.\n\n[13] Long version of this paper, Available at http://www-connex.lip6.fr/\u02dcusunier/nips05-lv.pdf\n\n\f", "award": [], "sourceid": 2860, "authors": [{"given_name": "Nicolas", "family_name": "Usunier", "institution": null}, {"given_name": "Massih R.", "family_name": "Amini", "institution": ""}, {"given_name": "Patrick", "family_name": "Gallinari", "institution": null}]}