{"title": "On the Theory of Learnining with Privileged Information", "book": "Advances in Neural Information Processing Systems", "page_first": 1894, "page_last": 1902, "abstract": "In Learning Using Privileged Information (LUPI) paradigm, along with the standard training data in the decision space, a teacher supplies a learner with the privileged information in the correcting space. The goal of the learner is to find a classifier with a low generalization error in the decision space. We consider a new version of empirical risk minimization algorithm, called Privileged ERM, that takes into account the privileged information in order to find a good function in the decision space. We outline the conditions on the correcting space that, if satisfied, allow Privileged ERM to have much faster learning rate in the decision space than the one of the regular empirical risk minimization.", "full_text": "On the Theory of Learning with Privileged\n\nInformation\n\nDmitry Pechyony\nNEC Laboratories\n\nPrinceton, NJ 08540, USA\n\npechyony@nec-labs.com\n\nVladimir Vapnik\nNEC Laboratories\n\nPrinceton, NJ 08540, USA\nvlad@nec-labs.com\n\nAbstract\n\nIn Learning Using Privileged Information (LUPI) paradigm, along with the stan-\ndard training data in the decision space, a teacher supplies a learner with the priv-\nileged information in the correcting space. The goal of the learner is to \ufb01nd a\nclassi\ufb01er with a low generalization error in the decision space. We consider an\nempirical risk minimization algorithm, called Privileged ERM, that takes into ac-\ncount the privileged information in order to \ufb01nd a good function in the decision\nspace. We outline the conditions on the correcting space that, if satis\ufb01ed, allow\nPrivileged ERM to have much faster learning rate in the decision space than the\none of the regular empirical risk minimization.\n\n1 Introduction\n\nIn the classical supervised machine learning paradigm the learner is given a labeled training set\nof examples and her goal is to \ufb01nd a decision function with the small generalization error on the\nunknown test examples. If the learning problem is easy (e.g. if learner\u2019s space of decision functions\ncontains a one with zero generalization error) then, when the training size increases, the decision\nfunction found by the learner converges quickly to the optimal one. However if the learning problem\nis hard and the learner\u2019s space of decision functions is large then the convergence (or learning) rate\nis slow. The example of such hard learning problem is XOR when the space of decision functions is\n2-dimensional hyperplanes.\nThe obvious question is \u201cCan we accelerate the learning rate if the learner is given an additional\ninformation about the learning problem?\u201d. During the last years several new paradigms of learning\nwith additional information were proposed that, under some conditions, provably accelerate the\nlearning rate. For example, in semi-supervised learning such additional information is unlabeled\ntraining examples.\nIn this paper we consider a recently proposed Learning Using Privileged Information (LUPI)\nparadigm [8, 9, 10], that uses additional information of different kind. Let X be a decision space.\nIn LUPI paradigm, in addition to the standard training data, (x, y) \u2208 X \u00d7 Y , a teacher supplies\nthe learner with a privileged information x\u2217 in the correcting space X\u2217. The privileged information\nis only available for the training examples and is never available for the test examples. The LUPI\nparadigm requires, given a training set {(xi, x\u2217\ni=1, to \ufb01nd a decision function h : X \u2192 Y\nwith the small generalization error for the unknown test examples x \u2208 X.\nThe above question about accelerating the learning rate, reformulated in terms of the LUPI paradigm,\nis \u201cWhat kind of additional information should the teacher provide to the learner in order to accel-\nerate her learning rate?\u201d. Paraphrased, this question is essentially \u201cWho is a good teacher?\u201d. In this\npaper we outline the conditions for the additional information provided by the teacher that allow for\nfast learning rate even in the hard problems.\n\ni , yi)}n\n\n1\n\n\fLUPI paradigm emerges in a number of applications, for example time series prediction, protein\nclassi\ufb01cation and human computation. The experiments [9] in these domains demonstrated a clear\nadvantage of LUPI paradigm over the supervised learning.\nLUPI paradigm can be implemented by SVM+ algorithm [8], which in turn is based on the well-\nknown SVM algorithm [2]. We now present the version of SVM+ for classi\ufb01cation, the version for\nregression can be found in [9]. Let h(x) = sign(w \u00b7 x + b) be a decision function and \u03c6(x\u2217\ni ) =\nw\u2217 \u00b7 x\u2217\n\ni + d be a correcting function. The optimization problem of SVM+ is\n(w\u2217 \u00b7 x\u2217\n\nkw\u2217k2\n\nn(cid:88)\n\n(1)\n\ni + d)\n\nmin\nw,b,w\u2217,d\ns.t. \u2200 1 \u2264 i \u2264 n,\n\n2 + \u03b3\n2\n\n1\nkwk2\n2\nyi (w \u00b7 xi + b) \u2265 1 \u2212 (w\u2217 \u00b7 x\u2217\n\n2 + C\n\ni=1\n\ni + d)\n\n\u2200 1 \u2264 i \u2264 n, w\u2217 \u00b7 x\u2217\n\ni + d \u2265 0.\n\nThe objective function of SVM+ contains two hyperparameters, C > 0 and \u03b3 > 0. The term\n\u03b3kw\u2217k2\n2/2 in (1) is intended to restrict the capacity (or VC-dimension) of the function space con-\ntaining \u03c6.\nLet \u2018X(h(x), y) = 1 \u2212 y(w \u00b7 x + b) be a hinge loss of the decision function h = (w, b) on the\nexample (x, y) and \u2018X\u2217(\u03c6(x\u2217)) = [w\u2217 \u00b7 x\u2217 + d]+ be a loss of the correcting function \u03c6 = (w\u2217, d)\non the example x\u2217. The optimization problem (1) can be rewritten as\n\nn(cid:88)\n\nThe following optimization problem is a simpli\ufb01ed and a generalized version of (2):\n\nmin\n\nh=(w,b),\u03c6=(w\u2217,d)\ns.t. \u2200 1 \u2264 i \u2264 n,\n\nkw\u2217k2\n\n2 + \u03b3\n2\n\n1\nkwk2\n2 + C\n2\n\u2018X(h(xi), y) \u2264 \u2018X\u2217(\u03c6(x\u2217\n\ni=1\n\ni )).\n\n\u2018X\u2217(\u03c6(x\u2217\ni ))\n\nn(cid:88)\n\ni=1\n\n\u2018X\u2217(\u03c6(x\u2217\n\ni ), yi)\n\nmin\n\nh\u2208H,\u03c6\u2208\u03a6\ns.t. \u2200 1 \u2264 i \u2264 n,\n\n(2)\n\n(3)\n\n(6)\n\n(4)\nwhere \u2018X and \u2018X\u2217 are arbitrary bounded loss functions, H is a space of decision functions and \u03a6 is\na space of correcting functions. Let C > 0 be a constant (that is de\ufb01ned later), [t]+ = max(t, 0) and\n\ni ), yi),\n\n\u2018X(h(xi), yi) \u2264 \u2018X\u2217(\u03c6(x\u2217\n\n\u20180((h, \u03c6), (x, x\u2217, y)) =\n\n(5)\nbe the loss of the composite hypothesis (h, \u03c6) on the example (x, x\u2217, y). In this paper we study the\nrelaxation of (3):\n\n\u00b7 \u2018X\u2217(\u03c6(x\u2217), y) + [\u2018X(h(x), y) \u2212 \u2018X\u2217(\u03c6(x\u2217), y)]+\nn(cid:88)\n\n1\nC\n\nmin\n\nh\u2208H,\u03c6\u2208\u03a6\n\ni=1\n\n\u20180((h, \u03c6), (xi, x\u2217\n\ni , yi)),\n\n(cid:80)n\n\nWe refer to the learning algorithm de\ufb01ned by the optimization problem (6) as empirical risk mini-\nmization with privileged information, or abbreviated Privileged ERM.\nThe basic assumption of Privileged ERM is that if we can achieve a small loss \u2018X\u2217(\u03c6(x\u2217), y) in the\ncorrecting space then we should also achieve a small loss \u2018X(h(x), y) in the decision space. This\nassumption re\ufb02ects the human learning process, where the teacher tells the learner what are the most\nimportant examples (the ones with the small loss in the correcting space) that the learner should take\ninto account in order to \ufb01nd a good decision rule.\n\nThe regular empirical risk minimization (ERM) \ufb01nds a hypothesis(cid:98)h \u2208 H that minimizes the training\n\ni=1 \u2018X(h(xi), yi). While the regular ERM directly minimizes the training error of h, the\nerror\nprivileged ERM minimizes the training error of h indirectly, via the minimization of the training\nerror of the correcting function \u03c6 and the relaxation of the constraint (4).\nLet h\u2217 be the best possible decision function (in terms of generalization error) in the hypothesis space\nH. Suppose that for each training example xi an oracle gives us the value of the loss \u2018X(h\u2217(xi), yi).\nWe use these \ufb01xed losses instead of \u2018X\u2217(\u03c6(x\u2217\ni ), yi) and \ufb01nd h that satis\ufb01es the following system of\ninequalities:\n(7)\n\n\u2200 1 \u2264 i \u2264 n, \u2018X(h(xi), yi) \u2264 \u2018X(h\u2217(xi), yi).\n\n2\n\n\fof the proof of Proposition 1 of [9] shows that the generalization error of the hypothesis(cid:98)h found\n\nWe denote the learning algorithm de\ufb01ned by (7) as OracleERM. A straightforward generalization\nby OracleERM converges to the one of h\u2217 with the rate of 1/n. This rate is much faster than the\n\u221a\nworst-case convergence rate 1/\nIn this paper we consider more realistic setting, when the above oracle is not available. Our subse-\nquent derivations rely heavily on the following de\ufb01nition:\n\nn of the regular ERM [3].\n\nDe\ufb01nition 1.1 A decision function h is uniformly better than the correcting function \u03c6 if for any\nexample (x, x\u2217, y) that has non-zero probability, \u2018X\u2217(\u03c6(x\u2217\nGiven a space H of decision functions and a space \u03a6 of correcting functions we de\ufb01ne\n\ni ), yi) \u2265 \u2018X(h(xi), yi).\n\n\u03a6 = {\u03c6 \u2208 \u03a6 | \u2203h \u2208 H that is uniformly better than \u03c6}.\n\ni ), yi).\n\nNote that \u03a6 \u2286 \u03a6 and \u03a6 does not contain correcting functions that are too good for H. Our results\nare based on the following two assumptions:\nAssumption 1.2 \u03a6 6= \u2205.\nThis assumption is not restrictive, since it only means that the optimization problem (3) of Privileged\nERM has a feasible solution when the training size goes to in\ufb01nity.\nAssumption 1.3 There exists a correcting function \u03c6 \u2208 \u03a6, such that for any (x, x\u2217, y) that has\nnon-zero probability, \u2018X(h\u2217(xi), yi) = \u2018X\u2217(\u03c6(x\u2217\nPut it another way, we assume the existence of correcting function in \u03a6 that mimics the losses of h\u2217.\nLet r be a learning rate of the Privileged ERM when it is ran over the joint X \u00d7 X\u2217 space with the\nspace of decision and correcting functions H \u00d7 \u03a6. We develop an upper bound for the risk of the\ndecision function found by Privileged ERM. Under the above assumptions this bound converges to\nh\u2217 with the same rate r. This implies that if the correcting space is good, so that the Privileged ERM\nin the joint X \u00d7 X\u2217 space has a fast learning rate (e.g 1/n), then the Privileged ERM will have the\nsame fast learning rate (e.g. the same 1/n) in the decision space. That is true even if the decision\nspace is hard and the regular ERM in the decision space has a slow learning rate (e.g. 1/\nn). We\nillustrate this result with the arti\ufb01cial learning problem, where the regular ERM in the decision space\ncan not learn with the rate faster than 1/\nn, but the correcting space is good and Privileged ERM\nlearns in the decision space with the rate of 1/n.\nThe paper has the following structure. In Section 2 we give additional de\ufb01nitions. In Section 3 we\nreview the existing risk bounds that are used to derive our results. Section 4 contains the proof of the\nrisk bound for Privileged ERM. In Section 5 we show an example when Privileged ERM is provably\nbetter than the regular ERM. We conclude and give the directions for future research in Section 6.\nDue to the space constraints, most of the proofs appear in the supplementary material.\nPrevious work\nThe \ufb01rst attempt of theoretical analysis of LUPI was done by Vapnik and Vashist [9]. In addition\nto the analysis of learning with oracle (mentioned above), they considered the algorithm, which is\nclose, but different from Privileged ERM. They developed a risk bound (Proposition 2 in [9]) for\nthe decision function found by their algorithm. This bound also applies to Privileged ERM. The\nbound of [9] is tailored to the classi\ufb01cation setting, with 0/1-loss functions in the decision and\nthe correcting space. By contrast, our bound holds for any bounded loss functions and allows the\nloss functions \u2018X and \u2018X\u2217 to be different. The bound of [9] depends on generalization error of the\n\ncorrecting function(cid:98)\u03c6 found by Privileged ERM. Vapnik and Vashist [9] concluded that if we could\nbound the convergence rate of(cid:98)\u03c6 then this bound will imply the bound on the convergence rate of the\n\n\u221a\n\n\u221a\n\ndecision function found by their algorithm.\n\n2 De\ufb01nitions\nThe triple (x, x\u2217, y) is sampled from the distribution D, which is unknown to the learner. We denote\nby DX the marginal distribution over (x, y) and by DX\u2217 the marginal distribution over (x\u2217, y). The\ndistribution DX is given by the nature and the distribution DX\u2217 is constructed by the teacher. The\nspaces H and \u03a6 of decision and correcting functions are chosen by learner.\n\n3\n\n\fLet R(h) = E(x,y)\u223cDX{\u2018X(h(x), y)} and R(\u03c6) = E(x\u2217,y)\u223cDX\u2217{\u2018X\u2217(\u03c6(x\u2217), y)} be the general-\nization errors of the decision function h and the correcting function \u03c6 respectively. We assume that\nthe loss functions \u2018X and \u2018X\u2217 have range [0, 1]. This assumption can be satis\ufb01ed by any bounded\nloss function by simply dividing it by its maximal value. We denote by h\u2217 = arg minh\u2208H R(h)\nand \u03c6\u2217 = arg min\u03c6\u2208\u03a6 R(\u03c6) the decision and the correction function with the minimal gener-\nthe loss functions \u2018X and \u2018X\u2217. Also, we denote by \u201801 the 0/1 loss, by\nalization error w.r.t.\nR01(h) = E(x,y)\u223cDX{\u201801(h(x), y)} the generalization error of h w.r.t.\nthe 0/1 loss and by\n01 = arg minh\u2208H R01(h) the decision function in H with the minimal generalization 0/1 error.\nh\u2217\nLet R0\n\n(cid:80)n\n\ni=1 \u20180((h, \u03c6), (xi, x\u2217\n\ni , yi)) and\n\nn(h, \u03c6) = 1\n\nn\n\nR0(h, \u03c6) = E(x,x\u2217,y) \u223cD{\u20180((h, \u03c6), (x, x\u2217, y))}\n\n(8)\n\nbe respectively empirical and generalization errors of the hypothesis (h, \u03c6) w.r.t. the loss function\n\n\u20180. We denote by ((cid:98)h,(cid:98)\u03c6) = arg min(h,\u03c6)\u2208H\u00d7\u03a6 R0\n\nn(h, \u03c6) the empirical risk minimizer and by\n\n(h0, \u03c60) = arg min\n\n(h,\u03c6)\u2208H\u00d7\u03a6\n\nR0(h, \u03c6)\n\nthe minimizer of the generalization error w.r.t. the loss function \u20180. Note that in general h\u2217 can be\ndifferent from h0, and also \u03c60 can be different from \u03c6\u2217.\nLet\n\n(H, \u03a6) = {(h, \u03c6) \u2208 H \u00d7 \u03a6 | h is uniformly better than \u03c6}.\nBy Assumption 1.2, (H, \u03a6) 6= \u2205. We will use additional technical assumption:\nAssumption 2.1 There exists a constant A > 0 such that\n\ninf\n\nE(x,x\u2217,y)\u223cD {[\u2018X(h(x), y) \u2212 \u2018X\u2217(\u03c6(x\u2217), y)]+} | (h, \u03c6) /\u2208 (H, \u03a6), R(\u03c6) < R(\u03c6)\n\n(cid:170) \u2265 A.\n\n(cid:169)\n\nThis assumption is satis\ufb01ed, for example, in the classi\ufb01cation setting when \u2018X and \u2018X\u2217 are\n0/1 loss functions and the probability density function p(x, x\u2217, y) of the underlying distribu-\ntion D is bounded away from zero for all points with nonzero probability.\nIn this case A \u2265\ninf{p(x, x\u2217, y) | (x, x\u2217, y) such that p(x, x\u2217, y) 6= 0}.\nThe following lemma (proved in Appendix A in the full version of the paper) shows that for suf\ufb01-\nciently large C the optimization problems (3) and (6) are asymptotically (when n \u2192 \u221e) equivalent:\nLemma 2.2 Suppose that Assumptions 1.2, 1.3 and 2.1 hold true. Then there exists a \ufb01nite C1 \u2208 R\nsuch that for any C \u2265 C1, (h0, \u03c60) \u2208 (H, \u03a6). Moreover, h0 = h\u2217 and \u03c60 = \u03c6.\nIn all our subsequent derivations we assume that C has a \ufb01nite value for which (3) and (6) are\nequivalent. Later on we will show how we choose the value of C that optimizes the forthcoming\nrisk bound.\nThe risk bounds presented in this paper are based on VC-dimension of various function classes.\nWhile the de\ufb01nition of VC-dimension for binary functions is well-known in the learning community,\nthe one for the real-valued functions is less known and we review it here. Let F be a set of real-\nvalued functions f : S \u2192 R and T (F) = {(x, t) \u2208 S \u00d7 R | \u2203 f \u2208 F s.t. 0 \u2264 |f(x)| \u2264 t}. We say\nthat the set T = {(xi, ti)}|T |\ni=1 \u2286 T (F) is shattered by F if for any T 0 \u2286 T there exists a function\nf \u2208 F such that for any (xi, ti) \u2208 T 0, |f(xi)| \u2264 ti and for any (xi, ti) \u2208 T \\ T 0, |f(xi)| > ti. The\nVC-dimension of F is de\ufb01ned as a VC-dimension of the set T (F), namely the maximal size of the\nset T \u2286 T (F) that is shattered by F.\n\n3 Review of existing excess risk bounds with fast convergence rates\n\nWe derive our risk bounds from generic excess risk bounds developed by Massart and Nedelec [6]\nand generalized by Gine and Koltchinskii [4] and Koltchinkii [5]. In this paper we use the version\nof the bounds given in [4] and [5].\nLet F be a space of hypotheses f : S \u2192 S0, \u2018 : S0 \u00d7 {\u22121, +1} \u2192 R be a real-valued\nloss function such that 0 \u2264 \u2018(f(x), y) \u2264 1 for any f \u2208 F and any (x, y). Let f\u2217 =\n\n4\n\n\f(a) Hypothesis space with small D\n\n(b) Hypothesis space with large D\n\nFigure 1: Visualization of the hypothesis spaces. The horisontal axis measures the distance (in\nterms of the variance) between hypothesis f and the best hypothesis f\u2217 in F. The vertical axis is\nthe minimal error of hypotheses in F with the \ufb01xed distance from f\u2217. Note that the error function\ndisplayed in graphs can be non-continuous. The large value of D in the hypothesis space in graph\n(b) is caused by hypothesis A, which is signi\ufb01cantly different from f\u2217 but has nearly-optimal error.\ni=1 \u2018(f(xi), yi) and D > 0 be a constant\n\narg minf\u2208F E(x,y){\u2018(f(x), y)}, (cid:98)fn = arg minf\u2208F\n\n(cid:80)n\n\nsuch that for any f \u2208 F,\n\nVar(x,y){\u2018(f(x), y) \u2212 \u2018(f\u2217(x), y)} \u2264 D \u00b7 E(x,y){\u2018(f(x), y) \u2212 \u2018(f\u2217(x), y)}.\n\n(9)\nThis condition is a generalization of Tsybakov\u2019s low-noise condition [7] to arbitrary loss functions\nand arbitrary hypothesis spaces.\nThe constant D in (9) characterizes the error surface of the hypothesis space F. Suppose that\nE(x,y){\u2018(f(x), y) \u2212 \u2018(f\u2217(x), y)} is very small, namely f is nearly optimal. If f is almost the same\nas f\u2217 then the variance in the left hand side of (9), as well as the value of D, will be small. But if\nf differs signi\ufb01cantly from f\u2217 then the variance in the left hand side of (9), as well as the value of\nD, will be large. Thus, if we take the variance in the left hand side of (9) as a measure of distance\nbetween f and f\u2217 then the hypothesis spaces with large and small D can be visualized as shown in\nFigure 1.\nLet V be a VC-dimension of F. The following theorem is a straightforward generalization of The-\norem 5.8 in [5].\nTheorem 3.1 ([5]) There exists a constant K > 0 such that if n > V \u00b7 D2 then for any \u03b4 > 0, with\nprobability of at least 1 \u2212 \u03b4\n\nE(x,y){\u2018((cid:98)f(x), y)} \u2264 E(x,y){\u2018(f\u2217(x), y)} + KD\n\nV log n\n\nV D2 + ln\n\n1\n\u03b4\n\nn\n\n.\n\n(10)\n\n(cid:182)\n\n(cid:181)\n\n(cid:179)(cid:113)\n\nLet B = (V log n + log(1/\u03b4))/n. If the condition of Theorem 3.1 does not hold, namely if n \u2264\nV \u00b7 D2 then we can use the following fallback risk bound:\nTheorem 3.2 ([1, 8]) There exists a constant K0 such that for any \u03b4 > 0, with probability of at least\n1 \u2212 \u03b4,\n\nE(x,y){\u2018((cid:98)f(x), y)} \u2264 E(x,y){\u2018(f\u2217(x), y)} + K0\n\nE(x,y){\u2018(f\u2217(x), y)}B + B\n\n(11)\nDe\ufb01nition 3.3 Let T = T (E(x,y){\u2018(f\u2217(x), y)}, V, \u03b4) be a constant such that for all n < T it holds\nthat E(x,y){\u2018(f\u2217(x), y)} < B.\nFor n \u2264 T the bound (11) has a convergence rate of 1/n, and for n > T the bound (11) has a\n\u221a\nconvergence rate of 1/\nn. The main difference between (10) and (11) is the fast convergence rate\nn in the regime of n > max(T, V \u00b7 D2). By Theorem 3.1, starting\nof 1/n vs. the slow one of 1/\nfrom n > n(D) = V \u00b7 D2 we always have the convergence rate of 1/n. Thus, the smaller value of\nD, the smaller will be the threshold n(D) for obtaining the fast convergence rate of 1/n.\n\n\u221a\n\n.\n\n(cid:180)\n\n4 Upper Risk Bound\nFor any C \u2265 1, any (x, x\u2217, y), any h \u2208 H and \u03c6 \u2208 \u03a6, and any loss functions \u2018X and \u2018X\u2217,\n\n\u2018X(h(x), y) \u2264 \u2018X\u2217(\u03c6(x\u2217), y) + C [\u2018X(h(x), y) \u2212 \u2018X\u2217(\u03c6(x\u2217), y)]+ .\n\n5\n\n\fHence, using (5) we obtain that\n\nR((cid:98)h) = E(x,y){\u2018X((cid:98)h(x), y)} \u2264 C \u00b7 E(x\u2217,y)\n\n(cid:110)\n(cid:111)\n\u20180(((cid:98)h,(cid:98)\u03c6), (x, x\u2217, y))\n\n= C \u00b7 R0((cid:98)h,(cid:98)\u03c6).\n\n(12)\nLet \u20181(h, h\u2217, x, y) = \u2018X(h(x), y) \u2212 \u2018X(h\u2217(x), y) and DH \u2265 0 be a constant such that for any\nh \u2208 H\n\nDH \u00b7 E(x,y) {\u20181(h, h\u2217, x, y)} \u2265 Var(x,y) {\u20181(h, h\u2217, x, y)} .\n\n(13)\nSimilarly, let \u20182(h, h0, \u03c6, \u03c60, x, x\u2217, y) = \u20180((h, \u03c6), (x, x\u2217, y))\u2212\u20180((h0, \u03c60), (x, x\u2217, y)) and DH,\u03a6 \u2265 0\nbe a constant such that for all (h, \u03c6) \u2208 H \u00d7 \u03a6,\n\nDH,\u03a6 \u00b7 E(x,x\u2217,y) {\u20182(h, h0, \u03c6, \u03c60, x, x\u2217, y)} \u2265 Var(x,x\u2217,y) {\u20182(h, h0, \u03c6, \u03c60, x, x\u2217, y)} .\n\n(14)\n\nLet L(H, \u03a6) = {\u20180((h, \u03c6), (\u00b7,\u00b7,\u00b7)) | h \u2208 H, \u03c6 \u2208 \u03a6} be a set of the loss functions \u20180 corresponding\nto hypotheses from H \u00d7 \u03a6 and VL(H,\u03a6) be a VC-dimension of L(H, \u03a6). Similarly, let L(H) =\n{\u2018X(h(\u00b7),\u00b7) | h \u2208 H} and L(\u03a6) = {\u2018X\u2217(\u03c6(\u00b7),\u00b7) | \u03c6 \u2208 \u03a6} be the sets of loss functions that\ncorrespond to the hypotheses in H and \u03a6, and VL(H) and VL(\u03a6) be VC dimensions of L(H) and\nL(\u03a6) respectively. Note that if \u2018X = \u201801 then VL(H) is also a VC-dimension of H (the same holds\nalso for VL(\u03a6)).\nLemma 4.1 VL(H,\u03a6) = VL(H) + VL(\u03a6).\n\nProof See Appendix C in the full version of the paper.\nWe apply Theorem 3.1 to the hypothesis space H\u00d7 \u03a6 and the loss function \u20180((h, \u03c6), (x, x\u2217, y)) and\nobtain that there exists a constant K > 0 such that if n > VL(H,\u03a6) \u00b7 D2H,\u03a6 then for any \u03b4 > 0, with\nprobability at least 1 \u2212 \u03b4\n\nR0((cid:98)h,(cid:98)\u03c6) \u2264 R0(h0, \u03c60) + KDH,\u03a6\nR((cid:98)h) \u2264 C \u00b7 R0(h0, \u03c60) + CKDH,\u03a6\n\nn\n\n(cid:195)\n\n(cid:195)\n\nUsing (12) we obtain that\n\n(cid:33)\n\n.\n\n(cid:33)\n\n.\n\n(15)\n\nVL(H,\u03a6) ln\n\nn\n\nVL(H,\u03a6)D2H,\u03a6\n\n+ ln\n\n1\n\u03b4\n\nVL(H,\u03a6) ln\n\nn\n\nVL(H,\u03a6)D2H,\u03a6\n\n+ ln\n\n1\n\u03b4\n\nwhere K > 0 is a constant.\n\nIt follows from Assumption 1.3 and Lemma 2.2 that\nR(\u03c60) =\n\nR0(h0, \u03c60) =\n\n(16)\nWe substitute (16) into (15) and obtain that there exists a constant K > 0 such that if n > VL(H,\u03a6) \u00b7\nD2H,\u03a6 then for any \u03b4 > 0, with probability at least 1 \u2212 \u03b4,\n\nR(\u03c6) =\n\n1\nC\n\nR(h\u2217).\n\n1\nC\n\nR((cid:98)h) \u2264 R(h\u2217) + CKDH,\u03a6\n\nn\n\nVL(H,\u03a6) ln\n\nn\n\nVL(H,\u03a6)D2H,\u03a6\n\n+ ln\n\n1\n\u03b4\n\nWe bound VH,\u03a6 by Lemma 4.1 and obtain our \ufb01nal risk bound, that is summarized in the following\ntheorem:\n\nTheorem 4.2 Suppose that Assumptions 1.2, 1.3 and 2.1 hold. Let DH,\u03a6 be as de\ufb01ned in (14),\nC1 be as de\ufb01ned in Lemma 2.2, and V L(H,\u03a6) = VL(H) + VL(\u03a6). Suppose that C > C1 and\nn > V L(H,\u03a6) \u00b7 D2H,\u03a6. Then for any \u03b4 > 0 with probability of at least 1 \u2212 \u03b4,\n\n(cid:33)\n\n.\n\n(cid:33)\n\nn\n\n1\nC\n\n(cid:195)\n\n(cid:195)\n\nR((cid:98)h) \u2264 R(h\u2217) + CKDH,\u03a6\n\nn\n\n,\n\n(17)\n\nV L(H,\u03a6) ln\n\nn\n\nV L(H,\u03a6) \u00b7 D2H,\u03a6\n\n+ ln\n\n1\n\u03b4\n\n6\n\n\f(cid:161)\n\nLemma 4.3 DH,\u03a6 \u2264 max\n\nD\u03a6/C, D0\n\nH,\u03a6\n\n.\n\n(cid:162)\n\n(cid:189)\n\nAccording to this bound, R((cid:98)h) converges to R(h\u2217) with the rate of 1/n. If Assumption 1.3 does not\nIn this case the upper bound on R((cid:98)h) converges to R(\u03c60) with the rate of 1/n.\n\nhold then it is easy to see that we obtain the same bound as (17), but with R(h\u2217) replaced by R(\u03c60).\n\nWe now provide further analysis of the risk bound (17). Let \u20183(\u03c6, \u03c60, x\u2217, y) = \u2018X\u2217(\u03c6(x\u2217), y) \u2212\n\u2018X\u2217(\u03c60(x\u2217), y) and D\u03a6 \u2265 0 be a constant such that for any \u03c6 \u2208 \u03a6,\n\nSimilarly, let D0\n\nD\u03a6 \u00b7 E(x\u2217,y) {\u20183(\u03c6, \u03c60, x\u2217, y)} \u2265 Var(x\u2217,y) {\u20183(\u03c6, \u03c60, x\u2217, y)} .\nH,\u03a6 \u2265 0 be a constant such that for all (h, \u03c6) \u2208 (H \u00d7 \u03a6) \\ (H, \u03a6),\n\n(18)\n\nH,\u03a6E(x,x\u2217,y) {\u20182(h, h0, \u03c6, \u03c60, x, x\u2217, y)} \u2265 Var(x,x\u2217,y) {\u20182(h, h0, \u03c6, \u03c60, x, x\u2217, y)} .\nD0\n\nProof See Appendix B in the full version of the paper.\nBy Lemma 4.3, C \u00b7 DH,\u03a6 \u2264 max(D\u03a6, C \u00b7 D0\nconstant D0\nvalue of C is the one that is larger that C1 and minimizes C \u00b7 D0\nminimum indeed exists. By the de\ufb01nition of the loss function \u20182,\n\nH,\u03a6). Since the loss function \u20182 depends on C, the\nH,\u03a6 depends on C too. Thus, ingoring the left-hand logarithmic term in (17), the optimal\nH,\u03a6. We now show that such\n\n0 < lim\nC\u2192\u221e\n\nsup\n\n(h,\u03c6)\u2208(H\u00d7\u03a6)\\(H,\u03a6)\n\nVar(x,x\u2217,y) {\u20182(h, h0, \u03c6, \u03c60, x, x\u2217, y)}\nE(x,x\u2217,y) {\u20182(h, h0, \u03c6, \u03c60, x, x\u2217, y)}\n\n\u2264 1.\n\n(19)\n\n(cid:190)\n\nTherefore for very large C it holds that 0 < s \u2264 D0\nConsequently limC\u2192\u221e C \u00b7 D0\n\ufb01nite in C = C1, there exists a point C = C\u2217 \u2208 [C1,\u221e) that minimizes it.\n\nH,\u03a6 = \u221e. Since the function g(C) = C \u00b7 D0\n\nH,\u03a6 \u2264 1, where s is the value of the above limit.\nH,\u03a6 is continuous and\n\n5 When Privileged ERM is provably better than the regular ERM\n\nWe show an example that demonstrates the difference between the emprical risk minimization in\nX space and empirical risk minimization with privileged information in the joint X \u00d7 X\u2217 space.\nIn particular, we show in this example that for not too small training sizes (as speci\ufb01ed by the\nconditions of Theorems 11 and 4.2) the learning rate of the regular ERM in X space is 1/\nn while\nthe learning rate of the privileged ERM in the joint X \u00d7 X\u2217 space is 1/n.\nWe consider the classi\ufb01cation setting and all loss functions in our example are 0/1 loss. Let\nDX = {DX(\u0001)|0 < \u0001 < 0.1} be an in\ufb01nite family of distributions of examples in X space. All\ndistributions in DX have non-zero support in four points, denoted by X1, X2, X3 and X4. We\nassume that these points lie on a 1-dimensional line, as shown in Figure 2(a). Figure 2(a) also shows\nthe probability mass of each point in the distribution DX(\u0001). The hypothesis space H consists of\nhypotheses ht(x) = sign(x \u2212 t) and h0\n1 and its\ngeneralization error is 1/4 \u2212 2\u0001. The hypothesis space H contains also a hypothesis h0\n3, which is\nslightly worse than h0\n1 and has generalization error of 1/4 + \u0001. It can be veri\ufb01ed that for a \ufb01xed\nDX(\u0001) and H the constant DH (de\ufb01ned in equation (13)) is\n\nt = \u2212sign(x \u2212 t). The best hypothesis in H is h0\n\n\u221a\n\nDH = 1/(6\u0001) \u2212 (1/3) \u2212 \u0001 \u2264 1/(6\u0001).\n\n(20)\nNote that the inequality in (20) is very tight since \u0001 can be arbitrary small. The VC-dimension VH\nof H is 2. Suppose that \u0001 is suf\ufb01ciently small such that VH \u00b7 D2H > T (1/4 \u2212 2\u0001, VH, \u03b4), where the\nfunction T (\u00b7,\u00b7,\u00b7) is de\ufb01ned in De\ufb01nition 3.3. In order to use the risk bound (10) with our DX and\nH, the condition\n\n(21)\nshould be satis\ufb01ed. But since \u0001 can be very small, the condition (21) is not satis\ufb01ed for a large range\nof n\u2019s. Hence, according to (11), for distributions DX(\u0001) that satisfy T (1/4 \u2212 2\u0001, 2, \u03b4) \u2264 1\n18\u00012 we\n\u221a\n\nobtain that R01((cid:98)h) converges to R01(h\u2217) with the rate of at least 1/\nThe following lower bound shows that R01((cid:98)h) converges to R01(h\u2217) with the rate of at most 1/\n\n\u221a\n\nn.\n\nn.\n\nn > VH \u00b7 D2H = 1/(18\u00012)\n\n7\n\n\f(a) X space\n\nFigure 2: X and X\u2217 spaces.\n\n(b) X\u2217 space\n\n(cid:112)\n\n\u221a\n\n1 , X\u2217\n\n2 , X\u2217\n\n3 and X\u2217\n\nln(1/\u03b4n)/(20n).\n\nLemma 5.1 Suppose that \u0001 < 1/16. Let \u03b4n = exp(\u221220n\u00012). Then for any n > 256, with proba-\nbility at least \u03b4n,\n\nBy combining upper and lower bounds we obtain that the convergence rate of R01((cid:98)h) to R01(h\u2217) is\n\nR01((cid:98)h) \u2212 R01(h\u2217) \u2265\n\nexactly 1/\nn. The proof of the lower bound appears in Appendix D in the full version of the paper.\nSuppose that the teacher constructed the distribution DX\u2217(\u0001) of examples in X\u2217 space in the fol-\nlowing way. DX\u2217(\u0001) has non-zero support in four points, denoted by X\u2217\n4 , that\nlie on a 1-dimensional line, as shown in Figure 2(b). Figure 2(b) shows the probability mass of\neach point in X\u2217 space. We assume that the joint distribution (X, X\u2217) has non-zero support only\non points (X1, X\u2217\n2 ), (X3, X\u2217\n4 ). The hypothesis space \u03a6 consists of hy-\npotheses \u03c6t(x) = sign(x\u2217 \u2212 t) and \u03c60\n2 and its\ngeneralization error is 0. However there is no h \u2208 H that is uniformly better than \u03c60\n2. The best hy-\npothesis in \u03a6, among those that have uniformly better hypothesis in H, is \u03c60\n1 and its generalization\nerror is 1/4 \u2212 2\u0001. h0\n1. It can be veri\ufb01ed that for such DX\u2217(\u0001) and \u03a6 the\nconstant D\u03a6 (de\ufb01ned in equation (18)) is\n\n3 ) and (X4, X\u2217\nt = \u2212sign(x\u2217 \u2212 t). The best hypothesis in \u03a6 is \u03c60\n\n1 is uniformly better than \u03c60\n\n1 ), (X2, X\u2217\n\nD\u03a6 = (11/16 \u2212 3\u0001 \u2212 4\u00012)/(1/4 + 2\u0001) \u2264 2.75.\n\nH,\u03a6 is C\u2217 = 2.6. For C = C\u2217 it holds that D0\n\nTheorem 4.2 and Lemma 4.3, if n > (2 + 2) \u00b7 1.712 = 11.7 then R01((cid:98)h) converges to R01(h\u2217) with\n\n(22)\nNote that the inequality in (22) is very tight since \u0001 can be arbitrary small. Moreover, it can be\nveri\ufb01ed that C that minimizes C \u00b7 D0\nH,\u03a6 = 1.71\nand D\u03a6/C = 1.06. It is easy to see that our example satis\ufb01es Assumptions 1.2 and 1.3 (the last\nassumption is satis\ufb01ed with \u03c6 = \u2212\u03c60\n1). Also, it can be veri\ufb01ed that Assumption 2.1 is satis\ufb01ed with\nA = 1/4 \u2212 2\u0001 and C1 = 1.1 < C\u2217 satis\ufb01es Lemma 2.2. The VC-dimension of \u03a6 is 2. Hence by\nthe rate of at least 1/n. Since our bounds on D\u03a6 and D0\nrate of 1/n holds for any distribution in DX.\nWe obtained that for 11.7 < n \u2264 1\n18\u00012 the upper bound (17) converges to R01(h\u2217) with the rate of\n1/n, while the upper bound (11) converges to R01(h\u2217) with the rate of 1/\nn. This improvement\nwas possible due to teacher\u2019s construction of DX\u2217(\u0001) and learner\u2019s choice of \u03a6. The hypothesis h0\n3\ncaused the value of DH to be large and thus prevented us from 1/n convergence rate for a large\nrange of n\u2019s. We constructed DX\u2217(\u0001) and \u03a6 in such a way that \u03a6 does not have a hypothesis \u03c6 that\n3. With such construction any \u03c6 \u2208 \u03a6, such\nhas exactly the same dichotomy as the bad hypothesis h0\n3 is uniformly better than \u03c6, has generalization error signi\ufb01cantly larger than the one of h0\nthat h0\n3.\nFor example, the best hypothesis in \u03a6 for which h0\n3 is uniformly better, is \u03c60 and its generalization\nerror is 1/2.\n\nH,\u03a6 are independent of \u0001, the convergence\n\n\u221a\n\n6 Conclusions\n\nWe formulated the algorithm of empirical risk minimization with privileged information and derived\nthe risk bound for it. Our risk bound outlines the conditions for the correcting space that, if satis\ufb01ed,\nwill allow fast learning in the decision space, even if the original learning problem in the decision\nspace is very hard. We showed an example where the privileged information provably signi\ufb01cantly\nimproves the learning rate.\n\u221a\nIn this paper we showed that the good correcting space can improve the learning rate from 1/\nn\nto 1/n. But, having the good correcting space, can we achieve a learning rate faster than 1/n?\nAnother intersting problem is to analyze Privileged ERM when the learner does not completely trust\nthe teacher. This condition translates to the constraint \u2018X(h(x), y) \u2264 \u2018X\u2217(\u03c6(x\u2217), y) + \u0001 in (3)\nand the term [\u2018X(h(x), y) \u2212 \u2018X\u2217(\u03c6(x\u2217), y)]+ in (6), where \u0001 \u2265 0 is a hyperparameter. Finally, the\nimportant direction is to develop risk bounds for SVM+ (which is a regularized version of Privileged\nERM) and show when it is provably better than SVM.\n\n8\n\n\fReferences\n[1] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classi\ufb01cation: a survey of some recent\n\nadvances. ESAIM: Probability and Statistics, 9:329\u2013375, 2005.\n\n[2] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273\u2013297, 1995.\n[3] L. Devroye and G. Lugosi. Lower bounds in pattern recognition and learning. Pattern Recog-\n\nnition, 28(7):1011\u20131018, 1995.\n\n[4] E. Gine and V. Koltchinskii. Concentration inequalities and asymptotic resutls for ratio type\n\nempirical processes. Annals of Probability, 34(3):1143\u20131216, 2006.\n\n[5] V. Koltchinskii. 2008 Saint Flour lectures: Oracle inequalities in empirical risk minimization\nand sparse recovery problems, 2008. Available at fodava.gatech.edu/\ufb01les/reports/FODAVA-\n09-17.pdf.\n\n[6] P. Massart and E. Nedelec. Risk bounds for statistical learning. Annals of Statistics,\n\n34(5):2326\u20132366, 2006.\n\n[7] A. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. Annals of Statistics,\n\n32(1):135\u2013166, 2004.\n\n[8] V. Vapnik. Estimation of dependencies based on empirical data. Springer\u2013Verlag, 2nd edition,\n\n2006.\n\n[9] V. Vapnik and A. Vashist. A new learning paradigm: Learning using privileged information.\n\nNeural Networks, 22(5-6):544\u2013557, 2009.\n\n[10] V. Vapnik, A. Vashist, and N. Pavlovich. Learning using hidden information: Master class\nlearning. In Proceedings of NATO workshop on Mining Massive Data Sets for Security, pages\n3\u201314. 2008.\n\n9\n\n\f", "award": [], "sourceid": 1348, "authors": [{"given_name": "Dmitry", "family_name": "Pechyony", "institution": null}, {"given_name": "Vladimir", "family_name": "Vapnik", "institution": null}]}