{"title": "The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 585, "page_last": 593, "abstract": "We derive here new generalization bounds, based on Rademacher Complexity theory, for model selection and error estimation of linear (kernel) classifiers, which exploit the availability of unlabeled samples. In particular, two results are obtained: the first one shows that, using the unlabeled samples, the confidence term of the conventional bound can be reduced by a factor of three; the second one shows that the unlabeled samples can be used to obtain much tighter bounds, by building localized versions of the hypothesis class containing the optimal classifier.", "full_text": "The Impact of Unlabeled Patterns in Rademacher\n\nComplexity Theory for Kernel Classi\ufb01ers\n\nDavide Anguita, Alessandro Ghio, Luca Oneto, Sandro Ridella\n\nDepartment of Biophysical and Electronic Engineering\n\nUniversity of Genova\n\nVia Opera Pia 11A, I-16145 Genova, Italy\n\n{Davide.Anguita,Alessandro.Ghio} @unige.it\n\n{Luca.Oneto,Sandro.Ridella} @unige.it\n\nAbstract\n\nWe derive here new generalization bounds, based on Rademacher Complexity the-\nory, for model selection and error estimation of linear (kernel) classi\ufb01ers, which\nexploit the availability of unlabeled samples.\nIn particular, two results are ob-\ntained: the \ufb01rst one shows that, using the unlabeled samples, the con\ufb01dence term\nof the conventional bound can be reduced by a factor of three; the second one\nshows that the unlabeled samples can be used to obtain much tighter bounds, by\nbuilding localized versions of the hypothesis class containing the optimal classi-\n\ufb01er.\n\n1\n\nIntroduction\n\nUnderstanding the factors that in\ufb02uence the performance of a statistical procedure is a key step for\n\ufb01nding a way to improve it. One of the most explored procedures in the machine learning approach\nto pattern classi\ufb01cation aims at solving the well\u2013known model selection and error estimation prob-\nlem, which targets the estimation of the generalization error and the choice of the optimal predictor\nfrom a set of possible classi\ufb01ers. For reaching this target, several approaches have been proposed\n[1, 2, 3, 4], which provide an upper bound on the generalization ability of the classi\ufb01er, which can\nbe used for model selection purposes as well. Typically, all these bounds consists of three terms:\nthe \ufb01rst one is the empirical error of the classi\ufb01er (i.e. the error performed on the training data),\nthe second term is a bias that takes into account the complexity of the class of functions, which the\nclassi\ufb01er belongs to, and the third one is a con\ufb01dence term, which depends on the cardinality of the\ntraining set. These approaches are quite interesting because they investigate the \ufb01nite sample behav-\nior of a classi\ufb01er, instead of the asymptotic one, even though their practical applicability has been\nquestioned for a long time1. One of the most recent methods for obtaining these bounds is to exploit\nthe Rademacher Complexity, which is a powerful statistical tool that has been deeply investigated\nduring the last years [5, 6, 7]. This approach has shown to be of practical use, by outperforming more\ntraditional methods [8, 9] for model selection in the small\u2013sample regime [10, 5, 6], i.e. when the\ndimensionality of the samples is comparable, or even larger, than the cardinality of the training set.\nWe show in this work how its performance can be further improved by exploiting some extra knowl-\nedge on the problem. In fact, real\u2013world classi\ufb01cation problems often are composed of datasets\nwith labeled and unlabeled data [11, 12]: for this reason an interesting challenge is \ufb01nding a way to\nexploit the unlabeled data for obtaining tighter bounds and, therefore, better error estimations.\n\nIn this paper, we present two methods for exploiting the unlabeled data in the Rademacher Com-\nplexity theory [2]. First, we show how the unlabeled data can have a role in reducing the con\ufb01dence\n\n1See, for example, the NIPS 2004 Workshop (Ab)Use of Bounds or the 2002 Neurocolt Workshop on Bounds\n\nless than 0.5\n\n1\n\n\fterm, by obtaining a new bound that takes into account both labeled and unlabeled data. Then, we\npropose a method, based on [7], which exploits the unlabeled data for selecting a better hypothesis\nspace, which the classi\ufb01er belongs to, resulting in a much sharper and accurate bound.\n\n2 Theoretical framework and results\n\n1, Y l\n\nnl , Y l\n\n1 ),\u00b7\u00b7\u00b7 , (X l\n\nWe consider the following prediction problem: based on a random observation of X \u2208 X \u2286 Rd\none has to estimate Y \u2208 Y \u2286 {\u22121, 1} by choosing a suitable prediction rule f : X \u2192 [\u22121, 1].\nThe generalization error L(f ) = E{X ,Y}\u2113(f (X), Y ) associated to the prediction rule is de\ufb01ned\nthrough a bounded loss function \u2113(f (X), Y ) : [\u22121, 1] \u00d7 Y \u2192 [0, 1]. We observe a set of labeled\nnu )(cid:9).\nnl )(cid:9) and a set of unlabeled ones Dnu :(cid:8)(X u\nsamples Dnl :(cid:8)(X l\nThe data consist of a sequence of independent, identically distributed (i.i.d.) samples with the same\ndistribution P (X ,Y) for Dnl and Dnu. The goal is to obtain a bound on L(f ) that takes into\naccount both the labeled and unlabeled data. As we do not know the distribution that have generated\nthe data, we do not know L(f ) but only its empirical estimation Lnl (f ) = 1/nlPnl\ni ).\ni ), Y l\nIn the typical context of Structural Risk Minimization (SRM) [13] we de\ufb01ne an in\ufb01nite sequence of\nhypothesis spaces of increasing complexity {Fi, i = 1, 2,\u00b7\u00b7\u00b7 }, then we choose a suitable function\nspace Fi and, consequently, a model f\u2217 \u2208 Fi that \ufb01ts the data. As we do not know the true data\ndistribution, we can only say that:\n(1)\n\n1 ),\u00b7\u00b7\u00b7 , (X u\n\ni=1 \u2113(f (X l\n\n{L(f ) \u2212 Lnl (f )}f\u2208Fi \u2264 sup\n\nf\u2208Fi {L(f ) \u2212 Lnl (f )}\n\nor, equivalently:\n\nL(f ) \u2264 Lnl (f ) + sup\n\nf\u2208Fi {L(f ) \u2212 Lnl (f )} ,\n\n\u2200f \u2208 Fi\n\n(2)\n\nIn this framework, the SRM procedure brings us to the following choice of the function space and\nthe corresponding optimal classi\ufb01er:\n\nf\u2217,F\u2217 :\n\narg\n\nFi\u2208{F1,F2,\u00b7\u00b7\u00b7 }\" min\n\nf\u2208Fi\n\nmin\n\nLnl (f )f\u2208Fi + sup\n\nf\u2208Fi {L(f ) \u2212 Lnl (f )}#\n\n(3)\n\nSince the generalization bias (supf\u2208Fi {L(f ) \u2212 Lnl (f )}) is a random variable, it is possible to\nstatistically analyze it and obtain a bound that holds with high probability [5].\n\nFrom this point, we will consider two types of prediction rule with the associated loss function:\n1 \u2212 yfH (x)\n\nfH (x) =sign(wT \u03c6(x) + b),\n\n\u2113H (fH (x), y) =\n\n(4)\n\nif wT \u03c6(x) + b > 0\nif wT \u03c6(x) + b \u2264 0\n\nfS(x) =(cid:26)min(1, wT \u03c6(x) + b)\nmax(\u22121, wT \u03c6(x) + b)\nwhere \u03c6(\u00b7) : Rd \u2192 RD with D >> d, w \u2208 RD and b \u2208 R. The function \u03c6(\u00b7) is introduced to\nallow for a later introduction of kernels, even though, for simplicity, we will focus only on the linear\ncase. Note that both the hard loss \u2113H (fH (x), y) and the soft loss (or ramp loss) [14] \u2113S(fS(x), y)\nare bounded ([0, 1]) and symmetric (\u2113(f (x), y) = 1 \u2212 \u2113(f (x),\u2212y)). Then, we recall the de\ufb01nition\nof Rademacher Complexity (R) for a class of functions F:\n\n1 \u2212 yfS(x)\n\n\u2113S(fS(x), y) =\n\n(5)\n\n2\n\n,\n\n2\n\n\u02c6Rnl (F) = E\u03c3 sup\nf\u2208F\n\n2\nnl\n\n\u03c3i\u2113(f (xi), yi) = E\u03c3 sup\nf\u2208F\n\n1\nnl\n\nnl\n\nXi=1\n\nnl\n\nXi=1\n\n\u03c3if (xi)\n\n(6)\n\nwhere \u03c31, . . . , \u03c3nl are nl independent Rademacher random variables, i.e. independent random vari-\nables for which P(\u03c3i = +1) = P(\u03c3i = \u22121) = 1/2, and the last equality holds if we use one\nof the losses de\ufb01ned before. Note that \u02c6R is a computable realization of the expected Rademacher\n\u02c6R(F). The most renowed result in Rademacher Complexity theory\nComplexity R(F) = E(X ,Y)\nstates that [2]:\n\nL(f )f\u2208F \u2264 Lnl (f )f\u2208F + \u02c6Rnl (F) + 3s log(cid:0) 2\n\u03b4(cid:1)\n\n2nl\n\n(7)\n\nwhich holds with probability (1 \u2212 \u03b4) and allows to solve the problem of Eq. (3).\n\n2\n\n\f2.1 Exploiting unlabeled samples for reducing the con\ufb01dence term\n\nAssuming that the amount of unlabeled data is larger than the number of labeled samples, we split\nthem in blocks of similar size by de\ufb01ning the quantity m = \u230anu/nl\u230b + 1, so that we can consider a\ncomposed of mnl pattern. Then, we can upper bound the expected generaliza-\nghost sample D\u2032mnl\ntion bias in the following way 2:\n\nE{X ,Y} sup\n\n1\nm\n\n\uf8f0E{X \u2032,Y\u2032}\uf8ee\nf\u2208F\uf8ee\nf\u2208F {L(f ) \u2212 Lnl (f )} = E{X ,Y} sup\n\uf8f0\nXi=1\nf\u2208F\uf8ee\nXi=1\n\uf8f0\nXk=(i\u22121)\u00b7nl+1\n\n= E{X ,Y}E{X \u2032,Y\u2032}E\u03c3\n\n\u2264 E{X ,Y}E{X \u2032,Y\u2032}\n\n\u2264 E{X ,Y}E\u03c3\n\nXi=1\n\n1\nm\n\n1\nm\n\n1\nm\n\n2\nnl\n\nsup\n\ni\u00b7nl\n\nm\n\nm\n\nm\n\nsup\n\nf\u2208F\uf8ee\n\uf8f0\n\nm\n\nnl\n\ni\u00b7nl\n\ni\u00b7nl\n\nsup\n\n1\nnl\n\n1\nnl\n\n1\nnl\n\n\u2113\u2032k\uf8f9\n\uf8fb \u2212\n\n\u2113i\uf8f9\nXi=1\nXk=(i\u22121)\u00b7nl+1\n\uf8fb\nXk=(i\u22121)\u00b7nl+1(cid:16)\u2113\u2032k \u2212 \u2113|k|nl(cid:17)\uf8f9\n\uf8fb\n\u03c3|k|nlh\u2113\u2032k \u2212 \u2113|k|nli\uf8f9\n\uf8fb\n\u02c6Ri\nnl (F)\n\nXi=1\nf\u2208F\uf8ee\n\uf8f0\nXk=(i\u22121)\u00b7nl+1\n\u2113k\uf8f9\n\uf8fb = E{X ,Y}\n\nXi=1\n\n\u03c3|k|nl\n\n1\nm\n\n1\nnl\n\ni\u00b7nl\n\nm\n\nwhere |k|nl = (k\u2212 1) mod (nl) + 1. The last quantity (that we call Expected Extended Rademacher\n\u02c6Rnu (F)) and the expected generalization bias are both deterministic quantities\nComplexity E{X ,Y}\nand we know only one realization of them, dependent on the sample. Then, we can use the McDi-\narmid\u2019s inequality [15] to obtain:\n\nP\"sup\nf\u2208F {L(f ) \u2212 Lnl (f )} \u2265 E{X ,Y} sup\n\nf\u2208F {L(f ) \u2212 Lnl (f )} \u2265 \u02c6Rnu (F) + \u01eb# \u2264\nP\"sup\nf\u2208F {L(f ) \u2212 Lnl (f )} + a\u01eb# +\n\u02c6Rnu (F) \u2265 \u02c6Rnu (F) + (1 \u2212 a)\u01ebi \u2264\n\nPhE{X ,Y}\n\n(mnl)\n\ne\u22122nla2\u01eb2\n\n+ e\u2212\n\n2\n\n(1\u2212a)2\u01eb2\n\nwith a \u2208 [0, 1]. By choosing a =\n\n\u221am\n2+\u221am , we can write:\n\nP\"sup\nf\u2208F {L(f ) \u2212 Lnl (f )} \u2265\n\n1\nm\n\nm\n\nXi=1\n\nnl (F) + \u01eb# \u2264 2e\u2212\n\u02c6Ri\n\n2mnl \u01eb2\n(2+\u221am)2\n\nand obtain an explicit bound which holds with probability (1 \u2212 \u03b4):\n\nL(f )f\u2208F \u2264 Lnl (f )f\u2208F +\n\n1\nm\n\nm\n\nXi=1\n\n\u02c6Ri\nnl (F) +\n\n2 + \u221am\n\n\u221am s log(cid:0) 2\n\u03b4(cid:1)\n\n2nl\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\nwhere \u02c6Ri\nnl (F) is the Rademacher Complexity of the class F computed on the i-th block of\nunlabeled data. Note that for m = 1 the training set does not contain any unlabeled data\nand the bound given by Eq.\n(3) is recovered, while for large m the con\ufb01dence term is re-\nduced by a factor of 3. At a \ufb01rst sight, it would seem impossible to compute the term \u02c6Ri\nnl\nIn\nwithout knowing the labels of the data, but it is easy to show that this is not the case.\n= +1o and K\u2212i =\nfact,\n\ni = nk \u2208 {k = (i \u2212 1) \u00b7 nl + 1, . . . , i \u00b7 nl} : \u03c3|k|nl\n\nlet us de\ufb01ne K+\n\n2we de\ufb01ne \u2113(f (xi), yi) \u2261 \u2113i to simplify the notation\n\n3\n\n\f(a) Coventional function classes\n\n(b) Localized function classes\n\nFigure 1: The effect of selecting a better center for the hypothesis classes.\n\nm\n\n1\nm\n\n2\n\nE\u03c3 sup\nf\u2208F\n\n\u02c6Rnu (F) = 1 +\n\nXi=1\nXi=1\n\n= \u22121o, then we have:\nnk \u2208 {k = (i \u2212 1) \u00b7 nl + 1, . . . , i \u00b7 nl} : \u03c3|k|nl\nnl \uf8ee\n\u2113(fk, yk) \u2212 Xk\u2208K\u2212i\n\uf8f0 Xk\u2208K+\nf\u2208F\uf8ee\nnl Xk\u2208K+\n\u2113(fk,\u2212yk) \u2212\n\uf8f0\u2212\nf\u2208F\uf8ee\nXi=1\nXk=(i\u22121)\u00b7nl+1\n\uf8f0\u2212\nf\u2208F\uf8ee\nXi=1\n= 1 \u2212\n\uf8f0\n\nE\u03c3 sup\n\nE\u03c3 sup\n\nE\u03c3 inf\n\n= 1 +\n\n= 1 +\n\ni\n\n2\nnl\n\nm\n\n1\nm\n\nm\n\n1\nm\n\nm\n\n1\nm\n\ni\n\n2\n\ni\n\n2\n\n1\uf8f9\n\u2113(fk, yk) \u2212 Xk\u2208K+\n\uf8fb\n\u2113(fk, yk)\uf8f9\nnl Xk\u2208K\u2212i\n\uf8fb\nyk)\uf8f9\n\u2113(fk,\u2212\u03c3|k|nl\n\uf8fb\n)\uf8f9\n\uf8fb\n\n\u2113(fk, \u03c3|k|nl\n\ni\u00b7nl\n\n2\nnl\n\ni\u00b7nl\n\nXk=(i\u22121)\u00b7nl+1\n\nwhich corresponds to solving a classi\ufb01cation problem using all the available data with random labels.\nThe expectation can be easily computed with some Monte Carlo trials.\n\n2.2 Exploiting the unlabeled data for tightening the bound\n\nAnother way of exploiting the unlabeled data is to use them for selecting a more suitable sequence of\nhypothesis spaces. For this purpose we could use some of the unlabeled samples or, even better, the\nnc = nu \u2212 \u230anu/nl\u230b nl samples left from the procedure of the previous section. The idea is inspired\nby the work of [3] and [7], which propose to in\ufb02ate the hypothesis classes by centering them around\na \u2018good\u2019 classi\ufb01er. Usually, in fact, we have no a-priori information on what can be considered a\ngood choice of the class center, so a natural choice is the origin [13], as in Figure 1(a). However,\nif it happens that the center is \u2018close\u2019 to the optimal classi\ufb01er, the search for a suitable class will\nstop very soon and the resulting Rademacher Complexity will be consequently reduced (see Figure\n1(b)). We propose here a method for \ufb01nding two possible \u2018good\u2019 centers for the hypothesis classes.\nLet us consider nc unlabeled samples and run a clustering algorithm on them, by setting the number\nof clusters to 2, and obtaining two clusters C1 and C2. We build two distinct labeled datasets by\nassigning the labels +1 and \u22121 to C1 and C2, respectively, and then vice-versa. Finally, we build\ntwo classi\ufb01ers fC1(x) and fC2(x) = \u2212fC1(x) by learning the two datasets3. The two classi\ufb01ers,\nwhich have been found using only unlabeled samples, can then be used as centers for searching\na better hypothesis class.\nIt is worthwhile noting that any supervised learning algorithm can be\nused [16], because the centers are only a hint for a better centered hypothesis space: their actual\nclassi\ufb01cation performance is not of paramount importance. The underlying principle that inspired\n\n3Note that we could build only one classi\ufb01er by assigning the most probable labels to the nc samples,\naccording to the nl labeled ones but, rigorously speaking, this is not allowed by the SRM principle, because\nit would lead to use the same data for both choosing the space of functions and computing the Rademacher\nComplexity.\n\n4\n\n\fthis procedure relies on the reasonable hypothesis that P(X ) is correlated with P(X ,Y): in fact, in\nan unlucky scenario, where the two classes are heavily overlapped, the method would obviously fail.\n\nChoosing a good center for the SRM procedure can greatly reduce the second term of the bound\ngiven by Eq. (13) [7] (the bias or complexity term). Note, however, that the con\ufb01dence term is not\naffected, so we propose here an improved bound, which makes this term depending on \u02c6Ri\nnl (F) as\nwell. We use a recent concentration result for Self Bounding Functions [17], instead of the looser\nMcDiarmid\u2019s inequality. The detailed proof is omitted due to space constraints and we give here\nonly the sketch (it is a more general version of the proof in [18] for Rademacher Complexities):\n\nf\u2208F {L(f ) \u2212 Lnl (f )} \u2265 \u02c6Rnu (F) + \u01eb# \u2264 e\u22122nla2\u01eb2\nP\"sup\n\n+ e\u2212\n\n(mnl)(1\u2212a)2 \u01eb2\n\u02c6Rnu (F )\n2E{X ,Y}\n\n(14)\n\nwith a \u2208 [0, 1]. Choosing a =\n\n\u221am\n\u221am+2qE{X ,Y}\n\n1\n\nm Pm\n\ni=1\n\n, we obtain:\n\n\u02c6Ri\n\nnl\n\n(F )\n\nf\u2208F {L(f ) \u2212 Lnl (f )} \u2265 \u02c6Rnu (F) + \u01eb# \u2264 2e\nP\"sup\n\n\u2212\n\n2mnl \u01eb2\n\n(\u221am+2\u221aE{X ,Y}\n\n\u02c6Rnu (F ))2\n\nso that the following explicit bound holds with probability (1 \u2212 \u03b4):\n\nL(f )f\u2208F \u2264 Lnl (f )f\u2208F + \u02c6Rnu (F) +\n\n2qE{X ,Y}\n\n\u02c6Rnu (F) + \u221am\n\u221am\n\ns log(cid:0) 2\n\u03b4(cid:1)\n\n2nl\n\n(15)\n\n(16)\n\n\u02c6Rnu (F) = 1 and we obtain again Eq. (13). Unfortunately, the\nNote that, in the worst case, E{X ,Y}\nExpected Extended Rademacher Complexity cannot be computed, but we can upper bound it with\nits empirical version (see, for example, [19], pages 420\u2013422, for a justi\ufb01caton of this step) as in\nEq.(10) to obtain:\n\nf\u2208F {L(f ) \u2212 Lnl (f )} \u2265 \u02c6Rnu (F) + \u01eb# \u2264 e\u22122nla2\u01eb2\nP\"sup\n\n+ e\u2212\n\n(mnl)(1\u2212a)2 \u01eb2\n\n2( \u02c6Rnu (F )+(1\u2212a)\u01eb)\n\n(17)\n\nwith a \u2208 [0, 1]. Differently from Eq. (15) the previous expression cannot be put in explicit form, but\nit can be simply computed numerically by writing it as:\n\nL(f )f\u2208F \u2264 Lnl (f )f\u2208F +\n\n1\nm\n\nm\n\nXi=1\n\n\u02c6Ri\nnl (F) + \u01ebb\n\nu\n\n(18)\n\nu can be obtained by upper bounding with \u03b4 the last term of Eq. (17) and solving the\n\nThe value \u01ebb\ninequality respect to a and \u01eb, so that the bound holds with probability (1 \u2212 \u03b4).\nWe can show the improvements obtained through these new results, by plotting the values of the\ncon\ufb01dence terms and comparing them with the conventional one [2]. Figure 2 shows the value of\nu, as a function of the number of\n\u01ebl in Eq. (7) against \u01ebu, the corresponding term in Eq. (13), and \u01ebb\nsamples.\n\n3 Performing the Structural Risk Minimization procedure\n\nComputing the values of the bounds described in the previous sections is a straightforward process,\nat least in theory. The empirical error Lnl (f ) is found by learning a classi\ufb01er with the original\nlabeled dataset, while the (Extended) Rademacher Complexity \u02c6Ri\nnl(F) is computed by learning the\ndataset composed of both labeled and unlabeled samples with random labels.\nIn order apply in practice the results of the previous section and to better control the hypothesis\nspace, we formulate the learning phase of the classi\ufb01er as the following optimization problem, based\n\n5\n\n\fm \u2208 [1,10]\n\nm = 1, R \u2208 [0,1]\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n\u03b5\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n \n\n \n\n\u03b5\nl\n\u03b5\nu\n\nm = 1\n\nm = 2\n\nm = 10\n\n40\n\n60\n\n80\n\n100\n\n120\nn\n\n140\n\n160\n\n180\n\n200\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n\u03b5\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n \n\n \n\n\u03b5\nu\n\u03b5b\nu\n\nR = 1\n\nR = 0.9\n\nR = 0\n\n40\n\n60\n\n80\n\n100\n\n120\nn\n\n140\n\n160\n\n180\n\n200\n\n(a) \u01ebl VS \u01ebu\n\n(b) \u01ebnl VS \u01ebb\n\nu with m = 1\n\nFigure 2: Comparison of the new con\ufb01dence terms with the conventional one.\n\non the Ivanov version of the Support Vector Machine (I-SVM) [13]:\n\nmin\nw,b,\u03be\n\nn\n\n\u03b7i\n\nXi=1\nkw \u2212 \u02c6wk2 \u2264 \u03c12\nyi(cid:0)wT \u03c6(xi) + b(cid:1) \u2265 1 \u2212 \u03bei\n\u03bei \u2265 0,\n\n\u03b7i = min (2, \u03bei)\n\n(19)\n\n(20)\n\nwhere the size of the hypothesis space, centered in \u02c6w, is controlled by the hyperparameter \u03c1 and\nthe last constraint is introduced for bounding the SVM loss function, which would be otherwise\nunbounded and would prevent the application of the theory developed so far. Note that, in practice,\ntwo sub-problems must be solved: the \ufb01rst one with \u02c6w = + \u02c6wC1 and the second one with \u02c6w =\n\u2212 \u02c6wC1, then the solution corresponding to the smaller value of the objective function is selected.\nUnfortunately, solving a classi\ufb01cation problem with a bounded loss function is computationally in-\ntractable, because the problem is no longer convex and even state-of-the-art solvers like, for example,\nCPLEX [20] fail to found an exact solution, when the training set size exceeds few tens of samples.\nTherefore, we propose here to \ufb01nd an approximate solution through well\u2013known algorithms like,\nfor example, the Peeling [6] or the Convex\u2013Concave Constrained Programming (CCCP) technique\n[14, 21, 22]. Furthermore, we derive a dual formulation of problem (19) that allows us exploiting\nthe well known Sequential Minimal Optimization (SMO) algorithm for SVM learning [23].\n\nProblem (19) can be rewritten in the equivalent Tikhonov formulation:\n\nmin\nw,b,\u03be\n\nn\n\n\u03b7i\n\nXi=1\n\n1\n2kw \u2212 \u02c6wk2 + C\nyi(cid:0)wT \u03c6(xi) + b(cid:1) \u2265 1 \u2212 \u03bei\n\u03bei \u2265 0,\n\n\u03b7i = min (2, \u03bei)\n\nwhich gives the same solution of the Ivanov formulation for some value of C [13]. The method\nfor \ufb01nding the value of C, corresponding to a given value of \u03c1, is reported in [10], where it is also\nshown that C cannot be used directly to control the hypothesis space. Then, it is possible to apply\nthe CCCP technique, which is synthesized in Algorithm 1, by splitting the objective function in its\nconvex and concave parts:\n\nmin\nw,b,\u03be\n\nJconvex(\u03b8)\n\nJconcave(\u03b8)\n\nn\n\nn\n\n\u03bei\n\n}|\n\nz\n{\n}|\nz\n1\nXi=1\n2kw \u2212 \u02c6wk2 + C\n\u2212C\nyi(cid:0)wT \u03c6(xi) + b(cid:1) \u2265 1 \u2212 \u03bei\n\u03c2i = max(0, \u03bei \u2212 2)\n\u03bei \u2265 0,\n\nXi=1\n\n{\n\n\u03c2i\n\n6\n\n(21)\n\n\fwhere \u03b8 = [w|b] is introduced to simplify the notation. Obviously, the algorithm does not guarantee\nto \ufb01nd the optimal solution, but it converges to a (usually good) solution in a \ufb01nite number of steps\n[14]. To apply the algorithm we must compute the derivative of the concave part of the objective\nfunction:\n\n(cid:18) dJconcave(\u03b8)\n\nd\u03b8\n\nThen, the learning problem becomes:\n\nmin\nw,b,\u03be\n\nwhere\n\nn\n\n\u03b2iyi(cid:0)wT \u03c6(xi) + b(cid:1)\n\nn\n\nd\u03b8\n\nXi=1\n\nd (\u2212C\u03c2i)\n\n(cid:12)(cid:12)(cid:12)(cid:12)\u03b8t(cid:19) \u03b8 = n\nXi=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\u03b8t! \u03b8 =\n1\nXi=1\n\u2206iyi(cid:0)wT \u03c6(xi) + b(cid:1)\n2kw \u2212 \u02c6wk2 + C\nyi(cid:0)wT \u03c6(xi) + b(cid:1) \u2265 1 \u2212 \u03bei,\n\u03bei \u2265 0\n\u2206i =(cid:26) C\n\nif yif t(xt) < \u22121\notherwise\n\nXi=1\n\n\u03bei +\n\n0\n\nn\n\nFinally, it is possible to obtain the dual formulation (derivation is omitted due to lack of space):\n\n\u03b2i\u03b2jyiyjK(xi, xj) +\n\nmin\n\n\u03b2\n\nn\n\nn\n\n1\n2\n\nXj=1\n\nXi=1\n\u2212 \u2206i \u2264 \u03b2i \u2264 C \u2212 \u2206i,\n\n\u03b2iyi = 0\n\nn\n\nXi=1\n\nn\n\nXi=1\n\nnC1\n\nXj=1\n\n\uf8ee\n\uf8f0\n\n\u02c6\u03b1jyi \u02c6yjK(\u02c6xj, xi) \u2212 1\uf8f9\n\uf8fb \u03b2i\n\nwhere we have used the kernel trick [24] K(\u00b7,\u00b7) = \u03c6(\u00b7)T \u03c6(\u00b7).\n4 A case study\n\n(22)\n\n(23)\n\n(24)\n\n(25)\n\nWe consider the MNIST dataset [25], which consists of 62000 images, representing the numbers\nfrom 0 to 9: in particular, we consider the 13074 patterns containing 0\u2019s and 1\u2019s, allowing us to deal\nwith a binary classi\ufb01cation problem. We simulate the small\u2013sample regime by randomly sampling\na training set with low cardinality (nl < 500), while the remaining 13074 \u2212 nl images are used as\na test set or as an unlabeled dataset, by simply discarding the labels. In order to build statistically\nrelevant results, this procedure is repeated 30 times.\nIn Table 1 we compare the conventional bound with our proposal. In the \ufb01rst column the number\nof labeled patterns (nl) is reported, while the second column shows the number of unlabeled ones\n(nu). The optimal classi\ufb01er f\u2217 is selected by varying \u03c1 in the range [10\u22126, 1], and selecting the\nfunction corresponding to the minimum of the generalization error estimate provided by each bound.\nThen, for each case, the selected f\u2217 is tested on the remaining 13074 \u2212 (nl + nu) samples and the\nclassi\ufb01cation results are reported in column three and four, respectively. The results show that the\nf\u2217 selected by exploiting the unlabeled patterns behaves better than the other and, furthermore, the\nestimated L(f ), reported in column \ufb01ve and six, shows that the bound is tighter, as expected by\ntheory.\n\nThe most interesting result, however, derives from the use of the new bound of Eq. (18), as reported\nin Table 2, where the unlabeled data is exploited for selecting a more suitable center of the hypoth-\nesis space. The results are reported analogously to Table 1. Note that, for each experiment, 30%\n\nAlgorithm 1 CCCP procedure\n\nInitialize \u03b80\nrepeat\n\n\u03b8t+1 = arg min\u03b8 Jconvex(\u03b8) +(cid:16) dJconcave(\u03b8)\n\nuntil \u03b8t+1 = \u03b8t\n\nd\u03b8\n\n(cid:12)(cid:12)(cid:12)\u03b8t(cid:17) \u03b8\n\n7\n\n\fTable 1: Model selection and error estimation, exploiting unlabeled data for tightening the bound.\n\nnl\n10\n20\n40\n60\n80\n100\n120\n150\n170\n200\n250\n300\n400\n\nnu\n20\n40\n80\n120\n160\n200\n240\n300\n340\n400\n500\n600\n800\n\nTest error of f\u2217\n\nEq. (7)\n\nEq. (13)\n\n13.20 \u00b1 0.86\n8.93 \u00b1 1.20\n6.26 \u00b1 0.16\n5.95 \u00b1 0.12\n5.61 \u00b1 0.07\n5.36 \u00b1 0.21\n4.98 \u00b1 0.40\n4.41 \u00b1 0.53\n3.59 \u00b1 0.57\n2.75 \u00b1 0.47\n2.07 \u00b1 0.03\n2.02 \u00b1 0.04\n1.93 \u00b1 0.02\n\n12.40 \u00b1 0.82\n8.93 \u00b1 1.29\n6.02 \u00b1 0.17\n5.88 \u00b1 0.13\n5.30 \u00b1 0.07\n5.51 \u00b1 0.22\n5.36 \u00b1 0.40\n4.08 \u00b1 0.51\n3.40 \u00b1 0.64\n2.67 \u00b1 0.48\n2.05 \u00b1 0.03\n1.94 \u00b1 0.04\n1.79 \u00b1 0.02\n\nEstimated L(f )\n\nEq. (7)\n\nEq. (13)\n\n194.00 \u00b1 0.97\n142.00 \u00b1 1.06\n103.00 \u00b1 0.59\n85.50 \u00b1 0.48\n73.70 \u00b1 0.40\n66.10 \u00b1 0.37\n61.30 \u00b1 0.33\n55.10 \u00b1 0.28\n52.40 \u00b1 0.26\n48.10 \u00b1 0.19\n42.70 \u00b1 0.22\n39.20 \u00b1 0.17\n34.90 \u00b1 0.19\n\n157.70 \u00b1 0.97\n116.33 \u00b1 1.06\n84.85 \u00b1 0.59\n70.68 \u00b1 0.48\n60.86 \u00b1 0.40\n54.62 \u00b1 0.37\n50.82 \u00b1 0.33\n45.73 \u00b1 0.28\n43.60 \u00b1 0.26\n39.98 \u00b1 0.19\n35.44 \u00b1 0.22\n32.57 \u00b1 0.17\n29.16 \u00b1 0.19\n\nTable 2: Model selection and error estimation, exploiting unlabeled data for selecting a more suitable\nhypothesis center.\n\nnl\n7\n14\n28\n42\n56\n70\n84\n105\n119\n140\n175\n210\n280\n\nnu\n3\n6\n12\n18\n24\n30\n36\n45\n51\n60\n75\n90\n120\n\nTest error of f\u2217\n\nEq. (7)\n\nEq. (18)\n\nEstimated L(f )\n\nEq. (7)\n\nEq. (18)\n\n13.20 \u00b1 0.86\n8.93 \u00b1 1.20\n6.26 \u00b1 0.16\n5.95 \u00b1 0.12\n5.61 \u00b1 0.07\n5.36 \u00b1 0.21\n4.98 \u00b1 0.40\n4.41 \u00b1 0.53\n3.59 \u00b1 0.57\n2.75 \u00b1 0.47\n2.07 \u00b1 0.03\n2.02 \u00b1 0.04\n1.93 \u00b1 0.02\n\n8.98 \u00b1 1.12\n5.10 \u00b1 0.67\n3.05 \u00b1 0.23\n2.36 \u00b1 0.23\n1.96 \u00b1 0.14\n1.63 \u00b1 0.11\n1.44 \u00b1 0.11\n1.27 \u00b1 0.09\n1.20 \u00b1 0.08\n1.08 \u00b1 0.09\n0.92 \u00b1 0.05\n0.81 \u00b1 0.07\n0.70 \u00b1 0.06\n\n219.15 \u00b1 0.97\n159.79 \u00b1 1.06\n115.58 \u00b1 0.59\n95.77 \u00b1 0.48\n82.59 \u00b1 0.40\n74.05 \u00b1 0.37\n68.56 \u00b1 0.33\n61.59 \u00b1 0.28\n58.50 \u00b1 0.26\n53.72 \u00b1 0.19\n47.73 \u00b1 0.22\n43.79 \u00b1 0.17\n38.88 \u00b1 0.19\n\n104.01 \u00b1 1.62\n86.70 \u00b1 0.01\n51.35 \u00b1 0.00\n38.37 \u00b1 0.00\n31.39 \u00b1 0.00\n26.83 \u00b1 0.00\n23.77 \u00b1 0.00\n20.36 \u00b1 0.00\n18.77 \u00b1 0.00\n16.82 \u00b1 0.00\n14.52 \u00b1 0.00\n12.91 \u00b1 0.00\n10.86 \u00b1 0.00\n\nof the data (nu) are used for selecting the hypothesis center and the remaining ones (nl) are used\nfor training the classi\ufb01er. The proposed method consistently selects a better classi\ufb01er, which reg-\nisters a threefold classi\ufb01cation error reduction on the test set, especially for training sets of smaller\ncardinality. The estimation of L(f ) is largely reduced as well.\nWe have to consider that this very clear performance increase is also favoured by the characteristics\nof the MNIST dataset, which consists of well\u2013separated classes: this particular data distribution im-\nplies that only few samples suf\ufb01ce for identifying a good hypothesis center. Many more experiments\nwith different datasets and varying the ratio between labeled and unlabeled samples are needed, and\nare currently underway, for establishing the general validity of our proposal but, in any case, these\nresults appear to be very promising.\n\n5 Conclusion\n\nIn this paper we have studied two methods which exploit unlabeled samples to tighten the\nRademacher Complexity bounds on the generalization error of linear (kernel) classi\ufb01ers. The \ufb01rst\nmethod improves a very well\u2013known result, while the second one aims at changing the entire ap-\nproach by selecting more suitable hypothesis spaces, not only acting on the bound itself. The recent\nliterature on the theory of bounds attempts to obtain tighter bounds through more re\ufb01ned concentra-\ntion inequalities (e.g. improving Mc Diarmid\u2019s inequality), but we believe that the idea of reducing\nthe size of the hypothesis space is a more appealing \ufb01eld of research because it opens the road to\npossible signi\ufb01cant improvements.\n\nReferences\n\n[1] V.N. Vapnik and A.Y. Chervonenkis. On the uniform convergence of relative frequencies of\n\nevents to their probabilities. Theory of Probability and its Applications, 16:264, 1971.\n\n8\n\n\f[2] P.L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and\n\nstructural results. The Journal of Machine Learning Research, 3:463\u2013482, 2003.\n\n[3] P.L. Bartlett, O. Bousquet, and S. Mendelson. Local rademacher complexities. The Annals of\n\nStatistics, 33(4):1497\u20131537, 2005.\n\n[4] O. Bousquet and A. Elisseeff. Stability and generalization. The Journal of Machine Learning\n\nResearch, 2:499\u2013526, 2002.\n\n[5] P.L. Bartlett, S. Boucheron, and G. Lugosi. Model selection and error estimation. Machine\n\nLearning, 48(1):85\u2013113, 2002.\n\n[6] D. Anguita, A. Ghio, and S. Ridella. Maximal discrepancy for support vector machines. Neu-\n\nrocomputing, 74(9):1436\u20131443, 2011.\n\n[7] D. Anguita, A. Ghio, L. Oneto, and S. Ridella. Selecting the Hypothesis Space for Improv-\ning the Generalization Ability of Support Vector Machines. In The 2011 International Joint\nConference on Neural Networks (IJCNN), San Jose, California. IEEE, 2011.\n\n[8] S. Arlot and A. Celisse. A survey of cross-validation procedures for model selection. Statistics\n\nSurveys, 4:40\u201379, 2010.\n\n[9] B. Efron and R. Tibshirani. An introduction to the bootstrap. Chapman & Hall/CRC, 1993.\n[10] D. Anguita, A. Ghio, L. Oneto, and S. Ridella. In-sample Model Selection for Support Vector\nMachines. In The 2011 International Joint Conference on Neural Networks (IJCNN), San Jose,\nCalifornia. IEEE, 2011.\n\n[11] K.P. Bennett and A. Demiriz. Semi-supervised support vector machines. In Advances in neural\ninformation processing systems 11: proceedings of the 1998 conference, page 368. The MIT\nPress, 1999.\n\n[12] O. Chapelle, B. Scholkopf, and A. Zien. Semi-supervised learning. The MIT Press, page 528,\n\n2010.\n\n[13] V.N. Vapnik. The nature of statistical learning theory. Springer Verlag, 2000.\n[14] R. Collobert, F. Sinz, J. Weston, and L. Bottou. Trading convexity for scalability.\n\nIn Pro-\nceedings of the 23rd international conference on Machine learning, pages 201\u2013208. ACM,\n2006.\n\n[15] C. McDiarmid. On the method of bounded differences. Surveys in combinatorics, 141(1):148\u2013\n\n188, 1989.\n\n[16] S. Haykin. Neural networks: a comprehensive foundation. Prentice Hall PTR Upper Saddle\n\nRiver, NJ, USA, 1994.\n\n[17] S. Boucheron, G. Lugosi, and P. Massart. On concentration of self-bounding functions. Elec-\n\ntronic Journal of Probability, 14:1884\u20131899, 2009.\n\n[18] S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities using the entropy method.\n\nThe Annals of Probability, 31(3):1583\u20131614, 2003.\n\n[19] G. Casella and R.L. Berger. Statistical inference. 2001.\n[20] I. CPLEX. 11.0 users manual. ILOG SA, 2008.\n[21] J. Wang, X. Shen, and W. Pan. On ef\ufb01cient large margin semisupervised learning: Method and\n\ntheory. Journal of Machine Learning Research, 10:719\u2013742, 2009.\n\n[22] J. Wang and X. Shen. Large margin semi\u2013supervised learning. Journal of Machine Learning\n\nResearch, 8:1867\u20131891, 2007.\n\n[23] J. Platt. Sequential minimal optimization: A fast algorithm for training support vector ma-\n\nchines. Advances in Kernel MethodsSupport Vector Learning, 208:1\u201321, 1998.\n\n[24] J. Shawe-Taylor and N. Cristianini. Margin distribution and soft margin. Advances in Large\n\nMargin Classi\ufb01ers, pages 349\u2013358, 2000.\n\n[25] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of\ndeep architectures on problems with many factors of variation. In 24th ICML, pages 473\u2013480,\n2007.\n\n9\n\n\f", "award": [], "sourceid": 413, "authors": [{"given_name": "Luca", "family_name": "Oneto", "institution": null}, {"given_name": "Davide", "family_name": "Anguita", "institution": null}, {"given_name": "Alessandro", "family_name": "Ghio", "institution": null}, {"given_name": "Sandro", "family_name": "Ridella", "institution": null}]}