0, \n\nlP'{ =3/ E F: P{! :-::; O} > \n\n+ \n\nLet us consider a special family of cost functions. Assume that cP is a fixed non increasing \nLipschitz function from IR into IR such that cp(x) 2: (1 + sgn( -x)) /2 for all x E lR. One can \neasily observe that L( cpU 15)) :-::; L( cP )15- 1 . Applying Theorem 1 to the class of Lipschitz \nfunctions

0, \n\nlP'{3! E F: P{! :-::; O} > \n\ninf [Pncp(L) + 2y'2irL(cp) Gn(F) \naE[O,l] \n\n+ cogIOg~(2c5-l)r/2] + t:n2 }:-::; 2exp{-2t2}. \n\n15 \n\n15 \n\nIn [5] an example was given which shows that, in general, the order of the factor 15- 1 in the \nsecond term of the bound can not be improved. \nGiven a metric space (T, d), we denote Hd(Tj c:) the c:-entropy of T with respect to d, \ni.e. Hd(Tj c:) := log Nd(Tj c:), where Nd(Tj c:) is the minimal number of balls of radius \nc: covering T. The next theorem improves the previous results under some additional as(cid:173)\nsumptions on the growth of random entropies Hdp \n\n2 (Fj .). Define for \"( E (0,1] \n\nn, \n\nand \n\n8n (-yjf):= sup { 15 E (0,1): c5\"fPn {/:-::; c5}:-::; n-1+!}. \n\nWe call c5n (\"(j f) and 8n (\"(j f), respectively, the ,,(-margin and the empirical ,,(-margin of f. \n\n\fTheorem 3 Suppose that for some a E (0,2) and for some constant D > 0 \n\nHdpn ,2 (Fj u) ~ Du- a , u > 0 a.s. \n\n(1) \n\nThen for any \"( ~ 2~a ,for some constants A, B > 0 andfor all large enough n \n\nJIItv'f E F: A- 18n(\"(jJ) ~ 8nbjf) ~ A8nbjJ)} \n~ 1 - B(log210g2 n) exp { -n t /2}. \n\nThis implies that with high probability for all f E F \n\nP{f ~ O} ~ c(nl -'Y/28n bj f)'Y)-I. \n\nThe bound of Theorem 2 corresponds to the case of \"( = 1. It is easy to see from the \ndefinitions of \"(-margins that the quantity (n l -'Y/28n bj f)'Y)-1 increases in \"( E (0,1]. \nThis shows that the bound in the case of \"( < 1 is tighter. Further discussion of this \ntype of bounds and their experimental study in the case of convex combinations of simple \nclassifiers is given in the next section. \n\n2 Bounding the generalization error of convex combinations of \n\nclassifiers \n\nRecently, several authors ([1, 8]) suggested a new class of upper bounds on generalization \nerror that are expressed in terms of the empirical distribution of the margin of the predictor \n(the classifier), The margin is defined as the product Y g(X). The bounds in question are \nespecially useful in the case of the classifiers that are the combinations of simpler classifiers \n(that belong, say, to a class 1-\u00a3). One of the examples of such classifiers is provided by the \nclassifiers obtained by boosting [3, 4], bagging [2] and other voting methods of combining \nthe classifiers. We will now demonstrate how our general results can be applied to the case \nof convex combinations of simple base classifiers. \nWe assume that S := 8x {-1, 1} andF:= {]: f E F}, where j(x,y) := yf(x). Pwill \ndenote the distribution of (X, Y), Pn the empirical distribution based on the observations \n((Xl, YI ), ... , (Xn, Yn)) . It is easy to see that Gn(F) = Gn(F). One can easily see \nthat if F := conv(1-\u00a3), where 1-\u00a3 is a class of base classifiers, then Gn(F) = Gn(1-\u00a3). \nThese easy observations allow us to obtain useful bounds for boosting and other methods \nof combining the classifiers. For instance, we get in this case the following theorem that \nimplies the bound of Schapire, Freund, Bartlett and Lee [8] when 1\u00a3 is a VC-class of sets. \n\nTheorem 4 Let F := conv(1\u00a3), where 1-\u00a3 is a class of measurable functions from (8, A) \ninto R For all t > 0, \nlP'{ 3f E F : P{yf(x) ~ O} \n\nIn particular, if 1-\u00a3 is a VC--class of classifiers h : 8 H {-1, 1} (which means that the class \nof sets {{x: h(x) = +1} : h E 1-\u00a3} is a Vapnik-Chervonenkis class) with VC--dimension \nV(1-\u00a3), we have with some constant C > 0, Gn(1-\u00a3) ~ C(V(1-\u00a3)/n)I/2. This implies that \nwith probability at least 1 - a \n\nP{yf(x) ~ O} ~ inf [Pn{yf(x) ~ 8} + ~ JV(1-\u00a3) + \n\nOE(O ,I] \n\nu \n\nn \n\n\f( log log2 (28-1)) 1/ 2] V! log ~ + 2 \n+ \n\n+ \n\n, \n\nr,;; \nyn \n\nn \n\nwhich slightly improves the bound obtained previously by Schapire, Freund, Bartlett and \nLee [8]. \n\nTheorem 3 provides some improvement of the above bounds on generalization error of \nconvex combinations of base classifiers. To be specific, consider the case when H is a \nVC-class of classifiers. Let V := V(H) be its VC-dimension. A well known bound (going \nback to Dudley) on the entropy of the convex hull (see [11], p. 142) implies that \n\nHdpn,2(conv(H);u)::; \n\nsup HdQ,2(conv(H);u)::; Du- - v -\n\n. \n\n2 ( V - l) \n\nQEP(S) \n\nIt immediately follows from Theorem 3 that for all 'Y 2: 2J~::::~) and for some constants \nC,B \n\nIF'{3f E conv(H): p{f::; a} > \n\nwhere \n\n? \n\n}::; BIog210g2nexP{--21nt}, \n\nn 1- 'Y/28n (,,(; f) 'Y \n\n8n(\"(; f) := sup{ 8 E (0,1) : 8'Y Pn {(x, y) : yf(x) ::; 8} ::; n-1+t }. \n\nThis shows that in the case when the VC-dimension of the base is relatively small the \ngeneralization error of boosting and some other convex combinations of simple classifiers \nobtained by various versions of voting methods becomes better than it was suggested by the \nbounds of Schapire, Freund, Bartlett and Lee. One can also conjecture that the remarkable \ngeneralization ability of these methods observed in numerous experiments can be related \nto the fact that the combined classifier belongs to a subset of the convex hull for which \nthe random entropy Hdp 2 is much smaller than for the whole convex hull (see [9, 10] for \nimproved margin type bounds in a much more special setting). \n\nTo demonstrate the improvement provided by our bounds over previous results, we show \nsome experimental evidence obtained for a simple artificially generated problem, for which \nwe are able to compute exactly the generalization error as well as the 'Y-margins. \n\nWe consider the problem of learning a classifier consisting of the indicator function of the \nunion of a finite number of intervals in the input space S = [0,1] . We used the Adaboost \nalgorithm [4] to find a combined classifier using as base class 11. = {[a, b] : b E [0, In u \n{[b,l] : b E [0, In (i.e. decision stumps). Notice that in this case V = 2, and according to \nthe theory values of gamma in (2/3, 1) should result in tighter bounds on the generalization \nerror. \n\nFor our experiments we used a target function with 10 equally spaced intervals, and a sam(cid:173)\nple size of 1000, generated according to the uniform distribution in [0, 1]. We ran Adaboost \nfor 500 rounds, and computed at each round the generalization error of the combined clas(cid:173)\nsifier and the bound C(n1- 'Y/28n(\"(; f) 'Y )-1 for different values of 'Y. We set the constant \nC to one. \nIn figure 1 we plot the generalization error and the bounds for 'Y = 1, 0.8 and 2/3. As \nexpected, for'Y = 1 (which corresponds roughly to the bounds in [8]) the bound is very \nloose, and as 'Y decreases, the bound gets closer to the generalization error. In figure 2 \nwe show that by reducing further the value of 'Y we get a curve even closer to the actual \ngeneralization error (although for 'Y = 0.2 we do not get an upper bound). This seems to \nsupport the conjecture that Adaboost generates combined classifiers that belong to a subset \nof of the convex hull of 11. with a smaller random entropy. In figure 3 we plot the ratio \n8-;\"(\"(; f)/8n(\"(; f) for'Y = 0.4,2/3 and 0.8 against the boosting iteration. We can see that \nthe ratio is close to one in all the examples indicating that the value of the constant A in \ntheorem 3 is close to one in this case. \n\n\f\u00b7 - - - - - - - - -\n\nboosbnground \n\nFigure 1: Comparison of the generalization error (thicker line) with (nl-'Y/2 8n b; f)'Y)-l \nfor'Y = 1,0.8 and 2/3 (thinner lines, top to bottom). \n\nFigure 2: Comparison of the generalization error (thicker line) with (nl-'Y/2 8n b; f)'Y)-l \nfor'Y = 0.5,0.4 and 0.2 (thinner lines, top to bottom). \n\nboostlrQround \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\no \n\n\u2022 \n\n:I f \u2022 \u2022 \u2022 \u2022 \u2022 \u2022 I \n:1 ! 11 11 11 I. \" \n'::'ffl \u2022 I \n~I I i III II III Jam, \".~., I \n\n_ \n\n_ \n\no \n\n\u2022 \n\no \n\n\u2022 \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\n_ \n\nFigure 3: Ratio 8:b;f)/8n b;f) versus boosting round for'Y = 0.4,2/3,0.8 (top to \nbottom) \n\n\f3 Bounding the generalization error in neural network learning \n\nWe turn now to the applications of the bounds of previous section in neural network learn(cid:173)\ning. Let 1i be a class of measurable functions from (8, A) into R Given a sigmoid U \nfrom lR into [-l,l]andavectorw := (Wl, ... ,Wn) E lRn, let Nu,w(Ul, . .. ,Un) := \nu(~~=l WjUj). We call the function Nu,w a neuron with weights wand sigmoid u. For \nwE lRn, [[w[[t l := ~~=l [Wit. Let Uj : j ~ 1 be functions from lR into [-1,1], satisfying \nthe Lipschitz conditions: \n\n[Uj(u) - Uj(v)[ :\"S Lj[u - vi, u,v E R \n\nLet {Aj} be a sequence of positive numbers. We define recursively classes of neural net(cid:173)\nworks with restrictions on the weights of neurons (j below is the number of layers): \n\n1lo =1i, 1lj(Al , ... ,Aj ):= \n\n:= {Nuj,w(hl , ... , hn) : n ~ 0, hi E 1ij-l (Al'\"'' Aj-d, wE lRn, [[w[[t l \n\nU 1ij-l (A l , .. . , A j- l ). \n\n:\"S Aj} U \n\nTheorem 5 For all t > 0 and for alll ~ 1 \n. \nlOf [Pn(fJh-) + ~ II (2Lj Aj + l)Gn(1l)+ \n