{"title": "Algorithmic Luckiness", "book": "Advances in Neural Information Processing Systems", "page_first": 391, "page_last": 397, "abstract": null, "full_text": "Algorithmic Luckiness \n\nRalf Herbrich \n\nMicrosoft Research Ltd. \n\nCB3 OFB Cambridge \n\nUnited Kingdom \n\nrherb@microsoft\u00b7com \n\nRobert C. Williamson \n\nAustralian National University \n\nCanberra 0200 \n\nAustralia \n\nBob. Williamson @anu.edu.au \n\nAbstract \n\nIn contrast to standard statistical learning theory which studies \nuniform bounds on the expected error we present a framework that \nexploits the specific learning algorithm used. Motivated by the \nluckiness framework [8] we are also able to exploit the serendipity \nof the training sample. The main difference to previous approaches \nlies in the complexity measure; rather than covering all hypothe(cid:173)\nses in a given hypothesis space it is only necessary to cover the \nfunctions which could have been learned using the fixed learning \nalgorithm. We show how the resulting framework relates to the \nVC, luckiness and compression frameworks. Finally, we present an \napplication of this framework to the maximum margin algorithm \nfor linear classifiers which results in a bound that exploits both the \nmargin and the distribution of the data in feature space. \n\n1 \n\nIntroduction \n\nStatistical learning theory is mainly concerned with the study of uniform bounds \non the expected error of hypotheses from a given hypothesis space [9, 1]. Such \nbounds have the appealing feature that they provide performance guarantees for \nclassifiers found by any learning algorithm. However, it has been observed that \nthese bounds tend to be overly pessimistic. One explanation is that only in the \ncase of learning algorithms which minimise the training error it has been proven \nthat uniformity of the bounds is equivalent to studying the learning algorithm's \ngeneralisation performance directly. \n\nIn this paper we present a theoretical framework which aims at directly studying the \ngeneralisation error of a learning algorithm rather than taking the detour via the \nuniform convergence of training errors to expected errors in a given hypothesis space. \nIn addition, our new model of learning allows the exploitation of the fact that we \nserendipitously observe a training sample which is easy to learn by a given learning \nalgorithm. In that sense, our framework is a descendant of the luckiness framework \nof Shawe-Taylor et al. [8]. In the present case, the luckiness is a function of a given \nlearning algorithm and a given training sample and characterises the diversity of \nthe algorithms solutions. The notion of luckiness allows us to study given learning \nalgorithms at many different perspectives. For example, the maximum margin \nalgorithm [9] can either been studied via the number of dimensions in feature space, \n\n\fthe margin of the classifier learned or the sparsity of the resulting classifier. Our \nmain results are two generalisation error bounds for learning algorithms: one for \nthe zero training error scenario and one agnostic bound (Section 2). We shall \ndemonstrate the usefulness of our new framework by studying its relation to the \nVC framework, the original luckiness framework and the compression framework of \nLittlestone and Warmuth [6] (Section 3). Finally, we present an application of the \nnew framework to the maximum margin algorithm for linear classifiers (Section 4). \nThe detailed proofs of our main results can be found in [5]. \nWe denote vectors using bold face, e.g. x = (Xl, ... ,xm ) and the length of this vector \nby lxi, i.e. Ixl = m. In order to unburden notation we use the shorthand notation \nZ[i:jJ := (Zi,\"\" Zj) for i :::; j. Random variables are typeset in sans-serif font. The \nsymbols Px, Ex [f (X)] and IT denote a probability measure over X, the expectation of \nf (.) over the random draw of its argument X and the indicator function, respectively. \nThe shorthand notation Z(oo) := U;;'=l zm denotes the union of all m- fold Cartesian \nproducts of the set Z. For any mEN we define 1m C {I, ... , m } m as the set of all \npermutations of the numbers 1, ... ,m, \n\n1m := {(il , ... ,im) E {I, ... ,m}m I'v'j f:- k: ij f:- id . \n\nGiven a 2m- vector i E hm and a sample z E z2m we define Wi : {I, ... , 2m} -+ \n{I, ... , 2m} by Wi (j) := ij and IIdz) by IIi (z) := (Z7ri(l), ... , Z7ri(2m))' \n\n2 Algorithmic Luckiness \n\nSuppose we are given a training sample z = (x, y) E (X x y)m = zm of size \nmEN independently drawn (iid) from some unknown but fixed distribution P Xy = \nP z together with a learning algorithm A : Z( 00) -+ yX . For a predefined loss \nl : y x y -+ [0,1] we would like to investigate the generalisation error Gl [A, z] := \nRl [A (z)] - infhEYx Rl [h] of the algorithm where the expected error Rl [h] of his \ndefined by \n\nRl [h] := Exy [l (h (X) ,Y)] . \n\nSince infhEYx Rl [h] (which is also known as the Bayes error) is independent of A \nit suffices to bound Rl [A (z)]. Although we know that for any fixed hypothesis h \nthe training error \n\n~ \n\nRdh,z]:=~ L \n\n1 \n\nl(h(xi),Yi) \n\n(X i ,Yi) E z \n\nis with high probability (over the random draw of the training sample z E Z(oo)) \nclose to Rl [h], this might no longer be true for the random hypothesis A (z). Hence \nwe would like to state that with only small probability (at most 8) , the expected \nerror Rl [A (z)] is larger than the training error HI [A (z), z] plus some sample and \nalgorithm dependent complexity c (A, z, 8), \n\nPzm (Rl [A (Z)] > HI [A (Z), Z] + c (A, Z,8)) < 8. \n\n(1) \nIn order to derive such a bound we utilise a modified version of the basic lemma of \nVapnik and Chervonenkis [10]. \nLemma 1. For all loss functions l : y x y -+ [0,1], all probability measures Pz, all \nalgorithms A and all measurable formulas Y : zm -+ {true, false}, if mc 2 > 2 then \nPzm (( RdA (Z)] > HdA (Z) , Z] + c) /\\ Y (Z)) < \n2P Z 2m ((HI [A (Z[l:m]) ,Z[(m+l):2mJJ > HI [A (Z[l:mJ) ,Z[l:mJJ + ~) /\\ Y (Z[l:m])) . \n\n, \n\n.I \n\nV \n\nJ(Z) \n\n\fProof (Sketch). The probability on the r.h.s. is lower bounded by the proba(cid:173)\n\nbility of the conjunction of event on the l.h.s. and Q (z) = Rl [A (Z[l:mj)] -\nRl [A (Z[l:mj) ,Z(m+1):2m] < ~. Note that this probability is over z E z2m. If \nwe now condition on the first m examples, A (Z[l:mj) is fixed and therefore by an \napplication of Hoeffding's inequality (see, e.g. [1]) and since m\u20ac2 > 2 the additional \nevent Q has probability of at least ~ over the random draw of (Zm+1, ... , Z2m). 0 \n\nUse of Lemma 1 - which is similar to the approach of classical VC analysis -\nreduces the original problem (1) to the problem of studying the deviation of the \ntraining errors on the first and second half of a double sample z E z2m of size \n2m. It is of utmost importance that the hypothesis A (Z[l:mj) is always learned \nfrom the first m examples. Now, in order to fully exploit our assumptions of the \nmutual independence of the double sample Z E z2m we use a technique known \nas symmetrisation by permutation: since PZ2~ is a product measure, it has the \nproperty that PZ2\u00bb> (J (Z)) = PZ2~ (J (ITi (Z))) for any i E hm. Hence, it suffices \nto bound the probability of permutations Jri such that J (ITi (z)) is true for a given \nand fixed double sample z. As a consequence thereof, we only need to count the \nnumber of different hypotheses that can be learned by A from the first m examples \nwhen permuting the double sample. \nDefinition 1 (Algorithmic luckiness). Any function L that maps an algorithm \nA : Z( oo ) -+ yX and a training sample z E Z( oo ) to a real value is called an algorith(cid:173)\nmic luckiness. For all mEN, for any z E z2m , the lucky set HA (L , z) ~ yX is the \nset of all hypotheses that are learned from the first m examples (Z7ri(1),\u00b7\u00b7\u00b7, Z7ri(m)) \nwhen permuting the whole sample z whilst not decreasing the luckiness, i.e. \n\n(2) \n\nwhere \n\nGiven a fixed loss function 1 : y x y -+ [0,1] the induced loss function set \n\u00a31 (HA (L,z)) is defined by \n\n\u00a31 (HA (L,z)) := {(x,y) r-+ 1 (h(x) ,y) I h E HA (L,z)} . \n\nFor any luckiness function L and any learning algorithm A , the complexity of the \ndouble sample z is the minimal number N1 (T, \u00a31 (HA (L, z)) ,z) of hypotheses h E \nyX needed to cover \u00a31 (HA (L , z)) at some predefined scale T, i.e. for any hypothesis \nhE HA (L, z) there exists a h E yX such that \n\n(4) \n\nTo see this note that whenever J (ITi (z)) is true (over the random draw of permu(cid:173)\ntations) then there exists a function h which has a difference in the training errors \non the double sample of at least ~ + 2T. By an application of the union bound we \nsee that the number N 1 (T, \u00a31 (HA (L , z)) , z) is of central importance. Hence, if we \nare able to bound this number over the random draw of the double sample z only \nusing the luckiness on the first m examples we can use this bound in place of the \nworst case complexity SUPzEZ2~ N1 (T, \u00a31 (HA (L , z)) ,z) as usually done in the VC \nframework (see [9]). \n\n\fDefinition 2 (w- smallness of L). Given an algorithm A : Z ( 00 ) -+ yX and a loss \nl : y x y -+ [a, 1] the algorithmic luckiness function Lis w- small at scale T E jR+ if \nfor all mEN, all J E (a, 1] and all Pz \n\nPZ2~ (Nl (T, \u00a3\"1 (1iA (L, Z)), Z) > w (L (A, Z[l:ml) ,l, m, J,T)) < J. \n\n\" \n\n, \n\nv \n\nS(Z) \n\nNote that if the range of l is {a, I} then N 1 (2~ ' \u00a3\"1 (1iA (L, z)) , z) equals the num(cid:173)\nber of dichotomies on z incurred by \u00a3\"1 (1iA (L ,z)). \n\nTheorem 1 (Algorithmic luckiness bounds). Suppose we have a learning \nalgorithm A : Z( oo ) -+ yX and an algorithmic luckiness L that is w-small at \nscale T for a loss function l : y X Y -+ [a, 1]. For any probability measure Pz, \nany dEN and any J E (a, 1], with probability at least 1 - J over the random \ndraw of the training sample z E zm of size m, if w (L (A, z) ,l, m, J/4, T) :::; 2d \nthen \n\nRz[A (z)] :::; Rz[A (z), z] + ! (d + 10g2 (~) ) + 4T. \n\n(5) \n\nFurthermore, under the above conditions if the algorithmic luckiness L is w(cid:173)\nsmall at scale 2~ for a binary loss function l (\".) E {a, I} and Rl [A (z), z] = a \nthen \n\n(6) \n\nProof (Compressed Sketch). We will only sketch the proof of equation (5) ; the proof \nof (6) is similar and can be found in [5]. First, we apply Lemma 1 with Y (z) == \nw (L (A,z) ,l,m,J/4,T) :::; 2d. We now exploit the fact that \n\nPZ2~ (J (Z)) \n\n:Z2~ (J (Z) 1\\ S (Z) ), +PZ2~ (J (Z) 1\\ ...,S (Z)) \n\nv \n\n:::: PZ2~ (S(Z)) \n\nJ \n\n< 4 + PZ2~ (J (Z) I\\...,S (Z)) , \n\nwhich follows from Definition 2. Following the above-mentioned argument it suf(cid:173)\nfices to bound the probability of a random permutation III (z) that J (III (z)) 1\\ \n...,S (III (z)) is true for a fixed double sample z. Noticing that Y (z) 1\\ ...,S (z) => \nNl (T,\u00a3\"l (1iA (L , z)) ,z) :::; 2d we see that we only consider swappings Jri for which \nNl (T,\u00a3\"l (1iA (L,IIi (z))) ,IIi (z)) :::; 2d. Thus let us consider such a cover of \nsize not more than 2. By (4) we know that whenever J (IIi (z)) 1\\ ...,S (IIi (z)) \nis true for a swapping i then there exists a hypothesis h E yX in the cover \nsuch that Rl [h, (III (z)) [(m+1):2ml] - Rl [h, (III (z)) [l:ml] > ~ + 2T. Using the \nunion bound and Hoeffding's inequality for a particular choice of PI shows that \nPI (J (III (z)) 1\\ ...,S (III (z))) :::; \u00a3 which finalises the proof. \nD \n\nA closer look at (5) and (6) reveals that the essential difference to uniform bounds \non the expected error is within the definition of the covering number: rather than \ncovering all hypotheses h in a given hypothesis space 1i ~ yX for a given double \nsample it suffices to cover all hypotheses that can be learned by a given learning \nalgorithm from the first half when permuting the double sample. Note that the \nusage of permutations in the definition of (2) is not only a technical matter; it \nfully exploits all the assumptions made for the training sample, namely the training \nsample is drawn iid. \n\n\f3 Relationship to Other Learning Frameworks \n\nIn this section we present the relationship of algorithmic luckiness to other learning \nframeworks (see [9, 8, 6] for further details of these frameworks). \n\nVC Framework If we consider a binary loss function l (\".) E {a, I} and assume \nthat the algorithm A selects functions from a given hypothesis space H ~ yX then \nL (A, z) = - VCDim (H) is a w- smallluckiness function where \n\n( \n\n1) \n\nw Lo,l,m,8, 2m \n\n(2em) -Lo \n\n-Lo \n\n. \n\n:S \n\n(7) \n\nThis can easily be seen by noticing that the latter term is an upper bound on \nmaxzEZ 2 \", I{ (l (h (Xl) ,yI) , ... ,l (h (X2m), Y2m)) : h E H}I (see also [9]). Note that \nthis luckiness function neither exploits the particular training sample observed nor \nthe learning algorithm used. \n\nLuckiness Framework Firstly, the luckiness framework of Shawe-Taylor et al. [8] \nonly considered binary loss functions l and the zero training error case. In this work, \nthe luckiness \u00a3 is a function of hypothesis and training samples and is called w(cid:173)\nsmall if the probability over the random draw of a 2m sample z that there exists a \nhypothesis h with w(\u00a3(h, (Zl, ... ,zm)), 8) < J'--h (2;\" {(X , y) t--+ l (g (x) ,y) 1\u00a3 (g , z) ::::: \n\u00a3 (h, Z)}, z), is smaller than 8. Although similar in spirit, the classical luckiness \nframework does not allow exploitation of the learning algorithm used to the same \nextent as our new luckiness. In fact, in this framework not only the covering number \nmust be estimable but also the variation of the luckiness \u00a3 itself. These differences \nmake it very difficult to formally relate the two frameworks. \n\nIn the compression framework of Littlestone and \nCompression Framework \nWarmuth [6] one considers learning algorithms A which are compression schemes, \ni.e. A (z) = :R (e (z)) where e (z) selects a subsample z ~ z and :R : Z(oo) -+ yX \nis a permutation invariant reconstruction function. For this class of learning algo(cid:173)\nrithms, the luckiness L(A,z) = -le(z)1 is w- small where w is given by (7). In \norder to see this we note that (3) ensures that we only consider permutations 7ri \nwhere e (IIi (z)) :S Ie (z)l, i.e. we use not more than -L training examples from \nz E z2m. As there are exactly e;;) distinct choices of d training examples from \n2m examples the result follows by application of Sauer's lemma [9]. Disregarding \nconstants, Theorem 1 gives exactly the same bound as in [6]. \n\n4 A New Margin Bound For Support Vector Machines \n\nIn this section we study the maximum margin algorithm for linear classifiers, i.e. A : \nZ(oo) -+ Hcp where Hcp := {x t--+ (\u00a2 (x), w) I wE }C} and \u00a2 : X -+ }C ~ \u00a3~ is known \nas the feature mapping. Let us assume that l (h (x) ,y) = lO -l (h (x) ,y) := lIyh(x)::;o, \nClassical VC generalisation error bounds exploit the fact that VCDim (Hcp) = nand \n(7). In the luckiness framework of Shawe-Taylor et al. [8] it has been shown that we \ncan use fat1i.p h'z (w)) :S h'z (W))-2 (at the price of an extra 10g2 (32m) factor) in \nplace of VCDim (Hcp) where \"(z (w) = min(xi,Yi)Ez Yi (\u00a2 (Xi) , w) / Ilwll is known as \nthe margin. Now, the maximum margin algorithm finds the weight vector WMM that \nmaximises \"(z (w). It is known that WMM can be written as a linear combination of \nthe \u00a2 (Xi). For notational convenience, we shall assume that A: Z(oo) -+ 1R(00) maps \n\n\fto the expansion coefficients 0: such that Ilwall = 1 where Wa := 2:1~1 (XicfJ(Xi). \nOur new margin bound follows from the following theorem together with (6). \n\nTheorem 2. Let fi (x) be the smallest 10 > 0 such that {cfJ (Xl) , ... , cfJ (Xm) } \ncan be covered by at most i balls of radius less than or equal f. Let f z (w) be \nd fi d b \nd \ne zero-one oss 0-1 an \ne ne \ny \nthe maximum margin algorithm A , the luckiness function \n\n. - mm( Xi, Yi)Ez 114>(Xi)II.llwll. ror \n\n( ) .-\n\nYi (4)(X i),W) \n\nz W \n\nth \n\n. \n\nf \n\nl \n\nl \n\nD \n\nL(A \n\n,Z \n\n) =_ . {. \n\nmIn \n\n\",,-T \n~ E 1'1 \n\n.> (fi (X)2:7=1 IA(Z)jl) 2 } \n\n( ) \n\n~ _ \n\nfz W A(z) \n\n, \n\n(8) \n\nis w-small at scale 112m w.r.t. the function \n\n( \n\n1) \n\nw L o,l,m,8, 2m = \n\n(2em)- 2LO \n\n-Lo \n\n(9) \n\nProof (Sketch). First we note that by a slight refinement of a theorem of Makovoz \n[7] we know that for any Z E zm there exists a weight vector w = 2::1 iiicfJ (Xi) \nsuch that \n\n(10) \nand a E ]Rm has no more than - L (A, z) non-zero components. Although only \nWA(z) is of unit length, one can show that (10) implies that \n(WA(z), wi IIwll) ~ )1- f; (WA(z\u00bb). \n\nUsing equation (10) of [4] this implies that w correctly classifies Z E zm. Consider \na fixed double sample Z E z2m and let ko := L (A, (Zl , ... , zm )). By virtue of (3) \nand the aforementioned argument we only need to consider permutations tri such \nthat there exists a weight vector w = 2:;:1 iijcfJ (Xj) with no more than ko non-zero \niij. As there are exactly (2;;) distinct choices of dE {I, ... , ko} training examples \nfrom the 2m examples Z there are no more than (2emlko)kO different subsamples \nto be used in w. For each particular subsample z ~ Z the weight vector w is a \nmember of the class of linear classifiers in a ko (or less) dimensional space. Thus, \nfrom (7) it follows that for the given subsample z there are no more (2emlko)kO \ndifferent dichotomies induced on the double sample Z E z2m. As this holds for any \nD \ndouble sample, the theorem is proven. \n\nThere are several interesting features about this margin bound. Firstly, observe \nthat 2:;:1 IA (Z)j I is a measure of sparsity of the solution found by the maximum \nmargin algorithm which, in the present case, is combined with margin. Note that \nfor normalised data, i.e. IlcfJ Oil = constant, the two notion of margins coincide, \ni.e. f z (w) = I Z (w). Secondly, the quantity fi (x) can be considered as a measure \nof the distribution of the mapped data points in feature space. Note that for all \ni E N, fi (x) :S 101 (x) :S maxjE{l , ... ,m} \nIlcfJ (xj)ll. Supposing that the two class(cid:173)\nconditional probabilities PX 1Y=y are highly clustered, 102 (x) will be very small. An \nextension of this reasoning is useful in the multi-class case; binary maximum margin \nclassifiers are often used to solve multi-class problems [9]. There appears to be also \na close relationship of fi (x) with the notion of kernel alignment recently introduced \nin [3]. Finally, one can use standard entropy number techniques to bound fi (x) in \nterms of eigenvalues of the inner product matrix or its centred variants. It is worth \nmentioning that although our aim was to study the maximum margin algorithm the \n\n\fabove theorem actually holds for any algorithm whose solution can be represented \nas a linear combination of the data points. \n\n5 Conclusions \n\nIn this paper we have introduced a new theoretical framework to study the gen(cid:173)\neralisation error of learning algorithms. In contrast to previous approaches, we \nconsidered specific learning algorithms rather than specific hypothesis spaces. We \nintroduced the notion of algorithmic luckiness which allowed us to devise data de(cid:173)\npendent generalisation error bounds. Thus we were able to relate the compression \nframework of Littlestone and Warmuth with the VC framework. Furthermore, we \npresented a new bound for the maximum margin algorithm which not only exploits \nthe margin but also the distribution of the actual training data in feature space. \nPerhaps the most appealing feature of our margin based bound is that it natu(cid:173)\nrally combines the three factors considered important for generalisation with linear \nclassifiers: margin, sparsity and the distribution of the data. Further research is \nconcentrated on studying Bayesian algorithms and the relation of algorithmic luck(cid:173)\niness to the recent findings for stable learning algorithms [2]. \n\nAcknowledgements This work was done while RCW was visiting Microsoft Re(cid:173)\nsearch Cambridge. This work was also partly supported by the Australian Research \nCouncil. RH would like to thank Olivier Bousquet for stimulating discussions. \n\nReferences \n\n[1) M. Anthony and P. Bartlett. A Theory of Learning in Artificial Neural Networks. \n\nCambridge University Press, 1999. \n\n[2) O. Bousquet and A. Elisseeff. Algorithmic stability and generalization performance. In \nT. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in N eural Information \nProcessing Systems 13, pages 196- 202. MIT Press, 2001. \n\n[3) N. Cristianini, A. Elisseeff, and J. Shawe-Taylor. On optimizing kernel alignment. \nTechnical Report NC2-TR-2001-087, NeuroCOLT, http: //www.neurocolt.com. 2001. \n[4) R. Herbrich and T . Graepel. A PAC-Bayesian margin bound for linear classifiers: \nWhy SVMs work. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances \nin Neural Information Processing Systems 13, pages 224- 230, Cambridge, MA, 2001. \nMIT Press. \n\n[5) R. Herbrich and R. C. Williamson. Algorithmic luckiness. Technical report, Microsoft \n\nResearch, 2002. \n\n[6) N. Littlestone and M. Warmuth. Relating data compression and learnability. Tech(cid:173)\n\nnical report, University of California Santa Cruz, 1986. \n\n[7) Y . Makovoz. Random approximants and neural networks. Journal of Approximation \n\nTheory, 85:98- 109, 1996. \n\n[8) J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk \nIEEE Transactions on Information \n\nminimization over data-dependent hierarchies. \nTheory, 44(5):1926- 1940, 1998. \n\n[9) V . Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, 1998. \n[10) V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative fre(cid:173)\n\nquencies of events to their probabilities. Theory of Probability and its Applications, \n16(2):264- 281, 1971. \n\n\f", "award": [], "sourceid": 2105, "authors": [{"given_name": "Ralf", "family_name": "Herbrich", "institution": null}, {"given_name": "Robert", "family_name": "Williamson", "institution": null}]}