{"title": "Improved risk tail bounds for on-line algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 195, "page_last": 202, "abstract": null, "full_text": "Improved Risk Tail Bounds \nfor On-Line Algorithms * \n\nNicolo Cesa-Bianchi \n\nDSI, Universita di  Milano \n\nvia Comelico 39 \n\n20135 Milano, Italy \n\nClaudio Gentile \n\nDICOM, Universita dell'Insubria \n\nvia Mazzini 5 \n\n21100 Varese, Italy \n\ncesa-bianchi@dsi.unimi.it \n\ngentile@dsi.unimi.it \n\nAbstract \n\nWe prove the strongest known  bound for the risk of hypotheses selected \nfrom the ensemble generated by running a learning algorithm incremen(cid:173)\ntally on the training data. Our result is based on proof techniques that are \nremarkably  different from  the  standard  risk  analysis  based  on  uniform \nconvergence arguments. \n\n1  Introduction \n\nIn  this  paper,  we  analyze  the  risk  of hypotheses  selected from  the ensemble obtained  by \nrunning an  arbitrary  on-line learning  algorithm on an  i.i.d.  sequence of training data.  We \ndescribe a procedure that selects from  the ensemble a hypothesis  whose  risk  is , with  high \nprobability, at most \n\nMn + 0  ((innn)2 + J~n Inn) , \n\nwhere  Mn is  the  average cumulative  loss  incurred by  the  on-line  algorithm  on  a training \nsequence of length n.  Note that this  bound exhibits the \"fast\" rate  (in n)2 I n  whenever the \ncumulative loss nMn is 0(1). \n\nThis  result is  proven  through  a  refinement of techniques  that we  used  in  [2]  to  prove the \nsubstantially weaker bound Mn + 0 ( J (in n) I n).  As  in  the proof of the older result, we \nanalyze the empirical process associated with a run of the on-line learner using exponential \ninequalities  for  martingales.  However,  this  time  we  control  the  large  deviations  of the \non-line  process  using  Bernstein's  maximal  inequality  rather  than  the  Azuma-Hoeffding \ninequality. This provides a much tighter bound on the average risk of the ensemble. Finally, \nwe relate the risk of a specific hypothesis within the ensemble to the average risk.  As in [2], \nwe select this  hypothesis using a deterministic sequential testing procedure, but the  use of \nBernstein's inequality makes the analysis of this procedure far more complicated. \n\nThe  study  of the  statistical  risk  of hypotheses  generated  by  on-line  algorithms , initiated \nby  Littlestone  [5],  uses  tools  that  are  sharply  different from  those  used  for  uniform con(cid:173)\nvergence analysis , a popular approach  based on  the manipulation of suprema of empirical \n\n' Part of the results contained in this  paper have been presented in  a talk given  at the NIPS  2004 \n\nworkshop on  \"(Ab)Use of Bounds\". \n\n\fprocesses  (see, e.g., [3]).  Unlike uniform convergence, which is  tailored to  empirical risk \nminimization , our bounds hold for any learning algorithm. Indeed, disregarding efficiency \nissues, any learner can be run incrementally on a data sequence to generate an ensemble of \nhypotheses. \n\nThe consequences of this line of research to kernel and margin-based algorithms have been \npresented in our previous work [2]. \n\nNotation.  An example  is  a pair (x , y), where x  E  X  (which we call instance) is  a data \nelement and y  E  Y is the label associated with it.  Instances x are tuples of numerical and/or \nsymbolic  attributes.  Labels  y  belong to  a finite  set of symbols  (the  class  elements)  or to \nan  interval  of the  real  line,  depending  on  whether the  task is  classification or regression. \nWe  allow  a learning  algorithm  to  output  hypotheses  of the  form  h  :  X  ---->  D , where  D \nis  a decision  space not necessarily equal to  y. The goodness of hypothesis  h on example \n(x, y)  is  measured by the quantity C(h(x), y), where C : D  x  Y  ---->  lR.  is  a nonnegative and \nbounded loss function. \n\n2  A bound on the average risk \n\nAn  on-line  algorithm A works in  a sequence of trials.  In each trial t  =  1,2, ... the algo(cid:173)\nrithm takes  in  input a hypothesis  Ht- l  and an  example  Zt  =  (Xt, yt), and returns a new \nhypothesis  H t  to  be  used  in  the  next  trial.  We  follow  the  standard  assumptions  in  statis(cid:173)\ntical  learning:  the  sequence  of examples  zn  =  ((Xl , Yd , ... , (Xn , Yn))  is  drawn  i.i.d. \naccording to  an  unknown distribution over X  x  y. We also assume that the loss function C \nsatisfies 0  ::;  C ::;  1. The success of a hypothesis h is measured by the risk of h, denoted by \nrisk(h). This is the expected loss of h on an example (X, Y) drawn from the underlying \ndistribution , risk(h) =  lEC(h(X), Y).  Define also  riskernp(h) to be the empirical risk \nof h on a sample zn, \n\nriskernp(h)  =  - 2: C(h(Xt ), yt)  . \n\n1  n \n\nn \n\nt =l \n\nGiven  a sample  zn and  an  on-line  algorithm  A, we  use  Ho, HI, ... ,Hn- l  to  denote the \nensemble of hypotheses generated by A. Note that the ensemble is a function of the random \ntraining sample zn. Our bounds hinge on the sample statistic \n\nwhich can be easily computed as the on-line algorithm is  run on  zn. \n\nThe following bound, a consequence of Bernstein's maximal inequality for martingales due \nto Freedman [4], is of primary importance for proving our results. \n\nLemma 1  Let L I , L 2 , ... be  a  sequence  of random  variables,  0  ::;  Lt  ::;  1.  Define  the \nbounded martingale difference  sequence Vi  =  lE[Lt  ILl' ... ' Lt- l ]  - Lt  and the  asso(cid:173)\nciated martingale  Sn  =  VI  + ... + Vn  with  conditional  variance  Kn  =  L:~=l Var[Lt  I \nLI, ... ,Lt - I ].  Then,forall s,k ~ 0, \n\nIP' (Sn  ~ s, Kn  ::;  k)  ::;  exp  ( - 2k :22S/ 3 ) \n\n. \n\nThe  next proposition, derived  from  Lemma  1, establishes a bound  on the  average  risk of \nthe ensemble of hypotheses. \n\n\fProposition 2  Let Ho, . .. ,Hn - 1 be the ensemble of hypotheses generated by an arbitrary \non-line algorithm A. Then,for any 0 < 5 ::;  1, \n( 1 ~ . \n;:;:  ~ rlsk(Ht- d  :::::  Mn + ~ In \n\n(nMn +3 ) \n\n+ 2 \n\n36 \n\nIP' \n\n5 \n\nThe  bound  shown  in  Proposition  2  has  the  same  rate  as  a  bound  recently  proven  by \nZhang [6, Theorem 5].  However, rather than deriving the bound from Bernstein inequality \nas  we do, Zhang uses an  ad hoc argument. \n\nProof.  Let \n1  n \n\nf-ln  =  - 2: risk(Ht _d \n\nn \n\nt = l \n\nand  vt - l  = risk(Ht _d  - C(Ht-1(Xt ), yt) \n\nfor t  :::::  l. \n\nLet  \"'t  be  the  conditional  variance  Var(C(Ht _ 1 (Xt ) , yt) \nfor  brevity  K n  =  2:~= 1 \"'t,  K~  =  l2:~= 1 \"'d,  and  introduce  the  function  A (x) \n2 In  (X+l)}X +3)  for x  :::::  O.  We find  upper and lower bounds on  the probability \n\nI  Zl , ... , Zt - l).  Also,  set \n\nIP' (t vt - l  :::::  A(Kn) + J  A(Kn) Kn)  . \n\n(1) \n\nThe  upper  bound  is  determined  through  a  simple  stratification  argument  over Lemma  1. \nWe can write \n\nn \n\n1P'(2: vt - l  :::::  A(Kn) +  J  A(Kn) Kn) \n\nt = l \n\nn \n\n::; 1P'(2: vt - l  :::::  A(K~) +  J  A(K~) K~) \n\nt =l \n\nn \n\nn \n\nt = l \nn \n\n::; 2:1P'(2: vt - 1 :::::  A(s) + JA(s)s, K~ = s) \ns=o \nn \n::; 2:1P'(2: vt - l  :::::  A(s) + J  A(s) s,  Kn  ::;  s + 1) \ns=o \nt = l \n~  ( \ns=o \n\n< ~exp -~--~~-=====~~----\n~(A(s) + J  A(s) s) + 2(s + 1) \n-\n> A(s)/2  for all s  > 0  we obtain \n\n(A(s) + J  A(s) s)2 \n\n) \n\n-\n\n, \n\nSince \n\n(A(s)+~)2 \n\nHA(s) +VA(S)S) +2(S+1)  -\n\n(using Lemma 1). \n\n(1)  < t e- A (s)/2  = t \n\n-\n\ns=o \n\n5 \n\n< 5. \n\n(2) \n\ns=o (s + l)(s + 3) \n\nAs  far  as  the  lower bound  on  (1)  is  concerned,  we  note  that our assumption  0  ::;  C ::;  1 \nimplies \"'t  ::;  risk(Ht_d for all t which, in tum, gives Kn  ::;  nf-ln. Thus \n\n(1)  =  IP' ( nf-ln  - nMn :::::  A(Kn) + J  A(Kn) Kn) \n\n:::::  IP' ( nf-ln  - nMn :::::  A(nf-ln) + J  A(nf-ln) nf-ln) \n=  IP' ( 2nf-ln  :::::  2nMn + 3A(nf-ln) + J4nMn A(nf-ln) + 5A(nf-lnF ) \n=  lP' ( x::::: B + ~A(x) + JB A(x) + ~A2(x)) , \n\n\fwhere we set for brevity x  =  nf-ln  and  B  =  n Mn.  We  would like to solve the inequality \n\nx ~ B + ~A(x) + VB A(x) + ~A2(X) \n\n(3) \n\nW.r.t.  x .  More precisely, we would like to  find  a suitable upper bound  on the (unique) x* \nsuch that the above is satisfied as  an equality. \nA (tedious) derivative argument along with the upper bound A(x) ::;  4 In (X!3) show that \n\nx' =  B + 2 VB  In ( Bt3) + 36ln ( Bt3) \n\nmakes  the  left-hand  side of (3)  larger than its  right-hand  side.  Thus  x' is  an upper bound \non  x* , and we conclude that \n\nwhich, recalling the definitions of x  and B, and combining with (2), proves the bound.  D \n\n3  Selecting a good hypothesis from the ensemble \n\nIf the  decision  space D  of A  is  a convex  set  and the  loss  function  \u00a3 is  convex  in  its  first \nargument, then via Jensen's inequality we can directly apply the bound of Proposition 2 to \nthe risk of the average hypothesis H  =  ~ L~=I H t - I \n\n.  This yields \n\nlP'  (riSk(H) ~ Mn + ~ In (nM~ + 3) + 2  ~n In (nM~ + 3) ) ::;  6. \n\n(4) \n\nObserve that this is a O(l/n) bound whenever the cumulative loss n  Mn is 0(1). \n\nIf the  convexity  hypotheses  do  not  hold  (as  in  the  case  of classification  problems),  then \nthe  bound  in  (4)  applies  to  a  hypothesis  randomly  drawn  from  the  ensemble  (this  was \ninvestigated in  [1] though with different goals). \n\nIn  this  section  we  show  how  to  deterministically  pick  from  the  ensemble  a  hypothesis \nwhose risk is close to the average ensemble risk. \n\nTo  see how this could be done, let us  first introduce the functions \n\nEr5(r, t)  = 3(!~ t)  + J ~~rt \n\nand \n\ncr5(r, t)  = Er5  (r + J ~~rt' t) , \n\n'th B-1  n(n+2) \n\nn \n\nr5 \n\nWI \n\n-\n\n. \n\nLet riskemp(Ht , t + 1) + Er5  (riskemp(Ht, t + 1), t)  be the penalized empirical risk of \nhypothesis H t , where \n\nn \nriskemp(Ht , t + 1)  =  - - \" \n\n1 \n\n\u00a3(Ht(Xi), Xi) \n\nn - t  ~ \n\ni=t+1 \n\nis the empirical risk of H t  on the remaining sample Zt+ l, ... , Z]1'  We now analyze the per(cid:173)\nformance of the learning algorithm that returns the hypothesis H  minimizing the penalized \nrisk estimate over all hypotheses in the ensemble, i.e.,  I \n\nii =  argmin( riskemp(Ht , t + 1) + Er5 (riskemp(Ht , t + 1), t))  . \n\nO::; t <n \n\n(5) \n\nI Note  that, from  an algorithmic point of view, this  hypothesis is fairly  easy  to  compute.  In  par(cid:173)\nticular, if the underlying on-line algorithm is a standard kernel-based algorithm, fj can be calculated \nvia a single sweep through the example sequence. \n\n\fLemma 3  Let Ho , ... , H n - 1  be the ensemble of hypotheses generated by an arbitrary on(cid:173)\nline algorit~m A working with a loss \u00a3 satisfying 0 S \u00a3 S  1.  Then,for any 0 < b  S  1, the \nhypothesis H  satisfies \n\nlP'  (risk(H) >  min  (risk(Ht ) + 2 c8(risk(Ht ) , t)))  S  b . \n\nO::; t <n \n\nProof. We introduce the following short-hand notation \n\nriskemp(Ht , t + 1) , \nargmin (risk(Ht ) + 2c8 (risk(Ht ), t))  . \nO::; t <n \n\nO::;t <n \n\nf  =  argmin (Rt + \u00a38(Rt , t)) \n\nT * \n\nAlso,  let  H *  =  H T *  and  R *  =  riskemp(HT * , T * + 1)  =  R T * .  Note  that  H  defined \nin (5) coincides with H i' . Finally, let \n\nQ( \n\nr, t \n\n)  =  y'2B(2B + 9r(n - t))  - 2B \n\n( )  \n3  n - t \n\n. \n\nWith this notation we can write \n\nlP' ( risk(H) > risk(H*) + 2c8(risk(H*), T *)) \n\n< \n\nlP' ( risk(H) > risk(H*) + 2C8 (R* - Q(R*, T *), T *)) \n\n+  lP' (riSk(H*) < R * - Q(R*, T *)) \n\nlP' ( risk(H) > risk(H*) + 2C8  (R* - Q(R*, T *), T *) ) \n\n< \n+  ~ lP' ( risk(Ht )  < R t  - Q(Rt , t))  . \n\nApplying the standard Bernstein's inequality (see, e.g., [3 , Ch. 8]) to the random variables \nR t  with IRt l S  1 and expected value  risk(Ht ), and upper bounding the  variance of R t \nwith risk(Ht ), yields \n\n.  ( )  \n\nlP'  r~sk H t  < R t  -\n\n(\n\nB  + y'B(B + 18(n - t)risk(Ht ))) \n\n( )  \n3  n  - t \n\n- B \n\nS  e \n\n. \n\nWith a little algebra, it is easy to show that \n\n.  ( )  \n\nr~sk H t  < Rt  -\n\nB  +  y'B(B + 18(n - t)risk(Ht )) \n\n( )  \n3  n  - t \n\nis equivalent to  risk(Ht ) < R t  - Q(Rt , t). Hence, we get \n\nlP' ( risk(H) > risk(H*) + 2c8(risk(H *), T *)) \n\n< \n\nlP' (risk(H) > risk(H*) + 2C8 (R* - Q(R*, T *),T *) ) + n e- B \n\n< \n\nlP' (risk(H) > risk(H*) + 2\u00a38(R*, T *)) + n e- B \n\n\fwhere in the last step we used \n~B'r \n-\nn  - t \n\nQ('r, t) ::; \n\nCo  ('I'  - J ~~'rt' t)  =  Eo ('I', t) . \n\nand \n\nSet for brevity E =  Eo (R* , T *) . We have \n\nIP' ( risk(H) > risk(H*) + 2E) \n\nIP' ( risk(H) > risk(H*)  + 2E , R f  + Eo (R f' T)  ::;  R * + E) \n(since Rf + Eo(Rf, T)  ::;  R * + E holds with certainty) \n\n<  ~ IP' ( R t + Eo(Rt , t)  ::;  R * + E, risk(Ht ) > risk(H*)  + 2E). \n\n(6) \n\nNow, if R t + Eo (Rt, t)  ::;  R * + E holds, then at least one of the following three conditions \nR t  ::;  risk (Ht ) - Eo(Rt , t) ,  R * > risk(H*) + E,  risk (Ht ) - risk (H*)  < 2E \nmust hold.  Hence, for any fixed t  we can write \n\nIP' ( R t + Eo(Rt, t)  ::;  R * + E, risk(Ht ) > risk(H*) + 2E) \n\n< \n\nIP' ( R t  ::;  risk(Ht ) - Eo(Rt , t) , risk(Ht ) > risk(H*) + 2E) \n\n+IP' ( R * > risk(H*) + E, risk(Ht ) > risk(H*) + 2E) \n\n+IP' ( risk (Ht ) - risk (H*)  < 2E , risk(Ht ) > risk (H*) + 2E) \n\n< \n\nIP' ( R t  ::;  risk(Ht ) - Eo(Rt , t)) +IP' ( R * > risk(H*) + E)  . \n\n(7) \n\nPlugging (7) into (6)  we have \n\nIP'  (risk(H) > risk (H*) + 2E) \n<  ~ IP' ( R t  ::;  risk(Ht ) - Eo(Rt, t)) + n  IP' ( R * > risk(H*) + E) \n\n<  n e- B  + n  ~ IP'( R t  2:  risk(Ht ) + Eo(Rt,t))  ::;  n e- B  + n 2 e- B , \n\nwhere  in  the  last two  inequalities  we  applied  again  Bernstein's  inequality  to  the  random \nvariables R t with mean risk(Ht ).  Putting together we obtain \n\nlP' (risk(H) > risk(H*) + 2co(risk(H*), T *)) ::;  (2n + n 2 )e- B \n\nwhich, recalling that B  = In  n(no+2) , implies the thesis. \nFix n  2:  1 and 15  E (0,1). For each t =  0, ... , n  - 1, introduce the function \n\nD \n\nllCln(n -t) + 1  m,cx \n\n+ 2  - - ,   x2:0, \n\nn-t \n\nf() \ntX  =x+-\n3 \n\nn-t \n\nwhere C  =  In 2n(~+2) .  Note that each  ft  is  monotonically increasing.  We  are now ready \nto state and prove the main result of this paper. \n\n\fTheorem 4  Fix any loss function C satisfying 0 ::;  C ::;  1.  Let H 0 , ... , H n-;:..l  be the ensem(cid:173)\nble of hypotheses generated by an arbitrary on-line algorithm A  and let H  be the hypoth(cid:173)\nesis minimizing the penalized empirical risk expression obtained by replacing C8  with C8/2 \nin (5).  Then,for any 0 < 15  ::;  1, ii satisfies \n\nIP'  risk(H) ;:::  min  ft  Mt,n + --In \n\n( \n\n( \n\n2n(n+3) \n\n15 \n\n~ \n\n36 \n\nn  - t \n\nO<C;Vn \n\n+ 2 \n\nM \n\nt,n \n\nIn  2n(n+3)  )) \n< 15 \n-\n, \nn -\n\nt \n\n8 \n\nwhere  Mt ,n  =  n~t L:~=t+l C(Hi- 1 (Xi)' Xi).  In particular, upper bounding the minimum \nover t with t = 0 yields \n\nIP'  risk(H) ;:::  fo  Mn  +  -:;:;  In \n\n( \n\n( \n\n36 \n\n~ \n\n2n(n + 3) \n\n15 \n\n+ 2 \n\nM \nn \n\nIn  2n(n+3)  )) \n< J. \n-\n\nn \n\n8 \n\n(8) \n\nFor n  ---+  00, bound (8) shows that risk(ii) is bounded with high probability by \n\nMn+O  Cn:n + VMn~nn) . \n\nIf the empirical cumulative loss n Mn is small (say, Mn  ::;  cln, where c is constant with n), \nthen our penalized empirical risk minimizer ii achieves a 0 ( (In 2 n) In) risk bound.  Also, \nrecall that, in  this  case, under convexity  assumptions  the  average  hypothesis  H  achieves \nthe sharper bound 0 (1 In) . \n\nProof.  Let Mt ,n  =  n~t L:~:/ risk(Hi ). Applying Lemma 3 with C8/2  we obtain \n\nlP' (risk(ii) >  min  (risk(Ht ) + c8/2(risk(Ht ), t)))::; i . \n\n(9) \n\nO<C;t<n \n\n2 \n\nWe then observe that \nmin  (risk(Ht ) + c8/2(risk(Ht ), t)) \nO.:;t<n \n\nmin  min  (risk(Hi )  + c8/2(risk(Hi ), i)) \nO<C;t<n t<C;2<n \n\n<  min  _ 1_  \"(risk(Hi )  + c8/2 (risk(Hi ), i)) \n\nn-l \n\nO<t<n  n  - t  ~ \n-\n\ni=t \n\n< \n\n< \n\n1 \n\nn- l  8  0 \n\n1 \n\nn- l  ( \n\n( \n\nmin  Mt  + - - ,,--- + - - \" \nn  - t  {;;; \nO.:;t<n \n(using the inequality Vx + y  ::;  ,jX + 2:/x ) \n\nn - t  {;;; 3 n - i \n\n,n \n\nmin  Mt  + - - \"  - - - + - - \" \nO.:;t<n \n\nn - t  {;;;  3  n  - i \n\n,n \n\n( \n\n1  n-l \nn  - t  {;;; \n\n1  n-l 11 \n\n0 \n\n' \n\n. \n\n( \n\n110  In(n-t)+1 \n\nmm  Mt  n  + --\nO.:;t<n \n3 \n(using L:7=1  I ii::; 1 + In k and the concavity of the square root) \nmin  ft(Mt n)  . \nO<C;t<n \n\n+ 2  - - -\nn  - t \n\nV20Mt,n) \n\nn - t \n\n' \n\n\fNow, it is  clear that Proposition 2 can be immediately generalized to  imply the following \nset of inequalities , one for each t  = 0, ... , n  - 1, \n\n36 A  J M t  n A)  0 \n/Jt n  ~ M t  n  + - - + 2  --' - s-\n2n \n\nn  - t \n\nn  - t \n\n, \n\nIP' \n\n( \n\n, \n\n(10) \n\nwhere A  =  In  2n(~+3) .  Introduce the random  variables K o, ... ,Kn - 1  to  be  defined later. \nWe can write \n\nIP' (  min  (riSk(Ht )  + c8/2(risk(Ht ) , t)) ~  min  Kt) \n\nO:'S:t<n \n\nO:'S:t<n \n\nSIP' (  min  !t(/Jt n ) ~  min  K t )  S ~ IP' (!t(/Jt n ) ~ K t ) \n\nO<t<n \n-\n\n' \n\nO<t<n \n-\n\n~  , \nt=O \n\nNow, for each t =  0, ... , n - l, define  K t  =  ft  ( Mt ,n + ~6_1 + 2 J M~:':./) . Then (10) \n\nand the monotonicity of fo,  .. . , f n- l  allow us  to obtain \n\nIP' (  min  (risk(Ht )  + c8/2(risk(Ht ) , t)) ~  min  Kt) \n\nO:'S:t<n \n\nO:'S:t<n \n\n(iVf;;A)) \nft(/Jt ,n)  ~ !t  Mt ,n +  n  _  t  + 2 V ~ \n\n(36  A \n\n/Jt ,n  ~ Mt ,n +  n  _  t  + 2 V ~ S 0/ 2 . \n\n36 A \n\n(iVf;;A) \n\nn- l \n\n( \n<  ~ IP' \n\n( \n\nn- l \n\n~ IP' \n\nCombining with (9) concludes the proof. \n\nD \n\n4  Conclusions and current research issues \n\nWe  have  shown  tail  risk  bounds  for  specific  hypotheses  selected from  the  ensemble gen(cid:173)\nerated by  the run of an  arbitrary on-line  algorithm.  Proposition  2, our simplest bound, is \nproven via  an easy  application of Bernstein's maximal inequality for  martingales, a quite \nbasic result in  probability theory.  The analysis of Theorem 4 is  also centered on  the  same \nmartingale inequality.  An open problem is  to  simplify  this  analysis, possibly  obtaining  a \nmore readable bound.  Also, the bound shown in Theorem 4 contains In n terms.  We do not \nknow  whether these  logarithmic  terms  can  be  improved  to  In(Mnn) , similarly  to  Propo(cid:173)\nsition 2.  A further open problem is  to  prove lower bounds, even in the special case when \nn Mn is bounded by a constant. \n\nReferences \n[1]  A. Blum, A. Kalai, and J. Langford.  Beating the hold-out.  In Proc.12th COLT,  1999. \n[2]  N. Cesa-Bianchi, A.  Conconi, and C. Gentile.  On the generalization ability of on-line \n\nlearning algorithms.  IEEE Trans.  on Information Theory, 50(9):2050-2057,2004. \n\n[3]  L. Devroye, L. Gy6rfi , and G.  Lugosi.  A Probabilistic Theory  of Pattern  Recognition. \n\nSpringer Verlag, 1996. \n\n[4]  D.  A. Freedman .  On  tail  probabilities  for  martingales.  The  Annals  of Probability , \n\n3: 100-118,1975. \n\n[5]  N. Littlestone.  From on-line to batch learning. In  Proc. 2nd COLT, 1989. \n[6]  T.  Zhang.  Data dependent concentration bounds for  sequential prediction algorithms. \n\nIn  Proc.  18th COLT, 2005. \n\n\f", "award": [], "sourceid": 2839, "authors": [{"given_name": "Nicol\u00f2", "family_name": "Cesa-bianchi", "institution": null}, {"given_name": "Claudio", "family_name": "Gentile", "institution": null}]}