{"title": "Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations", "book": "Advances in Neural Information Processing Systems", "page_first": 819, "page_last": 826, "abstract": "", "full_text": "Convergence and Consistency of\n\nRegularized Boosting Algorithms with\n\nStationary \u03b2-Mixing Observations\n\nDepartment of Electrical Engineering\n\nDepartment of Electrical Engineering\n\nAur\u00b4elie C. Lozano\n\nPrinceton University\nPrinceton, NJ 08544\n\nSanjeev R. Kulkarni\n\nPrinceton University\nPrinceton, NJ 08544\n\nalozano@princeton.edu\n\nkulkarni@princeton.edu\n\nRobert E. Schapire\n\nDepartment of Computer Science\n\nPrinceton University\nPrinceton, NJ 08544\n\nschapire@cs.princeton.edu\n\nAbstract\n\nWe study the statistical convergence and consistency of regularized\nBoosting methods, where the samples are not independent and identi-\ncally distributed (i.i.d.) but come from empirical processes of stationary\n\u03b2-mixing sequences. Utilizing a technique that constructs a sequence of\nindependent blocks close in distribution to the original samples, we prove\nthe consistency of the composite classi\ufb01ers resulting from a regulariza-\ntion achieved by restricting the 1-norm of the base classi\ufb01ers\u2019 weights.\nWhen compared to the i.i.d. case, the nature of sampling manifests in the\nconsistency result only through generalization of the original condition\non the growth of the regularization parameter.\n\n1 Introduction\n\nA signi\ufb01cant development in machine learning for classi\ufb01cation has been the emergence\nof boosting algorithms [1]. Simply put, a boosting algorithm is an iterative procedure that\ncombines weak prediction rules to produce a composite classi\ufb01er, the idea being that one\ncan obtain very precise prediction rules by combining rough ones. It was shown in [2] that\nAdaBoost, the most popular Boosting algorithm, can be seen as stage-wise \ufb01tting of addi-\ntive models under the exponential loss function and it effectively minimizes an empirical\nloss function that differs from the probability of incorrect prediction. From this perspec-\ntive, boosting can be seen as performing a greedy stage-wise minimization of various loss\nfunctions empirically. The question of whether boosting achieves Bayes-consistency then\narises, since minimizing an empirical loss function does not necessarily imply minimizing\nthe generalization error. When run a very long time, the AdaBoost algorithm, though resis-\ntant to over\ufb01tting, is not immune to it [2, 3]. There also exist cases where running Adaboost\n\n\fforever leads to a prediction error larger than the Bayes error in the limit of in\ufb01nite sample\nsize. Consequently, one approach for the study of consistency is to modify the original Ad-\naboost algorithm by imposing some constraints on the weights of the composite classi\ufb01er\nto avoid over\ufb01tting. In this regularized version of Adaboost, the 1-norm of the weights of\nthe base classi\ufb01ers is restricted to a \ufb01xed value. The minimization of the loss function is\nperformed over the restricted class [4, 5].\n\nIn this paper, we examine the convergence and consistency of regularized boosting algo-\nrithms with samples that are no longer i.i.d. but come from empirical processes of station-\nary weakly dependent sequences. A practical motivation for our study of non i.i.d. sam-\npling is that in many learning applications observations are intrinsically temporal and hence\noften weakly dependent. Ignoring this dependency could seriously undermine the perfor-\nmance of the learning process (for instance, information related to the time-dependent or-\ndering of samples would be lost). Recognition of this issue has led to several studies of non\ni.i.d. sampling [6, 7, 8, 9, 10, 11, 12].\n\nTo cope with weak dependence we apply mixing theory which, through its de\ufb01nition of\nmixing coef\ufb01cients, offers a powerful approach to extend results for the traditional i.i.d.\nobservations to the case of weakly dependent or mixing sequences. We consider the \u03b2-\nmixing coef\ufb01cients, whose mathematical de\ufb01nition is deferred to Sec. 2.1. Intuitively, they\nprovide a \u201cmeasure\u201d of how fast the dependence between the observations diminishes as\nthe distance between them increases. If certain conditions on the mixing coef\ufb01cients are\nsatis\ufb01ed to re\ufb02ect a suf\ufb01ciently fast decline in the dependence between observations as\ntheir distance grows, counterparts to results for i.i.d. random processes can be established.\nA comprehensive review of mixing theory results is provided in [13].\n\nOur principal \ufb01nding is that consistency of regularized Boosting methods can be established\nin the case of non-i.i.d. samples coming from empirical sequences of stationary \u03b2-mixing\nsequences. Among the conditions that guarantee consistency, the mixing nature of sam-\npling appears only through a generalization of the one on the growth of the regularization\nparameter originally stated for the i.i.d. case [4].\n\n2 Background and Setup\n\n2.1 Mixing Sequences\n\nLet W = (Wi)i\u22651 be a strictly stationary sequence of random variables, each having the\nsame distribution P on D \u2282 Rd. Let \u03c3l\n1 = \u03c3 (W1, W2, . . . , Wl) be the \u03c3-\ufb01eld generated\nby W1, . . . , Wl. Similarly, let \u03c3\u221e\nl+k = \u03c3 (Wl+k, Wl+k+1, . . . , ) . The following mixing\ncoef\ufb01cients characterize how close to independent a sequence W is.\nDe\ufb01nition 1. For any sequence W , the \u03b2-mixing1 coef\ufb01cient is de\ufb01ned by\n,\n\n(cid:162) \u2212 P (A)| : A \u2208 \u03c3\u221e\n\n\u03b2W (n) = supk E sup\n\nA|\u03c3k\n\n1\n\n(cid:161)\n\n(cid:169)|P\n\n(cid:170)\n\nk+n\n\nwhere the expectation is taken w.r.t. \u03c3k\n1 .\n\nHence \u03b2W (n) quanti\ufb01es the degree of dependence between \u2019future\u2019 observations and \u2019past\u2019\nones separated by a distance of at least n. In this study, we will assume that the sequences\n\n1To gain insight into the notion of \u03b2-mixing, it is useful to think of the \u03c3-\ufb01eld generated by a ran-\ndom variable X as the \u201cbody of information\u201d carried by X. This leads to the following interpretation\nof \u03b2-mixing. Suppose that the index i in Wi is the time index. Let A be an event happening in the\nfuture within the period of time between t = k + n and t = \u221e. |P (A|\u03c3k\n1 ) \u2212 P (A)| is the absolute\ndifference between the probability that event A occurs, given the knowledge of the information gener-\nated by the past up to t = k, and the probability of event A occurring without this knowledge. Then,\nthe greater the dependence between \u03c3k\nk+n (the\ninformation generated by (Wk+n, . . . , W\u221e)), the larger the coef\ufb01cient \u03b2W (n).\n\n1 (the information generated by (W1, . . . , Wk)) and \u03c3\u221e\n\n\fwe consider are algebraically \u03b2-mixing. This property implies that the dependence between\nobservations decreases fast enough as the distance between them increases.\nDe\ufb01nition 2. A sequence W is called \u03b2-mixing if\nalgebraically \u03b2-mixing if there is a positive constant r\u03b2 such that \u03b2W (n) = O (n\u2212r\u03b2 ) .\n\nlimn\u2192\u221e \u03b2W (n) = 0. Further, it is\n\nThe choice of \u03b2-mixing appears appropriate given previous results that showed \u201cuniform\nconvergence of empirical means uniformly in probability\u201d and \u201cprobably approximately\ncorrect\u201d properties to be preserved for \u03b2-mixing inputs [11]. Some examples of \u03b2-mixing\nsequences that \ufb01t naturally in a learning scenario are certain Markov processes and Hidden\nMarkov Models [11]. In practice, if the mixing properties are unknown, they need to be\nestimated. Although it is dif\ufb01cult to \ufb01nd them in general, there exist simple methods to\ndetermine the mixing rates for various classes of random processes (e.g. Gaussian, Markov,\nARMA, ARCH, GARCH). Hence the assumption of a known mixing rate is reasonable and\nhas been adopted by many studies [6, 7, 8, 9, 10, 12].\n\n2.2 Classi\ufb01cation with Stationary \u03b2-Mixing Training Data\n\nIn the standard binary classi\ufb01cation problem, the training data consist of a set Sn =\n{(X1, Y1) , . . . , (Xn, Yn)}, where Xk belongs to some measurable space X , and Yk is\nin {\u22121, 1}. Using Sn, a classi\ufb01er hn : X \u2192 {\u22121, 1} is built to predict the label Y of an\nunlabeled observation X. Traditionally, the samples are assumed to be i.i.d., and to our\nknowledge, this assumption is made by all the studies on boosting consistency. In this pa-\nper, we suppose that the sampling is no longer i.i.d. but corresponds to an empirical process\nof stationary \u03b2-mixing sequences. More precisely, let D = X \u00d7 Y, where Y = {\u22121, +1}.\nLet Wi = (Xi, Yi). We suppose that W = (Wi)i\u22651 is a strictly stationary sequence of\nrandom variables, each having the same distribution P on D and that W is \u03b2-mixing (see\nDe\ufb01nition 2). This setup is in line with [7]. We assume that the unlabeled observation is\nsuch that (X, Y ) is independent of Sn but with the same marginal.\n\n3 Statistical Convergence and Consistency of Regularized Boosting\n\nfor Stationary \u03b2-Mixing Sequences\n\n3.1 Regularized Boosting\nWe adopt the framework of [4] which we now recall. Let H denote the class of base\nclassi\ufb01ers h : X \u2192 {\u22121, 1}, which usually consists of simple rules (for instance decision\nstumps). This class is required to have \ufb01nite VC-dimension. Call F, the class of functions\nf : X \u2192 [\u22121, 1] obtained as convex combinations of the classi\ufb01ers in H:\n\nF =\n\nf (X) =\n\n\u03b1jhj (X) : t \u2208 N, \u03b11, . . . , \u03b1t \u2265 0,\n\n\u03b1j = 1, h1, . . . , ht \u2208 H\n\n.\n\n(cid:80)n\n(1)\nEach fn \u2208 F de\ufb01nes a classi\ufb01er hfn = sign (fn) and for simplicity the generalization\nerror L (hfn) is denoted by L (fn). Then the training error is denoted by Ln (fn) =\ni=1 I[hfn (Xi)(cid:54)=Yi]. De\ufb01ne Z (f) = \u2212f (X) Y and Zi (f) = \u2212f (Xi) Yi. Instead of\n1/n\nminimizing the indicator of misclassi\ufb01cation (I[\u2212f (X)Y >0]), boosting methods are shown\nto effectively minimize a smooth convex cost function of Z(f). For instance, Adaboost\nis based on the exponential function. Consider a positive, differentiable, strictly in-\ncreasing, and strictly convex function \u03c6 : R \u2192 R+ and assume that \u03c6 (0) = 1 and\nthat limx\u2192\u2212\u221e \u03c6 (x) = 0. The corresponding cost function and empirical cost func-\ntion are respectively C (f) = E\u03c6 (Z (f)) and Cn (f) = 1/n\ni=1 \u03c6 (Zi (f)) . Note that\nL (f) \u2264 C (f), since I[x>0] \u2264 \u03c6 (x).\n\n(cid:80)n\n\n(cid:110)\n\nt(cid:88)\n\nj=1\n\nt(cid:88)\n\nj=1\n\n(cid:111)\n\n\fThe iterative aspect of boosting methods is ignored to consider only their performing an\n(approximate) minimization of the empirical cost function or, as we shall see, a series of\ncost functions. To avoid over\ufb01tting, the following regularization procedure is developed for\nthe choice of the cost functions. De\ufb01ne \u03c6\u03bb such that \u2200\u03bb > 0 \u03c6\u03bb (x) = \u03c6 (\u03bbx) . The cor-\ni=1 \u03c6\u03bb (Zi (f))\nresponding empirical and expected cost functions become C \u03bb\nand C \u03bb (f) = E\u03c6\u03bb (Z (f)) . The minimization of a series of cost functions C \u03bb over the\nconvex hull of H is then analyzed.\n\n(cid:80)n\n\nn (f) = 1\n\nn\n\n3.2 Statistical Convergence\n\nn (f) and true cost C \u03bb (f).\n\nThe nature of the sampling intervenes in the following two lemmas that relate the empirical\ncost C \u03bb\nLemma 1. Suppose that for any n, the training data (X1, Y1) , . . . (Xn, Yn) comes from\na stationary algebraically \u03b2-mixing sequence with \u03b2-mixing coef\ufb01cients \u03b2 (m) satisfying\n\u03b2 (m) = O (m\u2212r\u03b2 ), m \u2208 N and r\u03b2 a positive constant. Then for any \u03bb > 0 and b \u2208 [0, 1),\n\nE sup\nf\u2208F\n\n|C \u03bb (f) \u2212 C \u03bb\n\nn (f)| \u2264 4\u03bb\u03c6(cid:48) (\u03bb)\n\nc1\n\nn(1\u2212b)/2\n\n+ 2\u03c6 (\u03bb)\n\n1\n\nnb(1+r\u03b2 )\u22121\n\n+\n\n2\nn1\u2212b\n\n.\n\n(2)\n\n(cid:179)\n\n(cid:180)\n\nLemma 2. Let the training data be as in Lemma 1. For any b \u2208 [0, 1), and \u03b1 \u2208 (0, 1 \u2212 b),\nlet \u0001n = 3(2c1 + n\u03b1/2)\u03bb\u03c6(cid:48)(\u03bb)/n(1\u2212b)/2. Then for any \u03bb > 0\n\n(cid:162) \u2264 exp(\u22124c2n\u03b1) + O(n1\u2212b(r\u03b2 +1)).\n\n|C \u03bb (f) \u2212 C \u03bb\n\nn (f)| > \u0001n\n\n(cid:161)\n\n(3)\n\nP\n\nsup\nf\u2208F\n\nThe constants c1 and c2 in the above lemmas are given in the proofs of Lemma 1 (Sec-\ntion 4.2) and Lemma 2 (Section 4.3) respectively.\n\n3.3 Consistency Result\n\nThe following summarizes the assumptions that are made to prove consistency.\nAssumption 1.\nI- Properties of the sample sequence: The samples (X1, Y1) , . . . , (Xn, Yn) are assumed\nto come from a stationary algebraically \u03b2-mixing sequence with \u03b2-mixing coef\ufb01cients\n\u03b2X,Y (n) = O (n\u2212r\u03b2 ), r\u03b2 being a positive constant.\nII- Properties of the cost function \u03c6: \u03c6 is assumed to be a differentiable, strictly convex,\nstrictly increasing cost function such that \u03c6 (0) = 1 and limx\u2192\u2212\u221e \u03c6 (x) = 0.\nIII- Properties of the base hypothesis space: H has \ufb01nite VC dimension. The distri-\nbution of (X, Y ) and the class H are such that lim\u03bb\u2192\u221e inf f\u2208\u03bbF C (f) = C\u2217, where\n\u03bbF = {\u03bbf : f \u2208 F} and C\u2217 = inf C (f) over all measurable functions f : X \u2192 R.\nIV- Properties of the smoothing parameter: We assume that \u03bb1, \u03bb2, . . . is a sequence\nof positive numbers satisfying \u03bbn \u2192 \u221e as n \u2192 \u221e, and that there exists a constant\n\n(cid:162)\n\n1\n\n, 1\n\nsuch that \u03bbn\u03c6(cid:48) (\u03bbn) /n(1\u2212c)/2 \u2192 0 as n \u2192 \u221e.\n\n1+r\u03b2\n\n(cid:80)n\nn is such that\ni=1 \u03c6\u03bb (Zi (f)) + \u0001n, with \u0001n \u2192 0 as\n\nn the function in F which approximatively minimizes C \u03bb\nn ) \u2264 inf f\u2208F C \u03bb\n\nCall \u02c6f \u03bb\nn( \u02c6f \u03bb\nC \u03bb\nn \u2192 \u221e. The main result is the following.\nTheorem 1. Consistency of regularized boosting methods for stationary \u03b2-mixing se-\nquences. Let fn = \u02c6f \u03bbn\nn (f) . Un-\nder Assumption 1, limn\u2192\u221e L (hfn = sign (fn)) = L\u2217 almost surely and hfn is strongly\nBayes-risk consistent.\n\nn \u2208 F, where \u02c6f \u03bbn\n\n(approximatively) minimizes C \u03bbn\n\nn (f) + \u0001n = inf f\u2208F 1\n\nn (f), i.e. \u02c6f \u03bb\n\nn\n\nn\n\nc \u2208(cid:161)\n\nCost functions satisfying Assumption 1.II include the exponential function and the logit\nfunction log2(1 + ex). Regarding Assumption 1.II, the reader is referred to [4](Remark on\n\n\f(denseness assumption)). In Assumption 1.IV, notice that the nature of sampling leads to\na generalization of the condition on the growth of \u03bbn\u03c6(cid:48) (\u03bbn) already present in the i.i.d.\nsetting [4]. More precisely, the nature of sampling manifests through parameter c, which is\nlimited by r\u03b2. The assumption that r\u03b2 is known is quite strict but cannot be avoided (for\ninstance this assumption is widely made in the \ufb01eld of time series analysis). On a positive\nnote, if unknown, r\u03b2 can be determined for various classes of processes as mentioned\nSection 2.1.\n\n4 Proofs\n\n4.1 Preparation to the Proofs: the Blocking Technique\nThe key issue resides in upper bounding\n\n(cid:175)(cid:175)C \u03bb\n\nn (f) \u2212 C \u03bb (f)\n\nsup\nf\u2208F\n\n\u03c6 (\u2212\u03bbf (Xi) Yi) \u2212 E\u03c6 (\u2212\u03bbf (X1) Y1)\n\n(cid:175)(cid:175)(cid:175),\n\n(4)\n\n(cid:175)(cid:175) = sup\n\nf\u2208F\n\nn(cid:88)\n\n(cid:175)(cid:175)(cid:175)1/n\n(cid:175)(cid:175) = supg\u03bb\u2208G\u03bb\n\ni=1\n\nsupf\u2208F\n\n(cid:175)(cid:175)C \u03bb\n\n(cid:175)(cid:175)(cid:175)n\u22121(cid:80)n\n\nwhere F is given by (1). Let W = (X, Y ), Wi = (Xi, Yi). De\ufb01ne the function g\u03bb by\ng\u03bb (W ) = g\u03bb (X, Y ) = \u03c6 (\u2212\u03bbf (X) Y ) and the class G\u03bb by G\u03bb = {g\u03bb : g\u03bb (X, Y ) =\n\u03c6 (\u2212\u03bbf (X) Y ) , f \u2208 F} . Then (4) can be rewritten as\n\nn (f) \u2212 C \u03bb (f)\n\ni=1 g\u03bb (Wi) \u2212 Eg\u03bb (W1)\nNote that the class G\u03bb is uniformly bounded by \u03c6 (\u03bb). Besides, if H is a class of measurable\nfunctions, then G\u03bb is also a class of measurable functions, by measurability of F.\nAs the Wi\u2019s are not i.i.d, we propose to use the blocking technique developed in [12, 14] to\nconstruct i.i.d blocks of observations which are close in distribution to the original sequence\nW1, . . . , Wn. This enables us to work on the sequence of independent blocks instead of the\noriginal sequence. We use the same notation as in [12]. The protocol is the following. Let\n(bn, \u00b5n) be a pair of integers, such that\n\n(cid:175)(cid:175)(cid:175).\n\n(n \u2212 2bn) \u2264 2bn\u00b5n \u2264 n.\n\n(5)\nDivide the segment W1 = (X1, Y1) , . . . , Wn = (Xn, Yn) of the mixing sequence into\n2\u00b5n blocks of size bn, followed by a remaining block (of size at most 2bn). Con-\nsider the odd blocks only.\nIf their size bn is large enough, the dependence between\nthem is weak, since two odd blocks are separated by an even block of the same size\nbn. Therefore, the odd blocks can be approximated by a sequence of independent blocks\nwith the same within-block structure. The same holds if we consider the even blocks.\nLet (\u03be1, . . . , \u03bebn) , (\u03bebn+1, . . . , \u03be2bn) , . . . ,\nbe independent blocks\nsuch that\nFor j = 1, . . . , 2\u00b5n, and any g \u2208 G\u03bb, de\ufb01ne\nZj,g :=\nLet O\u00b5n = {1, 3, . . . , 2\u00b5n \u2212 1} and E\u00b5n = {2, 4, . . . , 2\u00b5n}.\nDe\ufb01ne Zi,j(f) as Zi,j(f) := \u2212f\nare respectively the 1st and 2nd coordinate of the vector \u03bek. These correspond to the\nZk(f) = \u2212f (Xk) Yk for k in the odd blocks 1, ..., bn, 2bn + 1, ..., 3bn, ....\n\n(cid:161)\n(cid:80)jbn\n\u03bejbn+1, . . . , \u03be(j+1)bn\ni=(j\u22121)bn+1 g (\u03bei) \u2212 bnEg (\u03be1) , \u02dcZj,g :=\n\n, for j = 0, . . . , \u00b5n \u2212 1.\n\n(cid:80)jbn\n(cid:162) \u00b7 \u03be(2j\u22122)bn+i,2, where \u03bek,1 and \u03bek,2\ni=(j\u22121)bn+1 g (Wi) \u2212 bnEg (W1) .\n\n\u03be(2\u00b5n\u22121)bn, . . . , \u03be2\u00b5nbn\n\nWjbn+1, . . . , W(j+1)bn\n\n(cid:161)\n(cid:161)\n\n\u03be(2j\u22122)bn+i,1\n\n=D\n\n(cid:162)\n\n(cid:161)\n\n(cid:162)\n\n(cid:162)\n\n4.2 Proof sketch of Lemma 1\nA. Working with Independent Blocks. We show that\n\n(cid:175)(cid:175)(cid:175) 1\n\nn\n\nn(cid:88)\n\ni=1\n\nE sup\ng\u2208G\u03bb\n\ng (Wi)\u2212Eg (W1)\n\n(cid:175)(cid:175)(cid:175) \u2264 2E sup\n\ng\u2208G\u03bb\n\n(cid:175)(cid:175)(cid:175) 1\n\nn\n\n(cid:88)\n\nj\u2208O\u00b5n\n\n(cid:175)(cid:175)(cid:175)+\u03c6 (\u03bb)\n\n(cid:179)\n\nZj,g\n\n\u00b5n\u03b2W (bn)+\n\n(cid:180)\n\n.\n\n2bn\nn\n\n(6)\n\n\fn\n\nn\n\nn\n\nn\n\nE\u00b5n\n\nE\u00b5n\n\nO\u00b5n\n\n(cid:80)\n\n(cid:80)\n\n(cid:80)\n\n\u02dcZj,g +\n\n(cid:175)(cid:175)(cid:175) 1\n\n\u02dcZj,g + R\n\n(cid:80)n\n\ni=1 g (Wi)\n\n(cid:175)(cid:175)(cid:175) = E supg\n\n(cid:180)(cid:175)(cid:175)(cid:175), where R\n\n(cid:175)(cid:175)(cid:175) 1\n(cid:179)(cid:80)\n(cid:80)n\ni=1 g (Wi)| \u2264 E(supg | 1\n\nProof. Without loss of generality, assume that Eg (W1) = Eg (\u03be1) = 0.\nThen, E supg\nis the remainder term consisting of a sum of at most 2bn terms. Noting that \u2200g \u2208\n\u02dcZj,g|) +\nG\u03bb, |g| \u2264 \u03c6 (\u03bb), it follows that E supg | 1\nE(supg | 1\nLemma\ntion\nof\n(\u03be1, . . . , \u03bebn, \u03be2bn+1, . . . , \u03be3bn, . . .).\n\nbound H, |Qh (W1, . . .) \u2212 (cid:101)Qh (\u03be1, . . .)| \u2264 H (\u00b5n \u2212 1) \u03b2W (bn) . The same result holds\n\nO\u00b5n\n. We use the following intermediary lemma.\n4.1). Call Q the\nand\n\ndistribu-\nof\nFor any measurable function h on Rbn\u00b5n with\n\n3\nLemma\n(W1, . . . , Wbn, W2bn+1, . . . , W3bn, . . .)\n\n\u02dcZj,g|) + \u03c6(\u03bb)(2bn)\n(adapted\n\nfor (Wbn+1, . . . , W2bn, W3bn+1, . . . , W4bn . . .).\nUsing this with h(W1, . . .) = supg | 1\nrespectively, and noting that H = \u03c6 (\u03bb) /2, we have E supg | 1\nE supg | 1\nAs the Zj,g\u2019s from odd and even blocks have the same distribution, we obtain (6).\n\n(cid:80)\n(cid:80)n\n\u02dcZj,g|\ni=1 g (Wi)| \u2264\n2 \u00b5n\u03b2W (bn) + \u03c6(\u03bb)(2bn)\n.\n(cid:117)(cid:116)\n\n\u02dcZj,g| and h(Wbn+1 , . . .) = supg | 1\n\n2 \u00b5n\u03b2W (bn) + E supg | 1\n\n(cid:101)Q the\n\nZj,g| + \u03c6(\u03bb)\n\nZj,g| + \u03c6(\u03bb)\n\nfrom [15],\n\ndistribution\n\n(cid:80)\n\n(cid:80)\n\n(cid:80)\n\nO\u00b5n\n\nO\u00b5n\n\nE\u00b5n\n\nE\u00b5n\n\nn\n\nn\n\nn\n\nn\n\nn\n\nn\n\nn\n\nn\n\nC. Contraction Principle. We now show that\n\nn\n\n(cid:175)(cid:175)(cid:175) 1\n(cid:175)(cid:175)(cid:175) 1\n(cid:80)bn\n\nE sup\ng\n\nE sup\ng\u2208G\u03bb\n\nB. Symmetrization. The odd blocks Zj,g\u2019s being independent, we can use the standard\nsymmetrization techniques. Let Z(cid:48)\ni,j(f)\u2019s be the\ncorresponding copies of the Zi,j(f). Let (\u03c3i) be a Rademacher sequence, i.e. a sequence\nof independent random variables taking the values \u00b11 with probability 1/2. Then by [16],\nLemma 6.3 (Proof is omitted due to space constraints), we have\n\nj,g\u2019s be i.i.d. copies of the Zj,g\u2019s. Let Z(cid:48)\n\ng\n\nn\n\nj,g\n\n\u03c3j\n\n(7)\n\nj\u2208O\u00b5n\n\n(cid:175)(cid:175)(cid:175) 1\n\n(cid:162)(cid:175)(cid:175)(cid:175).\n\nZj,g \u2212 Z(cid:48)\n\n(cid:161)\n(cid:175)(cid:175)(cid:175) 1\n\n(cid:175)(cid:175)(cid:175) \u2264 E sup\n(cid:88)\n(cid:175)(cid:175)(cid:175) \u2264 2 \u00b7 bn\u03bb\u03c6(cid:48) (\u03bb) E sup\n(cid:162)(cid:162)(cid:175)(cid:175) \u2264\n(cid:175)(cid:175) 1\n(cid:161)\n(cid:80)bn\n(cid:80)\u00b5n\n(cid:175)(cid:175). By applying the \u201cComparison Theorem\u201d, The-\n\ni,j(f)\u2019s are i.i.d., with (7)\n\n(cid:175)(cid:175)(cid:175).\n(cid:161)\n\n\u00b5n(cid:88)\n\n\u03c3jZ1,j(f)\n\nj=1 \u03c3j\n\nf\u2208F\n\n(8)\n\nj=1\n\ni=1\n\nn\n\n(cid:175)(cid:175) 1\n\n(cid:80)\n(cid:175)(cid:175)1\n\n(cid:80)\u00b5n\n\nn\nj\u2208O\u00b5n\ni=1 \u03c6\u03bb(Zi,j(f)), and the Zi,j(f)\u2019s and Z(cid:48)\nZj,g\n\nProof. As Zj,g =\nZ(cid:48)\nE supg\ni,j(f)\nj\u2208O\u00b5n\n2bnE supg\norem 7 in [17], to the contraction \u03c8 (x) = (1/\u03bb\u03c6(cid:48) (\u03bb)) (\u03c6\u03bb (x) \u2212 1), we obtain (8).\nD. Maximal Inequality. We show that there exists a constant c1 > 0 such that\n\n\u03c6\u03bb (Zi,j(f)) \u2212 \u03c6\u03bb\n\nn\n\nn\n\nn\n\nZj,g\n\nZj,g\n\nj\u2208O\u00b5m\n\n(cid:88)\n(cid:88)\n(cid:175)(cid:175) \u2264 E supg\n(cid:175)(cid:175)(cid:175) 1\n\n(cid:175)(cid:175)(cid:175) \u2264 c1\n\n\u221a\n\u00b5n\n\n.\n\n\u03c3jZ1,j(f)\n\nj=1 \u03c3j (\u03c6\u03bb (Z1,j(f))\u22121)\n\u00b5n(cid:88)\n1 \u2208HN sup\u03b11,...,\u03b1N |(cid:80)\u00b5n\n\nE sup\nf\u2208F\n\n(cid:117)(cid:116)\n\n(9)\n\n(cid:162)|. Since\n(cid:162)(cid:162)\n(cid:162)(cid:175)(cid:175)(cid:175)(cid:175).\n\nj=1,...,\u00b5n\n\nn\n\nj=1\n\nj=1 \u03c3jZ1,j(f)| =\nProof. Denote (h1, . . . , hN ) by hN\nnE supN\u22651 suphN\n1\n\u03be(2j\u22122)bn+1,2 and \u03be(2j(cid:48)\u22122)bn+1,2 are i.i.d. for all j (cid:54)= j(cid:48) (they come from different blocks),\nand (\u03c3j) is a Rademacher sequence, then\nhas the same distribution as\n\nn\n(cid:161)\n1 . One can write E supf\u2208F | 1\n(cid:161)\n(cid:161)\n\nn\n\u03be(2j\u22122)bn+1,1\n\n(cid:80)N\n(cid:161)\n\n\u03c3j\u03be(2j\u22122)bn+1,2hk\n\nk=1 \u03b1k\u03c3j\u03be(1,j),2hk\n\n\u03be(2j\u22122)bn+1,1\n\n. Hence\n\n(cid:162)(cid:162)\n\nj=1\n\n\u03be(2j\u22122)bn+1,1\n\n\u03c3jhk\n\nE sup\nf\u2208F\n\n\u03c3jZ1,j(f )\n\n1\nn\n\nE sup\nN\u22651\n\nsup\n1 \u2208HN\nhN\n\nsup\n\n\u03b11,...,\u03b1N\n\n\u03c3j\u03b1khk\n\n\u03be(2j\u22122)bn+1,1\n\nBy the same argument as used in [4], p.53 on the maximum of a linear function over\na convex polygon, the supremum is achieved when \u03b1k = 1 for some k. Hence we get\n\nj=1,...,\u00b5n\n\nN(cid:88)\n\n(cid:175)(cid:175)(cid:175)(cid:175) \u00b5n(cid:88)\n\nj=1\n\nk=1\n\n(cid:175)(cid:175)(cid:175)(cid:175) 1\n\nn\n\n\u00b5n(cid:88)\n\nj=1\n\n(cid:161)\n(cid:175)(cid:175)(cid:175)(cid:175) =\n\n(cid:80)\u00b5n\n\n(cid:161)\n\n\fn\n\n(cid:175)(cid:175)(cid:175) 1\n(cid:80)\u00b5n\n(cid:175)(cid:175)(cid:175)(cid:175) \u00b5n(cid:88)\n\nj=1\n\n(cid:161)\n\n1\nn\n\nE sup\nh\u2208H\n\nj=1 \u03c3jZ1,j(f)\n\nE supf\u2208F\nj (cid:54)= j(cid:48), h(\u03be(2j\u22122)bn+1,1) and h(\u03be(2j(cid:48)\u22122)bn+1,1) are i.i.d. and that Rademacher processes\nare sub-gaussian, we have by [18], Corollary 2.2.8\n\nnE suph\u2208H\n\nj=1 \u03c3jh\n\n\u03be(1,j),1\n\n(cid:175)(cid:175)(cid:175) = 1\n(cid:162)(cid:175)(cid:175)(cid:175)(cid:175) \u2264 1\n\u2264 c(cid:48)\u221a\n\nn\n\n(cid:175)(cid:175)(cid:175)(cid:80)\u00b5n\n\n(cid:161)\n\n(cid:175)(cid:175)(cid:175)(cid:175) \u00b5n(cid:88)\n\nj=1\n\nE sup\n\nh\u2208H\u222a{0}\n\n(cid:90) \u221e\n\n\u00b5n\nn\n\n(log sup\n\n0\n\nP\n\n(cid:161)\n\n(cid:162)(cid:175)(cid:175)(cid:175). Noting that for all\n\n(cid:162)(cid:175)(cid:175)(cid:175)(cid:175)\n\nN (\u0001, \u03c12,Pn ,H \u222a {0}))1/2d\u0001,\n\n\u03c3jh\n\n\u03be(2j\u22122)bn+1,1\n\n\u03c3jh\n\n\u03be(2j\u22122)bn+1,1\n\n(cid:82) \u221e\n\n(log supPn N (\u0001, \u03c12,Pn,H \u222a {0}))1/2d\u0001 < \u221e. and (9) follows.\n\nwhere c(cid:48) is a constant and N (\u0001, \u03c12,Pn,H \u222a {0}) is the empirical L2 covering number.\nAs H has \ufb01nite VC-dimension (see Assumption 1.III), there exists a positive constant\nw such that supP N(\u0001, \u03c12,Pn,H \u222a {0}) = OP (\u0001\u2212w)(see [18], Theorem 2.6.1). Hence\n(cid:117)(cid:116)\n(cid:162)\n\n0\nE. Establishing (2). Combining (6),(8), and (9), we have\n\u221a\nE supg\u2208G\u03bb\n\u00b5n\nn + \u03c6 (\u03bb)\n\u00b5n\u03b2W (bn)+ 2bn\n.\nn\nTake bn = nb, with 0 \u2264 b < 1. By (5), we obtain \u00b5n \u2264 n1\u2212b/2. Besides, as we assumed\nthat the sequence W is algebraically \u03b2-mixing (see De\ufb01nition 2), \u03b2W (n) = O (n\u2212r\u03b2 ).\nThen \u00b5n\u03b2W (bn) = O\n\n(cid:80)n\ni=1 g (Wi) \u2212 Eg (W1)\n(cid:162)\n\n(cid:175)(cid:175)(cid:175) \u2264 4bn\u03bb\u03c6(cid:48) (\u03bb) c1\n\n, and we arrive at (2).\n\nn1\u2212b(1+r\u03b2 )\n\n(cid:175)(cid:175)(cid:175) 1\n\n(cid:161)\n\n(cid:161)\n\nn\n\n4.3 Proof Sketch of Lemma 2\nA. Working with Independent Blocks and Symmetrization. For any b \u2208 [0, 1), \u03b1 \u2208\n(0, 1 \u2212 b), let\n\ng (Wi)\u2212Eg (W1)\n\n+O(n1\u2212b(1+r\u03b2 )).\n\n(cid:175)(cid:175)(cid:175) 1\n\nWe show\nP\n\nsup\ng\u2208G\u03bb\n\nn\n\n(cid:179)\n(cid:179)\n\nn(cid:88)\n(cid:175)(cid:175)(cid:175) 1\n\ni=1\n\n\u0001n = 3(2c1 + n\u03b1/2)\u03bb\u03c6(cid:48)(\u03bb)/n(1\u2212b)/2.\n\n(cid:175)(cid:175)(cid:175) > \u0001n\n(cid:180)\n\n(cid:88)\n\n(cid:175)(cid:175)(cid:175) 1\n\nsup\ng\u2208G\u03bb\n\nZj,g\n\nn\nj\u2208O\u00b5n\n\n\u2264 2P\n\n(cid:179)\n(cid:175)(cid:175)(cid:175) > \u0001n\n\n(cid:180)\n\n(cid:175)(cid:175)(cid:175) > \u0001n/3\n(cid:175)(cid:175)(cid:175) 1\n\n(cid:180)\n(cid:80)\n\n(cid:179)\n\nn\n\n(cid:180)\n\n(11)\nProof. By [12], Lemma 3.1, we have that for any \u0001n such that \u03c6(\u03bb)bn = o(n\u0001n),\nP\n\n(cid:80)n\ni=1 g (Wi) \u2212 Eg (W1)\n\n\u2264 2P\n\nsupg\u2208G\u03bb\n\nsupg\u2208G\u03bb\n\n+ 4\u00b5n\u03b2W (bn). Set bn = nb, with 0 \u2264 b < 1. Then \u00b5n\u03b2W (bn) = O(n1\u2212b(1+r\u03b2 ))\n\u0001n/3\n(for the same reasons as in Section 4.2 E.). With \u0001n as in (10), and since Assumption 1.II\n(cid:117)(cid:116)\nimplies that \u03bb\u03c6(cid:48)(\u03bb) \u2265 \u03c6(\u03bb) \u2212 1, we automatically obtain \u03c6(\u03bb)bn = o(n\u0001n).\nB. McDiarmid\u2019s Bounded Difference Inequality. For \u0001n as in (10), there exists a constant\nc2 > 0 such that,\n\nj\u2208O\u00b5n\n\nZj,g\n\nn\n\n(10)\n\n(cid:175)(cid:175)(cid:175) >\n\n(cid:179)\n\n(cid:88)\n\n(cid:175)(cid:175)(cid:175) 1\n\nn\n\nP\n\nsup\ng\u2208G\u03bb\n\nZj,g\n\nj\u2208O\u00b5n\n\n(cid:175)(cid:175)(cid:175) > \u0001n/3\n\n(cid:180)\n\n\u2264 exp(\u22124c2n\u03b1).\n\n(12)\n\n(cid:179)\n\nProof. The Zj,g\u2019s of the odd block being independent, we can apply McDiarmid\u2019s bounded\n| 1\nZj,g|\ndifference inequality ([19], Theorem 9.2 p.136) on the function supg\u2208G\u03bb\nwhich depends of Z1,g, Z3,g . . . , Z2\u00b5n\u22121,g. Noting that changing the value of one variable\ndoes not change the value of the function by more that bn\u03c6 (\u03bb) /n,we obtain with bn = nb\nthat for all \u0001 > 0,\nP\nCombining (8) and (9) from the proof of Lemma 1, and with bn = nb, we have\nE supg\u2208G\u03bb\nobtain \u0001n as in (10). Pick \u03bb0 such that 0 < \u03bb0 < \u03bb. Then, since \u03bb\u03c6(cid:48)(\u03bb) \u2265 \u03c6(\u03bb) \u2212 1, (12)\nfollows with c2 = (1 \u2212 1/\u03c6(\u03bb0))2.\n(cid:117)(cid:116)\n\n(cid:175)(cid:175)(cid:175) > E supg\u2208G\u03bb\n(cid:180)\n(cid:175)(cid:175)(cid:175) \u2264 2\u03bb\u03c6(cid:48) (\u03bb) C/n(1\u2212b)/2. With \u0001 = n\u03b1/2\u03bb\u03c6(cid:48)(\u03bb)/n(1\u2212b)/2, we\n\n(cid:175)(cid:175)(cid:175) 1\n(cid:80)\n(cid:175)(cid:175)(cid:175) 1\n(cid:80)\n\n(cid:175)(cid:175)(cid:175) + \u0001\n(cid:180)\n\nsupg\u2208G\u03bb\n\n\u2264 exp\n\n\u22124\u00012n1\u2212b\n\n(cid:80)\n\n(cid:175)(cid:175) 1\n\nj\u2208O\u00b5n\n\nj\u2208O\u00b5n\n\nj\u2208O\u00b5n\n\nj\u2208O\u00b5n\n\nZj,g\n\nZj,g\n\nZj,g\n\n\u03c6(\u03bb)2\n\nn\n\nn\n\nn\n\nn\n\n.\n\n(cid:80)\n(cid:179)\n\n\fC. Establishing (3). Combining (11) and (12) we obtain (3).\n\nC (\u03bbnfn) \u2212 C\u2217 = (C \u03bbn( \u02c6f \u03bbn\n\nn ) \u2212 C \u03bbn( \u00aff\u03bbn)) + (inf f\u2208\u03bbnF C(f) \u2212 C\u2217).\n\n4.4 Proof Sketch of Theorem 1\nLet \u00aff\u03bb a function in F minimizing C \u03bb. With fn = \u02c6f \u03bbn\nSince \u03bbn \u2192 \u221e, the second term on the right-hand side converges to zero by Assump-\nn ) \u2212 C \u03bbn\ntion 1.III. By [19], Lemma 8.2, we have C \u03bbn( \u02c6f \u03bbn\nn (f)| \u2192 0 with probability 1 if, as\nn (f)|. By Lemma 2, supf\u2208F |C \u03bbn (f) \u2212 C \u03bbn\nC \u03bbn\nn \u2192 \u221e, \u03bbn\u03c6(cid:48) (\u03bbn) n(\u03b1+b\u22121)/2 \u2192 0 and b > 1/(1 + r\u03b2). Hence if Assumption 1.IV\nholds, C (\u03bbnfn) \u2192 C\u2217 with probability 1. By [4], Lemma 5, the theorem follows.\n\n(cid:162) \u2264 2 supf\u2208F |C \u03bbn (f) \u2212\n\n(cid:161) \u00aff\u03bbn\n\nn , we have\n\nReferences\n\n[1] Schapire, R.E.: The Boosting Approach to Machine Learning An Overview. In Proc. of the MSRI\n\nWorkshop on Nonlinear Estimation and Classi\ufb01cation (2002)\n\n[2] Friedman, J., Hastie T., Tibshirani, R.: Additive logistic regression: A statistical view of boost-\n\ning. Ann. Statist. 38 (2000) 337\u2013374\n\n[3] Jiang, W.: Does Boosting Over\ufb01t:Views From an Exact Solution. Technical Report 00-03 De-\n\npartment of Statistics, Northwestern University (2000)\n\n[4] Lugosi, G., Vayatis, N.: On the Bayes-risk consistency of boosting methods. Ann. Statist. 32\n\n(2004) 30\u201355\n\n[5] Zhang, T.: Statistical Behavior and Consistency of Classi\ufb01cation Methods based on Convex Risk\n\nMinimization. Ann. Statist. 32 (2004) 56\u201385\n\n[6] Gy\u00a8or\ufb01, L., H\u00a8ardle, W., Sarda, P., and Vieu, P.: Nonparametric Curve Estimation from Time\n\nSeries. Lecture Notes in Statistics. Springer-Verlag, Berlin. (1989)\n\n[7] Irle, A.: On the consistency in nonparametric estimation under mixing assumptions. J. Multivari-\n\nate Anal. 60 (1997) 123\u2013147\n\n[8] Meir, R.: Nonparametric Time Series Prediction Through Adaptative Model Selection. Machine\n\nLearning 39 (2000) 5\u201334\n\n[9] Modha, D., Masry, E.: Memory-Universal Prediction of Stationary Random Processes. IEEE\n\nTrans. Inform. Theory 44 (1998) 117\u2013133\n\n[10] Roussas, G.G.: Nonparametric estimation in mixing sequences of random variables. J. Statist.\n\nPlan. Inference. 18 (1988) 135\u2013149\n\n[11] Vidyasagar, M.: A Theory of Learning and Generalization: With Applications to Neural Net-\n\nworks and Control Systems. Second Edition. Springer-Verlag, London (2002)\n\n[12] Yu, B.: Density estimation in the L\u221e norm for dependent data with applications. Ann. Statist.\n\n21 (1993) 711\u2013735\n\n[13] Doukhan, P.: Mixing Properties and Examples. Springer-Verlag, New York (1995)\n[14] Yu, B.: Some Results on Empirical Processes and Stochastic Complexity. Ph.D. Thesis, Dept\n\nof Statistics, U.C. Berkeley (Apr. 1990)\n\n[15] Yu, B.: Rate of convergence for empirical processes of stationary mixing sequences. Ann.\n\nProbab. 22 (1994) 94\u2013116.\n\n[16] Ledoux, M., Talagrand, N.: Probability in Banach Spaces. Springer, New York (1991)\n[17] Meir, R., Zhang, T.:Generalization error bounds for Bayesian mixture algorithms. J. Machine\n\nLearning Research (2003)\n\n[18] van der Vaart, A.W., Wellner, J.A.: Weak convergence and empirical processes. Springer Series\n\nin Statistics. Springer-Verlag, New York (1996)\n\n[19] Devroye, L., Gy\u00a8or\ufb01 L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Springer,\n\nNew York (1996)\n\n\f", "award": [], "sourceid": 2790, "authors": [{"given_name": "Aurelie", "family_name": "Lozano", "institution": null}, {"given_name": "Sanjeev", "family_name": "Kulkarni", "institution": null}, {"given_name": "Robert", "family_name": "Schapire", "institution": null}]}