{"title": "Rademacher Complexity Bounds for Non-I.I.D. Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1097, "page_last": 1104, "abstract": "This paper presents the first data-dependent generalization bounds for non-i.i.d. settings based on the notion of Rademacher complexity. Our bounds extend to the non-i.i.d. case existing Rademacher complexity bounds derived for the i.i.d. setting. These bounds provide a strict generalization of the ones found in the i.i.d. case, and can also be used within the standard i.i.d. scenario. They apply to the standard scenario of beta-mixing stationary sequences examined in many previous studies of non-i.i.d. settings and benefit form the crucial advantages of Rademacher complexity over other measures of the complexity of hypothesis classes. In particular, they are data-dependent and measure the complexity of a class of hypotheses based on the training sample. The empirical Rademacher complexity can be estimated from finite samples and lead to tighter bounds.", "full_text": "Rademacher Complexity Bounds\n\nfor Non-I.I.D. Processes\n\nMehryar Mohri\n\nCourant Institute of Mathematical Sciences\n\nand Google Research\n\n251 Mercer Street\n\nNew York, NY 10012\nmohri@cims.nyu.edu\n\nAfshin Rostamizadeh\n\nDepartment of Computer Science\n\nCourant Institute of Mathematical Sciences\n\n251 Mercer Street\n\nNew York, NY 10012\nrostami@cs.nyu.edu\n\nAbstract\n\nThis paper presents the \ufb01rst Rademacher complexity-based error bounds for non-\ni.i.d. settings, a generalization of similar existing bounds derived for the i.i.d. case.\nOur bounds hold in the scenario of dependent samples generated by a stationary\n\u03b2-mixing process, which is commonly adopted in many previous studies of non-\ni.i.d. settings. They bene\ufb01t from the crucial advantages of Rademacher complexity\nover other measures of the complexity of hypothesis classes. In particular, they are\ndata-dependent and measure the complexity of a class of hypotheses based on the\ntraining sample. The empirical Rademacher complexity can be estimated from\nsuch \ufb01nite samples and lead to tighter generalization bounds. We also present\nthe \ufb01rst margin bounds for kernel-based classi\ufb01cation in this non-i.i.d. setting and\nbrie\ufb02y study their convergence.\n\n1 Introduction\n\nMost learning theory models such as the standard PAC learning framework [13] are based on the as-\nsumption that sample points are independently and identically distributed (i.i.d.). The design of most\nlearning algorithms also relies on this key assumption. In practice, however, the i.i.d. assumption\noften does not hold. Sample points have some temporal dependence that can affect the learning pro-\ncess. This dependence may appear more clearly in times series prediction or when the samples are\ndrawn from a Markov chain, but various degrees of time-dependence can also affect other learning\nproblems.\n\nA natural scenario for the analysis of non-i.i.d. processes in machine learning is that of observations\ndrawn from a stationary mixing sequence, a standard assumption adopted in most previous studies,\nwhich implies a dependence between observations that diminishes with time [7,9,10,14,15]. The pi-\noneering work of Yu [15] led to VC-dimension bounds for stationary \u03b2-mixing sequences. Similarly,\nMeir [9] gave bounds based on covering numbers for time series prediction [9]. Vidyasagar [14]\nstudied the extension of PAC learning algorithms to these non-i.i.d. scenarios and proved that under\nsome sub-additivity conditions, a PAC learning algorithm continues to be PAC for these settings.\nLozano et al. studied the convergence and consistency of regularized boosting under the same as-\nsumptions [7]. Generalization bounds have also been derived for stable algorithms with weakly\ndependent observations [10]. The consistency of learning under the more general scenario of \u03b1-\nmixing with non-stationary sequences has also been studied by Irle [3] and Steinwart et al. [12].\n\nThis paper gives data-dependent generalization bounds for stationary \u03b2-mixing sequences. Our\nbounds are based on the notion of Rademacher complexity. They extend to the non-i.i.d. case the\nRademacher complexity bounds derived in the i.i.d. setting [2, 4, 5]. To the best of our knowledge,\nthese are the \ufb01rst Rademacher complexity bounds derived for non-i.i.d. processes. Our proofs make\n\n1\n\n\fuse of the so-called independent block technique due to Yu [15] and Bernstein and extend the appli-\ncability of the notion of Rademacher complexity to non-i.i.d. cases.\n\nOur generalization bounds bene\ufb01t from all the advantageous properties of Rademacher complexity\nas in the i.i.d. case. In particular, since the Rademacher complexity can be bounded in terms of\nother complexity measures such as covering numbers and VC-dimension [1], it allows us to derive\ngeneralization bounds in terms of these other complexity measures, and in fact improve on existing\nbounds in terms of these other measures, e.g., VC-dimension. But, perhaps the most crucial advan-\ntage of bounds based on the empirical Rademacher complexity is that they are data-dependent: they\nmeasure the complexity of a class of hypotheses based on the training sample and thus better capture\nthe properties of the distribution that has generated the data. The empirical Rademacher complex-\nity can be estimated from \ufb01nite samples and lead to tighter bounds. Furthermore, the Rademacher\ncomplexity of large hypothesis sets such as kernel-based hypotheses, decision trees, convex neu-\nral networks, can sometimes be bounded in some speci\ufb01c ways [2]. For example, the Rademacher\ncomplexity of kernel-based hypotheses can be bounded in terms of the trace of the kernel matrix.\n\nIn Section 2, we present the essential notion of a mixing process for the discussion of learning in\nnon-i.i.d. cases and de\ufb01ne the learning scenario. Section 3 introduces the idea of independent blocks\nand proves a bound on the expected deviation of the error from its empirical estimate. In Section 4,\nwe present our main Rademacher generalization bounds and discuss their properties.\n\n2 Preliminaries\n\nThis section introduces the concepts needed to de\ufb01ne the non-i.i.d. scenario we will consider, which\ncoincides with the assumptions made in previous studies [7, 9, 10, 14, 15].\n\n2.1 Non-I.I.D. Distributions\n\nThe non-i.i.d. scenario we will consider is based on stationary \u03b2-mixing processes.\nDe\ufb01nition 1 (Stationarity). A sequence of random variables Z = {Zt}\u221e\nt=\u2212\u221e is said to be sta-\ntionary if for any t and non-negative integers m and k, the random vectors (Zt, . . . , Zt+m) and\n(Zt+k, . . . , Zt+m+k) have the same distribution.\n\nThus, the index t or time, does not affect the distribution of a variable Zt in a stationary sequence\n(note that this does not imply independence).\nDe\ufb01nition 2 (\u03b2-mixing). Let Z = {Zt}\u221e\nt=\u2212\u221e be a stationary sequence of random variables. For\nany i, j \u2208 Z \u222a {\u2212\u221e, +\u221e}, let \u03c3j\ni denote the \u03c3-algebra generated by the random variables Zk,\ni \u2264 k \u2264 j. Then, for any positive integer k, the \u03b2-mixing coef\ufb01cient of the stochastic process Z is\nde\ufb01ned as\n(1)\n\n\u03b2(k) = sup\n\nE\nB\u2208\u03c3n\n\nn\n\n\u2212\u221eh sup\n\nA\u2208\u03c3\u221e\n\nn+k(cid:12)(cid:12)(cid:12)Pr[A | B] \u2212 Pr[A](cid:12)(cid:12)(cid:12)i.\n\nZ is said to be \u03b2-mixing if \u03b2(k) \u2192 0. It is said to be algebraically \u03b2-mixing if there exist real\nnumbers \u03b20 > 0 and r > 0 such that \u03b2(k) \u2264 \u03b20/kr for all k, and exponentially mixing if there\nexist real numbers \u03b20 and \u03b21 such that \u03b2(k) \u2264 \u03b20 exp(\u2212\u03b21kr) for all k.\nThus, a sequence of random variables is mixing when the dependence of an event on those occurring\nk units of time in the past weakens as a function of k.\n\n2.2 Rademacher Complexity\n\nOur generalization bounds will be based on the following measure of the complexity of a class of\nfunctions.\nDe\ufb01nition 3 (Rademacher Complexity). Given a sample S \u2208 X m, the empirical Rademacher\ncomplexity of a set of real-valued functions H de\ufb01ned over a set X is de\ufb01ned as follows:\n\nbRS(H) =\n\n2\nm\n\nE\n\n\u03c3(cid:20) sup\nh\u2208H(cid:12)(cid:12)(cid:12)\n\nmXi=1\n\n\u03c3ih(xi)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)S = (x1, . . . , xm)(cid:21).\n\n2\n\n(2)\n\n\fThe expectation is taken over \u03c3 = (\u03c31, . . . , \u03c3n) where \u03c3is are independent uniform random vari-\nables taking values in {\u22121, +1} called Rademacher random variables. The Rademacher complexity\nof a hypothesis set H is de\ufb01ned as the expectation of bRS(H) over all samples of size m:\n\nRm(H) = E\n\n(3)\n\nS(cid:2)bRS(H)(cid:12)(cid:12)|S| = m(cid:3).\n\nThe de\ufb01nition of the Rademacher complexity depends on the distribution according to which sam-\nples S of size m are drawn, which in general is a dependent \u03b2-mixing distribution D. In the rare\n\nindicate that distribution as a superscript: R\n\ninstances where a different distribution eD is considered, typically for an i.i.d. setting, we explicitly\n\nThe Rademacher complexity measures the ability of a class of functions to \ufb01t noise. The empirical\nRademacher complexity has the added advantage that it is data-dependent and can be measured from\n\ufb01nite samples. This can lead to tighter bounds than those based on other measures of complexity\nsuch as the VC-dimension [2, 4, 5].\n\neD\nm(H).\n\ntation over a sample S drawn according to a stationary \u03b2-mixing distribution:\n\nWe will denote by bRS(h) the empirical average of a hypothesis h : X \u2192 R and by R(h) its expec-\n\n1\nm\n\nmXi=1\n\nbRS(h) =\n\nh(zi)\n\nR(h) = E\nS\n\n[bRS(h)].\n\nThe following proposition shows that this expectation is independent of the size of the sample S, as\nin the i.i.d. case.\nProposition 1. For any sample S of size m drawn from a stationary distribution D, the following\nholds: ES\u223cDm[bRS(h)] = Ez\u223cD[h(z)].\nProof. Let S = (x1, . . . , xm). By stationarity, Ezi\u223cD[h(zi)] = Ezj \u223cD[h(zj)] for all 1 \u2264 i, j \u2264 m,\nthus, we can write:\nmXi=1\n\n[h(zi)] = E\nz\n\nmXi=1\n\n[h(zi)] =\n\n[h(z)].\n\n1\nm\n\n1\nm\n\nE\nzi\n\nE\nS\n\nE\nS\n\n[bRS(h)] =\n\n3 Proof Components\n\nOur proof makes use of McDiarmid\u2019s inequality [8] to show that the empirical average closely\nestimates its expectation. To derive a Rademacher generalization bound, we apply McDiarmid\u2019s\ninequality to the following random variable, which is the quantity we wish to bound:\n\n(4)\n\n(5)\n\n\u03a6(S) = sup\nh\u2208H\n\nR(h) \u2212 bRS(h).\n\nMcDiarmid\u2019s inequality bounds the deviation of \u03a6 from its mean, thus, we must also bound the\nexpectation E[\u03a6]. However, we immediately face two obstacles: both McDiarmid\u2019s inequality and\nthe standard bound on E[\u03a6] hold only for samples drawn in an i.i.d. fashion. The main idea behind\nour proof is to analyze the non-i.i.d. setting and transfer it to a close independent setting. The\nfollowing sections will describe in detail our solution to these problems.\n\n3.1 Independent Blocks\n\nWe derive Rademacher generalization bounds for the case where training and test points are drawn\nfrom a stationary \u03b2-mixing sequence. As in previous non-i.i.d. analyses [7, 9, 10, 15], we use a\ntechnique transferring the original problem based on dependent points to one based on a sequence\nof independent blocks. The method consists of \ufb01rst splitting a sequence S into two subsequences S0\nand S1, each made of \u00b5 blocks of a consecutive points. Given a sequence S = (z1, . . . , zm) with\nm = 2a\u00b5, S0 and S1 are de\ufb01ned as follows:\n\nS0 = (Z1, Z2, . . . , Z\u00b5),\nS1 = (Z (1)\n\n1 , Z (1)\n\n2 , . . . , Z (1)\n\n\u00b5 ),\n\nwhere Zi = (z(2i\u22121)+1, . . . , z(2i\u22121)+a),\nwhere Z (1)\n\ni = (z2i+1, . . . , z2i+a).\n\n(6)\n\n(7)\n\n3\n\n\fsame distribution as in Zk. As stated by the following result of Yu [15][Corollary 2.7], for a suf\ufb01-\nciently large spacing a between blocks and a suf\ufb01ciently fast mixing distribution, the expectation of\n\nInstead of the original sequence of odd blocks S0, we will be working with a sequence eS0 of\nindependent blocks of equal size a to which standard i.i.d. techniques can be applied: eS0 =\n(eZ1,eZ2, . . . ,eZ\u00b5) with mutually independent eZks, but, the points within each block eZk follow the\na bounded measurable function h is essentially unchanged if we work with eS0 instead of S0.\nCorollary 1 ([15]). Let h be a measurable function bounded by M \u2265 0 de\ufb01ned over the blocks Zk,\nthen the following holds:\n(8)\n\n| E\n\nS0\n\n[h] \u2212 E\neS0\n\n[h]| \u2264 (\u00b5 \u2212 1)M \u03b2(a),\n\nwhere ES0 denotes the expectation with respect to S0, E eS0\n\nthe expectation with respect to the eS0.\nWe denote by eD the distribution corresponding to the independent blocks eZk. Also, to work with\nblock sequences, we extend some of our de\ufb01nitions: we de\ufb01ne the extension ha : Z a \u2192 R of any\naPa\nhypothesis h\u2208 H to a block-hypothesis by ha(B) = 1\ni=1 h(Zi) for any block B = (z1, . . . , za)\u2208\nZ a, and de\ufb01ne Ha as the set of all block-based hypotheses ha generated from h\u2208 H.\nIt will also be useful to de\ufb01ne the subsequence S\u00b5, which consists of \u00b5 singleton points separated\nby a gap of 2a \u2212 1 points. This can be thought of as the sequence constructed from S0, or S1, by\nselecting only the jth point from each block, for any \ufb01xed j \u2208 {1, . . . , a}.\n3.2 Concentration Inequality\n\nMcDiarmid\u2019s inequality requires the sample to be i.i.d. Thus, we \ufb01rst show that Pr[\u03a6(S)] can be\nbounded in terms of independent blocks and then apply McDiarmid\u2019s inequality to the independent\nblocks.\nLemma 1. Let H be a set of hypotheses bounded by M . Let S denote a sample, of size m, drawn\n\naccording to a stationary \u03b2-mixing distribution and let eS0 denote a sequence of independent blocks.\n\nThen, for all a, \u00b5, \u01eb > 0 with 2\u00b5a = m and \u01eb > E eS0\n\n[\u03a6(eS0)], the following bound holds:\n[\u03a6(eS0)] > \u01eb\u2032] + 2(\u00b5 \u2212 1)\u03b2(a),\n\n[\u03a6(eS0) \u2212 E\n\neS0\n\nwhere \u01eb\u2032 = \u01eb \u2212 E eS0\nProof. We \ufb01rst rewrite the left-hand side probability in terms of even and odd blocks and then apply\nCorollary 1 as follows:\n\n[\u03a6(eS0)].\n\nPr\nS\n\n[\u03a6(S) > \u01eb] \u2264 2 Pr\neS0\n\nPr\nS\n\n[\u03a6(S) > \u01eb] = Pr\nS\n\nh\n\n[sup\n\n(R(h) \u2212 bRS(h)) > \u01eb]\nShsup\nh (cid:16) R(h)\u2212 bRS0 (h)\nSh 1\n2(cid:16)sup\n(R(h) \u2212 bRS0(h)) + sup\n\n[\u03a6(S0) + \u03a6(S1) > 2\u01eb]\n\n+ R(h)\u2212 bRS1 (h)\n\nh\n\nh\n\n2\n\n2\n\n(cid:17) > \u01ebi\n(R(h) \u2212 bRS1(h))(cid:17) > \u01ebi\n\n[\u03a6(S0) > \u01eb] + Pr\nS1\n\n[\u03a6(S1) > \u01eb]\n\n[\u03a6(S0) > \u01eb]\n\n= Pr\n\n\u2264 Pr\n= Pr\nS\n\nS0\n\n\u2264 Pr\n= 2 Pr\nS0\n\n(def. of bRS(h))\n\n(convexity of sup)\n\n(def. of \u03a6)\n\n(union bound)\n\n(stationarity)\n\n= 2 Pr\nS0\n\n[\u03a6(S0) \u2212 E\neS0\n\n[\u03a6(eS0)] > \u01eb\u2032].\nThe second inequality holds by the union bound and the fact that \u03a6(S0) or \u03a6(S1) must surpass \u01eb\nfor their sum to surpass 2\u01eb. To complete the proof, we apply Corollary 1 to the expectation of the\nindicator variable of the event {\u03a6(S0) \u2212 E eS0\n[\u03a6(eS0)] > \u01eb\u2032] \u2264 2 Pr\n\n[\u03a6(eS0)] > \u01eb\u2032}, which yields\n[\u03a6(eS0) \u2212 E\n\n[\u03a6(eS0)] > \u01eb\u2032] + 2(\u00b5 \u2212 1)\u03b2(a).\n\nWe can now apply McDiarmid\u2019s inequality to the independent blocks of Lemma 1.\n\n[\u03a6(S0) \u2212 E\neS0\n\n(def. of \u01eb\u2032)\n\n2 Pr\nS0\n\neS0\n\neS0\n\n4\n\n\fPr\nS\n\n[\u03a6(eS0)].\n\nProposition 2. For the same assumptions as in Lemma 1, the following bound holds for all \u01eb >\nE eS0\n\n[\u03a6(eS0)]:\nwhere \u01eb\u2032 = \u01eb \u2212 E eS0\nProof. To apply McDiarmid\u2019s inequality, we view each block as an i.i.d. point with respect to ha.\n\n[\u03a6(S) > \u01eb] \u2264 2 exp(cid:18)\u22122\u00b5\u01eb\u20322\n\nM 2 (cid:19) + 2(\u00b5 \u2212 1)\u03b2(a),\n\n\u00b5P\u00b5\nk=1 ha(eZk).\n\u03a6(eS0) can be written in terms of ha as: \u03a6(eS0) = R(ha) \u2212 bR eS0\n\u00b5|h(eZk)| \u2264 M/\u00b5. By\nThus, changing a block eZk of the sample eS0 can change \u03a6(eS0) by at most 1\nMcDiarmid\u2019s inequality, the following holds for any \u01eb > 2(\u00b5 \u2212 1)M \u03b2(a):\ni=1(M/\u00b5)2(cid:19) = exp(cid:18)\u22122\u00b5\u01eb\u20322\nM 2 (cid:19) .\n\u22122\u01eb\u20322\nP\u00b5\n\n[\u03a6(eS0)] > \u01eb\u2032] \u2264 exp(cid:18)\n\nPlugging in the right-hand side in the statement of Lemma 1 proves the proposition.\n\n[\u03a6(eS0) \u2212 E\n\n(ha) = R(ha) \u2212 1\n\nPr\neS0\n\neS0\n\n3.3 Bound on the Expectation\n\n[\u03a6(S0)] based on the Rademacher complexity, as in the i.i.d. case [2].\n\nHere, we give a bound on E eS0\nBut, unlike the standard case, the proof requires an analysis in terms of independent blocks.\nLemma 2. The following inequality holds for the expectation E eS0\nindependent block sequence:E eS0\n\neD\n\u00b5 (H).\n\n[\u03a6(eS0)] \u2264 R\n\n[\u03a6(eS0)] de\ufb01ned in terms of an\n[\u03a6(eS0)] can be\n\n(h)].\n\nProof. By the convexity of the supremum function and Jensen\u2019s inequality, E eS0\nbounded in terms of empirical averages over two samples:\n\nE\neS0\n\nE\neS0\n\n0\n\n1\n\u00b5\n\nha\u2208Ha\n\nha\u2208Ha\n\n(h)]\n\neS0\n\n0\n\n[ sup\n\neS0, eS \u2032\n\n0\n\n[ sup\n\n0\n\n1\n\u00b5\n\n1\n\u00b5\n\n= E\n\neS0, eS \u2032\n\n\u2264 E\neS0, eS \u2032\n\n= E\neS0, eS \u2032\n\n[ sup\nh\u2208H\n\nE\neS \u2032\n\n0\n\n(h)] \u2264 E\neS0, eS \u2032\n\n0\n\n(def. of bR)\n\nh\u2208H bR eS \u2032\n\n(h) \u2212 bR eS0\n\n(h)] \u2212 bR eS0\n\n[\u03a6(eS0)] \u2264 E\n\nWe now proceed with a standard symmetrization argument with the independent blocks thought of\nas i.i.d. points:\n\n[\u03a6(eS0)] = E\nh\u2208H bR eS \u2032\n0(cid:20) sup\n0,\u03c3(cid:20) sup\n0,\u03c3(cid:20) sup\neS0,\u03c3(cid:20) sup\n\n[bR eS \u2032\n(h) \u2212 bR eS0\ni)(cid:21)\n\u00b5Xi=1\nha(Zi) \u2212 ha(Z \u2032\ni))(cid:21)\n\u00b5Xi=1\n\u03c3i(ha(Zi) \u2212 ha(Z \u2032\n0,\u03c3(cid:20) sup\n\u03c3iha(Zi)(cid:21) + E\n\u00b5Xi=1\n\u03c3iha(Zi)(cid:21).\n\u00b5Xi=1\nIn the second equality, we introduced the Rademacher random variables \u03c3is. With probability 1/2,\n\u03c3i = 1 and the difference ha(Zi) \u2212 ha(Z \u2032\ni) is left unchanged; and, with probability 1/2, \u03c3i = \u22121\nand Zi and Z \u2032\ni are independent, taking the expectation over\n\u03c3 leaves the expectation unchanged. The inequality follows from the sub-additivity of the supremum\nfunction and the linearity of expectation. The \ufb01nal equality holds becauseeS0 and eS\u2032\n0 are identically\nj )(cid:21),\n\nWe now relate the Rademacher block sequence to a sequence over independent points. The right-\nhand side of the inequality just presented can be rewritten as\n\n\u03c3iha(Zi)(cid:21) = E\n\ni are permuted. Since the blocks Zi, or Z \u2032\n\ndistributed due to the assumption of stationarity.\n\neS0,\u03c3(cid:20) sup\n\neS0,\u03c3(cid:20)sup\n\ni)(cid:21) (sub-add. of sup)\n\n(Rad. var.\u2019s)\n\n\u00b5Xi=1\n\neS0, eS \u2032\n\nha\u2208Ha\n\n\u03c3iha(Z \u2032\n\nh(z(i)\n\n2\n\u00b5\n\n\u00b5Xi=1\n\n\u03c3i\n\n1\na\n\naXj=1\n\n1\n\u00b5\n\n\u00b5Xi=1\n\n= 2 E\n\nha\u2208Ha\n\n1\n\u00b5\n\nha\u2208Ha\n\n2 E\n\nha\u2208Ha\n\n1\n\u00b5\n\nh\u2208H\n\n5\n\n\fwhere z(i)\n0 denote the i.i.d. sample\nj\nconstructed from the jth point of each independent block Zi, i \u2208 [1, \u00b5]. By reversing the order of\nsummations and using the convexity of the supremum function, we obtain the following:\n\nE\neS0\n\n1\na\n\n2\n\u00b5\n\nh\u2208H\n\nE\n\n\u03c3ih(z(i)\n\ndenotes the jth point of the ith block. For j \u2208 [1, a], let eSj\neS0,\u03c3(cid:20) sup\n[\u03a6(eS0)] \u2264 E\naXj=1\naXj=1\neS\u00b5,\u03c3(cid:20) sup\n\nj )(cid:21)\n\u00b5Xi=1\nj )(cid:21)\n\u00b5Xi=1\nj )(cid:21)\n\u00b5Xi=1\n\u03c3ih(zi)(cid:21) \u2264 R\n\naXj=1\neS0,\u03c3(cid:20)sup\n0 ,\u03c3(cid:20)sup\n\u00b5Xi=1\n\n\u03c3ih(z(i)\n\n\u03c3ih(z(i)\n\neD\n\u00b5 (H).\n\n1\na\n\n1\na\n\n2\n\u00b5\n\n2\n\u00b5\n\n\u2264\n\n=\n\n= E\n\n2\n\u00b5\n\nE\neSj\n\nh\u2208H\n\nh\u2208H\n\nh\u2208H\n\nzi\u2208 eS\u00b5\n\n(reversing order of sums)\n\n(convexity of sup)\n\n(marginalization)\n\nThe \ufb01rst equality in this derivation is obtained by marginalizing over the variables that do not appear\nwithin the inner sum. Then, the second equality holds since, by stationarity, the choice of j does\nnot change the value of the expectation. The remaining quantity, modulo absolute values, is the\nRademacher complexity over \u00b5 independent points.\n\n4 Non-i.i.d. Rademacher Generalization Bounds\n\n4.1 General Bounds\n\nThis section presents and analyzes our main Rademacher complexity generalization bounds for sta-\ntionary \u03b2-mixing sequences.\nTheorem 1 (Rademacher complexity bound). Let H be a set of hypotheses bounded by M \u2265 0.\nThen, for any sample S of size m drawn from a stationary \u03b2-mixing distribution, and for any \u00b5, a >\n0 with 2\u00b5a = m and \u03b4 > 2(\u00b5 \u2212 1)\u03b2(a), with probability at least 1 \u2212 \u03b4, the following inequality\nholds for all hypotheses h \u2208 H:\n\nwhere \u03b4\u2032 = \u03b4 \u2212 2(\u00b5 \u2212 1)\u03b2(a).\n\nR(h) \u2264 bRS(h) + R\n\n\u00b5 (H) + Ms log 2\n\n\u03b4\u2032\n2\u00b5\n\neD\n\n,\n\nProof. Setting the right-hand side of Proposition 2 to \u03b4 and using Lemma 2 to bound E eS0\nwith the Rademacher complexity R\n\neD\n\u00b5 (H) shows the result.\n\n[\u03a6(eS0)]\n\nAs pointed out earlier, a key advantage of the Rademacher complexity is that it can be measured\nfrom data, assuming that the computation of the minimal empirical error can be done effectively and\n\ndrawn from a \u03b2-mixing distribution, by considering random samples of \u03c3. The following theorem\n\nef\ufb01ciently. In particular we can closely estimate bRS\u00b5(H), where S\u00b5 is a subsample of the sample S\ngives a bound precisely with respect to the empirical Rademacher complexitybRS\u00b5.\nTheorem 2 (Empirical Rademacher complexity bound). Under the same assumptions as in Theo-\nrem 1, for any \u00b5, a > 0 with 2\u00b5a = m and \u03b4 > 4(\u00b5 \u2212 1)\u03b2(a), with probability at least 1 \u2212 \u03b4, the\nfollowing inequality holds for all hypotheses h \u2208 H:\n\nwhere \u03b4\u2032 = \u03b4 \u2212 4(\u00b5 \u2212 1)\u03b2(a).\n\nR(h) \u2264 bRS(h) +bRS\u00b5(H) + 3Ms log 4\n\n\u03b4\u2032\n2\u00b5\n\n,\n\n6\n\n\fProof. To derive this result from Theorem 1, it suf\ufb01ces to bound R\neD\napplication of Corollary 1 to the indicator variable of the event {R\n\neD\n\n\u00b5 (H) in terms of bRS\u00b5(H). The\n\u00b5 (H) \u2212bRS\u00b5(H) > \u01eb} yields\n\n(H) > \u01eb(cid:1) + (\u00b5 \u2212 1)\u03b2(2a \u2212 1).\n\n(9)\n\neD\n\neD\n\neD\n\nPr(cid:0)R\n\narmid\u2019s inequality gives\n\nNow, we can apply McDiarmid\u2019s inequality to R\n\n\u00b5 (H) \u2212bRS\u00b5(H) > \u01eb(cid:1) \u2264 Pr(cid:0)R\n\n\u00b5 (H) \u2212bR eS\u00b5\n\u00b5 (H) \u2212 bR eS\u00b5\ndrawn in an i.i.d. fashion. Changing a point of S\u00b5 can affect bR eS\u00b5\n2M 2(cid:17) + (\u00b5 \u2212 1)\u03b2(2a \u2212 1).\n\u00b5 (H) \u2212bRS\u00b5(H) > \u01eb(cid:1) \u2264 exp(cid:16)\u2212\u00b5\u01eb2\n\n(10)\nNote \u03b2 is a decreasing function, which implies \u03b2(2a \u2212 1) \u2264 \u03b2(a). Thus, with probability at least\n, with \u03b4\u2032 = \u03b4/2 \u2212 (\u00b5 \u2212 1)\u03b2(a), a fortiori with \u03b4\u2032 =\n\u03b4/4 \u2212 (\u00b5 \u2212 1)\u03b2(a). The result follows this inequality combined with the statement of Theorem 1\nfor a con\ufb01dence parameter \u03b4/2.\n\n1 \u2212 \u03b4/2, R\u00b5(H) \u2264 bRS\u00b5(H) + Mq 2 log 1\n\n(H) which is de\ufb01ned over points\nby at most (2M/\u00b5), thus, McDi-\n\nPr(cid:0)R\n\neD\n\n\u03b4\u2032\n\n\u00b5\n\nThis theorem can be used to derive generalization bounds for a variety of hypothesis sets and learning\nsettings. In the next section, we present margin bounds for kernel-based classi\ufb01cation.\n\n4.2 Classi\ufb01cation\n\nS(h) = 1\n\nLet X denote the input space, Y ={\u22121, +1} the target values in classi\ufb01cation, and Z = X \u00d7 Y . For\nany hypothesis h and margin \u03c1 > 0, let bR\u03c1\nS(h) denote the average amount by which yh(x) deviates\nmPm\nfrom \u03c1 over a sample S: bR\u03c1\ni=1(\u03c1 \u2212 yih(xi))+. Given a positive de\ufb01nite symmetric\nkernel K : X \u00d7X \u2192 R, let K denote its Gram matrix for the sample S and HK the kernel-based\nhypothesis set {x 7\u2192Pm\ni=1 \u03b1iK(xi, x) : \u03b1K\u03b1T \u2264 1}, where \u03b1 \u2208 Rm\u00d71 denotes the column-vector\nwith components \u03b1i, i = 1, . . . , m.\nTheorem 3 (Margin bound). Let \u03c1 > 0 and K be a positive de\ufb01nite symmetric kernel. Then, for any\n\u00b5, a > 0 with 2\u00b5a = m and \u03b4 > 4(\u00b5 \u2212 1)\u03b2(a), with probability at least 1 \u2212 \u03b4 over samples S of size\nm drawn from a stationary \u03b2-mixing distribution, the following inequality holds for all hypotheses\nh\u2208 HK :\n\nPr[yh(x) \u2264 0] \u2264\n\nS(h) +\n\n1\n\n\u03c1bR\u03c1\n\n4\n\n\u00b5\u03c1pTr[K] + 3s log 4\n\n\u03b4\u2032\n2\u00b5\n\n,\n\nwhere \u03b4\u2032 = \u03b4 \u2212 4(\u00b5 \u2212 1)\u03b2(a).\nProof. For any h\u2208 H, let h denote the corresponding hypothesis de\ufb01ned over Z by: \u2200z\u2208 Z, h(z) =\n\u2212yh(x); and H K the hypothesis set {z \u2208 Z 7\u2192 h(z) : h \u2208 HK}. Let L denote the loss function\nassociated to the margin loss bR\u03c1\nS(h). Then, Pr[yh(x) \u2264 0] \u2264 Pr[(L \u25e6 h)(z) \u2264 0] = R(L \u25e6 h).\nSince L \u2212 1 is 1/\u03c1-Lipschitz and (L \u2212 1)(0) = 0, by Talagrand\u2019s lemma [6], bRS((L \u2212 1) \u25e6 H K)\u2264\n2bRS(H K)/\u03c1. The result is then obtained by applying Theorem 2 to R((L\u2212 1)\u25e6 h) = R(L\u25e6 h)\u2212 1\nwith bR((L \u2212 1) \u25e6 h) = bR(L \u25e6 h) \u2212 1, and using the known bound for the empirical Rademacher\ncomplexity of kernel-based classi\ufb01ers [2, 11]: bRS(H K)\u2264 2\nIn order to show that this bound converges, we must appropriately choose the parameter \u00b5, or equiv-\nalently a, which will depend on the mixing parameter \u03b2. In the case of algebraic mixing and using\nthe straightforward bound Tr[K] \u2264 mR2 for the kernel trace, where R is the radius of the ball that\ncontains the data, the following corollary holds.\nCorollary 2. With the same assumptions as in Theorem 3, if \u03b2 is further algebraically \u03b2-mixing,\n\u03b2(a) = \u03b20a\u2212r, then, with probability at least 1 \u2212 \u03b4, the following bound holds for all hypotheses\nh\u2208 HK :\n\n|S|pTr[K].\n\nwhere \u03b31 = 1\n\nPr[yh(x) \u2264 0] \u2264\n\nS(h) +\n\nr+2 \u2212 1(cid:1), \u03b32 = 1\n2r+4 \u2212 1(cid:1) and \u03b4\u2032 = \u03b4 \u2212 2\u03b20m\u03b31.\n2(cid:0) 3\n2(cid:0) 3\n\n1\n\n\u03c1bR\u03c1\n\n8Rm\u03b31\n\n\u03c1\n\n+ 3m\u03b32rlog\n\n4\n\u03b4\u2032 ,\n\n7\n\n\f2r+1\n\n2 m\n\nThis bound is obtained by choosing \u00b5 = 1\n2r+4 , which, modulo a multiplicative constant, is the\nminimizer of (\u221am/\u00b5 + \u00b5\u03b2(a)). Note that for r > 1 we have \u03b31, \u03b32 < 0 and thus, it is clear that\nthe bound converges, while the actual rate will depend on the distribution parameter r. A tighter\nestimate of the trace of the kernel matrix, possibly derived from data, would provide a better bound,\nas would stronger mixing assumptions, e.g., exponential mixing. Finally, we note that as r \u2192 \u221e\nand \u03b20 \u2192 0, that is as the dependence between points vanishes, the right-hand side of the bound\nS + 1/\u221am), which coincides with the asymptotic behavior in the i.i.d. case [2,4,5].\napproaches O(bR\u03c1\n\n5 Conclusion\n\nWe presented the \ufb01rst Rademacher complexity error bounds for dependent samples generated by a\nstationary \u03b2-mixing process, a generalization of similar existing bounds derived for the i.i.d. case.\nWe also gave the \ufb01rst margin bounds for kernel-based classi\ufb01cation in this non-i.i.d. setting, includ-\ning explicit bounds for algebraic \u03b2-mixing processes. Similar margin bounds can be obtained for\nthe regression setting by using Theorem 2 and the properties of the empirical Rademacher com-\nplexity, as in the i.i.d. case. Many non-i.i.d. bounds based on other complexity measures such as\nthe VC-dimension or covering numbers can be retrieved from our framework. Our framework and\nthe bounds presented could serve as the basis for the design of regularization-based algorithms for\ndependent samples generated by a stationary \u03b2-mixing process.\n\nAcknowledgements\n\nThis work was partially funded by the New York State Of\ufb01ce of Science Technology and Academic Research\n(NYSTAR).\n\nReferences\n\n[1] M. Anthony and P. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University\n\nPress, Cambridge, UK, 1999.\n\n[2] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural\n\nresults. Journal of Machine Learning Research, 3:2002, 2002.\n\n[3] A. Irle. On the consistency in nonparametric estimation under mixing assumptions. Journal of Multivari-\n\nate Analysis, 60:123\u2013147, 1997.\n\n[4] V. Koltchinskii and D. Panchenko. Rademacher processes and bounding the risk of function learning. In\n\nHigh Dimensional Probability II, pages 443\u2013459. preprint, 2000.\n\n[5] V. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization error\n\nof combined classi\ufb01ers. Annals of Statistics, 30, 2002.\n\n[6] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer,\n\n1991.\n\n[7] A. Lozano, S. Kulkarni, and R. Schapire. Convergence and consistency of regularized boosting algorithms\n\nwith stationary \u03b2-mixing observations. Advances in Neural Information Processing Systems, 18, 2006.\n\n[8] C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics, pages 148\u2013188.\n\nCambridge University Press, 1989.\n\n[9] R. Meir. Nonparametric time series prediction through adaptive model selection. Machine Learning,\n\n39(1):5\u201334, 2000.\n\n[10] M. Mohri and A. Rostamizadeh. Stability bounds for non-iid processes. Advances in Neural Information\n\nProcessing Systems, 2007.\n\n[11] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press,\n\n2004.\n\n[12] I. Steinwart, D. Hush, and C. Scovel. Learning from dependent observations. Technical Report LA-UR-\n\n06-3507, Los Alamos National Laboratory, 2007.\n\n[13] L. G. Valiant. A theory of the learnable. ACM Press New York, NY, USA, 1984.\n[14] M. Vidyasagar. Learning and Generalization: with Applications to Neural Networks. Springer, 2003.\n[15] B. Yu. Rates of convergence for empirical processes of stationary mixing sequences. Annals Probability,\n\n22(1):94\u2013116, 1994.\n\n8\n\n\f", "award": [], "sourceid": 419, "authors": [{"given_name": "Mehryar", "family_name": "Mohri", "institution": null}, {"given_name": "Afshin", "family_name": "Rostamizadeh", "institution": null}]}