{"title": "Hypothesis Set Stability and Generalization", "book": "Advances in Neural Information Processing Systems", "page_first": 6729, "page_last": 6739, "abstract": "We present a study of generalization for data-dependent hypothesis\n sets. We give a general learning guarantee for data-dependent\n hypothesis sets based on a notion of transductive Rademacher\n complexity. Our main result is a generalization bound for\n data-dependent hypothesis sets expressed in terms of a notion of\n hypothesis set stability and a notion of Rademacher\n complexity for data-dependent hypothesis sets that we\n introduce. This bound admits as special cases both standard\n Rademacher complexity bounds and algorithm-dependent uniform\n stability bounds. We also illustrate the use of these learning\n bounds in the analysis of several scenarios.", "full_text": "Hypothesis Set Stability and Generalization\n\nDylan J. Foster\n\nMassachusetts Institute of Technology\n\ndylanf@mit.edu\n\nSpencer Greenberg\n\nSpark Wave\n\nadmin@sparkwave.tech\n\nSatyen Kale\n\nGoogle Research\n\nHaipeng Luo\n\nUniversity of Southern California\n\nsatyen@satyenkale.com\n\nhaipengl@usc.edu\n\nMehryar Mohri\n\nGoogle Research and Courant Institute\n\nKarthik Sridharan\nCornell University\n\nmohri@google.com\n\nsridharan@cs.cornell.edu\n\nAbstract\n\nWe present a study of generalization for data-dependent hypothesis sets. We give a\ngeneral learning guarantee for data-dependent hypothesis sets based on a notion of\ntransductive Rademacher complexity. Our main result is a generalization bound\nfor data-dependent hypothesis sets expressed in terms of a notion of hypothesis set\nstability and a notion of Rademacher complexity for data-dependent hypothesis sets\nthat we introduce. This bound admits as special cases both standard Rademacher\ncomplexity bounds and algorithm-dependent uniform stability bounds. We also\nillustrate the use of these learning bounds in the analysis of several scenarios.\n\n1\n\nIntroduction\n\nMost generalization bounds in learning theory hold for a \ufb01xed hypothesis set, selected before\nreceiving a sample. This includes learning bounds based on covering numbers, VC-dimension,\npseudo-dimension, Rademacher complexity, local Rademacher complexity, and other complexity\nmeasures [Pollard, 1984, Zhang, 2002, Vapnik, 1998, Koltchinskii and Panchenko, 2002, Bartlett\net al., 2002]. Some alternative guarantees have also been derived for speci\ufb01c algorithms. Among\nthem, the most general family is that of uniform stability bounds given by Bousquet and Elisseeff\n[2002]. These bounds were recently signi\ufb01cantly improved by Feldman and Vondrak [2019], who\n\nproved guarantees that are informative, even when the stability parameter \u03b2 is only in o(1), as\nopposed to o(1~\u221a\nm). The log2 m factor in these bounds was later reduced to log m by Bousquet\n\net al. [2019]. Bounds for a restricted class of algorithms were also recently presented by Maurer\n[2017], under a number of assumptions on the smoothness of the loss function. Appendix A gives\nmore background on stability.\nIn practice, machine learning engineers commonly resort to hypothesis sets depending on the same\nsample as the one used for training. This includes instances where a regularization, a feature\ntransformation, or a data normalization is selected using the training sample, or other instances\nwhere the family of predictors is restricted to a smaller class based on the sample received. In other\ninstances, as is common in deep learning, the data representation and the predictor are learned using\nthe same sample. In ensemble learning, the sample used to train models sometimes coincides with\nthe one used to determine their aggregation weights. However, standard generalization bounds are\nnot directly applicable for these scenarios since they assume a \ufb01xed hypothesis set.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Decomposition of the learning algorithm\u2019s hypothesis selection into two stages. In the \ufb01rst\nstage, the algorithm determines a hypothesis HS associated to the training sample S which may be a\n\nsmall subset of the set of all hypotheses that could be considered, say H=\u0016S\u2208Zm HS. The second\n\nstage then consists of selecting a hypothesis hS out of HS.\n\n1.1 Contributions of this paper.\n\n1. Foundational de\ufb01nitions of data-dependent hypothesis sets. We present foundational de\ufb01-\nnitions of learning algorithms that rely on data-dependent hypothesis sets. Here, the algorithm\ndecomposes into two stages: a \ufb01rst stage where the algorithm, on receiving the sample S, chooses\na hypothesis set HS dependent on S, and a second stage where a hypothesis hS is selected from\nHS. Standard generalization bounds correspond to the case where HS is equal to some \ufb01xed H\nindependent of S. Algorithm-dependent analyses, such as uniform stability bounds, coincide with\n\nthe case where HS is chosen to be a singleton HS={hS}. Thus, the scenario we study covers both\n\nexisting settings and other intermediate scenarios. Figure 1 illustrates our general scenario.\n2. Learning bounds via transductive Rademacher complexity. We present general learning\nbounds for data-dependent hypothesis sets using a notion of transductive Rademacher complexity\n(Section 3). These bounds hold for arbitrary bounded losses and improve upon previous guarantees\ngiven by Gat [2001] and Cannon et al. [2002] for the binary loss, which were expressed in terms of a\nnotion of shattering coef\ufb01cient adapted to the data-dependent case, and are more explicit than the\nguarantees presented by Philips [2005][corollary 4.6 or theorem 4.7]. Nevertheless, such bounds may\noften not be suf\ufb01ciently informative, since they ignore the relationship between hypothesis sets based\non similar samples.\n3. Learning bounds via hypothesis set stability. We provide \ufb01ner generalization bounds based\non the key notion of hypothesis set stability that we introduce in this paper. This notion admits\nalgorithmic stability as a special case, when the hypotheses sets are reduced to singletons. We also\nintroduce a new notion of Rademacher complexity for data-dependent hypothesis sets. Our main\nresults are generalization bounds (Section 4) for stable data-dependent hypothesis sets expressed in\nterms of the hypothesis set stability parameter, our notion of Rademacher complexity, and a notion\nof cross-validation stability that, in turn, can be upper-bounded by the diameter of the family of\nhypothesis sets. Our learning bounds admit as special cases both standard Rademacher complexity\nbounds and algorithm-dependent uniform stability bounds.\n4. New generalization bounds for speci\ufb01c learning applications.\nIn section 5 (see also Ap-\npendix G), we illustrate the generality and the bene\ufb01ts of our hypothesis set stability learning\nbounds by using them to derive new generalization bounds for several learning applications. To our\nknowledge, there is no straightforward analysis based on previously existing tools that yield these\ngeneralization bounds. These applications include: (a) bagging algorithms that may employ non-\nuniform, data-dependent, averaging of the base predictors, (b) stochastic strongly-convex optimization\nalgorithms based on an average of other stochastic optimization algorithms, (c) stable representation\nlearning algorithms, which \ufb01rst learn a data representation using the sample and then learn a predictor\non top of the learned representation, and (d) distillation algorithms, which \ufb01rst compute a complex\npredictor using the sample and then use it to learn a simpler predictor that is close to it.\n\n1.2 Related work on data-dependent hypothesis sets.\n\nShawe-Taylor et al. [1998] presented an analysis of structural risk minimization over data-dependent\nhierarchies based on a concept of luckiness, which generalizes the notion of margin of linear classi\ufb01ers.\nTheir analysis can be viewed as an alternative and general study of data-dependent hypothesis sets,\n\n2\n\nSHShS2HSH=[S2ZmHS\fusing luckiness functions and \u03c9-smallness (or \u03c9-smoothness) conditions. A luckiness function helps\ndecompose a hypothesis set into lucky sets, that is sets of functions luckier than a given function.\nThe luckiness framework is attractive and the notion of luckiness, for example margin, can in fact be\ncombined with our results. However, \ufb01nding pairs of truly data-dependent luckiness and \u03c9-smallness\nfunctions, other than those based on the margin and the empirical VC-dimension, is quite dif\ufb01cult, in\nparticular because of the very technical \u03c9-smallness condition [see Philips, 2005, p. 70]. In contrast,\nhypothesis set stability is simpler and often easier to bound. The notions of luckiness and \u03c9-smallness\nhave also been used by Herbrich and Williamson [2002] to derive algorithm-speci\ufb01c guarantees. In\nfact, the authors show a connection with algorithmic stability (not hypothesis set stability), at the\n\nprice of a guarantee requiring the strong condition that the stability parameter be in o(1~m), where\n\nm is the sample size [see Herbrich and Williamson, 2002, pp. 189-190].\nData-dependent hypothesis classes are conceptually related to the notion of data-dependent priors\nin PAC-Bayesian generalization bounds. Catoni [2007] developed localized PAC-Bayes analysis\nby using a prior de\ufb01ned in terms of the data generating distribution. This work was extended by\nLever et al. [2013] who proved sharp risk bounds for stochastic exponential weights algorithms.\nParrado-Hern\u00b4andez et al. [2012] investigated the possibility of learning the prior from a separate data\nset, as well as priors obtained via computing a data-dependent bound on the KL term. More closely\nrelated to this paper is the work of Dziugaite and Roy [2018a,b], who develop PAC-Bayes bounds by\nchoosing the prior via a data-dependent differentially private mechanism, and also showed that weaker\nnotions than differential privacy also suf\ufb01ce to yield valid bounds. In Appendix H, we give a more\ndetailed discussion of PAC-Bayes bounds, in particular to show how \ufb01ner PAC-Bayes bounds than\nstandard ones can be derived from Rademacher complexity bounds, here with an alternative analysis\nand constants than [Kakade et al., 2008] and how data-dependent PAC-Bayes bounds can be derived\nfrom our data-dependent Rademacher complexity bounds. More discussion on data-dependent priors\ncan be found in Appendix F.3.\n\n2 De\ufb01nitions and Properties\n\nLet X be the input space and Y the output space, and de\ufb01ne Z\u2236= X\u00d7 Y We denote by D the unknown\ndistribution over X\u00d7 Y according to which samples are drawn.\nclassi\ufb01cation, we may have Y={\u22121,+1} and Y\u2032= R. Thus, we denote by (cid:96)\u2236 Y\u2032\u00d7 Y\u2192[0, 1] a loss\nThe hypotheses h we consider map X to a set Y\u2032 sometimes different from Y. For example, in binary\nfunction de\ufb01ned on Y\u2032\u00d7 Y and taking non-negative real values bounded by one. We denote the loss\nof a hypothesis h\u2236 X\u2192 Y\u2032 at point z=(x, y)\u2208 X\u00d7 Y by L(h, z)= (cid:96)(h(x), y). We denote by R(h)\nthe generalization error or expected loss of a hypothesis h\u2208 H and by \u0302\nRS(h) its empirical loss over\na sample S=(z1, . . . , zm):\n[L(h, z)]= 1\nR(h)= E\n[L(h, z)]\nL(h, zi).\nmQ\nz\u223cD\ni=1\ndenote by HS the hypothesis set depending on the labeled sample S\u2208 Zm of size m\u2265 1. We assume\nDe\ufb01nition 1 (Hypothesis set uniform stability). Fix m \u2265 1. We will say that a family of data-\ndependent hypothesis setsH=(HS)S\u2208Zm is \u03b2-uniformly stable (or simply \u03b2-stable) for some \u03b2\u2265 0,\nif for any two samples S and S\u2032 of size m differing only by one point, the following holds:\n\nIn the general framework we consider, a hypothesis set depends on the sample received. We will\n\nthat HS is invariant to the ordering of the points in S.\n\n\u0302\nRS(h)= E\nz\u223cS\n\nm\n\n\u2200h\u2208 HS,\u2203h\u2032\u2208 HS\u2032\u2236\u2200z\u2208 Z,L(h, z)\u2212 L(h\u2032, z)\u2264 \u03b2.\n\n(1)\n\nThus, two hypothesis sets derived from samples differing by one element are close in the sense that\nany hypothesis in one admits a counterpart in the other set with \u03b2-similar losses. A closely related\n\nnotion is the sensitivity of a function f\u2236 Zm\u2192 R. Such a function f is called \u03b2-sensitive if for any\ntwo samples S and S\u2032 of size m differing only by one point, we havef(S)\u2212 f(S\u2032)\u2264 \u03b2.\nintroduce its de\ufb01nition, for any two samples S, T \u2208 Zm and a vector of Rademacher variables \u03c3,\nall i\u2208[m]={1, 2, . . . , m} with \u03c3i=\u22121. We will use H\u03c3\n\ndenote by ST,\u03c3 the sample derived from S by replacing its ith element with the ith element of T , for\n\nWe also introduce a new notion of Rademacher complexity for data-dependent hypothesis sets. To\n\nS,T to denote the hypothesis set HST ,\u03c3.\n\n3\n\n\fS,T\n\nS,T\n\nm\n\ni\n\nS,T\n\nm\n\nS,T\n\nS,T\n\nS,T\n\ni\n\nm\n\nE\n\u03c3\n\nm\n\n(2)\n\nE\n\n\u03c3\n\nR\u25c7\n\n1 , . . . , zT\nm\n\n)\u0017\u0017\u0017\u0017\u0017\u0017\n\nm\n1 , . . . , zS\nm\n\nin Zm are de\ufb01ned by\n\n(H)= 1\n\n(H)= 1\n\nDe\ufb01nition 2 (Rademacher complexity of data-dependent hypothesis sets). Fix m\u2265 1. The empirical\nRademacher complexity \u0302R\u25c7\n(H) and the Rademacher complexity R\u25c7\n(H) of a family of data-\ndependent hypothesis setsH=(HS)S\u2208Zm for two samples S=(zS\n) and T=(zT\n)\n\u0017\u0017\u0017\u0017\u0017\u0017 sup\n\u0017\u0017\u0017\u0017\u0017\u0017 sup\n)\u0017\u0017\u0017\u0017\u0017\u0017.\n\u0302R\u25c7\n\u03c3ih(zT\n\u03c3ih(zT\nmQ\nmQ\nS,T\u223cDm\nh\u2208H\u03c3\nh\u2208H\u03c3\ni=1\ni=1\nWhen the family of data-dependent hypothesis setsH is \u03b2-stable with \u03b2= O(1~m), the empirical\nRademacher complexity\u0302R\u25c7\n(G) is sharply concentrated around its expectation R\u25c7\n(G), as with the\nLet HS,T denote the union of all hypothesis sets based on U={U\u2236 U \u2286(S\u222a T), U \u2208 Zm}, the\nsubsamples of S\u222a T of size m: HS,T =\u0016U\u2208U HU . Since for any \u03c3, we have H\u03c3\n\u2286 HS,T , the\n\u0002\u0302RT(HS,T)\u0002,\n\nfollowing simpler upper bound in terms of the standard empirical Rademacher complexity of HS,T\ncan be used for our notion of empirical Rademacher complexity:\n\nstandard empirical Rademacher complexity (see Lemma 4).\n\n(H)\u2264 1\n\n\u03c3ih(zT\n\nGS={z\u0015 L(h, z)\u2236 h\u2208 HS},\n\nSome properties of our notion of Rademacher complexity are given in Appendix B.\nLet GS denote the family of loss functions associated to HS:\n\nwhere \u0302RT(HS,T) is the standard empirical Rademacher1 complexity of HS,T for the sample T .\nand letG=(GS)S\u2208Zm denote the family of hypothesis sets GS. Our main results will be expressed in\n(G). When the loss function (cid:96) is \u00b5-Lipschitz, by Talagrand\u2019s contraction lemma [Ledoux\nterms of R\u25c7\nand Talagrand, 1991], in all our results, R\u25c7\nreplacing z\u2208 S by z\u2032\u2208 Z.\nhypothesis sets. In the following, for a sample S, Sz\u2194z\nDe\ufb01nition 3 (Hypothesis set Cross-Validation (CV) stability, diameter). Fix m \u2265 1. For some\n\u00af\u03c7, \u03c7, \u00af\u2206, \u2206, \u2206max\u2265 0, we say that a family of data-dependent hypothesis setsH=(HS)S\u2208Zm has\n\nRademacher complexity is one way to measure the capacity of the family of data-dependent hypothesis\nsets. We also derive learning bounds in situations where a notion of diameter of the hypothesis\nsets is small. We now de\ufb01ne a notion of cross-validation stability and diameter for data-dependent\ndenotes the sample obtained from S by\n\n(G) can be replaced by \u00b5 E\n\nS,T\u223cDm[\u0302RT(HS,T)].\n\naverage CV-stability \u00af\u03c7, CV-stability \u03c7, average diameter \u00af\u2206, diameter \u2206 and max-diameter \u2206max if\nthe following hold:\n\n\u0004 sup\nh\u2208HS,T\n\nS,T\u223cDm\n\nS,T\u223cDm\n\n)\u0004=\n\nmQ\ni=1\n\nR\u25c7\n\nE\n\n\u03c3\n\nm\n\nm\n\nE\n\ni\n\n(3)\n\nm\n\nm\n\n\u2032\n\nE\n\nS\u223cDm\nS\u2208Zm\n\nsup\n\nE\n\nz\u2032\u223cD,z\u223cS\nz\u2032\u223cD,z\u223cS\n\nE\n\nsup\n\nsup\n\nSz\u2194z\u2032 L(h\u2032, z)\u2212 L(h, z)\u0004\u2264 \u00af\u03c7\n\u0004\nh\u2208HS ,h\u2032\u2208H\nSz\u2194z\u2032 L(h\u2032, z)\u2212 L(h, z)\u0004\u2264 \u03c7\n\u0004\nh\u2208HS ,h\u2032\u2208H\nL(h\u2032, z)\u2212 L(h, z)\u0004\u2264 \u00af\u2206\n\u0004 sup\nz\u223cS\nS\u223cDm\nh,h\u2032\u2208HS\n\u0004 sup\nL(h\u2032, z)\u2212 L(h, z)\u0004\u2264 \u2206\nz\u223cS\nh,h\u2032\u2208HS\nS\u2208Zm\nL(h\u2032, z)\u2212 L(h, z)\u0004\u2264 \u2206max.\n\u0004 sup\nz\u2208S\nh,h\u2032\u2208HS\nS\u2208Zm\n\nmax\n\nsup\n\nsup\n\nE\n\nE\n\nE\n\n(4)\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\nCV-stability of hypothesis sets can be bounded in terms of their stability and diameter (see straight-\nforward proof in Appendix C).\n\n1Note that the standard de\ufb01nition of Rademacher complexity assumes that hypothesis sets are not data-\n\ndependent, however the de\ufb01nition remains valid for data-dependent hypothesis sets.\n\n4\n\n\fLemma 1. Suppose a family of data-dependent hypothesis setsH is \u03b2-uniformly stable. Then if it has\ndiameter \u2206, then it is(\u2206+ \u03b2)-CV-stable, and if it has average diameter \u00af\u2206 then it is( \u00af\u2206+ \u03b2)-average\n\nCV-stable.\n\n3 General learning bound for data-dependent hypothesis sets\n\nU,m\n\n\u03c3\n\n1 , . . . , zU\n\n)\u0017\u0017\u0017\u0017\u0017\u0017,\n\n\u03c3iL(h, zU\n\ni\n\n\u0017\u0017\u0017\u0017\u0017\u0017 sup\nm+nQ\nm+ n\ni=1\nh\u2208HU,m\nm+n, and\u2212 m+n\n\n1\n\nIn this section, we present general learning bounds for data-dependent hypothesis sets that do not\nmake use of the notion of hypothesis set stability. One straightforward idea to derive such guarantees\nfor data-dependent hypothesis sets is to replace the hypothesis set HS depending on the observed\n\nsample S by the union of all such hypothesis sets over all samples of size m, Hm=\u0016S\u2208Zm HS.\nof size m included in some supersample U of size m+ n, with n\u2265 1, HU,m=\u0016S\u2208Zm\nS\u2286U\nshattering-coef\ufb01cient bounds originally given for this problem by Gat [2001] in the case n= m, and\nlater by Cannon et al. [2002] for any n\u2265 1. They also apply to arbitrary bounded loss functions\n\nHowever, in general, Hm can be very rich, which can lead to uninformative learning bounds. A\nsomewhat better alternative consists of considering the union of all such hypothesis sets for samples\nHS. We will\nderive learning guarantees based on the maximum transductive Rademacher complexity of HU,m.\nThere is a trade-off in the choice of n: smaller values lead to less complex sets HU,m, but they also\nlead to weaker dependencies on sample sizes. Our bounds are more re\ufb01ned guarantees than the\n\nand not just the binary loss. They are expressed in terms of the following notion of transductive\nRademacher complexity for data-dependent hypothesis sets:\n\ntransductive Rademacher complexity is simpler than that of El-Yaniv and Pechyony [2007] (in the\ndata-independent case) and leads to simpler proofs and guarantees. A by-product of our analysis\nis learning guarantees for standard transductive learning in terms of this notion of transductive\nRademacher complexity, which can be of independent interest.\n\n\u0302R\u25c7\n(G)= E\nwhere U = (zU\n) \u2208 Zm+n and where \u03c3 is a vector of(m+ n) independent random\nm+n\nvariables taking value m+n\nm+n. Our notion of\nn with probability n\nTheorem 1. LetH=(HS)S\u2208Zm be a family of data-dependent hypothesis sets. LetG be de\ufb01ned as\nin (3). Then, for any \u03b4> 0, with probability at least 1\u2212 \u03b4 over the choice of the draw of the sample\nS\u223c Zm, the following inequality holds for all h\u2208 HS:\n\u0002\u0001 1\n(G)+ 3\n\n\u0003\u0001 1\n\u00013\n)+ 2\n+ 1\nfor any \u0001> 0 with m\u00012\u2265 2 for data-dependent hypothesis sets (Lemma 5, Appendix D):\n\u0302\nRT(h)\u2212\u0302\n\u0004.\nRS(h)> \u0001\n\u0302\nRT(h)\u2212 \u0302\nout replacement [Cortes et al., 2008] applied to \u03a6(S)= suph\u2208HU,m\nRS(h). Lemma 6\n(Appendix D) is then used to bound E[\u03a6(S)] in terms of our notion of transductive Rademacher\n\nRS(h)+ max\nU\u2208Zm+n\nR(h)\u2212\u0302\n\nRS(h)> \u0001\u0004\u2264 2 P\nS\u223cDm\nT\u223cDn\n\nProof. (Sketch; full proof in Appendix D.) We use the following symmetrization result, which holds\n\nTo bound the right-hand side, we use an extension of McDiarmid\u2019s inequality to sampling with-\n\nR(h)\u2264 \u0302\n\n+ 1\n\nn\n\n\u0001 log( 2\n\n\u03b4\n\nm with probability m\n\n\u0004 sup\nh\u2208HS\n\n\u0004 sup\nh\u2208HS\n\n\u0302R\u25c7\n\n2\n\nU,m\n\nm\n\nm\n\nn\n\nmn.\n\nP\n\nS\u223cDm\n\ncomplexity.\n\n2\n\n4 Learning bound for stable data-dependent hypothesis sets\n\nIn this section, we present our main generalization bounds for data-dependent hypothesis sets.\n\nTheorem 2. LetH=(HS)S\u2208Zm be a \u03b2-stable family of data-dependent hypothesis sets, with \u00af\u03c7\naverage CV-stability, \u03c7 CV-stability and \u2206max max-diameter. LetG be de\ufb01ned as in (3). Then, for\nany \u03b4> 0, with probability at least 1\u2212 \u03b4 over the draw of a sample S\u223c Zm, the following inequality\n\n5\n\n\fm\n\nm\n\nm\n\n\u03b4\n\n4\n\n\u03b4\n\nm\n\n1\n\n\u03b4\n\n(9)\n\n(10)\n\n(11)\n\n)\u0017\u0017\u0017\u0017\u0017\u0017\u0017.\n\n\u03b4\n\nholds for all h\u2208 HS:\n\u2200h\u2208 HS, R(h)\u2264 \u0302\n\nRS(h)+ min\n\n\u03a8(S, S\u2032)= sup\nh\u2208HS\n\nThe proof of the theorem is given in Appendix E. The main idea is to control the sensitivity of the\n\n\u0017\u0017\u0017\u0017\u0017\u0017\u0017 min{2R\u25c7\n(G), \u00af\u03c7}+(1+ 2\u03b2m)\u0002\n),\n2m log( 1\n\u0002( 1\n\u221a\ne\u03c7+ 4\n+ 2\u03b2) log( 6\n),\n)+\u0002\nm log( 4\n48(3\u03b2+ \u2206max) log(m) log( 5m3\nfunction \u03a8(S, S\u2032) de\ufb01ned for any two samples S, S\u2032, as follows:\nR(h)\u2212\u0302\nRS\u2032(h).\n+ 2\u03b2)-sensitivity\nTo prove bound (9), we apply McDiarmid\u2019s inequality to \u03a8(S, S), using the( 1\nof \u03a8(S, S), and then upper bound the expectation ES\u223cDm[\u03a8(S, S)] in terms of our notion of\nand Vondrak [2019] using the observation that an algorithm that chooses an arbitrary h\u2208 HS is\nO(\u03b2+ \u2206max)-uniformly stable in the classical [Bousquet and Elisseeff, 2002] sense.\nsets [Koltchinskii and Panchenko, 2002, Bartlett and Mendelson, 2002]: in that case, we haveHS= H\n(G) coincides with the standard Rademacher complexity Rm(G); furthermore,\nthe family of hypothesis sets is 0-stable, thus the bound holds with \u03b2= 0.\nfor some H, thus R\u25c7\nthat case, HS is reduced to a singleton, HS={hS}, and so \u2206= 0, which implies that \u03c7\u2264 \u2206+ \u03b2= \u03b2.\nthe diameter-based bounds, in applications where \u2206max= O(\u2206), bound (11) may be tighter than\n(10). But, in several applications, we have \u03b2= O( 1\n\nThe Rademacher complexity-based bound (9) typically gives the tightest control on generalization\nerror compared to the bounds (10) and (11), which rely on the cruder diameter notion. However the\ndiameter may be easier to bound for some applications than the Rademacher complexity. To compare\n\nRademacher complexity. The bound (10) is obtained via a differential-privacy-based technique,\nas in Feldman and Vondrak [2018], and (11) is a direct consequence of the bound of Feldman\n\nBounds (10) and (11) specialize to the bounds of Feldman and Vondrak [2018] and Feldman and\nVondrak [2019] respectively for the special case of standard uniform stability of algorithms, since in\n\n), and then bound (10) is tighter.\n\nBound (9) admits as a special case the standard Rademacher complexity bound for \ufb01xed hypothesis\n\nm\n\n5 Applications\n\nWe now discuss several applications of our learning guarantees, with some others in Appendix G.\n\n5.1 Bagging\n\nBagging [Breiman, 1996] is a prominent ensemble method used to improve the stability of learning\nalgorithms. It consists of generating k new samples B1, B2, . . . , Bk, each of size p, by sampling\n\nuniformly with replacement from the original sample S of size m. An algorithmA is then trained\non each of these samples to generate k predictorsA(Bi), i\u2208[k]. In regression, the predictors are\ni=1 wiA(Bi). Here, we analyze a common instance of\ncombined by taking a convex combination\u2211k\n\u00b5-Lipschitz in its \ufb01rst argument, that the predictions are in the range[0, 1], and that all the mixing\nk for some constant C\u2265 1, in order to ensure that no subsample Bi is\nthe subsamples, we can equivalently imagine the process as choosing indices in[m] to form the\n\nweights wi are bounded by C\noverly in\ufb02uential in the \ufb01nal regressor (in practice, a uniform mixture is typically used in bagging).\nTo analyze bagging, we cast it in our framework. First, to deal with the randomness in choosing\n\nbagging to illustrate the application of our learning guarantees: we will assume a regression setting\nand a uniform sampling from S without replacement.2 We will also assume that the loss function is\n\nsubsamples rather than samples in S, and then once S is drawn, the subsamples are generated by\n\n2Sampling without replacement is only adopted to make the analysis more concise; its extension to sampling\n\nwith replacement is straightforward.\n\n6\n\n\fm\n\ni\n\nk\n\ni\n\n, we have\n\nk\n\n2kp log( m\n\u03b4)\n\nm\n\npicked in any subsample is p\n\nk . Next, we give upper\n\ni=1 wiA(Bi)\u2236 w\u2208 \u2206C~k\n\ndenotes the simplex of distributions over k items with all weights wi\u2264 C\n\nsubsamples. In the following, we condition on the\nrandom seed of the bagging algorithm so that this is indeed the case, and later use a union bound\nto control the chance that the chosen random seed does not satisfy this property, as elucidated in\nsection F.2.\n\n\ufb01lling in the samples at the corresponding indexes. For any index i\u2208[m], the chance that it is\nm. Thus, by Chernoff\u2019s bound, with probability at least 1\u2212 \u03b4, no index in\n+\u0002\n[m] appears in more than t\u2236= kp\n\u0001, where\nDe\ufb01ne the data-dependent family of hypothesis setsH as HS\u2236=\u0001\u2211k\n\u2206C~k\nbounds on the hypothesis set stability and the Rademacher complexity ofH. Assume that algorithm\nA admits uniform stability \u03b2A [Bousquet and Elisseeff, 2002], i.e. for any two samples B and B\u2032 of\nsize p that differ in exactly one data point and for all x\u2208X , we haveA(B)(x)\u2212A(B\u2032)(x)\u2264 \u03b2A.\nNow, let S and S\u2032 be two samples of size m differing by one point at the same index, z\u2208 S and\nz\u2032\u2208 S\u2032. Then, consider the subsets B\u2032\nelements except z, and replacing all instances of z by z\u2032. For any Bi, if z\u2209 Bi, thenA(Bi)=A(B\u2032\n)\ni of S\u2032 which are obtained from the Bis by copying over all the\n)(x)\u2264 \u03b2A for any x\u2208X . We can now bound the hypothesis\nand, if z\u2208 Bi, thenA(Bi)(x)\u2212A(B\u2032\nset uniform stability as follows: since L is \u00b5-Lipschitz in the prediction, for any z\u2032\u2032\u2208Z, and any\nw\u2208 \u2206C~k\ni=1 wiA(B\u2032\n\u0002L(\u2211k\ni=1 wiA(Bi), z\u2032\u2032)\u2212 L(\u2211k\nIt is easy to check the CV-stability and diameter of the hypothesis sets is \u2126(1) in the worst case. Thus,\ncomplexity \u0302RS(HS,T) for S, T \u2208Z m is non-trivial. Instead, we can derive a reasonable upper\nz\u2208Z, de\ufb01ne the d\u2236=\u00012m\n\u0001-dimensional vector uz =\u001bA(B)(z)\u001b\nfunctions is FS,T \u2236={z\u0015 w\u0016uz \u2236 w\u2208 Rd, \u0001w\u00011= 1}. Clearly HS,T \u2286 FS,T . Since\u0001uz\u0001\u221e\u2264 1,\nB\u2286S\u222aT,B=p. Then, the class of\n\u0302RS(FS,T) \u2264 \u0003\n\u2264 \u0002\n2 log\u00012\u00012m\np \u0001\u0001\n\u0002\n\u0302RS(GS,T)\u2264 \u00b5\n. In view of that, by Theorem 2, for any \u03b4> 0, with probability at least\n2p log(4m)\n1\u22122\u03b4 over the draws of a sample S\u223c Dm and the randomness in the bagging algorithm, the following\ninequality holds for any h\u2208 HS:\n\u0004\n2p log(4m)\nR(h)\u2264 \u0302\nRS(h)+ 2\u00b5\nFor p= o(\u221a\nm) and k= \u03c9(p), the generalization gap goes to 0 as m\u2192\u221e, regardless of the stability\nofA. This gives a new generalization guarantee for bagging, similar (but incomparable) to the one\n\nthe CV-stability-based bound (10) and standard uniform-stability bound (11) are not informative\nhere, and we use the Rademacher complexity based bound (9) instead. Bounding the Rademacher\n\nbound by analyzing the Rademacher complexity of a larger function class. Speci\ufb01cally, for any\n\na standard Rademacher complexity bound (see Theorem 11.15 in [Mohri et al., 2018]) implies\n\n. Thus, by Talagrand\u2019s inequality, we conclude that\n\n\u0017\u0017\u0017\u0017\u0017\u0017\u22c5 C\u00b5\u03b2A\n\n\u0017\u0017\u0017\u0017\u0017\u0017p+\u0004\n\n+\u0017\u0017\u0017\u0017\u0017\u00171+ 2\n\n), z\u2032\u2032)\u0002\u2264\u0002 p\n\n\u0002\u22c5 C\u00b5\u03b2A.\n\n2pm log( 1\n\n)\n\n\u0004\n\u0017\u0017\u0017\u0017\u0017\u0017\n\n2p log(4m)\n\nm\n\n2p log( 1\n\u03b4)\n\nkm\n\n+\u0002\n\np\n\nm\n\nm\n\ni\n\nm\n\nm\n\n\u03b4\n\nk\n\nlog 2\n\u03b4\n2m\n\n.\n\nk\n\nderived by Elisseeff et al. [2005]. One major point of difference is that unlike their bound, our bound\nallows for non-uniform averaging schemes.\n\n5.2 Stochastic strongly-convex optimization\n\nHere, we consider data-dependent hypothesis sets based on stochastic strongly-convex optimization\nalgorithms. As shown by Shalev-Shwartz et al. [2010], uniform convergence bounds do not hold for\nthe stochastic convex optimization problem in general.\n\n) sensitive in their output vector, i.e. for all j\u2208[K], we have\u0001\u0302wS\n\nConsider K stochastic strongly-convex optimization algorithmsAj, each returning vector\u0302wS\nreceiving sample S\u2208 Zm, j\u2208[K]. As shown by Shalev-Shwartz et al. [2010], such algorithms are\n\u03b2= O( 1\n\u0001\u2264 \u03b2 if S and S\u2032\nAssume that the loss L(w, z) is \u00b5-Lipschitz with respect to its \ufb01rst argument w. Let the data-\ndependent hypothesis set be de\ufb01ned as follows: HS=\u0001\u2211K\n\u2236 \u03b1\u2208 \u2206K\u2229 B1(\u03b10, r)\u0001, where\n\ndiffer by one point.\n\n\u2212\u0302wS\n\nj=1 \u03b1j\u0302wS\n\nj , after\n\nm\n\n\u2032\n\nj\n\nj\n\nj\n\n7\n\n\fj\n\nm\n\nj\n\nj\n\n\u221a\n\n1\n2\u00b5D\n\n. A natural choice for \u03b10 would be the uniform mixture.\n\nSince the loss function is \u00b5-Lipschitz, the family of hypotheses HS is \u00b5\u03b2-stable. In this setting,\nbounding the Rademacher complexity is dif\ufb01cult, so we resort to the diameter based bound (10)\n\n\u03b10 is in the simplex of distributions \u2206K and B1(\u03b10, r) is the L1 ball of radius r> 0 around \u03b10. We\nchoose r=\ninstead. Note that for any \u03b1, \u03b1\u2032\u2208 \u2206K\u2229 B1(\u03b10, r) and any z\u2208 Z, we have\n\u2264 \u00b5\u0002[wS\n\u03b1j\u0302wS\n)\u0302wS\n\u0302wS\n\u0016 wS\nL\u0004 KQ\nj , z\u0004\u2212 L\u0004 KQ\n\u0005\n(\u03b1i\u2212 \u03b1\u2032\nj , z\u0004\u2264 \u00b5\u0005 KQ\n\u0001\u03b1\u2212 \u03b1\u2032\u00011\u2264 2\u00b5rD,\n\u03b1\u2032\nj=1\nj=1\nj=1\nwhere\u0002[wS\n]\u0002\n= maxi\u2208[K]\u0001wS\n\u0016 wS\n\u00012\u2264 D. Thus, the average diameter\n\u2236= maxx\u22600\n\u0001\u2211k\nj\u00012\nadmits the following upper bound: \u0302\n. In view of that, by Theorem 2, for any \u03b4> 0,\n\u2206\u2264 2\u00b5rD= 1\u221a\nj=1 xj wS\n\u0001x\u00011\nwith probability at least 1\u2212 \u03b4, the following holds for all \u03b1\u2208 \u2206K\u2229 B1(\u03b10, r):\n\u0004\u0003 1\n\u0004+\u0003\n+\u221a\n\u03b1i\u0302wS\n\u03b1j\u0302wS\nL\u0004 KQ\n\u0003.\n+ 2\u00b5\u03b2\u0003 log\u0003 6\ne\u00b5\u03b2+ 4\n\u0004L\u0004 KQ\nj , z\u0004\u0004\u2264 1\nmQ\nz\u223cD\nj=1\ni=1\nj=1\nAs an aside, we note that the analysis of section 5.1 can be carried over to this setting, by settingA to\n\nThe second stage of an algorithm in this context consists of choosing \u03b1, potentially using a non-stable\nalgorithm. This application both illustrates the use of our learning bounds using the diameter and its\napplication even in the absence of uniform convergence bounds.\n\n]\u0002\n\nbe a stochastic strongly-convex optimization algorithm which outputs a weight vector \u02c6w. This yields\ngeneralization bounds for aggregating over a larger set of mixing weights, albeit with the restriction\nthat each algorithm uses only a small part of S.\n\nj , zS\ni\n\ne\nm\n\nE\n\nm\n\nm\n\n1,2\n\n1,2\n\n1\n\nK\n\n1\n\nK\n\nm\n\n2\n\ni\n\n\u03b4\n\nm\n\n5.3 \u2206-sensitive feature mappings\n\nto some positive de\ufb01nite symmetric kernel or a mapping de\ufb01ned by the top layer of an arti\ufb01cial neural\nnetwork trained on S, with a stability property.\n\nConsider the scenario where the training sample S\u2208 Zm is used to learn a non-linear feature mapping\n\u03a6S\u2236 X\u2192 RN that is \u2206-sensitive for some \u2206= O( 1\n). \u03a6S may be the feature mapping corresponding\nTo de\ufb01ne the second state, let L be a set of \u03b3-Lipschitz functions f\u2236 RN \u2192 R. Then we de\ufb01ne\nHS=\u0001x\u0015 f(\u03a6S(x))\u2236 f \u2208 L\u0001. Assume that the loss function (cid:96) is \u00b5-Lipschitz with respect to its\n\ufb01rst argument. Then, for any hypothesis h\u2236 x\u0015 f(\u03a6S(x))\u2208 HS and any sample S\u2032 differing from\nS by one element, the hypothesis h\u2032\u2236 x\u0015 f(\u03a6S\u2032(x))\u2208 HS\u2032 admits losses that are \u03b2-close to those\nof h, with \u03b2= \u00b5\u03b3\u2206, since, for all(x, y)\u2208 X\u00d7 Y, by the Cauchy-Schwarz inequality, the following\n(cid:96)(f(\u03a6S(x)), y)\u2212 (cid:96)(f(\u03a6S\u2032(x)), y)\u2264 \u00b5f((\u03a6S(x))\u2212 f(\u03a6S\u2032(x))\u2264 \u00b5\u03b3\u0001\u03a6S(x)\u2212 \u03a6S\u2032(x)\u0001\u2264 \u00b5\u03b3\u2206.\nThus, the family of hypothesis setH=(HS)S\u2208Zm is uniformly \u03b2-stable with \u03b2= \u00b5\u03b3\u2206= O( 1\n). In\nview of that, by Theorem 2, for any \u03b4> 0, with probability at least 1\u2212 \u03b4 over the draw of a sample\nS\u223c Dm, the following inequality holds for any h\u2208 HS:\n\ninequality holds:\n\nm\n\nR(h)\u2264 \u0302\n\nRS(h)+ 2R\u25c7\n\nm\n\n(G)+(1+ 2\u00b5\u03b3\u2206m)\u0002\n\n2m log( 1\n\n1\n\n\u03b4\n\n).\n\n(12)\n\nNotice that this bound applies even when the second stage of an algorithm, which consists of selecting\na hypothesis hS in HS, is not stable. A standard uniform stability guarantee cannot be used in that\ncase. The setting described here can be straightforwardly extended to the case of other norms for the\nde\ufb01nition of sensitivity and that of the norm used in the de\ufb01nition of HS.\n\n5.4 Distillation\n\nHere, we consider distillation algorithms which, in the \ufb01rst stage, train a very complex model on the\n\nlabeled sample. Let f\u2217\nwill assume that the training algorithm is \u03b2-sensitive, that is\u0001f\u2217\n\n\u2236 X\u2192 R denote the resulting predictor for a training sample S of size m. We\n) for S and S\u2032\n\n\u2212 f\u2217\nS\u2032\u0001\u2264 \u03b2= O( 1\n\nS\n\nS\n\nm\n\ndiffering by one point.\n\n8\n\n\fAssume that the loss (cid:96) is \u00b5-Lipschitz with respect to its\n\ufb01rst argument and that H is a subset of a vector space. Let\n\nIn the second stage, the algorithm selects a hypothesis\nS from a less complex family of pre-\ndictors H. This de\ufb01nes the following sample-dependent\n\nthat is \u03b3-close to f\u2217\nhypothesis set: HS=\u0001h\u2208 H\u2236\u0001(h\u2212 f\u2217\n)\u0001\u221e\u2264 \u03b3\u0001.\nS\u2032\u2212 f\u2217\nS and S\u2032 be two samples differing by one point. Note, f\u2217\nLet h be in HS, then the hypothesis h\u2032= h+ f\u2217\nS\u2032\u2212 f\u2217\nmay not be in H, but we will assume that f\u2217\nin HS\u2032 since\u0001h\u2032\u2212 f\u2217\n\u0001\u221e\u2264 \u03b3. Figure 2 illus-\nS\u2032\u0001\u221e=\u0001h\u2212 f\u2217\nloss, for any z=(x, y)\u2208 Z,(cid:96)(h\u2032(x), y)\u2212 (cid:96)(h(x), y)\u2264\n\u00b5\u0001h\u2032(x)\u2212 h(x)\u0001\u221e= \u00b5\u0001f\u2217\nS\u2032\u2212 f\u2217\n\u0001\u2264 \u00b5\u03b2. Thus, the family\nIn view of that, by Theorem 2, for any \u03b4> 0, with probability at least 1\u2212 \u03b4 over the draw of a sample\nS\u223c Dm, the following inequality holds for any h\u2208 HS:\nR(h)\u2264 \u0302\nRS(h)+ 2R\u25c7\n\nFigure 2: Illustration of the distillation\nhypothesis sets. Notice that the diame-\nter of a hypothesis set HS may be large\nhere.\n\n(G)+(1+ 2\u00b5\u03b2m)\u0002\n\ntrates the hypothesis sets. By the \u00b5-Lipschitzness of the\n\n2m log( 1\n\n1\n\n\u03b4\n\n).\n\nof hypothesis sets HS is \u00b5\u03b2-stable.\n\nS\n\nS\n\nS\n\nS\n\nS is in H.\nS is\n\nm\n\nNotice that a standard uniform-stability argument would not necessarily apply here since HS could\nbe relatively complex and the second stage not necessarily stable.\n\n6 Conclusion\n\nWe presented a broad study of generalization with data-dependent hypothesis sets, including general\nlearning bounds using a notion of transductive Rademacher complexity and, more importantly,\nlearning bounds for stable data-dependent hypothesis sets. We illustrated the applications of these\nguarantees to the analysis of several problems. Our framework is general and covers learning scenarios\ncommonly arising in applications for which standard generalization bounds are not applicable. Our\nresults can be further augmented and re\ufb01ned to include model selection bounds and local Rademacher\ncomplexity bounds for stable data-dependent hypothesis sets (to be presented in a more extended\nversion of this manuscript), and further extensions described in Appendix F. Our analysis can also be\nextended to the non-i.i.d. setting and other learning scenarios such as that of transduction. Several\nby-products of our analysis, including our proof techniques, new guarantees for transductive learning,\nand our PAC-Bayesian bounds for randomized algorithms, both in the sample-independent and\nsample-dependent cases, can be of independent interest. While we highlighted several applications\nof our learning bounds, a tighter analysis might be needed to derive guarantees for a wider range of\ndata-dependent hypothesis classes or scenarios.\n\nAcknowledgements. HL is supported by NSF IIS-1755781. The work of SG and MM was partly\nsupported by NSF CCF-1535987, NSF IIS-1618662, and a Google Research Award. KS would like\nto acknowledge NSF CAREER Award 1750575 and Sloan Research Fellowship.\n\nReferences\nPeter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and\n\nstructural results. Journal of Machine Learning, 3, 2002.\n\nPeter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Localized Rademacher complexities. In\n\nProceedings of COLT, volume 2375, pages 79\u201397. Springer-Verlag, 2002.\n\nRaef Bassily, Kobbi Nissim, Adam D. Smith, Thomas Steinke, Uri Stemmer, and Jonathan Ullman.\nAlgorithmic stability for adaptive data analysis. In Proceedings of STOC, pages 1046\u20131059, 2016.\n\nOlivier Bousquet and Andr\u00b4e Elisseeff. Stability and generalization. Journal of Machine Learning, 2:\n\n499\u2013526, 2002.\n\nOlivier Bousquet, Yegor Klochkov, and Nikita Zhivotovskiy. Sharper bounds for uniformly stable\n\nalgorithms. CoRR, abs/1910.07833, 2019. URL http://arxiv.org/abs/1910.07833.\n\n9\n\nf\u21e4Sf\u21e4S0h0hHSHS0\fLeo Breiman. Bagging predictors. Machine Learning, 24(2):123\u2013140, 1996.\n\nAdam Cannon, J. Mark Ettinger, Don R. Hush, and Clint Scovel. Machine learning with data\n\ndependent hypothesis classes. Journal of Machine Learning Research, 2:335\u2013358, 2002.\n\nOlivier Catoni. PAC-Bayesian supervised classi\ufb01cation: the thermodynamics of statistical learning.\n\nInstitute of Mathematical Statistics, 2007.\n\nNicol`o Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line\n\nlearning algorithms. In Proceedings of NIPS, pages 359\u2013366, 2001.\n\nCorinna Cortes, Mehryar Mohri, Dmitry Pechyony, and Ashish Rastogi. Stability of transductive\n\nregression algorithms. In Proceedings of ICML, pages 176\u2013183, 2008.\n\nLuc Devroye and T. J. Wagner. Distribution-free inequalities for the deleted and holdout error\n\nestimates. IEEE Transactions on Information Theory, 25(2):202\u2013207, 1979.\n\nGintare Karolina Dziugaite and Daniel M. Roy. Data-dependent PAC-Bayes priors via differential\n\nprivacy. In Proceedings of NeurIPS, pages 8440\u20138450, 2018a.\n\nGintare Karolina Dziugaite and Daniel M. Roy. Entropy-SGD optimizes the prior of a PAC-Bayes\nbound: Generalization properties of entropy-sgd and data-dependent priors. In Proceedings of\nICML, pages 1376\u20131385, 2018b.\n\nRan El-Yaniv and Dmitry Pechyony. Transductive Rademacher complexity and its applications. In\n\nProceedings of COLT, pages 157\u2013171, 2007.\n\nAndr\u00b4e Elisseeff, Theodoros Evgeniou, and Massimiliano Pontil. Stability of randomized learning\n\nalgorithms. Journal of Machine Learning Research, 6:55\u201379, 2005.\n\nVitaly Feldman and Jan Vondrak. Generalization bounds for uniformly stable algorithms.\n\nProceedings of NeurIPS, pages 9770\u20139780, 2018.\n\nIn\n\nVitaly Feldman and Jan Vondrak. High probability generalization bounds for uniformly stable\n\nalgorithms with nearly optimal rate. In Proceedings of COLT, 2019.\n\nYoram Gat. A learning generalization bound with an application to sparse-representation classi\ufb01ers.\n\nMachine Learning, 42(3):233\u2013239, 2001.\n\nRalf Herbrich and Robert C. Williamson. Algorithmic luckiness. Journal of Machine Learning\n\nResearch, 3:175\u2013212, 2002.\n\nSham M. Kakade, Karthik Sridharan, and Ambuj Tewari. On the complexity of linear prediction:\nRisk bounds, margin bounds, and regularization. In Proceedings of NIPS, pages 793\u2013800, 2008.\n\nSatyen Kale, Ravi Kumar, and Sergei Vassilvitskii. Cross-validation and mean-square stability. In\n\nProceedings of Innovations in Computer Science, pages 487\u2013495, 2011.\n\nMichael J. Kearns and Dana Ron. Algorithmic stability and sanity-check bounds for leave-one-out\n\ncross-validation. In Proceedings of COLT, pages 152\u2013162. ACM, 1997.\n\nVladmir Koltchinskii and Dmitry Panchenko. Empirical margin distributions and bounding the\n\ngeneralization error of combined classi\ufb01ers. Annals of Statistics, 30, 2002.\n\nSamuel Kutin and Partha Niyogi. Almost-everywhere algorithmic stability and generalization error.\n\nIn Proceedings of UAI, pages 275\u2013282, 2002.\n\nVitaly Kuznetsov and Mehryar Mohri. Generalization bounds for non-stationary mixing processes.\n\nMachine Learning, 106(1):93\u2013117, 2017.\n\nMichel Ledoux and Michel Talagrand. Probability in Banach Spaces: Isoperimetry and Processes.\n\nSpringer, New York, 1991.\n\nGuy Lever, Franc\u00b8ois Laviolette, and John Shawe-Taylor. Tighter PAC-Bayes bounds through\n\ndistribution-dependent priors. Theoretical Computer Science, 473:4\u201328, 2013.\n\n10\n\n\fAndreas Maurer. A second-order look at stability and generalization. In Proceedings of COLT, pages\n\n1461\u20131475, 2017.\n\nDavid A. McAllester. Simpli\ufb01ed PAC-Bayesian margin bounds. In Proceedings of COLT, pages\n\n203\u2013215, 2003.\n\nFrank McSherry and Kunal Talwar. Mechanism design via differential privacy. In Proceedings of\n\nFOCS, pages 94\u2013103, 2007.\n\nKohei Miyaguchi. PAC-Bayesian transportation bound. CoRR, abs/1905.13435, 2019.\n\nMehryar Mohri and Afshin Rostamizadeh. Stability bounds for stationary phi-mixing and beta-mixing\n\nprocesses. Journal of Machine Learning Research, 11:789\u2013814, 2010.\n\nMehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning.\n\nMIT Press, second edition, 2018.\n\nBehnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A PAC-Bayesian approach to\n\nspectrally-normalized margin bounds for neural networks. In Proceedings of ICLR, 2018.\n\nEmilio Parrado-Hern\u00b4andez, Amiran Ambroladze, John Shawe-Taylor, and Shiliang Sun. PAC-Bayes\nbounds with data dependent priors. Journal of Machine Learning Research, 13(Dec):3507\u20133531,\n2012.\n\nPetra Philips. Data-Dependent Analysis of Learning Algorithms. PhD thesis, Australian National\n\nUniversity, 2005.\n\nDavid Pollard. Convergence of Stochastic Processess. Springer, 1984.\n\nW. H. Rogers and T. J. Wagner. A \ufb01nite sample distribution-free performance bound for local\n\ndiscrimination rules. The Annals of Statistics, 6(3):506\u2013514, 05 1978.\n\nMark Rudelson and Roman Vershynin. Sampling from large matrices: An approach through geometric\n\nfunctional analysis. J. ACM, 54(4):21, 2007.\n\nShai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability\n\nand uniform convergence. Journal of Machine Learning Research, 11:2635\u20132670, 2010.\n\nJohn Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson, and Martin Anthony. Structural risk\nminimization over data-dependent hierarchies. IEEE Trans. Information Theory, 44(5):1926\u20131940,\n1998.\n\nNati Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. NIPS,\n\n2010.\n\nThomas Steinke and Jonathan Ullman. Subgaussian tail bounds via stability arguments. CoRR,\n\nabs/1701.03493, 2017. URL http://arxiv.org/abs/1701.03493.\n\nG. W. Stewart. Matrix Algorithms: Volume 1, Basic Decompositions. Society for Industrial and\n\nApplied Mathematics, 1998.\n\nVladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.\n\nTong Zhang. Covering number bounds of certain regularized linear function classes. Journal of\n\nMachine Learning Research, 2:527\u2013550, 2002.\n\n11\n\n\f", "award": [], "sourceid": 3642, "authors": [{"given_name": "Dylan", "family_name": "Foster", "institution": "MIT"}, {"given_name": "Spencer", "family_name": "Greenberg", "institution": "Spark Wave"}, {"given_name": "Satyen", "family_name": "Kale", "institution": "Google"}, {"given_name": "Haipeng", "family_name": "Luo", "institution": "University of Southern California"}, {"given_name": "Mehryar", "family_name": "Mohri", "institution": "Courant Inst. of Math. Sciences & Google Research"}, {"given_name": "Karthik", "family_name": "Sridharan", "institution": "Cornell University"}]}