{"title": "Information-Theoretic Generalization Bounds for SGLD via Data-Dependent Estimates", "book": "Advances in Neural Information Processing Systems", "page_first": 11015, "page_last": 11025, "abstract": "In this work, we improve upon the stepwise analysis of noisy iterative learning algorithms initiated by Pensia, Jog, and Loh (2018) and recently extended by Bu, Zou, and Veeravalli (2019). Our main contributions are significantly improved mutual information bounds for Stochastic Gradient Langevin Dynamics via data-dependent estimates. Our approach is based on the variational characterization of mutual information and the use of data-dependent priors that forecast the mini-batch gradient based on a subset of the training samples. Our approach is broadly applicable within the information-theoretic framework of Russo and Zou (2015) and Xu and Raginsky (2017). Our bound can be tied to a measure of flatness of the empirical risk surface. As compared with other bounds that depend on the squared norms of gradients, empirical investigations show that the terms in our bounds are orders of magnitude smaller.", "full_text": "Information-Theoretic Generalization Bounds for\n\nSGLD via Data-Dependent Estimates\n\nJeffrey Negrea\u2217\nUniversity of Toronto,\n\nVector Institute\n\nMahdi Haghifam\u2217\nUniversity of Toronto,\n\nElement AI\n\nGintare Karolina Dziugaite\n\nElement AI\n\nAshish Khisti\n\nUniversity of Toronto\n\nDaniel M. Roy\n\nUniversity of Toronto,\n\nVector Institute\n\nAbstract\n\nIn this work, we improve upon the stepwise analysis of noisy iterative learning\nalgorithms initiated by Pensia, Jog, and Loh (2018) and recently extended by Bu,\nZou, and Veeravalli (2019). Our main contributions are signi\ufb01cantly improved\nmutual information bounds for Stochastic Gradient Langevin Dynamics via data-\ndependent estimates. Our approach is based on the variational characterization of\nmutual information and the use of data-dependent priors that forecast the mini-\nbatch gradient based on a subset of the training samples. Our approach is broadly\napplicable within the information-theoretic framework of Russo and Zou (2015)\nand Xu and Raginsky (2017). Our bound can be tied to a measure of \ufb02atness of the\nempirical risk surface. As compared with other bounds that depend on the squared\nnorms of gradients, empirical investigations show that the terms in our bounds are\norders of magnitude smaller.\n\n1\n\nIntroduction\n\nStochastic subgradient methods, especially stochastic gradient descent (SGD), are at the core of re-\ncent advances in deep-learning practice. Despite some progress, developing a precise understanding\nof generalization error for that class of algorithms remains wide open. Concurrently, there has been\nsteady progress for noisy variants of SGD, such as stochastic gradient Langevin dynamics (SGLD)\n[13, 26, 34] and its full-batch counterpart, the Langevin algorithm [13]. The introduction of Gaus-\nsian noise to the iterates of SGD expands the set of theoretical frameworks that can be brought to\nbear on the study of generalization. In pioneering work, Raginsky, Rakhlin, and Telgarsky [26]\nexploit the fact that SGLD approximates Langevin diffusion, a continuous time Markov process, in\nthe small step size limit. One drawback of this and related analyses involving Markov processes is\nthe reliance on mixing. We hypothesize that SGLD is not mixing in practice, so results based upon\nmixing may not be representative of empirical performance.\nIn recent work, Pensia, Jog, and Loh [24] perform a stepwise analysis of a family of noisy iterative\nalgorithms that includes SGLD and the Langevin algorithm. At the foundation of this work is the\nframework of Russo and Zou [29] and Xu and Raginsky [35], where mean generalization error is\ncontrolled in terms of the mutual information between the dataset and the learned parameters. (See\nalso the study of on-average KL stability by Wang, Lei, and Fienberg [33].) However, because\nthe data distribution is unknown, so is any mutual information involving the data. This presents a\nsigni\ufb01cant barrier to understanding generalization in terms of mutual information.\n\n\u2217Equal contribution authors, order of names was determined randomly.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOne of the key contributions of Pensia et al. is a bound on the mutual information between the data\nand the \ufb01nal weights, which they construct from a bound on the mutual information between the data\nand the entire trajectory of weights. By exploiting properties of mutual information, they express\nthe latter as a sum of conditional mutual informations associated with each gradient step. While\nthese conditional mutual informations are also unknown, Pensia et al. obtain a bound in terms of the\nLipschitz constant for the objective function being optimized.\nBy passing to the full trajectory and exploiting Lipschitz continuity, Pensia et al. circumvent the\nstatistical barrier posed by the unknown mutual information. Their analysis, however, introduces\nseveral sources of looseness. In particular, the use of Lipschitz constants, which lead to distribution-\nindependent bounds, eradicates any hope that these bounds will be non-vacuous for modern models\nand datasets. Indeed, for deep neural networks, the Lipschitz constant for the empirical risk would be\nprohibitively large, or in some cases in\ufb01nite, and would immediately render any bound that depends\non them vacuous in regimes of interest. In order to fully exploit the decomposition proposed by\n[24], one needs distribution-dependent bounds on the incremental mutual information at each step.\nIn fact, by a small change, the bounds established by Pensia et al. can be made to depend on\nexpected-squared-gradient-norms, rather than Lipschitz constants, producing distribution-dependent\nbounds. The resulting bound would be similar to a PAC-Bayesian bound due to Mou et al. [22],\nwhich we consider to be the SGLD generalization result most similar to the present work. Writ-\ning \u2211t\u2264T \u03b7t for \u2211T\nlearning rate or Lipschitz continuity of the loss or its gradient.\n\n(cid:1) and does not place restrictions on the\n(cid:1) generalization bound for SGLD that depends on expected-\n\ntheir bound is O(cid:0)(cid:112)(\u03b2 /n)\u2211t\u2264T \u03b7t\n\nQiao [20] derive an O(cid:0)(1/n)(cid:112)\u03b2 \u2211t\u2264T \u03b7t\n\nt=1 \u03b7t,\n\nIn other related work, Li, Luo, and\n\nsquared-gradient-norms. However their result requires the learning rate to scale inversely with the\ninverse temperature and the Lipschitz constant of the loss, severely limiting the applicability of their\nresult to typical learning problems. Empirically, squared gradient norms are very large during train-\ning, which suggests that bounds based on these quantities may not explain empirical performance.\nAs we will show, the dependence on the expected-squared-gradient-norm is spurious.\nThe key contribution of the present work is the observation that variants of the mutual information\nbetween the learned parameters and a subset of the data can be estimated using the rest of the\ndata. We refer to such estimates as data-dependent due to their intermediate dependence on part of\nthe data. The use of data-dependent estimates leads to distribution-dependent bounds that naturally\nadapt to the model of interest and the data distribution. In particular, using data-dependent estimates,\nwe arrive at bounds in terms of the incoherence of gradients in the dataset. Roughly speaking,\nthe incoherence measures the amount by which batch gradients computed on subsets of the data\ndisagree, as quanti\ufb01ed by squared norm. Crucially, the incoherence is never larger than the squared-\ngradient-norm on average, and the incoherence is 0 for most iterations of SGLD with small batches.\nWe note that the mutual information between learned parameter and a single data point is used\nto produce generalization bounds in work by Bu, Zou, and Veeravalli [6], Raginsky et al. [27], and\nWang, Lei, and Fienberg [33]. However, in the SGLD analysis of [6], they do not use data-dependent\nestimates. Instead, they also rely on Lipschitz constants, leading to bounds similar to [24].\nIn the process of developing tighter distribution-dependent bounds, we also observe that, in some\ncircumstances, one may obtain tighter estimates by working with conditional or disintegrated\ninformation-theoretic quantities. In particular, doing so provides more opportunities to exchange\nexpectation and concave functions than are available with previous mutual information bounds. Us-\n\u221a\ning their own mutual information bound and the chain rule, [6] improve on the generalization error\nlogn where n is the sample size. The advantage of [6]\nbound for SGLD from [24] by a factor of\nthat enables this improvement is that their bound is only penalized once per epoch at a randomly cho-\nsen step. This effectively changes the order of an expectation and square-root, improving the bound.\nBuilding upon [6, 29, 35], we develop generalization bounds in terms of disintegrated information-\ntheoretic quantities that extract expectations from concave functions as much as possible.\nFinally, much like the stepwise analysis of SGD carried out by Hardt, Recht, and Singer [14], one\ncould consider an analysis in terms of uniform stability, e.g., in terms of average leave-one-out KL\nstability [12]. Under an assumption of uniform stability, [22] also showed that expected generaliza-\ntion error decays rapidly at a O(1/n) rate. However, uniform stability has poor dependence on the\nLipschitz constant, and so, does not even hold in simple settings, like univariate logistic regression.\nAs such, we do not believe this framework is suitable for studying SGLD as applied in modern\n\n2\n\n\fmachine learning. For other work on information-theoretic analyses generalization error, and on\nSGLD, see [1, 3, 4, 15, 16, 27, 32].\n\n1.1 Contributions\n\nThe present paper makes the following contributions:\n\n\u2022 We provide novel information-theoretic generalization bounds that relate a learned parameter to\na random subset of the training data. These bounds depend on forms of on-average information\nstability, but are different from those in existing work due to our use of disintegration.\n\n\u2022 We introduce the technique of data-dependent priors for bounding mutual information in data-\ndependent estimates of expected generalization error. Speci\ufb01cally, we use data-dependent priors\nto forecast the dynamics of iterative algorithms using a randomly chosen subset of the data.\nEach possible subset yields a generalization bound for the empirical risk over the complemen-\ntary subset. Combining this with our information-theoretic generalization bounds, we recover\ngeneralization error bounds for the empirical risk on the full dataset.\n\n\u2022 We develop bounds for Langevin dynamics and SGLD that depend on a measure of the inco-\nherence of empirical gradients. This quantity is typically orders of magnitude smaller than the\nsquared gradient norms or Lipschitz constants that other bounds depend upon. In our experi-\nments, the difference was a multiplicative factor between 102 and 104.\n\n\u2022 Our generalization bound for SGLD is O(cid:0)min(cid:8)(cid:112)(\u03b2 /bn)\u2211t\u2264T \u03b7t , (1/n)\u2211t\u2264T\n\n(cid:9)(cid:1) where \u03b7t\n\n(cid:112)\u03b2\u03b7t\n\nis the learning rate at iteration t, T is the number of iterations, \u03b2 is the inverse temperature, and\nb is the minibatch size. This bound is currently state of the art for bounds without assumptions\non the smoothness of the loss or restrictions on the learning rate.\n\n1.2 Preliminaries\n\nn \u2211m\n\nLet D be an unknown distribution on a space Z and let W be a space of parameters. Consider a\nloss function (cid:96) : Z \u00d7 W \u2192 R and the corresponding risk function RD (w) = E(cid:96)(Z,w). Given an i.i.d.\ndataset of size n, S \u223c D n, we may form the empirical risk function \u02c6RS(w) = 1\ni=1 (cid:96)(Zi,w), where\nS = (Z1, . . . ,Zn). In the setting of classi\ufb01cation and continuous parameter spaces, the loss function\nis discontinuous and the empirical risk function does not convey useful gradient information. For\nthis reason, it is common to work with a surrogate loss, such as cross entropy. To that end, let\n\u02dc(cid:96) : Z \u00d7 W \u2192 R denote a surrogate loss and let \u02dcRD (w) = E \u02dc(cid:96)(Z,w) and \u02dcRS(w) = 1\n\u02dc(cid:96)(Zi,w) be\nthe corresponding surrogate risk and empirical surrogate risk.\nOur primary interest is in the generalization performance of learning algorithms. Abstractly, let W\nbe a random element in W satisfying W = A (S,V ), where V is some auxiliary random element\nindependent from S and A is a measurable function representing a randomized learning algorithm\nthat maps the data S to a learned parameter W . Our focus will be the (mean) generalization error\n\nof W , i.e., E(cid:2)RD (W )\u2212 \u02c6RS(W )(cid:3). Note that we have averaged over both the choice of dataset and the\n\nsource of randomness V available to the learning algorithm A .\nFor random variables X and Y , write EY X = E[X|Y ] and PY [X] for the conditional expectation and\n(regular) conditional distribution, respectively, of X given Y .2 Besides the usual notions of KL\ndivergence, mutual information, and conditional mutual information (see Appendix A for formal\nde\ufb01nitions), we rely on the following less common notion:\nDe\ufb01nition 1.1. Let X, Y , and Z be arbitrary random elements. Let \u2297 form product measures. The\ndisintegrated mutual information between X and Y given Z is\n\nn \u2211m\ni=1\n\nIZ(X;Y ) = KL(PZ[(X,Y )](cid:107)PZ[X]\u2297 PZ[Y ]).\n\nIt follows immediately from de\ufb01nitions that I(X,Y|Z) = EIZ(X,Y ). Letting \u03c6 satisfy \u03c6 (Z) =\nIZ(X;Y ) a.s., de\ufb01ne I(X,Y|Z = z) = \u03c6 (z). This notation is necessarily well de\ufb01ned only up to a\nnull set under the marginal distribution of Z.\n\n2We \ufb01x arbitrary versions and assume regular versions of conditional distributions exist.\n\n3\n\n\f2 Methods\n\nIn this section, we establish generalization bounds for learning algorithms in terms of information-\ntheoretic quantities (conditional mutual information, disintegrated mutual information, relative en-\ntropy) that depend on the unknown data distribution and the probabilistic properties of the learning\nalgorithm. We then describe two complementary strategies that we employ to bound these otherwise\nintractable quantities. In Section 3, we apply these methods to the study of the Langevin algorithm\nand SGLD.\nWe make repeated use of generalized notions of priors and posteriors, which arise in the PAC-Bayes\nliterature ([7, 21, 31], etc.) and relate to variational bounds on mutual information, which we will\nnow describe: Consider learned parameters W , data S, and auxiliary variables V , viewed as random\nelements in W , Zn, etc., respectively. In PAC-Bayes, a generalized posterior is an arbitrary random\nmeasure on W . In our setting, the posterior, Q, (of W given S and V) is the conditional distribution\nof W given S and V . (Formally, Q is a probability kernel, but one can think informally that Q =\nf (S,V ) for some measurable function taking values in the space of Borel probability measures, and\nso we will simply say that Q is \u03c3 (S,V )-measurable.)\nDe\ufb01nition 2.1 (Data-dependent prior). Let Q be a \u03c3 (S,V )-measurable posterior. A (generalized)\nprior P is a random measure on W , measurable with respect to some sub-\u03c3-algebra of \u03c3 (S,V ). A\nprior P is said to be data-dependent if it is not independent of S.\nLet P be a F -measurable data-dependent prior, where \u03c3 (V ) \u2282 F . Using a variational characteriza-\ntion of mutual information (see Appendix B.1), we have\n\nEF [KL(Q(cid:107)P)] \u2265 IF (W ;S) a.s.,\n\n(1)\nwith equality for P = PF [W ]. Therefore, if the expected KL divergence is small, W contains little in-\nformation about S beyond what is already captured by F . If the special case where the disintegrated\nmutual information is zero, then W is independent of S given F . In the context of generalization,\nthis implies that the data S not contained in F can be used to form an unbiased estimate of the risk\nof W . The bounds we present below extend this logic to nonzero mutual information.\nThe utility of using data-dependent priors to control disintegrated mutual information depends on the\nbalance of two effects: On the one hand, I(W ;S) \u2264 I(W ;S|F ), and so conditioning never improves a\ntheoretical bound and may make it looser. On the other hand, I(W ;S) depends on the unknown data\ndistribution and so distribution-independent bounds will often be very loose. In contrast, the KL\ndivergence based on P can exploit the information in F \u2282 \u03c3 (S,V ) to obtain tighter data-dependent\nbounds on IF (W ;S).\nIn order to construct data-dependent priors, we partition the dataset S in two halves, based on a\nrandom subset J \u2282 {1, . . . ,n} with #J = m nonrandom. Let J = { j1, . . . , jm}, The \ufb01rst half, SJ =\n(Z j1, . . . ,Z jm), contains m points, which we will use to construct a data-dependent prior P. The\nsecond half, Sc\nJ are\nindependent of J, since m is nonrandom.)\nThis particular construction of data-dependent priors allow us to leverage a type of non-uniform\nKL-stability: the prior P may exploit SJ to make a data-dependent forecast of Q, yielding a bound,\nB, on the conditional expected generalization error (with respect to the remaining n\u2212 m data points\nin Sc\nJ). Averaging over SJ, we obtain a bound on the (unconditional) expected generalization error.\nJ be de\ufb01ned as above. Suppose that F is a \u03c3-\ufb01eld with \u03c3 (SJ) \u2282 F \u22a5\u22a5\nDe\ufb01nition 2.2. Let SJ,Sc\n\u03c3 (Sc\nJ). An expected generalization error bound based on a data-dependent estimate is one of the\nform\n\nJ, containing the remaining n\u2212 m points, is independent of P. (Note that SJ and Sc\n\nwhere B is F measurable, and satis\ufb01es EF(cid:2)RD (W )\u2212 \u02c6RSc\n\nE(cid:2)RD (W )\u2212 \u02c6RS(W )(cid:3) \u2264 E[B],\nJ (W )(cid:3) \u2264 B.\n\n(2)\n\nThe idea of using data-dependent priors to obtain tighter bounds is standard in the PAC-Bayes lit-\nerature [2, 10, 23, 28], but its utility in the present work is brought through by our introduction of\ndata-dependent estimates. In the following section, we derive information-theoretic bounds on ex-\npected generalization error that can exploit data-dependent priors to form data-dependent estimates.\nWe will then use these tools to study SGLD, without mixing assumptions.\n\n4\n\n\f2.1\n\nInformation-Theoretic Generalization Bounds based on Random Subsets of Data\n\nExisting work by Xu and Raginsky [35] bounds the expected generalization error of a learning\nalgorithm in terms of the mutual information between the random parameters and the data. The\nfollowing result is a simple extension of [35, Thm. 1] that bounds the expected generalization error\nin terms of the mutual information between the parameters and a random subset of the data.\nTheorem 2.3 (Data-Dependent Mutual Information Bound). Let W be a random element in W , let\nS \u223c D n, and let J \u2286 [n], |J| = m, be uniformly distributed and independent from S and W . Suppose\nthat (cid:96)(Z,w) is \u03c3-subgaussian when Z \u223c D, for each w \u2208 W . Let Q = PS[W ], and let P be a \u03c3 (SJ)-\nmeasurable data-dependent prior on W . Then\n\nE(cid:2)RD (W )\u2212 \u02c6RS(W )(cid:3) \u2264\n\n(cid:115)\n\n(cid:115)\n\n2\n\n\u03c3 2\nn\u2212 m\n\nI(W ;Sc\n\nJ) \u2264\n\n2\n\n\u03c3 2\nn\u2212 m\n\nE[KL(Q(cid:107)P)].\n\nThe proof of this result can be found in Appendix B. When m = 0, this recovers [35, Thm. 1].\nWhen the size of the subset is m = n\u2212 1, this bound is weaker than [6, Prop. 1], due to the order of\nthe concave square-root function and the expectation over the choice datapoint to be left out. This\ndifference is addressed by our next result.\nRandomization is one way that learning algorithms can control the mutual information between (a\nrandom subsets of) the data and the learned parameter. Let U be a random element independent\nfrom S and J, representing some aspect of the source of randomness used by the learning algorithm.\nBecause S \u22a5\u22a5 {J,U} and S \u223c D n, we have (SJ,U) \u22a5\u22a5 Sc\n\nJ and thus\n\nI(W ;Sc\n\nJ) \u2264 I(W ;Sc\n\nJ|SJ,U) = EISJ ,U (W ;Sc\nJ),\n\nwhere the last equality follows from the de\ufb01nition of conditional mutual information. The next result\nshows that we can pull the expectation over both SJ and U outside the concave square-root function.\nIn the case of SGLD, U will be the sequence of minibatch index sets.\nTheorem 2.4 (Data-Dependent Disintegrated Mutual Information Bound). Let W , S, and J be as in\nTheorem 2.3, and let U be independent from S and J. Suppose that (cid:96)(Z,w) is \u03c3-subgaussian when\nZ \u223c D, for each w \u2208 W . Let Q = PS,U [W ] and let P be a \u03c3 (SJ,U)-measurable data-dependent prior\non W . Then\n\n(cid:115)\nE(cid:2)RD (W )\u2212 \u02c6RS(W )(cid:3) \u2264 E\n\n(cid:115)\nJ) \u2264 E\n\n2\n\n\u03c3 2\nn\u2212 m\n\nISJ ,U (W ;Sc\n\n2\n\n\u03c3 2\nn\u2212 m\n\nESJ ,UKL(Q(cid:107)P)\n\nThe proof of this result can be found in Appendix B. Since ISJ ,U (W ;Sc\nJ) is (SJ,U)-measurable, we\nIn the case that m = n \u2212 1, our bound is\nmay use SJ and U to obtain a data-dependent bound.\nsimilar to, but not strictly comparable to, [6, Prop. 1]. Our bound is incomparable due to our use\nof disintegrated mutual information, ISJ (W ;Sc\nJ) and the fact that we take the expectations over the\ndataset outside of the convex square-root function. The disintegrated mutual information cannot\nbe upper bounded by the full mutual information, I(W,Sc\nJ), which appears in [6] (even by taking\nexpectations under the square root using Jensen\u2019s inequality). However, Theorem 2.4 is essentially\na disintegrated version of [6, Prop. 1]. In their actual SGLD expected generalization error bound,\n[6] controls the unconditional mutual information using the Lipschitz constant of the surrogate loss.\nHence, one could easily recover the same bound using our result. The conditioning we have done,\nhowever, allows us to control the mutual information more carefully in order to achieve a tighter\nbound for SGLD than is provided by [6].\nThese bounds allow for a tradeoff: for large m, the mutual information is measured between the\nparameter and a small random subset of the data, and so we expect the mutual information to be\n1\nsmall. (Indeed, this term will decrease monotonically in m.) At the same time, the\nn\u2212m term is\nlarger, re\ufb02ecting the reduced effect of averaging over only n\u2212 m data to form our estimate of the\nempirical risk. It is unclear without further context whether this bound is tighter in the regime of\nsmall, intermediate, and large m. In fact, we \ufb01nd that, for the bounds we derive in our applications,\nm = n\u2212 1 is optimal. This difference materially affects the quality and tightness of the bounds, as is\ndiscussed in Remark 3.4. However, for m = n\u2212 1 and bounded loss, the following bound is tighter,\nwhile it is incomparable for other values of m.\n\n5\n\n\fTheorem 2.5 (Data-Dependent KL Bound). Let W , S, J, and U be as in Theorem 2.4. Let Q =\nPS,U [W ] and let P be a \u03c3 (SJ,U)-measurable data-dependent prior on W . Suppose that (cid:96)(Z,w) is\n[a1,a2]-bounded a.s. when Z \u223c D, for each w \u2208 W .\n\n(cid:114)\nE(cid:2)RD (W )\u2212 \u02c6RS(W )(cid:3) \u2264 E\n\n(a2 \u2212 a1)2\n\n2\n\nKL(Q(cid:107)P).\n\nThe proof of this result can be found in Appendix B. For an analytic comparison of the three bounds\nin the case that m = n\u2212 1, see Appendix F. Remark B.2 explains why this result is only stated for\nbounded loss functions.\n\n2.2 Decomposing KL Divergences and Mutual Information for Sequential Algorithms\nConsider an iterative learning algorithm, and let W0,W1,W2, . . .WT \u2208 W be the parameters during\nthe course of T iterations. In light of the variational bound for mutual information, we can obtain\na generalization bound for WT by bounding the expected KL divergences between the conditional\ndistribution PSJ [WT ] and some SJ-measurable \u201cprior\u201d distribution P(Z). Unfortunately, the \ufb01rst\ndistribution has no known tractable representation. Pensia, Jog, and Loh [24] use monotonicity to\nbound a mutual information involving the terminal parameter with one involving the full trajectory,\nthen use the chain rule to decompose this into a sum of conditional mutual informations. The same\nprinciples allow us to \ufb01rst bound the terminal KL divergence by the KL for the full trajectory, and\nthen decompose the KL divergence for the full trajectory over each individual step.\nSetting some notation, let T be a nonnegative integer, let [T ]0 = {0,1,2, . . . ,T}, let \u00b5 be a distribu-\ntion on W [T ]0, and let X be a random variable with distribution \u00b5. We are interested in naming certain\nmarginal and conditional distributions (disintegrations) related to \u00b5. In particular, for t \u2208 [T ]0, let\n\ni) \u00b5t = P[Xt ], the marginal law of Xt;\nii) \u00b5t| = PX0:(t\u22121)[Xt ], the conditional law of Xt given X0:(t\u22121); and\niii) \u00b50:t = P[X0:t ], the marginal law of X0:t.\nProposition 2.6 (Decomposition of KL Divergences). Let Q,P be probability measures on W [T ]0.\nSuppose that Q0 = P0. Then\n\nKL(QT (cid:107)PT ) \u2264 KL(Q(cid:107)P) = \u2211T\n\nt=1 EQ0:(t\u22121)[KL(Qt|(cid:107)Pt|)].\nwhere, as per Section 1.2, Qt| is the conditional law of t-th iterate given the previous iterates, and\nso KL(Qt|(cid:107)Pt|) is a random variable which depends the (W0, . . .Wt\u22121) \u223c Q0:t\u22121.\nThe proof of this result may be found in Appendix B.\nConsidering the KL between full trajectories may yield a loose upper bound on the KL between\nterminal parameters (in particular, when the trajectory cannot be inferred from the terminus). We\ngain, however, analytical tractability, as we will see in the next section when we analyze particular\nalgorithms stepwise. In fact, many bounds that appear in the literature implicitly require this form\nof incrementation. Our approach based on the KL divergence and data-dependent priors gives us\nmuch tighter control of the KL divergence contribution of each step.\n\n3 Generalization Bounds for Speci\ufb01c Algorithms\n\nNow that we have all of the theoretical tools required, we may establish bounds on the generalization\nerror of speci\ufb01c noisy iterative learning algorithms by inventing sensible data-dependent priors.\nThe use of a data-dependent prior which closely forecasts the true algorithm in each step is key\nin establishing tighter generalization bounds. We \ufb01rst consider the stochastic gradient Langevin\ndynamics (SGLD) algorithm [34], then handle its full batch counterpart the (unadjusted) Langevin\nalgorithm [9, 11], which we will refer to informally as Langevin dynamics (LD). Note that the\nloss and risk functions used for training, ( \u02dc(cid:96), \u02dcRD , \u02dcRS), need not be the same loss functions used for\nassessing performance and generalization error, ((cid:96),RD , \u02c6RS), as explained in Section 1.2.\n\n3.1 Stochastic Gradient Langevin Dynamics\nLet \u03b7t to be the learning rate at time t; \u03b2t be the inverse temperature at time t; and \u03b5t, i.i.d. N (0,Id).\nLet bt be the minibatch size at time t. We are interested in stochastic gradient Langevin dynamics,\n\n6\n\n\fwhose iterates are given by\n\nWt+1 = Wt \u2212 \u03b7t\u2207 \u02dcRSt (Wt ) +(cid:112)2\u03b7t /\u03b2t \u03b5t .\n\n(3)\n\u02dc(cid:96)(w,z), and St is a subset of S of size bt sampled uniformly at random with\nwhere \u02dcRSt (w) = 1\na sampling procedure which is independent of S, and independent of {\u03b5t}t\u22650. The bt data points in\nSt are chosen without replacement.\n\nbt \u2211z\u2208St\n\n3.1.1 A data-dependent prior for SGLD\n\nLet SJ be a random subset of S, of size m, chosen independently from W0,W1, . . ., and independently\nof the sequence of minibatches, {St}t\u22650. Let the set of indices appearing in the t-th minibatch be\ndenoted by Kt, so that St = SKt for each t. By assumption, each Kt is a uniformly random subset of\n{1, . . . ,n} of size bt. We set U = (K1, . . .KT ), as to match the notation in the theorems of Section 2.1.\nLet SJt = SJ \u2229 St = SJ\u2229Kt and let b(cid:48)\n\nt = St \\ SJ = SKt\\J and bc\n\nt = #SJt. Let Sc\n\nt = bt \u2212 b(cid:48)\n\nt. De\ufb01ne\n\n(cid:0)\u2207 \u02dcRSc\nt (Wt )\u2212 \u2207 \u02dcRSJ (Wt )(cid:1) .\n\n\u03bet =\n\nbc\nt\nbt\n\n(4)\n\nLet Q(S,U) be the joint law of (W0, ...,WT ) given a dataset S and minibatch sequence U. Then\nQ(S,U) is a random measure as it depends on the random dataset S and the sequence of indices U.\nIt follows from Eq. (3) that Q(S,U)t| is multivariate normal with mean \u00b5Q,t (S,U) = Wt \u2212\u03b7t\u2207 \u02dcRS(Wt )\nId. Consider the data-dependent prior de\ufb01ned so that its conditional Pt|(SJ,U) is\nand covariance 2 \u03b7t\n\u03b2t\na multivariate normal with covariance 2 \u03b7t\n\u03b2\n\u00b5P,t (SJ,U) = Wt \u2212 \u03b7t\n\nId, and with mean\n\n(cid:18) b(cid:48)\n\nbt \u2212 b(cid:48)\n\n(cid:19)\n\n\u2207 \u02dcRSJt (Wt ) +\n\n\u2207 \u02dcRSJ (Wt )\n\n.\n\nt\n\nt\nbt\n\nbt\n\nNote that \u00b5Q,t (S,U)\u2212 \u00b5P,t (SJ,U) = \u03b7t\u03bet (S,idx). Thus the one-step KL divergence satis\ufb01es\n\n2KL(Qt+1|(S,idx)(cid:107)Pt+1|(SJ,U)) =\n\n\u03b2t\u03b7t\n4\n\n(cid:107)\u03bet(cid:107)2\n\n2\n\nApplying Proposition 2.6, we have (almost surely over the choice of (S,J,U))\n\n2KL(QT (S,U)(cid:107)PT (SJ,U)) \u2264 T\n\u2211\n\nt=1\n\nES,J,UKL(Qt|(S,U)(cid:107)Pt|(SJ,U)) =\n\nT\n\n\u2211\n\nt=1\n\nES,J,U \u03b2t\u03b7t\n4\n\n(cid:107)\u03bet(cid:107)2\n2.\n\nNote that \u03bet depends on the exact weight sequence, and hence is \u03c3 (S,J,U,Wt\u22121)-measurable, but\nnot \u03c3 (S,J,U)-measurable. Hence, ES,J,U \u03b2t\u03b7t\n\n2 is a \u03c3 (S,J,U)-measurable for each t.\n\n8 (cid:107)\u03bet(cid:107)2\n\n3.1.2 Expected Generalization Error Bounds for SGLD\nTheorem 3.1 (Expected Generalization Error Bounds for SGLD). Let {Wt}t\u2208[T ] denote the iterates\nof SGLD. Let the batch size be constant, bt = b. If (cid:96)(Z,w) is \u03c3-subgaussian for each w \u2208 W , then\n\nT\n\n\u2211\nt=1\n\nn\u2212 m\n\n(cid:118)(cid:117)(cid:117)(cid:116) \u03c3 2\n(cid:118)(cid:117)(cid:117)(cid:116) (a2 \u2212 a1)2\n\n4\n\nE(RD (WT )\u2212 RS(WT )) \u2264 E\n\n\u03b2t\u03b7t\n4\n\nESJ ,J,U(cid:107)\u03bet(cid:107)2\n\n2 \u2264 \u03c3\n2\n\nand if (cid:96)(Z,w) is [a1,a2]-bounded, and if m = n\u2212 1, then\n\nE(RD (WT )\u2212 RS(WT )) \u2264 E\n\nT\n\n\u2211\nt=1\n\n\u03b2t\u03b7t\n4\n\nES,J,U(cid:107)\u03bet(cid:107)2\n\n2 \u2264\n\n(cid:0) 1\n\nT\n\n\u2211\nt=1\n\n(n\u2212 1)2\n\n(cid:118)(cid:117)(cid:117)(cid:116) n\n(cid:20) (a2 \u2212 a1)2n\n\n4(n\u2212 1)2b\n\n(cid:1)\u03b2t\u03b7ttr(E[ \u02c6\u03a3t (S)])\n\nb + 1\nn\n\nn\u2212m\u22121\n\nm\n\n(cid:21)1/2\n\n(cid:118)(cid:117)(cid:117)(cid:116) T\n\n\u2211\nt=1\n\nE\n\n(5)\n\n\u03b2t\u03b7t\n4\n\ntr(ES[ \u02c6\u03a3t (S)])\n\n(6)\n\nwhere \u02c6\u03a3t (S) = VarWt ,S\nZ\u223cUnif(S)\n\n(\u2207 \u02dcRZ(Wt )) is the \ufb01nite population variance matrix of surrogate gradients.\n\nProof. The results are the direct combinations of Theorem 2.4 and Propositions 2.6 and B.1; and\nTheorem 2.5 and Proposition 2.6, respectively, with our data-dependent prior. Jensen\u2019s inequality is\nused to move expectations under\n\n\u221a\u00b7. Lemma D.2 expresses the results in terms of \u02c6\u03a3.\n\n7\n\n\f2, our generalization error bounds in Eq. (5) is clearly O(cid:0)(cid:112)(\u03b2 /bn)\u2211t\u2264T \u03b7t\n(cid:112)\u03b2\u03b7t\n\n\u03bet = 0 whenever Kt \u2282 J, we \ufb01nd that our \ufb01rst bound in Eq. (5) is also O(cid:0)(1/n)\u2211t\u2264T\n\nRemark 3.2. Suppose that \u03b2t = \u03b2 , bt = b, and m = n\u2212 1. Under uniform moment conditions on\nESJ ,J,U(cid:107)\u03bet(cid:107)2\nthis, notice that for non-negative random variables Ct and Bt \u223c Ber(p),\n\n(cid:1). Since\n(cid:1). To see\n\nE(cid:113)\n\nt=1BtCt \u2264 E[\u2211T\n\u2211T\n\nt=1 Bt\n\n\u221a\nCt ] = p\u2211T\n\n\u221a\nCt|Bt = 1].\nt=1 E[\nESJ ,J,U(cid:107)\u03bet(cid:107)2\n\n2 yields the stated rate.\n\nWhen m = n\u2212 1, taking Bt = I\u03bet(cid:54)=0, p = b/n, Ct = \u03b2t\u03b7t\n3.2 Langevin Dynamics\n\n8\n\nUnder the same notation as above, the iterates of the Langevin dynamics algorithm are given by\n\nWt+1 = Wt \u2212 \u03b7t\u2207 \u02dcRS(Wt ) +(cid:112)2\u03b7t /\u03b2t \u03b5t .\n\n(cid:47)\n\n(7)\n\n3.2.1 Expected Generalization Error Bounds for LD\n\n(cid:115)\n\nWe can recover bounds generalization error bounds for LD as a special case of SGLD when the\nbatch size is the dataset size, bt = n for all t. The data-dependent prior is the same as for SGLD.\nTheorem 3.3 (Expected Generalization Error Bounds for Langevin Dynamics). Let {Wt}t\u2208[T ] denote\nthe iterates of the Langevin dynamics algorithm. If (cid:96)(Z,w) is \u03c3-subgaussian for each w \u2208 W , then\n\nE(RD (WT )\u2212 RS(WT ))4 \u2264\n\n\u03c3 2\n\n(n\u2212 1)m\n\nT\n\n\u2211\n\nt=1\n\n\u03b2t\u03b7t\n4\n\nEtr( \u02c6\u03a3t (S)),\n\n(8)\n\n(cid:115) T\n\nand if (cid:96)(Z,w) is [a1,a2]-bounded and m = n\u2212 1, then\n\n(cid:115)\n\nE(RD (WT )\u2212 RS(WT )) \u2264 E\n\n(a2 \u2212 a1)2\n\n4\n\nT\n\n\u2211\n\nt=1\n\n\u03b2t\u03b7t\n4\n\nESJ(cid:107)\u03bet(cid:107)2\n\n2 \u2264 a2 \u2212 a1\n2(n\u2212 1)\n\nE\n\n\u03b2t\u03b7t\n4\n\nEStr( \u02c6\u03a3t (S)),\n\n\u2211\n\nt=1\n\nwhere \u02c6\u03a3t (S) = VarWt ,S\nZ\u223cUnif(S)\n\n(\u2207 \u02dcRZ(Wt )) is the \ufb01nite population variance matrix of surrogate gradients.\n\nFor asymptotic properties of this bound when \u02dc(cid:96) is L-Lipschitz, as in [24], see Appendix E. For a\nsimple analytic worked example of mean estimation using Langevin dynamics, refer to Appendix G.\nRemark 3.4 (Dependence of our bounds on the subset size, m). The choice of m \u2208 {1, . . . ,n} can\nmake a material difference in the quality of the bound and whether it is vacuous or not. As seen\nin Eq. (8), if m is \u2126(n) then the upper bound on expected generalization error is O(\u03b2 /n). If \u03b2 is\n\u221a\nn), as is typical in practice, then overall, the bound is O(n\u22121/2). If, on the other hand, m is o(n)\n\u221a\n\u2126(\n\u221a\nn) then our\nthen the order of the bound with respect to n would be lower\u2014in particular if m is O(\nbound would not be decreasing in n for \u03b2 of order \u2126(\nn).\n(cid:47)\n\n4 Empirical Results\n\nWe have developed bounds that depend on the gradient prediction residual of our data dependent\npriors (which we call the incoherence of the gradients), rather than on the gradient norms (as in\n[22]) or Lipschitz constants (as in [6, 24]). The extent to which this represents an advance is,\nhowever, an empirical question. The functional form of our bounds and those in the cited work\nare nearly identical. The \ufb01rst key differences between our work and others is the replacement of\ngradient norms ((cid:107)\u2207 \u02dcRt(cid:107)2) and Lipschitz constants in other work with gradient prediction residual,\n((cid:107)\u03bet(cid:107)) in our work. The second key difference is the order of expectations and square-roots, which\nfavor our bounds due to Jensen\u2019s inequality. In this section, we perform an empirical comparison of\nthe gradient prediction residual of our data dependent priors and the gradient norm across various\narchitectures and datasets. This illustrates the \ufb01rst of the differences, the quantities appearing in the\nbound. Our results indicate that that our data-dependent priors yield signi\ufb01cantly tighter results, as\nthe sum of square gradient incoherences of our data dependent priors are between 102 and 104 times\nsmaller than the sum of square gradient norms in the experiments we ran.\n\n8\n\n\f(a) MLP for MNIST.\n\n(b) CNN for MNIST.\n\n(c) CNN for MNIST.\n\n(d) CNN for MNIST.\n\n(e) CNN for Fashion-MNIST.\n\n(f) CNN for CIFAR-10.\n\nFigure 1: Numerical results for various datasets and architectures. All x-axes show the number of Epochs of training. Fig. 1a shows the\neffect of different amounts of heldout data on the summands appearing in our bound, and what those would be if we upper bounded the\nincoherence (cid:107)\u03be(cid:107) by (cid:107)\u2207 \u02c6R(cid:107) when it is not 0. Fig. 1b compares a Monte Carlo estimate of our bound with that of [22] and shows the effect of\ninverse temperature on each. Fig. 1c compares a Monte Carlo estimate of our bound with that of [22] and shows the effect of learning rate on\neach. Figs. 1d to 1f compare the summands appearing in our bound and those of [22] across datasets.\n\nIn Fig. 1, we compare (cid:107)\u03bet(cid:107)2 and (cid:107)\u2207 \u02dcRt(cid:107)2 in order to assess the improvement our methods bring\n\nover existing results for SGLD. Speci\ufb01cally, the values of each plot are the averages of(cid:112)\u03b7\u03b2(cid:107)\u03bet(cid:107)/b\nand(cid:112)\u03b7\u03b2(cid:107)\u2207 \u02dcRSt(cid:107)/b over an epoch. These serve as estimates of the per-epoch contributions to the\n\nrespective summations in our Theorem 3.1 and the bound of Mou et al. (Thm. 2 therein, when there\nis no L2-regularization). The average and standard error of both expressions taken over multiple runs\nare displayed. Bounds from related work that depend on Lipschitz constants would further upper\nbound what we show for [22], by replacing (cid:107)\u2207 \u02dcRt(cid:107) with a Lipschitz constant. The Lipschitz constant\ncould be lower bounded by the largest observed gradient norm, and would be off the chart.\nFrom Fig. 1a, we see that the empirical performance re\ufb02ects our analytical results that the bound is\ntighter for large m. As can be inferred from Eq. (4), the difference between (cid:107)\u03bet(cid:107)2 and (cid:107)\u2207 \u02dcRt(cid:107)2 in-\ncreases with m. From Figs. 1d to 1f we see that the squared gradient incoherence, (cid:107)\u03bet(cid:107)2, are between\n100 and 10,000 times smaller than the squared gradient norms, (cid:107)\u2207 \u02dcR(cid:107)2 in all of these examples.\nUsing Monte Carlo simulation, we compared estimates of our expected generalization error bounds\nwith (coupled) estimates of the bound from [22]. The results, in Figs. 1b and 1c, show that our\nbounds are materially tighter, and remain non-vacuous after many more epochs. Fig. 1b also com-\npares the two generalization error bounds for different inverse temperature schedules. Fig. 1c com-\npares the two generalization error bounds based for different learning rate schedules. It can inferred\nfrom Figs. 1b and 1c that our proposed bound yields to tighter values when the learning rate and\nthe inverse temperature are small. However, it should be noted that with small learning rate and the\ninverse temperature, it would be dif\ufb01cult to have a very low training error when the empirical risk\nminimization is performed using SGLD.\nThe details of our model architectures, temperature, learning rate schedules and hyperparameter\nselections may be found in Appendix H. We did not aim to achieve the state-of-the art predictive\nperformance. With further tuning, the prediction results could be improved.\n\nAcknowledgments\n\nJN is supported by an NSERC Vanier Canada Graduate Scholarship, and by the Vector Institute.\nMH was supported by a MITACS Accelerate Fellowship with Element AI. DMR is supported by an\nNSERC Discovery Grant and an Ontario Early Researcher Award. This research was carried out in\npart while GKD and DMR were visiting the Simons Institute for the Theory of Computing.\n\n9\n\n13579111315Training epochs102100102104106Squared gradient norm, and squared norm of gradient residual||t||2#heldout(residual), #heldout=1||Rt||2(mini-batch), #heldout=1||t||2#heldout(residual), #heldout=1000||Rt||2(mini-batch), #heldout=100012345Epoch0.000.250.500.751.001.251.501.752.002.252.502.753.003.25Bound on the Expected Generalization ErrorOur Bound:(high)tMou et al. Bound:(high)tOur Bound:(low)tMou et al. Bound:(low)t123456Epoch0.000.250.500.751.001.251.501.752.002.252.502.753.003.25Bound on the Expected Generalization ErrorOur Bound:(large)tMou et al. Bound:(large)tOur Bound:(small)tMou et al. Bound:(small)t13579111315Training epochs103102101100101Squared gradient norm, and squared norm of gradient residual||t||2(residual)||Rt||2(mini-batch)135791113151719212325Training epochs103102101100101102Squared gradient norm, and squared norm of gradient residual||t||2(residual)||Rt||2(mini-batch)15913172125293337414549Training epochs101100101102Squared gradient norm, and squared norm of gradient residual||t||2(residual)||Rt||2(mini-batch)\fReferences\n[1] A. Lopez and V. Jog. \u201cGeneralization error bounds using Wasserstein distances\u201d. In: IEEE\n\nInformation Theory Workshop. 2018.\n\n[2] A. Ambroladze, E. Parrado-Hern\u00e1ndez, and J. Shawe-Taylor. \u201cTighter PAC-Bayes bounds\u201d.\n\nIn: Advances in Neural Information Processing Systems. 2007, pp. 9\u201316.\n\n[3] A. Asadi, E. Abbe, and S. Verd\u00fa. \u201cChaining mutual information and tightening generalization\n\nbounds\u201d. In: Advances in Neural Information Processing Systems. 2018, pp. 7234\u20137243.\n\n[4] R. Bassily, S. Moran, I. Nachum, J. Shafer, and A. Yehudayoff. \u201cLearners that Use Little\n\nInformation\u201d. In: Algorithmic Learning Theory. 2018, pp. 25\u201355.\n\n[5] S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: A nonasymptotic theory\n\nof independence. Oxford university press, 2013.\n\n[6] Y. Bu, S. Zou, and V. V. Veeravalli. \u201cTightening Mutual Information Based Bounds on Gener-\nalization Error\u201d. In: IEEE International Symposium on Information Theory (ISIT). To appear.\n2019. arXiv: 1901.04609.\n\n[7] O. Catoni. \u201cPAC-Bayesian supervised classi\ufb01cation: the thermodynamics of statistical learn-\ning\u201d. In: Institute of Mathematical Statistics Lecture Notes-Monograph Series. Vol. 56. 2007.\narXiv: 1901.04609.\n\n[8] M. D. Donsker and S. S. Varadhan. \u201cAsymptotic evaluation of certain Markov process expec-\ntations for large time, I\u201d. Communications on Pure and Applied Mathematics 28.1 (1975),\npp. 1\u201347.\n\n[9] A. Durmus and E. Moulines. \u201cNonasymptotic convergence analysis for the unadjusted\n\nLangevin algorithm\u201d. The Annals of Applied Probability 27.3 (2017), pp. 1551\u20131587.\n\n[10] G. K. Dziugaite and D. M. Roy. \u201cData-dependent PAC-Bayes priors via differential privacy\u201d.\nIn: Advances in Neural Information Processing Systems (NIPS). Vol. 29. Cambridge, MA:\nMIT Press, 2018. arXiv: 1802.09583.\n\n[11] D. L. Ermak. \u201cA computer simulation of charged particles in solution. I. Technique and equi-\n\nlibrium properties\u201d. The Journal of Chemical Physics 62.10 (1975), pp. 4189\u20134196.\n\n[12] V. Feldman and T. Steinke. \u201cCalibrating Noise to Variance in Adaptive Data Analysis\u201d. In:\n\nConference On Learning Theory. 2018, pp. 535\u2013544.\n\n[13] S. B. Gelfand and S. K. Mitter. \u201cRecursive stochastic algorithms for global optimization in\n\nR\u02c6d\u201d. SIAM Journal on Control and Optimization 29.5 (1991), pp. 999\u20131018.\n\n[14] M. Hardt, B. Recht, and Y. Singer. \u201cTrain faster, generalize better: Stability of stochastic\ngradient descent\u201d. In: International Conference on Machine Learning. 2016. arXiv: 1509.\n01240.\n\n[15] A. E. I. Ibrahim and M. Gastpar. Strengthened Information-theoretic Bounds on the General-\n\nization Error. 2019. arXiv: 1903.03787.\nJ. Jiao, Y. Han, and T. Weissman. \u201cDependence measures bounding the exploration bias for\ngeneral measurements\u201d. In: IEEE International Symposium on Information Theory. 2017.\n\n[16]\n\n[17] O. Kallenberg. Foundations of modern probability. Springer Science & Business Media,\n\n2006.\nJ. Kemperman. \u201cOn the Shannon capacity of an arbitrary channel\u201d. In: Indagationes Mathe-\nmaticae (Proceedings). Vol. 77. 2. North-Holland. 1974, pp. 101\u2013115.\n\n[18]\n\n[19] Y. LeCun, C. Cortes, and C. J. C. Burges. MNIST handwritten digit database.\n\nhttp://yann.lecun.com/exdb/mnist/. 2010.\nJ. Li, X. Luo, and M. Qiao. On generalization error bounds of noisy gradient methods for\nnon-convex learning. 2019. arXiv: 1902.00621.\n\n[20]\n\n[21] D. A. McAllester. \u201cSome PAC-Bayesian Theorems\u201d. Machine Learning 37.3 (Dec. 1999),\n\npp. 355\u2013363. ISSN: 1573-0565.\n\n[22] W. Mou, L. Wang, X. Zhai, and K. Zheng. \u201cGeneralization Bounds of SGLD for Non-convex\nLearning: Two Theoretical Viewpoints\u201d. In: Proceedings of the 31st Conference On Learn-\ning Theory. Ed. by S. Bubeck, V. Perchet, and P. Rigollet. Vol. 75. Proceedings of Machine\nLearning Research. PMLR, June 2018, pp. 605\u2013638.\n\n[23] E. Parrado-Hern\u00e1ndez, A. Ambroladze, J. Shawe-Taylor, and S. Sun. \u201cPAC-Bayes bounds\nwith data dependent priors\u201d. Journal of Machine Learning Research 13.Dec (2012), pp. 3507\u2013\n3531.\n\n10\n\n\f[24] A. Pensia, V. Jog, and P.-L. Loh. \u201cGeneralization error bounds for noisy, iterative algorithms\u201d.\n\nIn: 2018 IEEE International Symposium on Information Theory (ISIT). 2018, pp. 546\u2013550.\n\n[25] B. Poole, S. Ozair, A. v. d. Oord, A. A. Alemi, and G. Tucker. \u201cOn variational bounds of\n\nmutual information\u201d (2019). arXiv: 1905.06922.\n\n[26] M. Raginsky, A. Rakhlin, and M. Telgarsky. \u201cNon-convex learning via Stochastic Gradient\nLangevin Dynamics: a nonasymptotic analysis\u201d. In: Proc. Conference on Learning Theory\n(COLT). 2017. arXiv: 1702.03849.\n\n[27] M. Raginsky, A. Rakhlin, M. Tsao, Y. Wu, and A. Xu. \u201cInformation-theoretic analysis of sta-\nbility and bias of learning algorithms\u201d. In: 2016 IEEE Information Theory Workshop (ITW).\nIEEE. 2016, pp. 26\u201330.\n\n[28] O. Rivasplata, C. Szepesvari, J. S. Shawe-Taylor, E. Parrado-Hernandez, and S. Sun. \u201cPAC-\nBayes bounds for stable algorithms with instance-dependent priors\u201d. In: Advances in Neural\nInformation Processing Systems. 2018, pp. 9214\u20139224.\n\n[29] D. Russo and J. Zou. How much does your data exploration over\ufb01t? Controlling bias via\n\ninformation usage. 2015. arXiv: 1511.05219.\n\n[30] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to algo-\n\n[31]\n\nrithms. Cambridge university press, 2014.\nJ. Shawe-Taylor and R. C. Williamson. \u201cA PAC analysis of a Bayesian estimator\u201d. In: Pro-\nceedings of the tenth annual conference on Computational learning theory. ACM. 1997,\npp. 2\u20139.\n\n[32] V. Thomas, F. Pedregosa, B. van Merri\u00ebnboer, P.-A. Mangazol, Y. Bengio, and N. L. Roux.\n\n\u201cInformation matrices and generalization\u201d. arXiv preprint arXiv:1906.07774 (2019).\n\n[33] Y.-X. Wang, J. Lei, and S. E. Fienberg. \u201cOn-Average KL-Privacy and Its Equivalence to\nGeneralization for Max-Entropy Mechanisms\u201d. In: Privacy in Statistical Databases. Ed. by J.\nDomingo-Ferrer and M. Peji\u00b4c-Bach. Cham: Springer International Publishing, 2016, pp. 121\u2013\n134. ISBN: 978-3-319-45381-1.\n\n[34] M. Welling and Y. W. Teh. \u201cBayesian learning via stochastic gradient Langevin dynamics\u201d.\nIn: Proceedings of the 28th International Conference on Machine Learning (ICML-11). 2011,\npp. 681\u2013688.\n\n[35] A. Xu and M. Raginsky. \u201cInformation-theoretic analysis of generalization capability of learn-\ning algorithms\u201d. In: Advances in Neural Information Processing Systems. 2017, pp. 2524\u2013\n2533.\n\n11\n\n\f", "award": [], "sourceid": 5904, "authors": [{"given_name": "Jeffrey", "family_name": "Negrea", "institution": "University of Toronto"}, {"given_name": "Mahdi", "family_name": "Haghifam", "institution": "University of Toronto"}, {"given_name": "Gintare Karolina", "family_name": "Dziugaite", "institution": "Element AI"}, {"given_name": "Ashish", "family_name": "Khisti", "institution": "University of Toronto"}, {"given_name": "Daniel", "family_name": "Roy", "institution": "Univ of Toronto & Vector"}]}