{"title": "Data-dependent PAC-Bayes priors via differential privacy", "book": "Advances in Neural Information Processing Systems", "page_first": 8430, "page_last": 8441, "abstract": "The Probably Approximately Correct (PAC) Bayes framework (McAllester, 1999) can incorporate knowledge about the learning algorithm and (data) distribution through the use of distribution-dependent priors, yielding tighter generalization bounds on data-dependent posteriors. Using this flexibility, however, is difficult, especially when the data distribution is presumed to be unknown. We show how a differentially private data-dependent prior yields a valid PAC-Bayes bound, and then show how non-private mechanisms for choosing priors can also yield generalization bounds. As an application of this result, we show that a Gaussian prior mean chosen via stochastic gradient Langevin dynamics (SGLD; Welling and Teh, 2011) leads to a valid PAC-Bayes bound due to control of the 2-Wasserstein distance to a differentially private stationary distribution. We study our data-dependent bounds empirically, and show that they can be nonvacuous even when other distribution-dependent bounds are vacuous.", "full_text": "Data-dependent PAC-Bayes priors\n\nvia differential privacy\n\nGintare Karolina Dziugaite\n\nUniversity of Cambridge; Element AI\n\nDaniel M. Roy\n\nUniversity of Toronto; Vector Institute\n\nAbstract\n\nThe Probably Approximately Correct (PAC) Bayes framework (McAllester, 1999)\ncan incorporate knowledge about the learning algorithm and (data) distribution\nthrough the use of distribution-dependent priors, yielding tighter generalization\nbounds on data-dependent posteriors. Using this \ufb02exibility, however, is dif\ufb01cult,\nespecially when the data distribution is presumed to be unknown. We show how an\ne-differentially private data-dependent prior yields a valid PAC-Bayes bound, and\nthen show how non-private mechanisms for choosing priors can also yield gener-\nalization bounds. As an application of this result, we show that a Gaussian prior\nmean chosen via stochastic gradient Langevin dynamics (SGLD; Welling and Teh,\n2011) leads to a valid PAC-Bayes bound given control of the 2-Wasserstein dis-\ntance to an e-differentially private stationary distribution. We study our data-\ndependent bounds empirically, and show that they can be nonvacuous even when\nother distribution-dependent bounds are vacuous.\n\n1\n\nIntroduction\n\nThere has been a resurgence of interest in PAC-Bayes bounds, especially towards explaining gener-\nalization in large-scale neural networks trained by stochastic gradient descent (Dziugaite and Roy,\n2017; Neyshabur et al., 2017b; Neyshabur et al., 2017a; London, 2017). See also (B\u00e9gin et al.,\n2016; Germain et al., 2016; Thiemann et al., 2017; Bartlett, Foster, and Telgarsky, 2017; Raginsky,\nRakhlin, and Telgarsky, 2017; Gr\u00fcnwald and Mehta, 2017; Smith and Le, 2018).\nPAC-Bayes bounds control the generalization error of Gibbs classi\ufb01ers (aka PAC-Bayes \u201cposteri-\nors\u201d) in terms of the Kullback\u2013Leibler (KL) divergence to a \ufb01xed probability measure (aka PAC-\nBayes \u201cprior\u201d) on the space of classi\ufb01ers. PAC-Bayes bounds depend on a tradeoff between the\nempirical risk of the posterior Q and a penalty 1\nmKL(Q||P), where P is the prior, \ufb01xed independently\nof the sample S 2 Zm from some space Z of labelled examples. The KL penalty is typically the\nlargest contribution to the bound and so \ufb01nding the tightest possible bound generally depends on\nminimizing the KL term.\nThe KL penalty vanishes for Q = P, but typically P, viewed as a randomized (Gibbs) classi\ufb01er, has\npoor performance since it has been chosen independently of the data. On the other hand, since P is\nchosen independently of the data, posteriors Q tuned to the data to achieve minimal empirical risk\noften bear little resemblance to the data-independent prior P, causing KL(Q||P) to be large. As a\nresult, PAC-Bayes bounds can be loose or even vacuous.\nThe problem of excessive KL penalties is not inherent to the PAC-Bayes framework. Indeed, the\nPAC-Bayes theorem permits one to choose the prior P based on the distribution D of the data.\nHowever, since D is considered unknown, and our only insight as to D is through the sample S, this\n\ufb02exibility would seem to be useless, as P must be chosen independently of S in existing bounds.\nNevertheless, it is possible to make progress in this direction, and it is likely the best way towards\ntighter bounds and deeper understanding.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThere is a growing body of work in the PAC-Bayes literature on data-distribution-dependent priors\n(Catoni, 2007; Parrado-Hern\u00e1ndez et al., 2012; Lever, Laviolette, and Shawe-Taylor, 2013). Our\nfocus is on generalization bounds that use all the data, and in this setting, Lever, Laviolette, and\nShawe-Taylor (2013) prove a remarkable result for Gibbs posteriors. Writing \u02c6LS for the empirical\nrisk function with respect to the sample S, Lever, Laviolette, and Shawe-Taylor study the random-\nized classi\ufb01er Q with density (relative to some base measure) proportional to exp(t \u02c6LS). For large\nvalues of t, Q concentrates increasingly around the empirical risk minimizers. As a prior, the authors\nconsider the probability distribution with density proportional to exp(tLD ), where the empirical\nrisk has been replaced by its expectation, LD, the (true) risk. Remarkably, Lever, Laviolette, and\nShawe-Taylor are able to upper bound the KL divergence between the two distributions by a term in-\ndependent of D, yielding the following result: Here kl(q||p) is the KL divergence between Bernoulli\nmeasures with mean q and p. See Section 4 for more details.\nTheorem 1.1 (Lever, Laviolette, and Shawe-Taylor 2013). Fix t > 0. For S 2 Zm, let Q(S) =\nPexp(t \u02c6LS) be a Gibbs posterior with respect to some base measure P on Rp, where the empirical\nrisk \u02c6LS is bounded in [0,1]. For every d > 0, m 2 N, distribution D on Z,\n\nt2\n2m + ln\n\n2pm\n\nd \u2318\u2318 1 d .\n\n(1.1)\n\nPS\u21e0Dm\u21e3kl(\u02c6LS(Q(S))||LD (Q(S))) \uf8ff\n\n1\n\nm\u21e3tr 2\n\nm\n\n2pm\nd +\n\nln\n\nThe dependence on the data distribution is captured through t, which is ideally chosen as small as\npossible, subject to Q(S) yielding small empirical risk. (One can use a union bound to tune t based\non S.) The fact that the KL bound does not depend on D, other than through t, implies that the\nbound must be loose for all t such that there exists a distribution D that causes Q to over\ufb01t with\nhigh probability on size m datasets S \u21e0 Dm. In other words, for \ufb01xed t, the bound is no longer\ndistribution dependent. This would not be important if not for the following empirical \ufb01nding:\nweights sampled according to high values of t do not over\ufb01t on real data, but they do on data whose\nlabels have been randomized. Thus these bounds are vacuous in practice when the generalization\nerror is, in fact, small. Evidently, the KL bound gives up too much.\nOur work launches a different attack on the problem of using distribution-dependent priors. Loosely\nspeaking, if a prior is chosen on the basis of the data, but in a way that is very stable to perturbations\nof the data set, then the resulting data-dependent prior should re\ufb02ect the underlying data distribution,\nrather than the data, resulting in a bound that should still hold, perhaps with smaller probability.\nWe formalize this intuition using differential privacy (Dwork, 2006; Dwork et al., 2015b). We\nshow that an e-differentially private prior mean yields a valid, though necessarily looser, PAC-\nBayes generalization bound.\n(See Theorem 4.2.) The result is a straightforward application of\nresults connecting privacy and adaptive data analysis (Dwork et al., 2015b; Dwork et al., 2015a).\nThe real challenge is using such a result: In practice, e-differentially private mechanisms can be\nexpensive to compute. In the context of generalization bounds for neural networks, we consider\nthe possibility of using stochastic gradient Langevin dynamics (SGLD; Welling and Teh, 2011) to\nchoose a data-dependent prior by way of stochastic optimization/sampling.\nBy various results, SGLD is known to produce an (e,d )-differentially private release (Mir, 2013;\nBassily, Smith, and Thakurta, 2014; Dimitrakakis et al., 2014; Wang, Fienberg, and Smola, 2015;\nMinami et al., 2016) A gap remains between pure and approximate differential privacy. Even if this\ngap were to be closed, the privacy/accuracy tradeoff of these analyses is too poor because they do not\ntake advantage of the fact that, under some technical conditions, the distribution of SGLD\u2019s output\nconverges weakly towards a stationary distribution (Teh, Thiery, and Vollmer, 2016, Thm. 7), which\nis e-differentially private. One can also bound the KL divergence (and then 2-Wasserstein distance)\nof SGLD to stationarity within a constant given an appropriate \ufb01xed step size (Raginsky, Rakhlin,\nand Telgarsky, 2017). Neither result implies that SGLD achieves pure e-differential privacy.\nWe show that we can circumvent this barrier in our PAC-Bayes setting. We give a general PAC-\nBayes bound for non-private data-dependent priors and then an application to multivariate Gaussian\npriors with non-private data-dependent means, with explicit bounds for the case of Gibbs posteriors.\nIn particular, conditional on a data set, if a data-dependent mean vector w is close in 2-Wasserstein\ndistance to an e-differentially private mean vector, then the generalization errors is close to that\nof the e-differentially private mean. The data-dependent mean w is not necessarily differentially\nprivate, even approximately. As a consequence, under suitable assumptions, SGLD can be used to\noptimize a data-dependent mean and still yield a valid PAC-Bayes bound.\n\n2\n\n\f2 Other Related Work\n\nOur analysis relies on the stability of a data-dependent prior. Stability has long been understood to\nrelate to generalization (Bousquet and Elisseeff, 2002). Our result relies on the connection between\ngeneralization and differential privacy (Dwork et al., 2015b; Dwork et al., 2015a; Bassily et al.,\n2016; Oneto, Ridella, and Anguita, 2017), which can be viewed as a particularly stringent notion of\nstability. See (Dwork, 2008) for a survey of differential privacy.\nKifer, Smith, and Thakurta (2012, Thm. 1) also establish a \u201climit\u201d theorem for differential privacy,\nshowing that the almost sure convergence of mechanisms of the same privacy level admits a private\nmechanism of the same privacy level. Our result can be viewed as a signi\ufb01cant weakening of the\nhypothesis to require only that the weak limit be private: no element on the sequence need be private.\nThe bounds we establish hold for bounded loss functions and i.i.d. data. Under additional assump-\ntions, one can obtain PAC-Bayes generalization and excess risk bounds for unbounded loss with\nheavy tails (Catoni, 2007; Germain et al., 2016; Gr\u00fcnwald and Mehta, 2016; Alquier and Guedj,\n2018). Alquier and Guedj (2018) also consider non-i.i.d. training data. Our approach to differen-\ntially private data-dependent priors can be readily extended to these settings.\n\n3 Preliminaries\n\nLet Z be a measurable space, let M1(Z) denote the space of probability measures on Z, and let\nD 2 M1(Z) be unknown. We consider the batch supervised learning setting under a loss function\nbounded below: having observed S \u21e0 Dm, i.e., m independent and identically distributed samples\nfrom D, we aim to choose a predictor, parameterized by weight vector w 2 Rp, with minimal risk\n\nLD (w) def= Ez\u21e0D{`(w,z)},\n\nwhere ` : Rp \u21e5 Z ! R is measurable and bounded below. (We ignore the possibility of constraints\non the weight vector for simplicity.) We also consider randomized predictors, represented by prob-\nability measures Q 2 M1(Rp), whose risks are de\ufb01ned via averaging,\nLD (Q) def= Ew\u21e0Q{LD (w)} = Ez\u21e0Dn Ew\u21e0Q{`(w,z)}o,\n\nwhere the second equality follows from Tonelli\u2019s theorem and the fact that ` is bounded below.\nLet S = (z1, . . . ,zm) and let\ndictor Q, such as that chosen by a learning algorithm on the basis of data S, its empirical risk\n\ni=1 dzi be the empirical distribution. Given a randomized pre-\n\n\u02c6D def= 1\n\nm \u00c2m\n\n\u02c6LS(Q) def= L \u02c6D (Q) = 1\n\nm \u00c2m\n\ni=1 Ew\u21e0Q{`(w,zi)},\n\nis studied as a stand-in for its risk, which we cannot compute. While \u02c6LS(Q) is easily seen to be an\nunbiased estimate of LD (Q) when Q is independent of S, our goal is to characterize the (one-sided)\ngeneralization error LD (Q) \u02c6LS(Q) when Q is random and dependent on S.\nFinally, when optimizing the weight vector or de\ufb01ning tractable distributions on Rp, we use a (dif-\nferentiable) surrogate risk \u02dcLS, which is the empirical average of a bounded surrogate loss, taking\nvalues in an interval of length D.\n\n3.1 Differential privacy\n\nFor readers not familiar with differential privacy, Appendix A provides the basic de\ufb01nitions and\nresults. We use the notation A : R T to denote a randomized algorithm A that takes an input in\na measurable space R and produces a random output in the measurable space T .\nDe\ufb01nition 3.1. A randomized algorithm A : Zm T is (e,d )-differentially private if, for all pairs\nS,S0 2 Zm that differ at only one coordinate, and all measurable subsets B \u2713 T , we have P{A (S) 2\nB}\uf8ff ee P{A (S0) 2 B} + d . Further, e-differentially private means (e,0)-differentially private.\nFor our purposes, max-information is the key quantity controlled by differential privacy.\nDe\ufb01nition 3.2 (Dwork et al. 2015a, \u00a73). Let b 0, let X and Y be random variables in arbitrary\nmeasurable spaces, and let X0 be independent of Y and equal in distribution to X. The b-approximate\n\n3\n\n\fmax-information between X and Y, denoted Ib\nmeasurable events E,\n\n\u2022(X;Y ), is the least value k such that, for all product-\n\nP{(X,Y ) 2 E}\uf8ff ek P{(X0,Y ) 2 E} + b .\n\n(3.1)\n\u2022(X;Y ) for b = 0. For m 2 N and A : Zm T ,\n\u2022(A ,m), is the least value k such that, for all D 2\n\n\u2022(S; A (S)) \uf8ff k when S \u21e0 D m. The max-information of A is de\ufb01ned similarly.1\n\nThe max-information I\u2022(X;Y ) is de\ufb01ned to be Ib\nthe b-approx. max-information of A , denoted Ib\nM1(Z), Ib\nIn Section 4.1, we consider the case where the dataset S and a data-dependent prior P(S) have\nsmall approximate max-information. The above de\ufb01nition tells us that we can almost treat the data-\ndependent prior as if it was chosen independently from S. The following is the key result connecting\npure differential privacy and max-information:\nTheorem 3.3 (Dwork et al. 2015a, Thms. 19\u201320). Fix m 2 N. Let A : Zm T be e-differentially\nprivate. Then I\u2022(A ,m) \uf8ff em and, for all b > 0, Ib\n4 PAC-Bayes bounds\nLet Q,P 2 M1(Rp). When Q is absolutely continuous with respect to P, written Q \u2327 P, we write\ndP : Rp ! R+ [{ \u2022} for some Radon\u2013Nikodym derivative of Q with respect to P. The Kullback\u2013\nLiebler (KL) divergence from Q to P is KL(Q||P) =R ln dQ\ndP dQ if Q \u2327 P and \u2022 otherwise. Let Bp\ndenote the Bernoulli distribution on {0,1} with mean p. For p,q 2 [0,1], we abuse notation and\nde\ufb01ne\n\n\u2022(A ,m) \uf8ff e2m/2 + epmln(2/b )/2.\n\ndQ\n\nkl(q||p) def= KL(Bq||Bp) = qln\n\nq\np + (1 q)ln\n\n1 q\n1 p .\n\nThe following PAC-Bayes bound for bounded loss is due to Maurer (2004), and extends the 0\u20131\nbound established by Langford and Seeger (2001), building off the seminal work of McAllester\n(1999) and Shawe-Taylor and Williamson (1997). See also (Langford, 2002) and (Catoni, 2007).\nTheorem 4.1 (PAC-Bayes; Maurer 2004, Thm. 5). Under bounded loss ` 2 [0,1], for every d > 0,\nm 2 N, distribution D on Z, and distribution P on Rp,\nPS\u21e0Dm\u21e3(8Q) kl(\u02c6LS(Q)||LD (Q)) \uf8ff\n\nKL(Q||P) + ln 2pm\n\nOne can use Pinsker\u2019s inequality to obtain a bound on the generalization error |\u02c6LS(Q) LD (Q)|,\nhowever this signi\ufb01cantly loosens the bound, especially when \u02c6LS(Q) is close to zero. We refer to the\nquantity kl(\u02c6LS(Q)||LD (Q)) as the KL-generalization error. From a bound on this quantity, we can\nbound the risk as follows: given empirical risk q and a bound on the KL-generalization error c, the\nrisk is bounded by the largest value p 2 [0,1] such kl(q||p) \uf8ff c. See (Dziugaite and Roy, 2017) for\na discussion of this computation. When the empirical risk is near zero, the KL-generalization error\nis essentially the generalization error. As empirical risk increases, the bound loosens and the square\nroot of the KL-generalization error bounds the generalization error.\n\n\u2318 1 d .\n\n(4.1)\n\nm\n\nd\n\n4.1 Data-dependent priors\nThe prior P that appears in the PAC-Bayes generalization bound must be chosen independently of\nthe data S \u21e0 D m, but can depend on the data distribution D itself. If a data-dependent prior P(S)\ndoes not depend too much on any individual data point, it should behave as if it depends only on\nthe distribution. Theorem 3.3 allows us to formalize this intuition: we can obtain new PAC-Bayes\nbounds that use data-dependent priors, provided they are e-differentially private: We provide an\nexample using the bound of Maurer (Theorem 4.1).\nTheorem 4.2 (PAC-Bayes with private data-dependent priors). Fix a bounded loss `2 [0,1]. Let m2\nN, let P : Zm M1(Rp) be an e-differentially private mechanism for choosing a data-dependent\nprior, let D 2 M1(Z), and let S \u21e0 D m. Then, with probability at least 1 d,\n\n8Q 2 M1(Rp), kl(\u02c6LS(Q)||LD (Q)) \uf8ff\n(4.2)\n1Note that in much of the literature it is standard to express the max-information in bits, i.e., the factor ek\n\n2m .\n\nm\n\nKL(Q||P(S)) + ln 4pm\n\nd\n\n+ e2/2 + erln(4/d )\n\nabove is replaced by 2k0 with k0 = k log2 e.\n\n4\n\n\fSee Appendix B for a proof of a more general statement of the theorem and further discussion.\nThe main innovation here is recognizing the potential to choose data-dependent priors using private\nmechanisms. The hard work is done by Theorem 3.3: obtaining differentially private versions of\nother PAC-Bayes bounds is straightforward.\nWhen one is choosing the privacy parameter, e, there is a balance between minimizing the direct\ncontributions of e to the bound (forcing e smaller) and minimizing the indirect contribution of e\nthrough the KL term for posteriors Q that have low empirical risk (forcing e larger). The optimal\nvalue for e is often much less than one, which can be challenging to obtain. We discuss strategies\nfor achieving the required privacy in later sections.\n\n5 Weak approximations to e-differentially private priors\nTheorem 4.2 permits data-dependent priors that are chosen by e-differentially private mechanisms.\nIn this section, we discuss concrete families of priors and mechanisms for choosing among them in\ndata-dependent ways. We also relax Theorem 4.2 to allow non-private priors.\nWe apply our main result to non-private data-dependent Gaussian priors with a \ufb01xed covariance\nmatrix. Thus, we choose only the mean w0 2 Rp privately in a data-dependent way. We show that it\nsuf\ufb01ces for the data-dependent mean to be merely close in 2-Wasserstein distance to a private mean\nto yield a generalization bound. (It is natural to consider also choosing a data-dependent covariance,\nbut as it is argued below, the privacy budget we have in applications to generalization is very small.)\nIdeally, we would choose a mean vector w0 that leads to a tight bound. A reasonable approach is to\nchoose w0 by approximately minimizing the empirical risk \u02c6LS or surrogate risk \u02dcLS, subject to privacy\nconstraints. A natural way to do this is via an exponential mechanism. We pause to introduce some\nnotation for Gibbs distributions: For a measure P on Rp and measurable function g : Rp ! R, let P[g]\ndenote the expectationR g(h)P(dh) and, provided P[g] < \u2022, let Pg denote the probability measure\non Rp, absolutely continuous with respect to P, with Radon\u2013Nikodym derivative dPg\ndP (h) = g(h)\nP[g] . A\ndistribution of the form Pexp(tg) is generally referred to as a Gibbs distribution with energy function\ng and inverse temperature t. In the special case where P is a probability measure, we call Pexp(t \u02dcLS)\na \u201cGibbs posterior\u201d.\nLemma 5.1 (McSherry and Talwar 2007, Thm. 6). Let q : Zm \u21e5 Rp ! R be measurable, let P\nbe a s-\ufb01nite measure on Rp, let b > 0, and assume P[exp(b q(S,\u00b7))] < \u2022 for all S 2 Zm. Let\nDq def= supS,S0 supw2Rp |q(S,w)q(S0,w)|, where the \ufb01rst supremum ranges over pairs S,S0 2 Zm that\ndisagree on no more than one coordinate. Let A : Zm Rp, on input S 2 Zm, output a sample from\nthe Gibbs distribution Pexp(bq(S,\u00b7)). Then A is 2bDq-differentially private.\nThe following result is a straightforward application of Lemma 5.1 and essentially equivalent results\nhave appeared in numerous studies of the differential privacy of Bayesian and Gibbs posteriors (Mir,\n2013; Bassily, Smith, and Thakurta, 2014; Dimitrakakis et al., 2014; Wang, Fienberg, and Smola,\n2015; Minami et al., 2016).\nCorollary 5.2. Let t > 0 and let \u02dcLS denote the surrogate risk, taking values in an interval of length\nD. One sample from the Gibbs posterior Pexp(t \u02dcLS) is 2tD\n5.1 Weak convergence yields valid PAC-Bayes bounds\nEven for small values of the inverse temperature, it is dif\ufb01cult to implement the exponential mech-\nanism because sampling from Gibbs posteriors exactly is intractable. On the other hand, a number\nof algorithms exist for generating approximate samples from Gibbs posteriors. If one can control\nthe total-variation distance, one can obtain a bound like Theorem 4.2 by applying approximate max-\ninformation bounds for (e,d )-differential private mechanisms. However, many algorithms do not\ncontrol the total-variation distance to stationarity, or do so poorly.\nThe generalization properties of randomized classi\ufb01ers are generally insensitive to small variations\nof the parameters, however, and so it stands to reason that our data-dependent prior need only be\nitself close to an e-differentially private prior. We formalize this intuition here by deriving bounds\non KL(Q||P(S)) in terms of a non-private data-dependent prior PS. We start with an identity:\nLemma 5.3. If P0 \u2327 P then KL(Q||P) = KL(Q||P0) + Q[ln dP0\ndP ].\n\nm -differentially private.\n\n5\n\n\fThe proof is straightforward. The lemma highlights the role of Q in judging the difference between\nP0 and P, and leads immediately to the following corollary (see Appendix C).\nLemma 5.4 (Non-private priors). Let m 2 N, let P : Zm M1(Rp) be e-differentially private, let\nD 2 M1(Z), let S \u21e0 D m, and let PS be a data-dependent prior such that, for some P\u21e4(S) satisfying\nP[P\u21e4(S)|S] = P[P(S)|S], we have PS \u2327 P\u21e4(S) with probability at least 1d0. Then, with probability\nat least 1 d d0, Eq. (4.2) holds with KL(Q||P(S)) replaced by KL(Q||PS) + Q[ln dPS\nThe conditions that PS \u2327 P\u21e4(S) and Q[ln dPS\ntially private. In fact, S could be PS-measurable!\nLemma 5.4 is not immediately applicable because P\u21e4(S) is intractable to generate. The following\napplication considers multivariate Gaussian priors, N(w), indexed by their mean vectors w 2 Rp,\nwith a \ufb01xed positive de\ufb01nite covariance matrix S 2 Rp\u21e5p. We require two technical results: The\n\ufb01rst implies that we can bound Q[ln dPS\ndP\u21e4(S) ] if Q concentrates near the non-private mean; The second\ncharacterizes this concentration for Gibbs posteriors built from bounded surrogate risks.\nLemma 5.5. Q[ln dN(w0)\n\ndP\u21e4(S) ] \uf8ff \u2022 for some Q do not constrain PS to be differen-\n\ndP\u21e4(S) ].\n\ndN(w) ] \uf8ff 1\n\n2kw w0k2\n\nS1 +kw w0kS1 Ev\u21e0Qkv w0kS1.\n\n1\n\n2 \uf8ff C\n\nkw(S) w\u21e4(S)k2\n\nLemma 5.6. Let P = N(w) and Q = Pexp(h) for h 0. Then Ev\u21e0QkvwkS1 \uf8ff p2khkL\u2022(P) +p2/p.\nCorollary 5.7 (Gaussian means close to private means). Let m 2 N, let D 2 M1(Z), let S \u21e0 D m, let\nA : Zm Rp be e-differentially private, and let w(S) denote a data-dependent mean vector such\nthat, for some w\u21e4(S) satisfying P[w\u21e4(S)|S] = P[A (S)|S], we have\n\n(5.1)\nwith probability at least 1 d0. Let smin be the minimum eigenvalue of S. Then, with prob-\nability at least 1 d d0, Eq. (4.2) holds with KL(Q||P(S)) replaced by KL(Q||N(w(S))) +\n2 C/smin +pC/smin Ev\u21e0Qkv w(S)kS1. In particular, for a Gibbs posterior Q = PS\nexp(t \u02dcLS), we\nhave Ev\u21e0Qkv w(S)kS1 \uf8ff p2tD +p2/p.\nSee Appendix C for details and further discussion.\nOne way to achieve Eq. (5.1) is to construct w1(S),w2(S), . . . so that P[wn(S)|S] ! P[A (S)|S]\nweakly with high probability. Skorohod\u2019s representation theorem then implies the existence of\nw\u21e4(S). One of the standard algorithms used to generate such sequences for high-dimensional Gibbs\ndistributions is stochastic gradient Langevin dynamics (SGLD; Welling and Teh, 2011).\nIn order to get nonasymptotic results, it suf\ufb01ces to bound the 2-Wasserstein distance of the SGLD\nMarkov chain to stationarity. Recall that the p-Wasserstein distance between \u00b5 and n is given\nby (Wp(\u00b5,n))p = infgRkv wkp\n2 dg(v,w) where the in\ufb01mum runs over couplings of \u00b5 and n, i.e.,\ndistributions g 2 M1(Rp \u21e5Rp) with marginals \u00b5 and n, respectively.\nLet A (S) return a sample from the Gibbs posterior Pexp(t \u02dcLS) with Gaussian P and surrogate risk\n\u02dcLS constructed from smooth loss functions taking values in a length-D interval. By Corollary 5.2,\nA is 2tD\nm -differentially private. Consider running SGLD to target P[A (S)|S] = Pexp(t \u02dcLS). Assume2\nthat, for every c > 0, there is a step size h > 0 and number of SGLD iterations n 2 O( 1\ncq ), such that\nthe n-th iterate w(S) 2 Rp produced by SGLD satis\ufb01es W2P[w(S)|S],P[A (S)|S] \uf8ff c. Markov\u2019s\ninequality and the de\ufb01nition of W2 immediately implies the following.\nCorollary 5.8 (Prior via SGLD). Under the above assumption, for some step size h > 0, the n-th\niterate w(S) of SGLD targeting Pexp(t \u02dcLS) satis\ufb01es Corollary 5.7 with e = 2tD\nThe dependence on d0 is poor. However, one can construct Markov chain algorithms that are geomet-\nrically ergodic, in which case n1/q is replaced by a term 2W(n), allowing one to spend computation\nto control the 1/d0 term.\n\nm and C 2 O(\n\n1\nd0n1/q ).\n\n2 The status of this assumption relates to results by Mattingly, Stuart, and Higham (2002, Thm. 7.3) and\nRaginsky, Rakhlin, and Telgarsky (2017, Prop. 3.3), under so-called dissipativity conditions on the regularized\nloss. We have hidden potentially exponential dependence on t, which is problem dependent. See also (Erdogdu,\nMackey, and Shamir, 2018).\n\n6\n\n\f6 Empirical studies\n\nWe have presented data-dependent bounds and so it is necessary to study them empirically to eval-\nuate their usefulness. The goal of this section is to make a number of arguments. First, it is an\nempirical question as to what value of the inverse temperature t is suf\ufb01cient to yield small empirical\nrisk from a Gibbs posterior. Indeed, we compare to the bound of Lever, Laviolette, and Shawe-\nTaylor (2013), presented above as Theorem 1.1. As Lever, Laviolette, and Shawe-Taylor point out,\nthis bound depends explicitly on t, where it plays an obvious role as a measure of complexity. Sec-\nond, despite how tight this bound is for small values of t, the bound must become vacuous before\nthe Gibbs posterior would have started to over\ufb01t on random labels because this bound holds for all\ndata distributions. We demonstrate that this phase transition happens well before the Gibbs poste-\nrior achieves its minimum risk on true labels. Third, because our bound retains the KL term, we can\npotentially identify easy data. Indeed, our risk bound decreases beyond the point where the same\nclassi\ufb01er begins to over\ufb01t to random labels. Finally, our results suggest that we can use the property\nthat SGLD converges weakly to investigate the generalization properties of Gibbs classi\ufb01ers. More\nwork is clearly needed to scale our study to full \ufb02edged deep learning benchmarks. (See Appendix E\nfor details on how we compute the KL term and challenges there due to Gibbs posteriors Q not be\nexactly samplable.)\nMore concretely, we perform an empirical evaluation using SGLD to approximate simulating from\nGibbs distributions. Corollary 5.7 provides some justi\ufb01cation by showing that a slight weaken-\ning of the PAC-Bayes generalization bound is valid provided that SGLD eventually controls the\n2-Wasserstein distance to stationarity. However, because we cannot measure our convergence in\npractice, it is an empirical question as to whether our samples are accurate enough.\nViolated bounds would be an obvious sign of trouble. We expect the bound on the classi\ufb01cation\nerror not to go below the true error as estimated on the heldout test set (with high probability).\nWe perform an experiment on a MNIST (and CIFAR10, with the same conclusion so we have not\nincluded it) using true and random labels and \ufb01nd that no bounds are violated. The results suggest\nthat it may be useful to empirical study bounds for Gibbs classi\ufb01ers using SGLD.\nOur main focus is a sythentic experiment comparing the bounds of Lever, Laviolette, and Shawe-\nTaylor (2013) to our new bounds based on privacy. The main \ufb01nding here is that, as expected, the\nbounds by Lever, Laviolette, and Shawe-Taylor must explode when the Gibbs classi\ufb01er begins to\nover\ufb01t random labels, whereas our bounds, on true labels, continue to track the training error and\nbound the test error.\n\n6.1 Setup\nOur focus is on classi\ufb01cation by neural networks into K classes. Thus Z = X \u21e5 [K], and we use\nneural networks that output probability vectors over these K classes. Given weights w 2 Rp and\ninput x 2 X, the probability vector output by the network is p(w,x) 2 [0,1]K. Networks are trained\nby minimizing cross entropy loss: `(w, (x,y)) = g(p(w,x),y), where g((p1, . . . , pK),y) = ln py.\nNote that cross entropy loss is merely bounded below. We report results in terms of {0,1}-valued\nclassi\ufb01cation error: `(w, (x,y)) = 0 if and only if y is the largest coordinate of p(w,x).\nWe refer to elements of Rp and M1(Rp) as classi\ufb01ers and randomized classi\ufb01ers, respectively, and\nto the (empirical) 0\u20131 risk as the (empirical) error. We train two different architectures using SGLD\non MNIST and a synthetic dataset, SYNTH. The experimental setup is explained in Appendix F.\n\nOne-stage training procedure We run SGLD for T training epochs with a \ufb01xed value of the\nparameter t. We observe that convergence appears to occur within 10 epochs, but use a much larger\nnumber of training epochs to potentially expose nonconvergence behavior. The value of the inverse\ntemperature t is \ufb01xed during the whole training procedure.\n\nTwo-stage training procedure\nwe perform a two-stage training procedure:\n\nIn order to evaluate our private PAC-Bayes bound (Theorem 4.2),\n\n\u2022 Stage One. We run SGLD for T1 epochs with inverse temperature t1, minimizing the\nstandard cross entropy objective. Let w0 denote the neural network weights after stage one.\n\n7\n\n\fFigure 1: Results for a fully connected neural network trained on MNIST dataset with SGLD and\na \ufb01xed value of t. We vary t on the x-axis. The y-axis shows the average 0\u20131 loss. We plot the\nestimated generalization gap, which is the difference between the training and test errors. The left\nplot shows the results for the true label dataset. We observe, that the training error converges to\nzero as t increases. Further, while the generalization error increases for intermediate values of t\n(104 to 106), it starts dropping again as t increases. We see that the Lever bound fails to capture\nthis behaviour due to the monotonic increase with t. The right hand side plot shows the results for\na classi\ufb01er trained on random labelling of MNIST images. The true error is around 0.9. For small\nvalues of t (under 103) the network fails to learn and the training error stays at around 0.9. When\nt exceeds the number of training points, the network starts to over\ufb01t heavily. The sharp increase of\nthe generalization gap is predicted by the Lever bound.\n\n\u2022 Transition. We restart the learning rate schedule and continue SGLD for T1 epochs with\nlinearly annealing the temperature between t1 and t2, i.e., inverse temperature tt = ((t \nT1)t2 + (2T1 t)\u21e4 t1)/T1, where t is the current epoch number. The objective at w is the\ncross entropy loss for w plus a weight-decay term g\n\u2022 Stage Two. We continue SGLD for T2 T1 epochs with inverse temperature t2. The\nobjective is the same as in the transition stage.\n\n2kw w0k2\n2.\n\nDuring the \ufb01rst stage, the k-step transitions of SGLD converge weakly towards a Gibbs distribution\nwith a uniform base measure, producing a random vector w0 2 Rp. The private data-dependent prior\nPw0 is the Gaussian distribution centred at w0 with diagonal covariance 1\ng Ip. During the second stage,\nSGLD converges to the Gibbs posterior with a Gaussian base measure Pw0, i.e., Qt2 = Pexp(t2 \u02c6LS).\nBound calculation Our experiments evaluate different values of the inverse temperature t.\nWe evaluate Lever bounds for the randomized classi\ufb01er Qt obtained by the one-stage training pro-\ncedure, with T = 1000. We do so on both the MNIST and SYNTH datasets.\nWe also evaluate our private PAC-Bayes bound (Theorem 4.2) for the randomized classi\ufb01er Qt2\nand the private data-dependent prior Pw0, where the privacy parameter depends on t1. The bound\ndepends on the value of the KL(Qt1||Pw0). The challenges of estimating this term are described\nin Appendix E. We only evaluate the differentially private PAC-Bayes bounds on the small neural\nnetwork and SYNTH dataset,\nThe parameter settings for SYNTH experiments are: T 1 = 100, T 2 = 1000, g = 2; for MNIST:\nT 1 = 500, T 2 = 1000, g = 5. When evaluating Lever bounds with a one-stage learning procedure\nfor either datasets, T = 1000.\n\n6.2 Results\nResults are presented in Figs. 1 and 2. We never observe a violation of the PAC-Bayes bounds for\nGibbs distributions. This suggests that our assumption that SGLD has nearly converged is accurate\nenough or the bounds are suf\ufb01ciently loose that any effect from nonconvergence was masked.\nOur MNIST experiments highlight that the Lever bounds upper bound the risk for every possible data\ndistribution, including the random label distribution. In the random label experiment (Fig. 1, right\nplot), when t gets close to the number of training samples, the generalization error starts increasing\nsteeply. This phase transition is captured by the Lever bound. In the true label experiment (right\n\n8\n\n\fFigure 2: Results for a small fully connected neural network trained on a synthetically generated\ndataset SYNTH, consisting of 50 training examples. The x-axis shows the t value, and the y-axis\nthe average 0\u20131 loss. To generate the top plots, we train the network with a one-stage SGLD. The\ntop-left plot corresponds to the true label dataset, and the top-right to the random label dataset.\nSimilarly as in MNIST experiments, we do not witness any violation of the Lever bounds. Once\nagain, we notice that Lever bound gets very loose for larger values of t in the true label case. The\nbottom plot demonstrates the results for the two-stage SGLD. In this case the x-axis plots the t value\nused in the second-stage optimization. The \ufb01rst stage used t1 = 1. The network is trained on true\nlabels. We see that the differentially private PAC-Bayes bound yields a much tighter estimate of the\ngeneralization gap for larger values of t than the Lever bound (top left). When t becomes very large\nrelative to the amount of training data, it becomes more dif\ufb01cult to sample from the Gibbs posterior.\nThis results in a looser upper bound on the KL divergence between the prior and posterior.\n\nplot), the generalization error does not rise with t. Indeed, it continues to decrease, and so the Lever\nbound quickly becomes vacuous as we increase t. The Lever bound cannot capture this behavior\nbecause it must simultaneously bound the generalization error under random labels.\nOn the SYNTH dataset, we see the same phase transition under random labels and so Lever bounds\nremain vacuous after this point. In contrast, we see that our private PAC-Bayes bounds can track the\nerror beyond the phase transition that occurs under random labels. (See Fig. 2.) At high values of t,\nour KL upper bound becomes very loose.\n\nPrivate versus Lever PAC-Bayes bound While Lever PAC-Bayes bound fails to explain gener-\nalization for high t values, our private PAC-Bayes bound may remain nonvacuous. This is due to\nthe fact that it retains the KL term, which is sensitive to the data distribution via Q, and thus it can\nbe much lower than the upper bound on the KL in Lever, Laviolette, and Shawe-Taylor (2013) for\ndatasets with small true Bayes risk. Two stage optimization, inspired by the DP PAC-Bayes bound,\nallows us to obtain more accurate classi\ufb01ers by setting a higher inverse temperature parameter at the\nsecond stage, t2.\nWe do not plot DP PAC-Bayes bound for MNIST experiments due to the computational challenges\napproximating the KL term for a high-dimensional parameter space, as discussed in Appendix E.\nWe evaluate our private PAC-Bayes bound on MNIST dataset only for a combination of t1 = 103\nand t2 2 [3\u21e4 103,3\u21e4 104,105,3\u21e4 105]. The values are chosen such that t1 gives a small penalty for\nusing the data to learn the prior and t2 is chosen such that at t = t2 Lever\u2019s bound returns a vacuous\nbound (as seen in Fig. 1). We use 105 number of samples from the DP Gaussian prior learnt in stage\none to approximate lnZ term that appears in the KL, as de\ufb01ned in Eq. (E.5).\nThe results are presented in the table below. While DP PAC-Bayes bound is very loose, it is still\nsmaller than Lever\u2019s bound for high values of inverse temperature.\nNote, that for smaller values t2, we can use Lever\u2019s upper bound on the KL term instead of perform-\ning a Monte Carlo approximation. Since t1 is small and adds only a small penalty (\u21e0 1%), the DP\nPAC-Bayes bound is equal to Lever\u2019s bound plus a differential privacy penalty (\u21e0 1%).\n3\u21e4 105\n\nt2\nTest\n\nDP PAC-Bayes bound on test\n\nLever PAC-Bayes with t = t2 on test\n\n3\u21e4 103\n0.12\n0.21\n0.26\n\n3\u21e4 104\n0.07\n0.35\n1\n\n105\n0.06\n0.65\n1\n\n4\n1\n1\n\n9\n\n\fAcknowledgments\nThe authors would like to thank Olivier Catoni, Pascal Germain, Mufan Li, David McAllester, and\nAlexander Rakhlin, John Shawe-Taylor, for helpful discussions. This research was carried out in part\nwhile the authors were visiting the Simons Institute for the Theory of Computing at UC Berkeley.\nGKD was additionally supported by an EPSRC studentship. DMR was additionally supported by an\nNSERC Discovery Grant and Ontario Early Researcher Award.\n\nReferences\nP. Alquier and B. Guedj (May 2018). \u201cSimpler PAC-Bayesian Bounds for Hostile Data\u201d. Mach.\n\nLearn. 107.5, pp. 887\u2013902. DOI: 10.1007/s10994-017-5690-0.\n\nP. L. Bartlett, D. J. Foster, and M. J. Telgarsky (2017). \u201cSpectrally-normalized margin bounds for\n\nneural networks\u201d. Advances in Neural Info. Proc. Syst. (NIPS), pp. 6241\u20136250.\n\nR. Bassily, A. Smith, and A. Thakurta (2014). Differentially private empirical risk minimization:\n\nEf\ufb01cient algorithms and tight error bounds. arXiv: 1405.7085v2 [cs.LG].\n\nR. Bassily, K. Nissim, A. Smith, T. Steinke, U. Stemmer, and J. Ullman (2016). \u201cAlgorithmic stabil-\n\nity for adaptive data analysis\u201d. Proc. Symp. Theory of Comput. (STOC), pp. 1046\u20131059.\n\nL. B\u00e9gin, P. Germain, F. Laviolette, and J.-F. Roy (2016). \u201cPAC-Bayesian bounds based on the R\u00e9nyi\n\ndivergence\u201d. Proc. Arti\ufb01cial Intelligence and Statistics (AISTATS), pp. 435\u2013444.\n\nO. Bousquet and A. Elisseeff (2002). \u201cStability and generalization\u201d. Journal of Machine Learning\n\nResearch 2.Mar, pp. 499\u2013526.\n\nO. Catoni (2007). PAC-Bayesian Supervised Classi\ufb01cation: The Thermodynamics of Statistical\nLearning. Lecture Notes-Monograph Series. Institute of Mathematical Statistics. DOI: 10.1214/\n074921707000000391. arXiv: 0712.0248 [stat.ML].\n\nC. Dimitrakakis, B. Nelson, A. Mitrokotsa, and B. I. Rubinstein (2014). \u201cRobust and private\n\nBayesian inference\u201d. Int. Conf. Algorithmic Learning Theory (ALT), pp. 291\u2013305.\n\nC. Dwork (2006). \u201cDifferential Privacy\u201d. Int. Colloq. Automata, Languages and Programming\n\n(ICALP), pp. 1\u201312. DOI: 10.1007/11787006_1.\n\n\u2013 (2008). \u201cDifferential privacy: A survey of results\u201d. International Conference on Theory and Ap-\n\nplications of Models of Computation. Springer, pp. 1\u201319.\n\nC. Dwork, A. Roth, et al. (2014). \u201cThe algorithmic foundations of differential privacy\u201d. Foundations\n\nand Trends in Theoretical Computer Science 9.3\u20134, pp. 211\u2013407.\n\nC. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth (2015a). \u201cGeneralization in\n\nadaptive data analysis and holdout reuse\u201d. Advances in Neural Info. Proc. Syst. Pp. 2350\u20132358.\n\nC. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. L. Roth (2015b). \u201cPreserving\nstatistical validity in adaptive data analysis\u201d. Proc. Symp. Theory of Comput. (STOC), pp. 117\u2013\n126.\n\nG. K. Dziugaite and D. M. Roy (2017). \u201cComputing Nonvacuous Generalization Bounds for Deep\n(Stochastic) Neural Networks with Many More Parameters than Training Data\u201d. Proc. 33rd Conf.\nUncertainty in Arti\ufb01cial Intelligence (UAI). arXiv: 1703.11008.\n\nM. A. Erdogdu, L. Mackey, and O. Shamir (2018). Global Non-convex Optimization with Dis-\n\ncretized Diffusions. arXiv: 1810.12361.\n\nP. Germain, F. Bach, A. Lacoste, and S. Lacoste-Julien (2016). \u201cPAC-Bayesian Theory Meets\n\nBayesian Inference\u201d. Advances in Neural Info. Proc. Syst. Pp. 1884\u20131892.\n\nP. D. Gr\u00fcnwald and N. A. Mehta (2016). \u201cFast Rates for General Unbounded Loss Functions: from\n\nERM to Generalized Bayes\u201d. arXiv: 1605.00252.\n\n\u2013 (2017). A Tight Excess Risk Bound via a Uni\ufb01ed PAC-Bayesian-Rademacher-Shtarkov-MDL\n\nComplexity. arXiv: 1710.07732.\n\n10\n\n\fD. Kifer, A. Smith, and A. Thakurta (2012). \u201cPrivate convex empirical risk minimization and high-\n\ndimensional regression\u201d. Journal of Machine Learning Research 1.41, pp. 1\u201340.\n\nJ. Langford (2002). \u201cQuantitatively tight sample complexity bounds\u201d. PhD thesis. Carnegie Mellon\n\nUniversity.\n\nJ. Langford and M. Seeger (2001). Bounds for Averaging Classi\ufb01ers. Tech. rep. CMU-CS-01-102.\n\nCarnegie Mellon University.\n\nG. Lever, F. Laviolette, and J. Shawe-Taylor (2013). \u201cTighter PAC-Bayes bounds through\ndistribution-dependent priors\u201d. Theoretical Computer Science 473, pp. 4\u201328. DOI: 10 . 1016 /\nj.tcs.2012.10.013.\n\nB. London (2017). \u201cA PAC-Bayesian Analysis of Randomized Learning with Application to\n\nStochastic Gradient Descent\u201d. Advances in Neural Info. Proc. Syst. (NIPS), pp. 2931\u20132940.\n\nJ. Mattingly, A. Stuart, and D. Higham (2002). \u201cErgodicity for SDEs and approximations: locally\nLipschitz vector \ufb01elds and degenerate noise\u201d. Stochastic Processes and their Applications 101.2,\npp. 185 \u2013232. DOI: https://doi.org/10.1016/S0304-4149(02)00150-3.\n\nA. Maurer (2004). A note on the PAC-Bayesian theorem. arXiv: cs/0411099 [cs.LG].\nD. A. McAllester (1999). \u201cPAC-Bayesian Model Averaging\u201d. Proc. Conf. Learning Theory (COLT),\n\npp. 164\u2013170.\n\nF. McSherry and K. Talwar (2007). \u201cMechanism Design via Differential Privacy\u201d. Proc. Symp.\n\nFound. Comp. Sci. (FOCS), pp. 94\u2013103.\n\nK. Minami, H. Arai, I. Sato, and H. Nakagawa (2016). \u201cDifferential Privacy without Sensitivity\u201d.\n\nAdvances in Neural Info. Proc. Syst. (NIPS), pp. 956\u2013964.\n\nD. J. Mir (2013). \u201cDifferential privacy: an exploration of the privacy-utility landscape\u201d. PhD thesis.\n\nRutgers University.\n\nB. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro (2017a). \u201cA PAC-Bayesian approach\nto spectrally-normalized margin bounds for neural networks\u201d. Proc. Int. Conf. on Learning Rep-\nresentation (ICLR). arXiv: 1707.09564.\n\nB. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro (2017b). \u201cExploring generalization in\n\ndeep learning\u201d. Advances in Neural Info. Proc. Syst. (NIPS), pp. 5949\u20135958.\n\nL. Oneto, S. Ridella, and D. Anguita (2017). \u201cDifferential privacy and generalization: Sharper\nbounds with applications\u201d. Pattern Recognition Letters 89, pp. 31\u201338. DOI: 10.1016/j.patrec.\n2017.02.006.\n\nE. Parrado-Hern\u00e1ndez, A. Ambroladze, J. Shawe-Taylor, and S. Sun (2012). \u201cPAC-Bayes bounds\n\nwith data dependent priors\u201d. Journal of Machine Learning Research 13.Dec, pp. 3507\u20133531.\n\nM. Raginsky, A. Rakhlin, and M. Telgarsky (2017). \u201cNon-convex learning via Stochastic Gradient\nLangevin Dynamics: a nonasymptotic analysis\u201d. Proc. Conf. on Learning Theory (COLT). arXiv:\n1702.03849.\n\nO. Rivasplata, C. Szepesvari, J. S. Shawe-Taylor, E. Parrado-Hernandez, and S. Sun (2018). \u201cPAC-\nBayes bounds for stable algorithms with instance-dependent priors\u201d. Advances in Neural Info.\nProc. Syst. 31, pp. 9234\u20139244.\n\nJ. Shawe-Taylor and R. C. Williamson (1997). \u201cA PAC analysis of a Bayesian estimator\u201d. Proc.\n\nConference on Learning Theory (COLT). ACM, pp. 2\u20139.\n\nS. L. Smith and Q. V. Le (2018). \u201cA Bayesian perspective on generalization and stochastic gradient\n\ndescent\u201d. Proc. Int. Conf. on Learning Representations (ICLR).\n\nY. W. Teh, A. H. Thiery, and S. J. Vollmer (2016). \u201cConsistency and \ufb02uctuations for stochastic\n\ngradient Langevin dynamics\u201d. Journal of Machine Learning Research 17, pp. 1\u201333.\n\nN. Thiemann, C. Igel, O. Wintenberger, and Y. Seldin (2017). \u201cA Strongly Quasiconvex PAC-\nBayesian Bound\u201d. Int. Conf. on Algorithmic Learning Theory (ALT), pp. 466\u2013492. arXiv: 1608.\n05610.\n\n11\n\n\fY.-X. Wang, S. E. Fienberg, and A. J. Smola (2015). \u201cPrivacy for Free: Posterior Sampling and\n\nStochastic Gradient Monte Carlo\u201d. Proc. Int. Conf. Machine Learning (ICML), pp. 2493\u20132502.\n\nM. Welling and Y. W. Teh (2011). \u201cBayesian learning via stochastic gradient Langevin dynamics\u201d.\n\nProc. of the 28th Int. Conf. on Machine Learning (ICML), pp. 681\u2013688.\n\n12\n\n\f", "award": [], "sourceid": 5112, "authors": [{"given_name": "Gintare Karolina", "family_name": "Dziugaite", "institution": "University of Cambridge"}, {"given_name": "Daniel", "family_name": "Roy", "institution": "Univ of Toronto & Vector"}]}