{"title": "Limits of Private Learning with Access to Public Data", "book": "Advances in Neural Information Processing Systems", "page_first": 10342, "page_last": 10352, "abstract": "We consider learning problems where the training set consists of two types of examples: private and public. The goal is to design a learning algorithm that satisfies differential privacy only with respect to the private examples. This setting interpolates between private learning (where all examples are private) and classical learning (where all examples are public). \r\n\r\nWe study the limits of learning in this setting in terms of private and public sample complexities. We show that any hypothesis class of VC-dimension $d$ can be agnostically learned up to an excess error of $\\alpha$ using only (roughly) $d/\\alpha$ public examples and $d/\\alpha^2$ private labeled examples. This result holds even when the public examples are unlabeled. This gives a quadratic improvement over the standard $d/\\alpha^2$ upper bound on the public sample complexity (where private examples can be ignored altogether if the public examples are labeled). Furthermore, we give a nearly matching lower bound, which we prove via a generic reduction from this setting to the one of private learning without public data.", "full_text": "Limits of Private Learning with Access to Public Data\n\nNoga Alon\n\nDepartment of Mathematics\n\nPrinceton University\n\nnalon@math.princeton.edu\n\nRaef Bassily\u2217\n\nDepartment of Computer Science & Engineering\n\nThe Ohio State University\nbassily.1@osu.edu\n\nShay Moran\nGoogle AI\nPrinceton\n\nshaymoran1@gmail.com\n\nAbstract\n\nWe consider learning problems where the training set consists of two types of\nexamples: private and public. The goal is to design a learning algorithm that\nsatis\ufb01es differential privacy only with respect to the private examples. This setting\ninterpolates between private learning (where all examples are private) and classical\nlearning (where all examples are public).\nWe study the limits of learning in this setting in terms of private and public sam-\nple complexities. We show that any hypothesis class of VC-dimension d can be\nagnostically learned up to an excess error of \u03b1 using only (roughly) d/\u03b1 public\nexamples and d/\u03b12 private labeled examples. This result holds even when the\npublic examples are unlabeled. This gives a quadratic improvement over the stan-\ndard d/\u03b12 upper bound on the public sample complexity (where private examples\ncan be ignored altogether if the public examples are labeled). Furthermore, we give\na nearly matching lower bound, which we prove via a generic reduction from this\nsetting to the one of private learning without public data.\n\n1\n\nIntroduction\n\nIn this work, we study a relaxed notion of differentially private (DP) supervised learning which was\nintroduced by Beimel et al. in [BNS13], where it was coined semi-private learning. In this setting, the\nlearning algorithm takes as input a training set that is comprised of two parts: (i) a private sample that\ncontains personal and sensitive information, and (ii) a \u201cpublic\u201d sample that poses no privacy concerns.\nWe assume that the private sample is always labeled, while the public sample can be either labeled or\nunlabeled. The algorithm is required to satisfy DP only with respect to the private sample. The goal\nis to design algorithms that can exploit as little public data as possible to achieve non-trivial gains in\naccuracy (or, equivalently savings in sample complexity) over standard DP learning algorithms, while\nstill providing strong privacy guarantees for the private dataset. Similar settings have been studied\nbefore in literature (see \u201cRelated Work\u201d section below).\nThere are several motivations for studying this problem. First, in practical scenarios, it is often not\nhard to collect reasonable amount of public data from users or organizations. For example, in the\nlanguage of consumer privacy, there is considerable amount of data collected from the so-called\n\u201copt-in\u201d users, who voluntarily offer or sell their data to companies or organizations. Such data is\ndeemed by its original owner to have no threat to personal privacy. There are also a variety of other\n\n\u2217Part of this work was done while visiting the Simons Institute for the Theory of Computing.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fsources of public data that can be harnessed. Moreover, in many scenarios, it is often much easier to\ncollect unlabeled than labeled data.\nAnother motivation emerges from several pessimistic results in DP learning that either limit or\neliminate the possibility of differentially private learning, even for elementary problems such as\none-dimensional thresholds which are trivially learnable without privacy constraints [BNSV15,\nALMM19]. It is therefore natural to explore whether a small amount of public data circumvents these\nimpossibility results.\nA third motivation arises from the following observation: consider a learning problem in which the\nmarginal distribution DX over the domain X is completely known to the algorithm, but the target\nconcept c : X \u2192 {0, 1} is unknown. One can show that in this setting every VC class can be learned\nprivately with (roughly) the same sample complexity as in the standard, non-private, case. The other\nextreme is the standard PAC-setting in which both DX and c are unknown to the algorithm. As\nmentioned earlier, in this case even very simple classes such as one-dimensional thresholds can not\nbe learned privately. In the setting considered in this work, the distribution DX is unknown but the\nlearner has access to some public examples from it. This naturally interpolates between these two\nextremes: the case when DX is unknown that corresponds to having no public examples, and the\ncase when DX is known that corresponds to having an unbounded amount of public examples. It\nis therefore natural to study the intermediate behaviour as the number of public examples grows\nfrom 0 to \u221e. The same question can be also asked in the \u201ceasier\u201d case where the public examples are\nlabeled.\nWe will generally refer to the setting described above as semi-private learning, and to algorithms\nin that setting as semi-private learners. (See Section 2, for precise de\ufb01nitions.) Following previous\nworks in private learning, we consider two types of semi-private learners: those that satisfy the notion\nof pure DP (the stronger notion of DP), as well as those that satisfy approximate DP. We will call the\nformer type pure semi-private learners, and call the latter approximate semi-private learners.\n\nMain Results\n\nIn this work we concentrate on the sample complexity of semi-private learners in the agnostic setting.\nWe especially focus on the minimal number of public examples with which it is possible to learn\nevery VC class.\n1. Upper bound: Every hypothesis class H can be learned up to excess error \u03b1 by a pure semi-\nprivate algorithm whose private sample complexity is (roughly) VC(H)/\u03b12 and public sample\ncomplexity is (roughly) VC(H)/\u03b1. Moreover, the input public sample can be unlabeled.\nRecall that VC(H)/\u03b12 examples are necessary to learn in the agnostic setting (even without\nprivacy constraints); therefore, this result establishes a quadratic saving.\n2. Lower bound: Assume H has an in\ufb01nite Littlestone dimension2. Then, any approximate semi-\nprivate learner for H must have public sample complexity \u2126(1/\u03b1), where \u03b1 is the excess error.\nThis holds even when the public sample is labeled.\nOne example of a class with an in\ufb01nite Littlestone dimension is the class of thresholds over R.\nThis class has VC dimension 1, and therefore demonstrates that the upper and lower bounds above\nnearly match.\n3. Dichotomy for pure semi-private learning: Every hypothesis class H satis\ufb01es exactly one of\nthe following:\n(i) H is learnable by a pure DP algorithm, and therefore can be semi-privately learned without\n(ii) Any pure semi-private learner for H must have public sample complexity \u2126 (1/\u03b1), where \u03b1\n\nany public examples.\n\nis the excess error.\n\nTechniques\n\nUpper bound: The idea of the construction for the upper bound is to use the (unlabeled) public\ndata to construct a \ufb01nite class H(cid:48) that forms a \u201cgood approximation\u201d of the original class H, then\n\n2The Littlestone dimension is a combinatorial parameter that arises in online learning [Lit87, BPS09].\n\n2\n\n\freduce the problem to DP learning of a \ufb01nite class. Such approximation is captured via the notion of\n\u03b1-covering (De\ufb01nition 2.7). By standard uniform-convergence arguments, it is not hard to see that\n(roughly) VC(H)/\u03b12 public examples suf\ufb01ce to construct such an approximation. We show that the\nnumber of public examples can be reduced to only about VC(H)/\u03b1, even in the agnostic setting. Our\nconstruction is essentially the same as a construction due to Beimel et al. [BNS13], but our proof\ntechnique is different (see the \u201cRelated Work\u201d section for a more detailed comparison).\nLower bounds: The lower bounds boil down to a public-data-reduction lemma which shows that if\nwe are given a semi-private learner whose public sample complexity is << 1/\u03b1, we can transform\nit to a fully private learner (which uses no public examples) whose excess error is a small constant\n(say 1/100). Stated contra-positively, this implies that if a class can not be privately learned up to an\nexcess loss of 1/100 then it can not be semi-privately learned with << 1/\u03b1 public examples. This\nallows us to exploit known lower bounds for private learning to derive a lower bound on the public\nsample complexity.\nRelated Work: Our algorithm for the upper bound is essentially the same as a construction due to\nBeimel et al. [BNS13]. Although [BNS13] focuses on the realizable case of semi-private learning,\ntheir analysis can be extended to the agnostic case to yield a similar upper bound to the one we\npresent here. However, the proof technique we give here is different from theirs. In particular, our\nproof relies on and emphasizes the use of \u03b1-coverings, which provides a direct argument for both the\nrealizable and agnostic case. We believe the notion of \u03b1-covering can be a useful tool in the analysis\nof other differentially private algorithms even outside the learning context.\nThere are also several other works that considered similar problems. A similar notion known as\n\u201clabel-private learning\u201d was considered in [CH11] (see also references therein) and in [BNS13]. In\nthis notion, only the labels in the training set are considered private. This notion is weaker than\nsemi-private learning. In particular, any semi-private learner can be easily transformed into a label-\nprivate learner. Another line of work considers the problem of private knowledge transfer [HCB16],\n[PAE+17], [PSM+18], and [BTT18]. In this problem, \ufb01rst a DP classi\ufb01cation algorithm with input\nprivate sample is used to provide labels for an unlabeled public dataset. Then, the result is used\nto train a non-private learner. [BTT18] gives sample complexity bounds in the setting when the\nDP algorithm is required to label the public data in an online fashion. Their bounds are thus not\ncomparable to ours.\n\n2 Preliminaries\nLet X denote an arbitrary domain, let Z = X \u00d7 {0, 1} denote the examples domain, and let\nZ\u2217 = \u222a\u221e\nn=1Z n. A function h : X \u2192 {0, 1} is called a concept/hypothesis, a set of hypotheses\nH \u2286 {0, 1}X is called a concept/hypothesis class. The VC dimension of H is denoted by VC(H).\nWe use D to denote a distribution over Z, and DX to denote the marginal distribution over X . We\nuse S \u223c Dn to denote a sample/dataset S = {(x1, y1), . . . , (xn, yn)} of n i.i.d. draws from D.\nExpected error: The expected/population error of a hypothesis h : X \u2192 {0, 1} with respect to a\ndistribution D over Z is de\ufb01ned by err(h;D) (cid:44) E\nA distribution D is called realizable by H if there exists h\u2217 \u2208 H such that err(h\u2217;D) = 0. In this\ncase, the data distribution D is described by a distribution DX over X and a hypothesis h\u2217 \u2208 H. For\nrealizable distributions, the expected error of a hypothesis h will be denoted by err (h; (DX , h\u2217)) (cid:44)\nE\nx\u223cDX\nEmpirical error: The empirical error of a hypothesis h : X \u2192 {0, 1} with respect to a labeled\n\ndataset S = {(x1, y1), . . . , (xn, yn)} will be denoted by(cid:99)err (h; S) (cid:44) 1\n\n(x,y)\u223cD [1 (h(x) (cid:54)= y)].\n\n[1 (h(x) (cid:54)= h\u2217(x))] .\n\n(cid:80)n\ni=1 1 (h(xi) (cid:54)= yi) .\n\nn\n\nExpected disagreement: The expected disagreement between a pair of hypotheses h1 and h2 with\nrespect to a distribution DX over X is de\ufb01ned as dis (h1, h2; DX ) (cid:44) E\n[1 (h1(x) (cid:54)= h2(x))] .\nx\u223cDX\n\n3\n\n\f(cid:80)n\ni=1 1 (h1(xi) (cid:54)= h2(xi)) .\n\nan unlabeled dataset T = {x1, . . . , xn} is de\ufb01ned as (cid:99)dis (h1, h2; T ) =\n\nEmpirical disagreement: The empirical disagreement between a pair of hypotheses h1\nand h2 w.r.t.\n1\nn\nDe\ufb01nition 2.1 (Differential Privacy [DMNS06, DKM+06]). Let \u0001, \u03b4 > 0. A (randomized) algorithm\nA with input domain Z\u2217 and output range R is called (\u0001, \u03b4)-differentially private if for all pairs of\ndatasets S, S(cid:48) \u2208 Z\u2217 that differs in exactly one data point, and every measurable O \u2286 R, we have\n\nPr (A(S) \u2208 O) \u2264 e\u0001 \u00b7 Pr (A(S(cid:48)) \u2208 O) + \u03b4,\n\nwhere the probability is over the random coins of A. When \u03b4 = 0, we say that A is pure \u0001-differentially\nprivate.\n\nWe study learning algorithms that take as input two datasets: a private dataset Spriv and a public\n\u2217 is\ndataset Spub, and output a hypothesis h : X \u2192 {0, 1}. The private set Spriv \u2208 (X \u00d7 {0, 1})\nlabeled. We distinguish between two settings of the learning problem depending on whether the\npublic dataset is labeled or not. To avoid confusion, we denote an unlabeled public set as Tpub \u2208 X \u2217,\nand use Spub to denote a labeled public set. We formally de\ufb01ne learners in these two settings.\nDe\ufb01nition 2.2 ((\u03b1, \u03b2, \u0001, \u03b4)- Semi-Private Learner). Let H \u2282 {0, 1}X be a hypothesis class. A\nrandomized algorithm A is (\u03b1, \u03b2, \u0001, \u03b4)-SP (semi-private) learner for H with private sample size npriv\nand public sample size npub if the following conditions hold:\n\n1. For every distribution D over Z = X \u00d7 {0, 1}, given datasets Spriv \u223c Dnpriv and Spub \u223c\nDnpub as inputs to A, with probability at least 1 \u2212 \u03b2 (over the choice of Spriv, Spub, and the\nrandom coins of A), A outputs a hypothesis A (Spriv, Spub) = \u02c6h \u2208 {0, 1}X satisfying\n\n(cid:16)\u02c6h; D(cid:17) \u2264 inf\n\nerr\n\nh\u2208H err (h; D) + \u03b1.\n\n2. For all S \u2208 Z npub , A (\u00b7, S) is (\u0001, \u03b4)-differentially private.\n\nWhen the second condition is satis\ufb01ed with \u03b4 = 0 (i.e., pure differential privacy), we refer to A as\n(\u03b1, \u03b2, \u0001)-SP learner (i.e., pure semi-private learner).\nAs a special case of the above de\ufb01nition, we say that an algorithm A is an (\u03b1, \u03b2, \u0001, \u03b4)-semi-privately\nlearner for a class H under the realizability assumption if it satis\ufb01es the \ufb01rst condition in the de\ufb01nition\nonly with respect to all distributions that are realizable by H.\nDe\ufb01nition 2.3 (Semi-Privately Learnable Class). We say that a class H is semi-privately learnable if\nthere are functions npriv : (0, 1)2 \u2192 N, npub : (0, 1)2 \u2192 N, where npub(\u03b1,\u00b7) = o(1/\u03b12), and there\nis an algorithm A such that for every \u03b1, \u03b2 \u2208 (0, 1), when A is given private and public samples of\nsizes npriv = npriv(\u03b1, \u03b2), and npub = npub(\u03b1, \u03b2), it (\u03b1, \u03b2, 0.1, negl (npriv))-semi-privately learns H.\nNote that in the de\ufb01nition above, the privacy parameters are set as follows: \u0001 = 0.1 and \u03b4 is negligible\nfunction in the private sample size (and \u03b4 = 0 for a pure semi-private learner).\nThe restriction npub = o(1/\u03b12) in the above de\ufb01nition is because taking \u2126(VC(H)/\u03b12) public\nexamples suf\ufb01ces for learning the class without any private examples.\nDe\ufb01nition 2.4 ((\u03b1, \u03b2, \u0001, \u03b4)-Semi-Supervised Semi-Private Learner). The de\ufb01nition is analogous to\nDe\ufb01nition 2.2 except that the public sample is unlabeled. An algorithm that satis\ufb01es this de\ufb01nition is\nreferred to as (\u03b1, \u03b2, \u0001, \u03b4)-SS-SP (semi-supervised semi-private) learner.\n\nPrivate learning without public data: In the standard setting of (\u0001, \u03b4)-differentially private learning,\nthe learner has no access to public data. We note that this setting can be viewed as a special case of\nDe\ufb01nitions 2.2 and 2.4 by taking npub = 0. In such case, we refer to the learner as (\u03b1, \u03b2, \u0001, \u03b4)-private\nlearner. As before, when \u03b4 = 0, we call the learner pure private learner. The notion of privately\nlearnable class H is de\ufb01ned analogously to De\ufb01nition 2.3 with npub(\u03b1, \u03b2) = 0 for all \u03b1, \u03b2.\nWe will use the following lemma due to Beimel et al. [BNS15]:\nLemma 2.5 (Special case of Theorem 4.16 in [BNS15]). Any class H that is privately learnable\nunder realizability assumption is also privately learnable (i.e., in the agnostic setting).\n\n4\n\n\fThe following fact follows from the private boosting technique due to [DRV10]:\nLemma 2.6 (follows from Theorem 6.1 [DRV10] (the full version)). For any class H, under the\nrealizability assumption, if there is a (0.1, 0.1, 0.1)-pure private learner for H, then H is privately\nlearnable by a pure private algorithm.\n\nWe note that no analogous statement to the one in Lemma 2.6 is known for approximate private\nlearners (see the full version [ABM19] for a discussion).\nWe will also use the following notion of coverings:\n\nDe\ufb01nition 2.7 (\u03b1-cover for a hypothesis class). A family of hypotheses (cid:101)H is said to form an \u03b1-cover\nthere is \u02dch \u2208 (cid:101)H such that dis\n\nfor a hypothesis class H \u2286 {0, 1}X with respect to a distribution DX over X if for every h \u2208 H,\n\n(cid:17) \u2264 \u03b1.\n\nh, \u02dch; DX\n\n(cid:16)\n\n3 Upper Bound\nIn this section we show that every VC class H can be semi-privately learned in the agnostic case with\nonly \u02dcO(VC(H)/\u03b1) public examples:\nTheorem 3.1 (Upper bound). Let H be a hypothesis class and let VC (H) = d. For any \u03b1, \u03b2 \u2208\n(0, 1), \u0001 > 0, ASSPP is an (\u03b1, \u03b2, \u0001)-semi-supervised semi-private agnostic learner for H with private\nand public sample complexities:\n\n(cid:18)(cid:0)d log(1/\u03b1) + log(1/\u03b2)(cid:1) max\n(cid:19)\n(cid:18) d log(1/\u03b1) + log(1/\u03b2)\n\n(cid:18) 1\n\n1\n\u0001 \u03b1\n\n\u03b12 ,\n\n(cid:19)(cid:19)\n\n,\n\nnpriv = O\n\nnpub = O\n\n\u03b1\n\n.\n\nProof overview. The upper bound is based on a reduction to the fact that any \ufb01nite hypothesis class H(cid:48)\ncan be learned privately with sample complexity (roughly) O(log|H(cid:48)|) via the exponential mechanism\n[KLN+08]. In more detail, we use the (unlabeled) public data to construct a \ufb01nite class H(cid:48) that forms\na \u201cgood enough approximation\u201d of the (possibly in\ufb01nite) original class H (See Algorithm 1). The\nrelevant notion of approximation is captured by the de\ufb01nition of \u03b1-cover (De\ufb01nition 2.7). Indeed, it\nsuf\ufb01ces to output an hypothesis h(cid:48) \u2208 H(cid:48) that \u201c\u03b1-approximates\u201d an optimal hypothesis h\u2217 \u2208 H.\nThus, the crux of the proof boils down to the question: How many samples from DX are needed in\norder to construct an \u03b1-cover for H? It is not hard to see that (roughly) O(VC(H)/\u03b12) examples\nsuf\ufb01ce: indeed, these many examples suf\ufb01ce to approximate the distances dis (h(cid:48), h(cid:48)(cid:48); DX ) for\nevery h(cid:48), h(cid:48)(cid:48) \u2208 H, which suf\ufb01ces to construct the \u03b1-cover. We show how to reduce the number of\nexamples to only (roughly) O(VC(H)/\u03b1) examples (Lemma 3.3), which, by our lower bound, is\nnearly optimal.\nAlgorithm 1 ASSPP: Semi-Supervised Semi-Private Agnostic Learner\nInput: Private labeled dataset: Spriv = {(x1, y1), . . . , (xnpriv , ynpriv )} \u2208 Z npriv, a public unlabeled\ndataset: Tpub = (\u02dcx1,\u00b7\u00b7\u00b7 , \u02dcxnpub ) \u2208 X npub, a hypothesis class H \u2282 {0, 1}X , and a privacy\n1: Let (cid:101)T = {\u02c6x1, . . . , \u02c6x \u02c6m} be the set of points x \u2208 X appearing at least once in Tpub.\nparameter \u0001 > 0.\n2: Let \u03a0H((cid:101)T ) = {(h(\u02c6x1), . . . , h(\u02c6x \u02c6m)) : h \u2208 H} .\n3: Initialize (cid:101)HTpub = \u2205.\n4: for each c = (c1, . . . , c \u02c6m) \u2208 \u03a0H((cid:101)T ): do\nAdd to (cid:101)HTpub arbitrary h \u2208 H that satis\ufb01es h(\u02c6xj) = cj for every j = 1, . . . , \u02c6m.\n6: Use the exponential mechanism with inputs Spriv, (cid:101)HTpub, \u0001 and score function q(Spriv, h) (cid:44)\n\u2212(cid:99)err(h; Spriv) to select hpriv \u2208 (cid:101)HTpub.\n\n5:\n\n7: return hpriv.\n\nThe proof of Theorem 3.1 relies on the following lemmas. (The full proof of Theorem 3.1 can be\nfound in the full version [ABM19]).\n\n5\n\n\f(cid:17)\n\n(cid:16) d log(1/\u03b1)+log(1/\u03b2)\n\nLemma 3.2. For all Tpub \u2208 X npub , ASSPP(\u00b7, Tpub) is \u0001-differentially private.\nThe proof is straightforward and is deferred to the full version [ABM19].\nprobability at least 1 \u2212 \u03b2, the family (cid:101)HTpub constructed in Step 5 of Algorithm 1 is an \u03b1-cover for H\nLemma 3.3 (\u03b1-cover for H). Let Tpub \u223c DnpubX , where npub = O\nw.r.t. DX .\nWe need to show that with high probability, for every h \u2208 H there exists \u02dch \u2208 (cid:101)HTpub such that\nWe now prove Lemma 3.3. (We include a a more detailed version in the full version [ABM19]).\ndis(h, \u02dch;DX ) \u2264 \u03b1. Let (cid:101)T = {\u02c6x1, . . . , \u02c6x \u02c6m} be as de\ufb01ned in ASSPP (Algorithm 1), and de\ufb01ne\nh((cid:101)T ) = (h(\u02c6x1), . . . , h(\u02c6x \u02c6m)). By construction, there must exist \u02dch \u2208 (cid:101)HTpub such that \u2200j \u2208 [ \u02c6m]\n\u02dch(\u02c6xj) = h(\u02c6xj); that is,(cid:99)dis\n\n(cid:110)\u2203h1, h2 \u2208 H : dis (h1, h2;DX ) > \u03b1 and (cid:99)dis (h1, h2; Tpub) = 0\n(cid:111)\n\n= 0. For Tpub \u223c DnpubX , de\ufb01ne the event\n\n(cid:16)\u02dch, h; Tpub\n\n. Then, with\n\n(cid:17)\n\n\u03b1\n\nBad =\n\nWe will show that\n\n(cid:18) 2e npub\n\n(cid:19)2d\n\nd\n\ne\u2212\u03b1 npub/4.\n\nP\n\nTpub\u223cDnpubX\n\n[Bad] \u2264 2\n\nBefore we do so, we \ufb01rst show that (1) suf\ufb01ces to prove the lemma. Indeed, if dis\nfor some h \u2208 H then the event Bad occurs. Hence,\n\n(cid:104)(cid:101)HTpub is not an \u03b1-cover\n\n(cid:105) \u2264 2\n\n(cid:18) 2e npub\n\n(cid:19)2d\n\nd\n\nP\n\nTpub\u223cDnpubX\n\n(cid:16)\u02dch, h; DX\n\n(cid:17)\n\n(1)\n\n> \u03b1\n\ne\u2212\u03b1 npub/4.\n\n(cid:16) d log(1/\u03b1)+log(1/\u03b2)\n\n(cid:17)\n\nNow, via standard manipulation, this bound is at most \u03b2 when npub = O\nwhich yields the desired bound and \ufb01nishes the proof.\nNow, it is left to prove (1). To do so, we use a standard VC-based uniform convergence bound (a.k.a\n\u03b1-net bound) on the class H\u2206 (cid:44) {h1\u2206h2 : h1, h2 \u2208 H} where h1\u2206h2 : X \u2192 {0, 1} is de\ufb01ned as\n\n\u03b1\n\n,\n\nh1\u2206h2(x) (cid:44) 1 (h1(x) (cid:54)= h2(x))\n\n\u2200x \u2208 X\n\n(cid:1)2d\n\nNote that GH\u2206 (m) \u2264(cid:0) e m\n\u03a0H(V ). Hence, GH\u2206 (m) \u2264 (GH(m))2 \u2264(cid:0) e m\n\n|\u03a0H\u2206 (V )|, where\nLet GH\u2206 denote the growth function of H\u2206; i.e., for any m, GH\u2206 (m) (cid:44) max\nV :|V |=m\n\u03a0H\u2206 (V ) is the set of all possible dichotomies that can be generated by H\u2206 on a set V of size m.\n. This follows from the fact that for any set V of size m, we have\n|\u03a0H\u2206 (V )| \u2264 |\u03a0H(V )|2 since every dichotomy in \u03a0H\u2206 is determined by a pair of dichotomies in\n(cid:104)\u2203h \u2208 H\u2206 : dis (h, h0; DX ) > \u03b1 and (cid:99)dis (h, h0; Tpub) = 0\n(cid:105)\n, where the last inequality follows from Sauer\u2019s\n\nLemma [Sau72]. Now, by invoking a uniform convergence argument, we have\n\n(cid:1)2d\n\n[Bad] =\n\nP\n\nP\n\nd\n\nd\n\nTpub\u223cDnpubX\n\nTpub\u223cDnpubX\n\n\u2264 2GH\u2206 (2 npub) e\u2212\u03b1 npub/4 \u2264 2\n\ne\u2212\u03b1 npub/4.\n\n(cid:18) 2e npub\n\n(cid:19)2d\n\nd\n\nThe \ufb01rst bound in the second line follows from the so-called double-sample argument used in virtually\nall VC-based uniform convergence bounds (e.g., [SSBD14]). This completes the proof of Lemma 3.3.\n\n4 Lower Bound\n\nIn this section we establish that our upper bound on the public sample complexity is nearly tight.\nTheorem 4.1 (Lower bound for classes of in\ufb01nite Littlestone dimension). Let H be any class with\nan in\ufb01nite Littlestone dimension (e.g., the class of thresholds over R). Then, any semi-private learner\nfor H must have public sample of size npub = \u2126(1/\u03b1), where \u03b1 is the excess error.\n\n6\n\n\fIn the case of pure differentially privacy we get a stronger statement which manifests a dichotomy\nthat applies for every class:\nTheorem 4.2 (Pure private vs. pure semi-private learners). Every class H must satisfy exactly one of\nthe following:\n\n1. H is learnable by a pure private learner.\n2. Any pure semi-private learner for H must have npub = \u2126(1/\u03b1), where \u03b1 is the excess error\n\n1\n\n10 then it can not be semi-privately learned with excess error of \u03b1 with less than\n\nProof overview. The crux of the argument is a public-data-reduction lemma (Lemma 4.4), which\nshows how one can reduce the number of public examples at the price of a proportional increase in\nthe excess error. This lemma implies, for example, that if H can be learned up to an excess error\nof \u03b1 with less than\n1000\u03b1 public examples then it can also be privately learned without any public\n10. Stating contra-positively, if H can not be privately learned with\nexamples and excess error < 1\nexcess error < 1\n1000\u03b1\npublic examples. This yields a lower bound of \u2126(1/\u03b1) on the public sample complexity for every\nclass H which is not privately learnable with constant excess error\nOne example for such a class is any class with in\ufb01nite Littlestone dimension (e.g., the class of\n1-dimensional thresholds over an in\ufb01nite domain). This follows from the result in [ALMM19]:\nTheorem 4.3 (Restatement of Corollary 2 in [ALMM19]). Let H be any class of in\ufb01nite Littlestone\ndimension (e.g., the class of thresholds over an in\ufb01nite domain X \u2286 R). For any n \u2208 N, given a\n-private learner for H (even in the\nprivate sample of size n, there is no\nrealizable case).\n\n(cid:16) 1\n\n16 , 0.1,\n\n16 , 1\n\n100 n2 log(n)\n\n1\n\n(cid:17)\n\n1\n\n(cid:101).\n\n10 npub\n\n16 , \u0001, \u03b4(cid:1)-private learner that learns any\n\nand public sample size npub. Then, there is a(cid:0)100 npub \u03b1, 1\n\nThe aforementioned reduction we use for the lower bound holds even when the public sample is\nlabeled, and it holds for both pure and approximate private/semi-private learners.\nWe now state and prove the reduction lemma outlined above.\nLemma 4.4 (Public data reduction lemma). Let 0 < \u03b1 \u2264 1/100, \u0001 > 0, \u03b4 \u2265 0. Suppose there is an\n18 , \u0001, \u03b4)-agnostic semi-private learner for an hypothesis class H with private sample size npriv\n(\u03b1, 1\ndistribution realizable by H with input sample size (cid:100) npriv\nProof. Let A denote the assumed agnostic-case semi-private learner for H with input private sample\nof size npriv and input public sample of size npub. Using A, we construct a realizable-case private\nlearner for H, which we denote by B. The description of B appears in Algorithm 2.\nThe following two claims about B suf\ufb01ce to prove the lemma.\nAlgorithm 2 Description of the private learner B:\nInput: Private sample \u02dcS = (\u02dcz1, . . . , \u02dcz\u02dcn) of size \u02dcn = (cid:100)npriv/(10 \u00b7 npub)(cid:101).\n1: Pick a \ufb01xed (dummy) distribution D0 over Z = X \u00d7 {0, 1} where the label y \u2208 {0, 1} is drawn\n2: Set p = 1/(100 \u00b7 npub).\n3: Using \u02dcS and D0, construct samples Spriv, Spub using procedures PrivSamp( \u02dcS,D0, p, npriv) and\n4: Return \u02dch = A(Spriv, Spub).\n\nuniformly at random from {0, 1} independently from x \u2208 X .\n\nPubSamp(D0, npub) given by Algorithms 3 and 4 below.\n\nClaim 4.5 (Privacy guarantee of B). B is (\u0001, \u03b4)-differentially private\nThe above claim easily follows since A is a semi-private learner, Spub does not contain any points\nfrom \u02dcS, and each point in \u02dcS appears at most once in Spriv.\nClaim 4.6 (Accuracy guarantee of B). Let D be any distribution over Z that is realizable by H.\nSuppose \u02dcS \u223c D \u02dcn. Then, except with probability at most 1/16 (over the choice of \u02dcS and internal\nrandomness in B), the output hypothesis \u02dch satis\ufb01es: err(\u02dch; D) \u2264 100 npub \u03b1.\n\n7\n\n\fAlgorithm 3 Private Sample Generator PrivSamp:\nInput: Sample \u02dcS = (\u02dcz1, . . . , \u02dcz\u02dcn), Distribution D0, parameter p, sample size npriv.\n1: i := 1\n2: while \u02dcS (cid:54)= \u2205 and i \u2264 npriv: do\n3:\n\nSample bi \u223c Ber(p) (independently for each i), where Ber(p) is Bernoulli distribution with\nmean p.\nif bi = 1: then\n\ni = \u02dczji, where ji =(cid:80)i\n\nk=1 bk.\n\ni\n\n4:\n5:\n6:\n7:\n8:\n9:\n10: return Spriv = (zprv\n\nto be the next element in \u02dcS, i.e., zprv\nSet zprv\nRemove this element from \u02dcS: \u02dcS \u2190 \u02dcS \\ \u02dczji.\nelse\nSet zprv\ni \u2190 i + 1\n\ni , where z0\n\ni = z0\n\n1 , . . . , zprv\n\nnpriv ).\n\ni is a fresh independent example from the \u201cdummy\u201d distribution D0.\n\nAlgorithm 4 Public Sample Generator PubSamp:\nInput: Distribution D0, sample size npub.\n1: for i = 1, . . . , npub : do\n2:\n3: return Spub = (zpub\n\ni where z0\n\ni = z0\n\nSet zpub\n\n, . . . , zpub\nnpub\n\n)\n\n1\n\ni is a fresh independent example from D0.\n\n1, . . . , z0\n\nLet D(p) denote the mixture distribution p\u00b7D+(1\u2212p)\u00b7D0 (recall the de\ufb01nition of p from Algorithm 2).\nTo prove Claim 4.6, we \ufb01rst show that both Spriv and Spub can be \u201cviewed\u201d as being sampled from\nD(p). The claim will then follow since A learns H with respect to D(p).\nFirst, note that since \u02dcn = 10 \u00b7 p \u00b7 npriv, then by Chernoff\u2019s bound, except with probability < 0.01,\nAlgorithm 3 exits the WHILE loop with i = npriv. Thus, except with probability < 0.01, we have\n(2)\n\n|Spriv| = npriv, hence, Spriv \u223c Dnpriv\n(p) .\n\n. We will show that Dnpub\n\nAs for Spub, note that Spub = (z0\nvariation to Dnpub\ni \u2208 [npub], \u02c6zi = bi vi + (1 \u2212 bi) z0\n\n(p) . Let (cid:98)Spub = (\u02c6z1, . . . , \u02c6znpub ) be i.i.d. sequence generated as follows: for each\nIt is clear that (cid:98)Spub \u223c Dnpub\ni , where (b1, . . . , bnpub ) \u223c (Ber(p))npub, and (v1, . . . , vn) \u223c Dnpub.\nP(cid:104)(cid:98)Spub = Spub\nNote that P(cid:104)(cid:98)Spub (cid:54)= Spub\n(cid:105)\nby D0) is at most 0.01. In particular, the probability of any event w.r.t. the distribution of (cid:98)Spub is at\n\nis the probability measure attributed to the \ufb01rst component of the mixture\ndistribution D(p) of \u02c6Spub (i.e., the component from D). Hence, it follows that the total variation\nbetween the distribution of \u02c6Spub (induced by the mixture D(p)) and the distribution of Spub (induced\n\n(cid:105) \u2265 P [bi = 0 \u2200 i \u2208 [npub]] =\n\n(cid:19)npub \u2265 0.99\n\n(p) . Moreover, observe that\n\nis close in total\n\n) \u223c Dnpub\n\nmost 0.01 far from the probability of the same event w.r.t. the distribution of Spub. Hence,\n\n100 npub\n\n(cid:18)\n\n1 \u2212\n\nnpub\n\n1\n\n0\n\n0\n\n(cid:21)\n\nSpriv,Spub,A\n\u2212\nP\n\n(cid:1) \u2212 min\n(cid:21)\nh\u2208H err(h;D(p)) > \u03b1\n(cid:1) \u2212 min\nh\u2208H err(h;D(p)) > \u03b1\n(cid:21)\n(cid:1) \u2212 min\nNow, from (2) and the premise that A is agnostic semi-private learner, we have\nh\u2208H err(h;D(p)) > \u03b1\n(cid:1) \u2212 min\nh\u2208H err(h;D(p)) \u2264 \u03b1.\n\n(cid:20)\nerr(cid:0)A(Spriv, Spub); D(p)\n(cid:20)\nP\nerr(cid:0)A(Spriv, Spub); D(p)\nSpriv,(cid:98)Spub,A\n(cid:20)\nerr(cid:0)A(Spriv, Spub); D(p)\nerr(cid:0)A(Spriv, Spub); D(p)\n\nHence, using (3), we conclude that except with probability < 1/16,\n\nSpriv,(cid:98)Spub,A\n\nP\n\n\u2264 1\n17\n\n\u2264 0.01\n\n(3)\n\n(4)\n\n8\n\n\fNote that for any h, err(h; D(p)) = p\u00b7 err(h; D) + (1\u2212 p)\u00b7 err(h;D0) = p\u00b7 err(h; D) + 1\n2 (1\u2212 p),\nwhere the last equality follows from the fact that the labels generated by D0 are completely noisy\nh\u2208H err(h;D). That is,\n(uniformly random labels). Hence, we have arg min\nthe optimal hypothesis with respect to the realizable distribution D is also optimal with respect to\nthe mixture distribution D(p). Let h\u2217 \u2208 H denote such hypothesis. Note that err(h\u2217;D) = 0 and\nerr(h\u2217;D(p)) = 1\n2 (1 \u2212 p). These observations together with (4) imply that except with probability\n< 1/16, we have\n\nh\u2208H err(h;D(p)) = arg min\n\n(cid:16)B( \u02dcS); D(cid:17)\n\n\u03b1 \u2265 p \u00b7 err (A(Spriv, Spub); D)\n\n= err (A(Spriv, Spub); D) \u2264 100 \u00b7 npub \u00b7 \u03b1. This completes the proof.\n\nHence, err\n\nWith Lemma 4.4, we are now ready to prove the main results for this section:\n\nProof of Theorem 4.1\nProof. Suppose A is a semi-private learner for H with sample complexities npriv, npub. In particular,\n18 ) private and public examples, A is (\u03b1, 1\npriv log(npriv) )-semi-\ngiven npriv(\u03b1, 1\nprivate learner for H. Hence, by Lemma 4.4, there is (100npub\u03b1, 1\npriv log(npriv) )-private\nlearner for H. Thus, Theorem 4.3 implies that 100npub\u03b1 > 1\n1600 \u03b1 as\nrequired.\n\n16 and hence that npub > 1\n\n18 ), npub(\u03b1, 1\n\n16 , 0.1,\n\n18 , 0.1,\n\n100 n2\n1\n\n1\n\n100 n2\n\nProof of Theorem 4.2\nProof. First, if H is learnable by a pure private learner, then trivially the second condition cannot\nhold since H can be learned without any public examples. Now, suppose that the \ufb01rst item does not\nhold. Note that by Lemma 2.5, this implies that there is no pure private learner for H with respect\nprivate learner for H with respect to realizable distributions. Now, suppose A is a pure semi-private\nlearner for H with sample complexities npriv(\u03b1, 1\n\n16 , 0.1(cid:1)-pure\n18 , 0.1(cid:1)-pure semi-private\n16 , 0.1(cid:1)-pure private learner for H w.r.t. realizable distributions. This together with\n\nto realizable distributions. By Lemma 2.6, this in turn implies that there is no(cid:0) 1\nlearner A for H. Then, this implies that for any \u03b1 > 0, A is an(cid:0)\u03b1, 1\n(cid:0)100 npub \u03b1, 1\n\n18 ). Hence, by Lemma 4.4, there is a\n\n18 ), npub(\u03b1, 1\n\n16 , 1\n\nthe earlier conclusion implies that 100 npub \u03b1 > 1\nthat the condition in the second item holds.\n\n16 , and therefore that npub > 1\n\n1600 \u03b1 , which shows\n\nAcknowledgements\n\nN. Alon\u2019s research is supported in part by NSF grant DMS-1855464, BSF grant 2018267, and the\nSimons Foundation. R. Bassily\u2019s research is supported by NSF Awards AF-1908281, SHF-1907715,\nGoogle Faculty Research Award, and OSU faculty start-up support.\n\n9\n\n\fReferences\n[ABM19] Noga Alon, Raef Bassily, and Shay Moran. Limits of private learning with access to\n\npublic data. arXiv:1910.11519 [cs.LG], 2019.\n\n[ALMM19] Noga Alon, Roi Livni, Maryanthe Malliaris, and Shay Moran. Private pac learn-\ning implies \ufb01nite littlestone dimension. STOC 2019, pp. 852-860 (arXiv preprint\narXiv:1806.00949), 2019.\n\n[BNS13] Amos Beimel, Kobbi Nissim, and Uri Stemmer. Private learning and sanitization:\nIn Approximation, Randomization, and\nPure vs. approximate differential privacy.\nCombinatorial Optimization. Algorithms and Techniques, pages 363\u2013378. Springer,\n2013.\n\n[BNS15] Amos Beimel, Kobbi Nissim, and Uri Stemmer. Learning privately with labeled and\nunlabeled examples. In Proceedings of the twenty-sixth annual ACM-SIAM symposium\non Discrete algorithms, pages 461\u2013477. Society for Industrial and Applied Mathematics,\n2015.\n\n[BNSV15] Mark Bun, Kobbi Nissim, Uri Stemmer, and Salil Vadhan. Differentially private release\nand learning of threshold functions. In Foundations of Computer Science (FOCS), 2015\nIEEE 56th Annual Symposium on, pages 634\u2013649. IEEE, 2015.\n\n[BPS09] Shai Ben-David, D\u00e1vid P\u00e1l, and Shai Shalev-Shwartz. Agnostic online learning. In\nCOLT 2009 - The 22nd Conference on Learning Theory, Montreal, Quebec, Canada,\nJune 18-21, 2009, 2009.\n\n[BTT18] Raef Bassily, Abhradeep Guha Thakurta, and Om Dipakbhai Thakkar. Model-agnostic\nprivate learning. In Advances in Neural Information Processing Systems, pages 7102\u2013\n7112, 2018.\n\n[CH11] Kamalika Chaudhuri and Daniel Hsu. Sample complexity bounds for differentially\nprivate learning. In Proceedings of the 24th Annual Conference on Learning Theory,\npages 155\u2013186, 2011.\n\n[DKM+06] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni\nNaor. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT,\n2006.\n\n[DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise\nto sensitivity in private data analysis. In Theory of Cryptography Conference, pages\n265\u2013284. Springer, 2006.\n\n[DRV10] Cynthia Dwork, Guy N. Rothblum, and Salil P. Vadhan. Boosting and differential privacy.\nIn FOCS (Full version: https://guyrothblum.\ufb01les.wordpress.com/2014/11/drv10.pdf),\n2010.\n\n[HCB16] Jihun Hamm, Yingjun Cao, and Mikhail Belkin. Learning privately from multiparty\n\ndata. In International Conference on Machine Learning, pages 555\u2013563, 2016.\n\n[KLN+08] Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova,\nIn FOCS, pages 531\u2013540. IEEE\n\nand Adam Smith. What can we learn privately?\nComputer Society, 2008.\n\n[Lit87] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-\n\nthreshold algorithm. Machine Learning, 2(4):285\u2013318, 1987.\n\n[PAE+17] Nicolas Papernot, Mart\u0131n Abadi, \u00dalfar Erlingsson, Ian Goodfellow, and Kunal Talwar.\nSemi-supervised knowledge transfer for deep learning from private training data. stat,\n1050, 2017.\n\n[PSM+18] Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and\n\u00dalfar Erlingsson. Scalable private learning with pate. arXiv preprint arXiv:1802.08908,\n2018.\n\n10\n\n\f[Sau72] Norbert Sauer. On the density of families of sets. Journal of Combinatorial Theory,\n\nSeries A, 13(1):145\u2013147, 1972.\n\n[SSBD14] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From\n\ntheory to algorithms. Cambridge university press, 2014.\n\n11\n\n\f", "award": [], "sourceid": 5450, "authors": [{"given_name": "Noga", "family_name": "Alon", "institution": "Princeton"}, {"given_name": "Raef", "family_name": "Bassily", "institution": "The Ohio State University"}, {"given_name": "Shay", "family_name": "Moran", "institution": "Google AI Princeton"}]}