{"title": "Private Hypothesis Selection", "book": "Advances in Neural Information Processing Systems", "page_first": 156, "page_last": 167, "abstract": "We provide a differentially private algorithm for hypothesis selection. \n Given samples from an unknown probability distribution $P$ and a set of $m$ probability distributions $\\mathcal{H}$, the goal is to output, in a $\\varepsilon$-differentially private manner, a distribution from $\\mathcal{H}$ whose total variation distance to $P$ is comparable to that of the best such distribution (which we denote by $\\alpha$).\n The sample complexity of our basic algorithm is $O\\left(\\frac{\\log m}{\\alpha^2} + \\frac{\\log m}{\\alpha \\varepsilon}\\right)$, representing a minimal cost for privacy when compared to the non-private algorithm. We also can handle infinite hypothesis classes $\\mathcal{H}$ by relaxing to $(\\varepsilon,\\delta)$-differential privacy.\n\n We apply our hypothesis selection algorithm to give learning algorithms for a number of natural distribution classes, including Gaussians, product distributions, sums of independent random variables, piecewise polynomials, and mixture classes.\n Our hypothesis selection procedure allows us to generically convert a cover for a class to a learning algorithm, complementing known learning lower bounds which are in terms of the size of the packing number of the class.\n As the covering and packing numbers are often closely related, for constant $\\alpha$, our algorithms achieve the optimal sample complexity for many classes of interest.\n Finally, we describe an application to private distribution-free PAC learning.", "full_text": "Private Hypothesis Selection\n\nDepartment of Computer Science\n\nCheriton School of Computer Science\n\nMark Bun\n\nBoston University\n\nmbun@bu.edu\n\nGautam Kamath\n\nUniversity of Waterloo\n\ng@csail.mit.edu\n\nThomas Steinke\nIBM Research\n\nphs@thomas-steinke.net\n\nDepartment of Computer Science & Engineering\n\nZhiwei Steven Wu\n\nUniversity of Minnesota\n\nzsw@umn.edu\n\nAbstract\n\n\u03b12 + log m\n\n\u03b1\u03b5\n\n(cid:17)\n\n(cid:16) log m\n\nWe provide a differentially private algorithm for hypothesis selection. Given\nsamples from an unknown probability distribution P and a set of m probability dis-\ntributions H, the goal is to output, in a \u03b5-differentially private manner, a distribution\nfrom H whose total variation distance to P is comparable to that of the best such\ndistribution (which we denote by \u03b1). The sample complexity of our basic algorithm\nis O\n, representing a minimal cost for privacy when compared to\nthe non-private algorithm. We also can handle in\ufb01nite hypothesis classes H by\nrelaxing to (\u03b5, \u03b4)-differential privacy.\nWe apply our hypothesis selection algorithm to give learning algorithms for a\nnumber of natural distribution classes, including Gaussians, product distributions,\nsums of independent random variables, piecewise polynomials, and mixture classes.\nOur hypothesis selection procedure allows us to generically convert a cover for a\nclass to a learning algorithm, complementing known learning lower bounds which\nare in terms of the size of the packing number of the class. As the covering and\npacking numbers are often closely related, for constant \u03b1, our algorithms achieve\nthe optimal sample complexity for many classes of interest. Finally, we describe\nan application to private distribution-free PAC learning.\n\n1\n\nIntroduction\n\nWe consider the problem of hypothesis selection: given samples from an unknown probability\ndistribution, select a distribution from some \ufb01xed set of candidates which is \u201cclose\u201d to the unknown\ndistribution in some appropriate distance measure. Such situations can arise naturally in a number\nof settings. For instance, we may have a number of different methods which work under various\ncircumstances, which are not known in advance. One option is to run all the methods to generate a\nset of hypotheses, and pick the best from this set afterwards. Relatedly, an algorithm may branch\nits behavior based on a number of \u201cguesses,\u201d which will similarly result in a set of candidates,\ncorresponding to the output at the end of each branch. Finally, if we know that the underlying\ndistribution belongs to some (parametric) class, it is possible to essentially enumerate the class (also\nknown as a cover) to create a collection of hypotheses. Observe that this last example is quite general,\nand this approach can give generic learning algorithms for many settings of interest.\n\nA full version of the paper, with additional details and proofs, appears in the supplement, or on arXiv [9].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThis problem of hypothesis selection has been extensively studied (see, e.g., [46, 17, 18, 19]),\nresulting in algorithms with a sample complexity which is logarithmic in the number of hypotheses.\nSuch a mild dependence is critical, as it facilitates sample-ef\ufb01cient algorithms even when the\nnumber of candidates may be large. These initial works have triggered a great deal of study into\nhypothesis selection with additional considerations, including computational ef\ufb01ciency, understanding\nthe optimal approximation factor, adversarial robustness, and weakening access to the hypotheses\n(e.g., [36, 15, 16, 43, 2, 21, 1, 7]).\nHowever, in modern settings of data analysis, data may contain sensitive information about individ-\nuals. Some examples of such data include medical records, GPS location data, or private message\ntranscripts. As such, we would like to perform statistical inference in these settings without revealing\nsigni\ufb01cant information about any particular individual\u2019s data. To this end, there have been many\nproposed notions of data privacy, but perhaps the gold standard is that of differential privacy [26].\nInformally, differential privacy requires that, if a single datapoint in the dataset is changed, then the\ndistribution over outputs produced by the algorithm should be similar (see De\ufb01nition 4). Differential\nprivacy has seen widespread adoption, including deployment by Apple [22], Google [28], and the US\nCensus Bureau [14].\nThis naturally raises the question of whether one can perform hypothesis selection under the constraint\nof differential privacy, while maintaining a logarithmic dependence on the size of the cover. Such a\ntool would allow us to generically obtain private learning results for a wide variety of settings.\n\n1.1 Results\n\nOur main results answer this in the af\ufb01rmative: we provide differentially private algorithms for\nselecting a good hypothesis from a set of distributions. The output distribution is competitive with\nthe best distribution, and the sample complexity is bounded by the logarithm of the size of the set.\nThe following is a basic version of our main result.\nTheorem 1. Let H = {H1, . . . , Hm} be a set of probability distributions. Let D = {X1, . . . , Xn}\nbe a set of samples drawn independently from an unknown probability distribution P . There exists\nan \u03b5-differentially private algorithm (with respect to the dataset D) which has following guarantees.\nSuppose there exists a distribution H\u2217 \u2208 H such that dTV(P, H\u2217) \u2264 \u03b1. If n = \u2126\n\u03b12 + log m\n,\nthen the algorithm will output a distribution \u02c6H \u2208 H such that dTV(P, \u02c6H) \u2264 (3+\u03b6)\u03b1 with probability\nat least 9/10, for any constant \u03b6 > 0. The running time of the algorithm is O(nm2).\n\n(cid:17)\n\n\u03b1\u03b5\n\n(cid:16) log m\n(cid:17)\n\n(cid:16) log m\n\n\u03b12\n\n(cid:16) log m\n\n(cid:17)\n\nThe sample complexity of this problem without privacy constraints is O\n\n, and thus the\n\n\u03b1\u03b5\n\nadditional cost for \u03b5-differential privacy is an additive O\n. We consider this cost to be\nminimal; in particular, the dependence on m is unchanged. Note that the running time of our\nalgorithm is O(nm2) \u2013 we conjecture it may be possible to reduce this to \u02dcO(nm) as has been done\nin the non-private setting [16, 43, 2, 1], though we have not attempted to perform this optimization.\nRegardless, our main focus is on the sample complexity rather than the running time, since any\nmethod for generic hypothesis selection requires \u2126(m) time, thus precluding ef\ufb01cient algorithms\nwhen m is large. Note that the approximation factor of (3 + \u03b6)\u03b1 is effectively tight [19, 36, 7].\nTheorem 1 requires prior knowledge of the value of \u03b1, though we can use this to obtain an algorithm\nwith similar guarantees which does not (Theorem 3).\nIt is possible to improve the guarantees of this algorithm in two ways (Theorem 4). First, if the\ndistributions are nicely structured, the former term in the sample complexity can be reduced from\nO(log m/\u03b12) to O(d/\u03b12), where d is a VC-dimension-based measure of the complexity of the\ncollection of distributions. Second, if there are few hypotheses which are close to the true distribution,\nthen we can pay only logarithmically in this number, as opposed to the total number of hypotheses.\nThese modi\ufb01cations allow us to handle instances where m may be very large (or even in\ufb01nite), albeit\nat the cost of weakening to approximate differential privacy to perform the second re\ufb01nement. A\ntechnical discussion of our methods is in Section 1.2, our basic approach is covered in Section 3, and\nthe version with all the bells and whistles appears in Section 4.\nFrom Theorem 1, we immediately obtain Corollary 1 which applies when H itself may not be \ufb01nite,\nbut admits a \ufb01nite cover with respect to total variation distance.\n\n2\n\n\f(cid:17)\n\n.\n\n\u03b1\u03b5\n\n(cid:16) log |C\u03b1|\n\u03b12 + log |C\u03b1|\n\nCorollary 1. Suppose there exists an \u03b1-cover C\u03b1 of a set of distributions H, and that we are given\na set of samples X1, . . . , Xn \u223c P , where dTV(P,H) \u2264 \u03b1. For any constant \u03b6 > 0, there exists\nan \u03b5-differentially private algorithm (with respect to the input {X1, . . . , Xn}) which outputs a\ndistribution H\u2217 \u2208 C\u03b1 such that dTV(P, H\u2217) \u2264 (6 + 2\u03b6)\u03b1 with probability \u2265 9/10, as long as\nn = \u2126\nInformally, this says that if a hypothesis class has an \u03b1-cover C\u03b1, then there is a private learning\nalgorithm for the class which requires O(log |C\u03b1|) samples. Note that our algorithm works even if\nthe unknown distribution is only close to the hypothesis class. This is useful when we may have\nmodel misspeci\ufb01cation, or when we require adversarial robustness. The requirements for this theorem\nto apply are minimal, and thus it generically provides learning algorithms for a wide variety of\nhypothesis classes. That said, in non-private settings, the sample complexity given by this method is\nrather lossy: as an extreme example, there is no \ufb01nite-size cover of univariate Gaussian distributions\nwith unbounded parameters, so this approach does not give a \ufb01nite-sample algorithm. That said, it is\nwell-known that O(1/\u03b12) samples suf\ufb01ce to estimate a Gaussian in total variation distance. In the\nprivate setting, our theorem incurs a cost which is somewhat necessary: in particular, it is folklore\nthat any pure \u03b5-differentially private learning algorithm must pay a cost which is logarithmic in the\npacking number of the class. Due to the relationship between packing and covering numbers, this\nimplies that up to a constant factor relaxation in the learning accuracy, our results are tight. A more\nformal discussion appears in the supplement.\nGiven Corollary 1, in Section 5, we derive new learning results for a number of classes. Our main\napplications are for d-dimensional Gaussian and product distributions. Informally, we obtain \u02dcO(d)\nsample algorithms for learning a product distribution and a Gaussian with known covariance, and\nan \u02dcO(d2) algorithm for learning a Gaussian with unknown covariance (Corollaries 2 and 3). These\nimprove on recent results by Kamath, Li, Singhal, and Ullman [33] in two different ways. First, as\nmentioned before, our results are semi-agnostic, so we can handle when the distribution is only close\nto a product or Gaussian distribution. Second, our results hold for pure (\u03b5, 0)-differential privacy,\nwhich is a stronger notion than \u03b52-zCDP as considered in [33]. In this weaker model, they also\nobtained \u02dcO(d) and \u02dcO(d2) sample algorithms, but the natural modi\ufb01cations to achieve \u03b5-DP incur\nextra poly(d) factors.1 [33] also showed \u02dc\u2126(d) lower bounds for Gaussian and product distribution\nestimation in the even weaker model of (\u03b5, \u03b4)-differential privacy. Thus, our results show that the\ndimension dependence for these problems is unchanged for essentially any notion of differential\nprivacy. In particular, our results show a previously-unknown separation between mean estimation\nof product distributions and non-product distributions under pure (\u03b5, 0)-differential privacy; see\nRemark 1.\nWe also apply Theorem 4 to obtain algorithms for learning Gaussians under (\u03b5, \u03b4)-differential privacy,\nwith no bounds on the mean and variance parameters. More speci\ufb01cally, we provide algorithms\nfor learning multivariate Gaussians with unknown mean and known covariance (Corollary 4), in\nwhich we manage to avoid dependences which arise due to the application of advanced composition\n(similar to Remark 1). We additionally give algorithms for learning mixtures of any coverable class\n(Corollary 5). In particular, this immediately implies algorithms for learning mixtures of Gaussians,\nproduct distributions, and all other classes mentioned above. Additional classes we can learn privately,\ndescribed and discussed in the supplement, include piecewise polynomials, sums of independent\nrandom variables, and univariate Gaussians with unbounded mean and variance.\nTo conclude our applications, we discuss a connection to PAC learning. It is known that the sample\ncomplexity of differentially private distribution-free PAC learning can be higher than that of non-\nprivate learning. However, this gap does not exist for distribution-speci\ufb01c learning, where the\nlearning algorithm knows the distribution of (unlabeled) examples, as both sample complexities are\ncharacterized by VC dimension. Private hypothesis selection allows us to address an intermediate\nsituation where the distribution of unlabeled examples is not known exactly, but is known to come\n(approximately) from a class of distributions. When this class has a small cover, we are able to recover\nsample complexity guarantees for private PAC learning which are comparable to the non-private case.\nDetails and discussion appear in the supplement.\n\n1Roughly, this is due to the fact that the Laplace and Gaussian mechanism are based on (cid:96)1 and (cid:96)2 sensitivity,\n\n\u221a\n\nrespectively, and that there is a\n\nd-factor relationship between these two norms, in the worst case.\n\n3\n\n\f1.2 Techniques\n\n(cid:1) sets (the \u201cScheff\u00e9\u201d\n\n2\n\nNon-privately, most algorithms for hypothesis selection involve a tournament-style approach. We\nconduct a number of pairwise comparisons between distributions, which may either have a winner\nand a loser, or may be declared a draw. Intuitively, a distribution will be declared the winner of a\ncomparison if it is much closer than the alternative to the unknown distribution, and a tie will be\ndeclared if the two distributions are comparably close. The algorithm will output any distribution\nwhich never loses a comparison. A single comparison between a pair of hypotheses requires O(1/\u03b12)\nsamples, and a Chernoff plus union bound argument over the O(m2) possible comparisons increases\nthe sample complexity to O(log m/\u03b12). In fact, we can use uniform convergence arguments to reduce\n\nthis sample complexity to O(d/\u03b12), where d is the VC dimension of the 2(cid:0)m\n\nsets) de\ufb01ned by the subsets of the domain where the PDF of one distribution dominates another.\nCrucially, we must reuse the same set of samples for all comparisons to avoid paying polynomially in\nthe number of hypotheses.\nA private algorithm for this problem requires additional care. Since a single comparison is based\non the number of samples which fall into a particular subset of the domain, the sensitivity of the\nunderlying statistic is low, and thus privacy may seem easily achievable at \ufb01rst glance. However,\nthe challenge comes from the fact that the same samples are reused for all pairwise comparisons,\nthus greatly increasing the sensitivity: changing a single datapoint could \ufb02ip the result of every\ncomparison! In order to avoid this pitfall, we instead carefully construct a score function for each\nhypothesis, namely, the minimum number of points that must be changed to cause the distribution to\nlose any comparison. For this to be a useful score function, we must show that the best hypothesis\nwill win all of its comparisons by a large margin. We can then use the Exponential Mechanism [37]\nto select a distribution with high score.\nFurther improvements can be made if we are guaranteed that the number of \u201cgood\u201d hypotheses (i.e.,\nthose that have total variation distance from the true distribution bounded by (3 + \u03b6)\u03b1) is at most\nsome parameter k, and if we are willing to relax to approximate differential privacy. The parameter\nk here is related to the doubling dimension of the hypothesis class with respect to total variation\ndistance. If we randomly assign the hypotheses to \u2126(k2) buckets, with high probability, no bucket\nwill contain more than one good hypothesis. We can identify a bucket containing a good hypothesis\nusing a similar method based on the exponential mechanism as described above. Moreover, since\nwe are likely to only have one \u201cgood\u201d hypothesis in the chosen bucket, this implies a signi\ufb01cant\ngap between the best and second-best scores in that bucket. This allows us to use stability-based\ntechniques [25, 44], and in particular the GAP-MAX algorithm of Bun, Dwork, Rothblum, and\nSteinke [8], to identify an accurate distribution.\n\n1.3 Related Work\n\nOur main result builds on a long line of work on non-private hypothesis selection. One starting point\nfor the particular style of approach we consider here is [46], which was expanded on in [17, 18, 19].\nSince then, there has been study into hypothesis selection under additional considerations, including\ncomputational ef\ufb01ciency, understanding the optimal approximation factor, adversarial robustness,\nand weakening access to the hypotheses [36, 15, 16, 43, 2, 21, 1, 7]. Our private algorithm examines\nthe same type of problem, with the additional constraint of differential privacy.\nThere has recently been a great deal of interest in differentially private distribution learning. In\nthe central model, most relevant are [20], which gives algorithms for learning structured univariate\ndistributions, and [35, 33], which focus on learning Gaussians and binary product distributions.\n[13] also studies private statistical parameter estimation. Privately learning mixtures of Gaussians\nwas considered in [39, 34]. The latter paper (which is concurrent with the present work) gives\na computationally ef\ufb01cient algorithm for the problem, but with a worse sample complexity, and\nincomparable accuracy guarantees (they require a separation condition, and perform clustering and\nparameter estimation, while we do proper learning). [10] give an algorithm for learning distributions\nin Kolmogorov distance. Upper and lower bounds for learning the mean of a product distribution\nover the hypercube in (cid:96)\u221e-distance include [6, 12, 26, 42]. [3] focuses on estimating properties of a\ndistribution, rather than the distribution itself. [40] gives an algorithm which allows one to estimate\nasymptotically normal statistics with optimal convergence rates, but no \ufb01nite sample complexity\nguarantees. There has also been a great deal of work on distribution learning in the local model\n\n4\n\n\fof differential privacy [23, 45, 32, 4, 24, 31, 47, 29]. Additional discussion of non-private learning\nappears in the supplementary material.\n\n2 Preliminaries\n\n(cid:82)\n\nDe\ufb01nition 1. The total variation distance or statistical distance between P and Q is de\ufb01ned as\n2(cid:107)P \u2212Q(cid:107)1 \u2208 [0, 1]. Moreover,\ndTV(P, Q) = maxS\u2286\u2126 P (S)\u2212Q(S) = 1\nif H is a set of distributions over a common domain, we de\ufb01ne dTV(P,H) = inf H\u2208H dTV(P, H).\nDe\ufb01nition 2. A \u03b3-cover of a set of distributions H is a set of distributions C\u03b3, such that for every\nH \u2208 H, there exists some P \u2208 C\u03b3 such that dTV(P, H) \u2264 \u03b3. A \u03b3-packing of a set of distributions H\nis a set of distributions P\u03b3 \u2286 H, such that for every pair of distributions P, Q \u2208 P\u03b3, we have that\ndTV(P, Q) \u2265 \u03b3.\n\nx\u2208\u2126 |P (x)\u2212Q(x)|dx = 1\n\n2\n\nIn this paper, we present semi-agnostic learning algorithms.\nDe\ufb01nition 3. An algorithm is said to be an \u03b1-semi-agnostic learner for a class H if it has the\nfollowing guarantees. Suppose we are given X1, . . . , Xn \u223c P , where dTV(P,H) \u2264 OPT. The\nalgorithm must output some distribution \u02c6H such that dTV(P, H) \u2264 c \u00b7 OPT + O(\u03b1), for some\nconstant c \u2265 1. If c = 1, then the algorithm is said to be agnostic.\n\nWe consider algorithms under the constraint of differential privacy.\nDe\ufb01nition 4 ([26]). A randomized algorithm T : X\u2217 \u2192 R is (\u03b5, \u03b4)-differentially private if for all\nn \u2265 1, for all neighboring datasets D, D(cid:48) \u2208 X n, and for all events S \u2286 R, Pr [T (D) \u2208 S] \u2264\ne\u03b5 Pr[T (D(cid:48)) \u2208 S] + \u03b4 . If \u03b4 = 0, we say that T is \u03b5-differentially private.\n\nWe will also use the related notion of concentrated differential privacy [27, 11], which is de\ufb01ned in\nthe supplement.\nOur methods will rely heavily upon the Exponential mechanism.\nTheorem 2 (Exponential Mechanism [37]). For a score function q : X\u2217 \u00d7 R \u2192 R, de\ufb01ne its\nsensitivity as \u2206(q) = maxr\u2208R,D\u223cD(cid:48) |q(D, r) \u2212 q(D(cid:48), r)|. For any q, input data set D, and privacy\nparameter \u03b5 > 0, the exponential mechanism ME(D, q, \u03b5) picks an outcome r \u2208 R with probability\nproportional to exp (\u03b5q(D, r)/(2\u2206(q))). This is \u03b5-differentially private, and with probability at least\n1 \u2212 \u03b2, selects an outcome r \u2208 R such that q(D, r) \u2265 maxr(cid:48)\u2208R q(D, r(cid:48)) \u2212 2\u2206(q) log(|R|/\u03b2)\n\n.\n\n\u03b5\n\n3 A First Method for Private Hypothesis Selection\n\nIn this section, we present our \ufb01rst algorithm for private hypothesis selection and obtain Theorem 1.\nNote that the sample complexity bound above scales logarithmically with the size of the hypothesis\nclass. In Section 4, we will provide a stronger result (which subsumes the present one as a special\ncase) that can handle certain in\ufb01nite hypothesis classes. For sake of exposition, we begin in this\nsection with the basic algorithm.\n\n3.1 Pairwise Comparisons\n\nWe \ufb01rst present a subroutine which compares two hypothesis distributions. Let H and H(cid:48) be\ntwo distributions over domain X and consider the following set, which is called the Scheff\u00e9 set\nW1 = {x \u2208 X | H(x) > H(cid:48)(x)}. De\ufb01ne p1 = H(W1), p2 = H(cid:48)(W1), and \u03c4 = P (W1) to be\n\n5\n\n\fthe probability masses that H, H(cid:48), and P place on W1, respectively. It follows that p1 > p2 and\np1 \u2212 p2 = dTV(H, H(cid:48)).2\nAlgorithm 1: PAIRWISE CONTEST: PC(H, H(cid:48), D, \u03b6, \u03b1)\nInput: Two hypotheses H and H(cid:48), input dataset D of size n drawn i.i.d. from target distribution\nP , approximation parameter \u03b6 > 0, and accuracy parameter \u03b1 \u2208 (0, 1).\nInitialize: Compute the fraction of points that fall into W1: \u02c6\u03c4 = 1\nIf p1 \u2212 p2 \u2264 (2 + \u03b6)\u03b1, return \u201cDraw\u201d.\nElse If \u02c6\u03c4 > p1 \u2212 (1 + \u03b6/2)\u03b1, return H as the winner.\nElse If \u02c6\u03c4 < p2 + (1 + \u03b6/2)\u03b1, return H(cid:48) as the winner.\nElse return \u201cDraw\u201d.\n\nn |{x \u2208 D | x \u2208 W1}|.\n\nNow consider the following function of this ordered pair of hypotheses:\n\n(cid:26)n\n\n\u0393\u03b6(H, H(cid:48), D) =\n\nn \u00b7 max{0, \u02c6\u03c4 \u2212 (p2 + (1 + \u03b6/2)\u03b1)}\n\nif p1 \u2212 p2 \u2264 (2 + \u03b6)\u03b1;\notherwise.\n\nWhen the two hypotheses are suf\ufb01ciently far apart (i.e., dTV(H, H(cid:48)) > (2 + \u03b6)\u03b1), \u0393\u03b6(H, H(cid:48), D) is\nessentially the number of points one needs to change in D to make H(cid:48) the winner.\nLemma 1. Let P, H, H(cid:48) be distributions as above. With probability at least 1 \u2212 2 exp(\u2212n\u03b6 2\u03b12/8)\nover the random draws of D from P n, \u02c6\u03c4 satis\ufb01es |\u02c6\u03c4 \u2212 \u03c4| < \u03b6\u03b1/4, and if dTV(P, H) \u2264 \u03b1, then\n\u0393\u03b6(H, H(cid:48), D) > \u03b6\u03b1n/4.\n\nleast 1 \u2212\nProof. By applying Hoeffding\u2019s inequality, we know that with probability at\n2 exp(\u2212n\u03b6 2\u03b12/8), |\u03c4 \u2212 \u02c6\u03c4| < \u03b6\u03b1/4. We condition on this event for the remainder of the proof.\nConsider the following two cases. In the \ufb01rst case, suppose that p1 \u2212 p2 \u2264 (2 + \u03b6)\u03b1. Then we\nknow that \u0393\u03b6(H, H(cid:48), D) = n > \u03b1n. In the second case, suppose that p1 \u2212 p2 > (2 + \u03b6)\u03b1. Since\ndTV(P, H) \u2264 \u03b1, we know that |p1\u2212\u03c4| \u2264 \u03b1, and so |p1\u2212 \u02c6\u03c4| < (1+\u03b6/4)\u03b1. Since p1 > p2 +(2+\u03b6)\u03b1,\nwe also have \u02c6\u03c4 > p2 + (1 + 3\u03b6/4)\u03b1. It follows that \u0393\u03b6(H, H(cid:48), D) = n(\u02c6\u03c4 \u2212 (p2 + (1 + \u03b6/2)\u03b1)) >\n\u03b6\u03b1n/4. This completes the proof.\n\n3.2 Selection via Exponential Mechanism\n\nIn light of the de\ufb01nition of the pairwise comparison de\ufb01ned above, we consider the follow-\ning score function S : H \u00d7 X n, such that for any Hj \u2208 H and dataset D, S(Hj, D) =\nminHk\u2208H \u0393\u03b6(Hj, Hk, D). Roughly speaking, S(Hj, D) is the minimum number of points required\nto change in D in order for Hj to lose at least one pairwise contest against a different hypothesis.\nWhen the hypothesis Hj is very close to every other distribution, such that all pairwise contests return\n\u201cDraw,\u201d then the score will be n.\nAlgorithm 2: PRIVATE HYPOTHESIS SELECTION: PHS(H, D, \u03b5)\nInput: Dataset D, a collection of hypotheses H = {H1, . . . , Hm}, privacy parameter \u03b5.\nOutput a random hypothesis \u02c6H \u2208 H such that for each Hj, Pr[ \u02c6H = Hj] \u221d exp\nLemma 2 (Privacy). For any \u03b5 > 0 and collection of hypotheses H, the algorithm PHS(H,\u00b7, \u03b5)\nsatis\ufb01es \u03b5-differential privacy.\nProof. First, observe that for any pairs of hypotheses Hj, Hk, \u0393\u03b6(Hj, Hk,\u00b7) has sensitivity 1. As\na result, the score function S is also 1-sensitive. Then the result directly follows from the privacy\nguarantee of the exponential mechanism (Theorem 2).\nLemma 3 (Utility). Fix any \u03b1, \u03b2 \u2208 (0, 1), and constant \u03b6 > 0. Suppose that there exists H\u2217 \u2208 H\nsuch that dTV(P, H\u2217) \u2264 \u03b1. Then with probability 1 \u2212 \u03b2 over the sample D and the algorithm PHS,\n2For simplicity of our exposition, we will assume that we can evaluate the two quantities p1 and p2 exactly.\nIn general, we can estimate these quantities to arbitrary accuracy, as long as, for each hypothesis H, we can\nevaluate the density of each point under H and also draw samples from H.\n\n(cid:16) S(Hj ,D)\n\n2\u03b5\n\n(cid:17)\n\n.\n\n6\n\n\fwe have that PHS(H, D) outputs an hypothesis \u02c6H such that dTV(P, \u02c6H) \u2264 (3 + \u03b6)\u03b1, as long as the\nsample size satis\ufb01es n \u2265 8 ln(4m/\u03b2)\n\n\u03b62\u03b12 + 8 ln(2m/\u03b2)\n\n\u03b6\u03b1\u03b5\n\n.\n\nProof. First, consider the m pairwise contests between H\u2217 and every candidate in H. Let Wj =\n{x \u2208 X | Hj(x) > H\u2217(x)} be the collection of Scheff\u00e9 sets. For any event W \u2286 X , let \u02c6P (W )\ndenote the empirical probability of event W on the dataset D. By Lemma 1 and an application of the\nunion bound, we know that with probability at least 1 \u2212 2m exp(\u2212n\u03b6 2\u03b12/8) over the draws of D,\n|P (Wj) \u2212 \u02c6P (Wj)| \u2264 \u03b6\u03b1/4 and \u0393\u03b6(H\u2217, Hj, D) > \u03b6\u03b1n/4 for all Hj \u2208 H. In particular, the latter\nevent implies that S(H\u2217, D) > \u03b6\u03b1n/4.\nNext, by the utility guarantee of the exponential mechanism (Theorem 2), we know that with\nprobability at least 1 \u2212 \u03b2/2, the output hypothesis satis\ufb01es S( \u02c6H, D) \u2265 S(H\u2217, D) \u2212 2 ln(2m/\u03b2)\n>\n\u03b6\u03b1n/4 \u2212 2 ln(2m/\u03b2)\n, we know that with probability at\nleast 1\u2212 \u03b2, S( \u02c6H, D) > 0. Let us condition on this event, which implies that \u0393\u03b6( \u02c6H, H\u2217, D) > 0. We\nwill now show that dTV( \u02c6H, H\u2217) \u2264 (2 + \u03b6)\u03b1, which directly implies that dTV( \u02c6H, P ) \u2264 (3 + \u03b6)\u03b1\nby the triangle inequality. Suppose to the contrary that dTV( \u02c6H, H\u2217) > (2 + \u03b6)\u03b1. Then by the\nde\ufb01nition of \u0393\u03b6, \u02c6P ( \u02c6W) > H\u2217( \u02c6W) + (1 + \u03b6/2)\u03b1, where \u02c6W = {x \u2208 X | \u02c6H(x) > H\u2217(x)}. Since\n|P ( \u02c6W) \u2212 \u02c6P ( \u02c6W)| \u2264 \u03b6\u03b1/4, we have P ( \u02c6W) > H\u2217( \u02c6W) + (1 + \u03b6/4)\u03b1, which is a contradiction to the\nassumption that dTV(P, H\u2217) \u2264 \u03b1.\n\n. Then as long as n \u2265 8 ln(4m/\u03b2)\n\n\u03b62\u03b12 + 8 ln(2m/\u03b2)\n\n\u03b6\u03b1\u03b5\n\n\u03b5\n\n\u03b5\n\nWhile Theorem 1 requires an upper bound \u03b1 on the accuracy of the best hypothesis, the following\ntheorem obviates this need, at the cost of a mild increase in the sample complexity and a constant\nfactor in the accuracy of the output hypothesis.\nTheorem 3. Let \u03b1, \u03b2, \u03b5 \u2208 (0, 1), and \u03b6 > 0 be a constant. Let H be a set of m distribu-\ntions and let P be a distribution with dTV(P,H) = OPT. There is an \u03b5-differentially pri-\nvate algorithm which takes as input n samples from P and with probability at least 1 \u2212 \u03b2,\noutputs a distribution \u02c6H \u2208 H with dTV(P, \u02c6H) \u2264 18(3 + \u03b6) OPT +\u03b1, as long as n \u2265\nO\n\n(cid:16) log(m/\u03b2)+log log(1/\u03b1)\n\n+ log m+log2(1/\u03b1)\u00b7(log(1/\u03b2)+log log(1/\u03b1))\n\n(cid:17)\n\n.\n\n\u03b12\n\n\u03b1\u03b5\n\n4 An Advanced Method for Private Hypothesis Selection\n\nIn Section 3, we provided a simple algorithm whose sample complexity grows logarithmically in the\nsize of the hypothesis class. We now demonstate that this dependence can be improved and, indeed,\nwe can handle in\ufb01nite hypothesis classes given that their VC dimension is \ufb01nite and that the cover\nhas small doubling dimension.\nTo obtain this improved dependence on the hypothesis class size, we must make two improvements\nto the analysis and algorithm. First, rather than applying a union bound over all the pairwise contests\nto analyse the tournament, we use a uniform convergence bound in terms of the VC dimension of\nthe Scheff\u00e9 sets. Second, rather than use the exponential mechanism to select a hypothesis, we\nuse a \u201cGAP-MAX\u201d algorithm [8]. This takes advantage of the fact that, in many cases, even for\nin\ufb01nite hypothesis classes, only a handful of hypotheses will have high scores. The GAP-MAX\nalgorithm need only pay for the hypotheses that are close to optimal. To exploit this, we must\nmove to a relaxation of pure differential privacy which is not subject to strong packing lower\nbounds (as we describe in the supplement). Speci\ufb01cally, we consider approximate differential privacy,\nalthough results with an improved dependence are also possible under various variants of concentrated\ndifferential privacy [27, 11, 38, 8].\nTheorem 4. Let H be a set of probability distributions on X . Let d be the VC dimension of the\n: X \u2192 {0, 1} de\ufb01ned by fH,H(cid:48)(x) = 1 \u21d0\u21d2 H(x) > H(cid:48)(x) where\nset of functions fH,H(cid:48)\nH, H(cid:48) \u2208 H. There exists a (\u03b5, \u03b4)-differentially private algorithm which has following guarantee. Let\nD = {X1, . . . , Xn} be a set of private samples drawn independently from an unknown probability\ndistribution P . Let k = |{H \u2208 H : dTV(H, P ) \u2264 7\u03b1}|. Suppose there exists a distribution H\u2217 \u2208\nH such that dTV(P, H\u2217) \u2264 \u03b1. If n = \u2126\n, then the\nalgorithm will output a distribution \u02c6H \u2208 H such that dTV(P, \u02c6H) \u2264 7\u03b1 with probability at least\n\n+ log(k/\u03b2)+min{log |H|,log(1/\u03b4)}\n\n(cid:16) d+log(1/\u03b2)\n\n(cid:17)\n\n\u03b12\n\n\u03b1\u03b5\n\n7\n\n\f1 \u2212 \u03b2. Alternatively, we can demand that the algorithm be 1\n\n2 \u03b52-concentrated differentially private if\n\n\u221a\n\nlog |H|\n\n(cid:19)\n\n.\n\n(cid:18)\n\nn = \u2126\n\nd+log(1/\u03b2)\n\n\u03b12\n\n+\n\nlog(k/\u03b2)+\n\u03b1\u03b5\n\nComparing Theorem 4 to Theorem 1, we see that the \ufb01rst (non-private) log |H| term is replaced by\nthe VC dimension d and the second (private) log |H| term is replaced by log k + log(1/\u03b4). Here\nk is a measure of the \u201clocal\u201d size of the hypothesis class H; its de\ufb01nition is similar to that of the\ndoubling dimension of the hypothesis class under total variation distance. We note that the log(1/\u03b4)\nterm could be large, as the privacy failure probability \u03b4 should be cryptographically small. Thus our\nresult includes statements for pure differential privacy (by using the other term in the minimum with\n\u03b4 = 0) and also concentrated differential privacy. Note that, since d and log k can be upper-bounded\nby O(log |H|), this result supercedes the guarantees of Theorem 1.Further details and a proof of\ncorrectness appear in the supplementary material.\n\n5 Applications of Hypothesis Selection\n\nIn this section, we give a number of applications of Theorem 1, primarily to obtain sample complexity\nbounds for learning a number of distribution classes of interest. Recall Corollary 1, which is an\nimmediate corollary of Theorem 1. This indicates that we can privately semi-agnostically learn a\nclass of distributions with a number of samples proportional to the log of its covering number.\nWe instantiate this corollary to give the sample complexity results for semi-agnostically learning\nproduct distributions (Section 5.1), Gaussian distributions (Section 5.2), and mixtures (Section 5.3).\nAdditional applications and all proofs of correctness appear in the supplement.\n\n5.1 Product Distributions\n\nAs a \ufb01rst application, we give an \u03b5-differentially private algorithm for learning product distributions.\nDe\ufb01nition 5. A (k, d)-product distribution is a distribution over [k]d, such that its marginal distribu-\ntions are independent (i.e., the distribution is the product of its marginals).\n\nWe start by constructing a cover for product distributions.\n\nLemma 4. There exists an \u03b1-cover of the set of (k, d)-product distributions of size O(cid:0) kd\n\n(cid:1)d(k\u22121).\n\n\u03b1\n\nWith this cover in hand, applying Corollary 1 allows us to conclude the following sample complexity\nupper bound.\nCorollary 2. Suppose we are given a set of samples X1, . . . , Xn \u223c P , where P is \u03b1-close to a\n(k, d)-product distribution. Then for any constant \u03b6 > 0, there exists an \u03b5-differentially private\nalgorithm which outputs a (k, d)-product distribution H\u2217 such that dTV(P, H\u2217) \u2264 (6 + 2\u03b6)\u03b1 with\n\nprobability \u2265 9/10, so long as n = \u2126(cid:0)kd log(cid:0) kd\n\n(cid:1)(cid:0) 1\n\n(cid:1)(cid:1).\n\n\u03b12 + 1\n\n\u03b1\u03b5\n\n\u03b1\n\nThis gives the \ufb01rst \u02dcO(d) sample algorithm for learning a binary product distribution in under\npure differential privacy, improving upon the work of Kamath, Li, Singhal, and Ullman [33] by\nstrengthening the privacy guarantee at a minimal cost in the sample complexity. The natural way to\nadapt their result from concentrated to pure differential privacy would require \u2126(d3/2) samples.\nRemark 1. Properly learning a product distribution over {0, 1}d to total variation distance \u2264 1\n2\nimplies learning its mean \u00b5 \u2208 [0, 1]d up to (cid:96)1 error \u2264 2\nd. Thus Corollary 2 implies a \u03b5-differentially\nprivate algorithm which takes n = \u02dcO(d/\u03b5) samples from a product distribution P on {0, 1}d and,\n\u221a\nwith high probability, outputs an estimate \u02c6\u00b5 of its mean \u00b5 with (cid:107)\u02c6\u00b5 \u2212 \u00b5(cid:107)1 \u2264 2\nd. In contrast, for\nnon-product distributions over the hypercube, estimating the mean to the same accuracy under \u03b5-\ndifferential privacy requires n = \u2126(d3/2/\u03b5) samples [30, 41]. Thus we have a polynomial separation\nbetween estimating product and non-product distributions under pure differential privacy.\n\n\u221a\n\n5.2 Gaussian Distributions\n\nWe next give private algorithms for learning Gaussian distributions. We discuss covers for Gaussian\ndistributions with known and unknown covariance.\n\n8\n\n\fLemma 5. There exists an \u03b1-cover of the set of Gaussian distributions N (\u00b5, \u03a3) in d-dimensions\n\nwith (cid:107)\u00b5(cid:107)2 \u2264 R and I (cid:22) \u03a3 (cid:22) \u03baI of size O(cid:0) dR\n\n(cid:1)d \u00b7 O(cid:0) d\u03ba\n\n(cid:1)d(d+1)/2. If \u03a3 = I, the size is O(cid:0) dR\n\n(cid:1)d.\n\n\u03b1\n\n\u03b1\n\n\u03b1\n\nIn addition, we can obtain bounds of the VC dimension of the Scheff\u00e9 sets of Gaussian distributions.\nLemma 6 ([5]). The set of Gaussian distributions with \ufb01xed variance \u2013 i.e., all N (\u00b5, I) with \u00b5 \u2208 Rd\n\u2013 has VC dimension d + 1. Furthermore, the set of Gaussians with unknown variance \u2013 i.e., all\nN (\u00b5, \u03a3) with \u00b5 \u2208 Rd and \u03a3 \u2208 Rd\u00d7d positive de\ufb01nite \u2013 has VC dimension O(d2).\nCombining the covers of Lemma 5 and the VC bound of Lemma 6 with Theorem 4 implies the\nfollowing corollaries for Gaussian estimation.\nCorollary 3. Suppose we are given a set of samples X1, . . . , Xn \u223c P , where P is \u03b1-close to a Gaus-\nsian distribution N (\u00b5, \u03a3) in d-dimensions with (cid:107)\u00b5(cid:107) \u2264 R. Then for any constant \u03b6 > 0, there exists an\n\u03b5-differentially private algorithm which outputs a Gaussian distribution H\u2217 such that dTV(P, H\u2217) \u2264\n\n(6 + 2\u03b6)\u03b1 with probability \u2265 9/10. If \u03a3 = I, the algorithm requires that n = \u2126(cid:0) d\n(cid:1)(cid:1)(cid:17)\n\n\u03b1\u03b5 log(cid:0) dR\n\nIf I (cid:22) \u03a3 (cid:22) \u03baI, it requires that n = \u2126\n\n(cid:1) + d2 log(cid:0) d\u03ba\n\n(cid:0)d log(cid:0) dR\n\n(cid:16) d2\n\n(cid:1)(cid:1).\n\n\u03b12 + d\n\n\u03b1\n\n.\n\n\u03b12 + 1\n\n\u03b1\u03b5\n\n\u03b1\n\n\u03b1\n\nSimilar to the product distribution case, these are the \ufb01rst \u02dcO(d) and \u02dcO(d2) sample algorithms for\nlearning Gaussians total variation distance under pure differential privacy, improving upon the\nconcentrated differential privacy results of Kamath, Li, Singhal, and Ullman [33].\n\n5.2.1 Gaussians with Unbounded Mean\n\nExtending Corollary 3, we consider multivariate Gaussian hypotheses with known covariance and\nunknown mean, without a bound on the mean (the parameter R above). This requires relaxation to\napproximate differential privacy. In place of Lemma 5, we construct a locally small cover:\nLemma 7. For any d \u2208 N and \u03b1 \u2208 (0, 1/30], there exists an \u03b1-cover C\u03b1 of the set of d-dimensional\nGaussian distributions N (\u00b5, I) satisfying \u2200\u00b5 \u2208 Rd |{H \u2208 C\u03b1 : dTV(H,N (\u00b5, I)) \u2264 7\u03b1}| \u2264 215d.\nApplying Theorem 4 with the cover of Lemma 7 and the VC bound from Lemma 6 gives:\nCorollary 4. Suppose we are given a set of samples X1, . . . , Xn \u223c P , where P is a spherical\nGaussian distribution N (\u00b5, I) in d-dimensions. Then there exists a (\u03b5, \u03b4)-differentially private\nalgorithm which outputs a spherical Gaussian distribution H\u2217 such that dTV(P, H\u2217) \u2264 7\u03b1 with\nprobability \u2265 1 \u2212 2\u2212d, so long as n = \u2126\n\n(cid:16) d\n\n(cid:17)\n\n.\n\n\u03b12 + d+log(1/\u03b4)\n\n\u03b1\u03b5\n\n(cid:16) d\n\nKarwa and Vadhan [35] give an algorithm for estimating a univariate Gaussian with unbounded\nmean, which can be applied independently to the d coordinates to get a sample complexity bound of\n\u02dcO\n\n. Our bound dominates this except for very small values of \u03b1.\n\nd log3/2(1/\u03b4)\n\n\u221a\n\n\u03b12 + d\n\n\u03b1\u03b5 +\n\n\u03b5\n\n(cid:17)\n\n5.3 Mixtures\n\nIn this section, we show that our results extend to learning mixtures of classes of distributions.\nDe\ufb01nition 6. Let H be some set of distributions. A k-mixture of H is a distribution with density\n\n(cid:80)k\ni=1 wiPi, where each Pi \u2208 H.\n\nOur results follow roughly due to the fact that a cover for k-mixtures of a class can be written as the\nCartesian product of k covers for the class, and then an application of Corollary 1.\nLemma 8. Consider the class of k-mixtures of H, where H is some set of distributions. There exists\n\n2\u03b1 + 1(cid:1)k\u22121, where C\u03b1 is an \u03b1-cover of H.\n\nCorollary 5. Let X1, . . . , Xn \u223c P , where P is \u03b1-close to a k-mixture of distributions from some\nset H. Let C\u03b1 be an \u03b1-cover of the set H, and \u03b6 > 0 be a constant. There exists an \u03b5-differentially\nprivate algorithm which outputs a distribution which is (9 + 3\u03b6)\u03b1-close to P with probability \u2265 9/10,\n\na 2\u03b1-cover of this class of size |C\u03b1|k(cid:0) k\nas long as n = \u2126(cid:0)(k log |C\u03b1| + k log(k/\u03b1))(cid:0) 1\n\n\u03b12 + 1\n\n\u03b1\u03b5\n\nFor example, instantiating this for mixtures of Gaussians (and disregarding terms which depend on R\nand \u03ba), we get an algorithm with sample complexity \u02dcO\n\n.\n\n\u03b12 + kd2\n\n\u03b1\u03b5\n\n(cid:1)(cid:1).\n(cid:16) kd2\n\n(cid:17)\n\n9\n\n\fAcknowledgments\n\nThe authors would like to thank Shay Moran for bringing to their attention the application to PAC\nlearning mentioned in the supplement, Jonathan Ullman for asking questions which motivated\nRemark 1, and Cl\u00e9ment Canonne for assistance in reducing the constant factor in the approximation\nguarantee. This work was done while the authors were all af\ufb01liated the Simons Institute for the Theory\nof Computing. MB was supported by a Google Research Fellowship, as part of the Simons-Berkeley\nResearch Fellowship program. GK was supported by a Microsoft Research Fellowship, as part of the\nSimons-Berkeley Research Fellowship program, and the work was also partially done while visiting\nMicrosoft Research, Redmond. TS was supported by a Patrick J. McGovern Research Fellowship, as\npart of the Simons-Berkeley Research Fellowship program. ZSW was supported in part by a Google\nFaculty Research Award, a J.P. Morgan Faculty Award, and a Facebook Research Award.\n\nReferences\n[1] Jayadev Acharya, Moein Falahatgar, Ashkan Jafarpour, Alon Orlitsky, and Ananda Theertha\nSuresh. Maximum selection and sorting with adversarial comparators. Journal of Machine\nLearning Research, 19(1):2427\u20132457, 2018.\n\n[2] Jayadev Acharya, Ashkan Jafarpour, Alon Orlitsky, and Ananda Theertha Suresh. Sorting with\nadversarial comparators and application to density estimation. In Proceedings of the 2014 IEEE\nInternational Symposium on Information Theory, ISIT \u201914, pages 1682\u20131686, Washington, DC,\nUSA, 2014. IEEE Computer Society.\n\n[3] Jayadev Acharya, Gautam Kamath, Ziteng Sun, and Huanyu Zhang.\n\nInspectre: Privately\nIn Proceedings of the 35th International Conference on Machine\n\nestimating the unseen.\nLearning, ICML \u201918, pages 30\u201339. JMLR, Inc., 2018.\n\n[4] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. Hadamard response: Estimating distributions\nprivately, ef\ufb01ciently, and with little communication. In Proceedings of the 22nd International\nConference on Arti\ufb01cial Intelligence and Statistics, AISTATS \u201919, pages 1120\u20131129. JMLR,\nInc., 2019.\n\n[5] Martin Anthony. Classi\ufb01cation by polynomial surfaces. Discrete Applied Mathematics, 61(2):91\u2013\n\n103, 1995.\n\n[6] Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. Practical privacy: The\nSuLQ framework. In Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART Symposium on\nPrinciples of Database Systems, PODS \u201905, pages 128\u2013138, New York, NY, USA, 2005. ACM.\n\n[7] Olivier Bousquet, Daniel M. Kane, and Shay Moran. The optimal approximation factor in\ndensity estimation. In Proceedings of the 32nd Annual Conference on Learning Theory, COLT\n\u201919, pages 318\u2013341, 2019.\n\n[8] Mark Bun, Cynthia Dwork, Guy N. Rothblum, and Thomas Steinke. Composable and versatile\nprivacy via truncated cdp. In Proceedings of the 50th Annual ACM Symposium on the Theory of\nComputing, STOC \u201918, pages 74\u201386, New York, NY, USA, 2018. ACM.\n\n[9] Mark Bun, Gautam Kamath, Thomas Steinke, and Zhiwei Steven Wu. Private hypothesis\n\nselection. arXiv preprint arXiv:1905.13229, 2019.\n\n[10] Mark Bun, Kobbi Nissim, Uri Stemmer, and Salil Vadhan. Differentially private release and\nIn Proceedings of the 56th Annual IEEE Symposium on\nlearning of threshold functions.\nFoundations of Computer Science, FOCS \u201915, pages 634\u2013649, Washington, DC, USA, 2015.\nIEEE Computer Society.\n\n[11] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simpli\ufb01cations, extensions,\nand lower bounds. In Proceedings of the 14th Conference on Theory of Cryptography, TCC\n\u201916-B, pages 635\u2013658, Berlin, Heidelberg, 2016. Springer.\n\n[12] Mark Bun, Jonathan Ullman, and Salil Vadhan. Fingerprinting codes and the price of approxi-\nmate differential privacy. In Proceedings of the 46th Annual ACM Symposium on the Theory of\nComputing, STOC \u201914, pages 1\u201310, New York, NY, USA, 2014. ACM.\n\n10\n\n\f[13] T. Tony Cai, Yichen Wang, and Linjun Zhang. The cost of privacy: Optimal rates of convergence\n\nfor parameter estimation with differential privacy. arXiv preprint arXiv:1902.04495, 2019.\n\n[14] Aref N. Dajani, Amy D. Lauger, Phyllis E. Singer, Daniel Kifer, Jerome P. Reiter, Ashwin\nMachanavajjhala, Simson L. Gar\ufb01nkel, Scot A. Dahl, Matthew Graham, Vishesh Karwa, Hang\nKim, Philip Lelerc, Ian M. Schmutte, William N. Sexton, Lars Vilhuber, and John M. Abowd.\nThe modernization of statistical disclosure limitation at the U.S. census bureau, 2017. Presented\nat the September 2017 meeting of the Census Scienti\ufb01c Advisory Committee.\n\n[15] Constantinos Daskalakis, Ilias Diakonikolas, and Rocco A. Servedio. Learning Poisson binomial\ndistributions. In Proceedings of the 44th Annual ACM Symposium on the Theory of Computing,\nSTOC \u201912, pages 709\u2013728, New York, NY, USA, 2012. ACM.\n\n[16] Constantinos Daskalakis and Gautam Kamath. Faster and sample near-optimal algorithms\nfor proper learning mixtures of Gaussians. In Proceedings of the 27th Annual Conference on\nLearning Theory, COLT \u201914, pages 1183\u20131213, 2014.\n\n[17] Luc Devroye and G\u00e1bor Lugosi. A universally acceptable smoothing factor for kernel density\n\nestimation. The Annals of Statistics, 24(6):2499\u20132512, 1996.\n\n[18] Luc Devroye and G\u00e1bor Lugosi. Nonasymptotic universal smoothing factors, kernel complexity\n\nand Yatracos classes. The Annals of Statistics, 25(6):2626\u20132637, 1997.\n\n[19] Luc Devroye and G\u00e1bor Lugosi. Combinatorial methods in density estimation. Springer, 2001.\n\n[20] Ilias Diakonikolas, Moritz Hardt, and Ludwig Schmidt. Differentially private learning of\nstructured discrete distributions. In Advances in Neural Information Processing Systems 28,\nNIPS \u201915, pages 2566\u20132574. Curran Associates, Inc., 2015.\n\n[21] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair\nStewart. Robust estimators in high dimensions without the computational intractability. In\nProceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science, FOCS\n\u201916, pages 655\u2013664, Washington, DC, USA, 2016. IEEE Computer Society.\n\n[22] Differential Privacy Team, Apple.\n\n//machinelearning.apple.com/docs/learning-with-privacy-at-scale/\nappledifferentialprivacysystem.pdf, December 2017.\n\nLearning with privacy at\n\nscale.\n\nhttps:\n\n[23] John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. Local privacy and statistical\nminimax rates. In Proceedings of the 54th Annual IEEE Symposium on Foundations of Computer\nScience, FOCS \u201913, pages 429\u2013438, Washington, DC, USA, 2013. IEEE Computer Society.\n\n[24] John C. Duchi and Feng Ruan. The right complexity measure in locally private estimation: It is\n\nnot the \ufb01sher information. arXiv preprint arXiv:1806.05756, 2018.\n\n[25] Cynthia Dwork and Jing Lei. Differential privacy and robust statistics. In Proceedings of the\n41st Annual ACM Symposium on the Theory of Computing, STOC \u201909, pages 371\u2013380, New\nYork, NY, USA, 2009. ACM.\n\n[26] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensi-\ntivity in private data analysis. In Proceedings of the 3rd Conference on Theory of Cryptography,\nTCC \u201906, pages 265\u2013284, Berlin, Heidelberg, 2006. Springer.\n\n[27] Cynthia Dwork and Guy N. Rothblum. Concentrated differential privacy. arXiv preprint\n\narXiv:1603.01887, 2016.\n\n[28] \u00dalfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. RAPPOR: Randomized aggregatable\nprivacy-preserving ordinal response. In Proceedings of the 2014 ACM Conference on Computer\nand Communications Security, CCS \u201914, pages 1054\u20131067, New York, NY, USA, 2014. ACM.\n\n[29] Marco Gaboardi, Ryan Rogers, and Or Sheffet. Locally private con\ufb01dence intervals: Z-test and\ntight con\ufb01dence intervals. In Proceedings of the 22nd International Conference on Arti\ufb01cial\nIntelligence and Statistics, AISTATS \u201919, pages 2545\u20132554. JMLR, Inc., 2019.\n\n11\n\n\f[30] Moritz Hardt and Kunal Talwar. On the geometry of differential privacy. In Proceedings of the\n42nd Annual ACM Symposium on the Theory of Computing, STOC \u201910, pages 705\u2013714, New\nYork, NY, USA, 2010. ACM.\n\n[31] Matthew Joseph, Janardhan Kulkarni, Jieming Mao, and Zhiwei Steven Wu. Locally private\n\nGaussian estimation. arXiv preprint arXiv:1811.08382, 2018.\n\n[32] Peter Kairouz, Keith Bonawitz, and Daniel Ramage. Discrete distribution estimation under\nlocal privacy. In Proceedings of the 33rd International Conference on Machine Learning, ICML\n\u201916, pages 2436\u20132444. JMLR, Inc., 2016.\n\n[33] Gautam Kamath, Jerry Li, Vikrant Singhal, and Jonathan Ullman. Privately learning high-\ndimensional distributions. In Proceedings of the 32nd Annual Conference on Learning Theory,\nCOLT \u201919, pages 1853\u20131902, 2019.\n\n[34] Gautam Kamath, Or Sheffet, Vikrant Singhal, and Jonathan Ullman. Differentially private\nalgorithms for learning mixtures of separated gaussians. In Advances in Neural Information\nProcessing Systems 32, NeurIPS \u201919. Curran Associates, Inc., 2019.\n\n[35] Vishesh Karwa and Salil Vadhan. Finite sample differentially private con\ufb01dence intervals. In\nProceedings of the 9th Conference on Innovations in Theoretical Computer Science, ITCS \u201918,\npages 44:1\u201344:9, Dagstuhl, Germany, 2018. Schloss Dagstuhl\u2013Leibniz-Zentrum fuer Informatik.\n[36] Satyaki Mahalanabis and Daniel Stefankovic. Density estimation in linear time. In Proceedings\n\nof the 21st Annual Conference on Learning Theory, COLT \u201908, pages 503\u2013512, 2008.\n\n[37] Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In Proceedings\nof the 48th Annual IEEE Symposium on Foundations of Computer Science, FOCS \u201907, pages\n94\u2013103, Washington, DC, USA, 2007. IEEE Computer Society.\n\n[38] Ilya Mironov. R\u00e9nyi differential privacy. In Proceedings of the 30th IEEE Computer Secu-\nrity Foundations Symposium, CSF \u201917, pages 263\u2013275, Washington, DC, USA, 2017. IEEE\nComputer Society.\n\n[39] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Smooth sensitivity and sampling in\nprivate data analysis. In Proceedings of the 39th Annual ACM Symposium on the Theory of\nComputing, STOC \u201907, pages 75\u201384, New York, NY, USA, 2007. ACM.\n\n[40] Adam Smith. Privacy-preserving statistical estimation with optimal convergence rates. In\nProceedings of the 43rd Annual ACM Symposium on the Theory of Computing, STOC \u201911,\npages 813\u2013822, New York, NY, USA, 2011. ACM.\n\n[41] Thomas Steinke and Jonathan Ullman. Interactive \ufb01ngerprinting codes and the hardness of\npreventing false discovery. In Proceedings of the 28th Annual Conference on Learning Theory,\nCOLT \u201915, pages 1588\u20131628, 2015.\n\n[42] Thomas Steinke and Jonathan Ullman. Between pure and approximate differential privacy. The\n\nJournal of Privacy and Con\ufb01dentiality, 7(2):3\u201322, 2017.\n\n[43] Ananda Theertha Suresh, Alon Orlitsky, Jayadev Acharya, and Ashkan Jafarpour. Near-\noptimal-sample estimators for spherical Gaussian mixtures. In Advances in Neural Information\nProcessing Systems 27, NIPS \u201914, pages 1395\u20131403. Curran Associates, Inc., 2014.\n\n[44] Abhradeep Guha Thakurta and Adam Smith. Differentially private feature selection via stability\narguments, and the robustness of the lasso. In Proceedings of the 26th Annual Conference on\nLearning Theory, COLT \u201913, pages 819\u2013850, 2013.\n\n[45] Shaowei Wang, Liusheng Huang, Pengzhan Wang, Yiwen Nie, Hongli Xu, Wei Yang, Xiang-\nYang Li, and Chunming Qiao. Mutual information optimally local private discrete distribution\nestimation. arXiv preprint arXiv:1607.08025, 2016.\n\n[46] Yannis G. Yatracos. Rates of convergence of minimum distance estimators and Kolmogorov\u2019s\n\nentropy. The Annals of Statistics, 13(2):768\u2013774, 1985.\n\n[47] Min Ye and Alexander Barg. Optimal schemes for discrete distribution estimation under locally\n\ndifferential privacy. IEEE Transactions on Information Theory, 64(8):5662\u20135676, 2018.\n\n12\n\n\f", "award": [], "sourceid": 82, "authors": [{"given_name": "Mark", "family_name": "Bun", "institution": "Boston University"}, {"given_name": "Gautam", "family_name": "Kamath", "institution": "University of Waterloo"}, {"given_name": "Thomas", "family_name": "Steinke", "institution": "IBM -- Almaden"}, {"given_name": "Steven", "family_name": "Wu", "institution": "University of Minnesota"}]}