{"title": "Differentially Private Algorithms for Learning Mixtures of Separated Gaussians", "book": "Advances in Neural Information Processing Systems", "page_first": 168, "page_last": 180, "abstract": "Learning the parameters of Gaussian mixture models is a fundamental and widely studied problem with numerous applications. In this work, we give new algorithms for learning the parameters of a high-dimensional, well separated, Gaussian mixture model subject to the strong constraint of differential privacy. In particular, we give a differentially private analogue of the algorithm of Achlioptas and McSherry. Our algorithm has two key properties not achieved by prior work: (1) The algorithm\u2019s sample complexity matches that of the corresponding non-private algorithm up to lower order terms in a wide range of parameters. (2) The algorithm requires very weak a priori bounds on the parameters of the mixture components.", "full_text": "Differentially Private Algorithms for Learning\n\nMixtures of Separated Gaussians\n\nGautam Kamath\n\nDavid R. Cheriton School of Computer Science\n\nUniversity of Waterloo\n\nWaterloo, ON, Canada N2L 3G1\n\ng@csail.mit.edu\n\nDepartment of Computer Science, Faculty of Exact Sciences\n\nOr Sheffet\n\nBar-Ilan University\n\nRamat-Gan, 5290002 Israel\nor.sheffet@biu.ac.il\n\nVikrant Singhal\n\nJonathan Ullman\n\nKhoury College of Computer Sciences\n\nKhoury College of Computer Sciences\n\nNortheastern University\n\nNortheastern University\n\n360 Huntington Ave., Boston, MA 02115\n\n360 Huntington Ave., Boston, MA 02115\n\nsinghal.vi@husky.neu.edu\n\njullman@ccs.neu.edu\n\nAbstract\n\nLearning the parameters of Gaussian mixture models is a fundamental and widely\nstudied problem with numerous applications. In this work, we give new algorithms\nfor learning the parameters of a high-dimensional, well separated, Gaussian mixture\nmodel subject to the strong constraint of differential privacy. In particular, we\ngive a differentially private analogue of the algorithm of Achlioptas and McSherry\n(COLT 2005). Our algorithm has two key properties not achieved by prior work: (1)\nThe algorithm\u2019s sample complexity matches that of the corresponding non-private\nalgorithm up to lower order terms in a wide range of parameters. (2) The algorithm\nrequires very weak a priori bounds on the parameters of the mixture.\n\n1\n\nIntroduction\n\nThe Gaussian mixture model is one of the most important and widely studied models in Statistics\u2014\nwith roots going back over a century [58]\u2014and has numerous applications in the physical, life, and\nsocial sciences. In a Gaussian mixture model, we suppose that each sample is drawn by randomly\nselecting from one of k distinct Gaussian distributions G1, . . . , Gk in Rd and then drawing a sample\nfrom that distribution. The problem of learning a Gaussian mixture model asks us to take samples\nfrom this distribution and approximately recover the parameters (mean and covariance) of each of the\nunderlying Gaussians. The past decades have seen tremendous progress towards understanding both\nthe sample complexity and computational complexity of learning Gaussian mixtures [20, 21, 5, 64, 3,\n17, 16, 52, 7, 59, 41, 27, 51, 46, 54, 10, 42, 4, 11, 40, 38, 66, 23, 6, 35, 36, 22, 63, 26, 53, 15].\n\nDue to signi\ufb01cant space restrictions, a full version of the paper, with additional details and all proofs,\n\nappears in the supplementary material [48].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn many of the applications of Gaussian mixtures models, especially those in the social sciences, the\nsample consists of sensitive data belonging to individuals. In these cases, it is crucial that we not only\nlearn the parameters of the mixture model, but do so while preserving these individuals\u2019 privacy. In\nthis work, we study algorithms for learning Gaussian mixtures subject to differential privacy [32],\nwhich has become the de facto standard for individual privacy in statistical analysis of sensitive\ndata. Intuitively, differential privacy guarantees that the output of the algorithm does not depend\nsigni\ufb01cantly on any one individual\u2019s data, which in this case means any one sample. Differential\nprivacy is used as a measure of privacy for data analysis systems at Google [34], Apple [28], and the\nU.S. Census Bureau [19]. Differential privacy and related notions of algorithmic stability are also\ncrucial for statistical validity even when individual privacy is not a primary concern, as they provide\ngeneralization guarantees in an adaptive setting [31, 9].\nThe \ufb01rst differentially private algorithm for learning Gaussian mixtures comes from the work of\nNissim, Raskhodnikova, and Smith [55] as an application of their in\ufb02uential subsample-and-aggregate\nframework. However, their algorithm is a reduction from private estimation to non-private estimation\nthat blows up the sample complexity by at least a poly(d) factor.\n\nThe contribution of this work is new differentially private algorithms for recovering the parameters\nof an unknown Gaussian mixture provided that the components are suf\ufb01ciently well separated. In\nparticular we give differentially private analogues of the algorithm of Achlioptas and McSherry [3],\nwhich requires that the means are separated by a factor proportional to\nk, but independent of the\ndimension d. Our algorithms have two main features not shared by previous methods:\n\n\u221a\n\n\u2022 The sample complexity of the algorithm matches that of the corresponding non-private\n\nalgorithm up to lower order additive terms for a wide range of parameters.\n\n\u2022 The algorithm requires very weak a priori bounds on the parameters of the mixture com-\nponents. That is, like many algorithms, we require that the algorithm is seeded with some\ninformation about the range of the parameters, but the algorithm\u2019s sample complexity\ndepends only mildly (polylogarithmically) on the size of this range.\n\n1.1 Problem Formulation\n\nThere are a plethora of algorithms for (non-privately) learning Gaussian mixtures, each with different\nlearning guarantees under different assumptions on the parameters of the underlying distribution.1 In\nthis section we describe the version of the problem that our work studies and give some justi\ufb01cation\nfor these choices.\nWe assume that the underlying distribution D is a mixture of k Gaussians in high dimension d.\nThe mixture is speci\ufb01ed by k components, where each component Gi is selected with probability\nwi \u2208 [0, 1] and is distributed as a Gaussian with mean \u00b5i \u2208 Rd and covariance \u03a3i \u2208 Rd\u00d7d. Thus the\nmixture is speci\ufb01ed by the tuple {(wi, \u00b5i, \u03a3i)}i\u2208[k].\nOur goal is to accurately \u03b1-recover this tuple of parameters. Intuitively, with probability 1 \u2212 \u03b2, we\n\nwould like to recover a tuple {((cid:98)wi,(cid:98)\u00b5i,(cid:98)\u03a3i)}i\u2208[k], specifying a mixture \u02dcG such that (cid:107)(cid:98)w \u2212 w(cid:107)1 is small\nand (cid:107)(cid:98)\u00b5i \u2212 \u00b5i(cid:107)\u03a3i and (cid:107)(cid:98)\u03a3i \u2212 \u03a3i(cid:107)\u03a3i are small for every i \u2208 [k] (speci\ufb01cally, we require them all to be\nO(\u03b1)). Here, (cid:107) \u00b7 (cid:107)\u03a3 is the appropriate vector/matrix norm that ensures N (\u00b5i, \u03a3i) and N ((cid:98)\u00b5i,(cid:98)\u03a3i) are\nclose in total variation distance and we also compare (cid:98)w and w in (cid:107) \u00b7 (cid:107)1 to ensure that the mixtures are\ni.i.d. samples from an unknown mixture D and outputs the parameters of a mixture (cid:98)D satis\ufb01es these\n\nclose in total variation distance. Of course, since the labeling of the components is arbitrary, we can\nactually only hope to recover the parameters up to some unknown permutation \u03c0 : [k] \u2192 [k] on the\ncomponents. We say that an algorithm learns a mixture of Gaussian using n samples if it takes n\n\nconditions, which will imply the resulting mixtures are \u03b1-close in total variation distance.2\n\n1We remark that there are also many popular heurstics for learning Gaussian mixtures, notably the EM\n\nalgorithm [24], but in this work we focus on algorithms with provable guarantees.\n\n2To provide context, one might settle for a weaker goal of proper learning where the goal is merely to learn\nsome Gaussian mixture, possibly with a different number of components, that is close to the true mixture, or\nimproper learning where it suf\ufb01ces to learn any such distribution.\n\n2\n\n\fIn this work, we consider mixtures that satisfy the separation condition\n\n\u2200i (cid:54)= j (cid:107)\u00b5i \u2212 \u00b5j(cid:107)2 \u2265(cid:101)\u2126\n\n(cid:18)(cid:113)\n\n(cid:19)\n\n(cid:110)(cid:107)\u03a31/2\n\nj (cid:107)(cid:111)\n\nk + 1/wi + 1/wj\n\n\u00b7 max\n\ni (cid:107),(cid:107)\u03a31/2\n\n(1)\n\nNote that the separation condition does not depend on the dimension d, only on the number of\nmixture components. However, (1) is not the weakest possible condition under which one can learn a\nmixture of Gaussians. We focus on (1) because in this regime we can learn the mixture components\nusing statistical properties of the data, such as the large principal components of the data and the\ncenters of a good clustering, which are amenable to privacy. In contrast, algorithms that learn with\nseparation proportional to k1/4 [64], k\u03b5 [41, 51, 27], or even\nlog k [59] use algorithmic machinery\nsuch as the sum-of-squares algorithm that has not been studied from the perspective of privacy, or\nare not computationally ef\ufb01cient. In particular, a barrier to learning with separation k1/4 is that the\nnon-private algorithms are based on single-linkage clustering, which is not amenable to privacy due\nto its crucial dependence on the distance between individual points. We remark that one can also learn\nwithout any separation conditions, but only with exp(k) many samples from the distributions [54].\n\n\u221a\n\nIn this work, our goal is to design learning algorithms for mixtures of Gaussians that are also\ndifferentially private. An (\u03b5, \u03b4)-differentially private [32] randomized algorithm A for learning\nmixtures of Gaussians is an algorithm that takes a dataset X of samples and:\n\n\u2022 A is (\u03b5, \u03b4)-differentially private in the worst-case. That is, for every pair of samples\nX, X(cid:48) \u2208 X n differing on one sample, A(X) and A(X(cid:48)) are (\u03b5, \u03b4)-close in the following\nsense: for b \u2208 {0, 1}, Pr[A(X) = b] \u2264 e\u03b5 Pr[A(X(cid:48)) = b] + \u03b4\n\u2022 If n is suf\ufb01ciently large and X1, . . . , Xn \u223c D for a mixture D satisfying our assumptions,\nthen A(X) outputs an approximation to the parameters of G.\n\nNote that, while the learning guarantees necessarily rely on the assumption that the data is drawn i.i.d.\nfrom some mixture of Gaussians, the privacy guarantee is worst-case. It is important for privacy not\nto rely on distributional assumptions because we have no way of verifying that the data was truly\ndrawn from a mixture of Gaussians, and if our assumption turns out to be wrong we cannot recover\nprivacy once it is lost.\nFurthermore, our algorithms assume certain boundedness of the mixture components. Speci\ufb01cally,\nwe assume that there are known quantities R, \u03c3max, \u03c3min, wmin such that\n\nmin \u2264 (cid:107)\u03a3i(cid:107)2 \u2264 \u03c32\n\nmax and wmin \u2264 wi.\n\n\u2200i \u2208 [k] (cid:107)\u00b5i(cid:107)2 \u2264 R and \u03c32\n\n(2)\nFor brevity, we de\ufb01ne G((cid:96), k, R, \u03c3min, \u03c3max, wmin, s) to be the family of mixtures of k Gaussians\nin (cid:96) dimensions satisfying 2 and a separation condition de\ufb01ned by s (in our case, s conforms to 1).\nThese assumptions are to some extent necessary, as even the state-of-the-art algorithms for learning a\nsingle multivariate normal [47] require boundedness.3 However, since R and \u03c3max/\u03c3min can be quite\nlarge\u2014and even if they are not we cannot expect the user of the algorithm to know these parameter a\npriori\u2014the algorithm\u2019s sample complexity should depend only mildly on these parameters so that\nthey can be taken to be quite large.\n\n1.2 Our Contributions\n\nThe main contribution of our work is an algorithm with improved sample complexity for learning\nmixtures of Gaussians that are separated and bounded.\nTheorem 1.1 (Main, Informal). There is an (\u03b5, \u03b4)-differentially private algorithm that takes\n\n(cid:18) d2\n\nn =\n\n\u03b12wmin\n\n+\n\nd2\n\n\u03b1wmin\u03b5\n\n+\n\npoly(k)d3/2\n\nwmin\u03b5\n\n\u00b7 polylog\n\n(cid:19)\n\n(cid:18) dkR(\u03c3max/\u03c3min)\n\n(cid:19)\n\n\u03b1\u03b2\u03b5\u03b4\n\nsamples from an unknown mixture of k Gaussians D in Rd satisfying (1) and (2), where wmin =\nmini wi, and, with probability at least 1 \u2212 \u03b2, learns the parameters of D up to error \u03b1.\n\n3These boundedness conditions are also provably necessary to learn even a single univariate Gaussian for\npure differential privacy, concentrated differential privacy, or R\u00e9nyi differential privacy, by the argument of [50].\nOne could only hope to avoid boundedness using the most general formulation of (\u03b5, \u03b4)-differential privacy.\n\n3\n\n\fWe remark that the sample complexity in Theorem 1.1 compares favorably to the sample complexity\nof methods based on subsample-and-aggregate. In particular, when \u03b5 \u2265 \u03b1 and k is a small polynomial\nin d, the sample complexity is dominated by d2/\u03b12wmin, which is optimal even for non-private\nIn Section 5 we give an optimized version of the subsample-and-aggregate-based\nalgorithms.\n\u221a\nreduction from [55] and show that we can learn mixtures of Gaussians with sample complexity\nroughly \u02dcO(\nd/\u03b5) times the sample complexity of the corresponding non-private algorithm. In\ncontrast the sample complexity of our algorithm does not grow by dimension-dependent factors\ncompared to the non-private algorithm on which it is based.\n\nAt a high level, our algorithm mimics the approach of Achlioptas and McSherry [3], which is to\nuse PCA to project the data into a low-dimensional space, which has the effect of projecting out\nmuch of the noise, and then recursively clustering the data points in that low-dimensional space.\nHowever, where their algorithm uses a Kruskal-based clustering algorithm, we have to use alternative\nclustering methods that are more amenable to privacy. We develop our algorithm in two distinct\nphases addressing different aspects of the problem:\n\nIn Section 3 we consider an \u201ceasy case\u201d of Theorem 1.1, where we assume that: all components are\nspherical Gaussians, such that variances of each component lie in a small, known range (such that\ntheir ratio is bounded by a constant factor) and that the means of the Gaussians lie in a small ball\naround the origin. Under these assumptions, it is fairly straight-forward to make the PCA-projection\nstep [64, 3] private. The key piece of the algorithm that needs to be private is computing the principal\ncomponents of the data\u2019s covariance matrix. We can make this step private by adding appropriate\nnoise to this covariance, and the key piece of the analysis is to analyze the effect of this noise\non the principal components, extending the work of Dwork et al. [33] on private PCA. Using the\nassumptions we make in this easy case, we can show that the projection shifts each component\u2019s\nmean by O(\nk\u03c3max), which preserves the separation of the data because all variances are within\nconstant factor of one another. Then, we iteratively cluster the data using the 1-cluster technique\nof [57, 56]. Lastly, we apply a simpli\ufb01ed version of [47] to learn each component.\n\n\u221a\n\nWe then consider the general case where the Gaussians can be non-spherical and wildly different\nfrom each other. In this case, if we directly add noise to the covariance matrix to achieve privacy, then\nthe noise will scale polynomially with \u03c3max/\u03c3min, which is undesirable. For the general case, we\nuse a recursive algorithm, which repeatedly identi\ufb01es a secluded cluster in the data, and then recurses\non this isolated cluster and the points outside of the cluster. To that end we develop in Section 4.1 a\nvariant of the private clustering algorithm of [57, 56] that \ufb01nds a secluded ball\u2014a set of points that\nlie inside of some ball Br (p) such that the annulus B10r (p) \\ Br (p) is (essentially) empty.4\nWe can obtain a recursive algorithm in the following way. First we try to \ufb01nd a secluded ball in the\nunprojected data. If we \ufb01nd one then we can split and recurse on the inside and outside of the ball.\nIf we cannot \ufb01nd a ball, then we can argue that the diameter of the dataset is poly(d, k, \u03c3max). In\n\u221a\nthe latter case, we can ensure that with poly(d, k) samples, the PCA-projection of the data preserves\nthe mean of each component up to O(\nk\u03c3max), which guarantees that the cluster with the largest\nvariance is secluded, in which case we can \ufb01nd the secluded ball and recurse.\n\n1.3 Related Work\n\nThere has also been a great deal of work on learning mixtures of distribution classes, particularly\nmixtures of Gaussians. There are many ways the objective can be de\ufb01ned, including clustering [20, 21,\n5, 64, 3, 17, 16, 52, 7, 59, 41, 27, 51], parameter estimation [46, 54, 10, 42, 4, 11, 40, 38, 66, 23, 6],\nproper learning [35, 36, 22, 63, 26, 53], and improper learning [15].\nThere has recently been a great deal of interest in differentially private distribution learning, the\nmost relevant being [50, 47], which focus on learning a single Gaussian. There are also algorithms\nfor learning structured univariate distributions in TV-distance [25], and learning arbitrary univariate\ndistributions in Kolomogorov distance [13]. Upper and lower bounds for learning the mean of a\nproduct distribution over the hypercube in (cid:96)\u221e-distance include [12, 14, 32, 61]. [1] focuses on\nestimating properties of a distribution, rather than the distribution itself. [60] gives an algorithm\nwhich allows one to estimate asymptotically normal statistics with minimal increase in the sample\n\n4Since [57, 56] call the ball found by their algorithm a good ball, we call ours a terri\ufb01c ball.\n\n4\n\n\fcomplexity. There has also been a great deal of work on distribution learning in the local model of\ndifferential privacy [29, 65, 45, 2, 30, 44, 67, 37].\nWithin differential privacy, there are many algorithms for tasks that are related to learning mixtures\nof Gaussians, notably PCA [12, 49, 18, 33] and clustering [55, 39, 57, 56, 8, 62, 43]. Applying\nthese algorithms na\u00efvely to the problem of learning Gaussian mixtures would necessarily introduce\na polynomial dependence on the range of the data, which we seek to avoid. Nonetheless, private\nalgorithms for PCA and clustering feature in our solution, which builds directly on these works.\n\n2 Robustness of PCA-Projection to Noise\n\nOne of the main tools used in learning mixtures of Gaussians under separation is principal component\nanalysis (PCA). Speci\ufb01cally, we project onto the subspace spanned by the top k principal components,\nwhich has the effect of preserving the means of the components while eliminating the directions\nthat are purely noise. In this section, we show that PCA achieves a similar effect even when adding\nadditional noise for privacy.\nBefore showing the result for perturbed PCA, we reiterate the very simple analysis of Achlioptas\nand McSherry [3]. Let X \u2208 Rn\u00d7d be a matrix of samples and A \u2208 Rn\u00d7d be the rank-k matrix of\nthe corresponding cluster centers. Fixing a cluster i, denoting its empirical mean of as \u00af\u00b5i, the mean\nof the resulting projection as \u02c6\u00b5i, \u03a0 as the k-PCA projection matrix and ui \u2208 {0, 1}n as the vector\nindicating which datapoint was sampled from cluster i, we have\n\n(cid:0)X T \u2212 (X\u03a0)T(cid:1) ui(cid:107)2 \u2264 (cid:107)X \u2212 X\u03a0(cid:107)2\n\n(cid:107)ui(cid:107)2\nni\n\n\u2264 1\u221a\n\nni\n\n(cid:107)X \u2212 A(cid:107)2,\n\n(cid:107)\u00af\u00b5i \u2212 \u02c6\u00b5i(cid:107)2 = (cid:107) 1\n\nni\n\nwhere the last inequality follows from the X\u03a0 being the best k-rank approximation of X whereas\nA is any rank-k matrix. We extend this analysis to a perturbed k-PCA projection as given by the\nfollowing lemma.\nLemma 2.1. Let X \u2208 Rn\u00d7d be a collection of n datapoints from k clusters each centered at\n\u00b51, \u00b52, ..., \u00b5k. Let A \u2208 Rn\u00d7d be the corresponding matrix of (unknown) centers (for each j we place\nthe center \u00b5c(j) with c(j) denoting the clustering point Xj belongs to). Let \u03a0Vk \u2208 Rd\u00d7d denote the\nk-PCA projection of X\u2019s rows. Let \u03a0U \u2208 Rd\u00d7d be a projection such that for some bound B \u2265 0\nit holds that (cid:107)X T X \u2212 (X\u03a0U )T (X\u03a0U )(cid:107)2 \u2264 (cid:107)X T X \u2212 (X\u03a0Vk )T (X\u03a0Vk )(cid:107)2 + B. Denote \u00af\u00b5i as\nthe empirical mean of all points in cluster i and denote \u02c6\u00b5i as the projection of the empirical mean\n\u02c6\u00b5i = \u03a0U \u00af\u00b5i. Then (cid:107) \u00af\u00b5i \u2212 \u02c6\u00b5i(cid:107)2 \u2264 1\u221a\n\n(cid:107)X \u2212 A(cid:107)2 +(cid:112)B/ni.\n\nni\n\nWe can instantiate in the following lemma for mixtures of Gaussians.\nLemma 2.2. Let X \u2208 Rn\u00d7d be a sample from a mixture of Gaussians D, and let A \u2208 Rn\u00d7d be the\nmatrix where each row i is the (unknown) mean of the Gaussian from which Xi was sampled. For each\ni denote the maximum directional variance of component i, and wi denote its mixing weight.\ni, let \u03c32\nDe\ufb01ne \u03c32 = max\n(\u03be1d + \u03be2 log(2k/\u03b2)), where \u03be1, \u03be2 are\n\n{\u03c32\n\n{wi}. If n \u2265 1\nwmin\n\u221a\nuniversal constants, then with probability at least 1 \u2212 \u03b2,\n\ni } and wmin = min\n\n(cid:115)\n\u2264 (cid:107)X \u2212 A(cid:107)2 \u2264 4\n\ni .\nwi\u03c32\n\nn\n\nnwmin\u03c3\n\n4\n\ni\n\ni\n\nk(cid:80)\n\ni=1\n\n3 A Warm Up: Strongly Bounded Spherical Gaussian Mixtures\n\nWe \ufb01rst give an algorithm to learn mixtures of spherical Gaussians, whose variances are within a\nconstant factor of one another, and means lie in a ball of radius k\nd\u03c3, where \u03c3 is the largest variance,\nand all mixing weights are uniform. We denote such a family of spherical Gaussians by S(d, k, \u03ba, s),\nwhere \u03ba is an upper bound on the ratio of maxiumum and minimum variances (a constant), and s\nde\ufb01nes the separation condition. Now, we present the main theorem of this section.\nTheorem 3.1. There exists an (\u03b5, \u03b4)-differentially private algorithm, which if given n independent\nsamples from D \u2208 S(d, k, \u03ba, C\n\u03ba, where \u03be, \u03ba \u2208 \u0398(1) and \u03be is a universal\nconstant, (cid:96) = max{512 ln(nk/\u03b2), k}, and\n\n\u221a\n(cid:96)), such that C = \u03be + 16\n\n\u221a\n\n\u221a\n\n3\n\n(cid:96) 5\n9 k 5\n\u03b5 10\n\n9\n\ndk ln(k/\u03b2)\n\n\u03b1\u03b5\n\n+\n\nk2\n\u03b12 +\n\n5\n\n(cid:32)\n\nn \u2265\n\ndk\n\u03b12 +\n\nd 3\n2 k3\n\u03b5\n\n+\n\ndk\n\u03b1\u03b5\n\n+\n\n\u221a\n\n(cid:33)\n\n(cid:18)\n\n\u00b7 polylog\n\n(cid:96), k,\n\n1\n\u03b5\n\n,\n\n1\n\u03b4\n\n,\n\n1\n\u03b2\n\n(cid:19)\n\n\fthen it (\u03b1, \u03b2)-learns D.\nWe assume that our algorithm is given samples X \u2208 Rn\u00d7d, a parameter \u03c3min that is the exact smallest\nvariance in the mixture and the parameter \u03ba. The algorithm proceeds as follows:\n\u221a\n(1) We \ufb01rst truncate the dataset so that all points lie within a ball of radius 2k\norigin. So with high probability, no points in X are lost in this step.\n(2) Next, we do privacy-preserving (cid:96)-PCA. By the guarantees in the previous section and from\nTheorem 9 of [33], since all variances are almost identical, the distances between projected means\nare changed by at most O(\n(3) Now, that the Gaussians have shrunk down, and are still far apart, we can \ufb01nd individual\ncomponents using an extension of the 1-cluster algorithm of [56] given below.\nTheorem 3.2 (Extension of [56]). There is an (\u03b5, \u03b4)-DP algorithm PGLOC(X, t; \u03b5, \u03b4, R, \u03c3min, \u03c3max)\nwith the following guarantee. Let X = (X1, . . . , Xn) \u2208 Rn\u00d7d be a set of n points drawn from a\nmixture of Gaussians D \u2208 G((cid:96), k, R, \u03c3min, \u03c3max, wmin, s). Let S \u2286 X such that |S| \u2265 t, and let\n0 < a < 1 be any small absolute constant (say, one can take a = 0.1). If t = \u03b3n, where 0 < \u03b3 \u2264 1,\nand\n\n(cid:96)\u03c3min), which maintains the separation.\n\nd\u03ba\u03c3min around the\n\n\u221a\n\n\u00b7 polylog\n\n(cid:96),\n\n1\n\u03b5\n\n,\n\n1\n\u03b4\n\n,\n\n1\n\u03b2\n\n,\n\n1\n\u03b3\n\n+ O\n\n(cid:19)\n\n(cid:18) (cid:96) + log(k/\u03b2)\n\n(cid:19)\n\nwmin\n\n(cid:32)\u221a\n\n(cid:96)\n\u03b3\u03b5\n\n(cid:33) 1\n1\u2212a \u00b7 9\n\nn \u2265\n\nlog\u2217(cid:18)\u221a\n\n(cid:96)\n\n(cid:16) R\u03c3max\n\n\u03c3min\n\n(cid:17)(cid:96)(cid:19)\n\n(cid:18)\n\n(cid:96)\u03c3min).\n\n4\n\nthen for some absolute constant c > 4 that depends on a, with probability at least 1 \u2212 \u03b2, the\nalgorithm outputs (r, (cid:126)c) such that the following hold: (1) Br((cid:126)c) contains at least t\n2 points in S, that\nis, |Br((cid:126)c) \u2229 S| \u2265 t\n\u221a\n2 . (2) If ropt is the radius of the smallest ball containing at least t points in S,\nthen r \u2264 c(ropt + 1\nFor simplicity, we set the constant a = 0.1. Since we have balls that isolate individual components in\nthe lower-dimensional space and that give us a constant factor approximation of each variance, we\ncan partition the original dataset to \ufb01nd a set of nearly optimal balls containing the points from each\ncluster.\n(4) We can \ufb01nally learn the parameters of each individual Gaussian using a simpli\ufb01ed version of the\nGaussian learner of [47] that is tailored speci\ufb01cally for spherical Gaussians.\nTheorem 3.3. There exists an (\u03b5, \u03b4)-differentially private algorithm PSGE(X; (cid:126)c, r, \u03b1\u00b5, \u03b1\u03c3, \u03b2, \u03b5, \u03b4)\nwith the following guarantee. If Br((cid:126)c) \u2286 R(cid:96) is a ball, X1, . . . , Xn \u223c N (\u00b5, \u03c32I(cid:96)\u00d7(cid:96)), and n \u2265\n\n(cid:33)\nthen with probability at least 1\u2212\u03b2, the algorithm returns(cid:98)\u00b5,(cid:98)\u03c3 such that if X is contained in Br((cid:126)c) (that\nis, Xi \u2208 Br((cid:126)c) for every i) and (cid:96) \u2265 8 ln(10/\u03b2), then (cid:107)\u00b5\u2212(cid:98)\u00b5(cid:107)2 \u2264 \u03b1\u00b5\u03c3 and (1\u2212\u03b1\u03c3) \u2264 (cid:98)\u03c32\n\u03c32 \u2264 (1+\u03b1\u03c3).\n\n\uf8f6\uf8f8 , n\u03c3 = O\n\n(cid:32) ln( 1\n\n\uf8eb\uf8ed (cid:96)\n\nr2 ln( 1\n\u03b2 )\n\u03b1\u03c3\u03b5\u03c32(cid:96)\n\n+ n\u00b5 + n\u03c3, where\n\nr ln( 1\n\u03b2 )\n\u03b1\u00b5\u03b5\u03c3\n\nln( 1\n\u03b2 )\n\u03b12\n\u00b5\n\nln( 1\n\u03b2 )\n\u03b1\u03c3\u03b5\n\nr\n\n(cid:96) ln( 1\n\u03b4 )\n\n\u03b2 )\n\u03b12\n\u03c3(cid:96)\n\n+\n\n\u03b12\n\u00b5\n\n6 ln(5/\u03b2)\n\n\u03b5\n\n+\n\n+\n\n(cid:113)\n\nn\u00b5 = O\n\n+\n\n+\n\n\u03b1\u00b5\u03b5\u03c3\n\n,\n\n4 An Algorithm for Privately Learning Mixtures of Gaussians\n\nIn this section, we present our main algorithm for privately learning mixtures of Gaussians and prove\nTheorem 1.1 from the introduction.\n\n4.1 Finding a Terri\ufb01c Ball\n\nIn this section we mention a key building block in our algorithm for learning Gaussian mixtures. This\nparticular subroutine is an adaptation of the work of Nissim and Stemmer [56] (who in turn built\non [57]) that \ufb01nds a ball that contains many datapoints. In this section we show how to tweak their\nalgorithm so that it now produces a ball with a few more additional properties. More speci\ufb01cally, our\ngoal in this section is to privately locate a ball Br (p) that (i) contains many datapoints, (ii) leaves out\nmany datapoints and (iii) is secluded in the sense that B2r (p) \\ Br (p) holds very few (and ideally\nzero) datapoints. More speci\ufb01cally, we are using the following de\ufb01nition.\n\n6\n\n\fDe\ufb01nition 4.1. Given a dataset X and an integer t > 0, we say a ball Br (p) is (c, \u0393)-terri\ufb01c for a\nconstant c > 1 and parameter \u0393 \u2265 0 if all of the following three properties hold: (i) The number of\ndatapoints in Br (p) is at least t \u2212 \u0393; (ii) The number of datapoints outside the ball Bcr (p) is least\nt \u2212 \u0393; and (iii) The number of datapoints in the annulus Bcr (p) \\ Br (p) is at most \u0393.\nWe say a ball is c-terri\ufb01c if it is (c, 0)-terri\ufb01c, and when c is clear from context we call a ball terri\ufb01c.\nIn this section, we provide a differentially private algorithm that locates a terri\ufb01c ball.\nThe algorithm of [56] is composed of two subroutines. The \ufb01rst, GoodRadius privately computes\nsome radius \u02dcr such that \u02dcr \u2264 4ropt with ropt denoting the radius of the smallest ball that contains t\ndatapoints. Their next subroutine, GoodCenter, takes \u02dcr as an input and produces a ball of radius \u03b3 \u02dcr\nthat holds (roughly) t datapoints with \u03b3 denoting some constant > 2. The GoodCenter procedure\nworks by \ufb01rst cleverly combining locality-sensitive hashing (LSH) and randomly-chosen axes-aligned\nboxed to retrieve a poly-length list of candidate centers, then applying ABOVETHRESHOLD to \ufb01nd a\ncenter point p such that the ball B\u02dcr (p) satis\ufb01es the required (holding enough datapoints).\nIn our modi\ufb01cation, the latter procedure is essentially unchanged, with the single condition before\nreplaced by the three conditions which we require. The previous procedure also requires the same\nmodi\ufb01cation, but since our new scoring function (which takes into account all three conditions) is no\nlonger monotone or quasi-concave, we perform a capping of the scores to facilitate our search. Our\nmodi\ufb01cation to GoodRadius is called TerrificRadius, to re\ufb02ect its stronger guarantees.\nThe resulting combination of these tools gives the following lemma.\nLemma 4.2. The TerrificBall procedure is a (2\u03b5, \u03b4)-DP algorithm which, if run using size-\nparameter t \u2265 1000c2\nd log(nd/\u03b2) log(1/\u03b4) log log(U/L) for some arbitrary small constant\na > 0 (say a = 0.1), and is set to \ufb01nd a c-terri\ufb01c ball with c > \u03b3 (\u03b3 being the parameter fed into the\nLSH in the GoodCenter procedure), then the following holds. With probability \u2265 1 \u2212 2\u03b2 if it returns\na ball Bp (r(cid:48)), and furthermore this ball is a (c, 2\u0393)-terri\ufb01c ball of radius r(cid:48) \u2264 (1 + c\n\n\u03b5 na\n\n\u221a\n\n10 )r.\n\n4.2 The Algorithm\n\n(cid:17)\n\n\u221a\n\n\u221a\nwi + 1/\n\n(cid:16)(cid:112)k log(n) + 1/\n\nWe \ufb01nally introduce our differentially private version of the Achlioptas-McSherry algorithm. Recall\nthat we assume the separation condition\n\n(cid:112)log(n/\u03b2) \u2264 1\n\n\u2200i, j,\n\n(cid:107)\u00b5i \u2212 \u00b5j(cid:107) \u2265 C(\u03c3i + \u03c3j)\n\n(3)\nfor some constant C > 0, and that n \u2265 poly(d, k) (we note that our constant C is larger than that\nof [3]. We also make one additional technical assumption that the Gaussians are not \u201ctoo skinny.\u201d\n\nwj\n\n\u2200i, (cid:107)\u03a3i(cid:107)F\n\nand (cid:107)\u03a3i(cid:107)2log(n/\u03b2) \u2264 1\n\n\u221a\n\n8 tr(\u03a3i)\n\nd and (cid:107)\u03a3i(cid:107)2 = \u03c32\n\ni Id\u00d7d) we have that tr(\u03a3i) = d\u03c32\n\n(4)\ni , while (cid:107)\u03a3i(cid:107)F =\ni , thus the above condition translates to requiring a suf\ufb01ciently large dimension,\n\n8 tr(\u03a3i),\nNote the for a spherical Gaussian (where \u03a3i = \u03c32\n\u03c32\ni\nas assumption made explicit in the work regarding learning spherical Gaussians [64].\nWe now detail the main components of our algorithm. Algorithm 1 takes the dataset X and returns\na k-partition of X into subsets corresponding to different mixture components. There are two key\npoints to note about this algorithm. First, the parameter k is an upper bound on the number of mixture\ncomponents that have points in X, so in every recursive call, even though we specify some upper\nbound k(cid:48), the number of clusters returned will match the exact number of components. Second, the\npartition itself cannot be private (since it is a list of points in the dataset). So once this k-partition\nis done, one must apply the existing (\u03b5, \u03b4)-DP procedure (called, \"PGE\", which is an adaptation of\nthe private learner for high-dimensional Gaussians from [47] for the case, where a tiny fraction of\nthe points may be lost, as described in the full version of this paper) for estimating the mean and\nthe covariance of each cluster, as well as apply the \u03b5-DP histogram to \ufb01nd the cluster weights. The\noverall algorithm (PGME) is in Algorithm 2.\nThe main theorem of our section is as follows.\nTheorem 4.3. There is an (\u03b5, \u03b4)-differentially private algorithm that takes\n\n(cid:32) \u221a\n\n(cid:33) 1\n\n1\u2212a\n\n7\n\n\uf8eb\uf8ed d2\n\n\u03b12wmin\n\nn =\n\n+\n\nd2\n\n\u03b1wmin\u03b5\n\n+\n\nk9.06d3/2\n\nwmin\u03b5\n\n+\n\ndk\nwmin\u03b5\n\n+\n\nk 3\n\n2\n\n\u03b1wmin\u03b5\n\n\uf8f6\uf8f8 \u00b7 polylog\n\n(cid:32) dkR( \u03c3max\n\n\u03c3min\n\n(cid:33)\n\n)\n\n\u03b1\u03b2\u03b5\u03b4\n\n\f(where a > 0 is an arbitrarily small constant as in Theorem 3.2) samples from an unknown mixture\nof k Gaussians D in Rd satisfying (1) and (2), where wmin = mini wi, and (\u03b1, \u03b2)-learns D.\n\nAlgorithm 1: Private Gaussian Mixture Partitioning RPGMP(X; k, R, wmin, \u03c3min, \u03c3max, \u03b5, \u03b4)\nInput: Dataset X \u2208 Rn\u00d7d coming from a mixture of at most k Gaussians, such that each xi \u2208 X\n\nd). Privacy parameters \u03b5, \u03b4 > 0; failure probability \u03b2 > 0.\n\n\u221a\nOutput: Partition of X into clusters.\n\nhas length \u2264 O(R + \u03c3max\n\n1. If k = 1 Skip to last step (#8)\n2. Find a small ball that contains X, and bound the range of points to within that ball:\n\nSet n(cid:48) \u2190 |X| + Lap(2/\u03b5) \u2212 nwmin\nBr(cid:48)(cid:48) (p) \u2190 PGLOC(X, n(cid:48); \u03b5\nSet r \u2190 12r(cid:48)(cid:48).\nSet X \u2190 X \u2229 Br (p).\n\n20\n\n.\n\n2 , \u03b4, R, \u03c3min, \u03c3max).\n\n3. Find 5-TerrificBall in X with t = nwmin\n\n:\n\n2\n\n4. If the data is separable already, we recurse on each part.\n\nBr(cid:48) (p(cid:48)) \u2190 PTERRIFICBALL(X, nwmin\nIf Br(cid:48) (p(cid:48)) (cid:54)= \u22a5 then partition X into A = X \u2229 Br(cid:48) (p(cid:48)) and B = X \\ B5r(cid:48) (p(cid:48)) and return\nRPGMP(A; k \u2212 1, r, wmin, \u03c3min, \u03c3max, \u03b5, \u03b4) \u222a RPGMP(B; k \u2212 1, r, wmin, \u03c3min, \u03c3max, \u03b5, \u03b4)\n\n, c = 5, largest = FALSE; \u03b5, \u03b4, r,\n\nd\u03c3min\n\n).\n\n2\n\n2\n\n5. Find a private k-PCA of X: \u03a0 \u2190 k-PCA of X T X + N where N is a symmetric matrix\n\n\u221a\n\n\u221a\n\nwhose entries are taken from N (0, 4r4 ln(2/\u03b4)\n).\n6. Find 5-TerrificBall in X\u03a0 with t = nwmin\n:\n\n\u03b52\n\n2\n\n7. If the projected data is separable, we recurse on each part.\n\nBr(cid:48) (p(cid:48)) \u2190 PTERRIFICBALL(X\u03a0, wmin\nIf Br(cid:48) (p(cid:48)) (cid:54)= \u22a5 then partition X into A = {xi \u2208 X : \u03a0xi \u2208 Br(cid:48) (p(cid:48))} and\nB = {xi \u2208 X : \u03a0xi (cid:54)\u2208 B5r(cid:48) (p(cid:48))} and return\nRPGMP(A; k \u2212 1, r, wmin, \u03c3min, \u03c3max, \u03b5, \u03b4) \u222a RPGMP(B; k \u2212 1, r, wmin, \u03c3min, \u03c3max, \u03b5, \u03b4)\n\n, c = 5, largest = TRUE; \u03b5, \u03b4, r,\n\nk\u03c3min\n\n).\n\n2\n\n2\n\n8. Since the data isn\u2019t separable, we treat it as a single Gaussian.\n\nSet a single cluster: C \u2190 {i : xi \u2208 X} and return the singleton {C}.\n\nAlgorithm 2: Privately Learn Gaussian Mixture PGME(X; k, R, wmin, \u03c3min, \u03c3max, \u03b5, \u03b4, \u03b2)\nInput: Dataset X \u2208 Rn\u00d7d coming from a k-Gaussian mixture model. Privacy parameters \u03b5, \u03b4 > 0;\n\nOutput: Model Parameters Estimation\n\nfailure probability \u03b2 > 0.\n1. Truncate the dataset so that for all points, (cid:107)Xi(cid:107)2 \u2264 O(R + \u03c3max\n2. {C1, .., Ck} \u2190 RPGMP(X; k, R, wmin, \u03c3min, \u03c3max, \u03b5, \u03b4)\n3. For j from 1 to k: let (\u00b5j, \u03a3j) \u2190 PGE({xi : i \u2208 Cj}; R, \u03c3min, \u03c3max, \u03b5, \u03b4) and\n\n\u221a\n\nd)\n\n4. Set weights such that for all j, wj \u2190 \u02dcnj/((cid:80)\n\n\u02dcnj \u2190 |Cj| + Lap(1/\u03b5).\n\nj \u02dcnj).\n\n5. Return (cid:104)\u00b5j, \u03a3j, wj(cid:105)k\n\nj=1\n\n5 Sample and Aggregate\n\nIn this section, we detail methods based on sample and aggregate, and derive their sample complexity.\nThis will serve as a baseline for comparison with our methods. In short, the method repeatedly runs a\nnon-private learning algorithm, and aggregates the results using the 1-cluster algorithms of [57, 56].\nA similar sample and aggregate method was considered in [55], but they focused on a restricted case\n\n8\n\n\f\u221a\n\n(when all mixing weights are equal, and all components are spherical with a known variance), and did\nnot explore certain considerations (i.e., how to minimize the impact of a large domain). We provide a\nmore in-depth exploration and attempt to optimize the sample complexity.\nThe main advantage of the sample and aggregate method we describe here is that it is extremely\n\ufb02exible: given any non-private algorithm for learning mixtures of Gaussians, it can immediately\nbe converted to a private method. However, there are a few drawbacks, which our main algorithm\navoids. First, by the nature of the approach, it will increase the sample complexity multiplicatively by\nd/\u03b5), thus losing any chance of the non-private sample complexity being the dominating term in\n\u2126(\nany parameter regime. Second, it is not clear on how to adapt this method to non-spherical Gaussians,\nsince the methods of [57, 56] can only \ufb01nd (cid:96)2-balls containing many points, rather than the ellipsoids\nas required by non-spherical Gaussians. We consider aggregation methods which can handle settings\nwhere the required metric is unknown to be a very interesting direction for further study. Our main\nsample-and-aggregate meta-theorem is the following.\nTheorem 5.1 (Informal). Let m = \u02dc\u0398(\n))). Given a non-private\nalgorithm which learns a mixture of separated spherical Gaussians with n samples, there exists an\n(\u03b5, \u03b4)-differentially private algorithm which learns the same mixture with O(mn) samples.\n\nlog2(1/\u03b4) \u00b7 2O(log\u2217( dR\u03c3max\n\nkd+k1.5\n\n\u03b1\u03c3min\n\n\u221a\n\n\u03b5\n\nCombining with results from [64], this implies the following private learning algorithm.\nTheorem 5.2 (Informal). There exists a (\u03b5, \u03b4)-differentially private algorithm learns a mixture\nof spherical Gaussians with the separation condition that (cid:107)\u00b5i \u2212 \u00b5j(cid:107)2 \u2265 (\u03c3i + \u03c3j) \u00b7 \u02dc\u2126(k1/4 \u00b7\npoly log(k, d, 1/\u03b5, log(1/\u03b4), log\n\n))) . The number of samples it requires is\n\n\u2217\n\n( R\u03c3max\n\u03b1\u03c3min\n\n(cid:32)\u221a\n\nn = \u02dcO\n\nkd + k1.5\n\n\u03b5\n\nlog2(1/\u03b4) \u00b7 2O(log\u2217( dR\u03c3max\n\n\u03b1\u03c3min\n\n))\n\n+\n\nd\n\nwmin\u03b12 +\n\nk2\n\u03b12\n\n(cid:18) d3\n\nw2\n\nmin\n\n(cid:19)(cid:33)\n\n.\n\nAcknowledgments\n\nPart of this work was done while the authors were visiting the Simons Institute for Theoretical\nComputer Science. Parts of this work were done while GK was supported as a Microsoft Research\nFellow, as part of the Simons-Berkeley Research Fellowship program, while visiting Microsoft\nResearch, Redmond, and while supported by a University of Waterloo startup grant. This work\nwas done while OS was af\ufb01liated with the University of Alberta. OS gratefully acknowledges the\nNatural Sciences and Engineering Research Council of Canada (NSERC) for its support through\ngrant #2017-06701. JU and VS were supported by NSF grants CCF-1718088, CCF-1750640, and\nCNS-1816028.\n\nReferences\n[1] Jayadev Acharya, Gautam Kamath, Ziteng Sun, and Huanyu Zhang.\n\nInspectre: Privately\nIn Proceedings of the 35th International Conference on Machine\n\nestimating the unseen.\nLearning, ICML \u201918, pages 30\u201339. JMLR, Inc., 2018.\n\n[2] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. Hadamard response: Estimating distributions\nprivately, ef\ufb01ciently, and with little communication. In Proceedings of the 22nd International\nConference on Arti\ufb01cial Intelligence and Statistics, AISTATS \u201919, pages 1120\u20131129. JMLR,\nInc., 2019.\n\n[3] Dimitris Achlioptas and Frank McSherry. On spectral learning of mixtures of distributions. In\nProceedings of the 18th Annual Conference on Learning Theory, COLT \u201905, pages 458\u2013469.\nSpringer, 2005.\n\n[4] Joseph Anderson, Mikhail Belkin, Navin Goyal, Luis Rademacher, and James R. Voss. The\nmore, the merrier: the blessing of dimensionality for learning large Gaussian mixtures. In\nProceedings of the 27th Annual Conference on Learning Theory, COLT \u201914, pages 1135\u20131164,\n2014.\n\n[5] Sanjeev Arora and Ravi Kannan. Learning mixtures of arbitrary Gaussians. In Proceedings\nof the 33rd Annual ACM Symposium on the Theory of Computing, STOC \u201901, pages 247\u2013257,\nNew York, NY, USA, 2001. ACM.\n\n9\n\n\f[6] Hassan Ashtiani, Shai Ben-David, Nicholas Harvey, Christopher Liaw, Abbas Mehrabian, and\nYaniv Plan. Nearly tight sample complexity bounds for learning mixtures of Gaussians via\nsample compression schemes. In Advances in Neural Information Processing Systems 31,\nNeurIPS \u201918, pages 3412\u20133421. Curran Associates, Inc., 2018.\n\n[7] Pranjal Awasthi and Or Sheffet. Improved spectral-norm bounds for clustering. In Approxima-\ntion, Randomization, and Combinatorial Optimization. Algorithms and Techniques., APPROX\n\u201912, pages 37\u201349. Springer, 2012.\n\n[8] Maria-Florina Balcan, Travis Dick, Yingyu Liang, Wenlong Mou, and Hongyang Zhang.\nDifferentially private clustering in high-dimensional euclidean spaces. In Proceedings of the\n34th International Conference on Machine Learning, ICML \u201917, pages 322\u2013331. JMLR, Inc.,\n2017.\n\n[9] Raef Bassily, Kobbi Nissim, Adam Smith, Thomas Steinke, Uri Stemmer, and Jonathan Ullman.\nIn Proceedings of the 48th Annual ACM\nAlgorithmic stability for adaptive data analysis.\nSymposium on the Theory of Computing, STOC \u201916, pages 1046\u20131059, New York, NY, USA,\n2016. ACM.\n\n[10] Mikhail Belkin and Kaushik Sinha. Polynomial learning of distribution families. In Proceedings\nof the 51st Annual IEEE Symposium on Foundations of Computer Science, FOCS \u201910, pages\n103\u2013112, Washington, DC, USA, 2010. IEEE Computer Society.\n\n[11] Aditya Bhaskara, Moses Charikar, Ankur Moitra, and Aravindan Vijayaraghavan. Smoothed\nanalysis of tensor decompositions. In Proceedings of the 46th Annual ACM Symposium on the\nTheory of Computing, STOC \u201914, pages 594\u2013603, New York, NY, USA, 2014. ACM.\n\n[12] Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. Practical privacy: The\nSuLQ framework. In Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART Symposium on\nPrinciples of Database Systems, PODS \u201905, pages 128\u2013138, New York, NY, USA, 2005. ACM.\n\n[13] Mark Bun, Kobbi Nissim, Uri Stemmer, and Salil Vadhan. Differentially private release and\nIn Proceedings of the 56th Annual IEEE Symposium on\nlearning of threshold functions.\nFoundations of Computer Science, FOCS \u201915, pages 634\u2013649, Washington, DC, USA, 2015.\nIEEE Computer Society.\n\n[14] Mark Bun, Jonathan Ullman, and Salil Vadhan. Fingerprinting codes and the price of approxi-\nmate differential privacy. In Proceedings of the 46th Annual ACM Symposium on the Theory of\nComputing, STOC \u201914, pages 1\u201310, New York, NY, USA, 2014. ACM.\n\n[15] Siu On Chan, Ilias Diakonikolas, Rocco A. Servedio, and Xiaorui Sun. Ef\ufb01cient density\nestimation via piecewise polynomial approximation. In Proceedings of the 46th Annual ACM\nSymposium on the Theory of Computing, STOC \u201914, pages 604\u2013613, New York, NY, USA,\n2014. ACM.\n\n[16] Kamalika Chaudhuri and Satish Rao. Beyond Gaussians: Spectral methods for learning mixtures\nof heavy-tailed distributions. In Proceedings of the 21st Annual Conference on Learning Theory,\nCOLT \u201908, pages 21\u201332, 2008.\n\n[17] Kamalika Chaudhuri and Satish Rao. Learning mixtures of product distributions using correla-\ntions and independence. In Proceedings of the 21st Annual Conference on Learning Theory,\nCOLT \u201908, pages 9\u201320, 2008.\n\n[18] Kamalika Chaudhuri, Anand D. Sarwate, and Kaushik Sinha. A near-optimal algorithm\nJournal of Machine Learning Research,\n\nfor differentially-private principal components.\n14(Sep):2905\u20132943, 2013.\n\n[19] Aref N. Dajani, Amy D. Lauger, Phyllis E. Singer, Daniel Kifer, Jerome P. Reiter, Ashwin\nMachanavajjhala, Simson L. Gar\ufb01nkel, Scot A. Dahl, Matthew Graham, Vishesh Karwa, Hang\nKim, Philip Lelerc, Ian M. Schmutte, William N. Sexton, Lars Vilhuber, and John M. Abowd.\nThe modernization of statistical disclosure limitation at the U.S. census bureau, 2017. Presented\nat the September 2017 meeting of the Census Scienti\ufb01c Advisory Committee.\n\n10\n\n\f[20] Sanjoy Dasgupta. Learning mixtures of Gaussians. In Proceedings of the 40th Annual IEEE\nSymposium on Foundations of Computer Science, FOCS \u201999, pages 634\u2013644, Washington, DC,\nUSA, 1999. IEEE Computer Society.\n\n[21] Sanjoy Dasgupta and Leonard J. Schulman. A two-round variant of EM for Gaussian mixtures.\nIn Proceedings of the 16th Conference in Uncertainty in Arti\ufb01cial Intelligence, UAI \u201900, pages\n152\u2013159. Morgan Kaufmann, 2000.\n\n[22] Constantinos Daskalakis and Gautam Kamath. Faster and sample near-optimal algorithms\nfor proper learning mixtures of Gaussians. In Proceedings of the 27th Annual Conference on\nLearning Theory, COLT \u201914, pages 1183\u20131213, 2014.\n\n[23] Constantinos Daskalakis, Christos Tzamos, and Manolis Zampetakis. Ten steps of EM suf\ufb01ce\nfor mixtures of two Gaussians. In Proceedings of the 30th Annual Conference on Learning\nTheory, COLT \u201917, pages 704\u2013710, 2017.\n\n[24] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data\nvia the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological),\n39(1):1\u201338, 1977.\n\n[25] Ilias Diakonikolas, Moritz Hardt, and Ludwig Schmidt. Differentially private learning of\nstructured discrete distributions. In Advances in Neural Information Processing Systems 28,\nNIPS \u201915, pages 2566\u20132574. Curran Associates, Inc., 2015.\n\n[26] Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair\nStewart. Robust estimators in high dimensions without the computational intractability. In\nProceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science, FOCS\n\u201916, pages 655\u2013664, Washington, DC, USA, 2016. IEEE Computer Society.\n\n[27] Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. List-decodable robust mean estima-\ntion and learning mixtures of spherical Gaussians. In Proceedings of the 50th Annual ACM\nSymposium on the Theory of Computing, STOC \u201918, pages 1047\u20131060, New York, NY, USA,\n2018. ACM.\n\n[28] Differential\n\nPrivacy\n\nTeam,\n\nat\n\nscale.\nhttps://machinelearning.apple.com/docs /learning-with-privacy-at-scale/\nappledifferentialprivacysystem.pdf, December 2017.\n\nLearning with\n\nprivacy\n\nApple.\n\n[29] John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. Local privacy and statistical\nminimax rates. In Proceedings of the 54th Annual IEEE Symposium on Foundations of Computer\nScience, FOCS \u201913, pages 429\u2013438, Washington, DC, USA, 2013. IEEE Computer Society.\n\n[30] John C. Duchi and Feng Ruan. The right complexity measure in locally private estimation: It is\n\nnot the \ufb01sher information. arXiv preprint arXiv:1806.05756, 2018.\n\n[31] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Roth.\nThe reusable holdout: Preserving validity in adaptive data analysis. Science, 349(6248):636\u2013638,\n2015.\n\n[32] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensi-\ntivity in private data analysis. In Proceedings of the 3rd Conference on Theory of Cryptography,\nTCC \u201906, pages 265\u2013284, Berlin, Heidelberg, 2006. Springer.\n\n[33] Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Analyze Gauss: Optimal\nbounds for privacy-preserving principal component analysis. In Proceedings of the 46th Annual\nACM Symposium on the Theory of Computing, STOC \u201914, pages 11\u201320, New York, NY, USA,\n2014. ACM.\n\n[34] \u00dalfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. RAPPOR: Randomized aggregatable\nprivacy-preserving ordinal response. In Proceedings of the 2014 ACM Conference on Computer\nand Communications Security, CCS \u201914, pages 1054\u20131067, New York, NY, USA, 2014. ACM.\n\n11\n\n\f[35] Jon Feldman, Ryan O\u2019Donnell, and Rocco A. Servedio. PAC learning axis-aligned mixtures of\nGaussians with no separation assumption. In Proceedings of the 19th Annual Conference on\nLearning Theory, COLT \u201906, pages 20\u201334, Berlin, Heidelberg, 2006. Springer.\n\n[36] Jon Feldman, Ryan O\u2019Donnell, and Rocco A. Servedio. Learning mixtures of product distribu-\n\ntions over discrete domains. SIAM Journal on Computing, 37(5):1536\u20131564, 2008.\n\n[37] Marco Gaboardi, Ryan Rogers, and Or Sheffet. Locally private con\ufb01dence intervals: Z-test and\ntight con\ufb01dence intervals. In Proceedings of the 22nd International Conference on Arti\ufb01cial\nIntelligence and Statistics, AISTATS \u201919, pages 2545\u20132554. JMLR, Inc., 2019.\n\n[38] Rong Ge, Qingqing Huang, and Sham M. Kakade. Learning mixtures of Gaussians in high\ndimensions. In Proceedings of the 47th Annual ACM Symposium on the Theory of Computing,\nSTOC \u201915, pages 761\u2013770, New York, NY, USA, 2015. ACM.\n\n[39] Anupam Gupta, Katrina Ligett, Frank McSherry, Aaron Roth, and Kunal Talwar. Differentially\nprivate combinatorial optimization. In Proceedings of the 21st Annual ACM-SIAM Symposium\non Discrete Algorithms, SODA \u201910, pages 1106\u20131125, Philadelphia, PA, USA, 2010. SIAM.\n\n[40] Moritz Hardt and Eric Price. Sharp bounds for learning a mixture of two Gaussians.\n\nIn\nProceedings of the 47th Annual ACM Symposium on the Theory of Computing, STOC \u201915, pages\n753\u2013760, New York, NY, USA, 2015. ACM.\n\n[41] Samuel B. Hopkins and Jerry Li. Mixture models, robustness, and sum of squares proofs. In\nProceedings of the 50th Annual ACM Symposium on the Theory of Computing, STOC \u201918, pages\n1021\u20131034, New York, NY, USA, 2018. ACM.\n\n[42] Daniel Hsu and Sham M. Kakade. Learning mixtures of spherical Gaussians: Moment methods\nand spectral decompositions. In Proceedings of the 4th Conference on Innovations in Theoretical\nComputer Science, ITCS \u201913, pages 11\u201320, New York, NY, USA, 2013. ACM.\n\n[43] Zhiyi Huang and Jinyan Liu. Optimal differentially private algorithms for k-means clustering. In\nProceedings of the 37th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database\nSystems, PODS \u201918, pages 395\u2013408, New York, NY, USA, 2018. ACM.\n\n[44] Matthew Joseph, Janardhan Kulkarni, Jieming Mao, and Zhiwei Steven Wu. Locally private\n\ngaussian estimation. arXiv preprint arXiv:1811.08382, 2018.\n\n[45] Peter Kairouz, Keith Bonawitz, and Daniel Ramage. Discrete distribution estimation under\nlocal privacy. In Proceedings of the 33rd International Conference on Machine Learning, ICML\n\u201916, pages 2436\u20132444. JMLR, Inc., 2016.\n\n[46] Adam Tauman Kalai, Ankur Moitra, and Gregory Valiant. Ef\ufb01ciently learning mixtures of two\nGaussians. In Proceedings of the 42nd Annual ACM Symposium on the Theory of Computing,\nSTOC \u201910, pages 553\u2013562, New York, NY, USA, 2010. ACM.\n\n[47] Gautam Kamath, Jerry Li, Vikrant Singhal, and Jonathan Ullman. Privately learning high-\ndimensional distributions. In Proceedings of the 32nd Annual Conference on Learning Theory,\nCOLT \u201919, 2019.\n\n[48] Gautam Kamath, Or Sheffet, Vikrant Singhal, and Jonathan Ullman. Differentially private\n\nalgorithms for learning mixtures of separated gaussians, 2019.\n\n[49] Michael Kapralov and Kunal Talwar. On differentially private low rank approximation. In\nProceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA \u201913,\npages 1395\u20131414, Philadelphia, PA, USA, 2013. SIAM.\n\n[50] Vishesh Karwa and Salil Vadhan. Finite sample differentially private con\ufb01dence intervals. In\nProceedings of the 9th Conference on Innovations in Theoretical Computer Science, ITCS \u201918,\npages 44:1\u201344:9, Dagstuhl, Germany, 2018. Schloss Dagstuhl\u2013Leibniz-Zentrum fuer Informatik.\n\n[51] Pravesh Kothari, Jacob Steinhardt, and David Steurer. Robust moment estimation and improved\nclustering via sum of squares. In Proceedings of the 50th Annual ACM Symposium on the\nTheory of Computing, STOC \u201918, pages 1035\u20131046, New York, NY, USA, 2018. ACM.\n\n12\n\n\f[52] Amit Kumar and Ravindran Kannan. Clustering with spectral norm and the k-means algorithm.\nIn Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science,\nFOCS \u201910, pages 299\u2013308, Washington, DC, USA, 2010. IEEE Computer Society.\n\n[53] Jerry Li and Ludwig Schmidt. Robust proper learning for mixtures of Gaussians via systems of\npolynomial inequalities. In Proceedings of the 30th Annual Conference on Learning Theory,\nCOLT \u201917, pages 1302\u20131382, 2017.\n\n[54] Ankur Moitra and Gregory Valiant. Settling the polynomial learnability of mixtures of Gaussians.\nIn Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science, FOCS\n\u201910, pages 93\u2013102, Washington, DC, USA, 2010. IEEE Computer Society.\n\n[55] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Smooth sensitivity and sampling in\nprivate data analysis. In Proceedings of the 39th Annual ACM Symposium on the Theory of\nComputing, STOC \u201907, pages 75\u201384, New York, NY, USA, 2007. ACM.\n\n[56] Kobbi Nissim and Uri Stemmer. Clustering algorithms for the centralized and local models. In\n\nAlgorithmic Learning Theory, ALT \u201918, pages 619\u2013653. JMLR, Inc., 2018.\n\n[57] Kobbi Nissim, Uri Stemmer, and Salil Vadhan. Locating a small cluster privately. In Proceedings\nof the 35th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems,\nPODS \u201916, pages 413\u2013427, New York, NY, USA, 2016. ACM.\n\n[58] Karl Pearson. Contributions to the mathematical theory of evolution. Philosophical Transactions\n\nof the Royal Society of London. A, pages 71\u2013110, 1894.\n\n[59] Oded Regev and Aravindan Vijayaraghavan. On learning mixtures of well-separated Gaussians.\nIn Proceedings of the 58th Annual IEEE Symposium on Foundations of Computer Science,\nFOCS \u201917, pages 85\u201396, Washington, DC, USA, 2017. IEEE Computer Society.\n\n[60] Adam Smith. Privacy-preserving statistical estimation with optimal convergence rates. In\nProceedings of the 43rd Annual ACM Symposium on the Theory of Computing, STOC \u201911,\npages 813\u2013822, New York, NY, USA, 2011. ACM.\n\n[61] Thomas Steinke and Jonathan Ullman. Between pure and approximate differential privacy. The\n\nJournal of Privacy and Con\ufb01dentiality, 7(2):3\u201322, 2017.\n\n[62] Uri Stemmer and Haim Kaplan. Differentially private k-means with constant multiplicative error.\nIn Advances in Neural Information Processing Systems 31, NeurIPS \u201918, pages 5431\u20135441.\nCurran Associates, Inc., 2018.\n\n[63] Ananda Theertha Suresh, Alon Orlitsky, Jayadev Acharya, and Ashkan Jafarpour. Near-\noptimal-sample estimators for spherical Gaussian mixtures. In Advances in Neural Information\nProcessing Systems 27, NIPS \u201914, pages 1395\u20131403. Curran Associates, Inc., 2014.\n\n[64] Santosh Vempala and Grant Wang. A spectral algorithm for learning mixtures of distributions.\nIn Proceedings of the 43rd Annual IEEE Symposium on Foundations of Computer Science,\nFOCS \u201902, pages 113\u2013123, Washington, DC, USA, 2002. IEEE Computer Society.\n\n[65] Shaowei Wang, Liusheng Huang, Pengzhan Wang, Yiwen Nie, Hongli Xu, Wei Yang, Xiang-\nYang Li, and Chunming Qiao. Mutual information optimally local private discrete distribution\nestimation. arXiv preprint arXiv:1607.08025, 2016.\n\n[66] Ji Xu, Daniel J. Hsu, and Arian Maleki. Global analysis of expectation maximization for\nmixtures of two Gaussians. In Advances in Neural Information Processing Systems 29, NIPS\n\u201916, pages 2676\u20132684. Curran Associates, Inc., 2016.\n\n[67] Min Ye and Alexander Barg. Optimal schemes for discrete distribution estimation under locally\n\ndifferential privacy. IEEE Transactions on Information Theory, 64(8):5662\u20135676, 2018.\n\n13\n\n\f", "award": [], "sourceid": 84, "authors": [{"given_name": "Gautam", "family_name": "Kamath", "institution": "University of Waterloo"}, {"given_name": "Or", "family_name": "Sheffet", "institution": "University of Alberta"}, {"given_name": "Vikrant", "family_name": "Singhal", "institution": "Northeastern University"}, {"given_name": "Jonathan", "family_name": "Ullman", "institution": "Northeastern University"}]}