{"title": "Private Testing of Distributions via Sample Permutations", "book": "Advances in Neural Information Processing Systems", "page_first": 10878, "page_last": 10889, "abstract": "Statistical tests are at the heart of many scientific tasks.\nTo validate their hypothesis, researchers in medical and \nsocial sciences use individuals' data. The sensitivity of \nparticipants' data requires the design of statistical tests that \nensure the privacy of the individuals in the most efficient way. \nIn this paper, we use the framework of property testing to design algorithms to test the properties of the distribution that the data is drawn from with respect to differential privacy. \nIn particular, we investigate testing two fundamental properties of distributions: (1) testing the equivalence of two distributions when we have unequal numbers of \nsamples from the two distributions. \n(2) Testing independence of two random variables. \nIn both cases, we show that our testers achieve near optimal sample complexity (up to logarithmic factors). \nMoreover, our dependence on the privacy parameter is an additive term, which indicates that differential privacy can be obtained \nin most regimes of parameters for free.", "full_text": "Private Testing of Distributions via Sample\n\nPermutations\n\nMaryam Aliakbarpour\n\nCSAIL, MIT\n\nmaryama@mit.edu\n\nDaniel Kane\n\nUniversity of California, San Diego\n\ndakane@ucsd.edu\n\nIlias Diakonikolas\n\nUniversity of Wisconsin, Madison\nilias.diakonikolas@gmail.com\n\nRonitt Rubinfeld\nCSAIL, MIT, TAU\n\nronitt@csail.mit.edu\n\nAbstract\n\nStatistical tests are at the heart of many scienti\ufb01c tasks. To validate their hypothe-\nses, researchers in medical and social sciences use individuals\u2019 data. The sensi-\ntivity of participants\u2019 data requires the design of statistical tests that ensure the\nprivacy of the individuals in the most ef\ufb01cient way.\nIn this paper, we use the\nframework of property testing to design algorithms to test the properties of the dis-\ntribution that the data is drawn from with respect to differential privacy. In particu-\nlar, we investigate testing two fundamental properties of distributions: (1) testing\nthe equivalence of two distributions when we have unequal numbers of samples\nfrom the two distributions. (2) Testing independence of two random variables. In\nboth cases, we show that our testers achieve near optimal sample complexity (up\nto logarithmic factors). Moreover, our dependence on the privacy parameter is an\nadditive term, which indicates that differential privacy can be obtained in most\nregimes of parameters for free.\n\n1\n\nIntroduction\n\nWe study questions in statistical hypothesis testing, a \ufb01eld of statistics with fundamental importance\nin scienti\ufb01c discovery. At a high level, given samples from an unknown statistical model, the goal of\na hypothesis test is to determine whether the model has a desired property. The \ufb01rst \u2014 and arguably\nthe most fundamental \u2014 objective in hypothesis testing is to make an accurate determination with\nas few samples as possible. In this work, we focus on understanding the trade off between sample\nsize and an additional important criterion \u2014 preserving the privacy of the underlying data sets.\nEarly work in statistics [Pea00, NP33] studied the asymptotic regime, where the sample size goes to\nin\ufb01nity. In this paper, we are interested in obtaining \ufb01nite sample bounds in the minimax setting that\nhas been extensively studied in the computer science and information theory communities during the\npast couple of decades. More speci\ufb01cally, we will work with the formalism of distribution property\ntesting [BFR+00, BFR+13]: Given samples from a collection of unknown probability distributions\nover discrete domains, do the underlying distributions satisfy a desired property P or are they \u201cfar\u201d\nfrom satisfying the property? (The de\ufb01nition of \u201cfar\u201d is typically quanti\ufb01ed via some global error\nmetric, e.g., the total variation distance. See Preliminaries section.) The goal is to develop testers\nfor various properties with information-theoretically optimal sample complexity, which is typically\nsublinear in the domain sizes of the underlying distributions. We note that, in recent years, such\nsample-optimal methods have been obtained for testing a range of properties, including identity\ntesting (\u201cgoodness-of-\ufb01t\u201d), closeness testing (\u201cequivalence testing\u201d or \u201ctwo-sample testing\u201d), and\nindependence testing.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThe primary goal in classical statistics theory is to minimize the sample size of inference tasks. In\nrecent years, a wide range of settings involves performing hypothesis testing tasks on sensitive data\nrepresenting speci\ufb01c individuals, such as data describing medical or other behavioral phenomena.\nIn such cases, the outputs of standard tests may reveal private information that should not be di-\nvulged. Differential privacy [Dwo09, DR14a] is a formal framework (see Preliminaries section) that\nmay allow us to obtain the scienti\ufb01c bene\ufb01t of statistical tests without compromising the privacy of\nthe underlying individuals. Intuitively, differential privacy postulates that similar data sets should\nhave statistically close outputs; and that once this guarantee is achieved, then provable privacy is\npreserved. Differentially private data analysis is a very active research area, in which a wealth of\ntechniques have been developed for a range of tasks.\nWhen designing a differentially private hypothesis testing algorithm, there are two criteria to balance.\nOn the one hand, we require that the algorithm satis\ufb01es the differential privacy condition on any\ninput dataset. On the other hand, we require that the algorithm is a valid statistical hypothesis tester\n(i.e., it correctly distinguishes inputs satisfying property P from inputs that are far from satisfying\nP). These competing criteria suggest that, in general, the sample size required to ensure both of\nthem grows compared to the non-private setting. Recent work [CDK17, ADR18, ASZ18] has shown\nthat for basic tasks, such as identity testing and equivalence testing, the sample size increase of a\ndifferentially private tester compared to its non-private analogue is negligible.\nIn this work, we continue this line of investigation. We give a new general technique that yields\nsample-ef\ufb01cient differentially private testers and apply it for two fundamental statistical tasks: the\nproblem of equivalence testing with unequal sized samples and the problem of independence testing\n(de\ufb01ned in the following paragraph). Notably, prior techniques were inherently unable to provide\nsample-ef\ufb01cient private testers for either of these problems.\n\nOur Contributions. The main contribution of this work is a general technique for preserving\nprivacy that in particular can be used to obtain sample-ef\ufb01cient and differentially private hypothesis\ntesters based on the algorithmic technique in the work of [DK16]. This technique can be applied\nto several testing problems and in particular, yields the only known sample optimal testers for the\nfollowing testing problems:\n\nEquivalence Testing with Unequal Sized Samples: Given a target error parameter \u03f5 > 0, s1 indepen-\ndent draws from an unknown distribution p over [n], and s2 draws from an unknown distribution q\nover [n], distinguish the case that p = q from the case that \u2225p (cid:0) q\u22251 (cid:21) \u03f5:\nIndependence Testing: Given a target error parameter \u03f5 > 0, and s independent draws from an\nunknown distribution p over [n] (cid:2) [m], distinguish the case that p is a product distribution (i.e., its\ntwo coordinates are independent) versus \u03f5-far, in \u21131-distance, from any product distribution.\nThese problems have been extensively investigated in distribution testing during the past\ndecade [BFF+01, LRR11, CDVV14, VV14, AJOS14, ADK15, BV15, DK16] and sample-optimal\ntesters are known for them in the non-private setting [DK16]. In this work, we design the \ufb01rst dif-\nferentially private testers for these problems with optimal (or near-optimal) sample complexity. In\nparticular, we show that the sample complexity of both these problems in the private setting is nearly\nthe same as in the non-private setting, i.e., privacy comes essentially for free.\nIn this work, we focus on privatizing the optimal non-private testers for the above problems presented\nin [DK16]. The algorithmic technique of [DK16] splits samples into two groups \u2014 \u201c\ufb02attening\nsamples\u201d and \u201ctesting samples\u201d \u2014 that are used in very different ways. To obtain privacy guarantees,\nwe must design algorithms that have low sensitivity to the samples \u2014 that is, changing one sample\ndoes not have much effect on the outcome. Previous works in private testers for hypothesis testing\nare based on algorithms that use the samples analogously to the use of the \u201ctesting samples\u201d in\n[DK16]. One can similarly use those techniques to design algorithms with low sensitivity with\nrespect to the \u201ctesting samples\u201d in our setting. However, since using \u201c\ufb02attening samples\u201d is a key to\nachieve testers with the optimal sample complexity, we need to obtain low sensitivity with respect to\nthe \u201c\ufb02attening samples\u201d. As the output of the testing algorithm can be very sensitive to small changes\nin the set of \u201c\ufb02attening samples\u201d, more care is required in designing low sensitivity algorithms for\nusing the \u201c\ufb02attening samples\u201d, In a nutshell, what we present is a technique for designing private\ntesters which considers the output of the algorithm in [DK16] on every permutation of the samples\nand outputs a result that is based on the aggregate. In order to show that this approach gives the\ncorrect answer, we have to show that not only the expectation of the result is the same as that of\n\n2\n\n\fthe non-private tester (which is straightforward), but also the variance is small enough so that the\nresulting output is correct with high probability (which does not follow from the calculation of the\nvariance of the non-private testers). For the independence testing problem, we need an extra step\nto reduce sensitivity further. We \ufb01rst show that trying every permutation results in low sensitivity\nin the typical case (over the random samples), and thus gives a private tester. In the non-typical\ncase, more care must be taken \u2014 we give an algorithm for modifying the samples in such a way\nthat we can reduce to the typical case. We describe the challenges and techniques in more detail\nin Section 3. Though our methods are mainly tailored for use with the [DK16] testers, there are\nseveral other distribution property testing algorithms which use samples in similar sensitive manners,\ne.g., [CDGR16]\u2014 the hope is that these techniques will be fruitful in allowing those algorithms to\nbe made differentially private as well.\n\nRelated Work The \ufb01eld of distribution property testing [BFR+00] has been extensively investi-\ngated in the past couple of decades, see [Rub12, Can15, Gol17]. A large body of the literature has\nfocused on characterizing the sample size needed to test properties of arbitrary discrete distributions.\nThis regime is fairly well understood: for many properties of interest there exist sample-ef\ufb01cient\ntesters [Pan08, CDVV14, VV14, DKN15b, ADK15, CDGR16, DK16, DGPP16, CDS17, Gol17,\nDGPP17, BC17, DKS18, CDKS17b]. More recently, an emerging body of work has concenter-\nated on leveraging a priori structure of the underlying distributions to obtain signi\ufb01cantly improved\nsample complexities [BKR04, DDS+13, DKN15b, DKN15a, CDKS17a, DP17, DKN17, DKP19].\nDifferential privacy was \ufb01rst introduced in [DMNS06]. Recently, a new line of research studies dis-\ntribution testing and learning problems with respect to differential privacy [DHS15, CDK17, ADR18,\nASZ18]. The focus of these works is on testing identity and closeness of distributions, leaving other\nproperties mostly unexplored. Other models for distribution testing problems with respect to differ-\nential privacy have been studied [WLK15, GLRV16, KR17, KFS17]. In most of these latter works,\nonly a type I error guarantee is provided, which is a signi\ufb01cantly weaker guarantee compare to ours.\nIn addition, other settings for privacy, e.g, local privacy, is investigated [She18, GR18, ACFT19].\n\n2 Preliminaries\n\njxijk\n\ni2[n]\n\n(cid:138) 1\n\n(cid:128)\u2211\n\n2.1 De\ufb01nitions\nNotation: We use [n] to denote the set f1; 2; : : : ; ng. We consider discrete distributions over a\n\ufb01nite domain, in particular, over [n] without loss of generality. For a distribution p, we write p(i)\nto denote the probability of element i in [n]. One can assume each distribution has an associated\nprobability function p : [n] ! [0; 1] such that p(i)\u2019s are non-negative and\ni2[n] p(i) = 1. For\nset S (cid:18) [n], p(S) denotes the total probability of the elements in S (i.e.,\ni2S p(i)). Note that\none can think of each discrete distribution over [n] as a vector in Rn where the i-th coordinate is\nthe probability of element i. Having said that, we can de\ufb01ne the \u2113k-norm of a distribution in the\nsame manner as of a vector: For a vector x 2 Rn and any k > 0, the \u2113k-norm of x is equal to\nk , and is denoted by \u2225x\u2225k. In addition, the \u2113k-distance between two distributions p\nand q over [n] is equal to \u2225p (cid:0) q\u2225k. Throughout this paper, we use the \u21131-distance to measure the\ndiscrepancy between distributions, which is equivalent to the total variation distance up to a factor of\ntwo. In particular, we say distribution p is \u03f5-far from distribution q if \u2225p(cid:0) q\u22251 (cid:21) \u03f5. We use Lap((cid:21))\nto denote the zero mean Laplace distribution with parameter (cid:21). The probability density function of\nthe Laplace distribution with parameter (cid:21) at point x 2 R is Lap(x; (cid:21)) = ee\n=(2b). We use the\nnotation for the distribution and a random variable exchangeably where the difference is clear from\nthe context.\nDistribution Testing: Formally, we de\ufb01ne a property P to be a set of distributions. We say a\ndistribution p has the property P if p 2 P; and, we say p is \u03f5-far from having the property P when\np is \u03f5-far from all distributions in P. Assume an algorithm has sample access to a distribution p. We\nsay the algorithm is an (\u03f5; (cid:14))-tester for property P, if the following holds with probability at least\n1 (cid:0) (cid:14):\n\n\u2211\n\u2211\n\n(cid:0)jxj=b\n\n(cid:15) Completeness case: If p has the property P, then the algorithm outputs accept.\n(cid:15) Soundness case: If p is \u03f5-far from P, then the algorithm outputs reject.\n\n3\n\n\fAnalogously, one can generalize the above de\ufb01nition to the case that we have sample access to two\ndistributions. In particular, given sample access to two distributions p and q over [n], an (\u03f5; (cid:14))-tester\nfor testing closeness of p and q distinguishes the following cases with probability at least 1 (cid:0) (cid:14):\n\n(cid:15) Completeness case: If p is equal to q, then the algorithm outputs accept.\n(cid:15) Soundness case: If p is \u03f5-far from q, then the algorithm outputs reject.\n\nn\n\nTesting closeness of distributions via \ufb02attening technique:\nIn this paper, we build on the non-\nprivate closeness tester presented in [DK16, CDVV14]. Here, we give an overview of the closeness\ntester, and the \ufb02attening technique which turn it into a tester with the optimal sample complexity.\nSuppose we have sample access to two distributions p and q on the domain [n]. Let s be a parameter\n\u2211\nthat determines the expected number of samples. We draw Poi(s) samples from p and q. For each\ni 2 [n], let Xi and Yi denote the number of occurrences of element i in the sample sets from p\nand q respectively. In [CDVV14], the authors proposed the following statistic to test the closeness\ni=1(Xi (cid:0) Yi)2 (cid:0) Xi (cid:0) Yi : The expected value of Z is proportional to the\nof p and q: Z =\nsquared \u21132-distance of p and q which enables us to use Z for testing closeness of p and q. While\nthis statistic is mainly measuring the \u21132-distance, one can use it for testing in \u21131-distance as well by\nusing trivial inequalities between the distances. By careful analysis of the variance of Z, it is shown\nthat Z concentrates around its expectation when we draw at least s = (cid:2)(n=(\u03f52 max(\u2225p\u22252;\u2225q\u22252)))\nsamples1. More speci\ufb01cally, it is shown that in the case where p = q, Z is below a threshold\nparameter, (cid:28); and when p is \u03f5-far from q, Z is at least (cid:28) with high probability. Thus, we can test\nthe closeness of p and q by computing Z from a large enough sample set and comparing it with the\nthreshold (cid:28).\nThe above algorithm would be sample-ef\ufb01cient only when max(\u2225p\u22252;\u2225q\u22252) is not too large.\nIn [DK16], the authors provide a technique, called \ufb02attening, that decreases the \u21132-norm of a distri-\nbution. Using this technique, they map p and q to two other distributions at least one of which has\nsmaller \u21132-norm. Then, they obtain a sample-optimal \u21131-closeness tester for p and q by showing its\nequivalence to closeness testers for the two distributions obtained after \ufb02attening.\nWe discuss the \ufb02attening technique in more detail. To \ufb02atten a distribution p, we need a (multi)set\nof the domain elements denoted by F . This set is usually obtained by drawing samples from the un-\nderlying distributions, and the elements in F are called \ufb02attening samples. Using \ufb02attening samples,\nwe transform p to another distribution p(F ) over a larger domain in a way that the \u21132-norm of p(F )\nis small. We build the new domain for p(F ) as follows: For each element i in the domain of p, we\n\ufb01rst count the number of occurrences of i in F , namely ki, and put bi := ki + 1 elements associated\nto i in the new domain. We refer to the elements of the new domain as buckets. We de\ufb01ne p(F ) to\nbe the distribution that assigns the probability mass of p(i)=bi to all the bi buckets of i for all i in\nthe domain. Note that one can generate a sample from p(F ) upon receiving a sample from p: For a\nfresh sample X = i drawn from p, the \ufb02attening procedure maps it to a randomly selected bucket j\n\u2211\n= (i; j) as a sample from p(F ). The above procedure\namong the bi buckets of i. Then, it outputs X\nhas several important properties which help us later in our analysis:\n(cid:15) By the above construction, it is clear that the size of the new domain is\ni bi = n + jFj.\nThus, as long as jFj is not larger than n (in the regime where we have sublinear number of\nsamples), the size of domain increases only by a constant factor.\n(cid:15) It is shown that if F contains Poi(k) samples from p, then the expected \u21132-norm of p(F ) is\nat most 1=k.\n(cid:15) If we \ufb02atten two distributions using the same ass ignments for the buckets (i.e., the \ufb02at-\ntening set F ), the \u21131-distance between the two distributions remains unchanged. Thus, it\nsuf\ufb01ces to test closeness of p(F ) and q(F ) for testing the closeness of p and q.\n\n\u2032\n\n\u2032 , from a universe [n] are neighboring if and only\nPrivacy: We say two sample sets, X and X\nif their Hamming distance is one (meaning that they differ in exactly one sample). A randomized\nalgorithm A is (cid:24)-private if for any subset S of the possible outputs of the algorithm, faccept, rejectg\nin the context of this paper, and for any two neighboring X and X\n\n\u2032, the following holds:\n\n1In fact, as a byproduct of our analysis, one can show it suf\ufb01ces to have s (cid:21) (cid:2)(n=(\u03f52 min(\u2225p\u22252;\u2225q\u22252))).\n\nPr[[[A(X) 2 S ]]] (cid:20) e(cid:24) (cid:1) Pr[[[A(X\n\n\u2032\n\n) 2 S ]]] :\n\n4\n\n\fFor a function f over sample sets, the sensitivity of f is de\ufb01ned as follows:\n\n\u2206(f ) = max\nX;X\u2032\n\njf (X) (cid:0) f (X\n\n\u2032\n\n)j ;\n\nwhere the maximum is taken over all possible sample sets that differ in only one sample. A standard\nmethod for making functions private is the Laplace mechanism [DR14b]. In this mechanism, to\nmake a function f (cid:24)-differentially private, one adds Laplace noise to the output of function: ^f (X) :=\nf (X) + Lap(\u2206(f )=(cid:24)). It is easy to show that ^f satis\ufb01es the de\ufb01nition of differential privacy with\nparameter (cid:24).\n\n3 General approach for making closeness-based testers private\n\nAs we mentioned earlier, several properties can be tested via a reduction to the \ufb02attening-based\ncloseness tester. These reductions and the resulting optimal testers were presented in [DK16]. In\nthis section, we focus on describing a general approach for making such testers differentially private.\nWe start by explaining the structure of the existing reductions in the non-private setting. Next, we\nexplain our main idea for making the reductions and the testers private, which is to derandomize the\nnon-private tester. Then we give the characteristics of the reductions that can be turned into a private\nalgorithm (see De\ufb01nition 3.2). In particular, if any property can be tested via a desired reduction to\nthe closeness testing problem, it can be also privately tested with a reduction to our general private\ncloseness tester. At the end, we describe our algorithm and prove its correctness in the full version.\n\n3.1 Reduction procedure in non-private setting\n\nIn this section, we elaborate on the structure of the reductions to the closeness tester with the use\nof the \ufb02attening technique proposed in [DK16]. Suppose we aim to test whether a distribution2 has\nproperty P or it is \u03f5-far from any distribution in P via a reduction to the \ufb02attening-based closeness\ntester in the non-private setting. The reduction has the following structure: Upon receiving a sample\nset from the underlying distribution, the reduction procedure splits the samples into two sets test\nsamples, denoted by T , and \ufb02attening samples, denoted by F . The reduction procedure uses these\nsample sets as follows:\n\n(cid:15) Test samples: The reduction procedure use the test samples to generate samples from two\ndistributions p and q over a domain of size n. The distributions p and q are designed in\nsuch a way that if the underlying distribution has the property P, then p and q are the same;\nand if the underlying distribution is \u03f5-far from any distribution that has the property, then\np and q are (cid:2)(\u03f5)-far from each other as well. This transformation is essentially the core of\nthe reduction to the testing closeness problem.\n(cid:15) Flattening samples: In addition to samples from p and q, the reduction procedure gener-\nates n positive integers b1; b2; : : : ; bn that indicate the number of buckets for each domain\nelement. These numbers are used for \ufb02attening of p and q. (See the Preliminaries section\nfor more details.)\n\nAn example of such reductions is testing independence. Suppose d is a distribution over [n] (cid:2) [m],\nand our goal is to test whether the two coordinates of the samples drawn from d are independent\nor not. It is not hard to see that this problem is equivalent to testing whether p := d is equal to\nq := d1 (cid:2) d2, where d1 and d2 are the two marginal distributions of d. For more examples, see\n[DK16].\nAfter the reduction procedure generates samples from p and q, and the number of buckets for each\ndomain element, we use the \ufb02attening-based closeness tester for the testing closeness of p and q.\nAs we explain in the Preliminaries section, \ufb01rst, we transform p and q to p(F ) and q(F ) on the new\ndomain:\n\nD = f(i; j)ji 2 [n] and j 2 [bi]g :\n\nThe hope is that after \ufb02attening the \u21132-norm of at least one of the resulting distributions is small.\nRecall that the probability of each element (i; j) in D, i.e., a bucket, according to p(F ) is p(i)=bi.\n\n2We may have more than one underlying distributions.\n\n5\n\n\f|\n\n{z\n\n} |\n\nSample set X from the underlying distribution(s)\n\n{z\n\n}\n\nReduction procedure\n\nFlattening Samples\n\nTest Samples\n\nF\n\nT\n\nprovides the number of\nbuckets for element i: bi\n\nprovides\nsamples\nfrom p\n\nprovides\nsamples\nfrom q\n\nComputes\n\nZ\n\nCompares Z with a threshold (cid:28):\nIf Z (cid:20) (cid:28), outputs accept;\nOtherwise, outputs reject.\n\nFigure 1: Standard reduction procedure to testing closeness of two distributions.\n\nThe closeness tester transforms samples from p to samples from p(F ) (and similarly for q). For each\nsample i from p, the closeness tester assigns it to the bucket (i; j), where j is picked uniformly at\nrandom from [bi]. We denote the number of occurrences of bucket (i; j) in the sample set from\np(F ) (and q(F )) by vi;j;1 (and vi;j;2). The \ufb02attening-based closeness tester computes the following\nstatistic Z:\n\n(vi;j;1 (cid:0) vi;j;2)2 (cid:0) vi;j;1 (cid:0) vi;j;2 ;\n\n(1)\n\nn\u2211\n\nbi\u2211\n\nZ :=\n\ni=1\n\nj=1\n\nand compares it with a threshold to establish whether p(F ) and q(F ) are equal or (cid:2)(\u03f5)-far from each\nother. Since the transformation to p(F ) and q(F ) does not change the \u21131-distance between p and q,\nthe output of the tester determines whether d has the property P or \u03f5-far from it. See Figure 1 for an\noverview of this process.\n\n3.2 Derandomizing the non-private tester\n\nTo develop a differentially private algorithm for closeness testing with \ufb02attening, we \u201cderandomize\u201d\nthe standard non-private closeness tester provided in [DK16, CDVV14]. The derandomization of the\ntester results in a stable statistic, which means if we change one sample in the sample set, the value\nof the statistics does not change drastically. The stability implies that the statistic has low sensitivity,\nand it can be privatized using fewer number of extra samples.\nIn the previous section, we explained how the reduction and the tester work. Note that there\nare two steps that the algorithm can make random choices: (i) The algorithm splits the samples\ninto two sets F and T (\ufb02attening and test samples): Upon receiving a set of s + k samples,\nX = fx1; x2; : : : ; xs+kg, where s and k is the number of test samples and the number of \ufb02at-\ntening samples respectively, usually the algorithm assigns the \ufb01rst s samples to the test set, and the\nrest to the \ufb02attening set. Equivalently, we can view this step as follows: the algorithm permutes the\nsample according to a random permutation (cid:25). Then, it splits them into F and T . In other words,\nT is equal to fx(cid:25)(1); x(cid:25)(2); : : : ; x(cid:25)(s)g, and F is equal to fx(cid:25)(s+1); x(cid:25)(s+2); : : : ; x(cid:25)(s+k)g. (ii) The\nalgorithm randomly selects a bucket for each sample from p and q. Let r denote the string of random\nbits that the algorithm uses to choose the buckets. We eliminate the randomness of these two steps\n\n6\n\n\fby setting our new statistic, Z, to be the expected value of the statistic Z over the random choices of\nthe algorithm. More precisely, for a given input sample set X, we de\ufb01ne:\n\nZ(X) := E(cid:25);r[[[ ZjX ]]] :\n\nWe can simplify the above statistic a step further by computing the closed-form of the expected value\nof Z over all the random choices of r, i.e., Er[[[ ZjX; (cid:25) ]]]. We provide the exact value in the following\nlemma, and it is proved in the full version.\nLemma 3.1. Let si;1 (similarly si;2) be the number of occurrences of element i in the sample sets\nof p (q). Assume bi is the number of buckets assigned to element i. Let vi;j;1 (similarly vi;j;2) be the\nnumber of occurrences of bucket (i; j) in the sample set from p(F ) (q(F )). Then, we have:\n(si;1 (cid:0) si;2)2 (cid:0) si;1 (cid:0) si;2\n\n242424 bi\u2211\n\n(vi;j;1 (cid:0) vi;j;2)2 (cid:0) vi;j;1 (cid:0) vi;j;2\n\n353535 =\n\nEr\n\n;\n\nj=1\n\nwhere the expectation is taken over all random assignments of the samples to the buckets.\n\nThe lemma above immediately implies the following equation:\n\nZ(X) = E(cid:25)[[[ Er[[[ ZjX; (cid:25) ]]]jX ]]] = E(cid:25)\n\n(si;1 (cid:0) si;2)2 (cid:0) si;1 (cid:0) si;2\n\ni=1\n\nbi\n\n:\n\n(2)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) bi; si;1; si;2\n[[[\nn\u2211\n\nbi\n\n]]]\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X\n\nFor the rest of this paper, we work with this latter form of Z.\n\n3.3 Designing a general private tester\n\nNote that the algorithm we want to design has to satisfy two guarantees: First, it should be an\naccurate tester, i.e., it should output the correct answer with high probability. Second, it should be a\ndifferentially private algorithm.\nFor the accuracy guarantee, we \ufb01rst, need to show that the proposed statistic, Z, is suf\ufb01cient for test-\ning closeness of p and q, and ultimately testing property P. At \ufb01rst glance, the claim seems trivially\ntrue: We know that for any sample set X and any random choice of (cid:25), and r, the statistic Z will be a\nsuf\ufb01cient statistic for testing property P just because of the properties of the non-private tester. Since\nZ is essentially the expected value of a group of suf\ufb01cient statistics, it must be immediate that Z is\na suf\ufb01cient statistic as well. However, there is a subtle difference between the guarantees we require\nfor the statistics Z and Z. To analyze the standard tester in [DK16], the authors showed that with\nhigh probability the set of \ufb02attening samples decreases the \u21132-norm of one of the two distributions p\nand q. The low \u21132-norm property is suf\ufb01cient to show that Z has low variance, and one can use it for\ncloseness testing. In fact, the authors show that the statistic Z works for some \ufb02attening test which\noccurs with high constant probability. However, in our setting, we wish to show the variance of Z is\nlow over the random choices of both the \ufb02attening samples and the test samples. Hence, while we\nare taking out the randomness of r, or (cid:25), we are introducing a new source of randomness, which is\nthe set F . Thus, it is unclear whether Z can be used as a statistic for the problem or not.\nThe goal for the rest of this section is to prove that if the reduction procedure has the desired char-\nacteristics, then Z is a suf\ufb01cient statistic for testing property P. We say a reduction procedure, A,\nis a proper procedure if it has these desired guarantees, which we formally de\ufb01ne in the following\nde\ufb01nition:\nDe\ufb01nition 3.2 (Proper procedure). Let A be a procedure that reduces testing property P to testing\ncloseness of two distributions p and q over [n] given an input sample set X. We say A is a proper\nprocedure if A \ufb02attens p and q in such a way that the following holds for two non-negative constants\n[[[\nc0 < 1, and c1 (cid:21) 1:\n\n(cid:151)(cid:151)(cid:151)]]]\n\n(cid:148)(cid:148)(cid:148)\n\n[[[\n\n]]]\n\nPrX\n\nE(cid:25)\n\n\u2225p(F ) (cid:0) q(F )\u22252\n\n2\n\n(cid:21) 4 c0 (cid:1) EF\n\n\u2225p(F ) (cid:0) q(F )\u22252\n\n(cid:21) 0:9 ;\n\n(cid:148)(cid:148)(cid:148)\n\n(cid:151)(cid:151)(cid:151)\n(cid:12)(cid:12)(cid:12) X; (cid:25)\n\n(cid:148)(cid:148)(cid:148)\n\n(cid:128)\n\n7\n\n\u2225p(F ) (cid:0) q(F )\u22254\n\n4\n\n(cid:20) c1 (cid:1)\n\nEF\n\n\u2225p(F ) (cid:0) q(F )\u22252\n\n2\n\n2\n\n:\n\nEF\n\n(cid:151)(cid:151)(cid:151)\n\n(cid:138)\n\n2\n\n(3)\n\n(4)\n\n\fSample set X\n\n|\n\nPermutation of X\naccording to (cid:25)1\n\n}\n\n|\n\n{z\n\n{z\n\n}\n\n|\n\nPermutation of X\naccording to (cid:25)2\n\n}\n\n|\n\n{z\n\n{z\n\n: : :\n\n}\n\n|\n\nPermutation of X\naccording to (cid:25)(s+k)!\n\n}\n\n|\n\n{z\n\n{z\n\n}\n\nReduction\nprocedure\n\nReduction\nprocedure\n\nComputes\n\nZ\n\nComputes\n\nZ\n\n: : :\n\nReduction\nprocedure\n\nComputes\n\nZ\n\nComputes\n\nZ\n\nAdd Laplace Noise\n\nto obtain ^Z\n\nCompare ^Z with a threshold (cid:28).\nIf ^Z (cid:20) (cid:28), output accept.\nOtherwise; output reject.\n\nFigure 2: Our approach which uses Z instead of Z to reduce sensitivity.\n\n\u00a8\n\n(cid:132)\n\nFor the privacy guarantee, using the statistic Z, we design a (cid:24)-private algorithm. To make the statistic\n(cid:24)-differentially private, we use a standard technique in differential privacy, the Laplace mechanism:\nwe add Lap(\u2206(Z)=(cid:24)) to the statistic, to make it (cid:24)-differentially private. Note that \u2206(Z) is the\nsensitivity of Z. The importance of the idea of taking Z instead of Z as the statistic is that it results\nin a stable statistic with low sensitivity. Thus, the magnitude of the noise we add is small, and\nwe can achieve nearly optimal sample complexity. Note that the exact value of \u2206(Z) depends on\nthe reduction procedure. We bound this quantity for each of the properties we consider separately.\nHowever, in this section, we state our result in terms of \u2206(Z).\nFinally, we propose Algorithm 1 for differentially private testing of property P. We analyze the\ncorrectness of the algorithm in Theorem 3.3. We provide an illustration of our approach in Figure 2\nin comparison with the standard approach in Figure 1.\nTheorem 3.3. Let A be a proper procedure for testing property P as de\ufb01ned in De\ufb01nition 3.2.\nSuppose the expected number of test samples, s, is bounded from below:\np\nn\u2206(Z)\n\n:\nThen Algorithm 1 is a (cid:24)-differentially private (\u03f5; 3=4)-tester for testing property P.\nRemark 3.4 (Sample complexity of the algorithm). Our algorithm receives two parameter s and k\nfor the number of test and \ufb02attening samples. Our analysis is based on the Poissonization method, so\nthe algorithm is required to generate ^s := Poi(s) samples from p and q, and ^k := Poi(k) samples\nfor \ufb02attening. In this section, for simplicity, we use s and k as the number of samples. Note that\nwe assume that one can generate a sample from p and q using (cid:2)(1) samples from the underlying\ndistribution, and with probability 0:99, ^s and ^k are within a constant factor of their expectation.\nThus, the sample complexity remains (cid:2)(s + k).\n\n(\u2225p(F )\u22252\n\n2;\u2225q(F )\u22252\n\n2\n\n(cid:142)\n\n\u00a8\n\ns (cid:21) (cid:2)\n\nEF\n\nmin\n\n)]]]\n\nn (cid:1)\n\n[[[\n\n\u03f5\n\n(cid:24)\n\n\u03f52\n\n+\n\n8\n\n\fX Draw s + k samples, x1; x2; : : : ; xs+k from the underlying distribution.\nZ 0\nfor each permutation (cid:25) do\nT x(cid:25)(1); : : : ; x(cid:25)(s)\nF x(cid:25)(s+1); : : : ; x(cid:25)(s+k)\nSp A determines a set of samples from p using the test samples T\nSq A determines a set of samples from q using the test samples T\nfor i = 1; 2; : : : ; n do\n\nAlgorithm 1 A private procedure for property testing\n1: procedure PRIVATE TESTER(n; \u03f5; s; k)\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n\nn A determines an upper bound for the new domain size.\n^Z Z + Lap(\u2206(Z)=(cid:24))\nif ^Z (cid:20) (cid:28) then\nelse\n\nbi A determines the number of buckets for element i using \ufb02attening samples F .\nsi;1 Number of occurrences of element i in Sp\nsi;2 Number of occurrences of element i in Sq\nZ Z + (si;1(cid:0)si;2)2(cid:0)si;1(cid:0)si;2\n\n(cid:1) Pr[[[probability of picking (cid:25) ]]]\n\nOutput accept.\n\nOutput reject.\n\nbi\n\n[[[\n\nRemark 3.5. Although the running time of Algorithm 1 is exponential as stated, one can run\nit in Poly(s) time as follows: For each domain element i and three numbers a and b, and\nc, one can calculate the probability of Si;1 = a; si;2 = b; bi = c, so once can compute\nE(cid:25)\n\n(si;1 (cid:0) si;2)2 (cid:0) si;1 (cid:0) si;2)=bi\n\n, and Z without trying all permutations (cid:25).\n\n]]]\n\n3.4 Applications of Our Framework\n\nWe use our general private tester to achieve differentially private algorithms for the two distribution\ntesting problems we mentioned earlier: (i) Testing closeness of two distributions with unequal sized\nsample sets, (ii) Testing independence. As stated earlier, our approach is to use the non-private\ntesters for these problems and show they satisfy the proper procedure de\ufb01nition (De\ufb01nition 3.2).\nThen, using our general methodology, we achieve a near sample-optimal private testers for these\nproblems. In particular, we have the following theorems. For more information, see the full version.\n\nTheorem 3.6. Suppose p and q are two distributions over [n]. There exists a (cid:24)-differentially private\n(cid:24))) samples from\n(\u03f5; 2=3)-tester for closeness of p and q that uses k1 = \u2126(max(n2=3=\u03f54=3;\np, (cid:2)(max(n=(\u03f52\n\n(cid:24); 1=\u03f52(cid:24))) from both p and q.\n\nn=\u03f52;\n\nmin(n; k1));\n\nn=\u03f5\n\nn=\u03f5\n\np\n\np\n\np\n\nTheorem 3.7. Let p be a distribution over [n] (cid:2) [m]. There exists a (cid:24)-differentially private (\u03f5; 2=3)-\ntester for the testing independence of p that uses (cid:2)(s) samples where s is:\n\np\n\np\n\n(cid:140)\n\n\u221a\n\n(cid:130)\n\nn2=3m1=3\n\n\u03f54=3\n\n+\n\n(m n)1=2\n\n\u03f52\n\n+\n\n(m n log n)1=2\n\np\n\n\u03f5\n\n(cid:24)\n\n+\n\nlog n\n\u03f52(cid:24)\n\n:\n\ns = (cid:2)\n\nAcknowledgments\n\nMA is supported by funds from the MIT-IBM Watson AI Lab (Agreement No. W1771646), the\nNSF grants IIS-1741137, and CCF-1733808. ID is supported by NSF Award CCF-1652862 (CA-\nREER), NSF AiTF award CCF-1733796 and a Sloan Research Fellowship. DK is supported by\nNSF Award CCF-1553288 (CAREER) and a Sloan Research Fellowship. RR is supported by funds\nfrom the MIT-IBM Watson AI Lab (Agreement No. W1771646) the NSF grants CCF-1650733,\nCCF-1733808, IIS-1741137 and CCF-1740751.\n\n9\n\n\fReferences\n[ACFT19] J. Acharya, C. Canonne, C. Freitag, and H. Tyagi. Test without trust: Optimal locally\nprivate distribution testing. In Proceedings of Machine Learning Research, volume 89\nof Proceedings of Machine Learning Research, pages 2067\u20132076. PMLR, 2019.\n\n[ADK15] J. Acharya, C. Daskalakis, and G. Kamath. Optimal testing for properties of distribu-\ntions. In Advances in Neural Information Processing Systems 28: Annual Conference\non Neural Information Processing Systems 2015, pages 3591\u20133599, 2015.\n\n[ADR18] M. Aliakbarpour, I. Diakonikolas, and R. Rubinfeld. Differentially private identity and\nequivalence testing of discrete distributions. In Proceedings of the 35th International\nConference on Machine Learning, ICML 2018, pages 169\u2013178, 2018.\n\n[AJOS14] J. Acharya, A. Jafarpour, A. Orlitsky, and A. T. Suresh. Sublinear algorithms for outlier\ndetection and generalized closeness testing. In 2014 IEEE International Symposium\non Information Theory, pages 3200\u20133204, 2014.\n\n[ASZ18] J. Acharya, Z. Sun, and H. Zhang. Differentially private testing of identity and close-\nness of discrete distributions. In Advances in Neural Information Processing Systems\n31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS\n2018, pages 6879\u20136891, 2018.\n\n[BC17] T. Batu and C. L. Canonne. Generalized uniformity testing.\n\nIn 58th IEEE Annual\nSymposium on Foundations of Computer Science, FOCS 2017, pages 880\u2013889, 2017.\n\n[BFF+01] T. Batu, E. Fischer, L. Fortnow, R. Kumar, R. Rubinfeld, and P. White. Testing random\nvariables for independence and identity. In Proc. 42nd IEEE Symposium on Founda-\ntions of Computer Science, pages 442\u2013451, 2001.\n\n[BFR+00] T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White. Testing that distributions\nare close. In IEEE Symposium on Foundations of Computer Science, pages 259\u2013269,\n2000.\n\n[BFR+13] T. Batu, L. Fortnow, R. Rubinfeld, W. D. Smith, and P. White. Testing closeness of\n\ndiscrete distributions. J. ACM, 60(1):4, 2013.\n\n[BKR04] T. Batu, R. Kumar, and R. Rubinfeld. Sublinear algorithms for testing monotone and\nunimodal distributions. In ACM Symposium on Theory of Computing, pages 381\u2013390,\n2004.\n\n[BV15] B. B. Bhattacharya and G. Valiant. Testing closeness with unequal sized samples. In\nAdvances in Neural Information Processing Systems 28: Annual Conference on Neural\nInformation Processing Systems 2015, pages 2611\u20132619, 2015.\n\n[Can15] C. L. Canonne. A survey on distribution testing: Your data is big. but is it blue?\n\nElectronic Colloquium on Computational Complexity (ECCC), 22:63, 2015.\n\n[CDGR16] C. L. Canonne, I. Diakonikolas, T. Gouleakis, and R. Rubinfeld. Testing shape restric-\ntions of discrete distributions. In 33rd Symposium on Theoretical Aspects of Computer\nScience, STACS 2016, pages 25:1\u201325:14, 2016.\n\n[CDK17] B. Cai, C. Daskalakis, and G. Kamath. Priv\u2019it: Private and sample ef\ufb01cient identity\nIn International Conference on Machine Learning, ICML, pages 635\u2013644,\n\ntesting.\n2017.\n\n[CDKS17a] C. L. Canonne, I. Diakonikolas, D. M. Kane, and A. Stewart. Testing bayesian net-\nworks. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, pages\n370\u2013448, 2017.\n\n[CDKS17b] C. L. Canonne, I. Diakonikolas, D. M. Kane, and A. Stewart. Testing conditional\n\nindependence of discrete distributions. CoRR, abs/1711.11560, 2017. In STOC\u201918.\n\n[CDS17] C. L. Canonne, I. Diakonikolas, and A. Stewart. Fourier-based testing for families of\n\ndistributions. CoRR, abs/1706.05738, 2017. In NeurIPS 2018.\n\n10\n\n\f[CDVV14] S. Chan, I. Diakonikolas, P. Valiant, and G. Valiant. Optimal algorithms for testing\n\ncloseness of discrete distributions. In SODA, pages 1193\u20131203, 2014.\n\n[DDS+13] C. Daskalakis, I. Diakonikolas, R. Servedio, G. Valiant, and P. Valiant. Testing k-\nmodal distributions: Optimal algorithms via reductions. In SODA, pages 1833\u20131852,\n2013.\n\n[DGPP16] I. Diakonikolas, T. Gouleakis, J. Peebles, and E. Price. Collision-based testers are\noptimal for uniformity and closeness. Electronic Colloquium on Computational Com-\nplexity (ECCC), 23:178, 2016.\n\n[DGPP17] I. Diakonikolas, T. Gouleakis, J. Peebles, and E. Price. Sample-optimal identity testing\n\nwith high probability. CoRR, abs/1708.02728, 2017. In ICALP 2018.\n\n[DHS15] I. Diakonikolas, M. Hardt, and L. Schmidt. Differentially private learning of structured\ndiscrete distributions. In Conference on Neural Information Processing Systems, NIPS,\npages 2566\u20132574, 2015.\n\n[DK16] I. Diakonikolas and D. M. Kane. A new approach for testing properties of discrete\ndistributions. In IEEE Symposium on Foundations of Computer Science, FOCS, pages\n685\u2013694, 2016. Full version available at abs/1601.05557.\n\n[DKN15a] I. Diakonikolas, D. M. Kane, and V. Nikishkin. Optimal algorithms and lower bounds\nfor testing closeness of structured distributions. In IEEE 56th Annual Symposium on\nFoundations of Computer Science, FOCS 2015, pages 1183\u20131202, 2015.\n\n[DKN15b] I. Diakonikolas, D. M. Kane, and V. Nikishkin. Testing identity of structured distribu-\ntions. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete\nAlgorithms, SODA 2015, pages 1841\u20131854, 2015.\n\n[DKN17] I. Diakonikolas, D. M. Kane, and V. Nikishkin. Near-optimal closeness testing of\ndiscrete histogram distributions. In 44th International Colloquium on Automata, Lan-\nguages, and Programming, ICALP 2017, pages 8:1\u20138:15, 2017.\n\n[DKP19] I. Diakonikolas, D. M. Kane, and J. Peebles. Testing identity of multidimensional\nhistograms. In Conference on Learning Theory, COLT 2019, pages 1107\u20131131, 2019.\n\n[DKS18] I. Diakonikolas, D. M. Kane, and A. Stewart. Sharp bounds for generalized uniformity\nIn Advances in Neural Information Processing Systems 31, NeurIPS 2018,\n\ntesting.\npages 6204\u20136213, 2018.\n\n[DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to\nsensitivity in private data analysis. In Proceedings of the Third Conference on Theory\nof Cryptography, TCC\u201906, pages 265\u2013284, Berlin, Heidelberg, 2006. Springer-Verlag.\n\n[DP17] C. Daskalakis and Q. Pan. Square Hellinger subadditivity for Bayesian networks and\nits applications to identity testing. In Proceedings of the 30th Conference on Learning\nTheory, COLT 2017, pages 697\u2013703, 2017.\n\n[DR14a] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Founda-\n\ntions and Trends in Theoretical Computer Science, 9(3-4):211\u2013407, 2014.\n\n[DR14b] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Founda-\n\ntions and Trends R\u20dd in Theoretical Computer Science, 9(3\u20134):211\u2013407, 2014.\n\n[Dwo09] C. Dwork. The differential privacy frontier (extended abstract). In TCC, pages 496\u2013\n\n502, 2009.\n\n[GLRV16] M. Gaboardi, H. W. Lim, R. M. Rogers, and S. P. Vadhan. Differentially private chi-\nsquared hypothesis testing: Goodness of \ufb01t and independence testing. In International\nConference on Machine Learning, ICML, pages 2111\u20132120, 2016.\n\n[Gol17] O. Goldreich. Introduction to Property Testing. Forthcoming, 2017.\n\n11\n\n\f[GR18] M. Gaboardi and R. Rogers. Local private hypothesis testing: Chi-square tests.\n\nIn\nProceedings of the 35th International Conference on Machine Learning, ICML 2018,\npages 1612\u20131621, 2018.\n\n[KFS17] K. Kakizaki, K. Fukuchi, and J. Sakuma. Differentially private chi-squared test by unit\ncircle mechanism. In Proceedings of the 34th International Conference on Machine\nLearning, volume 70 of Proceedings of Machine Learning Research, pages 1761\u20131770.\nPMLR, 2017.\n\n[KR17] D. Kifer and R. Rogers. A new class of private chi-square hypothesis tests. In Interna-\ntional Conference on Arti\ufb01cial Intelligence and Statistics, AISTATS, pages 991\u20131000,\n2017.\n\n[LRR11] R. Levi, D. Ron, and R. Rubinfeld. Testing properties of collections of distributions.\n\nIn ICS, pages 179\u2013194, 2011.\n\n[NP33] J. Neyman and E. S. Pearson. On the problem of the most ef\ufb01cient tests of statisti-\ncal hypotheses. Philosophical Transactions of the Royal Society of London. Series A,\nContaining Papers of a Mathematical or Physical Character, 231(694-706):289\u2013337,\n1933.\n\n[Pan08] L. Paninski. A coincidence-based test for uniformity given very sparsely-sampled dis-\n\ncrete data. IEEE Transactions on Information Theory, 54:4750\u20134755, 2008.\n\n[Pea00] K. Pearson. On the criterion that a given system of deviations from the probable in\nthe case of a correlated system of variables is such that it can be reasonably supposed\nto have arisen from random sampling. Philosophical Magazine Series 5, 50(302):157\u2013\n175, 1900.\n\n[Rub12] R. Rubinfeld. Taming big probability distributions. XRDS, 19(1):24\u201328, 2012.\n\n[She18] O. Sheffet. Locally private hypothesis testing. In Proceedings of the 35th International\n\nConference on Machine Learning, ICML 2018, pages 4612\u20134621, 2018.\n\n[VV14] G. Valiant and P. Valiant. An automatic inequality prover and instance optimal identity\n\ntesting. In FOCS, 2014.\n\n[WLK15] Y. Wang, J. Lee, and D. Kifer. Revisiting differentially private hypothesis tests for\n\ncategorical data. CoRR, abs/1511.03376, 2015.\n\n12\n\n\f", "award": [], "sourceid": 5819, "authors": [{"given_name": "Maryam", "family_name": "Aliakbarpour", "institution": "MIT"}, {"given_name": "Ilias", "family_name": "Diakonikolas", "institution": "UW Madison"}, {"given_name": "Daniel", "family_name": "Kane", "institution": "UCSD"}, {"given_name": "Ronitt", "family_name": "Rubinfeld", "institution": "MIT, TAU"}]}