{"title": "Differentially Private Anonymized Histograms", "book": "Advances in Neural Information Processing Systems", "page_first": 7971, "page_last": 7981, "abstract": "For a dataset of label-count pairs, an anonymized histogram is the multiset of counts. Anonymized histograms appear in various potentially sensitive contexts such as password-frequency lists, degree distribution in social networks, and estimation of symmetric properties of discrete distributions. Motivated by these applications, we propose the first differentially private mechanism to release anonymized histograms that achieves near-optimal privacy utility trade-off both in terms of number of items and the privacy parameter. Further, if the underlying histogram is given in a compact format, the proposed algorithm runs in time sub-linear in the number of items. For anonymized histograms generated from unknown discrete distributions, we show that the released histogram can be directly used for estimating symmetric properties of the underlying distribution.", "full_text": "Differentially private anonymized histograms\n\nAnanda Theertha Suresh\nGoogle Research, New York\n\ntheertha@google.com\n\nAbstract\n\nFor a dataset of label-count pairs, an anonymized histogram is the multiset of counts.\nAnonymized histograms appear in various potentially sensitive contexts such as\npassword-frequency lists, degree distribution in social networks, and estimation of\nsymmetric properties of discrete distributions. Motivated by these applications, we\npropose the \ufb01rst differentially private mechanism to release anonymized histograms\nthat achieves near-optimal privacy utility trade-off both in terms of number of items\nand the privacy parameter. Further, if the underlying histogram is given in a\ncompact format, the proposed algorithm runs in time sub-linear in the number of\nitems. For anonymized histograms generated from unknown discrete distributions,\nwe show that the released histogram can be directly used for estimating symmetric\nproperties of the underlying distribution.\n\n1\n\nIntroduction\n\n= {(x, nx) : x 2X} . An\nGiven a set of labels X , a dataset D is a collection of labels and counts, D\nanonymized histogram of such a dataset is the unordered multiset of all non-zero counts without any\nlabel information,\n\ndef\n\ndef\n\nh(D)\n\n= {nx : x 2X and nx > 0}.\n\nFor example, if X = {a, b, c, d}, D = {(a, 8), (b, 0), (c, 8), (d, 3)}, then h(D) = {3, 8, 8}1.\nAnonymized histograms do not contain any information about the labels, including the cardinality of\nX . Furthermore, we only consider histograms with positive counts. The results can be extended to\nhistograms that include zero counts. A histogram can also be represented succinctly using prevalences.\nFor a histogram h, the prevalence 'r is the number of elements in the histogram with count r,\n\n'r(h)\n\n1nx=r.\n\ndef\n\n= Xnx2h\n\nIn the above example, '3(h) = 1, '8(h) = 2, and 'r(h) = 0 for r /2{ 3, 8}. Anonymized his-\ntograms are also referred to as histogram of histograms [1], histogram order statistics [2], pro\ufb01les [3],\nunattributed histograms [4], \ufb01ngerprints [5], and frequency lists [6].\nAnonymized histograms appear in several potentially sensitive contexts ranging from password\nfrequency lists to social networks. Before we proceed to the problem formulation and results, we \ufb01rst\nprovide an overview of the various contexts where anonymized histograms have been studied under\ndifferential privacy and their motivation.\nPassword frequency lists: Anonymized password histograms are useful to security researchers who\nwish to understand the underlying password distribution in order to estimate the security risks or\nevaluate various password defenses [7, 6]. For example, if n(i) is the ith most frequent password,\ni=1 n(i) is the number of accounts that an adversary could compromise with guesses\n\nthen =P\n\n1h(D) is a multiset and not a set and duplicates are allowed.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fper user. Hence, if the server changes the k-strikes policy from 3 to 5, the frequency distribution\ncan be used to evaluate the security implications of this change. We refer readers to [6, 8] for more\nuses of password frequency lists. Despite their usefulness, organizations may be wary of publishing\nthese lists due to privacy concerns. This is further justi\ufb01ed as it is not unreasonable to expect that an\nadversary will have some side information based on attacks against other organizations. Motivated by\nthis, [6] studied the problem of releasing anonymized password histograms.\nDegree-distributions in social networks: Degree distributions is one of the most widely studied\nproperties of a graph as it in\ufb02uences the structure of the graph. Degree distribution can also be\nused to estimate linear queries on degree distributions such as number of k-stars. However, some\ngraphs may have unique degree distributions and releasing exact degree distributions is no more safer\nthan naive anonymization, which can leave social network participants vulnerable to a variety of\nattacks [9, 10, 11]. Thus releasing them exactly can be revealing. Hence, [12, 4, 13, 14, 15, 16, 17]\nconsidered the problem of releasing degree distributions of graphs with differential privacy. Degree\ndistributions are anonymized histograms over the graph node degrees.\ndef\n\npx\n\nEstimating symmetric properties of discrete distributions: Let k\n\n= |X|. A discrete distribution p\nis a mapping from a domain X to [0, 1]k such thatPx px = 1. Given a discrete distribution p over k\nsymbols, a symmetric property is a property that depends only on the multiset of probabilities [18, 19],\ne.g., entropy (Px2X px log 1\n). Other symmetric properties include support size, R\u00e9nyi entropy,\ndistance to uniformity, and support coverage. Given independent samples from an unknown p, the goal\nof property estimation is to estimate the value of the symmetric property of interest for p. Estimating\nsymmetric properties from unknown distributions has received a wide attention in the recent past\ne.g., [5, 18, 20, 21, 22, 19, 23, 24] and has applications in various \ufb01elds from neuro-science [2]\nto genetics [25]. Recently, [26] proposed algorithms to estimate support size, support coverage\nand entropy with differential privacy. Optimal estimators for symmetric properties only depend\non the anonymized histograms of the samples [1, 19]. Hence, releasing anonymous histograms\nwith differential privacy would simultaneously yield differentially-private plug-in estimators for all\nsymmetric properties.\n\n2 Differential privacy\n2.1 De\ufb01nitions\nBefore we outline our results, we \ufb01rst de\ufb01ne the privacy and utility aspects of anonymized histograms.\nPrivacy has been studied extensively in statistics and computer science [27, 28, 29, 30] and references\ntherein. Perhaps the most studied form of privacy is differential privacy (DP) [31, 32], where the\nobjective is to ensure that an adversary would not infer whether a user is present in the dataset or not.\nWe study the problem of releasing anonymized histograms via the lens of global-DP. We begin by\nde\ufb01ning the notion of DP. Formally, given a set of datasets H and a notion of neighboring datasets\nNH \u2713H\u21e5H , and a query function f : H!Y , for some domain Y, then a mechanism M : Y!O\nis said to be \u270f-DP, if for any two neighboring datasets (h1, h2) 2N H, and all S \u2713O ,\n(1)\nBroadly-speaking \u270f-DP ensures that given the output, an attacker would not be able to differentiate\nbetween any two neighboring datasets. \u270f-DP is also called pure-DP and provides stricter guarantees\nthan the approximate (\u270f, )-DP, where equation (1) needs to hold with probability 1 .\nSince introduction, DP has been studied extensively in various applications from dataset release to\nlearning machine learning models [33]. It has also been adapted by industry [34]. There are two\nmodels of DP: server or global or output DP, where a centralized entity has access to the entire dataset\nand answers the queries in a DP manner. The second model is local DP, where \u270f-DP is guaranteed\nfor each individual user\u2019s data [35, 36, 37, 38, 39]. We study the problem of releasing anonymized\nhistograms under global-DP. Here H is the set of anonymized histograms, f is the identity mapping,\nand O = H.\n2.2 Distance measure\nFor DP, a general notion of neighbors is as follows. Two datasets are neighbors if and only if one can\nbe obtained from another by adding or removing a user [30]. Since, anonymized histograms do not\n\nPr(M(f (h1)) 2 S) \uf8ff e\u270f Pr(M(f (h2)) 2 S).\n\n2\n\n\fcontain explicit user information, we need few de\ufb01nitions to apply the above notion. We \ufb01rst de\ufb01ne a\nnotion of distance between label-count datasets. A natural notion of distance between datasets D1\nand D2 over X is the `1 distance,\n\n`1(D1, D2)\n\n|nx(D1) nx(D2)|,\n\ndef\n\n= Xx2X\n\nwhere nx(D) is the count of x in dataset D. Since anonymized histograms do not contain any\ninformation about labels, we de\ufb01ne distance between two histograms h1, h2 as\n\n`1(h1, h2)\n\ndef\n=\n\nmin\n\nD1,D2:h(D1)=h1,h(D2)=h2\n\n`1(D1, D2).\n\n(2)\n\nThe following simple lemma characterizes the above distance in terms of counts.\nLemma 1 (Appendix B). For an anonymized histogram h = {nx}, let n(i) be the ith highest count in\nthe dataset.2 For any two anonymized histograms h1, h2, `1(h1, h2) =Pi1 |n(i)(h1) n(i)(h2)|.\n\nThe above distance is also referred to as sorted `1 distance or earth-mover\u2019s distance. With the above\nde\ufb01nition of distance, we can de\ufb01ne neighbors as follows.\nDe\ufb01nition 1. Two anonymized histograms h and h0 are neighbors if and only if `1(h, h0) = 1.\n\nThe above de\ufb01nition of neighboring histograms is same as the de\ufb01nition of neighbors in the previous\nworks on anonymized histograms [4, 6].\n\n3 Previous and new results\n\n3.1 Anonymized histogram estimation\n\ndef\n\nSimilar to previous works [6], we measure the utility of the algorithm in terms of the number of items\nin the anonymized histogram, n\nPrevious results: The problem of releasing anonymized histograms was \ufb01rst studied by [12, 4]\nin the context of degree distributions of graphs. They showed that adding Laplace noise to each\ncount, followed by a post-processing isotonic regression step results in a histogram H with expected\nsorted-`2\n\n=Pnx2h nx =Pr1 'r(h)r.\n\n2 error of\n\n\u270f2\n\nE[`2\n\nO\u2713 log3 max('r, 1)\n\n(n(i)(h1) n(i)(h2))235 =Xr0\n\n2(h, H)] = E24Xi1\n\u25c6 = O\u2713pn\n\u270f2 \u25c6 .\nTheir algorithm runs in time O(n). The problem was also considered in the context of password\nfrequency lists by [6]. They observed that an exponential mechanism over integer partitions yields\nan \u270f-DP algorithm. Based on this, for \u270f =\u2326(1 /pn), they proposed a dynamic programming\n\u2318 and returns\nbased relaxation of the exponential mechanism that runs in time O\u21e3 n3/2\n\u2318 , with probability 1 . Furthermore, the\na histogram H such that `1(h, H) = O\u21e3pn+log 1\nrelaxed mechanism is (\u270f, )-DP.\nThe best information-theoretic lower bound for the `1 utility of any \u270f-DP mechanism is due to [40],\nwho showed that for \u270f \u2326(1/n), any \u270f-DP mechanism has expected `1 error of \u2326(pn/p\u270f) for\nsome dataset.\nNew results: Following [6], we study the problem in `1 metric. We propose a new DP mechanism\nPRIVHIST that satis\ufb01es the following:\nTheorem 1. Given a histogram in the prevalence form h = {(r, 'r) : 'r > 0}, PRIVHIST returns a\nhistogram H and a sum count N that is \u270f-DP. Furthermore, if \u270f> 1, then\n\n\u270f + n log 1\n\n\n\n\u270f\n\nE[`1(h, H)] = Opn \u00b7 ec\u270f and E[|N n|] \uf8ff ec\u270f\n\n2For i larger than number of counts in h, n(i) = 0.\n\n3\n\n\f2\n\nlog\n\n\u270f\n\nfor some constant c > 0 and has an expected run time of \u02dcO(pn). If 1 \u270f =\u2326(1 /n) then,\n\n\u270f! and E[|N n|] \uf8ffO \u2713 1\n\u270f\u25c6 ,\n\nE[`1(h, H)] = O r n\nand has an expected run time of \u02dcOp n\nTogether with the lower bound of [40], this settles the optimal privacy utility trade-off for \u270f 2\n[\u2326(1/n), 1] up to a multiplicative factor of O(plog(2/\u270f)). We also show that PRIVHIST is near-\noptimal for \u270f> 1, by showing the following lower bound.\nTheorem 2 (Appendix E). For a given n, let H = {h : n \uf8ffPr r'r(h) \uf8ff n + 1}. For any \u270f-DP\nmechanism M, there exists a histogram h 2H , such that\n\n\u270f.\n\n\u270f + 1\n\nE[`1(h,M(h))] \u2326(pne2\u270f).\n\nTheorems 1 and 2 together with [40] show that the the proposed mechanism has near-optimal utility\n\nfor all \u270f =\u2326(1 /n). We can infer the number of items in the dataset byPr r \u00b7 'r(H). However, this\n\nestimate is very noisy. Hence, we also return the sum of counts N as it is useful for applications in\nsymmetric property estimation for distributions. Apart from the near-optimal privacy-utility trade-off,\nwe also show that PRIVHIST has several other useful properties.\nTime complexity: By the Hardy-Ramanujan integer partition theorem [41], the number of\nanonymized histograms with n items is e\u21e5(pn). Hence, we can succinctly represent them us-\ning O(pn) space. Recall that any anonymized histogram can be written as {(r, 'r) : 'r > 0},\nwhere 'r is the number of symbols with count r. Let t be the number of distinct counts and let\nr1, r2, . . . , rt be the distinct counts with non-zero prevalences. Then ri i and\n\nn =\n\nri'ri \n\nri \n\ntXi=1\n\ntXi=1\n\nt2\n2\n\n,\n\ni \n\ntXi=1\n\nand hence there are at most t \uf8ff p2n non-zero prevalences and h can be represented as {(r, 'r) :\n'r > 0} using O(pn) count-prevalence pairs. Histograms are often stored in this format for\nspace ef\ufb01ciency e.g., password frequency lists in [42]. PRIVHIST takes advantage of this succinct\nrepresentation. Hence, given such a succinct representation, it runs time O(pn) as opposed to the\nO(n) running time of [12] and O(n3/2) running time of [6]. This is highly advantageous for large\ndatasets such as password frequency lists with n = 70M data points [6].\nPure vs approximate differential privacy: The only previous known algorithm with `1 utility of\nO(pn) is that of [6] and it runs in time O(n3/2). However, their algorithm is (\u270f, )-approximate DP\nwhich is strictly weaker than PRIVHIST, whose output is \u270f-DP. For applications in social networks it\nis desirable to have group privacy for large groups [32]. For groups of size k, (\u270f, ) approximate DP,\nscales as (k\u270f, ek\u270f)-DP, which can be prohibitive for large values of k. Hence \u270f-DP is preferable.\nApplications to symmetric property estimation: We show that the output of PRIVHIST can be di-\nrectly applied to obtain near-optimal sample complexity algorithms for discrete distribution symmetric\nproperty estimation.\n\n3.2 Symmetric property estimation of discrete distributions\nFor a symmetric property f and an estimator \u02c6f that uses n samples, let E( \u02c6f , n) be an upper\ndef\nbound on the worst expected error over all distributions p with support at most k, E( \u02c6f , n)\n=\nmaxp2k E[|f (p) \u02c6f (X n)|] . Let sample complexity n(f, \u21b5) denote the minimum number of\nsamples such that E( \u02c6f , n) \uf8ff \u21b5, n(f, \u21b5)\nGiven samples X n def\nFor a symmetric property f, linear estimators of the form\n\n= X1, X2, . . . , Xn, let h(X n) denote the corresponding anonymous histogram.\n\n= min{n : E( \u02c6f , n) \uf8ff \u21b5}.\n\ndef\n\n\u02c6f (h(X))\n\ndef\n\n=Xr1\n\nf (r, n) \u00b7 'r(h(X n),\n\n4\n\n\fare shown to be sample-optimal for symmetric properties such as entropy [21], support size [18, 20],\nsupport coverage [22], and R\u00e9nyi entropy [43, 44], where f (r, n)s are some distribution-independent\ncoef\ufb01cients that depend on the property f. Recently, [26] showed that for any given property such as\nentropy or support size, one can construct DP estimators by adding Laplace noise to the non-private\nestimator. They further showed that this approach is information theoretically near-optimal.\nInstead of just computing a DP estimate for a given property, the output of PRIVHIST can be directly\nused to estimate any symmetric property. By the post-processing lemma [32], since the output of\nPRIVHIST is DP, the estimate is also DP. For an estimator \u02c6f, let Ln\n\u02c6f be the Lipschitz constant given\n= max(f (1, n), maxr1 |f (r, n) f (r + 1, n)|). If instead of h(X n), a DP histogram H\nby Ln\n\u02c6f\nand the sum of counts N is available, then \u02c6f can be modi\ufb01ed as\n\ndef\n\n\u02c6f dp def\n\n=Xr1\nwhich is differentially private. Using Theorem 1, we show that:\nCorollary 1 (Appendix F). Let \u02c6f satisfy Ln\n\u02c6f \uf8ff n1, for a \uf8ff 0.5. Further, let there exists E\nsuch that |E( \u02c6f , n) E ( \u02c6f , n + 1)|\uf8ff n1. Let fmax = maxp2k f (p). If n( \u02c6f ,\u21b5 ) is the sample\ncomplexity of estimator \u02c6f, then for \u270f> 1\n\nf (r, N ) \u00b7 'r(H),\n\nfor some constant c > 0. For \u2326(1/n) \uf8ff \u270f \uf8ff 1,\n\n12\n\nn( \u02c6f dp, 2\u21b5) \uf8ff max n( \u02c6f ,\u21b5 ),O \u2713 1\n\u21b5ec\u270f\u25c6 2\nn( \u02c6f dp, 2\u21b5) \uf8ff max n( \u02c6f ,\u21b5 ),O \u2713 log(2/\u270f)\n\u21b52\u270f \u25c6 1\n\n+\n\n1\n\u270f\n\nlog\n\nfmax\n\n\u21b5 !! .\n\n12\n\n+\n\n1\n\u270f\n\nlog\n\nfmax\n\n\u21b5\u270f !! .\n\nFurther, by the post-processing lemma, \u02c6f dp is also \u270f-DP.\n\nFor entropy (Px px log px), normalized support size (Px 1px>1/k/k), and normalized support\ncoverage, there exists sample-optimal linear estimators with < 0.1 and have the property |E( \u02c6f , n)\nE( \u02c6f , n + 1)|\uf8ffE ( \u02c6f , n)n1 [19, 26]. Hence the sample complexity of the proposed algorithm\nincreases at most by a polynomial in 1/\u270f\u21b5. Furthermore, the increase is dependent on the maximum\nvalue of the function for distributions of interest and it does not explicitly depend on the support size.\nThis result is slightly worse than the property speci\ufb01c results of [26] in terms of dependence on \u270f and\n\u21b5. In particular, for entropy estimation, the main term in our privacy cost is \u02dcO\u21e31/\u21b52\u270f 1\n12\u2318 and\nthe bound of [26] is O1/(\u21b5\u270f)1+. Thus for = 0.1, our dependence on \u270f and \u21b5 is slightly worse.\n\nHowever, we note that our results are more general in that H can be used with any linear estimator.\nFor example, our algorithm implies DP algorithms for estimating distance to uniformity, which have\nbeen not been studied before. Furthermore, PRIVHIST can also be combined with the maximum\nlikelihood estimators of [3, 22, 45] and linear programming estimators of [5], however we do not\nprovide any theoretical guarantees for these combined algorithms.\n\n4 PRIVHIST\n\n=Psr 's(h)\nIn the algorithm description and analysis, let \u00afx denote the vector x and let 'r+(h)\ndenote the cumulative prevalences. Since, anonymized histograms are multisets, we can de\ufb01ne the\nsum of two anonymized histograms as follows: for two histograms h1, h2, the sum h = h1 + h2 is\ngiven by 'r(h) = 'r(h1) + 'r(h2),8r. Furthermore, since there is a one-to-one mapping between\nhistograms in count form h = {n(i)} and in prevalence form h = {(r, 'r) : 'r > 0}, we use both\ninterchangeably. For the ease of analysis, we also use the notation of improper histogram, where\nthe 'r\u2019s can be negative or non-integers. Finally, for a histogram ha indexed by super-script a , we\nde\ufb01ne 'a def\n\n= '(ha) for the ease of notation.\n\ndef\n\n5\n\n\f4.1 Approach\nInstead of describing the technicalities involved in the algorithm directly, we \ufb01rst motivate the\nalgorithm with few incorrect or high-error algorithms. Before we proceed, recall that histograms can\nbe written either in terms of prevalences 'r or in terms of sorted counts n(i).\nAn incorrect algorithm: A naive tendency would be to just add noise only to the non-zero preva-\nlences or counts. However, this is not differentially private. For example, consider two neighboring\nhistograms in prevalence format, h = {'1 = 2} and h0 = {'1 = 1,' 2 = 1}. The resulting outputs\nfor the above two inputs would be very different as the output of h never produces a non-zero '2,\nwhereas the output of h0 produces non-zero '2 with high probability. Similar counter examples can\nbe shown for sorted counts.\nA high-error algorithm: Instead of adding noise to non-zero counts or prevalences, one can add\nnoise to all the counts or prevalences. It can be shown that adding noise to all the counts (including\nthose appeared zero times), yields a `1 error O(n/\u270f), whereas adding noise to prevalences can yield\nan `1 error of O(n2/\u270f), if we naively use the utility bound in terms of prevalences (3). We note that\n[12] showed that a post-processing step after adding noise to sorted-counts and improves the `2 utility.\nA naive application of the Cauchy-Schwarz inequality yields an `1 error of n3/4/\u270f for that algorithm.\nWhile it might be possible to improve the dependence on n by a tighter analysis, it is not clear if the\ndependence on \u270f can be improved.\nThe algorithm is given in PRIVHIST. After some computation, it calls two sub-routines PRIVHIST-\nLOWPRIVACY and PRIVHIST-HIGHPRIVACY depending on the value of \u270f. PRIVHIST has two main\nnew ideas: (i) splitting r around pn and using prevalences in one regime and counts in another and\n(ii) the smoothing technique used to zero out the prevalence vector. Of the two (i) is crucial for the\ncomputational complexity of the algorithm and (ii) is crucial in improving the \u270f-dependence from\n1/\u270f to 1/p\u270f in the high privacy regime (\u270f \uf8ff 1). There are more subtle differences such as using\ncumulative prevalences instead of actual prevalences. We highlight them in the next section. We now\noverview our algorithm for low and high privacy regimes separately.\n\n4.2 Low privacy regime\nWe \ufb01rst consider the problem in the low privacy regime when \u270f> 1. We make few observations.\nGeometric mechanism vs Laplace mechanism: For obtaining DP output of integer data, one can\nadd either Laplace noise or geometric noise [46]. For \u270f-DP, the expected `1 noise added by Laplace\nmechanism is 1/\u270f, which strictly larger than that of the geometric mechanism (2e\u270f/(1 e2\u270f)) (see\nAppendix A). For \u270f> 1, we use the geometric mechanism to obtain optimal trade off in terms of \u270f.\nPrevalences vs counts: If we add noise to each coordinate of a d-dimensional vector, the total\namount of noise in `1 norm scales linearly in d, hence it is better to add noise to a small dimensional\nvector. In the worst case, both prevalences and counts can be an n-dimensional vector. Hence, we\npropose to use prevalences for small values of r \uf8ff pn, and use counts for r > pn. This ensures\nthat the dimensionality of vectors to which we add noise is at most O(pn).\nCumulative prevalences vs prevalences: The `1 error can be bounded in terms of prevalences as\nfollows. See Appendix B for a proof.\n\n`1(h1, h2) \uf8ffXr1\n\n|'r+(h1) 'r+(h2)|\uf8ff Xr1\n\nr|'r(h1) 'r(h2)|,\n\n(3)\n\nIf we add noise to prevalences, the `1 error can be very high as noise is multiplied with the corre-\nsponding count r (3) . The bound in terms of cumulative prevalences '+ is much tighter. Hence, for\nsmall values of r, we use cumulative prevalences instead of prevalences themselves.\nThe above observations provide an algorithm for the low privacy regime. However, there are few\ntechnical dif\ufb01culties. For example, if we split the counts at a threshold naively, then it is not\ndifferentially private. We now describe each of the steps in the high-privacy regime.\n(1) Find pn: To divide the histogram into two smaller histograms, we need to know n, which may\nnot be available. Hence, we allot \u270f1 privacy cost to \ufb01nd a DP value of n.\n(2) Sensitivity preserving histogram split: If we divide the histogram into two parts based on\ncounts naively and analyze the privacy costs independently for the higher and smaller parts separately,\n\n6\n\n\f1, h`\n\nr+ )2, subject to the constraint that 'mon\n\nr+ 'mon\n\n2 = {} and h`\n\n1 = {'T = 1} and h`\n2 = {'T +1 = 1,' nT1 = 1}, then `1(h`\n1, h`\n2), `1(hs\n1, hs\n\nthen the sensitivity would be lot higher for certain neighboring histograms. For example, consider\ntwo neighboring histograms h1 = {'T = 1,' nT = 1} and h2 = {'T +1 = 1,' nT1 = 1}. If\nwe divide h1 in to two parts based on threshold T , say hs\n1 = {'nT = 1}\n2) = T + 2. Thus, the `1 distance\nand hs\n2) would be much higher compared\nbetween neighboring separated histograms `1(h`\nto `1(h2, h2) and we need to add a lot of noise. Therefore, we perturb 'T and 'T +1 using geometric\nnoise. This ensures DP in instances where the neighboring histograms differ at 'T and 'T +1, and\ndoesn\u2019t change the privacy analysis for other types of histograms. However, adding noise may make\nthe histogram improper as 'T can become negative. To this end, we add M fake counts at T and\nT + 1 to ensure that the histogram is proper with high probability. We remove them later in (L4). We\nrefer readers to Appendix C.2 for details about this step.\n(3,4) Add noise: Let H bs (small counts) and H b` (large counts) be the split-histograms. We add\nnoise to cumulative prevalences in H bs and counts in H b` as described in the algorithm overview.\n(L1, L2) Post-processing: The noisy versions of 'r+ may not satisfy the properties satis\ufb01ed by\nthe histograms i.e., 'r+ '(r+1)+. To overcome this, we run isotonic regression over noisy\n'r+ subject to the monotonicity constraints i.e., given noisy counts 'r+, \ufb01nd 'mon\nr+ that minimizes\nPr\uf8ffT ('r+ 'mon\n(r+1)+, for all r \uf8ff T . Isotonic\nregression in one dimension can be run in time linear in the number of inputs using the pool adjacent\nviolators algorithm (PAVA) or its variants [47, 48]. Hence, the time complexity of this algorithm\nis O(T ) \u21e1 pn. We then round the prevalences to the nearest non-negative integers. We similarly\npost-process large counts and remove the fake counts that we introduced in step (2).\nSince we used succinct representation of histograms, used prevalences for r smaller than O(pn) and\ncounts otherwise, the expected run time of the algorithm is \u02dcO(pn) for \u270f> 1.\n\n4.3 High privacy regime\n\nFor the high-privacy regime, when \u270f \uf8ff 1, all known previous algorithms achieve an error of 1/\u270f. To\nreduce the error from 1/\u270f to 1/p\u270f, we use smoothing techniques to reduce the sensitivity and hence\nreduce the amount of added noise.\nSmoothing method: Recall that the amount of noise added to a vector depends on its dimensionality.\nSince prevalences have length n, the amount of `1 noise would be O(n/\u270f). To improve on this, we\n\ufb01rst smooth the input prevalence vector such that it is non-zero only for few values of r and show\nthat the smoothing method also reduces the sensitivity of cumulative prevalences and hence reduces\nthe amount of noise added.\nWhile applying smoothing is the core idea, two questions remain: how to select the location of\nnon-zero values and how to smooth to reduce the sensitivity? We now describe these technical details.\n(H1) Approximate high prevalences: Recall that N was obtained by adding geometric noise to n.\nIn the rare case that this geometric noise is very negative, then there can be prevalences much larger\nthan 2N. This can affect the smoothing step. To overcome this, we move all counts above 2N to N.\nSince this changes the histogram with low probability, it does not affect the `1 error.\n(H2) Compute boundaries: We \ufb01nd a set of boundaries S and smooth counts to elements in S.\nIdeally we would like to ensure that there is a boundary close to every count. For small values of\nr, we ensure this by adding all the counts and hence there is no smoothing. If r \u21e1 pn, we use\nboundaries that are uniform in the log-count space. However, using this technique for all values of r,\nresults in an additional log n factor. To overcome this, for r pn, we use the noisy large counts in\nstep (4) to \ufb01nd the boundaries and ensure that there is a boundary close to every count.\n(H3) Smooth prevalences: The main ingredient in proving better utility in the high privacy regime is\nthe smoothing technique, which we describe now with an example. Suppose that all histograms have\nnon-zero prevalences only between counts s and s + t and further suppose we have two neighboring\nhistograms h1 and h2 as follows. '1\ni = 0. Similarly, let\ni = 0. If we want to release the prevalences or\nr+1 = 1 and for all i 2 [s, s + t] \\ {r + 1}, '2\n'2\ncumulative prevalences, we add `1 noise of O(1/\u270f) for each prevalence in [s, s + t]. Thus the `1\nnorm of the noise would be O(t/\u270f). We propose to reduce this noise by smoothing prevalences.\n\nr = 1 and for all i 2 [s, s + t] \\ {r}, '1\n\n7\n\n\ft\n\nt\n\nt\n\nand 't1\n\ns = r+1\nt\n\ns = t+sr\n\ns = r\ns = t+sr1\n\nfraction of 'r\nFor a r 2 [s, s + t] , we divide 'r into 's and 's+t as follows. We assign s+tr\nto 's and the remaining to 's+t. After this transformation, the \ufb01rst histogram becomes ht1 given\nt and all other prevalences are zero. Similarly, the second histogram\nby, 't1\nand 't1\nbecomes ht2 given by, 't2\nand all other prevalences are zero. Thus the\nprevalences after smoothing differ only in two locations s and s + t and they differ by at most 1/t.\nThus the total amount of noise needed for a DP release is O(1/t\u270f) to these two prevalences. However,\nnote that we also incur a loss as due to smoothing, which can be shown to be O(t). Hence, the total\namount of error would be O(1/(t\u270f) + t). Choosing t = 1/p\u270f, yields a total error of O(1/p\u270f). The\nabove analysis is for a toy example and extending it to general histograms requires additional work.\nIn particular, we need to \ufb01nd the smoothing boundaries that give the best utility. As described above,\nwe choose boundaries based on logarithmic partition of counts and also by private values of counts.\nThe utility analysis with these choice of boundaries is in Appendix D.2.\n(H4) Add small noise: Since the prevalences are smoothed, we add small amount of noise to the\ncorresponding cumulative prevalences. For 'si+, we add L(1/(si si1)\u270f) to obtain \u270f-DP.\n(H5) Post-processing: Finally, we post-process the prevalences similar to (L1) to impose monotonic-\nity and ensure that the resulting prevalences are positive and non-negative integers.\nSince we used succinct histogram representation, ensured that the size of S is small, and used counts\n\nlarger than \u02dcO(pn\u270f) to \ufb01nd boundaries, the expected run time is \u02dcOp n\n\nPrivacy budget allocation: We allocate \u270f1 privacy budget to estimate n, \u270f2 to the rest of PRIVHIST\nand \u270f3 to PRIVHIST-HIGHPRIVACY. We set \u270f1 = \u270f2 = \u270f3 in our algorithms. We note that there is\nno particular reason for \u270f1, \u270f2, and \u270f3 to be equal and we chose those values for simplicity and easy\nreadability. For example, since \u270f1 is just used to estimate n, the analysis of the algorithm shows that\n\u270f2,\u270f 3 affects utility more than \u270f1. Hence, we can set \u270f2 = \u270f3 = \u270f(1 o(1))/2 and \u270f1 = o(\u270f) to get\nbetter practical results. Furthermore, for low privacy regime, the algorithm only uses a privacy budget\nof \u270f1 + \u270f2. Hence, we can set \u270f1 = o(\u270f), \u270f2 = \u270f(1 o(1)), and \u270f3 = 0.\n\n\u270f for \u270f \uf8ff 1.\n\n\u270f + 1\n\n5 Acknowledgments\n\nAuthors thank Jayadev Acharya and Alex Kulesza for helpful comments and discussions.\n\nAlgorithm PRIVHIST\n\nInput: anonymized histogram h in terms of prevalences i.e., {(r, 'r) : 'r > 0}, privacy cost \u270f.\nParameters: \u270f1 = \u270f2 = \u270f3 = \u270f/3.\nOutput: DP anonymized histogram H and N (an estimate of n).\n1. DP value of the total sum: N = max(Pnx2h nx + Za, 0), where Za \u21e0 G(e\u270f1). If N = 0,\n2. Split h: Let T = dpN min(\u270f, 1)e and M =l max(2 log N e\u270f2 ,1)\n\noutput empty histogram and N. Otherwise continue.\n\n(a) H a : 'a\n(b) H b : 'b\n(c) Divide H b into two histograms H bs and H b`. For all r T + 1, 'b`\nPr1\nr ).\ns=r+1 'bs\n\nm.\nT Zb and 8r /2{ T, T +1} 'b\nr = max(0,PT\n\nT +1 = 'T +1 + M and 8r /2{ T, T + 1}, 'a\n\nr, where Zb \u21e0 G(e\u270f2).\nr = max(0,Pr\nr \nr for r \uf8ff T .\n\nT = 'T + M, 'a\nT +1 = 'a\n\nr PT\nr \u21e0 G(e\u270f2) i.i.d. and H cs be 'cs\ni \u21e0 G(e\u270f2) i.i.d. and H c` be N c`\n\nr ) for all r \uf8ff T' bs\n\n3. DP value of H bs. Let Zcs\n\n4. DP value of H b`: Let Zc`\n\nr+ = 'bs\ni = N b`\n\nr+ + Zcs\n(i) + Zc`\ni\n\nfor N(i) 2 H b`.\n\nT +1+Zb,' b\n\nT = 'a\n\ns=T +1 'b`\n\nr = 'r.\n\nr = 'a\n\n\u270f2\n\ns=r 'b\n\ns=T +1 'b\n\n5. If \u270f> 1, output PRIVHIST-LOWPRIVACY(H cs, H c`, T, M ) and N.\n6. If \u270f \uf8ff 1, output PRIVHIST-HIGHPRIVACY(h, H c`, T, N,\u270f 3) and N.\n\n8\n\n\fAlgorithm PRIVHIST-LOWPRIVACY\n\nInput: low-count histogram H cs, high-count histogram H c`, T, M and Output: a histogram H.\nL1. Post processing of H cs:\n\n(b) H ds: for all r, 'ds\n\nr+ = round(max('mon\n\n(a) Find \u00af'mon that minimizesPr1('mon\nL2. Post processing of H c`: Compute H d` = {max(Ni(H c`), T ),8i}.\nL3. Let H d = H ds + H d`.\nL4. Compute H e by removing M elements closest to T +1 from H d and then removing M elements\n\nr+ 'r+(H cs))2. with 'mon\nr+ , 0)).\n\nr+ 'mon\n\n(r+1)+,8r.\n\nclosest to T and output it.\n\nAlgorithm PRIVHIST-HIGHPRIVACY\n\nInput: non-private histogram h, high-count histogram H `, T, N,\u270f 3 and Output: a histogram H.\nH1. Approximate higher prevalences: for r < 2N, 'u\nH2. Compute boundaries: Let the set S be de\ufb01ned as follows:\n\nr = 'r(h) and 'u\n\n2N = '2N +(h).\n\nH3. Smooth prevalences: Let si denote the ith smallest element in S.\n\n3e, q =plog(1/\u270f3)/N \u270f3\n\n(a) T 0 = d10pN/\u270f3\n(b) S = {1, 2, . . . , T}[{bT (1+q)ic : i \uf8ff log1+q(T 0/T )}[{Nx : Nx 2 H `, Nx T 0}[{2N}.\nsi =Psi+1\n\nj \u00b7 si+1j\nsi+1si\n\nj = 0.\n\n1\n\n\u270f3(sisi1)\u2318.\nsi+1+8i.\n\nsi+ 'x\n\n(a) 'v\n\nj=si 'u\n\n+Psi1\nj=si1\nH4. DP value of H v: for each si 2 S, let 'w\nH5. Find H x that minimizesPsi2S('x\n\nH6. Return H y given by, 'y\n\nsi+ 'w\nr+ = round(max('x\n\nsi+ = 'v\n\nand if j /2 S, 'v\n\nj \u00b7 jsi1\n'u\nsisi1\nsi+ + Zsi, where Zsi \u21e0 L\u21e3\nsi+)2(si si1)2 such that 'x\nr+, 0)) 8r.\n\nReferences\n[1] Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D Smith, and Patrick White. Testing\nthat distributions are close. In Foundations of Computer Science, 2000. Proceedings. 41st\nAnnual Symposium on, pages 259\u2013269. IEEE, 2000.\n\n[2] Liam Paninski. Estimation of entropy and mutual information. Neural computation, 15(6):1191\u2013\n\n1253, 2003.\n\n[3] Alon Orlitsky, Narayana P Santhanam, Krishnamurthy Viswanathan, and Junan Zhang. On\nmodeling pro\ufb01les instead of values. In Proceedings of the 20th conference on Uncertainty in\narti\ufb01cial intelligence, pages 426\u2013435. AUAI Press, 2004.\n\n[4] Michael Hay, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. Boosting the accuracy of\ndifferentially private histograms through consistency. Proceedings of the VLDB Endowment,\n3(1-2):1021\u20131032, 2010.\n\n[5] Gregory Valiant and Paul Valiant. Estimating the unseen: an n/log (n)-sample estimator for\nentropy and support size, shown optimal via new clts. In Proceedings of the forty-third annual\nACM symposium on Theory of computing, pages 685\u2013694. ACM, 2011.\n\n[6] Jeremiah Blocki, Anupam Datta, and Joseph Bonneau. Differentially private password frequency\n\nlists. In NDSS, volume 16, page 153, 2016.\n\n[7] Joseph Bonneau. The science of guessing: analyzing an anonymized corpus of 70 million\npasswords. In Security and Privacy (SP), 2012 IEEE Symposium on, pages 538\u2013552. IEEE,\n2012.\n\n9\n\n\f[8] Jeremiah Blocki, Benjamin Harsha, and Samson Zhou. On the economics of of\ufb02ine password\ncracking. In 2018 IEEE Symposium on Security and Privacy (SP), pages 853\u2013871. IEEE, 2018.\n[9] Lars Backstrom, Cynthia Dwork, and Jon Kleinberg. Wherefore art thou r3579x?: anonymized\nsocial networks, hidden patterns, and structural steganography. In Proceedings of the 16th\ninternational conference on World Wide Web, pages 181\u2013190. ACM, 2007.\n\n[10] Michael Hay, Gerome Miklau, David Jensen, Don Towsley, and Philipp Weis. Resisting\nstructural re-identi\ufb01cation in anonymized social networks. Proceedings of the VLDB Endowment,\n1(1):102\u2013114, 2008.\n\n[11] Arvind Narayanan and Vitaly Shmatikov. De-anonymizing social networks. In 2009 30th IEEE\n\nSymposium on Security and Privacy, pages 173\u2013187. IEEE, 2009.\n\n[12] Michael Hay, Chao Li, Gerome Miklau, and David Jensen. Accurate estimation of the degree\ndistribution of private networks. In Data Mining, 2009. ICDM\u201909. Ninth IEEE International\nConference on, pages 169\u2013178. IEEE, 2009.\n\n[13] Vishesh Karwa and Aleksandra B Slavkovi\u00b4c. Differentially private graphical degree sequences\nand synthetic graphs. In International Conference on Privacy in Statistical Databases, pages\n273\u2013285. Springer, 2012.\n\n[14] Shiva Prasad Kasiviswanathan, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Analyz-\ning graphs with node differential privacy. In Theory of Cryptography, pages 457\u2013476. Springer,\n2013.\n\n[15] Sofya Raskhodnikova and Adam Smith. Ef\ufb01cient lipschitz extensions for high-dimensional\n\ngraph statistics and node private degree distributions. In FOCS, 2016.\n\n[16] Jeremiah Blocki. Differentially private integer partitions and their applications. 2016.\n[17] Wei-Yen Day, Ninghui Li, and Min Lyu. Publishing graph degree distribution with node\ndifferential privacy. In Proceedings of the 2016 International Conference on Management of\nData, pages 123\u2013138. ACM, 2016.\n\n[18] Gregory Valiant and Paul Valiant. The power of linear estimators. In Foundations of Computer\n\nScience (FOCS), 2011 IEEE 52nd Annual Symposium on, pages 403\u2013412. IEEE, 2011.\n\n[19] Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. A uni\ufb01ed\nmaximum likelihood approach for estimating symmetric properties of discrete distributions. In\nInternational Conference on Machine Learning, pages 11\u201321, 2017.\n\n[20] Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. Minimax estimation of\nfunctionals of discrete distributions. IEEE Transactions on Information Theory, 61(5):2835\u2013\n2885, 2015.\n\n[21] Yihong Wu and Pengkun Yang. Minimax rates of entropy estimation on large alphabets via best\npolynomial approximation. IEEE Transactions on Information Theory, 62(6):3702\u20133720, 2016.\n[22] Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu. Optimal prediction of the number\nof unseen species. Proceedings of the National Academy of Sciences, 113(47):13283\u201313288,\n2016.\n\n[23] Yi Hao, Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu. Data ampli\ufb01cation: A uni\ufb01ed\nand competitive approach to property estimation. In Advances in Neural Information Processing\nSystems, pages 8834\u20138843, 2018.\n\n[24] Yi Hao and Alon Orlitsky. Data ampli\ufb01cation: Instance-optimal property estimation. arXiv\n\npreprint arXiv:1903.01432, 2019.\n\n[25] James Zou, Gregory Valiant, Paul Valiant, Konrad Karczewski, Siu On Chan, Kaitlin Samocha,\nMonkol Lek, Shamil Sunyaev, Mark Daly, and Daniel G MacArthur. Quantifying unobserved\nprotein-coding variants in human populations provides a roadmap for large-scale sequencing\nprojects. Nature communications, 7:13293, 2016.\n\n[26] Jayadev Acharya, Gautam Kamath, Ziteng Sun, and Huanyu Zhang.\n\nInspectre: Privately\nestimating the unseen. In International Conference on Machine Learning, pages 30\u201339, 2018.\n[27] Tore Dalenius. Towards a methodology for statistical disclosure control. statistik Tidskrift,\n\n15(429-444):2\u20131, 1977.\n\n[28] Nabil R Adam and John C Worthmann. Security-control methods for statistical databases: a\n\ncomparative study. ACM Computing Surveys (CSUR), 21(4):515\u2013556, 1989.\n\n10\n\n\f[29] Dakshi Agrawal and Charu C Aggarwal. On the design and quanti\ufb01cation of privacy preserving\ndata mining algorithms. In Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART\nsymposium on Principles of database systems, pages 247\u2013255. ACM, 2001.\n\n[30] Cynthia Dwork. Differential privacy: A survey of results. In International Conference on\n\nTheory and Applications of Models of Computation, pages 1\u201319. Springer, 2008.\n\n[31] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to\nsensitivity in private data analysis. In Theory of cryptography conference, pages 265\u2013284.\nSpringer, 2006.\n\n[32] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Founda-\n\ntions and Trends R in Theoretical Computer Science, 9(3\u20134):211\u2013407, 2014.\n[33] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar,\nand Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC\nConference on Computer and Communications Security, pages 308\u2013318. ACM, 2016.\n\n[34] \u00dalfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. Rappor: Randomized aggregatable\nprivacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on\ncomputer and communications security, pages 1054\u20131067. ACM, 2014.\n\n[35] Stanley L Warner. Randomized response: A survey technique for eliminating evasive answer\n\nbias. Journal of the American Statistical Association, 60(309):63\u201369, 1965.\n\n[36] Shiva Prasad Kasiviswanathan, Homin K Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam\n\nSmith. What can we learn privately? SIAM Journal on Computing, 40(3):793\u2013826, 2011.\n\n[37] John C Duchi, Michael I Jordan, and Martin J Wainwright. Local privacy and statistical minimax\nrates. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium on,\npages 429\u2013438. IEEE, 2013.\n\n[38] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. Extremal mechanisms for local differential\n\nprivacy. In Advances in neural information processing systems, pages 2879\u20132887, 2014.\n\n[39] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. Hadamard response: Estimating distributions\n\nprivately, ef\ufb01ciently, and with little communication. In AISTATS, 2019.\n\n[40] Francesco Ald\u00e0 and Hans Ulrich Simon. A lower bound on the release of differentially private\n\ninteger partitions. Information Processing Letters, 129:1\u20134, 2018.\n\n[41] Godfrey H Hardy and Srinivasa Ramanujan. Asymptotic formula\u00e6 in combinatory analysis.\n\nProceedings of the London Mathematical Society, 2(1):75\u2013115, 1918.\n\n[42] Joseph Bonneau. Yahoo password frequency corpus, Dec 2015.\n[43] Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, and Himanshu Tyagi. The complexity\nof estimating r\u00e9nyi entropy. In Proceedings of the twenty-sixth annual ACM-SIAM symposium\non Discrete algorithms, pages 1855\u20131869. SIAM, 2014.\n\n[44] Jayadev Acharya, Alon Orlitsky, Ananda Theertha Suresh, and Himanshu Tyagi. Estimating\nr\u00e9nyi entropy of discrete distributions. IEEE Transactions on Information Theory, 63(1):38\u201356,\n2016.\n\n[45] Yi Hao and Alon Orlitsky. The broad optimality of pro\ufb01le maximum likelihood. arXiv preprint\n\narXiv:1906.03794, 2019.\n\n[46] Arpita Ghosh, Tim Roughgarden, and Mukund Sundararajan. Universally utility-maximizing\n\nprivacy mechanisms. SIAM Journal on Computing, 41(6):1673\u20131693, 2012.\n\n[47] Richard E Barlow, David J Bartholomew, JM Bremner, and H Daniel Brunk. Statistical inference\nunder order restrictions: The theory and application of isotonic regression. Technical report,\nWiley New York, 1972.\n\n[48] Patrick Mair, Kurt Hornik, and Jan de Leeuw. Isotone optimization in r: pool-adjacent-violators\n\nalgorithm (pava) and active set methods. Journal of statistical software, 32(5):1\u201324, 2009.\n\n11\n\n\f", "award": [], "sourceid": 4375, "authors": [{"given_name": "Ananda Theertha", "family_name": "Suresh", "institution": "Google"}]}